Update ResearchProposal.md

was intended to be done in 12e4fe2963 - whoopsie
2025-03-28 16:01:30 +00:00
parent 12e4fe2963
commit 341dfd10ff
1 changed files with 19 additions and 5 deletions
--- a/ResearchProposal.md
+++ b/ResearchProposal.md
@ -105,6 +105,10 @@ I intend to apply similar methods in the field of Criminology and Legal Psycholo

 The study will focus on papers in criminal psychology that use data and statistical methods. The aim is to evaluate the prevalence of key open science practices, including open access, pre-registration and open data. The research process will follow three steps: collection, classification and analysis. In line with preregistration guidelines, the outlined research plan may be subject to reconsideration during the research process that will be reported transparently [@managoPreregistrationRegisteredReports2023; @nosekRegisteredReports2014].

+## Sample
+
+The data collection will be based on a systematic sampling approach. Instead of following @scogginsMeasuringTransparencySocial2024's approach of collecting all papers from selected journals, I will create a subsample of the papers from those journals to limit my research to a, for a master's thesis, manageable number of papers. With the population being all published papers of the top 100 journals in Criminology and Legal Psychology, I will use a stratified sampling approach to ensure a representative sample. I will sample papers proportionally from different journals and publication years. 
+
 ## Data Collection

 The process of data collection will closely follow @scogginsMeasuringTransparencySocial2024 and begin with identifying relevant journals in criminal psychology. I will consult the Clarivate Journal Citation Report to obtain a comprehensive list of journals within the fields by filtering for the top 100 journals. The Transparency-and-Openness-Promotion-Factor[^4] (TOP-Factor) according to @nosekPromotingOpenResearch2015 will be used to then assess the journal's admission of open science practices and by including it in the journal dataset. Once the relevant journals are identified, I will use APIs such as Crossref, Scopus, and Web of Science to download metadata for all papers published between 2013 to 2023.
@ -120,15 +124,25 @@ The proposed data collection is resource-intensive but serves multiple purposes.

 ## Classification

-The classification process will begin with operationalizing the key open science practices that I aim to study. This involves the definition of clear criteria for identifying papers that fall into the categories I plan to classify: Papers that use statistical inference, papers that applied preregistration, papers that applied open data practices, papers that offer open materials and papers that are available via open access.
-
 Classification of open access papers will be performed using the available metadata. The other classes will be identified using machine learning models trained on a preclassified training dataset. The models will categorize papers using generated document feature matrices (DFM's) in line with @scogginsMeasuringTransparencySocial2024.

-To train machine learning models capable of classifying the papers, I will manually categorize a subset of papers. The sample size will be determined using weighted fitting of learning curves according to @figueroaPredictingSampleSize2012 which need an initial hand-coded sample size of 100-200. To ensure the representativeness of this subset, I will sample papers proportionally from different journals, publication years, and subfields within Criminology and Legal Psychology. The stratified sampling approach will help mitigate biases and ensure that the training data reflects the diversity of the overall dataset. If the necessary sample size exceeds my time constraints, I will try to use clustering based text classification to extend the training sample [@zengCBCClusteringBased2003] and will also consider the use of large language models like ScienceOS for the generation of the training data by using such a model to preclassify papers.
+### Operationalization

-The sampled subset will serve as a "labeled" dataset for supervised learning. Different classification methods were considered but deemed as not suitable for the task as those were either found to be designed for document topic classification or too time intense for a master's thesis [e.g. @kimResearchPaperClassification2019; @sanguansatFeatureMatricizationDocument2012; @jandotInteractiveSemanticFeaturing2016].
+The classification process will begin with operationalizing the key open science practices that I aim to study. This involves the definition of clear criteria for identifying papers that fall into the categories I plan to classify: Papers that use statistical inference, papers that applied preregistration, papers that applied open data practices, papers that offer open materials and papers that are available via open access.

-Instead, I will follow the approach of @scogginsMeasuringTransparencySocial2024, using document feature matrices (DFMs) created from open science specific dictionaries. For instance, the frequencies of terms like “pre-registered,” “open data,” or “data availability statement” could indicate adherence to pre-registration or open data practices. Similarly, phrases such as “materials available on request” or “open materials” could signify the use of open materials. @scogginsMeasuringTransparencySocial2024 freely available data will form the foundation of keyword dictionaries for identifying relevant papers during the classification phase. Using these dictionaries, DFM's will be generated for all full-text papers gathered. To facilitate this, I will additionally develop own keyword dictionaries for each category, identifying terms and phrases commonly associated with these practices before consulting @scogginsMeasuringTransparencySocial2024.
+Following the approach of @scogginsMeasuringTransparencySocial2024, I will use document feature matrices (DFMs) created from open science specific dictionaries as features in the training process. For instance, the frequencies of terms like “pre-registered,” “open data,” or “data availability statement” could indicate adherence to pre-registration or open data practices. Similarly, phrases such as “materials available on request” or “open materials” could signify the use of open materials. @scogginsMeasuringTransparencySocial2024 freely available data will form the foundation of keyword dictionaries for identifying relevant papers during the classification phase. Using these dictionaries, DFM's will be generated for all full-text papers gathered. To facilitate this, I will additionally develop own keyword dictionaries for each category, identifying terms and phrases commonly associated with these practices before consulting @scogginsMeasuringTransparencySocial2024.
+
+### Training Strategy
+
+A sampled subset will serve as a "labelled" dataset for supervised learning. Different classification methods were considered but deemed as not suitable for the task as those were either found to be designed for document topic classification or too time intense for a master's thesis [e.g. @kimResearchPaperClassification2019; @sanguansatFeatureMatricizationDocument2012; @jandotInteractiveSemanticFeaturing2016].
+
+To train machine learning models capable of classifying the papers, I will manually categorize a subset of papers. The prevalence of open science practices can be expected to be rather low: Looking at previous studies, I find different estimations: for open access, the prevalence has been around 22% in criminological research in the years of 2017 to 2019 [@ashbyOpenAccessAvailabilityCriminological2020]. @greenspanOpenSciencePractices2024 manually coded over 700 papers in the field between 2018 and 2022. They found a steady, not growing small prevalence of around 5 to 10 percent in open data and 20 to 40 percent in open materials and open access. Pre-registration and open code were concerningly rare with a prevalence of close to zero in most of the years. This is in line with my experience during my initials literature review, where I failed to find a significant amount of papers using certain open science practices. This is problematic for my training task in many ways: if the prevalence of preregistration is close to zero, it is hard to create a suitable sample for my purpose. Even worse, the prevalence could be so low that my training sample might not even catch a single paper for my preclassification dataset. 
+
+Thereby, I will gather an initial sample of 20-50 papers of my sample. This sample will not be chosen randomly to initiate training, given the unbalanced data with large variance in class prevalence. If the prevalence of preregistration or any other open science practice is too low, I address this by applying a sequential sampling approach in machine learning called active learning using uncertainty sampling strategies[^5]. This approach will be used to iteratively select the most informative samples to train the model [@settlesActiveLearningLiterature2009]. I deliberately refrain from describing the process in more detail, as the sequential sampling or active learning method is not set in stone.
+
+[^5]: My approach involves bootstrapping with a small set of diverse LLM-labeled papers, train an initial logistic regression model on vectorized text features, and iteratively uses active learning (uncertainty sampling, with optional diversity criteria and Query-by-Committee) to efficiently select and annotate new samples, periodically addressing rare classes directly through targeted querying, and continuously monitoring performance to ensure balanced, effective training. 
+
+I will use large language models like ChatGPT for the generation of the training data by using such a model to preclassify papers as they have proven to be reliable in text classification tasks [@buntValidatingUseLarge2025; @zhaoAdvancingSingleMultitask2024]. This of course rases the question of why not to use such a model for the classification of the whole dataset. The answer lies in efficiency and cost factors: the use of LLMs is expensive and the training of such a model is technically not possible for me, as it is for other researchers. Instead, a faster, computationally efficient approach shall lead to the classification of my sample as a use case for further, more cost effective research.

 I will then train various machine learning models, including Naive Bayes, Logistic Regression, Support Vector Machines, and Gradient Boosted Trees. The performance of each model will be evaluated to identify the best-performing classifier for each category of open science practices. Once the optimal models are selected, I will use them to classify the entire dataset of papers.