Update ResearchProposal.md

2025-03-28 16:34:37 +00:00
parent 4e5268b92d
commit f8ab22e42a
1 changed files with 7 additions and 7 deletions
--- a/ResearchProposal.md
+++ b/ResearchProposal.md
@ -107,7 +107,7 @@ The study will focus on papers in criminal psychology that use data and statisti

 ## Sample

-The data collection will be based on a systematic sampling approach. Instead of following @scogginsMeasuringTransparencySocial2024's approach of collecting all papers from selected journals, I will create a subsample of the papers from those journals to limit my research to a, for a master's thesis, manageable number of papers. With the population being all published papers of the top 100 journals in Criminology and Legal Psychology, I will use a stratified sampling approach to ensure a representative sample. I will sample papers proportionally from different journals and publication years. 
+Instead of following @scogginsMeasuringTransparencySocial2024's approach of collecting all papers from selected journals, I will create a subsample of the papers from those journals to limit my research to a, for a master's thesis, manageable number of papers. With the population being all published papers of the top 100 journals in Criminology and Legal Psychology, I will use a stratified sampling approach to ensure a representative sample. I will sample papers proportionally from different journals and publication years. 

 ## Data Collection

@ -124,6 +124,8 @@ The proposed data collection is resource-intensive but serves multiple purposes.

 ## Classification

+Different classification methods were considered but deemed as not suitable for the task as those were either found to be designed for document topic classification or too time intense for a master's thesis [e.g. @kimResearchPaperClassification2019; @sanguansatFeatureMatricizationDocument2012; @jandotInteractiveSemanticFeaturing2016].
+
 Classification of open access papers will be performed using the available metadata. The other classes will be identified using machine learning models trained on a preclassified training dataset. The models will categorize papers using generated document feature matrices (DFM's) in line with @scogginsMeasuringTransparencySocial2024.

 ### Operationalization
@ -134,9 +136,7 @@ Following the approach of @scogginsMeasuringTransparencySocial2024, I will use d

 ### Training Strategy

-A sampled subset will serve as a "labelled" dataset for supervised learning. Different classification methods were considered but deemed as not suitable for the task as those were either found to be designed for document topic classification or too time intense for a master's thesis [e.g. @kimResearchPaperClassification2019; @sanguansatFeatureMatricizationDocument2012; @jandotInteractiveSemanticFeaturing2016].
-
-To train machine learning models capable of classifying the papers, I will manually categorize a subset of papers. The prevalence of open science practices can be expected to be rather low: Looking at previous studies, I find different estimations: for open access, the prevalence has been around 22% in criminological research in the years of 2017 to 2019 [@ashbyOpenAccessAvailabilityCriminological2020]. @greenspanOpenSciencePractices2024 manually coded over 700 papers in the field between 2018 and 2022. They found a steady, not growing small prevalence of around 5 to 10 percent in open data and 20 to 40 percent in open materials and open access. Pre-registration and open code were concerningly rare with a prevalence of close to zero in most of the years. This is in line with my experience during my initials literature review, where I failed to find a significant amount of papers using certain open science practices. This is problematic for my training task in many ways: if the prevalence of preregistration is close to zero, it is hard to create a suitable sample for my purpose. Even worse, the prevalence could be so low that my training sample might not even catch a single paper for my preclassification dataset. 
+A subset of the stratified sample will serve as a "labelled" dataset for supervised learning. To train machine learning models capable of classifying the papers, I will manually categorize a the subset of papers. The prevalence of open science practices can be expected to be rather low: Looking at previous studies, I find different estimations: for open access, the prevalence has been around 22% in criminological research in the years of 2017 to 2019 [@ashbyOpenAccessAvailabilityCriminological2020]. @greenspanOpenSciencePractices2024 manually coded over 700 papers in the field between 2018 and 2022. They found a steady, not growing small prevalence of around 5 to 10 percent in open data and 20 to 40 percent in open materials and open access. Pre-registration and open code were concerningly rare with a prevalence of close to zero in most of the years. This is in line with my experience during my initials literature review, where I failed to find a significant amount of papers using certain open science practices. This is problematic for my training task in many ways: if the prevalence of preregistration is close to zero, it is hard to create a suitable sample for my purpose. Even worse, the prevalence could be so low that my training sample might not even catch a single paper for my preclassification dataset. 

 Thereby, I will gather an initial sample of 20-50 papers of my sample. This sample will not be chosen randomly to initiate training, given the unbalanced data with large variance in class prevalence. If the prevalence of preregistration or any other open science practice is too low, I address this by applying a sequential sampling approach in machine learning called active learning using uncertainty sampling strategies[^5]. This approach will be used to iteratively select the most informative samples to train the model [@settlesActiveLearningLiterature2009]. I deliberately refrain from describing the process in more detail, as the sequential sampling or active learning method is not set in stone.

@ -144,7 +144,7 @@ Thereby, I will gather an initial sample of 20-50 papers of my sample. This samp

 I will use large language models like ChatGPT for the generation of the training data by using such a model to preclassify papers as they have proven to be reliable in text classification tasks [@buntValidatingUseLarge2025; @zhaoAdvancingSingleMultitask2024]. This of course raises the question of why not to use such a model for the classification of the whole dataset. The answer lies in efficiency and cost factors: the use of LLMs is expensive and the training of such a model is technically not possible for me, as it is for other researchers. Instead, a faster, computationally efficient approach shall lead to the classification of my sample as a use case for further, more cost effective research.

-I will then train various machine learning models, including Naive Bayes, Logistic Regression, Support Vector Machines, and Gradient Boosted Trees. The performance of each model will be evaluated to identify the best-performing classifier for each category of open science practices. Once the optimal models are selected, I will use them to classify the entire dataset of papers.
+Therefore, the produced labelled training dataset will be used to train various machine learning models, including Naive Bayes, Logistic Regression, Support Vector Machines, and Gradient Boosted Trees. The performance of each model will be evaluated to identify the best-performing classifier for each category of open science practices. Once the optimal models are selected, I will use them to classify the entire dataset of papers.

 The automated classification will enable me to categorize a large amount papers automatically based on their adoption of open science practices. Automating the classification process mitigates the inefficiency of manual data collection, allowing for the analysis of a significantly larger dataset than would otherwise be feasible. This classification will provide the foundation for subsequent analyses of temporal trends and other patterns within the data.

@ -152,8 +152,6 @@ The automated classification will enable me to categorize a large amount papers

 In the analysis phase of the research, an exploratory analysis will be conducted to explore temporal trends in the adoption of open science practices over the past decade. This involves comparing the adoption rates of practices such as pre-registration, open data, open materials, and open access across the disciplines of Criminology and Legal Psychology, as well as among different journals. The goal is to identify possible differences or similarities in how these practices have been embraced over time. This evaluation aims to uncover insights into the methodological rigor and transparency within the fields, providing a comprehensive understanding of the current landscape and potential areas for improvement in research practices. By building on the methods developed by @scogginsMeasuringTransparencySocial2024, I hope to generate data and insights that will support future efforts to promote transparency and reproducibility in criminal psychology.

-\newpage
-
 # Conclusion

 My research aims to provide a review of open science practice prevalence in Criminology and Legal Psychology, with a specific focus on the prevalence of open data, preregistration, and other key open science practices in the field that enable replication and reduce known bias of the publication process. As the use of these practices has shown positive impacts on research transparency and reproducibility across various disciplines, understanding their application within criminology could reveal important insights into the state of methodological rigor and transparency in this area.
@ -162,6 +160,8 @@ The study will employ a comprehensive data collection approach, including a stra

 By leveraging both traditional research methods and advanced machine learning techniques, this work aspires to offer valuable contributions to the field of Criminology and Legal Psychology. The results will not only shed light on the adoption of open science practices but will also inform efforts to improve research practices, promote greater transparency, and foster a more collaborative and accessible scholarly environment. The systematic machine learning approach and the public availability of all produced results, data and methods wll enable future efforts aimed at enhancing scientific integrity and fostering more robust, reproducible, and impactful criminological research.

+\newpage
+
 # References

 ::: {#refs}