add conclusion and fixes a single typo :)

This commit is contained in:
Michael Beck 2025-03-28 16:18:07 +00:00
parent 341dfd10ff
commit 4e5268b92d

View File

@ -142,7 +142,7 @@ Thereby, I will gather an initial sample of 20-50 papers of my sample. This samp
[^5]: My approach involves bootstrapping with a small set of diverse LLM-labeled papers, train an initial logistic regression model on vectorized text features, and iteratively uses active learning (uncertainty sampling, with optional diversity criteria and Query-by-Committee) to efficiently select and annotate new samples, periodically addressing rare classes directly through targeted querying, and continuously monitoring performance to ensure balanced, effective training. [^5]: My approach involves bootstrapping with a small set of diverse LLM-labeled papers, train an initial logistic regression model on vectorized text features, and iteratively uses active learning (uncertainty sampling, with optional diversity criteria and Query-by-Committee) to efficiently select and annotate new samples, periodically addressing rare classes directly through targeted querying, and continuously monitoring performance to ensure balanced, effective training.
I will use large language models like ChatGPT for the generation of the training data by using such a model to preclassify papers as they have proven to be reliable in text classification tasks [@buntValidatingUseLarge2025; @zhaoAdvancingSingleMultitask2024]. This of course rases the question of why not to use such a model for the classification of the whole dataset. The answer lies in efficiency and cost factors: the use of LLMs is expensive and the training of such a model is technically not possible for me, as it is for other researchers. Instead, a faster, computationally efficient approach shall lead to the classification of my sample as a use case for further, more cost effective research. I will use large language models like ChatGPT for the generation of the training data by using such a model to preclassify papers as they have proven to be reliable in text classification tasks [@buntValidatingUseLarge2025; @zhaoAdvancingSingleMultitask2024]. This of course raises the question of why not to use such a model for the classification of the whole dataset. The answer lies in efficiency and cost factors: the use of LLMs is expensive and the training of such a model is technically not possible for me, as it is for other researchers. Instead, a faster, computationally efficient approach shall lead to the classification of my sample as a use case for further, more cost effective research.
I will then train various machine learning models, including Naive Bayes, Logistic Regression, Support Vector Machines, and Gradient Boosted Trees. The performance of each model will be evaluated to identify the best-performing classifier for each category of open science practices. Once the optimal models are selected, I will use them to classify the entire dataset of papers. I will then train various machine learning models, including Naive Bayes, Logistic Regression, Support Vector Machines, and Gradient Boosted Trees. The performance of each model will be evaluated to identify the best-performing classifier for each category of open science practices. Once the optimal models are selected, I will use them to classify the entire dataset of papers.
@ -154,6 +154,14 @@ In the analysis phase of the research, an exploratory analysis will be conducted
\newpage \newpage
# Conclusion
My research aims to provide a review of open science practice prevalence in Criminology and Legal Psychology, with a specific focus on the prevalence of open data, preregistration, and other key open science practices in the field that enable replication and reduce known bias of the publication process. As the use of these practices has shown positive impacts on research transparency and reproducibility across various disciplines, understanding their application within criminology could reveal important insights into the state of methodological rigor and transparency in this area.
The study will employ a comprehensive data collection approach, including a stratified sampling strategy from leading criminology journals, followed by the classification of open science practices through machine learning models. The anticipated outcomes will help identify trends in the adoption of these practices, assess the current state of openness in criminology, and contribute to the broader conversation about the role of open science in enhancing the reliability and accessibility of criminological research.
By leveraging both traditional research methods and advanced machine learning techniques, this work aspires to offer valuable contributions to the field of Criminology and Legal Psychology. The results will not only shed light on the adoption of open science practices but will also inform efforts to improve research practices, promote greater transparency, and foster a more collaborative and accessible scholarly environment. The systematic machine learning approach and the public availability of all produced results, data and methods wll enable future efforts aimed at enhancing scientific integrity and fostering more robust, reproducible, and impactful criminological research.
# References # References
::: {#refs} ::: {#refs}