ThesisProposal/ResearchProposal.md

---
documentclass: article

author:
  - Michael Beck

title: "A Systematic Review of Open Science Practices in the Studies of Crime"
description: ""
subtitle: "Research Proposal"
date: "2025-01-08"
lang: en-US

toc: true
toc-depth: 3
# toccolor: gray
number-sections: true

color-links: true
urlcolor: blue
linkcolor: blue
link-citations: true

lot: true
lof: true

bibliography: [lit.bib]

mainfont: Latin Modern Roman
fontfamilyoptions: lmodernscellL
geometry: "left=2.5cm,right=3cm,top=2.5cm,bottom=2.5cm"

header-includes: |
  \usepackage{pdflscape}
  \newcommand{\blandscape}{\begin{landscape}}
  \newcommand{\elandscape}{\end{landscape}}
  \setcounter{page}{-1}

include-before: |
  \newcommand{\scellL}[1]{%
    \parbox{4.2cm}{\raggedright\leftskip=1em\hskip-1em#1}
  }
  \newcommand{\scell}[1]{%
    \parbox{3cm}{
    \begin{center}#1
    \end{center}
    \vspace{1mm}
    }
  }
  \newcommand{\scellB}[1]{%
    \parbox{2.5cm}{
    \begin{center}#1
    \end{center}
    \vspace{1mm}
    }
  }
  \begin{center}
    \vfill
    Master's thesis \\
    Supervisor:  \\
    \textbf{Dr. Alexander Trinidad} \\
    \vspace{1cm}
    \hrule
  \end{center}
  \thispagestyle{empty}
  \newpage
  \thispagestyle{empty}
---

\vfill

\newpage

# Intro & Motivation

## Modern Science

The rise of the internet in the last decades drastically changed our lives: Our ways of looking at the world, our social lives or our consumption patterns - the internet influences all spheres of life, whether we like it or not [@SocietyInternetHow2019]. The surge interconnectivity enabled a rise in movements that resist the classic definition of intellectual property rights: open source, open scholarship access and open science [@willinskyUnacknowledgedConvergenceOpen2005]. Modern technologies enhanced reliability, speed and efficiency in knowledge development, thereby enhancing communication, collaboration and access to information or data [@thagardInternetEpistemologyContributions1997; @eisendInternetNewMedium2002; @wardenInternetScienceCommunication2010]. The internet significantly facilitated formal and informal scholarly communication through electronic journals and digital repositories like Academia.edu or ResearchGate [@wardenInternetScienceCommunication2010; @waiteINTERNETKNOWLEDGEEXCHANGE2021]. Evidence also shows that an increase in access to the internet also increases research output [@xuImpactInternetAccess2021]. But greater output doesn't necessarily imply greater quality, progress or greater scientific discoveries. As availability and thereby the quantity of publications increased, the possible information overload demands for effective filtering and assessment of publicated results [@wardenInternetScienceCommunication2010].

But how do we define scientific progress? In the mid 20st century, Thomas Kuhn characterized scientific progress as a revolutionary shift in paradigms that are accepted theories in a scientific community at a given time. According to Kuhn, normal science operates within these paradigms, "solving puzzles" and refining theories. However, when anomalies arise that cannot be explained by the current paradigm, a crisis occurs, leading to a scientific revolution [@kuhnReflectionsMyCritics1970; @kuhnStructureScientificRevolutions1962]. Opposed to that, a critical rationalist approach to scientific progress emerged that saw danger in the by Kuhn described process, as paradigms might facilitate confirmation bias and thereby stall progress. Karl Popper's philosophy of science, which emphasizes falsifiability and the idea that scientific theories progress through conjectures and refutations rather than through paradigm shifts. Popper argued that science advances by eliminating false theories, thus moving closer to the truth in a more linear and cumulative manner [@popperLogicScientificDiscovery2005]. Where Kuhn emphasized the development and refinement of dominant theories, Popper suggested the challenging or falsification of those theories.

Social sciences today engage in frequentist, deductive reasoning where significance testing is used to evaluate the null hypothesis, and conclusions are drawn based on the rejection or acceptance of this hypothesis, aligning with Popper's idea that scientific theories should be open to refutation. This approach is often criticized for its limitations in interpreting p-values and its reliance on long-run frequency interpretations [@dunleavyUseMisuseClassical2021; @wilkinsonTestingNullHypothesis2013]. In contrast, Bayesian inference is associated with inductive reasoning, where models are updated with new data to improve predictions. Bayesian methods allow for the comparison of competing models using tools like Bayes factors, but they do not directly falsify models through significance tests [@gelmanInductionDeductionBaysian2011; @dollBayesianModelSelection2019]. Overall, while falsification remains a cornerstone of scientific methodology, contemporary science often employs a pluralistic approach, integrating various methods to address complex questions and advance knowledge [@rowbottomKuhnVsPopper2011]. This pluralistic approach in contemporary science underscores the importance of integrating diverse methodologies to solve complex questions and enhance our understanding. Despite the differences between frequentist and Bayesian methods, both share a fundamental commitment to the rigorous testing and validation of scientific theories.
But besides the more theoretically driven discourse about scientific discovery, there are many tangible reasons to talk about the scientific method and the publication process. A recent, highly cited article revealed that only a very small proportion in variance of the outcomes in studies based on the same data can be accounted to the choices made by researchers in designing their tests. @breznauObservingManyResearchers2022 observed 77 researcher teams analyzing the same dataset to assess the same hypothesis and found that the results were extremely divers, ranging from strong positive to strong negative results. Between-team deviance could only be explained to less than 50% by assigned conditions, research decisions and researcher characteristics, the rest of the variance remained unexplained. This underlines the importance of transparent research: results are prone to many errors and biases, made intentionally or unintentionally by the researcher or induced by the publisher.

> "Only by ... repetitions can we convince ourselves that we are not dealing with a mere isolated ‘coincidence’, but with events which, on account of their regularity and reproducibility, are in principle inter-subjectively testable." [@popperLogicScientificDiscovery2005, p. 23]

To challenge the biases and to support the possibility of these "repetitions" or replications of research, a movement has formed within the scientific community, fuelled by the "replication crisis" that was especially prevalent within the field of psychology [@dienlinAgendaOpenScience2021]. The open science movement tries to establish open science practices to challenge many of the known biases that endanger the reliability of the scientific process and enable access to the scientific discourse for a broader public.

@banksAnswers18Questions2019 establish a definition of open science as a broad term that refers to many concepts including scientific philosophies embodying communality and universalism, specific practices operationalizing these norms including open science policies like sharing of data and analytic files, redefinition of confidence thresholds, pre-registration of studies and analytical plans, engagement in replication studies, removal of pay-walls, incentive systems to encourage the above practices and even specific citation standards. This typology is in line with the work of many other authors from diverse disciplines [e.g. @dienlinAgendaOpenScience2021; and @greenspanOpenSciencePractices2024]. The ongoing debate of the last decades were especially focused on two open science practices.

First, the **publishing of materials, data and code** or _open data_ that enables replication of studies. Replication thereby makes it possible to assess the pursued research in detail, find errors, bias or simply support the results of the replicated work [@dienlinAgendaOpenScience2021]. While many researchers see challenges in the publication of their data and materials due to a potentially higher workload, legal concerns or just lack of interest, many of these concerns could be ruled out by streamlined processes or institutional support [@freeseAdvancesTransparencyReproducibility2022; @freeseReplicationStandardsQuantitative2007]. As open data reduces p-hacking, facilitates new research by enabling reproduction, reveals mistakes in the analytical code and enables a diffusion of knowledge on the research process, it seems that many scientists, journals and other institutions start to adopt open data in their research to an increasing extent [@dienlinAgendaOpenScience2021; @finkReplicationCodeAvailability; @freeseAdvancesTransparencyReproducibility2022; @zenk-moltgenFactorsInfluencingData2018; @matternWhyAcademicsUndershare2024].

Second, **preregistration** involves thoroughly outlining and documenting research plans and their rationale in a repository. These plans can be made publicly accessible when the researcher decides to share them. The specifics of preregistration can vary based on the research type and may encompass elements such as hypotheses, sampling strategies, interview guides, exclusion criteria, study design, and analysis plans [@managoPreregistrationRegisteredReports2023]. Within this definition, a preregistration shall not prevent exploratory research. Deviations from the research plan are still allowed but have to be communicated transparently [@managoPreregistrationRegisteredReports2023; @nosekRegisteredReports2014]. Preregistration impacts research in multiple ways: it helps performing exploratory and confirmatory research independently, protects against publication bias as journals typically commit to publish registered research and counters "researchers' degrees of freedom" in data analysis by reducing overfitting through cherry-picking, variable swapping, flexible model selection and subsampling [@mertensPreregistrationAnalysesPreexisting2019; @FalsePositivePsychologyUndisclosed]. This minimizes the risk of bias by promoting decision-making that is independent of outcomes. It also enhances transparency, allowing others to evaluate the potential for bias and adjust their confidence in the research findings accordingly [@hardwickeReducingBiasIncreasing2023].

My initial plan for my master's thesis was to study the effect of open science practices on reported effect sizes in published papers. During my initial literature review, it appeared to me that there were very few publications that used pre-registration in data-driven Criminology and Legal Psychology. Instead of assessing effect sizes, this raised the question how open science practices have been adapted within criminology. I therefore intend, motivated by the expected positive impact of open science practices and in line with the research of @scogginsMeasuringTransparencySocial2024, to assess the two research questions in Criminology and Legal Psychology:

> $RQ_1$: What proportion of papers that rely on statistical inference make their data and code public?

> $RQ_2$: What proportion of experimental studies were preregistered?

@scogginsMeasuringTransparencySocial2024 did an extensive analysis of nearly 100,000 publications in political science and international relations. They observed an increasing use of preregistration and open data, with levels still being relatively low. The extensive research not only revealed the current state of open science in political science, but also generated rich data to perform further meta research.

I intend to apply similar methods in the field of Criminology and Legal Psychology: gather data about papers in a subset of Criminology and Legal Psychology journals, classify those papers by application of open source practices using sophisticated machine learning methods and explore the patterns over time to take stock of research practices in the disciplines. In the following section I describe the intended data collection and research methods that are highly based on @scogginsMeasuringTransparencySocial2024 research.

# Data and Method

The study will focus on papers in criminal psychology that use data and statistical methods. The aim is to evaluate the prevalence of key open science practices, including open access, pre-registration and open data. The research process will follow three steps: collection, classification and analysis. In line with preregistration guidelines, the outlined research plan may be subject to reconsideration during the research process that will be reported transparently [@managoPreregistrationRegisteredReports2023; @nosekRegisteredReports2014].

## Sample

Instead of following @scogginsMeasuringTransparencySocial2024's approach of collecting all papers from selected journals, I will create a subsample of the papers from those journals to limit my research to a, for a master's thesis, manageable number of papers. With the population being all published papers of the top 100 journals in Criminology and Legal Psychology, I will use a stratified sampling approach to ensure a representative sample. I will sample papers proportionally from different journals and publication years.

## Data Collection

The process of data collection will closely follow @scogginsMeasuringTransparencySocial2024 and begin with identifying relevant journals in criminal psychology. I will consult the Clarivate Journal Citation Report to obtain a comprehensive list of journals within the fields by filtering for the top 100 journals. The Transparency-and-Openness-Promotion-Factor[^4] (TOP-Factor) according to @nosekPromotingOpenResearch2015 will be used to then assess the journal's admission of open science practices and by including it in the journal dataset. Once the relevant journals are identified, I will use APIs such as Crossref, Scopus, and Web of Science to download metadata for all papers published between 2013 to 2023.

After obtaining the metadata, I will proceed to download the full-text versions of the identified papers. Whenever possible, I will prioritize downloading HTML versions of the papers due to their structured format, which simplifies subsequent text extraction. For papers that are not available in HTML, I will consider downloading full-text PDFs. Tools such as PyPaperBot or others[^1] can facilitate this process, although I will strictly stick to ethical and legal guidelines, avoiding unauthorized sources like Sci-Hub or Anna's Archive and only using sources that are either included in my institutions campus license or available via open access. If access to full-text papers becomes a limiting factor, I will assess alternative strategies such as collaborating with institutional libraries to request specific papers or identifying open-access repositories that may provide supplementary resources. Non-available texts will be considered with their own category in the later analysis. Once all available full-text papers are collected, I will preprocess the data by converting HTML and PDF files into plain text format using tools such as SciPDF Parser or others[^2]. This preprocessing step ensures that the text is in a standardized format suitable for analysis.

The proposed data collection is resource-intensive but serves multiple purposes. However, resource constraints could pose challenges, such as limited access to computational tools, DDoS-protection[^3], API-rate limits or delays in obtaining full-text papers. To mitigate these risks, I plan to prioritize scalable data collection methods, limit data collection to a manageable extent and use existing institutional resources, including library services and open-access repositories. Additionally, I will implement efficient preprocessing workflows ensuring that the project remains feasible within the given timeline and resources.

[^1]: [ferru97/PyPaperBot](https://github.com/ferru97/PyPaperBot), [monk1337/resp](https://github.com/monk1337/resp)
[^2]: [GitHub - titipata/scipdf_parser](https://github.com/titipata/scipdf_parser), [GitHub - aaronsw/html2text](https://github.com/aaronsw/html2text), [html2text · PyPI](https://pypi.org/project/html2text/), [GitHub - jsvine/pdfplumber](https://github.com/jsvine/pdfplumber), [GitHub - cat-lemonade/PDFDataExtractor](https://github.com/cat-lemonade/PDFDataExtractor/tree/main), [GitHub - euske/pdfminer](https://github.com/euske/pdfminer)
[^3]: DDoS: Distributed Denial of Service, see @wangDDoSAttackProtection2015.
[^4]: The TOP-Factor according to @nosekRegisteredReports2014 is a score that assesses the admission of open science practices can be obtained from [topfactor.org](https://topfactor.org/journals).

## Classification

Different classification methods were considered but deemed as not suitable for the task as those were either found to be designed for document topic classification or too time intense for a master's thesis [e.g. @kimResearchPaperClassification2019; @sanguansatFeatureMatricizationDocument2012; @jandotInteractiveSemanticFeaturing2016].

Classification of open access papers will be performed using the available metadata. The other classes will be identified using machine learning models trained on a preclassified training dataset. The models will categorize papers using generated document feature matrices (DFM's) in line with @scogginsMeasuringTransparencySocial2024.

### Operationalization

The classification process will begin with operationalizing the key open science practices that I aim to study. This involves the definition of clear criteria for identifying papers that fall into the categories I plan to classify: Papers that use statistical inference, papers that applied preregistration, papers that applied open data practices, papers that offer open materials and papers that are available via open access.

Following the approach of @scogginsMeasuringTransparencySocial2024, I will use document feature matrices (DFMs) created from open science specific dictionaries as features in the training process. For instance, the frequencies of terms like “pre-registered,” “open data,” or “data availability statement” could indicate adherence to pre-registration or open data practices. Similarly, phrases such as “materials available on request” or “open materials” could signify the use of open materials. @scogginsMeasuringTransparencySocial2024 freely available data will form the foundation of keyword dictionaries for identifying relevant papers during the classification phase. Using these dictionaries, DFM's will be generated for all full-text papers gathered. To facilitate this, I will additionally develop own keyword dictionaries for each category, identifying terms and phrases commonly associated with these practices before consulting @scogginsMeasuringTransparencySocial2024.

### Training Strategy

A subset of the stratified sample will serve as a "labelled" dataset for supervised learning. To train machine learning models capable of classifying the papers, I will manually categorize a the subset of papers. The prevalence of open science practices can be expected to be rather low: Looking at previous studies, I find different estimations: for open access, the prevalence has been around 22% in criminological research in the years of 2017 to 2019 [@ashbyOpenAccessAvailabilityCriminological2020]. @greenspanOpenSciencePractices2024 manually coded over 700 papers in the field between 2018 and 2022. They found a steady, not growing small prevalence of around 5 to 10 percent in open data and 20 to 40 percent in open materials and open access. Pre-registration and open code were concerningly rare with a prevalence of close to zero in most of the years. This is in line with my experience during my initials literature review, where I failed to find a significant amount of papers using certain open science practices. This is problematic for my training task in many ways: if the prevalence of preregistration is close to zero, it is hard to create a suitable sample for my purpose. Even worse, the prevalence could be so low that my training sample might not even catch a single paper for my preclassification dataset.

Thereby, I will gather an initial sample of 20-50 papers of my sample. This sample will not be chosen randomly to initiate training, given the unbalanced data with large variance in class prevalence. If the prevalence of preregistration or any other open science practice is too low, I address this by applying a sequential sampling approach in machine learning called active learning using uncertainty sampling strategies[^5]. This approach will be used to iteratively select the most informative samples to train the model [@settlesActiveLearningLiterature2009]. I deliberately refrain from describing the process in more detail, as the sequential sampling or active learning method is not set in stone.

[^5]: My approach involves bootstrapping with a small set of diverse LLM-labeled papers, train an initial logistic regression model on vectorized text features, and iteratively uses active learning (uncertainty sampling, with optional diversity criteria and Query-by-Committee) to efficiently select and annotate new samples, periodically addressing rare classes directly through targeted querying, and continuously monitoring performance to ensure balanced, effective training.

I will use large language models like ChatGPT for the generation of the training data by using such a model to preclassify papers as they have proven to be reliable in text classification tasks [@buntValidatingUseLarge2025; @zhaoAdvancingSingleMultitask2024]. This of course raises the question of why not to use such a model for the classification of the whole dataset. The answer lies in efficiency and cost factors: the use of LLMs is expensive and the training of such a model is technically not possible for me, as it is for other researchers. Instead, a faster, computationally efficient approach shall lead to the classification of my sample as a use case for further, more cost effective research.

Therefore, the produced labelled training dataset will be used to train various machine learning models, including Naive Bayes, Logistic Regression, Support Vector Machines, and Gradient Boosted Trees. The performance of each model will be evaluated to identify the best-performing classifier for each category of open science practices. Once the optimal models are selected, I will use them to classify the entire dataset of papers.

The automated classification will enable me to categorize a large amount papers automatically based on their adoption of open science practices. Automating the classification process mitigates the inefficiency of manual data collection, allowing for the analysis of a significantly larger dataset than would otherwise be feasible. This classification will provide the foundation for subsequent analyses of temporal trends and other patterns within the data.

## Analysis

In the analysis phase of the research, an exploratory analysis will be conducted to explore temporal trends in the adoption of open science practices over the past decade. This involves comparing the adoption rates of practices such as pre-registration, open data, open materials, and open access across the disciplines of Criminology and Legal Psychology, as well as among different journals. The goal is to identify possible differences or similarities in how these practices have been embraced over time. This evaluation aims to uncover insights into the methodological rigor and transparency within the fields, providing a comprehensive understanding of the current landscape and potential areas for improvement in research practices. By building on the methods developed by @scogginsMeasuringTransparencySocial2024, I hope to generate data and insights that will support future efforts to promote transparency and reproducibility in criminal psychology.

# Conclusion

My research aims to provide a review of open science practice prevalence in Criminology and Legal Psychology, with a specific focus on the prevalence of open data, preregistration, and other key open science practices in the field that enable replication and reduce known bias of the publication process. As the use of these practices has shown positive impacts on research transparency and reproducibility across various disciplines, understanding their application within criminology could reveal important insights into the state of methodological rigor and transparency in this area.

The study will employ a comprehensive data collection approach, including a stratified sampling strategy from leading criminology journals, followed by the classification of open science practices through machine learning models. The anticipated outcomes will help identify trends in the adoption of these practices, assess the current state of openness in criminology, and contribute to the broader conversation about the role of open science in enhancing the reliability and accessibility of criminological research.

By leveraging both traditional research methods and advanced machine learning techniques, this work aspires to offer valuable contributions to the field of Criminology and Legal Psychology. The results will not only shed light on the adoption of open science practices but will also inform efforts to improve research practices, promote greater transparency, and foster a more collaborative and accessible scholarly environment. The systematic machine learning approach and the public availability of all produced results, data and methods wll enable future efforts aimed at enhancing scientific integrity and fostering more robust, reproducible, and impactful criminological research.

\newpage

# References

::: {#refs}
:::

\newpage

Eigenständigkeitserklärung

Hiermit versichere ich, dass ich die vorliegende Arbeit selbstständig und ohne die Benutzung anderer als der angegebenen Hilfsmittel angefertigt habe. Alle Stellen, die wörtlich oder sinngemäß aus veröffentlichten und nicht veröffentlichten Schriften entnommen wurden, sind als solche kenntlich gemacht.

\vspace{5cm}

\begin{tabular}{@{}p{4in}@{}}
\hrulefill \\
Michael Beck, 08.08.2024 \\
\end{tabular}