revision 1, almost done

adds stuff
2026-05-18 22:43:11 +02:00 · 2026-04-16 19:18:19 +02:00
10 changed files with 642 additions and 226 deletions
@@ -1 +1,5 @@
-source("renv/activate.R")
+source("renv/activate.R")
+
+source(file.path(Sys.getenv(
+   if (.Platform$OS.type == "windows") "USERPROFILE" else "HOME"
+ ), ".vscode-R", "init.R"))
@@ -18,6 +18,8 @@ pdf: index.pdf Supplements.pdf
 docx: $(DOCX_DOCS)
 docx-main: index.docx

+export QUARTO_R := /opt/R/4.5.3/bin/R
+
 # Pattern rules for either format
 %.pdf: %.qmd
 	OUTPUT_FORMAT=pdf/tex $(QUARTO) render $< --to pdf $(PROFILE_FLAG)
@@ -1,4 +1,62 @@
+# Introduction
+
 This repository contains the quarto project for the article "Mining Transparency: Assessing Open Science Practices in Crime Research Over Time Using Machine Learning".

-Extensions:
- [kapsner/authors-block](https://github.com/kapsner/authors-block): brings the capability to add an author-related header block when rendering docx-documents with Quarto.
+This project only contains the replication files for the manuscript. The scraping, metadata download and classifier code is available in the method report which can be found in the OSF repository.
+
+## How to run
+
+In linux:
+
+```bash
+make all
+```
+
+Windows:
+
+```bash
+uninstall windows, install linux, run "make all" in linux terminal
+```
+
+## Technical Requirements
+
+The method report requires rather intense calculations, this manuscript should run on a simpler machine. The project was written and tested on a linux machine but should also run on windows and macOS. The project is set up to run in a virtual R environment using the `renv` package, which ensures that all necessary packages and their specific versions are installed for the project to run correctly. The project also relies on Quarto for rendering the documents.
+
+### Dependencies
+
+- R (4.5.1+)
+- renv R-library
+- Quarto
+- pandoc
+
+There are two packages that might need to be installed beforehand:
+
+- [gtsummary](https://www.danieldsjoberg.com/gtsummary/)
+- [ggthemr](https://github.com/Mikata-Project/ggthemr)
+
+For the R package `gtsummary`, you'll need to install the `libv8` library manually if on linux. Windows installation should work right away, sefer to the [manual](https://www.danieldsjoberg.com/gtsummary/).  See `globals.R` for more info and all necessary packages that should (!) be automatically installed when you run the `renv::restore()` command. More info on how to install on arch can be found [here](https://aur.archlinux.org/packages/v8-r). Alternatively, the environment variable `DOWNLOAD_STATIC_LIBV8` can be set to "1". For more on requirements and how to install, see the info in the `globals.R` file.
+
+ggplot plots are generated using ggthemr. ggthemr can be installed using devtools. the installation is explained in the [git repository](https://github.com/Mikata-Project/ggthemr) of `ggthemr`.
+
+::: callout-important
+It is important to install the dependencies of gtsummary as well as the R packages devtools and ggthemr before restoring the virtual R environment.
+:::
+
+It is also important to note that a full run of the document requires environment variables to be set in the `.Renviron` file. Here is an example:
+
+```{bash}
+❯ cat ~/.Renviron 
+OPENAI_API_KEY = "sk-proj--zt7maBiONziZFYlVKuXnGOmmuZkhSjjNwI[...]"
+DOWNLOAD_STATIC_LIBV8=1
+RENV_CONFIG_SANDBOX_ENABLED = FALSE
+```
+
+`OPENAI_API_KEY` has to contain the api-key for the OpenAI API, `DOWNLOAD_STATIC_LIBV8` is set to 1 for a quicker install of `libv8` (see the installation instructions of `gtsummary` on linux) and `RENV_CONFIG_SANDBOX_ENABLED` is enabled simply to reduce warnings. The latter can be left out with no negative effect except some warnings during all steps involving multiprocessing.
+
+Quarto Extensions:
+
+- [kapsner/authors-block](https://github.com/kapsner/authors-block): brings the capability to add an author-related header block when rendering docx-documents with Quarto.
+
+## License
+This work is licensed under a [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/).
+
@@ -38,6 +38,9 @@ date: ""
 #| label: setup
 #| include: false
 source("deps.R")
+
+debug_mode <- FALSE
+output_format <- "docx"
 ```

 \newpage
@@ -46,6 +49,23 @@ source("deps.R")

 This document serves as a supplement to the main article, providing additional details on the sampling approach, sample size determination, model training procedures, and evaluation metrics used in the study of Open Science Practices (OSP) adoption in scientific publications. The full methodological report, containing all code necessary for full replication can be accessed in the OSF repository.

+# Data Availability
+
+## Legal Considerations
+
+The TDM policies of all publishers whose content appears in the corpus were reviewed individually. The majority of publishers explicitly permit TDM under institutional access or open-access conditions, including SAGE, Cambridge University Press, Taylor & Francis, Nature, Emerald, Annual Reviews (upon request), and Wiley (via API). Elsevier permits TDM via its official API but not scraping. Oxford University Press permissions depend on the institutional library agreement, and Brill permits TDM depending on the individual article licence. ASCE explicitly prohibits TDM. For publishers where no explicit policy was found (Modern Law Review) or where policies were ambiguous, EU Directive 2019/790 (Articles 3 and 4) was applied as the operative framework, which broadly permits TDM for scientific research purposes by authorised users accessing content through institutional subscriptions. MDPI and Internet Policy Review content is fully open access and permissively licensed. Regardless of whether TDM is permitted, the right to mine does not extend to a right to redistribute the underlying full texts, which is the operative restriction on public data sharing in this study. Therefore, only metadata can be made available. The labelled dataset containing metadata and OSP labels for the sample is available in the OSF repository.
+
+## Replication Materials
+
+All code necessary for full replication of the study is available in the OSF repository at the following link:
+
+- https://osf.io/rvpc3/overview?view_only=0307dc0d99f74b50a738720a4a757aa0
+
+The Files contain three subfolders:
+- A code folder containing the full, **raw** quarto project with all code and data necessary for replication. A README file with instructions for replication is included in the code folder. This also includes a preprocessed dataset containing metadata and OSP labels for the sample, which can be used for replication of the analysis and figures in the main article without needing to run the full code. This version is more suitable for users with less experience in R, as it allows for a more straightforward replication of the results. A README file with instructions for replication is included in the data folder.
+- A manuscript folder containing the full manuscript quarto project that was used for all analyses and figures in the main article. 
+- The fully **rendered** Methodological Report, which contains all details on the methods used in the study, including the sampling approach, sample size determination, model training procedures, and evaluation metrics. The report also contains discussions of all decisions taken during the process, as well as the full descriptions and specifications of the models used and the preprocessing steps.
+
 # Sampling Approach

 The process involved in the following steps:
@@ -218,19 +238,19 @@ Extraction Method: HTML
 Paywall Status: open_access
 ================================================================================

-TITLE: The passage of Australia’s data retention regime: national security, human rights, and media scrutiny
+TITLE: The passage of Australia's data retention regime: national security, human rights, and media scrutiny

 AUTHORS: Nicolas P. Suzor, Kylie Pappalardo, Natalie McIntosh

 ABSTRACT:
 Abstract
-            In 2015, the Australian government passed the Telecommunications (Interception and Access) Amendment (Data Retention) Act, which requires ISPs to collect metadata about their users and store this metadata for two years. From its conception, Australia’s data retention scheme has been controversial. In this article we examine how public interest concerns were addressed in Australian news media during the Act’s passage. The Act was ultimately passed with bipartisan support, despite serious deficiencies. We show how the Act’s complexity seemed to limit engaged critique in the mainstream media and how fears over terrorist attacks were exploited to secure the Act’s passage through parliament.
+            In 2015, the Australian government passed the Telecommunications (Interception and Access) Amendment (Data Retention) Act, which requires ISPs to collect metadata about their users and store this metadata for two years. From its conception, Australia's data retention scheme has been controversial. In this article we examine how public interest concerns were addressed in Australian news media during the Act's passage. The Act was ultimately passed with bipartisan support, despite serious deficiencies. We show how the Act's complexity seemed to limit engaged critique in the mainstream media and how fears over terrorist attacks were exploited to secure the Act's passage through parliament.

 FULL TEXT:
-Title: The passage of Australia’s data retention regime: national security, human rights, and media scrutiny
+Title: The passage of Australia's data retention regime: national security, human rights, and media scrutiny

 Abstract: Abstract
-            In 2015, the Australian government passed the Telecommunications (Interception and Access) Amendment (Data Retention) Act, which requires ISPs to collect metadata about their users and store this metadata for two years. From its conception, Australia’s data retention scheme has been controversial. In this article we examine how public interest concerns were addressed in Australian news media during the Act’s passage. The Act was ultimately passed with bipartisan support, despite serious deficiencies. We show how the Act’s complexity seemed to limit engaged critique in the mainstream media and how fears over terrorist attacks were exploited to secure the Act’s passage through parliament.
+            In 2015, the Australian government passed the Telecommunications (Interception and Access) Amendment (Data Retention) Act, which requires ISPs to collect metadata about their users and store this metadata for two years. From its conception, Australia's data retention scheme has been controversial. In this article we examine how public interest concerns were addressed in Australian news media during the Act's passage. The Act was ultimately passed with bipartisan support, despite serious deficiencies. We show how the Act's complexity seemed to limit engaged critique in the mainstream media and how fears over terrorist attacks were exploited to secure the Act's passage through parliament.

 This paper is part of Australian internet policy, a special issue of Internet Policy Review guest-edited by Angela Daly and Julian Thomas.

@@ -271,9 +291,11 @@ for (i in seq_along(plots)) {
 }

 # combine plots using patchwork
-combined_plot <- wrap_plots(plotlist, ncol = 2) + # remove legend
-  plot_layout(guides = "collect") & theme(legend.position = "none")
+combined_plot <- wrap_plots(plotlist, ncol = 2) #+ # remove legend
+ # plot_layout(guides = "collect") & theme(legend.position = "none")
+
 print(combined_plot)
+
 if (isTRUE(debug_mode)) {
  debug_info[[knitr::opts_current$get("label")]] <- 
    if (knitr::is_html_output()) "HTML" else "LaTeX"
@@ -6,7 +6,7 @@ execute:
  freeze: auto

 abstract: |
-  This pilot study addresses the current lack of systematic, large-scale evidence on Open Science Practices (OSPs) adoption in criminology and legal psychology. A scalable, machine-learning-based text classification pipeline is introduced to map the prevalence of Open Access (OA), Open Data (OD), Open Materials (OM), and Preregistration (PR). The analysis is based on publication metadata and a year-stratified sample of full texts from the top 100 journals in Criminology & Penology, Law, and Psychology (2013-2023). After identifying articles containing statistical inference (SI) via a high-performing classifier, the author utilized GPT-assisted coding and supervised learning to train specific classifiers for OD, OM, and PR. OA was classified using publicly available metadata. Among 1,763 SI articles with usable full text, design-based estimates reveal a significant disparity in OSP adoption. OA is relatively common (40.9%, 95% CI: 38.8-43.1) and has steadily increased from approximately 20% in 2013 to 50% in 2023. By sharp contrast, trends for OD, OM, and PR cannot be reliably quantified. Extreme class imbalance and the minimal number of positive cases indicate a very low underlying true prevalence for these practices in the assessed field. Methodologically, the study confirms that GPT-assisted coding supports accurate SI detection, but robust prevalence estimation for extremely low-frequency OSPs remains challenging for downstream classifiers. Overall, this project establishes a transparent and reproducible pipeline and provides critical baseline estimates for future, larger-scale assessments of research transparency in crime-related fields.
+  This pilot study addresses the current lack of systematic, large-scale evidence on open science practices (OSPs) adoption in criminology and legal psychology. A scalable, machine-learning-based text classification pipeline is introduced to map the prevalence of Open Access (OA), Open Data (OD), Open Materials (OM), and Preregistration (PR). The analysis is based on publication metadata and a year-stratified sample of full texts from the top 100 journals in Criminology & Penology, Law, and Psychology (2013-2023). After identifying articles containing statistical inference (SI) via a high-performing classifier, the author utilized GPT-assisted coding and supervised learning to train specific classifiers for OD, OM, and PR. OA was classified using publicly available metadata. Among 1,763 SI articles with usable full text, design-based estimates reveal a significant disparity in OSP adoption. OA is relatively common (40.9%, 95% CI: 38.8-43.1) and has steadily increased from approximately 20% in 2013 to 50% in 2023. By sharp contrast, trends for OD, OM, and PR cannot be reliably quantified. Extreme class imbalance and the minimal number of positive cases indicate a very low underlying true prevalence for these practices in the assessed field. Methodologically, the study confirms that GPT-assisted coding supports accurate SI detection, but robust prevalence estimation for extremely low-frequency OSPs remains challenging for downstream classifiers. Overall, this project establishes a transparent and reproducible pipeline and provides critical baseline estimates for future, larger-scale assessments of research transparency in crime-related fields.
 ---

 ```{=latex}
@@ -33,15 +33,15 @@ debug_mode <- FALSE

 # Introduction

-When evidence makes headlines, influences public opinions, shapes policing, sentencing or rehabilitation, it touches lives. But over the last decades, social scientists have learned how easily impressive results evaporate. Criminology is not insulated from these pressures. In this paper, open science practices in criminology and legal psychology are monitored to assess if the field is wired to catch errors before they might become policy.
+When evidence makes headlines, influences public opinions, shapes policing, sentencing or rehabilitation, it touches lives. But over the last decades, social scientists have learned how easily impressive results evaporate. Criminology is not insulated from these pressures. In this paper, open science practices (OSPs) in criminology and legal psychology are monitored to assess if the field is wired to catch errors before they might become policy.

 > "Only by [...] repetitions can we convince ourselves that we are not dealing with a mere isolated 'coincidence', but with events which, on account of their regularity and reproducibility, are in principle inter-subjectively testable." [@popperLogicScientificDiscovery2005, p. 23]

-To challenge bias and to support replication of research, a movement has formed within the scientific community, fueled by the "replication crisis" that was especially prevalent within the field of psychology [@dienlinAgendaOpenScience2021]. The open science movement tries to establish open science practices (OSPs) to challenge many of the known biases that endanger the reliability of the scientific process and enable access to the scientific discourse for a broader public @banksAnswers18Questions2019. The ongoing debate of the last decades was especially focused on two OSPs: 
+To challenge bias and to support replication of research, a movement has formed within the scientific community, fueled by the "replication crisis" that was especially prevalent within the field of psychology [@dienlinAgendaOpenScience2021]. The open science movement tries to establish OSPs to challenge many of the known biases that endanger the reliability of the scientific process and enable access to the scientific discourse for a broader public [@banksAnswers18Questions2019]. The ongoing debate of the last decades was especially focused on two OSPs: 

-*First*, openly sharing materials, data and code enables replication that reduces p-hacking, surfaces errors, spreads methodological knowledge and might reduce burdens on the researcher, driving broader adoption across science [@freeseAdvancesTransparencyReproducibility2022; @freeseReplicationStandardsQuantitative2007; @finkReplicationCodeAvailability2024]. *Second*, preregistration involves thoroughly outlining and documenting research plans and their rationale in a repository before conducting the research, reducing deliberate or unconscious decisions taken to improve findings, challenging publication bias and other biases [@managoPreregistrationRegisteredReports2023; @hardwickeReducingBiasIncreasing2023; @mertensPreregistrationAnalysesPreexisting2019]. 
+*First*, openly sharing materials, data and code enables replication that reduces p-hacking, surfaces errors, spreads methodological knowledge and might reduce burdens on the researcher, driving broader adoption across science [@freeseAdvancesTransparencyReproducibility2022; @freeseReplicationStandardsQuantitative2007; @finkReplicationCodeAvailability2024]. *Second*, preregistration (PR) involves thoroughly outlining and documenting research plans and their rationale in a repository before conducting the research, reducing deliberate or unconscious decisions taken to improve findings, challenging publication bias and other biases [@managoPreregistrationRegisteredReports2023; @hardwickeReducingBiasIncreasing2023; @mertensPreregistrationAnalysesPreexisting2019]. 

-The initial plan for this work was to study the proposed effects of OSPs on reported effect sizes in published papers. During a first literature review, it appeared to me that there were only few publications that used preregistration in data-driven Criminology and Legal Psychology. Instead of assessing effect sizes, this raised the question of how OSPs have been adopted within criminology at all. Motivated by the expected positive impact of OSPs, this work studies the use of OSPs in the field. 
+The initial plan for this work was to study the proposed effects of OSPs on reported effect sizes in published papers. During a first literature review, it appeared to me that there were only few publications that used PR in data-driven Criminology and Legal Psychology. Instead of assessing effect sizes, this raised the question of how OSPs have been adopted within criminology at all. Motivated by the expected positive impact of OSPs, this work studies the use of OSPs in the field. 

@scogginsMeasuringTransparencySocial2024 did an extensive analysis of nearly 100,000 publications in political science as well as international relations and observed an increasing use of OSPs, with levels still being relatively low. Their extensive research not only revealed the current state of open science in political science, but also generated rich data to perform further meta research. Inspired by their work, I adopt their research questions to assess OSPs in the fields of criminology and legal psychology:

@@ -49,53 +49,53 @@ The initial plan for this work was to study the proposed effects of OSPs on repo

 > $RQ_2$: What proportion of statistical inference publications were preregistered?

-This work gathers data about papers in a subset of Criminology and Legal Psychology journals, categorizes those papers by application of open science practices using machine learning methods and explore the patterns over time. The methods will closely resemble and try to improve the approaches taken by @scogginsMeasuringTransparencySocial2024. The research will contribute to the ongoing discussion about the use of OSPs by painting a clearer picture of their adoption in the field. The improved approach will serve as a starting point for a more extensive exploration of OSPs in criminology and legal psychology and will contribute to the growing literature of machine learning and LLMs in classification tasks of scientific literature.
+This work gathers data about papers in a subset of Criminology and Legal Psychology journals, categorizes those papers by application of OSPs using machine learning methods and explore the patterns over time. The methods will replicate and try to improve the approaches taken by @scogginsMeasuringTransparencySocial2024. Alongside the two research questions, the share of openly accessible publications is reported as a secondary, descriptive analysis. While open access adresses a broader aspect of transparency, it is tracked alongside these practices and provides additional context for the overall trend in open-science adoption. The research will contribute to the ongoing discussion about the use of OSPs by painting a clearer picture of their adoption in the field. The improved approach will serve as a starting point for a more extensive exploration of OSPs in criminology and legal psychology and will contribute to the growing literature of machine learning and LLMs in classification tasks of scientific literature.

 But first, a closer look at the underlying issues leading to the recent development of the open science movement will be taken to gain a deeper understanding of its context, the intended goals, implemented methods and their expected impact on the ever progressing scientific discourse.

 # Background

-In his widely reviewed standard reading "Seven rules for social research", @4ff8afa9-5c92-3c50-b832-a1756ccbeedc emphasizes the importance of the reproduction of research findings. But already in the title of the chapter or the rule itself, Firebaugh cuts back on his appeal: "replicate *where possible*". He notes increasing data availability, yet acknowledges challenges for true replication. Given the books influence since 2008, one might expect replication and replication-enabling practices to be widely adopted today. But is this the case?
+In his widely reviewed standard reading "Seven rules for social research", @4ff8afa9-5c92-3c50-b832-a1756ccbeedc emphasizes the importance of the replication of research findings. But already in the title of the chapter or the rule itself, Firebaugh cuts back on his appeal: "replicate *where possible*". He notes increasing data availability, yet acknowledges challenges for true replication. Given the books influence since 2008, one might expect replication and replication-enabling practices to be widely adopted today. But is this the case?

-Besides the theoretically driven discourse, there are quite tangible reasons to talk about the scientific method, replication and the publication process. Analyzing 77 research teams assessing the same dataset for a single hypothesis, @breznauObservingManyResearchers2022 found extremely diverse results, ranging from strong positive to strong negative outcomes. They termed this phenomenon "researcher degrees of freedom", explaining that most of the variance in results was not explained by assigned conditions, research decisions, or researcher characteristics. Instead, idiosyncratic researcher variability accounted for more than 90% of the variance.
+Besides the theoretically driven discourse, there are quite tangible reasons to talk about the scientific method, replication and the publication process. Analyzing 77 research teams assessing the same dataset for a single hypothesis, @breznauObservingManyResearchers2022 found extremely diverse results, ranging from strong positive to strong negative outcomes. @simmonsFalsePositivePsychologyUndisclosed2011 first termed this phenomenon "researcher degrees of freedom" (RDOF), explaining that most of the variance in results was not explained by assigned conditions, research decisions, or researcher characteristics. Instead, RDOF accounted for more than 90% of the variance [@breznauObservingManyResearchers2022].

 This raises the question: if modern research practices are so prone to bias and error, what steps can be taken to mitigate these issues? A closer look at an ongoing debate resulting from cases around replication failures helps shed light on the whole complex, its implications and the today's research culture.

 ## From Replication Crisis to Credibility Revolution? {#sec-replication-crisis}

-The publication of Firebaugh's text coincided with the onset of the replication crisis, a period where widespread replication failures especially but not exclusively in psychology revealed systemic issues in research culture. This crisis wasn't limited to a few fraudulent cases but exposed a broader problem where seemingly robust, highly cited studies could not be reproduced. Examples ranged from unintended to outright data fabrication [@barghAutomaticitySocialBehavior1996; @callawayReportFindsMassive2011; @crockerRoadFraudStarts2011a]. While the crisis began in psychology, it soon spread to other fields like in political science and economics [@breznauDoesSociologyNeed2021]. For instance, a classic social priming study by @barghAutomaticitySocialBehavior1996, finding that participants primed with an "elderly" stereotype walked more slowly, failed to replicate. A follow-up-study suggested, that the original results were likely influenced by experimenter expectations rather than the hypothesized mechanism of unconscious priming [@doyenBehavioralPrimingIts2012]. While some extreme cases are well-documented, the crisis is largely seen as a result of  systemic pressure and normal human behavior or misconduct than in serious intent [@diekmannII2Probleme2022; @crockerRoadFraudStarts2011a; @4ff8afa9-5c92-3c50-b832-a1756ccbeedc].
+The publication of Firebaugh's text coincided with the onset of the replication crisis, a period where widespread failures to *replicate* - that is, to obtain consistent results when re-running a study with new data - revealed systemic issues in research culture. This is distinct from computational *reproducibility*, the narrower question of whether a given analysis can be re-executed on the same data; both matter for open science, but the crisis is primarily about the former. This crisis wasn't limited to a few fraudulent cases but exposed a broader problem where seemingly robust, highly cited studies could not be replicated. Examples ranged from unintended to outright data fabrication [@barghAutomaticitySocialBehavior1996; @callawayReportFindsMassive2011; @crockerRoadFraudStarts2011a]. While the crisis began in psychology, it soon spread to other fields like in political science and economics [@breznauDoesSociologyNeed2021]. For instance, a classic social priming study by @barghAutomaticitySocialBehavior1996, finding that participants primed with an "elderly" stereotype walked more slowly, failed to replicate. A follow-up study suggested, that the original results were likely influenced by experimenter expectations rather than the hypothesized mechanism of unconscious priming [@doyenBehavioralPrimingIts2012]. While some extreme cases are well-documented, the crisis is largely seen as a result of systemic pressure - such as institutional incentives to "publish or perish", described later - and normal human behavior or misconduct rather than serious intent [@diekmannII2Probleme2022; @crockerRoadFraudStarts2011a; @4ff8afa9-5c92-3c50-b832-a1756ccbeedc].

-The term crisis not only implies alarmingly high proportions, but also creates pressure to act. This is supported by findings spanning many fields: Finance [@jensenThereReplicationCrisis2023], economics [@briggsPartialSolutionReplication2023], sociology [@auspurgAusmassUndRisikofaktoren2014] or medicine [@begleyRaiseStandardsPreclinical2012], with some authors even claiming that most published research findings in the social sciences are false [@ioannidisWhyMostPublished2005]. But what drives this crisis?
+The term crisis not only implies alarmingly high proportions, but also creates pressure to act. This is supported by findings spanning many fields: social-psychology [@callawayReportFindsMassive2011],  economics [@jensenThereReplicationCrisis2023], economics [@briggsPartialSolutionReplication2023], sociology [@auspurgAusmassUndRisikofaktoren2014] or even medicine [@begleyRaiseStandardsPreclinical2012], with some authors even claiming that most published research findings in the social sciences are false [@ioannidisWhyMostPublished2005]. Criminology sits at the intersections of these literatures rather than apart from them, which is why we draw on evidence from across the social sciences in what follows. But what drives this crisis?

 ## Questionable Research

-One of the earlier discussed practices that distort scientific progress is called publication bias. @rosenthalFileDrawerProblem1979 defines it as the preference for publishing positive over negative or inconclusive results. This, often institutionally driven bias, also called the 'file drawer problem', can occur at any stage in research [@kuhbergerPublicationBiasPsychology2014; @francoPublicationBiasSocial2014]. Contributing practices include selective reporting, where null findings or variables are omitted from analysis [@breznauDoesSociologyNeed2021] and the post-hoc adaptation of hypotheses [@gerberPublicationBiasEmpirical2008]. But @breznauObservingManyResearchers2022 don't see publication bias as the main driver of the huge variance in results. Instead, they emphasize the role of idiosyncratic researcher variability and the broader context of research practices, leads to the problem of science practices that might produce unreliable or invalid results: so-called questionable research practices (QRP). 
-
-A truth-incentivizing survey of over 2000 psychologists revealed a high prevalence of QRPs. Around 60% admitted to not reporting all dependent measures, 50% to selective reporting, and 30% to falsely claiming they predicted an unexpected finding. About 2% even confessed to data falsification [@johnMeasuringPrevalenceQuestionable2012a]. Criminology shows similar patterns, though with lower rates due to the absence of incentives [@chinQuestionableResearchPractices2023]. 
+One of the earlier discussed practices that distort scientific progress is called publication bias. @rosenthalFileDrawerProblem1979 defines it as the preference for publishing positive over negative or inconclusive results. This, often institutionally driven bias, also called the 'file drawer problem', can occur at any stage in research [@kuhbergerPublicationBiasPsychology2014; @francoPublicationBiasSocial2014]. Contributing practices include selective reporting, where null findings or variables are omitted from analysis [@breznauDoesSociologyNeed2021] and the post-hoc adaptation of hypotheses [@gerberPublicationBiasEmpirical2008]. But @breznauObservingManyResearchers2022 don't see publication bias as the main driver of the huge variance in results. Instead, they emphasize the role of RDOF and the broader context of research practices, leads to the problem of science practices that might produce unreliable or invalid results: so-called questionable research practices (QRP). 

 In their excellent manifesto for reproducible science, @munafoManifestoReproducibleScience2017 conceptionalize QRPs along the typical stages of empirical research. Common QRPs include HARKing or presenting an unexpected exploratory finding as a preplanned hypothesis, p-hacking or manipulating data or analysis to achieve a desired p-value and selective reporting, that is not reporting studies or variables that lack significant results. Other QRPs involve undisclosed data exclusion, stopping data collection when a desired result is found, or not reporting all conditions or measures used. These practices inflate false-positive rates and undermine research credibility [@auspurgAusmassUndRisikofaktoren2014; @breznauDoesSociologyNeed2021; @chinQuestionableResearchPractices2023].

-Other problematic practices involve the misuse of p-values, where researchers simply misinterpret the significance level as the likelihood of truth in their findings, leading to vast overconfidence in their results-that can also be a consequence of or lead to a failure to control for bias and poor quality control [@breznauDoesSociologyNeed2021; @munafoManifestoReproducibleScience2017]. Demographic, geographic or political biases and peer review limitations are more sources for error [@breznauDoesSociologyNeed2021; @grossmannOpenScienceReform2021]. Additionally, gendered penalties favor men publishing disproportionately more than women @akbaritabarGenderPatternsPublication2021. Misaligned institutional incentives, also accelerated by an intense competition for academic jobs, tenure and funding, lead to a so-called "publish or perish" culture [@smaldinoOpenScienceModified2019; @breznauDoesSociologyNeed2021]. 
+Problematic practices involve the misuse of p-values, where researchers simply misinterpret the significance level as the likelihood of truth in their findings, leading to vast overconfidence in their results-that can also be a consequence of or lead to a failure to control for bias and poor quality control [@breznauDoesSociologyNeed2021; @munafoManifestoReproducibleScience2017]. Demographic, geographic or political biases and peer review limitations are more sources for error [@breznauDoesSociologyNeed2021; @grossmannOpenScienceReform2021]. Additionally, gendered penalties favor men publishing disproportionately more than women [@akbaritabarGenderPatternsPublication2021]. Misaligned institutional incentives, also accelerated by an intense competition for academic jobs, tenure and funding, lead to a so-called publish or perish culture [@smaldinoOpenScienceModified2019; @breznauDoesSociologyNeed2021]. 

-All the above leads to the conclusion, that our institutions make refutation harder than confirmation. Open science (OS) is the design response, resetting defaults to transparency, pre-specification, and reproducibility. @munafoManifestoReproducibleScience2017 translate that philosophy into a lifecycle blueprint: blinding and preregistration, stronger methods training and independent oversight, open data, code and diversified peer review to harden reproducibility, evaluation and other measures. The central movement to address the above issues is the OS movement, devoting its effort to challenge publication bias, low statistical power, p-hacking, HARKing and other problems by increasing reproducibility and transparency [@grossmannOpenScienceReform2021]. 
+A truth-incentivizing survey of over 2000 psychologists revealed a high prevalence of QRPs. Roughly 60% admitted to not reporting all dependent measures, 50% to selective reporting, and 30% to falsely claiming they predicted an unexpected finding. About 2% even confessed to data falsification [@johnMeasuringPrevalenceQuestionable2012a]. A comparable study in Criminology  by  @chinQuestionableResearchPractices2023 recorded remarkably lower rates of admission, but this likely reflects methodological rather than substantive differences: the design lacked the truth-incentivizing mechanisms used by @johnMeasuringPrevalenceQuestionable2012a. Additionally, given the sensitivity of the topic and the probable underrepresentation of QRP-engaging researchers in voluntary surveys, both sets are best read as lower bounds on true prevalence.
+
+All the above leads to the conclusion, that our institutions make refutation harder than confirmation. Open science (OS) is the design response, resetting defaults to transparency, pre-specification, and reproducibility. @munafoManifestoReproducibleScience2017 translate that philosophy into a lifecycle blueprint: blinding and PR, stronger methods training and independent oversight, open data, code and diversified peer review to harden reproducibility, evaluation and other measures. The central movement to address the above issues is the OS movement, devoting its effort to challenge publication bias, low statistical power, p-hacking, HARKing and other problems by increasing reproducibility and transparency [@grossmannOpenScienceReform2021]. 

 ## Open Science Practices

-Following an extensive literature review, @vicente-saezOpenScienceNow2018a characterize OS using four differentias: transparency in communication, accessibility or searchability to all data and materials, sharing of everything with a commitment to do so and collaboration along a scientific, distributed global dialogue throughout all stages involved in science. They integrate these into a succinct definition: "Open Science is transparent and accessible knowledge that is shared and developed through collaborative networks" [@vicente-saezOpenScienceNow2018a, p. 434]. @banksAnswers18Questions2019 establish a broader definition of os that refers to many concepts, including scientific philosophies embodying communality and universalism, specific practices operationalizing these norms including os policies. A common ground is that *open* science and OSPs try to prevent research misconduct by simply increasing research transparency.
+Following an extensive literature review, @vicente-saezOpenScienceNow2018a characterize OS using four differentias: transparency in communication, accessibility or searchability to all data and materials, sharing of everything with a commitment to do so and collaboration along a scientific, distributed global dialogue throughout all stages involved in science. They integrate these into a succinct definition: "Open Science is transparent and accessible knowledge that is shared and developed through collaborative networks" [@vicente-saezOpenScienceNow2018a, p. 434]. @banksAnswers18Questions2019 establish a broader definition of OS that refers to many concepts, including scientific philosophies embodying communality and universalism, specific practices operationalizing these norms including OS policies. A common ground is that *open* science and OSPs try to prevent research misconduct by simply increasing research transparency.

-Building on these definitions, in line with the work of many other authors from diverse disciplines [e.g. @dienlinAgendaOpenScience2021; and @greenspanOpenSciencePractices2024], there are numerous practices that have been proposed to enact OS.
+Building on these definitions, in line with the work of many other authors from diverse disciplines [e.g. @dienlinAgendaOpenScience2021 and @greenspanOpenSciencePractices2024], there are numerous practices that have been proposed to enact OS.

 ### Open Data and Open Materials

-*Open data* and *open materials* enable replication by publishing all materials necessary to reproduce research in detail, finding errors, bias or simply support the results of the replicated work [@dienlinAgendaOpenScience2021]. 
+*Open data* and *open materials* enable computational reproduction and, where new data are collected, replication, by publishing all materials necessary to reproduce research in detail, finding errors, bias or simply support the results of the reproduced work [@dienlinAgendaOpenScience2021]. 

-**Open data** (OD) is defined as *the sharing of data that was collected, generated or obtained from a third party and processed to investigate the research question assessed in the publication*. Open materials are often shared alongside open data. To delineate a differentiated picture as sharing behavior for data and materials can be expected to differ due to for example privacy concerns, **open materials** (OM) are distinctively defined as *all research materials necessary to reproduce the reported results like notebooks, code or syntax, guides, protocols that can be shared digitally*. Both definitions closely follow the definitions given by the @americanpsychologicalassociationOpenScienceBadges.
+**Open data** (OD) is defined as *the sharing of data that was collected, generated or obtained from a third party and processed to investigate the research question assessed in the publication*. Open materials are often shared alongside open data. To delineate a differentiated picture as sharing behavior for data and materials can be expected to differ due to for example privacy concerns, **open materials** (OM) are distinctively defined as *all research materials necessary to reproduce the reported results like notebooks, code or syntax, guides, protocols that can be shared digitally*. Both definitions closely follow the definitions given by the @americanpsychologicalassociationOpenScienceBadges. OD and OM are supported by a growing body of evidence that suggests they can improve the quality and impact of research.

-First, there is accumulating evidence that providing data alongside publications increases visibility and impact. Some estimates suggest around a 30% citation increase for papers that share data, and importantly, this advantage appears at least partly independent of JIF [@tennantAcademicEconomicSocietal2016; @banksAnswers18Questions2019]. Beyond citations, openly available datasets enable the exploration by others, supporting novel findings and exploratory, hypothesis-generating work [@piwowarSharingDetailedResearch2007; @piwowarStateOALargescale2018].
+First, research increasingly confirms that providing data alongside publications increases visibility and impact. Some estimates suggest around a 30% citation increase for papers that share data, and importantly, this advantage appears at least partly independent of journal impact factor (JIF) [@tennantAcademicEconomicSocietal2016; @banksAnswers18Questions2019]. Beyond citations, openly available datasets enable the exploration by others, supporting novel findings and exploratory, hypothesis-generating work [@piwowarSharingDetailedResearch2007; @piwowarStateOALargescale2018].

 Second, openness improves methodological rigor and documentation. Knowing that others will inspect our code, data, and decisions incentivizes clearer documentation, more careful workflows, and fewer statistical errors in final papers [@tennantAcademicEconomicSocietal2016; @banksAnswers18Questions2019]. This also promotes transparency about analytic choices and potential biases [@breznauDoesSociologyNeed2021].

-Third, OD and OM reinforce field credibility. By allowing independent examination of methods and results, openness reduces the chance that findings are based on idiosyncratic decisions or unreported researcher degrees of freedom [@scogginsMeasuringTransparencySocial2024; @breznauObservingManyResearchers2022]. Multiple sources suggest that open practices reduce QRPs overall [@scogginsMeasuringTransparencySocial2024; @tennantAcademicEconomicSocietal2016; @munafoManifestoReproducibleScience2017]. 
+Third, OD and OM reinforce field credibility. By allowing independent examination of methods and results, openness reduces the chance that findings are based on idiosyncratic decisions or unreported RDOF [@scogginsMeasuringTransparencySocial2024; @breznauObservingManyResearchers2022]. Multiple sources suggest that open practices reduce QRPs overall [@scogginsMeasuringTransparencySocial2024; @tennantAcademicEconomicSocietal2016; @munafoManifestoReproducibleScience2017]. 

 Finally, both have economic and societal benefits, that are even more evident for open access (OA). They discourage redundant data collection, enabling cost savings that can be redirected to new research questions [@tennantAcademicEconomicSocietal2016; @piwowarSharingDetailedResearch2007]. At the same time, the public availability of data stimulates methodological innovation and cross-dataset syntheses that would otherwise remain infeasible [@piwowarStateOALargescale2018]. These dynamics amplify the academic, economic, and societal impact of research [@tennantAcademicEconomicSocietal2016]. 

@@ -103,47 +103,45 @@ Despite these gains, legitimate concerns persist among many researchers. With in

 There are also method-specific hurdles. For qualitative research, transparency can be especially challenging when meaning-making is relational and context-dependent. Fieldnotes and transcripts may lose essential value once separated from the researcher and participants [@breznauDoesSociologyNeed2021; @freeseReplicationSocialScience2017]. These issues underscore that one-size-fits-all mandates are unlikely to succeed.

-In short, many systemic and researcher-centric challenges cut across OSPs - and they will reappear in the discussion of preregistration that follows.
+In short, many systemic and researcher-centric challenges cut across OSPs - and they will reappear in the discussion of PR that follows.

 ### Preregistration

-A preregistration is a time-stamped plan for a study's hypotheses, design, and analysis, often made public. Its contents vary by method (e.g., hypotheses, sampling, interview guides, exclusion rules, analysis plans) [@loggPreregistrationWeighingCosts2021; @managoPreregistrationRegisteredReports2023; @americanpsychologicalassociationOpenScienceBadges].
+A PR is a time-stamped plan for a study's hypotheses, design, and analysis, often made public. Its contents vary by method (e.g., hypotheses, sampling, interview guides, exclusion rules, analysis plans) [@loggPreregistrationWeighingCosts2021; @managoPreregistrationRegisteredReports2023; @americanpsychologicalassociationOpenScienceBadges].

-Timestamping restrains HARKing by separating predictions from evidence, reducing the flexibility for post-hoc theorizing [@scogginsMeasuringTransparencySocial2024; @loggPreregistrationWeighingCosts2021]. More broadly, by committing ex ante, researcher degrees of freedom are narrowed. The analytic and design choices that otherwise enable selective reporting or specification searching are constrained, and any deviations become visible to readers and reviewers. The same logic limits p-hacking: when transformations, outlier rules, model families, covariates, and confirmatory contrasts are specified in advance, cherry-picking becomes less feasible because analytical decisions are made independently of the data. Preregistration also addresses structural issues of study quality. Declaring sample-size requirements upfront helps prevent underpowered designs at construction [@kuhbergerPublicationBiasPsychology2014; @grossmannOpenScienceReform2021]. We predefine theory, measures, and analyses, seek early input, and document choices so reviewers can vet them and avoid misinterpretation-strengthening credibility [ @evansImprovingEvidencebasedPractice2023; @sarafoglouSurveyHowPreregistration2022; @scogginsMeasuringTransparencySocial2024]. Preregistration helps separate confirmatory from exploratory work, reduces publication bias (e.g., via Registered Reports), and narrows "researcher degrees of freedom" [@simmonsFalsePositivePsychologyUndisclosed2011].
+Timestamping restrains HARKing by separating predictions from evidence, reducing the flexibility for post-hoc theorizing [@scogginsMeasuringTransparencySocial2024; @loggPreregistrationWeighingCosts2021]. More broadly, by committing ex ante, RDOF are narrowed. The analytic and design choices that otherwise enable selective reporting or specification searching are constrained, and any deviations become visible to readers and reviewers. The same logic limits p-hacking: when transformations, outlier rules, model families, covariates, and confirmatory contrasts are specified in advance, cherry-picking becomes less feasible because analytical decisions are made independently of the data. PR also addresses structural issues of study quality. Declaring sample-size requirements upfront helps prevent underpowered designs at construction [@kuhbergerPublicationBiasPsychology2014; @grossmannOpenScienceReform2021]. We predefine theory, measures, and analyses, seek early input, and document choices so reviewers can vet them and avoid misinterpretation-strengthening credibility [@evansImprovingEvidencebasedPractice2023; @sarafoglouSurveyHowPreregistration2022; @scogginsMeasuringTransparencySocial2024]. PR helps separate confirmatory from exploratory work, reduces publication bias (e.g., via Registered Reports), and narrows RDOF [@simmonsFalsePositivePsychologyUndisclosed2011].

-For this work, **preregistration** is defined as *the act of planning and documenting the hypotheses, study design, and analysis plan of a study before data is collected or even viewed. The documentation is typically time-stamped and made publicly available*.
+For this work, **PR** is defined as *the act of planning and documenting the hypotheses, study design, and analysis plan of a study before data is collected or even viewed. The documentation is typically time-stamped and made publicly available*.

-The Open Science movement, particularly preregistration, has been criticized for not providing tailored transparency practices for qualitative research and for importing a positivist framework that may not fit all traditions [@breznauDoesSociologyNeed2021]. Nevertheless, the core principle of transparency remains relevant: qualitative reports should contain enough information for another researcher to understand the logic and process behind the findings [@breznauDoesSociologyNeed2021]. In qualitative contexts, preregistration can focus on documenting guiding questions, sampling logic, coding frameworks, and decision trails while remaining compatible with iterative analysis.
+The Open Science movement, particularly PR, has been criticized for not providing tailored transparency practices for qualitative research and for importing a positivist framework that may not fit all traditions [@breznauDoesSociologyNeed2021]. Nevertheless, the core principle of transparency remains relevant: qualitative reports should contain enough information for another researcher to understand the logic and process behind the findings [@breznauDoesSociologyNeed2021]. In qualitative contexts, PR can focus on documenting guiding questions, sampling logic, coding frameworks, and decision trails while remaining compatible with iterative analysis.

-Frequently voiced concerns are about increasing work, thereby lengthening projects and restricting researcher's freedom by confining them to their predefined plan. However, preplanning simply reorders the workflow rather than creating extra work, potentially preventing costly redesigns or follow-up-studies. Additionally, this does not inhibit exploratory work as the goal is to provide clarity and transparency by distinguishing between preplanned analysis and those conducted after viewing the data. By moving the conceptual work upstream, preregistration clarifies claims, adds transparency to the decision process and strengthens credibility by marking plans and deviations [@loggPreregistrationWeighingCosts2021; @evansImprovingEvidencebasedPractice2023]. In-principle acceptance adds a guarantee to the upfront work, provided the approved plan is followed [@sarafoglouSurveyHowPreregistration2022; @banksAnswers18Questions2019].
+Frequently voiced concerns are about increasing work, thereby lengthening projects and restricting researcher's freedom by confining them to their predefined plan. However, preplanning simply reorders the workflow rather than creating extra work, potentially preventing costly redesigns or follow-up studies. Additionally, this does not inhibit exploratory work as the goal is to provide clarity and transparency by distinguishing between preplanned analysis and those conducted after viewing the data. By moving the conceptual work upstream, PR clarifies claims, adds transparency to the decision process and strengthens credibility by marking plans and deviations [@loggPreregistrationWeighingCosts2021; @evansImprovingEvidencebasedPractice2023]. In-principle acceptance adds a guarantee to the upfront work, provided the approved plan is followed [@sarafoglouSurveyHowPreregistration2022; @banksAnswers18Questions2019].

-In summary, preregistration does not constrain scientific creativity, it clarifies claims. By making the sequence of decisions explicit-what was planned, what changed, and why-we reduce bias, improve interpretability, and strengthen confidence in reported findings [@hardwickeReducingBiasIncreasing2023].
+In summary, PR does not constrain scientific creativity, it clarifies claims. By making the sequence of decisions explicit-what was planned, what changed, and why-we reduce bias, improve interpretability, and strengthen confidence in reported findings [@hardwickeReducingBiasIncreasing2023].

 ### Open Access {#sec-open-access}

-**Open access** is a key OSP, defined as making research freely available online to anyone, as opposed to requiring payment via journal subscriptions [@banksAnswers18Questions2019, @breznauDoesSociologyNeed2021]. The Budapest OA Initiative defines OA as being free to read and reuse for lawful purposes, including text and data mining [@BOAI2002]. A simpler, broad definition is the lawful free availability of a research publication on the internet which will be used here.
+**Open access** is a key OSP, defined as making research freely available online to anyone, as opposed to requiring payment via journal subscriptions [@banksAnswers18Questions2019; @breznauDoesSociologyNeed2021]. The Budapest OA Initiative defines OA as being free to read and reuse for lawful purposes, including text and data mining [@BOAI2002]. A simpler, broad definition is the lawful free availability of a research publication on the internet which will be used here.

-OA publishing offers several benefits. It increases accessibility and equity, as anyone with an internet connection can reach an OA article, potentially reducing inequalities for those at underfunded institutions [@banksAnswers18Questions2019]. There is a significant OA citation advantage, as OA articles are cited more frequently than closed-access publications. This preference is now considered a form of research bias known as "FUTON" (full text on the net) bias [@piwowarStateOALargescale2018, @wentzVisibilityResearchFUTON2002; @piwowarSharingDetailedResearch2007]. OA also improves research quality by reducing the suppression of null findings [@francoPublicationBiasSocial2014] and enabling large-scale text and data mining [@tennantAcademicEconomicSocietal2016]. Furthermore, it accelerates equitable access, helping to bridge the global North-South divide, and enhances public accountability for publicly funded research [@tennantAcademicEconomicSocietal2016].
+OA publishing offers several benefits. It increases accessibility and equity, as anyone with an internet connection can reach an OA article, potentially reducing inequalities for those at underfunded institutions [@banksAnswers18Questions2019]. There is a significant OA citation advantage, as OA articles are cited more frequently than closed-access publications. This preference is now considered a form of research bias known as "FUTON" (full text on the net) bias [@piwowarStateOALargescale2018; @wentzVisibilityResearchFUTON2002; @piwowarSharingDetailedResearch2007]. OA also improves research quality by reducing the suppression of null findings [@francoPublicationBiasSocial2014] and enabling large-scale text and data mining [@tennantAcademicEconomicSocietal2016]. Furthermore, it accelerates equitable access, helping to bridge the global North-South divide, and enhances public accountability for publicly funded research [@tennantAcademicEconomicSocietal2016].

-Despite its benefits, OA faces challenges. Some newer or smaller Gold OA journals are perceived as less prestigious [@piwowarStateOALargescale2018], and concerns about "predatory publishers" have been mistakenly linked with OA [@tennantAcademicEconomicSocietal2016]. Article processing charges (APCs) can be a barrier for authors, particularly in low- and middle-income countries [@banksAnswers18Questions2019; @breznauDoesSociologyNeed2021], though roughly 70% of peer-reviewed OA journals are fee-free, and many offer waivers [@tennantAcademicEconomicSocietal2016; @breznauDoesSociologyNeed2021]. Publishers may also be hesitant to adopt OA due to concerns about losing subscription revenue. While OA promotes transparency, it cannot on its own solve issues like QRPs or underpowered studies if incentives continue to reward quantity over quality [@grossmannOpenScienceReform2021; @banksAnswers18Questions2019].
+Despite its benefits, OA faces challenges. Some newer or smaller Gold OA journals are perceived as less prestigious [@piwowarStateOALargescale2018], and concerns about "predatory publishers" have been mistakenly linked with OA [@tennantAcademicEconomicSocietal2016]. Article processing charges (APCs) can be a barrier for authors, particularly in low- and middle-income countries [@banksAnswers18Questions2019; @breznauDoesSociologyNeed2021], though roughly 70% of peer-reviewed OA journals are fee-free, and many offer waivers [@tennantAcademicEconomicSocietal2016; @breznauDoesSociologyNeed2021]. Publishers may also be hesitant to adopt OA due to concerns about losing subscription revenue. While OA promotes transparency, it cannot solve issues like QRPs or underpowered studies on its own, if incentives continue to reward quantity over quality [@grossmannOpenScienceReform2021; @banksAnswers18Questions2019].

 ## Open Science in Criminology and Legal Psychology {#sec-osp-in-crim}

 A focused literature review on adoption produced limited evidence as we still know surprisingly little about how often OSPs are actually used in criminology and legal psychology. The evidence is fragmented, method-dependent, and sometimes contradictory - so estimates of prevalence are shaky, even as enthusiasm for OSPs is high and QRPs  might appear common.

-Self-reports suggest high OSP familiarity - but they co-exist with widespread QRPs and are vulnerable to bias. In @chinQuestionableResearchPractices2023, 89% of respondents said they had used at least one OSP, yet 87% also admitted at least one QRP, and some serious QRPs (e.g., hiding known problems) were non-trivial. Survey data indicate that about 25% of researchers across fields have preregistered a study, with higher uptake in psychology (50-60%) and lower prevalence in sociology (~30%) [@fergusonSurveyOpenScience2023a]. Another survey in the field similarly estimated preregistration use at 45% (42-49%) [@chinQuestionableResearchPractices2023]. The reported prevalence of OD varies widely across disciplines. Survey data suggest that more than 60% of researchers report having posted data or code, with higher rates in psychology (>50%) compared to sociology (~35%) [@fergusonSurveyOpenScience2023a]. The prevalence of OM sharing is more limited compared to OD and access. Survey results indicate that 43% (40-47%) of researchers report providing access to their research materials [@chinQuestionableResearchPractices2023]. Few or no journals require data sharing in the field, coupled with rare preregistration and a tiny share of replication studies [@pridemoreReplicationCriminologySocial2018].
+Self-reports suggest high OSP familiarity - but they co-exist with widespread QRPs and are vulnerable to bias. In the study conducted by @chinQuestionableResearchPractices2023, 89% of respondents said they had used at least one OSP, yet 87% also admitted at least one QRP, and some serious QRPs (e.g., hiding known problems) were non-trivial. Survey data indicate that about 25% of researchers across fields have preregistered a study, with higher uptake in psychology (50-60%) and lower prevalence in sociology (~30%) [@fergusonSurveyOpenScience2023a]. Another survey in the field similarly estimated PR use at 45% (42-49%) [@chinQuestionableResearchPractices2023]. The reported prevalence of OD varies widely across disciplines. Survey data suggest that more than 60% of researchers report having posted data or code, with higher rates in psychology (>50%) compared to sociology (~35%) [@fergusonSurveyOpenScience2023a]. The prevalence of OM sharing is more limited compared to OD and access. Survey results indicate that 43% (40-47%) of researchers report providing access to their research materials [@chinQuestionableResearchPractices2023]. Few or no journals require data sharing in the field, coupled with rare PR and a tiny share of replication studies [@pridemoreReplicationCriminologySocial2018].

 In their survey at the Netherlands Institute for the Study of Crime and Law Enforcement, @moneva2025attitudes find broadly positive attitudes but divergent views by method and career stage, and a long list of cultural, structural, legal/privacy, and cost barriers. @fessingerStateOpenScience2025 also shows strong approval (88% positive) and some experience (58% tried at least one OSP), but routine adoption looks limited (only 44% even hold a repository account). In contrast, an assessment of social science studies between 2014 and 2017 found no preregistered studies at all [@hardwickeEmpiricalAssessmentTransparency2020]. 

-Article audits show far lower OSP uptake than surveys, implying either nondisclosure or overestimation. @greenspanOpenSciencePractices2024 coded 722 articles (2018-2022) across five leading journals and found OM in about a third of papers, but \<10% with OD, \<2% with open code or preregistration, and no upward trend.
+Article audits show far lower OSP uptake than surveys, implying either nondisclosure or overestimation. @greenspanOpenSciencePractices2024 coded 722 articles (2018-2022) across five leading journals and found OM in about a third of papers, but \<10% with OD, \<2% with open code or PR, and no upward trend.

-Put together, we have: (a) structural signals that transparency norms aren't yet embedded, (b) surveys that likely overstate or at least poorly calibrate actual practice; (c) parallel evidence from legal psychology that approval is high but practical barriers keep routine use patchy and (d) little to no evidence of actual os practice, opposed to plain opinion.
-
-The applied nature of the research in this field means fragile findings can drive high-stakes policy and practice. Single studies have shaped policing responses (e.g., the Minnesota Domestic Violence study by @shermanSpecificDeterrentEffects1984) only to be refuted by later replications, underscoring the risks of acting on unverified results [@mcneeleyReplicationCriminologyNecessary2015]. The relative youth of criminology and incentives that privilege novelty further heighten the need for systematic replication. To enable it, we should adopt measures [@mcneeleyReplicationCriminologyNecessary2015]. Given how little is known about the prevalence of OSPs in the field and the indicators we see for widespread QRPs, there is a strong case for prioritizing replication-and thereby a need to take stock. 
+Put together, we have: (a) structural signals that transparency norms aren't yet embedded, (b) surveys that likely overstate or at least poorly calibrate actual practice, (c) parallel evidence from legal psychology that approval is high but practical barriers keep routine use patchy and (d) little to no evidence of actual OS practice, opposed to plain opinion. The applied nature of the research in this field means that fragile findings can drive high-stakes policy and practice. Single studies, like the Minnesota Domestic Violence study by @shermanSpecificDeterrentEffects1984, have shaped policing responses only to be refuted by later replications, emphasizing the risks of acting on unverified results [@mcneeleyReplicationCriminologyNecessary2015]. The relative youth of criminology and incentives that privilege novelty further increase the need for systematic replication. To enable it, we should adopt measures [@mcneeleyReplicationCriminologyNecessary2015]. Given how little is known about the prevalence of OSPs in the field and the indicators we see for widespread QRPs, there is a strong case for prioritizing replication - and thereby a need to take stock. 

 # Data and Method

-The aim of this work is to compile a sample of publications in the fields of criminology and legal psychology, classify it as either statistical inference (SI) publications or non-SI publications and further examine the former to assess whether any of the OSPs under consideration are used: preregistration, OD, OM, or OA. OA results are reported as secondary, descriptive analyses to benchmark open-science adoption. The presented OSPs will be operationalized and a text-classification pipeline (keyword dictionaries and machine-learning models) will be used to detect them. OA status will be determined using publicly available metadata, given the expected high reliability of information on OA. The fine-tuned models are validated against a hand-coded sample that is extended using a large-language-model (LLM, ChatGPT 4o & ChatGPT 5o), with the product of both being then used to train classifier models that will classifiy the analytical sample to estimate true prevalences of OSPs.
+The aim of this work is to compile a sample of publications in the fields of criminology and legal psychology, classify it as either statistical inference (SI) publications or non-SI publications and further examine the former to assess whether any of the OSPs under consideration are used: PR, OD, OM, or OA. OA results are reported as secondary, descriptive analyses to benchmark open-science adoption. The presented OSPs will be operationalized and a text-classification pipeline (keyword dictionaries and machine-learning models) will be used to detect them. OA status will be determined using publicly available metadata, given the expected high reliability of information on OA. The fine-tuned models are validated against a hand-coded sample that is extended using a large-language-model (LLM, ChatGPT 4o & ChatGPT 5o), with the product of both being then used to train classifier models that will classifiy the analytical sample to estimate true prevalences of OSPs. Machine learning models learn patterns from a training set, a labeled subset of articles and are then tested on a validation set, a separate labeled subset withheld during training, to assess whether the learned patterns generalize rather than merely memorize the training examples [@robertsCrossvalidationStrategiesData2017]. The full research process is illustrated in @fig-flowchart-pipeline. A thorough description or discussion of the methods, including the sampling procedure, data collection, and classification pipeline is provided in the methodological report. The following sections provide a summary of the key aspects of the methods.

 Full-text data for training the machine learning classification models will be collected with a web application developed specifically for this project. Since software development is not the focus of this work, details of the app's architecture will not be discussed here. A brief description of the application, along with screenshots, is provided in the supplementary material.

@@ -151,7 +149,7 @@ This work is necessarily scoped by time and resources. It shall therefore be tre

 ## Population

-The scope of this work encompasses all publications from the top 100 journals classified under "Criminology & Penology" or the journals that are categorized as "Law" (which might also include sociologically or psychologically driven quantitative studies) and "Psychology, Multidisciplinary", ranked by the 2023 JIF according to Clarivate's Journal Citation Reports [@clarivateJournalImpactFactor2023] that rely on SI. Publication metadata were retrieved via the Crossref API. While Crossref provides extensive coverage, it is not exhaustive, and prior work has shown that missing records are often systematic rather than random [@delgado-quirosWhyAreThese2024; @hausteinWhenArticleActually2015]. Using multiple bibliographic sources (e.g., Scopus, Web of Science) would reduce this bias [@gerasimovComparisonDatasetsCitation2024; @delgado-quirosWhyAreThese2024], but this was not feasible within the scope of this pilot. Consequently, the study population is restricted to articles indexed in Crossref from the selected top 100 journals.
+The scope of this work encompasses all publications from the top 100 journals classified under "Criminology & Penology", the journals that are categorized as "Law" and "Psychology, Multidisciplinary", ranked by the 2023 JIF according to Clarivate's Journal Citation Reports (JCR) [@clarivateJournalImpactFactor2023] that rely on SI. Law journals were included because Clarivate's "Law" category contains a substantial body of legal psychology research relevant to our scope as for example shown in the analyses of journals by @gonzalez-salaCaracterizacionPsicologiaJuridica2017. Publication metadata were retrieved via the Crossref API. While Crossref provides extensive coverage, it is not exhaustive, and prior work has shown that missing records are often systematic rather than random [@delgado-quirosWhyAreThese2024; @hausteinWhenArticleActually2015]. Using multiple bibliographic sources (e.g., Scopus, Web of Science) would reduce this bias [@gerasimovComparisonDatasetsCitation2024; @delgado-quirosWhyAreThese2024], but this was not feasible within the scope of this pilot. Consequently, the study population is restricted to articles indexed in Crossref from the selected top 100 journals.

 As the population is restricted to publications that make SIs, this concept has to be clearly defined mostly in line with @scogginsMeasuringTransparencySocial2024, as works that rely on data, statistical analysis and experiments. @scogginsMeasuringTransparencySocial2024 restricted further on only experiments, which was deemed not necessary as all assessed OSPs are suitable to be used and should be used in not only experiments, but also in works assessing second-hand data or alike [@akkerPreregistrationSecondaryData2021; @westonRecommendationsIncreasingTransparency2019]. Thereby, descriptive, correlational, comparative and other non-purely theoretical research was included. 

@@ -165,7 +163,87 @@ The sampling procedure involved drawing a large enough sample for the training u

 The sample size was determined by a precision-based calculation to ensure a $\pm$ 1.5 percentage point confidence interval for the SI prevalence as a precision-based sample size calculation was deemed more suitable for an exploratory prevalence study [@blandTyrannyPowerThere2009]. Calculations were based on prevalences arbitrarily estimated using the results of the literature review described in @sec-osp-in-crim, explained further in the provided supplements. A minimum calculated total sample size equaled $\approx$ 4265 publications to achieve a 95% confidence interval with a half-width of $\pm$ 1.5 pp using the @agrestiApproximateBetterExact1998 method.

-First, Sample A, a random sample of up around 500 publications was manually classified to train the initial SI classifier. This step also helped estimate the effort for subsequent tasks. Next, an independent Sample B was drawn, stratified by year, thereby addressing problems in cross-validation and the non-independence of residuals assumptions of many machine-learning models [@robertsCrossvalidationStrategiesData2017]. 
+```{mermaid}
+%%| label: fig-flowchart-pipeline
+%%| fig-cap: "A flowchart of the full research process. All steps described are further explained in the methodologic report and the supplements."
+%%| fig-width: 4.2
+%%| fig-height: 8
+flowchart TD
+    A["Population
+    Top 100 JIF journals 2013-2023"]
+
+    B["Crossref 
+Metadata filtering
+95,042 → 40,860 publications
+Deduplication, date filter, keyword exclusions"]
+
+    C["Precision-based stratified sampling
+Target ±1.5 pp · n ≈ 4,265"]
+
+    D["Sample A
+n = 408 · unstratified
+SI classifier training only"]
+
+    E["Sample B
+analytical sample
+n = 4,265 · stratified by year"]
+
+    F["Manual + LLM labelling
+Subset of Sample A
+κ ≈ .83 after reconciliation"]
+
+    G["SI classifier trained
+Random Forest / XGBoost
+TF-IDF keyword features"]
+
+    H1["Full-text retrieval
+HTML / PDF, scraped"]
+ 
+    H2["Full-text retrieval
+HTML / PDF, scraped"]
+
+    I["SI classifier applied to Sample B"]
+
+    I2["SI papers identification
+n = 1,763 with usable full text"]
+
+    OA["OA classified from metadata
+using Crossref, Web of Science, Scopus"]
+
+    J["OSP training subset
+n = 352, from SI papers in Sample B
+manual & LLM labelling"]
+
+    KOD["OD classifier
+RF / XGBoost"]
+
+    KOM["OM classifier
+RF / XGBoost"]
+
+    KPR["PR classifier
+RF / XGBoost"]
+
+    L["Prevalence estimates, Post-stratified by year, adjusted for misclassification"]
+
+
+
+A --> B
+B -- Training/Testing Sample --> D
+B -- Analytical Sample --> C
+D --> H1 --> F --> G
+C --> E --> H2 --> I
+G --> I
+E -- OA from metadata --> OA
+I --> I2 --> J
+J --> KOD & KOM & KPR
+KOD & KOM & KPR -- Applied to all SI papers --> L
+OA --> L
+J --> L
+L ~~~ invisible[ ]
+    style invisible fill:none,stroke:none,color:none
+```
+
+First, Sample A, a random sample of up around 500 publications was manually classified to train the initial SI classifier. This step also helped estimate the effort for subsequent tasks. Next, an independent Sample B was drawn, stratified by year, thereby addressing problems in cross-validation (resampling of the sample to multiple smaller training and validation subsets) and the non-independence of residuals assumptions of many machine-learning models [@robertsCrossvalidationStrategiesData2017]. 

 The SI classifier was then used to analyze and classify all publications in Sample B. From the identified SI papers in Sample B, a balanced dataset was randomly sampled to create a training set for the OSP classifiers. Finally, these trained OSP classifiers were applied to the entire analytical Sample B. While a publisher or journal-based stratification for the full sample would have been ideal, it was not feasible due to the limited number of available full texts.

@@ -179,7 +257,7 @@ Before the full text data could be collected, some steps were necessary in order
 #| echo: false
 #| results: asis
 #| tbl-cap: Cases Dropped from all Publications Obtained
-#| label: tbl-cases
+#| label: tbl-01-cases

 tbl <- read_csv("data/tbl-sample-case-drops.csv")

@@ -246,10 +324,10 @@ if (isTRUE(debug_mode)) {
 }
 ```

-The data obtained necessitated multiple transformations. All transformations are reported in the respective section in the methodological report. Publications were filtered by the resulting date variable to limit the population to the defined time interval. To reduce SI coding efforts, simple keyword-lists were used to reduce the number of publications by matching titles. Missing values were assessed, checks were processed for language, @tbl-cases shows that from an initial number of 95042 publications, all steps resulted in a final publication count of 40,860. It is important to note here that several improvements were implemented here but not processed. More details can be found in the provided materials.
+The data obtained necessitated multiple transformations. All transformations are reported in the respective section in the methodological report as stated in the data availability statement. Publications were filtered by the resulting date variable to limit the population to the defined time interval. To reduce manual SI coding efforts, simple keyword-lists were used to reduce the number of publications by matching titles. Missing values were assessed and language filters were processed. @tbl-01-cases shows that from an initial number of 95042 publications, all steps resulted in a final publication count of 40,860. Note that, owing to caching difficulties, several planned filtering refinements did not execute at this stage - a deviation transparently documented in the methodological report. The resulting population is therefore somewhat larger than originally intended, but the surplus publications were removed by subsequent filtering steps. Further details are available in the provided materials.

 ```{r}
-#| label: tbl-cases2
+#| label: tbl-02-cases2
 #| tbl-cap: Cases Dropped from Analytical Sample
 tbl2 <- read_csv("data/tbl-sample-case-drops-stattraining-final.csv")
 tbl_cases2 <- tbl2 %>%
@@ -438,27 +516,29 @@ The final analytical sample is made up of 4265 publications. The OS prevalence c

 ### Full Text Retrieval

-The initial approach to gathering full texts, which used Zotero to translate DOIs as per Scoggins and Robertson, was unreliable across multiple attempts and software versions. Due to the unsuitability of existing software tools, be it for technical or legal reasons, a custom web application was developed.
+The initial approach to gathering full texts, which used Zotero to translate DOIs as per @scogginsMeasuringTransparencySocial2024, was unreliable across multiple attempts and software versions. Due to the unsuitability of existing software tools, be it for technical or legal reasons, a custom web application was developed.

 Downloading the analytical sample was mostly successful, though some publisher protections caused dropouts. Due to time constraints, additional more optimized runs were not feasible. Documents under 1,000 words were considered non-full-text papers. However, shorter HTML texts were retained for potential keyword matching. Text quality assessment (Flesch-Index) and word count identified missing full texts [@benoitQuantedaPackageQuantitative2018]. Full texts were downloaded for Independent Sample A and the Analytical Sample from which Sample B was drawn. The resulting dropouts were expected to have been implicitly handled by post-stratification, but publisher-level weighting was planned and considered but infeasible due to sparse cells that would have produced unstable weights. Post-stratification was conducted by year only, which does not correct publisher- or journal-specific dropouts. Future, non-piloting iterations should add publisher-level adjustment.

 ## Classification Tasks and Methods

-This section will present a brief summary of all methods used to classify the variables of interest. A thorough discussion of the decisions taken, the full descriptions and specifications of the models used as well as the preprocessing steps can be found in the supplied materials.
+This section will present a brief summary of all methods used to classify the variables of interest. A thorough discussion of the decisions taken, the full descriptions and specifications of the models used as well as the preprocessing steps can be found in the supplied materials. A thorough discussion of all decisions taken, the full descriptions and specifications of the models used as well as the preprocessing steps can be found in the reproduction materials available in the OSF repository.

 Since most existing classification approaches considered were deemed unsuitable for this scope (e.g., @kimResearchPaperClassification2019; @sanguansatFeatureMatricizationDocument2012; @jandotInteractiveSemanticFeaturing2016), this work instead relies on Random-Forest and XGBoost-models trained on a manually and LLM coded subset of publications as LLMs have shown good performance on similar classification tasks [@buntValidatingUseLarge2025; @zhaoAdvancingSingleMultitask2024]. 

-For each task, OSP-specific document-feature-matrices using term frequencies or TF-IDF of keyword sets, partly adapted from @scogginsMeasuringTransparencySocial2024, were constructed. 
+For each task, OSP-specific document-feature-matrices using term frequencies or term-frequency inverse-document-frequency (TF-IDF[^8]) of keyword sets, partly adapted from @scogginsMeasuringTransparencySocial2024, were constructed. 

-First, a strict dichotomous operationalization of "SI" or not SI, as well as of the OSPs was synthesized and documented in a short coding manual. A subset of Sample A was coded by hand, followed by a ChatGPT-based labelling of the fulltext. On a random subsample, agreement after reconciliation was high ($\kappa$ $\approx$ .83), so combined manual/LLM labels  served as training and test data for the ML classifiers. A similar approach was used for Sample B. Each OSP classifier was tuned on all possible combinations of different feature sets and model. 
+First, a strict dichotomous operationalization of "SI" or "not SI", as well as of the OSPs was synthesized and documented in a short coding manual. A subset of Sample A was coded by hand, followed by a ChatGPT-based labelling of the fulltext. On a random subsample, agreement of both after reconciliation was high ($\kappa$ $\approx$ .83), so combined manual/LLM labels (classification result variable values)  served as training and validation data for the ML classifiers. A similar approach was used for Sample B. Each OSP classifier was tuned on all possible combinations of different feature sets and model. 

-Given time constraints and the pilot nature of the study, preprocessing and evaluation were optimized for the OSP classifier only, not for the SI classifier. The more rigorous workflow applied to OSP - designed to handle high computational demands and substantial class imbalance - would likely also have improved SI performance, but was not pursued because SI results were already satisfactory, as documented in the provided material. Furthermore, journal-level adoption of OSPs was originally intended to be assessed using the Transparency and Openness Promotion Factor [@nosekPromotingOpenResearch2015]. However, as the available sample sizes were insufficient for journal-level analyses, these were not carried out.
+Given time constraints and the pilot nature of the study, preprocessing and evaluation were optimized for the OSP classifier only, not for the SI classifier. The more rigorous workflow applied to OSP - designed to handle high computational demands and substantial class imbalance (unequal distribution of the outcome levels) - would likely also have improved SI performance, but was not pursued because SI results were already satisfactory, as documented in the provided material. Furthermore, journal-level adoption of OSPs was originally intended to be assessed using the Transparency and Openness Promotion Factor [@nosekPromotingOpenResearch2015]. However, as the available sample sizes were insufficient for journal-level analyses, these were not carried out.
+
+[^8]: Term frequency (TF) is the count of a term in a document, while inverse document frequency (IDF) is the logarithm of the total number of documents divided by the number of documents containing the term. TF-IDF is calculated as TF multiplied by IDF, giving higher weight to terms that are frequent in a document but rare across the corpus [@sang-woonResearchPaperClassification2019; @ramosUsingTFIDFDetermine2003]. 

 ## Analysis

 The research was deliberately designed to study open-science practices via supervised classifiers rather than relying exclusively on metadata. This choice prioritized scalability and the potential to capture practice signals that metadata may miss, at the cost of managing model error and class imbalance. Given the exploratory character of the work, the analyses were not pre-defined, only data collection, sampling, and the model-training strategy were specified in advance. Concerns about classifier interpretability informed the evaluation strategy [@gilpinExplainingExplanationsOverview2018].

-Estimates for OSPs are domain-estimates among SI papers (see @tbl-cases2) drawn from a year-stratified random sample, beta-method CIs are based on design-based variance. Design-weights were applied post-stratified to frame-by-year totals with finite-population corrections. All OSP estimates are domain estimates for SI papers using design-based inference. Design corrected 95% confidence intervals are computed with the beta method (Clopper-Pearson) transformation which provides better coverage for low-prevalences than Wald intervals [@agrestiIntroductionCategoricalData2007]. 
+Estimates for OSPs are domain-estimates among SI papers (see @tbl-02-cases2) drawn from a year-stratified random sample, beta-method CIs are based on design-based variance. Design-weights were applied post-stratified to frame-by-year totals with finite-population corrections. All OSP estimates are domain estimates for SI papers using design-based inference. Design corrected 95% confidence intervals are computed with the beta method (Clopper-Pearson) transformation which provides better coverage for low-prevalences than Wald intervals [@agrestiIntroductionCategoricalData2007]. 

 Results generalize to the keyword-filtered data. With $n=1,763$ SI papers, SI-domain CIs are wider than the planned $\pm$ 1.5 pp. Because some SI papers may have been excluded by the screening, OSP levels for all SI papers in the full corpus of 90k publications may differ. An audit of excluded records could quantify the coverage and enable adjustment but was not conducted here.

@@ -468,106 +548,20 @@ Data is reported per year. As per year data given the very low prevalences is ex

 # Results & Discussion

-Two research questions were formulated: $RQ_1$ on the prevalence of OD and OM among statistical-inference (SI) publications, and $RQ_2$ on the prevalence of preregistration. After extensive model development, validation, calibration, thresholding, and misclassification adjustment, prevalences for OD, OM, and Preregistration were too low for the ML classifiers to yield interpretable, adjusted estimates. 
-
-```{r}
-#| tbl-cap: Sample Characteristics by Statistical Inference Status
-#| label: tbl-sample-char
-#| tbl-pos: H 
-
-df <- qs_read(file_sample_analysis)
-
-population <- qs_read(file_meta_final)
-
-tbl_sample_desc <- df %>% mutate(
-  journal_category = case_when(
-    journal_category == "PSYCHOLOGY, MULTIDISCIPLINARY" ~ "A",
-    journal_category == "LAW" ~ "B",
-    journal_category == "CRIMINOLOGY & PENOLOGY" ~ "C"
-  )) %>%
-  tbl_summary(
-    include = c(is_open_access, is_open_data, is_open_materials, is_prereg, txt_source, txt_only_abstract, journal_category, journal_jif_quartile, txt_count, txt_flesch, journal_x2023_jif),
-    by = is_statistical,
-    label = list(
-      is_open_access = "Open Access",
-      is_open_data = "Open Data",
-      is_open_materials = "Open Materials",
-      is_prereg = "Preregistration",
-      txt_source = "Text Source",
-      txt_only_abstract = "Only Abstract",
-      journal_category = "Journal Category",
-      journal_jif_quartile = "JIF Quartile",
-      txt_count = "Count: Words",
-      txt_flesch = "Flesch Score",
-      journal_x2023_jif = "JIF (2023)",
-      is_statistical = "Statistical Inference"
-      ),
-    statistic = list(
-        all_continuous()  ~ "{mean} ({sd})",
-        all_categorical() ~ "{n} / {N} ({p}%)"
-      )
-  ) %>%
-    add_p(
-    include = c(txt_only_abstract, txt_source, txt_count, txt_flesch, journal_x2023_jif),
-    test = list(
-      txt_only_abstract ~ "fisher.test",
-      txt_source        ~ "chisq.test",
-      txt_count         ~ "wilcox.test",
-      txt_flesch        ~ "wilcox.test",
-      journal_x2023_jif ~ "wilcox.test"
-    ),
-    pvalue_fun = label_style_pvalue(digits = 3)
-  ) %>%
-  add_overall() %>%
-  modify_header(label ~ "**Variable**") %>%
-  modify_spanning_header(c("stat_1", "stat_2") ~ "**Statistical Inference**")%>% 
-  modify_footnote_body(
-    footnote = "A: Psychology, Multidisciplinary; B: Law; C: Criminology & Penology",
-    columns = "label",
-    rows = variable == "journal_category"
-  )
-
-if(output_format == "pdf/tex") {
-  tbl_sample_desc %>%
-    as_gt() %>%
-    tab_options(
-      table.font.size = gt::px(12),
-      latex.use_longtable = TRUE
-    )
-} else if(output_format == "docx") {
-  tbl_sample_desc %>%
-    as_flex_table() %>%
-      set_table_properties(width = 1) %>%
-      theme_booktabs(bold_header = TRUE) %>%
-      align(align = "center", part = "all") %>%
-      fontsize(size = 11, part = "header") %>%
-      fontsize(size = 8, part = "body") %>%
-      width(5, 2.34, unit = "cm") %>%
-      width(2:4, 3.5, unit = "cm") %>%
-      autofit() %>%
-      height_all(height = .2)
-} else {
-}
-
-if (isTRUE(debug_mode)) {
-  debug_info[[knitr::opts_current$get("label")]] <- 
-    if (knitr::is_html_output()) "HTML" else "LaTeX"
-}
-```
+Two research questions were formulated: $RQ_1$ on the prevalence of OD and OM among statistical-inference (SI) publications, and $RQ_2$ on the prevalence of PR. After extensive model development, validation, calibration, thresholding, and misclassification adjustment, prevalences for OD, OM, and PR were too low for the ML classifiers to yield interpretable, adjusted estimates. 

 The ML classifiers trained on GPT labels inherit GPT's strengths and the data's sparsity. For the relatively small 20% validation set coded by GPT, the open-science practice classifiers are less precise and less reliable than the Statistical-Inference classifier. In contrast, a question that was not originally foregrounded proved answerable: the prevalence and trajectory of OA among SI publications, measured from metadata with high reliability, show clear increases over time.

-Before misclassification adjustment, design-based prevalences were estimated among SI papers with 95% CIs. For outcomes identified by the ML classifiers (OD, OM, Preregistration), these reflect survey-design uncertainty only. 
+Before misclassification adjustment, design-based prevalences were estimated among SI papers with 95% CIs. For outcomes identified by the ML classifiers (OD, OM, PR), these reflect survey-design uncertainty only. 

-@fig-osp-adoption shows a steady rise in OA from \~20% in 2013 to \~50% in 2023, while the other practices suffer from extremely low counts; for some years (e.g., 2013 OD; 2016 Preregistration) estimates were not possible. 
-
-@tbl-osp-prev-overall confirms low prevalences across the full period: OA $40.9\%$ (38.8-43.1), OM $4.3\%$ (3.4-5.3), Preregistration $3.6\%$ (2.8-4.5), and OD $2.2\%$ (1.6-2.9).
+@tbl-03-osp-prev-overall confirms low prevalences across the full period with OA being the only exception, estimated to be used in $40.9\%$ (38.8-43.1 CI) across the population. OA, OM and OD are all estimated to be available for only less than five percent of the publications.

 ```{r}
-#| fig-cap: OSP Adoption Over Time, among statistical inference papers (design-weighted)
-#| label: fig-osp-adoption
-#| fig-pos: H
-#| fig-width: 7
+#| tbl-cap: Overall Prevalence of open science practices among Statistical Inference Papers (Design-Weighted to Frame-by-Year Totals)
+#| label: tbl-03-osp-prev-overall
+df <- qs_read(file_sample_analysis)
+
+population <- qs_read(file_meta_final)

 # ensure that types match
 df <- df %>% mutate(published_year = as.integer(published_year))
@@ -625,6 +619,91 @@ vars <- c(
  "is_open_access_bin"    = "Open Access"
 )

+overall_results_list <- lapply(names(vars), function(var_name) {
+  
+  # Create the formula for the specific variable
+  form <- as.formula(paste0("~", var_name))
+  
+  # Calculate the proportion and CI on the entire des_stat object
+  # again, use method = "beta" for robustness
+  est <- svyciprop(form, design = des_stat, method = "beta", na.rm = TRUE)
+  
+  # Extract the proportion and confidence interval
+  p_est <- as.numeric(coef(est))
+  ci <- as.numeric(confint(est))
+  
+  # Return a clean tibble with the results
+  tibble(
+    osp = paste0(vars[var_name], ""), # Creates the label, e.g., "Prereg (SI)"
+    p = p_est,
+    p_low = ci[1],
+    p_upp = ci[2]
+  )
+})
+
+# Combine the list of results into a single data frame
+overall_osp_si_raw <- bind_rows(overall_results_list)
+
+# Apply the final formatting to match your original code
+overall_osp_si <- overall_osp_si_raw %>%
+  mutate(`Prevalence` = sprintf("%.1f%% (%.1f-%.1f)", 100 * p, 100 * p_low, 100 * p_upp)) %>%
+  select(osp, `Prevalence`)
+
+# Print the final result
+overall_osp_si <- overall_osp_si %>%
+  rename(
+    OSP = osp
+  ) %>%
+  arrange(desc(`Prevalence`))
+
+tbl_overall_osp_si <- overall_osp_si %>% 
+  kbl(
+    format = 'latex',
+    longtable = TRUE,
+    booktabs = TRUE, 
+    escape = T,
+  ) %>% # add footnote
+  column_spec(1, width = '3cm')%>%
+  kable_styling(
+    position = "center",
+    latex_options = "hold_position",
+    full_width = FALSE) %>%
+  kableExtra::footnote(
+    general = "Prevalence estimates in statistical inference publications using design-weights per year (95% CI)", 
+    general_title = "Note:", 
+    footnote_as_chunk = T, 
+    threeparttable = T
+    )
+
+
+if(output_format == "pdf/tex") {
+  print(tbl_overall_osp_si)
+} else if(output_format == "docx") {
+  overall_osp_si %>% 
+    flextable() %>%
+      set_table_properties(width = 1, layout = "autofit") %>%
+      theme_booktabs(bold_header = TRUE) %>%
+      align(align = "center", part = "all") %>%
+      fontsize(size = 11, part = "header") %>%
+      fontsize(size = 10, part = "body") %>%
+      add_footer_lines(values = c("Note: Prevalence estimates in statistical inference publications using design-weights per year (95% CI)"))
+} else {
+}
+
+if (isTRUE(debug_mode)) {
+  debug_info[[knitr::opts_current$get("label")]] <- 
+    if (knitr::is_html_output()) "HTML" else "LaTeX"
+}
+```
+
+@fig-osp-adoption shows a steady rise in OA from \~20% in 2013 to \~50% in 2023, while the other practices suffer from extremely low counts; for some years (e.g., 2013 OD; 2016 PR) estimates were not possible. 
+
+```{r}
+#| fig-cap: OSP Adoption Over Time, among statistical inference papers (design-weighted)
+#| label: fig-osp-adoption
+#| fig-pos: H
+#| fig-width: 7
+
 # Loop through each variable, run svyby, and collect results in a list
 results_list <- lapply(names(vars), function(var_name) {
  
@@ -718,80 +797,81 @@ if (isTRUE(debug_mode)) {
 }
 ```

-In parallel, @tbl-sample-char suggests systematic differences between SI and non-SI papers: distributions of text sources differ (likely reflecting publisher effects or text-quality variation), abstracts-only are more common among non-SI items, word counts are higher for SI papers, journal impact is higher, and OA appears more common. Several contrasts are statistically significant (many $p < .001$), but these should be treated as descriptive given unmodeled multilevel variance and field composition.
+In parallel, @tbl-04-sample-char suggests systematic differences between SI and non-SI papers: distributions of text sources differ (likely reflecting publisher effects or text-quality variation), abstracts-only are more common among non-SI items, word counts are higher for SI papers, journal impact is higher, and OA appears more common. Several contrasts are statistically significant (many $p < .001$), but these should be treated as descriptive given unmodeled multilevel variance and field composition.

 ```{r}
-#| tbl-cap: Overall Prevalence of Open Science Practices among Statistical Inference Papers (Design-Weighted to Frame-by-Year Totals)
-#| label: tbl-osp-prev-overall
+#| tbl-cap: Sample Characteristics by Statistical Inference Status
+#| label: tbl-04-sample-char
+#| tbl-pos: H 

-overall_results_list <- lapply(names(vars), function(var_name) {
-  
-  # Create the formula for the specific variable
-  form <- as.formula(paste0("~", var_name))
-  
-  # Calculate the proportion and CI on the entire des_stat object
-  # again, use method = "beta" for robustness
-  est <- svyciprop(form, design = des_stat, method = "beta", na.rm = TRUE)
-  
-  # Extract the proportion and confidence interval
-  p_est <- as.numeric(coef(est))
-  ci <- as.numeric(confint(est))
-  
-  # Return a clean tibble with the results
-  tibble(
-    osp = paste0(vars[var_name], ""), # Creates the label, e.g., "Prereg (SI)"
-    p = p_est,
-    p_low = ci[1],
-    p_upp = ci[2]
-  )
-})
-
-# Combine the list of results into a single data frame
-overall_osp_si_raw <- bind_rows(overall_results_list)
-
-# Apply the final formatting to match your original code
-overall_osp_si <- overall_osp_si_raw %>%
-  mutate(`Prevalence` = sprintf("%.1f%% (%.1f-%.1f)", 100 * p, 100 * p_low, 100 * p_upp)) %>%
-  select(osp, `Prevalence`)
-
-# Print the final result
-overall_osp_si <- overall_osp_si %>%
-  rename(
-    OSP = osp
+tbl_sample_desc <- df %>% mutate(
+  journal_category = case_when(
+    journal_category == "PSYCHOLOGY, MULTIDISCIPLINARY" ~ "A",
+    journal_category == "LAW" ~ "B",
+    journal_category == "CRIMINOLOGY & PENOLOGY" ~ "C"
+  )) %>%
+  tbl_summary(
+    include = c(is_open_access, is_open_data, is_open_materials, is_prereg, txt_source, txt_only_abstract, journal_category, journal_jif_quartile, txt_count, txt_flesch, journal_x2023_jif),
+    by = is_statistical,
+    label = list(
+      is_open_access = "Open Access",
+      is_open_data = "Open Data",
+      is_open_materials = "Open Materials",
+      is_prereg = "Preregistration",
+      txt_source = "Text Source",
+      txt_only_abstract = "Only Abstract",
+      journal_category = "Journal Category",
+      journal_jif_quartile = "JIF Quartile",
+      txt_count = "Count: Words",
+      txt_flesch = "Flesch Score",
+      journal_x2023_jif = "JIF (2023)",
+      is_statistical = "Statistical Inference"
+      ),
+    statistic = list(
+        all_continuous()  ~ "{mean} ({sd})",
+        all_categorical() ~ "{n} / {N} ({p}%)"
+      ),
+      missing = "no"
  ) %>%
-  arrange(desc(`Prevalence`))
-
-tbl_overall_osp_si <- overall_osp_si %>% 
-  kbl(
-    format = 'latex',
-    longtable = TRUE,
-    booktabs = TRUE, 
-    escape = T,
-  ) %>% # add footnote
-  column_spec(1, width = '3cm')%>%
-  kable_styling(
-    position = "center",
-    latex_options = "hold_position",
-    full_width = FALSE) %>%
-  kableExtra::footnote(
-    general = "Prevalence estimates in statistical inference publications using design-weights per year (95% CI)", 
-    general_title = "Note:", 
-    footnote_as_chunk = T, 
-    threeparttable = T
-    )
-
+    add_p(
+    include = c(txt_only_abstract, txt_source, txt_count, txt_flesch, journal_x2023_jif),
+    test = list(
+      txt_only_abstract ~ "fisher.test",
+      txt_source        ~ "chisq.test",
+      txt_count         ~ "wilcox.test",
+      txt_flesch        ~ "wilcox.test",
+      journal_x2023_jif ~ "wilcox.test"
+    ),
+    pvalue_fun = label_style_pvalue(digits = 3)
+  ) %>%
+  add_overall() %>%
+  modify_header(label ~ "**Variable**") %>%
+  modify_spanning_header(c("stat_1", "stat_2") ~ "**Statistical Inference**")%>% 
+  modify_footnote_body(
+    footnote = "A: Psychology, Multidisciplinary; B: Law; C: Criminology & Penology",
+    columns = "label",
+    rows = variable == "journal_category"
+  )

 if(output_format == "pdf/tex") {
-  print(tbl_overall_osp_si)
+  tbl_sample_desc %>%
+    as_gt() %>%
+    tab_options(
+      table.font.size = gt::px(12),
+      latex.use_longtable = TRUE
+    )
 } else if(output_format == "docx") {
-  overall_osp_si %>% 
-    flextable() %>%
-      set_table_properties(width = 1, layout = "autofit") %>%
+  tbl_sample_desc %>%
+    as_flex_table() %>%
+      set_table_properties(width = 1) %>%
      theme_booktabs(bold_header = TRUE) %>%
      align(align = "center", part = "all") %>%
      fontsize(size = 11, part = "header") %>%
-      fontsize(size = 10, part = "body") %>%
-      set_caption(caption = "Note: Prevalence estimates in statistical inference publications using design-weights per year (95% CI)")
+      fontsize(size = 8, part = "body") %>%
+      width(5, 2.34, unit = "cm") %>%
+      width(2:4, 3.5, unit = "cm") %>%
+      autofit() %>%
+      height_all(height = .2)
 } else {
 }

@@ -801,11 +881,11 @@ if (isTRUE(debug_mode)) {
 }
 ```

-In @tbl-osp-prev, adjustments were applied using sensitivity and specificity from the ML-validation analysis in  [@liuQuantitativeBiasAnalysis2023]. Under extreme rarity, adjustments become unstable: intervals widen dramatically (approaching $[0,1]$) or yield boundary/negative estimates when specificity is insufficient relative to prevalence. For OD, the false-positive rate ($1-\text{Sp} \approx 12.7\%$) exceeds the observed prevalence ($2.2\%$), pushing adjusted points below zero. For OM, low sensitivity ($\text{Se} = 0.20$) and tiny validation counts produce near-uninformative intervals. Given these constraints, the adjusted values can be interpreted as sensitivity ranges rather than confirmatory estimates. Any substantive claims should thereby rather be based on design-based estimates and on OA (measured from metadata).
+In @tbl-05-osp-prev, adjustments were applied using sensitivity and specificity from the ML-validation analysis in  [@liuQuantitativeBiasAnalysis2023]. Under extreme rarity, adjustments become unstable: intervals widen dramatically (approaching $[0,1]$) or yield boundary/negative estimates when specificity is insufficient relative to prevalence. For OD, the false-positive rate ($1-\text{Sp} \approx 12.7\%$) exceeds the observed prevalence ($2.2\%$), pushing adjusted points below zero. For OM, low sensitivity ($\text{Se} = 0.20$) and tiny validation counts produce near-uninformative intervals. Given these constraints, the adjusted values can be interpreted as sensitivity ranges rather than confirmatory estimates. Any substantive claims should thereby rather be based on design-based estimates and on OA (measured from metadata).

 ```{r}
-#| tbl-cap: Observed and Adjusted Prevalence of Open Science Practices among Statistical Inference Papers
-#| label: tbl-osp-prev
+#| tbl-cap: Observed and Adjusted Prevalence of open science practices among Statistical Inference Papers
+#| label: tbl-05-osp-prev

 # https://influentialpoints.com/Training/estimating_true_prevalence.htm
 # https://academic.oup.com/ije/article/52/3/942/6982613?login=false
@@ -969,7 +1049,7 @@ if (isTRUE(debug_mode)) {
 }
 ```

-Earlier differences in text sources suggest heterogeneity by journal, thereby implicating also publisher variance [@scogginsMeasuringTransparencySocial2024]. @fig-osp-time-by-publisher visualizes OA shares over time for the 12 most prolific publishers in the sample (listed in the caption). Leveraging larger $n$, the author fit simple OLS trends to annual OA proportions. The four most prolific publishers show clear increases. Four publishers do not: Oxford University Press, Emerald, ASCE, and MDPI. MDPI remains at 100% OA, Emerald at 0% in this sample; ASCE shows an apparent decline consistent with limited observations; Oxford University Press is relatively stable. All observed increases are highly statistically significant. Future work should use models designed for proportions (e.g., binomial GLMs) and, ideally, hierarchical pooling across publishers and years.
+Earlier differences in text sources suggest heterogeneity by journal, thereby implicating also publisher variance [@scogginsMeasuringTransparencySocial2024]. @fig-osp-time-by-publisher visualizes OA shares over time for the 12 most prolific publishers in the sample (listed in the caption). Leveraging larger $n$, the author fit simple ordinary least squares (OLS) regression trends to annual OA proportions. The four most prolific publishers show clear increases. Four publishers do not: Oxford University Press, Emerald, ASCE, and MDPI. MDPI remains at 100% OA, Emerald at 0% in this sample; ASCE shows an apparent decline consistent with limited observations; Oxford University Press is relatively stable. All observed increases are highly statistically significant. Future work should use models designed for proportions (e.g., binomial generalized linear models, GLMs) and, ideally, hierarchical pooling across publishers and years.

 ```{=latex}
 \footnotesize
@@ -1122,25 +1202,25 @@ if (isTRUE(debug_mode)) {
 \normalsize
 ```

-This study was deliberately scoped as a pilot, which constrained coverage, precision, and tooling. The population assessed was limited to SI papers from the top 100 JCR journals in criminology and legal psychology and to Crossref metadata, so venue and index biases remain. The 2013-2023 window omits the most recent changes. Keyword screening did not fully exclude non-target items, and a Quarto "freeze" configuration led to using print over online dates in some cases. Full-text retrieval was partial and legally bounded to TDM-permitted publishers; short documents (<1,000 words) were treated as missing full text, risking misclassification.
+This study was deliberately scoped as a pilot, which constrained coverage, precision, and tooling. The population assessed was limited to SI papers from the top 100 JCR journals in criminology and legal psychology and to Crossref metadata, so venue and index biases remain. The 2013-2023 window omits the most recent changes. Keyword screening did not fully exclude non-target items, and a Quarto "freeze" configuration led to using print over online dates in some cases. Full-text retrieval was partial and legally bounded to text and data mining permitted publishers, short documents (<1,000 words) were treated as missing full text, risking misclassification.

 Measurement and modeling challenges were substantial. SI/OSP labels were trained on a small, single-coder hand set plus GPT assistance. Severe class imbalance for OSPs, few validation positives, and upsampling inflated nominal accuracy while depressing stability. Misclassification adjustments (Rogan-Gladen) became unstable at very low prevalences, and some OA trend analyses used simple OLS rather than binomial/GLM approaches.

-In the methodological report, comparing basic text characteristics between OSP-labeled papers and non-OSP papers revealed non-independence (e.g., differences in word count, Flesch score, and text source), despite the assumption that such features should not vary with true OSP status. This pattern indicates likely misclassification and/or model leakage, with classifiers picking up irrelevant proxies (publisher templates, document length) rather than OSP content.
+In the methodological report, comparing basic text characteristics between OSP-labeled papers and non-OSP papers revealed non-independence (e.g., differences in word count, Flesch score, and text source), despite the assumption that such features (variables) should not vary with true OSP status. This pattern indicates likely misclassification and/or model leakage, with classifiers picking up irrelevant proxies (publisher templates, document length) rather than OSP content.

 I therefore propose a series of recommendations for future iterations, that should expand bibliographic metadata sources (Crossref + Scopus and Web of Science) and further audit screened-out records to assess selection, operationalizations with sharper rules more close to the constructs defined by e.g. OSF, employ multi-coder assessment, and quantify inter-rater-reliability on a larger training data base OR classify leveraging ChatGPT as implied by the very accurate precisions evident here and replace OLS with binomial GLMs or hierarchical models for proportions. On the technicals side, a more stringent Quarto setup should be used, with simplified modular code based on a refined version of the codebase used here. The downloader should be improved in terms of a more homogeneous extraction logic by including the HTML and PDF full-text extraction in the pre-processing pipeline, making the whole process more transparent, reproducible and less error-prone. Finally, the sample size should be increased substantially, ideally to the full population of SI papers in the frame, to improve precision and enable analysis on journal level.

-Despite of all the limitations, there are main substantive implications: OSP prevalence signals in SI papers-especially preregistration and OM-are rare enough that model-based estimation is fragile at this scale, whereas OA, measured from metadata, shows a clear upward trend approaching roughly half of SI outputs by 2023. Methodologically, GPT proves to be a promising primary coder for a scaled follow-up, and the pipeline developed here provides a reproducible, yet improvable foundation for a larger, better-powered study.
+Despite of all the limitations, there are main substantive implications: OSP prevalence signals in SI papers-especially PR and OM-are rare enough that model-based estimation is fragile at this scale, whereas OA, measured from metadata, shows a clear upward trend approaching roughly half of SI outputs by 2023. Methodologically, GPT proves to be a promising primary coder for a scaled follow-up, and the pipeline developed here provides a reproducible, yet improvable foundation for a larger, better-powered study.

 # Conclusion

-The replication crisis has intensified the examination of research practices and accelerated the push for transparency and openness. This study contributes by mapping the adoption of open-science practices (OSP) within criminology and legal psychology, establishing a baseline for future efforts. The evidence indicates meaningful progress in availability-most clearly in OA-yet massive, persistent gaps in reproducibility, particularly for OD, OM, and preregistration.
+The replication crisis has intensified the examination of research practices and accelerated the push for transparency and openness. This study contributes by mapping the adoption of OSPs within criminology and legal psychology, establishing a baseline for future efforts. The evidence indicates meaningful progress in availability-most clearly in OA-yet massive, persistent gaps in reproducibility, particularly for OD, OM, and PR.

-Two decades ago, @ioannidisWhyMostPublished2005 argued that the credibility of findings are closely tied to statistical power, field-specific protocols, and careful attention to pre-study odds. In other words, simply assessing p-values in a rather mechanistic manner is insufficient @collingStatisticalInferenceReplication2021. In that spirit, this work emphasizes measurement, validation, and transparency over nominal statistical "wins," offering an initial, field-specific picture of where credibility can be strengthened and how to get there.
+Two decades ago, @ioannidisWhyMostPublished2005 argued that the credibility of findings are closely tied to statistical power, field-specific protocols, and careful attention to pre-study odds. In other words, simply assessing p-values in a rather mechanistic manner is insufficient [@collingStatisticalInferenceReplication2021]. In that spirit, this work emphasizes measurement, validation, and transparency over nominal statistical "wins," offering an initial, field-specific picture of where credibility can be strengthened and how to get there.

 Methodologically, the study shows that GPT-assisted coding can be accurate and scalable for detecting OSPs, while downstream ML classifiers struggle under extreme class imbalance-a limitation that complicates misclassification-adjusted prevalence estimation. Still, the pipeline built here demonstrates a path toward for larger, confirmatory follow-ups. 

-This work discussed the replication crisis, its implications for criminology and legal psychology, and how OSPs can help to address some of the issues that have been raised. While the last decades the wording "crisis" framed the discussion in a rather negative light, recent work suggests an upward trend in OSPs which accelerates the transition towards more credible research-moving "from crisis to credibility" @korbmacherReplicationCrisisHas2023. Awareness and adoption of open practices are growing [@grossmannReasonsCautiousOptimism2021], institutions are adapting norms and incentives [@smaldinoOpenScienceModified2019]. Even though the results of this study indicate that there is still a long way to go, the upward trend in OA and the presence of OM and preregistration in some papers are encouraging signs
+This work discussed the replication crisis, its implications for criminology and legal psychology, and how OSPs can help to address some of the issues that have been raised. While the last decades the wording "crisis" framed the discussion in a rather negative light, recent work suggests an upward trend in OSPs which accelerates the transition towards more credible research-moving "from crisis to credibility" [@korbmacherReplicationCrisisHas2023]. Awareness and adoption of open practices are growing [@grossmannReasonsCautiousOptimism2021], institutions are adapting norms and incentives [@smaldinoOpenScienceModified2019]. Even though the results of this study indicate that there is still a long way to go, the upward trend in OA and the presence of OM and PR in some papers are encouraging signs

 To make sure, that our results are robust, reliable and credible, this work shall be seen as a call for an open, cumulative, and collaborative research culture. Accordingly, the author invites direct reproduction and incremental improvement of this pipeline-via open sharing of data, code, prompts, and labeling protocols-so the analysis can be stress-tested, recalibrated, and strengthened.

@@ -1160,7 +1240,7 @@ Materials, Data and Code are made available at a public OSF-repository that can

 - https://osf.io/rvpc3/overview?view_only=0307dc0d99f74b50a738720a4a757aa0. 

-Further instructions can be found in the README files. Full-text data and the downloader can't be made available to the public due to copyright concerns.
+Further instructions can be found in the README files. Full-text data and the downloader can't be made available to the public due to copyright concerns. The labelled dataset containing metadata and OSP labels for the sample is available in the OSF repository. The code for the downloader is currently under revision and will be made available in the OSF repository as well.

 # Funding {.unnumbered}

@@ -11,6 +11,49 @@
  abstract = {Rule 4 is thereplication rule. The replication rule is a natural follow-up to rule 3, ``Build reality checks into your research.'' Rule 3 advises you to look for ways to cross-check your results both internally---using other information in your data set---and externally---using different methods and data sets. In multiple-method research, as described in the previous chapter, your aim is to see if different methods and different sorts of data lead to the same conclusions.Rule 4 advises replication---the identical analysis (same measures, models, and estimation methods) of parallel data sets (different samples of the same}
 }

+@article{sang-woonResearchPaperClassification2019,
+  title = {Research Paper Classification Systems Based on {{TF-IDF}} and {{LDA}} Schemes},
+  author = {Kim, Sang-Woon and Gil, Joon-Min},
+  year = 2019,
+  month = aug,
+  journal = {Human-centric Computing and Information Sciences},
+  volume = {9},
+  number = {1},
+  pages = {30},
+  issn = {2192-1962},
+  doi = {10.1186/s13673-019-0192-7},
+  urldate = {2024-12-16},
+  abstract = {With the increasing advance of computer and information technologies, numerous research papers have been published online as well as offline, and as new research fields have been continuingly created, users have a lot of trouble in finding and categorizing their interesting research papers. In order to overcome the limitations, this paper proposes a research paper classification system that can cluster research papers into the meaningful class in which papers are very likely to have similar subjects. The proposed system extracts representative keywords from the abstracts of each paper and topics by Latent Dirichlet allocation (LDA) scheme. Then, the K-means clustering algorithm is applied to classify the whole papers into research papers with similar subjects, based on the Term frequency-inverse document frequency (TF-IDF) values of each paper.},
+  langid = {english},
+  keywords = {Artificial Intelligence,K-means clustering,LDA,Paper classification,TF-IDF},
+  file = {/home/michaelb/Zotero/storage/23YFBPYR/Kim and Gil - 2019 - Research paper classification systems based on TF-IDF and LDA schemes.pdf}
+}
+
+@inproceedings{ramosUsingTFIDFDetermine2003,
+  title = {Using {{TF-IDF}} to {{Determine Word Relevance}} in {{Document Queries}}},
+  author = {Ramos, J. E.},
+  year = 2003,
+  urldate = {2026-05-18},
+  abstract = {In this paper, we examine the results of applying Term Frequency Inverse Document Frequency (TF-IDF) to determine what words in a corpus of documents might be more favorable to use in a query. As the term implies, TF-IDF calculates values for each word in a document through an inverse proportion of the frequency of the word in a particular document to the percentage of documents the word appears in. Words with high TF-IDF numbers imply a strong relationship with the document they appear in, suggesting that if that word were to appear in a query, the document could be of interest to the user. We provide evidence that this simple algorithm efficiently categorizes relevant words that can enhance query retrieval.}
+}
+
+@article{gonzalez-salaCaracterizacionPsicologiaJuridica2017,
+  title = {Characterization of {{Legal Psychology}} through Psychology Journals Included in {{Criminology}} \& {{Penology}} and {{Law}} Categories of {{Web}} of {{Science}}},
+  author = {{Gonz{\'a}lez-Sala}, Francisco and {Osca-Lluch}, Julia and Tortosa Gil, Francisco and Pe{\~n}aranda Ortega, Mar{\'i}a},
+  year = 2017,
+  month = mar,
+  journal = {Anales de Psicolog\'ia},
+  volume = {33},
+  number = {2},
+  pages = {411},
+  issn = {1695-2294, 0212-9728},
+  doi = {10.6018/analesps.33.2.262591},
+  urldate = {2026-05-18},
+  abstract = {The objective of this work is to learn about the most relevant aspects that characterize contemporary Legal Psychology throughout the study of journals included in the WoS between the years 2009 and 2014 related with the area of Psychology. The number of selected publications is 16, mainly from the USA and Great Britain. The results show an increase in the number of works and authors, a greater collaboration and a growth in medium productors. It exists a major presence of men in editorial boards and as authors, outstanding the figures of T. Ward in 2009 and A. Vrij in 2014. According to the analysis of key words the most relevant themes during these years have been Crime, Conduct, Woman and Meta-analysis, being sexual violence towards children and women and gender violence the criminal typology most studied.},
+  copyright = {http://revistas.um.es/analesps/about/submissions\#copyrightNotice},
+  file = {/home/michaelb/Zotero/storage/KC3L68AL/González-Sala et al. - 2017 - Characterization of Legal Psychology through psychology journals included in Criminology & Penology.pdf}
+}
+
@inproceedings{abdennourEnsembleLearningModel2023,
  title = {Ensemble {{Learning Model}} for~{{Medical Text Classification}}},
  booktitle = {Web {{Information Systems Engineering}} -- {{WISE}} 2023},
@@ -0,0 +1,76 @@
+```{mermaid}
+%%| label: fig-flowchart-pipeline
+%%| fig-cap: "A flowchart of the full research process. All steps described are further explained in the methodologic report and the supplements."
+flowchart TD
+    A["Population
+    Top 100 JIF journals 2013-2023"]
+
+    B["Crossref 
+Metadata filtering
+95,042 → 40,860 publications
+Deduplication, date filter, keyword exclusions"]
+
+    C["Precision-based stratified sampling
+Target ±1.5 pp · n ≈ 4,265"]
+
+    D["Sample A
+n = 408 · unstratified
+SI classifier training only"]
+
+    E["Sample B
+analytical sample
+n = 4,265 · stratified by year"]
+
+    F["Manual + LLM labelling
+Subset of Sample A
+κ ≈ .83 after reconciliation"]
+
+    G["SI classifier trained
+Random Forest / XGBoost
+TF-IDF keyword features"]
+
+    H1["Full-text retrieval
+HTML / PDF, scraped"]
+ 
+    H2["Full-text retrieval
+HTML / PDF, scraped"]
+
+    I["SI classifier applied to Sample B"]
+
+    I2["SI papers identification
+n = 1,763 with usable full text"]
+
+    OA["OA classified from metadata
+using Crossref, Web of Science, Scopus"]
+
+    J["OSP training subset
+n = 352, from SI papers in Sample B
+manual & LLM labelling"]
+
+    KOD["OD classifier
+RF / XGBoost"]
+
+    KOM["OM classifier
+RF / XGBoost"]
+
+    KPR["Preregistration classifier
+RF / XGBoost"]
+
+    L["Prevalence estimates, Post-stratified by year, adjusted for misclassification"]
+
+A --> B
+B -- Training/Testing Sample --> D
+B -- Analytical Sample --> C
+D --> H1 --> F --> G
+C --> E --> H2 --> I
+G --> I
+E -- OA from metadata --> OA
+I --> I2 --> J
+J --> KOD & KOM & KPR
+KOD & KOM & KPR -- Applied to all SI papers --> L
+OA --> L
+J --> L
+```
+
+
+![A flowchart of the full research process. All steps described are further explained in the methodologic report and the supplements found in the OSF repository.](img/research-flow.svg){#fig-flowchart-pipeline width=100%}
@@ -0,0 +1,128 @@
+Kleine Todo:
+- [ ] Lektorat
+	- [x] Zitationen: einheitlichkeit & richtiger Einsatz in Fließtext oder Ende des Satzes.
+	- [x] Abkürzungen großschreiben
+	- [x] einheitlicher Einsatz von Begriffen (RDOF, REPLICATION/RE...)
+	- [x] Letzter Absatz von 2: nochmal gegenchecken.
+	- [x] Table 3: remove unknown  [completion:: 2026-05-12]
+	- [x] Table 4: entfernen? Letzter comment des ersten Reviewers -> nein
+	- [ ] Method report: beschreibung der legal considerations unterbringen. 
+	- [x] "Criminology sits at the intersections of these literatures rather than apart from them, which is why we draw on evidence from across the social sciences in what follows." ok? Siehe Satz davor.
+
+## Mail
+
+The reviewers have made a number of positive comments about the paper and we agree it has the potential to make a significant contribution. However, the reviewers have also identified a number of issues in need of attention.
+
+In particular, both reviewers emphasize the importance of greater transparency in the methodological section. In addition, please revise the language to ensure that it is accessible to a broad criminological audience. As criminology brings together scholars from psychology, economics, sociology, law, and related fields, the manuscript should avoid excessive technical jargon and clearly explain methodological concepts that may not be familiar to readers from less technically oriented backgrounds.
+
+Reviewer 1 also encourages you to provide a stronger justification for the inclusion of Law as a field within your search strategy. In particular, it would be helpful to clarify why criminological journals may reasonably be identified under the category of Law in relevant databases.
+
+In addition, as noted by Reviewer 2, the manuscript would benefit from a more fully developed discussion of why questionable research practices (QRPs) and/or open science practices may be shaped by different incentive structures in criminology compared to other disciplines, such as psychology.
+
+Therefore, I invite you to respond to the reviewers' comments and revise your manuscript. If you choose to do this, please ensure that your revised version is no longer than 10,000 words. To ensure a timely review process please submit your revised version within the next six weeks. If you find you require more time, please request this by contacting the Managing Editor, Dr Beth Hardie, at ejc@crim.cam.ac.uk. However, please note that after six months your manuscript can only be considered as a new submission.
+
+## Reviewer 1  
+
+	"To be transparent, I am not a researcher who regularly works with machine learning. As a result, I occasionally found it challenging to follow certain methodological steps as currently described. A simplified visual overview of the workflow (e.g., a schematic or flowchart) may help readers such as myself keep track of the various stages and clarify what is happening with Samples A and B. That said, the appropriate level of methodological detail and accessibility ultimately depends on the authors' intended target audience."
+
+We thank the reviewer for the important note. A flowchart has been added to the manuscript to better support the understanding of the approach. Additionaly, some of the core concepts were briefly defined in the manuscript, while the methodological report was sign-posted more prominently to encourage readers to consult it for a more detailed discussion of the methods.
+
+	"First, there are a few citation style inconsistencies (e.g., p. 1: Banks et al.; p. 4: Akbaritabar and Squazzoni), as well as minor inconsistencies in acronym capitalization (e.g., p. 8: OS → os)."
+
+We would like to thank you once again for pointing this out. It is a bit embarrassing that such errors were overlooked despite the text having been proofread several times. These mistakes have been corrected in the revised manuscript.
+
+	"Second, it is unclear why this study is not framed as a replication of Scoggins and Robertson; the manuscript describes the approach as "improved," but it would be helpful to specify explicitly what is improved and how. "
+
+This again is a fair point, a paragraph framing the work as a replication was removed from the manuscript during a broader, previous revision. During this revision, the wording was changed slightly to reflect your considerations. The specifics of methodological decisions taken compared to @scogginsMeasuringTransparencySocial2024 work are discussed in the methodological report. As stated later in this answer, the report itself was mentioned more prominently and the report itself was improved to encourage readers to consult it. The report includes a discussion of all decisions taken, methodological deviations from Scoggins' and Robertson's design and planned methods.  
+
+	"Third, in the Background section, the discussion of Breznau et al. reads as though these authors coined the term research degrees of freedom in 2022; while their work provides an excellent illustration of RDOFs, it is not the origin of the concept/term. Relatedly, the manuscript may benefit from using "RDOF" more consistently rather than switching to alternatives such as "idiosyncratic researcher variability," as consistency would improve clarity without sacrificing readability; in the same vein, it is unclear why "RDOF" appears in quotation marks on p. 6. "
+
+We also believe that consistent terminology improves readability, which is why we have reviewed and adjusted the terminology to ensure consistency with other terms as well. In addition, we have corrected the important citation of the term's origin by adding a reference to @simmonsFalsePositivePsychologyUndisclosed2011.
+
+	"Fourth, some concepts are introduced without a definition accessible to a general reader-for example, the meaning of "systemic pressure" on p. 3 was not immediately clear; although this is elaborated later, a brief signpost earlier in the text could help orient readers. "
+
+A short side note has been added to foreshadow the later discussion.
+
+	"Fifth, regarding QRPs in psychology and criminology, I would be interested in the authors' rationale for why incentives appear weaker in criminology, and it may improve flow to define QRPs before discussing their prevalence (potentially by switching the order of the relevant paragraphs). "
+
+We thank the reviewer for this observation. We expect the observation to be based on this sentence "Criminology shows similar patterns, though with lower rates due to the absence of incentives (Chin et al., 2023)." The lower admission rates in Chin et al. (2023) were not intended to suggest that incentives are inherently less significant in criminology, but rather to highlight a methodological difference between the two studies: unlike John et al. (2012), which incorporated truth-incentivizing mechanisms (according to @prelecBayesianTruthSerum2004) into its survey design, Chin et al. (2023) did not. The original phrasing was ambiguous in delivering this distinction. The paragraph has been revised accordingly to make the methodological contrast explicit and less suggestive. 
+The order of the paragraphs has been improved according to the reviewer's recommendation.
+
+	"Sixth, p. 5 begins with "First... Second... Third...," but it is unclear what is being enumerated, and it reads as though a sentence may have been removed during editing. "
+
+We have revised this passage accordingly.
+
+	"Seventh, the final paragraph of Section 2 could benefit from minor polishing for readability. "
+
+The wording has been revised for improved readability.
+
+	"Eighth, I found the sampling section and the related explanation in the metadata section somewhat difficult to follow; for example, in the data processing description, it is unclear what is meant by "several improvements were implemented but not processed." I would recommend revisiting these sections with an eye toward clarity, as even small wording issues may confuse readers who are not already deeply familiar with this work."
+
+Unfortunately, the cited paragraph was poorly worded: it refers to a caching issue described in greater detail in the methodological report:
+
+	"Several improvements could have been made here. First, the inclusion criterion for the publication date was defined as a combination of published_print and published_online, using the rule ifelse(is.na(published_print), published_online, published_print). A better approach would have been to take the minimum of both dates, ensuring that the earliest publication date was used. Although this solution had been written, it was never applied due to a small but impactful mistake made: Quarto's freeze parameter. When set to true, no code changes are executed. Because this setting was active and the issue was discovered only late in the project, it could no longer be corrected without fully rebuilding the pipeline, including manual recoding of the newly sampled data."
+
+This mistake led to many non-SI papers in the corpus - both a curse and a blessing, as it improved SI-classifier training as well as likely inducing some misclassified non-SI-papers in the final analytical sample, leading to further misclassification, but only to a small extent. We felt it was necessary to acknowledge this, while not over-emphasizing it, given its limited implications for the reported results. Nevertheless, the sentence was revised for improved context.
+
+The whole section is now supported by a flow-chart. Additionally, some parts within the Data Collection section were improved. 
+
+	"Ninth, the question of whether research is OA does not appear to be reflected in either of the stated research questions. "
+
+This is of course correct. It was not part of the initial question but "[...] reported as secondary, descriptive analyses to benchmark open-science adoption." - planned as an accompanying analysis to create context. A short paragraph rationalizing this has been added in the introduction rather than formulating a research question post-hoc.
+
+	"Tenth, it is unclear why "Unknown" appears three times in Table 3. "
+
+The "Unknown" category represents the non-statistical inference papers in each category, reported along with the rest of the sample to indicate the non-applicable cases. To improve the table, "Unknown" has been changed to "Non-SI".
+
+	"Finally, the table ordering and references in the text were slightly confusing (e.g., Table 3 is referenced and the next paragraph refers to Table 5; Table 4 is mentioned before Table 3 but appears afterward). This disrupted the reading flow somewhat, and it may be worth considering how much added value Table 4 provides beyond what is already described in the text."
+
+The table ordering and references have been revised for improved flow. Table 4 was retained as we believe that it provides a useful summary of the sample characteristics, but the text has been revised to better integrate its content.
+
+## Reviewer 2  
+  
+	"First, the scope of the paper, mostly notably the data, is incredibly broad. This is evident from the background section, in which the authors - I assume consciously and intentionally - talk about the challenges of open science practices for social science (and even beyond). This seems odd, given that readers of EJC are a pretty specific subset of social science. Similarly, the scope of the literature population includes law. Again, this might seem trivial for someone outside of social science, but criminology and law (researchers) have very little in common (mostly), so lumping these results together is a rather crude decision that doesn't seem to have much justification. A minor point but I wanted to highlight it. Are the authors planning on deploying these models for other fields?"
+
+We thank the reviewer for raising this point about scope, and we welcome the opportunity to clarify our rationale.
+
+On the broad framing of open science challenges: The decision to discuss open science across the social sciences (and adjacent fields) reflects the inherently interdisciplinary composition of criminological research itself. Criminological scholarship spans macro-level sociological and economic work through to micro-level analytical psychology, and the open science challenges relevant to the literature we analyze cut across this entire range. Restricting the background to "criminology" narrowly defined would, in our view, understate the disciplinary heterogeneity of the work actually published in the journals in our corpus. That said, we agree the motivation for the broader framing could be made more explicit for EJC's readership, and we will revise the the background section to signpost this rationale.
+
+On the inclusion of law journals: We take the reviewer's point that criminology and law are, generally speaking, distinct fields with limited methodological overlap, and we should have justified this inclusion more carefully. Journals classified as "law" by JCL contain a substantial body of legal psychology research that falls squarely within our analytical scope. @gonzalez-salaCaracterizacionPsicologiaJuridica2017 for example explicitly analyze journals that span both the Criminology & Penology and Law categories in Web of Science (WoS), documenting the relationship between the two categories through legal psychology content. The Law category in WoS captures a body of empirical work, particularly legal psychology, that overlaps substantially with criminological research. This is reflected by the sample characteristics: although only a small share of papers drawn from law journals are empirical statistical inference papers, a closer inspection of the corpus shows that these papers are predominantly legal psychology works rather than doctrinal legal scholarship. In other words, the "law" label captures legal psychology contributions that would otherwise be missed. We will add a clarifying note to the methods section to make this filtering logic transparent to readers.
+
+On deployment to other fields: This falls outside the immediate scope of this paper, but the pipeline is designed to be portable and reusable for other corpora, and we hope it will be used by other researchers to analyze open science practices in other disciplines.
+
+	"Second, and perhaps most importantly, I do have concerns about the manner in which the data and methods are explained, and the transparency of the methods themselves. This is a criminology journal, so it's a balance between comprehensiveness (and therefore technical language) and vagueness (not enough detail). This paper text itself does neither, really, but then I was pleasantly surprised when reading the supplementary materials. These are great, I would highly recommend that the authors sign-post the supplementary materials better in the paper, and then improve the README of the OSF repo itself so that more people can access the local website. The contents of the 'anon' folder is fantastic, but the average crim academic would never find it or know what to do with a folder full of html files without specific instructions. It's somewhat ironic given the contents of the paper, so we should expect better from the authors in this regard."
+
+This is a very important and, to be honest, fairly obvious criticism to make. The data availability statement as well as the methods section's introduction have been revised to better reflect this. To improve accessibility, a readme file for the method-report has been added. Additionally, a GitHub Pages site has been created to access the html files via a simple link. Note that, for the reviewers, that link contains the user name of one of the authors, potentially risking deanonymization. We therefore didn't explicitly include that URL in the manuscript. Brief definitions of the core concepts have been added to the manuscript while the authors decided not to include any further discussion as those are already available in the methodological report and would have reduced clarity for readers without a background in ML methods. 
+
+	On that note, I think most readers will expect a defence of the 'blackbox' criticism of GPT models. Only a handful of criminologists these days still think 'machine learning = amazing novel paper", but instead, many now approach such papers with skepticism. I think, especially given the topic of the paper, that this blackbox critique will need preemptively addressing by the authors."
+
+We welcome the opportunity to address the blackbox concern directly, as we agree it needs a clear response - particularly given the subject matter of the paper.
+
+The blackbox criticism is well-founded when a large language model serves as the primary analytic instrument, producing classifications or inferences that cannot be independently inspected or reproduced. Our design differs in an important respect: ChatGPT was used not as a classifier but exclusively as a _labelling assistant_ during the construction of the training data. Its labels were validated against hand-coded annotations on a random subsample, yielding high agreement after reconciliation (κ ≈ .83). The combined manual/LLM labels then served as training and test data for conventional, tunable ML classifiers - models whose feature sets, hyperparameters, decision boundaries, and performance metrics are fully documented and reported. Importantly, ChatGPT served exclusively as a labelling assistant to scale annotation, not as the analytic instrument driving final classifications.
+
+The final classifications driving our results therefore come from these transparent, validated ML models, not from GPT directly. The role of the LLM was to scale the labelling process efficiently, subject to human validation - a use case that is both auditable and replicable. The trained models, labelled dataset, and coding manual are made available in the supplementary materials precisely so that readers can scrutinize and if necessary contest the classification decisions.
+
+We have added a brief clarification of this distinction to the methods section to preempt this concern for readers.
+
+	"Third, related to the above, the lack of data availability is not particularly convincing. Again. the obvious irony, because the authors must know the vague reasons academic give for not sharing data (legal/privacy reasons with no actual justification or legal basis), and yet here the data is not shared due to 'copyright concerns'. Well, considering the topic of the paper, we need more than that. Have you sought advice on this, or got written correspondence with the publishers to say you cannot? Can you publish some of the data? The authors must know that reproducing this paper would actually be very difficult without this, so we need more detail and justification for not providing at least some of the data. If the data really really cannot be made available, then I would suggest elaborating and sign-posting the supplementary materials even further so that you really do make it easier for someone to re-obtain the data and reproduce your study."
+
+We thank the reviewer for pressing on this point - it is a fair and important one, and we want to address it with the detail it deserves. We have in fact reviewed the text and data mining (TDM) policies of every publisher whose content appears in our corpus. The situation is heterogeneous: some publishers explicitly permit TDM (e.g. SAGE, Cambridge, MDPI, Taylor & Francis, Wiley via their TDM API, Nature, Emerald, Annual Reviews under request), while others prohibit or significantly restrict it (e.g. Elsevier permits API-based access but not scraping; ASCE explicitly prohibits TDM). A small number of sources could not be verified or are excluded on other grounds (e.g. PsycNET supplementary files, which contain no analysable full text). This variation across publishers means that a blanket public release of the full-text corpus is not possible: even where individual publisher policies would allow it, the corpus as a whole includes content from publishers that do not.
+
+It is also worth noting that under EU Directive 2019/790 (Articles 3 and 4), that are applicable to this work, text and data mining for scientific research purposes is broadly permitted for authorised users, and our access was mostly obtained through institutional subscriptions. However, this right to mine does not extend to a right to redistribute the underlying full texts - which is the relevant restriction here.
+
+We have therefore clarified the data availability statement to reflect this situation more precisely, and have expanded the supplementary materials to include:
+
+1. Reproduction materials, including the labelled dataset derived from the full texts, are fully available. All analyses reported in the paper can be replicated directly from this dataset without requiring access to the underlying full texts.
+2. All Quarto (R) documents for the manuscript and the methodology report will be made publicly available via a git repository, as stated in the data availability statement. The repository is not yet publicly accessible, as full anonymization for double-blind review requires non-trivial effort that cannot be completed within the current revision timeline. An anonymized, less extensive version of all the files are now included in the osf repository. The authors hope that the materials provided in their current form are sufficient for the purposes of review.
+3. The trained classification models will be made publicly available alongside the reproduction materials.
+4. A summary of the relevant publisher TDM policies has been added to the supplementary materials, covering all publishers whose content appears in the corpus.
+
+We hope this makes clear that the copyright concern is not a vague disclaimer but reflects a  heterogeneous licensing landscape, and that we have taken seriously the responsibility to make reproduction as straightforward as possible within those constraints.
+
+	"Fourth, the authors often mix-up replication and reproduction terms (as used in social science). These are not the same and cannot be used interchangeably. This needs rectifying."
+
+We thank the reviewer for pointing this out. We have reviewed the manuscript to ensure that the terms "replication" and "reproduction" are used correctly and consistently according to their standard definitions in the social sciences. The relevant sections have been revised accordingly.
+
+	"Fifth, maybe I am being pedantic, but preregistration is not necessarily an open research practice. It's much more about avoiding questionable research practices (QRP). You can preregister your study and have zero open research materials. Also, often in criminology, preregistration does not necessarily help with QRP unless there's an audit trail of the preregistration being recorded prior to data being collected and accessed (often, in criminology, this is not the case, because of secondary data analysis). I say this as someone that is pro-prereg and has a few myself. A minor point because preregistration is still worth looking into, but I wanted to voice this."
+
+We understand the rationale behind the critique and see the challenges of pre-registration in criminology, especially due to secondary data analysis. We think that the discussion is largely based on the definition of open science itself that, in this work, is in line with the definition by the Center for Open Science, that defines preregistration as "[...] a specific plan for the upcoming study. Doing so helps to distinguish planned from unplanned work" [@sciencePreregistration]. Here, the second sentence is of great importance: by distinguishing the planned from the unplanned work, the deviations or research decisions made when the data was at hand is what is of special interest. A thorough discussion of this can be found in @nosekPreregistrationRevolution2018. While the main motivation might of course be avoiding QRPs, the transparent distinction of decisions met in light of the available data or challenges arisen in the analytical process enables a critical review of published work, making it also a valuable instrument in the open science framework. We tried to emphasize the discussion but deemed it a more thorough discussion of this would be out of scope of the work.
Author	SHA1	Message	Date
mischbeck	c4b94c7f8f	revision 1, almost done	2026-05-18 22:43:11 +02:00
mischbeck	ada154a107	adds stuff	2026-04-16 19:18:19 +02:00