finalizes manuscript for preprint

2025-12-17 18:41:28 +01:00
parent 0908df2a79
commit b7a0ddbc94
5 changed files with 168 additions and 109 deletions
@@ -4,6 +4,11 @@ top-level-division: section
 prefer-html: true
 execute:
  freeze: auto
+format:
+  pdf:
+    toc: true
+    lot: true
+    lof: true
 ---

 ```{r}
@@ -12,6 +17,8 @@ execute:
 source("deps.R")
 ```

+\newpage
+
 # Introduction

 This document serves as a supplement to the main article, providing additional details on the sampling approach, sample size determination, model training procedures, and evaluation metrics used in the study of Open Science Practices (OSP) adoption in scientific publications. The full methodological report, containing all code necessary for full replication can be accessed in the OSF repository.
@@ -26,9 +33,7 @@ The process involved in the following steps:
 4. Due to good performance, ChatGPT was used to classify the rest of Sample A, and the combined manual/LLM labels formed the training and test data for subsequent ML models.
 5. ML Classifiers were trained on the produced classified subsample.

-Classification of the training Sample B followed the same approach. For classification document feature matrices were generated using term frequencies of keywords. These keywords were both adopted from  @scogginsMeasuringTransparencySocial2024 as well as self created, and extended using ChatGPT. Keywords were context specific according to the classified variable. All classification tasks were binary classifications. After assembling the keywords, the SI classifier was fine-tuned. Using this classifier, the analytical sample was categorized. SI documents were then classified for applying OSPs.
-
-The approach might seem overly complicated but was intitially designed to be used on a much larger corpus of publications. As time progressed during the project multiple reasons recommend a simpler approach that will be discussed later. 
+Classification of the training Sample B followed a similar approach. For classification document feature matrices were generated using term frequencies of keywords. These keywords were both adopted from  @scogginsMeasuringTransparencySocial2024 as well as self created, and extended using ChatGPT. Keywords were context specific according to the classified variable. All classification tasks were binary classifications. After assembling the keywords, the SI classifier was fine-tuned. Using this classifier, the analytical sample was categorized. SI documents were then classified for applying OSPs.

 # Sample Size

@@ -118,7 +123,7 @@ if (isTRUE(debug_mode)) {
 }
 ```

-The minimum calculated total sample size equals 4265 (rounded) publications to achieve a 95% confidence interval with a half-width of $\pm$ 1.5 pp using the @agrestiApproximateBetterExact1998 method. When applying the assumed prevalence values for each OSP, the required sample sizes to achieve a 95% confidence interval with a half-width of $\pm$ 1.5 pp vary substantially. As shown in @tbl-cap-estimated-sample-sizes-osp, approximately 3,200 publications are needed to estimate OA at 25%, about 2,180 publications for OD at 15%, and only about 840 publications for OM or Preregistration at 5%. 
+The minimum calculated total sample size equals 4265 publications to achieve a 95% confidence interval with a half-width of $\pm$ 1.5 pp using the @agrestiApproximateBetterExact1998 method. When applying the assumed prevalence values for each OSP, the required sample sizes to achieve a 95% confidence interval with a half-width of $\pm$ 1.5 pp vary substantially. As shown in @tbl-cap-estimated-sample-sizes-osp, approximately 3,200 publications are needed to estimate OA at 25%, about 2,180 publications for OD at 15%, and only about 840 publications for OM or Preregistration at 5%. 

 These values are all below the worst-case requirement of 4,264, reflecting the lower variance at prevalences farther from 50%. At the assumed prevalences, 2,182 SI papers would be required to estimate OD at 15% with +- 1.5 percentage-points precision. This equals the OD requirement but is below the OA requirement, which on the other hand can be measured for the whole population, not just SI publications. Thus, while the sample is sufficiently large for OD, OM, and Preregistration, it falls slightly short of the target precision for OA, which could be measured on a larger scale.

@@ -169,11 +174,49 @@ An overestimation the prevalence of each OSP in the population can lead to poten

 # Full Text Retrieval

-As mentioned in the manuscript, full texts were retreived using a self developed web application that used both web scraping and publisher API's. Legal aspects were carefully considered throughout the development. Within the EU, scraping is legal for scientific purposes [@urhg-60d-tdm], but institutional contracts can override this. Scraping was therefore limited to the university network and only to publishers that permit it while other publishers were scraped outside of the network. Technical details are available in the documents provided while the scraper might be made publicly available in the future. 
+Legal aspects were carefully considered throughout the development. Within the EU, scraping is legal for scientific purposes [@urhg-60d-tdm], but institutional contracts can override this. Scraping was therefore limited to the university network and only to publishers that permit it while other publishers were scraped outside of the network. The scraper might be made publicly available in the future. 
+
+The fulltexts were downloaded using **SciPaperLoader** a self-developed Flask-based web application designed for automated scientific paper processing and metadata extraction, built by the author. The app that lets the user import scientific paper metadata (CSV or manual), then schedules and runs automated fetching/parsing jobs to extract content/metadata and track each paper's processing status with logs and a dashboard. It supports many academic publishers via mostly YAML-configured parsers (plus optional custom Python parsers) that leverage Beautiful Soup, with a streamlined focus on paywall detection and transparent pattern-matching output. 
+
+![Fulltext Downloader Screenshot: Control Panel](img/2025-08-04-174744_hyprshot.png)
+
+The scraper produced .txt files using the following standardized format while also downloading the full HTML and PDF files, if available. This enabled later re-extraction if necessary.
+
+```
+DOI: 10.14763/2017.1.454
+DOI URL: https://doi.org/10.14763/2017.1.454
+Extracted URL: https://policyreview.info/articles/analysis/passage-australias-data-retention-regime-national-security-human-rights-and-media
+Extracted: 2025-08-05T16:57:14.685108
+Parser: policyreview
+Publisher: policyreview
+Extraction Method: HTML
+Paywall Status: open_access
+================================================================================
+
+TITLE: The passage of Australia’s data retention regime: national security, human rights, and media scrutiny
+
+AUTHORS: Nicolas P. Suzor, Kylie Pappalardo, Natalie McIntosh
+
+ABSTRACT:
+Abstract
+            In 2015, the Australian government passed the Telecommunications (Interception and Access) Amendment (Data Retention) Act, which requires ISPs to collect metadata about their users and store this metadata for two years. From its conception, Australia’s data retention scheme has been controversial. In this article we examine how public interest concerns were addressed in Australian news media during the Act’s passage. The Act was ultimately passed with bipartisan support, despite serious deficiencies. We show how the Act’s complexity seemed to limit engaged critique in the mainstream media and how fears over terrorist attacks were exploited to secure the Act’s passage through parliament.
+
+FULL TEXT:
+Title: The passage of Australia’s data retention regime: national security, human rights, and media scrutiny
+
+Abstract: Abstract
+            In 2015, the Australian government passed the Telecommunications (Interception and Access) Amendment (Data Retention) Act, which requires ISPs to collect metadata about their users and store this metadata for two years. From its conception, Australia’s data retention scheme has been controversial. In this article we examine how public interest concerns were addressed in Australian news media during the Act’s passage. The Act was ultimately passed with bipartisan support, despite serious deficiencies. We show how the Act’s complexity seemed to limit engaged critique in the mainstream media and how fears over terrorist attacks were exploited to secure the Act’s passage through parliament.
+
+This paper is part of Australian internet policy, a special issue of Internet Policy Review guest-edited by Angela Daly and Julian Thomas.
+
+[FULLTEXT REDACTED IN THIS DOCUMENT]
+```
+
+During the preprocessing steps, parts above the "FULL TEXT:" heading were dismissed.

 # Model Training

-Overall agreement after resolving initial disagreements was good, as shown in @fig-cfm-osp. Nevertheless, some categories exhibited extreme class imbalance, especially preregistration, which later proved problematic in subsequent stages of the process. Given the small number of positive cases, the ChatGPT-based classification was used to extend the training sample further. A full classification of the analytical sample using an LLM was considered desirable, but was not financially feasible within the project.
+Overall agreement between manual and ChatGPT-based coding after resolving initial disagreements was good, as shown in @fig-cfm-osp. Nevertheless, some categories exhibited extreme class imbalance, especially preregistration, which later proved problematic in subsequent stages of the process. Given the small number of positive cases, the ChatGPT-based classification was used to extend the training sample further. A full classification of the analytical sample using an LLM was considered desirable, but was not financially feasible within the project.

 ```{r}
 #| fig-cap: Confusion Matrices - Manual vs ChatGPT Labels for Open Science Practices and Statistical Inference (design-weighted)
@@ -253,7 +296,7 @@ if (isTRUE(debug_mode)) {

 ## Model Evaluation

-The two top-left graphs in @fig-evaluation-stat, @fig-evaluation-od, @fig-evaluation-om and @fig-evaluation-pre show the performance of different feature set and model combinations measured by ROC-AUC [@fawcettIntroductionROCAnalysis2006]. In @fig-evaluation-stat, the top graph identifies the XGBoost classifier combined with a simple term frequencies dataset as the top-performing model. The top-right graph shows the most important terms for the XGBoost classifier, which are primarily statistical. The confusion matrix shows that the model is quite precise, with a 91.7% accuracy and a Cohen's Kappa of 0.832. This performance is good compared to hand-coded cases. Model calibration was not highly successful as the model's probabilities were already well-calibrated, mostly at the extremes of 0 and 1. A probability threshold of 0.25 was chosen based on three different metrics. This threshold is used for the final classification, where any case with a predicted probability greater than 0.25 is classified as 1. It's also important to note that the OSP classifiers performed much worse, as seen below.
+The two top-left graphs in @fig-evaluation-stat, @fig-evaluation-od, @fig-evaluation-om and @fig-evaluation-pr show the performance of different feature set and model combinations measured by ROC-AUC [@fawcettIntroductionROCAnalysis2006]. In @fig-evaluation-stat, the top graph identifies the XGBoost classifier combined with a simple term frequencies dataset as the top-performing model. The top-right graph shows the most important terms for the XGBoost classifier, which are primarily statistical. The confusion matrix shows that the model is quite precise, with a 91.7% accuracy and a Cohen's Kappa of 0.832. This performance is good compared to hand-coded cases. Model calibration was not highly successful as the model's probabilities were already well-calibrated, mostly at the extremes of 0 and 1. A probability threshold of 0.25 was chosen based on three different metrics. This threshold is used for the final classification, where any case with a predicted probability greater than 0.25 is classified as 1. It's also important to note that the OSP classifiers performed much worse, as seen below.

 ![Evaluation Metrics: Statistical Inference Classification](figures/combined_plot_is_statistical.pdf){#fig-evaluation-stat}

@@ -280,7 +323,7 @@ Category-specific results highlight class-imbalance constraints. Preregistration

 [^1]: The accuracy-no-information-rate p-value tests the null hypothesis that the accuracy is equal to the no-information rate or the accuracy when always predicting the most frequent class [@kuhnBuildingPredictiveModels2008].

-The ML classifiers trained on GPT labels inherit GPT's strengths and the data's sparsity.For the relatively small 20% validation set coded by GPT, the open-science practice classifiers are less precise and less reliable than the Statistical-Inference classifier. Preregistration (@fig-fig-evaluation-pr) appears strongest (balanced accuracy $= 99.2\%$, $F_1 = 88.9\%$, $\kappa = 88.1\%$), but the counts are sparse (four true positives, one false negative, no false positives), and the p-value versus the no-information rate ($p = 0.0853$) is not conventionally significant-an expected consequence of the very low base rate rather than a systematic error.
+The ML classifiers trained on GPT labels inherit GPT's strengths and the data's sparsity.For the relatively small 20% validation set coded by GPT, the open-science practice classifiers are less precise and less reliable than the Statistical-Inference classifier. Preregistration (@fig-evaluation-pr) appears strongest (balanced accuracy $= 99.2\%$, $F_1 = 88.9\%$, $\kappa = 88.1\%$), but the counts are sparse (four true positives, one false negative, no false positives), and the p-value versus the no-information rate ($p = 0.0853$) is not conventionally significant-an expected consequence of the very low base rate rather than a systematic error.


 # Tables: OSP Adoption Over Time Among Statistical Inference Papers
@@ -8,8 +8,6 @@ profile:
  group:
    - [default, anon]

-abstract: |
-  This pilot study addresses the current lack of systematic, large-scale evidence on Open Science Practices (OSPs) adoption in criminology and legal psychology. A scalable, machine-learning-based text classification pipeline is introduced to map the prevalence of Open Access (OA), Open Data (OD), Open Materials (OM), and Preregistration (PR). The analysis is based on publication metadata and a year-stratified sample of full texts from the top 100 journals in Criminology & Penology, Law, and Psychology (2013-2023). After identifying articles containing statistical inference (SI) via a high-performing classifier, the author utilized GPT-assisted coding and supervised learning to train specific classifiers for OD, OM, and PR. OA was classified using publicly available metadata. Among 1,763 SI articles with usable full text, design-based estimates reveal a significant disparity in OSP adoption. OA is relatively common (40.9%, 95% CI: 38.8-43.1) and has steadily increased from approximately 20% in 2013 to 50% in 2023. By sharp contrast, trends for OD, OM, and PR cannot be reliably quantified. Extreme class imbalance and the minimal number of positive cases indicate a very low underlying true prevalence for these practices in the assessed field. Methodologically, the study confirms that GPT-assisted coding supports accurate SI detection, but robust prevalence estimation for extremely low-frequency OSPs remains challenging for downstream classifiers. Overall, this project establishes a transparent and reproducible pipeline and provides critical baseline estimates for future, larger-scale assessments of research transparency in crime-related fields.
 keywords:
  - Open Science Practices
  - Open Access
@@ -4,6 +4,9 @@ top-level-division: section
 prefer-html: true
 execute:
  freeze: auto
+
+abstract: |
+  This pilot study addresses the current lack of systematic, large-scale evidence on Open Science Practices (OSPs) adoption in criminology and legal psychology. A scalable, machine-learning-based text classification pipeline is introduced to map the prevalence of Open Access (OA), Open Data (OD), Open Materials (OM), and Preregistration (PR). The analysis is based on publication metadata and a year-stratified sample of full texts from the top 100 journals in Criminology & Penology, Law, and Psychology (2013-2023). After identifying articles containing statistical inference (SI) via a high-performing classifier, the author utilized GPT-assisted coding and supervised learning to train specific classifiers for OD, OM, and PR. OA was classified using publicly available metadata. Among 1,763 SI articles with usable full text, design-based estimates reveal a significant disparity in OSP adoption. OA is relatively common (40.9%, 95% CI: 38.8-43.1) and has steadily increased from approximately 20% in 2013 to 50% in 2023. By sharp contrast, trends for OD, OM, and PR cannot be reliably quantified. Extreme class imbalance and the minimal number of positive cases indicate a very low underlying true prevalence for these practices in the assessed field. Methodologically, the study confirms that GPT-assisted coding supports accurate SI detection, but robust prevalence estimation for extremely low-frequency OSPs remains challenging for downstream classifiers. Overall, this project establishes a transparent and reproducible pipeline and provides critical baseline estimates for future, larger-scale assessments of research transparency in crime-related fields.
 ---

 ```{=latex}
@@ -218,7 +221,7 @@ if(output_format == "pdf/tex") {
 } else if(output_format == "docx") {
  tbl_cases %>% 
    flextable() %>%
-      set_table_properties(width = 1, layout = "autofit") %>%
+      set_table_properties(width = 1) %>%
      theme_booktabs(bold_header = TRUE) %>%
      align(align = "center", part = "all") %>%
      fontsize(size = 11, part = "header") %>%
@@ -229,7 +232,10 @@ if(output_format == "pdf/tex") {
        n_before = "Before",
        n_after = "After",
        n_dropped = "Dropped"
-      )
+      ) %>%
+      width(2, 8.5, unit = "cm") %>%
+      width(5, 2.34, unit = "cm") %>%
+      autofit()
 } else {

 }
@@ -242,14 +248,71 @@ if (isTRUE(debug_mode)) {

 The data obtained necessitated multiple transformations. All transformations are reported in the respective section in the methodological report. Publications were filtered by the resulting date variable to limit the population to the defined time interval. To reduce SI coding efforts, simple keyword-lists were used to reduce the number of publications by matching titles. Missing values were assessed, checks were processed for language, @tbl-cases shows that from an initial number of 95042 publications, all steps resulted in a final publication count of 40,860. It is important to note here that several improvements were implemented here but not processed. More details can be found in the provided materials.

+```{r}
+#| label: tbl-cases2
+#| tbl-cap: Cases Dropped from Analytical Sample
+tbl2 <- read_csv("data/tbl-sample-case-drops-stattraining-final.csv")
+tbl_cases2 <- tbl2 %>%
+  rename(
+    step_id      = `...1`,
+    step_code    = step,
+    n_before     = before,
+    n_after      = after,
+    n_dropped    = dropped,
+    filter_logic = logic
+  ) %>%
+  mutate(
+    step_label = case_when(
+      step_code == "status == \"Done\"" ~ "Only Successfully Downloaded",
+      step_code == "nchar(fulltext) > 1000"  ~ "Filter out texts < 1000 Words",
+      step_code == "is_statistical == \"Yes\""     ~ "Drop non-statistical papers"
+    ),
+    step_id = as.integer(step_id)
+  ) %>%
+  arrange(step_id) %>%
+  select(step_id, step_label, n_before, n_after, n_dropped)
+
+if(output_format == "pdf/tex") {
+  tbl_cases2 %>%
+    kable(
+      format   = "latex", # force LaTeX output (not markdown)
+      booktabs = TRUE,
+      longtable = FALSE, # avoid longtable entirely
+      col.names = c("Step #", "Step", "Before", "After", "Dropped")
+      )
+} else if(output_format == "docx") {
+  tbl_cases2 %>% 
+    flextable() %>%
+      set_table_properties(width = 1) %>%
+      theme_booktabs(bold_header = TRUE) %>%
+      align(align = "center", part = "all") %>%
+      fontsize(size = 11, part = "header") %>%
+      fontsize(size = 10, part = "body") %>%
+      set_header_labels(
+        step_id = "Step #",
+        step_label = "Step",
+        n_before = "Before",
+        n_after = "After",
+        n_dropped = "Dropped"
+      ) %>%
+      width(2, 8.5, unit = "cm") %>%
+      width(5, 2.34, unit = "cm") %>%
+      autofit()
+} else {
+
+}
+
+if (isTRUE(debug_mode)) {
+  debug_info[[knitr::opts_current$get("label")]] <- 
+    if (knitr::is_html_output()) "HTML" else "LaTeX"
+}
+```
+
+
 [^7]: [https://jcr.clarivate.com/jcr/browse-journals](https://jcr.clarivate.com/jcr/browse-journals), JCR Year set to 2023. 

 ### Sample

-Using the obtained crossref metadata, the analytical sample was drawn stratified by year according to the calculation in @sec-sampling. The resulting analytical sample contains roughly 10% of the population data. As seen in @fig-freq-pubs-comp, Sample A, that is the training and validation sample for the SI classifier, is intended as the proportion of SI papers are expected to not vary and therefore not stratified by year. Stratification by journal was rejected due to the resulting sample sizes of 100 journals would have required much more cases. 
-
-The final analytical sample is made up of 4265 publications. The OS prevalence classification sample consists of 352 publications stratified by year whereas the unstratified sample A for the training of the SI classifiers consists of 408 publications. The next step involved downloading full-text HTML or PDF versions, only using legal and ethical sources.
-
 ```{r}
 #| fig-cap: "Frequencies: Publications by Year in Population and Sample"
 #| label: fig-freq-pubs-comp
@@ -369,74 +432,23 @@ if (isTRUE(debug_mode)) {
 }
 ```

+Using the obtained crossref metadata, the analytical sample was drawn stratified by year according to the calculation in @sec-sampling. The resulting analytical sample contains roughly 10% of the population data. As seen in @fig-freq-pubs-comp, Sample A, that is the training and validation sample for the SI classifier, is intended as the proportion of SI papers are expected to not vary and therefore not stratified by year. Stratification by journal was rejected due to the resulting sample sizes of 100 journals would have required much more cases. 
+
+The final analytical sample is made up of 4265 publications. The OS prevalence classification sample consists of 352 publications stratified by year whereas the unstratified sample A for the training of the SI classifiers consists of 408 publications. The next step involved downloading full-text HTML or PDF versions, only using legal and ethical sources.
+
 ### Full Text Retrieval

 The initial approach to gathering full texts, which used Zotero to translate DOIs as per Scoggins and Robertson, was unreliable across multiple attempts and software versions. Due to the unsuitability of existing software tools, be it for technical or legal reasons, a custom web application was developed.

 Downloading the analytical sample was mostly successful, though some publisher protections caused dropouts. Due to time constraints, additional more optimized runs were not feasible. Documents under 1,000 words were considered non-full-text papers. However, shorter HTML texts were retained for potential keyword matching. Text quality assessment (Flesch-Index) and word count identified missing full texts [@benoitQuantedaPackageQuantitative2018]. Full texts were downloaded for Independent Sample A and the Analytical Sample from which Sample B was drawn. The resulting dropouts were expected to have been implicitly handled by post-stratification, but publisher-level weighting was planned and considered but infeasible due to sparse cells that would have produced unstable weights. Post-stratification was conducted by year only, which does not correct publisher- or journal-specific dropouts. Future, non-piloting iterations should add publisher-level adjustment.

-```{r}
-#| label: tbl-cases2
-#| tbl-cap: Cases Dropped from Analytical Sample
-tbl2 <- read_csv("data/tbl-sample-case-drops-stattraining-final.csv")
-tbl_cases2 <- tbl2 %>%
-  rename(
-    step_id      = `...1`,
-    step_code    = step,
-    n_before     = before,
-    n_after      = after,
-    n_dropped    = dropped,
-    filter_logic = logic
-  ) %>%
-  mutate(
-    step_label = case_when(
-      step_code == "status == \"Done\"" ~ "Only Successfully Downloaded",
-      step_code == "nchar(fulltext) > 1000"  ~ "Filter out texts < 1000 Words",
-      step_code == "is_statistical == \"Yes\""     ~ "Drop non-statistical papers"
-    ),
-    step_id = as.integer(step_id)
-  ) %>%
-  arrange(step_id) %>%
-  select(step_id, step_label, n_before, n_after, n_dropped)
-
-if(output_format == "pdf/tex") {
-  tbl_cases2 %>%
-    kable(
-      format   = "latex", # force LaTeX output (not markdown)
-      booktabs = TRUE,
-      longtable = FALSE, # avoid longtable entirely
-      col.names = c("Step #", "Step", "Before", "After", "Dropped")
-      )
-} else if(output_format == "docx") {
-  tbl_cases2 %>% 
-    flextable() %>%
-      set_table_properties(width = 1, layout = "autofit") %>%
-      theme_booktabs(bold_header = TRUE) %>%
-      align(align = "center", part = "all") %>%
-      fontsize(size = 11, part = "header") %>%
-      fontsize(size = 10, part = "body") %>%
-      set_header_labels(
-        step_id = "Step #",
-        step_label = "Step",
-        n_before = "Before",
-        n_after = "After",
-        n_dropped = "Dropped"
-      )
-} else {
-
-}
-
-if (isTRUE(debug_mode)) {
-  debug_info[[knitr::opts_current$get("label")]] <- 
-    if (knitr::is_html_output()) "HTML" else "LaTeX"
-}
-```
-
 ## Classification Tasks and Methods

 This section will present a brief summary of all methods used to classify the variables of interest. A thorough discussion of the decisions taken, the full descriptions and specifications of the models used as well as the preprocessing steps can be found in the supplied materials.

-Since most existing classification approaches considered were deemed unsuitable for this scope (e.g., @kimResearchPaperClassification2019; @sanguansatFeatureMatricizationDocument2012; @jandotInteractiveSemanticFeaturing2016), this work instead relies on Random-Forest and XGBoost-models trained on a manually and LLM coded subset of publications as LLMs have shown good performance on similar classification tasks [@buntValidatingUseLarge2025; @zhaoAdvancingSingleMultitask2024]. For each task, OSP-specific document-feature-matrices using term frequencies or TF-IDF of keyword sets, partly adapted from @scogginsMeasuringTransparencySocial2024, were constructed. 
+Since most existing classification approaches considered were deemed unsuitable for this scope (e.g., @kimResearchPaperClassification2019; @sanguansatFeatureMatricizationDocument2012; @jandotInteractiveSemanticFeaturing2016), this work instead relies on Random-Forest and XGBoost-models trained on a manually and LLM coded subset of publications as LLMs have shown good performance on similar classification tasks [@buntValidatingUseLarge2025; @zhaoAdvancingSingleMultitask2024]. 
+
+For each task, OSP-specific document-feature-matrices using term frequencies or TF-IDF of keyword sets, partly adapted from @scogginsMeasuringTransparencySocial2024, were constructed. 

 First, a strict dichotomous operationalization of "SI" or not SI, as well as of the OSPs was synthesized and documented in a short coding manual. A subset of Sample A was coded by hand, followed by a ChatGPT-based labelling of the fulltext. On a random subsample, agreement after reconciliation was high ($\kappa$ $\approx$ .83), so combined manual/LLM labels  served as training and test data for the ML classifiers. A similar approach was used for Sample B. Each OSP classifier was tuned on all possible combinations of different feature sets and model. 

@@ -446,7 +458,9 @@ Given time constraints and the pilot nature of the study, preprocessing and eval

 The research was deliberately designed to study open-science practices via supervised classifiers rather than relying exclusively on metadata. This choice prioritized scalability and the potential to capture practice signals that metadata may miss, at the cost of managing model error and class imbalance. Given the exploratory character of the work, the analyses were not pre-defined, only data collection, sampling, and the model-training strategy were specified in advance. Concerns about classifier interpretability informed the evaluation strategy [@gilpinExplainingExplanationsOverview2018].

-Estimates for OSPs are domain-estimates among SI papers (see @tbl-cases2) drawn from a year-stratified random sample, beta-method CIs are based on design-based variance. Design-weights were applied post-stratified to frame-by-year totals with finite-population corrections. All OSP estimates are domain estimates for SI papers using design-based inference. Design corrected 95% confidence intervals are computed with the beta method (Clopper-Pearson) transformation which provides better coverage for low-prevalences than Wald intervals [@agrestiIntroductionCategoricalData2007]. Results generalize to the keyword-filtered data. With $n=1,763$ SI papers, SI-domain CIs are wider than the planned $\pm$ 1.5 pp. Because some SI papers may have been excluded by the screening, OSP levels for all SI papers in the full corpus of 90k publications may differ. An audit of excluded records could quantify the coverage and enable adjustment but was not conducted here.
+Estimates for OSPs are domain-estimates among SI papers (see @tbl-cases2) drawn from a year-stratified random sample, beta-method CIs are based on design-based variance. Design-weights were applied post-stratified to frame-by-year totals with finite-population corrections. All OSP estimates are domain estimates for SI papers using design-based inference. Design corrected 95% confidence intervals are computed with the beta method (Clopper-Pearson) transformation which provides better coverage for low-prevalences than Wald intervals [@agrestiIntroductionCategoricalData2007]. 
+
+Results generalize to the keyword-filtered data. With $n=1,763$ SI papers, SI-domain CIs are wider than the planned $\pm$ 1.5 pp. Because some SI papers may have been excluded by the screening, OSP levels for all SI papers in the full corpus of 90k publications may differ. An audit of excluded records could quantify the coverage and enable adjustment but was not conducted here.

 OSP labels were assigned by classifiers whose sensitivity and specificity are imperfect, with potential misclassifications affecting the reported prevalence rates. To assess robustness, a simple sensitivity analysis using the Rogan-Gladen correction for misclassification of a binary outcome was conducted [@liuQuantitativeBiasAnalysis2023; @vallecamposSerosurveySerologicalSurvey2020]. 

@@ -456,10 +470,6 @@ Data is reported per year. As per year data given the very low prevalences is ex

 Two research questions were formulated: $RQ_1$ on the prevalence of OD and OM among statistical-inference (SI) publications, and $RQ_2$ on the prevalence of preregistration. After extensive model development, validation, calibration, thresholding, and misclassification adjustment, prevalences for OD, OM, and Preregistration were too low for the ML classifiers to yield interpretable, adjusted estimates. 

-The ML classifiers trained on GPT labels inherit GPT's strengths and the data's sparsity. For the relatively small 20% validation set coded by GPT, the open-science practice classifiers are less precise and less reliable than the Statistical-Inference classifier. In contrast, a question that was not originally foregrounded proved answerable: the prevalence and trajectory of OA among SI publications, measured from metadata with high reliability, show clear increases over time.
-
-Before misclassification adjustment, design-based prevalences were estimated among SI papers with 95% CIs. For outcomes identified by the ML classifiers (OD, OM, Preregistration), these reflect survey-design uncertainty only. @fig-osp-adoption shows a steady rise in OA from \~20% in 2013 to \~50% in 2023, while the other practices suffer from extremely low counts; for some years (e.g., 2013 OD; 2016 Preregistration) estimates were not possible. @tbl-osp-prev-overall confirms low prevalences across the full period: OA $40.9\%$ (38.8-43.1), OM $4.3\%$ (3.4-5.3), Preregistration $3.6\%$ (2.8-4.5), and OD $2.2\%$ (1.6-2.9).
-
 ```{r}
 #| tbl-cap: Sample Characteristics by Statistical Inference Status
 #| label: tbl-sample-char
@@ -527,12 +537,14 @@ if(output_format == "pdf/tex") {
 } else if(output_format == "docx") {
  tbl_sample_desc %>%
    as_flex_table() %>%
-      set_table_properties(width = 1, layout = "autofit") %>%
+      set_table_properties(width = 1) %>%
      theme_booktabs(bold_header = TRUE) %>%
      align(align = "center", part = "all") %>%
      fontsize(size = 11, part = "header") %>%
-      fontsize(size = 10, part = "body") %>%
-      width(5, 1) %>%
+      fontsize(size = 8, part = "body") %>%
+      width(5, 2.34, unit = "cm") %>%
+      width(2:4, 3.5, unit = "cm") %>%
+      autofit() %>%
      height_all(height = .2)
 } else {
 }
@@ -543,7 +555,13 @@ if (isTRUE(debug_mode)) {
 }
 ```

-In parallel, @tbl-sample-char suggests systematic differences between SI and non-SI papers: distributions of text sources differ (likely reflecting publisher effects or text-quality variation), abstracts-only are more common among non-SI items, word counts are higher for SI papers, journal impact is higher, and OA appears more common. Several contrasts are statistically significant (many $p < .001$), but these should be treated as descriptive given unmodeled multilevel variance and field composition.
+The ML classifiers trained on GPT labels inherit GPT's strengths and the data's sparsity. For the relatively small 20% validation set coded by GPT, the open-science practice classifiers are less precise and less reliable than the Statistical-Inference classifier. In contrast, a question that was not originally foregrounded proved answerable: the prevalence and trajectory of OA among SI publications, measured from metadata with high reliability, show clear increases over time.
+
+Before misclassification adjustment, design-based prevalences were estimated among SI papers with 95% CIs. For outcomes identified by the ML classifiers (OD, OM, Preregistration), these reflect survey-design uncertainty only. 
+
+@fig-osp-adoption shows a steady rise in OA from \~20% in 2013 to \~50% in 2023, while the other practices suffer from extremely low counts; for some years (e.g., 2013 OD; 2016 Preregistration) estimates were not possible. 
+
+@tbl-osp-prev-overall confirms low prevalences across the full period: OA $40.9\%$ (38.8-43.1), OM $4.3\%$ (3.4-5.3), Preregistration $3.6\%$ (2.8-4.5), and OD $2.2\%$ (1.6-2.9).

 ```{r}
 #| fig-cap: OSP Adoption Over Time, among statistical inference papers (design-weighted)
@@ -700,6 +718,7 @@ if (isTRUE(debug_mode)) {
 }
 ```

+In parallel, @tbl-sample-char suggests systematic differences between SI and non-SI papers: distributions of text sources differ (likely reflecting publisher effects or text-quality variation), abstracts-only are more common among non-SI items, word counts are higher for SI papers, journal impact is higher, and OA appears more common. Several contrasts are statistically significant (many $p < .001$), but these should be treated as descriptive given unmodeled multilevel variance and field composition.

 ```{r}
 #| tbl-cap: Overall Prevalence of Open Science Practices among Statistical Inference Papers (Design-Weighted to Frame-by-Year Totals)
@@ -784,8 +803,6 @@ if (isTRUE(debug_mode)) {

 In @tbl-osp-prev, adjustments were applied using sensitivity and specificity from the ML-validation analysis in  [@liuQuantitativeBiasAnalysis2023]. Under extreme rarity, adjustments become unstable: intervals widen dramatically (approaching $[0,1]$) or yield boundary/negative estimates when specificity is insufficient relative to prevalence. For OD, the false-positive rate ($1-\text{Sp} \approx 12.7\%$) exceeds the observed prevalence ($2.2\%$), pushing adjusted points below zero. For OM, low sensitivity ($\text{Se} = 0.20$) and tiny validation counts produce near-uninformative intervals. Given these constraints, the adjusted values can be interpreted as sensitivity ranges rather than confirmatory estimates. Any substantive claims should thereby rather be based on design-based estimates and on OA (measured from metadata).

-Earlier differences in text sources suggest heterogeneity by journal, thereby implicating also publisher variance [@scogginsMeasuringTransparencySocial2024]. @fig-osp-time-by-publisher visualizes OA shares over time for the 12 most prolific publishers in the sample (listed in the caption). Leveraging larger $n$, the author fit simple OLS trends to annual OA proportions. The four most prolific publishers show clear increases. Four publishers do not: Oxford University Press, Emerald, ASCE, and MDPI. MDPI remains at 100% OA, Emerald at 0% in this sample; ASCE shows an apparent decline consistent with limited observations; Oxford University Press is relatively stable. All observed increases are highly statistically significant. Future work should use models designed for proportions (e.g., binomial GLMs) and, ideally, hierarchical pooling across publishers and years.
-
 ```{r}
 #| tbl-cap: Observed and Adjusted Prevalence of Open Science Practices among Statistical Inference Papers
 #| label: tbl-osp-prev
@@ -915,11 +932,11 @@ if(output_format == "pdf/tex") {
                ~ stringr::str_replace_all(.x, "\\\\([%&_{}#])", "\\1"))
      ) %>%
    flextable() %>%
-      set_table_properties(width = 1, layout = "autofit") %>%
+      set_table_properties(width = 1) %>%
      theme_booktabs(bold_header = TRUE) %>%
      align(align = "center", part = "all") %>%
      fontsize(size = 11, part = "header") %>%
-      fontsize(size = 10, part = "body") %>%
+      fontsize(size = 9, part = "body") %>%
      footnote(
        i = 1,
        j = 4:7,
@@ -935,7 +952,10 @@ if(output_format == "pdf/tex") {
        ref_symbols = c("a", "b", "c", "d")
      ) %>%
      fontsize(size = 9, part = "footer") %>%
-      width(j = 1:3, 1.85) %>%
+      width(1, 3.13, unit = "cm") %>%
+      width(2:3, 3.62, unit = "cm") %>%
+      width(4:7, 1.5, unit = "cm") %>%
+      autofit() %>%
      colformat_double(
        big.mark = ",", digits = 2, na_str = "N/A"
      )
@@ -949,15 +969,7 @@ if (isTRUE(debug_mode)) {
 }
 ```

-This study was deliberately scoped as a pilot, which constrained coverage, precision, and tooling. The population assessed was limited to SI papers from the top 100 JCR journals in criminology and legal psychology and to Crossref metadata, so venue and index biases remain. The 2013-2023 window omits the most recent changes. Keyword screening did not fully exclude non-target items, and a Quarto "freeze" configuration led to using print over online dates in some cases. Full-text retrieval was partial and legally bounded to TDM-permitted publishers; short documents (<1,000 words) were treated as missing full text, risking misclassification.
-
-Measurement and modeling challenges were substantial. SI/OSP labels were trained on a small, single-coder hand set plus GPT assistance. Severe class imbalance for OSPs, few validation positives, and upsampling inflated nominal accuracy while depressing stability. Misclassification adjustments (Rogan-Gladen) became unstable at very low prevalences, and some OA trend analyses used simple OLS rather than binomial/GLM approaches.
-
-In the methodological report, comparing basic text characteristics between OSP-labeled papers and non-OSP papers revealed non-independence (e.g., differences in word count, Flesch score, and text source), despite the assumption that such features should not vary with true OSP status. This pattern indicates likely misclassification and/or model leakage, with classifiers picking up irrelevant proxies (publisher templates, document length) rather than OSP content.
-
-I therefore propose a series of recommendations for future iterations, that should expand bibliographic metadata sources (Crossref + Scopus and Web of Science) and further audit screened-out records to assess selection, operationalizations with sharper rules more close to the constructs defined by e.g. OSF, employ multi-coder assessment, and quantify inter-rater-reliability on a larger training data base OR classify leveraging ChatGPT as implied by the very accurate precisions evident here and replace OLS with binomial GLMs or hierarchical models for proportions. On the technicals side, a more stringent Quarto setup should be used, with simplified modular code based on a refined version of the codebase used here. The downloader should be improved in terms of a more homogeneous extraction logic by including the HTML and PDF full-text extraction in the pre-processing pipeline, making the whole process more transparent, reproducible and less error-prone. Finally, the sample size should be increased substantially, ideally to the full population of SI papers in the frame, to improve precision and enable analysis on journal level.
-
-Despite of all the limitations, there are main substantive implications: OSP prevalence signals in SI papers-especially preregistration and OM-are rare enough that model-based estimation is fragile at this scale, whereas OA, measured from metadata, shows a clear upward trend approaching roughly half of SI outputs by 2023. Methodologically, GPT proves to be a promising primary coder for a scaled follow-up, and the pipeline developed here provides a reproducible, yet improvable foundation for a larger, better-powered study.
+Earlier differences in text sources suggest heterogeneity by journal, thereby implicating also publisher variance [@scogginsMeasuringTransparencySocial2024]. @fig-osp-time-by-publisher visualizes OA shares over time for the 12 most prolific publishers in the sample (listed in the caption). Leveraging larger $n$, the author fit simple OLS trends to annual OA proportions. The four most prolific publishers show clear increases. Four publishers do not: Oxford University Press, Emerald, ASCE, and MDPI. MDPI remains at 100% OA, Emerald at 0% in this sample; ASCE shows an apparent decline consistent with limited observations; Oxford University Press is relatively stable. All observed increases are highly statistically significant. Future work should use models designed for proportions (e.g., binomial GLMs) and, ideally, hierarchical pooling across publishers and years.

 ```{=latex}
 \footnotesize
@@ -1110,6 +1122,16 @@ if (isTRUE(debug_mode)) {
 \normalsize
 ```

+This study was deliberately scoped as a pilot, which constrained coverage, precision, and tooling. The population assessed was limited to SI papers from the top 100 JCR journals in criminology and legal psychology and to Crossref metadata, so venue and index biases remain. The 2013-2023 window omits the most recent changes. Keyword screening did not fully exclude non-target items, and a Quarto "freeze" configuration led to using print over online dates in some cases. Full-text retrieval was partial and legally bounded to TDM-permitted publishers; short documents (<1,000 words) were treated as missing full text, risking misclassification.
+
+Measurement and modeling challenges were substantial. SI/OSP labels were trained on a small, single-coder hand set plus GPT assistance. Severe class imbalance for OSPs, few validation positives, and upsampling inflated nominal accuracy while depressing stability. Misclassification adjustments (Rogan-Gladen) became unstable at very low prevalences, and some OA trend analyses used simple OLS rather than binomial/GLM approaches.
+
+In the methodological report, comparing basic text characteristics between OSP-labeled papers and non-OSP papers revealed non-independence (e.g., differences in word count, Flesch score, and text source), despite the assumption that such features should not vary with true OSP status. This pattern indicates likely misclassification and/or model leakage, with classifiers picking up irrelevant proxies (publisher templates, document length) rather than OSP content.
+
+I therefore propose a series of recommendations for future iterations, that should expand bibliographic metadata sources (Crossref + Scopus and Web of Science) and further audit screened-out records to assess selection, operationalizations with sharper rules more close to the constructs defined by e.g. OSF, employ multi-coder assessment, and quantify inter-rater-reliability on a larger training data base OR classify leveraging ChatGPT as implied by the very accurate precisions evident here and replace OLS with binomial GLMs or hierarchical models for proportions. On the technicals side, a more stringent Quarto setup should be used, with simplified modular code based on a refined version of the codebase used here. The downloader should be improved in terms of a more homogeneous extraction logic by including the HTML and PDF full-text extraction in the pre-processing pipeline, making the whole process more transparent, reproducible and less error-prone. Finally, the sample size should be increased substantially, ideally to the full population of SI papers in the frame, to improve precision and enable analysis on journal level.
+
+Despite of all the limitations, there are main substantive implications: OSP prevalence signals in SI papers-especially preregistration and OM-are rare enough that model-based estimation is fragile at this scale, whereas OA, measured from metadata, shows a clear upward trend approaching roughly half of SI outputs by 2023. Methodologically, GPT proves to be a promising primary coder for a scaled follow-up, and the pipeline developed here provides a reproducible, yet improvable foundation for a larger, better-powered study.
+
 # Conclusion

 The replication crisis has intensified the examination of research practices and accelerated the push for transparency and openness. This study contributes by mapping the adoption of open-science practices (OSP) within criminology and legal psychology, establishing a baseline for future efforts. The evidence indicates meaningful progress in availability-most clearly in OA-yet massive, persistent gaps in reproducibility, particularly for OD, OM, and preregistration.
@@ -1128,11 +1150,11 @@ To make sure, that our results are robust, reliable and credible, this work shal
 \renewcommand\thesection{}
 ```

-# Acknowlegments
+# Acknowlegments {.unnumbered}

-I would like to thank my advisor, [Advisor Name], for their guidance, constructive feedback, and steady support throughout this project. Their expertise and encouragement were invaluable in shaping both the research and this publication.
+I would like to thank my advisor, [Advisor Name], for his guidance, constructive feedback, and steady support throughout this project. His expertise and encouragement were invaluable in shaping both the research and this publication.

-# Data availability
+# Data availability {.unnumbered}

 Materials, Data and Code are made available at a public OSF-repository that can be accessed here:

@@ -1140,19 +1162,15 @@ Materials, Data and Code are made available at a public OSF-repository that can

 Further instructions can be found in the README files. Full-text data and the downloader can't be made available to the public due to copyright concerns.

-# Funding
+# Funding {.unnumbered}

 This research received no external funding.

-# Conflicts of Interest
+# Conflict of interest {.unnumbered}

-The author declares no conflicts of interest.
+One of the editors of this special issue participated as an academic advisor.

-# Conflict of interest
-
-One of the editors of this issue participated as an academic advisor.
-
-# Author Biography
+# Author Biography {.unnumbered}

 Michael Beck is finishing his MSc in Sociology and Social Research at the University of Cologne, Germany. He is interested in computational social science and supports research as a student assistant at the German Institute for Adult Education (Bonn), focusing on evaluating generative AI for education and educational research.

@@ -1160,7 +1178,7 @@ Michael Beck is finishing his MSc in Sociology and Social Research at the Univer
 \newpage
 ```

-# Bibliography
+# Bibliography {.unnumbered}

 ::: {#refs}
 :::