first working full version

2025-12-13 17:52:35 +01:00
parent 9b80d36466
commit 8c412f0454
6 changed files with 193 additions and 50 deletions
@@ -1,5 +1,11 @@
 QUARTO ?= quarto
 # Allow overriding the list of docx outputs via env var `docx`.
 # Example: `make docx docx="index Supplements"` or `make docx docx=index`.
 DOCX ?= index Supplements
 DOCX := $(if $(docx),$(docx),$(DOCX))
 DOCX_DOCS := $(addsuffix .docx,$(DOCX))
 .PHONY: all pdf docx clean
 # Build both formats for both documents
@@ -7,7 +13,8 @@ all: pdf docx
 # Aggregate targets
 pdf: index.pdf Supplements.pdf
-docx: index.docx Supplements.docx
+docx: $(DOCX_DOCS)
 docx-main: index.docx
 # Pattern rules for either format
 %.pdf: %.qmd
@@ -1 +1,4 @@
-This repository contains the quarto project for the article "Mining Transparency: Assessing Open Science Practices in Crime Research Over Time Using Machine Learning".
+This repository contains the quarto project for the article "Mining Transparency: Assessing Open Science Practices in Crime Research Over Time Using Machine Learning".
 Extensions:
 - [kapsner/authors-block](https://github.com/kapsner/authors-block): brings the capability to add an author-related header block when rendering docx-documents with Quarto.
@@ -2,6 +2,27 @@ project:
  type: default
  output-dir: _output
 lang: en-US
 authors:
  - name: Michael Beck
    affiliations:
      - ref: die
    corresponding: true
    email: michaeljbeck@proton.me
    orcid: 0009-0005-4622-4717
 affiliations:
  - id: die
    name: German Institute for Adult Education - Leibniz Centre for Lifelong Learning (DIE), Bonn, Germany
 abstract: |
  This pilot study addresses the current lack of systematic, large-scale evidence on Open Science Practices (OSPs) adoption in criminology and legal psychology. A scalable, machine-learning-based text classification pipeline is introduced to map the prevalence of Open Access (OA), Open Data (OD), Open Materials (OM), and Preregistration (PR). The analysis is based on publication metadata and a year-stratified sample of full texts from the top 100 journals in Criminology & Penology, Law, and Psychology (2013-2023). After identifying articles containing statistical inference (SI) via a high-performing classifier, I utilized GPT-assisted coding and supervised learning to train specific classifiers for OD, OM, and PR. OA was classified using publicly available metadata. Among 1,763 SI articles with usable full text, design-based estimates reveal a significant disparity in OSP adoption. OA is relatively common (40.9%, 95% CI: 38.8-43.1) and has steadily increased from approximately 20% in 2013 to 50% in 2023. By sharp contrast, trends for OD, OM, and PR cannot be reliably quantified. 
  | Extreme class imbalance and the minimal number of positive cases indicate a very low underlying true prevalence for these practices in the assessed field. Methodologically, the study confirms that GPT-assisted coding supports accurate SI detection, but robust prevalence estimation for extremely low-frequency OSPs remains challenging for downstream classifiers. Overall, this project establishes a transparent and reproducible pipeline and provides critical baseline estimates for future, larger-scale assessments of research transparency in crime-related fields.
 keywords:
  - Metaphysics
  - String Theory
 date: 2025-12-14
 format:
  aog-article-pdf:
    papersize: a4
@@ -23,10 +44,14 @@ format:
    lof: false
  docx:
    prefer-html: true
-    toc: true
+    toc: false
    toc-depth: 3
    lot: false
    lof: false
    reference-doc: custom-reference-doc.docx
 filters:
  - authors-block
 always_allow_html: true
@@ -45,6 +45,7 @@ pkgs <- c(
  "gt",
  "knitr",
  "kableExtra",
  "flextable", # docx tables
  # Misc helpers
  "rlang",
@@ -18,8 +18,13 @@ execute:
 #| label: setup
 #| include: false
 source("deps.R")
 # uncomment the following line to disable graphs & tables
 #output_format <- ""
 ```
 \newpage
 # Introduction
 When evidence makes headlines, influences public opinions, shapes policing, sentencing or rehabilitation, it touches lives. But over the last decades, social scientists have learned how easily impressive results evaporate. Criminology is not insulated from these pressures. In this paper, open science practices in criminology and legal psychology are monitored to assess if the field is wired to catch errors before they might become policy.
@@ -98,7 +103,7 @@ In short, many systemic and researcher-centric challenges cut across OSPs - and
 A preregistration is a time-stamped plan for a study's hypotheses, design, and analysis, often made public. Its contents vary by method (e.g., hypotheses, sampling, interview guides, exclusion rules, analysis plans) [@loggPreregistrationWeighingCosts2021; @managoPreregistrationRegisteredReports2023; @americanpsychologicalassociationOpenScienceBadges].
-Timestamping restrains HARKing by separating predictions from evidence, reducing the flexibility for post-hoc theorizing [@scogginsMeasuringTransparencySocial2024; @loggPreregistrationWeighingCosts2021]. More broadly, by committing ex ante, researcher degrees of freedom are narrowed. The analytic and design choices that otherwise enable selective reporting or specification searching are constrained, and any deviations become visible to readers and reviewers. The same logic limits p-hacking: when transformations, outlier rules, model families, covariates, and confirmatory contrasts are specified in advance, cherry-picking becomes less feasible because analytical decisions are made independently of the data. Preregistration also addresses structural issues of study quality. Declaring sample-size requirements upfront helps prevent underpowered designs by construction [@kuhbergerPublicationBiasPsychology2014; @grossmannOpenScienceReform2021]. We predefine theory, measures, and analyses, seek early input, and document choices so reviewers can vet them and avoid misinterpretation-strengthening credibility [ @evansImprovingEvidencebasedPractice2023; @sarafoglouSurveyHowPreregistration2022; @scogginsMeasuringTransparencySocial2024]. Preregistration helps separate confirmatory from exploratory work, reduces publication bias (e.g., via Registered Reports), and narrows "researcher degrees of freedom" [@simmonsFalsePositivePsychologyUndplanned, what changed, and why-we reduce bias, improve interpretability, and isclosed2011].
+Timestamping restrains HARKing by separating predictions from evidence, reducing the flexibility for post-hoc theorizing [@scogginsMeasuringTransparencySocial2024; @loggPreregistrationWeighingCosts2021]. More broadly, by committing ex ante, researcher degrees of freedom are narrowed. The analytic and design choices that otherwise enable selective reporting or specification searching are constrained, and any deviations become visible to readers and reviewers. The same logic limits p-hacking: when transformations, outlier rules, model families, covariates, and confirmatory contrasts are specified in advance, cherry-picking becomes less feasible because analytical decisions are made independently of the data. Preregistration also addresses structural issues of study quality. Declaring sample-size requirements upfront helps prevent underpowered designs by construction [@kuhbergerPublicationBiasPsychology2014; @grossmannOpenScienceReform2021]. We predefine theory, measures, and analyses, seek early input, and document choices so reviewers can vet them and avoid misinterpretation-strengthening credibility [ @evansImprovingEvidencebasedPractice2023; @sarafoglouSurveyHowPreregistration2022; @scogginsMeasuringTransparencySocial2024]. Preregistration helps separate confirmatory from exploratory work, reduces publication bias (e.g., via Registered Reports), and narrows "researcher degrees of freedom" [@simmonsFalsePositivePsychologyUndisclosed2011].
 For this work, **preregistration** is defined as *the act of planning and documenting the hypotheses, study design, and analysis plan of a study before data is collected or even viewed. The documentation is typically time-stamped and made publicly available*.
@@ -207,9 +212,25 @@ if(output_format == "pdf/tex") {
    booktabs = TRUE,
    longtable = FALSE, # avoid longtable entirely
    col.names = c("Step #", "Description", "Before", "After", "Dropped"))
 } else if(output_format == "docx") {
  tbl_cases %>% 
    flextable() %>%
      set_table_properties(width = 1, layout = "autofit") %>%
      theme_booktabs(bold_header = TRUE) %>%
      align(align = "center", part = "all") %>%
      fontsize(size = 11, part = "header") %>%
      fontsize(size = 10, part = "body") %>%
      set_header_labels(
        step_id = "Step #",
        step_label = "Step",
        n_before = "Before",
        n_after = "After",
        n_dropped = "Dropped"
      )
 } else {
-  
+
 }
 if (isTRUE(debug_mode)) {
  debug_info[[knitr::opts_current$get("label")]] <- 
    if (knitr::is_html_output()) "HTML" else "LaTeX"
@@ -232,6 +253,7 @@ The final analytical sample is made up of 4265 publications. The OS prevalence c
 #| fig-cap: "Frequencies: Publications by Year in Population and Sample"
 #| label: fig-freq-pubs-comp
 #| fig-height: 6
 #| fig-width: 8
 #| fig-pos: H
 meta_final <- qs_read(file_meta_final)
@@ -329,7 +351,14 @@ p4 <- sample_B_by_year %>%
    labels = function(x) ifelse(x %% 2 == 1, x, "") # use modulo
  )
-print((p1|p2) / (p3|p4))
+if(output_format == "pdf/tex") {
  print((p1|p2) / (p3|p4))
 } else if(output_format == "docx") {
  print((p1|p2) / (p3|p4))  
 } else {
 }
 if (isTRUE(debug_mode)) {
  debug_info[[knitr::opts_current$get("label")]] <- 
    if (knitr::is_html_output()) "HTML" else "LaTeX"
@@ -367,15 +396,32 @@ tbl_cases2 <- tbl2 %>%
  select(step_id, step_label, n_before, n_after, n_dropped)
 if(output_format == "pdf/tex") {
-tbl_cases2 %>%
+  tbl_cases2 %>%
-  kable(
+    kable(
-    format   = "latex", # force LaTeX output (not markdown)
+      format   = "latex", # force LaTeX output (not markdown)
-    booktabs = TRUE,
+      booktabs = TRUE,
-    longtable = FALSE, # avoid longtable entirely
+      longtable = FALSE, # avoid longtable entirely
-    col.names = c("Step #", "Step", "Before", "After", "Dropped"))
+      col.names = c("Step #", "Step", "Before", "After", "Dropped")
      )
 } else if(output_format == "docx") {
  tbl_cases2 %>% 
    flextable() %>%
      set_table_properties(width = 1, layout = "autofit") %>%
      theme_booktabs(bold_header = TRUE) %>%
      align(align = "center", part = "all") %>%
      fontsize(size = 11, part = "header") %>%
      fontsize(size = 10, part = "body") %>%
      set_header_labels(
        step_id = "Step #",
        step_label = "Step",
        n_before = "Before",
        n_after = "After",
        n_dropped = "Dropped"
      )
 } else {
-  
+
 }
 if (isTRUE(debug_mode)) {
  debug_info[[knitr::opts_current$get("label")]] <- 
    if (knitr::is_html_output()) "HTML" else "LaTeX"
@@ -406,9 +452,7 @@ Data is reported per year. As per year data given the very low prevalences is ex
 Two research questions were formulated: $RQ_1$ on the prevalence of OD and OM among statistical-inference (SI) publications, and $RQ_2$ on the prevalence of preregistration. After extensive model development, validation, calibration, thresholding, and misclassification adjustment, prevalences for OD, OM, and Preregistration were too low for the ML classifiers to yield interpretable, adjusted estimates. 
-The ML classifiers trained on GPT labels inherit GPT's strengths and the data's sparsity.For the relatively small 20% validation set coded by GPT, the open-science practice classifiers are less precise and less reliable than the Statistical-Inference classifier.
+The ML classifiers trained on GPT labels inherit GPT's strengths and the data's sparsity. For the relatively small 20% validation set coded by GPT, the open-science practice classifiers are less precise and less reliable than the Statistical-Inference classifier. In contrast, a question that was not originally foregrounded proved answerable: the prevalence and trajectory of OA among SI publications, measured from metadata with high reliability, show clear increases over time.
 In contrast, a question that was not originally foregrounded proved answerable: the prevalence and trajectory of OA among SI publications, measured from metadata with high reliability, show clear increases over time.
 Before misclassification adjustment, design-based prevalences were estimated among SI papers with 95% CIs. For outcomes identified by the ML classifiers (OD, OM, Preregistration), these reflect survey-design uncertainty only. @fig-osp-adoption shows a steady rise in OA from \~20% in 2013 to \~50% in 2023, while the other practices suffer from extremely low counts; for some years (e.g., 2013 OD; 2016 Preregistration) estimates were not possible. @tbl-osp-prev-overall confirms low prevalences across the full period: OA $40.9\%$ (38.8-43.1), OM $4.3\%$ (3.4-5.3), Preregistration $3.6\%$ (2.8-4.5), and OD $2.2\%$ (1.6-2.9).
@@ -467,16 +511,26 @@ tbl_sample_desc <- df %>% mutate(
    footnote = "A: Psychology, Multidisciplinary; B: Law; C: Criminology & Penology",
    columns = "label",
    rows = variable == "journal_category"
-  ) %>% as_gt() %>%
+  ) 
  tab_options(
    table.font.size = gt::px(12),
    latex.use_longtable = TRUE
  )
 if(output_format == "pdf/tex") {
-  tbl_sample_desc
+  tbl_sample_desc %>%
    as_gt() %>%
    tab_options(
      table.font.size = gt::px(12),
      latex.use_longtable = TRUE
    )
 } else if(output_format == "docx") {
  tbl_sample_desc %>%
    as_flex_table() %>%
      set_table_properties(width = 1, layout = "autofit") %>%
      theme_booktabs(bold_header = TRUE) %>%
      align(align = "center", part = "all") %>%
      fontsize(size = 11, part = "header") %>%
      fontsize(size = 10, part = "body") %>%
      width(5, 1) %>%
      height_all(height = .2)
 } else {
  #tbl_sample_desc %>% as_kable()
 }
 if (isTRUE(debug_mode)) {
@@ -485,10 +539,13 @@ if (isTRUE(debug_mode)) {
 }
 ```
 In parallel, @tbl-sample-char suggests systematic differences between SI and non-SI papers: distributions of text sources differ (likely reflecting publisher effects or text-quality variation), abstracts-only are more common among non-SI items, word counts are higher for SI papers, journal impact is higher, and OA appears more common. Several contrasts are statistically significant (many $p < .001$), but these should be treated as descriptive given unmodeled multilevel variance and field composition.
 ```{r}
 #| fig-cap: OSP Adoption Over Time, among statistical inference papers (design-weighted)
 #| label: fig-osp-adoption
 #| fig-pos: H
 #| fig-width: 7
 # ensure that types match
 df <- df %>% mutate(published_year = as.integer(published_year))
@@ -635,9 +692,7 @@ if (isTRUE(debug_mode)) {
 }
 ```
-In parallel, @tbl-sample-char suggests systematic differences between SI and non-SI papers: distributions of text sources differ (likely reflecting publisher effects or text-quality variation), abstracts-only are more common among non-SI items, word counts are higher for SI papers, journal impact is higher, and OA appears more common. Several contrasts are statistically significant (many $p < .001$), but these should be treated as descriptive given unmodeled multilevel variance and field composition.
+Because design-based estimates do not account for classifier error, Rogan-Gladen adjustments were applied using sensitivity and specificity from the ML-validation analysis in @tbl-osp-prev-overall [@liuQuantitativeBiasAnalysis2023]. Under extreme rarity, adjustments become unstable: intervals widen dramatically (approaching $[0,1]$) or yield boundary/negative estimates when specificity is insufficient relative to prevalence. For OD, the false-positive rate ($1-\text{Sp} \approx 12.7\%$) exceeds the observed prevalence ($2.2\%$), pushing adjusted points below zero. For OM, low sensitivity ($\text{Se} = 0.20$) and tiny validation counts produce near-uninformative intervals. Given these constraints, the adjusted values can be interpreted as sensitivity ranges rather than confirmatory estimates. Any substantive claims should thereby rather be based on design-based estimates and on OA (measured from metadata).
 Because design-based estimates do not account for classifier error, Rogan-Gladen adjustments were applied using sensitivity and specificity from the ML-validation analysis (@tbl-osp-prev-overall) [@liuQuantitativeBiasAnalysis2023]. Under extreme rarity, adjustments become unstable: intervals widen dramatically (approaching $[0,1]$) or yield boundary/negative estimates when specificity is insufficient relative to prevalence. For OD, the false-positive rate ($1-\text{Sp} \approx 12.7\%$) exceeds the observed prevalence ($2.2\%$), pushing adjusted points below zero. For OM, low sensitivity ($\text{Se} = 0.20$) and tiny validation counts produce near-uninformative intervals. Given these constraints, the adjusted values can be interpreted as sensitivity ranges rather than confirmatory estimates. Any substantive claims should thereby rather be based on design-based estimates and on OA (measured from metadata).
 ```{r}
 #| tbl-cap: Overall Prevalence of Open Science Practices among Statistical Inference Papers (Design-Weighted to Frame-by-Year Totals)
@@ -680,27 +735,38 @@ overall_osp_si <- overall_osp_si %>%
  ) %>%
  arrange(desc(`Prevalence`))
 tbl_overall_osp_si <- overall_osp_si %>% 
  kbl(
    format = 'latex',
    longtable = TRUE,
    booktabs = TRUE, 
    escape = T,
  ) %>% # add footnote
  column_spec(1, width = '3cm')%>%
  kable_styling(
    position = "center",
    latex_options = "hold_position",
    full_width = FALSE) %>%
  kableExtra::footnote(
    general = "Prevalence estimates in statistical inference publications using design-weights per year (95% CI)", 
    general_title = "Note:", 
    footnote_as_chunk = T, 
    threeparttable = T
    )
 if(output_format == "pdf/tex") {
  print(tbl_overall_osp_si)
 } else if(output_format == "docx") {
  overall_osp_si %>% 
-    kbl(
+    flextable() %>%
-      format = 'latex',
+      set_table_properties(width = 1, layout = "autofit") %>%
-      longtable = TRUE,
+      theme_booktabs(bold_header = TRUE) %>%
-      booktabs = TRUE, 
+      align(align = "center", part = "all") %>%
-      escape = T,
+      fontsize(size = 11, part = "header") %>%
-    ) %>% # add footnote
+      fontsize(size = 10, part = "body") %>%
-    column_spec(1, width = '3cm')%>%
+      set_caption(caption = "Note: Prevalence estimates in statistical inference publications using design-weights per year (95% CI)")
    kable_styling(
      position = "center",
      latex_options = "hold_position",
      full_width = FALSE) %>%
    footnote(
      general = "Prevalence estimates in statistical inference publications using design-weights per year (95% CI)", 
      general_title = "Note:", 
      footnote_as_chunk = T, 
      threeparttable = T
      )
 } else {
 }
 if (isTRUE(debug_mode)) {
@@ -709,6 +775,8 @@ if (isTRUE(debug_mode)) {
 }
 ```
 Earlier differences in text sources suggest heterogeneity by journal, thereby implicating also publisher variance [@scogginsMeasuringTransparencySocial2024]. @fig-osp-time-by-publisher visualizes OA shares over time for the 12 most prolific publishers in the sample (listed in the caption). Leveraging larger $n$, the author fit simple OLS trends to annual OA proportions. The four most prolific publishers show clear increases. Four publishers do not: Oxford University Press, Emerald, ASCE, and MDPI. MDPI remains at 100% OA, Emerald at 0% in this sample; ASCE shows an apparent decline consistent with limited observations; Oxford University Press is relatively stable. All observed increases are highly statistically significant. Future work should use models designed for proportions (e.g., binomial GLMs) and, ideally, hierarchical pooling across publishers and years.
 ```{r}
 #| tbl-cap: Observed and Adjusted Prevalence of Open Science Practices among Statistical Inference Papers
 #| label: tbl-osp-prev
@@ -818,7 +886,7 @@ if(output_format == "pdf/tex") {
    kbl(format = "latex", booktabs = TRUE, escape = FALSE,
        align = c("l","l","l","r","r","r","r"), longtable = TRUE) %>%
    kable_styling(latex_options = "hold_position") %>%
-    footnote(
+    kableExtra::footnote(
      number = c(
        "Sensitivity",
        "Specificity",
@@ -827,16 +895,53 @@ if(output_format == "pdf/tex") {
      ),
      escape = FALSE
    )
 } else if(output_format == "docx") {
  colnames(osp_table_pretty) <- c(
  "OSP", "Obs. (95% CI)", "Adj. (95% CI)",
  "Se", "Sp", "Pos", "Neg"
  )
  osp_table_pretty %>% 
    mutate(
      across(where(is.character),
                ~ stringr::str_replace_all(.x, "\\\\([%&_{}#])", "\\1"))
      ) %>%
    flextable() %>%
      set_table_properties(width = 1, layout = "autofit") %>%
      theme_booktabs(bold_header = TRUE) %>%
      align(align = "center", part = "all") %>%
      fontsize(size = 11, part = "header") %>%
      fontsize(size = 10, part = "body") %>%
      footnote(
        i = 1,
        j = 4:7,
        value = as_paragraph(
          c(
            "Sensitivity",
            "Specificity",
            "Number of positive cases in validation set",
            "Number of negative cases in validation set"
          )
        ),
        part = "header",
        ref_symbols = c("a", "b", "c", "d")
      ) %>%
      fontsize(size = 9, part = "footer") %>%
      width(j = 1:3, 1.85) %>%
      colformat_double(
        big.mark = ",", digits = 2, na_str = "N/A"
      )
 } else {
-  
+
 }
 if (isTRUE(debug_mode)) {
  debug_info[[knitr::opts_current$get("label")]] <- 
    if (knitr::is_html_output()) "HTML" else "LaTeX"
 }
 ```
 Earlier differences in text sources suggest heterogeneity by journal, thereby implicating also publisher variance [@scogginsMeasuringTransparencySocial2024]. @fig-osp-time-by-publisher visualizes OA shares over time for the 12 most prolific publishers in the sample (listed in the caption). Leveraging larger $n$, the author fit simple OLS trends to annual OA proportions. The four most prolific publishers show clear increases. Four publishers do not: Oxford University Press, Emerald, ASCE, and MDPI. MDPI remains at 100% OA, Emerald at 0% in this sample; ASCE shows an apparent decline consistent with limited observations; Oxford University Press is relatively stable. All observed increases are highly statistically significant. Future work should use models designed for proportions (e.g., binomial GLMs) and, ideally, hierarchical pooling across publishers and years.
 This study was deliberately scoped as a pilot, which constrained coverage, precision, and tooling. The population assessed was limited to SI papers from the top 100 JCR journals in criminology and legal psychology and to Crossref metadata, so venue and index biases remain. The 2013-2023 window omits the most recent changes. Keyword screening did not fully exclude non-target items, and a Quarto "freeze" configuration led to using print over online dates in some cases. Full-text retrieval was partial and legally bounded to TDM-permitted publishers; short documents (<1,000 words) were treated as missing full text, risking misclassification.
@@ -854,8 +959,8 @@ Despite of all the limitations, there are main substantive implications: OSP pre
 ```{r}
 #| label: fig-osp-time-by-publisher
-#| fig-width: 10
+#| fig-width: 9
-#| fig-height: 10
+#| fig-height: 9
 #| fig-cap: Open Access by Publisher over Time.
 #| fig-pos: H
@@ -965,7 +1070,7 @@ grid_publishers <- ggplot(
  ) +
  labs(
    x = "",
-    y = "% of articles Open Access",
+    y = "% of Open Access articles",
    title = "",
    caption = paste0(
      "Top 12 publishers by sample n.\nWithin-year proportions from stratified-by-year sample.\n",
@@ -1056,5 +1161,7 @@ The authors declare no conflicts of interest.
 if (isTRUE(debug_mode)) {
  print("# Debug Info")
  print(debug_info)
  print(paste0("Output Format set to **", output_format, "**"))
 }
 ```