adds docx support

corrects a couple of mistakes
This commit is contained in:
2025-12-10 14:38:33 +01:00
parent ed4d2508e9
commit 1110f67fe3
2 changed files with 243 additions and 79 deletions
+21 -3
View File
@@ -3,15 +3,33 @@ project:
output-dir: _output
format:
aog-article-pdf:
papersize: a4
# mainfont: Noto Serif
# sansfont: Nerd Sans
fontsize: 12pt
geometry: margin=1in
fig-height: 4
fig-width: 7.5
colorlinks: true
urlcolor: blue
fig-cap-location: top
pdf-engine: xelatex
keep-tex: true
latex-max-runs: 3
docx:
prefer-html: true
always_allow_html: true
toc: true
toc-depth: 3
lot: false
lof: false
number-sections: true
citeproc: true
citation-package: none
bibliography: literature/Thesis.bib
reference-section-title: Bibliography
link-citations: true
@@ -19,6 +37,9 @@ csl: https://www.zotero.org/styles/apa
execute:
freeze: auto
echo: false
warning: false
execute-dir: file
header-includes: |
@@ -54,6 +75,3 @@ include-before: |
\thispagestyle{empty}
\newpage
\thispagestyle{empty}
citeproc: true
citation-package: none
+199 -53
View File
@@ -15,10 +15,22 @@ execute:
```
```{r}
#| results: false
#| echo: false
#| label: setup
#| include: false
source("deps.R")
# Output format:
# it is important to set the correct pandoc/quarto output format as knitr tables don't work in docx.
# possible formats:
# - docx
# - pdf/tex
output_format <- "pdf/tex"
# Debug Mode
debug_mode <- TRUE
if (isTRUE(debug_mode)) debug_info <- list()
# Theme
ggthemr('fresh')
# consitent colors for open science practices among plots
@@ -52,13 +64,13 @@ To challenge bias and to support replication of research, a movement has formed
*First*, openly sharing materials, data and code enables replication that reduces p-hacking, surfaces errors, spreads methodological knowledge and might reduce burdens on the researcher, driving broader adoption across science [@freeseAdvancesTransparencyReproducibility2022; @freeseReplicationStandardsQuantitative2007; @finkReplicationCodeAvailability2024]. *Second*, preregistration involves thoroughly outlining and documenting research plans and their rationale in a repository before conducting the research, reducing deliberate or unconscious decisions taken to improve findings, challenging publication bias and other biases [@managoPreregistrationRegisteredReports2023; @hardwickeReducingBiasIncreasing2023; @mertensPreregistrationAnalysesPreexisting2019].
The initial plan for this master's thesis was to study the proposed effects of OSPs on reported effect sizes in published papers. During my initial literature review, it appeared to me that there were only few publications that used preregistration in data-driven Criminology and Legal Psychology. Instead of assessing effect sizes, this raised the question of how OSPs have been adopted within criminology at all. Motivated by the expected positive impact of OSPs, this work studies the use of OSPs in the field.
The initial plan for this work was to study the proposed effects of OSPs on reported effect sizes in published papers. During a first literature review, it appeared to me that there were only few publications that used preregistration in data-driven Criminology and Legal Psychology. Instead of assessing effect sizes, this raised the question of how OSPs have been adopted within criminology at all. Motivated by the expected positive impact of OSPs, this work studies the use of OSPs in the field.
@scogginsMeasuringTransparencySocial2024 did an extensive analysis of nearly 100,000 publications in political science and international relations and observed an increasing use of preregistration and open data, with levels still being relatively low. Their extensive research not only revealed the current state of open science in political science, but also generated rich data to perform further meta research. Inspired by their work, I adopt their research questions to assess OSPs in the fields of criminology and legal psychology:
@scogginsMeasuringTransparencySocial2024 did an extensive analysis of nearly 100,000 publications in political science as well as international relations and observed an increasing use of OSPs, with levels still being relatively low. Their extensive research not only revealed the current state of open science in political science, but also generated rich data to perform further meta research. Inspired by their work, I adopt their research questions to assess OSPs in the fields of criminology and legal psychology:
> $RQ_1$: What proportion of papers that rely on statistical inference make their data and code public?
> $RQ_2$: What proportion of experimental studies were preregistered?
> $RQ_2$: What proportion of statistical inference publications were preregistered?
This work gathers data about papers in a subset of Criminology and Legal Psychology journals, categorizes those papers by application of open science practices using machine learning methods and explore the patterns over time. The methods will closely resemble and try to improve the approaches taken by @scogginsMeasuringTransparencySocial2024. The research will contribute to the ongoing discussion about the use of OSPs by painting a clearer picture of their adoption in the field. The improved approach will serve as a starting point for a more extensive exploration of OSPs in criminology and legal psychology and will contribute to the growing literature of machine learning and LLMs in classification tasks of scientific literature.
@@ -66,20 +78,14 @@ But first, a closer look at the underlying issues leading to the recent developm
# Background
But what defines scientific progress? Scientific progress can be viewed in two major ways. Thomas Kuhn defined it as a revolutionary shift in paradigms, which are accepted theories within a scientific community [@kuhnReflectionsMyCritics1970; @kuhnStructureScientificRevolutions2012]. He argued that while normal science operates within these paradigms, unexplained anomalies eventually lead to a crisis and a scientific revolution. In contrast, Karl Popper's critical rationalist approach views scientific progress as a cumulative process of conjectures and refutations [@popperLogicScientificDiscovery2005]. Popper argued that science advances by eliminating false theories through falsifiability, moving linearly toward the truth. Where Kuhn emphasized refining dominant theories, Popper focused on challenging them.
In his widely reviewed standard reading "Seven rules for social research", @4ff8afa9-5c92-3c50-b832-a1756ccbeedc emphasizes the importance of the reproduction of research findings. But already in the title of the chapter or the rule itself, Firebaugh cuts back on his appeal: "replicate *where possible*". Emphasizing the increasing availability of data, he acknowledges the challenges researchers face in achieving true replication and advertises optimism. As the book is from 2008 and the acceptance of the book is at least perceived to be high, one could expect that replication today as well as research practices enabling replication are broadly adopted. But is this the case?
Today's social sciences often align with Popper's ideas, using frequentist, deductive reasoning and significance testing to evaluate hypotheses. This approach, is criticized for its limitations and focus on p-values [@dunleavyUseMisuseClassical2021; @wilkinsonTestingNullHypothesis2013]. Bayesian inference, conversely, uses inductive reasoning to update models with new data and assess them using Bayes factors, but doesn't directly falsify models [@gelmanInductionDeductionBaysian2011]. Ultimately, many modern sciences use a pluralistic approach, integrating diverse methods to advance knowledge [@rowbottomKuhnVsPopper2011]. Both frequentist and Bayesian methods share a commitment to rigorous testing, with progress dependent on reliable evidence and institutions that correct error.
Besides the theoretically driven discourse, there are quite tangible reasons to talk about the scientific method and the publication process. Analyzing 77 research teams assessing the same dataset for a single hypothesis, @breznauObservingManyResearchers2022 found extremely diverse results, ranging from strong positive to strong negative outcomes. They termed this phenomenon "researcher degrees of freedom", explaining that most of the variance in results was not explained by assigned conditions, research decisions, or researcher characteristics. Instead, idiosyncratic researcher variability accounted for more than 90% of the variance.
Besides the theoretically driven discourse, there are quite tangible reasons to talk about the scientific method, replication and the publication process. Analyzing 77 research teams assessing the same dataset for a single hypothesis, @breznauObservingManyResearchers2022 found extremely diverse results, ranging from strong positive to strong negative outcomes. They termed this phenomenon "researcher degrees of freedom", explaining that most of the variance in results was not explained by assigned conditions, research decisions, or researcher characteristics. Instead, idiosyncratic researcher variability accounted for more than 90% of the variance.
This raises the question: if modern research practices are so prone to bias and error, what steps can be taken to mitigate these issues? A closer look at an ongoing debate resulting from cases around replication failures helps shed light on the whole complex, its implications and the today's research culture.
## From Replication Crisis to Credibility Revolution? {#sec-replication-crisis}
> "Rule 4 advises replication - the identical analysis (same measures, models, and estimation methods) of parallel data sets (different samples of the same population) - to see if you obtain similar results." [@4ff8afa9-5c92-3c50-b832-a1756ccbeedc, p. 90]
In his widely reviewed standard reading "Seven rules for social research", @4ff8afa9-5c92-3c50-b832-a1756ccbeedc emphasizes the importance of the reproduction of research findings. But already in the title of the chapter or the rule itself, Firebaugh cuts back on his appeal: "replicate *where possible*". Emphasizing the increasing availability of data, he acknowledges the challenges researchers face in achieving true replication and advertises optimism. As the book is from 2008 and the acceptance of the book is at least perceived to be high, one could expect that replication today as well as research practices enabling replication are broadly adopted. But is this the case?
The publication of Firebaugh's text coincided with the onset of the replication crisis, a period where widespread replication failures especially but not exclusively in psychology revealed systemic issues in research culture. This crisis wasn't limited to a few fraudulent cases but exposed a broader problem where seemingly robust, highly cited studies could not be reproduced. Examples ranged from unintended to outright data fabrication [@barghAutomaticitySocialBehavior1996; @callawayReportFindsMassive2011; @crockerRoadFraudStarts2011a]. While the crisis began in psychology, it soon spread to other fields like in political science and economics [@breznauDoesSociologyNeed2021]. For instance, a classic social priming study by @barghAutomaticitySocialBehavior1996, finding that participants primed with an "elderly" stereotype walked more slowly, failed to replicate. A follow-up-study suggested, that the original results were likely influenced by experimenter expectations rather than the hypothesized mechanism of unconscious priming [@doyenBehavioralPrimingIts2012]. While some extreme cases are well-documented, the crisis is largely seen as a result of systemic pressure and normal human behavior or misconduct than in serious intent [@diekmannII2Probleme2022; @crockerRoadFraudStarts2011a; @4ff8afa9-5c92-3c50-b832-a1756ccbeedc].
The term crisis not only implies alarmingly high proportions, but also creates pressure to act. This is supported by findings spanning many fields: not only in Psychology there are many findings that support the notion that there is such thing as a crisis in many fields. Finance [@jensenThereReplicationCrisis2023], economics [@briggsPartialSolutionReplication2023], sociology [@auspurgAusmassUndRisikofaktoren2014] or medicine [@begleyRaiseStandardsPreclinical2012], with some authors even claiming that most published research findings in the social sciences are false [@ioannidisWhyMostPublished2005]. But what drives this crisis?
@@ -94,17 +100,17 @@ A truth-incentivizing survey of over 2000 psychologists revealed a high prevalen
Common QRPs include HARKing or presenting an unexpected exploratory finding as a preplanned hypothesis, p-hacking or manipulating data or analysis to achieve a desired p-value and selective reporting, that is not reporting studies or variables that lack significant results. Other QRPs involve undisclosed data exclusion, stopping data collection when a desired result is found, or not reporting all conditions or measures used. These practices inflate false-positive rates and undermine research credibility [@auspurgAusmassUndRisikofaktoren2014; @breznauDoesSociologyNeed2021; @chinQuestionableResearchPractices2023].
Other problematic practices involves the misuse of p-values, where researchers simply misinterpret the significance level as the likelihood of truth in their findings, leading to vast overconfidence in their results-that can also be a consequence of or lead to a failure to control for bias and poor quality control [@breznauDoesSociologyNeed2021; @munafoManifestoReproducibleScience2017]. Other biases include demographic, geographic or political biases and peer review limitations [@breznauDoesSociologyNeed2021; @grossmannOpenScienceReform2021]. Additionally, gendered penalties favor men publishing disproportionately more than women @akbaritabarGenderPatternsPublication2021. Misaligned institutional incentives, also accelerated by an intense competition for academic jobs, tenure and funding, lead to a so-called "publish or perish" culture [@smaldinoOpenScienceModified2019; @breznauDoesSociologyNeed2021].
Other problematic practices involve the misuse of p-values, where researchers simply misinterpret the significance level as the likelihood of truth in their findings, leading to vast overconfidence in their results-that can also be a consequence of or lead to a failure to control for bias and poor quality control [@breznauDoesSociologyNeed2021; @munafoManifestoReproducibleScience2017]. Demographic, geographic or political biases and peer review limitations are more sources for error [@breznauDoesSociologyNeed2021; @grossmannOpenScienceReform2021]. Additionally, gendered penalties favor men publishing disproportionately more than women @akbaritabarGenderPatternsPublication2021. Misaligned institutional incentives, also accelerated by an intense competition for academic jobs, tenure and funding, lead to a so-called "publish or perish" culture [@smaldinoOpenScienceModified2019; @breznauDoesSociologyNeed2021].
Seen through Kuhn and Popper, today's credibility gap is a design failure: our institutions make refutation harder than confirmation. Open science is the design response-resetting defaults to transparency, pre-specification, and reproducibility-and @munafoManifestoReproducibleScience2017 translate that philosophy into a lifecycle blueprint: blinding and preregistration, stronger methods training and independent oversight, open data, code and diversified peer review to harden reproducibility, evaluation and other measures. The central movement to address the above issues is the so-called open science (OS) movement, The movement devotes its effort to challenge publication bias, low statistical power, p-hacking, HARKing and other problems by increasing reproducibility and transparency [@grossmannOpenScienceReform2021].
All the above leads to the conclusion, that our institutions make refutation harder than confirmation. Open science is the design response, resetting defaults to transparency, pre-specification, and reproducibility. @munafoManifestoReproducibleScience2017 translate that philosophy into a lifecycle blueprint: blinding and preregistration, stronger methods training and independent oversight, open data, code and diversified peer review to harden reproducibility, evaluation and other measures. The central movement to address the above issues is the so-called open science (OS) movement, devoting its effort to challenge publication bias, low statistical power, p-hacking, HARKing and other problems by increasing reproducibility and transparency [@grossmannOpenScienceReform2021].
## Open Science Practices
In an extensive literature review, @vicente-saezOpenScienceNow2018a use textual analysis to analyze 75 studies and collect definitions to synthesize a definition of os distinct from conventional science. The authors identified four differentias: transparency in communication, accessibility or searchability to all data and materials, sharing of everything with a commitment to do so and collaboration along a scientific, distributed global dialogue throughout all stages involved in science. They integrate these into a succinct definition: "Open Science is transparent and accessible knowledge that is shared and developed through collaborative networks" [@vicente-saezOpenScienceNow2018a, p. 434]
Following an extensive literature review @vicente-saezOpenScienceNow2018a characterize OS using four differentias: transparency in communication, accessibility or searchability to all data and materials, sharing of everything with a commitment to do so and collaboration along a scientific, distributed global dialogue throughout all stages involved in science. They integrate these into a succinct definition: "Open Science is transparent and accessible knowledge that is shared and developed through collaborative networks" [@vicente-saezOpenScienceNow2018a, p. 434]
@banksAnswers18Questions2019 establish a broader definition of os that refers to many concepts, including scientific philosophies embodying communality and universalism, specific practices operationalizing these norms including os policies, like sharing of data and analytic files, redefinition of confidence thresholds, preregistration of studies and analytical plans, engagement in replication studies, removal of pay-walls, incentive systems to encourage the above practices and even specific citation standards. A common ground is that *open* science and OSPs try to prevent research misconduct by simply increasing research transparency [@banksAnswers18Questions2019].
Building on these definitions, in line with the work of many other authors from diverse disciplines [e.g. @dienlinAgendaOpenScience2021; and @greenspanOpenSciencePractices2024], there are numerous practices that have been proposed to enact os. The most discussed will be evaluated in the next sections.
Building on these definitions, in line with the work of many other authors from diverse disciplines [e.g. @dienlinAgendaOpenScience2021; and @greenspanOpenSciencePractices2024], there are numerous practices that have been proposed to enact OS. The most discussed will be evaluated in the next sections.
### Open Data and Open Materials
@@ -166,15 +172,14 @@ The applied nature of the research in this field means fragile findings can driv
# Data and Method
The aim of this methodological work is to compile a sample of publications in the fields of criminology and legal psychology, and to classify them as either statistical inference (SI) publications or non-SI publications. Publications that employ SI will be further examined to assess whether they use any of the OSPs under consideration: preregistration, OD, OM, or OA. OA results are reported as secondary, descriptive analyses to benchmark open-science adoption, even though not present in the research questions. The presented OSPs will be operationalized and a text-classification pipeline (keyword dictionaries and machine-learning models) will be used to detect them. OA status will be determined using publicly available metadata, given the relatively high reliability of such information. The fine-tuned models are validated against a hand-coded sample that was extended using a large-language-model (LLM, ChatGPT 4o & ChatGPT 5o), report precision/recall and calibration, and then estimate annual prevalence with uncertainty intervals.
The aim of this methodological work is to compile a sample of publications in the fields of criminology and legal psychology, to classify them as either statistical inference (SI) publications or non-SI publications and further examine the former to assess whether they use any of the OSPs under consideration: preregistration, OD, OM, or OA. OA results are reported as secondary, descriptive analyses to benchmark open-science adoption. The presented OSPs will be operationalized and a text-classification pipeline (keyword dictionaries and machine-learning models) will be used to detect them. OA status will be determined using publicly available metadata, given the relatively high reliability of such information. The fine-tuned models are validated against a hand-coded sample that was extended using a large-language-model (LLM, ChatGPT 4o & ChatGPT 5o), report precision/recall and calibration, and then estimate annual prevalence with uncertainty intervals.
Full-text data for training the machine learning classification models will be collected with a web application developed specifically for this project. Since software development is not the focus of this work, details of the app's architecture will not be discussed here. A brief description of the application, along with screenshots, is provided in @sec-data-fulltext-collection.
As a master's thesis, this study is necessarily scoped by time and resources. It shall therefore be treated as a pilot that establishes data, measures and a reproducible, yet improvable pipeline to be extended in to a fully exhaustive study. Where necessary, potential improvements that could not be implemented are recommended.
As a master's thesis, this study is necessarily scoped by time and resources. It shall therefore be treated as a pilot that establishes data, measures and a reproducible, yet improvable pipeline to be extended in to a fully exhaustive study. Where dnecessary, potential improvements that could not be implemented are recommended.
All data and code necessary to enable full replication can be retrieved from the osf repositories. A full description of used software and methods is further layed out within the replication files and the accompanying methodological report.
## Population
The scope of this work encompasses all publications from the top 100 journals classified under "Criminology & Penology" or the journals that are categorized as "Law" (which might also include sociologically or psychologically driven quantitative studies) and "Psychology, Multidisciplinary", ranked by the 2023 JIF according to Clarivate's Journal Citation Reports [@clarivateJournalImpactFactor2023] that rely on SI. Publication metadata were retrieved via the Crossref API. While Crossref provides extensive coverage, it is not exhaustive, and prior work has shown that missing records are often systematic rather than random [@delgado-quirosWhyAreThese2024; @hausteinWhenArticleActually2015]. Using multiple bibliographic sources (e.g., Scopus, Web of Science) would reduce this bias [@gerasimovComparisonDatasetsCitation2024; @delgado-quirosWhyAreThese2024], but this was not feasible within the scope of this thesis. Consequently, the study population is restricted to articles indexed in Crossref from the selected top 100 journals.
@@ -187,7 +192,7 @@ In summary, the study population consists of all statistical-inference publicati
## Sampling {#sec-sampling}
The sampling procedure involved drawing a large enough sample for the training using sequential sampling, in this specific context called active learning [@chickSequentialSamplingEconomics2012]. Faced with expected challenges in full-text acquisition, a demanding training pipeline, and low anticipated OSP prevalence, the sequential sampling approach was abandoned and an alternative approach was established.
The sampling procedure involved drawing a large enough sample for the training using sequential sampling, in this specific context called active learning [@chickSequentialSamplingEconomics2012]. Faced with expected challenges in full-text acquisition, a rather demanding training pipeline, and unexpected low anticipated OSP prevalence, the sequential sampling approach was abandoned and an alternative approach was established.
The sample size was determined by a precision-based calculation to ensure a $\pm$ 1.5 percentage point confidence interval for the SI prevalence as a precision-based sample size calculation was deemed more suitable for an exploratory prevalence study [@blandTyrannyPowerThere2009]. The calculations were based on prevalences arbitrarily estimated using the results of the literature review described in @sec-osp-in-crim.
@@ -212,7 +217,7 @@ result <- prec_prop(
n_total <- result$n
result %>% as.tibble() %>%
table <- result %>% as.tibble() %>%
select(-padj) %>%
mutate(n = ceiling(n)) %>%
rename(
@@ -224,7 +229,18 @@ result %>% as.tibble() %>%
`Expected Prevalence` = paste0(p, " (", lwr ,", " , upr , ")")
) %>%
select(-lwr,-upr,-p) %>%
t() %>% kable()
t()
if(output_format == "pdf") {
table %>% kable()
} else {
print("Table: Estimated Minimum Sample Size")
}
if (isTRUE(debug_mode)) {
debug_info[[knitr::opts_current$get("label")]] <-
if (knitr::is_html_output()) "HTML" else "LaTeX"
}
```
The SI classifier, trained on Sample A, was then used to analyze and classify all publications in Sample B. From the identified SI papers in Sample B, a balanced dataset was randomly sampled to create a training set for the OSP classifiers. Finally, these trained OSP classifiers were applied to the entire analytical Sample B. While a publisher or journal-based stratification for the full sample would have been ideal, it was not feasible due to the limited number of available full texts.
@@ -257,7 +273,15 @@ summary_tbl <- tibble(
`Required Sample Size` = required_ns
)
summary_tbl %>% kable(digits = 0)
if(output_format == "pdf") {
summary_tbl %>% kable(digits = 0)
} else {
print("Table: Estimated Minimum Sample Sizes - Open Science Practices")
}
if (isTRUE(debug_mode)) {
debug_info[[knitr::opts_current$get("label")]] <-
if (knitr::is_html_output()) "HTML" else "LaTeX"
}
```
The minimum calculated total sample size equals 4265 (rounded) publications to achieve a 95% confidence interval with a half-width of $\pm$ 1.5 pp using the @agrestiApproximateBetterExact1998 method. When applying the assumed prevalence values for each OSP, the required sample sizes to achieve a 95% confidence interval with a half-width of $\pm$ 1.5 pp vary substantially. As shown in @tbl-cap-estimated-sample-sizes-osp, approximately 3,200 publications are needed to estimate OA at 25%, about 2,180 publications for OD at 15%, and only about 840 publications for OM or Preregistration at 5%.
@@ -282,7 +306,7 @@ result <- prec_prop(
method = "agresti-coull"
)
result %>% as.tibble() %>%
table_sampl_est <- result %>% as.tibble() %>%
select(-padj) %>%
rename(
`Sample Size` = n,
@@ -294,7 +318,17 @@ result %>% as.tibble() %>%
`Expected Prevalence` = paste0(p, " (", round(lwr,2) ,", " , round(upr,2) , ")")
) %>%
select(-lwr,-upr,-p) %>%
t() %>% kable(digits=2)
t()
if(output_format == "pdf") {
table_sampl_est %>% kable(digits = 2)
} else {
print("Table: Expected 95% CI for Open Access")
}
if (isTRUE(debug_mode)) {
debug_info[[knitr::opts_current$get("label")]] <-
if (knitr::is_html_output()) "HTML" else "LaTeX"
}
```
An overestimation the prevalence of each OSP in the population can lead to potential problems with all following steps. The true prevalences and confidence intervals along with performance diagnostics of trained models were assessed after all classification tasks were processed. An estimation of the prevalences per year was not suitable as no detailed information about those proportions was available. Instead, the established approach to stratify the sample proportionally to the population was used [@larsenProportionalAllocationStrata2008].
@@ -340,12 +374,21 @@ tbl_cases <- tbl %>%
) %>%
arrange(step_id) %>%
select(step_id, step_label, n_before, n_after, n_dropped)
tbl_cases %>%
if(output_format == "pdf") {
tbl_cases %>%
kable(
format = "latex", # force LaTeX output (not markdown)
booktabs = TRUE,
longtable = FALSE, # avoid longtable entirely
col.names = c("Step #", "Step", "Before", "After", "Dropped"))
} else {
print("Table: Cases Dropped from all Publications Obtained")
}
if (isTRUE(debug_mode)) {
debug_info[[knitr::opts_current$get("label")]] <-
if (knitr::is_html_output()) "HTML" else "LaTeX"
}
```
The data obtained necessitated multiple transformations. All transformations are reported in the respective section in the methodological report.
@@ -462,6 +505,10 @@ p4 <- sample_B_by_year %>%
)
print((p1|p2) / (p3|p4))
if (isTRUE(debug_mode)) {
debug_info[[knitr::opts_current$get("label")]] <-
if (knitr::is_html_output()) "HTML" else "LaTeX"
}
```
### Full Text Retrieval
@@ -495,12 +542,21 @@ tbl_cases2 <- tbl2 %>%
) %>%
arrange(step_id) %>%
select(step_id, step_label, n_before, n_after, n_dropped)
if(output_format == "pdf") {
tbl_cases2 %>%
kable(
format = "latex", # force LaTeX output (not markdown)
booktabs = TRUE,
longtable = FALSE, # avoid longtable entirely
col.names = c("Step #", "Step", "Before", "After", "Dropped"))
} else {
print("Table: Cases Dropped from Analytical Sample")
}
if (isTRUE(debug_mode)) {
debug_info[[knitr::opts_current$get("label")]] <-
if (knitr::is_html_output()) "HTML" else "LaTeX"
}
```
## Classification Methods
@@ -552,6 +608,10 @@ for (i in seq_along(plots)) {
combined_plot <- wrap_plots(plotlist, ncol = 2) + # remove legend
plot_layout(guides = "collect") & theme(legend.position = "none")
print(combined_plot)
if (isTRUE(debug_mode)) {
debug_info[[knitr::opts_current$get("label")]] <-
if (knitr::is_html_output()) "HTML" else "LaTeX"
}
```
For hyperparameter tuning and training of the ML models, the coded datasets were split into an training sample of 80% and a validation sample of 20%, stratified by the target variable as this improves training in scenarios with high class imbalance [@hilbertModelle2025]. K-Fold cross-validation was used during hyperparameter tuning to further iomprove model performance and reduce overfitting.
@@ -587,6 +647,10 @@ jobsplot <- readRDS("figures/jobs_osp.rds") +
)
print(jobsplot)
if (isTRUE(debug_mode)) {
debug_info[[knitr::opts_current$get("label")]] <-
if (knitr::is_html_output()) "HTML" else "LaTeX"
}
```
The two top-left graphs in @fig-evaluation-stat show the performance of different feature set and model combinations measured by ROC-AUC [@fawcettIntroductionROCAnalysis2006]. The top graph identifies the XGBoost classifier combined with a simple term frequencies dataset as the top-performing model. The top-right graph shows the most important terms for the XGBoost classifier, which are primarily statistical. The confusion matrix shows that the model is quite precise, with a 91.7% accuracy and a Cohen's Kappa of 0.832. This performance is good compared to hand-coded cases. Model calibration was not highly successful as the model's probabilities were already well-calibrated, mostly at the extremes of 0 and 1. A probability threshold of 0.25 was chosen based on three different metrics. This threshold is used for the final classification, where any case with a predicted probability greater than 0.25 is classified as 1. It's also important to note that the OSP classifiers performed much worse, as detailed in @sec-evaluation-metrics.
@@ -628,7 +692,7 @@ df <- qs_read(file_sample_analysis)
population <- qs_read(file_meta_final)
df %>% mutate(
tbl_sample_desc <- df %>% mutate(
journal_category = case_when(
journal_category == "PSYCHOLOGY, MULTIDISCIPLINARY" ~ "A",
journal_category == "LAW" ~ "B",
@@ -679,6 +743,18 @@ df %>% mutate(
table.font.size = gt::px(12),
latex.use_longtable = TRUE
)
if(output_format == "pdf") {
tbl_sample_desc
} else {
#tbl_sample_desc %>% as_kable()
print("Table:Sample Characteristics by Statistical Inference Status")
}
if (isTRUE(debug_mode)) {
debug_info[[knitr::opts_current$get("label")]] <-
if (knitr::is_html_output()) "HTML" else "LaTeX"
}
```
OM (@fig-plt-eval-om) tells a different story: despite nominal Accuracy of $94.3\%$, balanced accuracy drops to $60.0\%$ and $\kappa$ to $31.7\%$. Sensitivity is $20.0\%$ while specificity is $100.0\%$, yielding $F_1 = 33.3\%$. High nominal accuracy with a large miss rate indicates accuracy inflation under imbalance, and the p-value of $0.434$ confirms that accuracy does not exceed the no-information rate meaningfully.
@@ -833,6 +909,10 @@ p <- ggplot(yearly_long, aes(x = published_year, y = prop, color = variable)) +
tbl_osp_prev_overall_dsadj <- yearly_long
print(p)
if (isTRUE(debug_mode)) {
debug_info[[knitr::opts_current$get("label")]] <-
if (knitr::is_html_output()) "HTML" else "LaTeX"
}
```
In parallel, @tbl-sample-char suggests systematic differences between SI and non-SI papers: distributions of text sources differ (likely reflecting publisher effects or text-quality variation), abstracts-only are more common among non-SI items, word counts are higher for SI papers, journal impact is higher, and OA appears more common. Several contrasts are statistically significant (many $p < .001$), but these should be treated as descriptive given unmodeled multilevel variance and field composition.
@@ -880,7 +960,8 @@ overall_osp_si <- overall_osp_si %>%
) %>%
arrange(desc(`Prevalence`))
overall_osp_si %>%
if(output_format == "pdf") {
overall_osp_si %>%
kbl(
format = 'latex',
longtable = TRUE,
@@ -888,8 +969,24 @@ overall_osp_si %>%
escape = T,
) %>% # add footnote
column_spec(1, width = '3cm')%>%
kable_styling(position = "center", latex_options = "hold_position", full_width = FALSE) %>%
footnote(general = "Prevalence estimates in statistical inference publications using design-weights per year (95% CI)", general_title = "Note:", footnote_as_chunk = T, threeparttable = T, )
kable_styling(
position = "center",
latex_options = "hold_position",
full_width = FALSE) %>%
footnote(
general = "Prevalence estimates in statistical inference publications using design-weights per year (95% CI)",
general_title = "Note:",
footnote_as_chunk = T,
threeparttable = T
)
} else {
print("Overall Prevalence of Open Science Practices among Statistical Inference Papers (Design-Weighted to Frame-by-Year Totals")
}
if (isTRUE(debug_mode)) {
debug_info[[knitr::opts_current$get("label")]] <-
if (knitr::is_html_output()) "HTML" else "LaTeX"
}
```
```{r}
@@ -974,7 +1071,7 @@ osp_table_pretty <- osp_table %>%
`Adj. (.95 CI)`,
Se, Sp, n_Se, n_Sp)
# Pretty formatters that *emit LaTeX-safe percents*
# Pretty formatters
pct <- function(x) sprintf("%.1f\\%%", 100 * x) # -> "2.2\%"
ci <- function(l, u) sprintf("(%.1f-%.1f)", 100*l, 100*u) # -> "(1.5-2.8)"
@@ -996,7 +1093,8 @@ colnames(osp_table_pretty) <- c(
"Se$^{1}$", "Sp$^{2}$", "Pos$^{3}$", "Neg$^{4}$"
)
osp_table_pretty %>%
if(output_format == "pdf") {
osp_table_pretty %>%
kbl(format = "latex", booktabs = TRUE, escape = FALSE,
align = c("l","l","l","r","r","r","r"), longtable = TRUE) %>%
kable_styling(latex_options = "hold_position") %>%
@@ -1009,11 +1107,18 @@ osp_table_pretty %>%
),
escape = FALSE
)
} else {
print("Overall Prevalence of Open Science Practices among Statistical Inference Papers (Design-Weighted to Frame-by-Year Totals")
}
if (isTRUE(debug_mode)) {
debug_info[[knitr::opts_current$get("label")]] <-
if (knitr::is_html_output()) "HTML" else "LaTeX"
}
```
Earlier differences in text sources suggest heterogeneity by journal, thereby implicating also publisher variance [@scogginsMeasuringTransparencySocial2024]. @fig-osp-time-by-publisher visualizes OA shares over time for the 12 most prolific publishers in the sample (listed in the caption). Leveraging larger $n$, the author fit simple OLS trends to annual OA proportions. The four most prolific publishers show clear increases. Four publishers do not: Oxford University Press, Emerald, ASCE, and MDPI. MDPI remains at 100% OA, Emerald at 0% in this sample; ASCE shows an apparent decline consistent with limited observations; Oxford University Press is relatively stable. All observed increases are highly statistically significant. Future work should use models designed for proportions (e.g., binomial GLMs) and, ideally, hierarchical pooling across publishers and years.
This study was deliberately scoped as a master's-thesis pilot, which constrained coverage, precision, and tooling. The population assessed was limited to SI papers from the top 100 JCR journals in criminology and legal psychology and to Crossref metadata, so venue and index biases remain. The 2013-2023 window omits the most recent changes. Keyword screening did not fully exclude non-target items, and a Quarto "freeze" configuration led to using print over online dates in some cases. Full-text retrieval was partial and legally bounded to TDM-permitted publishers; short documents (<1,000 words) were treated as missing full text, risking misclassification.
This study was deliberately scoped as a pilot, which constrained coverage, precision, and tooling. The population assessed was limited to SI papers from the top 100 JCR journals in criminology and legal psychology and to Crossref metadata, so venue and index biases remain. The 2013-2023 window omits the most recent changes. Keyword screening did not fully exclude non-target items, and a Quarto "freeze" configuration led to using print over online dates in some cases. Full-text retrieval was partial and legally bounded to TDM-permitted publishers; short documents (<1,000 words) were treated as missing full text, risking misclassification.
Measurement and modeling challenges were substantial. SI/OSP labels were trained on a small, single-coder hand set plus GPT assistance. Severe class imbalance for OSPs, few validation positives, and upsampling inflated nominal accuracy while depressing stability. Misclassification adjustments (Rogan-Gladen) became unstable at very low prevalences, and some OA trend analyses used simple OLS rather than binomial/GLM approaches.
@@ -1160,6 +1265,10 @@ grid_publishers <- ggplot(
)
print(grid_publishers)
if (isTRUE(debug_mode)) {
debug_info[[knitr::opts_current$get("label")]] <-
if (knitr::is_html_output()) "HTML" else "LaTeX"
}
```
```{=latex}
@@ -1240,7 +1349,7 @@ Full reproducibility can't be guaranteed due to the dependency on data that is a
```{r}
#| tbl-caption: Preregistration
# rename published_year prop prop_low prop_upp variable
tbl_osp_prev_overall_dsadj %>%
tbl_osp_prev_overall_dsadj_a <- tbl_osp_prev_overall_dsadj %>%
select(published_year, prop, prop_low, prop_upp, variable) %>%
filter(variable == "Preregistration") %>%
select(-variable) %>%
@@ -1257,12 +1366,23 @@ tbl_osp_prev_overall_dsadj %>%
) %>%
kable(digits = 2, row.names = FALSE, booktabs = TRUE, caption = "Preregistration") %>%
kable_styling(position = "center", full_width = FALSE)
if(output_format == "pdf") {
tbl_osp_prev_overall_dsadj_a
} else {
print("Table: Preregistration")
}
if (isTRUE(debug_mode)) {
debug_info[[knitr::opts_current$get("label")]] <-
if (knitr::is_html_output()) "HTML" else "LaTeX"
}
```
```{r}
#| tbl-caption: Open Data
# rename published_year prop prop_low prop_upp variable
tbl_osp_prev_overall_dsadj %>%
tbl_osp_prev_overall_dsadj_b <- tbl_osp_prev_overall_dsadj %>%
select(published_year, prop, prop_low, prop_upp, variable) %>%
filter(variable == "Open Data") %>%
select(-variable) %>%
@@ -1279,6 +1399,17 @@ tbl_osp_prev_overall_dsadj %>%
) %>%
kable(digits = 2, row.names = FALSE, booktabs = TRUE, caption = "Open Data") %>%
kable_styling(position = "center", full_width = FALSE)
if(output_format == "pdf") {
tbl_osp_prev_overall_dsadj_b
} else {
print("Table: Open Data")
}
if (isTRUE(debug_mode)) {
debug_info[[knitr::opts_current$get("label")]] <-
if (knitr::is_html_output()) "HTML" else "LaTeX"
}
```
```{=latex}
@@ -1288,7 +1419,7 @@ tbl_osp_prev_overall_dsadj %>%
```{r}
# rename published_year prop prop_low prop_upp variable
tbl_osp_prev_overall_dsadj %>%
tbl_osp_prev_overall_dsadj_c <- tbl_osp_prev_overall_dsadj %>%
select(published_year, prop, prop_low, prop_upp, variable) %>%
filter(variable == "Open Materials") %>%
select(-variable) %>%
@@ -1305,11 +1436,22 @@ tbl_osp_prev_overall_dsadj %>%
) %>%
kable(digits = 2, row.names = FALSE, booktabs = TRUE, caption = "Open Materials") %>%
kable_styling(position = "center", full_width = FALSE)
if(output_format == "pdf") {
tbl_osp_prev_overall_dsadj_c
} else {
print("Table: Open Materials")
}
if (isTRUE(debug_mode)) {
debug_info[[knitr::opts_current$get("label")]] <-
if (knitr::is_html_output()) "HTML" else "LaTeX"
}
```
```{r}
# rename published_year prop prop_low prop_upp variable
tbl_osp_prev_overall_dsadj %>%
tbl_osp_prev_overall_dsadj_d <- tbl_osp_prev_overall_dsadj %>%
select(published_year, prop, prop_low, prop_upp, variable) %>%
filter(variable == "Open Access") %>%
select(-variable) %>%
@@ -1326,6 +1468,17 @@ tbl_osp_prev_overall_dsadj %>%
) %>%
kable(digits = 2, row.names = FALSE, booktabs = TRUE, caption = "Open Access") %>%
kable_styling(position = "center", full_width = FALSE)
if(output_format == "pdf") {
tbl_osp_prev_overall_dsadj_d
} else {
print("Table: Open Access")
}
if (isTRUE(debug_mode)) {
debug_info[[knitr::opts_current$get("label")]] <-
if (knitr::is_html_output()) "HTML" else "LaTeX"
}
```
```{=latex}
@@ -1340,18 +1493,11 @@ tbl_osp_prev_overall_dsadj %>%
![Evaluation Metrics: Preregistration](figures/combined_plot_is_prereg.pdf){#fig-plt-eval-pr fig-pos=H}
```{=latex}
\clearpage
\setcounter{section}{0}
\renewcommand\thesection{}
```{r}
#| results: asis
if (isTRUE(debug_mode)) {
print("# Debug Info\n\n")
print(debug_info)
}
```
# Eigenständigkeitserklärung
Hiermit versichere ich, dass ich die vorliegende Arbeit selbstständig und ohne die Benutzung anderer als der angegebenen Hilfsmittel angefertigt habe. Alle Stellen, die wörtlich oder sinngemäß aus veröffentlichten und nicht veröffentlichten Schriften entnommen wurden, sind als solche kenntlich gemacht.
\vspace{2cm}
\noindent
\begin{tabular}{@{}p{\dimexpr 0.4\linewidth-2\tabcolsep}p{\dimexpr 0.2\linewidth-2\tabcolsep}p{\dimexpr 0.4\linewidth-2\tabcolsep}@{}}
\hrulefill & & \hrulefill \\
\centering Michael Beck & & \centering Date \\
\end{tabular}