couple of tweaks, adds tables and reorders supplements.

This commit is contained in:
2025-12-16 23:20:20 +01:00
parent ee79e207ea
commit 86b2380902
2 changed files with 92 additions and 64 deletions
+90 -60
View File
@@ -32,7 +32,7 @@ The approach might seem overly complicated but was intitially designed to be use
# Sample Size
The sample size was determined by a precision-based calculation to ensure a $\pm$ 1.5 percentage point confidence interval for the SI prevalence as a precision-based sample size calculation was deemed more suitable for an exploratory prevalence study [@blandTyrannyPowerThere2009]. The calculations were based on prevalences arbitrarily estimated using the results of the literature review described in @sec-osp-in-crim.
The sample size was determined by a precision-based calculation to ensure a $\pm$ 1.5 percentage point confidence interval for the SI prevalence as a precision-based sample size calculation was deemed more suitable for an exploratory prevalence study [@blandTyrannyPowerThere2009]. The calculations were based on prevalences arbitrarily estimated using the results of the literature review described in the methodological report in detail.
```{r}
#| echo: false
@@ -167,52 +167,13 @@ if (isTRUE(debug_mode)) {
An overestimation the prevalence of each OSP in the population can lead to potential problems with all following steps. The true prevalences and confidence intervals along with performance diagnostics of trained models were assessed after all classification tasks were processed. An estimation of the prevalences per year was not suitable as no detailed information about those proportions was available. Instead, the established approach to stratify the sample proportionally to the population was used [@larsenProportionalAllocationStrata2008].
# Full Text Retreival
# Full Text Retrieval
As mentioned in the manuscript, full texts were retreived using a self developed web application that used both web scraping and publisher API's. Legal aspects were carefully considered throughout the development. Within the EU, scraping is legal for scientific purposes [@urhg-60d-tdm], but institutional contracts can override this. Scraping was therefore limited to the university network and only to publishers that permit it while other publishers were scraped outside of the network. Technical details are available in the documents provided while the scraper might be made publicly available in the future.
# Model Training
For hyperparameter tuning and training of the ML models, the coded datasets were split into an training sample of 80% and a validation sample of 20%, stratified by the target variable as this improves training in scenarios with high class imbalance [@hilbertModelle2025]. K-Fold cross-validation was used during hyperparameter tuning to further iomprove model performance and reduce overfitting.
![Evaluation Metrics: Statistical Inference Classification](figures/combined_plot_is_statistical.pdf){#fig-evaluation-stat}
The features differed in the feature construction: "TF" feature sets contained simple term frequencies of the keywords in each category whereas "n-gram" feature sets were constructed containing term frequencies of multi-word-phrases. Using ngrams has proven to enhance results in comparison to simple term frequencies in other contexts [e.g. @jandotInteractiveSemanticFeaturing2016; @ahmedDetectionOnlineFake2017], which is why I chose to include multi-gram (2 or 3 word phrases) feature sets as well as term-frequency and ngram combined feature sets in the evaluations. Multiple machine learning models were trained on those feature sets, resulting in multiple model-featureset combinations for each OSP assessed. An example of those combinations and the evaluation can be seen in @fig-jobs-osp.
```{r}
#| fig-height: 5
#| fig-width: 10
#| label: fig-jobs-osp
#| fig-cap: Model, Feature and Variable Combinations
#| fig-pos: h
axis_mapping <- c(
"is_prereg" = "Preregistration",
"is_open_data" = "Open Data",
"is_open_materials" = "Open Materials",
"is_open_access" = "Open Access"
)
jobsplot <- readRDS("figures/jobs_osp.rds") +
labs(
title = "",
subtitle = "",
x = "",
y = ""
) +
scale_fill_manual(
values = osp_cols2,
labels = axis_mapping
)
print(jobsplot)
if (isTRUE(debug_mode)) {
debug_info[[knitr::opts_current$get("label")]] <-
if (knitr::is_html_output()) "HTML" else "LaTeX"
}
```
The two top-left graphs in @fig-evaluation-stat show the performance of different feature set and model combinations measured by ROC-AUC [@fawcettIntroductionROCAnalysis2006]. The top graph identifies the XGBoost classifier combined with a simple term frequencies dataset as the top-performing model. The top-right graph shows the most important terms for the XGBoost classifier, which are primarily statistical. The confusion matrix shows that the model is quite precise, with a 91.7% accuracy and a Cohen's Kappa of 0.832. This performance is good compared to hand-coded cases. Model calibration was not highly successful as the model's probabilities were already well-calibrated, mostly at the extremes of 0 and 1. A probability threshold of 0.25 was chosen based on three different metrics. This threshold is used for the final classification, where any case with a predicted probability greater than 0.25 is classified as 1. It's also important to note that the OSP classifiers performed much worse, as detailed in @sec-evaluation-metrics.
Overall agreement after resolving initial disagreements was good, as shown in @fig-cfm-osp. Nevertheless, some categories exhibited extreme class imbalance, especially preregistration, which later proved problematic in subsequent stages of the process. Given the small number of positive cases, the ChatGPT-based classification was used to extend the training sample further. A full classification of the analytical sample using an LLM was considered desirable, but was not financially feasible within the project.
```{r}
#| fig-cap: Confusion Matrices - Manual vs ChatGPT Labels for Open Science Practices and Statistical Inference (design-weighted)
@@ -253,6 +214,55 @@ if (isTRUE(debug_mode)) {
}
```
For hyperparameter tuning and training of the ML models, the coded datasets were split into an training sample of 80% and a validation sample of 20%, stratified by the target variable as this improves training in scenarios with high class imbalance [@hilbertModelle2025]. K-Fold cross-validation was used during hyperparameter tuning to further iomprove model performance and reduce overfitting.
The features differed in the feature construction: "TF" feature sets contained simple term frequencies of the keywords in each category whereas "n-gram" feature sets were constructed containing term frequencies of multi-word-phrases. Using ngrams has proven to enhance results in comparison to simple term frequencies in other contexts [e.g. @jandotInteractiveSemanticFeaturing2016; @ahmedDetectionOnlineFake2017], which is why I chose to include multi-gram (2 or 3 word phrases) feature sets as well as term-frequency and ngram combined feature sets in the evaluations. Multiple machine learning models were trained on those feature sets, resulting in multiple model-featureset combinations for each OSP assessed. An example of those combinations and the evaluation can be seen in @fig-jobs-osp.
```{r}
#| fig-height: 5
#| fig-width: 10
#| label: fig-jobs-osp
#| fig-cap: Model, Feature and Variable Combinations
#| fig-pos: h
axis_mapping <- c(
"is_prereg" = "Preregistration",
"is_open_data" = "Open Data",
"is_open_materials" = "Open Materials",
"is_open_access" = "Open Access"
)
jobsplot <- readRDS("figures/jobs_osp.rds") +
labs(
title = "",
subtitle = "",
x = "",
y = ""
) +
scale_fill_manual(
values = osp_cols2,
labels = axis_mapping
)
print(jobsplot)
if (isTRUE(debug_mode)) {
debug_info[[knitr::opts_current$get("label")]] <-
if (knitr::is_html_output()) "HTML" else "LaTeX"
}
```
## Model Evaluation
The two top-left graphs in @fig-evaluation-stat, @fig-evaluation-od, @fig-evaluation-om and @fig-evaluation-pre show the performance of different feature set and model combinations measured by ROC-AUC [@fawcettIntroductionROCAnalysis2006]. In @fig-evaluation-stat, the top graph identifies the XGBoost classifier combined with a simple term frequencies dataset as the top-performing model. The top-right graph shows the most important terms for the XGBoost classifier, which are primarily statistical. The confusion matrix shows that the model is quite precise, with a 91.7% accuracy and a Cohen's Kappa of 0.832. This performance is good compared to hand-coded cases. Model calibration was not highly successful as the model's probabilities were already well-calibrated, mostly at the extremes of 0 and 1. A probability threshold of 0.25 was chosen based on three different metrics. This threshold is used for the final classification, where any case with a predicted probability greater than 0.25 is classified as 1. It's also important to note that the OSP classifiers performed much worse, as seen below.
![Evaluation Metrics: Statistical Inference Classification](figures/combined_plot_is_statistical.pdf){#fig-evaluation-stat}
![Evaluation Metrics: Open Data](figures/combined_plot_is_open_data.pdf){#fig-evaluation-od fig-pos=H}
![Evaluation Metrics: Open Materials](figures/combined_plot_is_open_materials.pdf){#fig-evaluation-om fig-pos=H}
![Evaluation Metrics: Preregistration](figures/combined_plot_is_prereg.pdf){#fig-evaluation-pr fig-pos=H}
$$
\text{Accuracy} = \frac{TP + TN}{N} \quad \text{and} \quad
\kappa = \frac{p_o - p_e}{1 - p_e}
@@ -260,9 +270,9 @@ $$
As expected, $\kappa$ is typically lower than Accuracy due to chance-agreement correction [@naiduReviewEvaluationMetrics2023].
OM (@fig-plt-eval-om) tells a different story: despite nominal Accuracy of $94.3\%$, balanced accuracy drops to $60.0\%$ and $\kappa$ to $31.7\%$. Sensitivity is $20.0\%$ while specificity is $100.0\%$, yielding $F_1 = 33.3\%$. High nominal accuracy with a large miss rate indicates accuracy inflation under imbalance, and the p-value of $0.434$ confirms that accuracy does not exceed the no-information rate meaningfully.
OM (@fig-evaluation-om) tells a different story: despite nominal Accuracy of $94.3\%$, balanced accuracy drops to $60.0\%$ and $\kappa$ to $31.7\%$. Sensitivity is $20.0\%$ while specificity is $100.0\%$, yielding $F_1 = 33.3\%$. High nominal accuracy with a large miss rate indicates accuracy inflation under imbalance, and the p-value of $0.434$ confirms that accuracy does not exceed the no-information rate meaningfully.
OD (@fig-plt-eval-od) sits between these extremes: accuracy $= 88.6\%$, balanced accuracy $= 93.7\%$, sensitivity $= 100.0\%$, specificity $= 87.3\%$. The classifier captures all positives but at the cost of eight false positives against seven true positives and 55 true negatives, which depresses precision and yields $F_1 = 63.6\%$. $\kappa = 57.9\%$ indicates moderate agreement beyond chance, and $p = 0.736$ again signals that nominal accuracy is uninformative under imbalance.
OD (@fig-evaluation-od) sits between these extremes: accuracy $= 88.6\%$, balanced accuracy $= 93.7\%$, sensitivity $= 100.0\%$, specificity $= 87.3\%$. The classifier captures all positives but at the cost of eight false positives against seven true positives and 55 true negatives, which depresses precision and yields $F_1 = 63.6\%$. $\kappa = 57.9\%$ indicates moderate agreement beyond chance, and $p = 0.736$ again signals that nominal accuracy is uninformative under imbalance.
In short, Preregistration appears comparatively reliable, OM is recall-limited, and OD is precision-limited. These profiles motivate reporting metrics suited to extreme class imbalance-Precision $P = \frac{TP}{TP+FP}$, Recall $R = \frac{TP}{TP+FN}$, balanced accuracy $BA = \frac{P+R}{2}$ - and anticipating how errors propagate into downstream estimates [@murphyMachineLearningProbabilistic2012; @fawcettIntroductionROCAnalysis2006].
@@ -270,11 +280,13 @@ Category-specific results highlight class-imbalance constraints. Preregistration
[^1]: The accuracy-no-information-rate p-value tests the null hypothesis that the accuracy is equal to the no-information rate or the accuracy when always predicting the most frequent class [@kuhnBuildingPredictiveModels2008].
The ML classifiers trained on GPT labels inherit GPT's strengths and the data's sparsity.For the relatively small 20% validation set coded by GPT, the open-science practice classifiers are less precise and less reliable than the Statistical-Inference classifier. Preregistration (@fig-plt-eval-pr) appears strongest (balanced accuracy $= 99.2\%$, $F_1 = 88.9\%$, $\kappa = 88.1\%$), but the counts are sparse (four true positives, one false negative, no false positives), and the p-value versus the no-information rate ($p = 0.0853$) is not conventionally significant-an expected consequence of the very low base rate rather than a systematic error.
The ML classifiers trained on GPT labels inherit GPT's strengths and the data's sparsity.For the relatively small 20% validation set coded by GPT, the open-science practice classifiers are less precise and less reliable than the Statistical-Inference classifier. Preregistration (@fig-fig-evaluation-pr) appears strongest (balanced accuracy $= 99.2\%$, $F_1 = 88.9\%$, $\kappa = 88.1\%$), but the counts are sparse (four true positives, one false negative, no false positives), and the p-value versus the no-information rate ($p = 0.0853$) is not conventionally significant-an expected consequence of the very low base rate rather than a systematic error.
# Tables: OSP Adoption Over Time Among Statistical Inference Papers
The following tables contain the values for Figure 2: OSP Adoption Over Time, among statistical inference papers (design-weighted) that are not adjusted for misclassification. Rogan-Gladen adjusted per-year-values were not calculated, given the already enormous errors in the total estimation.
## OSP Adoption Over Time Among Statistical Inference Papers {#sec-osp-adoption-tables}
```{r}
@@ -451,6 +463,38 @@ if (isTRUE(debug_mode)) {
}
```
```{r}
# rename published_year prop prop_low prop_upp variable
tbl_osp_prev_overall_dsadj_a <- tbl_osp_prev_overall_dsadj %>%
select(published_year, prop, prop_low, prop_upp, variable) %>%
filter(variable == "Preregistration") %>%
select(-variable) %>%
mutate(
prop = percent(prop, accuracy = 0.1),
prop_low = percent(prop_low, accuracy = 0.1),
prop_upp = percent(prop_upp, accuracy = 0.1),
) %>%
rename(
`Year` = published_year,
`Proportion` = prop,
`.95 CI (Lower)` = prop_low,
`.95 CI (Upper)` = prop_upp
) %>%
kable(digits = 2, row.names = FALSE, booktabs = TRUE, caption = "Preregistration") %>%
kable_styling(position = "center", full_width = FALSE)
if(output_format == "pdf/tex") {
tbl_osp_prev_overall_dsadj_a
} else {
print("Table: Preregistration")
}
if (isTRUE(debug_mode)) {
debug_info[[knitr::opts_current$get("label")]] <-
if (knitr::is_html_output()) "HTML" else "LaTeX"
}
```
```{=latex}
\clearpage
\newpage
@@ -522,18 +566,4 @@ if (isTRUE(debug_mode)) {
```{=latex}
\clearpage
```
# Evaluation Metrics {#sec-evaluation-metrics}
![Evaluation Metrics: Open Data](figures/combined_plot_is_open_data.pdf){#fig-plt-eval-od fig-pos=H}
![Evaluation Metrics: Open Data](figures/combined_plot_is_open_data.pdf){#fig-plt-eval-od fig-pos=H}
![Evaluation Metrics: Open Materials](figures/combined_plot_is_open_materials.pdf){#fig-plt-eval-om fig-pos=H}
![Evaluation Metrics: Preregistration](figures/combined_plot_is_prereg.pdf){#fig-plt-eval-pr fig-pos=H}>
> I received a response from the editorial board. In short, you can go ahead and submit it to the Special Issue. Let me know if you'd like me to review the title page and the anonymised manuscript-those are the only two unique documents you'll need for the submission. As mentioned in my previous email, you can upload all supplementary materials and replication files to OSF and include the anonymised link in both the manuscript and the title page. If you'd like to meet, I'm happy to briefly walk you through the process.
>
> Best,
> Alex
```
+2 -4
View File
@@ -509,7 +509,7 @@ tbl_sample_desc <- df %>% mutate(
pvalue_fun = label_style_pvalue(digits = 3)
) %>%
add_overall() %>%
modify_header(label ~ "**Variable**") |>
modify_header(label ~ "**Variable**") %>%
modify_spanning_header(c("stat_1", "stat_2") ~ "**Statistical Inference**")%>%
modify_footnote_body(
footnote = "A: Psychology, Multidisciplinary; B: Law; C: Criminology & Penology",
@@ -782,7 +782,7 @@ if (isTRUE(debug_mode)) {
}
```
@tbl-osp-prev adjusts adjustments were applied using sensitivity and specificity from the ML-validation analysis in [@liuQuantitativeBiasAnalysis2023]. Under extreme rarity, adjustments become unstable: intervals widen dramatically (approaching $[0,1]$) or yield boundary/negative estimates when specificity is insufficient relative to prevalence. For OD, the false-positive rate ($1-\text{Sp} \approx 12.7\%$) exceeds the observed prevalence ($2.2\%$), pushing adjusted points below zero. For OM, low sensitivity ($\text{Se} = 0.20$) and tiny validation counts produce near-uninformative intervals. Given these constraints, the adjusted values can be interpreted as sensitivity ranges rather than confirmatory estimates. Any substantive claims should thereby rather be based on design-based estimates and on OA (measured from metadata).
In @tbl-osp-prev, adjustments were applied using sensitivity and specificity from the ML-validation analysis in [@liuQuantitativeBiasAnalysis2023]. Under extreme rarity, adjustments become unstable: intervals widen dramatically (approaching $[0,1]$) or yield boundary/negative estimates when specificity is insufficient relative to prevalence. For OD, the false-positive rate ($1-\text{Sp} \approx 12.7\%$) exceeds the observed prevalence ($2.2\%$), pushing adjusted points below zero. For OM, low sensitivity ($\text{Se} = 0.20$) and tiny validation counts produce near-uninformative intervals. Given these constraints, the adjusted values can be interpreted as sensitivity ranges rather than confirmatory estimates. Any substantive claims should thereby rather be based on design-based estimates and on OA (measured from metadata).
Earlier differences in text sources suggest heterogeneity by journal, thereby implicating also publisher variance [@scogginsMeasuringTransparencySocial2024]. @fig-osp-time-by-publisher visualizes OA shares over time for the 12 most prolific publishers in the sample (listed in the caption). Leveraging larger $n$, the author fit simple OLS trends to annual OA proportions. The four most prolific publishers show clear increases. Four publishers do not: Oxford University Press, Emerald, ASCE, and MDPI. MDPI remains at 100% OA, Emerald at 0% in this sample; ASCE shows an apparent decline consistent with limited observations; Oxford University Press is relatively stable. All observed increases are highly statistically significant. Future work should use models designed for proportions (e.g., binomial GLMs) and, ideally, hierarchical pooling across publishers and years.
@@ -939,8 +939,6 @@ if(output_format == "pdf/tex") {
colformat_double(
big.mark = ",", digits = 2, na_str = "N/A"
)
} else {
}