finalizes supplements

This commit is contained in:
2025-12-18 16:31:18 +01:00
parent b7a0ddbc94
commit c4a46fc21b
3 changed files with 48 additions and 54 deletions
+45 -32
View File
@@ -1,5 +1,6 @@
---
title: "Mining Transparency - Supplementary Materials"
title: "Supplementary Materials for"
subtitle: "Mining Transparency: Assessing Open Science Practices in Crime Research Over Time Using Machine Learning"
top-level-division: section
prefer-html: true
execute:
@@ -9,6 +10,28 @@ format:
toc: true
lot: true
lof: true
fig-pos: H
mainfont: Aptos
code-overflow: wrap
include-in-header:
- text: | # wrap lines in code output. Source: https://github.com/orgs/quarto-dev/discussions/4121
\usepackage{fvextra}
\DefineVerbatimEnvironment{Highlighting}{Verbatim}{
commandchars=\\\{\},
breaklines, breaknonspaceingroup, breakanywhere
}
- text: '\pagenumbering{arabic}'
include-before: |
\newpage
include-before-body:
text: |
\RecustomVerbatimEnvironment{verbatim}{Verbatim}{
showspaces = false,
showtabs = false,
breaksymbolleft={},
breaklines
}
date: ""
---
```{r}
@@ -39,6 +62,8 @@ Classification of the training Sample B followed a similar approach. For classif
The sample size was determined by a precision-based calculation to ensure a $\pm$ 1.5 percentage point confidence interval for the SI prevalence as a precision-based sample size calculation was deemed more suitable for an exploratory prevalence study [@blandTyrannyPowerThere2009]. The calculations were based on prevalences arbitrarily estimated using the results of the literature review described in the methodological report in detail.
The minimum calculated total sample size equals 4265 publications to achieve a 95% confidence interval with a half-width of $\pm$ 1.5 pp using the @agrestiApproximateBetterExact1998 method. When applying the assumed prevalence values for each OSP, the required sample sizes to achieve a 95% confidence interval with a half-width of $\pm$ 1.5 pp vary substantially. As shown in @tbl-cap-estimated-sample-sizes-osp, approximately 3,200 publications are needed to estimate OA at 25%, about 2,180 publications for OD at 15%, and only about 840 publications for OM or Preregistration at 5%.
```{r}
#| echo: false
#| results: asis
@@ -123,7 +148,6 @@ if (isTRUE(debug_mode)) {
}
```
The minimum calculated total sample size equals 4265 publications to achieve a 95% confidence interval with a half-width of $\pm$ 1.5 pp using the @agrestiApproximateBetterExact1998 method. When applying the assumed prevalence values for each OSP, the required sample sizes to achieve a 95% confidence interval with a half-width of $\pm$ 1.5 pp vary substantially. As shown in @tbl-cap-estimated-sample-sizes-osp, approximately 3,200 publications are needed to estimate OA at 25%, about 2,180 publications for OD at 15%, and only about 840 publications for OM or Preregistration at 5%.
These values are all below the worst-case requirement of 4,264, reflecting the lower variance at prevalences farther from 50%. At the assumed prevalences, 2,182 SI papers would be required to estimate OD at 15% with +- 1.5 percentage-points precision. This equals the OD requirement but is below the OA requirement, which on the other hand can be measured for the whole population, not just SI publications. Thus, while the sample is sufficiently large for OD, OM, and Preregistration, it falls slightly short of the target precision for OA, which could be measured on a larger scale.
@@ -176,16 +200,17 @@ An overestimation the prevalence of each OSP in the population can lead to poten
Legal aspects were carefully considered throughout the development. Within the EU, scraping is legal for scientific purposes [@urhg-60d-tdm], but institutional contracts can override this. Scraping was therefore limited to the university network and only to publishers that permit it while other publishers were scraped outside of the network. The scraper might be made publicly available in the future.
The fulltexts were downloaded using **SciPaperLoader** a self-developed Flask-based web application designed for automated scientific paper processing and metadata extraction, built by the author. The app that lets the user import scientific paper metadata (CSV or manual), then schedules and runs automated fetching/parsing jobs to extract content/metadata and track each paper's processing status with logs and a dashboard. It supports many academic publishers via mostly YAML-configured parsers (plus optional custom Python parsers) that leverage Beautiful Soup, with a streamlined focus on paywall detection and transparent pattern-matching output.
The fulltexts were downloaded using SciPaperLoader, a self-developed Flask-based web application designed for automated scientific paper processing and metadata extraction, built by the author. The app that lets the user import scientific paper metadata (CSV or manual), then schedules and runs automated fetching/parsing jobs to extract content/metadata and track each paper's processing status with logs and a dashboard. It supports many academic publishers via mostly YAML-configured parsers (plus optional custom Python parsers) that leverage Beautiful Soup, with a streamlined focus on paywall detection and transparent pattern-matching output.
![Fulltext Downloader Screenshot: Control Panel](img/2025-08-04-174744_hyprshot.png)
![Fulltext Downloader Screenshot: Control Panel](img/2025-08-04-174744_hyprshot.png){width=40%}
The scraper produced .txt files using the following standardized format while also downloading the full HTML and PDF files, if available. This enabled later re-extraction if necessary.
```
DOI: 10.14763/2017.1.454
DOI URL: https://doi.org/10.14763/2017.1.454
Extracted URL: https://policyreview.info/articles/analysis/passage-australias-data-retention-regime-national-security-human-rights-and-media
Extracted URL: https://policyreview.info/articles/analysis/passage-australias-data-retention-
regime-national-security-human-rights-and-media
Extracted: 2025-08-05T16:57:14.685108
Parser: policyreview
Publisher: policyreview
@@ -212,14 +237,12 @@ This paper is part of Australian internet policy, a special issue of Internet Po
[FULLTEXT REDACTED IN THIS DOCUMENT]
```
During the preprocessing steps, parts above the "FULL TEXT:" heading were dismissed.
These fulltexts along with PDF files were then preprocessed. A more thorough analysis of how the fulltext data was handled can be found in the full methodological report.
# Model Training
Overall agreement between manual and ChatGPT-based coding after resolving initial disagreements was good, as shown in @fig-cfm-osp. Nevertheless, some categories exhibited extreme class imbalance, especially preregistration, which later proved problematic in subsequent stages of the process. Given the small number of positive cases, the ChatGPT-based classification was used to extend the training sample further. A full classification of the analytical sample using an LLM was considered desirable, but was not financially feasible within the project.
# Classification Model Training
```{r}
#| fig-cap: Confusion Matrices - Manual vs ChatGPT Labels for Open Science Practices and Statistical Inference (design-weighted)
#| fig-cap: Confusion Matrices - Manual vs ChatGPT Labels for Open Science Practices and Statistical Inference
#| label: fig-cfm-osp
#| fig-height: 12
#| fig-width: 11
@@ -257,6 +280,8 @@ if (isTRUE(debug_mode)) {
}
```
Overall agreement between manual and ChatGPT-based coding after resolving initial disagreements was good, as shown in @fig-cfm-osp. Nevertheless, some categories exhibited extreme class imbalance, especially preregistration, which later proved problematic in subsequent stages of the process. Given the small number of positive cases, the ChatGPT-based classification was used to extend the training sample further. A full classification of the analytical sample using an LLM was considered desirable, but was not financially feasible within the project.
For hyperparameter tuning and training of the ML models, the coded datasets were split into an training sample of 80% and a validation sample of 20%, stratified by the target variable as this improves training in scenarios with high class imbalance [@hilbertModelle2025]. K-Fold cross-validation was used during hyperparameter tuning to further iomprove model performance and reduce overfitting.
The features differed in the feature construction: "TF" feature sets contained simple term frequencies of the keywords in each category whereas "n-gram" feature sets were constructed containing term frequencies of multi-word-phrases. Using ngrams has proven to enhance results in comparison to simple term frequencies in other contexts [e.g. @jandotInteractiveSemanticFeaturing2016; @ahmedDetectionOnlineFake2017], which is why I chose to include multi-gram (2 or 3 word phrases) feature sets as well as term-frequency and ngram combined feature sets in the evaluations. Multiple machine learning models were trained on those feature sets, resulting in multiple model-featureset combinations for each OSP assessed. An example of those combinations and the evaluation can be seen in @fig-jobs-osp.
@@ -294,25 +319,15 @@ if (isTRUE(debug_mode)) {
}
```
## Model Evaluation
The two top-left graphs in @fig-evaluation-stat, @fig-evaluation-od, @fig-evaluation-om and @fig-evaluation-pr show the performance of different feature set and model combinations measured by ROC-AUC [@fawcettIntroductionROCAnalysis2006]. In @fig-evaluation-stat, the top graph identifies the XGBoost classifier combined with a simple term frequencies dataset as the top-performing model. The top-right graph shows the most important terms for the XGBoost classifier, which are primarily statistical. The confusion matrix shows that the model is quite precise, with a 91.7% accuracy and a Cohen's Kappa of 0.832. This performance is good compared to hand-coded cases. Model calibration was not highly successful as the model's probabilities were already well-calibrated, mostly at the extremes of 0 and 1. A probability threshold of 0.25 was chosen based on three different metrics. This threshold is used for the final classification, where any case with a predicted probability greater than 0.25 is classified as 1. It's also important to note that the OSP classifiers performed much worse, as seen below.
![Evaluation Metrics: Statistical Inference Classification](figures/combined_plot_is_statistical.pdf){#fig-evaluation-stat}
![Evaluation Metrics: Open Data](figures/combined_plot_is_open_data.pdf){#fig-evaluation-od fig-pos=H}
![Evaluation Metrics: Open Materials](figures/combined_plot_is_open_materials.pdf){#fig-evaluation-om fig-pos=H}
![Evaluation Metrics: Preregistration](figures/combined_plot_is_prereg.pdf){#fig-evaluation-pr fig-pos=H}
As expected, $\kappa$ is typically lower than Accuracy due to chance-agreement correction [@naiduReviewEvaluationMetrics2023].
$$
\text{Accuracy} = \frac{TP + TN}{N} \quad \text{and} \quad
\kappa = \frac{p_o - p_e}{1 - p_e}
$$
As expected, $\kappa$ is typically lower than Accuracy due to chance-agreement correction [@naiduReviewEvaluationMetrics2023].
OM (@fig-evaluation-om) tells a different story: despite nominal Accuracy of $94.3\%$, balanced accuracy drops to $60.0\%$ and $\kappa$ to $31.7\%$. Sensitivity is $20.0\%$ while specificity is $100.0\%$, yielding $F_1 = 33.3\%$. High nominal accuracy with a large miss rate indicates accuracy inflation under imbalance, and the p-value of $0.434$ confirms that accuracy does not exceed the no-information rate meaningfully.
OD (@fig-evaluation-od) sits between these extremes: accuracy $= 88.6\%$, balanced accuracy $= 93.7\%$, sensitivity $= 100.0\%$, specificity $= 87.3\%$. The classifier captures all positives but at the cost of eight false positives against seven true positives and 55 true negatives, which depresses precision and yields $F_1 = 63.6\%$. $\kappa = 57.9\%$ indicates moderate agreement beyond chance, and $p = 0.736$ again signals that nominal accuracy is uninformative under imbalance.
@@ -326,12 +341,19 @@ Category-specific results highlight class-imbalance constraints. Preregistration
The ML classifiers trained on GPT labels inherit GPT's strengths and the data's sparsity.For the relatively small 20% validation set coded by GPT, the open-science practice classifiers are less precise and less reliable than the Statistical-Inference classifier. Preregistration (@fig-evaluation-pr) appears strongest (balanced accuracy $= 99.2\%$, $F_1 = 88.9\%$, $\kappa = 88.1\%$), but the counts are sparse (four true positives, one false negative, no false positives), and the p-value versus the no-information rate ($p = 0.0853$) is not conventionally significant-an expected consequence of the very low base rate rather than a systematic error.
![Classifier Evaluation: Statistical Inference](figures/combined_plot_is_statistical.pdf){#fig-evaluation-stat}
![Classifier Evaluation: Open Data](figures/combined_plot_is_open_data.pdf){#fig-evaluation-od fig-pos=H}
![Classifier Evaluation: Open Materials](figures/combined_plot_is_open_materials.pdf){#fig-evaluation-om fig-pos=H}
![Classifier Evaluation: Preregistration](figures/combined_plot_is_prereg.pdf){#fig-evaluation-pr fig-pos=H}
# Tables: OSP Adoption Over Time Among Statistical Inference Papers
The following tables contain the values for Figure 2: OSP Adoption Over Time, among statistical inference papers (design-weighted) that are not adjusted for misclassification. Rogan-Gladen adjusted per-year-values were not calculated, given the already enormous errors in the total estimation.
## OSP Adoption Over Time Among Statistical Inference Papers {#sec-osp-adoption-tables}
```{r}
#| label: osp-adoption-tables
#| tbl-caption: Open Data
@@ -538,11 +560,6 @@ if (isTRUE(debug_mode)) {
}
```
```{=latex}
\clearpage
\newpage
```
```{r}
# rename published_year prop prop_low prop_upp variable
tbl_osp_prev_overall_dsadj_c <- tbl_osp_prev_overall_dsadj %>%
@@ -605,8 +622,4 @@ if (isTRUE(debug_mode)) {
debug_info[[knitr::opts_current$get("label")]] <-
if (knitr::is_html_output()) "HTML" else "LaTeX"
}
```
```{=latex}
\clearpage
```