finalizes manuscript for preprint

2025-12-17 18:41:28 +01:00
parent 0908df2a79
commit b7a0ddbc94
5 changed files with 168 additions and 109 deletions
@@ -4,6 +4,11 @@ top-level-division: section
 prefer-html: true
 execute:
  freeze: auto
+format:
+  pdf:
+    toc: true
+    lot: true
+    lof: true
 ---

 ```{r}
@@ -12,6 +17,8 @@ execute:
 source("deps.R")
 ```

+\newpage
+
 # Introduction

 This document serves as a supplement to the main article, providing additional details on the sampling approach, sample size determination, model training procedures, and evaluation metrics used in the study of Open Science Practices (OSP) adoption in scientific publications. The full methodological report, containing all code necessary for full replication can be accessed in the OSF repository.
@@ -26,9 +33,7 @@ The process involved in the following steps:
 4. Due to good performance, ChatGPT was used to classify the rest of Sample A, and the combined manual/LLM labels formed the training and test data for subsequent ML models.
 5. ML Classifiers were trained on the produced classified subsample.

-Classification of the training Sample B followed the same approach. For classification document feature matrices were generated using term frequencies of keywords. These keywords were both adopted from  @scogginsMeasuringTransparencySocial2024 as well as self created, and extended using ChatGPT. Keywords were context specific according to the classified variable. All classification tasks were binary classifications. After assembling the keywords, the SI classifier was fine-tuned. Using this classifier, the analytical sample was categorized. SI documents were then classified for applying OSPs.
-
-The approach might seem overly complicated but was intitially designed to be used on a much larger corpus of publications. As time progressed during the project multiple reasons recommend a simpler approach that will be discussed later. 
+Classification of the training Sample B followed a similar approach. For classification document feature matrices were generated using term frequencies of keywords. These keywords were both adopted from  @scogginsMeasuringTransparencySocial2024 as well as self created, and extended using ChatGPT. Keywords were context specific according to the classified variable. All classification tasks were binary classifications. After assembling the keywords, the SI classifier was fine-tuned. Using this classifier, the analytical sample was categorized. SI documents were then classified for applying OSPs.

 # Sample Size

@@ -118,7 +123,7 @@ if (isTRUE(debug_mode)) {
 }
 ```

-The minimum calculated total sample size equals 4265 (rounded) publications to achieve a 95% confidence interval with a half-width of $\pm$ 1.5 pp using the @agrestiApproximateBetterExact1998 method. When applying the assumed prevalence values for each OSP, the required sample sizes to achieve a 95% confidence interval with a half-width of $\pm$ 1.5 pp vary substantially. As shown in @tbl-cap-estimated-sample-sizes-osp, approximately 3,200 publications are needed to estimate OA at 25%, about 2,180 publications for OD at 15%, and only about 840 publications for OM or Preregistration at 5%. 
+The minimum calculated total sample size equals 4265 publications to achieve a 95% confidence interval with a half-width of $\pm$ 1.5 pp using the @agrestiApproximateBetterExact1998 method. When applying the assumed prevalence values for each OSP, the required sample sizes to achieve a 95% confidence interval with a half-width of $\pm$ 1.5 pp vary substantially. As shown in @tbl-cap-estimated-sample-sizes-osp, approximately 3,200 publications are needed to estimate OA at 25%, about 2,180 publications for OD at 15%, and only about 840 publications for OM or Preregistration at 5%. 

 These values are all below the worst-case requirement of 4,264, reflecting the lower variance at prevalences farther from 50%. At the assumed prevalences, 2,182 SI papers would be required to estimate OD at 15% with +- 1.5 percentage-points precision. This equals the OD requirement but is below the OA requirement, which on the other hand can be measured for the whole population, not just SI publications. Thus, while the sample is sufficiently large for OD, OM, and Preregistration, it falls slightly short of the target precision for OA, which could be measured on a larger scale.

@@ -169,11 +174,49 @@ An overestimation the prevalence of each OSP in the population can lead to poten

 # Full Text Retrieval

-As mentioned in the manuscript, full texts were retreived using a self developed web application that used both web scraping and publisher API's. Legal aspects were carefully considered throughout the development. Within the EU, scraping is legal for scientific purposes [@urhg-60d-tdm], but institutional contracts can override this. Scraping was therefore limited to the university network and only to publishers that permit it while other publishers were scraped outside of the network. Technical details are available in the documents provided while the scraper might be made publicly available in the future. 
+Legal aspects were carefully considered throughout the development. Within the EU, scraping is legal for scientific purposes [@urhg-60d-tdm], but institutional contracts can override this. Scraping was therefore limited to the university network and only to publishers that permit it while other publishers were scraped outside of the network. The scraper might be made publicly available in the future. 
+
+The fulltexts were downloaded using **SciPaperLoader** a self-developed Flask-based web application designed for automated scientific paper processing and metadata extraction, built by the author. The app that lets the user import scientific paper metadata (CSV or manual), then schedules and runs automated fetching/parsing jobs to extract content/metadata and track each paper's processing status with logs and a dashboard. It supports many academic publishers via mostly YAML-configured parsers (plus optional custom Python parsers) that leverage Beautiful Soup, with a streamlined focus on paywall detection and transparent pattern-matching output. 
+
+![Fulltext Downloader Screenshot: Control Panel](img/2025-08-04-174744_hyprshot.png)
+
+The scraper produced .txt files using the following standardized format while also downloading the full HTML and PDF files, if available. This enabled later re-extraction if necessary.
+
+```
+DOI: 10.14763/2017.1.454
+DOI URL: https://doi.org/10.14763/2017.1.454
+Extracted URL: https://policyreview.info/articles/analysis/passage-australias-data-retention-regime-national-security-human-rights-and-media
+Extracted: 2025-08-05T16:57:14.685108
+Parser: policyreview
+Publisher: policyreview
+Extraction Method: HTML
+Paywall Status: open_access
+================================================================================
+
+TITLE: The passage of Australia’s data retention regime: national security, human rights, and media scrutiny
+
+AUTHORS: Nicolas P. Suzor, Kylie Pappalardo, Natalie McIntosh
+
+ABSTRACT:
+Abstract
+            In 2015, the Australian government passed the Telecommunications (Interception and Access) Amendment (Data Retention) Act, which requires ISPs to collect metadata about their users and store this metadata for two years. From its conception, Australia’s data retention scheme has been controversial. In this article we examine how public interest concerns were addressed in Australian news media during the Act’s passage. The Act was ultimately passed with bipartisan support, despite serious deficiencies. We show how the Act’s complexity seemed to limit engaged critique in the mainstream media and how fears over terrorist attacks were exploited to secure the Act’s passage through parliament.
+
+FULL TEXT:
+Title: The passage of Australia’s data retention regime: national security, human rights, and media scrutiny
+
+Abstract: Abstract
+            In 2015, the Australian government passed the Telecommunications (Interception and Access) Amendment (Data Retention) Act, which requires ISPs to collect metadata about their users and store this metadata for two years. From its conception, Australia’s data retention scheme has been controversial. In this article we examine how public interest concerns were addressed in Australian news media during the Act’s passage. The Act was ultimately passed with bipartisan support, despite serious deficiencies. We show how the Act’s complexity seemed to limit engaged critique in the mainstream media and how fears over terrorist attacks were exploited to secure the Act’s passage through parliament.
+
+This paper is part of Australian internet policy, a special issue of Internet Policy Review guest-edited by Angela Daly and Julian Thomas.
+
+[FULLTEXT REDACTED IN THIS DOCUMENT]
+```
+
+During the preprocessing steps, parts above the "FULL TEXT:" heading were dismissed.

 # Model Training

-Overall agreement after resolving initial disagreements was good, as shown in @fig-cfm-osp. Nevertheless, some categories exhibited extreme class imbalance, especially preregistration, which later proved problematic in subsequent stages of the process. Given the small number of positive cases, the ChatGPT-based classification was used to extend the training sample further. A full classification of the analytical sample using an LLM was considered desirable, but was not financially feasible within the project.
+Overall agreement between manual and ChatGPT-based coding after resolving initial disagreements was good, as shown in @fig-cfm-osp. Nevertheless, some categories exhibited extreme class imbalance, especially preregistration, which later proved problematic in subsequent stages of the process. Given the small number of positive cases, the ChatGPT-based classification was used to extend the training sample further. A full classification of the analytical sample using an LLM was considered desirable, but was not financially feasible within the project.

 ```{r}
 #| fig-cap: Confusion Matrices - Manual vs ChatGPT Labels for Open Science Practices and Statistical Inference (design-weighted)
@@ -253,7 +296,7 @@ if (isTRUE(debug_mode)) {

 ## Model Evaluation

-The two top-left graphs in @fig-evaluation-stat, @fig-evaluation-od, @fig-evaluation-om and @fig-evaluation-pre show the performance of different feature set and model combinations measured by ROC-AUC [@fawcettIntroductionROCAnalysis2006]. In @fig-evaluation-stat, the top graph identifies the XGBoost classifier combined with a simple term frequencies dataset as the top-performing model. The top-right graph shows the most important terms for the XGBoost classifier, which are primarily statistical. The confusion matrix shows that the model is quite precise, with a 91.7% accuracy and a Cohen's Kappa of 0.832. This performance is good compared to hand-coded cases. Model calibration was not highly successful as the model's probabilities were already well-calibrated, mostly at the extremes of 0 and 1. A probability threshold of 0.25 was chosen based on three different metrics. This threshold is used for the final classification, where any case with a predicted probability greater than 0.25 is classified as 1. It's also important to note that the OSP classifiers performed much worse, as seen below.
+The two top-left graphs in @fig-evaluation-stat, @fig-evaluation-od, @fig-evaluation-om and @fig-evaluation-pr show the performance of different feature set and model combinations measured by ROC-AUC [@fawcettIntroductionROCAnalysis2006]. In @fig-evaluation-stat, the top graph identifies the XGBoost classifier combined with a simple term frequencies dataset as the top-performing model. The top-right graph shows the most important terms for the XGBoost classifier, which are primarily statistical. The confusion matrix shows that the model is quite precise, with a 91.7% accuracy and a Cohen's Kappa of 0.832. This performance is good compared to hand-coded cases. Model calibration was not highly successful as the model's probabilities were already well-calibrated, mostly at the extremes of 0 and 1. A probability threshold of 0.25 was chosen based on three different metrics. This threshold is used for the final classification, where any case with a predicted probability greater than 0.25 is classified as 1. It's also important to note that the OSP classifiers performed much worse, as seen below.

 ![Evaluation Metrics: Statistical Inference Classification](figures/combined_plot_is_statistical.pdf){#fig-evaluation-stat}

@@ -280,7 +323,7 @@ Category-specific results highlight class-imbalance constraints. Preregistration

 [^1]: The accuracy-no-information-rate p-value tests the null hypothesis that the accuracy is equal to the no-information rate or the accuracy when always predicting the most frequent class [@kuhnBuildingPredictiveModels2008].

-The ML classifiers trained on GPT labels inherit GPT's strengths and the data's sparsity.For the relatively small 20% validation set coded by GPT, the open-science practice classifiers are less precise and less reliable than the Statistical-Inference classifier. Preregistration (@fig-fig-evaluation-pr) appears strongest (balanced accuracy $= 99.2\%$, $F_1 = 88.9\%$, $\kappa = 88.1\%$), but the counts are sparse (four true positives, one false negative, no false positives), and the p-value versus the no-information rate ($p = 0.0853$) is not conventionally significant-an expected consequence of the very low base rate rather than a systematic error.
+The ML classifiers trained on GPT labels inherit GPT's strengths and the data's sparsity.For the relatively small 20% validation set coded by GPT, the open-science practice classifiers are less precise and less reliable than the Statistical-Inference classifier. Preregistration (@fig-evaluation-pr) appears strongest (balanced accuracy $= 99.2\%$, $F_1 = 88.9\%$, $\kappa = 88.1\%$), but the counts are sparse (four true positives, one false negative, no false positives), and the p-value versus the no-information rate ($p = 0.0853$) is not conventionally significant-an expected consequence of the very low base rate rather than a systematic error.


 # Tables: OSP Adoption Over Time Among Statistical Inference Papers