From 4533b1cdc0e7b27bd9f8c44309b86a4768c83215 Mon Sep 17 00:00:00 2001
From: Michael Beck <ich@mbeck.cologne>
Date: Thu, 11 Dec 2025 16:43:50 +0100
Subject: [PATCH] moves sampling process to supplements

---
 Supplements.qmd | 18 ++++++++++++++++++
 index.qmd       | 20 +++++---------------
 2 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/Supplements.qmd b/Supplements.qmd
index 40b9720..48d7ae2 100644
--- a/Supplements.qmd
+++ b/Supplements.qmd
@@ -12,6 +12,24 @@ execute:
 source("deps.R")
 ```
 
+# Introduction
+
+This document serves as 
+
+# Sampling Approach
+
+The process involved in the following steps:
+
+1. A small subset of papers from Sample A was hand-coded by the author according to the operationalization.
+2. ChatGPT classified both the hand-coded as well as the not coded publications in Sample A.
+3. A random subsample of 50 papers was coded both manually and with ChatGPT. Disagreements were carefully reviewed and manual coding was reassessed. Agreement after correction was very high ( $\kappa$ = 83,2%), with ChatGPT outperforming the author's initial coding consistency.
+4. Due to good performance, ChatGPT was used to classify the rest of Sample A, and the combined manual/LLM labels formed the training and test data for subsequent ML models.
+5. ML Classifiers were trained on the produced classified subsample.
+
+Classification of the training Sample B followed the same approach. For classification document feature matrices were generated using term frequencies of keywords. These keywords were both adopted from  @scogginsMeasuringTransparencySocial2024 as well as self created, and extended using ChatGPT. Keywords were context specific according to the classified variable. All classification tasks were binary classifications. After assembling the keywords, the SI classifier was fine-tuned. Using this classifier, the analytical sample was categorized. SI documents were then classified for applying OSPs.
+
+The approach might seem overly complicated but was intitially designed to be used on a much larger corpus of publications. As time progressed during the project multiple reasons recommend a simpler approach that will be discussed later. 
+
 # Sample Size
 
 The sample size was determined by a precision-based calculation to ensure a $\pm$ 1.5 percentage point confidence interval for the SI prevalence as a precision-based sample size calculation was deemed more suitable for an exploratory prevalence study [@blandTyrannyPowerThere2009]. The calculations were based on prevalences arbitrarily estimated using the results of the literature review described in @sec-osp-in-crim.
diff --git a/index.qmd b/index.qmd
index c7d3315..20eeb47 100644
--- a/index.qmd
+++ b/index.qmd
@@ -142,11 +142,9 @@ Full-text data for training the machine learning classification models will be c
 
 This work is necessarily scoped by time and resources. It shall therefore be treated as a pilot that establishes data, measures and a reproducible, yet improvable pipeline to be extended in to a fully exhaustive study. Where necessary, potential improvements that could not be implemented are recommended.
 
-All data and code necessary to enable full replication can be retrieved from the osf repositories found in @sec-supplements-downloader. A full description of used software and methods is further layed out within the replication files and the accompanying methodological report.
-
 ## Population
 
-The scope of this work encompasses all publications from the top 100 journals classified under "Criminology & Penology" or the journals that are categorized as "Law" (which might also include sociologically or psychologically driven quantitative studies) and "Psychology, Multidisciplinary", ranked by the 2023 JIF according to Clarivate's Journal Citation Reports [@clarivateJournalImpactFactor2023] that rely on SI. Publication metadata were retrieved via the Crossref API. While Crossref provides extensive coverage, it is not exhaustive, and prior work has shown that missing records are often systematic rather than random [@delgado-quirosWhyAreThese2024; @hausteinWhenArticleActually2015]. Using multiple bibliographic sources (e.g., Scopus, Web of Science) would reduce this bias [@gerasimovComparisonDatasetsCitation2024; @delgado-quirosWhyAreThese2024], but this was not feasible within the scope of this thesis. Consequently, the study population is restricted to articles indexed in Crossref from the selected top 100 journals.
+The scope of this work encompasses all publications from the top 100 journals classified under "Criminology & Penology" or the journals that are categorized as "Law" (which might also include sociologically or psychologically driven quantitative studies) and "Psychology, Multidisciplinary", ranked by the 2023 JIF according to Clarivate's Journal Citation Reports [@clarivateJournalImpactFactor2023] that rely on SI. Publication metadata were retrieved via the Crossref API. While Crossref provides extensive coverage, it is not exhaustive, and prior work has shown that missing records are often systematic rather than random [@delgado-quirosWhyAreThese2024; @hausteinWhenArticleActually2015]. Using multiple bibliographic sources (e.g., Scopus, Web of Science) would reduce this bias [@gerasimovComparisonDatasetsCitation2024; @delgado-quirosWhyAreThese2024], but this was not feasible within the scope of this pilot. Consequently, the study population is restricted to articles indexed in Crossref from the selected top 100 journals.
 
 As the population is restricted to publications that make SIs, this concept has to be clearly defined mostly in line with @scogginsMeasuringTransparencySocial2024, as works that rely on data, statistical analysis and experiments. @scogginsMeasuringTransparencySocial2024 restricted further on only experiments, which was deemed not necessary as all assessed OSPs are suitable to be used and should be used in not only experiments, but also in works assessing second-hand data or alike [@akkerPreregistrationSecondaryData2021; @westonRecommendationsIncreasingTransparency2019]. Thereby, descriptive, correlational, comparative and other non-purely theoretical research was included. 
 
@@ -390,21 +388,13 @@ if (isTRUE(debug_mode)) {
 }
 ```
 
-## Classification Methods
+## Classification Tasks and Methods
 
-This section will present a summary of all methods used to classify the variables of interest. A thorough discussion of the decisions taken, the full descriptions and specifications of the models used as well as the preprocessing steps can be found in the methodology report. A brief description of the results can be found in the supplementary material.
+This section will present a summary of all methods used to classify the variables of interest. A thorough discussion of the decisions taken, the full descriptions and specifications of the models used as well as the preprocessing steps can be found in the supplied materials.
 
-Since most existing classification approaches considered were deemed unsuitable for this scope (e.g., @kimResearchPaperClassification2019; @sanguansatFeatureMatricizationDocument2012; @jandotInteractiveSemanticFeaturing2016), this work instead relies on machine learning models trained on a manually and LLM coded subset of publications as LLMs have shown good performance on similar classification tasks [@buntValidatingUseLarge2025; @zhaoAdvancingSingleMultitask2024]. The classification of SI papers followed a staged approach. First a strict operationalization of "SI" (1) versus "not SI" (0), as well as of the OSPs with the same levels was created which was documented in a short coding manual. The process involved in the following steps:
+Since most existing classification approaches considered were deemed unsuitable for this scope (e.g., @kimResearchPaperClassification2019; @sanguansatFeatureMatricizationDocument2012; @jandotInteractiveSemanticFeaturing2016), this work instead relies on machine learning models trained on a manually and LLM coded subset of publications as LLMs have shown good performance on similar classification tasks [@buntValidatingUseLarge2025; @zhaoAdvancingSingleMultitask2024]. The classification of SI papers followed a staged approach further described in the provided materials. First a strict operationalization of "SI" (1) versus "not SI" (0), as well as of the OSPs with the same levels was created which was documented in a short coding manual. 
 
-1. A small subset of papers from Sample A was hand-coded by the author according to the operationalization.
-2. ChatGPT classified both the hand-coded as well as the not coded publications in Sample A.
-3. A random subsample of 50 papers was coded both manually and with ChatGPT. Disagreements were carefully reviewed and manual coding was reassessed. Agreement after correction was very high ( $\kappa$ = 83,2%), with ChatGPT outperforming the author's initial coding consistency.
-4. Due to good performance, ChatGPT was used to classify the rest of Sample A, and the combined manual/LLM labels formed the training and test data for subsequent ML models.
-5. ML Classifiers were trained on the produced classified subsample.
-
-Classification of the training Sample B followed the same approach. For classification document feature matrices were generated using term frequencies of keywords. These keywords were both adopted from  @scogginsMeasuringTransparencySocial2024 as well as self created, and extended using ChatGPT. Keywords were context specific according to the classified variable. All classification tasks were binary classifications. After assembling the keywords, the SI classifier was fine-tuned. Using this classifier, the analytical sample was categorized. SI documents were then classified for applying OSPs.
-
-The approach might seem overly complicated but was intitially designed to be used on a much larger corpus of publications. As time progressed during the project multiple reasons recommend a simpler approach that will be discussed later. 
+A subset of Sample A was coded by hand, followed by a ChatGPT-based labelling of the fulltext. On a random subsample, agreement after reconciliation was high ($\kappa$ $\approx$ .83), so combined manual/LLM labels  served as training and test data for the ML classifiers. A similar approach was used for Sample B. For each task, OSP-specific document-feature-matrices using term frequencies or TF-IDF of keyword sets, partly adapted from @scogginsMeasuringTransparencySocial2024, were constructed.
 
 Due to time constraints and the study's pilot nature, classification evaluation and data preprocessing were only optimized for the OSP classifier, not for the SI classifier. The more thorough approach used for OSP, which addressed challenges like high computational demands and class imbalance, would have improved the SI classifier but was not feasible. Despite this, the SI classifier still performed satisfactorily, and the optimal methods are reflected in the OSP training process. Furthermore, journal-level adoption of OSPs was originally intended to be assessed using the Transparency and Openness Promotion Factor [@nosekPromotingOpenResearch2015]. However, as the available sample sizes were insufficient for journal-level analyses, these were not carried out.