moves sampling process to supplements

2025-12-11 16:43:50 +01:00
parent a2f42ee061
commit 4533b1cdc0
2 changed files with 23 additions and 15 deletions
@@ -12,6 +12,24 @@ execute:
 source("deps.R")
 ```

+# Introduction
+
+This document serves as 
+
+# Sampling Approach
+
+The process involved in the following steps:
+
+1. A small subset of papers from Sample A was hand-coded by the author according to the operationalization.
+2. ChatGPT classified both the hand-coded as well as the not coded publications in Sample A.
+3. A random subsample of 50 papers was coded both manually and with ChatGPT. Disagreements were carefully reviewed and manual coding was reassessed. Agreement after correction was very high ( $\kappa$ = 83,2%), with ChatGPT outperforming the author's initial coding consistency.
+4. Due to good performance, ChatGPT was used to classify the rest of Sample A, and the combined manual/LLM labels formed the training and test data for subsequent ML models.
+5. ML Classifiers were trained on the produced classified subsample.
+
+Classification of the training Sample B followed the same approach. For classification document feature matrices were generated using term frequencies of keywords. These keywords were both adopted from  @scogginsMeasuringTransparencySocial2024 as well as self created, and extended using ChatGPT. Keywords were context specific according to the classified variable. All classification tasks were binary classifications. After assembling the keywords, the SI classifier was fine-tuned. Using this classifier, the analytical sample was categorized. SI documents were then classified for applying OSPs.
+
+The approach might seem overly complicated but was intitially designed to be used on a much larger corpus of publications. As time progressed during the project multiple reasons recommend a simpler approach that will be discussed later. 
+
 # Sample Size

 The sample size was determined by a precision-based calculation to ensure a $\pm$ 1.5 percentage point confidence interval for the SI prevalence as a precision-based sample size calculation was deemed more suitable for an exploratory prevalence study [@blandTyrannyPowerThere2009]. The calculations were based on prevalences arbitrarily estimated using the results of the literature review described in @sec-osp-in-crim.