some moving around and so on

This commit is contained in:
2025-12-11 19:32:15 +01:00
parent 4533b1cdc0
commit f78bebb8e8
2 changed files with 7 additions and 6 deletions
+6 -5
View File
@@ -390,16 +390,18 @@ if (isTRUE(debug_mode)) {
## Classification Tasks and Methods
This section will present a summary of all methods used to classify the variables of interest. A thorough discussion of the decisions taken, the full descriptions and specifications of the models used as well as the preprocessing steps can be found in the supplied materials.
This section will present a brief summary of all methods used to classify the variables of interest. A thorough discussion of the decisions taken, the full descriptions and specifications of the models used as well as the preprocessing steps can be found in the supplied materials.
Since most existing classification approaches considered were deemed unsuitable for this scope (e.g., @kimResearchPaperClassification2019; @sanguansatFeatureMatricizationDocument2012; @jandotInteractiveSemanticFeaturing2016), this work instead relies on machine learning models trained on a manually and LLM coded subset of publications as LLMs have shown good performance on similar classification tasks [@buntValidatingUseLarge2025; @zhaoAdvancingSingleMultitask2024]. The classification of SI papers followed a staged approach further described in the provided materials. First a strict operationalization of "SI" (1) versus "not SI" (0), as well as of the OSPs with the same levels was created which was documented in a short coding manual.
Since most existing classification approaches considered were deemed unsuitable for this scope (e.g., @kimResearchPaperClassification2019; @sanguansatFeatureMatricizationDocument2012; @jandotInteractiveSemanticFeaturing2016), this work instead relies on Random-Forest and XGBoost-models trained on a manually and LLM coded subset of publications as LLMs have shown good performance on similar classification tasks [@buntValidatingUseLarge2025; @zhaoAdvancingSingleMultitask2024]. First, a strict operationalization of "SI" (1) versus "not SI" (0), as well as of the OSPs with the same levels was created which was documented in a short coding manual.
A subset of Sample A was coded by hand, followed by a ChatGPT-based labelling of the fulltext. On a random subsample, agreement after reconciliation was high ($\kappa$ $\approx$ .83), so combined manual/LLM labels served as training and test data for the ML classifiers. A similar approach was used for Sample B. For each task, OSP-specific document-feature-matrices using term frequencies or TF-IDF of keyword sets, partly adapted from @scogginsMeasuringTransparencySocial2024, were constructed.
A subset of Sample A was coded by hand, followed by a ChatGPT-based labelling of the fulltext. On a random subsample, agreement after reconciliation was high ($\kappa$ $\approx$ .83), so combined manual/LLM labels served as training and test data for the ML classifiers. A similar approach was used for Sample B. For each task, OSP-specific document-feature-matrices using term frequencies or TF-IDF of keyword sets, partly adapted from @scogginsMeasuringTransparencySocial2024, were constructed. Each OSP classifier was tuned on all possible combinations of different feature sets and model.
Due to time constraints and the study's pilot nature, classification evaluation and data preprocessing were only optimized for the OSP classifier, not for the SI classifier. The more thorough approach used for OSP, which addressed challenges like high computational demands and class imbalance, would have improved the SI classifier but was not feasible. Despite this, the SI classifier still performed satisfactorily, and the optimal methods are reflected in the OSP training process. Furthermore, journal-level adoption of OSPs was originally intended to be assessed using the Transparency and Openness Promotion Factor [@nosekPromotingOpenResearch2015]. However, as the available sample sizes were insufficient for journal-level analyses, these were not carried out.
## Analysis
The research was deliberately designed to study open-science practices via supervised classifiers rather than relying exclusively on metadata. This choice prioritized scalability and the potential to capture practice signals that metadata may miss, at the cost of managing model error and class imbalance. Given the exploratory character of the work, the analyses were not pre-defined, only data collection, sampling, and the model-training strategy were specified in advance. Concerns about classifier interpretability informed the evaluation strategy [@gilpinExplainingExplanationsOverview2018].
Estimates for OSPs are domain-estimates among SI papers (see @tbl-cases2) drawn from a year-stratified random sample, beta-method CIs are based on design-based variance. A year-stratified random sample was drawn. Design-weights were applied post-stratified to frame-by-year totals with finite-population corrections. All OSP estimates are domain estimates for SI papers using design-based inference. Design corrected 95% confidence intervals are computed with the beta method (Clopper-Pearson) transformation which provides better coverage for low-prevalences than Wald intervals [@agrestiIntroductionCategoricalData2007]. Results generalize to the keyword-filtered data. With $n=1,763$ SI papers, SI-domain CIs are wider than the planned $\pm$ 1.5 pp. Because some SI papers may have been excluded by the screening, OSP levels for all SI papers in the full corpus of 90k publications may differ. An audit of excluded records could quantify the coverage and enable adjustment but was not conducted here.
OSP labels were assigned by classifiers whose sensitivity and specificity are imperfect, with potential misclassifications affecting the reported prevalence rates. To assess robustness, a simple sensitivity analysis using the Rogan-Gladen correction for misclassification of a binary outcome was conducted [@liuQuantitativeBiasAnalysis2023; @vallecamposSerosurveySerologicalSurvey2020].
@@ -408,7 +410,7 @@ Data is reported per year. As per year data given the very low prevalences is ex
# Results & Discussion
The research design was deliberately designed to study open-science practices via supervised classifiers rather than relying exclusively on metadata. This choice prioritized scalability and the potential to capture practice signals that metadata may miss, at the cost of managing model error and class imbalance. Given the exploratory character of the work, the analyses were not pre-defined, only data collection, sampling, and the model-training strategy were specified in advance. Concerns about classifier interpretability informed the evaluation strategy [@gilpinExplainingExplanationsOverview2018]. Two research questions were formulated: $RQ_1$ on the prevalence of OD and OM among statistical-inference (SI) publications, and $RQ_2$ on the prevalence of preregistration. After extensive model development, validation, calibration, thresholding, and misclassification adjustment, prevalences for OD, OM, and Preregistration were too low for the ML classifiers to yield interpretable, adjusted estimates.
Two research questions were formulated: $RQ_1$ on the prevalence of OD and OM among statistical-inference (SI) publications, and $RQ_2$ on the prevalence of preregistration. After extensive model development, validation, calibration, thresholding, and misclassification adjustment, prevalences for OD, OM, and Preregistration were too low for the ML classifiers to yield interpretable, adjusted estimates.
The ML classifiers trained on GPT labels inherit GPT's strengths and the data's sparsity.For the relatively small 20% validation set coded by GPT, the open-science practice classifiers are less precise and less reliable than the Statistical-Inference classifier.
@@ -1030,7 +1032,6 @@ Further instructions can be found in the README file. Full-text data and the dow
\newpage
```
# Bibliography
::: {#refs}