finalizing, statements added

2025-12-16 22:23:27 +01:00
parent efb66ac334
commit ee79e207ea
1 changed files with 41 additions and 43 deletions
@@ -74,11 +74,11 @@ In their excellent manifesto for reproducible science, @munafoManifestoReproduci

 Other problematic practices involve the misuse of p-values, where researchers simply misinterpret the significance level as the likelihood of truth in their findings, leading to vast overconfidence in their results-that can also be a consequence of or lead to a failure to control for bias and poor quality control [@breznauDoesSociologyNeed2021; @munafoManifestoReproducibleScience2017]. Demographic, geographic or political biases and peer review limitations are more sources for error [@breznauDoesSociologyNeed2021; @grossmannOpenScienceReform2021]. Additionally, gendered penalties favor men publishing disproportionately more than women @akbaritabarGenderPatternsPublication2021. Misaligned institutional incentives, also accelerated by an intense competition for academic jobs, tenure and funding, lead to a so-called "publish or perish" culture [@smaldinoOpenScienceModified2019; @breznauDoesSociologyNeed2021]. 

-All the above leads to the conclusion, that our institutions make refutation harder than confirmation. Open science is the design response, resetting defaults to transparency, pre-specification, and reproducibility. @munafoManifestoReproducibleScience2017 translate that philosophy into a lifecycle blueprint: blinding and preregistration, stronger methods training and independent oversight, open data, code and diversified peer review to harden reproducibility, evaluation and other measures. The central movement to address the above issues is the so-called open science (OS) movement, devoting its effort to challenge publication bias, low statistical power, p-hacking, HARKing and other problems by increasing reproducibility and transparency [@grossmannOpenScienceReform2021]. 
+All the above leads to the conclusion, that our institutions make refutation harder than confirmation. Open science (OS) is the design response, resetting defaults to transparency, pre-specification, and reproducibility. @munafoManifestoReproducibleScience2017 translate that philosophy into a lifecycle blueprint: blinding and preregistration, stronger methods training and independent oversight, open data, code and diversified peer review to harden reproducibility, evaluation and other measures. The central movement to address the above issues is the OS movement, devoting its effort to challenge publication bias, low statistical power, p-hacking, HARKing and other problems by increasing reproducibility and transparency [@grossmannOpenScienceReform2021]. 

 ## Open Science Practices

-Following an extensive literature review @vicente-saezOpenScienceNow2018a characterize OS using four differentias: transparency in communication, accessibility or searchability to all data and materials, sharing of everything with a commitment to do so and collaboration along a scientific, distributed global dialogue throughout all stages involved in science. They integrate these into a succinct definition: "Open Science is transparent and accessible knowledge that is shared and developed through collaborative networks" [@vicente-saezOpenScienceNow2018a, p. 434]. @banksAnswers18Questions2019 establish a broader definition of os that refers to many concepts, including scientific philosophies embodying communality and universalism, specific practices operationalizing these norms including os policies. A common ground is that *open* science and OSPs try to prevent research misconduct by simply increasing research transparency.
+Following an extensive literature review, @vicente-saezOpenScienceNow2018a characterize OS using four differentias: transparency in communication, accessibility or searchability to all data and materials, sharing of everything with a commitment to do so and collaboration along a scientific, distributed global dialogue throughout all stages involved in science. They integrate these into a succinct definition: "Open Science is transparent and accessible knowledge that is shared and developed through collaborative networks" [@vicente-saezOpenScienceNow2018a, p. 434]. @banksAnswers18Questions2019 establish a broader definition of os that refers to many concepts, including scientific philosophies embodying communality and universalism, specific practices operationalizing these norms including os policies. A common ground is that *open* science and OSPs try to prevent research misconduct by simply increasing research transparency.

 Building on these definitions, in line with the work of many other authors from diverse disciplines [e.g. @dienlinAgendaOpenScience2021; and @greenspanOpenSciencePractices2024], there are numerous practices that have been proposed to enact OS.

@@ -92,9 +92,9 @@ First, there is accumulating evidence that providing data alongside publications

 Second, openness improves methodological rigor and documentation. Knowing that others will inspect our code, data, and decisions incentivizes clearer documentation, more careful workflows, and fewer statistical errors in final papers [@tennantAcademicEconomicSocietal2016; @banksAnswers18Questions2019]. This also promotes transparency about analytic choices and potential biases [@breznauDoesSociologyNeed2021].

-Third, OD and OM reinforce field credibility. By allowing independent scrutiny of methods and results, openness reduces the chance that findings are based on idiosyncratic decisions or unreported researcher degrees of freedom [@scogginsMeasuringTransparencySocial2024; @breznauObservingManyResearchers2022]. Multiple sources suggest that open practices reduce QRPs overall [@scogginsMeasuringTransparencySocial2024; @tennantAcademicEconomicSocietal2016; @munafoManifestoReproducibleScience2017]. 
+Third, OD and OM reinforce field credibility. By allowing independent examination of methods and results, openness reduces the chance that findings are based on idiosyncratic decisions or unreported researcher degrees of freedom [@scogginsMeasuringTransparencySocial2024; @breznauObservingManyResearchers2022]. Multiple sources suggest that open practices reduce QRPs overall [@scogginsMeasuringTransparencySocial2024; @tennantAcademicEconomicSocietal2016; @munafoManifestoReproducibleScience2017]. 

-Finally, openness has economic and societal benefits, even more evident for open access. It discourages redundant data collection, enabling cost savings that can be redirected to new research questions [@tennantAcademicEconomicSocietal2016; @piwowarSharingDetailedResearch2007]. At the same time, the public availability of data stimulates methodological innovation and cross-dataset syntheses that would otherwise remain infeasible [@piwowarStateOALargescale2018]. These dynamics amplify the academic, economic, and societal impact of research [@tennantAcademicEconomicSocietal2016]. 
+Finally, both have economic and societal benefits, that are even more evident for open access (OA). They discourage redundant data collection, enabling cost savings that can be redirected to new research questions [@tennantAcademicEconomicSocietal2016; @piwowarSharingDetailedResearch2007]. At the same time, the public availability of data stimulates methodological innovation and cross-dataset syntheses that would otherwise remain infeasible [@piwowarStateOALargescale2018]. These dynamics amplify the academic, economic, and societal impact of research [@tennantAcademicEconomicSocietal2016]. 

 Despite these gains, legitimate concerns persist among many researchers. With increasingly powerful linkage and inference techniques, even 'anonymized' datasets can risk re-identification if insufficient safeguards are in place. Researchers may fear that openness exposes flaws, invites reputational harm, or enables misuse - but detecting and correcting errors is core to good scientific practice and should be actively encouraged [@banksAnswers18Questions2019]. A major practical barrier is time and effort. Preparing shareable assets such as de-identifying data, curating metadata and documented code, can be complex and resource-intensive [@loggPreregistrationWeighingCosts2021; @sarafoglouSurveyHowPreregistration2022]. While many researchers see challenges in the publication of their data and materials, many of these concerns could be ruled out by streamlined processes or institutional support [@freeseAdvancesTransparencyReproducibility2022; @freeseReplicationStandardsQuantitative2007; @americanpsychologicalassociationOpenScienceBadges]. 

@@ -106,7 +106,7 @@ In short, many systemic and researcher-centric challenges cut across OSPs - and

 A preregistration is a time-stamped plan for a study's hypotheses, design, and analysis, often made public. Its contents vary by method (e.g., hypotheses, sampling, interview guides, exclusion rules, analysis plans) [@loggPreregistrationWeighingCosts2021; @managoPreregistrationRegisteredReports2023; @americanpsychologicalassociationOpenScienceBadges].

-Timestamping restrains HARKing by separating predictions from evidence, reducing the flexibility for post-hoc theorizing [@scogginsMeasuringTransparencySocial2024; @loggPreregistrationWeighingCosts2021]. More broadly, by committing ex ante, researcher degrees of freedom are narrowed. The analytic and design choices that otherwise enable selective reporting or specification searching are constrained, and any deviations become visible to readers and reviewers. The same logic limits p-hacking: when transformations, outlier rules, model families, covariates, and confirmatory contrasts are specified in advance, cherry-picking becomes less feasible because analytical decisions are made independently of the data. Preregistration also addresses structural issues of study quality. Declaring sample-size requirements upfront helps prevent underpowered designs by construction [@kuhbergerPublicationBiasPsychology2014; @grossmannOpenScienceReform2021]. We predefine theory, measures, and analyses, seek early input, and document choices so reviewers can vet them and avoid misinterpretation-strengthening credibility [ @evansImprovingEvidencebasedPractice2023; @sarafoglouSurveyHowPreregistration2022; @scogginsMeasuringTransparencySocial2024]. Preregistration helps separate confirmatory from exploratory work, reduces publication bias (e.g., via Registered Reports), and narrows "researcher degrees of freedom" [@simmonsFalsePositivePsychologyUndisclosed2011].
+Timestamping restrains HARKing by separating predictions from evidence, reducing the flexibility for post-hoc theorizing [@scogginsMeasuringTransparencySocial2024; @loggPreregistrationWeighingCosts2021]. More broadly, by committing ex ante, researcher degrees of freedom are narrowed. The analytic and design choices that otherwise enable selective reporting or specification searching are constrained, and any deviations become visible to readers and reviewers. The same logic limits p-hacking: when transformations, outlier rules, model families, covariates, and confirmatory contrasts are specified in advance, cherry-picking becomes less feasible because analytical decisions are made independently of the data. Preregistration also addresses structural issues of study quality. Declaring sample-size requirements upfront helps prevent underpowered designs at construction [@kuhbergerPublicationBiasPsychology2014; @grossmannOpenScienceReform2021]. We predefine theory, measures, and analyses, seek early input, and document choices so reviewers can vet them and avoid misinterpretation-strengthening credibility [ @evansImprovingEvidencebasedPractice2023; @sarafoglouSurveyHowPreregistration2022; @scogginsMeasuringTransparencySocial2024]. Preregistration helps separate confirmatory from exploratory work, reduces publication bias (e.g., via Registered Reports), and narrows "researcher degrees of freedom" [@simmonsFalsePositivePsychologyUndisclosed2011].

 For this work, **preregistration** is defined as *the act of planning and documenting the hypotheses, study design, and analysis plan of a study before data is collected or even viewed. The documentation is typically time-stamped and made publicly available*.

@@ -114,19 +114,19 @@ The Open Science movement, particularly preregistration, has been criticized for

 Frequently voiced concerns are about increasing work, thereby lengthening projects and restricting researcher's freedom by confining them to their predefined plan. However, preplanning simply reorders the workflow rather than creating extra work, potentially preventing costly redesigns or follow-up-studies. Additionally, this does not inhibit exploratory work as the goal is to provide clarity and transparency by distinguishing between preplanned analysis and those conducted after viewing the data. By moving the conceptual work upstream, preregistration clarifies claims, adds transparency to the decision process and strengthens credibility by marking plans and deviations [@loggPreregistrationWeighingCosts2021; @evansImprovingEvidencebasedPractice2023]. In-principle acceptance adds a guarantee to the upfront work, provided the approved plan is followed [@sarafoglouSurveyHowPreregistration2022; @banksAnswers18Questions2019].

-In summary, preregistration does not constrain scientific creativity; it clarifies claims. By making the sequence of decisions explicit-what was planned, what changed, and why-we reduce bias, improve interpretability, and strengthen confidence in reported findings [@hardwickeReducingBiasIncreasing2023].
+In summary, preregistration does not constrain scientific creativity, it clarifies claims. By making the sequence of decisions explicit-what was planned, what changed, and why-we reduce bias, improve interpretability, and strengthen confidence in reported findings [@hardwickeReducingBiasIncreasing2023].

 ### Open Access {#sec-open-access}

-**Open access** (OA) is a key OSP, defined as making research freely available online to anyone, as opposed to requiring payment via journal subscriptions [@banksAnswers18Questions2019, @breznauDoesSociologyNeed2021]. The Budapest OA Initiative defines OA as being free to read and reuse for lawful purposes, including text and data mining [@BOAI2002]. A simpler, broad definition is the lawful free availability of a research publication on the internet which will be used here.
+**Open access** is a key OSP, defined as making research freely available online to anyone, as opposed to requiring payment via journal subscriptions [@banksAnswers18Questions2019, @breznauDoesSociologyNeed2021]. The Budapest OA Initiative defines OA as being free to read and reuse for lawful purposes, including text and data mining [@BOAI2002]. A simpler, broad definition is the lawful free availability of a research publication on the internet which will be used here.

 OA publishing offers several benefits. It increases accessibility and equity, as anyone with an internet connection can reach an OA article, potentially reducing inequalities for those at underfunded institutions [@banksAnswers18Questions2019]. There is a significant OA citation advantage, as OA articles are cited more frequently than closed-access publications. This preference is now considered a form of research bias known as "FUTON" (full text on the net) bias [@piwowarStateOALargescale2018, @wentzVisibilityResearchFUTON2002; @piwowarSharingDetailedResearch2007]. OA also improves research quality by reducing the suppression of null findings [@francoPublicationBiasSocial2014] and enabling large-scale text and data mining [@tennantAcademicEconomicSocietal2016]. Furthermore, it accelerates equitable access, helping to bridge the global North-South divide, and enhances public accountability for publicly funded research [@tennantAcademicEconomicSocietal2016].

-Despite its benefits, OA faces challenges. Some newer or smaller Gold OA journals are perceived as less prestigious [@piwowarStateOALargescale2018], and concerns about "predatory publishers" have been mistakenly linked with OA [@tennantAcademicEconomicSocietal2016]. Article processing charges (APCs) can be a barrier for authors, particularly in low- and middle-income countries [@banksAnswers18Questions2019; @breznauDoesSociologyNeed2021], though roughly 70% of peer-reviewed OA journals are fee-free, and many offer waivers [@tennantAcademicEconomicSocietal2016; @breznauDoesSociologyNeed2021]. Publishers may also be hesitant to adopt OA due to concerns about losing subscription revenue [@banksAnswers18Questions2019]. While OA promotes transparency, it cannot on its own solve issues like QRPs or underpowered studies if incentives continue to reward quantity over quality [@grossmannOpenScienceReform2021; @banksAnswers18Questions2019].
+Despite its benefits, OA faces challenges. Some newer or smaller Gold OA journals are perceived as less prestigious [@piwowarStateOALargescale2018], and concerns about "predatory publishers" have been mistakenly linked with OA [@tennantAcademicEconomicSocietal2016]. Article processing charges (APCs) can be a barrier for authors, particularly in low- and middle-income countries [@banksAnswers18Questions2019; @breznauDoesSociologyNeed2021], though roughly 70% of peer-reviewed OA journals are fee-free, and many offer waivers [@tennantAcademicEconomicSocietal2016; @breznauDoesSociologyNeed2021]. Publishers may also be hesitant to adopt OA due to concerns about losing subscription revenue. While OA promotes transparency, it cannot on its own solve issues like QRPs or underpowered studies if incentives continue to reward quantity over quality [@grossmannOpenScienceReform2021; @banksAnswers18Questions2019].

 ## Open Science in Criminology and Legal Psychology {#sec-osp-in-crim}

-A focused literature review on adoption produced limited evidence as we still know surprisingly little about how often OSPs are actually used in criminology and legal psychology. The evidence is fragmented, method-dependent, and sometimes contradictory-so estimates of prevalence are shaky even as enthusiasm for OSPs is high and QRPs appear common.
+A focused literature review on adoption produced limited evidence as we still know surprisingly little about how often OSPs are actually used in criminology and legal psychology. The evidence is fragmented, method-dependent, and sometimes contradictory - so estimates of prevalence are shaky, even as enthusiasm for OSPs is high and QRPs  might appear common.

 Self-reports suggest high OSP familiarity - but they co-exist with widespread QRPs and are vulnerable to bias. In @chinQuestionableResearchPractices2023, 89% of respondents said they had used at least one OSP, yet 87% also admitted at least one QRP, and some serious QRPs (e.g., hiding known problems) were non-trivial. Survey data indicate that about 25% of researchers across fields have preregistered a study, with higher uptake in psychology (50-60%) and lower prevalence in sociology (~30%) [@fergusonSurveyOpenScience2023a]. Another survey in the field similarly estimated preregistration use at 45% (42-49%) [@chinQuestionableResearchPractices2023]. The reported prevalence of OD varies widely across disciplines. Survey data suggest that more than 60% of researchers report having posted data or code, with higher rates in psychology (>50%) compared to sociology (~35%) [@fergusonSurveyOpenScience2023a]. The prevalence of OM sharing is more limited compared to OD and access. Survey results indicate that 43% (40-47%) of researchers report providing access to their research materials [@chinQuestionableResearchPractices2023]. Few or no journals require data sharing in the field, coupled with rare preregistration and a tiny share of replication studies [@pridemoreReplicationCriminologySocial2018].

@@ -140,9 +140,9 @@ The applied nature of the research in this field means fragile findings can driv

 # Data and Method

-The aim of this methodological work is to compile a sample of publications in the fields of criminology and legal psychology, classify them as either statistical inference (SI) publications or non-SI publications and further examine the former to assess whether any of the OSPs under consideration are used: preregistration, OD, OM, or OA. OA results are reported as secondary, descriptive analyses to benchmark open-science adoption. The presented OSPs will be operationalized and a text-classification pipeline (keyword dictionaries and machine-learning models) will be used to detect them. OA status will be determined using publicly available metadata, given the relatively high reliability of such information. The fine-tuned models are validated against a hand-coded sample that was extended using a large-language-model (LLM, ChatGPT 4o & ChatGPT 5o), report precision/recall and calibration, and then estimate annual prevalence with uncertainty intervals.
+The aim of this work is to compile a sample of publications in the fields of criminology and legal psychology, classify it as either statistical inference (SI) publications or non-SI publications and further examine the former to assess whether any of the OSPs under consideration are used: preregistration, OD, OM, or OA. OA results are reported as secondary, descriptive analyses to benchmark open-science adoption. The presented OSPs will be operationalized and a text-classification pipeline (keyword dictionaries and machine-learning models) will be used to detect them. OA status will be determined using publicly available metadata, given the expected high reliability of information on OA. The fine-tuned models are validated against a hand-coded sample that is extended using a large-language-model (LLM, ChatGPT 4o & ChatGPT 5o), with the product of both being then used to train classifier models that will classifiy the analytical sample to estimate true prevalences of OSPs.

-Full-text data for training the machine learning classification models will be collected with a web application developed specifically for this project. Since software development is not the focus of this work, details of the app's architecture will not be discussed here. A brief description of the application, along with screenshots, is provided in @sec-data-fulltext-collection.
+Full-text data for training the machine learning classification models will be collected with a web application developed specifically for this project. Since software development is not the focus of this work, details of the app's architecture will not be discussed here. A brief description of the application, along with screenshots, is provided in the supplementary material.

 This work is necessarily scoped by time and resources. It shall therefore be treated as a pilot that establishes data, measures and a reproducible, yet improvable pipeline to be extended in to a fully exhaustive study. Where necessary, potential improvements that could not be implemented are recommended.

@@ -154,7 +154,7 @@ As the population is restricted to publications that make SIs, this concept has

 Temporally, this study adopts a starting point of 2013-01-01. The endpoint is set at 2023-12-31, consistent with the initial planning of this work, as the year 2024 had not yet come to an end.

-In summary, the study population consists of all statistical-inference publications published between 2013 and 2023 in the top 100 JIF-ranked criminology and legal psychology journals (as of 2023), indexed in Crossref.
+In summary, the study targets all statistical-inference publications published between 2013 and 2023 in the top 100 JIF-ranked criminology and legal psychology journals (as of 2023), indexed in Crossref.

 ## Sampling {#sec-sampling}

@@ -240,9 +240,7 @@ if (isTRUE(debug_mode)) {
 }
 ```

-The data obtained necessitated multiple transformations. All transformations are reported in the respective section in the methodological report.
-
-Publications were filtered by the resulting date variable to limit the population to the defined time interval. To reduce SI coding efforts, simple keyword-lists were used to reduce the number of publications by matching titles (e.g. "Book Review"). Missing values were assessed, checks were processed for language, @tbl-cases shows that from an initial number of 95042 publications, all steps resulted in a final publication count of 40,860. It is important to note here that several improvements were implemented here but not processed. More details can be found in the provided materials. The next step that was planned was to download full-text HTML or PDF versions, only using legal and ethical sources.
+The data obtained necessitated multiple transformations. All transformations are reported in the respective section in the methodological report. Publications were filtered by the resulting date variable to limit the population to the defined time interval. To reduce SI coding efforts, simple keyword-lists were used to reduce the number of publications by matching titles. Missing values were assessed, checks were processed for language, @tbl-cases shows that from an initial number of 95042 publications, all steps resulted in a final publication count of 40,860. It is important to note here that several improvements were implemented here but not processed. More details can be found in the provided materials.

 [^7]: [https://jcr.clarivate.com/jcr/browse-journals](https://jcr.clarivate.com/jcr/browse-journals), JCR Year set to 2023. 

@@ -250,7 +248,7 @@ Publications were filtered by the resulting date variable to limit the populatio

 Using the obtained crossref metadata, the analytical sample was drawn stratified by year according to the calculation in @sec-sampling. The resulting analytical sample contains roughly 10% of the population data. As seen in @fig-freq-pubs-comp, Sample A, that is the training and validation sample for the SI classifier, is intended as the proportion of SI papers are expected to not vary and therefore not stratified by year. Stratification by journal was rejected due to the resulting sample sizes of 100 journals would have required much more cases. 

-The final analytical sample is made up of 4265 publications. The OS prevalence classification sample consists of 352 publications stratified by year whereas the unstratified sample A for the training of the SI classifiers consists of 408 publications.
+The final analytical sample is made up of 4265 publications. The OS prevalence classification sample consists of 352 publications stratified by year whereas the unstratified sample A for the training of the SI classifiers consists of 408 publications. The next step involved downloading full-text HTML or PDF versions, only using legal and ethical sources.

 ```{r}
 #| fig-cap: "Frequencies: Publications by Year in Population and Sample"
@@ -373,9 +371,9 @@ if (isTRUE(debug_mode)) {

 ### Full Text Retrieval

-The initial approach to gathering full texts, which used Zotero to translate DOIs as per Scoggins and Robertson, was unreliable across multiple attempts and versions. Due to the unsuitability of existing software tools, be it for technical or legal reasons, a custom web application was developed.
+The initial approach to gathering full texts, which used Zotero to translate DOIs as per Scoggins and Robertson, was unreliable across multiple attempts and software versions. Due to the unsuitability of existing software tools, be it for technical or legal reasons, a custom web application was developed.

-Downloading the analytical sample was mostly successful, though some publisher protections caused dropouts. Due to time constraints, additional more optimized runs were not feasible. Documents under 1,000 words were considered non-full-text papers. However, shorter HTML texts were retained for potential keyword matching. Text quality assessment (Flesch-Index) and word count identified missing full texts [@benoitQuantedaPackageQuantitative2018], with further analysis available in the methodological report. Full texts were downloaded for Independent Sample A and the Analytical Sample from which Sample B was drawn. The resulting dropouts should have been implicitly handled by post-stratification. Publisher-level weighting was considered but infeasible due to sparse cells that would have produced unstable weights. Post-stratification was conducted by year only, which does not correct publisher- or journal-specific dropouts. Future iterations should add publisher-level adjustment.
+Downloading the analytical sample was mostly successful, though some publisher protections caused dropouts. Due to time constraints, additional more optimized runs were not feasible. Documents under 1,000 words were considered non-full-text papers. However, shorter HTML texts were retained for potential keyword matching. Text quality assessment (Flesch-Index) and word count identified missing full texts [@benoitQuantedaPackageQuantitative2018]. Full texts were downloaded for Independent Sample A and the Analytical Sample from which Sample B was drawn. The resulting dropouts were expected to have been implicitly handled by post-stratification, but publisher-level weighting was planned and considered but infeasible due to sparse cells that would have produced unstable weights. Post-stratification was conducted by year only, which does not correct publisher- or journal-specific dropouts. Future, non-piloting iterations should add publisher-level adjustment.

 ```{r}
 #| label: tbl-cases2
@@ -438,17 +436,17 @@ if (isTRUE(debug_mode)) {

 This section will present a brief summary of all methods used to classify the variables of interest. A thorough discussion of the decisions taken, the full descriptions and specifications of the models used as well as the preprocessing steps can be found in the supplied materials.

-Since most existing classification approaches considered were deemed unsuitable for this scope (e.g., @kimResearchPaperClassification2019; @sanguansatFeatureMatricizationDocument2012; @jandotInteractiveSemanticFeaturing2016), this work instead relies on Random-Forest and XGBoost-models trained on a manually and LLM coded subset of publications as LLMs have shown good performance on similar classification tasks [@buntValidatingUseLarge2025; @zhaoAdvancingSingleMultitask2024]. First, a strict operationalization of "SI" (1) versus "not SI" (0), as well as of the OSPs with the same levels was created which was documented in a short coding manual. 
+Since most existing classification approaches considered were deemed unsuitable for this scope (e.g., @kimResearchPaperClassification2019; @sanguansatFeatureMatricizationDocument2012; @jandotInteractiveSemanticFeaturing2016), this work instead relies on Random-Forest and XGBoost-models trained on a manually and LLM coded subset of publications as LLMs have shown good performance on similar classification tasks [@buntValidatingUseLarge2025; @zhaoAdvancingSingleMultitask2024]. For each task, OSP-specific document-feature-matrices using term frequencies or TF-IDF of keyword sets, partly adapted from @scogginsMeasuringTransparencySocial2024, were constructed. 

-A subset of Sample A was coded by hand, followed by a ChatGPT-based labelling of the fulltext. On a random subsample, agreement after reconciliation was high ($\kappa$ $\approx$ .83), so combined manual/LLM labels  served as training and test data for the ML classifiers. A similar approach was used for Sample B. For each task, OSP-specific document-feature-matrices using term frequencies or TF-IDF of keyword sets, partly adapted from @scogginsMeasuringTransparencySocial2024, were constructed. Each OSP classifier was tuned on all possible combinations of different feature sets and model. 
+First, a strict dichotomous operationalization of "SI" or not SI, as well as of the OSPs was synthesized and documented in a short coding manual. A subset of Sample A was coded by hand, followed by a ChatGPT-based labelling of the fulltext. On a random subsample, agreement after reconciliation was high ($\kappa$ $\approx$ .83), so combined manual/LLM labels  served as training and test data for the ML classifiers. A similar approach was used for Sample B. Each OSP classifier was tuned on all possible combinations of different feature sets and model. 

-Due to time constraints and the study's pilot nature, classification evaluation and data preprocessing were only optimized for the OSP classifier, not for the SI classifier. The more thorough approach used for OSP, which addressed challenges like high computational demands and class imbalance, would have improved the SI classifier but was not feasible. Despite this, the SI classifier still performed satisfactorily, and the optimal methods are reflected in the OSP training process. Furthermore, journal-level adoption of OSPs was originally intended to be assessed using the Transparency and Openness Promotion Factor [@nosekPromotingOpenResearch2015]. However, as the available sample sizes were insufficient for journal-level analyses, these were not carried out.
+Given time constraints and the pilot nature of the study, preprocessing and evaluation were optimized for the OSP classifier only, not for the SI classifier. The more rigorous workflow applied to OSP - designed to handle high computational demands and substantial class imbalance - would likely also have improved SI performance, but was not pursued because SI results were already satisfactory, as documented in the provided material. Furthermore, journal-level adoption of OSPs was originally intended to be assessed using the Transparency and Openness Promotion Factor [@nosekPromotingOpenResearch2015]. However, as the available sample sizes were insufficient for journal-level analyses, these were not carried out.

 ## Analysis

 The research was deliberately designed to study open-science practices via supervised classifiers rather than relying exclusively on metadata. This choice prioritized scalability and the potential to capture practice signals that metadata may miss, at the cost of managing model error and class imbalance. Given the exploratory character of the work, the analyses were not pre-defined, only data collection, sampling, and the model-training strategy were specified in advance. Concerns about classifier interpretability informed the evaluation strategy [@gilpinExplainingExplanationsOverview2018].

-Estimates for OSPs are domain-estimates among SI papers (see @tbl-cases2) drawn from a year-stratified random sample, beta-method CIs are based on design-based variance. A year-stratified random sample was drawn. Design-weights were applied post-stratified to frame-by-year totals with finite-population corrections. All OSP estimates are domain estimates for SI papers using design-based inference. Design corrected 95% confidence intervals are computed with the beta method (Clopper-Pearson) transformation which provides better coverage for low-prevalences than Wald intervals [@agrestiIntroductionCategoricalData2007]. Results generalize to the keyword-filtered data. With $n=1,763$ SI papers, SI-domain CIs are wider than the planned $\pm$ 1.5 pp. Because some SI papers may have been excluded by the screening, OSP levels for all SI papers in the full corpus of 90k publications may differ. An audit of excluded records could quantify the coverage and enable adjustment but was not conducted here.
+Estimates for OSPs are domain-estimates among SI papers (see @tbl-cases2) drawn from a year-stratified random sample, beta-method CIs are based on design-based variance. Design-weights were applied post-stratified to frame-by-year totals with finite-population corrections. All OSP estimates are domain estimates for SI papers using design-based inference. Design corrected 95% confidence intervals are computed with the beta method (Clopper-Pearson) transformation which provides better coverage for low-prevalences than Wald intervals [@agrestiIntroductionCategoricalData2007]. Results generalize to the keyword-filtered data. With $n=1,763$ SI papers, SI-domain CIs are wider than the planned $\pm$ 1.5 pp. Because some SI papers may have been excluded by the screening, OSP levels for all SI papers in the full corpus of 90k publications may differ. An audit of excluded records could quantify the coverage and enable adjustment but was not conducted here.

 OSP labels were assigned by classifiers whose sensitivity and specificity are imperfect, with potential misclassifications affecting the reported prevalence rates. To assess robustness, a simple sensitivity analysis using the Rogan-Gladen correction for misclassification of a binary outcome was conducted [@liuQuantitativeBiasAnalysis2023; @vallecamposSerosurveySerologicalSurvey2020]. 

@@ -702,7 +700,6 @@ if (isTRUE(debug_mode)) {
 }
 ```

-Because design-based estimates do not account for classifier error, Rogan-Gladen adjustments were applied using sensitivity and specificity from the ML-validation analysis in @tbl-osp-prev-overall [@liuQuantitativeBiasAnalysis2023]. Under extreme rarity, adjustments become unstable: intervals widen dramatically (approaching $[0,1]$) or yield boundary/negative estimates when specificity is insufficient relative to prevalence. For OD, the false-positive rate ($1-\text{Sp} \approx 12.7\%$) exceeds the observed prevalence ($2.2\%$), pushing adjusted points below zero. For OM, low sensitivity ($\text{Se} = 0.20$) and tiny validation counts produce near-uninformative intervals. Given these constraints, the adjusted values can be interpreted as sensitivity ranges rather than confirmatory estimates. Any substantive claims should thereby rather be based on design-based estimates and on OA (measured from metadata).

 ```{r}
 #| tbl-cap: Overall Prevalence of Open Science Practices among Statistical Inference Papers (Design-Weighted to Frame-by-Year Totals)
@@ -785,6 +782,8 @@ if (isTRUE(debug_mode)) {
 }
 ```

+@tbl-osp-prev adjusts  adjustments were applied using sensitivity and specificity from the ML-validation analysis in  [@liuQuantitativeBiasAnalysis2023]. Under extreme rarity, adjustments become unstable: intervals widen dramatically (approaching $[0,1]$) or yield boundary/negative estimates when specificity is insufficient relative to prevalence. For OD, the false-positive rate ($1-\text{Sp} \approx 12.7\%$) exceeds the observed prevalence ($2.2\%$), pushing adjusted points below zero. For OM, low sensitivity ($\text{Se} = 0.20$) and tiny validation counts produce near-uninformative intervals. Given these constraints, the adjusted values can be interpreted as sensitivity ranges rather than confirmatory estimates. Any substantive claims should thereby rather be based on design-based estimates and on OA (measured from metadata).
+
 Earlier differences in text sources suggest heterogeneity by journal, thereby implicating also publisher variance [@scogginsMeasuringTransparencySocial2024]. @fig-osp-time-by-publisher visualizes OA shares over time for the 12 most prolific publishers in the sample (listed in the caption). Leveraging larger $n$, the author fit simple OLS trends to annual OA proportions. The four most prolific publishers show clear increases. Four publishers do not: Oxford University Press, Emerald, ASCE, and MDPI. MDPI remains at 100% OA, Emerald at 0% in this sample; ASCE shows an apparent decline consistent with limited observations; Oxford University Press is relatively stable. All observed increases are highly statistically significant. Future work should use models designed for proportions (e.g., binomial GLMs) and, ideally, hierarchical pooling across publishers and years.

 ```{r}
@@ -952,12 +951,11 @@ if (isTRUE(debug_mode)) {
 }
 ```

-
 This study was deliberately scoped as a pilot, which constrained coverage, precision, and tooling. The population assessed was limited to SI papers from the top 100 JCR journals in criminology and legal psychology and to Crossref metadata, so venue and index biases remain. The 2013-2023 window omits the most recent changes. Keyword screening did not fully exclude non-target items, and a Quarto "freeze" configuration led to using print over online dates in some cases. Full-text retrieval was partial and legally bounded to TDM-permitted publishers; short documents (<1,000 words) were treated as missing full text, risking misclassification.

 Measurement and modeling challenges were substantial. SI/OSP labels were trained on a small, single-coder hand set plus GPT assistance. Severe class imbalance for OSPs, few validation positives, and upsampling inflated nominal accuracy while depressing stability. Misclassification adjustments (Rogan-Gladen) became unstable at very low prevalences, and some OA trend analyses used simple OLS rather than binomial/GLM approaches.

-In the methodological report, comparing basic text characteristics between OSP-labeled papers and non-OSP papers reveals non-independence (e.g., differences in word count, Flesch score, and text source), despite the assumption that such features should not vary with true OSP status. This pattern indicates likely misclassification and/or model leakage, with classifiers picking up irrelevant proxies (publisher templates, document length) rather than OSP content.
+In the methodological report, comparing basic text characteristics between OSP-labeled papers and non-OSP papers revealed non-independence (e.g., differences in word count, Flesch score, and text source), despite the assumption that such features should not vary with true OSP status. This pattern indicates likely misclassification and/or model leakage, with classifiers picking up irrelevant proxies (publisher templates, document length) rather than OSP content.

 I therefore propose a series of recommendations for future iterations, that should expand bibliographic metadata sources (Crossref + Scopus and Web of Science) and further audit screened-out records to assess selection, operationalizations with sharper rules more close to the constructs defined by e.g. OSF, employ multi-coder assessment, and quantify inter-rater-reliability on a larger training data base OR classify leveraging ChatGPT as implied by the very accurate precisions evident here and replace OLS with binomial GLMs or hierarchical models for proportions. On the technicals side, a more stringent Quarto setup should be used, with simplified modular code based on a refined version of the codebase used here. The downloader should be improved in terms of a more homogeneous extraction logic by including the HTML and PDF full-text extraction in the pre-processing pipeline, making the whole process more transparent, reproducible and less error-prone. Finally, the sample size should be increased substantially, ideally to the full population of SI papers in the frame, to improve precision and enable analysis on journal level.

@@ -1132,15 +1130,17 @@ To make sure, that our results are robust, reliable and credible, this work shal
 \renewcommand\thesection{}
 ```

-# Materials, Data and Code
+# Acknowlegments
+
+I would like to thank my advisor, [Advisor Name], for their guidance, constructive feedback, and steady support throughout this project. Their expertise and encouragement were invaluable in shaping both the research and this publication.
+
+# Data availability

 Materials, Data and Code are made available at a public OSF-repository that can be accessed here:

 - https://osf.io/rvpc3/overview?view_only=0307dc0d99f74b50a738720a4a757aa0. 

-Further instructions can be found in the README file. Full-text data and the downloader can't be made available to the public due to copyright concerns. An encrypted, password-protected file for each containing the full-texts is available in the repository. 
-
-# Acknowlegments
+Further instructions can be found in the README files. Full-text data and the downloader can't be made available to the public due to copyright concerns.

 # Funding

@@ -1148,7 +1148,15 @@ This research received no external funding.

 # Conflicts of Interest

-The authors declare no conflicts of interest.
+The author declares no conflicts of interest.
+
+# Conflict of interest
+
+One of the editors of this issue participated as an academic advisor.
+
+# Author Biography
+
+Michael Beck is finishing his MSc in Sociology and Social Research at the University of Cologne, Germany. He is interested in computational social science and supports research as a student assistant at the German Institute for Adult Education (Bonn), focusing on evaluating generative AI for education and educational research.

 ```{=latex}
 \newpage
@@ -1159,16 +1167,6 @@ The authors declare no conflicts of interest.
 ::: {#refs}
 :::

-```{=latex}
-\FloatBarrier   % flush all earlier floats here
-\clearpage
-\appendix
-\setcounter{section}{0}
-\setcounter{page}{1}
-\renewcommand{\thepage}{A\arabic{page}}
-\renewcommand\thesection{\Alph{section}}
-```
-
 ```{r}
 #| results: asis