added tons of stuff, literature, corrected the makefile, added ResearchPlan, Notes and corrected readme

2024-12-16 23:56:45 +01:00 · 2024-12-16 23:56:45 +01:00 · f88ff734bd
commit f88ff734bd
parent 4757bcaa73
6 changed files with 1226 additions and 159 deletions
--- a/Notes.md
+++ b/Notes.md
@ -0,0 +1,147 @@
+# Notes
+
+## Research Plan
+
+- **Problem** using both sociology and criminology can introduce bias of the trained models due to highly different vocabulary used in both discioplines
+
+According to @scogginsMeasuringTransparencySocial2024a  
+**Population**: \[social science\] papers using data and statistics  
+1. **Gathering Papers**  
+	1. Consult Clarivate Journal Citation Report to obtain Journals in the field  
+	2. Filter downloadable journals (that are included in the campus' licences)  
+	3. Using [Crossref](https://github.com/ropensci/rcrossref), [Scopus](https://github.com/muschellij2/rscopus) or [WOS](https://github.com/juba/rwos) API: download publication metadata of all papers in a respective time span  
+	4. Download HTML Papers  
+	5. Filter to-download list by grabbed html papers  
+	6. Download Paper Fulltext PDF: using [ferru97/PyPaperBot](https://github.com/ferru97/PyPaperBot), [ monk1337/resp](https://github.com/monk1337/resp) (even possible to use anna's archive, scihub or libgen, but this would be illegal so no ofc not) - **really necessary?**  
+	7. Convert HTML and PDF papers to txt ([titipata/scipdf\_parser](https://github.com/titipata/scipdf_parser), [aaronsw/html2text](https://github.com/aaronsw/html2text), [html2text · PyPI](https://pypi.org/project/html2text/))  
+2. Classification  
+	1. Operationalization of ...  
+		1. Papers that use statistical inference  
+		2. Papers that applied preregistration  
+		3. Papers that applied open data practices  
+		4. Papers that offer open materials  
+		5. Open Access (theoretically not interesting?)  
+		6. Papers with Positive Results  
+	2. Definition of Identification keywords/dictionaries for each category  
+	3. Manual classification of a number of papers for ml model training (between 1k/2k)  
+	4. Creation of [DFMs](https://quanteda.io/reference/dfm.html) (see also: [Official Tutorial](https://tutorials.quanteda.io/basic-operations/dfm/)) using the dictionaries  
+	5. MLM training (Naive Bayes, LogReg, Nonlinear SVM, Random Forest, XGB)  
+	6. MLM evaluation / decision  
+	7. Classification of data using the trained, best performing model  
+3. Analysis  
+	- descriptive analysis of the temporal development in proportions in the last 10 years in each discipline, see @scogginsMeasuringTransparencySocial2024a  
+
+
+## Todo
+
+- add stuff about the replication crisis, 1-2 sentences in the introduction. see @scogginsMeasuringTransparencySocial2024a
+- **improve wording in the last paragraph**
+
+## Open Access
+
+[@dienlinAgendaOpenScience2021]
+	1. publish materials, data and code 
+	2. preregister studies and submit registered reports
+	3. conduct replication studies
+	4. collaborate
+	5. foster open science skills
+	6. Implement Transparency and Openness Promotion (TOP) Guidelines
+	7. incentivize open science practices
+
+
+- Systemic Biases in AI and Big Data: Open science tools can be used to address biases in AI algorithms [@nororiAddressingBiasBig2021].
+
+
+
+### **Publication Bias**, selective reporting [@smaldinoOpenScienceModified2019; @fox142OpenScience2021].
+- **Problem:** Journals often favor publishing positive or statistically significant results, leaving negative or null findings unpublished.
+- **How Open Science Helps:** Pre-registration of studies and publishing all research outcomes (e.g., via open access repositories) ensures that all results are accessible. Open science encourages the publication of all results, including negative or null findings, which helps reduce the bias towards publishing only positive results. By promoting transparency and the sharing of data and methodologies, open science reduces the tendency to selectively report only favorable outcomes 
+
+### **Confirmation Bias** [@fox142OpenScience2021]
+- **Problem:**
+- **How Open Science Helps:** Open science practices, such as pre-registration of studies, help mitigate confirmation bias by specifying hypotheses and analysis plans before data collection
+
+### **Reproducibility Crisis** [@fox142OpenScience2021]
+- **Problem:** Many scientific findings cannot be replicated due to opaque methodologies or unavailable data and code.
+- **How Open Science Helps:** Sharing detailed methods, datasets, and analysis scripts in open repositories promotes reproducibility and verification. Open science addresses the reproducibility crisis by making data and methods openly available, allowing other researchers to verify and replicate findings.
+
+### **Algorithmic Bias**  [@nororiAddressingBiasBig2021]
+- **Problem:**
+- **How Open Science Helps:** public data and training reports for ai enable 
+
+### **Inefficiencies in Research Progress**
+- **Problem:** Duplication of efforts and siloed research slow down scientific advancements.
+- **How Open Science Helps:** Sharing negative results, datasets, and ongoing projects prevents duplication and accelerates innovation.
+
+### **Overemphasis on Novelty**
+- **Problem:** The pressure to publish novel findings discourages replication studies or incremental advancements.
+- **How Open Science Helps:** Encouraging and funding replication studies through open peer-review processes shifts focus towards reliable and cumulative science.
+
+
+
+### **Lack of Peer Review Transparency**
+- **Problem:** Traditional peer review is often anonymous and lacks accountability, leading to potential biases or unfair evaluations.
+- **How Open Science Helps:** Open peer review, where reviews and reviewer identities are accessible, ensures greater accountability and reduces bias.
+
+### **Authorship and Credit Bias**
+- **Problem:** Early-career researchers, women, and underrepresen
+        - Logistic Regression
+        - Support Vector Machines
+        - Random Forests
+        - Gradient Boosted Trees
+    - Evaluate model performance to select the best classifier for each open science practice.
+
+6. **Automated Classification**
+    - Apply the best-performing models to classify the entire dataset.
+    - Automate the identification of open science practices across all collected papers.
+
+## **Analysis**
+1. **Descriptive Analysis**
+    - Examine temporal trends in the adoption of open science practices over the past decade.
+    - Compare practices across sociology and criminology.
+
+2. **Evaluation of Results**
+    - Identify patterns in:
+        - Prevalence of pre-registration, open data, open materials, and open access.
+        - Statistical inference methods.
+        - Reporting of positive results.
+
+3. **Ethical Considerations**
+    - Ensure all methodologies comply with ethical and legal guidelines.
+    - Avoid unauthorized sources such as Sci-Hub or LibGen.
+
+4. **Broader Implications**
+    - Contribute to understanding the adoption of transparency and reproducibility in social sciences.
+    - Inform efforts to promote open science practices in sociology, criminology, and beyond.tors are recognized for their specific roles.
+
+### **Conflicts of Interest**
+- **Problem:** Undisclosed funding sources or affiliations may bias research findings.
+- **How Open Science Helps:** Transparent declarations of conflicts of interest and funding sources reduce hidden biases.
+
+### **Limited Interdisciplinary Collaboration**
+- **Problem:** Barriers to sharing research outputs restrict interdisciplinary collaboration, limiting innovation.
+- **How Open Science Helps:** Open sharing of data, methods, and publications fosters cross-disciplinary integration and innovation.
+
+
+
+### **Data Access Inequality**
+- **Problem:** Researchers in low-resource settings often lack access to expensive journals, datasets, or tools.
+- **How Open Science Helps:** Open access publications and open data initiatives democratize access to research outputs, enabling equitable participation in science.
+
+### **Misuse of Metrics (e.g., Impact Factor, h-Index)**
+- **Problem:** Reliance on quantitative metrics for evaluating research quality skews scientific priorities.
+- **How Open Science Helps:** Encouraging diverse evaluation metrics (e.g., open data reuse, societal impact) ensures fair assessment of research contributions.
+
+### **Cherry-Picking and P-Hacking**
+- **Problem:** Selective reporting or manipulating data to achieve statistical significance undermines the integrity of research.
+- **How Open Science Helps:** Pre-registration of hypotheses and protocols discourages cherry-picking and promotes adherence to predefined analysis plans.
+
+
+
+### **Lack of Public Engagement**
+- **Problem:** Complex scientific outputs are often inaccessible to the general public, leading to mistrust or misunderstanding of science.
+- **How Open Science Helps:** Open access and lay summaries of research make science more inclusive and comprehensible to non-specialists.
+
+This commitment is rooted in the idea that scientific claims must be substantiated through consistent and reproducible evidence. Modern scientific inquiry, therefore, aligns with the notion that: 
+
+> "Only by ... repetitions can we convince ourselves that we are not dealing with a mere isolated ‘coincidence’, but with events which, on account of their regularity and reproducibility, are in principle inter-subjectively testable." [@popperLogicScientificDiscovery2005, p. 23]
--- a/ResearchPlan.md
+++ b/ResearchPlan.md
@ -0,0 +1,93 @@
+# Data, Method, and Analysis of Open Science Practices in Sociology and Criminology Papers
+
+## **Population**
+- Papers in sociology and criminology utilizing data and statistical methods.
+- Focus on evaluating open science practices:
+    - Pre-registration
+    - Open data
+    - Open materials
+    - Open access
+    - Statistical inference
+
+## **Data Collection**
+1. **Journal Identification**
+    - Use Clarivate Journal Citation Report API to obtain a comprehensive list of sociology and criminology journals.
+    - Filter the list to include journals accessible through university licensing agreements.
+
+2. **Metadata Download**
+    - Utilize APIs such as CrUsing [Crossref](https://github.com/ropensci/rcrossref), [Scopus](https://github.com/muschellij2/rscopus) or [WOS](https://github.com/juba/rwos) to download metadata for all papers published between 2013–2023.
+
+3. **Full-Text Retrieval**
+    - Download HTML versions of papers where available for ease of structured text extraction
+    - Use full-text PDFs when HTML is not available, adhering strictly to ethical and legal guidelines.
+    - Tools for retrieval:
+        - [ferru97/PyPaperBot](https://github.com/ferru97/PyPaperBot), [ monk1337/resp](https://github.com/monk1337/resp) (licensed sources only).
+        - Institutional library services for access.
+        - Open-access repositories for additional resources.
+        - ([titipata/scipdf\_parser](https://github.com/titipata/scipdf_parser), [aaronsw/html2text](https://github.com/aaronsw/html2text), [html2text · PyPI](https://pypi.org/project/html2text/))  
+
+4. **Preprocessing**
+    - Convert collected papers to plain text using:
+        - SciPDF Parser for PDF-to-text conversion.
+        - HTML-to-text tools like `html2text`.
+    - Standardize text format for subsequent analysis.
+
+5. **Resource Management**
+    - Address potential constraints:
+        - Use scalable data collection methods.
+        - Leverage institutional resources (e.g., libraries and repositories).
+        - Implement efficient workflows for text extraction and preprocessing (multicore processing).
+
+## **Classification**
+1. **Operationalization**
+    - Define clear criteria for identifying open science practices:
+        - Pre-registration: Terms like "pre-registered."
+        - Open data: Phrases like "data availability statement."
+        - Open materials: Statements like "materials available on request."
+
+2. **Keyword Dictionary Creation**
+    - Develop dictionaries of terms and phrases associated with each open science practice.
+    - Base dictionaries on prior research (e.g., @scogginsMeasuringTransparencySocial2024a).
+    - Compare and join dictionaries.
+
+3. **Manual Annotation**
+    - Manually classify a subset of 1,000–2,000 papers for training machine learning models.
+    - Use stratified sampling to ensure diversity in:
+        - Journals
+        - Publication years
+        - Subfields within sociology and criminology.
+
+4. **Feature Extraction**
+    - Create document-feature matrices (DFMs) using keyword dictionaries to prepare data for machine learning.
+
+5. **Model Training**
+    - Train multiple machine learning models:
+        - Naive Bayes
+        - Logistic Regression
+        - Support Vector Machines
+        - Random Forests
+        - Gradient Boosted Trees
+    - Evaluate model performance to select the best classifier for each open science practice.
+
+6. **Automated Classification**
+    - Apply the best-performing models to classify the entire dataset.
+    - Automate the identification of open science practices across all collected papers.
+
+## **Analysis**
+1. **Descriptive Analysis**
+    - Examine temporal trends in the adoption of open science practices over the past decade.
+        - Compare practices across sociology and criminology.
+        - Compare journals 
+
+2. **Evaluation of Results**
+    - Identify patterns in:
+        - Prevalence of pre-registration, open data, open materials, and open access.
+        - Statistical inference methods.
+
+3. **Ethical Considerations**
+    - Ensure all methodologies comply with ethical and legal guidelines.
+    - Avoid unauthorized sources such as Sci-Hub or LibGen.
+
+4. **Broader Implications**
+    - Contribute to understanding the adoption of transparency and reproducibility in social sciences.
+    - Inform efforts to promote open science practices in sociology, criminology, and beyond.
--- a/ResearchProposal.md
+++ b/ResearchProposal.md
@ -76,154 +76,54 @@ include-before: |

 ## Modern Science

-The rise of the internet in the last decades drastically changed our lives: Our ways of looking at the world, our social lives or our consumption patterns - the internet influences all spheres of life, whether we like it or not [@SocietyInternetHow2019]. The surge interconnectivity enabled a rise in movements that resist the classic definition of intellectual property rights: open source, open scholarship access and open science [@willinskyUnacknowledgedConvergenceOpen2005]. Modern technologies enhanced reliability, speed and efficiency in knowledge development, thereby enhancing communication, collaboration and access to information or data [@thagardInternetEpistemologyContributions1997; @eisendInternetNewMedium2002; @wardenInternetScienceCommunication2010]. The internet significantly facilitated formal and informal scholarly communication through electronic journals and digital repositories like Academia.edu or ResearchGate [@wardenInternetScienceCommunication2010; @waiteINTERNETKNOWLEDGEEXCHANGE2021]. Evidence also schows that an increase in access to the internet also increases research output [@xuImpactInternetAccess2021]. But greater output doesn't necessarily imply greater quality, progress or greater scientific discoveries. As availability and thereby the quantity of publications increased, the possible information overload demands for effective filtering and assessement of publicated results [@wardenInternetScienceCommunication2010]. 
+The rise of the internet in the last decades drastically changed our lives: Our ways of looking at the world, our social lives or our consumption patterns - the internet influences all spheres of life, whether we like it or not [@SocietyInternetHow2019]. The surge interconnectivity enabled a rise in movements that resist the classic definition of intellectual property rights: open source, open scholarship access and open science [@willinskyUnacknowledgedConvergenceOpen2005]. Modern technologies enhanced reliability, speed and efficiency in knowledge development, thereby enhancing communication, collaboration and access to information or data [@thagardInternetEpistemologyContributions1997; @eisendInternetNewMedium2002; @wardenInternetScienceCommunication2010]. The internet significantly facilitated formal and informal scholarly communication through electronic journals and digital repositories like Academia.edu or ResearchGate [@wardenInternetScienceCommunication2010; @waiteINTERNETKNOWLEDGEEXCHANGE2021]. Evidence also schows that an increase in access to the internet also increases research output [@xuImpactInternetAccess2021]. But greater output doesn't necessarily imply greater quality, progress or greater scientific discoveries. As availability and thereby the quantity of publications increased, the possible information overload demands for effective filtering and assessment of publicated results [@wardenInternetScienceCommunication2010]. 

 But how do we define scientific progress? In the mid 20st century, Thomas Kuhn characterized scientific progress as a revolutionary shift in paradigms that are accepted theories in a scientific community at a given time. According to Kuhn, normal science operates within these paradigms, "solving puzzles" and refining theories. However, when anomalies arise that cannot be explained by the current paradigm, a crisis occurs, leading to a scientific revolution [@kuhnReflectionsMyCritics1970; @kuhnStructureScientificRevolutions1962]. Opposed to that, a critical rationalist approach to scientific progress emerged that saw danger in the by Kuhn described process as paradigms might facilitate confirmation bias and thereby stall progress. Karl Popper's philosophy of science, which emphasizes falsifiability and the idea that scientific theories progress through conjectures and refutations rather than through paradigm shifts. Popper argued that science advances by eliminating false theories, thus moving closer to the truth in a more linear and cumulative manner [@popperLogicScientificDiscovery2005]. Where Kuhn emphasized the development of dominant theories, Popper suggested the challenging or falsification of those theories. 

-Social sciences today engage in frequentist, deductive reasoning where significance testing is used to evaluate the null hypothesis, and conclusions are drawn based on the rejection or acceptance of this hypothesis, aligning with Popper's idea that scientific theories should be open to refutation. This approach is often criticized for its limitations in interpreting p-values and its reliance on long-run frequency interpretations [@UseMisuseClassicala; @wilkinsonTestingNullHypothesis2013]. In contrast, Bayesian inference is associated with inductive reasoning, where models are updated with new data to improve predictions. Bayesian methods allow for the comparison of competing models using tools like Bayes factors, but they do not directly falsify models through significance tests [@gelmanInductionDeductionBaysian2011; @dollBayesianModelSelection2019]. Overall, while falsification remains a cornerstone of scientific methodology, contemporary science often employs a pluralistic approach, integrating various methods to address complex questions and advance knowledge [@rowbottomKuhnVsPopper2011]. This pluralistic approach in contemporary science underscores the importance of integrating diverse methodologies to tackle complex questions and enhance our understanding. Despite the differences between frequentist and Bayesian methods, both share a fundamental commitment to the rigorous testing and validation of scientific theories. 
+Social sciences today engage in frequentist, deductive reasoning where significance testing is used to evaluate the null hypothesis, and conclusions are drawn based on the rejection or acceptance of this hypothesis, aligning with Popper's idea that scientific theories should be open to refutation. This approach is often criticized for its limitations in interpreting p-values and its reliance on long-run frequency interpretations [@dunleavyUseMisuseClassical2021; @wilkinsonTestingNullHypothesis2013]. In contrast, Bayesian inference is associated with inductive reasoning, where models are updated with new data to improve predictions. Bayesian methods allow for the comparison of competing models using tools like Bayes factors, but they do not directly falsify models through significance tests [@gelmanInductionDeductionBaysian2011; @dollBayesianModelSelection2019]. Overall, while falsification remains a cornerstone of scientific methodology, contemporary science often employs a pluralistic approach, integrating various methods to address complex questions and advance knowledge [@rowbottomKuhnVsPopper2011]. This pluralistic approach in contemporary science underscores the importance of integrating diverse methodologies to tackle complex questions and enhance our understanding. Despite the differences between frequentist and Bayesian methods, both share a fundamental commitment to the rigorous testing and validation of scientific theories. 
 But despite the more theoretically driven discourse about scientific discovery, there are many tangible reasons to talk about the scientific method and the publication process. A recent, highly cited article revealed that only a very small proportion in variance of the outcomes in studies based on the same data can be accounted to the choices made by researchers in designing their tests. @breznauObservingManyResearchers2022 observed 77 researcher teams analyzing the same dataset to assess the same hypothesis and found that the results ranged from strong positive to strong negative results. Between-team deviance could only be explained to less than 50% by assigned conditions, research decisions and researcher characteristics, the rest of the variance remained unexplained. This underlines the importance of transparent research: results are prone to many errors and biases, made intentionally or unintentionally by the researcher or induced by the publisher.

 > "Only by ... repetitions can we convince ourselves that we are not dealing with a mere isolated ‘coincidence’, but with events which, on account of their regularity and reproducibility, are in principle inter-subjectively testable." [@popperLogicScientificDiscovery2005, p. 23]

 To challenge the biases and to support the possibility of these "repetitions" or replications of research, a movement has formed within the scientific community, fuelled by the "replication crisis" that was especially prevalent within the field of psychology [@dienlinAgendaOpenScience2021]. The open science movement tries to establish open science practices to challenge many of the known biases that endanger the reliability of the scientific process.

-@banksAnswers18Questions2019 establish a definition of open science as a broad term that refers to many concepts including scientific philosophies emobodying communality and universalism, specific practices operationalizing these norms including open science policies like sharing of data and analytic files, redifinition of confidence thresholds, pre-registration of studies and analytical plans, engagement in replication studies, removal of pay-walls, incentive systems to encourage the above practices and even specific citation standards. This typology is in line with the work of many other authors from diverse disciplines [e.g. @dienlinAgendaOpenScience2021; and @greenspanOpenSciencePractices2024]. The two dominant, highly discussed approaches in open science are open data and preregistration.
+@banksAnswers18Questions2019 establish a definition of open science as a broad term that refers to many concepts including scientific philosophies embodying communality and universalism, specific practices operationalizing these norms including open science policies like sharing of data and analytic files, redifinition of confidence thresholds, pre-registration of studies and analytical plans, engagement in replication studies, removal of pay-walls, incentive systems to encourage the above practices and even specific citation standards. This typology is in line with the work of many other authors from diverse disciplines [e.g. @dienlinAgendaOpenScience2021; and @greenspanOpenSciencePractices2024]. The two dominant, highly discussed approaches in open science are open data and preregistration.

 **Publishing materials, data and code** or *open data* is necessary to enable replication of the studies. Replication thereby makes it possible to assess the pursued research in detail, find errors, bias or even support the results [@dienlinAgendaOpenScience2021]. While many researchers see challenges in the publication of their data and materials due to a potentially higher workload, legal concerns or just lack of interest, many of these concerns could be ruled out by streamlined processes or institutional support [@freeseAdvancesTransparencyReproducibility2022; @freeseReplicationStandardsQuantitative2007]. As open data reduces p-hacking, facilitates new research by enabling reproduction, reveals mistakes in the coding process and enables a diffusion of knowledge on the research process, it seems that many researchers, journals and other institutions start to adopt open data in their research [@dienlinAgendaOpenScience2021; @finkReplicationCodeAvailability; @freeseAdvancesTransparencyReproducibility2022; @zenk-moltgenFactorsInfluencingData2018; @matternWhyAcademicsUndershare2024].

-**Preregistration** involves thoroughly outlining and documenting research plans and their rationale in a repository. These plans can be made publicly accessible when the researcher decides to share them. The specifics of preregistration can vary based on the research type and may encompass elements such as hypotheses, sampling strategies, interview guides, exclusion criteria, study design, and analysis plans [@managoPreregistrationRegisteredReports2023]. Within this definition, a preregistration shall not prevent exploratory research. Deviations from the research plan are still allowed but shall be communicated transparently [@managoPreregistrationRegisteredReports2023; @nosekRegisteredReports2014]. Preregistration impacts research in multiple ways : it helps performing exploratory and confirmatory research independently, protects against publication bias as journals tipically commit to publish registered research and counters "researchers' degrees of freedom" in data analysis by reducing overfitting through cherry-picking, variable swapping, flexible model selection and subsampling [@mertensPreregistrationAnalysesPreexisting2019; @FalsePositivePsychologyUndisclosed]. This minimizes the risk of bias by promoting decision-making that is independent of outcomes. It also enhances transparency, allowing others to evaluate the potential for bias and adjust their confidence in the research findings accordingly [@hardwickeReducingBiasIncreasing2023].
+**Preregistration** involves thoroughly outlining and documenting research plans and their rationale in a repository. These plans can be made publicly accessible when the researcher decides to share them. The specifics of preregistration can vary based on the research type and may encompass elements such as hypotheses, sampling strategies, interview guides, exclusion criteria, study design, and analysis plans [@managoPreregistrationRegisteredReports2023]. Within this definition, a preregistration shall not prevent exploratory research. Deviations from the research plan are still allowed but shall be communicated transparently [@managoPreregistrationRegisteredReports2023; @nosekRegisteredReports2014]. Preregistration impacts research in multiple ways : it helps performing exploratory and confirmatory research independently, protects against publication bias as journals typically commit to publish registered research and counters "researchers' degrees of freedom" in data analysis by reducing overfitting through cherry-picking, variable swapping, flexible model selection and subsampling [@mertensPreregistrationAnalysesPreexisting2019; @FalsePositivePsychologyUndisclosed]. This minimizes the risk of bias by promoting decision-making that is independent of outcomes. It also enhances transparency, allowing others to evaluate the potential for bias and adjust their confidence in the research findings accordingly [@hardwickeReducingBiasIncreasing2023].

-My initial plan for my master's thesis was to study the effect of pre-registration on reported effect sizes. During my initial literature review, it appeared to me that there were very few publications that used pre-registration in data-driven criminology and sociology. Instead of assessing effect sizes, this raised the question: **How have open science practices been adapted within sociology and criminolgy? How has the use of these practices developed over the last decade?**
+My initial plan for my master's thesis was to study the effect of pre-registration on reported effect sizes. During my initial literature review, it appeared to me that there were very few publications that used pre-registration in data-driven criminology and sociology. Instead of assessing effect sizes, this raised the question: **How have open science practices been adapted within sociology and criminology? How has the use of these practices developed over the last decade?**

-@scogginsMeasuringTransparencySocial2024a did an extensive analysis of almost 100,000 publications in political science and international relations. They found an increasing use of preregistration and open data, with levels still being relatively low. The extensive research not only revealed the current state of open science in political science, but also generated rich data to perform further meta research. Therefore, I intend to apply similar methods in the field of sociology and criminology. In the following section I describe the intended data collection and research methods that are highly based on @scogginsMeasuringTransparencySocial2024a research. 
+@scogginsMeasuringTransparencySocial2024 did an extensive analysis of nearly 100,000 publications in political science and international relations. They observed an increasing use of preregistration and open data, with levels still being relatively low. The extensive research not only revealed the current state of open science in political science, but also generated rich data to perform further meta research. 
+
+I intend to apply similar methods in the field of sociology and criminology: gather data about papers in a subset of criminology and sociology journals, classify those papers by application of open source practices and explore the patterns over time to take stock of research practices in the disciplines. In the following section I describe the intended data collection and research methods that are highly based on @scogginsMeasuringTransparencySocial2024 research. 

 # Data and Method

- **Problem** using both sociology and criminology can introduce bias of the trained models due to highly different vocabulary used in both discioplines
+The study will focus on papers in sociology and criminology that use data and statistical methods. The aim is to evaluate the prevalence of key open science practices, including pre-registration, open data, open materials, open access, statistical inference, and the reporting of positive results.

-According to @scogginsMeasuringTransparencySocial2024a  
-**Population**: \[social science\] papers using data and statistics  
-1. **Gathering Papers**  
-	1. Consult Clarivate Journal Citation Report to obtain Journals in the field  
-	2. Filter downloadable journals (that are included in the campus' licences)  
-	3. Using [Crossref](https://github.com/ropensci/rcrossref), [Scopus](https://github.com/muschellij2/rscopus) or [WOS](https://github.com/juba/rwos) API: download publication metadata of all papers in a respective time span  
-	4. Download HTML Papers  
-	5. Filter to-download list by grabbed html papers  
-	6. Download Paper Fulltext PDF: using [ferru97/PyPaperBot](https://github.com/ferru97/PyPaperBot), [ monk1337/resp](https://github.com/monk1337/resp) (even possible to use anna's archive, scihub or libgen, but this would be illegal so no ofc not) - **really necessary?**  
-	7. Convert HTML and PDF papers to txt ([titipata/scipdf\_parser](https://github.com/titipata/scipdf_parser), [aaronsw/html2text](https://github.com/aaronsw/html2text), [html2text · PyPI](https://pypi.org/project/html2text/))  
-2. Classification  
-	1. Operationalization of ...  
-		1. Papers that use statistical inference  
-		2. Papers that applied preregistration  
-		3. Papers that applied open data practices  
-		4. Papers that offer open materials  
-		5. Open Access (theoretically not interesting?)  
-		6. Papers with Positive Results  
-	2. Definition of Identification keywords/dictionaries for each category  
-	3. Manual classification of a number of papers for ml model training (between 1k/2k)  
-	4. Creation of [DFMs](https://quanteda.io/reference/dfm.html) using the dictionaries  
-	5. MLM training (Naive Bayes, LogReg, Nonlinear SVM, Random Forest, XGB)  
-	6. MLM evaluation / decision  
-	7. Classification of data using the trained, best performing model  
-3. Analysis  
-	- One of the two:  
-	- descriptive analysis of the temporal development in proportions in the last 10 years in each discipline, see @scogginsMeasuringTransparencySocial2024a  
-	- Intergroup Comparison of effect sizes of a randomly drawn sample of the data gathered. Effect sizes could also be gathered using a trained  
+## Data Collection

-Why the huge data collection effort?  
- preparation for further research, database might be useful for other research questions  
- I want to practice R / ML-Methods.  
- By-hand collection of data on open science practices is very time consuming. why not generate data from the texts?  
- From @akkerPreregistrationSecondaryData2021: "To create a control group for comparison with the preregistered studies in our sample, we linked each preregistered publication in our sample to a non-preregistered publication. We did so by checking Web of Science’s list of related papers for every preregistered publication and selecting the first non-preregistered publication from that list that used primary quantitative data and was published in the same year as the related preregistered publication." i think this is kind of questionable. 
+The process of data collection will closely follow @scogginsMeasuringTransparencySocial2024 and begin with identifying relevant journals in sociology and criminology. I will consult the Clarivate Journal Citation Report via their API to obtain a comprehensive list of journals within these fields by filtering for the top 30 journals in the respective fields (originally, @scogginsMeasuringTransparencySocial2024 used a top 100 filter - I will use top 30 journals to limit the amount of data because of technical limitations in my workspace setup). To ensure feasibility, I will filter this list to include only journals that are accessible under the university’s licensing agreements. Once the relevant journals are identified, I will use APIs such as Crossref, Scopus, or Web of Science to download metadata for all papers published between 2013 to 2023.

-## Todo
+After obtaining the metadata, I will proceed to download the full-text versions of the identified papers. Whenever possible, I will prioritize downloading HTML versions of the papers due to their structured format, which simplifies subsequent text extraction. For papers that are not available in HTML, I will consider downloading full-text PDFs. Tools such as PyPaperBot can facilitate this process, although I will strictly stick to ethical and legal guidelines, avoiding unauthorized sources like Sci-Hub or LibGen. If access to full-text papers becomes a limiting factor, I will assess alternative strategies such as collaborating with institutional libraries to request specific papers or identifying open-access repositories that may provide supplementary resources. Non-available texts will be considered with their own category in the later analysis. Once all available full-text papers are collected, I will preprocess the data by converting HTML and PDF files into plain text format using tools such as SciPDF Parser or html2text. This preprocessing step ensures that the text is in a standardized format suitable for analysis. 

- add stuff about the replication crisis, 1-2 sentences in the introduction. see @scogginsMeasuringTransparencySocial2024a
- **improve wording in the last paragraph**
+The proposed data collection is resource-intensive but serves multiple purposes. However, resource constraints could pose challenges, such as limited access to computational tools or delays in obtaining full-text papers. To mitigate these risks, I plan to prioritize scalable data collection methods, limit data collection to a manageable extent and use existing institutional resources, including library services and open-access repositories. Additionally, I will implement efficient preprocessing workflows ensuring that the project remains feasible within the given timeline and resources. 

-# Notes
+## Classification

-[@dienlinAgendaOpenScience2021]
-	1. publish materials, data and code 
-	2. preregister studies and submit registered reports
-	3. conduct replication studies
-	4. collaborate
-	5. foster open science skills
-	6. Implement Transparency and Openness Promotion (TOP) Guidelines
-	7. incentivize open science practices
+The classification process will begin with operationalizing the key open science practices that I aim to study. This involves operationalizing clear criteria for identifying papers that fall into the four categories I plan to classify: Papers that use statistical inference. Papers that applied preregistration, Papers that applied open data practices, Papers that offer open materials.

+For instance, terms like “pre-registered,” “open data,” or “data availability statement” could indicate adherence to pre-registration or open data practices. Similarly, phrases such as “materials available on request” or “open materials” could signify the use of open materials. @scogginsMeasuringTransparencySocial2024 freely available data will form the foundation of keyword dictionaries for identifying relevant papers during the classification phase. To facilitate this, I will additionally develop own keyword dictionaries for each category, identifying terms and phrases commonly associated with these practices before consulting @scogginsMeasuringTransparencySocial2024.

- Systemic Biases in AI and Big Data: Open science tools can be used to address biases in AI algorithms [@nororiAddressingBiasBig2021].
+To train machine learning models capable of classifying the papers, I will manually annotate a subset of papers. The sample size will be determined using weighted fitting of learning curves according to @figueroaPredictingSampleSize2012 which need an initial hand-coded sample size of 100-200. If the necessary sample size exceeds my time constraints, I will try to use clustering based text classification to extend the  training sample [@zengCBCClusteringBased2003]. To ensure the representativeness of this subset, I will sample papers proportionally from different journals, publication years, and subfields within sociology and criminology. The stratified sampling approach will help mitigate biases and ensure that the training data reflects the diversity of the overall dataset. The sampled subset will serve as a "labeled" dataset for supervised learning. Different classification methods were considered but deemed as not suitable for the task as those were either found to be designed for document topic classification or too time intense for a master's thesis [e.g. @kimResearchPaperClassification2019; @sanguansatFeatureMatricizationDocument2012; @jandotInteractiveSemanticFeaturing2016]. Instead a two stage approach that has been used in other fields and highly specialized document classification tasks [@abdollahiOntologybasedTwoStageApproach2019]. using the manually labeled data, I will construct document-feature matrices (DFMs) based on the predefined keyword dictionaries in line with [@scogginsMeasuringTransparencySocial2024]. I will then train various machine learning models, including Naive Bayes, Logistic Regression, Support Vector Machines, and Gradient Boosted Trees. The performance of each model will be evaluated to identify the best-performing classifier for each category of open science practices. Once the optimal models are selected, I will use them to classify the entire dataset of papers.

+The automated classification will enable me to categorize papers based on their adoption of open science practices. This classification will provide the foundation for subsequent analyses of temporal trends and other patterns within the data. Automating the classification process mitigates the inefficiency of manual data collection, allowing for the analysis of a significantly larger dataset than would otherwise be feasible.

+## Analysis

-### **Publication Bias**, selective reporting [@smaldinoOpenScienceModified2019; @fox142OpenScience2021].
- **Problem:** Journals often favor publishing positive or statistically significant results, leaving negative or null findings unpublished.
- **How Open Science Helps:** Pre-registration of studies and publishing all research outcomes (e.g., via open access repositories) ensures that all results are accessible. Open science encourages the publication of all results, including negative or null findings, which helps reduce the bias towards publishing only positive results. By promoting transparency and the sharing of data and methodologies, open science reduces the tendency to selectively report only favorable outcomes 
-
-### **Confirmation Bias** [@fox142OpenScience2021]
- **Problem:**
- **How Open Science Helps:** Open science practices, such as pre-registration of studies, help mitigate confirmation bias by specifying hypotheses and analysis plans before data collection
-
-### **Reproducibility Crisis** [@fox142OpenScience2021]
- **Problem:** Many scientific findings cannot be replicated due to opaque methodologies or unavailable data and code.
- **How Open Science Helps:** Sharing detailed methods, datasets, and analysis scripts in open repositories promotes reproducibility and verification. Open science addresses the reproducibility crisis by making data and methods openly available, allowing other researchers to verify and replicate findings.
-
-### **Algorithmic Bias**  [@nororiAddressingBiasBig2021]
- **Problem:**
- **How Open Science Helps:** public data and training reports for ai enable 
-
-### **Inefficiencies in Research Progress**
- **Problem:** Duplication of efforts and siloed research slow down scientific advancements.
- **How Open Science Helps:** Sharing negative results, datasets, and ongoing projects prevents duplication and accelerates innovation.
-
-### **Overemphasis on Novelty**
- **Problem:** The pressure to publish novel findings discourages replication studies or incremental advancements.
- **How Open Science Helps:** Encouraging and funding replication studies through open peer-review processes shifts focus towards reliable and cumulative science.
-
-
-
-### **Lack of Peer Review Transparency**
- **Problem:** Traditional peer review is often anonymous and lacks accountability, leading to potential biases or unfair evaluations.
- **How Open Science Helps:** Open peer review, where reviews and reviewer identities are accessible, ensures greater accountability and reduces bias.
-
-### **Authorship and Credit Bias**
- **Problem:** Early-career researchers, women, and underrepresented groups often face challenges in receiving credit for their contributions.
- **How Open Science Helps:** Transparent contributions using tools like the Contributor Roles Taxonomy (CRediT) ensure that all contributors are recognized for their specific roles.
-
-### **Conflicts of Interest**
- **Problem:** Undisclosed funding sources or affiliations may bias research findings.
- **How Open Science Helps:** Transparent declarations of conflicts of interest and funding sources reduce hidden biases.
-
-### **Limited Interdisciplinary Collaboration**
- **Problem:** Barriers to sharing research outputs restrict interdisciplinary collaboration, limiting innovation.
- **How Open Science Helps:** Open sharing of data, methods, and publications fosters cross-disciplinary integration and innovation.
-
-
-
-### **Data Access Inequality**
- **Problem:** Researchers in low-resource settings often lack access to expensive journals, datasets, or tools.
- **How Open Science Helps:** Open access publications and open data initiatives democratize access to research outputs, enabling equitable participation in science.
-
-### **Misuse of Metrics (e.g., Impact Factor, h-Index)**
- **Problem:** Reliance on quantitative metrics for evaluating research quality skews scientific priorities.
- **How Open Science Helps:** Encouraging diverse evaluation metrics (e.g., open data reuse, societal impact) ensures fair assessment of research contributions.
-
-### **Cherry-Picking and P-Hacking**
- **Problem:** Selective reporting or manipulating data to achieve statistical significance undermines the integrity of research.
- **How Open Science Helps:** Pre-registration of hypotheses and protocols discourages cherry-picking and promotes adherence to predefined analysis plans.
-
-
-
-### **Lack of Public Engagement**
- **Problem:** Complex scientific outputs are often inaccessible to the general public, leading to mistrust or misunderstanding of science.
- **How Open Science Helps:** Open access and lay summaries of research make science more inclusive and comprehensible to non-specialists.
-
-This commitment is rooted in the idea that scientific claims must be substantiated through consistent and reproducible evidence. Modern scientific inquiry, therefore, aligns with the notion that: 
-
-> "Only by ... repetitions can we convince ourselves that we are not dealing with a mere isolated ‘coincidence’, but with events which, on account of their regularity and reproducibility, are in principle inter-subjectively testable." [@popperLogicScientificDiscovery2005, p. 23]
-
-
+In the analysis phase of the research, an exploratory analysis will be conducted to explore temporal trends in the adoption of open science practices over the past decade. This involves comparing the adoption rates of practices such as pre-registration, open data, open materials, and open access across the disciplines of sociology and criminology, as well as among different journals. The goal is to identify any significant differences or similarities in how these practices have been embraced over time. This evaluation aims to uncover insights into the methodological rigor and transparency within the fields, providing a comprehensive understanding of the current landscape and potential areas for improvement in research practices. By building on the methods developed by @scogginsMeasuringTransparencySocial2024, I hope to generate insights that will inform future efforts to promote transparency and reproducibility in the social sciences.

 \newpage

--- a/lit.bib
+++ b/lit.bib
--- a/make.sh
+++ b/make.sh
@ -11,7 +11,6 @@ pandoc -i "$IN" \
  -o "$OUT" \
  --csl=apa-7th-edition.csl \
  --citeproc \
-  --filter pandoc-crossref \
  --lua-filter=filters/first-line-indent.lua \
  --citation-abbreviations=citation-abbreviations.csl

--- a/modify-pdf.sh
+++ b/modify-pdf.sh