ThesisProposal/Notes.md

# Notes

## Research Plan

- **Problem** using both sociology and criminology can introduce bias of the trained models due to highly different vocabulary used in both discioplines

According to @scogginsMeasuringTransparencySocial2024a
**Population**: \[social science\] papers using data and statistics
1. **Gathering Papers**
	1. Consult Clarivate Journal Citation Report to obtain Journals in the field
	2. Filter downloadable journals (that are included in the campus' licences)
	3. Using [Crossref](https://github.com/ropensci/rcrossref), [Scopus](https://github.com/muschellij2/rscopus) or [WOS](https://github.com/juba/rwos) API: download publication metadata of all papers in a respective time span
	4. Download HTML Papers
	5. Filter to-download list by grabbed html papers
	6. Download Paper Fulltext PDF: using [ferru97/PyPaperBot](https://github.com/ferru97/PyPaperBot), [ monk1337/resp](https://github.com/monk1337/resp) (even possible to use anna's archive, scihub or libgen, but this would be illegal so no ofc not) - **really necessary?**
	7. Convert HTML and PDF papers to txt ([titipata/scipdf\_parser](https://github.com/titipata/scipdf_parser), [aaronsw/html2text](https://github.com/aaronsw/html2text), [html2text · PyPI](https://pypi.org/project/html2text/))
2. Classification
	1. Operationalization of ...
		1. Papers that use statistical inference
		2. Papers that applied preregistration
		3. Papers that applied open data practices
		4. Papers that offer open materials
		5. Open Access (theoretically not interesting?)
		6. Papers with Positive Results
	2. Definition of Identification keywords/dictionaries for each category
	3. Manual classification of a number of papers for ml model training (between 1k/2k)
	4. Creation of [DFMs](https://quanteda.io/reference/dfm.html) (see also: [Official Tutorial](https://tutorials.quanteda.io/basic-operations/dfm/)) using the dictionaries
	5. MLM training (Naive Bayes, LogReg, Nonlinear SVM, Random Forest, XGB)
	6. MLM evaluation / decision
	7. Classification of data using the trained, best performing model
3. Analysis
	- descriptive analysis of the temporal development in proportions in the last 10 years in each discipline, see @scogginsMeasuringTransparencySocial2024a


## Todo

- add stuff about the replication crisis, 1-2 sentences in the introduction. see @scogginsMeasuringTransparencySocial2024a
- **improve wording in the last paragraph**

## Open Access

[@dienlinAgendaOpenScience2021]
	1. publish materials, data and code
	2. preregister studies and submit registered reports
	3. conduct replication studies
	4. collaborate
	5. foster open science skills
	6. Implement Transparency and Openness Promotion (TOP) Guidelines
	7. incentivize open science practices


- Systemic Biases in AI and Big Data: Open science tools can be used to address biases in AI algorithms [@nororiAddressingBiasBig2021].


### **Publication Bias**, selective reporting [@smaldinoOpenScienceModified2019; @fox142OpenScience2021].
- **Problem:** Journals often favor publishing positive or statistically significant results, leaving negative or null findings unpublished.
- **How Open Science Helps:** Pre-registration of studies and publishing all research outcomes (e.g., via open access repositories) ensures that all results are accessible. Open science encourages the publication of all results, including negative or null findings, which helps reduce the bias towards publishing only positive results. By promoting transparency and the sharing of data and methodologies, open science reduces the tendency to selectively report only favorable outcomes

### **Confirmation Bias** [@fox142OpenScience2021]
- **Problem:**
- **How Open Science Helps:** Open science practices, such as pre-registration of studies, help mitigate confirmation bias by specifying hypotheses and analysis plans before data collection

### **Reproducibility Crisis** [@fox142OpenScience2021]
- **Problem:** Many scientific findings cannot be replicated due to opaque methodologies or unavailable data and code.
- **How Open Science Helps:** Sharing detailed methods, datasets, and analysis scripts in open repositories promotes reproducibility and verification. Open science addresses the reproducibility crisis by making data and methods openly available, allowing other researchers to verify and replicate findings.

### **Algorithmic Bias**  [@nororiAddressingBiasBig2021]
- **Problem:**
- **How Open Science Helps:** public data and training reports for ai enable

### **Inefficiencies in Research Progress**
- **Problem:** Duplication of efforts and siloed research slow down scientific advancements.
- **How Open Science Helps:** Sharing negative results, datasets, and ongoing projects prevents duplication and accelerates innovation.

### **Overemphasis on Novelty**
- **Problem:** The pressure to publish novel findings discourages replication studies or incremental advancements.
- **How Open Science Helps:** Encouraging and funding replication studies through open peer-review processes shifts focus towards reliable and cumulative science.


### **Lack of Peer Review Transparency**
- **Problem:** Traditional peer review is often anonymous and lacks accountability, leading to potential biases or unfair evaluations.
- **How Open Science Helps:** Open peer review, where reviews and reviewer identities are accessible, ensures greater accountability and reduces bias.

### **Authorship and Credit Bias**
- **Problem:** Early-career researchers, women, and underrepresen
        - Logistic Regression
        - Support Vector Machines
        - Random Forests
        - Gradient Boosted Trees
    - Evaluate model performance to select the best classifier for each open science practice.

6. **Automated Classification**
    - Apply the best-performing models to classify the entire dataset.
    - Automate the identification of open science practices across all collected papers.

## **Analysis**
1. **Descriptive Analysis**
    - Examine temporal trends in the adoption of open science practices over the past decade.
    - Compare practices across sociology and criminology.

2. **Evaluation of Results**
    - Identify patterns in:
        - Prevalence of pre-registration, open data, open materials, and open access.
        - Statistical inference methods.
        - Reporting of positive results.

3. **Ethical Considerations**
    - Ensure all methodologies comply with ethical and legal guidelines.
    - Avoid unauthorized sources such as Sci-Hub or LibGen.

4. **Broader Implications**
    - Contribute to understanding the adoption of transparency and reproducibility in social sciences.
    - Inform efforts to promote open science practices in sociology, criminology, and beyond.tors are recognized for their specific roles.

### **Conflicts of Interest**
- **Problem:** Undisclosed funding sources or affiliations may bias research findings.
- **How Open Science Helps:** Transparent declarations of conflicts of interest and funding sources reduce hidden biases.

### **Limited Interdisciplinary Collaboration**
- **Problem:** Barriers to sharing research outputs restrict interdisciplinary collaboration, limiting innovation.
- **How Open Science Helps:** Open sharing of data, methods, and publications fosters cross-disciplinary integration and innovation.


### **Data Access Inequality**
- **Problem:** Researchers in low-resource settings often lack access to expensive journals, datasets, or tools.
- **How Open Science Helps:** Open access publications and open data initiatives democratize access to research outputs, enabling equitable participation in science.

### **Misuse of Metrics (e.g., Impact Factor, h-Index)**
- **Problem:** Reliance on quantitative metrics for evaluating research quality skews scientific priorities.
- **How Open Science Helps:** Encouraging diverse evaluation metrics (e.g., open data reuse, societal impact) ensures fair assessment of research contributions.

### **Cherry-Picking and P-Hacking**
- **Problem:** Selective reporting or manipulating data to achieve statistical significance undermines the integrity of research.
- **How Open Science Helps:** Pre-registration of hypotheses and protocols discourages cherry-picking and promotes adherence to predefined analysis plans.


### **Lack of Public Engagement**
- **Problem:** Complex scientific outputs are often inaccessible to the general public, leading to mistrust or misunderstanding of science.
- **How Open Science Helps:** Open access and lay summaries of research make science more inclusive and comprehensible to non-specialists.

This commitment is rooted in the idea that scientific claims must be substantiated through consistent and reproducible evidence. Modern scientific inquiry, therefore, aligns with the notion that:

> "Only by ... repetitions can we convince ourselves that we are not dealing with a mere isolated ‘coincidence’, but with events which, on account of their regularity and reproducibility, are in principle inter-subjectively testable." [@popperLogicScientificDiscovery2005, p. 23]