add intro

This commit is contained in:
Michael Beck 2024-12-13 17:52:59 +01:00
parent d50455ef7f
commit 02b5266611

View File

@ -23,7 +23,7 @@ link-citations: true
lot: true lot: true
lof: true lof: true
bibliography: lit.bib bibliography: [lit.bib]
mainfont: Latin Modern Roman mainfont: Latin Modern Roman
fontfamilyoptions: lmodernscellL fontfamilyoptions: lmodernscellL
@ -72,15 +72,136 @@ include-before: |
\newpage \newpage
# Research Question # Intro & Motivation
## Modern Science
The rise of the internet in the last decades drastically changed our lives: Our ways of looking at the world, our social lives or our consumption patterns - the internet influences all spheres of life, whether we like it or not [@SocietyInternetHow2019]. The surge interconnectivity enabled a rise in movements that resist the classic definition of intellectual property rights: open source, open scholarship access and open science [@willinskyUnacknowledgedConvergenceOpen2005]. Modern technologies enhanced reliability, speed and efficiency in knowledge development, thereby enhancing communication, collaboration and access to information [@thagardInternetEpistemologyContributions1997; @eisendInternetNewMedium2002; @wardenInternetScienceCommunication2010]. The internet significantly facilitated formal and informal scholarly communication through electronic journals and digital repositories like Academia.edu or ResearchGate [@wardenInternetScienceCommunication2010; @waiteINTERNETKNOWLEDGEEXCHANGE2021]. Evidence also schows that an increase in access to the internet also increases research output [@xuImpactInternetAccess2021]. But greater output doesn't necessarily imply greater quality, progress or greater scientific discoveries. As availability and thereby the quantity of publications increased, the possible information overload demands for effective filtering and assessement of publicated results [@wardenInternetScienceCommunication2010].
But what exactly is scientific progress? In the mid 20st century, Thomas Kuhn characterized scientific progress as a revolutionary shift in paradigms that are accepted theories in a scientific community at a given time. According to Kuhn, normal science operates within these paradigms, "solving puzzles" and refining theories. However, when anomalies arise that cannot be explained by the current paradigm, a crisis occurs, leading to a scientific revolution [@kuhnReflectionsMyCritics1970; @kuhnStructureScientificRevolutions1962]. Opposed to that, a critical rationalist approach to scientific progress emerged that saw danger in the by Kuhn described process as paradigms might facilitate confirmation bias and thereby stall progress. Karl Popper's philosophy of science, which emphasizes falsifiability and the idea that scientific theories progress through conjectures and refutations rather than through paradigm shifts. Popper argued that science advances by eliminating false theories, thus moving closer to the truth in a more linear and cumulative manner [@popperLogicScientificDiscovery2005]. Where Kuhn emphasized the development of dominant theories, Popper suggested the challenging or falsification of those theories.
Social sciences today engage in frequentist, deductive reasoning where significance testing is used to evaluate the null hypothesis, and conclusions are drawn based on the rejection or acceptance of this hypothesis, aligning with Popper's idea that scientific theories should be open to refutation. This approach is often criticized for its limitations in interpreting p-values and its reliance on long-run frequency interpretations [@UseMisuseClassicala; @wilkinsonTestingNullHypothesis2013]. In contrast, Bayesian inference is associated with inductive reasoning, where models are updated with new data to improve predictions. Bayesian methods allow for the comparison of competing models using tools like Bayes factors, but they do not directly falsify models through significance tests [@gelmanInductionDeductionBaysian2011; @dollBayesianModelSelection2019]. Overall, while falsification remains a cornerstone of scientific methodology, contemporary science often employs a pluralistic approach, integrating various methods to address complex questions and advance knowledge [@rowbottomKuhnVsPopper2011].
This pluralistic approach in contemporary science underscores the importance of integrating diverse methodologies to tackle complex questions and enhance our understanding. Despite the differences between frequentist and Bayesian methods, both share a fundamental commitment to the rigorous testing and validation of scientific theories. The internet in the described developments in science described above created many challenges for scientists in their publication process.
- Reproducibility Bias: Open science addresses the reproducibility crisis by making data and methods openly available, allowing other researchers to verify and replicate findings [@fox142OpenScience2021].
- Systemic Biases in AI and Big Data: Open science tools can be used to address biases in AI algorithms [@nororiAddressingBiasBig2021].
# Motivation ### **Publication Bias**, selective reporting [@smaldinoOpenScienceModified2019; @fox142OpenScience2021].
- **Problem:** Journals often favor publishing positive or statistically significant results, leaving negative or null findings unpublished.
- **How Open Science Helps:** Pre-registration of studies and publishing all research outcomes (e.g., via open access repositories) ensures that all results are accessible. Open science encourages the publication of all results, including negative or null findings, which helps reduce the bias towards publishing only positive results. By promoting transparency and the sharing of data and methodologies, open science reduces the tendency to selectively report only favorable outcomes
# Data ### **Confirmation Bias** [@fox142OpenScience2021]
- **Problem:**
- **How Open Science Helps:** Open science practices, such as pre-registration of studies, help mitigate confirmation bias by specifying hypotheses and analysis plans before data collection
# Method ### **Reproducibility Crisis** [@fox142OpenScience2021]
- **Problem:** Many scientific findings cannot be replicated due to opaque methodologies or unavailable data and code.
- **How Open Science Helps:** Sharing detailed methods, datasets, and analysis scripts in open repositories promotes reproducibility and verification. Open science addresses the reproducibility crisis by making data and methods openly available, allowing other researchers to verify and replicate findings.
### **Algorithmic Bias** [@nororiAddressingBiasBig2021]
- **Problem:**
- **How Open Science Helps:** public data and training reports for ai enable
### **Inefficiencies in Research Progress**
- **Problem:** Duplication of efforts and siloed research slow down scientific advancements.
- **How Open Science Helps:** Sharing negative results, datasets, and ongoing projects prevents duplication and accelerates innovation.
### **Overemphasis on Novelty**
- **Problem:** The pressure to publish novel findings discourages replication studies or incremental advancements.
- **How Open Science Helps:** Encouraging and funding replication studies through open peer-review processes shifts focus towards reliable and cumulative science.
### **Lack of Peer Review Transparency**
- **Problem:** Traditional peer review is often anonymous and lacks accountability, leading to potential biases or unfair evaluations.
- **How Open Science Helps:** Open peer review, where reviews and reviewer identities are accessible, ensures greater accountability and reduces bias.
### **Authorship and Credit Bias**
- **Problem:** Early-career researchers, women, and underrepresented groups often face challenges in receiving credit for their contributions.
- **How Open Science Helps:** Transparent contributions using tools like the Contributor Roles Taxonomy (CRediT) ensure that all contributors are recognized for their specific roles.
### **Conflicts of Interest**
- **Problem:** Undisclosed funding sources or affiliations may bias research findings.
- **How Open Science Helps:** Transparent declarations of conflicts of interest and funding sources reduce hidden biases.
### **Limited Interdisciplinary Collaboration**
- **Problem:** Barriers to sharing research outputs restrict interdisciplinary collaboration, limiting innovation.
- **How Open Science Helps:** Open sharing of data, methods, and publications fosters cross-disciplinary integration and innovation.
### **Data Access Inequality**
- **Problem:** Researchers in low-resource settings often lack access to expensive journals, datasets, or tools.
- **How Open Science Helps:** Open access publications and open data initiatives democratize access to research outputs, enabling equitable participation in science.
### **Misuse of Metrics (e.g., Impact Factor, h-Index)**
- **Problem:** Reliance on quantitative metrics for evaluating research quality skews scientific priorities.
- **How Open Science Helps:** Encouraging diverse evaluation metrics (e.g., open data reuse, societal impact) ensures fair assessment of research contributions.
### **Cherry-Picking and P-Hacking**
- **Problem:** Selective reporting or manipulating data to achieve statistical significance undermines the integrity of research.
- **How Open Science Helps:** Pre-registration of hypotheses and protocols discourages cherry-picking and promotes adherence to predefined analysis plans.
### **Lack of Public Engagement**
- **Problem:** Complex scientific outputs are often inaccessible to the general public, leading to mistrust or misunderstanding of science.
- **How Open Science Helps:** Open access and lay summaries of research make science more inclusive and comprehensible to non-specialists.
This commitment is rooted in the idea that scientific claims must be substantiated through consistent and reproducible evidence. Modern scientific inquiry, therefore, aligns with the notion that:
> "Only by ... repetitions can we convince ourselves that we are not dealing with a mere isolated coincidence, but with events which, on account of their regularity and reproducibility, are in principle inter-subjectively testable." [@popperLogicScientificDiscovery2005, p. 23]
This raises the question: Have open science practices been adapted within sociology and criminolgy? How has the use of these practices developed over the last decade?
## Research Question
## Motivation
@scogginsMeasuringTransparencySocial2024a
# Data and Method
- **Problem** using both sociology and criminology can introduce bias of the trained models due to highly different vocabulary used in both discioplines
According to @scogginsMeasuringTransparencySocial2024a
**Population**: \[social science\] papers using data and statistics
1. **Gathering Papers**
1. Consult Clarivate Journal Citation Report to obtain Journals in the field
2. Filter downloadable journals (that are included in the campus' licences)
3. Using [Crossref](https://github.com/ropensci/rcrossref), [Scopus](https://github.com/muschellij2/rscopus) or [WOS](https://github.com/juba/rwos) API: download publication metadata of all papers in a respective time span
4. Download HTML Papers
5. Filter to-download list by grabbed html papers
6. Download Paper Fulltext PDF: using [ferru97/PyPaperBot](https://github.com/ferru97/PyPaperBot), [ monk1337/resp](https://github.com/monk1337/resp) (even possible to use anna's archive, scihub or libgen, but this would be illegal so no ofc not) - **really necessary?**
7. Convert HTML and PDF papers to txt ([titipata/scipdf\_parser](https://github.com/titipata/scipdf_parser), [aaronsw/html2text](https://github.com/aaronsw/html2text), [html2text · PyPI](https://pypi.org/project/html2text/))
2. Classification
1. Operationalization of ...
1. Papers that use statistical inference
2. Papers that applied preregistration
3. Papers that applied open data practices
4. Papers that offer open materials
5. Open Access (theoretically not interesting?)
6. Papers with Positive Results
2. Definition of Identification keywords/dictionaries for each category
3. Manual classification of a number of papers for ml model training (between 1k/2k)
4. Creation of [DFMs](https://quanteda.io/reference/dfm.html) using the dictionaries
5. MLM training (Naive Bayes, LogReg, Nonlinear SVM, Random Forest, XGB)
6. MLM evaluation / decision
7. Classification of data using the trained, best performing model
3. Analysis
- One of the two:
- descriptive analysis of the temporal development in proportions in the last 10 years in each discipline, see @scogginsMeasuringTransparencySocial2024a
- Intergroup Comparison of effect sizes of a randomly drawn sample of the data gathered. Effect sizes could also be gathered using a trained
Why the huge data collection effort?
- preparation for further research, database might be useful for other research questions
- I want to practice R / ML-Methods.
- By-hand collection of data on open science practices is very time consuming. why not generate data from the texts?
- From @akkerPreregistrationSecondaryData2021: "To create a control group for comparison with the preregistered studies in our sample, we linked each preregistered publication in our sample to a non-preregistered publication. We did so by checking Web of Sciences list of related papers for every preregistered publication and selecting the first non-preregistered publication from that list that used primary quantitative data and was published in the same year as the related preregistered publication." i think this is kind of questionable.
\newpage \newpage
@ -100,4 +221,4 @@ Hiermit versichere ich, dass ich die vorliegende Arbeit selbstständig und ohne
\begin{tabular}{@{}p{4in}@{}} \begin{tabular}{@{}p{4in}@{}}
\hrulefill \\ \hrulefill \\
Michael Beck, 08.08.2024 \\ Michael Beck, 08.08.2024 \\
\end{tabular} \end{tabular}