mischbeck/ThesisProposal

Fork 0

Michael Beck 4757bcaa73 finished introduction text

2024-12-16 17:24:23 +01:00

21 KiB

Raw Blame History

documentclass

author

title

description

subtitle

date

lang

toc

toc-depth

number-sections

color-links

urlcolor

linkcolor

link-citations

lot

lof

bibliography

mainfont

fontfamilyoptions

geometry

header-includes

include-before

article

Michael Beck

Open Science Practices In Criminology and Social Psychology

Disentangling the impact of Open Science practices in Sociology and Criminology

Exposé

2024-12-06

en-US

true

blue

true

lit.bib

Latin Modern Roman

lmodernscellL

left=2.5cm,right=3cm,top=2.5cm,bottom=2.5cm

\usepackage{pdflscape} \newcommand{\blandscape}{\begin{landscape}} \newcommand{\elandscape}{\end{landscape}} \setcounter{page}{-1}

\newcommand{\scellL}[1]{% \parbox{4.2cm}{\raggedright\leftskip=1em\hskip-1em#1} } \newcommand{\scell}[1]{% \parbox{3cm}{ \begin{center}#1 \end{center} \vspace{1mm} } } \newcommand{\scellB}[1]{% \parbox{2.5cm}{ \begin{center}#1 \end{center} \vspace{1mm} } } \begin{center} \vfill \textbf{Research Proposal} \ Seminar Comparative Research in Crime and Delinquency \ Dr. Alexander Trinidad \ \vspace{1cm} \hrule \end{center} Matriculation number: \textbf{7406366} \thispagestyle{empty} \newpage \thispagestyle{empty}

\vfill

\newpage

Intro & Motivation

Modern Science

The rise of the internet in the last decades drastically changed our lives: Our ways of looking at the world, our social lives or our consumption patterns - the internet influences all spheres of life, whether we like it or not [@SocietyInternetHow2019]. The surge interconnectivity enabled a rise in movements that resist the classic definition of intellectual property rights: open source, open scholarship access and open science [@willinskyUnacknowledgedConvergenceOpen2005]. Modern technologies enhanced reliability, speed and efficiency in knowledge development, thereby enhancing communication, collaboration and access to information or data [@thagardInternetEpistemologyContributions1997; @eisendInternetNewMedium2002; @wardenInternetScienceCommunication2010]. The internet significantly facilitated formal and informal scholarly communication through electronic journals and digital repositories like Academia.edu or ResearchGate [@wardenInternetScienceCommunication2010; @waiteINTERNETKNOWLEDGEEXCHANGE2021]. Evidence also schows that an increase in access to the internet also increases research output [@xuImpactInternetAccess2021]. But greater output doesn't necessarily imply greater quality, progress or greater scientific discoveries. As availability and thereby the quantity of publications increased, the possible information overload demands for effective filtering and assessement of publicated results [@wardenInternetScienceCommunication2010].

But how do we define scientific progress? In the mid 20st century, Thomas Kuhn characterized scientific progress as a revolutionary shift in paradigms that are accepted theories in a scientific community at a given time. According to Kuhn, normal science operates within these paradigms, "solving puzzles" and refining theories. However, when anomalies arise that cannot be explained by the current paradigm, a crisis occurs, leading to a scientific revolution [@kuhnReflectionsMyCritics1970; @kuhnStructureScientificRevolutions1962]. Opposed to that, a critical rationalist approach to scientific progress emerged that saw danger in the by Kuhn described process as paradigms might facilitate confirmation bias and thereby stall progress. Karl Popper's philosophy of science, which emphasizes falsifiability and the idea that scientific theories progress through conjectures and refutations rather than through paradigm shifts. Popper argued that science advances by eliminating false theories, thus moving closer to the truth in a more linear and cumulative manner [@popperLogicScientificDiscovery2005]. Where Kuhn emphasized the development of dominant theories, Popper suggested the challenging or falsification of those theories.

Social sciences today engage in frequentist, deductive reasoning where significance testing is used to evaluate the null hypothesis, and conclusions are drawn based on the rejection or acceptance of this hypothesis, aligning with Popper's idea that scientific theories should be open to refutation. This approach is often criticized for its limitations in interpreting p-values and its reliance on long-run frequency interpretations [@UseMisuseClassicala; @wilkinsonTestingNullHypothesis2013]. In contrast, Bayesian inference is associated with inductive reasoning, where models are updated with new data to improve predictions. Bayesian methods allow for the comparison of competing models using tools like Bayes factors, but they do not directly falsify models through significance tests [@gelmanInductionDeductionBaysian2011; @dollBayesianModelSelection2019]. Overall, while falsification remains a cornerstone of scientific methodology, contemporary science often employs a pluralistic approach, integrating various methods to address complex questions and advance knowledge [@rowbottomKuhnVsPopper2011]. This pluralistic approach in contemporary science underscores the importance of integrating diverse methodologies to tackle complex questions and enhance our understanding. Despite the differences between frequentist and Bayesian methods, both share a fundamental commitment to the rigorous testing and validation of scientific theories. But despite the more theoretically driven discourse about scientific discovery, there are many tangible reasons to talk about the scientific method and the publication process. A recent, highly cited article revealed that only a very small proportion in variance of the outcomes in studies based on the same data can be accounted to the choices made by researchers in designing their tests. @breznauObservingManyResearchers2022 observed 77 researcher teams analyzing the same dataset to assess the same hypothesis and found that the results ranged from strong positive to strong negative results. Between-team deviance could only be explained to less than 50% by assigned conditions, research decisions and researcher characteristics, the rest of the variance remained unexplained. This underlines the importance of transparent research: results are prone to many errors and biases, made intentionally or unintentionally by the researcher or induced by the publisher.

"Only by ... repetitions can we convince ourselves that we are not dealing with a mere isolated ‘coincidence’, but with events which, on account of their regularity and reproducibility, are in principle inter-subjectively testable." [@popperLogicScientificDiscovery2005, p. 23]

To challenge the biases and to support the possibility of these "repetitions" or replications of research, a movement has formed within the scientific community, fuelled by the "replication crisis" that was especially prevalent within the field of psychology [@dienlinAgendaOpenScience2021]. The open science movement tries to establish open science practices to challenge many of the known biases that endanger the reliability of the scientific process.

@banksAnswers18Questions2019 establish a definition of open science as a broad term that refers to many concepts including scientific philosophies emobodying communality and universalism, specific practices operationalizing these norms including open science policies like sharing of data and analytic files, redifinition of confidence thresholds, pre-registration of studies and analytical plans, engagement in replication studies, removal of pay-walls, incentive systems to encourage the above practices and even specific citation standards. This typology is in line with the work of many other authors from diverse disciplines [e.g. @dienlinAgendaOpenScience2021; and @greenspanOpenSciencePractices2024]. The two dominant, highly discussed approaches in open science are open data and preregistration.

Publishing materials, data and code or open data is necessary to enable replication of the studies. Replication thereby makes it possible to assess the pursued research in detail, find errors, bias or even support the results [@dienlinAgendaOpenScience2021]. While many researchers see challenges in the publication of their data and materials due to a potentially higher workload, legal concerns or just lack of interest, many of these concerns could be ruled out by streamlined processes or institutional support [@freeseAdvancesTransparencyReproducibility2022; @freeseReplicationStandardsQuantitative2007]. As open data reduces p-hacking, facilitates new research by enabling reproduction, reveals mistakes in the coding process and enables a diffusion of knowledge on the research process, it seems that many researchers, journals and other institutions start to adopt open data in their research [@dienlinAgendaOpenScience2021; @finkReplicationCodeAvailability; @freeseAdvancesTransparencyReproducibility2022; @zenk-moltgenFactorsInfluencingData2018; @matternWhyAcademicsUndershare2024].

Preregistration involves thoroughly outlining and documenting research plans and their rationale in a repository. These plans can be made publicly accessible when the researcher decides to share them. The specifics of preregistration can vary based on the research type and may encompass elements such as hypotheses, sampling strategies, interview guides, exclusion criteria, study design, and analysis plans [@managoPreregistrationRegisteredReports2023]. Within this definition, a preregistration shall not prevent exploratory research. Deviations from the research plan are still allowed but shall be communicated transparently [@managoPreregistrationRegisteredReports2023; @nosekRegisteredReports2014]. Preregistration impacts research in multiple ways : it helps performing exploratory and confirmatory research independently, protects against publication bias as journals tipically commit to publish registered research and counters "researchers' degrees of freedom" in data analysis by reducing overfitting through cherry-picking, variable swapping, flexible model selection and subsampling [@mertensPreregistrationAnalysesPreexisting2019; @FalsePositivePsychologyUndisclosed]. This minimizes the risk of bias by promoting decision-making that is independent of outcomes. It also enhances transparency, allowing others to evaluate the potential for bias and adjust their confidence in the research findings accordingly [@hardwickeReducingBiasIncreasing2023].

My initial plan for my master's thesis was to study the effect of pre-registration on reported effect sizes. During my initial literature review, it appeared to me that there were very few publications that used pre-registration in data-driven criminology and sociology. Instead of assessing effect sizes, this raised the question: How have open science practices been adapted within sociology and criminolgy? How has the use of these practices developed over the last decade?

@scogginsMeasuringTransparencySocial2024a did an extensive analysis of almost 100,000 publications in political science and international relations. They found an increasing use of preregistration and open data, with levels still being relatively low. The extensive research not only revealed the current state of open science in political science, but also generated rich data to perform further meta research. Therefore, I intend to apply similar methods in the field of sociology and criminology. In the following section I describe the intended data collection and research methods that are highly based on @scogginsMeasuringTransparencySocial2024a research.

Data and Method

Problem using both sociology and criminology can introduce bias of the trained models due to highly different vocabulary used in both discioplines

According to @scogginsMeasuringTransparencySocial2024a
Population: [social science] papers using data and statistics

Gathering Papers
1. Consult Clarivate Journal Citation Report to obtain Journals in the field
2. Filter downloadable journals (that are included in the campus' licences)
3. Using Crossref, Scopus or WOS API: download publication metadata of all papers in a respective time span
4. Download HTML Papers
5. Filter to-download list by grabbed html papers
6. Download Paper Fulltext PDF: using ferru97/PyPaperBot, monk1337/resp (even possible to use anna's archive, scihub or libgen, but this would be illegal so no ofc not) - really necessary?
7. Convert HTML and PDF papers to txt (titipata/scipdf_parser, aaronsw/html2text, html2text · PyPI)
Classification
1. Operationalization of ...
  1. Papers that use statistical inference
  2. Papers that applied preregistration
  3. Papers that applied open data practices
  4. Papers that offer open materials
  5. Open Access (theoretically not interesting?)
  6. Papers with Positive Results
2. Definition of Identification keywords/dictionaries for each category
3. Manual classification of a number of papers for ml model training (between 1k/2k)
4. Creation of DFMs using the dictionaries
5. MLM training (Naive Bayes, LogReg, Nonlinear SVM, Random Forest, XGB)
6. MLM evaluation / decision
7. Classification of data using the trained, best performing model
Analysis
- One of the two:
- descriptive analysis of the temporal development in proportions in the last 10 years in each discipline, see @scogginsMeasuringTransparencySocial2024a
- Intergroup Comparison of effect sizes of a randomly drawn sample of the data gathered. Effect sizes could also be gathered using a trained

Why the huge data collection effort?

preparation for further research, database might be useful for other research questions
I want to practice R / ML-Methods.
By-hand collection of data on open science practices is very time consuming. why not generate data from the texts?
From @akkerPreregistrationSecondaryData2021: "To create a control group for comparison with the preregistered studies in our sample, we linked each preregistered publication in our sample to a non-preregistered publication. We did so by checking Web of Science’s list of related papers for every preregistered publication and selecting the first non-preregistered publication from that list that used primary quantitative data and was published in the same year as the related preregistered publication." i think this is kind of questionable.

Todo

add stuff about the replication crisis, 1-2 sentences in the introduction. see @scogginsMeasuringTransparencySocial2024a
improve wording in the last paragraph

Notes

[@dienlinAgendaOpenScience2021] 1. publish materials, data and code 2. preregister studies and submit registered reports 3. conduct replication studies 4. collaborate 5. foster open science skills 6. Implement Transparency and Openness Promotion (TOP) Guidelines 7. incentivize open science practices

Systemic Biases in AI and Big Data: Open science tools can be used to address biases in AI algorithms [@nororiAddressingBiasBig2021].

Publication Bias, selective reporting [@smaldinoOpenScienceModified2019; @fox142OpenScience2021].

Problem: Journals often favor publishing positive or statistically significant results, leaving negative or null findings unpublished.
How Open Science Helps: Pre-registration of studies and publishing all research outcomes (e.g., via open access repositories) ensures that all results are accessible. Open science encourages the publication of all results, including negative or null findings, which helps reduce the bias towards publishing only positive results. By promoting transparency and the sharing of data and methodologies, open science reduces the tendency to selectively report only favorable outcomes

Confirmation Bias [@fox142OpenScience2021]

Problem:
How Open Science Helps: Open science practices, such as pre-registration of studies, help mitigate confirmation bias by specifying hypotheses and analysis plans before data collection

Reproducibility Crisis [@fox142OpenScience2021]

Problem: Many scientific findings cannot be replicated due to opaque methodologies or unavailable data and code.
How Open Science Helps: Sharing detailed methods, datasets, and analysis scripts in open repositories promotes reproducibility and verification. Open science addresses the reproducibility crisis by making data and methods openly available, allowing other researchers to verify and replicate findings.

Algorithmic Bias [@nororiAddressingBiasBig2021]

Problem:
How Open Science Helps: public data and training reports for ai enable

Inefficiencies in Research Progress

Problem: Duplication of efforts and siloed research slow down scientific advancements.
How Open Science Helps: Sharing negative results, datasets, and ongoing projects prevents duplication and accelerates innovation.

Overemphasis on Novelty

Problem: The pressure to publish novel findings discourages replication studies or incremental advancements.
How Open Science Helps: Encouraging and funding replication studies through open peer-review processes shifts focus towards reliable and cumulative science.

Lack of Peer Review Transparency

Problem: Traditional peer review is often anonymous and lacks accountability, leading to potential biases or unfair evaluations.
How Open Science Helps: Open peer review, where reviews and reviewer identities are accessible, ensures greater accountability and reduces bias.

Authorship and Credit Bias

Problem: Early-career researchers, women, and underrepresented groups often face challenges in receiving credit for their contributions.
How Open Science Helps: Transparent contributions using tools like the Contributor Roles Taxonomy (CRediT) ensure that all contributors are recognized for their specific roles.

Conflicts of Interest

Problem: Undisclosed funding sources or affiliations may bias research findings.
How Open Science Helps: Transparent declarations of conflicts of interest and funding sources reduce hidden biases.

Limited Interdisciplinary Collaboration

Problem: Barriers to sharing research outputs restrict interdisciplinary collaboration, limiting innovation.
How Open Science Helps: Open sharing of data, methods, and publications fosters cross-disciplinary integration and innovation.

Data Access Inequality

Problem: Researchers in low-resource settings often lack access to expensive journals, datasets, or tools.
How Open Science Helps: Open access publications and open data initiatives democratize access to research outputs, enabling equitable participation in science.

Misuse of Metrics (e.g., Impact Factor, h-Index)

Problem: Reliance on quantitative metrics for evaluating research quality skews scientific priorities.
How Open Science Helps: Encouraging diverse evaluation metrics (e.g., open data reuse, societal impact) ensures fair assessment of research contributions.

Cherry-Picking and P-Hacking

Problem: Selective reporting or manipulating data to achieve statistical significance undermines the integrity of research.
How Open Science Helps: Pre-registration of hypotheses and protocols discourages cherry-picking and promotes adherence to predefined analysis plans.

Lack of Public Engagement

Problem: Complex scientific outputs are often inaccessible to the general public, leading to mistrust or misunderstanding of science.
How Open Science Helps: Open access and lay summaries of research make science more inclusive and comprehensible to non-specialists.

This commitment is rooted in the idea that scientific claims must be substantiated through consistent and reproducible evidence. Modern scientific inquiry, therefore, aligns with the notion that:

"Only by ... repetitions can we convince ourselves that we are not dealing with a mere isolated ‘coincidence’, but with events which, on account of their regularity and reproducibility, are in principle inter-subjectively testable." [@popperLogicScientificDiscovery2005, p. 23]

\newpage

References

::: {#refs} :::

\newpage

Eigenständigkeitserklärung

Hiermit versichere ich, dass ich die vorliegende Arbeit selbstständig und ohne die Benutzung anderer als der angegebenen Hilfsmittel angefertigt habe. Alle Stellen, die wörtlich oder sinngemäß aus veröffentlichten und nicht veröffentlichten Schriften entnommen wurden, sind als solche kenntlich gemacht.

\vspace{5cm}

\begin{tabular}{@{}p{4in}@{}} \hrulefill \ Michael Beck, 08.08.2024 \ \end{tabular}

21 KiB Raw Blame History Unescape Escape

Intro & Motivation

Modern Science

Data and Method

Todo

Notes

Publication Bias, selective reporting [@smaldinoOpenScienceModified2019; @fox142OpenScience2021].

Confirmation Bias [@fox142OpenScience2021]

Reproducibility Crisis [@fox142OpenScience2021]

Algorithmic Bias [@nororiAddressingBiasBig2021]

Inefficiencies in Research Progress

Overemphasis on Novelty

Lack of Peer Review Transparency

Authorship and Credit Bias

Conflicts of Interest

Limited Interdisciplinary Collaboration

Data Access Inequality

Misuse of Metrics (e.g., Impact Factor, h-Index)

Cherry-Picking and P-Hacking

Lack of Public Engagement

References

Eigenständigkeitserklärung

21 KiB

Raw Blame History