finished introduction text

2024-12-16 17:24:23 +01:00
parent 02b5266611
commit 4757bcaa73
1 changed files with 73 additions and 51 deletions
--- a/ResearchProposal.md
+++ b/ResearchProposal.md
@@ -76,15 +76,83 @@ include-before: |
 ## Modern Science
-The rise of the internet in the last decades drastically changed our lives: Our ways of looking at the world, our social lives or our consumption patterns - the internet influences all spheres of life, whether we like it or not [@SocietyInternetHow2019]. The surge interconnectivity enabled a rise in movements that resist the classic definition of intellectual property rights: open source, open scholarship access and open science [@willinskyUnacknowledgedConvergenceOpen2005]. Modern technologies enhanced reliability, speed and efficiency in knowledge development, thereby enhancing communication, collaboration and access to information [@thagardInternetEpistemologyContributions1997; @eisendInternetNewMedium2002; @wardenInternetScienceCommunication2010]. The internet significantly facilitated formal and informal scholarly communication through electronic journals and digital repositories like Academia.edu or ResearchGate [@wardenInternetScienceCommunication2010; @waiteINTERNETKNOWLEDGEEXCHANGE2021]. Evidence also schows that an increase in access to the internet also increases research output [@xuImpactInternetAccess2021]. But greater output doesn't necessarily imply greater quality, progress or greater scientific discoveries. As availability and thereby the quantity of publications increased, the possible information overload demands for effective filtering and assessement of publicated results [@wardenInternetScienceCommunication2010]. 
+The rise of the internet in the last decades drastically changed our lives: Our ways of looking at the world, our social lives or our consumption patterns - the internet influences all spheres of life, whether we like it or not [@SocietyInternetHow2019]. The surge interconnectivity enabled a rise in movements that resist the classic definition of intellectual property rights: open source, open scholarship access and open science [@willinskyUnacknowledgedConvergenceOpen2005]. Modern technologies enhanced reliability, speed and efficiency in knowledge development, thereby enhancing communication, collaboration and access to information or data [@thagardInternetEpistemologyContributions1997; @eisendInternetNewMedium2002; @wardenInternetScienceCommunication2010]. The internet significantly facilitated formal and informal scholarly communication through electronic journals and digital repositories like Academia.edu or ResearchGate [@wardenInternetScienceCommunication2010; @waiteINTERNETKNOWLEDGEEXCHANGE2021]. Evidence also schows that an increase in access to the internet also increases research output [@xuImpactInternetAccess2021]. But greater output doesn't necessarily imply greater quality, progress or greater scientific discoveries. As availability and thereby the quantity of publications increased, the possible information overload demands for effective filtering and assessement of publicated results [@wardenInternetScienceCommunication2010]. 
-But what exactly is scientific progress? In the mid 20st century, Thomas Kuhn characterized scientific progress as a revolutionary shift in paradigms that are accepted theories in a scientific community at a given time. According to Kuhn, normal science operates within these paradigms, "solving puzzles" and refining theories. However, when anomalies arise that cannot be explained by the current paradigm, a crisis occurs, leading to a scientific revolution [@kuhnReflectionsMyCritics1970; @kuhnStructureScientificRevolutions1962]. Opposed to that, a critical rationalist approach to scientific progress emerged that saw danger in the by Kuhn described process as paradigms might facilitate confirmation bias and thereby stall progress. Karl Popper's philosophy of science, which emphasizes falsifiability and the idea that scientific theories progress through conjectures and refutations rather than through paradigm shifts. Popper argued that science advances by eliminating false theories, thus moving closer to the truth in a more linear and cumulative manner [@popperLogicScientificDiscovery2005]. Where Kuhn emphasized the development of dominant theories, Popper suggested the challenging or falsification of those theories. 
+But how do we define scientific progress? In the mid 20st century, Thomas Kuhn characterized scientific progress as a revolutionary shift in paradigms that are accepted theories in a scientific community at a given time. According to Kuhn, normal science operates within these paradigms, "solving puzzles" and refining theories. However, when anomalies arise that cannot be explained by the current paradigm, a crisis occurs, leading to a scientific revolution [@kuhnReflectionsMyCritics1970; @kuhnStructureScientificRevolutions1962]. Opposed to that, a critical rationalist approach to scientific progress emerged that saw danger in the by Kuhn described process as paradigms might facilitate confirmation bias and thereby stall progress. Karl Popper's philosophy of science, which emphasizes falsifiability and the idea that scientific theories progress through conjectures and refutations rather than through paradigm shifts. Popper argued that science advances by eliminating false theories, thus moving closer to the truth in a more linear and cumulative manner [@popperLogicScientificDiscovery2005]. Where Kuhn emphasized the development of dominant theories, Popper suggested the challenging or falsification of those theories. 
-Social sciences today engage in frequentist, deductive reasoning where significance testing is used to evaluate the null hypothesis, and conclusions are drawn based on the rejection or acceptance of this hypothesis, aligning with Popper's idea that scientific theories should be open to refutation. This approach is often criticized for its limitations in interpreting p-values and its reliance on long-run frequency interpretations [@UseMisuseClassicala; @wilkinsonTestingNullHypothesis2013]. In contrast, Bayesian inference is associated with inductive reasoning, where models are updated with new data to improve predictions. Bayesian methods allow for the comparison of competing models using tools like Bayes factors, but they do not directly falsify models through significance tests [@gelmanInductionDeductionBaysian2011; @dollBayesianModelSelection2019]. Overall, while falsification remains a cornerstone of scientific methodology, contemporary science often employs a pluralistic approach, integrating various methods to address complex questions and advance knowledge [@rowbottomKuhnVsPopper2011]. 
+Social sciences today engage in frequentist, deductive reasoning where significance testing is used to evaluate the null hypothesis, and conclusions are drawn based on the rejection or acceptance of this hypothesis, aligning with Popper's idea that scientific theories should be open to refutation. This approach is often criticized for its limitations in interpreting p-values and its reliance on long-run frequency interpretations [@UseMisuseClassicala; @wilkinsonTestingNullHypothesis2013]. In contrast, Bayesian inference is associated with inductive reasoning, where models are updated with new data to improve predictions. Bayesian methods allow for the comparison of competing models using tools like Bayes factors, but they do not directly falsify models through significance tests [@gelmanInductionDeductionBaysian2011; @dollBayesianModelSelection2019]. Overall, while falsification remains a cornerstone of scientific methodology, contemporary science often employs a pluralistic approach, integrating various methods to address complex questions and advance knowledge [@rowbottomKuhnVsPopper2011]. This pluralistic approach in contemporary science underscores the importance of integrating diverse methodologies to tackle complex questions and enhance our understanding. Despite the differences between frequentist and Bayesian methods, both share a fundamental commitment to the rigorous testing and validation of scientific theories. 
 But despite the more theoretically driven discourse about scientific discovery, there are many tangible reasons to talk about the scientific method and the publication process. A recent, highly cited article revealed that only a very small proportion in variance of the outcomes in studies based on the same data can be accounted to the choices made by researchers in designing their tests. @breznauObservingManyResearchers2022 observed 77 researcher teams analyzing the same dataset to assess the same hypothesis and found that the results ranged from strong positive to strong negative results. Between-team deviance could only be explained to less than 50% by assigned conditions, research decisions and researcher characteristics, the rest of the variance remained unexplained. This underlines the importance of transparent research: results are prone to many errors and biases, made intentionally or unintentionally by the researcher or induced by the publisher.
 > "Only by ... repetitions can we convince ourselves that we are not dealing with a mere isolated ‘coincidence’, but with events which, on account of their regularity and reproducibility, are in principle inter-subjectively testable." [@popperLogicScientificDiscovery2005, p. 23]
 To challenge the biases and to support the possibility of these "repetitions" or replications of research, a movement has formed within the scientific community, fuelled by the "replication crisis" that was especially prevalent within the field of psychology [@dienlinAgendaOpenScience2021]. The open science movement tries to establish open science practices to challenge many of the known biases that endanger the reliability of the scientific process.
@banksAnswers18Questions2019 establish a definition of open science as a broad term that refers to many concepts including scientific philosophies emobodying communality and universalism, specific practices operationalizing these norms including open science policies like sharing of data and analytic files, redifinition of confidence thresholds, pre-registration of studies and analytical plans, engagement in replication studies, removal of pay-walls, incentive systems to encourage the above practices and even specific citation standards. This typology is in line with the work of many other authors from diverse disciplines [e.g. @dienlinAgendaOpenScience2021; and @greenspanOpenSciencePractices2024]. The two dominant, highly discussed approaches in open science are open data and preregistration.
 **Publishing materials, data and code** or *open data* is necessary to enable replication of the studies. Replication thereby makes it possible to assess the pursued research in detail, find errors, bias or even support the results [@dienlinAgendaOpenScience2021]. While many researchers see challenges in the publication of their data and materials due to a potentially higher workload, legal concerns or just lack of interest, many of these concerns could be ruled out by streamlined processes or institutional support [@freeseAdvancesTransparencyReproducibility2022; @freeseReplicationStandardsQuantitative2007]. As open data reduces p-hacking, facilitates new research by enabling reproduction, reveals mistakes in the coding process and enables a diffusion of knowledge on the research process, it seems that many researchers, journals and other institutions start to adopt open data in their research [@dienlinAgendaOpenScience2021; @finkReplicationCodeAvailability; @freeseAdvancesTransparencyReproducibility2022; @zenk-moltgenFactorsInfluencingData2018; @matternWhyAcademicsUndershare2024].
 **Preregistration** involves thoroughly outlining and documenting research plans and their rationale in a repository. These plans can be made publicly accessible when the researcher decides to share them. The specifics of preregistration can vary based on the research type and may encompass elements such as hypotheses, sampling strategies, interview guides, exclusion criteria, study design, and analysis plans [@managoPreregistrationRegisteredReports2023]. Within this definition, a preregistration shall not prevent exploratory research. Deviations from the research plan are still allowed but shall be communicated transparently [@managoPreregistrationRegisteredReports2023; @nosekRegisteredReports2014]. Preregistration impacts research in multiple ways : it helps performing exploratory and confirmatory research independently, protects against publication bias as journals tipically commit to publish registered research and counters "researchers' degrees of freedom" in data analysis by reducing overfitting through cherry-picking, variable swapping, flexible model selection and subsampling [@mertensPreregistrationAnalysesPreexisting2019; @FalsePositivePsychologyUndisclosed]. This minimizes the risk of bias by promoting decision-making that is independent of outcomes. It also enhances transparency, allowing others to evaluate the potential for bias and adjust their confidence in the research findings accordingly [@hardwickeReducingBiasIncreasing2023].
 My initial plan for my master's thesis was to study the effect of pre-registration on reported effect sizes. During my initial literature review, it appeared to me that there were very few publications that used pre-registration in data-driven criminology and sociology. Instead of assessing effect sizes, this raised the question: **How have open science practices been adapted within sociology and criminolgy? How has the use of these practices developed over the last decade?**
@scogginsMeasuringTransparencySocial2024a did an extensive analysis of almost 100,000 publications in political science and international relations. They found an increasing use of preregistration and open data, with levels still being relatively low. The extensive research not only revealed the current state of open science in political science, but also generated rich data to perform further meta research. Therefore, I intend to apply similar methods in the field of sociology and criminology. In the following section I describe the intended data collection and research methods that are highly based on @scogginsMeasuringTransparencySocial2024a research. 
 # Data and Method
 - **Problem** using both sociology and criminology can introduce bias of the trained models due to highly different vocabulary used in both discioplines
 According to @scogginsMeasuringTransparencySocial2024a  
 **Population**: \[social science\] papers using data and statistics  
 1. **Gathering Papers**  
 	1. Consult Clarivate Journal Citation Report to obtain Journals in the field  
 	2. Filter downloadable journals (that are included in the campus' licences)  
 	3. Using [Crossref](https://github.com/ropensci/rcrossref), [Scopus](https://github.com/muschellij2/rscopus) or [WOS](https://github.com/juba/rwos) API: download publication metadata of all papers in a respective time span  
 	4. Download HTML Papers  
 	5. Filter to-download list by grabbed html papers  
 	6. Download Paper Fulltext PDF: using [ferru97/PyPaperBot](https://github.com/ferru97/PyPaperBot), [ monk1337/resp](https://github.com/monk1337/resp) (even possible to use anna's archive, scihub or libgen, but this would be illegal so no ofc not) - **really necessary?**  
 	7. Convert HTML and PDF papers to txt ([titipata/scipdf\_parser](https://github.com/titipata/scipdf_parser), [aaronsw/html2text](https://github.com/aaronsw/html2text), [html2text · PyPI](https://pypi.org/project/html2text/))  
 2. Classification  
 	1. Operationalization of ...  
 		1. Papers that use statistical inference  
 		2. Papers that applied preregistration  
 		3. Papers that applied open data practices  
 		4. Papers that offer open materials  
 		5. Open Access (theoretically not interesting?)  
 		6. Papers with Positive Results  
 	2. Definition of Identification keywords/dictionaries for each category  
 	3. Manual classification of a number of papers for ml model training (between 1k/2k)  
 	4. Creation of [DFMs](https://quanteda.io/reference/dfm.html) using the dictionaries  
 	5. MLM training (Naive Bayes, LogReg, Nonlinear SVM, Random Forest, XGB)  
 	6. MLM evaluation / decision  
 	7. Classification of data using the trained, best performing model  
 3. Analysis  
 	- One of the two:  
 	- descriptive analysis of the temporal development in proportions in the last 10 years in each discipline, see @scogginsMeasuringTransparencySocial2024a  
 	- Intergroup Comparison of effect sizes of a randomly drawn sample of the data gathered. Effect sizes could also be gathered using a trained  
 Why the huge data collection effort?  
 - preparation for further research, database might be useful for other research questions  
 - I want to practice R / ML-Methods.  
 - By-hand collection of data on open science practices is very time consuming. why not generate data from the texts?  
 - From @akkerPreregistrationSecondaryData2021: "To create a control group for comparison with the preregistered studies in our sample, we linked each preregistered publication in our sample to a non-preregistered publication. We did so by checking Web of Science’s list of related papers for every preregistered publication and selecting the first non-preregistered publication from that list that used primary quantitative data and was published in the same year as the related preregistered publication." i think this is kind of questionable. 
 ## Todo
 - add stuff about the replication crisis, 1-2 sentences in the introduction. see @scogginsMeasuringTransparencySocial2024a
 - **improve wording in the last paragraph**
 # Notes
 [@dienlinAgendaOpenScience2021]
 	1. publish materials, data and code 
 	2. preregister studies and submit registered reports
 	3. conduct replication studies
 	4. collaborate
 	5. foster open science skills
 	6. Implement Transparency and Openness Promotion (TOP) Guidelines
 	7. incentivize open science practices
 This pluralistic approach in contemporary science underscores the importance of integrating diverse methodologies to tackle complex questions and enhance our understanding. Despite the differences between frequentist and Bayesian methods, both share a fundamental commitment to the rigorous testing and validation of scientific theories. The internet in the described developments in science described above created many challenges for scientists in their publication process.
 - Reproducibility Bias: Open science addresses the reproducibility crisis by making data and methods openly available, allowing other researchers to verify and replicate findings [@fox142OpenScience2021].
 - Systemic Biases in AI and Big Data: Open science tools can be used to address biases in AI algorithms [@nororiAddressingBiasBig2021].
@@ -155,53 +223,7 @@ This commitment is rooted in the idea that scientific claims must be substantiat
 > "Only by ... repetitions can we convince ourselves that we are not dealing with a mere isolated ‘coincidence’, but with events which, on account of their regularity and reproducibility, are in principle inter-subjectively testable." [@popperLogicScientificDiscovery2005, p. 23]
 This raises the question: Have open science practices been adapted within sociology and criminolgy? How has the use of these practices developed over the last decade? 
 ## Research Question
 ## Motivation
@scogginsMeasuringTransparencySocial2024a
 # Data and Method
 - **Problem** using both sociology and criminology can introduce bias of the trained models due to highly different vocabulary used in both discioplines
 According to @scogginsMeasuringTransparencySocial2024a  
 **Population**: \[social science\] papers using data and statistics  
 1. **Gathering Papers**  
 	1. Consult Clarivate Journal Citation Report to obtain Journals in the field  
 	2. Filter downloadable journals (that are included in the campus' licences)  
 	3. Using [Crossref](https://github.com/ropensci/rcrossref), [Scopus](https://github.com/muschellij2/rscopus) or [WOS](https://github.com/juba/rwos) API: download publication metadata of all papers in a respective time span  
 	4. Download HTML Papers  
 	5. Filter to-download list by grabbed html papers  
 	6. Download Paper Fulltext PDF: using [ferru97/PyPaperBot](https://github.com/ferru97/PyPaperBot), [ monk1337/resp](https://github.com/monk1337/resp) (even possible to use anna's archive, scihub or libgen, but this would be illegal so no ofc not) - **really necessary?**  
 	7. Convert HTML and PDF papers to txt ([titipata/scipdf\_parser](https://github.com/titipata/scipdf_parser), [aaronsw/html2text](https://github.com/aaronsw/html2text), [html2text · PyPI](https://pypi.org/project/html2text/))  
 2. Classification  
 	1. Operationalization of ...  
 		1. Papers that use statistical inference  
 		2. Papers that applied preregistration  
 		3. Papers that applied open data practices  
 		4. Papers that offer open materials  
 		5. Open Access (theoretically not interesting?)  
 		6. Papers with Positive Results  
 	2. Definition of Identification keywords/dictionaries for each category  
 	3. Manual classification of a number of papers for ml model training (between 1k/2k)  
 	4. Creation of [DFMs](https://quanteda.io/reference/dfm.html) using the dictionaries  
 	5. MLM training (Naive Bayes, LogReg, Nonlinear SVM, Random Forest, XGB)  
 	6. MLM evaluation / decision  
 	7. Classification of data using the trained, best performing model  
 3. Analysis  
 	- One of the two:  
 	- descriptive analysis of the temporal development in proportions in the last 10 years in each discipline, see @scogginsMeasuringTransparencySocial2024a  
 	- Intergroup Comparison of effect sizes of a randomly drawn sample of the data gathered. Effect sizes could also be gathered using a trained  
 Why the huge data collection effort?  
 - preparation for further research, database might be useful for other research questions  
 - I want to practice R / ML-Methods.  
 - By-hand collection of data on open science practices is very time consuming. why not generate data from the texts?  
 - From @akkerPreregistrationSecondaryData2021: "To create a control group for comparison with the preregistered studies in our sample, we linked each preregistered publication in our sample to a non-preregistered publication. We did so by checking Web of Science’s list of related papers for every preregistered publication and selecting the first non-preregistered publication from that list that used primary quantitative data and was published in the same year as the related preregistered publication." i think this is kind of questionable. 
 \newpage