closes #4; refined data collection

2024-12-18 18:28:11 +01:00
parent 95f1683326
commit 22989cf064
5 changed files with 1947 additions and 5 deletions
--- a/ResearchProposal.md
+++ b/ResearchProposal.md
@ -107,7 +107,7 @@ The study will focus on papers in criminal psychology that use data and statisti

 ## Data Collection

-The process of data collection will closely follow @scogginsMeasuringTransparencySocial2024 and begin with identifying relevant journals in  criminal psychology. I will consult the Clarivate Journal Citation Report via their API to obtain a comprehensive list of journals within these fields by filtering for the top 30 journals in the respective fields (originally, @scogginsMeasuringTransparencySocial2024 used a top 100 filter - I will use top 30 journals to limit the amount of data because of technical limitations in my workspace setup). To ensure feasibility, I will filter this list to include only journals that are accessible under the university’s licensing agreements. Once the relevant journals are identified, I will use APIs such as Crossref, Scopus, or Web of Science to download metadata for all papers published between 2013 to 2023.
+The process of data collection will closely follow @scogginsMeasuringTransparencySocial2024 and begin with identifying relevant journals in  criminal psychology. I will consult the Clarivate Journal Citation Report to obtain a comprehensive list of journals within the fields by filtering for the top 100 journals. The Transparency-and-Openness-Promotion-Factor[^4] (TOP-Factor) according to @nosekPromotingOpenResearch2015 will be used to then assess the journal's admission of open science practices and by including it in the journal dataset. Once the relevant journals are identified, I will use APIs such as Crossref, Scopus, and Web of Science to download metadata for all papers published between 2013 to 2023.

 After obtaining the metadata, I will proceed to download the full-text versions of the identified papers. Whenever possible, I will prioritize downloading HTML versions of the papers due to their structured format, which simplifies subsequent text extraction. For papers that are not available in HTML, I will consider downloading full-text PDFs. Tools such as PyPaperBot or others[^1] can facilitate this process, although I will strictly stick to ethical and legal guidelines, avoiding unauthorized sources like Sci-Hub or Anna's Archive and only using sources that are either included in my institutions campus license or available via open access. If access to full-text papers becomes a limiting factor, I will assess alternative strategies such as collaborating with institutional libraries to request specific papers or identifying open-access repositories that may provide supplementary resources. Non-available texts will be considered with their own category in the later analysis. Once all available full-text papers are collected, I will preprocess the data by converting HTML and PDF files into plain text format using tools such as SciPDF Parser or others[^2]. This preprocessing step ensures that the text is in a standardized format suitable for analysis.

@ -119,6 +119,8 @@ The proposed data collection is resource-intensive but serves multiple purposes.

 [^3]: DDoS: Distributed Denial of Service, see @wangDDoSAttackProtection2015.

+[^4]: The TOP-Factor according to @nosekRegisteredReports2014 is a score that assesses the admission of open science practices can be obtained from [topfactor.org](https://topfactor.org/journals).
+
 ## Classification

 The classification process will begin with operationalizing the key open science practices that I aim to study. This involves the definition of clear criteria for identifying papers that fall into the categories I plan to classify: Papers that use statistical inference, papers that applied preregistration, papers that applied open data practices, papers that offer open materials and papers that are available via open access.