added tons of stuff, literature, corrected the makefile, added ResearchPlan, Notes and corrected readme

This commit is contained in:
Michael Beck
2024-12-16 23:56:45 +01:00
parent 4757bcaa73
commit f88ff734bd
6 changed files with 1226 additions and 159 deletions

93
ResearchPlan.md Normal file
View File

@ -0,0 +1,93 @@
# Data, Method, and Analysis of Open Science Practices in Sociology and Criminology Papers
## **Population**
- Papers in sociology and criminology utilizing data and statistical methods.
- Focus on evaluating open science practices:
- Pre-registration
- Open data
- Open materials
- Open access
- Statistical inference
## **Data Collection**
1. **Journal Identification**
- Use Clarivate Journal Citation Report API to obtain a comprehensive list of sociology and criminology journals.
- Filter the list to include journals accessible through university licensing agreements.
2. **Metadata Download**
- Utilize APIs such as CrUsing [Crossref](https://github.com/ropensci/rcrossref), [Scopus](https://github.com/muschellij2/rscopus) or [WOS](https://github.com/juba/rwos) to download metadata for all papers published between 20132023.
3. **Full-Text Retrieval**
- Download HTML versions of papers where available for ease of structured text extraction
- Use full-text PDFs when HTML is not available, adhering strictly to ethical and legal guidelines.
- Tools for retrieval:
- [ferru97/PyPaperBot](https://github.com/ferru97/PyPaperBot), [ monk1337/resp](https://github.com/monk1337/resp) (licensed sources only).
- Institutional library services for access.
- Open-access repositories for additional resources.
- ([titipata/scipdf\_parser](https://github.com/titipata/scipdf_parser), [aaronsw/html2text](https://github.com/aaronsw/html2text), [html2text · PyPI](https://pypi.org/project/html2text/))
4. **Preprocessing**
- Convert collected papers to plain text using:
- SciPDF Parser for PDF-to-text conversion.
- HTML-to-text tools like `html2text`.
- Standardize text format for subsequent analysis.
5. **Resource Management**
- Address potential constraints:
- Use scalable data collection methods.
- Leverage institutional resources (e.g., libraries and repositories).
- Implement efficient workflows for text extraction and preprocessing (multicore processing).
## **Classification**
1. **Operationalization**
- Define clear criteria for identifying open science practices:
- Pre-registration: Terms like "pre-registered."
- Open data: Phrases like "data availability statement."
- Open materials: Statements like "materials available on request."
2. **Keyword Dictionary Creation**
- Develop dictionaries of terms and phrases associated with each open science practice.
- Base dictionaries on prior research (e.g., @scogginsMeasuringTransparencySocial2024a).
- Compare and join dictionaries.
3. **Manual Annotation**
- Manually classify a subset of 1,0002,000 papers for training machine learning models.
- Use stratified sampling to ensure diversity in:
- Journals
- Publication years
- Subfields within sociology and criminology.
4. **Feature Extraction**
- Create document-feature matrices (DFMs) using keyword dictionaries to prepare data for machine learning.
5. **Model Training**
- Train multiple machine learning models:
- Naive Bayes
- Logistic Regression
- Support Vector Machines
- Random Forests
- Gradient Boosted Trees
- Evaluate model performance to select the best classifier for each open science practice.
6. **Automated Classification**
- Apply the best-performing models to classify the entire dataset.
- Automate the identification of open science practices across all collected papers.
## **Analysis**
1. **Descriptive Analysis**
- Examine temporal trends in the adoption of open science practices over the past decade.
- Compare practices across sociology and criminology.
- Compare journals
2. **Evaluation of Results**
- Identify patterns in:
- Prevalence of pre-registration, open data, open materials, and open access.
- Statistical inference methods.
3. **Ethical Considerations**
- Ensure all methodologies comply with ethical and legal guidelines.
- Avoid unauthorized sources such as Sci-Hub or LibGen.
4. **Broader Implications**
- Contribute to understanding the adoption of transparency and reproducibility in social sciences.
- Inform efforts to promote open science practices in sociology, criminology, and beyond.