ThesisProposal/ResearchPlan.md

93 lines
4.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Data, Method, and Analysis of Open Science Practices in Sociology and Criminology Papers
## **Population**
- Papers in sociology and criminology utilizing data and statistical methods.
- Focus on evaluating open science practices:
- Pre-registration
- Open data
- Open materials
- Open access
- Statistical inference
## **Data Collection**
1. **Journal Identification**
- Use Clarivate Journal Citation Report API to obtain a comprehensive list of sociology and criminology journals.
- Filter the list to include journals accessible through university licensing agreements.
2. **Metadata Download**
- Utilize APIs such as CrUsing [Crossref](https://github.com/ropensci/rcrossref), [Scopus](https://github.com/muschellij2/rscopus) or [WOS](https://github.com/juba/rwos) to download metadata for all papers published between 20132023.
3. **Full-Text Retrieval**
- Download HTML versions of papers where available for ease of structured text extraction
- Use full-text PDFs when HTML is not available, adhering strictly to ethical and legal guidelines.
- Tools for retrieval:
- [ferru97/PyPaperBot](https://github.com/ferru97/PyPaperBot), [ monk1337/resp](https://github.com/monk1337/resp) (licensed sources only).
- Institutional library services for access.
- Open-access repositories for additional resources.
- ([titipata/scipdf\_parser](https://github.com/titipata/scipdf_parser), [aaronsw/html2text](https://github.com/aaronsw/html2text), [html2text · PyPI](https://pypi.org/project/html2text/))
4. **Preprocessing**
- Convert collected papers to plain text using:
- SciPDF Parser for PDF-to-text conversion.
- HTML-to-text tools like `html2text`.
- Standardize text format for subsequent analysis.
5. **Resource Management**
- Address potential constraints:
- Use scalable data collection methods.
- Leverage institutional resources (e.g., libraries and repositories).
- Implement efficient workflows for text extraction and preprocessing (multicore processing).
## **Classification**
1. **Operationalization**
- Define clear criteria for identifying open science practices:
- Pre-registration: Terms like "pre-registered."
- Open data: Phrases like "data availability statement."
- Open materials: Statements like "materials available on request."
2. **Keyword Dictionary Creation**
- Develop dictionaries of terms and phrases associated with each open science practice.
- Base dictionaries on prior research (e.g., @scogginsMeasuringTransparencySocial2024a).
- Compare and join dictionaries.
3. **Manual Annotation**
- Manually classify a subset of 1,0002,000 papers for training machine learning models.
- Use stratified sampling to ensure diversity in:
- Journals
- Publication years
- Subfields within sociology and criminology.
4. **Feature Extraction**
- Create document-feature matrices (DFMs) using keyword dictionaries to prepare data for machine learning.
5. **Model Training**
- Train multiple machine learning models:
- Naive Bayes
- Logistic Regression
- Support Vector Machines
- Random Forests
- Gradient Boosted Trees
- Evaluate model performance to select the best classifier for each open science practice.
6. **Automated Classification**
- Apply the best-performing models to classify the entire dataset.
- Automate the identification of open science practices across all collected papers.
## **Analysis**
1. **Descriptive Analysis**
- Examine temporal trends in the adoption of open science practices over the past decade.
- Compare practices across sociology and criminology.
- Compare journals
2. **Evaluation of Results**
- Identify patterns in:
- Prevalence of pre-registration, open data, open materials, and open access.
- Statistical inference methods.
3. **Ethical Considerations**
- Ensure all methodologies comply with ethical and legal guidelines.
- Avoid unauthorized sources such as Sci-Hub or LibGen.
4. **Broader Implications**
- Contribute to understanding the adoption of transparency and reproducibility in social sciences.
- Inform efforts to promote open science practices in sociology, criminology, and beyond.