93 lines
4.3 KiB
Markdown
93 lines
4.3 KiB
Markdown
# Data, Method, and Analysis of Open Science Practices in Sociology and Criminology Papers
|
||
|
||
## **Population**
|
||
- Papers in sociology and criminology utilizing data and statistical methods.
|
||
- Focus on evaluating open science practices:
|
||
- Pre-registration
|
||
- Open data
|
||
- Open materials
|
||
- Open access
|
||
- Statistical inference
|
||
|
||
## **Data Collection**
|
||
1. **Journal Identification**
|
||
- Use Clarivate Journal Citation Report API to obtain a comprehensive list of sociology and criminology journals.
|
||
- Filter the list to include journals accessible through university licensing agreements.
|
||
|
||
2. **Metadata Download**
|
||
- Utilize APIs such as CrUsing [Crossref](https://github.com/ropensci/rcrossref), [Scopus](https://github.com/muschellij2/rscopus) or [WOS](https://github.com/juba/rwos) to download metadata for all papers published between 2013–2023.
|
||
|
||
3. **Full-Text Retrieval**
|
||
- Download HTML versions of papers where available for ease of structured text extraction
|
||
- Use full-text PDFs when HTML is not available, adhering strictly to ethical and legal guidelines.
|
||
- Tools for retrieval:
|
||
- [ferru97/PyPaperBot](https://github.com/ferru97/PyPaperBot), [ monk1337/resp](https://github.com/monk1337/resp) (licensed sources only).
|
||
- Institutional library services for access.
|
||
- Open-access repositories for additional resources.
|
||
- ([titipata/scipdf\_parser](https://github.com/titipata/scipdf_parser), [aaronsw/html2text](https://github.com/aaronsw/html2text), [html2text · PyPI](https://pypi.org/project/html2text/))
|
||
|
||
4. **Preprocessing**
|
||
- Convert collected papers to plain text using:
|
||
- SciPDF Parser for PDF-to-text conversion.
|
||
- HTML-to-text tools like `html2text`.
|
||
- Standardize text format for subsequent analysis.
|
||
|
||
5. **Resource Management**
|
||
- Address potential constraints:
|
||
- Use scalable data collection methods.
|
||
- Leverage institutional resources (e.g., libraries and repositories).
|
||
- Implement efficient workflows for text extraction and preprocessing (multicore processing).
|
||
|
||
## **Classification**
|
||
1. **Operationalization**
|
||
- Define clear criteria for identifying open science practices:
|
||
- Pre-registration: Terms like "pre-registered."
|
||
- Open data: Phrases like "data availability statement."
|
||
- Open materials: Statements like "materials available on request."
|
||
|
||
2. **Keyword Dictionary Creation**
|
||
- Develop dictionaries of terms and phrases associated with each open science practice.
|
||
- Base dictionaries on prior research (e.g., @scogginsMeasuringTransparencySocial2024a).
|
||
- Compare and join dictionaries.
|
||
|
||
3. **Manual Annotation**
|
||
- Manually classify a subset of 1,000–2,000 papers for training machine learning models.
|
||
- Use stratified sampling to ensure diversity in:
|
||
- Journals
|
||
- Publication years
|
||
- Subfields within sociology and criminology.
|
||
|
||
4. **Feature Extraction**
|
||
- Create document-feature matrices (DFMs) using keyword dictionaries to prepare data for machine learning.
|
||
|
||
5. **Model Training**
|
||
- Train multiple machine learning models:
|
||
- Naive Bayes
|
||
- Logistic Regression
|
||
- Support Vector Machines
|
||
- Random Forests
|
||
- Gradient Boosted Trees
|
||
- Evaluate model performance to select the best classifier for each open science practice.
|
||
|
||
6. **Automated Classification**
|
||
- Apply the best-performing models to classify the entire dataset.
|
||
- Automate the identification of open science practices across all collected papers.
|
||
|
||
## **Analysis**
|
||
1. **Descriptive Analysis**
|
||
- Examine temporal trends in the adoption of open science practices over the past decade.
|
||
- Compare practices across sociology and criminology.
|
||
- Compare journals
|
||
|
||
2. **Evaluation of Results**
|
||
- Identify patterns in:
|
||
- Prevalence of pre-registration, open data, open materials, and open access.
|
||
- Statistical inference methods.
|
||
|
||
3. **Ethical Considerations**
|
||
- Ensure all methodologies comply with ethical and legal guidelines.
|
||
- Avoid unauthorized sources such as Sci-Hub or LibGen.
|
||
|
||
4. **Broader Implications**
|
||
- Contribute to understanding the adoption of transparency and reproducibility in social sciences.
|
||
- Inform efforts to promote open science practices in sociology, criminology, and beyond. |