4.3 KiB
4.3 KiB
Data, Method, and Analysis of Open Science Practices in Sociology and Criminology Papers
Population
- Papers in sociology and criminology utilizing data and statistical methods.
- Focus on evaluating open science practices:
- Pre-registration
- Open data
- Open materials
- Open access
- Statistical inference
Data Collection
-
Journal Identification
- Use Clarivate Journal Citation Report API to obtain a comprehensive list of sociology and criminology journals.
- Filter the list to include journals accessible through university licensing agreements.
-
Metadata Download
-
Full-Text Retrieval
- Download HTML versions of papers where available for ease of structured text extraction
- Use full-text PDFs when HTML is not available, adhering strictly to ethical and legal guidelines.
- Tools for retrieval:
- ferru97/PyPaperBot, monk1337/resp (licensed sources only).
- Institutional library services for access.
- Open-access repositories for additional resources.
- (titipata/scipdf_parser, aaronsw/html2text, html2text · PyPI)
-
Preprocessing
- Convert collected papers to plain text using:
- SciPDF Parser for PDF-to-text conversion.
- HTML-to-text tools like
html2text
.
- Standardize text format for subsequent analysis.
- Convert collected papers to plain text using:
-
Resource Management
- Address potential constraints:
- Use scalable data collection methods.
- Leverage institutional resources (e.g., libraries and repositories).
- Implement efficient workflows for text extraction and preprocessing (multicore processing).
- Address potential constraints:
Classification
-
Operationalization
- Define clear criteria for identifying open science practices:
- Pre-registration: Terms like "pre-registered."
- Open data: Phrases like "data availability statement."
- Open materials: Statements like "materials available on request."
- Define clear criteria for identifying open science practices:
-
Keyword Dictionary Creation
- Develop dictionaries of terms and phrases associated with each open science practice.
- Base dictionaries on prior research (e.g., @scogginsMeasuringTransparencySocial2024a).
- Compare and join dictionaries.
-
Manual Annotation
- Manually classify a subset of 1,000–2,000 papers for training machine learning models.
- Use stratified sampling to ensure diversity in:
- Journals
- Publication years
- Subfields within sociology and criminology.
-
Feature Extraction
- Create document-feature matrices (DFMs) using keyword dictionaries to prepare data for machine learning.
-
Model Training
- Train multiple machine learning models:
- Naive Bayes
- Logistic Regression
- Support Vector Machines
- Random Forests
- Gradient Boosted Trees
- Evaluate model performance to select the best classifier for each open science practice.
- Train multiple machine learning models:
-
Automated Classification
- Apply the best-performing models to classify the entire dataset.
- Automate the identification of open science practices across all collected papers.
Analysis
-
Descriptive Analysis
- Examine temporal trends in the adoption of open science practices over the past decade.
- Compare practices across sociology and criminology.
- Compare journals
- Examine temporal trends in the adoption of open science practices over the past decade.
-
Evaluation of Results
- Identify patterns in:
- Prevalence of pre-registration, open data, open materials, and open access.
- Statistical inference methods.
- Identify patterns in:
-
Ethical Considerations
- Ensure all methodologies comply with ethical and legal guidelines.
- Avoid unauthorized sources such as Sci-Hub or LibGen.
-
Broader Implications
- Contribute to understanding the adoption of transparency and reproducibility in social sciences.
- Inform efforts to promote open science practices in sociology, criminology, and beyond.