# Data, Method, and Analysis of Open Science Practices in Sociology and Criminology Papers ## **Population** - Papers in sociology and criminology utilizing data and statistical methods. - Focus on evaluating open science practices: - Pre-registration - Open data - Open materials - Open access - Statistical inference ## **Data Collection** 1. **Journal Identification** - Use Clarivate Journal Citation Report API to obtain a comprehensive list of sociology and criminology journals. - Filter the list to include journals accessible through university licensing agreements. 2. **Metadata Download** - Utilize APIs such as CrUsing [Crossref](https://github.com/ropensci/rcrossref), [Scopus](https://github.com/muschellij2/rscopus) or [WOS](https://github.com/juba/rwos) to download metadata for all papers published between 2013–2023. 3. **Full-Text Retrieval** - Download HTML versions of papers where available for ease of structured text extraction - Use full-text PDFs when HTML is not available, adhering strictly to ethical and legal guidelines. - Tools for retrieval: - [ferru97/PyPaperBot](https://github.com/ferru97/PyPaperBot), [ monk1337/resp](https://github.com/monk1337/resp) (licensed sources only). - Institutional library services for access. - Open-access repositories for additional resources. - ([titipata/scipdf\_parser](https://github.com/titipata/scipdf_parser), [aaronsw/html2text](https://github.com/aaronsw/html2text), [html2text · PyPI](https://pypi.org/project/html2text/)) 4. **Preprocessing** - Convert collected papers to plain text using: - SciPDF Parser for PDF-to-text conversion. - HTML-to-text tools like `html2text`. - Standardize text format for subsequent analysis. 5. **Resource Management** - Address potential constraints: - Use scalable data collection methods. - Leverage institutional resources (e.g., libraries and repositories). - Implement efficient workflows for text extraction and preprocessing (multicore processing). ## **Classification** 1. **Operationalization** - Define clear criteria for identifying open science practices: - Pre-registration: Terms like "pre-registered." - Open data: Phrases like "data availability statement." - Open materials: Statements like "materials available on request." 2. **Keyword Dictionary Creation** - Develop dictionaries of terms and phrases associated with each open science practice. - Base dictionaries on prior research (e.g., @scogginsMeasuringTransparencySocial2024a). - Compare and join dictionaries. 3. **Manual Annotation** - Manually classify a subset of 1,000–2,000 papers for training machine learning models. - Use stratified sampling to ensure diversity in: - Journals - Publication years - Subfields within sociology and criminology. 4. **Feature Extraction** - Create document-feature matrices (DFMs) using keyword dictionaries to prepare data for machine learning. 5. **Model Training** - Train multiple machine learning models: - Naive Bayes - Logistic Regression - Support Vector Machines - Random Forests - Gradient Boosted Trees - Evaluate model performance to select the best classifier for each open science practice. 6. **Automated Classification** - Apply the best-performing models to classify the entire dataset. - Automate the identification of open science practices across all collected papers. ## **Analysis** 1. **Descriptive Analysis** - Examine temporal trends in the adoption of open science practices over the past decade. - Compare practices across sociology and criminology. - Compare journals 2. **Evaluation of Results** - Identify patterns in: - Prevalence of pre-registration, open data, open materials, and open access. - Statistical inference methods. 3. **Ethical Considerations** - Ensure all methodologies comply with ethical and legal guidelines. - Avoid unauthorized sources such as Sci-Hub or LibGen. 4. **Broader Implications** - Contribute to understanding the adoption of transparency and reproducibility in social sciences. - Inform efforts to promote open science practices in sociology, criminology, and beyond.