SciPaperLoader/README.md

188 lines
7.3 KiB
Markdown

# SciPaperLoader: Flask Application Initial Structure
## Project Overview
**SciPaperLoader** is a Flask-based web application for managing scientific papers. It provides a web interface (with Jinja2 templates) enhanced by **Alpine.js** for interactive UI components and **HTMX** for partial page updates without full reloads. The application is composed of two main parts: a Flask web app (serving pages for uploading data, configuring schedules, and viewing logs) and a background **scraper daemon** that runs independently to perform long-running tasks (like fetching paper details on a schedule). The project is organized following Flask best practices (using blueprints, separating static files and templates) and is set up for easy development and testing (with configuration files and a pytest test fixture).
## Quick Start
Run the application:
make run
And open it in the browser at [http://localhost:5000/](http://localhost:5000/)
## Prerequisites
- Python >=3.8
## Development environment
- `make venv`: creates a virtualenv with dependencies and this application
installed in [development mode](http://setuptools.readthedocs.io/en/latest/setuptools.html#development-mode)
- `make run`: runs a development server in debug mode (changes in source code
are reloaded automatically)
- `make format`: reformats code
- `make lint`: runs flake8
- `make mypy`: runs type checks by mypy
- `make test`: runs tests (see also: [Testing Flask Applications](https://flask.palletsprojects.com/en/3.0.x/testing/))
- `make dist`: creates a wheel distribution (will run tests first)
- `make clean`: removes virtualenv and build artifacts
- add application dependencies in `pyproject.toml` under `project.dependencies`;
add development dependencies under `project.optional-dependencies.*`; run
`make clean && make venv` to reinstall the environment
## Task Processing Architecture
SciPaperLoader uses **APScheduler** for all task processing:
- **Periodic Tasks**: Hourly scraper scheduling with randomized paper processing
- **Background Tasks**: CSV uploads, manual paper processing, and all async operations
- **Job Management**: Clean job scheduling, revocation, and status tracking
This unified architecture provides reliable task processing with simple, maintainable code.
### Running Components
- `make run`: starts the Flask application with integrated APScheduler
For development monitoring:
- Access the Flask admin interface for APScheduler job monitoring
- View real-time logs in the application's activity log section
### How It Works
**For CSV Uploads:**
1. File is uploaded through the web interface
2. APScheduler creates a background job to process the file
3. Browser shows progress updates via AJAX polling
4. Results are displayed when processing completes
**For Scheduled Scraping:**
1. APScheduler runs hourly at the top of each hour
2. Papers are selected based on volume and schedule configuration
3. Individual paper processing jobs are scheduled at random times within the hour
4. All jobs are tracked in the database with complete visibility
This unified architecture provides reliable task processing without external dependencies.
## Configuration
Default configuration is loaded from `scipaperloader.defaults` and can be
overriden by environment variables with a `FLASK_` prefix. See
[Configuring from Environment Variables](https://flask.palletsprojects.com/en/3.0.x/config/#configuring-from-environment-variables).
### Task Processing Configuration
APScheduler automatically uses your configured database for job persistence. No additional configuration required.
For advanced configuration, you can set:
- `FLASK_SQLALCHEMY_DATABASE_URI`: Database URL (APScheduler uses the same database)
Consider using
[dotenv](https://flask.palletsprojects.com/en/3.0.x/cli/#environment-variables-from-dotenv).
## Database Migrations with Flask-Migrate
SciPaperLoader uses Flask-Migrate (based on Alembic) to handle database schema changes. This allows for version-controlled database updates that can be applied or rolled back as needed.
### Database Migration Commands
- `make db-migrate message="Description of changes"`: Create a new migration script based on detected model changes
- `make db-upgrade`: Apply all pending migration scripts to the database
- `make db-downgrade`: Revert the most recent migration
- `make reset-db`: Reset the database completely (delete, initialize, and migrate)
### Working with Migrations
When you make changes to the database models (in `models.py`):
1. Create a migration: `make db-migrate message="Add user roles table"`
2. Review the generated migration script in the `migrations/versions/` directory
3. Apply the migration: `make db-upgrade`
4. To roll back a problematic migration: `make db-downgrade`
Always create database backups before applying migrations in production using `make backup-db`.
## Deployment
See [Deploying to Production](https://flask.palletsprojects.com/en/3.0.x/deploying/).
You may use the distribution (`make dist`) to publish it to a package index,
deliver to your server, or copy in your `Dockerfile`, and insall it with `pip`.
You must set a
[SECRET_KEY](https://flask.palletsprojects.com/en/3.0.x/tutorial/deploy/#configure-the-secret-key)
in production to a secret and stable value.
### Deploying with APScheduler
When deploying to production:
1. APScheduler jobs are automatically persistent in your database
2. The Flask application handles all background processing internally
3. No external message broker or workers required
4. Scale by running multiple Flask instances with shared database
## Troubleshooting and Diagnostics
SciPaperLoader includes a collection of diagnostic and emergency tools to help address issues with the application, particularly with the scraper and APScheduler task system.
### Quick Access
For easy access to all diagnostic tools through an interactive menu:
```bash
# Using Make:
make diagnostics
# Using the shell scripts (works with any shell):
./tools/run-diagnostics.sh
# Fish shell version:
./tools/run-diagnostics.fish
# Or directly with Python:
python tools/diagnostics/diagnostic_menu.py
```
### Diagnostic Tools
All diagnostic tools are located in the `tools/diagnostics/` directory:
- **check_state.py**: Quickly check the current state of the scraper in the database
- **diagnose_scraper.py**: Comprehensive diagnostic tool that examines tasks, logs, and scraper state
- **inspect_tasks.py**: View currently running and scheduled APScheduler tasks
- **test_reversion.py**: Test the paper reversion functionality when stopping the scraper
### Emergency Recovery
For cases where the scraper is stuck or behaving unexpectedly:
- **emergency_stop.py**: Force stops all scraper activities, revokes all running tasks, and reverts papers from "Pending" state
- **quick_fix.py**: Simplified emergency stop that also stops Flask processes to ensure code changes are applied
### Usage Example
```bash
# Check the current state of the scraper
python tools/diagnostics/check_state.py
# Diagnose issues with tasks and logs
python tools/diagnostics/diagnose_scraper.py
# Emergency stop when scraper is stuck
python tools/diagnostics/emergency_stop.py
```
For more information, see:
- The README in the `tools/diagnostics/` directory
- The comprehensive `tools/DIAGNOSTIC_GUIDE.md` for troubleshooting specific issues