188 lines
7.3 KiB
Markdown
188 lines
7.3 KiB
Markdown
# SciPaperLoader: Flask Application Initial Structure
|
|
|
|
## Project Overview
|
|
|
|
**SciPaperLoader** is a Flask-based web application for managing scientific papers. It provides a web interface (with Jinja2 templates) enhanced by **Alpine.js** for interactive UI components and **HTMX** for partial page updates without full reloads. The application is composed of two main parts: a Flask web app (serving pages for uploading data, configuring schedules, and viewing logs) and a background **scraper daemon** that runs independently to perform long-running tasks (like fetching paper details on a schedule). The project is organized following Flask best practices (using blueprints, separating static files and templates) and is set up for easy development and testing (with configuration files and a pytest test fixture).
|
|
|
|
## Quick Start
|
|
|
|
Run the application:
|
|
|
|
make run
|
|
|
|
And open it in the browser at [http://localhost:5000/](http://localhost:5000/)
|
|
|
|
## Prerequisites
|
|
|
|
- Python >=3.8
|
|
|
|
## Development environment
|
|
|
|
- `make venv`: creates a virtualenv with dependencies and this application
|
|
installed in [development mode](http://setuptools.readthedocs.io/en/latest/setuptools.html#development-mode)
|
|
|
|
- `make run`: runs a development server in debug mode (changes in source code
|
|
are reloaded automatically)
|
|
|
|
- `make format`: reformats code
|
|
|
|
- `make lint`: runs flake8
|
|
|
|
- `make mypy`: runs type checks by mypy
|
|
|
|
- `make test`: runs tests (see also: [Testing Flask Applications](https://flask.palletsprojects.com/en/3.0.x/testing/))
|
|
|
|
- `make dist`: creates a wheel distribution (will run tests first)
|
|
|
|
- `make clean`: removes virtualenv and build artifacts
|
|
|
|
- add application dependencies in `pyproject.toml` under `project.dependencies`;
|
|
add development dependencies under `project.optional-dependencies.*`; run
|
|
`make clean && make venv` to reinstall the environment
|
|
|
|
## Task Processing Architecture
|
|
|
|
SciPaperLoader uses **APScheduler** for all task processing:
|
|
|
|
- **Periodic Tasks**: Hourly scraper scheduling with randomized paper processing
|
|
- **Background Tasks**: CSV uploads, manual paper processing, and all async operations
|
|
- **Job Management**: Clean job scheduling, revocation, and status tracking
|
|
|
|
This unified architecture provides reliable task processing with simple, maintainable code.
|
|
|
|
### Running Components
|
|
|
|
- `make run`: starts the Flask application with integrated APScheduler
|
|
|
|
For development monitoring:
|
|
- Access the Flask admin interface for APScheduler job monitoring
|
|
- View real-time logs in the application's activity log section
|
|
|
|
### How It Works
|
|
|
|
**For CSV Uploads:**
|
|
1. File is uploaded through the web interface
|
|
2. APScheduler creates a background job to process the file
|
|
3. Browser shows progress updates via AJAX polling
|
|
4. Results are displayed when processing completes
|
|
|
|
**For Scheduled Scraping:**
|
|
1. APScheduler runs hourly at the top of each hour
|
|
2. Papers are selected based on volume and schedule configuration
|
|
3. Individual paper processing jobs are scheduled at random times within the hour
|
|
4. All jobs are tracked in the database with complete visibility
|
|
|
|
This unified architecture provides reliable task processing without external dependencies.
|
|
|
|
## Configuration
|
|
|
|
Default configuration is loaded from `scipaperloader.defaults` and can be
|
|
overriden by environment variables with a `FLASK_` prefix. See
|
|
[Configuring from Environment Variables](https://flask.palletsprojects.com/en/3.0.x/config/#configuring-from-environment-variables).
|
|
|
|
### Task Processing Configuration
|
|
|
|
APScheduler automatically uses your configured database for job persistence. No additional configuration required.
|
|
|
|
For advanced configuration, you can set:
|
|
- `FLASK_SQLALCHEMY_DATABASE_URI`: Database URL (APScheduler uses the same database)
|
|
|
|
Consider using
|
|
[dotenv](https://flask.palletsprojects.com/en/3.0.x/cli/#environment-variables-from-dotenv).
|
|
|
|
## Database Migrations with Flask-Migrate
|
|
|
|
SciPaperLoader uses Flask-Migrate (based on Alembic) to handle database schema changes. This allows for version-controlled database updates that can be applied or rolled back as needed.
|
|
|
|
### Database Migration Commands
|
|
|
|
- `make db-migrate message="Description of changes"`: Create a new migration script based on detected model changes
|
|
- `make db-upgrade`: Apply all pending migration scripts to the database
|
|
- `make db-downgrade`: Revert the most recent migration
|
|
- `make reset-db`: Reset the database completely (delete, initialize, and migrate)
|
|
|
|
### Working with Migrations
|
|
|
|
When you make changes to the database models (in `models.py`):
|
|
|
|
1. Create a migration: `make db-migrate message="Add user roles table"`
|
|
2. Review the generated migration script in the `migrations/versions/` directory
|
|
3. Apply the migration: `make db-upgrade`
|
|
4. To roll back a problematic migration: `make db-downgrade`
|
|
|
|
Always create database backups before applying migrations in production using `make backup-db`.
|
|
|
|
## Deployment
|
|
|
|
See [Deploying to Production](https://flask.palletsprojects.com/en/3.0.x/deploying/).
|
|
|
|
You may use the distribution (`make dist`) to publish it to a package index,
|
|
deliver to your server, or copy in your `Dockerfile`, and insall it with `pip`.
|
|
|
|
You must set a
|
|
[SECRET_KEY](https://flask.palletsprojects.com/en/3.0.x/tutorial/deploy/#configure-the-secret-key)
|
|
in production to a secret and stable value.
|
|
|
|
### Deploying with APScheduler
|
|
|
|
When deploying to production:
|
|
|
|
1. APScheduler jobs are automatically persistent in your database
|
|
2. The Flask application handles all background processing internally
|
|
3. No external message broker or workers required
|
|
4. Scale by running multiple Flask instances with shared database
|
|
|
|
## Troubleshooting and Diagnostics
|
|
|
|
SciPaperLoader includes a collection of diagnostic and emergency tools to help address issues with the application, particularly with the scraper and APScheduler task system.
|
|
|
|
### Quick Access
|
|
|
|
For easy access to all diagnostic tools through an interactive menu:
|
|
|
|
```bash
|
|
# Using Make:
|
|
make diagnostics
|
|
|
|
# Using the shell scripts (works with any shell):
|
|
./tools/run-diagnostics.sh
|
|
|
|
# Fish shell version:
|
|
./tools/run-diagnostics.fish
|
|
|
|
# Or directly with Python:
|
|
python tools/diagnostics/diagnostic_menu.py
|
|
```
|
|
|
|
### Diagnostic Tools
|
|
|
|
All diagnostic tools are located in the `tools/diagnostics/` directory:
|
|
|
|
- **check_state.py**: Quickly check the current state of the scraper in the database
|
|
- **diagnose_scraper.py**: Comprehensive diagnostic tool that examines tasks, logs, and scraper state
|
|
- **inspect_tasks.py**: View currently running and scheduled APScheduler tasks
|
|
- **test_reversion.py**: Test the paper reversion functionality when stopping the scraper
|
|
|
|
### Emergency Recovery
|
|
|
|
For cases where the scraper is stuck or behaving unexpectedly:
|
|
|
|
- **emergency_stop.py**: Force stops all scraper activities, revokes all running tasks, and reverts papers from "Pending" state
|
|
- **quick_fix.py**: Simplified emergency stop that also stops Flask processes to ensure code changes are applied
|
|
|
|
### Usage Example
|
|
|
|
```bash
|
|
# Check the current state of the scraper
|
|
python tools/diagnostics/check_state.py
|
|
|
|
# Diagnose issues with tasks and logs
|
|
python tools/diagnostics/diagnose_scraper.py
|
|
|
|
# Emergency stop when scraper is stuck
|
|
python tools/diagnostics/emergency_stop.py
|
|
```
|
|
|
|
For more information, see:
|
|
- The README in the `tools/diagnostics/` directory
|
|
- The comprehensive `tools/DIAGNOSTIC_GUIDE.md` for troubleshooting specific issues |