Compare commits

...

3 Commits

9 changed files with 1396 additions and 9 deletions

View File

@ -1,3 +1,35 @@
## Directory Structure
Below is the directory and file layout for the `scipaperloader` project:
```plaintext
scipaperloader/
├── app/
│ ├── __init__.py # Initialize Flask app and database
│ ├── models.py # SQLAlchemy database models
│ ├── main.py # Flask routes (main blueprint)
│ ├── templates/ # Jinja2 templates for HTML pages
│ │ ├── base.html # Base layout template with Alpine.js and HTMX
│ │ ├── index.html # Home page template
│ │ ├── upload.html # CSV upload page template
│ │ ├── schedule.html # Schedule configuration page template
│ │ └── logs.html # Logs display page template
│ └── static/ # Static files (CSS, JS, images)
├── scraper.py # Background scraper daemon script
├── tests/
│ └── test_scipaperloader.py # Tests with a Flask test fixture
├── config.py # Configuration settings for different environments
├── pyproject.toml # Project metadata and build configuration
├── setup.cfg # Development tool configurations (linting, testing)
├── Makefile # Convenient commands for development tasks
└── .venv/ # Python virtual environment (not in version control)
```
- The **`app/`** package contains the Flask application code. It includes an `__init__.py` to create the app and set up extensions, a `models.py` defining database models with SQLAlchemy, and a `main.py` defining routes in a Flask Blueprint. The `templates/` directory holds HTML templates (with Jinja2 syntax) and `static/` will contain static assets (e.g., custom CSS or JS files, if any).
- The **`scraper.py`** is a **standalone** Python script acting as a background daemon. It can be run separately to perform background scraping tasks (e.g., periodically fetching new data). This script will use the same database (via SQLAlchemy models or direct database access) to read or write data as needed.
- The **`tests/`** directory includes a test file that uses pytest to ensure the Flask app and its components work as expected. A Flask fixture creates an application instance for testing (with an in-memory database) and verifies routes and database operations (e.g., uploading CSV adds records).
- The **configuration and setup files** at the project root help in development and deployment. `config.py` defines configuration classes (for development, testing, production) so the app can be easily configured. `pyproject.toml` and `setup.cfg` provide project metadata and tool configurations (for packaging, linting, etc.), and a `Makefile` is included to simplify common tasks (running the app, tests, etc.).
## How to use the logger
### GUI Interactions:

View File

@ -82,6 +82,28 @@ The following environment variables can be set to configure Celery:
Consider using
[dotenv](https://flask.palletsprojects.com/en/3.0.x/cli/#environment-variables-from-dotenv).
## Database Migrations with Flask-Migrate
SciPaperLoader uses Flask-Migrate (based on Alembic) to handle database schema changes. This allows for version-controlled database updates that can be applied or rolled back as needed.
### Database Migration Commands
- `make db-migrate message="Description of changes"`: Create a new migration script based on detected model changes
- `make db-upgrade`: Apply all pending migration scripts to the database
- `make db-downgrade`: Revert the most recent migration
- `make reset-db`: Reset the database completely (delete, initialize, and migrate)
### Working with Migrations
When you make changes to the database models (in `models.py`):
1. Create a migration: `make db-migrate message="Add user roles table"`
2. Review the generated migration script in the `migrations/versions/` directory
3. Apply the migration: `make db-upgrade`
4. To roll back a problematic migration: `make db-downgrade`
Always create database backups before applying migrations in production using `make backup-db`.
## Deployment
See [Deploying to Production](https://flask.palletsprojects.com/en/3.0.x/deploying/).

View File

@ -18,7 +18,7 @@ def create_app(test_config=None):
app.config.update(test_config)
db.init_app(app)
migrate = Migrate(app, db) # Add this line to initialize Flask-Migrate
migrate = Migrate(app, db)
with app.app_context():
db.create_all()

View File

@ -6,6 +6,8 @@ from .papers import bp as papers_bp
from .upload import bp as upload_bp
from .schedule import bp as schedule_bp
from .logger import bp as logger_bp
from .api import bp as api_bp
from .scraper import bp as scraper_bp
def register_blueprints(app: Flask):
@ -14,4 +16,6 @@ def register_blueprints(app: Flask):
app.register_blueprint(papers_bp, url_prefix='/papers')
app.register_blueprint(upload_bp, url_prefix='/upload')
app.register_blueprint(schedule_bp, url_prefix='/schedule')
app.register_blueprint(logger_bp, url_prefix='/logs')
app.register_blueprint(logger_bp, url_prefix='/logs')
app.register_blueprint(api_bp, url_prefix='/api')
app.register_blueprint(scraper_bp, url_prefix='/scraper')

View File

@ -0,0 +1,50 @@
from datetime import datetime
from flask import Blueprint, jsonify, request
from ..models import ActivityLog, ActivityCategory
bp = Blueprint("api", __name__, url_prefix="/api")
@bp.route("/activity_logs")
def get_activity_logs():
"""Get activity logs with filtering options."""
# Get query parameters
category = request.args.get("category")
action = request.args.get("action")
after = request.args.get("after")
limit = request.args.get("limit", 20, type=int)
# Build query
query = ActivityLog.query
if category:
query = query.filter(ActivityLog.category == category)
if action:
query = query.filter(ActivityLog.action == action)
if after:
try:
after_date = datetime.fromisoformat(after.replace("Z", "+00:00"))
query = query.filter(ActivityLog.timestamp > after_date)
except (ValueError, TypeError):
pass
# Order by most recent first and limit results
logs = query.order_by(ActivityLog.timestamp.desc()).limit(limit).all()
# Format the results
result = []
for log in logs:
log_data = {
"id": log.id,
"timestamp": log.timestamp.isoformat(),
"category": log.category,
"action": log.action,
"description": log.description,
"status": log.status,
"paper_id": log.paper_id,
"extra_data": log.extra_data
}
result.append(log_data)
return jsonify(result)

View File

@ -0,0 +1,512 @@
import random
import json
from datetime import datetime
from flask import Blueprint, jsonify, render_template, request, current_app, flash
from ..models import ScheduleConfig, VolumeConfig, ActivityLog, PaperMetadata, ActivityCategory
from ..db import db
from ..celery import celery
bp = Blueprint("scraper", __name__, url_prefix="/scraper")
# Global variables to track scraper state
SCRAPER_ACTIVE = False
SCRAPER_PAUSED = False
@bp.route("/")
def index():
"""Render the scraper control panel."""
volume_config = VolumeConfig.query.first()
# Ensure we have volume config
if not volume_config:
volume_config = VolumeConfig(volume=100) # Default value
db.session.add(volume_config)
db.session.commit()
# Ensure we have schedule config for all hours
existing_hours = {record.hour: record for record in ScheduleConfig.query.all()}
schedule_config = {}
for hour in range(24):
if hour in existing_hours:
schedule_config[hour] = existing_hours[hour].weight
else:
# Create default schedule entry (weight 1.0)
new_config = ScheduleConfig(hour=hour, weight=1.0)
db.session.add(new_config)
schedule_config[hour] = 1.0
if len(existing_hours) < 24:
db.session.commit()
return render_template(
"scraper.html.jinja",
volume_config=volume_config,
schedule_config=schedule_config,
scraper_active=SCRAPER_ACTIVE,
scraper_paused=SCRAPER_PAUSED
)
@bp.route("/start", methods=["POST"])
def start_scraper():
"""Start the scraper."""
global SCRAPER_ACTIVE, SCRAPER_PAUSED
if not SCRAPER_ACTIVE:
SCRAPER_ACTIVE = True
SCRAPER_PAUSED = False
# Log the action
ActivityLog.log_scraper_command(
action="start_scraper",
status="success",
description="Scraper started manually"
)
# Start the scheduler task
task = dummy_scraper_scheduler.delay()
return jsonify({
"success": True,
"message": "Scraper started",
"task_id": task.id
})
else:
return jsonify({
"success": False,
"message": "Scraper is already running"
})
@bp.route("/stop", methods=["POST"])
def stop_scraper():
"""Stop the scraper."""
global SCRAPER_ACTIVE, SCRAPER_PAUSED
if SCRAPER_ACTIVE:
SCRAPER_ACTIVE = False
SCRAPER_PAUSED = False
ActivityLog.log_scraper_command(
action="stop_scraper",
status="success",
description="Scraper stopped manually"
)
return jsonify({
"success": True,
"message": "Scraper stopped"
})
else:
return jsonify({
"success": False,
"message": "Scraper is not running"
})
@bp.route("/pause", methods=["POST"])
def pause_scraper():
"""Pause the scraper."""
global SCRAPER_ACTIVE, SCRAPER_PAUSED
if SCRAPER_ACTIVE and not SCRAPER_PAUSED:
SCRAPER_PAUSED = True
ActivityLog.log_scraper_command(
action="pause_scraper",
status="success",
description="Scraper paused manually"
)
return jsonify({
"success": True,
"message": "Scraper paused"
})
elif SCRAPER_ACTIVE and SCRAPER_PAUSED:
SCRAPER_PAUSED = False
ActivityLog.log_scraper_command(
action="resume_scraper",
status="success",
description="Scraper resumed manually"
)
return jsonify({
"success": True,
"message": "Scraper resumed"
})
else:
return jsonify({
"success": False,
"message": "Scraper is not running"
})
@bp.route("/status")
def scraper_status():
"""Get the current status of the scraper."""
return jsonify({
"active": SCRAPER_ACTIVE,
"paused": SCRAPER_PAUSED,
"current_hour": datetime.now().hour,
})
@bp.route("/stats")
def scraper_stats():
"""Get scraper statistics for the dashboard."""
# Get the last 24 hours of activity
hours = 24
if request.args.get('hours'):
try:
hours = int(request.args.get('hours'))
except ValueError:
pass
cutoff_time = datetime.utcnow().replace(
minute=0, second=0, microsecond=0
)
# Get activity logs for scraper actions
logs = ActivityLog.query.filter(
ActivityLog.category == ActivityCategory.SCRAPER_ACTIVITY.value,
ActivityLog.timestamp >= cutoff_time.replace(hour=cutoff_time.hour - hours)
).all()
# Group by hour and status
stats = {}
for hour in range(hours):
target_hour = (cutoff_time.hour - hour) % 24
stats[target_hour] = {
"success": 0,
"error": 0,
"pending": 0,
"hour": target_hour,
}
for log in logs:
hour = log.timestamp.hour
if hour in stats:
if log.status == "success":
stats[hour]["success"] += 1
elif log.status == "error":
stats[hour]["error"] += 1
elif log.status == "pending":
stats[hour]["pending"] += 1
# Convert to list for easier consumption by JavaScript
result = [stats[hour] for hour in sorted(stats.keys())]
return jsonify(result)
@bp.route("/update_config", methods=["POST"])
def update_config():
"""Update scraper configuration."""
data = request.json
try:
if "volume" in data:
try:
new_volume = float(data["volume"])
# Validate volume value (from schedule.py)
if new_volume <= 0 or new_volume > 1000:
return jsonify({
"success": False,
"message": "Volume must be between 1 and 1000"
})
volume_config = VolumeConfig.query.first()
if not volume_config:
volume_config = VolumeConfig(volume=new_volume)
db.session.add(volume_config)
else:
old_value = volume_config.volume
volume_config.volume = new_volume
ActivityLog.log_config_change(
config_key="scraper_volume",
old_value=old_value,
new_value=new_volume,
description="Updated scraper volume"
)
db.session.commit()
except (ValueError, TypeError):
return jsonify({
"success": False,
"message": "Invalid volume value"
})
if "schedule" in data:
try:
schedule = data["schedule"]
# Validate entire schedule
for hour_str, weight in schedule.items():
try:
hour = int(hour_str)
weight = float(weight)
if hour < 0 or hour > 23:
return jsonify({
"success": False,
"message": f"Hour value must be between 0 and 23, got {hour}"
})
if weight < 0.1 or weight > 5:
return jsonify({
"success": False,
"message": f"Weight for hour {hour} must be between 0.1 and 5, got {weight}"
})
except ValueError:
return jsonify({
"success": False,
"message": f"Invalid data format for hour {hour_str}"
})
# Update schedule after validation
for hour_str, weight in schedule.items():
hour = int(hour_str)
weight = float(weight)
schedule_config = ScheduleConfig.query.get(hour)
if not schedule_config:
schedule_config = ScheduleConfig(hour=hour, weight=weight)
db.session.add(schedule_config)
else:
old_value = schedule_config.weight
schedule_config.weight = weight
ActivityLog.log_config_change(
config_key=f"schedule_hour_{hour}",
old_value=old_value,
new_value=weight,
description=f"Updated schedule weight for hour {hour}"
)
db.session.commit()
except Exception as e:
db.session.rollback()
return jsonify({
"success": False,
"message": f"Error updating schedule: {str(e)}"
})
return jsonify({"success": True, "message": "Configuration updated"})
except Exception as e:
db.session.rollback()
return jsonify({"success": False, "message": f"Unexpected error: {str(e)}"})
@bp.route("/schedule", methods=["GET", "POST"])
def schedule():
"""Legacy route to maintain compatibility with the schedule blueprint."""
# For GET requests, redirect to the scraper index with the schedule tab active
if request.method == "GET":
return index()
# For POST requests, handle form data and process like the original schedule blueprint
if request.method == "POST":
try:
# Check if we're updating volume or schedule
if "total_volume" in request.form:
# Volume update
try:
new_volume = float(request.form.get("total_volume", 0))
if new_volume <= 0 or new_volume > 1000:
raise ValueError("Volume must be between 1 and 1000")
volume_config = VolumeConfig.query.first()
if not volume_config:
volume_config = VolumeConfig(volume=new_volume)
db.session.add(volume_config)
else:
volume_config.volume = new_volume
db.session.commit()
flash("Volume updated successfully!", "success")
except ValueError as e:
db.session.rollback()
flash(f"Error updating volume: {str(e)}", "error")
else:
# Schedule update logic
# Validate form data
for hour in range(24):
key = f"hour_{hour}"
if key not in request.form:
raise ValueError(f"Missing data for hour {hour}")
try:
weight = float(request.form.get(key, 0))
if weight < 0 or weight > 5:
raise ValueError(
f"Weight for hour {hour} must be between 0 and 5"
)
except ValueError:
raise ValueError(f"Invalid weight value for hour {hour}")
# Update database if validation passes
for hour in range(24):
key = f"hour_{hour}"
weight = float(request.form.get(key, 0))
config = ScheduleConfig.query.get(hour)
if config:
config.weight = weight
else:
db.session.add(ScheduleConfig(hour=hour, weight=weight))
db.session.commit()
flash("Schedule updated successfully!", "success")
except ValueError as e:
db.session.rollback()
flash(f"Error updating schedule: {str(e)}", "error")
# Redirect back to the scraper page
return index()
# Calculate schedule information for visualization/decision making
def get_schedule_stats():
"""Get statistics about the current schedule configuration."""
volume_config = VolumeConfig.query.first()
if not volume_config:
return {"error": "No volume configuration found"}
total_volume = volume_config.volume
schedule_configs = ScheduleConfig.query.all()
if not schedule_configs:
return {"error": "No schedule configuration found"}
# Calculate total weight
total_weight = sum(config.weight for config in schedule_configs)
# Calculate papers per hour
papers_per_hour = {}
for config in schedule_configs:
weight_ratio = config.weight / total_weight if total_weight > 0 else 0
papers = weight_ratio * total_volume
papers_per_hour[config.hour] = papers
return {
"total_volume": total_volume,
"total_weight": total_weight,
"papers_per_hour": papers_per_hour
}
# Enhanced API route to get schedule information
@bp.route("/schedule_info")
def schedule_info():
"""Get information about the current schedule configuration."""
stats = get_schedule_stats()
return jsonify(stats)
# Define the Celery tasks
@celery.task(bind=True)
def dummy_scraper_scheduler(self):
"""Main scheduler task for the dummy scraper."""
global SCRAPER_ACTIVE, SCRAPER_PAUSED
if not SCRAPER_ACTIVE:
return {"status": "Scraper not active"}
if SCRAPER_PAUSED:
return {"status": "Scraper paused"}
# Calculate how many papers to scrape based on current hour and configuration
current_hour = datetime.now().hour
hour_config = ScheduleConfig.query.get(current_hour)
volume_config = VolumeConfig.query.first()
if not hour_config or not volume_config:
return {"status": "Missing configuration"}
# Calculate papers to scrape this hour
hourly_rate = volume_config.volume / 24 # Base rate per hour
adjusted_rate = hourly_rate * (1 / hour_config.weight) # Adjust by weight
papers_to_scrape = int(adjusted_rate)
# Log the scheduling decision
ActivityLog.log_scraper_activity(
action="schedule_papers",
status="success",
description=f"Scheduled {papers_to_scrape} papers for scraping at hour {current_hour}",
hourly_rate=hourly_rate,
weight=hour_config.weight,
adjusted_rate=adjusted_rate,
)
# Launch individual scraping tasks
for _ in range(papers_to_scrape):
if not SCRAPER_ACTIVE or SCRAPER_PAUSED:
break
# Schedule a new paper to be scraped
dummy_scrape_paper.delay()
# Schedule the next run in 5 minutes if still active
if SCRAPER_ACTIVE:
dummy_scraper_scheduler.apply_async(countdown=300) # 5 minutes
return {"status": "success", "papers_scheduled": papers_to_scrape}
@celery.task(bind=True)
def dummy_scrape_paper(self):
"""Simulate scraping a single paper."""
# Simulate success or failure
success = random.random() > 0.3 # 70% success rate
# Simulate processing time
import time
time.sleep(random.randint(2, 5)) # 2-5 seconds
if success:
# Create a dummy paper
new_paper = PaperMetadata(
title=f"Dummy Paper {random.randint(1000, 9999)}",
doi=f"10.1234/dummy.{random.randint(1000, 9999)}",
journal=random.choice([
"Nature", "Science", "PLOS ONE", "Journal of Dummy Research",
"Proceedings of the Dummy Society", "Cell", "Dummy Review Letters"
]),
type="article",
language="en",
published_online=datetime.now().date(),
status="Done",
file_path="/path/to/dummy/paper.pdf"
)
db.session.add(new_paper)
db.session.commit()
# Log the successful scrape
ActivityLog.log_scraper_activity(
action="scrape_paper",
paper_id=new_paper.id,
status="success",
description=f"Successfully scraped paper {new_paper.doi}"
)
return {
"success": True,
"paper_id": new_paper.id,
"title": new_paper.title,
"doi": new_paper.doi
}
else:
# Log the failed scrape
error_message = random.choice([
"Connection timeout",
"404 Not Found",
"Access denied",
"Invalid DOI format",
"PDF download failed",
"Rate limited by publisher"
])
ActivityLog.log_scraper_activity(
action="scrape_paper",
status="error",
description=f"Failed to scrape paper: {error_message}"
)
return {
"success": False,
"error": error_message
}

View File

@ -7,6 +7,9 @@
</button>
<div class="collapse navbar-collapse" id="navbarSupportedContent">
<ul class="navbar-nav me-auto mb-2 mb-lg-0">
<li class="nav-item">
<a class="nav-link" href="{{ url_for('scraper.index') }}">Scraper</a>
</li>
<li class="nav-item">
<a class="nav-link" href="{{ url_for('upload.upload') }}">Import CSV</a>
</li>

View File

@ -144,13 +144,13 @@
</th>
<th>
{% set params = request.args.to_dict() %}
{% set params = params.update({'sort_by': 'journal', 'sort_dir': journal_sort}) or params %}
<a href="{{ url_for('papers.list_papers', **params) }}">Journal</a>
{% set params = params.update({'sort_by': 'doi', 'sort_dir': doi_sort}) or params %}
<a href="{{ url_for('papers.list_papers', **params) }}">DOI</a>
</th>
<th>
{% set params = request.args.to_dict() %}
{% set params = params.update({'sort_by': 'doi', 'sort_dir': doi_sort}) or params %}
<a href="{{ url_for('papers.list_papers', **params) }}">DOI</a>
{% set params = params.update({'sort_by': 'journal', 'sort_dir': journal_sort}) or params %}
<a href="{{ url_for('papers.list_papers', **params) }}">Journal</a>
</th>
<th>
{% set params = request.args.to_dict() %}
@ -186,10 +186,9 @@
<path
d="M9.5 1a.5.5 0 0 1 .5.5v1a.5.5 0 0 1-.5.5h-3a.5.5 0 0 1-.5-.5v-1a.5.5 0 0 1 .5-.5h3zm-3-1A1.5 1.5 0 0 0 5 1.5v1A1.5 1.5 0 0 0 6.5 4h3A1.5 1.5 0 0 0 11 2.5v-1A1.5 1.5 0 0 0 9.5 0h-3z" />
</svg>
{{ paper.title }}
{{ paper.title|escape }}
</a>
</td>
<td>{{ paper.journal }}</td>
<td>
<a href="https://doi.org/{{ paper.doi }}" target="_blank" class="icon-link icon-link-hover">
{{ paper.doi }}
@ -199,7 +198,17 @@
</svg>
</a>
</td>
<td>{{ paper.issn }}</td>
<td>{{ paper.journal }}</td>
<td>
<a href="https://search.worldcat.org/search?q=issn:{{ paper.issn }}" target="_blank"
class="icon-link icon-link-hover">
{{ paper.issn }}
<svg xmlns="http://www.w3.org/2000/svg" class="bi" viewBox="0 0 16 16" aria-hidden="true">
<path
d="M1 8a.5.5 0 0 1 .5-.5h11.793l-3.147-3.146a.5.5 0 0 1 .708-.708l4 4a.5.5 0 0 1 0 .708l-4 4a.5.5 0 0 1-.708-.708L13.293 8.5H1.5A.5.5 0 0 1 1 8z" />
</svg>
</a>
</td>
<td>{{ paper.status }}</td>
<td>{{ paper.created_at.strftime('%Y-%m-%d %H:%M:%S') }}</td>
<td>{{ paper.updated_at.strftime('%Y-%m-%d %H:%M:%S') }}</td>

View File

@ -0,0 +1,755 @@
{% extends "base.html.jinja" %}
{% block title %}Paper Scraper Control Panel{% endblock title %}
{% block styles %}
{{ super() }}
<style>
.status-indicator {
width: 15px;
height: 15px;
border-radius: 50%;
display: inline-block;
margin-right: 5px;
}
.status-active {
background-color: #28a745;
}
.status-paused {
background-color: #ffc107;
}
.status-inactive {
background-color: #dc3545;
}
.stats-chart {
height: 400px;
}
.notification {
position: fixed;
bottom: 20px;
right: 20px;
max-width: 350px;
z-index: 1050;
}
/* Enhanced scheduler styles */
.timeline {
display: flex;
flex-wrap: wrap;
gap: 3px;
user-select: none;
}
.hour-block {
width: 49px;
height: 70px;
border-radius: 5px;
text-align: center;
line-height: 1.2;
font-size: 0.9rem;
padding-top: 6px;
cursor: pointer;
user-select: none;
transition: background-color 0.2s ease-in-out;
margin: 1px;
}
.hour-block.selected {
outline: 2px solid #4584b8;
}
.papers {
font-size: 0.7rem;
margin-top: 2px;
}
/* Tab styles */
.nav-tabs .nav-link {
color: #495057;
}
.nav-tabs .nav-link.active {
font-weight: bold;
color: #007bff;
}
.tab-pane {
padding-top: 1rem;
}
</style>
{% endblock styles %}
{% block content %}
<div class="container mt-4">
<h1>Paper Scraper Control Panel</h1>
<!-- Navigation tabs -->
<ul class="nav nav-tabs mb-4" id="scraperTabs" role="tablist">
<li class="nav-item" role="presentation">
<button class="nav-link active" id="dashboard-tab" data-bs-toggle="tab" data-bs-target="#dashboard"
type="button" role="tab" aria-controls="dashboard" aria-selected="true">
Dashboard
</button>
</li>
<li class="nav-item" role="presentation">
<button class="nav-link" id="schedule-tab" data-bs-toggle="tab" data-bs-target="#schedule" type="button"
role="tab" aria-controls="schedule" aria-selected="false">
Schedule Configuration
</button>
</li>
</ul>
<div class="tab-content" id="scraperTabsContent">
<!-- Dashboard Tab -->
<div class="tab-pane fade show active" id="dashboard" role="tabpanel" aria-labelledby="dashboard-tab">
<div class="row mb-4">
<div class="col-md-6">
<div class="card">
<div class="card-header">
<h5>Scraper Status</h5>
</div>
<div class="card-body">
<div class="d-flex align-items-center mb-3">
<div id="statusIndicator" class="status-indicator status-inactive"></div>
<span id="statusText">Inactive</span>
</div>
<div class="btn-group" role="group">
<button id="startButton" class="btn btn-success">Start</button>
<button id="pauseButton" class="btn btn-warning" disabled>Pause</button>
<button id="stopButton" class="btn btn-danger" disabled>Stop</button>
</div>
</div>
</div>
</div>
<div class="col-md-6">
<div class="card">
<div class="card-header">
<h5>Volume Configuration</h5>
</div>
<div class="card-body">
<form id="volumeForm">
<div class="form-group">
<label for="volumeInput">Papers per day:</label>
<input type="number" class="form-control" id="volumeInput"
value="{{ volume_config.volume if volume_config else 100 }}">
</div>
<button type="submit" class="btn btn-primary mt-2">Update Volume</button>
</form>
</div>
</div>
</div>
</div>
<div class="row mb-4">
<div class="col-12">
<div class="card">
<div class="card-header d-flex justify-content-between align-items-center">
<h5>Scraping Activity</h5>
<div>
<div class="form-check form-switch">
<input class="form-check-input" type="checkbox" id="notificationsToggle" checked>
<label class="form-check-label" for="notificationsToggle">Show Notifications</label>
</div>
</div>
</div>
<div class="card-body">
<div class="btn-group mb-3">
<button class="btn btn-outline-secondary time-range-btn" data-hours="6">Last 6
hours</button>
<button class="btn btn-outline-secondary time-range-btn active" data-hours="24">Last 24
hours</button>
<button class="btn btn-outline-secondary time-range-btn" data-hours="72">Last 3
days</button>
</div>
<div class="stats-chart" id="activityChart"></div>
</div>
</div>
</div>
</div>
<div class="row mb-4">
<div class="col-12">
<div class="card">
<div class="card-header">
<h5>Recent Activity</h5>
</div>
<div class="card-body">
<div class="table-responsive">
<table class="table table-striped">
<thead>
<tr>
<th>Time</th>
<th>Action</th>
<th>Status</th>
<th>Description</th>
</tr>
</thead>
<tbody id="activityLog">
<tr>
<td colspan="4" class="text-center">Loading activities...</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</div>
</div>
<!-- Schedule Configuration Tab -->
<div class="tab-pane fade" id="schedule" role="tabpanel" aria-labelledby="schedule-tab"
x-data="scheduleManager({{ schedule_config | tojson }}, {{ volume_config.volume if volume_config else 100 }})">
<div class="mb-3">
<h3>How it Works</h3>
<p class="text-muted mb-0">
Configure the daily volume of papers to be downloaded and the hourly download weights.
The weights determine how many papers will be downloaded during each hour of the day.
The total volume (<strong x-text="volume"></strong> papers/day) is split across all hours based on
their relative weights.
<strong>Lower weights result in higher scraping rates</strong> for that hour.
</p>
<h5 class="mt-3">Instructions:</h5>
<p class="text-muted">
Click to select one or more hours below. Then assign a weight to them using the input and apply it.
Color indicates relative intensity. Changes are saved immediately when you click "Update Schedule".
</p>
</div>
<div class="card mb-4">
<div class="card-header">
<h4 class="m-0">Volume Configuration</h4>
</div>
<div class="card-body">
<p class="text-muted">
The total volume of data to be downloaded each day is
<strong x-text="volume"></strong> papers.
</p>
<div class="d-flex align-items-center mb-3">
<div class="input-group">
<span class="input-group-text">Papers per day:</span>
<input type="number" class="form-control" x-model="volume" min="1" max="1000" />
<button type="button" class="btn btn-primary" @click="updateVolume()">
Update Volume
</button>
</div>
</div>
</div>
</div>
<div class="card">
<div class="card-header">
<h4 class="m-0">Hourly Weights</h4>
</div>
<div class="card-body">
<div class="timeline mb-3" @mouseup="endDrag()" @mouseleave="endDrag()">
<template x-for="hour in Object.keys(schedule)" :key="hour">
<div class="hour-block" :id="'hour-' + hour" :data-hour="hour"
:style="getBackgroundStyle(hour)" :class="{'selected': isSelected(hour)}"
@mousedown="startDrag($event, hour)" @mouseover="dragSelect(hour)">
<div><strong x-text="formatHour(hour)"></strong></div>
<div class="weight"><span x-text="schedule[hour]"></span></div>
<div class="papers">
<span x-text="getPapersPerHour(hour)"></span> p.
</div>
</div>
</template>
</div>
<div class="input-group mb-4 w-50">
<span class="input-group-text">Set Weight:</span>
<input type="number" step="0.1" min="0.1" max="5" x-model="newWeight" class="form-control" />
<button type="button" class="btn btn-outline-primary" @click="applyWeight()">
Apply to Selected
</button>
</div>
<button type="button" class="btn btn-success" @click="updateSchedule()">
💾 Update Schedule
</button>
</div>
</div>
</div>
</div>
</div>
<!-- Notification template -->
<div id="notificationContainer"></div>
{% endblock content %}
{% block scripts %}
{{ super() }}
<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
<script src="https://cdn.jsdelivr.net/npm/alpinejs@3.x.x/dist/cdn.min.js" defer></script>
<script>
// Alpine.js scheduler component
function scheduleManager(initial, volume) {
return {
schedule: initial || {},
volume: volume,
selectedHours: [],
newWeight: 1.0,
isDragging: false,
dragOperation: null,
formatHour(h) {
return String(h).padStart(2, "0") + ":00";
},
getBackgroundStyle(hour) {
const weight = parseFloat(this.schedule[hour]);
const maxWeight = 2.5; // You can adjust this
// Normalize weight (0.0 to 1.0)
const t = Math.min(weight / maxWeight, 1.0);
// Interpolate HSL lightness: 95% (light) to 30% (dark)
const lightness = 95 - t * 65; // 95 → 30
const backgroundColor = `hsl(210, 10%, ${lightness}%)`;
const textColor = t > 0.65 ? "white" : "black"; // adaptive text color
return {
backgroundColor,
color: textColor,
};
},
startDrag(event, hour) {
event.preventDefault();
this.isDragging = true;
this.dragOperation = this.isSelected(hour) ? "remove" : "add";
this.toggleSelect(hour);
},
dragSelect(hour) {
if (!this.isDragging) return;
const selected = this.isSelected(hour);
if (this.dragOperation === "add" && !selected) {
this.selectedHours.push(hour);
} else if (this.dragOperation === "remove" && selected) {
this.selectedHours = this.selectedHours.filter((h) => h !== hour);
}
},
endDrag() {
this.isDragging = false;
},
toggleSelect(hour) {
if (this.isSelected(hour)) {
this.selectedHours = this.selectedHours.filter((h) => h !== hour);
} else {
this.selectedHours.push(hour);
}
},
isSelected(hour) {
return this.selectedHours.includes(hour);
},
applyWeight() {
this.selectedHours.forEach((hour) => {
this.schedule[hour] = parseFloat(this.newWeight).toFixed(1);
});
},
getTotalWeight() {
return Object.values(this.schedule).reduce(
(sum, w) => sum + parseFloat(w),
0
);
},
getPapersPerHour(hour) {
const total = this.getTotalWeight();
if (total === 0) return 0;
return (
(parseFloat(this.schedule[hour]) / total) *
this.volume
).toFixed(1);
},
updateVolume() {
fetch('/scraper/update_config', {
method: 'POST',
headers: {
'Content-Type': 'application/json'
},
body: JSON.stringify({ volume: parseFloat(this.volume) })
})
.then(response => response.json())
.then(data => {
if (data.success) {
showNotification('Volume updated successfully', 'success');
// Update the volume in the dashboard tab too
document.getElementById('volumeInput').value = this.volume;
} else {
showNotification(data.message, 'danger');
}
});
},
updateSchedule() {
fetch('/scraper/update_config', {
method: 'POST',
headers: {
'Content-Type': 'application/json'
},
body: JSON.stringify({ schedule: this.schedule })
})
.then(response => response.json())
.then(data => {
if (data.success) {
showNotification('Schedule updated successfully', 'success');
this.selectedHours = []; // Clear selections after update
} else {
showNotification(data.message, 'danger');
}
});
}
};
}
// Global variables for the scraper dashboard
let notificationsEnabled = true;
let activityChart = null;
let currentTimeRange = 24;
// DOM elements
const statusIndicator = document.getElementById('statusIndicator');
const statusText = document.getElementById('statusText');
const startButton = document.getElementById('startButton');
const pauseButton = document.getElementById('pauseButton');
const stopButton = document.getElementById('stopButton');
const notificationsToggle = document.getElementById('notificationsToggle');
const activityLog = document.getElementById('activityLog');
// Initialize the page
document.addEventListener('DOMContentLoaded', function () {
initStatusPolling();
loadActivityStats(currentTimeRange);
loadRecentActivity();
// Initialize event listeners
startButton.addEventListener('click', startScraper);
pauseButton.addEventListener('click', togglePauseScraper);
stopButton.addEventListener('click', stopScraper);
notificationsToggle.addEventListener('click', toggleNotifications);
document.getElementById('volumeForm').addEventListener('submit', function (e) {
e.preventDefault();
updateVolume();
});
document.querySelectorAll('.time-range-btn').forEach(btn => {
btn.addEventListener('click', function () {
document.querySelectorAll('.time-range-btn').forEach(b => b.classList.remove('active'));
this.classList.add('active');
currentTimeRange = parseInt(this.dataset.hours);
loadActivityStats(currentTimeRange);
});
});
});
// Status polling
function initStatusPolling() {
updateStatus();
setInterval(updateStatus, 5000); // Poll every 5 seconds
}
function updateStatus() {
fetch('/scraper/status')
.then(response => response.json())
.then(data => {
if (data.active) {
if (data.paused) {
statusIndicator.className = 'status-indicator status-paused';
statusText.textContent = 'Paused';
pauseButton.textContent = 'Resume';
} else {
statusIndicator.className = 'status-indicator status-active';
statusText.textContent = 'Active';
pauseButton.textContent = 'Pause';
}
startButton.disabled = true;
pauseButton.disabled = false;
stopButton.disabled = false;
} else {
statusIndicator.className = 'status-indicator status-inactive';
statusText.textContent = 'Inactive';
startButton.disabled = false;
pauseButton.disabled = true;
stopButton.disabled = true;
}
});
}
// Action functions
function startScraper() {
fetch('/scraper/start', { method: 'POST' })
.then(response => response.json())
.then(data => {
if (data.success) {
showNotification('Scraper started successfully', 'success');
updateStatus();
setTimeout(() => { loadRecentActivity(); }, 1000);
} else {
showNotification(data.message, 'danger');
}
});
}
function togglePauseScraper() {
fetch('/scraper/pause', { method: 'POST' })
.then(response => response.json())
.then(data => {
if (data.success) {
showNotification(data.message, 'info');
updateStatus();
setTimeout(() => { loadRecentActivity(); }, 1000);
} else {
showNotification(data.message, 'danger');
}
});
}
function stopScraper() {
fetch('/scraper/stop', { method: 'POST' })
.then(response => response.json())
.then(data => {
if (data.success) {
showNotification('Scraper stopped successfully', 'warning');
updateStatus();
setTimeout(() => { loadRecentActivity(); }, 1000);
} else {
showNotification(data.message, 'danger');
}
});
}
function updateVolume() {
const volume = document.getElementById('volumeInput').value;
fetch('/scraper/update_config', {
method: 'POST',
headers: {
'Content-Type': 'application/json'
},
body: JSON.stringify({ volume: volume })
})
.then(response => response.json())
.then(data => {
if (data.success) {
showNotification('Volume updated successfully', 'success');
} else {
showNotification(data.message, 'danger');
}
});
}
function toggleNotifications() {
notificationsEnabled = notificationsToggle.checked;
}
// Load data functions
function loadActivityStats(hours) {
fetch(`/scraper/stats?hours=${hours}`)
.then(response => response.json())
.then(data => {
renderActivityChart(data);
});
}
function loadRecentActivity() {
fetch('/api/activity_logs?category=scraper_activity&limit=20')
.then(response => response.json())
.then(data => {
renderActivityLog(data);
})
.catch(() => {
// If the API endpoint doesn't exist, just show a message
activityLog.innerHTML = '<tr><td colspan="4" class="text-center">Activity log API not available</td></tr>';
});
}
// Rendering functions
function renderActivityChart(data) {
const ctx = document.getElementById('activityChart').getContext('2d');
// Extract the data for the chart
const labels = data.map(item => `${item.hour}:00`);
const successData = data.map(item => item.success);
const errorData = data.map(item => item.error);
const pendingData = data.map(item => item.pending);
if (activityChart) {
activityChart.destroy();
}
activityChart = new Chart(ctx, {
type: 'bar',
data: {
labels: labels,
datasets: [
{
label: 'Success',
data: successData,
backgroundColor: '#28a745',
stack: 'Stack 0'
},
{
label: 'Error',
data: errorData,
backgroundColor: '#dc3545',
stack: 'Stack 0'
},
{
label: 'Pending',
data: pendingData,
backgroundColor: '#ffc107',
stack: 'Stack 0'
}
]
},
options: {
responsive: true,
maintainAspectRatio: false,
scales: {
x: {
stacked: true,
title: {
display: true,
text: 'Hour'
}
},
y: {
stacked: true,
beginAtZero: true,
title: {
display: true,
text: 'Papers Scraped'
}
}
}
}
});
}
function renderActivityLog(logs) {
activityLog.innerHTML = '';
if (!logs || logs.length === 0) {
activityLog.innerHTML = '<tr><td colspan="4" class="text-center">No recent activity</td></tr>';
return;
}
logs.forEach(log => {
const row = document.createElement('tr');
// Format timestamp
const date = new Date(log.timestamp);
const timeStr = date.toLocaleTimeString();
// Create status badge
let statusBadge = '';
if (log.status === 'success') {
statusBadge = '<span class="badge bg-success">Success</span>';
} else if (log.status === 'error') {
statusBadge = '<span class="badge bg-danger">Error</span>';
} else if (log.status === 'pending') {
statusBadge = '<span class="badge bg-warning text-dark">Pending</span>';
} else {
statusBadge = `<span class="badge bg-secondary">${log.status || 'Unknown'}</span>`;
}
row.innerHTML = `
<td>${timeStr}</td>
<td>${log.action}</td>
<td>${statusBadge}</td>
<td>${log.description || ''}</td>
`;
activityLog.appendChild(row);
});
}
// Notification functions
function showNotification(message, type) {
if (!notificationsEnabled && type !== 'danger') {
return;
}
const container = document.getElementById('notificationContainer');
const notification = document.createElement('div');
notification.className = `alert alert-${type} notification shadow-sm`;
notification.innerHTML = `
${message}
<button type="button" class="btn-close float-end" aria-label="Close"></button>
`;
container.appendChild(notification);
// Add close handler
notification.querySelector('.btn-close').addEventListener('click', () => {
notification.remove();
});
// Auto-close after 5 seconds
setTimeout(() => {
notification.classList.add('fade');
setTimeout(() => {
notification.remove();
}, 500);
}, 5000);
}
// WebSocket for real-time notifications
function setupWebSocket() {
// If WebSocket is available, implement it here
// For now we'll poll the server periodically for new papers
setInterval(checkForNewPapers, 10000); // Check every 10 seconds
}
let lastPaperTimestamp = new Date().toISOString();
function checkForNewPapers() {
fetch(`/api/activity_logs?category=scraper_activity&action=scrape_paper&after=${lastPaperTimestamp}&limit=5`)
.then(response => response.json())
.then(data => {
if (data && data.length > 0) {
// Update the timestamp
lastPaperTimestamp = new Date().toISOString();
// Show notifications for new papers
data.forEach(log => {
const extraData = log.extra_data ? JSON.parse(log.extra_data) : {};
if (log.status === 'success') {
showNotification(`New paper scraped: ${extraData.title || 'Unknown title'}`, 'success');
} else if (log.status === 'error') {
showNotification(`Failed to scrape paper: ${log.description}`, 'danger');
}
});
// Refresh the activity chart and log
loadActivityStats(currentTimeRange);
loadRecentActivity();
}
})
.catch(() => {
// If the API endpoint doesn't exist, do nothing
});
}
// Start checking for new papers
setupWebSocket();
</script>
{% endblock scripts %}