110 lines
4.0 KiB
Django/Jinja
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{% extends "base.html.jinja" %} {% block content %}
<h1 class="mb-4">📘 About This App</h1>
<p class="lead">
<strong>The Research Paper Scraper</strong> is a lightweight web-based tool
designed to help researchers manage and download large sets of academic papers
efficiently, using only a list of DOIs.
</p>
<hr class="my-4" />
<section class="mb-5">
<h2 class="h4">🔍 What It Does</h2>
<p>
This app automates the process of downloading research paper PDFs based on
metadata provided in a CSV file. Its especially useful when dealing with
hundreds or thousands of papers you want to collect for offline access or
analysis.
</p>
<p>
You simply upload a structured CSV file with paper metadata, and the system
takes care of the rest importing, organizing, and downloading each paper
in the background.
</p>
</section>
<section class="mb-5">
<h2 class="h4">⚙️ How It Works</h2>
<h5 class="mt-4">1. CSV Import</h5>
<p>
Users start by uploading a CSV file that contains metadata for many papers
(such as title, DOI, ISSN, etc.). The app only stores the fields it needs
like the DOI, title, and publication date and validates each entry before
importing it into the internal database.
</p>
<h5 class="mt-4">2. Metadata Management</h5>
<p>Each paper is stored in a local SQLite database, along with its status:</p>
<ul>
<li><strong>Pending</strong>: Ready to be downloaded.</li>
<li><strong>Done</strong>: Successfully downloaded.</li>
<li><strong>Failed</strong>: Something went wrong (e.g. PDF not found).</li>
</ul>
<h5 class="mt-4">3. Background Scraping</h5>
<p>
A separate background process runs 24/7, automatically downloading papers
based on a configurable hourly schedule. It uses tools like the Zotero API
to fetch the best available version of each paper (ideally as a PDF), and
stores them on disk in neatly organized folders, one per paper.
</p>
<p>
To avoid triggering download limits or spam detection, download times are
<strong>randomized within each hour</strong> to mimic natural behavior.
</p>
<h5 class="mt-4">4. Smart Scheduling</h5>
<p>
You can set how many papers the system should attempt to download during
each hour of the day. This allows you to, for example, schedule more
downloads during daytime and pause at night or tailor usage to match your
institutions bandwidth or rate limits.
</p>
<h5 class="mt-4">5. Easy Web Interface</h5>
<p>Everything is managed through a simple, responsive web interface:</p>
<ul>
<li>📥 Upload CSV files</li>
<li>📄 Track the status of each paper</li>
<li>⚠️ See which downloads failed, and why</li>
<li>📂 Download PDFs directly from the browser</li>
<li>🕒 Adjust the hourly download schedule</li>
</ul>
<p>
No command-line tools or scripts required everything works in your
browser.
</p>
</section>
<section class="mb-5">
<h2 class="h4">📦 File Storage</h2>
<p>
Downloaded PDFs are saved to a structured folder on the server, with each
paper in its own directory based on the DOI. The app never stores files
inside the database only references to where each PDF is located.
</p>
</section>
<section class="mb-5">
<h2 class="h4">🔒 Simple & Local</h2>
<p>
This app is designed for internal use on a local server or research
workstation. It does not send or expose data to third parties. Everything
from file storage to scheduling happens locally, giving you full control
over your paper collection process.
</p>
</section>
<section class="mb-5">
<h2 class="h4">💡 Who It's For</h2>
<p>This tool is ideal for:</p>
<ul>
<li>Research assistants organizing large literature datasets</li>
<li>Labs preparing reading archives for team members</li>
<li>Faculty compiling papers for courses or research reviews</li>
<li>Anyone needing a structured way to fetch and track papers in bulk</li>
</ul>
</section>
{% endblock content %}