110 lines
4.0 KiB
Django/Jinja
110 lines
4.0 KiB
Django/Jinja
{% extends "base.html.jinja" %} {% block content %}
|
||
<h1 class="mb-4">📘 About This App</h1>
|
||
|
||
<p class="lead">
|
||
<strong>The Research Paper Scraper</strong> is a lightweight web-based tool
|
||
designed to help researchers manage and download large sets of academic papers
|
||
efficiently, using only a list of DOIs.
|
||
</p>
|
||
|
||
<hr class="my-4" />
|
||
|
||
<section class="mb-5">
|
||
<h2 class="h4">🔍 What It Does</h2>
|
||
<p>
|
||
This app automates the process of downloading research paper PDFs based on
|
||
metadata provided in a CSV file. It’s especially useful when dealing with
|
||
hundreds or thousands of papers you want to collect for offline access or
|
||
analysis.
|
||
</p>
|
||
<p>
|
||
You simply upload a structured CSV file with paper metadata, and the system
|
||
takes care of the rest – importing, organizing, and downloading each paper
|
||
in the background.
|
||
</p>
|
||
</section>
|
||
|
||
<section class="mb-5">
|
||
<h2 class="h4">⚙️ How It Works</h2>
|
||
|
||
<h5 class="mt-4">1. CSV Import</h5>
|
||
<p>
|
||
Users start by uploading a CSV file that contains metadata for many papers
|
||
(such as title, DOI, ISSN, etc.). The app only stores the fields it needs –
|
||
like the DOI, title, and publication date – and validates each entry before
|
||
importing it into the internal database.
|
||
</p>
|
||
|
||
<h5 class="mt-4">2. Metadata Management</h5>
|
||
<p>Each paper is stored in a local SQLite database, along with its status:</p>
|
||
<ul>
|
||
<li><strong>Pending</strong>: Ready to be downloaded.</li>
|
||
<li><strong>Done</strong>: Successfully downloaded.</li>
|
||
<li><strong>Failed</strong>: Something went wrong (e.g. PDF not found).</li>
|
||
</ul>
|
||
|
||
<h5 class="mt-4">3. Background Scraping</h5>
|
||
<p>
|
||
A separate background process runs 24/7, automatically downloading papers
|
||
based on a configurable hourly schedule. It uses tools like the Zotero API
|
||
to fetch the best available version of each paper (ideally as a PDF), and
|
||
stores them on disk in neatly organized folders, one per paper.
|
||
</p>
|
||
<p>
|
||
To avoid triggering download limits or spam detection, download times are
|
||
<strong>randomized within each hour</strong> to mimic natural behavior.
|
||
</p>
|
||
|
||
<h5 class="mt-4">4. Smart Scheduling</h5>
|
||
<p>
|
||
You can set how many papers the system should attempt to download during
|
||
each hour of the day. This allows you to, for example, schedule more
|
||
downloads during daytime and pause at night – or tailor usage to match your
|
||
institution’s bandwidth or rate limits.
|
||
</p>
|
||
|
||
<h5 class="mt-4">5. Easy Web Interface</h5>
|
||
<p>Everything is managed through a simple, responsive web interface:</p>
|
||
<ul>
|
||
<li>📥 Upload CSV files</li>
|
||
<li>📄 Track the status of each paper</li>
|
||
<li>⚠️ See which downloads failed, and why</li>
|
||
<li>📂 Download PDFs directly from the browser</li>
|
||
<li>🕒 Adjust the hourly download schedule</li>
|
||
</ul>
|
||
<p>
|
||
No command-line tools or scripts required – everything works in your
|
||
browser.
|
||
</p>
|
||
</section>
|
||
|
||
<section class="mb-5">
|
||
<h2 class="h4">📦 File Storage</h2>
|
||
<p>
|
||
Downloaded PDFs are saved to a structured folder on the server, with each
|
||
paper in its own directory based on the DOI. The app never stores files
|
||
inside the database – only references to where each PDF is located.
|
||
</p>
|
||
</section>
|
||
|
||
<section class="mb-5">
|
||
<h2 class="h4">🔒 Simple & Local</h2>
|
||
<p>
|
||
This app is designed for internal use on a local server or research
|
||
workstation. It does not send or expose data to third parties. Everything –
|
||
from file storage to scheduling – happens locally, giving you full control
|
||
over your paper collection process.
|
||
</p>
|
||
</section>
|
||
|
||
<section class="mb-5">
|
||
<h2 class="h4">💡 Who It's For</h2>
|
||
<p>This tool is ideal for:</p>
|
||
<ul>
|
||
<li>Research assistants organizing large literature datasets</li>
|
||
<li>Labs preparing reading archives for team members</li>
|
||
<li>Faculty compiling papers for courses or research reviews</li>
|
||
<li>Anyone needing a structured way to fetch and track papers in bulk</li>
|
||
</ul>
|
||
</section>
|
||
{% endblock content %} |