Simple Python tool to scan one webpage and download tabular/document files (CSV, Excel, DOC, DOCX).
- Windows (No EXE): Windows Usage (No EXE)
- Windows (Build EXE): Build Windows EXE (Step-by-Step)
- Linux (Server CLI): Secure Linux Server Usage
- Single-page crawl: scans only the provided URL (no child-page recursion)
- Download detection for:
.csv,.xls,.xlsx,.doc,.docx - Optional file-type filter (
--types) - Duplicate-safe file naming (
report.xlsx,report_1.xlsx, ...) - Automatic per-run organization into timestamped run folders
- Request delay/rate limiting (
--delay) - Timeout/error handling with continue-on-error behavior
- Output organization into
run/sectionfolders under one output root
- Secure Linux server: Use CLI mode (this is the recommended production mode).
- Windows users: Can run from Python or a packaged
.exe. - No GUI is required.
pip install -r requirements.txtpython main.py --url <URL> --types csv,xlsx --output downloads
python main.py --url <URL> --output downloads --run-name fy2023_pageIf you run with no arguments, the app starts interactive prompts:
python main.pyThis is useful for non-technical users and for double-clicked .exe use.
--url(required): Base URL to crawl--types(optional): Comma-separated list fromcsv,xls,xlsx,doc,docx--output(optional): Output directory (default:downloads)--run-name(optional): Folder name for this run (default: auto timestamp)--delay(optional): Delay between requests in seconds (default:0.5)--timeout(optional): Request timeout in seconds (default:15)
Use this if you want to run directly from Python on Windows instead of building main.exe.
-
Install Python 3.10+ on Windows.
- Download from: https://www.python.org/downloads/windows/
- During setup, check Add Python to PATH.
-
Open Command Prompt.
- Press
Win + R, typecmd, press Enter.
- Press
-
Go to your project folder.
- Example:
cd C:\Users\<YourUser>\Desktop\link-scraper- Install dependencies:
pip install -r requirements.txt- Run the crawler:
python main.py --url https://example.com/page --output downloads --run-name run1or interactive mode:
python main.pyImportant: Build the .exe on a Windows machine.
- Install Python 3.10+ on Windows.
- Download from: https://www.python.org/downloads/windows/
- During setup, check Add Python to PATH.
- Verify install:
py --version-
Open Command Prompt.
- Press
Win + R, typecmd, press Enter.
- Press
-
Go to your project folder.
- Example:
cd C:\Users\<YourUser>\Desktop\link-scraper- Install dependencies:
pip install -r requirements.txt- Build the executable:
pyinstaller --onefile main.py- Find the executable here:
dist\main.exe
- Copy
main.exeto any folder you want to run from.
- Double-click
main.exe. - Enter prompts when asked:
- Website URL
- File types (or press Enter for all)
- Output folder (or Enter for
downloads) - Run name (optional)
- Delay/timeout (or Enter for defaults)
- Wait for completion message.
- Open output folder.
main.exe --url https://example.com/page --output downloads --run-name run1Use this flow for production Linux servers where users do not have sudo rights.
python3 -m venv ~/.venvs/tabular-crawler
source ~/.venvs/tabular-crawler/bin/activatepip install --upgrade pip
pip install -r requirements.txtpython main.py --url <link of your website> --output downloads --run-name <name of your run>or
python main.pyAll files are saved under one output root with subfolders:
downloads/
run_20260326_143012/
Annual Reports/
report.xlsx
fy2023_page/
Data Tables/
report.csv
If a heading/section is not found, files go to general/.
- Query parameters and relative links are handled correctly.
- Broken links and request failures are logged and do not stop execution.
- Scanning is limited to the single page specified by
--url.