Skip to content

pinformatics/link-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tabular Data Web Crawler

Simple Python tool to scan one webpage and download tabular/document files (CSV, Excel, DOC, DOCX).

Quick Links

Features

  • Single-page crawl: scans only the provided URL (no child-page recursion)
  • Download detection for: .csv, .xls, .xlsx, .doc, .docx
  • Optional file-type filter (--types)
  • Duplicate-safe file naming (report.xlsx, report_1.xlsx, ...)
  • Automatic per-run organization into timestamped run folders
  • Request delay/rate limiting (--delay)
  • Timeout/error handling with continue-on-error behavior
  • Output organization into run/section folders under one output root

Important Compatibility Notes

  • Secure Linux server: Use CLI mode (this is the recommended production mode).
  • Windows users: Can run from Python or a packaged .exe.
  • No GUI is required.

Installation

pip install -r requirements.txt

Quick Run (Python)

python main.py --url <URL> --types csv,xlsx --output downloads

python main.py --url <URL> --output downloads --run-name fy2023_page

If you run with no arguments, the app starts interactive prompts:

python main.py

This is useful for non-technical users and for double-clicked .exe use.

Arguments

  • --url (required): Base URL to crawl
  • --types (optional): Comma-separated list from csv,xls,xlsx,doc,docx
  • --output (optional): Output directory (default: downloads)
  • --run-name (optional): Folder name for this run (default: auto timestamp)
  • --delay (optional): Delay between requests in seconds (default: 0.5)
  • --timeout (optional): Request timeout in seconds (default: 15)

Windows Usage (No EXE)

Use this if you want to run directly from Python on Windows instead of building main.exe.

  1. Install Python 3.10+ on Windows.

  2. Open Command Prompt.

    • Press Win + R, type cmd, press Enter.
  3. Go to your project folder.

    • Example:
cd C:\Users\<YourUser>\Desktop\link-scraper
  1. Install dependencies:
pip install -r requirements.txt
  1. Run the crawler:
python main.py --url https://example.com/page --output downloads --run-name run1

or interactive mode:

python main.py

Build Windows EXE (Step-by-Step)

Important: Build the .exe on a Windows machine.

  1. Install Python 3.10+ on Windows.
py --version
  1. Open Command Prompt.

    • Press Win + R, type cmd, press Enter.
  2. Go to your project folder.

    • Example:
cd C:\Users\<YourUser>\Desktop\link-scraper
  1. Install dependencies:
pip install -r requirements.txt
  1. Build the executable:
pyinstaller --onefile main.py
  1. Find the executable here:
dist\main.exe
  1. Copy main.exe to any folder you want to run from.

Use EXE (Non-Technical User Flow)

Option A: Double-click mode (easiest)

  1. Double-click main.exe.
  2. Enter prompts when asked:
    • Website URL
    • File types (or press Enter for all)
    • Output folder (or Enter for downloads)
    • Run name (optional)
    • Delay/timeout (or Enter for defaults)
  3. Wait for completion message.
  4. Open output folder.

Option B: Command mode

main.exe --url https://example.com/page --output downloads --run-name run1

Secure Linux Server Usage

Use this flow for production Linux servers where users do not have sudo rights.

1. Create a user-owned virtual environment (no sudo)

python3 -m venv ~/.venvs/tabular-crawler
source ~/.venvs/tabular-crawler/bin/activate

2. Install dependencies into that environment

pip install --upgrade pip
pip install -r requirements.txt

3. Run the crawler

python main.py --url <link of your website> --output downloads --run-name <name of your run>

or

python main.py

Output Structure

All files are saved under one output root with subfolders:

downloads/
	run_20260326_143012/
		Annual Reports/
			report.xlsx

	fy2023_page/
		Data Tables/
			report.csv

If a heading/section is not found, files go to general/.

Notes

  • Query parameters and relative links are handled correctly.
  • Broken links and request failures are logged and do not stop execution.
  • Scanning is limited to the single page specified by --url.

About

A crawler for website data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages