Tabular Data Web Crawler

Simple Python tool to scan one webpage and download tabular/document files (CSV, Excel, DOC, DOCX).

Quick Links

Windows (No EXE): Windows Usage (No EXE)
Windows (Build EXE): Build Windows EXE (Step-by-Step)
Linux (Server CLI): Secure Linux Server Usage

Features

Single-page crawl: scans only the provided URL (no child-page recursion)
Download detection for: .csv, .xls, .xlsx, .doc, .docx
Optional file-type filter (--types)
Duplicate-safe file naming (report.xlsx, report_1.xlsx, ...)
Automatic per-run organization into timestamped run folders
Request delay/rate limiting (--delay)
Timeout/error handling with continue-on-error behavior
Output organization into run/section folders under one output root

Important Compatibility Notes

Secure Linux server: Use CLI mode (this is the recommended production mode).
Windows users: Can run from Python or a packaged .exe.
No GUI is required.

Installation

pip install -r requirements.txt

Quick Run (Python)

python main.py --url <URL> --types csv,xlsx --output downloads

python main.py --url <URL> --output downloads --run-name fy2023_page

If you run with no arguments, the app starts interactive prompts:

python main.py

This is useful for non-technical users and for double-clicked .exe use.

Arguments

--url (required): Base URL to crawl
--types (optional): Comma-separated list from csv,xls,xlsx,doc,docx
--output (optional): Output directory (default: downloads)
--run-name (optional): Folder name for this run (default: auto timestamp)
--delay (optional): Delay between requests in seconds (default: 0.5)
--timeout (optional): Request timeout in seconds (default: 15)

Windows Usage (No EXE)

Use this if you want to run directly from Python on Windows instead of building main.exe.

Install Python 3.10+ on Windows.
- Download from: https://www.python.org/downloads/windows/
- During setup, check Add Python to PATH.
Open Command Prompt.
- Press Win + R, type cmd, press Enter.
Go to your project folder.
- Example:

cd C:\Users\<YourUser>\Desktop\link-scraper

Install dependencies:

pip install -r requirements.txt

Run the crawler:

python main.py --url https://example.com/page --output downloads --run-name run1

or interactive mode:

python main.py

Build Windows EXE (Step-by-Step)

Important: Build the .exe on a Windows machine.

Install Python 3.10+ on Windows.
- Download from: https://www.python.org/downloads/windows/
- During setup, check Add Python to PATH.
- Verify install:

py --version

Open Command Prompt.
- Press Win + R, type cmd, press Enter.
Go to your project folder.
- Example:

cd C:\Users\<YourUser>\Desktop\link-scraper

Install dependencies:

pip install -r requirements.txt

Build the executable:

pyinstaller --onefile main.py

Find the executable here:

dist\main.exe

Copy main.exe to any folder you want to run from.

Use EXE (Non-Technical User Flow)

Option A: Double-click mode (easiest)

Double-click main.exe.
Enter prompts when asked:
- Website URL
- File types (or press Enter for all)
- Output folder (or Enter for downloads)
- Run name (optional)
- Delay/timeout (or Enter for defaults)
Wait for completion message.
Open output folder.

Option B: Command mode

main.exe --url https://example.com/page --output downloads --run-name run1

Secure Linux Server Usage

Use this flow for production Linux servers where users do not have sudo rights.

1. Create a user-owned virtual environment (no sudo)

python3 -m venv ~/.venvs/tabular-crawler
source ~/.venvs/tabular-crawler/bin/activate

2. Install dependencies into that environment

pip install --upgrade pip
pip install -r requirements.txt

3. Run the crawler

python main.py --url <link of your website> --output downloads --run-name <name of your run>

or

python main.py

Output Structure

All files are saved under one output root with subfolders:

downloads/
	run_20260326_143012/
		Annual Reports/
			report.xlsx

	fy2023_page/
		Data Tables/
			report.csv

If a heading/section is not found, files go to general/.

Notes

Query parameters and relative links are handled correctly.
Broken links and request failures are logged and do not stop execution.
Scanning is limited to the single page specified by --url.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
crawler		crawler
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tabular Data Web Crawler

Quick Links

Features

Important Compatibility Notes

Installation

Quick Run (Python)

Arguments

Windows Usage (No EXE)

Build Windows EXE (Step-by-Step)

Use EXE (Non-Technical User Flow)

Option A: Double-click mode (easiest)

Option B: Command mode

Secure Linux Server Usage

1. Create a user-owned virtual environment (no sudo)

2. Install dependencies into that environment

3. Run the crawler

Output Structure

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Tabular Data Web Crawler

Quick Links

Features

Important Compatibility Notes

Installation

Quick Run (Python)

Arguments

Windows Usage (No EXE)

Build Windows EXE (Step-by-Step)

Use EXE (Non-Technical User Flow)

Option A: Double-click mode (easiest)

Option B: Command mode

Secure Linux Server Usage

1. Create a user-owned virtual environment (no sudo)

2. Install dependencies into that environment

3. Run the crawler

Output Structure

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages