Chrono-Crawl Archive Explorer

A temporal web archive explorer that journeys through Common Crawl's historical snapshots with a futuristic interface.

🌟 Overview

Chrono-Crawl is a web application that allows you to explore historical versions of websites through Common Crawl's extensive archive. With a sleek, futuristic interface reminiscent of a time machine control panel, you can navigate through web history as if traveling through time.

✨ Features

Temporal Navigation: Browse historical versions of websites from Common Crawl archives
Complete Proxy System: All resources (CSS, JS, images) are routed through the archive
Interactive Rendering: View websites in an isolated iframe with full functionality
Modern UI: Futuristic design with glowing effects and smooth animations
Responsive Design: Works seamlessly on desktop and mobile devices
Data Redundancy: Automatically attemps retrieval from live for data missing in Common Crawl archive

🛠️ Tech Stack

Backend: Python Flask
Frontend: Pure HTML/CSS/JavaScript
Archive Access: Common Crawl API
HTML Processing: BeautifulSoup4
Data Handling: WARCIterator for archive processing

🚀 Quick Start

Prerequisites

pip install flask requests beautifulsoup4 warcio

Installation

Clone the repository:

git clone <repository-url>
cd chrono-crawl-explorer

Run the application:

python serve.py

Open your browser to http://localhost:5000

🎮 Usage

1. Temporal Coordinates

Select a Common Crawl index (timeline)
Enter the target URL to explore
Set your user agent identifier
Click "Initiate Time Jump"

2. Temporal Viewer

View raw HTML content from the archive
See metadata about the fetched records
Monitor fetch status with visual indicators

3. Temporal Renderer

Render the archived website in an interactive iframe
Navigate through links (all routed through Common Crawl)
View the site as it appeared historically

🔧 How It Works

Archive Processing

Index Lookup: Queries Common Crawl for URL matches in selected time index
Content Fetching: Retrieves WARC records from Common Crawl storage
HTML Modification: Rewrites all resource URLs to proxy through the application
Safe Rendering: Displays content in sandboxed iframe

Proxy System

Resource Proxy: Routes CSS, JavaScript, and images through Common Crawl
Navigation Proxy: Handles link clicks and form submissions
AJAX Proxy: Intercepts and proxies dynamic requests

🎨 Interface Features

Animated Elements: Pulsing time dials and scanning effects
Status Indicators: Real-time feedback for all operations
Tab Navigation: Organized workflow between coordinates, viewer, and renderer
Responsive Design: Adapts to different screen sizes
Dark Theme: Easy-on-the-eyes futuristic aesthetic

📁 Project Structure

chrono-crawl-explorer/
|-- serve.py             # Flask backend server
|-- static               # Flask static content directory
|   |-- favicon.ico      # Favicon
|   `-- style.css        # Futuristic styling
`-- templates            # Flask templates directory
    |-- index.html       # Main application interface

🌐 Common Crawl Integration

The application automatically:

Fetches available crawl indexes from Common Crawl
Handles byte-range requests for WARC files
Processes WARC records to extract HTML content
Manages cache for efficient repeated access

🐛 Troubleshooting

Common Issues

No Records Found
- Try different Common Crawl indexes
- Check if the site was crawled during the selected timeframe
Rendering Issues
- Some modern JavaScript may not execute in archive context
- Complex CSS might not load correctly from archives
- Dynamic content dependent on live APIs won't work
Performance
- Large sites may take longer to fetch and render
- Some archives have limited resource availability

📄 License

This project is open source and available under the MIT License.

Third-party components and data:

Common Crawl data: Used under Common Crawl's terms of use
Flask: BSD 3-Clause License
BeautifulSoup4: MIT License
warcio: Apache 2.0 License
Common Crawl API: Respect their terms of service

Data Usage

This tool accesses Common Crawl archives. Users are responsible for:

Respecting website terms found in the archives
Complying with robots.txt directives
Using data in accordance with applicable laws
Providing proper attribution to source websites

🤝 Contributing

Contributions are welcome! Please feel free to submit pull requests or open issues for bugs and feature requests.

⚡ Powered by Flask & Common Crawl

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
static		static
templates		templates
LICENSE		LICENSE
README.md		README.md
serve.py		serve.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chrono-Crawl Archive Explorer

🌟 Overview

✨ Features

🛠️ Tech Stack

🚀 Quick Start

Prerequisites

Installation

🎮 Usage

1. Temporal Coordinates

2. Temporal Viewer

3. Temporal Renderer

🔧 How It Works

Archive Processing

Proxy System

🎨 Interface Features

📁 Project Structure

🌐 Common Crawl Integration

🐛 Troubleshooting

Common Issues

📄 License

Third-party components and data:

Data Usage

🤝 Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Chrono-Crawl Archive Explorer

🌟 Overview

✨ Features

🛠️ Tech Stack

🚀 Quick Start

Prerequisites

Installation

🎮 Usage

1. Temporal Coordinates

2. Temporal Viewer

3. Temporal Renderer

🔧 How It Works

Archive Processing

Proxy System

🎨 Interface Features

📁 Project Structure

🌐 Common Crawl Integration

🐛 Troubleshooting

Common Issues

📄 License

Third-party components and data:

Data Usage

🤝 Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages