crawler

A simple web crawler based on Django

Installation

pip install pipenv

git clone https://github.com/Jibinjohnkj/crawler.git

cd crawler

pipenv install --ignore-pipfile

pipenv shell

python manage.py migrate

python manage.py runserver

open your browser and go to http://127.0.0.1:8000/

Introduction

The web crawler will require two inputs:

The URL to crawl.
The depth the crawler should go into.

A depth level of 1 means that the crawler would fetch the images from the supplied URL and show them. A depth level of 2 would mean the crawler would follow links to pages found in the first page and fetch images from them. Likewise, 3 would mean the crawler would follow links found in the second level page and fetch images from them and so on.

The maximum depth has been restricted to 3 and the maximum links generated has been restricted to 25 to promote fair usage

Duplicates links has been removed. http://example.com, https://example.com, https://example.com/, example.com, www.example.com all are considered duplicates

The crawler would be limited to the domain in the provided URL. The crawler would not crawl external site’s content.

For images starting with '/' or '//', domains and scheme have been added appropriately.

Base64 encoded images have been ignored

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
crawler		crawler
spider		spider
.gitignore		.gitignore
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
manage.py		manage.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

crawler

Installation

Introduction

About

Releases

Packages

Languages

License

Jibinjohnkj/crawler

Folders and files

Latest commit

History

Repository files navigation

crawler

Installation

Introduction

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages