GitHub - EthanLi0316/python-web-crawler

Python Web Crawler

Technologies: Python
Description: Developed a Python-based web crawler with a focus on minimizing runtime complexity. Utilized os and json modules for efficient data handling and reduced I/O operations. This project was created by Ethan Li and Bowen Zhang.

Instructions for Running the Crawler and Search Engine

Running the Crawler

Open the Terminal:
- Ensure your command line is in the directory containing the project's Python files.
Prepare the Configuration:
- Create a file named crawler_config.txt in the same directory. This file should contain the seed URL for the crawler without quotation marks.
- Example seed URL: http://people.scs.carleton.ca/~davidmckenney/fruits2/N-0.html
Execute the Crawler:
- In the terminal, type python crawler.py and press Enter.
- The crawler will start processing the seed URL, and the output will be saved in crawler_output.txt.

Running the Search Engine

Prepare the Configuration:
- Ensure you have a file named search_config.txt in the directory. This file should contain:
  - The search phrase on the first line.
  - The boost value (True or False) on the second line.
- Example configuration:
```
apple tomato tomato tomato
True
```
Execute the Search:
- In the terminal, ensure you are in the directory with search.py and searchdata.py.
- Type python search.py and press Enter.
- The search results will be stored in search_results.json in the same directory.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Project Lectures		Project Lectures
__pycache__		__pycache__
testing-resources		testing-resources
.gitignore		.gitignore
Map_url_to_title_file_name.json		Map_url_to_title_file_name.json
README.md		README.md
crawler.py		crawler.py
crawler_config.txt		crawler_config.txt
crawler_output.txt		crawler_output.txt
fruits2-all-idf-failed.txt		fruits2-all-idf-failed.txt
fruits2-all-idf-passed.txt		fruits2-all-idf-passed.txt
fruits2-all-incoming-failed.txt		fruits2-all-incoming-failed.txt
fruits2-all-incoming-passed.txt		fruits2-all-incoming-passed.txt
fruits2-all-outgoing-failed.txt		fruits2-all-outgoing-failed.txt
fruits2-all-outgoing-passed.txt		fruits2-all-outgoing-passed.txt
fruits2-all-pagerank-failed.txt		fruits2-all-pagerank-failed.txt
fruits2-all-pagerank-passed.txt		fruits2-all-pagerank-passed.txt
fruits2-all-search-failed.txt		fruits2-all-search-failed.txt
fruits2-all-search-passed.txt		fruits2-all-search-passed.txt
fruits2-all-tf-failed.txt		fruits2-all-tf-failed.txt
fruits2-all-tf-passed.txt		fruits2-all-tf-passed.txt
fruits2-all-tfidf-failed.txt		fruits2-all-tfidf-failed.txt
fruits2-all-tfidf-passed.txt		fruits2-all-tfidf-passed.txt
idf_values.json		idf_values.json
incoming_links.json		incoming_links.json
matmult.py		matmult.py
outgoing_links.json		outgoing_links.json
page_rank_values.json		page_rank_values.json
search.py		search.py
search_config.txt		search_config.txt
search_results.json		search_results.json
searchdata.py		searchdata.py
testingtools.py		testingtools.py
tf_values.json		tf_values.json
webdev.py		webdev.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Python Web Crawler

Instructions for Running the Crawler and Search Engine

Running the Crawler

Running the Search Engine

About

Releases

Packages

Languages

EthanLi0316/python-web-crawler

Folders and files

Latest commit

History

Repository files navigation

Python Web Crawler

Instructions for Running the Crawler and Search Engine

Running the Crawler

Running the Search Engine

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages