PageRank

Python implementation of the PageRank algorithm, using both sampling and iterative methods to rank web pages based on link structure. Aimed at understanding and replicating Google's core search algorithm logic.

PageRank Algorithm

Project Overview

This project implements the PageRank algorithm, used by search engines to rank the importance of web pages. The algorithm determines the probability of a random internet user visiting a specific page, taking into account both the number of incoming links and the significance of the linking pages.

This implementation is done using Python and offers two approaches to calculating PageRank:

Sampling Method - Uses a random surfer model to simulate browsing behavior.
Iterative Method - Uses a recursive mathematical approach to compute PageRank values.

The project is based on the CS50 AI curriculum.

How PageRank Works

PageRank evaluates web pages based on their link structure. The key idea is:

A page is more important if it is linked to by other important pages.
A random surfer model is used to simulate how likely a user is to visit a given page.

Random Surfer Model

In this model, a random surfer:

Starts on a random page.
Clicks links at random to navigate to other pages.
Occasionally jumps to a random page with a small probability (damping factor).

The algorithm computes the probability distribution over pages, estimating how often the surfer will land on each page.

Iterative Approach

This method uses a formula to iteratively update PageRank values until they converge (i.e., the values stabilize). It ensures that pages with no links are treated as if they link to every page in the corpus.

Installation

Ensure you have Python 3 installed on your system.

Clone the repository:

git clone https://github.com/yourusername/pagerank.git

Navigate into the project directory:
```
cd pagerank
```

Usage

To run the PageRank algorithm, use the following command:

python pagerank.py <directory_of_html_pages>

Example

If your HTML files are stored in a folders corpus0, corpus1 and corpus2:

python pagerank.py corpus0

Output

The script will display the PageRank values calculated by both the Sampling and Iterative methods. For example:

PageRank Results from Sampling (n = 10000)
  1.html: 0.2188
  2.html: 0.4275
  3.html: 0.2187
  4.html: 0.1351
PageRank Results from Iteration
  1.html: 0.2198
  2.html: 0.4294
  3.html: 0.2198
  4.html: 0.1311

Code Structure

The primary logic of the PageRank algorithm is implemented in the following key functions:

crawl(directory): Parses a directory of HTML pages and builds a graph representation of pages and their links.
transition_model(corpus, page, damping_factor): Returns the probability distribution of the next page based on the current page and the damping factor.
sample_pagerank(corpus, damping_factor, n): Calculates PageRank using the random sampling method.
iterate_pagerank(corpus, damping_factor): Calculates PageRank using an iterative method until convergence.

For detailed information, refer to the pagerank.py script.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PageRank

PageRank Algorithm

Table of Contents

Project Overview

How PageRank Works

Random Surfer Model

Iterative Approach

Installation

Usage

Example

Output

Code Structure

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
corpus0		corpus0
corpus1		corpus1
corpus2		corpus2
README.md		README.md
pagerank.py		pagerank.py

SJAlanA/PageRank

Folders and files

Latest commit

History

Repository files navigation

PageRank

PageRank Algorithm

Table of Contents

Project Overview

How PageRank Works

Random Surfer Model

Iterative Approach

Installation

Usage

Example

Output

Code Structure

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages