Python implementation of the PageRank algorithm. The script supports both Monte Carlo sampling and iterative convergence to estimate the rank of every HTML page in a corpus.
- Uses a transition model that mixes random jumps with outgoing links via the damping factor.
- Sampling simulates surfers for
nsteps to approximate probabilities. - Iterative updates recompute ranks until the maximum change is below a small threshold.
- Initialization assigns equal probability to every page, treating dangling pages as linking uniformly across the corpus until ranks settle.
- The iterative update applies PR(p) = (1 - d)/N + d × Σ[PR(q)/|L(q)|] for all incoming links q and stops once the largest delta is under 0.001.
python pagerank.py corpusReplace corpus with any folder that contains HTML pages you want to analyse. The program prints PageRank estimates from sampling and from the iterative method.
- Corpora must consist of
.htmlfiles with valid<a href="...">links. - For large archives (e.g., WARC extractions), keep a lookup table that maps original URLs to the filenames you store locally so that links can be resolved before running
pagerank.py.