Skip to content

Pipeline Overview

Ben Steel edited this page Oct 15, 2021 · 2 revisions

Fast, Low Volume Pipeline

Using search engines for quick experiments

  • Search engine APIs like Google and Bing have a limited free tier so only good for quick analysis

Slow, High Volume Pipeline

Using self hosted analytic engines to search common crawl for full analysis

  • Common Crawl maintains a PySpark library for processing Common Crawl data