download,raw,deduped: Scripts for downloading, creating.raw.xzand.deduped.xzfiles, respectively. Largely they are based on Christian's pipeline.s3: Scripts for uploading the local CommonCrawl data to AWS.precc: A command line application to which automates the CommonCrawl processing pipeline. It is a wrapper around several scripts which can also be run separately.language_lists: Files which contain a list of language codes. They are used extensively in the pipeline.LOCATIONS.md: Contains information on where the CommonCrawl data is located on Valhalla.TODO.md: List of things that I did not manage to finish.
-
Notifications
You must be signed in to change notification settings - Fork 1
treigerm/CommonCrawlProcessing
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published