You will have to install Kenneth Heafield's preprocessing tools (this is a fork). To run the scripts locally you will have to change some variables inside the scripts.
Deduping is done by using the commoncrawl_dedupe
executable from Kenneth Heafield's preprocessing tools
(this is a fork). The script dedupe.sh
is simply a wrapper for it. The deduper uses an hash table to detect the duplicates. By saving the hash table
to disk we can avoid reading and decompressing the previous deduped files. The commoncrawl_dedupe_save_table
executable makes this possible.
If the all of the raw data of one language is too big to fit into memory we have to shard the raw into multiple files. This is usually done with English. Before the sharding we do some minor processing of the raw data which removes lines with the document delimiter hash (df6fa1abb58549287111ba8d776733e9), strips leading and trailing white space and removes lines with invalid UTF-8.
I indicated which executable from the preprocessing tools each script uses.
Uses: commoncrawl_dedupe
Let's assume that all the raw files you want to dedupe are in /path/to/raw
and are named ${language_code}.raw.2017_17.xz
. You want to store the new
deduped files at /path/to/deduped
and each language already has a deduped file with the name /path/to/${language_code}.deduped.xz
. Then deduping
all languages in parallel can be done with:
cat language.codes | parallel ./dedupe.sh /path/to/raw/{}.2017_17.raw.xz /path/to/deduped {} /path/to/{}.deduped.xz
language.codes
is as in here.
The command also works if some languages don't already have a deduplicated file at /path/to/${language_code}.deduped.xz
.
Uses: commoncrawl_deduped_save_table
Assuming we have the same setup as in the previous section and you want to store the hash table at /path/to/out_table
. Then you can run:
cat language.codes | \
parallel ./dedupe_hash_table.sh /path/to/raw/{}.raw.2017_17.xz /path/to/deduped {} /dev/null /path/to/table1 /path/to/{}.deduped.xz
Then you can reuse the hash table to dedupe the next crawl as follows:
cat language.codes | \
parallel ./dedupe_hash_table.sh /path/to/raw/{}.raw.2017_30.xz /path/to/deduped {} /path/to/table1 /path/to/table2
Uses: commoncrawl_dedupe
, shard_fifo
NOTE: By default the sharding assumes that we are working on English data and shard into 100 files. However it should be trivial to change the script and add the language code as an argument.
Sharding the files:
./shard_fifo.sh /path/to/raw_files /path/to/fifos
Then open a new shell on the same machine and run:
seq 0 99 | parallel -j100 ./compress_shard.sh {} /path/to/fifos /path/to/shards
Here /path/to/shards
is the output directory. It is important that you start all 100 jobs at once and that all of them are running on the same machine
otherwise the FIFOs don't work.
Now dedupe the sharded files:
seq 0 99 | parallel ./dedupe_from_shard.sh {} /path/to/shards /path/to/previous_deduped_files /path/to/outdir
The deduper outputs only new lines that it sees. This means we want to concat the output of the deduper to the already existing deduped file and record
the offset. This can be done with update_deduped_data.sh
.
./update_deduped_data.sh ${old_deduped_file} ${new_deduped_file} ${offset_file} ${crawl_id}