Skip to content

Commit

Permalink
Add parallel experiment results
Browse files Browse the repository at this point in the history
  • Loading branch information
Tim Reichelt committed Aug 11, 2017
1 parent ec830fe commit 833fdfc
Showing 1 changed file with 19 additions and 3 deletions.
22 changes: 19 additions & 3 deletions precc/downloadsplit.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,22 @@

# TODO: Rename langsplit into langdetect and split into langsplit.

# NOTE: We can potentially speed up the download by using "parallel --keep-order curl"
# instead of "xargs curl".
# Prelimaniry experiments with a batch size of 50:
#
# Non-parallel
# real 133m46.524s
# user 340m41.548s
# sys 4m6.260s
#
# using "parallel --keep-order -j10 curl -s"
# real 139m9.633s
# user 343m37.816s
# sys 5m20.612s



set -e
set -o pipefail

Expand All @@ -14,9 +30,9 @@ LIBDIR=${SCRIPTDIR}/lib
DONEFILE="${BATCH_PATH}/download.done"

if [[ ! -f ${DONEFILE} ]]; then
# NOTE: We can potentially speed up the download by using "parallel --keep-order curl"
# instead of "xargs curl".
cat "${BATCH_PATH}/wet.paths.${BATCH_ID}" | xargs curl -s | gzip -cd | \
cat "${BATCH_PATH}/wet.paths.${BATCH_ID}" | \
parallel --keep-order -j10 curl -s | \
gzip -cd | \
${LIBDIR}/read_wet.py | \
${LIBDIR}/langsplit --printchunks 2> /dev/null | \
${LIBDIR}/split_languages.py --outdir "${BATCH_PATH}"
Expand Down

0 comments on commit 833fdfc

Please sign in to comment.