You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This error is caused by the fact that some of the metadata fields in the jsonl files contain the WARC-Truncated field, which is an optional field that can be found in some WARCs - I will look into how this can be resolved so that it is possible to load the dataset with load_dataset. In the meantime, I would recommend simply downloading the dataset from HF and using the jsonl.zst files separately for now.
However, I also want to point out that the dclm-pool-400m-1x dataset (and the other pool datasets) are not intended to be used directly for training - they only contain very minimal processing, and are intended to be processed further. As such, I would recommend doing so first and process each jsonl file individually (with our pipeline and/or your own implementations).
Hi,
When I ran the following command to download the dataset from hugginigface hub, I encountered an error:
My command:
The error:
Could you help take a look? Thanks!
The text was updated successfully, but these errors were encountered: