Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: Couldn't cast array of type #66

Open
shizhediao opened this issue Aug 28, 2024 · 2 comments
Open

TypeError: Couldn't cast array of type #66

shizhediao opened this issue Aug 28, 2024 · 2 comments
Assignees

Comments

@shizhediao
Copy link

shizhediao commented Aug 28, 2024

Hi,

When I ran the following command to download the dataset from hugginigface hub, I encountered an error:

My command:

from datasets import load_dataset

ds = load_dataset("mlfoundations/dclm-pool-400m-1x")

The error:

File /lustre/fsw/portfolios/table.py:2122, in cast_array_to_feature(array, feature, allow_primitive_to_str, allow_decimal_to_str)
   2116     return array_cast(
   2117         array,
   2118         feature(),
   2119         allow_primitive_to_str=allow_primitive_to_str,
   2120         allow_decimal_to_str=allow_decimal_to_str,
   2121     )
-> 2122 raise TypeError(f"Couldn't cast array of type\n{_short_str(array.type)}\nto\n{_short_str(feature)}")

TypeError: Couldn't cast array of type
struct<WARC-Type: string, WARC-Date: timestamp[s], WARC-Record-ID: string, Content-Length: string, Content-Type: string, WARC-Warcinfo-ID: string, WARC-Concurrent-To: string, WARC-IP-Address: string, WARC-Target-URI: string, WARC-Payload-Digest: string, WARC-Block-Digest: string, WARC-Identified-Payload-Type: string>
to
{'WARC-Type': Value(dtype='string', id=None), 'WARC-Date': Value(dtype='timestamp[s]', id=None), 'WARC-Record-ID': Value(dtype='string', id=None), 'Content-Length': Value(dtype='string', id=None), 'Content-Type': Value(dtype='string', id=None), 'WARC-Warcinfo-ID': Value(dtype='string', id=None), 'WARC-Concurrent-To': Value(dtype='string', id=None), 'WARC-IP-Address': Value(dtype='string', id=None), 'WARC-Target-URI': Value(dtype='string', id=None), 'WARC-Payload-Digest': Value(dtype='string', id=None), 'WARC-Block-Digest': Value(dtype='string', id=None), 'WARC-Identified-Payload-Type': Value(dtype='string', id=None), 'WARC-Truncated': Value(dtype='string', id=None)}

The above exception was the direct cause of the following exception:

Could you help take a look? Thanks!

@GeorgiosSmyrnis
Copy link
Contributor

Hi @shizhediao ,

This error is caused by the fact that some of the metadata fields in the jsonl files contain the WARC-Truncated field, which is an optional field that can be found in some WARCs - I will look into how this can be resolved so that it is possible to load the dataset with load_dataset. In the meantime, I would recommend simply downloading the dataset from HF and using the jsonl.zst files separately for now.

However, I also want to point out that the dclm-pool-400m-1x dataset (and the other pool datasets) are not intended to be used directly for training - they only contain very minimal processing, and are intended to be processed further. As such, I would recommend doing so first and process each jsonl file individually (with our pipeline and/or your own implementations).

@shizhediao
Copy link
Author

Thank you for your explanation! I would like to clean the pool datasets. Looking forward to the solutions! I will use the jsonl.zst for now. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants