-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UnicodeDecodeError: ... Efficient way to debug the dataset with streaming? #820
Comments
I tried to ensure that my data is in utf8 format.
But still, after 6000 batches of run, I get following errors:
Not sure how to proceed from here. =============================================================================== I spent a lot of time inspecting my dataset and confirmed that it is valid UTF-8. I then tried loading the exact line where the error occurred and found that it’s actually encoded in UTF-8, but for some reason, I suspected that the For anyone encountering a similar issue, try reducing the Summary: Unexpected unicode error during the Closing this. |
Hey @TAYmit that seems indicative of a bug on our side, would it be possible for you to share a shard file or a small repro of this behavior with us? |
Hello, The dataset is quite large (around 200GB in total), and I haven't yet tested it with a smaller dataset. I'll try working with a smaller subset (ideally under 10GB) over the next few days to see if I can reproduce the issue. |
Hey @TAYmit , any luck repro-ing on a smaller dataset? |
After training with approximately 30,000 batches in streaming, I encountered this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfb in position 2: invalid start byte
The dataset is around 200 GB. Is there an efficient way to debug the dataset, or any try-catch approach I could use within the streaming process to handle this error?
Thanks!
The text was updated successfully, but these errors were encountered: