Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError: ... Efficient way to debug the dataset with streaming? #820

Open
TAYmit opened this issue Nov 1, 2024 · 4 comments
Open
Labels
enhancement New feature or request

Comments

@TAYmit
Copy link

TAYmit commented Nov 1, 2024

After training with approximately 30,000 batches in streaming, I encountered this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfb in position 2: invalid start byte

The dataset is around 200 GB. Is there an efficient way to debug the dataset, or any try-catch approach I could use within the streaming process to handle this error?

Thanks!

@TAYmit TAYmit added the enhancement New feature or request label Nov 1, 2024
@TAYmit
Copy link
Author

TAYmit commented Nov 2, 2024

I tried to ensure that my data is in utf8 format.



# Local or remote directory path to store the output compressed files.

  out_root = "/data/shards/"
  
  columns = {
      'number': 'int',
      'texts': 'str',
  }
  
  # Compression algorithm to use for dataset
  compression = 'zstd:12'
  
  # Hashing algorithm to use for dataset
  hashes = ['sha1', 'xxh3_64']
  
  # shard size limit, in bytes
  size_limit = 7 << 30  # 4gb shard
  
  print(f'Saving dataset (to {out_root})...')
  infile = open('data.txt', encoding='utf-8')
  
  with MDSWriter(out=out_root, columns=columns, compression=compression,
                 hashes=hashes, size_limit=size_limit) as out:
      i=0
      for text in tqdm.tqdm(infile):
          sample={'number':i,'texts':text.encode('utf-8').strip()}
          out.write(sample)
          i=i+1

But still, after 6000 batches of run, I get following errors:

'utf-8' codec can't decode byte 0xd3 in position 0: invalid continuation byte

Not sure how to proceed from here.

===============================================================================
update

I spent a lot of time inspecting my dataset and confirmed that it is valid UTF-8. I then tried loading the exact line where the error occurred and found that it’s actually encoded in UTF-8, but for some reason, StreamingDataset still threw an error.

I suspected that the size_limit might be causing an issue (either directly or indirectly), so I reduced it within the MDSWriter loop (i.e., size_limit = 4 << 30 rather than 7<<30). After making this change, StreamingDataset worked fine.

For anyone encountering a similar issue, try reducing the size_limit when creating a sharded dataset with MDSWriter

Summary:

Unexpected unicode error during the StreamingDataset. No issue found in the raw dataset. Solved the problem by re-creating sharded datasets with size_limit = 4<<30 rather than 7<<30.

Closing this.

@TAYmit TAYmit closed this as completed Nov 2, 2024
@snarayan21
Copy link
Collaborator

Hey @TAYmit that seems indicative of a bug on our side, would it be possible for you to share a shard file or a small repro of this behavior with us?

@snarayan21 snarayan21 reopened this Nov 3, 2024
@TAYmit
Copy link
Author

TAYmit commented Nov 5, 2024

Hello,

The dataset is quite large (around 200GB in total), and I haven't yet tested it with a smaller dataset. I'll try working with a smaller subset (ideally under 10GB) over the next few days to see if I can reproduce the issue.

@snarayan21
Copy link
Collaborator

Hey @TAYmit , any luck repro-ing on a smaller dataset?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants