UnicodeDecodeError: ... Efficient way to debug the dataset with streaming? #820

TAYmit · 2024-11-01T21:26:14Z

After training with approximately 30,000 batches in streaming, I encountered this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfb in position 2: invalid start byte

The dataset is around 200 GB. Is there an efficient way to debug the dataset, or any try-catch approach I could use within the streaming process to handle this error?

Thanks!

TAYmit · 2024-11-02T21:10:19Z

I tried to ensure that my data is in utf8 format.



# Local or remote directory path to store the output compressed files.

  out_root = "/data/shards/"
  
  columns = {
      'number': 'int',
      'texts': 'str',
  }
  
  # Compression algorithm to use for dataset
  compression = 'zstd:12'
  
  # Hashing algorithm to use for dataset
  hashes = ['sha1', 'xxh3_64']
  
  # shard size limit, in bytes
  size_limit = 7 << 30  # 4gb shard
  
  print(f'Saving dataset (to {out_root})...')
  infile = open('data.txt', encoding='utf-8')
  
  with MDSWriter(out=out_root, columns=columns, compression=compression,
                 hashes=hashes, size_limit=size_limit) as out:
      i=0
      for text in tqdm.tqdm(infile):
          sample={'number':i,'texts':text.encode('utf-8').strip()}
          out.write(sample)
          i=i+1

But still, after 6000 batches of run, I get following errors:

'utf-8' codec can't decode byte 0xd3 in position 0: invalid continuation byte

Not sure how to proceed from here.

===============================================================================
update

I spent a lot of time inspecting my dataset and confirmed that it is valid UTF-8. I then tried loading the exact line where the error occurred and found that it’s actually encoded in UTF-8, but for some reason, StreamingDataset still threw an error.

I suspected that the size_limit might be causing an issue (either directly or indirectly), so I reduced it within the MDSWriter loop (i.e., size_limit = 4 << 30 rather than 7<<30). After making this change, StreamingDataset worked fine.

For anyone encountering a similar issue, try reducing the size_limit when creating a sharded dataset with MDSWriter

Summary:

Unexpected unicode error during the StreamingDataset. No issue found in the raw dataset. Solved the problem by re-creating sharded datasets with size_limit = 4<<30 rather than 7<<30.

Closing this.

snarayan21 · 2024-11-03T16:37:32Z

Hey @TAYmit that seems indicative of a bug on our side, would it be possible for you to share a shard file or a small repro of this behavior with us?

TAYmit · 2024-11-05T17:19:39Z

Hello,

The dataset is quite large (around 200GB in total), and I haven't yet tested it with a smaller dataset. I'll try working with a smaller subset (ideally under 10GB) over the next few days to see if I can reproduce the issue.

snarayan21 · 2025-01-04T05:51:46Z

Hey @TAYmit , any luck repro-ing on a smaller dataset?

TAYmit added the enhancement label Nov 1, 2024

TAYmit closed this as completed Nov 2, 2024

snarayan21 reopened this Nov 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError: ... Efficient way to debug the dataset with streaming? #820

UnicodeDecodeError: ... Efficient way to debug the dataset with streaming? #820

TAYmit commented Nov 1, 2024

TAYmit commented Nov 2, 2024 •

edited

Loading

snarayan21 commented Nov 3, 2024

TAYmit commented Nov 5, 2024

snarayan21 commented Jan 4, 2025

UnicodeDecodeError: ... Efficient way to debug the dataset with streaming? #820

UnicodeDecodeError: ... Efficient way to debug the dataset with streaming? #820

Comments

TAYmit commented Nov 1, 2024

TAYmit commented Nov 2, 2024 • edited Loading

snarayan21 commented Nov 3, 2024

TAYmit commented Nov 5, 2024

snarayan21 commented Jan 4, 2025

TAYmit commented Nov 2, 2024 •

edited

Loading