dataloader cannot process mp3 files #21

ISzoke · 2024-01-31T15:26:18Z

No description provided.

Lakoc · 2024-02-01T16:53:51Z

@ISzoke, please can you provide a log of the error? I think it is somehow related to libsndfile or some underlying library for audio manipulation.

ISzoke · 2024-02-01T19:23:20Z

/mnt/proj1/open-28-57/szoke/CONDA_ENVS/huggingface_asr/lib/python3.10/site-packages/datasets/load.py:922: FutureWarning: The repository for audio_folder_vad contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at /mnt/pr
oj1/open-28-57/szoke/huggingface_asr/src/dataset_builders/audio_folder_vad/audio_folder_vad.py
You can avoid this message in future by passing the argument trust_remote_code=True.
Passing trust_remote_code=True will be mandatory to load this dataset from the next major release of datasets.
warnings.warn(
/mnt/proj1/open-28-57/szoke/CONDA_ENVS/huggingface_asr/lib/python3.10/site-packages/pyannote/audio/core/io.py:43: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
torchaudio.set_audio_backend("soundfile")
/mnt/proj1/open-28-57/szoke/CONDA_ENVS/huggingface_asr/lib/python3.10/site-packages/torch_audiomentations/utils/io.py:27: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
torchaudio.set_audio_backend("soundfile")
torchvision is not available - cannot save figures

ISzoke · 2024-02-01T19:51:10Z

╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /mnt/proj1/open-28-57/szoke/CONDA_ENVS/huggingface_asr/lib/python3.10/site-p │
│ ackages/datasets/builder.py:1726 in _prepare_split_single                    │
│                                                                              │
│   1723 │   │   │   )                                                         │
│   1724 │   │   │   try:                                                      │
│   1725 │   │   │   │   _time = time.time()                                   │
│ ❱ 1726 │   │   │   │   for key, record in generator:                         │
│   1727 │   │   │   │   │   if max_shard_size is not None and writer._num_byt │
│   1728 │   │   │   │   │   │   num_examples, num_bytes = writer.finalize()   │
│   1729 │   │   │   │   │   │   writer.close()                                │
│                                                                              │
│ /scratch/project/open-28-57/szoke/huggingface_cache/modules/datasets_modules │
│ /datasets/audio_folder_vad/b49d605da8728ad455f8b4c54bdbb7ecd66d893e3e73b39d5 │
│ 0940697cffaada1/audio_folder_vad.py:93 in _generate_examples                 │
│                                                                              │
│    90 │   │   │   │   chunk = waveform[:, int(segment.start * sample_rate) : │
│    91 │   │   │   │   yield f"{example_id}_{segment.start:.2f}_{segment.end: │
│    92 │   │   │   │   │   **example,                                         │
│ ❱  93 │   │   │   │   │   "audio": audio_encoder.encode_example({"array": ch │
│    94 │   │   │   │   │   "input_len": len(chunk) / self.sampling_rate,      │
│    95 │   │   │   │   }                                                      │
│    96                                                                        │
│                                                                              │
│ /mnt/proj1/open-28-57/szoke/CONDA_ENVS/huggingface_asr/lib/python3.10/site-p │
│ ackages/datasets/features/audio.py:98 in encode_example                      │
│                                                                              │
│    95 │   │   elif "array" in value:                                         │
│    96 │   │   │   # convert the audio array to wav bytes                     │
│    97 │   │   │   buffer = BytesIO()                                         │
│ ❱  98 │   │   │   sf.write(buffer, value["array"], value["sampling_rate"], f │
│    99 │   │   │   return {"bytes": buffer.getvalue(), "path": None}          │
│   100 │   │   elif value.get("path") is not None and os.path.isfile(value["p │
│   101 │   │   │   # we set "bytes": None to not duplicate the data if they'r │
│                                                                              │
│ /mnt/proj1/open-28-57/szoke/CONDA_ENVS/huggingface_asr/lib/python3.10/site-p │
│ ackages/soundfile.py:343 in write                                            │
│                                                                              │
│    340 │   │   channels = 1                                                  │
│    341 │   else:                                                             │
│    342 │   │   channels = data.shape[1]                                      │
│ ❱  343 │   with SoundFile(file, 'w', samplerate, channels,                   │
│    344 │   │   │   │      subtype, endian, format, closefd) as f:            │
│    345 │   │   f.write(data)                                                 │
│    346                                                                       │
│                                                                              │
│ /mnt/proj1/open-28-57/szoke/CONDA_ENVS/huggingface_asr/lib/python3.10/site-p │
│ ackages/soundfile.py:658 in __init__                                         │
│                                                                              │
│    655 │   │   self._mode = mode                                             │
│    656 │   │   self._info = _create_info_struct(file, mode, samplerate, chan │
│    657 │   │   │   │   │   │   │   │   │   │    format, subtype, endian)     │
│ ❱  658 │   │   self._file = self._open(file, mode_int, closefd)              │
│    659 │   │   if set(mode).issuperset('r+') and self.seekable():            │
│    660 │   │   │   # Move write position to 0 (like in Python file objects)  │
│    661 │   │   │   self.seek(0)                                              │
│                                                                              │
│ /mnt/proj1/open-28-57/szoke/CONDA_ENVS/huggingface_asr/lib/python3.10/site-p │
│ ackages/soundfile.py:1216 in _open                                           │
│                                                                              │
│   1213 │   │   if file_ptr == _ffi.NULL:                                     │
│   1214 │   │   │   # get the actual error code                               │
│   1215 │   │   │   err = _snd.sf_error(file_ptr)                             │
│ ❱ 1216 │   │   │   raise LibsndfileError(err, prefix="Error opening {0!r}: " │
│   1217 │   │   if mode_int == _snd.SFM_WRITE:                                │
│   1218 │   │   │   # Due to a bug in libsndfile version <= 1.0.25, frames != │
│   1219 │   │   │   # when opening a named pipe in SFM_WRITE mode.            │
╰──────────────────────────────────────────────────────────────────────────────╯
LibsndfileError: Error opening <_io.BytesIO object at 0x2b8580d726b0>: Format 
not recognised.

The above exception was the direct cause of the following exception:

╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /mnt/proj1/open-28-57/szoke/huggingface_asr/src/trainers/pretrain_wav2vec2.p │
│ y:27 in <module>                                                             │
│                                                                              │
│   24 │   model_args, data_args, training_args, gen_args = parser.parse_args_ │
│   25 │                                                                       │
│   26 │   # 1. Collect, preprocess dataset and extract evaluation dataset     │
│ ❱ 27 │   dataset = get_dataset(                                              │
│   28 │   │   datasets_creation_config_path=data_args.datasets_creation_confi │
│   29 │   │   dataset_name=data_args.dataset_name,                            │
│   30 │   │   dataset_config=data_args.dataset_config,                        │
│                                                                              │
│ /mnt/proj1/open-28-57/szoke/huggingface_asr/src/utilities/data_utils.py:464  │
│ in get_dataset                                                               │
│                                                                              │
│   461 ) -> DatasetDict:                                                      │
│   462 │   """Loads single or multiple datasets, preprocess, and merge them." │
│   463 │   if datasets_creation_config_path is not None:                      │
│ ❱ 464 │   │   dataset = load_multiple_datasets(                              │
│   465 │   │   │   config_path=datasets_creation_config_path,                 │
│   466 │   │   │   num_proc=preprocessing_num_workers,                        │
│   467 │   │   │   writer_batch_size=writer_batch_size,                       │
│                                                                              │
│ /mnt/proj1/open-28-57/szoke/huggingface_asr/src/utilities/data_utils.py:382  │
│ in load_multiple_datasets                                                    │
│                                                                              │
│   379 │   │   │   │   )                                                      │
│   380 │   │   │                                                              │
│   381 │   │   │   else:                                                      │
│ ❱ 382 │   │   │   │   dataset = load_dataset(                                │
│   383 │   │   │   │   │   dataset_config["dataset_name"],                    │
│   384 │   │   │   │   │   keep_in_memory=False,                              │
│   385 │   │   │   │   │   writer_batch_size=writer_batch_size,               │
│                                                                              │
│ /mnt/proj1/open-28-57/szoke/CONDA_ENVS/huggingface_asr/lib/python3.10/site-p │
│ ackages/datasets/load.py:2549 in load_dataset                                │
│                                                                              │
│   2546 │   try_from_hf_gcs = path not in _PACKAGED_DATASETS_MODULES          │
│   2547 │                                                                     │
│   2548 │   # Download and prepare data                                       │
│ ❱ 2549 │   builder_instance.download_and_prepare(                            │
│   2550 │   │   download_config=download_config,                              │
│   2551 │   │   download_mode=download_mode,                                  │
│   2552 │   │   verification_mode=verification_mode,                          │
│                                                                              │
│ /mnt/proj1/open-28-57/szoke/CONDA_ENVS/huggingface_asr/lib/python3.10/site-p │
│ ackages/datasets/builder.py:1005 in download_and_prepare                     │
│                                                                              │
│   1002 │   │   │   │   │   │   │   prepare_split_kwargs["max_shard_size"] =  │
│   1003 │   │   │   │   │   │   if num_proc is not None:                      │
│   1004 │   │   │   │   │   │   │   prepare_split_kwargs["num_proc"] = num_pr │
│ ❱ 1005 │   │   │   │   │   │   self._download_and_prepare(                   │
│   1006 │   │   │   │   │   │   │   dl_manager=dl_manager,                    │
│   1007 │   │   │   │   │   │   │   verification_mode=verification_mode,      │
│   1008 │   │   │   │   │   │   │   **prepare_split_kwargs,                   │
│                                                                              │
│ /mnt/proj1/open-28-57/szoke/CONDA_ENVS/huggingface_asr/lib/python3.10/site-p │
│ ackages/datasets/builder.py:1767 in _download_and_prepare                    │
│                                                                              │
│   1764 │   │   yield job_id, True, (total_num_examples, total_num_bytes, wri │
│   1765 │                                                                     │
│   1766 │   def _download_and_prepare(self, dl_manager, verification_mode, ** │
│ ❱ 1767 │   │   super()._download_and_prepare(                                │
│   1768 │   │   │   dl_manager,                                               │
│   1769 │   │   │   verification_mode,                                        │
│   1770 │   │   │   check_duplicate_keys=verification_mode == VerificationMod │
│                                                                              │
│ /mnt/proj1/open-28-57/szoke/CONDA_ENVS/huggingface_asr/lib/python3.10/site-p │
│ ackages/datasets/builder.py:1100 in _download_and_prepare                    │
│                                                                              │
│   1097 │   │   │                                                             │
│   1098 │   │   │   try:                                                      │
│   1099 │   │   │   │   # Prepare split will record examples associated to th │
│ ❱ 1100 │   │   │   │   self._prepare_split(split_generator, **prepare_split_ │
│   1101 │   │   │   except OSError as e:                                      │
│   1102 │   │   │   │   raise OSError(                                        │
│   1103 │   │   │   │   │   "Cannot find data file. "                         │
│                                                                              │
│ /scratch/project/open-28-57/szoke/huggingface_cache/modules/datasets_modules │
│ /datasets/audio_folder_vad/b49d605da8728ad455f8b4c54bdbb7ecd66d893e3e73b39d5 │
│ 0940697cffaada1/audio_folder_vad.py:77 in _prepare_split                     │
│                                                                              │
│    74 │   │   max_shard_size: Optional[Union[int, str]] = None,              │
│    75 │   ):                                                                 │
│    76 │   │   set_start_method("spawn")                                      │
│ ❱  77 │   │   super()._prepare_split(split_generator, check_duplicate_keys,  │
│    78 │                                                                      │
│    79 │   def _generate_examples(self, files, metadata_files, split_name, ad │
│    80 │   │   audio_encoder = datasets.Audio(sampling_rate=self.sampling_rat │
│                                                                              │
│ /mnt/proj1/open-28-57/szoke/CONDA_ENVS/huggingface_asr/lib/python3.10/site-p │
│ ackages/datasets/builder.py:1605 in _prepare_split                           │
│                                                                              │
│   1602 │   │   │   gen_kwargs = split_generator.gen_kwargs                   │
│   1603 │   │   │   job_id = 0                                                │
│   1604 │   │   │   with pbar:                                                │
│ ❱ 1605 │   │   │   │   for job_id, done, content in self._prepare_split_sing │
│   1606 │   │   │   │   │   gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_ │
│   1607 │   │   │   │   ):                                                    │
│   1608 │   │   │   │   │   if done:                                          │
│                                                                              │
│ /mnt/proj1/open-28-57/szoke/CONDA_ENVS/huggingface_asr/lib/python3.10/site-p │
│ ackages/datasets/builder.py:1762 in _prepare_split_single                    │
│                                                                              │
│   1759 │   │   │   # Ignore the writer's error for no examples written to th │
│   1760 │   │   │   if isinstance(e, SchemaInferenceError) and e.__context__  │
│   1761 │   │   │   │   e = e.__context__                                     │
│ ❱ 1762 │   │   │   raise DatasetGenerationError("An error occurred while gen │
│   1763 │   │                                                                 │
│   1764 │   │   yield job_id, True, (total_num_examples, total_num_bytes, wri │
│   1765                                                                       │
╰──────────────────────────────────────────────────────────────────────────────╯
DatasetGenerationError: An error occurred while generating the dataset

ISzoke · 2024-02-01T19:58:08Z

@Lakoc We need to output filename which causes this error

ISzoke · 2024-02-01T20:09:40Z

one of the probles is may be stereo

[[email protected] ebranchformer_czech]$ soxi /scratch/project/open-28-57/szoke/huggingface_asr/metadata_dirs/ssl/jarin/5/536385_1974_01-Jesenik_JE.wav

Input File     : '/scratch/project/open-28-57/szoke/huggingface_asr/metadata_dirs/ssl/jarin/5/536385_1974_01-Jesenik_JE.wav'
Channels       : 2
Sample Rate    : 44100
Precision      : 16-bit
Duration       : 00:50:59.23 = 134911896 samples = 229442 CDDA sectors
File Size      : 540M
Bit Rate       : 1.41M
Sample Encoding: 16-bit Signed Integer PCM

Lakoc · 2024-02-02T09:33:12Z

If we want to output that we would need to nest it into try catch block, which would slow down things a lot. Yes, it is related to the fact that audio has more channels and loader is written in the way it always expects mono recording, I will add there check for that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataloader cannot process mp3 files #21

dataloader cannot process mp3 files #21

ISzoke commented Jan 31, 2024

Lakoc commented Feb 1, 2024

ISzoke commented Feb 1, 2024

ISzoke commented Feb 1, 2024

ISzoke commented Feb 1, 2024

ISzoke commented Feb 1, 2024

Lakoc commented Feb 2, 2024

dataloader cannot process mp3 files #21

dataloader cannot process mp3 files #21

Comments

ISzoke commented Jan 31, 2024

Lakoc commented Feb 1, 2024

ISzoke commented Feb 1, 2024

ISzoke commented Feb 1, 2024

ISzoke commented Feb 1, 2024

ISzoke commented Feb 1, 2024

Lakoc commented Feb 2, 2024