Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataloader cannot process mp3 files #21

Open
ISzoke opened this issue Jan 31, 2024 · 6 comments
Open

dataloader cannot process mp3 files #21

ISzoke opened this issue Jan 31, 2024 · 6 comments

Comments

@ISzoke
Copy link
Collaborator

ISzoke commented Jan 31, 2024

No description provided.

@Lakoc
Copy link
Collaborator

Lakoc commented Feb 1, 2024

@ISzoke, please can you provide a log of the error? I think it is somehow related to libsndfile or some underlying library for audio manipulation.

@ISzoke
Copy link
Collaborator Author

ISzoke commented Feb 1, 2024

/mnt/proj1/open-28-57/szoke/CONDA_ENVS/huggingface_asr/lib/python3.10/site-packages/datasets/load.py:922: FutureWarning: The repository for audio_folder_vad contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at /mnt/pr
oj1/open-28-57/szoke/huggingface_asr/src/dataset_builders/audio_folder_vad/audio_folder_vad.py
You can avoid this message in future by passing the argument trust_remote_code=True.
Passing trust_remote_code=True will be mandatory to load this dataset from the next major release of datasets.
warnings.warn(
/mnt/proj1/open-28-57/szoke/CONDA_ENVS/huggingface_asr/lib/python3.10/site-packages/pyannote/audio/core/io.py:43: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
torchaudio.set_audio_backend("soundfile")
/mnt/proj1/open-28-57/szoke/CONDA_ENVS/huggingface_asr/lib/python3.10/site-packages/torch_audiomentations/utils/io.py:27: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
torchaudio.set_audio_backend("soundfile")
torchvision is not available - cannot save figures

@ISzoke
Copy link
Collaborator Author

ISzoke commented Feb 1, 2024

╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /mnt/proj1/open-28-57/szoke/CONDA_ENVS/huggingface_asr/lib/python3.10/site-p │
│ ackages/datasets/builder.py:1726 in _prepare_split_single                    │
│                                                                              │
│   1723 │   │   │   )                                                         │
│   1724 │   │   │   try:                                                      │
│   1725 │   │   │   │   _time = time.time()                                   │
│ ❱ 1726 │   │   │   │   for key, record in generator:                         │
│   1727 │   │   │   │   │   if max_shard_size is not None and writer._num_byt │
│   1728 │   │   │   │   │   │   num_examples, num_bytes = writer.finalize()   │
│   1729 │   │   │   │   │   │   writer.close()                                │
│                                                                              │
│ /scratch/project/open-28-57/szoke/huggingface_cache/modules/datasets_modules │
│ /datasets/audio_folder_vad/b49d605da8728ad455f8b4c54bdbb7ecd66d893e3e73b39d5 │
│ 0940697cffaada1/audio_folder_vad.py:93 in _generate_examples                 │
│                                                                              │
│    90 │   │   │   │   chunk = waveform[:, int(segment.start * sample_rate) : │
│    91 │   │   │   │   yield f"{example_id}_{segment.start:.2f}_{segment.end: │
│    92 │   │   │   │   │   **example,                                         │
│ ❱  93 │   │   │   │   │   "audio": audio_encoder.encode_example({"array": ch │
│    94 │   │   │   │   │   "input_len": len(chunk) / self.sampling_rate,      │
│    95 │   │   │   │   }                                                      │
│    96                                                                        │
│                                                                              │
│ /mnt/proj1/open-28-57/szoke/CONDA_ENVS/huggingface_asr/lib/python3.10/site-p │
│ ackages/datasets/features/audio.py:98 in encode_example                      │
│                                                                              │
│    95 │   │   elif "array" in value:                                         │
│    96 │   │   │   # convert the audio array to wav bytes                     │
│    97 │   │   │   buffer = BytesIO()                                         │
│ ❱  98 │   │   │   sf.write(buffer, value["array"], value["sampling_rate"], f │
│    99 │   │   │   return {"bytes": buffer.getvalue(), "path": None}          │
│   100 │   │   elif value.get("path") is not None and os.path.isfile(value["p │
│   101 │   │   │   # we set "bytes": None to not duplicate the data if they'r │
│                                                                              │
│ /mnt/proj1/open-28-57/szoke/CONDA_ENVS/huggingface_asr/lib/python3.10/site-p │
│ ackages/soundfile.py:343 in write                                            │
│                                                                              │
│    340 │   │   channels = 1                                                  │
│    341 │   else:                                                             │
│    342 │   │   channels = data.shape[1]                                      │
│ ❱  343 │   with SoundFile(file, 'w', samplerate, channels,                   │
│    344 │   │   │   │      subtype, endian, format, closefd) as f:            │
│    345 │   │   f.write(data)                                                 │
│    346                                                                       │
│                                                                              │
│ /mnt/proj1/open-28-57/szoke/CONDA_ENVS/huggingface_asr/lib/python3.10/site-p │
│ ackages/soundfile.py:658 in __init__                                         │
│                                                                              │
│    655 │   │   self._mode = mode                                             │
│    656 │   │   self._info = _create_info_struct(file, mode, samplerate, chan │
│    657 │   │   │   │   │   │   │   │   │   │    format, subtype, endian)     │
│ ❱  658 │   │   self._file = self._open(file, mode_int, closefd)              │
│    659 │   │   if set(mode).issuperset('r+') and self.seekable():            │
│    660 │   │   │   # Move write position to 0 (like in Python file objects)  │
│    661 │   │   │   self.seek(0)                                              │
│                                                                              │
│ /mnt/proj1/open-28-57/szoke/CONDA_ENVS/huggingface_asr/lib/python3.10/site-p │
│ ackages/soundfile.py:1216 in _open                                           │
│                                                                              │
│   1213 │   │   if file_ptr == _ffi.NULL:                                     │
│   1214 │   │   │   # get the actual error code                               │
│   1215 │   │   │   err = _snd.sf_error(file_ptr)                             │
│ ❱ 1216 │   │   │   raise LibsndfileError(err, prefix="Error opening {0!r}: " │
│   1217 │   │   if mode_int == _snd.SFM_WRITE:                                │
│   1218 │   │   │   # Due to a bug in libsndfile version <= 1.0.25, frames != │
│   1219 │   │   │   # when opening a named pipe in SFM_WRITE mode.            │
╰──────────────────────────────────────────────────────────────────────────────╯
LibsndfileError: Error opening <_io.BytesIO object at 0x2b8580d726b0>: Format 
not recognised.

The above exception was the direct cause of the following exception:

╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /mnt/proj1/open-28-57/szoke/huggingface_asr/src/trainers/pretrain_wav2vec2.p │
│ y:27 in <module>                                                             │
│                                                                              │
│   24 │   model_args, data_args, training_args, gen_args = parser.parse_args_ │
│   25 │                                                                       │
│   26 │   # 1. Collect, preprocess dataset and extract evaluation dataset     │
│ ❱ 27 │   dataset = get_dataset(                                              │
│   28 │   │   datasets_creation_config_path=data_args.datasets_creation_confi │
│   29 │   │   dataset_name=data_args.dataset_name,                            │
│   30 │   │   dataset_config=data_args.dataset_config,                        │
│                                                                              │
│ /mnt/proj1/open-28-57/szoke/huggingface_asr/src/utilities/data_utils.py:464  │
│ in get_dataset                                                               │
│                                                                              │
│   461 ) -> DatasetDict:                                                      │
│   462 │   """Loads single or multiple datasets, preprocess, and merge them." │
│   463 │   if datasets_creation_config_path is not None:                      │
│ ❱ 464 │   │   dataset = load_multiple_datasets(                              │
│   465 │   │   │   config_path=datasets_creation_config_path,                 │
│   466 │   │   │   num_proc=preprocessing_num_workers,                        │
│   467 │   │   │   writer_batch_size=writer_batch_size,                       │
│                                                                              │
│ /mnt/proj1/open-28-57/szoke/huggingface_asr/src/utilities/data_utils.py:382  │
│ in load_multiple_datasets                                                    │
│                                                                              │
│   379 │   │   │   │   )                                                      │
│   380 │   │   │                                                              │
│   381 │   │   │   else:                                                      │
│ ❱ 382 │   │   │   │   dataset = load_dataset(                                │
│   383 │   │   │   │   │   dataset_config["dataset_name"],                    │
│   384 │   │   │   │   │   keep_in_memory=False,                              │
│   385 │   │   │   │   │   writer_batch_size=writer_batch_size,               │
│                                                                              │
│ /mnt/proj1/open-28-57/szoke/CONDA_ENVS/huggingface_asr/lib/python3.10/site-p │
│ ackages/datasets/load.py:2549 in load_dataset                                │
│                                                                              │
│   2546 │   try_from_hf_gcs = path not in _PACKAGED_DATASETS_MODULES          │
│   2547 │                                                                     │
│   2548 │   # Download and prepare data                                       │
│ ❱ 2549 │   builder_instance.download_and_prepare(                            │
│   2550 │   │   download_config=download_config,                              │
│   2551 │   │   download_mode=download_mode,                                  │
│   2552 │   │   verification_mode=verification_mode,                          │
│                                                                              │
│ /mnt/proj1/open-28-57/szoke/CONDA_ENVS/huggingface_asr/lib/python3.10/site-p │
│ ackages/datasets/builder.py:1005 in download_and_prepare                     │
│                                                                              │
│   1002 │   │   │   │   │   │   │   prepare_split_kwargs["max_shard_size"] =  │
│   1003 │   │   │   │   │   │   if num_proc is not None:                      │
│   1004 │   │   │   │   │   │   │   prepare_split_kwargs["num_proc"] = num_pr │
│ ❱ 1005 │   │   │   │   │   │   self._download_and_prepare(                   │
│   1006 │   │   │   │   │   │   │   dl_manager=dl_manager,                    │
│   1007 │   │   │   │   │   │   │   verification_mode=verification_mode,      │
│   1008 │   │   │   │   │   │   │   **prepare_split_kwargs,                   │
│                                                                              │
│ /mnt/proj1/open-28-57/szoke/CONDA_ENVS/huggingface_asr/lib/python3.10/site-p │
│ ackages/datasets/builder.py:1767 in _download_and_prepare                    │
│                                                                              │
│   1764 │   │   yield job_id, True, (total_num_examples, total_num_bytes, wri │
│   1765 │                                                                     │
│   1766 │   def _download_and_prepare(self, dl_manager, verification_mode, ** │
│ ❱ 1767 │   │   super()._download_and_prepare(                                │
│   1768 │   │   │   dl_manager,                                               │
│   1769 │   │   │   verification_mode,                                        │
│   1770 │   │   │   check_duplicate_keys=verification_mode == VerificationMod │
│                                                                              │
│ /mnt/proj1/open-28-57/szoke/CONDA_ENVS/huggingface_asr/lib/python3.10/site-p │
│ ackages/datasets/builder.py:1100 in _download_and_prepare                    │
│                                                                              │
│   1097 │   │   │                                                             │
│   1098 │   │   │   try:                                                      │
│   1099 │   │   │   │   # Prepare split will record examples associated to th │
│ ❱ 1100 │   │   │   │   self._prepare_split(split_generator, **prepare_split_ │
│   1101 │   │   │   except OSError as e:                                      │
│   1102 │   │   │   │   raise OSError(                                        │
│   1103 │   │   │   │   │   "Cannot find data file. "                         │
│                                                                              │
│ /scratch/project/open-28-57/szoke/huggingface_cache/modules/datasets_modules │
│ /datasets/audio_folder_vad/b49d605da8728ad455f8b4c54bdbb7ecd66d893e3e73b39d5 │
│ 0940697cffaada1/audio_folder_vad.py:77 in _prepare_split                     │
│                                                                              │
│    74 │   │   max_shard_size: Optional[Union[int, str]] = None,              │
│    75 │   ):                                                                 │
│    76 │   │   set_start_method("spawn")                                      │
│ ❱  77 │   │   super()._prepare_split(split_generator, check_duplicate_keys,  │
│    78 │                                                                      │
│    79 │   def _generate_examples(self, files, metadata_files, split_name, ad │
│    80 │   │   audio_encoder = datasets.Audio(sampling_rate=self.sampling_rat │
│                                                                              │
│ /mnt/proj1/open-28-57/szoke/CONDA_ENVS/huggingface_asr/lib/python3.10/site-p │
│ ackages/datasets/builder.py:1605 in _prepare_split                           │
│                                                                              │
│   1602 │   │   │   gen_kwargs = split_generator.gen_kwargs                   │
│   1603 │   │   │   job_id = 0                                                │
│   1604 │   │   │   with pbar:                                                │
│ ❱ 1605 │   │   │   │   for job_id, done, content in self._prepare_split_sing │
│   1606 │   │   │   │   │   gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_ │
│   1607 │   │   │   │   ):                                                    │
│   1608 │   │   │   │   │   if done:                                          │
│                                                                              │
│ /mnt/proj1/open-28-57/szoke/CONDA_ENVS/huggingface_asr/lib/python3.10/site-p │
│ ackages/datasets/builder.py:1762 in _prepare_split_single                    │
│                                                                              │
│   1759 │   │   │   # Ignore the writer's error for no examples written to th │
│   1760 │   │   │   if isinstance(e, SchemaInferenceError) and e.__context__  │
│   1761 │   │   │   │   e = e.__context__                                     │
│ ❱ 1762 │   │   │   raise DatasetGenerationError("An error occurred while gen │
│   1763 │   │                                                                 │
│   1764 │   │   yield job_id, True, (total_num_examples, total_num_bytes, wri │
│   1765                                                                       │
╰──────────────────────────────────────────────────────────────────────────────╯
DatasetGenerationError: An error occurred while generating the dataset

@ISzoke
Copy link
Collaborator Author

ISzoke commented Feb 1, 2024

@Lakoc We need to output filename which causes this error

@ISzoke
Copy link
Collaborator Author

ISzoke commented Feb 1, 2024

one of the probles is may be stereo

[[email protected] ebranchformer_czech]$ soxi /scratch/project/open-28-57/szoke/huggingface_asr/metadata_dirs/ssl/jarin/5/536385_1974_01-Jesenik_JE.wav

Input File     : '/scratch/project/open-28-57/szoke/huggingface_asr/metadata_dirs/ssl/jarin/5/536385_1974_01-Jesenik_JE.wav'
Channels       : 2
Sample Rate    : 44100
Precision      : 16-bit
Duration       : 00:50:59.23 = 134911896 samples = 229442 CDDA sectors
File Size      : 540M
Bit Rate       : 1.41M
Sample Encoding: 16-bit Signed Integer PCM

@Lakoc
Copy link
Collaborator

Lakoc commented Feb 2, 2024

If we want to output that we would need to nest it into try catch block, which would slow down things a lot. Yes, it is related to the fact that audio has more channels and loader is written in the way it always expects mono recording, I will add there check for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants