Opened on behalf of @BeckettFrey
Summary
In src/voxkit/storage/datasets.py, the validate_dataset function checks that the number of audio files equals the number of label files per speaker directory, but it never verifies that each audio file has a matching label file by stem name.
Example
Given a speaker directory:
speaker_001/
├── recording_A.wav
└── recording_B.lab
This passes validation (1 audio file, 1 label file — counts match), even though recording_A has no label and recording_B has no audio. The dataset would then fail silently downstream during alignment or training.
Relevant code
https://github.com/BrainBehaviorAnalyticsLab/voxkit-desktop/blob/main/src/voxkit/storage/datasets.py — in validate_dataset, around the per-speaker loop:
if len(audio_files) != len(label_files):
return (
False,
f"Mismatch between number of audio and label files in speaker "
f"directory '{speaker_path}'.",
)
Suggested fix
After the count check, add a stem-name comparison:
audio_stems = {Path(f).stem for f in audio_files}
label_stems = {Path(f).stem for f in label_files}
unmatched = audio_stems.symmetric_difference(label_stems)
if unmatched:
return (
False,
f"Unpaired audio/label files in '{speaker_path}': {unmatched}",
)
This keeps the scope minimal — one additional check in the existing validation loop.
Opened on behalf of @BeckettFrey
Summary
In
src/voxkit/storage/datasets.py, thevalidate_datasetfunction checks that the number of audio files equals the number of label files per speaker directory, but it never verifies that each audio file has a matching label file by stem name.Example
Given a speaker directory:
This passes validation (1 audio file, 1 label file — counts match), even though
recording_Ahas no label andrecording_Bhas no audio. The dataset would then fail silently downstream during alignment or training.Relevant code
https://github.com/BrainBehaviorAnalyticsLab/voxkit-desktop/blob/main/src/voxkit/storage/datasets.py — in
validate_dataset, around the per-speaker loop:Suggested fix
After the count check, add a stem-name comparison:
This keeps the scope minimal — one additional check in the existing validation loop.