Skip to content

validate_dataset checks file counts but not stem-name pairing #86

@dennisthemenacing

Description

@dennisthemenacing

Opened on behalf of @BeckettFrey

Summary

In src/voxkit/storage/datasets.py, the validate_dataset function checks that the number of audio files equals the number of label files per speaker directory, but it never verifies that each audio file has a matching label file by stem name.

Example

Given a speaker directory:

speaker_001/
├── recording_A.wav
└── recording_B.lab

This passes validation (1 audio file, 1 label file — counts match), even though recording_A has no label and recording_B has no audio. The dataset would then fail silently downstream during alignment or training.

Relevant code

https://github.com/BrainBehaviorAnalyticsLab/voxkit-desktop/blob/main/src/voxkit/storage/datasets.py — in validate_dataset, around the per-speaker loop:

if len(audio_files) != len(label_files):
    return (
        False,
        f"Mismatch between number of audio and label files in speaker "
        f"directory '{speaker_path}'.",
    )

Suggested fix

After the count check, add a stem-name comparison:

audio_stems = {Path(f).stem for f in audio_files}
label_stems = {Path(f).stem for f in label_files}
unmatched = audio_stems.symmetric_difference(label_stems)
if unmatched:
    return (
        False,
        f"Unpaired audio/label files in '{speaker_path}': {unmatched}",
    )

This keeps the scope minimal — one additional check in the existing validation loop.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions