Skip to content

Conversation

@JackTemaki
Copy link
Contributor

Basic jobs to use Loquacious on our standard pipelines. So far only contains jobs for dev/test as well as small and medium train sets. The large corpus needs extra handling.

The jobs require an existing huggingface cache directory.

JackTemaki and others added 2 commits September 29, 2025 09:57
Basic jobs to use Loquacious on our standard pipelines.
So far only contains jobs for dev/test as well as small and medium train sets.
The large corpus needs extra handling.

The jobs require an existing huggingface cache directory.

Co-authored-by: Nick Rossenbach <[email protected]>
Co-authored-by: Robin Schmitt <[email protected]>
"-c:a",
"libvorbis",
"-b:a",
"16k",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was checking some other examples.

What I also see:
["ffmpeg", "-y", "-f", "s16le", "-ar", "%i" % sr, "-i", "pipe:0", "-c:a", "libvorbis", "-q", "3.0", path]
(That's what you use for your TTS Ogg export.)
["ffmpeg", "-hide_banner", "-loglevel", "error", "-y", "-threads", "1", "-f", "s16le", "-ar", "%i" % sr, "-i", "pipe:0", "-c:a", "libvorbis", "-q", "3.0", path]

Or in i6_experiments.common.datasets.librispeech.corpus.get_bliss_corpus_dict and i6_experiments.common.datasets.tedlium2.corpus.get_bliss_corpus_dict (and many others), we use "output_format": "ogg", "codec": "libvorbis" and sample_rate=16000 for BlissChangeEncodingJob. I wonder a bit about that: Here we don't specify the quality at all (neither -b:a nor -q), as far as I can see?

I have not seen any other example using -b:a. This corresponds to the fixed_bitrate option in BlissChangeEncodingJob.

Using a fixed bitrate (ABR) (option -b) seems suboptimal to me. A variable bitrate (VBR) (option -q) makes more sense?

But I just see that -q 3 is already the default. And you said that the FFmpeg defaults are suboptimal? Maybe we should use -q 4 or higher?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I just learned: If you don't pass -c:a libvorbis, and you have some weird stripped down FFMpeg build which was build without libvorbis, then FFmpeg still provides a builtin vorbis encoder, so it can still generate ogg files, but the quality will just be (much?) lower.

In some older setups, I did not use -c:a libvorbis, but simply ffmpeg -i ... out...ogg. But I think in most of my environments, I always had my custom Linuxbrew ffmpeg installed, which should have libvorbis enabled.

But now, at the RWTH HPC cluster, the FFmpeg from there (which is also only available after module load FFmpeg), that one does not support libvorbis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And you said that the FFmpeg defaults are suboptimal?

I think they are rather suboptimal in the other direction, so that for speech only you do not need so much bandwidth compared to e.g. music.

Now that others already used this settings extensively and there were no issues regarding the results, we should just keep it as it is, and a flag in case we later find out there was a serious issue with this setting.
This is anyway only relevant for the more academic small/medium corpora, for "large" results we will use your HF-Dataset based pipeline anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. Note that also for my HF-Dataset-based pipeline, I convert the audio via ffmpeg, so the flags are also relevant there.


def __init__(
self,
hf_home_dir: Path,
Copy link
Member

@albertz albertz Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether hf_home_dir is the reasonable to have as an argument to the job. At least this is inconsistent to other HF-dataset related jobs. (cc @dthulke)

In other HF-dataset related jobs (DownloadAndPrepareHuggingFaceDatasetJob, TransformAndMapHuggingFaceDatasetJob, ExtractTextFromHuggingFaceDatasetJob), it is expected that this (and potentially other related) env vars are set correctly by the user. There are also other potentially relevant env vars. There is for example also the personal login secret (HF_TOKEN) that you need to have for some datasets, that you must not share publicly.

But on the other side, this here doesn't matter too much. I could just do PrepareLoquaciousTrainSmallDatasetJob(hf_home_dir=Path(os.environ["HF_HOME"])) in my setup. hf_home_dir is (rightfully) not part of the hash.

So, I just want to raise this point, but this is not really critical, so I will anyway vote for approve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants