Add Loquacious hugginface to bliss jobs #626

JackTemaki · 2025-09-29T08:00:55Z

Basic jobs to use Loquacious on our standard pipelines. So far only contains jobs for dev/test as well as small and medium train sets. The large corpus needs extra handling.

The jobs require an existing huggingface cache directory.

Basic jobs to use Loquacious on our standard pipelines. So far only contains jobs for dev/test as well as small and medium train sets. The large corpus needs extra handling. The jobs require an existing huggingface cache directory. Co-authored-by: Nick Rossenbach <[email protected]> Co-authored-by: Robin Schmitt <[email protected]>

albertz · 2025-10-03T11:32:07Z

datasets/loquacious.py

+            "-c:a",
+            "libvorbis",
+            "-b:a",
+            "16k",


I was checking some other examples.

What I also see:
["ffmpeg", "-y", "-f", "s16le", "-ar", "%i" % sr, "-i", "pipe:0", "-c:a", "libvorbis", "-q", "3.0", path]
(That's what you use for your TTS Ogg export.)
["ffmpeg", "-hide_banner", "-loglevel", "error", "-y", "-threads", "1", "-f", "s16le", "-ar", "%i" % sr, "-i", "pipe:0", "-c:a", "libvorbis", "-q", "3.0", path]

Or in i6_experiments.common.datasets.librispeech.corpus.get_bliss_corpus_dict and i6_experiments.common.datasets.tedlium2.corpus.get_bliss_corpus_dict (and many others), we use "output_format": "ogg", "codec": "libvorbis" and sample_rate=16000 for BlissChangeEncodingJob. I wonder a bit about that: Here we don't specify the quality at all (neither -b:a nor -q), as far as I can see?

I have not seen any other example using -b:a. This corresponds to the fixed_bitrate option in BlissChangeEncodingJob.

Using a fixed bitrate (ABR) (option -b) seems suboptimal to me. A variable bitrate (VBR) (option -q) makes more sense?

But I just see that -q 3 is already the default. And you said that the FFmpeg defaults are suboptimal? Maybe we should use -q 4 or higher?

Oh, I just learned: If you don't pass -c:a libvorbis, and you have some weird stripped down FFMpeg build which was build without libvorbis, then FFmpeg still provides a builtin vorbis encoder, so it can still generate ogg files, but the quality will just be (much?) lower.

In some older setups, I did not use -c:a libvorbis, but simply ffmpeg -i ... out...ogg. But I think in most of my environments, I always had my custom Linuxbrew ffmpeg installed, which should have libvorbis enabled.

But now, at the RWTH HPC cluster, the FFmpeg from there (which is also only available after module load FFmpeg), that one does not support libvorbis.

And you said that the FFmpeg defaults are suboptimal?

I think they are rather suboptimal in the other direction, so that for speech only you do not need so much bandwidth compared to e.g. music.

Now that others already used this settings extensively and there were no issues regarding the results, we should just keep it as it is, and a flag in case we later find out there was a serious issue with this setting.
This is anyway only relevant for the more academic small/medium corpora, for "large" results we will use your HF-Dataset based pipeline anyway.

Ah I see. Note that also for my HF-Dataset-based pipeline, I convert the audio via ffmpeg, so the flags are also relevant there.

albertz · 2025-11-27T08:19:41Z

datasets/loquacious.py

+
+    def __init__(
+        self,
+        hf_home_dir: Path,


I wonder whether hf_home_dir is the reasonable to have as an argument to the job. At least this is inconsistent to other HF-dataset related jobs. (cc @dthulke)

In other HF-dataset related jobs (DownloadAndPrepareHuggingFaceDatasetJob, TransformAndMapHuggingFaceDatasetJob, ExtractTextFromHuggingFaceDatasetJob), it is expected that this (and potentially other related) env vars are set correctly by the user. There are also other potentially relevant env vars. There is for example also the personal login secret (HF_TOKEN) that you need to have for some datasets, that you must not share publicly.

But on the other side, this here doesn't matter too much. I could just do PrepareLoquaciousTrainSmallDatasetJob(hf_home_dir=Path(os.environ["HF_HOME"])) in my setup. hf_home_dir is (rightfully) not part of the hash.

So, I just want to raise this point, but this is not really critical, so I will anyway vote for approve.

JackTemaki and others added 2 commits September 29, 2025 09:57

ruff

94c4a58

albertz reviewed Oct 3, 2025

View reviewed changes

albertz mentioned this pull request Oct 4, 2025

TransformAndMapHuggingFaceDatasetJob and ExtractTextFromHuggingFaceDatasetJob #627

Merged

JackTemaki marked this pull request as ready for review November 26, 2025 12:19

JackTemaki requested review from Atticus1806, Marvin84, SimBe195, albertz, larissakl and robin-p-schmitt November 26, 2025 12:23

albertz reviewed Nov 27, 2025

View reviewed changes

albertz approved these changes Nov 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Loquacious hugginface to bliss jobs #626

Add Loquacious hugginface to bliss jobs #626

Uh oh!

JackTemaki commented Sep 29, 2025

Uh oh!

albertz Oct 3, 2025

Uh oh!

albertz Oct 4, 2025

Uh oh!

JackTemaki Nov 26, 2025

Uh oh!

albertz Nov 27, 2025

Uh oh!

albertz Nov 27, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add Loquacious hugginface to bliss jobs #626

Are you sure you want to change the base?

Add Loquacious hugginface to bliss jobs #626

Uh oh!

Conversation

JackTemaki commented Sep 29, 2025

Uh oh!

albertz Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

albertz Oct 4, 2025

Choose a reason for hiding this comment

Uh oh!

JackTemaki Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

albertz Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

albertz Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

albertz Nov 27, 2025 •

edited

Loading