About gigaspeech glm file #124

CuiMingyu · 2022-09-17T08:27:29Z

Hi sir,

does gigaspeech provide a glm file like swbd en20000405_hub5.glm containing the transcript filtering rules?

I notice there are some rules in gigaspeech_scoring.py file. But do you have the glm file about all the rules?

Thanks a lot!

CuiMingyu · 2022-09-17T08:28:16Z

An example of swbd glm:

dophist · 2022-09-17T12:33:09Z

The short answer is, YES and NO.

Actually this is a pretty good question that I'm gonna keep this thread open forever for documentation purposes. And here is the long answer:

On No side:
The reason why we don't provide a GLM within GigaSpeech, is that we don't want to mess up the evaluation process with too complex sub-systems(such as TN & Context-Dependent language rewritings), so that downstream research toolkits can integrate and adopt GigaSpeech like a fresh air.

And as you mentioned, we do provide a very simple script containing our recommended text post-processing here, see discussion #24 , and it should provide a reliable apple-to-apple basis for academic comparisons.

On Yes side:
Taking ASR benchmarking more seriously, like real-life ASR scenarios, we developed a universal benchmarking platform, that contains modules such as:

production-grade TN(based on NeMo)
sophisticated evaluation tool(supporting GLM, and other stuff, even more than NIST)

They are in our Leaderboard project repo, there you can find a GLM file containing hundreds of rewriting rules already, for English in general, not limited to GigaSpeech. You can help us to improve it if you'd like to, it's an asset for the entire speech community.

Here is a glance of dummy outputs from the scoring tool:

As you can see, raw form of WE ARE are transformed to WE'RE, as the result of a GLM rule WE'RE <-> WE ARE, to match with reference on-the-fly. And we even managed to tag these alternative expansions with # and pretty-aligned, so that error analysis becomes crystal clear.

dophist added the documentation Improvements or additions to documentation label Sep 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About gigaspeech glm file #124

About gigaspeech glm file #124

CuiMingyu commented Sep 17, 2022

CuiMingyu commented Sep 17, 2022

dophist commented Sep 17, 2022

About gigaspeech glm file #124

About gigaspeech glm file #124

Comments

CuiMingyu commented Sep 17, 2022

CuiMingyu commented Sep 17, 2022

dophist commented Sep 17, 2022