Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About gigaspeech glm file #124

Open
CuiMingyu opened this issue Sep 17, 2022 · 2 comments
Open

About gigaspeech glm file #124

CuiMingyu opened this issue Sep 17, 2022 · 2 comments
Labels
documentation Improvements or additions to documentation

Comments

@CuiMingyu
Copy link

Hi sir,

does gigaspeech provide a glm file like swbd en20000405_hub5.glm containing the transcript filtering rules?

I notice there are some rules in gigaspeech_scoring.py file. But do you have the glm file about all the rules?

Thanks a lot!

@CuiMingyu
Copy link
Author

An example of swbd glm:
image

@dophist
Copy link
Collaborator

dophist commented Sep 17, 2022

The short answer is, YES and NO.

Actually this is a pretty good question that I'm gonna keep this thread open forever for documentation purposes. And here is the long answer:

On No side:
The reason why we don't provide a GLM within GigaSpeech, is that we don't want to mess up the evaluation process with too complex sub-systems(such as TN & Context-Dependent language rewritings), so that downstream research toolkits can integrate and adopt GigaSpeech like a fresh air.

And as you mentioned, we do provide a very simple script containing our recommended text post-processing here, see discussion #24 , and it should provide a reliable apple-to-apple basis for academic comparisons.

On Yes side:
Taking ASR benchmarking more seriously, like real-life ASR scenarios, we developed a universal benchmarking platform, that contains modules such as:

  • production-grade TN(based on NeMo)
  • sophisticated evaluation tool(supporting GLM, and other stuff, even more than NIST)

They are in our Leaderboard project repo, there you can find a GLM file containing hundreds of rewriting rules already, for English in general, not limited to GigaSpeech. You can help us to improve it if you'd like to, it's an asset for the entire speech community.

Here is a glance of dummy outputs from the scoring tool:
Screen Shot 2022-09-17 at 20 24 01

As you can see, raw form of WE ARE are transformed to WE'RE, as the result of a GLM rule WE'RE <-> WE ARE, to match with reference on-the-fly. And we even managed to tag these alternative expansions with # and pretty-aligned, so that error analysis becomes crystal clear.

@dophist dophist added the documentation Improvements or additions to documentation label Sep 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants