Implement recipe for Fluent Speech Commands dataset #1469

HSTEHSTEHSTE · 2024-01-19T18:42:06Z

Dataset link: https://fluent.ai/fluent-speech-commands-a-dataset-for-spoken-language-understanding-research/

Signed-off-by: Xinyuan Li <[email protected]>

pkufool · 2024-01-22T02:04:17Z

Thanks! Could you please clean up the code first (for example, if your recipe share same source file, shall we symbolic to the existing one). Also please have a look at other recipes and follow the same structure.

HSTEHSTEHSTE · 2024-01-24T01:25:19Z

Thanks! Could you please clean up the code first (for example, if your recipe share same source file, shall we symbolic to the existing one).

Thanks for your feedback!
I removed the tdnn-based recipe (as its performances were not competitive anyway). The transducer recipe was largely based on yesno, if you think it would be helpful I could symlink everything back to the yesno recipe (although I would be somewhat wary of doing so given the possibility of me making subsequent changes to this recipe). I have also not touched the copyrigh and licensing information on any of the files. Outside of these, if there's anything in particular where the code could be further cleaned up, please let me know. Thanks!

Also please have a look at other recipes and follow the same structure.
I have moved everything to fluent_speech_commands/SLU, following the dataset/task file structure. Is there anything else I need to fix?

JinZr · 2024-01-24T01:39:44Z

hi xinyuan! i left a few comments in this PR, please check.

also it would be more preferable to use symlink to reduce redundancy. if you want to make further changes to the recipe and make a PR in the future, you can try creating a new recipe like transducer_xx without using symlink in it.

thank you!

JinZr · 2024-01-24T01:44:29Z

usually we will use symlink for generic files like beam_search.py, conformer.py, decoder.py, encoder.py, encoder_interface.py, joiner.py, subsampling.py, model.py, transformer.py and test related scripts like test_*. and also rename the asr_datamodule.py to slu_datamodule.py considering the task name is "SLU" rather than "ASR" in your case 🤔

Signed-off-by: Xinyuan Li <[email protected]>

HSTEHSTEHSTE · 2024-01-24T04:25:28Z

usually we will use symlink for generic files like beam_search.py, conformer.py, decoder.py, encoder.py, encoder_interface.py, joiner.py, subsampling.py, model.py, transformer.py and test related scripts like test_*. and also rename the asr_datamodule.py to slu_datamodule.py considering the task name is "SLU" rather than "ASR" in your case 🤔

Thanks a lot for your comments! In a recent commit I updated my recipe to use symlinks whenever possible (I think the only issue was beam_search which has a dependency on the vocabulary). I can't seem to see your comments on individual files... And they don't seem to be hidden behind a particular commit. Do you know how I could find them? Thanks again!

JinZr · 2024-01-24T04:31:39Z

dear xinyuan, you can find the comments i left regarding specific files and lines in the "files changed" section as well as the "conversation" section of the webpage of this PR. it goes like this in the "conversation" section ![image](https://github.com/k2-fsa/icefall/assets/60612200/50934cd6-417a-4426-8c18-f420dbe03cee) best

…

On Wed, Jan 24, 2024 at 12:25 PM Henry Li Xinyuan ***@***.***> wrote: usually we will use symlink for generic files like beam_search.py, conformer.py, decoder.py, encoder.py, encoder_interface.py, joiner.py, subsampling.py, model.py, transformer.py and test related scripts like test_*. and also rename the asr_datamodule.py to slu_datamodule.py considering the task name is "SLU" rather than "ASR" in your case 🤔 Thanks a lot for your comments! In a recent commit I updated my recipe to use symlinks whenever possible (I think the only issue was beam_search which has a dependency on the vocabulary). I can't seem to see your comments on individual files... And they don't seem to be hidden behind a particular commit. Do you know how I could find them? Thanks again! — Reply to this email directly, view it on GitHub <#1469 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOON42EUEXZWQRLYEI5OJPTYQCEMFAVCNFSM6AAAAABCCL7IB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBXGM2TKNZSGY> . You are receiving this because you commented.Message ID: ***@***.***>

JinZr · 2024-01-24T04:42:30Z

also please remember to use the black formatter and isort formatter to format the scripts, you can use black and isort formatter by ``` pip install black isort # for install isort *.py # sort imports of the scripts black *.py # format the code of the scripts ``` the format of this repo conforms to the black and isort style.

…

On Wed, Jan 24, 2024 at 12:30 PM Zengrui Jin ***@***.***> wrote: dear xinyuan, you can find the comments i left regarding specific files and lines in the "files changed" section as well as the "conversation" section of the webpage of this PR. it goes like this in the "conversation" section [image: image.png] best On Wed, Jan 24, 2024 at 12:25 PM Henry Li Xinyuan < ***@***.***> wrote: > usually we will use symlink for generic files like beam_search.py, > conformer.py, decoder.py, encoder.py, encoder_interface.py, joiner.py, > subsampling.py, model.py, transformer.py and test related scripts like > test_*. and also rename the asr_datamodule.py to slu_datamodule.py > considering the task name is "SLU" rather than "ASR" in your case 🤔 > > Thanks a lot for your comments! In a recent commit I updated my recipe to > use symlinks whenever possible (I think the only issue was beam_search > which has a dependency on the vocabulary). I can't seem to see your > comments on individual files... And they don't seem to be hidden behind a > particular commit. Do you know how I could find them? Thanks again! > > — > Reply to this email directly, view it on GitHub > <#1469 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AOON42EUEXZWQRLYEI5OJPTYQCEMFAVCNFSM6AAAAABCCL7IB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBXGM2TKNZSGY> > . > You are receiving this because you commented.Message ID: > ***@***.***> >

Signed-off-by: Xinyuan Li <[email protected]>

HSTEHSTEHSTE · 2024-01-24T17:18:45Z

usually we will use symlink for generic files like beam_search.py, conformer.py, decoder.py, encoder.py, encoder_interface.py, joiner.py, subsampling.py, model.py, transformer.py and test related scripts like test_*. and also rename the asr_datamodule.py to slu_datamodule.py considering the task name is "SLU" rather than "ASR" in your case 🤔

Thanks a lot for your comments! In a recent commit I updated my recipe to use symlinks whenever possible (I think the only issue was beam_search which has a dependency on the vocabulary). I can't seem to see your comments on individual files... And they don't seem to be hidden behind a particular commit. Do you know how I could find them? Thanks again!

dear xinyuan, you can find the comments i left regarding specific files and lines in the "files changed" section as well as the "conversation" section of the webpage of this PR. it goes like this in the "conversation" section best
…
On Wed, Jan 24, 2024 at 12:25 PM Henry Li Xinyuan @.> wrote: usually we will use symlink for generic files like beam_search.py, conformer.py, decoder.py, encoder.py, encoder_interface.py, joiner.py, subsampling.py, model.py, transformer.py and test related scripts like test_. and also rename the asr_datamodule.py to slu_datamodule.py considering the task name is "SLU" rather than "ASR" in your case 🤔 Thanks a lot for your comments! In a recent commit I updated my recipe to use symlinks whenever possible (I think the only issue was beam_search which has a dependency on the vocabulary). I can't seem to see your comments on individual files... And they don't seem to be hidden behind a particular commit. Do you know how I could find them? Thanks again! — Reply to this email directly, view it on GitHub <#1469 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOON42EUEXZWQRLYEI5OJPTYQCEMFAVCNFSM6AAAAABCCL7IB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBXGM2TKNZSGY . You are receiving this because you commented.Message ID: @.*>

I still couldn't see your comments...

I was quite puzzled by this until I stumbled upon this thread which seems to be what's happening:

https://github.com/orgs/community/discussions/30638#discussioncomment-4574199

So it seems you might need to "finish the review", whatever that means. Sorry for the hassle, and thanks again for helping with this PR!

JinZr · 2024-01-24T01:30:48Z

icefall/shared/make_kn_lm.py

@@ -169,7 +169,7 @@ def add_raw_counts_from_file(self, filename):
        with open(filename, encoding=default_encoding) as fp:
            for line in fp:
                line = line.strip(strip_chars)
-                self.add_raw_counts_from_line(line)
+                self.add_raw_counts_from_line(line.split()[0])


this line seems to be hard coded 🤔 not sure if it causes unwanted changes for other cases

JinZr · 2024-01-24T17:27:37Z

oof my bad, just submitted the review.

let me make sure of the ngram lm stuff before merging it.

thanks!

HSTEHSTEHSTE · 2024-01-24T17:31:54Z

oof my bad, just submitted the review.

let me make sure of the ngram lm stuff before merging it.

thanks!

Thanks! It's been a while since I made that particular change (with ngram lms), I remember there was a good reason but I can't find it off the top of my head now, let me check again as well!

HSTEHSTEHSTE · 2024-01-25T04:10:42Z

oof my bad, just submitted the review.
let me make sure of the ngram lm stuff before merging it.
thanks!

Thanks! It's been a while since I made that particular change (with ngram lms), I remember there was a good reason but I can't find it off the top of my head now, let me check again as well!

I tried running without the changes active: it appears that in the generated .arpa LM file, all the word indices were given a weight as well, so some weird interaction must have taken place between the 1-gram LM training and the add_raw_counts_from_line function. In theory I could add a check which runs the old version if n in n-gram is greater than or equal to 1, and run the new version if n=1, although I won't be able to justify my change with anything more convincing than "because it seems to be the only way that works and that doesn't break any existing recipes". What are your thoughts on this?

Signed-off-by: Xinyuan Li <[email protected]>

JinZr · 2024-01-25T04:41:42Z

yes i think the latest commit is ok for the special case.

waiting for the final CI test to be done, thank you!

HSTEHSTEHSTE · 2024-01-26T04:46:38Z

Is it good to go? :)

JinZr · 2024-01-26T16:34:30Z

yes, i think this one is good to be merged once lhotse has merged the pr for data preparation

HSTEHSTEHSTE · 2024-01-30T20:50:02Z

yes, i think this one is good to be merged once lhotse has merged the pr for data preparation

Thanks!! Looks like the lhotse PR has just been merged :)

csukuangfj · 2024-01-31T14:50:47Z

yes, i think this one is good to be merged once lhotse has merged the pr for data preparation

Thanks!! Looks like the lhotse PR has just been merged :)

Thanks for your contribution!

Let us merge it first so that further work won't be blocked.

Could you update the results and upload ptretrained models in a separate PR?

HSTEHSTEHSTE · 2024-01-31T14:52:08Z

yes, i think this one is good to be merged once lhotse has merged the pr for data preparation

Thanks!! Looks like the lhotse PR has just been merged :)

Thanks for your contribution!

Let us merge it first so that further work won't be blocked.

Thanks!!

Could you update the results and upload ptretrained models in a separate PR?

Will do!

Implement recipe for Fluent Speech Commands dataset

d305c7c

Signed-off-by: Xinyuan Li <[email protected]>

Xinyuan Li added 2 commits January 23, 2024 20:14

Remove tdnn architecture from fluent speech commands recipe

5e88f80

Restructure recipe directories

fd9f7b4

Xinyuan Li added 3 commits January 23, 2024 21:39

Rename asr_datamodule to slu_datamodule

d725bad

Signed-off-by: Xinyuan Li <[email protected]>

Use symlinks whenever possible

8dc1ca1

Signed-off-by: Xinyuan Li <[email protected]>

Undo changes to util summary writer

7047a57

Signed-off-by: Xinyuan Li <[email protected]>

Fix style check issues

eec5941

Signed-off-by: Xinyuan Li <[email protected]>

JinZr reviewed Jan 24, 2024

View reviewed changes

Small change to avoid hardcoded change in make_kn_lm.py

6ba1e63

Signed-off-by: Xinyuan Li <[email protected]>

final polish

4170f44

HSTEHSTEHSTE mentioned this pull request Jan 26, 2024

Fluent Speech Commands dataset, SLU task lhotse-speech/lhotse#1272

Merged

csukuangfj merged commit b07d547 into k2-fsa:master Jan 31, 2024
54 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement recipe for Fluent Speech Commands dataset #1469

Implement recipe for Fluent Speech Commands dataset #1469

HSTEHSTEHSTE commented Jan 19, 2024

pkufool commented Jan 22, 2024

HSTEHSTEHSTE commented Jan 24, 2024

JinZr commented Jan 24, 2024

JinZr commented Jan 24, 2024

HSTEHSTEHSTE commented Jan 24, 2024

JinZr commented Jan 24, 2024 via email •

edited

Loading

JinZr commented Jan 24, 2024 via email

HSTEHSTEHSTE commented Jan 24, 2024

JinZr Jan 24, 2024

JinZr commented Jan 24, 2024

HSTEHSTEHSTE commented Jan 24, 2024

HSTEHSTEHSTE commented Jan 25, 2024

JinZr commented Jan 25, 2024

HSTEHSTEHSTE commented Jan 26, 2024

JinZr commented Jan 26, 2024

HSTEHSTEHSTE commented Jan 30, 2024

csukuangfj commented Jan 31, 2024

HSTEHSTEHSTE commented Jan 31, 2024

Implement recipe for Fluent Speech Commands dataset #1469

Implement recipe for Fluent Speech Commands dataset #1469

Conversation

HSTEHSTEHSTE commented Jan 19, 2024

pkufool commented Jan 22, 2024

HSTEHSTEHSTE commented Jan 24, 2024

JinZr commented Jan 24, 2024

JinZr commented Jan 24, 2024

HSTEHSTEHSTE commented Jan 24, 2024

JinZr commented Jan 24, 2024 via email • edited Loading

JinZr commented Jan 24, 2024 via email

HSTEHSTEHSTE commented Jan 24, 2024

JinZr Jan 24, 2024

Choose a reason for hiding this comment

JinZr commented Jan 24, 2024

HSTEHSTEHSTE commented Jan 24, 2024

HSTEHSTEHSTE commented Jan 25, 2024

JinZr commented Jan 25, 2024

HSTEHSTEHSTE commented Jan 26, 2024

JinZr commented Jan 26, 2024

HSTEHSTEHSTE commented Jan 30, 2024

csukuangfj commented Jan 31, 2024

HSTEHSTEHSTE commented Jan 31, 2024

JinZr commented Jan 24, 2024 via email •

edited

Loading