-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement recipe for Fluent Speech Commands dataset #1469
Conversation
Signed-off-by: Xinyuan Li <[email protected]>
Thanks! Could you please clean up the code first (for example, if your recipe share same source file, shall we symbolic to the existing one). Also please have a look at other recipes and follow the same structure. |
Thanks for your feedback!
|
hi xinyuan! i left a few comments in this PR, please check. also it would be more preferable to use symlink to reduce redundancy. if you want to make further changes to the recipe and make a PR in the future, you can try creating a new recipe like thank you! |
usually we will use symlink for generic files like |
Signed-off-by: Xinyuan Li <[email protected]>
Signed-off-by: Xinyuan Li <[email protected]>
Signed-off-by: Xinyuan Li <[email protected]>
Thanks a lot for your comments! In a recent commit I updated my recipe to use symlinks whenever possible (I think the only issue was beam_search which has a dependency on the vocabulary). I can't seem to see your comments on individual files... And they don't seem to be hidden behind a particular commit. Do you know how I could find them? Thanks again! |
dear xinyuan,
you can find the comments i left regarding specific files and lines in the
"files changed" section as well as the "conversation" section of the
webpage of this PR.
it goes like this in the "conversation" section
![image](https://github.com/k2-fsa/icefall/assets/60612200/50934cd6-417a-4426-8c18-f420dbe03cee)
best
…On Wed, Jan 24, 2024 at 12:25 PM Henry Li Xinyuan ***@***.***> wrote:
usually we will use symlink for generic files like beam_search.py,
conformer.py, decoder.py, encoder.py, encoder_interface.py, joiner.py,
subsampling.py, model.py, transformer.py and test related scripts like
test_*. and also rename the asr_datamodule.py to slu_datamodule.py
considering the task name is "SLU" rather than "ASR" in your case 🤔
Thanks a lot for your comments! In a recent commit I updated my recipe to
use symlinks whenever possible (I think the only issue was beam_search
which has a dependency on the vocabulary). I can't seem to see your
comments on individual files... And they don't seem to be hidden behind a
particular commit. Do you know how I could find them? Thanks again!
—
Reply to this email directly, view it on GitHub
<#1469 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AOON42EUEXZWQRLYEI5OJPTYQCEMFAVCNFSM6AAAAABCCL7IB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBXGM2TKNZSGY>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
also please remember to use the black formatter and isort formatter to
format the scripts, you can use black and isort formatter by
```
pip install black isort # for install
isort *.py # sort imports of the scripts
black *.py # format the code of the scripts
```
the format of this repo conforms to the black and isort style.
…On Wed, Jan 24, 2024 at 12:30 PM Zengrui Jin ***@***.***> wrote:
dear xinyuan,
you can find the comments i left regarding specific files and lines in the
"files changed" section as well as the "conversation" section of the
webpage of this PR.
it goes like this in the "conversation" section
[image: image.png]
best
On Wed, Jan 24, 2024 at 12:25 PM Henry Li Xinyuan <
***@***.***> wrote:
> usually we will use symlink for generic files like beam_search.py,
> conformer.py, decoder.py, encoder.py, encoder_interface.py, joiner.py,
> subsampling.py, model.py, transformer.py and test related scripts like
> test_*. and also rename the asr_datamodule.py to slu_datamodule.py
> considering the task name is "SLU" rather than "ASR" in your case 🤔
>
> Thanks a lot for your comments! In a recent commit I updated my recipe to
> use symlinks whenever possible (I think the only issue was beam_search
> which has a dependency on the vocabulary). I can't seem to see your
> comments on individual files... And they don't seem to be hidden behind a
> particular commit. Do you know how I could find them? Thanks again!
>
> —
> Reply to this email directly, view it on GitHub
> <#1469 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AOON42EUEXZWQRLYEI5OJPTYQCEMFAVCNFSM6AAAAABCCL7IB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBXGM2TKNZSGY>
> .
> You are receiving this because you commented.Message ID:
> ***@***.***>
>
|
Signed-off-by: Xinyuan Li <[email protected]>
I still couldn't see your comments... I was quite puzzled by this until I stumbled upon this thread which seems to be what's happening: https://github.com/orgs/community/discussions/30638#discussioncomment-4574199 So it seems you might need to "finish the review", whatever that means. Sorry for the hassle, and thanks again for helping with this PR! |
icefall/shared/make_kn_lm.py
Outdated
@@ -169,7 +169,7 @@ def add_raw_counts_from_file(self, filename): | |||
with open(filename, encoding=default_encoding) as fp: | |||
for line in fp: | |||
line = line.strip(strip_chars) | |||
self.add_raw_counts_from_line(line) | |||
self.add_raw_counts_from_line(line.split()[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line seems to be hard coded 🤔 not sure if it causes unwanted changes for other cases
oof my bad, just submitted the review. let me make sure of the ngram lm stuff before merging it. thanks! |
Thanks! It's been a while since I made that particular change (with ngram lms), I remember there was a good reason but I can't find it off the top of my head now, let me check again as well! |
I tried running without the changes active: it appears that in the generated .arpa LM file, all the word indices were given a weight as well, so some weird interaction must have taken place between the 1-gram LM training and the add_raw_counts_from_line function. In theory I could add a check which runs the old version if n in n-gram is greater than or equal to 1, and run the new version if n=1, although I won't be able to justify my change with anything more convincing than "because it seems to be the only way that works and that doesn't break any existing recipes". What are your thoughts on this? |
Signed-off-by: Xinyuan Li <[email protected]>
yes i think the latest commit is ok for the special case. waiting for the final CI test to be done, thank you! |
Is it good to go? :) |
yes, i think this one is good to be merged once lhotse has merged the pr for data preparation |
Thanks!! Looks like the lhotse PR has just been merged :) |
Thanks for your contribution! Let us merge it first so that further work won't be blocked. Could you update the results and upload ptretrained models in a separate PR? |
Thanks!!
Will do! |
Dataset link: https://fluent.ai/fluent-speech-commands-a-dataset-for-spoken-language-understanding-research/