Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Typo fix in documentation #1455

Merged
merged 1 commit into from
Jan 9, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 9 additions & 18 deletions docs/source/recipes/RNN-LM/librispeech/lm-training.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Train an RNN language model
======================================

If you have enough text data, you can train a neural network language model (NNLM) to improve
the WER of your E2E ASR system. This tutorial shows you how to train an RNNLM from
the WER of your E2E ASR system. This tutorial shows you how to train an RNNLM from
scratch.

.. HINT::
Expand All @@ -15,23 +15,23 @@ scratch.
.. note::

This tutorial is based on the LibriSpeech recipe. Please check it out for the necessary
python scripts for this tutorial. We use the LibriSpeech LM-corpus as the LM training set
python scripts for this tutorial. We use the LibriSpeech LM-corpus as the LM training set
for illustration purpose. You can also collect your own data. The data format is quite simple:
each line should contain a complete sentence, and words should be separated by space.

First, let's download the training data for the RNNLM. This can be done via the
First, let's download the training data for the RNNLM. This can be done via the
following command:

.. code-block:: bash

$ wget https://www.openslr.org/resources/11/librispeech-lm-norm.txt.gz
$ wget https://www.openslr.org/resources/11/librispeech-lm-norm.txt.gz
$ gzip -d librispeech-lm-norm.txt.gz

As we are training a BPE-level RNNLM, we need to tokenize the training text, which requires a
BPE tokenizer. This can be achieved by executing the following command:

.. code-block:: bash

$ # if you don't have the BPE
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-2023-05-15
$ cd icefall-asr-librispeech-zipformer-2023-05-15/data/lang_bpe_500
Expand All @@ -56,11 +56,11 @@ sentence length.
--out-statistics data/lang_bpe_500/lm_data_stats.txt


The aforementioned steps can be repeated to create a a validation set for you RNNLM. Let's say
you have a validation set in ``valid.txt``, you can just set ``--lm-data valid.txt``
The aforementioned steps can be repeated to create a a validation set for you RNNLM. Let's say
you have a validation set in ``valid.txt``, you can just set ``--lm-data valid.txt``
and ``--lm-archive data/lang_bpe_500/lm-data-valid.pt`` when calling ``./local/prepare_lm_training_data.py``.

After completing the previous steps, the training and testing sets for training RNNLM are ready.
After completing the previous steps, the training and testing sets for training RNNLM are ready.
The next step is to train the RNNLM model. The training command is as follows:

.. code-block:: bash
Expand All @@ -77,7 +77,7 @@ The next step is to train the RNNLM model. The training command is as follows:
--use-fp16 0 \
--tie-weights 1 \
--embedding-dim 2048 \
--hidden_dim 2048 \
--hidden-dim 2048 \
--num-layers 3 \
--batch-size 300 \
--lm-data rnn_lm/data/lang_bpe_500/sorted_lm_data.pt \
Expand All @@ -93,12 +93,3 @@ The next step is to train the RNNLM model. The training command is as follows:
.. note::

The training of RNNLM can take a long time (usually a couple of days).









Loading