|
| 1 | +# StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection (ACL 2024) |
| 2 | + |
| 3 | + |
| 4 | +Code for the paper: ["StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection"](https://arxiv.org/abs/2406.06097) published at the ACL 2024 main conference. |
| 5 | + |
| 6 | +## 📎 Requirements |
| 7 | +To run the agent, please make sure that [this repository](../README.md#installation) and |
| 8 | +[SimulEval v1.1.0](https://github.com/facebookresearch/SimulEval/commit/ec759d124307096dbbf6c3269d2ed652cc15fbdd) |
| 9 | +are installed. |
| 10 | + |
| 11 | +Create a textual file (e.g., `src_audiopath_list.txt`) containing the list of paths to the audio |
| 12 | +files (one path per line for each file), which, differently from SimulST, are __not__ split into |
| 13 | +segments but are the entire speeches. |
| 14 | +Specifically, in the case of the MuST-C dataset used in the paper, the file contains the paths to |
| 15 | +the entire TED talk files, similar to the following: |
| 16 | +```txt |
| 17 | +${AUDIO_DIR}/ted_1096.wav |
| 18 | +${AUDIO_DIR}/ted_1102.wav |
| 19 | +${AUDIO_DIR}/ted_1104.wav |
| 20 | +${AUDIO_DIR}/ted_1114.wav |
| 21 | +${AUDIO_DIR}/ted_1115.wav |
| 22 | +... |
| 23 | +``` |
| 24 | +Instead, as target file `translations.txt`, it can either be used a dummy file or the sentences |
| 25 | +concatenation, one line for each talk. |
| 26 | +However, for the evaluation of already segmented test sets, such as in MuST-C, we will not need |
| 27 | +these references, and we will evaluate directly from the segmented translations provided with the |
| 28 | +dataset, as described in [Evaluation with StreamLAAL](#-evaluation-streamlaal). |
| 29 | + |
| 30 | +## 📌 Pre-trained Offline models |
| 31 | +⚠️ The offline ST models used for the baseline, AlignAtt, and StreamAtt are the same and already available at |
| 32 | +the [AlignAtt release webpage](ALIGNATT_SIMULST_AGENT_INTERSPEECH2023.md#-pre-trained-offline-models)❗ |
| 33 | + |
| 34 | +## 🤖 Streaming Inference: *StreamAtt* |
| 35 | +For the streaming inference, set `--config` and `--model-path` as, respectively, the config file |
| 36 | +and the model checkpoint downloaded in the |
| 37 | +[Pre-trained Offline models](#-pre-trained-offline-models) step. |
| 38 | +As `--source` and `--target`, please use the files `src_audiopath_list.txt` and `translations.txt` |
| 39 | +created in the [Requirements](#-requirements) step. |
| 40 | + |
| 41 | +The output will be saved in `--output`. |
| 42 | + |
| 43 | +### ⭐ StreamAtt |
| 44 | +For the ***Hypothesis Selection*** (based on AlignAtt), please set `--frame-num` as the value of |
| 45 | +*f* used for the inference (`f=[2, 4, 6, 8]`, in the paper). |
| 46 | + |
| 47 | +Depending on the ***Textual History Selection*** ([Fixed Words](#fixed-words) or [Punctuation](#punctuation)), run the following command: |
| 48 | + |
| 49 | +#### Fixed Words |
| 50 | +```bash |
| 51 | +simuleval \ |
| 52 | + --agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.streaming.streaming_st_agent.StreamingSTAgent \ |
| 53 | + --simulst-agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.simul_offline_alignatt.AlignAttSTAgent \ |
| 54 | + --history-selection-method examples.speech_to_text.simultaneous_translation.agents.v1_1.streaming.text_first_history_selection.FixedWordsHistorySelection \ |
| 55 | + --source ${SRC_LIST_OF_AUDIO} \ |
| 56 | + --target ${TGT_FILE} \ |
| 57 | + --data-bin ${DATA_ROOT} \ |
| 58 | + --config config.yaml \ |
| 59 | + --model-path checkpoint.pt \ |
| 60 | + --source-segment-size 1000 \ |
| 61 | + --extract-attn-from-layer 3 \ |
| 62 | + --frame-num ${FRAME} \ |
| 63 | + --history-words 20 \ |
| 64 | + --quality-metrics BLEU --latency-metrics LAAL AL ATD --computation-aware --output ${OUT_DIR} \ |
| 65 | + --device cuda:0 |
| 66 | +``` |
| 67 | + |
| 68 | +#### Punctuation |
| 69 | +```bash |
| 70 | +simuleval \ |
| 71 | + --agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.streaming.streaming_st_agent.StreamingSTAgent \ |
| 72 | + --simulst-agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.simul_offline_alignatt.AlignAttSTAgent \ |
| 73 | + --history-selection-method examples.speech_to_text.simultaneous_translation.agents.v1_1.streaming.text_first_history_selection.PunctuationHistorySelection \ |
| 74 | + --source ${SRC_LIST_OF_AUDIO} \ |
| 75 | + --target ${TGT_FILE} \ |
| 76 | + --data-bin ${DATA_ROOT} \ |
| 77 | + --config config.yaml \ |
| 78 | + --model-path checkpoint.pt \ |
| 79 | + --source-segment-size 1000 \ |
| 80 | + --extract-attn-from-layer 3 \ |
| 81 | + --frame-num ${FRAME} \ |
| 82 | + --quality-metrics BLEU --latency-metrics LAAL AL ATD --computation-aware --output ${OUT_DIR} \ |
| 83 | + --device cuda:0 |
| 84 | +``` |
| 85 | + |
| 86 | +### ⭐ Baseline and Upperbound |
| 87 | + |
| 88 | +To run the baseline, execute the following command: |
| 89 | +```bash |
| 90 | +simuleval \ |
| 91 | + --agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.streaming.streaming_st_agent.StreamingSTAgent \ |
| 92 | + --simulst-agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.simul_offline_alignatt.AlignAttSTAgent \ |
| 93 | + --history-selection-method examples.speech_to_text.simultaneous_translation.agents.v1_1.streaming.text_first_history_selection.FixedAudioHistorySelection \ |
| 94 | + --source ${SRC_LIST_OF_AUDIO} \ |
| 95 | + --target ${TGT_FILE} \ |
| 96 | + --data-bin ${DATA_ROOT} \ |
| 97 | + --config config.yaml \ |
| 98 | + --model-path checkpoint.pt \ |
| 99 | + --source-segment-size 1000 \ |
| 100 | + --extract-attn-from-layer 3 \ |
| 101 | + --frame-num ${FRAME} \ |
| 102 | + --history-words 20 \ |
| 103 | + --quality-metrics BLEU --latency-metrics LAAL AL ATD --computation-aware --output ${OUT_DIR} \ |
| 104 | + --device cuda:0 |
| 105 | +``` |
| 106 | + |
| 107 | +For the simultaneous inference with AlignAtt (the upperbound presented in the paper), please refer |
| 108 | +to the [AlignAtt README](ALIGNATT_SIMULST_AGENT_INTERSPEECH2023.md#-inference). |
| 109 | + |
| 110 | +## 💬 Evaluation: *StreamLAAL* |
| 111 | +To evaluate the streaming outputs, download and extract the |
| 112 | +[mwerSegmenter](https://www-i6.informatik.rwth-aachen.de/web/Software/mwerSegmenter.tar.gz) in the |
| 113 | +`${MWERSEGMENTER_DIR}` folder, and run the following command: |
| 114 | +```bash |
| 115 | +export MWERSEGMENTER_ROOT=${MWERSEGMENTER_DIR} |
| 116 | + |
| 117 | +streamLAAL --simuleval-instances ${SIMULEVAL_INSTANCES} \ |
| 118 | + --reference ${REFERENCE_TEXTS} \ |
| 119 | + --audio-yaml ${AUDIO_YAML} \ |
| 120 | + --sacrebleu-tokenizer ${SACREBLEU_TOKENIZER} \ |
| 121 | + --latency-unit ${LATENCY_UNIT} |
| 122 | +``` |
| 123 | +where `${SIMULEVAL_INSTANCES}` is the output `instances.log` produced by the agent in the previous |
| 124 | +step, `${REFERENCE_TEXTS}` are the textual references in the target language (one line for each |
| 125 | +segment), `${AUDIO_YAML}` is the yaml file containing the original audio segmentation, |
| 126 | +`${SACREBLEU_TOKENIZER}` is the [sacreBLEU](https://github.com/mjpost/sacrebleu) tokenizer used for |
| 127 | +the quality evaluation (defaults to `13a`), and `${LATENCY_UNIT}` is the unit used for the latency |
| 128 | +computation (either `word` or `char`, defaults to `word`, the unit used in the paper). |
| 129 | + |
| 130 | +If invoking `streamLAAL` does not work, please include the FBK-fairseq directory |
| 131 | +(`${FBK_FAIRSEQ_DIR}`) in the `PYTHONPATH` (`export PYTHONPATH=${FBK_FAIRSEQ_DIR}:$PYTHONPATH`) or |
| 132 | +call it explicitly by running |
| 133 | +`python ${FBK_FAIRSEQ_DIR}/examples/speech_to_text/simultaneous_translation/scripts/stream_laal.py`. |
| 134 | + |
| 135 | + |
| 136 | +## 📍Citation |
| 137 | +```bibtex |
| 138 | +@inproceedings{papi-et-al-2024-streamatt, |
| 139 | +title = {{StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection}}, |
| 140 | +author = {Papi, Sara and Gaido, Marco and Negri, Matteo and Bentivogli, Luisa}, |
| 141 | +booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", |
| 142 | +year = {2024}, |
| 143 | +address = "Bangkok, Thailand", |
| 144 | +} |
| 145 | +``` |
0 commit comments