Skip to content

Commit 35b4120

Browse files
sarapapimgaido91
authored andcommitted
[!202][RELEASE] StreamAtt
# Which work do we release? "StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection" published at ACL 2024. # What changes does this release refer to? 800277d6dc8295145a504f1d56c3fc354e8381a6 0396faf27244901498147f04b3f73fa2c7d52951 af71dc0030884da92a93d45adb55020443d15381 77ea833521a1fd54159a12fab8ef67d13bbecced
1 parent d567765 commit 35b4120

File tree

2 files changed

+148
-0
lines changed

2 files changed

+148
-0
lines changed

README.md

+3
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ Dedicated README for each work can be found in the `fbk_works` directory.
55

66
### 2024
77

8+
- [[ACL 2024] **StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection**](fbk_works/STREAMATT_STREAMLAAL.md)
89
- [[ACL 2024] **SBAAM! Eliminating Transcript Dependency in Automatic Subtitling**](fbk_works/SBAAM.md)
910
- [[ACL 2024] **When Good and Reproducible Results are a Giant with Feet of Clay: The Importance of Software Quality in NLP**](fbk_works/BUGFREE_CONFORMER.md)
1011
- [[LREC-COLING 2024] **How do Hyenas deal with Human Speech? Speech Recognition and Translation with ConfHyena**](fbk_works/HYENA_COLING2024.md)
@@ -37,6 +38,8 @@ Dedicated README for each work can be found in the `fbk_works` directory.
3738
If using this repository, please acknowledge the related paper(s) citing them.
3839
Bibtex citations are available for each work in the dedicated README file.
3940

41+
## Installation
42+
4043
To install the repository, do:
4144

4245
```

fbk_works/STREAMATT_STREAMLAAL.md

+145
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
# StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection (ACL 2024)
2+
![ACL Anthology](https://img.shields.io/badge/anthology-brightgreen?logo=data%3Aimage%2Fsvg%2Bxml%3Bbase64%2CPD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiIHN0YW5kYWxvbmU9Im5vIj8%2BCjwhLS0gQ3JlYXRlZCB3aXRoIElua3NjYXBlIChodHRwOi8vd3d3Lmlua3NjYXBlLm9yZy8pIC0tPgo8c3ZnCiAgIHhtbG5zOnN2Zz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciCiAgIHhtbG5zPSJodHRwOi8vd3d3LnczLm9yZy8yMDAwL3N2ZyIKICAgdmVyc2lvbj0iMS4wIgogICB3aWR0aD0iNjgiCiAgIGhlaWdodD0iNjgiCiAgIGlkPSJzdmcyIj4KICA8ZGVmcwogICAgIGlkPSJkZWZzNCIgLz4KICA8cGF0aAogICAgIGQ9Ik0gNDEuOTc3NTUzLC0yLjg0MjE3MDllLTAxNCBDIDQxLjk3NzU1MywxLjc2MTc4IDQxLjk3NzU1MywxLjQ0MjExIDQxLjk3NzU1MywzLjAxNTggTCA3LjQ4NjkwNTQsMy4wMTU4IEwgMCwzLjAxNTggTCAwLDEwLjUwMDc5IEwgMCwzOC40Nzg2NyBMIDAsNDYgTCA3LjQ4NjkwNTQsNDYgTCA0OS41MDA4MDIsNDYgTCA1Ni45ODc3MDgsNDYgTCA2OCw0NiBMIDY4LDMwLjk5MzY4IEwgNTYuOTg3NzA4LDMwLjk5MzY4IEwgNTYuOTg3NzA4LDEwLjUwMDc5IEwgNTYuOTg3NzA4LDMuMDE1OCBDIDU2Ljk4NzcwOCwxLjQ0MjExIDU2Ljk4NzcwOCwxLjc2MTc4IDU2Ljk4NzcwOCwtMi44NDIxNzA5ZS0wMTQgTCA0MS45Nzc1NTMsLTIuODQyMTcwOWUtMDE0IHogTSAxNS4wMTAxNTUsMTcuOTg1NzggTCA0MS45Nzc1NTMsMTcuOTg1NzggTCA0MS45Nzc1NTMsMzAuOTkzNjggTCAxNS4wMTAxNTUsMzAuOTkzNjggTCAxNS4wMTAxNTUsMTcuOTg1NzggeiAiCiAgICAgc3R5bGU9ImZpbGw6I2VkMWMyNDtmaWxsLW9wYWNpdHk6MTtmaWxsLXJ1bGU6ZXZlbm9kZDtzdHJva2U6bm9uZTtzdHJva2Utd2lkdGg6MTIuODk1NDExNDk7c3Ryb2tlLWxpbmVjYXA6YnV0dDtzdHJva2UtbGluZWpvaW46bWl0ZXI7c3Ryb2tlLW1pdGVybGltaXQ6NDtzdHJva2UtZGFzaGFycmF5Om5vbmU7c3Ryb2tlLWRhc2hvZmZzZXQ6MDtzdHJva2Utb3BhY2l0eToxIgogICAgIHRyYW5zZm9ybT0idHJhbnNsYXRlKDAsIDExKSIKICAgICBpZD0icmVjdDIxNzgiIC8%2BCjwvc3ZnPgo%3D&label=ACL&labelColor=white&color=red)
3+
4+
Code for the paper: ["StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection"](https://arxiv.org/abs/2406.06097) published at the ACL 2024 main conference.
5+
6+
## 📎 Requirements
7+
To run the agent, please make sure that [this repository](../README.md#installation) and
8+
[SimulEval v1.1.0](https://github.com/facebookresearch/SimulEval/commit/ec759d124307096dbbf6c3269d2ed652cc15fbdd)
9+
are installed.
10+
11+
Create a textual file (e.g., `src_audiopath_list.txt`) containing the list of paths to the audio
12+
files (one path per line for each file), which, differently from SimulST, are __not__ split into
13+
segments but are the entire speeches.
14+
Specifically, in the case of the MuST-C dataset used in the paper, the file contains the paths to
15+
the entire TED talk files, similar to the following:
16+
```txt
17+
${AUDIO_DIR}/ted_1096.wav
18+
${AUDIO_DIR}/ted_1102.wav
19+
${AUDIO_DIR}/ted_1104.wav
20+
${AUDIO_DIR}/ted_1114.wav
21+
${AUDIO_DIR}/ted_1115.wav
22+
...
23+
```
24+
Instead, as target file `translations.txt`, it can either be used a dummy file or the sentences
25+
concatenation, one line for each talk.
26+
However, for the evaluation of already segmented test sets, such as in MuST-C, we will not need
27+
these references, and we will evaluate directly from the segmented translations provided with the
28+
dataset, as described in [Evaluation with StreamLAAL](#-evaluation-streamlaal).
29+
30+
## 📌 Pre-trained Offline models
31+
⚠️ The offline ST models used for the baseline, AlignAtt, and StreamAtt are the same and already available at
32+
the [AlignAtt release webpage](ALIGNATT_SIMULST_AGENT_INTERSPEECH2023.md#-pre-trained-offline-models)
33+
34+
## 🤖 Streaming Inference: *StreamAtt*
35+
For the streaming inference, set `--config` and `--model-path` as, respectively, the config file
36+
and the model checkpoint downloaded in the
37+
[Pre-trained Offline models](#-pre-trained-offline-models) step.
38+
As `--source` and `--target`, please use the files `src_audiopath_list.txt` and `translations.txt`
39+
created in the [Requirements](#-requirements) step.
40+
41+
The output will be saved in `--output`.
42+
43+
### ⭐ StreamAtt
44+
For the ***Hypothesis Selection*** (based on AlignAtt), please set `--frame-num` as the value of
45+
*f* used for the inference (`f=[2, 4, 6, 8]`, in the paper).
46+
47+
Depending on the ***Textual History Selection*** ([Fixed Words](#fixed-words) or [Punctuation](#punctuation)), run the following command:
48+
49+
#### Fixed Words
50+
```bash
51+
simuleval \
52+
--agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.streaming.streaming_st_agent.StreamingSTAgent \
53+
--simulst-agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.simul_offline_alignatt.AlignAttSTAgent \
54+
--history-selection-method examples.speech_to_text.simultaneous_translation.agents.v1_1.streaming.text_first_history_selection.FixedWordsHistorySelection \
55+
--source ${SRC_LIST_OF_AUDIO} \
56+
--target ${TGT_FILE} \
57+
--data-bin ${DATA_ROOT} \
58+
--config config.yaml \
59+
--model-path checkpoint.pt \
60+
--source-segment-size 1000 \
61+
--extract-attn-from-layer 3 \
62+
--frame-num ${FRAME} \
63+
--history-words 20 \
64+
--quality-metrics BLEU --latency-metrics LAAL AL ATD --computation-aware --output ${OUT_DIR} \
65+
--device cuda:0
66+
```
67+
68+
#### Punctuation
69+
```bash
70+
simuleval \
71+
--agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.streaming.streaming_st_agent.StreamingSTAgent \
72+
--simulst-agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.simul_offline_alignatt.AlignAttSTAgent \
73+
--history-selection-method examples.speech_to_text.simultaneous_translation.agents.v1_1.streaming.text_first_history_selection.PunctuationHistorySelection \
74+
--source ${SRC_LIST_OF_AUDIO} \
75+
--target ${TGT_FILE} \
76+
--data-bin ${DATA_ROOT} \
77+
--config config.yaml \
78+
--model-path checkpoint.pt \
79+
--source-segment-size 1000 \
80+
--extract-attn-from-layer 3 \
81+
--frame-num ${FRAME} \
82+
--quality-metrics BLEU --latency-metrics LAAL AL ATD --computation-aware --output ${OUT_DIR} \
83+
--device cuda:0
84+
```
85+
86+
### ⭐ Baseline and Upperbound
87+
88+
To run the baseline, execute the following command:
89+
```bash
90+
simuleval \
91+
--agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.streaming.streaming_st_agent.StreamingSTAgent \
92+
--simulst-agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.simul_offline_alignatt.AlignAttSTAgent \
93+
--history-selection-method examples.speech_to_text.simultaneous_translation.agents.v1_1.streaming.text_first_history_selection.FixedAudioHistorySelection \
94+
--source ${SRC_LIST_OF_AUDIO} \
95+
--target ${TGT_FILE} \
96+
--data-bin ${DATA_ROOT} \
97+
--config config.yaml \
98+
--model-path checkpoint.pt \
99+
--source-segment-size 1000 \
100+
--extract-attn-from-layer 3 \
101+
--frame-num ${FRAME} \
102+
--history-words 20 \
103+
--quality-metrics BLEU --latency-metrics LAAL AL ATD --computation-aware --output ${OUT_DIR} \
104+
--device cuda:0
105+
```
106+
107+
For the simultaneous inference with AlignAtt (the upperbound presented in the paper), please refer
108+
to the [AlignAtt README](ALIGNATT_SIMULST_AGENT_INTERSPEECH2023.md#-inference).
109+
110+
## 💬 Evaluation: *StreamLAAL*
111+
To evaluate the streaming outputs, download and extract the
112+
[mwerSegmenter](https://www-i6.informatik.rwth-aachen.de/web/Software/mwerSegmenter.tar.gz) in the
113+
`${MWERSEGMENTER_DIR}` folder, and run the following command:
114+
```bash
115+
export MWERSEGMENTER_ROOT=${MWERSEGMENTER_DIR}
116+
117+
streamLAAL --simuleval-instances ${SIMULEVAL_INSTANCES} \
118+
--reference ${REFERENCE_TEXTS} \
119+
--audio-yaml ${AUDIO_YAML} \
120+
--sacrebleu-tokenizer ${SACREBLEU_TOKENIZER} \
121+
--latency-unit ${LATENCY_UNIT}
122+
```
123+
where `${SIMULEVAL_INSTANCES}` is the output `instances.log` produced by the agent in the previous
124+
step, `${REFERENCE_TEXTS}` are the textual references in the target language (one line for each
125+
segment), `${AUDIO_YAML}` is the yaml file containing the original audio segmentation,
126+
`${SACREBLEU_TOKENIZER}` is the [sacreBLEU](https://github.com/mjpost/sacrebleu) tokenizer used for
127+
the quality evaluation (defaults to `13a`), and `${LATENCY_UNIT}` is the unit used for the latency
128+
computation (either `word` or `char`, defaults to `word`, the unit used in the paper).
129+
130+
If invoking `streamLAAL` does not work, please include the FBK-fairseq directory
131+
(`${FBK_FAIRSEQ_DIR}`) in the `PYTHONPATH` (`export PYTHONPATH=${FBK_FAIRSEQ_DIR}:$PYTHONPATH`) or
132+
call it explicitly by running
133+
`python ${FBK_FAIRSEQ_DIR}/examples/speech_to_text/simultaneous_translation/scripts/stream_laal.py`.
134+
135+
136+
## 📍Citation
137+
```bibtex
138+
@inproceedings{papi-et-al-2024-streamatt,
139+
title = {{StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection}},
140+
author = {Papi, Sara and Gaido, Marco and Negri, Matteo and Bentivogli, Luisa},
141+
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
142+
year = {2024},
143+
address = "Bangkok, Thailand",
144+
}
145+
```

0 commit comments

Comments
 (0)