Skip to content

Commit 9e9c2c7

Browse files
Lina Varella Contimgaido91
Lina Varella Conti
authored andcommitted
[!215][RELEASE] SPES
# Which work do we release? "SPES: Spectrogram Perturbation for Explainable Speech-to-Text Generation" # What changes does this release refer to? 7f7699d1fae4650832d8733cbf54d9279d04da35 3701eaa55ab5b599b23a93df0556760d92da78fc ca14dac9548d0be2f2132924e5bdcbe0c4013285 4641fda1a20569fb388341de12ad164084752607 43172daab2819f0a7810e5daf3c50fbc734ec7d8 076dd532cb7d806c242db44d86c355f6342dd2df c60003bbefc11f7900994b132fb7325237cd2273 713868e03517ee55949a906a26e89d0949f078c0 35c8e41a2ecdf6935076c1d6508279fde3045449 5edbeefe03f3f36ec733edfff3e47f2f53ffcf63 221e192c4f391235b87db8a257109a7a61f82a5f e8692a74ad84ad02af9ccafc4bc1e843f280a2aa 153d6a472262d100b6961f6d42beb691cdb7d64b d319d2167b150c791f4e0b52a1aacf9a1b1d2024 138253d9776a640fe025fb3c257755b4617d360b abd64ee4855a2af20142408907e030c8558c5a4c 777ebafb800e4e6046ac99b87d5a0ec48f607428 644e119553f9c98f7a1c25c15d451b9275e6fffa 62458c83df0870957d4b77f3b80c65c53d06720f c5ccce6d0e24cb9321d22807313871d7804143fc 9fbea0a6d532195c0c6cf741269bb702d9b0d650 79bb24d851827331960f7eab865390d54fec60c6 7f233227560762bd88c343e74fd065c9bb89e3e3 c7889680c86b1472aff8498a8fa82d015a83fc75 96a8559d4f3ac89aee4c1910ecf2b2eabcae4ceb e5ff5114b30cd25a4483827ed626084722db6138 48f88052374399761737712eac63dca56690e5d9 6d94a866c2232026b0b536f4ecb01e4ba89d4dc8 12a298bcc00279f73aa8ce0a28cf718eed073fb3 b24bc8f54a26f7d34aed560922259d7589ec2211 b2be28cbee0c030af61915a34d1c32879b13b14b cd2df8732803f0585f7cef8a9b0d6d1b1e889fbd 4ddadaeaf31c8dbd98b82c639f31ad0bb7dbfc50 b02228eb9652819a2605a29285574c224eae7d53 a039fbdbf029d547aad3848555a49068bee6ae6a 019e89c7a9ed9fd42cd6a6f0a973b1334944b69f 3b8fd52a1f5ce04cd6cbaa42198f32a1d1a6d2c3 07db9c8987dd8c5fc8779a5dce787434d36dd695 c345d58f500b782bb2308cfb093e716aa4361de6
1 parent c44589e commit 9e9c2c7

File tree

2 files changed

+180
-0
lines changed

2 files changed

+180
-0
lines changed

README.md

+1
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ Dedicated README for each work can be found in the `fbk_works` directory.
99

1010
### 2024
1111

12+
- [**SPES: Spectrogram Perturbation for Explainable Speech-to-Text Generation**](fbk_works/XAI_FEATURE_ATTRIBUTION.md)
1213
- [[IWSLT 2024] **SimulSeamless: FBK at IWSLT 2024 Simultaneous Speech Translation**](fbk_works/SIMULSEAMLESS.md)
1314
- [[ACL 2024] **StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection**](fbk_works/STREAMATT_STREAMLAAL.md)
1415
- [[ACL 2024] **SBAAM! Eliminating Transcript Dependency in Automatic Subtitling**](fbk_works/SBAAM.md)

fbk_works/XAI_FEATURE_ATTRIBUTION.md

+179
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,179 @@
1+
# SPES: Spectrogram Perturbation for Explainable Speech-to-Text Generation
2+
3+
4+
This README contains instructions to generate feature attribution explanations of the outputs of
5+
speech-to-text models using SPES and evaluate them.
6+
SPES is introduced in [SPES: Spectrogram Perturbation for Explainable Speech-to-Text Generation](https://arxiv.org/abs/2411.01710).
7+
8+
The explanations consist in saliency maps of relevance scores for each value in the spectrogram
9+
representations of the input audio, as well as relevance scores for the previously generated text tokens.
10+
To assign these relevance scores, SPES performs multiple forward-passes on the same data with different parts
11+
of the input masked.
12+
The impact of these occlusions is measured by comparing the original output probability distributions to that
13+
produced by the occluded inferences. The more the probability distribution changes, the more the masked
14+
part of the input is considered to be relevant.
15+
16+
## 0. Preprocess the data
17+
18+
If the data is not preprocessed, follow the preprocessing steps of [Speechformer](https://gitlab.fbk.eu/mt/fbk-fairseq/-/blob/internal_master/fbk_works/SPEECHFORMER.md#preprocessing).
19+
20+
## 1. Perform a standard inference
21+
22+
The first step involves performing a standard inference and saving the model's predictions. These predicted tokens will be used to apply forced decoding when performing the occluded inferences.
23+
24+
This can be done with the following script, where `data_dir` should be the directory where the tsv file containing the preprocessed data is stored, `tsv_file` the name of that file (without the .tsv extension), `model_path` the path to the fairseq model checkpoint to be used, `model_yaml_config` the model's configuration, and `output_file` the path where to store the standard output. `explanation_tsv` should be the name of the output file that will contain the tokens predicted by the model.
25+
26+
In this and the following scripts, the argument `--max-tokens` should be adjusted based on the GPU's VRAM capacity.
27+
```bash
28+
python /fbk-fairseq/fairseq_cli/generate.py ${data_dir} \
29+
--gen-subset ${tsv_file} \
30+
--user-dir examples/speech_to_text \
31+
--max-tokens 40000 \
32+
--config-yaml ${model_yaml_config} \
33+
--beam 5 \
34+
--task speech_to_text_ctc \
35+
--criterion ctc_multi_loss \
36+
--underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
37+
--no-repeat-ngram-size 5 \
38+
--path ${model_path} > ${output_file}
39+
40+
# Saves the tokenized translation hypotheses to a tab separated file
41+
python /fbk-fairseq/examples/speech_to_text/scripts/xai/prep_hyps_for_explanation.py \
42+
--model-output ${output_file} \
43+
--original-tsv ${data_dir}/${tsv_file}.tsv \
44+
--explain-tsv ${explanation_tsv}
45+
```
46+
47+
## 2. Save output probabilities
48+
49+
The second step consists in running a standard inference and storing the output probability distributions. These stored distributions will serve as a reference for computing relevance scores. Relevance is determined by occluding different parts of the input, running 'perturbed' inferences, and comparing their probability distributions to the reference distributions. This comparison quantifies the extent to which each perturbation affects the model's output.
50+
51+
In the following script, `explain_tsv_file` should be the name of the file generated in the previous step (without the .tsv extension), `data_dir` the directory where it is stored, and `output_file` where to store the probabilities (without the .h5 extension).
52+
In this step, the configuration file (`explain_yaml_config`) is identical to `model_yaml_config`, except that it omits the `bpe_tokenizer` field, as the target text in `explain_tsv_file` for forced decoding is already tokenized.
53+
Other variables are the same as in the previous step.
54+
55+
```bash
56+
python /fbk-fairseq/examples/speech_to_text/get_probs_from_constrained_decoding.py ${data_dir} \
57+
--gen-subset ${explain_tsv_file} \
58+
--user-dir examples/speech_to_text \
59+
--max-tokens 10000 \
60+
--config-yaml ${explain_yaml_config} \
61+
--task speech_to_text_ctc \
62+
--criterion ctc_multi_loss \
63+
--underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
64+
--path ${model_path} \
65+
--save-file ${output_file}
66+
```
67+
68+
## 3. Perform occluded inferences and generate the explanations
69+
70+
Next, multiple inferences are performed with different parts of the input occluded. The matrices of relevance computed from those are stored in a .h5 file.
71+
72+
`probs_path` should be the path to the file generated in the previous step, and `explanations_path` where to store the explanation heatmaps (without the .h5 extension).
73+
74+
```bash
75+
python /fbk-fairseq/examples/speech_to_text/generate_occlusion_explanation.py ${data_dir} \
76+
--gen-subset ${explain_tsv_file} \
77+
--user-dir examples/speech_to_text \
78+
--max-tokens 100000 \
79+
--num-workers 0 \
80+
--config-yaml ${explain_yaml_config} \
81+
--perturb-config ${occlusion_config} \
82+
--task speech_to_text_ctc \
83+
--criterion ctc_multi_loss \
84+
--underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
85+
--no-repeat-ngram-size 5 \
86+
--path ${model_path} \
87+
--original-probs ${probs_path} \
88+
--save-file ${explanations_path}
89+
```
90+
91+
#### Example of `occlusion_config.yaml`
92+
93+
`occlusion_config` should be a .yaml file containing the parameters with which to perform the occlusions. Below is an example how these files should be structured and a set of values that can be used. More information on the meaning of each of these parameters can be found in the [perturbator](https://gitlab.fbk.eu/mt/fbk-fairseq/-/tree/internal_master/examples/speech_to_text/occlusion_explanation/perturbators) package.
94+
95+
```
96+
fbank_occlusion:
97+
category: slic_fbank_dynamic_segments
98+
p: 0.5
99+
n_segments: [2000, 2500, 3000]
100+
threshold_duration: 750
101+
n_masks: 20000
102+
decoder_occlusion:
103+
category: discrete_embed
104+
p: 0.0
105+
no_position_occlusion: true
106+
scorer: KL
107+
```
108+
109+
## 4. Evaluate explanations
110+
111+
To evaluate the explanations, inference is performed with different percentages of the most relevant features occluded. This percentage is increased in `perc-interval` increments.
112+
113+
In the following script, `tsv_file` should be the original preprocessed data. The translation hypotheses obtained with different levels of occlusion are stored in the `output_file`. All other arguments are the same as in previous steps.
114+
115+
In this step, the file passed as `model_yaml_config` should again contain a field `bpe_tokenizer` since we turn back to using beam search (as in step 1), rather than forced decoding (steps 2 and 3).
116+
117+
```bash
118+
python /fbk-fairseq/fairseq_cli/generate.py ${data_dir} \
119+
--gen-subset ${tsv_file} \
120+
--user-dir examples/speech_to_text \
121+
--max-tokens 200000 \
122+
--config-yaml ${model_yaml_config} \
123+
--beam 5 \
124+
--max-source-positions 10000 \
125+
--max-target-positions 1000 \
126+
--task feature_attribution_evaluation_task \
127+
--aggregator sentence \
128+
--criterion ctc_multi_loss \
129+
--underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
130+
--no-repeat-ngram-size 5 \
131+
--explanation-path ${explanations_path} \
132+
--metric deletion \
133+
--normalizer single_mean_std paired_min_max \
134+
--perc-interval 5 \
135+
--path ${model_path} > ${output_file}
136+
```
137+
138+
### Example Output File Structure
139+
140+
The `output_file` contains the hypotheses obtained with different levels of feature occlusion. Below is an example snippet of what the file might look like for sentence 739 with occlusions going from 0 to 20%. Lines starting with D contain the hypotheses output by the system.
141+
142+
```
143+
T-739-2 he said fish
144+
H-739-2 -0.7054908275604248 ▁F ant as tic .
145+
D-739-2 -0.7054908275604248 Fantastic.
146+
P-739-2 -3.0391 -0.6950 -0.0423 -0.0290 -0.2239 -0.2036
147+
T-739-1 he said fish
148+
H-739-1 -0.2201979160308838 ▁Some ▁fish .
149+
D-739-1 -0.2201979160308838 Some fish.
150+
P-739-1 -0.3025 -0.1083 -0.3261 -0.1439
151+
T-739-0 he said fish
152+
H-739-0 -0.25894302129745483 ▁It ' s ▁a ▁fish .
153+
D-739-0 -0.25894302129745483 It's a fish.
154+
P-739-0 -0.7121 -0.2012 -0.1306 -0.4108 -0.0699 -0.1558 -0.1321
155+
```
156+
157+
The script below can be used to calculate the AUC score, corresponding to the area below the curve when plotting the percentage of most relevant features occluded against the score of the output generated with that percentage of occlusion. The metric used for scoring the output is defined by the variable `scorer`, which can be `wer`, `wer_max` or `sacrebleu`. `output_file` should be the file generated by the script above, `reference_txt` a text file containing the reference sentences, and `figure_path` where to save the plot from which the AUC score is calculated.
158+
159+
```bash
160+
python /fbk-fairseq/examples/speech_to_text/xai_metrics/auc_score.py \
161+
--reference ${reference_txt} \
162+
--output-path ${output_file} \
163+
--perc-step 5 \
164+
--scorer ${scorer} \
165+
--fig-path ${figure_path}
166+
```
167+
168+
### Citation
169+
170+
```
171+
@misc{fucci2024spesspectrogramperturbationexplainable,
172+
title={SPES: Spectrogram Perturbation for Explainable Speech-to-Text Generation},
173+
author={Dennis Fucci and Marco Gaido and Beatrice Savoldi and Matteo Negri and Mauro Cettolo and Luisa Bentivogli},
174+
year={2024},
175+
eprint={2411.01710},
176+
archivePrefix={arXiv},
177+
primaryClass={cs.CL},
178+
url={https://arxiv.org/abs/2411.01710},
179+
}```

0 commit comments

Comments
 (0)