[!215][RELEASE] SPES

Lina Varella Conti · mgaido91 · commit 9e9c2c723700 · 2025-03-14T20:00:34.000+01:00
# Which work do we release?
"SPES: Spectrogram Perturbation for Explainable Speech-to-Text Generation"

# What changes does this release refer to?
7f7699d1fae4650832d8733cbf54d9279d04da35
3701eaa55ab5b599b23a93df0556760d92da78fc
ca14dac9548d0be2f2132924e5bdcbe0c4013285
4641fda1a20569fb388341de12ad164084752607
43172daab2819f0a7810e5daf3c50fbc734ec7d8
076dd532cb7d806c242db44d86c355f6342dd2df
c60003bbefc11f7900994b132fb7325237cd2273
713868e03517ee55949a906a26e89d0949f078c0
35c8e41a2ecdf6935076c1d6508279fde3045449
5edbeefe03f3f36ec733edfff3e47f2f53ffcf63
221e192c4f391235b87db8a257109a7a61f82a5f
e8692a74ad84ad02af9ccafc4bc1e843f280a2aa
153d6a472262d100b6961f6d42beb691cdb7d64b
d319d2167b150c791f4e0b52a1aacf9a1b1d2024
138253d9776a640fe025fb3c257755b4617d360b
abd64ee4855a2af20142408907e030c8558c5a4c
777ebafb800e4e6046ac99b87d5a0ec48f607428
644e119553f9c98f7a1c25c15d451b9275e6fffa
62458c83df0870957d4b77f3b80c65c53d06720f
c5ccce6d0e24cb9321d22807313871d7804143fc
9fbea0a6d532195c0c6cf741269bb702d9b0d650
79bb24d851827331960f7eab865390d54fec60c6
7f233227560762bd88c343e74fd065c9bb89e3e3
c7889680c86b1472aff8498a8fa82d015a83fc75
96a8559d4f3ac89aee4c1910ecf2b2eabcae4ceb
e5ff5114b30cd25a4483827ed626084722db6138
48f88052374399761737712eac63dca56690e5d9
6d94a866c2232026b0b536f4ecb01e4ba89d4dc8
12a298bcc00279f73aa8ce0a28cf718eed073fb3
b24bc8f54a26f7d34aed560922259d7589ec2211
b2be28cbee0c030af61915a34d1c32879b13b14b
cd2df8732803f0585f7cef8a9b0d6d1b1e889fbd
4ddadaeaf31c8dbd98b82c639f31ad0bb7dbfc50
b02228eb9652819a2605a29285574c224eae7d53
a039fbdbf029d547aad3848555a49068bee6ae6a
019e89c7a9ed9fd42cd6a6f0a973b1334944b69f
3b8fd52a1f5ce04cd6cbaa42198f32a1d1a6d2c3
07db9c8987dd8c5fc8779a5dce787434d36dd695
c345d58f500b782bb2308cfb093e716aa4361de6
diff --git a/README.md b/README.md
@@ -9,6 +9,7 @@ Dedicated README for each work can be found in the `fbk_works` directory.
 
 ### 2024
 
+ - [**SPES: Spectrogram Perturbation for Explainable Speech-to-Text Generation**](fbk_works/XAI_FEATURE_ATTRIBUTION.md)
  - [[IWSLT 2024] **SimulSeamless: FBK at IWSLT 2024 Simultaneous Speech Translation**](fbk_works/SIMULSEAMLESS.md)
  - [[ACL 2024] **StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection**](fbk_works/STREAMATT_STREAMLAAL.md)
  - [[ACL 2024] **SBAAM! Eliminating Transcript Dependency in Automatic Subtitling**](fbk_works/SBAAM.md)
diff --git a/fbk_works/XAI_FEATURE_ATTRIBUTION.md b/fbk_works/XAI_FEATURE_ATTRIBUTION.md
@@ -0,0 +1,179 @@
+# SPES: Spectrogram Perturbation for Explainable Speech-to-Text Generation
+
+
+This README contains instructions to generate feature attribution explanations of the outputs of 
+speech-to-text models using SPES and evaluate them.
+SPES is introduced in [SPES: Spectrogram Perturbation for Explainable Speech-to-Text Generation](https://arxiv.org/abs/2411.01710).
+
+The explanations consist in saliency maps of relevance scores for each value in the spectrogram 
+representations of the input audio, as well as relevance scores for the previously generated text tokens. 
+To assign these relevance scores, SPES performs multiple forward-passes on the same data with different parts 
+of the input masked. 
+The impact of these occlusions is measured by comparing the original output probability distributions to that 
+produced by the occluded inferences. The more the probability distribution changes, the more the masked 
+part of the input is considered to be relevant.  
+
+## 0. Preprocess the data
+
+If the data is not preprocessed, follow the preprocessing steps of [Speechformer](https://gitlab.fbk.eu/mt/fbk-fairseq/-/blob/internal_master/fbk_works/SPEECHFORMER.md#preprocessing).
+
+## 1. Perform a standard inference
+
+The first step involves performing a standard inference and saving the model's predictions. These predicted tokens will be used to apply forced decoding when performing the occluded inferences. 
+
+This can be done with the following script, where `data_dir` should be the directory where the tsv file containing the preprocessed data is stored, `tsv_file` the name of that file (without the .tsv extension), `model_path` the path to the fairseq model checkpoint to be used, `model_yaml_config` the model's configuration, and `output_file` the path where to store the standard output. `explanation_tsv` should be the name of the output file that will contain the tokens predicted by the model. 
+
+In this and the following scripts, the argument `--max-tokens` should be adjusted based on the GPU's VRAM capacity.
+```bash
+python /fbk-fairseq/fairseq_cli/generate.py ${data_dir} \
+	--gen-subset ${tsv_file} \
+	--user-dir examples/speech_to_text \
+	--max-tokens 40000 \
+	--config-yaml ${model_yaml_config} \
+	--beam 5 \
+	--task speech_to_text_ctc \
+	--criterion ctc_multi_loss \
+	--underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
+	--no-repeat-ngram-size 5 \
+	--path ${model_path} > ${output_file}
+
+# Saves the tokenized translation hypotheses to a tab separated file
+python /fbk-fairseq/examples/speech_to_text/scripts/xai/prep_hyps_for_explanation.py \
+  --model-output ${output_file} \
+  --original-tsv ${data_dir}/${tsv_file}.tsv \
+  --explain-tsv ${explanation_tsv}
+```
+
+## 2. Save output probabilities
+
+The second step consists in running a standard inference and storing the output probability distributions. These stored distributions will serve as a reference for computing relevance scores. Relevance is determined by occluding different parts of the input, running 'perturbed' inferences, and comparing their probability distributions to the reference distributions. This comparison quantifies the extent to which each perturbation affects the model's output.
+
+In the following script, `explain_tsv_file` should be the name of the file generated in the previous step (without the .tsv extension), `data_dir` the directory where it is stored, and `output_file` where to store the probabilities (without the .h5 extension).
+In this step, the configuration file (`explain_yaml_config`) is identical to `model_yaml_config`, except that it omits the `bpe_tokenizer` field, as the target text in `explain_tsv_file` for forced decoding is already tokenized.
+Other variables are the same as in the previous step.
+
+```bash
+python /fbk-fairseq/examples/speech_to_text/get_probs_from_constrained_decoding.py ${data_dir} \
+    --gen-subset ${explain_tsv_file} \
+    --user-dir examples/speech_to_text \
+    --max-tokens 10000 \
+    --config-yaml ${explain_yaml_config} \
+    --task speech_to_text_ctc \
+    --criterion ctc_multi_loss \
+    --underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
+    --path ${model_path} \
+    --save-file ${output_file}
+```
+
+## 3. Perform occluded inferences and generate the explanations
+
+Next, multiple inferences are performed with different parts of the input occluded. The matrices of relevance computed from those are stored in a .h5 file.
+
+`probs_path` should be the path to the file generated in the previous step, and `explanations_path` where to store the explanation heatmaps (without the .h5 extension). 
+
+```bash
+python /fbk-fairseq/examples/speech_to_text/generate_occlusion_explanation.py ${data_dir} \
+    --gen-subset ${explain_tsv_file} \
+    --user-dir examples/speech_to_text \
+    --max-tokens 100000 \
+    --num-workers 0 \
+    --config-yaml ${explain_yaml_config} \
+    --perturb-config ${occlusion_config} \
+    --task speech_to_text_ctc \
+    --criterion ctc_multi_loss \
+    --underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
+    --no-repeat-ngram-size 5 \
+    --path ${model_path} \
+    --original-probs ${probs_path} \
+    --save-file ${explanations_path}
+```
+
+#### Example of `occlusion_config.yaml`
+
+`occlusion_config` should be a .yaml file containing the parameters with which to perform the occlusions. Below is an example how these files should be structured and a set of values that can be used. More information on the meaning of each of these parameters can be found in the [perturbator](https://gitlab.fbk.eu/mt/fbk-fairseq/-/tree/internal_master/examples/speech_to_text/occlusion_explanation/perturbators) package.
+
+```
+fbank_occlusion:
+  category: slic_fbank_dynamic_segments
+  p: 0.5
+  n_segments: [2000, 2500, 3000]
+  threshold_duration: 750
+  n_masks: 20000
+decoder_occlusion:
+  category: discrete_embed
+  p: 0.0
+  no_position_occlusion: true
+scorer: KL
+```
+
+## 4. Evaluate explanations
+
+To evaluate the explanations, inference is performed with different percentages of the most relevant features occluded. This percentage is increased in `perc-interval` increments. 
+
+In the following script, `tsv_file` should be the original preprocessed data. The translation hypotheses obtained with different levels of occlusion are stored in the `output_file`. All other arguments are the same as in previous steps.
+
+In this step, the file passed as `model_yaml_config` should again contain a field `bpe_tokenizer` since we turn back to using beam search (as in step 1), rather than forced decoding (steps 2 and 3).
+
+```bash
+python /fbk-fairseq/fairseq_cli/generate.py ${data_dir} \
+    --gen-subset ${tsv_file} \
+    --user-dir examples/speech_to_text \
+    --max-tokens 200000 \
+    --config-yaml ${model_yaml_config} \
+    --beam 5 \
+    --max-source-positions 10000 \
+    --max-target-positions 1000 \
+    --task feature_attribution_evaluation_task \
+    --aggregator sentence \
+    --criterion ctc_multi_loss \
+    --underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
+    --no-repeat-ngram-size 5 \
+    --explanation-path ${explanations_path} \
+    --metric deletion \
+    --normalizer single_mean_std paired_min_max \
+    --perc-interval 5 \
+    --path ${model_path} > ${output_file}
+```
+
+### Example Output File Structure
+
+The `output_file` contains the hypotheses obtained with different levels of feature occlusion. Below is an example snippet of what the file might look like for sentence 739 with occlusions going from 0 to 20%. Lines starting with D contain the hypotheses output by the system.
+
+```
+T-739-2	he said fish
+H-739-2	-0.7054908275604248	▁F ant as tic .
+D-739-2	-0.7054908275604248	Fantastic.
+P-739-2	-3.0391 -0.6950 -0.0423 -0.0290 -0.2239 -0.2036
+T-739-1	he said fish
+H-739-1	-0.2201979160308838	▁Some ▁fish .
+D-739-1	-0.2201979160308838	Some fish.
+P-739-1	-0.3025 -0.1083 -0.3261 -0.1439
+T-739-0	he said fish
+H-739-0	-0.25894302129745483	▁It ' s ▁a ▁fish .
+D-739-0	-0.25894302129745483	It's a fish.
+P-739-0	-0.7121 -0.2012 -0.1306 -0.4108 -0.0699 -0.1558 -0.1321
+```
+
+The script below can be used to calculate the AUC score, corresponding to the area below the curve when plotting the percentage of most relevant features occluded against the score of the output generated with that percentage of occlusion. The metric used for scoring the output is defined by the variable `scorer`, which can be `wer`, `wer_max` or `sacrebleu`. `output_file` should be the file generated by the script above, `reference_txt` a text file containing the reference sentences, and `figure_path` where to save the plot from which the AUC score is calculated.
+
+```bash
+python /fbk-fairseq/examples/speech_to_text/xai_metrics/auc_score.py \
+    --reference ${reference_txt} \
+    --output-path ${output_file} \
+    --perc-step 5 \
+    --scorer ${scorer} \
+    --fig-path ${figure_path}
+```
+
+### Citation
+
+```
+@misc{fucci2024spesspectrogramperturbationexplainable,
+      title={SPES: Spectrogram Perturbation for Explainable Speech-to-Text Generation}, 
+      author={Dennis Fucci and Marco Gaido and Beatrice Savoldi and Matteo Negri and Mauro Cettolo and Luisa Bentivogli},
+      year={2024},
+      eprint={2411.01710},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2411.01710}, 
+}```