|
| 1 | +# How do Hyenas deal with Human Speech? Speech Recognition and Translation with ConfHyena |
| 2 | + |
| 3 | +This README contains the instructions to replicate the training and evaluation of the models in the paper |
| 4 | +[How do Hyenas deal with Human Speech? Speech Recognition and Translation with ConfHyena](https://arxiv.org/abs/2402.13208). |
| 5 | +In addition, we release the pre-trained models used in the paper. |
| 6 | + |
| 7 | + |
| 8 | +## Setup |
| 9 | +Clone this repository and install it as explained in the original [Fairseq(-py)](https://github.com/pytorch/fairseq). |
| 10 | +For the experiments we used MuST-C, make sure to [download the corpus](https://mt.fbk.eu/must-c/). |
| 11 | +Follow the [preprocessing steps of Speechformer](SPEECHFORMER.md#preprocessing) to preprocess the MuST-C data. |
| 12 | + |
| 13 | +## Pretrained models |
| 14 | + |
| 15 | +Below we release the dictionary/config files and the pre-trained checkpoints |
| 16 | +obtained in our experiments. |
| 17 | +The dictionary and config files are the same as those used for the Conformer baseline, |
| 18 | +whose checkpoints can be found [here](BUGFREE_CONFORMER.md#pretrained-models). |
| 19 | + |
| 20 | +### Common files: |
| 21 | +- Source dictionary SentencePiece model and fairseq dictionary: |
| 22 | +[srcdict.model](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/EdAgeZdaw5BEjv6PUPEycvoBZHeOMqZ69ciEAIHM0XoBbw?e=t2z5G1), |
| 23 | +[srcdict.txt](https://fbk-my.sharepoint.com/:t:/g/personal/mgaido_fbk_eu/EY6_YCFCDjxBlBvm2_8UQFEB9ehLmFoLiGj2r7GGe_pL0A?e=NhIhkz) |
| 24 | +- Target dictionary SentencePiece model and fairseq dictionary: |
| 25 | + - **en (ASR)**: same as srcdict.model and srcdict.txt |
| 26 | + - **en-de**: |
| 27 | + [tgtdict.model](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/Eamb-6DsnklHq-4CZOZA9nYBKZ0XXnz0UdeOb49UXYlLVQ?e=yroKIk), |
| 28 | + [tgtdict.txt](https://fbk-my.sharepoint.com/:t:/g/personal/mgaido_fbk_eu/EVOJ0yFgZZpEqvHUlzhjqOEBkV7U26iryO-bpobz_5q_fQ?e=i2gdi0) |
| 29 | + - **en-es**: |
| 30 | + [tgtdict.model](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/EWmh3csXbEVPmBSI7xeemVMBHqlSEDJHl3JmUOXzPRwCAA?e=T53pKl), |
| 31 | + [tgtdict.txt](https://fbk-my.sharepoint.com/:t:/g/personal/mgaido_fbk_eu/EduV9z-HroFOgh2xQjhdShIBmCs-6PmvgqkzPfcQmXsXdQ?e=iehKch) |
| 32 | + - **en-fr**: |
| 33 | + [tgtdict.model](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/EXQfn6DYxC1CskMO7lJMaxIB23Wa4xIWOtsX2SIukOOM9A?e=HyvZrB), |
| 34 | + [tgtdict.txt](https://fbk-my.sharepoint.com/:t:/g/personal/mgaido_fbk_eu/ETV367Z8xJ1Egz9E_cKBdykB9iYgDdEj1xLKBLRTANWCUA?e=Y5CUky) |
| 35 | + - **en-it**: |
| 36 | + [tgtdict.model](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/EX_w-V-SN1dLkEEJWrXbK_UBxHQL0zJaJuzIM_ZzosICmg?e=Wf0VKk), |
| 37 | + [tgtdict.txt](https://fbk-my.sharepoint.com/:t:/g/personal/mgaido_fbk_eu/ERAhMZjPoJNHkPWih7v0GfoBus4jG0WD3XPRmK5CgaV3wA?e=lG50Ny) |
| 38 | + - **en-nl**: |
| 39 | + [tgtdict.model](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/EZ8C2AySmHxLi7qDcf4PcvEBEg5tkVXK9jsB1t8v0F3Maw?e=6VCiwb), |
| 40 | + [tgtdict.txt](https://fbk-my.sharepoint.com/:t:/g/personal/mgaido_fbk_eu/EWvoJ9Lb97RGqaUaFgsWPlMBYgo9uTIxUUY6KidHnZErhw?e=986D7S) |
| 41 | + - **en-pt**: |
| 42 | + [tgtdict.model](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/EX9u-0PII8JKpnNensFj5ygBqVZrcPYoE8RWC8VryspzTg?e=2LjDH5), |
| 43 | + [tgtdict.txt](https://fbk-my.sharepoint.com/:t:/g/personal/mgaido_fbk_eu/EZ2TMRgLtudCuvXcsjCzOtkBjWVSdsof1LGmt9bOtQn9gg?e=boCBtQ) |
| 44 | + - **en-ro**: |
| 45 | + [tgtdict.model](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/Ec_zzPD3sTtCkNmibsMUUQUBWQHxinzoNvSRCCx6c_JhzA?e=Q5pDs7), |
| 46 | + [tgtdict.txt](https://fbk-my.sharepoint.com/:t:/g/personal/mgaido_fbk_eu/EbkE3WFxh4lDiR7aB9wA6NoBaQIZnM6MnWscLKD-h5nMTw?e=QgoD95) |
| 47 | +- config yaml: |
| 48 | +```bash |
| 49 | +bpe_tokenizer: |
| 50 | + bpe: sentencepiece |
| 51 | + sentencepiece_model: tgtdict.model |
| 52 | +bpe_tokenizer_src: |
| 53 | + bpe: sentencepiece |
| 54 | + sentencepiece_model: srcdict.model |
| 55 | +input_channels: 1 |
| 56 | +input_feat_per_channel: 80 |
| 57 | +sampling_alpha: 1.0 |
| 58 | +specaugment: |
| 59 | + freq_mask_F: 27 |
| 60 | + freq_mask_N: 1 |
| 61 | + time_mask_N: 1 |
| 62 | + time_mask_T: 100 |
| 63 | + time_mask_p: 1.0 |
| 64 | + time_wrap_W: 0 |
| 65 | +transforms: |
| 66 | + '*': |
| 67 | + - utterance_cmvn |
| 68 | + _train: |
| 69 | + - utterance_cmvn |
| 70 | + - specaugment |
| 71 | +vocab_filename: tgtdict.txt |
| 72 | +vocab_filename_src: srcdict.txt |
| 73 | +``` |
| 74 | +### Checkpoints |
| 75 | +| Model | en (ASR) | en-de | en-es | en-fr | en-it | en-nl | en-pt | en-ro | |
| 76 | +|--------------------|------------|------------|------------|-------|-------|-------|-------|-------| |
| 77 | +| ConfHyena | [ckp.pt](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/EU6Bhy_jGQxJm9fIS3DsJmwBxd-tBl5HsQBM2OCbvu5gQQ?e=cORIdz) | [ckp.pt](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/ETuTNLx7_hNAooQ_U5yQh1oB3zae2fls2xv-K4enmCBMRw?e=2ENGAV) | [ckp.pt](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/EXWYMvNOEINMgeKStlW0peABybfiOIcOpInjpbFw3cRUBw?e=JyPdry) | [ckp.pt](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/Eb3pv7C6zvJIqkH2nPa9w4YBvvO74khSX7s_uo6D_p7fzg?e=t7NypZ) | [ckp.pt](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/Ee7Gvuo2iRJHsr2M_9G4KHQBkgrRkCmwCy5kS9jMlJVP6A?e=lbVxTr) | [ckp.pt](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/ERpXg6Cbe3pDlL1gzGCoe7UBHcpCLw2JQXQKtK1vF05NGg?e=RSsHDJ) | [ckp.pt](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/ETXG5TySDzZLmaWkaPbFMXgBIoxE3n54I-pclaRsmQQedg?e=JNdKaE) | [ckp.pt](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/EdR34a0DMMhGiRsIpGCxcQABIbjbICogJaTKZXOtGQa14w?e=MYRU3N) | |
| 78 | +| - non-causal Hyena | [ckp.pt](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/Efpj8KHH9oJDm6bPAJSdDNkB_JRcsmcxXC4ciaPE0U3kgg?e=yfGbhq) | [ckp.pt](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/EYHfCm2e4PBAoE-0jHkEm2MB1Wr-qBZAEaeAWJBUXl30Lg?e=ZuKaon) | [ckp.pt](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/EVlb_DmkG8VCg2JrddtHGOoB9be1IDpB2Q0aQavIe6hoAw?e=aL3SWY) | [ckp.pt](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/EawfiABptKBErrLwYJ5fjdUBAVSVv1gsWU-jwWlgj8qt_A?e=hnU9HB) | [ckp.pt](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/EQZSTWx_O8RFhFICdyE8swkBsrCmwkA0LouzRnX4cF7wHQ?e=0Ha4zB) | [ckp.pt](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/EXTo4vC_hMtAgijZ1TE7RWABZgwfI4wuXrZvlcHI_ah7Lg?e=3Baczg) | [ckp.pt](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/EUf-qg9uf_VBgAZDS5OW3DIBRK8gkxts-Ku067r00bb1VQ?e=2uNgZj) | [ckp.pt](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/Ea8i1u8KIldLreB5531Fno8BCEpg7qiiHG2lCE8cE8qZXA?e=pTXNCC) | |
| 79 | +| Hybrid ConfHyena | [ckp.pt](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/EbF2LjOz1MtLnX1gCHTQjsEBgLn_EAhKypyIDhu3Y7nuFQ?e=ZhFRyF) | [ckp.pt](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/ETmOsR9Ie6hOrhM50B6wzioBvWuSLo6g55e_qIp88W13qQ?e=W5eK79) | [ckp.pt](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/EcYJBCcrJaNApvOlvWDnRSEBtue-fzMIYpISwMWqdRCPSQ?e=gWXrK5) | [ckp.pt](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/EfSjqbL1CwZOkquHsvwZnpQBswt469ymSW3uL_q8ro5xlg?e=GMHKPO) | [ckp.pt](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/EQXC8NAnPldElP_0WduGtGYB2lhKCCy-tOQQDBfeQMvC4A?e=0T63hZ) | [ckp.pt](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/EZin2xLqqLFBkFYPyh0X1rcBCFbdvB-Dpr567adjGkrpSQ?e=57imQ7) | [ckp.pt](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/EVjU7IhkWB5Dq7M09SzWpqABn18U_GbSGdj4biJoNWCaJw?e=vQpsEh) | [ckp.pt](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/EYROvhAPNTxEn9WDHgpIgPEBsKhWUWYTpEfydwFV9AXDIw?e=oaet0d) | |
| 80 | +| - non-causal Hyena | [ckp.pt](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/EbNqXhyaUGVFheZ3FExAloEBPEZOG2jlpJv8ynnYnYpf2g?e=qe87Zq) | [ckp.pt](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/Ef5HXS1LJvxNvYHv-bp-cNUBZ4DDGdWBAL_iBQNpl6JbcA?e=DX0ItZ) | [ckp.pt](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/EWQ_V6szbMdPp149zGa8tuoBLnN-nZ0tVnYc3ymBb9Ddcg?e=BByutz) | [ckp.pt](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/EVsXeLu_VkFKndxmUAShl1kB7ANPmdw19QOA87RUBP-TcQ?e=q8Royw) | [ckp.pt](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/EbTZHMRnb_BJobUxSK0dFScB3FD1_IvVcLvyfnIWFy6lPg?e=mqh2wK) | [ckp.pt](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/Ef_gZRguYWJEmzyIMn9bzIUBzGgCt-lwb_5FPCSrUHv03A?e=LssC98) | [ckp.pt](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/EVhOSMYNNqlJibkYt85laRoBvwNrNzvXCAOX_CJYX13_MQ?e=9KfTZJ) | [ckp.pt](https://fbk-my.sharepoint.com/:u:/g/personal/mgaido_fbk_eu/EZ64nTKeOvhBmNzZgyx7LV8BJKJN0Qx0psoqLYaJ7lzPlg?e=Yq8tAv) | |
| 81 | + |
| 82 | + |
| 83 | + |
| 84 | +## Training |
| 85 | + |
| 86 | +For the Conformer baseline, please refer to the [bug-free Conformer README](BUGFREE_CONFORMER.md). |
| 87 | + |
| 88 | +For the Hybrid ConfHyena models, our training has been executed with the following commands. |
| 89 | + |
| 90 | + |
| 91 | +```bash |
| 92 | +LANG=$1 |
| 93 | +MUSTC_ROOT=$2 |
| 94 | +TASK=$3 |
| 95 | +SAVE_DIR=$4 |
| 96 | + |
| 97 | +mkdir -p $SAVE_DIR |
| 98 | + |
| 99 | +python ${FBK_fairseq}/train.py ${MUSTC_ROOT} \ |
| 100 | + --train-subset train_${TASK}_src --valid-subset dev_${TASK}_src \ |
| 101 | + --user-dir examples/speech_to_text --seed 1 \ |
| 102 | + --num-workers 2 --max-update 100000 --patience 10 --keep-last-epochs 12 \ |
| 103 | + --max-tokens 40000 --update-freq 4 \ |
| 104 | + --task speech_to_text_ctc --config-yaml config.yaml \ |
| 105 | + --criterion ctc_multi_loss \ |
| 106 | + --underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ |
| 107 | + --arch confhyena --conformer-after-compression --stride 2 \ |
| 108 | + --ctc-encoder-layer 8 --ctc-weight 0.5 --ctc-compress-strategy avg \ |
| 109 | + --optimizer adam --adam-betas '(0.9, 0.98)' \ |
| 110 | + --lr 2e-3 --lr-scheduler inverse_sqrt --warmup-updates 25000 \ |
| 111 | + --clip-norm 10.0 \ |
| 112 | + --skip-invalid-size-inputs-valid-test \ |
| 113 | + --save-dir ${SAVE_DIR} \ |
| 114 | + --log-format simple > $SAVE_DIR/train.log 2> $SAVE_DIR/train.err |
| 115 | + |
| 116 | +python ${FBK_fairseq}/scripts/average_checkpoints.py \ |
| 117 | + --input $SAVE_DIR --num-epoch-checkpoints 5 \ |
| 118 | + --checkpoint-upper-bound $(ls $SAVE_DIR | head -n 5 | tail -n 1 | grep -o "[0-9]*") \ |
| 119 | + --output $SAVE_DIR/avg5.pt |
| 120 | + |
| 121 | +if [ -f $SAVE_DIR/avg5.pt ]; then |
| 122 | + rm $SAVE_DIR/checkpoint??.pt |
| 123 | +fi |
| 124 | +``` |
| 125 | + |
| 126 | +The ConfHyena models can be obtained by removing the `--conformer-after-compression` parameter. |
| 127 | + |
| 128 | + |
| 129 | +The causal version of the two architectures (`- non causal Hyena` in the paper and tables below) |
| 130 | +can be obtained by adding the parameter `--hyena-causal` to the command. |
| 131 | + |
| 132 | +The command is meant to be executed on 2 A100 GPUs with 40GB VRAM. |
| 133 | + |
| 134 | + |
| 135 | +## Evaluation |
| 136 | +Once you downloaded the pretrained checkpoints and related config/dictionaries, |
| 137 | +generate the output with: |
| 138 | +```bash |
| 139 | +python ${FBK_fairseq}/fairseq_cli/generate.py ${MUSTC_ROOT} \ |
| 140 | + --user-dir examples/speech_to_text \ |
| 141 | + --config-yaml config.yaml --gen-subset tst-COMMON_st_src \ |
| 142 | + --max-source-positions 10000 --max-target-positions 1000 \ |
| 143 | + --task speech_to_text_ctc \ |
| 144 | + --criterion ctc_multi_loss --underlying-criterion label_smoothed_cross_entropy \ |
| 145 | + --beam 5 --no-repeat-ngram-size 5 --path ${PRETRAINED_CHECKPOINT} > ${OUTPUT_FILE} |
| 146 | +``` |
| 147 | + |
| 148 | +## Citation |
| 149 | +```bibtex |
| 150 | +@inproceedings{gaido-et-al-2024-hyena, |
| 151 | + title={{How do Hyenas deal with Human Speech? Speech Recognition and Translation with ConfHyena}}, |
| 152 | + author={Marco Gaido and Sara Papi and Matteo Negri and Luisa Bentivogli}, |
| 153 | + year={2024}, |
| 154 | + address="Turin, Italy", |
| 155 | + booktitle={Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)}, |
| 156 | +} |
| 157 | +``` |
0 commit comments