Projects per year
Abstract / Description of output
We propose a pretraining method to use Self-Supervised Speech (SSS) model to creating more compact Speech-to-text Translation. In contrast to using the SSS model for initialization, our method is more suitable to memory constrained scenario such as on-device deployment. Our method is based on Discrete Speech Units (DSU) extracted from the SSS model. In the first step, our method pretrains two smaller encoder-decoder models on 1) Filterbank-to-DSU (Fbk-to-DSU) and 2) DSU-to-Translation (DSU-to-Trl) data respectively. The DSU thus become the distillation inputs of the smaller models. Subsequently, the encoder from the Fbk-to-DSU model and the decoder from the DSU-to-Trl model are taken to initialise the compact model. Finally, the compact model is finetuned on the paired Fbk-Trl data. In addition to being compact, our method requires no transcripts, making it applicable to low-resource settings. It also avoids speech discretization in inference and is more robust to the DSU tokenization. Evaluation on CoVoST-2 (X-En) shows that our method has consistent improvement over the baseline in three metrics while being compact i.e., only half the SSS model size.
Original language | English |
---|---|
Title of host publication | Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024) |
Editors | Elizabeth Salesky, Marcello Frederico, Marine Carpuat |
Publisher | Association for Computational Linguistics |
Pages | 114–124 |
Number of pages | 11 |
ISBN (Electronic) | 9798891761414 |
DOIs | |
Publication status | Published - 16 Aug 2024 |
Event | The International Conference on Spoken Language Translation 2024 - Bangkok, Thailand Duration: 15 Aug 2024 → 16 Aug 2024 |
Conference
Conference | The International Conference on Spoken Language Translation 2024 |
---|---|
Abbreviated title | IWSLT 2024 |
Country/Territory | Thailand |
City | Bangkok |
Period | 15/08/24 → 16/08/24 |
Fingerprint
Dive into the research topics of 'Compact speech translation models via discrete speech units pretraining'. Together they form a unique fingerprint.Projects
- 1 Active