Abstract
Currently, there is a lack of a straightforward implementation of diarization-augmented speech transcription (DAST), ie. implementation of transcription, diarization and alignment to the audio within one model. These tasks typically require distinct models, necessitating to stack them together for complete processing. In this study, we advocate for leveraging the advanced capabilities of the Whisper models, which already excels in automatic transcription and partial alignment. Our approach involves fine-tuning the model’s parameters on both transcription and diarization tasks in a SOT-FIFO (Serialized Output Training- First In First Out) manner. This comprehensive framework facilitates the creation of orthographic transcriptions, identification of speakers, and precise alignment, thus enhancing the efficiency of audio processing workflows. While our work represents an initial step towards a unified transcription and diarization framework, the development of such a model demands substantial high-quality data augmentation and computational resources beyond our current scope. Consequently, our focus is narrowed to the English language. Despite these limitations, our method demonstrates promising performance in both transcription and diarization tasks. Comparative analysis between pre-trained models and fine-tuned TAD (Transcription, Alignment, Diarization) versions suggests that incorporating diarization into a Whisper model doesn’t compromise transcription accuracy. Our findings hint that deploying our TAD framework on the largest Whisper model could potentially yield state-of-the-art performance across all mentioned tasks.
Original language | English |
---|---|
Pages (from-to) | 33-38 |
Number of pages | 6 |
Journal | Proceedings of the International Conference Computational Linguistics in Bulgaria |
Publication status | Published - 2024 |
Externally published | Yes |
Event | 6th International Conference on Computational Linguistics in Bulgaria, CLIB 2024 - Sofia, Bulgaria Duration: 9 Sept 2024 → 10 Sept 2024 |
Keywords
- automatic speech recognition
- Diarization
- Whisper