A custom ASR model to transcribe the speech of Tai Situ Rinpoche

Ganga_Gyatso · December 19, 2024, 5:04am

Title: A Custom ASR Model to Transcribe the Speech of Tai Situ Rinpoche

Introduction

We have a base Speech-to-Text (STT) model trained on diverse data sources, including audiobook recordings, news audio, podcasts, movies, natural conversations, and children’s speech. While the base model performs well across various domains, its accuracy diminishes when transcribing large volumes of single-speaker audio, such as the hundreds of hours of recordings from Tai Situ Rinpoche. We hypothesize that fine-tuning the base model on a small set of annotated Tai Situ Rinpoche’s speech data will significantly improve transcription accuracy for his recordings compared to the base model without fine-tuning.

Background

Single-speaker datasets often present unique challenges for general-purpose STT models, resulting in higher Character Error Rates (CER) due to differences in speech patterns, pronunciation, and intonation. Similar to how accent-specific fine-tuning enhances performance for regional accents, fine-tuning the base model on Tai Situ Rinpoche’s speech data can bridge the performance gap. This approach could also be generalized to other speakers with large annotated datasets, potentially saving significant annotation time and effort through iterative refinement.

Methodology

Annotation:
- A few hours of Tai Situ Rinpoche’s speech were annotated by our expert annotators.
Data Preprocessing:
- Filtered non-Tibetan transcripts and low-quality audio recordings to ensure data relevance and integrity.
Dataset Splitting:
- Split the processed data into training, validation, and test sets in an 8:1:1 ratio.
Model Fine-Tuning:
- Fine-tuned the base model on the annotated dataset using optimal training parameters and configurations.
Evaluation:
- Compared the performance of the base and fine-tuned models on a test set containing unseen data.

Training Details

Training Dataset:
- Training: 1.1 hours (838 samples)
- Validation/Test: 0.10 hours (105 samples each)
Training Parameters:
- per_device_train_batch_size=8
- gradient_accumulation_steps=1
- evaluation_strategy="steps"
- save_steps=500
- eval_steps=50
- logging_steps=50
- learning_rate=1e-6
- num_train_epochs=200
- fp16=True
Hardware and Duration:
- GPU: 1x RTX 4090 (24 GB)
- Training Time: 3 hours

Results

Base Model Performance:
- CER: 9.78%
Fine-Tuned Model Performance:
- Checkpoint 21000: CER: 7.93%
- Checkpoint 17500: CER: 7.97%

Discussion

Fine-tuning the base model on Tai Situ Rinpoche’s speech data led to a significant reduction in CER, confirming our hypothesis. The results indicate that personalized STT models are more effective for single-speaker datasets. While the performance improvement is substantial, further enhancements can be achieved by collecting additional annotated data and experimenting with advanced fine-tuning techniques. Additionally, this approach sets the stage for fine-tuning models for other speakers, potentially automating large-scale transcription tasks while reducing manual effort.

Conclusion

This experiment demonstrates the effectiveness of fine-tuning a base Speech-to-Text (STT) model on a small annotated dataset, leading to significant improvements in transcription accuracy for Tai Situ Rinpoche’s speech. Our hypothesis that fine-tuning the model on this specific dataset would improve transcription accuracy was validated, as evidenced by the reduction in Character Error Rate (CER). The base model achieved a CER of 9.78%, while the fine-tuned model reduced the CER to 7.93% at Checkpoint 21000, resulting in approximately a 2% improvement. This substantial reduction confirms that speaker-specific fine-tuning is highly beneficial for improving transcription accuracy, particularly for distinct speech patterns like those of Tai Situ Rinpoche.

Topic		Replies	Views
Customizing Speech-to-Text: Fine-Tuning a Model for Tai Situ Rinpoche’s Unique Voice STT	0	33	January 16, 2025
A custom ASR model to transcribe the speech of Kabjye Dilgo Khyentse Rinpoche STT	4	62	January 6, 2025
Customizing Speech-to-Text: Fine-Tuning a Model for Kyabje Dilgo Kyentse Rinpoche’s Unique Voice STT	0	25	January 17, 2025
Custom Speech To Text (STT) Model PRD Community wiki	0	21	June 9, 2025
Building a Custom Speech-to-Text Model: A Step-by-Step Workflow STT	2	71	May 19, 2025