Title: A Custom ASR Model to Transcribe the Speech of Tai Situ Rinpoche
Introduction
We have a base Speech-to-Text (STT) model trained on diverse data sources, including audiobook recordings, news audio, podcasts, movies, natural conversations, and children’s speech. While the base model performs well across various domains, its accuracy diminishes when transcribing large volumes of single-speaker audio, such as the hundreds of hours of recordings from Tai Situ Rinpoche. We hypothesize that fine-tuning the base model on a small set of annotated Tai Situ Rinpoche’s speech data will significantly improve transcription accuracy for his recordings compared to the base model without fine-tuning.
Background
Single-speaker datasets often present unique challenges for general-purpose STT models, resulting in higher Character Error Rates (CER) due to differences in speech patterns, pronunciation, and intonation. Similar to how accent-specific fine-tuning enhances performance for regional accents, fine-tuning the base model on Tai Situ Rinpoche’s speech data can bridge the performance gap. This approach could also be generalized to other speakers with large annotated datasets, potentially saving significant annotation time and effort through iterative refinement.
Methodology
-
Annotation:
- A few hours of Tai Situ Rinpoche’s speech were annotated by our expert annotators.
-
Data Preprocessing:
- Filtered non-Tibetan transcripts and low-quality audio recordings to ensure data relevance and integrity.
-
Dataset Splitting:
- Split the processed data into training, validation, and test sets in an 8:1:1 ratio.
-
Model Fine-Tuning:
- Fine-tuned the base model on the annotated dataset using optimal training parameters and configurations.
-
Evaluation:
- Compared the performance of the base and fine-tuned models on a test set containing unseen data.
Training Details
-
Training Dataset:
- Training: 1.1 hours (838 samples)
- Validation/Test: 0.10 hours (105 samples each)
-
Training Parameters:
per_device_train_batch_size=8
gradient_accumulation_steps=1
evaluation_strategy="steps"
save_steps=500
eval_steps=50
logging_steps=50
learning_rate=1e-6
num_train_epochs=200
fp16=True
-
Hardware and Duration:
- GPU: 1x RTX 4090 (24 GB)
- Training Time: 3 hours
Results
-
Base Model Performance:
- CER: 9.78%
-
Fine-Tuned Model Performance:
- Checkpoint 21000: CER: 7.93%
- Checkpoint 17500: CER: 7.97%
Discussion
Fine-tuning the base model on Tai Situ Rinpoche’s speech data led to a significant reduction in CER, confirming our hypothesis. The results indicate that personalized STT models are more effective for single-speaker datasets. While the performance improvement is substantial, further enhancements can be achieved by collecting additional annotated data and experimenting with advanced fine-tuning techniques. Additionally, this approach sets the stage for fine-tuning models for other speakers, potentially automating large-scale transcription tasks while reducing manual effort.
Conclusion
This experiment demonstrates the effectiveness of fine-tuning a base Speech-to-Text (STT) model on a small annotated dataset, leading to significant improvements in transcription accuracy for Tai Situ Rinpoche’s speech. Our hypothesis that fine-tuning the model on this specific dataset would improve transcription accuracy was validated, as evidenced by the reduction in Character Error Rate (CER). The base model achieved a CER of 9.78%, while the fine-tuned model reduced the CER to 7.93% at Checkpoint 21000, resulting in approximately a 2% improvement. This substantial reduction confirms that speaker-specific fine-tuning is highly beneficial for improving transcription accuracy, particularly for distinct speech patterns like those of Tai Situ Rinpoche.