Title: A Custom ASR Model to Transcribe the Speech of Kabjye Dilgo Khyentse Rinpoche
Introduction
We have a base Speech-to-Text (STT) model trained on a diverse range of audio data, including general conversations, podcasts, and other varied speech recordings. While the base model provides good transcription results across these general domains, it faces challenges when transcribing audio from specific speakers, especially those with unique accents or speech patterns, like Kabjye Dilgo Khyentse Rinpoche. In this experiment, we aim to fine-tune the base model on a small set of annotated speech data from Dilgo Khyentse Rinpoche and evaluate its performance on his recordings. Our hypothesis is that this fine-tuning will improve transcription accuracy for his voice compared to the baseline model.
Background
General-purpose ASR models are typically trained on diverse data to handle a wide range of speakers and contexts. However, these models tend to underperform on single-speaker datasets, especially those with distinct accents, speech cadences, or pronunciations. Fine-tuning the model on a speaker’s specific data can help the model adapt to those particularities, improving performance significantly. This experiment explores the impact of fine-tuning the base model on the unique speech patterns of Kabjye Dilgo Khyentse Rinpoche, a Tibetan spiritual leader.
Methodology
-
Dataset Preparation:
- We utilized the Dilgo Khyentse Rinpoche dataset, which includes audio recordings of Rinpoche’s speeches and their corresponding transcripts.
-
Data Preprocessing:
- We preprocessed the audio files and text, ensuring they were properly aligned and of good quality. Non-Tibetan or low-quality audio was filtered out to maintain data integrity.
-
Dataset Splitting:
- The dataset was split into training, validation, and test sets in an 8:1:1 ratio, providing a solid foundation for training and evaluation.
-
Model Fine-Tuning:
- We fine-tuned the wav2vec2 model starting from a pre-trained checkpoint on this speaker-specific dataset.
-
Evaluation:
- The performance of both the base and fine-tuned models was evaluated on the test set, which consists of unseen audio data from Kabjye Dilgo Khyentse Rinpoche.
Training Details
-
Training Dataset:
- Training: 9.1 hours (9,316 samples)
- Validation/Test: 1.1 hours (1,165 samples)
-
Training Parameters:
per_device_train_batch_size=8
gradient_accumulation_steps=2
evaluation_strategy="steps"
save_steps=1000
eval_steps=1000
logging_steps=100
learning_rate=3e-4
num_train_epochs=100
fp16=True
dataloader_num_workers=4
save_total_limit=50
-
Hardware and Duration:
- GPU: 1x RTX 4090 (24 GB)
- Training Time: 3 hours
Results
-
Base Model Performance:
- CER (Character Error Rate): 0.4754
-
Fine-Tuned Model Performance:
- Checkpoint 58000: CER: 0.2043
- The fine-tuned model shows a significant reduction in CER, indicating that the speaker-specific fine-tuning led to a noticeable improvement in transcription accuracy for Dilgo Khyentse Rinpoche’s recordings.
Discussion
Fine-tuning the base model on Dilgo Khyentse Rinpoche’s speech data resulted in a substantial improvement in the Character Error Rate (CER). The fine-tuned model performed nearly 27% better than the base model, showcasing how speaker-specific fine-tuning can help tailor ASR models to unique speech patterns. The results underscore the importance of using a speaker’s own data to improve transcription accuracy, especially when working with spiritual or historical figures whose voices and speech patterns differ significantly from those in a general training corpus.
While the results are promising, further performance improvements can be achieved by incorporating additional data from Kabjye Dilgo Khyentse Rinpoche or similar speakers. The model could also be fine-tuned with more advanced techniques to achieve even better results in real-world applications, such as transcription of historical teachings and Dharma talks.
Conclusion
This experiment demonstrates the effectiveness of fine-tuning a base STT model on a speaker-specific dataset, leading to significant improvements in transcription accuracy for Kabjye Dilgo Khyentse Rinpoche’s speech. Our hypothesis that fine-tuning the model on this specific dataset would improve transcription accuracy was validated, as evidenced by the reduction in the Character Error Rate (CER). The base model achieved a CER of 0.4754, while the fine-tuned model reduced the CER to 0.2043, resulting in a 27% improvement. This substantial reduction confirms that speaker-specific fine-tuning is highly beneficial for improving transcription accuracy, particularly for unique speech patterns such as those of Kabjye Dilgo Khyentse Rinpoche.