A custom ASR model to transcribe the speech of Kabjye Dilgo Khyentse Rinpoche

Title: A Custom ASR Model to Transcribe the Speech of Kabjye Dilgo Khyentse Rinpoche

Introduction

We have a base Speech-to-Text (STT) model trained on a diverse range of audio data, including general conversations, podcasts, and other varied speech recordings. While the base model provides good transcription results across these general domains, it faces challenges when transcribing audio from specific speakers, especially those with unique accents or speech patterns, like Kabjye Dilgo Khyentse Rinpoche. In this experiment, we aim to fine-tune the base model on a small set of annotated speech data from Dilgo Khyentse Rinpoche and evaluate its performance on his recordings. Our hypothesis is that this fine-tuning will improve transcription accuracy for his voice compared to the baseline model.

Background

General-purpose ASR models are typically trained on diverse data to handle a wide range of speakers and contexts. However, these models tend to underperform on single-speaker datasets, especially those with distinct accents, speech cadences, or pronunciations. Fine-tuning the model on a speaker’s specific data can help the model adapt to those particularities, improving performance significantly. This experiment explores the impact of fine-tuning the base model on the unique speech patterns of Kabjye Dilgo Khyentse Rinpoche, a Tibetan spiritual leader.

Methodology

  1. Dataset Preparation:

  2. Data Preprocessing:

    • We preprocessed the audio files and text, ensuring they were properly aligned and of good quality. Non-Tibetan or low-quality audio was filtered out to maintain data integrity.
  3. Dataset Splitting:

    • The dataset was split into training, validation, and test sets in an 8:1:1 ratio, providing a solid foundation for training and evaluation.
  4. Model Fine-Tuning:

    • We fine-tuned the wav2vec2 model starting from a pre-trained checkpoint on this speaker-specific dataset.
  5. Evaluation:

    • The performance of both the base and fine-tuned models was evaluated on the test set, which consists of unseen audio data from Kabjye Dilgo Khyentse Rinpoche.

Training Details

  • Training Dataset:

    • Training: 9.1 hours (9,316 samples)
    • Validation/Test: 1.1 hours (1,165 samples)
  • Training Parameters:

    • per_device_train_batch_size=8
    • gradient_accumulation_steps=2
    • evaluation_strategy="steps"
    • save_steps=1000
    • eval_steps=1000
    • logging_steps=100
    • learning_rate=3e-4
    • num_train_epochs=100
    • fp16=True
    • dataloader_num_workers=4
    • save_total_limit=50
  • Hardware and Duration:

    • GPU: 1x RTX 4090 (24 GB)
    • Training Time: 3 hours

Results

  • Base Model Performance:

    • CER (Character Error Rate): 0.4754
  • Fine-Tuned Model Performance:

    • Checkpoint 58000: CER: 0.2043
    • The fine-tuned model shows a significant reduction in CER, indicating that the speaker-specific fine-tuning led to a noticeable improvement in transcription accuracy for Dilgo Khyentse Rinpoche’s recordings.

Discussion

Fine-tuning the base model on Dilgo Khyentse Rinpoche’s speech data resulted in a substantial improvement in the Character Error Rate (CER). The fine-tuned model performed nearly 27% better than the base model, showcasing how speaker-specific fine-tuning can help tailor ASR models to unique speech patterns. The results underscore the importance of using a speaker’s own data to improve transcription accuracy, especially when working with spiritual or historical figures whose voices and speech patterns differ significantly from those in a general training corpus.

While the results are promising, further performance improvements can be achieved by incorporating additional data from Kabjye Dilgo Khyentse Rinpoche or similar speakers. The model could also be fine-tuned with more advanced techniques to achieve even better results in real-world applications, such as transcription of historical teachings and Dharma talks.

Conclusion

This experiment demonstrates the effectiveness of fine-tuning a base STT model on a speaker-specific dataset, leading to significant improvements in transcription accuracy for Kabjye Dilgo Khyentse Rinpoche’s speech. Our hypothesis that fine-tuning the model on this specific dataset would improve transcription accuracy was validated, as evidenced by the reduction in the Character Error Rate (CER). The base model achieved a CER of 0.4754, while the fine-tuned model reduced the CER to 0.2043, resulting in a 27% improvement. This substantial reduction confirms that speaker-specific fine-tuning is highly beneficial for improving transcription accuracy, particularly for unique speech patterns such as those of Kabjye Dilgo Khyentse Rinpoche.

Resources

3 Likes

Could you give figures in your conclusion? Your hypothesis in the introduction is “Our hypothesis is that this fine-tuning will improve transcription accuracy for his voice compared to the baseline model.” so I expect to learn more about it in the conclusion.

yes sure genla. updated the conclusion.

1 Like

@Ganga_Gyatso if we get more data of rinpoche in future, how shall we proceed the training. Should we training from the latest checkpoint of the base model with old data and new data combined or should we train with new data on fine tuned checkpoint.

Training from scratch with the combined old and new datasets ensures balanced learning and avoids overfitting to patterns from the previously fine-tuned checkpoint. It leverages the base model’s generalizability, treats all data equally, and eliminates biases toward earlier fine-tuned data. Since the dataset is small (<50 hours), training from scratch is cost-effective, quick (under $10), and provides a fresh optimization process for better convergence and flexibility for future fine-tuning.

1 Like