Custom ASR(Automatic Speech Recognition) for Garchen Rinpoche

Customizing Speech-to-Text: Fine-Tuning a Model for Garchen Rinpoche’s Unique Voice

Introduction

Fine-tuning Speech-to-Text (STT) models for specific speakers can significantly enhance transcription accuracy by adapting to unique speech patterns and pronunciations. In this project, we fine-tuned a base wav2vec2 model using speaker-specific data of Garchen Rinpoche’s teachings. This blog post outlines our fine-tuning process, presents evaluation results, and discusses insights gained from this experiment.

Background

Single-speaker datasets often present unique challenges for general-purpose STT models, particularly with Tibetan speech recognition. Factors such as age, speaking style, and pronunciation patterns can significantly impact transcription accuracy. By fine-tuning our base model on Garchen Rinpoche’s speech data, we aimed to improve transcription accuracy for his specific speech patterns.

Methodology

Dataset Preparation

Dataset Type Duration Samples Description
Training Dataset 05:33:00 3071 Annotated recordings of Garchen Rinpoche’s teachings
Test Set 01:04:12 893 Held-out test data for evaluation
Total Data 06:37:12 3964 Combined dataset used in the experiment

Model Fine-Tuning

  • Base Model: ganga4364/mms_300_v4.96000
  • Model Architecture: Wav2Vec2ForCTC
  • Training Setup:
    • per_device_train_batch_size=8
    • gradient_accumulation_steps=2
    • learning_rate=3e-4
    • num_train_epochs=100
    • warmup_steps=500
    • fp16=True
  • Hardware: GPU with 24GB VRAM
  • Training Duration: ~2.15 hours

Evaluation Approach

Two models were evaluated on the test set:

  1. Base Model (wav2vec2 without fine-tuning)
  2. Fine-Tuned Model (trained on Garchen Rinpoche’s data)

Results

Model Performance Metrics

Metric Base Model Fine-Tuned Model
Character Error Rate (CER) 27.67% 22.93%
Word Error Rate (WER) 45.92% 39.42%

Training Progress

Checkpoint CER (%)
Base model 27.67
5000 steps 27.41
10000 steps 23.37
19000 steps 22.93

Error Analysis

The fine-tuned model showed improvements in several areas:

  • Error Distribution:
    • Substitutions: 4,217 instances
    • Insertions: 779 instances
    • Deletions: 1,190 instances

Discussion

Key Findings

  1. Overall Improvement: The fine-tuned model achieved a 4.74 percentage point reduction in CER compared to the base model, demonstrating the effectiveness of speaker-specific training.

  2. Word vs Character Accuracy: While character-level accuracy showed significant improvement, word-level accuracy remains a challenge, suggesting room for improvement in capturing complete word structures.

  3. Error Patterns: The predominance of substitution errors over insertions and deletions indicates that the model is more likely to misidentify characters than to miss them entirely.

Challenges and Limitations

  • Limited dataset size (5.6 hours)
  • Complexity of Tibetan language structure
  • Speaker-specific characteristics (age group: 70-90 years)

Next Steps

  1. Data Collection: Expand the training dataset with more annotated recordings
  2. Model Architecture: Experiment with alternative fine-tuning approaches
  3. Error Analysis: Conduct detailed analysis of common error patterns

Conclusion

Our fine-tuning experiment demonstrates promising results in adapting a general STT model to Garchen Rinpoche’s unique speech patterns. The reduction in both character and word error rates suggests that speaker-specific fine-tuning is an effective approach for improving Tibetan speech recognition accuracy.

Resources

Datasets

Models

Code and Documentation