Customizing Speech-to-Text: Fine-Tuning a Model for Garchen Rinpoche’s Unique Voice
Introduction
Fine-tuning Speech-to-Text (STT) models for specific speakers can significantly enhance transcription accuracy by adapting to unique speech patterns and pronunciations. In this project, we fine-tuned a base wav2vec2 model using speaker-specific data of Garchen Rinpoche’s teachings. This blog post outlines our fine-tuning process, presents evaluation results, and discusses insights gained from this experiment.
Background
Single-speaker datasets often present unique challenges for general-purpose STT models, particularly with Tibetan speech recognition. Factors such as age, speaking style, and pronunciation patterns can significantly impact transcription accuracy. By fine-tuning our base model on Garchen Rinpoche’s speech data, we aimed to improve transcription accuracy for his specific speech patterns.
Methodology
Dataset Preparation
Dataset Type | Duration | Samples | Description |
---|---|---|---|
Training Dataset | 05:33:00 | 3071 | Annotated recordings of Garchen Rinpoche’s teachings |
Test Set | 01:04:12 | 893 | Held-out test data for evaluation |
Total Data | 06:37:12 | 3964 | Combined dataset used in the experiment |
Model Fine-Tuning
- Base Model:
ganga4364/mms_300_v4.96000
- Model Architecture: Wav2Vec2ForCTC
- Training Setup:
per_device_train_batch_size=8
gradient_accumulation_steps=2
learning_rate=3e-4
num_train_epochs=100
warmup_steps=500
fp16=True
- Hardware: GPU with 24GB VRAM
- Training Duration: ~2.15 hours
Evaluation Approach
Two models were evaluated on the test set:
- Base Model (wav2vec2 without fine-tuning)
- Fine-Tuned Model (trained on Garchen Rinpoche’s data)
Results
Model Performance Metrics
Metric | Base Model | Fine-Tuned Model |
---|---|---|
Character Error Rate (CER) | 27.67% | 22.93% |
Word Error Rate (WER) | 45.92% | 39.42% |
Training Progress
Checkpoint | CER (%) |
---|---|
Base model | 27.67 |
5000 steps | 27.41 |
10000 steps | 23.37 |
19000 steps | 22.93 |
Error Analysis
The fine-tuned model showed improvements in several areas:
- Error Distribution:
- Substitutions: 4,217 instances
- Insertions: 779 instances
- Deletions: 1,190 instances
Discussion
Key Findings
-
Overall Improvement: The fine-tuned model achieved a 4.74 percentage point reduction in CER compared to the base model, demonstrating the effectiveness of speaker-specific training.
-
Word vs Character Accuracy: While character-level accuracy showed significant improvement, word-level accuracy remains a challenge, suggesting room for improvement in capturing complete word structures.
-
Error Patterns: The predominance of substitution errors over insertions and deletions indicates that the model is more likely to misidentify characters than to miss them entirely.
Challenges and Limitations
- Limited dataset size (5.6 hours)
- Complexity of Tibetan language structure
- Speaker-specific characteristics (age group: 70-90 years)
Next Steps
- Data Collection: Expand the training dataset with more annotated recordings
- Model Architecture: Experiment with alternative fine-tuning approaches
- Error Analysis: Conduct detailed analysis of common error patterns
Conclusion
Our fine-tuning experiment demonstrates promising results in adapting a general STT model to Garchen Rinpoche’s unique speech patterns. The reduction in both character and word error rates suggests that speaker-specific fine-tuning is an effective approach for improving Tibetan speech recognition accuracy.