Custom ASR(Automatic Speech Recognition) for Garchen Rinpoche

Ganga_Gyatso · July 14, 2025, 9:16am

Customizing Speech-to-Text: Fine-Tuning a Model for Garchen Rinpoche’s Unique Voice

Introduction

Fine-tuning Speech-to-Text (STT) models for specific speakers can significantly enhance transcription accuracy by adapting to unique speech patterns and pronunciations. In this project, we fine-tuned a base wav2vec2 model using speaker-specific data of Garchen Rinpoche’s teachings. This blog post outlines our fine-tuning process, presents evaluation results, and discusses insights gained from this experiment.

Background

Single-speaker datasets often present unique challenges for general-purpose STT models, particularly with Tibetan speech recognition. Factors such as age, speaking style, and pronunciation patterns can significantly impact transcription accuracy. By fine-tuning our base model on Garchen Rinpoche’s speech data, we aimed to improve transcription accuracy for his specific speech patterns.

Methodology

Dataset Preparation

Dataset Type	Duration	Samples	Description
Training Dataset	05:33:00	3071	Annotated recordings of Garchen Rinpoche’s teachings
Test Set	01:04:12	893	Held-out test data for evaluation
Total Data	06:37:12	3964	Combined dataset used in the experiment

Model Fine-Tuning

Base Model: ganga4364/mms_300_v4.96000
Model Architecture: Wav2Vec2ForCTC
Training Setup:
- per_device_train_batch_size=8
- gradient_accumulation_steps=2
- learning_rate=3e-4
- num_train_epochs=100
- warmup_steps=500
- fp16=True
Hardware: GPU with 24GB VRAM
Training Duration: ~2.15 hours

Evaluation Approach

Two models were evaluated on the test set:

Base Model (wav2vec2 without fine-tuning)
Fine-Tuned Model (trained on Garchen Rinpoche’s data)

Results

Model Performance Metrics

Metric	Base Model	Fine-Tuned Model
Character Error Rate (CER)	27.67%	22.93%
Word Error Rate (WER)	45.92%	39.42%

Training Progress

Checkpoint	CER (%)
Base model	27.67
5000 steps	27.41
10000 steps	23.37
19000 steps	22.93

Error Analysis

The fine-tuned model showed improvements in several areas:

Error Distribution:
- Substitutions: 4,217 instances
- Insertions: 779 instances
- Deletions: 1,190 instances

Discussion

Key Findings

Overall Improvement: The fine-tuned model achieved a 4.74 percentage point reduction in CER compared to the base model, demonstrating the effectiveness of speaker-specific training.
Word vs Character Accuracy: While character-level accuracy showed significant improvement, word-level accuracy remains a challenge, suggesting room for improvement in capturing complete word structures.
Error Patterns: The predominance of substitution errors over insertions and deletions indicates that the model is more likely to misidentify characters than to miss them entirely.

Challenges and Limitations

Limited dataset size (5.6 hours)
Complexity of Tibetan language structure
Speaker-specific characteristics (age group: 70-90 years)

Next Steps

Data Collection: Expand the training dataset with more annotated recordings
Model Architecture: Experiment with alternative fine-tuning approaches
Error Analysis: Conduct detailed analysis of common error patterns

Conclusion

Our fine-tuning experiment demonstrates promising results in adapting a general STT model to Garchen Rinpoche’s unique speech patterns. The reduction in both character and word error rates suggests that speaker-specific fine-tuning is an effective approach for improving Tibetan speech recognition accuracy.

Resources

Datasets

Models

Code and Documentation

Training Scripts

Ganga_Gyatso · July 22, 2025, 9:59am

Cost Estimation for Achieving 5% CER in Tibetan ASR

Current Progress and Data Analysis

Our current experiments with fine-tuning the MMS Wav2Vec2 model on Garchen Rinpoche’s teachings have shown promising results:

Model Stage	Training Data (hours)	CER (%)
Base Model	0	27.67
Fine-tuned	5.6	22.93

Using linear regression analysis, we can estimate the resources needed to achieve our target CER.

Linear Projection Analysis

Based on our current data points, we can establish a linear equation:


CER = -0.8482 × Hours + 27.67

This suggests:

Improvement rate: ~0.85% CER reduction per hour of training data
To reach 5% CER: Approximately 26.7 hours of high-quality training data needed
Additional data required: ~21.1 hours beyond our current dataset

Phased Implementation Plan

Phase 1 (Current Progress)

Metric	Value
Training Data	5.6 hours
Data Preparation Time	1 month
Syllable Count	85,206
Training Time	2.15 hours
Current CER	22.93%

Phase 2 (Intermediate Goal)

Metric	Value
Training Data	10.0 hours
Data Preparation Time	2 months
Syllable Count	170,412
Training Time	4.30 hours
Expected CER	19.188%

Phase 3 (Final Target)

Metric	Value
Training Data	27.0 hours
Data Preparation Time	5 months
Syllable Count	426,030
Training Time	11.5 hours
Target CER	5%

Project Timeline Overview

Total Timeline: 5 Months
Data Collection Rate: ~5.4 hours of training data per month

Conclusion

This implementation plan provides a clear path to achieve 5% CER through three progressive phases over 5 months. Based on our analysis, approximately 27 hours of high-quality training data (approximately 426,030 syllables) will be required to reach the target CER of 5%.

Topic		Replies	Views
Customizing Speech-to-Text: Fine-Tuning a Model for Tai Situ Rinpoche’s Unique Voice 🔊 ASR Speech Recognition SIG	0	65	January 16, 2025
📄 PRD: Garchen Rinpoche Speech Garchen Rinpoche Speech SIG	5	93	July 23, 2025
Garchen Rinpoche CER Analysis and Comparison Garchen Rinpoche Speech SIG	0	45	July 30, 2025
Garchen Rinpoche Training Data Requirements Estimation Garchen Rinpoche Speech SIG	1	32	July 23, 2025
Customizing Speech-to-Text: Fine-Tuning a Model for Kyabje Dilgo Kyentse Rinpoche’s Unique Voice 🔊 ASR Speech Recognition SIG	0	35	January 17, 2025

Custom ASR(Automatic Speech Recognition) for Garchen Rinpoche

Customizing Speech-to-Text: Fine-Tuning a Model for Garchen Rinpoche’s Unique Voice

Introduction

Background

Methodology

Dataset Preparation

Model Fine-Tuning

Evaluation Approach

Results

Model Performance Metrics

Training Progress

Error Analysis

Discussion

Key Findings

Challenges and Limitations

Next Steps

Conclusion

Resources

Datasets

Models

Code and Documentation

Cost Estimation for Achieving 5% CER in Tibetan ASR

Current Progress and Data Analysis

Linear Projection Analysis

Phased Implementation Plan

Phase 1 (Current Progress)

Phase 2 (Intermediate Goal)

Phase 3 (Final Target)

Project Timeline Overview

Conclusion

Related topics