Custom ASR(Automatic Speech Recognition) for Garchen Rinpoche

Customizing Speech-to-Text: Fine-Tuning a Model for Garchen Rinpoche’s Unique Voice

Introduction

Fine-tuning Speech-to-Text (STT) models for specific speakers can significantly enhance transcription accuracy by adapting to unique speech patterns and pronunciations. In this project, we fine-tuned a base wav2vec2 model using speaker-specific data of Garchen Rinpoche’s teachings. This blog post outlines our fine-tuning process, presents evaluation results, and discusses insights gained from this experiment.

Background

Single-speaker datasets often present unique challenges for general-purpose STT models, particularly with Tibetan speech recognition. Factors such as age, speaking style, and pronunciation patterns can significantly impact transcription accuracy. By fine-tuning our base model on Garchen Rinpoche’s speech data, we aimed to improve transcription accuracy for his specific speech patterns.

Methodology

Dataset Preparation

Dataset Type Duration Samples Description
Training Dataset 05:33:00 3071 Annotated recordings of Garchen Rinpoche’s teachings
Test Set 01:04:12 893 Held-out test data for evaluation
Total Data 06:37:12 3964 Combined dataset used in the experiment

Model Fine-Tuning

  • Base Model: ganga4364/mms_300_v4.96000
  • Model Architecture: Wav2Vec2ForCTC
  • Training Setup:
    • per_device_train_batch_size=8
    • gradient_accumulation_steps=2
    • learning_rate=3e-4
    • num_train_epochs=100
    • warmup_steps=500
    • fp16=True
  • Hardware: GPU with 24GB VRAM
  • Training Duration: ~2.15 hours

Evaluation Approach

Two models were evaluated on the test set:

  1. Base Model (wav2vec2 without fine-tuning)
  2. Fine-Tuned Model (trained on Garchen Rinpoche’s data)

Results

Model Performance Metrics

Metric Base Model Fine-Tuned Model
Character Error Rate (CER) 27.67% 22.93%
Word Error Rate (WER) 45.92% 39.42%

Training Progress

Checkpoint CER (%)
Base model 27.67
5000 steps 27.41
10000 steps 23.37
19000 steps 22.93

Error Analysis

The fine-tuned model showed improvements in several areas:

  • Error Distribution:
    • Substitutions: 4,217 instances
    • Insertions: 779 instances
    • Deletions: 1,190 instances

Discussion

Key Findings

  1. Overall Improvement: The fine-tuned model achieved a 4.74 percentage point reduction in CER compared to the base model, demonstrating the effectiveness of speaker-specific training.

  2. Word vs Character Accuracy: While character-level accuracy showed significant improvement, word-level accuracy remains a challenge, suggesting room for improvement in capturing complete word structures.

  3. Error Patterns: The predominance of substitution errors over insertions and deletions indicates that the model is more likely to misidentify characters than to miss them entirely.

Challenges and Limitations

  • Limited dataset size (5.6 hours)
  • Complexity of Tibetan language structure
  • Speaker-specific characteristics (age group: 70-90 years)

Next Steps

  1. Data Collection: Expand the training dataset with more annotated recordings
  2. Model Architecture: Experiment with alternative fine-tuning approaches
  3. Error Analysis: Conduct detailed analysis of common error patterns

Conclusion

Our fine-tuning experiment demonstrates promising results in adapting a general STT model to Garchen Rinpoche’s unique speech patterns. The reduction in both character and word error rates suggests that speaker-specific fine-tuning is an effective approach for improving Tibetan speech recognition accuracy.

Resources

Datasets

Models

Code and Documentation


Cost Estimation for Achieving 5% CER in Tibetan ASR

Current Progress and Data Analysis

Our current experiments with fine-tuning the MMS Wav2Vec2 model on Garchen Rinpoche’s teachings have shown promising results:

Model Stage Training Data (hours) CER (%)
Base Model 0 27.67
Fine-tuned 5.6 22.93

Using linear regression analysis, we can estimate the resources needed to achieve our target CER.

Linear Projection Analysis

Based on our current data points, we can establish a linear equation:


CER = -0.8482 × Hours + 27.67

This suggests:

  • Improvement rate: ~0.85% CER reduction per hour of training data

  • To reach 5% CER: Approximately 26.7 hours of high-quality training data needed

  • Additional data required: ~21.1 hours beyond our current dataset

Phased Implementation Plan

Phase 1 (Current Progress)

Metric Value
Training Data 5.6 hours
Data Preparation Time 1 month
Syllable Count 85,206
Training Time 2.15 hours
Current CER 22.93%

Phase 2 (Intermediate Goal)

Metric Value
Training Data 10.0 hours
Data Preparation Time 2 months
Syllable Count 170,412
Training Time 4.30 hours
Expected CER 19.188%

Phase 3 (Final Target)

Metric Value
Training Data 27.0 hours
Data Preparation Time 5 months
Syllable Count 426,030
Training Time 11.5 hours
Target CER 5%

Project Timeline Overview

  • Total Timeline: 5 Months

  • Data Collection Rate: ~5.4 hours of training data per month

Conclusion

This implementation plan provides a clear path to achieve 5% CER through three progressive phases over 5 months. Based on our analysis, approximately 27 hours of high-quality training data (approximately 426,030 syllables) will be required to reach the target CER of 5%.