Tibetan Speech-to-Text Model Training and Benchmark Report

Ganga_Gyatso · August 8, 2025, 9:54am

Tibetan Speech-to-Text Model Training and Benchmark Report

Overview

We have developed and evaluated several Tibetan speech-to-text (ASR/STT) models using different training strategies, datasets, and fine-tuning approaches. Our goal is to improve transcription accuracy for both general Tibetan speech and specific speakers (such as prominent Rinpoches), as well as across multiple dialects.

Training Methodology

Training Data Creation Resources

The following tools and technologies were used to create high-quality training data for our speech-to-text models:

Voice Activity Detection

Used pyannote and seliro to extract clean audio segments
Helped filter out background noise, music, and silence
Enhanced overall audio quality for training data preparation

Speaker Diarization and Identification

Employed specialized algorithms for speaker detection and separation
Critical for extracting specific speaker data in multi-speaker recordings
Enabled the creation of speaker-specific fine-tuned models (e.g., for Rinpoches)

Language Detection

Utilized Whisper for identifying and filtering out non-Tibetan voice content
Improved data quality by ensuring training data contained only Tibetan speech
Reduced noise from multilingual recordings and improved model accuracy

Audio-Transcript Pairing

Models were trained using paired audio and text transcription data
Audio segments ranged from 1 second to 30 seconds in length
Each audio clip was manually transcribed by native Tibetan speakers to ensure accuracy

Fine-tuning Approach

When we indicate “Trained from scratch” for general models, we mean fine-tuning generic ASR models specifically for Tibetan audio
The general Tibetan models (wav2vec2 and Whisper) were built using ~1500 hours of diverse Tibetan audio data
For custom speaker and dialect models, we leveraged the pre-trained general model’s knowledge and continued fine-tuning specifically for the target speaker/dialect
This transfer learning approach means custom models don’t require retraining on the full 1500 hours plus the custom data - they only need training on the specific speaker’s data (typically 4-15 hours)
This methodology is significantly more efficient and yields better results compared to training entirely from scratch for each speaker

Training Infrastructure and Parameters

Training was conducted using GPUs with 24 GB RAM
Training duration varied depending on hyperparameters including number of epochs, batch size, etc.
Detailed documentation of all training parameters used for each model is available in the documentation links provided in the table above
Model architecture specifications, optimizer configurations, and learning rate schedules are thoroughly documented

Understanding Performance Metrics

Error Rate Metrics Explained

Detailed explanations available in our blog post

Character Error Rate (CER) calculates how many character-level mistakes are made in the transcription. For example, if “I like cake.” is mistakenly rendered as “I like ceke.”, the CER reflects the incorrect substitution of “e” for “a”. There would be an additional penalty if the output was “I like cekes”, because in addition to the substitution of “e” for “a”, there is also the incorrect addition of an “s”.

Word Error Rate (WER) calculates word-level mistakes in the transcription. The mistaken outputs “I like ceke.” and “I like cekes” would both be counted as having one incorrect substitution, “ceke” for “cake” or “cekes” for “cake”, respectively.

Syllable Error Rate (SER) is identical to WER except that the unit of measurement is syllables rather than words. For Tibetan, syllables are extracted by splitting strings at each tsek (་). SER can be very similar to WER in practice.

Why We Use CER for Tibetan

For Tibetan language transcription, we primarily use CER because:

Tibetan has a complex syllabic structure where word boundaries are not always clearly defined
Character-level evaluation provides more granular insight into model performance for languages with complex orthography
It allows for consistent comparison across different dialect models and speakers

Evaluation Methodology

All models are evaluated using a rigorous benchmark methodology:

We create well-distributed representative benchmark samples of audio and transcriptions that are never used during model training
For general models, benchmarks include diverse audio types:
- Audio book recordings
- Natural speech conversations
- Teachings by lamas
- Children’s speech etc
For custom speaker models, benchmarks consist of audio samples from the specific speaker on whom the model was trained
For dialect models, benchmarks include representative samples from different dialectal regions
Evaluation process:
1. Run inference on benchmark audio using the trained model
2. Compare generated transcriptions against human-created reference transcriptions
3. Calculate CER, WER, and other metrics to quantify model accuracy
This methodology ensures fair and realistic assessment of model performance across different use cases

Model Performance Results

Summary of Models and Performance

Model Type	Variant/Speaker	Fine-tuning Source	Training Data	CER (%)	wav2vec2 General Model CER (%)	Benchmark Hours	Benchmark Link	Benchmark Data	Documentation
wav2vec2	General Model	Trained from scratch	~1,500 hours	15.23	__	~7.5 hours	Tibetan Voice Benchmark	Standard benchmark	Training History
Whisper	whisper general model	Trained from scratch	~1,500 hours	19.19	__	~100 hours	Benchmark v3	Standard benchmark	Training History
wav2vec2	Multi-Dialect (Balanced)	Fine-tuned from general model	~183 hours (balanced)	13.97	36.90	~9 hours	Tibetan Voice Benchmark	Standard benchmark	Multi-Dialect Speech Recognition Model
wav2vec2	Custom Tai Situ Rinpoche stt model	Fine-tuned from general model	~4 hours	8.08	12.60	~0.5 hours	Situ Rinpoche Test Set	Speaker-specific test set	Customizing STT for Tai Situ Rinpoche
wav2vec2	Custom Dilgo Khyentse Rinpoche stt model	Fine-tuned from general model	~15 hours	24.9	50.04	~1.5 hours	Dilgo Khyentse Test Set	Speaker-specific test set	Customizing STT for Dilgo Khyentse Rinpoche
wav2vec2	Custom Garchen Rinpoche stt model	Fine-tuned from general model	~10 hours	21.5	27.70	~1 hour	Garchen Rinpoche Benchmark	Speaker-specific benchmark	Garchen Rinpoche CER Analysis

Performance Analysis

Factors Affecting CER Values

Audio Quality: Higher CER values (20%+) often correlate with poor audio quality inputs that would be challenging even for human transcription
Speaker Variation: Speaker-specific characteristics (accent, speech rate, clarity) significantly impact performance
Dialect Differences: Regional dialect variations can affect transcription accuracy
Training Data Diversity: Models trained on more diverse data tend to perform better on general benchmarks

Commercial STT Comparison

Generic wav2vec2 models lack capability to generate transcription for Tibetan - they can extract audio features but cannot map them to Tibetan text
Generic Whisper models can attempt transcription but produce gibberish output as they cannot recognize proper Tibetan word mappings
Our specialized models provide significant improvements over these baselines

Key Findings

Speaker-specific models consistently outperform general models on speaker-specific benchmarks
The most significant improvement was seen with the Tai Situ Rinpoche model (8.08% CER vs 12.60% for general model)
The Multi-Dialect model demonstrates that balanced dialect training data greatly improves performance (13.97% CER vs 36.90% for general model)
Performance improvements correlate with the quality and quantity of speaker-specific training data

Handling Mixed-Language Content and English Recordings

Approach for English Recordings with Tibetan Terms

To address the specific needs of Tibet House regarding English recordings that contain Tibetan terminology, we are implementing a specialized approach:

Mixed-Language Transcription Strategy
- Base transcription will be in English, with consistent representation of embedded Tibetan terms
- We will develop and maintain a vocabulary mapping list that pairs Tibetan terms with their standardized English representations
- This approach allows flexible post-processing of transcriptions based on specific needs
Vocabulary Mapping Implementation
- For each Tibetan term spoken in English recordings, we’ll have a consistent English representation
- This mapping enables us to:
  - Maintain consistency across all transcriptions
  - Optionally replace representations with alternative spellings when needed
  - Address potential overlaps between Tibetan word representations and existing English vocabulary
Enhanced Transcription Methods
- Utilize prompt-based techniques with models like Whisper to ensure consistent handling of specialized terminology
- Implement context-aware transcription that recognizes when Tibetan terms are being used in English speech
- Reference: Whisper Prompting Guide

Why Commercial Models Are Insufficient

Standard commercial speech recognition models face significant challenges with mixed-language content and specialized speakers for several reasons:

Accent Recognition Limitations
- Research shows that English-only models perform well primarily on standard English with moderate accent diversity
- Multilingual models demonstrate better performance with high accent variability and underrepresented accents
- Reference: Whisper Model Bias Analysis
Domain-Specific Terminology
- Generic models lack exposure to specialized Buddhist terminology and concepts
- Mixed Tibetan-English speech presents unique challenges not addressed in commercial training datasets
- Specialized terminology often gets mistranscribed due to low occurrence in general training data
Speaker-Specific Characteristics
- Teachers like Geshe la have unique speech patterns, rhythms, and pronunciation that generic models struggle with
- The specific context of teachings includes specialized vocabulary that commercial models rarely encounter

Benefits of Fine-tuning on Geshe la’s Recordings

Fine-tuning on speaker-specific data will yield significant improvements:

Accent Adaptation
- Models will learn the specific accent patterns and speech characteristics of Geshe la
- As demonstrated in our speaker-specific models (e.g., Tai Situ Rinpoche model with 8.08% CER vs 12.60% for general model)
Terminology Recognition
- Fine-tuned models will better recognize specialized Buddhist terminology and concepts
- The model will learn consistent patterns for how Tibetan terms are pronounced in English context
Research-Backed Approach
- Research confirms that fine-tuning is effective for improving performance on languages and dialects insufficiently represented during pre-training
- Reference: Recent research on fine-tuning speech recognition models
Implementation Process
- We will follow established fine-tuning methodologies adapted for this specific use case
- Reference: Comprehensive Guide for Whisper Fine-tuning

Extracting Tibetan Terms from English Teachings

We can effectively extract and identify Tibetan terms used within English teachings:

Term Identification Process
- Utilize the vocabulary mapping list of Tibetan terms and their English representations
- Apply natural language processing techniques to identify these terms in transcriptions
- Leverage existing transcribed SRT files from Geshe la’s teachings as a foundation
Applications
- Create specialized glossaries from teachings
- Enable search functionality for specific Tibetan concepts across English language materials
- Support educational initiatives by highlighting key terminology

Training Recommendations for English Model

Based on our experience and research, we recommend:

Audio Segment Length
- Maximum 30-second audio segments for optimal training by breaking it down from longer durations of audio
- This aligns with state-of-the-art models like Whisper, which are pre-trained on 30-second segments
- Reference: Whisper Fine-tuning Guide
Training Data Requirements
- For speaker-specific fine-tuning, few hours of high-quality audio would be optimal
- Based on our experience with Rinpoche models (4-15 hours yielded significant improvements)
- Transcriptions should maintain consistent representation of Tibetan terms
Training Methodology
- Leverage transfer learning by starting with a pre-trained model
- Fine-tune specifically on Geshe la’s speech patterns and terminology
- This approach is significantly more efficient than training from scratch

We anticipate that a fine-tuned model for English recordings with Tibetan terminology will demonstrate similar performance improvements to our speaker-specific Tibetan models, with potential CER , and WER reductions of compared to generic models.

Topic		Replies	Views
Custom ASR(Automatic Speech Recognition) for Garchen Rinpoche Garchen Rinpoche Speech SIG documentation	1	50	July 22, 2025
A custom ASR model to transcribe the speech of Kabjye Dilgo Khyentse Rinpoche 🔊 ASR Speech Recognition SIG	4	70	January 6, 2025
A custom ASR model to transcribe the speech of Tai Situ Rinpoche 🔊 ASR Speech Recognition SIG	0	57	December 19, 2024
Customizing Speech-to-Text: Fine-Tuning a Model for Tai Situ Rinpoche’s Unique Voice 🔊 ASR Speech Recognition SIG	0	50	January 16, 2025
Whisper Tibetan Model Evaluation Summary on Garchen Rinpoche Speech Garchen Rinpoche Speech SIG	2	20	October 7, 2025

Tibetan Speech-to-Text Model Training and Benchmark Report

Tibetan Speech-to-Text Model Training and Benchmark Report

Overview

Training Methodology

Training Data Creation Resources

Audio-Transcript Pairing

Fine-tuning Approach

Training Infrastructure and Parameters

Understanding Performance Metrics

Error Rate Metrics Explained

Why We Use CER for Tibetan

Evaluation Methodology

Model Performance Results

Summary of Models and Performance

Performance Analysis

Factors Affecting CER Values

Commercial STT Comparison

Key Findings

Handling Mixed-Language Content and English Recordings

Approach for English Recordings with Tibetan Terms

Why Commercial Models Are Insufficient

Benefits of Fine-tuning on Geshe la’s Recordings

Extracting Tibetan Terms from English Teachings

Training Recommendations for English Model

Related topics