Tibetan Speech-to-Text Model Training and Benchmark Report

Tibetan Speech-to-Text Model Training and Benchmark Report

Overview

We have developed and evaluated several Tibetan speech-to-text (ASR/STT) models using different training strategies, datasets, and fine-tuning approaches. Our goal is to improve transcription accuracy for both general Tibetan speech and specific speakers (such as prominent Rinpoches), as well as across multiple dialects.

Training Methodology

Training Data Creation Resources

The following tools and technologies were used to create high-quality training data for our speech-to-text models:

  1. Voice Activity Detection
  • Used pyannote and seliro to extract clean audio segments

  • Helped filter out background noise, music, and silence

  • Enhanced overall audio quality for training data preparation

  1. Speaker Diarization and Identification
  • Employed specialized algorithms for speaker detection and separation

  • Critical for extracting specific speaker data in multi-speaker recordings

  • Enabled the creation of speaker-specific fine-tuned models (e.g., for Rinpoches)

  1. Language Detection
  • Utilized Whisper for identifying and filtering out non-Tibetan voice content

  • Improved data quality by ensuring training data contained only Tibetan speech

  • Reduced noise from multilingual recordings and improved model accuracy

Audio-Transcript Pairing

  • Models were trained using paired audio and text transcription data

  • Audio segments ranged from 1 second to 30 seconds in length

  • Each audio clip was manually transcribed by native Tibetan speakers to ensure accuracy

Fine-tuning Approach

  • When we indicate “Trained from scratch” for general models, we mean fine-tuning generic ASR models specifically for Tibetan audio

  • The general Tibetan models (wav2vec2 and Whisper) were built using ~1500 hours of diverse Tibetan audio data

  • For custom speaker and dialect models, we leveraged the pre-trained general model’s knowledge and continued fine-tuning specifically for the target speaker/dialect

  • This transfer learning approach means custom models don’t require retraining on the full 1500 hours plus the custom data - they only need training on the specific speaker’s data (typically 4-15 hours)

  • This methodology is significantly more efficient and yields better results compared to training entirely from scratch for each speaker

Training Infrastructure and Parameters

  • Training was conducted using GPUs with 24 GB RAM

  • Training duration varied depending on hyperparameters including number of epochs, batch size, etc.

  • Detailed documentation of all training parameters used for each model is available in the documentation links provided in the table above

  • Model architecture specifications, optimizer configurations, and learning rate schedules are thoroughly documented

Understanding Performance Metrics

Error Rate Metrics Explained

Detailed explanations available in our blog post

Character Error Rate (CER) calculates how many character-level mistakes are made in the transcription. For example, if “I like cake.” is mistakenly rendered as “I like ceke.”, the CER reflects the incorrect substitution of “e” for “a”. There would be an additional penalty if the output was “I like cekes”, because in addition to the substitution of “e” for “a”, there is also the incorrect addition of an “s”.

Word Error Rate (WER) calculates word-level mistakes in the transcription. The mistaken outputs “I like ceke.” and “I like cekes” would both be counted as having one incorrect substitution, “ceke” for “cake” or “cekes” for “cake”, respectively.

Syllable Error Rate (SER) is identical to WER except that the unit of measurement is syllables rather than words. For Tibetan, syllables are extracted by splitting strings at each tsek (་). SER can be very similar to WER in practice.

Why We Use CER for Tibetan

For Tibetan language transcription, we primarily use CER because:

  • Tibetan has a complex syllabic structure where word boundaries are not always clearly defined

  • Character-level evaluation provides more granular insight into model performance for languages with complex orthography

  • It allows for consistent comparison across different dialect models and speakers

Evaluation Methodology

All models are evaluated using a rigorous benchmark methodology:

  • We create well-distributed representative benchmark samples of audio and transcriptions that are never used during model training
  • For general models, benchmarks include diverse audio types:
    • Audio book recordings
    • Natural speech conversations
    • Teachings by lamas
    • Children’s speech etc
  • For custom speaker models, benchmarks consist of audio samples from the specific speaker on whom the model was trained
  • For dialect models, benchmarks include representative samples from different dialectal regions
  • Evaluation process:
    1. Run inference on benchmark audio using the trained model
    2. Compare generated transcriptions against human-created reference transcriptions
    3. Calculate CER, WER, and other metrics to quantify model accuracy
  • This methodology ensures fair and realistic assessment of model performance across different use cases

Model Performance Results

Summary of Models and Performance

Model Type Variant/Speaker Fine-tuning Source Training Data CER (%) wav2vec2 General Model CER (%) Benchmark Hours Benchmark Link Benchmark Data Documentation
wav2vec2 General Model Trained from scratch ~1,500 hours 15.23 __ ~7.5 hours Tibetan Voice Benchmark Standard benchmark Training History
Whisper whisper general model Trained from scratch ~1,500 hours 19.19 __ ~100 hours Benchmark v3 Standard benchmark Training History
wav2vec2 Multi-Dialect (Balanced) Fine-tuned from general model ~183 hours (balanced) 13.97 36.90 ~9 hours Tibetan Voice Benchmark Standard benchmark Multi-Dialect Speech Recognition Model
wav2vec2 Custom Tai Situ Rinpoche stt model Fine-tuned from general model ~4 hours 8.08 12.60 ~0.5 hours Situ Rinpoche Test Set Speaker-specific test set Customizing STT for Tai Situ Rinpoche
wav2vec2 Custom Dilgo Khyentse Rinpoche stt model Fine-tuned from general model ~15 hours 24.9 50.04 ~1.5 hours Dilgo Khyentse Test Set Speaker-specific test set Customizing STT for Dilgo Khyentse Rinpoche
wav2vec2 Custom Garchen Rinpoche stt model Fine-tuned from general model ~10 hours 21.5 27.70 ~1 hour Garchen Rinpoche Benchmark Speaker-specific benchmark Garchen Rinpoche CER Analysis

Performance Analysis

Factors Affecting CER Values

  • Audio Quality: Higher CER values (20%+) often correlate with poor audio quality inputs that would be challenging even for human transcription

  • Speaker Variation: Speaker-specific characteristics (accent, speech rate, clarity) significantly impact performance

  • Dialect Differences: Regional dialect variations can affect transcription accuracy

  • Training Data Diversity: Models trained on more diverse data tend to perform better on general benchmarks

Commercial STT Comparison

  • Generic wav2vec2 models lack capability to generate transcription for Tibetan - they can extract audio features but cannot map them to Tibetan text

  • Generic Whisper models can attempt transcription but produce gibberish output as they cannot recognize proper Tibetan word mappings

  • Our specialized models provide significant improvements over these baselines

Key Findings

  • Speaker-specific models consistently outperform general models on speaker-specific benchmarks

  • The most significant improvement was seen with the Tai Situ Rinpoche model (8.08% CER vs 12.60% for general model)

  • The Multi-Dialect model demonstrates that balanced dialect training data greatly improves performance (13.97% CER vs 36.90% for general model)

  • Performance improvements correlate with the quality and quantity of speaker-specific training data

Handling Mixed-Language Content and English Recordings

Approach for English Recordings with Tibetan Terms

To address the specific needs of Tibet House regarding English recordings that contain Tibetan terminology, we are implementing a specialized approach:

  1. Mixed-Language Transcription Strategy

    • Base transcription will be in English, with consistent representation of embedded Tibetan terms
    • We will develop and maintain a vocabulary mapping list that pairs Tibetan terms with their standardized English representations
    • This approach allows flexible post-processing of transcriptions based on specific needs
  2. Vocabulary Mapping Implementation

    • For each Tibetan term spoken in English recordings, we’ll have a consistent English representation
    • This mapping enables us to:
      • Maintain consistency across all transcriptions
      • Optionally replace representations with alternative spellings when needed
      • Address potential overlaps between Tibetan word representations and existing English vocabulary
  3. Enhanced Transcription Methods

    • Utilize prompt-based techniques with models like Whisper to ensure consistent handling of specialized terminology
    • Implement context-aware transcription that recognizes when Tibetan terms are being used in English speech
    • Reference: Whisper Prompting Guide

Why Commercial Models Are Insufficient

Standard commercial speech recognition models face significant challenges with mixed-language content and specialized speakers for several reasons:

  1. Accent Recognition Limitations

    • Research shows that English-only models perform well primarily on standard English with moderate accent diversity
    • Multilingual models demonstrate better performance with high accent variability and underrepresented accents
    • Reference: Whisper Model Bias Analysis
  2. Domain-Specific Terminology

    • Generic models lack exposure to specialized Buddhist terminology and concepts
    • Mixed Tibetan-English speech presents unique challenges not addressed in commercial training datasets
    • Specialized terminology often gets mistranscribed due to low occurrence in general training data
  3. Speaker-Specific Characteristics

    • Teachers like Geshe la have unique speech patterns, rhythms, and pronunciation that generic models struggle with
    • The specific context of teachings includes specialized vocabulary that commercial models rarely encounter

Benefits of Fine-tuning on Geshe la’s Recordings

Fine-tuning on speaker-specific data will yield significant improvements:

  1. Accent Adaptation

    • Models will learn the specific accent patterns and speech characteristics of Geshe la
    • As demonstrated in our speaker-specific models (e.g., Tai Situ Rinpoche model with 8.08% CER vs 12.60% for general model)
  2. Terminology Recognition

    • Fine-tuned models will better recognize specialized Buddhist terminology and concepts
    • The model will learn consistent patterns for how Tibetan terms are pronounced in English context
  3. Research-Backed Approach

  4. Implementation Process

Extracting Tibetan Terms from English Teachings

We can effectively extract and identify Tibetan terms used within English teachings:

  1. Term Identification Process

    • Utilize the vocabulary mapping list of Tibetan terms and their English representations
    • Apply natural language processing techniques to identify these terms in transcriptions
    • Leverage existing transcribed SRT files from Geshe la’s teachings as a foundation
  2. Applications

    • Create specialized glossaries from teachings
    • Enable search functionality for specific Tibetan concepts across English language materials
    • Support educational initiatives by highlighting key terminology

Training Recommendations for English Model

Based on our experience and research, we recommend:

  1. Audio Segment Length

    • Maximum 30-second audio segments for optimal training by breaking it down from longer durations of audio
    • This aligns with state-of-the-art models like Whisper, which are pre-trained on 30-second segments
    • Reference: Whisper Fine-tuning Guide
  2. Training Data Requirements

    • For speaker-specific fine-tuning, few hours of high-quality audio would be optimal
    • Based on our experience with Rinpoche models (4-15 hours yielded significant improvements)
    • Transcriptions should maintain consistent representation of Tibetan terms
  3. Training Methodology

    • Leverage transfer learning by starting with a pre-trained model
    • Fine-tune specifically on Geshe la’s speech patterns and terminology
    • This approach is significantly more efficient than training from scratch

We anticipate that a fine-tuned model for English recordings with Tibetan terminology will demonstrate similar performance improvements to our speaker-specific Tibetan models, with potential CER , and WER reductions of compared to generic models.