STT Evaluation Report on Garchen Rinpoche by Different Models

:bar_chart: STT Evaluation Report on Garchen Rinpoche by Different Models

:card_index_dividers: Overview

This report evaluates the performance of three different speech recognition models on Tibetan audio recordings of Garchen Rinpoche’s teachings. It provides a detailed comparison of their accuracy using standard metrics for speech recognition quality assessment.


:file_folder: Dataset Information

Attribute Detail
Dataset name Garchen Rinpoche Audios
Content type Tibetan language teachings
Total samples 552 audio segments
Total duration 1.21 hours (72.7 minutes)
Audio format WAV (16kHz)
Models evaluated General STT, Situ Rinpoche, Dilgo Khyentse

:input_latin_lowercase: Character Error Rate (CER) Analysis

CER Performance by Model

Model Corpus-level CER Sample Average CER Difference
General STT 27.53% 27.73% 0.20%
Situ Rinpoche 28.92% 29.17% 0.25%
Dilgo Khyentse 65.96% 66.21% 0.25%

CER Calculation Methodology

Method Formula Weighting Description
Sample Avg. CER df['cer'].mean() Each sample equal Avg. of CERs across all samples
Corpus CER cer_metric.compute() By total text CER computed over combined text

CER Observations

  • :white_check_mark: General STT shows the lowest error rate at 27.53%
  • Situ Rinpoche model performs similarly with 28.92%
  • Dilgo Khyentse performs significantly worse at 65.96%
  • Minimal difference between sample-wise and corpus-level CER

:memo: Word Error Rate (WER) Analysis

WER Performance by Model

Model Micro WER (Corpus) Macro WER (Sample Avg.) Difference
General STT 57.24% 58.75% 1.51%
Situ Rinpoche 60.68% 62.15% 1.47%
Dilgo Khyentse 92.63% 94.02% 1.39%

WER Error Breakdown

Model Substitutions Insertions Deletions Total Words Total Errors
General STT 6,967 (75%) 1,839 (20%) 494 (5%) ~16,247 9,300
Situ Rinpoche 7,377 (75%) 1,858 (19%) 624 (6%) ~16,247 9,859
Dilgo Khyentse 9,679 (64%) 352 (3%) 5,018 (33%) ~16,247 15,049

WER Observations

  • General STT model outperforms others at 57.24%
  • Situ Rinpoche is close, at 60.68%
  • :cross_mark: Dilgo Khyentse model performs poorly with 92.63%
  • Substitution errors dominate for General and Situ models
  • Dilgo Khyentse has unusually high deletion rate (33%)

:input_latin_uppercase: Syllable Error Rate (SER) Analysis

SER Performance by Model

Model Micro SER (Corpus) Macro SER (Sample Avg.) Difference
General STT 51.15% 51.38% 0.23%
Situ Rinpoche 53.68% 54.02% 0.34%
Dilgo Khyentse 91.55% 91.39% -0.16%

SER Error Breakdown

Model Substitutions Insertions Deletions Total Syllables Total Errors
General STT 7,202 (85%) 898 (10%) 384 (5%) ~16,596 8,484
Situ Rinpoche 7,557 (85%) 738 (8%) 608 (7%) ~16,596 8,903
Dilgo Khyentse 9,819 (65%) 207 (1%) 5,157 (34%) ~16,596 15,183

SER Observations

  • :white_check_mark: General STT again leads with 51.15% SER
  • SER is consistently lower than WER, but higher than CER
  • High deletions in Dilgo model negatively affect performance
  • Substitution errors dominate across all models

:bar_chart: Comparison Across All Metrics

Model CER WER SER Major Error Type Recommendation
General STT 27.53% 57.24% 51.15% Substitutions (75–85%) :white_check_mark: Recommended
Situ Rinpoche 28.92% 60.68% 53.68% Substitutions (75–85%) Good alternative
Dilgo Khyentse 65.96% 92.63% 91.55% Substitutions + Deletions :cross_mark: Not recommended

:magnifying_glass_tilted_left: Key Observations

  1. Ranking (Best to Worst):
    General STT > Situ Rinpoche > Dilgo Khyentse

  2. Error Type Insights:

    • Substitution is the primary source of error in all models
    • Dilgo Khyentse has disproportionately high deletion errors
  3. Metric Relationship:
    CER < SER < WER — showing that:

    • Character-level recognition is stronger
    • Word segmentation remains challenging
  4. Unexpected Outcome:

    • The general-purpose model outperformed both specialized models, including one trained on religious Tibetan speech

:white_check_mark: Conclusion

The General STT model demonstrates the best performance on the Garchen Rinpoche dataset, with the lowest CER (27.53%), WER (57.24%), and SER (51.15%). It is recommended as the primary candidate for further fine-tuning on Garchen-specific speech.

The Situ Rinpoche model is a strong secondary candidate, while the Dilgo Khyentse model shows poor generalization and is not recommended for this domain.

1 Like