STT Evaluation Report on Garchen Rinpoche by Different Models
Overview
This report evaluates the performance of three different speech recognition models on Tibetan audio recordings of Garchen Rinpoche’s teachings. It provides a detailed comparison of their accuracy using standard metrics for speech recognition quality assessment.
Dataset Information
Attribute |
Detail |
Dataset name |
Garchen Rinpoche Audios |
Content type |
Tibetan language teachings |
Total samples |
552 audio segments |
Total duration |
1.21 hours (72.7 minutes) |
Audio format |
WAV (16kHz) |
Models evaluated |
General STT, Situ Rinpoche, Dilgo Khyentse |
Character Error Rate (CER) Analysis
CER Performance by Model
Model |
Corpus-level CER |
Sample Average CER |
Difference |
General STT |
27.53% |
27.73% |
0.20% |
Situ Rinpoche |
28.92% |
29.17% |
0.25% |
Dilgo Khyentse |
65.96% |
66.21% |
0.25% |
CER Calculation Methodology
Method |
Formula |
Weighting |
Description |
Sample Avg. CER |
df['cer'].mean() |
Each sample equal |
Avg. of CERs across all samples |
Corpus CER |
cer_metric.compute() |
By total text |
CER computed over combined text |
CER Observations
General STT shows the lowest error rate at 27.53%
- Situ Rinpoche model performs similarly with 28.92%
- Dilgo Khyentse performs significantly worse at 65.96%
- Minimal difference between sample-wise and corpus-level CER
Word Error Rate (WER) Analysis
WER Performance by Model
Model |
Micro WER (Corpus) |
Macro WER (Sample Avg.) |
Difference |
General STT |
57.24% |
58.75% |
1.51% |
Situ Rinpoche |
60.68% |
62.15% |
1.47% |
Dilgo Khyentse |
92.63% |
94.02% |
1.39% |
WER Error Breakdown
Model |
Substitutions |
Insertions |
Deletions |
Total Words |
Total Errors |
General STT |
6,967 (75%) |
1,839 (20%) |
494 (5%) |
~16,247 |
9,300 |
Situ Rinpoche |
7,377 (75%) |
1,858 (19%) |
624 (6%) |
~16,247 |
9,859 |
Dilgo Khyentse |
9,679 (64%) |
352 (3%) |
5,018 (33%) |
~16,247 |
15,049 |
WER Observations
- General STT model outperforms others at 57.24%
- Situ Rinpoche is close, at 60.68%
Dilgo Khyentse model performs poorly with 92.63%
- Substitution errors dominate for General and Situ models
- Dilgo Khyentse has unusually high deletion rate (33%)
Syllable Error Rate (SER) Analysis
SER Performance by Model
Model |
Micro SER (Corpus) |
Macro SER (Sample Avg.) |
Difference |
General STT |
51.15% |
51.38% |
0.23% |
Situ Rinpoche |
53.68% |
54.02% |
0.34% |
Dilgo Khyentse |
91.55% |
91.39% |
-0.16% |
SER Error Breakdown
Model |
Substitutions |
Insertions |
Deletions |
Total Syllables |
Total Errors |
General STT |
7,202 (85%) |
898 (10%) |
384 (5%) |
~16,596 |
8,484 |
Situ Rinpoche |
7,557 (85%) |
738 (8%) |
608 (7%) |
~16,596 |
8,903 |
Dilgo Khyentse |
9,819 (65%) |
207 (1%) |
5,157 (34%) |
~16,596 |
15,183 |
SER Observations
General STT again leads with 51.15% SER
- SER is consistently lower than WER, but higher than CER
- High deletions in Dilgo model negatively affect performance
- Substitution errors dominate across all models
Comparison Across All Metrics
Model |
CER |
WER |
SER |
Major Error Type |
Recommendation |
General STT |
27.53% |
57.24% |
51.15% |
Substitutions (75–85%) |
Recommended |
Situ Rinpoche |
28.92% |
60.68% |
53.68% |
Substitutions (75–85%) |
Good alternative |
Dilgo Khyentse |
65.96% |
92.63% |
91.55% |
Substitutions + Deletions |
Not recommended |
Key Observations
-
Ranking (Best to Worst):
General STT > Situ Rinpoche > Dilgo Khyentse
-
Error Type Insights:
- Substitution is the primary source of error in all models
- Dilgo Khyentse has disproportionately high deletion errors
-
Metric Relationship:
CER < SER < WER — showing that:
- Character-level recognition is stronger
- Word segmentation remains challenging
-
Unexpected Outcome:
- The general-purpose model outperformed both specialized models, including one trained on religious Tibetan speech
Conclusion
The General STT model demonstrates the best performance on the Garchen Rinpoche dataset, with the lowest CER (27.53%), WER (57.24%), and SER (51.15%). It is recommended as the primary candidate for further fine-tuning on Garchen-specific speech.
The Situ Rinpoche model is a strong secondary candidate, while the Dilgo Khyentse model shows poor generalization and is not recommended for this domain.
1 Like