Whisper Tibetan Model Evaluation Summary on Garchen Rinpoche Speech

Ganga_Gyatso · October 7, 2025, 9:29am

Whisper Tibetan Model Evaluation Summary

0. Experiment Overview

This experiment evaluates and compares two fine-tuned Whisper-small models for Tibetan Automatic Speech Recognition (ASR).
The goal is to assess the effect of using native Tibetan script tokens (via a custom tokenizer) versus Wylie transliteration (using the default Whisper tokenizer).

Model A — Wylie-based Whisper model

Model name: whisper-small-tibetan-wylie-checkpoint-4000_to_tibetan
Base model: openai/whisper-small
Tokenizer: Default Whisper tokenizer
Training data: 5.6 hours of Tibetan audio paired with Wylie transliterations
Training steps: 4000
Batch size: 16
Gradient accumulation: 2
Input: Audio + Wylie transcript
Evaluation: After transcription, predicted Wylie text is converted back to Tibetan script and compared against the Tibetan ground-truth in the benchmark dataset.

Model B — Tibetan-script Whisper model with added tokens

Model name: whisper-small-latin-added-tibetan-checkpoint-4000
Base model: openai/whisper-small
Tokenizer: Custom tokenizer extending the default Whisper tokenizer with added Tibetan script tokens
Training data: Same 5.6 hours of Tibetan audio
Training steps: 4000
Batch size: 16
Gradient accumulation: 2
Input: Audio + Tibetan script transcript (tokenized using added Tibetan tokens)
Evaluation: Direct comparison between the model’s Tibetan output and Tibetan ground-truth.

Objective:
To determine whether directly training Whisper with a native Tibetan tokenizer improves transcription accuracy, token efficiency, and inference performance compared to a Wylie transliteration-based approach.

All experiments were performed on a NVIDIA RTX 4090 (24GB VRAM) GPU using 30-second audio samples (the Whisper maximum input window).

1. Word and Sentence Error Rates (WER / SER)

model	micro_wer	macro_wer	micro_ser	macro_ser	substitutions	insertions	deletions
whisper-small-latin-added-tibetan-checkpoint-4000	0.607723	0.587186	0.565648	0.565680	7289	543	1478
whisper-small-tibetan-wylie-checkpoint-4000_to_tibetan	0.675397	0.712424	0.616562	0.656280	6741	1561	1846

Summary:

The Latin-added Tibetan model outperforms the Wylie→Tibetan model in both WER and SER across micro and macro averages.
The difference is especially clear in insertion and substitution error counts.

2. Character Error Rate (CER)

model	cer_mean
whisper-small-latin-added-tibetan-checkpoint-4000	0.298808
whisper-small-tibetan-wylie-checkpoint-4000_to_tibetan	0.384848

Summary:

The Tibetan-script model achieves a lower CER (~0.30) compared to the Wylie model (~0.38), showing higher character-level accuracy.
This suggests better modeling of native script structure and spelling consistency.

3. Tokenization Length Comparison

(for ~30-second audio transcripts)

Tokenizer Type	Transcript Language	Token Length	Notes
Whisper Default Tokenizer	Wylie	271	Within model limit (≤ 1024)
Whisper + Added Tibetan Tokens Tokenizer	Tibetan	311	Within model limit (≤ 1024)
Whisper Default Tokenizer	Tibetan	1551	Exceeds model max input limit (1024)

Summary:

The default Whisper tokenizer is inefficient on Tibetan script, producing over 1500 tokens for 30-second text — exceeding the model limit.
The added-tokens tokenizer keeps Tibetan input compact and fully processable.
This makes direct Tibetan script training feasible without transliteration.

4. Inference Time Benchmark

(on ~30-second audios, Whisper max input, NVIDIA RTX 4090 24GB)

Model	GPU (VRAM)	Avg Inference Time (sec)	Notes
whisper-small-latin-added-tibetan-checkpoint-4000	RTX 4090 (24GB)	1.3	Stable average; small variation per run
whisper-small-tibetan-wylie-checkpoint-4000_to_tibetan	RTX 4090 (24GB)	1.3	No significant difference between models

Summary:

Both models achieve ~1.3 seconds inference time for 30-second audio clips.
There is no measurable difference in inference latency between Wylie and Tibetan-tokenized models.
GPU utilization remains consistent across runs.

5. Tokenizer Vocabulary Size

Tokenizer Type	Vocabulary Size	Notes
Whisper Default Tokenizer	51,865	Original Whisper vocabulary
Whisper + Added Tibetan Tokens Tokenizer	53,014	Includes 1,149 additional Tibetan script tokens

Summary:

The custom Tibetan tokenizer expands the base Whisper vocabulary by ~2.2%.
These extra tokens provide direct Tibetan script coverage, reducing token fragmentation and improving alignment with real-world Tibetan text.

6. Overall Observations

Category	Best Performing Model	Key Advantage
WER / SER	Latin-added Tibetan	Lower word and sentence error rates
CER	Latin-added Tibetan	More accurate at character level
Tokenization Efficiency	Latin-added Tibetan	Efficient native-script encoding
Vocabulary Coverage	Latin-added Tibetan	Broader support for Tibetan characters
Inference Speed	Equal (both)	No significant runtime difference

Conclusion

The whisper-small-latin-added-tibetan-checkpoint-4000 model consistently outperforms the whisper-small-tibetan-wylie-checkpoint-4000_to_tibetan model across accuracy metrics (WER, SER, CER) while maintaining identical inference performance.

By leveraging a custom tokenizer with added Tibetan tokens, this model:

Enables direct training on Tibetan script (no transliteration needed),
Avoids token overflow issues (keeps sequences ≤1024 tokens),
Achieves higher accuracy on native Tibetan benchmarks,
And maintains comparable inference speed.

Overall, direct Tibetan tokenization is a robust and scalable improvement for Tibetan ASR tasks using Whisper architectures.

Resources

Resource Type	Name / Link
Whisper model (Default Tokenizer, Wylie-based)	ganga4364/whisper-small-tibetan-wylie-checkpoint-4000
Whisper model (Added Tibetan Tokens)	ganga4364/whisper-small-latin-added-tibetan-checkpoint-4000
Training dataset	openpecha/garchen_rinpoche_data
Evaluation benchmark	ganga4364/garchen_rinpoche_evaluation_results
Leader Board	Buddhist AI Arena
Hugging Space	Stt Inference - a Hugging Face Space by openpecha

Ganga_Gyatso · October 7, 2025, 9:32am

1. Comprehensive Model Performance Statistics

Model/Checkpoint	CER Mean	CER Std	Lev Mean	Lev Std	Micro WER	Macro WER	Micro SER	Macro SER
base	0.277	0.161	19.172	21.794	0.554	0.597	0.493	0.518
v1_5000	0.274	0.182	18.441	19.909	0.518	0.574	0.482	0.544
v1_10000	0.234	0.172	15.658	17.422	0.410	0.468	0.398	0.479
v1_19000	0.229	0.172	15.334	16.935	0.394	0.459	0.386	0.470
v2_20000	0.223	0.175	14.884	16.349	0.392	0.451	0.384	0.467
v2_32000	0.218	0.167	14.730	16.144	0.386	0.443	0.380	0.466
v2_43000	0.215	0.168	14.608	16.029	0.381	0.438	0.376	0.461
v3_1000	0.234	0.176	15.440	16.877	0.412	0.474	0.401	0.486
v3_25000	0.222	0.171	14.991	16.405	0.391	0.452	0.383	0.463
v4_22000	0.227	0.171	15.169	16.710	0.409	0.476	0.392	0.476
v5_base_28000	0.215	0.161	14.970	16.645	0.413	0.459	0.401	0.470
v6_ft_25000	0.221	0.170	15.214	17.198	0.404	0.451	0.393	0.464
v7_scratch_23000	0.219	0.167	14.571	15.467	0.412	0.465	0.397	0.471
v8_10000	0.242	0.165	16.830	19.463	0.491	0.526	0.464	0.522
v8_55000	0.220	0.164	15.216	17.317	0.401	0.454	0.394	0.472
v8_59000	0.230	0.165	15.859	18.460	0.430	0.480	0.415	0.488
whisper-small-added-tibetan	0.299	0.176	23.260	33.615	0.608	0.587	0.566	0.566
whisper-small-wylie	0.385	0.201	29.950	43.624	0.675	0.712	0.617	0.656

Ganga_Gyatso · October 7, 2025, 11:48am

https://arena.buddhistai.tools/leaderboard/60157fcf-bd4d-47a7-af1d-25bb64070352

Topic		Replies	Views
Calculating Word Error Rate for Tibetan Automatic Speech Recognition 🔊 ASR Speech Recognition SIG toolkit	0	52	April 17, 2025
Tibetan Speech-to-Text Model Training and Benchmark Report 🔊 ASR Speech Recognition SIG	0	58	August 8, 2025
STT Evaluation Report on Garchen Rinpoche by Different Models 🔊 ASR Speech Recognition SIG	0	26	June 24, 2025
Custom ASR(Automatic Speech Recognition) for Garchen Rinpoche Garchen Rinpoche Speech SIG documentation	1	50	July 22, 2025
ASR Model Evaluation Report for Garchen Rinpoche Garchen Rinpoche Speech SIG	3	58	September 2, 2025

Whisper Tibetan Model Evaluation Summary on Garchen Rinpoche Speech

Whisper Tibetan Model Evaluation Summary

0. Experiment Overview

Model A — Wylie-based Whisper model

Model B — Tibetan-script Whisper model with added tokens

1. Word and Sentence Error Rates (WER / SER)

2. Character Error Rate (CER)

3. Tokenization Length Comparison

4. Inference Time Benchmark

5. Tokenizer Vocabulary Size

6. Overall Observations

Conclusion

Resources

1. Comprehensive Model Performance Statistics

Related topics