Whisper Tibetan Model Evaluation Summary on Garchen Rinpoche Speech

Whisper Tibetan Model Evaluation Summary

0. Experiment Overview

This experiment evaluates and compares two fine-tuned Whisper-small models for Tibetan Automatic Speech Recognition (ASR).
The goal is to assess the effect of using native Tibetan script tokens (via a custom tokenizer) versus Wylie transliteration (using the default Whisper tokenizer).

Model A — Wylie-based Whisper model

  • Model name: whisper-small-tibetan-wylie-checkpoint-4000_to_tibetan
  • Base model: openai/whisper-small
  • Tokenizer: Default Whisper tokenizer
  • Training data: 5.6 hours of Tibetan audio paired with Wylie transliterations
  • Training steps: 4000
  • Batch size: 16
  • Gradient accumulation: 2
  • Input: Audio + Wylie transcript
  • Evaluation: After transcription, predicted Wylie text is converted back to Tibetan script and compared against the Tibetan ground-truth in the benchmark dataset.

Model B — Tibetan-script Whisper model with added tokens

  • Model name: whisper-small-latin-added-tibetan-checkpoint-4000
  • Base model: openai/whisper-small
  • Tokenizer: Custom tokenizer extending the default Whisper tokenizer with added Tibetan script tokens
  • Training data: Same 5.6 hours of Tibetan audio
  • Training steps: 4000
  • Batch size: 16
  • Gradient accumulation: 2
  • Input: Audio + Tibetan script transcript (tokenized using added Tibetan tokens)
  • Evaluation: Direct comparison between the model’s Tibetan output and Tibetan ground-truth.

Objective:
To determine whether directly training Whisper with a native Tibetan tokenizer improves transcription accuracy, token efficiency, and inference performance compared to a Wylie transliteration-based approach.

All experiments were performed on a NVIDIA RTX 4090 (24GB VRAM) GPU using 30-second audio samples (the Whisper maximum input window).


1. Word and Sentence Error Rates (WER / SER)

model micro_wer macro_wer micro_ser macro_ser substitutions insertions deletions
whisper-small-latin-added-tibetan-checkpoint-4000 0.607723 0.587186 0.565648 0.565680 7289 543 1478
whisper-small-tibetan-wylie-checkpoint-4000_to_tibetan 0.675397 0.712424 0.616562 0.656280 6741 1561 1846

Summary:

  • The Latin-added Tibetan model outperforms the Wylie→Tibetan model in both WER and SER across micro and macro averages.
  • The difference is especially clear in insertion and substitution error counts.

2. Character Error Rate (CER)

model cer_mean
whisper-small-latin-added-tibetan-checkpoint-4000 0.298808
whisper-small-tibetan-wylie-checkpoint-4000_to_tibetan 0.384848

Summary:

  • The Tibetan-script model achieves a lower CER (~0.30) compared to the Wylie model (~0.38), showing higher character-level accuracy.
  • This suggests better modeling of native script structure and spelling consistency.

3. Tokenization Length Comparison

(for ~30-second audio transcripts)

Tokenizer Type Transcript Language Token Length Notes
Whisper Default Tokenizer Wylie 271 Within model limit (≤ 1024)
Whisper + Added Tibetan Tokens Tokenizer Tibetan 311 Within model limit (≤ 1024)
Whisper Default Tokenizer Tibetan 1551 :cross_mark: Exceeds model max input limit (1024)

Summary:

  • The default Whisper tokenizer is inefficient on Tibetan script, producing over 1500 tokens for 30-second text — exceeding the model limit.
  • The added-tokens tokenizer keeps Tibetan input compact and fully processable.
  • This makes direct Tibetan script training feasible without transliteration.

4. Inference Time Benchmark

(on ~30-second audios, Whisper max input, NVIDIA RTX 4090 24GB)

Model GPU (VRAM) Avg Inference Time (sec) Notes
whisper-small-latin-added-tibetan-checkpoint-4000 RTX 4090 (24GB) 1.3 Stable average; small variation per run
whisper-small-tibetan-wylie-checkpoint-4000_to_tibetan RTX 4090 (24GB) 1.3 No significant difference between models

Summary:

  • Both models achieve ~1.3 seconds inference time for 30-second audio clips.
  • There is no measurable difference in inference latency between Wylie and Tibetan-tokenized models.
  • GPU utilization remains consistent across runs.

5. Tokenizer Vocabulary Size

Tokenizer Type Vocabulary Size Notes
Whisper Default Tokenizer 51,865 Original Whisper vocabulary
Whisper + Added Tibetan Tokens Tokenizer 53,014 Includes 1,149 additional Tibetan script tokens

Summary:

  • The custom Tibetan tokenizer expands the base Whisper vocabulary by ~2.2%.
  • These extra tokens provide direct Tibetan script coverage, reducing token fragmentation and improving alignment with real-world Tibetan text.

6. Overall Observations

Category Best Performing Model Key Advantage
WER / SER Latin-added Tibetan Lower word and sentence error rates
CER Latin-added Tibetan More accurate at character level
Tokenization Efficiency Latin-added Tibetan Efficient native-script encoding
Vocabulary Coverage Latin-added Tibetan Broader support for Tibetan characters
Inference Speed Equal (both) No significant runtime difference

:white_check_mark: Conclusion

The whisper-small-latin-added-tibetan-checkpoint-4000 model consistently outperforms the whisper-small-tibetan-wylie-checkpoint-4000_to_tibetan model across accuracy metrics (WER, SER, CER) while maintaining identical inference performance.

By leveraging a custom tokenizer with added Tibetan tokens, this model:

  • Enables direct training on Tibetan script (no transliteration needed),
  • Avoids token overflow issues (keeps sequences ≤1024 tokens),
  • Achieves higher accuracy on native Tibetan benchmarks,
  • And maintains comparable inference speed.

Overall, direct Tibetan tokenization is a robust and scalable improvement for Tibetan ASR tasks using Whisper architectures.


:books: Resources

Resource Type Name / Link
:brain: Whisper model (Default Tokenizer, Wylie-based) ganga4364/whisper-small-tibetan-wylie-checkpoint-4000
:feather: Whisper model (Added Tibetan Tokens) ganga4364/whisper-small-latin-added-tibetan-checkpoint-4000
:headphone: Training dataset openpecha/garchen_rinpoche_data
:bar_chart: Evaluation benchmark ganga4364/garchen_rinpoche_evaluation_results
Leader Board Buddhist AI Arena
Hugging Space Stt Inference - a Hugging Face Space by openpecha

1. Comprehensive Model Performance Statistics

Model/Checkpoint CER Mean CER Std Lev Mean Lev Std Micro WER Macro WER Micro SER Macro SER
base 0.277 0.161 19.172 21.794 0.554 0.597 0.493 0.518
v1_5000 0.274 0.182 18.441 19.909 0.518 0.574 0.482 0.544
v1_10000 0.234 0.172 15.658 17.422 0.410 0.468 0.398 0.479
v1_19000 0.229 0.172 15.334 16.935 0.394 0.459 0.386 0.470
v2_20000 0.223 0.175 14.884 16.349 0.392 0.451 0.384 0.467
v2_32000 0.218 0.167 14.730 16.144 0.386 0.443 0.380 0.466
v2_43000 0.215 0.168 14.608 16.029 0.381 0.438 0.376 0.461
v3_1000 0.234 0.176 15.440 16.877 0.412 0.474 0.401 0.486
v3_25000 0.222 0.171 14.991 16.405 0.391 0.452 0.383 0.463
v4_22000 0.227 0.171 15.169 16.710 0.409 0.476 0.392 0.476
v5_base_28000 0.215 0.161 14.970 16.645 0.413 0.459 0.401 0.470
v6_ft_25000 0.221 0.170 15.214 17.198 0.404 0.451 0.393 0.464
v7_scratch_23000 0.219 0.167 14.571 15.467 0.412 0.465 0.397 0.471
v8_10000 0.242 0.165 16.830 19.463 0.491 0.526 0.464 0.522
v8_55000 0.220 0.164 15.216 17.317 0.401 0.454 0.394 0.472
v8_59000 0.230 0.165 15.859 18.460 0.430 0.480 0.415 0.488
whisper-small-added-tibetan 0.299 0.176 23.260 33.615 0.608 0.587 0.566 0.566
whisper-small-wylie 0.385 0.201 29.950 43.624 0.675 0.712 0.617 0.656

https://arena.buddhistai.tools/leaderboard/60157fcf-bd4d-47a7-af1d-25bb64070352