Title: Fine-Tuning a Multi-Dialect Speech Recognition Model for Tibetan: Balancing Cross-Dialect Performance
Introduction
We developed a speech recognition system capable of transcribing three major Tibetan dialects: Utsang, Khampa, and Amdo. Starting with a base wav2vec2 model trained on 1500 hours of Utsang dialect data, we explored the challenges of extending its capabilities to handle Khampa and Amdo dialects. Our experiment revealed an interesting trade-off: while we improved performance on new dialects, we encountered significant degradation in Utsang dialect recognition due to catastrophic forgetting.
Background
Tibetan language comprises three major dialects with significant phonetic variations but a unified written form. While our base model achieved strong performance on Utsang dialect with a mean CER of 15.23%, extending its capabilities to other dialects presented unique challenges. The experiment explores methods to balance performance across all dialects while dealing with data imbalance and knowledge retention issues.
Methodology
Dataset Preparation
- Utsang: 1500 hours (base model training data)
- Khampa: 80 hours for fine-tuning
- Amdo: 53 hours for fine-tuning
Model Architecture
- Base Model: wav2vec2 pretrained on Utsang dialect
- Fine-tuning approach: Combined training on Khampa and Amdo data
- No Balanced sampling implementation for handling data imbalance
Training Strategy
- Transfer learning from Utsang base model
- Gradient accumulation for stable training
- Warmup steps to mitigate catastrophic forgetting
Implementation Details
Training Parameters
training_args = TrainingArguments(
per_device_train_batch_size=8,
gradient_accumulation_steps=2,
evaluation_strategy="steps",
save_steps=1000,
eval_steps=1000,
logging_steps=100,
learning_rate=3e-5,
num_train_epochs=20,
warmup_steps=500,
fp16=True,
dataloader_num_workers=4
)
Hardware Configuration
- 4x NVIDIA L4 GPUs (24GB each)
- Training Duration: 24 hours
Results
Performance on Utsang Benchmark
Model Type | Mean CER | Analysis |
---|---|---|
Base Model | 15.23% | Strong initial performance |
Fine-tuned Model | 29.62% | Shows impact of catastrophic forgetting |
Cross-Dialect Performance
Dialect | Fine-Tuned Model | Base Model |
---|---|---|
STT_AM | 17.23% | 44.29% |
STT_AM_AB | 7.58% | 35.48% |
STT_KH | 20.86% | 36.73% |
STT_KH_AB | 10.46% | 31.10% |
Mean | 14.03% | 36.90% |
Discussion
The experiment revealed significant trade-offs in multi-dialect speech recognition:
Positive Outcomes
- Improved recognition accuracy for Khampa and Amdo dialects
- Achieved better overall performance across dialects
Challenges
- Catastrophic forgetting of Utsang dialect knowledge
- Performance degradation on Utsang benchmark (CER increased from 15.23% to 29.62%)
- Data imbalance between dialects affecting model training
Conclusion
Conclusion
While we successfully improved the model’s capabilities for Khampa and Amdo dialects, the significant degradation in Utsang performance highlights the challenges of multi-dialect speech recognition. Our experiment revealed clear evidence of catastrophic forgetting, with the following key performance metrics:
-
Utsang Benchmark:
- Base Model: 15.23% CER
- Fine-tuned Model: 29.62% CER
- Performance degradation: 14.39% increase in CER
-
On Amdo and Kham Benchmark:
- Base Model: 36.90% mean CER
- Fine-tuned Model: 14.03% mean CER
- Performance improvement: 22.87% reduction in CER
This trade-off demonstrates that while we achieved significant improvements in overall multi-dialect recognition, it came at the cost of Utsang performance. Future work should focus on developing techniques to preserve base model knowledge while adapting to new dialects, possibly through continual learning on balanced training data from all three dialects.
Resources
- Base Model: model link
- Fine-tuned Model: model link
- Utsang Benchmark: benchmark link
- Amdo and Kham Benchmark: benchmark link
Citations:
[1] Improving accuracy of speech recognition for low resource accents
[2] Tibetan Multi-Dialect Speech and Dialect Identity Recognition