Fine-Tuning a Multi-Dialect Speech Recognition Model for Tibetan Languages

Ganga_Gyatso · January 3, 2025, 6:01am

Title: Fine-Tuning a Multi-Dialect Speech Recognition Model for Tibetan: Balancing Cross-Dialect Performance

Introduction

We developed a speech recognition system capable of transcribing three major Tibetan dialects: Utsang, Khampa, and Amdo. Starting with a base wav2vec2 model trained on 1500 hours of Utsang dialect data, we explored the challenges of extending its capabilities to handle Khampa and Amdo dialects. Our experiment revealed an interesting trade-off: while we improved performance on new dialects, we encountered significant degradation in Utsang dialect recognition due to catastrophic forgetting.

Background

Tibetan language comprises three major dialects with significant phonetic variations but a unified written form. While our base model achieved strong performance on Utsang dialect with a mean CER of 15.23%, extending its capabilities to other dialects presented unique challenges. The experiment explores methods to balance performance across all dialects while dealing with data imbalance and knowledge retention issues.

Methodology

Dataset Preparation

Utsang: 1500 hours (base model training data)
Khampa: 80 hours for fine-tuning
Amdo: 53 hours for fine-tuning

Model Architecture

Base Model: wav2vec2 pretrained on Utsang dialect
Fine-tuning approach: Combined training on Khampa and Amdo data
No Balanced sampling implementation for handling data imbalance

Training Strategy

Transfer learning from Utsang base model
Gradient accumulation for stable training
Warmup steps to mitigate catastrophic forgetting

Implementation Details

Training Parameters

training_args = TrainingArguments(
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    evaluation_strategy="steps",
    save_steps=1000,
    eval_steps=1000,
    logging_steps=100,
    learning_rate=3e-5,
    num_train_epochs=20,
    warmup_steps=500,
    fp16=True,
    dataloader_num_workers=4
)

Hardware Configuration

4x NVIDIA L4 GPUs (24GB each)
Training Duration: 24 hours

Results

Performance on Utsang Benchmark

Model Type	Mean CER	Analysis
Base Model	15.23%	Strong initial performance
Fine-tuned Model	29.62%	Shows impact of catastrophic forgetting

Cross-Dialect Performance

Dialect	Fine-Tuned Model	Base Model
STT_AM	17.23%	44.29%
STT_AM_AB	7.58%	35.48%
STT_KH	20.86%	36.73%
STT_KH_AB	10.46%	31.10%
Mean	14.03%	36.90%

Discussion

The experiment revealed significant trade-offs in multi-dialect speech recognition:

Positive Outcomes

Improved recognition accuracy for Khampa and Amdo dialects
Achieved better overall performance across dialects

Challenges

Catastrophic forgetting of Utsang dialect knowledge
Performance degradation on Utsang benchmark (CER increased from 15.23% to 29.62%)
Data imbalance between dialects affecting model training

Conclusion

While we successfully improved the model’s capabilities for Khampa and Amdo dialects, the significant degradation in Utsang performance highlights the challenges of multi-dialect speech recognition. Our experiment revealed clear evidence of catastrophic forgetting, with the following key performance metrics:

Utsang Benchmark:
- Base Model: 15.23% CER
- Fine-tuned Model: 29.62% CER
- Performance degradation: 14.39% increase in CER
On Amdo and Kham Benchmark:
- Base Model: 36.90% mean CER
- Fine-tuned Model: 14.03% mean CER
- Performance improvement: 22.87% reduction in CER

This trade-off demonstrates that while we achieved significant improvements in overall multi-dialect recognition, it came at the cost of Utsang performance. Future work should focus on developing techniques to preserve base model knowledge while adapting to new dialects, possibly through continual learning on balanced training data from all three dialects.

Resources

Base Model: model link
Fine-tuned Model: model link
Utsang Benchmark: benchmark link
Amdo and Kham Benchmark: benchmark link

Citations:

[1] Improving accuracy of speech recognition for low resource accents
[2] Tibetan Multi-Dialect Speech and Dialect Identity Recognition

Topic		Replies	Views
Fine-Tuning a Multi-Dialect Speech Recognition Model for Tibetan Languages with balanced dialect data General	0	26	January 24, 2025
A custom ASR model to transcribe the speech of Tai Situ Rinpoche STT	0	43	December 19, 2024
Customizing Speech-to-Text: Fine-Tuning a Model for Tai Situ Rinpoche’s Unique Voice STT	0	25	January 16, 2025
A custom ASR model to transcribe the speech of Kabjye Dilgo Khyentse Rinpoche STT	4	57	January 6, 2025
Customizing Speech-to-Text: Fine-Tuning a Model for Kyabje Dilgo Kyentse Rinpoche’s Unique Voice STT	0	18	January 17, 2025

Fine-Tuning a Multi-Dialect Speech Recognition Model for Tibetan Languages

Introduction

Background

Methodology

Dataset Preparation

Model Architecture

Training Strategy

Implementation Details

Training Parameters

Hardware Configuration

Results

Performance on Utsang Benchmark

Cross-Dialect Performance

Discussion

Positive Outcomes

Challenges

Conclusion

Conclusion

Resources

Citations:

Related topics