Fine-Tuning a Multi-Dialect Speech Recognition Model for Tibetan Languages

Title: Fine-Tuning a Multi-Dialect Speech Recognition Model for Tibetan: Balancing Cross-Dialect Performance

Introduction

We developed a speech recognition system capable of transcribing three major Tibetan dialects: Utsang, Khampa, and Amdo. Starting with a base wav2vec2 model trained on 1500 hours of Utsang dialect data, we explored the challenges of extending its capabilities to handle Khampa and Amdo dialects. Our experiment revealed an interesting trade-off: while we improved performance on new dialects, we encountered significant degradation in Utsang dialect recognition due to catastrophic forgetting.

Background

Tibetan language comprises three major dialects with significant phonetic variations but a unified written form. While our base model achieved strong performance on Utsang dialect with a mean CER of 15.23%, extending its capabilities to other dialects presented unique challenges. The experiment explores methods to balance performance across all dialects while dealing with data imbalance and knowledge retention issues.

Methodology

Dataset Preparation

  • Utsang: 1500 hours (base model training data)
  • Khampa: 80 hours for fine-tuning
  • Amdo: 53 hours for fine-tuning

Model Architecture

  • Base Model: wav2vec2 pretrained on Utsang dialect
  • Fine-tuning approach: Combined training on Khampa and Amdo data
  • No Balanced sampling implementation for handling data imbalance

Training Strategy

  • Transfer learning from Utsang base model
  • Gradient accumulation for stable training
  • Warmup steps to mitigate catastrophic forgetting

Implementation Details

Training Parameters

training_args = TrainingArguments(
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    evaluation_strategy="steps",
    save_steps=1000,
    eval_steps=1000,
    logging_steps=100,
    learning_rate=3e-5,
    num_train_epochs=20,
    warmup_steps=500,
    fp16=True,
    dataloader_num_workers=4
)

Hardware Configuration

  • 4x NVIDIA L4 GPUs (24GB each)
  • Training Duration: 24 hours

Results

Performance on Utsang Benchmark

Model Type Mean CER Analysis
Base Model 15.23% Strong initial performance
Fine-tuned Model 29.62% Shows impact of catastrophic forgetting

Cross-Dialect Performance

Dialect Fine-Tuned Model Base Model
STT_AM 17.23% 44.29%
STT_AM_AB 7.58% 35.48%
STT_KH 20.86% 36.73%
STT_KH_AB 10.46% 31.10%
Mean 14.03% 36.90%

Discussion

The experiment revealed significant trade-offs in multi-dialect speech recognition:

Positive Outcomes

  • Improved recognition accuracy for Khampa and Amdo dialects
  • Achieved better overall performance across dialects

Challenges

  • Catastrophic forgetting of Utsang dialect knowledge
  • Performance degradation on Utsang benchmark (CER increased from 15.23% to 29.62%)
  • Data imbalance between dialects affecting model training

Conclusion

Conclusion

While we successfully improved the model’s capabilities for Khampa and Amdo dialects, the significant degradation in Utsang performance highlights the challenges of multi-dialect speech recognition. Our experiment revealed clear evidence of catastrophic forgetting, with the following key performance metrics:

  1. Utsang Benchmark:

    • Base Model: 15.23% CER
    • Fine-tuned Model: 29.62% CER
    • Performance degradation: 14.39% increase in CER
  2. On Amdo and Kham Benchmark:

    • Base Model: 36.90% mean CER
    • Fine-tuned Model: 14.03% mean CER
    • Performance improvement: 22.87% reduction in CER

This trade-off demonstrates that while we achieved significant improvements in overall multi-dialect recognition, it came at the cost of Utsang performance. Future work should focus on developing techniques to preserve base model knowledge while adapting to new dialects, possibly through continual learning on balanced training data from all three dialects.

Resources

Citations:

[1] Improving accuracy of speech recognition for low resource accents
[2] Tibetan Multi-Dialect Speech and Dialect Identity Recognition

2 Likes