Customizing Speech-to-Text: Fine-Tuning a Model for Kyabje Dilgo Kyentse Rinpoche’s Unique Voice

Title: Customizing Speech-to-Text: Fine-Tuning a Model for Kyabje Dilgo Kyentse Rinpoche’s Unique Voice

Introduction

Fine-tuning a Speech-to-Text (STT) model for specific speakers can significantly enhance transcription accuracy by adapting the model to unique speech patterns and pronunciations. Previously, we fine-tuned a base wav2vec2 model using a smaller, speaker-specific dataset of Kyabje Dilgo Kyentse Rinpoche’s annotated speech. This effort yielded promising results, improving transcription performance for this speaker. With access to a much larger dataset of Kyabje Dilgo Kyentse Rinpoche’s speech, we hypothesize that additional data will lead to further improvement in transcription accuracy. This blog post outlines the updated fine-tuning process, presents the evaluation results, and discusses insights gained from this new iteration. For more details on the previous experiment, refer to this blog post.

Background

Single-speaker datasets often present unique challenges for general-purpose STT models, resulting in higher Character Error Rates (CER) due to differences in speech patterns, pronunciation, and intonation. Similar to how accent-specific fine-tuning enhances performance for regional accents, fine-tuning the base model on Kyabje Dilgo Kyentse Rinpoche’s speech data can bridge the performance gap. This approach could also be generalized to other speakers with large annotated datasets, potentially saving significant annotation time and effort through iterative refinement.

Increasing the amount of speaker-specific training data can significantly improve accuracy. We can predict the number of training hours required to achieve a CER below 5% based on the observed improvements from the current dataset. This prediction is grounded in the observed improvements from our current dataset size and assumes a linear trend in CER reduction with additional data.

Methodology

Dataset Preparation

Dataset Duration Samples Description
Old Dataset 09:06:00 9316 Annotated recordings of Kyabje Dilgo Kyentse Rinpoche’s speech, used for fine-tuning the initial model (v1).
New Dataset 06:04:03 5609 Additional annotated recordings, collected to expand the training dataset and improve performance.
Combined Dataset 15:10:03 14925 Merged dataset comprising old and new data, used for fine-tuning the updated model (v2).
Test Set A 01:09:32 1165 Test data extracted from the old training dataset to evaluate performance on previously unseen data.
Test Set B 00:44:30 702 Test data derived from the new training dataset to assess performance on newly annotated recordings.
Test Set C 01:54:02 1867 Combined test data, including samples from both old and new datasets, to evaluate overall performance.

Model Fine-Tuning

  • Starting Point: The base model.
  • Training Setup:
    • per_device_train_batch_size=8
    • gradient_accumulation_steps=2
    • evaluation_strategy="steps"
    • save_steps=1000
    • eval_steps=1000
    • logging_steps=100
    • learning_rate=3e-4
    • num_train_epochs=100
    • fp16=True
    • dataloader_num_workers=4
    • save_total_limit=50
    • Checkpoints saved at regular intervals.
  • Hardware:
    • 1x L4 GPU (24 GB)
  • Training duration:
    • 17 hours

Evaluation

Three models were evaluated on the test sets:

  1. Base Model (wav2vec2 without fine-tuning).
  2. Fine-Tuned Model v1 (trained on the old dataset).
  3. Fine-Tuned Model v2 (trained on the combined dataset).

Results

Character Error Rate (CER)

Model Test Set A (Old) Test Set B (New) Test Set C (Combined)
Base Model 0.4754 0.53993 0.50048
Fine-Tuned Model v1 0.2043 0.37898 0.27042
Fine-Tuned Model v2 0.207643 0.31831 0.249256

Discussion

Performance Improvements

Fine-Tuned Model v2, trained on the combined dataset, showed consistent CER reductions across all test sets compared to the Base Model and Fine-Tuned Model v1. Specifically:

  • Test Set A (Old): A slight increase in CER compared to v1, possibly due to increased dataset complexity in training data.
  • Test Set B (New): A substantial improvement over both the Base Model and v1, demonstrating the value of additional training data.
  • Test Set C (Combined): A marked reduction in CER, highlighting the efficacy of training on a diverse dataset.

Data Requirement Estimation

By analyzing CER improvements per additional hour of training data, we estimate that achieving a CER below 15% would require approximately 43.58 additional hours of annotated data**. These calculations assume linear improvement trends based on current results.

Next Steps

  • Collect additional high-quality annotated data for Kyabje Dilgo Kyentse Rinpoche’s speech.
  • Extend this approach to other speakers with large volumes of recordings.

Conclusion

The results validate our hypothesis that incorporating additional training data significantly improves transcription accuracy for speaker-specific datasets. The Fine-Tuned Model v2 outperformed both the Base Model and the initial fine-tuned version, achieving CER values of 24.92% on the combined test set compared to base model 50.48% and finetuned model v1 27.04%. These findings underscore the potential of personalized STT models for automating large-scale transcription tasks, reducing manual effort, and enhancing accessibility to unique voices and teachings.

Resources

Models

Datasets

Test Sets

1 Like