Title: Customizing Speech-to-Text: Fine-Tuning a Model for Kyabje Dilgo Kyentse Rinpoche’s Unique Voice
Introduction
Fine-tuning a Speech-to-Text (STT) model for specific speakers can significantly enhance transcription accuracy by adapting the model to unique speech patterns and pronunciations. Previously, we fine-tuned a base wav2vec2 model using a smaller, speaker-specific dataset of Kyabje Dilgo Kyentse Rinpoche’s annotated speech. This effort yielded promising results, improving transcription performance for this speaker. With access to a much larger dataset of Kyabje Dilgo Kyentse Rinpoche’s speech, we hypothesize that additional data will lead to further improvement in transcription accuracy. This blog post outlines the updated fine-tuning process, presents the evaluation results, and discusses insights gained from this new iteration. For more details on the previous experiment, refer to this blog post.
Background
Single-speaker datasets often present unique challenges for general-purpose STT models, resulting in higher Character Error Rates (CER) due to differences in speech patterns, pronunciation, and intonation. Similar to how accent-specific fine-tuning enhances performance for regional accents, fine-tuning the base model on Kyabje Dilgo Kyentse Rinpoche’s speech data can bridge the performance gap. This approach could also be generalized to other speakers with large annotated datasets, potentially saving significant annotation time and effort through iterative refinement.
Increasing the amount of speaker-specific training data can significantly improve accuracy. We can predict the number of training hours required to achieve a CER below 5% based on the observed improvements from the current dataset. This prediction is grounded in the observed improvements from our current dataset size and assumes a linear trend in CER reduction with additional data.
Methodology
Dataset Preparation
Dataset | Duration | Samples | Description |
---|---|---|---|
Old Dataset | 09:06:00 | 9316 | Annotated recordings of Kyabje Dilgo Kyentse Rinpoche’s speech, used for fine-tuning the initial model (v1). |
New Dataset | 06:04:03 | 5609 | Additional annotated recordings, collected to expand the training dataset and improve performance. |
Combined Dataset | 15:10:03 | 14925 | Merged dataset comprising old and new data, used for fine-tuning the updated model (v2). |
Test Set A | 01:09:32 | 1165 | Test data extracted from the old training dataset to evaluate performance on previously unseen data. |
Test Set B | 00:44:30 | 702 | Test data derived from the new training dataset to assess performance on newly annotated recordings. |
Test Set C | 01:54:02 | 1867 | Combined test data, including samples from both old and new datasets, to evaluate overall performance. |
Model Fine-Tuning
- Starting Point: The base model.
- Training Setup:
per_device_train_batch_size=8
gradient_accumulation_steps=2
evaluation_strategy="steps"
save_steps=1000
eval_steps=1000
logging_steps=100
learning_rate=3e-4
num_train_epochs=100
fp16=True
dataloader_num_workers=4
save_total_limit=50
- Checkpoints saved at regular intervals.
- Hardware:
- 1x L4 GPU (24 GB)
- Training duration:
- 17 hours
Evaluation
Three models were evaluated on the test sets:
- Base Model (wav2vec2 without fine-tuning).
- Fine-Tuned Model v1 (trained on the old dataset).
- Fine-Tuned Model v2 (trained on the combined dataset).
Results
Character Error Rate (CER)
Model | Test Set A (Old) | Test Set B (New) | Test Set C (Combined) |
---|---|---|---|
Base Model | 0.4754 | 0.53993 | 0.50048 |
Fine-Tuned Model v1 | 0.2043 | 0.37898 | 0.27042 |
Fine-Tuned Model v2 | 0.207643 | 0.31831 | 0.249256 |
Discussion
Performance Improvements
Fine-Tuned Model v2, trained on the combined dataset, showed consistent CER reductions across all test sets compared to the Base Model and Fine-Tuned Model v1. Specifically:
- Test Set A (Old): A slight increase in CER compared to v1, possibly due to increased dataset complexity in training data.
- Test Set B (New): A substantial improvement over both the Base Model and v1, demonstrating the value of additional training data.
- Test Set C (Combined): A marked reduction in CER, highlighting the efficacy of training on a diverse dataset.
Data Requirement Estimation
By analyzing CER improvements per additional hour of training data, we estimate that achieving a CER below 15% would require approximately 43.58 additional hours of annotated data**. These calculations assume linear improvement trends based on current results.
Next Steps
- Collect additional high-quality annotated data for Kyabje Dilgo Kyentse Rinpoche’s speech.
- Extend this approach to other speakers with large volumes of recordings.
Conclusion
The results validate our hypothesis that incorporating additional training data significantly improves transcription accuracy for speaker-specific datasets. The Fine-Tuned Model v2 outperformed both the Base Model and the initial fine-tuned version, achieving CER values of 24.92% on the combined test set compared to base model 50.48% and finetuned model v1 27.04%. These findings underscore the potential of personalized STT models for automating large-scale transcription tasks, reducing manual effort, and enhancing accessibility to unique voices and teachings.