Customizing Speech-to-Text: Fine-Tuning a Model for Tai Situ Rinpoche’s Unique Voice

Title: Customizing Speech-to-Text: Fine-Tuning a Model for Tai Situ Rinpoche’s Unique Voice

Introduction

Fine-tuning a Speech-to-Text (STT) model for specific speakers can significantly enhance transcription accuracy by adapting the model to unique speech patterns and pronunciations. Previously, we fine-tuned a base wav2vec2 model using a smaller, speaker-specific dataset of Tai Situ Rinpoche’s annotated speech. This effort yielded promising results, improving transcription performance for this speaker. With access to a much larger dataset of Tai Situ Rinpoche’s speech, we hypothesize that additional data will lead to further improvement in transcription accuracy. This blog post outlines the updated fine-tuning process, presents the evaluation results, and discusses insights gained from this new iteration. For more details on the previous experiment, refer to this blog post.

Background

Single-speaker datasets often present unique challenges for general-purpose STT models, resulting in higher Character Error Rates (CER) due to differences in speech patterns, pronunciation, and intonation. Similar to how accent-specific fine-tuning enhances performance for regional accents, fine-tuning the base model on Tai Situ Rinpoche’s speech data can bridge the performance gap. This approach could also be generalized to other speakers with large annotated datasets, potentially saving significant annotation time and effort through iterative refinement.

Increasing the amount of speaker-specific training data can significantly improve accuracy. We can predict the number of training hours required to achieve a CER below 5% based on the observed improvements from the current dataset. This prediction is grounded in the observed improvements from our current dataset size and assumes a linear trend in CER reduction with additional data.

Methodology

Dataset Preparation

Here’s an improved and concise description for the table:

Dataset Duration Samples Description
Old Dataset 01:10:00 838 Annotated recordings of Tai Situ Rinpoche’s speech, used for fine-tuning the initial model (v1).
New Dataset 02:39:00 2151 Additional annotated recordings, collected to expand the training dataset and improve performance.
Combined Dataset 03:49:00 2989 Merged dataset comprising old and new data, used for fine-tuning the updated model (v2).
Test Set A 00:08:12 105 Test data extracted from the old training dataset to evaluate performance on previously unseen data.
Test Set B 00:18:28 269 Test data derived from the new training dataset to assess performance on newly annotated recordings.
Test Set C 00:26:40 374 Combined test data, including samples from both old and new datasets, to evaluate overall performance.

Model Fine-Tuning

  • Starting Point: The base model.
  • Training Setup:
    • per_device_train_batch_size=8
    • gradient_accumulation_steps=1
    • learning_rate=1e-6
    • num_train_epochs=200
    • fp16=True
    • Checkpoints saved at regular intervals.
  • Hardware:
    • 1x L4 GPU (24 GB)
  • Training duration:
    • 15 hours

Evaluation

Three models were evaluated on the test sets:

  1. Base Model (wav2vec2 without fine-tuning).
  2. Fine-Tuned Model v1 (trained on the old dataset).
  3. Fine-Tuned Model v2 (trained on the combined dataset).

Results

Character Error Rate (CER)

Model Test Set A (Old) Test Set B (New) Test Set C (Combined)
Base Model 0.097965 0.136982 0.126028
Fine-Tuned Model v1 0.0793 0.1164 0.1061
Fine-Tuned Model v2 0.080979 0.080739 0.0808

Discussion

Performance Improvements

Fine-Tuned Model v2, trained on the combined dataset, showed consistent CER reductions across all test sets compared to the Base Model and Fine-Tuned Model v1. Specifically:

  • Test Set A (Old): A slight increase in CER compared to v1, possibly due to increased dataset complexity.
  • Test Set B (New): A substantial improvement over both the Base Model and v1, demonstrating the value of additional training data.
  • Test Set C (Combined): A marked reduction in CER, highlighting the efficacy of training on a diverse dataset.

Data Requirement Estimation

By analyzing CER improvements per additional hour of training data, we estimate that achieving a CER below 5% would require approximately ** 3.0 additional hours of annotated data**. These calculations assume linear improvement trends based on current results.

Next Steps

  • Collect additional high-quality annotated data for Tai Situ Rinpoche’s speech.
  • Extend this approach to other speakers with large volumes of recordings.

Conclusion

The results validate our hypothesis that incorporating additional training data significantly improves transcription accuracy for speaker-specific datasets. The Fine-Tuned Model v2 outperformed both the Base Model and the initial fine-tuned version, achieving CER values of 8.08% on the combined test set. These findings underscore the potential of personalized STT models for automating large-scale transcription tasks, reducing manual effort, and enhancing accessibility to unique voices and teachings.

Resources

Models

Datasets

Test Sets