Customizing Speech-to-Text: Fine-Tuning a Model for Tai Situ Rinpoche’s Unique Voice

Ganga_Gyatso · January 16, 2025, 11:11am

Title: Customizing Speech-to-Text: Fine-Tuning a Model for Tai Situ Rinpoche’s Unique Voice

Introduction

Fine-tuning a Speech-to-Text (STT) model for specific speakers can significantly enhance transcription accuracy by adapting the model to unique speech patterns and pronunciations. Previously, we fine-tuned a base wav2vec2 model using a smaller, speaker-specific dataset of Tai Situ Rinpoche’s annotated speech. This effort yielded promising results, improving transcription performance for this speaker. With access to a much larger dataset of Tai Situ Rinpoche’s speech, we hypothesize that additional data will lead to further improvement in transcription accuracy. This blog post outlines the updated fine-tuning process, presents the evaluation results, and discusses insights gained from this new iteration. For more details on the previous experiment, refer to this blog post.

Background

Single-speaker datasets often present unique challenges for general-purpose STT models, resulting in higher Character Error Rates (CER) due to differences in speech patterns, pronunciation, and intonation. Similar to how accent-specific fine-tuning enhances performance for regional accents, fine-tuning the base model on Tai Situ Rinpoche’s speech data can bridge the performance gap. This approach could also be generalized to other speakers with large annotated datasets, potentially saving significant annotation time and effort through iterative refinement.

Increasing the amount of speaker-specific training data can significantly improve accuracy. We can predict the number of training hours required to achieve a CER below 5% based on the observed improvements from the current dataset. This prediction is grounded in the observed improvements from our current dataset size and assumes a linear trend in CER reduction with additional data.

Methodology

Dataset Preparation

Here’s an improved and concise description for the table:

Dataset	Duration	Samples	Description
Old Dataset	01:10:00	838	Annotated recordings of Tai Situ Rinpoche’s speech, used for fine-tuning the initial model (v1).
New Dataset	02:39:00	2151	Additional annotated recordings, collected to expand the training dataset and improve performance.
Combined Dataset	03:49:00	2989	Merged dataset comprising old and new data, used for fine-tuning the updated model (v2).
Test Set A	00:08:12	105	Test data extracted from the old training dataset to evaluate performance on previously unseen data.
Test Set B	00:18:28	269	Test data derived from the new training dataset to assess performance on newly annotated recordings.
Test Set C	00:26:40	374	Combined test data, including samples from both old and new datasets, to evaluate overall performance.

Model Fine-Tuning

Starting Point: The base model.
Training Setup:
- per_device_train_batch_size=8
- gradient_accumulation_steps=1
- learning_rate=1e-6
- num_train_epochs=200
- fp16=True
- Checkpoints saved at regular intervals.
Hardware:
- 1x L4 GPU (24 GB)
Training duration:
- 15 hours

Evaluation

Three models were evaluated on the test sets:

Base Model (wav2vec2 without fine-tuning).
Fine-Tuned Model v1 (trained on the old dataset).
Fine-Tuned Model v2 (trained on the combined dataset).

Results

Character Error Rate (CER)

Model	Test Set A (Old)	Test Set B (New)	Test Set C (Combined)
Base Model	0.097965	0.136982	0.126028
Fine-Tuned Model v1	0.0793	0.1164	0.1061
Fine-Tuned Model v2	0.080979	0.080739	0.0808

Discussion

Performance Improvements

Fine-Tuned Model v2, trained on the combined dataset, showed consistent CER reductions across all test sets compared to the Base Model and Fine-Tuned Model v1. Specifically:

Test Set A (Old): A slight increase in CER compared to v1, possibly due to increased dataset complexity.
Test Set B (New): A substantial improvement over both the Base Model and v1, demonstrating the value of additional training data.
Test Set C (Combined): A marked reduction in CER, highlighting the efficacy of training on a diverse dataset.

Data Requirement Estimation

By analyzing CER improvements per additional hour of training data, we estimate that achieving a CER below 5% would require approximately ** 3.0 additional hours of annotated data**. These calculations assume linear improvement trends based on current results.

Next Steps

Collect additional high-quality annotated data for Tai Situ Rinpoche’s speech.
Extend this approach to other speakers with large volumes of recordings.

Conclusion

The results validate our hypothesis that incorporating additional training data significantly improves transcription accuracy for speaker-specific datasets. The Fine-Tuned Model v2 outperformed both the Base Model and the initial fine-tuned version, achieving CER values of 8.08% on the combined test set. These findings underscore the potential of personalized STT models for automating large-scale transcription tasks, reducing manual effort, and enhancing accessibility to unique voices and teachings.

Topic		Replies	Views
A custom ASR model to transcribe the speech of Tai Situ Rinpoche STT	0	50	December 19, 2024
Customizing Speech-to-Text: Fine-Tuning a Model for Kyabje Dilgo Kyentse Rinpoche’s Unique Voice STT	0	25	January 17, 2025
A custom ASR model to transcribe the speech of Kabjye Dilgo Khyentse Rinpoche STT	4	62	January 6, 2025
Custom Speech To Text (STT) Model PRD Community wiki	0	21	June 9, 2025
Building a Custom Speech-to-Text Model: A Step-by-Step Workflow STT	2	71	May 19, 2025

Customizing Speech-to-Text: Fine-Tuning a Model for Tai Situ Rinpoche’s Unique Voice

Introduction

Background

Methodology

Dataset Preparation

Model Fine-Tuning

Evaluation

Results

Character Error Rate (CER)

Discussion

Performance Improvements

Data Requirement Estimation

Next Steps

Conclusion

Resources

Models

Datasets

Test Sets

Customizing Speech-to-Text: Fine-Tuning a Model for Tai Situ Rinpoche’s Unique Voice

Introduction

Background

Methodology

Dataset Preparation

Model Fine-Tuning

Evaluation

Results

Character Error Rate (CER)

Discussion

Performance Improvements

Data Requirement Estimation

Next Steps

Conclusion

Resources

Models

Datasets

Test Sets

Related topics