Discovery Phase Report: Development of Speech-to-Text for Garchen Rinpoche Teachings
Introduction
This report summarizes the discovery phase for developing a specialized Speech-to-Text (STT) solution for Garchen Rinpoche’s Tibetan teachings. The project aims to improve transcription accuracy and efficiency for preserving these valuable Buddhist teachings. Our discovery process was conducted in multiple phases, systematically evaluating existing resources, testing various models, and measuring the impact of customization efforts.
Phase 1: Audio Resource Cataloging
We began by creating a comprehensive catalog of all available Garchen Rinpoche audio recordings with detailed metadata. This catalog provided crucial insights into:
-
Total number of audio files available
-
Cumulative duration of teachings
-
Audio quality and recording environments
-
Content categorization and metadata structure
Resource: Complete Audio Catalog (Google Sheets)
This cataloging effort established the foundation for all subsequent work, allowing us to understand the scope and characteristics of the audio data we would be working with.
Phase 2: Model Evaluation on Garchen Rinpoche Audio
In this critical phase, we evaluated multiple existing STT models using a test set of segmented Garchen Rinpoche audio recordings. The evaluation compared model-generated transcripts against human-produced reference transcripts using industry-standard metrics:
-
Character Error Rate (CER)
-
Word Error Rate (WER)
-
Syllable Error Rate (SER)
Key Results
| Model | CER | WER | SER | Major Error Type |
|---|---|---|---|---|
| General STT | 27.53% | 57.24% | 51.15% | Substitutions (75%) |
| Situ Rinpoche | 28.92% | 60.68% | 53.68% | Substitutions (75%) |
| Dilgo Khyentse | 65.96% | 92.63% | 91.55% | Deletions (33%) |
Our evaluation determined that the General STT model performed best across all metrics, contrary to the initial hypothesis that models trained on similar religious speech would outperform a general-purpose model.
Resources:
Based on these findings, we selected the General STT model as our baseline for generating initial transcripts, which could then be edited by human transcribers.
Phase 3: Data Collection and Benchmark Creation
Following model selection, we focused on:
-
Tracking transcribed data until we accumulated 5 hours of training data
-
Creating a benchmark dataset with audio samples not included in the training data
This phase was critical for establishing both:
-
A dataset for fine-tuning the selected base model
-
A standardized benchmark for objective evaluation of model improvements
Resources:
Phase 4: Model Fine-tuning for Garchen Rinpoche
Upon collecting sufficient training data (5+ hours), we fine-tuned the base model specifically for Garchen Rinpoche’s unique speech patterns. The fine-tuning process involved:
-
Base Model:
ganga4364/mms_300_v4.96000 -
Architecture:
Wav2Vec2ForCTC -
Training Dataset: 5:33:00 hours (3,071 samples)
-
Test Dataset: 1:04:12 hours (893 samples)
Fine-tuning Results
| Metric | Base Model | Fine-Tuned Model | Improvement |
|---|---|---|---|
| CER | 27.67% | 22.93% | 4.74% |
| WER | 45.92% | 39.42% | 6.50% |
The fine-tuning process yielded significant improvements across all error metrics, with training checkpoints showing steady progress:
-
5,000 steps: 27.41% CER
-
10,000 steps: 23.37% CER
-
19,000 steps: 22.93% CER (final)
Resources:
Phase 5: Transcription Speed Analysis
The final phase examined whether the effort invested in fine-tuning was justified by measurable improvements in transcription speed. We conducted a comparative study with four experienced transcribers using three different approaches:
-
Manual transcription (no assistance)
-
Base model-assisted transcription
-
Fine-tuned model-assisted transcription
Speed Analysis Results
| Metric | Manual | Base Model | Fine-tuned Model |
|---|---|---|---|
| Average Speed (chars/min) | 47.86 | 49.85 | 77.67 |
| Time Saved vs Manual | - | 3.52 min | 40.60 min |
| Speed Improvement | - | 3.31% | 38.25% |
Key Findings
-
Base Model Paradox: Surprisingly, the base model assistance sometimes slowed down transcription (3 of 4 transcribers were slower)
-
Fine-tuned Model Success: All transcribers showed significant speed improvements with the fine-tuned model (12-47% improvement)
-
Individual Variations: Transcribers showed different adaptation patterns to the assisted workflows
Resource: Transcription Speed Analysis
Conclusions
Our discovery phase yielded several important insights:
-
Model Selection: The general STT model outperformed specialized models for Garchen Rinpoche audio
-
Fine-tuning Effectiveness: Speaker-specific fine-tuning yielded significant error rate reductions (4.74% CER improvement)
-
Practical Impact: Fine-tuned model assistance increased transcription speed by 38.25% on average
-
Unexpected Finding: Base model assistance sometimes decreased transcription speed, revealing the importance of high-accuracy transcripts
Interactive Testing
To make our fine-tuned model accessible for testing, we’ve created a Hugging Face Space that allows users to:
-
Upload Garchen Rinpoche audio files
-
Generate transcripts using our fine-tuned model
-
Download SRT subtitle files for video integration
Resources:
Next Steps
Based on these findings, we recommend:
-
Continue using the fine-tuned model for all Garchen Rinpoche transcription projects
-
Expand the training dataset to further improve model accuracy
-
Develop specialized transcriber training for optimal use of STT-assisted workflows
-
Investigate why base models sometimes reduce efficiency and develop mitigation strategies
-
Promote the interactive testing space to gather more user feedback
This discovery phase report was prepared by the OpenPecha team as part of our ongoing efforts to preserve and make accessible important Buddhist teachings through modern language technologies.