Tibetan ASR Training Data Estimation: Beyond Linear Projections
Current Experimental Results
Our current experiments with fine-tuning the MMS Wav2Vec2 model on Tibetan teachings have shown promising results:
| Model Stage | Training Data (hours) | CER (%) |
|---|---|---|
| Base Model | 0 | 27.67 |
| Fine-tuned | 5.6 | 22.93 |
Understanding Model Improvement Patterns
Initial Linear Projection
Based on our limited data points, a simple linear projection suggests:
CER = -0.8482 × Hours + 27.67
This would indicate:
-
Improvement rate: ~0.85% CER reduction per hour of training data
-
To reach 5% CER: Approximately 26.7 hours of high-quality training data needed
-
Additional data required: ~21.1 hours beyond our current dataset
Important Note: While this linear projection provides a rough estimation based on our current data points, speech recognition model improvement typically follows a non-linear pattern.
Realistic Model Improvement Trajectory
In practice, ASR model performance improvement typically follows a logarithmic or exponential decay curve rather than a linear relationship.For Example Here’s why:
-
Initial Rapid Gains: The first batch of domain-specific training data often yields the most significant improvements as the model adapts to the target language characteristics, domain vocabulary, and acoustic properties.
-
Diminishing Returns: As more training data is added, the rate of improvement tends to decrease. Each additional hour of training data generally yields smaller CER reductions.
-
Performance Floor: Models often approach a performance floor where additional data alone doesn’t significantly improve results without architectural changes or improved data quality.
A more realistic projection might follow this pattern:
| Training Data (hours) | Projected CER (%) | CER Decrease (%) | Notes |
|---|---|---|---|
| 0 (Base) | 27.67 | - | Starting point |
| 5.6 | 22.93 | 4.74 | Current progress |
| 10 | 17-19 | ~5 | Significant gains from doubling data |
| 15 | 14-16 | ~3 | Continued improvement at slower rate |
| 20 | 12-14 | ~2 | Diminishing returns begin to appear |
| 30 | 10-12 | ~2 | Approaching initial performance floor |
| 50+ | 6-8 | ~4 | Additional techniques may be needed |
Iterative Training Approach
Rather than relying solely on data quantity, we recommend an iterative approach:
Phase 1: Current Status (Complete)
-
Training Data: 5.6 hours
-
Current CER: 22.93%
Phase 2: Expanded Dataset (Next Milestone)
-
Training Data: ~10 hours
-
Expected CER: 17-19%
-
Evaluation: Analyze error patterns to guide next iteration
Phase 3: Targeted Data Collection
-
Training Data: ~15-20 hours
-
Expected CER: 12-16%
-
Focus: Address specific error patterns identified in Phase 2
-
Techniques: Begin implementing data augmentation strategies
Phase 4: Advanced Optimization
-
Training Data: ~30-50 hours
-
Target CER: < 6-12%
-
Techniques:
-
Hyperparameter optimization
-
Model architecture adjustments
-
Advanced data augmentation
Factors Affecting CER Improvement
The relationship between training data volume and CER is influenced by:
-
Data Quality: Well-transcribed, clearly spoken audio improves results more than quantity alone
-
Domain Relevance: Training data closely matching the target domain yields better results
-
Model Architecture: Different model architectures respond differently to data volume
-
Augmentation Techniques: Effective data augmentation can simulate larger datasets
Visualizing CER Projections for Garchen Rinpoche Audio Fine-tuning
Initial Linear Projection
Based on our two current data points (base model and 5.6 hours fine-tuned), we can create an initial linear projection:
Figure 1: Linear projection based on initial data points, suggesting approximately 27 hours needed for 5% CER.
Equation: CER = -0.8482 × Hours + 27.67
Expected Non-Linear Performance Curve
Figure 2: Expected non-linear performance curve showing diminishing returns as training data increases.
Recommended Approach
-
Milestone-Based Evaluation: Re-evaluate and plot the actual CER curve at 10, 15, 20, and 25 hour milestones
-
Error Analysis: At each milestone, analyze error patterns to guide targeted data collection
-
Technique Expansion: Introduce additional training techniques beyond raw data volume increases
-
Regular Stakeholder Updates: Update projections based on actual measurements rather than theoretical projections
Updating Projections with Iterative Data Points
Figure 3: As more data points are collected through iterative training, the projection curve will be refined to more accurately predict training requirements.
To create and update this visualization:
-
Start with the initial two data points and linear projection
-
After each milestone (10 hours, 15 hours, etc.), add the new data point to the graph
-
Refit the curve based on all available data points
-
Update the projection for reaching 5% CER based on the new curve
By following this iterative, data-driven approach, we can more accurately track our progress toward the 5% CER target and make informed decisions about resource allocation for fine-tuning the model specifically for Garchen Rinpoche’s teachings.



