ASR Model Evaluation Report for Garchen Rinpoche

ASR Model Evaluation Report for Garchen Rinpoche

INTRODUCTION

We evaluated several speech-to-text models for Garchen Rinpoche’s Tibetan teachings, testing different training methods to find the best performing model. Our goal was to improve transcription accuracy while determining which approach delivers optimal results with limited training data.

Our evaluation compares 14 model versions, including a base model and several fine-tuned variants (v1-v8), assessed using two key metrics:

  1. Character Error Rate (CER): The percentage of characters that were incorrectly recognized in the transcription compared to the reference text. CER is calculated as (substitutions + insertions + deletions) / total characters in the reference. Lower CER values indicate better character-level transcription accuracy.

  2. Levenshtein Distance: The minimum number of single-character edits (insertions, deletions, or substitutions) required to change the model’s transcription into the reference text. This metric provides insight into the practical effort needed to correct transcription errors. Lower values indicate transcriptions that require fewer edits to match the reference.

Key Differences Between CER and Levenshtein Distance:

  • Normalization: CER is normalized (expressed as a percentage of the reference text length), while Levenshtein distance is an absolute count of edits needed.
  • Scale: CER ranges from 0 to 1 (or 0% to 100%), while Levenshtein distance can be any non-negative integer and increases with text length.
  • Interpretation: CER tells you the proportion of characters that are incorrect, while Levenshtein gives you the actual number of edits required.
  • Practical use: CER is better for comparing models across different test sets, while Levenshtein distance better reflects the actual post-editing effort needed for a specific transcript.
  • Sensitivity to text length: A short transcript with one error will have a higher CER than a long transcript with the same error, while both would have the same Levenshtein distance.

The models reflect different training strategies:

  • wav2vec2 scratch model: Initial generic wav2vec2 model without any fine-tuning with tibetan audio
  • Base model: Initial general tibetan wav2vec2 model trained on general tibetan audio of 1500 hours without for garchen rinpoche audio
  • v1 models: Fine-tuned on base model with 5.6 hours of training data (at 5k, 10k, and 19k steps)
  • v2 models: Continued fine-tuning on base model with 10 hours of data of garchen rinpoche audio
  • v3 models: Progressive fine-tuning built upon v1_19000 with 10 hours of data of garchen rinpoche audio
  • v4 model: finetuned on wav2vec2 scratch model with 10 hours of data of garchen rinpoche audio
  • v5 model: Fine-tuned on base model with 20 hours of data of garchen rinpoche audio
  • v6 model: Further fine-tuning of an already optimized model (v2_43000) with 20 hours of data of garchen rinpoche audio
  • v7 model: Fine-tuned on wav2vec2 scratch model with 20 hours of data of garchen rinpoche audio
  • v8 model: Fine-tuned on Facebook’s wav2vec2-xls-r-1b (1 billion parameter) model with garchen rinpoche audio

Training Arguments

    eval_strategy="steps",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=2,
    learning_rate=3e-4,
    num_train_epochs=100,
    warmup_steps=500,
    logging_steps=100,
    eval_steps=1000,
    save_steps=1000,
    fp16=True,
    save_total_limit=50,
    dataloader_num_workers=4,
    report_to="wandb",

1. Comprehensive Model Performance Statistics

Our analysis reveals the following performance statistics for each model and checkpoint. In addition to CER and Levenshtein metrics, we include columns for training data hours, training time, epochs run, the finetuning source, and how the checkpoint was selected.

Model/Checkpoint CER Mean CER Std Lev Mean Lev Std Micro WER Micro SER Training Data (hrs) Training Time (hrs) Epochs Run Finetune Source
base 0.277 0.161 19.172 21.794 0.554 0.493 0 0 0 wav2vec2 scratch model
v1_5000 0.274 0.182 18.441 19.909 0.518 0.482 5.6 2.15 100 base
v1_10000 0.234 0.172 15.658 17.422 0.410 0.398 5.6 2.15 100 base
v1_19000 0.229 0.172 15.334 16.935 0.394 0.386 5.6 2.15 100 base
v2_20000 0.223 0.175 14.884 16.349 0.392 0.384 10 2.3 100 base
v2_32000 0.218 0.167 14.730 16.144 0.386 0.380 10 2.3 100 base
v2_43000 0.215 0.168 14.608 16.029 0.381 0.376 10 2.3 100 base
v3_1000 0.234 0.176 15.440 16.877 0.412 0.401 10 2 61 v1_19000
v3_25000 0.222 0.171 14.991 16.405 0.391 0.383 10 2 61 v1_19000
v4_22000 0.227 0.171 15.169 16.710 0.409 0.392 10 2 50 wav2vec2 scratch model
v5_base_28000 0.215 0.161 14.970 16.645 0.413 0.401 20 ~2 ~25 base
v6_ft_25000 0.221 0.170 15.214 17.198 0.404 0.393 20 ~2 ~24 v2_43000
v7_scratch_23000 0.219 0.167 14.571 15.467 0.412 0.397 20 ~2 ~21 wav2vec2 scratch model
v8_10000 0.242 0.165 16.830 19.463 0.491 0.464 20 ~2 ~20 wav2vec2-xls-r-1b
v8_55000 0.220 0.164 15.216 17.317 0.401 0.394 20 ~2 ~20 wav2vec2-xls-r-1b
v8_59000 0.230 0.165 15.859 18.460 0.430 0.415 20 ~2 ~20 wav2vec2-xls-r-1b

Note: Bold values indicate the best performance for each metric

3. Performance Analysis

Best Models by CER

  1. ft_v5_base (28,000 steps): 0.2152 CER
  2. ft_v2 (43,000 steps): 0.2153 CER
  3. ft_v8 (55,000 steps): 0.2203 CER
  4. ft_v7_scratch (23,000 steps): 0.2186 CER

Best Models by Levenshtein Distance

  1. ft_v7_scratch (23,000 steps): 14.571
  2. ft_v2 (43,000 steps): 14.608
  3. ft_v2 (32,000 steps): 14.730
  4. ft_v8 (55,000 steps): 15.216

Best Models by Micro WER

  1. ft_v2 (43,000 steps): 0.381
  2. ft_v2 (32,000 steps): 0.386
  3. ft_v3 (25,000 steps): 0.391
  4. ft_v8 (55,000 steps): 0.401

Best Models by Micro SER (Syllable Error Rate)

  1. ft_v2 (43,000 steps): 0.376
  2. ft_v2 (32,000 steps): 0.380
  3. ft_v2 (20,000 steps): 0.384
  4. ft_v8 (55,000 steps): 0.394

Combined Performance (Ranking based on all metrics)

  1. ft_v2 (43,000 steps)

    • CER: 0.215 (tied 1st best)
    • Levenshtein: 14.608 (2nd best)
    • Micro WER: 0.381 (1st best)
    • Micro SER: 0.376 (1st best)
    • Consistently excellent across all metrics
  2. ft_v2 (32,000 steps)

    • CER: 0.218 (4th best)
    • Levenshtein: 14.730 (3rd best)
    • Micro WER: 0.386 (2nd best)
    • Micro SER: 0.380 (2nd best)
    • Strong performer across all metrics
  3. ft_v7_scratch (23,000 steps)

    • CER: 0.219 (5th best)
    • Levenshtein: 14.571 (1st best)
    • Micro WER: 0.412 (5th best)
    • Micro SER: 0.397 (5th best)
    • Lowest Levenshtein std deviation: 15.467
  4. ft_v5_base (28,000 steps)

    • CER: 0.215 (tied 1st best)
    • Levenshtein: 14.970 (7th best)
    • Micro WER: 0.413 (7th best)
    • Micro SER: 0.401 (8th best)
    • Best CER stability (lowest std deviation: 0.161)
  5. ft_v8 (55,000 steps)

    • CER: 0.220 (3rd best)
    • Levenshtein: 15.216 (4th best)
    • Micro WER: 0.401 (4th best)
    • Micro SER: 0.394 (4th best)
    • Uses larger foundation model (wav2vec2-xls-r-1b with 1 billion parameters)
    • Demonstrates that larger models can achieve competitive but not superior performance

4. Key Insights

  1. Overall Best Model: ft_v2 (43,000 steps) continues to demonstrate the best all-around performance across all four metrics, with the best micro WER (0.381) and micro SER (0.376), 2nd best Levenshtein distance (14.608), and tied for best CER (0.215).

  2. Best Levenshtein Performance: ft_v7_scratch (23,000 steps) has the lowest Levenshtein distance (14.571) and the best Levenshtein stability (lowest std deviation: 15.467), though it ranks 5th in WER (0.412) and SER (0.397) metrics.

  3. Best CER Performance: ft_v5_base (28,000 steps) is tied for best CER with ft_v2 (43,000 steps) at 0.215 and has the best CER stability (lowest std deviation: 0.161), though its performance on WER and SER metrics is not as strong.

  4. WER and SER Consistency: The ft_v2 models (at 20000, 32000, and 43000 steps) consistently perform best on both WER and SER metrics, with clear improvements as training steps increase.

  5. Training Progression: Each model version shows improved performance with increased training steps across all metrics, indicating effective training strategies.

  6. Significant Improvement: The best models show approximately 22% relative improvement in CER over the base model (from 0.277 to 0.215) and 31% improvement in WER (from 0.554 to 0.381).

  7. Consistency: Models with lower error rates generally maintain this advantage across all four evaluation metrics (CER, Levenshtein, WER, and SER).

  8. Larger Model Performance: Despite using a much larger foundation model (1 billion parameters), the ft_v8 models do not outperform the best ft_v2 model, suggesting that model architecture and fine-tuning strategy may be more important than raw parameter count for this specific task. The ft_v8_55000 checkpoint does achieve competitive results (3rd best CER at 0.220, 4th best Levenshtein at 15.216) but still falls short of the best models.

5. Metric Relationship Analysis

We analyzed the relationships between different error metrics across all models to better understand how they correlate and what insights this reveals about model performance.

WER vs. Levenshtein Distance Correlation

The visualization of Micro WER plotted against Levenshtein distance reveals several important insights:

  1. Strong Positive Correlation: There is a strong positive correlation between WER and Levenshtein distance across all models. This confirms that improvements in one metric generally correspond to improvements in the other, though not always at the same rate.

  2. Performance Clusters: The models form distinct performance clusters:

    • High-error region: The base model, v1_5000, and v8_10000 are clearly separated from the rest, showing significantly higher error rates in both metrics
    • Mid-performance cluster: Models like v1_10000, v3_1000, v4_22000, and v8_59000 show intermediate performance
    • High-performance cluster: v2 models (especially v2_43000), v7_scratch_23000, v5_base_28000, and v8_55000 form the best-performing group
  3. Trade-off Patterns: The best models show interesting trade-offs:

    • v2_43000: Achieves the best WER (0.381) but ranks 2nd in Levenshtein distance (14.608)
    • v7_scratch_23000: Achieves the best Levenshtein distance (14.571) but has only moderate WER performance (0.412)
    • v5_base_28000: Ties for best CER but falls behind on both WER (0.413) and Levenshtein metrics (14.970)
    • v8_55000: Shows 4th best performance in both metrics (WER: 0.401, Levenshtein: 15.216)
  4. Training Progression Pattern: The plot visually demonstrates how increasing training steps improves performance along both metrics, as seen in the progression from v1_5000 → v1_10000 → v1_19000, and similarly for the v2 models. The v8 model shows a similar pattern with significant improvement from v8_10000 to v8_55000, but then degradation at v8_59000, suggesting potential overfitting.

SER vs. Levenshtein Distance Correlation

We also analyzed the relationship between Syllable Error Rate (SER) and Levenshtein distance:

  1. Similar Correlation Pattern: Like WER, SER also shows a strong positive correlation with Levenshtein distance, suggesting that improvements in syllable-level accuracy generally align with character-level improvements.

  2. Tighter Clustering: The SER vs. Levenshtein plot reveals slightly tighter clustering of the high-performance models, suggesting that SER may be a more consistent predictor of overall performance than WER for the best models.

  3. Best Model Confirmation: The v2_43000 model maintains its position as the best performer when considering SER (0.376) against Levenshtein distance (14.608), reinforcing its status as the overall best model.

  4. v8 Model Performance: The v8_55000 checkpoint shows competitive SER performance (0.394, 4th best) paired with good Levenshtein distance (15.216, 4th best), placing it firmly in the high-performance cluster despite using a different base architecture (wav2vec2-xls-r-1b).

  5. Progression Consistency: The training progression pattern observed in the WER analysis is similarly evident in the SER analysis. The v8 model shows an interesting pattern with v8_10000 having poor performance, v8_55000 showing significant improvement, but v8_59000 degrading again - suggesting that careful checkpoint selection is crucial.

CER vs. Levenshtein Distance Correlation

Finally, we examined the relationship between Character Error Rate (CER) and Levenshtein distance:

  1. Extremely Strong Correlation: CER and Levenshtein distance show the strongest correlation among all the metrics examined. This is expected since both are character-level metrics, but the strength of correlation confirms the reliability of both metrics.

  2. Unique Model Positioning: The CER vs. Levenshtein plot reveals some interesting nuances:

    • v5_base_28000: Shows excellent CER (tied for best at 0.215) but less optimal Levenshtein distance (14.970)
    • v7_scratch_23000: Achieves the best Levenshtein distance (14.571) with a slightly higher CER (0.219)
    • v2_43000: Offers the best balance with excellent CER (0.215) and near-best Levenshtein distance (14.608)
    • v8_55000: Shows good CER (0.220, 3rd best) with decent Levenshtein distance (15.216, 4th best)
  3. Divergence in High-Performance Models: Among the best models, there’s more divergence in the CER-Levenshtein relationship than seen in WER or SER plots, suggesting that character-level metrics may be capturing different aspects of model performance.

  4. Training Stability: The progression of models through training steps shows more stability in CER improvement than in Levenshtein distance, indicating that CER might be a more stable metric during model training.

  5. v8 Model Fluctuations: The v8 model shows significant fluctuations across checkpoints, with v8_55000 achieving competitive CER (0.220) but v8_10000 and v8_59000 showing much poorer performance. This suggests that the larger model architecture may be more sensitive to training dynamics and requires more careful optimization.

This analysis reinforces our primary recommendation of ft_v2 (43,000 steps) as the best overall model, as it achieves optimal or near-optimal performance across all error metrics, while showing that different models may be preferable for specific use cases where one type of error metric is prioritized over others.

6. Recommendations

  1. Primary Recommendation: Use ft_v2 (43,000 steps) for production as it demonstrates the best all-around performance across all four metrics (CER, Levenshtein distance, micro WER, and micro SER (Syllable Error Rate)) and offers the most consistent quality.

  2. Alternative Options:

    • If prioritizing Levenshtein distance: Use ft_v7_scratch (23,000 steps)
    • If prioritizing CER stability: Use ft_v5_base (28,000 steps)
    • If looking for balanced performance: Use ft_v2 (43,000 steps)
    • If using a larger model is desired: Use ft_v8 (55,000 steps) (best checkpoint of the 1B parameter model)
  3. Word-Level vs Character-Level vs Syllable-Level Optimization: If your application prioritizes word-level or syllable-level accuracy (important for practical usage and comprehension), the ft_v2 (43,000 steps) model with its superior WER (0.381) and SER (0.376) is clearly the best choice. If character-level precision is more important, consider ft_v5_base (28,000 steps) for its CER stability.

  4. Further Research: Consider evaluating these top models on specific subsets of data to determine if there are particular strengths or weaknesses for certain types of speech patterns or vocabulary contexts.

7. Training History Analysis

The evaluation data shows clear progression patterns in model performance:

  1. Fine-tuning benefits: All fine-tuned models outperform the base model, with CER improving from 0.277 to as low as 0.215.

  2. Training steps impact: For each model version, increasing training steps generally leads to better performance (lower CER and Levenshtein distance).

  3. Architectural comparison:

    • ft_v2 (43,000 steps) shows the best all-around performance across all metrics
    • ft_v7_scratch (trained from scratch) excels in Levenshtein distance
    • ft_v2 models demonstrate consistent improvement across training steps
    • ft_v8 models (based on wav2vec2-xls-r-1b) show promising WER/SER performance at 55,000 steps, but performance degrades by 59,000 steps, suggesting potential overfitting

8. Resources & Further Exploration

1 Like

@Ganga_Gyatso thank you.

Could you do a couple more things with the report?

  1. Also add SER and WER - I’m interested to know if they have any relationship to the Levenshtein distance numbers; and they are more relevant to our needs than CER, given that there is not a deterministic correlation between CER and WER.
  2. Add chart(s) with the WER and Levenshtein numbers so we can see how we’re trending and if we’re already approaching a point where more data might not help much.

i have updated the report to include wer and ser and also charts show casing different metrics vs lev distance.

ASR Model Evaluation Report for Garchen Rinpoche

INTRODUCTION

We evaluated several speech-to-text models for Garchen Rinpoche’s Tibetan teachings, testing different training methods to find the best performing model. Our goal was to improve transcription accuracy while determining which approach delivers optimal results with limited training data.

Our evaluation compares 14 model versions, including a base model and several fine-tuned variants (v1-v8), assessed using two key metrics:

  1. Character Error Rate (CER): The percentage of characters that were incorrectly recognized in the transcription compared to the reference text. CER is calculated as (substitutions + insertions + deletions) / total characters in the reference. Lower CER values indicate better character-level transcription accuracy.

  2. Levenshtein Distance: The minimum number of single-character edits (insertions, deletions, or substitutions) required to change the model’s transcription into the reference text. This metric provides insight into the practical effort needed to correct transcription errors. Lower values indicate transcriptions that require fewer edits to match the reference.

Key Differences Between CER and Levenshtein Distance:

  • Normalization: CER is normalized (expressed as a percentage of the reference text length), while Levenshtein distance is an absolute count of edits needed.
  • Scale: CER ranges from 0 to 1 (or 0% to 100%), while Levenshtein distance can be any non-negative integer and increases with text length.
  • Interpretation: CER tells you the proportion of characters that are incorrect, while Levenshtein gives you the actual number of edits required.
  • Practical use: CER is better for comparing models across different test sets, while Levenshtein distance better reflects the actual post-editing effort needed for a specific transcript.
  • Sensitivity to text length: A short transcript with one error will have a higher CER than a long transcript with the same error, while both would have the same Levenshtein distance.

The models reflect different training strategies:

  • wav2vec2 scratch model: Initial generic wav2vec2 model without any fine-tuning with tibetan audio
  • Base model: Initial general tibetan wav2vec2 model trained on general tibetan audio of 1500 hours without for garchen rinpoche audio
  • v1 models: Fine-tuned on base model with 5.6 hours of training data (at 5k, 10k, and 19k steps)
  • v2 models: Continued fine-tuning on base model with 10 hours of data of garchen rinpoche audio
  • v3 models: Progressive fine-tuning built upon v1_19000 with 10 hours of data of garchen rinpoche audio
  • v4 model: finetuned on wav2vec2 scratch model with 10 hours of data of garchen rinpoche audio
  • v5 model: Fine-tuned on base model with 20 hours of data of garchen rinpoche audio
  • v6 model: Further fine-tuning of an already optimized model (v2_43000) with 20 hours of data of garchen rinpoche audio
  • v7 model: Fine-tuned on wav2vec2 scratch model with 20 hours of data of garchen rinpoche audio
  • v8 model: Fine-tuned on Facebook’s wav2vec2-xls-r-1b (1 billion parameter) model with garchen rinpoche audio

Training Arguments

    eval_strategy="steps",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=2,
    learning_rate=3e-4,
    num_train_epochs=100,
    warmup_steps=500,
    logging_steps=100,
    eval_steps=1000,
    save_steps=1000,
    fp16=True,
    save_total_limit=50,
    dataloader_num_workers=4,
    report_to="wandb",

1. Comprehensive Model Performance Statistics

Our analysis reveals the following performance statistics for each model and checkpoint. In addition to CER and Levenshtein metrics, we include columns for training data hours, training time, epochs run, the finetuning source, and how the checkpoint was selected.

Model/Checkpoint CER Mean CER Std Lev Mean Lev Std Micro WER Micro SER Training Data (hrs) Training Time (hrs) Epochs Run Finetune Source
base 0.277 0.161 19.172 21.794 0.554 0.493 0 0 0 wav2vec2 scratch model
v1_5000 0.274 0.182 18.441 19.909 0.518 0.482 5.6 2.15 100 base
v1_10000 0.234 0.172 15.658 17.422 0.410 0.398 5.6 2.15 100 base
v1_19000 0.229 0.172 15.334 16.935 0.394 0.386 5.6 2.15 100 base
v2_20000 0.223 0.175 14.884 16.349 0.392 0.384 10 2.3 100 base
v2_32000 0.218 0.167 14.730 16.144 0.386 0.380 10 2.3 100 base
v2_43000 0.215 0.168 14.608 16.029 0.381 0.376 10 2.3 100 base
v3_1000 0.234 0.176 15.440 16.877 0.412 0.401 10 2 61 v1_19000
v3_25000 0.222 0.171 14.991 16.405 0.391 0.383 10 2 61 v1_19000
v4_22000 0.227 0.171 15.169 16.710 0.409 0.392 10 2 50 wav2vec2 scratch model
v5_base_28000 0.215 0.161 14.970 16.645 0.413 0.401 20 ~2 ~25 base
v6_ft_25000 0.221 0.170 15.214 17.198 0.404 0.393 20 ~2 ~24 v2_43000
v7_scratch_23000 0.219 0.167 14.571 15.467 0.412 0.397 20 ~2 ~21 wav2vec2 scratch model
v8_10000 0.242 0.165 16.830 19.463 0.491 0.464 20 ~2 ~20 wav2vec2-xls-r-1b
v8_55000 0.220 0.164 15.216 17.317 0.401 0.394 20 ~2 ~20 wav2vec2-xls-r-1b
v8_59000 0.230 0.165 15.859 18.460 0.430 0.415 20 ~2 ~20 wav2vec2-xls-r-1b

Note: Bold values indicate the best performance for each metric

3. Performance Analysis

Best Models by CER

  1. ft_v5_base (28,000 steps): 0.2152 CER
  2. ft_v2 (43,000 steps): 0.2153 CER
  3. ft_v8 (55,000 steps): 0.2203 CER
  4. ft_v7_scratch (23,000 steps): 0.2186 CER

Best Models by Levenshtein Distance

  1. ft_v7_scratch (23,000 steps): 14.571
  2. ft_v2 (43,000 steps): 14.608
  3. ft_v2 (32,000 steps): 14.730
  4. ft_v8 (55,000 steps): 15.216

Best Models by Micro WER

  1. ft_v2 (43,000 steps): 0.381
  2. ft_v2 (32,000 steps): 0.386
  3. ft_v3 (25,000 steps): 0.391
  4. ft_v8 (55,000 steps): 0.401

Best Models by Micro SER (Syllable Error Rate)

  1. ft_v2 (43,000 steps): 0.376
  2. ft_v2 (32,000 steps): 0.380
  3. ft_v2 (20,000 steps): 0.384
  4. ft_v8 (55,000 steps): 0.394

Combined Performance (Ranking based on all metrics)

  1. ft_v2 (43,000 steps)

    • CER: 0.215 (tied 1st best)
    • Levenshtein: 14.608 (2nd best)
    • Micro WER: 0.381 (1st best)
    • Micro SER: 0.376 (1st best)
    • Consistently excellent across all metrics
  2. ft_v2 (32,000 steps)

    • CER: 0.218 (4th best)
    • Levenshtein: 14.730 (3rd best)
    • Micro WER: 0.386 (2nd best)
    • Micro SER: 0.380 (2nd best)
    • Strong performer across all metrics
  3. ft_v7_scratch (23,000 steps)

    • CER: 0.219 (5th best)
    • Levenshtein: 14.571 (1st best)
    • Micro WER: 0.412 (5th best)
    • Micro SER: 0.397 (5th best)
    • Lowest Levenshtein std deviation: 15.467
  4. ft_v5_base (28,000 steps)

    • CER: 0.215 (tied 1st best)
    • Levenshtein: 14.970 (7th best)
    • Micro WER: 0.413 (7th best)
    • Micro SER: 0.401 (8th best)
    • Best CER stability (lowest std deviation: 0.161)
  5. ft_v8 (55,000 steps)

    • CER: 0.220 (3rd best)
    • Levenshtein: 15.216 (4th best)
    • Micro WER: 0.401 (4th best)
    • Micro SER: 0.394 (4th best)
    • Uses larger foundation model (wav2vec2-xls-r-1b with 1 billion parameters)
    • Demonstrates that larger models can achieve competitive but not superior performance

4. Key Insights

  1. Overall Best Model: ft_v2 (43,000 steps) continues to demonstrate the best all-around performance across all four metrics, with the best micro WER (0.381) and micro SER (0.376), 2nd best Levenshtein distance (14.608), and tied for best CER (0.215).

  2. Best Levenshtein Performance: ft_v7_scratch (23,000 steps) has the lowest Levenshtein distance (14.571) and the best Levenshtein stability (lowest std deviation: 15.467), though it ranks 5th in WER (0.412) and SER (0.397) metrics.

  3. Best CER Performance: ft_v5_base (28,000 steps) is tied for best CER with ft_v2 (43,000 steps) at 0.215 and has the best CER stability (lowest std deviation: 0.161), though its performance on WER and SER metrics is not as strong.

  4. WER and SER Consistency: The ft_v2 models (at 20000, 32000, and 43000 steps) consistently perform best on both WER and SER metrics, with clear improvements as training steps increase.

  5. Training Progression: Each model version shows improved performance with increased training steps across all metrics, indicating effective training strategies.

  6. Significant Improvement: The best models show approximately 22% relative improvement in CER over the base model (from 0.277 to 0.215) and 31% improvement in WER (from 0.554 to 0.381).

  7. Consistency: Models with lower error rates generally maintain this advantage across all four evaluation metrics (CER, Levenshtein, WER, and SER).

  8. Larger Model Performance: Despite using a much larger foundation model (1 billion parameters), the ft_v8 models do not outperform the best ft_v2 model, suggesting that model architecture and fine-tuning strategy may be more important than raw parameter count for this specific task. The ft_v8_55000 checkpoint does achieve competitive results (3rd best CER at 0.220, 4th best Levenshtein at 15.216) but still falls short of the best models.

5. Metric Relationship Analysis

We analyzed the relationships between different error metrics across all models to better understand how they correlate and what insights this reveals about model performance.

WER vs. Levenshtein Distance Correlation

The visualization of Micro WER plotted against Levenshtein distance reveals several important insights:

  1. Strong Positive Correlation: There is a strong positive correlation between WER and Levenshtein distance across all models. This confirms that improvements in one metric generally correspond to improvements in the other, though not always at the same rate.

  2. Performance Clusters: The models form distinct performance clusters:

    • High-error region: The base model, v1_5000, and v8_10000 are clearly separated from the rest, showing significantly higher error rates in both metrics
    • Mid-performance cluster: Models like v1_10000, v3_1000, v4_22000, and v8_59000 show intermediate performance
    • High-performance cluster: v2 models (especially v2_43000), v7_scratch_23000, v5_base_28000, and v8_55000 form the best-performing group
  3. Trade-off Patterns: The best models show interesting trade-offs:

    • v2_43000: Achieves the best WER (0.381) but ranks 2nd in Levenshtein distance (14.608)
    • v7_scratch_23000: Achieves the best Levenshtein distance (14.571) but has only moderate WER performance (0.412)
    • v5_base_28000: Ties for best CER but falls behind on both WER (0.413) and Levenshtein metrics (14.970)
    • v8_55000: Shows 4th best performance in both metrics (WER: 0.401, Levenshtein: 15.216)
  4. Training Progression Pattern: The plot visually demonstrates how increasing training steps improves performance along both metrics, as seen in the progression from v1_5000 → v1_10000 → v1_19000, and similarly for the v2 models. The v8 model shows a similar pattern with significant improvement from v8_10000 to v8_55000, but then degradation at v8_59000, suggesting potential overfitting.

SER vs. Levenshtein Distance Correlation

We also analyzed the relationship between Syllable Error Rate (SER) and Levenshtein distance:

  1. Similar Correlation Pattern: Like WER, SER also shows a strong positive correlation with Levenshtein distance, suggesting that improvements in syllable-level accuracy generally align with character-level improvements.

  2. Tighter Clustering: The SER vs. Levenshtein plot reveals slightly tighter clustering of the high-performance models, suggesting that SER may be a more consistent predictor of overall performance than WER for the best models.

  3. Best Model Confirmation: The v2_43000 model maintains its position as the best performer when considering SER (0.376) against Levenshtein distance (14.608), reinforcing its status as the overall best model.

  4. v8 Model Performance: The v8_55000 checkpoint shows competitive SER performance (0.394, 4th best) paired with good Levenshtein distance (15.216, 4th best), placing it firmly in the high-performance cluster despite using a different base architecture (wav2vec2-xls-r-1b).

  5. Progression Consistency: The training progression pattern observed in the WER analysis is similarly evident in the SER analysis. The v8 model shows an interesting pattern with v8_10000 having poor performance, v8_55000 showing significant improvement, but v8_59000 degrading again - suggesting that careful checkpoint selection is crucial.

CER vs. Levenshtein Distance Correlation

Finally, we examined the relationship between Character Error Rate (CER) and Levenshtein distance:

  1. Extremely Strong Correlation: CER and Levenshtein distance show the strongest correlation among all the metrics examined. This is expected since both are character-level metrics, but the strength of correlation confirms the reliability of both metrics.

  2. Unique Model Positioning: The CER vs. Levenshtein plot reveals some interesting nuances:

    • v5_base_28000: Shows excellent CER (tied for best at 0.215) but less optimal Levenshtein distance (14.970)
    • v7_scratch_23000: Achieves the best Levenshtein distance (14.571) with a slightly higher CER (0.219)
    • v2_43000: Offers the best balance with excellent CER (0.215) and near-best Levenshtein distance (14.608)
    • v8_55000: Shows good CER (0.220, 3rd best) with decent Levenshtein distance (15.216, 4th best)
  3. Divergence in High-Performance Models: Among the best models, there’s more divergence in the CER-Levenshtein relationship than seen in WER or SER plots, suggesting that character-level metrics may be capturing different aspects of model performance.

  4. Training Stability: The progression of models through training steps shows more stability in CER improvement than in Levenshtein distance, indicating that CER might be a more stable metric during model training.

  5. v8 Model Fluctuations: The v8 model shows significant fluctuations across checkpoints, with v8_55000 achieving competitive CER (0.220) but v8_10000 and v8_59000 showing much poorer performance. This suggests that the larger model architecture may be more sensitive to training dynamics and requires more careful optimization.

This analysis reinforces our primary recommendation of ft_v2 (43,000 steps) as the best overall model, as it achieves optimal or near-optimal performance across all error metrics, while showing that different models may be preferable for specific use cases where one type of error metric is prioritized over others.

6. Recommendations

  1. Primary Recommendation: Use ft_v2 (43,000 steps) for production as it demonstrates the best all-around performance across all four metrics (CER, Levenshtein distance, micro WER, and micro SER (Syllable Error Rate)) and offers the most consistent quality.

  2. Alternative Options:

    • If prioritizing Levenshtein distance: Use ft_v7_scratch (23,000 steps)
    • If prioritizing CER stability: Use ft_v5_base (28,000 steps)
    • If looking for balanced performance: Use ft_v2 (43,000 steps)
    • If using a larger model is desired: Use ft_v8 (55,000 steps) (best checkpoint of the 1B parameter model)
  3. Word-Level vs Character-Level vs Syllable-Level Optimization: If your application prioritizes word-level or syllable-level accuracy (important for practical usage and comprehension), the ft_v2 (43,000 steps) model with its superior WER (0.381) and SER (0.376) is clearly the best choice. If character-level precision is more important, consider ft_v5_base (28,000 steps) for its CER stability.

  4. Further Research: Consider evaluating these top models on specific subsets of data to determine if there are particular strengths or weaknesses for certain types of speech patterns or vocabulary contexts.

7. Training History Analysis

The evaluation data shows clear progression patterns in model performance:

  1. Fine-tuning benefits: All fine-tuned models outperform the base model, with CER improving from 0.277 to as low as 0.215.

  2. Training steps impact: For each model version, increasing training steps generally leads to better performance (lower CER and Levenshtein distance).

  3. Architectural comparison:

    • ft_v2 (43,000 steps) shows the best all-around performance across all metrics
    • ft_v7_scratch (trained from scratch) excels in Levenshtein distance
    • ft_v2 models demonstrate consistent improvement across training steps
    • ft_v8 models (based on wav2vec2-xls-r-1b) show promising WER/SER performance at 55,000 steps, but performance degrades by 59,000 steps, suggesting potential overfitting

8. Resources & Further Exploration