Transcription Speed Analysis: Garchen Rinpoche

Ganga_Gyatso · July 21, 2025, 11:46am

Evaluating STT Model-Assisted Transcription Speed: A Case Study with Garchen Rinpoche’s Teachings

Introduction

In the realm of Tibetan Buddhist teachings preservation, transcription of oral teachings plays a crucial role. We conducted a comprehensive study to evaluate whether Speech-to-Text (STT) models could significantly improve transcription speed. Our study focused on Garchen Rinpoche’s teachings, comparing three transcription methods: manual transcription, base STT model-assisted transcription, and fine-tuned model-assisted transcription. While our initial hypothesis suggested that both STT-assisted methods would consistently improve transcription speed, our findings revealed a more nuanced reality.

Background

Traditional manual transcription of Buddhist teachings is time-consuming and resource-intensive. Modern STT technology promises to streamline this process, but its effectiveness in real-world scenarios needed validation. We hypothesized that providing transcribers with machine-generated transcripts would significantly reduce transcription time, with fine-tuned models offering superior performance over base models.

Methodology

Dataset Preparation and Standardization

1. Initial Selection and Filtering

Selected segments from Garchen Rinpoche’s Benchmark
Carefully filtered for segments ≥30 characters
Grouped by original audio ID and character length

2. Strategic Distribution Design

a. Character Length Standardization

Segments in corresponding rows across all three test sets have similar character lengths
Why This Matters:
- Ensures fair comparison of transcription speeds
- Longer segments might naturally take more time to transcribe
- Similar character lengths provide a standardized basis for measuring speed improvements
- Eliminates bias from varying text complexity

b. Audio Source Management

Different segments from the same original audio used across test conditions
Why This Matters:
- Prevents transcriber memorization bias
- If same segment appeared in multiple tests, transcribers might remember content
- Memory advantage would artificially improve transcription speed
- Using different segments ensures genuine speed measurement

c. Distribution Balance

Similar total character counts across all three test sets
Equal representation of audio sources
Why This Matters:
- Maintains consistent workload across all three conditions
- Ensures fair comparison between manual, base, and fine-tuned approaches
- Controls for variations in audio quality and speaker patterns

Test Design

Manual Transcription: No reference transcript provided
Base Model Assisted: Base STT model transcript provided
Fine-tuned Model Assisted: Fine-tuned model transcript provided

Test Execution

Four experienced transcribers participated
Three transcribers processed 25 segments per method one transcriber 18 segments per method due to lack of test data.
Time measurements captured using specialized Google Sheets script
Character counts and accuracy metrics recorded

Results

Key Metrics Visualization

Transcription Speed (characters/minute)

Manual:     47.86 ┤████████████████████
Base:       49.85 ┤█████████████████████
Fine-tuned: 77.67 ┤████████████████████████████████

Total Time Savings vs Manual (minutes)

Base:       3.52  ┤█
Fine-tuned: 40.60 ┤████████████████

Speed Improvement Percentage

Base:       3.31%  ┤█
Fine-tuned: 38.25% ┤███████████████

Individual Transcriber Performance

Speed Metrics by Transcriber

Transcriber	Manual (chars/min)	Base Model (chars/min)	Fine-tuned (chars/min)
Transcriber 1	55.40	54.69 (-6.64%)	94.57 (+34.42%)
Transcriber 2	37.70	42.27 (+35.69%)	57.79 (+47.11%)
Transcriber 3	78.63	69.87 (-18.02%)	183.56 (+45.44%)
Transcriber 4	56.99	56.47 (-36.20%)	73.92 (+12.73%)

Note: Percentages in parentheses show speed improvement vs manual transcription. Negative values indicate slower performance.

Overall Performance (Across All Transcribers)

Metric	Manual	Base Model	Fine-tuned Model
Total Time (minutes)	106.15	102.63	65.55
Total Characters	5,080	5,116	5,091
Segments Count	93	93	93
Average Speed (chars/min)	47.86	49.85	77.67
Time Saved vs Manual (min)	-	3.52	40.60
Speed Improvement	-	3.31%	38.25%

Individual Transcriber Variations

Key Findings

Most Significant Improvements:
- Transcriber 2 showed the highest improvement with fine-tuned model (+47.11%)*
- Also uniquely improved with base model (+35.69%)
- Notable progression from slowest manual speed (37.70 chars/min) to significant improvements with both models
- *Note: Based on 18 segments per method, while others completed 25 segments
Base Model Performance Variance:
- Three out of four transcribers were slower with base model
- Most severe slowdown: Transcriber 4 (-36.20%)
- Only Transcriber 2 showed positive improvement
Fine-tuned Model Consistency:
- All transcribers improved with fine-tuned model
- Improvements ranged from 12.73% to 47.11%
- Even the lowest improvement (Transcriber 4: +12.73%) was significant
- Transcriber 3 achieved second-highest improvement (45.44%) with full 25-segment set

Analysis of Individual Performance Patterns

Transcriber Experience Patterns

High Performers with Fine-tuned Model
- Transcribers 2 and 3 showed exceptional improvement (>45%)
- Suggests optimal adaptation to assisted transcription workflow
Base Model Challenges
- Three out of four transcribers were slower with base model
- Most significant slowdown: Transcriber 4 (-36.20%)
- Only Transcriber 2 showed improvement with base model
Consistency Patterns
- All transcribers improved with fine-tuned model
- Improvement range varied significantly (12.73% to 47.11%)
- Suggests individual adaptation differences

Discussion

The Base Model Paradox

Our most surprising finding was that base model assistance sometimes increased transcription time. This could be attributed to:

Lack of domain-specific training
- Base model not specifically fine-tuned for Garchen Rinpoche’s audio
- Higher error rate with Buddhist terminology and teaching context
- Editing errors often more time-consuming than fresh transcription
Mental Context Switching
- Transcribers must constantly switch between listening and comparing with base model output
- Additional effort required to identify and correct errors while maintaining context
- Fresh transcription allows more natural flow of listening and typing
Trust and Verification Issues
- Low confidence in base model leads to double-checking every word
- Transcribers spend extra time verifying even correct transcriptions
- Psychological barrier: knowing it’s a general model makes transcribers more skeptical

Fine-tuned Model Success

The consistent improvement with fine-tuned model assistance across all transcribers validates the value of domain-specific model training. Key factors contributing to success:

Higher accuracy reducing verification time
Better handling of domain-specific terminology
Increased transcriber confidence in machine output

Conclusions and Recommendations

Fine-tuned Models: Strongly recommended for transcription assistance, showing consistent speed improvements (12-45%)
Base Models: Use with caution
- May not always improve efficiency
- Consider transcriber-specific workflows
- May require additional training for effective use
Future Considerations:
- Develop transcriber training for optimal use of STT assistance
- Further investigate why base models sometimes reduce efficiency
- Consider transcriber-specific customization of workflows

Resources and Data Access

Test Results and Analysis

Complete test results and analysis are available in our Google Drive folder
Includes detailed transcription timing data and comparative analysis across methods

Source Code

GitHub Repository: OpenPecha/stt_transcription_speed_test
Contains scripts for:
- Test data creation and distribution
- Results analysis
- Statistical computations

Acknowledgments

Special thanks to our four transcribers who participated in this study.

This research was conducted as part of the OpenPecha initiative to improve the preservation and accessibility of Buddhist teachings.

DavidYesheNyima · August 21, 2025, 2:01am

@Ganga_Gyatso which version of the fine tuned STT model was used in this testing?

Ganga_Gyatso · August 21, 2025, 5:13am

we used v1 finetuned model trained on 5 hours training data.

Topic		Replies	Views
Technical Discovery Phase Report Garchen Rinpoche Speech SIG	0	30	July 22, 2025
Work Progress Report on Garchen Rinpoche's STT Model Garchen Rinpoche Speech SIG minutes	0	21	August 18, 2025
Custom Speech-to-Text Model for Garchen Rinpoche’s Teachings Garchen Rinpoche Speech SIG	0	50	July 1, 2025
🕉️ Garchen Rinpoche Speech SIG Proposal Garchen Rinpoche Speech SIG	0	72	July 8, 2025
📄 PRD: Garchen Rinpoche Speech Garchen Rinpoche Speech SIG	5	105	July 23, 2025