AI-powered transcription of Garchen Rinpoche’s speech.
| SUMMARY | |
|---|---|
| Version: | 1.1 |
| Purpose | Train a custom Speech-to-Text model and its training dataset to accurately transcribe Garchen Rinpoche’s oral teachings, making them accessible and searchable. |
| Champion(s) | @DavidYesheNyima @Gade |
| Communication | - Discord Channel - Github Team - Google Calendar |
| Documentation | - PRD (Requirements) - Github Project Board - Visuals - Meeting Minutes |
Proposal Details
1. Problem Statement / Motivation
Train a custom Speech-to-Text (STT) model which could automatically turn Rinpoche’s spoken teachings into subtitle-style transcripts (WebVTT). This will help preserve, search, and translate his teachings and making them accessible and searchable in multiple languages.
Creating accurate transcripts isn’t easy due to following reasons;
Rinpoche’s unique voice: His age, speech style, and Tibetan dialect make it hard for machines to understand and even for the professional and experienced transcribers.
Tibetan language issues: There aren’t many training materials available for Tibetan, especially for religious content.
Audio problems: Some recordings are noisy or unclear.
Special vocabulary: Buddhist terms used by rinpoche are complex and unfamiliar to regular speech software.
2. Previous work
- Audio Analysis: organised and analysed Rinpoche’s audio files to understand their scope, quality, and content.
- Model Selection: After testing three options, a general Tibetan model was chosen as our base because it surprisingly outperformed specialised religious speech models.
- Data Collection: Created a custom training dataset by transcribing over five hours of Rinpoche’s teachings to capture his unique speech.
- Custom Training: Fine-tuned the selected model with this custom data, which significantly reduced its error rate and improved transcription accuracy.
- Proven Efficiency: A final test showed that using the fine-tuned model saved human transcribers 38% of their time, while the basic model sometimes made the process slower.
3.Scope
- In Scope:
- Ideal Goal: 5% error rate in the text.
- Realistic Goal: Around 15% error rate, which is still very useful
- Main Focus: Make sure the transcript is good enough for translation and searching
- Detecting only Rinpoche’s speech
- Training and improving the speech model
- Creating subtitle files with time stamps
- Reducing background noise and fillers
- Custom STT model for GR
- Out of Scope:
- Translating the teachings
- Transcribing other speakers
3. Potential Deliverables
- A custom model that makes less than 15% mistakes on Rinpoche’s clean speech
- Subtitle-style transcripts that are easy to read and need very little correction
- A semi-automated system that helps with future transcriptions
- A public tool where users can upload audio and get automatic subtitles
4. Team
Members:
- @DavidYesheNyima - Champion
- @Lhakpa_Wangyal - coordinator
- @Ganga_Gyatso - developer
Annotators:
-
Kunchok Gawa @kunchok73kunchok73kunchok73kunchok73kunchok73kunchok73kunchok73kunchok73
-
@KarmatsepKarmatsepKarmatsepKarmatsep
Karm@Ka@Karmatsepma@Karsepak@Kar Tsepak @Kar@jamluv227atsep@ja@jamluv227luv227k7
-
Ja@jamluv227jamluv227pa Lobsang @jamluv227
-
Kalsang Thardoe @K-Thardoe
SIG Meeting Calendar
- Meetings: 2nd Tuesday of each month, 16:00 UTC
- Meeting Minutes