Product Requirement Document
| Owning Group | [Name of WG or SIG] |
| Status | [Draft |
| GitHub Project | [Link to GitHub Project Board] |
| Last Updated | [YYYY-MM-DD] |
1. Overview
We are looking for a solution that identifies and transcribes the instances of Tibetan speech and outputs a timestamped transcript. Integral to this is fine-tuning an STT model that ideally will achieve 5% CER, but it remains to be seen whether that goal can be reached.
The overall context is for Garchen Archive to incorporate these artifacts into an automated pipeline that goes from audio file to a translated transcript with as little human intervention as possible.
Problem Statement
Although there is a vast collection of audio and video recordings of Garchen Rinpoche, these recordings are scattered around the world and exist on various forms of media from analog cassettes to mini disks to CDs to files on a Sangha member’s computer. Very few of the recordings have had their written translations transcribed, and even fewer of them have Tibetan transcriptions. The recordings are being gathered together to create the Garchen Archive, which will feature a fully searchable, transcribed and translated corpus of Garchen Rinpoche’s teachings.
Examples of the types of recording include but are not limited to:
- untranslated discourses
- teachings where Rinpoche alternates with an oral interpreter
- empowerments
- oral transmissions (lung)
- group practice sessions that contain a segment where Rinpoche gives a teaching
- The majority of the recordings also have group recitations at the beginning and end, which do not need to be transcribed or passed through the Tibetan STT; however in the interest of simplicity, if the diarization model recognizes a recitation as Tibetan, to transcribe it is also acceptable
In order for the Garchen Archive to have an automated pipeline, the system requires for each recording a timecoded transcript file that identifies each instance of Garchen Rinpoche’s speech.
Core Objectives
- Fine tune an STT model to achieve 5% CER for Garchen Rinpoche’s unique speech
- Provide a solution that has the ability to input an audio file and output a timecoded transcription file in WebVTT format that:
- Has instances of spoken Tibetan transcribed to Tibetan script
- Has timestamped entries for start and end of each Garchen Rinpoche passage
- Identifies the instances where Tibetan is spoken and transcribes only those instances
Other notes:
- Train the model to clean repetitions, jumbled sentences and filler words if possible
2. Strategy & Research
Background and foundational research and planning documents that inform this project.
3. Goals & Success Metrics
Primary Goals
-
Fine tune an STT model to achieve 5% CER for Garchen Rinpoche’s unique speech
-
Provide a solution that has the ability to input an audio file and output a timecoded transcription file in WebVTT format that:
- Has all instances of spoken Tibetan transcribed to Tibetan script
- Has timestamped entries
- Identifies the instances where Tibetan is spoken
- Refine the transcribed instances with annotations that indicate the speaker. Ideally it would identify instances of Garchen Rinpoche, but at the least it should indicate Speaker 1, Speaker 2, etc.
-
Produce comprehensive documentation and a how-to guide enabling reproducibility and community reuse of the full process.
-
After 8 weeks of training to provide estimates for the cost involved to go from to 5%, then 5% CER to 4%, 4% to 3% and so on
Success Metrics
- Achieve either
- < 5% Character Error Rate (CER) on validation data, OR
- if data indicates a non-linear performance curve, to mutually decide between Garchen Archive and OpenPecha that fine tuning has reached a point of diminishing returns
- ≥ 95% accuracy in identifying segments where Rinpoche is speaking
- Usable WebVTT files produced with minimal/no manual cleanup
4. Timeline & Quarterly Milestones
Per OpenPecha’s recommendation:
- Milestone-Based Evaluation: Re-evaluate and plot the actual CER curve at 10, 15, 20, and 25 hour milestones
- Error Analysis: At each milestone, analyze error patterns to guide targeted data collection
- Technique Expansion: Introduce additional training techniques beyond raw data volume increases
- Regular Stakeholder Updates: Update projections based on actual measurements rather than theoretical projections
- Status check-in every week
- Target Attainment of 5% CER: September 12, 2025; however it is understood that that target may not be possible in the timeframe or immediate future.
5. Scope & Features / Data Schema
In Scope
- Voiceprint to detect Garchen Rinpoche’s speech
- Speaker diarization
- Refine the transcribed instances with annotations that indicate the speaker, at the minimum would be Speaker 1, Speaker 2, etc
- Tibetan STT model fine-tuned specifically for Rinpoche
- Automatic generation of timestamped WebVTT transcripts
- STT model should be pure transcription
- We should test a cleanup model that will minimize filler words, repetition, jumbled phrasing, fix typos
- Ideally all speakers should be transcribed, not just Tibetan speakers
Out of Scope
- We won’t build translation capabilities into this version of the product
6. Dependencies
What other groups, projects, or resources does this work depend on?
- Garchen Institute: Access to the raw audio files.
7. Acceptance Criteria
- Fine-tuned STT model has achieved 5% CER, or whatever has been mutually decided based on the trends of the of CER data points chart
- Instances of Garchen Rinpoche are identified and transcribed; it is acceptable if in a recording the instances of Garchen Rinpoche are identified simply as Speaker 1 or Speaker 2, etc
- A Gradio interface is provided to input an audio file and receive as output a WebVTT file
- The unified, fine-tuned model must be hosted on Hugging Face and successfully deployed using the platform’s Inference Endpoints feature, providing an accessible API endpoint that returns transcriptions in the required WebVTT format when given an audio file.
- Documentation includes setup, training, evaluation, and deployment steps
- Following the guide results in a fine-tuned model that produces output consistent in format and structure with the officially delivered model and WebVTT specifications
- Guide includes versioning, configuration, and data requirements for replication