📄 PRD: Garchen Rinpoche Speech

DavidYesheNyima · July 15, 2025, 12:06pm

Product Requirement Document


Owning Group	[Name of WG or SIG]
Status	[Draft
GitHub Project	[Link to GitHub Project Board]
Last Updated	[YYYY-MM-DD]

1. Overview

We are looking for a solution that identifies and transcribes the instances of Tibetan speech and outputs a timestamped transcript. Integral to this is fine-tuning an STT model that ideally will achieve 5% CER, but it remains to be seen whether that goal can be reached.

The overall context is for Garchen Archive to incorporate these artifacts into an automated pipeline that goes from audio file to a translated transcript with as little human intervention as possible.

Problem Statement

Although there is a vast collection of audio and video recordings of Garchen Rinpoche, these recordings are scattered around the world and exist on various forms of media from analog cassettes to mini disks to CDs to files on a Sangha member’s computer. Very few of the recordings have had their written translations transcribed, and even fewer of them have Tibetan transcriptions. The recordings are being gathered together to create the Garchen Archive, which will feature a fully searchable, transcribed and translated corpus of Garchen Rinpoche’s teachings.

Examples of the types of recording include but are not limited to:

untranslated discourses
teachings where Rinpoche alternates with an oral interpreter
empowerments
oral transmissions (lung)
group practice sessions that contain a segment where Rinpoche gives a teaching
The majority of the recordings also have group recitations at the beginning and end, which do not need to be transcribed or passed through the Tibetan STT; however in the interest of simplicity, if the diarization model recognizes a recitation as Tibetan, to transcribe it is also acceptable

In order for the Garchen Archive to have an automated pipeline, the system requires for each recording a timecoded transcript file that identifies each instance of Garchen Rinpoche’s speech.

Core Objectives

Fine tune an STT model to achieve 5% CER for Garchen Rinpoche’s unique speech
Provide a solution that has the ability to input an audio file and output a timecoded transcription file in WebVTT format that:
- Has instances of spoken Tibetan transcribed to Tibetan script
- Has timestamped entries for start and end of each Garchen Rinpoche passage
- Identifies the instances where Tibetan is spoken and transcribes only those instances

Other notes:

Train the model to clean repetitions, jumbled sentences and filler words if possible

2. Strategy & Research

Background and foundational research and planning documents that inform this project.

3. Goals & Success Metrics

Primary Goals

Fine tune an STT model to achieve 5% CER for Garchen Rinpoche’s unique speech
Provide a solution that has the ability to input an audio file and output a timecoded transcription file in WebVTT format that:
- Has all instances of spoken Tibetan transcribed to Tibetan script
- Has timestamped entries
- Identifies the instances where Tibetan is spoken
- Refine the transcribed instances with annotations that indicate the speaker. Ideally it would identify instances of Garchen Rinpoche, but at the least it should indicate Speaker 1, Speaker 2, etc.
Produce comprehensive documentation and a how-to guide enabling reproducibility and community reuse of the full process.
After 8 weeks of training to provide estimates for the cost involved to go from to 5%, then 5% CER to 4%, 4% to 3% and so on

Success Metrics

Achieve either
- < 5% Character Error Rate (CER) on validation data, OR
- if data indicates a non-linear performance curve, to mutually decide between Garchen Archive and OpenPecha that fine tuning has reached a point of diminishing returns
≥ 95% accuracy in identifying segments where Rinpoche is speaking
Usable WebVTT files produced with minimal/no manual cleanup

4. Timeline & Quarterly Milestones

Per OpenPecha’s recommendation:

Milestone-Based Evaluation: Re-evaluate and plot the actual CER curve at 10, 15, 20, and 25 hour milestones
Error Analysis: At each milestone, analyze error patterns to guide targeted data collection
Technique Expansion: Introduce additional training techniques beyond raw data volume increases
Regular Stakeholder Updates: Update projections based on actual measurements rather than theoretical projections
Status check-in every week

Target Attainment of 5% CER: September 12, 2025; however it is understood that that target may not be possible in the timeframe or immediate future.

5. Scope & Features / Data Schema

In Scope

Voiceprint to detect Garchen Rinpoche’s speech
Speaker diarization
- Refine the transcribed instances with annotations that indicate the speaker, at the minimum would be Speaker 1, Speaker 2, etc
Tibetan STT model fine-tuned specifically for Rinpoche
Automatic generation of timestamped WebVTT transcripts
STT model should be pure transcription
We should test a cleanup model that will minimize filler words, repetition, jumbled phrasing, fix typos
Ideally all speakers should be transcribed, not just Tibetan speakers

Out of Scope

We won’t build translation capabilities into this version of the product

6. Dependencies

What other groups, projects, or resources does this work depend on?

Garchen Institute: Access to the raw audio files.

7. Acceptance Criteria

Fine-tuned STT model has achieved 5% CER, or whatever has been mutually decided based on the trends of the of CER data points chart
Instances of Garchen Rinpoche are identified and transcribed; it is acceptable if in a recording the instances of Garchen Rinpoche are identified simply as Speaker 1 or Speaker 2, etc
A Gradio interface is provided to input an audio file and receive as output a WebVTT file
The unified, fine-tuned model must be hosted on Hugging Face and successfully deployed using the platform’s Inference Endpoints feature, providing an accessible API endpoint that returns transcriptions in the required WebVTT format when given an audio file.
Documentation includes setup, training, evaluation, and deployment steps
Following the guide results in a fine-tuned model that produces output consistent in format and structure with the officially delivered model and WebVTT specifications
Guide includes versioning, configuration, and data requirements for replication

Lhakpa_Wangyal · July 22, 2025, 3:51am

5% CER is very strong performance and is challenging to achieve within short period of time with target Launch on September 12, 2025. The possibility of achieving 5% CER after fine-tuning the STT model for Garchen Rinpoche’s speech, I personally feel it’s uncertain. Even in English, reaching 5% CER is difficult, so we need to reframe 5% CER as a target. How about targeting for a high level of transcription accuracy for His Eminence Rinpoche’s unique way of speaking. Personally, I feel up to 15% CER is an acceptable for His Eminence Garchen Rinpoche’s own speech.

Trinley · July 23, 2025, 11:11am

To clarify, are these the only types of content in the recordings:

GR’s speech in Tibetan (teaching and answers)
Group recitation
English translation of GR’s speech
English questions from the audience
Tibetan translation of the questions

The only thing you want to keep are:

GR’s speech in Tibetan (teaching and answers)

@DavidYesheNyima Do you want to include 5? Are other types of content?

Trinley · July 23, 2025, 11:16am

What’s the precise/concrete expected output of this project?

a. An API endpoint where your team can send audio and get an STT transcription or only GR speech with a CER lower than 5%?
b. A custom GR model with a CER lower than 5%?
c. An “AI+human correction” transcription of ### hours of teachings of GR?
d. Something else

DavidYesheNyima · July 23, 2025, 12:15pm

For now we are looking only for option b, where the model can be run on Hugging Face.

Trinley · July 23, 2025, 12:30pm

Which level of transcription are you aiming at? https://g.co/gemini/share/a3e6f450185b

@Lhakpa_Wangyal and @Ganga_Gyatso will help you define this in the PRD.

Topic		Replies	Views
🕉️ Garchen Rinpoche Speech SIG Proposal Garchen Rinpoche Speech SIG	0	65	July 8, 2025
Custom Speech-to-Text Model for Garchen Rinpoche’s Teachings Garchen Rinpoche Speech SIG	0	39	July 1, 2025
Custom ASR(Automatic Speech Recognition) for Garchen Rinpoche Garchen Rinpoche Speech SIG documentation	1	50	July 22, 2025
Technical Discovery Phase Report Garchen Rinpoche Speech SIG	0	14	July 22, 2025
Meeting: Work progress update Garchen Rinpoche Speech SIG minutes	0	23	August 5, 2025