Garchen Rinpoche Speech Project Requirement Document

Lhakpa_Wangyal · July 8, 2025, 5:49am

Purpose and Demographic

This project creates a custom AI model to accurately transcribe the teachings of Garchen Rinpoche, a Tibetan Buddhist master. It aims to preserve and digitise his extensive audio archive, making his wisdom searchable and accessible to a wider audience through efficient and accurate transcription.

✦ Mission Statement

Build an STT model to accurately transcribe and preserve Garchen Rinpoche’s teachings for accessibility and searchability.

✦ Target Demographic

Garchen Buddhist Institutes and Dharma Centres.
Students and practitioners of Garchen Rinpoche
Digital archivists working to preserve Tibetan oral teachings
Scholars and translators working on Garchen Rinpoche’s lineage
Accessibility advocates supporting the deaf and hard-of-hearing community

✦ Problem Statement

Garchen Rinpoche’s teachings are inaccessible due to a lack of text format. Standard speech-to-text (STT) fails to accurately transcribe his unique speech, creating an urgent need for a specialised STT solution.

Product Objectives

✦ Core Objectives

Develop a Garchen specific STT model with Character Error Rate (CER) below 5%
Build a repeatable, scalable workflow for transcribing Rinpoche’s past and future recordings
Reduce manual transcription effort by 50% or more
Enable near real-time subtitle generation for live teachings or events

✦ Non-Goals

This product won’t replace human transcribers entirely, but augment their work
The model won’t focus on general-purpose speech recognition
We won’t build translation capabilities into this version of the product
We won’t implement speaker diarization (identifying who is speaking) in the initial release

✦ Impact Areas

Preservation of Garchen Rinpoche’s teachings in digital form
Easier access for students, scholars, and archivists
Improved inclusion for deaf/hard-of-hearing individuals
Support for creating searchable audio/video archives
Contribution to Tibetan linguistic research and cultural continuity

Example Use Cases

✦ Use Case: Garchen Institute Archivist

Digitize and transcribe hundreds of hours of legacy teachings from Garchen Rinpoche’s personal archive with minimal human correction required.

✦ Use Case: Translator

Extract clean transcripts from teaching sessions to create translated versions for international audiences.

✦ Use Case: Online Retreat Staff

Use the model to generate subtitles and transcripts of Garchen Rinpoche’s live online teachings in near real-time, supporting global accessibility.

Architectural Considerations

✦ Tech Stack

Programming Languages: Python
ML Frameworks: Hugging Face Transformers, PyTorch
Audio Processing: pyannote.audio or seliro
Base Models: Wav2Vec2 (300M parameters), Whisper (280M parameters)
Data Management: AWS S3, CSV files, DBeaver
Web Interface: Basic web application for model inference

✦ System Diagram

The system follows a five-phase workflow:

Cataloging audio/video sources
Filtering and splitting audio
Transcription and review
Data cleaning and organisation
Model training and evaluation

✦ Security & Privacy

All audio data stored in secure AWS S3 buckets with appropriate access controls
Transcriptions reviewed and approved before use in training
Personal or sensitive information will be flagged and potentially redacted during review process
User permissions system to control access to different parts of the platform

✦ Dependencies

AWS S3 for audio storage
Hugging Face Hub for model and dataset hosting
GPU infrastructure for model training
Pecha tools for transcription review and corrections which uses DBeaver database systems for transcription data management
fast-antx library for aligning transcriptions

✦ Scalability & Maintenance

Modular design allows adding new speakers without rebuilding entire system
Training pipeline designed to accommodate incremental data additions
Models versioned and stored on Hugging Face for reproducibility
Regular evaluation against benchmark test sets to track performance over time

Participants

✦ Working Group Members

working group members

✦ Stakeholders

David Yeshe Nyima (Garchen STT)

✦ Point of Contact

Ganga Gyatso
Lhakpa Wangyal

Project Status

✦ Current Phase

Preparing for the first training run of Garchen Rinpoche’s custom STT model
Targeting 5 hours of clean, annotated training data to initiate training
Dataset curation, segmentation, and transcription alignment is actively ongoing
Benchmark subset design in progress to ensure well-distributed evaluation samples

✦ Milestones

Workflow and tooling setup completed
Public audio archive identified and segmented
Training data collection ongoing (goal: 5 hours within 4 weeks)
First model training run will begin after data goal is met
Benchmark test set preparation using diverse metadata samples

✦ Roadmap

Timeline	Milestone
Week 1–4	Collect and annotate at least 5 hours of training data
Week 5	Launch first fine-tuning run for Garchen STT model
Week 6–7	Evaluate initial model on benchmark test sets
Q3 2025	Refine model and aim for <5% CER
Q4 2025	Release v1 public demo + continue expanding dataset
Q1 2026	Explore real-time transcription pipeline for live events

Meeting Times

When does the group meet?

✦ Regular Schedule

✦ Meeting Notes

Link to running minutes, past discussions, or decisions.

What We’re Working On

We maintain a public task board with all active issues and discussions.

View GitHub Project Board

Topic		Replies	Views
Custom Speech-to-Text Model for Garchen Rinpoche’s Teachings Garchen Rinpoche Speech SIG	0	28	July 1, 2025
🕉️ Garchen Rinpoche Speech SIG Proposal Garchen Rinpoche Speech SIG	0	39	July 8, 2025
Custom Speech To Text (STT) Model PRD 🚀 WG	0	27	June 9, 2025
Building a Custom Speech-to-Text Model: A Step-by-Step Workflow 🔊 ASR Speech Recognition SIG	2	84	May 19, 2025
Custom ASR(Automatic Speech Recognition) for Garchen Rinpoche Garchen Rinpoche Speech SIG documentation	0	18	July 14, 2025