Garchen Rinpoche Speech Project Requirement Document

:compass: Purpose and Demographic

This project creates a custom AI model to accurately transcribe the teachings of Garchen Rinpoche, a Tibetan Buddhist master. It aims to preserve and digitise his extensive audio archive, making his wisdom searchable and accessible to a wider audience through efficient and accurate transcription.

✦ Mission Statement

Build an STT model to accurately transcribe and preserve Garchen Rinpoche’s teachings for accessibility and searchability.

✦ Target Demographic

  • Garchen Buddhist Institutes and Dharma Centres.
  • Students and practitioners of Garchen Rinpoche
  • Digital archivists working to preserve Tibetan oral teachings
  • Scholars and translators working on Garchen Rinpoche’s lineage
  • Accessibility advocates supporting the deaf and hard-of-hearing community

✦ Problem Statement

Garchen Rinpoche’s teachings are inaccessible due to a lack of text format. Standard speech-to-text (STT) fails to accurately transcribe his unique speech, creating an urgent need for a specialised STT solution.

:bullseye: Product Objectives

✦ Core Objectives

  • Develop a Garchen specific STT model with Character Error Rate (CER) below 5%
  • Build a repeatable, scalable workflow for transcribing Rinpoche’s past and future recordings
  • Reduce manual transcription effort by 50% or more
  • Enable near real-time subtitle generation for live teachings or events

✦ Non-Goals

  • This product won’t replace human transcribers entirely, but augment their work

  • The model won’t focus on general-purpose speech recognition

  • We won’t build translation capabilities into this version of the product

  • We won’t implement speaker diarization (identifying who is speaking) in the initial release

✦ Impact Areas

  • Preservation of Garchen Rinpoche’s teachings in digital form
  • Easier access for students, scholars, and archivists
  • Improved inclusion for deaf/hard-of-hearing individuals
  • Support for creating searchable audio/video archives
  • Contribution to Tibetan linguistic research and cultural continuity

:light_bulb: Example Use Cases

✦ Use Case: Garchen Institute Archivist

Digitize and transcribe hundreds of hours of legacy teachings from Garchen Rinpoche’s personal archive with minimal human correction required.

✦ Use Case: Translator

Extract clean transcripts from teaching sessions to create translated versions for international audiences.

✦ Use Case: Online Retreat Staff

Use the model to generate subtitles and transcripts of Garchen Rinpoche’s live online teachings in near real-time, supporting global accessibility.

:building_construction: Architectural Considerations

✦ Tech Stack

  • Programming Languages: Python

  • ML Frameworks: Hugging Face Transformers, PyTorch

  • Audio Processing: pyannote.audio or seliro

  • Base Models: Wav2Vec2 (300M parameters), Whisper (280M parameters)

  • Data Management: AWS S3, CSV files, DBeaver

  • Web Interface: Basic web application for model inference

✦ System Diagram

The system follows a five-phase workflow:

  1. Cataloging audio/video sources

  2. Filtering and splitting audio

  3. Transcription and review

  4. Data cleaning and organisation

  5. Model training and evaluation

✦ Security & Privacy

  • All audio data stored in secure AWS S3 buckets with appropriate access controls

  • Transcriptions reviewed and approved before use in training

  • Personal or sensitive information will be flagged and potentially redacted during review process

  • User permissions system to control access to different parts of the platform

✦ Dependencies

  • AWS S3 for audio storage
  • Hugging Face Hub for model and dataset hosting
  • GPU infrastructure for model training
  • Pecha tools for transcription review and corrections which uses DBeaver database systems for transcription data management
  • fast-antx library for aligning transcriptions

✦ Scalability & Maintenance

  • Modular design allows adding new speakers without rebuilding entire system
  • Training pipeline designed to accommodate incremental data additions
  • Models versioned and stored on Hugging Face for reproducibility
  • Regular evaluation against benchmark test sets to track performance over time

:busts_in_silhouette: Participants

✦ Working Group Members

working group members

✦ Stakeholders

David Yeshe Nyima (Garchen STT)

✦ Point of Contact

Ganga Gyatso
Lhakpa Wangyal

:vertical_traffic_light: Project Status

✦ Current Phase

  • Preparing for the first training run of Garchen Rinpoche’s custom STT model
  • Targeting 5 hours of clean, annotated training data to initiate training
  • Dataset curation, segmentation, and transcription alignment is actively ongoing
  • Benchmark subset design in progress to ensure well-distributed evaluation samples

✦ Milestones

  • :white_check_mark: Workflow and tooling setup completed
  • :white_check_mark: Public audio archive identified and segmented
  • :counterclockwise_arrows_button: Training data collection ongoing (goal: 5 hours within 4 weeks)
  • :hourglass_not_done: First model training run will begin after data goal is met
  • :counterclockwise_arrows_button: Benchmark test set preparation using diverse metadata samples

✦ Roadmap

Timeline Milestone
Week 1–4 Collect and annotate at least 5 hours of training data
Week 5 Launch first fine-tuning run for Garchen STT model
Week 6–7 Evaluate initial model on benchmark test sets
Q3 2025 Refine model and aim for <5% CER
Q4 2025 Release v1 public demo + continue expanding dataset
Q1 2026 Explore real-time transcription pipeline for live events

:spiral_calendar: Meeting Times

When does the group meet?

✦ Regular Schedule

✦ Meeting Notes

Link to running minutes, past discussions, or decisions.

:hammer_and_wrench: What We’re Working On

We maintain a public task board with all active issues and discussions.

:right_arrow: View GitHub Project Board