Custom Speech-to-Text Model for Garchen Rinpoche’s Teachings

:compass: Purpose and Demographic

This Speech-to-Text (STT) project focuses on developing a custom, high-accuracy transcription model tailored specifically for the voice of Garchen Rinpoche—a revered Tibetan Buddhist master. His extensive archive of oral teachings represents a rich cultural and spiritual heritage, yet remains largely inaccessible in text form due to limitations in manual transcription capacity and the challenges general-purpose STT models face with his speech.

By applying specialized AI modeling, this project aims to preserve, digitize, and make searchable Garchen Rinpoche’s spoken wisdom, improving access for scholars, practitioners, and future generations. Through a combination of fine-tuned machine learning and human review, the project reduces the time and cost of transcription while safeguarding the accuracy and integrity of these sacred teachings.

✦ Mission Statement

To build a high-quality, speaker-specific STT model that can accurately transcribe Garchen Rinpoche’s teachings, making them accessible, searchable, and preserved for generations to come.

✦ Target Demographic

  • Garchen Buddhist Institutes and Dharma Centers
  • Students and practitioners of Garchen Rinpoche
  • Digital archivists working to preserve Tibetan oral teachings
  • Scholars and translators working on Garchen Rinpoche’s lineage
  • Accessibility advocates supporting the deaf and hard-of-hearing community

✦ Problem Statement

Much of Garchen Rinpoche’s spiritual legacy exists only in audio or video formats. These teachings are difficult to access, search, or translate without high-quality transcripts. Manual transcription is time-consuming and expensive, while general-purpose STT models fail to capture the nuances of Garchen Rinpoche’s speech patterns, intonation, and specialized terminology. A dedicated STT model is urgently needed to bridge this gap.


:bullseye: Product Objectives

✦ Core Objectives

  • Develop a Garchen-specific STT model with Character Error Rate (CER) below 5%
  • Build a repeatable, scalable workflow for transcribing Rinpoche’s past and future recordings
  • Reduce manual transcription effort by 50% or more
  • Enable near real-time subtitle generation for live teachings or events

✦ Non-Goals

  • This product won’t replace human transcribers entirely, but augment their work

  • The model won’t focus on general-purpose speech recognition

  • We won’t build translation capabilities into this version of the product

  • We won’t implement speaker diarization (identifying who is speaking) in the initial release

✦ Impact Areas

  • Preservation of Garchen Rinpoche’s teachings in digital form
  • Easier access for students, scholars, and archivists
  • Improved inclusion for deaf/hard-of-hearing individuals
  • Support for creating searchable audio/video archives
  • Contribution to Tibetan linguistic research and cultural continuity

:light_bulb: Example Use Cases

✦ Use Case: Garchen Institute Archivist

Digitize and transcribe hundreds of hours of legacy teachings from Garchen Rinpoche’s personal archive with minimal human correction required.

✦ Use Case: Translator

Extract clean transcripts from teaching sessions to create translated versions for international audiences.

✦ Use Case: Online Retreat Staff

Use the model to generate subtitles and transcripts of Garchen Rinpoche’s live online teachings in near real-time, supporting global accessibility.

:building_construction: Architectural Considerations

✦ Tech Stack

  • Programming Languages: Python

  • ML Frameworks: Hugging Face Transformers, PyTorch

  • Audio Processing: pyannote.audio or seliro

  • Base Models: Wav2Vec2 (300M parameters), Whisper (280M parameters)

  • Data Management: AWS S3, CSV files, DBeaver

  • Web Interface: Basic web application for model inference

✦ System Diagram

The system follows a five-phase workflow:

  1. Cataloging audio/video sources

  2. Filtering and splitting audio

  3. Transcription and review

  4. Data cleaning and organization

  5. Model training and evaluation

✦ Security & Privacy

  • All audio data stored in secure AWS S3 buckets with appropriate access controls

  • Transcriptions reviewed and approved before use in training

  • Personal or sensitive information will be flagged and potentially redacted during review process

  • User permissions system to control access to different parts of the platform

✦ Dependencies

  • AWS S3 for audio storage

  • Hugging Face Hub for model and dataset hosting

  • GPU infrastructure for model training

  • Pecha tools for transcription review and corrections which uses DBeaver database systems for transcription data management

  • fast-antx library for aligning transcriptions

✦ Scalability & Maintenance

  • Modular design allows adding new speakers without rebuilding entire system

  • Training pipeline designed to accommodate incremental data additions

  • Models versioned and stored on Hugging Face for reproducibility

  • Regular evaluation against benchmark test sets to track performance over time


:busts_in_silhouette: Participants

✦ Working Group Members

working group members

✦ Stakeholders

David Yeshe Nyima (Garchen STT)

✦ Point of Contact

Ganga Gyatso


:vertical_traffic_light: Project Status

✦ Current Phase

  • Preparing for the first training run of Garchen Rinpoche’s custom STT model
  • Targeting 5 hours of clean, annotated training data to initiate training
  • Dataset curation, segmentation, and transcription alignment is actively ongoing
  • Benchmark subset design in progress to ensure well-distributed evaluation samples

✦ Milestones

  • :white_check_mark: Workflow and tooling setup completed
  • :white_check_mark: Public audio archive identified and segmented
  • :counterclockwise_arrows_button: Training data collection ongoing (goal: 5 hours within 4 weeks)
  • :hourglass_not_done: First model training run will begin after data goal is met
  • :counterclockwise_arrows_button: Benchmark test set preparation using diverse metadata samples

✦ Roadmap

Timeline Milestone
Week 1–4 Collect and annotate at least 5 hours of training data
Week 5 Launch first fine-tuning run for Garchen STT model
Week 6–7 Evaluate initial model on benchmark test sets
Q3 2025 Refine model and aim for <5% CER
Q4 2025 Release v1 public demo + continue expanding dataset
Q1 2026 Explore real-time transcription pipeline for live events

:spiral_calendar: Meeting Times

When does the group meet?

✦ Regular Schedule

E.g., Every Thursday at 5PM IST via Zoom

✦ Meeting Notes

Link to running minutes, past discussions, or decisions.


:hammer_and_wrench: What We’re Working On

We maintain a public task board with all active issues and discussions.

:right_arrow: View GitHub Project Board