Purpose and Demographic
The Speech-to-Text (STT) project addresses a critical challenge in the preservation of Tibetan Buddhist wisdom by developing specialized AI models that can accurately transcribe the unique speech patterns, specialized vocabulary, and accents of Buddhist teachers. By combining cutting-edge machine learning with meticulous human review processes, we transform vast archives of audio teachings into searchable, accessible digital text - opening these invaluable teachings to broader audiences while significantly reducing the time and cost barriers of traditional manual transcription. This technology bridge helps ensure these precious teachings remain available for future generations, scholars, practitioners, and those with hearing impairments.
✦ Mission Statement
To create high-quality, speaker-specific custom speech-to-text models that accurately transcribe the teachings of Tibetan Buddhist masters, making their wisdom more accessible and preservable.
✦ Target Demographic
-
Tibetan Buddhist centers and organizations
-
Students and followers of Tibetan Buddhist teachers
-
Digital archivists and preservationists of Buddhist teachings
-
Translators and scholars working with oral teachings
-
Accessibility advocates making content available to the deaf/hard-of-hearing community
✦ Problem Statement
Many valuable teachings from Tibetan Buddhist masters exist only in audio/video formats, making them inaccessible to those who cannot hear or understand spoken Tibetan/English with unique accents. Manual transcription is extremely time-consuming and costly, creating a bottleneck in preserving and sharing these teachings. Existing general-purpose STT models perform poorly with specialized vocabulary and unique speech patterns.
Product Objectives
✦ Core Objectives
-
Achieve Character Error Rate (CER) below 5% for speaker-specific transcription
-
Create a reusable workflow for building custom STT models for different teachers
-
Reduce manual transcription time by at least 50%
-
Enable real-time or near real-time transcription of new teachings
✦ Non-Goals
-
This product won’t replace human transcribers entirely, but augment their work
-
The model won’t focus on general-purpose speech recognition
-
We won’t build translation capabilities into this version of the product
-
We won’t implement speaker diarization (identifying who is speaking) in the initial release
✦ Impact Areas
-
Digital preservation of endangered wisdom traditions
-
Improved accessibility of Buddhist teachings
-
Enhanced searchability of audio/video archives
-
Support for translation efforts across multiple languages
-
Acceleration of documentation for historically significant oral teachings
Example Use Cases
✦ Use Case: Archive Manager
-
Processing a backlog of 1000+ hours of uncatalogued teachings, using the model to generate draft transcriptions that require minimal human editing
-
Creating searchable archives by automatically transcribing and indexing new recorded content
✦ Use Case: Buddhist Centre Staff
-
Generating real-time subtitles during live teachings for attendees who are deaf or hard-of-hearing
-
Quickly producing written transcripts of weekend retreats or special events for distribution to students unable to attend
✦ Use Case: Scholar/Researcher
-
Analyzing linguistic patterns across a teacher’s decades of recorded teachings
-
Creating accurate citations from oral teachings for academic publications
Architectural Considerations
✦ Tech Stack
-
Programming Languages: Python
-
ML Frameworks: Hugging Face Transformers, PyTorch
-
Audio Processing: pyannote.audio or seliro
-
Base Models: Wav2Vec2 (300M parameters), Whisper (280M parameters)
-
Data Management: AWS S3, CSV files, DBeaver
-
Web Interface: Basic web application for model inference
✦ System Diagram
The system follows a five-phase workflow:
-
Cataloging audio/video sources
-
Filtering and splitting audio
-
Transcription and review
-
Data cleaning and organization
-
Model training and evaluation
✦ Security & Privacy
-
All audio data stored in secure AWS S3 buckets with appropriate access controls
-
Transcriptions reviewed and approved before use in training
-
Personal or sensitive information will be flagged and potentially redacted during review process
-
User permissions system to control access to different parts of the platform
✦ Dependencies
-
AWS S3 for audio storage
-
Hugging Face Hub for model and dataset hosting
-
GPU infrastructure for model training
-
Pecha tools for transcription review and corrections which uses DBeaver database systems for transcription data management
-
fast-antx library for aligning transcriptions
✦ Scalability & Maintenance
-
Modular design allows adding new speakers without rebuilding entire system
-
Training pipeline designed to accommodate incremental data additions
-
Models versioned and stored on Hugging Face for reproducibility
-
Regular evaluation against benchmark test sets to track performance over time
Participants
✦ Working Group Members
✦ Stakeholders
David Yeshe Nyima (Garchen STT)
Sherabling (Situ Rinpoche STT)
Gen Drupchen (Dilgo Kyentse Rinpoche STT)
✦ Point of Contact
Project Status
✦ Current Phase
Development and data collection. Currently working on two speaker-specific models:
-
Garchen Rinpoche: Initial workflow established, data collection ongoing
-
Tai Situ Rinpoche: Model v2 completed with CER of 8.08%, estimating ~3 more hours of training data needed to reach 5% CER
✦ Milestones
-
✓ Workflow design completed
-
✓ Initial models trained for two speakers
- Situ Rinpoche STT model
- Dilgo Khyentse Rinpoche STT model
-
✓ Data collection and cataloging process established
-
Working toward 5% CER for both speakers
-
Expanding to additional speakers/teachers
✦ Roadmap
-
Q3 2025: Achieve sub-5% CER for both current speaker models
-
Q4 2025: Add 2-3 additional speakers to the project
-
Q1 2026: Develop streamlined interface for non-technical users
-
Q2 2026: Launch public API for approved partners
Meeting Times
When does the group meet?
✦ Regular Schedule
E.g., Every Thursday at 5PM IST via Zoom
✦ Meeting Notes
Link to running minutes, past discussions, or decisions.
What We’re Working On
We maintain a public task board with all active issues and discussions.