Custom Speech To Text (STT) Model PRD

Ganga_Gyatso · June 9, 2025, 6:25am

Purpose and Demographic

The Speech-to-Text (STT) project addresses a critical challenge in the preservation of Tibetan Buddhist wisdom by developing specialized AI models that can accurately transcribe the unique speech patterns, specialized vocabulary, and accents of Buddhist teachers. By combining cutting-edge machine learning with meticulous human review processes, we transform vast archives of audio teachings into searchable, accessible digital text - opening these invaluable teachings to broader audiences while significantly reducing the time and cost barriers of traditional manual transcription. This technology bridge helps ensure these precious teachings remain available for future generations, scholars, practitioners, and those with hearing impairments.

✦ Mission Statement

To create high-quality, speaker-specific custom speech-to-text models that accurately transcribe the teachings of Tibetan Buddhist masters, making their wisdom more accessible and preservable.

✦ Target Demographic

Tibetan Buddhist centers and organizations
Students and followers of Tibetan Buddhist teachers
Digital archivists and preservationists of Buddhist teachings
Translators and scholars working with oral teachings
Accessibility advocates making content available to the deaf/hard-of-hearing community

✦ Problem Statement

Many valuable teachings from Tibetan Buddhist masters exist only in audio/video formats, making them inaccessible to those who cannot hear or understand spoken Tibetan/English with unique accents. Manual transcription is extremely time-consuming and costly, creating a bottleneck in preserving and sharing these teachings. Existing general-purpose STT models perform poorly with specialized vocabulary and unique speech patterns.

Product Objectives

✦ Core Objectives

Achieve Character Error Rate (CER) below 5% for speaker-specific transcription
Create a reusable workflow for building custom STT models for different teachers
Reduce manual transcription time by at least 50%
Enable real-time or near real-time transcription of new teachings

✦ Non-Goals

This product won’t replace human transcribers entirely, but augment their work
The model won’t focus on general-purpose speech recognition
We won’t build translation capabilities into this version of the product
We won’t implement speaker diarization (identifying who is speaking) in the initial release

✦ Impact Areas

Digital preservation of endangered wisdom traditions
Improved accessibility of Buddhist teachings
Enhanced searchability of audio/video archives
Support for translation efforts across multiple languages
Acceleration of documentation for historically significant oral teachings

Example Use Cases

✦ Use Case: Archive Manager

Processing a backlog of 1000+ hours of uncatalogued teachings, using the model to generate draft transcriptions that require minimal human editing
Creating searchable archives by automatically transcribing and indexing new recorded content

✦ Use Case: Buddhist Centre Staff

Generating real-time subtitles during live teachings for attendees who are deaf or hard-of-hearing
Quickly producing written transcripts of weekend retreats or special events for distribution to students unable to attend

✦ Use Case: Scholar/Researcher

Analyzing linguistic patterns across a teacher’s decades of recorded teachings
Creating accurate citations from oral teachings for academic publications

Architectural Considerations

✦ Tech Stack

Programming Languages: Python
ML Frameworks: Hugging Face Transformers, PyTorch
Audio Processing: pyannote.audio or seliro
Base Models: Wav2Vec2 (300M parameters), Whisper (280M parameters)
Data Management: AWS S3, CSV files, DBeaver
Web Interface: Basic web application for model inference

✦ System Diagram

The system follows a five-phase workflow:

Cataloging audio/video sources
Filtering and splitting audio
Transcription and review
Data cleaning and organization
Model training and evaluation

✦ Security & Privacy

All audio data stored in secure AWS S3 buckets with appropriate access controls
Transcriptions reviewed and approved before use in training
Personal or sensitive information will be flagged and potentially redacted during review process
User permissions system to control access to different parts of the platform

✦ Dependencies

AWS S3 for audio storage
Hugging Face Hub for model and dataset hosting
GPU infrastructure for model training
Pecha tools for transcription review and corrections which uses DBeaver database systems for transcription data management
fast-antx library for aligning transcriptions

✦ Scalability & Maintenance

Modular design allows adding new speakers without rebuilding entire system
Training pipeline designed to accommodate incremental data additions
Models versioned and stored on Hugging Face for reproducibility
Regular evaluation against benchmark test sets to track performance over time

Participants

✦ Working Group Members

working group members

✦ Stakeholders

David Yeshe Nyima (Garchen STT)
Sherabling (Situ Rinpoche STT)
Gen Drupchen (Dilgo Kyentse Rinpoche STT)

✦ Point of Contact

Ganga Gyatso

Project Status

✦ Current Phase

Development and data collection. Currently working on two speaker-specific models:

Garchen Rinpoche: Initial workflow established, data collection ongoing
Tai Situ Rinpoche: Model v2 completed with CER of 8.08%, estimating ~3 more hours of training data needed to reach 5% CER

✦ Milestones

✓ Workflow design completed
✓ Initial models trained for two speakers
- Situ Rinpoche STT model
- Dilgo Khyentse Rinpoche STT model
✓ Data collection and cataloging process established
Working toward 5% CER for both speakers
Expanding to additional speakers/teachers

✦ Roadmap

Q3 2025: Achieve sub-5% CER for both current speaker models
Q4 2025: Add 2-3 additional speakers to the project
Q1 2026: Develop streamlined interface for non-technical users
Q2 2026: Launch public API for approved partners

Meeting Times

When does the group meet?

✦ Regular Schedule

E.g., Every Thursday at 5PM IST via Zoom

✦ Meeting Notes

Link to running minutes, past discussions, or decisions.

What We’re Working On

We maintain a public task board with all active issues and discussions.

View GitHub Project Board

Topic		Replies	Views
Custom Speech-to-Text Model for Garchen Rinpoche’s Teachings Garchen Rinpoche Speech SIG	0	36	July 1, 2025
📄 PRD: Garchen Rinpoche Speech Garchen Rinpoche Speech SIG	5	66	July 23, 2025
🕉️ Garchen Rinpoche Speech SIG Proposal Garchen Rinpoche Speech SIG	0	56	July 8, 2025
Speech to Text Working Group 🌐 Buddhist Knowledge WG	0	26	June 9, 2025
Tibetan Speech-to-Text Model Training and Benchmark Report 🔊 ASR Speech Recognition SIG	0	18	August 8, 2025