Building a Speech-to-Text Model for Garchen Rinpoche: A Step-by-Step Workflow
Training a high-quality speech-to-text (STT) model tailored for a specific speaker, such as Garchen Rinpoche, is a complex but rewarding process. Our goal is to create a model that can accurately transcribe Rinpoche’s speech, leveraging a diverse collection of audio and video sources, some with existing transcriptions and others without. Below, I outline the workflow we follow, from data collection to model evaluation.
Figure: End-to-end workflow for building a speech-to-text model for Garchen Rinpoche.
Phase 1: Cataloging Audio and Video Sources
The first step is to gather all available audio and video recordings of Garchen Rinpoche. These sources may or may not already have transcriptions. We maintain a centralized catalog—typically a Google Sheet—where we log each source along with its metadata, such as audio length, speaker name, and transcription availability.
-
Manual Entry: For unique or rare recordings, we add links and metadata by hand.
-
Automated Collection: For platforms or accounts with many relevant videos, we use scripts to programmatically extract links and metadata.
This phase is all about building a comprehensive, well-organized repository of raw data. Example catalog here.
Phase 2: Filtering and Splitting Audio
Once the sources are cataloged, we process the audio to prepare it for model training:
-
Audio Splitting: Using tools like
pyannote.audio
, we segment the recordings into smaller clips, each between 3 and 12 seconds long. This ensures that each segment is manageable and contains clear speech, minimizing silence and noise. -
Quality Control: We fine-tune the splitting parameters to maximize usable audio while filtering out segments with excessive noise or silence.
-
Standardization: All audio files are converted to a 16kHz sampling rate with wav extension.
-
Storage: The processed audio segments are uploaded to an AWS S3 bucket (
s3://monlam.ai.stt/wav16k
). -
Preliminary Transcription: We run these segments through a general-purpose STT model to generate initial transcriptions and metadata (segment ID, length, etc.), storing the results in a CSV file.
-
Transcription Transfer (if available): When a split audio segment originates from a source that already has a human-created transcription, we use our fast-antx library to align and transfer the high-quality transcription text to the corresponding audio segments. This process matches the reference transcription to segments whose model-generated inference closely resembles the original text, ensuring that the resulting segment transcriptions are as accurate and grammatically correct as possible. Leveraging human-verified transcriptions in this way improves the overall quality of our training data, as these texts are generally superior to automatic model inferences.
Phase 3: Transcription and Review
The segmented audio files and their preliminary transcriptions are uploaded into our database (managed via DBeaver for stt.pecha.tools
and pecha.tools
). This enables collaborative, multi-stage transcription:
-
Transcriber (Group A): Listens to each segment and provides an initial transcription, changing the segment’s state from “transcribing” to “submitted.”
-
Reviewer (Person 2): Reviews the transcription and either marks it as “reviewed” or “rejected.”
-
Group Leader: Performs a final quality check, marking the segment as “finalized” if it meets standards.
Throughout this process, we track statistics such as hours transcribed, group progress, and individual contributions (useful for task management and salary calculations).
Phase 4: Data Cleaning and Organization
With finalized transcriptions in hand, we move on to data cleaning:
-
Export: We extract finalized segments and transcriptions from the database into a CSV file.
-
Cleaning: The transcriptions are cleaned to remove unwanted symbols and standardized (including Wylie transliteration to English).
-
Organization: We analyze the dataset—total hours, segment distribution, etc.—and split it into training, validation, and test sets.
-
Benchmarking: The test set is fixed and never used for training, ensuring fair and consistent evaluation throughout model development.
-
Upload: The prepared datasets are uploaded to Hugging Face for easy access and sharing.
Phase 5: Model Training
The final phase is training and evaluating the STT model:
-
Data Preparation: We load the training data from Hugging Face and convert it into the required input format.
-
Model Selection: We experiment with models such as Wav2Vec2 (300M parameters, small) and Whisper (280M parameters, small), with the option to try larger variants for improved performance.
-
Training: The model is trained on a GPU system with carefully chosen hyperparameters.
-
Evaluation: After training, we evaluate the model on the fixed test set, reporting metrics like Character Error Rate (CER) and Word Error Rate (WER).
-
Progress Tracking: By training multiple models, we can estimate how much data is needed to reach a target CER (e.g., 5%), helping us plan future data collection and annotation efforts.
Conclusion
This workflow ensures that we build a robust, high-quality speech-to-text model for Garchen Rinpoche, leveraging both human expertise and state-of-the-art machine learning techniques. By systematically cataloging, processing, transcribing, and cleaning our data—and rigorously evaluating our models—we move steadily toward our goal of accurate, automated transcription for this unique and valuable speech corpus.