Building a Custom Speech-to-Text Model: A Step-by-Step Workflow

Building a Speech-to-Text Model for Garchen Rinpoche: A Step-by-Step Workflow

Training a high-quality speech-to-text (STT) model tailored for a specific speaker, such as Garchen Rinpoche, is a complex but rewarding process. Our goal is to create a model that can accurately transcribe Rinpoche’s speech, leveraging a diverse collection of audio and video sources, some with existing transcriptions and others without. Below, I outline the workflow we follow, from data collection to model evaluation.


Figure: End-to-end workflow for building a speech-to-text model for Garchen Rinpoche.

Phase 1: Cataloging Audio and Video Sources

The first step is to gather all available audio and video recordings of Garchen Rinpoche. These sources may or may not already have transcriptions. We maintain a centralized catalog—typically a Google Sheet—where we log each source along with its metadata, such as audio length, speaker name, and transcription availability.

  • Manual Entry: For unique or rare recordings, we add links and metadata by hand.

  • Automated Collection: For platforms or accounts with many relevant videos, we use scripts to programmatically extract links and metadata.

This phase is all about building a comprehensive, well-organized repository of raw data. Example catalog here.


Phase 2: Filtering and Splitting Audio

Once the sources are cataloged, we process the audio to prepare it for model training:

  • Audio Splitting: Using tools like pyannote.audio, we segment the recordings into smaller clips, each between 3 and 12 seconds long. This ensures that each segment is manageable and contains clear speech, minimizing silence and noise.

  • Quality Control: We fine-tune the splitting parameters to maximize usable audio while filtering out segments with excessive noise or silence.

  • Standardization: All audio files are converted to a 16kHz sampling rate with wav extension.

  • Storage: The processed audio segments are uploaded to an AWS S3 bucket (s3://monlam.ai.stt/wav16k).

  • Preliminary Transcription: We run these segments through a general-purpose STT model to generate initial transcriptions and metadata (segment ID, length, etc.), storing the results in a CSV file.

  • Transcription Transfer (if available): When a split audio segment originates from a source that already has a human-created transcription, we use our fast-antx library to align and transfer the high-quality transcription text to the corresponding audio segments. This process matches the reference transcription to segments whose model-generated inference closely resembles the original text, ensuring that the resulting segment transcriptions are as accurate and grammatically correct as possible. Leveraging human-verified transcriptions in this way improves the overall quality of our training data, as these texts are generally superior to automatic model inferences.


Phase 3: Transcription and Review

The segmented audio files and their preliminary transcriptions are uploaded into our database (managed via DBeaver for stt.pecha.tools and pecha.tools). This enables collaborative, multi-stage transcription:

  • Transcriber (Group A): Listens to each segment and provides an initial transcription, changing the segment’s state from “transcribing” to “submitted.”

  • Reviewer (Person 2): Reviews the transcription and either marks it as “reviewed” or “rejected.”

  • Group Leader: Performs a final quality check, marking the segment as “finalized” if it meets standards.

Throughout this process, we track statistics such as hours transcribed, group progress, and individual contributions (useful for task management and salary calculations).


Phase 4: Data Cleaning and Organization

With finalized transcriptions in hand, we move on to data cleaning:

  • Export: We extract finalized segments and transcriptions from the database into a CSV file.

  • Cleaning: The transcriptions are cleaned to remove unwanted symbols and standardized (including Wylie transliteration to English).

  • Organization: We analyze the dataset—total hours, segment distribution, etc.—and split it into training, validation, and test sets.

  • Benchmarking: The test set is fixed and never used for training, ensuring fair and consistent evaluation throughout model development.

  • Upload: The prepared datasets are uploaded to Hugging Face for easy access and sharing.


Phase 5: Model Training

The final phase is training and evaluating the STT model:

  • Data Preparation: We load the training data from Hugging Face and convert it into the required input format.

  • Model Selection: We experiment with models such as Wav2Vec2 (300M parameters, small) and Whisper (280M parameters, small), with the option to try larger variants for improved performance.

  • Training: The model is trained on a GPU system with carefully chosen hyperparameters.

  • Evaluation: After training, we evaluate the model on the fixed test set, reporting metrics like Character Error Rate (CER) and Word Error Rate (WER).

  • Progress Tracking: By training multiple models, we can estimate how much data is needed to reach a target CER (e.g., 5%), helping us plan future data collection and annotation efforts.


Conclusion

This workflow ensures that we build a robust, high-quality speech-to-text model for Garchen Rinpoche, leveraging both human expertise and state-of-the-art machine learning techniques. By systematically cataloging, processing, transcribing, and cleaning our data—and rigorously evaluating our models—we move steadily toward our goal of accurate, automated transcription for this unique and valuable speech corpus.


@Ganga_Gyatso thank you for providing the workflow overview. I have a few questions:

For Phase 2:

  1. My links are to YouTube or mp4 videos. Is it helpful for me to extract audio first?
  2. I have just over 8 hrs of video that is only Garchen Rinpoche to start with. Other materials are interspersed with the interpreter (generally English, but also sometimes Chinese, and occasionally other languages as well) and have a group recitation section at the beginning. How much pre-processing do I need to do on these?

For Phase 3:

  1. Before a human transcriber does an initial transcription, is the audio put through the generalized ASR model for a first round of transcription? Could you add that step as well if it is part of the flow? I’m very curious to know how the model handles Rinpoche’s speech without any fine tuning.

cc @Trinley

Phase 2: Filtering and Splitting Audio

Question 1: Audio Extraction from Videos

Q: My links are to YouTube or mp4 videos. Is it helpful for me to extract audio first?

Yes, it would be extremely helpful if you can extract audio directly from YouTube videos or mp4 files. This will significantly streamline our preprocessing pipeline by eliminating several conversion steps. Once we have the audio files, we can proceed directly to the splitting process and convert them to the required 16kHz sampling rate.

Question 2: Handling Multiple Speakers and Languages

Q: I have just over 8 hrs of video that is only Garchen Rinpoche to start with. Other materials are interspersed with the interpreter (generally English, but also sometimes Chinese, and occasionally other languages as well) and have a group recitation section at the beginning. How much pre-processing do I need to do on these?

For videos with multiple speakers and languages, we can implement a speaker diarization step using the pyannote.audio library. This process will:

  1. Identify Speaker Segments: The library analyzes the audio and identifies different speakers based on voice characteristics.

  2. Extract Garchen Rinpoche’s Speech: We can isolate only the segments where Garchen Rinpoche is speaking.

  3. Filter by Language: Additionally, we can use language identification models to retain only the Tibetan language segments.

This approach allows us to automatically extract just the relevant portions of audio (Garchen Rinpoche speaking Tibetan) before proceeding with the audio splitting task.

Alternative approaches include:

  • Using timestamp annotations (if available) that mark when Garchen Rinpoche is speaking (need annotations from human).

  • Combining speaker diarization with language identification for more accurate filtering or

  • Letting annotators handlle the non tibetan audios by marking them non tibetan so we won’t use that audio for training model.

Phase 3: Transcription and Review

Q: Before a human transcriber does an initial transcription, is the audio put through the generalized ASR model for a first round of transcription? Could you add that step as well if it is part of the flow? I’m very curious to know how the model handles Rinpoche’s speech without any fine tuning.

Yes, we definitely use a generalized ASR model for the first round of transcription before human transcribers begin their work. This streamlines the annotation process in several ways:

Initial Transcription Workflow

  1. Automated Transcription: We take the split audio segments from Phase 2 and run them through a base ASR model (like Whisper or Wav2Vec2) to generate initial transcriptions.

  2. Database Storage: Both the audio segments and their machine-generated transcripts are saved to our database.

  3. Annotator Starting Point: Human transcribers access these pre-filled transcriptions through their annotation tool, allowing them to edit existing text rather than transcribing from scratch. This significantly improves efficiency and reduces the workload.

Evaluating Base Model Performance

Regarding how the base model handles Rinpoche’s speech without fine-tuning:

  1. Benchmark Dataset: We create a benchmark dataset consisting of representative audio samples with human-verified transcripts.

  2. Base Model Evaluation: We run inference on this benchmark using our base ASR model and compare the results against human transcripts.

  3. Performance Metrics: We calculate Character Error Rate (CER) and Word Error Rate (WER) to quantify the base model’s accuracy on Rinpoche’s speech.

  4. Fine-tuning Comparison: After training our specialized model, we re-evaluate on the same benchmark to measure improvement in CER and WER compared to the base model.

This dual approach allows us to both streamline the transcription process and systematically track how much our specialized model improves over general-purpose ASR models when handling Garchen Rinpoche’s unique speech patterns.