Garchen Rinpoche STT Concerns and Review - 2025/09/02

Transcription and SRT File Review Report

This document addresses the concerns raised regarding the quality of transcriptions and technical issues with the generated SRT files.

Summary of Concerns

  1. Recent SRT Files: A request to upload recently edited transcripts from the past couple of weeks.
  2. SRT File Technical Issues: Specific problems found in an SRT file that affect its usability, such as:
    • Missing content on certain lines (e.g., line 58 and 318).
    • Overlapping timecodes (e.g., lines 100-109).
  3. Transcription Quality: Issues identified during a review by Khenpo, including:
    • Redundant words.
    • Omitted or incorrect words.

Detailed Explanations and Solutions

1. Recent SRT Files and VAD Process

Concern: A request for recently edited transcripts.

Status: All SRT files have been generated and are available in a shared Google Drive folder: SRT File Archive.

Explanation: The drive contains two separate sets of SRT files (one for audios 1-33 and another for 1-133). This is due to an evolution in the Voice Activity Detection (VAD) process:

  • Initial VAD Settings: The first pass used strict parameters (e.g., min 2s, max 15s), which resulted in some parts of the audio being unintentionally discarded.
  • Improved VAD Settings: By adjusting the hyperparameters (e.g., min 0.5s, max 30s), we were able to process the audio again and retain significantly more of the original content.

By keeping the outputs in separate folders, we ensure clarity and prevent data from different processing stages from getting mixed.

2. SRT Technical Issues: Gaps and Overlaps

Concern: Extra spacing between transcript lines and overlapping timecodes.

Explanation:

  • Extra Gaps: The larger spaces between some transcript entries were not due to missing audio segments. They occurred because the original transcript text contained newline characters. When the SRT file was generated, these newlines created additional blank lines. This has been resolved by updating the generation script to remove leading/trailing whitespace from each transcript entry, ensuring uniform spacing.
  • Overlapping Timecodes: This issue was caused by combining the outputs of two different VAD models applied to the same audio file. The problem has been solved by storing the output from each VAD model separately. This separation prevents future overlaps and also preserves all segmented audio, which is valuable for training future models.

3. Transcription Quality and Khenpo’s Corrections

Concern: Addressing the corrections made by the Khenpo, which are categorized by yellow and red labels.

Explanation:

  • Yellow Labels (Omitable Words): These labels mark words or phrases that are part of spoken Tibetan but are often omitted in formal written Tibetan. The current transcription process intentionally captures these spoken nuances. The goal is to create a dataset that can be used to train a separate model specifically for the task of converting spoken-style Tibetan to a more formal, written style. Therefore, these are not considered errors but rather data for a future enhancement.

  • Red Labels (Corrections): These labels indicate genuine errors, such as misspellings or misheard words. These errors are a result of the current single-stage review process where the initial transcriber’s work is final.

Path to Higher Accuracy: To address the ‘red label’ errors and achieve Khenpo-level accuracy, the recommended solution is to re-introduce a two-stage review process. In this workflow, a dedicated reviewer would check and correct the work of the initial transcriber, significantly minimizing factual errors before the transcripts are finalized.

hugging face demo

audio samples for test

Speech-to-Text (STT) Model Fine-Tuning Q&A

Here are the questions and answers regarding fine-tuning the STT model.


1. For fine tuning the STT model, is it beneficial to constrain training transcripts to a limited topic?

For instance, the material in the Guide to Death and Dying videos is more abstract and pertains to the nature of mind than the teachings of the 37 Bodhisattva Practices.

Answer: This is a good point. While the primary task of an STT model is to accurately recognize phonemes and words regardless of the topic, fine-tuning on domain-specific data (like Buddhist teachings) helps the model learn specialized vocabulary (e.g., “Bodhisattva,” “dharma”). This can improve accuracy for that specific domain.

However, you raise a valid concern: over-specialization could potentially make the model less robust on unfamiliar topics, even with the same speaker. Given that the speaker is consistent, the acoustic properties of the voice are a constant. The main benefit of topic-specific data is in overcoming the vocabulary gap.

For a downstream task like translation, providing the topic as context in a prompt is highly beneficial, as it helps the translation model select the correct terminology and tone.


2. Likewise, is there any benefit to having the context of the fine tuning data in translation to other languages for the model to reference it?

(I assume no because the STT is different from translation, but I wanted to ask).

Answer: You are correct that for the STT task itself, providing translations is not beneficial. The STT model’s goal is to map audio to text in the same language.

However, for the subsequent task of translating the generated transcript (e.g., an SRT file) into other languages, context is absolutely critical. A translation model will produce a much more accurate and coherent translation if it has access to the full context of the conversation, as found in the complete SRT file, rather than translating isolated sentences.


3. In STT_GR_0013 some of the passages are pretty long - over 20s, whereas others are around 5s. How do you arrive on the durations of the in/out points?

Answer: The segment durations are determined by an automated process using a Voice Activity Detection (VAD) model, specifically silero-vad. This model detects speech activity and identifies natural pauses or silences in the audio, which are ideal points to split the recording.

The process is configured with specific parameters: a minimum segment duration of 0.5 seconds and a maximum of 30 seconds. If a detected speech segment is longer than 30 seconds, it is automatically split into smaller chunks. This ensures that the training data consists of coherent, sentence-like segments that are of a manageable length for the model.


4. If we were to provide time stamped transcripts for training but they did not have the same set of in/out points could that still be used?

E.g. in STT_GR_0013 the first two time stamps are 00:00:05,700 --> 00:00:22,100 and then 00:00:22,800 --> 00:00:40,700; but if we were to provide you with a transcript that was eg 00:00:05,700 --> 00:00:12,100 then 00:00:12,100 --> 00:00:22,100 would that be of use?

Answer: Yes, this data would be very useful, provided a critical condition is met. The timestamps are generated based on voice activity in the audio. If you change the segment boundaries (e.g., splitting 00:00:05,700 --> 00:00:22,100 into two smaller segments), you must also split the corresponding transcript text to accurately match each new, smaller audio segment. As long as the audio in each timestamped segment perfectly aligns with its transcript, the data is perfectly valid and effective for training.


5. Is there any correspondence between the timestamps and the training data eg here? ganga4364/garchen_rinpoche_data_20hours · Datasets at Hugging Face

Answer: Yes, there is a direct and crucial correspondence. In the dataset, the filenames often include timestamps to ensure each audio segment has a unique identifier. However, the filename itself is just metadata.

The core of the training data is the direct pairing of:

  1. The audio waveform of a segment.

  2. The text transcript that exactly matches that audio waveform.

When the model is fine-tuned, it is fed these pairs. It learns to map the patterns in the audio data to the sequence of characters in the text. The model does not interpret the filename; it works directly with the audio content and its corresponding label (the transcript).


6. I am looking to get srt files of the raw ML transcript without any edits for STT_GR_0003 and STT_GR_0013… Could you provide the raw srt files for those 2 recordings?

Answer: You are correct that processing a 27-minute audio file requires significant computational resources, typically a GPU, which is why it times out on public demos. As an AI assistant, I cannot run this process for you. The correct approach is what you’ve planned: to run the model on a server with the necessary hardware. I will await the results files you generate.