Transcription and SRT File Review Report
This document addresses the concerns raised regarding the quality of transcriptions and technical issues with the generated SRT files.
Summary of Concerns
- Recent SRT Files: A request to upload recently edited transcripts from the past couple of weeks.
- SRT File Technical Issues: Specific problems found in an SRT file that affect its usability, such as:
- Missing content on certain lines (e.g., line 58 and 318).
- Overlapping timecodes (e.g., lines 100-109).
- Transcription Quality: Issues identified during a review by Khenpo, including:
- Redundant words.
- Omitted or incorrect words.
Detailed Explanations and Solutions
1. Recent SRT Files and VAD Process
Concern: A request for recently edited transcripts.
Status: All SRT files have been generated and are available in a shared Google Drive folder: SRT File Archive.
Explanation: The drive contains two separate sets of SRT files (one for audios 1-33 and another for 1-133). This is due to an evolution in the Voice Activity Detection (VAD) process:
- Initial VAD Settings: The first pass used strict parameters (e.g., min 2s, max 15s), which resulted in some parts of the audio being unintentionally discarded.
- Improved VAD Settings: By adjusting the hyperparameters (e.g., min 0.5s, max 30s), we were able to process the audio again and retain significantly more of the original content.
By keeping the outputs in separate folders, we ensure clarity and prevent data from different processing stages from getting mixed.
2. SRT Technical Issues: Gaps and Overlaps
Concern: Extra spacing between transcript lines and overlapping timecodes.
Explanation:
- Extra Gaps: The larger spaces between some transcript entries were not due to missing audio segments. They occurred because the original transcript text contained newline characters. When the SRT file was generated, these newlines created additional blank lines. This has been resolved by updating the generation script to remove leading/trailing whitespace from each transcript entry, ensuring uniform spacing.
- Overlapping Timecodes: This issue was caused by combining the outputs of two different VAD models applied to the same audio file. The problem has been solved by storing the output from each VAD model separately. This separation prevents future overlaps and also preserves all segmented audio, which is valuable for training future models.
3. Transcription Quality and Khenpo’s Corrections
Concern: Addressing the corrections made by the Khenpo, which are categorized by yellow and red labels.
Explanation:
-
Yellow Labels (Omitable Words): These labels mark words or phrases that are part of spoken Tibetan but are often omitted in formal written Tibetan. The current transcription process intentionally captures these spoken nuances. The goal is to create a dataset that can be used to train a separate model specifically for the task of converting spoken-style Tibetan to a more formal, written style. Therefore, these are not considered errors but rather data for a future enhancement.
-
Red Labels (Corrections): These labels indicate genuine errors, such as misspellings or misheard words. These errors are a result of the current single-stage review process where the initial transcriber’s work is final.
Path to Higher Accuracy: To address the ‘red label’ errors and achieve Khenpo-level accuracy, the recommended solution is to re-introduce a two-stage review process. In this workflow, a dedicated reviewer would check and correct the work of the initial transcriber, significantly minimizing factual errors before the transcripts are finalized.