Attendees:
@DavidYesheNyima - Champion, Mr. Jake Moore @Ganga_Gyatso - Developer @Lhakpa_Wangyal - Coordinator
Date: September 09, 2025
Agenda:
- Transcript Progress
- Review Process Improvements
- Whisper Model & tokeniser Development
- Next Steps and future actions
Key Discussion Points
- Team discussed about completion of reviewing work for the benchmarking transcripts. It was also outlined that reviewing task for other audio files is in progress.
- On the basis of drawbacks of reflection of details about the work progress from the Pecha tool’s speech to text, team discussed about need to have dashboard or simply a spreadsheet showing status per file: transcribed, reviewed, trashed, and percentage completion. Mr. David Newman emphasised strongly that there is need to have a reporting in details showing percentages of finished transcripts and reviewed task. For that Mr. Ganga Gyatso demonstrated GitHub (cron job) based bar chart based weekly visual progress tracking report. He was requested to use distinct colours, avoiding red and green and need of daily update automation by sending E-mail or Discord notification). Additionally, it was suggested to have an easy download link for srt file.
- With reference to Reviewed SRT files, it was discussed about organising into a dedicated Google Drive folder for easy access. Accordingly, Mr. Ganga Gyatso also shared the same he had prepared: [Garchen SRT files]
- Mr. Ganga Gyatso outlined about the Whisper Model and Tokeniser Development for Tibetan language. He discussed that challenges of first version of the custom Whisper tokeniser, created by adding Tibetan tokens to whisper model by stating that tokens are exceeding the maximum tokens allowed. He added that he is in process of training a fully customised tokeniser to further improve tokenisation efficiency.
*Mr. Jake Moore requested Mr. Ganga Gyatso to compare three of the following approaches and evaluate using both Character Error Rate (CER) and Word Error Rate (WER):- Whisper with default tokeniser
- Whisper with added Tibetan tokens
- Whisper with fully customised tokeniser
Action Items
* @Ganga_Gyatso [Adding more of srt files in the google drive link] - Due: [24/09/2025]
* @Ganga_Gyatso [Documentation of tokeniser work will be prepared and shared.] - Due: [ASAP]
* @Ganga_Gyatso [Continue testing custom tokeniser and share documentation + results soon] - Due: [ASAP]
*@Lhakpa_Wangyal [Weekly work progress reports to be shared with David for easier client reporting.] - Due: [every Friday]
Decisions Made
- Need of weekly work progress reports to be shared with David for easier client reporting.
- Documentation of tokeniser work will be prepared and shared.