PRD/DRD: [Product or Dataset Name]
Owning Group | [Name of WG or SIG] |
Status | [Draft | In Review | Approved] |
GitHub Project | [Link to GitHub Project Board] |
Last Updated | [YYYY-MM-DD] |
1. Overview
This project aims to find the most effective way to evaluate Buddhist translations and automate the evaluation process. By doing so, it enables faster and more reliable feedback loops for training AI models and comparing existing translation solutions. This helps researchers, developers, and translators working on Buddhist texts to improve model quality efficiently and at scale.
2. Strategy & Research
Links to foundational research and planning documents that inform this project.
Previous work
In our experiments, we first evaluated translations using GEMBA MQM and then with MQM APE. We found that MQM APE provided better results due to its additional verification steps, which help filter out non impactful errors and refine the error spans. However, we noticed several issues during evaluation:
Self bias: When the same LLM is used both as the evaluator and the translation generator, it tends to prefer its own outputs, exhibiting a notable self preference bias.
Mismatch with human like error distribution: Similar to what the MQM APE paper observed, the distribution of predicted errors in our results did not align well with what we would expect from human annotations.
Literal over semantic preference: When translations were enriched with explanatory commentaries, making them more semantically faithful to the original Tibetan, the LLM evaluator often preferred more literal, word by word translations. It did not account for semantic accuracy or meaningful additions that improve alignment with the source text.
However, we were not able to verify these findings against human evaluations, so the conclusions remain indicative rather than confirmed.
3. Goals & Success Metrics
What are the primary goals? How will we measure success?
4. Timeline & Quarterly Milestones
A high-level schedule for the project, broken down by quarter. This should align with the main Project Roadmap.
-
Q3 2025:
-
Milestone 1: [e.g., Initial dataset collected and cleaned.]
-
Milestone 2: [e.g., Schema finalized and validated.]
-
Q4 2025:
-
Milestone 3: [e.g., Alpha version of API released to testers.]
-
Target Launch: Q1 2026
5. Scope & Features / Data Schema
What is included? What is not included?
6. Dependencies
What other groups, projects, or resources does this work depend on?
- Pecha Server and API WG: Requires a new API endpoint for search.
- Garchen Institute: Access to the raw audio files.
7. Acceptance Criteria
How will we know when this project is βdoneβ?