📄 PRD: MT Evaluation Pipeline

Trinley · July 12, 2025, 4:06pm

PRD/DRD: [Product or Dataset Name]


Owning Group	[Name of WG or SIG]
Status	[Draft \| In Review \| Approved]
GitHub Project	[Link to GitHub Project Board]
Last Updated	[YYYY-MM-DD]

1. Overview

This project aims to find the most effective way to evaluate Buddhist translations and automate the evaluation process. By doing so, it enables faster and more reliable feedback loops for training AI models and comparing existing translation solutions. This helps researchers, developers, and translators working on Buddhist texts to improve model quality efficiently and at scale.

2. Strategy & Research

Links to foundational research and planning documents that inform this project.

Empathy Map
Value Proposition Canvas
Impact/Difficulty Matrix

Previous work

In our experiments, we first evaluated translations using GEMBA MQM and then with MQM APE. We found that MQM APE provided better results due to its additional verification steps, which help filter out non impactful errors and refine the error spans. However, we noticed several issues during evaluation:

Self bias: When the same LLM is used both as the evaluator and the translation generator, it tends to prefer its own outputs, exhibiting a notable self preference bias.

Mismatch with human like error distribution: Similar to what the MQM APE paper observed, the distribution of predicted errors in our results did not align well with what we would expect from human annotations.

Literal over semantic preference: When translations were enriched with explanatory commentaries, making them more semantically faithful to the original Tibetan, the LLM evaluator often preferred more literal, word by word translations. It did not account for semantic accuracy or meaningful additions that improve alignment with the source text.

However, we were not able to verify these findings against human evaluations, so the conclusions remain indicative rather than confirmed.

3. Goals & Success Metrics

What are the primary goals? How will we measure success?

4. Timeline & Quarterly Milestones

A high-level schedule for the project, broken down by quarter. This should align with the main Project Roadmap.

Q3 2025:
Milestone 1: [e.g., Initial dataset collected and cleaned.]
Milestone 2: [e.g., Schema finalized and validated.]
Q4 2025:
Milestone 3: [e.g., Alpha version of API released to testers.]
Target Launch: Q1 2026

5. Scope & Features / Data Schema

What is included? What is not included?

6. Dependencies

What other groups, projects, or resources does this work depend on?

Pecha Server and API WG: Requires a new API endpoint for search.
Garchen Institute: Access to the raw audio files.

7. Acceptance Criteria

How will we know when this project is “done”?

Topic	Replies	Views
PRD of MQM evaluation tool ✨ AI Tools WG	10	September 22, 2025
🕉️ Proposal: MT Evaluation SIG 📊 MT Evaluation SIG	11	July 12, 2025
Pecha AI Studio: Buddhist AI Evaluation Platform PRD ✨ AI Tools WG	12	September 22, 2025
Strategic Meeting Minutes: 13 Jul 2025 - Automatic Translation Evaluation project for 84000 📊 MT Evaluation SIG minutes	11	July 14, 2025
PRD of Translation Editor ✨ AI Tools WG	16	September 22, 2025