Evaluation of Machine Translation
Proposal: MT Evaluation Special Interest Group
- Purpose: A brief paragraph explaining why this SIG should exist and what topic it will focus on.
- Champion(s): [Name/Username of initial lead(s)]
- Communication Channels:
- Chat #š MT Evaluation SIG
- Calendar: Google Calendar
- Important Links:
- š PRD: MT Evaluation Pipeline
- Github Project Board
Full Proposal Details
1. Problem Statement / Motivation:
A more detailed explanation of why this SIG is needed. What gap does it fill? What opportunity does it pursue?
2. Scope:
- In Scope: [e.g., Evaluating existing Tibetan MT models, Defining a new evaluation metric]
- Out of Scope: [e.g., Training a new production-level MT model (this would be a WGās job)]
3. Potential Deliverables:
A list of tangible things the SIG might produce.
- [e.g., A report comparing 3 different MT services.]
- [e.g., A proof-of-concept for a new evaluation tool.]
- [e.g., A DRD for a new evaluation dataset.]
1. Goals:
- Assess which metric is more reliable
- Create gold standard data (pricing aligned with Proz rates)
- Annotation tool
- Evaluate what humans consider good translation
- ā¦
- Training reward model
- Pricing estimate for 3k benchmark
2. Interested Members:
- 84000 (translate Tengyur into English)
- Disambiguation of source texts in Tibetan
- Human evaluation dataset of 3k radom translation samples
- Statistical analysis comparing eval metrics and humans
- Recommendation on best approach
- KVP