šŸ•‰ļø Proposal: MT Evaluation SIG

Evaluation of Machine Translation

Proposal: MT Evaluation Special Interest Group


Full Proposal Details

1. Problem Statement / Motivation:
A more detailed explanation of why this SIG is needed. What gap does it fill? What opportunity does it pursue?

2. Scope:

  • In Scope: [e.g., Evaluating existing Tibetan MT models, Defining a new evaluation metric]
  • Out of Scope: [e.g., Training a new production-level MT model (this would be a WG’s job)]

3. Potential Deliverables:
A list of tangible things the SIG might produce.

  • [e.g., A report comparing 3 different MT services.]
  • [e.g., A proof-of-concept for a new evaluation tool.]
  • [e.g., A DRD for a new evaluation dataset.]

1. Goals:

  • Assess which metric is more reliable
    • Create gold standard data (pricing aligned with Proz rates)
    • Annotation tool
  • Evaluate what humans consider good translation
    • …
  • Training reward model
  • Pricing estimate for 3k benchmark

2. Interested Members:

  • 84000 (translate Tengyur into English)
    • Disambiguation of source texts in Tibetan
    • Human evaluation dataset of 3k radom translation samples
    • Statistical analysis comparing eval metrics and humans
    • Recommendation on best approach
  • KVP