MVP
- Custom arenas looking like challenge cards, so we get 1 arena per task
- Arenas to assess models (a selection of models)
- Arenas to assess prompts or workflows (a selection of prompts frozen once the arena is oppened)
- 1 leaderboard for each arena
- live number of battles per in leaderboards, or leaderboard column with live number of battle per model in an arena
Problem statement:
As a translation project manager working on multilingual translation of སྤྱོད་འཇུག, རྡོ་རྗེ་གཅོད་པ།, འཕགས་པ་སྡུད་པ།, ཤེས་རབ་སྙིང་པོ། and 3 commentaries each, I need to have a tool to objectively rank the performance of different models and translation prompts and workflows for each language in order to know which model and which prompt and workflow I should use to batch translate these texts in each language on September 22, 2025.
Suggested solution
- 1 anonymous zero-shot battle arena per language (EN, ZH, FR, NP, HI) for the “OP AI Translation Pilot Project”
- 1 anonymous model+prompt battle arena per language (EN, ZH, FR, NP, HI) for the “OP AI Translation Pilot Project”
Suggested deliverables
- Custom arenas in AI studio
- Random text sample suggestions for the “OP AI Translation Pilot Project” automatically filled in the prompt
- Display commentaries together with source text in evaluation page (good to have)
- Leaderboard/translation/ page with 1 learderboard card for each translation arena
- A column with the # of battles per model in order to know if the sample size of a specific chalenger is big enough to use the leaderboard to inform projects/investments. The # of battle comes with a icon/message (good sample size, too small, very good etc)
- Prompt template with 1/multiple payload input fields
- In arena input field, hide the prompt and just show the input field(s)
- 1 default/general translation arena?
- 1 combined translation leaderboard?
Suggested models and workflows
- Claude Sonnet 3.7
- Gemini Pro 2.5
- Gemini Flash 2.5 with thinking
- Gemini Flash 2.5 w/o thinking
- Gemini Flash 2.5 temperature 1
- Gemini Flash 2.5 temperature 0
- Mitra translation
- Mitra translation (research)
- Gemini Flash 2.5 (with thinking? temperature?) + Sanskrit + UCCA + Gloss
- Gemini Flash 2.5 (with thinking? temperature?) + Sanskrit + commentaries
- For non-English languages: above combinations + English
Notes
- Note that apart from འཕགས་པ་སྡུད་པ།, translations for the root texts are widely available on the internet and definitely in the training data of most models. Commentaries might not be translated and/or translations might not be widely avalaible on the net. This might affect evaluation of zero-shot translation evaluation.