OCR Benchmark Report
We built a compact but diverse benchmark to see how well current OCR approaches handle Tibetan pecha in the wild. The goal was simple: take representative page images from different scripts and production eras, run the same OCR pipeline across engines, measure character error rate (CER), and learn where things still break.
Special thanks to Nitartha and Tibschol for their contributions.
For full dataset-by-dataset results and detailed CER breakdowns, see the public spreadsheet here: Click to see full results.
What We Tested
Models and Flows
-
BDRC – MonlamAI models used in practice
-
Google Vision
-
Gemini family: evaluated both zero-shot and few-shot prompting (few-shot applied only to Gemini models)
-
The Dharmamitra OCR (based on Gemini 2.5 Flash Lite)
Note: BDRC – MonlamAI models, Google Vision, and Dharmamitra were held constant across runs. Only the Gemini prompts changed between zero-shot and few-shot.
Datasets and Notes
Dataset ID/Name | Type | Script | Description | Pages |
---|---|---|---|---|
MW1KG13607_0534 | Manuscript | Uchen | Phugdrak Kangyur | 20 |
MW1NLM1660_1CF2D2 | Woodblock | Uchen | Poor quality blockprint from Mongolia | 10 |
MW4PD1207_347B2B | Manuscript | Ume | Modern Ume manuscript | 6 |
Nitartha DTG204 | Woodblock | Uchen | Derge blockprint | 14 |
Nitartha LRD004 | Woodblock | Uchen | Derge blockprint | 16 |
Tibschol | Manuscript | Ume | Old Ume manuscripts | 50 |
W1BL4 | Woodblock | Uchen | Ulanbaatar | 20 |
W1NLM1737 | Manuscript | Uchen | Mongolian manuscript | 20 |
W1NLM3731 | Woodblock | Uchen | Tibetan blockprint | 19 |
W1NLM3933 | Manuscript | Uchen | Mongolia manuscript | 20 |
W22704 | Woodblock | Uchen | Narthang Tengyur | 15 |
W2KG5015 | Woodblock | Uchen | Narthang Tengyur | 15 |
How We Ran It
Pipeline
-
Models received a concise OCR instruction prompt asking for clean Tibetan text without formatting.
-
Few-shot (Gemini only): Used a small number of exemplars from within the same dataset. Script builds a short conversation alternating exemplar page image and ground truth, followed by the target page.
Metric
-
Character Error Rate (CER) on Tibetan text (lower is better)
-
For all Gemini and Dharmamitra models: each page was processed 10 times; CER is the mean across runs, with accompanying standard deviation to show stability.
-
For both Gemini and Dharmamitra models, the standard deviations were often very high, meaning results could differ dramatically across the 10 runs for the same page.
Key Results
Across datasets, the lowest mean CERs come from BDRC – MonlamAI models, Google Vision, and in one case Gemini 2.0 Flash. Gemini 2.5 models improved with few-shot but still did not surpass BDRC or Google Vision overall. We are also working with Dharmamitra to understand why their reported Gemini 2.5 flash lite results are significantly better than ours.
Winners
BDRC – MonlamAI models are clear winners, having the best results on 11 dataset, followed by Google Vision (3 datasets) and Gemini 2.0 Flash (1 dataset)
Note on Zero-shot vs. Few-shot
Few-shot helped the Gemini 2.5 family a lot compared to its own zero-shot baseline and reduced variance across pages. The 2.0 family tended to get worse with few-shot in this configuration. Even after improvement, the 2.5 family still trailed the BDRC – MonlamAI and Google Vision baselines on aggregate.
Numbers at a Glance (Weighted by Pages per Dataset)
Model | Zero-shot CER | Few-shot CER | Change |
---|---|---|---|
Gemini 2.5 Flash | 195.8 | 84.2 | -111.6 |
Gemini 2.5 Flash Lite | 577.2 | 286.7 | -290.5 |
Gemini 2.0 Flash | 38.9 | 85.8 | +46.9 |
Gemini 2.0 Flash Lite | 39.2 | 80.3 | +41.1 |
Stability
-
Gemini 2.5 Flash: Std. dev. dropped from ~49.6 to ~8.8.
-
Gemini 2.5 Flash Lite: More stable but still more variable than full 2.5.
-
High standard deviations in Gemini and Dharmamitra runs mean that while averages are useful, individual run results could swing widely.
Dataset Difficulty Notes
Old Ume manuscripts and damaged Mongolian blockprints remain challenging. Clean Derge and Narthang Uchen prints fare better.
Reproduce and Extend
Zero-shot base prompt: Gist
Evaluation scripts, few-shot builders, and run instructions: GitHub