[Report] OCR Benchmark

OCR Benchmark Report

We built a compact but diverse benchmark to see how well current OCR approaches handle Tibetan pecha in the wild. The goal was simple: take representative page images from different scripts and production eras, run the same OCR pipeline across engines, measure character error rate (CER), and learn where things still break.

Special thanks to Nitartha and Tibschol for their contributions.

For full dataset-by-dataset results and detailed CER breakdowns, see the public spreadsheet here: Click to see full results.

What We Tested

Models and Flows

  • BDRC – MonlamAI models used in practice

  • Google Vision

  • Gemini family: evaluated both zero-shot and few-shot prompting (few-shot applied only to Gemini models)

  • The Dharmamitra OCR (based on Gemini 2.5 Flash Lite)

Note: BDRC – MonlamAI models, Google Vision, and Dharmamitra were held constant across runs. Only the Gemini prompts changed between zero-shot and few-shot.

Datasets and Notes

Dataset ID/Name Type Script Description Pages
MW1KG13607_0534 Manuscript Uchen Phugdrak Kangyur 20
MW1NLM1660_1CF2D2 Woodblock Uchen Poor quality blockprint from Mongolia 10
MW4PD1207_347B2B Manuscript Ume Modern Ume manuscript 6
Nitartha DTG204 Woodblock Uchen Derge blockprint 14
Nitartha LRD004 Woodblock Uchen Derge blockprint 16
Tibschol Manuscript Ume Old Ume manuscripts 50
W1BL4 Woodblock Uchen Ulanbaatar 20
W1NLM1737 Manuscript Uchen Mongolian manuscript 20
W1NLM3731 Woodblock Uchen Tibetan blockprint 19
W1NLM3933 Manuscript Uchen Mongolia manuscript 20
W22704 Woodblock Uchen Narthang Tengyur 15
W2KG5015 Woodblock Uchen Narthang Tengyur 15

How We Ran It

Pipeline

  • Models received a concise OCR instruction prompt asking for clean Tibetan text without formatting.

  • Few-shot (Gemini only): Used a small number of exemplars from within the same dataset. Script builds a short conversation alternating exemplar page image and ground truth, followed by the target page.

Metric

  • Character Error Rate (CER) on Tibetan text (lower is better)

  • For all Gemini and Dharmamitra models: each page was processed 10 times; CER is the mean across runs, with accompanying standard deviation to show stability.

  • For both Gemini and Dharmamitra models, the standard deviations were often very high, meaning results could differ dramatically across the 10 runs for the same page.

Key Results

Across datasets, the lowest mean CERs come from BDRC – MonlamAI models, Google Vision, and in one case Gemini 2.0 Flash. Gemini 2.5 models improved with few-shot but still did not surpass BDRC or Google Vision overall. We are also working with Dharmamitra to understand why their reported Gemini 2.5 flash lite results are significantly better than ours.

Winners

BDRC – MonlamAI models are clear winners, having the best results on 11 dataset, followed by Google Vision (3 datasets) and Gemini 2.0 Flash (1 dataset)

Note on Zero-shot vs. Few-shot

Few-shot helped the Gemini 2.5 family a lot compared to its own zero-shot baseline and reduced variance across pages. The 2.0 family tended to get worse with few-shot in this configuration. Even after improvement, the 2.5 family still trailed the BDRC – MonlamAI and Google Vision baselines on aggregate.

Numbers at a Glance (Weighted by Pages per Dataset)

Model Zero-shot CER Few-shot CER Change
Gemini 2.5 Flash 195.8 84.2 -111.6
Gemini 2.5 Flash Lite 577.2 286.7 -290.5
Gemini 2.0 Flash 38.9 85.8 +46.9
Gemini 2.0 Flash Lite 39.2 80.3 +41.1

Stability

  • Gemini 2.5 Flash: Std. dev. dropped from ~49.6 to ~8.8.

  • Gemini 2.5 Flash Lite: More stable but still more variable than full 2.5.

  • High standard deviations in Gemini and Dharmamitra runs mean that while averages are useful, individual run results could swing widely.

Dataset Difficulty Notes

Old Ume manuscripts and damaged Mongolian blockprints remain challenging. Clean Derge and Narthang Uchen prints fare better.

Reproduce and Extend

Zero-shot base prompt: Gist
Evaluation scripts, few-shot builders, and run instructions: GitHub