[Report] OCR Benchmark

Tenzin_Gayche · August 13, 2025, 10:50am

OCR Benchmark Report

We built a compact but diverse benchmark to see how well current OCR approaches handle Tibetan pecha in the wild. The goal was simple: take representative page images from different scripts and production eras, run the same OCR pipeline across engines, measure character error rate (CER), and learn where things still break.

Special thanks to Nitartha and Tibschol for their contributions.

For full dataset-by-dataset results and detailed CER breakdowns, see the public spreadsheet here: Click to see full results.

What We Tested

Models and Flows

BDRC – MonlamAI models used in practice
Google Vision
Gemini family: evaluated both zero-shot and few-shot prompting (few-shot applied only to Gemini models)
The Dharmamitra OCR (based on Gemini 2.5 Flash Lite)

Note: BDRC – MonlamAI models, Google Vision, and Dharmamitra were held constant across runs. Only the Gemini prompts changed between zero-shot and few-shot.

Datasets and Notes

Dataset ID/Name	Type	Script	Description	Pages
MW1KG13607_0534	Manuscript	Uchen	Phugdrak Kangyur	20
MW1NLM1660_1CF2D2	Woodblock	Uchen	Poor quality blockprint from Mongolia	10
MW4PD1207_347B2B	Manuscript	Ume	Modern Ume manuscript	6
Nitartha DTG204	Woodblock	Uchen	Derge blockprint	14
Nitartha LRD004	Woodblock	Uchen	Derge blockprint	16
Tibschol	Manuscript	Ume	Old Ume manuscripts	50
W1BL4	Woodblock	Uchen	Ulanbaatar	20
W1NLM1737	Manuscript	Uchen	Mongolian manuscript	20
W1NLM3731	Woodblock	Uchen	Tibetan blockprint	19
W1NLM3933	Manuscript	Uchen	Mongolia manuscript	20
W22704	Woodblock	Uchen	Narthang Tengyur	15
W2KG5015	Woodblock	Uchen	Narthang Tengyur	15

How We Ran It

Pipeline

Models received a concise OCR instruction prompt asking for clean Tibetan text without formatting.
Few-shot (Gemini only): Used a small number of exemplars from within the same dataset. Script builds a short conversation alternating exemplar page image and ground truth, followed by the target page.

Metric

Character Error Rate (CER) on Tibetan text (lower is better)
For all Gemini and Dharmamitra models: each page was processed 10 times; CER is the mean across runs, with accompanying standard deviation to show stability.
For both Gemini and Dharmamitra models, the standard deviations were often very high, meaning results could differ dramatically across the 10 runs for the same page.

Key Results

Across datasets, the lowest mean CERs come from BDRC – MonlamAI models, Google Vision, and in one case Gemini 2.0 Flash. Gemini 2.5 models improved with few-shot but still did not surpass BDRC or Google Vision overall. We are also working with Dharmamitra to understand why their reported Gemini 2.5 flash lite results are significantly better than ours.

Winners

BDRC – MonlamAI models are clear winners, having the best results on 11 dataset, followed by Google Vision (3 datasets) and Gemini 2.0 Flash (1 dataset)

Note on Zero-shot vs. Few-shot

Few-shot helped the Gemini 2.5 family a lot compared to its own zero-shot baseline and reduced variance across pages. The 2.0 family tended to get worse with few-shot in this configuration. Even after improvement, the 2.5 family still trailed the BDRC – MonlamAI and Google Vision baselines on aggregate.

Numbers at a Glance (Weighted by Pages per Dataset)

Model	Zero-shot CER	Few-shot CER	Change
Gemini 2.5 Flash	195.8	84.2	-111.6
Gemini 2.5 Flash Lite	577.2	286.7	-290.5
Gemini 2.0 Flash	38.9	85.8	+46.9
Gemini 2.0 Flash Lite	39.2	80.3	+41.1

Stability

Gemini 2.5 Flash: Std. dev. dropped from ~49.6 to ~8.8.
Gemini 2.5 Flash Lite: More stable but still more variable than full 2.5.
High standard deviations in Gemini and Dharmamitra runs mean that while averages are useful, individual run results could swing widely.

Dataset Difficulty Notes

Old Ume manuscripts and damaged Mongolian blockprints remain challenging. Clean Derge and Narthang Uchen prints fare better.

Reproduce and Extend

Zero-shot base prompt: Gist
Evaluation scripts, few-shot builders, and run instructions: GitHub

Topic		Replies	Views
The Current State of Tibetan OCR ( BDRC and Monlam AI ) 👁️‍🗨️ OCR SIG docs , ocr , dataset	3	296	May 19, 2025
Exploring BDRC’s Tibetan OCR: Training and Evaluation Repository Deep Dive 👁️‍🗨️ OCR SIG docs , ocr	0	90	December 16, 2024
Training OCR Models for Tibetan Pecha: Challenges and Solutions 👁️‍🗨️ OCR SIG transkribus , tibetan-fonts , htr	1	93	December 4, 2024
PRD - OCR Training & Evaluation Platform 🚀 WG	0	11	June 20, 2025
PRD - OCR Processing & Correction Suite 🚀 WG	0	4	June 20, 2025