The Current State of Tibetan OCR ( BDRC and Monlam AI )

Tashi_Tsering · December 6, 2024, 10:12am

To recognize a Tibetan characters out of page image, we need three models.

Models Trained:

layout analysis Model :

A model that takes a page image and give annotations of
Background,
Image,
line,
Margin,
Caption

Models trained:

Line segmentation Model :

A model that takes page image or text area of a page image and give line segmentation annotation of the page image.

Model trained:

Optical Character Model :

A model that takes the line image and give out the text in the line image.

Models Trained:

Script Classification Model:

A model takes in page image and gives out writing style out of the predefined class
As:
1. UCHAN_Woodblock
2. UCHAN_Manuscript
3. UCHAN_Modern
4. UMED_Druma1
5. UMED_Druma2
6. UMED_Betsug1
7. UMED_Betsug2
8. UMED_Slanted_1
Model trained:

Data Created:

Layout Analysis Data:

Randomly selected page images with different layout
Line Segmentation Data:

Same images used for Layout Analysis data
Style Classification Data:

Phentsok and Eric Created various page images from BDRC.
OCR Data:
1. Uchan Data:
  1. Woodblock:

Lhasa Kanjur with data distribution. Human Annotated

https://huggingface.co/datasets/openpecha/OCR-Lhasakanjur

 Lithang Kanjur with data distribution

https://huggingface.co/datasets/openpecha/OCR-Lithangkanjur

Derge Tenjur

Karmapa Data

https://huggingface.co/datasets/openpecha/OCR-Karmapa8

KhyentseWangpo

https://huggingface.co/datasets/openpecha/OCR-KhyentseWangpo

NorbuketakaNumbers

Manuscript:

HTR, Human annotated

https://huggingface.co/datasets/BDRC/OCR-HTR_Corrections
https://huggingface.co/datasets/BDRC/OCR-HTR_Manual

Modern Print:

Google Books - 100 Randomly selected BDRC Books that were OCRed by Google

https://huggingface.co/datasets/openpecha/OCR-Google_Books

Norbuketaka Datasets

https://huggingface.co/datasets/openpecha/OCR-Norbuketaka

Betsug Data:

Line to text batch 19-25 with data distribution. Human annotated

https://huggingface.co/datasets/openpecha/OCR-Betsug

Drutsa Data:

Line to text batch 26-37b with data distribution. Human annotated

https://huggingface.co/datasets/openpecha/OCR-Drutsa

Handwritten Cursive Data:

https://huggingface.co/datasets/openpecha/OCR-Handwritten_Tibetan_Cursive

Trinley · December 16, 2024, 6:43am

Durtsa → Drutsa

mura4k · May 10, 2025, 3:00pm

What’s the architecture of photi models? Is there any information available about the training data and training process for Photi and OCR models? I’m specializing in the field of Tibetan pecha recognition and would be really glad to learn more!

Dhakar_Rambo · May 19, 2025, 4:23am

@Tashi_Tsering i think you need to replace uchen dergetenjur dataset link with this one. here

Topic		Replies	Views
Exploring BDRC’s Tibetan OCR: Training and Evaluation Repository Deep Dive 👁️‍🗨️ OCR SIG docs , ocr	0	90	December 16, 2024
[Report] OCR Benchmark 👁️‍🗨️ OCR SIG docs , documentation	0	17	August 13, 2025
PRD - OCR Training & Evaluation Platform 🚀 WG	0	11	June 20, 2025
Training OCR Models for Tibetan Pecha: Challenges and Solutions 👁️‍🗨️ OCR SIG transkribus , tibetan-fonts , htr	1	93	December 4, 2024
PRD - OCR Processing & Correction Suite 🚀 WG	0	4	June 20, 2025

Models Trained:

layout analysis Model :

Line segmentation Model :

Optical Character Model :

Script Classification Model:

Data Created:

Layout Analysis Data:

Line Segmentation Data:

Style Classification Data:

OCR Data:

Uchan Data:

Woodblock:

Manuscript:

Modern Print:

Betsug Data:

Drutsa Data:

Handwritten Cursive Data:

Related topics