The Current State of Tibetan OCR ( BDRC and Monlam AI )
To recognize a Tibetan characters out of page image, we need three models.
Models Trained:
-
layout analysis Model :
A model that takes a page image and give annotations of
Background,
Image,
line,
Margin,
CaptionModels trained:
-
Line segmentation Model :
A model that takes page image or text area of a page image and give line segmentation annotation of the page image.
Model trained:
-
Optical Character Model :
A model that takes the line image and give out the text in the line image.
Models Trained:
-
Script Classification Model:
A model takes in page image and gives out writing style out of the predefined class
As:
1. UCHAN_Woodblock
2. UCHAN_Manuscript
3. UCHAN_Modern
4. UMED_Druma1
5. UMED_Druma2
6. UMED_Betsug1
7. UMED_Betsug2
8. UMED_Slanted_1
Model trained:
Data Created:
-
Layout Analysis Data:
Randomly selected page images with different layout
-
Line Segmentation Data:
Same images used for Layout Analysis data
-
Style Classification Data:
Phentsok and Eric Created various page images from BDRC.
-
OCR Data:
Lhasa Kanjur with data distribution. Human Annotated
Lithang Kanjur with data distribution
Derge Tenjur
Karmapa Data
KhyentseWangpo
NorbuketakaNumbers
HTR, Human annotated
https://huggingface.co/datasets/BDRC/OCR-HTR_Corrections
https://huggingface.co/datasets/BDRC/OCR-HTR_Manual
Google Books - 100 Randomly selected BDRC Books that were OCRed by Google
Norbuketaka Datasets
Line to text batch 19-25 with data distribution. Human annotated
Line to text batch 26-37b with data distribution. Human annotated