The Current State of Tibetan OCR ( BDRC and Monlam AI )

The Current State of Tibetan OCR ( BDRC and Monlam AI )

To recognize a Tibetan characters out of page image, we need three models.

Models Trained:

  1. layout analysis Model :

    A model that takes a page image and give annotations of
    Background,
    Image,
    line,
    Margin,
    Caption

    Models trained:

  1. Line segmentation Model :

    A model that takes page image or text area of a page image and give line segmentation annotation of the page image.

    Model trained:

  1. Optical Character Model :

    A model that takes the line image and give out the text in the line image.

    Models Trained:

  1. Script Classification Model:

    A model takes in page image and gives out writing style out of the predefined class
    As:
    1. UCHAN_Woodblock
    2. UCHAN_Manuscript
    3. UCHAN_Modern
    4. UMED_Druma1
    5. UMED_Druma2
    6. UMED_Betsug1
    7. UMED_Betsug2
    8. UMED_Slanted_1
    Model trained:

Data Created:

  1. Layout Analysis Data:

    Randomly selected page images with different layout

  2. Line Segmentation Data:

    Same images used for Layout Analysis data

  3. Style Classification Data:

    Phentsok and Eric Created various page images from BDRC.

  4. OCR Data:

    1. Uchan Data:

      1. Woodblock:

Lhasa Kanjur with data distribution. Human Annotated
 Lithang Kanjur with data distribution
Derge Tenjur
Karmapa Data
KhyentseWangpo
NorbuketakaNumbers
  1. Manuscript:

HTR, Human annotated

https://huggingface.co/datasets/BDRC/OCR-HTR_Corrections
https://huggingface.co/datasets/BDRC/OCR-HTR_Manual

  1. Modern Print:

Google Books - 100 Randomly selected BDRC Books that were OCRed by Google
Norbuketaka Datasets
  1. Betsug Data:

Line to text batch 19-25 with data distribution. Human annotated
  1. Drutsa Data:

Line to text batch 26-37b with data distribution. Human annotated
  1. Handwritten Cursive Data:

2 Likes
  • Durtsa → Drutsa
1 Like

What’s the architecture of photi models? Is there any information available about the training data and training process for Photi and OCR models? I’m specializing in the field of Tibetan pecha recognition and would be really glad to learn more!

@Tashi_Tsering i think you need to replace uchen dergetenjur dataset link with this one. here