The Current State of Tibetan OCR ( BDRC and Monlam AI )

The Current State of Tibetan OCR ( BDRC and Monlam AI )

To recognize a Tibetan characters out of page image, we need three models.

Models Trained:

  1. layout analysis Model :

    A model that takes a page image and give annotations of
    Background,
    Image,
    line,
    Margin,
    Caption

    Models trained:

  1. Line segmentation Model :

    A model that takes page image or text area of a page image and give line segmentation annotation of the page image.

    Model trained:

  1. Optical Character Model :

    A model that takes the line image and give out the text in the line image.

    Models Trained:

  1. Script Classification Model:

    A model takes in page image and gives out writing style out of the predefined class
    As:
    1. UCHAN_Woodblock
    2. UCHAN_Manuscript
    3. UCHAN_Modern
    4. UMED_Druma1
    5. UMED_Druma2
    6. UMED_Betsug1
    7. UMED_Betsug2
    8. UMED_Slanted_1
    Model trained:

Data Created:

  1. Layout Analysis Data:

    Randomly selected page images with different layout

  2. Line Segmentation Data:

    Same images used for Layout Analysis data

  3. Style Classification Data:

    Phentsok and Eric Created various page images from BDRC.

  4. OCR Data:

    1. Uchan Data:

      1. Woodblock:

Lhasa Kanjur with data distribution. Human Annotated
 Lithang Kanjur with data distribution
Derge Tenjur
Karmapa Data
KhyentseWangpo
NorbuketakaNumbers
  1. Manuscript:

HTR, Human annotated

https://huggingface.co/datasets/BDRC/OCR-HTR_Corrections
https://huggingface.co/datasets/BDRC/OCR-HTR_Manual

  1. Modern Print:

Google Books - 100 Randomly selected BDRC Books that were OCRed by Google
Norbuketaka Datasets
  1. Betsug Data:

Line to text batch 19-25 with data distribution. Human annotated
  1. Drutsa Data:

Line to text batch 26-37b with data distribution. Human annotated
  1. Handwritten Cursive Data:

2 Likes
  • Durtsa → Drutsa
1 Like