Exploring BDRC’s Tibetan OCR: Training and Evaluation Repository Deep Dive

Yonten · December 16, 2024, 3:06am

Introduction

Optical Character Recognition (OCR) is a transformative technology that enables the digitisation of printed or handwritten text. The Tibetan OCR Training and Tibetan OCR Evaluation repositories, developed by BDRC, offer specialised toolkits to build, fine-tune, and evaluate OCR models for Tibetan text recognition.

This blog explores the purpose, structure, workflows, and components of these repositories, highlighting their contributions to OCR development.

Repository Overview

Tibetan OCR Training Repository

This repository focuses on creating robust OCR models for Tibetan text. Key functionalities include:

Data Preprocessing: Tools for augmenting, cleaning, and formatting datasets.
Model Training: Support for training architectures like CRNN and Easter2.0.
Fine-Tuning: Adapting pre-trained models to new datasets for improved performance.

Tibetan OCR Evaluation Repository

The evaluation repository complements the training repository by enabling:

Performance Assessment: Tools to compute metrics like Character Error Rate (CER).
Comparative Evaluation: Workflows to compare multiple OCR models.

Repository Structures

Tibetan OCR Training Repository

tibetan-ocr-training/
│
├── BudaOCR/
│   ├── Augmentations.py
│   ├── Config.py
│   ├── Models.py
│   ├── Modules.py
│   ├── Utils.py
│   └── __init__.py
│
├── Demo-CRNN_Training.ipynb
├── Demo-Easter2_Training.ipynb
├── Demo-FineTuning.ipynb
├── LICENSE
├── README.md
├── tib-stacks.txt
├── train_from_dir.py
└── train_from_dist.py

Key Components

BudaOCR/: The core directory containing:
- Data Augmentation (Augmentations.py): Enhances data robustness.
- Model Configurations (Config.py): Defines training parameters and text rules.
- Core Architectures (Models.py): Implements CRNN and Easter2.0 models.
- Utilities (Utils.py): Functions for data preprocessing and training.
Demo Scripts: Notebooks demonstrating workflows for CRNN and Easter2.0.
Training Scripts: Scripts (train_from_dir.py, train_from_dist.py) for directory-based or distribution-based training.

Tibetan OCR Evaluation Repository

tibetan-ocr-evaluation/

├── Models.py

├── Modules.py

├── Utils.py

├── Demo-Evaluation.ipynb

├── Demo-Inference.ipynb

├── LICENSE

├── README.md

└── requirement.txt

Key Components

Models.py and Modules.py: Define inference pipelines for models like CRNN, Easter2.0, and TrOCR.
Utils.py: Provides helper functions for data processing and visualization.
Demo Scripts: Demo-Evaluation.ipynb and Demo-Inference.ipynb demonstrate workflows for model evaluation and inference.

Data Types and Training Workflows

The training repository supports two types of data:

Sample Random Data:

Training uses shuffled data to expose the model to diverse patterns.
Encourages generalization by reducing bias toward specific patterns.
Data Distribution:

Predefined subsets align training with real-world distributions.
Optimizes model performance for domain-specific datasets.

CRNN Training Workflow

CRNN (Convolutional Recurrent Neural Network) combines CNNs for feature extraction and RNNs for sequence modeling. Here’s a workflow for training CRNN on shuffled data:

# Shuffle image and label paths
image_paths, label_paths = shuffle_data(image_paths, label_paths)

ocr_trainer = OCRTrainer(
    network=CRNNNetwork(image_width=3200, image_height=100, num_classes=num_classes),
    label_encoder=wylie_encoder,
    batch_size=16,
    output_dir="Output",
)
ocr_trainer.init(image_paths, label_paths)
ocr_trainer.train(epochs=48, check_cer=True)

Easter2.0 Training Workflow

Easter2.0 is a lightweight model optimized for OCR tasks on smaller datasets. Training with a predefined data distribution ensures domain-specific optimization:

This approach tailors training to Tibetan-specific text patterns for enhanced performance.

distribution = build_distribution_from_file(distr_file, dataset_path)

ocr_trainer = OCRTrainer(
    network=EasterNetwork(image_width=3200, image_height=100, num_classes=num_classes),
    label_encoder=wylie_encoder,
    batch_size=16,
    output_dir="Output",
)
ocr_trainer.init_from_distribution(distribution)
ocr_trainer.train(epochs=80, scheduler_start=62)

Role of Encoders

Encoders play a crucial role in transforming Tibetan text into machine-readable formats. Two key encoders are:

Wylie Encoder:

Converts Tibetan text into Wylie transliteration.
Facilitates character-level recognition and alignment with model outputs.

Stack Encoder:

Encodes Tibetan stacks (syllables) numerically.
Ensures compatibility with OCR models by simplifying input processing.

These encoders preserve the linguistic integrity of Tibetan text during training and evaluation.

Evaluation Insights

The Tibetan OCR Evaluation repository provides a robust framework for assessing model performance. The workflow demonstrated in Demo-Evaluation.ipynb focuses on comparing multiple OCR models using CER as the primary evaluation metric.

Evaluation Workflow

1. Dataset and Model Preparation

Download datasets and pre-trained models from the Hugging Face Hub:

from huggingface_hub import snapshot_download
# Download dataset
data_path = snapshot_download(repo_id="BDRC/KhyentseWangpo", repo_type="dataset", cache_dir="Datasets")

lines = glob(f"{data_path}/lines/*.jpg")
labels = glob(f"{data_path}/transcriptions/*.txt")

2. CRNN Evaluation

Perform inference using the CRNN model:

from Modules import CRNNInference

model_path = snapshot_download(repo_id="BDRC/GoogleBooks_C_v1", repo_type="model", local_dir="Models")
crnn_inference = CRNNInference(read_ctc_model_config(f"{model_path}/config.json"))

cer_scores = []
for line, label in zip(lines, labels):
    prediction = crnn_inference.predict(cv2.imread(line))
    cer_score = cer_scorer.compute(predictions=[prediction], references=[label])
    cer_scores.append(cer_score)

3. Easter2.0 and TrOCR Evaluation

Similarly, evaluate the Easter2.0 and TrOCR models using their respective inference pipelines:


from Modules import Easter2Inference, TrOCRInference

# Easter2.0
easter_inference = Easter2Inference(read_ctc_model_config(f"{model_path}/config.json"))
# TrOCR
trocr_inference = TrOCRInference(snapshot_download(repo_id="BDRC/GoogleBooks_T_v1", repo_type="model", local_dir="Models"))

Conclusion

Key Takeaways

Tibetan OCR Training Repository:

Offers robust workflows for CRNN and easter2.0 model development.

Tibetan OCR Evaluation Repository:

Provides precise evaluation workflows for CRNN, Easter2.0, and TrOCR models.

Model Comparison:

CRNN excels with larger datasets.
Easter2.0 is lightweight and suitable for real-time tasks.
TrOCR provides versatility for various OCR needs.

Encoders:

Wylie and Stack Encoders ensure accurate text representation and model compatibility.

Citations

Tibetan OCR Training Repository: GitHub Link
Tibetan OCR Evaluation Repository: GitHub Link

Topic		Replies	Views
PRD - OCR Training & Evaluation Platform Community wiki	0	6	June 20, 2025
The Current State of Tibetan OCR ( BDRC and Monlam AI ) Data & Resources docs , ocr , dataset	3	263	May 19, 2025
Training OCR Models for Tibetan Pecha: Challenges and Solutions OCR transkribus , tibetan-fonts , htr	1	91	December 4, 2024
Enhancing Tibetan OCR with Fonts Created from Tibetan Pecha OCR tibetan-glyphs , tibetan-fonts , ocr	0	32	November 13, 2024
PRD - OCR Processing & Correction Suite Community wiki	0	2	June 20, 2025

Exploring BDRC’s Tibetan OCR: Training and Evaluation Repository Deep Dive

Related topics