Introduction
Optical Character Recognition (OCR) is a transformative technology that enables the digitisation of printed or handwritten text. The Tibetan OCR Training and Tibetan OCR Evaluation repositories, developed by BDRC, offer specialised toolkits to build, fine-tune, and evaluate OCR models for Tibetan text recognition.
This blog explores the purpose, structure, workflows, and components of these repositories, highlighting their contributions to OCR development.
Repository Overview
Tibetan OCR Training Repository
This repository focuses on creating robust OCR models for Tibetan text. Key functionalities include:
- Data Preprocessing: Tools for augmenting, cleaning, and formatting datasets.
- Model Training: Support for training architectures like CRNN and Easter2.0.
- Fine-Tuning: Adapting pre-trained models to new datasets for improved performance.
Tibetan OCR Evaluation Repository
The evaluation repository complements the training repository by enabling:
- Performance Assessment: Tools to compute metrics like Character Error Rate (CER).
- Comparative Evaluation: Workflows to compare multiple OCR models.
Repository Structures
Tibetan OCR Training Repository
tibetan-ocr-training/
│
├── BudaOCR/
│ ├── Augmentations.py
│ ├── Config.py
│ ├── Models.py
│ ├── Modules.py
│ ├── Utils.py
│ └── __init__.py
│
├── Demo-CRNN_Training.ipynb
├── Demo-Easter2_Training.ipynb
├── Demo-FineTuning.ipynb
├── LICENSE
├── README.md
├── tib-stacks.txt
├── train_from_dir.py
└── train_from_dist.py
Key Components
- BudaOCR/: The core directory containing:
- Data Augmentation (Augmentations.py): Enhances data robustness.
- Model Configurations (Config.py): Defines training parameters and text rules.
- Core Architectures (Models.py): Implements CRNN and Easter2.0 models.
- Utilities (Utils.py): Functions for data preprocessing and training.
- Demo Scripts: Notebooks demonstrating workflows for CRNN and Easter2.0.
- Training Scripts: Scripts (train_from_dir.py, train_from_dist.py) for directory-based or distribution-based training.
Tibetan OCR Evaluation Repository
tibetan-ocr-evaluation/
├── Models.py
├── Modules.py
├── Utils.py
├── Demo-Evaluation.ipynb
├── Demo-Inference.ipynb
├── LICENSE
├── README.md
└── requirement.txt
Key Components
- Models.py and Modules.py: Define inference pipelines for models like CRNN, Easter2.0, and TrOCR.
- Utils.py: Provides helper functions for data processing and visualization.
- Demo Scripts: Demo-Evaluation.ipynb and Demo-Inference.ipynb demonstrate workflows for model evaluation and inference.
Data Types and Training Workflows
The training repository supports two types of data:
-
Sample Random Data:
Training uses shuffled data to expose the model to diverse patterns.
Encourages generalization by reducing bias toward specific patterns. -
Data Distribution:
Predefined subsets align training with real-world distributions.
Optimizes model performance for domain-specific datasets.
CRNN Training Workflow
CRNN (Convolutional Recurrent Neural Network) combines CNNs for feature extraction and RNNs for sequence modeling. Here’s a workflow for training CRNN on shuffled data:
# Shuffle image and label paths
image_paths, label_paths = shuffle_data(image_paths, label_paths)
ocr_trainer = OCRTrainer(
network=CRNNNetwork(image_width=3200, image_height=100, num_classes=num_classes),
label_encoder=wylie_encoder,
batch_size=16,
output_dir="Output",
)
ocr_trainer.init(image_paths, label_paths)
ocr_trainer.train(epochs=48, check_cer=True)
Easter2.0 Training Workflow
Easter2.0 is a lightweight model optimized for OCR tasks on smaller datasets. Training with a predefined data distribution ensures domain-specific optimization:
This approach tailors training to Tibetan-specific text patterns for enhanced performance.
distribution = build_distribution_from_file(distr_file, dataset_path)
ocr_trainer = OCRTrainer(
network=EasterNetwork(image_width=3200, image_height=100, num_classes=num_classes),
label_encoder=wylie_encoder,
batch_size=16,
output_dir="Output",
)
ocr_trainer.init_from_distribution(distribution)
ocr_trainer.train(epochs=80, scheduler_start=62)
Role of Encoders
Encoders play a crucial role in transforming Tibetan text into machine-readable formats. Two key encoders are:
- Wylie Encoder:
- Converts Tibetan text into Wylie transliteration.
- Facilitates character-level recognition and alignment with model outputs.
- Stack Encoder:
- Encodes Tibetan stacks (syllables) numerically.
- Ensures compatibility with OCR models by simplifying input processing.
These encoders preserve the linguistic integrity of Tibetan text during training and evaluation.
Evaluation Insights
The Tibetan OCR Evaluation repository provides a robust framework for assessing model performance. The workflow demonstrated in Demo-Evaluation.ipynb focuses on comparing multiple OCR models using CER as the primary evaluation metric.
Evaluation Workflow
1. Dataset and Model Preparation
Download datasets and pre-trained models from the Hugging Face Hub:
from huggingface_hub import snapshot_download
# Download dataset
data_path = snapshot_download(repo_id="BDRC/KhyentseWangpo", repo_type="dataset", cache_dir="Datasets")
lines = glob(f"{data_path}/lines/*.jpg")
labels = glob(f"{data_path}/transcriptions/*.txt")
2. CRNN Evaluation
Perform inference using the CRNN model:
from Modules import CRNNInference
model_path = snapshot_download(repo_id="BDRC/GoogleBooks_C_v1", repo_type="model", local_dir="Models")
crnn_inference = CRNNInference(read_ctc_model_config(f"{model_path}/config.json"))
cer_scores = []
for line, label in zip(lines, labels):
prediction = crnn_inference.predict(cv2.imread(line))
cer_score = cer_scorer.compute(predictions=[prediction], references=[label])
cer_scores.append(cer_score)
3. Easter2.0 and TrOCR Evaluation
Similarly, evaluate the Easter2.0 and TrOCR models using their respective inference pipelines:
from Modules import Easter2Inference, TrOCRInference
# Easter2.0
easter_inference = Easter2Inference(read_ctc_model_config(f"{model_path}/config.json"))
# TrOCR
trocr_inference = TrOCRInference(snapshot_download(repo_id="BDRC/GoogleBooks_T_v1", repo_type="model", local_dir="Models"))
Conclusion
Key Takeaways
- Tibetan OCR Training Repository:
- Offers robust workflows for CRNN and easter2.0 model development.
- Tibetan OCR Evaluation Repository:
- Provides precise evaluation workflows for CRNN, Easter2.0, and TrOCR models.
- Model Comparison:
- CRNN excels with larger datasets.
- Easter2.0 is lightweight and suitable for real-time tasks.
- TrOCR provides versatility for various OCR needs.
- Encoders:
- Wylie and Stack Encoders ensure accurate text representation and model compatibility.
Citations
- Tibetan OCR Training Repository: GitHub Link
- Tibetan OCR Evaluation Repository: GitHub Link