PRD - OCR Training & Evaluation Platform

Kaldan · June 20, 2025, 11:23am

Purpose and Demographic

✦ Mission Statement

To empower researchers and developers to efficiently train, evaluate, and improve Tibetan OCR models by providing a comprehensive platform for dataset preparation, model training, and performance validation.

✦ Target Demographic

Machine learning researchers working on Tibetan language technologies
OCR engineers and data annotators
Digital humanities scholars and archivists working with Tibetan manuscripts
Institutions like BDRC, Esukhia, Monlam, and academic projects focusing on Tibetan texts

✦ Problem Statement

Training accurate OCR models for Tibetan texts is currently fragmented and manual. There is a lack of dedicated platforms to collect, label, train, and validate OCR models efficiently. This platform aims to consolidate all steps into a single environment, making the process faster, more reliable, and reproducible.

Product Objectives

✦ Core Objectives

Enable streamlined collection and labeling of image-text pairs
Support training of OCR models (e.g., TrOCR, Tesseract, or custom models)
Provide visual metrics dashboard for evaluating model performance
Integrate benchmark validation with existing gold standard datasets
Export trained models and evaluation reports

✦ Non-Goals

The platform will not handle handwriting recognition in its initial version
Does not include end-user text correction interfaces (handled by post-processing tools)
Not designed for multilingual OCR beyond Tibetan in the initial release

✦ Impact Areas

Enhances quality and quantity of usable Tibetan OCR data
Accelerates the development and deployment of OCR models
Contributes to better digitization of Tibetan cultural heritage

Example Use Cases

✦ Use Case: Tenzin (ML Researcher)

Tenzin uploads aligned pecha scans and text to prepare training data
Trains a new OCR model using the collected data and evaluates it against benchmark sets

✦ Use Case: Sonam (Annotator)

Sonam labels image-text pairs and verifies existing alignments using the platform’s tools
Reviews model outputs and flags inconsistencies

Architectural Considerations

✦ Tech Stack

Backend: FastAPI
Frontend: React.js
ML training: PyTorch, HuggingFace
Database: PostgreSQL (for data and metadata), S3-compatible object storage for images
Auth: Auth0 or similar provider

✦ System Diagram

(Optional — can be added later showing flow from data upload → labeling → training → evaluation)

✦ Security & Privacy

Annotated data and model checkpoints stored securely
Role-based access to prevent unauthorized access to model configurations and benchmark sets

✦ Dependencies

Hugging Face Transformers
PyTorch Lightning or Accelerate
Tesseract for baseline model
BDRC APIs for data integration

✦ Scalability & Maintenance

Designed with modular microservices for scaling individual components (training, labeling, visualization)
Will use Docker for deployment and CI/CD pipeline for updates

Participants

✦ Working Group Members

Lobsang (ML Engineer) – Model training and evaluation
Pema (Frontend Developer) – Interface development
Tashi (Product Owner) – Platform architecture and use-case alignment
Kunga (Data Engineer) – Data ingestion and preprocessing

✦ Stakeholders

OpenPecha Core Team
Monlam OCR Initiative
Esukhia Digital Projects
Academic Collaborators

✦ Point of Contact

Tashi – tashi@openpecha.org

Project Status

✦ Current Phase

Planning and early prototype development

✦ Milestones

Finalize data schema for image-text pairs
Build annotation UI
Integrate training engine and metrics dashboard
Conduct first model benchmark evaluation

✦ Roadmap

Quarter	Deliverables
Q2 2025	Annotation UI, image-text pair manager
Q3 2025	Model training module, metrics visualization
Q4 2025	Benchmark validation, export functions, public release

Meeting Times

✦ Regular Schedule

Every Tuesday at 4PM IST via Zoom

✦ Meeting Notes

Meeting Notes Folder

What We’re Working On

We maintain a public task board with all active issues and discussions.

View GitHub Project Board

Topic		Replies	Views
Exploring BDRC’s Tibetan OCR: Training and Evaluation Repository Deep Dive 👁️‍🗨️ OCR SIG docs , ocr	0	116	December 16, 2024
PRD - OCR Processing & Correction Suite 🚀 WG	0	5	June 20, 2025
The Current State of Tibetan OCR ( BDRC and Monlam AI ) 👁️‍🗨️ OCR SIG docs , ocr , dataset	3	339	May 19, 2025
Training OCR Models for Tibetan Pecha: Challenges and Solutions 👁️‍🗨️ OCR SIG transkribus , tibetan-fonts , htr	1	116	December 4, 2024
Pecha AI Studio: Buddhist AI Evaluation Platform PRD ✨ Pecha AI Studio WG	0	5	September 22, 2025