PRD - OCR Training & Evaluation Platform

:compass: Purpose and Demographic

✦ Mission Statement

To empower researchers and developers to efficiently train, evaluate, and improve Tibetan OCR models by providing a comprehensive platform for dataset preparation, model training, and performance validation.

✦ Target Demographic

  • Machine learning researchers working on Tibetan language technologies
  • OCR engineers and data annotators
  • Digital humanities scholars and archivists working with Tibetan manuscripts
  • Institutions like BDRC, Esukhia, Monlam, and academic projects focusing on Tibetan texts

✦ Problem Statement

Training accurate OCR models for Tibetan texts is currently fragmented and manual. There is a lack of dedicated platforms to collect, label, train, and validate OCR models efficiently. This platform aims to consolidate all steps into a single environment, making the process faster, more reliable, and reproducible.


:bullseye: Product Objectives

✦ Core Objectives

  • Enable streamlined collection and labeling of image-text pairs
  • Support training of OCR models (e.g., TrOCR, Tesseract, or custom models)
  • Provide visual metrics dashboard for evaluating model performance
  • Integrate benchmark validation with existing gold standard datasets
  • Export trained models and evaluation reports

✦ Non-Goals

  • The platform will not handle handwriting recognition in its initial version
  • Does not include end-user text correction interfaces (handled by post-processing tools)
  • Not designed for multilingual OCR beyond Tibetan in the initial release

✦ Impact Areas

  • Enhances quality and quantity of usable Tibetan OCR data
  • Accelerates the development and deployment of OCR models
  • Contributes to better digitization of Tibetan cultural heritage

:light_bulb: Example Use Cases

✦ Use Case: Tenzin (ML Researcher)

  • Tenzin uploads aligned pecha scans and text to prepare training data
  • Trains a new OCR model using the collected data and evaluates it against benchmark sets

✦ Use Case: Sonam (Annotator)

  • Sonam labels image-text pairs and verifies existing alignments using the platform’s tools
  • Reviews model outputs and flags inconsistencies

:building_construction: Architectural Considerations

✦ Tech Stack

  • Backend: FastAPI
  • Frontend: React.js
  • ML training: PyTorch, HuggingFace
  • Database: PostgreSQL (for data and metadata), S3-compatible object storage for images
  • Auth: Auth0 or similar provider

✦ System Diagram

(Optional β€” can be added later showing flow from data upload β†’ labeling β†’ training β†’ evaluation)

✦ Security & Privacy

  • Annotated data and model checkpoints stored securely
  • Role-based access to prevent unauthorized access to model configurations and benchmark sets

✦ Dependencies

  • Hugging Face Transformers
  • PyTorch Lightning or Accelerate
  • Tesseract for baseline model
  • BDRC APIs for data integration

✦ Scalability & Maintenance

  • Designed with modular microservices for scaling individual components (training, labeling, visualization)
  • Will use Docker for deployment and CI/CD pipeline for updates

:busts_in_silhouette: Participants

✦ Working Group Members

  • Lobsang (ML Engineer) – Model training and evaluation
  • Pema (Frontend Developer) – Interface development
  • Tashi (Product Owner) – Platform architecture and use-case alignment
  • Kunga (Data Engineer) – Data ingestion and preprocessing

✦ Stakeholders

  • OpenPecha Core Team
  • Monlam OCR Initiative
  • Esukhia Digital Projects
  • Academic Collaborators

✦ Point of Contact

Tashi – tashi@openpecha.org


:vertical_traffic_light: Project Status

✦ Current Phase

Planning and early prototype development

✦ Milestones

  • Finalize data schema for image-text pairs
  • Build annotation UI
  • Integrate training engine and metrics dashboard
  • Conduct first model benchmark evaluation

✦ Roadmap

Quarter Deliverables
Q2 2025 Annotation UI, image-text pair manager
Q3 2025 Model training module, metrics visualization
Q4 2025 Benchmark validation, export functions, public release

:spiral_calendar: Meeting Times

✦ Regular Schedule

Every Tuesday at 4PM IST via Zoom

✦ Meeting Notes

Meeting Notes Folder


:hammer_and_wrench: What We’re Working On

We maintain a public task board with all active issues and discussions.

:right_arrow: View GitHub Project Board