Purpose and Demographic
✦ Mission Statement
To empower researchers and developers to efficiently train, evaluate, and improve Tibetan OCR models by providing a comprehensive platform for dataset preparation, model training, and performance validation.
✦ Target Demographic
- Machine learning researchers working on Tibetan language technologies
- OCR engineers and data annotators
- Digital humanities scholars and archivists working with Tibetan manuscripts
- Institutions like BDRC, Esukhia, Monlam, and academic projects focusing on Tibetan texts
✦ Problem Statement
Training accurate OCR models for Tibetan texts is currently fragmented and manual. There is a lack of dedicated platforms to collect, label, train, and validate OCR models efficiently. This platform aims to consolidate all steps into a single environment, making the process faster, more reliable, and reproducible.
Product Objectives
✦ Core Objectives
- Enable streamlined collection and labeling of image-text pairs
- Support training of OCR models (e.g., TrOCR, Tesseract, or custom models)
- Provide visual metrics dashboard for evaluating model performance
- Integrate benchmark validation with existing gold standard datasets
- Export trained models and evaluation reports
✦ Non-Goals
- The platform will not handle handwriting recognition in its initial version
- Does not include end-user text correction interfaces (handled by post-processing tools)
- Not designed for multilingual OCR beyond Tibetan in the initial release
✦ Impact Areas
- Enhances quality and quantity of usable Tibetan OCR data
- Accelerates the development and deployment of OCR models
- Contributes to better digitization of Tibetan cultural heritage
Example Use Cases
✦ Use Case: Tenzin (ML Researcher)
- Tenzin uploads aligned pecha scans and text to prepare training data
- Trains a new OCR model using the collected data and evaluates it against benchmark sets
✦ Use Case: Sonam (Annotator)
- Sonam labels image-text pairs and verifies existing alignments using the platform’s tools
- Reviews model outputs and flags inconsistencies
Architectural Considerations
✦ Tech Stack
- Backend: FastAPI
- Frontend: React.js
- ML training: PyTorch, HuggingFace
- Database: PostgreSQL (for data and metadata), S3-compatible object storage for images
- Auth: Auth0 or similar provider
✦ System Diagram
(Optional — can be added later showing flow from data upload → labeling → training → evaluation)
✦ Security & Privacy
- Annotated data and model checkpoints stored securely
- Role-based access to prevent unauthorized access to model configurations and benchmark sets
✦ Dependencies
- Hugging Face Transformers
- PyTorch Lightning or Accelerate
- Tesseract for baseline model
- BDRC APIs for data integration
✦ Scalability & Maintenance
- Designed with modular microservices for scaling individual components (training, labeling, visualization)
- Will use Docker for deployment and CI/CD pipeline for updates
Participants
✦ Working Group Members
- Lobsang (ML Engineer) – Model training and evaluation
- Pema (Frontend Developer) – Interface development
- Tashi (Product Owner) – Platform architecture and use-case alignment
- Kunga (Data Engineer) – Data ingestion and preprocessing
✦ Stakeholders
- OpenPecha Core Team
- Monlam OCR Initiative
- Esukhia Digital Projects
- Academic Collaborators
✦ Point of Contact
Tashi – tashi@openpecha.org
Project Status
✦ Current Phase
Planning and early prototype development
✦ Milestones
- Finalize data schema for image-text pairs
- Build annotation UI
- Integrate training engine and metrics dashboard
- Conduct first model benchmark evaluation
✦ Roadmap
| Quarter | Deliverables |
|---|---|
| Q2 2025 | Annotation UI, image-text pair manager |
| Q3 2025 | Model training module, metrics visualization |
| Q4 2025 | Benchmark validation, export functions, public release |
Meeting Times
✦ Regular Schedule
Every Tuesday at 4PM IST via Zoom
✦ Meeting Notes
Meeting Notes Folder
What We’re Working On
We maintain a public task board with all active issues and discussions.