Purpose and Demographic
β¦ Mission Statement
To empower researchers and developers to efficiently train, evaluate, and improve Tibetan OCR models by providing a comprehensive platform for dataset preparation, model training, and performance validation.
β¦ Target Demographic
- Machine learning researchers working on Tibetan language technologies
- OCR engineers and data annotators
- Digital humanities scholars and archivists working with Tibetan manuscripts
- Institutions like BDRC, Esukhia, Monlam, and academic projects focusing on Tibetan texts
β¦ Problem Statement
Training accurate OCR models for Tibetan texts is currently fragmented and manual. There is a lack of dedicated platforms to collect, label, train, and validate OCR models efficiently. This platform aims to consolidate all steps into a single environment, making the process faster, more reliable, and reproducible.
Product Objectives
β¦ Core Objectives
- Enable streamlined collection and labeling of image-text pairs
- Support training of OCR models (e.g., TrOCR, Tesseract, or custom models)
- Provide visual metrics dashboard for evaluating model performance
- Integrate benchmark validation with existing gold standard datasets
- Export trained models and evaluation reports
β¦ Non-Goals
- The platform will not handle handwriting recognition in its initial version
- Does not include end-user text correction interfaces (handled by post-processing tools)
- Not designed for multilingual OCR beyond Tibetan in the initial release
β¦ Impact Areas
- Enhances quality and quantity of usable Tibetan OCR data
- Accelerates the development and deployment of OCR models
- Contributes to better digitization of Tibetan cultural heritage
Example Use Cases
β¦ Use Case: Tenzin (ML Researcher)
- Tenzin uploads aligned pecha scans and text to prepare training data
- Trains a new OCR model using the collected data and evaluates it against benchmark sets
β¦ Use Case: Sonam (Annotator)
- Sonam labels image-text pairs and verifies existing alignments using the platformβs tools
- Reviews model outputs and flags inconsistencies
Architectural Considerations
β¦ Tech Stack
- Backend: FastAPI
- Frontend: React.js
- ML training: PyTorch, HuggingFace
- Database: PostgreSQL (for data and metadata), S3-compatible object storage for images
- Auth: Auth0 or similar provider
β¦ System Diagram
(Optional β can be added later showing flow from data upload β labeling β training β evaluation)
β¦ Security & Privacy
- Annotated data and model checkpoints stored securely
- Role-based access to prevent unauthorized access to model configurations and benchmark sets
β¦ Dependencies
- Hugging Face Transformers
- PyTorch Lightning or Accelerate
- Tesseract for baseline model
- BDRC APIs for data integration
β¦ Scalability & Maintenance
- Designed with modular microservices for scaling individual components (training, labeling, visualization)
- Will use Docker for deployment and CI/CD pipeline for updates
Participants
β¦ Working Group Members
- Lobsang (ML Engineer) β Model training and evaluation
- Pema (Frontend Developer) β Interface development
- Tashi (Product Owner) β Platform architecture and use-case alignment
- Kunga (Data Engineer) β Data ingestion and preprocessing
β¦ Stakeholders
- OpenPecha Core Team
- Monlam OCR Initiative
- Esukhia Digital Projects
- Academic Collaborators
β¦ Point of Contact
Tashi β tashi@openpecha.org
Project Status
β¦ Current Phase
Planning and early prototype development
β¦ Milestones
- Finalize data schema for image-text pairs
- Build annotation UI
- Integrate training engine and metrics dashboard
- Conduct first model benchmark evaluation
β¦ Roadmap
Quarter | Deliverables |
---|---|
Q2 2025 | Annotation UI, image-text pair manager |
Q3 2025 | Model training module, metrics visualization |
Q4 2025 | Benchmark validation, export functions, public release |
Meeting Times
β¦ Regular Schedule
Every Tuesday at 4PM IST via Zoom
β¦ Meeting Notes
Meeting Notes Folder
What Weβre Working On
We maintain a public task board with all active issues and discussions.