Purpose and Demographic
✦ Mission Statement
To streamline the transformation of scanned textual images into clean, accurate, and editable digital texts through an integrated OCR processing and correction pipeline.
✦ Target Demographic
- Digitization teams handling Buddhist texts and pechas
- Editors and scholars working with scanned archival material
- Digital libraries and repositories (e.g., BDRC, Internet Archive)
- Research institutions focused on textual scholarship and preservation
✦ Problem Statement
Manual OCR correction is time-consuming, inconsistent, and requires domain-specific spelling tools. There’s a need for an efficient, semi-automated pipeline that supports OCR output correction with transparency and editorial oversight.
Product Objectives
✦ Core Objectives
- Support batch image preprocessing to improve OCR quality
- Integrate inference from pre-trained OCR models
- Provide an editor for side-by-side comparison and correction of OCR output
- Implement audit trails and spelling suggestions for editorial transparency
- Allow export of clean corrected text in OpenPecha format
✦ Non-Goals
- Will not provide translation features (handled by a separate product)
- Not a replacement for OCR model training platform
- Not focused on layout analysis beyond text recognition
✦ Impact Areas
- Enhances data quality for digital Tibetan corpora
- Reduces manual labor in digitization workflows
- Supports OpenPecha’s vision of trustworthy digital Buddhist knowledge
Example Use Cases
✦ Use Case: Digitization Staff
- Uploads batches of scanned images
- Runs OCR and corrects recognized text in an editor
- Exports final output with version control and correction history
✦ Use Case: Tibetan Text Scholar
- Reviews OCR text side-by-side with original scan
- Annotates misread characters
- Uses spelling suggestions to align with standard orthography
Architectural Considerations
✦ Tech Stack
- Python backend (FastAPI)
- OCR model: TrOCR or Tesseract
- Frontend: React or Vue-based editor
- MongoDB/PostgreSQL for storage
- Redis for caching correction history
✦ System Diagram
OCR Pipeline → Inference Engine → Correction Editor UI → Export/OpenPecha
✦ Security & Privacy
- No public data is stored without consent
- Scanned images processed are assumed to be public domain or licensed
- User activity is tracked for audit purposes only
✦ Dependencies
- OCR model APIs (TrOCR/Tesseract)
- Frontend libraries for image-text alignment
- External spell-check dictionaries and correction models
✦ Scalability & Maintenance
- Modular service-based architecture to enable horizontal scaling
- Separation of OCR engine and editor for easier updates
- Logging and audit capabilities for long-term maintenance
Participants
✦ Working Group Members
- Tashi – Product Owner
- Ganga – ML Engineer (OCR pipeline)
- Sonam – Frontend Engineer
- Thinley – QA and Documentation
✦ Stakeholders
- BDRC Digitization Team
- OpenPecha Backend Platform
- Partner institutions like Esukhia, Monlam AI
✦ Point of Contact
Tashi – tashi@openpecha.org
Project Status
✦ Current Phase
Development phase: Initial OCR inference and editor integration complete; working on correction history features
✦ Milestones
- Image Preprocessing Pipeline –
- OCR Inference Integration –
- Correction Editor MVP –
- Correction Audit Trail –
- OpenPecha Export –
✦ Roadmap
- July 2025 – Launch MVP
- August 2025 – Usability testing with internal users
- September 2025 – Public beta release
- Q4 2025 – Multi-language support
Meeting Times
✦ Regular Schedule
Every Tuesday at 4PM IST via Zoom