PRD - OCR Processing & Correction Suite

Kaldan · June 20, 2025, 10:39am

Purpose and Demographic

✦ Mission Statement

To streamline the transformation of scanned textual images into clean, accurate, and editable digital texts through an integrated OCR processing and correction pipeline.

✦ Target Demographic

Digitization teams handling Buddhist texts and pechas
Editors and scholars working with scanned archival material
Digital libraries and repositories (e.g., BDRC, Internet Archive)
Research institutions focused on textual scholarship and preservation

✦ Problem Statement

Manual OCR correction is time-consuming, inconsistent, and requires domain-specific spelling tools. There’s a need for an efficient, semi-automated pipeline that supports OCR output correction with transparency and editorial oversight.

Product Objectives

✦ Core Objectives

Support batch image preprocessing to improve OCR quality
Integrate inference from pre-trained OCR models
Provide an editor for side-by-side comparison and correction of OCR output
Implement audit trails and spelling suggestions for editorial transparency
Allow export of clean corrected text in OpenPecha format

✦ Non-Goals

Will not provide translation features (handled by a separate product)
Not a replacement for OCR model training platform
Not focused on layout analysis beyond text recognition

✦ Impact Areas

Enhances data quality for digital Tibetan corpora
Reduces manual labor in digitization workflows
Supports OpenPecha’s vision of trustworthy digital Buddhist knowledge

Example Use Cases

✦ Use Case: Digitization Staff

Uploads batches of scanned images
Runs OCR and corrects recognized text in an editor
Exports final output with version control and correction history

✦ Use Case: Tibetan Text Scholar

Reviews OCR text side-by-side with original scan
Annotates misread characters
Uses spelling suggestions to align with standard orthography

Architectural Considerations

✦ Tech Stack

Python backend (FastAPI)
OCR model: TrOCR or Tesseract
Frontend: React or Vue-based editor
MongoDB/PostgreSQL for storage
Redis for caching correction history

✦ System Diagram

OCR Pipeline → Inference Engine → Correction Editor UI → Export/OpenPecha

✦ Security & Privacy

No public data is stored without consent
Scanned images processed are assumed to be public domain or licensed
User activity is tracked for audit purposes only

✦ Dependencies

OCR model APIs (TrOCR/Tesseract)
Frontend libraries for image-text alignment
External spell-check dictionaries and correction models

✦ Scalability & Maintenance

Modular service-based architecture to enable horizontal scaling
Separation of OCR engine and editor for easier updates
Logging and audit capabilities for long-term maintenance

Participants

✦ Working Group Members

Tashi – Product Owner
Ganga – ML Engineer (OCR pipeline)
Sonam – Frontend Engineer
Thinley – QA and Documentation

✦ Stakeholders

BDRC Digitization Team
OpenPecha Backend Platform
Partner institutions like Esukhia, Monlam AI

✦ Point of Contact

Tashi – tashi@openpecha.org

Project Status

✦ Current Phase

Development phase: Initial OCR inference and editor integration complete; working on correction history features

✦ Milestones

Image Preprocessing Pipeline –
OCR Inference Integration –
Correction Editor MVP –
Correction Audit Trail –
OpenPecha Export –

✦ Roadmap

July 2025 – Launch MVP
August 2025 – Usability testing with internal users
September 2025 – Public beta release
Q4 2025 – Multi-language support

Meeting Times

✦ Regular Schedule

Every Tuesday at 4PM IST via Zoom

✦ Meeting Notes

Team Notes & Sprint Plans

What We’re Working On

View GitHub Project Board

Topic	Replies	Views
PRD - OCR Training & Evaluation Platform Community wiki	5	June 20, 2025
PRD - Manuscript & Text Cataloguing Tool Community wiki	2	June 20, 2025
PRD - Translation Editor Community wiki	12	June 12, 2025
PRD - Critical & Collated Edition Editor Community wiki	4	June 20, 2025
Exploring BDRC’s Tibetan OCR: Training and Evaluation Repository Deep Dive OCR docs , ocr	76	December 16, 2024