Pecha Data, Data Requirements Document (DRD)

Tashi_Tsering · June 23, 2025, 6:13am

Data Requirements Document

OpenPecha AI-Powered Textual Supply Chain (Stages 1-7)

Version: 1.18 Date: June 20, 2025

1. Executive Summary

This document outlines the data requirements for the first seven stages of the OpenPecha textual supply chain. The primary objective is to create a robust, scalable, and iterative process for producing high-fidelity, factual AI-generated translations and adaptations of Buddhist source texts.

A core principle of this architecture is verifiable provenance. Every piece of data, from a translation choice to an entity link, must be traceable back to its source text, reference materials, and the specific AI model or human editor responsible for its creation. The system is designed as a non-linear feedback loop, where improvements at any stage can propagate both upstream and downstream. For example, an ambiguity encountered during translation (Stage 6) might reveal a spelling mistake in the master text, triggering an update to the critical apparatus (Stage 4), which in turn could be traced back to a transcription error that is then corrected in the original TEI witness (Stage 3).

2. Core Architectural Principles

2.1. Non-Linear & Iterative Workflow

The data flow is not a simple one-way pipeline. It is a dynamic ecosystem where stages are interconnected. For example, an entity extracted in Stage 5 might reveal a transcription error, triggering a correction in Stage 3, which in turn updates the critical text in Stage 4 and all dependent translations and annotations in Stages 6 and 7.

Visual representation of the feedback loops:

graph TD
subgraph “Knowledge Layer”
F[Wikidata Knowledge Graph]
G[Wikipedia Knowledge Base]
end

subgraph “Source & Textual Layer”
A[1. Gather & Model Sources] -->|scans, audio| B(2. Preserve);
B -->|images| C{3. Diplomatic Transcriptions};
C -->|TEI| D(4. Critical Etext);
D -->|STAM Master| E(5. Extract Knowledge);
E -->|annotations| F;
E -->|knowledge base articles| G;
end

%% Feedback Loops
E – AI Entity Extraction → A;
E – Corrects Bibliography → B;
H – Reveals Ambiguity → D;
I – Identifies Concepts → E;

2.2. Metadata Model: BibFrame 2.0 with Work Hubs

We will adopt BibFrame 2.0, utilizing a hierarchical model to manage the complex network of relationships between texts, translations, editions, and other related resources.

Work Hub: A conceptual grouping that links a primary Work with other related Works. The Hub is the locus for relationships, connecting a core text to its translations, its source language Works, and other relevant Works such as commentaries, summaries, or adaptations (e.g., “The Heart Sūtra Hub”).
Work: The conceptual essence of a single distinct text within a Hub (e.g., The Sanskrit source text of the Heart Sūtra, or the English translation by Conze).
Instance: A specific material embodiment, version, or publication of a Work (e.g., the Taishō edition, a specific Sanskrit manuscript, the English translation by X, a TEI file, the STAM master file).
Item: An actual copy of an Instance (e.g., a specific server’s copy of the TEI file, a particular physical manuscript in a library).

2.3. Annotation Model: STAM (Stand-off Annotation Model)

Given that the base text is by nature continuously improving as the project progresses, STAM is the designated model for the master text (from Stage 4 onwards).

Pointer Update Mechanism (Critical Function): Because the base text evolves, a core system function must be the ability to automatically update all annotation pointers whenever the base text is modified. When a change is committed to the master text, the system will calculate the difference (diff) and adjust the character offsets for every annotation in every associated layer, ensuring that no data connections are broken.
Universal Addressability: Every annotation and relationship within the system will be assigned a persistent, unique identifier. This ensures that any piece of data can be directly referenced, updated, or itself be the target of another annotation.
Meta-Annotation: The system will support annotations of annotations. For example, a user could comment on a specific translation choice (which is an annotation), or an AI process could flag a named-entity link (an annotation) as having low confidence. This is enabled by the universal addressability of all annotations.
Layering: This model allows all data points—translations, commentaries, named entities, linguistic markup, critical apparatus notes, user comments—to be stored as distinct, queryable layers referencing the master text. This separation of concerns is critical for managing complexity.

2.4. Archival Format: TEI (Text Encoding Initiative)

For long-term preservation and interoperability with digital humanities standards, all diplomatic transcriptions (Stage 3) will be encoded in TEI XML. This ensures archival-quality storage at BDRC and other institutions.

3. Detailed Data Requirements by Stage

Stage 1: Gather and Model Bibliographic Data

Objective: To catalog primary source materials and model their complex bibliographic relationships using the BibFrame 2.0 framework.
Input Data:
Physical manuscripts, woodblock prints, modern editions.
BDRC catalog data (bdrc-data · GitLab).
Core Processes:
(Human) Scholarly identification of authentic and relevant sources.
(AI/Human) Ingestion and parsing of BDRC’s bibliographic data.
(AI) Entity and relationship extraction from existing etexts to correct and extend bibliographic information in Wikidata.
Output Data:
A comprehensive bibliographic graph representing all identified entities and their relationships.
Work Hubs to group related Works.
Work records for each distinct conceptual text.
Instance records for each version, edition, or manifestation of a Work.
Persistent identifiers (e.g., BDRC ID, Wikidata Q-ID) for all created entities.
Data Model:
Bibliographic data stored in a graph database adhering to the BibFrame 2.0 model.
The model must explicitly support:
Collections: A volume containing multiple distinct Works (e.g., a anthology) will be modeled as an Instance that is an instance of a “Collection” Work. This Collection Work will in turn have member relationships (hasPart) to the individual Works it contains.
Multi-volume Works: A single Work that spans multiple physical volumes will be represented as one Work record with multiple Instance records (one for each volume).
All records must link to corresponding BDRC entries and new/existing Wikidata items where applicable.

Stage 2: Preserve as Audio and Images

Objective: To create high-fidelity, archival-quality digital surrogates of the physical sources.
Input Data:
Physical sources identified in Stage 1.
Bibliographic records from Stage 1.
Core Processes:
(Human/Machine) High-resolution scanning of texts.
(Human/Machine) Digitization of oral commentaries/teachings.
Output Data:
High-resolution image files (e.g., TIFF, JP2) for each page.
Lossless audio files (e.g., WAV, FLAC).
Technical metadata (DPI, color depth, bit rate, etc.).
Data Model:
A new Instance record is created and linked to the source Work.
Each data file (e.g., image_001.tif) must be linked to the Instance record.
IIIF manifests will be generated for image sequences.

Stage 3: Convert into Searchable Digital Text (Diplomatic Transcriptions)

Objective: To create a faithful, machine-readable transcription of each source image.
Input Data:
Image files from Stage 2.
Core Processes:
(AI) Optical Character Recognition (OCR) / Handwritten Text Recognition (HTR) models trained on specific scripts.
(Human) A web-based editing interface for proofreading and correcting AI-generated text against the source image.
Output Data:
A TEI XML file for each source Instance.
Data Model:
TEI XML:
The must link to the bibliographic Instance record.
The element must contain the IIIF manifest.
Text content must use and # pointers to link each line or region of text to the corresponding coordinates on the source image.
All characters, line breaks, and scribal marks should be represented as faithfully as a possible.

Stage 4: Prepare Reliable Editions (Critical Etext)

Objective: To create a single, normalized, and reliable “master text” by comparing multiple diplomatic transcriptions and reference materials.
Input Data:
Multiple TEI diplomatic transcriptions from Stage 3 (as Instances).
Reference resources (e.g., digitized commentaries, dictionaries) linked within the Work Hub.
Core Processes:
(Tool) Collation of multiple TEI witnesses and identification of textual variations using pydurma.
(AI) Sentence-level alignment of the text-level reference resources against the collated diplomatic transcriptions.
(AI/Human) Decision-making process, guided by the newly aligned sentence-level resources and collated variants, to select the base text and document variants in a critical apparatus.
(AI) Normalization and spell-checking of the resulting master text.
Output Data:
A STAM file containing the master plain text.
Multiple STAM annotation layers, including:
Critical Apparatus: A layer documenting substantive variants, the editor’s choice for the master text, and links to evidence for that choice.
Spelling Variations: One layer per witness, documenting all orthographic variations.
Witness-Specific Pagination: A layer for each witness, mapping page breaks from the original TEI file to spans in the master text.
OCR Confidence: A layer mapping OCR confidence scores from Stage 3 to the characters/words in the master text.
Data Model:
The output is a new Instance (the critical edition). The master text is a simple Unicode string.
Critical Apparatus Layer: Each annotation must contain:
Span pointer to the relevant location in the master text.
The reading chosen for the master text (lemma).
An array of substantive variant readings, each with a pointer to the witness Instance it came from.
A justification field linking to one or more passages in the sentence-aligned reference resources that support the chosen reading.
Spelling Variation Layer: One such layer exists per witness. Each annotation must contain:
Span pointer to the relevant text in the master text.
The variant spelling as it appears in this specific witness.
The ID of the witness this variation belongs to.
Pagination Layer: These are zero-length “milestone” annotations. Each must contain:
A span pointer (start and end are the same).
A witness ID (e.g., the Instance ID of the source TEI file).
Page/folio number and line number (e.g., 23a, l. 5).
OCR Confidence Layer: Each annotation must contain:
Span pointer to the relevant text.
A confidence value (e.g., 0.98).
The ID of the witness this score came from.

Stage 5: Extract Knowledge Networks & Build Knowledge Base

Objective: To identify and link entities within the master text, to generate multi-level summaries from commentaries, and to use this structured data to generate a human-readable knowledge base.
Input Data:
The STAM master text and annotation layers from Stage 4.
Aligned reference commentaries from Stage 4.
BDRC/Wikidata bibliographic data.
Core Processes:
(AI) Named Entity Recognition (NER) and Relationship extraction.
(Human) Validation of AI-generated entity and relationship annotations.
(API) Pushing validated data to Wikidata, creating a structured knowledge graph.
(AI) Generation of multi-level summaries from aligned reference commentaries. The AI will extract key points and explanations related to passages in the master text and structure them as summaries of varying detail.
(AI) Assembling knowledge base articles (in a Wikipedia-like format). These articles are not original AI creations; they are structured compilations of facts extracted directly from the STAM annotations and linked reference resources. Every statement is explicitly grounded in and linked to its authentic source via inline citations.
Output Data:
Enriched STAM file with new annotation layers (e.g., entities, concepts, relationships, summaries). These layers are the primary, source-grounded representation of the knowledge graph data.
Updates to external Knowledge Graph (Wikidata) derived directly from the validated data in the STAM annotation layers.
A generated Knowledge Base where every fact is traceably linked back via inline citations to the specific source annotations in the STAM file.
Data Model:
Entity/Relationship Annotation Layers: Each annotation must contain:
A unique annotation ID, type, and content (e.g., person, place).
A URI pointing to the corresponding Wikidata item.
Full provenance data.
Summaries Layer: Each summary annotation must contain:
A unique annotation ID.
A span pointer to the passage in the master text that the summary pertains to.
The source of the summary (a pointer to the commentary Instance and span).
The summary content itself, potentially at multiple levels of detail (e.g., short, medium, long).
Knowledge Base Articles: Each generated article is a derivative product and must:
Have a main subject linked to a Wikidata ID.
Contain inline citations that link back to the unique IDs of the specific annotations or text spans in the source STAM files from which the facts were derived.

Stage 6: Produce Translations

Objective: To create factual AI-generated translations with transparent, justifiable interpretation decisions.

Input Data:

STAM master text and its associated annotation layers (from Stage 4 and 5), including sentence-aligned reference resources.
Buddhist knowledge graph (Wikidata).
Buddhist knowledge base (AI-generated articles).
Core Processes:
(AI) A translation model whose interpretation choices are informed by a multi-source context:
The explicit links to reference resources (commentaries, dictionaries) at the word, sentence, or section level.
Structured facts from the knowledge graph.
Narrative explanations from the knowledge base articles.
(AI) Generation of footnotes to justify translation choices. These justifications explicitly cite the evidence used, whether it is a passage in a reference resource, a fact in the knowledge graph, or a statement in the knowledge base.
(Human) Post-editing of the translation and justifications.
Output Data:
A new translation layer in the STAM file.
A new justification_footnotes annotation layer in the STAM file.
Data Model:
The translation layer maps spans from the source text to the translated text.
The justification_footnotes layer contains annotations with:
A unique annotation ID.
Span pointer to the relevant word/phrase in the source text.
The chosen translation and any significant alternative translations.
Justification text, including one or more traceable links to the supporting evidence. This link could be a unique ID pointing to a STAM annotation in a reference text, a URI for a Wikidata entity, or a citation ID in a knowledge base article.
Provenance data.

Stage 7: Shape Audience-Specific Adaptations

Objective: To generate different versions of a translation tailored to various audiences (e.g., scholarly, devotional, beginner).
Input Data:
STAM translation layer (Stage 6).
STAM justification footnotes (Stage 6).
STAM multi-level summaries layer (Stage 5).
The Knowledge Graph (Wikidata).
The Knowledge Base.
Core Processes:
(AI) Application of stylistic models (e.g., summarization, simplification, elaboration) to the base translation. These models will use the full context of available data to guide adaptations. For example:
A “scholarly” adaptation might insert an abridged summary from a commentary (from the summaries layer) as a footnote.
A “beginner” adaptation might simplify a term and link to its entry in the knowledge base.
(Human) Review and tagging of adaptations.
Output Data:
Multiple new text layers in the STAM file, one for each adaptation type.
Data Model:
Each adaptation is a new text layer (e.g., translation_simplified, translation_scholarly_commentary).
Each layer maps its content back to the spans of the original translation layer, ensuring alignment is never lost.

4. General Data Access & API Requirements

The system must expose an API with the a following capabilities:

Fetch Data: Retrieve a text and a user-specified set of its annotation/text layers (e.g., “get the master text for Work X, plus its English translation layer and its named entity layer”).
Update/Create Data: Allow authenticated users (and AI services) to submit new annotations or corrections. The submission must be in STAM format and include full provenance data.
Query: Support queries based on bibliographic metadata (e.g., “find all texts by author Y”) and content annotations (e.g., “find all passages that mention concept Z”).

Topic		Replies	Views
OpenPecha Project Overview General	8	309	May 14, 2025
PRD - Pecha API Product Requirements Document Community wiki	0	13	June 10, 2025
PRD - Manuscript & Text Cataloguing Tool Community wiki	0	2	June 20, 2025
A novel approach to transfer text alignment annotation opf-format toolkit , tibetan-buddhism , dataset	0	33	December 30, 2024
Translation Editor Sprint 1 - Release Note General docs , rich-text-editor , documentation	0	11	June 14, 2025