A novel approach to transfer text alignment annotation
By Tashi Tsering and Tenzin Tsundue
Introduction:
Texts often exist in multiple forms, such as versions, translations, and commentaries, each serving specific purposes for different audiences. However, even in today’s digitized world, a significant challenge remains: connecting and aligning these various forms seamlessly. Misaligned texts lead to fragmented information, limiting accessibility and diminishing the utility of digital platforms for users who require synchronized alignments across translations, commentaries, and segmented displays.
To address this challenge, we propose a novel approach for transferring text alignment annotations. Our methodology bridges the gap between differently segmented texts, allowing for the alignment of translations, commentaries, and other text variations. This approach not only enhances the coherence of textual data but also enables digital platforms to deliver a unified experience, making various alignments readily accessible and useful for end-users.
Literature Review:
Hamish et al., in their paper [6] on the General Architecture for Text Engineering (GATE), introduced the AnnotationDiff Tool, which is capable of comparing sets of annotations across one or more documents. Their work emphasizes evaluation aspects based on three distinct criteria—strict, lenient, and average—while employing performance metrics such as precision, recall, and F-score.
One key functionality of the OpenPecha Toolkit in this context is its ability to transfer span annotations from one text to another. This feature bears similarities to the Google Diff-Match-Patch tool [7].
Although some functionalities and data storage formats share similarities with existing projects, the application of these features for transferring alignment annotations represents a novel approach.
Real User Experience:
For the Pecha.org users, there are root text which are segmented for the pecha display are available on the site but since that text is segmented specifically for the purpose of display on the site, we don’t have neither Translations nor Commentary aligned to that pecha display segmented root text. We do have the same root text’s Translation and Commentary but aligned to its native segments. So to make those Translation and Commentary readily available on the pecha.org for use for the Users we have made this transfer mechanism so that our users can have access to the Translations as well as Commentary aligned to the pecha display segmented root text. At the beginning of it we have two different segments of same root text and two alignment of translation and commentary to its native segmented root text but at the end of alignment transfer we now have two more alignment, the alignment of translation and commentary to the pecha display segmented root text. We have increased the number of alignment from two to four.
Methodology:
At OpenPecha, textual content and annotations are stored in a custom-defined format known as the OpenPecha Format (OPF). The annotations are maintained in a “standoff” structure, which separates them from the main text. This design choice enhances the efficiency of downloading, parsing, and updating the data. Although the OPF format has undergone several revisions over time, its structure and capabilities have been thoroughly documented in the published paper, “Taming the Wild Etext: Managing, Annotating, and Sharing Tibetan Corpora in Open Spaces” [2].
Figure 1: A visual representation of OPF data structure.
Figure 1 illustrates a pictorial representation of an OPF dataset. All OPF data from OpenPecha are stored in the PechaData GitHub repository[3]. Each OPF dataset is uniquely identified by a Pecha ID, as shown in the example folder P00000001 in the figure.The base text is stored in a folder named with a randomized 4-digit UUID. Annotation files are organized under the layers folder, where each subfolder is named after its corresponding base text. Additionally, a metadata.json file contains essential metadata such as the title, source, author, and other relevant information about the base text.
All interactions with OPF data—including creation, uploading, updating, and downloading—are facilitated through the OpenPecha Toolkit[4].The annotations in OpenPecha are created using the Stand-off Text Annotation Model (STAM) [1] API and are stored in .json files. Leveraging the robust features of STAM—such as selecting text spans, referencing other annotations, and including payload information—the OpenPecha OPF format is designed to accommodate any type of text annotation.
Figure 2: A visual representation of alignment annotation transfer
In Figure 2, the English book is shown as a translation OPF aligned to the Tibetan book.A segment is defined as an annotation where it stores a span text in a given base text. Each English segment corresponds to a Tibetan segment, forming a one-to-one relationship. However, there exists a Tibetan Book Version 1 with a base text that includes minor character differences, missing content, and variations in segmentation. In such case where there is a need for translation for each segments in Tibetan Book version 1, manually translating or performing machine translation for Tibetan Book Version 1 is both labor-intensive and an inefficient use of computational resources. We propose an alignment annotation transfer process consisting of three steps:
Step 1: Transfer of Segment Annotation Layer
The segment annotation layer from the Tibetan Book will be transferred to Tibetan Book Version 1 using the base text of Tibetan Book Version 1. This process will be carried out utilizing the StamPecha merge functionality provided by our toolkit [4].
from pathlib import Path
from openpecha.pecha import StamPecha
tibetan_book_path = Path("tibetan_book_path")
tibetan_book_v1_path = Path("tibetan_book_v1_path")
tibetan_book_base_name = "tibetan_book_base_name"
tibetan_book_v1_base_name = "tibetan_book_v1_base_name"
tibetan_book_pecha = StamPecha(tibetan_book_path)
tibetan_book_v1_pecha = StamPecha(tibetan_book_v1_path)
tibetan_book_v1_pecha.merge_pecha(tibetan_book_pecha, tibetan_book_base_name, tibetan_book_v1_base_name)
Step 2: Mapping Segment Layers
The alignment relationship between the old and new segment layers in Tibetan Book Version 1 will be established. The alignment data will map each new segment to its corresponding old segments, which may follow a one-to-one or one-to-many relationship.
Step 3: Generation of English Segment Layer
Using the mapped output from Step 2, a new segment layer will be generated for the English Book. This layer will be aligned with the old segments of Tibetan Book Version 1.
The following code does all of Step one to three explained above.
tibetan_book = {"tibetan_book_pecha_path": …, "tibetan_book_base_name": …}
tibetan_book_v1 = {"tibetan_book_v1_pecha_path":..., "tibetan_book_v1_base_name": …}
english_book = {
"english_book_pecha_path":...,
"english_book_base_name": …,
}
ann_transfer = TranslationAlignmentAnnTransfer(tibetan_book, tibetan_book_v1, english_book)
ann_transfer.transfer_annotation()
Experiments:
1. Translation Alignment Transfer
Given:
P1
is the OPF of a root text with segments that are specifically segmented for Pecha Display.
P2
is the OPF of the same root text as the P1
but with its native segments.
P3
is the OPF of English Translation of the root text aligned to P2
Aim:
Currently P2
and P3
are aligned but we want to align the P3
English translation to the P1
pecha display OPF
Steps that we follow to achieve the transfer of translation alignment:
- Use
P1
andP2
to createL2
, which is a segment layer inP1
whose base text is ofP1
but the segmentation annotation is fromP2
. - Now compare layer
L1
andL2
ofP1
to createC1
, which is mapping of alignment betweenP1
andP2
. - Use the mapping
C1
to createL5
which will contain the alignment mapping between theP3
andP1
.
2. Commentary Alignment Transfer
Given:
P1
is the OPF of a root text with segments that are specifically segmented for Pecha Display.
P2
is the OPF of the same root text as the P1
but with its native segments.
P3
is the OPF of Tibetan Commentary of the root text aligned to P2
Aim:
Currently P2
and P3
are aligned but we want to align the P3
Tibetan commentary to the P1
pecha display OPF
Steps that we follow to achieve the transfer of translation alignment:
- Use
P1
andP2
to createL2
, which is a segment layer inP1
whose base text is ofP1
but the segmentation annotation is fromP2
. - Now compare layer
L1
andL2
ofP1
to createC1
, which is mapping of alignment betweenP1
andP2
. - Use the mapping
C1
to createL5
which will contain the alignment mapping between theP3
andP1
.
Results:
We successfully implemented our approach using real-world data. For translation, we worked with a dataset of རྡོ་རྗེ་གཅོད་པ་རྩ་བ། (Vajra Cutter), which included an aligned Tibetan root text and its Chinese translation, consisting of 348 segments. Our approach proved effective in generating a Chinese translation for a different Tibetan root text with a similar base text, comprising 431 segments.
Conclusion:
This research introduces a novel approach to transferring text alignment annotations, enabling seamless integration of various text alignments such as translations, commentaries, and display segments. By leveraging the robust functionalities of the OpenPecha Toolkit and the Stand-off Text Annotation Model (STAM), we effectively addressed the challenges in aligning differently segmented versions of the same root text. Our methodology significantly reduces manual effort and computational inefficiencies, as demonstrated in real-world applications like aligning translations and commentaries of རྡོ་རྗེ་གཅོད་པ་རྩ་བ། (Vajra Cutter).
Future Work:
For future work, we plan to extend the evaluation of the OpenPecha Toolkit by systematically testing its span annotation transfer functionality across diverse text corpora. This will involve benchmarking its performance against similar tools, including the Google Diff-Match-Patch tool[7], to assess its precision, efficiency, and robustness.
Citation:
[1] STAM Github Repo
[2] Taming the Wild Etext: Managing, Annotating, and Sharing Tibetan Corpora in Open Spaces
[3] PechaData Github Organization
[4] OpenPecha Toolkit v2
[5]GATE: an Architecture for Development of Robust HLT Applications
[6]Google diff-match-patch