A novel approach to transfer text alignment annotation

A novel approach to transfer text alignment annotation

By Tashi Tsering and Tenzin Tsundue

Introduction:

Texts often exist in multiple forms, such as versions, translations, and commentaries, each serving specific purposes for different audiences. However, even in today’s digitized world, a significant challenge remains: connecting and aligning these various forms seamlessly. Misaligned texts lead to fragmented information, limiting accessibility and diminishing the utility of digital platforms for users who require synchronized alignments across translations, commentaries, and segmented displays.

To address this challenge, we propose a novel approach for transferring text alignment annotations. Our methodology bridges the gap between differently segmented texts, allowing for the alignment of translations, commentaries, and other text variations. This approach not only enhances the coherence of textual data but also enables digital platforms to deliver a unified experience, making various alignments readily accessible and useful for end-users.

Literature Review:

Hamish et al., in their paper [6] on the General Architecture for Text Engineering (GATE), introduced the AnnotationDiff Tool, which is capable of comparing sets of annotations across one or more documents. Their work emphasizes evaluation aspects based on three distinct criteria—strict, lenient, and average—while employing performance metrics such as precision, recall, and F-score.

One key functionality of the OpenPecha Toolkit in this context is its ability to transfer span annotations from one text to another. This feature bears similarities to the Google Diff-Match-Patch tool [7].

Although some functionalities and data storage formats share similarities with existing projects, the application of these features for transferring alignment annotations represents a novel approach.

Real User Experience:

For the Pecha.org users, there are root text which are segmented for the pecha display are available on the site but since that text is segmented specifically for the purpose of display on the site, we don’t have neither Translations nor Commentary aligned to that pecha display segmented root text. We do have the same root text’s Translation and Commentary but aligned to its native segments. So to make those Translation and Commentary readily available on the pecha.org for use for the Users we have made this transfer mechanism so that our users can have access to the Translations as well as Commentary aligned to the pecha display segmented root text. At the beginning of it we have two different segments of same root text and two alignment of translation and commentary to its native segmented root text but at the end of alignment transfer we now have two more alignment, the alignment of translation and commentary to the pecha display segmented root text. We have increased the number of alignment from two to four.

Methodology:

At OpenPecha, textual content and annotations are stored in a custom-defined format known as the OpenPecha Format (OPF). The annotations are maintained in a “standoff” structure, which separates them from the main text. This design choice enhances the efficiency of downloading, parsing, and updating the data. Although the OPF format has undergone several revisions over time, its structure and capabilities have been thoroughly documented in the published paper, “Taming the Wild Etext: Managing, Annotating, and Sharing Tibetan Corpora in Open Spaces” [2].

Figure 1: A visual representation of OPF data structure.

Figure 1 illustrates a pictorial representation of an OPF dataset. All OPF data from OpenPecha are stored in the PechaData GitHub repository[3]. Each OPF dataset is uniquely identified by a Pecha ID, as shown in the example folder :open_file_folder:P00000001 in the figure.The base text is stored in a folder named with a randomized 4-digit UUID. Annotation files are organized under the layers folder, where each subfolder is named after its corresponding base text. Additionally, a metadata.json file contains essential metadata such as the title, source, author, and other relevant information about the base text.

All interactions with OPF data—including creation, uploading, updating, and downloading—are facilitated through the OpenPecha Toolkit[4].The annotations in OpenPecha are created using the Stand-off Text Annotation Model (STAM) [1] API and are stored in .json files. Leveraging the robust features of STAM—such as selecting text spans, referencing other annotations, and including payload information—the OpenPecha OPF format is designed to accommodate any type of text annotation.

Figure 2: A visual representation of alignment annotation transfer

In Figure 2, the English book is shown as a translation OPF aligned to the Tibetan book.A segment is defined as an annotation where it stores a span text in a given base text. Each English segment corresponds to a Tibetan segment, forming a one-to-one relationship. However, there exists a Tibetan Book Version 1 with a base text that includes minor character differences, missing content, and variations in segmentation. In such case where there is a need for translation for each segments in Tibetan Book version 1, manually translating or performing machine translation for Tibetan Book Version 1 is both labor-intensive and an inefficient use of computational resources. We propose an alignment annotation transfer process consisting of three steps:

Step 1: Transfer of Segment Annotation Layer
The segment annotation layer from the Tibetan Book will be transferred to Tibetan Book Version 1 using the base text of Tibetan Book Version 1. This process will be carried out utilizing the StamPecha merge functionality provided by our toolkit [4].


from pathlib import Path  
from openpecha.pecha import StamPecha

tibetan_book_path = Path("tibetan_book_path")  
tibetan_book_v1_path = Path("tibetan_book_v1_path")

tibetan_book_base_name = "tibetan_book_base_name"  
tibetan_book_v1_base_name = "tibetan_book_v1_base_name"

tibetan_book_pecha = StamPecha(tibetan_book_path)  
tibetan_book_v1_pecha = StamPecha(tibetan_book_v1_path)

tibetan_book_v1_pecha.merge_pecha(tibetan_book_pecha, tibetan_book_base_name, tibetan_book_v1_base_name)

Step 2: Mapping Segment Layers
The alignment relationship between the old and new segment layers in Tibetan Book Version 1 will be established. The alignment data will map each new segment to its corresponding old segments, which may follow a one-to-one or one-to-many relationship.

Step 3: Generation of English Segment Layer
Using the mapped output from Step 2, a new segment layer will be generated for the English Book. This layer will be aligned with the old segments of Tibetan Book Version 1.

The following code does all of Step one to three explained above.


tibetan_book = {"tibetan_book_pecha_path": …, "tibetan_book_base_name": …}  
tibetan_book_v1 = {"tibetan_book_v1_pecha_path":..., "tibetan_book_v1_base_name": …}

english_book = {  
    "english_book_pecha_path":...,  
    "english_book_base_name": …,  
}

ann_transfer = TranslationAlignmentAnnTransfer(tibetan_book, tibetan_book_v1, english_book)  
ann_transfer.transfer_annotation()

Experiments:

1. Translation Alignment Transfer

Given:
P1 is the OPF of a root text with segments that are specifically segmented for Pecha Display.
P2 is the OPF of the same root text as the P1 but with its native segments.
P3 is the OPF of English Translation of the root text aligned to P2

Aim:
Currently P2 and P3 are aligned but we want to align the P3 English translation to the P1 pecha display OPF

Steps that we follow to achieve the transfer of translation alignment:

  1. Use P1 and P2 to create L2, which is a segment layer in P1 whose base text is of P1 but the segmentation annotation is from P2.
  2. Now compare layer L1 and L2 of P1 to create C1, which is mapping of alignment between P1 and P2.
  3. Use the mapping C1 to create L5 which will contain the alignment mapping between the P3 and P1.

2. Commentary Alignment Transfer

Given:

P1 is the OPF of a root text with segments that are specifically segmented for Pecha Display.

P2 is the OPF of the same root text as the P1 but with its native segments.

P3 is the OPF of Tibetan Commentary of the root text aligned to P2

Aim:

Currently P2 and P3 are aligned but we want to align the P3 Tibetan commentary to the P1 pecha display OPF

Steps that we follow to achieve the transfer of translation alignment:

  1. Use P1 and P2 to create L2, which is a segment layer in P1 whose base text is of P1 but the segmentation annotation is from P2.
  2. Now compare layer L1 and L2 of P1 to create C1, which is mapping of alignment between P1 and P2.
  3. Use the mapping C1 to create L5 which will contain the alignment mapping between the P3 and P1.

Results:

We successfully implemented our approach using real-world data. For translation, we worked with a dataset of རྡོ་རྗེ་གཅོད་པ་རྩ་བ། (Vajra Cutter), which included an aligned Tibetan root text and its Chinese translation, consisting of 348 segments. Our approach proved effective in generating a Chinese translation for a different Tibetan root text with a similar base text, comprising 431 segments.

Conclusion:

This research introduces a novel approach to transferring text alignment annotations, enabling seamless integration of various text alignments such as translations, commentaries, and display segments. By leveraging the robust functionalities of the OpenPecha Toolkit and the Stand-off Text Annotation Model (STAM), we effectively addressed the challenges in aligning differently segmented versions of the same root text. Our methodology significantly reduces manual effort and computational inefficiencies, as demonstrated in real-world applications like aligning translations and commentaries of རྡོ་རྗེ་གཅོད་པ་རྩ་བ། (Vajra Cutter).

Future Work:

For future work, we plan to extend the evaluation of the OpenPecha Toolkit by systematically testing its span annotation transfer functionality across diverse text corpora. This will involve benchmarking its performance against similar tools, including the Google Diff-Match-Patch tool[7], to assess its precision, efficiency, and robustness.

Citation:

[1] STAM Github Repo
[2] Taming the Wild Etext: Managing, Annotating, and Sharing Tibetan Corpora in Open Spaces
[3] PechaData Github Organization
[4] OpenPecha Toolkit v2
[5]GATE: an Architecture for Development of Robust HLT Applications
[6]Google diff-match-patch

2 Likes