Modeling the Full Translation Dataset

billingsmoore · November 12, 2024, 10:13pm

Modeling the Full Translation Dataset

Summary

Following results from ’ A First Look at Topic Modeling for the Translation Dataset’, topic modeling was conducted the full translation dataset openpecha/cleaned_MT_v1.0.2. The results of this can be seen below.

Explanation of Cluster Labels

Most clusters are label according to a brief list of topics that are contained in the cluster.

Clusters whose labels are of the form “label” are clusters composed predominantly of texts which are just the the text in the label. For example, the cluster “3.” contains almost exclusively elements of the dataset where the entirety of the target sentence is “3.”

Clusters whose labels are of the form [label] contain texts that are described by the label but are not necessarily about the label. For example [Numerical Section Headers] contains texts which are themselves numerical section headers (i.e. ‘2.3.2.2.3’).

Elements of the dataset whose source text was readily identifiable are given as [Author, Title]. For example, [Dickens, A Christmas Carol] contains sentences from ‘A Christmas Carol’ by Charles Dickens.

Brief Analysis

We can see that the overwhelming majority of texts are Buddhist in nature. These are not well differentiated here in part because of the presence of extreme outliers in the dataset.

Junk Data

These outliers are partially junk data that can and should be removed from the dataset. For example, [Roman Numerals] contains texts whose entire target sentence consists exclusively of Roman numerals. These and other clusters (i.e. “3.”, “4”) may also indicate problems in the machine alignment more generally, as they are likely the result of programmatically splitting source texts at punctuation, which may not accurately reflect the Tibetan input sentence.

Other problematic clusters include [Contains Tibetan] which consists of elements whose target sentence in English contains Tibetan. These are primarily from language learning texts. As well as “Yes” and “No.” which contain primarily target sentences that consist only of the word “yes” or “no” respectively.

Fiction

In addition to [Dickens, A Christmas Carol], the dataset features (visible on the bottom right of the visualization above) a number of pieces of literary fiction, including stories from Harry Potter and Sherlock Holmes.

These texts are unlikely to be tremendously helpful in producing high quality translation of religious texts, but are probably valuable context for translating more contemporary, or less formal, works.

Contemporary Non-Fiction

Adjacent to fiction in the visualization are a set of contemporary non-fiction texts. These include works on history, politics, and the autobiographies of Mahatma Ghandi and Malala Yousafzai.

On the left edge of the main mass in the visualization are clusters of more academic non-fiction. These include academic publications on biology, physics, medicine, and finance.

As with the fiction, these texts are likely to be valuable in translation of contemporary works, but their domain-specific jargon is most likely unhelpful for translating older texts.

Dharma Texts

The majority of the dataset is religious or philosophical in nature. Although certain key topics pop out here (Madhyamaka, Tantra, Existence, Awareness) it is likely not helpful to take this to be too helpful in understanding the details of the dataset.

The presence of the dramatically distinct material leads to much of the Buddhist material being either lumped together without much meaningful connection, or being treated as noise in the data.

Methods

This analysis uses similar methods to A First Look at Topic Modeling for the Translation Dataset. With the exception of fitting of the projection algorithm, choice of clustering algorithm, and manual cleaning of cluster labels.

The Python library used to execute this pipeline is easy_text_clustering

Sentences were embedded as vectors using sentence-transformers/all-MiniLM-L6-v2

These embeddings were projected into two-dimensions using the UMAP algorithm.

The two-dimensional data was then clustered using the HDBSCAN algorithm.

A random set of 10 samples from each cluster were then fed to mistralai/Mixtral-8x7B-Instruct-v0.1 for summarization using the following prompt: “Use three words total (comma separated)to describe general topics in above texts. Under no circumstances use enumeration. Example format: Tree, Cat, Fireman”

These cluster labels were then edited manually to be more reliably descriptive, though some labels remain less than meaningful, reflecting the lack of cohesive theme in the texts in the cluster.

The clusters and the summary labels were then plotted using Plotly.

Next Steps

From here, the next step should be to curate a subset of this data which excludes the less relevant data mentioned above. From there, this pipeline can be re-executed in order to get a better view of the composition of the Buddhist materials.

Topic		Replies	Views
Topic Modeling Buddhist Material in the Translation Dataset 💃🏼 Topic Modeling SIG data-cleaning , data-visualization , tibetan-buddhism , easy_text_clustering	2	47	November 28, 2024
A First Look at Topic Modeling for the Translation Dataset 💃🏼 Topic Modeling SIG embedding , clusters , dataset , dimension-reduction	2	58	November 28, 2024
Toward a Cleaner Translation Dataset 💃🏼 Topic Modeling SIG data-cleaning , translate	0	76	November 3, 2024
Domain Tagging With Unsupervised Clustering for Retrieval Augmented Translation 💃🏼 Topic Modeling SIG docs , clusters , dataset	0	28	December 25, 2024
Using Topic Modeling to Cluster User Translation (BERTopic) 💃🏼 Topic Modeling SIG docs , pandas , bertopic , data-cleaning , data-visualization	0	109	November 28, 2024

Modeling the Full Translation Dataset

Modeling the Full Translation Dataset

Summary

Explanation of Cluster Labels

Brief Analysis

Junk Data

Fiction

Contemporary Non-Fiction

Dharma Texts

Methods

Next Steps

Related topics