Topic Modeling Buddhist Material in the Translation Dataset

billingsmoore · November 21, 2024, 9:49pm

Topic Modeling Buddhist Material in the Translation Dataset

Following results from Modeling the Full Translation Dataset, obviously non-Buddhist materials were excluded from the dataset and topic modeling was conducted on the remaining texts. The results can be seen below.

Brief Analysis

Typical topics and themes from the Buddhist literature are visibly present in the data. The relative ease of finding these topics bodes well for the possibility of training classifiers and/or tagging data for retrieval augmented generation.

The presence of relatively large clusters suggests the possibility of continuing to drill down to smaller datasets. However, I believe it is worthwhile to first, complete a more thorough cleaning of the dataset, and second, determine (if only at a reasonably high level) the objectives of more specific topic modeling whether to create an ontology for the data, attempt to determine specific source texts, etc.

Problem Data

Junk Data

Some elements of the Buddhist material continue to be plagued by numbers or roman numerals which lead to undesirably noisy results. Examples of this include the ‘Numbers’ clusters that are dangling from the bottom of the main mass in the visualization.

Fiction

Despite the attempt to exclude irrelevant materials, the dataset still contained some texts from identifiable pieces of reasonably contemporary fiction.

Elements of the dataset whose source text was readily identifiable are given as [Author, Title]. For example, [Dickens, A Christmas Carol] contains sentences from ‘A Christmas Carol’ by Charles Dickens.

Examples in the dataset can be seen in the bottom right side of the main mass in the visualization above.

Methods

As in previous posts on topic modeling, the Python library used to execute this pipeline is easy_text_clustering

Sentences were embedded as vectors using sentence-transformers/all-MiniLM-L6-v2

These embeddings were projected into two-dimensions using the UMAP algorithm.

The two-dimensional data was then clustered using the HDBSCAN algorithm.

A random set of 10 samples from each cluster were then fed to mistralai/Mixtral-8x7B-Instruct-v0.1 for summarization using the following prompt: “Use three words total (comma separated)to describe general topics in above texts. Under no circumstances use enumeration. Example format: Tree, Cat, Fireman”

These cluster labels were then edited manually to be more reliably descriptive, though some labels remain less than meaningful, reflecting the lack of cohesive theme in the texts in the cluster.

The clusters and the summary labels were then plotted using Plotly.

Topic		Replies	Views
Modeling the Full Translation Dataset 💃🏼 Topic Modeling SIG data-visualization	2	38	November 28, 2024
A First Look at Topic Modeling for the Translation Dataset 💃🏼 Topic Modeling SIG embedding , clusters , dataset , dimension-reduction	2	54	November 28, 2024
Domain Tagging With Unsupervised Clustering for Retrieval Augmented Translation 💃🏼 Topic Modeling SIG docs , clusters , dataset	0	27	December 25, 2024
Using Topic Modeling to Cluster User Translation (BERTopic) 💃🏼 Topic Modeling SIG docs , pandas , bertopic , data-cleaning , data-visualization	0	107	November 28, 2024
Toward a Cleaner Translation Dataset 💃🏼 Topic Modeling SIG data-cleaning , translate	0	71	November 3, 2024

Topic Modeling Buddhist Material in the Translation Dataset

Topic Modeling Buddhist Material in the Translation Dataset

Brief Analysis

Problem Data

Junk Data

Fiction

Methods

Related topics