Guided by Guru Gemma2: Exploring Tibetan's Language Relatives

What?

This project investigates the relationship between Tibetan and its linguistic relatives using datasets like FLORES-200. By exploring the connections between Tibetan and other languages, we aim to identify “language friends” with Gemma2 (our finely-tuned translation expert) leading the way.
Inspired by Investigating Multilingual NMT Representations at Scale
The ultimate goal? To improve Tibetan-to-English machine translation through transfer learning

How?

To uncover Tibetan’s linguistic relatives, I used the following methodology:

  1. Embedding the Source Text:

    • Leveraged fine-tuned Gemma2 embeddings with 3000 dimensions to represent the source text from 200 languages, all translated into English. This ensured a unified semantic space for comparison.
  2. Clustering Languages:

    • Applied HDBSCAN (Hierarchical Density-Based Spatial Clustering) to group languages based on their semantic similarity in the embedding space. This allowed for discovering clusters of languages that share structural or semantic traits with Tibetan.
  3. Visualizing the Relationships:

    • Reduced the embeddings’ dimensionality using UMAP (Uniform Manifold Approximation and Projection) for visualization. This provided an intuitive 2D or 3D representation of the language relationships, making it easier to identify Tibetan’s “language friends.”

Result

Here is our relatives with distances to Tibetan:

  • Dzongkha (25.289)
  • Bengali (85.750) (Surprisingly)
  • Thai (102.783)
  • Gujarati (103.572) (Surprisingly)
  • Armenian (109.434)
  • Burmese (Myanmar) (112.121)
  • Georgian (118.850)
  • Sinhala (120.851)
  • Lao (121.812)
  • Santali (123.460)
  • Telugu (123.709)
  • Assamese (124.945)
  • Kannada (126.392)
  • Malayalam (126.626)
  • Kazakh (130.821)
  • Khmer (131.717)
  • Tamasheq (Tamazight, Berber) (133.532)
  • Odia (Oriya) (134.090)
  • Central Atlas Tamazight (Berber) (139.575)

Chinese isn’t close to Tibetan (which is good).
I’m curious to hear your thoughts on these results! I don’t have deep expertise in linguistics, so I’m open to discussion about whether this clustering makes sense or not. Some of these connections (like Dzongkha and Burmese) seem expected, while others (like Bengali and Gujarati) are surprising to me. Does this clustering align with known linguistic or cultural relationships, or could these distances reflect something unexpected in the embeddings or data?

Looking forward to your insights!

1 Like

Great work! very nice to see this!

As for the accuracy of the result cluster, I think that it all depends on the level of details you are considering, and the type of criteria that are taken into account for the comparison.

I am most surprised to find Armenian, which is a language that is not related with Tibetan. Linguistic reconstructions make the hypothesis that the Tibetan plateau has been first populated by tribes traveling towards China, some of which populated Tibetan climbing from the side of Chengdu until Tibet while the rest continued into China. So the languages such as Kazakh, Tamasheq, and Tamazight seem to very roughly correspond to that hypothesis.

From the point of view of actually related languages, I only see Dzongkha, then Burmese. Most others may be geographically close to Tibet, they are nonetheless far from Tibetan language.

Then, the other way of comparing languages is to compare individual traits, such as clustering all languages that have singular/plural agreement for verbs versus those that don’t, or all languages that conjugate verbs versus those that don’t have conjugation. Or even still putting together all languages that make sentences using the order “Subject Object Verb”.

It looks to me that it is what is seen here: the languages clustered must share some traits with Tibetan language. As for to which ones and how relevant those traits are to compare languages, I have no idea. I would expect to find Nepali in such a comparison, but for some reason, Nepali is not found.

Tibetan language has integrated some of the features of sanskrit because of the texts imported from India and translated, which might explain why there are many Indian languages in your clusters.

Finally, on a side note, comparing translations (instead of actual texts in those respective languages) of these 200 languages is a HUGE bias and a huge input point of uncontrollable data that would make your results unusable altogether in a linguistic research context.

Yet, it is interesting to see what can be done using this methodology. Well done! Thank you for your efforts to keep pushing forwards Tibetan language!

Thank you for your valuable insights!
Sanskrit and Nepali are pretty close, but Tibetan isnt close to Sanskrit and Nepali for some reason

It is true that people who speak both languages don’t think these two languages are close (me included), but when I started to study Tibetan language at the university with a teacher who was a specialist of Nepali, we found many structural similarities. One such similarity is the semi-ergative trait of both Tibetan and Nepali. In other words, certain verbs in Tibetan require the use of བྱེད་སྒྲ།, but it is allowed to drop its use for past tense. Nepali has exactly the same thing.

What your results showed is that this specific trait of both languages was not taken into account in the comparison. This is due to the fact that English language does not use བྱེད་སྒྲ།, and since you compared English translations of the languages, this similarity was left aside.

I actually meant Tibetan and Sanskrit and Nepali are not very close in the results that i got which was surprising to me too, I encoded the source text that is to be translated to English with translation prompt template, which looks like :
"Please translate the following text into English: {SOURCE TEXT} Translation :"
I didn’t encoded the translated English version.
Clusters might have been formed based on the features of the source text that help in translation or the features that the model looks for while translating into English; that’s my intuition. Let me know what you think about that.

ahh very interesting! :smiley:
This post got me really interested into looking at how Bengali was related to Tibetan Language and found out how Bengali and Tibetan are similar in a few ways as I remember some of my friends from North East [Assam and Manipur] who spoke Bengali and Meitei , We had similar words in our language.

  • Script: Bengali and Tibetan scripts are considered sister scripts, and both share a lineage back to Siddhaṃ script. Bhutan’s official language, Dzongkha, also uses Tibetan script as found in the findings.
  • Geography: The Tibet and Bengal regions are close enough together.
  • Buddhism: Ancient Bengali Buddhists were instrumental in the creation of Tibetan Buddhism and Tibetan civilization.

Here are some other facts about Bengali and Tibetan:

Language family
Bengali is an Indo-European language, but it has also been influenced by other language families in South Asia.

Tibetan language
Tibetan is spoken in Tibet, parts of China, northern Pakistan, Nepal, Bhutan, and parts of India. It has an alphabet of thirty letters, no punctuation, and its own unique counting system.

Tibetan script
The Tibetan script’s major offshoot is the 'Phags-pa script, which was created to serve as a universal script for the major languages of the Mongol Great Khan’s empire.
Visits the link to dig deeper if you would like! After all it was an interesting fun fact !

https://en.m.wikipedia.org/wiki/Tibetan_script

[1] https://worldfamilyofbandana.quora.com/How-are-Bengali-and-Tibetan-script-similar[2] https://www.ancient-origins.net/history-famous-people/ancient-bengali-buddhism-0014006[3] https://www.britannica.com/topic/Bengali-language[4] https://asiapacific.anu.edu.au/language-tibetan[5] Tibetan - an overview | ScienceDirect Topics

1 Like