What?
This project investigates the relationship between Tibetan and its linguistic relatives using datasets like FLORES-200. By exploring the connections between Tibetan and other languages, we aim to identify “language friends” with Gemma2 (our finely-tuned translation expert) leading the way.
Inspired by Investigating Multilingual NMT Representations at Scale
The ultimate goal? To improve Tibetan-to-English machine translation through transfer learning
How?
To uncover Tibetan’s linguistic relatives, I used the following methodology:
-
Embedding the Source Text:
- Leveraged fine-tuned Gemma2 embeddings with 3000 dimensions to represent the source text from 200 languages, all translated into English. This ensured a unified semantic space for comparison.
-
Clustering Languages:
- Applied HDBSCAN (Hierarchical Density-Based Spatial Clustering) to group languages based on their semantic similarity in the embedding space. This allowed for discovering clusters of languages that share structural or semantic traits with Tibetan.
-
Visualizing the Relationships:
- Reduced the embeddings’ dimensionality using UMAP (Uniform Manifold Approximation and Projection) for visualization. This provided an intuitive 2D or 3D representation of the language relationships, making it easier to identify Tibetan’s “language friends.”
Result
Here is our relatives with distances to Tibetan:
- Dzongkha (25.289)
- Bengali (85.750) (Surprisingly)
- Thai (102.783)
- Gujarati (103.572) (Surprisingly)
- Armenian (109.434)
- Burmese (Myanmar) (112.121)
- Georgian (118.850)
- Sinhala (120.851)
- Lao (121.812)
- Santali (123.460)
- Telugu (123.709)
- Assamese (124.945)
- Kannada (126.392)
- Malayalam (126.626)
- Kazakh (130.821)
- Khmer (131.717)
- Tamasheq (Tamazight, Berber) (133.532)
- Odia (Oriya) (134.090)
- Central Atlas Tamazight (Berber) (139.575)
Chinese isn’t close to Tibetan (which is good).
I’m curious to hear your thoughts on these results! I don’t have deep expertise in linguistics, so I’m open to discussion about whether this clustering makes sense or not. Some of these connections (like Dzongkha and Burmese) seem expected, while others (like Bengali and Gujarati) are surprising to me. Does this clustering align with known linguistic or cultural relationships, or could these distances reflect something unexpected in the embeddings or data?
Looking forward to your insights!