Tibetan Embedding Models

Embedding Models for Tibetan Text: Production-Ready Options

Only a handful of embedding models explicitly support Tibetan script, reflecting its status as an extremely low-resource language: less than 0.01% of CommonCrawl, ~8,000 Wikipedia articles, and a 4Γ— tokenization penalty in standard multilingual tokenizers. This report covers every model with confirmed or plausible Tibetan support, from large self-hosted options down to lightweight API calls, along with Tibetan-specific NLP models that can serve as embedding backbones.


Models with Confirmed Tibetan Support

1. BGE-M3 (BAAI) β€” Best Overall Choice

Attribute Detail
Tibetan support :white_check_mark: Explicitly included (β€œbo” in training data)
Parameters 568M
Embedding dimensions 1024
Max context 8,192 tokens
License MIT (full commercial use)
Access HuggingFace: BAAI/bge-m3 Β· Ollama: bge-m3
Retrieval modes Dense + Sparse + ColBERT (multi-vector)
Self-hosting GPU T4 16GB minimum (FP16)
Architecture XLM-RoBERTa backbone with Tibetan pairs added during fine-tuning

BGE-M3 is the strongest open-source option. It is the only general-purpose open embedding model that explicitly adds Tibetan language pairs during training, overcoming the limitation of the XLM-R backbone (which excludes Tibetan from its 100-language CC-100 pretraining). The three retrieval modes β€” dense, sparse, and ColBERT β€” make it versatile across semantic search, keyword-style retrieval, and fine-grained passage matching.


2. Cohere embed-multilingual-v3 β€” Best API Option

Attribute Detail
Tibetan support :white_check_mark: Explicitly listed (β€œbo” in supported languages)
Parameters Undisclosed
Embedding dimensions 1024
Max context 512 tokens
License Proprietary (API-only)
Access Cohere API
API price $0.10 per million tokens

The most convenient option for teams that want confirmed Tibetan support without infrastructure. The 512-token context limit is a meaningful constraint for longer passages β€” each API call covers roughly 1–2 Tibetan sentences worth of content after the tokenization penalty.

Cohere embed-v4 (newer model) offers 1536 dimensions, 128K context, and costs $0.12/MTok. It likely inherits Tibetan support from v3 but this should be verified with Cohere before committing.


3. LaBSE (Google) β€” Lightweight Self-Hosted Option

Attribute Detail
Tibetan support :white_check_mark: Explicitly included (109 languages including β€œbo”)
Parameters 471M
Embedding dimensions 768
Max context 256 tokens
License Apache-2.0
Access HuggingFace: sentence-transformers/LaBSE
Self-hosting GPU Runs on CPU (INT8 ONNX) or any GPU

LaBSE’s strength is its broad language coverage and lightweight deployment β€” it runs on CPU with ONNX quantization. The severe 256-token limit restricts it to sentence-level tasks: similarity matching, short-text classification, deduplication. Not suitable for document-level retrieval or long-passage RAG. Outperformed by MITRA-E on Buddhist text benchmarks.


4. SONAR (Meta) β€” Research-Grade Multilingual

Attribute Detail
Tibetan support :white_check_mark: Explicitly included (bod_Tibt via NLLB-200 language set)
Parameters ~600M+
Embedding dimensions 1024
Max context Sentence-level (designed for individual sentences)
License Mixed (some components non-commercial)
Access GitHub: facebookresearch/SONAR
Self-hosting GPU Requires GPU; complex multi-component deployment

SONAR covers 200 languages through Meta’s NLLB framework and produces high-quality sentence-level embeddings. The trade-offs are significant: you must specify the language code (bod_Tibt) at inference time, deployment involves multiple model components (encoder + tokenizer per language family), and some components carry non-commercial license restrictions. Best suited for research contexts or cross-lingual retrieval where you need to map Tibetan against dozens of other languages simultaneously.


5. OpenAI Embeddings β€” Plausible but Unverified

Model Dims Max context Price/MTok Batch price/MTok
text-embedding-3-small 1,536 8,191 $0.02 $0.01
text-embedding-3-large 3,072 8,191 $0.13 $0.065
Attribute Detail
Tibetan support :red_question_mark: Not confirmed, but plausible
License Proprietary (API-only)
Access OpenAI API

OpenAI does not publish a language list for their embedding models. The BPE tokenizer handles Tibetan Unicode at the byte level, so it will produce embeddings β€” but whether those embeddings carry meaningful Tibetan semantics is unknown. The extremely low pricing ($0.02/MTok for small, $0.01 batch) makes it worth testing empirically on your data. If quality proves acceptable, this is the cheapest API option by a wide margin.

Recommendation: Run a quick evaluation on 100–200 Tibetan sentence pairs with known similarity before committing.


Tibetan-Specific & Domain-Specific Models

These are not general-purpose embedding models out of the box, but they offer Tibetan-native architectures that can be fine-tuned or adapted for embedding tasks.

6. Gemma 2 MITRA-E β€” Best for Buddhist Texts

Attribute Detail
Tibetan support :white_check_mark: Purpose-built for Buddhist texts (Pāli, Sanskrit, Chinese, Tibetan)
Parameters 9B
Training data 1.74M parallel Buddhist text pairs
License Gemma license (restricted)
Access HuggingFace: buddhist-nlp (gated access)
Self-hosting GPU A100 40GB minimum
Performance Outperforms BGE-M3 and LaBSE on 7-task Buddhist semantic benchmark

If your primary corpus is classical Buddhist literature (Kangyur, Tengyur, commentaries), MITRA-E is the highest-quality option available. The 9B parameter count makes it impractical for lightweight deployment β€” expect ~$0.40/MTok self-hosted on A100 β€” but for specialized Buddhist digital humanities work, the quality advantage may justify the cost.


7. TiBERT (CMLI-NLP) β€” Tibetan-Native Backbone

Attribute Detail
Tibetan support :white_check_mark: Monolingual Tibetan model
Parameters ~110M (BERT-base scale)
Vocabulary 30,005 Tibetan words (99.95% corpus coverage)
License Research
Access HuggingFace: CMLI-NLP/TiBERT
Self-hosting GPU Any GPU or CPU

TiBERT is a BERT-base model pretrained exclusively on Tibetan text with a purpose-built SentencePiece vocabulary. It achieves state-of-the-art Tibetan text classification, outperforming multilingual models by 3–5 F1 points. It is not a sentence embedding model β€” it produces token-level representations β€” but it can serve as a backbone for training a Tibetan sentence transformer using frameworks like sentence-transformers. The Tibetan-native tokenizer eliminates the 4Γ— tokenization penalty that cripples multilingual models.


8. UTibetNLP/tibetan_bert β€” Alternative Tibetan Backbone

Attribute Detail
Tibetan support :white_check_mark: Monolingual Tibetan model
Parameters ~110M
License Research
Access HuggingFace: UTibetNLP/tibetan_bert
Task Tibetan news classification (~86% accuracy)

Another monolingual Tibetan BERT variant, focused on news classification. Similar potential as TiBERT for fine-tuning into a sentence embedding model. Less documented than TiBERT.


9. CINO (HIT-iFLYTEK) β€” Chinese Minority Languages

Attribute Detail
Tibetan support :white_check_mark: Covers Tibetan, Mongolian, Uyghur, Kazakh, Korean, Zhuang + Chinese
Parameters ~110M (base), ~330M (large)
License Research
Access HuggingFace / GitHub
Advantage Cross-lingual between Tibetan and Chinese

CINO is designed for Chinese minority languages and offers cross-lingual capability between Tibetan and Chinese. Useful if your pipeline involves Tibetan-Chinese parallel texts or if you need embeddings that align both languages.


Cost Comparison

API Pricing (Tibetan-viable models only)

Provider / Model Price/MTok Tibetan confirmed Max context Best for
OpenAI text-embedding-3-small $0.02 ($0.01 batch) Unverified 8,191 Budget API if quality checks out
Cohere embed-multilingual-v3 $0.10 :white_check_mark: Yes 512 Confirmed Tibetan, no infra needed
Cohere embed-v4 $0.12 Likely 8,192 Longer context needs
OpenAI text-embedding-3-large $0.13 ($0.065 batch) Unverified 8,191 Higher dimensionality if quality checks out

Self-Hosting Economics

Setup Model GPU / Instance Monthly cost Effective $/MTok Throughput
Budget (spot) BGE-M3 FP16 T4 spot ~$130 ~$0.01 ~15–22M tok/hr
Budget BGE-M3 FP16 T4 (g4dn.xlarge) ~$384 ~$0.03 ~15–22M tok/hr
CPU-only LaBSE INT8 ONNX c5.2xlarge ~$250 ~$0.05 ~5–10M tok/hr
Standard BGE-M3 FP16 A10G (g5.xlarge) ~$734 ~$0.03 ~30–45M tok/hr
Specialized MITRA-E 9B A100 40GB ~$1,460 ~$0.40 ~3–5M tok/hr

Break-even analysis: Self-hosting BGE-M3 on a T4 beats Cohere API pricing ($0.10/MTok) at approximately 3.8 billion tokens/month. Against OpenAI batch pricing ($0.01/MTok), break-even rises to ~19 billion tokens/month. For most Tibetan text projects, API is more economical unless processing very large corpora (e.g., the full TIB-STC at 11B+ tokens).


Essential Tibetan NLP Preprocessing

Regardless of which model you choose, preprocessing Tibetan text with a word segmenter is critical. Standard multilingual tokenizers fragment Tibetan syllables aggressively β€” a word like ΰ½–ΰΎ±ΰ½„ΰΌ‹ΰ½†ΰ½΄ΰ½–ΰΌ‹ΰ½¦ΰ½Ίΰ½˜ΰ½¦ΰΌ‹ΰ½‘ΰ½”ΰ½  (bodhisattva, 4 syllables) gets split into 8–16 subword tokens. Proper word segmentation before model tokenization improves representation quality and reduces token consumption.

Tool Description License Access
Botok (OpenPecha) Leading Tibetan word segmenter with dictionary lookup + POS tagging Apache-2.0 pip install botok Β· GitHub: OpenPecha/Botok
ACTib corpus 170M words of annotated Classical Tibetan (BDRC collections) Research Via BDRC/OpenPecha
TIB-STC 11B+ tokens structured Tibetan text (literature 66%, web 24%, media 10%) Research arXiv: 2503.18288
FastText Tibetan 100-dim Classical Tibetan word vectors (90K+ tokens) Open Zenodo

Recommended Pipeline

Tibetan raw text
    β†’ Botok word segmentation
    β†’ BGE-M3 (self-hosted) or Cohere API
    β†’ Vector database (Qdrant, Milvus, Weaviate, pgvector)
    β†’ Semantic search / RAG / clustering

Step 1: Preprocess with Botok to segment Tibetan text into linguistically meaningful word units.

Step 2: Embed with BGE-M3 (best quality, self-hosted) or Cohere embed-multilingual-v3 (best convenience, API). Test OpenAI text-embedding-3-small as a budget alternative.

Step 3: Before committing to any model at scale, benchmark on 200–500 Tibetan sentence pairs with known semantic relationships from your actual corpus. No standardized Tibetan embedding benchmarks exist β€” your evaluation on your own data is the only reliable quality signal.


Summary Ranking

Rank Model Why Best for
:1st_place_medal: BGE-M3 (self-hosted) Only open model with confirmed Tibetan + 3 retrieval modes + 8K context All-purpose production use
:2nd_place_medal: Cohere embed-multilingual-v3 (API) Confirmed Tibetan, zero infra Teams without GPU infrastructure
:3rd_place_medal: LaBSE (self-hosted) Confirmed Tibetan, runs on CPU, Apache-2.0 Sentence-level tasks, budget deployments
4 SONAR (self-hosted) 200 languages, strong cross-lingual Research, cross-lingual retrieval
5 OpenAI embeddings (API) Cheapest API ($0.02/MTok) but Tibetan unverified Budget option pending quality validation
6 MITRA-E (self-hosted) Highest quality on Buddhist texts Classical Buddhist literature specialists
7 TiBERT / CINO (fine-tune) Tibetan-native tokenization, best potential ceiling Teams with ML capacity to build custom embeddings