Tibetan Embedding Models

Tashi_Tsering · March 26, 2026, 7:00am

Embedding Models for Tibetan Text: Production-Ready Options

Only a handful of embedding models explicitly support Tibetan script, reflecting its status as an extremely low-resource language: less than 0.01% of CommonCrawl, ~8,000 Wikipedia articles, and a 4× tokenization penalty in standard multilingual tokenizers. This report covers every model with confirmed or plausible Tibetan support, from large self-hosted options down to lightweight API calls, along with Tibetan-specific NLP models that can serve as embedding backbones.

Models with Confirmed Tibetan Support

1. BGE-M3 (BAAI) — Best Overall Choice

Attribute	Detail
Tibetan support	Explicitly included (“bo” in training data)
Parameters	568M
Embedding dimensions	1024
Max context	8,192 tokens
License	MIT (full commercial use)
Access	HuggingFace: `BAAI/bge-m3` · Ollama: `bge-m3`
Retrieval modes	Dense + Sparse + ColBERT (multi-vector)
Self-hosting GPU	T4 16GB minimum (FP16)
Architecture	XLM-RoBERTa backbone with Tibetan pairs added during fine-tuning

BGE-M3 is the strongest open-source option. It is the only general-purpose open embedding model that explicitly adds Tibetan language pairs during training, overcoming the limitation of the XLM-R backbone (which excludes Tibetan from its 100-language CC-100 pretraining). The three retrieval modes — dense, sparse, and ColBERT — make it versatile across semantic search, keyword-style retrieval, and fine-grained passage matching.

2. Cohere embed-multilingual-v3 — Best API Option

Attribute	Detail
Tibetan support	Explicitly listed (“bo” in supported languages)
Parameters	Undisclosed
Embedding dimensions	1024
Max context	512 tokens
License	Proprietary (API-only)
Access	Cohere API
API price	$0.10 per million tokens

The most convenient option for teams that want confirmed Tibetan support without infrastructure. The 512-token context limit is a meaningful constraint for longer passages — each API call covers roughly 1–2 Tibetan sentences worth of content after the tokenization penalty.

Cohere embed-v4 (newer model) offers 1536 dimensions, 128K context, and costs $0.12/MTok. It likely inherits Tibetan support from v3 but this should be verified with Cohere before committing.

3. LaBSE (Google) — Lightweight Self-Hosted Option

Attribute	Detail
Tibetan support	Explicitly included (109 languages including “bo”)
Parameters	471M
Embedding dimensions	768
Max context	256 tokens
License	Apache-2.0
Access	HuggingFace: `sentence-transformers/LaBSE`
Self-hosting GPU	Runs on CPU (INT8 ONNX) or any GPU

LaBSE’s strength is its broad language coverage and lightweight deployment — it runs on CPU with ONNX quantization. The severe 256-token limit restricts it to sentence-level tasks: similarity matching, short-text classification, deduplication. Not suitable for document-level retrieval or long-passage RAG. Outperformed by MITRA-E on Buddhist text benchmarks.

4. SONAR (Meta) — Research-Grade Multilingual

Attribute	Detail
Tibetan support	Explicitly included (bod_Tibt via NLLB-200 language set)
Parameters	~600M+
Embedding dimensions	1024
Max context	Sentence-level (designed for individual sentences)
License	Mixed (some components non-commercial)
Access	GitHub: `facebookresearch/SONAR`
Self-hosting GPU	Requires GPU; complex multi-component deployment

SONAR covers 200 languages through Meta’s NLLB framework and produces high-quality sentence-level embeddings. The trade-offs are significant: you must specify the language code (bod_Tibt) at inference time, deployment involves multiple model components (encoder + tokenizer per language family), and some components carry non-commercial license restrictions. Best suited for research contexts or cross-lingual retrieval where you need to map Tibetan against dozens of other languages simultaneously.

5. OpenAI Embeddings — Plausible but Unverified

Model	Dims	Max context	Price/MTok	Batch price/MTok
text-embedding-3-small	1,536	8,191	$0.02	$0.01
text-embedding-3-large	3,072	8,191	$0.13	$0.065

Attribute	Detail
Tibetan support	Not confirmed, but plausible
License	Proprietary (API-only)
Access	OpenAI API

OpenAI does not publish a language list for their embedding models. The BPE tokenizer handles Tibetan Unicode at the byte level, so it will produce embeddings — but whether those embeddings carry meaningful Tibetan semantics is unknown. The extremely low pricing ($0.02/MTok for small, $0.01 batch) makes it worth testing empirically on your data. If quality proves acceptable, this is the cheapest API option by a wide margin.

Recommendation: Run a quick evaluation on 100–200 Tibetan sentence pairs with known similarity before committing.

Tibetan-Specific & Domain-Specific Models

These are not general-purpose embedding models out of the box, but they offer Tibetan-native architectures that can be fine-tuned or adapted for embedding tasks.

6. Gemma 2 MITRA-E — Best for Buddhist Texts

Attribute	Detail
Tibetan support	Purpose-built for Buddhist texts (Pāli, Sanskrit, Chinese, Tibetan)
Parameters	9B
Training data	1.74M parallel Buddhist text pairs
License	Gemma license (restricted)
Access	HuggingFace: `buddhist-nlp` (gated access)
Self-hosting GPU	A100 40GB minimum
Performance	Outperforms BGE-M3 and LaBSE on 7-task Buddhist semantic benchmark

If your primary corpus is classical Buddhist literature (Kangyur, Tengyur, commentaries), MITRA-E is the highest-quality option available. The 9B parameter count makes it impractical for lightweight deployment — expect ~$0.40/MTok self-hosted on A100 — but for specialized Buddhist digital humanities work, the quality advantage may justify the cost.

7. TiBERT (CMLI-NLP) — Tibetan-Native Backbone

Attribute	Detail
Tibetan support	Monolingual Tibetan model
Parameters	~110M (BERT-base scale)
Vocabulary	30,005 Tibetan words (99.95% corpus coverage)
License	Research
Access	HuggingFace: `CMLI-NLP/TiBERT`
Self-hosting GPU	Any GPU or CPU

TiBERT is a BERT-base model pretrained exclusively on Tibetan text with a purpose-built SentencePiece vocabulary. It achieves state-of-the-art Tibetan text classification, outperforming multilingual models by 3–5 F1 points. It is not a sentence embedding model — it produces token-level representations — but it can serve as a backbone for training a Tibetan sentence transformer using frameworks like sentence-transformers. The Tibetan-native tokenizer eliminates the 4× tokenization penalty that cripples multilingual models.

8. UTibetNLP/tibetan_bert — Alternative Tibetan Backbone

Attribute	Detail
Tibetan support	Monolingual Tibetan model
Parameters	~110M
License	Research
Access	HuggingFace: `UTibetNLP/tibetan_bert`
Task	Tibetan news classification (~86% accuracy)

Another monolingual Tibetan BERT variant, focused on news classification. Similar potential as TiBERT for fine-tuning into a sentence embedding model. Less documented than TiBERT.

9. CINO (HIT-iFLYTEK) — Chinese Minority Languages

Attribute	Detail
Tibetan support	Covers Tibetan, Mongolian, Uyghur, Kazakh, Korean, Zhuang + Chinese
Parameters	~110M (base), ~330M (large)
License	Research
Access	HuggingFace / GitHub
Advantage	Cross-lingual between Tibetan and Chinese

CINO is designed for Chinese minority languages and offers cross-lingual capability between Tibetan and Chinese. Useful if your pipeline involves Tibetan-Chinese parallel texts or if you need embeddings that align both languages.

Cost Comparison

API Pricing (Tibetan-viable models only)

Provider / Model	Price/MTok	Tibetan confirmed	Max context	Best for
OpenAI text-embedding-3-small	$0.02 ($0.01 batch)	Unverified	8,191	Budget API if quality checks out
Cohere embed-multilingual-v3	$0.10	Yes	512	Confirmed Tibetan, no infra needed
Cohere embed-v4	$0.12	Likely	8,192	Longer context needs
OpenAI text-embedding-3-large	$0.13 ($0.065 batch)	Unverified	8,191	Higher dimensionality if quality checks out

Self-Hosting Economics

Setup	Model	GPU / Instance	Monthly cost	Effective $/MTok	Throughput
Budget (spot)	BGE-M3 FP16	T4 spot	~$130	~$0.01	~15–22M tok/hr
Budget	BGE-M3 FP16	T4 (g4dn.xlarge)	~$384	~$0.03	~15–22M tok/hr
CPU-only	LaBSE INT8 ONNX	c5.2xlarge	~$250	~$0.05	~5–10M tok/hr
Standard	BGE-M3 FP16	A10G (g5.xlarge)	~$734	~$0.03	~30–45M tok/hr
Specialized	MITRA-E 9B	A100 40GB	~$1,460	~$0.40	~3–5M tok/hr

Break-even analysis: Self-hosting BGE-M3 on a T4 beats Cohere API pricing ($0.10/MTok) at approximately 3.8 billion tokens/month. Against OpenAI batch pricing ($0.01/MTok), break-even rises to ~19 billion tokens/month. For most Tibetan text projects, API is more economical unless processing very large corpora (e.g., the full TIB-STC at 11B+ tokens).

Essential Tibetan NLP Preprocessing

Regardless of which model you choose, preprocessing Tibetan text with a word segmenter is critical. Standard multilingual tokenizers fragment Tibetan syllables aggressively — a word like བྱང་ཆུབ་སེམས་དཔའ (bodhisattva, 4 syllables) gets split into 8–16 subword tokens. Proper word segmentation before model tokenization improves representation quality and reduces token consumption.

Tool	Description	License	Access
Botok (OpenPecha)	Leading Tibetan word segmenter with dictionary lookup + POS tagging	Apache-2.0	`pip install botok` · GitHub: `OpenPecha/Botok`
ACTib corpus	170M words of annotated Classical Tibetan (BDRC collections)	Research	Via BDRC/OpenPecha
TIB-STC	11B+ tokens structured Tibetan text (literature 66%, web 24%, media 10%)	Research	arXiv: 2503.18288
FastText Tibetan	100-dim Classical Tibetan word vectors (90K+ tokens)	Open	Zenodo

Recommended Pipeline

Tibetan raw text
    → Botok word segmentation
    → BGE-M3 (self-hosted) or Cohere API
    → Vector database (Qdrant, Milvus, Weaviate, pgvector)
    → Semantic search / RAG / clustering

Step 1: Preprocess with Botok to segment Tibetan text into linguistically meaningful word units.

Step 2: Embed with BGE-M3 (best quality, self-hosted) or Cohere embed-multilingual-v3 (best convenience, API). Test OpenAI text-embedding-3-small as a budget alternative.

Step 3: Before committing to any model at scale, benchmark on 200–500 Tibetan sentence pairs with known semantic relationships from your actual corpus. No standardized Tibetan embedding benchmarks exist — your evaluation on your own data is the only reliable quality signal.

Summary Ranking

Rank	Model	Why	Best for
	BGE-M3 (self-hosted)	Only open model with confirmed Tibetan + 3 retrieval modes + 8K context	All-purpose production use
	Cohere embed-multilingual-v3 (API)	Confirmed Tibetan, zero infra	Teams without GPU infrastructure
	LaBSE (self-hosted)	Confirmed Tibetan, runs on CPU, Apache-2.0	Sentence-level tasks, budget deployments
4	SONAR (self-hosted)	200 languages, strong cross-lingual	Research, cross-lingual retrieval
5	OpenAI embeddings (API)	Cheapest API ($0.02/MTok) but Tibetan unverified	Budget option pending quality validation
6	MITRA-E (self-hosted)	Highest quality on Buddhist texts	Classical Buddhist literature specialists
7	TiBERT / CINO (fine-tune)	Tibetan-native tokenization, best potential ceiling	Teams with ML capacity to build custom embeddings

Topic		Replies	Views
Whisper Tibetan Model Evaluation Summary on Garchen Rinpoche Speech Garchen Rinpoche Speech SIG	2	86	October 7, 2025
A Pipeline for Tibetan Language Text Clustering 💃🏼 Topic Modeling SIG docs , bertopic , data-visualization , clusters , dataset	0	75	February 18, 2025
Calculating Word Error Rate for Tibetan Automatic Speech Recognition 🔊 ASR Speech Recognition SIG toolkit	0	79	April 17, 2025
Tibetan Speech-to-Text Model Training and Benchmark Report 🔊 ASR Speech Recognition SIG	0	218	August 8, 2025
The Benefits of Custom Tokenization for Machine Translation 🌎 Machine Translation SIG docs	0	33	February 20, 2025