Embedding Models for Tibetan Text: Production-Ready Options
Only a handful of embedding models explicitly support Tibetan script, reflecting its status as an extremely low-resource language: less than 0.01% of CommonCrawl, ~8,000 Wikipedia articles, and a 4Γ tokenization penalty in standard multilingual tokenizers. This report covers every model with confirmed or plausible Tibetan support, from large self-hosted options down to lightweight API calls, along with Tibetan-specific NLP models that can serve as embedding backbones.
Models with Confirmed Tibetan Support
1. BGE-M3 (BAAI) β Best Overall Choice
| Attribute | Detail |
|---|---|
| Tibetan support | |
| Parameters | 568M |
| Embedding dimensions | 1024 |
| Max context | 8,192 tokens |
| License | MIT (full commercial use) |
| Access | HuggingFace: BAAI/bge-m3 Β· Ollama: bge-m3 |
| Retrieval modes | Dense + Sparse + ColBERT (multi-vector) |
| Self-hosting GPU | T4 16GB minimum (FP16) |
| Architecture | XLM-RoBERTa backbone with Tibetan pairs added during fine-tuning |
BGE-M3 is the strongest open-source option. It is the only general-purpose open embedding model that explicitly adds Tibetan language pairs during training, overcoming the limitation of the XLM-R backbone (which excludes Tibetan from its 100-language CC-100 pretraining). The three retrieval modes β dense, sparse, and ColBERT β make it versatile across semantic search, keyword-style retrieval, and fine-grained passage matching.
2. Cohere embed-multilingual-v3 β Best API Option
| Attribute | Detail |
|---|---|
| Tibetan support | |
| Parameters | Undisclosed |
| Embedding dimensions | 1024 |
| Max context | 512 tokens |
| License | Proprietary (API-only) |
| Access | Cohere API |
| API price | $0.10 per million tokens |
The most convenient option for teams that want confirmed Tibetan support without infrastructure. The 512-token context limit is a meaningful constraint for longer passages β each API call covers roughly 1β2 Tibetan sentences worth of content after the tokenization penalty.
Cohere embed-v4 (newer model) offers 1536 dimensions, 128K context, and costs $0.12/MTok. It likely inherits Tibetan support from v3 but this should be verified with Cohere before committing.
3. LaBSE (Google) β Lightweight Self-Hosted Option
| Attribute | Detail |
|---|---|
| Tibetan support | |
| Parameters | 471M |
| Embedding dimensions | 768 |
| Max context | 256 tokens |
| License | Apache-2.0 |
| Access | HuggingFace: sentence-transformers/LaBSE |
| Self-hosting GPU | Runs on CPU (INT8 ONNX) or any GPU |
LaBSEβs strength is its broad language coverage and lightweight deployment β it runs on CPU with ONNX quantization. The severe 256-token limit restricts it to sentence-level tasks: similarity matching, short-text classification, deduplication. Not suitable for document-level retrieval or long-passage RAG. Outperformed by MITRA-E on Buddhist text benchmarks.
4. SONAR (Meta) β Research-Grade Multilingual
| Attribute | Detail |
|---|---|
| Tibetan support | |
| Parameters | ~600M+ |
| Embedding dimensions | 1024 |
| Max context | Sentence-level (designed for individual sentences) |
| License | Mixed (some components non-commercial) |
| Access | GitHub: facebookresearch/SONAR |
| Self-hosting GPU | Requires GPU; complex multi-component deployment |
SONAR covers 200 languages through Metaβs NLLB framework and produces high-quality sentence-level embeddings. The trade-offs are significant: you must specify the language code (bod_Tibt) at inference time, deployment involves multiple model components (encoder + tokenizer per language family), and some components carry non-commercial license restrictions. Best suited for research contexts or cross-lingual retrieval where you need to map Tibetan against dozens of other languages simultaneously.
5. OpenAI Embeddings β Plausible but Unverified
| Model | Dims | Max context | Price/MTok | Batch price/MTok |
|---|---|---|---|---|
| text-embedding-3-small | 1,536 | 8,191 | $0.02 | $0.01 |
| text-embedding-3-large | 3,072 | 8,191 | $0.13 | $0.065 |
| Attribute | Detail |
|---|---|
| Tibetan support | |
| License | Proprietary (API-only) |
| Access | OpenAI API |
OpenAI does not publish a language list for their embedding models. The BPE tokenizer handles Tibetan Unicode at the byte level, so it will produce embeddings β but whether those embeddings carry meaningful Tibetan semantics is unknown. The extremely low pricing ($0.02/MTok for small, $0.01 batch) makes it worth testing empirically on your data. If quality proves acceptable, this is the cheapest API option by a wide margin.
Recommendation: Run a quick evaluation on 100β200 Tibetan sentence pairs with known similarity before committing.
Tibetan-Specific & Domain-Specific Models
These are not general-purpose embedding models out of the box, but they offer Tibetan-native architectures that can be fine-tuned or adapted for embedding tasks.
6. Gemma 2 MITRA-E β Best for Buddhist Texts
| Attribute | Detail |
|---|---|
| Tibetan support | |
| Parameters | 9B |
| Training data | 1.74M parallel Buddhist text pairs |
| License | Gemma license (restricted) |
| Access | HuggingFace: buddhist-nlp (gated access) |
| Self-hosting GPU | A100 40GB minimum |
| Performance | Outperforms BGE-M3 and LaBSE on 7-task Buddhist semantic benchmark |
If your primary corpus is classical Buddhist literature (Kangyur, Tengyur, commentaries), MITRA-E is the highest-quality option available. The 9B parameter count makes it impractical for lightweight deployment β expect ~$0.40/MTok self-hosted on A100 β but for specialized Buddhist digital humanities work, the quality advantage may justify the cost.
7. TiBERT (CMLI-NLP) β Tibetan-Native Backbone
| Attribute | Detail |
|---|---|
| Tibetan support | |
| Parameters | ~110M (BERT-base scale) |
| Vocabulary | 30,005 Tibetan words (99.95% corpus coverage) |
| License | Research |
| Access | HuggingFace: CMLI-NLP/TiBERT |
| Self-hosting GPU | Any GPU or CPU |
TiBERT is a BERT-base model pretrained exclusively on Tibetan text with a purpose-built SentencePiece vocabulary. It achieves state-of-the-art Tibetan text classification, outperforming multilingual models by 3β5 F1 points. It is not a sentence embedding model β it produces token-level representations β but it can serve as a backbone for training a Tibetan sentence transformer using frameworks like sentence-transformers. The Tibetan-native tokenizer eliminates the 4Γ tokenization penalty that cripples multilingual models.
8. UTibetNLP/tibetan_bert β Alternative Tibetan Backbone
| Attribute | Detail |
|---|---|
| Tibetan support | |
| Parameters | ~110M |
| License | Research |
| Access | HuggingFace: UTibetNLP/tibetan_bert |
| Task | Tibetan news classification (~86% accuracy) |
Another monolingual Tibetan BERT variant, focused on news classification. Similar potential as TiBERT for fine-tuning into a sentence embedding model. Less documented than TiBERT.
9. CINO (HIT-iFLYTEK) β Chinese Minority Languages
| Attribute | Detail |
|---|---|
| Tibetan support | |
| Parameters | ~110M (base), ~330M (large) |
| License | Research |
| Access | HuggingFace / GitHub |
| Advantage | Cross-lingual between Tibetan and Chinese |
CINO is designed for Chinese minority languages and offers cross-lingual capability between Tibetan and Chinese. Useful if your pipeline involves Tibetan-Chinese parallel texts or if you need embeddings that align both languages.
Cost Comparison
API Pricing (Tibetan-viable models only)
| Provider / Model | Price/MTok | Tibetan confirmed | Max context | Best for |
|---|---|---|---|---|
| OpenAI text-embedding-3-small | $0.02 ($0.01 batch) | Unverified | 8,191 | Budget API if quality checks out |
| Cohere embed-multilingual-v3 | $0.10 | 512 | Confirmed Tibetan, no infra needed | |
| Cohere embed-v4 | $0.12 | Likely | 8,192 | Longer context needs |
| OpenAI text-embedding-3-large | $0.13 ($0.065 batch) | Unverified | 8,191 | Higher dimensionality if quality checks out |
Self-Hosting Economics
| Setup | Model | GPU / Instance | Monthly cost | Effective $/MTok | Throughput |
|---|---|---|---|---|---|
| Budget (spot) | BGE-M3 FP16 | T4 spot | ~$130 | ~$0.01 | ~15β22M tok/hr |
| Budget | BGE-M3 FP16 | T4 (g4dn.xlarge) | ~$384 | ~$0.03 | ~15β22M tok/hr |
| CPU-only | LaBSE INT8 ONNX | c5.2xlarge | ~$250 | ~$0.05 | ~5β10M tok/hr |
| Standard | BGE-M3 FP16 | A10G (g5.xlarge) | ~$734 | ~$0.03 | ~30β45M tok/hr |
| Specialized | MITRA-E 9B | A100 40GB | ~$1,460 | ~$0.40 | ~3β5M tok/hr |
Break-even analysis: Self-hosting BGE-M3 on a T4 beats Cohere API pricing ($0.10/MTok) at approximately 3.8 billion tokens/month. Against OpenAI batch pricing ($0.01/MTok), break-even rises to ~19 billion tokens/month. For most Tibetan text projects, API is more economical unless processing very large corpora (e.g., the full TIB-STC at 11B+ tokens).
Essential Tibetan NLP Preprocessing
Regardless of which model you choose, preprocessing Tibetan text with a word segmenter is critical. Standard multilingual tokenizers fragment Tibetan syllables aggressively β a word like ΰ½ΰΎ±ΰ½ΰΌΰ½ΰ½΄ΰ½ΰΌΰ½¦ΰ½Ίΰ½ΰ½¦ΰΌΰ½ΰ½ΰ½ (bodhisattva, 4 syllables) gets split into 8β16 subword tokens. Proper word segmentation before model tokenization improves representation quality and reduces token consumption.
| Tool | Description | License | Access |
|---|---|---|---|
| Botok (OpenPecha) | Leading Tibetan word segmenter with dictionary lookup + POS tagging | Apache-2.0 | pip install botok Β· GitHub: OpenPecha/Botok |
| ACTib corpus | 170M words of annotated Classical Tibetan (BDRC collections) | Research | Via BDRC/OpenPecha |
| TIB-STC | 11B+ tokens structured Tibetan text (literature 66%, web 24%, media 10%) | Research | arXiv: 2503.18288 |
| FastText Tibetan | 100-dim Classical Tibetan word vectors (90K+ tokens) | Open | Zenodo |
Recommended Pipeline
Tibetan raw text
β Botok word segmentation
β BGE-M3 (self-hosted) or Cohere API
β Vector database (Qdrant, Milvus, Weaviate, pgvector)
β Semantic search / RAG / clustering
Step 1: Preprocess with Botok to segment Tibetan text into linguistically meaningful word units.
Step 2: Embed with BGE-M3 (best quality, self-hosted) or Cohere embed-multilingual-v3 (best convenience, API). Test OpenAI text-embedding-3-small as a budget alternative.
Step 3: Before committing to any model at scale, benchmark on 200β500 Tibetan sentence pairs with known semantic relationships from your actual corpus. No standardized Tibetan embedding benchmarks exist β your evaluation on your own data is the only reliable quality signal.
Summary Ranking
| Rank | Model | Why | Best for |
|---|---|---|---|
| BGE-M3 (self-hosted) | Only open model with confirmed Tibetan + 3 retrieval modes + 8K context | All-purpose production use | |
| Cohere embed-multilingual-v3 (API) | Confirmed Tibetan, zero infra | Teams without GPU infrastructure | |
| LaBSE (self-hosted) | Confirmed Tibetan, runs on CPU, Apache-2.0 | Sentence-level tasks, budget deployments | |
| 4 | SONAR (self-hosted) | 200 languages, strong cross-lingual | Research, cross-lingual retrieval |
| 5 | OpenAI embeddings (API) | Cheapest API ($0.02/MTok) but Tibetan unverified | Budget option pending quality validation |
| 6 | MITRA-E (self-hosted) | Highest quality on Buddhist texts | Classical Buddhist literature specialists |
| 7 | TiBERT / CINO (fine-tune) | Tibetan-native tokenization, best potential ceiling | Teams with ML capacity to build custom embeddings |