Objective
To develop a reliable, lightweight statistical language model capable of assessing the quality of Optical Character Recognition (OCR) outputs for Classical and Contemporary Tibetan texts.
Table of Content
- Context – Background on the problem and motivation
- The Mathematical Premise: KenLM and Perplexity – Understanding quality metrics
- The Corpus: Establishing the Baseline – BoCorpus overview and composition
- Experiment 1: The Subword Approach (SentencePiece) – Initial approach and limitations
- Experiment 2: Syllable-Level Tokenization and the Normalization Revelation – Regex tokenization and discovery of corpus noise
- Experiment 3: Morphological Tokenization via Botok-rs – Botok approach with detailed validation
- Experiment 4: Corpus Denoising and Punctuation Normalization – Targeted corpus cleaning with breakthrough results
- Final Verdict – Key insights and conclusions
- So What? The Practical Application – Real-world use cases
Context
As the AI community scales Large Language Models (LLMs) to low-resource languages, the demand for high-quality, curated pre-training data has skyrocketed. Inspired by recent research, such as Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training, it became apparent that an automated filtering mechanism is required to separate clean text from noisy OCR artifacts. A Kneser-Ney Language Model (KenLM) offers an efficient, statistically rigorous approach to this problem by evaluating text via perplexity scores.
The Mathematical Premise: KenLM and Perplexity
Before diving into the experiments, it is crucial to establish how we measure “quality.” We rely on Perplexity (PPL), which evaluates how accurately a model predicts a text sample.
Simply put, perplexity measures the model’s “surprise” when reading a new sentence. A lower PPL score indicates the model is less surprised—meaning the text closely resembles the true, clean distribution of the language. To achieve this, our KenLM utilizes Kneser-Ney Smoothing. Instead of just looking at raw word frequencies, Kneser-Ney handles out-of-vocabulary (OOV) tokens by estimating probabilities based on the diversity of contexts a word appears in, making it highly robust for OCR evaluation.
The Corpus: Establishing the Baseline
To train the KenLM, we aggregated a diverse, cross-domain corpus of assumed “high-quality” Tibetan text. This corpus was formalized and published as the BoCorpus dataset (available on HuggingFace at openpecha/BoCorpus), released under the CC0 1.0 Universal license.
BoCorpus Overview
The dataset comprises 1,039 texts totaling 603,325,999 characters (average of 580,679 characters per text), drawn from eight distinct collections:
| Collection | Texts | Domain |
|---|---|---|
| Bon Kangyur | 151 | Classical Canon |
| Derge Kangyur | 103 | Classical Canon |
| Derge Tengyur | 213 | Classical Canon |
| DharmaEbook | 98 | Contemporary Scholarly |
| Pagen Project | 1 | Contemporary Scholarly |
| Tsadra Collection | 266 | Contemporary Scholarly |
| འབྲི་ལུགས་བང་མཛོད་སྐོར་ལྔ། | 136 | Tibetan Literary |
| རིན་ཆེན་གཏེར་མཛོད་ཆེན་མོ། | 71 | Tibetan Literary |
Each record in BoCorpus contains a unique UUID4 identifier, the source collection name, the original filename, the full text (with all newline characters removed), and a character count. The dataset was developed by Dharmaduta from specifications provided by the Buddhist Digital Resource Center (BDRC), with funding from the Khyentse Foundation.
For the purpose of KenLM training, we logically grouped these into two primary categories:
Classical Tibetan Canon: Comprising the Bon Kangyur (151 texts), Derge Kangyur (103 texts), and Derge Tengyur (213 texts), totaling 467 texts across classical canonical sources.
Contemporary Tibetan Scholar Work: Comprising the Tsadra Collection (266 texts) and DharmaEbook (98 texts), totaling 364 texts from contemporary scholarly sources.
Crucial Pre-processing Step
Before generating any vocabularies or training models, we standardized the entire corpus. We removed all newline characters from the source volumes.
Experiment 1: The Subword Approach (SentencePiece)
Methodology
Our initial hypothesis was to rely on modern subword tokenization, ubiquitous in Transformer-based architectures. We trained a SentencePiece model with a strict vocabulary size of 20,000.
Rationale: A vocabulary that is too small forces the KenLM to be overly strict (penalizing natural linguistic variance), while a vocabulary that is too large bloats inference time and makes the model too lenient to OCR noise.
Key Findings
Output: The tokenizer was saved to Hugging Face as BoSentencePiece and the resulting language model as BoKenlm-sp.
Limitations: When evaluating clean, ground-truth data, the perplexity scores remained unexpectedly high. While the model could still function as a relative filter (by establishing a heuristic threshold to separate good OCR from bad), the absolute perplexity scores were statistically unreliable. Subword chunking without morphological awareness proved suboptimal for the specific orthographic rules of Tibetan.
Experiment 2: Syllable-Level Tokenization and the Normalization Revelation
Methodology
Following discussions with Elie, we pivoted from algorithmic subwords to linguistically grounded tokenization. Tibetan text is naturally segmented into syllables by the tsek (་) and shad (།) characters. We implemented a Regex-based tokenizer to split the corpus along these boundaries.
Hypothesis: Bypassing the artificial 20k limit and utilizing true syllable boundaries would yield a more precise, representative vocabulary.
Key Findings
The Unnormalized Explosion: The initial training run yielded a massive vocabulary size of 101,462 tokens. Vocabulary validation revealed severe noise: only 53.15% of tokens were valid Tibetan words, while 46.85% were invalid (heavily polluted by fused TEXT+PUNCT and LATIN+TEXT combinations).
The Normalization Pipeline: Realizing our “high-quality” baseline contained severe formatting artifacts, we introduced a strict normalization step prior to tokenization:
- Removal of invalid Unicode keys.
- Pruning of unwanted spacing artifacts.
- Stripping inline spelling suggestion annotations present in the Kangyur/Tengyur datasets.
- Removal of residual Latin characters.
Result of Normalization: Normalization reduced the vocabulary size to 97,814 tokens. However, secondary vocabulary analysis showed that while validity improved, the results were still unacceptable: 55.09% valid words vs. 44.91% invalid words.
Experiment 3: Morphological Tokenization via Botok-rs
Methodology
Given the near 45% invalidity rate from naive Regex splitting—even after normalization—we abandoned simple boundary splitting. Instead, we adopted rule-based, morphological tokenization using botok-rs (the Rust implementation of the Botok Python library, optimized for speed) with NFC Unicode normalization.
Validation Report (BoKenlm-syl-0.1)
The formal vocabulary validation report for this model, generated on 2026-04-20, revealed the following:
| Metric | Value |
|---|---|
| Model | BoKenlm-syl-0.1 |
| Normalisation | NFC |
| Tokenizer | Botok Syllable Tokenizer |
| Total Words Analysed | 97,814 |
| Valid Words | 53,881 (55.09%) |
| Invalid Words | 43,933 (44.91%) |
Key Findings
Despite switching to morphological tokenization, the vocabulary size remained at 97,814 tokens—identical to the normalized Regex result from Experiment 2. The validity rate (55.09%) also barely improved. The dominant contamination pattern was TEXT+PUNCT fusion (23.01%), indicating that thousands of valid Tibetan words were being merged with adjacent punctuation characters (shad ། and gter-tsheg ༔) into single invalid vocabulary entries. A secondary contamination pattern was TEXT+TEXT (6.49%), where common grammatical particles like པའི་, ལས་, and པས་ were being incorrectly fused rather than recognized as multi-syllable constructions.
Critical Diagnosis: The vocabulary pollution was not a failure of the tokenizer, but a failure of the source corpus. The text itself contained embedded punctuation anomalies and formatting artifacts that no tokenizer alone could resolve. The corpus needed aggressive, domain-specific denoising before tokenization.
Experiment 4: Corpus Denoising and Punctuation Normalization (BoKenlm-syl-0.4)
Methodology
Armed with the detailed diagnostic data from the BoKenlm-syl-0.1 validation report, we designed a targeted denoising pipeline to attack the two dominant contamination vectors—fused punctuation and invalid character patterns—at the corpus level, before any tokenization occurs.
The enhanced normalization pipeline for BoKenlm-syl-0.3 consisted of three stages:
- NFC Unicode Normalization (carried over from Experiment 3).
- Corpus Denoising from Invalid Patterns: A systematic removal of the specific noise patterns identified in the 0.1 validation report—including fused TEXT+PUNCT sequences, embedded Latin characters, stray numerals merged with Tibetan text, and other formatting artifacts.
- All Punctuation to Shad: A normalization pass converting all variant Tibetan punctuation marks (gter-tsheg ༔, closing brackets ༽, and other ornamental punctuation) into the standard shad (།). This single step eliminated the entire class of TEXT+PUNCT fusion errors by ensuring only one canonical punctuation character existed in the corpus.
The tokenizer remained the Botok Syllable Tokenizer, unchanged from Experiment 3, isolating the effect of corpus-level denoising.
Validation Report
| Metric | Value |
|---|---|
| Model | BoKenlm-syl-0.4 |
| Normalisation | NFC + Denoise Corpus from Invalid Patterns + All Punct to Shad |
| Tokenizer | Botok Syllable Tokenizer |
| Total Words Analysed | 20,564 |
| Valid Words | 20,000 (97.26%) |
| Invalid Words | 564 (2.74%) |
This near-equal split reflects the bilingual nature of the classical Buddhist canon, where Tibetan translation coexists with preserved Sanskrit terminology (mantras, dharanis, technical terms).
Key Findings
The results represent a transformative improvement across every metric:
Vocabulary Compression: The vocabulary condensed from 97,814 (v0.1) to 20,564 tokens—a 79% reduction. This dramatic compression was achieved entirely through corpus denoising, not tokenizer changes. The 77,250 eliminated tokens were noise artifacts—fused punctuation, embedded Latin characters, stray numerals—that had been inflating the vocabulary with tokens that served no linguistic purpose.
Purity Breakthrough: Validity surged from 55.09% (v0.2) to 97.26% (v0.4). The clean TEXT category alone accounts for 97.06% of the entire vocabulary. The MULTI_TOKEN(TEXT+PUNCT) category—which constituted 23.01% of v0.1—was completely eliminated by the punctuation normalization step.
Residual Contamination Analysis: The remaining 564 invalid tokens (2.74%) are dominated by MULTI_TOKEN(TEXT+TEXT) fusions (451 tokens, 2.19%), which are common Tibetan grammatical particles like པར (pa-r), ལས (la-s), and པས (pa-s) that botok-rs splits into constituent morphemes. The remaining invalids are edge cases: SYM+TEXT fusions from rare Sanskrit diacritical marks (59), and a handful of residual Latin (12) and punctuation (7) artifacts.
Final Verdict
The journey from Experiment 1 through Experiment 4 demonstrates a fundamental principle for building statistical language models on morphologically complex or historically rich languages like Tibetan: the tokenizer and the corpus must be co-optimized.
Experiment 1 proved that algorithmic tokenization (SentencePiece) cannot substitute for linguistic rules. Experiments 2 and 3 proved that even the right tokenizer—whether regex-based or morphologically aware—cannot compensate for a dirty corpus: both produced identical 97,814-token vocabularies with identical 55.09% validity rates, despite radically different tokenization strategies. It was only in Experiment 4, when we attacked the corpus itself with targeted denoising informed by the v0.1 diagnostic data, that validity broke through to 97.26%.
The key insight: the assumption of “clean” data is a dangerous pitfall in NLP. Our source texts—canonically curated Kangyur and Tengyur volumes, published EPUBs—still harbored tens of thousands of invisible formatting artifacts that silently poisoned every downstream model. Aggressive, domain-specific normalization is an absolute prerequisite before vocabulary construction.
So What? The Practical Application
Why does this matter to the broader AI and Data Engineering ecosystem?
Automated Data Curation: As researchers attempt to build foundational LLMs for Tibetan, the BoKenlm-syl model serves as a highly efficient, deterministic gatekeeper. It can process millions of lines of raw OCR output, assigning perplexity scores to automatically discard hallucinated or garbled text before it ever reaches a computationally expensive LLM pre-training pipeline.
Historical Preservation: Digitizing historical texts (like the Kangyur and Tengyur) relies heavily on OCR. A calibrated KenLM allows archivists to pinpoint exact volumes or pages where the OCR pipeline failed, guiding human-in-the-loop (HITL) correction efforts precisely where they are needed most, rather than proofreading the entire canon manually.
Corpus Quality Auditing: The vocabulary validation methodology developed across these experiments—categorizing tokens by type (TEXT, PUNCT, LATIN, NUM, SYM) and flagging multi-token fusions—is generalizable to any language. It provides a systematic, quantitative framework for auditing corpus quality before committing to expensive model training.
Acknowledgements
This dataset was developed by Dharmaduta from specifications provided by the Buddhist Digital Resource Center (BDRC) for the BDRC Etext Corpus, with funding from the Khyentse Foundation.