Training a KenLM for Tibetan OCR Quality Assessment: An Empirical Journey

Kaldan · March 6, 2026, 4:56am

Objective

To develop a reliable, lightweight statistical language model capable of assessing the quality of Optical Character Recognition (OCR) outputs for Classical and Contemporary Tibetan texts.

Table of Content

Context – Background on the problem and motivation
The Mathematical Premise: KenLM and Perplexity – Understanding quality metrics
The Corpus: Establishing the Baseline – BoCorpus overview and composition
Experiment 1: The Subword Approach (SentencePiece) – Initial approach and limitations
Experiment 2: Syllable-Level Tokenization and the Normalization Revelation – Regex tokenization and discovery of corpus noise
Experiment 3: Morphological Tokenization via Botok-rs – Botok approach with detailed validation
Experiment 4: Corpus Denoising and Punctuation Normalization – Targeted corpus cleaning with breakthrough results
Final Verdict – Key insights and conclusions
So What? The Practical Application – Real-world use cases

Context

As the AI community scales Large Language Models (LLMs) to low-resource languages, the demand for high-quality, curated pre-training data has skyrocketed. Inspired by recent research, such as Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training, it became apparent that an automated filtering mechanism is required to separate clean text from noisy OCR artifacts. A Kneser-Ney Language Model (KenLM) offers an efficient, statistically rigorous approach to this problem by evaluating text via perplexity scores.

The Mathematical Premise: KenLM and Perplexity

Before diving into the experiments, it is crucial to establish how we measure “quality.” We rely on Perplexity (PPL), which evaluates how accurately a model predicts a text sample.

Simply put, perplexity measures the model’s “surprise” when reading a new sentence. A lower PPL score indicates the model is less surprised—meaning the text closely resembles the true, clean distribution of the language. To achieve this, our KenLM utilizes Kneser-Ney Smoothing. Instead of just looking at raw word frequencies, Kneser-Ney handles out-of-vocabulary (OOV) tokens by estimating probabilities based on the diversity of contexts a word appears in, making it highly robust for OCR evaluation.

The Corpus: Establishing the Baseline

To train the KenLM, we aggregated a diverse, cross-domain corpus of assumed “high-quality” Tibetan text. This corpus was formalized and published as the BoCorpus dataset (available on HuggingFace at openpecha/BoCorpus), released under the CC0 1.0 Universal license.

BoCorpus Overview

The dataset comprises 1,039 texts totaling 603,325,999 characters (average of 580,679 characters per text), drawn from eight distinct collections:

Collection	Texts	Domain
Bon Kangyur	151	Classical Canon
Derge Kangyur	103	Classical Canon
Derge Tengyur	213	Classical Canon
DharmaEbook	98	Contemporary Scholarly
Pagen Project	1	Contemporary Scholarly
Tsadra Collection	266	Contemporary Scholarly
འབྲི་ལུགས་བང་མཛོད་སྐོར་ལྔ།	136	Tibetan Literary
རིན་ཆེན་གཏེར་མཛོད་ཆེན་མོ།	71	Tibetan Literary

Each record in BoCorpus contains a unique UUID4 identifier, the source collection name, the original filename, the full text (with all newline characters removed), and a character count. The dataset was developed by Dharmaduta from specifications provided by the Buddhist Digital Resource Center (BDRC), with funding from the Khyentse Foundation.

For the purpose of KenLM training, we logically grouped these into two primary categories:

Classical Tibetan Canon: Comprising the Bon Kangyur (151 texts), Derge Kangyur (103 texts), and Derge Tengyur (213 texts), totaling 467 texts across classical canonical sources.

Contemporary Tibetan Scholar Work: Comprising the Tsadra Collection (266 texts) and DharmaEbook (98 texts), totaling 364 texts from contemporary scholarly sources.

Crucial Pre-processing Step

Before generating any vocabularies or training models, we standardized the entire corpus. We removed all newline characters from the source volumes.

Experiment 1: The Subword Approach (SentencePiece)

Methodology

Our initial hypothesis was to rely on modern subword tokenization, ubiquitous in Transformer-based architectures. We trained a SentencePiece model with a strict vocabulary size of 20,000.

Rationale: A vocabulary that is too small forces the KenLM to be overly strict (penalizing natural linguistic variance), while a vocabulary that is too large bloats inference time and makes the model too lenient to OCR noise.

Key Findings

Output: The tokenizer was saved to Hugging Face as BoSentencePiece and the resulting language model as BoKenlm-sp.

Limitations: When evaluating clean, ground-truth data, the perplexity scores remained unexpectedly high. While the model could still function as a relative filter (by establishing a heuristic threshold to separate good OCR from bad), the absolute perplexity scores were statistically unreliable. Subword chunking without morphological awareness proved suboptimal for the specific orthographic rules of Tibetan.

Experiment 2: Syllable-Level Tokenization and the Normalization Revelation

Methodology

Following discussions with Elie, we pivoted from algorithmic subwords to linguistically grounded tokenization. Tibetan text is naturally segmented into syllables by the tsek (་) and shad (།) characters. We implemented a Regex-based tokenizer to split the corpus along these boundaries.

Hypothesis: Bypassing the artificial 20k limit and utilizing true syllable boundaries would yield a more precise, representative vocabulary.

Key Findings

The Unnormalized Explosion: The initial training run yielded a massive vocabulary size of 101,462 tokens. Vocabulary validation revealed severe noise: only 53.15% of tokens were valid Tibetan words, while 46.85% were invalid (heavily polluted by fused TEXT+PUNCT and LATIN+TEXT combinations).

The Normalization Pipeline: Realizing our “high-quality” baseline contained severe formatting artifacts, we introduced a strict normalization step prior to tokenization:

Removal of invalid Unicode keys.
Pruning of unwanted spacing artifacts.
Stripping inline spelling suggestion annotations present in the Kangyur/Tengyur datasets.
Removal of residual Latin characters.

Result of Normalization: Normalization reduced the vocabulary size to 97,814 tokens. However, secondary vocabulary analysis showed that while validity improved, the results were still unacceptable: 55.09% valid words vs. 44.91% invalid words.

Experiment 3: Morphological Tokenization via Botok-rs

Methodology

Given the near 45% invalidity rate from naive Regex splitting—even after normalization—we abandoned simple boundary splitting. Instead, we adopted rule-based, morphological tokenization using botok-rs (the Rust implementation of the Botok Python library, optimized for speed) with NFC Unicode normalization.

Validation Report (BoKenlm-syl-0.1)

The formal vocabulary validation report for this model, generated on 2026-04-20, revealed the following:

Metric	Value
Model	BoKenlm-syl-0.1
Normalisation	NFC
Tokenizer	Botok Syllable Tokenizer
Total Words Analysed	97,814
Valid Words	53,881 (55.09%)
Invalid Words	43,933 (44.91%)

Key Findings

Despite switching to morphological tokenization, the vocabulary size remained at 97,814 tokens—identical to the normalized Regex result from Experiment 2. The validity rate (55.09%) also barely improved. The dominant contamination pattern was TEXT+PUNCT fusion (23.01%), indicating that thousands of valid Tibetan words were being merged with adjacent punctuation characters (shad ། and gter-tsheg ༔) into single invalid vocabulary entries. A secondary contamination pattern was TEXT+TEXT (6.49%), where common grammatical particles like པའི་, ལས་, and པས་ were being incorrectly fused rather than recognized as multi-syllable constructions.

Critical Diagnosis: The vocabulary pollution was not a failure of the tokenizer, but a failure of the source corpus. The text itself contained embedded punctuation anomalies and formatting artifacts that no tokenizer alone could resolve. The corpus needed aggressive, domain-specific denoising before tokenization.

Experiment 4: Corpus Denoising and Punctuation Normalization (BoKenlm-syl-0.4)

Methodology

Armed with the detailed diagnostic data from the BoKenlm-syl-0.1 validation report, we designed a targeted denoising pipeline to attack the two dominant contamination vectors—fused punctuation and invalid character patterns—at the corpus level, before any tokenization occurs.

The enhanced normalization pipeline for BoKenlm-syl-0.3 consisted of three stages:

NFC Unicode Normalization (carried over from Experiment 3).
Corpus Denoising from Invalid Patterns: A systematic removal of the specific noise patterns identified in the 0.1 validation report—including fused TEXT+PUNCT sequences, embedded Latin characters, stray numerals merged with Tibetan text, and other formatting artifacts.
All Punctuation to Shad: A normalization pass converting all variant Tibetan punctuation marks (gter-tsheg ༔, closing brackets ༽, and other ornamental punctuation) into the standard shad (།). This single step eliminated the entire class of TEXT+PUNCT fusion errors by ensuring only one canonical punctuation character existed in the corpus.

The tokenizer remained the Botok Syllable Tokenizer, unchanged from Experiment 3, isolating the effect of corpus-level denoising.

Validation Report

Metric	Value
Model	BoKenlm-syl-0.4
Normalisation	NFC + Denoise Corpus from Invalid Patterns + All Punct to Shad
Tokenizer	Botok Syllable Tokenizer
Total Words Analysed	20,564
Valid Words	20,000 (97.26%)
Invalid Words	564 (2.74%)

This near-equal split reflects the bilingual nature of the classical Buddhist canon, where Tibetan translation coexists with preserved Sanskrit terminology (mantras, dharanis, technical terms).

Key Findings

The results represent a transformative improvement across every metric:

Vocabulary Compression: The vocabulary condensed from 97,814 (v0.1) to 20,564 tokens—a 79% reduction. This dramatic compression was achieved entirely through corpus denoising, not tokenizer changes. The 77,250 eliminated tokens were noise artifacts—fused punctuation, embedded Latin characters, stray numerals—that had been inflating the vocabulary with tokens that served no linguistic purpose.

Purity Breakthrough: Validity surged from 55.09% (v0.2) to 97.26% (v0.4). The clean TEXT category alone accounts for 97.06% of the entire vocabulary. The MULTI_TOKEN(TEXT+PUNCT) category—which constituted 23.01% of v0.1—was completely eliminated by the punctuation normalization step.

Residual Contamination Analysis: The remaining 564 invalid tokens (2.74%) are dominated by MULTI_TOKEN(TEXT+TEXT) fusions (451 tokens, 2.19%), which are common Tibetan grammatical particles like པར (pa-r), ལས (la-s), and པས (pa-s) that botok-rs splits into constituent morphemes. The remaining invalids are edge cases: SYM+TEXT fusions from rare Sanskrit diacritical marks (59), and a handful of residual Latin (12) and punctuation (7) artifacts.

Final Verdict

The journey from Experiment 1 through Experiment 4 demonstrates a fundamental principle for building statistical language models on morphologically complex or historically rich languages like Tibetan: the tokenizer and the corpus must be co-optimized.

Experiment 1 proved that algorithmic tokenization (SentencePiece) cannot substitute for linguistic rules. Experiments 2 and 3 proved that even the right tokenizer—whether regex-based or morphologically aware—cannot compensate for a dirty corpus: both produced identical 97,814-token vocabularies with identical 55.09% validity rates, despite radically different tokenization strategies. It was only in Experiment 4, when we attacked the corpus itself with targeted denoising informed by the v0.1 diagnostic data, that validity broke through to 97.26%.

The key insight: the assumption of “clean” data is a dangerous pitfall in NLP. Our source texts—canonically curated Kangyur and Tengyur volumes, published EPUBs—still harbored tens of thousands of invisible formatting artifacts that silently poisoned every downstream model. Aggressive, domain-specific normalization is an absolute prerequisite before vocabulary construction.

So What? The Practical Application

Why does this matter to the broader AI and Data Engineering ecosystem?

Automated Data Curation: As researchers attempt to build foundational LLMs for Tibetan, the BoKenlm-syl model serves as a highly efficient, deterministic gatekeeper. It can process millions of lines of raw OCR output, assigning perplexity scores to automatically discard hallucinated or garbled text before it ever reaches a computationally expensive LLM pre-training pipeline.

Historical Preservation: Digitizing historical texts (like the Kangyur and Tengyur) relies heavily on OCR. A calibrated KenLM allows archivists to pinpoint exact volumes or pages where the OCR pipeline failed, guiding human-in-the-loop (HITL) correction efforts precisely where they are needed most, rather than proofreading the entire canon manually.

Corpus Quality Auditing: The vocabulary validation methodology developed across these experiments—categorizing tokens by type (TEXT, PUNCT, LATIN, NUM, SYM) and flagging multi-token fusions—is generalizable to any language. It provides a systematic, quantitative framework for auditing corpus quality before committing to expensive model training.

Acknowledgements

This dataset was developed by Dharmaduta from specifications provided by the Buddhist Digital Resource Center (BDRC) for the BDRC Etext Corpus, with funding from the Khyentse Foundation.

Topic		Replies	Views
Exploring BDRC’s Tibetan OCR: Training and Evaluation Repository Deep Dive 👁️‍🗨️ OCR SIG docs , ocr	0	194	December 16, 2024
PRD - OCR Training & Evaluation Platform 🚀 WG སྡེ་ཚན།	0	30	June 20, 2025
[Report] OCR Benchmark 👁️‍🗨️ OCR SIG docs , documentation	0	106	August 13, 2025
Calculating Word Error Rate for Tibetan Automatic Speech Recognition 🔊 ASR Speech Recognition SIG toolkit	0	97	April 17, 2025
The Current State of Tibetan OCR ( BDRC and Monlam AI ) 👁️‍🗨️ OCR SIG docs , ocr , dataset	3	463	May 19, 2025

Training a KenLM for Tibetan OCR Quality Assessment: An Empirical Journey

Objective

Table of Content

Context

The Mathematical Premise: KenLM and Perplexity

The Corpus: Establishing the Baseline

BoCorpus Overview

Crucial Pre-processing Step

Before generating any vocabularies or training models, we standardized the entire corpus. We removed all newline characters from the source volumes.

Experiment 1: The Subword Approach (SentencePiece)

Methodology

Key Findings

Experiment 2: Syllable-Level Tokenization and the Normalization Revelation

Methodology

Key Findings

Experiment 3: Morphological Tokenization via Botok-rs

Methodology

Validation Report (BoKenlm-syl-0.1)

Key Findings

Experiment 4: Corpus Denoising and Punctuation Normalization (BoKenlm-syl-0.4)

Methodology

Validation Report

Key Findings

Final Verdict

So What? The Practical Application

Acknowledgements

Related topics