Training a KenLM for Tibetan OCR Quality Assessment: An Empirical Journey

Objective: To develop a reliable, lightweight statistical language model capable of assessing the quality of Optical Character Recognition (OCR) outputs for Classical and Contemporary Tibetan texts.

Context: As the AI community scales Large Language Models (LLMs) to low-resource languages, the demand for high-quality, curated pre-training data has skyrocketed. Inspired by recent research, such as Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training, it became apparent that an automated filtering mechanism is required to separate clean text from noisy OCR artifacts. A Kneser-Ney Language Model (KenLM) offers an efficient, statistically rigorous approach to this problem by evaluating text via perplexity scores.

The Mathematical Premise: KenLM and Perplexity

Before diving into the experiments, it is crucial to establish how we measure “quality.” We rely on Perplexity (PPL), which evaluates how accurately a model predicts a text sample.

Simply put, perplexity measures the model’s “surprise” when reading a new sentence. A lower $PPL$ score indicates the model is less surprised—meaning the text closely resembles the true, clean distribution of the language. To achieve this, our KenLM utilizes Kneser-Ney Smoothing. Instead of just looking at raw word frequencies, Kneser-Ney handles out-of-vocabulary (OOV) tokens by estimating probabilities based on the diversity of contexts a word appears in, making it highly robust for OCR evaluation.

The Corpus: Establishing the Baseline

To train the KenLM, we aggregated a diverse, cross-domain corpus of assumed “high-quality” Tibetan text. We logically grouped this into two primary datasets:

  • Classical Tibetan Canon: Comprising the Esukhia Kangyur (50 volumes) and the Tengyur (113 randomly selected volumes).

  • Contemporary Tibetan Scholar Work: Comprising Tsadra (266 EPUBs) and Dharmaebooks (98 books).

Crucial Pre-processing Step: Before generating any vocabularies or training models, we standardized the entire corpus. We removed all newline characters from the source volumes and processed the text using a botok-rs sentence tokenizer. The resulting sentence-level dataset served as the foundational input for all subsequent experiments, ensuring consistency across trials.

Experiment 1: The Subword Approach (SentencePiece)

Methodology:

Our initial hypothesis was to rely on modern subword tokenization, ubiquitous in Transformer-based architectures. We trained a SentencePiece model with a strict vocabulary size of 20,000.

  • Rationale: A vocabulary that is too small forces the KenLM to be overly strict (penalizing natural linguistic variance), while a vocabulary that is too large bloats inference time and makes the model too lenient to OCR noise.

Key Findings:

  • Output: The tokenizer was saved to Hugging Face as BoSentencePiece and BoKenlm-sp.

  • Limitations: When evaluating clean, ground-truth data, the perplexity scores remained unexpectedly high. While the model could still function as a relative filter (by establishing a heuristic threshold to separate good OCR from bad), the absolute perplexity scores were statistically unreliable. Subword chunking without morphological awareness proved suboptimal for the specific orthographic rules of Tibetan.

Experiment 2: Syllable-Level Tokenization and the Normalization Revelation

Methodology:

Following discussions with Elie, we pivoted from algorithmic subwords to linguistically grounded tokenization. Tibetan text is naturally segmented into syllables by the tsek (་) and shad (།) characters. We implemented a Regex-based tokenizer to split the corpus along these boundaries.

  • Hypothesis: Bypassing the artificial 20k limit and utilizing true syllable boundaries would yield a more precise, representative vocabulary.

Key Findings:

  • The Unnormalized Explosion: The initial training run yielded a massive vocabulary size of 101,462 tokens. Vocabulary validation revealed severe noise: only 53.15% of tokens were valid Tibetan words, while 46.85% were invalid (heavily polluted by fused TEXT+PUNCT and LATIN+TEXT combinations).

  • The Normalization Pipeline: Realizing our “high-quality” baseline contained severe formatting artifacts, we introduced a strict normalization step prior to tokenization:

    1. Removal of invalid Unicode keys.

    2. Pruning of unwanted spacing artifacts.

    3. Stripping inline spelling suggestion annotations present in the Kangyur/Tengyur datasets.

    4. Removal of residual Latin characters.

  • Result of Normalization: Normalization reduced the vocabulary size to 97,814 tokens. However, secondary vocabulary analysis showed that while validity improved, the results were still unacceptable: 55.09% valid words vs. 44.91% invalid words.

Experiment 3: Morphological Tokenization via Botok-rs

Methodology:

Given the near 45% invalidity rate from naive Regex splitting—even after normalization—we abandoned simple boundary splitting. Instead, we adopted rule-based, morphological tokenization using botok-rs (the Rust implementation of the Botok Python library, optimized for speed).

Key Findings:

  • Vocabulary Refinement: The vocabulary size condensed beautifully to an optimized 83,721 tokens.

  • Purity Metrics: Token validity improved dramatically. Our validation report indicated 90.14% valid Tibetan words and only 9.86% invalid. The clean TEXT category alone accounted for 87.68% of the entire vocabulary, proving the efficacy of morphological awareness over naive splitting.

  • Deployment: This highly accurate model was successfully deployed to Hugging Face under the repository BoKenlm-syl.

Final Verdict

When building statistical language models for morphologically complex or historically rich languages like Tibetan, algorithmic tokenization (like SentencePiece) is no substitute for linguistic rules. Our journey proved that moving from arbitrary subwords (Exp 1) to naive delimiters (Exp 2), and finally to a morphological tokenizer (Exp 3), is required to build a reliable KenLM. Furthermore, the assumption of “clean” data is a dangerous pitfall in NLP; aggressive, domain-specific normalization is an absolute prerequisite before vocabulary construction.

So What? The Practical Application

Why does this matter to the broader AI and Data Engineering ecosystem?

  1. Automated Data Curation: As researchers attempt to build foundational LLMs for Tibetan, the BoKenlm-syl model serves as a highly efficient, deterministic gatekeeper. It can process millions of lines of raw OCR output, assigning perplexity scores to automatically discard hallucinated or garbled text before it ever reaches a computationally expensive LLM pre-training pipeline.

  2. Historical Preservation: Digitizing historical texts (like the Kangyur and Tengyur) relies heavily on OCR. A calibrated KenLM allows archivists to pinpoint exact volumes or pages where the OCR pipeline failed, guiding human-in-the-loop (HITL) correction efforts precisely where they are needed most, rather than proofreading the entire canon manually.

Related Repositories

To support the community in replicating this research and applying it to their own datasets, we have open-sourced the following resources:

  • BoCorpusQC: A toolkit designed to automatically filter Tibetan corpora using our trained KenLM models.

  • BoKenLm: The complete training pipeline and scripts used to conduct our KenLM and tokenization experiments.

  • BoTokenzier: The complete training pipeline and scripts to train the sentence piece tokenizer(BoSentencePiece).