Tibetan Script Classification Model training pipeline

Karma_Tashi1 · May 12, 2026, 6:03am

Tibetan Script Classification Pipeline

Building a classifier model to sort Buddhist manuscript images by script type by implementing from dataset analysis ,model selection, preprocessing ablation, and progressive fine-tuning.

Introduction

The BDRC holds approximately 3 million digitised manuscript images spanning centuries of Tibetan Buddhist textual heritage. These manuscripts are written in diverse script types , from formal Uchen to cursive Ume variants, woodblock prints to handwritten notes but the metadata often lacks script-level classification. Sorting 3 million images by hand is infeasible. This pipeline builds a model to do it automatically, with human oversight for uncertain cases.

The training code, preprocessing scripts, and model checkpoints are available at HuggingFace: tibetan-script-classifier and the associated GitHub repository.

The Dataset: A Taxonomy of 18 Classes

Tibetan paleography is incredibly diverse. While our initial analysis identified 24 distinct script classes, we found that only 18 had sufficient data and distinct enough characteristics to be viable for this first model iteration.

Class	Description	Training images
petsuk	Woodblock print (pecha format)	1,388
uchen_sugdring	Uchen with medium stroke weight	835
tsegdrig	Punctuated/segmented script	749
peri	Peri style manuscript	614
uchen_sugthung	Uchen with light stroke weight	240
multi_scripts	Pages containing multiple script types	235
druthung	Drutsa with light strokes	207
non_tibetan	Non-Tibetan text (Chinese, Sanskrit, etc.)	192
tsumachug	Tsumachug style	178
difficult	Ambiguous/damaged, hard to classify	170
yigchung	Small script annotations	166
drudring	Drutsa with medium strokes	132
drathung	Dratsa with light strokes	129
druring	Drutsa with heavy strokes	119
khyuyig	Shorthand/abbreviated script	113
dhumri	Smoke-style decorative script	98
tsugchung	Small Tsugring script	77
trinyig	Annotation/marginal notes script	42

Total: 5,684 images (after excluding 88 benchmark images and removing 6 classes with fewer than 30 samples: druchen, dradring, draring, tsugthung, uchen_sugring, gongshabma).

The class imbalance ratio exceeds 30:1 between the largest class (petsuk, 1,388) and the smallest (trinyig, 42).

EDA: Navigating the Three Main Obstacles

Before training, we performed Exploratory Data Analysis (EDA) which revealed why this task is significantly harder than standard image recognition:

Visually Similar Script Types: Many Tibetan script types differ only in stroke weight or character spacing—features that are subtle at the page level. The Dru-family shares the same basic letterforms with variations in thickness and formality that are easily obscured by scanning artifacts.
The Pecha Aspect Ratio: The dominant manuscript format is the traditional Tibetan pecha—a wide, narrow page with an aspect ratio of approximately 5:1. Standard vision models resize this to 224×224 pixels, which squashes the characters horizontally and distorts the stroke proportions that distinguish script types.
Bimodal Scanning Conditions: The pixel intensity distribution is bimodal. We found two distinct “types” of scans: mid-tone manuscripts representing aged paper, and high-exposure modern scans. Any preprocessing strategy must handle both without destroying the ink information.

Phase 1: Model selection

Methodology

Before investing in fine-tuning, we needed to identify which pretrained backbone produces the most useful visual features for Tibetan script classification. We evaluated 8 candidate models spanning four architectural families:

Self-supervised vision transformers: DINOv2 ViT-S/14 (21M params), DINOv2 ViT-B/14 (86M params), DINOv3 ViT-S/16 (21M params)
Vision-language models: CLIP ViT-B/32 (63M params), SigLIP 2 Base (86M params)
Supervised CNNs: EfficientNet-B0 (5.3M params), ResNet-50 (25M params)
Modern hybrid: ConvNeXt-Base (88M params)

Each model’s pretrained weights were frozen (no fine-tuning). We extracted embeddings for all images and trained a logistic regression classifier on top. All models were evaluated on identical stratified splits of a balanced ~1,467-image subset to ensure fair comparison.

Results

Model	Params	Accuracy	Macro-F1	Extract time (s)
DINOv3 ViT-S	21M	47.2%	0.412	~180
SigLIP 2 Base	86M	~46%	~0.40	~200
DINOv2 ViT-B/14	86M	45.7%	0.392	308
DINOv2 ViT-S/14	21M	44.3%	0.385	196
ConvNeXt-Base	88M	43.2%	0.374	133
EfficientNet-B0	5.3M	42.1%	0.353	113
ResNet-50	25M	41.7%	0.347	116
CLIP ViT-B/32	63M	39.3%	0.283	111

The ranking confirms the expected pattern: self-supervised vision transformers outperform supervised CNNs, which outperform vision-language models on domain-specific fine-grained tasks.

Confusion matrix analysis

Detailed confusion matrices for DINOv3 ViT-S and SigLIP 2 Base revealed four systematic confusion clusters that persist regardless of model choice:

Confusion matrix for DINOv3 ViT-S

Confusion matrix for DINOv3 ViT-S linear probe. The petsuk-peri-tsegdrig triangle and dru-family cluster are visible as off-diagonal hotspots.

Cluster	Classes	Pattern
Petsuk triangle	petsuk, peri, tsegdrig	Bidirectional confusion across the three largest classes
Dru-family	drudring, druring, druthung, drathung	drudring ↔ druring bidirectional; druthung → dhumri leakage
Tsuma/tsug group	tsumachug, tsugchung, yigchung	Cross-confusion driven by similar character proportions
Clean classes	non_tibetan, uchen_sugdring, difficult	High F1 (0.80–0.95), minimal off-diagonal confusion

Confusion clusters identified from the linear probe. These patterns persist across all model architectures tested.

Model selection decision

DINOv3 ViT-S was selected for fine-tuning based on three factors. First, it won outright on 8 of 17 classes compared to SigLIP 2’s 6. Second, it had zero catastrophic failures (no class scoring below 5 correct on the diagonal), whereas SigLIP 2 collapsed on yigchung (3 correct) and tsumachug (7 correct). Third, its error pattern is structured and predictable — confusions concentrate between genuinely similar scripts — making it amenable to targeted hard-negative mining. SigLIP 2’s errors were more erratic, with unrelated classes collapsing into each other.

Phase 2: Preprocessing ablation

After selecting the model we introduce three preprocessing steps with three experiments to solve the aspect ratio and lighting problems.The three variants are as follows:

Experiment 1 — Whole page. Resize the image so the short edge is 224 pixels, then center crop to 224×224. This produces one image per page. Simple, fast, but captures only ~20% of a 5:1 pecha page.

Experiment 2 — Color patches. Same resize, but instead of one center crop, slide a 224×224 window along the width with 25% overlap. This produces 5-7 undistorted patches per page, filtered by ink density (2-95%) to exclude blank margins. The dataset grows from 5,684 to ~28,286 samples.

Experiment 3 — CLAHE patches. Same patches as Experiment 2, but each patch is converted to grayscale and processed with Contrast Limited Adaptive Histogram Equalization (CLAHE, clipLimit=2.0, tileGridSize=8×8). This normalizes contrast across the two intensity populations without destroying content — unlike binarization, which converts to binary black/white and discards all grayscale and color information.

A fourth variant using Sauvola binarization was prepared but abandoned: the binarization step destroyed so much contrast information that the ink density filter rejected most patches, producing a 1:1 ratio with whole-page counts instead of the expected 5-7× multiplication. This confirmed that binarization does not suit this dataset’s mixed scanning conditions.

Dataset sizes after preprocessing

Experiment	Images/patches	Classes	Multiplier
Whole page	5,684	18	1×
Color patches	28,286	18	~5×
CLAHE patches	28,286	18	~5×

Dataset sizes after preprocessing. Patch extraction multiplies the smallest class (trinyig) from 42 to ~248 samples.

Phase 3: Progressive fine-tuning

Training recipe

All three experiments used identical training settings on DINOv3 ViT-S (21M parameters, 384-dimensional CLS token embeddings, 12 transformer blocks):

We introduce three stages of unfreezing on all three experiment as follows:

Stage A — Head only (20 epochs): The entire DINOv3 backbone is frozen. A 2-layer MLP classification head (384 → 128 → 18 classes) trains at learning rate 1e-3. This establishes a baseline using only the pretrained features.
Stage B — Last 2 blocks (10 epochs): Transformer blocks 11-12 are unfrozen at learning rate 1e-5 (100× lower than the head). The top layers begin adapting from generic visual features to Tibetan script patterns.
Stage C — Last 4 blocks (10 epochs): Blocks 9-12 are unfrozen at 5e-6, head reduced to 5e-4. Deeper adaptation with diminishing returns expected.

Each stage loads the best checkpoint from the previous stage. The overall best checkpoint across all stages becomes the final model.

Class balancing: WeightedRandomSampler for balanced batches and class-weighted CrossEntropyLoss (inverse-frequency weights) to prevent the model from defaulting to majority-class predictions.

Document-aware augmentations: Random rotation ±5° (simulating tilted scans), brightness/contrast jitter ±20%, random crop scale 0.7-1.0, and light random erasing (simulating page damage).

Splits: 70% train / 15% val / 15% test, stratified by class, split at the page level (all patches from one page stay in the same split to prevent data leakage). 88 gold-standard benchmark images (5 per class) were excluded from all splits.

Stage-level results

Stage	Whole page		Patches color		Patches CLAHE
	Macro-F1	Acc	Macro-F1	Acc	Macro-F1	Acc
Stage A (head only)	0.496	55.7%	0.504*	58.2%*	0.487*	57.9%*
Stage B (last 2 blocks)	0.512	57.1%	0.497*	58.5%*	0.529*	60.0%*
Stage C (last 4 blocks)	0.505	56.4%	0.502*	59.2%*	0.526*	60.0%*

Stage-level results across all three experiments. Bold indicates best per experiment. Asterisks (*) denote page-level metrics computed by aggregating patch predictions via softmax averaging across the same 844 test pages.

Stage B consistently produced the best or near-best results across all experiments, confirming that unfreezing the top 2 transformer blocks provides meaningful domain adaptation without overfitting. Stage C offered no consistent improvement, suggesting that deeper unfreezing is unnecessary at this dataset size.

Preprocessing comparison

Experiment	Best page-level macro-F1	Best stage	Best accuracy
Whole page	0.512	Stage B	57.1%
Patches color	0.504	Stage A	59.2%
Patches CLAHE	0.529	Stage B	60.0%

Overall preprocessing comparison. CLAHE patches achieve the highest macro-F1 but the improvement over whole page is modest relative to the inference cost increase.

CLAHE patches achieved the highest page-level macro-F1 (0.529), marginally outperforming whole page (0.512) by +1.7%. However, this comes at the cost of 5-7× more compute per image at inference (patch extraction + multiple forward passes + aggregation). Color patches without CLAHE showed no improvement over whole page, confirming that the 5× data multiplication from patching alone does not help — it is the contrast normalization that provides the marginal benefit.

Whole page is recommended for production deployment due to equivalent accuracy with dramatically simpler inference: one forward pass per image, no patch extraction, no aggregation logic.

Summary:

The experimental results across eight architectures and three preprocessing variants reveal a persistent performance plateau. Regardless of model complexity (from 21M to 86M parameters) or input strategy (whole-page vs. CLAHE patches), the system consistently encounters a “ceiling” at approximately 53% Macro-F1.

The Core Bottleneck: Systematic Confusion Clusters

Classification errors are not randomly distributed; they are concentrated in three specific visual “hotspots” that account for the vast majority of failures:

The Petsuk–Peri–Tsegdrig Triangle: These represent the three largest classes in the dataset. The model fails to reliably distinguish between them at the page level, resulting in high “bleeding” between labels.
The Dru-Family Cluster: Regional variants like drudring and druring confuse bidirectionally. Their fundamental letterforms are nearly identical, differing only in subtle stroke weight—a feature that DINOv3’s global embeddings struggle to isolate.
The Tsuma/Tsug/Yig Group: Tsumachug, tsugchung, and yigchung suffer from cross-confusion and extremely low recall (down to 25%) due to similar character proportions.

Solution & Next Steps: Moving Toward Hierarchical Routing

To break the 53% ceiling the pipeline is shifting from a flat 18-way classification to a targeted Hierarchical Classification strategy.

1. Uchen-Ume Binary Router

The immediate next step is the implementation of a high-precision Uchen vs. Ume binary classifier. By separating “headed” scripts from “headless” scripts first, we capitalize on the model’s strongest discriminative features. Our preliminary tests show that this binary “Router” can achieve near-perfect accuracy, providing a clean entry point for the rest of the pipeline.

2. The 6-Family Hierarchical Classification

Following the binary split, images will be routed into 5 broad hierarchical families(Tsugdri and Gyuyig together) rather than 18 granular types. This reduces the “noise” caused by visually identical sub-types and concentrates model capacity on more distinct paleographic groups:

Category	Primary Visual Marker	Target Scripts
Uchen	Horizontal “Head” stroke	Sugthung, Sugdring, Sugring
Druma	Oval/Round counters	Zlumris, Druthung, Druchen
Danyig	Flat-oval/Transitional shape	Tsegdrig, Drathung, DraRing
Pedri	Square/Rectangular counters	Peri, Petsuk
Tsugdri	Oblong counters + Looped gigu	Tsugthung, Tsugchung, Trinyig
Gyuyig	Upward-pointing gigu	Yigchung, Tsugma Khyug, Khyugyig

Model:

Dataset:

Topic		Replies	Views
Tibetan Script Classification — Model Training 💬 Feedback བསམ་ཚུལ།	0	10	June 19, 2026
The Current State of Tibetan OCR ( BDRC and Monlam AI ) 👁️‍🗨️ OCR SIG docs , ocr , dataset	3	478	May 19, 2025
How Well Do Existing Layout Detection Models Handle Tibetan Books? 📑BDRC-Etexts WG	0	18	June 22, 2026
PRD - OCR Training & Evaluation Platform 🚀 WG སྡེ་ཚན།	0	30	June 20, 2025
[Report] OCR Benchmark 👁️‍🗨️ OCR SIG docs , documentation	0	115	August 13, 2025