Tibetan Script Classification Pipeline
Building a classifier model to sort Buddhist manuscript images by script type by implementing from dataset analysis ,model selection, preprocessing ablation, and progressive fine-tuning.
Introduction
The BDRC holds approximately 3 million digitised manuscript images spanning centuries of Tibetan Buddhist textual heritage. These manuscripts are written in diverse script types , from formal Uchen to cursive Ume variants, woodblock prints to handwritten notes but the metadata often lacks script-level classification. Sorting 3 million images by hand is infeasible. This pipeline builds a model to do it automatically, with human oversight for uncertain cases.
The training code, preprocessing scripts, and model checkpoints are available at HuggingFace: tibetan-script-classifier and the associated GitHub repository.
The Dataset: A Taxonomy of 18 Classes
Tibetan paleography is incredibly diverse. While our initial analysis identified 24 distinct script classes, we found that only 18 had sufficient data and distinct enough characteristics to be viable for this first model iteration.
| Class | Description | Training images |
|---|---|---|
| petsuk | Woodblock print (pecha format) | 1,388 |
| uchen_sugdring | Uchen with medium stroke weight | 835 |
| tsegdrig | Punctuated/segmented script | 749 |
| peri | Peri style manuscript | 614 |
| uchen_sugthung | Uchen with light stroke weight | 240 |
| multi_scripts | Pages containing multiple script types | 235 |
| druthung | Drutsa with light strokes | 207 |
| non_tibetan | Non-Tibetan text (Chinese, Sanskrit, etc.) | 192 |
| tsumachug | Tsumachug style | 178 |
| difficult | Ambiguous/damaged, hard to classify | 170 |
| yigchung | Small script annotations | 166 |
| drudring | Drutsa with medium strokes | 132 |
| drathung | Dratsa with light strokes | 129 |
| druring | Drutsa with heavy strokes | 119 |
| khyuyig | Shorthand/abbreviated script | 113 |
| dhumri | Smoke-style decorative script | 98 |
| tsugchung | Small Tsugring script | 77 |
| trinyig | Annotation/marginal notes script | 42 |
Total: 5,684 images (after excluding 88 benchmark images and removing 6 classes with fewer than 30 samples: druchen, dradring, draring, tsugthung, uchen_sugring, gongshabma).
The class imbalance ratio exceeds 30:1 between the largest class (petsuk, 1,388) and the smallest (trinyig, 42).
EDA: Navigating the Three Main Obstacles
Before training, we performed Exploratory Data Analysis (EDA) which revealed why this task is significantly harder than standard image recognition:
-
Visually Similar Script Types: Many Tibetan script types differ only in stroke weight or character spacing—features that are subtle at the page level. The Dru-family shares the same basic letterforms with variations in thickness and formality that are easily obscured by scanning artifacts.
-
The Pecha Aspect Ratio: The dominant manuscript format is the traditional Tibetan pecha—a wide, narrow page with an aspect ratio of approximately 5:1. Standard vision models resize this to 224×224 pixels, which squashes the characters horizontally and distorts the stroke proportions that distinguish script types.
-
Bimodal Scanning Conditions: The pixel intensity distribution is bimodal. We found two distinct “types” of scans: mid-tone manuscripts representing aged paper, and high-exposure modern scans. Any preprocessing strategy must handle both without destroying the ink information.
Phase 1: Model selection
Methodology
Before investing in fine-tuning, we needed to identify which pretrained backbone produces the most useful visual features for Tibetan script classification. We evaluated 8 candidate models spanning four architectural families:
- Self-supervised vision transformers: DINOv2 ViT-S/14 (21M params), DINOv2 ViT-B/14 (86M params), DINOv3 ViT-S/16 (21M params)
- Vision-language models: CLIP ViT-B/32 (63M params), SigLIP 2 Base (86M params)
- Supervised CNNs: EfficientNet-B0 (5.3M params), ResNet-50 (25M params)
- Modern hybrid: ConvNeXt-Base (88M params)
Each model’s pretrained weights were frozen (no fine-tuning). We extracted embeddings for all images and trained a logistic regression classifier on top. All models were evaluated on identical stratified splits of a balanced ~1,467-image subset to ensure fair comparison.
Results
| Model | Params | Accuracy | Macro-F1 | Extract time (s) |
|---|---|---|---|---|
| DINOv3 ViT-S | 21M | 47.2% | 0.412 | ~180 |
| SigLIP 2 Base | 86M | ~46% | ~0.40 | ~200 |
| DINOv2 ViT-B/14 | 86M | 45.7% | 0.392 | 308 |
| DINOv2 ViT-S/14 | 21M | 44.3% | 0.385 | 196 |
| ConvNeXt-Base | 88M | 43.2% | 0.374 | 133 |
| EfficientNet-B0 | 5.3M | 42.1% | 0.353 | 113 |
| ResNet-50 | 25M | 41.7% | 0.347 | 116 |
| CLIP ViT-B/32 | 63M | 39.3% | 0.283 | 111 |
The ranking confirms the expected pattern: self-supervised vision transformers outperform supervised CNNs, which outperform vision-language models on domain-specific fine-grained tasks.
Confusion matrix analysis
Detailed confusion matrices for DINOv3 ViT-S and SigLIP 2 Base revealed four systematic confusion clusters that persist regardless of model choice:
Confusion matrix for DINOv3 ViT-S linear probe. The petsuk-peri-tsegdrig triangle and dru-family cluster are visible as off-diagonal hotspots.
| Cluster | Classes | Pattern |
|---|---|---|
| Petsuk triangle | petsuk, peri, tsegdrig | Bidirectional confusion across the three largest classes |
| Dru-family | drudring, druring, druthung, drathung | drudring ↔ druring bidirectional; druthung → dhumri leakage |
| Tsuma/tsug group | tsumachug, tsugchung, yigchung | Cross-confusion driven by similar character proportions |
| Clean classes | non_tibetan, uchen_sugdring, difficult | High F1 (0.80–0.95), minimal off-diagonal confusion |
Confusion clusters identified from the linear probe. These patterns persist across all model architectures tested.
Model selection decision
DINOv3 ViT-S was selected for fine-tuning based on three factors. First, it won outright on 8 of 17 classes compared to SigLIP 2’s 6. Second, it had zero catastrophic failures (no class scoring below 5 correct on the diagonal), whereas SigLIP 2 collapsed on yigchung (3 correct) and tsumachug (7 correct). Third, its error pattern is structured and predictable — confusions concentrate between genuinely similar scripts — making it amenable to targeted hard-negative mining. SigLIP 2’s errors were more erratic, with unrelated classes collapsing into each other.
Phase 2: Preprocessing ablation
After selecting the model we introduce three preprocessing steps with three experiments to solve the aspect ratio and lighting problems.The three variants are as follows:
Experiment 1 — Whole page. Resize the image so the short edge is 224 pixels, then center crop to 224×224. This produces one image per page. Simple, fast, but captures only ~20% of a 5:1 pecha page.
Experiment 2 — Color patches. Same resize, but instead of one center crop, slide a 224×224 window along the width with 25% overlap. This produces 5-7 undistorted patches per page, filtered by ink density (2-95%) to exclude blank margins. The dataset grows from 5,684 to ~28,286 samples.
Experiment 3 — CLAHE patches. Same patches as Experiment 2, but each patch is converted to grayscale and processed with Contrast Limited Adaptive Histogram Equalization (CLAHE, clipLimit=2.0, tileGridSize=8×8). This normalizes contrast across the two intensity populations without destroying content — unlike binarization, which converts to binary black/white and discards all grayscale and color information.
A fourth variant using Sauvola binarization was prepared but abandoned: the binarization step destroyed so much contrast information that the ink density filter rejected most patches, producing a 1:1 ratio with whole-page counts instead of the expected 5-7× multiplication. This confirmed that binarization does not suit this dataset’s mixed scanning conditions.
Dataset sizes after preprocessing
| Experiment | Images/patches | Classes | Multiplier |
|---|---|---|---|
| Whole page | 5,684 | 18 | 1× |
| Color patches | 28,286 | 18 | ~5× |
| CLAHE patches | 28,286 | 18 | ~5× |
Dataset sizes after preprocessing. Patch extraction multiplies the smallest class (trinyig) from 42 to ~248 samples.
Phase 3: Progressive fine-tuning
Training recipe
All three experiments used identical training settings on DINOv3 ViT-S (21M parameters, 384-dimensional CLS token embeddings, 12 transformer blocks):
We introduce three stages of unfreezing on all three experiment as follows:
- Stage A — Head only (20 epochs): The entire DINOv3 backbone is frozen. A 2-layer MLP classification head (384 → 128 → 18 classes) trains at learning rate 1e-3. This establishes a baseline using only the pretrained features.
- Stage B — Last 2 blocks (10 epochs): Transformer blocks 11-12 are unfrozen at learning rate 1e-5 (100× lower than the head). The top layers begin adapting from generic visual features to Tibetan script patterns.
- Stage C — Last 4 blocks (10 epochs): Blocks 9-12 are unfrozen at 5e-6, head reduced to 5e-4. Deeper adaptation with diminishing returns expected.
Each stage loads the best checkpoint from the previous stage. The overall best checkpoint across all stages becomes the final model.
Class balancing: WeightedRandomSampler for balanced batches and class-weighted CrossEntropyLoss (inverse-frequency weights) to prevent the model from defaulting to majority-class predictions.
Document-aware augmentations: Random rotation ±5° (simulating tilted scans), brightness/contrast jitter ±20%, random crop scale 0.7-1.0, and light random erasing (simulating page damage).
Splits: 70% train / 15% val / 15% test, stratified by class, split at the page level (all patches from one page stay in the same split to prevent data leakage). 88 gold-standard benchmark images (5 per class) were excluded from all splits.
Stage-level results
| Stage | Whole page | Patches color | Patches CLAHE | |||
|---|---|---|---|---|---|---|
| Macro-F1 | Acc | Macro-F1 | Acc | Macro-F1 | Acc | |
| Stage A (head only) | 0.496 | 55.7% | 0.504* | 58.2%* | 0.487* | 57.9%* |
| Stage B (last 2 blocks) | 0.512 | 57.1% | 0.497* | 58.5%* | 0.529* | 60.0%* |
| Stage C (last 4 blocks) | 0.505 | 56.4% | 0.502* | 59.2%* | 0.526* | 60.0%* |
Stage-level results across all three experiments. Bold indicates best per experiment. Asterisks (*) denote page-level metrics computed by aggregating patch predictions via softmax averaging across the same 844 test pages.
Stage B consistently produced the best or near-best results across all experiments, confirming that unfreezing the top 2 transformer blocks provides meaningful domain adaptation without overfitting. Stage C offered no consistent improvement, suggesting that deeper unfreezing is unnecessary at this dataset size.
Preprocessing comparison
| Experiment | Best page-level macro-F1 | Best stage | Best accuracy | |
|---|---|---|---|---|
| Whole page | 0.512 | Stage B | 57.1% | |
| Patches color | 0.504 | Stage A | 59.2% | |
| Patches CLAHE | 0.529 | Stage B | 60.0% |
Overall preprocessing comparison. CLAHE patches achieve the highest macro-F1 but the improvement over whole page is modest relative to the inference cost increase.
CLAHE patches achieved the highest page-level macro-F1 (0.529), marginally outperforming whole page (0.512) by +1.7%. However, this comes at the cost of 5-7× more compute per image at inference (patch extraction + multiple forward passes + aggregation). Color patches without CLAHE showed no improvement over whole page, confirming that the 5× data multiplication from patching alone does not help — it is the contrast normalization that provides the marginal benefit.
Whole page is recommended for production deployment due to equivalent accuracy with dramatically simpler inference: one forward pass per image, no patch extraction, no aggregation logic.
Summary:
The experimental results across eight architectures and three preprocessing variants reveal a persistent performance plateau. Regardless of model complexity (from 21M to 86M parameters) or input strategy (whole-page vs. CLAHE patches), the system consistently encounters a “ceiling” at approximately 53% Macro-F1.
The Core Bottleneck: Systematic Confusion Clusters
Classification errors are not randomly distributed; they are concentrated in three specific visual “hotspots” that account for the vast majority of failures:
-
The Petsuk–Peri–Tsegdrig Triangle: These represent the three largest classes in the dataset. The model fails to reliably distinguish between them at the page level, resulting in high “bleeding” between labels.
-
The Dru-Family Cluster: Regional variants like drudring and druring confuse bidirectionally. Their fundamental letterforms are nearly identical, differing only in subtle stroke weight—a feature that DINOv3’s global embeddings struggle to isolate.
-
The Tsuma/Tsug/Yig Group: Tsumachug, tsugchung, and yigchung suffer from cross-confusion and extremely low recall (down to 25%) due to similar character proportions.
Solution & Next Steps: Moving Toward Hierarchical Routing
To break the 53% ceiling the pipeline is shifting from a flat 18-way classification to a targeted Hierarchical Classification strategy.
1. Uchen-Ume Binary Router
The immediate next step is the implementation of a high-precision Uchen vs. Ume binary classifier. By separating “headed” scripts from “headless” scripts first, we capitalize on the model’s strongest discriminative features. Our preliminary tests show that this binary “Router” can achieve near-perfect accuracy, providing a clean entry point for the rest of the pipeline.
2. The 6-Family Hierarchical Classification
Following the binary split, images will be routed into 5 broad hierarchical families(Tsugdri and Gyuyig together) rather than 18 granular types. This reduces the “noise” caused by visually identical sub-types and concentrates model capacity on more distinct paleographic groups:
| Category | Primary Visual Marker | Target Scripts |
|---|---|---|
| Uchen | Horizontal “Head” stroke | Sugthung, Sugdring, Sugring |
| Druma | Oval/Round counters | Zlumris, Druthung, Druchen |
| Danyig | Flat-oval/Transitional shape | Tsegdrig, Drathung, DraRing |
| Pedri | Square/Rectangular counters | Peri, Petsuk |
| Tsugdri | Oblong counters + Looped gigu | Tsugthung, Tsugchung, Trinyig |
| Gyuyig | Upward-pointing gigu | Yigchung, Tsugma Khyug, Khyugyig |
Model:
Dataset:
