Tibetan Book Page Layout Detection — Experiment Report

Kaldan · May 15, 2026, 6:20am

Data Preparation

First, I collected all the training data annotated by our annotation team on the Ultralytics platform. I combined them and split the data into training, validation, and test sets while preserving the same class distribution proportions across all splits.

Class	Train (80%)	Val (10%)	Test (10%)	Total
header	1,692 (80.0%)	211 (10.0%)	212 (10.0%)	2,115
text-area	2,219 (80.0%)	277 (10.0%)	278 (10.0%)	2,774
footnote	142 (79.8%)	18 (10.1%)	18 (10.1%)	178
footer	1,429 (80.0%)	178 (10.0%)	179 (10.0%)	1,786

YOLO Fine-tuning

I took YOLO26m as the base model and fine-tuned it with our custom data on the Ultralytics platform. It was trained for 150 epochs. After running inference on our benchmark dataset, we obtained the following results.

Overall Metrics

Metric	Value
mAP @ IoU 0.50	0.8339
mAP @ IoU 0.50:0.95 (COCO)	0.5243

Per-Class Metrics (IoU = 0.50)

Class	AP	Precision	Recall	TP	FP	FN	GT	Predictions
header	0.7280	0.8740	0.7963	215	31	55	270	246
text-area	0.9584	0.9745	0.9808	306	8	6	312	314
footnote	0.8589	0.9487	0.9024	74	4	8	82	78
footer	0.7905	0.8971	0.8512	183	21	32	215	204

Fine-tuning Surya Segformer

I attempted to fine-tune the current layout detection model used in the Surya OCR pipeline, but the fine-tuning code is not yet public and the checkpoints are also private. On top of that, the current layout detection model is a combination of a DonutSwin vision encoder and a Qwen2-based decoder. Given the time and resources we have, it was not worth attempting to fine-tune that combination. The main maintainer of Surya has mentioned that they will release a fine-tuning script for layout soon, so we may explore this option then.

However, Surya version 0.4.14 used Segformer for layout detection. We fine-tuned that model from the last available Surya OCR checkpoint with the following parameters:

Epochs: 20
Batch size: 4
Learning rate: 5e-5

We obtained the following results.

Overall Metrics

Metric	Value
mAP @ IoU 0.50	0.0288
mAP @ IoU 0.50:0.95 (COCO)	0.0121

Per-Class Metrics (IoU = 0.50)

Class	AP	Precision	Recall	TP	FP	FN	GT	Predictions
header	0.0000	0.0000	0.0000	0	101	270	270	101
text-area	0.1153	0.1212	0.9199	287	2,081	25	312	2,368
footnote	0.0000	0.0000	0.0000	0	0	82	82	0
footer	0.0000	0.0000	0.0000	0	0	215	215	0

Fine-tuning DocLayout-YOLO

After failing to beat the Surya model, we decided to try another experiment by fine-tuning a document-layout-specialized YOLO model. We chose DocLayout-YOLO because it had been trained on the DocSynth dataset consisting of 300k annotated samples. It already performed reasonably well on our benchmark dataset, achieving 0.38 mAP out of the box.

We used the doclayout_yolo_docstructbench_imgsz1024.pt checkpoint. Since the DocLayout-YOLO classes do not match ours, we devised a two-phase strategy to avoid catastrophic forgetting: first train only the head layer on our data, then fine-tune the full network.

Phase 1 — Head warm-up

Epochs: 15
Batch size: 8
Optimizer: SGD
Learning rate: 0.01
Froze the 10 layers unrelated to the head

Phase 2 — Full fine-tuning

Epochs: 85
Batch size: 8
Optimizer: SGD
Learning rate: 0.002

We obtained the following results.

Overall Metrics

Metric	Value
mAP @ IoU 0.50	0.8255
mAP @ IoU 0.50:0.95 (COCO)	0.5182

Per-Class Metrics (IoU = 0.50)

Class	AP	Precision	Recall	TP	FP	FN	GT	Predictions
header	0.7833	0.8911	0.8481	229	28	41	270	257
text-area	0.9631	0.9808	0.9808	306	6	6	312	312
footnote	0.8263	0.9241	0.8902	73	6	9	82	79
footer	0.7293	0.8663	0.8140	175	27	40	215	202

Conclusion

I would recommend the fine-tuning Yolo26m and DocLayout-YOLO model for now, as it is more accurate, faster, and more cost-effective than the Surya model. I would also like to expand our training data using the following approaches:

Data augmentation on the current human-labelled data.
Creating multiple Word document templates and rendering them to page images along with their layout annotations.
Using libraries like PyMuPDF to extract layout from existing PDFs and exporting it in YOLO dataset format.

Credits

Developed by Dharmaduta based on specifications from the Buddhist Digital Resource Center for the project The BDRC Etext Corpus, funded by the Khyentse Foundation.

Topic		Replies	Views
How Well Do Existing Layout Detection Models Handle Tibetan Books? 📑BDRC-Etexts WG	0	20	June 22, 2026
The Current State of Tibetan OCR ( BDRC and Monlam AI ) 👁️‍🗨️ OCR SIG docs , ocr , dataset	3	482	May 19, 2025
Tibetan Script Classification Model training pipeline 📑BDRC-Etexts WG	0	26	May 12, 2026
PRD - OCR Training & Evaluation Platform 🚀 WG སྡེ་ཚན།	0	30	June 20, 2025
Tibetan Script Classification — Model Training 💬 Feedback བསམ་ཚུལ།	0	11	June 19, 2026

Tibetan Book Page Layout Detection — Experiment Report

Data Preparation

YOLO Fine-tuning

Fine-tuning Surya Segformer

Fine-tuning DocLayout-YOLO

Conclusion

Credits

Related topics