Tibetan Book Page Layout Detection — Experiment Report

Data Preparation

First, I collected all the training data annotated by our annotation team on the Ultralytics platform. I combined them and split the data into training, validation, and test sets while preserving the same class distribution proportions across all splits.

Class Train (80%) Val (10%) Test (10%) Total
header 1,692 (80.0%) 211 (10.0%) 212 (10.0%) 2,115
text-area 2,219 (80.0%) 277 (10.0%) 278 (10.0%) 2,774
footnote 142 (79.8%) 18 (10.1%) 18 (10.1%) 178
footer 1,429 (80.0%) 178 (10.0%) 179 (10.0%) 1,786

YOLO Fine-tuning

I took YOLO26m as the base model and fine-tuned it with our custom data on the Ultralytics platform. It was trained for 150 epochs. After running inference on our benchmark dataset, we obtained the following results.

Overall Metrics

Metric Value
mAP @ IoU 0.50 0.8339
mAP @ IoU 0.50:0.95 (COCO) 0.5243

Per-Class Metrics (IoU = 0.50)

Class AP Precision Recall TP FP FN GT Predictions
header 0.7280 0.8740 0.7963 215 31 55 270 246
text-area 0.9584 0.9745 0.9808 306 8 6 312 314
footnote 0.8589 0.9487 0.9024 74 4 8 82 78
footer 0.7905 0.8971 0.8512 183 21 32 215 204

Fine-tuning Surya Segformer

I attempted to fine-tune the current layout detection model used in the Surya OCR pipeline, but the fine-tuning code is not yet public and the checkpoints are also private. On top of that, the current layout detection model is a combination of a DonutSwin vision encoder and a Qwen2-based decoder. Given the time and resources we have, it was not worth attempting to fine-tune that combination. The main maintainer of Surya has mentioned that they will release a fine-tuning script for layout soon, so we may explore this option then.

However, Surya version 0.4.14 used Segformer for layout detection. We fine-tuned that model from the last available Surya OCR checkpoint with the following parameters:

  • Epochs: 20

  • Batch size: 4

  • Learning rate: 5e-5

We obtained the following results.

Overall Metrics

Metric Value
mAP @ IoU 0.50 0.0288
mAP @ IoU 0.50:0.95 (COCO) 0.0121

Per-Class Metrics (IoU = 0.50)

Class AP Precision Recall TP FP FN GT Predictions
header 0.0000 0.0000 0.0000 0 101 270 270 101
text-area 0.1153 0.1212 0.9199 287 2,081 25 312 2,368
footnote 0.0000 0.0000 0.0000 0 0 82 82 0
footer 0.0000 0.0000 0.0000 0 0 215 215 0

Fine-tuning DocLayout-YOLO

After failing to beat the Surya model, we decided to try another experiment by fine-tuning a document-layout-specialized YOLO model. We chose DocLayout-YOLO because it had been trained on the DocSynth dataset consisting of 300k annotated samples. It already performed reasonably well on our benchmark dataset, achieving 0.38 mAP out of the box.

We used the doclayout_yolo_docstructbench_imgsz1024.pt checkpoint. Since the DocLayout-YOLO classes do not match ours, we devised a two-phase strategy to avoid catastrophic forgetting: first train only the head layer on our data, then fine-tune the full network.

Phase 1 — Head warm-up

  • Epochs: 15

  • Batch size: 8

  • Optimizer: SGD

  • Learning rate: 0.01

  • Froze the 10 layers unrelated to the head

Phase 2 — Full fine-tuning

  • Epochs: 85

  • Batch size: 8

  • Optimizer: SGD

  • Learning rate: 0.002

We obtained the following results.

Overall Metrics

Metric Value
mAP @ IoU 0.50 0.8255
mAP @ IoU 0.50:0.95 (COCO) 0.5182

Per-Class Metrics (IoU = 0.50)

Class AP Precision Recall TP FP FN GT Predictions
header 0.7833 0.8911 0.8481 229 28 41 270 257
text-area 0.9631 0.9808 0.9808 306 6 6 312 312
footnote 0.8263 0.9241 0.8902 73 6 9 82 79
footer 0.7293 0.8663 0.8140 175 27 40 215 202

Conclusion

I would recommend the fine-tuning Yolo26m and DocLayout-YOLO model for now, as it is more accurate, faster, and more cost-effective than the Surya model. I would also like to expand our training data using the following approaches:

  • Data augmentation on the current human-labelled data.

  • Creating multiple Word document templates and rendering them to page images along with their layout annotations.

  • Using libraries like PyMuPDF to extract layout from existing PDFs and exporting it in YOLO dataset format.

Credits

Developed by Dharmaduta based on specifications from the Buddhist Digital Resource Center for the project The BDRC Etext Corpus, funded by the Khyentse Foundation.