Data Preparation
First, I collected all the training data annotated by our annotation team on the Ultralytics platform. I combined them and split the data into training, validation, and test sets while preserving the same class distribution proportions across all splits.
| Class | Train (80%) | Val (10%) | Test (10%) | Total |
|---|---|---|---|---|
| header | 1,692 (80.0%) | 211 (10.0%) | 212 (10.0%) | 2,115 |
| text-area | 2,219 (80.0%) | 277 (10.0%) | 278 (10.0%) | 2,774 |
| footnote | 142 (79.8%) | 18 (10.1%) | 18 (10.1%) | 178 |
| footer | 1,429 (80.0%) | 178 (10.0%) | 179 (10.0%) | 1,786 |
YOLO Fine-tuning
I took YOLO26m as the base model and fine-tuned it with our custom data on the Ultralytics platform. It was trained for 150 epochs. After running inference on our benchmark dataset, we obtained the following results.
Overall Metrics
| Metric | Value |
|---|---|
| mAP @ IoU 0.50 | 0.8339 |
| mAP @ IoU 0.50:0.95 (COCO) | 0.5243 |
Per-Class Metrics (IoU = 0.50)
| Class | AP | Precision | Recall | TP | FP | FN | GT | Predictions |
|---|---|---|---|---|---|---|---|---|
| header | 0.7280 | 0.8740 | 0.7963 | 215 | 31 | 55 | 270 | 246 |
| text-area | 0.9584 | 0.9745 | 0.9808 | 306 | 8 | 6 | 312 | 314 |
| footnote | 0.8589 | 0.9487 | 0.9024 | 74 | 4 | 8 | 82 | 78 |
| footer | 0.7905 | 0.8971 | 0.8512 | 183 | 21 | 32 | 215 | 204 |
Fine-tuning Surya Segformer
I attempted to fine-tune the current layout detection model used in the Surya OCR pipeline, but the fine-tuning code is not yet public and the checkpoints are also private. On top of that, the current layout detection model is a combination of a DonutSwin vision encoder and a Qwen2-based decoder. Given the time and resources we have, it was not worth attempting to fine-tune that combination. The main maintainer of Surya has mentioned that they will release a fine-tuning script for layout soon, so we may explore this option then.
However, Surya version 0.4.14 used Segformer for layout detection. We fine-tuned that model from the last available Surya OCR checkpoint with the following parameters:
-
Epochs: 20
-
Batch size: 4
-
Learning rate: 5e-5
We obtained the following results.
Overall Metrics
| Metric | Value |
|---|---|
| mAP @ IoU 0.50 | 0.0288 |
| mAP @ IoU 0.50:0.95 (COCO) | 0.0121 |
Per-Class Metrics (IoU = 0.50)
| Class | AP | Precision | Recall | TP | FP | FN | GT | Predictions |
|---|---|---|---|---|---|---|---|---|
| header | 0.0000 | 0.0000 | 0.0000 | 0 | 101 | 270 | 270 | 101 |
| text-area | 0.1153 | 0.1212 | 0.9199 | 287 | 2,081 | 25 | 312 | 2,368 |
| footnote | 0.0000 | 0.0000 | 0.0000 | 0 | 0 | 82 | 82 | 0 |
| footer | 0.0000 | 0.0000 | 0.0000 | 0 | 0 | 215 | 215 | 0 |
Fine-tuning DocLayout-YOLO
After failing to beat the Surya model, we decided to try another experiment by fine-tuning a document-layout-specialized YOLO model. We chose DocLayout-YOLO because it had been trained on the DocSynth dataset consisting of 300k annotated samples. It already performed reasonably well on our benchmark dataset, achieving 0.38 mAP out of the box.
We used the doclayout_yolo_docstructbench_imgsz1024.pt checkpoint. Since the DocLayout-YOLO classes do not match ours, we devised a two-phase strategy to avoid catastrophic forgetting: first train only the head layer on our data, then fine-tune the full network.
Phase 1 — Head warm-up
-
Epochs: 15
-
Batch size: 8
-
Optimizer: SGD
-
Learning rate: 0.01
-
Froze the 10 layers unrelated to the head
Phase 2 — Full fine-tuning
-
Epochs: 85
-
Batch size: 8
-
Optimizer: SGD
-
Learning rate: 0.002
We obtained the following results.
Overall Metrics
| Metric | Value |
|---|---|
| mAP @ IoU 0.50 | 0.8255 |
| mAP @ IoU 0.50:0.95 (COCO) | 0.5182 |
Per-Class Metrics (IoU = 0.50)
| Class | AP | Precision | Recall | TP | FP | FN | GT | Predictions |
|---|---|---|---|---|---|---|---|---|
| header | 0.7833 | 0.8911 | 0.8481 | 229 | 28 | 41 | 270 | 257 |
| text-area | 0.9631 | 0.9808 | 0.9808 | 306 | 6 | 6 | 312 | 312 |
| footnote | 0.8263 | 0.9241 | 0.8902 | 73 | 6 | 9 | 82 | 79 |
| footer | 0.7293 | 0.8663 | 0.8140 | 175 | 27 | 40 | 215 | 202 |
Conclusion
I would recommend the fine-tuning Yolo26m and DocLayout-YOLO model for now, as it is more accurate, faster, and more cost-effective than the Surya model. I would also like to expand our training data using the following approaches:
-
Data augmentation on the current human-labelled data.
-
Creating multiple Word document templates and rendering them to page images along with their layout annotations.
-
Using libraries like PyMuPDF to extract layout from existing PDFs and exporting it in YOLO dataset format.
Credits
Developed by Dharmaduta based on specifications from the Buddhist Digital Resource Center for the project The BDRC Etext Corpus, funded by the Khyentse Foundation.