Document Layout Analysis
Tracking the evolution of document layout detection: models, datasets, benchmarks, and metrics.
Table of Contents
- Overview
- Layout Analysis Resources
- Related Pages
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Mapping to Matter vs. Meaning Framework
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Mapping to Matter vs. Meaning Framework
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Mapping to Matter vs. Meaning Framework
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Mapping to Matter vs. Meaning Framework
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Mapping to Matter vs. Meaning Framework
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Mapping to Matter vs. Meaning Framework
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- SciBank Dataset Details
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- The Two-Project Workflow
- Strategies for Speed & Consistency
- Global Decision Precedence
- Implementation: Label Studio Configurations
- The Ontology Gap: Why Layout Analysis is Hard
- The Framework: Definitions
- Operational Heuristics
- Domain Case Study: Interactive Forms
- Implementation Strategy
- Known Limitations
- Conclusion
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Mapping to Matter vs. Meaning Framework
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Mapping to Matter vs. Meaning Framework
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Roots Analysis: Mapping to Matter vs. Meaning
- Reproducibility
- Peer Review History
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- Practical Assessment
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Mapping to Unified Taxonomy
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Mapping to Matter vs. Meaning Framework
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Mapping to Matter vs. Meaning Framework
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- Methodology: Class Definitions
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Taxonomy & Definitions
- Mapping to Unified Taxonomy
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Mapping to Unified Taxonomy
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Dataset Details
- Mapping to Unified Taxonomy
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Mapping to Matter vs. Meaning Framework
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Mapping to Matter vs. Meaning Framework
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- The PAGE-XML Format
- Mapping to Unified Taxonomy
- Reproducibility
- BibTeX
Disclaimer: This page tracks methods for detecting and classifying regions on a document page. For parsing the internal structure of tables, see the TSR Page. For reading order prediction, see the Reading Order Page. For text recognition and end-to-end OCR pipelines, see the OCR Page.
Overview
Document Layout Analysis (DLA) is the computer vision task of identifying regions of interest in a document image and classifying them into semantic categories (e.g., text, title, table, figure). It is often the first step in a RAG or document understanding pipeline, serving as the “chunking” mechanism before text extraction.
Deep Dives & Guides
We maintain detailed guides on the theory and practice of layout analysis:
- The Matter vs. Meaning Standard: Our approach to taxonomy for resolving the “Ontology Gap” between visual perception and logical structure.
- Annotation Guide & Label Studio Config: A two-pass workflow and conflict resolution rules for high-quality data labeling.
Layout Analysis Resources
Layout: Paradigms
The field is currently split between three dominant modeling paradigms:
- Vision-Only Object Detection (OD): Treats the page purely as an image (like a photograph).
- Examples: Faster R-CNN, YOLO, DETR.
- Pros: Fast, works even if OCR fails, great for coarse regions (Tables, Figures).
- Cons: Blind to semantic nuances (e.g., distinguishing “bold text” from “section header” based on content).
- Text + Layout (No Full-Page Image): Operates on parsed text positions from OCR or PDF extraction, without a full-page visual backbone.
- Examples: LiLT, GLAM.
- Pros: Lightweight, fast; LiLT allows swapping text encoders across languages.
- Cons: Loses visual cues (color, font rendering); GLAM restricted to born-digital PDFs.
- Multimodal / Sequence Labeling: Combines visual features with text embeddings (from OCR).
- Examples: LayoutLM, LayoutXLM, UDOP.
- Pros: High accuracy on semantically rich documents (invoices, forms).
- Cons: Slower (requires OCR first), complex pipeline.
Layout: Models
Image-Only
Models that operate on the document image alone, with no OCR or text input required. These are typically object detection architectures adapted for documents.
| Model Family | Artifacts | Code | License | Notes |
|---|---|---|---|---|
| Docling Layout (2025) | heron (43M) • heron-101 (77M) • egret-m (20M) • egret-l (31M) • egret-x (63M) | Docling | Apache-2.0 (weights); MIT (code) | Notes. RT-DETRv2 (heron) + D-FINE (egret). 17-class taxonomy (DocLayNet 11 + 6 delta: Code, Checkbox, Form, Key-Value, Doc Index). Trained on 150K pages. heron-101: 78% mAP, 28 ms/img (A100). |
| FFDNet / FFDetr (2025) | S (9M) • L (25M) • FFDetr (RF-DETR; not in note) | GitHub | Apache-2.0 | Notes. YOLO11 (FFDNet) + RF-DETR (FFDetr). Form field detection only: Text Input, Choice Button, Signature. 1216px high-res input. |
| PP-DocLayout (2025) | L (31M) • Plus-L (31M) • M (6M) • S (1M) | PaddleX | Apache-2.0 | Notes. RT-DETR & PicoDet. |
| DocLayout-YOLO (2024) | Pre-train • DocLayNet • D4LA • DocStructBench | GitHub | AGPL-3.0 (code); Apache-2.0 (weights) | Notes. YOLOv10-M with GL-CRM module. DocSynth-300K synthetic pre-training. |
| DLAFormer (2024) | None released | None | N/A | Notes. DETR-based unified model for detection + logical role classification + reading order via unified label space. Requires OCR text-line bounding boxes as geometric input (no text embeddings). ICDAR 2024. |
| GraphKD (2024) | None released | GitHub | MIT | Notes. Graph-based KD framework for DOD. Distills Faster R-CNN teachers (ResNet50/101/152) into compact students (ResNet18, EfficientNet-B0, MobileNetV2). Cosine + Mahalanobis distillation loss with adaptive text-node sampling. |
| Hybrid DLA (2024) | None released | None | N/A | Notes. ICDAR 2024. DINO + ResNet-50 with RoI-aligned query encoding and hybrid one-to-many/one-to-one matching. PubLayNet 97.3, DocLayNet 81.6, PubTables 98.6 mAP. No code or weights released. |
| TransDLANet (2023) | Transformer | GitHub | CC-BY-NC-ND-4.0 | Introduced in the M6Doc paper; see M6Doc Notes. ISTR-derived instance segmentation with ResNet-101 backbone. No text input. |
| SwinDocSegmenter (2023) | Google Drive (Model Zoo) | GitHub | Apache-2.0 | Notes. ICDAR 2023. SwinL + DETR-style encoder-decoder with contrastive denoising and hybrid bipartite matching. Instance segmentation (not just detection). 223M params. PubLayNet 93.7, HJ 84.6, TableBank 98.0, DocLayNet 76.9 mAP. First transformer instance seg. baseline on DocLayNet. |
| DiT (2022) | Base (87M) • Large (304M) | GitHub | MIT | Notes. BEiT-style MIM pre-training on 42M IIT-CDIP doc images with domain-specific dVAE tokenizer. |
| DocSegTr (2022) | PRImA • HJ • TableBank | GitHub | GPL-3.0 | Notes. Hybrid CNN-transformer with twin attention for bottom-up instance segmentation (no bounding box dependency). ResNeXt-101-FPN + DCN backbone with dynamic mask generation. PubLayNet 89.4, PRImA 40.3, HJ 83.4, TableBank 93.3 mAP. |
| YALTAi (2022) | None released | GitHub | GPL-3.0 | Notes. JDMDH 2023. Replaces Kraken pixel segmentation with YOLOv5 object detection (bounding boxes). 7$\times$ mAP improvement on small historical document datasets (~1K images). Releases MSS-EPB and Tabular datasets. |
| LayoutParser (2021) | Model Zoo (Detectron2) | GitHub | Apache-2.0 | Notes. Unified DIA toolkit wrapping Faster/Mask R-CNN. Pre-trained on PubLayNet, PRImA, HJDataset, Newspaper Navigator, TableBank. Community model hub. |
| CDDOD (2020) | None released | GitHub | MIT | Notes. CVPR 2020. Cross-domain DOD via FPN + three adversarial alignment modules (Feature Pyramid, Region, Rendering Layer). ResNet-101 backbone. Introduces PubMed/Chn/Legal benchmark suite. Partial code/data released. |
Text + Layout
Models that consume text tokens and/or their 2D positions but do not use a full-page image backbone. These require OCR or a PDF parser but process the document as structured data rather than as a raster image.
| Model Family | Artifacts | Code | License | Notes |
|---|---|---|---|---|
| GLAM (2023) | None released | None | N/A | Notes. ICDAR 2023. Graph-based DLA on PDF-parsed text boxes + ResNet-18 ROI-pooled visual features (per text box, not full-page). GNN (TAG conv) with joint node/edge classification. 4M params, 98 pages/sec on T4 GPU. Outperforms 140M+ vision models on 5/11 DocLayNet text-based classes. Ensemble with YOLO v5x6: 80.8 mAP. Born-digital PDFs only. |
| LiLT (2022) | Base (131M) | GitHub | MIT | Notes. ACL 2022. Language-independent layout transformer: disentangled text and layout streams allow swapping in any pre-trained text encoder without retraining the layout branch. No visual backbone. |
Image + Text + Layout
Models that take OCR-derived text tokens and their 2D positions as input alongside the document image. These require an OCR step before inference.
| Model Family | Artifacts | Code | License | Notes |
|---|---|---|---|---|
| M2Doc (2024) | Weights (BaiduNetDisk/GDrive) | GitHub | Proprietary | Notes. AAAI 2024. Pluggable early-fusion (pixel-level gate) + late-fusion (block-level IoU matching) modules for any detector. BERTgrid input, single shared backbone. DocLayNet 89.0 mAP, M6Doc 69.9 mAP. |
| DocFormerv2 (2023) | None released | None | N/A | Notes. AAAI 2024. Encoder-decoder multimodal transformer with asymmetric pre-training (Token-to-Line, Token-to-Grid). Simplified linear visual branch. 66M-750M params. Pre-trained on 64M IDL pages. |
| VGT (2023) | None released | GitHub | Apache-2.0 | Notes. ICCV 2023. Two-stream ViT + Grid Transformer (GiT) with dedicated text-grid pre-training (MGLM + SLM on 4M IIT-CDIP pages). 243M params. Requires OCR bounding boxes for GiT input. PubLayNet 96.2, DocBank 84.1, D4LA 68.8 mAP. Also introduces the D4LA dataset (see Datasets). |
| UDOP (2022) | Unified (~794M) | GitHub | MIT | Notes. CVPR 2023. Generative T5-based foundation model unifying vision, text, and layout. Layout-induced representation + unified seq2seq for all doc tasks. 1st on DUE-Benchmark. |
| LayoutLMv3 (2022) | Base (133M) • Large (368M) | GitHub | CC-BY-NC-SA-4.0 | Notes. CNN-free multimodal pre-training with unified text/image masking. |
| VSR (2021) | PubLayNet • DocBank | GitHub | Apache-2.0 | Notes. ICDAR 2021. Two-stream (ResNeXt-101 + CharGrid/SentGrid) with adaptive fusion and GNN relation module. Supports both detection and sequence labeling. PubLayNet 95.7 AP (1st on ICDAR 2021 leaderboard), DocBank 95.6 F1. Weights on Hikvision file sharing; depends on mmdetection 2.11.0. |
| VTLayout (2021) | None released | None | N/A | Notes. PRICAI 2021. Two-stage: Cascade Mask R-CNN localization + re-classification via fused deep visual (MobileNetV2 + SE), shallow visual (pixel histogram), and text (TF-IDF over PaddleOCR) features. PubLayNet F1 0.9599. No code/weights. |
| LayoutXLM (2021) | Base (369M) • Large (625M) | GitHub | CC-BY-NC-SA-4.0 | Notes. ACL 2022 Findings. Cross-lingual variant of LayoutLMv2. 53 languages via multilingual pre-training on 30M documents. Introduces XFUND benchmark. |
| LayoutLMv2 (2020) | Base (200M) • Large (426M) | GitHub | CC-BY-NC-SA-4.0 | Notes. ACL 2021. Introduced visual backbone (ResNeXt-101 FPN) integration into LayoutLM pre-training. Spatial-aware self-attention + TIA/TIM pre-training objectives. |
| LayoutLM (2019) | Base (113M) • Large (343M) | GitHub | MIT | Notes. First joint text+layout pre-training. BERT + 2D position embeddings + optional Faster R-CNN image features. |
Layout: Datasets
Large-scale annotated collections intended primarily for model training. Datasets are grouped by the most permissive use permitted: commercial, research / non-commercial, and not available / restricted.
Commercial Use
Training a private, for-profit model is permitted with minimal obligations.
| Dataset | Pages | Domain | Annotation | Classes | Eval Split | License | Notes |
|---|---|---|---|---|---|---|---|
| CommonForms (2025) | 480k | Diverse (14 domains, 10+ langs) | Auto (existing fillable PDFs) | 3 (form fields) | Yes | Apache-2.0 | Notes. Form field detection only: Text Input, Choice Button, Signature. Mined from Common Crawl. ~59K documents. One-third non-English. |
| DocLayNet (2022) (patents + SEC subsets only) | subset of 80k | Diverse | Human | 11 | Yes | CDLA-Perm-1.0 (annotations) | Notes. Only the Patents and SEC filings subsets carry permissive underlying document licenses. See the Research section below for full-dataset use. |
| SignverOD (2022) | 2.6k | Diverse (Business) | Human | 4 | Unknown | CC0-1.0 | Notes. Form-element detection only: Signature, Initials, Redaction, Date. Sources: Tobacco800, NIST SD-2, bank cheques, GSA leases. Kaggle. No formal publication. |
| Newspaper Navigator (2020) | 16M+ (3.5k GT) | Hist. Newspapers | Auto (3.5k Human GT) | 7 (visual only) | Yes (GT only) | Public Domain | Notes. GT from Beyond Words crowdsourcing with 80/20 train/val split. Original served form no longer available but underlying data remains accessible. Visual content retrieval focus; text columns are unlabeled background. |
Research / Non-Commercial
Training a free, open-weight non-commercial model is permitted.
| Dataset | Pages | Domain | Annotation | Classes | Eval Split | License | Notes |
|---|---|---|---|---|---|---|---|
| AnnoPage (2025) | 7.6k | Historical Czech/German (1485-present) | Human | 25 | Yes | CC-BY-4.0 (annotations + images) | Notes. Non-textual elements only (maps, decorations, charts, photographs, etc.). Expert librarian annotations following Czech Methodology. Zenodo. GREC Workshop at ICDAR 2025. Attribution required. |
| TextBite (2025) | 8.4k | Historical Czech (18th-20th c.) | Human | 3 | Yes | CC-BY-4.0 (dataset on Zenodo); MIT (code); underlying images from Czech libraries (personal/research use) | Notes. Logical page segmentation as pixel clustering. 78,863 annotated segments across newspapers, dictionaries, handwritten records. Printed + handwritten. Novel Rand index evaluation over foreground text pixels. ICDAR 2025. |
| IndicDLP (2025) | 120k | Diverse (12 domains, 12 langs) | Human | 42 | Yes | MIT (annotations/code); underlying doc licenses vary by source | Notes. 11 Indic languages + English. Hierarchical labels (section titles, list levels). Sources include gov. docs, ebooks, newspapers, arXiv. ICDAR 2025 Best Student Paper Runner-Up. |
| LADaS 2.0 (2024) | 7.3k | Diachronic French (1600-2024) | Human | 36 (13 types) | Yes | CC-BY-4.0 (annotations/code); underlying doc licenses vary by subset | Notes. SegmOnto/TEI-aligned taxonomy. 12 modular subsets (monographs, theses, catalogues, theatre, magazines, etc.). Mostly French with some multilingual content. Includes noisy “Fingers” subset for on-site digitization. |
| DocHieNet (2024) | 15.6k | Diverse (legal, financial, edu, sci) | Human | 19 + hierarchy + sequential | Yes | Apache-2.0 (code/data); research-only clause in repo README | Notes. 1,673 bilingual (EN + ZH) multi-page documents. 187K+ layout elements. Manual annotation (12 annotators, 3 QC rounds). 37.4% cross-page relations. EMNLP 2024. ModelScope. GitHub. |
| CATMuS Medieval Seg. (2024) | 1,680 | Medieval MSS (8th-16th c., 10 langs) | Semi-auto (BLLA + manual correction; 45+ contributors) | 19 (13 zone + 6 line) | Yes | CC-BY-4.0 | Notes. SegmOnto-based medieval layout dataset. 159 manuscripts. Hierarchical block/line annotations with polygons and baselines. Companion to CATMuS Medieval HTR dataset. ICDAR 2024. HuggingFace. |
| DocGenome (2024) | 6.8M | Scientific | Auto | 13 + 6 Rel | No | CC BY-4.0 (annotations); underlying arXiv PDFs carry per-paper licenses | Notes. Logical graph from LaTeX source; 6 relation types (hierarchical + reference). |
| DocSynth-300K (2024) | 300k | Synthetic | Auto | 74 | No | Unknown | Notes. Synthetic pre-training only; bin-packing layout generation from M6Doc crops. Unknown license – treat as research use at most. |
| SciPostLayout (2024) | 7,855 | Scientific Posters | Human | 9 | Yes | CC-BY | Notes. Conference-style scientific posters (not paper documents); sourced from F1000Research. Very different domain from PubLayNet-style datasets. |
| RanLayNet (2024) | Unknown | Synthetic (PubLayNet source) | Auto | 5 | Unknown | Dataset/code license unknown | Notes. Synthetic pages composited from PubLayNet crops for domain generalization. Standard YOLOv8 training. ACM Multimedia Asia 2023. GitHub. |
| OIN (2024) | 1,920 | Persian Newspapers | Human | 2 (text/non-text) | Yes | No license specified | Notes. Non-Latin (Persian) DLA + TLD. 32,642 lines, 2.35M CCs. Scanned at 300 dpi. Includes curved lines, skew, diacritics. Hosted on Google Drive. No license; treat as research use at most. |
| U-DIADS-Bib (2024) | 200 | Ancient manuscripts (6th-12th c.) | Human (pixel-precise) | 6 | Yes | Unknown (dataset); CC-BY-4.0 (paper) | Notes. Latin and Syriac Bibles. Pixel-precise, non-overlapping segmentation. Used in ICDAR 2024 few-shot layout competition (SAM). Published in Neural Computing and Applications. |
| American Stories (2023) | 20M (2.2k GT) | Hist. US Newspapers | Auto (2.2k human-labeled) | 9 | Yes | CC-BY-4.0 | Notes. NeurIPS 2023 Datasets Track. 1.14B content regions, 65.6B tokens from Library of Congress Chronicling America scans. YOLOv8 + EfficientOCR pipeline. Public domain source material. HuggingFace. |
| WordScape (2023) | 40M+ (via pipeline) | Diverse (136 langs) | Auto | 30 | No | Apache-2.0 (pipeline/code); underlying doc licenses vary | Notes. Open XML parsing of Common Crawl Word files. Released as a pipeline + 9.5M URLs, not a fixed dataset. NeurIPS 2023 Datasets Track. |
| D4LA (2023) | 11k | Reports | Human | 27 | Partial (train/val only) | Unknown | Notes. Introduced by VGT. 8,868 train / 2,224 val. Functional roles (LetterHead, Stamp, RegionKV). Unknown license; treat as research use at most. |
| M6Doc (2023) | 9k | Diverse | Human | 74 | Yes | CC BY-NC-ND-4.0 | Notes. Fine-grained modern taxonomy (DropCap, Kicker). |
| HRDoc (2023) | 2,500 docs | Scientific | Human | 14 + 3 rel | Yes | Mixed (CC-BY, CC-BY-NC-SA) | Notes. Hierarchical structure; line-level annotations with cross-page parent-child relations. Two splits: Simple (ACL) and Hard (arXiv multi-domain). AAAI 2023. NC-SA component blocks commercial use. |
| BaDLAD (2023) | 33k | Bengali (6 domains) | Human (polygon) | 4 | Yes | CC BY-SA 4.0 | Notes. Non-Latin-script layout dataset; polygon annotations across books, newspapers, gov. docs, property deeds. Share-alike obligation blocks commercial use. |
| ETD-ODv2 (2023) | 62k | ETDs (Scanned + Digital) | Human (AI-aided) | 24 | Yes | No license specified | Notes. Extends ETD-OD with 16.8K scanned pages and 20.2K AI-aided pages targeting minority classes. 300K bounding-box annotations across theses/dissertations. AI-aided annotation framework reduces labeling time 2-3$\times$. GitHub. WWW ‘23 Companion. No repo license; treat as research use at most. |
| ETD-OD (2022) | 25k | ETDs (Digital) | Human | 24 | Partial (train/val only) | No license specified | Notes. First layout dataset targeting long-form scholarly documents (theses/dissertations). 100K bounding-box annotations across 5 element groups (metadata, abstract, LOC, main content, bibliography). 80/20 train/val split; no held-out test set. Includes PDF-to-XML parsing pipeline. YOLOv7 85.3 mAP@0.50. GitHub. WIESP 2022. No repo license; treat as research use at most. |
| TexBiG (2022) | 7 books (52k+ annotation instances) | Historical (19th-20th c.) | Human (polygon) | 19 | Yes | CC-BY-4.0 (annotations + images) | Notes. Tschirschwitz et al., DAGM GCPR 2022 (no OA). Zenodo. GitHub (code, no license). Expert polygon annotations with inter-annotator agreement (Krippendorff’s Alpha). Includes OCR. Attribution required. |
| YALTAI-MSS-EPB (2022) | 1.1k | Historical MSS/EPB (9th-17th c.) | Human | 12 (Segmonto) | Yes | CC-BY-4.0 | Notes. Manuscripts and early printed books. Bounding-box annotations following Segmonto ontology. 5 source datasets + 593 original pages. Zenodo. |
| YALTAi-Tables (2022) | ~250 | Historical Tabular (16th-20th c.) | Human | 4 | Yes | CC-BY-4.0 | Notes. Tabular documents from Lectaurep + out-of-domain test set. Bounding-box annotations. Zenodo. |
| DAD (2022) | 5.9k | Scientific (Articles) | Human | 43 | Unknown | MIT (annotations); source PDFs unverified | Notes. 43 classes across front matter, body, and back matter. Sources from Elsevier, Springer, SAGE, Wiley, IEEE; per-article licenses not enumerated. Train/val/test split not specified in publicly available materials. |
| DocLayNet (2022) (full dataset) | 80k | Diverse | Human | 11 | Yes | CDLA-Perm-1.0 (annotations); underlying doc licenses vary by domain | Notes. Non-permissive subsets present; full dataset is research use only. See the Commercial section above for the permitted subsets. |
| SciBank (2021) | 74k | Scientific | Auto + Human | 12 | Unknown | CC-BY-4.0 (annotations); underlying paper licenses unverified | Notes. Includes inline equations as a distinct class. Source of 11,007 underlying papers not documented; treat as research use until clarified. IEEE DataPort (free account required). |
| CDLA (2021) | 6k | Chinese (Academic) | Human (polygon) | 10 | Yes | No license specified | No publication. GitHub. Labelme-format polygon annotations. Classes: Text, Title, Figure, Figure caption, Table, Table caption, Header, Footer, Reference, Equation. No license file or terms; treat as research use at most. |
| IIIT-AR-13K (2020) | 13k | Business (Annual Reports) | Human | 5 | Yes | Research | Notes. Separates Figure (charts/diagrams) from Natural Image (photos); adds Logo and Signature classes. English, French, Japanese, Russian. 9,333 train / 1,955 val / 2,120 test (per-company 70/15/15). |
| DocBank (2020) | 500k | Scientific | Weak | 12 | Yes | Apache-2.0 (annotations/code); underlying arXiv PDFs carry per-paper licenses | Notes. Token-level labels. 400K train / 50K val / 50K test. Underlying PDF licenses vary per paper. |
| PubLayNet (2019) | 360k | Scientific | Weak | 5 | Yes | CDLA-Perm-1.0 (annotations); PDFs non-commercial only | Notes. Coarse (Text, Table, Figure). PDFs from PubMed Central are non-commercial only. |
| SPaSe (2019) | 2k | Presentation Slides | Human (pixel) | 25 | Yes | CC-BY-4.0 (annotations); underlying slides from SlideShare-1M (mixed licenses) | Haurilet et al., WACV 2019. Multi-label pixel segmentation (overlapping regions). Location-sensitive classes (title, footnote, etc.). Only slide-layout segmentation dataset. IEEE. No code released. Project page offline as of March 2026. |
| ENP (2015) | 528 | Historical Newspapers (13 langs) | Human (polygon) | Hierarchical (PAGE-XML) | Unknown | Unknown (free for researchers) | Notes. Europeana Newspapers Project. 12 European national libraries. PAGE-XML format with reading order. PRImA. ICDAR 2015. |
| GROTOAP2 (2014) | 119k (13.2k docs) | Scientific | Auto (CERMINE + heuristic correction) | 22 | No | CC-BY (annotations); underlying PDFs from PMC OA (per-article licenses vary) | Tkaczyk et al., D-Lib Magazine 2014. Token/zone-level labels across 208 publishers, 1,170 journals. 22 classes incl. body_content, abstract, references, equation, figure, table. TrueViz XML format. 93% accuracy after correction rules. Download. Also used in S2-VLUE benchmark (see VILA). |
Not Available / Restricted
Described in a publication but not publicly downloadable. Included here because the papers provide useful methodological details and the data may become available in the future.
| Dataset | Pages | Domain | Annotation | Classes | Eval Split | License | Notes |
|---|---|---|---|---|---|---|---|
| DocLayNet-v2 (2025) | 7.6k (test) | Diverse | Human | 17 (11+6 new) | Yes | Proprietary | Notes. IBM/Docling extension of DocLayNet adding Code, Checkbox-Selected, Checkbox-Unselected, Form, Key-Value Region, Document Index. Used to train and evaluate Docling’s heron/egret model family. Not publicly available. |
| GraphDoc (2025) | 80k | Diverse | Rule-based + Human | 11 + 8 Rel | Yes | MIT (code); CDLA-Perm-1.0 (annotations); underlying doc licenses vary | Notes. Relation graph overlay on DocLayNet. 4.13M relation annotations (4 spatial + 4 logical types). ICLR 2025. GitHub repo is a placeholder as of March 2026. |
| PAL (2023) | 441k | Spanish Gov/Public Affairs | Semi-auto (PDF mining + RF) | 8 (4 layout + 4 text semantic) | Yes | Custom (signed license agreement required) | Notes. 37,910 Spanish government gazettes from 24 admin sources. 8M+ labels. Per-source Random Forest classifiers for text-block labeling. Born-digital PDFs only. ICDAR 2023 Workshop. GitHub. |
| HJDataset (2020) | 2,271 | Historical Japanese (Biographical) | Semi-rule-based + Human | 7 (hierarchical) | Yes | Apache-2.0 (annotations); images restricted (copyright, download request required) | Notes. 260K annotations with parent-child hierarchy and reading order. Single source publication (1953 Who’s Who). Vertical text, right-to-left reading. CVPR 2020 Workshop. |
Other Notables:
- PP-DocLayout (2025): Variable label space (20-25 classes) based on DocLayNet + Stamps/Seals.
Layout: Benchmarks
Curated evaluation sets used to compare model performance. Most are too small or too constrained for training use.
| Benchmark | Pages | Domain | Annotation | Classes | License | Notes |
|---|---|---|---|---|---|---|
| Comp-HRDoc (2024) | 1.5k docs | Scientific | Human | 12 + hierarchy + reading order | MIT (annotations/eval scripts); underlying images from HRDoc-Hard (mixed CC-BY, CC-BY-NC-SA) | Notes. Extension of HRDoc-Hard. First benchmark evaluating page object detection, reading order, TOC extraction, and hierarchical reconstruction simultaneously. Introduced with the DOC method. Pattern Recognition 2024. GitHub. |
| RoDLA (2024) | ~450k | Scientific + Diverse | Auto (perturbed) | 5 / 11 / 74 | Apache-2.0 (code/toolkit); data licenses inherited from source datasets | Notes. CVPR 2024. Robustness benchmark: PubLayNet-P, DocLayNet-P, M6Doc-P. 12 perturbation types $\times$ 3 severity levels. Introduces mPE and mRD metrics. GitHub. |
| OmniDocBench (2024) | 981 | Diverse | Human | 15 block + 4 span + 15 attr | Research-only (non-commercial) | Notes. End-to-end parsing eval (Markdown output); covers VLMs and pipeline tools. 9 document types. |
| DocStructBench (2024) | 9,955 (train+test) | Diverse | Human | 10 | N/A | Notes. In-house benchmark from DocLayout-YOLO; not publicly available. Covers Academic, Textbook, Market Analysis, Financial subsets. Results cannot be independently reproduced. |
| S2-VL (2021) | 1.3k | Scientific (19 disciplines) | Human | 15 | Apache-2.0 | Notes. Allen AI / VILA. Token-level annotations on 87 papers across 19 scientific disciplines. Annotated via PAWLS tool. Inter-annotator agreement 0.95. Part of the S2-VLUE benchmark suite (with GROTOAP2 and DocBank). 5-fold CV by paper. GitHub. TACL 2022. |
| PubMed (2020) | 13k | Scientific | Auto | 5 | Unknown | Notes. Li et al., CVPR 2020. Cross-domain transfer benchmark; PDF, English. Subset of PubLayNet with re-processed list annotations. |
| Chn (2020) | 8k | Synthesized Chinese | Auto | 5 | Unknown | Notes. Li et al., CVPR 2020. Cross-domain transfer benchmark; PDF, Chinese. Synthesized from Chinese Wikipedia with randomized layout/style. |
| ICDAR 2017 POD | 2k | Scientific (CiteSeer) | Human | 4 (Formula, Table, Figure, All) | Unknown | Gao et al., ICDAR 2017. Competition site. IEEE (no OA). 2,000 pages from 1,500 CiteSeer papers. Train/test zips available from competition site. |
| DSSE200 (2017) | 200 | Mags/Academic | Human | 6 | Unknown | Notes. Yang et al., CVPR 2017. 160 train / 40 test. Raster images (not PDF). Dataset URL appears dead. |
| DIVA-HisDB (2016) | 150 | Medieval Manuscripts | Human (pixel) | 4 | Unknown (no license specified) | Simistira et al., ICFHR 2016. 3 manuscripts (CB55, CSG18, CSG863), 600 dpi. Classes: Main Text, Comment, Decoration, Background (multi-class pixels via bitwise encoding). 20/10/20 train/val/test per manuscript. Used in ICDAR 2017 Layout Analysis competition. Evaluator (LGPL-3.0). IEEE. |
| PRImA (2009) | 1.2k (305 public) | Mags + Tech | Human | 10 (2009); 13 (2019 schema) | Unknown (research-only access) | Notes. Isothetic polygons. Introduced PAGE-XML. Not safe to assume commercial use. |
Layout: Metrics
| Metric | Paradigm | Notes | Tools |
|---|---|---|---|
| mAP @ IoU | Object Detection | COCO-style mean Average Precision. See below for IoU threshold variants. Higher is better. | pycocotools |
| F1 Score | Sequence Labeling | Harmonic mean of precision and recall over predicted labels. Standard metric when layout is framed as token classification (LayoutLM family, VSR, DocBank). Macro-F1 (equal weight per class) and micro-F1 (equal weight per token) variants both appear. Higher is better. | seqeval, sklearn |
| Rand Index | Segmentation-as-Clustering | Measures agreement between predicted and ground-truth pixel clusters. Used when layout analysis is framed as grouping text into logical segments rather than detecting bounding boxes. Introduced to the DLA context by TextBite. Notes. | Custom |
| mRD | Robustness | Mean Robustness Degradation. Normalizes model performance drop by perturbation difficulty (mPE). Lower is better; 100 = expected degradation. Notes. | RoDLA |
| mPE | Robustness | Mean Perturbation Effect. Model-independent perturbation severity score combining IQA metrics (MS-SSIM, CW-SSIM) with baseline degradation. Notes. | RoDLA |
mAP @ IoU: Threshold Variants
Most layout detection papers report some variant of COCO-style Average Precision, but the IoU threshold used changes what the number actually measures. When comparing results across papers, always check which variant is reported.
- $\text{AP}@[.50:.95]$ (also written AP or mAP without qualifier): The COCO primary metric. Averages AP across 10 IoU thresholds from 0.50 to 0.95 in steps of 0.05. This is the strictest standard and the most common in recent work (DocLayNet, M6Doc, DocLayout-YOLO). A model scoring 80+ on this metric is localizing regions precisely, not just roughly finding them.
- $\text{AP}@50$ (also written AP50 or $\text{mAP}@0.50$): Counts a detection as correct if its IoU with a ground-truth box exceeds 0.50. Lenient on localization; a prediction overlapping just half the ground truth counts as a hit. Common in YOLO-era papers and historical document work where region boundaries are inherently ambiguous. Results can be 10-20 points higher than $\text{AP}@[.50:.95]$ on the same model.
- $\text{AP}@75$: Stricter localization threshold. Less commonly reported but useful for assessing whether a model captures tight region boundaries, which matters for downstream extraction.
Per-class AP values are often more informative than the mean. Classes like “Table” and “Figure” (visually distinct, large regions) tend to score much higher than “Caption” or “Footnote” (small, visually similar to body text). A high mAP can mask poor performance on minority classes.
F1 Score: When Layout Is Token Classification
The object detection paradigm (mAP) assumes layout analysis produces bounding boxes. The sequence labeling paradigm instead assigns a class label to each text token or text line, using the 2D position as input context. This is how the LayoutLM family, LiLT, and VSR approach the problem.
In this setting, standard classification metrics apply: precision (fraction of predicted labels that are correct), recall (fraction of ground-truth labels recovered), and their harmonic mean, F1. Two averaging conventions appear in the literature:
- Macro-F1: Compute F1 per class, then average. Gives equal weight to rare classes.
- Micro-F1: Pool all predictions, compute F1 globally. Dominated by frequent classes like “Text” or “Paragraph.”
Comparing F1 scores to mAP scores across paradigms is not meaningful; they measure different things on different granularities (token vs. region).
Layout: Comparative Studies
Cross-architecture evaluations that benchmark models from multiple paradigms on the same datasets.
| Year | Paper | Models Compared | Datasets | Key Finding | Notes |
|---|---|---|---|---|---|
| 2023 | Kastanas et al. | LayoutLMv3, YOLOv5, Paragraph2Graph, LiLT | DocLayNet, GROTOAP2, FUNSD/XFUND | YOLOv5 most practical for image-centric (fast, single GPU); LayoutLMv3 dominates text-centric (F1 0.87 vs. 0.70); machine translation hurts cross-lingual zero-shot transfer. | Notes |
Layout: Surveys
Comprehensive literature reviews that organize and synthesize the DLA field.
| Year | Paper | Scope | Key Contribution | Notes |
|---|---|---|---|---|
| 2019 | BinMakhashen & Mahmoud | 79 DLA studies (pre-deep-learning era) | General DLA framework (preprocessing, bottom-up/top-down/hybrid analysis, evaluation). Taxonomy of 6 layout types. Three-tier evaluation hierarchy (PLEF, REF, CEF). | Notes |
Related Pages
- Table Structure Recognition: Models and datasets for parsing the internal grid of detected tables (rows, columns, spanning cells).
- Reading Order Prediction: Models for determining the logical reading sequence of detected regions.
VILA: Improving Structured Content Extraction from Scientific PDFs Using Visual Layout Groups
TL;DR
VILA introduces two methods for incorporating visual layout group structure (text lines and blocks) into BERT-based models for scientific PDF parsing. I-VILA inserts boundary indicator tokens and improves Macro F1 by up to 1.9%; H-VILA uses hierarchical encoding to reduce inference time by 47% with minimal accuracy loss. Both require only fine-tuning, cutting training cost by up to 95% compared to layout-aware pretraining. The paper also introduces S2-VL, a human-annotated dataset of 1,337 pages across 19 scientific disciplines.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$. The headline contribution is two novel methods (I-VILA and H-VILA) for incorporating visual layout group information into existing language models without expensive pretraining. Ablations, efficiency comparisons, and accuracy improvements are the paper’s center of gravity.
Secondary: $\Psi_{\text{Resource}}$. The paper introduces the S2-VL dataset (1,337 human-annotated pages across 19 disciplines) and the S2-VLUE benchmark suite unifying three evaluation datasets.
Secondary: $\Psi_{\text{Evaluation}}$. The paper introduces the group category inconsistency metric $H(G)$ and systematically evaluates how different VILA group detectors (ground-truth, vision model, PDF parsing) affect downstream performance.
What is the motivation?
Extracting structured content from scientific PDFs (titles, abstracts, references, body text, etc.) is a critical first step for NLP over scientific papers. Layout-aware language models like LayoutLM improve accuracy by encoding each token’s 2D position, but they do not explicitly model higher-level layout structures: the grouping of tokens into text lines and blocks. These models also require expensive pretraining (1,000+ GPU-hours for LayoutLM, several thousand for LayoutLMv2).
Existing evaluation datasets (GROTOAP2, DocBank) rely on automatic labeling from source files, have limited domain coverage (life sciences or math/physics/CS only), and contain systematic annotation errors. No manually annotated multi-discipline evaluation set existed.
What is the novelty?
The core insight is the group uniformity assumption: tokens within a visual layout group (text line or text block) generally share the same semantic category. This is exploited in two ways:
I-VILA: Layout Indicator Injection
A special [BLK] token is inserted at each group boundary in the linearized token sequence. The resulting input has the form:
$$[CLS], T_1^{(1)}, \ldots, T_{n_1}^{(1)}, [BLK], T_1^{(j+1)}, \ldots, T_{n_m}^{(m)}, [SEP]$$
These boundary tokens signal possible category changes between groups. When used with a layout-aware base model like LayoutLM, the [BLK] token inherits the 2D positional embedding of its group’s bounding box. No pretraining is needed; the model learns to use these indicators during fine-tuning.
H-VILA: Hierarchical Encoding
A two-level transformer architecture:
- Group encoder ($l_g = 1$ layer): encodes tokens within each group into a single vector $\tilde{\mathbf{h}}_j$, combined with 2D positional embeddings:
$$\mathbf{h}_j = \tilde{\mathbf{h}}_j + p(b_j)$$
- Page encoder ($l_p = 12$ layers): models inter-group relationships across the full page, producing group-level predictions via an MLP classifier.
Both encoders are initialized from pretrained BERT/LayoutLM weights (layer 1 for the group encoder, full model for the page encoder), so no additional pretraining is required.
S2-VLUE Benchmark Suite
Three datasets unified with VILA structure annotations:
- GROTOAP2: 119K pages, 22 categories, life sciences, automatic labels from CERMINE PDF parsing.
- DocBank: 498K pages, 12 categories, math/physics/CS, automatic labels from LaTeX source. VILA structures regenerated via Mask R-CNN.
- S2-VL (new): 1,337 pages from 87 papers across 19 disciplines, 15 categories, human-annotated using the PAWLS tool. Inter-annotator agreement: 0.95 token-level accuracy.
Group Category Inconsistency Metric
A new diagnostic metric $H(G)$ measuring the entropy of predicted token categories within each group:
$$H(G) = \frac{1}{m} \sum_i^m H(g_i), \quad H(g) = -\sum_c p_c \log p_c$$
Lower values indicate more consistent predictions within groups.
What experiments were performed?
Experiments are conducted on all three S2-VLUE datasets. For S2-VL, 5-fold cross-validation is used (split by paper, not page).
I-VILA results (Table 2):
| Model | GROTOAP2 F1 | DocBank F1 | S2-VL F1 |
|---|---|---|---|
| LayoutLM (baseline) | 92.34 | 91.06 | 82.69 |
| + I-VILA (Text Block) | 93.38 | 92.00 | 83.44 |
| + I-VILA (Text Line) | 92.37 | 92.79 | 83.77 |
I-VILA with text blocks also reduces group inconsistency $H(G)$ by 32.1% on GROTOAP2.
H-VILA results (Table 3):
| Model | S2-VL F1 | Inference Time (ms) |
|---|---|---|
| LayoutLM (baseline) | 82.69 | 52.56 |
| H-VILA (Text Line) | 83.69 | 28.07 (47% faster) |
| H-VILA (Text Block) | 82.09 | 16.37 (69% faster) |
Generalization across base models (Table 4, GROTOAP2):
I-VILA improves all tested base models: DistilBERT (+1.60), BERT (+1.53), RoBERTa (+0.88), LayoutLM (+1.04 Macro F1).
Training cost comparison (Table 5):
LayoutLM pretraining requires 1,000+ GPU-hours. VILA methods require only fine-tuning: 4.7 GPU-hours for I-VILA and 3.5 GPU-hours for H-VILA on GROTOAP2, representing a 95%+ cost reduction.
VILA group quality ablation (Table 6, S2-VL):
Using ground-truth text blocks yields the best I-VILA performance (86.50 F1) compared to vision model predictions (83.44 F1) or PDF parsing (83.95 F1). H-VILA is more sensitive to group detection errors since it cannot assign different categories within a group.
What are the outcomes/conclusions?
I-VILA provides consistent accuracy improvements (+1-2% Macro F1) and better prediction consistency across all three benchmarks, without any pretraining. H-VILA trades modest accuracy for large efficiency gains (up to 69% inference time reduction). Both methods generalize across different BERT variants.
The S2-VL dataset fills a gap as the first human-annotated, multi-discipline evaluation set for scientific document parsing, covering 19 disciplines with 15 token categories.
Key limitations:
- VILA methods are evaluated only on scientific documents; generalization to other document types is untested.
- H-VILA is sensitive to group detection quality: incorrect block boundaries propagate errors since all tokens in a group receive the same label.
- S2-VL is relatively small (1,337 pages, 87 papers), limiting its use as a training set.
- The group uniformity assumption does not always hold, particularly at block boundaries or when block detectors make errors.
Reproducibility
Models
- I-VILA: Fine-tuned LayoutLM-BASE with
[BLK]indicator tokens. 110M params (same as LayoutLM-BASE). - H-VILA: Group encoder ($l_g = 1$, initialized from LayoutLM layer 1) + Page encoder ($l_p = 12$, initialized from full LayoutLM). Positional embeddings from group bounding boxes.
- Eight fine-tuned model checkpoints released on HuggingFace (baseline, I-VILA, H-VILA variants).
Algorithms
- Optimizer: AdamW, lr $5 \times 10^{-5}$ ($2 \times 10^{-5}$ for S2-VL), $\beta = (0.9, 0.999)$.
- Linear warmup over 5% steps, then linear decay.
- Batch sizes: 40 (GROTOAP2), 40 (DocBank), 12 (S2-VL).
- Epochs: 24 (GROTOAP2), 6 (DocBank), 10/20 (S2-VL).
- Mixed precision training.
- S2-VL uses 5-fold cross-validation, split by paper.
Data
- GROTOAP2: 119K pages from PubMed Central. Automatic annotations. Available via the VILA repo.
- DocBank: 498K pages from arXiv. Automatic annotations. VILA structures regenerated via Mask R-CNN.
- S2-VL: 1,337 pages from 87 papers across 19 disciplines. Human-annotated with PAWLS. 15 categories. Inter-annotator agreement 0.95. Available via the VILA repo.
Evaluation
- Primary metric: Macro F1 (token-level).
- Diagnostic metric: Group category inconsistency $H(G)$ (entropy-based).
- Inference time: average over 3 runs on 1,000 GROTOAP2 test pages, single V100 GPU.
- S2-VL: 5-fold cross-validation with standard deviations reported.
Hardware
- Training: 4-GPU RTX 8000 or A100 machines.
- Training cost: 4.7 GPU-hours (I-VILA) to 3.5 GPU-hours (H-VILA) on GROTOAP2, compared to 1,000+ GPU-hours for LayoutLM pretraining.
- Inference: single V100 GPU. I-VILA: ~53 ms/page. H-VILA (text line): ~28 ms/page. H-VILA (text block): ~16 ms/page.
BibTeX
@article{shen2022vila,
title={{VILA}: Improving Structured Content Extraction from Scientific {PDFs} Using Visual Layout Groups},
author={Shen, Zejiang and Lo, Kyle and Wang, Lucy Lu and Kuehl, Bailey and Weld, Daniel S. and Downey, Doug},
journal={Transactions of the Association for Computational Linguistics},
volume={10},
pages={376--392},
year={2022},
publisher={MIT Press},
doi={10.1162/tacl_a_00466}
}
Advanced Layout Analysis Models for Docling
TL;DR
IBM trains five real-time document layout detectors (RT-DETRv2 and D-FINE variants) on a mixed corpus of 150K documents spanning 17 layout classes. The best model, heron-101, achieves 0.780 mAP on a filtered DocLayNet test set with 28 ms/image inference on an A100, a 23.9 absolute mAP-point improvement over Docling’s previous baseline. All model weights are released under Apache-2.0.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$. The headline contribution is training and evaluating a family of object detection architectures (RT-DETRv1, RT-DETRv2, D-FINE) for document layout analysis. The paper devotes most of its pages to architecture selection, training recipes, post-processing pipelines, and multi-benchmark evaluation.
Secondary: $\Psi_{\text{Resource}}$. Five publicly released model checkpoints under Apache-2.0, plus the introduction of DocLayNet-v2 (a proprietary 17-class extension of DocLayNet) and canonical-DocLayNet (a filtered subset of DocLayNet that removes annotation noise from the 6 “delta” classes).
Secondary: $\Psi_{\text{Evaluation}}$. The paper devotes significant analysis to comparing two evaluation methodologies (COCO-tools vs. docling-eval), documents a “post-processing paradox” where qualitatively better outputs score lower on mAP, and concludes that mAP may not be suitable for document layout evaluation.
What is the motivation?
Docling is IBM’s open-source document conversion pipeline. Its previous layout model (RT-DETRv1 with ResNet-50, referred to as “old-docling”) only supported DocLayNet’s 11 original classes and missed important structural elements: checkboxes, code blocks, forms, key-value regions, and document indices. The authors needed to:
- Expand the label taxonomy from 11 to 17 classes to support richer document types (forms, code, etc.).
- Improve detection accuracy while maintaining real-time inference speed.
- Use only permissively licensed architectures (no YOLO due to AGPL licensing concerns).
What is the novelty?
Expanded taxonomy and data curation
The 17-class taxonomy adds six “delta” classes to DocLayNet’s original 11: Checkbox-Selected, Checkbox-Unselected, Code, Document Index, Form, and Key-Value Region. The training mix unifies three sources:
- DocLayNet (public, filtered): A “canonical-DocLayNet” variant created by training a filtering detector at confidence threshold 0.3 to flag and exclude pages likely containing mislabeled delta-class elements. This removed ~32% of DocLayNet, yielding 22,101 training pages.
- DocLayNet-v2 (proprietary): IBM’s extended dataset covering all 17 classes with a 7,613-page test split.
- WordScape (public, filtered): Documents from the 2013 Common Crawl snapshot, with all table-containing pages removed due to systematic “Table”/“Form” label confusion.
The combined training set contains ~150K pages and 2.3M bounding-box annotations.
Architecture selection
The authors chose transformer-based detectors with permissive licenses:
| Model name | Architecture | Backbone | Params |
|---|---|---|---|
| egret-m | D-FINE | HGNet-V2 M | 19.5M |
| egret-l | D-FINE | HGNet-V2 L | 31.2M |
| egret-x | D-FINE | HGNet-V2 XL | 62.7M |
| heron | RT-DETRv2 | ResNet-50vd | 42.9M |
| heron-101 | RT-DETRv2 | ResNet-101vd | 76.7M |
Post-processing pipeline
Raw detections undergo PDF-aware refinement:
- Each predicted bounding box is matched to overlapping native PDF cells and snapped to their boundaries.
- Pictures covering >90% of a page are discarded.
- Regular elements overlapping “wrapper” elements (Form, Key-Value Region, Table, Document Index) become children; wrapper boxes expand to enclose children.
- Overlapping groups are resolved by a rule-based selector that considers label priority, element size, and confidence.
What experiments were performed?
Evaluation datasets
- DocLayNet (original 11-class test split)
- DocLayNet-v2 (17-class, 7,613-page proprietary test split)
- canonical-DocLayNet (filtered 11-class, 1,574-page test split)
Evaluation methodologies
Both methods report mean Average Precision averaged over IoU thresholds from 0.50 to 0.95 in steps of 0.05:
$$\text{mAP} = \frac{1}{|\mathcal{T}|} \sum_{t \in \mathcal{T}} \text{AP}_t, \quad \mathcal{T} = \{0.50, 0.55, \ldots, 0.95\}$$
- COCO-tools: Standard $\text{mAP}@[0.50{:}0.95]$ with per-size AP (small/medium/large). Evaluated under three conditions: all predictions (no threshold), confidence $\geq 0.50$, and Docling post-processing (confidence $\geq 0.50$ + PDF-cell snapping + overlap resolution).
- docling-eval: Docling’s own evaluation package. Applies a 0.50 confidence threshold, sets all surviving scores to 1.0, restricts to the label intersection of predictions and ground truth, skips samples with mismatched box counts, then computes $\text{mAP}@[0.50{:}0.95]$.
Runtime benchmarks
Inference time measured across three hardware configurations:
- CPU: AMD EPYC 7763 (4 threads)
- GPU: NVIDIA A100-80GB (batch sizes 100, 200, 500)
- MPS: Apple M3 Max (batch sizes 50, 100)
What are the outcomes/conclusions?
Accuracy
On canonical-DocLayNet (COCO-tools, no post-processing, no threshold):
- heron-101: 0.780 mAP (best overall)
- heron: 0.776 mAP
- egret-m: 0.765 mAP
On DocLayNet-v2 (COCO-tools, no post-processing, no threshold):
- heron-101: 0.758 mAP
- heron: 0.751 mAP
All new models improve over old-docling by 20.6 to 23.9 absolute mAP points (the paper reports these as percentage-point gains, not relative improvements). On the original DocLayNet test set, heron leads with 0.699 mAP, a 38.4% relative improvement over old-docling’s 0.505.
Post-processing paradox
The Docling post-processing pipeline (PDF-cell snapping, overlap resolution) visually improves layout quality, producing cleaner, less fragmented bounding boxes. However, it consistently lowers mAP scores by 10+ points. The authors attribute this to mAP penalizing geometric adjustments that do not exactly match ground-truth boxes, even when the post-processed output is qualitatively better. They conclude that mAP may not be the most suitable metric for document layout evaluation.
Runtime
On A100 with batch size 200:
- egret-m: 24 ms/image (fastest)
- heron-101: 28 ms/image
- heron: 31 ms/image
On CPU (4 threads, batch 32):
- egret-m: 334 ms/image
- heron: 643 ms/image
- heron-101: 988 ms/image
egret-m is roughly $3\times$ faster than heron-101 on CPU, making it a better choice for CPU-only deployments.
Limitations
- DocLayNet-v2 is proprietary. The 17-class test set that most directly measures full-taxonomy performance is not publicly available, limiting independent reproducibility.
- No comparison to non-DETR baselines. YOLO-family models (DocLayout-YOLO, PP-DocLayout) are excluded due to licensing. This makes it difficult to position these results against the broader DLA landscape.
- mAP critique is observational, not formal. The authors note mAP’s shortcomings for document layout but do not propose an alternative metric.
- Single training run per model. No error bars, variance estimates, or seed sensitivity analysis.
- Annotation ambiguity acknowledged but unaddressed. The qualitative analysis (Figures 2-4) reveals that ground truth and model predictions can both be “valid” layouts, but the paper does not explore multi-annotation or soft-label approaches.
Reproducibility
Models
- Five new models released on HuggingFace in safetensors format under Apache-2.0, plus the pre-existing old-docling baseline.
- Parameter counts: 19.5M (egret-m) to 76.7M (heron-101).
- Backbones initialized from pre-trained weights (ResNet, HGNet-V2). Pre-training source not specified beyond “pre-trained” (likely ImageNet, but not stated).
- Models were trained with native PyTorch code from the original RT-DETR and D-FINE repositories, then converted to HuggingFace Transformers safetensors for inference. Training scripts, configs, and reproduction recipes are not released; only inference-ready checkpoints are available.
Algorithms
- RT-DETRv2 models (heron, heron-101): 72 epochs, lr $10^{-4}$, AdamW ($\beta_1$=0.9, $\beta_2$=0.999), weight decay $10^{-4}$.
- D-FINE models (egret-m/l/x): 132/80/80 epochs, lr $2 \times 10^{-4}$ / $2.5 \times 10^{-4}$ / $2.5 \times 10^{-4}$, AdamW ($\beta_1$=0.9, $\beta_2$=0.999), weight decay $10^{-4}$ / $1.25 \times 10^{-4}$ / $1.25 \times 10^{-4}$.
- Augmentation: randomized distortions, zoom out, horizontal flips, resize to $640 \times 640$.
- No mention of learning rate warmup, gradient clipping, mixed precision, or EMA.
Data
- Training set: ~150K pages, 2.3M annotations across 17 classes.
- DocLayNet (public, filtered to ~22K pages via canonical filtering at threshold 0.3)
- DocLayNet-v2 (proprietary, not publicly available)
- WordScape (public, table-containing pages excluded)
- Test splits: DocLayNet original (public), DocLayNet-v2 (proprietary, 7,613 pages), canonical-DocLayNet (public, 1,574 pages).
- The proprietary DocLayNet-v2 component means the full training set cannot be reconstructed. DocLayNet-v2 is not available on the docling-project HuggingFace org or the deprecated ds4sd org.
Evaluation
- COCO-tools mAP@[0.50:0.95] with size-stratified AP.
- docling-eval (MIT license): custom evaluation that filters predictions at 0.50 confidence, normalizes scores, and restricts to shared labels.
- No cross-dataset evaluation (e.g., M6Doc, D4LA) beyond the DocLayNet family.
- Single run per model; no error bars or significance tests.
Hardware
- Training hardware not specified (GPU type, count, total hours not reported).
- Inference benchmarked on three platforms: A100-80GB, AMD EPYC 7763 CPU (4 threads), Apple M3 Max (MPS). Batch sizes varied (32 for CPU, 50-500 for GPU/MPS).
- Models are small enough to run on consumer hardware; egret-m at 19.5M params runs at 334 ms/page on CPU.
BibTeX
@article{livathinos2025advanced,
title={Advanced Layout Analysis Models for Docling},
author={Livathinos, Nikolaos and Auer, Christoph and Nassar, Ahmed and Teixeira de Lima, Rafael and Lysak, Maksym and Ebouky, Brown and Berrospi, Cesar and Dolfi, Michele and Vagenas, Panagiotis and Omenetti, Matteo and Dinkla, Kasper and Kim, Yusik and Weber, Valery and Morin, Lucas and Meijer, Ingmar and Kuropiatnyk, Viktor and Strohmeyer, Tim and Gurbuz, A. Said and Staar, Peter W. J.},
journal={arXiv preprint arXiv:2509.11720},
year={2025}
}
American Stories: Large-Scale Structured Text from Historical U.S. Newspapers
TL;DR
American Stories applies a modular, cost-efficient deep learning pipeline (YOLOv8 layout detection, MobileNetV3 legibility classification, EfficientOCR) to nearly 20 million Library of Congress newspaper scans, producing 1.14 billion content region bounding boxes and 65.6 billion tokens of structured article text. The dataset is public domain content released under CC-BY-4.0.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$. The headline contribution is the American Stories dataset itself: 1.14B layout annotations and structured OCR text extracted from the Chronicling America collection. The paper’s applications section, evaluation, and limitations all center on the dataset’s fitness for downstream use.
Secondary: $\Psi_{\text{Method}}$. The paper develops a full extraction pipeline (layout detection, legibility classification, custom OCR, content association) and makes specific architectural choices (mobile-phone-class models) to meet a $60K cloud compute budget constraint. The pipeline is modular and released for reuse.
What is the motivation?
Library of Congress’s Chronicling America project provides approximately 20 million historical newspaper scans with page-level OCR, but the digitized text does not respect page layout. Articles, headlines, captions, advertisements, and other regions are scrambled together at the page level. A non-trivial share of scans are illegible, introducing noise. These limitations prevent effective use of modern NLP methods, language model pre-training on historical English, and structured social science analyses that require article-level text.
Existing open-source alternatives (notably Newspaper Navigator) detect visual content classes (photographs, illustrations, maps, comics, cartoons, ads, headlines) but do not detect article bounding boxes, which is the prerequisite for legibility filtering, custom OCR, and article association.
What is the novelty?
The core contribution is the end-to-end pipeline for converting newspaper scans into structured article text at scale:
Layout detection: YOLOv8-Medium trained on 2,202 labeled pages (48,874 layout objects) with 9 content classes: Article, Headline, Byline, Caption, Image, Advertisement, Table, Masthead, and Header. Achieves 91.3 mAP@50:95 on articles and 88.6 on headlines.
Legibility classification: MobileNetV3-Small classifies each text region as legible, borderline, or illegible. Trained on 979 labeled crops with weighted cross-entropy. No legible texts are misclassified as illegible in evaluation.
OCR (EfficientOCR): A character/word-level image retrieval approach using MobileNetV3-Small encoders trained with supervised contrastive loss. Words are recognized by nearest-neighbor lookup against an offline embedding index of 22,230 dictionary terms rendered across 43 fonts. When cosine similarity falls below 0.82, the system falls back to character-level recognition. This design meets the $60K compute budget while achieving 4.3% CER.
Content association: Rule-based methods associate headlines, bylines, and article bounding boxes using spatial overlap heuristics. Achieves 97.0 F1 on headline-article association.
The pipeline prioritizes mobile-class architectures (YOLOv8, MobileNetV3) throughout, making it over an order of magnitude cheaper to deploy than alternatives like TrOCR.
What experiments were performed?
Evaluation uses four hand-constructed datasets:
- Full page scans (10 pages, 597 content regions, 196,655 characters): end-to-end pipeline evaluation including content association (214 headline-article pairs).
- Day-per-decade sample (50 lines per decade, 1850-1920): OCR evaluation across diverse historical periods.
- Random lines from Carlson et al. (64 textlines): comparison with other detection frameworks and OCR engines.
- Legibility sample (100 bounding boxes): legibility classifier evaluation.
Key metrics:
- End-to-end CER: 0.051 (0.044 with spellchecking)
- OCR-only CER: 0.043
- Layout/line detection CER contribution: 0.012
- Content association F1: 97.0
- Legibility: 0/81 legible texts misclassified as illegible; 1/16 illegible texts misclassified as legible
Downstream application comparisons against Chronicling America page-level OCR:
- Topic classification (politics): RoBERTa on American Stories articles achieves F1 83.6 (article-level) and 96.0 (page-level), compared to 83.3 for neural methods on Chronicling America pages.
- Reproduced content detection: Neural method on American Stories achieves ARI 86.2 (page-level), compared to 74.6 for Viral Texts on Chronicling America.
What are the outcomes/conclusions?
The American Stories dataset contains 1.14 billion content region bounding boxes across nearly 20 million scans, with 65.6 billion tokens of OCR text. Coverage spans all U.S. states, with content dating from the 17th century through the mid-20th century (bulk from early 1900s).
The structured article texts enable analyses that are impossible with page-level OCR: article-level topic classification, reproduced content detection, and news story clustering. The authors demonstrate that neural methods on structured articles substantially outperform both neural and sparse methods on unstructured page text.
Limitations the authors acknowledge:
- Foreign-language newspapers are excluded (off-the-shelf OCR performs poorly on diverse scripts)
- Only 3.8% of articles span multiple bounding boxes; these are not associated in the current release
- Historical language reflects cultural biases of the period; the authors recommend against using the dataset for training generative models
- The pipeline is optimized for cost over accuracy; swapping in larger models (ViT, two-stage detectors) could improve quality
Reproducibility
Models
- Layout detection: YOLOv8-Medium, initialized from official pretrained checkpoint. Trained 100 epochs on 2,202 pages. Settings:
imgsz=1280, iou=0.2, max_det=500, conf=0.1. - Legibility classification: MobileNetV3-Small (timm
mobilenetv3_small_050). Trained 50 epochs on 979 crops. Resolution 256, lr 2e-3 with 0.1 decay every 20 epochs. Weighted CE loss [2.0, 1.0, 1.0]. - Line detection: YOLOv8-Small. 100 epochs on 4,000 synthetic lines, then 50 epochs on 373 annotated crops. Resolution 640.
- Word/character localization: YOLOv8-Small. 100 epochs on 8,000 synthetic textlines, then 100 epochs on 684 annotated lines. Resolution 640.
- Word recognition: MobileNetV3-Small (timm
mobilenetv3_small_050). SupCon loss, $\tau=0.1$, AdamW, cosine annealing with warm restarts (max lr 2e-3). 50 epochs without hard negatives, 40 epochs with offline hard negative mining. Batch size 1024 ($m=4$ views $\times$ 256 words). - Character recognition: EffOCR-C (Small) from Carlson et al.
- All model checkpoints available via Dropbox link in the GitHub repo.
Data
- Source: Library of Congress Chronicling America collection (~20M scans). Public domain (pre-1925 content).
- Layout annotations: 2,202 pages labeled by undergraduate research assistants. Active learning sampling. 13% from Chronicling America, remainder from other public domain/off-copyright newspapers.
- OCR training data: 43 digital fonts for synthetic rendering; silver-quality labels from EffOCR-C on random sample; small set of hand-labeled word crops. Dictionary of 22,230 terms.
- Output dataset: 1.14B bounding boxes, 65.6B tokens. JSON format on HuggingFace Hub. CC-BY-4.0.
Evaluation
- CER (character error rate) via Levenshtein distance. Non-word rate as proxy metric on larger samples.
- mAP@50:95 for layout and line detection.
- F1 for content association. ARI (adjusted rand index) for reproduced content clustering.
- Evaluation sets are small (10 full pages, 64-225 textlines). No error bars or multi-run analysis reported.
Hardware
- Training: Single NVIDIA A6000 GPU.
- Inference (pipeline deployment): Azure F-series CPU nodes. Total compute budget: $60,000 USD for processing ~20M scans.
- Cost comparison: EfficientOCR is reported as over an order of magnitude cheaper than TrOCR Base on cloud CPUs.
BibTeX
@inproceedings{dell2023american,
title={American Stories: A Large-Scale Structured Text Dataset of Historical {U.S.} Newspapers},
author={Dell, Melissa and Carlson, Jacob and Bryan, Tom and Silcock, Emily and Arora, Abhishek and Shen, Zejiang and D'Amico-Wong, Luca and Le, Quan and Querubin, Pablo and Heldring, Leander},
booktitle={Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track},
year={2023}
}
CATMuS Medieval Segmentation: A SegmOnto-Based Layout Dataset for Medieval Manuscripts
TL;DR
CATMuS Medieval Segmentation is a 1,680-page layout dataset covering 159 medieval manuscripts (8th-16th century) in 10 languages, annotated with 19 classes following the SegmOnto vocabulary (13 zone types, 6 line types). It is released as a companion to the CATMuS Medieval HTR dataset, which was the primary focus of the ICDAR 2024 paper. Since the segmentation dataset has no standalone publication, this note draws on the ICDAR 2024 paper for context on dataset construction, sources, and methodology. The segmentation dataset provides hierarchical block/line annotations with polygons, bounding boxes, and baselines.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$ The primary contribution is a large-scale, multi-project, interoperable dataset for medieval manuscript analysis. The segmentation dataset is a companion release providing layout annotations derived from the HTR ground truth creation pipeline.
Secondary: $\Psi_{\text{Evaluation}}$ The paper devotes a full section to benchmark split design (general vs. feature-based) and HTR baseline experiments with two engines across three seeds, reporting CER with standard deviations.
What is the motivation?
Digitization initiatives have made thousands of medieval manuscripts available as images, but most lack machine-readable text. Handwritten Text Recognition (HTR) is the key technology for conversion, yet building consistent ground truth across projects is difficult due to conflicting philological traditions, varying transcription standards, and project-specific annotation practices.
Prior to CATMuS, no single dataset provided cross-century, cross-language, and cross-script evaluation capability for medieval Latin-script manuscripts. Each project worked in isolation with incompatible conventions.
For layout analysis specifically, medieval manuscripts present challenges that modern document datasets do not cover: marginal glosses, drop capitals, music notation, quire marks, seals, and other elements unique to historical codices. The SegmOnto vocabulary was designed to provide a shared ontology for these elements across projects.
What is the novelty?
CATMuS Medieval is a macro-dataset assembled from 14 contributing repositories across multiple institutions. The core contribution is the unification effort: harmonizing transcription guidelines, segmentation practices, and metadata across projects that previously used incompatible standards.
The dataset construction followed three phases:
- Phase 1: CREMMA and HTRomance projects formulated primary transcription guidelines for literary manuscripts in book scripts (Old French, Latin, Castilian, Italian).
- Phase 2: The DEEDS project joined, adding Latin cartulary material.
- Phase 3: Adoption by the Carthusian Monastery of Herne team and Middle English researchers.
Segmentation was performed primarily in eScriptorium using the BLLA model, with corrections. Transkribus data was repolygonized and post-corrected for compatibility. The segmentation dataset applies SegmOnto vocabulary for both zone-level and line-level classification.
Transcription Guidelines
A significant part of the paper’s contribution is the graphemic transcription approach, which enables interoperability across projects:
- Allograph normalization: Many-to-one mapping (e.g., round s and long f merged; round r and short r merged).
- Abbreviations preserved: Treated as an NLP task, not an HTR task. Diacritics in Unicode decomposed form (NFD).
- u/v and i/j normalization: The “Ramist” distinction is disregarded; v is transcribed as u, j as i.
- Punctuation standardized: Reduced to four signs:
.(full stop),;(double signs),,(comma),?(question mark). - Word segmentation: Semantic (lexical spaces) rather than imitative, addressing inconsistencies across medieval scribal practices.
These choices allow manuscripts from different projects, languages, and centuries to coexist in a single training set without conflicting conventions.
SegmOnto Classes (19 total)
Zone types (13):
| Zone Class | Description |
|---|---|
| MainZone | Primary text area |
| MarginTextZone | Text in margins (glosses, annotations) |
| DropCapitalZone | Decorated initial letters |
| GraphicZone | Illustrations or decorative elements |
| MusicZone | Musical notation |
| NumberingZone | Folio/page numbering |
| QuireMarksZone | Quire signatures or catchwords |
| RunningTitleZone | Running headers/titles |
| SealZone | Wax or parchment seals |
| StampZone | Library or ownership stamps |
| TitlePageZone | Title pages (primarily incunabula) |
| DamageZone | Areas of physical damage |
| DigitizationArtefactZone | Scanning artifacts (rulers, color bars) |
Line types (6):
| Line Class | Description |
|---|---|
| DefaultLine | Standard text lines |
| HeadingLine | Section headings or rubrics |
| DropCapitalLine | Lines containing drop capitals |
| InterlinearLine | Interlinear glosses or corrections |
| MusicLine | Lines of musical notation |
| TironianSignLine | Lines containing Tironian shorthand |
Note: The class inventory above is sourced from the HuggingFace dataset card. The ICDAR 2024 paper mentions region and line classes as metadata but does not enumerate the full SegmOnto vocabulary.
What experiments were performed?
The ICDAR 2024 paper evaluates HTR baselines (not layout analysis) on the companion CATMuS Medieval HTR dataset. Two HTR engines were benchmarked across two split strategies:
- General split (90/5/5): Every manuscript appears in all partitions. This tests in-domain generalization.
- Feature-based split (~85/7/7): Entire manuscripts are assigned to a single partition. This tests out-of-domain generalization to unseen manuscripts, scripts, and centuries.
| Engine | Split | Train Time (min) | Val CER (%) | Test CER (%) | Space CER (%) |
|---|---|---|---|---|---|
| Kraken v4.3.10 | General | $2112 \pm 163$ | $5.7 \pm 0.07$ | $4.7 \pm 0.06$ | $1.0 \pm 0.02$ |
| Kraken v4.3.10 | Feature | $1464 \pm 238$ | $6.8 \pm 0.16$ | $13.1 \pm 0.24$ | $2.7 \pm 0.06$ |
| Pylaia v1.1.0 | General | $308 \pm 47$ | $9.1 \pm 0.63$ | $8.4 \pm 0.73$ | $1.8 \pm 0.11$ |
| Pylaia v1.1.0 | Feature | $295 \pm 78$ | $11.3 \pm 0.24$ | $21.2 \pm 0.92$ | $3.8 \pm 0.06$ |
The large CER gap between general and feature-based splits (e.g., 4.7% vs. 13.1% for Kraken) highlights the difficulty of generalizing to unseen manuscripts. Space-related errors are a notable component, reflecting inconsistent word separation practices across medieval languages and centuries.
No layout analysis models were evaluated on the segmentation dataset. The segmentation annotations serve as ground truth for line extraction in the HTR pipeline but have not been benchmarked independently for object detection or instance segmentation tasks.
What are the outcomes/conclusions?
The CATMuS Medieval project demonstrates that cross-institutional collaboration on annotation standards is feasible for historical manuscripts, producing a dataset of 208 documents, 165,347 lines, and 5.77 million characters across 10 languages and 9 centuries.
Key limitations:
- Imbalance: The Carthusian Monastery of Herne project alone contributes 27% of lines, creating overrepresentation of 14th-century Dutch Textualis. Old French, Latin, and Castilian together exceed 50% of the data.
- Rare languages: Venetian (2 documents), Navarese (1), and Old English (1) have minimal coverage.
- Image quality variation: 30 of 203 documents are grayscale (from microfilm), with varying resolution across sources.
- Layout coverage: The segmentation dataset (1,680 pages, per the HuggingFace release) is smaller than the full HTR dataset (165K+ lines across 208 documents), suggesting not all pages have full region-level annotations.
- No layout baselines: The paper does not evaluate layout detection performance, so it is unknown how well modern detectors handle the SegmOnto class set on this data.
- No segmentation pipeline code: The XML-to-Parquet conversion, SegmOnto remapping, and polygon extraction steps used to produce the segmentation dataset are not released as code.
Reproducibility
Models
Not applicable. This is a dataset paper. No layout models are proposed or released. The HTR baselines use default Kraken and recommended Pylaia architectures without modification.
Algorithms
HTR training hyperparameters reported in the paper:
- Batch size: 64
- Seeds: 21, 42, 84 (three runs per configuration)
- Early stopping: 5-epoch patience
- Learning rate: $1 \times 10^{-4}$ (Kraken), $5 \times 10^{-4}$ (Pylaia)
- Deterministic mode: Enabled
- Configuration: Default specifications for Kraken; recommended specifications for Pylaia (per Maarand et al.)
No layout-specific training algorithms are reported.
Data
The segmentation dataset is available on HuggingFace in Parquet format (statistics from the HuggingFace dataset card):
- Train: 1,340 pages
- Validation: 191 pages
- Test: 158 pages
- Total: 1,680 pages from 159 manuscripts
Each record includes the page image, bounding boxes, polygon masks, baselines, SegmOnto category labels, block/line type, parent-child hierarchy, manuscript shelfmark, century, and source project.
The broader HTR dataset is available separately on HuggingFace (CATMuS/medieval) and Zenodo, containing 165,347 extracted line images with transcriptions and rich metadata (language, script type, genre, century).
Annotation:
- Tool: eScriptorium (primary), Transkribus (Middle Dutch subset)
- Annotators: 45+ contributors across 14 source projects
- Process: Automatic segmentation via BLLA model, followed by manual correction
- Format: Polygonal masks with baselines; hierarchical (lines linked to parent blocks)
Evaluation
No layout evaluation metrics or baselines are reported for the segmentation dataset. HTR evaluation uses Character Error Rate (CER):
$$ \text{CER} = \frac{S + D + I}{N} $$
where $S$, $D$, and $I$ are the number of substitutions, deletions, and insertions respectively, and $N$ is the number of characters in the reference. The paper also reports a “space-related” CER isolating spacing errors, which is relevant given medieval word separation variability.
Hardware
HTR training was performed on NVIDIA RTX 8000 GPUs with 12 CPU cores per run. Kraken training took substantially longer than Pylaia ($\sim$2100 min vs. $\sim$300 min on the general split). No layout-specific training was reported.
BibTeX
@inproceedings{clerice2024catmus,
author = {Cl{\'e}rice, Thibault and Pinche, Ariane and Vlachou-Efstathiou, Malamatenia and Chagu{\'e}, Alix and Camps, Jean-Baptiste and Gille Levenson, Matthias and Brisville-Fertin, Olivier and Boschetti, Federico and Fischer, Franz and Gervers, Michael and Boutreux, Agn{\`e}s and Manton, Avery and Gabay, Simon and O'Connor, Patricia and Haverals, Wouter and Kestemont, Mike and Vandyck, Caroline and Kiessling, Benjamin},
title = {{CATMuS Medieval: A multilingual large-scale cross-century dataset in Latin script for handwritten text recognition and beyond}},
booktitle = {Document Analysis and Recognition -- ICDAR 2024},
pages = {174--194},
year = {2024},
publisher = {Springer},
address = {Athens, Greece},
doi = {10.1007/978-3-031-70543-4_11}
}
Comp-HRDoc: Comprehensive Benchmark for Hierarchical Document Structure Analysis
TL;DR
This paper introduces DOC (Detect-Order-Construct), a three-stage pipeline for hierarchical document structure analysis that casts each stage as a relation prediction problem solved by multi-modal transformers. Alongside the method, the authors release Comp-HRDoc, a comprehensive benchmark built on HRDoc-Hard (1,500 documents) that evaluates four tasks simultaneously: page object detection, reading order prediction, table of contents extraction, and hierarchical structure reconstruction. DOC improves over prior methods across all four tasks, with margins detailed below.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$. The headline contribution is the DOC framework itself: a three-stage architecture that decomposes hierarchical document structure analysis into Detect (page object detection + text region grouping), Order (reading order prediction), and Construct (TOC and tree building). Each stage is formulated as a relation prediction problem with dedicated transformer-based models, ablations, and comparisons against published results across multiple benchmarks.
Secondary: $\Psi_{\text{Resource}}$. Comp-HRDoc is a new benchmark extending HRDoc-Hard with annotations for four tasks, publicly released with evaluation scripts under MIT license. Document images must be obtained separately from the original HRDoc dataset.
Secondary: $\Psi_{\text{Evaluation}}$. The paper introduces REDS (Reading Edit Distance Score), a new metric for reading order evaluation that uses Levenshtein distance with Hungarian matching across multiple reading order groups.
What is the motivation?
Prior work on document structure analysis treats sub-tasks (layout detection, reading order, TOC extraction, hierarchical reconstruction) independently, using separate models and datasets. No unified benchmark existed to evaluate all four tasks together on the same documents. The original HRDoc dataset only provided line-level semantic labels and hierarchical annotations, lacking standard page-object-level detection and reading order evaluation.
The authors also argue that existing reading order metrics (BLEU, nDCG, Spearman) are borrowed from other domains and do not adequately capture document-specific errors like paragraph segmentation mistakes.
What is the novelty?
DOC Framework
The core insight is that all three stages of hierarchical document structure analysis can be unified as relation prediction problems:
- Detect: A hybrid approach combining top-down graphical object detection (DINO or Mask2Former) with bottom-up text region detection. Text-lines from a PDF parser are grouped into text regions via intra-region reading order prediction, formulated as dependency parsing:
$$P(j \mid i) = \frac{\exp(f_{ij})}{\sum_{k} \exp(f_{ik})}$$
where $f_{ij}$ is a pairwise compatibility score between text-lines $i$ and $j$, computed from multi-modal features (visual via RoIAlign, text via BERT-BASE, 2D positional embeddings). Union-Find groups text-lines into regions based on predicted links.
Order: Detected page objects are encoded with attention-fused text-line features and region type embeddings, processed through a 3-layer transformer encoder. Inter-region reading order is predicted using the same dependency parsing formulation, with a separate classification head distinguishing text region reading order from graphical region relationships (caption-to-figure/table links).
Construct: Section headings (in predicted reading order) are encoded with Rotary Positional Embeddings (RoPE) and processed by a transformer encoder. A tree-aware TOC relation prediction head predicts both parent-child and sibling relationships:
$$P_{\text{parent}}(j \mid i) = \frac{\exp(f^{p}_{ij})}{\sum_{k} \exp(f^{p}_{ik})}$$
$$P_{\text{sibling}}(j \mid i) = \frac{\exp(f^{s}_{ij})}{\sum_{k} \exp(f^{s}_{ik})}$$
A serial tree insertion algorithm (Algorithm 1) scores candidate positions by multiplying parent and sibling probabilities along the rightmost sub-tree and inserts each heading at the best-scoring position.
Comp-HRDoc Benchmark
Extends HRDoc-Hard (1,500 documents, 1,000 train / 500 test) with:
- Page object detection annotations (12 categories: Title, Author, Mail, Affiliate, Section heading, Paragraph, Table, Figure, Caption, Footnote, Header, Footer) evaluated via segmentation-based mAP (not box-based, since logical paragraphs may span multiple columns)
- Reading order prediction evaluated via REDS
- TOC extraction and hierarchical structure reconstruction evaluated via Semantic-TEDS (Micro and Macro)
REDS Metric
REDS (Reading Edit Distance Score) addresses shortcomings of borrowed metrics (BLEU, nDCG) by using Levenshtein edit distance with Hungarian matching across reading order groups:
$$\text{REDS} = 1 - \frac{D}{N}$$
where $D$ is the minimum total Levenshtein distance after Hungarian matching across predicted and ground-truth reading order groups, and $N$ is the total number of basic evaluation units. It includes a special paragraph-ending tag to penalize paragraph segmentation errors.
What experiments were performed?
The authors evaluate DOC on four benchmarks:
- PubLayNet (360K pages, 5 classes): Page object detection. COCO box-based mAP.
- DocLayNet (80K pages, 11 classes): Page object detection. COCO box-based mAP.
- HRDoc (Simple: 1,000 docs; Hard: 1,500 docs): Semantic unit classification (F1) and hierarchical structure reconstruction (Semantic-TEDS).
- Comp-HRDoc (1,500 docs): All four tasks with dedicated metrics.
Baselines include Mask R-CNN, Faster R-CNN, YOLOv5, DINO, Mask2Former, DiT-L, VSR, LayoutLMv3, and the DSPS Encoder from HRDoc.
Ablation studies cover: (1) hybrid detection strategy effectiveness, (2) modality contributions in the Construct module, (3) component contributions to TOC extraction (sibling head, tree insertion algorithm, loss function choice).
What are the outcomes/conclusions?
Page Object Detection
| Benchmark | DOC | Previous Best | Delta |
|---|---|---|---|
| DocLayNet (mAP) | 81.0 | 76.8 (YOLOv5) | +4.2 |
| PubLayNet val (mAP) | 96.5 | 96.0 (TRDLU) | +0.5 |
| PubLayNet test (mAP) | 96.5 | 95.7 (VSR) | +0.8 |
| Comp-HRDoc (seg mAP) | 88.06 | 73.54 (Mask2Former) | +14.52 |
The large Comp-HRDoc gap (+14.52) comes from the hybrid strategy: top-down detection handles graphical objects well, while bottom-up text-line grouping produces more precise text region boundaries than pure detection approaches.
Reading Order, TOC, and Hierarchy (Comp-HRDoc)
| Task | Metric | DOC | Previous Best |
|---|---|---|---|
| Reading Order (Text) | REDS | 0.932 | 0.774 (Lorenzo et al.) |
| Reading Order (Graphical) | REDS | 0.864 | 0.858 (Lorenzo et al.) |
| TOC Extraction | Micro-STEDS | 0.861 | 0.676 (MTD) |
| TOC Extraction | Macro-STEDS | 0.879 | 0.710 (MTD) |
| Hierarchical Reconstruction | Micro-STEDS | 0.837 | 0.690 (DSPS Encoder) |
| Hierarchical Reconstruction | Macro-STEDS | 0.837 | 0.697 (DSPS Encoder) |
Key Ablation Findings
- Section number information is critical for the Construct module: without section numbers in the text features, Micro-STEDS drops from 0.861 to 0.641.
- The tree insertion algorithm is essential: removing it drops Micro-STEDS from 0.861 to 0.711.
- Softmax cross-entropy loss significantly outperforms binary cross-entropy for relation prediction (0.861 vs. 0.700).
Limitations
- The Construct module depends on accurate section heading recognition; errors in detection propagate to tree construction.
- Section number information is critical for TOC extraction; the approach may not generalize to documents without numbered sections.
- Only tree-structured documents are supported; graph-based logical structures are not addressed.
- The system assumes text-line bounding boxes and content are provided by a PDF parser or OCR engine, limiting direct applicability to image-only documents without a prior OCR step.
- Inconsistent paragraph annotations in HRDoc may affect paragraph detection performance.
Reproducibility
Models
- Detect: ResNet-50 backbone (PubLayNet/DocLayNet) or ResNet-18 (HRDoc/Comp-HRDoc). Graphical object detector: DINO (PubLayNet/DocLayNet) or Mask2Former (HRDoc/Comp-HRDoc). Custom bottom-up text region detector with 1-layer transformer encoder (12 heads, 768-dim, 2048-dim FFN).
- Order: 3-layer transformer encoder for inter-region relation prediction
- Construct: Transformer encoder with RoPE; tree-aware relation prediction head with parent-child and sibling sub-heads
- Text encoder: BERT-BASE (pretrained), averaged token embeddings per text-line
- No model weights or training code released. The GitHub repo contains only benchmark annotations and evaluation scripts.
Algorithms
- Optimizer: AdamW (betas 0.9/0.999, epsilon 1e-8)
- Learning rates: 1e-5 (backbone) / 2e-5 (BERT) for PubLayNet/DocLayNet; 2e-4 (backbone) / 4e-5 (BERT) for HRDoc/Comp-HRDoc
- Weight decay: 1e-4 (backbone, PubLayNet/DocLayNet); 1e-2 (backbone, HRDoc/Comp-HRDoc); 1e-2 (BERT, all)
- Schedule: PubLayNet/DocLayNet: step decay (LR / 10 at epoch 11 / 20 respectively). HRDoc/Comp-HRDoc: linear warmup (2 epochs) then linear decay to 0.
- Epochs: 12 (PubLayNet), 24 (DocLayNet), 20 (HRDoc/Comp-HRDoc)
- Batch size: 16 (PubLayNet/DocLayNet), 1 (HRDoc/Comp-HRDoc, multi-page documents)
- Multi-scale training: Shorter side randomly sampled from [512, 640, 768] (max 800) or [320, 416, 512, 608, 704, 800] (max 1024)
- Test resolution: Shorter side 640 (PubLayNet/DocLayNet), 512 (HRDoc/Comp-HRDoc)
- Loss: Softmax cross-entropy for relation prediction; standard detection losses for DINO/Mask2Former
- Framework: PyTorch v1.10
Data
- Comp-HRDoc: 1,000 train / 500 test documents, built on HRDoc-Hard. Annotations publicly available at GitHub under MIT license. Document images must be obtained separately from the HRDoc dataset (license unknown).
- PubLayNet: 340K train / 12K val / 12K test. CDLA-Perm-1.0 (annotations); underlying PDFs non-commercial.
- DocLayNet: 69K train / 6.5K test / 5K val. CDLA-Perm-1.0 (annotations).
- HRDoc: Simple (1,000 docs) and Hard (1,500 docs). Mixed licenses (CC-BY, CC-BY-NC-SA).
Evaluation
- Page object detection: COCO-style mAP (box-based for PubLayNet/DocLayNet; segmentation-based for Comp-HRDoc)
- Reading order: REDS (new metric introduced in this paper)
- TOC extraction and hierarchy: Semantic-TEDS Micro and Macro (from HRDoc)
- Semantic unit classification (HRDoc): Per-category F1, Micro/Macro averages
- Comp-HRDoc evaluation scripts released at GitHub
- No error bars or multi-run statistics reported
Hardware
- Training: 8 NVIDIA Tesla V100 GPUs (32 GB each)
- No inference latency, throughput, or cost estimates reported
- Batch size of 1 for Comp-HRDoc due to multi-page document processing; memory constraints likely driven by processing dozens of pages per document
BibTeX
@article{wang2024detect,
title={Detect-Order-Construct: A Tree Construction based Approach for Hierarchical Document Structure Analysis},
author={Wang, Jiawei and Hu, Kai and Zhong, Zhuoyao and Sun, Lei and Huo, Qiang},
journal={Pattern Recognition},
volume={156},
pages={110836},
year={2024},
publisher={Elsevier},
doi={10.1016/j.patcog.2024.110836}
}
DocHieNet: A Large and Diverse Dataset for Document Hierarchy Parsing
TL;DR
DocHieNet is a manually annotated, multi-page, bilingual (English + Chinese) dataset of 1,673 documents (15,610 pages) spanning legal, financial, educational, and scientific domains for document hierarchy parsing. It annotates 187K+ layout elements across 19 categories with hierarchical and sequential relationships, including 37.4% cross-page relations. The authors also propose DHFormer, a transformer-based baseline that uses sparse text-layout encoding and a global layout element decoder, reaching 77.82 F1 on DocHieNet while achieving 93-99 F1 on prior datasets.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$. The headline contribution is the DocHieNet dataset: a multi-domain, bilingual, multi-page dataset for document hierarchy parsing with manually verified annotations, filling a gap left by existing single-domain or single-page resources. The paper devotes its primary sections to data collection, annotation design, dataset statistics, and comparisons with existing resources.
Secondary: $\Psi_{\text{Method}}$. DHFormer is a non-trivial methodological contribution: a sparse text-layout encoder with chunked attention plus a global layout element decoder for multi-page hierarchy prediction, evaluated across five datasets with ablations.
What is the motivation?
Existing document hierarchy parsing datasets are limited in scope:
- arXivdocs is single-page and English-only; E-Periodica is single-page (though multilingual: EN, DE, FR, IT). Neither captures cross-page relationships.
- HRDoc covers only scientific papers with automatically generated (not manually verified) annotations at the text-line level
- No existing dataset spans multiple document domains (legal, financial, educational, technical)
The authors argue that layout-block-level annotation is more practical than text-line-level because (a) it leverages existing layout analysis models, (b) text-line evaluation inflates scores since trivial “connect” relations dominate, and (c) layout-block evaluation better reflects actual hierarchy prediction quality.
What is the novelty?
DocHieNet Dataset
- 1,673 documents (1,110 English, 563 Chinese) across 15,610 pages with 187K+ layout elements
- 19 layout element categories: title, sub-title, section-title, text, formula, TOC-title, TOC, figure, figure-title, figure-caption, table, table-title, table-caption, header, footer, page-number, footnote, endnote, sidebar
- Two relationship types: hierarchical (parent-child) and sequential (reading order)
- Multi-domain: Legal, financial, educational, technical, scientific, government publications
- Cross-page relations: 37.4% macro-average ratio with 5.4-page average span (vs. HRDoc’s 24.9% ratio and 2.4-page span)
- Manual annotation: 12 experienced annotators with 3 rounds of specialist quality checks
- Split: 1,512 train / 161 test documents. 835 training documents are truncated (partially annotated) to reduce pattern redundancy; test documents are fully annotated.
DHFormer
A transformer-based framework for document hierarchy parsing with three components:
Sparse Text-Layout Encoder: Uses a pretrained layout-aware language model (GeoLayoutLM). Documents are split into $K$ chunks where each chunk contains the maximum number of layout elements whose total tokens do not exceed a limit $l = 512$. Attention is dense within chunks but factorized across chunks, reducing complexity from $O(N^2)$ to $O(l \cdot N)$.
Page and Inner-Layout Embeddings: Sinusoidal page embeddings encode absolute page number; 1D inner-layout position embeddings encode relative token position within each layout element.
Global Layout Element Decoder: A 2-layer transformer decoder with Shifted Sparse Attention (window size 48) that pools each layout element’s representation via its first token. A learnable root embedding is prepended. Relation prediction uses a bilinear layer with sigmoid activation trained with cross-entropy loss.
What experiments were performed?
The authors evaluate DHFormer on five datasets, all mapped to DocHieNet’s annotation format for comparability:
- DocHieNet (1,673 docs, multi-domain, bilingual)
- arXivdocs (362 single-page scientific documents)
- HRDoc-Simple (1,000 docs) and HRDoc-Hard (1,500 docs)
- E-Periodica (542 single-page magazine pages)
Metrics: F1 and TEDS (Tree Edit Distance Score).
Baselines: DocParser, DSPS (HRDoc baseline), DOC (Comp-HRDoc baseline), DSG (graph-based), GPT-4 (in-context learning), Llama2-7b (LoRA fine-tuned).
Ablations: (1) encoder choice (GeoLayoutLM vs. LayoutLMv3 vs. BROS vs. XLM-RoBERTa), (2) chunking strategy and window size, (3) page and inner-layout embeddings, (4) text-line vs. layout-block annotation paradigm, (5) LLM comparison across document lengths.
What are the outcomes/conclusions?
Main Results (all datasets in DocHieNet format)
| Dataset | DHFormer F1 | DHFormer TEDS | Next Best F1 | Next Best TEDS |
|---|---|---|---|---|
| arXivdocs | 98.45 | 95.04 | 81.17 (DSG) | 72.47 (DSG) |
| HRDoc-Simple | 99.34 | 98.69 | 84.78 (DSG) | 83.24 (DSG) |
| HRDoc-Hard | 93.40 | 89.14 | 74.04 (DSG) | 64.33 (DSG) |
| E-Periodica | 92.53 | 84.85 | 67.17 (DSG) | 60.14 (DSG) |
| DocHieNet | 77.82 | 57.64 | 53.51 (DSG) | 33.90 (DSG) |
DocHieNet is considerably more challenging than prior datasets: even DHFormer achieves only 77.82 F1, compared to 93+ on the others.
Key Findings
- Encoder matters: GeoLayoutLM (layout-aware) outperforms XLM-RoBERTa (text-only) by 8.7 F1 points (77.82 vs. 69.13).
- Both embedding types help: Page embeddings add ~2 F1 and inner-layout position embeddings add ~1.5 F1; combined they add ~4 F1 (73.66 to 77.82).
- Chunking strategy: Layout-element-aligned chunking at 512 tokens (77.82 F1) outperforms page-level chunking (75.66) and stride-based chunking (73.98).
- LLM baselines lag behind: DHFormer (~80 F1) outperforms fine-tuned Llama2-7b (~55-65 F1) and GPT-4 ICL (~25-50 F1), with LLM performance degrading sharply as document length increases (approximate values read from Figure 5; no table provided).
- Annotation paradigm: Text-line-level evaluation (as in HRDoc) inflates scores because trivial “connect” relations dominate. Layout-block evaluation is more informative.
Limitations
- The dataset, while multi-domain, may not cover all document variations in the wild.
- 835 of 1,673 training documents are truncated (partially annotated), which may affect model training for long-document scenarios.
- The dataset requires layout element detection as a prerequisite; end-to-end evaluation with a layout detector (CenterNet) drops F1 from 99.34 to 94.32 on HRDoc-Simple.
- No arXiv preprint available; the paper is only accessible through ACL Anthology.
Reproducibility
Models
- DHFormer: Sparse text-layout encoder (GeoLayoutLM default) + 2-layer transformer decoder with Shifted Sparse Attention (window size 48)
- Text encoder variants evaluated: GeoLayoutLM, LayoutLMv3, BROS, XLM-RoBERTa
- The GitHub repo contains only a README and LICENSE file; no runnable training or inference code is available. The README states data and model “will be made publicly available in the near future.” The dataset is hosted on ModelScope.
- No pretrained model weights or evaluation scripts released.
Algorithms
- Optimizer: AdamW, base LR 4e-5 decaying to 1e-6
- Epochs: 100
- Max tokens per chunk: 512
- Max chunks: 32 (training), 128 (testing)
- Decoder: 2-layer transformer, window size 48 for Shifted Sparse Attention
- Relation prediction: Bilinear layer with sigmoid, cross-entropy loss
- Batch size: Not reported
- Framework: Not explicitly stated (likely PyTorch given GeoLayoutLM dependency)
Data
- DocHieNet: 1,512 train / 161 test documents. Available on ModelScope under Apache-2.0.
- Languages: English (1,110 docs) + Chinese (563 docs)
- Annotation: Manual by 12 annotators with 3 rounds of specialist QC
- 835 training documents are truncated (partially annotated); all 161 test documents are fully annotated
- Comparison datasets (arXivdocs, HRDoc, E-Periodica) available separately with their own licenses
Evaluation
- Metrics: F1 (relation-level) and TEDS (Tree Edit Distance Score: 1 minus the normalized tree edit distance between predicted and ground-truth document hierarchies; document-level)
- All datasets mapped to DocHieNet annotation format for cross-dataset comparison
- Cross-format evaluation also performed (train in one format, test in another)
- No error bars or multi-run statistics reported
- LLM comparison included (GPT-4 ICL, Llama2-7b LoRA)
Hardware
- DHFormer training: 2 NVIDIA Tesla V100 GPUs
- Llama2-7b fine-tuning: 2 NVIDIA A100 GPUs, LoRA (rank 8, alpha 32, dropout 0.05)
- No inference latency or throughput reported
BibTeX
@inproceedings{xing2024dochiennet,
title={DocHieNet: A Large and Diverse Dataset for Document Hierarchy Parsing},
author={Xing, Hangdi and Cheng, Changxu and Gao, Feiyu and Shao, Zirui and Yu, Zhi and Bu, Jiajun and Zheng, Qi and Yao, Cong},
booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing},
pages={1129--1142},
year={2024},
doi={10.18653/v1/2024.emnlp-main.65}
}
LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding
TL;DR
LayoutLMv2 is the second generation of the LayoutLM family, introducing visual features directly into the pre-training stage (rather than only at fine-tuning time, as in LayoutLMv1). It combines a multi-modal Transformer encoder with a ResNeXt-101 FPN visual backbone, a spatial-aware self-attention mechanism for 2D relative position modeling, and two new pre-training objectives (Text-Image Alignment and Text-Image Matching). It achieved SOTA on six document understanding benchmarks at the time of publication.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$
The headline contribution is the model architecture and pre-training strategy. The paper introduces three key innovations: (1) early visual integration via a trainable CNN backbone, (2) spatial-aware self-attention with 2D relative position biases, and (3) two new cross-modal pre-training objectives. The majority of the paper describes these components and their ablations.
Secondary: $\Psi_{\text{Evaluation}}$
The paper evaluates on six diverse benchmarks (FUNSD, CORD, SROIE, Kleister-NDA, RVL-CDIP, DocVQA) and provides a systematic ablation study isolating the contribution of each component.
What is the motivation?
LayoutLMv1 demonstrated that jointly pre-training text and layout (2D position) embeddings from large-scale unlabeled documents improves downstream document understanding tasks. However, v1 had a significant limitation: visual features from a CNN were only added during fine-tuning, not during pre-training. This meant the model could not learn cross-modal interactions between text, layout, and visual appearance from the 11M unlabeled IIT-CDIP pages.
The authors also identified that absolute 2D position embeddings (as used in v1) encode where a token is on the page but not how tokens relate to each other spatially. Two tokens in the same relative arrangement (e.g., a label above a value) at different absolute positions would produce different representations.
What is the novelty?
1. Early Visual Integration
Instead of adding visual features only at fine-tuning time, LayoutLMv2 passes the document page image through a ResNeXt-101 FPN visual encoder during pre-training. The output feature map is average-pooled to $W \times H = 7 \times 7 = 49$ visual tokens, each projected to the same dimensionality as text tokens. Visual and text tokens are concatenated into a single sequence:
$$x_i^{(0)} = X_i + l_i, \quad X = \{v_0, \ldots, v_{WH-1}, t_0, \ldots, t_{L-1}\}$$
where $l_i$ is the layout (2D position) embedding and $X_i$ is either a visual or text embedding. The CNN backbone is updated through backpropagation during pre-training, initialized from a Mask R-CNN trained on PubLayNet.
2. Spatial-Aware Self-Attention
The standard self-attention score $\alpha_{ij}$ is augmented with learnable relative position biases:
$$\alpha’_{ij} = \alpha_{ij} + b^{(1D)}_{j-i} + b^{(2D_x)}_{x_j - x_i} + b^{(2D_y)}_{y_j - y_i}$$
where $(x_i, y_i)$ are the top-left bounding box coordinates of the $i$-th token. The biases are shared across encoder layers but differ across attention heads, adding minimal parameters while encoding spatial relationships.
3. New Pre-training Objectives
In addition to Masked Visual-Language Modeling (MVLM, inherited from v1):
- Text-Image Alignment (TIA): Randomly cover line-level image regions and train the model to predict which text tokens have been covered. This enforces fine-grained spatial correspondence between text and image. Image regions corresponding to MVLM-masked tokens are also covered to prevent visual leakage.
- Text-Image Matching (TIM): Binary classification on whether the image and text come from the same page. 15% of images are replaced with a different document page; 5% are dropped entirely.
What experiments were performed?
Pre-training
Pre-trained on 11M pages from IIT-CDIP Test Collection. Text is extracted via OCR (Microsoft Read API). Two model sizes: BASE (200M params, 12 layers, $d=768$) and LARGE (426M params, 24 layers, $d=1024$). Both use the same ResNeXt-101 FPN visual backbone. Text encoder initialized from UniLMv2; visual backbone initialized from a Mask R-CNN trained on PubLayNet.
Downstream Tasks
Six benchmarks spanning four task types:
| Task Type | Dataset | Metric | v1 LARGE | v2 LARGE |
|---|---|---|---|---|
| Entity Extraction | FUNSD | F1 | 0.7895 | 0.8420 |
| Entity Extraction | CORD | F1 | 0.9493 | 0.9601 |
| Entity Extraction | SROIE | F1 | 0.9524 | 0.9781* |
| Entity Extraction | Kleister-NDA | F1 | 0.8340 | 0.8520 |
| Classification | RVL-CDIP | Acc | 94.43% | 95.64% |
| Visual QA | DocVQA | ANLS | 0.7295 | 0.8672 |
*SROIE F1 of 0.9781 excludes OCR mismatch errors; the raw F1 is 0.9661. The DocVQA score of 0.8672 uses train+dev fine-tuning with question-generation data augmentation.
Ablation Study (DocVQA val, BASE size)
The ablation isolates each component’s contribution:
| Setting | ANLS |
|---|---|
| LayoutLM (text+layout only, BERT init) | 0.6841 |
| + visual module (MVLM only) | 0.6915 |
| + TIA | 0.7061 |
| + TIM | 0.6955 |
| + TIA + TIM | 0.7124 |
| + TIA + TIM + spatial-aware attention | 0.7217 |
| + UniLMv2 init (final v2 BASE) | 0.7421 |
TIA contributes more than TIM. Spatial-aware self-attention adds roughly 1 ANLS point. Better text-side initialization (UniLMv2 over BERT) provides a further 2-point gain.
What are the outcomes/conclusions?
LayoutLMv2 LARGE set new SOTA on all six benchmarks at the time of publication, with particularly large gains on DocVQA (+13.8 ANLS over v1) and FUNSD (+5.3 F1 over v1). The results demonstrate that integrating visual features during pre-training, rather than only at fine-tuning, yields substantial improvements across diverse document understanding tasks.
The ablation confirms that each component contributes: visual integration, TIA, TIM, and spatial-aware attention all provide additive gains. The largest single contributor is the visual module with TIA, suggesting that fine-grained text-image spatial alignment is the most impactful pre-training signal.
Limitations
- The model requires OCR as a preprocessing step, inheriting OCR errors.
- The visual backbone (ResNeXt-101 FPN) adds computational cost; the 49 visual tokens increase sequence length.
- Pre-training on 11M IIT-CDIP pages is computationally expensive (exact GPU-hours not reported).
- The CC-BY-NC-SA-4.0 license on released model weights restricts commercial use.
- No layout detection experiments are reported; the paper focuses on document understanding (entity extraction, classification, VQA) rather than bounding-box detection.
Reproducibility
Models
- LayoutLMv2 BASE: 200M parameters. 12-layer, 12-head Transformer encoder, $d = 768$. ResNeXt-101 FPN visual backbone.
- LayoutLMv2 LARGE: 426M parameters. 24-layer, 16-head Transformer encoder, $d = 1024$. Same visual backbone.
- Both model checkpoints released on Hugging Face under CC-BY-NC-SA-4.0.
- Text encoder initialized from UniLMv2; visual backbone initialized from Mask R-CNN trained on PubLayNet (“MaskRCNN ResNeXt101_32x8d FPN 3X” from detectron2).
Algorithms
- Pre-training: Maximum sequence length $L = 512$. Visual tokens: $7 \times 7 = 49$. MVLM masks 15% of tokens (80% [MASK], 10% random, 10% unchanged). TIA covers 15% of lines. TIM replaces 15% of images and drops 5%.
- Optimizer: Adam with learning rate $2 \times 10^{-5}$, weight decay $1 \times 10^{-2}$, $(\beta_1, \beta_2) = (0.9, 0.999)$. Linear warmup over the first 10% of steps, then linear decay.
- Pre-training schedule: BASE is trained with batch size 64 for 5 epochs; LARGE is trained with batch size 2048 for 20 epochs on IIT-CDIP. (The ablation study uses 1-epoch runs for efficiency.)
- Fine-tuning: Task-specific heads on top of encoder outputs. DocVQA additionally benefits from a question-generation data augmentation step. Fine-tuning hyperparameters (learning rates, batch sizes, epochs per task) are not reported in the paper.
Data
- Pre-training: IIT-CDIP Test Collection 1.0 (11M document images). Text extracted via Microsoft Read API (a commercial OCR service). The IIT-CDIP dataset requires institutional access and is not freely downloadable.
- Downstream: FUNSD (149/50 train/test), CORD (800/100/100), SROIE (626/347), Kleister-NDA (254/83/203), RVL-CDIP (320K/4K/4K), DocVQA (39K/5K/5K).
Evaluation
- Entity extraction: entity-level F1 score.
- Classification: accuracy.
- DocVQA: Average Normalized Levenshtein Similarity (ANLS).
- Ablation on DocVQA validation set with BASE models pre-trained for one epoch.
- No error bars or multi-seed runs reported.
Hardware
- Not specified in the main paper. Pre-training on 11M pages with a 200M-426M parameter model with a ResNeXt-101 FPN backbone requires substantial GPU resources. The original LayoutLM was pre-trained on 8x V100 GPUs; LayoutLMv2 likely requires similar or greater resources.
BibTeX
@inproceedings{xu2021layoutlmv2,
title={LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding},
author={Xu, Yang and Xu, Yiheng and Lv, Tengchao and Cui, Lei and Wei, Furu and Wang, Guoxin and Lu, Yijuan and Florencio, Dinei and Zhang, Cha and Che, Wanxiang and Zhang, Min and Zhou, Lidong},
booktitle={Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)},
pages={2579--2591},
year={2021},
publisher={Association for Computational Linguistics},
doi={10.18653/v1/2021.acl-long.201}
}
LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding
TL;DR
LayoutXLM is the multilingual extension of LayoutLMv2. It uses the same architecture (multimodal Transformer with ResNeXt-FPN visual backbone and spatial-aware self-attention) but pre-trains on 30M documents across 53 languages. The paper also introduces XFUND, a manually labeled multilingual form understanding benchmark in 7 languages. LayoutXLM outperforms multilingual text-only baselines (XLM-R, InfoXLM) on cross-lingual document understanding tasks.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$. The headline contribution is extending LayoutLMv2 to the multilingual setting with a large-scale cross-lingual pre-training corpus.
Secondary: $\Psi_{\text{Resource}}$. The XFUND dataset (1,393 forms across 7 languages with key-value pair annotations) is a meaningful data contribution.
Secondary: $\Psi_{\text{Evaluation}}$. Benchmarks on XFUND with zero-shot transfer and multitask fine-tuning settings.
What is the motivation?
Nearly 40% of digital documents on the web are in non-English languages. LayoutLMv2 was pre-trained only on English documents (IIT-CDIP), limiting its cross-lingual capabilities. Multilingual text-only models (mBERT, XLM-R) lack the layout and visual understanding needed for document tasks. Translating documents via machine translation is unsatisfactory because it loses layout structure.
What is the novelty?
Multilingual pre-training corpus. 22M multilingual PDFs scraped from Common Crawl across 53 languages, combined with 8M English documents from IIT-CDIP (30M total). The multilingual PDFs are filtered for quality and classified by language using fastText.
Architecture. Same as LayoutLMv2 (multimodal Transformer + ResNeXt-FPN + spatial-aware self-attention), but initialized from InfoXLM (a multilingual pre-trained model) instead of UniLMv2. Three pre-training objectives: MVLM, TIA, TIM.
XFUND benchmark. 1,393 manually annotated forms across 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese). 199 forms per language (149 train / 50 test). Two tasks: Semantic Entity Recognition (SER) and Relation Extraction (RE) for key-value pairs.
What experiments were performed?
Three evaluation settings on XFUND:
- Language-specific fine-tuning: Train and test on the same language.
- Zero-shot transfer: Fine-tune on English FUNSD, evaluate on each target language.
- Multitask fine-tuning: Fine-tune on all 8 languages jointly.
LayoutXLM BASE and LARGE are compared against XLM-R, InfoXLM, and LayoutLMv2.
What are the outcomes/conclusions?
LayoutXLM outperforms text-only multilingual baselines on both SER and RE tasks. On XFUND SER (language-specific), LayoutXLM BASE achieves an average F1 of 0.8056 across 7 languages, compared to 0.7047 for XLM-R BASE. Zero-shot transfer from English to other languages is feasible but shows substantial performance drops, confirming that layout understanding transfers better than textual semantics across languages.
Limitations
- CC-BY-NC-SA-4.0 license on model weights restricts commercial use.
- Requires OCR as preprocessing.
- The 30M pre-training corpus is not publicly released.
- XFUND covers only form understanding; no layout detection or document classification benchmarks for multilingual evaluation.
Reproducibility
Models
- LayoutXLM BASE: 369M parameters. Initialized from InfoXLM BASE + ResNeXt-101 FPN.
- LayoutXLM LARGE: 625M parameters. Initialized from InfoXLM LARGE + ResNeXt-101 FPN.
Data
- Pre-training: 30M documents (22M multilingual PDFs + 8M IIT-CDIP English). OCR via Microsoft Read API.
- XFUND: 1,393 forms, 7 languages, manually annotated key-value pairs.
Algorithms
- Optimizer, learning rate, batch size, warmup schedule, and total training steps are not reported in the paper.
- Three pre-training objectives are identical to LayoutLMv2: Masked Visual-Language Modeling (MVLM), Text-Image Alignment (TIA), and Text-Image Matching (TIM).
- OCR preprocessing uses the commercial Microsoft Read API.
Evaluation
- Semantic Entity Recognition (SER) is evaluated with entity-level F1.
- Relation Extraction (RE) is evaluated with pair-level F1.
- No error bars or multi-run statistics are reported.
Hardware
- Not specified in the paper.
BibTeX
@inproceedings{xu2022layoutxlm,
title={LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding},
author={Xu, Yiheng and Lv, Tengchao and Cui, Lei and Wang, Guoxin and Lu, Yijuan and Florencio, Dinei and Zhang, Cha and Wei, Furu},
booktitle={Findings of the Association for Computational Linguistics: ACL 2022},
pages={3184--3196},
year={2022},
doi={10.18653/v1/2022.findings-acl.253}
}
LiLT: A Simple yet Effective Language-Independent Layout Transformer
TL;DR
LiLT (Language-independent Layout Transformer) is a dual-stream Transformer that separates text and layout processing into parallel flows connected by a Bi-directional Attention Complementation Mechanism (BiACM). The key insight: pre-train the layout stream on English-only structured documents (IIT-CDIP), then swap in any off-the-shelf text encoder (monolingual or multilingual) at fine-tuning time. This avoids the costly multilingual pre-training corpus that LayoutXLM requires while achieving competitive or superior results on 8 languages.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$. The core contribution is the disentangled dual-stream architecture and BiACM mechanism that enables language-independent layout pre-training.
Secondary: $\Psi_{\text{Evaluation}}$. Comprehensive benchmarking on FUNSD, CORD, EPHOIE, RVL-CDIP, and the multilingual XFUND benchmark across 8 languages.
What is the motivation?
LayoutXLM demonstrated that multilingual document understanding requires both layout and text signals. However, it requires scraping 22M multilingual PDFs, cleaning them, and pre-training a 369M+ parameter model from scratch. LiLT asks: can we learn layout structure from English-only documents and transfer that knowledge to other languages by simply swapping the text encoder?
The key observation is that document layout structure is largely language-independent. The spatial arrangement of fields on a form looks similar whether the text is in English, Chinese, or Arabic. If the layout knowledge can be separated from the language knowledge, each can be learned independently.
What is the novelty?
Dual-Stream Architecture
LiLT uses two parallel Transformer streams:
- Text flow: Processes token embeddings + 1D positional embeddings. Standard Transformer architecture.
- Layout flow: Processes 2D positional (bounding box) embeddings only. Reduced hidden size for efficiency (adds only 6.1M parameters to the text flow).
BiACM (Bi-directional Attention Complementation Mechanism)
At each layer, the attention weights from each stream are added to the other stream’s attention computation before the softmax:
$$A_T’ = \text{Softmax}(\frac{Q_T K_T^T}{\sqrt{d_T}} + \text{DETACH}(\frac{Q_L K_L^T}{\sqrt{d_L}}))$$
$$A_L’ = \text{Softmax}(\frac{Q_L K_L^T}{\sqrt{d_L}} + \text{DETACH}(\frac{Q_T K_T^T}{\sqrt{d_T}}))$$
The DETACH operation (gradient stopping during pre-training) prevents the layout flow from corrupting the text encoder’s representations. At fine-tuning time, DETACH is removed to allow full end-to-end optimization.
Pre-training Objectives
Three objectives, all operating on text tokens:
- MVLM: Standard masked visual-language modeling.
- KPL (Key Point Location): Predict the top-left and bottom-right corner coordinates of masked bounding boxes.
- CAI (Cross-modal Alignment Identification): Binary classification of whether each text-layout pair is correctly aligned or randomly shifted.
Asynchronous Optimization
The text flow’s learning rate is slowed down by a factor of 1000x relative to the layout flow during pre-training. This prevents the text encoder from overfitting to English while allowing the layout encoder to fully learn spatial structure.
What experiments were performed?
Monolingual Benchmarks
| Dataset | LiLT[EN-R] BASE | LayoutLMv2 BASE | LayoutXLM BASE |
|---|---|---|---|
| FUNSD (F1) | 0.8841 | 0.8276 | 0.8034 |
| CORD (F1) | 0.9607 | 0.9495 | 0.9481 |
| RVL-CDIP (Acc) | 95.68% | 95.25% | 95.21% |
LiLT with English RoBERTa reports the highest F1 among the compared baselines on FUNSD and competitive results on CORD and RVL-CDIP, outperforming both LayoutLMv2 and LayoutXLM at BASE size. Note that on CORD, DocFormer BASE (0.9633 F1) outperforms LiLT; the table above shows only the most relevant layout-aware baselines.
Multilingual (XFUND)
The paper evaluates three settings on XFUND: (1) language-specific fine-tuning (train and test on the same language), (2) zero-shot transfer (train on English, test on target language), and (3) multitask fine-tuning (train on all languages jointly). On zero-shot transfer, LiLT with InfoXLM achieves competitive performance with LayoutXLM despite using only English pre-training data. On language-specific fine-tuning, LiLT often matches or exceeds LayoutXLM.
What are the outcomes/conclusions?
LiLT demonstrates that layout structure is sufficiently language-independent that it can be learned from a single language and transferred to others. The disentangled architecture enables a practical workflow: pre-train layout once on English, then combine with any text encoder for the target language.
The BiACM mechanism is critical. Without it (plain concatenation), performance drops sharply. Replacing it with co-attention (a common dual-stream approach) also hurts performance, because deeper cross-modal interaction damages the text flow’s consistency during pre-training.
Limitations
- No visual features (no CNN backbone). LiLT uses only text + layout, unlike LayoutLMv2/LayoutXLM which also use document images.
- Only BASE size evaluated; no LARGE variant.
- Pre-trained on IIT-CDIP only (English); adding multilingual layout data could further help.
Reproducibility
Models
- LiLT BASE: 12-layer layout encoder with 192 hidden size, 768 FFN size, 12 attention heads (6.1M parameters). Combined with a text encoder at fine-tuning time.
- Released checkpoints: lilt-roberta-en-base (293 MB, English), lilt-infoxlm-base (846 MB, multilingual), and lilt-only-base (21 MB, layout flow only). Available on both GitHub (OneDrive links) and HuggingFace under MIT license.
- The text flow is initialized from an existing pre-trained model (RoBERTa BASE or InfoXLM BASE) and can be swapped at fine-tuning time.
Algorithms
- Pre-training: Adam optimizer ($\beta_1 = 0.9$, $\beta_2 = 0.999$), learning rate $2 \times 10^{-5}$ with linear warmup over the first 10% of steps followed by linear decay, weight decay $1 \times 10^{-2}$, batch size 96, 5 epochs on IIT-CDIP.
- Text flow slow-down ratio: 1000x (text learning rate = base rate / 1000).
- Max sequence length: 512.
Data
- Pre-training: IIT-CDIP Test Collection 1.0 (English only): 6M+ documents, 11M+ scanned images. OCR via TextIn API.
- Ablation pre-training: 2M randomly sampled documents from IIT-CDIP (5 epochs).
- Fine-tuning benchmarks: FUNSD (149/50 train/test forms), CORD (800/100/100), EPHOIE (1183/311 Chinese exam papers), RVL-CDIP (320K/40K/40K), XFUND (149/50 per language, 7 languages + English FUNSD).
Evaluation
- Metrics: Entity-level F1 for SER (FUNSD, CORD, EPHOIE, XFUND); overall accuracy for document classification (RVL-CDIP); F1 for relation extraction (XFUND RE).
- Baselines: LayoutLM, LayoutLMv2, LayoutXLM, BROS, DocFormer, and text-only models (BERT, RoBERTa, InfoXLM).
- Statistical rigor: No error bars, confidence intervals, or multi-run statistics reported.
Hardware
- 4x NVIDIA A40 (48 GB) GPUs for pre-training.
- Training time not reported.
BibTeX
@inproceedings{wang2022lilt,
title={LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding},
author={Wang, Jiapeng and Jin, Lianwen and Ding, Kai},
booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
pages={7747--7757},
year={2022},
doi={10.18653/v1/2022.acl-long.534}
}
PAL Database: Document Layout Annotation for Public Affairs
TL;DR
The PAL Database is a large-scale document layout analysis dataset covering 37,910 Spanish government gazettes (441,300 pages) with over 8 million layout labels across 8 categories. Annotations were generated through a semi-automatic pipeline combining PDF mining, heuristic rules, and human-in-the-loop Random Forest classifiers. The dataset is gated behind a signed license agreement.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$. The headline contribution is the PAL Database itself: a new domain-specific dataset for document layout analysis in the public affairs domain. The paper devotes most of its space to describing the data collection pipeline, annotation process, and dataset statistics.
Secondary: $\Psi_{\text{Method}}$. The semi-automatic annotation pipeline (heuristic rules followed by iterative human-in-the-loop Random Forest classifiers) is a reusable methodological contribution beyond the dataset itself.
What is the motivation?
Existing DLA datasets (PubLayNet, DocBank, DocLayNet) focus on scientific articles, financial reports, and similar domains. Government and legal documents have distinct visual layouts (multi-column gazettes, structured metadata blocks, legal identifiers) that are underrepresented in current benchmarks. The authors argue that public affairs documents require their own annotated corpus because:
- The layout structure differs from academic and business documents
- Existing models may not generalize well to this domain
- Government document processing is a practical need for legal and administrative automation
Beyond layout analysis, the authors also frame the corpus as a large-scale NLP pre-training resource for the legal domain in Spanish and four co-official languages (Basque, Catalan/Valencian, Galician).
What is the novelty?
The core novelty is the dataset and its semi-automatic annotation pipeline:
Data Collection: Web scraping of 24 Spanish administration sources (3 national, 19 regional, 2 local) using Scrapy. Only born-digital PDFs from 2014 onward are included.
Feature Extraction: PyMuPDF extracts images, links, and text blocks with 18 features per block (bounding box, font size, bold/italic proportions, capitalization ratios, etc.). Camelot extracts tables.
Two-step Annotation:
- Step 0 (Heuristic): Rule-based labeling using font features and keyword matching, followed by human validation through a custom annotation application
- Step 1 (ML-assisted): Random Forest classifiers trained on validated documents (minimum 50 validated pages per source), with iterative human correction
Label Taxonomy: 8 categories split into 4 basic layout blocks (Image, Table, Link, Text) and 4 text semantic categories (Identifier, Title, Summary, Body).
The per-source Random Forest approach acknowledges that government gazette layouts vary significantly across administrative regions, each requiring its own classifier.
What experiments were performed?
The evaluation focuses on validating the text-block classification quality:
- Model: Random Forest classifiers (max depth 1000), one per data source (24 total)
- Features: 17 of 18 extracted features (raw text excluded). Positional features normalized by document dimensions. Font type encoded via dictionary mapping.
- Protocol: 10-fold cross-validation with 80/20 document-level splits on the validation set (1,444 documents, 19,276 pages)
- Metric: Accuracy on 4-class text block classification (Identifier, Title, Summary, Body)
What are the outcomes/conclusions?
Overall accuracy ranges from 93.37% (Source 16, Ceuta) to 99.93% (Source 20), with 20 of 24 sources exceeding 96%.
Per-class performance:
- Identifier: Highest accuracy; mean above 99% for 22 of 24 sources. These blocks have consistent position, token count, and font characteristics across documents.
- Body: All sources above 95%. The dominant class, easily distinguished by font regularity.
- Title: Hardest category; all sources above 84%, 19 above 90%. The authors note titles are the hardest to label consistently due to varying editorial conventions across sources.
- Summary: Generally high, but Source 16 (Ceuta) is an outlier at 61.08% ($\pm$12.35 std), indicating high instability for that source.
Limitations:
- Restricted to born-digital PDFs only; scanned/image-based documents are excluded
- Feature extraction depends on internal PDF structure, which varies by editing software
- Single human supervisor validated all documents (no inter-annotator agreement metrics)
- The training set (422K pages) uses ML-generated labels that were only spot-checked (“visual inspection of an arbitrary selection of documents”), not systematically validated
- No cross-source generalization experiments; each source gets its own classifier, so it is unknown whether the dataset supports training a single unified model
- Only Random Forest classifiers evaluated; no comparison with deep learning approaches
- The dataset is not freely available (requires signed license agreement)
Reproducibility
Models
- Random Forest classifiers with max depth 1000, one per source
- No deep learning models trained
- No pretrained weights released
- No source code released; the GitHub repository contains only a README and license agreement
Algorithms
- Feature extraction via PyMuPDF (text blocks, images, links, fonts) and Camelot (tables)
- 18 features per text block (Table 2 in the paper): page number ($f_1$), bounding box ($f_{2\text{-}5}$: $x_0, y_0, x_1, y_1$), center coordinates ($f_{6\text{-}7}$: $x_c, y_c$), distances to page edges ($f_{8\text{-}11}$), raw text/data ($f_{12}$), bold proportion ($f_{13}$), italic proportion ($f_{14}$), average font size ($f_{15}$), font types tuple ($f_{16}$), capital letter proportion ($f_{17}$), word count ($f_{18}$)
- For classification, 17 of 18 features used ($f_{12}$ excluded). Positional features ($f_2$-$f_{11}$) normalized by page dimensions.
- Font type ($f_{16}$) encoded via dictionary mapping (details not provided)
- Full heuristic rule set for Step 0 annotation not disclosed; the paper provides only one example rule (bold proportion $> 0.5$ for titles)
- 10-fold cross-validation with 80/20 document-level splits
Data
- Sources: 24 Spanish administration websites (official gazettes)
- Scale: 37,910 documents, 441,300 pages, 138.1M tokens, 8M+ labels
- Split: Train: 36,466 docs (422K pages); Validation: 1,444 docs (19,276 pages)
- Languages: Spanish (primary), Basque, Catalan/Valencian, Galician
- Availability: Gated. Requires downloading and signing a license agreement from the BiDAlab GitHub repo, then emailing credentials to atvs@uam.es
- Annotation: Semi-automatic (heuristic + ML-assisted) with single-annotator human validation
Evaluation
- Metric: Accuracy (4-class text block classification)
- Protocol: 10-fold cross-validation, document-level splits
- No cross-dataset evaluation against established DLA benchmarks (PubLayNet, DocLayNet)
- Standard deviations reported across 10-fold CV, but no formal significance tests (e.g., paired t-tests between sources or against baselines)
- No comparison with deep learning baselines (Faster R-CNN, YOLO, LayoutLM, etc.)
Hardware
- Not reported. Random Forest classifiers are lightweight and likely trained on CPU.
- No training time, memory, or compute cost information provided.
BibTeX
@inproceedings{pena2023document,
title={Document Layout Annotation: Database and Benchmark in the Domain of Public Affairs},
author={Pe{\~n}a, Alejandro and Morales, Aythami and Fierrez, Julian and Ortega-Garcia, Javier and Grande, Marcos and Puente, {\'I}{\~n}igo and C{\'o}rdova, Jorge and C{\'o}rdova, Gonzalo},
booktitle={Document Analysis and Recognition -- ICDAR 2023 Workshops},
series={Lecture Notes in Computer Science},
volume={14194},
pages={126--141},
year={2023},
publisher={Springer},
doi={10.1007/978-3-031-41501-2_9}
}
TexBiG: A Dataset for Analysing Complex Document Layouts in the Digital Humanities
TL;DR
TexBiG (Text-Bild-Gefuge, “Text-Image-Structure”) is a document layout analysis dataset for historical documents from the late 19th and early 20th centuries. It provides over 52,000 expert polygon annotations across 19 classes, with inter-annotator agreement measured using Krippendorff’s Alpha. Each document image was annotated by at least two independent annotators. Baselines are provided using Mask R-CNN and VSR.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$
The headline contribution is the dataset itself: its annotation guidelines, 19-class taxonomy, polygon annotations, and inter-annotator agreement evaluation. The paper is organized around data curation and quality assessment.
Secondary: $\Psi_{\text{Evaluation}}$
The paper benchmarks Mask R-CNN and VSR on the dataset and uses Krippendorff’s Alpha to formally quantify annotation reliability, which is unusual for layout datasets.
What is the motivation?
Digital humanities researchers work with historical documents that differ substantially from the born-digital, English-language STEM papers that dominate existing DLA datasets. Late 19th and early 20th century publications exhibit complex text-image arrangements, varied typography, decorative elements, and advertisements interleaved with body content. Existing datasets either lack the class granularity needed for these documents or do not provide polygon-level annotations that can capture the irregular region shapes common in historical layouts.
The authors also note that most DLA datasets lack formal inter-annotator agreement measurements, making it difficult to assess annotation reliability. TexBiG addresses both gaps: fine-grained polygon annotations for historical documents with formal agreement metrics.
What is the novelty?
19-class polygon annotations. Instance segmentation annotations (bounding boxes + polygons/masks) in COCO format across 19 classes covering both textual and non-textual document elements.
Expert annotation with formal agreement. Each document image was annotated by at least two independent expert annotators. Inter-annotator agreement is measured using Krippendorff’s Alpha ($\alpha$), providing a principled estimate of annotation reliability. Krippendorff’s Alpha is defined as:
$$\alpha = 1 - \frac{D_o}{D_e}$$
where $D_o$ is the observed disagreement between annotators and $D_e$ is the disagreement expected by chance. Values of $\alpha \geq 0.8$ are generally considered reliable, while $0.667 \leq \alpha < 0.8$ allows tentative conclusions.
Historical document focus. The dataset draws from 7 publicly available German-language publications from the late 19th and early 20th centuries, covering a range of layout complexity from simple book pages to magazine-style layouts with embedded images, advertisements, and decorative elements.
OCR data included. Tesseract-generated OCR output is provided alongside the annotations to support multimodal models like VSR, though the authors note this OCR is imperfect.
What experiments were performed?
Two baseline models were evaluated:
- Mask R-CNN (via Detectron2): standard instance segmentation baseline
- VSR (Vision-Semantics-Relations): multimodal model using both visual features and OCR text, pre-trained on PubLayNet
Results are reported as mAP on the test split. The paper DOI (10.1007/978-3-031-16788-1_22) provides full benchmark numbers (the paper itself is not open access).
What are the outcomes/conclusions?
TexBiG fills a gap for historical document layout analysis with a well-annotated, multi-class dataset. The formal inter-annotator agreement evaluation via Krippendorff’s Alpha sets it apart from most DLA datasets, which either report no agreement metrics or use ad-hoc measures. The dataset has been adopted by subsequent work (e.g., AnnoPage re-annotates TexBiG pages under its own schema).
Limitations
- The paper is behind a paywall (Springer DAGM GCPR), limiting accessibility.
- The GitHub repository for the VSR benchmark code has no license file.
- The dataset is relatively small by deep learning standards (7 books), though the annotation density (52K+ instances) is high.
- OCR data is Tesseract-generated and imperfect.
Mapping to Matter vs. Meaning Framework
How does TexBiG’s 19-class taxonomy map to our Matter vs. Meaning framework?
| TexBiG Class | Visual Primitive | Logical Role | Notes |
|---|---|---|---|
| paragraph | Text | Body | Standard prose content. |
| heading | Text | SectionHeader | Section headings. |
| sub-heading | Text | SectionHeader | Subsection headings. |
| column title | Text | SectionHeader / Title | Column-level titles in multi-column layouts. |
| author | Text | Author | Author names. |
| caption | Text | Caption | Descriptions of images/tables. |
| footnote | Text | Footnote | Bottom-of-page notes. |
| header | Text | PageHeader | Running headers. |
| footer | Text | PageFooter | Running footers. |
| page number | Text | PageNumber | Folio numbers. |
| editorial note | Text | Annotation | Editor’s notes, commentary. |
| equation | Formula | DisplayEquation | Mathematical content. |
| table | Table | (primitive only) | Tabular data. |
| image | Image | Figure | Photographs, illustrations. |
| logo | Image | Logo | Publisher or organizational logos. |
| advertisement | Image | Advertisement | Commercial advertisements. |
| decoration | Structure | (decorative) | Ornamental elements, borders, vignettes. |
| frame | Structure | (structural) | Page frames, column borders. |
| noise | Structure | (artifact) | Scanner artifacts, stains, damage. |
Coverage strengths: TexBiG provides a well-balanced taxonomy for historical documents. It distinguishes Logo from generic Image, separates Advertisement from editorial content, and includes decoration, frame, and noise classes that are important for historical material but absent from most modern-document datasets. The editorial note class is unique among DLA datasets.
Coverage gaps: No explicit classes for ListItem, BibEntry, Abstract, Blockquote, Code, OpticalCode, or form primitives. No relation annotations (reading order, hierarchy).
Reproducibility
Models
- Mask R-CNN: Standard Detectron2 implementation. Pretrained weights released on Zenodo.
- VSR: Multimodal model from Hikvision’s DAVAR-Lab-OCR, pre-trained on PubLayNet. Code adapted for TexBiG and released on GitHub.
Algorithms
- Framework: Detectron2 for Mask R-CNN; mmdetection 2.11.0 for VSR.
- Pre-training: VSR initialized from PubLayNet-pretrained weights.
- Training details: Optimizer, learning rate, batch size, epochs, and augmentation details are not available outside the paywalled paper.
Data
- Source: 7 publicly available German-language historical publications (late 19th to early 20th century).
- Total annotations: 52,000+ instances across 19 classes.
- Format: COCO JSON with polygon masks and bounding boxes. Train/val/test splits provided.
- Annotation process: Manual expert annotation using a detailed guideline (provided as PDF on Zenodo). At least 2 annotators per document. Agreement measured with Krippendorff’s Alpha.
- OCR: Tesseract-generated text provided in DAVAR format for VSR compatibility.
- License: CC-BY-4.0 for both annotations and images. Attribution required.
- Access: Zenodo (doi:10.5281/zenodo.6885144).
Evaluation
- Metric: mAP (COCO-style) for detection and segmentation.
- Agreement: Krippendorff’s Alpha for inter-annotator reliability.
- Baselines: Mask R-CNN and VSR.
- Full benchmark results in the published paper (Springer, paywalled).
Hardware
- NVIDIA Tesla V100-PCIE (16 GB), Ubuntu 18.04.
BibTeX
@inproceedings{tschirschwitz2022texbig,
title={A Dataset for Analysing Complex Document Layouts in the Digital Humanities and its Evaluation with Krippendorff's Alpha},
author={Tschirschwitz, David and Klemstein, Franziska and Stein, Benno and Rodehorst, Volker},
booktitle={DAGM German Conference on Pattern Recognition},
pages={354--374},
year={2022},
publisher={Springer},
doi={10.1007/978-3-031-16788-1_22}
}
UDOP: Unifying Vision, Text, and Layout for Universal Document Processing
TL;DR
UDOP (Universal Document Processing) is a generative foundation model for document AI that unifies vision, text, and layout in a single encoder-decoder architecture. It converts all document tasks (understanding, classification, QA, layout analysis) into a sequence-to-sequence generation format, including generating document images from text and layout. UDOP ranked first on the Document Understanding Benchmark (DUE) at the time of publication and reported leading results on 8 tasks across diverse document domains.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$. The headline contribution is the Vision-Text-Layout (VTL) Transformer architecture and unified generative framework that handles diverse document tasks without task-specific heads.
Secondary: $\Psi_{\text{Evaluation}}$. Comprehensive evaluation on 8+ benchmarks (DUE-Benchmark, FUNSD, CORD, RVL-CDIP, DocVQA) with ablation studies.
What is the motivation?
Prior document AI models (LayoutLMv2, LayoutLMv3) treat layout as shallow positional embeddings added to text tokens, underutilizing the strong spatial correlation between text content and its visual appearance. They also require task-specific heads for each downstream task (classification head for RVL-CDIP, sequence labeling head for FUNSD, etc.), limiting flexibility.
UDOP addresses both issues: (1) a layout-induced representation that directly fuses text tokens with the image patches where they appear, and (2) a unified seq2seq generation framework where all tasks use the same decoder.
What is the novelty?
Layout-Induced Vision-Text Representation
Instead of encoding text and image separately, UDOP adds each text token’s embedding to the visual features of the image patch where that token is located. This creates a unified representation where each token carries both its textual semantics and its visual appearance.
Vision-Text-Layout Transformer
A T5-based encoder-decoder with:
- Modality-agnostic encoder: Processes the fused vision-text tokens.
- Text-Layout decoder: Generates text and bounding box tokens autoregressively.
- Vision decoder: Reconstructs masked image patches (MAE-style).
Bounding boxes are discretized into tokens and added to the text vocabulary, allowing the model to generate layout as part of its output sequence.
Unified Pre-training
Four self-supervised objectives:
- Joint Text-Layout Reconstruction: Reconstruct masked text tokens and their bounding boxes.
- Visual Text Recognition: Predict text from image patches (OCR-like objective).
- Layout Modeling: Predict bounding boxes from text.
- Masked Autoencoding: Reconstruct masked image patches.
Plus 11 supervised datasets (1.8M examples) incorporated into pre-training via task prompts, including FUNSD, CORD, DocVQA, and others.
What experiments were performed?
DUE-Benchmark (7 tasks)
UDOP achieves first place overall with an average score of 64.8, compared to 62.9 for LayoutLMv3 LARGE. Strong results on DocVQA (84.7), InfoVQA (47.4), KLC (82.8), and DeepForm (85.5).
Additional Benchmarks
| Dataset | Task | UDOP | Previous SOTA |
|---|---|---|---|
| FUNSD | Info Extraction (F1) | 91.62 | 88.41 (LiLT) |
| CORD | Info Extraction (F1) | 97.58 | 97.40 (BROS) |
| RVL-CDIP | Classification (Acc) | 96.00 | 96.08 (StructuralLM) |
| DocVQA | VQA (ANLS) | 84.7 | 83.4 (LayoutLMv3) |
Document Generation
UDOP can generate document images from text and layout, enabling document editing and content customization. The authors describe this as the first document AI model with this capability, though the generated images are at MAE resolution (not publication quality).
What are the outcomes/conclusions?
The authors argue that UDOP demonstrates a unified generative framework can match or exceed task-specific models across diverse document AI tasks. The layout-induced representation is effective at exploiting the spatial correlation unique to documents. Incorporating supervised datasets into pre-training alongside self-supervised objectives provides additional gains.
Limitations
- 794M parameters (LARGE size only); no BASE variant released.
- Requires OCR as preprocessing.
- Document generation quality is limited to MAE reconstruction resolution.
- Pre-trained on 11M IIT-CDIP pages + 1.8M supervised examples; substantial compute requirements.
Reproducibility
Models
- UDOP: ~794M parameters. Based on T5 LARGE architecture with added vision encoder (ViT) and vision decoder.
- Checkpoint released on Hugging Face under MIT license.
- Vision decoder weights are withheld. The authors chose not to release the vision decoder due to ethical concerns around potential misuse for document forgery. The released checkpoint includes the encoder and text-layout decoder only.
Algorithms
- Pre-training: 11M IIT-CDIP pages (self-supervised) + 11 supervised datasets (1.8M examples).
- Optimizer: Adam with $\beta_1 = 0.9$, $\beta_2 = 0.98$, weight decay $1 \times 10^{-2}$.
- Learning rate: $5 \times 10^{-5}$ with 1000 warmup steps.
- Batch size: 512.
- Curriculum: 3-stage resolution scaling ($224 \rightarrow 512 \rightarrow 1024$), 1 epoch per stage.
- Self-supervised tasks: Joint text-layout reconstruction, visual text recognition, layout modeling, masked autoencoding.
Data
- Pre-training (unlabeled): IIT-CDIP Test Collection (11M documents).
- Pre-training (labeled): FUNSD, CORD, RVL-CDIP, DocVQA, InfoVQA, KLC, PWC, DeepForm, WTQ, TabFact, SQA (11 datasets, 1.8M examples total).
Evaluation
- Metrics: F1 for entity extraction tasks (FUNSD, CORD), accuracy for document classification (RVL-CDIP), ANLS for visual question answering (DocVQA, InfoVQA).
- Benchmarks: DUE-Benchmark (7 tasks), plus standalone evaluation on FUNSD, CORD, RVL-CDIP, and DocVQA.
- Baselines: LayoutLMv3 LARGE, LiLT, BROS, StructuralLM, and other document AI models.
- Statistical reporting: The authors report results over 5 runs with different random seeds and include standard deviations (Tables 9-10 in the paper), which is notable for this domain.
Hardware
- Not specified in the main paper. A model of this scale (794M params with vision encoder) typically requires multi-GPU training on A100-class hardware.
BibTeX
@inproceedings{tang2023udop,
title={Unifying Vision, Text, and Layout for Universal Document Processing},
author={Tang, Zineng and Yang, Ziyi and Wang, Guoxin and Fang, Yuwei and Liu, Yang and Zhu, Chenguang and Zeng, Michael and Zhang, Cha and Bansal, Mohit},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={15254--15264},
year={2023},
doi={10.1109/CVPR52729.2023.01462}
}
CDDOD: Cross-Domain Document Object Detection
TL;DR
This paper formalizes cross-domain document object detection (DOD) as a task, introduces a benchmark suite of three PDF document datasets (PubMed, Chn, Legal), and proposes an FPN-based method with three adversarial alignment modules (Feature Pyramid Alignment, Region Alignment, and Rendering Layer Alignment) that leverage PDF rendering layers as domain-consistent supervision signals.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$ : The headline novelty is the three-module alignment framework for cross-domain DOD. The paper dedicates the most space to describing the FPA, RA, and RLA modules and includes ablation studies isolating each module’s contribution. SOTA-style comparisons across six cross-domain transfer directions demonstrate the method’s performance across transfer directions.
Secondary: $\Psi_{\text{Resource}}$ : The paper establishes a benchmark suite of three document datasets (PubMed, Chn, Legal) with page images, bounding box annotations, raw PDF files, and PDF rendering layers. Two of three datasets (PubMed and Chn) are publicly released.
What is the motivation?
Document object detection requires large labeled training sets, but documents vary widely in layout, language, and genre. Constructing a comprehensive labeled dataset for every document domain is infeasible. The authors identify two key gaps:
- No cross-domain DOD benchmark exists. Existing DOD datasets annotate only certain object types (tables, formulas) and do not provide auxiliary data like PDF rendering layers that could help bridge domain gaps.
- Cross-domain object detection methods from natural images are suboptimal for documents. Document objects have extreme variance in aspect ratio and scale (a page number vs. a full-page table), and documents contain modular elements where context provides limited inpainting cues, unlike natural scenes.
The paper argues that adapting cross-domain techniques from natural image detection requires both document-specific data assets and document-specific alignment strategies.
What is the novelty?
Benchmark Suite
Three datasets sharing the same 5-class label space (text, list, heading, table, figure):
| Dataset | Pages | Domain | Language | Annotation | Source |
|---|---|---|---|---|---|
| PubMed | 12,871 | Scientific journals | English | Auto (from PubLayNet) | |
| Chn | 8,005 | Synthesized documents | Chinese | Auto (from tags) | |
| Legal | 19,355 | Legal reports | English | Human |
Each dataset includes page images, bounding box annotations, raw PDF files, and PDF rendering layers (text, vector, raster). The authors re-processed PubLayNet’s “list” annotations, splitting whole-list bounding boxes into individual items.
Cross-Domain DOD Method
Built on FPN with ResNet-101 backbone, the method introduces three adversarial alignment modules:
Feature Pyramid Alignment (FPA): Four binary domain classifiers $\{D_1, D_2, D_3, D_4\}$ operating on feature pyramid layers $\{P_1, P_2, P_3, P_4\}$. Each classifier predicts whether a pixel belongs to the source or target domain, trained adversarially via gradient reversal:
$$L_p = -\frac{1}{4W^sH^s}\sum_{i=1}^{4}\sum_{w=1}^{W^s}\sum_{h=1}^{H^s}\log(D_i(P_{i,w,h}^s)) - \frac{1}{4W^tH^t}\sum_{i=1}^{4}\sum_{w=1}^{W^t}\sum_{h=1}^{H^t}\log(1 - D_i(P_{i,w,h}^t))$$
The authors note that since each pyramid layer mixes high- and low-level features, aligning pyramids jointly addresses both semantic levels simultaneously, unlike prior methods that align them separately.
Region Alignment (RA): A focal-loss-weighted binary domain classifier on region proposals. Unlike prior work that treats all proposals equally, the focal loss focuses on hard-to-align proposals:
$$L_r = -\frac{1}{R}\sum_{i=1}^{R}(1 - D_r(r_i^s))^\gamma \log(D_r(r_i^s)) - \frac{1}{R}\sum_{i=1}^{R}(D_r(r_i^t))^\gamma \log(1 - D_r(r_i^t))$$
with $\gamma = 5.0$.
Rendering Layer Alignment (RLA): A DeepLab-V2-style segmentation network that predicts pixel-level rendering layer masks (background, text, raster; $C = 3$) from the FPN feature map $C_4$. Since rendering layers are available for both source and target PDFs, this auxiliary segmentation task provides domain-consistent supervision. The vector layer is merged into background because vector drawings are too thin to carry reliable semantics.
The total training loss is:
$$\mathcal{L} = \mathcal{L}_{det}^s + \lambda_1 \mathcal{L}_p + \lambda_2 \mathcal{L}_r + \lambda_3 \mathcal{L}_s$$
At inference, all alignment modules are removed and only the standard FPN detector remains.
What experiments were performed?
Cross-Domain Transfer Directions
Six transfer experiments across three datasets (each used as both source and target):
- Legal $\leftrightarrow$ Chn (cross-lingual)
- Chn $\leftrightarrow$ PubMed (cross-lingual)
- Legal $\leftrightarrow$ PubMed (cross-category, both English)
Baselines
- FRCNN (source-only): Faster R-CNN trained only on source labels, no adaptation.
- FPN (source-only): Feature Pyramid Network trained only on source labels.
- SWDA: Strong-Weak Distribution Alignment (Saito et al., CVPR 2019), a cross-domain detector for natural images.
- Oracle: FPN trained with labeled target data (upper bound).
Ablation Study (Legal $\rightarrow$ PubMed)
| Configuration | mAP |
|---|---|
| FPN (source-only) | 64.9 |
| FPN + FPA | 66.5 |
| FPN + FPA + RA | 68.6 |
| FPN + FPA + RA + RLA | 70.7 |
RLA was also tested as a plug-in on SWDA, boosting its mAP from 65.3 to 68.7 (+3.4).
Natural Image Transfer
Without RLA (which requires PDF data), FPA + RA were tested on Kitti $\leftrightarrow$ Cityscape car detection, outperforming SWDA (73.3 vs. 70.6 on Cityscape $\rightarrow$ Kitti).
Metric
Mean Average Precision (mAP) at IoU threshold 0.5.
What are the outcomes/conclusions?
Key results
- The full method (FPA + RA + RLA) consistently outperforms all baselines across all six transfer directions.
- Best improvements appear on Legal $\rightarrow$ PubMed (+5.8 mAP over FPN baseline) and PubMed $\rightarrow$ Chn (+20.5 mAP over FPN baseline).
- RLA is a modular component that improves any detector when PDF rendering layers are available.
- PubMed-as-source transfers show the largest domain adaptation gains, likely because PubMed’s homogeneous formatting (scientific journals with similar templates) limits its diversity, making the source-only model especially brittle on out-of-domain targets.
Limitations
- Large oracle gap persists. The best adapted results (e.g., 70.7 on Legal $\rightarrow$ PubMed) remain well below oracle performance (86.1), indicating domain shift is far from fully addressed.
- RLA requires PDF access. The rendering layer module only works when raw PDFs are available for both domains, which excludes scanned or camera-captured documents. This limits the method’s applicability to born-digital PDFs.
- Legal dataset not released. Only PubMed and Chn are publicly available; the Legal dataset (the largest at 19K pages) is withheld due to company policy.
- Limited class diversity. All datasets share the same 5 coarse classes. The benchmark does not test fine-grained or overlapping label spaces.
- No comparison with self-training methods. The related work discusses self-training as an alternative paradigm for cross-domain detection, but no self-training baselines are included in experiments.
- Single evaluation metric. Only mAP@0.5 is reported; no mAP@[0.5:0.95] or per-class analysis of failure modes.
Reproducibility
Models
- Backbone: ResNet-101 (pre-trained on ImageNet), fine-tuned end-to-end. Total parameter count not reported.
- Detection framework: FPN with RPN (standard Faster R-CNN two-stage pipeline).
- FPA: Four binary domain classifiers, each with three conv layers (kernel size 1, padding 0). ReLU after first two, Sigmoid after last. Classifiers share structure but not weights.
- RA: Three FC layers with ReLU + Dropout after the first two.
- RLA: DeepLab-V2 architecture predicting rendering layer segmentation from $C_4$ features.
- Initialization strategy for alignment modules not reported.
- No pretrained weights released.
Algorithms
- Optimizer: SGD, initial learning rate 0.001, divided by 10 every 8 epochs, total 12 epochs. Momentum and weight decay not reported.
- Batch size not reported.
- Loss weights: $\lambda_1 = \lambda_2 = 0.1$, $\lambda_3 = 0.01$.
- Focal loss parameter: $\gamma = 5.0$.
- Image shorter side resized to 600 pixels.
- No mention of warmup, gradient clipping, mixed precision, or other training tricks.
- At inference, FPA/RA/RLA modules are removed; only the standard FPN remains.
Data
- PubMed: 12,871 pages (9,653 train / 3,218 test) from PubLayNet. “List” annotations re-processed to split whole-list boxes into individual items.
- Chn: 8,005 pages (5,000 train / 3,005 test) synthesized from Chinese Wikipedia via layout/style randomization.
- Legal: 19,355 pages (9,684 train / 9,671 test) of human-annotated legal reports. Not publicly released due to company policy. Inter-annotator agreement not reported.
- PubMed and Chn available via Google Drive (Pascal VOC format) from the GitHub repo. Legal is withheld. The repo is MIT-licensed, but no explicit dataset license is stated.
- All datasets provide page images, bounding box annotations, raw PDFs, and rendering layers.
Evaluation
- mAP at IoU 0.5, per-class AP for 5 classes.
- No error bars, significance tests, or multi-run statistics.
- Oracle (supervised target training with FPN) provides an upper bound for each transfer direction.
- SWDA baseline was re-implemented by the authors. Compute budget parity across baselines is not discussed.
Hardware
- No training hardware, GPU hours, or memory requirements reported.
- No inference latency or throughput measurements.
- Implementation in PyTorch. The ResNet-101 + FPN backbone is a standard architecture that can run on a single consumer GPU at inference.
BibTeX
@inproceedings{li2020crossdomain,
title={Cross-Domain Document Object Detection: Benchmark Suite and Method},
author={Li, Kai and Wigington, Curtis and Tensmeyer, Chris and Zhao, Handong and Barmpalios, Nikolaos and Morariu, Vlad I. and Manjunatha, Varun and Sun, Tong and Fu, Yun},
booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
pages={12912--12921},
year={2020},
doi={10.1109/CVPR42600.2020.01293}
}
DocSegTr: An Instance-Level End-to-End Document Image Segmentation Transformer
TL;DR
DocSegTr is a hybrid CNN-transformer architecture for bottom-up instance-level document layout segmentation. It combines a ResNeXt-101-FPN backbone with a twin-attention transformer module and dynamic convolution for mask generation, avoiding dependence on bounding box detection. The authors report mAP scores of 89.4 (PubLayNet), 40.3 (PRImA), 83.4 (Historical Japanese), and 93.3 (TableBank), competitive with or slightly above top-down Mask R-CNN baselines on three of four benchmarks.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$ : The headline contribution is a new transformer-based architecture for document instance segmentation. The paper introduces a twin-attention mechanism adapted from Guo et al. for document-specific use, combined with a bottom-up dynamic mask prediction strategy. Ablation studies and baseline comparisons dominate the experimental sections.
Secondary: None. While the paper benchmarks across four datasets with per-category AP breakdowns and ablation studies, this evaluation work is standard method-paper validation rather than a standalone evaluation contribution (no new metrics, protocols, or test suites are proposed).
What is the motivation?
Document layout analysis has traditionally relied on top-down approaches (Faster R-CNN, Mask R-CNN) that first detect bounding boxes, then predict segmentation masks within them. The authors identify three problems with this paradigm:
- Overlapping objects: Bounding box detection struggles with overlapping layout elements (e.g., captions overlapping figures), which are common in complex document layouts.
- Computational cost: The two-stage detect-then-segment pipeline is expensive at inference time.
- Limited global reasoning: CNNs capture local features effectively but lack the ability to model long-range dependencies between distant layout elements on a page.
The authors propose a bottom-up approach that generates instance masks directly without bounding box detection, using transformers for global context aggregation.
What is the novelty?
Architecture Overview
DocSegTr has three components:
- CNN backbone with FPN: ResNeXt-101 with Deformable Convolutional Networks (DCN) and Feature Pyramid Networks (FPN) for multi-scale local feature extraction (P2 through P6).
- Twin-attention transformer: A sparse attention mechanism that decomposes the standard self-attention matrix into sequential row-wise and column-wise attention computations.
- Layerwise Feature Aggregation Module (LFAM): Combines local FPN features with global transformer features to produce a unified mask feature map.
Twin Attention
The key efficiency contribution is the twin-attention mechanism adapted from Guo et al. Given feature maps $f_i \in \mathbb{R}^{h \times w \times c}$, the approach divides them into $n \times n$ patches, then computes attention along rows and columns separately before combining through a global attention module. This reduces complexity from $\Theta((h \times w)^2)$ to $\Theta(h \times w^2 + w \times h^2)$.
Dynamic Mask Prediction
Instead of predicting masks inside bounding boxes, DocSegTr uses dynamic convolution. The transformer outputs two heads:
- Category head: An MLP that predicts semantic class for each patch ($n \times n \times q_c$, where $q_c$ is the number of classes).
- Kernel head: A linear layer that generates convolution kernels ($n \times n \times b$, where $b = \theta^2 \times c$).
The final mask is produced by convolving the unified feature map with the generated kernels:
$$M_f^{h \times w \times n \times n} = f^{h \times w \times c} \ast k^{n \times n \times b}$$
Instance predictions are post-processed with Matrix NMS. Training uses focal loss for classification:
$$\text{FL}(p_t) = -\alpha_t (1 - p_t)^\gamma \log(p_t)$$
and Dice loss for mask regularization:
$$\mathcal{L}_{\text{Dice}} = 1 - \frac{2 |X \cap Y|}{|X| + |Y|}$$
What experiments were performed?
Datasets
| Dataset | Train Instances | Eval Instances | Categories |
|---|---|---|---|
| PubLayNet | 3,263,046 | 120,761 | 5 (Text, Title, List, Figure, Table) |
| PRImA | 8,068 | 1,891 | 6 (Text, Image, Table, Math, Separator, Other) |
| Historical Japanese (HJ) | 181,097 | 31,866 | 7 (Body, Row, Title, Bio, Name, Position, Other) |
| TableBank | 2,835 | 1,418 | 1 (Table) |
Baselines
DocSegTr is compared against two baselines:
- LayoutParser (Detectron2-based Mask R-CNN)
- Biswas et al. (prior work from the same group, also Mask R-CNN)
Ablation Study (on PRImA)
The ablation on the PRImA dataset demonstrates component contributions:
| Configuration | AP | AP@0.5 | AP@0.75 |
|---|---|---|---|
| ResNet-101-FPN | 20.12 | 31.32 | 16.78 |
| ResNeXt-101-FPN | 32.59 | 58.62 | 29.73 |
| ResNeXt-101-FPN + DCN | 33.21 | 49.13 | 27.39 |
| Without transformer | 5.21 | 7.12 | 3.22 |
| Transformer without self-attention | 29.14 | 41.23 | 20.22 |
| DocSegTr (full) | 40.31 | 59.72 | 29.54 |
Note: the DCN row is anomalous. Adding DCN improves AP (32.59 to 33.21) but decreases AP@0.5 (58.62 to 49.13) and AP@0.75 (29.73 to 27.39). The paper does not address this inconsistency. One possible explanation is that DCN shifts the AP distribution across IoU thresholds, improving at mid-range thresholds while degrading at the standard 0.5 and 0.75 cutoffs, but this is speculative.
Key Results (mAP)
| Dataset | LayoutParser | Biswas et al. | DocSegTr |
|---|---|---|---|
| PubLayNet | 88.8 | 89.0 | 89.4 |
| PRImA | 37.7 | 39.9 | 40.3 |
| Historical Japanese | 79.4 | 84.0 | 83.4 |
| TableBank | 91.2 | 91.7 | 93.3 |
What are the outcomes/conclusions?
DocSegTr achieves competitive or better mAP compared to Mask R-CNN baselines on three of four benchmarks. The results suggest that transformer-based global context aggregation is particularly beneficial for larger document objects (tables, figures, lists), where long-range spatial reasoning matters. The model underperforms on smaller regions (e.g., text blocks in PRImA), which the authors acknowledge as a limitation.
Notable observations:
- Larger objects benefit most: Tables, figures, and lists consistently outperform the baselines, while smaller elements like text and separators see less improvement or regressions.
- Efficiency (claimed): The twin-attention mechanism is designed to reduce computational cost relative to standard self-attention, though the paper provides no latency or throughput measurements to confirm this.
- Transfer robustness: Pre-trained PubLayNet weights transferred to PRImA without fine-tuning yield 15% AP, suggesting some cross-domain capability.
- Historical Japanese is close: DocSegTr slightly trails on HJ (83.4 vs. 84.0), likely because the dataset has many small, densely packed elements where CNNs retain an advantage.
Limitations
- Narrow baseline comparison: Only two baselines (both Mask R-CNN variants) are compared. No comparison with other bottom-up or non-Mask-RCNN approaches.
- Small object performance: The authors acknowledge that transformers do not improve segmentation of smaller layout regions.
- No DocLayNet evaluation: The paper predates DocLayNet (2022) and does not evaluate on it, limiting comparisons with later work.
- PRImA dataset is small: The ablation study is conducted on PRImA (~300 pages), which may not generalize well to conclusions about transformer behavior on larger datasets.
- No speed benchmarks: Despite claiming computational efficiency, the paper does not report inference latency or FPS comparisons with the baselines.
Reproducibility
Models
- Architecture: ResNeXt-101-FPN backbone with DCN, twin-attention transformer layers, LFAM, two functional heads (category + kernel), and dynamic convolution for mask prediction.
- Parameter count: Not reported.
- Pre-trained weights: Available on the GitHub repo for PRImA, Historical Japanese, and TableBank. PubLayNet weights are not listed. All three checkpoints are hosted on institutional SharePoint (CVC UAB), which raises long-term durability concerns compared to archival platforms like Zenodo or HuggingFace.
- Framework dependencies: Requires Detectron2 v0.2.1 and AdelaiDet built from source, adding friction to reproduction.
Algorithms
- Optimizer: SGD with learning rate 0.001, Nesterov momentum 0.9, weight decay $10^{-5}$
- Schedule: 300K iterations with warm-up (1,000 steps); LR drops by $10\times$ at 210K and 250K iterations
- Batch size: 8
- Loss: Focal loss (classification) + Dice loss (mask regularization)
- Post-processing: Matrix NMS for instance deduplication
Data
- PubLayNet, PRImA, Historical Japanese, and TableBank are all publicly available datasets.
- No custom data augmentation is described beyond standard practices.
- Train/eval splits follow the original dataset definitions.
Evaluation
- Metric: COCO-style mAP (IoU thresholds 0.5 to 0.95, step 0.05), plus per-category AP
- Baselines: LayoutParser and Biswas et al. (both Mask R-CNN). Comparisons appear fair (same datasets, same metrics).
- No error bars or multi-run statistics reported.
Hardware
- Training: $2\times$ NVIDIA A40 GPUs (48 GB each), training takes 4-5 days
- Framework: PyTorch + Detectron2 v0.2.1
- Inference hardware: Not specified. The paper claims computational efficiency from the twin-attention mechanism but reports no inference latency, FPS, or throughput benchmarks to substantiate this.
BibTeX
@article{biswas2022docsegtr,
title={DocSegTr: An Instance-Level End-to-End Document Image Segmentation Transformer},
author={Biswas, Sanket and Banerjee, Ayan and Llad{\'o}s, Josep and Pal, Umapada},
journal={arXiv preprint arXiv:2201.11438},
year={2022}
}
Document AI: Comparing Transformers, GNNs, and CNNs for Layout Analysis
TL;DR
A comparative benchmarking study that evaluates LayoutLMv3 (transformer), Paragraph2Graph (GNN), and YOLOv5 (CNN) on document layout analysis across DocLayNet and GROTOAP2 datasets, framing the task as both image-centric (object detection) and text-centric (token classification). The study also tests whether machine translation can improve LiLT’s cross-lingual zero-shot transfer on FUNSD/XFUND forms; the answer is no.
What kind of paper is this?
Dominant: $\Psi_{\text{Evaluation}}$. The paper proposes no new model or method. Its contribution is a head-to-head comparison of three existing architectures (LayoutLMv3, Paragraph2Graph, YOLOv5) across two benchmarks using standard metrics (mAP, F1). The cross-lingual machine translation experiment is also evaluative in nature.
Secondary: $\Psi_{\text{Systematic}}$. The paper organizes the DLA landscape along two axes: architecture type (transformer vs. graph vs. CNN) and task framing (image-centric vs. text-centric), providing a structured overview of related work.
What is the motivation?
The authors identify three gaps in document layout analysis research:
- No cross-architecture comparison. Prior work developed transformer-based, graph-based, and CNN-based models independently, but no study directly compared their effectiveness on the same benchmarks.
- Limited dataset coverage. Existing comparisons focused on narrow datasets (often PubLayNet or DocBank). The authors target DocLayNet (diverse, multi-domain) and GROTOAP2 (scientific articles, 22 classes) for broader coverage.
- Unexplored machine translation for cross-lingual DLA. Language-independent models like LiLT showed promise for zero-shot cross-lingual transfer, but no prior work tested whether translating documents to the training language (English) could improve performance.
What is the novelty?
The novelty is primarily in the experimental design rather than in any algorithmic contribution:
- First direct comparison of LayoutLMv3, Paragraph2Graph, and YOLOv5 on DocLayNet and GROTOAP2 under both image-centric and text-centric framings.
- First use of machine translation (M2M100) as a preprocessing step for cross-lingual document layout analysis with LiLT.
The evaluation metrics are standard:
$$ \text{mAP} = \frac{1}{N} \sum_{i=1}^{N} AP_i $$
for image-centric tasks (COCO-style, IoU thresholds $[0.5, 0.95]$ with step 0.05), and F1-score for text-centric token classification.
What experiments were performed?
Experiment 1: Image-Centric DLA (Object Detection)
GROTOAP2 (first pages only, 22 classes):
| Model | mAP | Training Time | Hardware |
|---|---|---|---|
| LayoutLMv3 | 0.751 | ~3 days (7.5 epochs) | 1 GPU |
| YOLOv5s | 0.725 | ~2.5 hours (75 epochs) | 1 GPU |
YOLOv5 outperformed LayoutLMv3 on 14 of 22 individual classes. LayoutLMv3 was stronger on glossary, tables, figures, unknown elements, and page numbers.
DocLayNet (11 classes): The authors compared LayoutLMv3 against previously reported numbers for YOLOv5 and Paragraph2Graph from prior work. Exact per-class numbers are reported in the paper. The discussion concludes YOLOv5 is most practical for industry use due to speed.
Experiment 2: Text-Centric DLA (Token Classification)
GROTOAP2 (first pages):
| Model | F1 | Training Time |
|---|---|---|
| LayoutLMv3 | 0.866 | ~2 days (6 epochs) |
| Paragraph2Graph | 0.698 | ~7 hours (9 epochs) |
LayoutLMv3 outperformed Paragraph2Graph by a large margin, though at substantially higher computational cost.
Experiment 3: Cross-Lingual Transfer with Machine Translation
LiLT (with XLM-RoBERTa tokenizer) was fine-tuned on English FUNSD forms (4 classes), then evaluated zero-shot on XFUND forms in 7 languages. The M2M100 translation model was used to translate non-English forms to English before inference.
| Method | EN | DE | IT | ZH | JA | ES | FR | PT |
|---|---|---|---|---|---|---|---|---|
| LiLT (direct) | 0.48 | 0.49 | 0.38 | 0.35 | 0.37 | 0.39 | 0.50 | 0.44 |
| LiLT + MT | - | 0.43 | 0.33 | 0.34 | 0.30 | 0.33 | 0.43 | 0.35 |
Machine translation consistently degraded performance across all languages. The authors attribute this to token-level translation errors disrupting layout alignment.
What are the outcomes/conclusions?
- YOLOv5 is the most practical choice for image-centric DLA. While LayoutLMv3 achieves marginally higher mAP on GROTOAP2, YOLOv5 trains in hours (vs. days) on a single GPU and runs inference on 1,300 pages in under a minute (vs. ~7 minutes for LayoutLMv3).
- LayoutLMv3 substantially outperforms Paragraph2Graph in text-centric DLA. With an F1 of 0.866 vs. 0.698 for Paragraph2Graph, the gap is substantial, though the authors note LayoutLMv3’s CC-BY-NC-SA license makes it unsuitable for commercial use. They suggest LiLT as a viable alternative.
- Machine translation does not help cross-lingual DLA. Translating documents to English before applying LiLT performed worse than direct zero-shot transfer in every language tested. Token-level translation introduces alignment errors that outweigh any linguistic benefit.
Limitations the authors acknowledge:
- Paragraph2Graph was not evaluated on GROTOAP2 for image-centric tasks (left as future work).
- No hyperparameter tuning for YOLOv5; only YOLOv5s (smallest variant) was tested.
- The GROTOAP2 experiments used only first pages of documents, which biases toward front-matter elements.
- Machine translation was only tested at token level; sequence-level classification might yield different results.
Limitations not acknowledged:
- All comparisons use models “as-is” from prior work with minimal tuning, making it unclear whether performance gaps reflect fundamental architectural differences or just different optimization effort.
- The FUNSD/XFUND experiments report F1 scores below 0.50 even in English, suggesting the 4-class setup or fine-tuning protocol may be suboptimal, limiting the interpretability of the cross-lingual comparison.
- DocLayNet results partially rely on numbers from prior work rather than consistent re-evaluation.
Reproducibility
Models
- LayoutLMv3: Pre-trained on IIT-CDIP (11M images). Uses RoBERTa for text, DiT for vision. 133M params (base). CC-BY-NC-SA-4.0.
- Paragraph2Graph: Language-independent GNN. Exact parameter count not reported.
- YOLOv5s: Smallest YOLOv5 variant. Pre-trained on COCO. Open-source (GPL-3.0 via Ultralytics).
- LiLT: Language-independent layout transformer with XLM-RoBERTa tokenizer. MIT license.
- M2M100: 1.2B-parameter many-to-many translation model from Meta.
Algorithms
- Text-centric LayoutLMv3: LR 1e-5 (from original paper), trained until convergence (~6 epochs on GROTOAP2).
- Text-centric Paragraph2Graph: LR 0.001, weight decay 0.005 (from original paper), trained until convergence (~9 epochs).
- Image-centric LayoutLMv3: LR 2e-4, batch size 2, 100 epochs max with early stopping.
- Image-centric YOLOv5s: Default configuration, no hyperparameter tuning, trained until convergence (~75 epochs).
- Convergence determined by monitoring training/validation loss and mAP on validation data.
- The released code consists of Jupyter notebooks for individual experiments, not a unified training framework.
Data
- DocLayNet: 80,863 pages (69,103 train / 6,480 val / 4,994 test), 11 classes. Diverse domains (financial, tenders, laws, manuals, patents, scientific). 95% English.
- GROTOAP2: Originally 13,210 documents / 119,334 pages. Authors used only first pages: 10,492 train / 1,310 val / 1,317 test. 22 classes. Scientific articles only. Converted from XML to COCO format for image-centric tasks and preprocessed for token classification.
- FUNSD: 149 train + 50 test English forms, 4 classes.
- XFUND: Multilingual forms in 7 languages, 4 classes.
Evaluation
- Image-centric: mAP @ IoU [0.5:0.05:0.95] (COCO-style).
- Text-centric: F1-score.
- No error bars, confidence intervals, or multi-run statistics reported.
- GROTOAP2 results are reported on the validation set, not the held-out test set.
- DocLayNet image-centric results for YOLOv5 and Paragraph2Graph taken from prior publications, not re-run.
Hardware
- All experiments on a single GPU (type not specified).
- YOLOv5s: ~2.5 hours for 75 epochs on GROTOAP2.
- LayoutLMv3: ~3 days for 7.5 epochs (image-centric) or ~2 days for 6 epochs (text-centric) on GROTOAP2.
- YOLOv5 inference: <1 min for 1,300 images. LayoutLMv3 inference: ~7 min for 1,300 images.
BibTeX
@article{kastanas2023document,
title={Document AI: A Comparative Study of Transformer-Based, Graph-Based Models, and Convolutional Neural Networks For Document Layout Analysis},
author={Kastanas, Sotirios and Tan, Shaomu and He, Yi},
journal={arXiv preprint arXiv:2308.15517},
year={2023}
}
Document Layout Analysis: A Comprehensive Survey
TL;DR
BinMakhashen and Mahmoud survey 79 document layout analysis studies, organizing the field around a general DLA framework consisting of preprocessing, analysis parameter estimation, layout analysis (bottom-up, top-down, hybrid), post-processing, and performance evaluation. Bottom-up methods dominate the literature (61% of reviewed papers) due to their flexibility with complex layouts, while top-down methods remain useful for structured documents. The survey predates the shift to deep learning in DLA (PubLayNet, LayoutLM, DETR-based detectors all appeared in 2019-2020), making it a useful historical reference for the classical foundations the field built upon.
What kind of paper is this?
Dominant: $\Psi_{\text{Systematic}}$
This is a survey paper. The headline contribution is a unifying DLA framework that organizes the field into preprocessing, analysis strategies (bottom-up, top-down, hybrid), post-processing, and evaluation. The paper reviews 79 studies and categorizes them by strategy, document type, language, and layout complexity. Most of the paper is devoted to organizing and explaining prior work rather than introducing new methods or datasets.
Secondary: $\Psi_{\text{Evaluation}}$
The survey dedicates a full section to performance evaluation, distinguishing three levels of metrics: pixel-level (PLEF), region-level (REF, IoU), and customizable (CEF, used in ICDAR competitions). It provides formal definitions for each and discusses their strengths and limitations.
What is the motivation?
The authors identify several gaps that motivate the survey:
- No unified framework. DLA algorithms vary widely in their processing pipelines, making it difficult to compare methods or understand where a new contribution fits. Earlier surveys focused on narrow topics (texture-based methods, skew detection, printed-document DLA) rather than the full pipeline.
- Historical documents neglected. Most prior surveys concentrated on contemporary printed documents, despite historical manuscripts presenting harder layout challenges (arbitrary layouts, degraded text, multi-script content).
- Evaluation inconsistency. Many early DLA methods used subjective evaluation or incompatible metrics, hindering progress. The paper aims to consolidate the evaluation landscape.
What is the novelty?
The paper’s primary contribution is the general DLA framework (Figure 2 in the paper), which decomposes DLA into five phases:
- Preprocessing: Skew detection/correction and binarization.
- Analysis parameter estimation: Static (predefined thresholds for structured layouts) vs. dynamic (data-driven measurements for heterogeneous documents).
- Layout analysis: The core segmentation step, divided into three strategies:
- Bottom-up (61% of 79 papers): Starts from pixels or connected components, grows to regions. Includes connected component analysis (Docstrum), texture analysis (Gabor, autocorrelation), machine learning (both non-deep and deep), Voronoi diagrams, and Delaunay triangulation.
- Top-down (29%): Starts from page-level and splits into regions. Includes texture-based, RLSA, projection profile (X-Y cut), and whitespace analysis.
- Hybrid (6%): Combines both strategies.
- Post-processing: Optional refinement (clustering, morphological cleaning, human interaction).
- Performance evaluation: Three tiers of metrics.
The survey also adopts and extends Kise’s document layout taxonomy, distinguishing six layout types: regular, Manhattan, non-Manhattan, multi-column Manhattan, arbitrary/complex, and overlapping. The extension to historical manuscripts with arbitrary layouts is the authors’ contribution; the base categorization for printed documents originates from Kise [82].
Evaluation Framework Taxonomy
The paper organizes DLA evaluation into three levels:
Pixel-Level Evaluation (PLEF): Counts pixel matches between segmentation and ground truth. The match score between ground-truth region $G_j$ and segmented region $R_i$ over image foreground $I$ is:
$$ \text{MS}(i, j) = \alpha \frac{\text{T}(G_j \cap R_i \cap I)}{\text{T}((G_j \cup R_i) \cap I)}, \quad \alpha = \begin{cases} 1, & \text{if } g_j = r_i \\ 0, & \text{otherwise} \end{cases} $$
where $\text{T}(\cdot)$ counts foreground pixels and $g_j$, $r_i$ are the labels of the ground-truth and segmented regions, respectively. The match table is aggregated into detection rate (DR) and recognition rate (RR) via weighted one-to-one, one-to-many, and many-to-one correspondences, then combined as $F_{\text{measure}} = 2 \cdot \text{DR} \cdot \text{RR} / (\text{DR} + \text{RR})$.
Region-Level Evaluation (REF): Uses IoU borrowed from computer vision. For each class $i$:
$$ \text{IoU}_i = \frac{G_i \cap R_i}{t_i + (G_i \cup R_i) - (G_i \cap R_i)} $$
The mean IoU averages over $N$ classes ($\text{mIoU} = \frac{1}{N} \sum_i \text{IoU}_i$). A frequency-weighted variant scales each class by its pixel count $t_i$.
Customizable Evaluation Framework (CEF): Introduced by Antonacopoulos and Bridson (ICDAR 2007), this framework transforms regions into interval representations, establishes correspondence, then classifies errors as merge, split, miss, partial miss, or false detection. Each error type $i$ has a weighted error rate $ER_i$ (the affected area multiplied by an application-specific weight $\alpha_i$). The per-type weight is:
$$ w_i = \frac{(N - 1) \cdot ER_i + 1}{N} $$
where $N$ is the number of error types considered. The overall success rate is then:
$$ SR = \frac{\sum_{i=1}^{N} w_i}{\sum_{i=1}^{N} \frac{w_i}{1 - ER_i}} $$
This design allows practitioners to customize error severity by adjusting $\alpha_i$ for different analysis objectives (e.g., penalizing column merges more heavily for OCR applications).
What experiments were performed?
This is a survey; no new experiments are conducted. The paper collects and compares reported results from the literature across multiple axes:
- Strategy distribution: Bottom-up (61%), top-down (29%), hybrid (6%), other (4%).
- Document type: Printed (58%), handwritten (33%), other (9%).
- Language: English (41%), Arabic (16%), German (9%), Latin (9%), French (8%), and others.
- Layout complexity: Manhattan (37%), multi-column (35%), complex (28%).
Table 1 compares skew detection methods by angle range, error, document type, and language. Table 2 catalogues 79 DLA algorithms by strategy, method type, document properties, and output. Table 4 collects quantitative results organized by evaluation metric.
What are the outcomes/conclusions?
The survey reaches several conclusions:
- Bottom-up dominates but is slow. Connected-component-based analysis is the most flexible approach for complex layouts, but it carries higher computational cost than top-down methods. A complex layout can be analyzed top-down in 0.7 seconds with 78.5% success rate; bottom-up achieves higher accuracy at the cost of longer processing time.
- Deep learning is emerging. The authors report that deep learning methods (FCNNs, U-Net variants) produced the strongest results among the surveyed papers, but require post-processing (clustering, morphological operations) and large training data. Transfer learning from ImageNet helps with convergence.
- Hybrid methods are underexplored. Only 6% of reviewed papers use hybrid strategies, despite the complementary strengths of bottom-up (precision on complex layouts) and top-down (speed on structured layouts).
- Binarization is declining for modern methods but remains critical for historical documents. Deep learning methods use full pixel intensities rather than binarized inputs.
- Evaluation needs standardization. Many early methods used subjective evaluation. The CEF framework from ICDAR competitions offers the most comprehensive approach, but adoption is inconsistent.
Limitations
- Temporal cutoff. The survey was submitted in July 2018 and published in October 2019, so it predates the rapid adoption of deep learning in DLA. PubLayNet (2019), LayoutLM (2019), Mask R-CNN on documents, and DETR-based detectors are all absent. The “deep learning” methods discussed are early FCNNs and U-Nets, not the object detection or multimodal pre-training approaches that now dominate.
- Narrow language scope. Despite noting the importance of language diversity, the survey focuses primarily on Latin and Arabic scripts. CJK, Indic, and other scripts receive minimal attention.
- No deep learning taxonomy. The bottom-up/top-down/hybrid taxonomy does not cleanly accommodate modern object detection pipelines (Faster R-CNN, YOLO, DETR), which operate at neither the pixel/CC level nor the page-splitting level.
- Dataset coverage. The dataset section (Table 3) lists only 12 datasets, all with fewer than 7K pages. The large-scale weakly-supervised datasets (PubLayNet at 360K, DocBank at 500K) that became widely adopted appeared shortly after this survey.
Reproducibility
Models
Not applicable (survey paper).
Algorithms
Not applicable. The paper reviews but does not implement algorithms.
Data
The survey catalogues 12 DLA datasets (Table 3), including:
- Historical: OHG (596 pages), DIVA-HisDB (150 pages, ICDAR 2017), Parzival (47 pages), George Washington GW20 (20 pages), Saint Gall (60 pages), IMPACT (7K pages)
- Contemporary: UW-3 (1,600 pages), PRImA (305 pages), BCE-Arabic-v1 (1,833 pages), CENIP-UCCP (400 Urdu pages), LAMP (203 mixed pages), MAURDOR (2.5K pages)
These are all small by modern standards. The survey does not discuss dataset licensing.
Evaluation
The paper reviews three evaluation frameworks (PLEF, REF, CEF) and collects quantitative results in Table 4, but does not propose new metrics or benchmarks.
Hardware
Not discussed. This is a survey, so no training or inference hardware is reported.
BibTeX
@article{binmakhashen2019document,
title={Document Layout Analysis: A Comprehensive Survey},
author={BinMakhashen, Galal M. and Mahmoud, Sabri A.},
journal={ACM Computing Surveys},
volume={52},
number={6},
pages={1--36},
year={2019},
publisher={ACM},
doi={10.1145/3355610}
}
ETD-OD: Object Detection for Parsing Electronic Theses and Dissertations
TL;DR
Ahuja et al. introduce ETD-OD, a 25K-page dataset of electronic theses and dissertations (ETDs) annotated with bounding boxes across 24 element categories, along with a complete PDF-to-XML parsing pipeline. YOLOv7 achieves 85.3 mAP@0.50, and pre-training on DocBank followed by fine-tuning on ETD-OD provides large gains for Faster R-CNN (+37.1 mAP@0.50).
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$ . The headline contribution is the ETD-OD dataset: 25K page images from 200 ETDs with ~100K bounding-box annotations across 24 document element categories. The dataset targets a document type (theses/dissertations) not covered by existing layout datasets.
Secondary: $\Psi_{\text{Method}}$ . The paper also presents an end-to-end PDF-to-XML parsing pipeline (object detection, text extraction, caption-to-figure matching, XML structuring) and benchmarks four detector configurations.
What is the motivation?
Electronic theses and dissertations contain substantial scholarly knowledge but are locked in unstructured PDFs, limiting their use in search, retrieval, summarization, and question-answering systems. Existing document layout datasets and parsing tools focus on shorter research papers (PubLayNet, DocBank, GROBID), which differ from ETDs in several ways:
- ETDs are long-form documents (hundreds of pages) with distinct structural elements: committee pages, degree metadata, lists of contents, and back matter.
- ETDs lack universal formatting; different institutions and disciplines impose different requirements and include domain-specific elements (algorithms in CS, equations in mathematics).
- Layout differences (single vs. double column, wider margins, larger font sizes) mean models trained on research paper datasets do not transfer well.
No existing dataset addressed these gaps at the time of publication.
What is the novelty?
ETD-OD Dataset
The dataset contains 25,073 page images from 200 ETDs sampled across institutions, degree types, and academic domains. Pages are annotated with bounding boxes across 24 categories organized into five groups:
| Group | Categories |
|---|---|
| Metadata (6) | Title, Author, Date, University, Committee, Degree |
| Abstract (2) | Abstract Heading, Abstract Text |
| List of Contents (2) | LOC Heading, LOC Text |
| Main Content (12) | Chapter Title, Section, Paragraph, Figure, Figure Caption, Table, Table Caption, Equation, Equation Number, Algorithm, Footnote, Page Number |
| Bibliography (2) | Reference Heading, Reference Text |
Total annotations: 99,859 bounding boxes. Paragraphs (30,359) and page numbers (24,543) dominate; algorithms (96) are the rarest class.
Annotation was performed by 6 undergraduate students using Roboflow, with each sample validated by 2 graduate students.
PDF-to-XML Pipeline
The parsing pipeline has three stages:
- Preprocessing: Convert PDF pages to images using
pdf2image. - Element extraction: Apply an object detection model (Faster R-CNN or YOLO) to produce bounding boxes with class labels. Post-processing rules correct common misclassifications (e.g., abstract heading vs. chapter heading via keyword matching and position constraints).
- XML structuring: Image-based objects (figures, tables, algorithms, equations) are cropped and stored as files. Text-based objects are extracted via
pymupdf(born-digital) orpytesseract(scanned). Captions are matched to figures/tables by Euclidean distance between bounding boxes. Output follows a hierarchical XML schema: front matter, body (chapters with sections), back matter (references).
What experiments were performed?
Four object detection configurations were trained and evaluated on an 80/20 train/validation split:
| Model | mAP@0.50 | mAP@0.50:0.95 |
|---|---|---|
| Faster R-CNN (vanilla, ResNeXt-101) | 39.1 | 19.6 |
| Faster R-CNN (DocBank pre-trained) | 76.2 | 44.0 |
| YOLOv5 | 83.4 | 52.1 |
| YOLOv7 | 85.3 | 52.7 |
Faster R-CNN models were trained for 60K iterations using Detectron2 (inference threshold 0.7). YOLO models were trained for 150 epochs.
Per-Category Results (YOLOv7, AP@0.50)
Strong categories (AP > 95): Paragraph (97.4), LOC Text (99.3), Reference Text (99.3), Figure (98.4), Footnote (98.9).
Weak categories: Page Number (51.3), Equation Number (55.0), Algorithm (66.6), Date (68.3), Degree (68.3). The authors attribute low performance to limited training samples (degree, date, algorithm) and small object sizes (page numbers, equation numbers).
Cross-Dataset Transfer
A Faster R-CNN experiment on shared categories between DocBank and ETD-OD tested three training configurations:
- DocBank only: Poor transfer to ETDs due to layout differences.
- ETD-OD only: Reasonable but not optimal.
- DocBank + ETD-OD (pre-train + fine-tune): Best results across all shared categories.
This confirms that domain-specific fine-tuning on ETD data is necessary, and that scholarly document pre-training provides a useful initialization.
What are the outcomes/conclusions?
Key findings:
- YOLOv7 achieves the best overall performance (85.3 mAP@0.50) on ETD-OD, outperforming Faster R-CNN substantially.
- Pre-training on scholarly documents (DocBank) nearly doubles Faster R-CNN’s mAP (39.1 to 76.2), confirming that domain overlap matters for document layout detection.
- The proposed XML schema and parsing pipeline provide a complete end-to-end system for converting ETD PDFs to structured representations.
- ETDs present challenges distinct from shorter research papers: metadata diversity, class imbalance (algorithms appear on < 0.4% of pages), and layout variability across institutions and disciplines.
Limitations and open questions:
- Class imbalance is severe. Algorithm (96 instances) and several metadata classes have very few annotations. The authors acknowledge this but do not apply mitigation strategies (oversampling, class-weighted loss).
- No formal inter-annotator agreement. Although each sample was validated by a graduate student, no IAA metric (e.g., Krippendorff’s Alpha or IoU agreement) is reported.
- Generalization is untested. All data comes from US institutional repositories. Performance on non-English ETDs or ETDs from non-US institutions is unknown.
- The XML pipeline is not evaluated end-to-end. The paper benchmarks object detection separately but does not measure the quality of the final XML output (e.g., text extraction accuracy, caption-to-figure matching precision).
- Small object detection remains a challenge. Page numbers and equation numbers have AP below 56, and the paper does not explore specialized techniques for small objects (e.g., high-resolution inputs, feature pyramid tuning).
- Only standard COCO metrics are reported. No error bars, multiple runs, or significance tests are provided.
Reproducibility
Models
- Faster R-CNN: ResNeXt-101 backbone, implemented via Detectron2.
- Faster R-CNN (DocBank): Pre-trained on DocBank, fine-tuned on ETD-OD. Uses the original DocBank model zoo weights and configurations.
- YOLOv5: Open-source implementation; the paper cites Jocher et al. 2022 (ultralytics/yolov5 v6.2) but does not pin a specific commit or version in the repo.
- YOLOv7: Open-source implementation; specific version not stated.
- Weights availability: The paper claims pre-trained models are available at the GitHub repository. However, as of March 2026, the repository (https://github.com/Opening-ETDs/ETD-OD) is not publicly accessible. The paper uses the URL
Opening-ETDS(capital S) while the note usesOpening-ETDs; neither variant resolves to a public repository. Model weights are effectively unavailable.
Algorithms
- Faster R-CNN: 60K iterations, inference threshold 0.7. No learning rate, optimizer, batch size, or augmentation details are reported.
- YOLO models: 150 epochs. No other training hyperparameters (learning rate, optimizer, augmentation, image resolution) are reported.
- Post-processing: Rule-based corrections for abstract headings (keyword matching + position constraint: chapter heading in first 10 pages matching “abstract” keyword) and element-to-caption matching (Euclidean distance between bounding boxes, with a y-coordinate constraint for equation numbers).
Data
- Source: 200 ETDs from publicly accessible institutional repositories. Sampling stratified by degree, domain, and institution, though specific repositories are not named.
- Annotation: Roboflow platform. 6 undergraduate annotators + 2 graduate reviewers. No inter-annotator agreement metrics reported.
- Split: 80/20 train/validation. No separate test set.
- Availability: The paper states that dataset, code, and models are available at a GitHub repository. As of March 2026, this repository is not publicly accessible (neither the
Opening-ETDSnorOpening-ETDsURL variant resolves). No dataset files, annotation files, or code have been released publicly.
Evaluation
- COCO-style mAP@0.50 and mAP@0.50:0.95.
- Evaluation on validation split only; no held-out test set.
- No cross-dataset evaluation (e.g., testing on PubLayNet or DocLayNet).
- No statistical rigor: single runs, no error bars, no significance tests.
- Evaluation scripts are not available (no code in the repository).
Hardware
- Not reported. No mention of GPU type, training time, memory requirements, or inference throughput.
BibTeX
@inproceedings{ahuja-etal-2022-parsing,
title = {Parsing Electronic Theses and Dissertations Using Object Detection},
author = {Ahuja, Aman and Devera, Alan and Fox, Edward Alan},
booktitle = {Proceedings of the First Workshop on Information Extraction from Scientific Publications},
month = nov,
year = {2022},
address = {Online},
publisher = {Association for Computational Linguistics},
doi = {10.18653/v1/2022.wiesp-1.14},
pages = {121--130},
}
GLAM: A Graphical Approach to Document Layout Analysis
TL;DR
GLAM (Graph-based Layout Analysis Model) represents PDF pages as structured graphs built from parsed text-box metadata and frames document layout analysis as joint node classification and graph segmentation. At 4M parameters, GLAM is over 35$\times$ smaller than the SOTA vision models it competes with, outperforming YOLO v5x6 on 5 of 11 DocLayNet classes. A simple ensemble of GLAM and YOLO v5x6 improved the best reported DocLayNet mAP from 76.8 to 80.8.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$ : The headline contribution is a new problem formulation (DLA as graph segmentation + node classification on parsed PDF metadata) and a corresponding GNN architecture. Ablations, SOTA comparisons, and an efficiency analysis dominate the experimental sections.
Secondary: $\Psi_{\text{Evaluation}}$ : The paper includes a detailed analysis of where graph-based methods succeed and fail relative to vision-based detectors, with per-class breakdowns and a thorough discussion of bounding-box misalignment artifacts in PubLayNet that systematically degrade parser-based approaches.
What is the motivation?
Most DLA models treat document pages as images and apply standard object detection (Faster R-CNN, YOLO, DETR). This discards the structured metadata available in born-digital PDFs: exact character positions, font information, bounding boxes, and rendering instructions. These metadata features are both precise and cheap to extract relative to image-based inference.
Prior graph-based work in document understanding focused narrowly on table extraction or used CV-derived primitives. The authors argue that a pure-graph formulation operating on PDF-parsed text boxes can be competitive with much larger vision models while being an order of magnitude more efficient.
What is the novelty?
Graph Formulation
Each PDF page is parsed into text boxes (via pdfminer.six), where each box becomes a graph node with a 79-dimensional feature vector (bounding box coordinates, text length, fraction of numerical characters, font type, font size, and other heuristics). Bidirectional edges connect each node to its nearest neighbor in four cardinal directions, plus additional edges approximating reading order (top-to-bottom, left-to-right).
Architecture
GLAM has two components:
Graph Network (~1M params): TAG convolutional layers with batch normalization, producing node embeddings. A separate edge embedding is formed by concatenating source/destination node embeddings with edge features. Two classification heads predict node class (segment type) and edge label (same-segment or not).
CV Feature Extractor (~3M params): A ResNet-18 renders the page as an image, extracts ROI-pooled visual features per text box, and fuses them with graph features via a three-layer attention encoder. This addresses the graph’s blindness to non-textual visual cues (lines, colors, backgrounds).
Training Objective
The model is trained with a weighted joint loss:
$$L = L_{node} + \alpha L_{edge}$$
where both terms are cross-entropy losses and $\alpha = 4$ upweights the edge classification to improve segmentation quality.
Inference
At inference, negative edges are removed from the graph. Connected components define segments. Each segment is classified by majority vote over its constituent nodes. The bounding box is the minimum spanning box of the segment’s nodes, with confidence equal to the mean class probability.
What experiments were performed?
Datasets
| Dataset | Pages | Classes | Annotation |
|---|---|---|---|
| DocLayNet | 80,863 | 11 | Human-drawn, text-box-snapped |
| PubLayNet | 358,353 | 5 | Auto-generated (XML/PDF matching) |
DocLayNet was the primary evaluation target due to its diverse layouts (financial reports, manuals, scientific articles, laws, patents, tenders). PubLayNet’s auto-generated bounding boxes are known to misalign with underlying text boxes, imposing a hard upper bound on parser-based methods.
Baselines
Mask R-CNN (ResNet-50, ResNet-101), Faster R-CNN (ResNet-101), YOLO v5x6, LayoutLMv3, MBC, and VSR.
Ablations
CV feature ablation (Tables 3-4): Removing the ResNet-18 visual branch drops the GNN-only variant (GLAM-CV, ~1M params) to 59.9 mAP on DocLayNet (vs. 68.6 for full GLAM), with the largest drops on visually complex classes like formula ($-22.8$) and list item ($-15.2$). On PubLayNet (visually simple layouts), the drop is minimal ($-1.2$).
Metric
COCO-style mAP at IoU thresholds [0.5:0.95].
What are the outcomes/conclusions?
DocLayNet Results
GLAM (4M params) achieves 68.6 overall mAP vs. 76.8 for YOLO v5x6 (140.7M params). However, GLAM outperforms all baselines on 5 of 11 classes: formula (66.6), page footer (86.2), page header (78.0), section header (79.8), and title (85.1). These are all text-dense, small-region classes where pixel-perfect bounding boxes from parsed text boxes give an advantage.
The ensemble of GLAM + YOLO v5x6 (taking the per-class best) achieves 80.8 mAP, a new SOTA at the time of publication. The complementarity is clear: GLAM excels on text-based classes while YOLO handles vision-rich classes (picture: 77.1 vs. 5.7, table: 86.3 vs. 56.3).
PubLayNet Results
GLAM achieves 72.2 overall mAP (vs. 95.1 for LayoutLMv3 SOTA), significantly handicapped by bounding-box misalignment. At IoU 0.5 only, GLAM exceeds 90 mAP on all text classes, demonstrating that the model correctly identifies segments but cannot match misaligned ground-truth boxes at stricter thresholds.
Efficiency
| Model | Params | GPU Inference | CPU Inference | Pages/sec (GPU) |
|---|---|---|---|---|
| LayoutLMv3 | 133M | 687 ms | 6,100 ms | 1.5 |
| YOLO v5x6 | 140.7M | 56 ms | 179 ms | 17.8 |
| GLAM | 4M | 10 ms | 16 ms | 98.0 |
| GLAM-CV (GNN only) | 1M | 4 ms | 9 ms | 243.9 |
All benchmarked on NVIDIA Tesla T4 (AWS G4 instance). GLAM processes ~98 pages/sec on GPU and ~64 pages/sec on CPU, making on-device deployment feasible.
Limitations
- No scanned document support: GLAM requires born-digital PDFs with parseable text boxes. It cannot handle scanned documents or image-only PDFs.
- Vision-rich classes fail: Picture mAP is 5.7 on DocLayNet; table mAP is 56.3. Any element not well-represented by text boxes is poorly captured.
- Parser dependency: Performance is bounded by PDF parser quality. Different parsers produce different text boxes, and the graph construction is sensitive to this.
- No text embeddings: GLAM does not use any semantic text features; adding text embeddings is noted as future work.
- PubLayNet misalignment: The auto-generated PubLayNet annotations penalize parser-based methods systematically, making cross-method comparisons on that dataset unfair.
Reproducibility
Models
- GNN: TAG convolutional layers, 512-dim hidden (PubLayNet) or 1024-dim hidden (DocLayNet), scaling down by 2$\times$ per layer. Batch normalization on node features.
- CV branch: ResNet-18 backbone with ROI-average-pooling, followed by a 3-layer attention encoder.
- Total: ~4M parameters (full), ~1M (GNN only).
- No official code or weights released by the authors (Kensho Technologies).
Algorithms
- Joint cross-entropy loss on node and edge classification with edge weight $\alpha = 4$.
- Segment inference via connected components after negative-edge removal, with majority-vote classification.
- No details reported on: optimizer, learning rate, batch size, number of epochs, warmup, gradient clipping, or augmentation.
Data
- DocLayNet: publicly available, human-annotated, text-box-snapped ground truth. Standard train/dev/test splits (69,375 / 6,489 / 4,999).
- PubLayNet: publicly available, auto-generated annotations. Standard splits (~335K / 11K / 11K).
- Graph construction uses 79 node features extracted from parsed text boxes plus heuristic edges (nearest neighbor in four directions + reading order).
Evaluation
- COCO-style mAP at IoU [0.5:0.95], reported per-class and overall.
- No error bars, significance tests, or multi-run statistics.
- The ensemble result (80.8 mAP) is a simple per-class max of two independent models, not a trained fusion.
- The PubLayNet comparison is acknowledged as unfair to parser-based methods due to annotation misalignment.
Hardware
- All inference benchmarks on NVIDIA Tesla T4 (AWS G4 instance).
- Training hardware not reported. No GPU-hours, memory requirements, or training time estimates.
BibTeX
@inproceedings{wang2023glam,
title={A Graphical Approach to Document Layout Analysis},
author={Wang, Jilin and Krumdick, Michael and Tong, Baojia and Halim, Hamima and Sokolov, Maxim and Barda, Vadym and Vendryes, Delphine and Tanner, Chris},
booktitle={International Conference on Document Analysis and Recognition (ICDAR)},
pages={57--75},
year={2023},
publisher={Springer},
doi={10.1007/978-3-031-41734-4_4}
}
GraphKD: Graph-Based Knowledge Distillation for Document Object Detection
TL;DR
GraphKD is a graph-based knowledge distillation framework for document object detection (DOD) that constructs structured instance graphs from RoI-pooled features and transfers them from a large teacher to a compact student network. By using cosine similarity for node-to-node distillation and Mahalanobis distance for edge-to-edge distillation, the framework enables heterogeneous distillation (e.g., ResNet to EfficientNet) and includes an adaptive text-node sampling strategy to reduce text-class bias.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$ : The headline contribution is a graph-based knowledge distillation framework specifically designed for document object detection. The paper proposes a structured instance graph construction from RPN proposals, a new node sampling strategy for text bias reduction, and a combined cosine/Mahalanobis distillation loss. Ablations and SOTA comparisons dominate the experimental sections.
Secondary: $\Psi_{\text{Evaluation}}$ : The paper provides an extensive comparative evaluation across four benchmarks, testing both homogeneous and heterogeneous distillation configurations against three competing KD methods (ReviewKD, NKD, SimKD).
What is the motivation?
Document object detection models have grown increasingly complex (e.g., SwinDocSegmenter at 223M params, LayoutLMv3 at 368M params), making deployment on edge devices impractical. Knowledge distillation offers a path to compact models, but existing KD techniques face three problems in the DOD setting:
- Feature imbalance: Documents are dominated by text regions, creating a class imbalance that biases standard distillation.
- Missing instance relations: Logit-based KD loses fine-grained spatial information; feature-based KD suffers from alignment problems across heterogeneous architectures.
- Homogeneity constraint: Conventional layer-to-layer distillation requires matching teacher/student architectures, preventing cross-architecture transfer (e.g., ViT to CNN).
The authors note this is the first application of knowledge distillation to the DOD task specifically.
What is the novelty?
Structured Instance Graph
Rather than distilling raw feature maps or logits, GraphKD constructs a graph $G = (V, \xi)$ where:
- Nodes ($V$): Vectorized RoI-pooled features from region proposals, categorized into “text” and “non-text” based on feature covariance.
- Edges ($\xi$): Cosine similarity between node pairs, forming a complete, symmetric, undirected graph:
$$e_{pq} = \frac{v_p \cdot v_q}{|v_p| \cdot |v_q|}$$
Teacher and student share the same RPN proposals, ensuring aligned anchor boxes so both backbones extract features from the same regions.
Adaptive Text Node Sampling
To handle the text-dominance problem, the framework applies a mining strategy (Algorithm 1): text nodes with high classification loss in the teacher (likely to be misclassified) are merged with non-text nodes for distillation, while confident text nodes are pruned. This reduces biased edges and improves false-negative regularization.
Graph Distillation Loss
The total loss combines node losses (text and non-text) with an edge loss, all computed via Mahalanobis distance:
$$\mathcal{L}_g = \frac{\lambda_1}{N_{nt}} \sum_{i=1}^{N_{nt}} \left|\frac{v_i^{t,nt} - v_i^{s,nt}}{\sigma_{nt}}\right| + \frac{\lambda_2}{N_{text}} \sum_{i=1}^{N_{text}} \left|\frac{v_i^{t,text} - v_i^{s,text}}{\sigma_{text}}\right| + \frac{\lambda_3}{N^2} \sum_{i=1}^{N} \sum_{j=1}^{N} \left|\frac{e_{ij}^t - e_{ij}^s}{\sigma_\xi}\right|$$
where $\lambda_1 = \lambda_3 = 0.3$ (Optuna search) and $\lambda_2 = \alpha \cdot \frac{N_{nt}}{N_{text}}$ adapts to the class imbalance. The key insight from ablations is that cosine similarity works best for node-to-node alignment (captures orthogonality) while Mahalanobis distance works best for edge-to-edge alignment (sensitive to outliers).
A final GCN + cross-entropy loss classifies the non-text nodes into specific object categories beyond the binary text/non-text split.
What experiments were performed?
Datasets
Four benchmarks covering diverse document types:
| Dataset | Classes | Train Instances | Eval Instances |
|---|---|---|---|
| PubLayNet | 5 | 3.26M | 120K |
| PRImA | 6 | 8K | 1.9K |
| Historical Japanese | 7 | 181K | 31.9K |
| DocLayNet | 11 | 994K | 66.5K |
Distillation Configurations
- Homogeneous: ResNet152 $\rightarrow$ ResNet101, ResNet101 $\rightarrow$ ResNet50
- Heterogeneous: ResNet50 $\rightarrow$ ResNet18, ResNet101 $\rightarrow$ EfficientNet-B0, ResNet50 $\rightarrow$ MobileNetV2
Baselines
- KD methods: ReviewKD, NKD, SimKD (all adapted to the DOD setting)
- Supervised upper bounds: SwinDocSegmenter (223M), LayoutLMv3 (368M), DocSegTr (168M)
Ablations
- Component ablation (Table 2): Tested edge-only, edge+non-text, edge+text, and full (edge+non-text+text) on DocLayNet with ResNet50 $\rightarrow$ ResNet18. Full model achieves 42.1 AP vs. 33.1 AP edge-only.
- Distance function ablation (Table 3): Exhaustive 4$\times$4 grid of L1, L2, Cosine, and Mahalanobis for node and edge losses. Best: Cosine (nodes) + Mahalanobis (edges) at 42.1 AP.
Metric
COCO-style mAP at IoU thresholds 0.5 to 0.95 (step 0.05).
What are the outcomes/conclusions?
Key Results
GraphKD consistently outperforms competing KD methods (ReviewKD, NKD, SimKD) across all four benchmarks and all five teacher-student configurations. The best configuration (ResNet152 $\rightarrow$ ResNet101) achieves:
| Dataset | GraphKD AP | Best Competing KD AP | Supervised SOTA AP |
|---|---|---|---|
| PubLayNet | 88.8 | 81.1 (SimKD) | 95.1 (LayoutLMv3) |
| PRImA | 41.9 | 36.2 (SimKD) | 54.3 (SwinDocSeg) |
| Hist. Japanese | 79.7 | 76.7 (SimKD) | 85.0 (SwinDocSeg) |
| DocLayNet | 68.9 | 64.6 (SimKD) | 72.1 (SwinDocSeg) |
Noteworthy observations
- On PRImA (a small dataset), the distilled ResNet101 (44.5M params) outperforms LayoutLMv3 (368M) by ~1.6 AP, suggesting large transformers overfit on limited data.
- On DocLayNet, distilled models outperform DocSegTr and LayoutLMv3 on specific classes like “Caption,” “Page-footer,” and “Picture,” where text-bias reduction helps.
- Heterogeneous distillation (e.g., ResNet50 $\rightarrow$ MobileNetV2 at 3.4M params) is functional but incurs substantial performance drops due to double compression of RoI features.
Limitations
- The performance gap to supervised SOTA remains significant: ~7% on PubLayNet/DocLayNet and ~13% on PRImA vs. SwinDocSegmenter.
- Transformer backbones cannot serve as teacher or student due to incompatible data handling at the RPN/RoI level. The authors acknowledge this as a key limitation and future work direction.
- The framework is built entirely on Faster R-CNN with RPN; it does not generalize to anchor-free or DETR-style detectors.
- No latency or inference speed benchmarks are reported, despite efficiency being the stated motivation.
- The “first KD for DOD” claim is plausible but difficult to verify exhaustively.
Reproducibility
Models
- All teacher/student pairs use standard backbones: ResNet18/50/101/152, EfficientNet-B0, MobileNetV2.
- The detection framework is Faster R-CNN with a shared RPN between teacher and student.
- A GCN is appended for final node classification (text $\rightarrow$ specific object categories).
- Code is available on GitHub under MIT license, built on Detectron2.
Algorithms
- Loss penalty coefficients: $\lambda_1 = \lambda_3 = 0.3$ (Optuna), $\lambda_2$ adaptive.
- Node-to-node: cosine similarity loss. Edge-to-edge: Mahalanobis distance loss.
- Logit matching via KL divergence.
- Text node merging threshold $t$ is empirically determined (exact value not reported).
- No details on optimizer, learning rate schedule, batch size, or number of training epochs.
Data
- PubLayNet, PRImA, HJDataset, DocLayNet: all publicly available.
- Standard train/eval splits used (instance counts reported in Table 1).
Evaluation
- COCO-style mAP (IoU 0.5:0.05:0.95), plus AP@50, AP@75, AP_S, AP_M, AP_L.
- Comparisons against three KD baselines (ReviewKD, NKD, SimKD) and three supervised methods (DocSegTr, LayoutLMv3, SwinDocSegmenter).
- No error bars, significance tests, or multi-run statistics reported.
- The supervised baselines use different training setups (pre-training corpora, augmentations) so direct parameter-efficiency comparisons should be interpreted cautiously.
Hardware
- No training hardware, GPU hours, or memory requirements reported.
- No inference latency or throughput measurements, which is a notable gap for a paper motivated by edge deployment.
BibTeX
@inproceedings{banerjee2024graphkd,
title={GraphKD: Exploring Knowledge Distillation Towards Document Object Detection with Structured Graph Creation},
author={Banerjee, Ayan and Biswas, Sanket and Llad{\'o}s, Josep and Pal, Umapada},
booktitle={International Conference on Document Analysis and Recognition (ICDAR)},
pages={353--370},
year={2024},
publisher={Springer},
doi={10.1007/978-3-031-70543-4_21}
}
Hybrid DLA: Enhanced Query Encoding and Hybrid Matching for Document Layout Analysis
TL;DR
This paper extends the DINO detection transformer for document layout analysis with two additions: (1) a query encoding mechanism that fuses RoI-aligned backbone features with decoder queries via cosine similarity, improving small-object detection, and (2) a hybrid matching strategy that trains with one-to-many matching in the first half of training and switches to one-to-one matching in the second half. The authors report 97.3% mAP on PubLayNet, 81.6% on DocLayNet, and 98.6% on PubTables.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$. The headline contribution is an improved detection transformer architecture and training strategy. The bulk of the paper describes the query encoding mechanism, the hybrid matching scheme, and ablations isolating their effects.
Secondary: None. No new datasets, benchmarks, or artifacts are released.
What is the motivation?
Transformer-based detectors like DINO have shown strong performance on natural image detection, but the authors argue they underperform on document layout analysis tasks, particularly for small graphical objects (headers, footers, section titles). The core issue: decoder queries in DINO lack direct access to fine-grained backbone features, limiting their ability to detect small or visually subtle layout elements. Additionally, standard one-to-one Hungarian matching produces fewer positive training samples, slowing convergence.
What is the novelty?
Query Encoding
The method extracts features from the ResNet-50 backbone using RoI-Align on bounding box proposals and passes them through an MLP to produce high-level query features:
$$Q_h = \text{MLP}(\text{RoIAlign}(F_h, b_j))$$
These are compared to decoder queries via cosine similarity after self-attention:
$$Q_e = \text{similarity}(Q’_d, Q’_h)$$
The enhanced query features are concatenated with the original decoder queries:
$$Q_t = \text{Concat}(Q_d, Q_e)$$
This combined representation is fed into the decoder:
$$o = \text{Decoder}(Q_t, E \mid A)$$
where $E$ is the encoder output and $A$ is the denoising attention mask.
Hybrid Matching
The training loss combines one-to-many and one-to-one assignment strategies. The one-to-many classification loss is:
$$L_{cls}^{1\text{-}m} = \sum_{i=1}^{N_{obj}} |\hat{g}_i - p_i| \cdot \text{BCE}(p_i, \hat{g}_i) + \sum_{j=1}^{N_{no}} p_j \cdot \text{BCE}(p_j, 0)$$
with regression:
$$L_{reg}^{1\text{-}m} = \sum_{i=1}^{N_{obj}} \hat{g}_i \cdot L_{GIoU}(bx_i, \hat{bx}_i) + \sum_{i=1}^{N_{obj}} \hat{g}_i \cdot L_{L1}(bx_i, \hat{bx}_i)$$
During training, the first half uses one-to-many matching (more positive samples, faster convergence), and the second half switches to one-to-one matching (eliminates duplicate predictions, enabling NMS-free inference).
What experiments were performed?
The method is evaluated on three benchmarks:
- PubLayNet (360K pages, 5 classes): 97.3% mAP, improving over DINO baseline (95.5%) and the prior best (Zhong et al., 96.5%).
- DocLayNet (80K pages, 11 classes): 81.6% mAP, improving over DINO (74.3%) and Zhong et al. (81.0%). Particularly strong on Page-footer (+37.1 over DINO), Page-header (+14.1), and Title (+3.5).
- PubTables (table detection): 98.6% mAP (AP50: 99.8%, AP75: 99.1%).
Baselines include Faster R-CNN, Mask R-CNN, YOLOv5, DINO, DiT-L, LayoutLMv3, VSR, and others.
Ablations
- Query encoding ($Q_d + Q_e$ vs. $Q_d$ alone): +1.78 mAP on PubLayNet validation.
- Matching strategy: One-to-many alone reaches 98.4 mAP but generates duplicates. The hybrid scheme (one-to-many then one-to-one) achieves 97.3 mAP while remaining NMS-free.
- Number of queries: 300 queries is optimal. Performance degrades at 400 due to overfitting.
What are the outcomes/conclusions?
The two proposed modifications yield consistent improvements over the DINO baseline across all three benchmarks. The query encoding mechanism is particularly effective for small layout elements. The hybrid matching strategy provides a practical trade-off between detection quality and duplicate elimination without requiring NMS at inference.
The gains over DINO are substantial on DocLayNet (+7.3 mAP) where documents are structurally diverse, but more modest on PubLayNet (+1.8 mAP) where layouts are relatively uniform. No code or model weights are released, limiting independent verification.
Reproducibility
Models
- Backbone: ResNet-50 pre-trained on ImageNet, with multi-scale feature maps (1/4, 1/8, 1/16, 1/32, 1/64), each projected to 256 channels via 1$\times$1 convolutions
- Architecture: DINO transformer with deformable attention, augmented with the proposed query encoding module
- Parameter count: Not reported
- Queries: 300 learnable queries (optimal per ablation)
- Initialization: DINO weight initialization source (pretrained or from scratch) not specified
- Weights: Not released
Algorithms
- Optimizer: AdamW
- Batch size: 16
- Epochs: 12 (PubLayNet, PubTables), 24 (DocLayNet)
- Base learning rate: Not reported
- Learning rate schedule: Reduced by factor of 10 at a later stage (exact milestone not specified)
- Training tricks: Warmup, gradient clipping, and mixed precision not mentioned
- Loss functions: BCE (classification) + GIoU + L1 (regression), combined in one-to-many and one-to-one branches. Loss weighting coefficients not reported.
- Multi-scale training: Images resized to various lengths with a max size limit (exact dimensions not specified)
- Other augmentation: Not described beyond multi-scale resizing
- Test-time: Shorter side resized to 640
- Matching switch: One-to-many for first half of training epochs, one-to-one for second half
Data
- PubLayNet: 360K pages, 5 classes (Text, Title, List, Table, Figure). Standard train/val/test splits. CDLA-Permissive-1.0 (annotations); underlying PDFs non-commercial only.
- DocLayNet: 80K pages, 11 classes. Standard splits. CDLA-Permissive-1.0 (annotations); underlying doc licenses vary by domain.
- PubTables: Table detection benchmark. Standard splits. CDLA-Permissive-1.0.
- All three are established benchmarks with published annotation protocols.
- No additional or proprietary training data used.
Evaluation
- Metric: COCO-style mAP at IoU thresholds 0.50 to 0.95 (step 0.05), plus AP@50 and AP@75
- Baselines: Comparison against CNN-based (Faster R-CNN, Mask R-CNN, YOLOv5) and transformer-based (DINO, DiT-L, LayoutLMv3) methods
- Baseline provenance: DocLayNet DINO numbers are sourced from Zhong et al. [62], not from the original DINO paper or the authors’ own runs. PubLayNet DINO numbers cite the original DINO paper [19]. Provenance of other baselines varies by table.
- Statistical rigor: No error bars, significance tests, or multi-run statistics reported. Number of runs and seed sensitivity not mentioned.
- Author-acknowledged limitations: The paper’s conclusion does not explicitly discuss limitations. It focuses on summarizing contributions without qualifying scope or failure cases.
Hardware
- GPU: NVIDIA “RTXA600” GPUs (plural; likely RTX A6000). Exact GPU count not specified.
- Training time: Not reported
- Inference speed: Not reported
- Memory requirements: Not reported
- Cost estimates: Not reported
- Deployment considerations: Not discussed. Given the DINO-based architecture with ResNet-50, CPU-only inference is likely impractical for real-time use.
BibTeX
@inproceedings{shehzadi2024hybrid,
title={A Hybrid Approach for Document Layout Analysis in Document Images},
author={Shehzadi, Tahira and Stricker, Didier and Afzal, Muhammad Zeshan},
booktitle={International Conference on Document Analysis and Recognition (ICDAR)},
pages={20--37},
year={2024},
organization={Springer}
}
M2Doc: A Multi-Modal Fusion Approach for Document Layout Analysis
TL;DR
M2Doc is a pluggable multi-modal fusion approach for document layout analysis that adds two lightweight modules to existing object detectors: an early-fusion module that gates pixel-level textual features into backbone outputs, and a late-fusion module that injects block-level BERT embeddings into candidate bounding boxes via IoU matching. Applied to DINO and Cascade Mask R-CNN, it achieves 89.0 mAP on DocLayNet (+11.3 over unimodal DINO) and 69.9 mAP on M6Doc (+1.9).
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$ : The core contribution is a multi-modal fusion architecture designed specifically for document layout analysis. The paper introduces two pluggable fusion modules (early and late), demonstrates them across multiple detector families, and validates with extensive ablations and SOTA comparisons. The majority of the paper describes architectural design choices and ablation experiments.
Secondary: None. The paper does not introduce new datasets, benchmarks, or evaluation protocols. Code is released, but as a companion to the method, not as an independent infrastructure contribution.
What is the motivation?
Document layout analysis has traditionally been treated as a vision-only object detection problem, even when using multimodal pre-trained models. The authors identify three specific gaps:
- Unimodal pipelines from multimodal pre-training: Models like DiT and LayoutLM undergo multimodal pre-training but are reduced to unimodal (image-only) backbones when fine-tuned for layout detection. The textual modality learned during pre-training is discarded at inference time.
- Weak multi-modal baselines: The primary multi-modal method (VSR) uses a complex two-backbone architecture with Chargrid/Wordgrid/Sentencegrid inputs and Transformer-based relation modeling, yet it performs worse than unimodal detectors on complex logical layout datasets like DocLayNet.
- Detector-specific solutions: Existing DLA enhancements (TransDLANet, SwinDocSegmenter, SelfDocSeg) modify specific detector internals, making them non-transferable across detector families.
The authors argue that documents are inherently multimodal (rich text + visual structure) and that layout analysis should exploit textual semantics, particularly for semantically distinct categories like page headers, footers, and section headers.
What is the novelty?
Textual Grid Representation
M2Doc uses a BERTgrid-style approach to create pixel-aligned textual features. Given a document image $\mathbf{I} \in \mathbb{R}^{H \times W \times 3}$ with $N$ OCR-detected words, each word $w_i$ with bounding box $b_i$ is encoded by BERT:
$$(T_1, \dots, T_N) = \text{BERT}(w_1, \dots, w_N)$$
The sequential embeddings $T_i \in \mathbb{R}^{d \times 1}$ (where $d = 768$) are placed into a 2D grid $\mathbf{G} \in \mathbb{R}^{H \times W \times d}$:
$$G_{x,y} = \begin{cases} T_i, & \text{if } (x, y) \in b_i \\ 0, & \text{otherwise} \end{cases}$$
Unlike VSR’s two-backbone approach, M2Doc feeds both the image and the textual grid through a single shared backbone (ResNet), using convolutions to align channel dimensions before the first block.
Early Fusion (Pixel-Level)
After feature extraction, each scale $\theta \in \{1, 2, 3, 4\}$ produces visual features $P_\theta$ and textual features $S_\theta$. A gated mechanism adaptively weights the textual contribution:
$$\alpha_\theta = \eta(S_\theta)$$
$$F_\theta = \text{LayerNorm}(\alpha_\theta \odot S_\theta + P_\theta)$$
where $\eta(\cdot)$ is two $1 \times 1$ convolutions with ReLU and Tanh activations (restricting scores to $(0, 1)$), and $\odot$ is element-wise multiplication. LayerNorm handles the zero-valued regions of the textual grid.
Late Fusion (Block-Level)
After the RPN or Transformer encoder generates candidate bounding boxes $r_j$, late fusion matches each candidate to OCR boxes via IoU:
$$\text{IoU}_{i,j} = \frac{|r_j \cap b_i|}{|r_j \cup b_i|}$$
Words with IoU above a threshold are aggregated into a block-level textual feature:
$$E_j = \Gamma(T \cdot J_j)$$
where $J_j$ is a binary inclusion vector and $\Gamma$ is an MLP mapping BERT dimensions to the decoder’s channel dimensions. For end-to-end detectors (DINO), these features are added to content queries:
$$\text{Query}_j = \text{Query}_j + \lambda_1 E_j$$
For two-stage detectors (Cascade Mask R-CNN), block-level textual features are added to RoI features scaled by $\lambda_2$ before the R-CNN head.
Design Insight
The authors find that gated fusion works best for early fusion (where textual features are backbone-extracted and need adaptive weighting), while simple summation works best for late fusion (where textual features come directly from the pre-trained language model and are already high-quality).
What experiments were performed?
Datasets
| Dataset | Pages | Categories | OCR Source | Layout Type |
|---|---|---|---|---|
| PubLayNet | 360K | 5 | PDFMiner | Physical |
| DocLayNet | 80K | 11 | Human sentence-level | Logical |
| M6Doc | 9K | 74 | OCR engine | Logical |
Detectors Evaluated
- End-to-end: DINO (DINO-4Scale, 900 queries)
- Two-stage: Cascade Mask R-CNN
- Pluggability test: Mask R-CNN, Faster R-CNN, Deformable DETR
All use ResNet-101 + FPN as the visual backbone and BERT-Base-Multilingual-Cased as the language model. Detectors are initialized from COCO 2017 pre-trained weights.
Training Configuration
- DocLayNet and M6Doc: 36 epochs
- Cascade Mask R-CNN: SGD, lr = 2e-2, decayed to 2e-3 at epoch 27 and 2e-4 at epoch 33
- DINO: AdamW, lr = 1e-4, decayed to 3.3e-5 at epoch 27 and 1e-5 at epoch 33
- PubLayNet: 6 epochs, same initial lr, decayed by 10$\times$ at epoch 5
- Custom anchor scales for Cascade Mask R-CNN: [0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1.0, 2, 5, 10]
- Built on MMDetection
Key Results
DocLayNet (Table 1): DINO + M2Doc achieves 89.0 mAP, up from 77.7 for unimodal DINO. This is reported as the first method to significantly exceed the human baseline of 83 mAP. Cascade Mask R-CNN + M2Doc reaches 86.7 mAP (+11.1). The largest per-category gains come from semantically distinct classes: Page-footer (+27.3), Page-header (+14.4), Section-header (+18.6), Formula (+17.1).
M6Doc (Table 2): DINO + M2Doc achieves 69.9 mAP (+1.9). Gains are more modest due to M6Doc’s 74 fine-grained categories and diverse input scales limiting recall.
PubLayNet (Table 3): DINO + M2Doc achieves 95.5 mAP (+0.1 over unimodal DINO). Minimal improvement because PubLayNet’s 5-category physical layout is already near-saturated and semantically simple.
Ablations (M6Doc)
- Module ablation (Table 4): Both early and late fusion contribute. Together they yield +1.9 mAP for DINO and +2.1 mAP for Cascade Mask R-CNN. Individual modules contribute roughly equally (+0.4 to +0.5 for DINO).
- Fusion strategy ablation (Table 5): Gate mechanism is best for early fusion; summation is best for late fusion. Concatenation and other combinations underperform.
- Pluggability (Table 6): All five detectors improve with M2Doc. Gains range from +1.6 mAP (Deformable DETR) to +2.9 mAP (Mask R-CNN). The authors note that their reproduced baselines are higher than M6Doc’s originally reported numbers due to different experimental settings, so absolute mAP values should be compared with that caveat in mind.
Notable Observation
The only category where multimodal fusion hurts performance is “Figure” (which mostly lacks text content). Text-overlaid pictures can also degrade boundary detection for multimodal models.
What are the outcomes/conclusions?
Strengths
- The pluggable design is detector-agnostic, demonstrated across five architectures spanning two-stage and end-to-end families.
- The DocLayNet results (+11.3 mAP) are substantial and indicate that multimodal fusion addresses a real gap in logical layout analysis where semantic categories require textual understanding.
- The single-backbone design is simpler and more parameter-efficient than VSR’s two-backbone approach.
- Ablations are thorough, testing both module contributions and fusion strategy alternatives.
Limitations
- OCR dependency: The method requires pre-computed OCR results (word boxes + text). DocLayNet provides human-annotated OCR, but real-world OCR introduces noise. The paper does not evaluate robustness to OCR errors.
- Marginal gains on simple datasets: On PubLayNet (+0.1 mAP), the multimodal overhead (BERT inference + fusion modules) provides almost no benefit, suggesting the approach is only worthwhile for complex logical layout tasks.
- No latency or throughput analysis: Adding BERT inference and fusion modules has a computational cost that is never quantified. Inference time comparisons with unimodal baselines are absent.
- Missing comparison with pre-trained multimodal models: The paper compares against unimodal detectors and VSR, but does not compare with LayoutLMv3 or DiT fine-tuned for DLA, which also leverage multimodal features (through pre-training). LayoutLMv3 appears in the PubLayNet table but not DocLayNet or M6Doc.
- Figure category degradation: The authors acknowledge that multimodal fusion hurts text-overlaid image detection but offer no mitigation strategy.
Reproducibility
Models
- Visual backbone: ResNet-101 + FPN, initialized from COCO 2017 pre-trained weights.
- Language model: BERT-Base-Multilingual-Cased (110M params), from HuggingFace.
- Early fusion: two $1 \times 1$ conv layers with ReLU + Tanh per feature scale ($\theta \in \{1,2,3,4\}$).
- Late fusion: IoU matching + MLP mapping BERT dim (768) to detector channel dim.
- Pre-trained weights for R50 and R101 variants (4 checkpoints: Cascade Mask R-CNN and DINO, each with R50 and R101) available via BaiduNetDisk and Google Drive (per the GitHub repo). Weights are only released for DocLayNet; no PubLayNet or M6Doc checkpoints are provided. Full training and inference code is provided, including multi-GPU scripts.
Algorithms
- Cascade Mask R-CNN: SGD, lr = 2e-2, step decay at epochs 27 and 33, 36 epochs.
- DINO: AdamW, lr = 1e-4, step decay at epochs 27 and 33, 36 epochs.
- PubLayNet: 6 epochs, lr decayed by 10$\times$ at epoch 5.
- IoU threshold for late fusion: not stated in the paper text. Config files in the GitHub repo (
mmdetection/m2doc_config/) likely contain the actual values used. - $\lambda_1$ and $\lambda_2$: described as adjustable hyperparameters; exact values not stated in the paper text but may be recoverable from the repo config files.
- Framework: MMDetection.
- Environment: Python 3.8.0, CUDA 10.2, PyTorch 1.8.1, mmcv, mmengine, transformers.
Data
- PubLayNet: 360K pages, word-level OCR from PDFMiner. Publicly available (CDLA-Permissive-1.0).
- DocLayNet: 80K pages, human sentence-level OCR annotations included. Publicly available (CDLA-Permissive-1.0).
- M6Doc: 9K pages, sentence-level OCR obtained via OCR engine (unspecified which). Restricted access (CC BY-NC-ND 4.0); requires application and approval from HCIILAB.
- Standard train/val/test splits used for all three datasets.
Evaluation
- Metric: COCO-style mAP @ IoU [0.50:0.95:0.05].
- Per-category AP reported for DocLayNet (11 classes) and PubLayNet (5 classes).
- AP50, AP75, and Recall additionally reported for M6Doc.
- No error bars, significance tests, or multi-run statistics.
- Baselines reproduced in the same MMDetection framework with consistent settings.
Hardware
- Not explicitly reported in the paper. The GitHub repo’s distributed training script (
dist_train.sh) defaults to 8 GPUs, suggesting an 8-GPU setup was used, but GPU type, training time, and memory requirements are unspecified. - No inference speed benchmarks.
BibTeX
@inproceedings{zhang2024m2doc,
title={M2Doc: A Multi-Modal Fusion Approach for Document Layout Analysis},
author={Zhang, Ning and Cheng, Hiuyi and Chen, Jiayu and Jiang, Zongyuan and Huang, Jun and Xue, Yang and Jin, Lianwen},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={38},
number={7},
pages={7233--7241},
year={2024},
doi={10.1609/aaai.v38i7.28552}
}
RoDLA: Benchmarking the Robustness of Document Layout Analysis Models
TL;DR
RoDLA introduces the first systematic robustness benchmark for Document Layout Analysis (DLA), comprising ~450K perturbed document images across three datasets (PubLayNet-P, DocLayNet-P, M6Doc-P) with 12 perturbation types at 3 severity levels. The authors propose two metrics (mPE for perturbation assessment and mRD for robustness evaluation) and a robust model that integrates channel attention and average pooling into a DINO-based detector to resist perturbation-induced attention shifting.
What kind of paper is this?
Dominant: $\Psi_{\text{Evaluation}}$. The headline contribution is a robustness benchmarking framework: a perturbation taxonomy, two new evaluation metrics (mPE and mRD), and a systematic assessment of 10+ DLA methods under controlled perturbations. The paper’s center of gravity is measuring how existing models fail under realistic document corruptions.
Secondary: $\Psi_{\text{Resource}}$. The perturbed benchmark datasets (PubLayNet-P, DocLayNet-P, M6Doc-P) and perturbation generation code are released as reusable community assets.
Secondary: $\Psi_{\text{Method}}$. The RoDLA model itself is a methodological contribution (channel attention + average pooling in a DINO encoder), though it serves primarily to demonstrate the benchmark’s utility.
What is the motivation?
DLA models are typically evaluated on clean digital documents, but real-world documents suffer from scanning artifacts, camera distortions, noise, and other degradations. Prior work had not systematically studied how these perturbations affect DLA performance. The authors observed that existing high-performing models (e.g., SwinDocSegmenter achieving 93.7% mAP on clean PubLayNet) can drop drastically under perturbation (down to 29.2% mAP under speckle noise). This gap between clean-data performance and real-world robustness motivates both the benchmark and the robust model.
Existing robustness metrics (borrowed from ImageNet-C) conflate perturbation severity with model robustness because they rely on a single baseline model’s performance as reference. The authors argue this introduces metric uncertainty from model randomness.
What is the novelty?
Perturbation Taxonomy
A hierarchical taxonomy of 12 document-specific perturbations organized into 5 groups, each with 3 severity levels:
- Spatial Transformation: Rotation (P1), Warping (P2), Keystoning (P3)
- Content Interference: Watermark (P4), Complex Background (P5)
- Inconsistency Distortion: Non-uniform Illumination (P6), Ink-bleeding (P7), Ink-holdout (P8)
- Blur: Defocus (P9), Vibration (P10)
- Noise: Speckle (P11), Texture (P12)
Mean Perturbation Effect (mPE)
A model-independent metric for quantifying perturbation severity that combines IQA metrics with degradation scores:
$$\text{mPE}_p = \frac{1}{NMK} \sum_{s=1}^{N} \left(\sum_{i=1}^{M} f_{s,p}^i + \sum_{g=1}^{K} D_{s,p}^g\right)$$
where $f_{s,p}^i$ are IQA scores (MS-SSIM, CW-SSIM) and $D_{s,p}^g$ is the degradation of a reference model. By combining multiple assessment methods, mPE reduces sensitivity to any single metric’s blind spots (e.g., MS-SSIM is insensitive to warping; CW-SSIM is insensitive to defocus at fine severity levels).
Mean Robustness Degradation (mRD)
A robustness metric that normalizes a model’s performance drop by the perturbation’s inherent difficulty:
$$\text{RD}_p = \frac{1}{N} \sum_{s=1}^{N} \frac{D_{s,p}^g}{\text{mPE}_{s,p}}$$
The overall mRD averages $\text{RD}_p$ across all 12 perturbation types. Values above 100 indicate the model degrades more than expected; below 100 indicates better-than-expected robustness. Lower is better.
RoDLA Model
Built on DINO with an InternImage backbone, the model adds:
- Channel-wise Attention (CA) in the encoder: performs self-attention along the channel dimension to aggregate correlated feature channels, reducing attention shifting on perturbed tokens
- Average Pooling Layers (APL) with a dilation predictor: leverages neighboring token attention to mitigate overemphasis on individual (potentially corrupted) tokens
The channel-wise attention operates as:
$$Y = \sigma\left(\frac{\text{Softmax}(Q) \cdot \text{Softmax}(K^T)}{\sqrt{d}} \cdot \text{MLP}(V)\right)$$
where $Q, K, V \in \mathbb{R}^{d \times n}$.
What experiments were performed?
Benchmark Setup
- Datasets: PubLayNet (360K pages, 5 classes), DocLayNet (80.8K pages, 11 classes), M6Doc (9K pages, 74 classes)
- Perturbed versions: Each dataset $\times$ 12 perturbations $\times$ 3 severity levels = ~450K total perturbed images
- Baselines: 10+ methods including LayoutParser, Faster R-CNN, Mask R-CNN, Cascade R-CNN, DocSegTr, SwinDocSegmenter, DINO, Co-DINO, and multi-modal models (DiT, LayoutLMv3)
- Backbones: ResNet, ResNeXt, Swin Transformer, DiT, LayoutLMv3, InternImage
- Framework: All models reproduced in MMDetection with consistent training settings
- Training: Models trained on clean data only; perturbations applied only at test time
- Metrics: Clean mAP, P-Avg (average mAP across all perturbations), mRD
Key Results
PubLayNet-P:
| Method | Clean mAP | P-Avg | mRD |
|---|---|---|---|
| LayoutParser | 89.0 | 58.0 | 212.7 |
| Faster R-CNN | 90.2 | 66.2 | 175.5 |
| SwinDocSegmenter | 93.7 | 57.1 | 214.4 |
| DiT + Cascade R-CNN | 94.5 | 76.7 | 95.8 |
| LayoutLMv3 + Cascade R-CNN | 95.1 | 73.6 | 116.2 |
| DINO (InternImage) | 95.4 | 69.0 | 120.7 |
| RoDLA | 96.0 | 70.0 | 116.0 |
RoDLA achieves the best clean mAP (96.0%) and the best mRD (116.0) among single-modal methods, with +3.8% P-Avg over Faster R-CNN. Among all methods including multi-modal ones, DiT + Cascade R-CNN achieves the best mRD (95.8), benefiting from document-specific pre-training on IIT-CDIP.
DocLayNet-P: RoDLA achieves 82.8% clean mAP (+1.9% over DINO baseline), 62.5% P-Avg (+7.1%), and mRD of 135.4.
M6Doc-P: RoDLA achieves 67.5% clean mAP, 52.9% P-Avg (+12.1%), and mRD of 150.4.
Ablation Studies
- Backbone: InternImage outperforms Swin Transformer, but RoDLA narrows the gap (only 0.1% P-Avg difference and 8.2 mRD difference between backbones)
- CA placement: Encoder-only CA placement is optimal; adding CA to the decoder hurts robustness
- APL placement: Encoder-only APL is optimal; the combined CA (encoder) + APL (encoder) configuration yields the best mRD of 115.7
- Multiple perturbations: Stacking perturbations degrades performance progressively (P-Avg drops from 64.9 with 2 perturbations to 41.9 with 5 on M6Doc-P)
What are the outcomes/conclusions?
Key findings:
- Existing DLA models are fragile under realistic perturbations, with some experiencing 60%+ mAP drops under severe blur or noise
- Document-specific pre-training (DiT, LayoutLMv3 on IIT-CDIP) substantially improves robustness, particularly against texture and noise perturbations
- Multi-modal models are not uniformly more robust; LayoutLMv3 shows vulnerability to content interference (watermark, background) despite strong overall numbers
- The RoDLA model demonstrates that attention mechanism modifications can improve both clean performance and robustness simultaneously
- mPE provides more balanced perturbation assessment than individual IQA metrics, and mRD decouples perturbation difficulty from model robustness measurement
Limitations the authors acknowledge:
- The benchmark focuses on vision-based single-modal DLA models; multi-modal robustness testing is limited
- No human-in-the-loop evaluation for deployment readiness
- Perturbations are applied only at test time (no robustness-aware training or augmentation strategies explored)
Limitations not discussed:
- The 12 perturbations, while comprehensive, do not cover all real-world degradations (e.g., partial occlusion, page folding, mixed-resolution scanning)
- mPE still depends on a baseline model (Faster R-CNN) for the degradation component, partially undermining the claim of model independence
- The benchmark uses existing datasets that are predominantly English and scientific/business documents; robustness for non-Latin scripts and diverse document types is not evaluated
- Severity levels are somewhat arbitrary; real-world degradation distributions may not align with the L1/L2/L3 bins
Reproducibility
Models
- RoDLA: DINO-based detector with InternImage backbone (~323M parameters), pre-trained on ImageNet
- Channel-wise Attention module and Average Pooling Layers added to the encoder
- Pre-trained checkpoints for PubLayNet, DocLayNet, and M6Doc are released via the GitHub repo
Algorithms
- Optimizer: AdamW (lr=2e-4, weight decay=1e-4)
- Schedule: Step-based with warmup at epochs {16, 22}, warmup ratio 1e-3
- Training: 24 epochs, batch size 2 per GPU
- Augmentation: Horizontal flip (0.5 ratio), random crop (384, 600)
- All models trained on clean data only; perturbations applied at test time only
- 38 total models trained for the benchmark (24 on PubLayNet, 7 each on DocLayNet and M6Doc)
Data
- PubLayNet-P: PubLayNet test set + 12 perturbations $\times$ 3 levels
- DocLayNet-P: DocLayNet test set + 12 perturbations $\times$ 3 levels
- M6Doc-P: M6Doc test set + 12 perturbations $\times$ 3 levels
- Perturbation generation code and pre-generated benchmarks available via the GitHub repo
- Background perturbation (P5) uses images from ILSVRC (ImageNet)
Evaluation
- Primary metrics: mAP (COCO-style), P-Avg (average mAP across all perturbations/levels), mRD
- mPE uses MS-SSIM + CW-SSIM + Faster R-CNN degradation as components
- All baselines reproduced in MMDetection for fair comparison
- No error bars or multiple-run variance reported
- Rotation perturbation (P1) causes the largest performance drops across all models due to annotation misalignment at high severity
Hardware
- Training: 4$\times$ NVIDIA A100 (40 GB each), 300 GB CPU memory per node
- No GPU-hours or cost estimates reported
- No inference latency or throughput measurements
BibTeX
@inproceedings{chen2024rodla,
title={RoDLA: Benchmarking the Robustness of Document Layout Analysis Models},
author={Chen, Yufan and Zhang, Jiaming and Peng, Kunyu and Zheng, Junwei and Liu, Ruiping and Torr, Philip and Stiefelhagen, Rainer},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
pages={15556--15566},
year={2024},
doi={10.1109/CVPR52733.2024.01473}
}
SwinDocSegmenter: An End-to-End Unified Domain Adaptive Transformer for Document Instance Segmentation
TL;DR
SwinDocSegmenter is a unified end-to-end transformer architecture for document instance segmentation that combines a Swin Transformer backbone with a DETR-style encoder-decoder, contrastive denoising training, and hybrid bipartite matching. It reports competitive results on PubLayNet (93.7 mAP), TableBank (98.0), and Historical Japanese (84.6), while providing the first transformer-based instance segmentation baseline on DocLayNet (76.9). The key finding is that generic MS-COCO pre-training combined with contrastive learning and adaptive matching reduces the domain bias caused by document-specific pre-training.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$ : The headline contribution is a unified transformer architecture for document instance segmentation with three components: mixed query selection with contrastive projection heads, contrastive denoising training adapted for masks, and hybrid bipartite matching for domain shift. Ablations and SOTA comparisons dominate the experimental sections.
Secondary: $\Psi_{\text{Evaluation}}$ : The paper provides extensive benchmarking across five datasets (PubLayNet, PRImA, HJDataset, TableBank, DocLayNet) and includes detailed ablations on backbone choice, resolution, query count, loss functions, and pre-training bias.
What is the motivation?
Document layout analysis has traditionally been addressed by detection models (bounding boxes), but instance segmentation (pixel-level, class-aware and instance-aware labels) provides richer structure for downstream document understanding tasks. The authors identify three specific gaps:
- Domain shift: Transformer models pre-trained on domain-specific document datasets (e.g., PubLayNet for scientific articles) fail when applied to different document types (e.g., magazines, historical texts). The pre-training introduces bias toward common classes.
- No unified detection + segmentation: Existing approaches treat detection and segmentation as separate or sequential tasks rather than jointly optimizing both.
- Low-frequency class penalty: Transformers struggle with document element categories that have few training examples (e.g., separators, “other” regions), because standard matching and loss functions are biased toward high-frequency classes like text.
What is the novelty?
Unified Architecture
The model combines a Swin Transformer Large (SwinL) backbone with a transformer encoder-decoder pair and a segmentation branch. The segmentation branch constructs a pixel embedding map (PEM) by fusing multi-scale features:
$$M = Q_e \otimes \delta(\Gamma(S_b) + \psi(T_e))$$
where $Q_e$ is the query embedding from the decoder, $S_b$ is the 1/4-resolution backbone feature map, $T_e$ is the 1/8-resolution encoder feature map (upsampled via $\psi$), $\Gamma$ is a channel projection convolution, and $\delta$ is a segmentation head.
Mixed Query Selection with Contrastive Projection
The encoder output passes through three parallel heads (classification, detection, segmentation). Top-ranked features (by classification confidence) are selected as content queries. Two projection heads apply contrastive learning:
Low-level projection (shallow MLP) for contrastive feature alignment:
$$\mathcal{L}_{low} = \sum_{i=1}^{n} \sum_{j=1}^{n’} -\log \frac{\exp(f_i \cdot f_j / \tau)}{\sum_{c=1}^{k} \exp(f_c \cdot f_j / \tau)}$$
High-level projection (deep MLP) with learned prototypes $P = (p_1, p_2, \ldots, p_m)$:
$$\mathcal{L}_{high} = \sum_{i=1}^{n} \sum_{j=1}^{n’} -\log \frac{\exp(f_i \cdot p_j / \phi_j)}{\sum_{c=1}^{k} \exp(f_c \cdot p_j / \phi_j)}$$
where $\tau$ is a dataset-specific temperature and $\phi_j$ is a concentration estimate per prototype. Content queries remain learnable (not initialized from encoder features), which the authors argue aids domain shift.
Contrastive Denoising Training (CDN)
Adapted from DINO-style denoising but extended to the segmentation task. Ground-truth boxes are randomly noised, and:
- Positive queries (noise scale $< \lambda_p$) are trained to reconstruct ground truth
- Negative queries (noise scale between $\lambda_p$ and $\lambda_e$) are suppressed via focal loss
The paper states $\lambda_e > \lambda_p$ but reports $\lambda_p = 0.1$ and $\lambda_e = 0.02$, which is an apparent contradiction in the text. The numerical values are reproduced here as published.
For segmentation, the noised boxes serve as the denoising target, treating masks as the clean signal and boxes as a noised version. Noised queries are added during training and removed at inference.
Hybrid Bipartite Matching
The standard Hungarian matching is augmented with an additional mask prediction loss alongside L1 and focal loss. During domain shift from pre-trained weights, the mask prediction penalty is initially weighted more heavily, then decreased as the model converges. This addresses the inconsistency between masks predicted from different heads.
What experiments were performed?
Datasets
| Dataset | Train Instances | Eval Instances | Classes |
|---|---|---|---|
| PubLayNet | 3.26M | 120K | 5 |
| PRImA | 8K | 1.9K | 6 |
| Historical Japanese | 181K | 31.9K | 7 |
| TableBank | 2.8K | 1.4K | 1 |
| DocLayNet | 91K | N/R | 11 |
All models are initialized from MS-COCO Object Detection pre-trained weights.
Metric
COCO-style mAP at IoU thresholds 0.5 to 0.95 (step 0.05), plus AP@50, AP@75, and size-specific AP (small, medium, large).
Main Results
The paper compares against Layout Parser (multimodal, uses Microsoft OCR), DocSegTr (vision-only), and LayoutLMv3 (multimodal). Results are mAP (COCO-style):
| Dataset | SwinDocSeg | Layout Parser | DocSegTr | LayoutLMv3 |
|---|---|---|---|---|
| PubLayNet | 93.72 | 86.7 | 90.4 | 95.1 |
| PRImA | 54.39 | 64.7 | 42.5 | 40.3 |
| Hist. Japanese | 84.55 | 81.6 | 83.1 | 82.7 |
| TableBank | 98.04 | 91.2 | 93.3 | 92.9 |
| DocLayNet (11-class) | 76.85 | N/A | N/A | N/A |
SwinDocSegmenter outperforms all baselines on Historical Japanese and TableBank. On PubLayNet, LayoutLMv3 (which uses OCR text) leads at 95.1. On PRImA, Layout Parser (which also uses OCR text) leads at 64.7; the authors note that the convolution backbone and textual features give Layout Parser an advantage on this small, complex dataset. DocLayNet results are the first transformer-based instance segmentation baseline for this benchmark (compared against Mask R-CNN, Faster R-CNN, and YOLOv5 in a separate table).
Ablation Studies
Backbone choice (PRImA): SwinL (223M) achieves 54.4 AP vs. ResNet-50 (52M) at 36.1 AP and ViT-B (164M) at 46.1 AP. The 8-point gap over ResNet and 5-point gap over Swin-Tiny justify the parameter cost.
Input resolution (PRImA): Performance increases with resolution: 256px (45.0 AP), 512px (50.1 AP), 1024px (54.4 AP). Higher resolutions were not feasible due to compute constraints.
Number of decoder queries (PRImA): 300 queries (54.4 AP) outperform 100 queries (50.0 AP). The authors note SwinL could theoretically support 900-1200 queries but were limited by available compute.
Pre-training bias (PRImA): MS-COCO pre-training (54.4 AP) substantially outperforms PubLayNet pre-training (49.4 AP). PubLayNet pre-training biases the model toward tables (70.7 vs. 49.9 AP) while severely penalizing rare classes like separators (8.6 vs. 27.6 AP). This finding validates the domain-adaptive design: generic pre-training with contrastive learning and hybrid matching outperforms domain-specific pre-training.
Loss function: L1 + Focal Loss is the best combination, outperforming L2-based alternatives.
What are the outcomes/conclusions?
Key findings
- Generic pre-training outperforms domain-specific on cross-domain tasks: MS-COCO pre-training with the proposed contrastive and matching mechanisms outperforms PubLayNet pre-training on PRImA (54.4 vs. 49.4 AP), reducing common-class bias.
- Competitive with multimodal methods using vision alone: SwinDocSegmenter outperforms all baselines on HJDataset and TableBank, and outperforms the vision-only DocSegTr across all benchmarks. It falls short of multimodal methods (LayoutLMv3, Layout Parser) that leverage OCR text on PubLayNet and PRImA.
- First DocLayNet instance segmentation baseline: The 11-class DocLayNet results (76.9 mAP) provide a reference point for future work; best classes are Text (88.2) and Table (87.4), weakest are Formula (62.3) and Title (63.3).
Limitations
- PRImA performance gap: Despite improvements, 54.4 mAP on PRImA is still relatively low, reflecting the challenge of small training sets with complex layouts.
- No text information: The model is purely visual, which limits disambiguation of semantically similar regions (e.g., title vs. section header).
- Compute-limited exploration: Resolution capped at 1024px and queries at 300 due to hardware constraints; higher settings could potentially improve results.
- Dataset-specific hyperparameters: The temperature $\tau$ must be tuned per dataset (0.02 for PubLayNet, 0.6 for PRImA, 0.1 for HJDataset, 0.2 for TableBank), reducing out-of-the-box applicability.
- Weak on rare classes: Categories with few instances (“Other” in PRImA at 7.1 AP, “Other” in HJDataset at 40.6 AP) remain challenging despite the contrastive approach.
Reproducibility
Models
- Backbone: Swin Transformer Large (SwinL), 223M parameters total.
- Architecture: Swin backbone + transformer encoder-decoder with deformable attention + segmentation branch.
- Positional embeddings via 3$\times$3 convolutions.
- Input resolution: 1024$\times$1024.
- Decoder queries: 300.
- Pre-trained from MS-COCO Object Detection weights.
- Code available at GitHub under Apache-2.0. Pre-trained checkpoints for all five benchmarks (PubLayNet, PRImA, HJDataset, TableBank, DocLayNet) are available via Google Drive links in the repo’s Model Zoo.
Algorithms
- Loss: L1 + Focal Loss (detection) + mask prediction loss (segmentation) + contrastive losses ($\mathcal{L}_{low}$, $\mathcal{L}_{high}$).
- CDN hyperparameters: $\lambda_p = 0.1$, $\lambda_e = 0.02$ (paper states $\lambda_e > \lambda_p$ but provides these contradictory values).
- Temperature: dataset-specific ($\tau$ = 0.02/0.6/0.1/0.2 for PubLayNet/PRImA/HJ/TableBank).
- Optimizer, learning rate, batch size, number of epochs: not reported.
Data
- PubLayNet, PRImA, HJDataset, TableBank, DocLayNet: all publicly available.
- Standard train/eval splits used.
- No data augmentation details reported.
Evaluation
- COCO-style mAP (IoU 0.5:0.05:0.95), plus AP@50, AP@75, AP_S, AP_M, AP_L.
- Compared against Mask R-CNN, SOLOv2, CondInst, Mask2Former, DocSegTr, and others.
- No error bars, significance tests, or multi-run statistics reported.
- Ablations only on PRImA; results may not generalize across all datasets.
Hardware
- No training hardware, GPU hours, or memory requirements reported.
- No inference latency or throughput measurements.
- Authors note compute constraints limited resolution and query exploration.
BibTeX
@inproceedings{banerjee2023swindocsegmenter,
title={SwinDocSegmenter: An End-to-End Unified Domain Adaptive Transformer for Document Instance Segmentation},
author={Banerjee, Ayan and Biswas, Sanket and Llad{\'o}s, Josep and Pal, Umapada},
booktitle={International Conference on Document Analysis and Recognition (ICDAR)},
pages={307--325},
year={2023},
publisher={Springer},
doi={10.1007/978-3-031-41734-4_19}
}
VSR: Unifying Vision, Semantics, and Relations for Document Layout Analysis
TL;DR
VSR is a multimodal document layout analysis framework that fuses visual features from document images with character-level and sentence-level text embeddings via an adaptive two-stream network, then refines predictions using a GNN-based relation module. It supports both NLP-based (sequence labeling) and CV-based (object detection) paradigms, and achieved first place on the ICDAR 2021 PubLayNet leaderboard with 95.7% AP.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$. The headline contribution is a new multimodal architecture: the two-stream network with adaptive aggregation and GNN-based relation module. The paper devotes most of its pages to describing the architecture, fusion mechanism, and ablation studies.
Secondary: None. The paper’s thorough ablation studies and cross-paradigm evaluation (testing the same architecture in both detection and sequence labeling modes across three benchmarks) serve the method contribution rather than constituting an independent evaluation or measurement contribution.
What is the motivation?
Prior document layout analysis methods fall into two camps, each with limitations:
- NLP-based (sequence labeling): Methods like LayoutLM treat layout as token classification. They capture text semantics but have limited spatial modeling, especially for visually rich elements (figures, tables).
- CV-based (object detection): Methods like Faster R-CNN treat the page as an image. They capture visual structure but ignore text semantics and lack explicit modeling of inter-component relationships (e.g., a figure caption should appear near its figure).
Existing multimodal approaches suffer from two additional weaknesses: (a) they use single-granularity text features (either character or sentence, not both), with shallow fusion strategies like concatenation; and (b) they lack explicit relation modeling between detected components.
What is the novelty?
VSR introduces three architectural contributions:
1. Multi-Granularity Semantic Features
Text information is encoded as spatial maps at two granularities. Character-level embeddings are rendered onto the document canvas using character bounding boxes:
$$\text{CharGrid}_{ij} = \begin{cases} E^c(c_k) & \text{if } (i, j) \in b_k^c \\ 0 & \text{otherwise} \end{cases}$$
Sentence-level embeddings use BERT to encode full sentences, then paint them onto the canvas at sentence bounding boxes:
$$\text{SentGrid}_{ij} = \begin{cases} E^s(s_k) & \text{if } (i, j) \in b_k^s \\ 0 & \text{otherwise} \end{cases}$$
Both are combined via LayerNorm: $S_0 = \text{LayerNorm}(\text{CharGrid} + \text{SentGrid})$.
2. Adaptive Multi-Scale Aggregation
Rather than concatenating visual and semantic features, VSR learns a per-pixel attention map at each FPN scale that adaptively weights the two modalities:
$$AM_i = h(g([V_i, S_i]))$$ $$FM_i = AM_i \odot V_i + (1 - AM_i) \odot S_i$$
where $g(\cdot)$ is a $1 \times 1$ convolution, $h(\cdot)$ is a non-linear activation function (likely sigmoid, given the $[0,1]$ weighting), and $\odot$ is element-wise multiplication. This lets the model learn that visual features matter more for figures while semantic features matter more for authors.
3. GNN-Based Relation Module
Detected component candidates form a fully-connected graph. Node features combine RoIAlign features with positional encodings:
$$z_j = \text{LayerNorm}(f_j + e_j^{pos}(b_j))$$
A multi-head self-attention mechanism (16 heads, 2 layers, 1024 dimensions) models pairwise relations:
$$\hat{O} = \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
The relation module refines both class predictions and bounding box coordinates. Crucially, the same relation module design works for both detection candidates and token-level sequence labeling.
Combined Loss
For the detection variant:
$$\mathcal{L} = \mathcal{L}_{\text{DET}} + \lambda \mathcal{L}_{\text{RM}}$$
where $\lambda = 1$ and both terms include cross-entropy (classification) and smooth L1 (regression) losses.
What experiments were performed?
VSR is evaluated on three benchmarks spanning both detection and sequence labeling formats:
Article Regions (822 pages, 9 classes, mAP)
| Method | mAP |
|---|---|
| Faster RCNN (reimpl.) | 87.3 |
| Faster RCNN + context (reimpl.) | 87.8 |
| VSR | 94.5 |
VSR improves mAP by +7.2 points. The largest per-class gains are on Author (+42.9 pp over the Faster RCNN baseline) and Table Caption (+17.5 pp), both classes that benefit heavily from semantic and relational cues.
PubLayNet (360K pages, 5 classes, AP@[0.50:0.95])
| Method | AP (val) | AP (test) |
|---|---|---|
| Mask R-CNN | 91.0 | 90.7 |
| SiliconMinds | – | 95.03 |
| VSR | 95.7 | 95.69 |
VSR ranked first on the ICDAR 2021 PubLayNet leaderboard, outperforming the Mask R-CNN baseline by +5.0 AP on the test set.
DocBank (500K pages, 12 classes)
Sequence labeling (F1):
| Method | Macro F1 |
|---|---|
LayoutLM_large | 93.50 |
X101 + LayoutLM_large | 94.88 |
| VSR | 95.59 |
VSR achieves the highest score on 10 of 12 classes.
Object detection (mAP):
| Method | mAP |
|---|---|
| Faster RCNN | 86.3 |
| VSR | 87.6 |
Ablation Studies (Article Regions)
The ablation studies on Article Regions isolate each component’s contribution:
| Configuration | mAP |
|---|---|
| Vision only (Faster RCNN) | 87.3 |
| + CharGrid | 90.7 (+3.4) |
| + SentGrid | 90.0 (+2.7) |
| + CharGrid + SentGrid | 92.3 (+5.0) |
| + Adaptive fusion + Relation Module (full VSR) | 94.5 (+7.2) |
The relation module alone adds +5.3 mAP to the Faster RCNN baseline, with the most dramatic improvement on Author (51.1 $\rightarrow$ 88.4, +37.3 pp). Character-level semantics help identify components requiring less context (Author), while sentence-level semantics better capture contextual components (Figure Caption).
What are the outcomes/conclusions?
Strengths:
- The unified framework genuinely supports both NLP-based and CV-based paradigms with the same core modules, which is a practical design contribution.
- The ablation studies are thorough and clearly demonstrate that each module (multi-granularity semantics, adaptive aggregation, relation module) contributes meaningful gains.
- Strong results across three diverse benchmarks, including a first-place leaderboard result on PubLayNet at the time of publication.
- The relation module addresses a real structural gap: document components have strong spatial relationships (caption near figure, columns aligned) that detection-only models cannot exploit.
Limitations:
- VSR requires text positions and content in addition to document images, making it dependent on an upstream OCR or PDF parsing step. This limits applicability to scanned documents without reliable OCR.
- Code and trained model weights were released through the DAVAR-Lab-OCR repository (Apache-2.0). However, the implementation depends on older framework versions (mmdetection 2.11.0, mmcv 1.3.4), and trained weights are hosted on Hikvision’s internal file sharing rather than a standard model hub, which may limit long-term accessibility.
- The DocBank detection results show only a modest +1.3 mAP improvement, and VSR slightly underperforms Faster RCNN on some individual classes (Equation, Footer, Title), suggesting the relation module can occasionally hurt when the graph structure does not match the document’s actual layout.
- The paper uses ResNeXt-101 as the backbone, which is large. No efficiency analysis (inference speed, VRAM) is provided, making it hard to assess practical deployment cost.
- Results on only three datasets, two of which (PubLayNet, DocBank) are scientific documents. Generalization to diverse document types (forms, invoices, historical) is not evaluated.
Reproducibility
Models
- Backbone: ResNeXt-101, initialized from a Mask R-CNN model pretrained on COCO (backbone weights copied to both visual and semantic streams). No parameter count reported for the full model.
- Semantic stream: Character embedding layer (dimension $C_S = 64$) + pretrained BERT-base-uncased for sentence embeddings (dimension reduced to 64).
- Relation module: 2 layers of multi-head self-attention, 16 heads, 1024 hidden dimensions.
- Weights available. Trained checkpoints for PubLayNet (AP 95.8) and DocBank (F1 95.25) are hosted on Hikvision’s file sharing with access codes. Pretrained backbone (Mask R-CNN on COCO) and BERT-base-uncased are also linked.
Algorithms
- Optimizer: SGD, momentum 0.9, weight decay $10^{-4}$
- Learning rate: $10^{-3}$, divided by 10 every 10 epochs (Article Regions) or every 3 epochs (PubLayNet, DocBank)
- Batch size: 2
- Epochs: 30 (Article Regions), 6 (PubLayNet, DocBank)
- RPN anchor ratios: 0.02, 0.05, 0.1, 0.2, 0.5, 1.0, 2.0
- Loss trade-off: $\lambda = 1$
- Framework: PyTorch (mmdetection 2.11.0, mmcv 1.3.4)
- Test scale: (1300, 800) for PubLayNet; (600, 800) for DocBank
Data
- Article Regions: 822 documents, 9 classes. Detection format.
- PubLayNet: 360K documents, 5 classes. Detection format. COCO-style AP.
- DocBank: 500K documents, 12 classes. Token-level annotations (sequence labeling) and bounding-box annotations (detection).
- All three datasets are publicly available.
- VSR requires character-level bounding box annotations extracted from PDF parsing, which is a non-trivial preprocessing step. The DAVAR-Lab-OCR repo provides datalist format examples for PubLayNet and DocBank.
Evaluation
- Metrics: mAP (Article Regions, DocBank detection), AP@[0.50:0.95] COCO format (PubLayNet), macro F1 (DocBank sequence labeling).
- Baselines: Faster RCNN, Mask RCNN, BERT/RoBERTa/LayoutLM variants, and ICDAR 2021 leaderboard entries.
- No error bars, significance tests, or multi-run analysis reported.
- The paper does not discuss inference speed or computational overhead from the relation module.
Hardware
- Training: Tesla V100 GPUs (count not specified).
- No inference speed, VRAM, or cost figures reported.
BibTeX
@inproceedings{zhang2021vsr,
title={VSR: A Unified Framework for Document Layout Analysis Combining Vision, Semantics and Relations},
author={Zhang, Peng and Li, Can and Qiao, Liang and Cheng, Zhanzhan and Pu, Shiliang and Niu, Yi and Wu, Fei},
booktitle={Document Analysis and Recognition -- ICDAR 2021},
pages={115--130},
year={2021},
publisher={Springer},
doi={10.1007/978-3-030-86549-8_8}
}
VTLayout: Fusion of Visual and Text Features for Document Layout Analysis
TL;DR
VTLayout is a two-stage document layout analysis model. The first stage uses Cascade Mask R-CNN for region localization, and the second stage re-classifies those regions by fusing three feature types: deep visual features (MobileNetV2), shallow visual features (pixel histogram statistics), and text features (TF-IDF over OCR output). On PubLayNet, the model reports an overall F1 of 0.9599, with the largest gains on List and Title categories where pure detection models struggle.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$. The headline contribution is a two-stage architecture that decouples localization from classification and introduces a multimodal fusion strategy combining three distinct feature extractors (DVFE, SVFE, TFE) with a Squeeze-and-Excitation weighting mechanism.
Secondary: None. The paper does not release code, data, or models.
What is the motivation?
Standard object detection models (Faster R-CNN, Mask R-CNN) applied to DLA perform well at localizing regions but often misclassify visually similar categories, particularly List vs. Text and Title vs. Text. The authors observe that:
- Text-format categories (List, Text, Title) share similar deep visual features, making them hard to distinguish with a detection backbone alone.
- Shallow statistical features (pixel value histograms) exhibit distinct distributions across categories.
- Text content (e.g., short vs. long runs of text) carries discriminative signal that vision-only models ignore.
The paper argues that fusing these three complementary feature types can substantially improve classification accuracy without changing the underlying detector.
What is the novelty?
The core contribution is a two-stage pipeline:
Stage 1 (Localization): Cascade Mask R-CNN with ResNeXt-101-64x4d backbone detects and localizes all category blocks. This stage handles bounding box regression and initial classification.
Stage 2 (Re-classification): Three parallel feature extractors process each detected region:
- Deep Visual Feature Extractor (DVFE): MobileNetV2 extracts CNN features from each cropped region (resized to $128 \times 128$). A Squeeze-and-Excitation (SE) block re-calibrates channel-wise feature responses.
- Shallow Visual Feature Extractor (SVFE): Converts each region to grayscale and computes a 256-dimensional pixel histogram (count of pixels at each intensity value 0-255). This captures intuitive differences: Title blocks tend to have bold/dark pixel distributions distinct from body Text.
- Text Feature Extractor (TFE): PaddleOCR extracts text from each region (with 8$\times$ enlargement for small Title blocks), then TF-IDF produces a feature vector capturing word-level statistics.
The SVFE and TFE vectors are concatenated and passed through a 4-layer fully connected network (512, 256, 128, 64 neurons). The output is concatenated with the SE-weighted DVFE features, then classified via softmax.
The loss function for the classification stage is cross-entropy:
$$\mathcal{L} = -\sum_{i=1}^{C} y_i \log(\hat{y}_i)$$
where $C = 5$ is the number of PubLayNet categories.
What experiments were performed?
All experiments use the PubLayNet dataset (335,703 training images, 11,245 validation images; 5 categories: Text, Title, List, Table, Figure). The test set was not publicly available, so the authors use the validation set for evaluation.
Baselines: Faster R-CNN, Mask R-CNN, and Cascade Mask R-CNN, all reproduced by the authors.
Metrics: Precision, Recall, and F1 score (classification-focused, not mAP).
Key experiments:
- Full comparison (Table 2): VTLayout vs. baselines on full PubLayNet.
- Per-category F1 (Table 3): Breakdown showing where VTLayout improves.
- Small-dataset stability (Table 4): Training on a reduced subset (25k Text/Title, 10k Figure/List/Table). Five-fold cross-validation to assess stability.
- Ablation study (Table 5): Seven configurations removing individual or pairs of extractors (DVFE, SVFE, TFE) to measure each component’s contribution.
What are the outcomes/conclusions?
Main results:
| Model | Precision | Recall | F1 |
|---|---|---|---|
| Faster R-CNN | 0.9319 | 0.9130 | 0.9224 |
| Mask R-CNN | 0.9379 | 0.9410 | 0.9385 |
| Cascade Mask R-CNN | 0.9515 | 0.9506 | 0.9510 |
| VTLayout | 0.9584 | 0.9618 | 0.9599 |
Per-category highlights:
- List F1 improves from 0.9055 (Cascade Mask R-CNN) to 0.9177
- Title F1 improves from 0.9166 (Cascade Mask R-CNN) to 0.9411, though Faster R-CNN still edges it out at 0.9425, suggesting the fusion approach does not uniformly dominate all baselines on every category
- Text F1 improves from 0.9688 to 0.9751
- Table F1 improves from 0.9794 to 0.9833
- Figure F1 drops slightly from 0.9846 to 0.9824
Ablation findings:
- DVFE alone achieves 0.9272 F1, confirming it as the most important component
- Removing SVFE from the full model drops F1 from 0.9599 to 0.8956 (large degradation, especially for List: 0.9177 to 0.7635)
- Removing TFE drops F1 from 0.9599 to 0.9440; the effect is moderate but consistent across categories
- SVFE alone achieves 0.8209, showing pixel histograms carry meaningful discriminative signal
Limitations:
- Evaluated only on PubLayNet, a single-domain (biomedical) dataset with 5 categories. Generalization to other domains or finer-grained taxonomies is untested.
- The two-stage pipeline is sequential and likely slower than end-to-end detectors, though no latency measurements are reported.
- Relies on PaddleOCR for text extraction, introducing a dependency on OCR quality. The authors use
chinese_ocr_db_crnn_mobile, a model name suggesting Chinese-language optimization, on English biomedical text. While PaddleOCR supports English, this choice may limit TFE effectiveness. The authors also note that enlarging small Title blocks before OCR helps but blurring from upscaling still limits recognition. - No code, model weights, or trained artifacts released.
- The evaluation uses precision/recall/F1 on classification rather than the standard mAP@IoU metric, making direct comparison with other PubLayNet results difficult. Because F1 is measured only on regions that Stage 1 successfully detected, entirely missed regions are not penalized; the reported scores reflect classification accuracy on localized regions, not end-to-end system quality.
- The “shallow visual feature” (pixel histogram) is sensitive to document style. Its effectiveness on diverse document types (e.g., handwritten, colored, degraded) is unknown.
- The five-fold cross-validation results in Table 4 contain a suspicious discrepancy: per-fold List F1 values range from 0.9648 to 0.9810 (averaging approximately 0.9730), yet the paper’s reported column average is 0.9286. This appears to be a typographical error in the original paper.
- The pipeline spans two frameworks (PyTorch for Stage 1, TensorFlow for Stage 2), adding integration complexity for anyone attempting to reproduce the work.
Reproducibility
Models
- Stage 1: Cascade Mask R-CNN with ResNeXt-101-64x4d backbone, pre-trained on ImageNet. Implemented in PyTorch.
- Stage 2 DVFE: MobileNetV2 pre-trained on ImageNet (without top FC layer), implemented in TensorFlow. Input: $128 \times 128 \times 3$. Global average pooling applied to last conv block output. SE block for channel re-calibration.
- Stage 2 classifier: 4-layer FC network with 512, 256, 128, 64 neurons. Softmax output over 5 classes.
- No weights released.
Algorithms
- Stage 1 training: SGD optimizer, initial LR 0.02, momentum 0.9, weight decay 0.0001. Batch size 8 (1 sample/GPU). 30 epochs.
- Stage 2 training: Adam optimizer, initial LR 0.001. Cross-entropy loss.
- SVFE: Color to grayscale conversion, then 256-bin pixel histogram per region.
- TFE: PaddleOCR (
chinese_ocr_db_crnn_mobilemodel) for text extraction. Title blocks enlarged 8$\times$ before OCR. TF-IDF via scikit-learn. - No data augmentation details reported for either stage.
- Several Stage 2 details are unspecified: number of training epochs, FC layer activation functions, SE block reduction ratio, and TF-IDF vector dimensionality (max features/vocabulary size).
Data
- Training: PubLayNet training set (335,703 images, 3.3M annotations across 5 categories). PubLayNet annotations are licensed under CDLA-Permissive-1.0; source images come from PubMed Central Open Access.
- Evaluation: PubLayNet validation set (11,245 images, 120,761 annotations). Test set not publicly available.
- Small-dataset experiment: 25k Text/Title images, 10k Figure/List/Table images randomly sampled from Cascade Mask R-CNN stage 1 output.
Evaluation
- Metrics: Precision, Recall, F1 score (classification-focused). The authors explicitly chose not to use mAP@IoU to isolate classification performance from localization quality.
- Baselines: Faster R-CNN and Mask R-CNN reproduced by the authors; Cascade Mask R-CNN is VTLayout’s own first stage.
- Ablation: 7 configurations testing all subsets of {DVFE, SVFE, TFE}.
- Stability: 5-fold cross-validation on the small subset.
- No error bars, significance tests, or multi-seed runs reported.
Hardware
- 8 GPUs mentioned (batch size 8, 1 sample per GPU) for Cascade Mask R-CNN training.
- GPU type, total training time, and inference speed not reported.
BibTeX
@inproceedings{li2021vtlayout,
title={VTLayout: Fusion of Visual and Text Features for Document Layout Analysis},
author={Li, Shoubin and Ma, Xuyan and Pan, Shuaiqun and Hu, Jun and Shi, Lin and Wang, Qing},
booktitle={PRICAI 2021: Trends in Artificial Intelligence},
pages={304--319},
year={2021},
publisher={Springer},
doi={10.1007/978-3-030-89188-6_23}
}
YALTAi: Object Detection for Layout Analysis in Kraken
TL;DR
YALTAi proposes replacing Kraken’s pixel-classification-based region segmentation with YOLOv5 object detection using isothetic bounding boxes. On small historical document datasets (around 1,000 images), YOLOv5 outperforms Kraken’s built-in segmenter by roughly $7\times$ on mAP, while also being more GPU-efficient. The authors release two datasets and a tool that injects YOLOv5 into Kraken’s segmentation pipeline.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$. The headline contribution is a methodological shift: reframing layout analysis from pixel classification to object detection, and demonstrating that this yields large accuracy gains on small datasets when integrated into an existing OCR pipeline (Kraken).
Secondary: $\Psi_{\text{Resource}}$. The paper releases two new annotated datasets (YALTAI-MSS-EPB for manuscripts/early printed books and YALTAi-Tables for tabular documents) and an open-source tool (YALTAi) that plugs YOLOv5 into Kraken.
What is the motivation?
Layout analysis (identifying and classifying regions on a page) is the first step in OCR/HTR pipelines. The open-source Kraken engine, widely used in digital humanities, performs pixel-level classification for region segmentation. This approach has two known weaknesses:
- Poor performance on small datasets. Training Kraken’s segmenter on roughly 1,000 images or fewer yields unreliable results, which is a common scenario for historical document projects.
- Region merging. Because Kraken operates at the pixel level, it tends to merge adjacent regions of the same class (e.g., neighboring columns), requiring workarounds like “ColOdd/ColEven” naming hacks.
Meanwhile, ICDAR competitions shifted from polygon-based evaluation (2011) to pixel-based evaluation (2017+), which the authors argue biased the field away from object detection approaches that might be better suited for downstream text extraction.
Three specific digital humanities research projects are cited as being blocked by Kraken’s segmentation limitations on small datasets.
What is the novelty?
The core proposal is to treat document layout analysis as an object detection task rather than a pixel classification task. Concretely:
- Ground truth polygons are simplified to isothetic (axis-aligned) bounding boxes by taking the min/max X and Y coordinates of each polygon:
$$\text{bbox} = \big(\min(x_i),; \min(y_i),; \max(x_i),; \max(y_i)\big)$$
- YOLOv5 is trained on these bounding boxes to detect and classify document regions.
- At inference time, YALTAi runs YOLOv5 first for region detection, then injects those regions into Kraken’s pipeline before Kraken’s line segmenter dispatches text lines into the detected zones.
This is not a novel architecture. The contribution is the reframing of the task and the integration: reusing off-the-shelf YOLOv5 within Kraken’s ecosystem, which lets users benefit from YOLO’s strong few-shot learning and fast inference without abandoning Kraken’s line segmentation and OCR capabilities.
The datasets follow the Segmonto ontology for region classification labels, providing a standardized vocabulary (MainZone, MarginTextZone, GraphicZone, DropCapitalZone, etc.).
What experiments were performed?
Datasets
Two datasets were assembled for training and evaluation:
- YALTAI-MSS-EPB (Manuscripts and Early Printed Books): 1,110 pages from the 9th to 17th century, drawn from five existing CREMMA/Gallicorpora datasets plus 593 new annotated pages. 12 zone types following Segmonto. Split: 854 train / 154 dev / 139 test (test is out-of-source).
- YALTAi-Tables: Tabular documents from the Lectaurep dataset for train/dev, with an entirely out-of-domain test set spanning the 17th to early 20th century. 4 zone types (Col, Header, Marginal, Text).
Models
- Kraken segmenter trained at two input resolutions (640 and 1200 pixels), 50 epochs, best model selected by mAP on dev set.
- YOLOv5x (extra-large, 175 MiB) and YOLOv5n (nano, 3.9 MiB), both trained for 50 epochs on the same bounding-box data.
Metrics
Evaluation uses mAP@0.5 (mean average precision at IoU threshold 0.5) and per-class AP, computed with mean-average-precision==2021.4.26.0. Kraken’s polygon outputs were converted to bounding boxes for fair comparison.
Key Results
YALTAI-MSS-EPB:
| Model | mAP | MainZone AP | DropCapital AP |
|---|---|---|---|
| Kraken | 6.98 | 43.5 | 23.3 |
| YOLOv5n | 34.63 | 87.0 | 54.6 |
| YOLOv5x | 47.75 | 91.7 | 69.2 |
YOLOv5x achieves roughly $7\times$ the mAP of Kraken. On MainZone (the most important class for text extraction), YOLOv5x more than doubles Kraken’s AP. Kraken scores 0% AP on several minority classes (MarginText, Numbering, QuireMarks, RunningTitle), while YOLOv5x achieves 45-76% AP on these.
YALTAi-Tables:
| Model | mAP | Col AP | Header AP |
|---|---|---|---|
| Kraken | 0.09 | 0.1 | 0.1 |
| YOLOv5x | 4.77 | 12.9 | 1.4 |
The tabular results are low overall because the test set is entirely out-of-domain, but they highlight Kraken’s inability to separate adjacent same-class columns (a known pixel-classification failure mode) versus YOLOv5’s ability to detect individual column instances.
Efficiency
All training on a single NVIDIA RTX 2080 Ti (11 GB). YOLOv5 generally uses less peak GPU power and trains in roughly half the time compared to Kraken (batch size 2 vs. 1). YOLOv5x inference is 0.025s per image; YOLOv5n is 0.004s per image.
What are the outcomes/conclusions?
The results suggest that object detection (specifically YOLOv5) is a practical replacement for pixel-based segmentation in Kraken’s pipeline, especially for small dataset scenarios common in historical document digitization projects.
Limitations acknowledged by the authors:
- Isothetic bounding boxes cannot represent complex polygon shapes (P-shaped, C-shaped, or rotated regions). This limits applicability to documents with relatively rectangular, unskewed regions.
- Bounding boxes can overlap, losing the precision of non-overlapping polygon segmentation.
- The tabular test set is deliberately adversarial (fully out-of-domain), making those numbers hard to interpret as general performance indicators.
Limitations not acknowledged:
- The comparison is somewhat asymmetric: YOLOv5 benefits from ImageNet pre-training (via COCO-pretrained weights), while Kraken’s segmenter does not leverage comparable pre-training. The performance gap may partly reflect transfer learning advantages rather than a fundamental architectural superiority.
- mAP@0.5 is a relatively lenient threshold. Results at higher IoU thresholds (e.g., mAP@0.75 or mAP@[0.5:0.95]) are not reported, which matters because bounding boxes inherently have looser fits than polygons.
- No downstream OCR/HTR accuracy is reported. The paper assumes better region detection leads to better text extraction, but this is not validated end-to-end.
- No ablation over training set size is provided, despite the paper’s central claim being about small-data regimes. It remains unclear at what dataset size the advantage over Kraken’s segmenter would diminish or disappear.
Reproducibility
Models
- No novel architecture; uses off-the-shelf YOLOv5 (commit 29d79a6, July 2, 2022) and Kraken v4.1.2.
- YOLOv5x (~86.7M params, 175 MiB) and YOLOv5n (~1.9M params, 3.9 MiB), both initialized from COCO-pretrained weights. Kraken segmenter model is 5 MiB (no comparable pre-training).
- No trained weights are released. The YALTAi package provides the integration pipeline but does not bundle domain-specific checkpoints.
Algorithms
- 50 epochs for all models.
- YOLOv5: batch size 2, input size 640px, best model selected by YOLOv5’s internal metric. The paper does not specify optimizer, learning rate, or augmentation settings; presumably YOLOv5 defaults apply (SGD with momentum 0.937, initial LR 0.01, cosine annealing, mosaic/mixup augmentation). The default augmentation pipeline is notable given the paper’s small-data thesis.
- Kraken: batch size 1, input resize 640 and 1200. Best model selected by mAP on dev set (computed externally since Kraken does not support this natively).
- Loss functions are not described. YOLOv5 uses a composite loss (objectness + classification + CIoU box regression); Kraken uses pixel-level cross-entropy. Neither is discussed in the paper.
Data
- Both datasets are publicly available on Zenodo under CC-BY-4.0.
- YALTAI-MSS-EPB: 1,110 pages, fixed train/dev/test splits included. Assembled from five source datasets: CREMMA Medieval, CREMMA Medieval LAT, Eutyches, and two Gallicorpora sets (HTR-MSS-15e-Siecle, HTR-imprime-gothique-16e-siecle), plus 593 original pages annotated for this paper.
- YALTAi-Tables: derived from Lectaurep Repertoires with an entirely out-of-domain test set.
- Annotation follows Segmonto ontology. Some annotations (the 593 original pages) were produced with early model predictions and corrected by hand on Roboflow. No inter-annotator agreement statistics are reported, and this semi-automated workflow may introduce annotation bias toward the model’s error patterns.
Evaluation
- mAP@0.5 using
mean-average-precision==2021.4.26.0Python package. The AP interpolation method (11-point vs. all-point) is not specified. - IoU threshold 0.5 only; no results at stricter thresholds.
- Kraken polygons converted to bounding boxes for fair comparison. The paper reports Kraken results at input resize 1200 (the better setting); results at 640 are not shown in the main tables.
- No error bars, multiple runs, or significance tests reported.
Hardware
- Single NVIDIA RTX 2080 Ti (11 GB VRAM, 250W TDP).
- Ubuntu with CUDA 10.2, drivers 440.118.02.
- Training times: Kraken Segmonto ~6.5 hours; YOLOv5x Segmonto ~3 hours; YOLOv5n ~3.2 hours.
- YOLOv5x inference: 0.025s/image. YOLOv5n inference: 0.004s/image.
BibTeX
@article{clerice2023yaltai,
title={You Actually Look Twice At it (YALTAi): using an object detection approach instead of region segmentation within the Kraken engine},
author={Cl{\'e}rice, Thibault},
journal={Journal of Data Mining and Digital Humanities},
year={2023},
doi={10.46298/jdmdh.9806}
}
CommonForms: A Large, Diverse Dataset for Form Field Detection
TL;DR
CommonForms is a 480k-page dataset of fillable PDF forms mined from Common Crawl, covering 10+ languages and 14 domains. The authors also train and release FFDNet-S (9M params per paper; 6M per released HuggingFace card) and FFDNet-L (25M params), high-resolution YOLO11 object detectors for form field detection that achieve 81.0 mAP$_{50-95}$ and qualitatively outperform Adobe Acrobat, including on choice buttons (checkboxes/radio buttons) that Acrobat does not detect at all. Note: due to a processing error acknowledged in the paper, the released models were trained on ~350K pages rather than the full ~480K.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$
The headline contribution is the dataset: a large-scale, diverse collection of form images with form field annotations bootstrapped from existing fillable PDFs. The paper devotes substantial attention to the curation pipeline, filtering strategy, language/domain analysis, and annotation consistency discussion.
Secondary: $\Psi_{\text{Method}}$
The FFDNet family of detectors is a meaningful secondary contribution. While they are straightforward YOLO11 models (not architecturally novel), the authors release trained weights and demonstrate that high-resolution inputs and aggressive data filtering are important practical design choices for this task.
What is the motivation?
Despite decades of digitalization, many real-world transactions still rely on paper forms distributed as flat (non-fillable) PDFs. Making these forms digitally fillable requires two steps: (1) form field detection (locating where fields should go and what type they are) and (2) form enrichment (grouping fields with labels for semantics and accessibility).
Prior work (Form2Seq, FormAlly) focused on the second step, assuming fields are already detected. There was no large-scale, open dataset for the detection step itself. Existing form understanding datasets like FUNSD (199 forms) and NAF (77 forms) target key information extraction from filled forms, not detection of empty fields. Commercial tools like Adobe Acrobat offer form preparation, but are proprietary, do not detect choice buttons, and (as the authors demonstrate) suffer from low recall and precision.
What is the novelty?
The core insight is that the web already contains a large supply of well-prepared fillable PDFs. By mining Common Crawl, the authors bootstrap training data without manual annotation:
Dataset construction via aggressive filtering. Starting from 7.9M PDFs in a Common Crawl snapshot, the pipeline filters to documents containing AcroForm or XFA form objects (~762K), then removes documents with no fields or only buttons (~59K), and finally cleans individual field annotations (removing out-of-bounds, too-small, and near-duplicate fields). The result is ~59K documents with 480K+ pages (the paper is internally inconsistent, citing both ~55K in the abstract and ~59K in the pipeline description; the detailed filtering pipeline supports ~59K).
Form field detection as object detection. The task is cast as predicting bounding boxes and types for three classes: Text Input, Choice Button (checkboxes and radio buttons), and Signature. This is notable because commercial tools like Adobe Acrobat and Apple Preview do not detect choice buttons at all.
High-resolution matters. An ablation across input resolutions (640px to 1536px) shows a ~20 mAP point spread, with choice buttons and signatures benefiting most from higher resolution. The authors settle on 1216px as a practical tradeoff.
Filtering improves data efficiency. Training on 10K pages from the filtered set yields ~4 mAP points higher than training on 10K pages from the unfiltered set (57.9 vs. 53.6), validating the cleaning pipeline.
What experiments were performed?
All results are reported as mAP$_{50-95}$ on the CommonForms test set (25K pages, split by document to prevent leakage).
Main results (Table 3)
| Model | Text AP | Choice AP | Sig. AP | All AP |
|---|---|---|---|---|
| FFDNet-S (1216px, 9M) | 61.5 | 71.3 | 84.2 | 72.3 |
| FFDNet-L (1216px, 25M) | 71.4 | 78.1 | 93.5 | 81.0 |
Resolution ablation (Table 4)
A 6M parameter model trained on 10K pages at varying resolutions:
| Resolution | All AP |
|---|---|
| 640px | 42.7 |
| 960px | 52.8 |
| 1216px | 57.9 |
| 1536px | 62.1 |
Qualitative comparison with Adobe Acrobat (Table 2)
The authors show side-by-side examples on English and Arabic forms. Acrobat does not predict choice buttons, and exhibits lower recall (missing many fields) and lower precision (detecting table lines and separators as text fields). Both FFDNet-S and FFDNet-L detect text, choice, and signature fields with higher coverage and fewer false positives.
Cross-language and cross-domain robustness (Table 5)
FFDNet-L performs consistently across 9 of the 10 most common languages (AP 73.8 to 89.3), with degraded performance on Russian (69.2). FFDNet-S shows higher variance across languages and domains. Across 14 domains, performance ranges from 75.5 (“Other”) to 88.9 (“Real Estate”) for FFDNet-L.
Data efficiency (Section 5.4)
Training on filtered data yields 57.9 mAP vs. 53.6 from unfiltered data (same 10K page budget), a ~4 point improvement from the cleaning pipeline.
What are the outcomes/conclusions?
Key findings:
- CommonForms is (to the authors’ knowledge) the first large-scale open dataset for form field detection, with 480K pages from ~59K documents spanning 10+ languages and 14 domains.
- FFDNet-L achieves 81.0 mAP$_{50-95}$, with particularly strong signature detection (93.5 AP).
- High-resolution input is critical: a 20-point mAP spread across resolutions from 640px to 1536px.
- The aggressive filtering pipeline improves data efficiency by ~4 mAP points.
- FFDNet qualitatively outperforms Adobe Acrobat, and uniquely detects choice buttons.
- Each model costs less than $500 to train.
Limitations:
- Annotation noise. The dataset inherits inconsistencies from web-scraped forms: “For Official Use Only” sections are inconsistently fillable, signatures are sometimes implemented as text fields, and scanned pages within otherwise digital PDFs lack annotations. The authors catalog these issues but do not attempt to fix them.
- No semantic understanding. The models detect field locations and types but do not group fields with labels or understand form semantics (the “form enrichment” step).
- Qualitative-only comparison with baselines. The Acrobat comparison is qualitative (visual examples), not quantitative. There are no numeric comparisons against other open-source detectors or layout models.
- Single architecture family. Only YOLO11 variants are evaluated. No comparison with DETR, Faster R-CNN, or multimodal alternatives.
- Russian underperformance. FFDNet-S drops to 33.2 AP on Russian forms; the authors note this but do not investigate causes beyond parameter count.
- Training data processing error. A footnote in the paper acknowledges that both released models were trained on only ~350K pages rather than the full ~480K, due to a processing error. The authors state they are working on retraining.
- Single Common Crawl snapshot. The entire dataset comes from one July/August 2021 scrape, limiting temporal and source diversity.
- Unverified non-English domain labels. Domain classification used LDA + GPT-5 labeling, but only English-language topics were manually verified. Non-English domain labels may contain errors.
Future work identified by the authors: tackling the full form preparation problem (semantics), improving performance on scans and foreign-language documents via augmentation or resampling, and cleaning up annotation inconsistencies.
Reproducibility
Models
- FFDNet-S: 9M parameters per the paper, but the released HuggingFace card reports 6M parameters. The discrepancy may reflect retraining (see below). YOLO11-based, initialized from scratch (not pretrained).
- FFDNet-L: 25M parameters, YOLO11-based, initialized from scratch
- Both accept 1216px input resolution
- Training data caveat: A footnote in the paper states that both released models were trained on only ~350K pages (not the full ~480K) due to a processing error. The authors indicate they are working on retraining.
- Weights released on HuggingFace. The HuggingFace model cards do not specify a license; Apache-2.0 is inferred from a GitHub release note mentioning “Apache-licensed FFDetr” and from the dataset’s Apache-2.0 license. Verify before commercial use.
Algorithms
- Training: 300 epochs, initial learning rate 0.001
- Architecture: YOLO11 object detector (via Ultralytics)
- Classes: Text Input, Choice Button, Signature
- The paper does not report batch size, optimizer, learning rate scheduler, data augmentation, or loss function details. Ultralytics YOLO11 defaults are presumably used (SGD with momentum 0.937, cosine LR annealing, mosaic/mixup augmentation, combined box regression + classification + DFL loss).
- Hyperparameters and resolution chosen by training on subsets and observing generalization
Data
- Source: Common Crawl July/August 2021 scrape (PDF Association collection)
- Size: 7.9M PDFs filtered to ~59K documents, 480K+ pages
- Splits: Train/val/test split by document (not page); 8K validation pages, 25K test pages, remainder for training
- Annotations: Bootstrapped from existing fillable PDF form fields (AcroForm and XFA standards); no manual annotation
- Language ID: FastText per-page classification; 63.6% English, ~36% other languages
- Domain classification: LDA topic modeling (300 topics via MALLET), labels assigned by GPT-5. Only English-language topics were manually verified; non-English labels are unverified.
- Released on HuggingFace under Apache-2.0; includes per-page text, language, and domain labels
Evaluation
- Metric: mAP$_{50-95}$ (standard COCO-style average precision)
- Baselines: Only Adobe Acrobat (qualitative comparison). No quantitative comparison with other open-source detectors
- No error bars or multi-run statistics reported
- No cross-validation: single train/val/test split
Hardware
- Training: 4$\times$V100 GPUs (compute grant from LambdaLabs)
- Training time: FFDNet-L ~5 days, FFDNet-S ~2 days
- Training cost: < $500 per model
- Inference: FFDNet-L ~16ms/page, FFDNet-S ~5ms/page on a single 3090 Ti
- VRAM requirements and CPU-only feasibility are not reported
BibTeX
@article{barrow2025commonforms,
title={CommonForms: A Large, Diverse Dataset for Form Field Detection},
author={Barrow, Joe},
journal={arXiv preprint arXiv:2509.16506},
year={2025}
}
DiT: Self-supervised Pre-training for Document Image Transformer
TL;DR
DiT applies BEiT-style masked image modeling (MIM) to document images, pre-training a ViT backbone on 42 million unlabeled document pages from IIT-CDIP. A domain-specific dVAE tokenizer replaces the natural-image DALL-E tokenizer. The resulting model achieves top scores on document image classification, layout analysis, table detection, and text detection benchmarks at the time of publication.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$. The headline contribution is the self-supervised pre-training strategy (MIM with a document-domain dVAE) applied to a ViT backbone for document images. The paper’s bulk consists of architectural choices, training recipes, and ablation-style comparisons across four downstream tasks.
Secondary: $\Psi_{\text{Resource}}$. The pre-trained DiT-Base and DiT-Large checkpoints are publicly released, and the domain-specific dVAE tokenizer trained on IIT-CDIP is a reusable asset.
What is the motivation?
Prior Document AI pre-training (LayoutLM, DocFormer, etc.) combines text, layout, and image modalities but still relies on supervised CNN backbones (typically ImageNet-pretrained) for the vision component. No large-scale human-labeled dataset analogous to ImageNet exists for document images, making supervised vision pre-training impractical for this domain. Weakly supervised datasets like PubLayNet and DocBank are drawn almost entirely from academic papers, creating a domain mismatch with real-world business documents (forms, invoices, receipts, reports).
DiT addresses this by bringing self-supervised pre-training to the document vision backbone, using large-scale unlabeled document images that span diverse templates and formats.
What is the novelty?
DiT adapts the BEiT framework to document images with two key modifications:
Domain-specific dVAE tokenizer. Rather than using the DALL-E tokenizer (trained on 400M natural images), the authors retrain a discrete VAE on 42M document images from IIT-CDIP. The document tokenizer produces sharper reconstructions of text boundaries and line structures compared to the DALL-E tokenizer. The tokenizer uses a codebook of size 8,192 and a downsampling factor of 8, producing a $14 \times 14$ token map from a $112 \times 112$ input crop (aligned with the $14 \times 14$ patch grid from $224 \times 224$ input at $16 \times 16$ patch size).
Masked Image Modeling (MIM) on document images. The pre-training objective masks 40% of image patches (using blockwise masking as in BEiT) and requires the model to predict the corresponding discrete visual tokens from the dVAE. The training loss is cross-entropy over the 8,192-entry codebook:
$$ \mathcal{L}_{\text{MIM}} = -\sum_{i \in \mathcal{M}} \log p(z_i \mid \tilde{x}) $$
where $\mathcal{M}$ is the set of masked positions, $z_i$ is the visual token from the dVAE, and $\tilde{x}$ is the corrupted input.
The architecture itself is a standard ViT (no modifications to the Transformer blocks). For object detection tasks, DiT feeds into Mask R-CNN or Cascade R-CNN via a multi-scale FPN adapter that extracts features at four resolution levels from intermediate Transformer blocks (1/3, 1/2, 2/3, and the final block), using transposed convolutions for upsampling and max pooling for downsampling.
What experiments were performed?
DiT is evaluated on four Document AI benchmarks covering classification and detection:
Document Image Classification (RVL-CDIP)
400K grayscale images across 16 document categories. Single-model accuracy comparison against CNN (ResNeXt-101) and ViT baselines (DeiT-B, BEiT-B, MAE-B), all at $224 \times 224$ resolution.
Document Layout Analysis (PubLayNet)
360K document images with 5 element types (text, title, list, table, figure). Evaluated with mAP @ IoU [0.50:0.95] using both Mask R-CNN and Cascade R-CNN detection heads.
Table Detection (ICDAR 2019 cTDaR)
Combined archival and modern document subsets. Evaluated with weighted F1 across IoU thresholds (0.6, 0.7, 0.8, 0.9):
$$ wF1 = \frac{0.6 \cdot F1_{0.6} + 0.7 \cdot F1_{0.7} + 0.8 \cdot F1_{0.8} + 0.9 \cdot F1_{0.9}}{0.6 + 0.7 + 0.8 + 0.9} $$
Notable: the authors apply adaptive binarization (OpenCV) to archival images before fine-tuning, improving performance on historical documents with color backgrounds.
Text Detection (FUNSD)
199 scanned forms (31,485 words) with word-level bounding boxes. Evaluated with precision, recall, and F1 at IoU@0.5. Additionally tested with 1M synthetic document images for further pre-training.
Baselines: ResNeXt-101-32x8d (CNN), DeiT-B, BEiT-B, MAE-B (all ImageNet-1K pre-trained ViTs). All baselines were re-run by the authors for fair comparison.
What are the outcomes/conclusions?
DiT reports the highest scores across all four benchmarks at time of publication:
| Task | Previous Best | DiT Best | Config |
|---|---|---|---|
| RVL-CDIP (accuracy) | 91.11% (single) | 92.69% | DiT-L |
| PubLayNet (mAP) | 91.0 | 94.9 | DiT-L + Cascade R-CNN |
| ICDAR 2019 cTDaR (wF1) | 94.23 | 96.55 | DiT-L + Cascade R-CNN |
| FUNSD (F1 @ IoU 0.5) | 93.07 | 94.29 | DiT-L + synthetic data |
Key observations:
- The domain-specific dVAE tokenizer is important: visual reconstructions show much sharper text boundaries compared to the DALL-E tokenizer, particularly for fine-grained text and table borders.
- DiT benefits more from Cascade R-CNN than baselines do, suggesting the richer representations from document-specific pre-training compound with stronger detection heads.
- On the archival subset of ICDAR 2019, BEiT-B slightly outperforms DiT-B at lower IoU thresholds (0.6, 0.7). The authors attribute this to the DALL-E tokenizer being trained on color images (400M), while DiT’s tokenizer was trained on grayscale IIT-CDIP images, disadvantaging it on colorful historical documents.
- DiT-L consistently outperforms DiT-B, indicating that the self-supervised pre-training scales well with model capacity.
Limitations not discussed by the authors:
- All evaluations are on English-centric or Latin-script benchmarks. Performance on CJK, Arabic, or Indic scripts is unknown.
- The $224 \times 224$ resolution for classification is low for dense document pages. Detection tasks use higher resolution but the pre-training is still at $224 \times 224$.
- No ablation on pre-training data size or diversity; all experiments use the full 42M IIT-CDIP corpus.
- The paper does not compare against concurrent multimodal approaches (LayoutLMv2, LayoutLMv3) that use text + layout + image jointly, though this is acknowledged as a different paradigm (DiT is vision-only by design).
Reproducibility
Models
- DiT-Base: 12-layer ViT, 768 hidden size, 12 attention heads, 3072 FFN intermediate size. 87M parameters.
- DiT-Large: 24-layer ViT, 1024 hidden size, 16 attention heads, 4096 FFN intermediate size. 304M parameters.
- Pre-trained checkpoints are publicly available on HuggingFace (
microsoft/dit-base,microsoft/dit-large). Fine-tuned checkpoints for RVL-CDIP classification are also on HuggingFace. Fine-tuned detection checkpoints (PubLayNet, ICDAR 2019 cTDaR) are available via the GitHub repo README. - Detection framework: Detectron2-based Mask R-CNN and Cascade R-CNN with multi-scale FPN adapter.
Algorithms
- Pre-training: 500K steps, batch size 2048, learning rate 1e-3, 10K warmup steps, weight decay 0.05. Adam optimizer ($\beta_1 = 0.9$, $\beta_2 = 0.999$). Stochastic depth rate 0.1, no dropout. Blockwise masking at 40%.
- dVAE training: Learning rate 5e-4, minimum temperature 1e-10, 3 epochs on full IIT-CDIP. Codebook size 8192, encoder with three layers (each: 2D conv stride 2 + ResNet block), downsampling factor 8. Trained with MSE reconstruction loss + perplexity loss. Implementation based on the open-source DALL-E PyTorch codebase.
- Fine-tuning (RVL-CDIP): 90 epochs, batch size 128, learning rate 1e-3, $224 \times 224$ with RandomResizedCrop.
- Fine-tuning (PubLayNet): Batch size 16, learning rate 4e-4 (base) / 1e-4 (large). DETR-style data augmentation (random crop + multi-scale resize, shortest side 480-800px, longest side 1333px max).
- Fine-tuning (ICDAR 2019 cTDaR): Batch size 16, learning rate 1e-4 (archival) / 5e-5 (modern). Same DETR augmentation. Adaptive binarization applied to archival images.
- Fine-tuning (FUNSD): Batch size 16, learning rate 1e-4 (base) / 5e-5 (large). Anchor sizes [4, 8, 16, 32, 64] for word-level detection (vs. [32, 64, 128, 256, 512] for paragraph-level tasks).
Data
- Pre-training corpus: IIT-CDIP Test Collection 1.0 (42M document images after splitting multi-page documents). Grayscale, diverse business documents.
- dVAE training data: Same IIT-CDIP corpus.
- Downstream datasets: RVL-CDIP (400K), PubLayNet (360K), ICDAR 2019 cTDaR (1,200 train / 439 test), FUNSD (199 forms: 150 train / 49 test, 31,485 words total).
- For FUNSD, an additional 1M synthetic document images were used in extended experiments.
Evaluation
- RVL-CDIP: Overall classification accuracy.
- PubLayNet: Category-wise and overall mAP @ IoU [0.50:0.95].
- ICDAR 2019 cTDaR: Weighted F1 across IoU thresholds {0.6, 0.7, 0.8, 0.9}.
- FUNSD: Precision, recall, F1 at IoU@0.5.
- No error bars, confidence intervals, or multi-run statistics reported.
- All baselines were re-run by the authors under identical conditions.
Hardware
- Not explicitly reported. Given 500K steps at batch size 2048, the pre-training likely required multiple V100 or A100 GPUs over several days, but the paper does not provide GPU count, total GPU-hours, or training time.
BibTeX
@inproceedings{li2022dit,
title={DiT: Self-supervised Pre-training for Document Image Transformer},
author={Li, Junlong and Xu, Yiheng and Lv, Tengchao and Cui, Lei and Zhang, Cha and Wei, Furu},
booktitle={Proceedings of the 30th ACM International Conference on Multimedia},
pages={3530--3539},
year={2022},
doi={10.1145/3503161.3547911}
}
DLAFormer: An End-to-End Transformer for Document Layout Analysis
TL;DR
DLAFormer reformulates multiple document layout analysis sub-tasks (graphical object detection, text region detection, logical role classification, reading order prediction) as relation prediction problems over a unified label space, enabling a single end-to-end DETR-based transformer to handle all tasks concurrently. It outperforms multi-branch and multi-stage baselines on both Comp-HRDoc and DocLayNet.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$ The headline contribution is a new architecture and training formulation. The paper introduces type-wise queries, a unified label space, and a coarse-to-fine detection strategy. The majority of the paper is devoted to describing the architecture, ablation studies, and SOTA comparisons.
Secondary: None. No new datasets, benchmarks, or model weights are released.
What is the motivation?
Prior DLA systems treat sub-tasks (graphical object detection, text region grouping, logical role classification, reading order) as separate problems, each handled by a dedicated model or branch. Multi-branch and multi-stage pipelines introduce cascading errors: mistakes in early stages propagate downstream. They also scale poorly because adding a new task requires adding a new branch or stage. The authors argue for a single unified model that handles all DLA sub-tasks simultaneously, reducing cascading errors and improving scalability.
What is the novelty?
DLAFormer introduces three key ideas:
1. Unified Label Space
Three relationship types are defined over text-lines and graphical objects:
- Intra-region relationships: link adjacent text-lines belonging to the same text region (handles text region detection)
- Inter-region relationships: link pairs of regions with logical connections, e.g., a table and its caption (handles reading order prediction)
- Logical role relationships: link each text-line to a predefined logical role query, e.g., “paragraph” or “section heading” (handles logical role classification)
These are unified into a single label matrix $M \in \mathbb{Z}^{H \times W}$, where $H$ indexes text-line and graphical object queries and $W$ indexes all queries including logical role queries. Each cell takes one of four values: no relation, intra-region, inter-region, or logical role.
2. Type-wise Queries
Standard DETR initializes content queries as learnable but semantically opaque vectors. DLAFormer replaces the binary classifier in the Deformable DETR encoder’s auxiliary detection head with a multi-class classifier. The predicted category selects a type-specific learnable embedding as the content query, giving queries a concrete “physical meaning” tied to their predicted class (table, figure, formula, etc.).
3. Unified Relation Prediction Head
A relation prediction module computes pairwise scores:
$$ f_{ij} = FC_k^q(q_i) \circ FC_k^r(q_j) $$
where $\circ$ denotes the dot product. Separate softmax normalization is applied over text-line/graphical queries ($j \le H$) and logical role queries ($H < j \le W$). A bilinear relation classification module then determines the relation type:
$$ p_{ij} = \text{BiLinear}(FC_c^q(q_i), FC_c^k(q_j)) $$
Both modules share the same enhanced query representations from the decoder, so all relations are predicted in a single forward pass.
What experiments were performed?
Benchmarks
- Comp-HRDoc (HRDoc-Hard split): 1,000 train / 500 test documents. Tasks: page object detection (Segm. mAP) and reading order prediction (Text Region REDS, Graphical Region REDS).
- DocLayNet: 69,375 train / 6,489 test / 4,999 val pages across 6 document categories, 11 classes. Task: page object detection (box mAP).
Baselines
- Mask2Former (R18, R50)
- Detect-Order-Construct (DOC) (R18, R50)
- Mask R-CNN, Faster R-CNN, YOLOv5, DINO (on DocLayNet)
Key Results
Comp-HRDoc (R50):
| Task | Metric | DOC | DLAFormer |
|---|---|---|---|
| Page Object Detection | Segm. mAP | 86.5 | 89.6 |
| Reading Order (Text) | REDS | 94.4 | 96.6 |
| Reading Order (Graphical) | REDS | 88.6 | 90.0 |
DocLayNet: 83.8 mAP, up from 79.6 (DOC) and 76.8 (YOLOv5). This exceeds the reported human inter-annotator agreement range of 82-83.
Ablations
- Removing type-wise queries drops Segm. mAP by 1.4 points (89.6 to 88.2) on Comp-HRDoc. The effect on reading order is uneven: Text Region REDS is unchanged (96.6), but Graphical Region REDS drops by 0.6 points (90.0 to 89.4), suggesting type-wise queries mainly help with graphical region ordering.
- Replacing the deformable encoder/decoder with vanilla DETR encoder + DAB-DETR decoder drops Segm. mAP from 89.6 to 86.9. An intermediate configuration (DETR encoder + Deformable decoder) reaches 89.0, indicating both components contribute but the deformable encoder provides the larger gain.
What are the outcomes/conclusions?
DLAFormer demonstrates that unifying DLA sub-tasks into a single relation prediction framework is viable and outperforms multi-branch alternatives. The unified label space is the core enabling idea: it allows a single prediction head to handle detection, classification, and ordering simultaneously.
Limitations
- Per-class weaknesses on DocLayNet: DLAFormer does not uniformly dominate all categories. It scores 63.1 mAP on Footnote versus 77.2 for YOLOv5, and also trails DOC or YOLOv5 on List-item, Page-footer, Text, and Table. The authors attribute the Footnote gap to the long-tail problem: text-lines inside graphical objects are visually similar to standard text-lines, confusing the classifier. Multi-stage methods like DOC can filter these using predicted graphical object boxes first.
- Requires OCR text-line bounding boxes as input. The model is described as “purely vision-based” (no text embeddings), but it requires a PDF parser or OCR engine to provide text-line bounding boxes. This is not a zero-dependency vision model.
- Single-page only. The framework handles page-level tasks; cross-page hierarchical analysis (e.g., table of contents extraction, multi-page reading order) is left to future work.
- No code or weights released. Reproducibility is limited to the paper description.
- Logit-adjusted loss on DocLayNet. The authors use logit-adjusted softmax cross-entropy to handle the long-tail distribution, but the improvement from this specific choice is not ablated.
Reproducibility
Models
- Architecture: Deformable DETR with 3 encoder layers + 3 decoder layers.
- Hidden dimension: 256; feedforward dimension: 1,024; 8 attention heads.
- Backbone: ResNet-50 (ImageNet pre-trained) or ResNet-18.
- Multi-scale feature maps: $C_3, C_4, C_5$ (1/8, 1/16, 1/32).
- Relation prediction/classification FC layers: 1,024 nodes each.
- Top-K encoder features for query selection: 50 (Comp-HRDoc), 100 (DocLayNet).
- No weights or checkpoints released.
Algorithms
- Optimizer: AdamW ($\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$).
- Backbone LR: $10^{-5}$; transformer LR: $10^{-4}$; weight decay: $10^{-4}$.
- Batch size: 4 per GPU (effective global batch size: 64 across 16 GPUs).
- Epochs: 40 (Comp-HRDoc), 24 (DocLayNet).
- Multi-scale training: shorter side randomly in [320, 416, 512, 608, 704, 800], longer side capped at 1,024.
- Test resolution: shorter side = 512.
- Loss: softmax cross-entropy for all relation prediction/classification heads. Logit-adjusted variant used on DocLayNet for class imbalance.
- Shared unified relation prediction head at each decoder layer during training.
- Loss weights/balancing across detection, relation prediction, and relation classification heads are not specified. Hungarian matching cost coefficients are inherited from Deformable DETR defaults but not explicitly stated.
- Implemented in PyTorch v1.11.
Data
- Comp-HRDoc: Based on HRDoc-Hard. 1,000 train + 500 test documents. Publicly available at github.com/microsoft/CompHRDoc under MIT license. OCR files with text-line bounding boxes and reading order provided. Text-lines within graphical objects are pre-filtered.
- DocLayNet: 80,863 pages total. Publicly available on HuggingFace under CDLA-Permissive-1.0. OCR files included but contain text-lines within graphical objects (a noted source of difficulty).
- No new data released.
Evaluation
- Comp-HRDoc: COCO-style Segmentation mAP for page object detection; REDS for reading order.
- DocLayNet: COCO-style box mAP.
- Baselines: DOC results replicated by the authors (marked with dagger); other DocLayNet baselines from original papers.
- No error bars, confidence intervals, or multi-run statistics reported.
- No per-class breakdown of reading order performance.
Hardware
- Training: 16 NVIDIA Tesla V100 GPUs (32 GB each).
- No training time, GPU-hours, or inference latency reported.
- No information on inference hardware requirements or throughput.
BibTeX
@inproceedings{wang2024dlaformer,
title={DLAFormer: An End-to-End Transformer For Document Layout Analysis},
author={Wang, Jiawei and Hu, Kai and Huo, Qiang},
booktitle={International Conference on Document Analysis and Recognition (ICDAR)},
pages={40--57},
year={2024},
series={Lecture Notes in Computer Science},
volume={14807},
publisher={Springer},
doi={10.1007/978-3-031-70546-5_3}
}
DocFormerv2: Local Features for Document Understanding
TL;DR
DocFormerv2 is a multimodal encoder-decoder transformer for Visual Document Understanding (VDU) that introduces two encoder pre-training tasks (Token-to-Line and Token-to-Grid) designed to encourage local semantic alignment between text, vision, and spatial modalities. Pre-trained on 64M document pages from the Industrial Document Library (IDL), it reports competitive results across eight benchmarks spanning table VQA, document VQA, entity extraction, sequence labeling, and scene-text VQA. Published at AAAI 2024.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$ The headline contribution is the architecture (encoder-decoder multimodal transformer with simplified visual branch) and the asymmetric pre-training recipe (two novel encoder tasks + decoder language modeling). The paper devotes most of its pages to ablations proving these design choices drive gains.
Secondary: None. The paper does not release datasets, models, or benchmarks; the ablations serve the method rather than constituting an independent measurement contribution.
What is the motivation?
Visual Document Understanding requires reasoning over three modalities simultaneously: visual appearance, text content, and spatial layout. Prior multimodal document models were predominantly encoder-only (LayoutLM, LayoutLMv2, BROS) or used encoder-decoder architectures with only a single language modeling pre-training task (TILT). The authors argue that:
- Most VDU tasks require local, layout-relative understanding (e.g., a “1” at the top-right is a page number, while in a table it is a quantity).
- Prior approaches relied on heavyweight visual encoders (ResNeXt, Faster R-CNN, Swin) that add complexity without necessarily helping on document images, which are visually simpler than natural scenes.
- Existing pre-training objectives do not explicitly teach local multi-modal feature alignment.
What is the novelty?
Asymmetric Pre-training
DocFormerv2 uses three pre-training tasks applied asymmetrically: two on the encoder and one on the decoder.
Token-to-Line (encoder): Given two randomly selected text tokens, predict the quantized line distance between them. Labels are $\{0, 1, 2\}$, where 2 captures all pairs more than 2 lines apart. For tokens on lines $a, b, c, d$:
$$F(a, d) = 0;; F(a, b) = 1;; F(b, c) = 2$$
This teaches the model relative positional relationships between tokens. Ablations show +2.2% on DocVQA.
Token-to-Grid (encoder): The document is virtually divided into an $m \times n$ grid. Each OCR token is assigned to a grid cell based on its top-left coordinate:
$$g_i = \lfloor \frac{x_i}{\Delta_x} \rfloor + \lfloor \frac{y_i}{\Delta_y} \rfloor \cdot m$$
where $\Delta_x$ and $\Delta_y$ are cell dimensions. This pairs language semantics with spatial location. A 4$\times$4 grid was found optimal.
Decoder Language Modeling: Standard T5-style denoising masked language modeling, with the additional constraint that spatial features of masked tokens are also masked, forcing the model to infer position from context.
The final pre-training loss combines all three:
$$L_{\text{final}} = k \cdot L_{\text{tol}} + l \cdot L_{\text{tog}} + m \cdot L_{\text{dlm}}$$
Simplified Visual Branch
Instead of using a pretrained CNN or object detector for visual features, DocFormerv2 uses a single convolutional layer followed by a linear projection:
$$V = \text{linear}(\text{conv}_{2 \times 2}(v))$$
with randomly initialized weights. Ablations show this outperforms Swin Transformer by +4.3% on DocVQA, which the authors attribute to documents having extensive white space that a linear downsampling layer can efficiently compress.
Architecture Variants
| Variant | dim | ff | Heads | Layers (E, D) | Params |
|---|---|---|---|---|---|
| Small | 512 | 2048 | 8 | 6, 6 | 66M |
| Base | 768 | 3072 | 12 | 12, 12 | 232M |
| Large | 1024 | 4096 | 16 | 24, 24 | 750M |
What experiments were performed?
DocFormerv2 is evaluated on eight datasets across four task categories:
Table VQA
- TabFact: 83.2% accuracy (+4.3% over UDOP)
Document VQA
- DocVQA: 87.84 ANLS (+0.79% over TILT, with extra VQA data)
- InfographicsVQA: 48.8 ANLS (+1.4% over UDOP)
Sequence Labeling / Entity Extraction
- FUNSD: 88.89 F1 (competitive; models using entity-box priors like LayoutLMv3 score higher but are not directly comparable)
- CORD: 97.70 F1 (competitive with UDOP which uses entity-box priors)
Scene-Text VQA (Generalization)
Fine-tuned directly from document pre-trained weights with no image-text pre-training:
- OCR-VQA: 71.5% accuracy (+3.4% over GIT)
- TextVQA: 64.0% accuracy (+2.4% over LaTr)
- ST-VQA: 71.8 ANLS (+2.2% over LaTr)
On the TextVQA validation set, DocFormerv2 (750M parameters) outperforms Flamingo (80B, +9.9%), PaLi-3B (+6.8%), and PaLi-15B (+1.5%), though those models were trained on substantially larger and more diverse data.
Ablations
- Pre-training tasks: Each task contributes cumulatively. Baseline + Vision + Line + Grid yields +4.0% on DocVQA over the baseline (small model, 1M pre-training docs).
- Pre-training data: DFv2 base outperforms LayoutLMv2 base when both use 11M pre-training documents, and improves further with 64M.
- OCR noise robustness: 20% character error rate causes only -1.68% F1 drop on FUNSD (vs. -9.84% for LayoutLMv2), demonstrating the benefit of the generative decoder.
- Vision tokens: 128 image tokens is optimal.
- Grid size: 4$\times$4 is optimal for Token-to-Grid; too fine (8$\times$8, 12$\times$12) or too coarse (4$\times$1) hurts.
- Image encoder: Linear projection outperforms Swin Transformer v2 by +4.3% on DocVQA.
What are the outcomes/conclusions?
The results suggest that local feature alignment through carefully designed pre-training tasks may matter more for VDU than complex visual encoders. The two encoder tasks (Token-to-Line, Token-to-Grid) provide complementary benefits: line-level relative positioning for nearby token reasoning, and grid-level spatial awareness for document region understanding.
Limitations:
- No code or model weights have been released (AWS AI Labs), so results cannot be independently reproduced.
- The IDL pre-training corpus (64M pages) is substantially larger than what most baselines used (11M for LayoutLMv2/v3), making direct comparisons imperfect despite the ablation controlling for data size.
- FUNSD and CORD comparisons are complicated by differing use of entity-box priors across methods.
- The paper does not report training cost, GPU hours, or inference latency, making practical deployment assessment difficult.
- Scene-text VQA results, while notable as a generalization test, compare against models not specifically designed for that domain.
Reproducibility
Models
- Three variants: small (66M), base (232M), large (750M)
- Encoder-decoder transformer with sentence-piece tokenizer
- Visual branch: single $2 \times 2$ convolution + linear projection (randomly initialized)
- Spatial features: four learnable embedding layers for $x$, $y$, height, width
- Modality embeddings for text and vision
- No weights released. No model cards or checkpoints available.
Algorithms
- Pre-training: three concurrent losses (Token-to-Line, Token-to-Grid, denoising LM) with empirically determined weights $k, l, m$ (values not specified in the paper)
- Fine-tuning: no dataset-specific hyperparameter tuning reported (authors note this may leave performance on the table)
- Optimizer, learning rate, batch size, and training schedule: not specified in the main paper or supplemental
- Maximum sequence length $s$ is used but the value is not reported
- Implementation: PyTorch + HuggingFace Transformers (no official code released; a third-party reimplementation exists under Apache-2.0 but is incomplete/WIP)
Data
- Pre-training: Industrial Document Library (IDL) from UCSF Industry Documents. 13M documents (70M pages) collected, filtered to 64M pages after cleaning and pruning ~6M documents. Data distribution referenced in supplemental (not in main paper). The IDL is publicly accessible, but the specific filtering pipeline and OCR extraction process are not described in reproducible detail.
- Fine-tuning: Standard public benchmarks (DocVQA, InfoVQA, FUNSD, CORD, TabFact, OCR-VQA, TextVQA, ST-VQA). Some experiments use an additional 850K document VQA question-answer pairs (source not specified).
Evaluation
- Metrics: ANLS (DocVQA, InfoVQA, ST-VQA), Accuracy (TabFact, OCR-VQA, TextVQA), F1 (FUNSD, CORD)
- Baselines: LayoutLMv1/v2/v3, TILT, UDOP, BROS, DocFormerv1, GIT/GIT2, PaLi, Flamingo, LaTr, and others
- FUNSD/CORD comparisons note that some baselines use entity-box priors (line-level bounding boxes during fine-tuning), which is not directly comparable
- No error bars, significance tests, or multi-run statistics reported
Hardware
- Not specified. No GPU type, count, training time, or cost information provided.
BibTeX
@inproceedings{appalaraju2024docformerv2,
title={DocFormerv2: Local Features for Document Understanding},
author={Appalaraju, Srikar and Tang, Peng and Dong, Qi and Sankaran, Nishant and Zhou, Yichu and Manmatha, R.},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={38},
number={2},
pages={709--718},
year={2024},
doi={10.1609/aaai.v38i2.27828}
}
Enhancing OCR for Persian: Voting-Based Layout Analysis and Font-Size-Driven Text Line Detection
TL;DR
Fateh et al. propose an OCR pipeline for Persian newspaper documents with two main components: (1) a DLA stage that ensembles four object detectors (YOLOv3, SSD, Faster R-CNN, Layout Parser) via a pixel-level voting system with 5$\times$5 dispute resolution, and (2) a TLD stage that uses distance-transform-based “optimum font size” estimation, projection-based angle correction, and a baseline-alignment algorithm for line curvature elimination. On their new Official Iranian Newspapers (OIN) dataset (1,920 images), the pipeline reduces Tesseract-OCR 5.1.0 error rate from 6.2% to 3.4%.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$ The headline contribution is the two-stage pipeline: a voting-based DLA ensemble and a heuristic TLD method built around the “optimum font size” concept. The bulk of the paper describes algorithms (7 pseudocode listings), with tables comparing against Kraken and OCRopus.
Secondary: $\Psi_{\text{Resource}}$ The authors introduce the OIN dataset (1,920 scanned newspaper images with 32,642 text lines), filling a gap in Persian-language DLA/TLD resources.
What is the motivation?
Most DLA and TLD methods are developed for Latin scripts and do not transfer well to Persian and Arabic, which have connector/non-connector letters, extensive diacritics, and frequent dots that interfere with both layout segmentation and line detection. Deep-learning-based TLD methods (Kraken, OCRopus) require large labeled training sets that are unavailable for Persian. Existing TLD approaches also struggle with curved text lines and image skew, both common in newspaper scans. The authors aim to build a pipeline that works well for Persian without requiring large labeled Persian TLD training data.
What is the novelty?
1. Multi-Model Voting for DLA
Four object detectors (YOLOv3, SSD, Faster R-CNN, Layout Parser) each produce pixel-level text/non-text predictions. A per-pixel voting system classifies each pixel by majority vote. For “disputed” pixels where text and non-text votes are tied, a 5$\times$5 window aggregates surrounding votes to break the tie. The result is a binary text/non-text segmentation map.
2. Optimum Font Size via Distance Transform
The TLD stage computes the Euclidean distance transform of the binarized image, then identifies the most frequently occurring diameter across all connected components (CCs). This “optimum font size” serves as an adaptive scale parameter throughout the pipeline:
- CCs with fewer than $10 \times \text{OptimumFontSize}$ pixels are classified as dots/diacritics and removed before line grouping
- The search length for connecting CCs into lines is $A \times \text{OptimumFontSize}$, where $A$ is an adaptive coefficient:
$$ A = \begin{cases} \frac{60 - \text{OptimumFontSize}}{10} & \text{if OptimumFontSize} < 50 \\ 1 & \text{otherwise} \end{cases} $$
3. Angle Correction and Line Curvature Elimination
Angle correction uses a coarse-to-fine rotation search: the image is rotated in 1-degree steps over $[-10^\circ, 10^\circ]$, selecting the angle that maximizes near-zero rows in the horizontal projection. A second pass refines within $\pm 1^\circ$ at 0.1-degree resolution.
For curved lines, the method extracts a baseline from the middle 60% of columns (trimming 20% from each end to avoid edge curvature), then decomposes lines into subwords using vertical projection gaps. Subwords whose local baselines deviate from the main baseline are vertically shifted to align.
What experiments were performed?
Datasets
- OIN (Official Iranian Newspapers): 1,920 scanned images at 300 dpi; 32,642 lines, 2.35M CCs. Created by the authors.
- PRImA: 478 magazine/article images (used for DLA evaluation only).
- Arabic OCR: 50 images of Arabic text with slight rotation.
- Synthetic: 962 images generated from 6 Persian fonts (Tahoma, XB-Niloofar, Ziba, IranNastaliq, Arial, Nazanin) in 5 sizes and 3 styles.
DLA Results
| Method | $\text{Acc}_{\text{text}}$ (OIN) | $\text{Acc}_{\text{nontext}}$ (OIN) | $\text{Acc}_{\text{text}}$ (PRImA) | $\text{Acc}_{\text{nontext}}$ (PRImA) |
|---|---|---|---|---|
| Layout Parser | 78.31% | 94.53% | 85.47% | 90.74% |
| SSD | 87.51% | 93.39% | 89.32% | 92.93% |
| YOLOv3 | 93.33% | 94.02% | 93.78% | 95.73% |
| Faster R-CNN | 93.67% | 94.36% | 93.53% | 95.23% |
| Voting ensemble | 94.77% | 98.04% | 96.11% | 97.96% |
TLD Results (OIN)
| Method | CC Detection Error | CC Removal Error | Total Error | Runtime (s/image) |
|---|---|---|---|---|
| OCRopus | 0.66% | 8.60% | 9.26% | 19.25 |
| Kraken | 2.14% | 0.17% | 2.32% | 16.25 |
| Proposed | 0.19% | 0.31% | 0.51% | 12.25 |
The pattern is consistent across the synthetic dataset (0.11% vs 5.32% Kraken, 10.39% OCRopus) and Arabic dataset (0.18% vs 1.29% Kraken, 4.13% OCRopus).
OCR Impact
Using the TLD step before Tesseract-OCR 5.1.0 reduced total error from 6.235% to 3.431% on OIN (a 2.8 percentage point improvement), and from 13.38% to 1.52% on the Arabic dataset.
What are the outcomes/conclusions?
The voting ensemble consistently outperforms individual detectors for DLA on both OIN and PRImA. The font-size-based TLD method outperforms Kraken and OCRopus by a wide margin, particularly on images with curved lines and the IranNastaliq font (which resembles handwriting). The combined pipeline improves downstream OCR accuracy.
Limitations
- Non-standard DLA metrics. The DLA evaluation uses pixel-level text/non-text accuracy (Equations 3-4), not standard mAP or IoU metrics. This makes comparison with the broader DLA literature difficult.
- Non-standard TLD metrics. TLD is evaluated by CC-level error rates (wrong-line assignment and undetected CCs). These are reasonable for measuring OCR impact but not directly comparable to standard segmentation metrics.
- Hardcoded thresholds. The $10\times$ factor for dot/diacritic removal, the adaptive coefficient formula, the 20% column trimming, and the $2\times$ median height threshold for detecting merged lines are all empirically tuned. No sensitivity analysis is provided.
- DLA requires running four models. The voting ensemble is inherently slow because it runs four object detectors. No runtime comparison is given for the DLA step itself (only TLD runtimes are reported).
- Limited baselines. TLD is compared only against Kraken and OCRopus. No comparison with projection-profiling, CC-based, or other heuristic methods from the related work section.
- No code released. The algorithms are described in pseudocode but no implementation is provided.
- Dataset hosting. OIN and synthetic datasets are hosted on Google Drive with no formal repository (e.g., Zenodo, HuggingFace), raising long-term availability concerns. No explicit license is stated for either dataset.
- Narrow domain. Results are demonstrated on Persian/Arabic newspaper scans. Generalization to multi-column layouts, tables, figures, or non-newspaper documents is not evaluated.
- The 2.8% OCR improvement is modest. The headline OCR improvement on OIN (6.2% to 3.4%) is meaningful but is measured end-to-end through Tesseract, making it hard to attribute gains specifically to DLA vs. TLD vs. preprocessing.
Reproducibility
Models
- DLA: YOLOv3, SSD, Faster R-CNN, Layout Parser (all pre-existing; no architecture modifications described).
- TLD: Entirely heuristic (no learned parameters beyond the DLA models).
- Preprocessing: Pre-trained denoising CNN (DnCNN, Zhang et al. 2017) for image denoising; Bradley’s adaptive thresholding for binarization.
- No model weights or checkpoints released.
Algorithms
- Seven algorithms described in pseudocode (Algorithms 1-7).
- Key hyperparameters: $10\times$ optimum font size threshold for dot/diacritic removal; adaptive coefficient formula for search length; 20% column trim for baseline extraction; coarse rotation range $[-10^\circ, 10^\circ]$ at 1-degree steps, refined to $\pm 1^\circ$ at 0.1-degree steps.
- No optimizer/training details for the DLA models (they appear to use default configurations).
- No code released.
Data
- OIN: 1,920 newspaper images scanned at 300 dpi with HP Scanjet 4890. Available on Google Drive. No license specified. 70/10/20 train/val/test split.
- PRImA: 478 magazine/article images. Standard DLA benchmark. Available at primaresearch.org.
- Arabic OCR: 50 images from Google Drive. No license specified.
- Synthetic: 962 images, 6 fonts, 5 sizes, 3 styles. Available on Google Drive. No license specified.
Evaluation
- DLA: Pixel-level accuracy for text and non-text regions (Equations 3-4). Non-standard.
- TLD: CC-level error rates (Equations 5-7): CC detection error (wrong line), CC removal error (undetected), total error. Non-standard.
- OCR: Tesseract-OCR 5.1.0 error rate, broken down by dot/diacritic and main text CCs.
- No error bars, confidence intervals, or multi-run statistics reported.
- No cross-validation or statistical significance tests.
Hardware
- No hardware specifications reported.
- No training time, GPU type, or compute requirements mentioned.
BibTeX
@article{fateh2024enhancing,
title={Enhancing optical character recognition: Efficient techniques for document layout analysis and text line detection},
author={Fateh, Amirreza and Fateh, Mansoor and Abolghasemi, Vahid},
journal={Engineering Reports},
volume={6},
number={9},
pages={e12832},
year={2024},
publisher={Wiley},
doi={10.1002/eng2.12832}
}
ETD-ODv2: AI-Aided Annotation and Dataset for Layout Analysis of Long Documents
TL;DR
Ahuja et al. propose an AI-aided annotation framework that uses a pre-trained object detection model to generate weak bounding-box labels, which human annotators then verify and correct. They release ETD-ODv2, a 62K-page, 300K-annotation dataset for layout analysis of electronic theses and dissertations (ETDs), including scanned documents and targeted pages for minority element classes. Models trained on the combined dataset achieve a mAP@0.50 of 0.886, up from 0.774 with digital-only data.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$ . The headline contribution is ETD-ODv2, a new dataset that extends ETD-OD with 16.8K scanned pages and 20.2K AI-aided annotated pages. The annotation framework is in service of producing this resource.
Secondary: $\Psi_{\text{Method}}$ . The AI-aided annotation scheme (pre-trained model generates weak labels, optional class-based filtering, human verification) is a methodological contribution, though it is relatively straightforward and primarily described as a pipeline rather than an algorithmic novelty.
What is the motivation?
Layout analysis of long scholarly documents (theses, dissertations) faces three specific challenges:
- Annotation cost: Labeling bounding boxes on document pages is slow. No LaTeX source is available for scanned documents, so automatic annotation methods (used by PubLayNet, DocBank) do not apply.
- Scanned document noise: Legacy ETDs were typewritten, microfilmed, and/or scanned, introducing noise, low resolution, dilated/eroded text, and handwritten elements that cause models trained on clean digital PDFs to underperform.
- Class imbalance: Certain elements (title, author, algorithm, equation number) appear on only a handful of pages per document, leading to severe frequency skew in page-level datasets.
The prior ETD-OD dataset (Ahuja et al., 2022) addressed digital ETDs only and suffered from the imbalance problem. ETD-ODv2 is designed to fill both gaps.
What is the novelty?
AI-Aided Annotation Framework
The framework has four stages:
- Dataset sampling: Sample unlabeled ETDs from open-access institutional repositories and split into page images.
- Weak label generation: Run a YOLOv7 model (pre-trained on ETD-OD + a small scanned seed set) on each page to produce predicted bounding boxes and class labels.
- Optional filtering: Use predicted labels to select pages likely to contain minority-class elements (e.g., title, algorithm, equation number). This directly addresses class imbalance by over-sampling rare pages.
- Human verification and correction: Annotators review model predictions, keeping correct ones and fixing incorrect/missing boxes.
The key practical insight is that even a model trained on limited, imbalanced data produces enough correct predictions to substantially reduce annotation effort (2-3$\times$ speedup).
ETD-ODv2 Dataset
The dataset extends ETD-OD with two new subsets:
- Scanned subset: ~16.8K pages from 100 scanned theses/dissertations, manually annotated by 5 undergraduates with author review. Captures noise, low resolution, and handwritten elements absent from digital ETDs.
- AI-aided subset: ~20.2K pages from ~1,200 documents (scanned + digital), filtered to over-represent 12 minority element classes, then verified/corrected by 4 annotators.
Combined with the original ETD-OD digital subset (25K pages), the full dataset contains:
- 62,043 page images
- 300,388 bounding-box annotations
- 24 object categories (metadata elements like title/author/degree, structural elements like paragraph/section/chapter title, and content elements like equation/table/figure/algorithm)
What experiments were performed?
Annotation Time Study
Four annotators each labeled ~500 pages under three conditions:
| Setting | Description |
|---|---|
| No Model | Classical annotation; no bounding boxes shown |
| AI-Aided-v1 | YOLOv7_base model generates initial boxes |
| AI-Aided-v2 | YOLOv7 fine-tuned on 10K additional AI-aided pages generates boxes |
Average time per page decreased by 2-3$\times$ with model assistance (AI-Aided-v1 vs. No Model). AI-Aided-v2 further reduced time beyond v1, confirming that better models yield more useful suggestions.
Object Detection Performance
YOLOv7 was trained on four dataset configurations and evaluated on a held-out test set of 9,353 pages (44,331 objects) spanning digital, scanned, and AI-aided sources:
| Training Data | Images | Objects | mAP@0.50 | mAP@0.50:0.95 |
|---|---|---|---|---|
| Digital only | 21,313 | 85,540 | 0.774 | 0.573 |
| Scanned only | 14,204 | 53,834 | 0.596 | 0.356 |
| Digital + Scanned | 35,517 | 139,374 | 0.832 | 0.590 |
| Digital + Scanned + AI-Aided | 52,690 | 256,057 | 0.886 | 0.655 |
Metrics: COCO-style AP@0.50 and AP@0.50:0.95.
What are the outcomes/conclusions?
Key results:
- The AI-aided annotation framework reduces per-page annotation time by 2-3$\times$ compared to manual labeling, and the benefit increases as the assisting model improves.
- Adding scanned and AI-aided data to the training set improves mAP@0.50 from 0.774 (digital-only) to 0.886, an 11.2 percentage point gain.
- Minority classes benefit the most from targeted annotation. Algorithm detection jumps from 0.368 to 0.665 AP@0.50 (a ~30 pp gain); Degree from 0.524 to 0.732 (~21 pp).
- The filtering step does not hurt majority-class performance; in fact, pages selected for minority elements also contain majority elements like paragraphs, providing additional training signal across all classes.
Limitations and open questions:
- The annotation framework relies on an existing pre-trained model. If the seed model is very poor (e.g., for a completely new document domain), the weak labels may not provide enough benefit to justify the pipeline overhead.
- Inter-annotator agreement is not reported. Although each scanned sample received a second review by an author, no formal IAA metric (e.g., Krippendorff’s Alpha) is provided.
- The test set was constructed from the same sources as the training data (ETD-OD + ETD-ODv2), so generalization to non-ETD long documents (books, legal filings, government reports) is untested.
- Only YOLOv7 was evaluated as the object detection model. The authors acknowledge this but do not test other architectures (e.g., Faster R-CNN, DETR variants) to confirm the gains are model-agnostic.
- The paper does not discuss annotation quality differences across the five undergraduate annotators or the four AI-aided annotators.
- The GitHub repository does not specify a license, making the legal terms for dataset use unclear despite the paper being published under CC-BY-NC-4.0.
Reproducibility
Models
- YOLOv7 used as the detection model throughout (both for AI-aided annotation and final evaluation).
- YOLOv7_base: trained on ETD-OD + ~2K scanned pages. Specific hyperparameters not detailed.
- YOLOv7_v2: YOLOv7_base fine-tuned on an additional 10K AI-aided pages.
- The paper’s conclusion claims “the code, datasets, and pre-trained models discussed in this paper are available at” the GitHub repository. However, the repo contains only a README.md file with a three-sentence description. No code, data files, or model weights are present. The repo was last updated October 2022 and has no license file.
Algorithms
- Training procedure details (learning rate, optimizer, epochs, augmentation) are not reported.
- The AI-aided annotation pipeline uses pylabel (open-source bounding box widget) integrated with the pre-trained model.
- Roboflow was used as the annotation platform for the scanned subset.
Data
- Source: ETDs from open-access US institutional repositories, uniformly sampled. Specific repositories are not enumerated.
- Scanned subset: 100 documents, ~16.8K pages. Annotated by 5 undergraduates, reviewed by an author.
- AI-aided subset: ~1,200 documents, ~20.2K pages. Filtered for 12 minority element classes. Annotated/corrected by 4 annotators.
- Digital subset: From ETD-OD (Ahuja et al., 2022), ~25K pages.
- Object categories: 24 classes compatible with ETD-OD.
- Code and data are said to be available at https://github.com/Opening-ETDs/ETD-OD. However, the repository contains only a README.md and no actual code, data, or model files. Last updated October 2022.
Evaluation
- Metrics: COCO-style mAP@0.50 and mAP@0.50:0.95 (standard).
- Test set: 9,353 pages / 44,331 objects, sampled from all three subsets (digital, scanned, AI-aided).
- Only YOLOv7 is used, limiting the generalizability of the performance comparison.
- No error bars, significance tests, or multi-run statistics are reported.
- No cross-dataset evaluation (e.g., DocLayNet, PubLayNet).
Hardware
- Not reported. No mention of GPU type, training time, or inference speed.
BibTeX
@inproceedings{ahuja2023etdodv2,
author = {Ahuja, Aman and Dinh, Kevin and Dinh, Brian and Ingram, William A. and Fox, Edward A.},
title = {A New Annotation Method and Dataset for Layout Analysis of Long Documents},
booktitle = {Companion Proceedings of the ACM Web Conference 2023},
year = {2023},
pages = {835--841},
publisher = {ACM},
doi = {10.1145/3543873.3587609},
}
LayoutLM: Pre-training of Text and Layout for Document Image Understanding
TL;DR
LayoutLM extends BERT by adding 2D position embeddings (encoding each token’s bounding box on the page) and optional image embeddings (via Faster R-CNN). Pre-trained on 11M scanned document images from IIT-CDIP with a Masked Visual-Language Model objective, it was the first model to jointly learn text and layout in a single framework. It improved form understanding (FUNSD F1: 70.72 $\rightarrow$ 79.27), receipt extraction (SROIE F1: 94.02 $\rightarrow$ 95.24), and document classification (RVL-CDIP: 93.07% $\rightarrow$ 94.42%).
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$. The primary contribution is a pre-training architecture that extends BERT with 2D layout position embeddings and image embeddings for document image understanding. The paper is organized around the model architecture, pre-training objectives, and SOTA comparisons on three downstream tasks. Ablations over data scale, pre-training epochs, initialization methods, and modality combinations take up most of the experimental section.
Secondary: None. Code and model weights are released, but the paper frames them as supporting the method rather than as standalone contributions.
What is the motivation?
Prior approaches to document AI had two key limitations:
- Limited use of unlabeled data. Models relied on small labeled datasets without leveraging self-supervised pre-training on large-scale document collections.
- Disconnected modalities. Existing methods used either pre-trained CV models or pre-trained NLP models, but never jointly trained text and layout information together.
The authors observed that in document images, spatial position carries substantial semantic signal (e.g., a form value is usually to the right of or below its key). Standard NLP pre-training (BERT, RoBERTa) discards this spatial information entirely. LayoutLM addresses this by encoding 2D bounding box coordinates directly into the token representation.
What is the novelty?
2D Position Embedding
LayoutLM represents each token’s location using its bounding box coordinates $(x_0, y_0, x_1, y_1)$, where $(x_0, y_0)$ is the upper-left corner and $(x_1, y_1)$ is the lower-right corner. All coordinates are normalized to a $[0, 1000]$ virtual coordinate system. Four position embedding layers are added, with $x$ and $y$ dimensions sharing separate embedding tables:
$$\text{emb}_{\text{2D}} = E_X(x_0) + E_Y(y_0) + E_X(x_1) + E_Y(y_1)$$
This is concatenated with the standard token, 1D position, and segment embeddings from BERT.
Image Embedding
For the image modality, each token’s bounding box crop is passed through a pre-trained Faster R-CNN (ResNet-101 backbone, pre-trained on Visual Genome) to produce a region feature vector. The $[\text{CLS}]$ token receives the whole-page image embedding as its ROI. Image embeddings are added during fine-tuning (not during pre-training in the released version).
Pre-training Objectives
Two objectives are used:
Masked Visual-Language Model (MVLM): 15% of input tokens are masked (following the BERT 80/10/10 scheme), but their 2D position embeddings are kept. The model must recover masked tokens using both textual context and layout position, bridging the two modalities.
Multi-label Document Classification (MDC): Using IIT-CDIP’s document tags, the $[\text{CLS}]$ representation is trained to predict document categories. This is optional and only applicable when document labels are available.
What experiments were performed?
Benchmarks
| Benchmark | Task | Metric | Domain |
|---|---|---|---|
| FUNSD | Form understanding (sequence labeling) | F1 | Noisy scanned forms (199 pages) |
| SROIE | Receipt key info extraction (sequence labeling) | F1 | Scanned receipts (973 pages) |
| RVL-CDIP | Document image classification (16 classes) | Accuracy | Diverse scanned documents (400K pages) |
Key Results
FUNSD (Form Understanding):
| Model | Modality | F1 | Params |
|---|---|---|---|
| BERT$_{\text{BASE}}$ | Text | 0.6026 | 110M |
| RoBERTa$_{\text{LARGE}}$ | Text | 0.7072 | 355M |
| LayoutLM$_{\text{BASE}}$ (11M, MVLM) | Text + Layout | 0.7866 | 113M |
| LayoutLM$_{\text{BASE}}$ (11M, MVLM+Image) | Text + Layout + Image | 0.7927 | 160M |
SROIE (Receipt Understanding):
| Model | F1 | Params |
|---|---|---|
| BERT$_{\text{LARGE}}$ | 0.9200 | 340M |
| Competition 1st place | 0.9402 | – |
| LayoutLM$_{\text{LARGE}}$ (11M, MVLM) | 0.9524 | 343M |
RVL-CDIP (Document Classification):
| Model | Accuracy | Params |
|---|---|---|
| Multimodal Ensemble | 93.07% | – |
| LayoutLM$_{\text{BASE}}$ (11M, Text+Layout) | 91.78% | 113M |
| LayoutLM$_{\text{BASE}}$ (11M, Text+Layout+Image) | 94.42% | 160M |
Scaling Behavior
The paper includes a thorough study of how pre-training data scale and epochs affect downstream performance on FUNSD. F1 increases monotonically with both more data (500K $\rightarrow$ 11M pages) and more epochs, though gains diminish at larger scales. This is among the more useful analyses in the paper, confirming that layout-aware pre-training benefits substantially from scale even on a tiny downstream dataset (149 training forms).
Initialization Ablation
RoBERTa initialization outperforms BERT initialization by 2.1 F1 points (BASE) and 1.3 F1 points (LARGE) on FUNSD. Training from scratch yields significantly worse results, confirming the value of warm-starting from a text-only pre-trained model.
What are the outcomes/conclusions?
Strengths:
- The 2D position embedding idea is simple, well-motivated, and adds minimal parameter overhead. Subsequent work adopted this design (LayoutLMv2, LayoutLMv3), and other multimodal document models incorporated similar spatial encoding approaches.
- The scaling study (Table 2) is informative: it shows clear log-linear improvement with pre-training data size, useful for practitioners deciding how much pre-training is worth the compute.
- Strong results on three diverse tasks (forms, receipts, classification) demonstrate the generality of the approach.
- The FUNSD improvement (+8.6 F1 percentage points over RoBERTa$_{\text{LARGE}}$ with a smaller model) is substantial and practically significant for form understanding workflows.
Limitations and caveats:
- Image embeddings are only used during fine-tuning, not during pre-training. The authors acknowledge this and note plans to incorporate image features into pre-training in future work (which was realized in LayoutLMv2).
- The Faster R-CNN backbone adds 47M parameters (113M $\rightarrow$ 160M for BASE) and requires an additional pre-trained object detector. This is a notable complexity and inference cost penalty.
- Only three downstream tasks are evaluated. No layout detection or reading order tasks are included.
- The MDC pre-training objective relies on IIT-CDIP metadata tags, which the authors note are noisy and inconsistent. The marginal benefit of MDC is modest (e.g., +0.73 F1 percentage points on FUNSD for the 1M/6-epoch setting).
- No error bars or multi-run statistics are reported.
- OCR quality dependency: the method assumes reasonable OCR output is available. Errors in OCR (bounding boxes or text) propagate directly into the model’s input representation. The paper does not evaluate robustness to OCR noise.
Reproducibility
Models
- LayoutLM$_{\text{BASE}}$: 12-layer Transformer, 12 heads, hidden size 768. ~113M parameters (text + layout) or ~160M with image embeddings.
- LayoutLM$_{\text{LARGE}}$: 24-layer Transformer, 16 heads, hidden size 1024. ~343M parameters.
- Initialized from BERT (base/large). RoBERTa initialization also tested and shown to be superior.
- Image encoder: Faster R-CNN with ResNet-101, pre-trained on Visual Genome. Used only at fine-tuning time.
- Pre-trained checkpoints (base, large) publicly available on Hugging Face under MIT license.
Algorithms
- Optimizer: Adam, initial LR $5 \times 10^{-5}$, linear decay schedule.
- Pre-training: Total batch size 80 across 8 GPUs. BASE takes ~80 hours/epoch on 11M images; LARGE takes ~170 hours/epoch.
- MVLM masking: 15% of tokens masked with BERT’s 80/10/10 replacement scheme. 2D position embeddings are preserved for masked tokens.
- Fine-tuning hyperparameters:
- FUNSD: 100 epochs, batch size 16, LR $5 \times 10^{-5}$.
- SROIE: specific settings not detailed in paper.
- RVL-CDIP: 30 epochs, batch size 40, LR $2 \times 10^{-5}$.
- 2D coordinates normalized to $[0, 1000]$ virtual coordinate space.
Data
- Pre-training: IIT-CDIP Test Collection 1.0. 6M+ documents, 11M scanned images. OCR re-processed with Tesseract (version not specified); results stored in hOCR format. IIT-CDIP is a gated dataset requiring a formal data use agreement through UCSF Industry Documents Library; the original NIST distribution URL is no longer active.
- FUNSD: 199 forms (149 train / 50 test). 9,707 entities, 31,485 words. Publicly available.
- SROIE: 626 train / 347 test receipts. Ground truth OCR used for experiments. Publicly available via ICDAR 2019 competition.
- RVL-CDIP: 400K images in 16 classes (320K train / 40K val / 40K test). Subset of IIT-CDIP with document-type labels. Publicly available.
Evaluation
- FUNSD: Word-level F1 on semantic entity labeling.
- SROIE: Exact-match entity F1 (Task 3: Key Information Extraction). Ground truth OCR used to isolate model contribution from OCR quality.
- RVL-CDIP: Overall classification accuracy across 16 classes.
- Baselines include BERT (base/large), RoBERTa (base/large), and image-only models (VGG-16, InceptionResNetV2, LadderNet, etc.).
- No error bars, significance tests, or multi-run statistics reported.
Hardware
- Pre-training: 8 NVIDIA Tesla V100 32GB GPUs. BASE: ~80 hours/epoch; LARGE: ~170 hours/epoch.
- No cost estimates or energy consumption figures reported.
- Inference hardware requirements not specified. The Faster R-CNN component (when image embeddings are used) adds meaningful inference latency.
BibTeX
@inproceedings{xu2020layoutlm,
title={LayoutLM: Pre-training of Text and Layout for Document Image Understanding},
author={Xu, Yiheng and Li, Minghao and Cui, Lei and Huang, Shaohan and Wei, Furu and Zhou, Ming},
booktitle={Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
pages={1192--1200},
year={2020},
doi={10.1145/3394486.3403172}
}
LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis
TL;DR
LayoutParser is an open-source Python library that provides a unified interface for applying deep learning models to document image analysis tasks including layout detection, OCR, and visualization. It wraps Detectron2-based Faster R-CNN and Mask R-CNN models in a model zoo pre-trained on five document layout datasets, and includes annotation tooling, data structures for layout manipulation, and a community platform for sharing models and pipelines.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$. The headline contribution is the toolkit itself: an open-source library with a model zoo, annotation tools, layout data structures, OCR wrappers, and a community hub. No new model architecture, training objective, or algorithm is proposed.
Secondary: $\Psi_{\text{Impact}}$. The paper demonstrates LayoutParser in two end-to-end digitization use cases (historical Japanese financial tables and legal docket extraction), validating that the toolkit can support real document processing workflows with quantitative accuracy metrics.
What is the motivation?
Despite rapid progress in DL-based document image analysis (DIA), adoption remains hampered by three practical obstacles:
- Fragmented codebases. Models are implemented in different frameworks (TensorFlow, PyTorch) with inconsistent APIs, making it difficult for non-specialists to reuse them.
- No customization infrastructure. When pre-trained models perform poorly on a new document domain (due to domain shift), there is no streamlined path for curating training data and fine-tuning.
- Undocumented pipelines. End-to-end digitization workflows require chaining layout detection, OCR, and post-processing, but these pipelines are rarely shared or reproducible.
Existing tools at the time (OCR-D, dhSegment, Tesseract, PaddleOCR) either focus on a single task, target specific document types, or lack DL model support. General-purpose DL libraries (Detectron2, AllenNLP, HuggingFace Transformers) do not address DIA-specific concerns like layout data structures, multi-model pipelines, or document annotation workflows.
What is the novelty?
LayoutParser is not a modeling contribution; it is an engineering and ecosystem contribution. The key components are:
Model Zoo
Nine pre-trained models (Faster R-CNN and Mask R-CNN with ResNet-50/101 backbones) trained on five datasets:
| Dataset | Base Model | Large Model | Domain |
|---|---|---|---|
| PubLayNet | F / M | M | Modern scientific documents |
| PRImA | M | - | Scanned magazines and scientific reports |
| Newspaper Navigator | F | - | 20th century US newspapers |
| TableBank | F | F | Tables in scientific and business documents |
| HJDataset | F / M | - | Historical Japanese documents |
“F” = Faster R-CNN, “M” = Mask R-CNN. “Base” = ResNet-50 backbone, “Large” = ResNet-101.
Models are loaded with a semantic URL scheme:
model = lp.Detectron2LayoutModel("lp://PubLayNet/faster_rcnn_R_50_FPN_3x/config")
layout = model.detect(image)
Layout Data Structures
A three-level hierarchy (Coordinate, TextBlock, Layout) with transformation and spatial operations (shift, pad, scale, intersect, union, is_in, crop_image, relative_to, condition_on). These structures support hierarchical nesting and reading order specification via parent indices, and all share a consistent API.
OCR Module
Unified wrappers around Tesseract, Google Cloud Vision, and a built-in CNN-RNN model trained with CTC loss. Plug-and-play switching between engines via the same API.
Active Learning Annotation
Integration with an object-level active learning tool (OLALA) that trains a layout model alongside labeling, requiring only the most uncertain layout objects within each image to be manually annotated. The authors report that only around 60% of the labeling budget is needed compared to exhaustive annotation.
Community Platform
A model hub for sharing pre-trained weights and complete digitization pipeline descriptions, inspired by Torch Hub and TensorFlow Hub.
What experiments were performed?
LayoutParser is not benchmarked against other toolkits via controlled experiments. Instead, the paper presents two case studies demonstrating the library’s utility:
Case Study 1: Historical Japanese Document Digitization
A large-scale pipeline for extracting structured data from historical Japanese firm financial tables (vertical text, archaic fonts, noisy scans). The pipeline uses:
- Two layout models (column detection and token detection) trained on ~400 images with active learning annotation
- Custom page reorganization to densely pack detected tokens for improved OCR recall
- A self-trained CNN-RNN OCR model for specialized flat numerals
Reported accuracy: 96.97 AP for column detection (5 categories), 89.23 AP for token detection (4 categories), 0.98 Jaccard score and 0.17 average Levenshtein distance for numeral recognition.
Case Study 2: Lightweight Legal Docket Table Extraction
A simple pipeline using a pre-trained PubLayNet Mask R-CNN for table region detection, followed by rule-based column identification and row clustering from OCR token coordinates. No custom model training required.
No formal comparison against alternative tools or frameworks is provided.
What are the outcomes/conclusions?
LayoutParser demonstrated that a unified Python API could lower the barrier to building DIA pipelines for both specialists and non-specialists. The library gained broad community adoption (5,600+ GitHub stars as of early 2026) and was among the first frameworks to provide off-the-shelf DL-based layout detection with a pre-trained model zoo.
Practical impact at time of publication:
- Provided the easiest path to applying Faster/Mask R-CNN for document layout detection without writing Detectron2 configuration files.
- The model zoo offered immediate coverage across five document domains.
- The active learning annotation tool reduced labeling costs for custom domains.
Limitations:
- No novel models. All models are standard Detectron2 Faster/Mask R-CNN architectures. The toolkit adds convenience, not accuracy improvements.
- Detectron2 dependency. The library is tightly coupled to Detectron2, which constrains it to two-stage detection architectures. Newer paradigms (DETR, YOLO, transformer-based detectors) are not supported.
- No multimodal models. Text and layout features (as in LayoutLM) are not integrated. The toolkit is vision-only for layout detection.
- Maintenance status. While the repository remains unarchived, active development appears to have slowed after 2022. The DIA toolkit landscape has since shifted toward tools like Surya, DocLayout-YOLO, and PaddleX/PP-DocLayout, which offer more modern architectures and active maintenance.
- Limited OCR integration. Only Tesseract, Google Cloud Vision, and a basic CNN-RNN model are supported. No integration with PaddleOCR, EasyOCR, or VLM-based text recognition.
- No reading order. The data structures support parent indices for hierarchy, but no reading order prediction model is included.
- No quantitative toolkit comparison. The paper does not benchmark LayoutParser against alternative frameworks on setup time, lines-of-code, or accuracy parity.
Reproducibility
Models
- All models are Detectron2 Faster R-CNN or Mask R-CNN with ResNet-50 (Base) or ResNet-101 (Large) backbones and FPN necks.
- Pre-trained weights are distributed via the LayoutParser model zoo and can be loaded with the semantic URL scheme.
- No parameter counts are reported in the paper for the detection models themselves; standard Detectron2 configurations apply (ResNet-50 FPN ~41M, ResNet-101 FPN ~60M for the backbone + FPN, plus detection head).
Algorithms
- No new training algorithms are introduced. Standard Detectron2 training recipes are used for the model zoo.
- The active learning annotation loop (OLALA) is described in a separate paper (Shen et al., 2020, arXiv:2010.01762).
- The CNN-RNN OCR model uses CTC loss (Graves et al., 2006) via the im2markup architecture (Deng et al., 2017). Training details for this model are not provided.
Data
- The model zoo is trained on publicly available datasets: PubLayNet, PRImA, Newspaper Navigator, TableBank, and HJDataset.
- The historical Japanese document dataset used in Case Study 1 is not publicly released; training set size is ~400 images with approximately 100 annotations each.
- The legal docket dataset used in Case Study 2 is not described in detail.
Evaluation
- Case Study 1 reports AP scores (COCO-style) for layout detection and Jaccard/Levenshtein for OCR. No error bars or multi-run statistics.
- Case Study 2 provides qualitative results only (Figure 6).
- No systematic evaluation of the toolkit itself (e.g., setup time, developer experience, failure modes).
Hardware
- Not reported. The case studies involve Detectron2 inference, which typically requires a GPU but is feasible on consumer hardware. Training the custom models for Case Study 1 on ~400 images would be modest (single GPU, hours).
BibTeX
@inproceedings{shen2021layoutparser,
title={LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis},
author={Shen, Zejiang and Zhang, Ruochen and Dell, Melissa and Lee, Benjamin Charles Germain and Carlson, Jacob and Li, Weining},
booktitle={Document Analysis and Recognition -- ICDAR 2021},
pages={131--146},
year={2021},
publisher={Springer},
doi={10.1007/978-3-030-86549-8_29}
}
SignverOD: Signature Object Detection in Scanned Documents
TL;DR
SignverOD provides 2,576 scanned document images with 7,103 bounding box annotations across four classes (Signature, Initials, Redaction, Date). It targets a narrow but practically important slice of document layout: detecting handwritten artifacts and intentional occlusions in business documents. Licensed CC0 (public domain).
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$
The contribution is the dataset itself. There is no associated peer-reviewed publication; the dataset was released on Kaggle with a self-published technical report (now offline) and a companion signature verification library (SignVer).
What is the motivation?
Detecting the presence and location of handwritten artifacts (signatures, initials, dates) in scanned offline documents supports multiple downstream tasks: signature verification, document tagging, and categorization. Most existing layout analysis datasets either omit these classes entirely or include them as minor additions (e.g., IIIT-AR-13K’s Signature class accounts for fewer than 600 of its 23,000 annotations). SignverOD focuses exclusively on this subset of form-level elements, providing denser supervision for these specific classes.
The dataset also includes Redaction as a class, which is unusual. In the Matter vs. Meaning framework, redaction maps to a distinct visual primitive (Redaction) that requires its own processing logic (masking/logging rather than OCR).
What is the novelty?
The novelty is practical rather than methodological: a focused, freely licensed (CC0) collection of diverse business documents annotated specifically for form-element detection. The four classes map cleanly to the Matter vs. Meaning taxonomy:
| SignverOD Class | Matter Primitive | Meaning Role | Processing Logic |
|---|---|---|---|
| Signature | Field / Text | Signature | Verification / Matching |
| Initials | Field / Text | Signature (variant) | Verification / Matching |
| Redaction | Redaction | (none; primitive is self-describing) | Masking / Log |
| Date | Text (is\_handwritten: True) | Value | HTR |
The source documents span memos, emails, bank cheques, lease agreements, letters, and invoices, providing more domain diversity than typical scientific-paper layout datasets.
What experiments were performed?
No experiments are reported in the dataset release. The companion SignVer library uses the dataset’s Signature annotations for a detection module (bounding box localization), followed by cleaning and metric learning for verification, but no benchmark results on the detection task are published.
What are the outcomes/conclusions?
As a dataset-only release without a formal evaluation, there are no reported metrics. The dataset’s value is as a training resource for the specific classes it covers.
Reproducibility
Models
No models released.
Algorithms
Not applicable (dataset release only).
Data
- Images: 2,576 scanned document pages.
- Annotations: 7,103 bounding boxes across 4 classes (Signature, Initials, Redaction, Date).
- Document types: Memos, emails, bank cheques, lease agreements, letters, invoices.
- Image sources:
- Tobacco800: scanned tobacco litigation documents.
- NIST Special Database 2: structured tax forms.
- Bank Cheques Dataset: cheque images with signatures.
- GSA.gov Lease Documents: U.S. government lease agreements.
- License: CC0-1.0 (public domain). Source collections are public records (Tobacco800 from tobacco litigation discovery; NIST SD-2 is a U.S. government product; GSA leases are government documents). The bank cheques dataset (BCSD) is listed as CC0 on Kaggle. Tobacco800 has no formal license declaration, though its underlying documents are public legal records.
- Format: Bounding box annotations (object detection format). Available on Kaggle (~760 MB compressed).
- Splits: Not documented in the Kaggle description. Users should verify whether predefined train/val/test splits are included.
Evaluation
No formal evaluation benchmarks or baselines reported.
Hardware
Not applicable.
BibTeX
@article{Dibia2022signverod,
author = {Dibia, Victor},
title = {A Dataset for Handwritten Signature Object Detection in Scanned Documents},
year = {2022},
publisher = {victordibia.com},
journal = {victordibia.com},
url = {https://www.kaggle.com/datasets/victordibia/signverod}
}
AnnoPage Dataset: Fine-Grained Non-Textual Element Detection in Historical Documents
TL;DR
The AnnoPage Dataset provides 7,550 pages from mostly historical documents (predominantly Czech and German, spanning 1485 to the present) annotated with axis-aligned bounding boxes across 25 fine-grained categories of non-textual elements. Annotations were created by expert librarians following the Czech Methodology of image document processing. Baseline experiments with YOLO11 and DETR models reach up to 0.658 mAP@50, suggesting the task remains challenging.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$
The headline contribution is the dataset itself: its curation pipeline, annotation methodology, category taxonomy, and public release on Zenodo. The majority of the paper describes data acquisition, the 25-category schema, annotation iterations, dataset statistics, and limitations. Baseline model experiments are secondary and serve mainly to establish reference performance.
Secondary: $\Psi_{\text{Evaluation}}$
The paper defines a standard evaluation protocol (mAP@50 and mAP@50-95) and provides baseline results from five model configurations, including confusion matrices that highlight systematic failure modes (e.g., photograph vs. image confusion). This gives future researchers a clear measurement starting point.
What is the motivation?
Existing document layout analysis datasets tend to have one or more of the following limitations:
- Few non-textual categories. Most datasets focus on textual structure (titles, headers, footers, captions) and lump visual elements into broad classes like “image” or “figure.”
- Modern documents only. Datasets like DocLayNet and PubLayNet focus on contemporary documents and miss visual elements unique to historical materials (friezes, vignettes, signets, exlibris).
- Synthetic data. Some datasets rely on synthetic generation rather than real-world scanned pages.
- Single-element focus. Several datasets target only one element type (tables, text lines).
Historical documents preserved by libraries and archives contain distinctive visual elements (decorative borders, printer’s marks, sheet music, hand-drawn maps) that are important for cultural heritage research but are absent from standard benchmarks. The authors argue that detecting and categorizing these elements is a prerequisite for downstream processing: images can be indexed for search, musical notations can be processed by OMR systems, and mathematical formulas can be parsed by recognition engines.
What is the novelty?
The core novelty is the dataset itself and its fine-grained categorization scheme. Key distinguishing features include:
25 non-textual element categories. This is considerably more granular than the typical handful of classes in existing datasets. Categories range from common elements (photographs, tables, charts, mathematical formulas) to domain-specific ones (exlibris, signets, vignettes, friezes, sheet music, chemical formulas, decorative inscriptions). The taxonomy follows the Czech Methodology of image document processing, a formally defined standard.
Expert librarian annotations. Librarians from Czech national institutions (Moravian Library, National Library of the Czech Republic, Library of the Czech Academy of Sciences) annotated the pages using Label Studio. The process involved four iterations: initial manual annotation, two rounds of model-assisted annotation (using YOLO predictions as proposals), and a final consistency verification pass where two YOLO models trained on complementary halves of the data flagged disagreements for review.
Historical document coverage. The 5,690 Czech Digital Library pages span from 1485 to the present, with the bulk from the late 19th and early 20th centuries. An additional 1,860 pages from six existing datasets (IlluHisDoc, PRImA, PRImA RDCL2019, PRImA Europeana Newspapers, ICDAR2019 cBAD, TexBiG) are re-annotated under the same schema to increase variability and maintain cross-dataset continuity.
Transparent limitations. The authors explicitly discuss category ambiguity (photograph vs. image, single vs. multiple tables, formula segmentation) and how borderline cases were resolved through annotator discussion.
What experiments were performed?
The authors train object detection baselines on the development set (6,950 pages) and evaluate on a held-out test set of 600 pages selected to match the overall category distribution.
Models
- YOLO11 variants: YOLO11n, YOLO11s, YOLO11m, YOLO11l (via the Ultralytics package), initialized from publicly available pretrained weights
- DETR (facebook/detr-resnet-50 from HuggingFace), also initialized from pretrained weights
Training setup
All models were trained for 250 epochs at a fixed resolution of $1024 \times 1024$ pixels. YOLO models used Adam with a learning rate of $2 \times 10^{-4}$. DETR used a learning rate of $5 \times 10^{-5}$.
Evaluation metrics
The primary metrics are mAP@50 and mAP@50-95, following standard COCO-style object detection evaluation. The IoU threshold for mAP@50 is 0.5; mAP@50-95 averages over thresholds from 0.5 to 0.95 in steps of 0.05.
Key comparisons
- Czech subset vs. full dataset: Models trained on the full dataset (including re-annotated pages from other datasets) consistently outperform those trained only on the Czech subset, confirming that the extra pages improve generalization.
- Resolution ablation: YOLO11m was additionally trained at $800 \times 800$ and $1280 \times 1280$ pixel resolutions. The $1024 \times 1024$ setting performed best (0.658 mAP@50 vs. 0.622 at 800 px and 0.651 at 1280 px).
What are the outcomes/conclusions?
Results summary
| Model | Czech mAP@50 | Czech mAP@50-95 | Full mAP@50 | Full mAP@50-95 |
|---|---|---|---|---|
| YOLO11n | 0.595 | 0.531 | 0.616 | 0.558 |
| YOLO11s | 0.567 | 0.513 | 0.641 | 0.585 |
| YOLO11m | 0.592 | 0.541 | 0.658 | 0.598 |
| YOLO11l | 0.607 | 0.547 | 0.657 | 0.602 |
| DETR | 0.438 | 0.356 | 0.458 | 0.377 |
- YOLO11m and YOLO11l achieve the best results, reaching approximately 0.66 mAP@50 and 0.60 mAP@50-95 on the full dataset. These numbers suggest significant room for improvement.
- DETR underperforms substantially (0.458 mAP@50), which the authors attribute to the relatively small dataset size, consistent with the known data-hunger of transformer-based detectors.
- Confusion matrix analysis reveals that the model most frequently confuses photographs and cartoons with the generic “image” category, and other technical drawings with schemas. Chemical formulas, exlibris, and handwritten inscriptions are frequently missed entirely (high false negative rates).
- The extra dataset pages provide a meaningful boost (roughly 2 to 7 points in mAP@50 depending on the model), validating the decision to incorporate re-annotated pages from existing datasets.
Limitations
- The 25-category scheme introduces inherent ambiguity between visually similar classes (photograph vs. image, different drawing types).
- Annotation granularity is not always consistent: whether closely adjacent tables or formula fragments should be treated as one element or several depends on context and interpretation.
- The dataset is heavily skewed toward certain categories (images, photographs, tables are common; exlibris, chemical formulas, barcodes are rare), which likely contributes to poor detection of tail categories.
- Only axis-aligned bounding boxes are provided; no polygon or pixel-level annotations.
Mapping to Matter vs. Meaning Framework
How does AnnoPage’s 25-category non-textual element taxonomy map to our Matter vs. Meaning framework? Note that AnnoPage annotates only non-textual elements; text regions are not labeled.
| # | AnnoPage Class | Visual Primitive | Logical Role | Notes |
|---|---|---|---|---|
| 1 | Chemical formula | Formula | ChemScheme | Chemical structures and reaction equations. |
| 2 | Symbol | Image | Logo | Icons, pictograms, logos, heraldic emblems. Broad category with cultural significance. |
| 3 | Exlibris | Image | Stamp | Bookplate ownership marks. |
| 4 | Photograph | Image | Figure | Photographic content. Frequently confused with Image class. |
| 5 | Geometric drawing | Image | Diagram | Points, lines, triangles, spatial forms with simple drawn lines. |
| 6 | Chart | Image | Chart | Data visualizations with coordinate systems. |
| 7 | Initial | Image | Body | Decorated drop-cap letters. Treated as Image because standard font properties cannot describe the appearance. |
| 8 | Cartoon | Image | Figure | Cartoon illustrations, caricatures, satire. |
| 9 | Map | Image | Diagram | Geographic maps. |
| 10 | Mathematical formula | Formula | DisplayEquation | Mathematical notation on separate lines. |
| 11 | Sheet music | Music | (primitive only) | Staves with musical symbols. |
| 12 | Image | Image | Figure | Generic non-photographic visual content (illustrations, engravings). Also serves as fallback. |
| 13 | Other decoration | Structure | (decorative) | Ornamental borders, frames, decorative strips. |
| 14 | Other technical drawing | Image | Diagram | Drawings, outlines, cross-sections not fitting other technical categories. |
| 15 | Decorative inscription | Image | Icon | Ornamental text rendered as artwork (not OCR-readable). |
| 16 | Technical drawing | Image | Diagram | Floor plans, top-down building projections. |
| 17 | Barcode | OpticalCode | Barcode1D / Barcode2D | Machine-readable codes for library collection management. |
| 18 | Stamp | Image | Stamp | Provenance marks indicating ownership. Can be textual or graphic. |
| 19 | Advertisement | Image | Figure | Visual promotions framed as separate graphic units. |
| 20 | Handwritten inscription | Text | Annotation | Marginalia, glosses, provenance notes. is_handwritten: True. |
| 21 | Schema | Image | Diagram | Simplified representations of systems or processes. |
| 22 | Signet | Image | Stamp | Printer’s marks, publisher’s devices. |
| 23 | Table | Table | (primitive only) | Structured data in rows and columns. |
| 24 | Vignette | Image | Icon | Ornamental book decoration, commonly on title pages. |
| 25 | Frieze | Structure | (decorative) | Elongated horizontal decoration at page tops. |
Coverage strengths: AnnoPage is unusually rich in historical-document-specific categories (exlibris, signets, friezes, vignettes, coat of arms) that no other DLA dataset covers. It also distinguishes chemical formulas from mathematical formulas and includes music notation.
Coverage gaps: No text-region annotations at all (the dataset is non-textual elements only). No SectionHeader, Caption, PageHeader, PageFooter, ListItem, or form primitives. By design, this dataset complements rather than replaces general-purpose DLA datasets.
Reproducibility
Models
- YOLO11 variants (n/s/m/l) from the Ultralytics package, initialized from publicly available pretrained checkpoints. No custom architectural modifications are described.
- DETR uses the
facebook/detr-resnet-50pretrained model from HuggingFace. No modifications reported. - No model weights from the baseline experiments appear to be released.
Algorithms
- Optimizer: Adam for all YOLO variants; the optimizer for DETR is not explicitly named, only the learning rate ($5 \times 10^{-5}$) is given.
- Learning rate: $2 \times 10^{-4}$ for YOLO, $5 \times 10^{-5}$ for DETR. No schedule (warmup, decay, cosine annealing) is described for either framework.
- Batch size: Not reported for any model.
- Epochs: 250 for all models. Model selection criterion (best validation mAP checkpoint vs. final epoch) is not specified beyond “evaluated the best-performing models on the test set.”
- Input resolution: $1024 \times 1024$ pixels (with an ablation at 800 and 1280 for YOLO11m).
- Data augmentation: Not described beyond what Ultralytics and HuggingFace DETR provide by default.
- Loss functions: Not specified; presumably the default losses for each framework (YOLO composite box/class/DFL loss; DETR bipartite matching with Hungarian algorithm).
Data
- Source: 5,690 pages from the Czech Digital Library (publicly accessible), plus 1,860 pages from six existing datasets (IlluHisDoc, PRImA, PRImA RDCL2019, PRImA Europeana Newspapers, ICDAR2019 cBAD, TexBiG).
- Total annotations: 27,904 elements across 6,726 pages with at least one element, plus 824 empty pages.
- Split: Development set (6,950 pages) and test set (600 pages from the Czech subset, containing 1,938 elements). The authors further split the development set into training (6,441 pages) and validation (509 pages, 1,926 elements), selecting validation pages to match the overall category distribution. Test set category distribution was similarly selected to match the overall dataset distribution.
- Annotation process: Expert librarians using Label Studio across four iterations (manual, two model-assisted rounds, one consistency-verification round). Annotations follow the Czech Methodology of image document processing. Inter-annotator agreement metrics are not reported.
- Public availability: The dataset is publicly available on Zenodo under CC-BY-4.0. Ground truth is provided in YOLO format. Pages from external datasets must be obtained separately from their original sources.
Evaluation
- Metrics: mAP@50 and mAP@50-95 (COCO-style). mAP@50 uses a single IoU threshold of 0.5; mAP@50-95 averages over IoU thresholds from 0.5 to 0.95 in steps of 0.05. Column-wise normalized confusion matrices are also provided for YOLO11l, offering per-category diagnostic detail.
- Baselines: Five detector configurations (four YOLO11 variants, one DETR). Comparisons are internal (Czech subset vs. full dataset); no cross-dataset comparisons with other layout analysis benchmarks are reported.
- Limitations acknowledged: Category ambiguity (photograph vs. image, table granularity, formula segmentation), class imbalance (tail categories have very few examples), and axis-aligned bounding boxes only (no polygons or pixel masks).
- Statistical rigor: No error bars, multiple-run statistics, or seed sensitivity analysis are reported. Results appear to be from single training runs. No significance tests are provided.
Hardware
- Training and inference hardware are not specified in the paper.
- No GPU-hour estimates, memory requirements, or latency measurements are reported.
- No cost estimates or deployment considerations are discussed. Given the use of standard YOLO11 and DETR architectures, reproducing baselines should be feasible on a single modern GPU, but specifics are left to the reader.
BibTeX
@misc{kiss2025annopage,
title={AnnoPage Dataset: Dataset of Non-Textual Elements in Documents with Fine-Grained Categorization},
author={Martin Kišš and Michal Hradiš and Martina Dvořáková and Václav Jiroušek and Filip Kersch},
year={2025},
eprint={2503.22526},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
ENP Image and Ground Truth Dataset of Historical Newspapers
TL;DR
The ENP dataset contains 528 historical newspaper page images from digitization projects of 12 national and major European libraries, spanning the 17th to mid-20th century across 13 languages. Each page is accompanied by comprehensive ground truth in PAGE format, including precise region outlines, type labels, Unicode-encoded full text, and reading order. The authors also provide baseline OCR evaluations using ABBYY FineReader Engine 11 and Tesseract 3.03.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$
The headline contribution is the dataset itself: a large, representative, publicly available collection of historical newspaper images with detailed ground truth. The paper devotes most of its space to describing how the dataset was created, what it contains, and how it is hosted and accessed.
Secondary: $\Psi_{\text{Evaluation}}$
The second part of the paper establishes a baseline evaluation of two OCR systems (ABBYY FineReader 11 and Tesseract 3.03) on the dataset, using both text-based and scenario-based layout analysis performance measures. This evaluation component, while not the primary contribution, demonstrates the dataset’s utility and provides reference performance numbers for future comparisons.
What is the motivation?
Meaningful progress in document analysis for historical materials requires representative datasets with detailed ground truth. Historical newspapers present particular challenges: low print quality, large image sizes, complex multi-column layouts, mixed typefaces (including Gothic/Fraktur), and diverse languages.
At the time of publication, no comparable dataset existed. The closest prior work, the IMPACT dataset, focused primarily on books rather than newspapers. The Europeana Newspapers Project (ENP), which processed over 10 million newspaper pages for The European Library and Europeana, provided both the resources and the institutional collaboration needed to assemble a representative collection from 12 European libraries.
The authors aimed to build a dataset that would be:
- Realistic: Reflecting the actual distribution of documents in library collections.
- Comprehensive: Including metadata, detailed ground truth (layout, text, reading order).
- Flexibly structured: Supporting searching, browsing, grouping, and direct access from external systems.
What is the novelty?
1. Scale and representativeness for historical newspapers:
The dataset includes 528 pages from 12 European libraries, covering 13 languages (Dutch, English, Estonian, Finnish, French, German, Latvian, Polish, Russian, Serbian, Swedish, Ukrainian, Yiddish) and publication dates from the 17th century through 1950. The distribution includes grayscale (47%), bitonal (29%), and colour (24%) images, all stored as lossless TIFF at 300 or 400 dpi. Pages average over 383 text lines each.
2. Comprehensive ground truth in PAGE format:
Every page has ground truth in PAGE (Page Analysis and Ground Truth Elements) format, including:
- Precise region outlines (polygon boundaries)
- Region type labels
- Full text in Unicode (including special characters and ligatures)
- Reading order
The ground truth covers 61,619 regions total, of which 46,889 are text regions containing 202,524 text lines, along with 1,497 images/graphics and 208 tables.
3. Searchable online repository with web services:
The dataset is hosted through a web-based repository with metadata search, random subset generation, an interactive ground truth viewer/editor rendered in the browser, and a direct-access API for integration with external OCR workflows and evaluation tools.
4. Scenario-based OCR baseline:
The authors evaluate two OCR systems using scenario-based profiles that weight different error types according to intended use cases (keyword search, phrase search, content structure access, print/ebook on demand, content-based image retrieval).
What experiments were performed?
The authors evaluated ABBYY FineReader Engine 11 and Tesseract 3.03 in two settings:
Text-based evaluation
A Bag of Words analysis was used rather than word accuracy, since word accuracy becomes unreliable for complex layouts with non-linear reading order. Results for FineReader across font types:
- Normal (Antiqua) fonts: 81.4% success rate
- Gothic (Fraktur) fonts: 67.3% success rate
- Mixed fonts: 64.0% success rate
The authors note that the Gothic performance was substantially better than prior expectations, attributable to ABBYY’s Fraktur module developed during the IMPACT project.
Scenario-based layout analysis evaluation
Five use scenarios were defined based on input from content holders and users:
- Keyword search in full text
- Phrase search in full text
- Access via content structure
- Print/eBook on demand
- Content-based image retrieval
Each scenario defines an evaluation strategy with specific weights for segmentation errors (merge, split, miss, false detection), misclassification errors, and reading order errors. Results varied considerably across scenarios, and neither system was a clear winner across all use cases, suggesting that Tesseract could outperform FineReader in certain scenarios.
Error breakdowns for FineReader showed different compositions of merge, split, partial miss, miss, false detection, and misclassification errors depending on the scenario.
What are the outcomes/conclusions?
The ENP dataset fills a clear gap in available resources for historical newspaper analysis. The key outcomes are:
- A publicly available, representative dataset: 528 pages with comprehensive ground truth, free of charge for researchers.
- Practical cost insights: Ground truth creation required several hundred person-hours across multiple organizations. The manual correction cost alone was approximately 15,000 euros, underscoring the expense of high-quality annotation (targeting 99.95% accuracy).
- OCR baselines reveal remaining challenges: Gothic and mixed-font documents remain substantially harder than normal text for commercial OCR. Scenario-based evaluation reveals that no single system dominates across all use cases.
- Infrastructure for reproducible evaluation: The online repository, with searchable metadata, random subset generation, and API access, supports reproducible experiments and integration with external tools.
Limitations
- The dataset is limited to 528 pages, constrained by the available budget for ground truth production.
- Copyright restrictions reduced the original 600-page collection.
- The evaluation covers only two OCR systems available at the time (2015); more recent systems are not benchmarked.
- The paper does not report inter-annotator agreement metrics, though a three-stage quality assurance process is described.
Mapping to Matter vs. Meaning Framework
ENP uses the PAGE-XML format (see PRImA notes for the full PAGE-XML region mapping). The dataset’s 61,619 annotated regions break down into three broad types that map to the Matter vs. Meaning framework as follows:
| ENP Region Type | Count | Visual Primitive | Logical Role | Notes |
|---|---|---|---|---|
| TextRegion | 46,889 | Text | Body / SectionHeader / Caption / etc. | Role depends on PAGE-XML type attribute (paragraph, heading, caption, etc.). See PRImA mapping for full subtype breakdown. |
| ImageRegion / GraphicRegion | 1,497 | Image | Figure | Photographs, illustrations, advertisements in newspaper pages. |
| TableRegion | 208 | Table | (primitive only) | Tables in newspaper content (rare). |
| SeparatorRegion | (included in total) | Structure | (none) | Column separators, rules between articles. Particularly important for newspaper layouts. |
Structural notes: ENP provides reading order annotations via PAGE-XML’s OrderedGroup/UnorderedGroup elements, which is critical for multi-column newspaper layouts where the visual reading path is non-trivial. The dataset also includes full Unicode text content at the text-line level (202,524 lines total), making it useful for OCR evaluation beyond layout detection.
Coverage strengths: The multilingual coverage (13 languages) and diachronic range (17th-20th century) give ENP broad coverage that is uncommon among historical newspaper datasets. The PAGE-XML format preserves the logical vs. physical distinction that the M-vs-M framework formalizes.
Coverage gaps: No explicit classes for Formula, Music, OpticalCode, or any form primitives. Newspaper-specific elements like JumpLine, Flag, or Advertisement are not separated from generic TextRegion or ImageRegion.
Reproducibility
Models
Not applicable. This is a dataset paper; no models are introduced.
Algorithms
No new algorithms are proposed. The paper evaluates ABBYY FineReader Engine 11 and Tesseract 3.03 using their existing capabilities:
- FineReader configuration: Run in production workflow mode with three font profiles (Gothic/Fraktur, Normal/Antiqua, Mixed). The Fraktur module was developed as part of the IMPACT project.
- Tesseract configuration: Version 3.03 was used, but the paper does not specify which language models or configuration parameters were applied.
- Ground truth pre-processing: Preliminary OCR output in PAGE format was generated using ABBYY FineReader Engine 10 (not 11) and provided to service providers as a starting point for ground truth correction.
- Scenario-based evaluation profiles: The methodology for weighting segmentation, classification, and reading order errors per scenario follows Clausner et al. (ICDAR 2011), but the exact weight values used in this paper are not reproduced in the text.
Data
- Source: 12 national and major European libraries participating in the Europeana Newspapers Project.
- Size: 528 page images (reduced from an initial collection of 600 due to copyright restrictions). Originally, up to 50 images per institution were targeted.
- Format: Lossless TIFF images at 300 or 400 dpi. Ground truth in PAGE XML format.
- Languages: 13 (Dutch, English, Estonian, Finnish, French, German, Latvian, Polish, Russian, Serbian, Swedish, Ukrainian, Yiddish). German pages (169) are the most represented.
- Time period: 17th century through 1950, with the majority from the 19th century (187 pages) and early-to-mid 20th century (323 pages).
- Ground truth contents: 61,619 regions, 46,889 text regions, 202,524 text lines, 1,497 images/graphics, 208 tables.
- Quality target: 99.95% accuracy, enforced through a three-stage quality assurance process (automated validation, manual layout inspection, text verification by contributing libraries).
- Annotation process: Semi-automated. Service providers received preliminary OCR output in PAGE format (generated using ABBYY FineReader Engine 10) and a customized version of the Aletheia ground truth editor. They could correct or discard the preliminary output.
- Image characteristics: Grayscale (47%), bitonal (29%), and colour (24%). Common page characteristics include multi-column layout (487 pages), text/image overlap, skew, faint print, and bleed-through.
- Ground truth production cost: Approximately 15,000 euros for manual correction alone, with several hundred person-hours across multiple organizations for the full pipeline (selection, pre-processing, correction, verification, ingestion, categorization).
- Public availability: Free of charge for researchers at primaresearch.org/datasets/ENP. Access requires authentication through a permissions management system; some materials may have institution-specific access agreements imposed by the contributing libraries.
- License: Not explicitly stated in the paper. Access is described as “free-of-charge for researchers” but specific licensing terms are not provided.
Evaluation
- Text metric: Bag of Words success rate (chosen over word accuracy to avoid ambiguity from non-linear reading order in complex newspaper layouts). Only FineReader results are broken down by font type (Gothic 67.3%, Normal 81.4%, Mixed 64.0%); per-font Tesseract text accuracy numbers are not reported.
- Layout metric: Scenario-based evaluation using weighted error types (merge, split, miss, false detection, misclassification, reading order) with five predefined use-case profiles (keyword search, phrase search, content structure access, print/ebook on demand, content-based image retrieval). The evaluation methodology and weighting scheme follow Clausner et al. (ICDAR 2011), but the specific weight values applied are not reproduced in this paper.
- Baselines: ABBYY FineReader Engine 11 and Tesseract 3.03. The paper does not specify whether the full dataset (all 528 pages) or a subset was used for evaluation.
- Limitations acknowledged: The dataset is constrained by budget (528 pages from an initial 600); copyright restrictions forced removal of some pages; only two OCR systems from 2015 are benchmarked; no inter-annotator agreement metrics are reported (though a three-stage QA process was used).
- Statistical rigor: No error bars, confidence intervals, significance tests, or multi-run statistics are reported. Results are presented as single aggregate success rates per font type or scenario.
Hardware
Not reported. The paper does not discuss computational requirements for the OCR evaluations.
BibTeX
@inproceedings{clausner2015enp,
title={The ENP Image and Ground Truth Dataset of Historical Newspapers},
author={Clausner, Christian and Papadopoulos, Christos and Pletschacher, Stefan and Antonacopoulos, Apostolos},
booktitle={2015 13th International Conference on Document Analysis and Recognition (ICDAR)},
pages={931--935},
year={2015},
organization={IEEE},
doi={10.1109/ICDAR.2015.7333898}
}
RanLayNet: A Synthetic Dataset for Document Layout Detection via Domain Adaptation
TL;DR
RanLayNet is a synthetic document layout dataset generated by randomly cropping layout elements (text, title, list, table, figure) from PubLayNet and compositing them onto blank canvases with automatically generated bounding box annotations. The authors show that fine-tuning YOLOv8 on a source dataset followed by further fine-tuning on RanLayNet improves cross-domain table detection performance on DocLayNet compared to using the source dataset alone.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$ : the headline contribution is the RanLayNet dataset itself, a synthetically generated document layout dataset with automatic annotations. The paper dedicates substantial space to the dataset construction pipeline, class distribution, and its role as a reusable training resource.
Secondary: $\Psi_{\text{Method}}$ : the paper frames the dataset within a domain adaptation workflow and evaluates a two-stage fine-tuning strategy (source dataset $\rightarrow$ RanLayNet $\rightarrow$ target inference). However, the fine-tuning procedure itself is standard YOLOv8 training with no novel algorithmic contribution.
What is the motivation?
Document layout analysis models trained on specific datasets often fail to generalize across domains. A model trained on scientific papers (PubLayNet) struggles with financial documents, manuals, or legal texts. This domain gap is a well-known problem, and collecting large-scale annotated data for every new domain is expensive and time-consuming.
The authors identify two practical issues:
Limited layout diversity in existing datasets. Datasets like PubLayNet contain documents from a single domain (biomedical papers), which share a relatively uniform structure. Models trained on them can overfit to that structure.
Costly annotation for new domains. Manually annotating document layouts for every target domain is impractical at scale. Synthetic data generation offers a path to broader layout diversity without manual labeling.
The core idea is that exposing models to highly varied (even noisy) layout configurations during training can improve their ability to generalize to unseen document types.
What is the novelty?
The main contribution is the RanLayNet dataset and its generation pipeline. The approach works as follows:
- Crop extraction. Individual layout elements (text blocks, titles, lists, tables, figures) are cropped from PubLayNet images using existing bounding box annotations.
- Random compositing. Cropped elements are randomly placed onto blank white canvases. A
smart_plotfunction determines positions and gaps to avoid overlap while allowing diverse spatial arrangements. - Automatic annotation. Bounding box coordinates are tracked during compositing, producing labels in a standard format without any manual annotation.
The resulting dataset contains $\sim$209,000 element instances across five classes:
| Class | Count | Percentage |
|---|---|---|
| Text | 95,227 | 45.52% |
| Title | 45,306 | 21.65% |
| List | 23,090 | 11.03% |
| Table | 22,146 | 10.58% |
| Figure | 23,493 | 11.22% |
The authors frame this as “noise labeling”: the synthetic images have layouts that do not correspond to any real document structure, which they argue reduces bias toward any specific domain.
What experiments were performed?
All experiments use YOLOv8 as the detection model. The experimental design has two parts:
In-domain training
The authors first train YOLOv8 on three individual datasets to establish baselines:
- PubLayNet (5 classes: text, title, list, table, figure)
- IIIT-AR-13K (5 classes: table, figure, natural image, signature, logo)
- RanLayNet (5 classes, same as PubLayNet)
They also train on RanLayNet using pretrained weights from PubLayNet and IIIT-AR-13K, respectively.
Cross-domain table detection
The key evaluation tests cross-domain transfer. Models are evaluated on DocLayNet documents from four domains: Manuals, Financial Documents, Laws & Regulations, and Scientific Documents. The focus is on detecting the “Table” class specifically.
Four model variants are compared per source dataset:
- Source dataset only (PubLayNet or IIIT-AR-13K)
- Source dataset + RanLayNet fine-tuning
Metrics reported are Precision, Recall, mAP@50, and mAP@95 (referred to as “mAP95” in the paper, likely meaning mAP@[0.5:0.95]).
Training details
- Epochs: 40
- Batch size: 16
- Optimizer: SGD
- Learning rate: $1 \times 10^{-3}$
- Hardware: NVIDIA RTX A6000 GPU, 512 GB RAM
What are the outcomes/conclusions?
The main finding is that adding RanLayNet fine-tuning consistently improves cross-domain table detection compared to using the source dataset alone.
IIIT-AR-13K + RanLayNet vs. IIIT-AR-13K alone (Table class on DocLayNet):
- Scientific Documents: mAP@95 improves from 0.002 to 0.398
- Financial Documents: mAP@95 improves from 0.001 to 0.187
- Manuals: mAP@95 improves from 0.026 to 0.226
- Laws & Regulations: mAP@95 improves from 0.026 to 0.222
PubLayNet + RanLayNet vs. PubLayNet alone (Table class on DocLayNet):
- Manuals: mAP@95 improves from 0.220 to 0.588
- Scientific Documents: mAP@95 improves from 0.376 to 0.588
- Financial Documents: mAP@95 improves from 0.163 to 0.293
- Laws & Regulations: mAP@95 improves from 0.194 to 0.282
The improvements are more dramatic when starting from IIIT-AR-13K, which is expected since IIIT-AR-13K’s label set (annual report graphics) is more distant from the DocLayNet target than PubLayNet’s.
Limitations and observations:
- The evaluation focuses exclusively on the Table class. Performance on other layout elements is not evaluated in the cross-domain setting.
- The paper does not compare against other domain adaptation or synthetic data baselines.
- The paper does not ablate the dataset generation choices (e.g., random placement strategy, dataset size, class balance).
- The “noise labeling” framing is informal; there is no formal analysis of how the noise characteristics affect learning.
- The absolute mAP@95 scores remain modest (below 0.6 in all cases), suggesting substantial room for improvement.
- The paper reports the same mAP@95 values (0.562, 0.807, 0.761, 0.588) for both the Manuals and Scientific Documents rows in Table 3 (PubLayNet + RanLayNet), which appears to be a data entry error.
Reproducibility
Models
- Architecture: YOLOv8 (Ultralytics). The specific variant (n/s/m/l/x) is never stated; the paper simply says “YOLOv8.” This omission makes exact reproduction difficult, as parameter counts range from $\sim$3M (nano) to $\sim$68M (extra-large).
- Pretrained initialization: Not explicitly stated. Standard Ultralytics YOLOv8 initializes from COCO-pretrained weights by default, but the paper does not confirm or deny this.
- Input resolution: Not specified. YOLOv8 defaults to 640$\times$640, but the paper does not confirm the image size used.
- No model weights released. The GitHub repository does not include trained checkpoints.
- Two-stage training: Models are first trained on a source dataset (PubLayNet or IIIT-AR-13K), then further fine-tuned on RanLayNet. The appendix training curves (Figures 4, 5, 6) confirm convergence across all training runs.
Algorithms
- Optimizer: SGD with learning rate $1 \times 10^{-3}$. Momentum and weight decay are not reported; YOLOv8 defaults are momentum=0.937 and weight_decay=0.0005, but there is no confirmation the defaults were used.
- Batch size: 16, with
num\_workersset to 4. - Epochs: 40 for all experiments (both source and RanLayNet fine-tuning stages).
- Learning rate schedule: Not described. YOLOv8 uses a cosine annealing schedule by default with a configurable warmup period, but the paper does not confirm this.
- Loss function: Not explicitly named. The appendix training curves plot box_loss, cls_loss, and dfl_loss, confirming the standard YOLOv8 detection losses (bounding box regression, classification, and distribution focal loss).
- Data augmentation: No augmentation details are provided beyond whatever YOLOv8 applies by default (mosaic, mixup, HSV jitter, etc.). It is unclear whether default augmentations were kept, modified, or disabled.
Data
- RanLayNet: Generated from PubLayNet crops. The total number of composite images is never stated; the paper reports only $\sim$209,000 element instances across five classes (Text, Title, List, Table, Figure). The GitHub repository includes a CSV file (
raylaynet_dataframe.csv) and generation scripts (smart_plotfunction). - Train/val splits: Not specified for any dataset in the experimental setup. It is unclear what portion of each dataset was used for training vs. validation.
- Annotation format: Implied to follow COCO format conventions (consistent with PubLayNet), but not explicitly confirmed.
- Source datasets: PubLayNet ($\sim$360K pages, Apache-2.0 license), IIIT-AR-13K (13,000 pages, custom license).
- Target dataset: DocLayNet (80,863 pages, CDLA-Permissive-1.0 license), used only for inference evaluation across four domain subsets (Manuals, Financial Documents, Laws & Regulations, Scientific Documents).
- RanLayNet license: The GitHub repository does not include a LICENSE file. The license for the RanLayNet dataset and code is unknown.
Evaluation
- Metrics: Precision, Recall, mAP@50, and mAP@95 (the paper’s notation for what is likely mAP@[0.5:0.95], the standard COCO-style averaged metric).
- In-domain results (Tables 4-7) serve as sanity checks showing the model learns each dataset’s classes, but these are not the main evaluation.
- Cross-domain evaluation is limited to the Table class on four DocLayNet domain subsets. Performance on the other four classes (Text, Title, List, Figure) is not evaluated in the cross-domain setting, leaving the generality of the approach undemonstrated.
- No error bars, confidence intervals, or multi-seed experiments are reported. All results appear to come from single training runs.
- No comparison against other domain adaptation methods (e.g., adversarial adaptation, self-training, CycleGAN-based augmentation) or other synthetic data generation approaches.
- No ablation of dataset generation choices (random placement strategy, number of elements per image, class balance, dataset size scaling).
- Possible data error in Table 3: the Manuals and Scientific Documents rows report identical PubLayNet + RanLayNet results (Precision=0.562, Recall=0.807, mAP50=0.761, mAP95=0.588), which is almost certainly a copy-paste error in the paper.
Hardware
- Training: Single NVIDIA RTX A6000 GPU (48 GB VRAM) with 512 GB system RAM.
- Training time: Not reported for any experiment.
- VRAM usage: Not reported.
- Inference latency/throughput: Not reported.
- Cost estimates: Not provided.
- Deployment considerations: The A6000 is a workstation-class GPU. Without knowing the YOLOv8 variant used, it is difficult to assess whether inference is feasible on more constrained hardware.
BibTeX
@inproceedings{anand2023ranlaynet,
title={RanLayNet: A Dataset for Document Layout Detection used for Domain Adaptation and Generalization},
author={Avinash Anand and Raj Jaiswal and Mohit Gupta and Siddhesh S Bangar and Pijush Bhuyan and Naman Lal and Rajeev Singh and Ritika Jha and Rajiv Ratn Shah and Shin'ichi Satoh},
booktitle={ACM Multimedia Asia 2023 (MMAsia '23)},
year={2023},
publisher={ACM},
doi={10.1145/3595916.3626448}
}
U-DIADS-Bib: Pixel-Precise Layout Analysis for Ancient Biblical Manuscripts
TL;DR
U-DIADS-Bib is a pixel-precise document layout analysis dataset containing 200 pages from four ancient biblical manuscripts (three Latin, one Syriac) dating from the 6th to 12th centuries. The dataset distinguishes six non-overlapping semantic classes chosen collaboratively by humanities scholars and computer vision researchers. The authors also release a standardized few-shot variant (U-DIADS-BibFS) with only 3 training images per manuscript and benchmark five segmentation architectures on both versions.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$ : the headline contribution is the dataset itself. The paper devotes the majority of its content to describing the manuscripts, the annotation taxonomy, the ground truth construction pipeline, and the data splits. The dataset and its few-shot variant are the primary reusable assets introduced.
Secondary: $\Psi_{\text{Method}}$ : the authors propose a computer-aided annotation pipeline that alternates between ML-generated coarse segmentations and manual expert refinement, reducing the time needed to produce pixel-precise ground truth. $\Psi_{\text{Evaluation}}$ : the paper benchmarks five semantic segmentation models (FCN, LRASPP, DeepLabV3, DeepLabV3+, PSPNet) plus a few-shot baseline on both the full and few-shot dataset versions, establishing reference results for future work.
What is the motivation?
Document layout analysis for historical manuscripts requires pixel-precise segmentation maps as ground truth. Existing datasets in this space suffer from several limitations:
Coarse annotation granularity. Many historical DLA datasets provide only bounding boxes or polygons, which cannot capture the fine-grained, interleaved layout elements common in ancient manuscripts (e.g., marginal glosses weaving around decorations).
Limited class vocabularies. Several pixel-level datasets distinguish only between text and background (e.g., PHTD, HDRC-CHINESE) or offer only a handful of classes (e.g., DIVA-HisDB with three classes). This is insufficient for humanities scholars who need to differentiate between main text, paratextual elements, decorations, titles, and chapter headings.
Noisy ground truth. The HBA 1.0 dataset, one of the larger pixel-level alternatives, contains noisy segmentation maps that compromise model evaluation reliability.
Mono-alphabetic focus. Existing datasets typically cover a single script (Latin, Arabic, or Chinese), whereas real manuscript collections span multiple writing systems.
Annotation cost. Producing pixel-precise segmentation for complex manuscript pages is extremely time-consuming and requires domain expertise, creating a practical bottleneck for dataset creation.
What is the novelty?
The core novelty is the dataset design and its annotation methodology:
Interdisciplinary class selection. The six segmentation classes (background, main text, paratext, decoration, title, chapter headings) were chosen through collaboration between humanities experts and computer vision researchers. The humanities scholars identified which layout elements are meaningful for studying biblical manuscripts, while the CV researchers ensured the annotations meet modern semantic segmentation standards (pixel-precise, non-overlapping, noiseless).
Multi-script coverage. U-DIADS-Bib includes manuscripts in both the Latin alphabet and the Syriac Abjad consonantal alphabet, making the segmentation task more challenging and the dataset more representative of real manuscript collections.
Computer-aided annotation pipeline. To address the annotation cost problem, the authors propose a pipeline that:
- Selects 10 images per manuscript and binarizes them using the Sauvola thresholding technique.
- Has humanities experts manually segment these 10 images at pixel level.
- Trains a few-shot segmentation model (following De Nardin et al.) on the manual annotations to produce coarse segmentation of remaining pages (initially for 4 classes).
- Has humanities experts refine the machine-generated segmentations and add missing classes.
This approach ensures that the final ground truth is always verified by a human expert, while reducing the total manual effort compared to annotating every page from scratch.
Standardized few-shot split. The U-DIADS-BibFS variant provides a fixed split with only 3 training images per manuscript (chosen to cover all classes), 10 for validation, and 30 for testing. This standardized split is intended to encourage research on low-data document layout analysis.
Class weighting for imbalanced data. The dataset exhibits severe class imbalance (background occupies 85-93% of pixels). The authors propose a weighted cross-entropy loss where each class weight is computed as:
$$W_i = \sqrt{\frac{1}{F_i}}$$
where $F_i$ is the frequency of class $i$ in the corresponding manuscript. This helps the models attend to underrepresented classes.
What experiments were performed?
Full dataset setting
Five semantic segmentation architectures were benchmarked on U-DIADS-Bib:
- FCN (Fully Convolutional Network)
- LRASPP (Lite Reduced Atrous Spatial Pyramid Pooling)
- DeepLabV3
- DeepLabV3+
- PSPNet (Pyramid Scene Parsing Network)
All models except LRASPP use a ResNet34 backbone; LRASPP uses MobileNetV3-Large. Training uses the Adam optimizer with a learning rate of $10^{-3}$ and weight decay of $10^{-5}$, running for up to 200 epochs with early stopping (patience of 20 epochs after a 50-epoch buffer).
Each model is trained and evaluated per manuscript, with the final score computed as the average across all four manuscripts. Evaluation metrics include Precision, Recall, IoU, and F1-Score, reported as both weighted and macro averages.
Few-shot setting
The same five architectures were evaluated on U-DIADS-BibFS (3 training images per manuscript). Additionally, the few-shot method of De Nardin et al. was included as a baseline specifically designed for low-data document layout segmentation.
Analysis
Confusion matrices for the best-performing model (DeepLabV3+) on each manuscript are provided, allowing inspection of per-class failure modes.
What are the outcomes/conclusions?
Full dataset. DeepLabV3+ achieved the best overall performance across all four manuscripts. However, even the best model showed notable weaknesses: Paratext and Title classes proved difficult across manuscripts, with particularly low macro-averaged scores. The weighted averages (which are dominated by the background class) were consistently high, but the macro averages reveal that minority classes remain challenging.
Few-shot dataset. In the few-shot setting, the method of De Nardin et al. (designed for few-shot document layout segmentation) achieved the best results, outperforming the general-purpose segmentation models trained on only 3 images. This result supports the paper’s argument that specialized few-shot approaches are needed for practical deployment where large annotated corpora are unavailable.
Class imbalance is a persistent challenge. The large gap between weighted and macro metrics across all models highlights that severe class imbalance (with minority classes like Paratext and Title comprising less than 1% of pixels) remains a difficult problem. The confusion matrices show that models frequently confuse minority classes with the dominant background and main text classes.
Limitations. The dataset is relatively small (200 pages total) and covers only biblical manuscripts, which limits the diversity of document types. The authors note plans to expand to modern manuscripts, letters, and printed documents with manual annotations. The Syriac manuscript lacks the Chapter Headings class, creating some inconsistency across manuscripts.
Reproducibility
Models
- Five standard architectures: FCN, LRASPP, DeepLabV3, DeepLabV3+, PSPNet.
- Backbones: ResNet34 for all except LRASPP (MobileNetV3-Large). The paper does not state whether backbones are initialized from ImageNet-pretrained weights or trained from scratch, though pretrained initialization is standard practice for these architectures.
- The few-shot baseline follows De Nardin et al. (WACV 2023 / IJNS 2023).
- No parameter counts are reported for any of the models.
- No pretrained weights or trained checkpoints are released.
Algorithms
- Optimizer: Adam, learning rate $10^{-3}$, weight decay $10^{-5}$.
- Batch size: Not reported.
- Training schedule: Up to 200 epochs, early stopping with 20-epoch patience after a 50-epoch warmup buffer (early stopping only activates after 50 epochs to reduce the risk of stopping at a local minimum early in training).
- Loss function: Weighted cross-entropy with class weights $W_i = \sqrt{1 / F_i}$, where $F_i$ is the pixel-level frequency of class $i$ in the corresponding manuscript.
- Data augmentation: Not reported. The Sauvola binarization mentioned in the paper is part of the annotation pipeline, not training-time augmentation.
- Sampling/decoding: Not applicable (segmentation task, no generation or decoding involved).
Data
- 200 images total: 50 per manuscript, split into 10 train / 10 validation / 30 test.
- U-DIADS-BibFS: 43 images per manuscript (3 train / 10 validation / 30 test). The 3 training images per manuscript were chosen to cover all segmentation classes.
- Images stored as JPEG at $1344 \times 2016$ pixels; ground truth as PNG with RGB-encoded class labels.
- Six non-overlapping classes encoded by distinct RGB values: Background (0,0,0), Paratext (255,255,0), Decoration (0,255,255), Main Text (255,0,255), Title (255,0,0), Chapter Headings (0,255,0). Syriaque 341 has only five classes (no Chapter Headings).
- Severe class imbalance: Background dominates at 85-93% of pixels, while minority classes like Paratext and Title each comprise less than 1%.
- Source images collected from Gallica (Bibliothèque nationale de France digital library). Specific Gallica URLs for each manuscript are provided in the paper.
- Annotation: hybrid process combining ML-generated coarse segmentation (for 4 classes initially) with manual expert refinement to add the remaining classes and correct errors. The final ground truth is always verified by a human domain expert.
- No filtering or deduplication pipeline is described; pages were selected to be representative of all segmentation classes.
- The dataset is available at https://ai4ch.uniud.it/udiadsbib/. The specific license for the dataset is not explicitly stated in the paper; the underlying manuscript images come from Gallica.
Evaluation
- Metrics: Precision, Recall, IoU, F1-Score, with standard definitions (Equations 2-5 in the paper). Both weighted average (based on class distribution) and macro average (equal weight per class) are computed per manuscript, then averaged across all four manuscripts.
- Confusion matrices are provided for the best-performing model (DeepLabV3+) on each manuscript, allowing per-class failure mode inspection.
- No error bars, confidence intervals, or multi-seed runs are reported. Each model appears to have been trained once per manuscript.
- Baselines: five general-purpose segmentation models on both full and few-shot settings, plus one few-shot-specific baseline (De Nardin et al.) on the few-shot setting only.
- The authors acknowledge the severe class imbalance as a key challenge and the limited manuscript diversity (only biblical manuscripts) as a scope restriction.
- The large gap between weighted and macro metrics across all models indicates that high weighted scores are dominated by the background class and may overstate practical performance on minority layout elements.
Hardware
- No hardware specifications are reported in the paper: GPU type/count, training time, memory requirements, and inference latency are all absent.
- The dataset and models are relatively small (200 images at $1344 \times 2016$ pixels, standard segmentation architectures with ResNet34/MobileNetV3 backbones), so training is likely feasible on a single consumer GPU, though the paper does not confirm this.
BibTeX
@article{zottin2024udiadsbib,
title={U-DIADS-Bib: a full and few-shot pixel-precise dataset for document layout analysis of ancient manuscripts},
author={Zottin, Silvia and De Nardin, Axel and Colombi, Emanuela and Piciarelli, Claudio and Pavan, Filippo and Foresti, Gian Luca},
journal={Neural Computing and Applications},
year={2024},
doi={10.1007/s00521-023-09356-5},
publisher={Springer}
}
GraphDoc: Graph-based Document Structure Analysis
TL;DR
GraphDoc introduces the graph-based Document Structure Analysis (gDSA) task and provides a dataset of 80K document images with 4.13M relation annotations (spatial and logical) built on top of DocLayNet. The accompanying model, DRGG, is a plug-and-play relation prediction head that achieves 57.6% mAP_g@0.5. Prior layout datasets provide bounding boxes but not the relational graph structure needed for reading order, hierarchy, and cross-element references. GraphDoc adds this layer.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$. The headline contribution is the GraphDoc dataset: 80K images, 11 object categories, 8 relation types, and 4.13M annotated relation pairs. This is the first large-scale dataset to combine spatial and logical document relations in a unified graph structure.
Secondary: $\Psi_{\text{Method}}$ (the DRGG relation head architecture), $\Psi_{\text{Evaluation}}$ (definition of the new mAP_g metric and benchmark baselines for the gDSA task).
What is the motivation?
Traditional Document Layout Analysis (DLA) focuses on detecting and classifying bounding boxes (text, table, figure, etc.) but stops there. It does not capture the relations between elements: which paragraph follows which, what section a figure belongs to, or how elements are spatially arranged relative to each other.
Existing relational datasets are limited in scope:
- FUNSD / XFUND: form key-value pairs only, text-only elements
- ReadingBank: reading order only, text-only
- HRDoc / Comp-HRDoc: hierarchical structure with non-textual elements, but no spatial relations and no unified graph modality
No prior dataset captures the full relational graph of a document page, including both textual and non-textual elements, with both spatial and logical edge types. GraphDoc fills this gap.
What is the novelty?
The gDSA Task
Graph-based Document Structure Analysis requires a model to produce a directed graph $G = (V, E)$ where:
- $V$: detected layout elements (nodes), each with a category label and bounding box
- $E$: directed edges between element pairs, labeled with one of 8 relation types
The gDSA objective is formulated as:
$$ \mathcal{L}_{\text{gDSA}} = \sum_{(v_i, v_j) \in E} \bigl(\mathcal{L}_{\text{cls}}(v_i, \hat{v}_i) + \mathcal{L}_{\text{rel}}(r_{ij}, \hat{r}_{ij})\bigr) $$
where $v_i$ and $\hat{v}_i$ are ground-truth and predicted node labels, and $r_{ij}$ and $\hat{r}_{ij}$ are ground-truth and predicted relation labels between elements $i$ and $j$.
This subsumes two classical sub-tasks:
- Reading Order Prediction (ROP): determining the correct sequence of elements
- Hierarchical Structure Analysis (HSA): identifying parent-child relations
The GraphDoc Dataset
Built on top of DocLayNet’s 80K document images using a semi-automated pipeline:
- Content extraction: Tesseract OCR + pdfplumber on DocLayNet’s source PDFs
- Spatial relation extraction: pixel-by-pixel axis scanning to find nearest adjacent bounding boxes in each cardinal direction
- Reading order: Manhattan/non-Manhattan layout detection + Recursive X-Y Cut
- Hierarchical structure: rule-based tree construction for section headers, text, captions, tables, figures
- Human verification: covering approximately 58.5% of the dataset (4,852 Government Tender pages, 12,000 Financial Report pages, 6,469 Patent pages, 8,000 pages from other domains)
Refinement rates during human verification varied by domain: ~23% for Financial Reports, ~8% for Scientific Articles, ~26% for Government Tenders, ~17% for Patents.
11 Object Categories: Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, Title
8 Relation Types:
| Type | Category | Description |
|---|---|---|
| Up | Spatial | Nearest element above |
| Down | Spatial | Nearest element below |
| Left | Spatial | Nearest element to the left |
| Right | Spatial | Nearest element to the right |
| Parent | Logical | Hierarchical parent (e.g., section header over subsection) |
| Child | Logical | Hierarchical child |
| Sequence | Logical | Reading order successor |
| Reference | Logical | Cross-reference (e.g., figure cited in text) |
Spatial relations make up 64.1% of all annotations; logical relations account for 36.9%.
The DRGG Model
The Document Relation Graph Generator is a plug-and-play module that attaches to any DETR-style object detector:
- Relation Feature Extractor: takes object queries and per-layer decoder features ($X^l \in \mathbb{R}^{N \times d_{\text{embed}}}$), processes them through pooling and MLP layers, then upsamples and concatenates to form 2D relational feature maps:
$$ D_1^l = \text{MLP}_p^1(P_1(X^l)), \quad D_2^l = \text{MLP}_p^2(P_2(X^l)) $$
$$ F^l = \text{Concat}\bigl(\sigma(\text{MLP}_u^1(U_1(D_1^l)) + X^l) \otimes \mathbf{1}_{d_{\text{embed}}},; \sigma(\text{MLP}_u^1(U_1(D_2^l)) + X^l)^T \otimes \mathbf{1}_{d_{\text{embed}}}\bigr) $$
where $F^l \in \mathbb{R}^{N \times N \times 2d_{\text{embed}}}$.
- Relational Feature Aggregation: learnable weighted sum across decoder layers, followed by an MLP that predicts the relation graph:
$$ G = \text{MLP}_g\left(\sum_{l=1}^{L} \alpha^{(l)} F^l\right) $$
where $G \in \mathbb{R}^{N \times N \times k}$, $k$ is the number of relation categories, and $\alpha^{(l)}$ are learnable aggregation weights.
- Auxiliary Relation Head: binary prediction of whether any relation exists between a pair of elements, multiplied element-wise with the main output at inference
The total loss is:
$$ \mathcal{L}_{\text{total}} = \mathcal{L}_{\text{cls}} + \mathcal{L}_{\text{bbox}} + \lambda \mathcal{L}_{\text{rel}} + \sigma \mathcal{L}_{\text{rel_aux}} $$
where $\mathcal{L}_{\text{rel}}$ and $\mathcal{L}_{\text{rel_aux}}$ are both binary cross-entropy losses, and $\lambda$, $\sigma$ are hyperparameters.
What experiments were performed?
Metrics
- DLA: standard COCO mAP@50:5:95
- gDSA: new metrics mR_g and mAP_g that jointly evaluate detection and relation prediction. Predicted instances are first matched to ground-truth via IoU and category correspondence. Relations with confidence above a threshold $T_R$ are then evaluated. Mean recall per relation category:
$$ \text{mR}_g = \frac{1}{R} \sum_{r=1}^{R} \text{Recall}_r $$
Mean average precision per relation category:
$$ \text{mAP}_g = \frac{1}{R} \sum_{r=1}^{R} \text{AP}_r $$
Results are reported at $T_R \in \{0.5, 0.75, 0.95\}$. Unlike top-$k$ SGG metrics, this threshold-based approach ensures rare but critical relations (e.g., Reference) are not filtered out.
Detector and Backbone Comparison
All detectors tested with InternImage backbone + DRGG:
| Detector | DLA mAP | mAP_g@0.5 |
|---|---|---|
| DETR | 68.2 | 19.8 |
| Deformable DETR | 73.4 | 25.4 |
| DINO | 79.5 | 25.2 |
| RoDLA | 81.5 | 57.6 |
RoDLA + InternImage outperforms all other combinations by a wide margin. Backbone ablations with RoDLA:
| Backbone | DLA mAP | mAP_g@0.5 |
|---|---|---|
| ResNet | 71.0 | 45.8 |
| ResNeXt | 77.9 | 40.3 |
| Swin | 73.7 | 26.1 |
| InternImage | 81.5 | 57.6 |
Per-Relation Performance (AP_g@0.5, best model)
| Relation | AP_g@0.5 |
|---|---|
| Left | 99.0 |
| Right | 99.0 |
| Up | 49.0 |
| Down | 49.0 |
| Parent | 45.5 |
| Child | 45.5 |
| Sequence | 56.4 |
| Reference | 16.8 |
Left/Right spatial relations are nearly solved. Vertical spatial and logical relations are substantially harder. Reference relations remain very challenging: the best model (InternImage + RoDLA) achieves only 16.8% AP, while ResNeXt + RoDLA reaches 18.8%. This suggests that visual features alone are insufficient for cross-reference detection.
Domain-Specific Results
Performance varies significantly by document domain (mAP_g@0.5): Laws & Regulations (63.2) performs best, while Patents (31.8) performs worst.
Key Ablation Findings
- Adding DRGG improves DLA performance (80.5 to 81.5 mAP), suggesting relational reasoning provides a useful learning signal for detection.
- The full Relation Feature Extractor outperforms a simple linear layer (57.6 vs. 52.9 mAP_g@0.5).
- Including logical relations alongside spatial ones improves overall mAP_g (57.5 vs. 49.5 spatial-only), though mean recall drops slightly (26.7 vs. 32.1).
What are the outcomes/conclusions?
GraphDoc provides the first large-scale benchmark for joint layout detection and relational graph prediction. The gDSA formulation goes beyond flat bounding-box detection, capturing structure that downstream tasks (RAG chunking, reading order, document understanding) require.
Key takeaways:
- Relational graph prediction is feasible at scale with a plug-and-play module, achieving 57.6% mAP_g@0.5 and 30.7% mR_g@0.5.
- Spatial relations are largely solved (99% AP for Left/Right), but logical relations remain hard, especially Reference (16.8% AP).
- Relation prediction helps detection: the DRGG head acts as an auxiliary task that improves DLA mAP by 1 point.
- Visual-only input is a limitation: the model uses no textual content, which likely explains the poor Reference relation performance.
- Single-page only: multi-page document structure remains unaddressed.
- Domain sensitivity: performance varies substantially across document types, with structured layouts (laws/regulations) being easier than complex ones (patents, financial reports).
Mapping to Matter vs. Meaning Framework
GraphDoc inherits DocLayNet’s 11-class label set (see DocLayNet notes for the full class-level mapping). The novel contribution is the 8 relation types, which map to the Matter vs. Meaning framework as follows:
Relation Types
| GraphDoc Relation | Category | M-vs-M Mapping | Notes |
|---|---|---|---|
| Up | Spatial | (no direct equivalent) | Nearest element above. Spatial adjacency, not a semantic relation. |
| Down | Spatial | (no direct equivalent) | Nearest element below. |
| Left | Spatial | (no direct equivalent) | Nearest element to the left. |
| Right | Spatial | (no direct equivalent) | Nearest element to the right. |
| Parent | Logical | header_of | Hierarchical parent (e.g., section header over text body). |
| Child | Logical | Inverse of header_of | Hierarchical child. |
| Sequence | Logical | Reading order | Reading order successor. |
| Reference | Logical | group_with | Cross-reference (e.g., figure cited in text, caption linked to figure). The hardest relation at 16.8% AP. |
Coverage gaps in relations: The M-vs-M framework defines label_for and value_of (form-specific relations) which GraphDoc does not cover, since DocLayNet contains no form annotations. The spatial relations (Up/Down/Left/Right) have no direct M-vs-M equivalent; they encode physical adjacency rather than semantic structure.
Reproducibility
Models
- Architecture: DRGG is a plug-and-play relation head compatible with DETR-family detectors
- Best config: InternImage backbone + RoDLA detector + DRGG relation head
- Code: GitHub repo exists under MIT license but is currently a placeholder. Dataset, model checkpoints, training code, and evaluation code are all listed as “coming soon” (as of March 2026)
Algorithms
- Optimizer: AdamW (betas: 0.9/0.999, epsilon: 1e-8)
- Learning rate: 1e-4 initial
- Weight decay: 5e-3
- Batch size: 4
- Multi-scale training: shorter side randomly resized to one of {480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800}; longer side capped at 1333 pixels
- Framework: MMDetection (PyTorch v1.10)
- Loss: $\mathcal{L}_{\text{cls}} + \mathcal{L}_{\text{bbox}} + \lambda \mathcal{L}_{\text{rel}} + \sigma \mathcal{L}_{\text{rel_aux}}$ (all relation losses are BCE); the numerical values of $\lambda$ and $\sigma$ are not reported
- Number of epochs: not explicitly stated
Data
- Source: DocLayNet 80K images + source PDFs
- Annotation pipeline: rule-based extraction (Tesseract OCR, pdfplumber, X-Y Cut) + human verification (~58.5% coverage; the remaining ~41.5% relies entirely on rule-based annotations without manual review)
- License: CDLA-Permissive-1.0 (inherited from DocLayNet annotations); underlying document licenses vary by domain
- Format: not explicitly specified (likely COCO-style given MMDetection usage)
- Train/val/test splits: follows DocLayNet conventions (not explicitly re-stated)
Evaluation
- DLA metric: COCO mAP@50:5:95
- gDSA metric: custom mAP_g and mR_g at relation confidence thresholds
- Baselines: four detector architectures (DETR, Deformable DETR, DINO, RoDLA) across four backbones (ResNet, ResNeXt, Swin, InternImage)
- Limitations acknowledged: visual-only input, single-page scope, class imbalance (spatial dominates), reference relations poorly handled, DLA errors propagate to relation prediction
- Statistical rigor: no error bars or multi-seed runs reported
Hardware
- Training: 4x NVIDIA A100 GPUs (40 GB each), 300 GB CPU memory
- Infrastructure: HoreKa supercomputer / HAICORE@KIT / bwForCluster Helix
- Inference latency / cost: not reported
BibTeX
@inproceedings{chen2025graphdoc,
title={Graph-Based Document Structure Analysis},
author={Chen, Yufan and Liu, Ruiping and Zheng, Junwei and Wen, Di and Peng, Kunyu and Zhang, Jiaming and Stiefelhagen, Rainer},
booktitle={International Conference on Learning Representations (ICLR)},
year={2025}
}
IndicDLP: A Foundational Dataset for Multi-Lingual and Multi-Domain Document Layout Parsing
TL;DR
IndicDLP is a 120K-page, human-annotated document layout dataset covering 11 Indic languages plus English, 12 document domains (newspapers, novels, textbooks, forms, etc.), and 42 region labels. Models trained on IndicDLP generalize better to other layout datasets than models trained on English-only resources. The paper received the ICDAR 2025 Best Student Paper Runner-Up award.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$: the headline contribution is the IndicDLP dataset itself: its scale, linguistic diversity, domain breadth, and annotation quality. The paper is organized around data curation, annotation workflow, and the resulting benchmark.
Secondary: $\Psi_{\text{Evaluation}}$: a substantial portion of the paper benchmarks six detection models (YOLOv10, DiT, DocLayout-YOLO, DINO, RoDLA, Florence-2) on IndicDLP, evaluates cross-lingual zero-shot transfer, and measures the dataset’s effectiveness as a pretraining source for other layout datasets.
What is the motivation?
Existing large-scale layout datasets (PubLayNet, DocBank) offer volume but lack fine-grained labels and multilingual coverage. Manually annotated datasets (M6Doc, D4LA) provide richer labels and domain diversity but are too small to train robust models (9K and 12K pages, respectively). This gap is particularly acute for Indic documents, which encompass 11+ scripts with complex diacritics, conjunct characters, and historical typographic variation, yet have almost no representation in current layout benchmarks.
The authors also note a practical motivation: poor layout localization cascades into poor OCR for Indic scripts, where text recognition is already under-resourced. A high-quality, diverse layout dataset is therefore a prerequisite for downstream digitization pipelines.
What is the novelty?
The core contribution is scale and diversity in a single human-annotated resource:
- 119,806 pages with 1,856,241 annotated instances (bounding boxes)
- 42 region labels covering both physical (figure, table, paragraph) and logical (jumpline, flag, sidebar, hierarchical list levels) regions
- 11 Indic languages (Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil, Telugu) plus English
- 12 document domains: newspapers, novels, magazines, manuals, textbooks, acts and rules, question papers, forms, brochures, notices, syllabi, research papers
- Documents span from pre-independence era to present day, including both digital-born and scanned sources
The annotation process involved a team of 50 individuals (3 to 4 annotators and 1 reviewer per language), guided by a 150-page annotation manual. Over 60% of validated documents were further checked by a team of 10 supercheckers, including the authors, to ensure cross-language and cross-domain consistency.
The annotation schema builds on M6Doc’s guidelines but refines them: splitting generic labels (e.g., list into ordered-list and unordered-list; caption into table-caption and figure-caption), adding domain-specific labels (jumpline, flag, placeholder-text, contact-info, website-link), and introducing hierarchical nesting for section titles and list levels (up to three depth levels).
The authors also curate UED-mini, a merged and label-harmonized subset of DocLayNet and M6Doc (75K images, 25 unified labels), intended as a pretraining resource for Indic layout models.
What experiments were performed?
Baseline Benchmarking
All experiments use mean Average Precision as the primary metric, computed over IoU thresholds from 0.5 to 0.95 in steps of 0.05 (COCO-style):
$$\text{mAP@[.5:.95]} = \frac{1}{10} \sum_{t \in \{0.50, 0.55, \ldots, 0.95\}} \text{AP}_t$$
Six models were trained and evaluated on IndicDLP:
| Model | Params | mAP@50 | mAP@75 | mAP@[50:95] |
|---|---|---|---|---|
| YOLOv10x | 37M | 73.5 | 60.6 | 55.0 |
| DocLayout-YOLO | 20M | 73.5 | 60.0 | 54.5 |
| RoDLA | 342M | 74.1 | 57.7 | 53.1 |
| DINO | 46M | 69.7 | 53.4 | 49.2 |
| DiT | 86M | 67.2 | 51.8 | 47.8 |
| Florence-2 | 826M | 41.4 | 28.7 | 28.0 |
YOLOv10x offers the best efficiency-performance tradeoff. Of the six models, DiT, RoDLA, and DocLayout-YOLO were pretrained on document-specific data (IIT-CDIP, M6Doc, and DocSynth-300K, respectively), while the others used COCO or (in the case of Florence-2) FLD-5B natural images. Florence-2 underperforms despite its large capacity, likely because its pretraining emphasizes natural images over document layouts.
Cross-Lingual Zero-Shot Transfer
Language-specific YOLOv10x models were trained on single-language subsets and evaluated on all other languages. Key findings:
- Zero-shot evaluation on unseen scripts drops 20-25 mAP points
- Languages sharing the same script (Hindi/Marathi via Devanagari; Assamese/Bengali via Bengali script) show modest cross-lingual gains, but still trail language-specific models
- A model trained on the full multilingual IndicDLP outperforms all single-language models, even those evaluated on their own language
Domain-Wise Performance
Performance varies substantially by domain. Simpler layouts (novels, research papers) score highest. Complex multi-column layouts (newspapers, magazines) and domains with unique region types (forms, brochures) are harder.
Pretraining Experiments
| Pretraining Source | Size | mAP@[50:95] |
|---|---|---|
| UED-mini | 75K | 57.7 (+1.9) |
| No pretraining (scratch) | - | 55.8 (baseline) |
| COCO | 118K | 55.0 (-0.8) |
| UED (full) | 785K | 53.9 (-1.9) |
| PubLayNet | 360K | 53.0 (-2.8) |
UED-mini (the curated, human-annotated subset) is the only pretraining source that improves over training from scratch. Larger but lower-quality or less diverse sources (PubLayNet, UED full) actually hurt performance. This suggests that annotation quality and domain diversity matter more than volume for pretraining.
IndicDLP as a Pretraining Source
Conversely, pretraining on IndicDLP and fine-tuning on M6Doc, DocLayNet, and D4LA yields +2.8 mAP improvement on average, and faster convergence for DocLayNet. The dataset generalizes well beyond Indic layouts.
Cross-Dataset Generalization
A YOLOv10x model trained on IndicDLP and evaluated on other datasets (M6Doc, D4LA, DocLayNet) shows the smallest cross-dataset performance drop of any training source tested, demonstrating that IndicDLP’s diversity leads to better generalization on common physical region labels (paragraph, table, figure).
What are the outcomes/conclusions?
Scale and diversity in one package: IndicDLP is, by a significant margin, the largest human-annotated layout dataset with multilingual and multi-domain coverage (120K pages vs. DocLayNet’s 80K, with far broader language and domain representation).
Script matters for layout: Zero-shot cross-lingual results show a 20-25 mAP drop on unseen scripts, indicating that visual script properties do influence layout detection. This contradicts the common assumption that layout analysis is script-agnostic.
Quality over quantity for pretraining: The curated UED-mini (75K pages) outperforms the full UED (785K pages) as a pretraining source, reinforcing that careful curation and label harmonization matter more than raw scale.
Strong generalization: Models trained on IndicDLP transfer better to English-centric datasets than the reverse, suggesting the dataset’s domain and layout diversity provides a robust visual prior.
Limitations acknowledged by the authors: The dataset uses axis-aligned bounding boxes only (no polygons or rotated boxes), which limits precision on skewed or rotated documents. The 42-class label set, while richer than most, still does not cover all possible regions as dedicated labels (e.g., QR codes and barcodes are subsumed under
figure; poems are excluded entirely).
Mapping to Matter vs. Meaning Framework
How does IndicDLP’s 42-class taxonomy map to our Matter vs. Meaning framework? IndicDLP uses a hierarchical label set with depth-level variants for headings and lists.
Headings & Structure
| IndicDLP Class | Visual Primitive | Logical Role | Notes |
|---|---|---|---|
| headline | Text | Title | The main document title. |
| sub-headline | Text | Subtitle | Secondary headline. |
| subsub-headline | Text | Subtitle | Tertiary headline. |
| chapter-title | Text | SectionHeader | Chapter-level heading. |
| section-title | Text | SectionHeader | Section-level heading (H2 equivalent). |
| sub-section-title | Text | SectionHeader | Subsection heading (H3 equivalent). |
| subsub-section-title | Text | SectionHeader | Sub-subsection heading (H4 equivalent). |
Content
| IndicDLP Class | Visual Primitive | Logical Role | Notes |
|---|---|---|---|
| paragraph | Text | Body | Standard prose content. |
| quote | Text | Blockquote | Extended quotations. |
| sidebar | Text | Sidebar | Content outside the main text flow. |
| figure | Image | Figure | Visual content (charts, photos, diagrams not distinguished). |
| table | Table | (primitive only) | Tabular data. |
| formula | Formula | DisplayEquation | Mathematical notation. |
| index | Table | Index | Back-of-book index. |
| table-of-contents | Table | TOC | Table of contents. |
Lists (Hierarchical)
| IndicDLP Class | Visual Primitive | Logical Role | Notes |
|---|---|---|---|
| ordered-list | Text | ListItem | Numbered list. Container-level annotation. |
| unordered-list | Text | ListItem | Bulleted list. Container-level annotation. |
| sub-ordered-list | Text | ListItem | Nested numbered list (depth 2). |
| sub-unordered-list | Text | ListItem | Nested bulleted list (depth 2). |
| subsub-ordered-list | Text | ListItem | Nested numbered list (depth 3). |
| subsub-unordered-list | Text | ListItem | Nested bulleted list (depth 3). |
Meta & Navigation
| IndicDLP Class | Visual Primitive | Logical Role | Notes |
|---|---|---|---|
| author | Text | Author | Author name(s). |
| dateline | Text | Dateline | Date/location information. |
| header | Text | PageHeader | Running headers. |
| footer | Text | PageFooter | Running footers. |
| folio | Text | PageNumber | Page numbers. |
| page-number | Text | PageNumber | Explicit page index (distinct from folio in some contexts). |
| footnote | Text | Footnote | Bottom-of-page notes. |
| figure-caption | Text | Caption | Caption for figures. |
| table-caption | Text | Caption | Caption for tables. |
| reference | Text | BibEntry | Bibliography items. |
| jumpline | Text | JumpLine | “Continued on page X” navigation pointers. |
| flag | Text | SectionHeader | Newspaper section flags (e.g., “Sports”, “Opinion”). |
| advertisement | Image | Advertisement | Commercial content. |
| contact-info | Text | Address | Physical or electronic address information. |
| website-link | Text | Address | URL or web reference. |
| placeholder-text | Text | Instruction | Template text awaiting replacement (forms context). |
Exam/Question Paper Classes
| IndicDLP Class | Visual Primitive | Logical Role | Notes |
|---|---|---|---|
| first-level-question | Text | Body | Top-level exam question. No direct M-vs-M equivalent; closest to M6Doc’s “QuestionNumber” domain grammar. |
| second-level-question | Text | Body | Sub-question (e.g., part a, b, c). |
| third-level-question | Text | Body | Sub-sub-question. |
| answer | Text | Body | Answer text. |
| options | Text | Option | Multiple-choice options. Maps to the Forms Option role. |
Coverage strengths: IndicDLP is one of the richest taxonomies in the layout landscape. It provides hierarchical depth for both headings (3 levels) and lists (3 levels), separates figure/table captions, and includes domain-specific labels for newspapers (flag, jumpline) and exam papers (first/second/third-level-question, answer, options). The JumpLine and Flag roles are rarely covered by other datasets.
Coverage gaps: No explicit classes for Code, OpticalCode (barcodes/QR), form primitives (Field, Selection, Checkbox), Logo, Stamp, or Watermark. QR codes and barcodes are subsumed under figure. No relation annotations (reading order, hierarchy).
Reproducibility
Models
- Six architectures evaluated: YOLOv10x (37M), DocLayout-YOLO (20M), RoDLA (342M), DINO (46M), DiT (86M), Florence-2 (826M)
- Three fine-tuned checkpoints released (YOLOv10x, DocLayout-YOLO, RoDLA) via Zenodo, with inference code
- Original hyperparameters from each model’s source paper were retained
Algorithms
- All models trained using their default hyperparameters as specified in the original papers
- No additional training tricks (warmup, gradient clipping, mixed precision, data augmentation) are reported beyond what the original model papers specify
- Metric: mAP@[.5:.95] (COCO-style, thresholds from 0.5 to 0.95 in steps of 0.05)
Data
- Training split: ~96K images (80%)
- Validation split: ~12K images (10%)
- Test split: ~12K images (10%)
- Stratified split preserving language and domain proportions
- Available on Hugging Face (gated access: requires agreeing to share contact info) and AIKosh (indiaai.gov.in)
- COCO-format JSON annotations (PNG images)
- Images scaled to max 1024px shortest side, 1333px longest side
- Annotation process: maker-checker workflow with 50 annotators/reviewers total (3-4 annotators + 1 reviewer per language); 10 supercheckers validated over 60% of documents; completed in approximately 8 months
- License: MIT (annotations and code); underlying document images are sourced from government bulletins, ebook stores, school textbooks, newspapers, arXiv, and Shodhganga, with per-source licenses not enumerated in the paper
Evaluation
- Primary metric: mAP@[.5:.95]
- Cross-lingual zero-shot evaluation across all 12 languages
- Domain-wise mAP breakdown
- Cross-dataset transfer evaluation on M6Doc, DocLayNet, D4LA
- No error bars, significance tests, or multi-seed runs reported
Hardware
- Training: 8x NVIDIA H100 GPUs
- Total GPU-hours not reported
- No inference latency or throughput benchmarks reported
BibTeX
@inproceedings{nath2025indicdlp,
title={IndicDLP: A Foundational Dataset for Multi-Lingual and Multi-Domain Document Layout Parsing},
author={Nath, Oikantik and Kukkala, Sahithi and Khapra, Mitesh and Sarvadevabhatla, Ravi Kiran},
booktitle={International Conference on Document Analysis and Recognition (ICDAR)},
pages={23--39},
year={2025},
publisher={Springer},
doi={10.1007/978-3-032-04614-7_2}
}
LADaS 2.0: Diachronic Document Dataset for Semantic Layout Analysis
TL;DR
LADaS 2.0 is an open, diachronic document layout analysis dataset containing 7,254 manually annotated pages from the 17th century to the present. It uses a 36-class taxonomy (13 zone types with subtypes) grounded in the SegmOnto vocabulary and mapped to the Text Encoding Initiative (TEI) standard, making it one of the few DLA datasets designed with document reconstruction workflows in mind. The authors benchmark YOLOv11 across multiple input sizes and model scales, finding that 1280-pixel input with a large model performs best, and that training on the full generic dataset generally outperforms subset-specific fine-tuning.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$ : the paper’s headline contribution is the dataset itself: its annotation guidelines, class taxonomy, subset composition, and release as an open resource. The curation pipeline, TEI/SegmOnto alignment, and metadata design occupy the majority of the paper.
Secondary: $\Psi_{\text{Evaluation}}$ : the paper benchmarks YOLOv11 models at three input resolutions and five model scales, plus a comparison against DocYOLO. A second experiment evaluates domain-specific fine-tuning vs. generic training on held-out subsets.
What is the motivation?
Document layout analysis datasets have historically skewed toward born-digital, English-language STEM papers (PubLayNet, DocBank) or narrow modern document types (DocLayNet). Cultural heritage institutions and digital humanities (DH) researchers work with materials that span centuries of typographic evolution, yet few DLA datasets capture this temporal and structural diversity.
Three specific gaps motivate LADaS 2.0:
Diachronic coverage. Existing datasets either target a single era or a single document type. Historical documents exhibit layout conventions (marginal notes, running titles, ornamental graphics, verse structures) that differ substantially from modern PDF layouts. A dataset spanning 1600 to 2024 forces models to generalize across these conventions.
Semantic granularity aligned to document reconstruction. Most DLA datasets use a flat label set (text, title, table, figure). DH workflows need finer distinctions: paragraphs vs. verse groups vs. list items vs. dramatic speeches within the main text zone, as well as separation of footnotes from manuscript additions in margin zones. The SegmOnto vocabulary provides a principled first level, but it lacks subtype specifications needed for full TEI reconstruction.
Interoperability. The DH community adopted SegmOnto as a shared vocabulary, enabling datasets like Gallicorpora and Ajax to be combined. LADaS 2.0 extends this by defining subtypes that map directly to TEI XML elements, bridging the gap between CV-style bounding box annotations and structured text encoding.
What is the novelty?
The core contribution is the dataset and its taxonomy. Several design choices distinguish it from prior work:
TEI-grounded subtype system. LADaS 2.0 extends SegmOnto’s zone types with subtypes selected based on three criteria: (1) a corresponding TEI XML element exists, (2) the element is visually distinguishable in the page image, and (3) the distinction is relevant for downstream document reconstruction. This yields 36 classes organized under 13 zone types. For example, MainZone is subdivided into Head, P (paragraph), Lg (line group/verse), Sp (dramatic speech), List, Entry, Date, Signature, Maths, and Other.
Modular subset architecture. The 7,254 pages are organized into 12 subsets by provenance and content type (Monographies, Catalogues, Theses, Theatre, Magazines, etc.), each with metadata including publication date, domain, and acquisition method. This modularity enables users to compose domain-specific training sets or study cross-domain transfer.
Diachronic metadata. 95% of documents carry a publication year, enabling temporal analysis. The dataset spans from the 17th century (primarily theatre texts from Gallica) to born-digital 21st-century theses and administrative reports.
“Continued” subtype for run-on elements. When a paragraph or entry spans a page break, the continuation on the next page is annotated as [Zone]-Continued. During reconstruction, reading-order heuristics merge these with their predecessor, maintaining document coherence.
New SegmOnto zone types. The authors introduce FigureZone (for code excerpts), FormZone (for magazine forms), and AdvertisementZone (for advertisements), types that were absent from the original SegmOnto vocabulary.
Noise subset (Fingers). A deliberately noisy subset of 100 pages was created by scanning books with visible fingers, background clutter, and bent pages, simulating real-world on-site digitization conditions.
What experiments were performed?
Generic model benchmarks
The authors train YOLOv11 models at three input resolutions (640, 960, 1280 pixels) across five model scales (nano, small, medium, large, extra-large), for 15 configurations total. Three subsets (Theatre, Admin. Rep., Romans-19) are held out from training for the domain-specific experiments. Training uses the Ultralytics library (v8.3.8) with batch size 16, 100 epochs, learning rate 0.01, seed 42, and standard augmentations (rotation, contrast, shear). A single RTX 8000 GPU is used for all models except the 1280-pixel extra-large variant, which requires two.
DocYOLO (a layout-specific adaptation of YOLOv10) is also evaluated as a comparison, trained on two GPUs with the same parameters except a learning rate of 0.02.
The primary metric is mAP@50 (COCO-style).
Domain-specific fine-tuning
For each of the three held-out subsets, the authors train three models: (1) generic training with the subset added to the full pool, (2) fine-tuning from raw YOLOv11-L weights on the subset alone, and (3) fine-tuning from the generic model (Exp-1) on the subset alone. The baseline is the generic model that never saw the subset.
What are the outcomes/conclusions?
Input resolution matters. The 1280-pixel input consistently outperforms 960, which outperforms 640. This aligns with annotator experience: marginal notes and small zones are difficult to distinguish at low resolution.
Model scale plateaus. Performance gains diminish beyond the large model. The extra-large model provides minimal improvement and occasionally degrades, possibly due to overfitting given the dataset size. The YOLOv11-L at 1280 pixels achieves the best overall mAP@50.
DocYOLO underperforms. Despite layout-specific architectural modifications, DocYOLO scores below the standard YOLOv11-L, suggesting that input resolution and dataset diversity may matter more than architectural tweaks for this task.
Generic training beats fine-tuning. For 2 of 3 held-out subsets (Theatre, Romans-19), the generic model incorporating the subset data outperforms fine-tuning on the subset alone. The Admin. Rep. subset is a marginal exception (fine-tuned Exp-1-L edges out the generic model by less than one point on mAP@50). Fine-tuning from raw YOLOv11-L weights on small subsets consistently performs worst.
Challenging subsets. The Fingers (noisy scans) and Tech. Magazines (high visual complexity, limited training data) subsets yield the lowest per-subset scores, suggesting that data augmentation strategies targeting noise and complex layouts could help.
The dataset contains 81,766 total annotation instances across 7,254 pages. The most common classes (MainZone-P, MainZone-Head) are well-represented across the full temporal range, while niche classes (QuireMarks, NumberingZone) are sparser but consistently present from the 18th century onward.
Mapping to Matter vs. Meaning Framework
How does LADaS 2.0’s SegmOnto/TEI-aligned taxonomy map to our Matter vs. Meaning framework? LADaS uses a two-level hierarchy: zone types (from SegmOnto) with subtypes (mapped to TEI elements).
MainZone (Body Content)
| LADaS Class | Visual Primitive | Logical Role | Notes |
|---|---|---|---|
| MainZone-Head | Text | SectionHeader | Section headings within the main text. |
| MainZone-P | Text | Body | Standard paragraphs. The most common class. |
| MainZone-Lg | Text | Body | Line groups (verse, poetry). Preserves line breaks as semantic structure. |
| MainZone-Sp | Text | Body | Dramatic speeches (theatre texts). Domain-specific. |
| MainZone-List | Text | ListItem | List structures. |
| MainZone-Entry | Text | Body | Dictionary/catalogue entries. |
| MainZone-Date | Text | Dateline | Date elements in the main flow. |
| MainZone-Signature | Text | Signature | Author signatures at end of sections. |
| MainZone-Maths | Formula | DisplayEquation | Mathematical content in the main zone. |
| MainZone-Other | Text | Body | Catch-all for main content not fitting other subtypes. |
MarginTextZone
| LADaS Class | Visual Primitive | Logical Role | Notes |
|---|---|---|---|
| MarginTextZone-Notes | Text | Footnote | Marginal notes (printed). |
| MarginTextZone-ManuscriptAddendum | Text | Annotation | Handwritten additions in margins. is_handwritten: True. |
Visual & Graphic Zones
| LADaS Class | Visual Primitive | Logical Role | Notes |
|---|---|---|---|
| FigureZone | Text | Code | Code excerpts (repurposed zone type). |
| FigureZone-Head | Text | Caption | Heading/caption for code blocks. |
| FigureZone-Figdesc | Text | Caption | Description of code figures. |
| GraphicZone | Image | Figure | Visual content (illustrations, photographs). |
| GraphicZone-Head | Text | Caption | Heading for graphic zones. |
| GraphicZone-Figdesc | Text | Caption | Description of graphics. |
| GraphicZone-TextualContent | Text | Body | Text embedded within graphic regions. |
| GraphicZone-Part | Image | Figure | Sub-components of a larger graphic. |
| GraphicZone-Decoration | Structure | (decorative) | Ornamental graphics (vignettes, borders). |
Structural & Navigation Zones
| LADaS Class | Visual Primitive | Logical Role | Notes |
|---|---|---|---|
| TableZone | Table | (primitive only) | Tabular structures. |
| TableZone-Head | Text | Caption | Table headings/captions. |
| PageTitleZone | Text | Title | Page-level titles. |
| PageTitleZone-Index | Text | Title | Index page titles. |
| RunningTitleZone | Text | PageHeader | Running titles in margins/headers. |
| NumberingZone | Text | PageNumber | Page and element numbering. |
| StampZone | Image | Stamp | Library stamps, ownership marks. |
| StampZone-Sticker | Image | Stamp | Adhesive labels. |
| FormZone | Table | FormGroup | Form structures (magazines). |
| AdvertisementZone | Image | Advertisement | Commercial advertisements. |
Continued Variants
LADaS introduces a -Continued suffix for elements spanning page breaks (e.g., MainZone-P-Continued, MainZone-Entry-Continued). During reconstruction, reading-order heuristics merge these with their predecessor on the previous page. This addresses a common gap in single-page datasets that cannot represent cross-page continuity.
Coverage strengths: LADaS is one of the few datasets designed with document reconstruction (TEI output) as the explicit goal. The SegmOnto/TEI alignment means every class maps to a specific XML element. The diachronic coverage (1600-2024) and domain-specific labels (dramatic speeches, catalogue entries, manuscript addenda) make it uniquely suited for cultural heritage workflows. The FormZone and AdvertisementZone are new additions to SegmOnto.
Coverage gaps: No explicit classes for Author, Affiliation, Abstract, BibEntry, Footnote (separate from margin notes), ListItem (individual items), or any form interaction primitives (Field, Selection, Checkbox). The dataset is heavily French-language.
Reproducibility
Models
No new architecture is proposed. All experiments use standard YOLOv11 (Ultralytics) and DocYOLO:
- YOLOv11: nano (n), small (s), medium (m), large (l), extra-large (x) variants.
- DocYOLO: layout-adapted YOLOv10.
- No trained weights are released with the paper.
Algorithms
- Training: Ultralytics library v8.3.8. Batch size 16, 100 epochs, learning rate 0.01 (0.02 for DocYOLO), seed 42.
- Augmentations: Rotations, contrast, shear (Ultralytics defaults).
- Evaluation metric: mAP@50 (COCO-style).
- All other parameters are Ultralytics defaults.
Data
- 7,254 annotated pages; train: 5,443, validation: 912, test: 899.
- 12 subsets organized by provenance and content type.
- Annotations in YOLOv8 txt format (
class center_x center_y width height, normalized coordinates). - The HuggingFace release contains 6,060 rows (some images may be excluded from the HF version relative to the paper’s count; the GitHub repository is the primary source for the full dataset).
- 36 classes across 13 zone types; full taxonomy in Table 2 of the paper.
- License: CC-BY-4.0 for annotations and code.
- Underlying document licenses vary by subset: Gallica materials are public domain, ARLP (Picard) and INHA/BnF (Catalogues) are partner donations, Persee materials are from the Persee portal, and theses are from theses.fr. The CC-BY-4.0 applies to the annotations; users should verify source-document terms for their use case.
- Annotation process: manual annotation on Roboflow, with pre-annotations from iteratively trained models. Each annotation reviewed by a second annotator; disagreements resolved by guideline authors.
Evaluation
- mAP@50 reported per model configuration and per subset.
- No error bars, confidence intervals, or multi-seed runs reported for the main experiments. The authors note that additional seed tests were conducted for the extra-large model plateau phenomenon.
- Three held-out subsets (Theatre, Admin. Rep., Romans-19) enable controlled domain-transfer evaluation.
- Comparison against DocYOLO uses the same hyperparameters except learning rate.
Hardware
- Single NVIDIA RTX 8000 GPU for most models.
- Two RTX 8000 GPUs for the 1280-pixel extra-large model and for DocYOLO.
- No wall-clock training times, VRAM usage, or cost estimates reported.
BibTeX
@misc{clerice2024diachronic,
title={Diachronic Document Dataset for Semantic Layout Analysis},
author={Thibault Clérice and Juliette Janès and Hugo Scheithauer and Sarah Bénière and Florian Cafiero and Laurent Romary and Simon Gabay and Benoît Sagot},
year={2024},
eprint={2411.10068},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
SciBank: Deep Learning in Time-Frequency Domain for Document Layout Analysis
TL;DR
Grijalva et al. introduce a three-stage pipeline for scientific document layout analysis that converts page regions into spectrograms of intensity histograms and classifies them with a CNN. Alongside the method, they release SciBank, a 74,435-page dataset of scientific articles annotated with 12 region classes, notably including inline equations, a label absent from most competing datasets.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$. The headline contribution is a DLA pipeline that operates in the time-frequency domain (spectrograms of intensity histograms) rather than directly on pixel data.
Secondary: $\Psi_{\text{Resource}}$. The paper also releases SciBank, a large-scale annotated dataset for scientific document layout analysis, available on IEEE DataPort under CC-BY-4.0.
What is the motivation?
Existing document layout analysis approaches typically operate directly on spatial pixel features. The authors argue that transforming page regions into the frequency domain (via spectrograms of horizontal and vertical intensity histograms) can capture structural patterns more compactly, enabling a lighter-weight classifier. They also note that prior datasets like PubLayNet, DocBank, and TableBank lack annotations for inline equations, a common and structurally important element in scientific papers.
What is the novelty?
The core idea is a three-stage pipeline:
- Segmentation and spectrogram extraction: Each page is segmented into regions of interest. Horizontal and vertical intensity histograms are computed for each region, then converted into spectrograms via short-time Fourier transform.
- CNN classification: A deep CNN classifies each spectrogram-represented region into one of the 12 layout categories.
- Inline equation detection: A Bag of Visual Words (BOVW) approach with Zernike moments identifies isolated equations within text regions.
The central idea is that projecting spatial structure into frequency space can reduce computational cost while preserving discriminative layout features.
What experiments were performed?
- The method was evaluated on SciBank (74,435 pages from 11,007 scientific papers).
- The 12 classes are: abstract, text block, caption, keywords, reference, section heading, subsection heading, title, table, figure, isolated equation, and inline equation.
- Annotations were generated automatically and validated by human curators.
- The primary metrics reported are the Adjusted Rand Index (ARI) and Variation of Information (VI), along with overall classification accuracy.
- The authors report comparisons against prior approaches. We were unable to access the full text, so the specific baselines are not enumerated here.
What are the outcomes/conclusions?
The system achieved an overall accuracy of 96.27%, which the authors report as competitive with or exceeding prior methods at lower computational cost. The spectrogram-based representation appears to capture sufficient structural information for region classification without requiring multimodal (text + vision) input.
Limitations
- The dataset is limited to scientific articles, so generalization to other document types (forms, reports, newspapers) is untested.
- The annotation pipeline is automatic with human validation rather than fully manual, so annotation quality may vary, particularly for ambiguous regions.
- The evaluation uses ARI and VI rather than the more standard COCO-style mAP metrics, making direct comparison with other DLA benchmarks difficult.
- The 12-class label set is moderately fine-grained but does not include structural relations (e.g., caption-to-figure links) or reading order.
- The paper does not report per-class performance, so it is unclear how well the method handles rare or visually similar classes.
SciBank Dataset Details
| Property | Value |
|---|---|
| Pages | 74,435 |
| Papers | 11,007 |
| Domain | Scientific articles |
| Annotation | Automatic + human validation |
| Classes | 12 |
| Eval split | Unknown |
| Formats | PNG, PDF, CSV |
| Size | ~48 GB |
| License | CC-BY-4.0 |
| Access | IEEE DataPort (free account required) |
Label set: Abstract, Text Block, Caption, Keywords, Reference, Section, Subsection, Title, Table, Figure, Isolated Equation, Inline Equation.
The inclusion of inline equations as a distinct class is SciBank’s most notable differentiator from PubLayNet (5 classes), DocBank (12 classes, token-level), and TableBank (table-only).
Note on authorship: The IEEE DataPort entry lists seven authors (adding Carla Parra and Marco Gallardo to the five on the IEEE Access paper), suggesting additional contributors were involved in the dataset curation and release.
Reproducibility
Models
- The classification model is a deep CNN operating on spectrogram inputs. Specific architecture details (layer count, parameter count) are described in the paper but were not available without full-text access at the time of this review.
- No pretrained weights appear to be publicly released.
Algorithms
- Spectrograms are derived from horizontal and vertical intensity histograms via short-time Fourier transform.
- Inline equation detection uses Bag of Visual Words with Zernike moments.
- Training details (optimizer, learning rate, batch size) are in the paper.
Data
- Source: 11,007 scientific papers (source repositories not specified in available metadata).
- Annotations generated automatically, then validated by human curators.
- Available on IEEE DataPort under CC-BY-4.0.
- Evaluation split structure is unclear from available metadata.
Evaluation
- Metrics: Adjusted Rand Index (ARI), Variation of Information (VI), overall accuracy.
- These are not the standard COCO-style mAP metrics used by most modern DLA benchmarks, which limits comparability.
- Per-class breakdowns were not available in the metadata reviewed.
Hardware
- Not specified in available metadata.
BibTeX
@article{grijalva2021deep,
author={Grijalva, Felipe and Santos, Erick and Acu\~{n}a, Byron and Rodr\'{i}guez, Juan Carlos and Larco, Julio C\'{e}sar},
journal={IEEE Access},
title={Deep Learning in Time-Frequency Domain for Document Layout Analysis},
year={2021},
volume={9},
pages={151254--151265},
doi={10.1109/ACCESS.2021.3125913}
}
TextBite: A Historical Czech Document Dataset for Logical Page Segmentation
TL;DR
TextBite is a dataset of 8,449 manually annotated historical Czech document pages (18th-20th century) for logical page segmentation, the task of grouping document content into semantically coherent units. The dataset covers printed newspapers, dictionaries, and handwritten records, with 78,863 annotated segments. The authors propose evaluating segmentation as a pixel clustering problem using the Rand index over foreground text pixels, removing dependence on OCR accuracy or precise bounding-box geometry. Best baseline: YOLOv11 detection + GNN merging achieves 92.5% Rand index.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$. The headline contribution is the TextBite dataset itself: 8,449 pages with 78,863 annotated logical segments, publicly released on Zenodo with MIT-licensed code.
Secondary: $\Psi_{\text{Evaluation}}$ (the pixel-clustering evaluation framework and Rand index metric that decouples evaluation from OCR and bounding-box geometry), $\Psi_{\text{Method}}$ (three baseline methods combining detection with relation prediction).
What is the motivation?
Logical page segmentation divides a document page into semantically coherent units, enabling finer text embeddings and better retrieval. Prior approaches fall into two camps, each with limitations:
- Text-based segmentation (e.g., TextTiling): operates on OCR output, so errors in recognition and reading order propagate directly into segmentation quality.
- Object-detection-based methods (e.g., DLAFormer): detect bounding boxes for paragraphs and group them, but evaluation penalizes minor geometric deviations that have no impact on the actual text grouping.
No existing benchmark evaluates logical segmentation in a way that is both OCR-independent and insensitive to irrelevant geometric variation. TextBite fills this gap with a dataset and evaluation framework designed for historical documents, which present additional challenges: mixed printed/handwritten content, multi-column newspaper layouts, ornamental elements, and inconsistent formatting across centuries.
What is the novelty?
Pixel-Clustering Evaluation
Rather than evaluating segmentation through text-level metrics or detection-based mAP, TextBite treats logical segments as clusters of foreground text pixels. The pipeline:
- Intersect human-annotated bounding boxes with OCR-detected textlines (from PERO-OCR)
- Apply adaptive thresholding (301$\times$301 pixel patches) to isolate foreground ink pixels
- Evaluate using the Rand index over foreground pixel pairs only
The Rand index measures agreement between predicted and ground-truth clusterings over all pixel pairs. For each pair of foreground pixels, the metric checks whether the two assignments agree (both same-segment or both different-segment):
$$\text{RI} = \frac{\text{number of agreeing pixel pairs}}{\text{total number of pixel pairs}}$$
Values range from 0 (no agreement) to 1 (perfect agreement). In practice, when roughly 10% of foreground pixels are misassigned, the Rand index is typically around 0.83.
Background pixels are excluded entirely. This means:
- Systems are not penalized for bounding-box shape differences that do not affect text grouping
- OCR quality does not affect the evaluation metric
- Page metadata (headers, footers, edition info) can be freely marked as additional segments without penalty
The TextBite Dataset
- 8,449 pages (7,346 printed + 1,103 handwritten)
- 78,863 annotated segments of logically coherent text
- Historical Czech documents spanning the 18th to 20th centuries
- Domains: newspapers, periodicals, books, dictionaries, handwritten school/organizational records
- Annotation categories: Text region (80,458), Title (48,788), Page number (6,213)
- Relations: directed links between bounding boxes encoding reading order within segments
- Format: extended COCO with a
relationskey; OCR provided in PAGE-XML and ALTO formats - Test split: 964 pages (779 printed + 185 handwritten)
Annotations were created using Label Studio by a team of students, librarians, and researchers. The test portion was manually verified by researchers; the development portion was reviewed by students.
Baseline Methods
Three approaches, all starting from YOLOv11 detection:
- YOLOv11 alone: each detected bounding box is an independent segment
- YOLOv11 + GNN: a graph neural network performs binary edge classification to merge detected regions; uses 29 geometric node features and 12 edge features (including CZERT text embeddings)
- YOLOv11 + Transformer: relation prediction via exclusive relation scoring with optional image backbone (ResNet-18/50). Input queries are detected bounding boxes represented by 2D sinusoidal positional embeddings. Relation scores between output feature vectors $o_i$ and $o_j$ are computed as:
$$s_{i,j} = FC_a^D(o_i) \cdot FC_b^D(o_j), \quad i \in \{1..N\}, ; j \in \{1..N\}$$
where $FC^D$ is a fully connected layer with dimensionality $D$. The highest-scoring query $q_j$ is chosen as the relation target for query $q_i$, forming chains that are grouped into final segments.
What experiments were performed?
Detection Stage
YOLOv11 was tested in three sizes (nano, small, medium) at five resolutions (640-1400 pixels). The small variant at 1200px resolution was selected (mAP50: 92.0, mAP50-95: 72.2). Training used batch size 16, learning rate $2 \times 10^{-4}$, Adam optimizer, with early stopping after 20 epochs of no validation improvement.
Merging Stage
| Method | Rand Index | 95% CI |
|---|---|---|
| YOLOv11 (detection only) | 83.9 | [82.6, 85.1] |
| YOLOv11 + Transformer | 86.9 | [85.9, 88.0] |
| YOLOv11 + GNN | 92.5 | [91.6, 93.3] |
Transformer Ablations
Adding image features (via ResNet cross-attention) improved the transformer on validation, with the no-image baseline at 86.5 and image-augmented variants ranging from 87.5 to 88.9 Rand index. Freezing the image backbone had no negative effect, and ResNet-18 performed comparably to ResNet-50. The authors selected the frozen ResNet-18 variant (88.5 validation Rand index) as the final model.
Key Findings
- Pure detection treats each box as independent, missing multi-box segments (columns, title-body associations)
- The GNN’s geometric + text features (via CZERT embeddings) provide strong merging signals
- The transformer underperforms the GNN despite image features, suggesting that explicit pairwise edge classification is more effective than sequence-based relation prediction for this task
- Confidence intervals were computed via bootstrapping
What are the outcomes/conclusions?
- Pixel-clustering evaluation works. By focusing on foreground text pixels and using the Rand index, evaluation becomes independent of OCR quality and robust to inconsequential geometric variation.
- Two-stage detection + merging is effective. The GNN-based merging approach substantially improves over detection-only (83.9 to 92.5 Rand index), demonstrating that relation prediction is critical for logical segmentation.
- Historical documents remain challenging. The dataset includes newspapers with complex multi-column layouts, dictionaries with dense entry structures, and handwritten records with loose formatting.
- The dataset supports multiple tasks. Beyond logical segmentation, TextBite can be used for reading order prediction, title detection and association, and general layout analysis.
Limitations:
- Czech-only (with occasional German/French traces), limiting cross-lingual generalization claims
- Primarily sourced from Czech libraries, so historical printing and layout conventions may not transfer to other traditions
- The pixel-level ground truth is derived semi-automatically (OCR textline intersection + thresholding), not manually drawn at the pixel level
- No comparison to existing segmentation methods beyond the authors’ baselines
Reproducibility
Models
- Detection: YOLOv11s at 1200px resolution, implemented via Ultralytics
- GNN: Residual Gated Graph Convolution (3 layers, 512 dimensions), with Z-score normalized features
- Transformer: 6-layer encoder, 128 hidden dim, 2 attention heads, optional frozen ResNet-18 image backbone
- Checkpoints: baseline model weights available on Zenodo (77.3 MB)
- Code: evaluation scripts and pre-trained weights publicly available at GitHub; full training pipelines for baselines are not included
Algorithms
- YOLOv11: batch size 16, LR $2 \times 10^{-4}$, Adam, early stopping (20 epochs patience)
- GNN: batch size 16, LR $1 \times 10^{-3}$, AdamW, Residual Gated Graph Convolution operator, cosine similarity threshold for edge classification
- Transformer: batch size 4, LR $2 \times 10^{-4}$, cross-entropy loss on relation scores, inputs padded to 80 tokens
- Text features: CZERT (Czech BERT) masked language model for text embeddings in GNN edge features
- OCR: PERO-OCR for textline detection and transcription (provided with dataset)
Data
- Source: historical Czech documents from multiple Czech libraries
- Size: 8,449 pages total; 6,400 train / 1,000 validation / 964 test (185 handwritten in test)
- Annotation: Label Studio; 10+ annotators (students, librarians, researchers); test set verified by researchers
- Format: extended COCO JSON (bounding boxes + relations), PAGE-XML, ALTO
- License: MIT (code), CC-BY-4.0 (dataset)
- Download: Zenodo (11.7 GB dataset + 218.3 MB test labels + 77.3 MB baseline models)
- Language: Czech (primary), with occasional German and French
Evaluation
- Metric: Rand index over foreground text pixels
- Ground truth construction: annotated bounding boxes intersected with PERO-OCR textlines, then adaptive thresholding (301$\times$301 patches)
- Confidence intervals: 95% CI via bootstrapping
- Baselines: three methods (YOLOv11, YOLOv11+GNN, YOLOv11+Transformer)
- Limitations acknowledged: semi-automatic pixel ground truth, Czech-only, no external method comparisons
- Statistical rigor: bootstrapped confidence intervals reported for all results
Hardware
- Not explicitly reported in the paper
- YOLOv11s is lightweight; GNN and transformer baselines use modest architectures, likely trainable on a single GPU
BibTeX
@inproceedings{kostelnik2025textbite,
title={TextBite: A Historical Czech Document Dataset for Logical Page Segmentation},
author={Kosteln{\'\i}k, Martin and Bene{\v{s}}, Karel and Hradi{\v{s}}, Michal},
booktitle={International Conference on Document Analysis and Recognition Workshops (ICDAR Workshops)},
year={2025}
}
Layout Annotation Strategy & Label Studio Configuration
The Two-Project Workflow
To avoid cognitive overload and ensure high-velocity annotation, we strongly recommend splitting the task into Two Separate Label Studio Projects. Mixing geometry (drawing boxes) with semantics (classifying roles) in a single interface is a “footgun” that destroys the flow state needed for precise bounding box creation and offers poor visual feedback on completion.
Project 1: The “Matter” Scan (Geometry & Primitives)
- Goal: pure visual segmentation.
- Action: Rapidly draw bounding boxes around every distinct layout element.
- Classification: Assign only one of the 10 Visual Primitives (Text, Image, Table, etc.).
- Mindset: “Don’t read the text. If it’s a distinct block, box it.”
- Speed: fast (~2-3 sec/box).
- Output: A dataset of high-quality bounding boxes with generic labels (e.g., all paragraphs are just
Text).
Project 2: The “Meaning” Refinement (Semantic Roles)
- Input: The “Matter” annotations are imported as Pre-annotations.
- Goal: Logical understanding.
- Action: Click existing boxes and change their label.
- Gamification: The goal is to turn “Raw” primitives (e.g., Gray
Textboxes) into specific Roles (e.g., RedTitle, BlueCaption). - Visual Feedback: You can instantly see which boxes are still generic and need attention.
- Constraint: Do not move the boxes. The geometry was fixed in Project 1.
Strategies for Speed & Consistency
1. The “Matter-First” Mental Model
When annotators try to do both steps at once, they often subconsciously alter the geometry to fit their semantic opinion (e.g., excluding “Figure 1:” from a caption box because they think it’s “meta-data” or splitting a paragraph because it contains a variable).
Rule: Geometry is objective. Meaning is subjective. Determine the geometry first.
2. Level 1: Visual Primitives (The “Matter”)
- Task: Draw bounding boxes and classify only the 10 Primitives.
- Overlap Rule: Generally, primitives should not overlap (they are competing hypotheses).
- Exception: Inlines (The Span vs Block Model). Small
Formula,OpticalCode, orImage(icon) regions inside a paragraph are treated as “Children.” You must draw the parentTextblock around the whole paragraph AND the child boxes inside it.
- Exception: Inlines (The Span vs Block Model). Small
- Guidance: “Ignore semantic meaning. Focus on processing needs. If it is a Matrix, it’s a Table. If it implies user interaction, it’s a Field.”
- Outcome: High-precision bounding boxes (IoU > 0.90). This step is purely objective and high-agreement.
The 9 Conflict Resolution Rules
When in doubt between primitives during Pass 1, apply these tests in order:
- The “Rich Text vs. Rendered Art” Rule (Text vs Image)
- Test: Is the text rendered as standard selectable fonts?
- Result: Yes $\to$
Text. - No: If it is rendered as vector shapes/outlines (WordArt), heavily distorted, or heavily stylized (e.g., a “30% OFF” burst), it is
Image.
- The “Brand Identity” Rule (Logo vs Text)
- Test: Is this text a Company Logo or Wordmark?
- Result: Always
Image. - Why: Logos imply specific geometry, color, and branding that standard OCR cannot capture. We treat branding as a visual object, not prose.
- The “Recursion” Rule (Image vs Document)
- Test: Is this an image of a document (screenshot, photo of receipt) or a contained graphic?
- Result: Document $\to$ Annotate the inner content (Text, Tables, etc.). Graphic $\to$
Image(stop and caption).
- The “Visual Grid” Rule (Table vs Text)
- Test: Can the content be read linearly (left-to-right, wrapping) without losing meaning?
- Result: Yes $\to$
Text(e.g., a simple 2-column key-value list like “Name: John”). - No: If the spatial position (X,Y) defines the relationship (e.g. a matrix of prices), or strict columnar alignment is required to understand it, it is a
Table. - Heuristic: 3+ columns usually force a
Tableclassification.
- The “Interaction” Rule (Field vs Structure)
- Test: Is a human expected to physically write on or alter this line/box?
- Result: Yes $\to$
Field. No $\to$Structure.
- The “Info-Content” Test (Structure vs Image)
- Test: Does the graphical element communicate meaning?
- Result: Yes (e.g., an icon, specific brand pattern) $\to$
Image. - No: If it is purely for visual separation (lines, solid blocks, generic gradients), it is
Structure(and will be ignored).
- The “Explicit Geometry” Rule (Selection vs Text)
- Test: Does the bounding box for a checkbox include the label text (e.g. “[ ] Yes”)?
- Result: No. The
Selectionbox must strictly wrap the geometric toggle (square/circle). The label “Yes” is a separateTextbox (Role:OptionorLabel).
- The “Notation” Rule (Formula vs Image)
- Test: Does it use a standardized notation system (Math/Chem)?
- Result: Yes $\to$
Formula. No $\to$Image(use for creative illustrations).
- The “Machine” Rule (OpticalCode vs Image)
- Test: Is it for machine vision?
- Result: Yes $\to$
OpticalCode. No $\to$Image.
3. Level 2: Semantic Roles (The “Meaning”)
This happens in Project 2. The annotator reviews the boxes and refines the labels.
- Goal: Full Understanding.
- Task: Assign the final semantic label to the primitives.
- Categories to Label (Full Taxonomy):
- Headings:
Title,Subtitle,SectionHeader - Content:
Abstract,Body,Blockquote,ListItem,Figure,Chart,Diagram,Sketch,Code,ChemScheme,DisplayEquation,InlineEquation,Score - Meta:
Author,Affiliation,Keywords,Caption,TableNote,Credit,Dateline,PageHeader,PageFooter,PageNumber,LineNumber,JumpLine,Footnote,BibEntry,FormulaNumber,Address,Watermark,Annotation,Correction,Logo,Icon,Stamp,Advertisement,Barcode1D,Barcode2D,TOC,Index,Calendar - Forms:
Label,Instruction,Value,Option,Signature,Input,Checkbox,Radio,FormGroup
- Headings:
- Guidance: “Turn every Gray box into a Colored box if a specific role applies. If a
Textbox is truly just standard body text, label itBody(Blue) or leave asText(Gray) ifBodyis implied.”
4. Level 3: Attributes (The “Nuance”)
- Task: Select the box and toggle any relevant boolean attributes.
- Property:
is_handwritten: The text is handwritten/script (for ICR).orientation: The content is rotated (0/90/180/270).state:checked/unchecked(For Checkbox/Radio).is_filled:True(For Input Fields/Signatures that have content).is_header:True( For Table Cells that act as headers).row_span/col_span: (Integer) Defines cell topology for Tables.
5. Level 4: Relations (The “Structure”)
- Task: Create directed edges between boxes to reconstruct the logical graph.
- Tool: Use the Relation Mode (Alt + Click Source $\to$ Click Target).
- Relations to Label:
group_with: Links aFigure/Tableto itsCaptionorTableNote.label_for: Links a formLabelto its blankInput/Checkboxfield.value_of: Links a formLabelto the filledValue/Optionselection.header_of: Links aSectionHeaderto theBodycontent it introduces.row_of/col_of: Explicitly links a cell to its parentTable(if geometry is ambiguous).
Global Decision Precedence
Since most CV models require mutually exclusive classes, how do we handle objects with dual natures? These rules override specific primitive/role guidelines.
Navigation trumps Content (The “Crop Rule”)
- Scenario: A Company Logo inside the top margin.
- Classification:
PageHeader(notLogo). - Why: In RAG pipelines, we usually want to “crop out” headers to avoid repetition. If we label it
Logo, it survives the crop and pollutes the stream.
Function trumps Semantics (The “Usage Rule”)
- Scenario: A handwritten signature on a line.
- Classification:
Signature(not justText). - Why: Its role triggers a specific downstream action (Signature Verification) that generic
Textdoes not. (Note: The printed text “Sign Here:” is anInstruction).
Specific trumps Generic (The “Specificity Rule”)
- Scenario: A standard chart in a financial report.
- Classification:
Chart(notFigure). - Why:
Chartimplies numeric data extraction is possible;Figureimplies visual description only.
Content trumps Chrome (The “Exclusion Rule”)
- Scenario: A running header “Page 4 of 20” or “CONFIDENTIAL” watermark.
- Classification:
PageNumber(orPageHeader) /Watermark. - Why: These labels act as “Stop Words.” Any region with these roles is deleted from the final text extraction stream. If you mislabel them as
Text, they will pollute the output.
Layout trumps Logic (The “Visual Grid Rule”)
- Scenario: A Resume “Experience” section or Invoice address block arranged in columns.
- Classification:
Table(Primitive) $\to$FormGroup(Role). - Why: Reading order is ambiguous in 2D space (Row-major vs Column-major?). If you cannot read it simply top-to-bottom, left-to-right without losing structure, it is visually a
Table. We resolve the semantic logic later.
Scope trumps Continuity (The “Stop at Header Rule”)
- Scenario: A long form (e.g. ACORD 125) with multiple sections (“Agency”, “Insured”, “Coverages”) that all look like grids.
- Classification: Multiple
Tableboxes, split by the headers. - Why: A
TableorFormGroupmust not cross aSectionHeaderor a full-widthStructureline. This ensures downstream parsers receive manageable chunks (e.g., “The Agency Table”) rather than one giant, mixed-schema blob.
Implementation: Label Studio Configurations
To support this two-project workflow, we use two different configurations.
Config A: Project 1 (Matter)
Simple, fast, no nested choices. Focus on drawing.
<View>
<Image name="image" value="$image"/>
<RectangleLabels name="label" toName="image">
<!-- Visual Primitives Only -->
<Label value="Text" background="#FFAABB" />
<Label value="Image" background="#BBFFAA" />
<Label value="Table" background="#BBAAFF" />
<Label value="Formula" background="#00FF00" />
<Label value="Structure" background="#CCCCCC" />
<Label value="Music" background="#FF00FF" />
<Label value="OpticalCode" background="#0000FF" />
<Label value="Field" background="#FFCC00" />
<Label value="Selection" background="#FF5500" />
<Label value="Redaction" background="#000000" />
</RectangleLabels>
</View>
Config B: Project 2 (Meaning)
Designed for refining labels and adding attributes. Note that we flatten the taxonomy here to make the “Color Change” explicit (changing Text to Title) but use Choices for boolean attributes like is_handwritten or checked.
<View>
<Image name="image" value="$image"/>
<RectangleLabels name="label" toName="image">
<!-- Visual Primitives (Gray - To be refined) -->
<Label value="Text" background="#CCCCCC" />
<Label value="Image" background="#CCCCCC" />
<Label value="Table" background="#CCCCCC" />
<Label value="Structure" background="#CCCCCC" />
<Label value="Formula" background="#CCCCCC" />
<Label value="Music" background="#CCCCCC" />
<Label value="OpticalCode" background="#CCCCCC" />
<Label value="Field" background="#CCCCCC" />
<Label value="Selection" background="#CCCCCC" />
<Label value="Redaction" background="#CCCCCC" />
<!-- Role: Headings (Red) -->
<Label value="Title" background="#FF0000" />
<Label value="Subtitle" background="#FF4444" />
<Label value="SectionHeader" background="#FF8800" />
<!-- Role: Content (Blue / Teal / Lime) -->
<Label value="Body" background="#4444FF" />
<Label value="Abstract" background="#0000FF" />
<Label value="Blockquote" background="#8888FF" />
<Label value="ListItem" background="#00AAFF" />
<Label value="Code" background="#000088" />
<Label value="Figure" background="#00FFFF" />
<Label value="Chart" background="#00AAAA" />
<Label value="Diagram" background="#008888" />
<Label value="Sketch" background="#006666" />
<Label value="ChemScheme" background="#44AA00" />
<Label value="DisplayEquation" background="#88FF00" />
<Label value="InlineEquation" background="#AAFF44" />
<Label value="Score" background="#CC44FF" />
<!-- Role: Meta (Green / Navy / Orange) -->
<Label value="Author" background="#008800" />
<Label value="Affiliation" background="#008844" />
<Label value="Keywords" background="#008888" />
<Label value="Caption" background="#00FF00" />
<Label value="TableNote" background="#00AA00" />
<Label value="Credit" background="#006600" />
<Label value="Dateline" background="#004400" />
<Label value="PageHeader" background="#AAFF00" />
<Label value="PageFooter" background="#AAFF00" />
<Label value="PageNumber" background="#CCFF00" />
<Label value="LineNumber" background="#CCFF66" />
<Label value="JumpLine" background="#AAFFAA" />
<Label value="Footnote" background="#555500" />
<Label value="BibEntry" background="#888800" />
<Label value="FormulaNumber" background="#BBBB00" />
<Label value="Address" background="#666600" />
<Label value="Watermark" background="#444444" />
<Label value="Annotation" background="#FF0088" />
<Label value="Correction" background="#FF0055" />
<Label value="Logo" background="#002222" />
<Label value="Icon" background="#00AAAA" />
<Label value="Stamp" background="#008888" />
<Label value="Advertisement" background="#00FF88" />
<Label value="Barcode1D" background="#000088" />
<Label value="Barcode2D" background="#0000AA" />
<Label value="TOC" background="#A0522D" />
<Label value="Index" background="#D2691E" />
<Label value="Calendar" background="#CD853F" />
<!-- Role: Forms (Purple / Brown) -->
<Label value="Label" background="#880088" />
<Label value="Instruction" background="#FF00FF" />
<Label value="Value" background="#CC88CC" />
<Label value="Option" background="#BB66BB" />
<Label value="Signature" background="#440044" />
<Label value="Input" background="#660066" />
<Label value="Checkbox" background="#FF5500" />
<Label value="Radio" background="#FF8800" />
<Label value="FormGroup" background="#8B4513" />
</RectangleLabels>
<!-- Mocked for now; still reviewing how to best implement in Label Studio -->
<!-- <Choices name="attributes" toName="image" perRegion="true">
<Choice value="is_handwritten" />
<Choice value="rot_90" />
<Choice value="rot_180" />
<Choice value="rot_270" />
<Choice value="checked" />
<Choice value="unchecked" />
<Choice value="is_filled" />
<Choice value="is_header" />
...
</Choices>
<RelationLabels name="relation" toName="image">
<Label value="group_with" />
<Label value="label_for" />
<Label value="value_of" />
<Label value="header_of" />
<Label value="row_of" />
<Label value="col_of" />
...
</RelationLabels> -->
</View>
Note: Depending on your exact Label Studio version, you may prefer to use a single RectangleLabels block containing both sets, or use the LayoutLMv3 template if you are doing strict taxonomy mapping.
The Matter vs. Meaning Framework: A Unified Taxonomy for Layout
The Ontology Gap: Why Layout Analysis is Hard
The Core Disconnect
A major challenge in Document Layout Analysis is the mismatch between:
- Logical Structure: The hierarchical semantic tree (e.g., Article $\to$ Section $\to$ Nested Section $\to$ Paragraph) defined in document standards like Tagged PDF or JATS.
- Visual Structure: The flat list of bounding boxes (e.g., “Text Block”, “Title”, “Table”) detected by computer vision models.
This gap means that even a top-scoring CV model can fail to reconstruct a valid document because it misses the relationships between elements (e.g., which caption belongs to which figure, or that a text block continues a paragraph from the previous page).
Granularity Mismatch
Models often disagree on what constitutes an “object”:
- Annotation Level: Does a bounding box wrap a whole paragraph (PubLayNet), a single line (frequently in OCR), or individual tokens (DocBank)?
- Geometry: Are regions defined by simple bounding boxes (COCO-style) or complex polygons (PRImA/PAGE-XML)?
- Scope: Do structures persist across pages? Most models are single-page only.
Visual Object Detection Classes
Computer vision datasets typically “flatten” the document hierarchy into a single layer of bounding boxes. The choice of classes dictates what the model can see.
- Coarse (PubLayNet): Generic buckets like
Text,Table,Figure. Good for chunking. - Fine-Grained (DocLayNet): Specific roles like
SectionHeader,Caption,PageFooter, orFootnote. - Functional (Forms): Interactive elements like
Checkbox,InputLine,Signature.
Specialized Taxonomies
Beyond general detection, specific domains or legacy workflows use distinct taxonomies:
- Pixel-Level Segmentation: Methods for historical manuscripts often use pixel maps rather than boxes to handle complex overlaps. Classes:
Main Text Body,Decoration,Comment/Marginalia,Background. - Asset-Focused (Sparse): Some benchmarks (e.g., ICDAR 2017 POD) reduce the space to just
Formula,Table,Figureto ensure consistent cross-method evaluation.
Domain-Specific Grammars
Some document types require specialized “sub-grammars” where standard layout classes (like “Figure” or “Text”) are too coarse.
| Domain | Key Primitives / Roles | Relations | Datasets |
|---|---|---|---|
| Charts | AxisTitle, TickLabel, Legend, Mark (Bar/Point) | label_of, axis_of | Benetech, ChartOCR |
| Slides | Bullet, CodeBlock, SlideNumber, Footer, Logo | parent_of | SciPostLayout, D4LA |
| Education | QuestionNumber, Answer, Option, ExamineeInfo | parent_of (Group) | M6Doc |
| Comics | frame, text, face, body | speaker_of (Text $\to$ Character) | Manga109, eBDtheque |
| Flowcharts | Node, Terminal, Process, Decision, Arrow | flow_to (Edge) | FlowChart7k |
Side A: The Logical Ideal (Source Schemas)
Comparison of “hidden ground truth” structures versus visual output.
PDF Tags (Born-Digital ISO 32000)
- text:
P(Para),H1–H6(Headings),Span,Quote,Code. - grouping:
Part,Sect,Div,BlockQuote,Caption,TOC. - lists:
L(List) $\to$LI(Item) $\to$Lbl(Bullet) +LBody(Content). - tables:
Table$\to$THead/Body/Foot$\to$TR$\to$TH/TD.
<Sect>
<H1>Structure</H1>
<P>Text content.</P>
<L>
<LI><Lbl>•</Lbl><LBody>Item 1</LBody></LI>
</L>
</Sect>
JATS XML (Scientific)
Semantic publishing standard. Key difference: explicit links (captions to figures, refs to bib).
- structure:
front(meta),body,back(refs). - content:
sec,fig,table-wrap,boxed-text.
<fig id="f1">
<caption><title>Model</title></caption>
<graphic href="fig1.png"/>
</fig>
Side B: The Visual Output (Production Schemas)
Common output formats from OCR engines.
- PAGE-XML / PRImA: Polygon-based. Widely used for historical/complex layouts.
- ABBYY XML: Block-based.
Text,Table,Picture,Barcode. - ALTO XML: Layout-based.
PrintSpace$\to$TextBlock/Illustration. - hOCR: HTML-based.
<div class="ocr_carea">wrapper.
The Framework: Definitions
To bridge these disparate schemas, we propose the Matter vs. Meaning framework. This framework is not proposed as a perfect schema nor as a one-size-fits-all solution. Instead, it is a practical framing to help break down the problem into something tractable and composable. Something that scales to large teams, diverse document types, and evolving use cases without needing to redefine the entire taxonomy every time. The goal is to aim for minimal inter-annotator disagreement and maximum composability, even if it means some loss of granularity in edge cases.
This framework decouples the physical nature of what is on the page (what an annotator can identify at a glance, without reading) from the semantic function of that content (what it means in context, which requires understanding the document). This improves scalability: we do not create new classes for every edge case (e.g., “Signature”, “Stamp”, “Barcode”). Instead, we compose them from fundamental primitives.
This page defines what to label and why. For practical guidance on how to annotate (bounding box geometry, conflict resolution, two-pass workflow), see the Annotation Guide.
1. Visual Primitives (The “Matter”)
What kind of content is this, visually?
An annotator should be able to assign a primitive without reading the text or understanding the document’s purpose. The question is purely perceptual: “Is this a block of text? A grid? An image? A formula?” This is the fast, low-disagreement pass.
| Primitive | Definition | Typical Downstream | Notes |
|---|---|---|---|
Text | A linear sequence of glyphs or strokes. | OCR / HTR | Includes Prose, Code, Handwriting. Use whenever extracting the string content preserves the primary information. |
Table | A region of 2D content alignment (Grid/Matrix). | TSR | The “Visual Grid Rule”: If the content requires preserving X-axis alignment to be understood (e.g. price lists, dense forms), it is a Table, regardless of border visibility. Simple 1D enumerations (bulleted lists) are Text, even if indented. |
Image | A non-textual pixel map requiring visual description. | VLM / Captioning | The “Rich Text vs. Rendered Art” Rule: If standard font properties (family, size, bold, italic, color) cannot describe the appearance (e.g. 3D effects, gradients, warping, drop caps with illustrations), it is an Image. If it’s just a funky font, it’s Text. |
Formula | A region of complex spatial notation (Math/Chem). | LaTeX / MolVec | Standardized notation systems only. Creative illustrations of atoms are Image. |
OpticalCode | A machine-readable optical data pattern. | Decoder | QR Codes, Barcodes. Distinct from Image because no VLM is used. |
Music | A region of musical notation. | OMR | Sheet music, scores. Treated as a distinct content type in GOT-OCR2.0, which trains on 0.5M sheet music samples (GrandStaff + Verovio). |
Selection | A dedicated geometric region for binary state. | State Classifier | Explicit Geometry Only: Must be a box/circle/polygon intended for marking. Bullets (•, 1.) are part of the Text list item. Dingbats ($\checkmark$) are treated as Text unless they are inside a dedicated box. |
Field | A container awaiting input where interaction is expected. | Detector | Input Lines, Signature Lines. Must be spatially anchored to Text (Label/Instruction). Orphan lines (separators, rules) are Structure. |
Redaction | An intentional visual occlusion. | Masking / Log | Black Bars, White-outs. Signals that content is intentionally hidden. Annotated in SignverOD. |
Structure | Geometric layout artifacts. | Ignored | Separators, Ruled Lines, Background Graphics. Lines that are not anchored to input prompts. |
2. Logical Roles (The “Meaning”)
What semantic function does this primitive serve in the document?
Assigning a role requires understanding context: reading the text, recognizing its position in the document’s structure, or knowing the document type. This is the slower, higher-cognition pass where inter-annotator disagreement is more likely.
A note on multi-primitive roles: Some roles support multiple primitives because the same semantic function can be realized through different visual forms. A PageHeader can be text or a logo image. A Sidebar can contain prose, a data table, or an inset figure. A Code block can be selectable text or a screenshot. In each case, the primitive determines which downstream processor handles extraction (OCR, TSR, VLM, etc.), while the role determines reading-order treatment and structural placement.
| Category | Role | Definition | Supported Primitives |
|---|---|---|---|
| Headings | Title | The document root or main headline. | Text |
Subtitle | Secondary headline or deck. | Text | |
SectionHeader | Structural division (H1-H6). | Text | |
| Content | Abstract | Summary/Description. | Text |
Body | Standard prose content. | Text | |
Blockquote | Extended distinct quotation/highlight. | Text | |
ListItem | An individual item in a list. | Text | |
Figure | Visual content to be viewed. | Image | |
Chart | A data visualization (Bar/Line/Pie). | Image | |
Diagram | Technical drawing or schematic. | Image | |
Sketch | Hand-drawn illustration (integral). | Image | |
Code | Source code, pseudocode, or algorithmic description. | Text / Image | |
ChemScheme | Chemical structure. | Formula | |
DisplayEquation | Standalone math block. | Formula | |
InlineEquation | Math embedded in text. | Formula | |
Score | Sheet music or musical score. | Music | |
| Meta | Author | The document creator(s) in a byline context. | Text |
Affiliation | Institutional connection of an author. | Text | |
Keywords | Indexing terms or categorization tags. | Text | |
Caption | Meta-text describing another element. | Text | |
TableNote | Explanatory note anchored to a specific table. | Text | |
Credit | Attribution/Source for an element. | Text | |
Dateline | Document-level temporal anchor. | Text | |
PageHeader | Running navigation/meta-data at top. | Text / Image | |
PageFooter | Running navigation/meta-data at bottom. | Text / Image | |
PageNumber | Explicit page index. | Text | |
LineNumber | Explicit line index (Legal/Code). | Text | |
JumpLine | Navigation pointer (“Continued on page X”). | Text | |
Footnote | Explanatory note at bottom of page/text. | Text | |
BibEntry | Bibliography reference/citation. | Text | |
FormulaNumber | Numerical label for an equation. | Text | |
Address | Physical or electronic routing address. | Text | |
Watermark | Faint overlay indicating state. | Text / Image | |
Annotation | User-added marginalia/corrections. | Text / Image | |
Correction | Error correction or patch. | Text / Image | |
Logo | Branding element / Identity. | Image | |
Icon | A symbol or glyph treated as an image. | Image | |
Stamp | Official seal or validation mark. | Image | |
Sidebar | Content outside the main text flow. | Text / Image / Table | |
Advertisement | Commercial content / promos. | Text / Image | |
Barcode1D | One-dimensional barcode (Lines). | OpticalCode | |
Barcode2D | Two-dimensional matrix code (QR/DataMatrix). | OpticalCode | |
TOC | Table of Contents navigation. | Table | |
Index | Index navigation. | Table | |
Calendar | Calendar grid. | Table | |
| Forms | Label | The question or prompt text. | Text |
Instruction | Explanatory text guiding input. | Text | |
Value | The content entered into an input. | Text | |
Option | A textual choice (Likert). | Text | |
Signature | A verification mark/zone. | Text / Field | |
Input | A region accepting user entry. | Field | |
Checkbox | Multi-select toggle. | Selection | |
Radio | Single-select toggle. | Selection | |
FormGroup | A logical grouping of form elements. | Table |
3. Attributes (The “Nuance”)
Properties that modify the handling of a primitive without changing its fundamental type.
| Attribute | Values | Function |
|---|---|---|
is_handwritten | True, False | Flags handwritten vs. printed Text. Visually identifiable. |
orientation | 0, 90, 180, 270 | Rotation of Text/Image. Visually identifiable. |
level | 1-6 | Heading depth for SectionHeader. Partially visual (font size/weight suggest level, but document structure may override). |
format | line, box, comb, rect | Geometry variant for Field. Visually identifiable. |
state | checked, unchecked | Binary state of Selection. |
is_filled | True, False | Whether a Field contains content. |
row_span / col_span | Integer | Defines Table cell topology. |
is_header | True, False | Distinguishes metadata cells from values in Table. |
score | 0.0 - 1.0 | Model confidence. |
4. Relations (The “Structure”)
Finally, we define the edges that connect these nodes to reconstruct the document graph.
| Relation | Arguments | Definition |
|---|---|---|
group_with | (Figure, Caption) | Links visual content to its description. |
label_for | (Label, Field) | Links a form prompt to its input zone (no value yet). |
value_of | (Label, Value) | Links a prompt to the user’s filled answer ("Name:" $\to$ "John"). |
header_of | (SectionHeader, Body) | Links a header to the content section it introduces. |
follows | (Block, Block) | Reading order: one element immediately precedes another. |
continues_from | (Block, Block) | Cross-page or cross-column continuation of the same logical element (e.g., a paragraph split across pages). See HRDoc for a dataset with explicit cross-page parent-child relations. |
refers_to | (Text, Any) | Explicit cross-reference (e.g., “see Figure 3” $\to$ the Figure). Requires text understanding; not resolvable from layout geometry alone. Typically a Stage 2 task. |
row_of / col_of | (Cell, Table) | Explicit grid membership (if not implicit by geometry). |
Operational Heuristics
The “Span vs. Block” Model (Handling Inline Elements)
A major source of ambiguity in layout analysis is the treatment of inline elements like mathematical variables ($x$), small icons ($\to$), or chemical formulas embedded within a sentence.
The Problem: Is the variable $x$ a
Formulaor is it part of theTextparagraph? If we label itFormula, do we split the paragraph into three boxes (Text,Formula,Text)? This fragmentation destroys the semantic coherence of the sentence.The Solution: We treat these elements as a separate Enrichment Layer that sits on top of the base text layer.
- Base Layer (The “Block”): The entire paragraph is detected as a single
Textblock.- Action: Run standard OCR.
- Result: “The value of x is 10.” (Note: OCR might misread complex math, but captures the sentence structure).
- Enrichment Layer (The “Span”): The inline equation is detected as a
Formulaspan inside the text block.- Action: Run specialized LaTeX OCR on the span crop.
- Result:
$x$.
- Reconciliation: The pipeline matches the span’s geometry to the text’s character coordinates and injects the high-fidelity result, overwriting the raw OCR.
- Final Output: “The value of $x$ is 10.”
Key Rule: Any primitive fully contained within a Text block is Exempt from NMS. Instead of competing for existence, it is treated as an inline child span of the parent paragraph.
The Overlap Exceptions (Stamps & Watermarks)
Similarly, some macro-elements physically exist on top of the content but are semantically distinct.
- Stamp (
Image): Extracts the visual seal (e.g. “APPROVED”), ignoring text underneath. - Watermark (
Text): Extracts the overlay text (“CONFIDENTIAL”), identifies theRole: Watermark, and removes it from the reading order.
Drop Capitals
A drop cap (or initial) is a large, often decorative first letter spanning multiple lines at the start of a paragraph. Several datasets annotate it as a distinct class (M6Doc, PRImA, YALTAi/Segmonto). We do not define a separate role for it because the drop cap is semantically the first character of the paragraph: its role is Body.
The primitive depends on the Rich Text vs. Rendered Art rule:
- Plain enlarged letter (just a bigger font):
Text. Standard OCR can extract it. - Illustrated or decorated initial (gold leaf, figural illustration, intertwined vines):
Image. The letterform carries visual information that OCR cannot capture.
In both cases, the drop cap’s bounding box should be treated as a child span of the parent paragraph (the same Span vs. Block model used for inline equations). The downstream pipeline merges the extracted character back into the paragraph text.
Critical Case: The “Grid vs. Group” Distinction
A common failure mode is either (A) forcing the model to read text to distinguish Lists from Tables, or (B) being so permissive that an entire form (like ACORD 125) is swallowed as one giant Table.
The Compromise: “Visual Atomicity” We treat 2D aligned regions as
Table(orGrid), but strictly enforce Section Boundaries.The “Stop at Header” Rule:
- A
TableorFormGroupmust not cross aSectionHeaderor a full-widthStructureline. - Result: An ACORD form is not one big grid. It is 5 separate grids (“Agency”, “Insured”, “Coverages”, etc.), separated by their headers. This ensures the downstream parser receives manageable chunks, not a whole page.
- A
The “Visual Grid” Rule (Inner Content):
- Inside those chunks, if it looks aligned, label it
Table. Do not worry if it is actually aListof records or aMatrixof numbers. The downstream parser can easily distinguish a 2-column list from a 10-column matrix once it has the clean chunk.
- Inside those chunks, if it looks aligned, label it
Vital Inline Elements:
- Tables can feature an inline overlay of key primitive types like
Selection,Field, andImage. This allows us to capture checkboxes and input fields inside a table without needing to split it into separate boxes.
- Tables can feature an inline overlay of key primitive types like
Rule of Thumb:
- Is it 2D Aligned? $\to$
Table. - Does it cross a Section Header? $\to$ SPLIT IT. Two
Tables.
Why There Is No “ListContainer”
Many datasets (PubLayNet, M6Doc, DocGenome, OmniDocBench) annotate lists as a single bounding box around the entire block. We intentionally omit a ListContainer role for several reasons:
- Ambiguous primitive: A list container is visually just
Text. An annotator cannot reliably distinguish “a block of list items” from “a paragraph” at a glance; there is no clear visual signature separating them. - The useful unit is the item: Downstream tasks (chunking, hierarchy recovery) need individual
ListItemboundaries, not the outer wrapper. Annotating the container encourages skipping the harder, more valuable step. - Geometry problems: In multi-column layouts, a list can span columns or wrap in ways that make a single rectangular bounding box misleading or impossible.
When mapping datasets that annotate list containers, we map them to ListItem and note the granularity mismatch.
Cross-Page and Cross-Column Continuation
The continues_from relation (defined in the Relations table above) handles cases where a paragraph, table, or list splits across a page or column break. In reading order, the second fragment continues_from the first, signaling that they are parts of the same logical element and should be merged during reconstruction.
Several datasets annotate this explicitly. HRDoc provides cross-page parent-child relations for hierarchical document reconstruction. LADaS uses a -Continued suffix on region labels to mark fragments that span page boundaries. DocGenome annotates Identical relations between regions that belong to the same logical entity across pages.
Most single-page detection models ignore continuation entirely, since they process one page at a time with no memory of the previous page. For downstream reconstruction pipelines that stitch pages into a coherent document, however, continues_from is essential for producing correct output. Without it, a paragraph split across two pages becomes two unrelated text blocks.
Domain Case Study: Interactive Forms
Forms are documents with a high density of Interaction Primitives. This section details how the framework handles user input zones and states.
Form Interaction Primitives
| Primitive | Attributes | Roles | Processing Logic |
|---|---|---|---|
Field | format: line, box, comb, rect | Input, Signature | Line/Box: HTR. Comb: Char-HTR by cell. |
Selection | state: checked, unchecked | Checkbox, Radio | State Classifier (Empty vs Marked). |
Text | is_handwritten: True/False | Option, Label, Input | OCR / ICR. |
Critical Distinctions
Signature Lines vs. Input Lines:
- Visual (Primitive): Both are
Fieldwithformat="line". They are physically just lines waiting for ink. - Semantic (Role): Context determines the role. A field next to “Sign:” is
Role: Signature. A field next to “Name:” isRole: Input. - Downstream:
Signaturefields trigger Verification/Matching Models.Inputfields trigger HTR Models.
- Visual (Primitive): Both are
Explicit vs. Implicit Selections (Checkboxes vs. Likert):
- Explicit (Primitive:
Selection): A physical box or circle exists. The model detects the geometry and classifies the interior pixel state. - Implicit (Primitive:
Text+ Role:Option): The user circles or strikes through the text “Strongly Agree”.- Comb: A series of small adjacent boxes, one per character.
- Processing:
- Box: Standard HTR on the region.
- Comb: Requires Split-and-Merge. The image is sliced by cell vertical lines; each cell is recognized individually as a single char; results are concatenated.
- Explicit (Primitive:
Binding Relations
To reconstruct a form, we link these primitives:
group_with(Figure, Caption): Groups an image and its description.value_of(Label, Text): Links a question to its answer (e.g., “Name:” $\to$ “John Doe”).header_of(SectionHeader, Body): Hierarchical grouping.
Implementation Strategy
The taxonomy above is designed around human annotation: primitives are what an annotator can label quickly and reliably by sight, while roles require deeper understanding. This same principle carries over to model design. In practice, training a single computer vision model to detect all Primitive-Role pairs is often impractical due to class imbalance and visual ambiguity.
In our experience, a Hybrid Detection Strategy is effective, targeting “High-Salience” roles directly while falling back to Primitives for ambiguous content.
The “Visual Salience” Hierarchy
Some roles are visually distinctive enough that both annotators and models can assign them without reading the text. We use this to decide which roles to target directly in Stage 1.
- High Salience (Target Directly): These roles have unique geometric signatures (font size, position, texture).
Title(vs. Body)ListItem(Bullets/Indents)Caption(Proximity to Image + Small Font)PageHeader/PageFooter(Absolute Position)Footnote(Separator Line + Bottom Location)
- Low Salience (Target as Primitive): These roles look identical to generic text blocks.
Address$\to$ Detect asTextAuthor$\to$ Detect asTextBody$\to$ Detect asTextInstruction$\to$ Detect asText
Recommended Detection Classes
For a robust Layout Model (e.g., RT-DETR), we recommend a simplified class list (~15 classes) that mixes Primitives and High-Salience Roles:
- Textual Anchors:
Title,SectionHeader,Caption,ListItem,PageHeader,PageFooter,Footnote. - Visual Primitives:
Table,Figure,Formula,OpticalCode,Text(Catch-all). - Form Primitives:
Input,Checkbox,Signature.
The Two-Stage Pipeline
- Stage 1 (Vision): The specific classes (like
Title) are accepted as-is. The generic classes (likeText) are passed to Stage 2. - Stage 2 (Logic): A Multimodal Model (LayoutLM) or Graph Heuristic acts on the generic
Textboxes to assign specific roles (Author,Abstract,Address) based on content and context.
The Recall Advantage: “Surface Everything”
Beyond annotation speed, this separation is critical for system reliability. A common failure mode in fine-grained detection is premature rejection: a model fails to detect a “Section Header” because it looks slightly ambiguous, so it outputs nothing at all.
By simplifying the visual targets to “Matter” (e.g., just Text), we significantly lower the complexity of the visual decision boundaries. This allows the detector to optimize for High Recall, ensuring every patch of ink is captured as a distinct segment.
It shifts the failure mode from “Missing Data” (Vision Failure) to “Mislabelled Data” (Logic Failure). In production, a mislabelled paragraph is easier to fix than a missing one.
Known Limitations
- No spatial relations. The framework captures semantic and logical edges (
group_with,header_of) and reading order (follows), but not spatial adjacency (above, below, left, right). GraphDoc defines these, and they matter for form understanding and table reconstruction. We may add spatial relations in a future revision. refers_torequires text understanding. Cross-references (“see Figure 3”) cannot be resolved from layout geometry alone, making them a Stage 2 task. GraphDoc reports only 16.8% AP for reference relations using visual features, underscoring the difficulty.- Heritage-specific visual roles are thin. Decorative elements (vignettes, friezes, illuminated initials) map to
ImageorStructure, but the role layer does not distinguish art-historically significant ornament from layout chrome. Heritage DLA practitioners should extend the role set for their domain. - Single-page scope. Most primitives and roles are defined for a single page. Cross-page reconstruction depends on
continues_fromandfollowsrelations, which few datasets annotate consistently (see Cross-Page and Cross-Column Continuation above).
Conclusion
The core insight of the Matter vs. Meaning framework is that the two hardest problems in document layout analysis require different kinds of thinking, and forcing annotators (or models) to do both at once is where most taxonomies break down.
Matter asks: “What kind of content is this?” An annotator can answer quickly, by sight, with high agreement. Text looks like text. A table looks like a grid. A formula looks like spatial notation. These judgments are fast, reliable, and largely independent of document type or domain.
Meaning asks: “What role does this content serve?” That requires reading, context, and sometimes domain knowledge. Is this text block an abstract or an introduction? Is this image a figure or a logo? These judgments are slower, more subjective, and more likely to vary across annotators and document types.
By separating these two passes, we get practical benefits at every stage:
- Annotation scales. The fast visual pass can be done by any annotator across any domain. The slower semantic pass can be done by specialists, or deferred to a second stage, or handled by rules and models. New document types don’t require redefining the primitive set.
- Models get simpler targets. A vision model optimizing for “find all text regions” has far simpler decision boundaries than one trying to distinguish “Abstract” from “Body.” This trades classification errors (recoverable) for detection misses (not recoverable).
- Teams share a common language. OCR engineers, data curators, and downstream NLP researchers can coordinate on a shared primitive vocabulary even when they disagree on roles, because the visual layer is stable across domains.
The taxonomy is deliberately not exhaustive. We expect roles to grow as new document types are encountered, and we expect the relations to deepen as cross-page and cross-document reconstruction become better understood. The primitives, by contrast, should remain relatively stable: the ways humans put marks on a page change slowly.
PP-DocLayout: Unified Document Layout Detection
TL;DR
PP-DocLayout is a family of document layout detection models (L/M/S) that predict bounding boxes for 23 layout block classes. The largest variant reports 90.4% mAP@0.5 with 13.39 ms per page on T4 GPU; the smallest variant achieves 70.9% mAP@0.5 at 8.11 ms per page (T4) or 14.49 ms (CPU). Training combines knowledge distillation from GOT-OCR2.0’s visual encoder (for the L backbone) and semi-supervised learning with adaptive per-class thresholds (for M/S variants).
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$
Core contribution is a unified detection approach across diverse document types with multiple model variants and training strategies: knowledge distillation for the L backbone and semi-supervised pseudo-labeling with adaptive thresholds for M/S variants.
Secondary: $\Psi_{\text{Resource}}$
Provides three model variants (L/M/S) with public weights and implementation details.
What is the motivation?
Layout detection is positioned as a foundational step for downstream document processing tasks (table recognition, formula recognition, OCR, information extraction) and structured training data generation.
The authors identify three gaps in prior layout detectors:
- Weak generalization beyond academic papers: insufficient coverage of magazines, newspapers, financial reports, and other document types.
- Insufficient granularity for complex layouts: coarse categorization that collapses charts, seals, and formulas into broad buckets.
- Insufficient speed for large-scale and real-time processing requirements.
What is the novelty?
Fine-grained label space: 23 layout categories designed to cover diverse document types and distinguish high-value regions (charts, seals, formula numbers, header/footer images) rather than collapsing them into generic classes.
Knowledge distillation for PP-DocLayout-L backbone: The PP-HGNetV2-B4 student backbone learns from the frozen Vary-VIT-B visual encoder (from GOT-OCR2.0) via feature-level alignment with a learnable projection layer and L2 distillation loss.
Adaptive per-class thresholding for semi-supervised learning: PP-DocLayout-M/S use PP-DocLayout-L as a teacher model to generate pseudo-labels, but instead of a fixed global confidence threshold, they select per-class thresholds by maximizing F1 on labeled validation data for each of the 23 categories.
Throughput focus: The authors emphasize deployment scenarios requiring batch processing at scale, reporting approximately 123 pages per second on T4 GPU when using the PaddleX inference engine.
What experiments were performed?
Main evaluation: Reports mAP@0.5, inference latency on T4 GPU and CPU (Intel Xeon Gold 6271C, 8 threads, FP16), and parameter counts for all three variants.
Qualitative comparison: Visual side-by-side comparisons to DocLayout-YOLO on diverse document types. No direct quantitative comparison due to label set mismatch.
Ablations:
- Knowledge distillation effect on PP-DocLayout-L accuracy.
- Semi-supervised learning effect on PP-DocLayout-M and PP-DocLayout-S accuracy.
Dataset description: Training and evaluation split sizes with per-category instance counts provided in the appendix.
What are the outcomes/conclusions?
Outcomes:
- PP-DocLayout-L: 90.4%
mAP@0.5, 13.39 ms (T4), 759.76 ms (CPU), 30.94M parameters. - PP-DocLayout-M: 75.2%
mAP@0.5, 12.73 ms (T4), 59.82 ms (CPU), 5.65M parameters. - PP-DocLayout-S: 70.9%
mAP@0.5, 8.11 ms (T4), 14.49 ms (CPU), 1.21M parameters.
Distillation improves L from 89.3% to 90.4% mAP@0.5. Semi-supervised learning improves M from 73.8% to 75.2% and S from 66.2% to 70.9%. Note: the paper’s Table 4 labels the S improvement as “+3.7” while the body text states “4.7%”; the arithmetic (70.9 - 66.2 = 4.7) supports the body text figure.
Limitations:
Fragile evaluation: The “Unified” claim rests on a very small custom evaluation set of 500 images (approx. 22 images per class on average). Despite using DocLayNet and PubLayNet in the training mix, the authors do not report results on standard public benchmarks (e.g., DocLayNet Test, PubLayNet Val). This makes the 90.4% mAP figure isolated and difficult to compare against the broader SOTA (LayoutLMv3, VGT, DiT).
Qualitative-only baselines: The comparison to DocLayout-YOLO is purely visual. The authors cite “label mismatch” as the reason for avoiding quantitative comparison, but this prevents verifying whether the architecture itself is superior or if gains are solely due to the custom label taxonomy.
IoU looseness: Reporting only mAP@0.5 is permissive for layout analysis, where precise boundary alignment often matters for downstream OCR or reading order recovery. Stricter thresholds (e.g., mAP@0.75 or mAP@0.5:0.95) are not provided.
Reporting ambiguity: The paper’s body text states PP-DocLayout-S improves by “4.7%” with semi-supervised learning, but Table 4 labels the gain as “+3.7” (66.2% to 70.9%). The arithmetic (70.9 - 66.2 = 4.7) supports the body text. This internal inconsistency suggests a typo in Table 4.
Reproducibility
Reproducibility Score: Partially Reproducible
Missing Components:
- Training Data: The 30k-image training set is a mix of public (DocLayNet, PubLayNet) and proprietary data; neither the dataset nor the exact composition ratio is released.
- Distillation Corpus: The 500k-document corpus used to distill the backbone is unreleased and origins are unspecified.
- Evaluation Data: The 500-image custom evaluation set used for the 90.4% mAP claim is “self-built” and not public.
- Standard Benchmarks: The authors do not report results on standard public benchmarks (e.g., DocLayNet Test), preventing independent verification of the performance claims.
Compute:
- Training: 8 NVIDIA V100 GPUs (approx. 26 hours for L variant).
- Inference: Benchmarks provided for NVIDIA T4 and Intel Xeon Gold 6271C.
Models
Detector architecture
PP-DocLayout-L:
- Detector: RT-DETR-L
- Backbone: PP-HGNetV2-B4 (15.6M parameters after distillation from Vary-VIT-B)
- Total parameters: 30.94M
PP-DocLayout_plus-L (not described in the paper; details below are from the HuggingFace model card):
- Detector: RT-DETR-L
- Backbone: PP-HGNetV2-B4
- Total parameters: 30.94M
- Classes: 20 (reduced from the standard 23-class set)
PP-DocLayout-M:
- Detector: PicoDet-M
- Total parameters: 5.65M
PP-DocLayout-S:
- Detector: PicoDet-S
- Total parameters: 1.21M
Label set
The standard PP-DocLayout variants (L/M/S) predict bounding boxes for 23 layout block classes. The PP-DocLayout_plus-L variant uses a modified set of 20 classes.
Standard 23-class set:
paragraph_title, image, text, number, abstract, content, figure_title, formula, table, table_title, reference, doc_title, footnote, header, algorithm, footer, seal, chart_title, chart, formula_number, header_image, footer_image, aside_text
Plus-L 20-class set (sourced from the HuggingFace model card, not the paper):
The model card lists 20 categories. The exact mapping of which classes were removed or renamed relative to the standard 23-class set is not fully specified.
Category mapping versus DocLayout-YOLO
The authors provide a mapping table contrasting their 23-class label space with DocLayout-YOLO’s coarser scheme. Examples:
- Page Number is explicitly modeled in PP-DocLayout but mapped to “Abandon” in DocLayout-YOLO.
- Formula is treated as “Isolate Formula” in PP-DocLayout, while many structural elements (header, footer, footnote) are “Abandon” in DocLayout-YOLO.
Data
Labeled layout detection dataset (Training)
Training: 30,000 images annotated with 23 categories.
Composition: A mix of images collected from Baidu image search and public datasets (DocLayNet, PubLayNet). Note: The authors do not specify the ratio of proprietary to public data, nor do they report performance on the standard test splits of the public datasets used in training.
Evaluation: 500 images (approx. 22 images per class).
Document types: Chinese and English academic papers, magazines, newspapers, research reports, exam papers, handwritten notes, contracts, books.
Per-category instance counts (train/eval):
High-volume classes from the appendix include:
- text: 217,257 / 3,342
- formula: 113,145 / 1,961
- paragraph title: 42,158 / 715
- header: 25,001 / 430
- number: 25,217 / 430
Full counts for all 23 categories are provided in Table 5 of the paper.
Distillation pretraining corpus
Used for distilling the PP-HGNetV2-B4 backbone from the Vary-VIT-B teacher: 500,000 document samples across five domains (mathematical formulas, financial documents, scientific literature from arXiv STEM fields, academic dissertations, tabular data from reports and spreadsheets).
The paper does not specify the source or license for this 500k corpus.
Algorithms
Knowledge distillation (PP-DocLayout-L backbone)
Teacher model: Vary-VIT-B visual encoder ($F_{\text{Vary-VIT-B}}$) from GOT-OCR2.0, frozen during student training.
Student model: PP-HGNetV2-B4 backbone ($F_{\text{PP-HGNetV2-B4}}$).
Feature alignment: Teacher features $F_{\text{Vary-VIT-B}} \in \mathbb{R}^{B \times D}$ and student features $F_{\text{PP-HGNetV2-B4}} \in \mathbb{R}^{B \times P}$ are aligned via a learnable linear projection $\varphi: \mathbb{R}^{P} \rightarrow \mathbb{R}^{D}$.
Distillation loss (feature L2):
$$ L_{\text{Distill}} = \frac{1}{B} \sum_{i=1}^{B} \left\lVert F_{\text{Vary-VIT-B}}^{(i)} - \varphi\left(F_{\text{PP-HGNetV2-B4}}^{(i)}\right) \right\rVert_2^2 $$
Distillation training settings:
- Input resolution: 768 $\times$ 768
- Epochs: 50
- Optimizer: AdamW ($\beta_1 = 0.9, \beta_2 = 0.999$)
- Student backbone parameters after distillation: 15.6M
Semi-supervised learning (PP-DocLayout-M/S)
Teacher model: PP-DocLayout-L generates predictions $P(y | x_u) = f_T(x_u; \theta_T)$ over $C = 23$ classes.
Adaptive per-class thresholding: For each class $c$, the threshold $\tau_c^{\ast}$ is selected to maximize per-class F1 on labeled validation data $x_l$:
$$ \tau_c^{\ast} = \arg \max_{\tau \in [0, 1]} F_c(\tau; P(y|x_l)) $$
Pseudo-label assignment: An unlabeled region $i$ in image $x_u$ is assigned a pseudo-label $\hat{y}_{u,i}$ if the prediction exceeds the optimal threshold:
$$ \hat{y}_{u,i} = \begin{cases} 1 & \text{if } P(y_{i,c} | x_u) > \tau_c^{\ast} \text{ for any } c \\ 0 & \text{otherwise} \end{cases} $$
Detector training settings
PP-DocLayout-L:
- Epochs: 100
- Learning rate: 0.0001 (constant)
- Batch size: 2 per GPU
- GPUs: 8 NVIDIA V100
- Training time: approximately 26 hours
PP-DocLayout-M:
- Epochs: 100
- Learning rate: 0.02
- Batch size: 2 per GPU
- GPUs: 8
- LR scheduler: CosineDecay
PP-DocLayout-S:
- Epochs: 100
- Learning rate: 0.06
- Batch size: 2 per GPU
- GPUs: 8
- LR scheduler: CosineDecay
Evaluation
Metrics and hardware
Primary metric: mAP@0.5
GPU latency: NVIDIA Tesla T4
CPU latency: Intel Xeon Gold 6271C @ 2.60 GHz, 8 threads, FP16 precision
Main results
| Variant | mAP@0.5 | T4 Latency | CPU Latency | Parameters |
|---|---|---|---|---|
| L | 90.4% | 13.39 ms | 759.76 ms | 30.94M |
| M | 75.2% | 12.73 ms | 59.82 ms | 5.65M |
| S | 70.9% | 8.11 ms | 14.49 ms | 1.21M |
Ablation: Knowledge distillation (PP-DocLayout-L)
Distillation improves the L variant from 89.3% to 90.4% mAP@0.5 (+1.1 percentage points).
Ablation: Semi-supervised learning (PP-DocLayout-M/S)
| Variant | Baseline | With Semi-Supervised | Improvement |
|---|---|---|---|
| M | 73.8% | 75.2% | +1.4 pp |
| S | 66.2% | 70.9% | +4.7 pp (paper Table 4 prints “+3.7”, likely a typo) |
Qualitative comparisons
The authors provide side-by-side visualizations comparing PP-DocLayout to DocLayout-YOLO on diverse document types. Claimed improvements include:
- More granular text hierarchy: separate detection of document title, abstract, and paragraph title.
- Improved detection of headers, footers, and page numbers.
- Inline and block formula detection.
- Handwritten content classified as text rather than figure.
- Separation of charts, seals, and natural images into distinct categories.
Hardware
Throughput claims
The authors emphasize large-scale batch processing scenarios, claiming approximately 123 pages per second on T4 GPU when using the PaddleX inference engine.
Deployment context
Inference performance is framed around two use cases:
- Large-scale data construction: Processing document corpora for training data generation.
- Real-time processing: Low-latency layout detection for interactive applications.
CPU measurements are provided for evaluation scenarios, but primary speed claims focus on T4 GPU performance.
Mapping to Unified Taxonomy
| Dataset Class | Visual Primitive | Logical Role | Notes |
|---|---|---|---|
| paragraph_title | Text | SectionHeader | |
| image | Image | Figure | |
| text | Text | Body | |
| number | Text | PageNumber | |
| abstract | Text | Abstract | |
| content | Table | TOC | “Content” usually refers to Catalogue/TOC in this context. |
| figure_title | Text | Caption | |
| formula | Formula | DisplayEquation | |
| table | Table | Table | |
| table_title | Text | Caption | |
| reference | Text | BibEntry | |
| doc_title | Text | Title | |
| footnote | Text | Footnote | |
| header | Text | PageHeader | |
| algorithm | Text | Code | |
| footer | Text | PageFooter | |
| seal | Image | Stamp | |
| chart_title | Text | Caption | |
| chart | Image | Chart | |
| formula_number | Text | FormulaNumber | Numbering associated with equation. |
| header_image | Image | PageHeader | |
| footer_image | Image | PageFooter | |
| aside_text | Text | Sidebar |
BibTeX
@misc{sun2025ppdoclayout,
title={PP-DocLayout: A Unified Document Layout Detection Model to Accelerate Large-Scale Data Construction},
author={Ting Sun and Cheng Cui and Yuning Du and Yi Liu},
year={2025},
eprint={2503.17213},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking
TL;DR
LayoutLMv3 is a multimodal Transformer for Document AI that replaces CNN-based image embeddings with simple linear patch projections and unifies text and image pre-training through masked language modeling (MLM), masked image modeling (MIM), and a word-patch alignment (WPA) objective. The authors report it is the first multimodal Document AI model to operate without CNN or Faster R-CNN backbones, and it performs competitively on both text-centric tasks (form understanding, receipt parsing, document VQA) and image-centric tasks (document classification, layout analysis) with fewer parameters than prior approaches.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$. The headline contribution is a new multimodal pre-training architecture and training recipe. The paper dedicates most of its pages to describing the unified masking objectives (MLM, MIM, WPA), the CNN-free image embedding design, and ablation studies that isolate the contribution of each component. SOTA comparisons across five benchmarks are the primary evidence.
Secondary: None. While the authors release code and model weights, the paper does not frame these as the primary contribution; the models serve as validation of the method.
What is the motivation?
Prior multimodal Document AI models (LayoutLM, LayoutLMv2, DocFormer, SelfDoc) all relied on CNN backbones (ResNet, ResNeXt, Faster R-CNN) to extract visual features. This had several downsides:
- Parameter overhead: CNN backbones added tens of millions of parameters (e.g., 44M for ResNet-101) on top of the Transformer.
- Region supervision: Models using Faster R-CNN required object detection pre-training with bounding box annotations.
- Objective mismatch: Text and image modalities used fundamentally different pre-training objectives (discrete token prediction for text vs. pixel reconstruction or region feature regression for images), making cross-modal alignment difficult.
The gap between text objectives (discrete, symbolic) and image objectives (continuous, pixel-level) hindered multimodal representation learning. The authors sought to unify the pre-training approach across modalities.
What is the novelty?
CNN-free Image Embedding
LayoutLMv3 replaces CNN/Faster R-CNN image encoders with a linear projection of image patches, following ViT and ViLT. A document image is resized to $224 \times 224$, split into $16 \times 16$ patches yielding $M = 196$ patch tokens, and each patch is linearly projected to $D$ dimensions. This adds only 0.6M parameters versus 44M+ for CNN backbones.
Unified Masking Objectives
The model is pre-trained with three objectives:
Masked Language Modeling (MLM): 30% of text tokens are masked using span masking ($\lambda = 3$). The objective minimizes:
$$L_{\text{MLM}}(\theta) = -\sum_{l=1}^{L’} \log p_\theta(y_l \mid X^{M’}, Y^{L’})$$
Masked Image Modeling (MIM): Approximately 40% of image tokens are masked with blockwise masking. Target labels come from a discrete VAE image tokenizer (vocabulary size 8,192, initialized from DiT). The objective minimizes:
$$L_{\text{MIM}}(\theta) = -\sum_{m=1}^{M’} \log p_\theta(x_m \mid X^{M’}, Y^{L’})$$
Word-Patch Alignment (WPA): A binary classification task predicting whether a text token’s corresponding image patch is masked. Only unmasked text tokens are used:
$$L_{\text{WPA}}(\theta) = -\sum_{l=1}^{L-L’} \log p_\theta(z_l \mid X^{M’}, Y^{L’})$$
The total loss is $L = L_{\text{MLM}} + L_{\text{MIM}} + L_{\text{WPA}}$.
The key insight is that both MLM and MIM now operate over discrete token vocabularies, making the pre-training objectives symmetric across modalities. WPA provides fine-grained cross-modal alignment by leveraging the natural correspondence between text words and their spatial image patches in documents.
Segment-Level Layout Positions
Instead of word-level bounding boxes (as in LayoutLM/LayoutLMv2), LayoutLMv3 uses segment-level layout positions where words in the same text segment share 2D coordinates. This follows StructuralLM’s observation that words within a segment share semantic meaning.
What experiments were performed?
Benchmarks
The authors evaluated on five public benchmarks spanning both text-centric and image-centric tasks:
| Benchmark | Task Type | Task | Metric |
|---|---|---|---|
| FUNSD | Text-centric | Form understanding (sequence labeling) | F1 |
| CORD | Text-centric | Receipt key info extraction (sequence labeling) | F1 |
| DocVQA | Text-centric | Document visual QA (extractive) | ANLS |
| RVL-CDIP | Image-centric | Document image classification (16 classes) | Accuracy |
| PubLayNet | Image-centric | Document layout analysis (object detection) | mAP@IoU[0.50:0.95] |
Key Results (BASE models)
| Model | Params | FUNSD F1 | CORD F1 | RVL-CDIP Acc | DocVQA ANLS |
|---|---|---|---|---|---|
| LayoutLMv2 | 200M | 82.76 | 94.95 | 95.25 | 78.08 |
| DocFormer | 183M | 83.34 | 96.33 | 96.17 | – |
| LayoutLMv3 | 133M | 90.29 | 96.56 | 95.44 | 78.76 |
Key Results (LARGE models)
| Model | Params | FUNSD F1 | CORD F1 | RVL-CDIP Acc | DocVQA ANLS |
|---|---|---|---|---|---|
| LayoutLMv2 | 426M | 84.20 | 96.01 | 95.64 | 83.48 |
| DocFormer | 536M | 84.55 | 96.99 | 95.50 | – |
| LayoutLMv3 | 368M | 92.08 | 97.46 | 95.93 | 83.37 |
Document Layout Analysis (PubLayNet)
| Model | Framework | Overall mAP |
|---|---|---|
| Mask R-CNN + ResNet-101 | Mask R-CNN | 91.0 |
| DiT$_{\text{BASE}}$ | Cascade R-CNN | 94.5 |
| LayoutLMv3$_{\text{BASE}}$ | Cascade R-CNN | 95.1 |
LayoutLMv3 was used as a feature backbone in a Cascade R-CNN detector with FPN, extracting single-scale features from Transformer layers 4, 6, 8, and 12. It outperformed DiT (the then-concurrent vision-only document Transformer) by 0.6 mAP.
Ablation Study
The ablation (Table 3) isolated each component’s contribution at BASE size, trained on 1M samples for 150K steps:
- Without image embedding (text+layout only, MLM): strong text-centric results but cannot do layout analysis.
- Linear image embedding + MLM only: PubLayNet loss diverges; the model cannot learn visual representations without an image-specific objective.
- + MIM: PubLayNet loss converges (94.38 mAP); image-centric tasks improve.
- + WPA: Consistent improvement across all tasks (94.43 mAP on PubLayNet, 89.78 F1 on FUNSD).
This ablation is the most informative part of the paper: it demonstrates that MIM is essential for learning visual representations from linear patches, and WPA provides complementary cross-modal signal.
Chinese Model
A LayoutLMv3-Chinese model (BASE size) was pre-trained on 50M Chinese document pages, initialized from XLM-R. On the EPHOIE visual information extraction benchmark, it achieved 99.21% mean F1, outperforming all compared methods including StrucTexT (97.95%).
What are the outcomes/conclusions?
Strengths:
- The unified masking approach (MLM + MIM over discrete tokens) is clean and well-motivated. Symmetric objectives across modalities reduce architectural complexity.
- Significant parameter savings: LayoutLMv3$_{\text{BASE}}$ (133M) outperforms LayoutLMv2$_{\text{BASE}}$ (200M) across most benchmarks while using a simpler image encoder.
- The ablation clearly isolates each component’s contribution. The finding that linear patches + MLM alone causes divergence on vision tasks, while MIM resolves this, is an important practical insight.
- Generality across both text-centric and image-centric tasks is well-demonstrated.
Limitations and caveats:
- The FUNSD comparison is not entirely apples-to-apples: LayoutLMv3 and StructuralLM use segment-level layout positions, while most baselines use word-level positions. The authors acknowledge this but the magnitude of the FUNSD gain (90.29 vs. 83.34 for DocFormer) likely reflects both the method and the position granularity choice.
- Pre-training data is IIT-CDIP (11M images), which is English-centric scanned documents. Generalization to born-digital, multilingual, or domain-specific documents is not assessed (except for the Chinese appendix).
- The image tokenizer is initialized from DiT, which itself was pre-trained on IIT-CDIP. The paper does not ablate how much the quality of the image tokenizer matters.
- Layout analysis is only evaluated on PubLayNet (research papers). Performance on more diverse document types (forms, invoices, historical documents) is not tested in the vision-only setting.
- No error bars or variance across runs are reported for any experiments.
- DocVQA LARGE result (83.37 ANLS) slightly trails LayoutLMv2 LARGE (83.48), though the comparison is complicated by the fact that some baselines use extra data.
Reproducibility
Models
- LayoutLMv3$_{\text{BASE}}$: 12-layer Transformer, 12 heads, hidden size 768, FFN intermediate 3072. Total: 133M parameters.
- LayoutLMv3$_{\text{LARGE}}$: 24-layer Transformer, 16 heads, hidden size 1024, FFN intermediate 4096. Total: 368M parameters.
- Image input: $3 \times 224 \times 224$, patch size $16 \times 16$, yielding 196 image tokens.
- Text input: BPE tokenized, max sequence length 512.
- Text embeddings initialized from RoBERTa. Image tokenizer initialized from DiT (vocabulary size 8,192).
- Pre-trained checkpoints (base, large, base-chinese) are publicly available on Hugging Face under CC-BY-NC-SA-4.0.
Algorithms
- Optimizer: Adam, $\beta_1 = 0.9$, $\beta_2 = 0.98$, weight decay $1 \times 10^{-2}$.
- Pre-training: 500K steps, batch size 2,048.
- BASE: LR $1 \times 10^{-4}$, warmup 4.8% of steps.
- LARGE: LR $5 \times 10^{-5}$, warmup 10%.
- MLM masking: 30% of text tokens, span masking with Poisson $\lambda = 3$.
- MIM masking: ~40% of image tokens, blockwise masking strategy from BEiT.
- Attention stabilization: PB-Relax technique from CogView with $\alpha = 32$.
- Distributed training, mixed precision, gradient accumulation, gradient checkpointing (for layout analysis).
Data
- Pre-training: IIT-CDIP Test Collection 1.0 (11M scanned document images out of available 42M pages). English-centric. Access requires a formal data use agreement through the University of California, San Francisco (UCSF) Industry Documents Library; the dataset is not freely downloadable.
- Chinese pre-training: 50M Chinese document pages collected from publicly available digital-born documents following Common Crawl principles. No public download link or further specification is provided; this data is not reproducible without additional information.
- OCR is obtained from an off-the-shelf toolkit (Microsoft Read API for fine-tuning benchmarks). This is a commercial API, which introduces a cost dependency for reproduction.
- No image augmentation is applied during pre-training, consistent with prior LayoutLM models.
Evaluation
- FUNSD: 149 train / 50 test documents. F1 on semantic entity labeling. Fine-tuned for 1,000 steps, LR $1 \times 10^{-5}$, batch size 16.
- CORD: 800 train / 100 val / 100 test receipts. F1 on key information extraction. Fine-tuned for 1,000 steps, LR $5 \times 10^{-5}$, batch size 64.
- DocVQA: 10,194 train / 1,286 val / 1,287 test images. ANLS on test set (submitted to official leaderboard). BASE fine-tuned for 100K steps, LR $3 \times 10^{-5}$, batch size 128, warmup 4.8%. LARGE fine-tuned for 200K steps, LR $1 \times 10^{-5}$, batch size 32, warmup 10%.
- RVL-CDIP: 320K train / 40K val / 40K test. Overall classification accuracy. Fine-tuned for 20,000 steps, LR $2 \times 10^{-5}$, batch size 64. OCR from Microsoft Read API.
- PubLayNet: 335,703 train / 11,245 val. mAP@IoU[0.50:0.95] on validation set. Fine-tuned for 60,000 steps using AdamW, LR $2 \times 10^{-4}$, batch size 32, weight decay 0.05, 1,000 warmup steps. Uses Cascade R-CNN with FPN via Detectron2. No flipping or cropping augmentation.
- No error bars, confidence intervals, or multi-run statistics reported for any benchmark.
- FUNSD comparisons are complicated by the segment-level vs. word-level position distinction.
Hardware
- Not explicitly reported. The scale of pre-training (batch size 2,048 for 500K steps on 11M images) implies multi-GPU clusters, likely 32+ V100 or A100 GPUs, but no concrete figures are given.
- No training time, GPU-hours, or cost estimates provided.
BibTeX
@inproceedings{huang2022layoutlmv3,
title={LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking},
author={Huang, Yupan and Lv, Tengchao and Cui, Lei and Lu, Yutong and Wei, Furu},
booktitle={Proceedings of the 30th ACM International Conference on Multimedia},
pages={4083--4091},
year={2022},
doi={10.1145/3503161.3548112}
}
OmniDocBench: Benchmarking Diverse PDF Document Parsing
TL;DR
OmniDocBench is a document parsing benchmark with 981 annotated PDF pages across 9 document types (academic papers, textbooks, newspapers, handwritten notes, etc.), 19 layout categories, and 15 attribute labels. It introduces a multi-level evaluation pipeline with an Adjacency Search Match algorithm that addresses paragraph boundary mismatches between model predictions and ground truth. Evaluations of pipeline tools and VLMs reveal that pipelines excel on structured layouts while VLMs generalize better to degraded or unconventional documents.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$ (Benchmark dataset). The headline contribution is the dataset itself: 981 pages with comprehensive annotations (bounding boxes, text, LaTeX formulas, HTML/LaTeX tables, reading order, and fine-grained attributes) spanning 9 document types. The paper’s main novelty is the reusable evaluation infrastructure rather than a new model or algorithm.
Secondary: $\Psi_{\text{Evaluation}}$ (Novel evaluation pipeline). The Adjacency Search Match algorithm and multi-level evaluation protocol (end-to-end, component-level, attribute-level) represent a meaningful measurement contribution that addresses the “granularity mismatch” problem in document parsing evaluation.
What is the motivation?
The authors identify three gaps in existing document parsing benchmarks:
- Task misalignment: Benchmarks like PubLayNet and DocLayNet measure layout detection via
mAP(bounding box accuracy). RAG systems and LLM data pipelines need valid Markdown or text output, not just correctly drawn boxes. - Narrow document diversity: Most evaluation sets focus on clean academic papers (e.g., arXiv). Real-world pipelines encounter noisy scans, multi-column newsletters, handwritten notes, exam papers, and financial reports.
- Granularity mismatch: End-to-end VLMs produce continuous text output, while ground truth annotations are stored as separate blocks. Standard text similarity metrics (edit distance, BLEU) fail when prediction paragraph boundaries do not align with ground truth boundaries, penalizing correct content unfairly.
What is the novelty?
OmniDocBench introduces two main contributions:
Hierarchical Taxonomy
The annotation schema covers:
- Block-Level (15 categories): Text paragraphs, headings, tables, figures, charts, display formulas, lists, code blocks, headers, footers, page numbers, TOC, index, captions, and footnotes.
- Span-Level (4 categories): Text lines, inline formulas, subscripts/superscripts, and footnote markers (nested within block-level annotations).
- Attributes (15 labels): 6 page-level attributes (PDF type, layout type, language, watermark, fuzzy scan, colored background) and 9 bbox-level attributes (text language, text background, text rotation, table language, table frame type, merged cells, formulas in tables, colorful tables, rotated tables).
Adjacency Search Match
To handle the paragraph boundary mismatch between model output and ground truth, the authors propose a dynamic matching algorithm. The process operates in two phases:
- Compute a matrix of Normalized Edit Distance (NED) similarity between all prediction and ground truth blocks:
$$ \text{NED}(T_{\text{pred}}, T_{\text{gt}}) = 1 - \frac{\text{EditDistance}(T_{\text{pred}}, T_{\text{gt}})}{\max(|T_{\text{pred}}|, |T_{\text{gt}}|)} $$
Pairs exceeding a similarity threshold are matched directly.
- For unmatched blocks, apply fuzzy matching to detect substring relationships, then iteratively merge adjacent ground truth (or prediction) blocks until NED stops improving.
This approach penalizes fragmentation less severely than strict one-to-one matching while still enforcing semantic correctness. The final reported metrics use Edit Distance (lower is better) rather than the NED similarity score.
What experiments were performed?
Data Construction
The dataset was curated through a human-in-the-loop pipeline:
- Sourcing: Over 200k PDFs from Common Crawl, Google, Baidu search engines, and internal data.
- Visual Clustering: ResNet-50 features extracted from pages, clustered with Faiss, and 6,000 visually diverse pages sampled from 10 cluster centers.
- Annotation:
- Pre-annotation: LayoutLMv3 (layout detection), PaddleOCR (text), UniMERNet (formulas), GPT-4o (tables).
- Correction: Human annotators refined bounding boxes, reading order, and character-level content.
- Expert Review: Three researchers reviewed unrenderable formulas and tables using CDM rendering techniques.
- Final Selection: 981 pages balanced across 9 document types.
Evaluation Protocol
Models are evaluated at three levels:
- End-to-End: Full page Markdown text similarity using Normalized Edit Distance.
- Component-Level: Text (Edit Distance), Formula (BLEU and CDM, i.e., Character Detection Matching), Table (TEDS, i.e., Tree-Edit-Distance-based Similarity), Reading Order (Edit Distance).
- Attribute-Level: Performance breakdown by language, rotation, handwriting, background color, table frame type, and other attributes.
Models Evaluated
Three categories of methods:
- Pipeline Tools: MinerU (v0.9.3), Marker (v1.2.3), Mathpix.
- Expert VLMs: GOT-OCR 2.0, Nougat.
- General VLMs: GPT-4o, Qwen2-VL-72B, InternVL2-76B.
What are the outcomes/conclusions?
Key findings from the benchmark:
- Pipeline tools outperform VLMs on standard documents. MinerU achieves the best overall Edit Distance on English text (0.061) and reading order. Pipeline tools benefit from explicit layout segmentation that preserves reading order better than autoregressive generation.
- VLMs generalize better to unconventional formats. GPT-4o and Qwen2-VL perform more robustly on handwritten notes, slides, and visually degraded pages (fuzzy scans, watermarks, colored backgrounds). Their broader pretraining data helps handle long-tail document types that pipeline tools were not specifically trained for.
- VLMs struggle with high-density layouts. On newspapers and dense multi-column pages, VLMs frequently miss content or hallucinate due to input resolution and token length limitations. Pipeline tools, which process each region independently, maintain higher accuracy on these layouts.
- Chinese lags behind English across all methods. Nearly all models show lower accuracy on Chinese pages compared to English, likely reflecting training data imbalances.
- Component-level results reveal specialization. For table recognition, OCR-based models (RapidTable) lead overall. For formula recognition, GPT-4o achieves the highest CDM score (86.8%), while UniMERNet leads on Normalized Edit Distance. DocLayout-YOLO dominates layout detection (47.38 mAP) and powers MinerU’s strong end-to-end performance.
Limitations
- The paper does not report hardware specifications, training costs, or statistical variance (error bars, multiple runs) for the benchmark evaluations.
- The dataset focuses on single-page parsing; multi-page document understanding (e.g., cross-page tables) is not addressed.
- Inline formulas are converted to Unicode for evaluation rather than being assessed in LaTeX format, which limits evaluation granularity for mathematical content.
Mapping to Matter vs. Meaning Framework
How does OmniDocBench’s hierarchical taxonomy map to our Matter vs. Meaning framework?
Block-Level Categories (15)
| OmniDocBench Class | Visual Primitive | Logical Role | Notes |
|---|---|---|---|
| Text paragraph | Text | Body | Standard prose content. |
| Heading | Text | Title / SectionHeader | No distinction between document title and section headers. |
| Table | Table | (primitive only) | Tabular regions evaluated via TEDS. |
| Figure | Image | Figure | Photographs, illustrations. |
| Chart | Image | Chart | Data visualizations (bar, line, pie). Notably separated from Figure. |
| Display formula | Formula | DisplayEquation | Block-level math, evaluated via CDM and BLEU. |
| List | Text | ListItem | Annotated as container blocks, not individual items. |
| Code block | Text | Code | Computer source code. |
| Header | Text | PageHeader | Running headers. |
| Footer | Text | PageFooter | Running footers. |
| Page number | Text | PageNumber | Explicit page indices. |
| TOC | Table | TOC | Table of contents. Treated as a grid structure. |
| Index | Table | Index | Back-of-book index. |
| Caption | Text | Caption | Descriptions of figures, tables, charts. |
| Footnote | Text | Footnote | Bottom-of-page notes. |
Span-Level Categories (4)
| OmniDocBench Class | Visual Primitive | Logical Role | Notes |
|---|---|---|---|
| Text line | Text | (sub-element) | Line-level segmentation within blocks. |
| Inline formula | Formula | InlineEquation | Math embedded within text. Converted to Unicode for evaluation. |
| Subscript/Superscript | Text | (sub-element) | Positional text variants. |
| Footnote marker | Text | (sub-element) | Reference marks linking text to footnotes. |
Attribute Labels (15)
OmniDocBench defines 6 page-level and 9 bbox-level attributes. These partially map to the M-vs-M Attributes layer: text rotation maps to orientation, text background maps to visual context. Most OmniDocBench attributes (language, frame type, fuzzy scan) are metadata for evaluation stratification rather than structural annotations.
Coverage gaps: No explicit classes for Author, Affiliation, Abstract, BibEntry, Blockquote, or any form primitives (Field, Selection, Checkbox). The Heading class does not distinguish between document title and section headers.
Reproducibility
Data
- Dataset: 981 annotated PDF pages across 9 types (academic papers, textbooks, books, slides, exam papers, financial reports, magazines, newspapers, handwritten notes). Available on Hugging Face under a research-only (non-commercial) license.
- Annotations: Over 20,000 block-level and 70,000 span-level annotations with bounding boxes, text content, LaTeX formulas, HTML/LaTeX tables, reading order, and attribute labels.
- Sourcing pipeline: PDFs collected from Common Crawl, Google, Baidu, and internal data. Visual clustering with ResNet-50 + Faiss for diversity selection.
Evaluation
- Code: Full evaluation toolkit (extraction, matching, metric calculation) released on GitHub under Apache-2.0.
- Metrics: Edit Distance (text, formulas, tables, reading order), CDM (Character Detection Matching for formulas), BLEU (formulas), TEDS (tables).
- Baselines: Pipeline tools (MinerU, Marker, Mathpix), expert VLMs (GOT-OCR, Nougat), general VLMs (GPT-4o, Qwen2-VL-72B, InternVL2-76B). Specific model versions documented (e.g., MinerU v0.9.3, Marker v1.2.3).
Models
- No new model is proposed. The benchmark evaluates existing pipeline tools and VLMs.
Algorithms
- The Adjacency Search Match algorithm is implemented in the open-source evaluation toolkit. No model training is involved.
Hardware
- The paper does not report specific hardware configurations for the benchmark evaluations. Running the VLM baselines (e.g., Qwen2-VL-72B, InternVL2-76B) requires significant GPU resources or API access. The evaluation scripts themselves are lightweight text comparison tools.
BibTeX
@inproceedings{ouyang2024omnidocbench,
title={OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations},
author={Ouyang, Linke and Qu, Yuan and Zhou, Hongbin and Zhu, Jiawei and Zhang, Rui and Lin, Qunshu and Wang, Bin and Zhao, Zhiyuan and Jiang, Man and Zhao, Xiaomeng and Shi, Jin and Wu, Fan and Chu, Pei and Liu, Minghao and Li, Zhenxiang and Xu, Chao and Zhang, Bo and Shi, Botian and Tu, Zhongying and He, Conghui},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
pages={24838--24848},
year={2025}
}
DocLayout-YOLO: Real-time Layout Analysis via Synthetic Data & Adaptive Perception
TL;DR
DocLayout-YOLO is a unimodal document layout detector built on YOLOv10 that combines large-scale synthetic pre-training (DocSynth-300K, 300K pages) with a multi-scale receptive field module (GL-CRM). It reports 79.7% mAP on the DocLayNet validation split and 85.5 FPS on an A100 (measured on DocStructBench), outperforming several transformer-based multimodal baselines in both speed and accuracy on the benchmarks tested. Code, weights, and the synthetic dataset are all publicly available.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$: a new unimodal architecture (YOLOv10 variant) and training recipe optimized for real-time document layout analysis.
Secondary: $\Psi_{\text{Resource}}$: introduction of DocSynth-300K, a large-scale synthetic dataset generated via a 2D bin-packing algorithm (“Mesh-candidate BestFit”), and DocStructBench, a diverse in-house evaluation benchmark.
What is the motivation?
Document Layout Analysis (DLA) faces a tension between speed and accuracy:
- Multimodal methods (e.g., LayoutLMv3, DiT) achieve high accuracy by fusing visual and textual features but are slow (6-14 FPS on the paper’s benchmarks) due to heavy transformer backbones.
- Unimodal methods (e.g., YOLO, DINO) are fast but struggle with complex documents because they lack:
- diverse pre-training data (existing datasets like PubLayNet are homogeneous and skewed toward academic formats).
- multi-scale perception mechanisms to handle extreme size variance between document elements (e.g., a tiny page number vs. a full-page table).
The authors aim to close this gap by creating a model that matches the speed of unimodal detectors while exceeding the accuracy of heavy multimodal transformers.
What is the novelty?
The paper introduces two complementary innovations:
A. Mesh-candidate BestFit & DocSynth-300K ($\Psi_{\text{Resource}}$)
To address data scarcity, the authors frame document generation as a 2D bin-packing problem rather than a generative diffusion task.
- Algorithm: “Mesh-candidate BestFit” iteratively fills a page grid with elements (Text, Tables, Figures) sampled from a diverse pool sourced from the M6Doc test set (~2,800 pages, 74 categories). It maximizes fill rate (density) while keeping layouts visually aligned. The authors note that M6Doc itself is not open source due to copyright restrictions, which creates a license ambiguity for DocSynth-300K (see Reproducibility).
- Outcome: DocSynth-300K, a 300K-page synthetic dataset with enforced layout and element diversity, intended to prevent the model from overfitting to specific academic formats.
B. Global-to-Local Controllable Receptive Module (GL-CRM) ($\Psi_{\text{Method}}$)
To handle scale variation, the authors modify the YOLOv10 backbone with GL-CRM, a hierarchical module that adapts the receptive field across three levels. Features are extracted at varying dilation rates:
$$F_i = \text{GELU}(\text{BN}(\text{Conv}(X, w, d_i)))$$
These are concatenated across all dilation rates:
$$\hat{F} = \text{Concat}([F_1, F_2, \ldots, F_n])$$
A gating mask $M$ controls feature selection:
$$M = \sigma(\text{GELU}(\text{BN}(\text{Conv}_{gate}(\hat{F}))))$$
The masked features are fused back into the residual stream:
$$X_{CRM} = X + \text{GELU}(\text{BN}(\text{Conv}_{out}(M \otimes \hat{F})))$$
The three levels of GL-CRM are:
- Global level: Large kernels ($k=5$) with dilations ($d=1,2,3$) to capture texture details and local patterns at whole-page scale.
- Block level: Medium kernels ($k=3$) with dilations ($d=1,2,3$) to handle medium-scale document elements.
- Local level: Standard bottleneck (no dilation) to capture fine-grained semantic details.
What experiments were performed?
Training Setup
- Pre-training: YOLOv10-M initialized on DocSynth-300K (30 epochs, LR=0.02, batch=128, image size 1600).
- Fine-tuning: Separately fine-tuned on DocLayNet (LR=0.02, patience=100, image size 1120), D4LA (LR=0.04, patience=100, image size 1600), and DocStructBench (LR=0.04, patience=100, image size 1280).
- Hardware: 8 x A100 GPUs.
Baselines
- Unimodal: YOLOv10, DINO-4scale (ResNet-50 backbone, ImageNet1K pretrained).
- Multimodal: LayoutLMv3, DiT-Cascade (Base/Large), VGT (Da et al., 2023).
Evaluation Metrics
Primary metric is mAP (COCO-style); secondary is AP50. Speed is measured as FPS on a single A100 GPU.
Key Datasets
- D4LA: A 27-class dataset of 11,092 images (8,868 train / 2,224 test) manually annotated from RVL-CDIP across 12 document types, requiring fine-grained class distinction (e.g.,
LetterSign,Date,RegionKV). - DocLayNet: A standard diverse layout benchmark containing 80,863 pages from 7 document types, manually annotated with 11 categories.
- DocStructBench: An in-house benchmark (not publicly released) covering Academic, Textbook, Market Analysis, and Financial document types; 7,310 train / 2,645 test images across 10 categories (the appendix reports different totals of 9,082/2,232, an internal inconsistency in the paper; the main-text numbers are what the reported results were trained and evaluated on).
What are the outcomes/conclusions?
Quantitative Performance
DocLayNet:
| Model | FPS | mAP |
|---|---|---|
| DiT-Cascade-L | 6.0* | 72.6% |
| DocLayout-YOLO | 85.5* | 79.7% |
* FPS figures are from the DocStructBench comparison (Table 3 of the paper); the DocLayNet evaluation (Table 2) does not report FPS.
DocLayout-YOLO achieves 79.7% mAP on DocLayNet (evaluated on the validation split), 14.3x faster than DiT-Cascade-L based on the DocStructBench speed comparison, with higher accuracy.
D4LA:
DocLayout-YOLO reports 70.3% mAP on D4LA, outperforming multimodal baselines on this benchmark as well.
DocStructBench:
Overall 78.8% mAP across the benchmark’s four document subsets. On the Financial subset specifically, the model reaches 90.1% mAP. Note that DocStructBench is an in-house evaluation set and is not publicly available for independent verification.
Ablation Studies
Table 1 in the paper isolates each component against a YOLOv10-M baseline (no document-specific pre-training):
- GL-CRM alone: +1.2% on D4LA, +1.0% on DocLayNet.
- DocSynth-300K pre-training alone: +1.2% on D4LA, +2.6% on DocLayNet.
- Full DocLayout-YOLO (GL-CRM + DocSynth): +1.7% on D4LA, +3.0% on DocLayNet, and up to +3.5% on the Textbook subset of DocStructBench.
Limitations
- The bin-packing synthesis is heuristic-based and may not capture the semantic coherence or organic noise of real documents (e.g., text content does not flow logically across elements).
- DocStructBench is not publicly available, so the generalization results on that benchmark cannot be independently verified.
- The model underperforms DiT-Cascade-L on the Market Analysis subset of DocStructBench (69.4% vs 70.8% mAP), suggesting that DocSynth-300K pre-training may not fully cover the most complex layout distributions. The paper does not discuss this gap explicitly.
- The model performs layout detection only; a separate OCR step is still required for text extraction.
Reproducibility
Models
- Backbone: YOLOv10-M with GL-CRM modifications inserted into the backbone feature extraction stages.
- Pre-trained and fine-tuned weights all released on HuggingFace under Apache-2.0: pre-training checkpoint, DocLayNet-finetuned, D4LA-finetuned, and DocStructBench-finetuned variants (see frontmatter artifacts).
- Inference runs within the
ultralyticsPython framework with standard YOLOv10 inference APIs.
Algorithms
- Pre-training: LR=0.02, batch=128, 30 epochs, image size 1600 on DocSynth-300K.
- Fine-tuning (DocLayNet): LR=0.02, patience=100 epochs, image size 1120.
- Fine-tuning (D4LA): LR=0.04, patience=100 epochs, image size 1600.
- Fine-tuning (DocStructBench): LR=0.04, patience=100 epochs, image size 1280.
- The paper does not explicitly report the optimizer type, warmup schedule, gradient clipping, or mixed-precision settings for pre-training.
Data
- DocSynth-300K: 300K synthetic pages generated by the Mesh-candidate BestFit algorithm from a pool sourced from the M6Doc test set specifically (~2,800 pages, 74 element categories). The DocLayout-YOLO paper itself states that M6Doc “is not open source due to copyright restrictions.” DocSynth-300K is therefore built from crops of copyright-encumbered source images; synthetic rearrangement of those crops does not dissolve the underlying copyright. The HuggingFace dataset card lists Apache-2.0 in its metadata, but that claim is legally dubious given the M6Doc provenance. A public inquiry on the HuggingFace discussion page (opened March 2025) has received no response from the authors. Treat the license as unknown.
- DocLayNet: Public benchmark (80,863 pages, 11 categories, 7 document types). The validation split (6,480 images) is used for evaluation, not the test split.
- D4LA: Public dataset; 11,092 images from RVL-CDIP, 27 categories, 12 document types; 8,868 train / 2,224 test. The ModelScope page lists Apache-2.0, but the underlying images come from RVL-CDIP (a subset of the IIT-CDIP tobacco litigation archive), which carries no Apache-2.0 grant. The VGT paper that introduced D4LA makes no license claim beyond “will be made publicly available.” Treat the image license as unknown; the annotation schema may be separately permissible.
- DocStructBench: In-house; NOT publicly available. 7,310 train / 2,645 test images (main-text Section 5.1), 10 categories, covering Academic, Textbook, Market Analysis, and Financial subsets. The appendix reports different totals (9,082/2,232), an internal inconsistency in the paper.
Evaluation
- Primary metric: mAP (COCO-style AP at IoU 0.50:0.95); secondary: AP50.
- Speed: FPS measured on a single A100 GPU (batch size not explicitly stated; single-image inference implied).
- DocLayNet results are on the validation split, not the test split, as stated in Section 5.1 of the paper. This affects direct comparability with papers that report test-split numbers.
- There is an internal inconsistency in the paper between Table 1 and Table 2 for the YOLO-v10 DocLayNet baseline: Table 1 (ablations) reports 76.7 mAP, while Table 2 (method comparison) reports 76.2 mAP. The paper gives no explanation. Ablation deltas in the note use Table 1’s 76.7 baseline.
- Baselines span both unimodal (YOLOv10, DINO-4scale) and multimodal (LayoutLMv3, DiT-Cascade B/L, VGT) models.
- No error bars, seed sensitivity, or multi-run statistics are reported; all results appear to be single runs.
- DocStructBench results cannot be independently reproduced since the benchmark is not public.
Hardware
- Training: 8 x A100 GPUs.
- Inference: 85.5 FPS on a single A100 GPU (measured on DocStructBench; per-benchmark FPS is not reported separately).
- Total GPU-hours and cloud compute cost are not reported.
- No guidance is provided on CPU-only or consumer GPU inference feasibility.
BibTeX
@misc{zhao2024doclayoutyolo,
title={DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception},
author={Zhiyuan Zhao and Hengrui Kang and Bin Wang and Conghui He},
year={2024},
eprint={2410.12628},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
SciPostLayout: A Layout Dataset for Scientific Posters
TL;DR
SciPostLayout is a dataset of 7,855 manually annotated scientific conference posters collected from F1000Research, all under a CC-BY license. It is the first publicly available dataset designed for layout analysis and layout generation in the scientific poster domain, which the authors find more challenging than existing scientific paper datasets like PubLayNet.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$. The primary contribution is the SciPostLayout dataset itself: a curated, manually annotated collection of scientific poster images with bounding-box labels across nine element categories, released under an open license. The paper frames its experimental results as baseline benchmarks that accompany the dataset release.
Secondary: $\Psi_{\text{Evaluation}}$. The paper evaluates existing layout analysis and layout generation models on the new dataset to characterize the difficulty of the poster domain relative to prior scientific paper datasets. This benchmarking work is significant but subordinate to the dataset contribution.
What is the motivation?
Scientific conference posters efficiently communicate research findings in a graphical, space-constrained format. Creating a well-designed poster from a paper is labor-intensive work, and automating it is an appealing goal. Yet research on scientific poster generation remained sparse at the time of this paper, largely because no publicly available, properly licensed dataset existed for the task.
Prior work on layout analysis (such as PubLayNet) focused on scientific papers in document form, not posters. Poster layouts differ structurally: figures and tables can appear at arbitrary positions, typography varies widely, and the visual density and arrangement patterns are quite different from journal article pages. Two prior datasets for poster generation existed, but one was not publicly available and the other had unclear licensing, making them unsuitable as shared benchmarks.
This gap motivated the construction of a community resource that could support both layout analysis (detecting element regions in a poster image) and layout generation (producing plausible element arrangements), as well as the more ambitious paper-to-poster generation task.
What is the novelty?
The central contribution is the SciPostLayout dataset. Several aspects distinguish it from prior work:
Scope and domain. All 7,855 items are scientific conference-style posters, not paper pages. This is an important distinction: poster layouts are substantially more variable than paper layouts because fonts, figure sizes, and column arrangements are not constrained by journal templates.
Annotation depth. The authors expanded PubLayNet’s five-category scheme to nine categories to capture the finer-grained structure of posters:
| Category | Contents |
|---|---|
| Title | Paper title |
| Author Info | Author names and affiliations |
| Section | Top-level section headings |
| Text | Body paragraphs |
| List | Bullet points, numbered items, reference blocks |
| Table | Table body |
| Figure | Figure body (including sub-figure panels as a single bounding box) |
| Caption | Captions for tables and figures |
| Unknown | Large-area logos, advertising |
Paired paper-poster data. The dataset includes 100 paper-poster pairs (50 dev, 50 test), all under CC-BY, enabling experiments on generating poster layouts directly from source papers. No prior publicly licensed dataset supported this.
Licensing. All posters and paired papers are under CC-BY (versions 2.0, 3.0, or 4.0), making the dataset suitable for commercial research use. This was a deliberate design choice: the authors collected only from F1000Research and excluded any posters under noncommercial or non-distributable licenses.
What experiments were performed?
The paper benchmarks three tasks using existing models.
Layout Analysis
Two pre-trained document layout detection models were fine-tuned on SciPostLayout’s training split and evaluated on the test split:
- LayoutLMv3 (with Cascade R-CNN detector)
- DiT (with Cascade R-CNN detector)
Both were initialized from base-size checkpoints and fine-tuned until dev-set performance peaked. The evaluation metric is mean average precision at IoU thresholds from 0.50 to 0.95, i.e., $\text{mAP}@\text{IoU}[0.50{:}0.95]$. The Unknown category was excluded due to insufficient instances.
LayoutLMv3 achieved an overall mAP of 66.93; DiT reached 63.77. Both models performed well on Title and Author Info (which are structurally regular and appear at predictable locations on posters) but showed meaningful performance drops compared to their reported PubLayNet results, where mAP values exceed 90. The authors attribute this to the greater diversity of element positions and typography in posters.
Layout Generation
Three layout generation models were evaluated across five conditional generation settings:
- LayoutDM (discrete diffusion, trained from scratch on SciPostLayout)
- LayoutFormer++ (transformer-based, trained from scratch)
- LayoutPrompter (GPT-4-based, few-shot via API; no training needed)
The generation settings vary by what information is given as a condition:
- Gen-T: element type counts only
- Gen-TS: element type counts and sizes
- Gen-R: element types and pairwise relationships
- Completion: a partial layout
- Refinement: a noisy or suboptimal layout
Evaluation metrics are:
- mIoU (maximum IoU, $\uparrow$): the highest IoU between a generated layout and any real layout in the test set
- Alignment ($\downarrow$): how well elements within a layout are aligned with each other
- Overlap ($\downarrow$): the overlapping area between pairs of elements in a layout
- FID (Frechet Inception Distance, $\downarrow$): distributional similarity between generated and real layouts in an embedding space
FID and mIoU both measure similarity to the real layout distribution, but from different angles: mIoU is computed from geometric intersection between individual layouts, while FID is an embedding-based distributional metric. Their rankings across models are not always consistent.
Results indicate that mIoU was low for all models and less than half the values reported on PubLayNet, confirming that poster layout generation is harder. All models produced well-aligned layouts (low Alignment scores). LayoutPrompter was the most effective at minimizing element overlap; LayoutDM produced layout distributions most similar to the real distribution (lowest FID) but with more overlap. In the Refinement setting, LayoutPrompter substantially outperformed the other models.
Paper-to-Layout
To explore automated poster generation from source papers, the authors implemented two settings using GPT-4:
- Gen-T: GPT-4 extracts element type count constraints from the paper text (via Nougat for PDF parsing), then a layout generation model generates a layout as in the Gen-T generation experiment.
- Gen-P: GPT-4 first summarizes the paper in under 1,000 words, then LayoutPrompter generates a layout from that summary using the 50 dev paper-poster pairs as few-shot examples.
For Gen-T, constraint extraction accuracy was measured with mean absolute error (MAE) between predicted and actual element counts:
$$\text{MAE} = \frac{1}{C} \sum_{c=1}^{C} \left| n_c^{\text{pred}} - n_c^{\text{real}} \right|$$
where $C$ is the number of element categories and $n_c$ is the element count for category $c$. GPT-4 achieved an overall MAE of 1.98, with the largest errors on Text (4.9) and Figure (3.66) categories, which are the most numerous and most variable. Layout generation quality degraded slightly compared to using ground-truth constraints, as expected. In the Gen-P setting (tested only with LayoutPrompter), the model produced well-aligned layouts with very low overlap (0.022), suggesting that LLMs have meaningful potential for this task even though they currently cannot reproduce the exact structure of real poster layouts.
What are the outcomes/conclusions?
The authors report that layout analysis on scientific posters is harder than on scientific paper images, with both LayoutLMv3 and DiT showing notable performance drops versus their PubLayNet benchmarks. Layout generation is similarly more challenging: mIoU values for all tested models are less than half of what is observed on PubLayNet. These results suggest that poster layout is a distinct and underexplored problem that warrants dedicated modeling effort.
The paper-to-layout experiments are presented as a proof of concept. GPT-4 can extract plausible element constraints from paper text, and LayoutPrompter can then generate reasonably structured layouts. The generated layouts are not yet close to real poster layouts by quantitative measures, but the results indicate the task is tractable and worth further investment.
A limitation that the authors acknowledge implicitly is domain skew: because posters were sourced exclusively from F1000Research, most of the collected posters are in the biomedical field. This may limit how well models trained on SciPostLayout transfer to posters from other scientific communities (e.g., physics, computer science conferences where PDF poster formats differ).
Mapping to Matter vs. Meaning Framework
How does SciPostLayout’s 9-class poster taxonomy map to our Matter vs. Meaning framework?
| SciPostLayout Class | Visual Primitive | Logical Role | Notes |
|---|---|---|---|
| Title | Text | Title | Paper title on the poster. |
| Author Info | Text | Author / Affiliation | Author names and institutional affiliations, combined in a single class. |
| Section | Text | SectionHeader | Top-level section headings on the poster. |
| Text | Text | Body | Body paragraphs. |
| List | Text | ListItem | Bullet points, numbered items, and reference blocks. Annotated as a container (not individual items). |
| Table | Table | (primitive only) | Table body. |
| Figure | Image | Figure / Chart / Diagram | Figure body, including sub-figure panels as a single bounding box. No sub-classification between photographs, charts, and diagrams. |
| Caption | Text | Caption | Captions for tables and figures. |
| Unknown | Image | Logo / Advertisement | Large-area logos and advertising elements. Excluded from evaluation due to insufficient instances. |
Coverage gaps: SciPostLayout does not distinguish Formula, PageHeader, PageFooter, Footnote, BibEntry, or Code. Reference blocks are merged into the List class. The Author Info class combines author names and affiliations into a single label, unlike datasets that separate Author and Affiliation. The Unknown class is a catch-all for non-standard elements.
Reproducibility
Models
No new models are released with this paper. The experiments use LayoutLMv3 (base, 133M parameters), DiT (base, 86M parameters), LayoutDM, LayoutFormer++, and LayoutPrompter (which calls the GPT-4 API, specifically gpt-4-1106-preview). Weights for LayoutLMv3 and DiT are publicly available from Microsoft; the layout generation models are available from their respective repositories.
Algorithms
LayoutLMv3 and DiT were fine-tuned from their base-size pretrained checkpoints using Cascade R-CNN as the object detection head. The checkpoint with the highest dev-set performance was selected for test evaluation. Specific optimizer settings, learning rates, and training durations are not reported in the paper.
LayoutDM and LayoutFormer++ were trained from randomly initialized weights on SciPostLayout’s training split. Detailed hyperparameters are not provided; readers would need to consult the original papers for those models.
For paper-to-layout, Nougat was used to extract text from PDF files, and GPT-4 was prompted to extract element type constraints or generate a paper summary. The prompts used are reproduced in Appendix B of the paper.
Data
The dataset was constructed as follows:
- Posters in PDF format were downloaded from F1000Research.
- Only CC-BY licensed posters were retained (7,943 after license filtering).
- PDFs were converted to PNG at DPI=100.
- Posters with file sizes below 200KB were excluded as low-content (primarily text-only), leaving 7,855 posters.
- Professional data annotators manually annotated bounding boxes and categories for all 7,855 posters.
Split: 6,855 train / 500 dev / 500 test (from Table 2 in the paper; train annotation statistics sum to 168,313 total annotations, dev to 12,141, test to 12,243).
Paired papers: 100 paper-poster pairs (50 dev, 50 test) were manually identified by searching for papers corresponding to posters in the dataset. There is no automated linking on F1000Research, so this was done by hand.
Domain skew: The authors note through title-word analysis that most posters are from the biomedical field, which reflects F1000Research’s coverage.
The dataset is publicly available on HuggingFace at omron-sinicx/scipostlayout_v2 under CC-BY. Note that older posters may carry CC-BY 2.0 or CC-BY 3.0 rather than 4.0.
Evaluation
Layout analysis is evaluated with $\text{mAP}@\text{IoU}[0.50{:}0.95]$ on bounding boxes. The Unknown category is excluded from evaluation. This follows standard COCO-style object detection evaluation.
Layout generation uses four metrics: mIoU (maximum IoU between a generated layout and any real layout in the test set), Alignment, Overlap, and FID. These metrics are drawn from the layout generation literature and are not novel to this paper.
Paper-to-layout constraint extraction is evaluated with MAE between predicted and actual element type counts, averaged over three GPT-4 inference runs.
The paper compares against results reported in prior work on PubLayNet to contextualize difficulty, but direct cross-dataset comparison involves confounds (different split sizes, annotation granularity, domain).
Hardware
No hardware details are provided for the layout analysis or generation training runs. LayoutPrompter and the paper-to-layout experiments use GPT-4 via API, so inference cost depends on token usage rather than local hardware.
BibTeX
@inproceedings{tanaka2024scipostlayout,
title = {SciPostLayout: A Dataset for Layout Analysis and Layout Generation of Scientific Posters},
author = {Tanaka, Shohei and Wang, Hao and Ushiku, Yoshitaka},
booktitle = {British Machine Vision Conference (BMVC)},
year = {2024}
}
DocGenome: An Open Large-scale Scientific Document Benchmark
TL;DR
DocGenome is a 500K-document scientific benchmark built from arXiv LaTeX source, annotating 13 entity types and 6 logical relationships across 6.8M pages spanning 153 disciplines. It enables a 7-task benchmark suite covering layout detection, document transformation, and MLLM reasoning. The dataset and pipeline (DocParser) are fully open-source; ground truth is recovered directly from LaTeX compilation rather than PDF-based extraction, which largely eliminates OCR noise.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$: the primary contribution is the DocGenome dataset and its automated construction pipeline (DocParser). The paper’s center of gravity is dataset curation, annotation quality control, and the resulting benchmark suite.
Secondary: $\Psi_{\text{Method}}$: the paper also trains two baseline transformation models (EqVLM-B, TableVLM-B) and fine-tunes YOLOv8 for layout detection to demonstrate the dataset’s utility.
The benchmark covers 7 tasks: document classification, visual grounding, layout detection, equation-to-LaTeX, table-to-LaTeX, single-page QA, and multi-page QA.
What is the motivation?
Scientific documents are unusually demanding for Multimodal Large Language Models (MLLMs). A typical arXiv paper combines structured prose, mathematical notation, figures, tables, cross-references, and multi-column layouts. Understanding these documents requires knowing not just what each element looks like, but how elements relate to one another logically.
Existing document datasets fall short in two ways. Purely visual datasets like PubLayNet and DocLayNet annotate bounding boxes and category labels, but not the logical structure connecting them: which caption belongs to which figure, which paragraph belongs under which section heading, which equation is referenced by which sentence. Datasets that do capture richer structure are either small (requiring expensive human annotation) or domain-narrow.
The paper’s argument is that the field lacks a large-scale, open resource that aligns visual layout with logical structure across a broad range of scientific disciplines. DocGenome is that resource, covering 500K documents from 153 disciplines across 8 primary scientific fields.
What is the novelty?
The core innovation is the logical graph annotation derived from the LaTeX compilation process itself. Rather than inferring structure from the rendered PDF (which discards authorial intent), the DocParser pipeline intercepts the LaTeX source before compilation, recovering the document’s logical hierarchy directly.
The “Genome” Graph
Each document is represented as a graph: nodes are semantic units (one of 13 entity types), and edges are one of 6 logical relationship types.
Entity types (13): Algorithm, Caption, Equation, Figure, Footnote, List, Table, Text, Text-EQ (text with inline math), Title (section titles), PaperTitle, Code, Abstract.
Relationship types (6):
| Relation | Description |
|---|---|
| Identical | Two units share the same source code (e.g., text split across columns or pages). |
| Title adjacent | Two adjacent section titles. |
| Subordinate | One unit is a subclass of another (e.g., \section{Introduction} $\rightarrow$ a paragraph within it). |
| Non-title adjacent | Two adjacent text or equation units. |
| Explicitly-referred | One unit references another via \ref or a footnote (e.g., “As shown in Figure 5” $\rightarrow$ Figure 5). |
| Implicitly-referred | A caption unit refers to its corresponding float (e.g., Table Caption 1 $\rightarrow$ Table 1). |
Reading order is not stored as an explicit relationship; it is the implicit ordering produced by the TexSoup segmentation in Stage 2 of the pipeline.
What experiments were performed?
The DocParser Pipeline
DocParser is a 4-stage automated annotation pipeline:
- Preprocessing: Cleaning LaTeX source, expanding imports (
\input,\include), and normalizing figure formats to PNG. - Unit Segmentation: Using
TexSoupto parse the LaTeX AST into a linear sequence of semantic units, establishing ground-truth reading order. - Attribute and Relation Retrieval: Entities and relations are derived structurally from the LaTeX parse tree. Hierarchical relations (subordinate, adjacent) apply to fixed-form units; reference relations (explicit via
\ref/\label, implicit for caption-float pairs) apply to floating elements. - Color Rendering for Bounding Box Extraction: To avoid fuzzy text matching when locating units in the rendered PDF, the pipeline renders each target unit $u_i$ in isolation: $u_i$ is rendered black on white while all other content is rendered white. Differencing against a blank page yields a pixel-perfect bounding box for $u_i$.
Quality Control
Two metrics filter the automatic annotations:
- Intra-consistency ($IoU_{\text{intra}}$): Checks for illegitimate bounding-box overlap within a paper. For $N$ bounding boxes $B_1, \dots, B_N$ in a document, with $J(\cdot, \cdot)$ denoting Intersection-over-Union:
$$IoU_{\text{intra}} = \frac{1}{N(N-1)} \sum_{i=1}^{N} \sum_{j=1,, j \neq i}^{N} J(B_i, B_j)$$
- Alignment ($IoU_{\text{align}}$): Compares generated boxes against DocXChain (a document parsing toolkit from Alibaba DAMO Academy) to flag anomalies where the automatic annotation diverges from an independent detector. Given reference boxes $G_1, \dots, G_N$:
$$IoU_{\text{align}} = \frac{1}{N} \sum_{i=1}^{N} J(B_i, G_i)$$
Only documents in the top quality tier (Tier-1, requiring $IoU_{\text{intra}}$ < 0.05% and $IoU_{\text{align}}$ > 60%) are eligible for the test set. From this pool, 1,004 documents were sampled to form DocGenome-test. QA pairs for the test set were generated by GPT-4V (2 single-page + 2 multi-page per paper) and validated through a 3-step process: (1) automated review by Kimi, (2) two independent faculty reviews with 0-3 confidence scoring, and (3) cross-verification against the original document text.
Benchmark Tasks and Baselines
The authors construct a 7-task benchmark using DocGenome-test:
- Layout Detection: YOLOv8 fine-tuned on increasing amounts of DocGenome training data, compared against DocXChain and models trained on smaller datasets.
- Document Transformation: EqVLM-B (equation-to-LaTeX) and TableVLM-B (table-to-LaTeX) fine-tuned from Pix2Struct-B, evaluated against Mathpix (commercial API) using normalized edit distance.
- MLLM Reasoning: Zero-shot evaluation of GPT-4V, GPT-4o, Qwen-VL, InternVL, and others on document classification, single-page QA, and multi-page QA.
What are the outcomes/conclusions?
Layout Detection
Fine-tuning YOLOv8 on DocGenome substantially outperforms baselines trained on smaller datasets. The mAP for Footnotes improves from 48.43% to 86.52% when scaling to 700K pages of training data. On out-of-distribution (OOD) human-annotated layout data, YOLOv8 trained on DocGenome (mAP 50.15) also outperforms DocXChain (mAP 37.99), suggesting the layout annotations generalize reasonably beyond the arXiv distribution.
Document Transformation
On the in-distribution DocGenome-test set, both trained models outperform Mathpix:
| Task | Model | Edit Distance |
|---|---|---|
| Eq-to-LaTeX | Mathpix | 0.4738 |
| Eq-to-LaTeX | EqVLM-B (1M samples) | 0.2111 |
| Table-to-LaTeX | Mathpix | 0.4436 |
| Table-to-LaTeX | TableVLM-B (500K samples) | 0.2223 |
On OOD Sci-Hub data, however, the picture reverses for the equation task. EqVLM-B’s edit distance rises to 0.6627, while Mathpix holds at 0.4873. The trained models appear to overfit to the arXiv/LaTeX distribution and do not generalize as cleanly as the in-distribution numbers suggest.
MLLM Reasoning
Neither GPT-4V nor GPT-4o dominates across all tasks:
| Model | Classification | Single-page QA | Multi-page QA |
|---|---|---|---|
| GPT-4V | 0.9821 | 0.6101 | 0.6501 |
| GPT-4o | 0.9761 | 0.7183 | 0.6762 |
GPT-4V leads on document classification; GPT-4o leads on both QA tasks. Open-source models (Qwen-VL, InternVL) trail significantly on the structural reasoning tasks, suggesting that scale and instruction tuning still matter for document understanding.
The paper concludes that large-scale structural annotation is a meaningful lever for improving document understanding models, but the OOD gaps for transformation tasks indicate that distribution breadth matters as much as scale.
Roots Analysis: Mapping to Matter vs. Meaning
The following is internal team analysis, not the paper’s own framing. We map all 13 DocGenome entity types and all 6 relation types against the Matter vs. Meaning framework to assess coverage, identify gaps, and understand where the dataset fits into our annotation and pipeline design.
Entity Type Mapping
| DocGenome Entity | Visual Primitive | Logical Role | Notes |
|---|---|---|---|
PaperTitle | Text | Title | The document root headline. Clean 1-to-1 match. |
Title (section) | Text | SectionHeader | Maps to H1-H6 depth; DocGenome does not record heading level. |
Abstract | Text | Abstract | Clean 1-to-1 match. |
Text | Text | Body | Standard prose. Clean 1-to-1 match. |
Text-EQ | Text + Formula span | Body + InlineEquation child | DocGenome treats this as a distinct top-level category. Our framework handles it as a Text block (Body) with one or more InlineEquation enrichment spans inside, per the Span vs. Block model. Do not train a separate detector for Text-EQ: detect the block as Text, then run a formula spotter on the crop. |
Equation | Formula | DisplayEquation | Standalone math block. Clean 1-to-1 match. |
Figure | Image | Figure | Visual content. Clean 1-to-1 match. |
Table | Table | (implicit) | DocGenome annotates the outer bounding box. TSR is a separate downstream step. |
Caption | Text | Caption | Linked to its float via the Implicitly-referred relation, which maps to our group_with edge. |
List | Text | ListItem | Granularity mismatch: DocGenome annotates the container block, not individual items. Per the framework’s “Why There Is No ListContainer” rationale, we map this to ListItem and note the mismatch. Individual bullet boundaries must be recovered by a separate step. |
Footnote | Text | Footnote | Clean 1-to-1 match. |
Code | Text | Code | Computer source code. Could alternatively map to Image if rendered as a screenshot, but in LaTeX-sourced papers it is always machine-readable Text. |
Algorithm | Text or Image | Code | Ambiguous primitive. Algorithm pseudocode in LaTeX is typeset text (Text → Code), but some papers render it as a figure. DocGenome does not distinguish these cases. Treat as Text / Code unless visual inspection confirms otherwise. |
Relation Type Mapping
| DocGenome Relation | Framework Equivalent | Notes |
|---|---|---|
Implicitly-referred | group_with(Figure/Table, Caption) | Direct match. This is the canonical caption-to-float binding. |
Subordinate | header_of(SectionHeader, Body) | Direct match for section-to-content hierarchy. Also covers nested sections (SectionHeader-to-SectionHeader at different depths). |
Explicitly-referred | (unmapped) | In-text \ref and footnote pointer relations. The framework’s current relation set has no general citation or cross-reference edge. This is a gap worth addressing if we build reading-order or citation graphs. |
Identical | (unmapped) | Marks units whose source code is shared but renders in two locations (e.g., text split across columns or pages). The framework has no cross-page or column-continuation relation. Important for multi-column layout reconstruction. |
Title adjacent | (reading order) | Two consecutive section titles. In the framework this is implicit reading order, not an explicit relation. No structural equivalent needed unless building explicit TOC graphs. |
Non-title adjacent | (reading order) | Two consecutive text or equation units. Same as above: implicit reading order, not a structural edge the framework needs to represent explicitly. |
Coverage Gaps
Present in DocGenome, absent from the framework’s detection targets:
- The
Identicalrelation (cross-column/cross-page text continuation) has no equivalent in the current framework relation set. This matters for multi-column scientific papers where a paragraph spans two columns: without this relation, a downstream reader has no signal to stitch the halves together. Worth adding as acontinues_fromedge in a future revision. - The
Explicitly-referredrelation (in-text\refcitations) is similarly absent. It would be useful for building document knowledge graphs but is not needed for the core layout-to-text pipeline.
Present in the framework, not annotated by DocGenome:
DocGenome is scoped to arXiv scientific papers and annotates only what is needed for that domain. The following roles from the framework are therefore absent:
| Missing Role | Why It Matters | Where It Appears |
|---|---|---|
Author / Affiliation | Needed for paper metadata extraction | Paper header / byline |
Keywords | Indexing and classification | After abstract |
BibEntry | Reference list parsing | Back matter |
FormulaNumber | Linking equations to in-text references | Equation label (e.g., “(1)”) |
PageHeader / PageFooter | Running navigation strips | Top/bottom of page |
PageNumber | Pagination | Corner of each page |
These are not DocGenome failures; they reflect a deliberate scope decision. Any pipeline that ingests DocGenome-trained models and processes full scientific papers will need to supplement them with detectors for the above roles.
Summary Assessment
DocGenome’s entity taxonomy covers the core content primitives of the framework well: the 13 types map cleanly except for Text-EQ (which should be decomposed rather than treated as a separate class) and Algorithm (which requires visual disambiguation). Its relation set covers the two most structurally important edges (group_with and header_of) but leaves cross-page continuation and citation links unaddressed. The dataset is a strong foundation for scientific document layout, but will require supplementation for full-document metadata extraction and multi-column reconstruction.
Reproducibility
Models
EqVLM-B and TableVLM-B are both fine-tuned from Pix2Struct-B (0.2B parameters). The paper does not release pretrained weights for these baseline models; researchers would need to retrain from scratch using the released dataset and pipeline code. The same team later released StructTable-InternVL2-1B, a successor model for table-to-LaTeX trained on DocGenome data, but this is a separate effort not described in the paper.
Algorithms
- Layout detection (YOLOv8): AdamW, learning rate 0.01; 30 epochs
- Document transformation (EqVLM-B, TableVLM-B): AdamW, learning rate 5e-5, weight decay 0.01; 30 epochs
- Hardware: 64x NVIDIA A100 (80G) GPUs
No additional training tricks (warmup schedule, gradient clipping, mixed precision) are reported in the paper.
Data
- Training set: ~500K arXiv documents (6.8M pages), spanning 153 disciplines across 8 primary scientific fields, derived from LaTeX source via DocParser
- Test set: 1,004 documents (9K pages, DocGenome-test), sampled from the Tier-1 quality pool ($IoU_{\text{intra}}$ < 0.05%, $IoU_{\text{align}}$ > 60%). QA pairs were generated by GPT-4V (7,028 questions across 1,757 papers), then filtered through a 3-step quality process by 20 PhD/Master’s-level annotators. 2,498 QA pairs survived, with 66.93% modified by reviewers
- Availability: Training data is publicly available on Hugging Face under CC-BY-4.0. The test QA set is also on Hugging Face but has no license explicitly declared in its metadata (the paper states CC-BY-4.0 for the project overall)
- Bias: The dataset is entirely derived from arXiv submissions, meaning it is heavily skewed toward scientific papers with LaTeX source. Applicability to scanned documents, business forms, or non-LaTeX documents is not addressed.
Evaluation
- Layout detection:
mAP@0.5:0.95(primary COCO metric, averaged over IoU thresholds 0.5 to 0.95) - Document transformation: Normalized edit distance (lower is better)
- Reasoning tasks: Top-1 accuracy (classification) and GPT-acc (QA, where GPT-4 judges predicted answers against ground truth as True/False)
- OOD evaluation: Sci-Hub data (transformation) and human-annotated layout data; both are reported in Table 6 of the paper
- Baselines: DocXChain (layout), Mathpix (transformation), GPT-4V, GPT-4o, Qwen-VL, InternVL (reasoning)
- Evaluation scripts: The test set repository includes evaluation scripts (
eval_open_docqa_gpt.py,eval_normal_docqa.py), but reproducing QA evaluation requires OpenAI API access for GPT-acc scoring. - The paper does not report error bars, significance tests, or multi-seed results for the trained baselines.
Hardware
- Training: 64x NVIDIA A100 (80G); total GPU-hours not reported
- Inference: No latency, throughput, or VRAM requirements reported for inference
- Cost: No cloud cost or energy consumption estimates reported
Peer Review History
DocGenome was submitted to ICLR 2025 (Datasets and Benchmarks track) and rejected. The OpenReview discussion provides useful context for understanding the paper’s reception.
Reviewer Scores
| Reviewer | Rating | Confidence |
|---|---|---|
| i9VG | 8/10 | 4/5 |
| c3jL | 6/10 | 4/5 |
| 4Uws | 6/10 | 4/5 |
| TmKL | 5/10 | 5/5 |
Key Criticisms
Limited novelty in pipeline methodology. Multiple reviewers (4Uws, TmKL) noted that parsing LaTeX source to derive layout annotations is not new; prior work (GROTOAP, PubLayNet, DocBank) uses similar approaches. The area chair’s meta-review summarized this as: “its technical contributions and novelty appear limited.”
Logical relationships are underexploited. Reviewer 4Uws flagged that while the paper introduces 6 relationship types as a key differentiator, the experiments do not actually leverage them. The benchmarked tasks operate on individual components, not the graph structure. In response, the authors showed that providing layout bounding boxes as visual prompts improves InternVL-1.5 QA accuracy (0.4529 $\rightarrow$ 0.4922 on single-page QA), but this addresses layout detection, not the relationship graph itself.
GPT-4 dependence in QA generation. Reviewers i9VG and 4Uws raised concerns about bias introduced by using GPT-4V to generate QA pairs. The authors revealed that from 7,028 initial GPT-4-generated questions across 1,757 papers, only 2,498 QA pairs survived quality filtering, and 66.93% of retained pairs were modified by the 20 PhD/Master’s-level quality checkers. This high editing rate somewhat mitigates the bias concern but also raises questions about the efficiency of the generation process.
Non-text modality coverage. Reviewer c3jL noted that QA pairs focus heavily on textual understanding, with limited coverage of chart/plot or table-specific questions, which is a gap for a benchmark targeting multimodal document understanding.
LaTeX distribution bias. Reviewer TmKL questioned whether LaTeX-formatted arXiv papers are representative of scientific documents broadly. The authors conducted OOD experiments on 966 Sci-Hub papers (covering medicine, chemistry, biology, humanities, and other fields outside arXiv’s core), where their DocGenome-trained YOLOv8 layout detector outperformed DocXChain (mAP 50.15 vs. 37.99), partially addressing this concern for layout detection. However, the equation-to-LaTeX task showed the opposite pattern on OOD data (edit distance 0.6627 vs. Mathpix’s 0.4873).
Noteworthy Author Revelations
- The open-source tool MinerU reportedly uses DocGenome for training its equation and table parsing models.
- The full QA pipeline details: 7,028 questions generated $\rightarrow$ quality-checked by 20 annotators $\rightarrow$ 2,498 retained (editing rate 66.93%).
- Evaluation metric limitations were acknowledged: edit distance and BLEU do not capture semantic equivalence of different LaTeX expressions (e.g.,
\frac{a}{b}vs.a/b). The authors mention ongoing work on a rendering-based metric that compares visual similarity of compiled LaTeX output.
BibTeX
@misc{xia2024docgenome,
title={DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models},
author={Xia, Renqiu and Mao, Song and Yan, Xiangchao and Zhou, Hongbin and Zhang, Bo and Peng, Haoyang and Pi, Jiahao and Fu, Daocheng and Wu, Wenjie and Ye, Hancheng and Feng, Shiyang and Wang, Bin and Xu, Chao and He, Conghui and Cai, Pinlong and Dou, Min and Shi, Botian and Zhou, Sheng and Wang, Yongwei and Wang, Bin and Yan, Junchi and Wu, Fei and Qiao, Yu},
year={2024},
eprint={2406.11633},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
WordScape: Multilingual Layout Annotations from Web Crawl Data
TL;DR
WordScape is an open-source pipeline that extracts Word documents from Common Crawl, parses their Open XML structure, and produces layout-annotated page images with text. The authors release 9.5M URLs to Word files that yield 40M+ pages across 136 languages and 28 topic categories. Pre-training on WordScape annotations consistently improves layout detection over random initialization and PubLayNet pre-training, particularly in low-resource regimes.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$: the headline contribution is the WordScape pipeline and its associated URL corpus. The paper’s center of gravity is the data creation methodology, dataset statistics (language distribution, topic modeling, entity breakdowns), and quality analysis. The pipeline itself is the released artifact, not a fixed dataset.
Secondary: $\Psi_{\text{Method}}$: the bounding-box annotation algorithm (colorization of Open XML elements, rendering via LibreOffice, color detection with OpenCV) is a distinct technical contribution, though it serves the primary goal of dataset creation.
What is the motivation?
Existing document layout datasets are limited along two axes:
- Scale vs. quality tradeoff: Human-annotated datasets (DocLayNet, FUNSD) top out at ~80K pages. Auto-annotated alternatives (PubLayNet, DocBank) reach hundreds of thousands but are confined to English scientific papers.
- Domain and language diversity: Nearly all large-scale layout datasets draw from PubMed Central or arXiv. Real-world document understanding requires coverage across industries, cultures, and languages, especially low-resource ones.
Word documents are ubiquitous on the web, used in formal and professional contexts, and carry structured Open XML metadata that enables automatic annotation. Common Crawl provides a scalable source of such documents without manual curation.
What is the novelty?
The core insight is that Word’s Open XML format already encodes the semantic structure of a document (headings, tables, lists, figures) in its XML tags and built-in styles. Rather than training a model to predict layout, WordScape reads the structure directly from the source format.
The annotation pipeline has three stages:
- URL Extraction: Parse Common Crawl
.watmetadata files for URLs ending in.doc/.docx. Deduplicate per-snapshot and globally. - Document Download: Send HTTP requests, filter by content-type, apply OLE-based malware checks (VBA macros, encryption, flash embeds), hash-based deduplication.
- Processing & Annotation:
- Category identification: Parse Open XML using
python-docx. Identify elements via (a) built-in Word styles (headings, titles), (b) native XML tags (tables, headers, footers, figures), or (c) heuristic fallbacks (font-size ranking, bullet detection). - Colorization: Edit XML tags to color each element category uniquely. Render via LibreOffice to JPEG.
- Bounding box extraction: Detect colors on rendered images with OpenCV to produce per-category bounding boxes.
- Text extraction: Document-level text from
python-docx(in reading order); page-level with word bounding boxes fromPDFPlumber. - Language ID: FastText classifier (176 languages) at both document and page level.
- Category identification: Parse Open XML using
The pipeline annotates 30 semantic entity categories: Title, Heading Levels 1-9, Plain Text, List Item, Header, Footer, Table (plus Table Header, Table Header Cell, Table Cell, Table Row, Table Column), Table of Contents, Bibliography, Quote, Equation, Figure, Table Caption, Footnote, Annotation, Form Field, and Form Tag.
An annotation reliability score $R$ captures the proportion of entities identified via built-in/XML methods vs. heuristics, weighted by character count:
$$R = \sum_{i=1}^{N} \gamma_i r_i, \quad \gamma_i = \frac{c_i}{\sum_{j=1}^{N} c_j}, \quad r_i = \frac{b_i}{b_i + h_i}$$
where $c_i$ is the character count of entity $i$, $b_i$ counts built-in/XML-tagged annotations, and $h_i$ counts heuristic-based annotations.
What experiments were performed?
The authors validate WordScape annotations on three downstream tasks, each using a “pre-train on WordScape, then fine-tune on target” protocol:
1. Text Detection on FUNSD (199 scanned forms, word-level bounding boxes)
- Model: Faster R-CNN
- Pre-training on 10K WordScape pages + fine-tuning on just 25 FUNSD samples exceeds full-FUNSD performance (F1 0.840 vs. 0.772), a 6$\times$ reduction in manual labeling cost.
2. Table Detection on ICDAR 2019 cTDaR (600 training / 240 test images)
- Model: YOLOv8m, 640$\times$640 resolution, SGD then AdamW fine-tuning
- Pre-training on 5K WordScape table pages improves mAP@[0.50:0.95] by ~5.5 points in the 75-sample regime (0.924 vs. 0.869).
3. Layout Analysis on DocLayNet (80K+ pages, 11 classes)
- Model: YOLOv5, 200K pre-training images
- WordScape pre-training (mAP 0.508 at 1K fine-tune) outperforms PubLayNet pre-training (0.467) and random initialization (0.299). The gap narrows but persists at full 69K fine-tuning (0.755 vs. 0.745 vs. 0.753).
4. Handcrafted Scientific Dataset (2,500 pages, 31 categories, 8 scientific domains)
- Models: DETR and YOLOv8
- Pre-training on WordScape reduces the need for downstream labeled data; improvements are most pronounced at smaller fine-tuning budgets.
What are the outcomes/conclusions?
Key findings:
- From a single Common Crawl snapshot (Nov/Dec 2022), the pipeline produced 1M annotated documents (5.5M pages) after filtering. Across 6 snapshots, 9.5M unique URLs were collected, projecting to 40M+ pages.
- The language distribution is heavily skewed: Russian (2M pages) and English (1M pages) dominate, with a long tail down to ~1K pages for Tajik and Urdu. 136 languages total are represented.
- Entity distribution is imbalanced: table cells account for ~50% of all 173M bounding boxes. Excluding table sub-elements, the distribution flattens to a more even split across headings, plain text, and list items.
- Topic modeling (Google Cloud NLP API on 25K documents) reveals the top domains are Law & Government (15.3%), Forms/Templates (9.0%), and Education (5.6%). There are significant cross-language differences: Law & Government dominates Russian (~37%) but is rare in other languages.
- Pre-training on WordScape consistently outperforms PubLayNet pre-training and random initialization across all benchmarks, with the largest gains in low-data regimes.
Limitations the authors acknowledge:
- Bounding box annotation quality depends on the assumption that document formatting (built-in styles, font sizes) correlates with user intent. Heuristic-based annotations are less reliable.
- The pipeline does not address toxic or offensive content filtering.
- URL-based distribution means the dataset cannot be released as a fixed, downloadable corpus; documents must be re-crawled, and availability degrades over time (only 12.5% of 2013 URLs were still live).
- Per-snapshot deduplication removes 60-80% of URLs, and ~20% of downloaded documents fail processing (mainly invalid zip files).
Reproducibility
Models
No pre-trained model weights are released. The experiments use standard architectures (Faster R-CNN, YOLOv5, YOLOv8m, DETR) with publicly available implementations. Training details are provided for each experiment (learning rates, optimizers, batch sizes, epochs, early stopping).
Algorithms
The annotation algorithm is fully described and the code is open-sourced at DS3Lab/WordScape under Apache-2.0. The pipeline depends on python-docx, LibreOffice, OpenCV, PDFPlumber, and FastText.
Repo status (as of March 2026): The GitHub repository has 39 stars and 6 forks. The last commit was 2023-11-20, roughly a month before publication. The entire commit history spans 8 commits. One open issue (Dec 2023) has no response. The repo is effectively a paper code dump: sufficient to document the approach but not actively maintained. A reimplementation from the paper’s description using the same underlying libraries would likely be more practical than reviving this codebase.
Data
- URL corpus: 9.5M URLs to be released (not the documents themselves). Availability degrades over time as URLs go stale.
- No fixed dataset download: Users must run the pipeline to produce their own dataset. This is a fundamental design choice (pipeline, not dataset), but it means results depend on which URLs are still live at download time.
- Quality filters are provided: perplexity-based (Wikipedia KenLM models), annotation reliability score, character count thresholds, language confidence, and file size limits.
Evaluation
- FUNSD: F1@IoU 0.5. Standard split (149 train / 50 test).
- ICDAR 2019 cTDaR: mAP@[0.50:0.95]. Modern tables subset (600 train / 240 test). 4 models per setting with error bars reported.
- DocLayNet: mAP@[0.50:0.95]. Standard splits with varying fine-tuning sizes.
- Handcrafted scientific dataset: 2,500 pages, 31 categories. Quartile-based confidence bands reported.
Hardware
The paper’s appendix provides a detailed breakdown for processing one Common Crawl snapshot:
- Common Crawl parsing: 64 CPU cores, 512GB RAM, 49 hours (3,087 CPU hours)
- Document download: 64 CPU cores, 256GB RAM, 22.5 hours (1,440 CPU hours)
- Document processing: 24 nodes with 24 CPU cores and 96GB RAM each, ~22 hours (12,672 CPU hours)
- Total per snapshot: ~17K CPU hours (CPU-only; no GPUs required for the pipeline itself)
Topic modeling uses the Google Cloud NLP API (paid external service). GPU types, total GPU-hours, and training costs for the downstream object detection experiments are not reported.
Practical Assessment
The paper’s lasting value is the idea, not the artifacts. The Common Crawl URL corpus is ephemeral (URLs go stale), no fixed dataset is downloadable, and the reference implementation is unmaintained. However, the core technique is straightforward and well-documented: parse Open XML for semantic structure, colorize, render, detect bounding boxes. The underlying libraries (python-docx, LibreOffice, OpenCV) are all mature and actively maintained.
The strongest use case is private .docx collections. The paper frames WordScape around Common Crawl, but the technique applies to any .docx source. Organizations that generate or receive Word documents as part of normal operations can use this approach to produce domain-specific layout training data at zero annotation cost. Each element identified via built-in Word styles or XML tags (headings, tables, lists, figures) becomes a free bounding-box label. The annotation reliability score lets you filter out heuristic-based labels when you need higher confidence. The result is pre-training data that matches your actual document distribution, not the scientific-paper distribution that dominates public datasets.
BibTeX
@inproceedings{weber2023wordscape,
title={WordScape: a Pipeline to extract multilingual, visually rich Documents with Layout Annotations from Web Crawl Data},
author={Weber, Maurice and Siebenschuh, Carlo and Butler, Rory M. and Alexandrov, Anton and Thanner, Valdemar R. and Tsolakis, Georgios and Jabbar, Haris and Foster, Ian and Li, Bo and Stevens, Rick and Zhang, Ce},
booktitle={Advances in Neural Information Processing Systems},
volume={36},
year={2023}
}
VGT: Vision Grid Transformer for Document Layout Analysis
TL;DR
VGT is a two-stream multimodal architecture for document layout analysis that pairs a Vision Transformer (ViT) for image features with a Grid Transformer (GiT) for 2D text grid features. GiT is pre-trained on ~4 million IIT-CDIP document images using two objectives: token-level Masked Grid Language Modeling (MGLM) and segment-level contrastive Segment Language Modeling (SLM). The paper also introduces D4LA, a 27-class, 11,092-image benchmark spanning 12 real-world document types from RVL-CDIP. At the time of publication, VGT set new results on PubLayNet (96.2% mAP), DocBank (84.1% mAP), and D4LA (68.8% mAP).
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$. The core contribution is the VGT architecture: a two-stream ViT + GiT design where GiT is pre-trained with novel objectives (MGLM and SLM) specifically for 2D text grid understanding in document layout analysis.
Secondary: $\Psi_{\text{Resource}}$. The paper introduces D4LA (Diverse and Detailed Dataset for Document Layout Analysis), a manually annotated benchmark with 27 categories across 12 real-world document types.
What is the motivation?
Document layout analysis (DLA) sits at an awkward intersection of two paradigms, and neither makes full use of what is available in documents.
Pure visual methods (CNN-based detectors, DiT) treat the page as an image and leverage pre-training well (DiT uses image-based BEiT-style pre-training), but they ignore the text content that is almost always present and semantically informative for distinguishing layout categories.
Grid-based methods (VSR, sentence-grid methods) construct a 2D representation of text tokens aligned with the image and feed both streams into a detector. They use textual information at inference time, but do so without any textual pre-training objective. The grid branch is trained only on visual detection supervision, which limits how well it learns semantic text representations.
No existing approach at the time combined: (1) multi-modal inputs with both image and text grid, and (2) dedicated pre-training for the text grid stream.
In parallel, existing large-scale DLA datasets (PubLayNet with 5 classes, DocBank with 13 classes) are almost entirely scientific papers. The layout categories they define are poorly suited to real-world document types like invoices, letters, forms, and budget sheets, which have their own distinct structural elements.
What is the novelty?
Two-Stream Architecture: ViT + GiT
VGT runs two parallel ViT-Base encoders on the same document:
- ViT stream: Encodes the document image as patch embeddings with learnable 1D position embeddings, following DiT [25] initialization.
- GiT stream: Encodes a 2D text grid $\mathbf{G} \in \mathbb{R}^{H \times W \times C_G}$, where each pixel position is filled with the sub-word token embedding of whichever OCR token covers that pixel:
$$G_{i,j} = \begin{cases} E(c_k) & \text{if } (i,j) \in \mathbf{b}_k \\ E(\texttt{[PAD]}) & \text{otherwise} \end{cases}$$
Here $c_k$ is the $k$-th sub-word token with bounding box $\mathbf{b}_k$, and $E(\cdot)$ maps tokens to a $C_G = 64$ dimensional embedding space initialized from LayoutLM. Background pixels with no text are assigned the [PAD] token embedding. The grid is then split into patches and processed identically to ViT.
Multi-scale features from both streams are fused element-wise at each FPN scale $i$:
$$\mathbf{Z}_i = \mathbf{V}_i \oplus \mathbf{S}_i$$
where $\mathbf{V}_i$ and $\mathbf{S}_i$ are the ViT and GiT feature maps at scale $i$, and $\oplus$ is element-wise summation. The fused pyramid is passed through FPN and a Cascade R-CNN detection head.
Pre-Training for GiT: MGLM and SLM
The authors decouple visual and language pre-training: ViT is initialized from DiT-base, while GiT is separately pre-trained on ~4 million IIT-CDIP document images using two objectives.
Masked Grid Language Modeling (MGLM) adapts the BERT MLM objective to the 2D grid space. Some tokens in $\mathbf{G}$ are replaced with [MASK], and the model must recover the original token identities. Because GiT produces 2D feature maps rather than 1D sequences, token features are extracted via RoIAlign on the FPN output rather than direct index lookup. The objective is:
$$\mathcal{L}_{\text{MGLM}}(\theta) = -\sum_{k=1}^{N_M} \log p_{\theta}(c_k \mid \mathbf{e}_{c_k})$$
where $N_M$ is the number of masked tokens, $\theta$ is the parameters of GiT and FPN, and $\mathbf{e}_{c_k}$ is the RoIAlign-extracted feature for token $c_k$.
Segment Language Modeling (SLM) targets segment-level (text-line-level) representations via contrastive learning. For each text line segment $l_i$, the model produces a segment feature $\mathbf{e}_{l_i}$ by RoIAlign. A frozen LayoutLM model generates pseudo-target features $\mathbf{e}^\ast_{l_i}$. The objective aligns segment features with their pseudo-targets:
$$p_{\theta}(\mathbf{e}_{l_i}, \mathbf{e}^{\ast}_{l_i}) = \frac{\exp(\mathbf{e}_{l_i} \cdot \mathbf{e}^{\ast}_{l_i} / \tau)}{\exp(\mathbf{e}_{l_i} \cdot \mathbf{e}^{\ast}_{l_i} / \tau) + \sum_{k \in \mathcal{N}_{l_i}} \exp(\mathbf{e}_{l_i} \cdot \mathbf{e}^{\ast}_{k} / \tau)}$$
$$\mathcal{L}_{\text{SLM}}(\theta) = -\frac{1}{N_S} \sum_{i=1}^{N_S} \log p_{\theta}(\mathbf{e}_{l_i}, \mathbf{e}^{\ast}_{l_i})$$
where $\cdot$ denotes cosine similarity, $\tau = 0.01$ is a temperature parameter, $\mathcal{N}_{l_i}$ is the set of negative samples for segment $l_i$, and $N_S$ is the number of sampled segments per page. Both losses are combined equally: $\mathcal{L} = \mathcal{L}_{\text{MGLM}} + \mathcal{L}_{\text{SLM}}$.
D4LA Dataset
D4LA (Diverse and Detailed Dataset for Document Layout Analysis) addresses the scientific-paper bias in existing DLA benchmarks. Key properties:
- Images: 11,092 pages from RVL-CDIP across 12 document types (Budget, Email, Form, Invoice, Letter, Memo, News Article, Presentation, Resume, Scientific Publication, Scientific Report, Specification). Noisy, handwritten, and low-text images are filtered.
- Annotation: Manual, with OCR from IIT-CDIP as a textual resource.
- Categories: 27 classes designed for real-world DLA: DocTitle, ListText, LetterHead, Question, RegionList, TableName, FigureName, Footer, Number, ParaTitle, RegionTitle, LetterDear, OtherText, Abstract, Table, Equation, PageHeader, Catalog, ParaText, Date, LetterSign, RegionKV, Author, Figure, Reference, PageFooter, PageNumber. Notably, RegionKV (key-value regions in invoices) and RegionList (wireless form lists) are absent from academic-paper-focused datasets.
- Splits: 8,868 train / 2,224 validation.
What experiments were performed?
Benchmarks
- PubLayNet: 360K scientific PDFs, 5 classes (Text, Title, List, Table, Figure). Trained on 335,703, evaluated on the 11,245-image validation split.
- DocBank: 500K document pages with 13 region-level categories. Trained on 400K, evaluated on the 5K validation split.
- D4LA: 8,868 train / 2,224 validation, 27 categories across 12 document types.
For ablation efficiency, two sub-datasets PubLayNet2K and DocBank2K (2K train / 2K validation each) were used for early module comparisons.
Baselines
- ResNeXt-101: CNN-based Cascade R-CNN, no document-specific pre-training.
- DiT-Base: ViT-based DLA model with BEiT-style document image pre-training. Visual only.
- LayoutLMv3-Base: Multi-modal pre-training with unified text and image masking, but only image tokens are used at DLA fine-tuning time (no text embeddings passed to the detector head).
- VSR: Two-stream CNN model using char-level and sentence-level text grids. Multi-modal but no pre-training for the text stream.
Evaluation
- Primary metric: mAP at IoU[0.50:0.95] (COCO-style bounding box AP).
- Detection framework: Cascade R-CNN with FPN, implemented in Detectron2.
Ablation Studies
Table 5 (on PubLayNet2K / DocBank2K) isolates contributions:
- ViT alone (baseline): 86.92 / 59.61
- GiT alone with only layout (no text): 64.12 / 40.56, showing layout information alone is insufficient
- GiT alone with text, no pre-training: 65.88 / 49.15, showing text helps, especially for semantic DocBank categories
- GiT alone with pre-training (MGLM+SLM): 74.96 / 55.46, showing pre-training substantially improves GiT
- ViT + GiT (LayoutLM embeddings, no pre-training): 87.76 / 64.01, showing multi-modal fusion helps over ViT alone
- Full VGT (ViT + GiT + pre-training): 88.44 / 65.94, the best result, verifying all components contribute
Table 6 (DocBank2K) tests pre-training objectives in isolation:
- No pre-training: 64.01
- MGLM only: 64.54
- SLM only: 65.11
- MGLM + SLM: 65.17 (SLM alone contributes more than MGLM, and combining both yields the best result)
What are the outcomes/conclusions?
Results
PubLayNet (Table 7):
| Model | Text | Title | List | Table | Figure | mAP |
|---|---|---|---|---|---|---|
| ResNeXt-101 | 93.0 | 86.2 | 94.0 | 97.6 | 96.8 | 93.5 |
| DiT-Base | 94.4 | 88.9 | 94.8 | 97.6 | 96.9 | 94.5 |
| LayoutLMv3-Base | 94.5 | 90.6 | 95.5 | 97.9 | 97.0 | 95.1 |
| VSR | 96.7 | 93.1 | 94.7 | 97.4 | 96.4 | 95.7 |
| VGT | 95.0 | 93.9 | 96.8 | 98.1 | 97.1 | 96.2 |
VGT improves on VSR’s prior result (95.7%) by 0.5%, with notable gains in Title and List categories, which the authors attribute to GiT’s text-aware features.
DocBank (Table 8):
| Model | mAP |
|---|---|
| ResNeXt-101 | 77.4 |
| DiT-Base | 79.6 |
| LayoutLMv3-Base | 78.3 |
| VGT | 84.1 |
VGT’s 4.5-point gain over DiT-Base is most pronounced in text-semantic categories: Caption (+5.7), Date (+5.7), Equation (+8.9), List (+5.9), Paragraph (+8.5). This confirms that GiT pre-training specifically benefits classes that require understanding of text content.
D4LA (Table 9):
| Model | mAP |
|---|---|
| ResNeXt-101 | 65.1 |
| DiT-Base | 67.7 |
| LayoutLMv3-Base | 60.5 |
| VGT | 68.8 |
LayoutLMv3 underperforms even ResNeXt-101 on D4LA, which the authors attribute to the disconnect between multi-modal pre-training and visual-only inference. VGT’s gain over DiT-Base (+1.1 overall, +6.6 on Abstract, +8.9 on Question) demonstrates the benefit of using textual features at inference time.
Limitations
- Model size: VGT has 243M parameters, significantly larger than DiT-Base (138M) and LayoutLMv3 (138M). The two-stream design adds compute proportionally.
- Inference speed: VGT takes 460ms per image, compared to 210ms for DiT-Base and 270ms for LayoutLMv3. This makes it unsuitable for latency-sensitive applications.
- OCR dependency: GiT requires word bounding boxes and tokens as input. For PDFs this comes from a parser (PDFPlumber); for scanned images it requires an external OCR engine. OCR errors propagate to the text grid.
- Image-centric scope: VGT is primarily an object detection system. The authors note that extending it to text-centric tasks like information extraction is future work.
Reproducibility
Models
- Both ViT and GiT are 12-layer ViT-Base encoders (12-head attention, D=768, 3,072 MLP size, patch size 16).
- ViT initialized from published DiT-base weights. GiT initialized from DiT-base weights and then further pre-trained with MGLM+SLM.
- GiT text embeddings initialized from LayoutLM with embedding dimension reduced to $C_G = 64$.
- Total model: 243M parameters.
- Detection head: Cascade R-CNN with FPN, implemented in Detectron2. All other Cascade R-CNN settings follow DiT.
- Code and models stated as publicly available at: https://github.com/AlibabaResearch/AdvancedLiterateMachinery. Licensed under Apache-2.0.
Algorithms
GiT Pre-training:
- Dataset: ~4 million images from IIT-CDIP.
- Optimizer: Adam, batch size 96, 150,000 steps.
- Learning rate: 5e-4 with 2% linear warmup.
- Image size: 768 x 768.
- SLM: 64 segments sampled per page; LayoutLM used as the pseudo-target generator; temperature $\tau = 0.01$.
Fine-tuning:
- Optimizer: AdamW with 1,000 warmup steps, LR=2e-4.
- DocBank: 200,000 steps, batch size 24.
- PubLayNet: 120,000 steps, batch size 24.
- D4LA: 10,000 steps, batch size 12.
Data
- IIT-CDIP (~4M images): Used for GiT pre-training. This is a tobacco litigation archive dataset. OCR results from IIT-CDIP are reused for D4LA annotation as well.
- RVL-CDIP: Source of D4LA images. RVL-CDIP is a subset of IIT-CDIP organized into 16 document categories; D4LA draws 12 of those. No explicit license for RVL-CDIP images is stated in this paper; the underlying IIT-CDIP collection was assembled for litigation and carries no open data grant. Treat the D4LA image license as unknown.
- D4LA: 8,868 train / 2,224 validation (split from 11,092 total). Manually annotated. Available on ModelScope: https://modelscope.cn/datasets/damo/D4LA/summary. License listed as Unknown (the ModelScope page states the data will be made publicly available but provides no explicit license terms).
- PubLayNet and DocBank: Standard benchmarks used as-is.
Evaluation
- Metric: mAP at IoU[0.50:0.95] (COCO-style bounding box detection AP).
- PubLayNet and DocBank results are on the validation splits.
- No error bars or multi-run statistics are reported.
- No data contamination analysis is described (PubLayNet/DocBank images come from scientific paper repositories; overlap with IIT-CDIP pre-training data is not discussed).
Hardware
- Training hardware: not explicitly stated in the paper. The pre-training scale (~4M images, 150K steps, batch 96) and fine-tuning scale suggest multi-GPU setup.
- Inference speed: VGT 460ms, DiT-Base 210ms, LayoutLMv3 270ms (comparison hardware not stated).
- Total GPU-hours and cloud cost are not reported.
- CPU-only or consumer GPU feasibility is not discussed.
BibTeX
@InProceedings{Da_2023_ICCV,
author={Da, Cheng and Luo, Chuwei and Zheng, Qi and Yao, Cong},
title={Vision Grid Transformer for Document Layout Analysis},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month={October},
year={2023},
pages={19462--19472},
doi={10.1109/ICCV51070.2023.01783}
}
M6Doc: Multi-Format, Multi-Type, Multi-Layout, Multi-Language, Multi-Annotation Dataset
TL;DR
M6Doc is a manually annotated document layout dataset covering 9,080 pages across seven document types, three formats (PDF, scanned, photographed), and two languages, with 74 fine-grained annotation categories and 237,116 instances. The paper also introduces TransDLANet, a transformer-based instance segmentation model, which achieves 64.5% mAP on the dataset.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$: the primary contribution is the M6Doc dataset itself, including its collection pipeline, annotation schema, and 74-class label taxonomy.
Secondary: $\Psi_{\text{Method}}$: TransDLANet is a purpose-built baseline model introduced alongside the dataset, representing a real methodological contribution rather than just an evaluation vehicle.
Secondary: $\Psi_{\text{Evaluation}}$: the paper benchmarks eleven object detection and five instance segmentation baselines against the dataset, positioning M6Doc as a community benchmark.
What is the motivation?
The field’s trajectory runs from PRImA (2009, 305 scanned pages of magazines and scientific articles, widely considered the first real-world DLA dataset) through the large automatically annotated PDF corpora of the late 2010s: PubLayNet (360k pages, 5 classes), DocBank (500k pages, 13 classes), and DocLayNet (80k pages, 11 classes). M6Doc positions itself against several structural limitations shared by that generation:
- Format monoculture. All large public datasets consist entirely of PDF documents. Models trained on them tend to fail on scanned or photographed pages, which dominate real-world document workflows.
- Type monoculture. Most datasets are drawn exclusively from scientific articles, which use highly regularized two-column or single-column templates. Magazines, newspapers, textbooks, and exam papers have fundamentally different visual grammars.
- Coarse taxonomies. Five-class taxonomies (Text, Title, Table, Figure, List) are sufficient for document retrieval but insufficient for reconstruction, accessibility pipelines, or editorial understanding. Distinguishing a
kickerfrom aheadlineor afootnotefrom amarginal noterequires categories that these datasets lack. - Language coverage. Nearly all large-scale public data is English. Non-English work exists (BCE-Arabic reaches 9k pages of scanned Arabic books) but remains narrow in scope and format. Document layout varies with writing direction, script density, and typographic conventions.
The “M6” Properties
The dataset name encodes six design properties:
| # | Property | Details |
|---|---|---|
| 1 | Multi-Format | Scanned, photographed, and born-digital PDF |
| 2 | Multi-Type | Scientific articles, textbooks, books, test papers, magazines, newspapers, notes |
| 3 | Multi-Layout | Rectangular, Manhattan, non-Manhattan, multi-column Manhattan |
| 4 | Multi-Language | Chinese and English |
| 5 | Multi-Annotation Category | 74 annotation classes |
| 6 | Modern Documents | Contemporary layouts drawn from current publications; explicitly excludes historical, archival, or pre-digital documents (e.g., no early-print OCR corpora) |
What is the novelty?
The Dataset
M6Doc contains 9,080 manually annotated pages with 237,116 total instances, split 6:1:3 across training (5,448 pages, 143,040 annotations), validation (908 pages, 23,210 annotations), and test (2,724 pages, 70,866 annotations).
Document type composition: scientific article (11%), textbook (23%), test paper (22%), magazine (22%), newspaper (11%), note (5.5%), book (5.5%).
Format composition: PDF (64%), scanned (31%), photographed (5%).
Sources: arXiv (scientific articles), the Chinese People’s Daily website (newspapers), scanned textbooks across nine school subjects, handwritten student notes, and photographed books. Magazines were collected as PDFs from a mix of Chinese and English publishers (including The New Yorker, Scientific American, and Global Science); the paper also cites VKontakte as a general source, though which specific content came from there is not entirely clear. As with any dataset that aggregates third-party publications, the copyright ownership of individual items may not rest with the dataset creators, which creates some uncertainty about the scope of the CC-BY-NC-ND-4.0 license.
The 74-class taxonomy is, to our knowledge, the most granular publicly released layout annotation schema. It covers editorial structures specific to print journalism (kicker, dateline, jump line, byline, mugshot), academic structures (formula, algorithm, reference, footnote, endnote), exam-specific structures (question numbers at three levels, answer, option, matching, blank), and general structures (section titles at four levels, ordered/unordered lists, figures, tables, and navigation elements). Compared to DocLayNet’s 11 classes, this is roughly a $7\times$ increase in granularity.
A key design note: list items are grouped into container blocks (ordered list, unordered list) rather than annotated individually, which differs from DocLayNet’s per-line list-item convention.
Full Class Distribution
| Category | Train | Train % | Val | Val % | Test | Test % |
|---|---|---|---|---|---|---|
| background | 0 | 0.000 | 0 | 0.000 | 0 | 0.000 |
| QR code | 59 | 0.041 | 15 | 0.065 | 23 | 0.032 |
| advertisement | 257 | 0.180 | 45 | 0.194 | 145 | 0.205 |
| algorithm | 12 | 0.008 | 3 | 0.013 | 12 | 0.017 |
| answer | 165 | 0.115 | 30 | 0.129 | 77 | 0.109 |
| author | 2,424 | 1.695 | 403 | 1.736 | 1,188 | 1.676 |
| barcode | 10 | 0.007 | 1 | 0.004 | 3 | 0.004 |
| bill | 3 | 0.002 | 2 | 0.009 | 3 | 0.004 |
| blank | 189 | 0.132 | 58 | 0.250 | 90 | 0.127 |
| bracket | 863 | 0.603 | 164 | 0.707 | 273 | 0.385 |
| breakout | 411 | 0.287 | 72 | 0.310 | 188 | 0.265 |
| byline | 1,276 | 0.892 | 185 | 0.797 | 660 | 0.931 |
| caption | 3,508 | 2.452 | 605 | 2.607 | 1,766 | 2.492 |
| catalogue | 39 | 0.027 | 10 | 0.043 | 19 | 0.027 |
| chapter title | 245 | 0.171 | 33 | 0.142 | 124 | 0.175 |
| code | 62 | 0.043 | 7 | 0.030 | 31 | 0.044 |
| correction | 9 | 0.006 | 1 | 0.004 | 6 | 0.008 |
| credit | 1,523 | 1.065 | 255 | 1.099 | 728 | 1.027 |
| dateline | 901 | 0.630 | 140 | 0.603 | 482 | 0.680 |
| drop cap | 414 | 0.289 | 71 | 0.306 | 234 | 0.330 |
| editor’s note | 39 | 0.027 | 4 | 0.017 | 9 | 0.013 |
| endnote | 35 | 0.024 | 4 | 0.017 | 19 | 0.027 |
| examinee information | 8 | 0.006 | 2 | 0.009 | 6 | 0.008 |
| fifth-level title | 13 | 0.009 | 2 | 0.009 | 20 | 0.028 |
| figure | 7,614 | 5.323 | 1,242 | 5.351 | 3,762 | 5.309 |
| first-level question number | 5,669 | 3.963 | 930 | 4.007 | 2,740 | 3.866 |
| first-level title | 586 | 0.410 | 81 | 0.349 | 292 | 0.412 |
| flag | 30 | 0.021 | 5 | 0.022 | 12 | 0.017 |
| folio | 1,442 | 1.008 | 213 | 0.918 | 685 | 0.967 |
| footer | 1,984 | 1.387 | 310 | 1.336 | 987 | 1.393 |
| footnote | 295 | 0.206 | 49 | 0.211 | 139 | 0.196 |
| formula | 13,090 | 9.151 | 2,058 | 8.867 | 6,191 | 8.736 |
| fourth-level section title | 15 | 0.010 | 3 | 0.013 | 19 | 0.027 |
| fourth-level title | 70 | 0.049 | 13 | 0.056 | 66 | 0.093 |
| header | 1,877 | 1.312 | 297 | 1.280 | 969 | 1.367 |
| headline | 4,115 | 2.877 | 643 | 2.770 | 1,981 | 2.795 |
| index | 214 | 0.150 | 36 | 0.155 | 100 | 0.141 |
| inside | 16 | 0.011 | 1 | 0.004 | 5 | 0.007 |
| institute | 60 | 0.042 | 9 | 0.039 | 28 | 0.040 |
| jump line | 381 | 0.266 | 63 | 0.271 | 180 | 0.254 |
| kicker | 516 | 0.361 | 91 | 0.392 | 257 | 0.363 |
| lead | 664 | 0.464 | 109 | 0.470 | 285 | 0.402 |
| marginal note | 238 | 0.166 | 37 | 0.159 | 101 | 0.143 |
| matching | 7 | 0.005 | 1 | 0.004 | 8 | 0.011 |
| mugshot | 73 | 0.051 | 11 | 0.047 | 46 | 0.065 |
| option | 3,198 | 2.236 | 515 | 2.219 | 1,577 | 2.225 |
| ordered list | 1,012 | 0.707 | 172 | 0.741 | 510 | 0.720 |
| other question number | 42 | 0.029 | 3 | 0.013 | 31 | 0.044 |
| page number | 4,782 | 3.343 | 803 | 3.460 | 2,383 | 3.363 |
| paragraph | 65,642 | 45.891 | 10,575 | 45.562 | 33,069 | 46.664 |
| part | 524 | 0.366 | 89 | 0.383 | 283 | 0.399 |
| play | 10 | 0.007 | 3 | 0.013 | 2 | 0.003 |
| poem | 98 | 0.069 | 18 | 0.078 | 33 | 0.047 |
| reference | 149 | 0.104 | 23 | 0.099 | 62 | 0.087 |
| sealing line | 3 | 0.002 | 2 | 0.009 | 5 | 0.007 |
| second-level question number | 2,773 | 1.939 | 377 | 1.624 | 1,330 | 1.877 |
| second-level title | 273 | 0.191 | 48 | 0.207 | 140 | 0.198 |
| section | 2,508 | 1.753 | 408 | 1.758 | 1,228 | 1.733 |
| section title | 897 | 0.627 | 171 | 0.737 | 442 | 0.624 |
| sidebar | 54 | 0.038 | 10 | 0.043 | 27 | 0.038 |
| sub section title | 567 | 0.396 | 107 | 0.461 | 269 | 0.380 |
| subhead | 1,998 | 1.397 | 394 | 1.698 | 1,069 | 1.508 |
| subsub section title | 101 | 0.071 | 21 | 0.090 | 71 | 0.100 |
| supplementary note | 986 | 0.689 | 158 | 0.681 | 487 | 0.687 |
| table | 821 | 0.574 | 146 | 0.629 | 409 | 0.577 |
| table caption | 287 | 0.201 | 41 | 0.177 | 143 | 0.202 |
| table note | 8 | 0.006 | 2 | 0.009 | 5 | 0.007 |
| teasers | 32 | 0.022 | 7 | 0.030 | 7 | 0.010 |
| third-level question number | 240 | 0.168 | 36 | 0.155 | 102 | 0.144 |
| third-level title | 146 | 0.102 | 44 | 0.190 | 94 | 0.133 |
| title | 201 | 0.141 | 35 | 0.151 | 100 | 0.141 |
| translator | 73 | 0.051 | 11 | 0.047 | 38 | 0.054 |
| underscore | 3,687 | 2.578 | 590 | 2.542 | 1,717 | 2.423 |
| unordered list | 497 | 0.347 | 84 | 0.362 | 271 | 0.382 |
| weather forecast | 10 | 0.007 | 3 | 0.013 | 3 | 0.004 |
| Total | 143,040 | 100 | 23,210 | 100 | 70,866 | 100 |
TransDLANet
Alongside the dataset, the authors propose TransDLANet, a transformer-based instance segmentation model. It is derived from the ISTR framework with three modifications: (1) a Transformer encoder without positional encoding for self-attention over query embeddings; (2) an adaptive element matching mechanism to improve recall by better aligning queries to ground-truth instances; and (3) three shared-parameter MLP branches for joint classification, bounding box regression, and segmentation mask prediction. The model iteratively refines its output over $K$ rounds.
What experiments were performed?
Annotation Methodology
47 annotators followed a purpose-written guideline of over 170 pages. Labels were defined using a “commonality vs. specificity” principle: labels were unified across document types where possible, and type-specific labels were added only where the distinction carried semantic value. Annotation files use MS COCO format. Final quality control was performed by the authors.
The authors report several sources of inter-annotator ambiguity in the guideline: determining table vs. paragraph boundaries when borders are absent; deciding whether a figure containing sub-figures should be annotated as one region or several; and distinguishing list items from paragraphs when no explicit marker is present. Resolving these consistently required explicit annotation rules.
Baseline Benchmarks on M6Doc
The paper evaluates eleven object detection methods and five instance segmentation methods. The paper reports both object detection mAP and instance segmentation mAP separately. Selected results on the M6Doc test set (COCO mAP 0.50:0.95):
| Method | Backbone | Det. mAP | Seg. mAP |
|---|---|---|---|
| RetinaNet | ResNet-101 | 21.4 | 21.0 |
| YOLOv3 | DarkNet-53 | 59.8 | – |
| FCOS | ResNet-101 | 40.6 | 39.3 |
| Faster R-CNN | ResNet-101 | 49.0 | 47.8 |
| Cascade R-CNN | ResNet-101 | 54.1 | 52.7 |
| Mask R-CNN | ResNet-101 | 40.1 | 39.7 |
| Cascade Mask R-CNN | ResNet-101 | 54.4 | 52.9 |
| HTC | ResNet-101 | 58.2 | 57.1 |
| SOLOv2 | ResNet-101 | 46.8 | 48.3 |
| Deformable DETR | ResNet-101 | 57.2 | 55.6 |
| TransDLANet (proposed) | ResNet-101 | 64.5 | 63.4 |
Note that for detection-only methods (e.g., Faster R-CNN), the segmentation mAP is computed using the bounding box as a proxy mask. For segmentation-only methods (e.g., SOLOv2), the detection mAP uses the minimum bounding rectangle of the predicted mask. All anchor-based methods use a non-default anchor ratio set of [0.0625, 0.125, 0.25, 0.5, 1.0, 2.0, 4.0, 8.0, 16.0] to accommodate M6Doc’s wide aspect ratio diversity.
A notable result: Mask R-CNN (40.1 detection mAP) substantially underperforms Faster R-CNN (49.0 detection mAP), despite Mask R-CNN typically performing comparably or better on standard benchmarks. The authors note that pixel-based segmentation degrades performance when documents have complex layouts, a finding they also observe on DocLayNet. Meanwhile, SOLO reaches only 38.7 segmentation mAP. TransDLANet builds on ISTR, and the transformer encoder and adaptive matching exist specifically to address ISTR’s fixed-query collapse when multiple queries converge on the same dense region.
The authors also evaluate TransDLANet on DocLayNet and PubLayNet. On DocLayNet, TransDLANet achieves 72.3% mAP, below the YOLOv5x6 baseline at 76.8%. On PubLayNet, it achieves 94.5% mAP, comparable to but slightly below VSR at 95.7%. The authors attribute the DocLayNet gap to a fixed-query design that misses instances when multiple queries collapse onto a single region.
Cross-Dataset Transfer Experiments
To demonstrate the value of format diversity, the authors trained models on DocBank, PubLayNet, and DocLayNet and evaluated on M6Doc’s scanned and photographed subsets. Models trained solely on PDF data consistently fail to detect instances in scanned handwritten notes and photographed books, reflecting domain gap from background variation, tilt, and brightness shifts. Models trained on M6Doc generalize better to those formats. Conversely, models trained on M6Doc do not transfer well to DocLayNet (and vice versa), which the authors attribute to the two datasets drawing from entirely different source corpora.
What are the outcomes/conclusions?
M6Doc demonstrates that increasing label granularity, even at the cost of raw dataset size, can improve a model’s ability to extract fine-grained logical structure. A Faster R-CNN trained on M6Doc’s 600-page scientific article subset produces more accurate results on formula and section-level distinctions than the same architecture trained on DocBank, which has 500k pages but only 13 coarse classes.
The dataset is genuinely hard: the strongest baseline excluding TransDLANet (HTC) reaches only 58.2% detection mAP. Failure modes the authors identify include: dense or skewed instances that overlap significantly, making anchor-based suppression lossy; large-aspect-ratio regions like newspaper advertisements that fall outside standard anchor configurations; and handwritten notes, which remain the most difficult subset for all evaluated models.
Critical observations:
- The “Large-Scale” framing in the title is hard to justify. At 9,080 pages, M6Doc is substantially smaller than DocLayNet (80k), PubLayNet (360k), or DocBank (500k). Its value is in annotation granularity and format diversity, not volume.
- The class imbalance is extreme.
paragraphaccounts for roughly 46% of all instances while several classes have single-digit counts in validation (e.g.,matching: 1 val instance;barcode: 1 val instance;bill: 2 val instances). Standard COCO mAP weights all classes equally, so headline numbers are dominated by the high-frequency classes and tell you almost nothing about model performance on the rare but semantically distinctive ones. - The CC-BY-NC-ND-4.0 license prevents commercial use and prohibits derivative datasets, which limits applicability for production-oriented work.
- The Train/Val split is not publicly downloadable without requesting access through the repository. Only the test set is on open download.
- No training or inference code is provided. The baselines must be reproduced independently, with the anchor ratio adjustment ([0.0625, 0.125, …, 16.0]) being a non-obvious configuration change needed for anchor-based methods to handle M6Doc’s wide aspect ratio diversity.
Mapping to Unified Taxonomy
M6Doc’s granularity stress-tests any unified taxonomy. The tables below map each of its 74 classes to our Matter/Meaning annotation standard, noting where M6Doc distinguishes concepts that are usually merged. (Note: that page is a work in progress; check current availability before linking externally.)
Hierarchy and Text Flow
| M6Doc Class | Visual Primitive | Logical Role | Notes |
|---|---|---|---|
| title | Text | Title | Document-level title. |
| first-level title through fifth-level title | Text | Title / SectionHeader | M6Doc splits academic headers by level. |
| section title | Text | SectionHeader | Generic section divider. |
| sub section title | Text | SectionHeader | Second-level academic section. |
| subsub section title | Text | SectionHeader | Third-level academic section. |
| fourth-level section title | Text | SectionHeader | Fourth-level academic section. |
| second-level title / third-level title | Text | SectionHeader | Parallel hierarchy for non-article documents. |
| chapter title | Text | SectionHeader | Book-level chapter division. |
| section / part | Text | SectionHeader | High-level structural divisions (books, textbooks). |
| headline | Text | Title | Newspaper article title. |
| subhead | Text | SectionHeader | Secondary headline (newspaper). |
| kicker | Text | SectionHeader | Introductory label above the headline. |
| paragraph | Text | Body | Standard prose body. |
| drop cap | Text | Body | Typographic first letter; semantically part of the paragraph. |
| breakout | Text | Blockquote | Pull quote or highlighted excerpt. |
| lead | Text | Abstract | Introductory summary paragraph in news articles. |
| supplementary note | Text | Body | Supplementary explanatory text. |
| jump line | Text | JumpLine | “Continued on page X” cross-reference. |
| byline / author | Text | Author | Attribution lines. |
| institute / translator | Text | Author | Secondary attribution blocks. |
Specialized Content
| M6Doc Class | Visual Primitive | Logical Role | Notes |
|---|---|---|---|
| advertisement | Image | Advertisement | Distinct commercial region. |
| code | Text | Code | Source code blocks. |
| algorithm | Text | Code | Pseudocode or algorithmic descriptions. |
| poem | Text | Body | Verse with significant whitespace. |
| play | Text | Body | Dialogue-heavy script text. |
| editor’s note | Text | Abstract | Editorial comment or contextual summary. |
| correction | Text | Body | Published correction notice. |
| bill | Text | Body | Legislative bill text or invoice (rare, 3 training instances). |
| sidebar | Text | Sidebar | Detached aside content. |
| weather forecast | Table | Sidebar | Grid-style structured data region. Per the Visual Grid Rule, 2D-aligned data is a Table regardless of domain. |
Lists and Examination
| M6Doc Class | Visual Primitive | Logical Role | Notes |
|---|---|---|---|
| ordered list / unordered list | Text | ListItem | M6Doc annotates the container block, not individual items. |
| first-level question number | Text | Label | Top-level exam question label. |
| second-level question number | Text | Label | Sub-question label. |
| third-level question number | Text | Label | Sub-sub-question label. |
| other question number | Text | Label | Question labels not fitting the three levels. |
| answer | Text | Value / Body | Exam answer content. |
| option | Text | ListItem | Multiple-choice option. |
| matching | Table | FormGroup | “Match column A to B” exam regions. |
| examinee information | Table | FormGroup | Name / ID input blocks. |
Visuals and Captions
| M6Doc Class | Visual Primitive | Logical Role | Notes |
|---|---|---|---|
| figure | Image | Figure | Standard image or illustration. |
| mugshot | Image | Figure | Headshot portrait photograph. |
| teasers | Image | Figure | Visual preview thumbnails. |
| table | Table | Table | Standard table. |
| formula | Formula | DisplayEquation | Math regions; also annotated inline per guidelines. |
| caption | Text | Caption | Figure caption. |
| table caption | Text | Caption | Table-specific title. |
| table note | Text | TableNote | Notes appended below tables. |
| credit | Text | Credit | Photo source attribution. |
| barcode | OpticalCode | Barcode1D | Linear machine-readable code. |
| QR code | OpticalCode | Barcode2D | 2D machine-readable code. |
Navigation and Meta
| M6Doc Class | Visual Primitive | Logical Role | Notes |
|---|---|---|---|
| header | Text | PageHeader | Running page head. |
| footer | Text | PageFooter | Running page foot. |
| page number | Text | PageNumber | Explicit page index. |
| folio | Text | PageNumber | Combined page number and publication name. |
| flag | Text / Image | Title / Logo | Newspaper nameplate (e.g., masthead logotype). Not a literal flag image. |
| inside | Text | TOC | “Inside this issue” section pointer. |
| catalogue | Text | TOC | Table of contents or list of works. |
| index | Text | Index | Back-of-book index. |
| reference | Text | BibEntry | Bibliography or works cited. |
| footnote / endnote | Text | Footnote | Explanatory meta-text tied to a specific passage. |
| marginal note | Text | Annotation | Text in margins; user-added or editorial marginalia. |
| dateline | Text | Dateline | Temporal and geographic anchor (“NEW YORK, March 4”). |
Structures and Form Elements
| M6Doc Class | Visual Primitive | Logical Role | Notes |
|---|---|---|---|
| blank | Field | Input | Explicit answer box or input area. |
| underscore | Field | Input | Fill-in-the-blank line. |
| bracket | Structure | Input | Answer bracket region in exams. |
| sealing line | Structure | – | Security seal line on exam papers. |
Reproducibility
Models
TransDLANet uses a ResNet-101 backbone pretrained on ImageNet. The architecture is based on ISTR, extended with a Transformer encoder (no positional encoding), an adaptive element matching mechanism, and a dynamic interaction decoder. Three shared-parameter MLP branches produce classification scores, bounding box coordinates, and segmentation masks. No model weights are released.
Algorithms
- Optimizer: AdamW
- Base learning rate: $2 \times 10^{-5}$, decayed to $2 \times 10^{-6}$ at epoch 250 and $2 \times 10^{-7}$ at epoch 375 (50% and 75% of 500 epochs)
- Batch size: not reported
- Augmentation: random crop; input scaled so the short side is 704-896 px and the long side is at most 1333 px
- Anchor ratios (for anchor-based baselines): [0.0625, 0.125, 0.25, 0.5, 1.0, 2.0, 4.0, 8.0, 16.0], a significant departure from the default [0.5, 1.0, 2.0] required to handle M6Doc’s wide aspect ratio diversity
- No training or inference code is provided in the repository.
Data
- Total: 9,080 manually annotated pages; 237,116 instances
- Split: 6:1:3 (train 5,448 pages / val 908 pages / test 2,724 pages), stratified to maintain label proportions
- Annotation format: MS COCO JSON with polygon segmentation masks (not bounding-box only); detection-only baselines use the bounding box derived from each mask’s enclosing rectangle
- Annotation guideline: 170+ pages; 47 annotators; final QA by authors. No inter-annotator agreement metric (e.g., Cohen’s kappa or Fleiss’ kappa) is reported. The supplementary material states that inconsistent annotations in the final check were “within 5%,” but this is an informal figure without a formal protocol.
- License: CC-BY-NC-ND-4.0 for both code and data; prohibits commercial use and derivative datasets
- Availability: Test set is publicly downloadable (OneDrive link in repository). Train/val sets require requesting access through the GitHub repository.
Evaluation
- Metrics: Standard COCO mAP (IoU 0.50:0.95), AP50, AP75, and Recall
- Benchmark: M6Doc test set; also DocLayNet and PubLayNet test sets for transfer experiments
- Baselines: Evaluated with mmdetection framework; anchor configurations adjusted as noted above
- No official evaluation scripts are provided; users must reproduce the COCO metric pipeline independently
- Annotation ambiguities acknowledged: table vs. paragraph without borders; whether sub-figures count as one or many instances; list item vs. paragraph without explicit markers. These affect reproducibility of edge cases.
Hardware
Hardware specifications are not reported in the paper.
BibTeX
@InProceedings{Cheng_2023_CVPR,
author = {Cheng, Hiuyi and Zhang, Peirong and Wu, Sihang and Zhang, Jiaxin and Zhu, Qiyuan and Xie, Zecheng and Li, Jing and Ding, Kai and Jin, Lianwen},
title = {M6Doc: A Large-Scale Multi-Format, Multi-Type, Multi-Layout, Multi-Language, Multi-Annotation Category Dataset for Modern Document Layout Analysis},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2023},
pages = {15138-15147}
}
BaDLAD: A Large Multi-Domain Bengali Document Layout Analysis Dataset
TL;DR
BaDLAD is, to the authors’ knowledge, the first large-scale, human-annotated, multi-domain document layout analysis dataset for Bengali (Bangla). It contains 33,693 document pages collected from six domains, with 710K polygon annotations across four semantic unit types: text-box, paragraph, image, and table. The authors benchmark several object detection architectures on the dataset to establish baseline performance.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$. The paper’s primary contribution is the dataset itself: a new, large-scale, human-annotated corpus for Bengali document layout analysis (DLA). The curation pipeline, annotation protocol, and statistical characterization of the data constitute the bulk of the paper.
Secondary: $\Psi_{\text{Evaluation}}$. The paper benchmarks Faster R-CNN, Mask R-CNN, and YOLOv8 on the dataset, establishing baseline performance numbers and identifying where current English-DLA models fall short on Bengali documents.
What is the motivation?
Bengali (Bangla) is spoken by approximately 300 million people as a native language and another 37 million internationally, making it one of the most widely spoken languages in the world. Despite this, the infrastructure for processing Bengali documents lags considerably behind what is available for Latin-script languages. Bengali OCR has seen progress over the past decade, but document layout analysis (DLA) for Bengali remained severely underresourced at the time of this paper.
The core problem is that DLA is a prerequisite for document transcription pipelines: before an OCR system can read a page, the page must be segmented into meaningful regions (paragraphs, text-boxes, images, tables). Existing large DLA datasets like PubLayNet and DocBank are either automatically annotated from born-digital PDFs or restricted to a single domain (scientific articles). Neither approach transfers well to Bengali documents, which are predominantly scanned or photographed originals and span a far more heterogeneous range of typographies, scripts, and layouts.
Three compounding challenges make Bengali DLA particularly difficult:
- Bengali is a non-Latin script with complex character composites, inflected forms, and conjunct consonants. Models trained on Latin-script data cannot be naively applied.
- A large fraction of publicly available Bengali documents are scanned images of historical materials, some dating to the early 1800s. These exhibit degraded print quality, unusual typefaces, and inconsistent layouts.
- Modern Bengali legal documents (property deeds) contain handwritten annotations, signatures, stamps, and free-form text interleaved with printed content, creating annotation challenges that rule-based systems handle poorly.
Existing Bengali DLA resources were small and domain-specific, insufficient to train the data-hungry deep learning models that have become standard in the field.
What is the novelty?
The main contribution is BaDLAD itself: the first multi-domain, large-scale, human-annotated dataset for Bengali document layout analysis. Several aspects distinguish it from prior work:
Scale and human annotation. BaDLAD contains 33,693 annotated document pages produced by a team of 13 trained annotators over four months, with three dedicated curators performing quality control and iteratively refining annotation guidelines. Annotation time was also recorded per sample, providing a proxy measure of layout complexity.
Polygon annotations. Unlike axis-aligned bounding box annotations used in most DLA datasets, BaDLAD uses polygon annotations that follow the actual boundaries of text regions. This matters because Bengali documents frequently contain rotated text, irregular column arrangements, and text that does not align neatly to rectangular boxes.
Multi-domain coverage. The dataset spans six distinct domains:
- Books and magazines (largest component, 30,054 pages)
- Government documents (1,285 pages)
- Liberation war documents (1,004 pages)
- Historical newspapers, pre-December 1971 (861 pages)
- New newspapers (161 pages)
- Property deeds (328 pages, held entirely in the test split to preserve confidentiality and to probe out-of-distribution performance)
Data-driven semantic unit discovery. The four annotation categories were not chosen arbitrarily. The authors trained a self-supervised SwAV model on approximately 20,000 scraped Bengali PDFs and inspected the resulting cluster prototypes. The clusters revealed four major semantic categories, which became the annotation schema: text-box (isolated headings, captions, labels), paragraph (contiguous body text), image (any non-text visual element including logos and signatures), and table (structured row-column content, with internal elements also annotated according to their type).
Additional unannotated data. The authors also release approximately 4 million unannotated document images to support unsupervised or semi-supervised DLA research.
The dataset split is 60:40 train-to-test at the document level (not the page level), stratified by domain, ensuring that pages from the same source document land in the same split to prevent data leakage.
What experiments were performed?
The authors benchmark three standard object detection and segmentation architectures on BaDLAD:
- Faster R-CNN (F-RCNN) with ResNet-50 backbone, trained with the Detectron framework, 10,000 iterations, learning rate 0.001 with decay 0.1, minibatch size 48.
- Mask R-CNN (M-RCNN) with ResNet-50 backbone, same Detectron training configuration. A second variant (M-RCNN*) uses ResNet-101, pre-trained on PubLayNet.
- YOLOv8 trained with the Ultralytics framework, 100 epochs, batch size 8, initial learning rate 0.01, weight decay 0.0005, warm-up over 3 iterations. Two variants are reported: object detection (bounding boxes) and segmentation (polygon masks).
All models were pre-trained on one of: ImageNet, PubLayNet, or COCO. Performance is reported as mAP (COCO-style, IoU thresholds 0.50-0.95) computed per domain and per unit type (paragraph, text-box, image, table).
The evaluation measures both bounding box prediction (for detection models) and segmentation mask quality (for Mask R-CNN and YOLOv8 segmentation). BaDLAD provides annotations in three formats: polygon, best-fit bounding box, and segmentation mask, allowing fair comparison across model types.
What are the outcomes/conclusions?
The benchmark results suggest that existing English DLA models, when fine-tuned on BaDLAD, learn meaningful Bengali layout features but leave substantial room for improvement:
- YOLOv8 achieves the highest bounding box mAP at 70.46% averaged across domains and unit types.
- In the segmentation setting, YOLOv8 achieves 35.69% mAP, while M-RCNN pre-trained on PubLayNet reaches 32.27%.
- Models generally perform better on paragraphs and images than on text-boxes and tables. Tables are underrepresented in the training split (1,353 table annotations in training vs. a much larger number of paragraphs), which likely explains the weaker table detection.
- Text-box detection is notably low given it is the second most frequent unit type. The authors attribute this to the structural diversity of text-box layouts (headers, captions, page numbers, etc.).
- The property deeds domain, absent from training, is poorly handled by all models, demonstrating the expected out-of-distribution degradation.
- Models show robustness in some challenging conditions: code-switched text (Bengali and English mixed), noisy scans, and partially torn pages.
The authors acknowledge several limitations. The dataset is domain-imbalanced: books and magazines account for roughly 89% of pages. Several practically important domains (shopping receipts, ID cards, application forms) are absent. The dataset does not currently include word-level or character-level text content, which limits applicability to language-model-based DLA methods (such as LayoutLM-style approaches). Future work is expected to address this gap through word detection and recognition pipelines.
Mapping to Matter vs. Meaning Framework
How does BaDLAD’s 4-class taxonomy map to our Matter vs. Meaning framework?
| BaDLAD Class | Visual Primitive | Logical Role | Notes |
|---|---|---|---|
| Paragraph | Text | Body | Contiguous body text blocks. |
| Text-box | Text | Title / SectionHeader / Caption / PageHeader / PageFooter / PageNumber | Isolated text fragments: headings, captions, labels, page numbers. BaDLAD does not distinguish between these roles; all non-paragraph text is collapsed here. |
| Image | Image | Figure / Logo / Stamp | Any non-text visual element, including logos, signatures, and stamps. No sub-classification. |
| Table | Table | (primitive only) | Structured row-column content. Internal elements are also annotated according to their type (text-box, image, etc.). |
Coverage gaps: BaDLAD’s 4-class schema is extremely coarse. It lacks explicit classes for Formula, ListItem, Footnote, BibEntry, PageHeader, PageFooter, and all form primitives (Field, Selection). The Text-box class conflates many distinct roles (headings, captions, page numbers) into a single category, requiring downstream disambiguation. The Image class similarly merges photographs, logos, and stamps.
Reproducibility
Models
No new model architecture is introduced. The benchmarked models are standard implementations:
- Faster R-CNN and Mask R-CNN via the Detectron framework (Facebook Research).
- YOLOv8 via the Ultralytics library.
- Backbone: ResNet-50 for most experiments; ResNet-101 for one M-RCNN variant.
Model weights trained on BaDLAD are available via Google Drive, linked from the GitHub repository README (https://github.com/BengaliAI/badlad). The dataset is also hosted on Kaggle (https://www.kaggle.com/datasets/reasat/badlad-train). Note: the GitHub repository has no LICENSE file despite the paper claiming CC BY-SA 4.0.
Algorithms
- R-CNN models: 10,000 iterations, learning rate 0.001, decay 0.1, minibatch 48, warm-up 5 iterations.
- YOLOv8: 100 epochs, batch size 8, initial learning rate 0.01, weight decay 0.0005, warm-up 3 iterations.
- Pretrained weights from ImageNet, COCO, or PubLayNet depending on the variant.
- Evaluation metric: COCO-style mAP at IoU 0.50-0.95.
Data
- 33,693 annotated pages; train split: 20,365 pages; test split: 13,328 pages.
- Split is 60:40 stratified by domain (at document level, not page level). Property deeds are held entirely in the test set.
- Polygon, bounding box, and segmentation mask formats are all provided.
- An additional approximately 4 million unannotated images are released for unsupervised use.
- License: CC BY-SA 4.0 (covering both the dataset and benchmark code, as stated in the paper).
- The property deeds portion has been anonymized (PII removed) before inclusion.
- Annotations were produced using the Labelbox platform by 13 trained annotators under iterative guidelines developed with three curators.
Evaluation
- Metric: mAP at IoU thresholds 0.50-0.95 (COCO protocol), reported per domain and per unit type.
- No error bars or multi-run statistics are reported; results represent single training runs.
- The test split includes the property deeds domain, which is absent from training, making it a partial out-of-distribution evaluation by design.
- Benchmarks use models pre-trained on different source datasets (ImageNet, COCO, PubLayNet), so comparisons across pretraining regimes are not controlled for compute or data size.
Hardware
The paper does not report specific hardware configurations, GPU type or count, training wall-clock time, or memory requirements. Reproducibility of the benchmarks requires consulting the Detectron and Ultralytics documentation for typical resource estimates for ResNet-50/101 and YOLOv8 at the reported batch sizes.
BibTeX
@inproceedings{shihab2023badlad,
title={BaDLAD: A Large Multi-Domain Bengali Document Layout Analysis Dataset},
author={Shihab, Md. Istiak Hossain and Hasan, Md. Rakibul and Emon, Mahfuzur Rahman and Hossen, Syed Mobassir and Ansary, Md. Nazmuddoha and Ahmed, Intesur and Rakib, Fazle Rabbi and Dhruvo, Shahriar Elahi and Dip, Souhardya Saha and Pavel, Akib Hasan and Meghla, Marsia Haque and Haque, Md. Rezwanul and Chowdhury, Sayma Sultana and Sadeque, Farig and Reasat, Tahsin and Humayun, Ahmed Imtiaz and Sushmit, Asif},
booktitle={Document Analysis and Recognition -- ICDAR 2023},
pages={326--340},
year={2023},
publisher={Springer},
doi={10.1007/978-3-031-41676-7\_19}
}
HRDoc: Dataset and Baseline Method toward Hierarchical Reconstruction of Document Structures
TL;DR
HRDoc is a dataset of 2,500 multi-page documents with approximately 2 million line-level annotated semantic units, designed for the task of hierarchical document structure reconstruction. Unlike flat layout detection benchmarks, HRDoc requires predicting both semantic labels and parent-child relationships across pages. The authors also propose DSPS, a multi-modal encoder-decoder system that outperforms the prior DocParser baseline by a wide margin on this task.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$. The central contribution is the HRDoc dataset itself: 2,500 annotated multi-page documents with line-level semantic labels and explicit parent-child relational annotations. The dataset introduces a new task formulation, defines annotation protocols, and provides train/test splits. Without the dataset, the paper has little to offer.
Secondary: $\Psi_{\text{Method}}$. The Document Structure Parsing System (DSPS) is a meaningful model contribution: a multi-modal bidirectional encoder paired with a structure-aware GRU decoder and a novel soft-mask operation. It is benchmarked against baselines with ablations.
Secondary: $\Psi_{\text{Evaluation}}$. The paper introduces STEDS (Semantic Tree-Edit-Distance Score), a new metric for evaluating hierarchical reconstruction quality at the document level.
What is the motivation?
Most document layout analysis research treats each page in isolation. The dominant paradigm (object detection models like Cascade-RCNN applied to document images) produces flat lists of bounding boxes per page: text regions, figures, tables, and so on. These approaches do not address how elements relate to one another across pages, how paragraphs belong to sections, how sections nest under one another, or how captions attach to figures.
The authors observe that converting a PDF to a structured, editable format (such as Markdown or a knowledge graph) requires more than bounding boxes: it requires a tree. Yet at the time of writing, no benchmark captured this hierarchical, cross-page structure at scale with fine-grained line-level annotations. Existing partial solutions either only extracted tables of contents (ignoring body content) or applied brittle rule-based heuristics that do not generalize beyond fixed layouts.
HRDoc is built to fill this gap, providing both the annotation standard and the scale needed to train and evaluate learned approaches to hierarchical document structure recovery.
What is the novelty?
Task formulation. The paper formalizes hierarchical document structure reconstruction as three linked subtasks:
- Semantic unit classification: assign each text line, figure, table, or equation region one of 14 semantic classes: Title, Author, Mail, Affiliation, Section, First-Line, Para-Line, Equation, Table, Figure, Caption, Page-Footer, Page-Header, and Footnote.
- Parent finding: for each unit $u_i$, identify its nearest parent $u_{p_i}$ in reading order across all pages.
- Relation classification: classify the relationship between each child-parent pair as one of three types: Connect (sequential continuation, e.g., paragraph lines), Contain (hierarchical nesting, e.g., section to subsection), or Equality (same-level siblings, e.g., two subsections under the same section).
Dataset. HRDoc contains two partitions based on layout homogeneity. HRDoc-Simple (HRDS) contains 1,000 ACL conference papers, all following a uniform two-column template, annotated via a rule-based PDF parser with human verification. HRDoc-Hard (HRDH) contains 1,500 arXiv papers spanning more than 30 layout styles from 17 research domains, annotated by compiling LaTeX source code with injected color tags to identify semantic regions, again followed by human review. In total, the dataset has approximately 2 million annotated semantic units.
All source documents carry open licenses. ACL papers are licensed under CC BY-NC-SA 3.0 or CC BY 4.0; arXiv papers were selected specifically for their CC BY-NC-SA 4.0 licenses.
STEDS metric. The paper proposes Semantic Tree-Edit-Distance Score (STEDS) to measure hierarchical reconstruction quality. For a ground-truth tree $T_D$ and predicted tree $\hat{T}_D$, STEDS is defined as:
$$\text{STEDS}(T_D, \hat{T}_D) = 1 - \frac{\text{EditDist}(T_D, \hat{T}_D)}{\max(|T_D|, |\hat{T}_D|)}$$
where $|T|$ is the node count of the tree, and a node match requires both label and text to agree. This is tighter than standard tree-edit distance metrics that do not condition on semantic content.
DSPS model. The Document Structure Parsing System uses a multi-modal bidirectional encoder that fuses five embedding types per semantic unit:
$$x_i = \text{LN}(x^{\text{text}}_i + x^{\text{layout}}_i + x^{\text{pos}}_i + x^{\text{vis}}_i + x^{\text{page}}_i)$$
The text embedding comes from Sentence-BERT, the layout embedding follows LayoutLMv2’s coordinate tokenization, position and page embeddings are learned, and the visual embedding uses RoIAlign over a ResNet-50 with FPN backbone. A bidirectional Transformer encoder processes the full page sequence to produce per-unit representations.
For parent finding, a structure-aware GRU decoder processes units in reading order across pages. Attention pooling produces a weighted hidden state $q_i$, and a soft-mask operation biases the parent distribution using a domain-prior matrix $\mathbf{M}_{cp}$ derived from observed child-parent type co-occurrence statistics:
$$P^{\text{par}}_{(i,j)} = \text{softmax}!\left(\alpha(q_i, h_j), P^{\text{dom}}_{(i,j)}\right)$$
where $P^{\text{dom}}_{(i,j)} = \tilde{P}^{\text{cls}}_j, \mathbf{M}_{cp}, (P^{\text{cls}}_i)^T$ injects structural grammar knowledge into the attention distribution.
The full model is trained end-to-end with a combined loss:
$$\mathcal{L}_{\text{tol}} = \mathcal{L}_{\text{cls}} + \alpha_1 \mathcal{L}_{\text{par}} + \alpha_2 \mathcal{L}_{\text{rel}}$$
using Focal Loss for classification and relation subtasks and cross-entropy for parent finding.
What experiments were performed?
Baselines for semantic unit classification. Four methods are compared on both HRDS and HRDH:
- T1: Cascade-RCNN (vision-only object detector)
- T2: ResNet-50 + RoIAlign (vision features only, no spatial context)
- T3: Sentence-BERT (text-only, no visual context)
- T4: DSPS Encoder (full multi-modal bidirectional encoder, Task 1 only)
Baselines for hierarchical reconstruction. DocParser (Rausch et al., AAAI 2021) is the primary comparison. It uses Mask R-CNN for detection and heuristic rules for structure recovery, operating only within single pages. The DSPS model is evaluated in several ablated settings: with/without semantic modality, with/without visual modality, with/without the soft-mask operation, and in a page-level (no cross-page parent finding) versus document-level mode.
Evaluation metrics. Semantic unit classification uses per-class F1 score with a bounding-box IoU threshold of 0.65 and a confidence threshold of 0.5 for detection outputs. Hierarchical reconstruction uses both Micro-STEDS (weighted by node frequency) and Macro-STEDS (equal weight per class).
What are the outcomes/conclusions?
Semantic unit classification. The DSPS encoder (T4) achieves the highest Micro-F1 across nearly all categories on both dataset splits. On HRDS it reaches 99.52% Micro-F1 and 98.90% Macro-F1. On the harder HRDH split the margins narrow but DSPS still leads with 96.74% Micro-F1. One exception: Sentence-BERT (T3) outperforms DSPS on the Mail and Affiliation categories, suggesting that visual features can occasionally mislead when elements look similar but differ in content.
Hierarchical reconstruction. DSPS in full document-level mode outperforms DocParser by a wide margin on both splits. On HRDS, DocParser achieves 0.2361 Micro-STEDS versus DSPS at 0.8143, roughly a 3.4x improvement. Results on HRDH are lower for both systems, reflecting the greater layout diversity. The ablation results suggest each modality and the soft-mask operation each contribute meaningfully: removing any one of them decreases STEDS by several points. Forcing the decoder to operate page-locally (rather than cross-page) also causes a substantial drop, confirming that cross-page parent-finding is a genuine challenge this dataset exposes.
Limitations the authors acknowledge. The HRDH dataset is more difficult and shows lower absolute performance for all methods, indicating the problem is far from solved for heterogeneous layouts. The paper does not report results on documents outside the academic paper domain, so generalization to business documents, legal forms, or scanned materials is uncharacterized. The annotation pipeline relies on heuristic rules and LaTeX source code, which limits applicability to documents where clean digital sources are unavailable.
Mapping to Matter vs. Meaning Framework
How does HRDoc’s 14-class taxonomy and 3 relation types map to our Matter vs. Meaning framework?
Element Classes
| HRDoc Class | Visual Primitive | Logical Role | Notes |
|---|---|---|---|
| Title | Text | Title | The main document title. |
| Author | Text | Author | Author name(s). |
Text | Address | Author email addresses. | |
| Affiliation | Text | Affiliation | Institutional affiliations. |
| Section | Text | SectionHeader | Section and subsection headings. |
| First-Line | Text | Body | The first line of a paragraph (used to mark paragraph boundaries in line-level annotation). |
| Para-Line | Text | Body | Continuation lines of a paragraph. |
| Equation | Formula | DisplayEquation | Block-level mathematical expressions. |
| Table | Table | (primitive only) | Tabular regions (detected at region level, not line level). |
| Figure | Image | Figure | Visual content (detected at region level, not line level). |
| Caption | Text | Caption | Text describing a Figure or Table. |
| Page-Footer | Text | PageFooter | Running footers. |
| Page-Header | Text | PageHeader | Running headers. |
| Footnote | Text | Footnote | Bottom-of-page notes. |
Relation Types
| HRDoc Relation | M-vs-M Mapping | Notes |
|---|---|---|
| Connect | Sequential reading order | Continuation (e.g., paragraph line to next line). Maps to reading order edges. |
| Contain | header_of / hierarchical nesting | Parent-child (e.g., section to subsection). Maps to hierarchical structure. |
| Equality | Sibling grouping | Same-level siblings (e.g., two subsections under one section). No direct M-vs-M relation equivalent; closest is implicit via shared parent. |
Structural notes: HRDoc is one of few datasets that provides cross-page parent-child relationships. The First-Line / Para-Line distinction is unique: it encodes paragraph boundaries at the line level, which is finer-grained than most datasets that annotate whole paragraphs as single boxes.
Coverage gaps: No explicit classes for ListItem, BibEntry, Abstract, PageNumber, Formula (inline), or any form primitives. The dataset is restricted to scientific papers (ACL and arXiv).
Reproducibility
Models
The DSPS model uses a standard Transformer-based bidirectional encoder (architecture details are not fully enumerated in the paper, but it follows the LayoutLMv2 lineage). The visual backbone is ResNet-50 with FPN and RoIAlign. The text encoder is Sentence-BERT. The GRU decoder is a single-layer GRU with attention pooling. No parameter counts are reported. Model weights are not explicitly released in the paper, but the code repository at https://github.com/jfma-USTC/HRDoc is stated to contain source code and datasets.
Algorithms
Training uses multi-task learning with the combined loss $\mathcal{L}_{\text{tol}} = \mathcal{L}_{\text{cls}} + \alpha_1 \mathcal{L}_{\text{par}} + \alpha_2 \mathcal{L}_{\text{rel}}$. Focal Loss is used for classification and relation subtasks; cross-entropy for parent finding. The specific values of $\alpha_1$ and $\alpha_2$ are not reported in the paper. Optimizer, learning rate, batch size, and training duration are not specified. The soft-mask matrix $\mathbf{M}_{cp}$ is computed from training data statistics with additive smoothing (pseudo-count of 5) to handle unseen relation pairs.
Data
HRDoc is split into HRDoc-Simple (1,000 ACL papers, uniform two-column layout) and HRDoc-Hard (1,500 arXiv papers, 30+ layout styles across 17 domains). Together they contain approximately 2 million annotated semantic units at line level (with Equation, Table, and Figure units at region level rather than line level). Semantic unit positions for multi-line objects are obtained using Cascade-RCNN.
- HRDoc-Simple annotation pipeline: PDFPlumber extracts text lines; heuristic rules assign preliminary labels; human annotators verify labels and assign parent-relation pairs.
- HRDoc-Hard annotation pipeline: LaTeX source is re-compiled with injected color tags per semantic block; a rule-based parser uses both heuristics and color to assign preliminary labels; human annotators verify and correct.
Licenses: all ACL papers are CC BY-NC-SA 3.0 or CC BY 4.0; all arXiv papers are CC BY-NC-SA 4.0. The dataset is stated to be publicly available at the GitHub repository, though a separate Hugging Face release is not mentioned in the paper.
Train/test splits are reported in the dataset statistics (Figure 3 of the paper) but the exact document counts per split are not spelled out in the prose.
Evaluation
- Semantic unit classification F1: IoU threshold 0.65 for detection output matching; confidence threshold 0.5 for predicted boxes. Standard per-class F1 and Micro/Macro aggregates.
- STEDS: Tree-edit-distance-based metric; a node match requires label and text to agree exactly. Both Micro-STEDS (frequency-weighted) and Macro-STEDS (class-equal) are reported.
- DocParser comparison: The DocParser inference was adapted to the HRDoc task by excluding table structure recovery and matching detected objects to ground-truth units via 0.7 IoU overlap.
The paper does not report error bars, significance tests, or multiple training runs. Seed sensitivity is not discussed.
Hardware
No hardware details are reported in the paper: no GPU type, training time, memory requirements, or throughput figures.
BibTeX
@inproceedings{ma2023hrdoc,
title={HRDoc: Dataset and Baseline Method toward Hierarchical Reconstruction of Document Structures},
author={Ma, Jiefeng and Du, Jun and Hu, Pengfei and Zhang, Zhenrong and Zhang, Jianshu and Zhu, Huihui and Liu, Cong},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={37},
number={3},
pages={1948--1956},
year={2023}
}
DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis
TL;DR
DocLayNet is a human-annotated document layout dataset containing 80,863 pages from six diverse domains: Financial Reports, Scientific Articles, Laws & Regulations, Government Tenders, Manuals, and Patents. Unlike previous large-scale datasets (like PubLayNet or DocBank) that rely on automatic matching of XML/LaTeX to PDFs, DocLayNet provides human annotations for 11 distinct classes (including Page-header, Page-footer, Formula, Footnote, and Section-header).
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$
The primary contribution is the dataset itself, which addresses the lack of high-quality, human-annotated layout data for non-scientific documents. It provides a rigorous benchmark for evaluating layout analysis models on complex, real-world documents.
Secondary: $\Psi_{\text{Evaluation}}$
The authors perform extensive baselining to measure the “human ceiling” (inter-annotator agreement) vs. model performance, demonstrating that even top baseline models lag behind human layout perception on average, though top models exceed human agreement on the most visually distinctive classes.
What is the motivation?
Prior datasets like PubLayNet (~300K pages in the authors’ rounding; ~358K pages in the actual release) and DocBank (500K pages) relied on weak supervision (automatically aligning PDF bounding boxes with source XML/LaTeX files). While scalable, this approach has two critical limitations:
- Domain Restriction: They are almost exclusively scientific papers, which have clean, predictable columns and few complex layouts.
- Label Noise: Automatic matching often produces loose bounding boxes or misses elements entirely, creating a noisy ground truth.
The authors built DocLayNet to address a generalization problem: models trained on PubLayNet fail on financial reports, invoices, or legal documents. By providing human-verified distinctions (e.g., distinguishing a “Footnote” from body “Text”), the dataset requires models to learn layout semantics rather than fitting to the uniform templates of scientific publishing.
What is the novelty?
- Diversity (The “Big 6”): It covers six distinct document styles, covering layouts that penalize models overfit to scientific articles:
- Financial Reports: Annual reports with complex tables/branding, including both free-style formats and formal SEC filings.
- Scientific Articles: Standard double-column layouts (comparable to PubLayNet).
- Laws & Regulations: Dense text, numeric lists.
- Government Tenders: Forms, lists, tables.
- Manuals: Graphic-heavy instruction booklets.
- Patents: Unique technical flow with diagrams.
- Human Annotation: Every page was annotated by trained humans (after passing two exam levels in Phase 3), providing tight bounding boxes and resolving ambiguities that confuse automated parsers.
- Inter-Annotator Agreement as a Ceiling: The authors double- and triple-annotated roughly 10% of the data to calculate a human agreement ceiling (IAA). The best-performing baseline models achieve $\approx 72\text{-}77%$ mAP (
mAP@0.5:0.95), while human agreement is $\approx 82\text{-}83%$, leaving a meaningful gap for the community to close. - Pre-defined document-wise splits: The dataset ships with fixed train/test/validation splits partitioned on document boundaries, preventing pages from the same document from appearing in multiple sets.
What experiments were performed?
The authors establish baselines using four standard object detection architectures: Mask R-CNN R50-FPN 3x, Mask R-CNN R101-FPN 3x, Faster R-CNN R101-FPN 3x (all via detectron2 defaults), and YOLOv5x6. All models were pretrained on COCO 2017 and fine-tuned on $1025 \times 1025$ RGB page images. Performance is reported as mAP@0.5:0.95 (COCO standard, averaged over IoU thresholds from 0.5 to 0.95 in steps of 0.05).
Beyond the baseline comparison, the paper runs four ablation studies:
- Dataset size ablation (learning curve): Mask R-CNN R50 was trained on increasing fractions of DocLayNet to assess whether the dataset is large enough. The learning curve flattens between the 80% and 100% data mark (within a 1% error bar from five runs), suggesting the dataset is close to saturated for this model family and that further size increases would yield diminishing returns.
- Label set reduction: The 11-class label set was reduced to 6, 5, and 4 classes by collapsing or dropping labels, to understand the trade-off between label richness and model performance.
- Document-wise vs. page-wise split: The same Mask R-CNN R50 was trained on both the official document-wise split and a randomized page-wise split to quantify how much information leaks when pages from the same document appear in both train and test.
- Cross-dataset comparison: Mask R-CNN R50 was trained on each of PubLayNet, DocBank, and DocLayNet, then evaluated on the other datasets using their common labels, to test transfer robustness.
Methodology: Class Definitions
How does DocLayNet fit into our Matter vs. Meaning framework? The paper defines 11 distinct classes, chosen for their semantic distinctiveness.
| DocLayNet Class | Visual Primitive | Logical Role | Notes |
|---|---|---|---|
| Text | Text | Body | Standard body text. |
| Title | Text | Title | The main document title. |
| Section-header | Text | SectionHeader | Crucial for RAG chunking hierarchies. |
| Page-header | Text or Image | PageHeader | Running headers. DocLayNet does not distinguish text-based from image-based headers (logos, decorative bands); both are labeled identically. |
| Page-footer | Text or Image | PageFooter | Running footers. Same caveat as Page-header: image-based footers are not separated from text-based ones. |
| Footnote | Text | Footnote | Bottom-of-page notes. Link to Body via reference. |
| Caption | Text | Caption | Text describing a Picture/Table. The guideline requires exactly one Caption per Picture or Table. |
| Table | Table | (primitive only) | DocLayNet does not assign a further role. The Table visual primitive is the full classification; no TOC, FormGroup, or other role distinction is made. |
| Picture | Image | Figure / Chart / Diagram | DocLayNet collapses all non-text visual content into one class. Photos, data charts, and technical diagrams are indistinguishable here. M vs. M separates these as distinct roles; expect precision loss when mapping. |
| Formula | Formula | DisplayEquation | Display only. Inline math is merged into Text. Equation numbers are included inside the Formula bounding box (unlike M vs. M, which defines FormulaNumber as a separate Text role). |
| List-item | Text | ListItem | Bullets/Enumerations. Each item is a separate atomic instance (not a grouped container). A List-item is defined as a paragraph with hanging indentation; bullet symbols are not required. |
Critical for RAG & Extraction: The explicit
Page-headerandPage-footerclasses are invaluable. Most other datasets merge these into “Text”. By isolating them, you gain control:
- Filter: Remove “Page 42” or “Annual Report” repetition to clean up prose chunks for RAG.
- Extract: Retain critical metadata (e.g., Policy Numbers, Claimant Names) often found only in the margins.
Constraint: If your dataset lumps these into
Text, you lose both capabilities (the prose is polluted, and the metadata is unrecoverable).
What are the outcomes/conclusions?
Best model: YOLOv5x6 at 76.8% overall mAP. The Mask R-CNN and Faster R-CNN variants cluster tightly between 72.4% and 73.5%, indicating that pixel-level segmentation masks do not help beyond bounding boxes for this task.
The human IAA ceiling sits at 82-83% mAP, leaving a gap of roughly 6 to 10 percentage points that current vision models have not closed.
Per-Class Breakdown (Table 2, YOLOv5x6 vs. Human IAA)
| Class | YOLOv5x6 AP | Human IAA | Notes |
|---|---|---|---|
| Caption | 77.7 | 84-89 | Below human; caption/text confusion |
| Footnote | 77.2 | 83-91 | Below human; small and visually similar to text |
| Formula | 66.2 | 83-85 | Hardest text class; visually variable and easily confused with surrounding text |
| List-item | 86.2 | 87-88 | Near-human; very frequent |
| Page-footer | 61.1 | 93-94 | Lowest model AP; humans are highly consistent here |
| Page-header | 67.9 | 85-89 | Model struggles; high human consistency |
| Picture | 77.1 | 69-71 | Model exceeds human IAA |
| Section-header | 74.6 | 83-84 | Below human; confused with title/text |
| Table | 86.3 | 77-81 | Model exceeds human IAA |
| Text | 88.1 | 84-86 | Model exceeds human IAA |
| Title | 82.7 | 60-72 | Model far exceeds human IAA |
| All | 76.8 | 82-83 |
The hardest classes for models are Page-footer (61.1), Formula (66.2), and Page-header (67.9). On the most visually distinctive and abundant classes (Text, Table, Picture, Title), YOLOv5x6 surpasses human annotator agreement. The gap is concentrated in semantically ambiguous classes like Footnote, Section-header, and the header/footer pair.
The paper attributes the model’s advantage on Text, Table, and Picture to their abundance and visual distinctiveness, while the human advantage on header/footer classes reflects that humans bring page-level context that the model lacks.
Learning Curve
The dataset size ablation (Figure 5) shows the mAP learning curve flattens between the 80% and 100% data marks, within the 1% error bar estimated from five full-dataset runs. The authors interpret this as evidence that the dataset is large enough for this model family: adding more similar data would not yield substantial gains. Future improvement is more likely to come from better annotation consistency, data augmentation, or additional document categories.
Label Reduction Tradeoff (Table 3)
Reducing the label set from 11 classes to 5 (keeping List-item, Picture, Section-header, Table, and Text; mapping Caption, Footnote, and Formula into Text; excluding Page-footer and Page-header) raises the overall Mask R-CNN R50 mAP from 72% to 78%. Going to 4 classes (the closest match to PubLayNet, further collapsing List-item into Text) yields 77%.
The 5-point gain from collapsing to 5 classes comes almost entirely from removing Page-footer and Page-header, the two classes where model performance is lowest. If your downstream application does not need to distinguish headers and footers, reducing the label set is a practical option.
Methodological Warning: Document-Wise vs. Page-Wise Split (Table 4)
Many documents in DocLayNet have a distinctive visual style. If pages from the same document appear in both the training set and the test set (a “page-wise” random split), the model can memorize document-specific styling. Table 4 quantifies the effect: page-wise splitting inflates mAP by roughly 10 to 12 points compared to the document-wise split used in the official benchmark (72% vs. 84% for 11 classes with Mask R-CNN R50).
DocLayNet ships with a pre-defined document-wise split. When benchmarking your own model, use that split. Page-wise splits produce results that look better on paper but do not reflect real generalization performance.
Cross-Dataset Generalization (Table 5)
To test whether layout diversity translates to robustness, the authors trained Mask R-CNN R50 on each available dataset and evaluated on the others using their common label classes. A PubLayNet-trained model achieves 93% mAP on its own test set but only 30% when transferred to DocLayNet. A DocLayNet-trained model achieves 78% on its own test set but only drops to 59% on PubLayNet and 47% on DocBank: a far smaller performance gap. This supports the core motivation that diverse training data leads to more generalizable layout models.
Reproducibility
Annotation Geometry
Each annotation is an axis-aligned rectangle (no polygons, no rotated boxes). The CCS tooling enforced this as a hard constraint: annotators could only draw vertically oriented, non-overlapping rectangular bounding boxes.
Annotations are at block level, not line, word, or character level. A whole paragraph is one box; a whole table is one box; a whole figure is one box. The one deliberate exception is List-item: each item in a list receives its own individual box rather than a single grouped container, which differs from PubLayNet’s approach.
Text-cell snapping: For all purely text-based classes (everything except Table and Picture), the CCS tool automatically shrinks the user-drawn box to the minimum bounding box enclosing the PDF text cells within it. This produces pixel-accurate tight boxes without requiring annotators to hand-draw precise boundaries. Table and Picture have no underlying text-cell layer, so these were drawn entirely by hand.
Non-overlapping: Enforced at the tooling level. Boxes cannot overlap.
Coordinate space: Pixel coordinates in the $1025 \times 1025$ PNG image, stored in augmented COCO JSON format.
Annotation Campaign
The campaign ran in four phases. Phases 1 (data selection) and 2 (label definition and guideline creation) required a small expert team. Phases 3 and 4 involved the full annotator group.
- Staff: 40 annotators assembled; 32 admitted to production after passing two exam levels in Phase 3.
- Phase 3 (training): 974 pages were reference-annotated by one proficient core team member. Annotators completed the same pages (blinded from the reference) at two complexity levels, each with a practice and an exam part. Only those meeting quality thresholds on both levels advanced to production.
- Phase 4 (production): Roughly three months of annotation.
- Throughput: 20 to 60 seconds per page, depending on layout complexity.
- Tooling: IBM Corpus Conversion Service (CCS), a cloud-native annotation platform with a visual PDF overlay interface. Annotators could not see each other’s annotations, by design, to avoid biasing the IAA measurement.
- Guideline: Over 100 pages; planned for public release alongside the dataset.
Models
- Architectures: Mask R-CNN R50-FPN 3x, Mask R-CNN R101-FPN 3x, Faster R-CNN R101-FPN 3x (all via detectron2 default configs), YOLOv5x6.
- Pretraining: COCO 2017 weights for all models.
- Input: $1025 \times 1025$ pixels, RGB.
- No custom hyperparameters reported beyond default detectron2/YOLOv5 configurations.
- No official pretrained checkpoints for these specific experiments were released.
Algorithms
- Training procedure: Not detailed. The paper states all models used default detectron2 or YOLOv5 configurations with COCO 2017 pretrained weights. No optimizer, learning rate schedule, batch size, epoch count, or data augmentation settings are reported.
- Loss functions: Standard object detection losses (as defined by detectron2 and YOLOv5 defaults).
Data
- Size: 80,863 unique pages; 7,059 double-annotated, 1,591 triple-annotated (91,104 total page-annotator instances).
- Total bounding boxes: 1,107,470 across the full dataset (train/test/val: 941,123 / 99,816 / 66,531 bounding boxes respectively).
- Domains: Financial Reports, Scientific Articles, Laws & Regulations, Government Tenders, Manuals, Patents.
- Language: ~95% English; also German (2.5%), French (1.0%), Japanese (1.0%).
- Splits: Pre-defined document-wise train/test/validation split. A page-wise split is also available but should not be used for fair evaluation (see Table 4 finding above).
- Format: Augmented COCO JSON with PNG page images ($1025 \times 1025$), original PDFs, and sidecar JSON files with parsed text and text-cell coordinates.
- License: CDLA-Permissive-1.0.
Source Acquisition
Documents were collected from publicly available repositories with open intellectual property:
- Scientific Articles: arXiv
- Financial Reports: Company websites and financial data directory services (including SEC filings and free-style annual reports)
- Laws & Regulations / Government Tenders: Government office websites, mixed across independent providers to maximize layout variability
- Patents: Patent databases
- Manuals: Various public sources
Licensing caveat: The CDLA-Permissive-1.0 license covers the annotations only. The underlying page images inherit the copyright of their source documents, which varies significantly by domain. SEC filings (mandatory US government disclosures) and patents (published by patent offices for public disclosure) are effectively public domain and are the safest categories for commercial training use. arXiv papers carry per-paper licenses that are not uniformly commercial-safe (many are CC-BY-NC or all-rights-reserved). Financial reports beyond SEC filings (free-style annual reports from company websites) and manuals (“various public sources”) have unverified per-document licenses. Government tenders vary by jurisdiction. If you need a commercially safe training subset, filter by doc_category in the dataset (values: financial_reports, scientific_articles, laws_and_regulations, government_tenders, manuals, patents) and restrict to patents and the SEC filing portion of financial_reports.
Selection criteria: The authors required medium-to-large documents (more than 10 pages) with technically dense content, prioritizing pages rich in tables, figures, and captions. Scanned documents were excluded wherever possible. Scans introduce rotation and skew that make rectangular bounding-box annotation unreliable, and the CCS annotation tool’s text-cell snapping feature (which tightens boxes to enclosed PDF text cells) cannot function without a programmatic PDF layer.
Page subsampling: Title pages were always included. For the remaining pages, a pre-trained PubLayNet object detection model was used to estimate figure and table density, biasing selection toward pages with more complex visual content.
Rendering Pipeline
The dataset images are rasterized vector PDFs, not photographs or scans. The pipeline was:
- Source PDFs uploaded to IBM’s Corpus Conversion Service (CCS)
- CCS parsed each PDF and extracted text-cell coordinates (character bounding boxes from the PDF layer)
- Each page was rendered to a PNG at $1025 \times 1025$ pixels
- Annotators drew bounding boxes on the rendered PNG inside the CCS interface; the tool automatically snapped purely text-based boxes to the minimum enclosing text-cell boundary for pixel-accurate annotation
- The released dataset includes the PNG page images, the original PDF pages, and sidecar JSON files linking parsed text and text-cell geometry to each page image
This rendering-from-PDF approach is significant: it means bounding boxes can be tightly fit to PDF text cells, and it is why scanned pages (which lack a programmatic text layer) were excluded.
Evaluation
- Metric:
mAP@0.5:0.95(COCO standard API), macro-averaged over 11 classes. - Human ceiling: Computed as pairwise mAP between independent annotators on triple-annotated pages; yields 82-83% overall.
- Baselines: Detectron2 and YOLOv5 defaults; no architecture search or hyperparameter tuning reported.
Hardware
- Not specified in the paper. Standard GPU training with detectron2 and YOLOv5 default configurations.
BibTeX
@inproceedings{pfitzmann2022doclaynet,
title={DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis},
author={Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S and Staar, Peter W J},
booktitle={Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
pages={3743--3751},
year={2022},
publisher={ACM},
doi={10.1145/3534678.3539043}
}
DAD: Dense Article Dataset
TL;DR
DAD is a human-annotated layout dataset of ~5,980 pages from 450 open-access scientific articles, labeled with 43 fine-grained semantic classes spanning front matter, body, and back matter. The authors also propose a Weighted Bounding Box Regression Loss to improve segmentation boundary quality, achieving 96.26% pixel-level F1 with DeepLabV3+.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$: The primary contribution is the DAD dataset itself: a fine-grained, human-annotated benchmark covering 43 semantic classes for scientific document layout. The annotation files are released under MIT, though the underlying document images carry per-article licenses from Elsevier, Springer, SAGE, Wiley, and IEEE that are not enumerated in the repository.
Secondary: $\Psi_{\text{Method}}$: The authors also introduce a Weighted Bounding Box Regression Loss as a training auxiliary that improves segmentation boundary sharpness in dense document layouts.
What is the motivation?
Existing layout datasets like PubLayNet use coarse labels (Text, Title, Figure, Table, List) that are insufficient for downstream extraction tasks. Distinguishing an Affiliation from an Abstract, or a Funding Statement from the body text, requires finer-grained labels. DAD addresses this “missing middle” by providing 43 semantic classes that fully reconstruct the logical structure of a scientific paper, including metadata and back-matter declarations that no prior dataset annotated.
What is the novelty?
The primary novelty is the 43-class taxonomy, which decomposes generic document regions into specific semantic roles across three zones: front matter (e.g., affiliation, abstract, doi), body content (e.g., math_formula, nomenclature, code), and back matter (e.g., funding_info, ethics, conflict_int).
The secondary novelty is the Weighted Bounding Box Regression Loss. In standard semantic segmentation, pixel-wise Cross-Entropy classification:
$$\mathcal{L}_{\text{CE}} = -\sum_{c=1}^{C} y_c \log(\hat{y}_c)$$
tends to produce “blobby” masks that merge adjacent lines of text. Adding a regression auxiliary task forces the network to respect the rectangular boundaries of document elements, improving separation in dense multi-column regions. The exact loss formulation is not available without access to the paywalled paper; conceptually, it adds a per-pixel bounding box coordinate regression term weighted by class membership, yielding a combined objective $\mathcal{L} = \mathcal{L}_{\text{CE}} + \lambda \mathcal{L}_{\text{box}}$.
What experiments were performed?
The authors benchmark the dataset using semantic segmentation rather than object detection:
- Architecture: DeepLabV3+ (a standard pixel-segmentation baseline).
- Loss Function: Standard Cross-Entropy combined with the proposed Box Regression Loss.
- Data Split: 450 open-access research articles from 14 journals across 5 publishers (Elsevier, Springer, SAGE, Wiley, IEEE), comprising approximately 5,980 annotated pages.
- Annotation: Human-annotated using the Microsoft VOTT tool; annotations saved in labelme JSON format.
- Cross-dataset: The same trained model was also evaluated on PubLayNet to test generalization.
What are the outcomes/conclusions?
The authors report the following results for the segmentation approach:
- DAD Performance: 96.26% pixel-level F1 using DeepLabV3+.
- Ablation: The bounding box regression loss improved F1 by +1.99 points on DAD.
- Cross-dataset: The same method achieved 97.11% F1 on PubLayNet. However, PubLayNet uses only 5 coarse classes versus DAD’s 43, so this comparison reflects performance on a simpler labeling scheme rather than true generalization.
Strengths: DAD is one of the few datasets to annotate funding_info, author_bio, and ethics statements, making it relevant for RAG applications that require provenance or compliance checking.
Commercial use warning: The repository is MIT-licensed, but that license covers only the annotations. The underlying document images come from Elsevier, Springer, SAGE, Wiley, and IEEE journals, none of which are enumerated with per-article licenses in the repo. “Open access” in academic publishing spans a spectrum from CC-BY (commercial use permitted) to CC-BY-NC and CC-BY-NC-ND (commercial use prohibited). Elsevier and SAGE in particular frequently publish under non-commercial Creative Commons terms. Without a per-article license audit, using DAD to train commercial models carries meaningful legal risk. The MIT license on the repo cannot grant rights to content the dataset authors do not own. For comparison, PubLayNet explicitly flags its underlying PDFs as non-commercial despite also having permissively-licensed annotations; DAD makes no equivalent disclosure. If you need a commercially defensible scientific layout dataset, DocLayNet (CDLA-Perm-1.0, with cleared sources) is the more defensible choice.
Limitations: The reported 96% F1 is a pixel-level segmentation metric, which can overstate practical performance. A model could achieve 96% pixel accuracy by correctly segmenting bulk body text while missing rare classes like doi or date entirely. Object-level mAP is the more actionable metric for document extraction tasks. With only 450 papers and 43 classes, the dataset also has a long-tail problem where rare classes have very few training examples.
Comparison to Alternatives
| Feature | DAD | DocLayNet | PubLayNet |
|---|---|---|---|
| Classes | 43 | 11 | 5 |
| Granularity | Fine-grained (Funding, DOI, Affiliation) | Mid (Header, Caption, Footnote) | Coarse (Text, Title) |
| Size | 450 docs | 80k docs | 360k docs |
| Use Case | Parsing logic & Metadata extraction | General robust detection | Pre-training backbone |
Taxonomy & Definitions
DAD is notable for decomposing “Text” regions into specific semantic roles, particularly in the header and footer metadata.
Front Matter & Metadata
title,author_name,affiliation,contact_infoabstract,keywords,doi,date,copyrightjournal,publisher,editor,corresponding_authorarticle_history,MSC(Math Subject Classification)
Body Content
section_heading,subheading,core_textfigure,table,captionmath_formula,code,list,indexnomenclature,abbreviation,note
Back Matter & Declarations
reference,appendice,author_biofunding_info,acknowledgmentconflict_int(Conflict of Interest),author_contributionethics,consent_publicationavailability_of_data,additional_file,URLs_to_supplementarypublisher_note,highlights
Mapping to Unified Taxonomy
DAD is an Information Extraction (IE) focused dataset. It splits the Body and Author roles into dozens of fine-grained semantic buckets. For Layout Analysis, we collapse these distinctions.
Front Matter
| DAD Class | Visual Primitive | Logical Role | Notes |
|---|---|---|---|
| title | Text | Title | |
| author_name | Text | Author | |
| affiliation | Text | Author | Grouped with Author block. |
| corresponding_author | Text | Author | |
| contact_info | Text | Address | Emails/metrics. |
| abstract | Text | Abstract | |
| keywords | Text | Abstract | Semantic metadata block. |
| highlights | Text | Abstract | Summary points. |
| doi | Text | Value | Metadata. |
| date | Text | Value | Metadata. |
| copyright | Text | Credit | |
| journal | Text | PageHeader | Often running header. |
| publisher | Text | Credit | |
| editor | Text | Credit | |
| article_history | Text | Body | Intro metadata block. |
| MSC | Text | Body | Math Subject Classification. |
Content
| DAD Class | Visual Primitive | Logical Role | Notes |
|---|---|---|---|
| section_heading | Text | SectionHeader | |
| subheading | Text | SectionHeader | |
| core_text | Text | Body | |
| figure | Image | Figure | |
| table | Table | Table | |
| caption | Text | Caption | |
| math_formula | Formula | DisplayEquation | |
| code | Text | Code | |
| list | Text | ListItem | Granularity mismatch: dataset annotates the container block, not individual items. |
| index | Text | Index | |
| nomenclature | Text | Body | List-like definition block. |
| abbreviation | Text | Body | |
| note | Text | Footnote |
Back Matter (Declarations)
Most of these are pure IE Classes that structurally render as standard paragraphs. We map them all to Body.
| DAD Class | Visual Primitive | Logical Role | Notes |
|---|---|---|---|
| reference | Text | BibEntry | Citations. |
| appendice | Text | SectionHeader | Appendix title/block. |
| author_bio | Text | Body | |
| funding_info | Text | Body | |
| acknowledgment | Text | Body | |
| conflict_int | Text | Body | |
| author_contribution | Text | Body | |
| ethics | Text | Body | |
| consent_publication | Text | Body | |
| availability_of_data | Text | Body | |
| additional_file | Text | Body | |
| URLs_to_supplementary | Text | Body | |
| publisher_note | Text | Body |
Reproducibility
Models
- Architecture: DeepLabV3+ with a standard backbone (details in the paper).
- Weights: No pre-trained weights released. The training code repository is available under MIT.
- Framework: TensorFlow 2.x.
Algorithms
- Loss: Cross-Entropy + Weighted Bounding Box Regression Loss (auxiliary).
- Optimizer: Not specified in this summary; the paper is paywalled, so full training hyperparameters (optimizer, learning rate, batch size, epochs) are not publicly verifiable.
Data
- 450 open-access research articles from 14 journals across 5 publishers (Elsevier, Springer, SAGE, Wiley, IEEE), comprising approximately 5,980 annotated pages.
- Annotation: Human-annotated using the Microsoft VOTT tool; output saved in labelme JSON format.
- Public availability: Dataset released on GitHub under the MIT license.
- Note: The authors separate the dataset repository (
Dense_Article_Dataset_DAD) and the training code repository (Document_Layout_Segmentation). - Train/val/test split: Not specified in publicly available materials. The exact split ratio and construction method require access to the paywalled paper.
- Inter-annotator agreement: Not reported. The annotations are described as human-annotated but no IAA metrics are provided.
Evaluation
- Metric: Pixel-level F1 score (segmentation-style). This differs from the object-level mAP standard used by most detection-focused datasets.
- Baselines: DeepLabV3+ with and without the proposed Box Regression Loss.
- Cross-dataset: PubLayNet used for generalization evaluation.
- Limitation: Pixel-level F1 can overstate practical performance for extraction tasks. No object-level mAP reported.
- Statistical rigor: No error bars, confidence intervals, or multi-run variance reported in publicly available materials.
Hardware
- Not reported in the paper. As an editorial estimate, DeepLabV3+ training on a dataset of this size (~6k pages) is feasible on a single modern GPU.
BibTeX
@article{DAD,
author = {Markewich, Logan and Zhang, Hao and Xing, Yubin and Lambert-Shirzad, Navid and Jiang, Zhexin and Lee, Roy Ka-Wei and Li, Zhi and Ko, Seok-Bum},
title = {Segmentation for document layout analysis: not dead yet},
journal = {International Journal on Document Analysis and Recognition (IJDAR)},
year = {2022},
month = {Jan},
day = {13},
issn = {1433-2825},
doi = {10.1007/s10032-021-00391-3},
url = {https://doi.org/10.1007/s10032-021-00391-3}
}
IIIT-AR-13K: Graphical Object Detection in Annual Reports
TL;DR
IIIT-AR-13K is a manually annotated dataset of 13,415 pages from annual reports of 29 companies, covering five graphical object categories: Table, Figure, Natural Image, Logo, and Signature. At the time of release, it was the largest manually annotated dataset for graphical object detection, explicitly separating data-driven graphics (charts) from natural photographs and adding document-specific elements (logos, signatures) absent from scientific-paper datasets.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$
The primary contribution is the dataset itself, addressing the absence of high-quality, human-annotated training data for graphical object detection in business documents.
Secondary: $\Psi_{\text{Evaluation}}$
The authors benchmark two standard object detection architectures on the new dataset and run a cross-dataset transfer study to demonstrate that IIIT-AR-13K is more effective as training data than much larger, automatically annotated datasets for most target benchmarks.
What is the motivation?
Prior to this work, document layout analysis datasets were dominated by scientific articles (arXiv, PubMed). Those documents follow rigid column templates and lack graphical elements common in business contexts. The authors identify three gaps:
- Domain restriction: Existing datasets (TableBank, PubLayNet, DeepFigures) contain predominantly scientific papers. Business documents such as annual reports have highly heterogeneous, often artistic layouts that differ substantially from academic typesetting.
- Missing classes: No existing dataset included Logo or Signature classes. These are critical for workflows such as branding authentication and legal document verification.
- Figure conflation: Datasets like PubLayNet use a single
Figurelabel for both data-driven charts and natural photographs. IIIT-AR-13K separates these intoFigure(charts, plots, diagrams, sketches) andNatural Image(photographs), enabling more precise encoder selection for downstream tasks.
What is the novelty?
The main novelty is the class taxonomy, which explicitly decomposes visual content into five categories:
| Class | Definition | Roots Taxonomy Mapping |
|---|---|---|
| Table | Structured row-column data. | Table |
| Figure | Data-driven charts, plots, diagrams, sketches. | Chart / Diagram |
| Natural Image | Photographs and raster scenes. | Figure |
| Logo | Company branding elements. | Logo |
| Signature | Handwritten verification zones. | Signature |
The dataset is also notable for its language diversity. Annual reports were collected in English, French, Japanese, and Russian, introducing script variation absent from prior layout datasets.
A secondary claim is that smaller, high-quality human annotations can outperform much larger automatically generated datasets for transfer to new domains, which the cross-dataset study provides partial evidence for.
What experiments were performed?
Baseline detection on IIIT-AR-13K. Faster R-CNN (ResNet-101, PyTorch) and Mask R-CNN (ResNet-101, TensorFlow/Keras) were trained on the IIIT-AR-13K training split and evaluated on its validation and test splits. Both models were initialized from MS-COCO pretrained weights. Input images were resized to $800 \times 1024$ preserving aspect ratio. The evaluation metric is mAP (mean average precision) alongside per-class Precision, Recall, and F-measure.
Cross-dataset transfer study. To establish whether IIIT-AR-13K is effective as a pretraining source, Mask R-CNN was trained separately on:
- TableBank (LaTeX only, Word only, LaTeX+Word combined)
- PubLayNet
- IIIT-AR-13K
Each trained model was then evaluated on the test sets of ICDAR-2013, ICDAR-POD-2017, cTDaR, UNLV, Marmot, and PubLayNet, using their common label (Table). This covered three fine-tuning conditions: no fine-tuning, full fine-tuning on the target dataset’s training split, and partial fine-tuning with only 1k randomly selected images from the target dataset.
What are the outcomes/conclusions?
Baseline Performance on IIIT-AR-13K (Mask R-CNN, Test Set)
| Class | Recall | Precision | F-measure | mAP |
|---|---|---|---|---|
| Table | 0.9711 | 0.9715 | 0.9713 | 0.9654 |
| Figure | 0.8898 | 0.7872 | 0.8385 | 0.8686 |
| Natural Image | 0.9179 | 0.8625 | 0.8902 | 0.8945 |
| Logo | 0.6330 | 0.3920 | 0.5125 | 0.4699 |
| Signature | 0.9175 | 0.7876 | 0.8525 | 0.9115 |
| Average | 0.8659 | 0.7601 | 0.8130 | 0.8220 |
Mask R-CNN outperforms Faster R-CNN across all classes. Table and Signature are the strongest classes. Logo is the weakest: the authors attribute this directly to class imbalance in the training set (approximately 11k table instances vs. 0.3k logo instances), not to visual ambiguity alone.
Cross-Dataset Transfer (No Fine-tuning)
The IIIT-AR-13K-trained model (Experiment-V) achieves the best mAP on ICDAR-2013 (0.9393), cTDaR (0.7478), and UNLV (0.7996) without any target-domain fine-tuning. On Marmot, IIIT-AR-13K (0.8464) is effectively tied with TableBank-LaTeX (0.8465), which the paper acknowledges as the best on that benchmark. On ICDAR-POD-2017, the TableBank (LaTeX+Word) model outperforms it substantially (mAP 0.9035 vs. 0.7509). The authors attribute IIIT-AR-13K’s general advantage to its business-document diversity, but the ICDAR-POD-2017 result is a meaningful exception where scientific-domain training data is more appropriate.
Fine-tuning Findings
When any of the larger datasets (TableBank, PubLayNet) are fine-tuned with the full IIIT-AR-13K training set, performance on IIIT-AR-13K’s test set rises to match or closely approach training directly on IIIT-AR-13K. More practically: fine-tuning with only 1k randomly selected IIIT-AR-13K images is sufficient to bring models pretrained on larger datasets to near-parity with full IIIT-AR-13K training. This suggests IIIT-AR-13K is efficient as a fine-tuning target even when the full training set is not used.
Reproducibility
Models
- Faster R-CNN: ResNet-101 backbone, PyTorch. Five anchor scales (32, 64, 128, 256, 512) and five anchor ratios (1, 2, 3, 4, 5). Pretrained on MS-COCO.
- Mask R-CNN: ResNet-101 backbone, TensorFlow/Keras. Anchor scales 0.5, 1, 2; anchor box sizes 32, 64, 128, 256, 512. Pretrained on MS-COCO.
- No pretrained checkpoints released for either model.
Algorithms
- Faster R-CNN: SGD optimizer, initial learning rate 0.001, multiplied by 0.1 every 5 epochs. Batch size 4.
- Mask R-CNN: 80 total epochs. First 20: all FPN and subsequent layers. Next 20: FPN + last 4 ResNet-101 layers. Final 40: all layers. Learning rate 0.001, momentum 0.9, weight decay 0.0001. Batch size 1.
- Input resolution: $800 \times 1024$ (aspect ratio preserved) for both models.
Data
- Pages: 13,415 total (9,333 train / 1,955 validation / 2,120 test). Split per company: 70% / 15% / 15% randomly selected pages per company.
- Bounding boxes: ~23,000 total across five classes: ~16k Table, ~3k Figure, ~3k Natural Image, ~0.5k Logo, ~0.6k Signature.
- Sources: Publicly available annual reports in English, French, Japanese, and Russian; more than ten years of filings from 29 companies.
- Annotation: Manual bounding box annotation; axis-aligned rectangles only. No inter-annotator agreement metrics are reported, and the annotation guidelines are not described in detail.
- Format: Not specified in the paper; available via static download from the project website.
- License: Project site states “Copyright 2020 All rights reserved.” Intended for academic research. Verify terms before commercial use.
Evaluation
- Metrics: mAP, Precision, Recall, F-measure (per class and averaged). The paper does not specify the IoU threshold for mAP (likely mAP@0.5 given the era and baselines, but not stated explicitly).
- Transfer benchmarks: ICDAR-2013 (238 pages, table only), ICDAR-POD-2017 (2,417 pages, table/figure/equation), cTDaR (modern + archival), UNLV (427 pages with tables), Marmot (2,000 Chinese/English pages), PubLayNet (validation split).
- Baseline: No hyperparameter search; standard detectron / Keras default configs adapted for document resolution.
Hardware
- NVIDIA Titan X GPU, 12GB memory (both Faster R-CNN and Mask R-CNN).
- Training duration not reported.
BibTeX
@inproceedings{mondal2020iiit,
title={IIIT-AR-13K: A New Dataset for Graphical Object Detection in Documents},
author={Mondal, Ajoy and Lipps, Peter and Jawahar, C. V.},
booktitle={Document Analysis Systems (DAS)},
year={2020},
doi={10.1007/978-3-030-57058-3\_16}
}
DocBank: A Benchmark Dataset for Document Layout Analysis
TL;DR
DocBank introduces a 500K-page benchmark for document layout analysis with fine-grained token-level annotations across 12 semantic categories, constructed via weak supervision from arXiv LaTeX sources. The authors inject semantic-specific colors into LaTeX documents, recompile them, and extract token-level labels by mapping RGB values back to structure types. This enables both sequence labeling and object detection workflows.
What kind of paper is this?
- Dominant: $\Psi_{\text{Resource}}$ (benchmark dataset with construction pipeline, splits, statistics, and baselines)
- Secondary: $\Psi_{\text{Method}}$ (weak supervision construction procedure and object detection conversion)
- Secondary: $\Psi_{\text{Evaluation}}$ (custom area-based metric and multimodal baseline comparisons)
What is the motivation?
Document layout analysis typically emphasizes visual features while underutilizing textual content, despite text providing strong signals for semantic role classification. Existing labeled datasets are either smaller-scale, image-only, or lack token-level annotations, making it difficult to fairly compare NLP, computer vision, and multimodal approaches. High-quality manual annotation at token-level is expensive; the authors target a scalable, low-cost labeling approach using LaTeX structure.
What is the novelty?
The paper formalizes document layout analysis as follows: given a document $D$ composed of a discrete token set $t = \{t_0, t_1, \dots, t_n\}$, where each token $t_i = (w, (x_0, y_0, x_1, y_1))$ consists of a word $w$ and its bounding box, and a set of semantic categories $C = \{c_0, c_1, \dots, c_m\}$, the goal is to find a function $F : (C, D) \to S$ that produces a prediction set:
$$ S = \{(\{t_0^0, \dots, t_{n_0}^0\}, c_0), \dots, (\{t_0^k, \dots, t_{n_k}^k\}, c_k)\} $$
The core innovations are:
- Weak supervision from LaTeX semantics: The authors inject structure-specific font colors into LaTeX source code for semantic units (abstract, author, caption, etc.), recompile the documents, then recover token labels by mapping extracted RGB colors to structure types.
- Token-level annotations at scale: Each token is represented as
(word, bounding box), enabling NLP-style sequence labeling while remaining convertible to object detection annotations. - Conversion to object detection format: Same-label tokens are grouped into connected components using BFS with x/y proximity thresholds, then bounding boxes are computed for each component to produce region-level annotations.
What experiments were performed?
Dataset Construction
- Scale: 500K document pages from arXiv papers published 2014-2018 (400K Train, 50K Val, 50K Test), sampled proportionally to preserve the natural year distribution.
- Labels: 12 semantic structure types: Abstract, Author, Caption, Equation, Figure, Footer, List, Paragraph, Reference, Section, Table, Title.
- Pipeline:
- Semantic coloring: Inject
\color{fontcolor}{...}edits into LaTeX source. - PDF Compilation: Generate visually distinct pages.
- Token Extraction: Use PDFPlumber (built on PDFMiner) to extract text lines and non-text elements (like
#LTFigure#,#LTLine#) with bounding boxes. - Labeling: Assign labels based on the RGB color of the first character in the token.
- Semantic coloring: Inject
- Reading Order: Tokens are sorted top-to-bottom, left-to-right to support sequence models.
Baselines & Training
The authors frame layout analysis as sequence labeling over a serialized 2D document.
- NLP Baselines: BERT and RoBERTa (text-only sequence labeling).
- Multimodal Baseline: LayoutLM (text + 2D layout embeddings, no image embeddings).
- Vision Baseline: Faster R-CNN with ResNeXt-101 (trained on converted object detection annotations).
- Ensemble: Combining ResNeXt-101 object detections with LayoutLM token predictions.
Evaluation Metric
Instead of standard BIO tagging, the authors propose an area-based metric to handle the spatial nature of the task: $$ \text{Precision} = \frac{\text{area of GT tokens} \cap \text{detected tokens}}{\text{area of all detected tokens}} $$ $$ \text{Recall} = \frac{\text{area of GT tokens} \cap \text{detected tokens}}{\text{area of all GT tokens}} $$ $$ F_1 = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $$
What are the outcomes/conclusions?
Outcomes LayoutLM substantially improves over text-only baselines and pure vision baselines. The ensemble achieves the best performance.
- LayoutLM$_{\text{LARGE}}$ (0.9350 F1) outperforms BERT$_{\text{BASE}}$ (0.8770 F1) and Faster R-CNN (0.9051 F1).
- The ensemble of ResNeXt-101 + LayoutLM$_{\text{LARGE}}$ reaches 0.9488 F1.
Limitations
- Domain restriction: Limited to arXiv papers (LaTeX source required).
- Tokenization artifacts: Mixed-color tokens are assigned the color of the first character, introducing potential label noise.
- Non-text handling: Non-text elements are treated as special tokens (e.g.,
##LTFigure##), which may underspecify their semantic content compared to visual features. - Dependency reliability: The pipeline depends on successful LaTeX compilation; compilation errors directly degrade annotation quality.
Mapping to Unified Taxonomy
DocBank’s token-level annotation allows for fine-grained mapping.
| DocBank Class | Visual Primitive | Logical Role | Notes |
|---|---|---|---|
| Abstract | Text | Abstract | The paper abstract section. |
| Author | Text | Author | Author names and affiliations. |
| Caption | Text | Caption | Text describing a Figure or Table. |
| Equation | Formula | DisplayEquation | Block-level mathematical regions. |
| Figure | Image | Figure | Visual content (charts, plots, images). |
| Footer | Text | PageFooter | Running footers (page numbers, dates). |
| List | Text | ListItem | Tokens part of a list structure. Dataset Violation: Label is List (container) but applies to content tokens. |
| Paragraph | Text | Body | Standard prose. |
| Reference | Text | BibEntry | Bibliography items. |
| Section | Text | SectionHeader | Section and alignment titles. |
| Table | Table | Table | Tabular regions. |
| Title | Text | Title | The main paper title. |
Reproducibility
Models
- Text-only baselines: BERT (BASE: 110M params, LARGE: 340M params) and RoBERTa (same sizes), used as sequence labeling models over tokenized document text.
- Multimodal baseline: LayoutLM (BASE and LARGE), pre-trained on IIT-CDIP Test Collection 1.0. Only text and 2D layout position embeddings were used during fine-tuning; image embeddings were disabled.
- Vision baseline: Faster R-CNN with ResNeXt-101 backbone, pre-trained on ImageNet.
- Checkpoints: Fine-tuned weights for all baselines are available in the Model Zoo (linked in artifacts above) under Apache-2.0.
Data
- Source: arXiv papers from 2014-2018, covering Physics, Mathematics, Computer Science, and other fields.
- Scale: 500K pages total (400K Train / 50K Val / 50K Test), randomly sampled to preserve the natural year distribution without balancing.
- Availability: Annotations and code released publicly on GitHub under Apache-2.0; also mirrored on Hugging Face. The underlying arXiv PDFs are not redistributed - users download them directly from arXiv, where each paper carries its own author-chosen license. Apache-2.0 covers only the annotation layer.
Algorithms
- Sequence labeling: AdamW optimizer, initial learning rate $5 \times 10^{-5}$, max sequence block size $N = 512$.
- Object detection: Faster R-CNN with ResNeXt-101 backbone, pre-trained on ImageNet via Detectron2.
- LayoutLM pre-training: Pre-trained on IIT-CDIP Test Collection 1.0 using two objectives: Masked Visual-Language Model (MVLM) and Multi-label Document Classification (MDC). Image embeddings were not used during fine-tuning on DocBank; only text and 2D layout position embeddings were active.
Hardware
- Training: 8x NVIDIA V100 GPUs, batch size 10 per GPU. Fine-tuning LayoutLM took approximately 5 hours per epoch over 400K pages.
Evaluation
- Toolkit: HuggingFace Transformers for BERT/RoBERTa/LayoutLM baselines; Detectron2 for Faster R-CNN.
- Note: LayoutLM was used without image embeddings; only text and 2D layout position embeddings were used during fine-tuning.
BibTeX
@inproceedings{li-etal-2020-docbank,
title = "{D}oc{B}ank: A Benchmark Dataset for Document Layout Analysis",
author = "Li, Minghao and Xu, Yiheng and Cui, Lei and Huang, Shaohan and Wei, Furu and Li, Zhoujun and Zhou, Ming",
booktitle = "Proceedings of the 28th International Conference on Computational Linguistics",
month = dec,
year = "2020",
address = "Barcelona, Spain (Online)",
publisher = "International Committee on Computational Linguistics",
url = "https://aclanthology.org/2020.coling-main.82",
doi = "10.18653/v1/2020.coling-main.82",
pages = "949--960"
}
The Newspaper Navigator Dataset
TL;DR
Newspaper Navigator processes all 16.3 million pages from the Chronicling America archive using a fine-tuned Faster R-CNN (R50-FPN) to extract 7 classes of visual content. The resulting dataset combines a 3.5k-page human-annotated ground truth with roughly 102 million silver-label extractions (at a 0.9 confidence threshold; up to 198 million at 0.5), all placed in the public domain. The paper was recognized as Best Resource Paper Runner-up at CIKM 2020.
What kind of paper is this?
Dominant Basis: $\Psi_{\text{Resource}}$ - The primary contribution is the dataset, codebase, fine-tuned model weights, and search application released for the Digital Humanities community.
Secondary Basis: $\Psi_{\text{Method}}$ - The pipeline applies a modern object detection architecture to historic document images across 16.3 million pages.
This paper presents Newspaper Navigator, a massive dataset created by running a visual content extraction pipeline on the Chronicling America newspaper archive. It provides the codebase, fine-tuned models, and a search interface, making it a significant infrastructural contribution for Digital Humanities and Document Layout Analysis.
What is the motivation?
The Chronicling America collection contains over 16 million digitized newspaper pages with OCR text, but searching for specific visual content (like photos, maps, cartoons) was historically difficult because the metadata was primarily text-based and page-level. Unlocking this visual content requires automated methods to detect, classify, and extract visual regions from the scanned pages at scale, surpassing the limitations of keyword-only search.
What is the novelty?
- Headline extraction as a visual task: Prior work (e.g., Google Newspaper Search) identified headlines via font-size heuristics applied to OCR text. This paper instead detects headlines as image-level bounding boxes, treating them like any other visual element. This shift allows the detector to leverage the visual distinctiveness of headline typography rather than relying on imperfect OCR and metadata.
- Scale: Processing 16.3 million pages to extract visual content is a massive undertaking in the Digital Humanities space.
- Dataset: The resulting extracted dataset is the largest of its kind for historic newspapers.
- Pipeline: A complete pipeline utilizing a fine-tuned Faster R-CNN (R50-FPN) trained on the Beyond Words dataset (augmented with headlines/ads) to detect 7 classes of visual content.
- Application: A search interface utilizing image embeddings (ResNet-18, 512-dim; ResNet-50, 2,048-dim) for visual similarity search, enabling users to train “AI navigators” on the fly to find similar content.
What experiments were performed?
Model Architecture
The core extraction utilizes a Faster R-CNN architecture with a ResNet-50 backbone and Feature Pyramid Network (FPN). The model enables region proposal and classification by optimizing the standard Faster R-CNN multi-task loss function (from Ren et al. 2015):
$$ \begin{aligned} L(\{p_i\}, \{t_i\}) &= \frac{1}{N_{cls}} \sum_i L_{cls}(p_i, p_i^\ast) \\ &+ \lambda \frac{1}{N_{reg}} \sum_i p_i^\ast L_{reg}(t_i, t_i^\ast) \end{aligned} $$
Where $L_{cls}$ is the classification loss (log loss) and $L_{reg}$ is the regression loss (smooth L1 loss) for the bounding boxes.
Training Details
- Base Dataset: Beyond Words (3.5k pages), augmented with semi-automated annotations for headlines and advertisements.
- Hyperparameters: Learning rate of 0.00025, batch size of 8, 64 proposals per image. Augmentation was limited to
RESIZE_SHORTEST_EDGEandRANDOM_FLIP(the only two augmentations supported by Detectron2 at the time). Training ran for 77 epochs (approximately 17 hours) on a single NVIDIA T4 with early stopping. - Inference threshold: Detectron2 default confidence threshold of 0.05 is used during corpus processing. All predictions above this floor are retained in the dataset, allowing downstream users to apply their own cutoff.
- Search Embeddings: Both ResNet-18 (512-dim) and ResNet-50 (2,048-dim) embeddings are generated for all extracted content using ImageNet-pretrained weights. Neither is fine-tuned on newspaper data. ResNet-50 was the primary choice for the search interface; ResNet-18 was retained for its lower dimensionality and faster querying.
What are the outcomes/conclusions?
The model achieves an overall mAP (COCO) of 63.4%, computed following the COCO standard: precision averaged over 101 recall values and 10 IoU thresholds from 0.50 to 0.95. Per-class results from Table 2 of the paper:
| Class | AP | Count in Val. Set |
|---|---|---|
| Photograph | 61.6% | 879 |
| Illustration | 30.9% | 206 |
| Map | 69.5% | 34 |
| Comic/Cartoon | 65.6% | 211 |
| Editorial Cartoon | 63.0% | 54 |
| Headline | 74.3% | 5,689 |
| Advertisement | 78.7% | 2,858 |
| mAP (COCO) | 63.4% | N/A |
| One Class | 75.1% | 9,931 |
Key Findings
Best and worst performing classes: Advertisement is the strongest class at 78.7% AP, reflecting the high visual consistency of display advertising. Illustration is the weakest at 30.9%, attributed to annotation heterogeneity in the training data (the boundary between an illustration and a photograph is not always clear, and the Beyond Words annotations reflect this ambiguity).
Classification ambiguity vs. detection failure: The “One Class” AP of 75.1% collapses all 7 classes into a single foreground label. The roughly 12-point gap between One Class AP (75.1%) and 7-class mAP (63.4%) indicates that most errors come from inter-class confusion, not from the detector failing to find regions at all.
CIKM recognition: The paper was awarded Best Resource Paper Runner-up at CIKM 2020.
19th Century Generalization
The paper tests on two sets of 500 annotated pages from earlier periods not well-represented in training data. Performance degrades substantially:
| Category | AP (1875-1900) | AP (1850-1875) |
|---|---|---|
| Headline | 51.6% | 21.2% |
| Advertisement | 44.7% | 7.3% |
| Illustration | 36.4% | N/A |
| One Class | 48.1% | 12.1% |
Approximately 10.4% of the full 16.3M-page corpus predates 1875. Users working with pre-1875 material should treat model predictions as lower-confidence silver data and consider targeted annotation or fine-tuning for that time period.
Visual Similarity Search
The search application demonstrates that generic ImageNet features (ResNet-18 and ResNet-50, not fine-tuned) are effective for retrieving visually similar historical content (e.g., finding maps of Virginia or cartoons depicting a specific political figure).
Dataset Details
1. The Ground Truth (The “Good” Part)
- Size: 3,559 full newspaper pages.
- Annotations: Manual bounding boxes for 7 visual classes.
- Format: COCO JSON (
beyond_words_data/trainval.jsonin the repo). - Scope warning: partial annotation by design. The Beyond Words crowdsourcing task asked volunteers to find specific visual elements (photographs, illustrations, maps, comics, editorial cartoons) and ignore everything else. Headlines and advertisements were added later by the paper authors. This means article text columns, bylines, datelines, pull quotes, column rules, and running heads are all unannotated background. Training a general layout model on this data will teach the model to treat normal text layout as negative examples. It is well-suited for a visual-content detector, not a comprehensive page layout annotator.
- Quality Warning: Even within the 7 annotated classes, there are significant recall issues. Many valid Headlines and Advertisements are unannotated (False Negatives), so training with “background” sampling may suppress valid detections of those classes too.
2. The Extracted Dataset (The “Big” Part)
- Size: Visual content extracted from 16,368,041 pages (99.998% success rate).
- Format: Metadata (JSON) containing predicted bounding boxes and links to full images (IIIF).
- Nature: Silver Data. These are model predictions, not human annotations.
- Confidence threshold system: All predictions with confidence ≥ 0.05 are retained, giving users the ability to apply their own threshold. At $\geq 0.9$, the corpus yields approximately 102 million total extractions across all classes (rising to roughly 198 million at $\geq 0.5$). The paper documents counts at three cuts (≥ 0.9, ≥ 0.7, ≥ 0.5); higher thresholds improve precision at the cost of recall.
- Warning: The Hugging Face mirror is currently missing the
adsandheadlineclasses (the bulk of the data). Use the original GitHub Releases for the full set.
3. Missing Pieces
This dataset was designed for visual content retrieval, not full page layout understanding. The following are entirely absent from the annotation scheme; not just missing, but actively treated as unlabeled background:
- Article text / body copy
- Column structure
- Bylines, datelines, pull quotes
- Tables
- Reading order
To build a complete newspaper layout model, you would need to combine the 3.5k Visual GT with raw OCR XML (ALTO/METS) from Chronicling America, which provides word-level bounding boxes for the text regions this dataset omits.
Mapping to Unified Taxonomy
How does Newspaper Navigator fit into our Matter vs. Meaning framework?
| NewsNav Class | Visual Primitive (The What) | Logical Role (The Why) | Notes |
|---|---|---|---|
| Headline | Text | Title | Distinct from a “Header” (running page header). This is the article title. |
| Advertisement | Image | Advertisement | Flow Breaking. Ads interrupt the main article text. Essential to classify separately to exclude them from RAG or processing. |
| Photograph | Image | Figure | Includes Caption. Standard halftone photos. |
| Illustration | Image | Figure | Includes Caption. Engravings/Drawings. |
| Map | Image | Diagram | Includes Caption. Geospatial data. |
| Comic | Image | Figure | Includes Caption. Often includes the strip’s title/dialogue. |
| Editorial Cartoon | Image | Figure | Includes Caption. Single-panel satire. |
Reproducibility
The project is open-source and public domain. While the original news-navigator.labs.loc.gov search application was retired in 2025, the code, model weights, and full dataset are preserved in the archived GitHub repository and its releases.
Models
- Architecture: Faster R-CNN with ResNet-50-FPN backbone, from Detectron2 Model Zoo.
- Pretraining: COCO 2017 weights (standard Detectron2 initialization).
- Weights: Released as
model_final.pthvia GitHub Releases (Unlicense). - Search embeddings: ResNet-18 (512-dim) and ResNet-50 (2,048-dim), ImageNet-pretrained, not fine-tuned.
Algorithms
- Optimizer: SGD. Base learning rate 0.00025.
- Batch size: 8. Proposals per image: 64.
- Epochs: 77 (early stopping). Training time: ~17 hours.
- Augmentation:
RESIZE_SHORTEST_EDGEandRANDOM_FLIPonly (the only two augmentations supported by Detectron2 at the time). - Image preprocessing: Input images downsampled by 6x before inference to reduce I/O and memory; Detectron2 handles any further resizing internally.
- Inference confidence threshold: 0.05 (Detectron2 default). All predictions above this floor are retained in the released dataset; users apply their own cutoff.
Data
- Training/validation set: Augmented Beyond Words annotations, 48,409 total bounding boxes across 3,559 pages and 7 classes, split 80/20 train/val (see Table 1 in the paper for per-class breakdown). Headlines and advertisements were added by the paper authors and are not crowdsource-verified; map annotations for 122 pages were added via keyword search.
- Extracted corpus: 16,368,041 pages (99.998% of Chronicling America) processed and released as silver-label predictions with IIIF image links.
- License: Unlicense (public domain) for both code and dataset.
- Note: The Hugging Face mirror is missing the
adsandheadlineclasses. Use GitHub Releases for the complete schema.
Evaluation
- Metric: mAP (COCO standard: precision averaged over 101 recall values, IoU thresholds 0.50-0.95).
- Validation set: World War I-era newspaper pages (held out from the Beyond Words annotations).
- Temporal generalization test: Two additional sets of 500 annotated pages each from 1875-1900 and 1850-1875, to assess degradation on pre-training-distribution material.
Hardware
- Training: Single NVIDIA T4 GPU (AWS
g4dn.2xlarge), ~17 hours. - Corpus inference: Two
g4dn.12xlargeinstances, each with 48 Intel Cascade Lake vCPUs and 4 NVIDIA T4 GPUs. Full 16.3M-page corpus processed in 19 days (~0.1 seconds/page). - Dependencies: Detectron2 (2020-era). Requires older PyTorch/Python versions for exact reproduction.
BibTeX
@inproceedings{lee2020newspaper,
title={The Newspaper Navigator Dataset: Extracting And Analyzing Visual Content from 16 Million Historic Newspaper Pages in Chronicling America},
author={Lee, Benjamin Charles Germain and Mears, Jaime and Jakeway, Eileen and Ferriter, Meghan and Adams, Chris and Yarasavage, Nathan and Thomas, Deborah and Zwaard, Kate and Weld, Daniel S.},
booktitle={Proceedings of the 29th ACM International Conference on Information and Knowledge Management},
year={2020},
pages={3055--3062},
publisher={ACM},
doi={10.1145/3340531.3412767}
}
HJDataset: Historical Japanese Documents with Complex Layouts
TL;DR
HJDataset provides 259,616 layout element annotations across 7 hierarchical categories for 2,271 pages of a 1953 Japanese biographical directory. In addition to bounding boxes and masks, the dataset includes parent-child dependency structures and reading orders. A semi-rule-based pipeline generates the annotations, with statistical quality control and human correction achieving an estimated 99.6% accuracy.
What kind of paper is this?
Dominant Basis: $\Psi_{\text{Resource}}$. The headline contribution is a large-scale historical document layout dataset with hierarchical annotations and reading order, filling a gap for Asian-language historical documents.
Secondary Basis: $\Psi_{\text{Method}}$. The semi-rule-based annotation pipeline (contour detection, CCL/RLSA segmentation, CNN-based region classification, and statistical error identification) is a meaningful methodological contribution for scalable dataset construction.
What is the motivation?
Deep learning methods for document layout analysis require large annotated datasets, but historical document datasets were small (DIVA-HisDB: 150 images; ENP: 528 images) and almost exclusively Western-language. Asian documents present unique challenges: vertical text orientation, right-to-left reading order within rows, and complex hierarchical structures that off-the-shelf tools cannot handle. The Japanese National Diet Library has millions of digitized scans, but no large-scale layout dataset existed to train models on them.
What is the novelty?
HJDataset is (to the authors’ knowledge) the first large-scale layout analysis dataset for historical Japanese documents. Its key contributions:
Hierarchical structure annotations. Each page is decomposed into a tree: page frame $\rightarrow$ rows $\rightarrow$ title regions / text regions $\rightarrow$ title / subtitle / other. Parent-child relationships are encoded via
parent_idfields.Reading order annotations. Sequential reading order is encoded via
next_idfields, with $-1$ marking the page end. The dataset captures irregular reading orders caused by section headers disrupting the standard right-to-left flow.Semi-rule-based annotation pipeline. Rather than relying on expensive manual annotation, the pipeline combines:
- Contour detection for page frames
- Connected Component Labeling (CCL) and Run Length Smoothing Algorithm (RLSA) for row and region segmentation
- A NASNet Mobile CNN classifier (trained on 1,200 hand-labeled samples, 99% test accuracy) for region type refinement
- Statistical outlier detection (percentile filtering on element counts and gap widths) to flag errors for human correction
Quality control. The statistical approach identified 80% of the 616 corrected errors. After correction, the dataset achieves an estimated 99.6% annotation accuracy.
What experiments were performed?
Baseline object detection. Three models were trained on main-page layout elements using Detectron2 with R-50-FPN-3x backbone (pretrained on COCO), on a single NVIDIA RTX 2080Ti:
| Model | Page Frame | Row | Title Region | Text Region | Title | Subtitle | Other | mAP |
|---|---|---|---|---|---|---|---|---|
| Faster R-CNN | 99.0 | 98.8 | 87.6 | 94.5 | 65.9 | 84.1 | 44.0 | 82.0 |
| Mask R-CNN | 99.1 | 98.5 | 89.5 | 86.8 | 71.5 | 84.2 | 39.8 | 81.3 |
| RetinaNet | 99.0 | 95.1 | 69.6 | 89.5 | 72.6 | 85.9 | 14.4 | 75.2 |
All values are mAP @ IoU [0.50:0.95] on the test set. Large-scale elements (Page Frame, Row) are near-perfect; small elements (Title, Other) are harder.
Transfer learning. Models pretrained on HJDataset main pages were fine-tuned on two targets: index pages and a different 1939 biographical directory. Results are mixed:
- Index pages (all 57 training samples): HJDataset initialization raised mAP from 34.4 (COCO init) to 47.1, a clear gain.
- Index pages (5-shot): The improvement was marginal: 10.0 (COCO) to 10.3 (HJDataset), within noise.
- Cross-publication (4 training samples from 12 annotated pages): HJDataset initialization raised mAP from 69.9 to 81.6, the strongest transfer result.
The cross-publication experiment is encouraging but limited: only 12 pages from a single additional source were annotated, making it hard to draw broad generalization conclusions.
What are the outcomes/conclusions?
- HJDataset fills a gap in historical Asian-language document layout data: large scale, hierarchical annotations, and reading order in a single package.
- The semi-automatic pipeline achieves an estimated 99.6% annotation accuracy at far lower cost than manual annotation. That estimate is based on a small sampling procedure (20 pages, repeated 3 times), so the true error rate may differ.
- Baseline detection models trained on HJDataset achieve mAP above 80% for Faster R-CNN and Mask R-CNN at IoU [0.50:0.95].
- Pretraining on HJDataset provides clear transfer benefits when fine-tuning with sufficient data, though the few-shot gains on index pages are marginal.
- The “Other” category (chapter headers and miscellaneous text) has only 148 total annotations, making detection performance on it unreliable (mAP as low as 14.4% for RetinaNet).
- Reading order and hierarchy are not evaluated. Despite being headline features, neither parent-child structure nor sequential reading order is quantitatively assessed. The experiments focus exclusively on bounding-box detection.
- The dataset is limited to a single publication type (1953 biographical directory), with highly uniform visual style and layout. Generalization evidence rests on only 12 manually annotated pages from one other source.
Mapping to Matter vs. Meaning Framework
How does HJDataset’s 7-class hierarchical taxonomy map to our Matter vs. Meaning framework?
| HJDataset Class | Visual Primitive | Logical Role | Notes |
|---|---|---|---|
| Page Frame | Structure | (none) | The outer boundary of the printed area. A structural container, not a content element. |
| Row | Text | Body | A horizontal row of entries in the biographical directory. Serves as a spatial grouping container for the elements within it. |
| Title Region | Text | Title | A region containing a person’s name entry (the “title” of their biographical entry). |
| Text Region | Text | Body | A region containing the biographical details of an entry. |
| Title | Text | Title | The actual name text within a Title Region. Finer granularity than Title Region. |
| Subtitle | Text | Subtitle | Secondary identifying information (birth year, occupation) within a biographical entry. |
| Other | Text | SectionHeader | Chapter headers and miscellaneous text that does not fit the biographical entry structure. Only 148 total annotations; underrepresented. |
Structural notes: HJDataset is unique in providing explicit parent-child hierarchy (via parent_id) and reading order (via next_id). The hierarchy maps to: Page Frame $\rightarrow$ Row $\rightarrow$ Title Region / Text Region $\rightarrow$ Title / Subtitle / Other. Reading order is right-to-left within rows (vertical Japanese text), encoded as a linked list.
Coverage gaps: No explicit classes for Table, Image, Formula, PageHeader, PageFooter, or any form primitives. The dataset is limited to a single domain (biographical directories) with very uniform layout.
Reproducibility
Models
- Baselines: Faster R-CNN, Mask R-CNN, and RetinaNet with R-50-FPN-3x backbone, initialized from COCO pretrained weights.
- Region classifier: NASNet Mobile, trained from scratch on 1,200 hand-labeled + 250 synthetic mis-segmentation samples. Input size: 200 $\times$ 522.
- Pretrained model configs and weights are released via the GitHub repository (Detectron2 format).
Algorithms
- Training: SGD optimizer, base learning rate 0.00025, decay 0.1 every 30k iterations, 60k total iterations, batch size 2.
- Region classifier training: SGD, converges in 40 epochs.
- Annotation pipeline: Binarization $\rightarrow$ contour detection (page frames) $\rightarrow$ CCL + horizontal RLSA (rows) $\rightarrow$ CCL + vertical RLSA (text/title regions) $\rightarrow$ NASNet Mobile classification $\rightarrow$ reading order generation $\rightarrow$ statistical error detection $\rightarrow$ human correction.
Data
- Source: Japanese Who’s Who biographical directory (Jinji Koshinroku, Vol. 17, 1953). Biographies of ~50,000 prominent Japanese citizens.
- Scale: 2,271 page images; 259,616 layout annotations across 7 categories.
- Page types: Main (2,048), Advertisement (87), Index (82), Other (54).
- Layout element classes (7): Page Frame, Row, Title Region, Text Region, Title, Subtitle, Other.
- Splits: 70% train / 15% validation / 15% test, stratified by page type.
| Category | Train | Val | Test | Total |
|---|---|---|---|---|
| Page Frame | 1,490 | 320 | 320 | 2,130 |
| Row | 7,742 | 1,657 | 1,660 | 11,059 |
| Title Region | 33,637 | 7,184 | 7,271 | 48,092 |
| Text Region | 38,034 | 8,129 | 8,207 | 54,370 |
| Title | 66,515 | 14,931 | 14,366 | 95,812 |
| Subtitle | 33,576 | 7,173 | 7,256 | 48,005 |
| Other | 103 | 16 | 29 | 148 |
| Total | 181,097 | 39,410 | 39,109 | 259,616 |
- Format: COCO-style JSON (bounding boxes in XYWH format, plus segmentation masks,
parent_id,next_id, and image-levelcategory_idfor page type). - Access: Annotations available via Dropbox (Apache-2.0). Images require a download request form due to copyright restrictions on the source publication.
- Annotation license: Apache-2.0. Image license is restricted (copyright held by original publisher).
Evaluation
- Metric: mAP @ IoU [0.50:0.95] (COCO-style).
- Baselines: Faster R-CNN, Mask R-CNN, RetinaNet; all evaluated on the same test split.
- Dataset is unbalanced: Title has ~96K annotations while Other has only 148, making rare-class evaluation less reliable.
- Not evaluated: Reading order accuracy and hierarchical structure correctness. No metrics or baselines are provided for these annotation types despite their prominence in the dataset description.
- Accuracy estimate: The 99.6% figure is extrapolated from manually checking 20 random pages (repeated 3 times, averaging 0.6% error). No inter-annotator agreement is reported.
Hardware
- Training: Single NVIDIA RTX 2080Ti GPU.
- Inference latency: Not reported.
BibTeX
@inproceedings{shen2020hjdataset,
title={A Large Dataset of Historical Japanese Documents with Complex Layouts},
author={Shen, Zejiang and Zhang, Kaixuan and Dell, Melissa},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
year={2020},
pages={548--549},
doi={10.1109/CVPRW50498.2020.00282}
}
PubLayNet: The 'ImageNet' of Document Layout Analysis
TL;DR
PubLayNet is a large-scale dataset for document layout analysis, providing over 360,000 annotated pages from PubMed Central scientific articles. By automatically aligning XML source text with PDF visual content, the authors created a dataset orders of magnitude larger than any prior public alternative, enabling effective training of deep object detection models. Its 5-class taxonomy (Text, Title, List, Table, Figure) became the de facto standard across the field until more diverse datasets like DocLayNet emerged.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$
The primary contribution is the dataset and the automated annotation pipeline. By demonstrating that models trained on large-scale automatically generated annotations achieve strong performance and transfer well to other domains, PubLayNet helped shift the field’s focus toward scalable data synthesis.
Secondary: $\Psi_{\text{Evaluation}}$
The paper benchmarks Faster R-CNN and Mask R-CNN on the new dataset and runs two transfer learning experiments (ICDAR 2013 table detection and SPD insurance documents) to validate the dataset’s usefulness as a pretraining source.
What is the motivation?
Prior to 2019, document layout analysis suffered from a severe data scarcity problem. Datasets like ICDAR 2013 or UNLV contained only hundreds of pages, insufficient for training deep networks like Faster R-CNN or Mask R-CNN, which require thousands of examples to generalize.
The authors identified an opportunity in the duality of scientific publishing: PubMed Central (PMC) stores articles as both PDF (visual layout) and XML (logical structure). By systematically aligning the XML text with bounding boxes in the PDF, they could generate annotations for millions of pages without manual effort.
What is the novelty?
- Scale: At 360k+ pages across 6,500+ journals, it was orders of magnitude larger than any prior public dataset.
- Automated alignment pipeline: A robust heuristic algorithm maps logical XML nodes to visual PDF tokens using fuzzy string matching. The maximum Levenshtein distance allowed for a match ($d_{\max}$) adapts to the length of the target string ($l_{\text{target}}$):
$$ d_{\max} = \begin{cases} 0.2 \cdot l_{\text{target}} & \text{if } l_{\text{target}} \le 20 \\ 0.15 \cdot l_{\text{target}} & \text{if } 20 < l_{\text{target}} \le 40 \\ 0.1 \cdot l_{\text{target}} & \text{if } l_{\text{target}} > 40 \end{cases} $$
Shorter strings tolerate proportionally more mismatches (up to 20%), while longer strings use a tighter threshold (10%), reflecting the expectation that longer sequences provide more matching context.
- Proof of transfer: The paper demonstrates empirically that models pre-trained on this synthetic “silver” data generalize better to real-world document tasks than models pre-trained on ImageNet or COCO alone.
What experiments were performed?
Three experiments were designed to evaluate the dataset:
Layout recognition on PubLayNet itself: Faster R-CNN and Mask R-CNN (ResNeXt-101-64x4d backbone) trained on PubLayNet and evaluated on the development and test sets using mAP@IoU[0.50:0.95].
Table detection transfer to ICDAR 2013: Models pretrained on the table-only subset of PubLayNet were fine-tuned on the 170 training pages from the ICDAR 2013 Table Competition and evaluated on the 238-page competition test set. Evaluated using precision, recall, and F1.
Transfer to a different domain (SPD documents): 20 Summary Plan Description (SPD) health insurance documents (2,131 pages) were manually annotated by the authors for Text, Table, and List. A 5-fold cross-document-validation was used to compare three initialization strategies: ImageNet backbone only, full COCO-pretrained model, and full PubLayNet-pretrained model. Zero-shot PubLayNet was also tested.
All training used the Detectron (v1) framework. PDF pages were rasterized using pdf2image.
What are the outcomes/conclusions?
Layout Recognition (Table III)
| Class | F-RCNN (Test) | M-RCNN (Test) |
|---|---|---|
| Text | 0.913 | 0.917 |
| Title | 0.812 | 0.828 |
| List | 0.885 | 0.887 |
| Table | 0.943 | 0.947 |
| Figure | 0.945 | 0.955 |
| Macro average | 0.900 | 0.907 |
Both models achieve mAP > 0.90 overall. Tables and Figures score highest (~0.94-0.96), attributed to their regular shapes and visual distinctiveness. Titles score lowest (~0.81-0.83), because titles are small and their appearance varies across journal templates.
ICDAR 2013 Table Detection (Table IV)
Fine-tuning a PubLayNet-pretrained F-RCNN on only 170 training pages achieves F1 = 0.968, matching the prior state of the art (Schreiber et al. 2017, F1 = 0.968) which required 1,600 samples from a general vision model. This demonstrates that pretraining on domain-relevant data substantially reduces the labeled data requirement.
SPD Domain Transfer (Table V)
On the SPD insurance documents, fine-tuning from PubLayNet outperforms COCO and ImageNet initializations for Text and List detection. For Table, COCO initialization is slightly stronger for F-RCNN, which the authors attribute to greater visual dissimilarity between SPD and PMC tables compared to SPD and PMC text. The zero-shot PubLayNet result is substantially below fine-tuned models (macro average ~0.47 vs. ~0.66), confirming that the domain gap between scientific and business documents is significant.
Taxonomy Mapping
PubLayNet defines five broad functional classes. Understanding their exact scope matters for downstream use:
| Class | XML Sources | Key Limitations |
|---|---|---|
| Text | Body paragraphs, abstract, affiliations, footnotes, figure & table captions | Captions are labeled as Text, not merged into Figure/Table boxes. Inline section titles (where the title runs into the first line of the section) are also merged here. |
| Title | Article title, standalone (sub)section titles, standalone figure/table labels | Standalone section headings ARE captured as Title. Only inline section titles are subsumed into Text. |
| List | List elements | Nested (child) lists are annotated as a single object with the parent. |
| Table | Main body of table only | Caption is a separate Text instance, not included in the Table box. |
| Figure | Main body of figure only | Caption is a separate Text instance. Sub-figures are grouped into one box. |
Mapping to Matter vs. Meaning Framework
How does PubLayNet’s 5-class taxonomy map to our Matter vs. Meaning framework?
| PubLayNet Class | Visual Primitive | Logical Role | Notes |
|---|---|---|---|
| Text | Text | Body / Caption / Footnote / Abstract | Catch-all for all text regions. Captions are labeled as Text (not grouped with Figure/Table). Inline section titles are also merged here. Footnotes and abstracts are not separated. |
| Title | Text | Title / SectionHeader | Covers both the main paper title and standalone section headings. Only inline section titles (where the title runs into the first paragraph line) are subsumed into Text. |
| List | Text | ListItem | Nested child lists are annotated as a single object with the parent. No separation of individual list items. |
| Table | Table | (primitive only) | Main table body only. Caption is a separate Text instance, not included in the Table bounding box. |
| Figure | Image | Figure | Main figure body only. Caption is a separate Text instance. Sub-figures are grouped into one box. No distinction between charts, diagrams, and photographs. |
Coverage gaps: PubLayNet has no explicit classes for PageHeader, PageFooter, PageNumber, Footnote, Formula, Caption, or BibEntry. All of these are collapsed into the Text class. The absence of Formula is by design: the annotation pipeline explicitly removes tex-math and disp-formula XML nodes before matching, because their PDF rendering diverges too much from the XML source text for fuzzy matching to work reliably. The Caption role is particularly notable: it exists as a separate bounding box but shares the Text label, making it indistinguishable from body paragraphs without post-processing heuristics (e.g., proximity to Figure/Table boxes).
Reproducibility
Note: The official IBM Data Asset eXchange page is offline. The code depends on the deprecated Detectron (v1) framework, which requires older PyTorch versions.
Models
- Architectures: Faster R-CNN and Mask R-CNN.
- Backbone: ResNeXt-101-64x4d, initialized from ImageNet pretrained weights.
- Pretrained weights: Both Faster R-CNN and Mask R-CNN checkpoints released (Apache-2.0).
Algorithms
- Training: 180k iterations. Base learning rate 0.01, reduced by factor of 10 at 120k and 160k iterations.
- Batch size: 8 GPUs, 1 image per GPU (effective mini-batch size of 8).
- PDF rendering:
pdf2imagepackage to rasterize PDF pages. - ICDAR fine-tuning: Base learning rate 0.001, reduced by 10 at iteration 100 of 200. Confidence threshold selected by 5-fold cross-validation on the 170 training pages.
Data
- Source: 1,162,856 PMCOA articles downloaded 3 October 2018.
- Splits (journal-level partitioning):
- Training: 340,391 pages / 3,311,660 instances
- Development: 11,858 pages / 127,815 instances (human-curated to remove egregious errors)
- Testing: 11,983 pages / 131,775 instances (human-curated)
- Quality control: Pages excluded if annotation coverage is below 99% for non-title pages, 90% for title pages.
- License: CDLA-Permissive-1.0 covers the annotations. However, the page images are rendered from PDFs obtained via the PMC FTP service, which explicitly states that individual PDFs are only available for non-commercially licensed articles. Since training requires images and annotations together, PubLayNet is not suitable for commercial model training regardless of the annotation license.
- Format: COCO JSON with rasterized page images.
Evaluation
- Metric: mAP@IoU[0.50:0.95] (COCO standard), macro-averaged over classes.
- ICDAR 2013: Official evaluation toolkit; precision/recall/F1 on 238 test pages.
- SPD transfer: 5-fold cross-document-validation on 20 manually annotated documents (2,131 pages).
Hardware
- Not specified in the paper beyond “8 GPUs.”
BibTeX
@inproceedings{zhong2019publaynet,
title={PubLayNet: largest dataset ever for document layout analysis},
author={Zhong, Xu and Tang, Jianbin and Yepes, Antonio Jimeno},
booktitle={2019 International Conference on Document Analysis and Recognition (ICDAR)},
pages={1015--1022},
year={2019},
organization={IEEE},
url={https://arxiv.org/abs/1908.07836}
}
DSSE-200: Document Semantic Structure Extraction Dataset
TL;DR
DSSE-200 is a 200-page human-annotated benchmark of raster document images (magazines and academic papers), with pixel-level labels for 6 classes: figure, table, section heading, caption, list, and paragraph. It was introduced with a Multimodal Fully Convolutional Network (MFCN) that fuses visual features with text embedding maps. Notably, DSSE-200 consists of raster images rather than PDFs, which precludes PDF-based post-processing approaches. The dataset URL appears to be dead.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$ : The paper introduces two reusable datasets: DSSE-200 (a 200-page human-annotated benchmark) and a 135,000-image synthetic pre-training set generated from document templates. We classify this as Resource-dominant because the note’s purpose is to catalog the dataset, though the paper itself devotes more page space to its method.
Secondary: $\Psi_{\text{Method}}$ : The paper also introduces MFCN, a multimodal FCN that fuses visual features with a text embedding map derived from skip-gram word embeddings, plus unsupervised reconstruction and consistency auxiliary losses. The ablation study structure and SOTA comparisons are strong methodological indicators.
What is the motivation?
Document layout analysis at the time relied either on heuristic rule-based systems (exploiting PDF metadata or whitespace geometry) or on purely visual models trained on limited data. Both approaches have weaknesses: PDF-based methods fail on scanned or raster documents, while visual-only models lack the semantic grounding needed to distinguish visually similar but functionally different regions (e.g., a figure caption vs. body paragraph).
The paper argues that documents are inherently multimodal, carrying both visual structure and textual meaning, and that combining both signals in a single end-to-end trainable model should improve layout segmentation. The lack of a suitable benchmark mixing magazines and academic papers motivated the release of DSSE-200.
What is the novelty?
MFCN (Multimodal Fully Convolutional Network): The architecture consists of four parts: an encoder, two decoders, and a bridge. The encoder learns a hierarchy of visual features using dilated convolution blocks (five 3x3 dilated convolutions with $d = 1, 2, 4, 8, 16$) and unpooling-based skip connections. The bridge module concatenates the visual feature map with a text embedding map along the channel dimension. The main decoder outputs per-pixel class probabilities, while an auxiliary decoder handles unsupervised reconstruction during training.
The text embedding map is constructed by running OCR to localize words, embedding each word via a 128-dimensional skip-gram model trained on Wikipedia, averaging word embeddings within each sentence, and then filling each pixel within that sentence’s bounding region with the resulting sentence vector. For an $H \times W$ document image, this yields an embedding map of size $N \times H \times W$ (where $N = 128$).
Two unsupervised auxiliary tasks regularize training when labeled data is scarce:
- Reconstruction loss: Given encoder activations $a_l$ at layer $l$ and their reconstructions $\hat{a}_l$ from the auxiliary decoder, the loss is:
$$L_{\text{rec}}^{(l)} = \frac{1}{C_l H_l W_l} \lVert a_l - \hat{a}_l \rVert_2^2, \quad l = 0, 1, 2, \ldots, L$$
- Consistency loss: For pixels within a bounding box region $b$ of size $H_b \times W_b$, the loss encourages intra-region feature consistency:
$$L_{\text{cons}} = \frac{1}{H_b W_b} \sum_{(i,j) \in b} \lVert p_{(i,j)} - p^{(b)} \rVert_2^2$$
where $p^{(b)}$ is the mean feature vector over all pixels in box $b$:
$$p^{(b)} = \frac{1}{H_b W_b} \sum_{(i,j) \in b} p_{(i,j)}$$
These auxiliary objectives allow leveraging both the 135,000 synthetic images and unlabeled real documents during semi-supervised training.
DSSE-200 Dataset: 200 pages sampled from magazines and academic papers, with human-annotated pixel-level masks for 7 classes (6 semantic classes plus background): figure, table, section (heading), caption, list, paragraph. The images are raster (not PDF), which the authors explicitly note prevents applying PDF post-processing methods used by contemporaneous baselines.
Synthetic pre-training set: 135,000 automatically generated document images from templates with known ground-truth layouts, used for unsupervised pre-training of the visual branch.
What experiments were performed?
The paper evaluates on three datasets:
- DSSE-200 (the introduced benchmark): 160 train / 40 test split, pixel-level IoU.
- ICDAR2015 competition dataset: a broader document segmentation benchmark.
- SectLabel: a dataset focused on scientific paper section structure.
Baselines include vision-only FCN variants and several rule-based / PDF-based methods where applicable. On DSSE-200, the paper cannot compare against PDF-based methods since the documents are raster images.
Ablations test: (1) vision-only FCN, (2) FCN + text embeddings, (3) adding the reconstruction auxiliary loss, (4) adding both auxiliary losses. The full model is evaluated with and without synthetic pre-training.
The primary metric is mean IoU across all classes (pixel-level, not object-level mAP). This measures overlap at the pixel level, so results are not directly comparable to object-detection mAP numbers reported on other datasets.
What are the outcomes/conclusions?
On DSSE-200, the best vision-only architecture (Model5, with dilated blocks, unpooling, and skip connections) achieves 73.0% mean IoU. Per-class IoU for this baseline:
| Class | IoU |
|---|---|
| Background | 84.6% |
| Figure | 83.3% |
| Table | 79.4% |
| Paragraph | 77.1% |
| List | 66.7% |
| Caption | 61.0% |
| Section | 58.3% |
Section headings are the hardest class, likely due to their small spatial footprint relative to surrounding text.
Adding OCR-extracted text embeddings raises the mean IoU to 73.3%, with the gains concentrated in textual classes (list +1.7%, paragraph +2.2%, section heading +1.1%). The unsupervised losses provide further additive improvement: reconstruction brings the mean to 73.9%, consistency to 75.4%, and combining both yields 75.9% mean IoU (the paper reports only mean IoU for these ablations, not per-class breakdowns).
On ICDAR2015, the binary-class MFCN achieves 94.5% (non-text) and 91.0% (text) IoU, outperforming prior baselines. On SectLabel, it achieves higher F1 than Luong et al. for section heading (0.919 vs. 0.916), caption (0.893 vs. 0.781), and list (0.793 vs. 0.712).
The paper concludes that multimodal fusion of visual and text signals is beneficial for layout segmentation, especially on raster documents where PDF metadata is unavailable.
Limitations acknowledged by the authors:
- The text branch depends on OCR quality; errors in OCR propagate to the embedding map.
- The 200-page test set is small, and variance across the 40 test pages is not reported.
- Results on DSSE-200 are not comparable to methods that use PDF structure, since those methods cannot run on DSSE-200.
Dataset caveats:
- No explicit license is stated for DSSE-200. The source images are from magazines and academic papers with unclear copyright. Commercial use is not advisable without clarifying provenance.
- The dataset URL (
http://personal.psu.edu/xuy111/projects/cvpr2017_doc.html) appears to be a dead personal page and the data may no longer be publicly accessible.
Reproducibility
Models
- MFCN: Encoder-decoder architecture following Noh et al., with modifications: dilated convolution blocks (five 3x3 convolutions at $d = 1, 2, 4, 8, 16$), unpooling-based upsampling, and refined skip connections. All convolutional layers use 3x3 kernels with stride 1, batch normalization before non-linearities. The models in the ablation study (Table 1) are trained from scratch.
- Text embeddings: 128-dimensional skip-gram word vectors trained on the 2016 English Wikipedia dump. Out-of-vocabulary words handled following Bojanowski et al. Sentence embeddings are computed by averaging word vectors.
- The paper does not provide a complete layer-by-layer specification or parameter count.
- No model weights are released (as of the original publication).
Algorithms
- Optimizer: Adadelta with a mini-batch size of 2.
- Input preprocessing: Per-channel mean subtraction; images resized so the longer side is at most 384 pixels. No other preprocessing.
- Semi-supervised training: Mini-batches of synthetic and real documents alternate. For synthetic documents, both the per-pixel classification loss and unsupervised losses are active. For real (unlabeled) documents, only the unsupervised losses are active.
- Class balancing: Per-pixel classification loss uses class weights inversely proportional to class pixel frequency.
- Auxiliary losses: Reconstruction loss reconstructs encoder activations from the auxiliary decoder; consistency loss encourages pixels within the same bounding-box region to have similar feature representations.
- OCR engine: Tesseract is used to extract text for the embedding map.
Data
- DSSE-200: 200 raster images (160 train, 40 test) from magazines and academic papers. Human-annotated pixel-level masks for 7 classes (6 semantic + background). Not PDF.
- 135k synthetic images: Automatically generated from document layout templates. Ground-truth layouts are known by construction. Used for unsupervised pre-training only.
- Public availability: Both datasets appear to be permanently unavailable. DSSE-200 was hosted at
http://personal.psu.edu/xuy111/projects/cvpr2017_doc.html(a PSU personal page that is now dead), and the 135k synthetic set was bundled on the same page. No mirrors, HuggingFace uploads, or alternative downloads are known to exist. This is a common fate for datasets hosted on personal academic pages from this era. - License: No explicit license stated for DSSE-200.
Evaluation
- Metric: Pixel-level mean IoU across all classes (including background). This is not the same as object-level mAP and is not directly comparable to COCO-style detection metrics.
- Splits: 160 train / 40 test for DSSE-200. No cross-validation or multiple runs reported; statistical significance is not discussed.
- Baselines: Vision-only FCN ablations; rule-based PDF methods (not applicable to DSSE-200 since it is raster-only). Results on ICDAR2015 and SectLabel are also reported.
Hardware
- No specific GPU type, training time, or memory requirements are reported in the paper.
- As a 2017-era FCN with small batch size (2) and modest input resolution (384px), training and inference should be feasible on a single modern GPU with minimal VRAM.
BibTeX
@InProceedings{Yang_2017_CVPR,
author = {Yang, Xiao and Yumer, Ersin and Asente, Paul and Kraley, Mike and Kifer, Daniel and Giles, C. Lee},
title = {Learning to Extract Semantic Structure from Documents Using Multimodal Fully Convolutional Neural Networks},
booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2017},
pages = {5315--5324},
doi = {10.1109/CVPR.2017.462},
}
PRImA Layout Analysis Dataset
TL;DR
The PRImA Layout Analysis Dataset (ICDAR 2009) contains 1,240 ground-truthed page images at a 7:1 ratio of magazine to technical article pages. It uses pixel-accurate isothetic polygon ground truth (rectilinear shapes that wrap tightly around non-rectangular regions) and stores annotations in the PAGE-XML format (formally described in a companion publication at ICPR 2010). A semi-automated ground-truthing tool fits polygon boundaries to region content using a shrink-wrap approach. A subset of 55 pages was used for the ICDAR 2009 Page Segmentation Competition; a downloadable subset of approximately 305 pages is currently available via the dataset website.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$
The primary contribution is the dataset itself along with the PAGE-XML format and semi-automated ground-truthing tooling. The paper argues that no existing dataset adequately captures the complexity of real-world digitisation targets (colour magazines, non-Manhattan layouts), and addresses that gap with a new benchmark.
Secondary: $\Psi_{\text{Evaluation}}$
The paper describes the in-depth evaluation framework tied to the dataset and its use as the basis for the ICDAR 2009 Page Segmentation Competition. Competition results are reported in a separate companion paper (Antonacopoulos et al., ICDAR 2009 competition paper).
What is the motivation?
Prior datasets used for layout analysis evaluation had two systemic problems:
- Simple layouts only. The University of Washington (UWASH) dataset, ISRI, and MARG datasets contain almost exclusively bilevel (black and white) journal articles with rectangular, Manhattan-style layouts. These do not represent the complexity of documents most likely to be digitised in practice.
- Limited region representation. Bounding rectangles cannot represent text wrapping around images or other non-rectangular content. The MediaTeam Document Database offered colour but still constrained regions to bounding rectangles; the UvA dataset had more complex regions but concentrated on advertisement-type pages unrepresentative of general digitisation work.
The PRImA group built this dataset to evaluate layout analysis methods on realistic, complex, colour documents with accurate and detailed ground truth.
What is the novelty?
1. Complex layout support via isothetic polygons: Rather than bounding rectangles, PRImA uses isothetic polygons (polygons with only horizontal and vertical edges) to describe region boundaries. This enables tight, accurate representation of non-rectangular regions such as text wrapping around a circular image, without the complexity of arbitrary polygons.
2. The PAGE-XML format: The paper uses and describes the PAGE (Page Analysis and Ground truth Elements) XML format (formally introduced in Pletschacher and Antonacopoulos, ICPR 2010), which explicitly separates:
- Physical structure: Region types (text, image, table, etc.)
- Logical structure: Semantic roles within text regions (heading, paragraph, caption, footer, etc.)
- Reading order: Ordered and unordered groups, nestable for complex logical structures like newspapers.
The format also supports sub-region annotation at the text line, word, and glyph level, enabling evaluation of segmentation methods beyond the region level.
3. Semi-automated ground-truthing tool: Annotators draw a rough outline around any region; the tool then automatically fits the boundary precisely to the region content using a shrink-wrap approach. (This tool was later developed into the Aletheia system, formally described in Clausner et al., ICDAR 2011.) This proved more efficient than running an automatic segmentation and manually correcting it. The tool enforces the isothetic polygon constraint and captures per-region metadata (language, font, reading direction, text colour, background colour, logical label).
4. Dataset composition: The 1,240 images include:
- Magazines covering news, business, and technology (mainstream publications with complex layouts, text-wrap, and varied font sizes)
- Technical articles from journals and conference proceedings (typically simpler, but with some complex pages)
- Supplementary material including forms, bank statements, and advertisements for broader coverage
5. Web-accessible, searchable dataset:
Unlike prior datasets distributed as flat file collections, PRImA provides a web-based database interface at dataset.primaresearch.org supporting browsing and search by colour depth, column count, and layout features (presence of images, tables, varying fonts). Suitable subsets can be selected and downloaded as zip files.
What experiments were performed?
The dataset was used as the basis for the ICDAR 2009 Page Segmentation Competition. A subset of 55 pages was reserved for competition evaluation. The competition methodology and results are described in a separate companion paper (Antonacopoulos et al., ICDAR 2009 Page Segmentation Competition); this paper covers only the dataset design and creation workflow.
The evaluation framework, described in an earlier paper by the same authors (Antonacopoulos and Bridson, ICDAR 2007), weights segmentation errors by semantic severity rather than treating all pixel or region errors equally. Merging two unrelated columns is penalised more heavily than splitting a single paragraph, reflecting the impact on downstream readability and extraction.
What are the outcomes/conclusions?
The paper establishes three artefacts that persisted in the field:
- PAGE-XML became the standard ground-truth format for historical and complex document analysis. It is still used in the Transkribus platform and subsequent PRImA competitions (RDCL 2017, RDCL 2019).
- The semi-automated ground-truthing tool (later named Aletheia; Clausner et al., ICDAR 2011) demonstrated that semi-automated isothetic polygon fitting is more practical than manual correction of automatic segmentation for ground-truthing complex layouts.
- PRImA dataset became a standard benchmark for non-Manhattan layout analysis, filling a gap left by UWASH and other simple-layout datasets.
Limitations
- Scale: 1,240 pages is small by deep learning standards; modern methods typically require tens of thousands of training examples and use PRImA primarily as a test set.
- Domain: Heavy bias toward magazines and technical articles. Forms, bank statements, and advertisements are included but not the focus.
- Access model: The dataset is accessed via registration on the PRImA website. No open-access download or permissive license is specified in the paper; the effective use appears to be research-only.
- Evaluation complexity: The scenario-based weighted evaluation metric, while semantically meaningful, is complex to implement compared to COCO-style mAP, which is why later researchers often apply simpler metrics when using this dataset.
The PAGE-XML Format
From the official documentation: There is a plethora of established and proposed document representation formats but none that can adequately support individual stages within an entire sequence of document image analysis methods. This paper describes PAGE, a new XML-based page image representation framework.
The PAGE-XML format is the standard for ground truth storage in PRImA. Below is an overview of the 2019 version of the structure.
Region Types
The 2009 paper defines 10 region types: text, image, line drawing, graphic, table, chart, separator, maths, noise, and frame. The 2019 PAGE-XML schema extended this list; the expanded set is shown below.
Supported types in the 2019 schema:
- TextRegion
- ImageRegion
- GraphicRegion
- ChartRegion
- LineDrawingRegion
- SeparatorRegion
- TableRegion
- MathsRegion
- ChemRegion
- MusicRegion
- AdvertRegion
- NoiseRegion
- UnknownRegion
Main Structure
The XML schema can be found here: http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd
<?xml version="1.0" encoding="UTF-8"?>
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd">
<Metadata>...</Metadata>
<Page imageFilename="SimplePage.png" imageWidth="800" imageHeight="600">
<ReadingOrder>...</ReadingOrder>
<TextRegion>...</TextRegion>
</Page>
</PcGts>
Region Examples
Text Region (Heading)
<TextRegion id="r0" type="heading">
<Coords points="25,30 25,55 235,55 235,30"/>
<TextEquiv>
<Unicode>The PAGE Format</Unicode>
</TextEquiv>
</TextRegion>
Table Region Regions can have sub-regions (nested regions), such as table cells.
<TableRegion id="r3" lineSeparators="true">
<Coords points="25,475 25,560 400,560 400,475"/>
<!-- Column 1 Header -->
<TextRegion id="r5" type="paragraph">
<Coords points="40,485 40,500 120,500 120,485"/>
<TextEquiv>
<Unicode>Column 1</Unicode>
</TextEquiv>
</TextRegion>
...
</TableRegion>
Reading Order
Reading order describes the logical sequence of text regions. It supports OrderedGroup and UnorderedGroup.
<ReadingOrder>
<OrderedGroup id="ro357564684568544579089">
<RegionRefIndexed regionRef="r0" index="0"/> <!-- Heading -->
<RegionRefIndexed regionRef="r1" index="1"/> <!-- Paragraph 1 -->
</OrderedGroup>
</ReadingOrder>
Text Line Objects
Text line objects are sub-elements of TextRegion. The text content can be stored simultaneously in the text region and in the text line objects.
<TextRegion id="r0" type="heading">
...
<TextLine id="l0">
<Coords points="25,30 25,55 235,55 235,30"/>
<TextEquiv><Unicode>...</Unicode></TextEquiv>
</TextLine>
...
</TextRegion>
Mapping to Unified Taxonomy
PRImA’s PAGE-XML schema distinguishes between Region Type (Physical) and TextRegion attributes (Logical), which aligns well with our Matter vs. Meaning standard. This separation between physical and logical structure was a notable design choice for the time.
High-Level Regions (The Primitives)
| PAGE-XML Region | Visual Primitive | Logical Role | Notes |
|---|---|---|---|
| TextRegion | Text | Varies | Role depends on the type attribute (see below). |
| ImageRegion | Image | Figure | Photographic content. |
| GraphicRegion | Image | Figure | Logos, decorations, simple graphics. |
| LineDrawingRegion | Image | Diagram | Sketches, technical drawings. |
| ChartRegion | Image | Chart | Business graphics (Bar/Pie). |
| TableRegion | Table | Table | Grid structures. |
| MathsRegion | Formula | DisplayEquation | Math equations. |
| ChemRegion | Formula | ChemScheme | Chemical structures. |
| MusicRegion | Music | SheetMusic | Music scores. |
| AdvertRegion | Image | Advertisement | Adverts (visual or text-heavy blocks). |
| SeparatorRegion | Structure | - | Lines, rulers. |
| NoiseRegion | Structure | - | Scanner artifacts. |
Text Roles (The Meaning)
Within TextRegion, the type attribute defines the semantic role.
| PAGE-XML Type | Unified Role | Notes |
|---|---|---|
| paragraph | Body | Standard prose. |
| heading | SectionHeader | Titles and headings. |
| caption | Caption | Labels for images/tables. |
| header | PageHeader | Running headers. |
| footer | PageFooter | Running footers. |
| page-number | PageNumber | Folio numbers. |
| drop-cap | Body | Violation: Separates the first letter. |
| credit | Credit | Image/Photo attribution. |
| floating | Sidebar | Text outside the main flow. |
| signature-mark | Signature | Quire marks (books) or actual signatures. |
| catch-word | JumpLine | Navigation aid at bottom of page. |
| marginalia | Annotation | Handwritten or printed margin notes. |
| footnote | Footnote | Notes at bottom of page. |
| toc-entry | TOC | Table of contents entries. |
Reproducibility
Data
- Total images: 1,240 ground-truthed page images (stated at time of 2009 publication). A downloadable subset of approximately 305 images (265 magazine + 40 technical article pages) is available via the dataset website as of 2022. The ICDAR 2009 competition used a dedicated subset of 55 pages.
- Composition: ~7:1 ratio of magazine pages to technical article pages. Also includes forms, bank statements, and advertisements.
- Scanning: 300 DPI, 24-bit colour. Bilevel (binarised) versions are also provided, produced using ABBYY FineReader’s binarisation. Deskewing applied automatically (ABBYY FineReader), then manually verified.
- Access: Registration required via
dataset.primaresearch.org. No explicit open license is stated in the paper; effective use appears to be research-only. - License: Unknown. The dataset metadata records copyright holder per document. Individual source documents are magazines and technical publications whose underlying copyrights are not cleared; use for commercial training is not safe to assume.
Annotation Geometry
- Format: PAGE-XML (isothetic polygons, not bounding boxes). Each region boundary is an isothetic polygon with only horizontal and vertical edges, enabling tight representation of non-rectangular regions.
- Levels: Page, region, text line, word, glyph. Competition evaluations typically use the region level.
- Per-region metadata (text regions): language, font, reading direction, text colour, background colour, logical label (heading, paragraph, caption, footer, etc.).
- Overlaps: Not explicitly prohibited by the format, unlike DocLayNet’s hard constraint. Complex nested structures (e.g., table cells) are supported as sub-regions.
Annotation Tool
- Semi-automated ground-truthing tool (later named Aletheia; Clausner et al., ICDAR 2011): Annotators draw an approximate boundary; the tool automatically fits it to region contents. Supports rectangle, polygon, and free-draw input modes. A separate quality-control operator reviews each ground-truthed document before it is committed to the dataset.
Evaluation
- Metric: Weighted scenario-based scoring (Antonacopoulos and Bridson, ICDAR 2007) that penalises errors by semantic severity. Not equivalent to COCO-style mAP; later researchers often substitute simpler IoU-based metrics.
- Competition: ICDAR 2009 Page Segmentation Competition results are in a separate companion paper, not this dataset description paper.
Hardware
- Scanning hardware not specified beyond “300 DPI, 24-bit colour scanner.” No GPU or compute requirements reported (dataset creation, not model training).
BibTeX
@inproceedings{antonacopoulos2009realistic,
title={A realistic dataset for performance evaluation of document layout analysis},
author={Antonacopoulos, Apostolos and Bridson, David and Papadopoulos, Christos and Pletschacher, Stefan},
booktitle={2009 10th International Conference on Document Analysis and Recognition},
pages={296--300},
year={2009},
organization={IEEE},
doi={10.1109/ICDAR.2009.271}
}