Navigation
Breadcrumb

Document Layout Analysis

Tracking the evolution of document layout detection: models, datasets, benchmarks, and metrics.

Table of Contents

Disclaimer: This page tracks methods for detecting and classifying regions on a document page. For parsing the internal structure of tables, see the TSR Page. For reading order prediction, see the Reading Order Page. For text recognition and end-to-end OCR pipelines, see the OCR Page.

Overview

Document Layout Analysis (DLA) is the computer vision task of identifying regions of interest in a document image and classifying them into semantic categories (e.g., text, title, table, figure). It is often the first step in a RAG or document understanding pipeline, serving as the “chunking” mechanism before text extraction.

Deep Dives & Guides

We maintain detailed guides on the theory and practice of layout analysis:

  1. The Matter vs. Meaning Standard: Our approach to taxonomy for resolving the “Ontology Gap” between visual perception and logical structure.
  2. Annotation Guide & Label Studio Config: A two-pass workflow and conflict resolution rules for high-quality data labeling.

Layout Analysis Resources

Layout: Paradigms

The field is currently split between three dominant modeling paradigms:

  1. Vision-Only Object Detection (OD): Treats the page purely as an image (like a photograph).
    • Examples: Faster R-CNN, YOLO, DETR.
    • Pros: Fast, works even if OCR fails, great for coarse regions (Tables, Figures).
    • Cons: Blind to semantic nuances (e.g., distinguishing “bold text” from “section header” based on content).
  2. Text + Layout (No Full-Page Image): Operates on parsed text positions from OCR or PDF extraction, without a full-page visual backbone.
    • Examples: LiLT, GLAM.
    • Pros: Lightweight, fast; LiLT allows swapping text encoders across languages.
    • Cons: Loses visual cues (color, font rendering); GLAM restricted to born-digital PDFs.
  3. Multimodal / Sequence Labeling: Combines visual features with text embeddings (from OCR).
    • Examples: LayoutLM, LayoutXLM, UDOP.
    • Pros: High accuracy on semantically rich documents (invoices, forms).
    • Cons: Slower (requires OCR first), complex pipeline.

Layout: Models

Image-Only

Models that operate on the document image alone, with no OCR or text input required. These are typically object detection architectures adapted for documents.

Model FamilyArtifactsCodeLicenseNotes
Docling Layout (2025)heron (43M) • heron-101 (77M) • egret-m (20M) • egret-l (31M) • egret-x (63M)DoclingApache-2.0 (weights); MIT (code)Notes. RT-DETRv2 (heron) + D-FINE (egret). 17-class taxonomy (DocLayNet 11 + 6 delta: Code, Checkbox, Form, Key-Value, Doc Index). Trained on 150K pages. heron-101: 78% mAP, 28 ms/img (A100).
FFDNet / FFDetr (2025)S (9M) • L (25M) • FFDetr (RF-DETR; not in note)GitHubApache-2.0Notes. YOLO11 (FFDNet) + RF-DETR (FFDetr). Form field detection only: Text Input, Choice Button, Signature. 1216px high-res input.
PP-DocLayout (2025)L (31M) • Plus-L (31M) • M (6M) • S (1M)PaddleXApache-2.0Notes. RT-DETR & PicoDet.
DocLayout-YOLO (2024)Pre-trainDocLayNetD4LADocStructBenchGitHubAGPL-3.0 (code); Apache-2.0 (weights)Notes. YOLOv10-M with GL-CRM module. DocSynth-300K synthetic pre-training.
DLAFormer (2024)None releasedNoneN/ANotes. DETR-based unified model for detection + logical role classification + reading order via unified label space. Requires OCR text-line bounding boxes as geometric input (no text embeddings). ICDAR 2024.
GraphKD (2024)None releasedGitHubMITNotes. Graph-based KD framework for DOD. Distills Faster R-CNN teachers (ResNet50/101/152) into compact students (ResNet18, EfficientNet-B0, MobileNetV2). Cosine + Mahalanobis distillation loss with adaptive text-node sampling.
Hybrid DLA (2024)None releasedNoneN/ANotes. ICDAR 2024. DINO + ResNet-50 with RoI-aligned query encoding and hybrid one-to-many/one-to-one matching. PubLayNet 97.3, DocLayNet 81.6, PubTables 98.6 mAP. No code or weights released.
TransDLANet (2023)TransformerGitHubCC-BY-NC-ND-4.0Introduced in the M6Doc paper; see M6Doc Notes. ISTR-derived instance segmentation with ResNet-101 backbone. No text input.
SwinDocSegmenter (2023)Google Drive (Model Zoo)GitHubApache-2.0Notes. ICDAR 2023. SwinL + DETR-style encoder-decoder with contrastive denoising and hybrid bipartite matching. Instance segmentation (not just detection). 223M params. PubLayNet 93.7, HJ 84.6, TableBank 98.0, DocLayNet 76.9 mAP. First transformer instance seg. baseline on DocLayNet.
DiT (2022)Base (87M) • Large (304M)GitHubMITNotes. BEiT-style MIM pre-training on 42M IIT-CDIP doc images with domain-specific dVAE tokenizer.
DocSegTr (2022)PRImAHJTableBankGitHubGPL-3.0Notes. Hybrid CNN-transformer with twin attention for bottom-up instance segmentation (no bounding box dependency). ResNeXt-101-FPN + DCN backbone with dynamic mask generation. PubLayNet 89.4, PRImA 40.3, HJ 83.4, TableBank 93.3 mAP.
YALTAi (2022)None releasedGitHubGPL-3.0Notes. JDMDH 2023. Replaces Kraken pixel segmentation with YOLOv5 object detection (bounding boxes). 7$\times$ mAP improvement on small historical document datasets (~1K images). Releases MSS-EPB and Tabular datasets.
LayoutParser (2021)Model Zoo (Detectron2)GitHubApache-2.0Notes. Unified DIA toolkit wrapping Faster/Mask R-CNN. Pre-trained on PubLayNet, PRImA, HJDataset, Newspaper Navigator, TableBank. Community model hub.
CDDOD (2020)None releasedGitHubMITNotes. CVPR 2020. Cross-domain DOD via FPN + three adversarial alignment modules (Feature Pyramid, Region, Rendering Layer). ResNet-101 backbone. Introduces PubMed/Chn/Legal benchmark suite. Partial code/data released.

Text + Layout

Models that consume text tokens and/or their 2D positions but do not use a full-page image backbone. These require OCR or a PDF parser but process the document as structured data rather than as a raster image.

Model FamilyArtifactsCodeLicenseNotes
GLAM (2023)None releasedNoneN/ANotes. ICDAR 2023. Graph-based DLA on PDF-parsed text boxes + ResNet-18 ROI-pooled visual features (per text box, not full-page). GNN (TAG conv) with joint node/edge classification. 4M params, 98 pages/sec on T4 GPU. Outperforms 140M+ vision models on 5/11 DocLayNet text-based classes. Ensemble with YOLO v5x6: 80.8 mAP. Born-digital PDFs only.
LiLT (2022)Base (131M)GitHubMITNotes. ACL 2022. Language-independent layout transformer: disentangled text and layout streams allow swapping in any pre-trained text encoder without retraining the layout branch. No visual backbone.

Image + Text + Layout

Models that take OCR-derived text tokens and their 2D positions as input alongside the document image. These require an OCR step before inference.

Model FamilyArtifactsCodeLicenseNotes
M2Doc (2024)Weights (BaiduNetDisk/GDrive)GitHubProprietaryNotes. AAAI 2024. Pluggable early-fusion (pixel-level gate) + late-fusion (block-level IoU matching) modules for any detector. BERTgrid input, single shared backbone. DocLayNet 89.0 mAP, M6Doc 69.9 mAP.
DocFormerv2 (2023)None releasedNoneN/ANotes. AAAI 2024. Encoder-decoder multimodal transformer with asymmetric pre-training (Token-to-Line, Token-to-Grid). Simplified linear visual branch. 66M-750M params. Pre-trained on 64M IDL pages.
VGT (2023)None releasedGitHubApache-2.0Notes. ICCV 2023. Two-stream ViT + Grid Transformer (GiT) with dedicated text-grid pre-training (MGLM + SLM on 4M IIT-CDIP pages). 243M params. Requires OCR bounding boxes for GiT input. PubLayNet 96.2, DocBank 84.1, D4LA 68.8 mAP. Also introduces the D4LA dataset (see Datasets).
UDOP (2022)Unified (~794M)GitHubMITNotes. CVPR 2023. Generative T5-based foundation model unifying vision, text, and layout. Layout-induced representation + unified seq2seq for all doc tasks. 1st on DUE-Benchmark.
LayoutLMv3 (2022)Base (133M) • Large (368M)GitHubCC-BY-NC-SA-4.0Notes. CNN-free multimodal pre-training with unified text/image masking.
VSR (2021)PubLayNetDocBankGitHubApache-2.0Notes. ICDAR 2021. Two-stream (ResNeXt-101 + CharGrid/SentGrid) with adaptive fusion and GNN relation module. Supports both detection and sequence labeling. PubLayNet 95.7 AP (1st on ICDAR 2021 leaderboard), DocBank 95.6 F1. Weights on Hikvision file sharing; depends on mmdetection 2.11.0.
VTLayout (2021)None releasedNoneN/ANotes. PRICAI 2021. Two-stage: Cascade Mask R-CNN localization + re-classification via fused deep visual (MobileNetV2 + SE), shallow visual (pixel histogram), and text (TF-IDF over PaddleOCR) features. PubLayNet F1 0.9599. No code/weights.
LayoutXLM (2021)Base (369M) • Large (625M)GitHubCC-BY-NC-SA-4.0Notes. ACL 2022 Findings. Cross-lingual variant of LayoutLMv2. 53 languages via multilingual pre-training on 30M documents. Introduces XFUND benchmark.
LayoutLMv2 (2020)Base (200M) • Large (426M)GitHubCC-BY-NC-SA-4.0Notes. ACL 2021. Introduced visual backbone (ResNeXt-101 FPN) integration into LayoutLM pre-training. Spatial-aware self-attention + TIA/TIM pre-training objectives.
LayoutLM (2019)Base (113M) • Large (343M)GitHubMITNotes. First joint text+layout pre-training. BERT + 2D position embeddings + optional Faster R-CNN image features.

Layout: Datasets

Large-scale annotated collections intended primarily for model training. Datasets are grouped by the most permissive use permitted: commercial, research / non-commercial, and not available / restricted.

Commercial Use

Training a private, for-profit model is permitted with minimal obligations.

DatasetPagesDomainAnnotationClassesEval SplitLicenseNotes
CommonForms (2025)480kDiverse (14 domains, 10+ langs)Auto (existing fillable PDFs)3 (form fields)YesApache-2.0Notes. Form field detection only: Text Input, Choice Button, Signature. Mined from Common Crawl. ~59K documents. One-third non-English.
DocLayNet (2022) (patents + SEC subsets only)subset of 80kDiverseHuman11YesCDLA-Perm-1.0 (annotations)Notes. Only the Patents and SEC filings subsets carry permissive underlying document licenses. See the Research section below for full-dataset use.
SignverOD (2022)2.6kDiverse (Business)Human4UnknownCC0-1.0Notes. Form-element detection only: Signature, Initials, Redaction, Date. Sources: Tobacco800, NIST SD-2, bank cheques, GSA leases. Kaggle. No formal publication.
Newspaper Navigator (2020)16M+ (3.5k GT)Hist. NewspapersAuto (3.5k Human GT)7 (visual only)Yes (GT only)Public DomainNotes. GT from Beyond Words crowdsourcing with 80/20 train/val split. Original served form no longer available but underlying data remains accessible. Visual content retrieval focus; text columns are unlabeled background.

Research / Non-Commercial

Training a free, open-weight non-commercial model is permitted.

DatasetPagesDomainAnnotationClassesEval SplitLicenseNotes
AnnoPage (2025)7.6kHistorical Czech/German (1485-present)Human25YesCC-BY-4.0 (annotations + images)Notes. Non-textual elements only (maps, decorations, charts, photographs, etc.). Expert librarian annotations following Czech Methodology. Zenodo. GREC Workshop at ICDAR 2025. Attribution required.
TextBite (2025)8.4kHistorical Czech (18th-20th c.)Human3YesCC-BY-4.0 (dataset on Zenodo); MIT (code); underlying images from Czech libraries (personal/research use)Notes. Logical page segmentation as pixel clustering. 78,863 annotated segments across newspapers, dictionaries, handwritten records. Printed + handwritten. Novel Rand index evaluation over foreground text pixels. ICDAR 2025.
IndicDLP (2025)120kDiverse (12 domains, 12 langs)Human42YesMIT (annotations/code); underlying doc licenses vary by sourceNotes. 11 Indic languages + English. Hierarchical labels (section titles, list levels). Sources include gov. docs, ebooks, newspapers, arXiv. ICDAR 2025 Best Student Paper Runner-Up.
LADaS 2.0 (2024)7.3kDiachronic French (1600-2024)Human36 (13 types)YesCC-BY-4.0 (annotations/code); underlying doc licenses vary by subsetNotes. SegmOnto/TEI-aligned taxonomy. 12 modular subsets (monographs, theses, catalogues, theatre, magazines, etc.). Mostly French with some multilingual content. Includes noisy “Fingers” subset for on-site digitization.
DocHieNet (2024)15.6kDiverse (legal, financial, edu, sci)Human19 + hierarchy + sequentialYesApache-2.0 (code/data); research-only clause in repo READMENotes. 1,673 bilingual (EN + ZH) multi-page documents. 187K+ layout elements. Manual annotation (12 annotators, 3 QC rounds). 37.4% cross-page relations. EMNLP 2024. ModelScope. GitHub.
CATMuS Medieval Seg. (2024)1,680Medieval MSS (8th-16th c., 10 langs)Semi-auto (BLLA + manual correction; 45+ contributors)19 (13 zone + 6 line)YesCC-BY-4.0Notes. SegmOnto-based medieval layout dataset. 159 manuscripts. Hierarchical block/line annotations with polygons and baselines. Companion to CATMuS Medieval HTR dataset. ICDAR 2024. HuggingFace.
DocGenome (2024)6.8MScientificAuto13 + 6 RelNoCC BY-4.0 (annotations); underlying arXiv PDFs carry per-paper licensesNotes. Logical graph from LaTeX source; 6 relation types (hierarchical + reference).
DocSynth-300K (2024)300kSyntheticAuto74NoUnknownNotes. Synthetic pre-training only; bin-packing layout generation from M6Doc crops. Unknown license – treat as research use at most.
SciPostLayout (2024)7,855Scientific PostersHuman9YesCC-BYNotes. Conference-style scientific posters (not paper documents); sourced from F1000Research. Very different domain from PubLayNet-style datasets.
RanLayNet (2024)UnknownSynthetic (PubLayNet source)Auto5UnknownDataset/code license unknownNotes. Synthetic pages composited from PubLayNet crops for domain generalization. Standard YOLOv8 training. ACM Multimedia Asia 2023. GitHub.
OIN (2024)1,920Persian NewspapersHuman2 (text/non-text)YesNo license specifiedNotes. Non-Latin (Persian) DLA + TLD. 32,642 lines, 2.35M CCs. Scanned at 300 dpi. Includes curved lines, skew, diacritics. Hosted on Google Drive. No license; treat as research use at most.
U-DIADS-Bib (2024)200Ancient manuscripts (6th-12th c.)Human (pixel-precise)6YesUnknown (dataset); CC-BY-4.0 (paper)Notes. Latin and Syriac Bibles. Pixel-precise, non-overlapping segmentation. Used in ICDAR 2024 few-shot layout competition (SAM). Published in Neural Computing and Applications.
American Stories (2023)20M (2.2k GT)Hist. US NewspapersAuto (2.2k human-labeled)9YesCC-BY-4.0Notes. NeurIPS 2023 Datasets Track. 1.14B content regions, 65.6B tokens from Library of Congress Chronicling America scans. YOLOv8 + EfficientOCR pipeline. Public domain source material. HuggingFace.
WordScape (2023)40M+ (via pipeline)Diverse (136 langs)Auto30NoApache-2.0 (pipeline/code); underlying doc licenses varyNotes. Open XML parsing of Common Crawl Word files. Released as a pipeline + 9.5M URLs, not a fixed dataset. NeurIPS 2023 Datasets Track.
D4LA (2023)11kReportsHuman27Partial (train/val only)UnknownNotes. Introduced by VGT. 8,868 train / 2,224 val. Functional roles (LetterHead, Stamp, RegionKV). Unknown license; treat as research use at most.
M6Doc (2023)9kDiverseHuman74YesCC BY-NC-ND-4.0Notes. Fine-grained modern taxonomy (DropCap, Kicker).
HRDoc (2023)2,500 docsScientificHuman14 + 3 relYesMixed (CC-BY, CC-BY-NC-SA)Notes. Hierarchical structure; line-level annotations with cross-page parent-child relations. Two splits: Simple (ACL) and Hard (arXiv multi-domain). AAAI 2023. NC-SA component blocks commercial use.
BaDLAD (2023)33kBengali (6 domains)Human (polygon)4YesCC BY-SA 4.0Notes. Non-Latin-script layout dataset; polygon annotations across books, newspapers, gov. docs, property deeds. Share-alike obligation blocks commercial use.
ETD-ODv2 (2023)62kETDs (Scanned + Digital)Human (AI-aided)24YesNo license specifiedNotes. Extends ETD-OD with 16.8K scanned pages and 20.2K AI-aided pages targeting minority classes. 300K bounding-box annotations across theses/dissertations. AI-aided annotation framework reduces labeling time 2-3$\times$. GitHub. WWW ‘23 Companion. No repo license; treat as research use at most.
ETD-OD (2022)25kETDs (Digital)Human24Partial (train/val only)No license specifiedNotes. First layout dataset targeting long-form scholarly documents (theses/dissertations). 100K bounding-box annotations across 5 element groups (metadata, abstract, LOC, main content, bibliography). 80/20 train/val split; no held-out test set. Includes PDF-to-XML parsing pipeline. YOLOv7 85.3 mAP@0.50. GitHub. WIESP 2022. No repo license; treat as research use at most.
TexBiG (2022)7 books (52k+ annotation instances)Historical (19th-20th c.)Human (polygon)19YesCC-BY-4.0 (annotations + images)Notes. Tschirschwitz et al., DAGM GCPR 2022 (no OA). Zenodo. GitHub (code, no license). Expert polygon annotations with inter-annotator agreement (Krippendorff’s Alpha). Includes OCR. Attribution required.
YALTAI-MSS-EPB (2022)1.1kHistorical MSS/EPB (9th-17th c.)Human12 (Segmonto)YesCC-BY-4.0Notes. Manuscripts and early printed books. Bounding-box annotations following Segmonto ontology. 5 source datasets + 593 original pages. Zenodo.
YALTAi-Tables (2022)~250Historical Tabular (16th-20th c.)Human4YesCC-BY-4.0Notes. Tabular documents from Lectaurep + out-of-domain test set. Bounding-box annotations. Zenodo.
DAD (2022)5.9kScientific (Articles)Human43UnknownMIT (annotations); source PDFs unverifiedNotes. 43 classes across front matter, body, and back matter. Sources from Elsevier, Springer, SAGE, Wiley, IEEE; per-article licenses not enumerated. Train/val/test split not specified in publicly available materials.
DocLayNet (2022) (full dataset)80kDiverseHuman11YesCDLA-Perm-1.0 (annotations); underlying doc licenses vary by domainNotes. Non-permissive subsets present; full dataset is research use only. See the Commercial section above for the permitted subsets.
SciBank (2021)74kScientificAuto + Human12UnknownCC-BY-4.0 (annotations); underlying paper licenses unverifiedNotes. Includes inline equations as a distinct class. Source of 11,007 underlying papers not documented; treat as research use until clarified. IEEE DataPort (free account required).
CDLA (2021)6kChinese (Academic)Human (polygon)10YesNo license specifiedNo publication. GitHub. Labelme-format polygon annotations. Classes: Text, Title, Figure, Figure caption, Table, Table caption, Header, Footer, Reference, Equation. No license file or terms; treat as research use at most.
IIIT-AR-13K (2020)13kBusiness (Annual Reports)Human5YesResearchNotes. Separates Figure (charts/diagrams) from Natural Image (photos); adds Logo and Signature classes. English, French, Japanese, Russian. 9,333 train / 1,955 val / 2,120 test (per-company 70/15/15).
DocBank (2020)500kScientificWeak12YesApache-2.0 (annotations/code); underlying arXiv PDFs carry per-paper licensesNotes. Token-level labels. 400K train / 50K val / 50K test. Underlying PDF licenses vary per paper.
PubLayNet (2019)360kScientificWeak5YesCDLA-Perm-1.0 (annotations); PDFs non-commercial onlyNotes. Coarse (Text, Table, Figure). PDFs from PubMed Central are non-commercial only.
SPaSe (2019)2kPresentation SlidesHuman (pixel)25YesCC-BY-4.0 (annotations); underlying slides from SlideShare-1M (mixed licenses)Haurilet et al., WACV 2019. Multi-label pixel segmentation (overlapping regions). Location-sensitive classes (title, footnote, etc.). Only slide-layout segmentation dataset. IEEE. No code released. Project page offline as of March 2026.
ENP (2015)528Historical Newspapers (13 langs)Human (polygon)Hierarchical (PAGE-XML)UnknownUnknown (free for researchers)Notes. Europeana Newspapers Project. 12 European national libraries. PAGE-XML format with reading order. PRImA. ICDAR 2015.
GROTOAP2 (2014)119k (13.2k docs)ScientificAuto (CERMINE + heuristic correction)22NoCC-BY (annotations); underlying PDFs from PMC OA (per-article licenses vary)Tkaczyk et al., D-Lib Magazine 2014. Token/zone-level labels across 208 publishers, 1,170 journals. 22 classes incl. body_content, abstract, references, equation, figure, table. TrueViz XML format. 93% accuracy after correction rules. Download. Also used in S2-VLUE benchmark (see VILA).

Not Available / Restricted

Described in a publication but not publicly downloadable. Included here because the papers provide useful methodological details and the data may become available in the future.

DatasetPagesDomainAnnotationClassesEval SplitLicenseNotes
DocLayNet-v2 (2025)7.6k (test)DiverseHuman17 (11+6 new)YesProprietaryNotes. IBM/Docling extension of DocLayNet adding Code, Checkbox-Selected, Checkbox-Unselected, Form, Key-Value Region, Document Index. Used to train and evaluate Docling’s heron/egret model family. Not publicly available.
GraphDoc (2025)80kDiverseRule-based + Human11 + 8 RelYesMIT (code); CDLA-Perm-1.0 (annotations); underlying doc licenses varyNotes. Relation graph overlay on DocLayNet. 4.13M relation annotations (4 spatial + 4 logical types). ICLR 2025. GitHub repo is a placeholder as of March 2026.
PAL (2023)441kSpanish Gov/Public AffairsSemi-auto (PDF mining + RF)8 (4 layout + 4 text semantic)YesCustom (signed license agreement required)Notes. 37,910 Spanish government gazettes from 24 admin sources. 8M+ labels. Per-source Random Forest classifiers for text-block labeling. Born-digital PDFs only. ICDAR 2023 Workshop. GitHub.
HJDataset (2020)2,271Historical Japanese (Biographical)Semi-rule-based + Human7 (hierarchical)YesApache-2.0 (annotations); images restricted (copyright, download request required)Notes. 260K annotations with parent-child hierarchy and reading order. Single source publication (1953 Who’s Who). Vertical text, right-to-left reading. CVPR 2020 Workshop.

Other Notables:

  • PP-DocLayout (2025): Variable label space (20-25 classes) based on DocLayNet + Stamps/Seals.

Layout: Benchmarks

Curated evaluation sets used to compare model performance. Most are too small or too constrained for training use.

BenchmarkPagesDomainAnnotationClassesLicenseNotes
Comp-HRDoc (2024)1.5k docsScientificHuman12 + hierarchy + reading orderMIT (annotations/eval scripts); underlying images from HRDoc-Hard (mixed CC-BY, CC-BY-NC-SA)Notes. Extension of HRDoc-Hard. First benchmark evaluating page object detection, reading order, TOC extraction, and hierarchical reconstruction simultaneously. Introduced with the DOC method. Pattern Recognition 2024. GitHub.
RoDLA (2024)~450kScientific + DiverseAuto (perturbed)5 / 11 / 74Apache-2.0 (code/toolkit); data licenses inherited from source datasetsNotes. CVPR 2024. Robustness benchmark: PubLayNet-P, DocLayNet-P, M6Doc-P. 12 perturbation types $\times$ 3 severity levels. Introduces mPE and mRD metrics. GitHub.
OmniDocBench (2024)981DiverseHuman15 block + 4 span + 15 attrResearch-only (non-commercial)Notes. End-to-end parsing eval (Markdown output); covers VLMs and pipeline tools. 9 document types.
DocStructBench (2024)9,955 (train+test)DiverseHuman10N/ANotes. In-house benchmark from DocLayout-YOLO; not publicly available. Covers Academic, Textbook, Market Analysis, Financial subsets. Results cannot be independently reproduced.
S2-VL (2021)1.3kScientific (19 disciplines)Human15Apache-2.0Notes. Allen AI / VILA. Token-level annotations on 87 papers across 19 scientific disciplines. Annotated via PAWLS tool. Inter-annotator agreement 0.95. Part of the S2-VLUE benchmark suite (with GROTOAP2 and DocBank). 5-fold CV by paper. GitHub. TACL 2022.
PubMed (2020)13kScientificAuto5UnknownNotes. Li et al., CVPR 2020. Cross-domain transfer benchmark; PDF, English. Subset of PubLayNet with re-processed list annotations.
Chn (2020)8kSynthesized ChineseAuto5UnknownNotes. Li et al., CVPR 2020. Cross-domain transfer benchmark; PDF, Chinese. Synthesized from Chinese Wikipedia with randomized layout/style.
ICDAR 2017 POD2kScientific (CiteSeer)Human4 (Formula, Table, Figure, All)UnknownGao et al., ICDAR 2017. Competition site. IEEE (no OA). 2,000 pages from 1,500 CiteSeer papers. Train/test zips available from competition site.
DSSE200 (2017)200Mags/AcademicHuman6UnknownNotes. Yang et al., CVPR 2017. 160 train / 40 test. Raster images (not PDF). Dataset URL appears dead.
DIVA-HisDB (2016)150Medieval ManuscriptsHuman (pixel)4Unknown (no license specified)Simistira et al., ICFHR 2016. 3 manuscripts (CB55, CSG18, CSG863), 600 dpi. Classes: Main Text, Comment, Decoration, Background (multi-class pixels via bitwise encoding). 20/10/20 train/val/test per manuscript. Used in ICDAR 2017 Layout Analysis competition. Evaluator (LGPL-3.0). IEEE.
PRImA (2009)1.2k (305 public)Mags + TechHuman10 (2009); 13 (2019 schema)Unknown (research-only access)Notes. Isothetic polygons. Introduced PAGE-XML. Not safe to assume commercial use.

Layout: Metrics

MetricParadigmNotesTools
mAP @ IoUObject DetectionCOCO-style mean Average Precision. See below for IoU threshold variants. Higher is better.pycocotools
F1 ScoreSequence LabelingHarmonic mean of precision and recall over predicted labels. Standard metric when layout is framed as token classification (LayoutLM family, VSR, DocBank). Macro-F1 (equal weight per class) and micro-F1 (equal weight per token) variants both appear. Higher is better.seqeval, sklearn
Rand IndexSegmentation-as-ClusteringMeasures agreement between predicted and ground-truth pixel clusters. Used when layout analysis is framed as grouping text into logical segments rather than detecting bounding boxes. Introduced to the DLA context by TextBite. Notes.Custom
mRDRobustnessMean Robustness Degradation. Normalizes model performance drop by perturbation difficulty (mPE). Lower is better; 100 = expected degradation. Notes.RoDLA
mPERobustnessMean Perturbation Effect. Model-independent perturbation severity score combining IQA metrics (MS-SSIM, CW-SSIM) with baseline degradation. Notes.RoDLA

mAP @ IoU: Threshold Variants

Most layout detection papers report some variant of COCO-style Average Precision, but the IoU threshold used changes what the number actually measures. When comparing results across papers, always check which variant is reported.

  • $\text{AP}@[.50:.95]$ (also written AP or mAP without qualifier): The COCO primary metric. Averages AP across 10 IoU thresholds from 0.50 to 0.95 in steps of 0.05. This is the strictest standard and the most common in recent work (DocLayNet, M6Doc, DocLayout-YOLO). A model scoring 80+ on this metric is localizing regions precisely, not just roughly finding them.
  • $\text{AP}@50$ (also written AP50 or $\text{mAP}@0.50$): Counts a detection as correct if its IoU with a ground-truth box exceeds 0.50. Lenient on localization; a prediction overlapping just half the ground truth counts as a hit. Common in YOLO-era papers and historical document work where region boundaries are inherently ambiguous. Results can be 10-20 points higher than $\text{AP}@[.50:.95]$ on the same model.
  • $\text{AP}@75$: Stricter localization threshold. Less commonly reported but useful for assessing whether a model captures tight region boundaries, which matters for downstream extraction.

Per-class AP values are often more informative than the mean. Classes like “Table” and “Figure” (visually distinct, large regions) tend to score much higher than “Caption” or “Footnote” (small, visually similar to body text). A high mAP can mask poor performance on minority classes.

F1 Score: When Layout Is Token Classification

The object detection paradigm (mAP) assumes layout analysis produces bounding boxes. The sequence labeling paradigm instead assigns a class label to each text token or text line, using the 2D position as input context. This is how the LayoutLM family, LiLT, and VSR approach the problem.

In this setting, standard classification metrics apply: precision (fraction of predicted labels that are correct), recall (fraction of ground-truth labels recovered), and their harmonic mean, F1. Two averaging conventions appear in the literature:

  • Macro-F1: Compute F1 per class, then average. Gives equal weight to rare classes.
  • Micro-F1: Pool all predictions, compute F1 globally. Dominated by frequent classes like “Text” or “Paragraph.”

Comparing F1 scores to mAP scores across paradigms is not meaningful; they measure different things on different granularities (token vs. region).

Layout: Comparative Studies

Cross-architecture evaluations that benchmark models from multiple paradigms on the same datasets.

YearPaperModels ComparedDatasetsKey FindingNotes
2023Kastanas et al.LayoutLMv3, YOLOv5, Paragraph2Graph, LiLTDocLayNet, GROTOAP2, FUNSD/XFUNDYOLOv5 most practical for image-centric (fast, single GPU); LayoutLMv3 dominates text-centric (F1 0.87 vs. 0.70); machine translation hurts cross-lingual zero-shot transfer.Notes

Layout: Surveys

Comprehensive literature reviews that organize and synthesize the DLA field.

YearPaperScopeKey ContributionNotes
2019BinMakhashen & Mahmoud79 DLA studies (pre-deep-learning era)General DLA framework (preprocessing, bottom-up/top-down/hybrid analysis, evaluation). Taxonomy of 6 layout types. Three-tier evaluation hierarchy (PLEF, REF, CEF).Notes