Reading Order Prediction

Tracking models, datasets, and methods for determining the logical reading sequence of detected document regions.

Table of Contents

Overview
Reading Order: Paradigms
Reading Order: Models
- Models with Code or Weights
- Methods (Paper Only)
Reading Order: Datasets
- Commercial Use
- Research / Non-Commercial
- Not Available / Restricted
Reading Order: Metrics
Reading Order: Benchmarks
To Investigate
- Methods
- Benchmarks / Competitions
- Metrics
Related Pages
- Capturing Logical Structure of Visually Structured Documents with Multimodal Transition Parser
TL;DR
What kind of paper is this?
What is the motivation?
What is the novelty?
- Transition label formulation
- Two-pass decoding
- Multimodal feature set
- Pointer identification
- Customizable design
What experiments were performed?
- Datasets
- Evaluation metrics
- Baselines
- Classifier
What are the outcomes/conclusions?
- Quantitative results
- Feature importance
- Error analysis
- Limitations
Reproducibility
- Models
- Algorithms
- Data
- Evaluation
- Hardware
BibTeX
- CED: Catalog Extraction from Documents with TRACER
TL;DR
What kind of paper is this?
What is the motivation?
What is the novelty?
What experiments were performed?
What are the outcomes/conclusions?
Reproducibility
- Models
- Algorithms
- Data
- Evaluation
- Hardware
BibTeX
- Detect-Order-Construct: A Tree-Based Framework for Hierarchical Document Structure Analysis
TL;DR
What kind of paper is this?
What is the motivation?
What is the novelty?
- Unified relation prediction formulation
- Detect module: hybrid top-down and bottom-up detection
- Order module: inter-region reading order with attention fusion
- Construct module: tree-aware TOC relation prediction
- Comp-HRDoc benchmark and REDS metric
What experiments were performed?
- Datasets
- Baselines
- Ablations
What are the outcomes/conclusions?
- Page object detection
- Comp-HRDoc end-to-end results
- HRDoc hierarchical reconstruction
- Key ablation findings
- Limitations
Reproducibility
- Models
- Algorithms
- Data
- Evaluation
- Hardware
BibTeX
- ÉCLAIR: Jointly Extracting Text, Bounding Boxes, and Semantic Classes in Reading Order
TL;DR
What kind of paper is this?
What is the motivation?
What is the novelty?
What experiments were performed?
What are the outcomes/conclusions?
Reproducibility
- Models
- Algorithms
- Data
- Evaluation
- Hardware
BibTeX
- FocalOrder: Focal Preference Optimization for Reading Order Detection
TL;DR
What kind of paper is this?
What is the motivation?
What is the novelty?
What experiments were performed?
What are the outcomes/conclusions?
Reproducibility
- Models
- Algorithms
- Data
- Evaluation
- Hardware
BibTeX
- HELD: Extracting Variable-Depth Logical Document Hierarchy from Long Documents
TL;DR
What kind of paper is this?
What is the motivation?
What is the novelty?
- Rightmost-branch insertion
- Objective function
- Put-or-skip classifier
- Traversal order variants
- Two-step explicit heading extraction
- Error-tolerance training
- Path-correctness metric
What experiments were performed?
- Datasets
- Baselines
- Evaluation
- Hyperparameters
What are the outcomes/conclusions?
- Accuracy vs. baselines
- Traversal order
- Two-step vs. one-step
- Feature ablation (put-or-skip)
- PDFLux noise experiment
- Passage retrieval
- Limitations
Reproducibility
- Models
- Algorithms
- Data
- Evaluation
- Hardware
BibTeX
- M5HisDoc: A Multi-Style Chinese Historical Document Analysis Benchmark
TL;DR
What kind of paper is this?
What is the motivation?
What is the novelty?
- Dataset Construction
What experiments were performed?
- Task 1: Text Line Detection
- Task 2: Text Line Recognition
- Task 3: Character Detection
- Task 4: Character Recognition
- Task 5: Reading Order Prediction
- Cross-Validation
What are the outcomes/conclusions?
- Limitations
Reproducibility
- Models
- Algorithms
- Data
- Evaluation
- Hardware
BibTeX
- MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding
TL;DR
What kind of paper is this?
What is the motivation?
What is the novelty?
- DocWeaver: Multi-Agent Annotation Pipeline
- MosaicDoc Benchmark
What experiments were performed?
- Models Evaluated
- DocVQA (ANLSL Metric)
- Page-Level OCR (CRR / OCRR)
- Reading Order Prediction (Micro-F1)
What are the outcomes/conclusions?
- Limitations
Reproducibility
- Models
- Algorithms
- Data
- Evaluation
- Hardware
BibTeX
- PharmaShip: A Chinese Pharmaceutical Shipping Document Benchmark with Reading Order Supervision
TL;DR
What kind of paper is this?
What is the motivation?
What is the novelty?
- Dataset: PharmaShip
- Key Design Choices
What experiments were performed?
- Baselines
- Results on PharmaShip (Table III)
What are the outcomes/conclusions?
- Limitations
Reproducibility
- Models
- Algorithms
- Data
- Evaluation
- Hardware
BibTeX
- Reading Order Matters: Information Extraction from Visually-rich Documents by Token Path Prediction
TL;DR
What kind of paper is this?
What is the motivation?
What is the novelty?
- Token Path Prediction (TPP)
- Adaptation to other VrD tasks
- FUNSD-r and CORD-r
What experiments were performed?
- VrD-NER on FUNSD-r and CORD-r
- VrD-EL and VrD-ROP
- Ablation studies
What are the outcomes/conclusions?
Reproducibility
- Models
- Algorithms
- Data
- Evaluation
- Hardware
BibTeX
- READoc: A Unified Benchmark for Realistic Document Structured Extraction
TL;DR
What kind of paper is this?
What is the motivation?
What is the novelty?
What experiments were performed?
What are the outcomes/conclusions?
Reproducibility
- Models
- Algorithms
- Data
- Evaluation
- Hardware
BibTeX
- ROOR: Modeling Layout Reading Order as Ordering Relations
TL;DR
What kind of paper is this?
What is the motivation?
What is the novelty?
- Reformulation: ISDR as a Directed Acyclic Relation
- Relation Extraction Baseline
- Downstream Enhancement Pipeline (RORE)
What experiments were performed?
- Dataset
- ROP Baseline Evaluation (Table 1)
- Downstream Enhancement on EC-FUNSD (Table 2)
- Cross-Domain VrD-IE with Pseudo Labels (Table 3)
- VrD-QA with Pseudo Labels (Table 4)
What are the outcomes/conclusions?
- Limitations
Reproducibility
- Models
- Algorithms
- Data
- Evaluation
- Hardware
BibTeX
- ROPE: Reading Order Equivariant Positional Encoding for Graph-based Document Information Extraction
TL;DR
What kind of paper is this?
What is the motivation?
What is the novelty?
What experiments were performed?
- Tasks
- Datasets
- Architecture and training details
- Ablations
What are the outcomes/conclusions?
Reproducibility
- Models
- Algorithms
- Data
- Evaluation
- Hardware
BibTeX
- Seg2Act: Global Context-aware Action Generation for Document Logical Structuring
TL;DR
What kind of paper is this?
What is the motivation?
What is the novelty?
What experiments were performed?
What are the outcomes/conclusions?
Reproducibility
- Models
- Algorithms
- Data
- Evaluation
- Hardware
BibTeX
- Text Reading Order in Uncontrolled Conditions by Sparse Graph Segmentation
TL;DR
What kind of paper is this?
What is the motivation?
What is the novelty?
- Sparse-graph formulation
- Multi-task GCN with edge RoI pooling
- Hierarchical cluster-and-sort post-processing
What experiments were performed?
- Datasets
- Evaluation metric
- Baselines
- Ablations
- Training setup
What are the outcomes/conclusions?
Reproducibility
- Models
- Algorithms
- Data
- Evaluation
- Hardware
BibTeX
- UniHDSA: Unified Relation Prediction for Hierarchical Document Structure Analysis
TL;DR
What kind of paper is this?
What is the motivation?
What is the novelty?
- Unified label space
- Relation prediction head
- Type-wise query initialization
- Two-stage efficiency for long documents
- Loss functions
What experiments were performed?
- Datasets
- Baselines
- Ablations
What are the outcomes/conclusions?
- Comp-HRDoc end-to-end results
- DocLayNet page object detection
- Ablation findings
- Relation prediction performance
- Limitations
Reproducibility
- Models
- Algorithms
- Data
- Evaluation
- Hardware
BibTeX
- XY-Cut++: Advanced Layout Ordering via Hierarchical Mask Mechanism
TL;DR
What kind of paper is this?
What is the motivation?
What is the novelty?
- Pre-Mask Processing
- Multi-Granularity Segmentation
- Cross-Modal Matching
What experiments were performed?
- Dataset
- Baselines
- Metrics
- Ablations
What are the outcomes/conclusions?
- Limitations
Reproducibility
- Models
- Algorithms
- Data
- Evaluation
- Hardware
BibTeX
- Survey: Reading Order, Table of Contents, and Structure Extraction
TL;DR
What kind of paper is this?
What is the motivation?
What is the novelty?
- Task taxonomy
- Method taxonomy
- Unifying perspective
What experiments were performed?
What are the outcomes/conclusions?
Reproducibility
- Models
- Algorithms
- Data
- Evaluation
- Hardware
BibTeX
- OmniDocBench: A Diverse, Multi-Level Benchmark for Document Parsing Evaluation
TL;DR
What kind of paper is this?
What is the motivation?
What is the novelty?
What experiments were performed?
What are the outcomes/conclusions?
Reproducibility
- Models
- Algorithms
- Data
- Evaluation
- Hardware
BibTeX
- CMM: A Scalable Framework for Table of Contents Extraction from Complex ESG Reports
TL;DR
What kind of paper is this?
What is the motivation?
What is the novelty?
What experiments were performed?
What are the outcomes/conclusions?
Reproducibility
- Models
- Algorithms
- Data
- Evaluation
- Hardware
BibTeX
- DocTrack: Eye-Tracking Ground Truth for Visually-Rich Document Reading Order
TL;DR
What kind of paper is this?
What is the motivation?
What is the novelty?
What experiments were performed?
What are the outcomes/conclusions?
Reproducibility
- Models
- Algorithms
- Data
- Evaluation
- Hardware
BibTeX
- LayoutMask: Local 1D Position and Dual Masking for Text-Layout Pre-training
TL;DR
What kind of paper is this?
What is the motivation?
What is the novelty?
- Local 1D Position
- Masked Language Modeling with Two Masking Strategies
- Masked Position Modeling
What experiments were performed?
- Pre-training
- Downstream Tasks
- Ablation Studies
What are the outcomes/conclusions?
- Main Results
- Ablation Findings
- Limitations
Reproducibility
- Models
- Algorithms
- Data
- Evaluation
- Hardware
BibTeX
- ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding
TL;DR
What kind of paper is this?
What is the motivation?
What is the novelty?
What experiments were performed?
What are the outcomes/conclusions?
Reproducibility
- Models
- Algorithms
- Data
- Evaluation
- Hardware
BibTeX
- LayoutReader: Pre-training of Text and Layout for Reading Order
TL;DR
What kind of paper is this?
What is the motivation?
What is the novelty?
- 1. Dataset Construction (ReadingBank)
- 2. LayoutReader Model
What experiments were performed?
What are the outcomes/conclusions?
- Layout signals dominate
- Robustness to input order
- Practical OCR improvement
- Limitations
Reproducibility
- Models
- Algorithms
- Data
- Evaluation
- Hardware
BibTeX
- HJDataset: Historical Japanese Documents with Complex Layouts and Reading Order
TL;DR
What kind of paper is this?
What is the motivation?
What is the novelty?
- Dataset scope and annotation richness
- Semi-rule-based construction pipeline
- Reading order generation
- Quality control
What experiments were performed?
- Layout element detection benchmark
- Transfer to index pages
- Transfer to a different publication
What are the outcomes/conclusions?
- Limitations
Reproducibility
- Models
- Algorithms
- Data
- Evaluation
- Hardware
BibTeX

Disclaimer: This page covers models for predicting the logical reading sequence of detected regions. For detecting where regions are on a page, see the Layout Page. For parsing table internal structure, see the TSR Page.

Overview

Reading order prediction determines the logical sequence in which detected regions should be read. While it depends on layout detection for region proposals, it is a distinct task with its own models, datasets, and evaluation challenges.

The core difficulty is that reading order is not purely spatial. Multi-column layouts, sidebars, footnotes, and floating elements all require understanding the logical flow of a document rather than just scanning top-to-bottom, left-to-right.

Note: We are actively researching this area. Expect significant updates to this section.

Reading Order: Paradigms

Reading order prediction methods organize around five main paradigms:

Rule-based / Projection: Partitions the page by recursively finding gaps in horizontal or vertical ink projections. No trainable parameters; fast and deterministic.
- Examples: XY-Cut (classic), XY-Cut++
- Pros: No training data required; CPU-deployable; interpretable.
- Cons: Breaks down where clean projection gaps do not exist (L-shaped regions, multi-column with spanning headers, complex wrap-arounds).
Seq2Seq with Pointer Networks: An encoder reads all layout elements and a decoder emits an ordered sequence by repeatedly pointing to the next element.
- Examples: LayoutReader, FocalOrder
- Pros: End-to-end trainable; captures global context across all elements.
- Cons: Sequential decoding; output length tied to input size; requires large annotated datasets.
Relation Extraction / Dependency Parsing: Frames reading order as pairwise precedence prediction; a chain-decoding step resolves pairs into a total or partial order.
- Examples: ROOR, Detect-Order-Construct, UniHDSA, HRDoc/DSPS, Token Path Prediction
- Pros: Does not assume a single total order; handles non-linear and partial sequences naturally; compatible with graph-based decoding.
- Cons: Quadratic complexity in element count; requires post-processing to resolve relation pairs into a valid sequence.
Multimodal End-to-End: A single model reads the document image and generates text, bounding boxes, and semantic labels in reading order in one pass, with no separate layout detection step.
- Examples: ÉCLAIR, DRGG
- Pros: No upstream OCR or detector required; reading order emerges from generation order.
- Cons: Large model sizes; hallucination risk; complex tables and equations often excluded from current releases.
Action-based / Transition Parsing: Incrementally builds a document hierarchy tree by predicting a discrete action (e.g., insert, skip, merge, move) at each step over a sequence of segments.
- Examples: MTP, HELD, CED/TRACER, CMM, SEG2ACT
- Pros: Naturally captures hierarchical structure (sections, subsections, ToC); interpretable step-by-step construction.
- Cons: Errors propagate through the action chain; many methods require electronic PDFs for font or layout feature extraction.

Reading Order: Models

Models with Code or Weights

Date	Name	Artifacts	Code	License	Notes
2025-03	UniHDSA	None	GitHub	MIT	Notes. MSRA. Unified relation prediction framework collapsing page-level and document-level HDSA sub-tasks into a single label space. Multimodal Transformer (ResNet + BERT). REDS 96.7 (text), 91.0 (graphical) on Comp-HRDoc. Extends Detect-Order-Construct.
2024-09	ROOR baseline	None	GitHub	Apache-2.0	Notes. Global pointer network on LayoutLMv3-large. 93.01 F1 (word-level), 82.38 F1 (segment-level) on ROOR dataset. Also introduces RORE downstream enhancement pipeline.
2024-01	Detect-Order-Construct	None	GitHub	MIT	Notes. MSRA. Multi-modal transformer relation prediction with separate Detect, Order (chain), and Construct (tree) modules with RoPE. REDS 0.9319 on Comp-HRDoc.
2023-04	CED/TRACER	Spico197/CatalogExtraction	GitHub	Apache-2.0	Notes. ICDAR 2023. Compact 3-layer Chinese RoBERTa (RBT3) shift-reduce parser with Sub-Heading/Sub-Text/Concat/Reduce actions for catalog/ToC tree construction. 82.39% F1 on ChCatExt. Also introduces ChCatExt dataset and Wikipedia pre-training corpus.
2021-08	LayoutReader	microsoft/layoutreader	GitHub	MIT (code); Unknown (weights)	Notes. LayoutLM encoder + pointer-network decoder. Trained on ReadingBank (500k pages).
2021-05	MTP	stanfordnlp/pdf-struct	GitHub	Apache-2.0	Notes. Stanford + Google. NLLP@EMNLP 2021. Multimodal transition parser: five-label action sequence (continuous, consecutive, down, up, omitted) predicted by Random Forest on visual and textual features. Works across four document types (scientific, legal, government). Introduces four annotated corpora.

Methods (Paper Only)

Date	Name	Notes
2024-05	DLAFormer	ICDAR 2024. MSRA + USTC. Deformable DETR with unified label space treating graphical detection, text region detection, logical role classification, and reading order as a single relation prediction problem. Type-wise query selection (encoder) + type-wise query initialization (decoder). No text embeddings; requires OCR text-line bounding boxes. REDS 96.6% (text) / 90.0% (graphical) on Comp-HRDoc. Full notes in Layout.
2026-01	FocalOrder	Notes. Unisound AI + CAS. LayoutLMv3-large (0.4B) trained with Focal Preference Optimization: EMA-based difficulty discovery + difficulty-calibrated pairwise ranking. Addresses “Positional Disparity” failure mode. Edit 0.038 EN / 0.055 ZH on OmniDocBench v1.0; REDS 97.1 (text) / 91.1 (graphical) on Comp-HRDoc.
2025-04	XY-Cut++	Notes. Rule-based: pre-mask processing + multi-granularity XY-Cut + cross-modal matching. No trainable parameters; 514 FPS on CPU. 98.8 BLEU-4 on DocBench-100.
2025-02	ÉCLAIR	Notes. NVIDIA. ViT-H + mBART decoder (937M params). Jointly extracts text in reading order, bounding boxes, and semantic classes in a single pass. WER 0.142 on DROBS.
2025-02	DRGG	KIT. ICLR 2025. Plug-and-play relation head added to DETR-family detectors; jointly predicts DLA bounding boxes and full relational graph (spatial + logical). InternImage + RoDLA + DRGG: $mAP_g@0.5 = 57.6%$, Sequence AP 56.4% on GraphDoc. Full notes in Layout.
2024-10	SEG2ACT	Notes. EMNLP 2024. CAS. Generative LLM (Baichuan-7B) with a 3-symbol action vocabulary (+, *, =). Global context stack serialized in the same vocabulary; multi-segment batching for efficiency. Single-pass per-segment action prediction. DocAcc on ChCatExt; TEDS on HierDoc. Code: icip-cas/Seg2Act (Unknown license).
2023-10	CMM	Notes. EMNLP 2023. XY-Cut initial tree construction followed by GRU + GAT per-node encoding over a BFS subtree; Keep/Delete/Move operation prediction per node. Introduces ESGDoc (250 Chinese ESG annual reports). Requires electronic PDFs for font extraction. Code: xnyuwg/cmm (Unknown license).
2023-10	Token Path Prediction (TPP)	Notes. Fudan + Ant Group. EMNLP 2023. GlobalPointer head on LayoutLMv3/LayoutMask predicting directed token-pair edges; no reading-order assumption. Also introduces FUNSD-r and CORD-r benchmarks reflecting realistic OCR disorder. 80.40 F1 (FUNSD-r), 91.85 F1 (CORD-r) with LayoutLMv3.
2023-05	LayoutMask	Notes. Ant Group. ACL 2023. Text+layout-only pre-training: local (in-segment) 1D positions + segment-level 2D positions, with Whole Word Masking, Layout-Aware Masking, and Masked Position Modeling (GIoU loss). 182M params (Base). 92.91 F1 on FUNSD, outperforming all T+L+I base models.
2023-05	Sparse Graph Segmentation	Notes. Google. 267K-param multi-task GCN on a $\beta$-skeleton sparse graph. Predicts column-wise vs. row-wise pattern (node) and paragraph membership (edge); hierarchical cluster-and-sort recovers global order. Language-agnostic; mobile-deployable. NED 0.098 on English eval set. ICDAR 2023.
2022-10	ERNIE-Layout	Notes. Baidu. Findings of EMNLP 2022. Layout-aware serialization + spatial-aware disentangled attention + reading order prediction (ROP) as an explicit pre-training task. 24 A100s, 10M IIT-CDIP pages. 93.12 F1 on FUNSD (large). Code + weights: Apache-2.0 (PaddleNLP).
2023-10	DSG	ICDM 2023. ETH Zurich / LMU. Replaces DocParser’s heuristic relation rules with LSTM-based prediction + grammar constraints; end-to-end trainable hierarchical parsing. Introduces E-Periodica dataset. Full notes in Layout.
2019-11	DocParser	AAAI 2021. ETH Zurich / LMU. Mask R-CNN detection + rule-based reading-order and parent-child relation classification. Direct predecessor to DSG, HRDoc, and Detect-Order-Construct. Full notes in Layout.
2023-03	HRDoc / DSPS	USTC + iFLYTEK. AAAI 2023. Hierarchical document structure reconstruction as a three-subtask formulation (classify, parent-find, relation-classify). Multi-modal bidirectional encoder + structure-aware GRU decoder with soft-mask. Micro-STEDS 0.8143 (simple) / 0.6903 (hard). Direct predecessor to Detect-Order-Construct. Full notes in Layout.
2021-06	ROPE	Notes. Google. Reading Order Equivariant Positional Encoding: assigns relative reading order codes to GCN neighbor messages before aggregation. Plugs into any GCN pipeline. Up to +8.4 F1 on word labeling, +2.5 F1 on FUNSD. ACL-IJCNLP 2021.
2021-05	HELD	Notes. JCST 2022. BiLSTM/CNN PUT-or-SKIP classifier predicting insertion positions in a rightmost-branch tree. Variable-depth hierarchy from long documents with error-tolerance training via simulated misplacements. Introduces stricter path-correctness metric. No code or data released.

Reading Order: Datasets

Datasets are grouped by the most permissive use permitted: commercial, research / non-commercial, and not available / restricted.

Commercial Use

No reading order datasets with fully permissive licenses (CC0, CDLA-Permissive, MIT, Apache-2.0) are currently confirmed. Datasets in this space tend to carry attribution requirements (CC-BY) or non-commercial clauses, or have undeclared licenses.

Research / Non-Commercial

Date	Name	Pages	Domain	Key Contribution	License	Notes
2025-11	PharmaShip	161 docs	Chinese pharmaceutical shipping (scanned)	ISDR reading order graphs + SER + entity linking on domain-specific long-form documents	CC-BY-NC-4.0	Notes. Xi’an Jiaotong University. ~108 RO links/doc vs. ~55 for ROOR. Entity-centric annotation decouples semantics from layout blocks.
2024-09	ROOR	199 samples	Scanned VrDs (EC-FUNSD-based)	First dataset with ISDR-style reading order as directed acyclic relation pairs; 23.76% non-linear segments	CC-BY-4.0	Notes. Fudan + Ant Group. 10,662 segments, 10,967 annotated ISDR relation pairs. Builds on EC-FUNSD.
2023-10	DocTrack	539 docs	VRDs (forms, structured docs, infographics)	First eye-tracking-grounded reading order benchmark; human gaze trajectories mapped to OCR bounding box sequences	Research-only (Apache-2.0 header conflicts with non-commercial clause in README; treat as non-commercial until clarified)	Notes. 39,671 semantic entities; 86,054 tokens. Three subsets: WEAK (FUNSD), STRUCTURED (SeaBill), INFOGRAPH (InfographicVQA). Simple Z-ORDER heuristic often outperforms true eye-tracking order on downstream tasks.
2021-08	ReadingBank	500k	Diverse (born-digital)	First large-scale reading order benchmark	Research-only (README prohibits redistribution despite Apache-2.0 file)	Notes. Auto-extracted from DocX XML metadata via color-based watermarking. English only.

Not Available / Restricted

Described in a publication but not publicly downloadable.

Date	Name	Pages	Domain	Key Contribution	License	Notes
2025-02	GraphDoc	80K images	Mixed (financial, scientific, legal, manuals, patents, government)	Extends DocLayNet with 4.13M relation annotations across 8 categories: Sequence (reading order), Parent, Child, Reference, Up, Down, Left, Right. Only dataset combining spatial and logical graph relations at paragraph level with non-textual elements	Unknown (code: MIT; no dataset license declared; CDLA-Permissive-1.0 inheritance from DocLayNet unconfirmed by authors)	Notes. KIT. ICLR 2025. Dataset promised on GitHub but no explicit license statement. Sequence AP = 56.4% with DRGG baseline.
2025-11	MosaicDoc	72.3K images	Bilingual (ZH/EN) newspapers + magazines (196 publishers)	Block- and line-level reading order on non-Manhattan multi-column layouts; bilingual multi-task (VQA, OCR, ROP, localization)	Not released (code: Apache-2.0)	Notes. AAAI 2026. Promised open release in Nov 2025; no download available as of Apr 2026.
2023-12	M5HisDoc	8,000 images (4K regular + 4K hard)	Chinese historical documents (Han to Qing dynasty; 20+ layouts; 20+ doc types)	First multi-style Chinese historical document benchmark; 5 tasks including reading order; Regular + Hard (rotation/distortion/resolution) splits	CC-BY-NC-ND-4.0 (application required)	Notes. NeurIPS 2023 D&B. SCUT. 16,151 character categories; zero-shot val/test splits. Simple right-to-left heuristic outperforms LayoutReader on this domain.
2020-04	HJDataset	2,271 scans	Historical Japanese documents (1953 Who’s Who)	Layout, hierarchy, and reading order annotations for seven element types; rule-derived right-to-left RO with irregular-case handling	Apache-2.0 (annotations) / Restricted (images require request)	Notes. 259,616 annotations. Semi-automatic pipeline: RLSA + CCL segmentation, NASNet Mobile classifier, rule-based RO. No dedicated ROP model trained or evaluated.

Reading Order: Metrics

Metric	What it measures	Notes
BLEU (Page-level)	N-gram overlap between predicted and ground-truth reading order sequence	Used by LayoutReader on ReadingBank.
ARD (Average Relative Distance)	Positional displacement between predicted and ground-truth element positions; penalty $n$ for omitted elements	Used by LayoutReader and M5HisDoc. Lower is better.
F1 over relation pairs	Precision and recall over ISDR directed pairs $(i, j)$	Introduced by ROOR. Replaces BLEU/ARD for the relation-extraction formulation.
REDS (Reading Edit Distance Score)	Edit distance between predicted and ground-truth reading order sequences, evaluated per group	Primary metric for Comp-HRDoc (Detect-Order-Construct). Lower is better.
NED over block sequences	Normalized Levenshtein distance between predicted and ground-truth block sequences	Used by OmniDocBench. Lower is better.
Kendall’s Tau	Rank correlation: concordant minus discordant pairs, normalized	Used in some document ordering papers as an alternative to BLEU/ARD. Correlates with human ratings.
Perfect Match Rate	Fraction of pages where predicted ordering exactly matches ground truth	Used alongside BLEU/ARD as a strict binary measure.

Reading Order: Benchmarks

Date	Name	Size	Scope	Primary Metric	License	Notes
2025-04	DocBench-100	100 docs	Block-level reading order; complex and regular layout subsets	BLEU-4	Unknown	Notes. Introduced with XY-Cut++. Emphasizes multi-column and domain-specific complex layouts underrepresented in existing datasets.
2025-02	DROBS	789 pages	Joint reading order, bounding boxes, and semantic classes	Counting F1, WER, BLEU	Unknown	Notes. Introduced with ÉCLAIR. Human-annotated from magazines, books, Common Crawl. Tables and equations masked in current release.
2024-12	OmniDocBench	981 pages	Full document parsing: OCR, layout, table, formula, reading order	NED over block sequences (RO); CDM (formula); TEDS (table)	Apache-2.0	Notes. CVPR 2025. Nine document types; attribute-level evaluation by language, rotation, background.
2024-09	READoc	3,576 docs	End-to-end DSE (PDF to Markdown); reading order via KTDS	KTDS (reading order); EDS (text); TEDS (table/ToC)	Unknown	Notes. ACL Findings 2025. Sources: arXiv, GitHub, Zenodo. 16 systems evaluated.
2024-01	Comp-HRDoc	1,500 docs (1K train / 500 test; HRDoc-Hard extension)	Hierarchical doc structure: page object detection, reading order, TOC, full reconstruction	REDS (reading order); segmentation mAP; Semantic-TEDS	MIT	Notes. MSRA. First benchmark to evaluate all four subtasks simultaneously.

To Investigate

Papers and resources identified as likely relevant but not yet fully reviewed.

Methods

arXiv	Title	Why Relevant
`10.18653-v1-2024.emnlp-main.65`	DocHieNet (Xing et al., EMNLP 2024)	Document hierarchy parsing with DHFormer (GeoLayoutLM + encoder-decoder); reading order implicit in the hierarchy. Alibaba + Zhejiang. Already in Layout index. No arXiv.
–	MLARP (Qiao et al., Pattern Recognition 2024)	Multi-modal layout-aware relation prediction for ROP. Harmonizes visual, semantic, and positional features; layout-aware position embedding handles varying reading habits. No arXiv found; DOI: 10.1016/j.patcog.2024.000657.
–	XYLayoutLM (Gu et al., CVPR 2022)	Introduces Augmented XY Cut heuristic to generate multiple total-ordered reading sequences, explicitly acknowledging single-permutation limitation. Used as baseline in M5HisDoc ROP evaluation.
–	Layout2Pos (Nguyen et al., TPDL 2024)	Positional encoding injection for reading order; layout-only features; argues spatial layout alone is sufficient to infer reading order without textual content. No arXiv; DOI: 10.1007/978-3-031-72437-4_1. Identified via Giovannini & Marinai survey Tab. 3.
–	Gao et al. (2013) - “Newspaper article reconstruction using ant colony optimization and bipartite graph” (Applied Soft Computing 2013)	TSP / ant-colony formulation for reading order recovery across newspaper article blocks. Identified via Giovannini & Marinai survey graph-based ROP category. DOI: 10.1016/j.asoc.2012.07.012.
–	Quirós & Vidal (2022) - “Reading order detection on handwritten documents” (Neural Computing and Applications 2022)	Reading order as a learned sorting problem over text region pairs; applied to handwritten documents. No arXiv; DOI: 10.1007/s00521-021-06227-9.
–	Li et al. (2020) - “End-to-end OCR text re-organization sequence learning” (ECCV 2020)	GNN + pointer-network approach for reordering OCR text blocks; early deep learning predecessor to LayoutReader. No arXiv.
–	Clausner et al. (2013) - “The significance of reading order in document recognition and its evaluation” (ICDAR 2013)	Foundational work defining reading order evaluation methodology; notes single permutation is insufficient and proposes tree structures of ordered/unordered groups. Cited by ROOR as direct theoretical precursor.
–	Ferilli et al. (2014) - “Abstract argumentation for reading order detection” (DocEng 2014)	Unsupervised strategy based on abstract argumentation and empirical assumptions about human reading behavior. Historical ML-era method.
–	Ceci et al. (2007) - “A data mining approach to reading order detection” (ICDAR 2007)	Early statistical method modeling reading order as multiple disjoint chains; precursor to understanding non-linear reading order.
–	Aiello & Smeulders (2003/2004) - “Bidimensional relations for reading order detection”	Foundational rule-based ROP using qualitative rectangle relations; introduced the relational framing that ROOR formalizes two decades later.
–	Meunier (2005) - “Optimized XY-Cut for Determining a Page Reading Order” (ICDAR 2005)	The classic XY-Cut algorithm; foundational rule-based baseline that XY-Cut++ directly extends. DOI: 10.1109/ICDAR.2005.182.
`2101.12741`	Post-OCR Paragraph Recognition by Graph Convolutional Networks (Wang, Fujii & Popat, WACV 2022)	GCN-based grouping and ordering of OCR output into paragraphs. Same Google authors as ROPE; cited by Sparse Graph Segmentation. Reading order implicit in the paragraph assembly step.
–	Wu, Chou & Chang (2008) - “A Machine-Learning Approach for Analyzing Document Layout Structures with Two Reading Orders” (Pattern Recognition 2008)	Early ML method explicitly addressing dual reading order in two-column and complex layouts. Cited by FocalOrder as a historical predecessor. No arXiv. DOI: 10.1016/j.patcog.2008.03.009.
–	MTD (Hu et al., ICPR 2022) - “Multimodal Tree Decoder for Table of Contents Extraction in Document Images”	Multimodal tree decoder for ToC extraction; introduces HierDoc benchmark dataset. Same group as HRDoc (AAAI 2023) but earlier and distinct paper. Primary baseline for CMM and SEG2ACT. No arXiv found.
–	Bentabet et al. (ICDAR 2019) - “Table-of-Contents Generation on Contemporary Documents”	LSTM-based multi-stage ToC generation pipeline; one of the earliest deep learning approaches for ToC/hierarchy extraction; baseline cited in CMM and SEG2ACT. No arXiv found. DOI: 10.1109/ICDAR.2019.00025.
`1709.00770`	Rahman & Finin (BDCAT 2017) - “Deep Understanding of a Document’s Structure”	RNN-based classifier assigning logical structure depth labels (section, subsection, etc.) to document blocks; direct predecessor compared against in HELD and SEG2ACT.

Benchmarks / Competitions

Reference	Title	Why Relevant
ICDAR 2024	RDTAG Task B (ICDAR 2024)	Competition: “Prediction of Reading Order” for text spotted through AR glasses. 33 registered teams. Site.

Metrics

Metric	Source	Notes
Reading-order-independent IE metrics	Villanova-Aparisi et al., ICDAR 2024 (`2404.18664`)	Metrics for handwritten IE that decouple transcription accuracy from reading order errors.

Document Layout Analysis: Models and datasets for detecting and classifying regions on a document page.
Table Structure Recognition: Models and datasets for parsing the internal grid of detected tables.
OCR: Text recognition pipelines that often run downstream of layout detection and reading order prediction.
Document Understanding: End-to-end systems combining layout detection, structure recognition, and content extraction.

TL;DR

This paper proposes a transition-label formulation for recovering the logical hierarchy of visually structured documents (VSDs) such as PDFs. Given a sequence of text blocks extracted by existing layout tools, a Random Forest classifier predicts one of five transition labels between each consecutive pair of blocks, building a tree that captures paragraph boundaries and their nesting. The multimodal feature set (visual, textual, and semantic cues including a GPT-2 coherence signal) substantially outperforms visual-only and numbering-only baselines, achieving a paragraph boundary detection F1 of 0.953 on English NDAs compared to 0.739 for PDFMiner.

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$: The primary contribution is a complete algorithmic system, the transition parser with its multimodal feature set, formulation, and two-pass decoding procedure. The paper is organized around algorithm description, ablation-style feature importance analysis, and cross-dataset baseline comparisons.

Secondary: $\Psi_{\text{Resource}}$: The authors release both the code (pdf-struct, Apache-2.0) and the annotated dataset covering four document types (English and Japanese NDAs in PDF and plain text, plus English executive orders in PDF). The dataset and system together constitute a reusable infrastructure contribution for legal document preprocessing research.

What is the motivation?

Most NLP pipelines assume clean, linear text, but a substantial fraction of real-world documents are visually structured: PDFs produced by word processors or scanners, plain text with indentation-based formatting, and so on. Existing preprocessing tools (PDFMiner, pdftotext, etc.) concentrate on two extremes: word-level segmentation and coarse layout analysis (figure/body/table detection). Neither step recovers the logical hierarchy inside the body text: which lines belong to the same paragraph, which paragraphs are children of a section header, which blocks are debris (page numbers, running headers) that should be discarded.

This gap matters especially in the legal domain. Contracts and executive orders are heavily hierarchical: a confidentiality obligation may be defined as a numbered clause containing lettered sub-clauses, and extracting the definition requires understanding that hierarchy. The best available PDF-to-text tool at the time of publication had a 17% newline detection error rate (Bast and Korzen, 2017), and none of these tools could recover hierarchy at all.

The paper targets single-column VSDs, for which coarse layout analysis is straightforward, and focuses exclusively on the logical structure within the body text.

What is the novelty?

Transition label formulation

The core insight is to reduce tree construction to a local, sequential classification problem. Given text blocks $b_1, b_2, \ldots, b_n$ in document order, the system predicts a transition label $trans_i$ for each consecutive pair $(b_i, b_{i+1})$:

$$trans_i \in \{\texttt{continuous},, \texttt{consecutive},, \texttt{down},, \texttt{up},, \texttt{omitted}\}$$

continuous: $b_{i+1}$ continues the same paragraph as $b_i$.
consecutive: $b_{i+1}$ starts a new sibling paragraph at the same level.
down: $b_{i+1}$ starts a new paragraph that is a child of $b_i$’s paragraph (one level deeper).
up: $b_{i+1}$ returns to an ancestor level; accompanied by a pointer $ptr_i = b_j$ that identifies which prior block’s level $b_{i+1}$ rejoins.
omitted: $b_i$ is debris (a page number, running header, etc.) and is dropped; $trans_{i-1}$ is carried over to bridge $b_{i-1}$ and $b_{i+1}$.

The five labels are sufficient to reconstruct the full tree. Because up can be ambiguous about how many levels to ascend, the authors introduce a separate pointer identification module (Section 4.2) that selects among all candidate down blocks preceding $b_i$.

Two-pass decoding

Because the omitted label changes which blocks participate in subsequent feature extraction, the system runs a first pass dedicated to detecting omitted blocks, then a second pass that extracts features from the correct four-block context window $[b_{i-1}, b_i, b_j, b_{j+1}]$ (where $b_j$ is the next non-omitted block after $b_i$) to predict the remaining four labels.

Multimodal feature set

The classifier receives hand-crafted features spanning three modalities (Table 2 in the paper lists 12 visual features V1-V12, 12 textual features T1-T12, and 1 semantic feature S1):

Visual: indentation relative to adjacent blocks, indentation after stripping numbering prefixes, centering, line break before the estimated right margin, page transitions, header/footer zone membership, line spacing relative to the modal spacing, justification with internal spaces, positionally repeated text (header/footer detection via edit distance across pages), and emphasis via spaced characters or parentheses.
Textual: a heuristic numbering-transition sub-parser (T1) that tracks numbering type memories to output a predicted transition as a feature, terminal punctuation, list-start and list-element patterns, strict and tolerant page-number regexes, domain-specific legal phrasing triggers (“whereas”, “now, therefore”), dictionary-like formatting, all-capitals, blank fill-in fields, and horizontal rule characters.
Semantic: a GPT-2 language model coherence signal (S1) computed as $l(i, i-1) - l(i+1, i-1)$, where $l(a, b)$ is the language model loss for block $b_a$ given block $b_b$ as context, with a second variant $l(i+1, i) - l(i+1, i-1)$. A more negative value indicates $b_i$ is more coherent after $b_{i-1}$ than $b_{i+1}$ would be, suggesting $b_i$ is not debris.

Pointer identification

For up transitions, the pointer module selects among candidate blocks $C_i = \{b_j : trans_j = \texttt{down}, j < i\}$. Features for each candidate pair $(b_j, b_i)$ include: consecutive numbering between $b_i$ and $b_j$ or the paragraph head $b_{head(j)}$, indentation changes, left-alignment flags, and counts of intervening down and up transitions. A binary Random Forest classifier scores each candidate and the argmax is taken.

Customizable design

Feature extractors are implemented as Python classes with decorator-annotated methods (@single_input_feature, @pairwise_feature, @pointer_feature). New document types can subclass an existing extractor and override or extend individual features. The Contract$^{pdf}$ and Contract$^{jaf}$ (Japanese) extractors both inherit from a base PDF extractor, with Japanese-specific adaptations (e.g., a Japanese-language GPT-2 model and Japanese numbering detection).

What experiments were performed?

Datasets

Four document types are evaluated, all annotated by a single author:

Type	Format	Language	Documents	Avg. blocks/doc
Contract$^{pdf}$	PDF	English	40	137.9
Law$^{pdf}$	PDF	English	40	165.9
Contract$^{txt}$	Plain text	English	22	142.0
Contract$^{jaf}$	PDF	Japanese	40	73.7

PDFs were downloaded from Google search results targeting NDAs and executive orders. Plain-text NDAs were retrieved from SEC EDGAR. All documents are single-column. Blocks were extracted using PDFMiner (LTTextLine objects with vertical overlap merging) for PDFs, and individual non-blank lines for plain text. Evaluation used five-fold cross-validation.

Evaluation metrics

The authors report two families of metrics:

Information extraction (IE) perspective: For each pair of blocks, classify their pairwise relationship as same-paragraph, sibling, or ancestor-descendant. Precision, recall, and F1 are computed for each relationship type, then micro- and macro-averaged.
Preprocessing perspective: Paragraph boundary detection F1 (both strict and lenient variants), and F1 for omitted block elimination (debris removal). PDFMiner is additionally included as a baseline for the boundary task.

Baselines

Numbering baseline (Hatsutori et al., 2017): uses regular-expression-based numbering detection and type changes to infer hierarchy. Equivalent to the T1 feature used in isolation.
Visual baseline: predicts from indentation direction and line spacing only.
PDFMiner: purely geometric heuristics for paragraph break detection; available only for PDF inputs.

Classifier

Random Forest (scikit-learn defaults, no hyperparameter tuning) trained separately for each document type. GPT-2 medium is used for English; rinna/japanese-gpt2-medium for Japanese. The paper explicitly cites the choice of Random Forest as being well suited to the predominantly categorical feature set and the small training set sizes typical in the legal domain.

What are the outcomes/conclusions?

Quantitative results

On the IE perspective metrics, the system achieves micro-average F1 of:

Type	Visual	Numbering	Ours
Contract$^{pdf}$	0.772	0.778	0.914
Law$^{pdf}$	0.827	0.685	0.908
Contract$^{txt}$	0.587	0.674	0.828
Contract$^{jaf}$	0.618	0.623	0.940

On paragraph boundary detection (micro-average F1):

Type	PDFMiner	Visual	Numbering	Ours
Contract$^{pdf}$	0.739	0.712	0.793	0.953
Law$^{pdf}$	0.667	0.676	0.750	0.948
Contract$^{txt}$	–	0.633	0.702	0.950
Contract$^{jaf}$	0.653	0.632	0.759	0.980

Transition label prediction accuracy (micro-average) ranges from 0.923 to 0.955 across document types. The system outperforms all baselines on all but one relationship type across all four document types.

Feature importance

Greedy forward selection and backward elimination (Table 5) reveal that the system makes balanced use of both visual and textual cues. Indentation (V1), larger line spacing (V8), and numbering hierarchy (T1) rank highly, as do all-capitals (T10) and terminal punctuation (T2). The semantic GPT-2 feature (S1) ranks lower than most visual and textual features, which the authors attribute to the model not being fine-tuned on legal text, causing the coherence signal to sometimes ignore context.

Error analysis

Two systematic error modes are identified: (1) for Contract$^{pdf}$, documents with bold or underlined section headers followed by unindented paragraphs confuse the system into predicting continuous instead of down; typographic features (bold, underline detection) would address this. (2) For Contract$^{jaf}$, all-capital blocks and underbar sequences that function as section titles or input fields are misclassified as omitted; the authors attribute this to insufficient training data for those patterns rather than a feature design failure.

The authors note an interesting finding: the system tends to perform better on hierarchically more complex documents, possibly because such documents use more varied visual and textual cues to signal structure, giving the classifier more signal to work from.

Limitations

Single-column documents only. Multi-column layouts require prior layout analysis that is out of scope.
Single annotator, no inter-annotator agreement measurement.
Small dataset sizes (22-40 documents per type); results are cross-validated but no significance tests or confidence intervals are reported.
The GPT-2 semantic feature performs weakly due to domain mismatch; future work could use legal domain-adapted language models.
Typographic features (bold, italic, underline) are not included despite evidence they would help.
The system processes blocks in linear order: it cannot revisit earlier predictions. Errors in the first (omitted detection) pass propagate to the second pass.

Reproducibility

Models

The transition and pointer classifiers are Random Forest models using scikit-learn 0.x default hyperparameters. No model weights are needed beyond what scikit-learn trains from scratch.
The semantic feature requires GPT-2 medium (English) and rinna/japanese-gpt2-medium (Japanese), both available on Hugging Face. These are used for inference only (language model loss computation), not fine-tuned.
No pretrained weights specific to this paper are released.

Algorithms

Two-pass classification: first pass predicts omitted labels; second pass uses corrected context windows to predict the four structural labels.
Pointer identification: among all candidate down blocks, a separate binary Random Forest selects the argmax probability prediction.
Feature extraction is block-level with a four-block context window. The right margin is estimated per-document by 1D clustering on block right-edge positions (minimum 6 members per page). Normal line spacing is estimated by 1D clustering on inter-block spacings (largest cluster).
GPT-2 coherence feature uses log-loss differences; no gradient or fine-tuning involved.
All implementation details are open-sourced at github.com/stanfordnlp/pdf-struct.

Data

The annotated dataset (142 documents across four types) is released at a dedicated page (stanfordnlp.github.io/pdf-struct-dataset/) under CC-BY-4.0, separate from the code repository (which is Apache-2.0).
PDFs were collected from Google search results; plain-text NDAs from SEC EDGAR. No license is specified for the source documents themselves, which are legal contracts and government orders from various jurisdictions.
Annotation was performed by one author with no majority vote. The authors argue that labels are unambiguous enough that single-annotator annotation is reliable, but no inter-annotator agreement metric is reported.

Evaluation

Five-fold cross-validation throughout.
IE perspective: pairwise relationship matrices compared for same-paragraph, sibling, and ancestor-descendant categories; precision, recall, and F1 reported per category plus micro/macro averages.
Preprocessing perspective: paragraph boundary F1 and omitted block elimination F1.
Transition label accuracy (micro) also reported as a diagnostic.
No error bars or significance tests reported.
PDFMiner baseline is applicable only to PDF document types; not included for Contract$^{txt}$.

Hardware

Training and inference used ABCI (AI Bridging Cloud Infrastructure) provided by Japan’s AIST. Specific GPU types and counts are not reported.
Random Forest training on small datasets (22-40 documents) is fast; the main computational cost is GPT-2 inference for the S1 feature.
The system is CPU-feasible for both training and inference given the classical ML classifier; GPU acceleration is needed only for the GPT-2 feature.

BibTeX


@inproceedings{koreeda-manning-2021-capturing,
  title     = {Capturing Logical Structure of Visually Structured Documents with Multimodal Transition Parser},
  author    = {Koreeda, Yuta and Manning, Christopher D.},
  booktitle = {Proceedings of the Natural Legal Language Processing Workshop 2021},
  month     = nov,
  year      = {2021},
  address   = {Punta Cana, Dominican Republic},
  publisher = {Association for Computational Linguistics},
  pages     = {144--154},
  doi       = {10.18653/v1/2021.nllp-1.15},
  url       = {https://aclanthology.org/2021.nllp-1.15},
}

TL;DR

This paper defines the Catalog Extraction from Documents (CED) task, which parses a flat sequence of OCR text segments into a hierarchical table-of-contents tree. The authors release ChCatExt, a 650-document manually annotated Chinese corpus, and propose TRACER, a transition-based parser built on a lightweight RoBERTa backbone that uses four actions (Sub-Heading, Sub-Text, Concat, Reduce) to incrementally build the catalog tree. TRACER outperforms the pipeline baseline by 5.3 F1 points and the tagging baseline by 4.1 F1 points, and shows meaningful cross-domain transfer when pre-trained on a large Wikipedia corpus.

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$: The paper’s primary contribution is TRACER, a transition-based parsing algorithm with a defined action set for catalog tree construction. The architecture design, ablation on constraints, and extensive transfer experiments all center on validating this method.

Secondary: $\Psi_{\text{Resource}}$: The paper also releases ChCatExt, the first annotated corpus for the CED task, and a large-scale Wikipedia pre-training corpus with automatically extracted catalog structures.

What is the motivation?

Long documents such as credit rating reports, financial announcements, and bid specifications contain information that is sparsely distributed across tens of pages. Feeding an entire document into an information extraction (IE) pipeline is computationally impractical; a prior structuring step that locates sections and their hierarchical relationships can dramatically reduce the search space for downstream IE tasks.

The obvious solution, handcrafted regular expressions, breaks down across three challenges:

Section title formatting varies widely across document sources, making rules non-reusable.
Catalog hierarchies can reach five or six levels deep, and naive rule systems cannot distinguish fine-grained heading levels reliably.
OCR tools frequently split a single sentence across multiple segments at page breaks or line breaks, so the raw text stream does not align with semantic boundaries.

No prior dataset existed for this task, and no prior model addressed all three challenges jointly.

What is the novelty?

The core novelty is TRACER (Transition-based CAtalog extRaction), which reformulates catalog tree construction as a shift-reduce parsing problem operating on document-scale input.

Transition system. The parser maintains two data structures: a catalog tree stack $S$ and an input queue $Q$ of text segments. At each step, the stack top $s$ and the queue front $q$ are compared, and one of four actions is emitted:

Sub-Heading: $q$ becomes a child heading node of $s$.
Sub-Text: $q$ becomes a child text node of $s$.
Concat: $q$ is the continuation of $s$; their contents are concatenated (handles OCR splits).
Reduce: $s$ is replaced by its parent, moving up the tree when $q$’s level is at or above $s$’s level.

This action vocabulary sidesteps the need to pre-define a maximum heading depth, making the parser applicable to arbitrarily deep catalog structures.

Model architecture. Each step encodes $s$ and $q$ independently through a pre-trained language model (RBT3, a compact Chinese RoBERTa variant), concatenates the representations, and passes them through a two-layer FFN with ReLU and dropout:

$$ \begin{aligned} g &= s \Vert q \\ o &= \text{FFN}(g) \\ p(A \mid s, q) &= \text{softmax}(o) \end{aligned} $$

At inference, the predicted action is the argmax over the action set $A$:

$$ a_i = \underset{a \in A}{\text{argmax}} ; p(A \mid s, q) $$

Training uses standard cross-entropy loss:

$$ \mathcal{L} = -\sum_i \mathbb{I}_{y_a = a_i} \log p(a_i \mid s, q) $$

Two hard constraints prevent illegal tree states: the first action must be Sub-Heading or Sub-Text (not Reduce or Concat onto an empty stack); text nodes may only be followed by Reduce or Concat actions. When the top-scoring action is illegal, the model falls back to the highest-scoring legal action.

Wikipedia pre-training. To improve cross-domain generalization, the authors train TRACER on 214,989 Chinese Wikipedia articles with automatically derived catalog structures (filtered from 665,355 documents to those with depth 2 to 4), producing a WikiBert checkpoint for transfer experiments.

What experiments were performed?

Dataset. ChCatExt contains 650 manually annotated documents across three domain-specific subsets: 100 bid announcements (BidAnn), 300 financial announcements (FinAnn), and 250 credit rating reports (CreRat). Documents are split 80/10/10 for train, development, and test. OCR segmentation is simulated by chunking paragraphs with 50% probability and splitting heading strings at lengths of 7 to 20 characters using a Chinese tokenizer.

Evaluation metric. Micro-averaged F1 over predicted tree nodes, where each node is represented as a tuple (level, type, content). Precision and recall are:

$$ P = \frac{N_r}{N_p}, \quad R = \frac{N_r}{N_g}, \quad F1 = \frac{2PR}{P + R} $$

where $N_r$ is correctly matched tuples, $N_p$ is predicted tuples, and $N_g$ is gold tuples.

Baselines. Two baselines were constructed because no prior systems exist for CED:

Classification Pipeline: A two-stage system that first predicts segment concatenation using a [CLS]-based classifier, then predicts depth level with PLM + TextCNN.
Tagging: A joint sequence tagger using PLM + LSTM + CRF to simultaneously predict BIO concatenation tags and depth/type labels.

Experiment settings. Experiments ran on a single NVIDIA Titan Xp GPU. RBT3 was used as the backbone PLM. Training used AdamW with learning rate 2e-5, batch size 20, dropout 0.5, for 10 epochs. Results are averages over 5 random seed trials.

Transfer experiments. Four tables explore cross-domain and low-resource transfer: training on one subset and evaluating on another; training on only $k \in \{3, 5, 10\}$ source documents; sequential fine-tuning on source then $k$ target documents; and joint training on the concatenated source plus $k$ target documents.

What are the outcomes/conclusions?

Main results. On ChCatExt, TRACER achieves 82.390% overall F1, compared to 77.085% for the Classification Pipeline and 78.269% for the Tagging baseline. The gains are consistent across heading (90.486% vs. 87.601% and 87.846%) and text nodes (84.328% vs. 82.047% and 81.344%). The authors attribute the advantage to TRACER reasoning about pairwise structural relationships between nodes rather than assigning absolute depth labels, and to the fact that TRACER does not require a predefined maximum depth.

Removing the two decoding constraints drops overall F1 by 0.794%, confirming they are useful but not load-bearing for correctness.

Wikipedia pre-training did not help on in-domain evaluation. TRACER with WikiBert scores 82.063% overall, marginally below the vanilla TRACER (82.390%), likely because the Wikipedia domain differs substantially from Chinese business documents.

Transfer. Cross-domain zero-shot transfer is weak in all configurations, with BidAnn-trained models scoring only 7.4% and 2.4% on FinAnn and CreRat respectively. WikiBert improves transfer in 6 out of 9 cross-domain pairs (Table 4) and in 23 out of 27 few-shot pairs (Table 5). In low-resource settings, WikiBert pre-training is clearly beneficial. When $k$ target documents are available, concatenating source and target data tends to outperform sequential fine-tuning as $k$ grows.

Depth-level analysis. F1 degrades at deeper levels, and TRACER fails entirely on level-1 text nodes (none exist in the gold data) and level-5 heading nodes (too rare). The authors note that pairwise encoding without global history is a structural limitation for deeper predictions.

Limitations acknowledged by the authors. The domain gap between Wikipedia and business documents limits the utility of Wikipedia pre-training for in-domain settings. Comparing only two nodes at a time discards global structural context, which the authors identify as a key source of errors at deeper hierarchy levels. The dataset is Chinese-only business documents, and generalization to other languages or document types is not evaluated.

Limitations not acknowledged. The simulation of OCR errors by random chunking is a rough approximation; real OCR errors include character substitutions, merged lines, and layout-induced reordering that chunking alone does not capture. The evaluation metric treats node content as an exact match, which may be brittle to minor string normalization differences. No statistical significance testing is reported; results are averages over 5 seeds but without confidence intervals.

Reproducibility

Models

Backbone: RBT3, a 3-layer Chinese RoBERTa variant from the HFL group, available on Hugging Face (hfl/rbt3, Apache-2.0). Parameter count is not stated explicitly in the paper, but RBT3 is a compact model designed for efficiency.
Head: Two-layer FFN with ReLU activation and 0.5 dropout; output dimension matches the 4-class action space.
Released checkpoints (GitHub releases, Apache-2.0):
- transducer_DomainMix_1227.zip: vanilla TRACER trained on the full ChCatExt mix.
- transducer_plm4w_-1_DomainMix_1227.zip: TRACER initialized from WikiBert, fine-tuned on ChCatExt.
- wiki_plm_4w.zip: the WikiBert PLM checkpoint (RBT3 trained for 40k steps on the Wikipedia corpus), ready for use as initialization in transfer experiments.

Algorithms

Optimizer: AdamW, learning rate 2e-5, no other schedule details reported.
Batch size: 20 segments per batch.
Training duration: 10 epochs for domain-specific training; 40k steps for Wikipedia pre-training.
Dropout: 0.5 in the FFN.
Decoding: Greedy (argmax) action selection at each step, with constraint-based fallback to the next legal action when the top prediction is illegal.
Model selection: Best checkpoint on the development set; final evaluation on the test set.
Runs: 5 trials with different random seeds; average reported.

Data

ChCatExt: 650 manually annotated Chinese business documents (100 BidAnn, 300 FinAnn, 250 CreRat), split 80/10/10. Annotation was performed by pairs of annotators with expert adjudication; each document took up to 20 minutes to label.
Wikipedia corpus: 214,989 Chinese Wikipedia articles filtered for catalog depth 2 to 4 from a dump dated 2021-12-20. Catalog structure is automatically derived from Wikipedia heading markup.
OCR simulation: Paragraphs chunked with 50% probability into segments of 70 to 100 characters; headings split at 7 to 20 character boundaries using jieba.
Availability: Code and all data are released at https://github.com/Spico197/CatalogExtraction under Apache-2.0. The GitHub releases page includes ChCatExt.zip (annotated data), Wiki.zip (pre-training corpus), DataForAnalysisExp.zip (data-scale analysis splits), and OriginalRawData.zip. Source document URLs are listed in the paper (Hebei bid platform, cninfo.com.cn, chinaratings.com.cn, dfratings.com), but the raw PDFs are not redistributed.

Evaluation

Metric: Micro-averaged F1 over (level, type, content) node tuples. Heading-only and text-only sub-scores are also reported but on disjoint subsets, so they are not averages of the overall score.
Baselines: Both baselines use RBT3 as their PLM. The classification pipeline uses TextCNN; the tagging baseline uses LSTM + CRF. Maximum heading depth is set to 8 for both baselines.
Benchmark: ChCatExt only; no external benchmark exists for this task.
Statistical rigor: 5 random seed runs with averages; no standard deviations or confidence intervals are reported for the main results (standard deviation is mentioned once in passing for a data-scale ablation).

Hardware

Training: Single NVIDIA Titan Xp (12 GB VRAM). No GPU-hour estimates are reported.
Inference: Not characterized separately.
Cost: Not reported.
Local feasibility: RBT3 is compact; the system should be runnable on a single consumer GPU given the small batch size and short sequence lengths (individual segment pairs rather than full documents).

BibTeX


@inproceedings{zhu2023ced,
  title={CED: Catalog Extraction from Documents},
  author={Zhu, Tong and Zhang, Guoliang and Li, Zechang and Yu, Zijian and Ren, Junfei and Wu, Mengsong and Wang, Zhefeng and Huai, Baoxing and Chao, Pingfu and Chen, Wenliang},
  booktitle={Document Analysis and Recognition -- ICDAR 2023},
  pages={120--136},
  year={2023},
  publisher={Springer},
  doi={10.1007/978-3-031-41679-8_8}
}

TL;DR

Wang et al. propose Detect-Order-Construct, a three-stage framework that decomposes hierarchical document structure reconstruction into page object detection (Detect), reading order prediction (Order), and table-of-contents extraction (Construct). All three stages are cast as relation prediction tasks solved by multi-modal transformer-based models with structure-aware designs. The authors also introduce Comp-HRDoc, the first benchmark to evaluate all four sub-tasks of hierarchical document structure analysis concurrently.

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$: The central contribution is a new pipeline architecture and a family of multi-modal, transformer-based relation prediction models tailored to chain (reading order) and tree (TOC) structures. Most of the paper describes the Detect, Order, and Construct modules, their architectural details, and ablations verifying each design choice.

Secondary: $\Psi_{\text{Resource}}$: The paper introduces Comp-HRDoc, a new benchmark built on top of the HRDoc-Hard dataset. This benchmark extends existing annotations to support simultaneous evaluation of four sub-tasks, and is released publicly.

Secondary: $\Psi_{\text{Evaluation}}$: The paper also proposes a new metric, Reading Edit Distance Score (REDS), for evaluating reading order prediction when multiple independent reading-order groups are present, addressing a gap in existing metrics.

What is the motivation?

Document structure analysis has historically been fragmented: prior work tackled page object detection, reading order prediction, and table-of-contents (TOC) extraction as isolated tasks. No end-to-end system or benchmark addressed all of them together.

Key limitations the authors identify in prior work:

DocParser (Rausch et al., 2021) handled physical layout parsing with rule-based relation classification, but ignored logical structure such as TOC.
HRDoc / DSPS (Ma et al., 2023) introduced a hierarchical parsing system, but assumed reading order was provided as ground truth. Its line-level relation prediction scaled quadratically with the number of text lines, making it impractical for long documents.
Existing reading-order metrics (e.g., full-ranking metrics from Quirós et al.) assume a single linear reading order and cannot handle documents with multiple independent reading-order groups (e.g., newspapers with parallel articles).

The authors frame hierarchical document structure reconstruction as recovering a rooted tree $H$ whose nodes are page objects and whose edges encode three relationship types: text-region reading order, graphical-region attachment (caption-to-figure, footnote-to-table), and TOC parent-child relationships among section headings.

What is the novelty?

Unified relation prediction formulation

All three pipeline stages are cast as dependency-parsing-style relation prediction problems. For any pair of elements $(e_i, e_j)$, a score is computed:

$$ f_{ij} = FC_q(F_{e_i}) \circ FC_k(F_{e_j}) + \text{MLP}(r_{b_{e_i}, b_{e_j}}) $$

where $\circ$ is the dot product, $FC_q$ and $FC_k$ project element representations into distinct feature spaces, and $r_{b_{e_i}, b_{e_j}}$ is a spatial compatibility feature vector encoding relative position and size. A softmax cross-entropy loss (rather than binary cross-entropy) enforces a single-successor constraint consistent with the chain structure of reading order:

$$ s_{ij} = \frac{\exp(f_{ij})}{\sum_N \exp(f_{ij})} $$

Detect module: hybrid top-down and bottom-up detection

A shared CNN backbone (ResNet) provides multi-scale features. A top-down detector (DINO or Mask2Former) localises graphical objects (tables, figures, displayed formulas). A bottom-up text-region detection model then groups text lines into text regions using the intra-region reading order predicted by the relation head, followed by plurality voting for logical role classification. Multi-modal representations fuse visual embeddings (RoIAlign on fused feature maps), text embeddings from BERT, and 2D positional embeddings:

$$ U_{t_i} = FC(\text{Concat}(V_{t_i},, T_{t_i},, B_{t_i})) $$

A single-layer Transformer encoder enhances these representations via self-attention before the relation and classification heads operate.

Order module: inter-region reading order with attention fusion

For each detected text region, an attention mechanism aggregates the constituent text-line features produced by the Detect module into a single region representation:

$$ \begin{aligned} \alpha_{tn_j} &= FC_1(\tanh(FC_2(F_{tn_j}))) \\ w_{tn_j} &= \frac{\exp \alpha_{tn_j}}{\sum_j \exp \alpha_{tn_j}} \\ U_{O_n} &= \sum_j w_{tn_j} F_{tn_j} \end{aligned} $$

A three-layer Transformer encoder then models interactions among all page objects before the inter-region reading order relation head predicts both the successor and the relation type (text-region order vs. graphical-region attachment).

Construct module: tree-aware TOC relation prediction

Given section headings in predicted reading order, the Construct module uses Rotary Positional Embedding (RoPE) to inject order information into a Transformer encoder. The TOC Relation Prediction Head predicts both parent-child and sibling relations simultaneously. Final TOC construction uses a greedy tree-insertion algorithm (Algorithm 1 in the paper) that inserts each heading into the rightmost subtree of the growing tree by combining parent and sibling scores element-wise, enforcing a valid tree structure during decoding.

Comp-HRDoc benchmark and REDS metric

Comp-HRDoc extends HRDoc-Hard (1,000 train / 500 test documents) with new annotations for page object detection (segmentation-level) and reading order. The Reading Edit Distance Score (REDS) is defined as:

$$ \text{REDS} = 1 - \frac{D}{N} $$

where $D$ is the minimum total Levenshtein distance between predicted and ground-truth reading-order groups (matched via Hungarian assignment) and $N$ is the total number of basic units. Paragraph boundary markers are included as explicit tokens so that paragraph segmentation errors are penalised within the reading-order evaluation.

What experiments were performed?

Datasets

PubLayNet: 340k training pages from PubMed Central, 5 object classes. Metric: COCO-style mAP at IoU 0.50:0.95.
DocLayNet: 69k human-annotated pages from diverse document types, 11 object classes. Metric: COCO-style mAP.
HRDoc: Human-annotated hierarchical reconstruction dataset. Two splits: HRDoc-Simple (1,000 docs) and HRDoc-Hard (1,500 docs). Metrics: per-class F1 for semantic unit classification; Micro- and Macro-STEDS for hierarchical reconstruction.
Comp-HRDoc: The authors’ new benchmark (1,000 train / 500 test, drawn from HRDoc-Hard). Metrics: segmentation mAP, REDS, and Semantic-TEDS for TOC and reconstruction.

Baselines

Page object detection: Mask R-CNN, Faster R-CNN, YOLOv5, DINO, Mask2Former.
Reading order: enhanced replication of the partial-order algorithm of Quirós et al. (2022).
TOC extraction: Multimodal Tree Decoder (MTD, Hu et al., 2022).
Hierarchical reconstruction: DSPS Encoder from HRDoc (Ma et al., 2023), evaluated with ground-truth reading order and bounding boxes as upper-bound conditions.

Ablations

Hybrid strategy vs. pure Mask2Former for detection.
Visual-only vs. visual+text modalities in the Detect and Construct modules.
Removing sibling-finding head from TOC Relation Prediction Head.
Removing the tree-insertion algorithm.
Replacing softmax cross-entropy with binary cross-entropy loss.

What are the outcomes/conclusions?

Page object detection

On DocLayNet, the proposed Detect module achieves 81.0% mAP, compared to 76.8% for YOLOv5 (the prior leading method). Gains are especially pronounced for Page-footer (90.0% vs. 61.1%) and Section-header (83.2% vs. 74.6%). On PubLayNet, the Vision+Text variant reaches 96.5% mAP on the validation set, outperforming prior multimodal methods such as VSR (95.7%).

Comp-HRDoc end-to-end results

On the authors’ own benchmark (Table 6 in the paper):

Sub-task	Metric	Best prior	Ours
Page object detection	Seg mAP	73.54% (Mask2Former)	88.06%
Reading order (text)	REDS	0.7741 (Quirós et al.)	0.9319
Reading order (graphical)	REDS	0.8583 (Quirós et al.)	0.8637
TOC extraction	Micro-STEDS	0.6755 (MTD)	0.8605
Hierarchical reconstruction	Micro-STEDS	0.6903 (DSPS w/ GT)	0.8371

The DSPS baseline for hierarchical reconstruction was allowed ground-truth reading order and bounding boxes; the proposed system uses only predicted values.

HRDoc hierarchical reconstruction

On HRDoc-Hard, the proposed method exceeds DSPS Encoder by approximately 16.6 percentage points in Micro-STEDS and 15.8 points in Macro-STEDS, despite not receiving ground-truth reading order as input.

Key ablation findings

The tree-insertion algorithm is the most critical single component for TOC extraction: removing it drops Micro-STEDS from 0.8605 to 0.7111. Switching to binary cross-entropy loss causes a similar drop (to 0.7002), confirming the importance of the softmax dependency-parsing formulation. The hybrid detection strategy accounts for a 9.86% mAP gain over pure Mask2Former for page object detection. Adding text modality adds another 4.66% for semantically sensitive categories.

Limitations

The Construct module assumes section headings have been correctly identified by earlier stages; heading recognition errors propagate directly into TOC quality.
Section numbers play a large role in the model’s ability to infer heading hierarchy; documents without numbered sections cause notable failures.
The framework has so far been evaluated only on structured academic and technical PDF documents. Extension to contracts, financial reports, and handwritten material is left for future work.
Future work also mentions graph-based logical structures, which the current tree-based formulation cannot directly represent.

Reproducibility

Models

Backbone: ResNet-50 for PubLayNet/DocLayNet; ResNet-18 for HRDoc/Comp-HRDoc (to reduce GPU memory for multi-page processing).
Graphical object detector: DINO for PubLayNet/DocLayNet; Mask2Former for HRDoc/Comp-HRDoc.
Text encoder: BERT-Base (uncased), used to extract text-line embeddings.
Transformer encoder (Detect module): 1-layer, 12 heads, 768 hidden dim, 2048 FFN dim.
Transformer encoder (Order module): 3-layer, same hyperparameters.
Pretrained weights for ResNet are from ImageNet classification; BERT weights are from the standard HuggingFace BERT-Base release. No model weights for the trained system are reported as publicly available.

Algorithms

Optimizer: AdamW; betas (0.9, 0.999), epsilon 1e-8.
PubLayNet: lr 1e-5 (backbone), 2e-5 (BERT); weight decay 1e-4 (backbone), 1e-2 (BERT); batch size 16; 12 epochs; lr divided by 10 at epoch 11.
DocLayNet: same lr schedule, 24 epochs, lr divided by 10 at epoch 20.
HRDoc / Comp-HRDoc: lr 2e-4 (backbone), 4e-5 (BERT); weight decay 1e-2 for both; batch size 1; 20 epochs; linear warmup for 2 epochs then linear decay to 0.
Multi-scale training: shorter side randomly chosen from [512, 640, 768] (PubLayNet/DocLayNet) or [320, 416, 512, 608, 704, 800] (HRDoc/Comp-HRDoc); longer side capped at 800 or 1024, respectively. Test: shorter side 640 or 512.
Loss: softmax cross-entropy (dependency parsing formulation) for all relation prediction heads.

Data

PubLayNet: publicly available, IBM license (research use). ~340k training pages from PubMed Central.
DocLayNet: publicly available on HuggingFace, CC-BY-4.0. ~69k human-annotated pages.
HRDoc: publicly available (AAAI 2023 release); license not explicitly stated in the paper.
Comp-HRDoc: released at https://github.com/microsoft/CompHRDoc under the MIT license (per the repository).

Evaluation

PubLayNet / DocLayNet: COCO-style mAP at IoU thresholds 0.50:0.05:0.95.
HRDoc semantic unit classification: per-class F1 score.
HRDoc / Comp-HRDoc hierarchical reconstruction: Semantic-TEDS (Micro and Macro), which measures tree-edit distance on the predicted vs. ground-truth document trees weighted by semantic content.
Comp-HRDoc reading order: REDS (proposed in this paper), accounting for multiple reading-order groups via Hungarian matching and including paragraph boundary tokens.
Comparisons for hierarchical reconstruction use DSPS Encoder with access to ground-truth reading order and bounding boxes, which is a favourable baseline condition. The end-to-end evaluation in Comp-HRDoc is therefore more demanding for the proposed system.
No error bars or statistical significance tests are reported; results are single-run.

Hardware

All experiments: 8 Nvidia Tesla V100 GPUs (32 GB memory each).
Training time is not reported.
Inference hardware and latency are not reported.

BibTeX


@article{wang2024detectorderconstruct,
  title={Detect-Order-Construct: A Tree Construction based Approach for Hierarchical Document Structure Analysis},
  author={Jiawei Wang and Kai Hu and Zhuoyao Zhong and Lei Sun and Qiang Huo},
  journal={Pattern Recognition},
  year={2024},
  doi={10.1016/j.patcog.2024.110426}
}

TL;DR

ÉCLAIR is a 937M-parameter multimodal encoder-decoder that jointly extracts formatted text in reading order, block-level bounding boxes, and semantic class labels from a single document image pass. The authors also introduce DROBS, a 789-page human-annotated benchmark covering visually diverse layouts, and report competitive or leading performance on reading-order OCR and document object detection tasks.

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$: The central contribution is a new model architecture and training recipe for end-to-end document text extraction with simultaneous reading-order, bounding box, and semantic class prediction. The paper includes ablations, SOTA-style comparison tables, and an inference-speed optimization (multi-token decoding) that together characterize a methodological paper.

Secondary: $\Psi_{\text{Resource}}$: The authors release the arXiv-5M synthetic training dataset (generated via a modified LaTeX compiler) and the DROBS human-annotated benchmark.

Secondary: $\Psi_{\text{Evaluation}}$: DROBS is presented as a new evaluation protocol to address gaps in existing OCR benchmarks, which lack simultaneous reading-order and semantic class annotations.

What is the motivation?

Extracting useful text from complex documents requires more than recognizing characters. A downstream system needs to know which text block comes first, what role each block plays (body text, footnote, table, formula), and where each block sits on the page. Traditional OCR systems work at the word or line level and cannot resolve the spatial and semantic relationships that govern reading order. More capable pipelines chain multiple specialized models together, creating brittle systems where errors in one stage propagate through the rest.

Recent end-to-end models address parts of this problem but leave gaps. Nougat produces structured markdown but predicts no spatial information. GOT similarly omits bounding boxes. Kosmos-2.5 supports either formatted text or bounding boxes through two mutually exclusive prompts, so a caller cannot obtain both simultaneously. None of these models predict semantic classes alongside text and boxes.

The authors identify two further limitations in existing benchmarks: GOT’s benchmark captures document-level reading order but lacks block-level bounding boxes and semantic labels, while DocLayNet provides bounding boxes and classes but not reading order. DROBS is designed to fill both gaps at once.

What is the novelty?

ÉCLAIR combines a ViT-H/16 vision encoder (657M parameters, initialized from RADIO) with a lightweight mBART-based autoregressive decoder (279M parameters, trained from scratch). A horizontal convolutional neck reduces the 5120-token image sequence to 1280 tokens before feeding the decoder.

The decoder conditions on encoder features to predict tokens autoregressively:

$$P(t_i \mid \mathcal{N}(Z),, t_{<i}), \quad Z = \mathcal{E}(I)$$

where $\mathcal{E}$ is the vision encoder, $\mathcal{N}$ is the neck, $I \in \mathbb{R}^{3 \times H \times W}$ is the input image, and $\{t_1, \dots, t_P\}$ are prompt tokens.

The output for each bounding box, in the maximal-information prompt (MIP) setting, follows:

<x_(\d+)><y_(\d+)>(text content)<x_(\d+)><y_(\d+)><class_([^>]+)>

where the first coordinate pair gives the top-left corner, the second gives the bottom-right corner, and the class token identifies the semantic category.

The prompt interface is a tuple drawn from three independent binary choices:

text mode: <structured_text>, <plain_text>, or <no_text>
bounding boxes: <bbox> or <no_bbox>
classes: <classes> or <no_classes>

This yields eight valid combinations. The MIP (<structured_text><bbox><classes>) is used as the canonical training target. During fine-tuning, information is selectively dropped per dataset to accommodate partially annotated sources, allowing visually diverse data with incomplete labels to contribute to training.

A second technical contribution is multi-token decoding. Standard autoregressive decoding requires one step per output token, which is expensive for text-dense documents. The authors add $n-1$ linear heads on top of the decoder’s final hidden state, each predicting the next offset token:

$$\hat{t}_{i+k} = \text{Head}_k(h_i), \quad k \in \{1, \dots, n-1\}$$

With $n = 2$ or $n = 3$, multi-token ÉCLAIR matches or slightly improves text accuracy while reducing decoding steps by a factor of $n$.

The arXiv-5M dataset underpins pre-training. Rather than post-processing compiled PDFs (as Nougat does via LaTeXML), the authors embed Python hooks directly into the TeX Live compiler to intercept node allocations, character placements, and hbox/vbox events as they occur. This allows character-level bounding boxes and semantic class assignments to be recorded at compile time, producing a tightly aligned image/annotation dataset at roughly 5 million pages.

What experiments were performed?

DROBS reading-order benchmark. The primary evaluation uses 789 human-annotated pages sampled from magazines, books, and Common Crawl documents. Annotations follow DocLayNet’s labeling schema with reading order added. ÉCLAIR in MIP mode is compared against Kosmos-2.5 and GOT, each run in both OCR and markdown output modes. Evaluation metrics include word error rate (WER), character-level edit distance, token-level F1/precision/recall, BLEU, METEOR, and a Counting F1 score designed to penalize missing repeated words:

$$\text{Counting F1} = \frac{2 \cdot P_{\text{count}} \cdot R_{\text{count}}}{P_{\text{count}} + R_{\text{count}}}$$

Tables and equations are masked from images prior to inference because DROBS does not yet include ground-truth labels for those element types.

GOT benchmark. ÉCLAIR is also evaluated on the English-language split of the GOT benchmark, comparing edit distance, F1, BLEU, and METEOR against Nougat, TextMonkey, DocOwl1.5, Vary, Fox, GOT, and Qwen-VL variants.

Nougat validation set. ÉCLAIR pre-trained on arXiv-5M is evaluated on text, math, and table extraction on a 10,000-sample arXiv validation set. The authors are careful to note that direct comparison to Nougat’s published numbers is not valid due to differences in output format and evaluation protocol.

DocLayNet object detection. ÉCLAIR is fine-tuned on DocLayNet alone for 50K steps and evaluated with COCO-mAP (IoU 0.5:0.95) against Mask R-CNN and SwinDocSegmenter. Because ÉCLAIR does not over-predict boxes, the authors also use sequence augmentation and top-k class sampling from Pix2Seq to produce multiple candidates for AP computation.

LLM training quality. Nemotron-8B is trained from scratch on 300B tokens using text extracted by ÉCLAIR versus PyMuPDF4LLM from a common PDF corpus. The resulting models are compared on MMLU and eight additional benchmarks (ARC-Easy, ARC-Challenge, HellaSwag, OpenBookQA, PIQA, RACE, WinoGrande, TriviaQA).

Inference speed and multi-token decoding. Average seconds per image and seconds per 100 tokens are measured on DROBS using an H100 GPU in bfloat16, comparing Nougat, GOT, and ÉCLAIR variants with $n \in \{1, 2, 3, 4\}$ tokens per step.

Training and inference ablations. Table 5 in the paper isolates the contributions of pre-training, fine-tuning, and the repetition penalty on DROBS Counting F1 and WER.

What are the outcomes/conclusions?

On DROBS, ÉCLAIR-MIP achieves WER of 0.142, F1 of 0.942, BLEU of 0.886, and METEOR of 0.930, outperforming Kosmos-2.5 (OCR mode: WER 0.195) and GOT (OCR mode: WER 0.302) on most metrics. Kosmos-2.5 in OCR mode achieves higher recall (0.950 vs. 0.942), and the authors attribute score differences between OCR and markdown modes to differences in training data blending rather than any architectural effect.

On the GOT English benchmark, ÉCLAIR achieves the best edit distance (0.032) among models with under 1B parameters, outperforming Fox (1.8B) and Qwen-VL-Max (72B+) on edit distance and several other metrics, though GOT (580M) achieves marginally higher F1 (0.972 vs. 0.968).

On DocLayNet mAP, ÉCLAIR reaches 73.9 overall, between Mask R-CNN (73.5) and SwinDocSegmenter (75.2). The authors note that mAP is structurally unfavorable to autoregressive detectors because there is no confidence score to sweep for a precision-recall curve. In the supplementary point-to-point precision/recall comparison (Table S1), ÉCLAIR achieves higher mean precision and mean recall than SwinDocSegmenter at matched operating points for most classes.

LLM quality improves when training on ÉCLAIR-extracted text: Nemotron-8B scores 39.1 MMLU (versus 37.2 with PyMuPDF4LLM) and 56.7 on the multi-benchmark average (versus 55.72). ÉCLAIR also extracts more tokens from the same PDFs (55.1B versus 43.6B), likely because it handles formatting and multi-column layouts that PyMuPDF4LLM misses.

Multi-token decoding at $n = 2$ reduces inference time from 3.8 to 2.5 seconds per image with a slight accuracy improvement (WER 0.13, F1 0.94). At $n = 3$, speed is 1.77 s/img with a small accuracy cost. The $n = 4$ variant degrades noticeably. Using a multi-token-trained model at next-token inference ($\frac{tkn}{step} = 1$) provides accuracy gains at no speed cost.

Limitations acknowledged by the authors include: absence of multilingual training (no Chinese or other non-English data beyond what appears incidentally), challenges with repetition-loop hallucinations mitigated by post-processing rather than eliminated, and the fact that DROBS does not yet include ground-truth table and equation labels.

Reproducibility

Models

Total parameters: 937M (reported as 936M or 963M at different points in the paper; the supplementary states 937M as the canonical figure).
Vision encoder: RADIO ViT-H/16, 657M parameters, input resolution 1280x1024, producing 80x64 patches. Initialized from a pretrained RADIO checkpoint.
Vision neck: horizontal 1x4 convolutional compression (stride 1x4), reducing 5120 patches to 1280 tokens.
Decoder: mBART architecture, 10 layers, 279M parameters, trained from scratch. Maximum sequence length: 3584 tokens.
Tokenizer: Galactica tokenizer extended with $H + W + C + 7$ special tokens for coordinates, semantic classes, and prompt components.
Pretrained model weights are not publicly released as of the arXiv submission.

Algorithms

Two-stage training: pre-training on arXiv-5M, then fine-tuning on the full multi-source dataset.
Both stages: AdamW optimizer, 130,000 iterations, effective batch size 128.
Pre-training learning rate: $2 \times 10^{-5}$ constant with 5,000 linear warmup steps.
Fine-tuning learning rate: $8 \times 10^{-6}$ constant with 500 linear warmup steps.
Inference: greedy decoding with repetition penalty of 1.1.
DocLayNet fine-tuning: 50,000 additional steps for the detection evaluation.
Multi-token variants: $n - 1$ additional linear heads (1024 nodes each) added on top of the decoder hidden state; trained with teacher forcing on next-$n$ ground-truth tokens.
Post-processing: MIP inference followed by strict regex syntax validation, bounding box spatial validity checks, and class schema validation to filter hallucinated outputs.

Data

arXiv-5M: approximately 5 million pages of arXiv paper images with character-level bounding boxes, semantic classes, and formatted text. Generated via a modified TeX Live compiler with Python hooks. License not specified in the paper.
SynthTabNet: 480K synthetic table images with HTML-to-LaTeX converted annotations.
README: 302K pages rendered from Stack source-code README files via Pandoc and wkhtmltopdf, with Nougat’s data pipeline for alignment.
DocLayNet: 56K pages, plain text with boxes and classes.
G1000: 324K pages with plain text obtained via Tesseract OCR.
Human-annotated Common Crawl: 14K pages with plain text, boxes, and classes.
Total training data: approximately 6.176M pages.
DROBS benchmark: 789 pages drawn from magazines, books, and Common Crawl, annotated by human labelers following DocLayNet guidelines with reading order added. The authors state they will release it to the research community; a formal URL or repository was not provided in the arXiv version.

Evaluation

DROBS metrics: WER, character edit distance, token-level F1/precision/recall, BLEU, METEOR, Counting F1. String normalization removes non-alphanumeric characters and collapses whitespace before scoring. Tables and equations are masked from images for all methods because DROBS lacks ground-truth for those elements.
GOT benchmark: same text metrics, evaluated on English split only; ÉCLAIR does not train on Chinese data.
arXiv-5M validation set: 10,000 samples evaluated for text, math, and table extraction separately.
DocLayNet: COCO-mAP (IoU 0.5:0.95, maxDets=100). Sequence augmentation and top-k class sampling are used specifically to produce enough candidate boxes for the AP curve. The authors note this metric is structurally unfavorable to autoregressive models and provide a point-to-point precision/recall comparison as an alternative in the supplementary.
LLM benchmark: Nemotron-8B trained on 300B tokens total; 3.3 epochs of PDF-extracted tokens from a shared PDF corpus; remaining tokens from CommonCrawl, StackExchange, OpenWebMath, PubMed, bioRxiv, SEC filings, Wikipedia, and arXiv.
The paper does not report error bars, statistical significance tests, or multiple runs.

Hardware

Inference speed measured on a single NVIDIA H100 GPU in bfloat16.
Single-token ÉCLAIR: 3.8 s/img, 0.42 s/100 tokens on DROBS.
ÉCLAIR-2tkn (2 tokens per step): 2.5 s/img, 0.31 s/100 tokens.
ÉCLAIR-3tkn: 1.77 s/img, 0.23 s/100 tokens.
ÉCLAIR-4tkn: 1.32 s/img, 0.20 s/100 tokens.
Training hardware (GPU count, total GPU-hours, memory requirements) is not reported in the paper or supplementary.

BibTeX


@article{karmanov2025eclair,
  title={{\&#39;E}CLAIR: Extracting Content and Layout with Integrated Reading Order for Documents},
  author={Karmanov, Ilia and Deshmukh, Amala Sanjay and V{\&#34;o}gtle, Lukas and Fischer, Philipp and Chumachenko, Kateryna and Roman, Timo and Sepp{\&#34;a}nen, Jarno and Parmar, Jupinder and Jennings, Joseph and Tao, Andrew and Sapra, Karan},
  journal={arXiv preprint arXiv:2502.04223},
  year={2025}
}

TL;DR

FocalOrder identifies a systematic failure mode called Positional Disparity, in which all existing reading order models achieve near-perfect accuracy at document start and end positions but collapse in the middle 20%–80% of the document. The authors attribute this to uniform cross-entropy supervision drowning out learning signals from structurally ambiguous layout transitions. Their proposed framework, Focal Preference Optimization (FPO), addresses this by pairing an EMA-based difficulty tracker with a difficulty-calibrated pairwise ranking objective. The authors report the lowest edit distance on OmniDocBench v1.0 and the highest REDS on Comp-HRDoc in their comparison, at 0.4B parameters.

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$: The headline contribution is FPO, a new training objective and curriculum strategy for reading order models. The paper includes a progressive ablation study, comparison tables against pipeline tools, general VLMs, and expert document models, and an inference latency comparison showing that FPO adds no overhead at test time.

Secondary: $\Psi_{\text{Evaluation}}$: The Positional Disparity analysis (Section 3) is a careful empirical measurement across multiple architectures and two benchmarks. The Spatial-Logical Mismatch metric introduced there quantifies the structural cause of the observed disparity and stands as a diagnostic contribution independent of the proposed fix.

What is the motivation?

Reading order detection is a pre-requisite for reliable downstream document understanding. A misordered sequence fed into a RAG pipeline or a reasoning system propagates structural errors throughout the downstream task.

The dominant training recipe across all recent methods, from LayoutReader through MinerU 2.5, is standard cross-entropy over the predicted order sequence. This recipe treats every transition in a document as equally important. The authors conduct an empirical analysis across LayoutReader, PaddleOCR-VL, and MinerU 2.5 on OmniDocBench and Comp-HRDoc, measuring prediction error rate binned by relative document position $p = t/T$. All models show a consistent “Inverted-U” error curve: error is low in the first and last 10–20% of the document, peaks between positions 20% and 80%, and drops back at the end.

To diagnose the cause, the authors introduce the Spatial-Logical Mismatch rate: the proportion of ground-truth transitions where the next correct region is not the geometrically nearest neighbor (measured by Euclidean distance between bounding box centers). This mismatch distribution peaks in the same 20%–80% interval, suggesting that the intermediate sections contain the most complex layout decisions, precisely those where simple spatial heuristics fail and where uniform supervision provides the weakest signal.

Formally, all existing methods optimize a standard cross-entropy objective that assigns equal weight to every predicted token:

$$\mathcal{L}_{\text{CE}} = -\frac{1}{N} \sum_{t=1}^{N} \log P(y_t \mid Y_{<t}, O, I) \quad (1)$$

The Positional Disparity analysis is correlated evidence rather than a controlled causal proof: the paper evaluates only three models (LayoutReader, PaddleOCR-VL, MinerU 2.5), all trained with standard CE on similar data distributions. Whether the Inverted-U pattern arises from the loss function specifically or from the training data distribution is not fully disentangled.

What is the novelty?

FocalOrder adds two training-time components on top of a standard generative reading order backbone (LayoutLMv3-large):

Adaptive Difficulty Discovery tracks per-bin loss using exponential moving average (EMA). The document is partitioned into $K$ bins by relative position. A global difficulty vector $\mathbf{D} \in \mathbb{R}^K$ is updated after each iteration:

$$\tilde{\mathcal{L}}_k^{(\text{iter})} = \gamma \cdot \tilde{\mathcal{L}}_k^{(\text{iter}-1)} + (1-\gamma) \cdot \mathcal{L}_{\text{batch}}^{(k)} \quad (2)$$

The momentum $\gamma = 0.99$ acts as a low-pass filter. A per-token difficulty weight $w_t$ is then derived by normalizing bin loss against the mean of $\mathbf{D}$ and clipping to $[1 - \delta, 1 + \delta]$ with $\delta = 0.8$:

$$w_t = \text{Clip}!\left(\frac{\tilde{\mathcal{L}}_k}{\mu_{\mathbf{D}}},; 1-\delta,; 1+\delta\right) \quad (3)$$

These weights modulate the standard cross-entropy loss at the token level, amplifying gradients from the ambiguous document body while attenuating them at the deterministic start and end.

Difficulty-Calibrated Pairwise Ranking (DCPR) adds a sequence-level contrastive objective. For each generated candidate $\hat{Y}_i$, a reward is computed as the inverted normalized Levenshtein distance against the ground truth:

$$R(\hat{Y}, Y^{\ast}) = 1 - \frac{\text{Lev}(\hat{Y}, Y^{\ast})}{\max(|\hat{Y}|, |Y^{\ast}|)} \quad (4)$$

A difficulty bonus is added to form the advantage score:

$$A_i = R(\hat{Y}_i, Y_i^{\ast}) + \beta \cdot \tilde{\mathcal{L}}_{\text{CE}}^{(i)} \quad (5)$$

where $\tilde{\mathcal{L}}_{\text{CE}}^{(i)}$ is the length-normalized per-sample cross-entropy (further normalized by a running mean for scale consistency). Setting $\beta = 0.05$ ensures the reward dominates while hard-sample bonuses act as tie-breakers. Training pairs $(i, j)$ are drawn from the top and bottom $\rho = 20%$ of samples ranked by $A_i$. The ranking loss is a hinge over sequence log-probability scores:

$$\mathcal{L}_{\text{Rank}} = \frac{1}{|\mathcal{P}|} \sum_{(i,j) \in \mathcal{P}} \left[S(\hat{Y}_j) - S(\hat{Y}_i) + m_{ij}\right]_+ \quad (6)$$

The adaptive margin scales with the structural complexity of the harder sequence in the pair, forcing larger probability gaps for complex layout transitions:

$$m_{ij} = \alpha \cdot \max(w^{(i)}, w^{(j)}) \quad (7)$$

where $\alpha$ is a base scaling factor and $w^{(i)}$ is the mean token-level difficulty weight for sequence $i$.

The total objective combines both components:

$$\mathcal{L}_{\text{total}} = \sum_{t=1}^{N} w_t \cdot \mathcal{L}_{\text{CE}}^{(t)} + \lambda_{\text{Rank}} \cdot \mathcal{L}_{\text{Rank}} \quad (8)$$

What experiments were performed?

Benchmarks. All main experiments use OmniDocBench v1.0 (981 pages) and v1.5 (1,355 pages), reporting normalized edit distance (lower is better), and Comp-HRDoc (1,500 documents), reporting Reading Edit Distance Score, REDS (higher is better), separately for text and graphical regions.

Backbone. LayoutLMv3-large is used as the unified backbone encoder across all ablation conditions and the final model, enabling controlled component comparisons. The model is trained for 50 epochs with batch size 24.

Comparison baselines. On OmniDocBench v1.0, FocalOrder is compared against a wide range of systems across three categories: pipeline tools (MinerU, Marker, Mathpix, Docling, Pix2Text, Unstructured, OpenParse, PP-StructureV3), general VLMs (GPT-4o, Qwen2-VL-72B, Qwen2.5-VL-72B, Gemini-1.5 Pro, Doubao-1.5-pro, InternVL2-76B, InternVL3-78B, GOT-OCR, Nougat, Mistral OCR, OLMOCR-sglang, SmolDocling-256M), and expert VLMs (Dolphin, MinerU 2.0, OCRFlux, MonkeyOCR-pro-3B, dots.ocr, PaddleOCR-VL, MinerU 2.5).

Ablation study. A six-row progressive ablation on OmniDocBench v1.0 isolates contributions from: (1) base LayoutReader model; (2) fine-tuning; (3) category token embeddings; (4) standard preference optimization; (5) EMA fine-grained loss alone; and (6) the full FocalOrder combining group contrastive ranking with EMA. All ablation variants use the same 0.4B backbone; inference latency increases from 12.1 ms to 12.3 ms between the base model and the full FocalOrder framework (the EMA-only intermediate variant reaches 12.4 ms).

Sensitivity analysis. The number of bins $K$ is swept over $\{1, 5, 10, 20, 50\}$. The advantage weight $\beta$ is swept over $\{0.0, 0.01, 0.05, 0.1, 0.2\}$. The pair selection ratio $\rho$ is swept over $\{10%, 20%, 30%\}$.

Comparison with alternative weighting strategies. FocalOrder is benchmarked against uniform supervision, a static Inverted-U Gaussian weighting baseline, and a token-level EMA weighting variant (without spatial binning), all at 0.4B parameters on OmniDocBench v1.0.

What are the outcomes/conclusions?

The authors report that FocalOrder achieves an edit distance of 0.038 (EN) and 0.055 (ZH) on OmniDocBench v1.0, the lowest in their comparison across all categories, including much larger general VLMs such as Gemini-1.5 Pro (0.049 EN) and models with dedicated document pipelines such as PaddleOCR-VL (0.045 EN) and MinerU 2.5 (0.045 EN). On OmniDocBench v1.5, the model reaches 0.044 edit distance, tying MinerU 2.5 (1.2B) and outperforming general VLMs up to 241B parameters, though PaddleOCR-VL (0.9B) achieves 0.043 in their comparison.

On Comp-HRDoc, FocalOrder achieves 97.1% REDS on text regions and 91.1% on graphical regions, improving 0.1% over UniHDSA-R50 (91.0%) on graphical regions.

The Positional Disparity analysis confirms the mechanism: in the intermediate document body (20%–80%), FocalOrder reduces the average error rate from 25.99% to 10.28%, a 60.4% relative reduction. The learned difficulty weights form an Inverted-U distribution with dual peaks at 1.61 in the 40%–50% and 60%–70% bins, dropping to 0.32 at the document start, directly mirroring the ground-truth mismatch distribution.

The sensitivity analysis shows robustness: performance peaks at $K = 10$ bins but remains strong across $\{5, 10, 20\}$; $\beta = 0.05$ is optimal but the method is stable in $[0.01, 0.1]$; $\rho = 20%$ pair selection is best. Static Inverted-U weighting (0.042) and token-level EMA (0.043) both underperform bin-level EMA within FocalOrder (0.038), confirming that the data-driven spatial binning is necessary.

Acknowledged limitations:

FocalOrder is a downstream serialization module; errors in the upstream layout detection stage (missed or mislabeled regions) propagate without correction.
Category embeddings are aligned to the training benchmark ontology (English and Chinese). Zero-shot transfer to documents with different semantic schemas or scripts may require re-alignment.
The definition of “correct” reading order is subjective for artistic or highly unstructured layouts; the difficulty-aware formulation may not cover all ambiguous cases.
Training incurs marginal overhead from pairwise ranking, though inference latency is unaffected.

Reproducibility

Models

Backbone: LayoutLMv3-large, approximately 0.4B parameters. A standard pre-trained checkpoint is used; FocalOrder adds no new parameters to the architecture.
Input representation: bounding box coordinates and semantic category labels encoded as a unified token sequence. Category labels are sourced from an upstream layout analysis model, not ground truth, during inference.
No model weights are released with the paper.

Algorithms

Optimizer: not explicitly named; initial learning rate $3 \times 10^{-5}$, linear warmup for the first 5% of steps, followed by cosine decay.
Epochs: 50; batch size: 24.
EMA momentum: $\gamma = 0.99$.
Number of difficulty bins: $K = 10$.
Clipping parameter: $\delta = 0.8$, yielding weight range $[0.2, 1.8]$.
Advantage difficulty weight: $\beta = 0.05$.
Pair selection ratio: $\rho = 20%$ (top and bottom 20% by advantage).
Margin scaling factor: $\alpha$ (value not stated in the main paper).
Ranking loss weight: $\lambda_{\text{Rank}}$ (value not stated in the main paper).
Training pseudocode is provided in Algorithm 1 (Appendix B.1).

Data

Training data: a subset is included in submission supplementary materials; the full training data is not publicly released. The authors cite upload size limitations as the reason (Appendix B.3).
No code repository is released with this paper.
Evaluation: OmniDocBench v1.0 (981 pages, Apache-2.0), OmniDocBench v1.5 (1,355 pages), and Comp-HRDoc (1,500 documents, 1,000 train / 500 test, MIT license).
Baselines are either reproduced from official codebases or cited from their respective papers.
The paper was submitted to ACL 2026 (per Appendix E AI Usage Declaration). As of the arXiv submission date, it has not yet appeared in a published venue.

Evaluation

OmniDocBench metric: normalized edit distance (Levenshtein distance divided by max sequence length); lower is better. Evaluated separately for English (EN) and Chinese (ZH) documents.
Comp-HRDoc metric: REDS (Reading Edit Distance Score); higher is better. Evaluated separately for text regions and graphical regions.
Positional Disparity metric: position-wise error rate derived from Levenshtein backtrace alignment, aggregated into $K = 10$ bins by relative sequence position. Defined formally in Appendix A.5.
The paper does not report standard deviations, confidence intervals, or multiple runs.

Hardware

Training hardware: NVIDIA RTX 4090 GPUs. Number of GPUs not specified.
Inference latency: 12.1–12.4 ms per document on the same hardware class (not GPU-hours or throughput reported).
No cloud cost or energy consumption estimates are provided.

BibTeX


@article{liu2026focalorder,
  title={FocalOrder: Focal Preference Optimization for Reading Order Detection},
  author={Liu, Fuyuan and Yu, Dianyu and Ren, He and Liu, Nayu and Kang, Xiaomian and Qiu, Delai and Zhang, Fa and Zhen, Genpeng and Liu, Shengping and Liang, Jiaen and Huang, Wei and Wang, Yining and Zhu, Junnan},
  journal={arXiv preprint arXiv:2601.07483},
  year={2026}
}

TL;DR

HELD (Hierarchy Extraction from Long Document) treats logical document hierarchy recovery as sequential tree construction: a put-or-skip binary classifier decides where to insert each physical object into a growing rightmost-branch tree, handling variable-depth hierarchies up to 11 levels in documents with thousands of physical objects. On proprietary Chinese financial, English financial, and arXiv datasets it outperforms rule-based and sequence-labeling baselines by meaningful margins. The paper also proposes a stricter evaluation metric that requires the full root-to-node path to be correct, not just the depth.

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$: The headline contribution is the HELD algorithm: a tree-construction procedure with a BiLSTM/CNN put-or-skip classifier, three traversal variants, optional explicit heading extraction, and an error-tolerance training strategy. The bulk of the paper ablates these design choices.

Secondary: $\Psi_{\text{Resource}}$: Three new proprietary datasets are constructed (1,030 Chinese financial, 1,203 English financial, and 1,732 arXiv documents) with multi-annotator hierarchy labels. The documents themselves are drawn from public sources (CNINFO, arXiv), but the annotated hierarchies are not released. $\Psi_{\text{Evaluation}}$: The paper also introduces a path-correctness metric that is strictly stronger than the depth-only measure used in prior work.

What is the motivation?

PDF and other rendered document formats preserve visual layout but discard the logical hierarchical structure that was present in the original editing format (e.g., WORD, LaTeX). Recovering this structure is a prerequisite for hierarchical browsing, passage retrieval, high-quality information extraction, and document re-flow.

Prior approaches fall into three families, each with limitations the authors identify:

Rule-based methods (e.g., HEPS [Manabe & Tajima, 2015]) rely on assumptions such as “headings with the same visual style occupy the same hierarchical depth.” These assumptions break down on financial disclosures and other domain-specific documents where the same numbering pattern appears at multiple levels.
Sequence-labeling methods (e.g., TOC [Bentabet, ICDAR 2019], Rahman & Finin 2017) classify each physical object into an absolute depth label. Because the label space is fixed, they cannot generalize to unseen depths, and because they use a window of neighboring objects, they struggle with the very long distances found in hundred-page documents.
Tree-generation methods (Pembe & Güngor, NLE 2014) build a tree node by node and capture containment/parallel relationships, but that prior work uses logistic regression and lacks a systematic study of design variants.

The HELD paper targets long documents specifically. The authors show that 90% of headings in their benchmark sit at depths 3 through 7, with a maximum of 11, which is well beyond the 4-level cap that existing datasets assume.

What is the novelty?

The central idea is to view hierarchy construction as sequential insertion into a tree, guided by a binary classifier that answers “should I put the current node here, or skip to the next candidate position?”

Rightmost-branch insertion

At any point during construction, the only positions that can receive the next physical object (to preserve the pre-order reading sequence) are the last-child slots of the nodes on the current rightmost branch. Formally, the rightmost branch is the path from the root $\phi$ to the current rightmost leaf; all possible insertion positions $S_i = \{s_j^i\}_{j=1}^{M_i}$ are the last-child positions of nodes on this branch.

Objective function

The model is trained by maximizing the log-likelihood of inserting each node $c_i$ at the correct position $s_i^{\ast}$ while minimizing it at all incorrect positions:

$$ \log P(T) = \sum_{i=1}^{N} \log P(c_i \mid \mathcal{T}_{c_1, \dots, c_{i-1}}) $$

Expanding the per-node term:

$$ \log P(c_i \mid \mathcal{T}_{c_1, \dots, c_{i-1}}) = \sum_{j=1}^{M_i} \left( \mathbf{1}(s_j^i = s_i^{\ast}) \cdot \log P(c_i \mid ctx(s_j^i)) - \mathbf{1}(s_j^i \neq s_i^{\ast}) \cdot \log P(c_i \mid ctx(s_j^i)) \right) $$

This is a binary cross-entropy objective over all candidate positions for each node.

Put-or-skip classifier

For each candidate position $s$, the classifier estimates $P(c \mid ctx(s))$ using the local context: the immediate parent $z$ and the existing siblings $g_1, \dots, g_K$ at that position. Three Bi-LSTMs are used in sequence: one encodes the text of each element into a vector $\mathbf{v}_x$; a second combines the text representations of the candidate context into a final text vector $\mathbf{v}$; a third does the same for format features (font family, size, color, bold, italic, centering, indent) to produce $\mathbf{u}$. The combined $[\mathbf{v}; \mathbf{u}]$ is passed through a feed-forward layer and a sigmoid to produce the insertion probability.

Traversal order variants

Three strategies for which positions to probe are studied:

Traversal-all: probe all positions, select the highest-probability one.
Root-to-leaf: probe from root toward the leaf and return the first position whose probability exceeds 0.5.
Leaf-to-root: same as root-to-leaf but in reverse order.

The authors prove theoretically that $N_{l2r} < N_{r2l} < N_{all}$ in terms of classifier calls, where the subscripts follow the paper’s abbreviations: l2r = leaf-to-root, r2l = root-to-leaf, all = traversal-all; $N_{\cdot}$ counts how many positions are probed under each strategy given a perfect oracle. Root-to-leaf offers a favorable accuracy-efficiency tradeoff in practice.

Two-step explicit heading extraction

In the one-step variant, heading and non-heading objects are inserted by the same procedure. In the two-step variant, a separate heading classifier (Bi-LSTM + multi-layer CNN + self-attention) first identifies headings. The tree is then built over headings only; non-headings are appended as leaf children of the immediately preceding heading. This reduces the number of nodes the put-or-skip module must process by roughly 75%, producing an 8.3$\times$ speedup while improving accuracy.

Error-tolerance training

Because earlier nodes may be misplaced during inference, the authors augment training data by simulating random misplacements and generating additional training tuples from the resulting imperfect trees. This is most beneficial on the noisier English dataset.

Path-correctness metric

Previous work defines a node as correctly inserted when it ends up at the right depth. HELD introduces a stricter criterion: a node is correct if and only if its entire path from the root matches the ground-truth path. This matters for downstream applications such as passage retrieval, where ancestor identity, not just depth, is needed.

What experiments were performed?

Datasets

Three proprietary datasets are constructed from publicly downloadable documents:

Chinese dataset: 1,030 prospectuses and annual reports from CNINFO (China Securities Exchange); split 830/100/100.
English dataset: 1,203 annual reports from Hong Kong Exchange; split 1,110/100/100 (approximately).
arXiv dataset: 1,732 scientific papers from arXiv; split 1,432/150/150.

Documents have at least 500 physical objects, and hierarchies reach up to 11 levels. Each document was annotated by at least two annotators with conflict resolution by a senior annotator.

Baselines

HEPS (Manabe & Tajima, 2015): rule-based, adapted to PDF (web-page features removed).
TOC (Bentabet, ICDAR 2019): sequence-labeling with CNN replacing LSTM for the longer documents.
Pembe’s model (Pembe & Güngor, NLE 2014): tree-generation with logistic regression.

Evaluation

Node accuracy (fraction of nodes with a fully correct root-to-node path) and per-level $F1_k$ are the primary measures. Efficiency is tracked via average processing time per document and the number of put-or-skip classifier calls (#inquiry).

Downstream evaluation uses passage retrieval on 110 Chinese IPO prospectuses with 138 queries (88 train / 50 test), measured by mAP and recall@k. A GBDT ranker is trained on BM25 plus four hierarchy-derived features: BM25AncMax (max BM25 over ancestor headings), SameWordAnc (word overlap with ancestor text), Pos (absolute sibling position), and PosRatio (relative sibling position).

Hyperparameters

Character embeddings: skip-gram, 24 dimensions.
Heading recognition network: 9-layer Network-in-Network with kernel sizes 5-1-5-5-1-5-5-1-5 and increasing channel counts (128 to 512).
Put-or-skip Bi-LSTMs: hidden dimensions of 128 ($\mathbf{v}_x$), 512 ($\mathbf{u}_T$, visual), and 64 ($\mathbf{u}_F$).
Optimizer: Adam, learning rate 0.00005, mini-batch size 128.
Hardware: 2 NVIDIA Titan 1080Ti GPUs with Horovod distributed training.
Beam size: 1 (greedy search; beam size 3 adds roughly 10$\times$ cost with minimal accuracy change: +0.0005 on Chinese but -0.0026 on English).

What are the outcomes/conclusions?

Accuracy vs. baselines

Using the path-correctness metric, the best HELD variant (2step, traversal-all, greedy) achieves:

Chinese: 0.9731 vs. 0.3764 (HEPS), 0.9403 (TOC), 0.9339 (Pembe)
English: 0.7301 vs. 0.4779 (HEPS), 0.6436 (TOC), 0.6563 (Pembe)
arXiv: 0.9578 vs. 0.8375 (HEPS), 0.8908 (TOC), 0.9034 (Pembe)

The English dataset is hardest because visual/textual cues are more ambiguous (many headings match no numbering regex, and heading depth is less predictable from format alone).

Traversal order

Root-to-leaf and traversal-all achieve nearly identical accuracy. Root-to-leaf is 1.2$\times$ faster; leaf-to-root is 2.4$\times$ faster but noticeably less accurate. The paper recommends root-to-leaf for general use.

Two-step vs. one-step

Explicit heading extraction increases accuracy by 1.5 to 11 percentage points and reduces inference time by 8.3$\times$ by cutting the number of nodes the put-or-skip module handles.

Feature ablation (put-or-skip)

Removing sibling features hurts more than removing parent features, especially on English (accuracy drops from 0.7301 to 0.6445 without siblings, vs. 0.7007 without the parent). Sibling context carries strong cues because consecutive headings at the same level share formatting and numbering patterns.

PDFLux noise experiment

When the commercial PDFLux layout recognizer (F1 approximately 0.97 on physical object detection) is used instead of ground-truth physical objects, overall accuracy drops by 1.5 percentage points on Chinese and 0.6 on English. The hierarchy extraction is reasonably robust to layout recognition noise at this quality level.

Passage retrieval

Combining all four hierarchy features with BM25 raises mAP from 0.149 to 0.338 and recall@1 from 0.083 to 0.218 on the Chinese prospectus subset. BM25AncMax and position features contribute the most individually.

Limitations

The authors acknowledge that the greedy construction process creates cascading errors: if a high-level heading is placed incorrectly, all of its descendants inherit the error. The proposed path-correctness metric makes this failure mode visible (unlike depth-only metrics), and the authors suggest Monte Carlo Tree Search as a future remedy. The datasets are also not publicly released, limiting independent replication.

Reproducibility

Models

The put-or-skip classifier is a stack of three Bi-LSTMs with hidden dimensions 128, 512, and 64, combined with a feed-forward layer and sigmoid output. The heading classifier uses a Bi-LSTM followed by 9 CNN layers (Network-in-Network) and a self-attention layer.
No pretrained checkpoints are released. The paper does not describe any pretrained language model backbone; text is encoded from scratch using character embeddings.
Character embeddings are trained with skip-gram, dimension 24.

Algorithms

Training: Adam optimizer, learning rate 0.00005, batch size 128. No number of training epochs or steps is reported.
Error-tolerance augmentation: random misplacements are simulated during training data generation to expose the model to imperfect predecessor insertions.
Inference: greedy (beam size 1) with root-to-leaf traversal for the recommended configuration.
No mixed precision, gradient clipping, or learning rate schedule details are reported.

Data

All three datasets are annotated by the authors from publicly downloadable documents (CNINFO and arXiv). The annotations are not released.
Multi-annotator pipeline: at least two annotators per document; a senior annotator resolves conflicts.
Physical object detection uses the commercial PDFLux tool, not a released open-source system.
No data augmentation beyond the error-tolerance training examples is described.

Evaluation

Primary metric is node accuracy under the path-correctness definition: a node is correct iff its full root-to-node path matches ground truth. Per-level $F1_k$ values are also reported.
Efficiency is measured as average wall-clock seconds per document and #inquiry (put-or-skip calls).
Passage retrieval uses mAP and recall@k (k = 1, 5, 10) with GBDT ranking.
No error bars or multi-run statistics are reported for any result.

Hardware

Training: 2 NVIDIA Titan 1080Ti GPUs, Horovod for distributed parameter updates. Total GPU-hours are not reported.
Inference: the recommended 2step root-to-leaf configuration processes a document in approximately 36 seconds on the training hardware; traversal-all takes 42.58 seconds. These are wall-clock times, not normalized to a standard GPU.
No cloud cost or energy consumption estimates.

BibTeX


@article{cao2022held,
  title={Extracting Variable-Depth Logical Document Hierarchy from Long Documents: Method, Evaluation, and Application},
  author={Cao, Rongyu and Cao, Yixuan and Zhou, Ganbin and Luo, Ping},
  journal={Journal of Computer Science and Technology},
  volume={37},
  pages={699--718},
  year={2022},
  doi={10.1007/s11390-021-1076-7}
}

TL;DR

M5HisDoc is a 8,000-image Chinese historical document benchmark with five benchmark tasks: text line detection, text line recognition, character detection, character recognition, and reading order prediction. It is the first multi-style benchmark for this domain, spanning 20+ layout types, 20+ document types, and 16,151 character categories. Two subsets are provided: M5HisDoc-R (regular, manual annotation) and M5HisDoc-H (hard, with added rotation, distortion, and resolution reduction). Existing reading order methods designed for modern documents generalize poorly to historical Chinese documents.

What kind of paper is this?

Dominant: $\Psi_{\text{Resource}}$: The headline contribution is the M5HisDoc benchmark dataset, covering a previously underserved domain (Chinese historical documents) with multi-style diversity and five distinct task annotations. Most of the paper describes data collection, annotation methodology, and benchmark statistics.

Secondary: $\Psi_{\text{Evaluation}}$: Comprehensive baselines are established across all five tasks, exposing systematic weaknesses of modern-document methods when applied to historical Chinese documents, including for reading order prediction.

What is the motivation?

Existing Chinese historical document benchmarks share three limitations:

Single document type. MTHv2 covers only Buddhist scriptures; ICDAR 2019 HDRC covers only genealogies. Real historical collections span medical texts, anthologies, dictionaries, government records, and more.
Limited layout variety. Prior datasets cover fewer than 10 layout types; M5HisDoc covers more than 20.
No dedicated reading order study. Prior reading order work (LayoutReader, Augmented XY Cut) targets modern documents with Western left-to-right reading flows. Chinese historical documents read right-to-left, top-to-bottom in vertical columns, with complex layouts that are poorly handled by these methods.

What is the novelty?

Dataset Construction

Data sources:

300 images from MTHv2 training set and 700 from SCUT-CAB, selected for dense text or complex layout.
2,799 images from 131 ancient books sourced from electronic archives (Harvard Library, National Archives of Japan, etc.).
201 images captured by scanning four physical Chinese ancient books under varied scanning angles and lighting.

Total: 4,000 manually annotated images (M5HisDoc-R).

Annotation process: Character detector and recognizer pre-trained on MTHv2 and CASIA-AHCDB provided preliminary annotations. Three annotation steps followed: bounding box correction, two-round character recognition proofreading (two annotators + one domain expert), text line grouping, and reading order sorting. Three-round quality check. Total annotation effort: ~4,000 person-hours.

Hard subset generation (M5HisDoc-H):

834 images with white backgrounds randomly rotated $\pm 5^\circ$ to $\pm 15^\circ$.
50% of images randomly distorted via DewarpNet to simulate real-world document warping.
2,041 images with large character boxes scaled down (minimum edge 18 pixels).

Both subsets split 2:1:1 (train:val:test) for fair evaluation.

Dataset statistics:

8,000 images, 50.48 text lines per image, 545.92 characters per image.
16,151 character categories (at least 1.6$\times$ more than prior datasets).
Zero-shot recognition scenario: 1,853 (val) and 1,831 (test) character categories unseen in training.
Aspect ratios from under 0.4 to over 1.8; text line lengths from under 6 to over 58 characters.

What experiments were performed?

Task 1: Text Line Detection

Three method families: regression-based (Mask R-CNN, Cascade R-CNN, OBD), segmentation-based (PSENet, PAN, FCENet, DBNet++), connected component-based (TextSnake). Metrics: precision, recall, F1 at IoU thresholds 0.5, 0.6, 0.7; end-to-end 1-NED.

Key findings: segmentation-based methods degrade sharply at higher IoU thresholds due to long texts and dense packing. Rotation and distortion in M5HisDoc-H substantially reduces recall for regression-based methods (RPN proposals remain horizontal). TextSnake achieves the best F1 at IoU 0.5 (90.85/90.94 on H/R).

Task 2: Text Line Recognition

Methods: CRNN, Ma et al., ZCTRN (CTC-based); ASTER, NRTR, Robust Scanner (attention-based); Peng et al. (segmentation-based). Metrics: correct rate (CR) and accurate rate (AR).

Key findings: attention-based methods underperform on this task due to the large character category count (16,151) and long text lines causing attention drift. CTC-based methods are more robust. Best overall: Ma et al. at 91.29/92.29 CR (H/R).

Task 3: Character Detection

Faster R-CNN, YOLOv3, YOLOX. Metrics: precision, recall, F1 at IoU 0.5/0.6/0.7; end-to-end top-1 accuracy with a SwinTransformer recognizer.

Key findings: character-level methods are substantially more robust to M5HisDoc-H’s rotation and distortion than text-line-level methods, because individual character scale variations are smaller. All models achieve F1 above 96% at IoU 0.5.

Task 4: Character Recognition

ResNet50, RegNet, ConvNeXt (CNN-based); ViT, SwinTransformer (ViT-based). Metrics: top-1, top-5, macro accuracy.

Key findings: top-1 accuracy is high (94-95%), but macro accuracy is substantially lower (65-75%), reflecting long-tail distribution. Similar-morphology characters are the primary error source.

Task 5: Reading Order Prediction

Three methods evaluated on M5HisDoc using ARD (Average Relative Distance; lower is better):

Type	Method	ARD (H/R)
Rule-based	Heuristic (right-to-left, top-to-bottom)	5.27 / 5.20
Rule-based	Augmented XY Cut	14.63 / 12.60
Learning-based	LayoutReader (layout only)	8.68 / 5.49

The authors’ simple heuristic (center coordinates sorted right-to-left, top-to-bottom, reflecting the dominant Chinese historical reading direction) outperforms both Augmented XY Cut and LayoutReader on M5HisDoc-R, and achieves the best result on M5HisDoc-H. LayoutReader, trained on modern born-digital documents, performs comparably to the heuristic on M5HisDoc-R but degrades on M5HisDoc-H due to distorted bounding boxes (horizontal minimum enclosing rectangles of distorted text overlap significantly, breaking the spatial assumptions the model learned).

Cross-Validation

Models trained on MTHv2 and tested on M5HisDoc show substantial degradation, confirming the style gap between homogeneous (single-type) historical benchmarks and M5HisDoc’s multi-style distribution.

What are the outcomes/conclusions?

Modern reading order methods do not transfer to Chinese historical documents. The simple domain-aware heuristic (right-to-left, top-to-bottom) outperforms LayoutReader and Augmented XY Cut, both of which were designed for modern left-to-right documents. The benchmark makes this gap quantifiable for the first time in a controlled setting.
M5HisDoc-H is substantially harder than M5HisDoc-R for reading order. Distortion and rotation cause horizontal bounding boxes to overlap, breaking spatial assumptions. ARD degrades from 5.49 to 8.68 for LayoutReader, from 12.60 to 14.63 for XY Cut.
Character-level analysis is more robust to image degradation than text-line-level. For detection and recognition tasks, character-level methods degrade less on M5HisDoc-H than text-line-level methods.
Zero-shot character recognition remains an open problem. Macro accuracy (65-75%) lags far behind top-1 accuracy (94-95%), reflecting the long-tail and similar-morphology challenges.

Limitations

The dataset requires application and agreement before access (CC-BY-NC-ND-4.0); redistribution and derivatives are restricted. Commercial use is not permitted.
The reading order baselines are limited: only two rule-based methods and LayoutReader. No method specifically designed for vertical right-to-left reading is evaluated beyond the authors’ heuristic.
Horizontal minimum enclosing rectangles are used as bounding box inputs to reading order methods; polygon-aware evaluation is not reported.
No cross-lingual or non-Chinese historical document evaluation.
Annotation of reading order uses manual sorting by annotators; no inter-annotator agreement metric is reported for reading order specifically.

Reproducibility

Models

No new model weights released. All baselines use publicly available architectures and standard training protocols.

Algorithms

Text line detection: AdamW, lr $10^{-4}$, 160 epochs, batch size 8; LR decays $\times 0.1$ at epochs 80 and 128; input long side 1,333.
Text line recognition: AdamW, lr $4 \times 10^{-4}$ with CosineAnnealing to $4 \times 10^{-6}$, 50 epochs, batch size 32; TPS preprocessing; height normalized to 96.
Character recognition: AdamW, lr $10^{-3}$ with CosineAnnealing to $10^{-6}$, 90 epochs, batch size 1,024; RandAugment; 96$\times$96 input.
Reading order: heuristic described in Appendix; LayoutReader evaluated layout-only using horizontal bounding boxes as input.

Data

8,000 images (4,000 regular + 4,000 hard) with character-level and text-line-level bounding boxes, text content, and reading order.
Released at github.com/HCIILAB/M5HisDoc under CC-BY-NC-ND-4.0 with application required.
2:1:1 train/val/test split; zero-shot character categories in val/test.

Evaluation

Text detection: precision, recall, F1 at IoU 0.5/0.6/0.7.
Text recognition: CR, AR.
Character recognition: top-1, top-5, macro accuracy.
Reading order: ARD (Average Relative Distance).
No error bars or significance tests reported.

Hardware

Not reported.

BibTeX


@inproceedings{shi2023m5hisdoc,
  title={M$^5$HisDoc: A Large-scale Multi-style Chinese Historical Document Analysis Benchmark},
  author={Shi, Yongxin and Liu, Chongyu and Peng, Dezhi and Jian, Cheng and Huang, Jiarong and Jin, Lianwen},
  booktitle={37th Conference on Neural Information Processing Systems, Datasets and Benchmarks Track},
  year={2023}
}

TL;DR

MosaicDoc is a 72K-image bilingual (Chinese and English) benchmark sourced from newspapers and magazines, annotated with block- and line-level reading order, OCR, DocVQA, and content-aware localization. It is constructed using DocWeaver, a multi-agent LLM pipeline that computes correct reading order explicitly for non-Manhattan and multi-column layouts. Evaluation of 13 models reveals that all current models, including Gemini-2.5, fail to correctly order text across columns in complex layouts.

What kind of paper is this?

Dominant: $\Psi_{\text{Resource}}$: The headline contribution is the MosaicDoc benchmark dataset itself, along with the DocWeaver pipeline used to construct it. The paper devotes most of its space to data collection, annotation methodology, quality assurance, and dataset statistics.

Secondary: $\Psi_{\text{Evaluation}}$: The benchmark evaluation of 13 models establishes baseline performance and identifies systematic weaknesses, with meaningful diagnostic findings on multi-span reasoning, dense OCR, and multi-column reading order.

What is the motivation?

Existing VRDU benchmarks have three converging weaknesses:

English-centric and visually simple. Benchmarks like VisualMRC and Docmatix rely on simple single-column documents. Chinese datasets like XFUND and DuReadervis replicate this simplicity.
No reading order annotations. Models trained without reading order supervision default to left-to-right, top-to-bottom heuristics, which fail on multi-column and non-Manhattan layouts.
Single-task. Prior reading order datasets (ReadingBank, ROOR) lack VQA annotations; VQA datasets lack reading order ground truth.

The authors argue this gap is particularly severe for newspapers and magazines, which feature irregular column layouts, cross-page articles, and dense bilingual text.

What is the novelty?

DocWeaver: Multi-Agent Annotation Pipeline

DocWeaver automates benchmark construction in three phases:

Phase 1: Document Decomposition and Structuring

PDF parsing via PyMuPDF and pdfminer.six; encoding recovery via chardet.
Layout detection using PP-DocLayout fine-tuned on the M$^3$Doc magazine dataset (13 element types).
Multi-engine OCR with weighted-voting for final text:

$$ \hat{s} = \arg\max_{t \in \mathcal{T}} \sum_{i=1}^{N} w_i p_i \mathbf{1}[s_i = t] $$

where $\mathcal{T}$ is the set of all candidate strings, $w_i$ is engine reliability weight, and $p_i$ is the confidence score.

Table structure via TableStructureRec (HTML output).
Cross-page linking via DeepSeek semantic similarity (threshold 0.8, up to 4 pages merged).

Reading Order Modeling

Two strategies depending on data availability:

Structured data (magazines, Chinese newspapers with HTML): fuzzy matching of PDF text spans to HTML template, constrained by bounding box and contextual proximity. Reading order derived from the validated HTML sequence.
Unstructured data (English newspapers): binarization-based layout line detection partitions pages into rectangular article blocks; within each block, line-level reading order is inferred from PDF metadata (font features, positional cues) and semantic coherence.

Pages with more than five detected reading order errors (by distance-based heuristics) are discarded entirely.

Phase 2: QA Generation

GPT-4o and DeepSeek-R1 generate QA pairs using specialized prompts for text, tables, and charts. QWQ-32B serves as an auxiliary discriminator to promote multi-span questions requiring cross-sentence reasoning.

Phase 3: Quality Assurance

Five-criteria hallucination guardrail scored by QWQ (Completeness, Consistency, Conciseness, Clarity, Inference). Only pairs passing a high threshold on all criteria are retained. Retention rates: 87.7-89.8% across subsets (Table 2). A 200-image human-validated gold subset is used for all reported evaluations.

MosaicDoc Benchmark

Scale: 72.3K images (42.7K magazines, 29.6K newspapers), 620K+ QA pairs.
Languages: Chinese and English from 196 publishers across 24 domains.
Tasks: DocVQA (single-span, multi-span, table, chart), word- and line-level OCR, block- and line-level reading order prediction, content-aware localization.
Layout complexity: Average tokens per document exceeds prior VQA datasets by a wide margin (magazines: ~1,075; newspapers: ~3,558 vs. ReadingBank ~315). Low sentence-level BLEU scores versus left-to-right baselines confirm layout complexity.

What experiments were performed?

Models Evaluated

13 models in three tiers:

Expert Models: Donut (253M), ViTLP (259M), LayoutReader (line-level bbox input only).
Expert VLMs: Vary-7B, TextMonkey-7B, mPLUG-DocOwl2-7B, GOT-OCR-0.5B, olmOCR-8B.
General VLMs: CogVLM2-19B, InternVL3-9B, Qwen2.5-VL-7B, GPT-4o (API), Gemini-2.5 (API).

All evaluated zero-shot on A100 GPUs (80GB) via vLLM at maximum supported resolution.

DocVQA (ANLSL Metric)

Best overall performance: Gemini-2.5 (61.3/68.8 Mag., 62.5/59.8 News. en/zh). All models drop sharply on multi-span questions; even Gemini-2.5 sees a ~10-point decline from single-span to multi-span. Expert VLMs with token reduction (TextMonkey, mPLUG-DocOwl2) collapse on dense layouts.

Page-Level OCR (CRR / OCRR)

Gemini-2.5 dominates (89.4/87.3 CRR for magazines en/zh; 87.9/66.6 for newspapers). All models degrade significantly on Chinese newspapers. A common failure mode: models enter repetitive generation after processing a fraction of a dense page, collapsing CRR.

Reading Order Prediction (Micro-F1)

Evaluated on text line sequences. Results from Table 5:

Model	Mag. F1 (en/zh)	News. F1 (en/zh)
LayoutReader	– / –	– / –
olmOCR	81.9 / 71.4	33.8 / 0.71
InternVL3	72.2 / 76.6	63.9 / 41.2
Qwen2.5-VL	65.4 / 75.7	50.9 / 57.2
Gemini-2.5	87.9 / 89.4	87.9 / 56.3

All models exhibit high precision but low recall: correct local sequences within a column, but failure to capture the full page or bridge across columns. Gemini-2.5 achieves the best results but still fails to establish correct cross-column ordering in multi-column newspaper layouts.

What are the outcomes/conclusions?

Complex layouts break current models. All 13 models degrade substantially on newspaper layouts relative to magazine layouts, with Chinese newspaper OCR near zero for several models.
Multi-span reasoning is a universal weakness. No current model handles multi-span QA reliably on dense documents.
Reading order recovery fails at the column boundary. High-precision/low-recall patterns indicate models read within columns correctly but cannot establish inter-column reading order. This directly degrades OCR and VQA tasks that depend on correct full-page sequencing.
General VLMs outperform specialized Expert VLMs on MosaicDoc, suggesting that domain-specific pre-training on simpler documents does not transfer to complex layouts.

Limitations

The human-validated gold set is only 200 images; the full 72K set uses automated annotations whose error rate is not independently characterized.
Reading order ground truth for unstructured sources (English newspapers) is derived via heuristics; no inter-annotator agreement study is reported.
The dataset is accepted at AAAI 2026 but published as a preprint; full release (“as soon as possible”) was pending at submission time.
No reported training set; MosaicDoc is an evaluation-only benchmark. Training data for reading order on complex layouts remains an open gap.

Reproducibility

Models

No new model weights introduced. All evaluated models are existing releases.

Algorithms

DocWeaver pipeline code at github.com/DOCLAB-SCUT/MosaicDoc (Apache-2.0). Full release pending as of arXiv submission.

Data

Source: Official public websites, content aggregation platforms, open-source PDF repositories (compliance details in Appendix D of the paper).
196 publishers, 24 domains, Chinese + English.
Gold evaluation subset: 200 human-validated images. Full dataset release pending.
Reading order annotations: HTML-derived for structured sources; heuristic for unstructured. No per-source license breakdown provided in the main text.

Evaluation

DocVQA: ANLSL (Average Normalized Levenshtein Similarity for Lists).
OCR: CRR (ground-truth normalized), OCRR (output-normalized precision).
ROP: Micro-F1 on text line sequences.
Zero-shot evaluation throughout; no fine-tuning baselines.
No error bars or significance tests.

Hardware

Evaluation: NVIDIA A100 80GB via vLLM framework.
Training hardware: not applicable (evaluation-only benchmark paper).

BibTeX


@article{chen2025mosaicdoc,
  title={MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding},
  author={Chen, Ketong and Chen, Yuhao and Xue, Yang},
  journal={arXiv preprint arXiv:2511.09919},
  year={2025}
}

TL;DR

PharmaShip is a 161-document Chinese pharmaceutical shipping dataset with ISDR-style reading order graph annotations alongside SER and entity linking tasks. It extends the entity-centric annotation paradigm of EC-FUNSD to the pharmaceutical logistics domain, where dense tabular layouts and long-form documents stress-test proximity-based layout heuristics. Five layout-aware baselines are benchmarked, and reading-order-enhanced variants (RORE) consistently improve SER and EL, with the largest benefit on entity linking.

What kind of paper is this?

Dominant: $\Psi_{\text{Resource}}$: The headline contribution is the PharmaShip dataset, the first publicly available benchmark for pharmaceutical shipping documents with reading order graph annotations. Dataset construction, annotation methodology, and statistics occupy the majority of the paper.

Secondary: $\Psi_{\text{Evaluation}}$: The controlled benchmark evaluation of five models on SER, EL, and ROP tasks provides diagnostic findings about the complementary roles of pixel features, geometric encoding, and reading order regularization.

What is the motivation?

Existing VrDU benchmarks (FUNSD, CORD, SROIE) target general administrative forms and consumer receipts. They share two limitations that make them unsuitable for pharmaceutical shipping documents:

Segment-entity coupling. Block-level annotations implicitly tie semantic labels to visual regions, encouraging models to rely on spatial proximity heuristics that fail when visually adjacent segments belong to different semantic categories (e.g., drug name adjacent to batch number in a dense table).
No reading order supervision. Without reading order annotations, models default to top-to-bottom, left-to-right assumptions. This fails on multi-column tables and field groups where the reading path is non-trivial.

PharmaShip targets Chinese pharmaceutical circulation documents: delivery notes, shipment forms, and outbound sales slips. These documents have heterogeneous templates, dense tabular layouts, and long-form content frequently exceeding the default 512-token limit of standard PTLMs.

What is the novelty?

Dataset: PharmaShip

161 annotated scanned documents sourced from multiple pharmaceutical enterprises in China. Each document is provided in JSON format with:

SER annotations: Token-level semantic entity labels (header, question, answer, other), following the entity-centric paradigm of EC-FUNSD.
EL annotations: Directed relation triples over recognized entities $(e_i, e_j) \mapsto r_{ij} \in \mathcal{R}$.
ROP annotations: ISDR reading order links as a directed acyclic graph over segments, following the formulation of ROOR (Zhang et al., 2024):

$$ f_{ROP}: (s_i, s_j) \mapsto p_{ij} \in \mathcal{P} $$

where $\mathcal{P}$ is the set of immediate succession links.

PharmaShip is substantially denser than ROOR on a per-document basis: 70.16 segments per sample versus 53.57 for ROOR, 415.91 words per sample versus 157.27, and ~108 reading order relations per sample versus ~55.

Key Design Choices

Entity-centric annotation: Semantic labels are assigned independently of visual block boundaries, decoupling segment identity from layout proximity. This removes the “false coupling” that inflates scores on FUNSD-style benchmarks.

Positional embedding extension: Because pharmaceutical documents routinely exceed 512 tokens, positional embeddings are expanded from 512 to 2048 for all three backbones. Newly added positions are initialized per each repository’s default. This prevents truncation of late-page entities and long-range relations.

Class-imbalance correction for EL: The public ROOR implementation over-samples the “no-relation” class. The authors rebalance positive/negative pairs and apply class-weighted optimization to stabilize gradients.

What experiments were performed?

Baselines

Five models spanning pixel-aware and geometry-aware families:

Model	Params	Notes
LiLT[InfoXLM]	base	Pixel-free; layout via box/position interactions
LayoutLMv3-base-chinese	133M	Pixel-aware joint token+image pre-training
RORE-LayoutLMv3	133M+12	LayoutLMv3 + reading order relation injection
GeoLayoutLM	399M	Explicit multi-level geometric encodings
RORE-GeoLayoutLM	399M+24	GeoLayoutLM + reading order relation injection

All use Chinese pre-trained checkpoints, identical OCR preprocessing and bounding boxes, and the same page renderings. Fine-tuned up to 500 epochs with early stopping (50-epoch patience).

Results on PharmaShip (Table III)

SER:

Model	Precision	Recall	F1
LiLT	73.15	78.77	75.85
LayoutLMv3	81.06	86.08	83.49
RORE-LayoutLMv3	82.13	86.08	84.06 (+0.57)
GeoLayoutLM	77.71	83.91	80.69
RORE-GeoLayoutLM	78.81	85.21	81.88 (+1.19)

EL:

Model	Precision	Recall	F1
LiLT	51.22	82.35	63.16
LayoutLMv3	67.49	90.32	77.25
RORE-LayoutLMv3	68.81	91.45	78.53 (+1.28)
GeoLayoutLM	77.91	73.28	75.52
RORE-GeoLayoutLM	78.42	74.79	76.56 (+1.04)

ROP:

Model	Precision	Recall	F1
LayoutLMv3 (word-level)	81.91	96.20	88.48
LayoutLMv3 (segment-level)	62.64	81.51	70.84

Cross-dataset results (Table II) show that the same models score 5-9 F1 points lower on PharmaShip than on FUNSD/CORD/SROIE, confirming that entity-centric annotation and layout complexity expose weaknesses masked by simpler benchmarks.

What are the outcomes/conclusions?

RORE consistently improves SER and EL. Reading order injection provides +0.57 to +1.28 F1 gains on PharmaShip, with the largest benefit for EL (entity linking relies more on logical ordering than local entity semantics).
Pixel features and geometry are complementary. LayoutLMv3 (pixel-aware) leads on SER; GeoLayoutLM (geometry-aware) has higher EL precision. Neither alone is sufficient.
Word-level ROP is feasible; segment-level ROP is harder. LayoutLMv3 achieves 88.48 F1 at word-level but only 70.84 at segment-level, with boundary ambiguity and long-range crossings in dense tables driving the gap.
Positional embedding extension matters. Expanding from 512 to 2048 positions prevents truncation of late-page content and stabilizes performance on long pharmaceutical forms.
Entity-centric annotation reveals layout heuristic fragility. Decoupling segments from semantic entities removes the block-level bias that inflates scores on simpler benchmarks; PharmaShip scores are consistently lower, reflecting real task difficulty.

Limitations

Dataset is small (161 documents). Generalization of benchmark conclusions to other pharmaceutical document types or other institutions is not validated.
All documents are Chinese; no multilingual or non-Chinese evaluation.
Only LayoutLMv3 is tested for ROP; no comparison against dedicated ROP models (e.g., LayoutReader, the ROOR baseline directly).
No hardware or training time details reported.
CC-BY-NC-4.0 license restricts commercial use.

Reproducibility

Models

LiLT[InfoXLM], LayoutLMv3-base-chinese, GeoLayoutLM: official Chinese pre-trained checkpoints used as-is.
RORE variants: ROOR framework applied using official ROOR implementations with corrected class weighting.
No new model weights released.

Algorithms

AdamW, weight decay $10^{-2}$, learning rate $10^{-4}$, batch size 16, 2% linear warm-up, linear decay (for GeoLayoutLM EL; others follow Huang et al. 2022 schedules).
Up to 500 epochs, early stopping at 50-epoch patience on validation F1.
Positional embeddings extended from 512 to 2048; dynamic padding within mini-batches.
Class-imbalance rebalancing for EL positive/negative pairs.

Data

161 annotated scanned documents from Chinese pharmaceutical enterprises.
Released at github.com/KevinYuLei/PharmaShip under CC-BY-NC-4.0.
Train/validation/test splits follow EC-FUNSD conventions; exact split sizes not stated in the paper.
Annotation process: not described in detail; no inter-annotator agreement reported.

Evaluation

SER and EL: F1 (precision and recall also reported).
ROP: F1 at word-level and segment-level separately.
Cross-dataset evaluation on FUNSD, CORD, SROIE, ROOR for context.
No error bars or significance tests.

Hardware

Not reported.

BibTeX


@article{xie2025pharmaship,
  title={PharmaShip: An Entity-Centric, Reading-Order-Supervised Benchmark for Chinese Pharmaceutical Shipping Documents},
  author={Xie, Tingwei and Zhou, Tianyi and Song, Yonghong},
  journal={arXiv preprint arXiv:2512.23714},
  year={2025}
}

TL;DR

BIO-tagging for named entity recognition assumes that entity tokens appear in a clean, contiguous, front-to-back order in the model input. On scanned documents, OCR-driven input order routinely violates this assumption. Token Path Prediction (TPP) avoids the assumption entirely by modeling entities as directed paths in a complete token graph, making it robust to disordered OCR inputs without requiring any pre-computed reading order. The authors also release FUNSD-r and CORD-r, revised versions of popular NER benchmarks that better reflect the disordered layouts seen in practice.

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$: The central contribution is TPP, a new prediction head that replaces the token classification head used by sequence-labeling NER models. The paper includes ablations, SOTA comparisons on multiple tasks, and a case study, all focused on demonstrating the method’s effectiveness.

Secondary: $\Psi_{\text{Resource}}$: FUNSD-r and CORD-r are re-annotation efforts requiring manual word-level entity labeling on OCR-relaid layouts, released as community benchmarks. They also expose a data quality problem in the original FUNSD and CORD annotations that affects fair evaluation.

What is the motivation?

Information extraction from visually-rich documents (VrDs) such as forms, receipts, and contracts typically relies on pre-trained document transformers (LayoutLM family, BROS, LayoutMask) fine-tuned with a BIO token classification head for named entity recognition. The BIO scheme requires that each entity spans a contiguous, correctly ordered slice of the input token sequence.

In practice, OCR systems processing scanned documents read top-to-bottom, left-to-right and produce segment-level annotations that need not align with logical entity boundaries. Complex layouts, including multi-column text, tables, and elements that span multiple rows, cause OCR outputs to interleave tokens from different entities or split a single entity across non-adjacent positions. The BIO tagging scheme cannot represent such entities correctly, regardless of how powerful the underlying encoder is.

The authors note a secondary problem: the standard FUNSD and CORD benchmarks have segment annotations manually aligned with entity boundaries, so each entity always maps to a contiguous span in the input. Models trained and evaluated on these benchmarks are never tested on the disorder that appears in real scanned documents. This makes existing leaderboard numbers misleading as a measure of real-world capability.

What is the novelty?

Token Path Prediction (TPP)

Rather than tagging each input token with a BIO label, TPP treats entity extraction as a graph path prediction problem.

A visually-rich document with $N_D$ words is represented as:

$$D = \{(w_i, b_i)\}_{i=1,\ldots,N_D}$$

where $w_i$ is the $i$-th word and $b_i = (x_i^0, y_i^0, x_i^1, y_i^1)$ is its bounding box. An entity of type $e_j$ is a word sequence $(w_{j_1}, \ldots, w_{j_k})$ where the words need not be adjacent in the input order.

TPP constructs a complete directed graph over all $N_D$ tokens. For each entity type, the token paths that form entities are encoded as an $n \times n$ binary grid label, where entry $(i, j)$ is 1 if token $w_i$ is immediately followed by token $w_j$ in any entity of that type. This transforms NER into $N_e$ independent binary classification problems on $n^2$ token-pair samples.

The classifier is implemented using Global Pointer (Su et al., 2022), which computes a bilinear score for each directed token pair:

$$s(w_i \to w_j) = \text{GlobalPointer}(h_i, h_j)$$

Training uses the class-imbalance loss from Global Pointer (Su et al., 2022). For each entity type grid, let $\mathcal{P}$ be the set of positive token pairs and $\mathcal{N}$ the set of negative pairs. The loss penalizes positive pairs with low score and negative pairs with high score:

$$\mathcal{L} = \log!\left(1 + \sum_{(i,j)\in\mathcal{P}} e^{-s(w_i \to w_j)}\right) + \log!\left(1 + \sum_{(i,j)\in\mathcal{N}} e^{s(w_i \to w_j)}\right)$$

This formulation handles the extreme class imbalance naturally, since at most $n$ of the $n^2$ pairs per grid are positive. During inference, positive-scored pairs are collected per entity type, and a greedy search decodes them into token paths (entity mentions), keeping the highest-confidence successor when multiple outgoing edges exist from the same token.

Two enhancements improve performance over the base design:

Positional residual linking: adds a residual connection from the backbone’s 1D positional embeddings into the GlobalPointer inputs, strengthening the ordering signal available to the classifier.
Multi-dropout: duplicates the final fully-connected layer and trains each copy with a different dropout mask while sharing weights, following Inoue (2019), to improve robustness.

Adaptation to other VrD tasks

The same grid-label design is adapted to entity linking (VrD-EL) by marking all token pairs between linked entity spans, and to reading order prediction (VrD-ROP) by predicting a global path over all tokens starting from a special <s> token, decoded via beam search.

FUNSD-r and CORD-r

The revised benchmarks are re-annotated from the original FUNSD (199 documents) and CORD (999 documents) images using an automatic OCR pass with PP-OCRv3 for layout re-annotation, followed by manual re-labeling of entity mentions as word sequences on the new layouts. Key differences from the originals: (1) every segment is constrained to a single row, matching real OCR behavior; (2) entity boundaries are annotated at the word level and are not forced to align with segment boundaries; (3) entities may span multiple segments (and hence be discontinuous in OCR-order input). The continuous entity rate, the fraction of entities that form a contiguous span in OCR-order input, is reported as a diagnostic: FUNSD-r has 95.74% and CORD-r has 92.10% when no reordering is applied.

What experiments were performed?

VrD-NER on FUNSD-r and CORD-r

Two document transformer backbones are evaluated: LayoutLMv3-base (vision + text + layout) and LayoutMask (text + layout only). For each backbone, three conditions are compared:

Sequence labeling, no pre-processing: vanilla BIO tagging on OCR-ordered input.
Sequence labeling + reordering pre-processing: inputs are reordered by either LayoutReader or TPP-for-VrD-ROP before BIO tagging.
TPP-for-VrD-NER: graph-based prediction with no reading order assumption.

The primary metric is entity-level F1 (strict boundary + type match), which the authors argue is more meaningful than the word-level F1 used in prior FUNSD/CORD work.

Results from Table 1, including the continuous entity rate (Cont.) which measures how well each pre-processing step restores contiguous entity spans in the input:

FUNSD-r

Backbone	Method	Pre-processing	Cont. (%)	F1
LayoutLMv3	Seq. Labeling	None	95.74	78.77
LayoutLMv3	Seq. Labeling	LayoutReader	95.53	78.37
LayoutLMv3	Seq. Labeling	TPPR	97.29	79.72
LayoutLMv3	TPP	None	N/A	80.40
LayoutMask	Seq. Labeling	None	95.74	77.10
LayoutMask	Seq. Labeling	LayoutReader	95.53	77.24
LayoutMask	Seq. Labeling	TPPR	97.29	80.70
LayoutMask	TPP	None	N/A	78.19

CORD-r

Backbone	Method	Pre-processing	Cont. (%)	F1
LayoutLMv3	Seq. Labeling	None	92.10	82.72
LayoutLMv3	Seq. Labeling	LayoutReader	82.10	70.33
LayoutLMv3	Seq. Labeling	TPPC	92.43	83.24
LayoutLMv3	TPP	None	N/A	91.85
LayoutMask	Seq. Labeling	None	92.10	81.84
LayoutMask	Seq. Labeling	LayoutReader	82.10	68.05
LayoutMask	Seq. Labeling	TPPC	92.43	81.90
LayoutMask	TPP	None	N/A	89.34

TPPR/TPPC denotes TPP-for-VrD-ROP trained on ReadingBank/CORD respectively. On CORD-r, LayoutReader is trained on ReadingBank, which uses column-by-column reading order; the CORD documents use row-by-row order. This domain mismatch causes LayoutReader to actively reduce the continuous entity rate from 92.10% to 82.10%, which translates into large F1 drops (+12 and +14 points below baseline for LayoutLMv3 and LayoutMask respectively). This is the strongest quantitative evidence against using generic reading order models as pre-processing.

The CORD-r F1 gains for TPP itself are also large: +9.13 and +7.50 points over vanilla sequence labeling with LayoutLMv3 and LayoutMask respectively, with no pre-processing step at all.

LayoutReader used as a pre-processing step either fails to improve or actively degrades performance on both benchmarks. The authors attribute this to LayoutReader’s seq2seq decoding, which can produce duplicate or missing tokens and is optimized for BLEU on ReadingBank rather than for segment-level entity continuity.

VrD-EL and VrD-ROP

TPP is also evaluated on entity linking (FUNSD, F1) and reading order prediction (ReadingBank, BLEU and ARD). The authors report that TPP outperforms MSAU-PAF on entity linking (FUNSD, F1) and LayoutReader on reading order prediction (ReadingBank) in five of six BLEU/ARD settings.

Ablation studies

Ablations on positional encoding choices and TPP design components (Table 5a/5b) show that:

Using global 1D positional encoding alongside segment-level 2D positions gives the best result for LayoutLMv3 on FUNSD-r.
Both positional residual linking and multi-dropout contribute positively; their combined removal causes a drop of up to 12.66 F1 points on FUNSD-r with LayoutMask.

What are the outcomes/conclusions?

TPP consistently outperforms BIO-based sequence labeling on both new benchmarks, suggesting that order-invariant entity extraction is better suited to real-world scanned VrDs. As a secondary use mode, TPP-for-VrD-ROP also serves as a better pre-processing reordering mechanism than LayoutReader, as measured by the continuous entity rate after reordering.

The case study (Figure 6) illustrates qualitative strengths: TPP correctly identifies multi-row, multi-column, and long entities that sequence-labeling methods fragment or miss. A notable failure mode is entity type misclassification: TPP sometimes finds the correct tokens but assigns the wrong entity type label, suggesting that layout features dominate text signals in some cases.

The authors acknowledge two limitations. First, the $n^2$ grid label causes quadratic memory and compute scaling: training with TPP requires 2.45-2.75x more time and 1.47-1.56x more peak memory than vanilla token classification. This cost is manageable for fine-tuning on datasets of a few hundred to a few thousand samples but would be more problematic for large-scale pre-training. Second, the revised benchmarks cover English only, and similar disorder-reflecting benchmarks are not yet available for other languages or for IE tasks beyond NER.

Reproducibility

Models

Architecture: Document transformer backbone (LayoutLMv3-base or LayoutMask) with a GlobalPointer classification head per entity type. Backbone parameter counts follow the original papers (LayoutLMv3-base: 125M parameters).
Weights: Not released. The GitHub repository (https://github.com/chongzhangFDU/TPP) contains only a README; the authors note that open-sourcing of the code is pending approval from Ant Group. No pre-trained checkpoints are available.
Backbone initialization: Fine-tuned from publicly available LayoutLMv3-base and LayoutMask checkpoints.

Algorithms

Optimizer: Adam with 1% linear warm-up and weight decay of 1e-5, dropout 0.1.
Learning rate: Grid-searched over {3e-5, 5e-5, 8e-5}; best values are backbone- and dataset-specific (e.g., 5e-5 / 5e-5 for LayoutLMv3 / LayoutMask on FUNSD; 3e-5 / 8e-5 on CORD).
Batch size: 16.
Steps: 1,000 on FUNSD-r; 2,500 on CORD-r. For VrD-ROP: 100 epochs on ReadingBank.
Loss: Class-imbalance loss (from Global Pointer, Su et al., 2022) applied to the $n^2$ binary classification problem per entity type.
Decoding: Greedy path search for NER/EL; beam search (beam size 8) for ROP.
Maximum entities decoded: 100.

Data

Training data: FUNSD-r (199 documents) and CORD-r (999 documents), re-annotated from the original FUNSD and CORD images using PP-OCRv3 for layout and manual annotation for entity mentions.
Splits: Inherited from original FUNSD and CORD train/val/test splits after filtering for OCR confidence (> 0.8) and minimum document length (> 20 valid words).
Availability: Released in a separate GitHub repository (https://github.com/chongzhangFDU/Token-Path-Prediction-Datasets) under CC-BY-4.0. Note that the TPP code itself has not been released; the dataset repo is independent of the code repo.
Annotation: Semi-automatic (automated OCR re-annotation + manual entity labeling). No inter-annotator agreement figure is reported.
Contamination: Not discussed. The revised datasets derive from the same document images as the originals, so any overlap with pre-training data for LayoutLMv3 or LayoutMask is inherited.

Evaluation

Primary metric: Entity-level F1 (exact match on entity type and all constituent tokens). This is stricter than the word-level F1 used in most prior FUNSD/CORD work.
Baselines: Sequence labeling with LayoutLMv3-base and LayoutMask; LayoutReader as pre-processing. Comparisons use the same backbones and fine-tuning budget, which is a reasonable basis for comparison.
Limitations acknowledged: Quadratic complexity; English-only benchmarks; no other-language or non-NER IE tasks evaluated with disorder-reflecting benchmarks.
Limitations not acknowledged: No error bars or multiple runs reported. The continuous entity rate metric, used to evaluate pre-processing quality, measures a proxy (contiguous spans) rather than direct downstream F1 impact at token level. The 92-96% continuous entity rates in the revised datasets mean that a majority of entities are still contiguous even after re-annotation, limiting how much the benchmarks actually stress-test non-sequential methods.

Hardware

Training hardware: 8 Tesla A100 GPUs.
Training time: 2.45x (LayoutLMv3) to 2.75x (LayoutMask) longer than vanilla token classification.
Peak memory: 1.47x (LayoutMask) to 1.56x (LayoutLMv3) higher than vanilla token classification.
Inference hardware: Not reported.

BibTeX


@inproceedings{zhang-etal-2023-reading,
  title     = &#34;Reading Order Matters: Information Extraction from Visually-rich Documents by Token Path Prediction&#34;,
  author    = &#34;Zhang, Chong and Guo, Ya and Tu, Yi and Chen, Huan and Tang, Jinyang and Zhu, Huijia and Zhang, Qi and Gui, Tao&#34;,
  booktitle = &#34;Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing&#34;,
  year      = &#34;2023&#34;,
  publisher = &#34;Association for Computational Linguistics&#34;,
  doi       = &#34;10.18653/v1/2023.emnlp-main.846&#34;
}

TL;DR

READoc introduces a benchmark that defines Document Structured Extraction (DSE) as a single end-to-end task: given a raw PDF, produce structured Markdown. The benchmark comprises 3,576 documents drawn from arXiv, GitHub, and Zenodo, and comes with a three-module Evaluation S$^3$uite (Standardization, Segmentation, Scoring) that measures text, heading, formula, table, and reading-order quality in a unified way. Evaluation of 16 systems reveals large gaps between current tools and the holistic DSE objective.

What kind of paper is this?

Dominant: $\Psi_{\text{Resource}}$

The primary contribution is the READoc dataset and evaluation suite, both released as reusable community infrastructure. The benchmark itself, its construction pipeline, and the S$^3$uite are the headline deliverables.

Secondary: $\Psi_{\text{Evaluation}}$

The paper also contributes a measurement framework with novel metrics and protocols (KTDS for reading order, TEDS for tables and ToC trees, EDS for text), and the empirical evaluation exposes systematic failure modes that could not be identified with prior fragmented benchmarks.

What is the motivation?

Document Structured Extraction sits at the foundation of knowledge-base construction, corpus creation, and retrieval-augmented generation pipelines. Despite the large number of DSE systems that have emerged from both the academic and open-source communities, a rigorous unified evaluation framework has been absent.

Prior benchmarks fragment DSE into independent subtasks: layout analysis, OCR, reading-order detection, table recognition, formula conversion, and table-of-contents extraction each have their own datasets with distinct formats, annotation conventions, and data sources. This fragmentation creates two practical problems. First, it is impossible to determine how well a system performs at the realistic end-to-end task that downstream users actually care about. Second, because each subtask dataset focuses on a narrow slice, it does not reveal interactions and failure modes that only appear when subtasks are chained together (for example, a reading-order error that corrupts an otherwise accurate formula extraction).

The authors also note that many existing benchmarks use synthetic or curated data, rather than real-world documents in the wild, which limits their ecological validity.

What is the novelty?

The central contribution is the framing of DSE as a single PDF-to-Markdown conversion task and the construction of a benchmark that operationalizes this framing at scale.

Task definition. READoc takes a complete multi-page PDF as input and expects structured Markdown as output. The Markdown variant supports LaTeX syntax for tables and formulas, following the convention introduced by Nougat. This formulation is well-defined, practical (Markdown is directly consumable by LLMs and RAG pipelines), and challenging.

Dataset construction. The 3,576-document corpus is assembled from three heterogeneous sources, each stressing different DSE capabilities:

READoc-arXiv (1,009 documents): academic papers with complex multi-column layouts, formulas, and tables. Ground truth Markdown is derived from LaTeX sources via LaTeXML and a modified Nougat conversion pipeline.
READoc-GitHub (1,224 documents): README files converted to PDF via Pandoc and Eisvogel. These test heading detection and ToC construction under varied and often unmarked heading styles.
READoc-Zenodo (1,343 documents): DOCX and HTML documents from Zenodo’s open repository, covering 27 languages, diverse types (theses, reports, posters, books), and long documents (average 14.93 pages). Ground truth is generated via Markitdown and Pandoc.

Evaluation S$^3$uite. The suite operates in three sequential stages. The Standardization module normalizes formula boundaries, heading markers, table syntax, and image/link references to make model outputs comparable regardless of formatting idiosyncrasies. The Segmentation module partitions both prediction and reference into four semantic unit types: headings, embedded formulas, isolated formulas, tables, and plain text. The Scoring module then applies four metrics:

Edit Distance Similarity (EDS) for text and heading concatenation: $$EDS(A, B) = 1 - \frac{ED(A, B)}{\max(|A|, |B|)}$$
Tree Edit Distance Similarity (TEDS) for ToC and table structure: $$TEDS(T_1, T_2) = 1 - \frac{TED(T_1, T_2)}{\max(|T_1|, |T_2|)}$$
Kendall’s Tau Distance Similarity (KTDS) for reading order, computed at both block and token granularity: $$KTDS(X, Y) = 1 - \frac{2 \cdot K_d}{n(n-1)}$$

where $K_d$ is the number of discordant pairs and $n$ is the total number of ranked items.

Vocabulary F1 for plain text.

Reading-order quality is thus measured implicitly within the holistic task rather than as a separate subtask with a dedicated annotation layer.

What experiments were performed?

The paper evaluates 16 systems across four categories on READoc-arXiv and READoc-GitHub (and selectively on READoc-Zenodo):

Baselines: PyMuPDF4LLM (PDF bytecode parsing), Tesseract (OCR-based), and MarkItDown (multi-format converter). These represent simple, widely applicable approaches without specialized document-understanding capabilities.

Pipeline tools: Marker, MinerU, Pix2Text, and Docling. These integrate engineering pipelines with specialized deep learning submodels for distinct DSE subtasks.

Expert visual models: Nougat-small and Nougat-base (end-to-end academic document parsing), and GOT-OCR 2.0. These are Transformer-based models trained directly for document-to-text conversion.

Vision-Language Models: DeepSeek-VL-7B-Chat, MiniCPM-Llama3-V2.5, LLaVA-1.6-Vicuna-13B, InternVL-Chat-V1.5, and GPT-4o-mini. VLMs process page images one at a time, with concatenated output.

The paper also performs several fine-grained analyses:

Impact of document length and depth, showing that pipeline tools and expert models degrade with increasing heading depth, while VLMs degrade with increasing document length.
Impact of layout complexity (single-column vs. multi-column), using pdfplumber classification.
Multi-page paradigm exploration: GPT-4o-mini is given all page images simultaneously for short documents to test global context effects.
Efficiency comparison measuring per-document processing time on a sample of 50 documents.

What are the outcomes/conclusions?

The evaluation on READoc-arXiv (Table 3 in the paper) surfaces several findings:

Pipeline tools outperform baselines, but fall well short of expert models on specialized subtasks. MinerU achieves the highest average score (73.07) among pipeline tools, but expert models Nougat-small and Nougat-base reach averages of 81.38 and 81.42 respectively, driven by strong formula and heading conversion.
VLMs lag considerably on specialized content. Even GPT-4o-mini, the best-performing VLM, achieves only 57.98 average on READoc-arXiv. Formula conversion scores are particularly low (31.77 for embedded, 18.65 for isolated), reflecting the difficulty of recognizing and rendering mathematical notation from page images.
Reading order detection is relatively tractable. The heuristic-based Tesseract baseline scores 96.70 block-level KTDS on READoc-arXiv, suggesting that most systems maintain correct sequential ordering even when specialized content conversion is poor. READoc’s holistic framing still surfaces confused reading order in multi-column and complex layouts, as documented in the case studies.
Single-page paradigm limits global structure recovery. Processing pages independently prevents models from constructing accurate cross-page ToC trees. A multi-page experiment with GPT-4o-mini shows significant ToC improvement but degrades local fine-grained capabilities (tables, formulas), indicating an unresolved tension.
Efficiency gaps are large. Inference times range from 23.86 seconds per document (Marker, 1x Titan RTX) to over 1,182 seconds (InternVL-Chat-V1.5, 2x A100). VLMs are orders of magnitude slower than pipeline tools, which matters for large-scale corpus construction or real-time RAG applications.

Limitations acknowledged by the authors: The automated PDF-Markdown construction pipeline introduces noise that is difficult to eliminate entirely, and the Standardization module cannot cover all format variations produced by diverse DSE systems.

Reproducibility

Models

READoc does not introduce a new model. All evaluated systems are publicly available third-party tools referenced with their original publications. Architecture details and parameter counts are not reported by this paper; readers should consult each system’s source. The evaluation pipeline itself is released alongside the benchmark dataset.

Algorithms

The S$^3$uite is implemented in Python. Segmentation is performed via regular expressions. EDS uses standard Levenshtein edit distance. TEDS uses a maximum bipartite matching algorithm for table alignment before computing tree edit distance. KTDS counts discordant pairs directly from ranked lists constructed from block-level and token-level co-occurrence positions.

For VLMs, generation uses top-p = 0.8, top-k = 100, temperature = 0.7, do_sample = True, and repetition_penalty = 1.05. No training is performed.

Data

READoc-arXiv: 1,009 documents from arXiv (1996-2024). LaTeX sources converted via LaTeXML then a modified Nougat HTML-to-Markdown script. English only. Covers 6 document types and 8 disciplines.
READoc-GitHub: 1,224 README files (2008-2024) with at least 500 stars, English, no HTML syntax, no tables or formulas. PDF generated via Pandoc + Eisvogel template.
READoc-Zenodo: 1,343 documents (910 DOCX + 433 HTML) from 2014-2024. DOCX converted via Microsoft Markitdown then python-docx; HTML via Pandoc then Chromium. 27 languages.
Ground truth Markdown is generated automatically; some noise remains despite filtering. No human annotation.
The dataset is released for research use; per-document source licenses inherit from arXiv (CC-BY or equivalent), GitHub (varied open source), and Zenodo (varied open access).

Evaluation

Metrics: EDS (text, heading, formula concatenation), Vocabulary F1 (plain text), TEDS (ToC tree, table structure), KTDS at block level and token level (reading order).
Benchmarks: Three subsets of READoc: arXiv, GitHub, Zenodo. Main results reported on arXiv and GitHub. Zenodo used in supplementary analyses.
Baselines: 16 systems covering four paradigm classes; no custom fine-tuning is performed.
Statistical rigor: The paper does not report standard deviations or confidence intervals. Efficiency measurements are averaged over 50 sampled documents from READoc-arXiv.
Scope limitations: READoc-arXiv ground truth derives from LaTeX source, which may not capture all rendering nuances of the compiled PDF. READoc-GitHub and READoc-Zenodo are constructed from original source files, but automated conversion chains may introduce artifacts.

Hardware

Nougat-small and Nougat-base: evaluated on 1x Titan RTX (24 GB).
MiniCPM-Llama3-V2.5: evaluated on 1x A100 (80 GB).
InternVL-Chat-V1.5: evaluated on 2x A100 (80 GB).
Marker, MinerU, Pix2Text: GPU type and VRAM not specified in detail.
No training is performed; all reported times are inference throughput.

BibTeX


@inproceedings{li2025readoc,
  title     = {{READOC}: A Unified Benchmark for Realistic Document Structured Extraction},
  author    = {Zichao Li and Aizier Abulaiti and Yaojie Lu and Xuanang Chen and
               Jia Zheng and Hongyu Lin and Xianpei Han and Shanshan Jiang and
               Bin Dong and Le Sun},
  booktitle = {Findings of the Association for Computational Linguistics: ACL 2025},
  year      = {2025}
}

TL;DR

ROOR argues that representing layout reading order as a single permutation of elements is inherently lossy for complex documents, and proposes instead to model reading order as a directed acyclic relation (Immediate Succession During Reading, ISDR) over layout elements. The authors introduce a 199-sample benchmark dataset with ISDR annotations built on EC-FUNSD, a LayoutLMv3-based relation extraction baseline, and a downstream enhancement pipeline that injects reading order relations into any layout-aware model.

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$: The headline contribution is the reformulation of ROP as a relation extraction task and the accompanying relation-aware attention pipeline for enhancing downstream VrD tasks. The method is the core novelty; the dataset and benchmark are released to enable it.

Secondary: $\Psi_{\text{Resource}}$: Introduces the ROOR benchmark dataset (199 samples, 10,662 segments, 10,967 annotated reading order relation pairs built on EC-FUNSD).

Secondary: $\Psi_{\text{Evaluation}}$: Proposes a new task formulation that challenges the standard permutation-based evaluation setup, and uses F1 over relation pairs as the primary metric rather than BLEU/ARD.

What is the motivation?

Reading Order Prediction (ROP) for visually-rich document layouts has historically been framed as predicting a permutation of layout elements. The authors identify a structural problem with this framing: for complex document layouts (multi-column pages, sidebars, tables, floating elements), multiple valid reading orders exist simultaneously, and no single permutation can capture all of them without discarding information or injecting annotation noise.

Concretely, they identify three failure modes of the permutation formulation:

A layout element can have multiple valid immediate successors (e.g., the last cell in a table row can be followed by the cell below or the first cell in the next row).
Some layout elements (headers, footers, watermarks) are logically disconnected from the main reading flow and cannot be cleanly inserted into a linear sequence.
Indirect reading order relationships (e.g., a question and its distant answer) are not captured by immediate succession alone.

The downstream motivation is practical: prior work has shown reading order information improves VrD-IE and VrD-QA tasks, but this benefit is limited by the quality of the reading order representation.

What is the novelty?

Reformulation: ISDR as a Directed Acyclic Relation

The paper defines two order-theoretic concepts:

Immediate Succession During Reading (ISDR): A directed acyclic relation (DAR) over the set of layout elements. Given a document layout $D = \{(w_i, b_i)\}_{i=1,\ldots,N_D}$, the reading order is represented as a set of directed pairs:

$$ P_D = \{(k_s, k_o) \mid w_{k_o} \text{ immediately follows } w_{k_s} \text{ during reading}\} $$

This is a DAR (not a total order), which means a single element may have multiple successors, and disconnected subgraphs are permitted.

Generalized Succession During Reading (GSDR): The transitive closure of ISDR, forming a strict partial order (SPO). GSDR captures indirect ordering relationships (e.g., a question element always precedes its distant answer).

Relation Extraction Baseline

ROP is reformulated as a relation extraction task. The model uses LayoutLMv3 as the backbone encoder, pools per-element token embeddings via average pooling:

$$ h_i = \text{AvgPool1d}(h_i^1, \ldots, h_i^{n_i}) $$

Then applies a global pointer network for pair scoring:

$$ \begin{aligned} q_i &= W_q h_i + b_q \\ k_i &= W_k h_i + b_k \\ s_{ij} &= q_i^T k_j \end{aligned} $$

Training uses a class-imbalance loss:

$$ \mathcal{L} = \log!\left(1 + \sum_{\substack{(i,j) \notin L}} e^{s_{ij}}\right) + \log!\left(1 + \sum_{\substack{(i,j) \in L}} e^{-s_{ij}}\right) $$

Inference extracts pairs above a zero threshold: $\hat{L} = \{(i,j) \mid s_{ij} > 0\}$.

Downstream Enhancement Pipeline (RORE)

Reading order relations are encoded as an $n \times n$ binary matrix over input tokens and injected into downstream layout-aware models via a relation-aware attention module:

$$ a_{ij}^l = \frac{\exp!\left(\left(q_i^T k_j^l + \lambda^l \rho_{ij}\right) / \sqrt{d_k}\right)}{\sum_{j’} \exp!\left(\left(q_i^T k_{j’}^l + \lambda^l \rho_{ij’}\right) / \sqrt{d_k}\right)} $$

where $\rho_{ij}$ is the binary reading order relation matrix and $\lambda^l$ is a learnable per-layer scalar weight.

What experiments were performed?

Dataset

ROOR is built on EC-FUNSD: 199 single-page scanned VrD samples, 10,662 segments, 31,297 words, 10,967 annotated ISDR relation pairs. Two domain experts annotated independently and resolved conflicts by discussion. 23.76% of segments are involved in non-linear reading order.

ROP Baseline Evaluation (Table 1)

Evaluated on word-level and segment-level ROP with F1 score. LayoutLMv3-large achieves 93.01 (word-level) and 82.38 (segment-level) F1. Human inter-annotator agreement is 99.28 (segment-level). Prior sequence-based methods (LayoutReader: 9.44 word-level F1; Token Path Prediction: 42.96 segment-level F1) perform dramatically worse, illustrating the incompatibility of sequence-based methods with the relation-based evaluation.

Downstream Enhancement on EC-FUNSD (Table 2)

Ground truth ISDR injected into LayoutLMv3-base/large and GeoLayoutLM for SER and EL tasks. EL benefits substantially more than SER (up to +6.17 F1 for LayoutLMv3-base).

Cross-Domain VrD-IE with Pseudo Labels (Table 3)

An off-the-shelf LayoutLMv3-large ROP model trained only on ROOR generates pseudo labels for FUNSD, CORD, and SROIE. RORE variants improve over baselines across all settings; RORE-GeoLayoutLM achieves the best comprehensive VrD-IE performance (93.94 overall).

VrD-QA with Pseudo Labels (Table 4)

Tested on DocVQA, InfoVQA, WTQ, TextVQA using ANLS. Gains range from +0.09 (TextVQA, simple layouts) to +6.48 (InfoVQA, layout-heavy). TextVQA’s minimal gain is attributed to its simple top-to-bottom layouts providing little new signal.

What are the outcomes/conclusions?

Permutation formulation is insufficient. 23.76% of segments in ROOR are in non-linear reading positions. A single sequence cannot represent them without noise.
Relation-based models dramatically outperform sequence-based models on ROOR. LayoutLMv3-large at 93.01 F1 versus LayoutReader-style baselines below 10 F1 at word-level; though this comparison is partly an artifact of task reformulation (the prior models were not designed for this metric).
Reading order relations improve downstream VrD tasks universally, including via pseudo labels generated by the ROP model, without task-specific tuning. This is the most practically significant finding.
EL benefits more than SER from reading order. Entity linking requires understanding logical spatial flow; SER relies more on local entity semantics.

Limitations

The ROOR benchmark is small (199 samples) and built from a single source dataset (EC-FUNSD). Generalization of the benchmark itself is not validated.
The baseline ROP model over-relies on spatial layout features relative to textual features, as shown in case studies in the appendix.
Only one fusion approach (relation-aware attention additive bias) is tested; graph-based or pre-training-based alternatives are acknowledged but not compared.
Search space increases from $O(n!)$ (permutation) to $O(2^{n^2})$ (DAR), which the authors flag as an inherent cost of the richer representation.

Reproducibility

Models

Backbone: LayoutLMv3-base (133M) and LayoutLMv3-large (368M). Standard HuggingFace releases.
Downstream baselines: GeoLayoutLM (399M), generative LayoutLMv3 + BART decoder.
No new model weights released; code for training and inference is available.

Algorithms

Global pointer network for relation extraction (Su et al., 2022).
Relation-aware attention: additive bias with learnable per-layer scalar $\lambda^l$, initialized from pre-trained backbone.
Training details in Appendix D of the paper; specific hyperparameters not reported in main text.

Data

ROOR dataset: 199 samples from EC-FUNSD, train/validation split follows EC-FUNSD. Released at github.com/chongzhangFDU/ROOR-Datasets under CC-BY-4.0.
Annotation: two domain experts, conflict resolution by discussion, no formal inter-annotator agreement metric reported beyond 99.28 human F1 on the task.
Downstream benchmarks: FUNSD, CORD, SROIE, DocVQA, InfoVQA, WTQ, TextVQA (all publicly available, used as-is).

Evaluation

Primary ROP metric: F1 over ISDR relation pairs (not BLEU or ARD).
Downstream metrics: F1 for SER/EL, ANLS for VrD-QA.
No error bars or significance tests reported.

Hardware

Not explicitly reported in the main paper or appendix.

BibTeX


@inproceedings{zhang2024roor,
  title={Modeling Layout Reading Order as Ordering Relations for Visually-rich Document Understanding},
  author={Zhang, Chong and Tu, Yi and Zhao, Yixi and Yuan, Chenshu and Chen, Huan and Zhang, Yue and Chai, Mingxu and Guo, Ya and Zhu, Huijia and Zhang, Qi and Gui, Tao},
  booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing},
  year={2024}
}

TL;DR

ROPE (Reading Order Equivariant Positional Encoding) addresses a specific limitation in graph-based document information extraction: GCN aggregation functions discard word ordering during message passing, just as average pooling discards position in CNNs. ROPE fixes this by assigning relative reading order codes to each word’s neighbors before message passing, adding ordering context without requiring a serialized document. On FUNSD and a large proprietary payment dataset, ROPE improves node and edge classification F1 by up to 8.4 points, and proves at least as important as geometric edge features in the GCN baseline.

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$: The primary contribution is the ROPE encoding mechanism itself: a lightweight modification to existing GCN pipelines that injects relative reading order into graph message passing. The paper centers on ablation tables, architecture description, and performance comparisons across encoding variants.

What is the motivation?

Key information extraction from form-like documents is difficult because layout varies enormously: multi-column layouts, tables, scattered text blocks, and non-aligned sections mean there is no reliable linear serialization of words. Sequence models applied to OCR output suffer when the left-to-right, top-to-bottom order produced by an OCR engine does not reflect actual reading order.

GCNs avoid serialization by operating directly on word-level graphs where nodes are tokens and edges encode spatial relationships. However, the aggregation step in GCNs (typically a pooling or attention function applied over all incoming neighbor messages) is order-invariant. Any sequential context latent in the OCR reading order is lost at this step. The gap between sequence models (which capture order but struggle with complex 2D layout) and graph models (which handle 2D layout but drop order) motivates ROPE.

What is the novelty?

The core idea is as follows: before each GCN message passing step, iterate through the neighbors of a target word in OCR reading order and assign consecutive integer codes $p \in \mathbb{N}$ starting from zero. These ROPE codes are concatenated to the corresponding neighbor messages before aggregation.

Formally, for a target token $t_i$ with neighbor set $\mathcal{N}(i)$ in the graph sorted by OCR reading order, the $k$-th neighbor receives ROPE code $p = k$. The augmented message from neighbor $v_j$ at position $k$ becomes:

$$\tilde{m}_{ij} = [m_{ij} ,|, \phi(k)]$$

where $m_{ij}$ is the raw neighbor message, $\phi(k)$ is the positional encoding function, and $[\cdot | \cdot]$ denotes concatenation. The self-attention aggregation then operates over these augmented messages, becoming sensitive to the relative reading order of each neighbor.

A key property of this formulation is equivariance: if the entire local neighborhood (target word plus all its neighbors) shifts by the same offset in the document while preserving relative order, the assigned codes remain unchanged. This makes ROPE robust to global position variation while remaining sensitive to local sequential structure.

The authors explore two choices for $\phi(k)$:

Index encoding: $\phi(k) = k$, used directly as a scalar or embedded.
Sinusoidal encoding: following Vaswani et al. (2017) with 3 base frequencies $\omega \in \{1, 10, 100\}$:

$$\phi_{\sin}(k) = \left[\sin!\left(\frac{k}{\omega}\right),, \cos!\left(\frac{k}{\omega}\right)\right]_{\omega \in \{1,,10,,100\}}$$

Ablations show that sinusoidal encoding marginally outperforms index encoding on word labeling, while their combination is strongest on word grouping. Since each target word in a $\beta$-skeleton graph typically has fewer than 8 neighbors, index encoding alone already provides most of the benefit.

ROPE is plugged into a GCN with multi-head self-attention aggregation (inspired by Graph Attention Networks), which allows the model to weight incoming messages by both content and their ROPE-encoded sequential position.

What experiments were performed?

Tasks

Two entity extraction tasks are evaluated, both from Jaume et al. (2019):

Word labeling: multi-class node classification assigning each token a semantic entity type. Metric: macro F1.
Word grouping: binary edge classification deciding whether two tokens belong to the same entity. Metric: F1 with precision and recall.

These tasks deliberately avoid CRF-style decoders and do not rely on pre-annotated entity groupings, which isolates the quality of node and edge embeddings.

Datasets

Payment: a proprietary large-scale collection of roughly 18K single-page payment documents from multiple vendors. Human-annotated with 13 semantic entity types, producing over 3M word-level annotations. 80/20 train/test split. OCR performed with Google Cloud Vision.
FUNSD: a public benchmark of 199 noisy scanned forms with 4 entity types (header, question, answer, other), 9,707 entities, and 31,485 word-level annotations. Official 75/25 train/test split.

Architecture and training details

Node features: BERT word embeddings + normalized bounding box coordinates (height, width, four corners).
Edge features: spatial embeddings (relative distances, aspect ratios) and MobileNetV3 visual embeddings from the union bounding box of connected token pairs.
Graph construction: $\beta$-skeleton with $\beta = 1$ (ball-of-sight strategy), providing high connectivity while remaining sparser than fully-connected graphs.
Aggregation: 3-layer multi-head self-attention pooling (4 heads, 32-dimensional head size).
Node update: 2-layer MLP with 128 hidden units.
GCN depth: 7 hops for payment (due to scale), 2 hops for FUNSD.
Optimizer: Adam, learning rate $10^{-4}$, warm-up proportion 0.01, batch size 1.
Hardware: 8 Tesla P100 GPUs; approximately 1 day of training on the largest corpus.

Ablations

Table 1 crosses EdgeGeo (geometric edge features) and ROPE independently and in combination across both datasets:

Dataset	EdgeGeo	ROPE	WL F1	WG F1
Payment			74.55	86.64
Payment	✓		60.80	83.80
Payment		✓	66.09	84.94
Payment	✓	✓	68.17	85.88
FUNSD			57.22	89.33
FUNSD	✓		50.86	86.86
FUNSD		✓	53.16	87.37
FUNSD	✓	✓	51.78	89.28

(WL = Word Labeling; WG = Word Grouping. Full precision/recall numbers are in the paper.)

The paper frames these results relative to a fully-featured GCN reference that includes both EdgeGeo and ROPE. Without any explicit positional encoding (the bottom row per dataset), word labeling F1 drops 13.75 points and word grouping drops 2.84 points on payment. Adding ROPE alone recovers 6.38 of those word labeling points; adding EdgeGeo alone recovers less. The authors conclude that reading order information is at least as important as geometric edge features, and that the two provide orthogonal improvements when combined. Note that the “no encoding” row in Table 1 is the internal GCN baseline from Qian et al. (2019), which already incorporates BERT node embeddings and spatial bounding box features; what is being ablated is only the additional positional encoding on top of that.

Table 2 compares encoding functions for ROPE:

Dataset	Index	Sine	WL F1	WG F1
Payment	✓		66.09	84.94
Payment		✓	72.41	86.53
Payment	✓	✓	70.94	85.66
FUNSD	✓		53.16	87.37
FUNSD		✓	55.48	88.94
FUNSD	✓	✓	54.14	89.12

Results are compared against the no-encoding GCN baseline (74.55 WL F1 on payment). Both encoding functions improve over the no-ROPE variants. For word grouping, the combined Index+Sine setting is best on both datasets. For word labeling, sinusoidal encoding alone edges out the combined variant on payment (72.41 vs. 70.94), suggesting slight interference between the two functions at this task. Notably, all ROPE variants trail the full no-encoding baseline on WL F1, meaning ROPE’s gains appear primarily through reducing the performance gap rather than exceeding the baseline when EdgeGeo is absent.

Sensitivity to OCR order quality: the authors shuffle a fraction of words in the input reading order before feeding into ROPE. Performance is robust up to roughly 30% shuffling; beyond that, degradation is more pronounced for word grouping than word labeling, suggesting that grouping relies more heavily on sequential consistency.

What are the outcomes/conclusions?

ROPE consistently improves GCN-based document entity extraction, with gains of up to 8.4 F1 points on the payment dataset (word labeling) and around 2.5 F1 points on FUNSD (word labeling). Reading order encoding and geometric edge features provide orthogonal improvements, suggesting they capture different aspects of document structure.

The equivariance property is a meaningful design choice: by defining codes relative to the target word, ROPE avoids dependence on absolute document position and remains consistent when local neighborhoods span column or section boundaries.

The paper is narrow in scope: it applies ROPE only to a GCN with self-attention aggregation, and only to two entity extraction tasks (node and edge classification). The payment dataset is proprietary and not released, limiting reproducibility of the larger-corpus results. Results on FUNSD are modest. The authors acknowledge that ROPE is compatible with other GCN-based document tasks but do not demonstrate this directly.

One limitation not explicitly acknowledged by the authors: the paper predates the dominant transformer-based document models (LayoutLM family, Donut, etc.), and ROPE’s benefit in an architecture where positional encoding is already handled by cross-attention or 2D sinusoidal embeddings is untested. The $\beta$-skeleton graph construction also requires pre-computed bounding box information, so ROPE is not applicable to end-to-end pixel-level models.

Reproducibility

Models

The GCN architecture is fully described: 2-layer MLP node updates (128 hidden), 3-layer multi-head self-attention aggregation (4 heads, 32 head size), 7 or 2 GCN hops depending on dataset.
Node input: BERT embeddings (exact variant unspecified in the paper; likely BERT-base given the 2021 era and compute budget) plus 2D bounding box spatial embeddings.
Edge input: spatial geometric features (relative distances, aspect ratios) plus MobileNetV3 visual embeddings.
No model weights or code repository are released with the paper.

Algorithms

Training: Adam optimizer, $lr = 10^{-4}$, warm-up proportion 0.01, batch size 1, cross-entropy loss.
Graph construction: $\beta$-skeleton with $\beta = 1$; ball-of-sight strategy.
ROPE: integer code assignment by OCR order; optionally mapped through sinusoidal encoding with 3 base frequencies.
No gradient clipping or mixed precision details are reported.

Data

FUNSD: publicly available at guillaumejaume.github.io/FUNSD. 199 annotated scanned forms, 4 entity types, official 75/25 split. License: not formally specified by the authors, but widely used for academic research.
Payment: proprietary, approximately 18K single-page payment documents with 13 entity types, annotated by human labelers. Not released. OCR from Google Cloud Vision.

Evaluation

Word labeling metric: multi-class node classification F1.
Word grouping metric: binary edge classification F1 (precision + recall also reported).
No error bars, confidence intervals, or multiple-run variance reported.
Comparison is against an internal GCN baseline; no comparison to externally published methods on FUNSD.

Hardware

Training: 8 Tesla P100 GPUs.
Training time: approximately 1 day on the payment dataset (largest corpus).
No inference latency or memory consumption reported.

BibTeX


@inproceedings{lee-etal-2021-rope,
  title     = {{ROPE}: Reading Order Equivariant Positional Encoding for Graph-based Document Information Extraction},
  author    = {Lee, Chen-Yu and Li, Chun-Liang and Wang, Chu and Wang, Renshen and Fujii, Yasuhisa and Qin, Siyang and Popat, Ashok and Pfister, Tomas},
  booktitle = {Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)},
  year      = {2021},
  publisher = {Association for Computational Linguistics},
  address   = {Online},
  doi       = {10.18653/v1/2021.acl-short.41},
  url       = {https://aclanthology.org/2021.acl-short.41},
}

TL;DR

Seg2Act treats document logical structure extraction as a one-pass action generation task: a generative language model reads text segments through a sliding window and predicts a symbolic action per segment, while a global context stack tracks the partially built hierarchy. On ChCatExt and HierDoc benchmarks, the method outperforms both text-only transition-based baselines and multimodal pipeline methods using only textual input.

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$: The paper’s headline contribution is an end-to-end generative framework for document logical structuring. It proposes a new action vocabulary, a global context stack mechanism, and a multi-segment multi-action inference strategy, and validates them with ablation studies and SOTA comparisons.

What is the motivation?

Document logical structuring (recovering the hierarchical tree of headings and paragraphs from a document’s text segments) is important for downstream tasks like retrieval, summarization, and question answering over long documents. Existing methods share three failure modes:

Local encoding: Pipeline methods encode text segments in isolation or in pairs, missing long-range structural cues.
Multi-stage error propagation: Decomposing the task into heading detection, node classification, and relationship prediction lets errors accumulate across stages.
Weak generalization: Because each stage is designed around structural conventions of a specific document type, adapting to new document families requires redesigning the pipeline.

Transition-based methods such as TRACER partially address these issues but still rely on pairwise local context and shift-reduce operations that force each segment to participate in multiple prediction steps.

What is the novelty?

Seg2Act contributes three tightly coupled ideas:

1. A one-to-one action vocabulary. Rather than pairwise transition decisions, Seg2Act defines exactly three actions, each mapping one text segment to one position in the logical tree:

New Level-$k$ Heading (symbol: $k$ consecutive + characters): adds the segment as a level-$k$ heading under the last level-$(k-1)$ heading.
New Paragraph (symbol: *): adds the segment as a paragraph under the current heading.
Concatenation (symbol: =): appends the segment to the last node, handling OCR line splits.

This one-to-one design reduces the total number of predictions to $N$ (the number of segments) versus the multiple-pass decisions in shift-reduce parsing.

2. A global context stack. To give the model awareness of the evolving structure, Seg2Act maintains a stack $S$ of the nodes along the ancestor path of the most recently added node. The stack is serialized in the same symbolic vocabulary (+, *) as the action set and prepended to the model input at each step. Update rules are:

New Level-$k$ Heading: pop nodes until the stack top is a level-$(k-1)$ heading, then push the new node.
New Paragraph: pop the top paragraph (if any), then push the new paragraph.
Concatenation: append text to the top node; stack shape is unchanged.

This keeps the globally relevant structural context within a bounded input length.

3. Multi-segment multi-action (MSMA) strategy. Processing 853 segments one-by-one (the HierDoc average) is slow and limits context. Seg2Act instead processes a window of $w_I$ segments and predicts $w_O$ actions in a single forward pass. The total number of steps is $\lceil N / w_O \rceil$. The two window sizes are independent: a wider $w_I$ provides more look-ahead context, while $w_O$ controls throughput. Training uses teacher forcing with cross-entropy loss:

$$ \mathcal{L} = -\sum_{i=1}^{|D|} \log P(y_{i:i+w_I-1} \mid s_i, x_{i:i+w_I-1};, \Theta) $$

where $s_i$ is the global context stack at step $i$ and $\Theta$ are the model parameters. At training time, $w_I = w_O$ by default (one-pass mode).

Hard inference constraints enforce structural validity: only tokens from the predefined set $\{$+, *, =, \n$\}$ are allowed, concatenation is blocked when only the root node is on the stack, and heading levels cannot skip (a predicted level-$k$ heading is clamped to the current maximum depth plus one if needed).

What experiments were performed?

Datasets. Two benchmarks:

ChCatExt (Zhu et al., 2023): 650 Chinese documents (bid announcements, financial announcements, credit rating reports) with full heading and paragraph annotations.
HierDoc (Hu et al., 2022): 650 English scientific documents with Table-of-Contents (heading-only) annotations; average 853.38 segments per document.

Metrics. Node-level F1 for headings, paragraphs, and all nodes combined; TEDS for ToC structure on HierDoc; and DocAcc (the fraction of documents whose predicted structure exactly matches the ground truth).

Baselines.

Text-only: TRACER (shift-reduce transition parser with a pretrained LM encoder).
Multimodal: MTD (BERT + ResNet attention-based ToC extractor) and CMM (RoBERTa + heuristic tree initialization, then node-level refinement).

Backbone models. GPT2-Medium (~345 M parameters), Baichuan-7B (7B parameters, fine-tuned with LoRA: rank $r=8$, $\alpha=16$), Baichuan-13B, and Qwen1.5 (0.5B, 1.8B, 4B) in the model size ablation.

Transfer learning. Models are pretrained on the Wiki corpus (provided by Zhu et al.) for 10,000 steps, then evaluated zero-shot, few-shot (3 and 5 examples), and full-shot on each ChCatExt sub-corpus.

Ablations.

Removing the MSMA strategy ($w_I = w_O = 1$).
Removing the symbolic component of the global context stack (leaving only text).
Removing the text component (leaving only symbols).
Removing the full stack (both text and symbols).

Optimizer. AdamW, learning rate $3 \times 10^{-4}$, 10 epochs, batch size 128, $w_I = w_O = 3$ by default. Experiments run on one NVIDIA A100 GPU; results averaged over 5 random seeds.

What are the outcomes/conclusions?

Supervised setting. On ChCatExt with the Baichuan-7B backbone, Seg2Act achieves 92.63 total F1 and 63.69 DocAcc, versus TRACER at 89.55 F1 and 53.85 DocAcc, a +9.84 DocAcc gap. On HierDoc, Seg2Act with Baichuan-7B reaches 98.1 heading detection F1 and 96.3 TEDS, outperforming the multimodal MTD (96.1 / 87.2) and CMM (97.0 / 88.1) using text alone.

Transfer learning. In zero-shot, few-shot, and full-shot transfer across ChCatExt sub-corpora, Seg2Act averages +15.28, +6.04, and +10.65 F1 over SEG2ACT-T (the transition-based variant), respectively. With only 5 labeled examples, performance drops only 3.98 F1 points on average from the full-shot setting, suggesting the action vocabulary generalizes well across document families.

Ablations. Removing the hierarchical symbols from the global context stack costs 6.46 DocAcc. Removing the text portion costs more (12.92 DocAcc). Removing the full stack degrades DocAcc by 18.77 points, confirming that global context is the primary driver of document-level accuracy.

Model size. Qwen1.5-1.8B matches Baichuan-7B’s DocAcc (both 63.69) at roughly half the inference time (4.27 s vs. 8.13 s per document), and the 1.8B model is $0.65\times$ faster than Qwen1.5-4B while achieving comparable accuracy. Qwen1.5-4B achieves 65.23 DocAcc with a moderate speed penalty. Results suggest the action generation framework can run effectively on smaller models.

Limitations acknowledged by the authors. Seg2Act relies on a correctly ordered sequence of text segments; disrupted segment order (e.g., from imperfect OCR column detection) is not handled. The generative model may occasionally emit the wrong number of actions for a given window, causing skipped segments. Visual information is not used, so purely layout-driven cues such as font size and indentation are unavailable unless the text encoder can infer them from OCR output.

Limitations not addressed. Evaluation is restricted to two Chinese and English benchmarks with relatively clean ground-truth segmentations. Performance on noisier OCR pipelines, PDF documents with complex layouts (multi-column, tables embedded in the flow), or non-standard document types remains untested. The paper does not report statistical significance tests, though it does average over five seeds. Computational overhead of the autoregressive generation loop at inference compared to a classifier-based approach is not analyzed for latency-sensitive production settings.

Reproducibility

Models

GPT2-Medium: ~345 M parameters, standard causal LM decoder, fine-tuned fully on the task.
Baichuan-7B: 7B-parameter causal LM, fine-tuned with LoRA (rank $r=8$, alpha $\alpha=16$). Specific layers adapted are not specified in the paper.
Baichuan-13B: Larger variant tested in the appendix; LoRA is used by analogy with the 7B setting (specific rank/alpha not re-stated for 13B).
Qwen1.5 (0.5B, 1.8B, 4B): Additional model size ablation; training details assumed consistent with Baichuan-7B setting.
Model weights for Seg2Act fine-tuned models are not released. Base model weights (GPT2, Baichuan-7B, Qwen1.5) are publicly available.

Algorithms

Optimizer: AdamW with learning rate $3 \times 10^{-4}$.
Schedule: 10 training epochs; no warmup schedule reported.
Batch size: 128.
Input/output windows: $w_I = w_O = 3$ (default).
Loss: Teacher-forcing cross-entropy over action tokens.
Inference: Greedy decoding with hard constraints via LogitsProcessor from the HuggingFace Transformers library.
Pretraining for transfer: 10,000 steps on Wiki corpus, then fine-tuned with LoRA.
Gradient clipping and mixed-precision training are not mentioned.

Data

ChCatExt: 650 Chinese documents with logical tree annotations; split details and download location are not described in this paper. The dataset was provided by Zhu et al. (2023) alongside the TRACER work.
HierDoc: 650 English scientific documents with ToC annotations; provided by Hu et al. (2022) via ICPR 2022. Public availability should be verified at the original source.
Wiki corpus: Used for pretraining in transfer experiments; provided by Zhu et al. (2023). License and availability are not discussed in this paper.
Annotation process for ChCatExt and HierDoc is described in the original dataset papers, not here.

Evaluation

Metrics: Node-level F1 (heading, paragraph, total); TEDS for ToC; DocAcc (exact match at document level). DocAcc is newly introduced here and is straightforward to compute.
Benchmarks: ChCatExt (all three sub-corpora: BidAnn, FinAnn, CreRat) and HierDoc.
Baselines: TRACER results on ChCatExt were re-run by the authors (TRACER*) because original results used a different backbone. CMM and MTD results are taken from the original publications.
Statistical rigor: Results are averaged over 5 random seeds. No confidence intervals or significance tests are reported.
Reproducibility gap: Fine-tuned model weights are not released. The code repository (linked in frontmatter artifacts) has no declared license file, which limits derivative use.

Hardware

Training hardware: Single NVIDIA A100 GPU. Total GPU-hours not reported.
Inference time: Per-document inference time with Baichuan-7B at $w_I = w_O = 3$ is approximately 8.13 seconds (from Table 6 in the paper). Qwen1.5-0.5B achieves ~4.01 seconds per document.
Memory requirements and VRAM usage are not reported.
The LoRA fine-tuning is intended to reduce GPU memory overhead for the 7B model, suggesting the A100 (40 GB or 80 GB) is used but may be manageable at lower VRAM with quantization.

BibTeX


@inproceedings{li-etal-2024-seg2act,
  title     = &#34;{S}eg2{A}ct: Global Context-aware Action Generation for Document Logical Structuring&#34;,
  author    = &#34;Li, Zichao and He, Shaojie and Liao, Meng and Chen, Xuanang and Lu, Yaojie and Lin, Hongyu and Lu, Yanxiong and Han, Xianpei and Sun, Le&#34;,
  booktitle = &#34;Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing&#34;,
  year      = &#34;2024&#34;,
  publisher = &#34;Association for Computational Linguistics&#34;,
  doi       = &#34;10.18653/v1/2024.emnlp-main.1003&#34;,
}

TL;DR

Wang et al. (Google Research) reframe text reading order prediction as a graph segmentation task rather than sequence generation. A sparse, $\beta$-skeleton-based multi-task GCN predicts whether each text line belongs to a column-wise or row-wise reading pattern and whether neighboring lines share a paragraph. A deterministic hierarchical cluster-and-sort post-processing step then converts those local predictions into a global reading order. The resulting model has 267K parameters, is language-agnostic, and generalizes from English training data to seven other languages.

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$: The headline contribution is an algorithm combining a sparse-graph GCN with a deterministic cluster-and-sort post-processing stage. The paper introduces the architecture, the labeling scheme for deriving pattern labels from annotations, the hierarchical clustering algorithms, and ablations validating the design choices.

Secondary: none. The internal annotated dataset is not released and is used solely for validation.

What is the motivation?

Reading order is a critical output of any OCR pipeline. Downstream tasks from text selection to structured document understanding with LayoutLM-family models depend on correctly ordered text. The difficulty is not uniform: a column of newsprint text is easily sorted by y-coordinate, but a restaurant menu photographed at an oblique angle with interleaved dish names and prices requires understanding two-dimensional relational structure.

Prior approaches fall into two failure modes:

Rule-based methods such as XY-Cut and topological sort based on bidimensional relations work well within a narrow domain but fail out-of-distribution. They commit to a single layout pattern (column-wise or row-wise) globally and cannot handle mixed documents.
Sequence-learning methods such as the graph encoder plus pointer-network decoder in Li et al. (ECCV 2020) or the LayoutLM-based LayoutReader suffer from $O(n^2)$ time complexity and accuracy degradation as input length grows, because the fully-connected attention conflates local layout signals with irrelevant global context.

The authors identify that most real-world reading order sequences follow one of two local patterns (column-wise or row-wise) and that deciding which pattern applies to any given text region is a local rather than global problem. This observation motivates reframing the task as node classification on a sparse neighborhood graph.

What is the novelty?

Sparse-graph formulation

Rather than building a fully connected graph over all text lines, the model uses a $\beta$-skeleton graph [Kirkpatrick and Radke, 1985] whose edges connect only spatially neighboring text-line boxes. This keeps graph-convolution complexity linear in the number of lines, $O(n)$, and provides a natural set of edge bounding boxes for RoI pooling.

Multi-task GCN with edge RoI pooling

The GCN is an MPNN variant with weight-sharing convolutions. It has two output heads:

$$\hat{y}_{\text{node}} \in \{\text{col}, \text{row}\}$$

for reading order pattern per line, and

$$\hat{y}_{\text{edge}} \in \{\text{same-para}, \text{different-para}\}$$

for paragraph membership. Image features are extracted by a MobileNetV3-Small backbone (144K of the model’s 267K parameters) at 512 $\times$ 512 resolution and pooled over edge bounding boxes (the minimum containing box of each connected pair of text-line boxes) rather than node boxes. An ablation (Table 2 in the paper) shows that node-box RoI pooling slightly degrades accuracy relative to using no image features at all, whereas edge-box RoI pooling provides a consistent improvement. The interpretation is that the discriminative visual signals (separator lines, color boundaries) live between text boxes rather than inside them.

Spatial node features encode $x, y$ corner coordinates multiplied by rotation coefficients $\cos \alpha$ and $\sin \alpha$, making the model rotation-aware from the input representation.

Hierarchical cluster-and-sort post-processing

After the GCN outputs per-line pattern labels and per-edge paragraph membership, three deterministic algorithms convert local predictions into a global traversal order:

Algorithm 1 (Hierarchical Clustering): Lines are grouped into paragraphs via edge predictions; paragraphs inherit a pattern via majority vote; same-pattern clusters are merged along graph edges; clusters are nested by bounding box containment; and a single top-level cluster is created from all remaining children.
Algorithm 2 (Sorting within a cluster): Boxes are rotated to remove mean orientation, and then a topological sort applies column-wise or row-wise precedence constraints based on axis-overlap and center coordinates.
Algorithm 3 (Pattern labeling from annotations): Used at training time to derive ground-truth binary labels from human-annotated paragraph ordering sequences by comparing the geometric relation (vertical/horizontal/unknown) between each consecutive paragraph pair.

This design separates the machine learning component (local binary classification) from the combinatorial reasoning component (sorting), keeping the model small while enabling correct handling of mixed layouts, perspective distortion, and rotated text.

What experiments were performed?

Datasets

Internal annotated set: 25K training images and a few hundred test images per language, annotated with paragraph polygons and partial reading order groups. Ground truths are partial (some subsets of paragraphs form ordered groups; order among groups is undefined). Languages tested: English, French, Italian, German, Spanish, Russian, Hindi, Thai.
PubLayNet: 340K training + 12K validation images used to supplement training; “text” regions are treated as column-wise, “table”/“figure” containers as row-wise.

OCR bounding boxes are provided by Google Cloud Vision API.

Evaluation metric

The authors use a normalized Levenshtein distance (lower is better) between the serialized predicted reading order and a ground-truth word sequence extracted from each annotated reading order group. This metric accommodates partially annotated ground truth, unlike BLEU or Kendall’s Tau which require full permutation comparisons.

Baselines

All-column-wise topological sort: Applies column-wise bidimensional sort rules globally. Chosen over all-row-wise because it scores better in practice.
Fully-connected graph with direct edge-direction prediction: A GCN on a complete graph predicting pairwise precedence relations, similar to Li et al. (ECCV 2020). Included to demonstrate scalability problems with dense graphs on receipt-style inputs.

Ablations

Image feature RoI source (none vs. node boxes vs. edge boxes): edge boxes are best; node boxes are no better than no image features (Table 2).
Classification task scores on PubLayNet vs. the internal annotated set: PubLayNet scores near-perfect (F1 $\approx$ 0.997 for pattern, 0.995 for clustering), while the real-world annotated set reaches F1 0.819 / 0.902, reflecting the difficulty of in-the-wild images.

Training setup

10M steps with random rotation and scaling augmentations, with image and OCR boxes transformed together. The image is scaled so all text boxes fit in a circle of diameter 512, then cropped or padded to 512 $\times$ 512; this ensures rotation augmentation never moves boxes out of bounds.

What are the outcomes/conclusions?

On the multi-language evaluation set (Table 3), the 2-task GCN plus cluster-and-sort algorithm reduces normalized Levenshtein distance from 0.146 (all-column baseline) to 0.098 for English, and shows consistent improvements across all seven non-English languages despite training on English only. The fully-connected graph baseline shows intermediate results; on dense inputs (e.g., long receipts) it produces clearly worse output than the sparse approach (Figure 10).

The edge-box RoI ablation (Table 2) is the most informative experiment: node-box pooling does not help and can hurt, while edge-box pooling consistently improves both tasks. This supports the authors’ hypothesis that useful visual signals (separator lines, color gradients) appear between text boxes.

The model is 267K parameters total, runs on mobile devices, and needs no text recognition (language features are omitted deliberately) to stay language-agnostic.

Limitations acknowledged by the authors:

Complex layouts with multiple adjacent tabular sections can cause incorrect cross-section reading order because the two-task model lacks a higher-level section-parsing signal. A 3-task version trained on menu-specific section annotations partially addresses this but does not generalize reliably.
Perspective-distorted tables (Figure 8, the menu from Figure 1) remain difficult: rotating boxes to cancel mean angle still leaves slanted individual boxes, so topological sort on axis-aligned containing boxes cannot guarantee row alignment.
The model is not a table structure parser; for table-internal reading order, a dedicated structure recognition model will likely outperform it.
Training data is English-only; while Latin-script generalization is strong, Cyrillic, Devanagari, and Thai still trail Latin languages in absolute score.

Reproducibility

Models

MPNN-variant GCN: 10 weight-sharing graph convolution steps; node feature dimension 32; message-passing hidden dimension 128; edge-to-node pooling uses 4-head attention with 3 hidden layers of size 16 and dropout 0.5.
MobileNetV3-Small image backbone: input resized so all text fits in a circle of diameter 512; output pooled via bilinear interpolation over $16 \times 3$ points per edge box; dropout 0.5.
Total parameters: 267K (144K from MobileNetV3-Small, 123K from the GCN).
No weights are released.

Algorithms

Optimizer and learning rate schedule: not reported.
Training steps: 10M.
Augmentations: random rotation and scaling with synchronized image and box transforms.
TF-GNN used as the GCN implementation framework.
No code repository released.

Data

Internal annotated dataset: 25K English training images with paragraph polygon and partial reading order annotations; a few hundred test images per language. Not publicly released.
PubLayNet (340K train / 12K val): publicly available. Used to supplement training.
OCR input: Google Cloud Vision API (not publicly reproducible at the same quality).
No contamination analysis reported.

Evaluation

Metric: normalized Levenshtein distance. Exact word-level formulation described in Section 4.1.
Baselines use the same OCR inputs from Google Cloud Vision API, so comparisons are fair within the paper but may not transfer to other OCR systems.
No error bars, significance tests, or multi-run statistics reported.
The PubLayNet reading order ground truth is constructed heuristically (text $\rightarrow$ column-wise, table/figure $\rightarrow$ row-wise), not from human annotation of actual reading order.

Hardware

No training hardware or compute budget reported.
Inference is described as fast enough for mobile deployment, but no latency benchmarks or VRAM figures are provided.

BibTeX


@inproceedings{wang2023sparse,
  title={Text Reading Order in Uncontrolled Conditions by Sparse Graph Segmentation},
  author={Wang, Renshen and Fujii, Yasuhisa and Bissacco, Alessandro},
  booktitle={Document Analysis and Recognition -- ICDAR 2023},
  year={2023},
  organization={Springer}
}

TL;DR

Wang et al. extend their earlier Detect-Order-Construct work by proposing UniHDSA, a unified relation prediction framework for hierarchical document structure analysis. Rather than running separate modules for detection, reading order, TOC extraction, and reconstruction, UniHDSA merges all sub-tasks at both the page-level and the document-level into a single label space handled by a shared relation prediction head. A multimodal Transformer system (vision backbone + BERT) built on this framework reports the best published results across all sub-tasks on the Comp-HRDoc benchmark and competitive results on DocLayNet.

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$: The paper’s central contribution is a new architectural principle (unified label space) and the multimodal Transformer system built on it. Most of the paper describes the page-level module, the document-level module, the type-wise query design, the loss functions, and extensive ablations verifying each design choice. This is a journal extension of DLAFormer (Wang et al., ICDAR 2024), which introduced unified relation prediction for page-level tasks only. UniHDSA extends that to document-level tasks and adds a language model for multimodal fusion.

What is the motivation?

Hierarchical Document Structure Analysis (HDSA) encompasses multiple interdependent sub-tasks: page object detection, text region detection, logical role classification, reading order prediction, table-of-contents extraction, hierarchical list extraction, and cross-page grouping. Prior methods fell into one of two camps:

Single-task models: Effective in isolation but unable to exploit cross-task interactions.
Multi-branch unified systems (e.g., Detect-Order-Construct): Each branch handles a distinct sub-task, sharing feature maps but not operating on a common representation. Sequential execution introduces cascading errors: an error in the detection stage propagates to the reading order stage, which then propagates into TOC extraction.

The authors also note that multi-branch designs are hard to extend. Adding a new sub-task requires engineering a new branch and possibly a new decoding algorithm.

UniHDSA addresses these problems by framing every sub-task at both the page-level and the document-level as a relation prediction problem and collapsing all relation labels into a single label space. A single prediction head then handles all sub-tasks concurrently, reducing the potential for cascading errors and making the system easier to extend.

What is the novelty?

Unified label space

The core idea is to define a label matrix $M \in \mathbb{Z}^{H \times W}$, where rows correspond to source queries and columns to target queries. Each cell can take one of a small number of values encoding the relation type (or no relation). At the page-level, three relation types are defined:

Intra-region: reading order between text-lines within a text region, or a self-referential marker for single-line regions.
Inter-region: reading order between adjacent text regions, or semantic association between a text region and a graphical object (e.g., a caption pointing to a table).
Logical role: assignment of each text-line or graphical object to a predefined logical role class (title, section heading, paragraph, caption, etc.).

At the document-level, two relation types are defined:

Intra-region: cross-page grouping (e.g., sub-tables on consecutive pages belonging to the same logical table, or paragraphs split across page boundaries).
Inter-region: hierarchical relationships between section headings or list items, supporting TOC and list extraction.

This encoding means the same relation prediction head, with a relation prediction module and a relation classification module, handles all tasks at both levels.

Relation prediction head

Given a set of queries $\{q_1, \ldots, q_H\}$ (text-line and graphical object queries) and logical role queries $\{q_{H+1}, \ldots, q_W\}$, the relation prediction module computes:

$$ f_{ij} = FC^r_q(q_i) \circ FC^r_k(q_j), \quad i \le H,\ j \le W $$

$$ s_{ij} = \begin{cases} \dfrac{\exp(f_{ij})}{\sum_{j’=1}^{H} \exp(f_{ij’})}, & j \le H \\[6pt] \dfrac{\exp(f_{ij})}{\sum_{j’=H+1}^{W} \exp(f_{ij’})}, & H < j \le W \end{cases} $$

The separate softmax normalizations enforce single-successor constraints for intra-region and inter-region relations independently of the logical role relations. The relation classification module then uses a bilinear classifier to predict the relation type for pairs with a predicted relation:

$$ p_{ij} = \text{BiLinear}(FC^c_q(q_i),, FC^c_k(q_j)), \quad c_{ij} = \text{argmax}(p_{ij}) $$

Type-wise query initialization

The page-level module is built on a Deformable DETR-like encoder-decoder. The authors introduce a type-wise query selection strategy: instead of using a binary foreground classifier to select the top-$K$ encoder features as object proposals, they use a multi-class classifier. The predicted category is then used to initialize the content query from a set of learnable per-category embeddings. This gives the decoder a semantically grounded starting point rather than a generic learnable token. Text-line queries are initialized directly from BERT embeddings, while predefined logical role queries use learnable embeddings with auxiliary supervision from the union boxes of all queries in that role class.

Two-stage efficiency for long documents

The framework divides the problem into page-level analysis (run per page) and document-level analysis (run once over the entire document, taking only the relevant entity types as input). The document-level Transformer encoder uses RoPE to encode the predicted reading order sequence. Because only section headings, list items, tables, and boundary paragraphs are passed to the document-level module, the quadratic cost of full cross-page attention is avoided. Training uses a random sampling strategy (6-8 consecutive pages per sample), which also introduces diversity. The authors report that models trained on 6-8 page samples produced promising TOC extraction results on a 44-page document in their qualitative evaluation.

Loss functions

The overall objective combines detection and relation prediction losses:

$$ L_{\text{overall}} = \lambda_{\text{graphical}} \cdot (L_{\text{graphical}}^{\text{enc}} + L_{\text{graphical}}^{\text{dec}}) + \lambda_{\text{relation}}^{\text{page}} \cdot L_{\text{relation}}^{\text{page}} + \lambda_{\text{relation}}^{\text{doc}} \cdot L_{\text{relation}}^{\text{doc}} $$

All three $\lambda$ weights are set to 1.0. The graphical detection losses follow Deformable DETR: focal loss for classification and a linear combination of $L_1$ and GIoU for box regression. Relation losses are softmax cross-entropy over both the relation prediction scores and the relation classification logits:

$$ L_{\text{rel}} = \frac{1}{N} \sum_{l} \sum_{i} L_{\text{CE}}(s_i, s_i^{\ast}) $$

$$ L_{\text{rel.cls}} = \frac{1}{N} \sum_{l} \sum_{i} L_{\text{CE}}(c_i, c_i^{\ast}) $$

where $l$ indexes decoder layers (auxiliary loss at each layer) and $N$ is the number of queries.

What experiments were performed?

Datasets

Comp-HRDoc: 1,000 training and 500 test documents drawn from the HRDoc-Hard dataset. Four sub-tasks evaluated: page object detection (segmentation mAP), reading order prediction (REDS), table-of-contents extraction (Semantic-TEDS), and hierarchical structure reconstruction (Semantic-TEDS). OCR files provide pre-filtered text-line bounding boxes.
DocLayNet: 69,375 training pages and 6,489 test pages, 11 object classes, diverse document types. Evaluated with COCO-style box mAP at IoU $\{0.50, 0.55, \ldots, 0.95\}$ and mean F1 score at the same thresholds.

Baselines

On Comp-HRDoc: Mask2Former (detection only), Lorenzo et al. (reading order), MTD (TOC), DSPS Encoder (reconstruction with ground-truth reading order and bounding boxes provided), and the prior best-reported end-to-end system, Detect-Order-Construct (DOC).

On DocLayNet: Faster R-CNN, Mask R-CNN, YOLOv5, SwinDocSegmenter, DINO, VSR, DOC, DLAFormer (a vision-only predecessor from the same group), and M2Doc.

Ablations

Attention mask ablation: disabling interaction between text-line and graphical object queries in the page-level decoder to test the value of unified modeling.
Larger sampling window (14-16 pages vs. 6-8 pages) for document-level tasks.
Semantic feature ablation: comparing vision-only variants with the late-fusion multimodal variant across different backbone strengths (DOC-R18, DOC-R50, DLAFormer-R18, UniHDSA-R18, UniHDSA-R50).
Backbone comparison: ResNet-18, ResNet-50, ResNet-101, Swin-Tiny, InternImage-Tiny, InternImage-Small.

What are the outcomes/conclusions?

Comp-HRDoc end-to-end results

UniHDSA outperforms DOC (the prior best end-to-end system) across all sub-tasks:

Sub-task	Metric	DOC-R18	UniHDSA-R18	UniHDSA-R50
Page object detection	Seg. mAP	88.1	90.9	91.2
Reading order (text)	REDS	93.2	96.4	96.7
Reading order (graphical)	REDS	86.4	90.6	91.0
TOC extraction	Micro-STEDS	86.1	87.9	88.3
TOC extraction	Macro-STEDS	87.9	89.5	88.8
Hierarchical reconstruction	Micro-STEDS	83.7	88.0	88.9
Hierarchical reconstruction	Macro-STEDS	83.7	87.8	88.6

Reading order shows the largest gains: +3.2 points for text regions and +4.2 points for graphical regions versus DOC, despite DOC using a two-stage reading order prediction requiring additional parameters.

DocLayNet page object detection

UniHDSA-R50 achieves 87.0% mAP and 87.9% mean F1 score. This is competitive with M2Doc (87.3% mAP, 87.5% F1 after correcting for unreasonable post-processing). The authors note that M2Doc’s early-fusion strategy allows one bounding box to receive multiple category predictions, which inflates AP but is problematic for downstream tasks such as markdown reconstruction. After correcting M2Doc’s post-processing to enforce a single category per box, UniHDSA remains competitive. UniHDSA particularly outperforms at the strictest IoU threshold (0.95), suggesting the hybrid approach produces tighter bounding box predictions.

Ablation findings

Unified modeling of text and graphical elements: Masking cross-attention between text-line and graphical object queries drops formula detection by 26.6 mAP points and table detection by 2.0 points, confirming that text-lines within graphical objects provide critical signal for top-down detection.
Sampling window size: Training on 14-16 pages versus 6-8 pages yields no significant improvement on the test set (which contains full documents of roughly 20 pages). The smaller window provides sufficient context while improving training efficiency.
Semantic features: Late fusion boosts semantically ambiguous categories (author, mail, affiliation, caption) substantially. However, for visually distinctive but semantically confusable categories (header vs. title), semantic features can hurt slightly. Gains from semantic features diminish as visual backbone quality increases.
Backbone scaling: ResNet-50 delivers the best overall balance on Comp-HRDoc. InternImage-Small, despite using a smaller sampling window due to GPU memory constraints, achieves 91.7% page object detection mAP, suggesting further gains are possible with sufficient compute.

Relation prediction performance

Page-level intra-region and logical role relations exceed 97% macro F1, while inter-region relations exceed 93%. Document-level inter-region relations also exceed 93%, validating TOC and hierarchical list extraction. Document-level intra-region relations (cross-page paragraph grouping) achieve around 87% F1, which the authors attribute to limited contextual connections across pages and content ambiguity in this task.

Limitations

No suitable public datasets exist to evaluate cross-page table grouping and hierarchical list extraction in isolation. Without held-out test splits for these specific tasks, it is difficult to characterize model performance on them independently.
The vision and language backbones are pretrained separately, without joint visual-language pretraining. Features may not be optimally aligned for cross-modal reasoning.
The quadratic relation matrix may create scalability challenges for pages with very large numbers of text-lines or graphical regions.
The authors acknowledge a broader gap: there is no comprehensive benchmark and agreed-upon evaluation metrics for end-to-end document digitization, making fair comparison between specialist models and general multimodal large language models difficult.

Reproducibility

Models

Vision backbone: ResNet-18 or ResNet-50 (pretrained on ImageNet). Swin-Tiny, ResNet-101, InternImage-Tiny, and InternImage-Small also tested. UniHDSA-R18 and UniHDSA-R50 total approximately 150M and 162M parameters, respectively.
Language model: BERT-Base (for Comp-HRDoc) or LayoutXLM (for multilingual settings, mentioned but not primary in the experiments).
Page-level Transformer: 3-layer Deformable Transformer encoder and decoder; 8 attention heads; 256 hidden dim; 1024 FFN dim. Top-$K$ graphical object proposals: 50 (Comp-HRDoc), 100 (DocLayNet).
Document-level Transformer: 3-layer encoder; 8 heads; 256 hidden dim; 2048 FFN dim; RoPE positional encoding.
No pretrained weights for the full trained system are reported as publicly released. Configurations and evaluation code are available at https://github.com/microsoft/CompHRDoc/tree/main/UniHDSA.

Algorithms

Framework: PyTorch v1.11.
Optimizer: AdamW; betas (0.9, 0.999); epsilon 1e-8.
Comp-HRDoc: mini-batch 1 (one sampled sub-document), 40 epochs. CNN backbone: lr 4e-5, weight decay 1e-2. BERT: lr 8e-5, weight decay 1e-2. All other parameters: lr 4e-4, weight decay 1e-2.
DocLayNet: mini-batch 4, 24 epochs. Same lr settings. Logit-adjusted softmax cross-entropy loss used to handle the long-tail distribution of text-lines inside graphical objects.
Multi-scale training: shorter side randomly chosen from $\{320, 416, 512, 608, 704, 800\}$; longer side capped at 1024. Test: shorter side 512.
Document sampling: 6-8 consecutive pages sampled per training example.
Text-line grouping in DocLayNet uses a heuristic merging step (grouping separated words/characters by horizontal and vertical distance relative to average character height) before feeding into the model.

Data

Comp-HRDoc: 1,000 train / 500 test documents (HRDoc-Hard extension). Available at https://github.com/microsoft/CompHRDoc under MIT license. Provides pre-filtered OCR with text-lines inside graphical objects removed.
DocLayNet: 69,375 train / 6,489 test pages, IBM/CC-BY-4.0 license. OCR files have many fragmented text-lines requiring the heuristic grouping step described above.
Annotation for document-level tasks (TOC hierarchy, list hierarchy) is derived from the structural metadata of the source documents (academic PDFs and technical documents).

Evaluation

Page object detection: COCO-style segmentation mAP (Comp-HRDoc) and box mAP + mean F1 at IoU 0.50:0.05:0.95 (DocLayNet).
Reading order: REDS (Comp-HRDoc), evaluated separately for text regions and graphical regions.
TOC extraction and hierarchical reconstruction: Semantic-TEDS (Micro and Macro), which measures tree-edit distance weighted by semantic content.
No error bars or significance tests are reported for the main comparison tables. The authors do note average performance standard deviations for UniHDSA-R18 (0.16) and UniHDSA-R50 (0.23) on Comp-HRDoc without specifying the number of runs.
M2Doc comparisons required replicating M2Doc with corrected post-processing (one box, one category). The corrected numbers differ from the M2Doc paper’s reported figures.

Hardware

16 Nvidia Tesla V100 GPUs (32 GB each) for all experiments.
Training time and inference latency are not reported.

BibTeX


@article{wang2025unihdsa,
  title={UniHDSA: A Unified Relation Prediction Approach for Hierarchical Document Structure Analysis},
  author={Jiawei Wang and Kai Hu and Qiang Huo},
  journal={Pattern Recognition},
  volume={165},
  pages={111617},
  year={2025},
  doi={10.1016/j.patcog.2025.111617}
}

TL;DR

XY-Cut++ extends the classic XY-Cut projection-based algorithm with three complementary components: pre-mask processing to remove highly dynamic layout elements before sorting, multi-granularity segmentation using density-driven adaptive splitting to handle complex multi-column layouts, and cross-modal matching that reinstates masked elements into the ordered sequence using semantic priority and geometric distance metrics. The authors also introduce DocBench-100, a 100-page block-level benchmark with annotations across complex and regular layout subsets. On DocBench-100, XY-Cut++ achieves 98.8 BLEU-4 overall while running at 514 FPS on CPU (ordering module only), outperforming both classic projection methods and learning-based baselines in accuracy.

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$: The core contribution is the XY-Cut++ algorithmic framework. The paper is structured around ablations of each component, comparisons against baselines, and analysis of design choices such as the density threshold and geometric distance weights. The algorithm and its performance are the central focus.

Secondary: $\Psi_{\text{Resource}}$: The paper introduces DocBench-100, a focused 100-page benchmark with block-level reading order annotations, emphasizing multi-column and complex page topologies that existing datasets underrepresent.

What is the motivation?

Document reading order recovery is a prerequisite for downstream applications such as RAG pipelines and LLM preprocessing: if blocks are ordered incorrectly, text chunking and context windows will contain semantically incoherent sequences. Three specific challenges motivate this work:

Complex layouts. Multi-column newspapers, nested text boxes, non-rectangular text regions, and cross-page content routinely defeat the classic XY-Cut algorithm, which relies on clean projection-cut separability. The algorithm’s rigid threshold mechanism causes hierarchical errors when L-shaped inputs span multiple columns.
Cross-modal alignment overhead. Learning-based methods such as LayoutReader are capable of handling complex layouts but carry prohibitive latency (around 22 FPS averaged across benchmarks) that makes them impractical in production document parsing pipelines.
Lack of block-level benchmarks. Existing datasets such as ReadingBank annotate at the word level, making it difficult to evaluate block-level structural reasoning directly. OmniDocBench provides block-level support but has sparse coverage of domain-specific complex layouts such as newspapers and multi-column technical documents.

The authors identify that many hard cases for XY-Cut stem from L-shaped inputs: regions that span multiple columns create false projection gaps and cause the recursive split to propagate errors into all subsequent ordering decisions. Their key observation is that temporarily masking such elements before the projection stage and then reinserting them afterwards is both highly effective and computationally cheap.

What is the novelty?

XY-Cut++ is a heuristic, geometry-aware algorithm with shallow semantic augmentation. It requires no neural training and operates entirely on the bounding box coordinates and coarse semantic labels produced by any layout detection model (the paper uses PP-DocLayout). The pipeline has four stages: layout detection, pre-mask processing, multi-granularity segmentation, and cross-modal matching.

Pre-Mask Processing

Elements with high positional flexibility (titles, figures, tables) are identified via semantic labels and temporarily excluded from the subsequent XY-Cut passes. This directly addresses the L-shape problem: once these spanning elements are masked, the remaining text blocks form clean separable regions for projection-based splitting.

Multi-Granularity Segmentation

Segmentation proceeds through three phases:

Phase 1: Cross-Layout Detection. An adaptive threshold $T_l$ is computed from the median bounding-box width across all content blocks $\{B_i\}$:

$$T_l = \beta \cdot \text{median}(\{l_i\}_{i=1}^N)$$

with $\beta = 1.3$ as an empirically tuned scaling factor. A block $B_i$ is classified as a cross-layout element if its width exceeds $T_l$ and its horizontal projection overlaps with at least two other blocks:

$$C_{\text{cross}}(B_i) = \begin{cases} 1 & l_i > T_l \land \sum_{j \ne i} I_{\text{overlap}}(B_i, B_j) \ge 2 \ 0 & \text{otherwise} \end{cases}$$

Phase 2: Geometric Pre-Segmentation. Central title elements and isolated graphical components are identified using a centrality and isolation criterion:

$$P(B_i) = \mathbb{I}!\left(\frac{\|C_i - C_{\text{page}}\|_2}{d_{\text{page}}} \le 0.2 ;\wedge; \phi_{\text{text}}(B_i) = \infty\right)$$

where $\phi_{\text{text}}(B_i) = \infty$ indicates no adjacent text block. These elements are used to partition the page into coarse sub-regions $R$ for further refinement.

Phase 3: Density-Driven Refinement. Each sub-region $R$ is processed with an adaptive XY-Cut/YX-Cut strategy. The density ratio $\tau_d$ measures the proportion of cross-layout to single-layout element area within $R$:

$$\tau_d = \frac{\sum_{k=1}^{N_c} w_k^{(C_c)} h_k^{(C_c)}}{\sum_{k=1}^{N_s} w_k^{(C_s)} h_k^{(C_s)}}$$

The splitting strategy $S(R)$ is determined by comparing $\tau_d$ against a threshold $\theta_v = 0.9$:

$$S(R) = \begin{cases} \text{XY-Cut} & \tau_d > \theta_v \ \text{YX-Cut} & \text{otherwise} \end{cases}$$

XY-Cut prioritizes horizontal splits (finding gaps in the Y-axis projection), while YX-Cut prioritizes vertical splits. Recursion terminates when each sub-region contains exactly one unmasked block.

After multi-granularity segmentation produces an ordered sequence of atomic regions, the masked elements are restored. Matching is driven by a label priority ordering:

$$L_{\text{order}}: L_{\text{cross-layout}} \succ L_{\text{title}} \succ L_{\text{vision}} \succ L_{\text{others}}$$

Each masked element $B_p$ is assigned an anchor position $B_{\text{best}}$ in the ordered sequence by minimizing a joint geometric distance:

$$D(B_p, B_o, l) = \sum_{k=1}^{4} w_k \cdot \phi_k(B_p, B_o)$$

The four geometric constraints $\phi_1$–$\phi_4$ encode intersection filtering, boundary proximity, vertical continuity, and horizontal ordering respectively. Weights are scaled to avoid interference between constraints:

$$w_k = [\max(h, w)^2,\ \max(h, w),\ 1,\ \max(h, w)^{-1}]$$

and further tuned per semantic label class via a grid search on 2,800 documents.

What experiments were performed?

Dataset

Experiments use two benchmarks:

DocBench-100: The newly introduced dataset with 30 complex ($D_c$) and 70 regular ($D_r$) pages. Complex pages are predominantly three-or-more column layouts (90% of $D_c$); regular pages are dominated by single and double-column formats typical of academic and business documents. Both end-to-end (image input, full detection required) and oracle JSON-based (detection boxes provided) evaluation protocols are supported.
OmniDocBench: An existing block-level benchmark used to assess cross-dataset transfer, covering single, double, three-column, and complex layout categories.

Baselines

XY-Cut (Ha et al., 1995): Classic projection-based method, geometry only.
LayoutReader (Wang et al., 2021): LayoutLMv3 fine-tuned on 500k samples; uses rich text and visual features.
MinerU (Wang et al., 2024): End-to-end document extraction pipeline.

Metrics

BLEU-4: Applied to sequences of block identifiers (not text tokens), using standard n-gram precision with brevity penalty:

$$\text{BLEU-4}(\hat{s}, s) = BP \cdot \exp!\left(\sum_{n=1}^{4} \frac{1}{4} \log p_n\right)$$

ARD (Absolute Relative Difference): Measures prediction accuracy for block positions; lower is better.
Kendall’s $\tau$: Rank correlation between predicted and reference orderings; higher is better.
FPS: Ordering module throughput only; upstream detection and downstream OCR/LM are excluded. Measured over 10 runs on an Intel Xeon Gold 6326 CPU at 2.90 GHz with 256 GB RAM.

Ablations

Table 2 in the paper presents progressive component ablations, adding each module incrementally:

Configuration	$D_c$ BLEU-4	$D_r$ BLEU-4	$\mu$ BLEU-4
XY-Cut baseline	0.749	0.819	0.797
+Pre-Mask	0.818	0.823	0.822
+MGS	0.946	0.969	0.962
+CMM (full XY-Cut++)	0.986	0.989	0.988

A detailed component ablation (Table 3) isolates individual design choices within each module, including cross-layout masking, pre-cut, adaptive scheme, and the four geometric constraint terms.

What are the outcomes/conclusions?

XY-Cut++ achieves 98.8 mean BLEU-4 on DocBench-100, compared to 79.7 for the XY-Cut baseline, 78.8 for LayoutReader, and 87.3 for MinerU. The improvement is more pronounced on complex layouts: BLEU-4 on $D_c$ rises from 0.749 (baseline XY-Cut) to 0.986, an absolute gain of 23.7 points. ARD across all pages falls from 0.139 to 0.009, a reduction of 93.5%.

On OmniDocBench, results suggest XY-Cut++ achieves the highest mean scores across layout categories (0.953 BLEU-4 versus 0.926 for MinerU), though MinerU has a slight edge on the “Complex ARD” sub-metric (0.050 versus 0.064).

In terms of throughput, the ordering module runs at 514 FPS on average across DocBench-100 and OmniDocBench, modestly faster than the XY-Cut baseline (487 FPS). The authors attribute this to the semantic filtering step in cross-modal matching, which processes each block once rather than repeatedly partitioning subsets through recursive descent. LayoutReader and MinerU run at 22 and 11 FPS respectively.

Limitations

The authors identify several open issues:

Label granularity: Fine-grained segmentation (caption linkage, ambiguous block boundaries) depends on the quality and category count of the upstream detector’s semantic labels. On OmniDocBench, which has fewer label categories, some fine-grained failures persist.
Sub-page nested structures: Pages that contain embedded sub-documents or complex nested regions (illustrated in Figure A8) remain a failure mode, as the current framework lacks hierarchical sub-page reasoning.
Heuristic hyperparameters: The scaling factor $\beta$ (Eq. 1) and density threshold $\theta_v$ (Eq. 5) were tuned on the available data and may need domain-specific adjustment for out-of-distribution documents.
Benchmark scale: DocBench-100 contains only 100 annotated pages, which limits statistical power and may not represent the full distribution of real-world document types.

Future work directions the authors identify include sub-page detection with hierarchical reasoning, learning the split policy and edge weights from weak supervision, and coupling lightweight language priors with end-to-end RAG/LLM evaluation.

Reproducibility

Models

XY-Cut++ is a heuristic algorithm, not a learned model. There are no trainable parameters and no model weights to release. The algorithm depends on an upstream layout detector to provide bounding boxes and coarse semantic labels. Experiments use PP-DocLayout (Sun et al., 2025) for this purpose.

Algorithms

The full pipeline is described in Algorithm 1 of the paper. Key hyperparameters are:

Cross-layout scaling factor: $\beta = 1.3$
Density threshold for split axis selection: $\theta_v = 0.9$
Overlap projection threshold for intersection constraint: $\tau_{\text{overlap}} = 0.3$
Centrality threshold for Phase 2 pre-segmentation: 0.2 (fraction of page diagonal)
Edge weights $w_{\text{edge}}$ for semantic-specific matching: four-class lookup table tuned via grid search on 2,800 documents

No training procedure, optimizer, or learning rate schedule is involved. The grid search over semantic-specific edge weights is not described in enough detail to fully reproduce without the 2,800-document corpus.

Data

DocBench-100: 100 pages (30 complex, 70 regular). Pages are sourced from PP-DocLayout public document detection corpora and MinerU extraction outputs. Annotation was performed in two stages: automatic pre-annotation by MinerU followed by manual verification and screening. Pages with inherently ambiguous global reading order are excluded. The authors state they will release the evaluation script and dataset documentation, but at the time of writing the license is not specified in the preprint.
OmniDocBench: Used for additional evaluation only; not introduced by this paper.
Annotation format: Each page provides an image, an input JSON (bounding boxes, labels, no reading order index), and a ground truth JSON (same fields plus block-level reading order index).

Evaluation

BLEU-4 is applied to sequences of block identifiers, not text tokens. The brevity penalty is computed using block sequence lengths.
ARD and Kendall’s $\tau$ are computed over the same block-level ordering.
FPS is measured on the ordering module only, explicitly excluding upstream detection and downstream OCR. Results are averaged over 10 runs on an Intel Xeon Gold 6326 @ 2.90 GHz with 256 GB RAM. The paper reports both DocBench-100 and OmniDocBench FPS values separately before averaging.
The Wilcoxon signed-rank test is reported for the Kendall’s $\tau$ comparison ($p < 0.001$), but no other statistical significance tests or error bars are provided.
No code repository is linked in the preprint. The benchmark and evaluation script availability is stated as future release.

Hardware

Evaluation hardware: Intel Xeon Gold 6326 CPU @ 2.90 GHz, 256 GB RAM. No GPU is used for the ordering module.
No GPU training: XY-Cut++ is a heuristic algorithm; GPU hardware is not required.
Grid search: Conducted on 2,800 documents to tune edge weights; no compute cost is reported for this step.
Deployment: The algorithm is CPU-only and lightweight enough for production-grade deployment alongside any detection model that provides bounding boxes and coarse labels.

BibTeX


@article{liu2025xycutplusplus,
  title={XY-Cut++: Advanced Layout Ordering via Hierarchical Mask Mechanism on a Novel Benchmark},
  author={Liu, Shuai and Li, Youmeng and Wei, Jizeng},
  journal={arXiv preprint arXiv:2504.10258},
  year={2025}
}

TL;DR

Giovannini and Marinai present a survey that jointly covers reading order prediction, table of contents (ToC) detection and generation, and hierarchical document structure extraction. The survey organizes roughly 45 papers into a two-level taxonomy of tasks and methods, and identifies key open problems including cross-domain generalization, multimodal integration, and the lack of a standard, task-spanning benchmark.

What kind of paper is this?

Dominant: $\Psi_{\text{Systematic}}$ (survey / systematization of knowledge). The paper’s entire contribution is a unified taxonomy of tasks and methods across three related but previously siloed research areas, with no new model, dataset, or empirical evaluation of its own.

Secondary: none.

What is the motivation?

Information extraction from structured documents depends on understanding their internal organization. Reading order, table of contents structure, and hierarchical layout are three facets of the same underlying problem: where is each piece of content, and how does it relate to everything else? Yet the literature on these three topics has developed largely in isolation.

The authors observe that existing surveys either focus narrowly on document layout analysis broadly, or on specific document types such as historical printed books. No prior work unifies reading order prediction, ToC analysis, and structure extraction under a common conceptual framework. The recent proliferation of transformer-based multimodal architectures across all three areas makes this a suitable moment to draw those connections explicitly.

What is the novelty?

The core contribution is a two-level taxonomy: one level for tasks (what is being predicted) and one for methods (how it is predicted).

Task taxonomy

The survey organizes the problem space into three top-level tasks, each subdivided into distinct problem formulations:

Reading Order splits into three variants:

Guided Document Understanding: Reading order is not the end goal; it is computed as an auxiliary representation to improve downstream tasks like named entity recognition. Methods include position-embedding injection (Layout2Pos), graph-based token path prediction (TPP), and reading-order pretraining objectives (ERNIE-Layout, XYLayoutLM).
Global Order Prediction: The system predicts a single, total ordering of all document elements. Methods include pointer-network decoders (LayoutReader), graph-based approaches (Gao et al., MLARP), and attention-based architectures.
Local Order Prediction: The system predicts binary precedence relations between element pairs, yielding a partial order graph. This formulation is appropriate when a single canonical sequence does not exist, especially in visually complex documents.

Table of Contents splits into:

ToC Detection: Identify and parse an existing ToC page, then link its entries to corresponding document sections. Problems include locating the ToC pages, extracting hierarchical entry structure, and matching entries to body text.
ToC Generation: Automatically construct a ToC by identifying headings and organizing them into a hierarchy, even when no printed ToC exists. Domain-specific formulations include ESG reports and financial documents.

Structure Extraction splits into three levels of scope:

Text Region Classification: Assign semantic roles (title, abstract, body, heading) to individual text fragments. Relies on domain-specific label sets and does not generalize across document types without re-specification.
Text-only Tree Generation: Build a hierarchical tree of textual elements (sections, subsections, paragraphs). Does not require domain-specific label sets and is more likely to generalize across domains.
Overall Tree Generation: Extend tree generation to include non-textual elements: figures, captions, tables, equations. This is the most comprehensive formulation and the most challenging.

Method taxonomy

The survey cross-cuts tasks with a method taxonomy organized by modeling strategy:

Task	Strategy	Representative works
Reading Order	Rule-based	Aiello & Smeulders (Allen relations), Ferilli et al. (argumentation), XYLayoutLM (augmented XY-Cut)
Reading Order	Graph-based	Gao et al. (TSP on newspaper blocks), Li et al. ECCV 2020 (GCN + pointer decoder), Wang et al. ICDAR 2023 (sparse graph GCN multitask)
Reading Order	Transformer-based	Layout2Pos, LayoutReader, TPP, ERNIE-Layout, MLARP, Zhang et al. EMNLP 2024
ToC	Rule-based	Belaïd, Dresevic et al., Lin & Xiong, Marinai et al., Story et al., Wu et al.
ToC	ML-based	Déjean & Meunier, Gao et al. (clustering), Bogatenkova et al. (XGBoost), Klampfl et al. (unsupervised)
ToC	DL-based	Bentabet et al. (CNN + BiLSTM + CRF), Hu et al. ICPR 2022 (multimodal tree decoder)
Structure	Text-based classification	AIDAS (rule-based geometry), Luong et al. (CRF on line sequences)
Structure	Hand-crafted features	Tuarob et al. (Random Forest/SVM/NB), Rahman & Finin (SVM/DT/RNN + topic modeling)
Structure	Relation-based	DocParser (Mask R-CNN + heuristics), DSG (Faster R-CNN + BiLSTM + grammar), D-O-C (CNN+BERT+transformer encoder), DHFormer (GeoLayoutLM + encoder-decoder)
Structure	Action-based	MTP (Random Forest transitions), TRACER (RoBERTa stack parser), SEG2ACT (LLM + global context stack), HELD (BiLSTM/CNN + PUT-or-SKIP), CMM (GAT + XY-Cut + Keep/Move/Delete)

For transformer-based reading order methods, the paper additionally provides a comparative breakdown by feature modality (layout only, layout+text, layout+text+visual) and prediction head (no explicit relation, decoder, global pointer network, FC+sigmoid). This is summarized in Table 3 of the paper:

Method	Features	Backbone	Prediction head
Layout2Pos	Layout	Custom	None (positional encoding)
LayoutReader	Layout + Text	LayoutLMv1	Pointer decoder
TPP	Layout + Text	LayoutLMv3	Global Pointer Network
ERNIE-Layout	Layout + Text + Visual	Custom	None (pretraining task)
MLARP	Layout + Text + Visual	Custom	FC + Sigmoid
Zhang et al. 2024	Layout + Text	Custom	Global Pointer Network

Unifying perspective

The central argument of the survey is that reading order, ToC, and structure extraction are all variations of the same fundamental problem: identifying relationships between document entities. No element has a well-defined role in isolation; its function emerges from its relationship to surrounding elements. This framing motivates the authors’ claim that insights from one task can transfer to another, and that joint modeling across tasks is an underexplored direction.

What experiments were performed?

This is a survey; no new experiments are conducted. The paper explicitly acknowledges that it does not provide a quantitative comparison of methods or datasets, citing space constraints. Instead, it organizes the literature qualitatively through taxonomy tables and comparative descriptions.

The scope of the review covers approximately 45 papers published between 2001 and 2024, spanning rule-based systems from the early 2000s through recent transformer and LLM-based methods. The survey covers reading order (13 papers), ToC detection and generation (12 papers), and structure extraction (16 papers), with some methods appearing in multiple categories.

Document types covered by the surveyed methods include newspapers, scientific articles, books, financial documents, ESG reports, legal documents, and visually rich documents generally.

The survey does not describe a formal systematic search protocol (database queries, inclusion/exclusion criteria, or PRISMA-style selection). Coverage appears to be based on the authors’ knowledge of the literature and citation snowballing rather than an exhaustive search strategy. This means the paper set is representative but not guaranteed to be complete.

What are the outcomes/conclusions?

Key findings:

The survey finds that the field has evolved through a clear succession of paradigms: rule-based approaches, then classical machine learning, then deep learning, and now transformer-based multimodal architectures. This pattern holds across all three tasks and suggests the tasks are pulled by the same general-purpose modeling advances rather than task-specific algorithmic insights.

Reading order prediction and hierarchical structure extraction are mutually reinforcing: accurate reading order aids ToC generation and structure construction, while a correct structural hierarchy constrains which reading orders are plausible.

Action-based methods for structure extraction (MTP, TRACER, SEG2ACT, HELD, CMM) form a coherent research thread that the survey highlights as deserving more attention. Rather than predicting pairwise relations and inferring structure in post-processing, these methods directly model document parsing as a sequence of structured decisions (e.g., push, pop, keep, move, delete operations on a stack or graph), which may yield more consistent global structures.

Open problems the survey identifies:

Layout variability and domain generalization: Most methods are tuned to a specific document type or corpus. Robust generalization across scientific articles, books, newspapers, forms, and scanned handwritten documents remains unsolved.
Multimodal integration: Methods that combine visual, textual, and layout signals consistently outperform unimodal approaches, but optimal fusion strategies are still an open research question.
Task integration: Reading order, ToC, and structure extraction are almost always addressed in isolation. The survey calls for joint approaches that explicitly model the interdependence between these tasks.
No unified benchmark: There is no benchmark that evaluates a method across all three tasks jointly, making cross-task comparison and transfer learning studies difficult.
Large language models: The survey notes that general-purpose LLMs are beginning to appear in structure extraction (SEG2ACT uses a generative LLM) and anticipates this trend will continue.

Limitations of the survey itself:

The authors acknowledge that coverage is constrained by space. There is no quantitative meta-analysis: no result tables comparing methods on shared benchmarks, no discussion of dataset contamination or reproducibility issues, and no attempt to rank approaches. The survey does not describe a formal systematic search methodology, so the coverage reflects the authors’ literature familiarity rather than a reproducible selection process. The survey is intended as a first step toward a more comprehensive empirical study.

Reproducibility

This paper does not release code, models, or data. As a survey, reproducibility considerations apply primarily to the reviewed works rather than to the paper itself.

Models

Not applicable. The survey includes comparative tables (Tables 3, 4, 5) summarizing architecture choices, feature modalities, and prediction heads for the methods it covers. These tables serve as a reference for anyone implementing or comparing methods.

Algorithms

Not applicable. The survey describes algorithmic strategies at a categorical level (rule-based, graph-based, transformer-based, action-based) without providing pseudocode or implementation details beyond what the original papers contain.

Data

No datasets are introduced or released. The survey notes the absence of a unified benchmark across the three tasks as a significant gap. Individual datasets used by the reviewed methods (ReadingBank, Comp-HRDoc, FinTOC shared tasks, DocHieNet) are mentioned but not catalogued systematically.

Evaluation

The survey explicitly declines to provide quantitative comparisons due to space constraints. No metrics are discussed at the survey level, and no baselines are proposed for future comparison. This is the most significant limitation from a reproducibility standpoint.

The survey does not describe a formal paper-selection methodology (no database queries, no inclusion/exclusion criteria, no time-range cutoff stated explicitly). The coverage of approximately 45 papers from 2001 to 2024 appears to rely on the authors’ domain expertise and citation following. As a result, the paper set cannot be mechanically reproduced by a third party following a stated protocol.

Hardware

Not applicable. This is a survey paper; no training or inference was performed.

BibTeX


@inproceedings{giovannini2025survey,
  author    = {Simone Giovannini and Simone Marinai},
  title     = {A Survey on Reading Order, Table of Contents, and Structure Extraction in Document Analysis},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops},
  pages     = {7644--7653},
  year      = {2025}
}

TL;DR

OmniDocBench is an evaluation benchmark covering 981 annotated PDF pages from nine distinct document types, including handwritten notes and newspapers. It supports end-to-end, task-specific, and attribute-level evaluation of document parsing systems across OCR, layout detection, table recognition, formula parsing, and reading order. Evaluation of pipeline tools and vision-language models on the benchmark surfaces consistent strengths and weaknesses that narrower prior benchmarks could not reveal.

What kind of paper is this?

Dominant: $\Psi_{\text{Resource}}$: the primary contribution is the benchmark dataset itself: 981 carefully curated and annotated pages, along with the evaluation framework and leaderboard infrastructure. The paper’s center of gravity is the curation pipeline, annotation schema, and diversity of the resource.

Secondary: $\Psi_{\text{Evaluation}}$: the evaluation methodology section introduces non-trivial measurement contributions, including an Adjacency Search Match algorithm designed to handle paragraph-splitting inconsistencies across models, format-specific metrics (CDM for formulas, TEDS for tables), and a principled reading order evaluation protocol. These are genuine measurement contributions beyond standard edit distance.

What is the motivation?

Document parsing (converting unstructured PDFs into machine-readable formats) is a foundational step for large language model pre-training and retrieval-augmented generation (RAG) pipelines. Despite substantial progress in both pipeline-based systems (layout detection + OCR + formula/table recognition modules) and end-to-end vision-language models (VLMs), the field lacked a benchmark capable of fairly comparing these two paradigms.

Existing benchmarks had three key gaps. First, document diversity was narrow: most datasets focused exclusively on academic papers from arXiv, ignoring textbooks, financial reports, exam papers, handwritten notes, newspapers, and magazines that appear routinely in real deployments. Second, evaluation metrics were inconsistent: generic string similarity metrics such as edit distance and BLEU treat LaTeX formula representations as arbitrary strings, penalizing syntactically different but semantically equivalent expressions. Third, evaluations were not fine-grained: benchmarks typically reported a single aggregate score, making it impossible to identify whether failures were concentrated in a particular document type, content category, or visual attribute.

These gaps made it difficult to diagnose model weaknesses or choose the right system for a given document domain.

What is the novelty?

OmniDocBench contributes three interconnected components.

Benchmark dataset. The dataset contains 981 pages sampled from over 200,000 initial PDFs sourced from Common Crawl and search engines. Pages were clustered using ResNet-50 visual features and Faiss to maximize visual diversity, then manually labeled for page-level attributes. The final dataset covers nine document types: academic papers, books, colorful textbooks, exam papers, magazines, notes, newspapers, slides, and financial reports. Each page carries six page-level attributes (type, layout, language, fuzzy scan, watermark, colorful background), and each annotation block carries up to nine attribute labels covering text language, background color, text rotation, table frame type, merged cells, formula presence, and table rotation.

Annotations span two granularity levels. Block-level annotations cover 19 region categories (titles, text paragraphs, figures, tables, formulas, headers, footers, code blocks, references, and more), with over 20,000 bounding boxes. Reading order labels are provided for all blocks except headers, footers, and page notes, totaling over 16,000 ordered annotations. Span-level annotations cover inline formulas (4,009, in LaTeX), equation-ignore spans, text spans, and footnote markers, totaling over 70,000 entries.

Evaluation pipeline. The evaluation framework addresses three practical challenges that undermine naive metric computation. First, different models apply different standards for headers, footers, and caption placement; OmniDocBench handles these through ignore rules so that incidental differences do not dominate scores. Second, paragraph boundaries vary across models; the Adjacency Search Match algorithm resolves this by constructing a normalized edit distance matrix between predicted and ground-truth paragraphs and iteratively merging adjacent paragraphs until the best-possible alignment is found. Third, formula and table outputs are format-diverse; the benchmark applies the CDM metric (Character-level Distinction Metric) to formulas and the Tree Edit Distance-based Similarity (TEDS) metric to tables, rather than treating these as plain strings.

For reading order, after matching predicted text blocks to ground-truth blocks, the benchmark derives an ordering over matched pairs and computes normalized edit distance between the predicted sequence and the ground-truth sequence. Tables, images, and deliberately ignored blocks are excluded from this calculation.

Formally, the reading order score uses normalized edit distance:

$$\text{ROD}_{\text{Edit}} = \frac{\text{EditDistance}(\hat{s},\ s)}{\max(|\hat{s}|,\ |s|)}$$

where $\hat{s}$ is the predicted block sequence and $s$ is the ground-truth sequence, after adjacency-match alignment. Lower values indicate better ordering.

The overall end-to-end normalized edit distance aggregates across text, formula, and table elements:

$$\text{Overall}^{\text{Edit}} = \frac{\sum_{i} \text{EditDistance}(\hat{t}_{i},\ t_{i})}{\sum_{i} \max(|\hat{t}_{i}|,\ |t_{i}|)}$$

Comprehensive model comparison. The benchmark is applied to three categories of document parsing systems: pipeline tools (MinerU v0.9.3, Marker v1.2.3, Mathpix), expert VLMs trained specifically for document parsing (GOT-OCR 2.0, Nougat 0.1.0-base), and general-purpose VLMs (GPT-4o, Qwen2-VL-72B, InternVL2-Llama3-76B).

What experiments were performed?

Evaluation spans four levels.

End-to-end evaluation measures full-page parsing accuracy across seven dimensions: text (normalized edit distance $\downarrow$), formula edit distance $\downarrow$, formula CDM $\uparrow$, table TEDS $\uparrow$, table edit distance $\downarrow$, reading order edit distance $\downarrow$, and overall edit distance $\downarrow$. Results are broken down by language (English vs. Chinese), document type (nine categories), layout type, and visual degradation (fuzzy scan, watermark, colorful background).

Layout detection evaluation uses mean Average Precision (mAP) on a layout-annotated subset, comparing DIT-L, LayoutLMv3, DocLayout-YOLO, SwinDocSegmenter, GraphKD, and DOCX-Chain across the nine page types.

Table recognition evaluation uses TEDS on a table-annotated subset, stratified by language, frame type (full frame, omission line, three-line, no frame), and special conditions (merged cells, colorful background, formula content, rotation).

OCR evaluation uses normalized edit distance on a text-annotated subset, stratified by language, background color, and text rotation (normal, 90°, 270°, horizontal). Models evaluated include PaddleOCR, Tesseract, Surya, GOT-OCR, Mathpix, Qwen2-VL-72B, InternVL2-76B, and GPT-4o.

Formula recognition evaluation uses CDM, ExpRate@CDM, BLEU, and normalized edit distance on a formula subset. Models include GOT-OCR, Mathpix, Pix2Tex, UniMERNet-B, GPT-4o, InternVL2-76B, and Qwen2-VL-72B.

For general VLMs, do_sample=False was set to ensure reproducibility. Qwen2-VL-72B used max_token=32000; InternVL2-Llama3-76B used max_token=4096. Pipeline tools and expert VLMs used default settings.

What are the outcomes/conclusions?

Pipeline tools vs. VLMs. Pipeline tools, especially MinerU and Mathpix, achieve lower end-to-end edit distance on common document types such as academic papers and financial reports. General VLMs generalize better to unconventional formats (e.g., notes, exam papers, slides) and show greater robustness under visual degradations such as fuzzy scans and colored backgrounds, which the authors attribute to their broader training data.

Layout detection. DocLayout-YOLO significantly outperforms other layout detectors, particularly on diverse formats. Other methods struggle with slides and notes due to limited training data diversity.

Reading order. All models degrade substantially on multi-column and complex layouts. MinerU maintains the most consistent reading order overall, but its performance drops on handwritten single-column pages. VLMs tend to merge multi-column text, compounding both recognition and ordering errors.

Newspapers. Most VLMs fail on newspaper pages due to high text density and resolution constraints on input tokens. Pipeline tools, which apply layout segmentation before recognition, handle this case substantially better.

Rotated text and tables. Rotation is a consistent failure mode across all model types; table rotation in particular renders almost all models ineffective.

Formula recognition. GPT-4o, Mathpix, and UniMERNet-B achieve CDM scores of 86.8, 86.6, and 85.0 respectively. Mathpix has high character-level precision but occasionally omits punctuation, reducing its exact-match rate.

Limitations. The benchmark is annotated for English and Chinese only, leaving multilingual generalization untested. The 981-page scale, while carefully curated, remains relatively small compared to training sets. The evaluation framework relies on matched block sequences, so models that produce highly fragmented or reorganized output may be penalized more than their semantic accuracy warrants.

Reproducibility

Models

OmniDocBench does not introduce new model weights. It evaluates existing systems: MinerU v0.9.3, Marker v1.2.3, Mathpix (commercial API), Nougat 0.1.0-base (350M parameters), GOT-OCR 2.0, GPT-4o (2024-08-06 version), Qwen2-VL-72B, and InternVL2-Llama3-76B (76B parameters). Layout detection models evaluated include DIT-L (361.6M), LayoutLMv3 (138.4M), DocLayout-YOLO (19.6M), SwinDocSegmenter (Swin-L, 223M), and GraphKD (ResNet-101, 44.5M).

Algorithms

No novel training algorithm is introduced. The evaluation pipeline includes: regex-based element extraction (LaTeX tables, HTML tables, display formulas, markdown tables, code blocks), Unicode normalization for inline formulas, Adjacency Search Match for paragraph alignment (normalized edit distance matrix with a similarity threshold, followed by iterative fuzzy merging), and metric calculation via normalized edit distance, CDM, and TEDS. The annotation pipeline used LayoutLMv3 for pre-annotation of layout, PaddleOCR for text, UniMERNet for formulas, and GPT-4o for tables, followed by human correction and expert quality inspection using CDM’s rendering technique to identify unrenderable formula annotations.

Data

The benchmark was constructed from over 200,000 PDFs sourced from Common Crawl and Google/Baidu search engines. ResNet-50 visual features were extracted and clustered with Faiss (10 cluster centers, 6,000 candidate pages sampled), then manually filtered to 981 balanced pages. The dataset contains English (290 pages), Simplified Chinese (612 pages), and mixed (79 pages) pages. The dataset and evaluation code are available at https://github.com/opendatalab/OmniDocBench under an Apache-2.0 license.

Evaluation

Metrics used: normalized edit distance (text, formula, reading order, overall), CDM (formula recognition, higher is better), TEDS (table recognition, higher is better), and mAP (layout detection). The reading order metric applies normalized edit distance over the matched ordered block sequence, excluding tables, images, and ignored elements. No statistical significance tests or confidence intervals are reported; results are single-run evaluations. Baseline comparisons are generally fair within each modality, though the pipeline tool versions are pinned while VLMs are evaluated at their publicly available weights.

Hardware

The paper does not report GPU specifications, training compute, or inference latency for the evaluated systems. No cost estimates are provided. The evaluation framework itself is lightweight (regex extraction, edit distance computation), but inference costs for large VLMs (e.g., Qwen2-VL-72B, InternVL2-76B) would require substantial GPU memory.

BibTeX


@inproceedings{ouyang2025omnidocbench,
  title={OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations},
  author={Ouyang, Linke and Qu, Yuan and Zhou, Hongbin and Zhu, Jiawei and Zhang, Rui and Lin, Qunshu and Wang, Bin and Zhao, Zhiyuan and Jiang, Man and Zhao, Xiaomeng and Shi, Jin and Wu, Fan and Chu, Pei and Liu, Minghao and Li, Zhenxiang and Xu, Chao and Zhang, Bo and Shi, Botian and Tu, Zhongying and He, Conghui},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2025}
}

TL;DR

CMM (Construction-Modelling-Modification) is a three-stage pipeline for extracting a hierarchical Table of Contents from long, structurally diverse PDF documents. It builds an initial document tree from reading order and font sizes, encodes each node independently using GRU and Graph Attention Networks over a node-centric subtree, then predicts a Keep/Delete/Move operation per node to produce the final ToC tree. The accompanying ESGDoc dataset contains 1,093 ESG annual reports averaging 72 pages (versus 19 pages for the prior HierDoc benchmark), with less regular structure and higher assumption-violation rates.

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$: The core contribution is the CMM framework itself: a three-stage pipeline for ToC extraction that replaces pairwise heading-relationship modelling with per-node operation prediction over a graph-structured subtree. The paper presents ablations, runtime comparisons, and comparisons against the prior best-reported method, all orbiting the methodological contribution.

Secondary: $\Psi_{\text{Resource}}$: The paper also introduces ESGDoc, a new dataset of 1,093 ESG reports spanning 2001-2022. The dataset is a meaningful standalone contribution, but it is built and released primarily to benchmark CMM and expose the limitations of MTD (the prior best-reported method on HierDoc) on long, complex documents.

What is the motivation?

Most document understanding work targets well-structured, short scientific papers. The HierDoc benchmark (Hu et al., 2022), which MTD was built and evaluated on, consists of 650 arXiv papers averaging 19 pages. In such documents, section numbering (e.g., “5.1 Experimental Setup”) encodes hierarchy directly in the heading text, making ToC extraction comparatively straightforward.

ESG (Environmental, Social, and Governance) annual reports present a different challenge: they routinely exceed 100 pages, lack standardized section numbering, include landscape-oriented pages, embed heavy visual elements (charts, infographics, tables), and show wide variation in font choices across companies and years. No dataset or method addressed this regime at the time.

The specific failure mode of MTD on long documents is also concrete: MTD must process all headings simultaneously to model pairwise relationships, which causes out-of-memory errors on documents exceeding approximately 50 pages. This makes MTD technically inapplicable to a substantial fraction of ESGDoc.

What is the novelty?

The central insight is to reformulate ToC extraction as a per-node classification problem on a pre-built document tree, rather than as pairwise heading-relationship classification over the full document. This decoupling enables linear scaling with document length.

Tree Construction. Using the XY-Cut algorithm (Ha et al., 1995) to establish reading order, CMM builds a complete tree $T$ over all text blocks extracted by PyMuPDF. For each block $x_i$, CMM finds the closest preceding block $x_j \in {x_{<i}}$ such that $s_j > s_i$ (where $s$ denotes font size) and sets $x_i$ as a child of $x_j$. This encodes the assumption that higher-level headings have larger font sizes; blocks with no larger-font predecessor become children of the pseudo root.

Node-Centric Subtree Extraction. Rather than processing the entire tree at once, CMM extracts a local subtree $t_i$ for each node $x_i$ by running Breadth First Search to depth $n_d$ (set to 2), capturing parent, sibling, and child nodes. This limits GPU memory footprint to a fixed-size neighbourhood regardless of document length, and allows multiple subtrees to be batched simultaneously.

Node Encoding. Each block $x_i$ is encoded by concatenating a pretrained language model representation (RoBERTa-base) with a handcrafted feature vector $f_i$ covering page number, font and font size, colour (RGB), text line count and length, and bounding box coordinates:

$$b_i = \text{MLP}([\text{TextEncoder}(x_i),\ f_i])$$

A one-layer bidirectional GRU is then applied to the nodes of $t_i$ in in-order traversal to incorporate sequential reading-order context:

$$\{v_j\}_{t_1}^{t_l} = \text{GRU}(t_i)$$

Graph Attention over the Subtree. The subtree is converted to a graph $G = (V, \mathcal{E})$ with nodes connected by parent, child, and sibling edges. Edge features $f_{j,i}$ encode edge type, font size difference $s_j - s_i$, font/colour match, page difference, and bounding box position difference. Graph Attention Networks (GAT) with $n_d$ layers propagate information across the graph:

$$\{h_i\}^{t_i} = \text{GAT}(V, \mathcal{E})$$

Tree Modification. For each node $x_i$, three operation scores are computed from the final GAT representation $h_i$:

$$\begin{aligned} o_i^{[kp]} &= W_{kp} h_i + b_{kp} \\ o_i^{[de]} &= W_{de} h_i + b_{de} \\ o_i^{[mv]} &= W_{mv}[\text{POOL}(\text{PRS}(h_i)),\ h_i] + b_{mv} \end{aligned}$$

The Move score incorporates a max-pooled representation of all preceding sibling nodes $\text{PRS}(h_i)$, allowing the model to compare a node against its already-seen siblings before deciding whether it should be re-parented. Operations are selected via:

$$\hat{y}_i = \text{argmax}(p_i)$$

where $p_i$ is the vector of softmax probabilities over the three operation scores. After all per-node predictions are made, deletions are executed first, then Move operations are applied in reading order: a moved node’s parent is set to its immediately preceding sibling.

Training uses cross-entropy loss over the three-class operation labels. Ground-truth labels are derived from the annotated ToC: non-headings receive Delete; headings with a higher-level heading preceding them in the document receive Move; remaining headings receive Keep.

What experiments were performed?

Datasets. Two benchmarks are used: ESGDoc (the new dataset) and HierDoc (the prior benchmark of arXiv papers). For ESGDoc, MTD cannot process documents longer than 50 pages due to memory constraints, so a filtered sub-dataset “ESGDoc (Partial)” of documents under 50 pages (274/40/78 train/dev/test) is constructed for fair MTD comparison. CMM is evaluated on the full ESGDoc (765/110/218 train/dev/test) as well.

Baseline. MTD (Hu et al., 2022) is the only baseline. It fuses text, visual, and layout modalities using a pretrained language model for heading classification, then models all pairwise heading relationships with GRU and attention to decode the tree. The paper cites earlier rule-based and ML methods (Namboodiri and Jain, 2007; Tuarob et al., 2015) but does not compare against them, which is a gap: their inclusion would help calibrate how much the neural approach adds over classical heuristics on HierDoc where both models already score above 87 TEDS.

Metrics. Heading Detection (HD) is measured by F1-score, capturing how well the model identifies which blocks are headings. ToC quality is measured by Tree Edit Distance Similarity (TEDS):

$$\text{TEDS}(T_p, T_g) = 1 - \frac{\text{TreeEditDist}(T_p, T_g)}{\max(|T_p|, |T_g|)}$$

Ablations. Four conditions are compared: (1) full CMM, (2) page-based division instead of tree-based subtree segmentation, (3) CMM without GRU, (4) CMM without GNN. A hyperparameter sweep over $n_d \in \{1, 2, 3, 4\}$ is also reported.

Implementation. RoBERTa-base is the text encoder. BFS depth $n_d = 2$. Hidden size for $b$, $v$, and $h$ is 128. Training uses Adam with learning rate $10^{-5}$ for pretrained parameters and $10^{-3}$ for randomly initialized parameters, batch size 32, on a single NVIDIA A100 80GB GPU.

What are the outcomes/conclusions?

Main results. On HierDoc, CMM slightly outperforms MTD (HD: 97.0 vs. 96.1 F1; ToC: 88.1 vs. 87.2 TEDS). On ESGDoc (Partial), CMM substantially outperforms MTD (HD: 53.2 vs. 40.4; ToC: 30.0 vs. 26.9). On ESGDoc (Full), MTD achieves only 12.7/12.8 HD/ToC due to out-of-memory failures on long documents; CMM achieves 55.6/33.2. The gap on ToC is smaller than on HD, which the authors attribute to assumption violations (see below).

Runtime. CMM trains 4.6$\times$ faster than MTD on HierDoc and 2.1$\times$ faster on ESGDoc (Partial). Inference is 4.2$\times$ and 1.3$\times$ faster, respectively. The smaller inference gap on ESGDoc is attributed to the higher number of graph edges in ESGDoc, which contains many small numeric text blocks that become individual nodes.

Ablation findings. Removing the GNN causes the largest performance drop: 0.6 points on HierDoc and 11.6 points on ESGDoc (HD F1). Removing GRU costs 0.2 and 4.8 points respectively. Replacing tree-based segmentation with page-based windowing (6-page window, 2-page overlap) costs 0.3 points on HierDoc and 3.3 points on ESGDoc. All components matter more on ESGDoc, confirming that the GNN’s ability to capture long-distance relationships is particularly important when heading hierarchy cannot be inferred from heading text alone. BFS depth $n_d = 2$ offers the best accuracy-efficiency tradeoff; gains plateau beyond $n_d = 2$.

Assumption violations. CMM relies on three assumptions: (1) reading order is left-to-right, top-to-bottom; (2) higher-level headings have larger or equal font size; (3) headings at the same level share the same font size. In ESGDoc, 10.8% of headings violate at least one assumption, versus 4.6% in HierDoc. Violations arise from decorative typography, emphasis styling, and XY-Cut errors. These cases are the primary source of ToC prediction errors, since the initial tree construction may not position these headings correctly.

Limitations acknowledged. CMM requires font size extraction from PDFs; scanned or photographed documents need an additional OCR step to recover font information. The model cannot recover from tree construction errors caused by XY-Cut failures or assumption violations. The paper also notes that CMM’s advantage over MTD is smaller on the ToC metric than on HD, partly because assumption violations propagate into the modification step even when heading detection is correct.

Reproducibility

Models

Text encoder: RoBERTa-base (12-layer, 768-dimensional hidden, 125M parameters)
GRU: one-layer bidirectional; hidden size 128
GAT: $n_d = 2$ layers; hidden size 128; uses edge embeddings in addition to node embeddings (following Brody et al., 2021 formulation of GATv2)
MLP for block encoding: maps from [RoBERTa hidden + feature dim] to 128
Output heads: three separate linear layers (Keep, Delete, Move) from 128-dimensional $h_i$; Move additionally uses a max-pooled preceding-sibling representation concatenated before the linear layer

Algorithms

Optimizer: Adam
Learning rate: $10^{-5}$ for RoBERTa parameters, $10^{-3}$ for randomly initialized parameters
Batch size: 32
Training epochs/steps: not specified
Loss: cross-entropy over three operation classes (Keep, Delete, Move)
No warmup schedule, gradient clipping, or mixed precision details are mentioned
BFS depth $n_d = 2$; texts with very small sizes are deleted automatically during tree modification

Data

ESGDoc: 1,093 ESG annual reports from 563 companies (2001-2022), sourced from ResponsibilityReports.com; split 765/110/218 (train/dev/test)
Documents range from 4 to 521 pages; average 72 pages
Text extracted using PyMuPDF; annotations derived by using the embedded ToC of each PDF as ground truth (documents without an embedded ToC were excluded from the original 10,639 downloads)
ESGDoc is not directly redistributed due to copyright concerns over the original PDFs; the GitHub repository provides metadata and download scripts (via Google Drive) that re-crawl reports from ResponsibilityReports.com, so reproducibility depends on that site remaining accessible and the download links staying live
No explicit license is stated for the ESGDoc metadata or annotations in the paper or code repository
HierDoc: 650 arXiv papers (350/300 train/test), from Hu et al. (2022); font sizes re-extracted from PDFs using PyMuPDF since the HierDoc release does not include them

Evaluation

HD metric: F1-score over heading block identification
ToC metric: TEDS (tree edit distance similarity) averaged over all documents in the test set
ESGDoc (Partial) used for MTD comparison to avoid out-of-memory exclusions affecting MTD
No error bars, significance tests, or multiple-run seeds are reported
Assumption violation statistics (Table 1) are computed automatically from the labeled data, not from human inspection of every case

Hardware

Training hardware: single NVIDIA A100 80GB GPU
Training time: 525.4 minutes on HierDoc, 241.4 minutes on ESGDoc (Partial) for CMM; MTD requires 2420.6 and 513.5 minutes respectively
Inference time: 3.8 minutes on HierDoc, 2.1 minutes on ESGDoc (Partial) for CMM
No CPU-only or deployment considerations reported

BibTeX


@inproceedings{wang-etal-2023-cmm,
  title = &#34;A Scalable Framework for Table of Contents Extraction from Complex {ESG} Annual Reports&#34;,
  author = &#34;Wang, Xinyu and Gui, Lin and He, Yulan&#34;,
  booktitle = &#34;Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing&#34;,
  month = dec,
  year = &#34;2023&#34;,
  address = &#34;Singapore&#34;,
  publisher = &#34;Association for Computational Linguistics&#34;,
  pages = &#34;13215--13229&#34;,
  doi = &#34;10.18653/v1/2023.emnlp-main.816&#34;
}

TL;DR

DocTrack is a benchmark dataset of 539 visually-rich documents with reading order ground truth derived from human eye-tracking experiments using Tobii hardware. The authors use it to study whether human reading order actually helps document AI models, finding that simple heuristic orderings often outperform both OCR default order and true eye-tracking order on downstream understanding tasks.

What kind of paper is this?

Dominant: $\Psi_{\text{Resource}}$: The paper’s primary contribution is the DocTrack dataset itself: a collection of visually-rich document images paired with reading order labels derived from eye-tracking experiments. The dataset construction methodology, participant protocols, and annotation voting procedure are the core of the paper.

Secondary: $\Psi_{\text{Evaluation}}$: The paper also conducts a systematic evaluation study measuring how different reading order strategies (OCR default, Z-order heuristics, XYLayout rules, and learned multimodal models) affect downstream task performance. This benchmarking analysis is substantial but serves to validate the dataset’s utility rather than propose a new model.

Secondary: $\Psi_{\text{Method}}$: A “preordering pipeline” is introduced alongside atomic comparison models (Box, Text, Text+Box, Text+Box+Image) for generating human-like reading orders.

What is the motivation?

Visually-rich documents (VRDs) (forms, tables, infographics, and other document types mixing text with graphical elements) present a challenge for document AI models because those models require a linearized token sequence as input. The standard approach is to either follow OCR engine output order or apply simple left-to-right, top-to-bottom rules. Both approaches diverge significantly from how humans actually read such documents.

Prior work on reading order, notably ReadingBank (Wang et al., 2021), derived ground truth from the XML structure of Word source files. The authors argue this proxy measure may not reflect actual human behavior, particularly for complex layouts. No dataset existed that directly measured human reading order via eye tracking for real-world document images. DocTrack fills that gap.

The motivation is also evaluative: the authors want to know whether bringing machine reading order closer to human reading order actually improves document understanding performance, a question that had not been rigorously tested.

What is the novelty?

The core novelty is the use of eye-tracking hardware to collect reading order ground truth directly from human participants reading real document images. This avoids the assumptions embedded in proxy measures like XML source order.

Dataset construction. Documents are drawn from three existing sources: FUNSD (form understanding), SeaBill (structured tabular shipping documents), and InfographicVQA. These form three subsets within DocTrack:

WEAK: 149 train / 50 test documents from FUNSD, characterized by the normal-Z reading pattern
STRUCTURED: 160 train / 50 test documents from SeaBill, with local-priority reading patterns
INFOGRAPH: 100 train / 30 test documents from InfographicVQA, exhibiting cross-modal interaction and visual instruction patterns

The full dataset contains 539 documents with 39,671 semantic entities and 86,054 tokens across train and test splits.

Eye-tracking methodology. Five participants (graduate and undergraduate students) were recruited and divided into five groups, each assigned a disjoint subset of the data. Participants read documents on an HP 24-inch 1080p display using Tobii TX300 and Tobii Studio to record fixation points, fixation time, fixation frequency, saccade distance, and pupil size. Two of the five participants labeled an overlapping subset for annotation agreement; final labels were selected by majority vote across all five participants.

Gaze-to-OCR alignment. Raw gaze trajectories contain noise. The authors apply three corrections: (1) when a gaze point lands on the periphery of a known OCR bounding box within a Euclidean distance threshold, the ordinal position of that peripheral gaze point is used as the reading sequence index for that box; (2) when gaze points are missing between two known points, ordinal positions are filled from the surrounding adjacent readings; (3) gaze points that represent repeated return fixations are removed, keeping only the first occurrence. This alignment maps continuous eye movement traces to a discrete reading order over OCR bounding boxes.

Reading pattern taxonomy. The authors identify four qualitative patterns observed in the data:

Normal-Z: left-to-right, line-by-line zigzag, typical of plain or weakly structured text
Local priority: attention to cell content before surrounding structure, characteristic of tables and forms
Cross-modal interaction: radial back-and-forth between graphical elements and their text labels, seen in pie charts, bar charts, and line graphs
Visual instruction: backtracking over hierarchical diagrams such as flowcharts, where readers re-read earlier nodes for contextual reference

Preordering pipeline. The authors propose a preprocessing step that reorders OCR bounding boxes before feeding them into downstream document AI models. Four atomic comparison models each predict which of two bounding boxes comes first in reading order, framed as a binary classification:

$$p = f(b_i : b_j) = \begin{cases} 0, & \text{if } r[b_i] < r[b_j] \ 1, & \text{if } r[b_i] > r[b_j] \end{cases}$$

The four model variants are:

Box: centroid coordinates fed into a multi-layer Transformer
Text: BERT encoding of token content
Text+Box: LayoutLM joint encoding of text and 2D position
Text+Box+Image: LayoutLMv2 joint encoding of text, 2D position, and image ROI

A bubble-sort variant uses pairwise comparison outputs to produce a final sorted sequence.

What experiments were performed?

The paper conducts both intrinsic and extrinsic evaluations.

Intrinsic evaluation. The four comparison models are evaluated by measuring rank correlation between their predicted order and the eye-tracking ground truth, using Kendall’s $\tau$ and Spearman’s $\rho$. Results across the three subsets (Table 2 in the paper) show:

Box alone performs worst overall ($\tau = 0.5992$, $\rho = 0.6366$)
Text+Box+Image achieves the best overall performance ($\tau = 0.8052$, $\rho = 0.8665$)
For WEAK documents, visual features matter: Text+Box+Image beats Text+Box
For STRUCTURED and INFOGRAPH documents, Text+Box is nearly as strong or stronger, suggesting visual features provide diminishing returns when layout structure is clear

A notable finding is that 38.16% of WEAK documents have missing gaze points (gaze fell outside OCR boxes), compared to 12.98% for STRUCTURED and 9.55% for INFOGRAPH.

Extrinsic evaluation. The impact of reading order on downstream tasks is measured using:

Semantic Entity Recognition (SER) on WEAK and STRUCTURED subsets, reported as precision, recall, and F1
Visual Question Answering on INFOGRAPH, reported as Average Normalized Levenshtein Similarity (ANLS)

Three backbone models are tested: BERT (text only), LayoutLMv2 (text + position + image), and LayoutLMv3 (text + position + image with unified masking pre-training).

Conditions compared include: EYE (raw human eye-tracking order), EYE++ (eye order with position features added), DEFAULT-OCR, Z-ORDER, XYLayout rule-based sorting, and the four machine-generated model orders.

What are the outcomes/conclusions?

The results suggest a counterintuitive finding: true human eye-tracking order (EYE) does not consistently produce better downstream performance than simpler alternatives. Across BERT, LayoutLMv2, and LayoutLMv3, the Z-ORDER heuristic (top-to-bottom, left-to-right coordinate sorting) frequently achieves the highest or near-highest F1 on SER tasks and ANLS on VQA.

For example, under LayoutLMv3 on WEAK, Z-ORDER reaches F1 of 93.73%, while EYE achieves only 90.97% and EYE++ 91.33%. The model-generated orders (MODEL-T+B+I) reach 93.56%, close to Z-ORDER but trained on eye data. The pattern holds on STRUCTURED and INFOGRAPH as well.

The authors interpret this as evidence that document AI models have learned implicit reading order assumptions aligned with simple coordinate-based rules rather than natural human gaze patterns. The eye-tracking order captures regressions, local backtracking, and cross-modal jumps that do not benefit sequence models.

Key conclusions:

Document AI models still cannot read VRDs as flexibly as humans
Human reading order is nuanced (four distinct patterns) and encodes more cognitive strategy than current models can exploit
Simple rule-based sorting often suffices for current architectures
The DocTrack dataset provides a resource for studying the human-machine reading order gap more deeply

Limitations acknowledged by the authors. Due to high annotation cost, most documents were labeled by only one participant (with a small overlapping subset assigned to two participants). The authors state that “the inner-agreement rate is not available for the current dataset” because full multi-annotator labeling was not performed. The study also ignores fixation duration, back-gaze frequency, and fixation count, which may carry additional signal.

Reproducibility

Models

The atomic comparison models use standard pretrained weights: BERT-base for the Text model, LayoutLM for Text+Box, and LayoutLMv2 for Text+Box+Image. No custom architecture is introduced. Exact model sizes and layer configurations are not specified in the paper beyond the pretrained model names.

Algorithms

The preordering algorithm is a pairwise comparison wrapped in bubble sort (Algorithm 1 in the paper). The atomic comparison takes two bounding box representations and predicts binary precedence. No optimizer, learning rate, batch size, or training duration details are provided in the paper.

Data

DocTrack is hosted at https://github.com/hint-lab/doctrack
The dataset builds on document images from FUNSD (CC-BY-4.0), SeaBill (license not stated in this paper), and InfographicVQA (license not stated in this paper)
Eye-tracking annotations were collected under institutional ethics approval with informed participant consent; participants approved data release for research purposes
The repository carries an Apache-2.0 license, but the README includes a non-commercial restriction note: “The DocTrack dataset should only be used for non-commercial research purposes. For any person/institution/company working on this direction, please contact us for a commercial license.” This conflicts with the Apache-2.0 header; treat as research-only until clarified with the authors.
Five participants labeled the data; the dataset was divided into five disjoint parts, with each participant assigned one part. Two of the five participants were additionally assigned an overlapping subset; final labels were selected by majority vote across all five participants. The paper’s Limitations section notes that “due to the high annotation cost, the annotation has not been done by multiple annotators. Therefore, the inner-agreement rate is not available for the current dataset.”
Missing gaze rates vary substantially by document type (9.55% for INFOGRAPH to 38.16% for WEAK)

Evaluation

Intrinsic: Kendall’s $\tau$ and Spearman’s $\rho$ correlation between predicted and human reading order
Extrinsic SER: precision, recall, F1 on semantic entity recognition
Extrinsic VQA: ANLS on InfographicVQA subset
Baselines include DEFAULT-OCR, Z-ORDER, XYLayout, and human eye-tracking conditions
No error bars or significance tests are reported
The paper does not specify the number of training runs or random seeds

Hardware

No training hardware, GPU specifications, training time, or inference costs are reported in the paper.

BibTeX


@inproceedings{wang-etal-2023-doctrack,
  title = &#34;{D}oc{T}rack: A Visually-Rich Document Dataset Really Aligned with Human Eye Movement for Machine Reading&#34;,
  author = &#34;Wang, Hao and Wang, Qingxuan and Li, Yue and Wang, Changqing and Chu, Chenhui and Wang, Rui&#34;,
  editor = &#34;Bouamor, Houda and Pino, Juan and Bali, Kalika&#34;,
  booktitle = &#34;Findings of the Association for Computational Linguistics: EMNLP 2023&#34;,
  month = dec,
  year = &#34;2023&#34;,
  address = &#34;Singapore&#34;,
  publisher = &#34;Association for Computational Linguistics&#34;,
  url = &#34;https://aclanthology.org/2023.findings-emnlp.344/&#34;,
  doi = &#34;10.18653/v1/2023.findings-emnlp.344&#34;,
  pages = &#34;5176--5189&#34;
}

TL;DR

LayoutMask is a text-and-layout-only pre-training model for visually-rich document understanding (VrDU). It replaces the global 1D token positions used by prior models with local, in-segment positions, forcing reading order to be inferred jointly from 1D and 2D layout signals. Two objectives drive pre-training: a Masked Language Model augmented with Whole Word Masking and Layout-Aware Masking, and a new Masked Position Modeling task that recovers masked 2D bounding boxes via GIoU loss. The authors report the base model leads all prior methods benchmarked on FUNSD, including those that incorporate image modality.

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$: the paper proposes a new pre-training architecture and two new pre-training objectives. The central claim is that local 1D position, combined with targeted masking strategies, produces more adaptive and robust document representations than global reading-order positions. The main content is an ablation grid across position choices and pre-training configurations, with baseline comparison tables across multiple VrDU benchmarks.

Secondary: $\Psi_{\text{Evaluation}}$: the authors conduct a robustness analysis by simulating segment-swap layout disturbances on held-out test sets, quantifying the sensitivity of global vs. local 1D positions to OCR ordering errors.

What is the motivation?

Pre-trained models for VrDU tasks (forms, receipts, document classification) routinely encode reading order as a globally ascending sequence of 1D position integers: token 0, 1, 2, …, 511 across the whole document. This representation carries two structural problems.

First, plain text has a single linear reading order, but document layouts do not. A receipt with a vertical price column alongside a horizontal total row cannot be faithfully serialized by a single linear ordering. Any such ordering is an approximation, and different OCR tools or heuristic rules may produce different approximations for the same document.

Second, the global order depends on consistent, stable OCR results and on empirical serialization rules (e.g., top-down, left-right). When OCR segments are detected out of order due to document rotation or scanning artifacts, the global 1D positions encode incorrect cross-segment precedence, which then contaminates fine-tuned models at inference time.

The authors argue that requiring a model to infer global reading order from local and 2D positional signals, rather than receiving it as an explicit input, forces richer text-layout interactions and produces representations that generalize better across layout styles.

What is the novelty?

Local 1D Position

Rather than assigning each token a globally unique position index reflecting its cross-document reading rank, LayoutMask assigns positions that restart at 1 for each OCR segment. A segment might be a line or a bounding box returned by the OCR engine. Within a segment, words are ordered 1, 2, 3, … as usual. Across segments, no ordering information is conveyed by the 1D position.

The model is instead expected to recover cross-segment ordering from segment-level 2D bounding boxes (shared by all tokens in a segment) and from semantic context. Table 1 in the paper compares position choices across prior work; LayoutMask is the only model using local 1D position.

Masked Language Modeling with Two Masking Strategies

The standard MLM loss averages cross-entropy over masked tokens:

$$\mathcal{L}_{\text{mlm}} = -\frac{1}{M}\sum_{i=1}^{M} \text{CE}(y_i, \hat{y}_i)$$

where $M$ is the number of masked tokens and $y_i$, $\hat{y}_i$ are the ground truth and predicted token for position $i$.

Whole Word Masking (WWM) applies masks at word level rather than subword level. All subword tokens of a word are masked together, which removes within-word context and forces the model to rely on surrounding words and layout signals.

Layout-Aware Masking (LAM) targets cross-segment boundaries. Because local 1D positions do not encode cross-segment order, the model has to infer that order from segment 2D positions and semantics. LAM increases the masking probability for the first and last word of each segment from $P_{\text{mlm}}$ to $3 \times P_{\text{mlm}}$, specifically forcing the model to reason about what precedes or follows a given segment.

Masked Position Modeling

Masked Position Modeling (MPM) is an auxiliary objective that predicts the 2D bounding box of a randomly selected word whose spatial information has been removed. The procedure involves two steps:

Box Split: The selected word is separated from its segment. The segment is split into two or three pieces around the selected word, and local 1D positions are recomputed for each piece.
Box Masking: The word’s 2D position is replaced with a pseudo-box $[0, 0, 0, n]$ where $n$ is a random integer, acting as a unique identifier to distinguish multiple masked positions.

The model predicts the original 2D box for each masked word. The MPM loss is GIoU averaged over $N$ masked positions:

$$\mathcal{L}_{\text{mpm}} = \frac{1}{N} \sum_{i=1}^{N} \left( \frac{|B_i \cap \hat{B}_i|}{|B_i \cup \hat{B}_i|} - \frac{|C_i \setminus (B_i \cup \hat{B}_i)|}{|C_i|} \right)$$

where $B_i$ is the ground-truth box (normalized to $[0,1]$), $\hat{B}_i$ is the predicted box, and $C_i$ is the smallest convex shape covering both. The combined training objective is:

$$\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{mlm}} + \lambda \mathcal{L}_{\text{mpm}}$$

with $\lambda = 1$ in practice.

What experiments were performed?

Pre-training

LayoutMask uses a spatial-aware self-attention backbone from LayoutLMv2. It is initialized from XLM-RoBERTa weights and pre-trained on 10 million pages sampled from the IIT-CDIP Test Collection (42 million scanned pages total). OCR is performed with PaddleOCR. Two sizes are trained: Base (12 layers, 16 heads, hidden size 768, 182M parameters) and Large (24 layers, 16 heads, hidden size 1024, 404M parameters). Masking probabilities are $P_{\text{mlm}} = 25%$ and $P_{\text{mpm}} = 15%$.

Downstream Tasks

Form and receipt understanding (entity extraction / named entity recognition): FUNSD (199 forms), CORD (1000 receipts, 30 entity types), and SROIE (973 receipts, 4 entity types). Word-level F1 for FUNSD and CORD; entity-level F1 for SROIE. Each experiment is repeated 10 times; the paper reports mean and standard error.

Document image classification: RVL-CDIP (400,000 images, 16 categories; 320k train / 40k val / 40k test). Overall classification accuracy is the metric.

Ablation Studies

The authors systematically ablate:

1D position type (global vs. local) crossed with 2D position type (word-level vs. segment-level), reported as average F1 across FUNSD, CORD, and SROIE (Table 4).
All combinations of the four training components: no pre-training, MLM only, MLM+WWM, MLM+LAM, MLM+WWM+LAM+MPM (Table 6 across all four benchmarks).
Robustness to segment-swap perturbations: for Global+Segment models, segments in the same line are swapped with probability $P_{\text{swap}} \in \{10%, 20%, 30%\}$ at test time (Table 5).
Masking probabilities $P_{\text{mlm}}$ and $P_{\text{mpm}}$ (Appendix A).

What are the outcomes/conclusions?

Main Results

On FUNSD, CORD, and SROIE (entity extraction), the authors report LayoutMask Base achieves 92.91, 96.99, and 96.87 F1 respectively, exceeding all prior base models the paper benchmarks against, including those that incorporate image modality (e.g., LayoutLMv3 Base at 90.29 on FUNSD). LayoutMask Large leads on FUNSD (93.20) and is competitive on CORD and SROIE.

On RVL-CDIP (document classification), LayoutMask Base and Large reach 93.26% and 93.80% accuracy. These results are competitive with text-only and text+layout baselines, but trail trimodal (T+L+I) models. The authors attribute this gap to RVL-CDIP containing figures, table lines, and orientation anomalies that cannot be captured without image features.

Ablation Findings

The Local+Segment position combination is consistently best or near-best. The performance advantage of local over global 1D is most pronounced on FUNSD (where complex spatial layouts dominate) and on SROIE for the “Total” entity (which appears in mixed vertical/horizontal layouts with many semantically identical number strings nearby).

Adding WWM, LAM, and MPM each contributes incremental gains across all four benchmarks. The full model improves over the naive version by 3.18% on FUNSD, 0.67% on CORD, 1.11% on SROIE, and 1.09% on RVL-CDIP.

The robustness experiment shows that global 1D positions degrade substantially under segment-swap perturbations: the “Address” entity on SROIE drops by 4.81%, 6.51%, and 8.42% at swap rates of 10%, 20%, and 30%. Local 1D positions are not affected by this perturbation because they carry no cross-segment ordering signal.

Limitations

The authors identify two limitations. First, the evaluation datasets are small and domain-limited, so it is unclear how well the conclusions transfer to more diverse real-world document collections. Second, LayoutMask deliberately omits image modality, leaving a gap on tasks where visual elements (figures, ruling lines, handwriting) carry essential information not captured by OCR. Integrating image modality within the local-position framework is left as future work.

Reproducibility

Models

LayoutMask is a transformer with spatial-aware self-attention (following LayoutLMv2). The Base variant has 12 layers, 16 heads, and hidden dimension 768 (182M parameters). The Large variant has 24 layers, 16 heads, and hidden dimension 1024 (404M parameters). Both are initialized from XLM-RoBERTa (Base and Large respectively). No model weights appear to have been publicly released with the paper.

Algorithms

Pre-training uses the combined loss $\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{mlm}} + \lambda \mathcal{L}_{\text{mpm}}$ with $\lambda = 1$, $P_{\text{mlm}} = 25%$, and $P_{\text{mpm}} = 15%$. Masking is applied at word level for both MLM and MPM. LAM triples the masking probability for first and last words of each segment. Optimizer, batch size, learning rate schedule, and total training steps are not detailed in the main paper or appendix.

Data

Pre-training uses 10 million pages sampled from the IIT-CDIP Test Collection. OCR is produced by PaddleOCR. IIT-CDIP is publicly available for research use. No data filtering or augmentation details beyond OCR are provided. Downstream evaluation uses FUNSD, CORD, SROIE, and RVL-CDIP, all publicly available benchmarks.

Evaluation

Entity extraction uses word-level F1 (FUNSD, CORD) and entity-level F1 (SROIE). Results are averaged over 10 runs with standard errors reported; this is more statistically careful than most prior work in the space, which typically reports single runs. Document classification uses top-1 accuracy on RVL-CDIP. No cross-dataset evaluation or domain generalization tests are performed.

Hardware

Hardware details are not reported in the paper or appendix.

BibTeX


@inproceedings{tu-etal-2023-layoutmask,
  title     = &#34;{L}ayout{M}ask: Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding&#34;,
  author    = &#34;Tu, Yi  and
               Guo, Ya  and
               Chen, Huan  and
               Tang, Jinyang&#34;,
  booktitle = &#34;Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)&#34;,
  year      = &#34;2023&#34;,
  publisher = &#34;Association for Computational Linguistics&#34;,
  url       = &#34;https://aclanthology.org/2023.acl-long.847&#34;,
  doi       = &#34;10.18653/v1/2023.acl-long.847&#34;,
}

TL;DR

ERNIE-Layout is a multi-modal pre-trained model for visually-rich document understanding that treats layout as a first-class modality rather than a positional add-on. The two core contributions are: (1) correcting the serialization order of document tokens using a layout-aware parser before pre-training, coupled with a reading order prediction pre-training task; and (2) a spatial-aware disentangled attention mechanism that incorporates 1D and 2D relative positions directly into attention weight computation. On downstream key information extraction, document question answering, and document image classification tasks, the large variant improves substantially over prior methods.

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$: The paper introduces a new model architecture, attention mechanism, pre-training objectives, and serialization pipeline. Performance on VrDU benchmarks is the central evaluation, with ablations isolating each design choice.

Secondary: $\Psi_{\text{Evaluation}}$: The ablation study systematically measures the contribution of each pre-training task and attention variant, providing quantitative comparisons across serialization strategies.

What is the motivation?

Visually-rich document understanding (VrDU) involves forms, invoices, receipts, and other documents where text, spatial layout, and visual appearance jointly determine meaning. Models like LayoutLM and LayoutLMv2 showed that incorporating 2D bounding box coordinates can substantially improve over text-only baselines.

The authors identify two systematic gaps in prior approaches. First, serialization: existing methods pass OCR output to the model in raster-scan order (left to right, top to bottom), which is incorrect for documents with multi-column layouts, tables, or parallel text blocks. Human reading order groups words by their logical region first, then proceeds from region to region. Raster-scan serialization breaks this grouping and forces the model to learn reading order from corrupted sequences. Second, attention integration: prior work treats layout coordinates as extra input embeddings (appended to token representations) rather than as an independent factor in pairwise token similarity. The cross-modal interaction between layout and content is therefore shallow.

What is the novelty?

ERNIE-Layout addresses both gaps through a unified pre-training framework with four interconnected components.

Serialization module. Before training, documents are passed through Document-Parser, a layout analysis toolkit built on top of Layout-Parser. Document-Parser detects paragraph boundaries, tables, and figures, then applies element-type-specific heuristics to assign a logical reading order. The result is that tokens belonging to the same table cell or paragraph stay contiguous, even when their pixel positions would place them in an interleaved raster order. The authors validate this with perplexity measured by GPT-2: for complex-layout documents, Document-Parser serialization reduces perplexity substantially compared to raster-scan order.

Input representation. Each token (textual or visual) receives three additive embeddings:

$$T = E_{tk}(T) + E_{1p}(T) + E_{tp}(T)$$

$$V = F_{vs}(V) + E_{1p}(V) + E_{tp}([V])$$

$$L = E_{2x}(x_0, x_1, w) + E_{2y}(y_0, y_1, h)$$

The final combined representation is:

$$H = [T + L ;;; V + L]$$

Visual tokens come from a Faster-RCNN visual encoder that produces a $7 \times 7 = 49$-token sequence from a $224 \times 224$ image.

Spatial-aware disentangled attention. Inspired by DeBERTa’s disentangled attention, ERNIE-Layout decomposes each pairwise attention score into four components: content-to-content, content-to-1D-position, content-to-2D-x, and content-to-2D-y. Relative position indices are computed for each axis with a clipping range $k$:

$$\delta_{1p}(i, j) = \begin{cases} 0 & \text{for } i - j \le -k \\ 2k - 1 & \text{for } i - j \ge k \\ i - j + k & \text{otherwise} \end{cases}$$

The four attention score components are:

$$A_{ij}^{ct,ct} = Q_i^{ct} K_j^{ct,\mathsf{T}}$$

$$A_{ij}^{ct,1p} = Q_i^{ct} K_{\delta_{1p}(i,j)}^{1p,\mathsf{T}} + K_j^{ct} Q_{\delta_{1p}(j,i)}^{1p,\mathsf{T}}$$

$$A_{ij}^{ct,2x} = Q_i^{ct} K_{\delta_{2x}(i,j)}^{2x,\mathsf{T}} + K_j^{ct} Q_{\delta_{2x}(j,i)}^{2x,\mathsf{T}}$$

$$A_{ij}^{ct,2y} = Q_i^{ct} K_{\delta_{2y}(i,j)}^{2y,\mathsf{T}} + K_j^{ct} Q_{\delta_{2y}(j,i)}^{2y,\mathsf{T}}$$

The final attention matrix and output are:

$$A_{ij} = A_{ij}^{ct,ct} + A_{ij}^{ct,1p} + A_{ij}^{ct,2x} + A_{ij}^{ct,2y}$$

$$H_{out} = \text{softmax}!\left(\frac{\hat{A}}{\sqrt{3d}}\right) V^{ct}$$

This means layout directly influences which pairs of tokens attend strongly to each other, rather than layout affecting only the query/key vectors through the input embedding.

Pre-training tasks. ERNIE-Layout uses four tasks jointly:

Reading Order Prediction (ROP): given the attention matrix $\hat{A}$, the model learns to assign $A_{ij}$ the additional meaning of “probability that token $j$ is the next token after token $i$ in reading order.” Ground truth is a binary matrix $G$ where $G_{ij} = 1$ when there is a reading-order successor relationship. The loss is:

$$\mathcal{L}_{\text{ROP}} = -\sum_{0 < i < N} \sum_{0 < j < N} G_{ij} \log(\hat{A}_{ij})$$

Replaced Region Prediction (RRP): 10% of visual patches are replaced with patches from another document image. The [CLS] vector is used to predict which patches were replaced, trained with binary cross-entropy:

$$\mathcal{L}_{\text{RRP}} = \sum_{0 \le i < HW} \left[ G_i \log(P_i) + (1 - G_i) \log(1 - P_i) \right]$$

Masked Visual-Language Modeling (MVLM) and Text-Image Alignment (TIA) are carried over from LayoutLMv2. The overall pre-training objective sums all four losses:

$$\mathcal{L} = \mathcal{L}_{\text{ROP}} + \mathcal{L}_{\text{RRP}} + \mathcal{L}_{\text{MVLM}} + \mathcal{L}_{\text{TIA}}$$

What experiments were performed?

Pre-training data. The model is pre-trained on 10 million scanned document pages sampled from the IIT-CDIP Test Collection, a large tobacco industry document archive. The layout-aware serialization is applied during pre-training. Notably, the fine-tuning datasets retain their original raster-scan OCR order, so the serialization benefit comes entirely from improved pre-training representations.

Downstream tasks and datasets. Three task types are evaluated across six datasets:

Key information extraction (sequence labeling): FUNSD, CORD, SROIE, Kleister-NDA. Metric: entity-level F1.
Document question answering: DocVQA. Metric: Average Normalized Levenshtein Similarity (ANLS).
Document image classification: RVL-CDIP. Metric: accuracy.

Baselines. The comparisons include text-only models (BERT-large, RoBERTa-large, UniLMv2-large) and multi-modal document models (LayoutLM, LayoutLMv2, StructuralLM, DocFormer, TILT).

Ablations. Table 6 in the paper shows the incremental effect of adding RRP, ROP, and switching attention mechanisms. Table 7 compares raster-scan, Layout-Parser, and Document-Parser serialization on FUNSD and CORD.

What are the outcomes/conclusions?

On key information extraction, ERNIE-Layout-large achieves F1 of 0.9312 on FUNSD (7.98 points above the previous best), 0.9721 on CORD, 0.9755 on SROIE, and 0.8810 on Kleister-NDA. On document image classification (RVL-CDIP), accuracy is 0.9627, edging past StructuralLM. On DocVQA, ANLS is 0.8321 with the train split alone, and 0.8841 on the leaderboard with ensemble; this is competitive but not uniformly best, which the authors attribute to ERNIE-Layout being initialized from RoBERTa while LayoutLMv2 starts from the stronger UniLMv2 question-answering backbone.

The ablation results are informative. Adding RRP improves FUNSD F1 by about 0.95 points over the MVLM + TIA baseline. Adding ROP on top of that adds roughly 1.3 more points. Switching from LayoutLMv2’s spatial-aware self-attention to ERNIE-Layout’s disentangled variant adds a further gain. The serialization ablation suggests that layout-aware parsing helps even at fine-tuning time when the fine-tuning data itself is still raster-scanned, confirming that the pre-training representation carries useful reading-order knowledge.

The base-size model (12 layers, 768 hidden, initialized from RoBERTa-base) reproduces the same pattern and on several datasets outperforms large-scale baselines.

One limitation is that Document-Parser is a proprietary internal Baidu toolkit. While the paper describes it conceptually and compares it against open-source Layout-Parser, the exact implementation is not released. Downstream fine-tuning also uses raster-scan OCR rather than the improved serializer, which limits the benefit to pre-training only. The authors do not report error bars or statistical significance tests for the main results, only for the FUNSD ablation where they average five random seeds.

Reproducibility

Models

ERNIE-Layout-large: 24 transformer layers, 1024 hidden units, 16 attention heads. Initialized from RoBERTa-large.
ERNIE-Layout-base: 12 transformer layers, 768 hidden units, 12 attention heads. Initialized from RoBERTa-base.
Visual encoder: Faster-RCNN, initialized from a pretrained Faster-RCNN checkpoint. Output is an adaptive-pooled $7 \times 7 = 49$ token sequence.
Code and model weights are available via PaddleNLP at the artifact URL above (Apache-2.0 license).

Algorithms

Optimizer: Adam with learning rate 1e-4, weight decay 0.01 during pre-training; learning rate 2e-5 with weight decay 0.01 during fine-tuning.
Learning rate schedule: linear warm-up over the first 10% of steps, then linear decay to zero.
Pre-training batch size: 576; trained for 20 epochs.
Maximum textual sequence length: 512 tokens; visual sequence length: 49 tokens.
Fine-tuning hyperparameters (epochs, batch size, weight decay) vary per dataset and are reported in Table 2 of the paper.
No mixed precision or gradient clipping details are reported.

Data

Pre-training: 10 million pages sampled from the IIT-CDIP Test Collection (tobacco documents), serialized with Document-Parser.
Fine-tuning: FUNSD (199 documents), CORD (1000 receipts), SROIE (973 receipts), Kleister-NDA (540 documents), RVL-CDIP (400k document images), DocVQA (50k questions on 12,767 images).
OCR: for FUNSD, CORD, SROIE, and Kleister-NDA, the official OCR annotations are used. For RVL-CDIP and DocVQA, Microsoft OCR tools are used.
Fine-tuning data is not re-serialized; only pre-training data uses Document-Parser ordering.

Evaluation

Key information extraction: entity-level F1 (BIO labeling, sequence labeling framework).
Document QA: ANLS, as defined in the DocVQA competition.
Document classification: accuracy on RVL-CDIP (16 classes).
The main results compare model families at the large scale. Base-scale results are in the appendix.
Statistical rigor: five-run averaging with standard deviation reported only for the FUNSD ablation (standard deviation 0.0011 for the best-vs-prior comparison). No significance tests for other results.
The Kleister-NDA test set is not publicly available; results are reported on the validation set.

Hardware

Pre-training: 24 Tesla A100 GPUs, 20 epochs. Total GPU-hours are not stated.
Fine-tuning hardware is not reported.
Inference requirements are not discussed.

BibTeX


@inproceedings{peng-etal-2022-ernie,
  title = &#34;{ERNIE}-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding&#34;,
  author = &#34;Peng, Qiming and Pan, Yinxu and Wang, Wenjin and Luo, Bin and Zhang, Zhenyu and Huang, Zhengjie and Cao, Yuhui and Yin, Weichong and Chen, Yongfeng and Zhang, Yin and Feng, Shikun and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng&#34;,
  booktitle = &#34;Findings of the Association for Computational Linguistics: EMNLP 2022&#34;,
  month = dec,
  year = &#34;2022&#34;,
  address = &#34;Abu Dhabi, United Arab Emirates&#34;,
  publisher = &#34;Association for Computational Linguistics&#34;,
  url = &#34;https://aclanthology.org/2022.findings-emnlp.274/&#34;,
  doi = &#34;10.18653/v1/2022.findings-emnlp.274&#34;,
  pages = &#34;3744--3756&#34;
}

TL;DR

This paper introduces ReadingBank, a 500k-page benchmark for reading order detection built by automatically extracting word order from DocX XML metadata and aligning it to rendered PDF bounding boxes via a color-based watermarking scheme. The accompanying LayoutReader model, a seq2seq architecture built on LayoutLM, achieves near-perfect reading order detection (0.98 BLEU) and improves text line ordering for both open-source and commercial OCR engines.

What kind of paper is this?

Dominant: $\Psi_{\text{Resource}}$

The headline contribution is ReadingBank, the first large-scale benchmark for reading order detection. The automated data pipeline (DocX XML parsing + color-based alignment) is the core novelty that enables everything else: without it, there is no dataset, and without the dataset, the model cannot be trained.

Secondary: $\Psi_{\text{Method}}$, $\Psi_{\text{Impact}}$

The authors also propose LayoutReader, a seq2seq model using LayoutLM as an encoder with a pointer-network-style decoder for predicting reading order permutations ($\Psi_{\text{Method}}$). They further demonstrate practical value by using the model to reorder OCR engine outputs, improving line-level ordering for both Tesseract and a commercial engine ($\Psi_{\text{Impact}}$).

What is the motivation?

Reading order detection is a prerequisite for document understanding tasks such as extraction, summarization, and question answering.

The problem: Traditional OCR engines typically output text in top-to-bottom, left-to-right order. This heuristic fails on complex layouts like multi-column articles, forms, and invoices.
The gap: Deep learning approaches were limited by the lack of large-scale supervision. Manual annotation of reading order is prohibitively expensive (it requires full page context and careful sequential labeling). Prior deep learning efforts (Li et al., 2020a) used small in-house datasets that were not publicly available for comparison.
The insight: The reading order of WORD documents is already embedded in their DocX XML metadata. By converting DocX files to PDFs and aligning the XML-derived word sequence to PDF bounding boxes, one can obtain large-scale reading order supervision at near-zero annotation cost.

What is the novelty?

1. Dataset Construction (ReadingBank)

The core innovation is the data pipeline.

Source: DocX files crawled from the internet, where the XML structure preserves the logical reading order (paragraphs traversed line-by-line, tables traversed cell-by-cell).
Alignment scheme: To map XML text nodes to visual bounding boxes (from a rendered PDF), the authors use a color-based watermarking technique. Each word is rendered with a unique RGB color corresponding to its appearance index $i$ in the reading sequence:

$$ \begin{aligned} r &= i \mathbin{\&} \text{0x110000} \\ g &= i \mathbin{\&} \text{0x001100} \\ b &= i \mathbin{\&} \text{0x000011} \end{aligned} $$

This resolves duplicate words (e.g., multiple instances of “the”) by their color-coded appearance index, creating a 1:1 mapping between logical order and spatial layout. Each word-index pair $(w, i)$ in the DocX is matched to its PDF counterpart $(w’, c, x_0, y_0, x_1, y_1, W, H)$ subject to $w = w’$ and $c = C(i)$.

Scale: 500,000 document pages split 8:1:1 into training (400k), validation (50k), and test (50k) sets. The average page contains approximately 196 words.

2. LayoutReader Model

The model treats reading order as a sequence generation task.

Encoder: LayoutLM (initialized from pre-trained weights) encodes text tokens alongside their 2D bounding box coordinates. Source and target segments are packed into one contiguous input sequence.
Attention mask: A custom self-attention mask $M$ allows full bidirectional attention within the source segment while enforcing causal (left-to-right) attention in the target segment:

$$ M_{i,j} = \begin{cases} 1 & \text{if } i < j \text{ or } i, j \in \text{src} \\ 0 & \text{otherwise} \end{cases} $$

Decoder (pointer network): Instead of generating vocabulary tokens, the decoder predicts an index into the source sequence. The probability of selecting source index $i$ at decoding step $k$ is:

$$ P(x_k = i \mid x_{<k}) = \frac{\exp(e_i^T h_k + b_k)}{\sum_j \exp(e_j^T h_k + b_k)} $$

where $e_i$ and $e_j$ are source input embeddings, $h_k$ is the hidden state at step $k$, and $b_k$ is a bias term. This constrains predictions to valid source indices rather than an open vocabulary.

What experiments were performed?

Training details:

500k pages (400k train, 50k val, 50k test)
4 $\times$ Tesla V100 GPUs, batch size 4 per GPU
3 epochs (~6 hours), AdamW optimizer, learning rate $7 \times 10^{-5}$, 500 warm-up steps

Evaluation metrics:

Average Page-level BLEU: Measures n-gram overlap between the predicted reading order and the ground truth sequence, computed per page and then averaged.
Average Relative Distance (ARD): Measures positional displacement between elements in the predicted and ground truth sequences, with a penalty $n$ (sequence length) for omitted elements:

$$ s(e_k, B) = \begin{cases} |k - I(e_k, B)| & \text{if } e_k \in B \\ n & \text{otherwise} \end{cases} $$

$$ \text{ARD}(A, B) = \frac{1}{n} \sum_{e_k \in A} s(e_k, B) $$

Key experiments:

Reading order detection: Full model vs. modality ablations (text-only with BERT or UniLM, layout-only with LayoutLM minus token embeddings) and a heuristic baseline (left-to-right, top-to-bottom sorting).
Input order study: Shuffling a proportion $r$ of training inputs (r = 0%, 50%, 100%) to test whether the model relies on the default OCR ordering or genuinely learns layout-based reading order.
OCR adaptation: Reordering the text line output of Tesseract and a commercial OCR engine using LayoutReader, evaluated on a line-level adaptation of ReadingBank.

What are the outcomes/conclusions?

Layout signals dominate

The layout-only variant significantly outperforms text-only baselines, confirming that spatial features (bounding box coordinates) are the primary signal for reading order detection.

Model	Encoder	BLEU $\uparrow$	ARD $\downarrow$
Heuristic (L-R, T-B)		0.6972	8.46
LayoutReader (text only)	BERT	0.8510	12.08
LayoutReader (text only)	UniLM	0.8765	10.65
LayoutReader (layout only)	LayoutLM (layout only)	0.9732	2.31
LayoutReader (full)	LayoutLM	0.9819	1.75

Note that the text-only variants actually have worse ARD than the heuristic baseline despite better BLEU. The authors attribute this to severe ARD penalties for token omission: the text-only models produce roughly correct orderings but fail to generate complete sequences.

Robustness to input order

The layout-only and full models are nearly unaffected by input shuffling (BLEU drops from 0.9732 to 0.9701 for layout-only at r=100%), while text-only models collapse (BLEU drops from 0.8765 to 0.3440 for UniLM at r=100%). This confirms the model learns spatial reading order rules rather than memorizing the default OCR sequence.

Practical OCR improvement

Adapting LayoutReader to reorder text lines from commercial OCR engines yielded improvements in line-level BLEU, validating the model’s potential as a post-processing module for production pipelines.

Limitations

Synthetic ground truth: The “ground truth” is defined by the DocX structure. Users can create DocX files with logical structures that do not match the visual reading order (e.g., floating text boxes). The model learns the authoring order, which is a proxy for, but not identical to, reading order.
Licensing ambiguity: The ReadingBank repository presents conflicting information, citing purely academic usage limitations despite an Apache 2.0 license file. This complicates commercial adoption.
Language restriction: The dataset is strictly filtered to English documents, limiting multilingual generalization.
No scanned document evaluation: All documents are born-digital (DocX-to-PDF conversions). Performance on noisy scanned documents with OCR errors is not evaluated.

Reproducibility

Models

LayoutReader uses LayoutLM-base as the encoder (approximately 113M parameters). Pre-trained weights are released via Google Drive, though the license is unspecified.
Text-only ablations substitute BERT-base or UniLM-base for the encoder.
The decoder is implemented as a modified seq2seq layer from the s2s-ft toolkit in the UniLM repository.

Algorithms

Optimizer: AdamW with initial learning rate $7 \times 10^{-5}$ and 500 warm-up steps
Training: 3 epochs, batch size 4 per GPU (effective batch size 16 across 4 GPUs)
Loss: Standard cross-entropy over pointer indices (not explicitly stated but implied by the seq2seq formulation)

Data

Source: DocX files crawled from the internet, filtered for English using Azure Text Analytics API with a high confidence threshold
Size: 500,000 pages (400k/50k/50k train/val/test split), pages with fewer than 50 words excluded
Availability: Released at github.com/doc-analysis/ReadingBank under Apache 2.0, though the repository text restricts usage to research purposes, creating ambiguity
Annotation: Fully automated via the DocX XML + coloring scheme pipeline (no human annotation)

Evaluation

Metrics: Average Page-level BLEU and ARD (defined above)
Baselines: Heuristic (L-R, T-B), text-only (BERT, UniLM), layout-only ablation
Limitations acknowledged by authors: Ground truth is derived from authoring order, not human-verified reading order. Future work includes labeling a real-world scanned document dataset.
Statistical rigor: No error bars, significance tests, or multi-seed experiments reported

Hardware

Training: 4 $\times$ Tesla V100 (32GB per GPU), approximately 6 hours for 3 epochs (~75k steps)
Inference: Not explicitly reported

BibTeX


@inproceedings{wang2021layoutreader,
  title={LayoutReader: Pre-training of Text and Layout for Reading Order Detection},
  author={Wang, Zilong and Xu, Yiheng and Cui, Lei and Shang, Jingbo and Wei, Furu},
  booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing},
  pages={4735--4744},
  year={2021}
}

TL;DR

HJDataset is a semi-automatically constructed layout analysis dataset for historical Japanese documents, containing over 259k annotations of seven element types across 2,271 page scans. Beyond bounding boxes and segmentation masks, it provides two annotations rarely found together in layout datasets: hierarchical dependency structure and reading order for all layout elements. The reading order reflects the real right-to-left, top-to-bottom conventions of Japanese text, including irregular cases where section headers disrupt the standard flow.

What kind of paper is this?

Dominant: $\Psi_{\text{Resource}}$: The central contribution is the HJDataset itself, a large annotated corpus accompanied by a semi-rule-based construction pipeline. The paper’s organization, tables, and figures all center on the dataset’s contents, construction methodology, and statistics.

Secondary: $\Psi_{\text{Method}}$: The authors develop a multi-stage pipeline (rule-based detector, CNN classifier, reading order generator, statistical quality control) to construct the dataset with limited manual effort. This pipeline is a reusable contribution in its own right.

Secondary: $\Psi_{\text{Evaluation}}$: The paper provides baseline detection results on the dataset using Faster R-CNN, Mask R-CNN, and RetinaNet, and evaluates transfer learning from the main-page subset to index pages and to a different historical Japanese publication.

What is the motivation?

Deep learning approaches for document layout analysis require large labeled datasets for training. Two gaps motivated this work. First, available historical document datasets are small: DIVA-HisDB has 150 instances and the European Newspapers Project has 528. Models trained on these tend to overfit and produce unreliable benchmarks. Second, nearly all historical document datasets cover Western languages, leaving the distinct layout conventions of Asian languages (vertical text, right-to-left column order, dense biographical formats) without dedicated training resources.

Japan’s National Diet Library has digitized millions of historical scans, providing raw material for a large-scale dataset if an annotation method can be devised that avoids prohibitive manual labeling costs. The Japanese Who’s Who biographical directory from 1953 is chosen as the source because it contains roughly 2,000 page scans with consistent structural conventions, making semi-automatic extraction feasible.

What is the novelty?

Dataset scope and annotation richness

HJDataset contains 2,271 document image scans partitioned into four page categories: main (2,048), advertisement (87), index (82), and other (54). For the main and index pages, 259,616 layout element annotations are provided across seven categories: Page Frame, Row, Title Region, Text Region, Title, Subtitle, and Other. Annotations include both rectangular bounding boxes and quadrilateral segmentation masks (for page frames, which are subject to scan rotation). Each annotation also carries its position in the hierarchical layout tree and its reading order index.

The hierarchical structure reflects the physical organization of main pages: five rows are vertically stacked in each page, and within each row, text regions and title regions are arranged horizontally. Title regions are further subdivided into title and subtitle blocks. This nesting gives three levels of granularity, from the full page frame down to individual text blocks.

Semi-rule-based construction pipeline

The construction proceeds in four stages. The Text Block Detector extracts bounding boxes hierarchically: page frame by contour detection, rows by horizontal Run Length Smoothing Algorithm (RLSA) followed by Connected Component Labeling (CCL), then text and title regions by applying RLSA vertically within each row. Page frames are characterized as quadrilaterals to handle scan distortion; row and block regions use axis-aligned rectangles after warp affine correction.

A CNN classifier (Text Region Classifier) is then applied to each cropped region to distinguish text regions, title regions, and mis-segmented regions. The architecture is NASNet Mobile, trained from scratch on 1,200 hand-labeled samples (augmented with 250 synthetic mis-segmentations to address class imbalance) and achieving 0.99 test accuracy on 100 samples.

Reading order generation

Reading order is determined by exploiting Japanese reading conventions: texts inside each region are written in vertical columns read right-to-left across columns; within a row, blocks are also ordered right-to-left. For the majority of pages this yields a deterministic topological ordering.

The more interesting case involves irregular reading orders in index pages. Section headers can span multiple rows, breaking the expected right-to-left sequence. The pipeline detects these discontinuities by searching for unusually large horizontal gaps between blocks in a row and assigns a corrected reading order accordingly. These irregular cases are preserved in the dataset rather than smoothed away, making HJDataset useful for evaluating reading order methods that must handle non-trivial flows.

Reading order annotations take the form of ordered position indices associated with each annotation entry in the COCO-format JSON files. Combined with the hierarchical parent-child links between layout elements, this gives each annotation both its rank in the reading sequence and its structural role.

Quality control

After automated extraction, statistical screening identifies candidate errors without requiring full manual review. Pages with abnormally many or few layout elements (outside the 5th to 95th percentile range) are flagged for page frame correction (182 pages examined, 18 errors corrected). Blocks with unusually wide gaps (above the 99th percentile) are flagged for missed final text lines (1,011 examined, 487 corrected). Remaining structural noise is handled by human annotators during the inspection pass. In total, over 616 errors are corrected, and the authors estimate the final dataset achieves 99.6% annotation accuracy.

What experiments were performed?

Layout element detection benchmark

Faster R-CNN, Mask R-CNN, and RetinaNet are trained on main-page training data (181,097 annotations) for 60k iterations using a ResNet-50 FPN backbone pre-trained on COCO. Evaluation uses COCO-style mean Average Precision (mAP) at IOU $[0.50{:}0.95]$.

Results on the test set show high accuracy for coarser elements and lower accuracy for fine-grained ones:

Category	Faster R-CNN	Mask R-CNN	RetinaNet
Page Frame	99.05	99.10	99.04
Row	98.83	98.48	95.07
Title Region	87.57	89.48	69.59
Text Region	94.46	86.80	89.53
Title	65.91	71.52	72.57
Subtitle	84.09	84.17	85.87
Other	44.02	39.85	14.37
mAP	81.99	81.34	75.22

Faster R-CNN and Mask R-CNN reach comparable overall mAP (82.0 vs 81.3), both ahead of RetinaNet (75.2). Title detection is the weakest category, likely due to its small physical size and visual similarity to subtitle blocks. The Other category (chapter headers and miscellaneous text) has the lowest mAP due to its small sample count (148 total).

Transfer to index pages

Five Faster R-CNN variants are compared on index pages, varying the initialization (COCO or HJDataset weights) and the amount of training data (all 57 training index samples, 5-shot, or zero-shot). Pre-training on HJDataset consistently outperforms COCO initialization: with all index training data, HJDataset initialization reaches 47.1 mAP versus 34.4 for COCO; with 5 samples, the gap is smaller (10.3 vs 10.0) but still favors HJDataset. Zero-shot transfer from main to index reaches only 9.4 mAP, consistent with the structural dissimilarity between page types.

Transfer to a different publication

Twelve pages from a 1939 edition of the same Who’s Who directory with a different layout schema are manually annotated. Four pages are used for training, eight for testing. HJDataset-initialized Faster R-CNN reaches 81.6 mAP at 5-shot, versus 69.9 for COCO initialization, a gain of approximately 12 points.

What are the outcomes/conclusions?

HJDataset fills a concrete gap: a large-scale layout analysis dataset for historical Asian language documents. The dataset’s combination of bounding boxes, masks, hierarchical structure, and reading order annotations makes it more information-rich than most contemporaneous layout datasets, which typically provide only bounding boxes or masks.

The reading order annotations specifically capture right-to-left, top-to-bottom Japanese conventions along with documented irregular cases (section headers disrupting column flow). This makes HJDataset an early example of a dataset that treats reading order as a first-class annotation rather than an afterthought, though the reading order is generated from known rules rather than human annotation.

The transfer learning results suggest that pre-training on HJDataset generalizes to related historical Japanese documents, which is useful for real-world digitization workflows where labeled data is scarce.

Limitations

Reading order annotations are rule-derived rather than human-annotated. For the majority of main pages the rules are deterministic and accurate, but the dataset does not quantify inter-annotator agreement for the reading order specifically.
The source material is a single publication type (Japanese Who’s Who, 1953 edition). The layouts are structurally homogeneous (five rows per main page, vertical text only), which limits generalization to documents with less regular or more diverse layout patterns.
No dedicated reading order model is trained or evaluated in the paper. The reading order annotations are included in the dataset but their utility is demonstrated only implicitly through the overall layout task.
The dataset covers one language (Japanese) and one document era. Western-language users and researchers working on other Asian historical materials will not benefit directly from the pre-trained models.
Image access requires submitting a request form due to copyright considerations, which limits immediate reproducibility.

Reproducibility

Models

Faster R-CNN, Mask R-CNN, and RetinaNet with R-50-FPN-3x backbone, implemented via Detectron2. Backbone weights initialized from the public COCO pre-trained checkpoint.
The text region classifier uses NASNet Mobile, trained from scratch (no pre-trained weights loaded).
The authors state that training configurations will be open-sourced. Detectron2 is Apache-2.0.

Algorithms

Text block detection: contour detection for page frames, CCL and RLSA for row and block segmentation.
Text classification: NASNet Mobile trained with SGD, converging in 40 epochs; input images resized to 200 (height) $\times$ 522 (width) pixels.
Reading order: rule-based; right-to-left ordering within rows, with gap-based discontinuity detection for irregular cases.
Detection models: 60k iterations, base learning rate 0.00025, decay factor 0.1 at 30k and 60k iterations, batch size 2, trained on a single NVIDIA RTX 2080Ti.

Data

Source: 2,271 scans from the 1953 edition of the Japanese Who’s Who (Jinji Kshinroku, vol. 17).
259,616 layout element annotations across seven categories (Page Frame, Row, Title Region, Text Region, Title, Subtitle, Other) for main and index pages.
70/15/15 train/validation/test split, stratified by page type.
Annotations in COCO JSON format with an additional category_id field per image for page-type classification.
Dataset annotations released under Apache-2.0; images subject to copyright and require a download request.
Available at: https://dell-research-harvard.github.io/HJDataset/

Evaluation

COCO-style mAP at IOU $[0.50{:}0.95]$ as the primary detection metric; AP$_{50}$ and AP$_{75}$ also reported for transfer experiments.
No dedicated reading order evaluation metric or evaluation protocol is defined in the paper; the 99.6% annotation accuracy estimate covers layout element detection errors overall, not reading order correctness specifically. Reading order quality has no separate quantitative measure in the paper.
No error bars or significance tests reported.

Hardware

Single NVIDIA RTX 2080Ti GPU for all detection model training.
No training time or memory requirements reported.
NASNet Mobile inference is lightweight; no GPU requirements stated for the classifier.

BibTeX


@inproceedings{shen2020hjdataset,
  title={A Large Dataset of Historical Japanese Documents with Complex Layouts},
  author={Shen, Zejiang and Zhang, Kaixuan and Dell, Melissa},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year={2020}
}

Overview

Reading Order: Paradigms

Reading Order: Models

Models with Code or Weights

Methods (Paper Only)

Reading Order: Datasets

Commercial Use

Research / Non-Commercial

Not Available / Restricted

Reading Order: Metrics

Reading Order: Benchmarks

To Investigate

Methods

Benchmarks / Competitions

Metrics

Related Pages