Navigation
Breadcrumb

Reading Order Prediction

Tracking models, datasets, and methods for determining the logical reading sequence of detected document regions.

Table of Contents

Disclaimer: This page covers models for predicting the logical reading sequence of detected regions. For detecting where regions are on a page, see the Layout Page. For parsing table internal structure, see the TSR Page.

Overview

Reading order prediction determines the logical sequence in which detected regions should be read. While it depends on layout detection for region proposals, it is a distinct task with its own models, datasets, and evaluation challenges.

The core difficulty is that reading order is not purely spatial. Multi-column layouts, sidebars, footnotes, and floating elements all require understanding the logical flow of a document rather than just scanning top-to-bottom, left-to-right.

Note: We are actively researching this area. Expect significant updates to this section.


Reading Order: Paradigms

Reading order prediction methods organize around five main paradigms:

  1. Rule-based / Projection: Partitions the page by recursively finding gaps in horizontal or vertical ink projections. No trainable parameters; fast and deterministic.

    • Examples: XY-Cut (classic), XY-Cut++
    • Pros: No training data required; CPU-deployable; interpretable.
    • Cons: Breaks down where clean projection gaps do not exist (L-shaped regions, multi-column with spanning headers, complex wrap-arounds).
  2. Seq2Seq with Pointer Networks: An encoder reads all layout elements and a decoder emits an ordered sequence by repeatedly pointing to the next element.

    • Examples: LayoutReader, FocalOrder
    • Pros: End-to-end trainable; captures global context across all elements.
    • Cons: Sequential decoding; output length tied to input size; requires large annotated datasets.
  3. Relation Extraction / Dependency Parsing: Frames reading order as pairwise precedence prediction; a chain-decoding step resolves pairs into a total or partial order.

    • Examples: ROOR, Detect-Order-Construct, UniHDSA, HRDoc/DSPS, Token Path Prediction
    • Pros: Does not assume a single total order; handles non-linear and partial sequences naturally; compatible with graph-based decoding.
    • Cons: Quadratic complexity in element count; requires post-processing to resolve relation pairs into a valid sequence.
  4. Multimodal End-to-End: A single model reads the document image and generates text, bounding boxes, and semantic labels in reading order in one pass, with no separate layout detection step.

    • Examples: ÉCLAIR, DRGG
    • Pros: No upstream OCR or detector required; reading order emerges from generation order.
    • Cons: Large model sizes; hallucination risk; complex tables and equations often excluded from current releases.
  5. Action-based / Transition Parsing: Incrementally builds a document hierarchy tree by predicting a discrete action (e.g., insert, skip, merge, move) at each step over a sequence of segments.

    • Examples: MTP, HELD, CED/TRACER, CMM, SEG2ACT
    • Pros: Naturally captures hierarchical structure (sections, subsections, ToC); interpretable step-by-step construction.
    • Cons: Errors propagate through the action chain; many methods require electronic PDFs for font or layout feature extraction.

Reading Order: Models

Models with Code or Weights

DateNameArtifactsCodeLicenseNotes
2025-03UniHDSANoneGitHubMITNotes. MSRA. Unified relation prediction framework collapsing page-level and document-level HDSA sub-tasks into a single label space. Multimodal Transformer (ResNet + BERT). REDS 96.7 (text), 91.0 (graphical) on Comp-HRDoc. Extends Detect-Order-Construct.
2024-09ROOR baselineNoneGitHubApache-2.0Notes. Global pointer network on LayoutLMv3-large. 93.01 F1 (word-level), 82.38 F1 (segment-level) on ROOR dataset. Also introduces RORE downstream enhancement pipeline.
2024-01Detect-Order-ConstructNoneGitHubMITNotes. MSRA. Multi-modal transformer relation prediction with separate Detect, Order (chain), and Construct (tree) modules with RoPE. REDS 0.9319 on Comp-HRDoc.
2023-04CED/TRACERSpico197/CatalogExtractionGitHubApache-2.0Notes. ICDAR 2023. Compact 3-layer Chinese RoBERTa (RBT3) shift-reduce parser with Sub-Heading/Sub-Text/Concat/Reduce actions for catalog/ToC tree construction. 82.39% F1 on ChCatExt. Also introduces ChCatExt dataset and Wikipedia pre-training corpus.
2021-08LayoutReadermicrosoft/layoutreaderGitHubMIT (code); Unknown (weights)Notes. LayoutLM encoder + pointer-network decoder. Trained on ReadingBank (500k pages).
2021-05MTPstanfordnlp/pdf-structGitHubApache-2.0Notes. Stanford + Google. NLLP@EMNLP 2021. Multimodal transition parser: five-label action sequence (continuous, consecutive, down, up, omitted) predicted by Random Forest on visual and textual features. Works across four document types (scientific, legal, government). Introduces four annotated corpora.

Methods (Paper Only)

DateNameNotes
2024-05DLAFormerICDAR 2024. MSRA + USTC. Deformable DETR with unified label space treating graphical detection, text region detection, logical role classification, and reading order as a single relation prediction problem. Type-wise query selection (encoder) + type-wise query initialization (decoder). No text embeddings; requires OCR text-line bounding boxes. REDS 96.6% (text) / 90.0% (graphical) on Comp-HRDoc. Full notes in Layout.
2026-01FocalOrderNotes. Unisound AI + CAS. LayoutLMv3-large (0.4B) trained with Focal Preference Optimization: EMA-based difficulty discovery + difficulty-calibrated pairwise ranking. Addresses “Positional Disparity” failure mode. Edit 0.038 EN / 0.055 ZH on OmniDocBench v1.0; REDS 97.1 (text) / 91.1 (graphical) on Comp-HRDoc.
2025-04XY-Cut++Notes. Rule-based: pre-mask processing + multi-granularity XY-Cut + cross-modal matching. No trainable parameters; 514 FPS on CPU. 98.8 BLEU-4 on DocBench-100.
2025-02ÉCLAIRNotes. NVIDIA. ViT-H + mBART decoder (937M params). Jointly extracts text in reading order, bounding boxes, and semantic classes in a single pass. WER 0.142 on DROBS.
2025-02DRGGKIT. ICLR 2025. Plug-and-play relation head added to DETR-family detectors; jointly predicts DLA bounding boxes and full relational graph (spatial + logical). InternImage + RoDLA + DRGG: $mAP_g@0.5 = 57.6%$, Sequence AP 56.4% on GraphDoc. Full notes in Layout.
2024-10SEG2ACTNotes. EMNLP 2024. CAS. Generative LLM (Baichuan-7B) with a 3-symbol action vocabulary (+, *, =). Global context stack serialized in the same vocabulary; multi-segment batching for efficiency. Single-pass per-segment action prediction. DocAcc on ChCatExt; TEDS on HierDoc. Code: icip-cas/Seg2Act (Unknown license).
2023-10CMMNotes. EMNLP 2023. XY-Cut initial tree construction followed by GRU + GAT per-node encoding over a BFS subtree; Keep/Delete/Move operation prediction per node. Introduces ESGDoc (250 Chinese ESG annual reports). Requires electronic PDFs for font extraction. Code: xnyuwg/cmm (Unknown license).
2023-10Token Path Prediction (TPP)Notes. Fudan + Ant Group. EMNLP 2023. GlobalPointer head on LayoutLMv3/LayoutMask predicting directed token-pair edges; no reading-order assumption. Also introduces FUNSD-r and CORD-r benchmarks reflecting realistic OCR disorder. 80.40 F1 (FUNSD-r), 91.85 F1 (CORD-r) with LayoutLMv3.
2023-05LayoutMaskNotes. Ant Group. ACL 2023. Text+layout-only pre-training: local (in-segment) 1D positions + segment-level 2D positions, with Whole Word Masking, Layout-Aware Masking, and Masked Position Modeling (GIoU loss). 182M params (Base). 92.91 F1 on FUNSD, outperforming all T+L+I base models.
2023-05Sparse Graph SegmentationNotes. Google. 267K-param multi-task GCN on a $\beta$-skeleton sparse graph. Predicts column-wise vs. row-wise pattern (node) and paragraph membership (edge); hierarchical cluster-and-sort recovers global order. Language-agnostic; mobile-deployable. NED 0.098 on English eval set. ICDAR 2023.
2022-10ERNIE-LayoutNotes. Baidu. Findings of EMNLP 2022. Layout-aware serialization + spatial-aware disentangled attention + reading order prediction (ROP) as an explicit pre-training task. 24 A100s, 10M IIT-CDIP pages. 93.12 F1 on FUNSD (large). Code + weights: Apache-2.0 (PaddleNLP).
2023-10DSGICDM 2023. ETH Zurich / LMU. Replaces DocParser’s heuristic relation rules with LSTM-based prediction + grammar constraints; end-to-end trainable hierarchical parsing. Introduces E-Periodica dataset. Full notes in Layout.
2019-11DocParserAAAI 2021. ETH Zurich / LMU. Mask R-CNN detection + rule-based reading-order and parent-child relation classification. Direct predecessor to DSG, HRDoc, and Detect-Order-Construct. Full notes in Layout.
2023-03HRDoc / DSPSUSTC + iFLYTEK. AAAI 2023. Hierarchical document structure reconstruction as a three-subtask formulation (classify, parent-find, relation-classify). Multi-modal bidirectional encoder + structure-aware GRU decoder with soft-mask. Micro-STEDS 0.8143 (simple) / 0.6903 (hard). Direct predecessor to Detect-Order-Construct. Full notes in Layout.
2021-06ROPENotes. Google. Reading Order Equivariant Positional Encoding: assigns relative reading order codes to GCN neighbor messages before aggregation. Plugs into any GCN pipeline. Up to +8.4 F1 on word labeling, +2.5 F1 on FUNSD. ACL-IJCNLP 2021.
2021-05HELDNotes. JCST 2022. BiLSTM/CNN PUT-or-SKIP classifier predicting insertion positions in a rightmost-branch tree. Variable-depth hierarchy from long documents with error-tolerance training via simulated misplacements. Introduces stricter path-correctness metric. No code or data released.

Reading Order: Datasets

Datasets are grouped by the most permissive use permitted: commercial, research / non-commercial, and not available / restricted.

Commercial Use

No reading order datasets with fully permissive licenses (CC0, CDLA-Permissive, MIT, Apache-2.0) are currently confirmed. Datasets in this space tend to carry attribution requirements (CC-BY) or non-commercial clauses, or have undeclared licenses.

Research / Non-Commercial

DateNamePagesDomainKey ContributionLicenseNotes
2025-11PharmaShip161 docsChinese pharmaceutical shipping (scanned)ISDR reading order graphs + SER + entity linking on domain-specific long-form documentsCC-BY-NC-4.0Notes. Xi’an Jiaotong University. ~108 RO links/doc vs. ~55 for ROOR. Entity-centric annotation decouples semantics from layout blocks.
2024-09ROOR199 samplesScanned VrDs (EC-FUNSD-based)First dataset with ISDR-style reading order as directed acyclic relation pairs; 23.76% non-linear segmentsCC-BY-4.0Notes. Fudan + Ant Group. 10,662 segments, 10,967 annotated ISDR relation pairs. Builds on EC-FUNSD.
2023-10DocTrack539 docsVRDs (forms, structured docs, infographics)First eye-tracking-grounded reading order benchmark; human gaze trajectories mapped to OCR bounding box sequencesResearch-only (Apache-2.0 header conflicts with non-commercial clause in README; treat as non-commercial until clarified)Notes. 39,671 semantic entities; 86,054 tokens. Three subsets: WEAK (FUNSD), STRUCTURED (SeaBill), INFOGRAPH (InfographicVQA). Simple Z-ORDER heuristic often outperforms true eye-tracking order on downstream tasks.
2021-08ReadingBank500kDiverse (born-digital)First large-scale reading order benchmarkResearch-only (README prohibits redistribution despite Apache-2.0 file)Notes. Auto-extracted from DocX XML metadata via color-based watermarking. English only.

Not Available / Restricted

Described in a publication but not publicly downloadable.

DateNamePagesDomainKey ContributionLicenseNotes
2025-02GraphDoc80K imagesMixed (financial, scientific, legal, manuals, patents, government)Extends DocLayNet with 4.13M relation annotations across 8 categories: Sequence (reading order), Parent, Child, Reference, Up, Down, Left, Right. Only dataset combining spatial and logical graph relations at paragraph level with non-textual elementsUnknown (code: MIT; no dataset license declared; CDLA-Permissive-1.0 inheritance from DocLayNet unconfirmed by authors)Notes. KIT. ICLR 2025. Dataset promised on GitHub but no explicit license statement. Sequence AP = 56.4% with DRGG baseline.
2025-11MosaicDoc72.3K imagesBilingual (ZH/EN) newspapers + magazines (196 publishers)Block- and line-level reading order on non-Manhattan multi-column layouts; bilingual multi-task (VQA, OCR, ROP, localization)Not released (code: Apache-2.0)Notes. AAAI 2026. Promised open release in Nov 2025; no download available as of Apr 2026.
2023-12M5HisDoc8,000 images (4K regular + 4K hard)Chinese historical documents (Han to Qing dynasty; 20+ layouts; 20+ doc types)First multi-style Chinese historical document benchmark; 5 tasks including reading order; Regular + Hard (rotation/distortion/resolution) splitsCC-BY-NC-ND-4.0 (application required)Notes. NeurIPS 2023 D&B. SCUT. 16,151 character categories; zero-shot val/test splits. Simple right-to-left heuristic outperforms LayoutReader on this domain.
2020-04HJDataset2,271 scansHistorical Japanese documents (1953 Who’s Who)Layout, hierarchy, and reading order annotations for seven element types; rule-derived right-to-left RO with irregular-case handlingApache-2.0 (annotations) / Restricted (images require request)Notes. 259,616 annotations. Semi-automatic pipeline: RLSA + CCL segmentation, NASNet Mobile classifier, rule-based RO. No dedicated ROP model trained or evaluated.

Reading Order: Metrics

MetricWhat it measuresNotes
BLEU (Page-level)N-gram overlap between predicted and ground-truth reading order sequenceUsed by LayoutReader on ReadingBank.
ARD (Average Relative Distance)Positional displacement between predicted and ground-truth element positions; penalty $n$ for omitted elementsUsed by LayoutReader and M5HisDoc. Lower is better.
F1 over relation pairsPrecision and recall over ISDR directed pairs $(i, j)$Introduced by ROOR. Replaces BLEU/ARD for the relation-extraction formulation.
REDS (Reading Edit Distance Score)Edit distance between predicted and ground-truth reading order sequences, evaluated per groupPrimary metric for Comp-HRDoc (Detect-Order-Construct). Lower is better.
NED over block sequencesNormalized Levenshtein distance between predicted and ground-truth block sequencesUsed by OmniDocBench. Lower is better.
Kendall’s TauRank correlation: concordant minus discordant pairs, normalizedUsed in some document ordering papers as an alternative to BLEU/ARD. Correlates with human ratings.
Perfect Match RateFraction of pages where predicted ordering exactly matches ground truthUsed alongside BLEU/ARD as a strict binary measure.

Reading Order: Benchmarks

DateNameSizeScopePrimary MetricLicenseNotes
2025-04DocBench-100100 docsBlock-level reading order; complex and regular layout subsetsBLEU-4UnknownNotes. Introduced with XY-Cut++. Emphasizes multi-column and domain-specific complex layouts underrepresented in existing datasets.
2025-02DROBS789 pagesJoint reading order, bounding boxes, and semantic classesCounting F1, WER, BLEUUnknownNotes. Introduced with ÉCLAIR. Human-annotated from magazines, books, Common Crawl. Tables and equations masked in current release.
2024-12OmniDocBench981 pagesFull document parsing: OCR, layout, table, formula, reading orderNED over block sequences (RO); CDM (formula); TEDS (table)Apache-2.0Notes. CVPR 2025. Nine document types; attribute-level evaluation by language, rotation, background.
2024-09READoc3,576 docsEnd-to-end DSE (PDF to Markdown); reading order via KTDSKTDS (reading order); EDS (text); TEDS (table/ToC)UnknownNotes. ACL Findings 2025. Sources: arXiv, GitHub, Zenodo. 16 systems evaluated.
2024-01Comp-HRDoc1,500 docs (1K train / 500 test; HRDoc-Hard extension)Hierarchical doc structure: page object detection, reading order, TOC, full reconstructionREDS (reading order); segmentation mAP; Semantic-TEDSMITNotes. MSRA. First benchmark to evaluate all four subtasks simultaneously.

To Investigate

Papers and resources identified as likely relevant but not yet fully reviewed.

Methods

arXivTitleWhy Relevant
10.18653-v1-2024.emnlp-main.65DocHieNet (Xing et al., EMNLP 2024)Document hierarchy parsing with DHFormer (GeoLayoutLM + encoder-decoder); reading order implicit in the hierarchy. Alibaba + Zhejiang. Already in Layout index. No arXiv.
MLARP (Qiao et al., Pattern Recognition 2024)Multi-modal layout-aware relation prediction for ROP. Harmonizes visual, semantic, and positional features; layout-aware position embedding handles varying reading habits. No arXiv found; DOI: 10.1016/j.patcog.2024.000657.
XYLayoutLM (Gu et al., CVPR 2022)Introduces Augmented XY Cut heuristic to generate multiple total-ordered reading sequences, explicitly acknowledging single-permutation limitation. Used as baseline in M5HisDoc ROP evaluation.
Layout2Pos (Nguyen et al., TPDL 2024)Positional encoding injection for reading order; layout-only features; argues spatial layout alone is sufficient to infer reading order without textual content. No arXiv; DOI: 10.1007/978-3-031-72437-4_1. Identified via Giovannini & Marinai survey Tab. 3.
Gao et al. (2013) - “Newspaper article reconstruction using ant colony optimization and bipartite graph” (Applied Soft Computing 2013)TSP / ant-colony formulation for reading order recovery across newspaper article blocks. Identified via Giovannini & Marinai survey graph-based ROP category. DOI: 10.1016/j.asoc.2012.07.012.
Quirós & Vidal (2022) - “Reading order detection on handwritten documents” (Neural Computing and Applications 2022)Reading order as a learned sorting problem over text region pairs; applied to handwritten documents. No arXiv; DOI: 10.1007/s00521-021-06227-9.
Li et al. (2020) - “End-to-end OCR text re-organization sequence learning” (ECCV 2020)GNN + pointer-network approach for reordering OCR text blocks; early deep learning predecessor to LayoutReader. No arXiv.
Clausner et al. (2013) - “The significance of reading order in document recognition and its evaluation” (ICDAR 2013)Foundational work defining reading order evaluation methodology; notes single permutation is insufficient and proposes tree structures of ordered/unordered groups. Cited by ROOR as direct theoretical precursor.
Ferilli et al. (2014) - “Abstract argumentation for reading order detection” (DocEng 2014)Unsupervised strategy based on abstract argumentation and empirical assumptions about human reading behavior. Historical ML-era method.
Ceci et al. (2007) - “A data mining approach to reading order detection” (ICDAR 2007)Early statistical method modeling reading order as multiple disjoint chains; precursor to understanding non-linear reading order.
Aiello & Smeulders (2003/2004) - “Bidimensional relations for reading order detection”Foundational rule-based ROP using qualitative rectangle relations; introduced the relational framing that ROOR formalizes two decades later.
Meunier (2005) - “Optimized XY-Cut for Determining a Page Reading Order” (ICDAR 2005)The classic XY-Cut algorithm; foundational rule-based baseline that XY-Cut++ directly extends. DOI: 10.1109/ICDAR.2005.182.
2101.12741Post-OCR Paragraph Recognition by Graph Convolutional Networks (Wang, Fujii & Popat, WACV 2022)GCN-based grouping and ordering of OCR output into paragraphs. Same Google authors as ROPE; cited by Sparse Graph Segmentation. Reading order implicit in the paragraph assembly step.
Wu, Chou & Chang (2008) - “A Machine-Learning Approach for Analyzing Document Layout Structures with Two Reading Orders” (Pattern Recognition 2008)Early ML method explicitly addressing dual reading order in two-column and complex layouts. Cited by FocalOrder as a historical predecessor. No arXiv. DOI: 10.1016/j.patcog.2008.03.009.
MTD (Hu et al., ICPR 2022) - “Multimodal Tree Decoder for Table of Contents Extraction in Document Images”Multimodal tree decoder for ToC extraction; introduces HierDoc benchmark dataset. Same group as HRDoc (AAAI 2023) but earlier and distinct paper. Primary baseline for CMM and SEG2ACT. No arXiv found.
Bentabet et al. (ICDAR 2019) - “Table-of-Contents Generation on Contemporary Documents”LSTM-based multi-stage ToC generation pipeline; one of the earliest deep learning approaches for ToC/hierarchy extraction; baseline cited in CMM and SEG2ACT. No arXiv found. DOI: 10.1109/ICDAR.2019.00025.
1709.00770Rahman & Finin (BDCAT 2017) - “Deep Understanding of a Document’s Structure”RNN-based classifier assigning logical structure depth labels (section, subsection, etc.) to document blocks; direct predecessor compared against in HELD and SEG2ACT.

Benchmarks / Competitions

ReferenceTitleWhy Relevant
ICDAR 2024RDTAG Task B (ICDAR 2024)Competition: “Prediction of Reading Order” for text spotted through AR glasses. 33 registered teams. Site.

Metrics

MetricSourceNotes
Reading-order-independent IE metricsVillanova-Aparisi et al., ICDAR 2024 (2404.18664)Metrics for handwritten IE that decouple transcription accuracy from reading order errors.

  • Document Layout Analysis: Models and datasets for detecting and classifying regions on a document page.
  • Table Structure Recognition: Models and datasets for parsing the internal grid of detected tables.
  • OCR: Text recognition pipelines that often run downstream of layout detection and reading order prediction.
  • Document Understanding: End-to-end systems combining layout detection, structure recognition, and content extraction.