Reading Order Prediction

Tracking models, datasets, and methods for determining the logical reading sequence of detected document regions.

Disclaimer: This page covers models for predicting the logical reading sequence of detected regions. For detecting where regions are on a page, see the Layout Page. For parsing table internal structure, see the TSR Page.

Overview

Reading order prediction determines the logical sequence in which detected regions should be read. While it depends on layout detection for region proposals, it is a distinct task with its own models, datasets, and evaluation challenges.

The core difficulty is that reading order is not purely spatial. Multi-column layouts, sidebars, footnotes, and floating elements all require understanding the logical flow of a document rather than just scanning top-to-bottom, left-to-right.

Note: We are actively researching this area. Expect significant updates to this section.

Reading Order: Models

Date	Model	Method	Code	License	Notes
2023-12	Surya	SegFormer	VikParuchuri/surya	GPL-3.0
2021-08	LayoutReader	Seq2Seq	GitHub	Non-Comm	Notes. LayoutLM encoder + pointer-network decoder. ReadingBank dataset (500k pages).

Reading Order: Datasets

Date	Name	Pages	Domain	Key Contribution	License	Notes
2021-08	ReadingBank	500k	Diverse (born-digital)	First large-scale reading order benchmark	Apache-2.0	Notes. Auto-extracted from DocX XML metadata via color-based watermarking. English only.
2015-10	ENP	528	Historical Newspapers	PAGE-XML with reading order	Unknown	Reading order annotations in PAGE-XML format. See Layout notes.

Reading Order: Metrics

Metric	What it measures
BLEU (Page-level)	N-gram overlap between predicted and ground-truth reading order sequence. Used by LayoutReader.
ARD (Average Relative Distance)	Positional displacement between predicted and ground-truth element positions. Penalizes omissions.

TL;DR

This paper introduces ReadingBank, a 500k-page benchmark for reading order detection built by automatically extracting word order from DocX XML metadata and aligning it to rendered PDF bounding boxes via a color-based watermarking scheme. The accompanying LayoutReader model, a seq2seq architecture built on LayoutLM, achieves near-perfect reading order detection (0.98 BLEU) and improves text line ordering for both open-source and commercial OCR engines.

What kind of paper is this?

Dominant: $\Psi_{\text{Resource}}$

The headline contribution is ReadingBank, the first large-scale benchmark for reading order detection. The automated data pipeline (DocX XML parsing + color-based alignment) is the core novelty that enables everything else: without it, there is no dataset, and without the dataset, the model cannot be trained.

Secondary: $\Psi_{\text{Method}}$, $\Psi_{\text{Impact}}$

The authors also propose LayoutReader, a seq2seq model using LayoutLM as an encoder with a pointer-network-style decoder for predicting reading order permutations ($\Psi_{\text{Method}}$). They further demonstrate practical value by using the model to reorder OCR engine outputs, improving line-level ordering for both Tesseract and a commercial engine ($\Psi_{\text{Impact}}$).

What is the motivation?

Reading order detection is a prerequisite for document understanding tasks such as extraction, summarization, and question answering.

The problem: Traditional OCR engines typically output text in top-to-bottom, left-to-right order. This heuristic fails on complex layouts like multi-column articles, forms, and invoices.
The gap: Deep learning approaches were limited by the lack of large-scale supervision. Manual annotation of reading order is prohibitively expensive (it requires full page context and careful sequential labeling). Prior deep learning efforts (Li et al., 2020a) used small in-house datasets that were not publicly available for comparison.
The insight: The reading order of WORD documents is already embedded in their DocX XML metadata. By converting DocX files to PDFs and aligning the XML-derived word sequence to PDF bounding boxes, one can obtain large-scale reading order supervision at near-zero annotation cost.

What is the novelty?

1. Dataset Construction (ReadingBank)

The core innovation is the data pipeline.

Source: DocX files crawled from the internet, where the XML structure preserves the logical reading order (paragraphs traversed line-by-line, tables traversed cell-by-cell).
Alignment scheme: To map XML text nodes to visual bounding boxes (from a rendered PDF), the authors use a color-based watermarking technique. Each word is rendered with a unique RGB color corresponding to its appearance index $i$ in the reading sequence:

$$ \begin{aligned} r &= i \mathbin{\&} \text{0x110000} \\ g &= i \mathbin{\&} \text{0x001100} \\ b &= i \mathbin{\&} \text{0x000011} \end{aligned} $$

This resolves duplicate words (e.g., multiple instances of “the”) by their color-coded appearance index, creating a 1:1 mapping between logical order and spatial layout. Each word-index pair $(w, i)$ in the DocX is matched to its PDF counterpart $(w’, c, x_0, y_0, x_1, y_1, W, H)$ subject to $w = w’$ and $c = C(i)$.

Scale: 500,000 document pages split 8:1:1 into training (400k), validation (50k), and test (50k) sets. The average page contains approximately 196 words.

2. LayoutReader Model

The model treats reading order as a sequence generation task.

Encoder: LayoutLM (initialized from pre-trained weights) encodes text tokens alongside their 2D bounding box coordinates. Source and target segments are packed into one contiguous input sequence.
Attention mask: A custom self-attention mask $M$ allows full bidirectional attention within the source segment while enforcing causal (left-to-right) attention in the target segment:

$$ M_{i,j} = \begin{cases} 1 & \text{if } i < j \text{ or } i, j \in \text{src} \\ 0 & \text{otherwise} \end{cases} $$

Decoder (pointer network): Instead of generating vocabulary tokens, the decoder predicts an index into the source sequence. The probability of selecting source index $i$ at decoding step $k$ is:

$$ P(x_k = i \mid x_{<k}) = \frac{\exp(e_i^T h_k + b_k)}{\sum_j \exp(e_j^T h_k + b_k)} $$

where $e_i$ and $e_j$ are source input embeddings, $h_k$ is the hidden state at step $k$, and $b_k$ is a bias term. This constrains predictions to valid source indices rather than an open vocabulary.

What experiments were performed?

Training details:

500k pages (400k train, 50k val, 50k test)
4 $\times$ Tesla V100 GPUs, batch size 4 per GPU
3 epochs (~6 hours), AdamW optimizer, learning rate $7 \times 10^{-5}$, 500 warm-up steps

Evaluation metrics:

Average Page-level BLEU: Measures n-gram overlap between the predicted reading order and the ground truth sequence, computed per page and then averaged.
Average Relative Distance (ARD): Measures positional displacement between elements in the predicted and ground truth sequences, with a penalty $n$ (sequence length) for omitted elements:

$$ s(e_k, B) = \begin{cases} |k - I(e_k, B)| & \text{if } e_k \in B \\ n & \text{otherwise} \end{cases} $$

$$ \text{ARD}(A, B) = \frac{1}{n} \sum_{e_k \in A} s(e_k, B) $$

Key experiments:

Reading order detection: Full model vs. modality ablations (text-only with BERT or UniLM, layout-only with LayoutLM minus token embeddings) and a heuristic baseline (left-to-right, top-to-bottom sorting).
Input order study: Shuffling a proportion $r$ of training inputs (r = 0%, 50%, 100%) to test whether the model relies on the default OCR ordering or genuinely learns layout-based reading order.
OCR adaptation: Reordering the text line output of Tesseract and a commercial OCR engine using LayoutReader, evaluated on a line-level adaptation of ReadingBank.

What are the outcomes/conclusions?

Layout signals dominate

The layout-only variant significantly outperforms text-only baselines, confirming that spatial features (bounding box coordinates) are the primary signal for reading order detection.

Model	Encoder	BLEU $\uparrow$	ARD $\downarrow$
Heuristic (L-R, T-B)		0.6972	8.46
LayoutReader (text only)	BERT	0.8510	12.08
LayoutReader (text only)	UniLM	0.8765	10.65
LayoutReader (layout only)	LayoutLM (layout only)	0.9732	2.31
LayoutReader (full)	LayoutLM	0.9819	1.75

Note that the text-only variants actually have worse ARD than the heuristic baseline despite better BLEU. The authors attribute this to severe ARD penalties for token omission: the text-only models produce roughly correct orderings but fail to generate complete sequences.

Robustness to input order

The layout-only and full models are nearly unaffected by input shuffling (BLEU drops from 0.9732 to 0.9701 for layout-only at r=100%), while text-only models collapse (BLEU drops from 0.8765 to 0.3440 for UniLM at r=100%). This confirms the model learns spatial reading order rules rather than memorizing the default OCR sequence.

Practical OCR improvement

Adapting LayoutReader to reorder text lines from commercial OCR engines yielded improvements in line-level BLEU, validating the model’s potential as a post-processing module for production pipelines.

Limitations

Synthetic ground truth: The “ground truth” is defined by the DocX structure. Users can create DocX files with logical structures that do not match the visual reading order (e.g., floating text boxes). The model learns the authoring order, which is a proxy for, but not identical to, reading order.
Licensing ambiguity: The ReadingBank repository presents conflicting information, citing purely academic usage limitations despite an Apache 2.0 license file. This complicates commercial adoption.
Language restriction: The dataset is strictly filtered to English documents, limiting multilingual generalization.
No scanned document evaluation: All documents are born-digital (DocX-to-PDF conversions). Performance on noisy scanned documents with OCR errors is not evaluated.

Reproducibility

Models

LayoutReader uses LayoutLM-base as the encoder (approximately 113M parameters). Pre-trained weights are released via Google Drive, though the license is unspecified.
Text-only ablations substitute BERT-base or UniLM-base for the encoder.
The decoder is implemented as a modified seq2seq layer from the s2s-ft toolkit in the UniLM repository.

Algorithms

Optimizer: AdamW with initial learning rate $7 \times 10^{-5}$ and 500 warm-up steps
Training: 3 epochs, batch size 4 per GPU (effective batch size 16 across 4 GPUs)
Loss: Standard cross-entropy over pointer indices (not explicitly stated but implied by the seq2seq formulation)

Data

Source: DocX files crawled from the internet, filtered for English using Azure Text Analytics API with a high confidence threshold
Size: 500,000 pages (400k/50k/50k train/val/test split), pages with fewer than 50 words excluded
Availability: Released at github.com/doc-analysis/ReadingBank under Apache 2.0, though the repository text restricts usage to research purposes, creating ambiguity
Annotation: Fully automated via the DocX XML + coloring scheme pipeline (no human annotation)

Evaluation

Metrics: Average Page-level BLEU and ARD (defined above)
Baselines: Heuristic (L-R, T-B), text-only (BERT, UniLM), layout-only ablation
Limitations acknowledged by authors: Ground truth is derived from authoring order, not human-verified reading order. Future work includes labeling a real-world scanned document dataset.
Statistical rigor: No error bars, significance tests, or multi-seed experiments reported

Hardware

Training: 4 $\times$ Tesla V100 (32GB per GPU), approximately 6 hours for 3 epochs (~75k steps)
Inference: Not explicitly reported

BibTeX


@inproceedings{wang2021layoutreader,
  title={LayoutReader: Pre-training of Text and Layout for Reading Order Detection},
  author={Wang, Zilong and Xu, Yiheng and Cui, Lei and Shang, Jingbo and Wei, Furu},
  booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing},
  pages={4735--4744},
  year={2021}
}