Document Understanding & Visual Question Answering
Tracking models and benchmarks for document understanding, visual information extraction, and document VQA.
Table of Contents
- Overview
- Models
- Datasets & Benchmarks
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
Disclaimer: This page tracks methods that consume document layout, text, and visual information to answer questions or extract structured data. For models that produce layout annotations (bounding boxes, region labels), see the Layout Page. For text recognition and OCR pipelines, see the OCR Page.
Overview
Document Understanding encompasses tasks where a model must jointly reason over a document’s text, visual appearance, and spatial layout to produce answers, extracted fields, or structured outputs. Key task families include:
- Document VQA: Open-ended question answering over document images.
- Visual Information Extraction (VIE): Extracting key-value pairs from forms, receipts, invoices, and other semi-structured documents.
- Visual Machine Reading Comprehension: Answering questions that require reading and reasoning over rendered text in context.
These tasks differ from layout analysis in that the output is semantic (answers, extracted values) rather than geometric (bounding boxes, region classes). Many methods build on top of layout-aware encoders (LayoutLM, LayoutLMv3) but apply them to downstream comprehension tasks rather than detection.
Models
Unified Document Parsing
End-to-end systems that jointly perform text spotting, layout detection, table structure recognition, and key information extraction in a single framework. Unlike task-specific TSR or OCR models, these systems are designed to parse entire document pages into structured output without offline preprocessing.
| Date | Name | Artifacts | Code | License | Notes |
|---|---|---|---|---|---|
| 2025-02 | OmniParser V2 | None released | AdvancedLiterateMachinery | Apache-2.0 | Notes. HUST + Alibaba. Swin-B + FPN + Token-Router Shared Decoder (MoE-like). ~110M params. Two-stage SPOT prompting; covers TSR + TCR + text spotting + KIE + layout. No offline OCR required. S-TEDS 93.2 / TEDS 90.5 (FinTabNet). |
| 2024-03 | OmniParser V1 | None released | AdvancedLiterateMachinery | Apache-2.0 | Notes. CVPR 2024. Swin-B + FPN + three independent decoders (text spotting + KIE + TSR). Predecessor to V2. S-TEDS 90.45 / TEDS 88.83 (PubTabNet). |
LLM-Based
Models that pair a document-specialized encoder with a large language model backbone for instruction-following or generative document understanding.
| Model Family | Encoder | LLM Backbone | Code | License | Notes |
|---|---|---|---|---|---|
| LayoutLLM (2024) | LayoutLMv3-large | Vicuna-7B | None | N/A | Notes. Layout instruction tuning (5.7M pre-training instructions across document/region/segment levels) + LayoutCoT for layout-aware chain-of-thought reasoning. Zero-shot eval on DocVQA, VisualMRC, FUNSD, CORD, SROIE. |
Datasets & Benchmarks
Document VQA
| Benchmark | Task | Metric | Size | Notes |
|---|---|---|---|---|
| DocVQA (2021) | Document visual QA | ANLS | 50K questions, 12K images | Mathew et al., WACV 2021. Extractive QA over diverse industry documents. |
| VisualMRC (2021) | Visual machine reading comprehension | Rouge-L | 30K+ questions | Tanaka et al., AAAI 2021. QA over web page screenshots. |
Visual Information Extraction
| Benchmark | Task | Metric | Size | Notes |
|---|---|---|---|---|
| FUNSD (2019) | Form understanding | F1 | 199 forms | Jaume et al., ICDAR 2019 Workshop. Entity labeling + linking on noisy scanned forms. |
| CORD (2019) | Receipt key info extraction | F1 | 1,000 receipts | Park et al. Post-OCR parsing of Indonesian receipts. |
| SROIE (2019) | Receipt key info extraction | F1 | 973 receipts | ICDAR 2019 Competition. Scanned receipt text localization + key info extraction. |
OmniParser V2: Unified Visual Text Parsing with Structured-Points-of-Thought
TL;DR
OmniParser V2 is a unified visually-situated text parsing (VsTP) model that handles text spotting, key information extraction (KIE), table recognition (TR), and layout analysis within a single encoder-decoder. The core contribution is Structured-Points-of-Thought (SPOT) prompting: a two-stage decoding strategy that first generates a structured sequence of text center points, then predicts polygon contours and content in parallel from those points. A token-router-based shared decoder (a supervised MoE variant) replaces three separate decoders from the prior version, reducing model size by 23.6% while improving performance.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$: The paper’s headline contribution is a new architecture and training paradigm. The token-router-based shared decoder and SPOT prompting scheme are the central contributions, supported by ablations that isolate each design choice and SOTA-style comparison tables across four tasks.
What is the motivation?
Visually-situated text parsing spans several related subtasks: text spotting (detection + recognition), KIE (linking named fields to their values), table recognition (structure + cell content), and layout analysis (word/line/paragraph hierarchy). Existing approaches address each subtask with dedicated architectures and objectives, creating modal isolation and complex multi-stage pipelines. Generalist models (typically large MLLMs) offer breadth but sacrifice localization precision and often fail without an external OCR engine. Specialist models are strong on their subtask but compose poorly.
The authors argue that a single model with a shared representation and a unified objective can match or exceed specialist performance while eliminating task-specific heads and reducing inference complexity.
What is the novelty?
Structured-Points-of-Thought (SPOT) Prompting. SPOT decomposes generation into two stages:
Stage 1 (Structured Points Sequence): The decoder autoregressively generates center-point coordinates of text instances, interleaved with task-specific structural tokens (e.g.,
<tr>for table rows,<line>for layout lines). Coordinates are quantized into discrete bins: $x, y \in [0, n_{\text{bins}} - 1]$ with $n_{\text{bins}} = 1000$.Stage 2 (Polygon + Content): Given each center point from Stage 1 as a prompt, the same decoder generates the full 16-point polygon and the character-level text transcript in parallel across all instances.
This decoupling reduces sequence length relative to directly predicting full HTML (as in Donut) and provides an explicit spatial anchor that reduces attention drift in long sequences. The authors draw an analogy to chain-of-thought reasoning: structured points are intermediate “reasoning” steps before the final output.
Token-Router-Based Shared Decoder. Rather than three independent decoders (one per decoding mode: structure, detection, recognition), OmniParser V2 uses a single decoder in which each transformer layer contains three task-specific feed-forward networks (Structured FFN, Detection FFN, Recognition FFN). Token routing is not learned; the category of each input token is predetermined by task-specific priors, making training more stable than a standard MoE:
$$ L = -\sum_{j=k}^{N} w_j \log P(\tilde{s}_j \mid v, s_{k:j-1}) $$
where $w_j = 4.0$ for structural/entity tags and $w_j = 1.0$ for other tokens. The shared self-attention and cross-attention layers use the same parameters across all token types; only the FFNs differ. This yields 110M total parameters versus 144M for OmniParser V1, a 23.6% reduction.
Pre-training Strategies. Two auxiliary strategies improve the decoder’s spatial and semantic grounding:
- Spatial-window prompting: the decoder is conditioned on a bounding box and tasked with predicting center points only within that window, training robust spatial coordinate perception.
- Prefix-window prompting: the decoder is conditioned on a character range and tasked with predicting points only for text instances whose first character falls within that range, training character-level semantic grounding.
SPOT for MLLMs. The authors demonstrate that SPOT prompting can be applied to existing multimodal LLMs (e.g., QwenVL) via supervised fine-tuning on a curated dataset of 181k-389k text spotting examples. The two-stage instruction-following formulation (first generate center points; then generate polygons and transcripts) substantially improves text localization in models that otherwise struggle with it.
What experiments were performed?
Experiments span four tasks across eight datasets.
Text Spotting: Total-Text, CTW1500 (curved text, line-level), and ICDAR 2015 (quadrilateral, word-level). Metrics: end-to-end F1 under None/Full/Strong/Weak/Generic lexicons. Baselines include DeepSolo, UNITS, TESTR, ABCNet v2, and others.
Key Information Extraction: CORD (1k receipt samples, 30 entity labels) and SROIE (626/347 receipts, 4 entities). Metrics: field-level F1 and TED-based accuracy. Baselines restricted to end-to-end (OCR-free) methods comparable to the OMNIPARSER V2 formulation.
Table Recognition (end-to-end): PubTabNet (500k train, 9k val) and FinTabNet (113k, financial tables). Metric: TEDS and S-TEDS. Baselines: EDD, Donut (fine-tuned), and OmniParser V1. Importantly, this is an end-to-end evaluation (structure + cell content recognized jointly), not a pure TSR evaluation, so results are not directly comparable to non-end-to-end TSR methods like TFLOP or UniTabNet that use offline OCR.
Layout Analysis: HierText validation and test sets. Metric: Panoptic Quality (PQ) at word, line, and paragraph level. Baselines: UniDec and Hi-SAM-B.
Ablations:
- Pre-training strategies (spatial-window vs. prefix-window): each contributes independently; combined they add +1.8 points E2E on Total-Text.
- Decoder architecture: token-router-based decoder outperforms native shared decoder (+1.72% on text spotting) and standard MoE decoder, at lower parameter count than OmniParser V1.
- Decoder length for table recognition: performance saturates at 1,500 structure tokens + 200 content tokens; longer sequences yield marginal gains at significant speed cost.
- SPOT length for MLLMs (N-SPOT / S-SPOT / L-SPOT): normal SPOT (generating the full center-point sequence as an intermediate step) performs best; omitting it (S-SPOT) or extending to per-instance detection/recognition prompting (L-SPOT) both hurt.
What are the outcomes/conclusions?
On text spotting, OmniParser V2 achieves competitive results across all three benchmarks, with Total-Text E2E (None) at 84.3%, and ICDAR 2015 Generic at 80.6%, matching or exceeding prior unified models.
On KIE, it reaches 85.0% field F1 on CORD and outperforms prior generation-based methods on SROIE TED accuracy, while being pre-trained on scene text data only (no large document corpora).
On end-to-end table recognition, it achieves S-TEDS 90.5 / TEDS 88.9 on PubTabNet and S-TEDS 93.2 / TEDS 90.5 on FinTabNet, improving over OmniParser V1 and substantially over Donut. However, results are lower than pure TSR methods (e.g., TFLOP at 99.56 S-TEDS on FinTabNet) that assume clean OCR input; the comparison set in the paper is correctly restricted to end-to-end methods.
On layout analysis (HierText test), OmniParser V2 surpasses Hi-SAM-B by +1.9 / +1.2 / +0.8 PQ at word, line, and paragraph level respectively, in an end-to-end fashion without offline OCR.
SPOT applied to QwenVL and similar MLLMs shows meaningful improvements in text localization over both the base MLLMs and some specialist baselines. A performance gap versus OmniParser V2 itself remains, which the authors acknowledge as an open problem.
Limitations acknowledged by the authors:
- Understanding and reasoning capabilities remain limited; they cite reinforcement learning (e.g., DeepSeek-R1 style) as a future direction.
- A gap remains between SPOT-augmented MLLMs and the lightweight OmniParser V2 model on localization tasks.
- The paper does not compare end-to-end table recognition to non-end-to-end TSR methods, which would require a different experimental setup.
Unacknowledged limitations:
- The evaluation on table recognition excludes widely used benchmarks (e.g., WTW for wild tables) where non-end-to-end methods are typically reported. The restricted benchmark set makes cross-paper comparison difficult.
- No inference latency breakdown per task is reported; the 2.1 FPS figure is for the table recognition setting only (the most sequence-length-constrained task).
- No error bars, seeds, or significance tests are reported.
Reproducibility
Models
- Architecture: Swin-B (pretrained on ImageNet-22k) encoder with FPN, producing multi-scale visual embeddings. Token-router-based shared decoder with 4 transformer layers, 8 attention heads, hidden dimension 512, MLP amplification factor 4. Total parameters: approximately 110M.
- Code: The paper states code will be released at AlibabaResearch/AdvancedLiterateMachinery. The repository exists and is licensed Apache-2.0, but as of the preprint date no OmniParser V2 subdirectory has been added (only OmniParser V1 is present in the OCR/ folder).
- Weights: Not released at time of preprint submission.
Algorithms
- Pre-training Stage 1: AdamW, batch 128, resolution $768 \times 768$, 500k steps, lr $5 \times 10^{-4}$, 5k-step linear warm-up then linear decay.
- Pre-training Stage 2: AdamW, batch 16, resolution $1920 \times 1920$, 200k steps, lr $2.5 \times 10^{-4}$, same schedule.
- Fine-tuning: Learning rate $1 \times 10^{-4}$, cosine decay. Text spotting and KIE: 20k and 200k steps respectively. Table recognition and layout: up to 400k steps.
- Data augmentation: Instance-aware random cropping, random rotation in $[-90^\circ, 90^\circ]$, random resizing, color jittering.
- Loss: Negative log-likelihood with token weighting ($w = 4.0$ for structural tags, $w = 1.0$ for others). Prompt tokens excluded from loss.
- Coordinate quantization: $n_{\text{bins}} = 1000$; coordinates normalized to image dimensions then quantized to integers.
Data
- Pre-training data (8 datasets): Curved SynthText, ICDAR 2013, ICDAR 2015, MLT 2017, Total-Text, TextOCR, HierText, COCO Text, Open Image V5.
- MLLM fine-tuning data: TS180k (181k) and TS380k (389k) text spotting examples; R440k (447k) and R980k (981k) read-all-text examples. Constructed from public datasets and data collected from Platypus.
- Task-specific fine-tuning: Uses standard splits for each benchmark (PubTabNet, FinTabNet, CORD, SROIE, Total-Text, CTW1500, ICDAR 2015, HierText).
- Public availability: PubTabNet, FinTabNet, CORD, SROIE, Total-Text, CTW1500, HierText, and ICDAR benchmarks are all publicly available. Platypus data is publicly available. Curved SynthText is publicly available.
Evaluation
- Text spotting: End-to-end F1 with lexicon (None, Full, Strong, Weak, Generic). Standard for Total-Text and ICDAR 2015.
- KIE: Field-level F1 and TED-based accuracy. Standard for CORD and SROIE.
- Table recognition: TEDS and S-TEDS. Comparison is restricted to end-to-end methods only (EDD, Donut, OmniParser V1); non-end-to-end TSR methods are explicitly excluded.
- Layout analysis: Panoptic Quality (PQ) at three levels. Standard for HierText.
- Baselines: Generally fair within each task track, though compute budgets are not normalized across methods.
- Statistical rigor: No error bars, seeds, or significance testing reported. Single-run results throughout.
Hardware
- Hardware is not explicitly stated in the paper.
- From context: the two-stage pre-training (700k total steps, batch 128 at $768^2$ then batch 16 at $1920^2$) suggests substantial GPU resources, likely 8+ A100-class GPUs.
- Table recognition inference at 1,124 + 200 decoder tokens runs at 2.1 FPS; the default 1,500 + 200 configuration runs at 1.7 FPS (configuration-dependent).
- The paper notes GPU constraints limit Donut’s max sequence length to 4,000 in their reproductions, providing a rough lower bound on memory requirements.
BibTeX
@article{yu2025omniparserv2,
title={OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models},
author={Yu, Wenwen and Yang, Zhibo and Wan, Jianqiang and Song, Sibo and Tang, Jun and Cheng, Wenqing and Liu, Yuliang and Bai, Xiang},
journal={arXiv preprint arXiv:2502.16161},
year={2025}
}
LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding
TL;DR
LayoutLLM integrates a document pre-trained model (LayoutLMv3) as the visual-textual encoder for an LLM and introduces a two-stage “layout instruction tuning” strategy: layout-aware pre-training across document, region, and segment levels (5.7M instructions), followed by layout-aware supervised fine-tuning with a “LayoutCoT” (Layout Chain-of-Thought) module that decomposes document QA into question analysis, relevant area localization, and answer formation. On zero-shot document understanding benchmarks, LayoutLLM with Vicuna-7B outperforms existing open-source 7B LLMs and MLLMs by significant margins.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$. The headline contribution is a new architecture and training recipe for combining document pre-trained models with LLMs. The paper dedicates most of its content to describing the three-level pre-training tasks, the LayoutCoT module, and ablations isolating each component.
Secondary: $\Psi_{\text{Evaluation}}$. The paper conducts systematic zero-shot evaluation across five benchmarks, comparing against both LLM-based and MLLM-based approaches, providing a useful snapshot of the landscape.
What is the motivation?
Prior approaches to LLM-based document understanding fall into two camps, each with clear shortcomings:
LLM-based methods feed documents as flattened plain text or layout-formatted text (text with coordinates) to LLMs. This does not guarantee effective comprehension of spatial layout information.
MLLM-based methods (LLaVAR, mPLUG-DocOwl, Qwen-VL) use generic visual encoders (ViT, CLIP) rather than document-specialized encoders. Their pre-training tasks (image captioning, plain text generation) do not capture document layout structure.
Additionally, existing SFT approaches directly supervise with the final answer, with no explicit mechanism for the model to learn about document layout during fine-tuning. LayoutLLM addresses both gaps: it uses a document-specialized encoder and introduces layout-aware objectives at both pre-training and fine-tuning stages.
What is the novelty?
Document Pre-trained Model as Encoder
Rather than a generic vision encoder, LayoutLLM uses LayoutLMv3 (a document pre-trained model) that jointly processes the document image, OCR text, and bounding boxes:
$$F_V, F_T = \text{DocPTM}(V, T, \text{Box})$$
where $F_V \in \mathbb{R}^{m \times d_0}$ are visual features and $F_T \in \mathbb{R}^{n \times d_0}$ are text-layout features. Two separate MLP projectors map these into the LLM embedding space:
$$H_V = P_V(F_V), \quad H_T = P_T(F_T)$$
These projected features are concatenated with instruction token embeddings and fed to the LLM (Vicuna-7B).
Layout-Aware Pre-training (3 Levels)
LayoutLLM defines pre-training tasks across three hierarchical levels, all unified as instruction tuning (5.7M total instructions, ratio 1:4:4 across levels):
Document-level:
- Document Dense Description (DDD): Generate detailed document descriptions (avg 373 words, vs. 36 for LLaVAR captions). Descriptions generated by GPT-3.5 Turbo.
- Text and Layout Reconstruction (TLR): Reconstruct full text+layout in structured formats (JSON, Markdown, or
<box, text>pairs).
Region-level:
- Document Layout Analysis (DLA): Locate layout regions by type, or identify the type of a given area.
- Table Understanding (TU): Parse rows, columns, logical coordinates, and cell content.
Segment-level:
- Masked Vision-Language Modeling (MVLM): Predict randomly masked text tokens.
- Mask Position: Recover zeroed-out bounding box coordinates.
- Geometric Layout: Answer direction and distance questions between text lines.
LayoutCoT (Layout Chain-of-Thought) for SFT
The key SFT innovation is a three-step reasoning module that makes layout reasoning explicit:
- Question Analysis: Classify the question type from a layout perspective (table, extraction, reasoning, etc.).
- Relevant Area Concentration: Identify the bounding box region of the document relevant to the question.
- Answer Formation: Generate the answer based on the identified region.
SFT data (300K instructions) is constructed using GPT-3.5 Turbo to generate QA pairs and text-based chain-of-thought, then converting text CoT to LayoutCoT by mapping relevant sentences to their union bounding boxes. Three document sources are used in ratio 5:4.5:0.5 (image documents, GPT-generated HTML documents, MRC text documents).
LayoutCoT also enables interactive correction: users can manually correct the identified region in Step 2, and the model re-generates the answer based on the corrected area.
What experiments were performed?
Zero-shot Evaluation
All models use open-source 7B LLMs. Evaluation spans five benchmarks:
| Benchmark | Task | Metric |
|---|---|---|
| DocVQA | Document visual QA | ANLS |
| VisualMRC | Visual machine reading comprehension | Rouge-L |
| FUNSD | Form understanding (VIE) | ANLS |
| CORD | Receipt key info extraction (VIE) | ANLS |
| SROIE | Receipt key info extraction (VIE) | ANLS |
Key Results
| Model | DocVQA | VisualMRC | FUNSD | CORD | SROIE |
|---|---|---|---|---|---|
| Vicuna-7B (plain text) | 65.16 | 46.71 | 48.56 | 4.49 | 15.83 |
| Vicuna-7B (layout text) | 69.17 | 52.51 | 55.03 | 5.47 | 32.04 |
| mPLUG-DocOwl | 38.20 | 188.40 | 17.24 | 2.74 | 2.79 |
| Qwen-VL | 62.48 | 29.31 | 8.18 | 13.64 | 33.69 |
| LayoutLLM | 72.73 | 221.28 | 55.81 | 33.84 | 57.17 |
| LayoutLLM + LayoutCoT | 69.67 | 218.28 | 60.68 | 42.82 | 58.17 |
LayoutCoT substantially improves VIE tasks (FUNSD +4.87, CORD +8.98) at a slight cost on DocVQA (-3.06). The authors attribute this to the intermediate generation overhead being less helpful for open-ended QA.
For reference, supervised fine-tuned LayoutLMv3-Large achieves 83.37 on DocVQA, 92.08 on FUNSD, and 97.46 on CORD, indicating meaningful room between zero-shot LLM performance and task-specific supervised models.
Ablation Studies
The ablation progressively adds components on top of the layout-text Vicuna-7B baseline:
- Each pre-training level (document, region, segment) contributes incrementally.
- Layout-aware SFT is highly effective, particularly for VIE tasks.
- LayoutCoT boosts VIE performance by providing explicit layout reasoning, at a slight cost on DocVQA.
- Both layout-aware pre-training and layout-aware SFT are important; the combination yields the best overall results.
What are the outcomes/conclusions?
Strengths:
- The idea of using a document-specialized encoder (LayoutLMv3) rather than a generic vision encoder for LLM-based document understanding is well-motivated and effective. The multi-level pre-training tasks provide a principled curriculum for learning layout at different granularities.
- LayoutCoT is an interesting approach to making layout reasoning explicit during fine-tuning, and the interactive correction capability (manually adjusting the relevant area) is a practical design choice for real-world use.
- The zero-shot results consistently outperform other open-source 7B models across all five benchmarks.
Limitations:
- No refusal capability: LayoutLLM cannot indicate when the answer is not present in the document, which is important for real-world deployment.
- Imprecise region-level reasoning: The model still struggles with precisely understanding relationships between multiple regions containing similar content.
- OCR dependency: The method requires external OCR or PDF parsing for text and bounding boxes; it is not end-to-end.
- Proprietary data dependency: SFT and pre-training data construction relies on GPT-3.5 Turbo, introducing a dependency on a proprietary model.
- Token limit: Maximum document tokens limited to 512, which may be insufficient for long or dense documents.
- Scale: Only evaluated with 7B LLMs; no experiments with larger backbones.
Reproducibility
Models
- Document encoder: LayoutLMv3-large ($d_0 = 1024$).
- LLM backbone: Vicuna-7B-v1.5 ($d_1 = 4096$). Also tested with Llama2-7B-chat.
- Projectors: Two separate MLPs mapping $\mathbb{R}^{d_0} \to \mathbb{R}^{d_1}$ (one for visual features, one for text-layout features).
- No public release: No code, model weights, or constructed instruction datasets have been released. No GitHub repository or HuggingFace link is provided.
Algorithms
- Pre-training: AdamW optimizer, LR $1 \times 10^{-4}$ with cosine scheduler, warmup ratio 0.03, weight decay 0.0001, max grad norm 1.0, batch size 32 per GPU, 1 epoch over 5.7M instructions. LLM frozen; projectors and DocPTM encoder updated.
- SFT: LR $2 \times 10^{-5}$ with cosine scheduler, warmup ratio 0.03, no weight decay, max grad norm 1.0, batch size 8 per GPU, 3 epochs over 300K instructions. DocPTM encoder frozen; LLM and projectors fine-tuned.
- Inference: Beam search with beam size 5, no sampling.
Data
- Pre-training sources: PubLayNet, DocLayNet, DocBank, RVL-CDIP, DocILE (layout tasks); PubTabNet (table understanding); GPT-3.5 Turbo (dense descriptions). No downstream benchmark train/val/test data used.
- SFT sources: Three document types in ratio 5:4.5:0.5: image documents with GPT-generated QA, GPT-generated HTML documents, and MRC text documents.
- Constructed datasets are not released.
Evaluation
- All benchmarks evaluated in zero-shot setting (no fine-tuning on benchmark training splits).
- No error bars, confidence intervals, or multi-run statistics reported.
- FUNSD (50 test forms), CORD (100 test receipts), SROIE (347 test receipts), DocVQA (5,188 test questions), VisualMRC (6,708 test questions).
Hardware
- Not reported. Batch sizes reference “per GPU” but the number and type of GPUs, total training time, and compute cost are not disclosed.
BibTeX
@article{luo2024layoutllm,
title={LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding},
author={Luo, Chuwei and Shen, Yufan and Zhu, Zhaoqing and Zheng, Qi and Yu, Zhi and Yao, Cong},
journal={arXiv preprint arXiv:2404.05225},
year={2024}
}
OmniParser V1: A Unified Framework for Text Spotting, Key Information Extraction, and Table Recognition
TL;DR
OmniParser V1 proposes a single encoder-decoder model for visually-situated text parsing (VsTP) that handles text spotting, key information extraction (KIE), and table recognition under one objective: point-conditioned text generation. A Structured Points Decoder first autoregressively produces center-point coordinates interleaved with task-specific structural tokens; a Region Decoder and a Content Decoder then predict polygon contours and character-level transcripts in parallel from those points. With two pre-training strategies (spatial-window prompting and prefix-window prompting), the model achieves competitive or leading results on 7 benchmarks spanning all three tasks.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$: The core contribution is a new architecture and training paradigm for multi-task visually-situated text parsing. The unified representation (structured points interleaved with task tokens), the three-decoder design, and the two pre-training strategies together constitute the methodological headline. Ablation tables and SOTA-style comparison tables across all three tasks form the bulk of the experimental content.
No strong secondary vector. The multi-task comparison tables serve the methodological argument rather than constituting an independent contribution to evaluation methodology.
What is the motivation?
Visually-situated text parsing covers several tightly coupled subtasks: detecting and recognizing text instances in natural images (text spotting), linking detected text to structured named fields (KIE), and recovering both the logical structure and cell content of tabular layouts (table recognition). Despite their shared visual and textual substrate, prior work has almost universally addressed them with task-specific architectures, objectives, and post-processing pipelines.
This fragmentation causes two problems. First, specialist models trained for a single subtask cannot share representations or transfer knowledge across tasks, leading to duplicate engineering effort and modal isolation. Second, generalist models (typically large multimodal LLMs) achieve breadth at the cost of spatial precision and interpretability, and their performance degrades when an external OCR engine is unavailable.
The paper argues that a single, lightweight model with a shared representation and a unified generation objective can match or exceed specialist performance on all three tasks simultaneously.
What is the novelty?
Unified Input/Output Representation. OmniParser represents the output of all three tasks as a structured sequence composed of three sub-sequences:
Structured Points Sequence: Discrete coordinate tokens encoding the center point of each text instance, interleaved with task-specific structural tokens (e.g.,
<address>for KIE entities,<tr>for table rows). Text spotting is treated as the special case with no structural tokens. Coordinates are normalized to image dimensions, then quantized to integers in $[0, n_{\text{bins}} - 1]$.Polygon Sequence: A 16-point polygon per text instance, tokenized with the same quantization scheme, representing the polygonal contour.
Content Sequence: Character-level tokens encoding the text transcript.
Points act as the bridging primitive: the Structured Points Decoder first generates the entire points sequence, and the Region and Content Decoders then generate polygons and transcripts in parallel, conditioned on each point.
Three-Decoder Architecture. The model uses a shared Swin-B encoder with FPN for multi-scale visual feature extraction. Three independent transformer decoders (same architecture, separate parameters) handle structure-point generation, polygon prediction, and content transcription. Each decoder has 4 layers, 8 attention heads, and hidden dimension 512. The independence of decoder weights, rather than weight sharing, is shown by ablation to be important: shared decoder weights degrade performance, suggesting the sub-tasks require different inductive biases.
Training Objective. All tasks share the same negative log-likelihood loss:
$$ L = -\sum_{j=k}^{N} w_j \log P(\bar{s}_j \mid \mathbf{v}, s_{k:j-1}) $$
where $\bar{s}$ is the target sequence, $k$ is the number of prompt tokens excluded from the loss, and $w_j$ is a per-token weight: 4.0 for structural or entity tags, 1.0 for all other tokens.
Spatial-Window Prompting. During pre-training, the Structured Points Decoder is given a bounding box prompt $(X_{\text{left}}, Y_{\text{top}}, X_{\text{right}}, Y_{\text{bottom}})$ and tasked with predicting only the center points whose locations fall inside that window. Windows are sampled from fixed grid layouts (e.g., $3 \times 3$, $2 \times 2$) or randomly from the image (covering at least 1/9 of the area). This mechanism allows the decoder to handle images with many text instances despite a finite sequence length, and trains robust spatial coordinate perception.
Prefix-Window Prompting. The Structured Points Decoder is also pre-trained to predict center points only for text instances whose single-character prefix falls within a designated character range. The range is defined by start and end characters drawn from an ordered dictionary (26 uppercase, 26 lowercase, 10 digits, 34 ASCII punctuation). This strategy trains character-level semantic grounding, which the authors find particularly useful for KIE where entity types are associated with characteristic vocabulary.
Ablation shows that each prompting strategy provides an independent performance gain, and combining both yields the best results on both Total-Text and ICDAR 2015.
What experiments were performed?
Experiments cover three tasks across five evaluation benchmarks, plus two ablation studies.
Text Spotting. Evaluated on Total-Text (1,255 train / 300 test, word-level polygon annotations for arbitrary-shaped text), CTW1500 (1,000 train / 500 test, line-level curved text), and ICDAR 2015 (1,000 train / 500 test, quadrilateral annotations for incidental text). End-to-end recognition metrics are the primary target: F1 under None and Full lexicons for Total-Text and CTW1500, and F1 under Strong, Weak, and Generic lexicons for ICDAR 2015. Baselines include TextDragon, ABCNet v2, TESTR, DeepSolo, UNITS, and others.
Key Information Extraction. Evaluated on CORD (1,000 receipts, 30 entity labels across 4 categories; 800/100/100 train/val/test split) and SROIE (626 train / 347 test receipts, 4 entity types: company, date, address, total). Metrics: field-level F1 and tree-edit-distance-based accuracy. Baselines: Donut, Dessurt, DocParser, SeRum.
Table Recognition. Evaluated on PubTabNet (500,777 train / 9,115 val, scientific document tables; validation set used due to missing public test annotations) and FinTabNet (92,000 train / 10,656 test, financial document tables). Metric: Tree-Edit-Distance-based Similarity (TEDS) and structure-only TEDS (S-TEDS). Baselines: WYGIWYS, EDD, and Donut (fine-tuned by the authors using the official training configuration). The comparison is correctly restricted to end-to-end methods that predict both table structure and cell content jointly, excluding non-end-to-end TSR methods.
Ablations.
- Pre-training strategies (Table 5): Removing spatial-window prompting, prefix-window prompting, or both shows additive degradation on Total-Text and ICDAR 2015 end-to-end metrics.
- Encoder and decoder design (Table 6): Swin-B outperforms ResNet-50, and three independent decoders outperform a shared-weight decoder.
- Decoder length for table recognition (Table 7): Performance plateaus for the Structured Points Decoder at 1,500 tokens; extending the Content Decoder length from 200 to 300 tokens yields small TEDS improvement. Inference speed (1.3 FPS) is faster than Donut (0.8 FPS) at a shorter maximum sequence length.
Pre-training details. Two-stage pre-training on a hybrid corpus (Curved SynthText, ICDAR 2013, ICDAR 2015, MLT 2017, Total-Text, TextOCR, HierText, COCO Text, Open Image V5). Stage 1: batch 128, resolution $768 \times 768$, 500k steps, AdamW with lr $5 \times 10^{-4}$, 5k-step linear warm-up then linear decay. Stage 2: batch 16, resolution $1920 \times 1920$, 200k additional steps, lr $2.5 \times 10^{-4}$.
What are the outcomes/conclusions?
Text Spotting. On Total-Text and CTW1500, OmniParser V1 achieves 84.0% E2E (None) on Total-Text and 66.8% E2E (None) on CTW1500, the highest end-to-end scores reported on those benchmarks at the time of publication (+1.5% and +3.2% above the prior best, respectively). On ICDAR 2015, results are competitive with the then-leading methods across all three lexicon settings.
Key Information Extraction. On CORD, the model achieves 84.8% field-level F1, outperforming all compared generation-based methods. On SROIE, it matches the prior best in F1 (85.6%) and achieves the best TED-based accuracy (93.6%). Importantly, OmniParser V1 also provides explicit localization of extracted entities, unlike OCR-free generation methods such as Donut, and is pre-trained on scene text data only, not large document corpora.
Table Recognition. OmniParser V1 achieves 90.45 S-TEDS / 88.83 TEDS on PubTabNet and 91.55 S-TEDS / 89.75 TEDS on FinTabNet, surpassing EDD and the authors’ Donut reproduction on both. The modularized architecture separates structural HTML tags from cell text, avoiding the attention drift and error accumulation that limits monolithic sequence-to-sequence approaches on long sequences.
Limitations acknowledged by the authors:
- The model requires precise word-level point location annotations during training, which may not be available in all real-world scenarios.
- It does not handle non-text elements (figures, charts, diagrams), limiting applicability to complex document parsing tasks.
- Future work is intended to extend the model to layout analysis and chart parsing.
Unacknowledged limitations:
- No error bars, seeds, or significance tests are reported; all results are single-run.
- The table recognition comparison set is small (only WYGIWYS, EDD, and Donut), reflecting the limited availability of end-to-end baselines at the time but making it difficult to assess the model’s position in a broader context.
- Inference speed (1.3 FPS at 1,024 resolution for table recognition) may be insufficient for latency-sensitive document processing pipelines.
Reproducibility
| Resource | Type | License | Link |
|---|---|---|---|
| Preprint | Paper | CC-BY-4.0 | arXiv 2403.19128 |
| Published paper | Paper | Unknown | CVPR 2024 open access |
| Code | Code | Apache-2.0 | AlibabaResearch/AdvancedLiterateMachinery |
| Pretrained weights | Model | Unknown | Not confirmed released; check repository |
Models
- Architecture: Swin-B encoder (pretrained on ImageNet-22k) with FPN for multi-scale features at strides 4, 8, 16, 32. Three independent transformer decoders (Structured Points, Region, Content), each with 4 layers, 8 attention heads, hidden dimension 512, MLP amplification factor 4. Separate randomly initialized positional encodings per decoder to accommodate varying sequence lengths.
- Code: Available at AlibabaResearch/AdvancedLiterateMachinery under Apache-2.0. The OmniParser V1 implementation is in the OCR/ subdirectory.
- Weights: Not explicitly mentioned as released in the paper. The code repository should be checked for the current status of pretrained weights.
Algorithms
- Pre-training Stage 1: AdamW, batch 128, resolution $768 \times 768$, 500k steps, lr $5 \times 10^{-4}$, 5k-step linear warm-up then linear decay to 0.
- Pre-training Stage 2: AdamW, batch 16, resolution $1920 \times 1920$, 200k steps, lr $2.5 \times 10^{-4}$, same schedule.
- Fine-tuning: lr $1 \times 10^{-4}$, cosine decay. Text spotting: 20k steps. KIE: 200k steps. Table recognition: 400k steps for Structured Points Decoder, 200k for Content Decoder.
- Data augmentation: Instance-aware random cropping, random rotation in $[-90^\circ, 90^\circ]$, random resizing, color jittering.
- Loss: Negative log-likelihood with $w = 4.0$ for structural/entity tags, $w = 1.0$ for others. Prompt tokens excluded from loss computation.
- Coordinate quantization: Coordinates normalized to image dimensions, quantized to $[0, n_{\text{bins}} - 1]$; value of $n_{\text{bins}}$ is not stated in the main paper.
- Text ordering during pre-training: Center points arranged in raster scan order.
Data
- Pre-training corpus (9 datasets): Curved SynthText, ICDAR 2013, ICDAR 2015, MLT 2017, Total-Text, TextOCR, HierText, COCO Text, Open Image V5. All are publicly available.
- Task fine-tuning: Standard train/val/test splits for each benchmark (Total-Text, CTW1500, ICDAR 2015, CORD, SROIE, PubTabNet, FinTabNet).
- Annotation requirements: The model requires word-level center-point annotations during training. Polygon annotations (16-point format) are also required. Standard benchmarks provide these.
Evaluation
- Text spotting: End-to-end F1 under None/Full (Total-Text, CTW1500) and Strong/Weak/Generic (ICDAR 2015) lexicons. Standard protocol for each dataset.
- KIE: Field-level F1 and TED-based accuracy. Standard for CORD and SROIE; SROIE point locations are generated by the authors where not provided in the original annotation.
- Table recognition: TEDS and S-TEDS. Evaluated on PubTabNet validation (no public test set) and FinTabNet test set. Comparison restricted to end-to-end methods.
- Baselines: Generally fair within each task track. Donut for table recognition is reproduced by the authors using official configuration.
- Statistical rigor: No error bars, seeds, or significance tests reported.
Hardware
- Training hardware not specified in the paper.
- The two-stage pre-training (700k total steps at high resolution, batch sizes 16-128) implies multi-GPU training, likely on A100-class hardware. Specific GPU counts and training time are not reported.
- Table recognition inference runs at 1.3 FPS (compared to Donut at 0.8 FPS) at input size $1024 \times 1024$ with a maximum decoder length of 1,500 + 200 tokens.
BibTeX
@inproceedings{wan2024omniparser,
title={OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition},
author={Wan, Jianqiang and Song, Sibo and Yu, Wenwen and Liu, Yuliang and Cheng, Wenqing and Huang, Fei and Bai, Xiang and Yao, Cong and Yang, Zhibo},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2024}
}