Document Understanding & Visual Question Answering
Tracking models and benchmarks for document understanding, visual information extraction, and document VQA.
Disclaimer: This page tracks methods that consume document layout, text, and visual information to answer questions or extract structured data. For models that produce layout annotations (bounding boxes, region labels), see the Layout Page. For text recognition and OCR pipelines, see the OCR Page.
Overview
Document Understanding encompasses tasks where a model must jointly reason over a document’s text, visual appearance, and spatial layout to produce answers, extracted fields, or structured outputs. Key task families include:
- Document VQA: Open-ended question answering over document images.
- Visual Information Extraction (VIE): Extracting key-value pairs from forms, receipts, invoices, and other semi-structured documents.
- Visual Machine Reading Comprehension: Answering questions that require reading and reasoning over rendered text in context.
These tasks differ from layout analysis in that the output is semantic (answers, extracted values) rather than geometric (bounding boxes, region classes). Many methods build on top of layout-aware encoders (LayoutLM, LayoutLMv3) but apply them to downstream comprehension tasks rather than detection.
Models
LLM-Based
Models that pair a document-specialized encoder with a large language model backbone for instruction-following or generative document understanding.
| Model Family | Encoder | LLM Backbone | Code | License | Notes |
|---|---|---|---|---|---|
| LayoutLLM (2024) | LayoutLMv3-large | Vicuna-7B | None | N/A | Notes. Layout instruction tuning (5.7M pre-training instructions across document/region/segment levels) + LayoutCoT for layout-aware chain-of-thought reasoning. Zero-shot eval on DocVQA, VisualMRC, FUNSD, CORD, SROIE. |
Datasets & Benchmarks
Document VQA
| Benchmark | Task | Metric | Size | Notes |
|---|---|---|---|---|
| DocVQA (2021) | Document visual QA | ANLS | 50K questions, 12K images | Mathew et al., WACV 2021. Extractive QA over diverse industry documents. |
| VisualMRC (2021) | Visual machine reading comprehension | Rouge-L | 30K+ questions | Tanaka et al., AAAI 2021. QA over web page screenshots. |
Visual Information Extraction
| Benchmark | Task | Metric | Size | Notes |
|---|---|---|---|---|
| FUNSD (2019) | Form understanding | F1 | 199 forms | Jaume et al., ICDAR 2019 Workshop. Entity labeling + linking on noisy scanned forms. |
| CORD (2019) | Receipt key info extraction | F1 | 1,000 receipts | Park et al. Post-OCR parsing of Indonesian receipts. |
| SROIE (2019) | Receipt key info extraction | F1 | 973 receipts | ICDAR 2019 Competition. Scanned receipt text localization + key info extraction. |