Document Understanding & Visual Question Answering

Tracking models and benchmarks for document understanding, visual information extraction, and document VQA.

Table of Contents

Overview
Models
- LLM-Based
Datasets & Benchmarks
- Document VQA
- Visual Information Extraction

Disclaimer: This page tracks methods that consume document layout, text, and visual information to answer questions or extract structured data. For models that produce layout annotations (bounding boxes, region labels), see the Layout Page. For text recognition and OCR pipelines, see the OCR Page.

Overview

Document Understanding encompasses tasks where a model must jointly reason over a document’s text, visual appearance, and spatial layout to produce answers, extracted fields, or structured outputs. Key task families include:

Document VQA: Open-ended question answering over document images.
Visual Information Extraction (VIE): Extracting key-value pairs from forms, receipts, invoices, and other semi-structured documents.
Visual Machine Reading Comprehension: Answering questions that require reading and reasoning over rendered text in context.

These tasks differ from layout analysis in that the output is semantic (answers, extracted values) rather than geometric (bounding boxes, region classes). Many methods build on top of layout-aware encoders (LayoutLM, LayoutLMv3) but apply them to downstream comprehension tasks rather than detection.

Models

LLM-Based

Models that pair a document-specialized encoder with a large language model backbone for instruction-following or generative document understanding.

Model Family	Encoder	LLM Backbone	Code	License	Notes
LayoutLLM (2024)	LayoutLMv3-large	Vicuna-7B	None	N/A	Notes. Layout instruction tuning (5.7M pre-training instructions across document/region/segment levels) + LayoutCoT for layout-aware chain-of-thought reasoning. Zero-shot eval on DocVQA, VisualMRC, FUNSD, CORD, SROIE.

Datasets & Benchmarks

Document VQA

Benchmark	Task	Metric	Size	Notes
DocVQA (2021)	Document visual QA	ANLS	50K questions, 12K images	Mathew et al., WACV 2021. Extractive QA over diverse industry documents.
VisualMRC (2021)	Visual machine reading comprehension	Rouge-L	30K+ questions	Tanaka et al., AAAI 2021. QA over web page screenshots.

Visual Information Extraction

Benchmark	Task	Metric	Size	Notes
FUNSD (2019)	Form understanding	F1	199 forms	Jaume et al., ICDAR 2019 Workshop. Entity labeling + linking on noisy scanned forms.
CORD (2019)	Receipt key info extraction	F1	1,000 receipts	Park et al. Post-OCR parsing of Indonesian receipts.
SROIE (2019)	Receipt key info extraction	F1	973 receipts	ICDAR 2019 Competition. Scanned receipt text localization + key info extraction.