Navigation
Breadcrumb

Document Understanding & Visual Question Answering

Tracking models and benchmarks for document understanding, visual information extraction, and document VQA.

Disclaimer: This page tracks methods that consume document layout, text, and visual information to answer questions or extract structured data. For models that produce layout annotations (bounding boxes, region labels), see the Layout Page. For text recognition and OCR pipelines, see the OCR Page.

Overview

Document Understanding encompasses tasks where a model must jointly reason over a document’s text, visual appearance, and spatial layout to produce answers, extracted fields, or structured outputs. Key task families include:

  • Document VQA: Open-ended question answering over document images.
  • Visual Information Extraction (VIE): Extracting key-value pairs from forms, receipts, invoices, and other semi-structured documents.
  • Visual Machine Reading Comprehension: Answering questions that require reading and reasoning over rendered text in context.

These tasks differ from layout analysis in that the output is semantic (answers, extracted values) rather than geometric (bounding boxes, region classes). Many methods build on top of layout-aware encoders (LayoutLM, LayoutLMv3) but apply them to downstream comprehension tasks rather than detection.


Models

LLM-Based

Models that pair a document-specialized encoder with a large language model backbone for instruction-following or generative document understanding.

Model FamilyEncoderLLM BackboneCodeLicenseNotes
LayoutLLM (2024)LayoutLMv3-largeVicuna-7BNoneN/ANotes. Layout instruction tuning (5.7M pre-training instructions across document/region/segment levels) + LayoutCoT for layout-aware chain-of-thought reasoning. Zero-shot eval on DocVQA, VisualMRC, FUNSD, CORD, SROIE.

Datasets & Benchmarks

Document VQA

BenchmarkTaskMetricSizeNotes
DocVQA (2021)Document visual QAANLS50K questions, 12K imagesMathew et al., WACV 2021. Extractive QA over diverse industry documents.
VisualMRC (2021)Visual machine reading comprehensionRouge-L30K+ questionsTanaka et al., AAAI 2021. QA over web page screenshots.

Visual Information Extraction

BenchmarkTaskMetricSizeNotes
FUNSD (2019)Form understandingF1199 formsJaume et al., ICDAR 2019 Workshop. Entity labeling + linking on noisy scanned forms.
CORD (2019)Receipt key info extractionF11,000 receiptsPark et al. Post-OCR parsing of Indonesian receipts.
SROIE (2019)Receipt key info extractionF1973 receiptsICDAR 2019 Competition. Scanned receipt text localization + key info extraction.