Tables: Understanding, Reasoning, and LLM-era Evaluation

Tracking benchmarks and datasets for table understanding, visual table QA, and LLM/VLM-era table evaluation. Scope: downstream reasoning and multimodal understanding over tables, not structural recognition.

Table of Contents

Why This Exists
To Investigate: Benchmarks
To Investigate: Datasets
Related Pages

Scope note: This page covers table understanding: tasks like visual table QA, multi-hop reasoning over tables, and end-to-end document parsing benchmarks that treat table content as the target rather than table structure as the output. For detecting table regions (TD) and parsing the row/column/cell grid (TSR), see the Tables: Detection and Structure Recognition page.
Status: This page is a working investigation cluster. All entries below are candidates that have been identified but not yet fully vetted for inclusion. Treat this as a reading list, not a curated index.

Why This Exists

The 2024-2026 period produced a surge of benchmarks and datasets targeting LLM and VLM evaluation on tables. These are distinct from the structural recognition (TD/TSR) pipeline: they ask “can the model understand what a table says?” rather than “can the model recover the table’s row/column structure?”. They deserve their own tracking page.

Many of these also have implications for synthetic data generation (structured table content can be rendered into training images) and for evaluating end-to-end document parsing systems that produce HTML, Markdown, or JSON output containing tables.

To Investigate: Benchmarks

These have been identified but not yet reviewed in depth. Priority items are marked.

Name	Year	Domain	Size	Output/Metric	arXiv	Notes
OmniDocBench	2024	Diverse PDFs (9 types: academic, financial, newspaper, handwritten, etc.)	1,355 pages, 3 language settings	TEDS / HTML+LaTeX for tables; NED for text	2412.07626	CVPR 2025. Evaluates end-to-end document parsers; table sub-task uses HTML+LaTeX annotations. Relevant to TSR evaluation pipeline.
READoc	2024	Diverse (arXiv, GitHub, Zenodo)	3,576 real-world PDFs	PDF-to-Markdown; S³uite metric	2409.05137	ACL 2025 Findings. End-to-end document structured extraction; tables appear as Markdown. 27 languages.
MMTU	2025	Real-world tables (25 tasks)	28K+ questions	Accuracy across 25 tasks	2506.05587	NeurIPS 2025 Datasets & Benchmarks. Even GPT-5 scores ~69%. Covers table understanding + reasoning + coding.
MMLongBench-Doc	2024	Lengthy PDFs (multi-modal)	135 PDFs, 1,082 QA pairs	F1; GPT-4o scores 44.9%	NeurIPS 2024 D&B	NeurIPS 2024. Table-dependent questions within long-context document understanding benchmark.
TableVQA-Bench	2024	Multiple table domains	1,500 QA pairs	VQA accuracy	2404.19205	Naver AI Lab. Evaluates MLLMs on image-format tables. Derived from existing TSR and table QA sources.
MMTab	2024	Wikipedia, financial, web (Excel, Markdown, HTML)	105K table images; 232K instruction samples; 45K eval	Task accuracy (15 tasks)	2406.08100	ACL 2024. Instruction-tuning resource + eval for table-aware VLMs. CC-BY-4.0. Introduced Table-LLaVA.
TableBench	2024	Real-world industrial tables	886 test cases, 18 QA categories	Accuracy (Fact Checking, Numerical Reasoning, Data Analysis, Visualization)	2408.09174	AAAI 2025. Text-format table QA; targets LLM gap vs. human. GPT-4 scores modestly.
MTabVQA	2025	Multi-table reasoning	3,745 QA pairs	Multi-hop accuracy	2506.11684	First benchmark for multi-hop reasoning across multiple table images simultaneously.
MMTBench	2025	Real-world multimodal tables (with charts/maps embedded)	500 tables, 4,021 QA pairs	Accuracy (Explicit, Implicit, Visual-Based, Answer-Mention)	2505.21771	8 table types. Tables containing embedded visual elements.
TABLET-VTU	2025	Diverse (14 seed datasets: Wikipedia, financial, scientific)	4M examples, 2M unique table images, 21 tasks	Task accuracy	2509.21205	Aggregates 14 existing datasets for VLM instruction tuning. Not to be confused with TABLET (the TSR method at ICDAR 2025).
MirageTVQA	2025	Multilingual (24 languages)	~60K QA pairs	VQA accuracy; exposes >35% drop under visual noise	2511.17238	Exposes English-first bias and noise sensitivity in frontier VLMs on table QA.
m3TQA	2025	Multilingual (97 languages)	2,916 QA test + 39K training pairs	Multi-task accuracy	2508.16265	50 source tables (annual reports, statistical reports) translated to 97 languages.
TableEval	2025	Real-world Excel (Chinese + English)	617 spreadsheets, 2,325 QA pairs	Accuracy (6 tasks, 16 sub-tasks)	2506.03949	Hierarchical/nested/merged-cell tables from government, finance, academia.
WikiMixQA	2025	Wikipedia (7 domains)	1,000 MCQ from 4,000 pages	Multi-choice accuracy	2506.15594	ACL 2025. Cross-modal reasoning over table+chart pairs.
ExtractBench	2026	High-value domains (financial reporting)	35 PDFs, 12,867 fields	JSON schema-guided; field-level accuracy	2602.12247	PDF-to-JSON structured extraction. Frontier models fail on 369-field financial schema.
OCRBench v2	2025	Diverse OCR (31 scenarios)	10,000 QA pairs	Accuracy per task	2501.00321	Bilingual ZH/EN. Table/formula parsing is one of 31 scenarios.
CC-OCR	2024	Diverse (document parsing track)	7,058 images, 39 subsets	Accuracy per track	2412.02210	ICCV 2025. 4 tracks; document parsing track covers tables + formulas.
ComTQA / TabPedia	2024	Comprehensive visual table understanding	~9K QA pairs	VQA accuracy	2406.01326	NeurIPS 2024. Introduced alongside TabPedia (unified VLM for TD+TSR+QA).
ChemTable	2025	Chemistry literature	1,300+ tables, 9,000+ QA pairs	TSR accuracy + QA; evaluates MLLMs	2506.11375	Expert-annotated chemical tables from peer-reviewed journals. Tests domain-specific MLLM performance.
NGTRBench	2024	Multi-type table images (varied quality)	Not stated	Hierarchical VLLM evaluation	2412.20662	IJCAI 2025. Evaluates VLLMs on realistic low-quality table images. Introduced with NGTR (RAG-augmented toolchain).
TReB	2025	Diverse (text-format tables)	26 sub-tasks	Accuracy (TCoT, PoT, ICoT modes)	2506.18421	Shallow understanding + deep reasoning; 20+ LLMs evaluated.
KITAB-Bench	2025	Arabic OCR + document understanding	Not stated	Multi-task per sub-task	2502.14949	Arabic OCR benchmark; table extraction is one sub-task among text recognition, chart analysis, VQA.
CHiTab	2025	Scientific tables (hierarchical headers)	PubTables-1M filtered subset	QA accuracy (integer answers)	2511.08298	ICDAR 2025 GREC workshop. Tests VLLM reasoning over hierarchical header structure. Derived from PubTables-1M.

To Investigate: Datasets

Structured text/content-level table datasets that may be useful as content sources for synthetic image generation pipelines or for pre-training table-aware models.

Name	Year	Domain	Size	Format	License	URL	Notes
ENTRANT	2024	Financial (SEC EDGAR, 2013-2021)	~6.7M tables from ~330K filings	JSON (bi-tree: cell attributes, positional and hierarchical info)	Open (unrestricted)	Sci. Data / Zenodo	~20 tables/filing, ~25 rows, ~5 columns avg. Text-level; no images. Potential upstream source for synthetic financial table generation (cf. SynFinTabs pipeline).
HiFi-KPI	2025	SEC filings (10-K, 10-Q, 2017-2024)	~1.8M paragraphs, ~5M entities, 218K-label hierarchy	Text spans + iXBRL taxonomy labels	Unknown	2502.15411 / HuggingFace	KPI extraction from SEC earnings filings. Text-based, not image-based.
WikiDT (VQA side)	2024	Wikipedia screenshots	70,652 QA + 53,698 SQL annotations	Visual table QA + SQL	Unknown	Amazon Science	The TD+TSR annotations are tracked in the TSR page. This entry covers the QA/SQL side.

Tables: Detection and Structure Recognition: TD and TSR models, datasets, benchmarks, and metrics.
Document Understanding: End-to-end systems combining layout, structure, and content.

Why This Exists

To Investigate: Benchmarks

To Investigate: Datasets

Related Pages