Navigation
Breadcrumb

Tables: Understanding, Reasoning, and LLM-era Evaluation

Tracking benchmarks and datasets for table understanding, visual table QA, and LLM/VLM-era table evaluation. Scope: downstream reasoning and multimodal understanding over tables, not structural recognition.

Scope note: This page covers table understanding: tasks like visual table QA, multi-hop reasoning over tables, and end-to-end document parsing benchmarks that treat table content as the target rather than table structure as the output. For detecting table regions (TD) and parsing the row/column/cell grid (TSR), see the Tables: Detection and Structure Recognition page.

Status: This page is a working investigation cluster. All entries below are candidates that have been identified but not yet fully vetted for inclusion. Treat this as a reading list, not a curated index.


Why This Exists

The 2024-2026 period produced a surge of benchmarks and datasets targeting LLM and VLM evaluation on tables. These are distinct from the structural recognition (TD/TSR) pipeline: they ask “can the model understand what a table says?” rather than “can the model recover the table’s row/column structure?”. They deserve their own tracking page.

Many of these also have implications for synthetic data generation (structured table content can be rendered into training images) and for evaluating end-to-end document parsing systems that produce HTML, Markdown, or JSON output containing tables.


To Investigate: Benchmarks

These have been identified but not yet reviewed in depth. Priority items are marked.

NameYearDomainSizeOutput/MetricarXivNotes
OmniDocBench2024Diverse PDFs (9 types: academic, financial, newspaper, handwritten, etc.)1,355 pages, 3 language settingsTEDS / HTML+LaTeX for tables; NED for text2412.07626CVPR 2025. Evaluates end-to-end document parsers; table sub-task uses HTML+LaTeX annotations. Relevant to TSR evaluation pipeline.
READoc2024Diverse (arXiv, GitHub, Zenodo)3,576 real-world PDFsPDF-to-Markdown; S³uite metric2409.05137ACL 2025 Findings. End-to-end document structured extraction; tables appear as Markdown. 27 languages.
MMTU2025Real-world tables (25 tasks)28K+ questionsAccuracy across 25 tasks2506.05587NeurIPS 2025 Datasets & Benchmarks. Even GPT-5 scores ~69%. Covers table understanding + reasoning + coding.
MMLongBench-Doc2024Lengthy PDFs (multi-modal)135 PDFs, 1,082 QA pairsF1; GPT-4o scores 44.9%NeurIPS 2024 D&BNeurIPS 2024. Table-dependent questions within long-context document understanding benchmark.
TableVQA-Bench2024Multiple table domains1,500 QA pairsVQA accuracy2404.19205Naver AI Lab. Evaluates MLLMs on image-format tables. Derived from existing TSR and table QA sources.
MMTab2024Wikipedia, financial, web (Excel, Markdown, HTML)105K table images; 232K instruction samples; 45K evalTask accuracy (15 tasks)2406.08100ACL 2024. Instruction-tuning resource + eval for table-aware VLMs. CC-BY-4.0. Introduced Table-LLaVA.
TableBench2024Real-world industrial tables886 test cases, 18 QA categoriesAccuracy (Fact Checking, Numerical Reasoning, Data Analysis, Visualization)2408.09174AAAI 2025. Text-format table QA; targets LLM gap vs. human. GPT-4 scores modestly.
MTabVQA2025Multi-table reasoning3,745 QA pairsMulti-hop accuracy2506.11684First benchmark for multi-hop reasoning across multiple table images simultaneously.
MMTBench2025Real-world multimodal tables (with charts/maps embedded)500 tables, 4,021 QA pairsAccuracy (Explicit, Implicit, Visual-Based, Answer-Mention)2505.217718 table types. Tables containing embedded visual elements.
TABLET-VTU2025Diverse (14 seed datasets: Wikipedia, financial, scientific)4M examples, 2M unique table images, 21 tasksTask accuracy2509.21205Aggregates 14 existing datasets for VLM instruction tuning. Not to be confused with TABLET (the TSR method at ICDAR 2025).
MirageTVQA2025Multilingual (24 languages)~60K QA pairsVQA accuracy; exposes >35% drop under visual noise2511.17238Exposes English-first bias and noise sensitivity in frontier VLMs on table QA.
m3TQA2025Multilingual (97 languages)2,916 QA test + 39K training pairsMulti-task accuracy2508.1626550 source tables (annual reports, statistical reports) translated to 97 languages.
TableEval2025Real-world Excel (Chinese + English)617 spreadsheets, 2,325 QA pairsAccuracy (6 tasks, 16 sub-tasks)2506.03949Hierarchical/nested/merged-cell tables from government, finance, academia.
WikiMixQA2025Wikipedia (7 domains)1,000 MCQ from 4,000 pagesMulti-choice accuracy2506.15594ACL 2025. Cross-modal reasoning over table+chart pairs.
ExtractBench2026High-value domains (financial reporting)35 PDFs, 12,867 fieldsJSON schema-guided; field-level accuracy2602.12247PDF-to-JSON structured extraction. Frontier models fail on 369-field financial schema.
OCRBench v22025Diverse OCR (31 scenarios)10,000 QA pairsAccuracy per task2501.00321Bilingual ZH/EN. Table/formula parsing is one of 31 scenarios.
CC-OCR2024Diverse (document parsing track)7,058 images, 39 subsetsAccuracy per track2412.02210ICCV 2025. 4 tracks; document parsing track covers tables + formulas.
ComTQA / TabPedia2024Comprehensive visual table understanding~9K QA pairsVQA accuracy2406.01326NeurIPS 2024. Introduced alongside TabPedia (unified VLM for TD+TSR+QA).
ChemTable2025Chemistry literature1,300+ tables, 9,000+ QA pairsTSR accuracy + QA; evaluates MLLMs2506.11375Expert-annotated chemical tables from peer-reviewed journals. Tests domain-specific MLLM performance.
NGTRBench2024Multi-type table images (varied quality)Not statedHierarchical VLLM evaluation2412.20662IJCAI 2025. Evaluates VLLMs on realistic low-quality table images. Introduced with NGTR (RAG-augmented toolchain).
TReB2025Diverse (text-format tables)26 sub-tasksAccuracy (TCoT, PoT, ICoT modes)2506.18421Shallow understanding + deep reasoning; 20+ LLMs evaluated.
KITAB-Bench2025Arabic OCR + document understandingNot statedMulti-task per sub-task2502.14949Arabic OCR benchmark; table extraction is one sub-task among text recognition, chart analysis, VQA.
CHiTab2025Scientific tables (hierarchical headers)PubTables-1M filtered subsetQA accuracy (integer answers)2511.08298ICDAR 2025 GREC workshop. Tests VLLM reasoning over hierarchical header structure. Derived from PubTables-1M.

To Investigate: Datasets

Structured text/content-level table datasets that may be useful as content sources for synthetic image generation pipelines or for pre-training table-aware models.

NameYearDomainSizeFormatLicenseURLNotes
ENTRANT2024Financial (SEC EDGAR, 2013-2021)~6.7M tables from ~330K filingsJSON (bi-tree: cell attributes, positional and hierarchical info)Open (unrestricted)Sci. Data / Zenodo~20 tables/filing, ~25 rows, ~5 columns avg. Text-level; no images. Potential upstream source for synthetic financial table generation (cf. SynFinTabs pipeline).
HiFi-KPI2025SEC filings (10-K, 10-Q, 2017-2024)~1.8M paragraphs, ~5M entities, 218K-label hierarchyText spans + iXBRL taxonomy labelsUnknown2502.15411 / HuggingFaceKPI extraction from SEC earnings filings. Text-based, not image-based.
WikiDT (VQA side)2024Wikipedia screenshots70,652 QA + 53,698 SQL annotationsVisual table QA + SQLUnknownAmazon ScienceThe TD+TSR annotations are tracked in the TSR page. This entry covers the QA/SQL side.