Tables: Understanding, Reasoning, and LLM-era Evaluation
Tracking benchmarks and datasets for table understanding, visual table QA, and LLM/VLM-era table evaluation. Scope: downstream reasoning and multimodal understanding over tables, not structural recognition.
Scope note: This page covers table understanding: tasks like visual table QA, multi-hop reasoning over tables, and end-to-end document parsing benchmarks that treat table content as the target rather than table structure as the output. For detecting table regions (TD) and parsing the row/column/cell grid (TSR), see the Tables: Detection and Structure Recognition page.
Status: This page is a working investigation cluster. All entries below are candidates that have been identified but not yet fully vetted for inclusion. Treat this as a reading list, not a curated index.
Why This Exists
The 2024-2026 period produced a surge of benchmarks and datasets targeting LLM and VLM evaluation on tables. These are distinct from the structural recognition (TD/TSR) pipeline: they ask “can the model understand what a table says?” rather than “can the model recover the table’s row/column structure?”. They deserve their own tracking page.
Many of these also have implications for synthetic data generation (structured table content can be rendered into training images) and for evaluating end-to-end document parsing systems that produce HTML, Markdown, or JSON output containing tables.
To Investigate: Benchmarks
These have been identified but not yet reviewed in depth. Priority items are marked.
| Name | Year | Domain | Size | Output/Metric | arXiv | Notes |
|---|---|---|---|---|---|---|
| OmniDocBench | 2024 | Diverse PDFs (9 types: academic, financial, newspaper, handwritten, etc.) | 1,355 pages, 3 language settings | TEDS / HTML+LaTeX for tables; NED for text | 2412.07626 | CVPR 2025. Evaluates end-to-end document parsers; table sub-task uses HTML+LaTeX annotations. Relevant to TSR evaluation pipeline. |
| READoc | 2024 | Diverse (arXiv, GitHub, Zenodo) | 3,576 real-world PDFs | PDF-to-Markdown; S³uite metric | 2409.05137 | ACL 2025 Findings. End-to-end document structured extraction; tables appear as Markdown. 27 languages. |
| MMTU | 2025 | Real-world tables (25 tasks) | 28K+ questions | Accuracy across 25 tasks | 2506.05587 | NeurIPS 2025 Datasets & Benchmarks. Even GPT-5 scores ~69%. Covers table understanding + reasoning + coding. |
| MMLongBench-Doc | 2024 | Lengthy PDFs (multi-modal) | 135 PDFs, 1,082 QA pairs | F1; GPT-4o scores 44.9% | NeurIPS 2024 D&B | NeurIPS 2024. Table-dependent questions within long-context document understanding benchmark. |
| TableVQA-Bench | 2024 | Multiple table domains | 1,500 QA pairs | VQA accuracy | 2404.19205 | Naver AI Lab. Evaluates MLLMs on image-format tables. Derived from existing TSR and table QA sources. |
| MMTab | 2024 | Wikipedia, financial, web (Excel, Markdown, HTML) | 105K table images; 232K instruction samples; 45K eval | Task accuracy (15 tasks) | 2406.08100 | ACL 2024. Instruction-tuning resource + eval for table-aware VLMs. CC-BY-4.0. Introduced Table-LLaVA. |
| TableBench | 2024 | Real-world industrial tables | 886 test cases, 18 QA categories | Accuracy (Fact Checking, Numerical Reasoning, Data Analysis, Visualization) | 2408.09174 | AAAI 2025. Text-format table QA; targets LLM gap vs. human. GPT-4 scores modestly. |
| MTabVQA | 2025 | Multi-table reasoning | 3,745 QA pairs | Multi-hop accuracy | 2506.11684 | First benchmark for multi-hop reasoning across multiple table images simultaneously. |
| MMTBench | 2025 | Real-world multimodal tables (with charts/maps embedded) | 500 tables, 4,021 QA pairs | Accuracy (Explicit, Implicit, Visual-Based, Answer-Mention) | 2505.21771 | 8 table types. Tables containing embedded visual elements. |
| TABLET-VTU | 2025 | Diverse (14 seed datasets: Wikipedia, financial, scientific) | 4M examples, 2M unique table images, 21 tasks | Task accuracy | 2509.21205 | Aggregates 14 existing datasets for VLM instruction tuning. Not to be confused with TABLET (the TSR method at ICDAR 2025). |
| MirageTVQA | 2025 | Multilingual (24 languages) | ~60K QA pairs | VQA accuracy; exposes >35% drop under visual noise | 2511.17238 | Exposes English-first bias and noise sensitivity in frontier VLMs on table QA. |
| m3TQA | 2025 | Multilingual (97 languages) | 2,916 QA test + 39K training pairs | Multi-task accuracy | 2508.16265 | 50 source tables (annual reports, statistical reports) translated to 97 languages. |
| TableEval | 2025 | Real-world Excel (Chinese + English) | 617 spreadsheets, 2,325 QA pairs | Accuracy (6 tasks, 16 sub-tasks) | 2506.03949 | Hierarchical/nested/merged-cell tables from government, finance, academia. |
| WikiMixQA | 2025 | Wikipedia (7 domains) | 1,000 MCQ from 4,000 pages | Multi-choice accuracy | 2506.15594 | ACL 2025. Cross-modal reasoning over table+chart pairs. |
| ExtractBench | 2026 | High-value domains (financial reporting) | 35 PDFs, 12,867 fields | JSON schema-guided; field-level accuracy | 2602.12247 | PDF-to-JSON structured extraction. Frontier models fail on 369-field financial schema. |
| OCRBench v2 | 2025 | Diverse OCR (31 scenarios) | 10,000 QA pairs | Accuracy per task | 2501.00321 | Bilingual ZH/EN. Table/formula parsing is one of 31 scenarios. |
| CC-OCR | 2024 | Diverse (document parsing track) | 7,058 images, 39 subsets | Accuracy per track | 2412.02210 | ICCV 2025. 4 tracks; document parsing track covers tables + formulas. |
| ComTQA / TabPedia | 2024 | Comprehensive visual table understanding | ~9K QA pairs | VQA accuracy | 2406.01326 | NeurIPS 2024. Introduced alongside TabPedia (unified VLM for TD+TSR+QA). |
| ChemTable | 2025 | Chemistry literature | 1,300+ tables, 9,000+ QA pairs | TSR accuracy + QA; evaluates MLLMs | 2506.11375 | Expert-annotated chemical tables from peer-reviewed journals. Tests domain-specific MLLM performance. |
| NGTRBench | 2024 | Multi-type table images (varied quality) | Not stated | Hierarchical VLLM evaluation | 2412.20662 | IJCAI 2025. Evaluates VLLMs on realistic low-quality table images. Introduced with NGTR (RAG-augmented toolchain). |
| TReB | 2025 | Diverse (text-format tables) | 26 sub-tasks | Accuracy (TCoT, PoT, ICoT modes) | 2506.18421 | Shallow understanding + deep reasoning; 20+ LLMs evaluated. |
| KITAB-Bench | 2025 | Arabic OCR + document understanding | Not stated | Multi-task per sub-task | 2502.14949 | Arabic OCR benchmark; table extraction is one sub-task among text recognition, chart analysis, VQA. |
| CHiTab | 2025 | Scientific tables (hierarchical headers) | PubTables-1M filtered subset | QA accuracy (integer answers) | 2511.08298 | ICDAR 2025 GREC workshop. Tests VLLM reasoning over hierarchical header structure. Derived from PubTables-1M. |
To Investigate: Datasets
Structured text/content-level table datasets that may be useful as content sources for synthetic image generation pipelines or for pre-training table-aware models.
| Name | Year | Domain | Size | Format | License | URL | Notes |
|---|---|---|---|---|---|---|---|
| ENTRANT | 2024 | Financial (SEC EDGAR, 2013-2021) | ~6.7M tables from ~330K filings | JSON (bi-tree: cell attributes, positional and hierarchical info) | Open (unrestricted) | Sci. Data / Zenodo | ~20 tables/filing, ~25 rows, ~5 columns avg. Text-level; no images. Potential upstream source for synthetic financial table generation (cf. SynFinTabs pipeline). |
| HiFi-KPI | 2025 | SEC filings (10-K, 10-Q, 2017-2024) | ~1.8M paragraphs, ~5M entities, 218K-label hierarchy | Text spans + iXBRL taxonomy labels | Unknown | 2502.15411 / HuggingFace | KPI extraction from SEC earnings filings. Text-based, not image-based. |
| WikiDT (VQA side) | 2024 | Wikipedia screenshots | 70,652 QA + 53,698 SQL annotations | Visual table QA + SQL | Unknown | Amazon Science | The TD+TSR annotations are tracked in the TSR page. This entry covers the QA/SQL side. |
Related Pages
- Tables: Detection and Structure Recognition: TD and TSR models, datasets, benchmarks, and metrics.
- Document Understanding: End-to-end systems combining layout, structure, and content.