Tables: Detection & Structure Recognition
Tracking models, datasets, and metrics for detecting tables in documents and parsing their internal structure.
Table of Contents
- Overview
- Table Detection
- Table Structure Recognition
- Metrics
- Surveys
- Comparative Studies
- To Investigate
- Related Pages
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- Motivation: The HTML Bottleneck
- Novelty: The 5-Token OTSL Vocabulary
- Methodology: TableFormer Evaluation
- Results: Efficiency and Accuracy
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- Reported Results (Test Set)
- Mapping to Unified Taxonomy
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- BibTeX
- TL;DR
- What kind of paper is this?
- What is the motivation?
- What is the novelty?
- What experiments were performed?
- What are the outcomes/conclusions?
- Reproducibility
- Mapping to Unified Taxonomy
- BibTeX
Disclaimer: This page covers the full table pipeline: detecting tables on a page (TD) and parsing their internal grid structure (TSR). For general-purpose layout detection (which often includes table regions), see the Layout Page. For reading order prediction, see the Reading Order Page. For end-to-end document understanding, see the Document Understanding Page.
Overview
The table pipeline in document analysis has two stages:
- Table Detection (TD): Locating table regions on a page. This is typically handled by general-purpose layout models (Faster R-CNN, YOLO, DETR) that classify “Table” as one of several region types. Some specialized models and benchmarks target TD specifically.
- Table Structure Recognition (TSR): Recovering the logical grid of a detected table region, including rows, columns, spanning cells, and (optionally) header vs. body roles. TSR operates on a pre-cropped table image produced by the detection stage.
Both stages are active research areas with distinct datasets, metrics, and modeling paradigms. Some datasets (TableBank, PubTables-1M) provide annotations for both stages.
Table Detection
TD: Paradigms
Table detection is most commonly handled as a special case of document layout analysis. General-purpose object detectors (Faster R-CNN, DETR, YOLO variants) trained on layout datasets like PubLayNet or DocLayNet naturally produce table bounding boxes alongside other region types.
A few dedicated efforts focus on TD specifically, often using competition benchmarks (ICDAR series) or specialized datasets where tables are the only annotated class.
TD: Models
Most table detection models are general layout detectors that happen to produce table bounding boxes. The tables below list models with TD-specific contributions or evaluations. For the full set of layout models, see the Layout Page.
Models with Code or Weights
| Date | Name | Artifacts | Code | License | Notes |
|---|---|---|---|---|---|
| 2021-09 | Table Transformer (Det) | microsoft/table-transformer-detection | table-transformer | MIT | Notes |
| 2020-08 | CDeC-Net | None | CDeCNet | MIT | Notes |
| 2020-04 | CascadeTabNet | None | CascadeTabNet | MIT | Notes |
Methods (Paper Only)
| Date | Name | Notes |
|---|---|---|
| 2024-05 | SemiTabDETR | Notes |
| 2017-11 | DeepDeSRT | DOI |
Layout models with strong TD results: Many general-purpose layout models also detect tables. See the Layout Page for details on these models, which include:
- DocLayout-YOLO (2024): YOLOv10-based; DocSynth-300K pre-training.
- DiT (2022): BEiT-style pre-training; evaluated on ICDAR 2019 cTDaR.
- SwinDocSegmenter (2023): Instance segmentation; TableBank 98.0 mAP.
- IIIT-AR-13K baselines (2020): Cross-dataset TD evaluation on ICDAR 2013, cTDaR, UNLV, Marmot.
- LayoutParser (2021): Pre-trained on TableBank (among others).
TD: Datasets
Large-scale annotated collections for table detection (bounding box) training and evaluation. Datasets are grouped by the most permissive use permitted: commercial, research / non-commercial, and not available / restricted.
Commercial Use
Training a private, for-profit model is permitted with minimal obligations.
| Dataset | Pages | Domain | Annotation | Classes | Eval Split | License | Notes |
|---|---|---|---|---|---|---|---|
| TNCR (2021) | 9,428 pages | FDA drug labels | Human; format not stated | 5 (table type) | Yes | MIT | Notes. Pages may contain multiple tables; total table count exceeds page count. |
Research / Non-Commercial
Training a free, open-weight non-commercial model is permitted.
| Dataset | Pages | Domain | Annotation | Classes | Eval Split | License | Notes |
|---|---|---|---|---|---|---|---|
| WikiDT (2024) | 16,887 full pages (54,032 sub-pages after pagination); 159,905 table annotations | Wikipedia screenshots | Auto (rendered from Wikipedia source); Pascal VOC XML | 1 (table) | Yes | CC-BY-SA-3.0 | Notes. Also includes TSR annotations; see TSR section. |
| Open-Tables + ICT-TD (2023) | ~16k images (Open-Tables: 11,074; ICT-TD: 5,000) | Cleaned merger of open TD datasets (ICDAR 2013/17/19, Marmot, TNCR) + ICT commodity PDFs | Auto (re-annotated) + Human; format not stated | 1 (table) | Yes | Apache-2.0 (HuggingFace release); underlying Open-Tables sources include Marmot (research-only) and ICDAR (unknown) | Notes. ICT-TD component is original data; Open-Tables component inherits source restrictions. |
| SCI-3000 (2023) | 34,791 pages (3,000 PDFs) | Scientific (CS, biomed, chemistry, physics) | Human | 3 (table, figure, caption) | Yes | CC-BY-4.0 | ICDAR 2023. Zenodo. |
| TabRecSet (2023) | 32k images | Wild (scanned, camera, spreadsheet, bilingual EN/ZH) | Human (LabelMe JSON polygons) | 1 (table) | Yes | CC-BY-SA-4.0 | Notes. Also includes TSR + TCR; see TSR section. |
| PubTables-1M (2021) | 460k pages (~947k table instances) | Scientific | Auto (canonicalized); format not stated | 1 (table) | Yes | CDLA-Perm-2.0 (annotations); underlying PMCOA images have mixed per-article licenses | Notes. Table Transformer detection baseline: AP 0.966. |
| TableBank (2019) | 417k table instances (163k Word, 253k LaTeX) | Diverse | Weak (Word/LaTeX source); format not stated | 1 (table) | Yes | Apache-2.0 (code/models); Research-only (data) | Notes. Data license is research-only despite code being Apache-2.0. |
| PubLayNet (2019) | 360k pages | Scientific | Weak | 5 (table is one class) | Yes | CDLA-Perm-1.0 (annotations); underlying PMCOA PDFs are non-commercial only | Layout. ICDAR 2013 TD transfer demonstrated. |
| Marmot (2012) | 2,000 pages | Chinese e-books + English scientific | Human; format not stated | 1 (table) | No official splits | Research-only | PKU. 50% hard negatives. |
| UNLV (2010) | 2,889 pages | Diverse (scanned) | Human; format not stated | 1 (table) | Unknown | Research-only | Shahab et al., DAS 2010. Used in ICDAR 2013 competition. |
Not Available / Restricted
Described in a publication but not publicly downloadable. Included here because the papers provide useful methodological details and the data may become available in the future.
| Dataset | Pages | Domain | Annotation | Classes | Eval Split | License | Notes |
|---|---|---|---|---|---|---|---|
| BankTabNet (2024) | 11,607 pages | Bank statements (transaction tables) | Human (K-alpha 0.99); format not stated | 10 (table category) | Unknown | Proprietary | Notes. Also includes TSR annotations; see TSR section. |
TD: Benchmarks
Competition evaluation sets used to compare TD models. These are typically too small for training.
| Benchmark | Pages | Domain | Annotation | Metric | License | Notes |
|---|---|---|---|---|---|---|
| ICDAR 2019 cTDaR | ~600 (modern) + ~600 (archival) | Modern + Historical | Human | Weighted F1 (IoU $\geq$ 0.6) | Unknown | Two tracks: modern documents and archival records. DiT and WordScape evaluated here. |
| ICDAR 2013 Table Competition | 238 (EU + US gov) | Government documents | Human | Completeness + Purity | Unknown | Gobel et al. Classic TD benchmark; PubLayNet and IIIT-AR-13K models evaluated here. |
Layout benchmarks with TD evaluation: General-purpose layout benchmarks also evaluate table detection as one region class. See the Layout Page for full details, including:
- RoDLA (2024): Robustness benchmark covering PubLayNet-P, DocLayNet-P, M6Doc-P; introduces mPE and mRD metrics.
- ICDAR 2017 POD: 2,000 scientific pages; 4 region types (Formula, Table, Figure, All). Competition site
- ICDAR 2023 DocLayNet: Hard-split subset of DocLayNet (~80k pages); 11 region classes. Table is one class.
Table Structure Recognition
TSR: Paradigms
The field organizes around four main formulations:
- Image-to-Sequence (Im2Seq): Generates a markup token sequence (HTML, OTSL, or LaTeX) from a table image using an encoder-decoder architecture. Some variants are multi-task, sharing a backbone across structure, cell detection, and content decoders. Recent work fine-tunes large multimodal models with reinforcement learning on rendered output quality (Table2LaTeX-RL).
- Examples: EDD, TableFormer + OTSL, MTL-TabNet, TFLOP, UniTable, SPRINT, Table2LaTeX-RL.
- Pros: Captures complex spanning patterns naturally; end-to-end trainable; amenable to beam search decoding.
- Cons: Sequence length scales with table size; attention drift on large or complex tables; LaTeX output is harder to evaluate than HTML.
- Object Detection: Treats rows, columns, and/or cells as bounding-box objects detected in a single forward pass. Some models (OmniParser V2) additionally perform text spotting without a separate offline OCR stage, which affects fair comparison against structure-only models.
- Examples: Table Transformer (DETR-based), GTE, Cycle-CenterNet, GridFormer, OmniParser V1/V2.
- Pros: Fast single-pass inference; leverages standard detection tooling; produces spatial coordinates directly.
- Cons: Post-processing required to resolve spanning cells; struggles with dense or borderless tables.
- Split-and-Merge / Separation Line: Recovers the cell grid by predicting spatial structure and then assembling it. Approaches vary: some predict separator lines (via segmentation, regression, or query-based detection) and merge spanning cells in a second stage; others directly segment cell regions at the pixel or instance level (TableNet, CascadeTabNet, OG-HFYOLO). The common thread is that cell boundaries are predicted before the grid topology is assembled.
- Examples: TableNet, CascadeTabNet (Mask R-CNN cell segmentation), SEMv2 (separator line instance segmentation), SPLERGE (projection networks + merge model), TRUST (query-based decoder + vertex merge), LGPMA (soft-mask pyramid supervision), OG-HFYOLO (deformed cell instance segmentation), SepFormer, TABLET.
- Pros: Grid structure falls naturally from separator or boundary predictions; interpretable intermediate representations.
- Cons: Two-stage pipelines are sensitive to errors in the first stage; spanning cell resolution adds complexity; instance segmentation variants are slower than detection-only approaches.
- Graph / Cell Relationship: Reasons directly over cells rather than lines or sequences. Approaches share the goal of assigning each cell an adjacency structure or logical row/column index, but differ architecturally: GCN methods propagate context over explicit cell-to-cell edges; token-based methods predict adjacency matrices over OCR word tokens; regression-based methods output logical indices per cell in parallel.
- Examples: TGRNet (GCN + ordinal regression), TabStruct-Net (DGCNN + LSTM), NCGM (multi-modal collaborative blocks), ClusterTabNet (adjacency matrix over OCR tokens), LORE (cascade index regression; 0.45s/image), TableCenterNet (parallel spatial + logical index regression), VertexNet (keypoint-based cell stitching).
- Pros: Explicit cell-level reasoning; handles complex spanning structures; regression-based variants enable fully parallel inference.
- Cons: Graph methods depend on upstream cell detection quality; token-based methods require reliable OCR positions; harder to scale to very large tables.
The subtables below are organized by paradigm. Some models implement hybrid approaches and appear under their primary paradigm.
Choosing a Paradigm
The right paradigm depends on what your downstream pipeline needs and the quality of your input:
| If you need… | Lean toward… |
|---|---|
| HTML/LaTeX output ready for a parser or LLM | Im2Seq |
| Spatial cell crops to feed a downstream OCR pass | Object Detection |
| High boundary precision on bordered/printed documents | Split-and-Merge |
| Deformed or warped table images (camera capture) | Split-and-Merge (segmentation variants) |
| OCR-first pipeline that already has word bounding boxes | Graph/Cell (token-based, e.g. ClusterTabNet) |
| Fast parallel inference with logical grid indices | Graph/Cell (regression-based, e.g. LORE) |
TSR: Models
Im2Seq
Models that generate a markup token sequence (HTML, OTSL, or LaTeX) using an encoder-decoder architecture. Sequence length scales with table size.
| Date | Name | Artifacts | Code | License | Notes |
|---|---|---|---|---|---|
| 2025-12 | TRivia | TRivia-3B | TRivia | Apache-2.0 (code); Unknown (weights) | Notes. opendatalab. Self-supervised GRPO fine-tuning of Qwen2.5-VL-3B from unlabeled table images; attention-guided QA generation creates training signal without human labels. TEDS 89.88 overall vs. Gemini 2.5 Pro (88.93) and MinerU2.5 (86.82). |
| 2025-09 | Table2LaTeX-RL | LLLHHH/Table2Latex-RL | Table2LaTeX-RL | Apache-2.0 | Notes. NeurIPS 2025. Qwen2.5-VL-3B fine-tuned with VSGRPO: dual-reward RL combining TEDS-Structure and CW-SSIM on rendered images. Trained on 1.2M arXiv table image-to-LaTeX pairs (SFT) + 5,936 complex tables (RL). TEDS-S 0.9218 on complex tables. Generates LaTeX rather than HTML/OTSL. |
| 2025-03 | SPRINT | None released | SPRINT | MIT | Notes. ICDAR 2024. IIT Bombay. ResNet-31 + Multi-Aspect Global Attention (GCA) encoder + 6-layer Transformer decoder. Aggressively downsamples input to 128×128 to produce script-agnostic features; decodes OTSL (6-token vocabulary). TEDS-S 97.55 (PubTabNet), 98.17 (FinTabNet). +11.12% avg TEDS-S over MTL-TabNet on MUSTARD (multilingual). Introduces MUSTARD dataset. |
| 2025-01 | TFLOP | None released | TFLOP | CC-BY-NC-4.0 | Notes. IJCAI 2024. Swin + BART + Layout Pointer. Encodes input text bounding boxes; decoder points to text regions via InfoNCE loss. TEDS-S/TEDS: 99.56/99.45 (FinTabNet), 98.38/96.66 (PubTabNet test). |
| 2024-09 | UniTabNet | None released | None | N/A | Notes. USTC + iFLYTEK. EMNLP 2024 Findings. Swin + BART + Vision Guider + Language Guider. GriTS-Top 99.43 (PubTables-1M); TEDS-S 94.0 (iFLYTAB). Code promised but not released. |
| 2024-04 | MuTabNet | None released | MuTabNet | MIT | Notes. ICDAR 2024. Multi-cell content decoder with bidirectional mutual learning; cross-attention between adjacent cells during decoding. Outperforms non-end-to-end models on PubTabNet without extra annotations. |
| 2024-03 | UniTable | None released | unitable | MIT | Notes. Preprint. Linear Projection Transformer + VQ-VAE SSP. Self-supervised pretraining on up to 2M unlabeled tables. 30M (base) / 125M (large). Best reported results on 4 of 5 major benchmarks at publication. |
| 2023-05 | OTSL / TableFormer | None released | None | N/A | Notes. IBM, ICDAR 2023. 5-token language with backward-only syntax rules; ~$2\times$ inference speedup over HTML. No code/weights released. |
| 2023-03 | MTL-TabNet | PubTabNet weights • FinTabNet weights | MTL-TabNet | Apache-2.0 | Notes. NII Tokyo. VISAPP 2023. ResNet-31 + GCAttention + Shared Decoder + 3 Task Decoders. Joint structure, cell detection, and content in one model. TEDS-Struct 98.79% (FinTabNet), 97.88% (PubTabNet val). |
| 2019-11 | EDD | None released | None | N/A | Notes. IBM. ResNet-18 + dual LSTM. Encoder-dual-decoder separating structure from cell content. Introduced with PubTabNet. |
| 2019-03 | TableBank baselines | None released | TableBank | Apache-2.0 | Notes. OpenNMT enc-dec. Image-to-HTML tag sequence (12-token vocabulary). Faster R-CNN for detection + OpenNMT for structure. |
Object Detection
Models that detect rows, columns, and cells as bounding-box objects in a single forward pass.
| Date | Name | Artifacts | Code | License | Notes |
|---|---|---|---|---|---|
| 2025-01 | VertexNet | None released | None | Unknown | IJDAR 2025. Keypoint-based TSR: detects cell center points, regresses four vertex positions, then stitches adjacent cells into the grid. F1 86.9% (WTW); TEDS 79.4%. |
| 2024-12 | TabSniper (TSR) | None released | None | Unknown | Notes. AmEx + Bosch. CODS-COMAD 2024. DETR fine-tuned from PubTables-1M with CIoU loss substitution and long-table split-merge strategy. Deployed pipeline for bank statement transaction extraction. Evaluated on proprietary BankTabNet only; no TEDS/GriTS reported. |
| 2023-09 | GridFormer | None released | None | N/A | Notes. Baidu + SCUT. ACM MM 2023. ResNet-50 + Deformable DETR (two-stream row/col decoders). Represents tables as $M \times N$ vertex-edge grids. TEDS-S 97.0% (PubTabNet val); TEDS-S 98.63% (FinTabNet val); F1 94.1% (WTW). |
| 2021-09 | Table Transformer (Struct) | microsoft/table-transformer-structure-recognition | table-transformer | MIT | Notes. DETR (ResNet-18). 125 object queries. Standard detection baseline for TSR. |
| 2021-09 | Cycle-CenterNet | None released | None | N/A | Notes. CenterNet + cycle-pairing module. Cell detection for wild table images. Introduced with WTW dataset. |
| 2020-05 | GTE | None released | None | N/A | Notes. RetinaNet (ResNet-50-FPN). Constraint loss coupling cell and table detectors. Joint TD + TSR. Also introduces FinTabNet dataset. |
Unified document parsers with TSR: Some systems perform TSR as part of a broader end-to-end pipeline (text spotting, KIE, layout). These are tracked on the Document Understanding Page, including OmniParser V1 (CVPR 2024) and OmniParser V2.
Split-and-Merge / Separation Line
Models that predict spatial structure and recover the cell grid in two stages. Early approaches used coarse pixel-wise mask segmentation (TableNet) or cell instance segmentation (OG-HFYOLO); most later work refined this into explicit separator line detection with a learned merge stage for spanning cells.
| Date | Name | Artifacts | Code | License | Notes |
|---|---|---|---|---|---|
| 2025-06 | SepFormer | None released | None | N/A | Notes. ICDAR 2025. RT-DETR backbone + dual two-stage decoder branches for coarse-to-fine separator regression (single line to line-strip). Eliminates segmentation masks entirely. 25.6 FPS. 98.6% F1 (SciTSR-COMP); 96.8% TEDS-S (PubTabNet); 93.9% F1 (WTW); 93.8% F1 (iFLYTAB). |
| 2025-06 | TABLET | None released | None | N/A | Notes. ICDAR 2025. ResNet-18 + FPN + Dual Transformer Encoders (split) + Transformer Encoder (merge). Formulates row/col splitting as 1D sequence labeling; merging as OTSL grid classification. 18 FPS on A100. 98.54 TEDS / 98.71 TEDS-S (FinTabNet test); 96.79 TEDS / 97.67 TEDS-S (PubTabNet val). |
| 2025-04 | OG-HFYOLO | DWTAL dataset + code | OGHFYOLO | AGPL-3.0 | Notes. NCHU. YOLOv5 + GOE + HKCF + scale-aware loss + mask-driven NMS. Cell-level instance segmentation (pixel masks) rather than separator line detection; evaluates on deformed table images. Mask mAP@50:95 74.23% (DWTAL-s); 62.38% (DWTAL-l). Introduces DWTAL (28,285 images). |
| 2024-07 | DTSM | None released | DTSM | Unknown | ICDAR 2024. SCUT. Text query encoder + adjacent feature aggregator targeting dense tables with high cell counts. Introduces DenseTab dataset (16,575 dense table images). |
| 2024-05 | SEMv3 | None released | None | N/A | Notes. IJCAI 2024. ResNet-34 + FPN + KOR. Keypoint offset regression replaces instance segmentation for separation line detection; O(NM) merge action map. 95.1% F1 (WTW), 89.3% F1 (ICDAR-2019 cTDaR Historical). |
| 2023-05 | TRACE | None released | None | N/A | Notes. ICDAR 2023. NAVER AI. Single U-Net (ResNet-50) predicts 5 segmentation maps (cell corners + 4 edge directions); bottom-up post-processing assembles cells without explicit separator lines or a separate TD stage. SubTableBank adds per-cell border visibility annotations. Best reported results on ICDAR 2013 and WTW at publication. |
| 2023-03 | SEMv2 | None released | SEMv2 | Unknown | Notes. Pattern Recognition 2024. ResNet-18 + conditional conv. Instance segmentation of separation lines; kernel/feature branch decoupling. Also introduces iFLYTAB dataset. Code exists; license not stated. |
| 2022-08 | TRUST | None released | None | N/A | Notes. Baidu + DUT. ResNet-18 + FPN + Query-Based Splitting (Transformer Decoder) + Vertex-Based Merging (cross-attention). Predicts multi-oriented row/col separators with angle. 97.1% Str-TEDS / 96.2% TEDS (PubTabNet). 10 FPS on A100. |
| 2022-08 | TSRFormer | None released | None | N/A | Notes. ACM MM 2022. ResNet-18 + FPN + SepRETR (DETR). Replaces segmentation with direct line regression via two-stage DETR decoder. 97.5% TEDS-S (PubTabNet), 93.4% F1 (WTW). |
| 2022-03 | RobusTabNet | None released | None | N/A | Notes. USTC + Microsoft Research Asia. ResNet-18 + FPN + Spatial CNN (split) + Grid CNN (merge) + CornerNet-FRCN detector. Spatial CNN propagates context across blank regions for robust separator prediction. 99.3% F1 (SciTSR), 97.0% TEDS-S (PubTabNet val). |
| 2021-05 | LGPMA | None released | DAVAR-Lab-OCR | Apache-2.0 | Notes. ICDAR 2021 Best Industry Paper. Hikvision + Zhejiang Univ. ResNet-50 + FPN + Mask-RCNN + LPMA + GPMA. Dual pyramid soft-mask supervision. TEDS 94.6 / TEDS-Struct 96.7 (PubTabNet val); F1 98.8 (SciTSR). Code only; no pretrained weights. |
| 2020-04 | CascadeTabNet | None | CascadeTabNet | MIT | Notes. CVPR Workshops 2020. Cascade Mask R-CNN + HRNet backbone. End-to-end TD + TSR: detects table regions then recovers cell structure via instance segmentation. Iterative transfer learning and document-specific augmentation. Also listed in TD: Models. F1 0.9252 (ICDAR 2013 TD). |
| 2020-01 | TableNet | None released | None | N/A | Notes. TCS Research. ICDAR 2019. VGG-19 + Dual FCN Decoders. Early precursor: pixel-wise table and column segmentation with rule-based row extraction from OCR. Releases Marmot Extended column annotations. F1 0.9662 (TD) / F1 0.9151 (TSR) on ICDAR 2013. |
| 2019-09 | SPLERGE | None released | None | N/A | ICDAR 2019, pp. 114-121. Adobe Research. DOI. CNN Row/Col Projection Networks + Merge Model. Projection pooling for global row/col split prediction; grid pooling for spanning cell merge decisions. Best reported results on ICDAR 2013 at publication. |
| 2017-11 | DeepDeSRT | None released | None | N/A | ICDAR 2017, pp. 1162-1167. DFKI. DOI. Faster R-CNN (detection) + FCN (row/col segmentation). Early joint TD + TSR with deep learning. See also TD: Models. |
Graph / Cell Relationship
Models that assign each cell an adjacency structure or logical row/column position by reasoning over inter-cell relationships. Approaches differ architecturally: GCN methods propagate context over explicit cell-to-cell edges; token-based methods predict adjacency matrices over OCR word tokens; regression methods directly output logical indices in parallel without a graph.
| Date | Name | Artifacts | Code | License | Notes |
|---|---|---|---|---|---|
| 2025-04 | TableCenterNet | None released | TableCenterNet | Apache-2.0 | Notes. One-stage parallel regression for both spatial coordinates and logical row/col indices per cell simultaneously. Synergistic shared-feature + task-specific decoding. Best reported results on TableGraph-24K at publication. |
| 2024-02 | ClusterTabNet | None released | clustertabnet | Apache-2.0 | Notes. ICDAR 2024. SAP. Transformer Encoder + optional patch CNN. Predicts $n \times n$ adjacency matrix per target (tables, rows, columns, cells, headers) via $\sigma(QK^T)$ + BCELoss over OCR word tokens. Covers TD and TSR in one model. ~5M non-embedding params; rotation-robust. TD: AP 0.989 (PubTables-1M). TSR (4-class): AP 0.931 (PubTables-1M). |
| 2024-01 | LORE++ | None released | None | N/A | Notes. Pattern Recognition 2025. Follow-up to LORE. Adds MAE + Logical Distance Prediction pre-training; 60% of training data matches LORE at 100%. 0.43s/image inference. |
| 2023-03 | LORE | WTW checkpoint • PubTabNet checkpoint • Wireless checkpoint | LORE-TSR | Apache-2.0 | Notes. AAAI 2023. Zhejiang University + Alibaba DAMO. DLA-34 + CenterNet + Cascade Self-Attention Regressors. Regresses logical row/col start-end indices per cell in parallel. 99.3 F1 (SciTSR-comp); 95.1 F1 (WTW); 98.1 TEDS (PubTabNet, 20k training). 0.45s/image vs. 14.8s for EDD. |
| 2021-11 | NCGM | None released | None | N/A | Notes. Tencent YouTu Lab. CVPR 2022. ResNet-18 + CMHA-based ECE (intra-modality) + CCS (inter-modality); 3 collaborative blocks over geometry, appearance, and content. SciTSR-COMP Setup-B: F1 99.0%; strong gains on distorted tables. |
| 2021-06 | TGRNet | Checkpoints | TGRNet | Apache-2.0 (code); Unknown (data/weights) | Notes. ICCV 2021. Segmentation + GCN. Two-branch: cell detection via segmentation, logical index prediction via ordinal regression. Also introduces TableGraph-24K (full 350K described in paper not publicly released). Pretrained checkpoints on CMDD, ICDAR13, ICDAR19-cTDaR, TableGraph-24K. |
| 2020-10 | TabStruct-Net | None released | TabStructNet | Unknown | Notes. IIIT Hyderabad. ECCV 2020. ResNet-101 + FPN + Mask R-CNN + DGCNN + LSTM. Joint cell detection and row/column adjacency prediction; alignment loss enforces grid constraints. F1 0.906 (ICDAR-2013), 0.920 (SciTSR), TEDS 0.901 (PubTabNet). Code available; license not stated. |
Pipeline references: Some models appear only in downstream pipeline integrations: TableMaster (used in MinerU for TSR; see OCR Page) and StructEqTable (also referenced by MinerU for table-to-LaTeX conversion).
TSR: Datasets
Datasets with table structure annotations (row/column/cell/spanning cell labels). Grouped by the most permissive use permitted: commercial, research / non-commercial, and not available / restricted.
Commercial Use
Training a private, for-profit model is permitted with minimal obligations.
| Dataset | Tables | Domain | Annotation | Classes | Eval Split | License | Notes |
|---|---|---|---|---|---|---|---|
| SynFinTabs (2024) | 100k synthetic | Synthetic (browser-rendered HTML/CSS; no real document images; styled after UK Companies House filings) | Auto (programmatic rendering) | Rows, columns, cells; HTML + JSON + CSV + bbox annotations | Yes | MIT | Notes. Queen’s University Belfast. Also introduces FinTabQA (LayoutLM fine-tune). |
| SynthTabNet (2022) | 600k | Synthetic (browser-rendered HTML/CSS; no real document images) | Auto (HTML + cell bboxes) | Rows, columns, cells (spanning) | Yes | CDLA-Permissive-1.0 | Notes. Introduced with TableFormer (Nassar et al., CVPR 2022). Four styled subsets (FinTabNet, marketing, PubTabNet, sparse), 150k each. 80/10/10 splits. ~37 GB total. |
Research / Non-Commercial
Training a free, open-weight non-commercial model is permitted.
| Dataset | Tables | Domain | Annotation | Classes | Eval Split | License | Notes |
|---|---|---|---|---|---|---|---|
| PubTables-v2 (2025) | 548k tables (9,172 documents; 9,492 multi-page tables) | Scientific (PubMed, 2023-2025) | Auto (canonicalized) + full-page context | Rows, columns, cells (spanning); multi-page annotations | Yes | CDLA-Permissive-2.0 (annotations); underlying PMCOA images have mixed per-article licenses | Notes. Kensho. First large-scale multi-page TSR dataset; adds full-page context and cross-page table annotations. |
| CISOL (2025) | 844 table images; 120k+ cell instances | Construction industry (steel ordering lists, civil engineering) | Human | Cells + TD bboxes | Yes (two tracks: TD+TSR end-to-end; TSR-only cropped) | CC-BY-4.0 | Notes. WACV 2025. Real-world documents from 24 construction projects (2015-2023). Industrial domain absent from prior TSR benchmarks. |
| Table2LaTeX-RL (2025) | ~1.2M pairs | Scientific (arXiv rendered tables) | Auto (LaTeX source + rendered image) | LaTeX sequences | Yes | Apache-2.0 (stated); underlying arXiv papers have mixed licenses including NC variants; same provenance concern as SciTSR/TableBank | Notes. Large-scale arXiv-derived table image to LaTeX training corpus. Introduced with RL-based table-to-LaTeX model (NeurIPS 2025). |
| DWTAL (2025) | 28,285 (DWTAL-s: 8,765; DWTAL-l: 19,520) | Diverse (deformed/warped table images; derived from TAL-OCR + WTW + 150 collected) | Synthetic (wave + cylindrical warping generator) + Human (150 offline images) | Cells (pixel-level instance segmentation masks) | Yes | DWTAL-l: CC-BY-NC-4.0 (WTW-derived); DWTAL-s: Unknown (TAL-OCR license unverified); HuggingFace Apache-2.0 claim does not reflect upstream NC restriction | Notes. NCHU. Dataset focused on deformed and warped real-world tables with fine-grained segmentation annotations. Introduced alongside OG-HFYOLO. |
| UoS_Data_Rescue (2025) | 1,113 logbooks (594k+ cells) | Historical (scientific logbooks, 19th-20th century) | Human | Cells (row/col structure + transcription) | Yes | CC-BY-4.0 | IJDAR 2025. University of Southampton. Historical scientific logbooks digitized for climate/environmental research. Handwritten and mixed printed/handwritten tables. No arXiv preprint. |
| MMSci (2025) | ~52k TSR samples | Scientific (multimodal figures + tables) | Auto | Cells + structure | Yes | CC-BY-NC-SA-4.0 (inherited from SciGen source via ShareAlike) | Notes. Large-scale multimodal science dataset; TSR component derived from arXiv papers. Also includes 12K instruction-tuning and 3,114-sample evaluation benchmark. |
| ENTRANT (2024) | ~6.7M tables (~330k SEC filings) | Financial (SEC EDGAR, 2013-2021) | Auto (extracted from XLSX) | Cells (JSON bi-tree: positional + hierarchical attributes) | Yes | CC-BY-4.0 | Notes. IIT Demokritos. Text/JSON format only; no table images. Structural content source for synthetic image generation pipelines (cf. SynFinTabs). ~20 tables/filing avg, ~25 rows, ~5 cols. |
| WikiDT (2024, TSR side) | 159,905 table crops | Wikipedia screenshots | Auto (Wikipedia markup) | Rows, columns, cells | Yes | CC-BY-SA-3.0 | Notes. Also includes TD and QA/SQL annotations; TSR side tracked here. See TD: Datasets for the TD side and Table Understanding for the QA side. |
| MUSTARD (2024) | 1,428 tables | Multilingual (11 Indic scripts + Chinese; scanned and scene-text) | Human | Cells (OTSL sequences) | Yes | MIT | Notes (SPRINT paper). IIT Bombay. ICDAR 2024. 1,214 Indic + 214 Chinese/other tables from magazines. Released alongside SPRINT (script-agnostic TSR model). First large-scale multilingual TSR dataset covering Indic scripts. |
| TabRecSet (2023) | 38.2k | Wild (scanned, camera, spreadsheet, bilingual EN/ZH) | Human (cell polygons + logical structure + cell text) | Cells (TD + TSR + TCR) | Yes | CC-BY-SA-4.0 | Notes. 32k images; 80/20 split. First large-scale bilingual (English + Chinese) end-to-end table recognition dataset. Also includes TD annotations. |
| ComFinTab (2022) | 10k (6k Chinese + 4k English) | Financial (compound tables from Chinese listed-company annual reports) | Human (cell bboxes, row/col indices, text, cell type, cell linking) | Cells (compound spanning; TH/LH/DA/OT + linking) | Yes (4.5k/1.5k Chinese; 3.2k/0.8k English; company-level split) | CC-BY-NC-SA-4.0 | Notes. DAVAR Lab (Hikvision/ShanghaiTech/ZJU). ACM MM 2022. Over 70% compound tables. Introduces table item extraction task and Tree-F1-Score metric. CTUNet code released via DAVAR-Lab-OCR (Apache-2.0). Dataset available via gated application; see ComFinTab page. |
| TableGraph-24K (2021) | 24k (350K described in paper; only 24K publicly released) | Scientific | Auto (graph: cell bboxes + logical indices) | Cells (spatial + logical row/col indices) | Yes | No license stated (annotations/code); underlying images derived from arXiv LaTeX papers with mixed per-article licenses | Notes. Derived from TABLE2LATEX-450K (rendered arXiv LaTeX tables). Same mixed-license provenance as SciTSR/TableBank. Full TableGraph-350K has not been released. |
| GloSAT (2021) | 500 page images (one table per page) | Historical meteorological logbooks (UK Met Office, NOAA, Univ. of Reading) | Human (VOC2007 + ICDAR cTDaR XML) | Headings, headers, table body, coarse segmentation cells | Yes | BSD-3-Clause | Notes. HIP@ICDAR 2021. University of Southampton. Enhanced annotations for TSR in historical documents; adds coarse cell groupings following original ruling lines. |
| WTW (2021) | 14.5k | Wild (photos, scans, documents) | Human (cell coordinates + row/col structure) | Cells | Yes | CC-BY-NC-4.0 | Notes. Tables in natural scenes with deformation, bending, occlusion. |
| PubTables-1M (2021) | ~947k tables (from 460k pages) | Scientific | Auto (canonicalized bboxes) | Rows, columns, cells (incl. blank), spanning cells | Yes | CDLA-Perm-2.0 (annotations); underlying PMCOA images have mixed per-article licenses | Notes. Canonicalized annotations fix PubTabNet oversegmentation. Includes projected row header labels. |
| TSRD + TCRD (2021) | 46K + 38K | Scientific (arXiv CS preprints, LaTeX-rendered to JPG) | Auto (LaTeX source compiled to image + structure sequences) | Structure (TSRD) + content (TCRD) | Yes | CC-BY-NC-SA-4.0 (stated on CodaLab competition page) | Notes. ICDAR 2021 competition datasets from IIT Gandhinagar. TSRD: 46K table images; TCRD: 38K. Images are programmatically rendered from arXiv CS LaTeX sources. Access requires CodaLab login; no open mirror available. |
| FinTabNet (2020) | ~113k | Financial | Auto (HTML) | Rows, columns, cells (spanning) | Yes | CDLA-Permissive-1.0 (annotations); underlying annual report images are from copyrighted S&P 500 corporate filings | Notes. Complex, dense financial tables from S&P 500 SEC annual reports. Introduced with GTE. |
| Marmot Extended (2020) | 509 | English scientific (English subset of Marmot) | Human (column bounding boxes) | Columns | No | Research-only (inherits Marmot restriction); Unknown (annotation license) | Notes. TCS Research. Column bounding box annotations for 509 English documents from the Marmot dataset, released alongside TableNet. Column-level only; no row or cell annotations. Google Drive. |
| PubTabNet (2019) | 568k | Scientific | Auto (HTML structure + cell content) | Rows, columns, cells (spanning) | Yes | CDLA-Permissive-1.0 (annotations); underlying PMCOA images have mixed per-article licenses | Notes. First large-scale TSR dataset. Introduces EDD model and TEDS metric. |
| SciTSR (2019) | 15k | Scientific (arXiv PDF table images) | Auto (LaTeX source → cell adjacency graph) | Cells (adjacency) | Yes | MIT (annotations/code); underlying arXiv PDF images have mixed per-article licenses | Notes. Focuses on complex spanning structures. SciTSR-COMP subset: 716 complex tables. PDF images sourced from arXiv papers; same mixed-license provenance as PubTabNet/TableBank. |
| TableBank (2019) | 145k (structure split) | Diverse | Weak (HTML-like tag sequence) | Cells | Yes | Research-only (data) | Notes. Weak supervision from Word/LaTeX sources. Also provides TD annotations (see above). |
Not Available / Restricted
Described in a publication but not publicly downloadable. Included here because the papers provide useful methodological details and the data may become available in the future.
| Dataset | Pages | Domain | Annotation | Classes | Eval Split | License | Notes |
|---|---|---|---|---|---|---|---|
| Arabic TSR (2024) | 7,300 | Arabic documents | Human | Cells | Unknown | Unknown | No verified public release or arXiv preprint found. Dataset referenced in literature; availability unconfirmed. |
| Chinese Financial TSR (2024) | ~1.5M tables (105,600 bordered sampled for synthesis) | Financial (Chinese annual reports) | Auto (extracted from reports) | Rows, columns, cells | Unknown | Unknown | arXiv 2404.11100. ICDAR 2024. Authors state intent to publicly release; no download available as of 2026-03. Includes 2,290-table manually verified benchmark. |
| BankTabNet (2024, TSR side) | 5,165 tables | Bank statements (transaction tables) | Human (K-alpha 0.99) | Cells (rows, columns, spanning) | Unknown | Proprietary | Notes. AmEx internal; PII-masked; not released. Same paper as TD side. TD side tracked in TD: Datasets. |
| DenseTab (2024) | 16,575 | Dense/complex (multi-row/col spanning); image source undisclosed | Human | Cells (complex spanning) | Yes | Unknown | ICDAR 2024. Google Drive. Files are technically downloadable but moved here due to two blockers: (1) no license stated anywhere in the repo or README; (2) image provenance is completely undisclosed; the source documents are unknown and the full paper is paywalled. Cannot assess rights or suitability. |
| SubTableBank (2023) | 9,717 images | Financial and scientific documents (+ some TableBank images) | Human (cell bboxes + per-edge visibility flags: explicit vs. implicit borders) | Cells, border visibility | Yes (7,783 / 971 / 963 train/val/test) | Unknown | Notes. NAVER AI. In-house dataset used to train TRACE. Per-edge visibility annotations distinguish explicit (visible) from implicit (invisible) cell borders. Partial public release promised in ICDAR 2023 preprint; no public URL found as of 2026-04. |
| iFLYTAB (2023) | 17.3k | Diverse (digital + camera); digital document sources unspecified | Human (cell polygons + row/col info polygons) | Rows, columns, cells | Unknown | Unknown | Notes. Introduced with SEMv2. No stated license; digital image provenance unspecified; originated from an iFLYTEK competition (restricted redistribution implied). USTC file-sharing download link may no longer be live. |
| cTDaR TrackA TSR Annotations (2022) | 600 pages (modern) | Modern documents (ICDAR 2019 cTDaR TrackA) | Human (row/col separation lines + cell bboxes) | Rows, columns, cells | Yes (follows cTDaR 600/240 modern train/test split) | Unknown | Notes. USTC + Microsoft Research Asia. Structure annotations for the 600 modern images from ICDAR 2019 cTDaR TrackA, used to train RobustTabNet’s TSR module. Public release stated in the paper; no repository or download link identified as of 2026-04. |
| ICDAR 2017 POD TSR Supplement (post-2017) | 549 train + 243 test table crops | Scientific (CiteSeer; subset of ICDAR 2017 POD) | Human (cell polygons + adjacency XML) | Cells (polygon coords + adjacency relations) | Yes | Unknown | Post-hoc TSR annotations added to a subset of ICDAR 2017 POD table regions by the CIAS group (PKU). Cell polygon coordinates and adjacency neighbors attributes in XML; no row/col span indices. No stated license. GitHub. |
TSR: Benchmarks
Evaluation-only sets used to compare TSR systems. Not suitable for training.
| Benchmark | Tables | Domain | Annotation | Metric | License | Notes |
|---|---|---|---|---|---|---|
| Benchmarking PDF Parsers (2026) | 451 tables (100 pages) | Synthetic PDFs from arXiv LaTeX sources | Human + LLM-as-judge | LLM-as-judge scoring (Pearson r=0.93 vs. human) | MIT (code); CC-BY-SA-4.0 (data) | Notes. Evaluates 21 PDF parsers using LLM-as-a-judge, validated against 1,554 human ratings. LLM metrics substantially outperform TEDS (r=0.68) and GriTS (r=0.70). |
| OmniDocBench (2025) | 1,355 pages (subset of 9 doc types) | Diverse (academic, financial, newspaper, handwritten, etc.) | Human | TEDS / HTML+LaTeX for tables; NED for text | Custom (research only; non-commercial per dataset card) | Notes. CVPR 2025. End-to-end document parsing benchmark; table sub-task uses HTML+LaTeX annotations for TEDS-style evaluation. Covers 9 document types and 3 language settings. |
| Benchmarking TE (Soric et al.) (2025) | 37k (Table-arXiv + Table-BRGM); 56k (PubTables-Test) | Scientific (LaTeX preprints, geological reports) | Auto + Human | GriTS-Top; GriTS-Cont; TEDS; end-to-end P/R metrics | MIT | Notes. End-to-end benchmark covering TD, TSR, and full TE pipeline. Introduces Table-arXiv (36k samples, all arXiv domains) and Table-BRGM (124 tables from French geological reports) alongside PubTables-Test. Formally justified metrics propagate TD errors into TSR scores. Nine methods evaluated; models trained on PubTables-1M degrade substantially on heterogeneous data. |
| DocPTBench (2025) | 1,381 images (Original + Photographed + Unwarped) | Camera-captured and digital (phone photos of printed documents; 8 translation directions) | Human | Edit distance (parsing); BLEU/chrF/METEOR (translation) | Apache-2.0 | Notes. Unified parsing + translation benchmark on photographed documents. Expert OCR models degrade ~25% on photographed vs. digital; general MLLMs ~18%. |
| CC-OCR (2025) | 7,058 images | Diverse (4 tracks: multi-scene text, multilingual OCR, document parsing, KIE) | Human | F1 (text); NED (parsing); TEDS (tables); field F1 (KIE) | MIT | Notes. ICCV 2025. 4-track OCR benchmark; document parsing track is most relevant to TSR. 39 subsets, 10 languages. Evaluates 9 LMMs including GPT-4o, Gemini, Qwen2-VL. |
| RD-TableBench (2024) | 1,000 | Diverse (financial, scientific, scanned, handwritten, multilingual) | Human | Needleman-Wunsch HTML array alignment with Levenshtein partial credit | CC-BY-NC-ND-4.0 | Released by Reducto. Targets complex real-world tables; eval-only, no training split. Partial public release to prevent contamination. HuggingFace. Used in Nemotron Parse evaluation. |
| ICDAR 2021 SLP Task B (2021) | 9,064 (final eval) | Scientific (PubTabNet) | Auto (PubTabNet) | TEDS (Simple/Complex/All) | Apache-2.0 (eval harness); CDLA-Perm-1.0 (annotations); underlying PDFs mixed | Notes. IBM. 30 teams. Top result: 96.36 TEDS all (Davar-Lab-OCR). Note: <b> bold tags excluded from scoring due to a data preparation bug; all reported numbers are slightly inflated. Prior EDD baseline: 91 TEDS. GitHub. |
| ICDAR 2019 cTDaR TSR (2019) | ~600 (modern) + ~600 (archival) | Modern documents + historical handwritten | Human | Adjacency F1 | Unknown | Gao et al., ICDAR 2019. DOI. Three TSR tasks: (1) structure with given regions; (2) structure without given regions; (3) both on archival handwritten documents. SEMv3 reports 89.3% F1 on the archival track. Same competition as the ICDAR 2019 cTDaR TD benchmark; see TD section for detection-side details. |
| SciTSR-COMP (2019) | 716 | Scientific (arXiv CS PDF images) | Auto (LaTeX source) | Adjacency F1 | MIT (inherits from SciTSR) | Complex-table subset of SciTSR: tables with at least one spanning cell. Widely used as the primary eval target for graph-paradigm and split-and-merge models (LORE: 99.3 F1; NCGM: 99.0 F1; RobustTabNet: 99.3 F1; TRUST: reported on val). SciTSR-COMP results are not comparable to full SciTSR results. |
| ICDAR 2013 TSR (2013) | 150 tables (238 pages) | Government documents (EU + US) | Human (cell bboxes + row/col spans; adjacency XML) | Adjacency F1 | Unknown | Gobel et al., ICDAR 2013. Same documents as the ICDAR 2013 TD benchmark; separate TSR ground truth with cell bounding boxes, row/column span indices, and adjacency relations between neighboring cells. The community’s sole public TSR benchmark from 2013 through 2018; DeepDeSRT (2017) and SPLERGE (2019) both evaluate here. A corrected version is on HuggingFace (CDLA-Perm-2.0 for new annotations). |
Metrics
Detection Metrics
Table detection uses the same metrics as general layout detection. See the Layout Page metrics section for detailed explanations.
| Metric | What it measures | Notes | Tools |
|---|---|---|---|
| mAP @ IoU | COCO-style detection accuracy | Standard for PubTables-1M, ICDAR 2017 POD. Thresholds vary: $\text{AP}@[.50:.95]$, $\text{AP}@50$, $\text{AP}@75$. | pycocotools |
| Weighted F1 (IoU $\geq$ 0.6) | Detection quality with class weighting | Used in ICDAR 2019 cTDaR competition. | – |
| Area-based P/R/F1 | Pixel-area overlap between predicted and GT regions | Used by TableBank. Less standard than mAP; may obscure object-level errors like merging adjacent tables. | Custom |
Structure Metrics
| Metric | Paradigm | Notes | Tools |
|---|---|---|---|
| TEDS (Tree Edit Distance Similarity) | Im2Seq, Object Detection | Standard for PubTabNet and FinTabNet. Introduced by PubTabNet. Normalized tree-edit distance between predicted and ground-truth HTML table trees: $\text{TEDS}(T_a, T_b) = 1 - \frac{\text{EditDist}(T_a, T_b)}{\max(\lvert T_a \rvert, \lvert T_b \rvert)}$. Often reported as Simple / Complex / All splits. | Custom; teds (unofficial) |
| TEDS-S (TEDS-Structure) | Im2Seq, Object Detection | Structure-only variant: cell content is stripped from both prediction and ground truth before computing the tree edit distance. Isolates layout prediction from OCR quality. Widely reported in recent work as the primary comparison number. | Custom; teds (unofficial) |
| GriTS (Grid Table Similarity) | Object Detection | Grid topology correctness measuring row/column spanning alignment. More robust to empty cell variations than TEDS. Introduced by PubTables-1M. Three variants: $\text{GriTS}_{\text{Top}}$ (topology), $\text{GriTS}_{\text{Cont}}$ (content), $\text{GriTS}_{\text{Loc}}$ (location). | Custom (PubTables-1M repo) |
| Adjacency F1 | Graph Reconstruction | Correctness of adjacent cell pair relationships. Pre-TEDS metric. Known to under-react to structural errors (row/column misalignment) and over-react to content perturbations. PubTabNet demonstrated these failure modes. Largely superseded. | Custom |
| BLEU | Im2Seq | N-gram overlap on generated tag sequences. Used by TableBank for structure recognition evaluation (4-gram BLEU). Less sensitive to structural errors than TEDS; largely superseded. | nltk, sacrebleu |
TEDS Variants
TEDS measures how many tree edit operations (node insertion, deletion, relabeling) are needed to transform the predicted HTML tree into the ground-truth HTML tree, normalized by the size of the larger tree. A score of 1.0 is a perfect prediction; 0.0 means the trees share no recoverable structure.
Two variants appear in the literature:
- TEDS (full): Includes both the structural token sequence and the OCR text content of each cell. A structurally correct prediction with wrong cell text is penalized. This requires a working OCR component and makes cross-system comparison harder unless OCR is held constant.
- TEDS-S (structure only): Strips cell content from both prediction and ground truth before computing tree edit distance. This isolates the layout prediction task from OCR quality and is the more commonly reported number in recent work.
TEDS is often reported split across Simple (few spanning cells, rectangular grids) and Complex (multi-span, hierarchical headers) subsets. A model that performs well on Simple but poorly on Complex is likely struggling with the spanning cell case specifically.
One known limitation: TEDS is proportional, so a structural error that shifts every cell in a large table still yields a moderate score, while the same error on a small table is more severely penalized. This makes cross-table-size comparisons unreliable.
GriTS Variants
GriTS frames TSR evaluation as comparing the predicted grid structure directly, rather than via a tree edit distance on HTML. Given predicted and ground-truth grids, the metric computes an alignment score using bipartite matching. Three variants measure different aspects:
- GriTS-Top (topology): Checks only that each cell occupies the correct row/column span position. Ignores both spatial coordinates and text content.
- GriTS-Cont (content): Extends topology evaluation to also require correct cell text content, using a string similarity score per matched pair.
- GriTS-Loc (location): Extends topology to also require correct bounding box coordinates, using a spatial IoU score per matched pair.
GriTS is more robust than TEDS to annotation artifacts like empty cell representation (the oversegmentation issue that PubTables-1M corrects relative to PubTabNet). When GriTS and TEDS disagree substantially, the discrepancy often traces to empty cell handling. GriTS-Top is the most commonly reported single number for object detection-paradigm models (Table Transformer, ClusterTabNet). For Im2Seq models evaluated on PubTabNet, TEDS-S remains the community standard.
Surveys
| Date | Title | Venue | Key Contribution | Notes |
|---|---|---|---|---|
| 2024 | A Survey for Table Recognition Based on Deep Learning (Yu et al.) | Neurocomputing 2024 | Reviews TD and TSR methods through ~2024; covers DL-based methods from object detection through transformer paradigms; updated dataset and benchmark comparison tables | DOI. Neurocomputing vol. 600, article 128154. Xidian University. No arXiv preprint. Post-Kasem coverage extension; provides updated taxonomy of methods and dataset landscape through the large-scale transformer era. |
| 2022-11 | Deep Learning for TD and TSR: A Survey (Kasem et al.) | ACM CSUR 2024 | Reviews 19 datasets, heuristic/ML/DL taxonomy, comparative TNCR experiments across HRNet, ResNeSt, and Dynamic R-CNN backbones | Notes. arXiv 2211.08469. GitHub repo for ongoing tracking. |
Comparative Studies
Cross-architecture evaluations that benchmark models from multiple paradigms on the same datasets.
| Year | Paper | Models Compared | Datasets | Key Finding | Notes |
|---|---|---|---|---|---|
| 2022 | Kasem et al. (Deep Learning for TD and TSR: A Survey, ACM CSUR 2024) | HRNet, ResNeSt, Dynamic R-CNN backbones | TNCR + 18 datasets reviewed | Reviews heuristic/ML/DL taxonomy; comparative backbone experiments show Dynamic R-CNN competitive across TNCR table types | Notes. arXiv 2211.08469. |
To Investigate
Papers and resources identified as likely relevant but not yet fully reviewed.
TD: Methods
| Reference | Title | Why Relevant |
|---|---|---|
Siddiqui et al., IEEE Access 2018 (10.1109/ACCESS.2018.2848541) | DeCNT: Deep Deformable CNN for Table Detection | Deformable convolution applied specifically to table detection; transfer learning across domains; surfaced via DocParser references. No arXiv. |
TSR: Methods
| Reference | Title | Why Relevant |
|---|---|---|
| Qasim et al., ICDAR 2019 | Rethinking Table Recognition Using Graph Neural Networks | GNN-based table structure parsing from rendered document images; direct TSR predecessor cited by DocParser. No arXiv found; DOI: 10.1109/ICDAR.2019.00028. |
Related Pages
- Document Layout Analysis: General-purpose layout detection models, many of which detect tables as one region class.
- Reading Order Prediction: Determining the logical reading sequence of detected regions, including table placement in document flow.
- OCR: Text recognition pipelines; MinerU integrates TableMaster and StructEqTable for table extraction.
- Document Understanding: End-to-end systems that combine detection, structure recognition, and content extraction.
- Tables: Understanding, Reasoning, and LLM-era Evaluation: Benchmarks and datasets for table QA, visual table reasoning, and LLM/VLM evaluation on table content. Out of scope for this page; tracked separately as a working investigation cluster.
ComFinTab: A Compound Financial Table Dataset for End-to-End Table Understanding
TL;DR
ComFinTab is a 10,000-image benchmark of financial document tables, of which over 70% are compound tables: tables that integrate multiple basic sub-tables with complex, non-uniform header structures. Alongside the dataset, the authors propose CTUNet, a multi-modal graph-based framework that jointly handles table structure recognition and a new unified “table item extraction” task defined over the dataset.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$: The headline contribution is the ComFinTab dataset and its associated annotation scheme. The paper devotes substantial space to collection methodology, annotation pipeline, and the formal task definition that the dataset enables. No existing public dataset provided compound-table coverage with both structure and understanding annotations for image-based tables.
Secondary: $\Psi_{\text{Method}}$: CTUNet is a non-trivial multi-modal framework combining Faster R-CNN cell detection, RoI-Align visual features, BERT text features, positional embeddings, and a graph attention network for joint cell classification and relation linking. It serves primarily as a strong baseline to validate the new task formulation rather than as a standalone architectural contribution.
What is the motivation?
Real-world financial documents routinely contain tables that go beyond the three basic forms (relational, entity, matrix) described in prior taxonomies. These “compound tables” integrate multiple basic sub-tables into a single grid, so cells in the same row or column may belong to entirely different semantic groups and cannot be correctly linked by simple header-chasing heuristics.
Existing image-based datasets such as PubTabNet (568k), FinTabNet (113k), and SciTSR (15k) contain fewer than 2% compound tables. Digital-format table understanding datasets (WebSheet, SAUS, DeEX, Spider, WikiSQL) use spreadsheet or database representations that carry color, font, and structural metadata unavailable in scanned images. The result is a gap: no public benchmark tests visual table understanding on the challenging compound tables that appear routinely in financial filings.
The paper identifies a secondary motivation: the proliferation of incompatible task formulations across prior table understanding work (table type classification, cell type classification, column type identification, entity linking, Table QA). The authors argue these tasks can largely be unified into a single “table item extraction” formulation, and use ComFinTab to instantiate and evaluate that formulation.
What is the novelty?
Dataset. ComFinTab contains 10,000 table images from public annual reports of companies listed on the Shanghai and Shenzhen Stock Exchanges: 6,000 Chinese-language and 4,000 English-language tables. Over 70% are compound tables. Annotations cover cell bounding boxes, row and column indices, text bounding boxes, text content, cell type (TH, LH, DA, OT), and cell linking (which data cells are linked to which header cells). This combination of structure and understanding annotations in an image-based, majority-compound-table dataset is the first of its kind.
Task formulation. The paper proposes a “table item extraction” task. Each data (DA) cell is the root of a tree: the left subtree stores the left-header (LH) hierarchy and the right subtree stores the top-header (TH) hierarchy, with virtual header nodes added when a sub-table lacks one axis of headers. This representation encodes cell type and cell linking jointly, and can losslessly express the outputs of prior cell-type classification and cell-linking tasks for basic tables.
Evaluation metric. The authors introduce Tree-F1-Score, which extends standard F1 with Tree-Edit-Distance-Based Similarity (TEDS) as a soft matching credit. Formally:
$$ \begin{aligned} \text{Tree-R} &= \frac{\sum_{t_i \in T_G} \text{TEDS}(t_i, t_i’)}{|T_G|} \\ \text{Tree-P} &= \frac{\sum_{t_i \in T_P} \text{TEDS}(t_i, t_i’)}{|T_P|} \end{aligned} $$
$$ \text{Tree-F1} = \frac{2 \times \text{Tree-R} \times \text{Tree-P}}{\text{Tree-R} + \text{Tree-P}} $$
where $T_P$ and $T_G$ are predicted and ground-truth tree sets, and $t’_i$ is the predicted tree sharing the same root DA cell as $t_i$. Items with no corresponding root in the counterpart set score TEDS = 0.
CTUNet. The framework fuses three feature modalities per detected cell:
- Position feature: $PE_i = \text{embedding}(b_i)$ from the bounding box coordinates $b_i = (x_1, y_1, x_2, y_2)$, $PE_i \in \mathbb{R}^{d_F}$.
- Visual feature: RoI-Align crop from the Faster R-CNN feature pyramid, followed by convolutions and a linear projection, $V_i \in \mathbb{R}^{d_F}$.
- Textual feature: BERT (or BERT-Chinese) encoding of cell text content, $T_i \in \mathbb{R}^{d_F}$.
The combined cell representation is:
$$F_i = \text{LayerNorm}(PE_i + V_i + T_i)$$
A graph attention network (GAT) with a masked self-attention mechanism propagates features within each cell’s structural neighborhood (all cells in the same row or column). Enhanced node features $F’_i$ are computed as:
$$ \begin{aligned} e_{ij} &= \text{LeakyReLU}(w^T [WF_i | WF_j]) \\ \alpha_{ij} &= \frac{\exp(e_{ij})}{\sum_{k \in N_i} \exp(e_{ik})} \\ F’_i &= \sigma!\left(\sum_{j \in N_i} \alpha_{ij} W F_j\right) \end{aligned} $$
Edge features are formed by pairwise concatenation $E_{i,j} = [F’_i | F’_j]$. Three branches are trained simultaneously: node classification (cell type, 4 classes), row relation linking (sigmoid binary), and column relation linking (sigmoid binary). The combined loss is:
$$L = L_{\text{det}} + \lambda_1 L_{\text{node}} + \lambda_2 (L_{\text{rlink}} + L_{\text{clink}})$$
where $L_{\text{det}}$ is the Faster R-CNN detection loss and the remaining terms are cross-entropy losses.
What experiments were performed?
The paper evaluates on ComFinTab-Chinese (4,500 train / 1,500 test) and ComFinTab-English (3,200 train / 800 test), split at company level to prevent format leakage between splits.
Because no prior method directly addresses the table item extraction task, the authors construct two comparison settings:
Cell Classification + Rules. TUTA (a table-pretrained language model) and LayoutLMv2 (a multi-modal document model) are fine-tuned for cell type classification, then handcrafted adjacency rules link headers to data cells. TUTA is English-only; LayoutLMv2 is tested on both splits.
End-to-End Relation Extraction. The task is recast as triple extraction: (‘DA 1’, ‘Left-Linking’, ‘LH 1’) etc. AGCCN (an NLP relation extraction GCN) is adapted and augmented with the same multi-modal features used in CTUNet.
All baselines and CTUNet use the same OCR source: PaddleOCR for Chinese, Tesseract for English. A ground-truth oracle variant (GT) is also reported to isolate table understanding performance from OCR errors.
Table structure recognition quality (TEDS) is reported as a reference: CTUNet achieves 98.99 on English and 98.83 on Chinese, reflecting that bordered financial tables are relatively straightforward structurally once cell detection is accurate.
Ablations test (a) the contribution of each modality (position, visual, semantic) and (b) neighborhood definitions in the GAT (none, 1-step, all-pairs, row/column). The row/column neighborhood setting outperforms all others on both splits.
What are the outcomes/conclusions?
On the table item extraction task (Tree-F1), CTUNet substantially outperforms rule-based approaches and modestly outperforms the end-to-end AGCCN baseline:
| Method | English Tree-F1 | Chinese Tree-F1 |
|---|---|---|
| TUTA + rule | 72.91 | 71.69 |
| LayoutLMv2 + rule | 72.05 | 84.14 |
| AGCCN | 84.67 | 86.98 |
| CTUNet (ours) | 89.20 | 86.98 |
| CTUNet (GT) | 90.37 | 88.90 |
Rule-based methods struggle specifically on compound tables: cell classification accuracy is adequate (91-92%), but heuristic header-to-data matching fails when the same row or column spans multiple logically independent sub-tables.
Ablation results confirm that all three modalities contribute. Removing position features causes the largest drop (Tree-F1 falls to roughly 31-38%), reflecting the centrality of spatial layout in table reasoning. Visual and textual features each add 5-6 points independently; together they further improve over any single-modality baseline.
The GAT neighborhood ablation shows that using row/column neighbors (the structural prior) outperforms both 1-step spatial neighbors and all-pairs connections, suggesting that long-range row/column context is genuinely useful and that the structural graph provides meaningful inductive bias.
Limitations noted or apparent: The model relies on an offline OCR engine, so OCR errors propagate into the understanding stage (the GT oracle consistently scores 1-2 points higher). The dataset is restricted to bordered, horizontally-oriented tables extracted from PDF annual reports, which is a narrower distribution than general document tables. The annotation pipeline used heuristic pseudo-labels for cell types (subsequently manually corrected), and heuristic linking rules (also manually cleaned), leaving some residual annotation noise. The authors do not report inter-annotator agreement statistics.
Reproducibility
Models
CTUNet uses a ResNet-50 backbone with an FPN neck, instantiated within a Faster R-CNN cell detector. RoI-Align crops from the FPN map feed a small CNN stack followed by a linear projection to produce visual features. BERT-base (multilingual or Chinese variant) produces textual features. A single-layer GAT with masked self-attention constitutes the relational graph construction module. The paper does not specify the feature dimension $d_F$ explicitly; the re-implementation in DAVAR-Lab-OCR targets the same architecture. Pre-trained model weights (gated download; access codes in the DAVAR-Lab-OCR README) are provided separately for Chinese and English variants; re-implementation numbers differ slightly from paper results (Chinese: 93.59 vs. 91.78 Cell-F1; English: 92.75 vs. 92.27).
Algorithms
Training uses SGD with momentum 0.9, weight decay $1 \times 10^{-4}$, initial learning rate $1 \times 10^{-3}$, divided by 10 every 20 epochs, batch size 4. OCR is handled offline by PaddleOCR (Chinese) or Tesseract (English) and is not part of the end-to-end training graph. Loss weights $\lambda_1$ and $\lambda_2$ are not stated explicitly in the paper. All experiments run on 8 Tesla V100 GPUs.
Data
ComFinTab contains 10,000 table images from public Chinese-listed company annual reports (Shanghai and Shenzhen Stock Exchanges). Split by language: 6,000 Chinese, 4,000 English. Train/test splits are 4,500/1,500 (Chinese) and 3,200/800 (English), partitioned at the company level.
Annotation pipeline: (1) bordered table detection and cell-line crossing to extract cells; (2) LGPMA cell-matching strategy for row/column indices; (3) PDFPlumber for text locations and content; (4) color-based auto-labeling for cell types, refined by a text classifier for pseudo-labels, then manual correction; (5) heuristic linking rules, manually cleaned.
The dataset is publicly available via a gated application process (application form required; email to the corresponding author). License: CC-BY-NC-SA-4.0 (non-commercial only). See ComFinTab dataset page.
Evaluation
Primary metric: Tree-F1-Score (TEDS-weighted precision and recall over tree representations of table items). Secondary metric: Macro-F1 for cell type classification. Table structure quality is also reported via TEDS (the standard TSR metric from PubTabNet). All baselines and CTUNet use the same OCR pipelines. Ground-truth oracle variants are reported to isolate understanding from OCR quality. No error bars, confidence intervals, or multi-seed averages are reported.
Hardware
All experiments use 8 Tesla V100 GPUs. Total training time and GPU-hours are not reported.
BibTeX
@inproceedings{li2022comfintab,
author = {Zaisheng Li and Pengfei Li and Shiliang Pu and Yi Li and Zhanzhan Cheng and Xi Li and Qiao Liang and Yi Niu},
title = {End-to-End Compound Table Understanding with Multi-Modal Modeling},
booktitle = {Proceedings of the 30th ACM International Conference on Multimedia (MM '22)},
year = {2022},
pages = {4112--4121},
doi = {10.1145/3503161.3547885},
publisher = {ACM}
}
ICDAR 2021 SLP Task B: Table Recognition on PubTabNet
TL;DR
ICDAR 2021 SLP Task B is the competition track for full table recognition (structure plus cell content) on PubTabNet. With 30 teams and 30 final-phase submissions, the top systems reached 96.36 TEDS overall (97.88 on simple tables, 94.78 on complex), well above the 91 TEDS baseline reported with PubTabNet’s original EDD model. Two-stage pipelines (cell detection then structure inference) dominated over end-to-end sequence methods.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$: The primary contribution is a competition benchmark event. The paper documents the evaluation protocol, participation statistics, leaderboard results, and system descriptions. No new model or dataset is introduced; PubTabNet (v2.0.0) provides the underlying data.
Secondary: none.
What is the motivation?
Prior table recognition competitions (ICDAR 2013, ICDAR 2019) were small (156 and 3,600 samples respectively) and did not require systems to recover cell content, only cell bounding boxes or adjacency relations. Task B raised the bar: participants must produce a complete HTML representation of each table image, including both structure (row/column spans) and the text content of every cell. No intermediate cell-position annotations are provided during evaluation, pushing systems toward truly end-to-end solutions.
The competition is part of a two-task event (ICDAR2021-SLP). Task A, which addresses document layout recognition on PubLayNet, is covered in the layout notes.
What is the novelty?
The competition establishes a documented, large-scale snapshot of the state of full table recognition in 2021, with TEDS as the community-standard metric. The system descriptions from the top teams reveal the architectural patterns that worked best at scale, serving as a useful reference for understanding the field’s trajectory toward later models like TableFormer, TRUST, and UniTable.
What experiments were performed?
Data
Task B used PubTabNet v2.0.0. The final evaluation set (9,064 images) was released three days before the competition deadline. Participants had access to training (500,777 samples) and development (9,115 samples) sets throughout.
| Split | Size | Phase |
|---|---|---|
| Training | 500,777 | N/A |
| Development | 9,115 | N/A |
| Mini development | 20 | Format verification |
| Test | 9,138 | Development phase |
| Final evaluation | 9,064 | Final evaluation phase |
Metric
TEDS (Tree-Edit-Distance-based Similarity), the metric introduced with PubTabNet. It computes normalized tree edit distance between the predicted and ground-truth HTML table trees, including cell content:
$$\text{TEDS}(T_a, T_b) = 1 - \frac{\text{EditDist}(T_a, T_b)}{\max(|T_a|, |T_b|)}$$
Results are decomposed into Simple and Complex subsets following PubTabNet’s original split.
Known evaluation bug: Bold tags (<b>) were inadvertently excluded from scoring in the final evaluation phase due to a data preparation issue. All reported numbers are therefore slightly inflated for tables containing bold text. This affects all teams equally, so relative rankings hold, but absolute TEDS numbers are not directly comparable to evaluations that include bold tags.
Results
Top 9 teams on the Final Evaluation Phase leaderboard:
| Team | TEDS Simple | TEDS Complex | TEDS All |
|---|---|---|---|
| Davar-Lab-OCR (Hikvision) | 97.88 | 94.78 | 96.36 |
| VCGroup | 97.90 | 94.68 | 96.32 |
| XM | 97.60 | 94.89 | 96.27 |
| YG | 97.38 | 94.79 | 96.11 |
| DBJ | 97.39 | 93.87 | 95.66 |
| TAL | 97.30 | 93.93 | 95.65 |
| PaodingAI | 97.35 | 93.79 | 95.61 |
| anyone | 96.95 | 93.43 | 95.23 |
| LTIAYN | 97.18 | 92.40 | 94.84 |
Prior state of the art (EDD model, reported with PubTabNet): 91 TEDS.
What are the outcomes/conclusions?
Key findings:
- Complex tables are consistently 3-4 points harder than simple ones across all teams. The gap is stable, suggesting it reflects genuine structural difficulty rather than a calibration artifact.
- Two-stage pipelines dominated: detect cells (Mask R-CNN variants, often HRNet backbone with pyramid mask supervision), then infer structure by horizontally/vertically aligning bounding boxes via maximum clique search or similar combinatorial post-processing. This approach cleanly separates cell localization from topology recovery.
- One team (VCGroup) decomposed the task into four sub-tasks: table structure recognition, text-line detection, text-line recognition, and box assignment. Their structure recognizer was based on MASTER (a scene text recognition model), demonstrating that sequence models from adjacent domains transferred well.
- A pure sequence approach (Kaen Context, Kakao Enterprise) using a 12-layer linear-attention transformer operating on flattened image patches also competed, anticipating the later EDD/TableFormer direction.
- Performance improvement over EDD (91 TEDS) is substantial (~5 points), driven by larger backbones, ensemble strategies, and better OCR modules.
Limitations:
- The
<b>bold tag bug means all reported TEDS numbers are slightly inflated. The magnitude of the effect depends on how bold-heavy the evaluation tables are; the authors acknowledge it but do not quantify the impact. - Only 30 teams participated in the final phase, far fewer than Task A (78 teams). Table recognition attracted less participation, likely due to higher implementation complexity.
- The competition uses PubTabNet, which is scientific literature only. Transferability to financial tables (FinTabNet) or wild tables (WTW) is not evaluated.
- No source code was released by competition organizers; most system descriptions are high-level.
Reproducibility
Models
No trained weights were released by the competition organizers or most participating teams. Davar-Lab-OCR referenced their lab website and later released code via DAVAR-Lab-OCR GitHub; VCGroup released MASTER-pytorch (the backbone of their structure recognizer).
Algorithms
Top two-stage systems: Cascade Mask R-CNN with HRNet-W48 backbone for cell detection (with pyramid mask supervision for row/column-aligned bounding boxes), followed by structure inference via alignment overlap analysis and maximum clique search for row/column index assignment. OCR was handled by a separate single-line text detection and recognition model. Optimizer, learning rate schedule, batch size, and other training hyperparameters are not reported in the competition overview paper; individual teams did not provide full training details in their system descriptions.
Data
PubTabNet v2.0.0 is publicly available. Annotations are under CDLA-Permissive-1.0; underlying PMCOA images have mixed per-article licenses. See PubTabNet notes for full dataset details.
Evaluation
TEDS computation code is available in the PubTabNet repository. Competition evaluation harness at ICDAR2021-SLP GitHub (Apache-2.0). The online leaderboard is no longer active.
Statistical rigor: single-run competition results; no error bars, confidence intervals, or multi-seed averages are reported. Rankings should be interpreted with caution given the small score differences between top teams (e.g., rank 1 vs. rank 2 differ by 0.04 TEDS overall).
Hardware
Not reported. Individual teams used V100-class GPUs; no aggregate compute figures available.
BibTeX
@inproceedings{yepes2021icdar,
title={ICDAR 2021 Competition on Scientific Literature Parsing},
author={Jimeno Yepes, Antonio and Zhong, Peter and Burdick, Douglas},
booktitle={Document Analysis and Recognition -- ICDAR 2021},
year={2021},
publisher={Springer},
doi={10.1007/978-3-030-86337-1_40}
}
Benchmarking PDF Parsers on Table Extraction with LLM-based Semantic Evaluation
TL;DR
Horn and Keuper evaluate 21 contemporary PDF parsers on table extraction using 100 synthetically generated documents with exact LaTeX ground truth. Their central finding is that LLM-as-a-judge evaluation correlates substantially better with human judgment (Pearson $r = 0.93$ for Gemini-3-Flash-Preview) than rule-based metrics such as TEDS ($r = 0.68$) or GriTS ($r = 0.70$). Among the 21 parsers tested, the Gemini 3 family achieves the highest scores, while rule-based tools lag far behind all learning-based approaches, and parser choice can mean the difference between near-perfect and nearly unusable table extraction.
What kind of paper is this?
Dominant: $\Psi_{\text{Evaluation}}$ The paper’s primary contribution is a new evaluation methodology for table extraction. It proposes LLM-as-a-judge as a replacement for rule-based metrics, validates that methodology against over 1,500 human ratings, and applies it to benchmark 21 parsers. The headline results concern measurement reliability and parser rankings, not a new model architecture.
Secondary: $\Psi_{\text{Systematic}}$ The work also surveys the landscape of existing table extraction metrics and document-level benchmarks, systematically characterizing the failure modes of TEDS, GriTS, and SCORE with concrete examples from real parser outputs.
What is the motivation?
Accurate table extraction from PDFs is increasingly important for scientific data mining, language model pretraining, and retrieval-augmented generation pipelines. Yet the dominant evaluation metrics, TEDS and GriTS, operate purely at the syntactic level. Both metrics compare structure and cell content as strings, which means they penalize semantically harmless representational differences (format conversions, symbol encoding variants, value equivalences like “85.0%” vs. “85%”) while being nearly insensitive to small but meaning-altering errors (a lost decimal point, a sign flip).
This creates a measurement gap: a parser that produces structurally different but semantically equivalent tables is unfairly penalized, while one that preserves structure but corrupts a handful of cell values receives an inflated score. Existing document-level benchmarks such as OmniDocBench and READoc adopt the same metrics without addressing this flaw. Separately, most benchmarks rely on manually annotated ground truth, which is expensive and limits scale and reproducibility.
What is the novelty?
The paper makes three interconnected contributions.
LLM-based semantic evaluation. The authors propose using an LLM as a judge to evaluate table extraction quality. Given a ground truth table and its parsed counterpart, the model assigns scores on a 0 to 10 scale for content accuracy and structural preservation, defined as whether every cell value can be unambiguously mapped to its row and column headers. This framing captures semantic equivalence that string-level comparison cannot.
Synthetic benchmark with exact ground truth. Rather than relying on manual annotation, the authors construct 100 benchmark pages by embedding real LaTeX tables sourced from arXiv papers (published December 2025, to avoid overlap with parser training data) into synthetically generated PDFs. Each page samples a random layout configuration (document class, font, margins, column layout) and fills available space with filler text and tables, compiled with pdflatex. The LaTeX source serves as exact ground truth at zero annotation cost. Tables are classified into three structural complexity tiers by an LLM classifier: simple (regular grid), moderate (limited cell merging), and complex (multi-dimensional merging, nested structures).
LLM-based table matching pipeline. Because parsers produce tables in diverse formats (HTML, Markdown, LaTeX, plain text) and may split, merge, or reorder content, aligning parser output to ground truth tables is non-trivial. The authors address this with a Gemini-3-Flash-Preview matching step that identifies and extracts the parsed counterpart of each ground truth table from the full parser output, followed by rule-based post-validation to correct minor LLM artifacts.
What experiments were performed?
Meta-evaluation of metrics. To compare automated metrics against human judgment, the authors collected 1,554 human ratings across 518 ground truth/parser output table pairs, sampled from diverse parsers, output formats, and table complexities. Three independent evaluators each rated all 518 pairs on a 0 to 10 scale. Inter-annotator agreement is reported as Krippendorff’s $\alpha = 0.77$ (interval), with average pairwise Pearson correlation $r = 0.85$ and a human ceiling of $r = 0.89$ (leave-one-out). To help surface subtle discrepancies, evaluators were shown LLM-generated hints identifying potential differences, though final scores remained entirely human decisions.
Four LLM judges were evaluated: DeepSeek-v3.2, GPT-5-mini, Gemini-3-Flash-Preview, and Claude Opus 4.6. All rule-based metrics (TEDS, GriTS variants, SCORE variants) were rescaled to a 0 to 10 range for comparability.
Correlations with mean human scores across 518 pairs:
| Metric | Type | Pearson $r$ | Spearman $\rho$ | Kendall $\tau$ |
|---|---|---|---|---|
| TEDS | Rule-based | 0.684 | 0.717 | 0.557 |
| GriTS-Avg | Rule-based | 0.698 | 0.763 | 0.604 |
| SCORE-Avg | Rule-based | 0.637 | 0.684 | 0.539 |
| DeepSeek-v3.2 | LLM | 0.802 | 0.827 | 0.713 |
| GPT-5-mini | LLM | 0.888 | 0.827 | 0.739 |
| Gemini-3-Flash-Preview | LLM | 0.927 | 0.889 | 0.799 |
| Claude Opus 4.6$^\dagger$ | LLM | 0.939 | 0.890 | 0.804 |
$^\dagger$ Also used to generate error hints shown to evaluators; correlation may be inflated.
The authors note that Claude Opus 4.6 also generated the error hints shown to human evaluators, which may inflate its correlation. Gemini-3-Flash-Preview ($r = 0.927$) had no role in the annotation process, making it the recommended judge for the subsequent parser benchmark.
Parser benchmark. Using the validated Gemini-3-Flash-Preview judge, the authors evaluated 21 parsers on 451 tables from the 100 synthetic pages, reporting mean LLM score (0 to 10), per-complexity breakdowns, TEDS for comparison, and inference cost or wall-clock time on an NVIDIA RTX 4090. The parser set spans a broad spectrum: specialized OCR VLMs (LightOnOCR-2-1B, dots.ocr, MonkeyOCR-3B, GOT-OCR2.0, Chandra, olmOCR-2-7B, Nanonets-OCR-s, DeepSeek-OCR, MinerU2.5), commercial API services (Mathpix, Mistral OCR 3), general-purpose multimodal models prompted for Markdown output (Gemini 3 Pro, Gemini 3 Flash, Gemini 2.5 Flash, Qwen3-VL-235B, GLM-4.5V, GPT-5 mini, GPT-5 nano, Claude Sonnet 4.6), and rule-based tools (PyMuPDF4LLM, GROBID).
What are the outcomes/conclusions?
LLM evaluation vs. rule-based metrics. All rule-based metrics cluster in a narrow band of $r = 0.56$ to $0.70$ with human judgment. Even the weakest LLM judge (DeepSeek-v3.2, $r = 0.80$) surpasses the best rule-based metric. The authors argue this confirms that semantic assessment captures dimensions of table quality that string-level comparison systematically misses. TEDS scores for the 21 parsers cluster within 22% of its scale (0.66 to 0.88), while LLM scores span 38% (5.75 to 9.55), suggesting the former paints a misleadingly uniform picture of parser quality.
Parser rankings. Overall LLM scores range from 2.10 (GROBID) to 9.55 (Gemini 3 Pro). Selected results:
| Parser | LLM Score | Simple | Moderate | Complex | Cost/Time |
|---|---|---|---|---|---|
| Gemini 3 Pro | 9.55 | 9.58 | 9.57 | 9.49 | $10.00 API |
| Gemini 3 Flash | 9.50 | 9.53 | 9.38 | 9.61 | $0.57 API |
| LightOnOCR-2-1B | 9.08 | 9.41 | 8.90 | 8.91 | 30 min GPU |
| Mistral OCR 3 | 8.89 | 8.92 | 8.69 | 9.07 | $0.20 API |
| dots.ocr | 8.73 | 9.01 | 8.43 | 8.76 | 20 min GPU |
| MinerU2.5 | 6.49 | 7.07 | 6.03 | 6.35 | free-tier API |
| PyMuPDF4LLM | 5.25 | 6.78 | 4.86 | 3.91 | 30 s CPU |
| GROBID | 2.10 | 2.27 | 1.94 | 2.09 | 2 min CPU |
The top two systems are general-purpose multimodal models (Gemini 3), not dedicated OCR tools, suggesting broad visual-linguistic capabilities transfer well to table extraction. However, LightOnOCR-2-1B achieves 9.08 with only 1B parameters, running on a single consumer GPU within 30 minutes, narrowing the gap between proprietary APIs and self-hosted pipelines for privacy-constrained deployments.
Complexity sensitivity. Table complexity affects parsers unevenly. Gemini 3 Flash actually scores slightly higher on complex tables than on simple ones, while GLM-4.5V drops 2.19 points from simple to complex, Qwen3-VL-235B drops 1.56 points, and Mathpix drops 1.55 points. The authors identify handling multi-dimensional cell merging as a key differentiator.
Score distributions. Per-parser histograms reveal bimodal failure patterns that mean scores obscure. Top parsers (Gemini 3, LightOnOCR) concentrate more than 70% of tables at score 10. Claude Sonnet 4.6 and olmOCR-2-7B show strongly bimodal distributions: they frequently omit tables entirely (score 0) but extract near-perfectly when they succeed. Mid-tier parsers such as GPT-5 mini and Gemini 2.5 Flash produce broad distributions centered around scores 5 to 8, indicating pervasive partial errors. The authors note that, depending on the application, a missed table may be preferable to a corrupted one, making bimodal parsers more useful than uniformly mediocre ones in certain contexts.
Limitations. The authors acknowledge several scope restrictions. Synthetic PDFs do not capture the full diversity of real-world documents (scanned pages, non-standard layouts). The table source is exclusively arXiv, which may bias toward scientific formats and leave financial or medical table styles unrepresented. LLM-as-a-judge is not infallible and depends on proprietary models, though evaluation costs are described as modest (approximately $0.20 to score all 451 tables, roughly $1 per parser for a full benchmark run). The authors do not report statistical significance for parser score differences, which may matter for closely ranked systems.
Reproducibility
Models
The paper does not introduce a new model. The judge model used for the parser benchmark is Gemini-3-Flash-Preview (Google DeepMind), accessed via API. The matching pipeline also uses Gemini-3-Flash-Preview. For the human evaluation study, Claude Opus 4.6 was used to generate discrepancy hints shown to raters. No model weights are released as part of this work.
Algorithms
The benchmark pipeline proceeds as follows:
- LaTeX tables are scraped from arXiv papers (December 2025), cleaned, and classified into simple/moderate/complex tiers by an LLM classifier.
- Synthetic PDF pages are generated by sampling random layout configurations and filling pages with filler text and tables, compiled with
pdflatex. - Each of the 21 parsers processes the 100 PDF pages.
- A Gemini-3-Flash-Preview matching step identifies the parsed counterpart for each ground truth table in the parser output, followed by rule-based post-validation.
- Matched pairs are scored by the LLM judge on a 0 to 10 scale for content accuracy and structural preservation.
The evaluation pipeline, parser configurations, exact prompts, and software versions used to produce the leaderboard are available in the pdf-parse-bench repository. The metric meta-evaluation code and human study data are in the table-metric-study repository.
Data
The benchmark consists of 100 synthetically generated PDF pages containing 451 tables derived from arXiv papers published in December 2025. Ground truth is the original LaTeX source, requiring no manual annotation. The dataset is released as part of the pdf-parse-bench repository and on Hugging Face (piushorn/pdf-parse-bench) under the MIT license.
For the human evaluation study, 518 ground truth/parser output table pairs were rated by three independent evaluators, yielding 1,554 total ratings. The human study data is available in the table-metric-study repository.
Evaluation
The primary metric is the LLM judge score (0 to 10) using Gemini-3-Flash-Preview, selected over Claude Opus 4.6 for cost efficiency and lack of confounding with the human annotation process. TEDS (0 to 1 scale) is reported alongside for comparison purposes.
Rule-based baselines evaluated in the meta-study: TEDS, GriTS$_{\text{Top}}$, GriTS$_{\text{Con}}$, GriTS-Avg, SCORE Index, SCORE Content, SCORE-Avg.
LLM judges evaluated in the meta-study: DeepSeek-v3.2, GPT-5-mini, Gemini-3-Flash-Preview, Claude Opus 4.6.
Human reference scores were collected from three evaluators rating 518 pairs, with Krippendorff’s $\alpha = 0.77$ and average pairwise Pearson $r = 0.85$.
Hardware
Parser benchmarking was run on a single NVIDIA RTX 4090 for GPU-based models, or via API for cloud-hosted models. API pricing in USD is reported at the time of writing. No training hardware is required as the benchmark evaluates existing parsers.
BibTeX
@article{horn2026benchmarking,
title={Benchmarking PDF Parsers on Table Extraction with LLM-based Semantic Evaluation},
author={Horn, Pius and Keuper, Janis},
journal={arXiv preprint arXiv:2603.18652},
year={2026}
}
CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy
TL;DR
CC-OCR is a four-track OCR evaluation benchmark from Alibaba Group and South China University of Technology covering 39 subsets, 7,058 fully annotated images, and 10 languages. It systematically evaluates large multimodal models (LMMs) on text reading, multilingual OCR, document parsing (including tables and formulas), and key information extraction, revealing specific weaknesses in text grounding, multi-orientation robustness, and structured output generation.
What kind of paper is this?
Dominant: $\Psi_{\text{Evaluation}}$: the primary contribution is a new benchmark and evaluation protocol. The paper introduces CC-OCR as a measurement framework for assessing LMM literacy capabilities across multiple tasks, scenarios, and languages. The headline novelty is “how we measure” these models, and the paper’s center of gravity lies in the benchmark design, annotation methodology, and systematic model evaluation rather than in proposing a new model or training recipe.
Secondary: $\Psi_{\text{Resource}}$: the benchmark itself represents a reusable evaluation asset: 7,058 annotated images spanning newly collected and re-annotated data, including the SOLD dataset (re-annotated open-category KIE data), all released to the research community.
What is the motivation?
Large multimodal models have demonstrated strong general visual understanding, but their performance on OCR-centric tasks that require fine-grained text perception, structured output generation, and multilingual reading is less well understood. Existing benchmarks for evaluating LMMs on OCR tasks suffer from several limitations: narrow scope (focusing on only one subtask such as line-level text recognition or formula detection), restricted language coverage (many are English-only or at best English and Chinese), and inconsistent annotation standards that make cross-model comparison difficult.
Benchmarks such as OCRBench emphasize line-level recognition, while FOX and DocLocal4K focus on document images but omit natural scenes and are restricted to one or two languages. KOSMOS2.5-Eval covers parsing but not key information extraction. The gap is a comprehensive benchmark that can simultaneously assess all major OCR-relevant capabilities of general-purpose LMMs. CC-OCR is constructed to fill that gap.
What is the novelty?
The main contribution is the CC-OCR benchmark itself, characterized by three design principles: diversity of scenarios, practicality through use of real-world captured images, and deliberate challenge construction targeting known weak points of LMMs.
Four tracks: The benchmark organizes evaluation into (1) multi-scene text reading (natural scenes, documents, web/UGC images in English and Chinese); (2) multilingual text reading across ten languages from four language families; (3) document parsing covering full-page document structuring, table recognition, and formula recognition; and (4) key information extraction in both constrained-category and open-category settings.
Challenge dimensions are orthogonal to the track structure and include orientation-sensitivity (multi-directional and curved text), grounding (predicting word-level bounding boxes), natural noise, artistic/stylized text, multilingual diversity, mathematical and chemical formula expressions, and structured output formats (LaTeX, HTML, JSON, SMILES).
SOLD dataset: A notable sub-contribution is SOLD (Structurally-rich Open Layout Dataset), created by re-annotating two existing open-category KIE datasets (SIBR and HUST-CELL) with a bottom-up annotation scheme that produces end-to-end JSON representations, including hierarchical structures and table-format key-value groups. The original link-based annotation format of those datasets was unsuitable for directly evaluating LMMs.
Evaluation metrics are tailored per track. For OCR text sequences, the authors use a full-text multi-set matching metric (Eval-Trans) that accounts for the fact that LMMs may predict text in different orders:
$$ \text{Recall} = \frac{\sum_{i}^{N} \min(c_i, c’_i)}{\sum_{i}^{N} c_i}, \quad \text{Precision} = \frac{\sum_{i}^{N} \min(c_i, c’_i)}{\sum_{i}^{N} c’_i} $$
where $u_i$ is a basic unit (character for CJK scripts, word for Latin/Cyrillic/Arabic), $c_i$ is its count in the ground-truth sequence, and $c’_i$ is its count in the predicted sequence. F1 is computed from recall and precision.
For document content structuring and formula recognition, normalized edit distance (NED) is used:
$$ \text{NED} = \frac{1}{N} \sum_{i=1}^{N} \left(1 - \frac{\text{EditDist}(P_i, G_i)}{\max(\text{len}(P_i), \text{len}(G_i))}\right) $$
For table parsing, the benchmark adopts Tree Edit Distance-based Similarity (TEDS), which accounts for both structural similarity and cell content accuracy:
$$ \text{TEDS}(T_{pred}, T_{gt}) = 1 - \frac{\text{EditDist}(T_{pred}, T_{gt})}{\max(|T_{pred}|, |T_{gt}|)} $$
For key information extraction, field-level F1 score is used, where a key-value pair is treated as a single field and any character mismatch counts as a failed extraction.
The paper also introduces a repetition ratio $R_{rep}$ to quantify hallucination in the form of repeated output, defined as the fraction of images for which the length of the longest continuous repetitive string exceeds 25% of the total output length.
What experiments were performed?
Nine LMMs are evaluated: five generalist models (GPT-4o-2024-08-06, Gemini-1.5-Pro-002, Claude-3.5-Sonnet-20241022, Qwen2-VL-72B, InternVL2-76B) and four specialist models (KOSMOS2.5, TextMonkey, Florence-2, GOT). Specialist models are those trained specifically for OCR or document understanding tasks.
The evaluation covers all four tracks. For multi-scene OCR, each model is prompted to output the full text content of the image. For multilingual OCR, prompts are in the target language. For document parsing, models are prompted to produce structured output (LaTeX for documents and formulas, HTML for tables, SMILES for molecular formulas). For KIE, models must fill in a provided JSON schema.
Three additional analyses probe specific challenge dimensions: text grounding (word-level bounding box prediction on English multi-scene subsets), multi-orientation robustness (rotating multi-scene images by 0, 90, 180, and 270 degrees), and repetition hallucination.
What are the outcomes/conclusions?
Overall ranking: Gemini-1.5-Pro achieves the highest average score (73.0), ranking first in three of four tracks. Qwen2-VL-72B ranks second overall (68.7) and first in the KIE track. Generalist LMMs consistently outperform specialist models, which the authors attribute to their larger parameter counts and broader training data.
Multi-scene OCR: Gemini-1.5-Pro scores 83.25, followed by Qwen2-VL-72B at 77.95. Performance in natural scenes is approximately 15 percentage points lower than in document scenes for most models, suggesting that text in natural environments remains a more difficult regime.
Multilingual OCR: The performance gap across languages is substantial. Asian languages (Japanese, Korean, Vietnamese) and Arabic show the weakest results across most models. Japanese scores are particularly low, which the authors attribute to the prevalence of vertical text in their collected images. Latin-family languages are generally easier, though German and Italian show lower scores, possibly due to special characters.
Document parsing: Gemini-1.5-Pro leads with 62.37. Even the top-performing model scores below 70 on individual document parsing sub-tasks (67.17 on document content structuring and 67.93 on table recognition), which the authors note as evidence of the benchmark’s difficulty. All models perform substantially better on English documents than Chinese documents; Gemini-1.5-Pro, for instance, scores about 5 percentage points higher on English tables than Chinese tables. Handwritten formula recognition is a clear weak point: nearly all models score below 50, with molecular formula recognition being a partial exception for Gemini-1.5-Pro.
Key information extraction: Qwen2-VL-72B (71.76 F1) and Gemini-1.5-Pro (67.28 F1) lead. GPT-4o and Claude-3.5-Sonnet perform notably weaker on the EPHOIE dataset. The highest-performing model still falls short of practical application thresholds, which the authors attribute to real-world noise, complex document structures, and the strict JSON output requirement.
Text grounding: Only four of the nine evaluated models support word-level grounding output. Even for these, performance drops sharply relative to recognition-only results. Gemini-1.5-Pro achieves the highest grounding score at 60.98%; all others remain below 50%. This finding indicates that fine-grained spatial understanding remains a major gap.
Multi-orientation robustness: Most models degrade significantly when images are rotated. InternVL2-76B shows the largest drop (38.92 percentage points), followed by GOT (34.86). Gemini-1.5-Pro is a notable outlier, declining by only 3.80 points, suggesting substantially better rotation invariance.
Repetition hallucination: TextMonkey produces repetitive output in 33.93% of images, and KOSMOS2.5 in 10.64%. Among generalist models, InternVL2-76B shows the highest repetition rate at 5.94%, while Claude-3.5-Sonnet shows the lowest at 0.09%. The authors identify this as a practical failure mode worth monitoring.
Failure mode observations: GPT-4o exhibits both hallucination (inventing plausible but incorrect text for long fields such as addresses) and instruction noncompliance (reformatting dates despite explicit prompts). These qualitative failures are documented alongside quantitative results.
Reproducibility
Models
No new model is introduced. The benchmark evaluates nine existing LMMs. The commercial APIs evaluated are GPT-4o-2024-08-06, Gemini-1.5-Pro-002, and Claude-3.5-Sonnet-20241022. The open-source models are KOSMOS2.5, TextMonkey, Florence-2, GOT, InternVL2-76B (76 billion parameters), and Qwen2-VL-72B (72 billion parameters). Model-specific details (parameter counts, architecture) are described in the original model papers and not repeated here.
Algorithms
Evaluation uses the metrics described above (Eval-Trans F1, NED, TEDS, field-level F1, $R_{rep}$). Normalization code for the KIE evaluation and evaluation scripts for all tracks are released at AlibabaResearch/AdvancedLiterateMachinery under MIT license, with VLMEvalKit integration supported via the HuggingFace dataset. The annotation pipeline for the SOLD dataset uses a bottom-up re-annotation scheme with rule-based multi-round quality rectification, LLM-assisted error correction, key normalization, and secondary manual review.
Data
CC-OCR comprises 7,058 fully annotated images across four tracks. Approximately 41% of the data are sourced from real-world applications and released for the first time. Data sources fall into three categories: existing benchmarks with qualified annotations (open-source, re-annotated as needed), newly collected data, and re-annotated data. All document parsing data and all Chinese multi-scene data are newly collected. KIE constrained-category data is re-annotated from SROIE, CORD, EPHOIE, and POIE. The open-category SOLD dataset is re-annotated from SIBR and HUST-CELL.
The benchmark data is publicly available on HuggingFace at wulipc/CC-OCR under an MIT license; this hosts a TSV-format version of the data integrated with VLMEvalKit. The dataset combines images from multiple source licenses: subsets derived from open-source benchmarks retain their original licenses (e.g., TotalText, IC15, CORD, SROIE are individually licensed), while newly collected and re-annotated images (including all document parsing data, Chinese multi-scene data, and SOLD) are released by the authors under the MIT license. The overall dataset package on HuggingFace carries MIT, though downstream users should verify the original source licenses for any open-source subsets they use independently.
Evaluation
Benchmark statistics: multi-scene OCR has 2,750 images; multilingual OCR has 1,500 images (150 per language); document parsing has 800 images (300 documents, 300 tables, 200 formulas); KIE has 2,008 images (1,008 constrained, 1,000 open-category).
For text evaluation, basic units are defined per script: characters for CJK languages (Chinese, Japanese, Korean), words for Latin/Cyrillic/Arabic. The repetition threshold $T_{rep}$ is set to 5 (a continuous unit must appear more than 5 times consecutively to qualify as repetitive).
No statistical uncertainty measures (confidence intervals, significance tests, multiple-seed runs) are reported. The evaluation is a single-pass inference with no reported variance.
A limitation noted by the authors is that most specialist models cannot produce LaTeX or HTML output, so the document parsing track primarily reflects generalist model performance. Results for specialist models on that track are incomplete or require post-hoc conversion (indicated by asterisks in Table 4).
Hardware
The paper does not report hardware specifications, inference latency, GPU counts, or cost estimates for the model evaluations. No deployment or inference requirements are discussed.
BibTeX
@InProceedings{Yang_2025_ICCV,
author = {Yang, Zhibo and Tang, Jun and Li, Zhaohai and Wang, Pengfei and Wan, Jianqiang and Zhong, Humen and Liu, Xuejing and Yang, Mingkun and Wang, Peng and Bai, Shuai and Jin, Lianwen and Lin, Junyang},
title = {CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2025},
pages = {21744-21754}
}
DocPTBench: Benchmarking End-to-End Photographed Document Parsing and Translation
TL;DR
DocPTBench is a benchmark of over 1,300 photographed document images designed to measure how well current models handle parsing and translation under real-world capture conditions. Experiments show that shifting from digital-born to photographed documents causes average parsing edit distance to increase by roughly 18% for general MLLMs and 25% for specialized OCR models, with translation BLEU scores dropping by about 12%.
What kind of paper is this?
Dominant: $\Psi_{\text{Evaluation}}$. The central contribution is a new benchmark and evaluation protocol for photographed document parsing and translation. The paper does not introduce a new model or training procedure; it systematically measures the robustness gap in existing systems using a carefully constructed dataset.
Secondary: $\Psi_{\text{Resource}}$. DocPTBench itself is a released community resource: a dataset of over 1,300 labeled photographed document images with human-verified annotations across eight language-pair translation directions, along with evaluation code.
What is the motivation?
Document parsing and translation benchmarks have historically focused on digital-born or high-quality scanned documents with clean geometric structure. Benchmarks such as OmniDocBench, FoxPage, and olmOCR-Bench represent this well-structured distribution, as do translation benchmarks like DoTA and DITrans. The result is that reported model numbers do not reflect what happens when a user points a phone at a document in practice.
Real-world document photographs introduce a range of degradations absent from these benchmarks: perspective distortion from off-axis capture, page curvature and physical deformation, motion blur, uneven or harsh lighting, shadows, and overexposure. While the WildDoc benchmark incorporated in-the-wild images, it focused on visual question answering rather than structured parsing, and provided no translation annotations. The community therefore lacked a unified benchmark to measure end-to-end parsing and translation robustness under photographed conditions.
What is the novelty?
DocPTBench introduces three features that distinguish it from prior benchmarks:
Unified dual-task evaluation. A single dataset supports both document parsing and document translation evaluation. Prior benchmarks either covered parsing or translation in isolation; no earlier resource combined both tasks with photographed images.
Three-tier image collection. DocPTBench organizes its images into an Original set (981 born-digital documents from OmniDocBench), a Photographed set (981 simulated photographs plus 400 physically photographed images of 100 documents under four challenging conditions), and an Unwarping set (the Photographed images after a commercial geometric rectification API). This structure allows controlled ablation of geometric versus photometric degradation.
Broad multilingual translation coverage. Eight translation directions are supported: En-Zh, En-De, En-Fr, En-Ru, and their Zh-originated counterparts. Translation ground truth was produced by Qwen-Max generation followed by rigorous manual verification.
The evaluation metrics for parsing follow OmniDocBench conventions, using Levenshtein-based edit distances for text ($\text{Text Edit}$), formulas ($\text{Formula Edit}$), tables ($\text{Table Edit}$), and reading order ($\text{Read Order Edit}$), plus a Tree-Edit-Distance-based Similarity score (TEDS) for table structure. Translation is assessed with BLEU, chrF, METEOR, and STEDS. Lower edit distance indicates better parsing; higher BLEU indicates better translation.
What experiments were performed?
The paper benchmarks two model categories across all three image conditions:
Expert OCR models (evaluated on parsing only): PaddleOCR-VL, MinerU2.5, dots.ocr, MonkeyOCR, DeepSeek-OCR, olmOCR2, Dolphin, olmOCR, OCRFlux, SmolDocling, Nanonets-OCR, and Nanonets-OCR2. These systems are purpose-built for document content extraction.
General MLLMs (evaluated on both parsing and translation): Gemini 2.5-Pro, Qwen-VL-Max, GLM-4.5v, Kimi-VL, and Doubao-Seed-1.6-vision. Qwen2.5-VL-72B is also evaluated on parsing in the supplementary full results table. For translation, four smaller open-source models were also included: Qwen3-VL-4B, Qwen2.5-VL-3B, InternVL3-2B, and InternVL3.5-2B.
Parsing is evaluated across the Original, Photographed, and Unwarping conditions. Translation is evaluated with:
- A text-only baseline (the source Markdown text is provided directly, bypassing vision) to establish an upper bound for translation quality independent of OCR.
- Original and Photographed image inputs.
- Two prompting strategies: a Simple prompt (direct image-to-target-language output) and a Chain-of-Thought (CoT) prompt (explicit instruction to first parse the document into source-language Markdown, then translate).
The CoT formulation decouples perception from translation and helps diagnose where errors originate.
What are the outcomes/conclusions?
Parsing degradation is severe and widespread. Across the tested expert models, transitioning from Original to Photographed documents produces average overall edit distance increases far exceeding typical benchmark variance. DeepSeek-OCR’s Overall Edit Score (En) deteriorates from 13.4 to 54.4; SmolDocling’s rises from 49.3 to 90.1. General MLLMs fare comparatively better, with an average increase of roughly 18%, while expert models average roughly 25%. The authors attribute the relative resilience of MLLMs to training on more diverse and noisy data.
Geometric rectification is necessary but not sufficient. Processing photographed images through a commercial unwarping API before evaluation consistently recovers substantial performance. For example, DeepSeek-OCR’s Overall Edit Score (En) improves from 54.4 back to 22.1 after unwarping, and PaddleOCR-VL’s Table TEDS (En) rebounds from 54.3 to 82.5. However, a consistent gap remains between Unwarping and Original scores. The authors interpret this as evidence that photometric degradations (blur, uneven lighting) are not addressed by geometric correction and continue to impair parsing quality.
Translation exhibits a large modality gap. Comparing text-only inputs to image inputs on the same documents reveals a large gap. Qwen-VL-Max’s BLEU score on En-Zh drops from 69.41 (text) to 41.04 (image, original). For some models on low-resource scripts, the collapse is near-total: InternVL3-2B’s BLEU on En-Ru falls from 37.09 (text) to 8.15 (image, original), suggesting that complex scripts compound the difficulty.
Photographed images compound translation errors. When the image source is photographed rather than digital-born, BLEU scores drop a further average of 12% across models and directions. This degradation occurs consistently whether a simple or CoT prompt is used, indicating that visual image quality is the primary bottleneck rather than the prompting strategy.
CoT prompting mitigates some modality gap. For most models and language pairs, the CoT strategy (parse first, then translate) outperforms the simple direct approach. For example, Qwen-VL-Max’s En-Zh BLEU on original images rises from 41.04 with a simple prompt to 47.60 with CoT. The benefit, while consistent, does not close the gap to text-only baselines.
Dual bottleneck for translation. The paper identifies two independent constraints on end-to-end translation quality: the model’s visual parsing capability and its underlying translation competence. Doubao-Seed-1.6-vision achieves strong parsing scores but delivers only modest translation quality because its text-only translation performance is weaker than competing models. This finding suggests that improving either component in isolation will not fully solve end-to-end translation.
Gemini 2.5-Pro shows strongest overall robustness. Across parsing, translation, and photographed conditions, Gemini 2.5-Pro exhibits the smallest performance degradation among the general MLLMs tested, dropping only 3.4 Overall Edit Score (En) points on photographed images compared to original.
Reproducibility
Models
No new model weights are introduced. DocPTBench evaluates existing, externally released systems. All evaluated models are referenced by their original papers and repositories. Gemini 2.5-Pro was accessed via API; the open-source models (Qwen series, InternVL series) are available on HuggingFace under their respective licenses.
Algorithms
No training is performed. The evaluation pipeline applies each model’s default inference settings. Two prompting strategies are systematically varied: the Simple prompt and the Chain-of-Thought prompt. The exact prompts are illustrated in Fig. 5 of the supplementary material (page 14 of the paper). The commercial unwarping API used for the Unwarping tier is provided by textin.com (https://www.textin.com/market/detail/crop_enhance_image).
Inference scripts for each prompting mode (simple, CoT, and text-only) are provided in code/translation/model_infer/ in the repository, organized by source language. The translation evaluation script is at code/translation/evaluate_metric/evaluate.py. Parsing evaluation uses the OmniDocBench evaluation harness (v1_0 branch of https://github.com/OmniDocBench/OmniDocBench); the repository documentation (docs/parsing.md) includes a step-by-step walkthrough for reproducing parsing results with any model.
Data
The DocPTBench dataset is built on 981 documents from OmniDocBench. The Photographed tier combines:
- 981 photographed-simulation images (exact simulation pipeline details are in the paper and code repository).
- 400 physically photographed images of 100 source documents, each photographed under four conditions (strong illumination, shadows, perspective distortion, physical wear such as folds or wrinkles).
The 1,381 Unwarping images are derived from the Photographed set using a commercial API. Translation ground truth for all eight language directions was generated by Qwen-Max and then manually verified and corrected.
The dataset and code are released under Apache-2.0 at https://github.com/Topdu/DocPTBench. The dataset is also hosted on HuggingFace (topdu/DocPTBench) and ModelScope (topdktu/DocPTBench) and can be downloaded with:
huggingface-cli download topdu/DocPTBench --local-dir ./DocPTBench --repo-type dataset
The downloaded dataset contains a ground-truth file (DocPTBench_combined.json) and four image directories: images_synreal (981 simulated photographed images), images_synreal_dewarp (their unwarped counterparts), images_pic_synreal (400 physically photographed images), and images_pic_dewarp (their unwarped counterparts). Translation ground truth is stored under translation_gt/src_en/ and translation_gt/src_zh/ for the eight language pairs.
Evaluation
Parsing metrics follow OmniDocBench conventions: Levenshtein-based Text Edit, Formula Edit, Table Edit, and Read Order Edit (lower is better), plus TEDS for table structure (higher is better). An Overall Edit score aggregates these. Translation metrics are BLEU, chrF, METEOR, and STEDS.
Baselines cover 18 evaluated systems in total across the supplementary full results table (Tab. 5). The three-tier design (Original/Photographed/Unwarping) controls for degradation type. No statistical significance tests or error bars over multiple runs are reported; results reflect single-run evaluations. The authors acknowledge that the benchmark builds on OmniDocBench’s digital-born subset, so domain coverage reflects OmniDocBench’s composition (academic papers, invoices, forms, magazines, and others).
Hardware
No training hardware is relevant. Inference hardware specifics are not reported for the evaluated models. The paper does not report GPU-hours, inference latency, or throughput.
BibTeX
@article{du2025docptbench,
title={DocPTBench: Benchmarking End-to-End Photographed Document Parsing and Translation},
author={Du, Yongkun and Chen, Pinxuan and Ying, Xuye and Chen, Zhineng},
journal={arXiv preprint arXiv:2511.18434},
year={2025}
}
ENTRANT: A Large Financial Dataset for Table Understanding
TL;DR
ENTRANT is a large-scale financial table dataset compiled from SEC EDGAR filings. It contains approximately 6.7 million tables extracted from roughly 330,000 XBRL-format reports filed between 2013 and 2021, covering ten report types including 10-K, 10-Q, and 8-K filings. Each table is stored as a JSON object with per-cell attributes (formatting, data type, formula references), spatial coordinates, and two explicit hierarchy trees encoding the top-down and left-right header structure. The authors demonstrate the dataset’s usability by pre-training a TUTA model on ENTRANT and applying it to cell type classification, where it either matches or slightly improves on the original model depending on the training mix.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$: the paper’s central contribution is the dataset itself, including the automated crawling and transformation pipeline, the JSON annotation format with bi-tree hierarchies, and the public release on Zenodo.
Secondary: $\Psi_{\text{Evaluation}}$: the pre-training and downstream cell classification experiments serve as a technical validation of the dataset’s usability rather than as a standalone evaluation contribution.
What is the motivation?
Table understanding covers a wide range of tasks: cell role prediction, table type classification, question answering over tables, schema mapping, and knowledge extraction. Most recent progress in this area relies on large pre-trained transformers (TaBERT, TaPas, TUTA, FORTAP) that consume millions of tables during pre-training.
The datasets most commonly used for this purpose come from general-purpose web sources: Wikipedia tables, Common Crawl, and the WDC Web Table corpus. These sources require substantial preprocessing to extract tables and attach per-cell features in a format that models can consume directly. Financial tables from spreadsheets, which interleave textual and numerical content and commonly exhibit multi-level hierarchical header structures, are largely absent from existing public pre-training corpora. The one prior financial table resource cited by the authors (TAT-QA) contains only 20K tables, which is several orders of magnitude smaller than what large pre-training runs require.
EDGAR, the SEC’s public filing database, provides a continuously updated source of structured financial spreadsheets in XBRL format. Starting from 2009, SEC rules required publicly traded companies to tag every number and table in their filings with XBRL identifiers, making automated extraction reliable and consistent. The authors argue that this source, once processed, can fill the gap for large-scale, machine-readable financial tabular data.
What is the novelty?
Dataset scale and coverage
ENTRANT covers approximately 331,448 reports filed between 2013 and 2021 across ten SEC form types. The 2013 start date reflects XBRL adoption maturity; the 2021 cutoff was chosen to align with prior research on financial misstatement detection, where average discovery lags of two to three years introduce noise in more recent filings. Processing these reports produced 6,735,407 tables distributed across roughly 330,000 JSON files.
The dataset is not limited to annual financial statements. It includes quarterly reports (10-Q), current event reports (8-K), registration statements (S-1, S-4), foreign issuer reports (20-F, 18-K), and investment company filings (485BPOS, 497). This breadth means the vocabulary spans standard XBRL financial concepts (e.g., “Net Income”, “Current Liabilities”) as well as short textual disclosures and notes that accompany significant corporate events.
JSON annotation format
Each JSON file corresponds to one filing workbook and contains a list of table dictionaries. Each table stores:
- Metadata: file description, language, and spreadsheet name
- Table dimensions via a “RangeAddress” field
- A “Cells” field structured as a list of row lists, where each cell is a dictionary
Per-cell attributes include:
| Key | Description |
|---|---|
T | Cell value as string |
V | Cell value (typed) |
is_header | Participates in a header row |
is_attribute | Participates in a header column |
coordinates | [row index, column index] |
HF | Cell has a formula |
A1 / R1 | Absolute / relative formula cell reference |
font_name, font_size | Font metadata |
wrap_text | Whether text is wrapped within the cell dimension |
BC, FC, FB, I | Background color, font color, bold, italic flags |
NS | Number format |
DT | Data type |
LB, TB, BB, RB | Left, top, bottom, right border flags |
O, HA, VA | Orientation, horizontal alignment, vertical alignment |
In addition to cell content, each table records merged regions and the counts of top header rows and left header columns.
Bi-tree hierarchy representation
A distinctive feature of ENTRANT is its explicit encoding of the two hierarchical structures present in financial tables. Financial spreadsheets routinely use merged cells and indentation to imply both a top-down hierarchy (column group headers spanning multiple time periods or categories) and a left-to-right hierarchy (row label groups organizing related financial line items). ENTRANT encodes both hierarchies as nested node dictionaries:
node = {
'RI': <RowIndex>,
'CI': <ColumnIndex>,
'Cd': [ <child node>, ... ]
}
The root node uses coordinates (-1, -1) to indicate it lies conceptually outside the table grid. This representation, adopted from the TUTA paper’s formalism, supports up to four levels of hierarchy depth, which the authors report covers all table types present in EDGAR filings. The two hierarchy trees are stored under the keys TopTreeRoot and LeftTreeRoot in each table dictionary.
Automated, tested curation pipeline
The crawling phase uses each company’s Central Index Key (CIK) and EDGAR accession number to construct API endpoints pointing to Financial_Report.xlsx files. Reports without valid XBRL tags are filtered out before downloading. The transformation phase uses OpenPyXL to extract tables from each workbook, annotate cells with the attributes above, and construct the bi-tree structures. The paper states that a test suite of 45 unit tests covering all transformation steps passed successfully at the time of publication. A post-processing pass verifies that no empty rows or columns remain in any extracted table, and all JSON files are validated by loading them into Python dictionaries.
What experiments were performed?
The authors pre-trained TUTA, a transformer-based table understanding model, on two data configurations to validate that ENTRANT is correctly formatted and consumable:
- TUTA (WikiTables + WDC + ENTRANT): pre-trained on the three public datasets combined.
- TUTA (ENTRANT only): pre-trained exclusively on ENTRANT.
Both models were then evaluated on cell type classification using two datasets from the DECO collection (Koci et al., ICDAR 2019), DeEx and SAUS, which annotate cells into six types: “metadata”, “notes”, “data”, “top attribute”, “left attribute”, and “derived”. These datasets draw from financial, educational, and public health domains.
Results are reported as Macro F1:
| Model | DeEx | SAUS |
|---|---|---|
| TUTA original (WikiTables + WDC + non-public spreadsheets) | 76.6 | 90.2 |
| TUTA (WikiTables + WDC + ENTRANT) | 78.8 | 91.9 |
| TUTA (ENTRANT only) | 74.2 | 88.4 |
The model pre-trained only on ENTRANT performs approximately 2 Macro F1 points below the original TUTA baseline (2.4 on DeEx, 1.8 on SAUS). When ENTRANT is combined with the other two public datasets, the resulting model slightly outperforms the original baseline on both tasks.
What are the outcomes/conclusions?
The results suggest that ENTRANT can serve as a drop-in addition to existing public pre-training corpora for table understanding models. The ENTRANT-only model performs reasonably well despite being restricted to a single financial domain, which the authors attribute to the dataset’s vocabulary breadth (financial statements contain diverse numeric formats, date references, and conceptual labels) and its explicit structural annotations.
The paper notes several limitations worth considering:
- The dataset covers only filings from 2013 to 2021. More recent EDGAR filings are not included and would need to be crawled using the released code.
- The 2021 cutoff introduces a specific bias: filings close to the cutoff are less likely to have had misstatements discovered, which was the original research context but has no direct bearing on table understanding.
- Tables below a minimum size threshold and tables with sparse or broken content are filtered out during preprocessing. The exact filter criteria are described qualitatively but not as precise thresholds.
- The ENTRANT-only pre-training experiment uses downstream datasets (DeEx, SAUS) that include financial tables, which may partially favor a financial-domain pre-training corpus. Generalization to other domains is not directly measured.
- The paper does not report statistical confidence intervals or multiple pre-training runs, so the observed improvements over the TUTA baseline should be interpreted cautiously.
The code release allows users to extend the dataset by crawling more recent EDGAR filings or to apply the transformation pipeline to other XBRL-format corporate spreadsheets.
Reproducibility
Models
- TUTA: tree-based transformer for general table pre-training; originally described by Wang et al. (KDD 2021). The pre-training uses all three objectives described in the original TUTA paper. Model weights from the published TUTA codebase (https://github.com/microsoft/TUTA_table_understanding) were used as the starting point. Parameter count and layer configuration are not restated in the ENTRANT paper; readers should consult the original TUTA paper.
Algorithms
- Pre-training followed the TUTA procedure using the released Microsoft code. No modifications to the TUTA training recipe are described.
- Downstream cell type classification uses the fine-tuning setup from the original TUTA paper.
- Specific optimizer, learning rate, batch size, and compute budget for the ENTRANT pre-training runs are not reported.
Data
- Source: SEC EDGAR public filings, accessed via the EDGAR API. The underlying financial data is public domain. Spreadsheets are in XBRL-annotated XLSX format.
- Coverage: 2013-2021, ten report types, approximately 331,448 reports, 6,735,407 tables.
- File organization: One JSON file per company filing, organized in subfolders by report type. Each subfolder is provided as a zip archive on Zenodo.
- File naming convention:
<CIK>_<YEAR>_<TYPE>_<ACCESSION_NUMBER>.json - Dataset availability: Released on Zenodo at https://doi.org/10.5281/zenodo.10667088 under CC-BY-4.0 (open access, 17.3 GB total across 10 zip archives, one per report type). Note that the dataset license (CC-BY-4.0) is more permissive than the published article license (CC-BY-NC-ND-4.0).
- Code availability: Python scripts for crawling and transformation released at https://github.com/iit-Demokritos/entrant under CC-BY-4.0.
- Processing libraries: OpenPyXL for spreadsheet parsing.
Evaluation
- Metric: Macro F1 for multi-class cell type classification (six classes: metadata, notes, data, top attribute, left attribute, derived).
- Downstream datasets: DeEx and SAUS from the DECO collection (Koci et al., ICDAR 2019), covering financial, educational, and public health tables.
- Baselines: Original TUTA model pre-trained on WikiTables, WDC, and non-public web-crawled spreadsheets. This baseline uses pre-training data that is not fully reproducible by external parties.
- No error bars, significance tests, or multiple pre-training seeds are reported.
Hardware
- Computing and storage resources were provided by the SKEL AI Lab at NCSR “Demokritos”. Specific GPU types, GPU counts, pre-training duration, and memory requirements are not reported in the paper.
BibTeX
@article{zavitsanos2024entrant,
title={ENTRANT: A Large Financial Dataset for Table Understanding},
author={Zavitsanos, Elias and Mavroeidis, Dimitris and Spyropoulou, Eirini and Fergadiotis, Manos and Paliouras, Georgios},
journal={Scientific Data},
volume={11},
pages={876},
year={2024},
publisher={Nature Publishing Group},
doi={10.1038/s41597-024-03605-5}
}
MMSci: Benchmarking and Improving Multimodal Scientific Table Understanding
TL;DR
MMSci is a dataset suite built from the SciGen corpus of scientific tables, comprising MMSci-Pre (52K image-to-HTML pairs for table structure learning), MMSci-Ins (12K chain-of-thought instruction samples), and MMSci-Eval (3,114 benchmark samples requiring numerical reasoning). The authors find that 52K domain-specific scientific tables outperform 150K general-domain tables on both in-distribution and held-out evaluation, suggesting that data source matters more than raw volume for this task.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$: The primary contribution is the construction and release of the three MMSci datasets (pre-training, instruction tuning, and evaluation), along with the construction methodology, quality control procedures, and reasoning-type taxonomy.
Secondary: $\Psi_{\text{Evaluation}}$: MMSci-Eval serves as a purpose-built benchmark for numerical reasoning over scientific tables, and a substantial portion of the paper is devoted to comparing models on this benchmark and analyzing cross-modal consistency.
What is the motivation?
Large language models have shown growing capability on table understanding tasks such as table question answering (TQA), table fact verification (TFV), and table-to-text generation (T2T). Most table-oriented LLMs, however, require converting tables into sequential text formats such as HTML strings, which can discard structural and positional information. Multimodal large language models (MLLMs) sidestep this by processing table images directly, but existing MLLM-based approaches suffer from at least three practical problems:
- Fixed input image resolutions constrain applicability to tables of varying sizes.
- Training corpora largely consist of general-domain tables (Wikipedia, financial reports, government documents) and do not cover the dense numerical content typical of scientific publications.
- Models trained on these general corpora show substantial performance degradation on scientific tables that require arithmetic and statistical reasoning.
Prior datasets such as TAT-QA and FinQA address numerical reasoning in the financial domain, and SciGen introduces scientific table-to-text generation, but no existing resource combines multimodal (image-based) input with comprehensive numerical reasoning evaluation across multiple task types for scientific tables.
What is the novelty?
The primary novelty is the MMSci dataset suite itself. The authors build all three components by sourcing raw tabular data from the SciGen dataset (Moosavi et al., 2021), which pairs scientific tables from computer science papers with textual descriptions that naturally invoke arithmetic reasoning.
MMSci-Pre contains 52K image-to-HTML pairs. The authors convert SciGen’s LaTeX-formatted tables to HTML using a rendering pipeline (imgkit) and store the resulting table images alongside their HTML ground truth. The pre-training objective is standard sequence generation: given a table image, produce the corresponding HTML string. This stage develops table structure perception within the MLLM.
MMSci-Ins contains 12K instruction-tuning samples spanning TQA, TFV, and T2T. For each table, GPT-4o is prompted with both the table image and its SciGen description to generate a question-answer pair, a claim-and-verdict pair, and an extended table-to-text description, each accompanied by explicit chain-of-thought reasoning steps. To improve quality, self-consistency voting (Wang et al., 2023) is applied: multiple reasoning paths are sampled and a majority-vote mechanism selects the final output. The construction prompt is made available in the paper (Table 7). The dataset is balanced: each table contributes exactly one TQA, one TFV, and one T2T sample.
MMSci-Eval contains 3,114 test samples derived from SciGen’s test set using the same GPT-4o generation pipeline, with an additional round of human verification covering all samples. The benchmark covers eight reasoning types:
| Reasoning Type | Proportion |
|---|---|
| Addition | 21.1% |
| Subtraction | 15.3% |
| Max/Min | 15.7% |
| Division | 14.2% |
| Comparison | 13.7% |
| Ranking | 9.6% |
| Look-up | 8.9% |
| Domain Knowledge Calculation | 1.5% |
On the modeling side, the authors implement dynamic input resolution on two base architectures (Qwen2-VL-7B-Instruct and LLaVA-NeXT-7B), addressing the fixed-resolution limitation of prior table MLLMs. Qwen2-VL achieves this through 2D-RoPE positional encodings; LLaVA-NeXT splits images into grids and encodes them independently.
What experiments were performed?
Training procedure. Both models are fine-tuned in two stages on 4 × A100 80 GB GPUs using LoRA (rank 64, sequence length 4096). In the first stage (table structure learning), models are trained for one epoch on image-to-HTML data with varying combinations of MMSci-Pre (52K) and MMTab-Pre (150K from Table-LLaVA). In the second stage (visual instruction tuning), models are fine-tuned for 4 epochs on MMSci-Ins (12K), freezing the visual encoder and updating only the MLP connector and LLM weights.
Evaluation. The held-in evaluation uses MMSci-Eval’s TQA, TFV, and T2T splits. Held-out evaluation uses the MMTab-Eval benchmark (from Table-LLaVA), covering TABMWP, WTQ, HiTab, TAT-QA, FeTaQA (TQA), TabFact, InfoTabs (TFV), and HiTab_T2T, Rotowire, WikiBIO (T2T). Accuracy is used for TQA and TFV; BLEU is used for T2T.
Baselines. The authors compare against a range of MLLMs: GPT-4V, InternVL-2-76B, Qwen-2-VL-72B-Instruct, LLaVA-NeXT (72B/34B/13B/7B), Table-LLaVA (13B/7B), Pixtral-12B, Llama-3.2-11B-Vision-Instruct, MiniCPM-V-2.6-8B, and InternVL-2-8B.
Ablations. Three ablation axes are evaluated: (1) pre-training data source (MMSci-Pre 52K vs. MMTab-Pre 150K vs. combined MM-Pre 202K vs. no pre-training), (2) inclusion of explicit reasoning steps in instruction tuning, and (3) instruction data scaling from 3K to 12K samples.
Representational alignment analysis. The authors also measure language-vision alignment using kernel similarity metrics (CKA, CKNNA, SVCCA, and several KNN-based measures) on ImageNet, Wikipedia captions, and the MMSci T2T task. Qwen2-VL-7B-Instruct scores highest across most measures, which the authors correlate with its stronger task performance.
What are the outcomes/conclusions?
Data quality over quantity. Across both model architectures, training on MMSci-Pre (52K scientific domain) produces comparable or better results than training on MMTab-Pre (150K general domain), particularly on MMSci-Eval. For Qwen2-VL-7B-Instruct, MMSci-Pre (52K) + MMSci-Ins achieves 41.13% TQA accuracy and 72.92% TFV accuracy, versus 40.75% TQA and 72.73% TFV for MMTab-Pre (150K) + MMSci-Ins. Combining both (MM-Pre, 202K) pushes results to 42.10% TQA and 73.98% TFV.
Reasoning steps help consistently. Ablations show that including chain-of-thought reasoning steps during instruction tuning improves TQA by roughly 6-7 percentage points and TFV by 8-10 percentage points across all pre-training configurations. The benefit is consistent regardless of whether 3K or 12K instruction samples are used, and it generalizes to held-out datasets like TABMWP and TAT-QA.
Generalization to held-out data. Despite not training on any MMTab-Ins data, the MMSci-tuned Qwen2-VL models achieve 21.15% average TQA accuracy and 41.63% average TFV accuracy on the MMTab held-out benchmark, which is competitive with models explicitly trained on general-domain table data.
Limitations. The authors note three main limitations. First, the framework focuses on numerical tables; qualitative or methodology tables are not well covered. Second, complex statistical analyses and domain-specific mathematical notation (e.g., subscripted variables, special symbols) may still challenge the models. Third, very large tables with dense information remain difficult to process due to computational constraints and potential information loss during visual encoding. GPT-4V continues to lead most subtasks, suggesting that the fine-tuned 7B models leave substantial room for improvement.
Reproducibility
Models
Both fine-tuned model variants build on publicly available checkpoints:
- Qwen2-VL-7B-Instruct (ViT + MLP connector + Qwen2-7B-Instruct backbone)
- LLaVA-NeXT-7B (CLIP encoder + MLP connector + Vicuna-7B backbone)
The paper does not report releasing fine-tuned model weights. The base checkpoints are available via their respective repositories under permissive licenses.
Algorithms
- LoRA fine-tuning: rank 64, applied to both stages
- Sequence length: 4096 tokens
- Stage 1 (table structure learning): 1 epoch; LLaVA-NeXT updates only the MLP connector; Qwen2-VL updates the full model
- Stage 2 (instruction tuning): 4 epochs; visual encoder frozen; MLP and LLM weights updated
- Training time (4 × A100 80 GB): LLaVA-NeXT table structure learning takes approximately 15 hours for MMTab-Pre (150K), 3 hours for MMSci-Pre (52K), and 20 hours combined (one epoch); Qwen2-VL-7B takes approximately 15 hours, 8 hours, and 19 hours for the same three configurations respectively; instruction tuning takes approximately 1 hour for 4 epochs on 12K samples for both models
- Optimizer, learning rate, and warmup schedule are not reported in the paper
- Self-consistency voting used during dataset generation; exact number of sampled reasoning paths is not stated
Data
- MMSci-Pre: 52K image-to-HTML pairs derived from SciGen training and development splits; HTML rendered via the imgkit Python package
- MMSci-Ins: 12K samples from SciGen training split; generated by GPT-4o with self-consistency voting; 40% manually verified (94.36% accuracy after regeneration of failures)
- MMSci-Eval: 3,114 samples from SciGen test split; all samples manually verified (95.25% accuracy)
- The dataset JSON files and a zip of MMSci-Pre are publicly available at the GitHub repository; image data is referenced to HuggingFace but exact access path is not detailed in the paper
- The underlying SciGen dataset (Moosavi et al., 2021) is publicly available; its license should be consulted before use
Evaluation
- TQA and TFV: accuracy (exact match for TQA answers formatted as JSON)
- T2T: BLEU score
- No error bars or significance tests are reported; single-run results
- Held-out datasets (TABMWP, TAT-QA, TabFact, WTQ, etc.) are evaluated zero-shot relative to MMSci training distribution
Hardware
- Training: 4 × NVIDIA A100 80 GB GPUs
- Inference hardware not explicitly specified
- No cloud cost estimates provided
BibTeX
@article{yang2025mmsci,
title={Does Table Source Matter? Benchmarking and Improving Multimodal Scientific Table Understanding and Reasoning},
author={Bohao Yang and Yingji Zhang and Dong Liu and André Freitas and Chenghua Lin},
journal={arXiv preprint arXiv:2501.13042},
year={2025}
}
SPRINT: Script-agnostic Structure Recognition in Tables
TL;DR
SPRINT is an image-to-sequence model for table structure recognition (TSR) that achieves script-agnostic generalization by downsampling table images to 128x128 pixels before encoding, thereby blurring language- and font-specific features. It uses a Global Context Attention (GCA) encoder paired with a transformer decoder trained on the compact Optimized Table Structure Language (OTSL) vocabulary. Alongside the model, the authors release MUSTARD, a multilingual TSR benchmark of 1,428 tables spanning 13 languages, including 11 Indic languages, English, and Chinese. On MUSTARD, SPRINT outperforms the best available comparison model (MTL-TabNet) by an absolute average of 11.12% TEDS-S while also achieving lower inference latency on English benchmarks.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$: The paper’s primary contribution is the SPRINT architecture and training recipe. It introduces a deliberate design choice (aggressive downsampling to produce a script-agnostic feature representation) and an end-to-end methodology combining SPRINT with a loosely coupled table grid estimator. Ablations over decoder depth, image resolution, and training data mix support this framing.
Secondary: $\Psi_{\text{Resource}}$: The MUSTARD dataset is a meaningful standalone contribution: 1,428 annotated multilingual table images across 13 languages, the first such OTSL-labeled multilingual TSR benchmark. Code and dataset are both released under the MIT license.
What is the motivation?
Table structure recognition is a prerequisite for downstream tasks such as information retrieval, table reconstruction, and document understanding. While large English-language TSR benchmarks (PubTabNet, FinTabNet, PubTables-1M) and the models trained on them exist, no labeled multilingual TSR datasets with logical structure annotations were publicly available at the time of this work. Models trained on upsampled English table images implicitly learn language- and font-specific cues that do not transfer to documents in other scripts. The cost of collecting and annotating large labeled corpora for each target language is prohibitive, so the authors seek a model that generalizes across scripts without per-language retraining.
A secondary motivation is inference efficiency. Popular image-to-sequence models (MTL-TabNet, TableFormer + HTML) decode large HTML vocabularies, which slows inference relative to what is needed for practical deployment.
What is the novelty?
SPRINT reframes TSR as a script-agnostic cell arrangement prediction problem. The core insight is that two properties together should produce a model that ignores script-specific appearance: first, shrinking the input image to 128x128 pixels converts cell contents into indistinct blobs, removing text-level cues; second, using the minimal OTSL vocabulary (six tokens: F, E, L, X, U, N) rather than a full HTML tag set both speeds decoding and avoids learning vocabulary-linked biases.
OTSL representation. A table with $R$ rows and $C$ columns is represented as a flat sequence of length $R \times (C+1)$, where N acts as a row delimiter appearing at every $(C+1)^{\text{th}}$ position. The remaining tokens encode cell type:
F(filled cell with content),E(empty cell)L(left-looking: column span),U(upward-looking: row span),X(cross: both spans)
This six-token vocabulary is substantially smaller than an HTML vocabulary (30+ tokens), which reduces decoder logit complexity and shortens sequences.
Architecture. SPRINT’s encoder is a ResNet-31 backbone with Multi-Aspect Global Attention (GCA) interleaved between convolutional layers. The encoder maps a 128x128 image to 512 channels of 16x32 feature maps, which are positionally embedded and fed to a six-layer transformer decoder. The decoder is trained with categorical cross-entropy against ground-truth OTSL sequences, with a maximum sequence length of 224 tokens.
Grid-based alignment. SPRINT predicts an unconstrained OTSL string, which is then aligned with row and column counts $R$ and $C$ obtained from a pre-trained Table Aware Transformer (TATR) grid estimator. The alignment step pads or trims the predicted string to length $R \times (C+1)$, enforces the periodicity of N tokens, and replaces syntactically invalid placements of L, U, X, and N with F. This produces a well-formed OTSL matrix that can be losslessly converted to HTML using a deterministic algorithm (Algorithm 1 in the paper).
MUSTARD. The dataset covers 1,214 scanned or printed document tables in 12 languages (11 Indic: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Oriya, Punjabi, Tamil, Telugu, Urdu; plus Chinese from CTDAR) and 214 scene-text tables in English and Chinese sourced from TabRecSet. Each table is annotated with its OTSL logical structure. The split contains 662 simple tables (no merged cells) and 766 complex tables (at least one merged cell).
What experiments were performed?
The authors train two SPRINT variants:
- $\text{SPRINT}_{\text{FTN}}$: trained on the OTSL-labeled canonical subset of FinTabNet only.
- $\text{SPRINT}_{\text{ALL}}$: trained on a merged dataset combining FinTabNet, PubTabNet, and PubTables-1M canonical subsets.
Training runs for more than 80 epochs with a learning rate of 0.0001 on a single NVIDIA RTX A6000 GPU.
Evaluation metric. The primary metric is TEDS-S (Tree Edit Distance-based Similarity, structure-only), reported for simple tables, complex tables, and overall. Higher is better; scores are in the range [0, 100].
English benchmarks (canonical OTSL subsets). $\text{SPRINT}_{\text{FTN}}$ is evaluated on the FinTabNet canonical test set; $\text{SPRINT}_{\text{ALL}}$ on PubTabNet and PubTables-1M. Results are compared against the TableFormer + OTSL baseline:
| Dataset | TableFormer + OTSL Overall | SPRINT Overall |
|---|---|---|
| PubTabNet | 95.50 | 97.55 |
| FinTabNet | 95.90 | 98.17 |
| PubTables-1M | ~97.70 | 97.68 |
English benchmarks (HTML baselines). On the original PubTabNet and FinTabNet test sets, SPRINT is compared against EDD, GTE, TableFormer, VAST, and MTL-TabNet. SPRINT reaches 95.71% overall TEDS-S on PubTabNet and 98.03% on FinTabNet, approximately 1.5 percentage points below the best-reported scores from MTL-TabNet (97.88% and 98.79% respectively). The authors attribute this gap to MTL-TabNet’s use of upscaled images and dual cascaded decoders trained on HTML.
Inference speed. On a random 400-image FinTabNet subset (50 iterations), SPRINT averages 1.52 seconds per image compared to 2.35 seconds for MTL-TabNet (both measured on the same A6000 GPU) and 3.26 seconds for TableFormer + HTML (a pre-reported figure measured on an AMD EPYC 7763 CPU, not directly comparable).
MUSTARD evaluation. $\text{SPRINT}_{\text{FTN}}$ and the FinTabNet-trained MTL-TabNet checkpoint are compared on all 1,428 MUSTARD tables across 14 language-modality groups. The table grid estimator achieves 54.62% exact match for both rows and columns on MUSTARD (versus over 80% on English datasets), reflecting that TATR was pretrained on English data, but the mean L1 error remains below 0.55, which the authors argue limits TEDS-S penalty in practice.
Ablations. Table 9 in the supplementary material explores combinations of decoder layer count (3, 4, 6, 8) and input image shape (32x32, 32x128, 128x128) across training data conditions. The 6-layer decoder with 128x128 input consistently produces the best or near-best results across all three English benchmarks.
What are the outcomes/conclusions?
On English TSR benchmarks, SPRINT is competitive with methods that use much larger image inputs and vocabularies, approaching but not matching the top HTML-based models. On MUSTARD, SPRINT outperforms MTL-TabNet by an absolute average of 11.12% TEDS-S overall, with gains visible across all 14 language-modality groups (ranging from roughly 2 to 29 percentage points). The largest gains occur in languages where MTL-TabNet performs most poorly (Punjabi: +29.39%, Bengali: +16.94%), which the authors attribute to MTL-TabNet’s implicit reliance on English-like visual features.
The authors acknowledge several limitations. The table grid estimator (TATR) was pretrained exclusively on English documents, so its row/column count estimates are less accurate on MUSTARD, which could suppress TEDS-S scores. Source code and checkpoints for VAST and the OTSL baseline are not publicly available, so direct replication of those comparisons is not possible. MUSTARD contains roughly 100 tables per language, which is a small evaluation set, and the dataset lacks a train split for non-English languages.
The conclusion suggests future work on integrating detected cell bounding boxes with the predicted logical structure to support end-to-end table reconstruction across scripts.
Reproducibility
Models
SPRINT consists of a ResNet-31 encoder with GCA-based attention (Multi-Aspect Global Attention) and a 6-layer transformer decoder. The encoder produces a 512-channel 16x32 feature tensor; decoder intermediate layers have width 2048. Total parameter count is not reported. Model weights are released via the GitHub repository under the MIT license. The architecture adapts the MASTER scene text recognition framework.
Algorithms
- Optimizer: Not explicitly stated; learning rate is 0.0001.
- Training duration: More than 80 epochs.
- Loss: Categorical cross-entropy over the OTSL vocabulary (6 characters plus start/stop tokens; vocabulary size of 8).
- Maximum sequence length: 224 tokens.
- Preprocessing: Images are resized to 128x128 pixels using standard resize operations; no augmentation details are reported.
- Grid estimator: TATR v1.1 (pre-trained on FinTabNet, PubTabNet, PubTables-1M); detection threshold 0.25; NMS IOU threshold 0.25 for table-row class. Two checkpoint variants are used:
v1.1-allfor PubTabNet, FinTabNet, and MUSTARD evaluations;v1.1-PubTables-1mfor PubTables-1M evaluation. - OTSL-to-HTML conversion: Deterministic algorithm released with the code.
Data
- Training data: OTSL-labeled canonical subsets of PubTabNet (320,000 train images), FinTabNet (88,441 train images), and PubTables-1M (522,874 train images), released by Lysak et al. (ICDAR 2023).
- Validation: Internal split of PubTabNet train set; non-overlapping PubTabNet validation images used for final reporting.
- MUSTARD: 1,428 tables used as a held-out evaluation set only; no train split is provided. Tables are sourced from Yojana magazine archives (11 Indic languages), CTDAR documents (102 Chinese tables), and a TabRecSet subset (214 English/Chinese scene tables). The HuggingFace release is at
badrivishalk/MUSTARDunder the MIT license.
Evaluation
- Primary metric: TEDS-S, computed from tree edit distance between predicted and ground-truth HTML structure sequences (cell content filtered out).
- Benchmarks: PubTabNet (original and canonical), FinTabNet (original and canonical), PubTables-1M (canonical), MUSTARD.
- Baselines on English: EDD, GTE, TableFormer (HTML and OTSL), VAST, MTL-TabNet. VAST and the OTSL baseline did not release code/checkpoints, so comparisons rely on pre-reported numbers.
- Baselines on MUSTARD: Only MTL-TabNet is compared, as VAST and the OTSL baseline have no public checkpoints.
- No error bars or significance tests are reported. Results are from single runs.
Hardware
- Training: Single NVIDIA RTX A6000 GPU.
- Inference timing: Single NVIDIA RTX A6000 GPU; 50 iterations over 400 FinTabNet images. SPRINT averages 1.52 seconds per image (1.36s for the SPRINT model itself, 0.16s for post-processing: grid estimation + alignment + HTML conversion). MTL-TabNet was also measured on the A6000 at 2.35 seconds. TableFormer timings (3.26s for HTML, 1.85s for OTSL) are pre-reported figures from the OTSL baseline paper, measured on an AMD EPYC 7763 CPU, not comparable to the A6000 measurements.
- No cost estimates or memory requirements are reported.
- SPRINT’s downsampled 128x128 inputs and compact OTSL vocabulary suggest modest inference memory requirements relative to models using full-resolution images.
BibTeX
@inproceedings{kudale2024sprint,
title={SPRINT: Script-agnostic Structure Recognition in Tables},
author={Kudale, Dhruv and Kasuba, Badri Vishal and Subramanian, Venkatapathy and Chaudhuri, Parag and Ramakrishnan, Ganesh},
booktitle={Document Analysis and Recognition -- ICDAR 2024 -- 18th International Conference, Athens, Greece, August 30--September 4, 2024, Proceedings, Part V},
pages={350--367},
year={2024},
publisher={Springer},
doi={10.1007/978-3-031-70549-6_21}
}
Table2LaTeX-RL: High-Fidelity LaTeX Code Generation from Table Images via Reinforced Multimodal Language Models
TL;DR
Table2LaTeX-RL constructs a large-scale corpus of approximately 1.2 million table image-to-LaTeX pairs extracted from arXiv source files, then fine-tunes multimodal large language models in two stages: supervised fine-tuning on the full corpus followed by reinforcement fine-tuning via VSGRPO, a dual-reward extension of GRPO that combines a TEDS-Structure reward with a CW-SSIM reward computed on rendered output images. The resulting Qwen2.5-VL-3B model reaches TEDS-Structure of 0.9218 on complex tables, the first reported result above 0.9 on that subset, outperforming models with more than 20 times as many parameters.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$ The primary contribution is the VSGRPO training framework: a visual-in-the-loop reinforcement learning strategy that renders generated LaTeX to PNG and computes CW-SSIM directly as an optimization signal alongside a structure-level TEDS-Structure reward. The paper devotes most of its space to method design, ablations on reward components, and comparative benchmarks.
Secondary: $\Psi_{\text{Resource}}$ The 1.2M table image-to-LaTeX dataset scraped from arXiv, together with the complexity-stratified test split covering simple, medium, and complex tables, represents a meaningful data contribution to a domain that previously lacked large-scale public training corpora.
What is the motivation?
Scientific documents depend on LaTeX tables to communicate experimental results, and digitizing or reusing those tables requires converting scanned or rendered table images back into source code. Most prior table structure recognition work targets HTML representations, which do not support the full expressiveness of LaTeX: nested multirow/multicolumn headers, mathematical content in cells, and typographic control. The handful of systems that do target LaTeX (LATTE, LaTeXNet, Nougat) either rely on from-scratch training with limited data or do not address the combined difficulty of large, deeply nested structures.
Two intertwined gaps motivate the work. First, no large-scale, publicly available dataset of table image-to-LaTeX pairs existed prior to this work. Second, standard evaluation metrics are poorly suited to the task. TEDS and BLEU compare LaTeX code at the token level, which penalizes semantically equivalent but syntactically different code (e.g., {} wrapper groups that do not affect rendering, or \textbf vs {\bf}). Visual similarity metrics like CW-SSIM capture rendered appearance but ignore structural correctness. Neither alone gives a reliable picture of generation quality.
What is the novelty?
The paper makes three interrelated contributions.
Large-scale table corpus. The authors scrape arXiv LaTeX source files for papers published between October 2017 and April 2023, extract table environments using regular expressions, and clean the code by removing references, color commands, and other non-structural elements. The result is 1,209,986 table-LaTeX pairs. Tables are classified into three complexity tiers based on cell count and the presence of \multirow/\multicolumn commands: complex (over 160 cells), medium (100-160 cells with 2 or more \multirow/\multicolumn commands), and simple (all others). In the training set, simple tables account for approximately 94%, with medium and complex each at roughly 3%.
Hybrid evaluation protocol. The authors propose combining TEDS-Structure (measuring structural tree edit distance, ignoring cell content) with a modified CW-SSIM tuned for binary, high-contrast table images. The CW-SSIM variant applies a one-level Haar wavelet decomposition and computes SSIM per sub-band:
$$ \text{CW-SSIM}(X, Y) = \frac{1}{4} \sum_{i \in \{A,H,V,D\}} \text{SSIM}(c^i_X, c^i_Y) $$
where $A$, $H$, $V$, $D$ denote the approximation, horizontal, vertical, and diagonal sub-bands respectively. TEDS-Structure is defined as:
$$ \text{TEDS-Structure} = 1 - \frac{\text{TED}_{\text{structure}}}{\max(|\mathcal{T}_{\text{pred}}|, |\mathcal{T}_{\text{gt}}|)} $$
VSGRPO dual-reward reinforcement fine-tuning. Standard GRPO optimizes text generation quality using correctness-based rewards without a separate value network. VSGRPO extends this to include a visual fidelity reward. For each training example, the model generates $N$ candidate LaTeX outputs. Each candidate is compiled to an image; if compilation fails the reward is 0. Successful compilations are scored by CW-SSIM against the ground-truth rendered image, yielding a binary reward at threshold 0.6. In parallel, each candidate is converted to HTML and scored by TEDS-Structure against the ground truth, yielding a second binary reward at threshold 0.9. The combined reward drives the GRPO objective:
$$ J_{\text{RFT}}(\theta) = \mathbb{E}\left[ \frac{1}{N} \sum_{i=1}^{N} \min!\left( \frac{\pi_\theta(o_i|q)}{\pi_{\theta_{\text{old}}}(o_i|q)} A_i,\ \text{clip}!\left(\frac{\pi_\theta(o_i|q)}{\pi_{\theta_{\text{old}}}(o_i|q)}, 1-\varepsilon, 1+\varepsilon\right) A_i \right) - \beta D_{\text{KL}}(\pi_\theta | \pi_{\text{ref}}) \right] $$
where $\varepsilon = 0.2$ and $\beta = 0.02$, and the group advantage $A_i$ is the z-score of reward $r_i$ within the group of $N$ samples:
$$ A_i = \frac{r_i - \text{mean}(\{r_1, \ldots, r_N\})}{\text{std}(\{r_1, \ldots, r_N\})} $$
Because LaTeX rendering is non-differentiable, this visual reward signal can only be incorporated via the reward mechanism in RL, not during gradient-based supervised training. That is the core motivation for using GRPO rather than extending SFT.
What experiments were performed?
Training data. SFT uses all 1,209,986 table-LaTeX pairs. VSGRPO fine-tunes on a curated subset of 5,936 complex-tier tables whose ground-truth LaTeX is under 3,000 characters, chosen to balance structural difficulty and rendering feasibility.
Test data. A separate crawl of arXiv papers from January to November 2024 yields 101,469 entries; from these, 496 simple, 354 medium, and 361 complex tables are sampled as the test set. An external dataset from the LATTE paper serves as an additional generalization benchmark.
Baselines. Commercial: Mathpix. General-purpose MLLMs: GPT-4o, Qwen2.5-VL-72B, InternVL2.5-78B. Task-specific: Nougat, LATTE.
Metrics. CW-SSIM and compile ratio measure rendered image quality. TEDS-Structure and TEDS measure structural and content alignment at the source level. Human preference evaluation covers 200 tables (50/50/100 by complexity), with majority-vote preference across four systems.
Ablations. The paper ablates (1) RL training data selection (simple-only vs. mixed vs. complex-only), (2) reward components (TEDS-Structure alone, CW-SSIM alone, both combined), and (3) the necessity of SFT initialization before VSGRPO.
Hardware. SFT for Nougat runs on 4 nodes of 8xA100 GPUs. InternVL2-1B SFT uses 4 nodes; VSGRPO uses 2 nodes. Qwen2.5-VL-3B SFT and VSGRPO each use 2 nodes.
What are the outcomes/conclusions?
Main benchmark results. On the authors’ own test set, Qwen2.5-VL-3B-VSGRPO achieves CW-SSIM scores of 0.8186 (simple), 0.7236 (medium), and 0.6145 (complex). On complex tables, the CW-SSIM gain over the SFT baseline is +0.034, and the compile ratio reaches 0.9917, matching or exceeding all baselines including Mathpix. On TEDS and TEDS-Structure, Qwen2.5-VL-3B-VSGRPO reaches 0.8673 TEDS and 0.9218 TEDS-Structure on complex tables. The reported TEDS gap over the next-best model is 0.1225 on complex tables. Intern2.5-VL-78B, a model more than 25 times larger, shows a TEDS collapse on complex tables (0.3379) while Qwen2.5-VL-3B-VSGRPO remains robust.
External benchmark. On the LATTE external dataset (predominantly simple tables), Qwen2.5-VL-3B-VSGRPO achieves CW-SSIM of 0.8225 and TEDS-Structure of 0.9461, outperforming LATTE itself (0.7615 / 0.9445) despite LATTE being a specialist system fine-tuned for this format.
Human evaluation. In the 200-table preference study, Qwen2.5-VL-3B-VSGRPO received the highest vote counts across all complexity levels: 42 vs. 29 for the SFT-only baseline on simple tables, 37 vs. 28 on medium, and 70 vs. 56 on complex.
Ablation findings. Training VSGRPO on complex tables only outperforms both simple-only and mixed-data variants across all metrics. Both reward components contribute independently; their combination achieves the highest performance on every metric. Skipping SFT and applying VSGRPO directly to the pre-trained base model results in substantially lower scores (CW-SSIM 0.4695 vs. 0.6145 on complex), confirming that a strong initialization is required.
Limitations. The authors acknowledge that the visual rendering step in VSGRPO training is computationally expensive. Each candidate output must be compiled to PDF and converted to PNG for CW-SSIM scoring, creating a training bottleneck even with multi-threading. This overhead restricted the VSGRPO training set to 5,936 tables. Additionally, the TEDS metric remains sensitive to syntactically irrelevant LaTeX differences (e.g., empty groups, different bold commands), which the paper illustrates with qualitative examples but does not fully resolve.
Reproducibility
Models
Two model families are reported. InternVL2-1B is fine-tuned via the VLM-R1 framework. Qwen2.5-VL-3B is fine-tuned via the ms-swift infrastructure. Both use full-parameter fine-tuning for SFT. Architecture details (layer count, attention heads) are not described in the paper; they are inherited from the pre-trained base models (Qwen2.5-VL-3B-Instruct from HuggingFace). During VSGRPO, the ViT visual encoder is frozen (--freeze_vit true, --freeze_parameters visual), so only the language decoder weights are updated in the RL phase. Trained weights for Qwen2.5-VL-3B-VSGRPO are publicly released at https://huggingface.co/LLLHHH/Table2Latex-RL (Apache-2.0, BF16 safetensors format, approximately 4B parameters).
Algorithms
SFT trains for one epoch with a maximum output length of 4,096 tokens. InternVL2-1B SFT uses batch size 4 with gradient accumulation steps of 2; Qwen2.5-VL-3B SFT uses batch size 4 with gradient accumulation steps of 2 on 2 nodes. For VSGRPO, InternVL2-1B uses 8 sampled generations per input, batch size 8, gradient accumulation 2, and 2 nodes. Qwen2.5-VL-3B VSGRPO uses 4 generations per input, batch size 4, gradient accumulation 2, and 2 nodes, for 1 epoch. Reward thresholds are 0.6 for CW-SSIM and 0.9 for TEDS-Structure. KL penalty $\beta = 0.02$; PPO clipping $\varepsilon = 0.2$.
The VSGRPO training script (published in the repository at examples/train/grpo/plugin/run_external_rm.sh) reveals additional hyperparameters not stated in the paper: learning rate 1e-5, warmup ratio 0.05, bfloat16 mixed precision, training-time sampling temperature 0.9, maximum completion length 3,000 tokens, maximum image resolution 524,288 pixels (--max_pixels), and 8 GPUs per node via torchrun. The optimizer is not named explicitly in the script but ms-swift defaults to AdamW. Testing uses maximum output length 8,192, batch size 1, temperature 0. Training requires a Docker environment; the README specifies a ModelScope image with CUDA 12.6, PyTorch 2.7.1, vLLM 0.10.1, and ms-swift 3.8.1.
Data
Training data covers arXiv papers from October 2017 to April 2023 (1,209,986 table entries). Test data covers arXiv papers from January to November 2024 (496 simple, 354 medium, 361 complex tables). VSGRPO uses 5,936 complex tables with ground-truth LaTeX under 3,000 characters. The full 1.2M SFT training corpus is not directly downloadable from the repository; instead, the authors provide arxiv_papers_get.py and a table.ipynb Kaggle notebook to re-crawl arXiv source files and reconstruct the dataset. The evaluation split (1,211 images, Apache-2.0) is released at https://huggingface.co/datasets/LLLHHH/Table2LaTeX-RL. The underlying arXiv source files are available under arXiv’s open-access policy, though the license status of individual paper LaTeX sources varies by author. The code repository is Apache-2.0.
Evaluation
CW-SSIM is computed using a Python implementation with Haar wavelet decomposition (provided as cw_ssim.ipynb in the repository). TEDS-Structure and TEDS use tree edit distance over structural HTML representations parsed from the LaTeX. All testing is conducted inside a texlive-full Docker environment to ensure consistent LaTeX rendering. No error bars, significance tests, or seed sensitivity analysis are reported; the NeurIPS checklist explicitly states “No” for statistical significance on the grounds that “running a complete experiment takes one week” and compute resources were limited. Human evaluation uses majority voting by 5 graduate students in computer science on 200 tables (50 simple, 50 medium, 100 complex), with anonymized and randomly shuffled model outputs.
Hardware
SFT for Nougat: 4 nodes of 8xA100 GPUs (32 A100s total). SFT for InternVL2-1B: 4 nodes of 8 GPUs; for Qwen2.5-VL-3B: 2 nodes of 8 GPUs (16 GPUs total). VSGRPO for both models: 2 nodes of 8 GPUs (16 GPUs total); the training script confirms 8 processes per node via --nproc_per_node=8. The GPU type for all non-Nougat experiments is not stated in the paper (only A100 is explicitly named for Nougat). Total GPU-hours, cost estimates, and inference latency figures are not reported. The NeurIPS checklist self-reports “Yes” on compute resources but the paper body only enumerates node counts. The rendering bottleneck in VSGRPO (PDF compilation per candidate per training step) is noted qualitatively but not measured; the authors state it was the reason for limiting VSGRPO training to 5,936 tables.
BibTeX
@inproceedings{ling2025table2latexrl,
title={Table2LaTeX-RL: High-Fidelity LaTeX Code Generation from Table Images via Reinforced Multimodal Language Models},
author={Ling, Jun and Qi, Yao and Huang, Tao and Zhou, Shibo and Huang, Yanqin and Yang, Jiang and Song, Ziqi and Zhou, Ying and Yang, Yang and Shen, Heng Tao and Wang, Peng},
booktitle={Advances in Neural Information Processing Systems},
year={2025}
}
SepFormer: Coarse-to-Fine Separator Regression for Table Structure Recognition
TL;DR
SepFormer is a real-time table structure recognition model that directly regresses row and column separator lines using a DETR-style architecture. It combines the RT-DETR backbone (ResNet-34 + hybrid encoder) with dual decoder branches, each organized as two sequential transformer decoder stacks: a coarse decoder that predicts straight single-line separators, and a fine decoder that refines P evenly sampled points from each coarse line into a line-strip. This coarse-to-fine decomposition avoids segmentation masks entirely and eliminates the ROIAlign-based merge stage found in split-and-merge systems. On four benchmarks (SciTSR-COMP, PubTabNet, WTW, iFLYTAB), SepFormer runs at 25.6 FPS on average while staying within 0.6-1.2% of the top non-real-time methods on each dataset.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$ The main contribution is the architecture: a coarse-to-fine dual decoder design for direct separator regression. The paper’s structure centers on model design, ablation studies, and comparisons against SOTA on four benchmarks.
Secondary: None. No new dataset or benchmark is released.
What is the motivation?
Table structure recognition (TSR) aims to reconstruct the logical layout of tables in document images as machine-readable representations. The two dominant paradigms for the split-and-merge family are segmentation-based methods (which predict pixel masks and apply post-processing to extract separator lines) and regression-based methods (which directly output line coordinates).
Segmentation-based methods such as SEMv2 and RTSR require high-resolution feature maps and additional post-processing steps to convert masks to lines. Regression-based approaches such as TSRFormer and its DQ-DETR variant already eliminate the segmentation stage, but retain a merge module relying on ROIAlign for the second stage, adding both computational cost and a dependency on the quality of the first stage.
SepFormer’s goal is to collapse the split-and-merge pipeline into a single forward pass. By framing TSR purely as separator line regression using a DETR architecture, the method removes both segmentation masks and the ROIAlign merge module, targeting practical real-time deployment (the authors note their system is deployed in a commercial product).
What is the novelty?
Architecture overview
SepFormer builds on the RT-DETR backbone: a ResNet-34 encoder extracts multi-scale feature maps $C_2, C_3, C_4$ at strides 8, 16, and 32; a hybrid encoder (one transformer encoder layer on $C_4$ plus a cross-scale fusion module) produces enhanced feature maps $\{M_1, M_2, M_3\}$. These are flattened and concatenated into a single sequence $M \in \mathbb{R}^{S \times C}$ with $C = 256$ channels, which is then fed into two parallel decoder branches for rows and columns respectively.
Coarse-to-fine decoder
Each branch contains two sequential stages (Fig. 2 in the paper).
Query selection. A detection head operating on the full encoder sequence scores $S$ candidate line proposals and selects the top $K = 300$ positions (by classification confidence) as initial reference points for the coarse decoder. Unlike the box proposals in Deformable DETR, SepFormer uses line proposals: row proposals take the form $\{x_i, y_i, x_i + 2^{l-1} \times s, y_i\}$ and column proposals $\{x_i, y_i, x_i, y_i + 2^{l-1} \times s\}$.
Coarse decoder. Three deformable transformer decoder layers refine the selected queries iteratively. The output for the $k$-th query is a single-line separator $l^k = \{(x_1^k, y_1^k), (x_2^k, y_2^k)\}$ and a classification score $c^k$.
Sampling. For each accepted coarse separator, $P = 16$ reference points are evenly interpolated between its two endpoints:
$$ \text{Sampling} = \left\{ \left(1 - \frac{t}{P}\right)(x_1, y_1) + \frac{t}{P}(x_2, y_2) ;\middle|; t = 1, \ldots, P \right\} $$
Fine decoder. Three more deformable transformer decoder layers refine the sampled reference points, producing a line-strip $ls^k = \{(x_i^k, y_i^k) \mid i = 1, \ldots, P\}$. The fine stage reuses the same content queries and encoder memory; only the reference points change.
Loss function
The training objective combines four terms per branch. Hungarian matching uses only single-line ($L_1$) distance for pairing predictions with ground-truth lines:
$$ \mathcal{L}_{\text{match}} = \sum_{i=1}^{N_{\text{row}}} \left( \lambda_{\text{coord}} |l_{gt}^i - l^{\sigma(i)}| + \lambda_{\text{cls}} , c^{\sigma(i)} \right) $$
After matching, the full loss for the row branch is:
$$ \mathcal{L}^{\text{row}} = \lambda_1 \mathcal{L}_{\text{cls}}^{\text{row}} + \lambda_2 \mathcal{L}_{\text{angle}}^{\text{row}} + \lambda_3 \mathcal{L}_{\text{line}}^{\text{row}} + \lambda_4 \mathcal{L}_{\text{linestrip}}^{\text{row}} $$
The angle loss uses cosine similarity between the predicted and ground-truth line direction vectors, with an additional penalty term $|c_{gt}^n| \times 4$ in the denominator to increase sensitivity for short separators:
$$ \mathcal{L}_{\text{angle}}^{\text{row}} = \frac{1}{N_{\text{row}}} \sum_{n=1}^{N_{\text{row}}} \left( 1 - \frac{\cos(c_{gt}^n, c_n)}{|c_{gt}^n| \times 4} \right) $$
The coordinate losses $\mathcal{L}_{\text{line}}$ and $\mathcal{L}_{\text{linestrip}}$ are $L_1$ distances on single-line and line-strip coordinates respectively. The total loss sums row and column branches:
$$ \mathcal{L}^{\text{sep}} = \mathcal{L}^{\text{row}} + \mathcal{L}^{\text{col}} $$
Loss coefficients are set to $\lambda_1 = 1, \lambda_2 = 1, \lambda_3 = 3, \lambda_4 = 1$; matching coefficients to $\lambda_{\text{cls}} = 2, \lambda_{\text{coord}} = 3$.
What experiments were performed?
Datasets
- SciTSR-COMP: 2,885 train / 716 test complex PDF tables. Metric: cell adjacency relationship F1.
- PubTabNet: 500,777 train / 9,115 validation / 9,138 test (validation set used, as the test labels are not released). Metric: TEDS-Struct.
- WTW: 10,970 train / 3,611 test; real-world wired tables including inclined, curved, and occluded examples. Metric: cell adjacency F1.
- iFLYTAB: 12,104 train / 5,187 test; wired and wireless tables from digital documents and camera capture. Metric: cell adjacency F1. Ablations also run on this dataset.
Baselines
Real-time comparison: RTSR (segmentation-based, 34.4 FPS average). Non-real-time comparisons on each benchmark include TSRFormer w/ DQ-DETR, SEMv2, SEMv3, GrabTab, LORE++, GridFormer, and TRUST among others.
Ablation studies
Table 5 in the paper evaluates two design choices on iFLYTAB: (1) two-stage vs. one-stage decoders and (2) single-line vs. line-strip matching. A one-stage 3-layer decoder achieves 90.8% F1; one-stage 6-layer achieves 89.7% (worse, suggesting overfitting at depth). The two-stage 3-layer decoder with single-line matching reaches 93.8%, a gain of 3.0% over the one-stage 3-layer baseline. Line-strip matching degrades performance in all configurations, indicating that the coarser $L_1$ single-line distance provides better optimization signal for the Hungarian algorithm.
Table 6 in the paper shows the angle loss contributes a marginal +0.2% F1 on iFLYTAB but improves consistency for short separators.
What are the outcomes/conclusions?
Speed: SepFormer averages 25.6 FPS across all four test datasets (RTX 3060), versus RTSR’s 34.4 FPS. It is the second real-time-capable method reported and the only regression-based real-time TSR system in the comparison.
Accuracy on SciTSR-COMP: 98.6% F1 (99.0 P / 98.2 R), within 0.7% of the best non-real-time method LORE++ (99.3%) and above RTSR (98.3%).
Accuracy on PubTabNet: 96.8% TEDS-Struct, 1.1 points above RTSR (95.7%) and within 0.7% of TSRFormer w/ DQ-DETR and SEMv3 (both at 97.5%).
Accuracy on WTW: 93.9% F1 (93.7 P / 94.2 R), above RTSR (92.9%) and within 1.2% of SEMv3 / GrabTab / LORE++ (all at 95.1%).
Accuracy on iFLYTAB: 93.8% F1 (94.6 P / 93.2 R), above RTSR (91.1%) and within 0.6% of SEMv3 (94.4%).
Limitations: The authors identify three failure modes: (1) heavily warped images where text lines exhibit varying degrees of warping within the same image; (2) two adjacent separators that are collapsed into a single predicted separator (attributed to insufficient discrimination in the $L_1$ single-line matching criterion); (3) tables with very high row or column density where over- or under-detection occurs. No code or model weights are publicly released.
Reproducibility
Models
- Backbone: ResNet-34 pretrained on ImageNet. Three feature levels $C_2$, $C_3$, $C_4$ at strides 8, 16, 32.
- Hybrid encoder: one transformer encoder layer (on $C_4$ only) plus cross-scale fusion; $C = 256$ channels.
- Dual decoder branches: query selection + 3 coarse deformable decoder layers + 3 fine deformable decoder layers per branch. 8 attention heads, 4 sampling points per deformable attention.
- $K_{\text{row}} = K_{\text{col}} = 300$ top queries selected.
- Model weights are not publicly released.
Algorithms
- Optimizer: not specified (implicitly AdamW or Adam for DETR-style models, but not stated).
- Learning rate: $3 \times 10^{-5}$; cosine annealing over 100 epochs (20 epochs for PubTabNet).
- Input resolution: longer side scaled to one of $\{864, 896, 928, 960\}$ during training (aspect ratio preserved); 896 at inference.
- Confidence thresholds: $\tau_{\text{row}} = \tau_{\text{col}} = 0.95$.
- $P = 16$ sampling points per separator.
Data
- SciTSR: MIT license. 12,000 train / 3,000 test; SciTSR-COMP subset used (2,885 train / 716 test).
- PubTabNet: CDLA-Permissive-1.0. ~500k train; validation set (9,115) used for evaluation.
- WTW: License unknown. 10,970 train / 3,611 test.
- iFLYTAB: Introduced with SEMv2. License unknown. 12,104 train / 5,187 test.
Evaluation
- Cell adjacency relationship F1 for SciTSR-COMP, WTW, and iFLYTAB.
- TEDS-Struct for PubTabNet (test labels unreleased; validation set used).
- No error bars, confidence intervals, or multi-seed results reported.
- FPS measured on RTX 3060; training on a single NVIDIA V100 32 GB.
- RTSR FPS figures taken from the prior paper [31]; direct head-to-head hardware comparison is approximate.
Hardware
- Training: single NVIDIA V100 32 GB.
- Inference: NVIDIA RTX 3060.
- GPU-hours, total compute cost, and memory at inference are not reported.
BibTeX
@inproceedings{nguyen2025sepformer,
title={SepFormer: Coarse-to-fine Separator Regression Network for Table Structure Recognition},
author={Nguyen, Nam Quan and Pham, Xuan Phong and Tran, Tuan-Anh},
booktitle={Proceedings of the International Conference on Document Analysis and Recognition (ICDAR)},
year={2025}
}
TableCenterNet: One-Stage Parallel Regression for Table Structure Recognition
TL;DR
TableCenterNet is a one-stage, end-to-end table structure recognition network built on the CenterNet framework. It unifies spatial bounding-box regression and logical row/column index prediction into a single parallel multi-task pass, avoiding the serial two-stage pipelines of prior work. On the TableGraph-24k benchmark, it improves logical location accuracy by 7.2 percentage points over the prior best while running 15.7 times faster than LORE at comparable overall performance.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$: The headline contribution is a new neural network architecture and training recipe. The paper proposes a regression-based approach to jointly predict spatial and logical cell structure from a single model, with ablation studies, SOTA comparisons across three benchmarks, and efficiency analysis all centered on validating the architectural design choices.
What is the motivation?
Table structure recognition (TSR) requires a system to recover both the physical location of each cell (bounding quadrilateral) and its logical position in the table grid (starting/ending row and column indices). Prior approaches fell into one of two camps, each with significant drawbacks.
The classic two-stage pipeline first detects row and column regions or individual cells, then runs a separate second-stage model to infer logical indices. Methods like TGRNet and LORE are representative: TGRNet uses a Graph Convolutional Network on detected cells, and LORE cascades a Transformer over spatial detections. These require independent training of multiple sub-modules and serial inference, resulting in high latency and compounding prediction errors.
Existing one-stage methods (TableNet, Cycle-CenterNet, SCAN, TRACE) simplified training but each suffered notable limitations in scenario coverage: they either assumed clean, bordered tables or were sensitive to geometric deformations, and none predicted logical indices directly in a single pass without complex post-processing.
The central gap was the absence of a method that could handle diverse real-world table scenarios (borderless, distorted, multi-column spanning, curved) while predicting both spatial and logical structure from a single model with simple training and fast inference.
What is the novelty?
TableCenterNet extends the CenterNet object detection framework into a multi-task regression architecture. The key idea is to represent the logical structure of the table as a pair of continuous interpolation maps (one for rows, one for columns) that can be regressed directly from features, then use those maps to read off integer row/column indices for any detected cell.
Overall architecture. A CNN backbone (DLA-34 or StarNet-s3) produces a shared feature map. Six parallel regression heads decode from that shared representation:
- A keypoint heatmap for cell corner detection.
- Keypoint offsets.
- Two sets of corner-to-center offset vectors $\hat{u}$ (centers-to-corner) and $\hat{v}$ (corners-to-center).
- A row interpolation map $\hat{I}_r \in \mathbb{R}^{H \times W}$.
- A column interpolation map $\hat{I}_c \in \mathbb{R}^{H \times W}$.
- Per-cell row/column span regression $\hat{s}_i = \{\hat{s}_i^r, \hat{s}_i^c\}$.
Spatial location regression. The spatial head follows a modified CenterNet/Cycle-CenterNet approach. Each cell is represented by its four corner points, and the network regresses offset vectors between corners and cell centers. The spatial loss is a weighted combination of L1 losses on these offset vectors, plus a regularization term on invalid (unmatched) corner-to-center vectors:
$$\mathcal{L}_{\text{spatial}} = \frac{1}{8n} \sum_{i=1}^{n} \omega(i) \left(\lambda_u \mathcal{L}_1(u_i, \hat{u}_i) + \lambda_v \mathcal{L}_1(v_i, \hat{v}_i)\right) + \lambda_e \mathcal{L}_e$$
where $\omega(i)$ is a sine-based quality weighting function and $\lambda_u = 1.0$, $\lambda_v = 0.5$, $\lambda_e = 0.2$.
Interpolation maps and logical location. The key novelty is how logical indices are obtained without requiring a graph or sequence model. Before training, a Delaunay triangulation-based polygon interpolation algorithm (Algorithm 1) renders a row interpolation map $I_r$ and column interpolation map $I_c$ for each training image. Each pixel in $I_r$ is assigned a value interpolated from the logical row boundaries of the cells overlapping that pixel. Once trained, the network regresses $\hat{I}_r$ and $\hat{I}_c$ directly. At inference, the logical indices of cell $i$ are extracted by sampling these maps at the upper-left corner of the predicted cell:
$$\begin{aligned} \hat{r}_i^{st} &= \text{round}(\hat{I}_r[\hat{y}_{i,1}, \hat{x}_{i,1}]) \\ \hat{r}_i^{ed} &= \hat{r}_i^{st} + \lfloor \hat{s}_r^i \rfloor - 1 \\ \hat{c}_i^{st} &= \text{round}(\hat{I}_c[\hat{y}_{i,1}, \hat{x}_{i,1}]) \\ \hat{c}_i^{ed} &= \hat{c}_i^{st} + \lfloor \hat{s}_c^i \rfloor - 1 \end{aligned}$$
Logical location supervision. Two auxiliary losses guide the interpolation map regression. The boundary loss $\mathcal{L}_{\text{boundary}}$ applies higher weight near logical boundaries (integer values in the map) via a squared sine weight function. The span loss $\mathcal{L}_{\text{span}}$ enforces consistency between the regressed per-cell span $\hat{s}_i$ and the spans implied by reading the interpolation map at the cell’s corners:
$$\mathcal{L}_{\text{logical}} = \mathcal{L}_{\text{boundary}} + \mathcal{L}_{\text{span}}$$
Selective gridding. For well-structured datasets without significant deformation, the authors propose decomposing merged (spanning) cells into unit grid cells before generating interpolation maps. This unifies the interpolation style and improves regression accuracy on such datasets.
Overall loss:
$$\mathcal{L}_{\text{overall}} = \mathcal{L}_{\text{keypoint}} + \mathcal{L}_{\text{spatial}} + \mathcal{L}_{\text{logical}}$$
What experiments were performed?
Datasets. Three public benchmarks were used:
- ICDAR-2013 (156 tables, PDF documents from government websites; modified ICDAR-2013.c version with cell bounding boxes; 80/20 train/test split following prior work).
- WTW (Wired Table in the Wild; 10,970 training and 3,611 test images from diverse real-world scenes including curved, occluded, and blurred tables).
- TableGraph-24k (20,000 training, 2,000 validation, 2,000 test images of document tables from arXiv papers; includes wired, borderless, and partial-border tables).
Metrics. Physical coordinate quality was assessed by precision, recall, and F1 at IoU threshold 0.5 (and 0.9 for WTW). Logical location was assessed by: (1) accuracy of logical location indices, (2) F1 of cell adjacency relations, and (3) TEDS (Tree-Edit-Distance-based Similarity).
Baselines. Comparisons were made against TGRNet, LORE, Cycle-CenterNet, SCAN, TRACE, FLAG-Net, NCGM, LGPMA, TabStrNet, TOD, TSRFormer, and CascadeTabNet, covering the main prior paradigms.
Ablation. Seven ablation experiments on WTW investigated: (a) whether cell-span regression helps, (b) whether the span-supervised loss $\mathcal{L}_{\text{span}}$ helps beyond direct span regression, (c) whether the boundary loss $\mathcal{L}_{\text{boundary}}$ adds value, and (d) whether interpolation map regression is better than regressing logical indices directly from upper-left or all four corners.
Two model variants. TableCenterNet-D uses DLA-34 as the backbone; TableCenterNet-S uses the lighter StarNet-s3.
What are the outcomes/conclusions?
ICDAR-2013. TableCenterNet-D improves physical F1 by 0.6 percentage points and logical accuracy by 4.3 percentage points over LORE. TableCenterNet-D and TableCenterNet-S both exceed Cycle-CenterNet and SCAN on adjacency metrics when trained only on WTW (no ICDAR-2013 training data).
WTW. At IoU 0.5, both variants match or exceed all prior methods on physical F1 (97.3%). Logical accuracy reaches 83.0% (TableCenterNet-D), versus 95.1% for LORE on adjacency F1 but 82.9% TEDS. TEDS improves 1% over SCAN’s 90.7% to reach 91.7%. At the stricter IoU threshold of 0.9, physical F1 (91.3%) improves 13 percentage points over Cycle-CenterNet.
TableGraph-24k. TableCenterNet-D achieves 95.1% overall logical accuracy versus 87.9% for LORE (+7.2 pp), with near-parity on physical F1 (96.0% vs. LORE’s unreported / TGRNet’s 90.6% baseline).
Efficiency. Using the same DLA-34 backbone, TableCenterNet-D has 27.3% fewer parameters (16.82 M vs. 27.86 M) and runs 15.7 times faster than LORE (227 ms vs. 3788 ms per image at 1024x1024). TableCenterNet-S (6.27 M parameters, 50.3 GFLOPs, 215 ms) offers further reduction suitable for edge deployment.
Ablation findings. Interpolation map regression outperforms direct logical index regression from either one corner (–3.2% Acc) or four corners (–1.2% Acc). The span loss $\mathcal{L}_{\text{span}}$ contributes the largest single gain (+6.5% Acc on WTW). The boundary loss adds +0.6% Acc. Combined, they account for the majority of the logical accuracy improvement.
Limitations acknowledged. The authors note that the WTW dataset focuses on bordered tables, so the method’s performance on diverse borderless scenes in the wild is not fully validated. Future work mentions collecting more challenging wireless table data. The paper does not report results on PubTables-1M or other large-scale benchmarks, and comparisons on TableGraph-24k are limited to two prior methods.
Reproducibility
Models
- Two backbone variants are evaluated: DLA-34 (18.8 M parameters in LORE; 16.82 M in TableCenterNet-D) and StarNet-s3 (TableCenterNet-S, 6.27 M parameters). Exact layer configurations are not detailed beyond the reference to the original DLA and StarNet papers.
- Six parallel regression heads, each with 256 hidden channels, appended to the backbone’s output feature map within the CenterNet regression framework.
- Backbones initialized from ImageNet-pretrained weights.
- Code released under Apache-2.0 at https://github.com/dreamy-xay/TableCenterNet. No information on whether pretrained TSR model weights are released separately.
Algorithms
- Optimizer: Adam.
- Batch size: 22 for ICDAR-2013 and WTW; 38 for TableGraph-24k.
- Training: 200 epochs. Learning rate starts at $1.25 \times 10^{-4}$, decayed to $1.25 \times 10^{-5}$ at epoch 140 and $1.25 \times 10^{-6}$ at epoch 180.
- Input images resized to 1024x1024 (768x768 for TableGraph-24k), aspect ratio maintained with padding.
- Data augmentation and normalization applied (specific augmentation types not enumerated).
- ICDAR-2013 training used fine-tuning from a WTW-trained model for 100 epochs, with ICDAR-2013 training images border-padded by 100 black pixels on all sides.
- Interpolation map GT generated once before training using Algorithm 1 (Delaunay triangulation-based polygon interpolation).
- Loss hyperparameters: $\lambda_u = 1.0$, $\lambda_v = 0.5$, $\lambda_e = 0.2$.
Data
- ICDAR-2013.c: modified version with cell bounding boxes rather than word boxes; 258 table images after preprocessing; 80/20 train/test split.
- WTW: 10,970 training / 3,611 test; publicly available; physical and logical coordinates with quadrilateral physical boxes.
- TableGraph-24k: 20,000 train / 2,000 val / 2,000 test; sourced from arXiv scholarly articles.
- All three datasets are publicly available. Licenses not stated in the paper.
Evaluation
- Physical coordinate metrics: precision, recall, F1 at IoU 0.5 (and additionally IoU 0.9 on WTW).
- Logical location accuracy: per-index accuracy of start/end row/col.
- Adjacency relation F1: follows ICDAR-2013 competition evaluation protocol.
- TEDS: Tree-Edit-Distance-based Similarity between predicted and ground-truth HTML table representations.
- No error bars or confidence intervals are reported. Single-run results throughout.
- Comparisons are not always on identical settings (some baselines use different training data marked with $\ddagger$), which limits direct head-to-head conclusions.
Hardware
- Training hardware: two NVIDIA GeForce RTX 3090 GPUs (24 GB VRAM each).
- Inference time measured at 1024x1024 input on the WTW test set; TableCenterNet-D: 227.4 ms/image; TableCenterNet-S: 214.8 ms/image. Hardware specification for inference timing is not separately reported (likely the same workstation).
- No cloud compute cost or energy consumption figures are provided.
- The compact TableCenterNet-S (6.27 M parameters, 50.3 GFLOPs) is described by the authors as feasible for edge device deployment.
BibTeX
@article{xiao2025tablecenternet,
title={TableCenterNet: A one-stage network for table structure recognition},
author={Xiao, Anyi and Yang, Cihui},
journal={arXiv preprint arXiv:2504.17522},
year={2025}
}
TRACE: Table Reconstruction Aligned to Corner and Edges
TL;DR
TRACE is a single-model, end-to-end table recognition system that detects corners and edges at the pixel level and reconstructs cell grids from those low-level features without needing a separate table detector. It also introduces SubTableBank, an in-house dataset with per-cell border visibility annotations. At publication, the authors report top results on ICDAR 2013 and WTW.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$ (Methodological): The core contribution is a new architecture and post-processing pipeline for bottom-up table reconstruction. The paper provides ablations, baselines, and comparisons across two benchmarks to validate that the corner-and-edge prediction approach works better than prior two-stage pipelines.
Secondary: $\Psi_{\text{Resource}}$ (Infrastructure): The authors describe an in-house dataset, SubTableBank, with annotated cell bounding boxes and per-edge visibility flags. A portion is promised for public release, though no public link is provided in this preprint.
What is the motivation?
Table recognition in document images is typically split into two sequential steps: table detection (TD) to localize the table region, and table structure recognition (TSR) to identify rows, columns, and spanning cells. Running two separate models is expensive at training and inference time, and errors from the detection stage propagate into structure recognition with no feedback between the two. The authors argue that the field has studied end-to-end approaches only sparsely, and that the few examples that exist still require parallel branches for bordered versus borderless tables, or rely on OCR input to handle empty cells.
The deeper motivation is conceptual: a table is a hierarchical object built from cells, and each cell is a rectangle defined by its four corners and four edges. If a model can detect those fundamental visual elements directly, cells and tables can be assembled bottom-up through simple post-processing, without ever needing to detect the high-level bounding box first.
What is the novelty?
The central contribution is the framing of table reconstruction as low-level feature prediction rather than object detection.
A single U-Net style CNN with a ResNet-50 backbone outputs five segmentation maps:
- one corner heatmap (Gaussian blobs at every cell corner)
- two explicit edge maps (visible horizontal and vertical borders)
- two implicit edge maps (invisible horizontal and vertical separators)
The distinction between explicit and implicit edges is important for borderless tables. Previous split-merge methods either ignore invisible lines altogether or require separate model branches. TRACE instead labels every cell border by visibility at annotation time and predicts both types jointly in a shared network.
The training objective is a pixel-wise MSE loss across all five channels:
$$ L = \sum_{p}\sum_{i} | S_{i}(p) - S_{i}^{\ast}(p) |^{2}_{2} $$
where $S_{i}(p)$ is the ground truth for the $i$-th map at pixel $p$, and $S_{i}^{\ast}(p)$ is the model prediction.
The post-processing pipeline reconstructs tables through a split-merge strategy. Explicit and implicit edge maps are binarized and projected onto the horizontal and vertical axes. The midpoint of each projected edge cluster becomes a separator. For implicit horizontal lines where rows contain empty cells, ambiguity in the projection is resolved using corner map peak positions. Spanning cells are recovered by checking whether an edge is present at the midpoint between adjacent cell candidates; if not, they are merged. Table location is finally derived from the bounding coordinates of all reconstructed cells.
An optional image rectification step applies the Hough line transform followed by a homography to correct perspective distortion before reconstruction. This step is mostly relevant for scene-image tables in the WTW dataset.
What experiments were performed?
The authors evaluate on two public benchmarks.
ICDAR 2013 Table Competition contains 156 PDF-rendered document tables from EU/US government websites. The official evaluation uses character-level recall, precision, and F1 for table detection, along with Purity and Completeness scores. Table structure recognition is evaluated with adjacency-relation recall and precision. TRACE is tested without any fine-tuning on ICDAR 2013 data.
Wired Table in the Wild (WTW) includes 10,970 training and 3,611 test images of tables captured in natural scenes with perspective distortion, rotation, and varied lighting. Detection is measured at cell level with IoU = 0.9. Structure recognition is measured by cell adjacency relation recall and precision using IoU = 0.6 for cell matching.
Baselines: For ICDAR 2013, comparisons include Nurminen (PDF-based), TableNet, DeepDeSRT, GTE, SPLERGE, LGPMA, and GTE with ground-truth table crops. For WTW, comparisons include Cycle-CenterNet (TD+TSR), TSRFormer (TSR-only with GT table), and NCGM (TSR-only with GT table).
Training setup: ResNet-50 pretrained on ImageNet, input resized to 1280 on the long side, Adam optimizer, initial learning rate $1 \times 10^{-4}$ decayed every 10k steps, 100k total iterations, batch size 12, Online Hard Negative Mining, and standard augmentations. The in-house dataset of 9,717 images (7,783 train / 971 val / 963 test) is used for training.
What are the outcomes/conclusions?
On ICDAR 2013 table detection, TRACE reaches F1 = 97.53%, with the highest Completeness (150) and Purity (147) scores, outperforming all image-based methods on completeness and purity even though GTE leads on raw F1.
On ICDAR 2013 structure recognition, TRACE reaches F1 = 97.46% end-to-end, the highest reported result and the only method evaluated without pre-cropped GT table regions. Methods with GT crops score lower (GTE with GT at 96.24%).
On WTW structure recognition, TRACE achieves F1 = 94.5%, which is the best among compared methods, again without requiring GT table regions. Cell-level detection at IoU = 0.9 is 64.8% F1, which the authors note is inflated downward by annotation imprecision on small cells; at IoU = 0.7 the same model scores 94.9% F1.
Bordered vs. borderless breakdown (ICDAR 2013): TSR F1 on bordered tables is 99.4%, while borderless tables score 93.85%. The primary failure modes involve ambiguous implicit edge placement, mixed explicit-implicit separators along the same line, and partial table detection when cell content spacing is large.
Limitations acknowledged:
- Implicit edge prediction is the main source of errors, especially when content gaps are large or mixed-type separators appear.
- The model uses a CNN backbone (ResNet-50) without global attention; the authors suggest Swin Transformer or similar architectures as future work.
- Evaluation on tag-generation benchmarks (SciTSR, PubTabNet, TableBank) is out of scope because TRACE does not produce text content and those metrics require OCR.
- The internal dataset used for training is not yet publicly available at the time of publication.
Reproducibility
Models
- Architecture: U-Net style encoder-decoder, ResNet-50 backbone, five-channel segmentation head (one corner map, four edge maps).
- No pretrained TRACE weights are released.
- Backbone initialized from ImageNet-pretrained ResNet-50.
Algorithms
- Optimizer: Adam.
- Learning rate: $1 \times 10^{-4}$, decayed every 10k iterations.
- Total training: 100k iterations, batch size 12.
- Online Hard Negative Mining applied during training.
- Augmentations: color jitter, random rotations, random cropping.
- Binarization thresholds: 0.5 for explicit edge maps, 0.2 for implicit edge maps.
- Edge length filter: edges shorter than 25% of the corresponding table dimension are discarded before the split step.
- No fine-tuning performed on ICDAR 2013; the model trained on the in-house dataset is applied directly.
Data
- In-house training dataset: 9,717 annotated document images (financial and scientific documents plus some from TableBank). Split: 7,783 train / 971 val / 963 test.
- Annotations include cell bounding boxes with per-edge visibility flags (explicit vs. implicit).
- The dataset is described as “SubTableBank.” The authors state they “will soon make a portion” available publicly; no URL is provided in this preprint.
- WTW dataset (public): 10,970 train / 3,611 test; wired tables only from natural scenes. Source: Long et al., ICCV 2021.
- ICDAR 2013 benchmark: 156 tables from EU/US government PDFs; used for testing only.
Evaluation
- ICDAR 2013 TD: character-level recall, precision, F1, Purity, Completeness (official protocol).
- ICDAR 2013 TSR: adjacency-relation recall and precision.
- WTW: cell-level recall/precision/F1 at IoU = 0.9 for detection; adjacency-relation recall/precision at IoU = 0.6 for structure.
- The paper notes that IoU = 0.9 for cell detection is strict relative to annotation quality, and IoU = 0.7 may be more appropriate for small cells. Additional results at IoU 0.8, 0.7, and 0.6 are provided.
- No error bars or multiple-run statistics are reported.
Hardware
- Training hardware not specified.
- No inference latency or throughput numbers are reported.
- No cost or energy estimates provided.
BibTeX
@inproceedings{baek2023trace,
title={TRACE: Table Reconstruction Aligned to Corner and Edges},
author={Baek, Youngmin and Nam, Daehyun and Surh, Jaeheung and Shin, Seung and Kim, Seonghyeon},
booktitle={International Conference on Document Analysis and Recognition},
year={2023}
}
CascadeTabNet: End-to-End Table Detection and Structure Recognition
TL;DR
CascadeTabNet applies a three-stage Cascade Mask R-CNN with an HRNet backbone to jointly detect tables and recognize their cell structure in document images. The authors use two-stage iterative transfer learning from a merged general dataset down to a smaller annotated set, combined with dilation and smudge image augmentations, to achieve competitive results on ICDAR 2013, ICDAR 2019, and TableBank while training on relatively small amounts of labeled data.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$: The central contribution is a complete training pipeline for an instance segmentation model adapted to table understanding. The paper devotes most of its space to architecture choices, augmentation strategies, ablation tables comparing model variants, and detailed transfer learning procedures.
Secondary: $\Psi_{\text{Resource}}$: The authors manually annotate 342 images from ICDAR 2019 with three-class labels (bordered table, borderless table, borderless cell masks) and release these annotations publicly. This is a supporting contribution rather than the headline.
What is the motivation?
Table recognition from document images involves two tightly coupled subproblems: locating table regions and identifying the internal cell structure. Earlier systems addressed these independently, often relying on heuristics (line detection, junction analysis) or sequential two-model pipelines where errors in detection propagate into structure recognition. More recent deep-learning approaches, such as DeepDeSRT and TableNet, attempted joint or sequential CNN-based solutions but still required large training sets or introduced separate post-processing steps.
The authors aim to handle both tasks with a single model that learns efficiently from limited labeled data, and that outputs cell-level segmentation masks directly without rule-based post-processing for the structure recognition component.
What is the novelty?
The paper combines three ideas that had not been assembled in this way for document table recognition:
1. Cascade Mask R-CNN with HRNet backbone. The model (CMRcnnHr, referred to as CascadeTabNet in the paper) is a three-stage Cascade Mask R-CNN using HRNetV2p-W32 as the backbone. HRNet maintains high-resolution feature maps throughout the network rather than progressively downsampling, which helps preserve fine-grained spatial detail needed for cell-boundary segmentation. The cascade stages allow the model to refine bounding box proposals at increasingly strict IoU thresholds, improving localization quality. The architecture is implemented via MMdetection using the default cascade_mask_rcnn_hrnetv2p_w32_20e configuration.
2. Two-stage iterative transfer learning. The training strategy proceeds as:
$$\theta_{\text{general}} \leftarrow \text{fine-tune}(\theta_{\text{ImageNet+COCO}},\ \mathcal{D}_{\text{general}})$$
$$\theta_{\text{specific}} \leftarrow \text{fine-tune}(\theta_{\text{general}},\ \mathcal{D}_{\text{specific}})$$
where $\mathcal{D}_{\text{general}}$ is a merged dataset of 1,934 images (ICDAR 2019 Modern, Marmot, and a GitHub borderless table set) with all tables labeled as one class, and $\mathcal{D}_{\text{specific}}$ is a smaller 342-image set annotated for three classes (bordered tables, borderless tables, borderless cell masks). No layers are frozen during any stage of fine-tuning.
3. Document-specific image augmentations. Two augmentation transforms are proposed:
- Dilation transform: binarizes the image and applies morphological dilation with a $2 \times 2$ kernel to thicken text strokes, making text regions more salient to an object detector trained on natural images.
- Smudge transform: applies Euclidean, linear, and max distance transforms to spread black pixel regions, producing a blurred, smeary representation. This is adapted from Gilani et al. (2017).
Adding both transforms to the training set (tripling its size) raises the baseline Faster R-CNN model’s weighted-average F1 on ICDAR 2019 from 0.758 to 0.835.
What experiments were performed?
Augmentation ablation (Table 1). A Faster R-CNN ResNeXt-101 baseline is trained on four variants of the general dataset: original only, original + dilation, original + smudge, and all three. Results are reported as F1 at IoU thresholds 0.6, 0.7, 0.8, and 0.9, plus a weighted average (WAvg.) that up-weights higher IoU thresholds. The combined augmentation set achieves WAvg. 0.835 vs. 0.758 for the original-only baseline.
Model comparison (Table 2). Seven model configurations are evaluated on the ICDAR 2019 Track A Modern test set, all trained on the general dataset with both augmentations: RetinaNet (ResNeXt-101), Faster R-CNN HRNet, Cascade R-CNN (ResNeXt-101 and HRNet), Cascade Mask R-CNN (ResNet-50 with deformable convolutions, ResNeXt-101, and HRNet). CascadeTabNet (CMRcnnHr) achieves WAvg. 0.918, outperforming the next best (CMRcnnX at 0.905).
Table detection benchmarks (Tables 3 and 4). After fine-tuning on ICDAR 2019 Track A training data, the model achieves 3rd place on the post-competition leaderboard (WAvg. 0.901), with the best reported F1 at IoU 0.9 (0.901). On TableBank, training on only 1,500 images per document type (Word, Latex, or both), the model achieves F1 of 94.33 (combined), 96.60 (Latex), and 94.92 (Word), exceeding the TableBank baseline ResNeXt-101 and ResNeXt-152 models trained on the full dataset.
ICDAR 2013 table detection (Table 5). Fine-tuning the general model on 40 ICDAR 2013 images (using 198 for testing, a harder evaluation than prior work that reserved most images for training) yields perfect precision, recall, and F1. The authors note their test set is larger and harder than those used by DeepDeSRT and TableNet.
ICDAR 2019 Track B2 structure recognition (Table 6). Evaluated on 100 test images using cell adjacency relations, CascadeTabNet achieves the highest post-competition WAvg. F1 of 0.232 (vs. 0.206 for NLPR-PAL). The absolute scores are low, reflecting the difficulty of precise cell localization at high IoU thresholds.
All experiments were run on Google Colaboratory with a P100 PCIE GPU (16 GB VRAM), Intel Xeon CPU at 2.30 GHz, and 12.72 GB RAM.
What are the outcomes/conclusions?
The results suggest that existing instance segmentation architectures originally designed for natural images can be adapted to document table understanding with relatively modest amounts of domain-specific training data, provided the training strategy (iterative transfer learning) and augmentations are chosen to close the gap between natural and document image statistics.
For bordered tables, the pipeline falls back to conventional line detection for cell extraction rather than relying on model predictions, which the authors describe as more efficient. The model is applied only to borderless cell segmentation, where line cues are insufficient.
Several limitations are worth noting:
- The ICDAR 2013 perfect-score result is difficult to interpret fairly because the test/train split differs from prior comparisons (the authors test on more images while fine-tuning on fewer).
- Structure recognition F1 scores remain low in absolute terms (WAvg. 0.232), and the authors explicitly note that “high-end post-processing can improve the results significantly,” suggesting the model alone is not sufficient for production-quality structure recognition.
- The annotated dataset for structure recognition contains only 342 images, which limits diversity and may explain the model’s failures on some document types (Figure 5d).
- No statistical significance testing or confidence intervals are reported.
- The comparison is framed as “no post-processing” vs. competitors that use post-processing, but the bordered-table branch of the pipeline does use conventional line detection and contour-based algorithms, so the system is not fully end-to-end in the strictest sense.
Reproducibility
Models
- Architecture: Cascade Mask R-CNN with HRNetV2p-W32 backbone, three Bbox stages plus one Mask Head at the final stage. Backbone width W32 indicates 32 channels in the high-resolution convolution stream. All models in the comparison use an FPN neck, as stated in the paper’s model comparison section.
- Parameter count: Not reported in the paper.
- Implementation: MMdetection toolbox, default configuration
cascade_mask_rcnn_hrnetv2p_w32_20e(20 epochs by default, labeled “20e” in the config name). - Pretrained initialization: ImageNet-pretrained HRNet weights (via COCO pretrained model from MMdetection). The paper refers to “imagenet coco model weights” for the first transfer learning stage.
- Weights release: The authors release code and annotations on GitHub (https://github.com/DevashishPrasad/CascadeTabNet) under the MIT license. The paper does not state whether trained model weight checkpoints are included or specify their license separately.
Algorithms
- Transfer learning stage 1: Fine-tune the COCO-pretrained model on the general dataset (1,934 images, single table class). All layers unfrozen.
- Transfer learning stage 2: Fine-tune the stage-1 model on the specific dataset (342 images, three classes: bordered table, borderless table, borderless cell). All layers unfrozen.
- Optimizer and schedule: Not explicitly stated in the paper. The paper uses MMdetection default configurations; the 20e schedule in MMdetection convention uses SGD with momentum and step-based learning rate decay, but these details are not confirmed in the text.
- Batch size: Not reported in the paper.
- Augmentation: Dilation transform (binary + $2 \times 2$ morphological dilation) and smudge transform (distance transform-based blurring) applied offline; augmented images added to the training set alongside originals, tripling its size.
- No test-time augmentation or ensembling is mentioned.
Data
- General dataset: 1,934 images, 2,835 table annotations. Merged from ICDAR 2019 cTDaR Modern (cTDaR), Marmot (Chinese and English subsets), and a GitHub borderless table dataset. Ground-truth errors in the Marmot dataset were corrected.
- Specific dataset: 342 manually annotated images selected from ICDAR 2019 Train set. Three-class annotations: 114 bordered tables, 429 borderless tables, 24,920 borderless cell masks. Released publicly via the GitHub repository.
- TableBank fine-tuning: 1,500 images randomly sampled per document type from the full TableBank dataset. Annotation quality issues in the Word subset led to exclusion of some images from the test set.
- ICDAR 2013 fine-tuning: 40 randomly selected images used for fine-tuning; 198 used for testing.
- Availability: ICDAR 2019, ICDAR 2013, Marmot, and TableBank are publicly available datasets; individual licenses are not stated in the paper and should be verified at their respective sources. The custom 342-image annotation set is available through the GitHub repository under the repository’s MIT license.
Evaluation
- ICDAR 2019 (Track A): Precision, Recall, F1 at IoU thresholds 0.6, 0.7, 0.8, 0.9. The WAvg. metric is defined by the cTDaR competition (Gao et al. 2019) and weights the four F1 scores more heavily at higher IoU thresholds; the exact weights (0.1, 0.2, 0.3, 0.4) are specified in the competition description, not directly in this paper.
- TableBank: Standard precision, recall, and F1 computed by summing overlap areas across all documents, following the method of Gilani et al. (2017) as described in the TableBank paper.
- ICDAR 2013: Cell-level completeness/purity metrics; precision and recall per table, then averaged. The paper follows the same protocol as TableNet (Paliwal et al. 2019).
- ICDAR 2019 (Track B2): Cell adjacency relation-based evaluation (Gobel et al. 2012); F1 at IoU 0.6, 0.7, 0.8, 0.9 and WAvg.
- No error bars, seeds, or significance tests are reported.
- Comparison fairness concern: The ICDAR 2013 evaluation uses a larger test split than prior published results, making direct comparison with DeepDeSRT and TableNet potentially misleading in either direction.
Hardware
- Training hardware: Google Colaboratory, P100 PCIE GPU (16 GB VRAM), Intel Xeon CPU at 2.30 GHz, 12.72 GB RAM.
- Training time: Not reported.
- Inference latency and throughput: Not reported.
- Inference hardware: Not separately characterized; all reported experiments used the same Colab environment.
- Deployment: The pipeline relies on MMdetection and standard image processing libraries; local deployment should be feasible on a single GPU with at least 16 GB VRAM based on the reported training environment.
BibTeX
@inproceedings{prasad2020cascadetabnet,
title={CascadeTabNet: An approach for end to end table detection and structure recognition from image-based documents},
author={Prasad, Devashish and Gadpal, Ayan and Kapadni, Kshitij and Visave, Manish and Sultanpure, Kavita},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
pages={572--573},
year={2020}
}
LGPMA: Complicated Table Structure Recognition with Local and Global Pyramid Mask Alignment
TL;DR
LGPMA proposes a Mask-RCNN-based framework that simultaneously learns local (RoI-level) and global (full-image) pyramid soft-mask predictions to produce accurately aligned cell bounding boxes for table structure recognition. A mask re-scoring module fuses the two prediction levels, and a three-step post-processing pipeline uses global segmentation cues to locate and merge empty cells. The method achieves 96.7 TEDS-Struct on the PubTabNet validation set and 98.8 F1 on SciTSR.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$: The paper introduces LGPMA, an architecture with a specific training recipe combining dual pyramid mask supervision across local and global feature maps, a multi-task loss, and a structured inference pipeline. The center of gravity is the model design, demonstrated through ablation studies that attribute gains to each module.
Secondary: none significant.
What is the motivation?
Table structure recognition requires recovering the full cell grid from a cropped table image, including logical row/column indices and spanning relations. Two broad families of methods face complementary weaknesses:
- Global-object-based methods predict row/column separators or segmentation masks, but struggle with cells that span multiple rows or columns and with multi-line text that confuses line-cutting decisions.
- Local-object-based methods detect text regions as nodes and recover cell relations via heuristic rules or GNNs. Because these methods anchor predictions to visible text, they cannot reliably identify empty cells, which have no text and are visually ambiguous with neighboring cross-span cells. Empty cell handling directly affects downstream editable document conversion.
The authors observe that if one could obtain aligned bounding boxes (boxes expanded from the text region to the full cell extent, including empty padding) for every cell, the table grid becomes recoverable by simple coordinate overlap. The challenge is that aligned bounding boxes are difficult to regress because cell boundaries typically have no visible texture to guide the detector.
What is the novelty?
The core idea is soft pyramid mask supervision applied at two spatial granularities simultaneously, combined with a re-scoring procedure that fuses local and global predictions.
Pyramid mask label. For a cell proposal of height $H$ and width $W$, the text region corners are $(x_1, y_1)$ (top-left) and $(x_2, y_2)$ (bottom-right). Each pixel $(h, w)$ in the proposal is assigned a soft label in $[0, 1]$ that is maximal at the text midpoint and decreases linearly toward the cell boundary:
$$ t_h^{(w,h)} = \begin{cases} w/x_{mid} & w \le x_{mid} \\ \frac{W-w}{W-x_{mid}} & w > x_{mid} \end{cases}, \quad t_v^{(w,h)} = \begin{cases} h/y_{mid} & h \le y_{mid} \\ \frac{H-h}{H-y_{mid}} & h > y_{mid} \end{cases} $$
where $x_{mid} = (x_1 + x_2)/2$ and $y_{mid} = (y_1 + y_2)/2$. Because every pixel in the proposal participates in predicting the boundary, the network can push the predicted boundary beyond the initial proposal rectangle.
Local Pyramid Mask Alignment (LPMA). Within each RoI-aligned feature map, the mask head learns two tasks jointly: a binary text region segmentation and horizontal/vertical pyramid label regression. Local features provide reliable texture cues near the text content but have a limited receptive field.
Global Pyramid Mask Alignment (GPMA). A parallel branch operates on the full feature map and learns two tasks: (1) a global binary segmentation covering all cells including empty ones (empty cell ground-truth boundaries are derived from the maximum height/width of non-empty cells in the same row/column), and (2) global pyramid label regression for non-empty cells. The global feature captures long-range spatial context but lacks fine-grained texture precision.
Pyramid mask re-scoring. At inference, for each proposed aligned bounding box with text region midpoint $(x_{mid}, y_{mid})$ and overlap with the global segmentation map $P_o$, the local predictions $F^{(L)}$ and global predictions $F^{(G)}$ are combined with distance-weighted coefficients:
$$ F(x) = \begin{cases} \frac{x-x_1}{x_{mid}-x_1} F_{hor}^{(L)} + \frac{x_{mid}-x}{x_{mid}-x_1} F_{hor}^{(G)} & x_1 \le x \le x_{mid} \\ \frac{x-x_2}{x_{mid}-x_2} F_{hor}^{(L)} + \frac{x_{mid}-x}{x_{mid}-x_2} F_{hor}^{(G)} & x_{mid} < x \le x_2 \end{cases} $$
The fused pyramid map for each side of the box is used to fit a plane by least squares:
$$ \min \sum_{y_i=y_1}^{y_2} \sum_{x_i=x_{mid}}^{x_2} (ax_i + by_i + c - F(x_i, y_i))^2 $$
The zero-crossing of the fitted plane gives the refined boundary coordinate:
$$ x_{refine} = - \frac{1}{y_2 - y_1 + 1} \sum_{y_i=y_1}^{y_2} \frac{b y_i + c}{a} $$
All four boundaries (left, right, top, bottom) are refined analogously.
Table structure recovery pipeline. Three sequential steps follow bounding box refinement: (1) cell matching by coordinate midpoint overlap in horizontal and vertical directions, (2) empty cell searching using the Bron-Kerbosch maximum clique algorithm on the matching graph to identify row/column cliques and find vacant positions, and (3) empty cell merging guided by the fraction of GPMA segmentation pixels predicted as foreground in the region between adjacent empty cells; neighbor cells are merged when the ratio exceeds a threshold.
Multi-task loss. Training optimizes:
$$ \mathcal{L} = \mathcal{L}_{rpn} + \lambda_1(\mathcal{L}_{cls} + \mathcal{L}_{box}) + \lambda_2(\mathcal{L}_{mask} + \mathcal{L}_{LPMA}) + \lambda_3(\mathcal{L}_{seg} + \mathcal{L}_{GPMA}) $$
where $\mathcal{L}_{seg}$ uses the Dice coefficient loss and $\mathcal{L}_{LPMA}$, $\mathcal{L}_{GPMA}$ use pixel-wise L1 loss. The weights $\lambda_1 = \lambda_2 = \lambda_3 = 1$ were set empirically.
What experiments were performed?
Datasets:
- ICDAR 2013: 98 train / 156 test samples from PDF-extracted government reports.
- SciTSR: 12,000 train / 3,000 test scientific literature tables; SciTSR-COMP is a harder subset of 2,885 train / 716 test.
- PubTabNet: 500,777 train / 9,115 val / 9,138 test scientific tables.
Baselines: DeepDeSRT, Split, DeepTabStR, Siddiqui et al., ReS2TIM, GTE, GraphTSR, and TabStruct-Net on ICDAR 2013 and SciTSR; EDD, TabStruct-Net, and GTE on PubTabNet.
Metrics: Adjacency F1 (micro-averaged precision/recall of neighboring cell-pair relations) for ICDAR 2013 and SciTSR; TEDS and TEDS-Struct for PubTabNet. OCR for PubTabNet TEDS uses a separate attention-based recognizer (Lee and Osindero, CVPR 2016).
Ablations (Table 3, 60k train / 1k val from PubTabNet): Incremental addition of LPMA, GPMA, and Alignment Loss (AL from TabStruct-Net), measured on text region detection, aligned bounding box detection (IoU threshold 0.7), and TEDS-Struct. The combination LPMA+GPMA without AL achieves the best aligned bounding box Hmean (84.95) and TEDS-Struct (95.53). Adding AL on top of LPMA+GPMA degrades aligned bounding box Hmean by 0.27 pp, suggesting the two supervision signals conflict.
Ablations (Table 4): Three empty cell merging strategies compared on empty-cell detection F1, all-cell detection F1, and TEDS-Struct. The LGPMA-guided strategy achieves 70.21% empty-cell Hmean vs. 59.40% for minimum-cells and 14.70% for maximum-cells. When non-empty bounding box ground truth is provided (ideal upper bound), LGPMA recovery achieves 97.47% empty-cell Hmean and TEDS-Struct 99.77.
What are the outcomes/conclusions?
Main results on held-out test/validation sets:
| Benchmark | Metric | Prior best | LGPMA |
|---|---|---|---|
| ICDAR 2013 (SciTSR train) | F1 | 0.935 (GTE) | 0.953 |
| ICDAR 2013 (fine-tuned) | F1 | 0.935 (GTE) | 0.979 |
| SciTSR | F1 | 0.953 (GraphTSR) | 0.988 |
| SciTSR-COMP | F1 | 0.955 (GraphTSR) | 0.980 |
| PubTabNet val | TEDS | 93.0 (GTE) | 94.6 |
| PubTabNet val | TEDS-Struct | n/a | 96.7 |
The paper received the Best Industry Paper Award at ICDAR 2021.
Key findings:
- Pyramid soft supervision allows aligned bounding boxes to exceed initial proposal bounds, helping particularly for cross-span cells where the proposal often underestimates the true cell extent.
- GPMA’s global receptive field and LPMA’s local texture detail are complementary; the re-scoring fusion outperforms either branch alone.
- The empty cell merging strategy using GPMA visual guidance substantially outperforms naive minimum or maximum strategies on empty-cell detection metrics.
- When non-empty bounding boxes are provided as ground truth, the recovery pipeline achieves near-perfect TEDS-Struct (99.77), indicating the structural recovery logic itself is nearly lossless and that detection accuracy is the primary bottleneck.
- Alignment Loss (AL, from TabStruct-Net) degrades aligned bounding box detection by 3.1 pp Hmean when combined with LGPMA’s pyramid supervision, suggesting the approaches encode redundant or conflicting gradients.
Limitations:
- The method is designed for axis-aligned, print-format tables without rotation or perspective distortion. Wild or photographed tables are out of scope.
- Empty cell merging uses a fixed pixel-ratio threshold on GPMA segmentation output, which may be sensitive to heavily bordered or stylized tables.
- No inference latency or throughput figures are reported.
- OCR is handled by a separately trained recognizer; the full pipeline is not end-to-end.
Reproducibility
Models
- Backbone: ResNet-50 with Feature Pyramid Network (FPN), initialized from MS-COCO pretrained weights.
- LPMA: six anchor ratios [1/20, 1/10, 1/5, 1/2, 1, 2] to handle tall and wide cell shapes; RCNN NMS IoU threshold 0.1 at test time.
- GPMA: full-image segmentation branch with pyramid label regression; aligned bounding box ground-truth shrunk by 5% before generating targets to prevent overlap.
- Code is released in the DAVAR-Lab-OCR repository under Apache-2.0 (the repo-level license applies to the lgpma subdirectory; no subdirectory-specific license file exists).
- No pretrained model weights are publicly released by the authors. Reproduction requires training from scratch using the released code and configuration files.
Algorithms
- Optimizer: SGD, momentum 0.9, weight decay $1 \times 10^{-4}$.
- Batch size: 4.
- Learning rate: $1 \times 10^{-2}$, divided by 10 every 5 epochs.
- Training duration: 12 epochs for SciTSR and PubTabNet; 25 epochs for ICDAR 2013 fine-tuning.
- Input scale augmentation: random scale of longer side to [480, 1080] during training; fixed 768 at test time.
- Loss weights: $\lambda_1 = \lambda_2 = \lambda_3 = 1$ (empirical; no ablation over these values reported).
- Plane fitting for boundary refinement uses least squares; iterative refinement is optional following the Pyramid Mask Text Detector procedure.
- Empty cell merging threshold (the pixel-ratio cutoff for deciding whether to merge two adjacent empty cells via GPMA segmentation output) is not reported in the paper and is not exposed in the released configuration files based on available documentation.
Data
- Primary training for SciTSR and ICDAR 2013 experiments: SciTSR training split (12,000 images).
- ICDAR 2013 fine-tuning: 98-image training split, after SciTSR pretraining.
- PubTabNet experiments: full 500,777-image training split.
- Ablation experiments: 60,000 randomly sampled PubTabNet training images.
- ICDAR 2013: publicly available; hosted via ICDAR competition infrastructure; no explicit open license but widely used for academic research.
- SciTSR: publicly available on GitHub (https://github.com/Academic-Integrity-ML/SciTSR); no explicit open-source license stated by the authors.
- PubTabNet: publicly available on IBM’s GitHub (https://github.com/ibm-aur-nlp/PubTabNet) under CDLA-Permissive-1.0.
Evaluation
- Adjacency F1: micro-averaged correctness of neighboring cell-pair relations; standard for ICDAR 2013 and SciTSR benchmarks.
- TEDS / TEDS-Struct: tree edit distance similarity on full HTML (with and without cell content); introduced with PubTabNet.
- No error bars, confidence intervals, or multi-run seeds reported.
- Training data conditions vary across baselines (some use private pretraining data); LGPMA uses only public datasets.
- Full TEDS reproduction requires an OCR component: the authors use an attention-based recognizer (Lee and Osindero, CVPR 2016) trained separately. No checkpoint or training recipe for this recognizer is provided by the LGPMA authors; reproducing the TEDS (full) score requires sourcing or training a compatible OCR model independently. TEDS-Struct (structure only) is reproducible without OCR.
Hardware
- Training: 8 $\times$ Tesla V100 (32 GB each), PyTorch.
- No GPU-hours, per-sample latency, or throughput figures reported in the paper.
- The ResNet-50 + FPN + Mask-RCNN architecture is standard and deployable on a single modern GPU for inference.
BibTeX
@inproceedings{qiao2021lgpma,
title = {LGPMA: Complicated Table Structure Recognition with Local and Global Pyramid Mask Alignment},
author = {Qiao, Liang and Li, Zaisheng and Cheng, Zhanzhan and Zhang, Peng and Pu, Shiliang and Niu, Yi and Ren, Wenqi and Tan, Wenming and Wu, Fei},
booktitle = {Document Analysis and Recognition -- ICDAR 2021},
pages = {99--114},
year = {2021},
publisher = {Springer International Publishing},
series = {Lecture Notes in Computer Science},
volume = {12821}
}
NCGM: Neural Collaborative Graph Machines for Table Structure Recognition
TL;DR
NCGM introduces a graph-based architecture for table structure recognition that treats three modalities (geometry, appearance, and content) as separate graphs, then alternates between per-modality context extraction and cross-modality information synthesis in stacked collaborative blocks. The approach handles distorted tables considerably better than earlier graph methods that rely on early or late fusion.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$ (new architecture). The headline contribution is a stack of collaborative blocks composed of Ego Context Extractors (ECE) and Cross Context Synthesizers (CCS). The paper is organized around ablations, comparisons on multiple benchmarks, and an analysis of learned attention diversity.
What is the motivation?
Table structure recognition (TSR) aims to recover the logical or physical grid structure of a table from an image, producing row/column/cell adjacency relationships. Two families of prior approaches fall short on tables with complex spanning cells or geometric distortion:
- Early fusion concatenates multiple modality features before feeding them to a single graph model. This forces the model to learn a single inductive bias and ignores the fact that different modalities may be more or less informative depending on table type.
- Late fusion models each modality in a separate graph and combines the outputs. This disentanglement sacrifices inter-modality interactions entirely.
The authors frame the unaddressed challenge as the “Heterogeneous TSR (Hetero-TSR)” problem: how should different modalities collaborate with each other in a way that adapts to the table at hand?
What is the novelty?
The paper introduces Neural Collaborative Graph Machines (NCGM), built from a stack of $L$ collaborative blocks. Each block has two successive modules.
Ego Context Extractor (ECE): For each modality $\sim \in \{G, A, C\}$, a fully-connected directed graph is built over the $N$ text segment bounding boxes. The edge feature for a pair $(i, j)$ is the asymmetric function:
$$h_e(x_i, x_j) = x_i ,|, (x_i - x_j)$$
ECE aggregates global context via a Compressed Multi-head Attention (CMHA) module. Standard MHA over all $N(N-1)/2$ edge features would be $O(NM)$ in complexity. CMHA reduces this by compressing keys and values through a reshape-and-project operation:
$$MC(H) = \text{Norm}(\text{Reshape}(H, \varepsilon), W^h)$$
with compression ratio $\varepsilon = N/M$, cutting complexity to $O(N^2)$. ECE outputs updated per-modality context embeddings $C^{(l)} \in \{C^G, C^A, C^C\}$.
Cross Context Synthesizer (CCS): Three parallel CMHA modules each take one modality as queries and the other two as keys/values, producing inter-modality embeddings $M^{(l)}$. For example, for the content branch:
$$M_C^{(l)} = \text{CMHA}(M_C^{(l-1)},; C_A^{(l)} \cup C_G^{(l)})$$
After $L=3$ blocks the three inter-modality streams are concatenated into collaborative graph embeddings $E \in \mathbb{R}^{N \times d_e}$. All $N^2$ pair vectors $U = \{u_{i,j}\}$ are formed by channel-wise concatenation of $e_i$ and $e_j$ and fed to three separate FC classifiers predicting binary same-row, same-column, and same-cell relationships.
Training loss: End-to-end with a combination of classification and contrastive terms for each relationship type:
$$L = L_{cell} + L_{col} + L_{row}$$
$$\tilde{L} = \lambda_1 L_{\text{class}} + \lambda_2 L_{\text{con}}$$
$$L_{con} = |e_{(a)} - e_{(b)}^+|_2^2 + \max\{0,, \alpha - |e_{(a)} - e_{(b)}^-|_2^2\}$$
During training, Monte Carlo sampling (sample size $S=10$) is used to keep memory costs tractable. At inference, all pairs are evaluated.
Augmented evaluation set: The authors also release a synthesizing procedure (SciTSR-COMP-A) that applies perspective transformation and quadratic Bézier curve distortion to SciTSR-COMP images, generating a harder variant for benchmarking under realistic capture conditions.
What experiments were performed?
Datasets:
- Physical structure (precision/recall/F1 on row/column adjacency): ICDAR-2013-P, ICDAR-2019, UNLV-P, WTW, SciTSR, SciTSR-COMP, and the new SciTSR-COMP-A.
- Logical structure (BLEU on TableBank; TEDS on PubTabNet): TableBank (145k train) and PubTabNet (339k train).
Evaluation setups follow TabStruct-Net:
- Setup-A: table image only (no bounding boxes or text content as input; FLAG-Net detection boxes and Tesseract OCR results are used for fair comparison).
- Setup-B: table image plus ground-truth (or OCR-derived) text segment bounding boxes and content.
Baselines: GraphTSR, DGCNN, TabStruct-Net, FLAG-Net, GTE, LGPMA, Cycle-CenterNet.
Ablations isolate: (a) early vs. late vs. collaborative fusion, (b) DGCNN vs. Transformer vs. ECE as the intra-modality aggregator, (c) concatenation vs. CCS as the inter-modality fusion strategy, (d) individual modality contributions (G, A, C alone and in pairs), (e) effect of block depth (1 to 9 blocks).
What are the outcomes/conclusions?
On SciTSR-COMP (complex structures), NCGM under Setup-B achieves F1 99.0% (P 98.8%, R 99.3%), outperforming FLAG-Net. The strongest result is on the distorted set SciTSR-COMP-A: without distorted training data, NCGM exceeds the second-best FLAG-Net by approximately 11 percentage points (F1) under Setup-A and roughly 12 points under Setup-B. When trained with distorted data, the margin remains around 7 to 9 points.
Ablation insights:
- Geometry (G) is the dominant modality for physical structure; appearance (A) contributes meaningfully; content (C) alone is weak (F1 50.2% on SciTSR-COMP).
- ECE with individual modality inputs (99.0 F1) beats all early-fusion and late-fusion alternatives on SciTSR-COMP.
- CCS raises F1 from 98.3% (ECE + late concatenation) to 99.0%.
- Three collaborative blocks is the practical optimum; deeper networks converge slower and become prone to training collapse beyond 7 blocks.
Limitations the authors acknowledge:
- Computational cost grows with the number of modalities and decoupled processing (3.1M parameters, 12.7G FLOPs for 42 bounding boxes, which is heavier than FLAG-Net at 1.9M / 3.3G FLOPs).
- Deeper collaborative blocks risk training collapse (observed for more than 7 blocks beyond 50 epochs).
- Nested tables, where row/column boundaries are ambiguous, remain a failure mode.
The paper does not discuss inference latency beyond FLOPs counts, nor does it report results on FinTabNet or the PubTables-1M benchmark family that has since become standard.
Reproducibility
| Resource | Type | License | Link |
|---|---|---|---|
| Preprint (arXiv) | Paper | arXiv-nonexclusive-distrib-1.0 | arxiv.org/abs/2111.13359 |
| Published (CVPR 2022) | Paper | Open Access (CVF) | CVF Open Access |
| Code | Code | Not released | N/A |
| Pretrained weights | Model | Not released | N/A |
Models
- Backbone: ResNet-18 (conv1 to conv2_2) plus three 3$\times$3$\times$64 conv layers for appearance features; RoI Align for per-box pooling.
- Geometry: 4-dim normalized bounding box coordinates projected by a $d$-dim FC layer.
- Content: word2vec embeddings followed by a 7$\times$1$\times$d conv for sequential modeling.
- Hidden size $d=64$; 8 attention heads; $d_k = d_v = 8$; $d_m = 64$.
- 3 collaborative blocks (L=3); FC classifiers have 3 layers at 256 dimensions plus a 2-dim softmax head.
- Total: 3.1M parameters. No pretrained weights or code released.
Algorithms
- Framework: PyTorch.
- Input table images resized to $512 \times 512$.
- Pre-trained on SciTSR for 10 epochs; fine-tuned on each target benchmark for 50 epochs.
- Optimizer: not explicitly named; learning rate initialized at $1 \times 10^{-4}$, divided by 10 on loss plateau.
- Loss weights: $\lambda_1 = \lambda_2 = 1$; margin $\alpha = 1$ in contrastive loss.
- Monte Carlo sampling size $S=10$ during training; full pairing at inference.
- No data augmentation beyond the optional SciTSR-COMP-A distortion procedure (perspective transform + Bézier curve warp).
Data
- Training: SciTSR (12k), WTW (10.97k), ICDAR-2019 (600), PubTabNet (339k), TableBank (145k), and partial splits for ICDAR-2013 and UNLV (80/20 random 5-fold).
- Bounding box alignment: cell-level datasets (ICDAR-2019, UNLV, WTW) are converted to text-segment level via Tesseract OCR; text-segment-level datasets (ICDAR-2013, SciTSR) use parsed GT boxes directly.
- SciTSR-COMP-A synthesis procedure is described in the appendix but the augmented images are not separately distributed.
Evaluation
- Physical structure: precision, recall, and F1 on pairwise adjacency (same-row, same-column, same-cell).
- Logical structure: BLEU (TableBank, following Li et al.) and TEDS (PubTabNet, following Zheng et al.).
- ICDAR-2013 and UNLV: averaged over 10 random 5-fold splits.
- No error bars or significance tests reported for main results. No seed sensitivity study.
- Baselines use different input formats in some cases (flag-net detection vs. GT boxes), creating potential inconsistencies at the dataset level.
Hardware
- Training: one Nvidia Tesla V100 GPU with 32 GB memory.
- No wall-clock training time reported.
- FLOPs: 12.7G for a table with 42 text segment boxes; parameters: 3.1M.
- No inference latency or throughput numbers reported.
BibTeX
@inproceedings{liu2022neural,
title={Neural Collaborative Graph Machines for Table Structure Recognition},
author={Hao Liu and Xin Li and Bing Liu and Deqiang Jiang and Yinsong Liu and Bo Ren},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2022}
}
RobusTabNet: CornerNet Proposals and Spatial CNN for Robust Table Extraction
TL;DR
RobusTabNet is a two-stage table extraction system that replaces the standard region proposal network (RPN) in Faster R-CNN with CornerNet for higher-localization-accuracy table detection, and pairs a spatial CNN separator prediction module with a Grid CNN cell merger for table structure recognition. The approach is designed to handle geometrically distorted and even curved tables that confound axis-aligned methods.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$: The paper’s center of gravity is the design and evaluation of two architectural components: a CornerNet-as-RPN table detector and a spatial CNN + Grid CNN structure recognizer. Extensive ablations and comparisons against published baselines on six public benchmarks support every claim.
Secondary: none significant; the authors release manually annotated structure labels for cTDaR TrackA modern, but this is a minor byproduct rather than the paper’s purpose.
What is the motivation?
Two gaps motivate this work.
For table detection, existing CNN detectors (CDeC-Net, Cascade Mask R-CNN variants) achieve good recall but poor localization at high IoU thresholds. The underlying cause is that standard RPN generates proposals whose IoU with ground-truth boxes falls mostly in the 0.7–0.9 range; proposals with IoU > 0.9 represent only 48.1% of positive samples. Because these lower-quality positives survive NMS with high scores, the final detector is penalized whenever evaluation uses tight IoU thresholds.
For table structure recognition (TSR), the dominant cell-detection-then-clustering paradigm (TabStruct-Net, LGPMA) assumes tables are axis-aligned. Camera-captured documents in settings such as the “Insert data from picture” feature in Excel routinely contain skewed or curved tables that violate this assumption. Separately, row/column segmentation models built on plain ResNet-FPN cannot propagate enough context to disambiguate large blank regions in borderless tables.
What is the novelty?
CornerNet-FRCN table detector. The authors replace RPN with CornerNet to generate table proposals. CornerNet predicts paired top-left and bottom-right corner heatmaps, with offsets to compensate for quantization error from the stride-16 backbone:
$$ p_i^x = \left\lfloor \frac{q_j^x}{s} \right\rfloor, \quad p_i^y = \left\lfloor \frac{q_j^y}{s} \right\rfloor $$
$$ \Delta_i = \left( \left\lfloor \frac{q_i^x}{s} \right\rfloor - \frac{q_i^x}{s},\ \left\lfloor \frac{q_i^y}{s} \right\rfloor - \frac{q_i^y}{s} \right) $$
All valid top-left/bottom-right corner pairs are enumerated as proposals (x-left $<$ x-right and y-top $<$ y-bottom) and then passed through a standard Fast R-CNN head for classification and box refinement. The authors report that this design raises the fraction of well-localized proposals (IoU $>$ 0.9 within IoU $>$ 0.7 positives) from 48.1% (RPN) to 96.3% (CornerNet). The combined training loss is:
$$ L_{\text{detector}} = \lambda_{\text{corner}} \cdot L_{\text{corner}} + L_{\text{frcn}}, \quad \lambda_{\text{corner}} = 0.2 $$
where $L_{\text{corner}}$ uses focal loss for heatmap classification and Smooth-$L_1$ for offset regression, and $L_{\text{frcn}}$ uses cross-entropy plus $L_1$ box regression.
Spatial CNN separation line prediction. A plain ResNet-18 + FPN feature map struggles with borderless tables because each pixel’s receptive field is local. The spatial CNN module (adapted from lane detection) propagates information sequentially across the feature map. For the row separator branch, slices $\{s_i^w\}$ along the width dimension are updated left-to-right and then right-to-left by convolving with a $9 \times 1$ kernel and adding to the adjacent slice. The column branch performs analogous top-to-bottom and bottom-to-top passes. The resulting context-enriched feature map drives pixel-level binary segmentation:
$$ L_{\text{split}} = \frac{1}{N_{\text{row}}} \sum_i L(R_i, R_i^\ast) + \frac{1}{N_{\text{col}}} \sum_j L(C_j, C_j^\ast) $$
Grid CNN cell merging. After cell generation via connected-component analysis and polynomial curve fitting, the detected cells are arranged in an $M \times N$ grid. RoI-aligned features per cell form a grid feature map $F_{\text{grid}} \in \mathbb{R}^{M \times N \times 512}$. Three stacked $3 \times 3$ convolutions aggregate spatial context across this grid. A relation network then predicts whether each pair of 4-adjacent cells should be merged, using an 18-dimensional spatial compatibility vector $l_{ij}$ encoding the box deltas:
$$ \begin{aligned} t_x^{ij} &= (x^i - x^j) / w^j, \quad & t_y^{ij} &= (y^i - y^j) / h^j, \\ t_w^{ij} &= \log(w^i / w^j), \quad & t_h^{ij} &= \log(h^i / h^j), \\ t_x^{ji} &= (x^j - x^i) / w^i, \quad & t_y^{ji} &= (y^j - y^i) / h^i. \end{aligned} $$
$$ L_{\text{merge}} = \frac{1}{N_p} \sum_i L(r_i, r_i^\ast) $$
The total structure recognizer loss is $L_{\text{recognizer}} = L_{\text{split}} + L_{\text{merge}}$.
What experiments were performed?
Datasets (TD): cTDaR 2019 TrackA (600/240 modern, plus 600/199 historical), PubLayNet (335k/11k/11k), IIIT-AR-13K (9.3k/2k/2.1k). Datasets (TSR): SciTSR (12k/3k), PubTabNet (500k/9k/9k), cTDaR TrackB2-Modern (100 test images). The authors also evaluate TSR on a private in-house dataset of 9,000/700 camera-captured, skewed, or curved table images.
Metrics: Weighted Average F1 (IoU $\in \{0.6, 0.7, 0.8, 0.9\}$) for cTDaR; COCO AP$^{0.5:0.95}$ for PubLayNet; PASCAL VOC AP for IIIT-AR-13K; adjacency relation F1 for SciTSR and cTDaR TrackB2; TEDS-Struct for PubTabNet.
Baselines: CDeC-Net (strongest published TD baseline), LGPMA (ICDAR 2021 TSR winner), TabStruct-Net, GTE, EDD, SPLERGE (for distorted table comparison).
Ablations: The paper systematically compares five message-passing methods for separation line prediction (no message passing, projection networks, Bi-GRU, CC Attention, Spatial CNN) and four cell merging methods (no merging, Relation Network, GCN, Grid CNN) on cTDaR TrackB2-Modern.
What are the outcomes/conclusions?
Table detection: RobusTabNet (ResNet-18 backbone) achieves WAvg. F1 of 94.9% on cTDaR TrackA (vs. 94.3% for CDeC-Net with a much heavier dual-ResNeXt-101 backbone), AP$^{0.5:0.95}$ of 97.0% on PubLayNet (vs. 96.7% for CDeC-Net), and test AP of 97.7% on IIIT-AR-13K (vs. 96.5% for Mask R-CNN with ResNet-101). The authors attribute these gains to proposal quality: CornerNet raises top-50 recall at IoU $\geq$ 0.9 from 89.3% (RPN) to 97.8%.
Table structure recognition: RobusTabNet achieves F1 of 99.3% / 98.7% on SciTSR / SciTSR-COMP (vs. 98.8% / 98.0% for LGPMA), TEDS-Struct of 97.0% on PubTabNet (vs. 96.7% for LGPMA), and the highest reported WAvg. F1 among evaluated methods on cTDaR TrackB2-Modern. On the private distorted-table dataset, RobusTabNet scores 94.6% WAvg. F1 against SPLERGE’s 63.8%.
Ablation findings: Spatial CNN outperforms CC Attention (93.9%), Bi-GRU (93.1%), and projection networks (93.0%) for separation line prediction (94.6% vs. all alternatives). Grid CNN outperforms GCN (94.0%) and Relation Network (93.2%) for cell merging (94.6%).
Limitations: The detector struggles with closely adjacent tables (difficulty disambiguating boundaries). The structure recognizer degrades on cells with multi-line content and on extremely dense tables where separator masks from adjacent rows or columns overlap. All public benchmarks consist of black-line/white-background tables; generalization to colored or stylized documents is untested.
Reproducibility
Models
- TD: ResNet-18 with dilations in Conv5 (stride 16), reduced to 64 channels with a $1 \times 1$ conv. CornerNet corner pooling + 3$\times$3 conv appended to produce Dilated-C5’. Fast R-CNN head with two 1,024-d fc layers. Total parameter count not reported.
- TSR: ResNet-18 + FPN (64 output channels) shared backbone. Spatial CNN branches with $\frac{H}{4} \times \frac{W}{32}$ intermediate resolution ($3 \times$ downsampling), propagating $9 \times 1$ kernels. Grid CNN: RoI Align $7 \times 7 \times 64$, two 512-d fc layers per cell, $3 \times$ stacked $3 \times 3$ convolutions, 2-hidden-layer MLP relation head.
- No pretrained weights or code released. Weights initialized from ImageNet-pretrained ResNet-18; new layers initialized from $\mathcal{N}(0, 0.01)$.
Algorithms
- Optimizer: SGD with momentum 0.9, weight decay $5 \times 10^{-4}$.
- Schedule: 15K iterations; base LR 0.032, decayed by $\times 0.1$ at 10K and 13K iterations. PubLayNet uses $3\times$ schedule. SciTSR and PubTabNet TSR models trained for 12 epochs.
- Batch size: 4 images per GPU across 8 GPUs; synchronized batch normalization.
- Multi-scale training (TD): shorter side randomly chosen from $\{320, 416, 512, 608, 704, 800\}$; cTDaR also uses random rotations in $\{0^\circ, 90^\circ, 180^\circ, 270^\circ\}$ with $\pm 5^\circ$ jitter. Multi-scale training (TSR): shorter side randomly chosen from $\{416, 512, 608, 704, 800\}$ using cropped table images.
- OHEM: Equal hard positive/negative sampling (32/32 proposals for FRCN; 1,024/1,024 separator/background pixels per branch for TSR split; 64/64 cell pairs for merge).
- Loss weights: $\lambda_{\text{corner}} = 0.2$ (tuned on in-house dataset; applied without per-dataset tuning).
Data
- Public training sets: cTDaR TrackA (600 modern images annotated by the authors for TSR structure), PubLayNet (86k table-containing images used for TD), IIIT-AR-13K, SciTSR, PubTabNet.
- Private in-house dataset: 9,000 training / 700 test camera-captured images with distorted/curved tables; not released.
- The authors annotated row/column separation lines and cell bounding boxes for the cTDaR TrackA 600 modern images for TSR training; they state these annotations “will be released publicly,” though no repository link was provided in either the preprint or the published journal version. As of this writing, no public release has been identified.
Evaluation
- Metrics: Standard per-dataset protocols. TEDS-Struct ignores OCR and evaluates structural HTML trees only. Adjacency F1 uses the official cTDaR measurement tool (https://github.com/cndplab-founder/ctdar_measurement_tool).
- Test-time scaling: TD images rescaled so the shorter side is 512 px (longer side capped at 1,024 px). TSR: cropped table images rescaled so the longer side is 1,024 px; SciTSR images are not resized.
- Comparisons: Single-model, single-scale inference throughout.
- Statistical rigor: No error bars, significance tests, or multi-run variance reported.
- Known limitations: LGPMA leverages the axis-aligned table constraint to boost PubTabNet performance; RobusTabNet does not use this constraint, making the comparison favorable to RobusTabNet on distorted images.
- Hyperparameter transfer: All thresholds ($C_{th}=0.3$, $S_{th}=0.8$, merge score threshold $0.8$, top-$K$=100) were tuned on the private in-house dataset and applied without modification to public benchmarks.
Hardware
- 8 Nvidia V100 GPUs.
- PyTorch 1.6.0.
- Training GPU-hours not reported.
- Inference: no latency figures reported; the paper does not quantify FPS or memory footprint at test time.
BibTeX
@article{ma2023robustabnet,
title={Robust Table Detection and Structure Recognition from Heterogeneous Document Images},
author={Ma, Chixiang and Lin, Weihong and Sun, Lei and Huo, Qiang},
journal={Pattern Recognition},
volume={133},
pages={109006},
year={2023},
publisher={Elsevier},
doi={10.1016/j.patcog.2022.109006}
}
GridFormer: Table Structure Recognition via Grid Prediction
TL;DR
GridFormer frames table structure recognition as predicting the vertices and edges of an $M \times N$ grid. A Deformable DETR backbone with parallel row and column decoders predicts this grid in a single forward pass, covering wired, wireless, oriented, and distorted tables. The method matches or improves on prior results across five benchmarks without requiring OCR annotations.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$. The paper’s headline contribution is a novel architecture for TSR: a new table representation (the $M \times N$ vertex-edge grid) paired with a DETR-style two-stream decoder. The bulk of the paper describes the model design, loss functions, and ablation studies validating each module choice.
Secondary: none. The paper uses existing public benchmarks and does not release a dataset, metric, or code repository.
What is the motivation?
Table structure recognition (TSR) must handle tables that vary widely in structure, line style, and image quality. Prior methods each handle this diversity poorly in at least one dimension:
- Split-and-merge methods (region-based) use a two-stage pipeline: a segmentation model divides the table into an over-split grid of regions, then a merge model combines spanning cells. The two-stage design complicates end-to-end training and degrades on wireless or geometrically distorted tables.
- Graph-based methods model cells or text lines as graph nodes and predict pairwise relationships. They depend on OCR annotations at inference time, creating a hard coupling to text detection quality.
- Cell-based methods detect individual cells via object detection. They work well on wired tables but struggle with wireless tables where cell boundaries are implicit.
- Markup language-based methods generate HTML sequences autoregressively. They suffer from attention drift on long sequences and scale poorly to large tables.
None of these approaches cleanly handles all four challenging scenarios (wireless, oriented, distorted, multi-spanning-cell) within a single-stage, OCR-free pipeline. GridFormer is designed to close that gap.
What is the novelty?
Grid representation. The central observation is that any table can be expressed as an $M \times N$ grid of vertices and edges. Given a table with $r$ rows and $c$ columns, the corresponding grid has $(r+1) \times (c+1)$ vertices and two sets of edges (rightward and downward). A vertex is “positive” if it coincides with a cell corner; an edge is “positive” if it lies on a cell boundary. Both sets are binary-classified, and each positive vertex stores its physical $(x, y)$ coordinate. This representation is:
- Unified: wired and wireless tables differ only in which edges are positive; the representation itself does not change.
- Compact: no autoregressive decoding; the grid is a fixed-size tensor for a given dataset.
- Geometry-aware: physical coordinates are stored directly, enabling localization for distorted or rotated tables.
Two-stream decoder. GridFormer adapts Deformable DETR with two parallel transformer decoders. One decoder processes row queries $Q_{\text{row}} \in \mathbb{R}^{M \times d}$ and predicts $y$-axis coordinates; the other processes column queries $Q_{\text{col}} \in \mathbb{R}^{N \times d}$ and predicts $x$-axis coordinates. Decoupling the axes simplifies vertex localization because within a single row, the $y$ range is narrow while $x$ varies widely, and vice versa.
Query selection module. Rather than using learnable reference point embeddings (as in standard DETR), the query selection module generates image-conditioned reference points. Row proposals are placed at fixed $x = W/4$ positions evenly distributed along $y$; column proposals at fixed $y = H/4$ evenly along $x$. The top-80 scoring proposals (no NMS) initialize the decoder reference points, and L1 supervision is applied to these points during training.
Three prediction heads. Given row embeddings $Z_{\text{row}}$ and column embeddings $Z_{\text{col}}$:
- Classification head: predicts positive/negative probability for each row and column via Focal loss.
$$L_{\text{cls}} = \text{Focal}(\hat{p}_{\text{row}}, p_{\text{row}}) + \text{Focal}(\hat{p}_{\text{col}}, p_{\text{col}})$$
- Position head: predicts normalized $y$-coordinates from row embeddings and $x$-coordinates from column embeddings via L1 loss, with an auxiliary cell-level GIoU loss on reconstructed cell bounding boxes.
$$L_{\text{coord}} = L1(\hat{r}_{\text{row}}, r_{\text{row}}) + L1(\hat{r}_{\text{col}}, r_{\text{col}}) + \gamma_1 L_{\text{iou}}(\hat{g}, g)$$
- Edge head: at each grid vertex, concatenates the corresponding row/column query embedding (global context) with a locally sampled CNN feature (local context), then applies binary Focal classification for rightward and downward edges separately.
$$L_{\text{edge}} = \text{Focal}(\hat{e}_{\text{row}}, e_{\text{row}}) + \text{Focal}(\hat{e}_{\text{col}}, e_{\text{col}})$$
The total loss is:
$$L = \lambda_1 L_{\text{cls}} + \lambda_2 L_{\text{coord}} + \lambda_3 L_{\text{ref}} + \lambda_4 L_{\text{edge}}$$
with $\lambda_1 = 1$, $\lambda_2 = \lambda_3 = \lambda_4 = 5$, $\gamma_1 = 0.1$. Auxiliary losses from all six decoder layers are also used.
Table reconstruction. After decoding, rows and columns with score $> \tau_1 = 0.5$ are kept and sorted by reference point coordinates. Edges with score $> \tau_2 = 0.4$ are kept. A breadth-first search on the resulting grid groups adjacent positive vertices into cells.
What experiments were performed?
GridFormer is evaluated on five benchmarks spanning both regular (PDF-derived) and scene (wild-image) tables.
Datasets:
- SciTSR: 12k train / 3k test; scientific PDFs; evaluated with cell adjacency F1.
- PubTabNet: 500k train / 9k val; scientific PDFs; evaluated with TEDS and TEDS-Struct on the validation set (test annotations are unreleased).
- FinTabNet: 92k train / 10k val; financial tables; evaluated with TEDS-Struct on the validation set.
- WTW: 11k train / 3.6k test; wild photos with deformation; evaluated with cell adjacency precision/recall/F1.
- TAL (TAL_OCR_TABLE): 12.3k train / 3k test (custom split from the competition set); wild scene tables; also evaluated on TAL_rotated ($\pm 30^\circ$) and TAL_curved (geometric distortion variants). Metrics: TEDS-Struct and cell F1 at IoU 0.6.
Baselines compared: EDD, TabStruct-Net, GTE, SEM, LGPMA, FLAG-Net, NCGM, TableFormer, TSRFormer, TRUST, VAST (PubTabNet/FinTabNet); GraphTSR, RobustTabNet (SciTSR); Cycle-CenterNet, TSRFormer, NCGM (WTW); SPLERGE, TableMaster (TAL).
Implementation: ResNet-50 backbone; 6 deformable encoder layers; 6 decoder layers per stream; AdamW, LR 2e-4; 8 V100 GPUs, batch size 24; multi-scale training with short side in {384, 416, 448, 480, 512}; long side capped at 640.
Ablations (on WTW):
- Removing the query selection module (random reference points): F1 drops from 94.1% to 86.3% (-7.8%).
- Replacing the two-stream decoder with a single decoder: F1 drops from 94.1% to 70.2% (-23.9%), suggesting that axis decoupling matters considerably.
- Using only query embeddings for edge classification (no local visual features): F1 drops to 62.2%.
- Using only local visual features for edge classification (no query embeddings): F1 drops to 88.7%.
- Both combined: 94.1% (best), suggesting the value of global-plus-local feature fusion.
What are the outcomes/conclusions?
Regular tables (PDF-derived):
| Dataset | Metric | GridFormer | Nearest competitor |
|---|---|---|---|
| PubTabNet (val) | TEDS | 95.84% | VAST: 96.31% |
| PubTabNet (val) | TEDS-Struct | 97.0% | TSRFormer: 97.5% |
| FinTabNet (val) | TEDS-Struct | 98.63% | VAST: 98.63% |
| SciTSR (test) | F1 (excl. empty) | 99.3% | NCGM/TSRFormer: 99.6% |
On PubTabNet and SciTSR, GridFormer is competitive but not the best-reported number. On FinTabNet, it ties VAST at 98.63% (an improvement of 1.8 points over TableFormer).
Scene tables (wild images):
| Dataset | Metric | GridFormer | Nearest competitor |
|---|---|---|---|
| WTW (test) | F1 | 94.1% | NCGM: 94.1% |
| TAL (test) | TEDS-Struct | 99.4% | TableMaster: 98.8% |
| TAL (test) | Cell F1 | 98.9% | TableMaster: 80.8% |
| TAL_rotated | Cell F1 | 92.9% | TableMaster: 27.8% |
| TAL_curved | Cell F1 | 96.8% | TableMaster: 64.5% |
The gap between methods is largest on rotated and curved tables. TableMaster’s autoregressive decoder drops substantially under geometric distortion, from 27.8% (TAL_rotated) and 64.5% (TAL_curved) versus GridFormer at 92.9% and 96.8% respectively.
Limitations:
- No code or weights are released, limiting reproducibility.
- The TAL evaluation uses a custom train/test split from the competition set rather than an established evaluation protocol; comparison with future methods may be inconsistent.
- TEDS on PubTabNet requires external OCR (the authors use PSENet + MASTER); the quality of these OCR results directly affects reported TEDS scores, and the paper does not ablate OCR quality.
- The grid size $M \times N$ is matched to the largest table in each dataset. For datasets with very large tables, this increases query count and memory usage proportionally.
- The method does not output cell text content; it is a structure-only recognizer. Downstream content extraction requires a separate OCR stage.
Reproducibility
Models
- Backbone: ResNet-50 (ImageNet pre-trained; standard initialization).
- Encoder: 6 deformable transformer encoder layers from Deformable DETR; 256-channel features; multi-scale features from ResNet stages 3-5.
- Decoder: two parallel deformable transformer decoders, each with 6 layers; hidden dimension 256.
- Query counts: $M$ and $N$ are set to the maximum row and column counts in each dataset (e.g., 50 rows and 50 columns for TAL). Ablations show modest sensitivity to this choice.
- No pre-trained TSR checkpoints, model weights, or code repository are released.
Algorithms
- Optimizer: AdamW, initial LR 2e-4. No learning rate schedule (warmup, decay policy, or total training steps/epochs) is reported in the paper.
- Training hardware: 8 NVIDIA V100 GPUs; total batch size 24.
- Multi-scale training: short side randomly sampled from {384, 416, 448, 480, 512}; long side capped at 640.
- Inference resizing: long side fixed at 640 while preserving aspect ratio.
- Score thresholds: $\tau_1 = 0.5$ (row/column classification), $\tau_2 = 0.4$ (edge classification).
- Loss weights: $\lambda_1 = 1$, $\lambda_2 = \lambda_3 = \lambda_4 = 5$, $\gamma_1 = 0.1$.
- Bipartite matching: Hungarian algorithm used to assign ground-truth rows/columns to predicted queries, following DETR.
Data
- All training data come from existing public datasets (SciTSR, PubTabNet, FinTabNet, WTW, TAL). No new data is collected.
- TAL uses a custom 12,285/3,000 train/test split from the original competition training set; split file names are stated to be released but no repository is given.
- TAL_rotated and TAL_curved are derived from TAL via data augmentation (random rotation $\pm 30^\circ$; geometric distortion); these variants are not available as standalone downloads.
- Label generation for datasets with text-line level bounding box annotations (e.g., SciTSR, PubTabNet) requires a multi-step coordinate extension procedure described in the appendix; this preprocessing is non-trivial but documented in the paper.
Evaluation
- TEDS / TEDS-Struct: standard metrics from PubTabNet; TEDS-Struct ignores cell content and compares HTML structure trees only.
- Cell adjacency F1: used for SciTSR and WTW; measures whether pairs of horizontally or vertically adjacent cells are correctly identified.
- Cell F1 at IoU 0.6: localization metric used for TAL; measures how well predicted cell bounding boxes overlap with ground truth.
- PubTabNet TEDS results depend on external OCR (PSENet + MASTER); the paper does not provide TEDS-Struct-only comparisons that would isolate the structure predictor from OCR errors.
- No error bars, significance tests, or repeated runs are reported.
Hardware
- Training: 8 V100 GPUs; total batch size 24. Total GPU-hours and wall-clock time are not reported.
- Inference: long side 640; no inference latency or throughput numbers are reported.
- Deployment: no discussion of CPU-only or low-memory inference feasibility.
BibTeX
@inproceedings{lyu2023gridformer,
title={GridFormer: Towards Accurate Table Structure Recognition via Grid Prediction},
author={Lyu, Pengyuan and Ma, Weihong and Wang, Hongyi and Yu, Yuechen and Zhang, Chengquan and Yao, Kun and Xue, Yang and Wang, Jingdong},
booktitle={Proceedings of the 31st ACM International Conference on Multimedia},
year={2023},
doi={10.1145/3581783.3611961}
}
MTL-TabNet: End-to-End Multi-Task Learning for Image-Based Table Recognition
TL;DR
MTL-TabNet proposes a single encoder-decoder model that jointly learns table structure recognition, cell bounding box detection, and cell content recognition as three coupled tasks. On PubTabNet and FinTabNet, it reported improvements over TableFormer on all three sub-tasks, achieving competitive results with the top entries in the ICDAR 2021 competition without ensembling or additional annotations.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$: The paper’s core contribution is the architecture: a multi-task learning network that jointly handles three aspects of table recognition through shared and task-specific decoder components. The validation is comparison tables against strong baselines with ablation logic implicit in the sub-task decomposition.
Secondary: None significant. No new dataset, metric, or theoretical derivation is introduced.
What is the motivation?
Most table recognition pipelines at the time of publication were non-end-to-end: a separate system would handle table structure recognition, and a separate OCR system would handle cell content recognition. These two stages were trained independently, so errors from one stage could not be corrected by the other, and the systems could not share visual representations.
Prior end-to-end attempts (EDD, IM2TEX) existed but lagged substantially behind non-end-to-end methods. The authors argue that multi-task learning is a natural fit for table recognition because structure, cell layout, and cell content are interdependent.
What is the novelty?
The core architectural contribution is a five-component design:
Shared Encoder: ResNet-31 with Multi-Aspect Global Context Attention (GCAttention) after each residual block, producing a spatial feature grid that is unrolled column-by-column and passed through positional encoding.
Shared Decoder: Two standard Transformer decoder layers that consume the encoder output and a right-shifted sequence of structural HTML tokens, producing a context vector for each predicted cell token.
Structure Decoder: One Transformer decoder layer predicting an HTML tag sequence for the table structure. Cells are tokenized at the HTML tag level; spanning cells are decomposed into
<td, span attribute, value, and>tokens.Cell-BBox Decoder: Triggered whenever the structure decoder emits a cell-start token. Takes the corresponding shared-decoder output and predicts four bounding box coordinates via a linear layer and sigmoid activation.
Cell-Content Decoder: Similarly triggered per cell. Autoregressively generates the text content of that cell at character level, attending to both the shared encoder output and the per-cell shared decoder representation.
The overall loss is:
$$ \mathcal{L} = \lambda_{1}\mathcal{L}_{\text{struc.}} + \lambda_{2}\mathcal{L}_{\text{cont.}} + \lambda_{3}\mathcal{L}_{\text{bbox}} $$
where $\mathcal{L}_{\text{struc.}}$ and $\mathcal{L}_{\text{cont.}}$ are cross-entropy losses and $\mathcal{L}_{\text{bbox}}$ is an L1 regression loss. All three $\lambda$ values are set to 1.
The key design choice is that the shared decoder processes the structural token sequence, giving all three task-specific decoders a contextualized cell-level representation without requiring three separate encoder passes.
What experiments were performed?
Datasets:
- PubTabNet: 568k table images from PMCOA scientific papers, annotated with HTML structure, cell bounding boxes, and cell text. Split used: 500,777 train / 9,115 validation (dev phase) / 9,064 final evaluation.
- FinTabNet: 112k complex financial tables from S&P 500 annual reports. 81 / 9.5 / 9.5 train/val/test split.
Metrics:
- TEDS-Struct: Tree Edit Distance Similarity considering only structural HTML tokens (not cell content).
- TEDS: Full Tree Edit Distance Similarity including cell content.
- mAP (PASCAL VOC) for cell bounding box detection on PubTabNet.
Baselines compared:
- EDD (PubTabNet baseline)
- GTE and GTE (fine-tuned) for FinTabNet
- LGPMA (required additional text-line bounding box annotations)
- TableFormer (with and without PDF-based post-processing)
- SEM (3rd place, ICDAR 2021)
- VCGroup / TableMASTER (2nd place, ICDAR 2021, uses model ensembles)
- Davar-Lab-OCR (1st place, ICDAR 2021, uses additional annotation + ensembles)
What are the outcomes/conclusions?
Table Structure Recognition (TEDS-Struct):
- FinTabNet: 98.79% All (vs. 96.80% TableFormer, +2.0%)
- PubTabNet val: 97.88% All (vs. 96.75% TableFormer, +1.1%)
Cell Detection (mAP on PubTabNet val):
- 88.93% (vs. TableFormer 82.10%, +6.8%; vs. TableFormer+PP 86.80%, +2.1%)
Full Table Recognition (TEDS on PubTabNet val):
- 96.67% All (vs. TableFormer 93.60%, vs. SEM 93.70%, vs. VCGroup without ensemble 96.26%)
ICDAR 2021 Final Evaluation (PubTabNet final set):
- 96.17% TEDS All, approximately 4th place, without ensembles or additional training data.
- Davar-Lab-OCR (1st): 96.36%; VCGroup (2nd): 96.32%; XM (3rd): 96.27%.
The multi-task design appears to benefit cell detection in particular, likely because the structure decoder provides explicit cell token triggers, focusing the bbox decoder on exactly one cell at a time rather than running box detection over the whole table image. The content decoder benefits from this same focus mechanism.
Limitations the authors do not discuss: The architecture is autoregressive in the structure decoder, which means inference time scales with the number of cells in a table. Large tables with many cells could be slow. The paper does not report inference speed or latency. The cell-content decoder also autoregressively generates characters per cell, which may compound slow inference. There is no evaluation on wild or camera-captured table images (e.g., WTW dataset), so robustness outside clean document scans is unknown.
The paper also lacks a formal ablation study. No experiment isolates the contribution of individual decoders (for example, training without the cell-content loss, or replacing the shared decoder with per-task decoders). The gain over TableFormer is attributed to the MTL formulation, but the relative contributions of the stronger backbone (GCAttention), the shared decoder design, and multi-task gradient flow are not disentangled. Additionally, the shared decoder uses teacher forcing during training (conditioned on gold HTML tags) but must consume its own predicted output at inference. This train/inference mismatch is a known source of error accumulation in autoregressive sequence models and is not acknowledged.
Reproducibility
Models
- Backbone: ResNet-31 with GCAttention (Multi-Aspect Global Context Attention from MASTER, Lu et al. 2021) after each residual block.
- Input resolution: 480 $\times$ 480 pixels; CNN output feature map: 60 $\times$ 60.
- Shared decoder: $N = 2$ Transformer decoder layers; hidden size 512, FFN size 2048, 8 attention heads.
- Structure decoder: 1 Transformer decoder layer; same hidden/FFN/head config.
- Cell-BBox decoder: 1 Transformer decoder layer + linear + sigmoid.
- Cell-Content decoder: 1 Transformer decoder layer + linear + softmax.
- Max sequence lengths: 500 structural tokens; 150 character tokens per cell.
- Weights: Pretrained checkpoints for both PubTabNet and FinTabNet are released on Google Drive (see frontmatter artifacts). The repository and weights are licensed Apache-2.0 (inherited from the MMOCR codebase). Note: the repository README instructs using
master_decoder_old20220923.pyinstead of the defaultmaster_decoder.pywhen loading the released checkpoints.
Algorithms
- Optimizer: SGD (implied; the paper says “stochastic gradient descent algorithms” without specifying Adam vs. SGD).
- Learning rate: 0.001 for the first 12 epochs; then 0.0001 for 8 more epochs or until convergence.
- Batch size: 4 per GPU (2 GPUs; effective batch size 8).
- Loss: Weighted sum of cross-entropy (structure), cross-entropy (content), and L1 (bounding box), with $\lambda_{1} = \lambda_{2} = \lambda_{3} = 1$.
- Framework: PyTorch + MMCV; specifically MMOCR-0.2.0, MMDetection-2.11.0, mmcv-full-1.3.4.
- Data augmentation: Not described.
Data
- PubTabNet: Publicly available (CDLA-Permissive-1.0 for annotations; underlying PMCOA images have per-article mixed licenses).
- FinTabNet: Published by IBM; CDLA-Permissive-1.0 for annotations; underlying S&P 500 report images are from copyrighted filings. The original IBM distribution is no longer easily accessible; the canonicalized FinTabNet.c version is available on HuggingFace.
- Both datasets provide structure annotations in HTML, cell bounding boxes, and cell text per non-empty cell.
Evaluation
- TEDS and TEDS-Struct as defined in Zhong et al. 2020.
- Simple vs. Complex splits: Simple = no spanning cells; Complex = tables with multi-row or multi-column cells.
- Cell detection evaluated with PASCAL VOC mAP protocol on PubTabNet.
- No error bars, significance tests, or multi-run statistics reported.
- Comparison with ICDAR 2021 competition entries is fair only in the sense that the same final evaluation set was used; competition entries may have used different training data augmentation or proprietary preprocessing.
Hardware
- Training: 2 $\times$ NVIDIA A100 80 GB GPUs.
- Training time: Not reported.
- Inference speed: Not reported.
- Deployment: The autoregressive structure and content decoders suggest non-trivial inference latency on large tables; no benchmarks provided.
BibTeX
@inproceedings{ly2023mtltabnet,
title={An End-to-End Multi-Task Learning Model for Image-based Table Recognition},
author={Nam Tuan Ly and Atsuhiro Takasu},
booktitle={Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 5: VISAPP},
pages={626--634},
year={2023},
organization={SCITEPRESS},
doi={10.5220/0011685000003417}
}
GraphTSR: Complicated Table Structure Recognition via Graph Neural Networks
TL;DR
GraphTSR reformulates table structure recognition (TSR) in PDFs as an edge-classification problem on a cell adjacency graph, using alternating edge-to-vertex and vertex-to-edge graph attention blocks to predict horizontal and vertical cell relations. The paper also introduces SciTSR, a 15,000-table dataset of scientific PDF tables with structure labels derived from LaTeX source, split 12,000/3,000 train/test.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$. The headline contribution is a new graph neural network architecture (GraphTSR) for recognizing table structure from PDF cell inputs. The paper’s center of gravity is the architectural design, comparison against three baselines, and the per-dataset, per-metric performance tables.
Secondary: $\Psi_{\text{Resource}}$. The paper simultaneously releases SciTSR, a large-scale dataset that did not exist before this work and which has since become a standard benchmark in the TSR community. Without SciTSR, the experimental section would have been limited to the tiny ICDAR-2013 set (156 tables, no training split).
What is the motivation?
PDF table structure recognition requires understanding which cells are adjacent to which, including cells that span multiple rows or columns (“spanning cells”). Prior work fell into two camps: rule-based methods that relied on explicit layout heuristics, and a small set of deep learning approaches that treated the problem as image segmentation into row/column regions. Both camps handled simple grid-like tables reasonably well but struggled with tables containing spanning cells, which carry disproportionately important semantic content (they are often headers).
The existing benchmark (ICDAR-2013) had only 156 tables and no training split, which severely constrained data-driven research. The authors identify both the methodological gap (no principled way to handle spanning cells) and the resource gap (no adequate training corpus) as jointly limiting progress.
What is the novelty?
GraphTSR treats each cell as a node and the task as predicting whether any two cells share a horizontal or vertical adjacency edge. The pipeline has four stages:
- Pre-processing: extract cell text content and bounding boxes from the PDF using an existing tool (same procedure as Shigarov et al., 2016).
- Graph construction: connect each cell to its $K = 20$ nearest neighbors by spatial distance, yielding an approximately sparse graph with $O(K|V|)$ edges rather than $O(|V|^2)$.
- Relation prediction: run GraphTSR to classify each edge as
vertical,horizontal, orno relation. - Post-processing: convert the labeled graph back into a structured table representation.
The core model maintains two representation streams, one over vertices (cells) and one over edges (candidate adjacency pairs), updated by alternating attention blocks. The edge-to-vertex block updates each cell’s representation by attending over its neighboring edge nodes; the vertex-to-edge block updates each edge’s representation by attending over its neighboring cell nodes. Graph attention is local (restricted to the $K$-NN neighborhood) rather than global:
$$ a_{iu} = \frac{e^{\mathbf{k}_i^\top \mathbf{q}_u / \sqrt{d_k}}}{\sum_{j \in N(u)} e^{\mathbf{k}_j^\top \mathbf{q}_u / \sqrt{d_k}}} $$
$$ \text{GraphAtt}(\mathbf{q}_u, \mathbf{K}, \mathbf{V}) = \sum_{j \in N(u)} a_{ju} \mathbf{v}_j $$
Each attention block wraps the graph attention with residual connections and layer normalization, following the Transformer block structure:
$$ \tilde{\mathbf{H}}_v^{(n)} = \text{Norm}!\left(\mathbf{H}_v + \text{GraphAtt}(\mathbf{Q}, \mathbf{K}, \mathbf{V})\right) $$
$$ \mathbf{H}_v^{(n)} = \text{Norm}!\left(\tilde{\mathbf{H}}_v^{(n)} + \text{FFN}(\tilde{\mathbf{H}}_v^{(n)})\right) $$
Vertex features include cell size, absolute bounding-box coordinates, and relative positions. Edge features include Euclidean, x-axis, and y-axis distances in both absolute and relative form, plus x-axis and y-axis overlap between cell pairs. These hand-crafted features are fed as initial representations to the two streams.
The key design decision is representing the original graph as a bipartite graph (cells and candidate edges as separate node sets), so that both vertex and edge states can be updated with the same graph attention mechanism.
What experiments were performed?
Datasets: SciTSR (12,000 train / 3,000 test), ICDAR-2013 (test-only, 156 tables), and SciTSR-COMP (the 716 complicated tables from the SciTSR test set, used as a harder held-out subset).
Metric: The ICDAR 2013 evaluation protocol, which converts a table to a list of horizontally and vertically adjacent cell pairs and computes precision, recall, and F1. Both macro- and micro-averaged scores are reported.
Baselines:
- Tabby (Shigarov et al., 2016): rule-based PDF table extractor
- DeepDeSRT (Schreiber et al., 2017): CNN-based semantic segmentation of row and column regions
- Adobe Acrobat DC SDK: commercial table-to-HTML extraction
Note that both DeepDeSRT and GraphTSR are trained on SciTSR training data and then evaluated zero-shot on ICDAR-2013, testing cross-domain generalization.
Training details: Adam optimizer, initial learning rate 0.0005, batch size 1 (one graph per step), 15 epochs, L2 weight decay $\lambda = 0.0001$, dropout $p = 0.4$ on each sub-layer output. Class imbalance is addressed with a manual rescaling weight: 0.2 for no relation edges and 1.0 for vertical and horizontal edges in the cross-entropy loss.
What are the outcomes/conclusions?
On ICDAR-2013, GraphTSR achieves macro-F1 of 0.837 and micro-F1 of 0.872, surpassing Tabby (0.816 / 0.854) and substantially outperforming DeepDeSRT (0.568 / 0.615). The cross-domain gap for DeepDeSRT suggests that image-based models are sensitive to table rendering style, while GraphTSR’s text-and-geometry features generalize better.
On SciTSR, GraphTSR achieves macro-F1 of 0.934 and micro-F1 of 0.953, ahead of Tabby (0.912 / 0.921) and DeepDeSRT (0.897 / 0.890).
The more informative comparison is SciTSR-COMP. All baselines drop at least 4 points relative to full SciTSR; GraphTSR retains macro-F1 0.934 and micro-F1 0.955, while Tabby falls to 0.855 / 0.882. On the most demanding sub-evaluation (spanning cells only, ignoring non-spanning adjacencies), the advantage is sharper: GraphTSR reaches macro-F1 0.703 vs. Adobe’s 0.485 and Tabby’s 0.379. DeepDeSRT cannot recover any spanning-cell structure at all (its outputs are always grid-like).
The authors conclude that framing TSR as graph edge prediction rather than image segmentation or sequence generation naturally handles spanning cells, because each cell-to-cell relation is predicted independently and the model is not forced into a grid assumption.
Limitations (stated and unstated):
- The pre-processing step depends on a rule-based PDF text extractor to supply cell bounding boxes; the model itself does not handle raw PDF pages or tables from scanned images.
- The SciTSR tables come entirely from scientific papers on arXiv compiled from LaTeX source; tables from other domains (business reports, government documents, HTML pages) are not represented, and the LaTeX-derived labels may not generalize to noisier real-world PDFs.
- Training requires a ground-truth cell-bounding-box-to-LaTeX-cell matching step (“cell matching”), which relies on accurate pre-processing and clean LaTeX source; this step is not evaluated separately.
- No ablation is provided over the number of attention blocks, the KNN $K$ value, or the feature engineering choices.
- GPU vs. CPU training is not discussed in the context of scalability or inference speed.
- DeepDeSRT was reimplemented by the authors rather than using original code (none was available). A reimplemented baseline may not match the original’s tuning, which puts the comparison’s fairness in question.
- SciTSR was designed and constructed to match GraphTSR’s input format: pre-extracted cell bounding boxes from clean LaTeX source. Image-based methods like DeepDeSRT are disadvantaged on this benchmark by design, since the benchmark natively favors text-and-geometry approaches.
Reproducibility
Models
GraphTSR is a 4-block model with hidden dimension $d = 64$, implemented in PyTorch 0.4.1. The bipartite-graph formulation has edge-to-vertex and vertex-to-edge attention blocks, each following the Transformer residual-norm-FFN pattern. Total parameter count is not reported. No pretrained checkpoints are released. The GitHub repository contains the SciTSR dataset (via Google Drive download) and evaluation scripts for computing precision/recall/F1 against ground-truth relation labels; it does not include GraphTSR training code.
Algorithms
- Optimizer: Adam, learning rate 0.0005
- Batch size: 1 graph per step
- Epochs: 15
- Dropout: $p = 0.4$ on each sub-layer output
- Weight decay: L2 with $\lambda = 0.0001$
- Loss: cross-entropy with class weights (0.2 for
no relation, 1.0 for adjacency classes) - KNN graph construction: $K = 20$
- No data augmentation or curriculum learning described
Data
SciTSR is collected by crawling LaTeX source files from arXiv, extracting table environments, compiling them to individual PDF files, and parsing \multirow/\multicolumn commands to produce JSON structure labels. The full dataset has 15,000 tables (12,000 train, 3,000 test). The 716 complicated-table test subset (SciTSR-COMP) is the subset of the test split that contains at least one spanning cell. The dataset is publicly available under the MIT license.
A notable caveat: since labels come from LaTeX source, the dataset cannot easily include tables from scanned documents or proprietary PDF producers. It is also restricted to the scientific paper style of arXiv submissions, which tend to be typeset consistently.
Evaluation
The ICDAR 2013 evaluation protocol converts each table to a set of adjacent-cell relation tuples and computes precision, recall, and F1 by set overlap. Both macro (per-table average) and micro (per-relation aggregate) variants are reported. No error bars, significance tests, or multi-run averages are provided.
ICDAR-2013 has no training split; results on it test zero-shot generalization from SciTSR training data.
Hardware
Training runs on Intel Xeon CPUs. Each epoch over the 12,000 SciTSR training graphs takes approximately 20 minutes, giving roughly 5 hours of total training time. No GPU is used or reported. Inference hardware and latency are not discussed.
BibTeX
@misc{chi2019complicated,
title={Complicated Table Structure Recognition},
author={Chi, Zewen and Huang, Heyan and Xu, Heng-Da and Yu, Houjin and Wanxuan Yin and Xian-Ling Mao},
year={2019},
eprint={1908.04729},
archivePrefix={arXiv},
primaryClass={cs.IR}
}
PubTables-v2: Full-Page and Multi-Page Table Extraction
TL;DR
PubTables-v2 is a new large-scale dataset from Kensho Technologies for table extraction from scientific documents. It introduces three collections organized by context: 136k large cropped tables, 548k single-page tables with caption and footer annotations, and 9,172 full documents with 9,492 multi-page tables spanning up to 13 pages. It is the first large benchmark for multi-page table structure recognition, and the authors also demonstrate a cross-page table continuation classifier that significantly improves document-level extraction performance.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$ (the primary contribution is the dataset itself: a large-scale, quality-controlled collection of annotated tables at multiple levels of document context, including the first large multi-page table benchmark).
Secondary: $\Psi_{\text{Evaluation}}$ (the paper introduces generalized GriTS and TEDS metrics for multi-table inputs and establishes baseline performance across four distinct tasks); $\Psi_{\text{Method}}$ (the paper introduces POTATR, a page-object adaptation of TATR, and a cross-page table continuation classifier).
What is the motivation?
Table extraction (TE) has traditionally been decomposed into two sequential stages: detecting individual tables, then recognizing their internal structure from cropped images. This decomposition has known drawbacks. Multi-stage pipelines are complex to maintain, and processing tables in isolation discards potentially useful surrounding context. Interest has grown in methods that extract tables directly from full page or document images, including smaller specialized vision-language models (VLMs), but demonstrating progress on this richer problem has been difficult because no adequate benchmarks existed.
Specific gaps the authors identify:
- No large-scale dataset annotates table structure within full-page context with strict quality control and complete coverage (every table on a page must be annotated).
- No large dataset connects tables to their captions and footers with hierarchical relationships.
- No published dataset exists for multi-page table extraction, despite its clear practical importance.
- Existing benchmarks like PubTables-1M have been in use long enough that current model pre-training practices create data leakage concerns.
- Existing datasets skew toward short, narrow tables; challenging long and wide tables are underrepresented.
What is the novelty?
Three context-level collections: Rather than splitting TD and TSR into separate subtasks, PubTables-v2 organizes its data by the amount of surrounding context available:
- Cropped Tables (135,578 samples): exclusively long ($\geq 30$ rows) or wide ($\geq 12$ columns) tables, representing the challenging tail that prior datasets undersample. This collection includes 32% more long or wide tables than PubTables-1M, despite having a fraction of the total count.
- Single Pages (467,541 pages; 548,414 tables): full-page annotations with bounding boxes for tables, rows, columns, spanning cells, column headers, projected row headers, captions, and footers, plus a hierarchical relation set connecting each table to its associated captions and footers.
- Full Documents (9,172 documents; 137,095 pages; 24,862 tables): document-level annotations where every document contains at least one multi-page or split table. This collection includes 9,492 multi-page tables and 630 single-page split tables (split across two columns of a two-column page).
Stricter quality control: PubTables-v2 tightens the annotation quality thresholds from PubTables-1M. The mean normalized edit distance threshold is $\alpha = 0.02$ (down from $0.05$) and a new maximum per-cell edit distance threshold $\beta = 0.2$ is added. For the page- and document-level collections, if any table in a document fails quality control, the entire document is discarded, ensuring no page has missing table annotations.
Generalized evaluation metrics: Standard GriTS and TEDS metrics are defined for a single predicted table against a single ground truth table. The authors generalize both to handle a predicted set of tables against a ground truth set. Using the Hungarian algorithm to find the optimal one-to-one matching, they compute a pseudo-F1 score aggregated over all samples:
$$ \text{GriTS}_f(\mathbf{A}, \mathbf{B}) = \frac{2 \sum_{i,j} f(\tilde{\mathbf{A}}_{i,j}, \tilde{\mathbf{B}}_{i,j})}{|\mathbf{A}| + |\mathbf{B}|} $$
where $\mathbf{A}$ and $\mathbf{B}$ are the ground truth and predicted table matrices and $f$ is a cell-level similarity function. For multi-table inputs, the true positive scores are summed across all samples before computing the final pseudo-F1.
POTATR (Page-Object Table Transformer): An adaptation of TATR for full-page extraction. POTATR extends TATR by adding two page-level object classes (caption, footer) and their rotated counterparts (16 classes total), doubling the object queries to 250, and adding a relation head (a three-layer MLP) to predict parent-child relationships between a table and its associated structures.
Cross-page table continuation classifier: A binary image classifier trained to predict whether the last table on one page continues onto the next. The input is two contiguous page images concatenated side by side. A ViT-B-16 classifier trained on PubTables-v2 achieves recall of 0.995 and precision of 0.987, suggesting strong visual cues for this task in the dataset.
What experiments were performed?
The authors evaluate several recent smaller, specialized VLMs (SmolDocling-256M, GraniteDocling-258M, Qwen2.5-VL-3B, Granite-Vision-3.2-2B, DeepSeek-OCR, DeepSeek-OCR 2, dots.ocr) plus TATR-based models trained on PubTables-v2 data.
Experiment 1: Cropped table structure recognition. Models are evaluated on the Cropped Tables collection (long and wide tables). TATR-v1.2-Pub (fine-tuned on PubTables-v2) is compared against seven VLMs and prior TATR versions. TATR is also compared with four text sources: EasyOCR, PaddleOCR, docTR, and text extracted directly from the PDF (DT).
Experiment 2: Page-level table extraction. Models are evaluated on the Single Pages collection, requiring prediction of all tables on a page. POTATR-v1.0-Pub is introduced here for the first time. Metrics are the generalized GriTS and TEDS variants.
Experiment 3: Document-level table extraction. Models are evaluated on the Full Documents collection. Two scenarios are compared: (1) providing all pages of a document at once, and (2) processing pages individually while evaluating at the document level. Only Qwen2.5-VL-3B was able to process full documents at once due to context window constraints.
Experiment 4: Cross-page table continuation prediction. ResNet-50 and ViT-B-16 are trained on 15,830 pairs (9,866 positive, 5,964 negative) from the Full Documents collection and evaluated on classification metrics (recall, precision, F1, AUC).
Experiment 5: Document-level TE with cross-page merging. The best-performing document-level model (dots.ocr, processed page-by-page) is augmented with the ViT-B-16 continuation classifier. When the classifier predicts a table continuation and the table column counts match, tables are concatenated vertically.
Additional experiments (appendix): Single-page split table extraction, analysis of how often models predict the correct number of tables per page, comparison of image-to-graph architectures (Relationformer, EGTR, POTATR) in a small-scale experiment, and an ablation on training set size for the continuation classifier.
What are the outcomes/conclusions?
Key findings:
- Among VLMs on cropped long/wide tables, dots.ocr performs best with $\text{GriTS}_{\text{Con}} = 0.801$. TATR-v1.2-Pub with direct text extraction achieves $\text{GriTS}_{\text{Con}} = 0.980$, roughly halving the error of the prior TATR-v1.1-Pub model, showing the value of additional training data for this challenging subset.
- At the page level, dots.ocr again leads among VLMs ($\text{GriTS}_{\text{Con}} = 0.899$), still below POTATR with direct text extraction ($\text{GriTS}_{\text{Con}} = 0.957$).
- At the document level (pages processed individually), the best VLM, dots.ocr, achieves only $\text{GriTS}_{\text{Con}} = 0.577$ and perfect content-and-structure exact match on only 11.8% of tables ($\text{Acc}_{\text{Con}} = 0.118$). The only model tested on full documents at once, Qwen2.5-VL-3B, scores $\text{GriTS}_{\text{Con}} = 0.047$.
- The cross-page continuation classifier (ViT-B-16) achieves F1 = 0.991, indicating strong visual cues in the dataset. Even with only 250 training samples, models exceed F1 = 0.86.
- Augmenting dots.ocr with the continuation classifier improves document-level $\text{GriTS}_{\text{Con}}$ from 0.577 to 0.684, a substantial gain from a lightweight auxiliary model.
Failure modes observed:
- Qwen2.5-VL-3B enters infinite repetition loops on full-document inputs, even for short two-page documents.
- Granite-Vision-3.2-2B always predicts exactly one table per page, regardless of actual table count, which artificially inflates its split-table metrics.
- No model achieves $\text{Acc}_{\text{Con}} > 0$ on single-page split tables.
Limitations acknowledged by the authors:
- PubTables-v2 is sourced exclusively from English-language scientific articles (PubMed); it does not cover other domains or languages.
- The dataset is not intended to replace benchmarks covering broader variation in table and page appearance.
- Row and column bounding boxes are not annotated for multi-page tables in the Full Documents collection.
Reproducibility
| Resource | Type | License | Link |
|---|---|---|---|
| Preprint | Paper | Unknown | arXiv 2512.10888 |
| PubTables-v2 Dataset | Dataset | CDLA-Permissive-2.0 | HuggingFace |
| Code | Code | MIT | HuggingFace |
| TATR-v1.2-Pub, POTATR-v1.0-Pub | Models | MIT | HuggingFace |
Models
- TATR-v1.2-Pub: 28M parameters, identical architecture to TATR-v1.1-Pub, fine-tuned on the PubTables-v2 Cropped Tables collection. Weights initialized from TATR-v1.1-Pub (pre-trained on PubTables-1M cropped tables).
- POTATR-v1.0-Pub: 29M parameters, extends TATR with 125 additional object queries (250 total), two additional object classes (caption, footer) plus rotated counterparts (16 classes total), and a three-layer MLP relation head. Initialized from TATR-v1.1-Pub; relation head and extra queries are randomly initialized.
- Cross-page classifiers: Standard ResNet-50 and ViT-B-16 image classifiers. Input is two page images concatenated horizontally.
- All models will be released at the HuggingFace dataset page under MIT license.
Algorithms
Both TATR-v1.2-Pub and POTATR-v1.0-Pub follow the default TATR training setup (AdamW optimizer, standard DETR hyperparameters) unless noted otherwise.
- TATR-v1.2-Pub training: 8 Nvidia T4 GPUs, batch size 2 per GPU (effective batch size 16). Epoch defined as 100,000 samples; trained for 160 epochs. Initial learning rate $5 \times 10^{-5}$, gamma 0.9 every 4 epochs. No-object class weight
eos_coef= 0.3. - POTATR-v1.0-Pub training: Same hardware setup. 100 epochs on Single Pages collection. Initial learning rate $5 \times 10^{-5}$, gamma 0.9 every 2 epochs.
eos_coef= 0.1. Relation loss weight = 0.05. - Cross-page classifiers: Trained for 200 epochs, best checkpoint selected by validation F1.
Data
- Source: PubMed Central articles published 2023-2025 (not included in PubTables-1M).
- Annotation leverages author-supplied HTML/XML table markup aligned to PDF via sequence matching.
- Quality thresholds: mean normalized edit distance $\alpha < 0.02$, maximum per-cell edit distance $\beta < 0.2$.
- Table structure canonicalization scheme from PubTables-1M is applied, plus the improved column header annotation correction from Smock et al. (ICDAR 2023).
- Splits: train, validation, public test, and hidden test sets. The hidden test set (4.3% of cropped tables) is withheld to allow future leakage detection.
- Annotation formats: PASCAL VOC for object detection training, grid matrix format for GriTS, HTML for TEDS, JSON for cells and relations.
- Dataset license: CDLA-Permissive-2.0.
Evaluation
- GriTS variants: $\text{GriTS}_{\text{Top}}$ (cell topology), $\text{GriTS}_{\text{Con}}$ (cell content via normalized edit distance). Aggregated as pseudo-F1 weighted by cell count across all samples.
- TEDS and TEDS-S: tree-edit distance similarity on HTML table representation, generalized to sets via Hungarian matching.
- Exact match accuracy: $\text{Acc}_{\text{Top}}$ (topology) and $\text{Acc}_{\text{Con}}$ (content + topology).
- Graph metrics (POTATR): Edge F1 at IoU threshold 0.8 for relation prediction.
- No statistical significance testing or multi-run reporting. Single training run for each model configuration.
- TEDS computation for full documents has a 10-minute timeout; affected samples are bounded $[0, 1]$.
Hardware
- TATR and POTATR training: 8 Nvidia T4 GPUs. GPU-hours not reported.
- VLM inference: vLLM used for SmolDocling, GraniteDocling, and Qwen2.5-VL-3B. Inference hardware and latency not reported.
- Larger models (Qwen2.5-VL-7B) evaluated only qualitatively on a small number of documents due to computational cost.
BibTeX
@article{smock2025pubtablesv2,
title={PubTables-v2: A new large-scale dataset for full-page and multi-page table extraction},
author={Smock, Brandon and Faucon-Morin, Valerie and Sokolov, Max and Liang, Libin and Khanam, Tayyibah and Ramesh, Amrit and Courtland, Maury},
journal={arXiv preprint arXiv:2512.10888},
year={2025}
}
TRivia: Self-supervised Fine-tuning of VLMs for Table Recognition
TL;DR
TRivia is a self-supervised fine-tuning framework that enables vision-language models (VLMs) to learn table recognition from unlabeled table images. It uses Group Relative Policy Optimization (GRPO) with table question-answering as a proxy reward, sidestepping the need for costly human annotations or proprietary model distillation. The resulting TRivia-3B, fine-tuned from Qwen2.5-VL-3B, outperforms Gemini 2.5 Pro and MinerU2.5 on three standard table recognition benchmarks.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$: The core contribution is a self-supervised fine-tuning framework (TRivia) built around a training algorithm combining GRPO, response-consistency sampling, and attention-guided QA generation. The paper includes extensive ablations establishing which components drive performance.
Secondary: $\Psi_{\text{Evaluation}}$: The paper evaluates across four benchmarks (PubTabNet, OmniDocBench, CC-OCR, OCRBench v2), compares against 12 baselines spanning three model categories, and includes a robustness analysis to QA noise and complex table types.
What is the motivation?
Table recognition (TR), the task of converting table images into structured representations such as HTML or Markdown, is a core component of document parsing pipelines. Recent VLMs have pushed TR quality forward significantly, but the gap between open-source and proprietary systems remains large.
The root cause is data. Acquiring labeled TR data follows three established paradigms, each with a ceiling:
- Synthetic data (rendered HTML tables) offers scale but misses real-world visual diversity.
- Human-annotated real-world data captures complexity but is expensive and slow to produce.
- Distillation from proprietary models (e.g., Gemini 2.5 Pro) is still costly, may violate service agreements, and hard-caps downstream performance at the teacher’s level.
Even large-scale open-source efforts like MinerU2.5, trained on millions of samples with manual annotations and Gemini 2.5 Pro distillation, cannot exceed Gemini’s own performance. This motivates a different question: can unlabeled table images from the wild, which are cheap to collect at scale, be used to train TR models that go beyond the limits of labeled data?
What is the novelty?
TRivia introduces a closed-loop self-supervised fine-tuning pipeline built around three components:
1. Table QA-driven GRPO
Rather than predicting full HTML markup as a supervised target, TRivia treats TR as a reinforcement learning (RL) problem. The TR model acts as the policy and generates multiple recognition responses $o_1, \ldots, o_R$ for each table image $I$. A separate LLM $M_{\text{LLM}}$ then answers table-specific questions based on each response, and the reward is defined as the average F1-score across the QA set:
$$\text{Reward}(o_j) = \frac{1}{|QA|} \sum_{(q,a) \in QA} \text{F1}(M_{\text{LLM}}(q; o_j),, a)$$
GRPO optimizes relative reward differences across the response group rather than chasing absolute scores, which means imperfect QA pairs tend to cancel out during within-group normalization. To prevent destabilization from invalid (illegal) recognition outputs, which receive zero reward by default, TRivia filters those responses out before computing advantages.
2. Response-consistency sampling
Not all unlabeled images contribute equally to learning. TRivia selects samples that cause the model to produce diverse recognition outputs, since GRPO benefits most from samples where relative advantages are informative. For each image $I$, the TR model generates $K$ recognition outputs $\{o_i\}_{i=1}^K$, and the consistency score is defined as the average pairwise TEDS similarity:
$$\text{Consistency}(I) = \frac{2}{K^2 - K} \sum_{1 \le i < j \le K} \text{TEDS}(o_i, o_j)$$
Images with lower consistency scores (more disagreement across responses) are prioritized for training. To keep the process practical, this sampling is done offline once rather than re-evaluated online. Images with consistency scores below 0.4 tend to be noisy or non-tabular and are discarded; the remaining images are uniformly sampled across score intervals from 0.4 to 1.0.
3. Attention-guided diverse QA generation
QA pairs serve as the supervisory signal, so they must cover different table regions and be verifiable. Naive single-pass prompting often covers only part of a table, while multi-pass prompting generates paraphrased duplicates from the same region. TRivia instead uses the VLM’s own attention weights during answer generation to identify the visual source of each QA pair:
$$VS((q,a);, I, M_{\text{QA}}) = \{v \mid A_{M_{\text{QA}}}(v \mid a) > \tau_A,; v \in \mathcal{V}\}$$
Candidate QA pairs are then selected greedily so that no two selected pairs share significant visual overlap (IOU below a threshold $\tau_{\text{IOU}}$). Before selection, each candidate is cross-checked by a separate VLM $M_{\text{Val}}$ to verify that it can be answered correctly with the image but not without it.
What experiments were performed?
Benchmarks. TRivia-3B is evaluated on four benchmarks:
- PubTabNet: academic papers with digital tables.
- OmniDocBench v1.5: 512 samples from digital PDFs with diverse table types.
- CC-OCR: 300 scanned/photographed images, including handwritten, long, and complex tables.
- OCRBench v2 (table parsing subset): 700 images with varied real-world layouts.
The primary metric is TEDS (Tree Edit Distance-based Similarity), which captures both content accuracy and structure. S-TEDS is a structure-only variant.
Baselines. The paper compares against 12 models in three categories:
- Expert TR models: SLANNet-plus, UniTable.
- General-purpose VLMs: InternVL3.5-241B, Qwen2.5-VL-72B, Qwen3-VL-235B, Gemini 2.5 Pro, GPT-4o, GPT-5.
- Document-parsing VLMs: dots.ocr, DeepSeek-OCR, PaddleOCR-VL, MinerU2.5.
Ablations. The paper ablates each of the three main TRivia components:
- Removing attention-guided QA generation significantly drops performance on structurally complex tables.
- Replacing response-consistency sampling with random sampling slows convergence and reduces final TEDS from 63.5 to 52.0 on the analysis set.
- Removing illegal-sample filtering destabilizes GRPO rewards, increasing convergence steps by approximately 25% and reducing final TEDS by 3 points.
The paper also compares the QA-based approach directly against distillation alternatives: using Qwen2.5-VL-72B pseudo-labels for SFT decreases TEDS by 8.37 on average, and GRPO with those same pseudo-labels still lags by 4.92 TEDS compared to TRivia.
What are the outcomes/conclusions?
TRivia-3B achieves 89.88 average TEDS across OmniDocBench, CC-OCR, and OCRBench v2, compared to 88.93 for Gemini 2.5 Pro and 86.82 for MinerU2.5. It is the top-performing model overall, with the only gap being on CC-OCR where Gemini 2.5 Pro scores 85.56 versus 84.90 for TRivia-3B.
Several secondary findings are notable:
- TRivia-3B annotations can be used to distill into a smaller Stage-2 model via standard SFT with nearly no performance loss (89.99 vs 89.88 overall TEDS), which contrasts sharply with the degraded results from distilling Qwen2.5-VL-72B pseudo-labels.
- The improvements are most pronounced on challenging table types: rotated tables gain +21.76 TEDS over the base Stage-2 model, with large gains on large tables and borderless merged-cell tables as well.
- TRivia is robust to moderate QA noise: injecting 20% incorrect QA pairs causes only limited degradation, explained by the relative reward normalization in GRPO canceling out uniformly wrong signals.
- A preliminary multi-task study shows TRivia is compatible with key information extraction (KIE), suggesting the approach may generalize beyond TR.
Limitations. The pipeline relies on large auxiliary models (Qwen2.5-VL-72B as $M_{\text{QA}}$ and InternVL3-78B as $M_{\text{Val}}$), which imposes significant compute cost during data preparation even though TRivia-3B itself is compact. The method is validated only on table recognition and one KIE task; broader applicability to other document understanding tasks remains to be demonstrated.
Reproducibility
Models
- TRivia-3B: fine-tuned from Qwen2.5-VL-3B-Instruct. 3 billion parameters. Model weights are released on HuggingFace at opendatalab/TRivia-3B (Apache-2.0). Training code is at opendatalab/TRivia (Apache-2.0). A Huggingface demo is available at opendatalab/TRivia-3B (Space).
- Visual encoder: supports image resolutions from $256 \times 28 \times 28$ to $1280 \times 28 \times 28$, yielding 256 to 1280 visual tokens per image.
Algorithms
Training proceeds in three stages:
| Stage | Data | Trainable params | Samples | Batch | Epochs |
|---|---|---|---|---|---|
| 1 (OTSL warm-up) | Synthetic (PubTabNet, SynthTabNet, MMTab) | LLM only | 700K | 32 | 1 |
| 2 (SFT) | Real-world | All | 50K | 32 | 2 |
| 3 (TRivia/GRPO) | TRivia curated unlabeled | All | 50K | 128 | 1 |
Stage 3 hyperparameters: learning rate $2 \times 10^{-7}$ (ViT), $1 \times 10^{-6}$ (MLP + LM); GRPO group size $G = 16$; sampling temperature 1.2; sequence length 8192; constant LR scheduler.
- QA generation: Qwen2.5-VL-72B-Instruct as $M_{\text{QA}}$, attention layer 72, $\tau_A = 0.01$, $\tau_{\text{IOU}} = 0.3$.
- Validity cross-checking: InternVL3-78B as $M_{\text{Val}}$.
- Answer LLM: Qwen3-8B as $M_{\text{LLM}}$ during GRPO.
Data
- Stage 1: PubTabNet (200K), SynthTabNet (4 subsets, 100K each), MMTab (100K), converted to OTSL format.
- Stage 2: approximately 50K real-world samples from open-source datasets and web sources.
- Stage 3: 100K unlabeled PDF table images collected from web; after response-consistency filtering, 50K selected; QA generation yields a final dataset of 48,470 images with an average of 28.3 QA pairs each.
- Dataset construction details and prompt templates are provided in the supplementary appendix of the paper.
Evaluation
- Metrics: TEDS (full) and S-TEDS (structure only), both computed via tree edit distance against reference HTML.
- Benchmarks: OmniDocBench v1.5 (512 samples), CC-OCR (300 samples), OCRBench v2 table subset (700 samples), PubTabNet, FinTabNet.
- Inference: vLLM used for compatible models; temperature 0.2 for general-purpose VLMs to reduce repetitions; Gemini 2.5 Pro evaluated with thinking mode enabled.
- The authors acknowledge that PubTabNet and FinTabNet in-domain performance may drop after Stage 2 real-world adaptation, as those benchmarks are only present in Stage 1 training.
Hardware
- All experiments run on 8x A100 80GB GPUs.
- Stage 1: approximately 1 day. Stage 2: approximately 2 hours. Stage 3 (GRPO): approximately 2 days.
- TRivia-3B is designed for offline deployment and can run on hardware that fits a 3B parameter VLM.
BibTeX
@article{zhang2025trivia,
title={TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition},
author={Junyuan Zhang and Bin Wang and Qintong Zhang and Fan Wu and Zichen Wen and Jialin Lu and Junjie Shan and Ziqi Zhao and Shuya Yang and Ziling Wang and Ziyang Miao and Huaping Zhong and Yuhang Zang and Xiaoyi Dong and Ka-Ho Chow and Conghui He},
journal={arXiv preprint arXiv:2512.01248},
year={2025}
}
Benchmarking Table Extraction from Heterogeneous Scientific Documents
TL;DR
This paper proposes an end-to-end benchmark for table extraction (TE) from scientific PDF documents. It introduces two new heterogeneous datasets (Table-arXiv: 36k samples from arXiv LaTeX sources; Table-BRGM: 124 tables from French geological reports), alongside a formally justified set of metrics that jointly evaluate table detection and structure recognition. Experiments across nine methods show that models trained on homogeneous collections like PubTables-1M suffer substantial performance drops on heterogeneous data, and that table extraction remains an unsolved problem.
What kind of paper is this?
Dominant: $\Psi_{\text{Evaluation}}$. The central contribution is a new benchmark framework: two new datasets, a formal metric design covering TD, TSR, and end-to-end TE, and a rigorous evaluation protocol that captures model uncertainty and cascading errors. The paper does not introduce a new model or training algorithm.
Secondary: $\Psi_{\text{Resource}}$. Table-arXiv and Table-BRGM are released as community resources. Their heterogeneity (diverse LaTeX typesetting, Word-processed geological reports, multilingual content) makes them useful for training and evaluation beyond the paper’s own experiments.
What is the motivation?
Table extraction from PDFs is a prerequisite for many downstream tasks: question answering over tables, data lake integration, and automated document analysis pipelines. Despite a large number of competing tools, choosing among them is difficult because existing evaluations have two systematic limitations.
First, prior benchmarks assess table detection (TD) and table structure recognition (TSR) as independent problems, each given a clean pre-cropped input. In practice, a TE pipeline chains these steps: the TSR model receives whatever the TD model detected. Errors in detection degrade structure recognition, but this cascade is invisible when the two tasks are scored separately.
Second, the dominant large-scale dataset, PubTables-1M, draws exclusively from PubMed Central biomedical articles produced with professional publishing systems. Models trained on this distribution learn typesetting conventions that do not transfer to LaTeX-compiled preprints from other domains, Word-processed reports, or documents in languages other than English. Benchmarks constructed from only this source inflate reported scores and mask generalization failures.
The authors address both limitations: they design an end-to-end evaluation protocol and construct heterogeneous test data that stress-tests out-of-distribution behavior.
What is the novelty?
Formal end-to-end metrics
The paper introduces new metrics that propagate table detection quality into the TSR score. For a model whose predictions include bounding boxes and optionally a confidence score $\tau$, the TE precision and recall are defined as:
$$ P_{\text{TSR}} = \frac{1}{|\mathcal{P}|} \sum_{\delta \in \mathcal{P}} \mathbb{1}_{\delta > \lambda} , B^{\text{TSR}}(\delta) \qquad R_{\text{TSR}} = \frac{1}{|\mathcal{G}|} \sum_{\delta \in \mathcal{G}} \mathbb{1}_{\delta > \lambda} , B^{\text{TSR}}(\delta) $$
where $\delta$ is the Jaccard index between a detected table and its ground-truth match, $\lambda$ is the matching threshold, and $B^{\text{TSR}}(\delta)$ is the TSR score (GriTS or TEDS) for that detected table. Missed detections receive $B^{\text{TSR}} = 0$, so a TD failure directly penalizes the end-to-end score.
For probabilistic models that output a confidence score $\tau_p$ per prediction, the paper additionally derives expected precision and recall by treating the Jaccard threshold as a random variable with a probability density function. Under the uniform density $S_0$, the expected value of the indicator $\mathbb{1}_{\lambda \geq \tau}$ for a prediction with Jaccard index $\delta$ is $2\delta$, providing a smooth, principled summary that avoids dependence on a single threshold choice.
Calibration of probabilistic models is measured with the object detection version of Expected Calibration Error (D-ECE):
$$ \text{D-ECE} = \sum_{i=1}^{M} \frac{|\mathcal{P}_i|}{|\mathcal{P}|} \left| \text{prec}_i - \text{conf}_i \right| $$
where predictions are binned by confidence and the gap between average precision and average confidence within each bin is accumulated.
Two new heterogeneous datasets
Table-arXiv is built from 2,443 LaTeX source files downloaded from arXiv, spanning the last 20 years and all arXiv subject areas. TD annotations are generated by instrumenting the LaTeX tabular environment to inject anchors that survive PDF compilation. TSR annotations (with cell content) are produced by converting the LaTeX source to HTML using LaTeXML. The result is 36,869 pages with 6,308 annotated tables, all from documents compiled with LaTeX and covering mathematics, physics, astrophysics, and other domains where PubTables-1M has no representation.
Table-BRGM is a small, manually annotated domain-specific set from geological survey reports published by BRGM, France’s national geological survey institution. It contains 6 PDFs, 499 pages, and 124 tables in English and French. Most documents were produced with Word processing software rather than LaTeX. Tables include large multi-column structures, borderless layouts, empty cells, and merged cells, representing challenges absent from PubTables-1M.
Both datasets include negative pages (pages with no tables). Prior benchmarks consisting only of positive instances inflate precision scores because models are never penalized for predicting tables on table-free pages. The new datasets expose this gap by including pages where false positives can occur.
Content-based table detection evaluation
For models like the LVLM (GPT-4o mini) that produce HTML tables without bounding boxes, the paper introduces a content-Jaccard metric based on character-level 2-grams. Given multisets $\mu(P)$ and $\mu(G)$ of character bigrams from predicted and ground-truth table cells:
$$ \text{Jaccard}_c(P, G) = \frac{|\mu(P) \cap \mu(G)|}{|\mu(P) \cup \mu(G)|} $$
This allows all methods to be evaluated for TD even when spatial coordinates are unavailable, using a content-based notion of whether a table was found.
What experiments were performed?
The benchmark evaluates nine end-to-end TE methods across three datasets.
Baseline methods (no confidence scores): PDFPlumber, PyMuPDF, Camelot (rule-based Python libraries); Grobid (ML-based CRF extraction pipeline); and GPT-4o mini used as a zero-shot LVLM given a page image and a prompt to return HTML tables.
Probabilistic methods (output confidence scores): Docling (RT-DETR for TD + TableFormer for TSR + EasyOCR for content); TATR-extract (Table Transformer detector + Table Transformer structure model, with a modified inference loop that retains low-confidence table predictions before NMS); XY+TATR-extract (XY-cut page segmentation prepended to TATR-extract to handle pages with multiple tables); and VGT+TATR-structure (Vision Grid Transformer for TD, which also uses text token grid information, combined with the same TATR-structure model).
Metrics reported include AP and F1 for TD, GriTS-Topology, GriTS-Content, and TEDS for TSR and end-to-end TE, and D-ECE for calibration of probabilistic models. Inference speed is measured as CPU seconds per image on 4 CPUs with 2 concurrent tasks.
What are the outcomes/conclusions?
No single method dominates across datasets. Docling achieves the best overall TD F1 scores across all three datasets (F1 of 0.99 on PubTables, 0.89 on Table-arXiv, 0.90 on Table-BRGM), benefiting from its RT-DETR model trained on the diverse DocLayNet dataset. Among probabilistic models, TATR achieves near-perfect AP on PubTables (1.00) but drops to 0.84 AP on Table-BRGM, and VGT produces better-calibrated confidence scores on heterogeneous data while achieving lower raw AP than TATR on PubTables.
Rule-based tools collapse on heterogeneous data. Camelot achieves F1 of 0.88 on Table-BRGM (where tables often have explicit grid lines) but only 0.25 on PubTables and 0.33 on Table-arXiv. PyMuPDF follows a similar pattern. These tools depend on visual lines and alignment cues that are absent from the LaTeX-typeset documents in PubTables and Table-arXiv.
PubTables-1M-specialized models do not generalize. TATR is trained on PubTables-1M and achieves near-perfect scores on the PubTables test set. On Table-arXiv and Table-BRGM, its scores drop substantially. Its confidence scores also become poorly calibrated: on Table-arXiv, predictions with 80-90% confidence scores correspond to fewer than 2% true positives on average, making the scores essentially uninformative.
End-to-end metrics are more pessimistic than subtask metrics. Once TD errors are propagated into TSR evaluation, all scores decrease. GriTS-Topology scores remain relatively high (tables that are detected tend to have their row/column structure recognized correctly), but GriTS-Content and TEDS drop more, because token extraction (via PDFAlto) misses stylized text like mathematical formulas, limiting content accuracy for TATR-structure-based models.
The LVLM hallucinates cell content. GPT-4o mini in zero-shot mode replaces capital-I characters with the digit 1, reformats numbers, fills in empty cells with fabricated values, and changes numeric values. Its content TD is accordingly low on Table-arXiv. It also achieves only modest F1 scores on TD across all datasets, consistent with prior reports that GPT-4o mini struggles with precise spatial localization.
Inference speed differs by an order of magnitude. Rule-based tools and Grobid require roughly 0.4-1.2 seconds per image. Docling, TATR, and XY require 9-13 seconds. VGT is substantially slower at 128 seconds per image due to its grid transformer pipeline.
The authors conclude that table extraction is not a solved problem. Heterogeneous data exposes model fragility that homogeneous benchmarks conceal, and the cascade from TD to TSR means that strong subtask numbers do not guarantee strong end-to-end quality.
Reproducibility
Models
No new model weights are introduced. The benchmark uses existing off-the-shelf models:
- TATR-detect and TATR-structure: Table Transformer models from Microsoft, trained on PubTables-1M. Available via the table-transformer repository and HuggingFace.
- VGT: Vision Grid Transformer, fine-tuned on PubLayNet with DiT-base ViT weights. Available via the VGT repository. Uses BERT tokenizer with LayoutLM word embeddings.
- Docling: Available as a Python library. Uses RT-DETR (trained on DocLayNet) for TD and TableFormer for TSR.
- Grobid: Available as an open-source library.
- GPT-4o mini: Accessed via OpenAI API in zero-shot mode with the prompt shown in Appendix A.2 of the paper.
Algorithms
The benchmark evaluation code is available at the GitLab repository listed in the artifacts. The authors state that their pipelines were built starting from the microsoft/table-transformer repository. AP is computed with Scikit-learn’s average_precision_score (area under the full precision-recall curve, following the sklearn convention rather than the Pascal VOC interpolation).
No training is performed. Key pipeline parameters:
- TATR-extract retains all predictions with confidence above 0.05 for “table” or “table rotated” (modifying the original inference which kept only the top-1 label), then applies NMS.
- XY+TATR-extract runs XY-cut recursive page segmentation (counting black pixels along both axes) before TATR-detect; each chunk retains only the top 2 predictions.
- Table padding for TATR-structure crops: $\delta_{\text{pxs}} = 100$ pixels added around detected tables; $\delta_{\text{pxd}} = 10$ pixels padding for XY-cut sub-images.
- Content-Jaccard matching threshold for LVLM TD evaluation: $\iota = 0.5$.
- Camelot uses the “lattice” mode (demarcated cell lines), which outperformed other modes in this evaluation.
Data
Table-arXiv: 2,443 LaTeX source files from arXiv, sampled from all subject domains over 20 years. TD annotations are generated automatically by instrumenting LaTeX compilation; TSR annotations are produced by LaTeXML. Contains 36,869 pages, 5,214 positive pages, 6,308 annotated tables. All documents are LaTeX-compiled.
Table-BRGM: 6 PDF documents from BRGM geological reports (French and English). 499 pages, 91 positive pages, 124 tables. Manually annotated; an end-to-end model was used as a starting point and outputs were manually corrected. Mix of Word-processed (67%) and LaTeX (33%) documents.
PubTables-Test: Subset of the PubTables-1M test set for which PDF files could be retrieved from PubMed Central (Open Access subset). 23,175 PDFs, 46,942 pages (all positive), 55,990 annotated tables. Ground-truth HTML markup was generated from the original dataset annotations. Note that not all PubMed PDFs were available; the subset used here covers only papers with valid retrievable PDFs. The PubTables-1M original annotations are available at the table-transformer repository; the reconstructed HTML ground truth and the PDF subset used are provided via the benchmark GitLab repository.
Datasets are available at https://gitlab.inria.fr/msoric/table-extraction-benchmark under MIT license.
Evaluation
Metrics used:
- TD: Precision, Recall, F1, and AP at IoU threshold $\lambda = 0.5$; expected Precision/Recall (integral over $\lambda$ weighted by density $S_{0.5}$); D-ECE for calibration.
- TSR: GriTS-Topology, GriTS-Content, TEDS computed on HTML markup normalized to only
<table>,<tr>,<td>tags (header/body distinctions stripped for comparability). - End-to-end TE: $P_{\text{TSR}}$ and $R_{\text{TSR}}$ with each of the three TSR scores; AP versions for probabilistic models and F1 versions for baseline models.
TSR evaluation is conditioned on TD: only true-positive detections (IoU $\geq 0.5$) are passed to the structure model. Missed tables receive a score of 0 in the end-to-end aggregation. This means TSR scores cannot be compared directly across models because each model evaluates on a different subset of detected tables.
The benchmark does not report statistical significance tests or error bars. Each model is run once. LVLM results carry a contamination caveat: GPT-4o mini may have been trained on benchmark datasets.
Hardware
Inference is measured on CPU (4 CPUs per task, 2 tasks concurrently). No GPU-hours or training hardware are relevant. Approximate inference times per image: PDFPlumber 0.35s, PyMuPDF 0.41s, Grobid 0.78s, Camelot 1.18s, LVLM 8.88s (API), XY+TATR 9.32s, Docling 11.36s, TATR 13.33s, VGT 127.98s.
BibTeX
@article{soric2025benchmarking,
title={Benchmarking Table Extraction from Heterogeneous Scientific Documents},
author={Soric, Marijan and Manolescu, Ioana and Gracianne, C\'{e}cile and Senellart, Pierre},
journal={arXiv preprint arXiv:2511.16134},
year={2025}
}
TABLET: Table Structure Recognition using Encoder-only Transformers
TL;DR
TABLET is a split-merge TSR model that treats row and column splitting as 1D sequence labeling and cell merging as OTSL-based grid cell classification, using only Transformer encoders throughout. Trained on FinTabNet and PubTabNet, it reaches 98.54 TEDS on FinTabNet while processing 18 FPS end-to-end, roughly 2.5 times faster than the prior fastest reported split-merge method.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$ The headline contribution is the TABLET architecture: dual Transformer encoders for split sequence labeling and a third Transformer encoder for merge classification. The paper devotes most pages to the model design, ablations on encoder depth, and benchmark comparisons on FinTabNet and PubTabNet.
Secondary: None. No dataset, benchmark, or tool is released alongside the method.
What is the motivation?
Table structure recognition in production environments faces two competing pressures: accuracy and throughput. End-to-end autoregressive approaches (e.g., TableMaster, TableFormer, VAST, TFLOP) generate HTML or OTSL sequences token by token, which gives them a natural advantage on TEDS because TEDS itself evaluates cell-level accuracy in a way that mirrors their output format. However, autoregressive decoding is inherently sequential, making it slow on large, densely populated tables. For financial documents (annual reports, earnings announcements), tables often have dozens of rows and columns with small fonts, creating long output sequences that slow autoregressive methods further.
The authors also observe that for end-to-end methods, predicted cell bounding boxes frequently misalign with OCR-extracted text positions, causing errors in the final TEDS evaluation even when structure is correct. This TEDS-Struct vs. TEDS gap is visible in the paper’s comparison tables (e.g., Zhu et al. show a gap exceeding 2 points; Ly et al. exceeding 3 points).
Top-down split-merge methods avoid bounding box prediction and matching entirely, but existing approaches (TSRFormer, SEMv2, SEMv3) rely on Spatial CNN feature extractors or RNN-based sequence models that either introduce complexity or lag behind Transformer-based alternatives. The paper’s goal is a split-merge model that uses only Transformer encoders while maintaining a concise, high-resolution feature pipeline suited for dense financial tables.
What is the novelty?
Overall Architecture
TABLET is a two-stage pipeline. A split model segments the table into an $R \times C$ grid; a merge model then classifies each grid cell to reconstruct spanning cells. The two models are trained separately.
Before inference, all table images are rescaled so that the longer side is 960 pixels (preserving aspect ratio) and padded to $960 \times 960$.
Split Model
A modified ResNet-18 (Max Pooling removed, all channel counts halved) combined with an FPN at 128 channels produces a feature map $F_{1/2}$ of size $(H/2) \times (W/2) \times 128$, half the input resolution. Removing Max Pooling preserves fine-grained spatial detail that matters for small-font, dense tables.
Global and local features are extracted along each axis independently:
- Horizontal (row splitting): Global projection along the horizontal axis yields $F_{RG}$ of shape $(H/2) \times 128$. Local features apply Average Pooling ($1 \times 2$) then $1 \times 1$ convolution to obtain $F_{RL}$ of shape $(H/2) \times (W/4)$. Concatenation gives $F_{RG+L}$ of shape $(H/2) \times (128 + W/4)$.
- Vertical (column splitting): The same procedure along the vertical axis produces $F_{CG+L}$ of shape $(W/2) \times (128 + H/4)$.
Each feature map is fed into its own Transformer encoder (3 layers, 8 heads, $d_{ff} = 2048$, dropout 0.1). The fixed input sequence lengths are $H/2 = 480$ and $W/2 = 480$; positional embeddings are 1D, randomly initialized, and learned end-to-end.
Each encoder output position is passed through a linear head for binary classification (split vs. non-split line). Training uses Focal Loss:
$$ FL_{\text{split}} = \frac{1}{n_h} \sum_{i=1}^{n_h} \alpha_i (1-p_i)^\gamma (-\log p_i) + \frac{1}{n_v} \sum_{j=1}^{n_v} \alpha_j (1-p_j)^\gamma (-\log p_j) $$
with all class weights $\alpha = 1$ and focusing parameter $\gamma = 2$. Predictions are upsampled 2$\times$ to match input resolution. A post-processing rule reclassifies any non-split region containing no OCR text projection as a split region, handling empty rows and columns.
Merge Model
A standard ResNet-18 + FPN (256 channels) extracts a feature map $F_{1/4}$ at $(H/4) \times (W/4) \times 256$. Grid cells from the split output are mapped to $F_{1/4}$ via RoIAlign ($7 \times 7$), yielding a feature tensor $F_{\text{grids}}$ of shape $R \times C \times (7 \times 7 \times 256) = R \times C \times 12544$.
A two-layer MLP (Linear + ReLU, both layers 512-dimensional) compresses each cell to 512 dimensions, forming a sequence $S_{\text{grids}}$ of length $R \times C$. A Transformer encoder (3 layers, 8 heads, $d_{ff} = 2048$, dropout 0.1, maximum sequence length 640) with learnable 2D positional embeddings models cross-cell interactions. A linear classification head then assigns each grid cell one of four OTSL labels:
- C: new cell (may contain content)
- L: left-looking span (merges with the cell to its left)
- U: up-looking span (merges with the cell above)
- X: cross span (merges with both left and upper neighbors)
The NL (newline) token from OTSL is unnecessary in a split-merge context and is not used. Merging again uses Focal Loss:
$$ FL_{\text{merge}} = \frac{1}{R \times C} \sum_{k=1}^{R \times C} \alpha_k (1-p_k)^\gamma (-\log p_k) $$
with $\alpha = 1$, $\gamma = 2$.
The OTSL output is converted to HTML, and OCR-extracted text blocks are assigned to cells by position.
What experiments were performed?
Datasets
- FinTabNet v1.0.0: 91,596 train / 10,656 val / 10,635 test tables from financial announcements. A manually refined subset of 9,681 test tables (Hou et al., same authors) with corrected annotations is also used.
- PubTabNet v2: 500,777 train / 9,115 val tables from scientific articles. Note: PubTabNet does not provide original PDF files, so OCR is required to extract cell text.
Metrics
TEDS (content-inclusive), TEDS-Struct (structure only, ignoring cell content), and Accuracy (proportion of tables with fully correct structure and content).
Ablations
Tables 2 and 3 in the paper sweep the Transformer encoder depth (0, 1, 3, 6 layers) independently for the split and merge models. The 3-layer configuration consistently achieves the best results; deeper encoders (6 layers) degrade performance, suggesting the task does not benefit from additional capacity at this scale.
Speed Comparison
Inference speed is measured end-to-end (image resizing, Split-Merge inference, OCR text matching, HTML post-processing) on a single NVIDIA A100 80GB GPU at FP32.
| Method | Adj. FPS |
|---|---|
| TABLET | 18.01 |
| RobusTabNet | ~7.33 |
| TSRFormer | ~7.31 |
| TRUST | ~4.44 |
| SEMv2 | 7.30 (unadjusted; image size not reported by authors) |
| VAST | ~0.69 |
The authors also benchmark Qwen2.5-VL-7B on FinTabNet: zero-shot TEDS is more than 10 points below TABLET; after fine-tuning, the gap narrows to roughly 2 points, but FPS is over 100$\times$ slower.
What are the outcomes/conclusions?
FinTabNet (test set): TEDS Simple/Complex/All = 98.97 / 98.14 / 98.54; TEDS-Struct = 99.10 / 98.35 / 98.71; Accuracy = 88.18%. The authors report this as the highest TEDS among listed methods with image-only input. The near-zero gap between TEDS and TEDS-Struct (0.17 points) confirms that bounding box misalignment is absent for split-merge methods.
FinTabNet (refined test set): TEDS = 98.87; TEDS-Struct = 98.99; Accuracy = 90.00%.
PubTabNet (val set): TEDS = 96.79; TEDS-Struct = 97.67. Competitive but below DRCC (97.80 TEDS / 98.90 TEDS-Struct). The authors note that DRCC’s semi-autoregressive approach is less sensitive to resolution than pixel-classification split-merge methods, which partly explains the gap.
Error analysis (FinTabNet): The authors identify five failure modes: adjacent columns with overlapping text projections that prevent correct splitting; misaligned column headers; multi-line single-cell text mistakenly split into multiple rows; empty cell annotation inconsistencies; and cases where the model’s output is actually correct but the FinTabNet annotation is wrong.
Acknowledged limitations: The method assumes tables have already been extracted and dewarped; it does not handle raw document pages, distorted images, or tables in natural scenes. It depends on an OCR engine as preprocessing, so OCR quality directly affects final TEDS scores (especially visible on PubTabNet, where different methods use different OCR tools). The merge model sequence length is capped at 640 grid cells; very large tables that produce more grid cells than this are truncated.
Unacknowledged gaps: No code or model weights are released. No error bars or multi-run statistics are reported. The Accuracy metric (fully correct tables) is only reported on FinTabNet, not PubTabNet. The comparison table on FinTabNet is incomplete (several methods report only TEDS-Struct, omitting TEDS), making direct comparison across all metrics impossible.
Reproducibility
| Resource | Type | License | Link |
|---|---|---|---|
| Preprint (arXiv) | Paper | arXiv-nonexclusive-v1.0 | arxiv.org/abs/2506.07015 |
| FinTabNet v1.0.0 | Dataset | CDLA-Permissive-1.0 | github.com/ibm-aur-nlp/FinTabNet |
| PubTabNet v2 | Dataset | CDLA-Permissive-1.0 | github.com/ibm-aur-nlp/PubTabNet |
No code, pretrained weights, or evaluation scripts were released with this paper.
Models
- Split model: Modified ResNet-18 (no MaxPool, half channels) + FPN (128 channels); two Transformer encoders (3 layers, 8 heads, $d_{ff} = 2048$, dropout 0.1); 16.1M parameters total.
- Merge model: Standard ResNet-18 + FPN (256 channels); RoIAlign $7 \times 7$; 2-layer MLP (512-dim); Transformer encoder (3 layers, 8 heads, $d_{ff} = 2048$, dropout 0.1); 32.5M parameters total.
- Total: approximately 48.6M parameters.
- No pretrained weights or code repository released.
Algorithms
- Optimizer: AdamW ($\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$, weight decay $5 \times 10^{-4}$).
- Learning rate: $3 \times 10^{-4}$ for the split model (constant); polynomial decay schedule (power 0.9) for the merge model.
- Gradient clipping: L2 norm with
max_norm= 0.5. - Batch size: 32.
- Training epochs: 16 (split model), 24 (merge model).
- Focal Loss: $\alpha = 1$, $\gamma = 2$ for both split and merge.
- Input resolution: 960$\times$960 (proportional resize + zero-padding).
- Split region annotation preprocessing: regions narrower than 5 pixels are expanded symmetrically to a minimum of 5 pixels.
Data
- FinTabNet v1.0.0: 91,596 train / 10,656 val / 10,635 test. Available under CDLA-Permissive-1.0 from IBM Research.
- PubTabNet v2: 500,777 train / 9,115 val. Available under CDLA-Permissive-1.0. PubTabNet does not provide original PDFs; cell text must be obtained with an OCR tool.
- Refined FinTabNet test set (9,681 tables): released by the same authors in prior work (Hou et al.); exact availability not specified in this paper.
- No new datasets are introduced or released.
Evaluation
- TEDS and TEDS-Struct computed over all tables, split into simple (no spans) and complex (at least one spanning cell) subsets.
- Accuracy = percentage of tables with fully correct structure and content.
- Baselines vary in what they report; direct comparison on all three metrics is not possible for most methods.
- No error bars, significance tests, or multiple training runs reported.
Hardware
- Training: 2$\times$ NVIDIA A100 80GB GPUs.
- Inference: 1$\times$ NVIDIA A100 80GB GPU; 18.01 FPS end-to-end (960$\times$960 input, full pipeline including OCR text matching and HTML post-processing).
- No training time, total GPU-hours, or memory breakdown reported.
BibTeX
@inproceedings{hou2025tablet,
title={TABLET: Table Structure Recognition using Encoder-only Transformers},
author={Hou, Qiyu and Wang, Jun},
booktitle={Proceedings of the International Conference on Document Analysis and Recognition (ICDAR)},
year={2025}
}
OG-HFYOLO: Orientation Gradient Guidance and Heterogeneous Feature Fusion for Deformation Table Cell Instance Segmentation
TL;DR
OG-HFYOLO extends YOLOv5-based instance segmentation with three new modules (a gradient orientation extractor, a heterogeneous kernel cross-fusion block, and a scale-aware loss) plus mask-driven NMS, targeting the problem of accurately localizing cells in physically deformed table images. The authors also construct DWTAL, a large-scale synthetically augmented dataset of deformed wired tables with pixel-level segmentation annotations, using a custom wave and cylindrical warping generator applied to existing datasets.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$ The paper’s primary contribution is a new model architecture with several interacting modules. The bulk of the content covers architectural design choices, loss function derivations, and ablation experiments validating each component.
Secondary: $\Psi_{\text{Resource}}$ The paper also releases DWTAL, a dataset of 28,285 images across two difficulty tiers (DWTAL-s and DWTAL-l) with instance segmentation annotations for deformed table cells. This is a meaningful contribution, since no prior public dataset offered pixel-level annotations for this task.
What is the motivation?
Physical deformations in photographed or scanned tables (bending, perspective distortion, page curl from binding) disrupt the alignment between cell content and structure. Downstream tasks such as content extraction and OCR depend on accurate spatial coordinate localization of individual cells, but most existing table structure recognition methods assume near-planar inputs.
Prior approaches to cell coordinate localization fall into two categories. Contour-based detection methods lose critical information when deformations are severe. Text box segmentation methods handle non-deformed wired tables but struggle with geometric distortion. Keypoint detection partially addresses the problem but corner regression accuracy degrades under heavy curvature. Instance segmentation offers pixel-level boundary descriptions that can represent irregular, curved cell shapes, but two practical barriers remained: (1) existing datasets lacked the fine-grained segmentation annotations needed to train such a model, and (2) instance segmentation methods had not been adapted to the specific challenges of deformed table cells: dense object arrangements, extreme aspect ratio variation from merged cells, and overlapping bounding boxes from complex cell shapes.
What is the novelty?
The paper introduces four technical contributions that work in concert:
Gradient Orientation-aware Extractor (GOE). Placed after the first downsampling convolution in the backbone, GOE applies separate horizontal and vertical decoupling operators $G_x$ and $G_y$ (initialized as edge operators, then trained freely) to the input feature map. The gradient magnitude and gradient direction are computed as:
$$I_{GM} = \sqrt{G_x^2 + G_y^2}$$
$$I_{GD} = \text{Cat}(G_x, G_y)$$
An orientation attention mechanism $\Phi$, inspired by HOG’s orientation binning, initializes its $1 \times 1$ convolutional kernels from polar directional basis vectors:
$$\mathbf{e}_i = \begin{bmatrix} \cos \theta_i \ \sin \theta_i \end{bmatrix}, \quad \theta_i = \frac{i\pi}{n}$$
These are assembled into a weight matrix $W \in \mathbb{R}^{n \times 2 \times 1 \times 1}$ mapping gradient components to $n$ orientation channels. The final output combines both cues:
$$I_o = \text{IN}(\text{Softmax}(\Phi(I_{GD}))) \odot I_{GM}$$
where IN denotes instance normalization. This gives the network a geometric prior about edge directionality that standard convolutions lack.
Heterogeneous Kernel Cross Fusion (HKCF). Inserted after skip connections in the FPN-PAN neck, HKCF addresses the extreme width and height variation of merged table cells. It uses a bottleneck to reduce channel count, then runs parallel $1 \times k$ and $k \times 1$ asymmetric convolutions (horizontal and vertical cross-convolutions) whose outputs are summed:
$$HXConv^{(k)}(I) = HConv^{(k)}(I) + VConv^{(k)}(I)$$
A Channel Attention Bridge (CAB) before the cross-convolutions compensates for information loss from dimensionality reduction. Following the Heterogeneous Kernel Select Protocol from YOLO-MS, kernel sizes 3, 5, and 7 are applied at progressively deeper fusion stages.
Scale-aware loss function. The standard YOLO mask normalization divides by instance area $A$, but $1/A$ has a second derivative $2/A^3$ that causes near-vertical gradient growth for very small cells. The authors introduce a composite weighting term $W_s = 1 + \log(1/A)$, yielding:
$$L_{\text{scale-aware}} = \frac{1}{n} \sum_{i=1}^{n} \left[ \left(1 + \log \frac{1}{A_i}\right) \frac{1}{A_i} \sum_{(x,y) \in \text{crop}(\Omega)} \frac{1}{H’_i W’_i} L_{\text{base}}(x,y) \right]$$
where $L_{\text{base}} = L_{\text{BCE}} + L_{\text{Dice}}$. The $\log(1/A)$ term moderates the curvature of compensation, so gradient magnitudes scale more smoothly across the object size distribution.
Mask-driven NMS. Standard bounding-box IoU NMS incorrectly suppresses valid cells when deformed bounding boxes substantially overlap. The proposed replacement computes IoU directly on predicted binary masks:
$$\text{Mask_IoU} = \frac{|M_i \cap M_j|}{|M_i \cup M_j|}$$
suppressing a lower-confidence mask only when this pixel-level overlap exceeds the threshold.
DWTAL dataset. The authors construct a data generator applying wave warping (sine/cosine-based, amplitude $A \in [10, 50]$) and cylindrical warping (cosine-based along the y-axis) to mildly deformed source images from TAL-OCR and WTW, with an illumination adjustment stage. The result is two datasets: DWTAL-s (8,765 images, simpler backgrounds from TAL-OCR) and DWTAL-l (19,520 images, more complex backgrounds from WTW), both with pixel-level instance segmentation annotations.
What experiments were performed?
All experiments used a single NVIDIA RTX 3090 (24 GB). For ablation studies, no pretrained weights were used. For comparison experiments against non-YOLO models, ResNet-101 backbones were initialized from ImageNet-pretrained MSRA weights via MMDetection.
Training settings: SGD optimizer, momentum 0.9, learning rate 0.001, weight decay 0.0005, $640 \times 640$ input, batch size 2 (batch size 1 for DWTAL-l comparisons), 200 epochs (ablation) and 100 epochs (comparisons).
Baselines spanned two-stage models (Mask R-CNN, Cascade Mask R-CNN), single-stage models (SOLOv2, YOLACT), Transformer-based segmentation (Mask2Former), and YOLO variants (YOLOv5l-seg, YOLOv8l-seg, YOLOv11l-seg).
Evaluation metrics: mask mAP@50, mask mAP@50:95, bounding box mAP@50, bounding box mAP@50:95, parameter count, and GFLOPs.
Ablation studies isolated each of the four proposed components, examining individual and joint contributions on DWTAL-s. A separate backbone ablation compared the YOLOv5 C3 baseline against substituting YOLOv11’s C3k2 and C2PSA components. An anchor mechanism ablation confirmed that anchor-based detection substantially outperforms anchor-free variants on this task.
What are the outcomes/conclusions?
On DWTAL-s, the full OG-HFYOLO model achieves mask mAP@50:95 of 74.23%, compared to 71.96% for YOLOv5l-seg, 64.4% for Mask2Former, 62.5% for Mask R-CNN, 57.5% for YOLOv8l-seg, and 57.8% for YOLOv11l-seg. On DWTAL-l, OG-HFYOLO reaches mask mAP@50:95 of 62.38% versus 61.34% for YOLOv5l-seg (the next-best on this metric) and 44.7% for Mask R-CNN.
The ablation results indicate that each module delivers modest gains in isolation (GOE alone: +0.44% mask mAP@50:95; HKCF alone: +0.09%; scale-aware loss alone: +0.48%) but their joint use yields substantially larger improvements. Mask-driven NMS provides an additional +0.94% increase when added on top of the three structural modules, and is identified as the single component with the largest individual contribution in the combined setting. The authors note that the challenges in this dataset are interdependent, so the modules benefit from each other.
The backbone ablation found that the simpler YOLOv5 C3 architecture outperforms YOLOv11-style components within the same training setup, which the authors take as evidence that the proposed modules are more important than the backbone choice.
Anchor-based detection shows a greater than 10% absolute improvement over anchor-free variants on both datasets, suggesting that the dense, regular structure of table cell arrangements favors anchor-based priors.
Qualitative results on natural scene photographs and the CamCap dataset (using a model trained exclusively on DWTAL-l) suggest some cross-domain generalization, though the evaluation is limited to visual examples rather than held-out metrics.
Limitations. The datasets are entirely derived from two source datasets (TAL-OCR and WTW), so the diversity of deformation types and backgrounds is constrained by those sources. All images contain only a single table instance. Downstream logical coordinate recovery is left as future work. Statistical reporting (standard deviations, multiple runs) is absent. The model’s parameter count (125.39 M) is higher than several baselines, which may create deployment constraints.
Reproducibility
Models
OG-HFYOLO is built on a modified YOLOv5 backbone (CSPDarknet with C3 modules) augmented by GOE after the first downsampling layer. The neck uses FPN-PAN with HKCF blocks replacing standard skip-connection convolutions. The detection head retains YOLOv5’s anchor-based design. Total parameters: 125.39 M. GFLOPs: 170 (at $640 \times 640$).
Code is released under AGPL-3.0 at https://github.com/justliulong/OGHFYOLO. The model architecture is defined in cfg/models/segment/og-hfyolo.yaml. A demonstration checkpoint (best.pt) is available via GitHub releases (v1.0.2). No fully reproduced checkpoint from the paper’s complete training runs is separately documented; the released weight is a demonstration model. Environment setup is provided via requirements.txt or environment.yaml (Python >= 3.8, PyTorch >= 1.8).
Algorithms
- Optimizer: SGD, momentum 0.9
- Learning rate: 0.001 (schedule not specified)
- Weight decay: 0.0005 (ablation), 0.0001 (comparison experiments)
- Batch size: 2 for DWTAL-s; 1 for DWTAL-l
- Training epochs: 200 (ablation), 100 (comparison)
- Input resolution: $640 \times 640$
- Bounding box regression: EIoU loss replacing CIoU
- Mask loss: $L_{\text{base}} = L_{\text{BCE}} + L_{\text{Dice}}$, scaled by $(1 + \log(1/A)) \cdot (1/A)$
- Post-processing: Mask IoU NMS replacing bounding-box IoU NMS
- GOE orientation bins: $n$ (value not stated explicitly in the paper)
- HKCF kernel sizes: 3, 5, 7 at three FPN levels
Data
- DWTAL-s: 8,765 images total; 7,012 train / 1,753 test. Primarily derived from TAL-OCR with 150 collected images. Simpler backgrounds.
- DWTAL-l: 19,520 images total; 15,616 train / 3,904 test. Primarily derived from WTW. More complex backgrounds.
- 80/20 train/test split, stratified by deformation type.
- Labels are pixel-level instance segmentation masks per table cell.
- A logical coordinate annotation version has also been released (via GitHub releases, without images; images must be downloaded separately).
- DWTAL is distributed in two formats. YOLO format is hosted on Google Drive (separate links for DWTAL-s and DWTAL-l). COCO format is hosted on Hugging Face at
justliulong/DWTALand can be downloaded via thedatasetslibrary. The data generator code lives in the GitHub repo atdataprocess/data_gen.py. Labelme-format annotation archives are also available on Google Drive. - The underlying source datasets (TAL-OCR and WTW) have their own licenses; the paper does not state an explicit license for DWTAL beyond referencing the AGPL-3.0 repo and describing it as open-source.
Evaluation
- Primary metrics: mask mAP@50 and mAP@50:95 (COCO-style AP over 101 recall points)
- Secondary metrics: bounding box mAP@50 and mAP@50:95
- Non-YOLO baselines: ResNet-101 backbone, pretrained on ImageNet, fine-tuned 100 epochs via MMDetection
- No error bars, significance tests, or repeated runs are reported
- Comparison against Mask2Former uses ResNet-101 backbone for consistency with other non-YOLO models; a ViT-based Mask2Former is not evaluated
- CamCap generalization evaluation is qualitative only
Hardware
- Training hardware: NVIDIA RTX 3090, 24 GB VRAM
- Framework: Python 3.8.19, PyTorch 1.13.0, CUDA 12.4
- Training time and energy consumption are not reported
- Inference speed is discussed qualitatively (OG-HFYOLO described as retaining single-stage speed) but no concrete latency numbers are given
BibTeX
@article{liu2025oghfyolo,
title={OG-HFYOLO: Orientation Gradient Guidance and Heterogeneous Feature Fusion for Deformation Table Cell Instance Segmentation},
author={Long Liu and Cihui Yang},
journal={arXiv preprint arXiv:2504.20682},
year={2025}
}
CISOL: An Open and Extensible Dataset for Table Structure Recognition in the Construction Industry
TL;DR
CISOL (Construction Industry Steel Ordering Lists) is a medium-sized, human-annotated dataset of 844 scanned civil engineering documents with over 120,000 annotated instances for table detection (TD) and table structure recognition (TSR). All documents are anonymized real-world steel ordering lists in German, sourced from 10 structural engineering firms. The dataset is released under CC-BY-4.0 on Zenodo and is accompanied by a public evaluation server. Benchmarking with YOLOv8 reaches 67.22 mAP@0.5:0.95:0.05, outperforming the TSR-specific Table Transformer (TATR) on this domain, while both results remain below the annotation-derived convergence threshold.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$. The primary contribution is the dataset itself: a new, openly licensed, human-annotated collection of real-world civil engineering documents that fills a gap in available table extraction data. The paper documents a thorough creation pipeline, an annotation guideline, a stratified splitting strategy, and a long-term public evaluation server. The dataset release is the headline novelty.
Secondary: $\Psi_{\text{Evaluation}}$. The authors measure annotation consistency using Krippendorff’s $\alpha$ and derive a label convergence threshold that defines a theoretical upper bound on model performance. They then benchmark YOLOv8 and TATR against this bound, providing a measurement framework as well as a dataset.
What is the motivation?
Table extraction from scanned documents is a long-standing problem in document analysis. Most publicly available table extraction datasets come from born-digital government documents, scientific articles, or financial reports. Civil engineering documents, such as steel ordering lists, have a distinctive structure: large tables with spanning cells, embedded tables, technical drawings, and highly domain-specific content. No dataset for this sub-domain existed at the time of writing.
The authors also identify a broader problem in dataset creation: inconsistent annotation conventions, opaque collection pipelines, inadequate licensing, and poor extensibility. Prior human-annotated datasets such as ICDAR 2013, WTW, and TabRecSet lack metadata that would allow others to replicate or extend the annotation process. CISOL is designed from the outset to address these documentation gaps by following the data development lifecycle described by Hutchinson et al., the datasheet framework of Gebru et al., and the FAIR principles (Findability, Accessibility, Interoperability, Reusability).
What is the novelty?
The novelty is the CISOL dataset itself and the process used to create it. Key properties that distinguish it from prior human-annotated TSR datasets include:
- Domain specificity: all documents come from real construction industry projects, a domain not previously represented in TSR benchmarks.
- Anonymized real-world data: raw documents from 10 structural engineering firms (24 projects, creation dates 2015-2023) were anonymized via an automated pass (replacing names, locations, and personal data with length-preserving substitutions) followed by manual review. This allowed real industry data to be licensed under CC-BY-4.0.
- Rich metadata: each document image is tagged with table size (XS/S/M/L by cell count) and table type (open/closed/mixed by separator line usage), plus provenance metadata. This metadata enables stratified sampling and is available to downstream researchers.
- Extensible annotation pipeline: the annotation guideline and the full pipeline (using the publicly available CVAT tool) are released, making it straightforward for others to add new documents or sub-domains without writing a new guideline from scratch.
- Physical structure focus: the dataset annotates physical structure (column, row, spanning cell, and header bounding boxes) rather than logical structure. This design choice makes downstream heuristic inference of logical structure simpler, supports embedded table recognition extensions, and reduces the need for speculative inference during annotation.
- Annotation quality measurement and convergence threshold: 20 images were re-annotated by all four annotators. Krippendorff’s $\alpha$ ($K\text{-}\alpha$) was computed at multiple IoU thresholds and converted to mAP to yield a label convergence threshold, i.e., a theoretical ceiling on model performance given the annotation variance. The authors interpret $K\text{-}\alpha > 0.8$ as reliable agreement and $K\text{-}\alpha > 0.667$ as moderate agreement; CISOL achieves the former at IoU 0.5 and the latter at the aggregate $0.5\text{:}0.95\text{:}0.05$ range. The conversion from $K\text{-}\alpha$ to equivalent mAP follows the formula from Tschirschwitz and Rodehorst (WACV 2025). The convergence threshold sits above the best benchmark result, indicating the problem is not yet solved.
What experiments were performed?
The paper benchmarks two models on CISOL: YOLOv8 (a general-purpose object detector) and TATR (Table Transformer, a TSR-specific model). Both models are trained on the CISOL training split and evaluated on the test split using the COCO mAP metric at IoU thresholds $0.5\text{:}0.95\text{:}0.05$, the same metric used in the PubTabNet ICDAR 2021 Challenge Track A.
The primary reported result is:
| Model | mAP@0.5:0.95:0.05 |
|---|---|
| TATR | not reported in paper (below YOLOv8) |
| YOLOv8 | 67.22 |
| Convergence threshold | above 67.22 |
The paper reports that YOLOv8 outperforms TATR on this dataset and attributes this to TSR-specific models generalizing poorly to new domains, consistent with a prior reproducibility study by Ajayi et al. Exact per-model and per-class numbers are not included in the paper; the authors direct readers to the public evaluation server at EvalAI for the current best results.
Annotation consistency is evaluated as a separate experiment. The four annotators each labeled the same 20 randomly selected images, and $K\text{-}\alpha$ was computed at IoU thresholds $0.5$, $0.75$, and $0.5\text{:}0.95\text{:}0.05$. The $K\text{-}\alpha$ values are converted to equivalent mAP values and plotted alongside benchmark results.
Dataset statistics are also characterized and compared to seven existing human-annotated TSR datasets (ICDAR 2013, ICDAR 2019 Track-B2, UNLV, UW3, TabRecSet, TUCD, WTW) on instance count, image count, class distribution, and annotation density. CISOL falls in the medium-size range and is distinguished by its larger average table size (more rows and columns per table) and the provision of metadata.
What are the outcomes/conclusions?
The CISOL dataset provides a functional benchmark for TSR in a domain not previously covered. Key findings reported by the authors include:
- YOLOv8 achieves 67.22 mAP@0.5:0.95:0.05 on CISOL, outperforming TATR despite TATR being designed specifically for TSR. Results suggest that domain shift remains a significant challenge for TSR-specific models, and that general-purpose detectors can be competitive on out-of-distribution data.
- Annotation quality is moderate to high: $K\text{-}\alpha > 0.8$ at IoU 0.5 and $K\text{-}\alpha > 0.667$ at the aggregate IoU range. This indicates that the annotation guideline was sufficiently clear and that the labeling process was reproducible.
- The best benchmarked model remains below the convergence threshold, leaving room for further model improvement specific to this domain.
- Embedded tables are present in the raw data but were intentionally excluded from annotations in this release. They are identified as a prime extension target.
The authors identify several directions for future work: annotating embedded tables, extending to logical structure and cell content, and expanding to other sub-domains within civil engineering or other industries.
Limitations:
- The dataset covers a single narrow sub-domain (German-language steel ordering lists from 10 firms). Generalization beyond this sub-domain is unclear.
- Exact benchmarking numbers for TATR and per-class results are not in the paper; they require querying the external evaluation server.
- The dataset is openly licensed under CC-BY-4.0, which permits commercial use with attribution, but the underlying documents reflect a single geographic and industry context, limiting diversity.
- No model weights or training configurations are released for the benchmarked models.
Reproducibility
Models
- YOLOv8: general-purpose object detector (Ultralytics YOLO, January 2023 release). No architectural modifications reported. No weights released by the CISOL authors.
- TATR (Table Transformer): DETR-based TSR model from Smock et al. (CVPR 2022), pre-trained on PubTables-1M. Fine-tuned on CISOL. No fine-tuned weights released.
Algorithms
- Training details for the benchmarked models are not described in the paper beyond model selection. Hyperparameters, optimizer, learning rate, batch size, and epoch count are not reported.
- Annotation was performed using CVAT (Computer Vision Annotation Tool, CVAT.ai), a publicly available open-source tool.
- The annotation guideline is released alongside the dataset on Zenodo.
- Train/validation/test splitting used stratified sampling primarily on data origin (company), with size and type tags used to verify balance across subsets rather than as hard stratification criteria.
Data
- Total collected: 3,288 document images from 24 reinforced concrete projects at 10 German structural engineering firms (2015-2023). Documents are scanned and in German.
- Annotated subset: 844 images, selected by stratified sampling with a preference for companies with fewer than 100 documents to ensure provenance diversity.
- Annotation format: COCO-JSON.
- Classes: table (for TD), column, row, spanning cell, header (for TSR).
- Instance count: over 120,000 annotated instances across both tasks.
- Splits: training, validation, and test sets. Stratified sampling on data origin (company) ensures provenance diversity; size and type distributions are verified to be consistent across splits.
- License: CC-BY-4.0.
- Zenodo release contents (five files, 772.5 MB total): annotation guidelines, tagged metadata, TD+TSR annotated dataset, TSR-only annotated dataset, and unlabeled images (full 3,288-image collection).
- Anonymization: automated pattern matching for names, locations, and personal data (length-preserving substitution), followed by manual review. Eight images removed for containing only cover page content.
- Public availability (DOI): https://doi.org/10.5281/zenodo.10829550
Evaluation
- Primary metric: COCO mAP at IoU $0.5\text{:}0.95\text{:}0.05$, the same metric used in the PubTabNet ICDAR 2021 Challenge.
- Annotation quality metric: Krippendorff’s $\alpha$ computed at IoU thresholds $0.5$, $0.75$, and $0.5\text{:}0.95\text{:}0.05$ on 20 re-annotated images.
- Convergence threshold: $K\text{-}\alpha$ values are converted to mAP to yield a theoretical performance ceiling. The formula follows Tschirschwitz and Rodehorst (WACV 2025).
- Evaluation server: https://eval.ai/web/challenges/challenge-page/2257, with held-out test annotations not publicly released.
- No error bars or significance tests are reported for the benchmark results. Number of training runs and seeds are not stated.
Hardware
- Training and evaluation hardware are not reported in the paper.
- No GPU-hour, VRAM, or inference latency figures are provided.
BibTeX
@InProceedings{Tschirschwitz_2025_WACV,
author = {Tschirschwitz, David and Rodehorst, Volker},
title = {CISOL: An Open and Extensible Dataset for Table Structure Recognition in the Construction Industry},
booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)},
month = {February},
year = {2025},
pages = {7594-7602}
}
TFLOP: Table Structure Recognition Framework with Layout Pointer Mechanism
TL;DR
TFLOP replaces the conventional dual-decoder TSR pipeline (predict cell bounding boxes, then match them to OCR text regions) with a layout pointer mechanism: text region bounding boxes are encoded as context from the start, and the decoder directly associates each predicted table data tag with the corresponding text region. An additional span-aware contrastive supervision improves recognition of tables with row and column spans. On PubTabNet, FinTabNet, and SynthTabNet, TFLOP achieves the highest reported TEDS scores at the time of submission, and the authors demonstrate practical utility on watermarked and non-English industrial documents.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$ The headline contribution is the TFLOP architecture: the layout pointer mechanism and span-aware contrastive supervision. The paper devotes most pages to the model design, ablations, and benchmark comparisons.
Secondary: None. No dataset is released (the Korean evaluation set is small and kept in the supplementary; the watermark dataset is synthesized from FinTabNet and not distributed).
What is the motivation?
Recent image-to-text TSR systems follow a dual-decoder pattern: one decoder generates HTML or OTSL structure tokens, and a second decoder predicts cell bounding boxes conditioned on those tokens. The bounding boxes are then matched to text regions obtained from an OCR engine or PDF parser to produce the complete table.
The matching step requires carefully calibrated post-processing heuristics to pair predicted bounding boxes with the correct text regions. When a predicted box does not precisely align with the true cell boundary, the matched text may be wrong or incomplete, degrading the final table structure. The gap between TEDS-Struct (structure only) and TEDS (structure + content) scores in prior work is a direct indicator of this misalignment problem.
TFLOP’s premise is that text region bounding boxes are already available (from OCR or PDF parsing) before inference. Rather than predicting those boxes and then matching them, the model can treat the available boxes as input context and directly point to them from the generated structure tokens, skipping the matching stage entirely.
What is the novelty?
Overall Architecture
TFLOP comprises four modules operating on an input table image and its associated text region bounding boxes:
- Image Encoder (Swin Transformer): Produces visual patch features ${z_i}_{i=1}^{P} \in \mathbb{R}^{d}$.
- Layout Encoder (MLP): Embeds each text region bounding box combined with $2 \times 2$ RoI Align features from ${z_i}$ into layout embeddings ${l_j}_{j=1}^{B} \in \mathbb{R}^{d}$.
- Logical Structure Decoder (BART): Auto-regressively generates a sequence of $T$ table tags ${y_k}_{k=1}^{T} \in \mathbb{R}^{v}$, conditioned on both visual features (cross-attention) and layout embeddings (context prompt). Tags are in OTSL format, which has a 1-to-1 mapping with HTML.
- Layout Pointer: Associates the decoder’s last hidden states with the text region bounding boxes to resolve which text belongs to which cell.
Feature dimension $d = 1{,}024$; input resolution $768 \times 768$; output sequence length $N = 1{,}376$.
Layout Pointer Mechanism
The decoder’s final hidden states ${h_i}_{i=1}^{N}$ are split into bounding box features ${b_j}_{j=1}^{B}$ and tag features ${t_k}_{k=1}^{T}$. Both are linearly projected:
$$ b_j = \text{proj}_b(b_j), \quad t_k = \text{proj}_t(t_k) $$
For bounding boxes that are not empty, the pointer loss is an InfoNCE-style objective over all data tag positions in set $D$:
$$ L_{\text{ptr}} = -\frac{1}{B} \sum_{j=1}^{B} \log \frac{\exp(b_j \cdot t_{k=j} / \tau)}{\sum_{k’ \in D} \exp(b_j \cdot t_{k’} / \tau)} $$
where $k=j$ is the index of the data tag corresponding to the $j$-th bounding box and $\tau = 0.1$ is the temperature.
Empty data cells (no corresponding bounding box) receive a separate BCE loss using a special learnable embedding $b_0$:
$$ L_{\text{ptr}}^{\text{empty}} = \frac{1}{|D|} \sum_{k’ \in D} \text{BCE}!\left(\sigma(b_0 \cdot t_{k’}),; I(k’)\right) $$
where $I(k’)$ is 1 if data tag $k’$ has no associated bounding box.
Span-aware Contrastive Supervision
To improve recognition of tables with row or column spans, TFLOP applies contrastive supervision across bounding box embeddings. For the $j$-th bounding box projected to $\hat{b}_j = \text{proj}_s(b_j)$:
$$ L_{\text{contr},j} = -\sum_{p \in P(j)} \frac{c_p(j)}{|P(j)|} \log \frac{\exp(\hat{b}_j \cdot \hat{b}_p / \tau)}{\sum_{a \in A(j)} \exp(\hat{b}_j \cdot \hat{b}_a / \tau)} $$
where $A(j)$ is all bounding boxes except $j$, and $P(j)$ is the set of positive samples (same row or column as $j$). The span coefficient $c_p(j)$ weighs each positive pair by the degree of span overlap:
$$ c_p(j) = \frac{(\text{overlap}(p, j))^2}{\text{span}(p) \cdot \text{span}(j)} $$
$\text{span}()$ is the row or column span count; $\text{overlap}(p,j)$ is the number of rows or columns shared between $p$ and $j$. This formulation assigns higher weight to cells that overlap more completely, and lower weight to partially overlapping cells in multi-span scenarios.
Overall Loss
The training objective combines five terms:
$$ L = \lambda_1 L_{\text{cls}} + \lambda_2 L_{\text{ptr}} + \lambda_3 L_{\text{ptr}}^{\text{empty}} + \lambda_4 L_{\text{contr}}^{\text{row}} + \lambda_5 L_{\text{contr}}^{\text{col}} $$
where $\lambda_1 = \lambda_2 = \lambda_3 = 1$ and $\lambda_4 = \lambda_5 = 0.5$. $L_{\text{cls}}$ is the cross-entropy loss for tag classification.
What experiments were performed?
Datasets
- PubTabNet: 500,777 train / 9,115 validation / 9,064 test. Tables from scientific articles; cell-level annotations available for train and val. For the test set, text regions were obtained using PSENet + MASTER (same OCR pipeline as prior work).
- FinTabNet: 112,887 tables from financial reports; cell-level annotations provided via PDF parsing (no OCR noise).
- SynthTabNet: 600,000 synthetic tables with diverse styles; cell-level annotations provided.
Baselines
PubTabNet: TableMaster, LGPMA, TableFormer, VAST, RobusTabNet, DRCC. FinTabNet and SynthTabNet: TableFormer, GridFormer, VAST, DRCC.
Ablation (Table 3)
Four configurations on PubTabNet test TEDS (%):
| Config | Simple | Complex | All |
|---|---|---|---|
| TFLOP_BASE | 97.92 | 94.85 | 96.42 |
| + Image ROI (I) | +0.04 | +0.14 | +0.08 |
| + I + Uniform contrastive (U) | +0.01 | +0.12 | +0.06 |
| + I + Span-aware contrastive (S) = TFLOP_FULL | +0.14 | +0.35 | +0.24 |
The gain from span-aware over uniform contrastive is concentrated in complex tables (+0.35 vs +0.12 for complex), supporting the claimed benefit for spanning cells.
Industrial Experiments
Watermark TSR: A watermarked FinTabNet set was synthesized by inpainting 20 candidate texts per image (20% probability per box). TFLOP extended with a 2-layer MLP filter (Layout Filter) achieves 99.54% TEDS-S / 99.41% TEDS vs. TableMaster’s 82.18% / 72.83%.
Cross-lingual TSR: 30 manually annotated Korean financial report tables (15 simple, 15 complex) and 175 QA pairs. TFLOP trained on English PubTabNet generalizes to Korean: 95.76% / 89.41% simple/complex TEDS vs. GPT-4V 79.43% / 68.39% and TableMaster 89.96% / 83.94%.
Implementation
- Image encoder: Swin Transformer.
- Structure decoder: BART (configured similarly to Donut).
- Input resolution: $768 \times 768$.
- Output sequence length: $N = 1{,}376$; bounding box context length $B = 640$ for PubTabNet/FinTabNet, $864$ for SynthTabNet.
- Optimizer: AdamW (assumed; not stated explicitly), learning rate $8 \times 10^{-5}$ with cosine scheduling over 250K steps.
- Hardware: 4 Nvidia A100 GPUs.
What are the outcomes/conclusions?
PubTabNet (val): TFLOP_FULL reaches 98.3% TEDS-S and 98.0% TEDS, ahead of DRCC (98.9% TEDS-S, 97.8% TEDS) on TEDS-S but better on TEDS. Note: DRCC reports no TEDS on the test set.
PubTabNet (test): 98.38% TEDS-S, 96.66% TEDS. The TEDS-Struct to TEDS gap for TFLOP_FULL (0.11 on FinTabNet) is substantially smaller than for prior dual-decoder methods (0.42 for comparison), directly reflecting the reduction in bounding box misalignment.
FinTabNet: 99.56% TEDS-S, 99.45% TEDS, ahead of VAST (98.63% TEDS-S, 98.21% TEDS).
SynthTabNet: 99.42% TEDS-S, 99.40% TEDS, ahead of DRCC (98.70% TEDS-S).
Limitations acknowledged: The Korean TSR evaluation uses only 30 tables (15 simple, 15 complex), which is too small for reliable statistical conclusions. The watermark dataset is synthesized; performance on natural watermarks may differ. TFLOP requires text region bounding boxes as input, which means it is not a fully end-to-end system and depends on an upstream OCR pipeline. For datasets without cell-level annotations (like PubTabNet test), OCR quality affects downstream TEDS scores.
Unacknowledged gaps: No comparison with non-English training baselines; the Korean experiment only tests zero-shot transfer. No latency or throughput measurements. The optimizer is not stated in the main paper. No error bars or multi-seed runs.
Reproducibility
| Resource | Type | License | Link |
|---|---|---|---|
| Preprint (arXiv) | Paper | CC-BY-4.0 | arXiv 2501.11800 |
| Published (IJCAI 2024) | Paper | Unknown | DOI 10.24963/ijcai.2024/105 |
| TFLOP Code | Code | CC-BY-NC-4.0 | GitHub |
| Model weights | Model | Unknown | Not confirmed released |
| PubTabNet | Dataset | CDLA-Permissive-1.0 | GitHub |
| FinTabNet | Dataset | CDLA-Permissive-1.0 | IBM Research |
| SynthTabNet | Dataset | CDLA-Permissive-1.0 | GitHub |
Models
- Image encoder: Swin Transformer (configuration matches Donut; exact size not specified in paper).
- Structure decoder: BART architecture; feature dimension $d = 1{,}024$.
- Layout encoder: MLP with $2 \times 2$ RoI Align; output dimension $d = 1{,}024$.
- Layout pointer: two linear projection heads for $b_j$ and $t_k$; one additional projection $\text{proj}_s$ for contrastive.
- Watermark extension: 2-layer MLP (Layout Filter) with BCE and sigmoid.
- Model weights: the GitHub repository exists (license: CC-BY-NC-4.0); weight availability should be checked directly.
Algorithms
- Learning rate: $8 \times 10^{-5}$ with cosine scheduling.
- Training steps: 250K.
- Hardware: 4 Nvidia A100 GPUs.
- Input resolution: $768 \times 768$.
- $N = 1{,}376$ (max sequence length); $B = 640$ (PubTabNet/FinTabNet), $B = 864$ (SynthTabNet).
- Temperature $\tau = 0.1$ for both pointer and contrastive losses.
- Loss weights: $\lambda_1 = \lambda_2 = \lambda_3 = 1$; $\lambda_4 = \lambda_5 = 0.5$.
- OTSL tokenization used (from Lysak et al., 2023; arXiv:2305.03393).
- Optimizer not explicitly stated in the paper text.
Data
- PubTabNet: 500,777 train / 9,115 val / 9,064 test; license CDLA-Permissive-1.0. Test set has no cell-level annotations; OCR (PSENet + MASTER) used.
- FinTabNet: 112,887 tables; PDF-parsed cell annotations; license CDLA-Permissive-1.0.
- SynthTabNet: 600,000 synthetic tables; cell-level annotations; license CDLA-Permissive-1.0.
- Korean tables: 30 manually annotated tables + 175 QA pairs; not released.
- Watermark dataset: synthesized from FinTabNet (not released); synthesis procedure described in Appendix B.
Evaluation
- TEDS and TEDS-Struct as primary metrics (tree-edit-distance similarity on HTML trees).
- PubTabNet: both val and test splits reported.
- Ablation on PubTabNet test TEDS.
- Watermark evaluation: custom synthesized dataset.
- Korean evaluation: 30 tables; tiny set, no confidence intervals.
- No error bars, significance tests, or multi-seed runs.
Hardware
- Training: 4 Nvidia A100 GPUs.
- Training time: 250K steps; approximate GPU-hours not reported.
- Inference latency and memory requirements not reported.
- No deployment or cost estimates.
BibTeX
@inproceedings{khang2024tflop,
title={TFLOP: Table Structure Recognition Framework with Layout Pointer Mechanism},
author={Khang, Minsoo and Hong, Teakgyu},
booktitle={Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence},
pages={947--955},
year={2024},
doi={10.24963/ijcai.2024/105}
}
TabSniper: Table Detection and Structure Recognition for Bank Statements
TL;DR
TabSniper is an end-to-end bank statement processing pipeline from American Express that chains table detection and categorization (TDC) with table structure recognition (TSR), both built on DETR fine-tuned from PubTables-1M. The accompanying BankTabNet dataset provides 11,607 page-level and 5,165 table-level annotations from real bank statements. The primary contribution is a deployed workflow for extracting transactions from diverse bank templates, with CIoU loss substitution and a long-table split-merge strategy as the key model-level changes.
What kind of paper is this?
Dominant: $\Psi_{\text{Impact}}$ (translational/operational). The headline contribution is a validated end-to-end pipeline for a specific industrial use case: extracting credit and debit transactions from bank statements for credit underwriting. The paper describes the full workflow from raw PDF pages to a structured JSON of transactions, includes a post-processing checksum (balance reconciliation), reports CPU inference time as a primary metric, and is explicitly motivated by real-time deployment constraints. The “win” is the validated operational outcome, not a new modeling paradigm.
Secondary: $\Psi_{\text{Method}}$ (the CIoU loss substitution and split-merge strategy for long tables are genuine, if modest, algorithmic contributions) and $\Psi_{\text{Resource}}$ (BankTabNet is a purpose-built annotated dataset for this domain).
What is the motivation?
Bank statement processing is a core task in credit underwriting: lenders need to assess cash flows, spending patterns, and transaction histories before making lending decisions. Unlike structured financial reports, bank statements vary substantially across issuers in layout, template, and table structure. Scanned and digitally generated PDFs coexist, and tables routinely span multiple pages with densely packed multi-line rows.
Standard TSR datasets (PubTabNet, FinTabNet, PubTables-1M) are derived from scientific articles and corporate financial reports. The authors argue these do not capture the specific characteristics of bank statement tables: long tables (often more than 20 rows) with variable intra-cell text spacing that misleads column boundary detection, multiple table categories on a single page (credit, debit, check, transaction balance, summary), and the need to categorize table types for downstream processing.
No existing public dataset addressed bank statement TSR specifically, and no end-to-end system had demonstrated reliable transaction extraction across multiple bank templates. The paper is motivated by this operational gap.
What is the novelty?
The novelty is concentrated in three areas:
1. End-to-end bank statement pipeline. TabSniper chains TDC and TSR into a single workflow designed for real-time CPU deployment. A one-time OCR call is shared across both stages, and a postprocessing layer sorts tables by page order, matches column headers to synonyms for standard transaction fields (Date, Amount, Debit, Credit, Balance), and computes a balance checksum:
$$ \text{Checksum} = \text{OpenBal} - \sum_{t \in \text{Debit}} t + \sum_{t \in \text{Credit}} t - \text{EndBal} $$
A zero checksum confirms that all transactions were extracted for a given statement.
2. CIoU loss for TSR. The baseline DETR uses GIoU loss. The authors identify that GIoU does not account for overlapping area, center distance, or aspect ratio, which causes false positive row bounding boxes in densely packed bank tables. They substitute CIoU, which incorporates a normalized center-distance penalty (DIoU) plus an aspect-ratio consistency term:
$$ L_{\text{CIoU}} = L_{\text{DIoU}} + \alpha \cdot \upsilon $$
$$ \upsilon = \frac{4}{\pi^2} \left(\arctan \frac{w^{gt}}{h^{gt}} - \arctan \frac{w}{h}\right)^2, \quad \alpha = \frac{\upsilon}{(1 - \text{IoU}) + \upsilon} $$
The full TSR loss becomes:
$$ L_{\text{TSR}} = \sum_{n=1}^{N} \lambda_{ce} \cdot L_{CE} + \lambda_{l1} \cdot L_{L1} + \lambda_{ciou} \cdot L_{\text{CIoU}} $$
3. Long-table split-merge. Tables with more than 20 rows are split horizontally into two sub-images before passing to the TSR model (which uses a fixed query count of $N = 125$). Predictions are merged at postprocessing time. This addresses the attention drift and missed detections that arise when the fixed-capacity transformer processes tables that are longer than those in its training distribution.
4. Multi-modal TDC. The TDC stage runs a DETR vision model first, then refines Credit/Debit categorization using a text-based Multinomial Naive Bayes classifier trained on table caption and header text. Because caption text is outside the table bounding box (and therefore invisible to the vision model), this two-stage multi-modal approach separates detection from categorization.
5. BankTabNet dataset. The authors introduce two new annotated datasets sourced from real, PII-masked bank statements: a TDC dataset (11,607 page images, 10 table categories) and a TSR dataset (5,165 table images, 5 object classes). Inter-annotator agreement is measured with Krippendorff’s Alpha ($\kappa \geq 0.955$ at $\text{IoU} > 0.5$ for TDC; $\kappa \geq 0.99$ for TSR).
What experiments were performed?
TDC evaluation. TabSniper-TDC is compared against Faster R-CNN, Mask R-CNN, Cascade R-CNN, DiT-B (Cascade), and Dynamic Head, all evaluated on the BankTabNet TDC test split. Metrics are $AP$, $AP_{50}$, and $AP_{75}$ (COCO-style). CPU inference time on Apple M1 Max (32 GB) is also reported.
| Model | AP | $AP_{50}$ | $AP_{75}$ | CPU time (s/img) |
|---|---|---|---|---|
| Faster R-CNN | 83.50 | 94.55 | 90.25 | 1.71 |
| Mask R-CNN | 83.84 | 95.14 | 90.13 | 1.72 |
| Cascade R-CNN | 84.08 | 93.07 | 89.18 | 2.34 |
| TabSniper-TDC | 85.25 | 93.91 | 90.69 | 1.25 |
| DiT-B (Cascade) | 87.96 | 95.73 | 92.57 | 8.85 |
| Dynamic Head | 89.04 | 97.39 | 94.43 | N/A (GPU only) |
TabSniper-TDC sits below DiT and Dynamic Head on AP but delivers the lowest CPU inference time among all models that support CPU inference.
TSR evaluation with ablations. Three ablations are stacked incrementally on BankTabNet TSR test:
| Model / Ablation | $AP_{50}$ | $AP_{75}$ | AP | AR |
|---|---|---|---|---|
| TabStructNet | 79.2 | 71.0 | 65.4 | 73.6 |
| LGPMA | 84.8 | 76.1 | 70.0 | 78.8 |
| Table-Transformer (Base) | 85.3 | 76.5 | 70.4 | 79.3 |
| + Split-Merge Long Tables | 92.6 | 80.8 | 72.5 | 81.2 |
| + Padding Variations | 94.6 | 89.8 | 80.2 | 87.2 |
| + Complete IoU Loss | 94.8 | 91.5 | 83.1 | 90.6 |
Each ablation contributes meaningfully, with padding variations delivering the largest single jump in $AP_{75}$ (from 80.8 to 89.8).
TSR on external public datasets. TabSniper-TSR is trained from scratch on PubTables-1M and FinTabNet respectively and compared to Table Transformer:
| Dataset | Model | $AP_{50}$ | $AP_{75}$ | AP | AR |
|---|---|---|---|---|---|
| PubTables-1M | Table-Transformer | 96.3 | 92.3 | 84.4 | 89.3 |
| PubTables-1M | TabSniper-TSR | 96.2 | 93.8 | 89.6 | 93.3 |
| FinTabNet | Table-Transformer | 97.4 | 94.2 | 88.8 | 93.1 |
| FinTabNet | TabSniper-TSR | 96.0 | 93.0 | 89.1 | 93.4 |
TabSniper shows modest gains on AP and AR (localization over a range of IoU thresholds) while being slightly below baseline on $AP_{50}$ for FinTabNet.
Text classifier evaluation. Three Multinomial Naive Bayes classifiers (Header NB, Caption NB, Header_Caption NB) are evaluated on per-category F1. Header_Caption NB achieves the best results across all categories.
What are the outcomes/conclusions?
The system demonstrates that a DETR-based pipeline, adapted with domain-specific techniques (long-table split-merge, CIoU loss, padding augmentation, multi-modal TDC), can reliably extract transactions from diverse bank templates at CPU inference speeds suitable for real-time processing (1.91 minutes for a 20-page statement on Apple M1 Max).
The balance checksum mechanism provides a form of deterministic validation: if the extracted debits and credits reconcile to the opening and closing balances, the extraction is likely complete.
On the internal BankTabNet TSR test set, TabSniper achieves AP 83.1 and AR 90.6, substantially above all baselines. On public datasets, it is competitive with Table Transformer, particularly on AP and AR, suggesting the model modifications generalize modestly beyond the bank statement domain.
Limitations acknowledged by the authors:
- Tilted tables (from mobile phone captures) cause bounding box misalignment and partial OCR capture. The authors note this could be addressed with document angle correction, which they plan for future versions.
- The paper does not report end-to-end transaction extraction accuracy (e.g., percentage of statements with checksum zero, false transaction rates), limiting assessment of the pipeline’s actual production performance.
Limitations not acknowledged:
- BankTabNet is proprietary (AmEx internal, PII-masked). The dataset is described but not released, making the bank-statement-specific results unreproducible for external researchers.
- No code or model weights are released. All components (DETR fine-tune, Naive Bayes text classifier, postprocessing heuristics) are described but not available.
- Evaluation uses AP/AR rather than TEDS or GriTS, making direct comparison to most TSR literature difficult.
- The DOI in the paper (
https://doi.org/XXXXXXX.XXXXXXX) is a placeholder; the actual ACM DOI was not available at time of arXiv submission. - The paper relies on an in-house OCR service and an in-house statement summary extraction service. These dependencies are opaque and unavailable to external researchers.
Reproducibility
Models
- TDC: DETR with ResNet-50 backbone, implemented via Detectron2. Pre-trained weights: Table Transformer (PubTables-1M). Fine-tuned on BankTabNet TDC. No weights released.
- TSR: DETR with ResNet-50 backbone, pre-trained on PubTables-1M, fine-tuned on BankTabNet TSR with CIoU loss substitution. Query count $N = 125$. No weights released.
- Text classifier: Three Multinomial Naive Bayes models trained on caption/header text using count vectorization (vocabulary sizes: 840 caption words, 860 header words). Lightweight; reproducible from the paper’s description if data were available.
Algorithms
- TDC training: LR $1 \times 10^{-5}$; 100 epochs; batch size 16; $\lambda_{ce} = 1$, $\lambda_{l1} = 5$, $\lambda_{giou} = 2$. Scale augmentation: shortest side 400-800 px. Optimizer not stated for TDC (Adam stated only for TSR).
- TSR training: LR $5 \times 10^{-5}$ with linear decay every 4 epochs; 100 epochs; batch size 2; $\lambda_{ce} = 1$, $\lambda_{l1} = 5$, $\lambda_{ciou} = 2$. Adam optimizer. Scale augmentation: longest side 1100-1300 px. Inference: $1200 \times 1200$ px.
- Long-table split: tables with more than 20 rows are split into two sub-images horizontally. Merge happens in postprocessing.
- Postprocessing heuristics: NMS, gap filling, addition of missing rows/columns, rule-based row separation from OCR date column.
- Balance checksum: $\text{OpenBal} - \sum_{\text{Debit}} + \sum_{\text{Credit}} - \text{EndBal} = 0$.
Data
- BankTabNet TDC: 11,607 page images (7,544 train / 1,741 val / 2,322 test). 10 table categories. PII-masked via in-house service. Not publicly available.
- BankTabNet TSR: 5,165 table images from 310 bank statements (9,724 train / 2,000 val / 2,200 test after augmentation). 5 object classes. Not publicly available.
- Annotation quality: Krippendorff’s Alpha $\geq 0.955$ (TDC, $\text{IoU} > 0.5$), $\geq 0.994$ (TDC, $\text{IoU} > 0.9$); $\geq 0.99$ (TSR, $\text{IoU} > 0.5$), $\geq 0.98$ (TSR, $\text{IoU} > 0.9$).
- External datasets used for generalization tests: PubTables-1M (public, CDLA-Perm-2.0 annotations) and FinTabNet (public, CDLA-Permissive-1.0 annotations).
Evaluation
- Primary metrics: $AP$, $AP_{50}$, $AP_{75}$, $AR$ (COCO-style object detection metrics). Text classifier: per-category F1.
- No TEDS, GriTS, or Adjacency F1 reported. Direct comparison to Im2Seq TSR literature is not straightforward.
- Baselines for TSR: TabStructNet, LGPMA, Table Transformer (all trained on BankTabNet TSR). Comparisons on external datasets: Table Transformer trained on the respective dataset.
- No error bars, significance tests, or multi-seed averaging reported.
- The end-to-end transaction extraction accuracy (checksum pass rate) is mentioned conceptually but not quantified in the results tables.
Hardware
- Training: NVIDIA A100 Tensor Core GPU, 40 GB. Number of GPUs not stated.
- Inference: Apple M1 Max CPU (32 GB). End-to-end pipeline for a 20-page bank statement: approximately 1.91 minutes. Per-image TDC inference: 1.25 s (TabSniper-TDC) on this hardware.
- No GPU-hour estimates or energy consumption figures are reported.
BibTeX
@inproceedings{trivedi2024tabsniper,
title={TabSniper: Towards Accurate Table Detection \& Structure Recognition for Bank Statements},
author={Trivedi, Abhishek and Mukherjee, Sourajit and Singh, Rajat Kumar and Agarwal, Vani and Ramakrishnan, Sriranjani and Bhatt, Himanshu Sharad},
booktitle={8th International Conference on Data Science and Management of Data (12th ACM IKDD CODS and 30th COMAD)},
year={2024},
address={Jodhpur, India},
publisher={ACM}
}
OmniDocBench: A Diverse Benchmark for End-to-End Document Parsing
TL;DR
OmniDocBench is a document parsing benchmark from Shanghai AI Laboratory that covers 981 pages across nine document types (academic papers, books, textbooks, magazines, slides, financial reports, newspapers, handwritten notes, and exam papers), with human-verified annotations for layout detection, text, formulas, and tables, including dual HTML+LaTeX table annotations for TEDS-style evaluation. The authors evaluate three classes of systems: pipeline tools, expert VLMs, and general VLMs. Pipeline-based tools generally lead on standard document types while general VLMs generalize better to uncommon formats and degrade less under visual noise. No single method dominates across all nine document types or both language settings (English and Chinese).
What kind of paper is this?
Dominant: $\Psi_{\text{Evaluation}}$ The primary contribution is a new benchmark and evaluation methodology for document parsing. The dataset design, annotation protocol, multi-level evaluation pipeline, and systematic comparison of methods are all in service of a measurement goal: determining how well current document parsing systems handle diverse real-world document types.
Secondary: $\Psi_{\text{Resource}}$ The benchmark itself, including 981 annotated pages, over 100,000 annotations, dual HTML+LaTeX table annotations, and a publicly released evaluation codebase, is a reusable community resource.
What is the motivation?
Document parsing, the task of extracting structured, machine-readable content from PDFs, underpins data pipelines for LLM pretraining and retrieval-augmented generation systems. Two main paradigms have emerged: pipeline-based approaches (layout detection followed by specialized OCR, formula, and table recognition modules) and end-to-end vision-language models. Despite progress on both fronts, fair and comprehensive comparison between systems has been difficult because existing benchmarks carry several well-documented shortcomings.
Document diversity is narrow: most benchmarks focus on academic papers from arXiv, overlooking document types common in real workflows such as textbooks, financial reports, handwritten notes, newspapers, and slides. Evaluation metrics are inconsistent or superficial: benchmarks relying solely on edit distance or BLEU do not account for the syntactic variety of valid LaTeX or HTML table representations, leading to systematic under-scoring of structurally correct but syntactically non-identical predictions. Evaluations are also typically flat, reporting one aggregate score without breakdown by document type, content category, language, or layout complexity.
The combination of narrow document coverage and weak metrics makes it difficult to identify where particular systems actually fail, or to determine whether a reported improvement generalizes beyond the training distribution.
What is the novelty?
OmniDocBench addresses these gaps with three main contributions.
Diverse evaluation corpus. The dataset spans nine document types: academic papers (129 pages), slides/PPT-to-PDF (133 pages), financial reports (81 pages), colorful textbooks (96 pages), exam papers (114 pages), magazines (97 pages), handwritten and typed notes (116 pages), newspapers (111 pages), and books (104 pages). The corpus covers three language settings: English (290 pages), Simplified Chinese (612 pages), and mixed (79 pages), with multiple layout types and three classes of visual degradations (fuzzy scans, watermarks, and colorful backgrounds).
Comprehensive annotations. Each page carries layout bounding boxes for 19 region categories with reading order and caption-footnote affiliation annotations, nine bounding-box-level attribute labels (three text attributes: language, background, rotation; six table attributes: language, frame type, merged cells, colorful background, formula content, rotation), and content recognition annotations in plain text for paragraphs and titles, LaTeX for formulas, and dual HTML plus LaTeX for tables.
The dual HTML+LaTeX annotation for tables is a direct response to the known problem that TEDS-style metrics penalize syntactically different but semantically equivalent table representations.
Multi-level evaluation pipeline. OmniDocBench supports three evaluation granularities:
- End-to-end evaluation: full-page Markdown prediction versus per-element ground truth
- Task-specific evaluation: isolated assessment of layout detection (mAP), OCR (normalized edit distance), table recognition (TEDS), and formula recognition (CDM + ExpRate)
- Attribute-based evaluation: performance broken down by document type, language, background, rotation, table frame type, and special conditions
A key technical contribution is the Adjacency Search Match algorithm, which addresses the paragraph-splitting problem that arises when different parsers apply different line-break and merge conventions. The algorithm computes a matrix of normalized edit distances between ground truth and predicted paragraph sequences, applies fuzzy matching for unmatched pairs, then iteratively merges adjacent paragraphs in either direction until similarity no longer improves.
The CDM (Character-level Differential Metric) is applied for formula evaluation. For tables, TEDS is computed against both HTML and LaTeX annotations and the better of the two scores is retained, reducing the syntactic-equivalence penalty.
What experiments were performed?
Systems evaluated. The authors evaluate three categories of methods:
- Pipeline tools: MinerU v0.9.3, Marker v1.2.3, Mathpix (commercial)
- Expert VLMs: GOT-OCR 2.0, Nougat 0.1.0-base (350M parameters)
- General VLMs: GPT-4o (2024-08-06), Qwen2-VL-72B, InternVL2-Llama3-76B
For general VLMs, deterministic decoding is applied (do_sample=False) with per-model max_token settings (32,000 for Qwen2-VL-72B; 4,096 for InternVL2; default for GPT-4o).
End-to-end evaluation. Each model produces a full-page Markdown prediction. The pipeline extracts structured elements, matches them to ground-truth elements via the adjacency search algorithm, and computes per-category scores for text, formulas, tables, and reading order, reported separately for English and Chinese pages. Selected end-to-end table TEDS results:
| Method | Table TEDS EN | Table TEDS ZH |
|---|---|---|
| MinerU | 78.6 | 62.1 |
| Mathpix | ~75 | ~63 |
| Qwen2-VL-72B | 73.2 | 75.1 |
| GPT-4o | 71.1 | 58.0 |
| InternVL2-76B | 60.9 | 58.5 |
| GOT-OCR | 51.7 | 46.2 |
| Nougat | 36.2 | 0.3 |
Layout detection. DocLayout-YOLO achieves a mean average precision of 47.38 averaged across nine document types, substantially ahead of LayoutLMv3 (28.84) and DIT-L (26.90). Performance drops markedly on handwritten notes and slides across all models evaluated.
Table recognition (component-level). Six models are evaluated on an isolated table subset using TEDS broken down by language, frame type, and special conditions. OCR-based models (RapidTable: 82.5 overall, PaddleOCR: 73.6) outperform expert VLMs (StructEqTable: 75.8, GOT-OCR: 74.9) and general VLMs (Qwen2-VL-7B: 71.0, InternVL2-8B: 71.5). Table rotation degrades all models substantially, with StructEqTable showing the best relative robustness on rotated tables.
OCR (component-level). PaddleOCR achieves the lowest normalized edit distance on the OCR subset across standard and complex-background conditions. GPT-4o achieves the lowest English edit distance (0.020) but is substantially weaker on Chinese (0.224). Rotation to 270 degrees is challenging for most systems.
Formula recognition. GPT-4o leads on CDM (86.8) and ExpRate@CDM (65.5); UniMERNet achieves the best normalized edit distance (0.238). Mathpix reports high CDM (86.6) but low ExpRate (2.8), consistent with high character-level precision combined with occasional punctuation omissions.
What are the outcomes/conclusions?
Pipeline tools (MinerU, Mathpix) generally lead on structured content tasks for standard document types such as academic papers and financial reports. Their strength comes from specialized downstream models and layout segmentation preprocessing.
General VLMs (Qwen2-VL-72B, GPT-4o) generalize better to uncommon or visually unusual document types such as handwritten notes, slides, and exam papers, and degrade less under visual degradations. The authors attribute this to broader training data covering long-tail scenarios.
VLMs struggle with high-density documents like newspapers, where limitations in input resolution and context length cause content truncation. The failure modes differ qualitatively from pipeline failures: VLMs tend to miss content on dense pages or produce hallucinated content on hard-to-read ones, while pipeline tools tend to misclassify layout regions on uncommon formats.
All systems perform worse on Chinese than English pages. All systems degrade substantially on rotated text and rotated tables. No single method achieves strong performance across all nine document types, both language settings, and all content categories simultaneously. This finding suggests that OmniDocBench captures a meaningfully broader evaluation space than prior benchmarks focused narrowly on academic papers.
Limitations acknowledged by the authors include that the dataset is evaluation-only and not suitable for training, the Chinese-language subset represents a specific subset of Chinese document styles, and some annotation categories such as code blocks are represented by too few examples for reliable evaluation.
Reproducibility
Models
The evaluated systems range widely in openness. MinerU and Marker are open-source pipeline tools. Nougat (350M) and GOT-OCR 2.0 are open expert VLMs with released weights. Qwen2-VL-72B and InternVL2-Llama3-76B are open general VLMs with released weights. Mathpix and GPT-4o are commercial API services with no public weights.
Algorithms
For pipeline tools, default settings are used throughout. For general VLMs, do_sample=False is set for reproducibility. Nougat uses its 0.1.0-base checkpoint; GOT-OCR uses its “format OCR” mode to produce structured output. The Adjacency Search Match algorithm applies normalized edit distance with a similarity threshold, followed by iterative paragraph merging until similarity ceases to improve.
Data
The 981 PDF pages were collected from Common Crawl, Google and Baidu search, and internal data at Shanghai AI Laboratory. Pages were selected by clustering visual features from 6,000 candidate pages using ResNet-50 features and Faiss at 10 cluster centers, with manual balancing across nine document types and page attributes. Annotation used a three-stage pipeline: automatic pre-annotation with LayoutLMv3 (layout), PaddleOCR (text), UniMERNet (formulas), and GPT-4o (tables); manual annotator correction with rendering verification using Tables Generator and latexlive; and expert quality inspection using CDM rendering to flag unrenderable elements.
The evaluation code is available on GitHub under Apache-2.0. The dataset on Hugging Face carries no formal SPDX license; a copyright statement on the dataset page explicitly restricts use to research purposes and prohibits commercial use. The dataset files include an annotation JSON (OmniDocBench.json), a corresponding images directory, and a PDF version of the evaluation pages added in December 2024.
Evaluation
Primary metrics: normalized edit distance for text and reading order (lower is better); CDM and ExpRate@CDM for formulas (higher is better); TEDS for tables (higher is better, computed against both HTML and LaTeX annotations with the better score taken); mAP for layout detection. Inline formulas are converted to Unicode for cross-model fairness before text evaluation. Ignore handling is applied to reduce noise from inter-system differences in header and footer retention conventions and in caption placement conventions.
Statistical details (number of runs, seeds, confidence intervals) are not reported; single-run results are presented for all methods.
Hardware
Hardware requirements are not reported. The evaluation mixes API calls to commercial services (Mathpix, GPT-4o) with local inference for open-weight models. No GPU hours, memory requirements, or cost estimates are provided.
BibTeX
@inproceedings{ouyang2025omnidocbench,
title={OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations},
author={Ouyang, Linke and Qu, Yuan and Zhou, Hongbin and Zhu, Jiawei and Zhang, Rui and Lin, Qunshu and Wang, Bin and Zhao, Zhiyuan and Jiang, Man and Zhao, Xiaomeng and Shi, Jin and Wu, Fan and Chu, Pei and Liu, Minghao and Li, Zhenxiang and Xu, Chao and Zhang, Bo and Shi, Botian and Tu, Zhongying and He, Conghui},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2025}
}
SynFinTabs: A Dataset of Synthetic Financial Tables for Information and Table Extraction
TL;DR
SynFinTabs is a dataset of 100,000 synthetic financial table images from Queen’s University Belfast, generated to resemble tables found in UK Companies House filings and related financial documents. Each image ships with ground-truth HTML, JSON, and CSV representations and bounding-box annotations at the table, row, cell, and word levels. The authors also fine-tune LayoutLM on an extractive table question-answering task to produce FinTabQA, demonstrating the dataset’s utility for training information-extraction models.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$: the primary contribution is the dataset itself. The paper’s center of gravity is the curation pipeline, the characteristics of the 100,000 synthetic tables, their annotation scheme, and the licensing and public release. The headline claim is a reusable asset for the financial document AI community.
Secondary: $\Psi_{\text{Method}}$: the FinTabQA model is a secondary contribution. It demonstrates one practical use of the dataset by fine-tuning LayoutLM for extractive question answering, but the model is not the headline novelty; the dataset is.
What is the motivation?
Table extraction from document images sits at the intersection of computer vision, NLP, and information retrieval, and it remains difficult in part because high-quality labelled data is scarce outside the scientific domain. Most existing table datasets (ICDAR 2013, SciTSR, TableBank, PubTabNet, PubTables-1M) draw heavily from scientific articles because large repositories of such articles are available with source markup that can be used to construct annotations automatically. Financial tables differ from scientific tables in layout, typography, and structural conventions: they tend to be larger, use less dense borders, include currency units, section headings, note columns, and parenthesized negative values, and span a wider range of date formats and font choices.
A second gap is the absence of word-level positional ground truth. Many datasets rely on OCR to recover the spatial positions of words after the fact. The authors demonstrate that OCR is unreliable on tabular data: using default EasyOCR parameters on SynFinTabs test images yields approximately 75% exact-match accuracy, compared to near-perfect accuracy when ground-truth positions are used. For training layout language models such as the LayoutLM family, which consume 2D bounding-box coordinates as part of their input, accurate positional annotations are important.
Privacy constraints compound the data-scarcity problem. Businesses are often reluctant to share financial documents for model training, particularly when those documents contain sensitive information and training would involve a third-party service provider. A synthetic dataset avoids this constraint by design.
What is the novelty?
SynFinTabs makes two interconnected contributions. First, it introduces a publicly available dataset of 100,000 synthetic financial table images with:
- Ground-truth structure in three formats: HTML, JSON, and CSV.
- Bounding-box annotations at four levels of granularity: full table, row, cell, and word. Cell bounding boxes represent the full spatially meaningful cell region, not just the minimum pixel region containing text, which is a noted limitation of FinTabNet.
- Semantic cell-type labels: “section title”, “currency unit”, “row header”, “column header”, and “data”.
- A predefined question-answer pair per table, with start and end span positions stored explicitly rather than derived from OCR.
Second, it releases the dataset generation code (SynFinTabGen), enabling others to extend the pipeline to new domains or larger scales. The generation process proceeds as: specification (structural blueprint) $\rightarrow$ table object (rows, cells, words) $\rightarrow$ HTML document $\rightarrow$ headless browser rendering $\rightarrow$ screenshot image plus bounding-box extraction via the browser DOM. This approach sidesteps OCR entirely for annotation and produces pixel-accurate bounding boxes.
Dataset composition. The 100,000 tables are spread across six visual themes. Theme 0 (40%) replicates the style of financial statements filed with UK Companies House, designed by inspecting a large sample of filings. The remaining five themes (12% each) represent financial spreadsheet styles and one company-report style (styled annual report). An 80/10/10 train/validation/test split is applied with each theme represented proportionally. Textual cell content uses random English words from a 10,000-word vocabulary; numerical cells contain random numbers. Row headers draw from a list of real financial account names.
What experiments were performed?
The experiments center on extractive table question answering. Each table in SynFinTabs has question-answer pairs for every non-empty cell; one pair per table is designated as the “competition pair” used for training or evaluation depending on the split. Questions take the form “What is the value of [row header] for the year [column header]?” and the target answer is the cell value at that intersection.
FinTabQA training. LayoutLM base (113 million parameters) is fine-tuned on SynFinTabs using a batch size of 2 and a learning rate found via PyTorch Lightning’s Tuner (initial max $3 \times 10^{-5}$). Two model variants are trained:
- FinTabQA: input is the table image cropped to the table boundary.
- FinTabQA-A4: input is an A4 page-size image with the table in the top-left corner. This variant is motivated by the observation that when coordinates are rescaled to the 0-1000 “virtual” coordinate space used by LayoutLM, very small or large cropped images suffer more distortion than a consistently-sized A4 canvas.
Training uses ground-truth word bounding boxes. Evaluation uses EasyOCR output, with parameters tuned beyond the defaults through a parameter search.
The exact-match accuracy metric checks that both the predicted start and end span positions are correct. The F1 score used in standard QA benchmarks is not reported because a partial match in financial cell extraction is considered uninformative: either the full cell value is extracted or it is not.
Real-world evaluation. A small test set of 50 real-world table images, sampled from UK Companies House filings from March 2023, is manually annotated with two question-answer pairs each, yielding 100 questions. This set tests how well models trained on synthetic data transfer to real documents.
Baseline comparison. FinTabQA models are compared against GPT-4V under several prompting conditions: question-only, and question plus an instruction specifying that the image contains tabular financial data and requesting parentheses and negation signs be preserved in the response. GPT-4V responses are manually evaluated.
OCR sensitivity analysis. To separate OCR errors from model errors, both FinTabQA and FinTabQA-A4 are also evaluated on the SynFinTabs test split using the ground-truth words and bounding boxes directly.
What are the outcomes/conclusions?
SynFinTabs test split. Using tuned EasyOCR, FinTabQA achieves 95.87% exact-match accuracy and FinTabQA-A4 achieves 94.97% (Table 1). The less-than-one-point difference between the two image-input strategies suggests the choice of image size has limited impact in this setting.
Real-world evaluation. Results on the 100-question real-world set are reported in Table 2:
| Model | Training data | Image size | Prompt | Accuracy |
|---|---|---|---|---|
| FinTabQA | SynFinTabs | Table boundary | Question | 89% |
| FinTabQA-A4 | SynFinTabs | A4 page | Question | 79% |
| GPT-4V | Proprietary | Table boundary | Question | 76% |
| GPT-4V | Proprietary | Table boundary | Instruction + question | 94% |
| GPT-4V | Proprietary | A4 page | Instruction + question | 89% |
FinTabQA trained on synthetic data achieves 89% on real-world tables, outperforming GPT-4V with a question-only prompt (76%). When GPT-4V is prompted with an explicit instruction to preserve parentheses and negation signs, it reaches 94%, a gap of 18 percentage points attributed almost entirely to parenthesis handling: 22 of 24 GPT-4V errors with the question-only prompt were caused by the model omitting parentheses around negative values.
OCR impact. When ground-truth words and bounding boxes are provided, FinTabQA and FinTabQA-A4 achieve 99.98% and 99.99% accuracy respectively, indicating the models themselves are near-perfect on the task. The gap to the OCR-based results (roughly 4-5 percentage points on the synthetic test split) is attributable to OCR errors in two categories: (1) headers not recognized correctly, meaning the question context is corrupted, and (2) answer text not recognized correctly, so the span cannot be located. Using default EasyOCR parameters instead of the tuned parameters reduces accuracy to roughly 75%, a drop of about 20 percentage points. Tesseract, which is commonly used in related work, is reported to perform substantially worse on tabular data.
Limitations. The paper identifies several explicit constraints. Table content is semantically random: words are drawn from a vocabulary at random and numbers are arbitrary, so models cannot learn to interpret meaning from cell values. All question-answer pairs follow the same grammatical template, which may limit generalization to natural-language paraphrases of the same question. GPT-4V experiments were limited by API cost, restricting the extent of prompt engineering explored. The real-world evaluation set is small (100 questions from 50 tables), making it difficult to draw strong statistical conclusions from the real-world results.
Reproducibility
Models
FinTabQA and FinTabQA-A4 are both fine-tuned from the base variant of LayoutLM (microsoft/layoutlm-base-uncased), which has 113 million parameters. The base LayoutLM is pre-trained for document understanding using text, layout (2D bounding-box coordinates), and image features. The fine-tuned checkpoints are released separately on HuggingFace under the MIT license: ethanbradley/fintabqa (table-boundary variant) and ethanbradley/fintabqa-a4 (A4-page variant).
Algorithms
The extractive QA objective follows the standard span prediction setup: the model predicts a start logit and end logit over the context (the flattened list of table words), and the loss is computed against the ground-truth span start and end positions. A post-processing fix is applied at inference: the end position is constrained to come after the predicted start position (argmax over end logits restricted to positions after the start), which yielded approximately 2-point accuracy improvements.
Fine-tuning hyperparameters: batch size 2; learning rate selected with PyTorch Lightning Tuner with initial max $3 \times 10^{-5}$; FinTabQA trained for 12 epochs (63 GPU hours); FinTabQA-A4 trained for 11 epochs (57 GPU hours). Best checkpoint selected by lowest validation loss at epoch end.
Data
SynFinTabs comprises 100,000 images split 80/10/10 (train/validation/test). The dataset includes HTML, JSON, and CSV representations of each table; word, cell, row, and table bounding-box annotations; semantic cell-type labels; and pre-computed question-answer pairs with stored span positions. For test-split tables, EasyOCR-extracted words and bounding boxes are also included.
The dataset is publicly available on HuggingFace at ethanbradley/synfintabs under the MIT license. The generation code is available at ethanbradley/synfintabgen under the MIT license.
Textual row headers are drawn from a curated list of real financial account names observed in Companies House filings. Other textual cell content uses random words from a 10,000-word English vocabulary. Numerical cell content is random. No personally identifiable information or actual financial figures from real companies appear in the dataset.
Evaluation
The primary metric is exact-match accuracy requiring both span start and end positions to be correct. Evaluation on SynFinTabs uses the competition pair (one randomly selected question-answer pair per table). For the real-world set, two question-answer pairs per table were manually defined. The real-world evaluation set is small (50 tables, 100 questions) and statistical uncertainty measures are not reported for any result in the paper.
Hardware
All training was performed on a single NVIDIA GRID M60-8Q GPU using the Northern Ireland High Performance Computing (NI-HPC) service. FinTabQA required 63 GPU hours; FinTabQA-A4 required 57 GPU hours. No inference latency or memory figures are reported.
BibTeX
@inproceedings{bradley2026synfintabs,
title = {{SynFinTabs}: A Dataset of Synthetic Financial Tables for Information and Table Extraction},
author = {Bradley, Ethan and Roman, Muhammad and Rafferty, Karen and Devereux, Barry},
booktitle = {Document Analysis and Recognition -- ICDAR 2025 Workshops},
publisher = {Springer Nature Switzerland},
address = {Cham},
pages = {85--100},
year = {2026},
month = jan,
doi = {10.1007/978-3-032-09371-4_6}
}
UniTabNet: Bridging Vision and Language Models for Enhanced Table Structure Recognition
TL;DR
UniTabNet is an image-to-text model for table structure recognition (TSR) that pairs a Swin Transformer encoder with a BART decoder, then adds two auxiliary modules: a Vision Guider that supervises attention toward row/column regions, and a Language Guider that aligns structural decoding tokens with text-reading tokens to improve handling of descriptive table cells. Results on PubTables1M, PubTabNet, WTW, and iFLYTAB are competitive with or exceed prior split-and-merge methods.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$. The headline contribution is a new model architecture (UniTabNet) with two auxiliary training modules (Vision Guider and Language Guider). The paper is structured around ablation tables, baseline comparison tables, and architecture figures, all of which suggest that the model design is the primary claim.
Secondary: none significant. Experiments use existing benchmarks; no new dataset or tool is released.
What is the motivation?
Most prior TSR methods treat table parsing as a purely visual problem: bottom-up methods detect cells from visual features, and split-and-merge methods segment rows and columns with pixel-level classifiers. Both paradigms ignore the textual content of cells, which matters when table structure is visually ambiguous.
Wireless tables with descriptive cells are a clear failure case. If two adjacent cells both span multiple rows but differ in content length, a purely visual model can confuse the cell boundaries. The authors illustrate this with a comparison against SEMv2 (Figure 1): the visual-only baseline merges cells that should remain separate because it cannot distinguish them by layout alone.
Image-to-text models (Donut, Pix2Struct) have shown strong text perception in general document understanding, but prior TSR work in this paradigm (TableFormer, VAST) did not fully exploit text understanding within cells. The authors position UniTabNet as the first image-to-text TSR model to explicitly condition structural decoding on textual semantics; this framing is consistent with the reviewed prior work but is the authors’ own claim rather than an externally verified fact.
What is the novelty?
Architecture overview
UniTabNet follows a “divide-and-conquer” strategy:
- An image-to-text model (Swin encoder + BART decoder) decodes a sequence of cell tokens
<C>and row-boundary tokens<NL>. - A Physical Decoder converts each cell’s hidden state into a polygon (bounding box with 8 coordinates).
- A Logical Decoder predicts rowspan and colspan for each cell.
- A Vision Guider and a Language Guider are injected at the final decoder layer to improve both prediction types.
Physical Decoder
Polygon coordinates are quantized into 1,000 bins. Rather than taking the argmax over the bin vocabulary (as in Pix2Seq), the physical decoder computes the expected location:
$$ E(p_j) = \sum_{i=0}^{999} i \cdot a_i^{p_j} $$
where $a^{p_j} = \text{softmax}(\mathbf{h}_i^{p_j} \mathbf{Loc}^\top)$ is the distribution over the location vocabulary. The regression loss is mean squared error over the eight coordinate points:
$$ \mathcal{L}_{\text{poly}} = \frac{1}{8} \sum_{j=1}^{8} \left(E(p_j) - p_j^\ast\right)^2 $$
Logical Decoder
Span prediction uses argmax over the same location vocabulary. Because span distributions are heavily imbalanced (most cells have span 1), the training loss is sigmoid focal loss:
$$ \mathcal{L}_{\text{span}} = L_f!\left(a^{\text{row}}, l_{\text{row}}^\ast\right) + L_f!\left(a^{\text{col}}, l_{\text{col}}^\ast\right) $$
Vision Guider
The Vision Guider supervises which visual positions a cell token attends to. For each decoded cell, the last-layer hidden state $\mathbf{h}_i$ is projected to query vectors $\mathbf{h}_i^{\text{row}}$ and $\mathbf{h}_i^{\text{col}}$, which compute attention scores over the visual feature map $\mathbf{Z}$. The loss penalizes deviations from ground-truth row and column mask maps using sigmoid focal loss:
$$ \mathcal{L}_{\text{vis}} = L_f!\left(a^{\text{row}}, g_{\text{row}}^\ast\right) + L_f!\left(a^{\text{col}}, g_{\text{col}}^\ast\right) $$
This is an auxiliary loss; the attention weight itself is not changed, only supervised.
Language Guider
An auxiliary “Table Read” (TR) task trains the model to output the text content of each cell in sequence. A cell token in the TSR task is mapped to a hidden vector $\mathbf{h}_i^{\text{lang}}$, which is aligned with the mean-pooled hidden states $\mathbf{h}_{\text{lang}}^\ast$ of the corresponding TR tokens via MSE loss:
$$ \mathcal{L}_{\text{lang}} = \text{MSE}!\left(\mathbf{h}_i^{\text{lang}},, \mathbf{h}_{\text{lang}}^\ast\right) $$
This alignment gives structural tokens access to textual representations without requiring OCR outputs at inference time.
Total loss
Five losses with large differences in scale are combined using learnable homoscedastic uncertainty weights (Kendall et al., 2018). The five components are: the language model cross-entropy loss $\mathcal{L}_{\text{lm}}$, the polygon regression loss $\mathcal{L}_{\text{poly}}$, the span focal loss $\mathcal{L}_{\text{span}}$, the Vision Guider focal loss $\mathcal{L}_{\text{vis}}$, and the Language Guider alignment loss $\mathcal{L}_{\text{lang}}$. These are combined as:
$$ \mathcal{L}_{\text{total}} = \sum_{k=1}^{5} \frac{1}{2\sigma_k^2}\mathcal{L}_k + \log(1 + \sigma_k^2) $$
The $\sigma_k$ parameters are learned end-to-end and remove the need for manual loss weighting.
What experiments were performed?
Datasets
| Dataset | Type | Size | Metric |
|---|---|---|---|
| PubTabNet | Digital, wired + wireless | 568,000 | TEDS-Struct |
| PubTables1M | Digital, wired + wireless | 948,000 | GriTS-Top, GriTS-Loc |
| WTW | Wild (digital + camera), wired | 14,581 | F1-Measure |
| iFLYTAB | Mixed (digital + camera) | 17,291 | TEDS-Struct |
| iFLYTAB-DP | Wireless + descriptive cells | 322 | TEDS-Struct |
iFLYTAB-DP is a manually curated subset of iFLYTAB validation images selected to stress-test text-dependent structure prediction.
Baselines
Bottom-up: Faster RCNN, DETR, Cycle-CenterNet, LORE, LGPMA. Split-and-merge: SEM, RobustTabNet, TSRFormer, SEMv2, TRUST, SEMv3. Image-to-text: EDD, TableFormer, VAST.
Ablation
Table 3 in the paper studies four system variants evaluated on iFLYTAB and iFLYTAB-DP:
| System | Uncertainty loss | Vision Guider | Language Guider | iFLYTAB | iFLYTAB-DP |
|---|---|---|---|---|---|
| T1 | 92.4 | 82.9 | |||
| T2 | yes | 93.2 | 83.3 | ||
| T3 | yes | yes | 93.7 | 83.6 | |
| T4 (full) | yes | yes | yes | 94.0 | 84.9 |
| SEMv3 | 93.2 | 82.6 |
Each module contributes, with the Language Guider providing the largest gain on the descriptive subset.
What are the outcomes/conclusions?
On PubTables1M, UniTabNet achieves GriTS-Top of 99.43 (vs. 99.22 for VAST), the best reported among image-to-text methods. GriTS-Loc of 95.37 trails the DETR-based bottom-up method (97.81), which the authors attribute to DETR using in-cell content bounding boxes to refine cell localization.
On PubTabNet, UniTabNet TEDS-Struct of 97.50 matches TSRFormer, SEMv2, and SEMv3.
On WTW, UniTabNet achieves F1 of 95.1 (precision 95.6, recall 94.7), tied with LORE and SEMv3 for the top reported score. Recall is constrained by the maximum decoding length of 500 tokens; for large tables, some rows are truncated at inference. UniTabNet trails SEMv3 on recall (94.7 vs. 95.4) despite matching on F1 due to its higher precision.
On iFLYTAB, UniTabNet TEDS-Struct of 94.0 exceeds all compared methods. The gain is largest on iFLYTAB-DP (84.9 vs. 82.6 for SEMv3), validating the Language Guider’s contribution for descriptive tables.
Limitations
- Autoregressive latency: Inference time scales with cell count, making the model slower than non-autoregressive approaches for large tables.
- Recall degradation: A hard cap on decoding length (500 tokens) causes missed cells in dense tables; this is reported as a known issue with no proposed fix.
- Out-of-distribution spans: Span prediction is cast as classification over a fixed vocabulary. Any span value not seen during training cannot be predicted.
- Benchmark scope: All four evaluated datasets are either scientific articles (PubTabNet, PubTables1M) or primarily Chinese documents (WTW, iFLYTAB). Generalization to business documents, invoices, or HTML tables is not tested.
Reproducibility
| Resource | Type | License | Link |
|---|---|---|---|
| Preprint (arXiv) | Paper | CC-BY-4.0 | arXiv:2409.13148 |
| Published (EMNLP 2024 Findings) | Paper | CC-BY-4.0 | doi:10.18653/v1/2024.findings-emnlp.355 |
| UniTabNet code | Code | Not released | No URL provided in paper |
| iFLYTAB-DP IDs | Dataset subset | Unknown | SEMv2 repo (ML_list.txt; images require main iFLYTAB download) |
Models
- Vision encoder: Swin Transformer; downsampling factor 32; feature dimension $D = 1024$.
- Text decoder: BART; 4 layers; 16 attention heads.
- Image resolution: longest side resized to 1600 pixels, aspect ratio preserved.
- Location vocabulary: 1,000 special tokens
<0>through<999>used for both polygon coordinates and span prediction. - Weights: the abstract states “the code will be made publicly available,” but no repository URL is provided in this preprint.
Algorithms
- Optimizer: Adam, learning rate $5 \times 10^{-5}$.
- Schedule: linear warmup over first 10% of steps, then linear decay.
- Epochs: 100 for iFLYTAB and WTW; 10 for PubTables1M and PubTabNet.
- Loss weighting: homoscedastic uncertainty ($\sigma_k$ learned jointly).
- Pre-training data: 1.4 million SynthDog (Chinese + English synthetic documents) + PubTables1M training set.
- Three training tasks: OCR (pre-training), Table Read (pre-training + fine-tuning), Table Structure Recognition (fine-tuning).
Data
- Pre-training: SynthDog (1.4 M synthetic entries) and PubTables1M; both are publicly available.
- Fine-tuning: PubTabNet, PubTables1M, WTW, iFLYTAB (publicly available via their respective releases).
- iFLYTAB-DP: 322-image subset curated from iFLYTAB validation; the image IDs are available in
ML_list.txtin the SEMv2 GitHub repo (https://github.com/ZZR8066/SEMv2), but the actual images must be obtained from the main iFLYTAB dataset. No license is stated for this subset. - No contamination analysis is reported.
Evaluation
- TEDS-Struct: tree-edit-distance similarity on HTML structure, ignoring OCR output (Eq. 17).
- GriTS-Top / GriTS-Loc: grid-level F-score variants from Smock et al. (2023), measuring topology and localization accuracy.
- F1-Measure (WTW): cell adjacency relationship metric; cells matched by IoU $\geq 0.6$.
- No error bars, significance tests, or multiple run seeds are reported.
- Comparison fairness: not all baselines are evaluated on all datasets; iFLYTAB results are absent for several baselines (LGPMA, TSRFormer, TRUST, VAST).
Hardware
- 8 $\times$ Tesla A40 48 GB GPUs.
- Total training time not reported.
- Inference latency not reported.
- No cloud cost or energy consumption estimates provided.
BibTeX
@inproceedings{zhang-etal-2024-unitabnet,
title = "{U}ni{T}ab{N}et: Bridging Vision and Language Models for Enhanced Table Structure Recognition",
author = "Zhang, Zhenrong and Liu, Shuhang and Hu, Pengfei and Ma, Jiefeng and Du, Jun and Zhang, Jianshu and Hu, Yu",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.findings-emnlp.355/",
doi = "10.18653/v1/2024.findings-emnlp.355",
pages = "6131--6143"
}
WikiDT: Visual-based Table Recognition and Question Answering Dataset
TL;DR
WikiDT is a large-scale document TableVQA dataset built from Wikipedia pages. It contains 16,887 full-page images (54,032 sub-pages after pagination), 159,905 table annotations, and 70,652 question-answer pairs. A key design feature is the layered set of intermediate labels covering table detection (TD), table structure recognition (TSR), table retrieval, and SQL-form queries, enabling sub-task diagnosis and modular training. Baseline experiments with T5, LaTr, and TAPAS show that all tasks remain substantially unsolved, with the best overall TableVQA accuracy reaching only 45.23%.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$: the primary contribution is the dataset itself, including its automated curation pipeline, multi-level annotation schema, and release on HuggingFace.
Secondary: $\Psi_{\text{Evaluation}}$: baseline experiments across four sub-tasks establish difficulty benchmarks and surface model failure modes, particularly around multi-step reasoning and long table contexts.
What is the motivation?
Intelligent document processing is a growing commercial area, with the global IDP market projected to grow from $1.1B in 2022 to $5.2B by 2027. Table extraction and table-based QA sit at the center of many real-world workflows: insurance claim parsing, income report processing, and financial data extraction all require systems that can locate, recognize, and reason over tabular content.
Existing document VQA datasets fall short in several ways:
- Extractive-only questions: Most benchmarks emphasize directly span-extractable answers. Fewer than 2% of InfographicVQA samples require genuine reasoning or synthesis.
- Single-table context: Datasets like DUE (DocumentUnderstanding Benchmark) crop images to contain only the target table, making table recognition trivial and the task less realistic.
- Missing intermediate labels: Models trained on end-to-end supervision cannot be easily diagnosed; it is unclear whether errors stem from table detection, retrieval, or the QA component itself.
- Scale and diversity limitations: Prior table recognition benchmarks (ICDAR-2013 at 450 samples, ICDAR-2019 at 3.2k) are small, and large ones (PubTables-1M from scientific PDFs) come from a narrow visual domain.
WikiDT targets all of these gaps by providing full-page web screenshots with diverse, multi-table layouts, complex reasoning questions, and a full chain of intermediate annotations.
What is the novelty?
Dataset scale and composition
WikiDT is assembled from Wikipedia pages referenced by three existing TableQA datasets: WikiTableQuestions, TabFact, and WikiSQL. The dataset provides:
- 16,887 full-page images rendered as web screenshots at 1600px width
- 54,032 sub-page images after a dynamic-height pagination step
- 159,905 table annotations (web-derived ground truth + AWS Textract output)
- 70,652 QA pairs with table retrieval labels
- ~49,000 SQL annotations linking questions to executable queries
The table content spans a wide domain: sports records, historical events, political data, geography, and more.
Hierarchical annotation schema
Each QA sample is paired with a full chain of intermediate labels:
- Table detection (TD): Bounding boxes for every visible table in the sub-page image (WikiDT-detection, 54,032 images).
- Table structure recognition (TSR): Row, column, cell (including merged cells), and header bounding boxes extracted from HTML rendering, stored in Pascal VOC format (WikiDT-structure, 159,905 table crops).
- Table retrieval: For each question, the label identifies which table on the page contains the answer.
- TableVQA: End-to-end QA requiring multi-step reasoning over the full-page image with OCR assistance.
- SQL annotation: Executable SQL queries covering ~70% of samples, derived from WikiSQL and WikiTableQuestions.
Automated, heuristic-verified construction
The pipeline renders Wikipedia pages with Puppeteer, extracts table bounding boxes directly from HTML DOM elements, and connects them back to existing QA pairs. Because web content drifted after original dataset creation, the authors recover historical page versions from Wikipedia’s edit history and re-execute WikiSQL queries against the updated table content to regenerate correct answers. Manual quality inspections supplement the automated checks.
Challenging table characteristics
Compared to PDF-derived datasets like PubTables-1M, WikiDT tables appear at arbitrary positions and widths across the page (the bounding box distribution is nearly uniform rather than concentrated at a few column positions). Many pages contain 10 or more tables per sub-page (mean 3.4, max over 40), and tables can span beyond a single paginated segment.
What experiments were performed?
Sub-task formulations
The paper evaluates four tasks:
| Task | Input | Output |
|---|---|---|
| Table Detection | sub-page image | table bounding boxes |
| Table Structure Recognition | table crop image | row/column/cell bounding boxes |
| Table Retrieval | tables + question | target table identifier |
| TableVQA | full-page image + question | answer string(s) |
TableVQA
Three models were evaluated using denotation accuracy (exact match, ignoring number formatting):
| Model | Single Answer (%) | Multi-Answer (%) | Overall (%) |
|---|---|---|---|
| T5 (text-only, spatial-blind) | 32.74 | 1.30 | 31.67 |
| LaTr (layout-aware visual) | 35.29 | 0.0 | 34.08 |
| TAPAS (modularized, with ground-truth retrieval) | 46.24 | 6.86 | 45.23 |
TAPAS receives table structure information from Textract and ground-truth retrieval labels, making it the most favorable setup. All models struggle badly on multi-entry answers.
An ablation comparing Textract table annotations vs. web (ground truth) table annotations shows the improvement from better table recognition is modest for monolithic models (T5, LaTr), whereas model size has a much larger impact. For TAPAS, table recognition quality matters more.
Table Extraction
Pre-trained models from other domains transfer poorly to WikiDT:
- TableNet (trained on Marmot) and CascadeTabNet (trained on ICDAR-19) show substantially degraded F1 compared to their home benchmarks, confirming domain shift.
- DETR pre-trained on PubTables-1M achieves AP$_{50}$ well below the WikiDT-trained DETR, illustrating that PubTables-1M’s narrow PDF layout distribution does not generalize to web pages.
At IoU=0.5, Textract achieves 0.901 precision, 0.558 recall, and F1=0.689 on the WikiDT test set, leaving considerable room for improvement especially on recall.
Table Retrieval
BM25 and a dense retrieval model (TAPAS backbone with a learned $R_{CLS}$ token) are compared on Mean Reciprocal Rank (MRR):
| Method | Web Tables (MRR) | Textract Tables (MRR) |
|---|---|---|
| BM25 | 0.382 | 0.389 |
| Dense Retrieval | 0.587 | 0.524 |
Dense retrieval substantially outperforms BM25. Table recognition quality has only a marginal effect on retrieval.
What are the outcomes/conclusions?
WikiDT surfaces challenges that remain unsolved by models performing well on other VQA benchmarks. The key findings are:
- All four evaluated sub-tasks have substantial performance headroom, with the best overall TableVQA accuracy at 45.23%.
- Multi-answer questions are particularly difficult; TAPAS achieves only 6.86% denotation accuracy on them.
- Multi-step reasoning (e.g., “count in one group, count in another, then compare”) consistently defeats TAPAS, which can handle at most one aggregation operation.
- Web-page table layouts are substantially more diverse than PDF-derived layouts, causing significant domain transfer failures for models trained on existing benchmarks.
- Improving table recognition quality helps modularized systems more than end-to-end models; for end-to-end models, scaling model capacity has a larger effect.
The paper notes geographic and cultural bias in the dataset toward content from the United States and Canada, and acknowledges that Wikipedia’s inherent misinformation and outdated data carry over into WikiDT.
Reproducibility
Models
- T5 (base and large): standard text-to-text transformer; spatially blind; inputs are linearized table tokens in row-column order.
- LaTr (Layout-aware Transformer): pre-trained on scene-text and document VQA tasks; takes image patches, OCR tokens with layout coordinates, and question as input; generates answer autoregressively.
- TAPAS: pre-trained table parser; takes flattened table, column names, and question; predicts aggregation operator and cell selection; cannot compose more than one aggregation.
- DETR: transformer-based object detector trained on WikiDT for table detection and structure recognition.
All models fine-tuned from publicly released pre-trained checkpoints on Hugging Face, except LaTr (fine-tuned by the original LaTr authors).
Algorithms
- TAPAS and T5 fine-tuned using standard Hugging Face training procedures; hyperparameters not explicitly stated in the paper.
- Dense retrieval model trained with weighted binary cross-entropy to address label imbalance; each background sample receives weight $1 / |target_sample| / |background_sample|$ to counteract the class imbalance.
- DETR trained from random initialization on WikiDT; configuration identical to the PubTables-1M-pre-trained version.
Data
- Source: Wikipedia pages from WikiTableQuestions, TabFact, and WikiSQL URL lists.
- Rendering: Puppeteer at 1600px width; historical page versions recovered via Wikipedia edit history for TabFact URLs.
- Pagination: Dynamic-height segmentation using blank-line detection (window W=1200px, H=10px); segments to near-1:1 aspect ratio.
- Table annotation: Extracted directly from HTML DOM via Puppeteer; filtered to visible elements with at least two rows and two columns; nested outer tables removed.
- OCR: AWS Textract applied to sub-page images; footnotes indicate annotations predate merged-cell support added in March 2022.
- QA generation: WikiTableQuestions answers reused as-is; WikiSQL answers regenerated by executing translated SQL on the retrieved web table (due to content drift).
- SQL annotation: WikiSQL queries translated and validated; coverage approximately 70% of QA samples.
- Dataset availability: Released on HuggingFace at AmazonScience/WikiDT; license not stated in the paper.
- Data split: Train/dev/test splits follow standard practice; dev and test sets exclude pages with only one table for the retrieval task.
Evaluation
- TableVQA metric: Denotation accuracy. A prediction is correct if it matches the ground truth in number of entries and each entry can be non-repeatedly matched (format-insensitive, e.g., “1000” matches “1,000”).
- Table extraction metrics: IoU-based AP (AP$_{50}$, AP$_{75}$, AP$_{95}$) and AR following MS-COCO convention; pixel-level precision/recall/F1 for TableNet (mask outputs).
- Table retrieval metric: Mean Reciprocal Rank (MRR) on the subset of samples with multiple table candidates.
- Multi-answer denotation accuracy is evaluated strictly: all entries must match, regardless of order.
- No error bars or significance tests reported; single-run results.
Hardware
- Training hardware, GPU configuration, and compute time are not reported in the paper.
- AWS Textract was used as an external service for OCR and table extraction annotations.
BibTeX
@inproceedings{shi2024wikidt,
title={WikiDT: Visual-based Table Recognition and Question Answering Dataset},
author={Shi, Hui and Xie, Yusheng and Goncalves, Luis and Gao, Sicun and Zhao, Jishen},
booktitle={International Conference on Document Analysis and Recognition (ICDAR)},
pages={401--418},
year={2024},
publisher={Springer},
doi={10.1007/978-3-031-70533-5_24}
}
SEMv3: A Fast and Robust Approach to Table Separation Line Detection
TL;DR
SEMv3 is the third iteration of the SEM (Split, Embed and Merge) family for table structure recognition. The split stage replaces instance segmentation with a Keypoint Offset Regression (KOR) module that predicts offsets from fixed horizontal proposals to the true line keypoints, removing the need for mask-to-line post-processing. The merge stage introduces a four-class merge action map (Stay, Left, Upward, X) that describes the cell grid structure in O(NM) space rather than the O(N²M²) needed by prior merge-map approaches. On the iFLYTAB, WTW, and ICDAR-2019 cTDaR Historical benchmarks the combined system outperforms SEMv2 and is competitive with LORE.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$ The headline contribution is the KOR split module and the merge action formulation. The paper is organized around model design, ablation studies on iFLYTAB, and comparisons on four benchmarks.
Secondary: None. No new dataset or benchmark is released. iFLYTAB is introduced in the SEMv2 paper (Zhang et al., 2024); SEMv3 evaluates on it but does not release it.
What is the motivation?
The split-and-merge paradigm for TSR detects row and column separation lines, intersects them to form a grid, and then merges over-split grid cells to recover spanning cells. The split stage has historically used semantic segmentation to predict a per-pixel separator mask, but this requires a heuristic mask-to-line post-processing step. Segmentation-based methods also struggle with wireless tables (no visible borders) and geometrically deformed tables, because the predicted masks become noisy and the post-processing amplifies errors.
Instance-segmentation approaches (e.g., SEMv2) partially address this by predicting a per-line mask through dynamic convolution, but the position-insensitivity of instance convolution kernels can reduce robustness, and the inference cost grows linearly with the number of lines.
A separate weakness is in the merge stage. Predicting a full merge map for each grid cell (as in SEMv2) requires quadratic space and computation in the grid dimensions, limiting scalability.
SEMv3 addresses both: replace segmentation with direct offset regression for the split stage, and replace the per-grid merge map with a compact action classification for the merge stage.
What is the novelty?
Keypoint Offset Regression (KOR) Split Module
The backbone is ResNet-34 with FPN, producing four feature levels $\mathbf{P}_2, \mathbf{P}_3, \mathbf{P}_4, \mathbf{P}_5$ that are fused into a single map $\mathbf{F} \in \mathbb{R}^{H/4 \times W/4 \times C}$.
Separation line representation. The $i$-th row separation line is represented by a sequence of keypoints $\{k_{ij}^{\text{row}} \mid j = 0, \ldots, N_k^{\text{row}} - 1\}$. The number of keypoints is determined by image width $W$ and a fixed sampling step $t$:
$$ N_k^{\text{row}} = \left\lceil \frac{W}{t} \right\rceil $$
The x-coordinates of keypoints are fixed ($x_{ij} = j \times t$), so the line is fully characterized by the y-coordinates of its keypoints. Rather than predicting absolute y-coordinates, KOR predicts the offset $\delta_{ij}^{\text{row}}$ from a horizontal proposal to the true keypoint position. Proposals share the same x-coordinates as keypoints and have y-coordinates matching the starting point of their line.
Feature enhancement. Two separate feature enhancement modules (with non-shared parameters) process $\mathbf{F}$ using SCNN for row and column context propagation, producing $\mathbf{F}_{sd}^{\text{row}}$ for starting point detection and $\mathbf{F}_{lr}^{\text{row}}$ for line regression.
Starting point detection. Row-wise average pooling and a softmax over $\mathbf{F}_{sd}^{\text{row}}$ produce a probability vector $\mathbf{P}^{\text{row}}$ for whether each row contains a separator starting point. NMS removes duplicate predictions.
Keypoint offset head. Proposal features $\mathbf{K}’ \in \mathbb{R}^{C \times N_k^{\text{row}}}$ are sampled from $\mathbf{F}_{lr}^{\text{row}}$ at proposal positions. A line-level representation is computed by averaging proposal features within the same line:
$$ \mathbf{S}_i^{\text{row}} = \frac{1}{N_k^{\text{row}}} \sum_{j=1}^{N_k^{\text{row}}} \mathbf{K}’_{ij} $$
The per-keypoint feature is then formed by channel-wise concatenation of the proposal feature and the line representation:
$$ \mathbf{K}_{ij} = \text{concat}(\mathbf{K}’_{ij},; \mathbf{S}_i^{\text{row}}) $$
A $1 \times 3$ convolution over $\mathbf{K}$ predicts the offset $\delta_{ij}^{\text{row}}$, allowing each keypoint to consider its neighbors for a smoother predicted line.
Merge Action Map
After intersecting row and column separators to produce a grid $\mathbf{B} \in \mathbb{R}^{M \times N \times 8}$ (quadrilateral boxes), grid features are extracted via RoI Align and absolute position embedding:
$$ \mathbf{E}’ = \text{RoiAlign}(\mathbf{F}, \mathbf{B}) + \text{PE}(\mathbf{B}) $$
A row/column self-attention layer produces the enhanced grid representation $\mathbf{E}$.
Each grid cell is assigned one of four merge actions:
- S (Stay): upper-left cell of a spanning cell; do not merge in either direction.
- L (Left): merge with the cell to the left.
- U (Upward): merge with the cell above.
- X: merge both left and upward.
A convolutional auxiliary branch classifies starting grids; the final merge action is predicted from the concatenation of the starting-grid feature and $\mathbf{E}$. This formulation requires only one merge action map of size $M \times N$ per image, reducing the space complexity from $O(N^2 M^2)$ (per-grid merge maps) to $O(NM)$.
Loss Functions
Starting point detection uses binary cross-entropy:
$$ \mathcal{L}_{sp}^{\text{row}} = \frac{1}{H/2} \sum_{i=1}^{H/2} \mathcal{L}_{\text{BCE}}(\hat{\mathbf{P}}_i^{\text{row}}, \mathbf{P}_i^{\text{row}}) $$
Keypoint offset regression uses L2 loss:
$$ \mathcal{L}_\delta^{\text{row}} = \frac{1}{N^{\text{row}} N_k^{\text{row}}} \sum_{i=1}^{N^{\text{row}}} \sum_{j=0}^{N_k^{\text{row}}-1} \left| \hat{\delta}_{ij}^{\text{row}} - \delta_{ij}^{\text{row}} \right|_2 $$
Starting grid classification and merge action prediction both use focal loss:
$$ \mathcal{L}_{\text{sg}} = \frac{1}{NM} \sum_{i=1}^{N} \sum_{j=1}^{M} L_{\text{focal}}(\hat{\mathbf{P}}_{ij}^{\text{sg}}, \mathbf{P}_{ij}^{\text{sg}}) $$
$$ \mathcal{L}_{\text{ma}} = \frac{1}{NM} \sum_{a} \sum_{i=1}^{N} \sum_{j=1}^{M} L_{\text{focal}}(\hat{\mathbf{P}}_{ij}^{a}, \mathbf{P}_{ij}^{a}) $$
Overall loss:
$$ \mathcal{L} = \mathcal{L}_{sp}^{\text{row}} + \mathcal{L}_{sp}^{\text{col}} + \mathcal{L}_\delta^{\text{row}} + \mathcal{L}_\delta^{\text{col}} + \mathcal{L}_{\text{ma}} + \mathcal{L}_{\text{sg}} $$
What experiments were performed?
Datasets
- ICDAR-2019 cTDaR Historical: 600 train / 150 test; archival historical documents with sparse borders and geometric deformation. Metric: cell adjacency F1 at IoU 0.6.
- WTW: 14,581 wired tables from real business scenarios, including seven hard cases (inclined, curved, occluded, extreme aspect ratio, etc.). Metric: cell adjacency F1 at IoU 0.6.
- iFLYTAB (introduced in SEMv2): 12,103 train / 5,188 test; four subsets: Wired-Digital (WDD), Wired-Camera-Capture (WDC), Wireless-Digital (WLD), Wireless-Camera-Capture (WLC). Cell adjacency F1 and grid detection F1-G (IoU 0.9) are both reported.
- SciTSR-COMP and PubTabNet: axis-aligned digital tables for broader comparison. SciTSR-COMP uses cell adjacency F1; PubTabNet uses TEDS and TEDS-Struct.
Baselines
ICDAR-2019 cTDaR Historical: TabStructNet, FLAGNet, NCGM, SEMv2, LORE. WTW: Cycle-CenterNet, NCGM, TSRFormer-DQ, SEMv2, LORE. iFLYTAB: SEM, SEMv2. SciTSR/PubTabNet: RobusTabNet, TSRFormer, LORE, TSRFormer-DQ, SEMv2.
Ablation
Table 5 in the paper crosses two split options (IS from SEMv2, KOR) with two merge options (MP from SEMv2, MA from SEMv3) on iFLYTAB’s four subsets:
- T1: IS + MP; T2: KOR + MP; T3: IS + MA; T4: KOR + MA. Grid detection F1-G on wireless subsets (WLC, WLD) rises from roughly 18-24% (IS) to 37-59% (KOR) under both merge settings, confirming that the split quality improvement is primarily attributable to KOR.
Implementation
ResNet-34 + FPN backbone; feature channel $C = 256$; grid feature channel $C_g = 512$; sampling step $t = 32$. Trained end-to-end for 100 epochs. Optimizer: Adam, initial LR $10^{-4}$ decayed to $10^{-6}$ via cosine annealing. Four Nvidia Tesla V100 GPUs (24 GB each); PyTorch 1.7.1.
What are the outcomes/conclusions?
ICDAR-2019 cTDaR Historical: 89.3% F1, ahead of LORE (88.3%) and SEMv2 (85.9%).
WTW: 95.1% F1, tied with LORE* (95.1%), but SEMv3 uses the stricter IoU threshold of 0.6 vs. LORE’s 0.5. The authors note this makes the comparison favorable to SEMv3 relative to LORE.
iFLYTAB: 94.4% F1 overall, 0.9 points above SEMv2 (93.5%). On wireless subsets the grid F1-G gain is substantially larger (18-24% for IS to 37-59% for KOR).
SciTSR-COMP: 99.0% F1, comparable to TSRFormer (98.9%) and TSRFormer-DQ (98.8%).
PubTabNet: 97.3% TEDS, 97.5% TEDS-Struct. TEDS-Struct is on par with TSRFormer-DQ and SEMv2; TEDS is slightly below LORE (98.1%).
Inference speed: KOR’s cost is nearly constant as the number of rows and columns grows because it predicts a single shared offset map, while IS cost grows linearly with the number of lines. In the paper’s timing comparison on SciTSR at 512×512, KOR is significantly faster than IS for tables with many rows/columns.
Limitations acknowledged: Performance on SciTSR is noted as potentially confounded because the test set has been “manually rectified in various ways.” On PubTabNet the method is slightly below LORE (a region-based method). No code or trained weights are publicly released. The iFLYTAB dataset used for ablation is from the SEMv2 paper and is not freely downloadable.
Reproducibility
| Resource | Type | License | Link |
|---|---|---|---|
| SEMv3 paper (preprint) | Paper | arXiv-nonexclusive-v1.0 | arXiv:2405.11862 |
| SEMv3 code | Code | Not released | N/A |
| SEMv3 model weights | Model | Not released | N/A |
| iFLYTAB (released with SEMv2) | Dataset | Unknown | See SEMv2 (arXiv:2405.11757) |
| WTW | Dataset | Unknown | GitHub |
| ICDAR-2019 cTDaR Historical | Dataset | Unknown | ICDAR 2019 competition page |
| SciTSR | Dataset | MIT | GitHub |
| PubTabNet | Dataset | CDLA-Permissive-1.0 | IBM Research / GitHub |
Models
- Backbone: ResNet-34 pretrained on ImageNet, with FPN producing four feature levels; fused into $\mathbf{F} \in \mathbb{R}^{H/4 \times W/4 \times 256}$.
- Feature enhancement: SCNN with 4 downsampling blocks; two separate modules per branch (row and column), each with non-shared parameters.
- KOR offset head: $1 \times 3$ convolution on concatenated keypoint features.
- Merge module: RoI Align + positional embedding; row/column self-attention; convolutional starting-grid head; merge action classification head.
- Model weights are not publicly released.
Algorithms
- Optimizer: Adam, $\beta_1 = 0.9$, $\beta_2 = 0.999$ (defaults assumed; not specified in paper).
- Learning rate: $10^{-4}$ with cosine annealing to $10^{-6}$.
- Training: 100 epochs end-to-end.
- Sampling step $t = 32$; grid feature channel $C_g = 512$.
- During training, ground-truth grid coordinates are used for RoI Align in the merge module (teacher-forcing).
- NMS applied to starting point predictions (algorithm matches SEMv2).
Data
- ICDAR-2019 cTDaR Historical: 600 train / 150 test; publicly available from the ICDAR 2019 competition page. License unknown.
- WTW: 14,581 images; publicly available on GitHub. License unknown.
- iFLYTAB: 12,103 train / 5,188 test; introduced in the SEMv2 paper (arXiv:2405.11757) by iFLYTEK Research. Not independently released with SEMv3; public download availability is unclear. License unknown.
- SciTSR: 12,000 train / 3,000 test (SciTSR-COMP subset: 716 complex tables); publicly available under MIT license.
- PubTabNet: approximately 500k train / 9k validation; publicly available under CDLA-Permissive-1.0.
Evaluation
- Cell adjacency F1 at IoU 0.6 for ICDAR-2019 cTDaR, WTW, and iFLYTAB.
- Grid detection F1-G at IoU 0.9 for iFLYTAB (measures split stage independently).
- TEDS and TEDS-Struct for PubTabNet.
- Cell adjacency F1 for SciTSR-COMP.
- No error bars, confidence intervals, or multi-seed runs reported.
- LORE on WTW uses IoU 0.5 vs. SEMv3’s IoU 0.6; direct comparison is not fair and the paper acknowledges this.
- SciTSR rectification caveat acknowledged but not quantified.
Hardware
- Training: 4 Nvidia Tesla V100 GPUs, 24 GB each.
- Framework: PyTorch 1.7.1.
- Inference timing reported on SciTSR at 512×512; no absolute latency figures (only relative IS vs. KOR comparison from a chart).
- GPU-hours, memory at inference, and cost are not reported.
BibTeX
@inproceedings{qin2024semv3,
title={SEMv3: A Fast and Robust Approach to Table Separation Line Detection},
author={Qin, Chunxia and Zhang, Zhenrong and Hu, Pengfei and Liu, Chenyu and Ma, Jiefeng and Du, Jun},
booktitle={Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence},
pages={1191--1199},
year={2024},
url={https://www.ijcai.org/proceedings/2024/132}
}
SemiTabDETR: Semi-Supervised Table Detection with Dual Query Assignment
TL;DR
SemiTabDETR is a teacher-student semi-supervised table detector built on Deformable DETR. The key contribution is a dual query assignment strategy: a one-to-many branch generates high-quality pseudo-labels during early training, while a one-to-one branch eliminates duplicate predictions at inference without requiring NMS. With only 30% labeled data, the approach reports mAP 95.7% on TableBank-Word and 97.9% on PubLayNet.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$: The paper proposes a new semi-supervised detection framework. The headline contribution is the dual query assignment mechanism combined with a teacher-student training loop. Ablations and comparison tables occupy the majority of the experimental sections.
Secondary: None. No new datasets, benchmarks, or theoretical contributions are introduced.
What is the motivation?
Supervised table detection methods depend on large annotated datasets. Annotating document pages for table bounding boxes is expensive, domain-specific, and does not transfer well across publication types. Semi-supervised approaches address this by generating pseudo-labels for unlabeled images, but existing semi-supervised table detectors had two problems:
- CNN-based methods (Soft Teacher, STAC) require anchor generation and NMS, which adds complexity and limits end-to-end training.
- Prior transformer-based semi-supervised detectors (Shehzadi et al., ICDAR 2023) used a one-to-one bipartite matching strategy. When pseudo-labels are noisy, one-to-one matching forces each query to align tightly with a potentially incorrect label, propagating errors into subsequent training iterations.
The authors observe that one-to-many matching can generate multiple positive proposals per pseudo-label, improving robustness to label noise during the early stages of training, but introduces duplicate detections that require NMS at inference. The goal is to combine the noise-robustness of one-to-many matching with the NMS-free property of one-to-one matching.
What is the novelty?
SemiTabDETR wraps Deformable DETR in a teacher-student semi-supervised framework with two parallel query sets in the decoder:
- A primary set $Q = \{q_1, \ldots, q_N\}$ of $N = 30$ queries undergoes one-to-one bipartite matching via the Hungarian algorithm.
- A secondary set $\hat{Q} = \{\hat{q}_1, \ldots, \hat{q}_T\}$ of $T = 400$ queries undergoes one-to-many matching, where ground truth is replicated $K = 6$ times to form augmented targets $\hat{y}_g = \{y_g^1, \ldots, y_g^K\}$.
The per-decoder-layer losses for each strategy are:
$$\mathcal{L}_{o2o} = \sum_{l=1}^{L} \mathcal{L}_H(y_p, y_g)$$
$$\mathcal{L}_{o2m} = \sum_{l=1}^{L} \mathcal{L}_H(\hat{y}_p, \hat{y}_g)$$
where $\mathcal{L}_H$ is the Hungarian matching cost combining classification and regression terms.
Training proceeds in two stages. First, the one-to-many strategy runs for all iterations to produce high-quality pseudo-labels and accelerate convergence. Then the one-to-one strategy takes over to eliminate duplicates, so no NMS step is needed at inference. Both strategies use separate classification and regression losses:
$$\mathcal{L}_{o2m} = \mathcal{L}_{o2m}^{cls} + \mathcal{L}_{o2m}^{reg}$$
$$\mathcal{L}_{o2o} = \mathcal{L}_{o2o}^{cls} + \mathcal{L}_{o2o}^{reg}$$
The overall semi-supervised loss weights a supervised term on labeled images and an unsupervised term on unlabeled images:
$$\mathcal{L} = \mathcal{L}_s + \omega \mathcal{L}_u$$
The teacher module receives weakly augmented unlabeled images and generates pseudo-labels filtered by a confidence threshold of 0.7. The student receives strongly augmented versions of both labeled and unlabeled images. Teacher weights are updated via Exponential Moving Average (EMA) from the student.
The one-to-many pseudo-label selection is formulated as:
$$p_{o2m} = \left\{ \underset{\sigma_i \in \chi}{\text{argmin}} \sum_{i=1}^{N} \mathcal{L}_H!\left(y_i^T, y_{\sigma_i(k)}^S\right) \right\}_{i=1}^{|y^T|}$$
where $\chi$ assigns a subset of student proposals to each teacher pseudo-label $y_i^T$, and $y_{\sigma_i(k)}^S$ is the matched student prediction.
What experiments were performed?
Datasets: PubLayNet (table class only), TableBank (Word, LaTeX, and Both splits), and ICDAR-19 cTDaR Modern Track A. Experiments use 10%, 30%, and 50% labeled data with the remainder treated as unlabeled.
Metrics: mAP (IoU 0.50:0.95), AP$^{50}$, AP$^{75}$, and $AR_L$ (average recall for large objects). ICDAR-19 is also evaluated on Recall, Precision, and F1 at IoU thresholds of 0.8 and 0.9, following the competition protocol.
Baselines:
- Supervised: Faster R-CNN (Ren et al.) and Deformable DETR (Zhu et al.)
- Semi-supervised CNN-based: STAC (Sohn et al.), Unbiased Teacher (Liu et al.), Tang et al., Soft Teacher (Xu et al.)
- Semi-supervised transformer-based: Shehzadi et al. (ICDAR 2023, the direct predecessor)
- ICDAR-19 supervised: TableRadar, NLPR-PAL, Lenovo Ocean, CDeC-Net, HybridTabNet
Ablations: Number of one-to-many queries (200, 400, 600 evaluated; 400 is best), confidence threshold (0.5, 0.6, 0.7, 0.8; 0.7 is best), and comparison of o2o-only, o2m-only, and combined strategies with/without NMS and their training times.
Implementation: ResNet-50 backbone pretrained on ImageNet, 150 training epochs, learning rate decayed by 0.1 at epoch 140, input size 600 pixels (800 for comparisons), strong augmentation (flips, resize, patch removal, cropping, grayscale, Gaussian blur) for the student, weak augmentation (horizontal flip) for the teacher.
What are the outcomes/conclusions?
TableBank (Table 2): With 30% labeled data on TableBank-Both, SemiTabDETR achieves mAP 93.3, compared to 86.8 for the previous semi-supervised method (Shehzadi et al.) and 82.6 for the supervised Deformable DETR baseline. With only 10% labeled data, the gain over the prior semi-supervised method is 5.2 mAP points.
| Dataset | Labels | mAP | AP$^{50}$ |
|---|---|---|---|
| TableBank-Word | 30% | 95.7 | 96.9 |
| TableBank-LaTeX | 30% | 88.0 | 94.3 |
| TableBank-Both | 30% | 93.3 | 96.8 |
PubLayNet (Table 4): With 30% labeled data, mAP 97.9 vs. 90.3 for the prior semi-supervised method, a 7.6 point improvement. With 50% labels, the model’s AP$^{50}$ (98.9) matches or exceeds all listed supervised baselines.
ICDAR-19 (Table 7): With 50% labeled data, the method achieves recall 95.0 at IoU 0.8 but precision of only 61.4, yielding F1 74.5. This falls well below the best-reported supervised results on this benchmark (TableRadar F1 94.5, CDeC-Net F1 94.4). The authors highlight the recall advantage but do not address the precision gap. The ICDAR-19 results for this semi-supervised approach should be read with caution for applications where false positives are costly.
Ablation (Table 12): Using o2o-only takes 8.74 hours training time; o2m-only requires NMS and 11.33 hours; the combined approach takes 10.99 hours without NMS. The paper notes that Table 12 is run without augmented GT in the one-to-many branch, which is why the margin over o2o-only is small: the combined strategy achieves mAP 96.1 vs. 96.0 for o2o-only on PubLayNet with 10% labels. The primary advantage of the combined approach is the improved pseudo-label quality during early training rather than a large absolute mAP gain in this controlled ablation setting.
Limitations: No code or model weights were released. The ICDAR-19 results show high recall but poor precision compared to supervised methods, which is not addressed. Hardware details for training are not reported. Error bars and significance tests are absent throughout.
Reproducibility
Models
- Architecture: Deformable DETR with two parallel query sets (30 one-to-one, 400 one-to-many) fed to the same transformer decoder. ResNet-50 backbone. The number of decoder layers is not explicitly confirmed; Deformable DETR’s default of 6 is assumed.
- Parameter count: Not reported.
- Teacher/student relationship: Both use the same Deformable DETR architecture; the teacher’s weights are an EMA of the student’s weights and are not separately parameterized.
- Initialization: ResNet-50 pretrained on ImageNet.
- Weights: Not released. No code repository linked.
Algorithms
- Framework: Not specified; Deformable DETR implementations exist in PyTorch.
- Optimizer: Not stated.
- Initial learning rate: Not stated. Only the decay schedule is reported: multiplied by 0.1 at epoch 140 of 150 total epochs.
- Batch size: Not reported.
- EMA update: Used for teacher module; decay factor not reported.
- Pseudo-label threshold: 0.7 (ablated in Table 10).
- Loss weights: $\alpha_1 = 2$ (classification), $\alpha_2 = 5$ (regression); unsupervised weight $\omega$ not reported numerically.
- Augmentation: Strong (flips, resize, patch removal, crop, grayscale, Gaussian blur); weak (horizontal flip only).
- Query counts: 30 for one-to-one, 400 for one-to-many; ground truth replicated $K = 6$ times for augmented matching.
Data
- Training: PubLayNet, TableBank (Word, LaTeX, Both), ICDAR-19 cTDaR Modern Track A. Evaluated at 10%, 30%, and 50% labeled splits; remainder treated as unlabeled.
- Preprocessing: Input resized to 600 pixels (longer side); 800 pixels for direct comparison experiments.
- Public availability: All three datasets are publicly available (PubLayNet: CDLA-Perm; TableBank: Apache-2.0 code, research-only data; ICDAR-19: competition data, available from the challenge organizers).
- Split construction: The labeled/unlabeled partition is created by randomly sampling the specified percentage from the training set. How random seeds or stratification were handled is not described.
Evaluation
- Primary metric: mAP following COCO protocol (IoU 0.50:0.95) for TableBank and PubLayNet; Recall/Precision/F1 at IoU thresholds 0.8 and 0.9 for ICDAR-19, following the competition protocol.
- Baselines fairness: All semi-supervised methods use the same labeled/unlabeled splits. Supervised baselines are trained on the same labeled subset only.
- Statistical rigor: No error bars, confidence intervals, or multi-seed runs reported. Results are single-run point estimates throughout.
- Known limitation: ICDAR-19 precision (61.4 at IoU 0.8 with 50% labels) is substantially below the best-reported supervised results (~95%), despite high recall (95.0). The paper does not analyze this precision-recall imbalance.
- Scope restriction: All experiments use a ResNet-50 backbone only; generalization to stronger backbones (ResNet-101, ViT-based) is untested.
Hardware
- Training hardware: GPU type and count not reported.
- Wall-clock training time: Reported in Table 12 for the PubLayNet ablation (10% labeled data): o2o-only 8.74 hours, o2m-only 11.33 hours, combined 10.99 hours. These are wall-clock times on unspecified hardware and should not be taken as reproducible benchmarks.
- Inference speed: The combined strategy runs at 4.36 FPS (same as o2o-only and o2m-only variants, per Table 12), suggesting inference cost is dominated by the backbone and decoder rather than the matching strategy.
- Memory requirements: Not reported.
- Deployment feasibility: No discussion of inference-time requirements or deployment constraints.
BibTeX
@article{ehsan2024semitabdetr,
title = {End-to-end semi-supervised approach with modulated object queries for table detection in documents},
author = {Ehsan, Iqraa and Shehzadi, Tahira and Stricker, Didier and Afzal, Muhammad Zeshan},
journal = {International Journal on Document Analysis and Recognition (IJDAR)},
year = {2024},
doi = {10.1007/s10032-024-00471-0}
}
MuTabNet: Multi-Cell Decoder and Mutual Learning for Table Structure and Character Recognition
TL;DR
MuTabNet is an end-to-end table recognition model that improves upon prior Transformer-based approaches through two mechanisms: a multi-cell decoder that reads cell contents sequentially across multiple cells while attending to neighbors, and bidirectional mutual learning that trains the structure decoder in both left-to-right and right-to-left directions simultaneously. On FinTabNet and PubTabNet, MuTabNet achieves results that exceed or match all prior end-to-end and OCR-assisted baselines.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$: the paper’s headline contribution is an architectural modification to an existing end-to-end table recognition pipeline. The two proposed components (multi-cell decoder and bidirectional mutual learning) are architectural additions not present in prior work, and the paper is organized around ablation tables and SOTA comparisons that validate those additions.
Secondary: $\Psi_{\text{Evaluation}}$: the paper carefully measures performance across multiple table-length subsets (250+, 500+, 600+, 700+ tokens) to isolate how well the method handles long tables, and it explores sensitivity to local attention window sizes.
What is the motivation?
Extracting structured table content from document images is an important preprocessing step for LLM-based knowledge retrieval, but it remains difficult. Two classes of approaches exist: those that rely on an external OCR system to recognize cell text (which achieves high accuracy but depends on a separate pipeline) and end-to-end approaches that jointly recognize structure and content.
Recent end-to-end work (Ly and Takasu, ICDAR 2023) showed that joint learning can reach accuracy comparable to OCR-assisted models, and introduced local attention to handle tables with hundreds of cells. However, both the HTML structure decoder and the cell content decoder in these models have a directional limitation: the HTML decoder reads tokens strictly left to right, and each cell’s content is decoded independently without any context from neighboring cells. These design choices forgo potentially useful context, particularly for long or complex tables where surrounding cells can disambiguate ambiguous content.
What is the novelty?
MuTabNet makes two architectural additions to the prior Ly-Takasu end-to-end framework:
Multi-cell decoder. Instead of decoding each cell’s content independently, the proposed cell decoder reads all cell contents as a single sequence, separated by a special SEP token. The decoder receives a concatenation of character embeddings and HTML features extracted from the HTML decoder output, giving it access to structural context for each cell. Local attention (window size $w = 300$) handles the resulting long sequences. This means cell $n$ can attend to the decoded content of cells $1, \ldots, n-1$ within the attention window, allowing contextual correction of ambiguous characters.
Bidirectional mutual learning. Inspired by deep mutual learning (Zhang et al., CVPR 2018), the HTML decoder is trained to predict structure in both left-to-right (LtoR) and right-to-left (RtoL) directions. To avoid doubling the parameter count, both directions share a single decoder; a one-hot vector specifying direction is appended to the embedded HTML tokens. The loss for the LtoR decoder combines standard cross-entropy against the ground truth with a KL divergence term that aligns its output distribution to the RtoL decoder’s predictions:
$$ \mathcal{L} = -\frac{1}{N}\sum_{n=1}^{N} p(\vec{x}_n) \log q(\vec{x}_n) + \frac{1}{N}\sum_{n=1}^{N} q(\overleftarrow{x}_n) \log \frac{q(\overleftarrow{x}_n)}{q(\vec{x}_n)}. $$
The symmetric loss for the RtoL decoder reverses the roles. This regularizes the structure decoder to be consistent regardless of reading direction.
The encoder is a TableResNetExtra backbone (26 convolutional layers, three Global Context Attention blocks) that encodes a $520 \times 520$ input to $65 \times 65$ feature maps with 512 channels, followed by 2D positional encoding and flattening for cross-attention.
What experiments were performed?
Datasets. Two large public benchmarks are used:
- FinTabNet (112k tables from S&P 500 annual reports; evaluation on the 10,656-image validation set following prior convention)
- PubTabNet (568k tables from PubMed Central; validation set and the held-out ICDAR 2021 competition evaluation set)
- PubTabNet250 (subset of PubTabNet with 250+ HTML tokens; used for ablations to reduce training time from ~179 hours to ~45 hours)
Baselines. Comparisons include EDD, GTE, TableFormer, VAST (an OCR-assisted model), and the direct predecessor series from Ly and Takasu (weakly supervised, multi-task, and local-attention variants).
Metric. Tree Edit Distance Similarity (TEDS), defined as:
$$ \text{TEDS}(T_a, T_b) = 1 - \frac{\text{EditDist}(T_a, T_b)}{\max(|T_a|, |T_b|)}, $$
computed in two variants: structural TEDS (excluding cell text) and total TEDS (including cell text). Results are broken out by simple vs. complex tables on PubTabNet.
Ablation. Table 5 isolates the contribution of local attention (LA), multi-cell decoder (MC), and bidirectional mutual learning (BML) across four token-length subsets. Table 6 sweeps local attention window sizes (100, 200, 300, 400, 500) for the cell decoder.
Training. Four NVIDIA V100 GPUs, batch size 8, Ranger optimizer, learning rate 0.001 for 25 epochs then stepped down to 0.0001 and 0.00001. No data augmentation, no ensemble, no early stopping.
What are the outcomes/conclusions?
FinTabNet. MuTabNet achieves 98.87% structural TEDS (best among all models) and 97.69% total TEDS. The total TEDS is below VAST’s 98.21%, which the authors attribute to VAST’s use of an external OCR system for cell content. Among end-to-end models, MuTabNet leads on both metrics.
PubTabNet validation. MuTabNet scores 98.16% (simple), 95.53% (complex), and 96.87% (total) TEDS, outperforming all prior methods including those using external OCR. On the ICDAR 2021 evaluation set, MuTabNet achieves 98.01% (simple), 94.98% (complex), and 96.53% (total) TEDS; the authors report this as the highest single-model score among submissions without ensemble methods or additional annotation.
Ablation findings. Adding the MC decoder to the LA baseline produces a large jump in total TEDS across all table lengths. BML then adds a smaller but consistent improvement, especially visible in total TEDS scores (e.g., 95.02% to 95.81% at the 250+ token threshold). The effect of BML on structural TEDS alone is modest, suggesting it primarily helps cell content recognition by improving the implicit structural representations passed to the cell decoder. A window size of 300 is optimal for the cell decoder in most cases; very long tables (500+ tokens) slightly prefer a smaller window of 100.
Limitations the authors acknowledge. The paper notes that future work should extend toward understanding the semantic meaning of table contents, not just their structure and character-level text. There is no discussion of inference latency per table, runtime memory, or behavior on non-English or handwritten tables.
Reproducibility
Models
- Architecture: TableResNetExtra encoder (26 conv layers, 3 GCA blocks); HTML decoder (1 embedding layer + 3 local-attention blocks + 2 output layers); cell decoder (1 embedding layer + 1 local-attention block + output layer). All attention blocks use 8 heads and 512 channels.
- No pretrained weights are released; no model card is provided.
- Input size: $520 \times 520$ pixels. Feature map: $65 \times 65 \times 512$.
Algorithms
- Optimizer: Ranger (a combination of RAdam and LookAhead)
- Learning rate schedule: 0.001 for epochs 1-25, 0.0001 for epochs 26-28, 0.00001 for epochs 29-30
- Batch size: 8 across 4 GPUs
- Local attention window: 300 for both HTML and cell decoders (default)
- Maximum HTML token sequence length: 800; maximum cell content sequence length: 8000
- Decoding: greedy search
- No data augmentation; no early stopping; no ensembling
Data
- FinTabNet: publicly available, extracted from S&P 500 annual reports. License: see the GTE paper (Zheng et al., WACV 2021).
- PubTabNet: publicly available from IBM Research, derived from PubMed Central Open Access. License: Community Data License Agreement - Permissive (CDLA-Permissive).
- PubTabNet250: a filtered subset created by Ly and Takasu; not separately distributed; reproducible by filtering PubTabNet for tables with 250+ HTML tokens.
Evaluation
- TEDS metric computed on complete HTML trees (including cell text) and on structure-only trees.
- PubTabNet simple/complex split follows Zhong et al. (ECCV 2020): simple means no merged cells.
- FinTabNet evaluation set follows prior convention (using the labeled “validation” split of 10,656 images).
- No error bars or significance tests are reported; single training run per configuration.
- Ablations use PubTabNet250 (a subset), so absolute scores there are not directly comparable to full-dataset results.
Hardware
- Training: 4 NVIDIA V100 GPUs
- Full PubTabNet training time: approximately 179 hours (as implied by the comparison; ablations use PubTabNet250 at approximately 45 hours per model)
- Inference: 3.78 hours on 4 GPUs for the FinTabNet evaluation set; 3.23 hours for PubTabNet validation; 3.13 hours for PubTabNet evaluation. Per-table inference speed is not reported.
- No code repository is released with the paper.
BibTeX
@inproceedings{kawakatsu2024mutabnet,
title={Multi-Cell Decoder and Mutual Learning for Table Structure and Character Recognition},
author={Kawakatsu, Takaya},
booktitle={International Conference on Document Analysis and Recognition (ICDAR)},
year={2024}
}
RoDLA: Benchmarking the Robustness of Document Layout Analysis Models
TL;DR
RoDLA is the first systematic robustness benchmark for Document Layout Analysis (DLA). It applies 12 document-specific perturbation types at 3 severity levels to three established DLA datasets, yielding approximately 450K perturbed images. The paper also proposes two new metrics: Mean Perturbation Effect (mPE) to quantify perturbation difficulty independently of any single model, and Mean Robustness Degradation (mRD) to measure how much a model degrades relative to perturbation difficulty. Finally, the authors propose the Robust Document Layout Analyzer (RoDLA), an attention-enhanced DINO-based detector that achieves the best mRD scores across all three benchmarks.
What kind of paper is this?
Dominant: $\Psi_{\text{Evaluation}}$. The paper’s primary contribution is a robustness benchmarking framework: a perturbation taxonomy, two new evaluation metrics (mPE and mRD), and a suite of three perturbed datasets covering over 10 DLA methods. The framing, the bulk of the paper, and the conclusions all revolve around measurement methodology and systematic evaluation of existing DLA approaches under realistic corruptions.
Secondary: $\Psi_{\text{Method}}$ (the RoDLA model introduces channel-wise attention and adaptive average pooling within a DINO encoder, with ablations showing measurable gains in robustness metrics); $\Psi_{\text{Resource}}$ (the benchmark datasets PubLayNet-P, DocLayNet-P, and M$^6$Doc-P are reusable evaluation artifacts released to the community).
What is the motivation?
Document Layout Analysis is a foundational step in document understanding pipelines. As DLA moves from clean, electronically-generated documents toward photos and scans of physical documents, image quality variations become unavoidable. Factors such as uneven lighting, camera motion, printing artifacts, and paper texture introduce corruptions that differ qualitatively from the distribution assumed during training.
Despite this practical importance, robustness had not been systematically studied for DLA. Prior DLA benchmarks (PubLayNet, DocLayNet, M$^6$Doc) simply collect clean document images without analyzing or controlling for perturbations. Robustness benchmarking exists for image classification (ImageNet-C) and semantic segmentation, but these use perturbations tailored to natural images that do not match the failure modes of document processing. The authors observe that existing DLA models can drop by over 90% in mAP when moving from mild to severe perturbations (e.g., SwinDocSegmenter drops from 93.7% to near zero under strong texture noise), motivating a dedicated benchmark.
What is the novelty?
Perturbation taxonomy. The paper introduces a hierarchical taxonomy of 12 perturbation types grouped into 5 categories, each with 3 severity levels (36 severity configurations total):
- Spatial transformation: Rotation (P1), Warping (P2), Keystoning (P3)
- Content interference: Watermark (P4), Complex Background (P5)
- Inconsistency distortion: Non-uniform Illumination (P6), Ink-bleeding (P7), Ink-holdout (P8)
- Blur: Defocus (P9), Vibration/motion blur (P10)
- Noise: Speckle/blotch noise (P11), Fibrous texture noise (P12)
All perturbations are parametrized mathematically, with severity levels calibrated using image quality assessment (IQA) metrics.
Mean Perturbation Effect (mPE). Previous benchmarks measure perturbation difficulty via a pre-selected baseline model’s degradation, which conflates the model’s inherent robustness with the perturbation’s severity. mPE addresses this by combining IQA metrics (MS-SSIM and CW-SSIM) with baseline model degradation, averaging over severity levels:
$$ \text{mPE}_p = \frac{1}{NMK} \sum_{s=1}^{N} \left( \sum_{i=1}^{M} f_{s,p}^i + \sum_{g=1}^{K} D_{s,p}^g \right) $$
Here $f_{s,p}^i$ are IQA scores and $D_{s,p}^g$ is the degradation of reference model $g$ under perturbation $p$ at severity $s$. Using multiple IQA methods and averaging over models reduces sensitivity to any individual model’s variability.
Mean Robustness Degradation (mRD). Building on mPE, mRD normalizes a model’s degradation under each perturbation by that perturbation’s inherent difficulty:
$$ \text{RD}_p = \frac{1}{N} \sum_{s=1}^{N} \frac{D_{s,p}^g}{\text{mPE}_{s,p}} $$
The overall mRD is the average of $\text{RD}_p$ across all 12 perturbation types. A value above 100 means the model degrades more than expected given the perturbation’s difficulty; below 100 means better-than-expected robustness. Lower is better.
RoDLA model. To improve robustness, the authors modify the DINO detector’s encoder with two additions. First, a Channel-wise Attention (CA) module applies self-attention across the channel dimension rather than the spatial dimension, allowing the model to group correlated feature channels while filtering out perturbation-driven spatial noise:
$$ Y = \sigma!\left(\frac{\text{Softmax}(Q) \cdot \text{Softmax}(K^T)}{\sqrt{d}} \cdot \text{MLP}(V)\right), \quad Q,K,V \in \mathbb{R}^{d \times n} $$
Second, spatially dilated average pooling layers are inserted into the encoder via a learned Dilation Predictor (two 3x3 convolution layers). These pooling layers reduce overemphasis on perturbed tokens by incorporating neighborhood context, dynamically constraining long-range attention based on local context.
What experiments were performed?
Datasets. Three perturbed benchmark suites are constructed:
- PubLayNet-P: derived from PubLayNet’s ~360K scientific document pages (5 layout classes: text, title, list, table, figure)
- DocLayNet-P: derived from DocLayNet’s ~80K pages across diverse document types (11 layout classes)
- M$^6$Doc-P: derived from M$^6$Doc’s ~9K pages with fine-grained annotations (74 classes)
Baselines. More than 10 methods are benchmarked, covering CNN-based detectors (Faster R-CNN, Mask R-CNN, Cascade R-CNN with ResNet/InternImage backbone), Transformer-based methods (DocSegTr, DINO, Co-DINO), multimodal methods (DiT+Cascade R-CNN, LayoutLMv3+Cascade R-CNN), and single-modal specialist methods (LayoutParser, SwinDocSegmenter). All models are trained and validated on clean data only; perturbations are applied at test time only.
Training setup. All models are trained with AdamW (lr = 2e-4, weight decay = 1e-4) for 24 epochs with a step-based scheduler, batch size 2 per GPU, on 4x A100 (40 GB each). Data augmentation includes random horizontal flips and crops (384x600).
Metrics. mAP on clean data, P-Avg (average mAP across all 12 perturbation types and 3 severity levels), and mRD (lower is better).
Ablations. The authors study:
- Backbone choice (Swin vs. InternImage) under both Cascade R-CNN and RoDLA
- Component placement of CA and average pooling layers (encoder vs. decoder, various combinations)
- Combined perturbation analysis (stacking up to 5 perturbations simultaneously on M$^6$Doc-P)
- Two-stage pipeline comparison (document rectification followed by DLA vs. one-stage RoDLA)
What are the outcomes/conclusions?
Benchmark findings. Existing DLA models degrade substantially under document perturbations, with large variance across perturbation types and models. Spatial perturbations (especially rotation) cause the largest drops. Models that are strong on clean data are not necessarily more robust: SwinDocSegmenter achieves 93.7% clean mAP but drops to near zero on texture noise (mRD 214.4), while Faster R-CNN shows worse clean mAP but relatively better mRD (175.5). Extra pre-training on document-specific data (DiT and LayoutLMv3 on IIT-CDIP) provides measurable robustness benefits for noise and content perturbations.
RoDLA results. On PubLayNet: 96.0% clean mAP (best), 70.0% P-Avg (best), 116.0 mRD (best). On DocLayNet-P: 80.5% clean mAP (best), 135.4 mRD (best). On M$^6$Doc-P: 70.0% clean mAP (best), 150.4 mRD (best). The clean mAP gains over Faster R-CNN are +5.8%, +7.1%, and +12.1% on the three datasets respectively.
Ablation results. The optimal RoDLA configuration places CA in the encoder and average pooling in the encoder. This achieves 70.0% P-Avg and 115.7 mRD on PubLayNet-P while using fewer parameters (323.2M) than many ablation variants. Swapping to a Swin backbone reduces performance by about 8 mRD units but shows that RoDLA’s design largely closes the gap between backbone choices.
Two-stage vs. one-stage. A pre-processing rectification stage (DocTr) followed by Faster R-CNN achieves 41.2% P-Avg on PubLayNet-P, compared to 68.8% for Faster R-CNN alone and 70.8% for RoDLA. Rectification models trained for general document dewarping introduce their own artifacts that hurt DLA performance.
Limitations. The benchmark does not yet cover content tampering or replacement. Only 3 severity levels are defined, which may be too coarse for fine-grained analysis. Perturbations are applied individually rather than in combination (except the supplementary multi-perturbation analysis). The mPE metric uses only two IQA methods and a single reference model (Faster R-CNN), which could be extended for greater robustness.
Reproducibility
Models
- RoDLA architecture: DINO detector (end-to-end object detection with denoising anchor boxes) modified with Channel-wise Attention in the encoder and dilated average pooling layers gated by a Dilation Predictor (two 3x3 conv layers). Backbone: InternImage pretrained on ImageNet-22K.
- Parameter count: approximately 323.2M (optimal configuration with CA Encoder + APL Encoder).
- No pretrained weights appear to be released as of the paper submission.
Algorithms
- Optimizer: AdamW, lr = 2e-4, weight decay = 1e-4
- Scheduler: step-based, warm-up steps at epochs {16, 22}, warm-up ratio 1e-3
- Training epochs: 24
- Batch size: 2 per GPU
- Data augmentation: random horizontal flip (p=0.5), random crop to (384, 600)
- Framework: MMDetection
- All baselines reproduced under identical training settings for fair comparison
Data
- PubLayNet-P: derived from PubLayNet (~360K scientific article images from PubMed Central); perturbations generated programmatically at test time from the original clean test set
- DocLayNet-P: derived from DocLayNet (~80K pages; diverse document types including academic papers, brochures, letters)
- M$^6$Doc-P: derived from M$^6$Doc (~9K pages; 74 layout classes)
- The perturbation code for generating the benchmark does not appear to be publicly released as of the paper
- Benchmark images total approximately 450K across three datasets and all perturbation settings
Evaluation
- mAP: standard COCO-style mean average precision across layout classes
- P-Avg: average mAP across 12 perturbation types and 3 severity levels (36 configurations)
- mRD: normalized robustness degradation metric; lower is better; values above 100 indicate worse-than-expected degradation
- mPE: perturbation severity metric computed from MS-SSIM, CW-SSIM, and Faster R-CNN degradation; model-independent proxy for perturbation difficulty
- Baselines include both visual-only and multimodal methods; multimodal methods use extra document pretraining (IIT-CDIP)
- The authors acknowledge that mRD relies on Faster R-CNN as the reference model in mPE, introducing some baseline dependence
Hardware
- Training hardware: 4x NVIDIA A100 (40 GB each), 300 GB CPU memory per node
- 38 models total were trained for the benchmark: 24 for PubLayNet (including ablations), 7 each for DocLayNet and M$^6$Doc
- Compute cost not explicitly reported
- Inference hardware requirements not specified
BibTeX
@inproceedings{chen2024rodla,
title={RoDLA: Benchmarking the Robustness of Document Layout Analysis Models},
author={Chen, Yufan and Zhang, Jiaming and Peng, Kunyu and Zheng, Junwei and Liu, Ruiping and Torr, Philip and Stiefelhagen, Rainer},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2024}
}
UniTable: Towards a Unified Framework for Table Structure Recognition
TL;DR
UniTable introduces two complementary ideas for table recognition (TR). First, a self-supervised pretraining (SSP) stage trains the visual encoder on up to 2 million unannotated tabular images via VQ-VAE masked image modeling, providing structured visual representations before any supervised data is seen. Second, all three TR tasks (table structure, cell bounding boxes, and cell content) are unified under a single language modeling objective, with a shared encoder and task decoder that operates on raw pixel inputs alone, without requiring an external PDF or OCR system. On SynthTabNet, removing SSP drops accuracy by up to 14 percentage points, and UniTable achieves SOTA on four of five major TR benchmarks.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$ The primary contribution is the UniTable training framework: the SSP stage with VQ-VAE masked image modeling and the unified language modeling objective for all TR tasks. Ablations, scaling experiments, and cross-architecture validation are all in service of establishing the framework’s design choices.
Secondary: None. No new dataset is released; the paper evaluates on existing public benchmarks.
What is the motivation?
Table recognition in document processing involves three interrelated subtasks: predicting the table structure (as an HTML or similar token sequence), localizing cell bounding boxes, and reading cell content. Prior work treated these as separate problems with task-specific decoders and often required external inputs such as PDF text layers or separate OCR systems. The result is fragmented pipelines that are harder to train, maintain, and generalize.
A second problem is initialization quality. Supervised TR datasets range from roughly 100k to 500k annotated examples, but unlabeled tables are abundant in documents. Standard supervised training from random or ImageNet-pretrained weights does not leverage this readily available unlabeled data, and performance on diverse table styles (colorful backgrounds, sparse content, financial tables) is correspondingly inconsistent.
UniTable addresses both problems together: a unified training objective eliminates task-specific decoder design, and SSP on unannotated tabular images provides a domain-adapted initialization before any supervised training begins.
What is the novelty?
Overall Architecture
UniTable has two main components: a visual encoder and a shared task decoder.
Visual encoder: A linear projection Transformer (ViT-style). Given input image $I \in \mathbb{R}^{H \times W \times C}$, it is divided into $N = HW/P^2$ non-overlapping patches of size $P \times P$. With $P = 16$ and $I \in \mathbb{R}^{448 \times 448 \times 3}$, the sequence length is $28 \times 28 = 784$ patch tokens.
Task decoder: A 4-layer Transformer decoder shared across all TR tasks. At inference, the decoder generates token sequences autoregressively using greedy decoding. Maximum sequence lengths are 512 for table structure, 1024 for cell bounding boxes, and 200 for cell content.
Two model sizes are reported:
| Variant | Params | Encoder layers | Heads | Hidden size |
|---|---|---|---|---|
| UniTable-base | 30M | 4 | 8 | 512 |
| UniTable-large | 125M | 12 | 12 | 768 |
Self-Supervised Pretraining (SSP)
Before supervised finetuning, the visual encoder is pretrained via masked image modeling using a VQ-VAE visual codebook.
Visual codebook. A VQ-VAE is trained to represent any tabular image as a sequence of discrete tokens drawn from a codebook $Z \in \mathbb{R}^{K \times D}$. The encoder $q_\phi(z \mid I)$ maps the image to discrete tokens $z$; the decoder $p_\psi(I \mid z)$ reconstructs the image. The VQ-VAE training objective is:
$$ \max_{\phi, \psi} ; \mathbb{E}_{z \sim q_\phi(z \mid I)}\bigl[\log p_\psi(I \mid z)\bigr] $$
Because sampling discrete tokens is not differentiable, Gumbel-Softmax reparameterization is used during training. Two codebook sizes are evaluated: $K = 8{,}192$ for a 1M-image VQ-VAE and $K = 16{,}384$ for a 2M-image VQ-VAE.
Pretraining objective. Approximately 40% of patch tokens are randomly masked. The visual encoder is trained to predict the VQ-VAE codebook index of each masked patch, maximizing the log-likelihood of the masked visual tokens given the unmasked context. This is analogous to masked language modeling but applied to table images.
Unified Finetuning Objective
During supervised finetuning, all three TR tasks are formulated as next-token prediction over vocabulary sequences. The table structure task predicts HTML tokens; the cell bbox task predicts coordinate tokens; and the cell content task predicts text tokens. A single cross-entropy loss applies across all tasks without any task-specific auxiliary objectives or detection heads. The input is always a raw pixel image; no external PDF or OCR output is required.
Prior pipelines used a detection head (Faster R-CNN or DETR) for bounding boxes alongside a separate sequence decoder for structure, and relied on PDF-extracted text for cell content. UniTable replaces all of these with a single language modeling objective over raw pixel inputs.
Architecture Generality
The unified language modeling objective also works with a hybrid CNN-Transformer encoder (replacing the linear projection with ResNet-18), yielding results roughly comparable to the linear projection model pretrained on 1M images. The authors recommend the linear projection variant for three reasons: (1) it benefits from SSP, (2) it aligns with modern VLM architectures that share the same design, and (3) the hybrid variant does not surpass SSP-2M linear projection despite its larger parameter count.
What experiments were performed?
Datasets
- ICDAR 2019 B2 Modern (IC19B2M): 100 modern-subset tables; evaluation by weighted average F1 (WAvg. F1) over IoU thresholds $\{0.6, 0.7, 0.8, 0.9\}$:
$$ \text{WAvg. F1} = \frac{\sum_{i=1}^{4} \text{IoU}_{i} \times \text{F1@IoU}_{i}}{\sum_{i=1}^{4} \text{IoU}_{i}} $$
- PubTabNet: 509k scientific article tables; three metrics: COCO $\text{AP}_{50}$ (cell bbox), S-TEDS (structure), and TEDS (structure + content).
- FinTabNet: 113k S&P 500 financial report tables; evaluated on S-TEDS.
- SynthTabNet: 600k synthetic tables across four style subsets (Finance, PubTabNet, Marketing, Sparse); $\text{AP}_{50}$ and S-TEDS.
- PubTables-1M: 947k tables with word-level bbox annotations; evaluated on COCO $\text{AP}_{50}$ and $\text{AP}_{75}$.
Baselines
EDD, TableFormer, VAST, GTE, DRCC, and a DETR baseline from the PubTables-1M authors. General-purpose VLMs (GPT-4V, LLaVA-v1.6 in multiple sizes) are also evaluated on SynthTabNet in Appendix A.
SSP Ablation (Table 2)
S-TEDS on all four SynthTabNet subsets (base and large models):
| Configuration | Finance (Base/Large) | Marketing (Base/Large) |
|---|---|---|
| No SSP | 88.95 / 90.75 | 68.05 / 70.60 |
| SSP 1M | 98.73 / 99.56 | 95.14 / 99.05 |
| SSP 2M | 99.41 / 99.58 | 98.35 / 99.08 |
SSP 1M improves over no SSP by an average of 14.40 percentage points (pp); SSP 2M adds a further 0.74 pp. Without SSP, scaling from base to large barely helps; with SSP, large models consistently outperform base by an average of 1.51 pp.
Implementation
- All models trained with AdamW, cosine LR schedule with linear warmup, 24 epochs.
- Teacher forcing during training; greedy decoding at inference.
- VQ-VAE pretrained on 1M (PubTabNet + SynthTabNet) or 2M (+ PubTables-1M + TableBank) unannotated tabular images.
- Finetuning: table structure trained on PubTabNet + FinTabNet + SynthTabNet; cell bbox on PubTabNet + SynthTabNet; cell content on PubTabNet + SynthTabNet + PubTables-1M.
What are the outcomes/conclusions?
IC19B2M: UniTable improves WAvg. F1 over the prior best result held by GTE; the authors report the margin as large (see Table 1 of the paper). GTE was the previous leader on this benchmark.
PubTabNet: UniTable-base and UniTable-large both surpass VAST on $\text{AP}_{50}$ by more than 3 pp. VAST required a task-specific visual-alignment auxiliary loss; UniTable achieves this with the unified language modeling objective alone.
FinTabNet: New SOTA on S-TEDS; performance scales from base to large, confirming the framework benefits from additional model capacity.
SynthTabNet: SOTA on both $\text{AP}_{50}$ and S-TEDS; SSP is the largest single contributor to performance on the more challenging Marketing and Sparse subsets.
PubTables-1M: Results are competitive with the DETR baseline from the dataset creators, but the authors identify previously unacknowledged annotation inconsistencies that may inflate DETR’s reported score (detailed in Appendix F.1).
VLM comparison (Appendix A): LLaVA-v1.6 variants score between 32% and 50% S-TEDS on SynthTabNet; GPT-4V reaches approximately 64-69% on the Finance and PubTabNet subsets. UniTable-large with SSP 2M scores above 99% on three of four subsets, illustrating the gap between specialist TR models and general VLMs on structured recognition tasks.
Limitations acknowledged: The paper is a preprint and has not undergone peer review at the time of this writing. PubTables-1M evaluation is complicated by discovered annotation issues. No latency or throughput measurements are reported. The cross-dataset finetuning setup (combining multiple training sets) complicates direct comparison with baselines trained on single datasets. Learning rate, batch size, and exact hardware are not stated in the main paper.
Unacknowledged gaps: No multi-seed error bars or confidence intervals. No comparison with methods that use OCR-augmented inputs on a common benchmark. The VQ-VAE training details (optimizer, LR, hardware, training time) are not disclosed.
Reproducibility
| Resource | Type | License | Link |
|---|---|---|---|
| Preprint (arXiv) | Paper | CC-BY-4.0 | arXiv 2403.04822 |
| UniTable Code | Code | MIT | GitHub |
| UniTable Demo | Demo | MIT | magic-table |
| PubTabNet | Dataset | CDLA-Permissive-1.0 | GitHub |
| FinTabNet | Dataset | CDLA-Permissive-1.0 | IBM Research |
| SynthTabNet | Dataset | CDLA-Permissive-1.0 | GitHub |
| PubTables-1M | Dataset | MIT | GitHub |
Models
- Visual encoder: linear projection Transformer (ViT-style), patch size $P = 16$, input resolution $448 \times 448$.
- Task decoder: 4 Transformer decoder layers, shared across all three TR tasks.
- UniTable-base: 30M parameters; 4 encoder layers, 8 attention heads, hidden size 512.
- UniTable-large: 125M parameters; 12 encoder layers, 12 attention heads, hidden size 768.
- VQ-VAE codebook: $K = 8{,}192$ entries for 1M pretraining; $K = 16{,}384$ for 2M.
- Model weights: the GitHub repository exists; availability of pretrained checkpoints should be verified directly. A finetuned demo model is accessible via the HuggingFace-hosted public API.
Algorithms
- Optimizer: AdamW.
- LR schedule: cosine with linear warmup (learning rate value not stated in main paper).
- Training duration: 24 epochs for all supervised finetuning experiments.
- Teacher forcing during training; greedy decoding at inference.
- Masking ratio for SSP: approximately 40%.
- Gumbel-Softmax reparameterization for VQ-VAE token sampling.
- VQ-VAE training details (LR, batch size, hardware, epochs) are not reported.
Data
- VQ-VAE 1M pretraining: PubTabNet + SynthTabNet (unannotated tabular images).
- VQ-VAE 2M pretraining: + PubTables-1M + TableBank.
- Supervised finetuning data is task-dependent (see main results section).
- All five evaluation datasets are publicly available; licenses vary (see table above).
- TableBank license: see the original repository; not confirmed in this note.
Evaluation
- CAR F1 at multiple IoU thresholds (IC19B2M); COCO $\text{AP}_{50}$/$\text{AP}_{75}$ (cell bbox); S-TEDS and TEDS (structure and content).
- WAvg. F1 is a weighted average over four IoU thresholds as defined in the ICDAR 2019 competition.
- No error bars, significance tests, or multi-seed runs reported.
- Baselines are from published papers; not all use the same training data as UniTable, so comparisons may not reflect equal supervision.
Hardware
- Training hardware, GPU count, and training time are not reported in the main paper.
- Inference hardware and latency are not reported.
- No cost estimates or deployment considerations are provided.
BibTeX
@article{peng2024unitable,
title={UniTable: Towards a Unified Framework for Table Recognition via Self-Supervised Pretraining},
author={Peng, ShengYun and Chakravarthy, Aishwarya and Lee, Seongmin and Wang, Xiaojing and Balasubramaniyan, Rajarajeswari and Chau, Duen Horng},
journal={arXiv preprint arXiv:2403.04822},
year={2024}
}
ClusterTabNet: Supervised Clustering for Table Detection and Structure Recognition
TL;DR
ClusterTabNet reframes table detection and table structure recognition as a supervised clustering problem over OCR word tokens. A compact transformer encoder predicts adjacency matrices indicating which word pairs belong to the same table, row, column, cell, or header. The resulting model has roughly 5M parameters (excluding embeddings), achieves detection accuracy on par with or better than DETR on PubTables-1M, and is inherently robust to rotated or skewed documents.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$: The paper proposes a new architecture and training formulation for TD and TSR. The primary contribution is the supervised clustering objective applied to transformer encoders, with an ablation study over model variants and comparisons against DETR and Faster R-CNN baselines.
Secondary: $\Psi_{\text{Evaluation}}$: The paper includes careful discussion of metric comparability, notes reproducibility challenges with the DETR baseline checkpoint versus retrained weights, and cross-dataset evaluation on PubTables-1M, FinTabNet, PubTabNet, and ICDAR-2019 cTDaR.
What is the motivation?
Standard image-based table detection and structure recognition approaches treat these tasks as object detection problems, applying models like DETR or Faster R-CNN to raw document images. This has several practical drawbacks. Large vision models are expensive to run, often requiring high-resolution inputs. They must implicitly re-learn OCR to identify text cues like “Table” captions. And bounding-box outputs fail gracefully on rotated or skewed scans where axis-aligned boxes do not represent the actual table region.
Multi-modal approaches inspired by LayoutLM take OCR word boxes as input but are primarily designed for classification tasks rather than clustering tasks like table structure recovery.
The authors observe that both detection (grouping words into tables) and structure recognition (partitioning words into rows, columns, cells, and headers) are fundamentally clustering problems: there is no intrinsic feature distinguishing one table from another on the same page, so supervised clustering (learning cluster membership from labeled examples) is the right abstraction.
What is the novelty?
Supervised clustering via adjacency matrices. Given $n$ words on a page produced by an OCR engine, ClusterTabNet predicts a binary $n \times n$ adjacency matrix per clustering target. Position $(i, j)$ equals 1 if and only if words $i$ and $j$ belong to the same cluster (table, cell, row, column, or header). Clusters correspond to connected components of the resulting graph, and connected-components post-processing enforces transitivity to correct isolated missed edges.
Symmetric vs. directed adjacency for spanning cells. For table membership and cell membership, the adjacency relation is symmetric. For rows, columns, and headers, the paper uses directed (asymmetric) adjacency to represent spanning cells: a word in a spanning cell points to all words in the rows it spans, but those words do not point back to it, enabling clean recovery of the spanning structure.
Architecture. The model uses a standard 4-layer transformer encoder with $d_{\text{model}} = 256$ and $d_{\text{ff}} = 1024$. Each OCR word is represented by summing:
- Learned embeddings of its normalized word content (vocabulary of 30,015 entries)
- Learned embeddings of its bounding-box coordinates, quantized to integers in $[0, 1023]$
Optionally, a small ResNet-like CNN processes a $32 \times 32$ image crop around the word bounding box (4 blocks with 16, 32, 48, 80 channels), producing a $d_{\text{model}}$-dimensional vector added to the token embedding.
Each output head splits the transformer output of shape $(N, L, D)$ into two tensors $Q$ and $K$ of shape $(N, L, D/2)$, and computes:
$$\hat{A} = \sigma(QK^T)$$
where $\sigma$ is the sigmoid function. The training loss is elementwise binary cross-entropy between $\hat{A}$ and the ground-truth adjacency matrix, summed over unpadded positions:
$$\mathcal{L} = -\sum_{i,j} \left[ A_{ij} \log \hat{A}_{ij} + (1 - A_{ij}) \log(1 - \hat{A}_{ij}) \right]$$
Inference post-processing. To enforce symmetry and recover missed edges, the predicted matrix is symmetrized before thresholding:
$$\frac{1}{2}\left(\sigma(QK^T) + \sigma(KQ^T)\right) \ge k$$
The threshold $k$ is selected on a validation set (empirically $k \approx 0.7$ maximizes Dice score). Connected components are then computed on the thresholded graph. For spanning cells, a second pass identifies weak (one-directional) connections above a fixed 0.5 threshold.
Word normalization for language robustness. Word contents are normalized to a 4-character alphabet (uppercase $\to$ ‘A’, lowercase $\to$ ‘a’, digits $\to$ ‘1’, other $\to$ ‘,’) before vocabulary lookup, making the dictionary compact and language-agnostic without subword tokenization.
What experiments were performed?
Datasets. Training uses PubTables-1M (primary; provides ground-truth word coordinates, avoiding OCR noise), FinTabNet, PubTabNet, and SynthTabNet. The model is evaluated on PubTables-1M, FinTabNet, PubTabNet, and ICDAR-2019 cTDaR (modern track only). Test sets are capped at 10,000 documents for runtime.
Training details. Adam optimizer; learning rate $10^{-4}$ for $100 \times 5000$ steps then $10^{-5}$ for $100 \times 5000$ steps; batch size 8; maximum sequence length 1,000 words.
Metrics. Dice score on adjacency matrices is used for validation and threshold selection. Standard COCO-style $\text{AP}@[\text{IoU}=0.50]$ and $\text{AP}@[0.50:0.95]$ are computed for final evaluation, with a caveat: ground-truth boxes are shrunk to the tight bounding box around words before scoring, making the numbers not directly comparable to published DETR results.
Baseline. The Table Transformer DETR from PubTables-1M (Smock et al., CVPR 2022) is the main comparison target, evaluated both with the authors’ published GitHub checkpoint and with a model retrained by the ClusterTabNet authors following the same code, revealing a reproducibility gap (AP 0.948 checkpoint vs. 0.888 retrained for TSR).
Ablation. Variants explore the number of transformer layers (2, 4, 8), $d_{\text{model}}$ (128, 256, 512), output head dimension $C_{\text{out}}$ (300, 1000), and the presence or absence of image patches.
What are the outcomes/conclusions?
Table detection. On PubTables-1M, ClusterTabNet with image patches achieves AP 0.989 and AR 0.994, outperforming both the published DETR checkpoint (AP 0.966, AR 0.981) and the retrained DETR (AP 0.949).
Table structure recognition (4 classes: cells, rows, columns, headers). On PubTables-1M, ClusterTabNet with image patches reaches AP 0.931, AP50 0.972, AR 0.952. The published DETR checkpoint achieves AP 0.948 (best), and the retrained DETR reaches AP 0.888. Image patches contribute roughly +0.04 AP versus the text-only variant.
Cross-dataset TSR (single model trained on all four datasets): AP 0.924 (PubTables-1M), 0.792 (FinTabNet), 0.921 (PubTabNet), 0.746 (ICDAR-2019).
Ablation findings. More transformer layers or larger hidden dimension provide negligible gains beyond the 4-layer, $d_{\text{model}}=256$ baseline. The smallest tested variant (2-layer, $d_{\text{model}}=128$, 2M non-embedding parameters) still achieves AP 0.912, while the full model uses 5M non-embedding parameters versus roughly 29M for DETR.
Key limitations. The primary limitation of the AP metric comparison is the word-shrinking normalization applied to ground-truth boxes, which makes numbers incomparable with published DETR results. The paper does not evaluate on TEDS, the standard metric for PubTabNet and FinTabNet, so broader comparison to other TSR methods is difficult. Spanning cell recovery relies on a fixed hyperparameter (0.5 threshold) due to data sparsity. The model requires OCR output at inference time, coupling it to an upstream OCR engine whose choices (padding, orientation) affect results.
Reproducibility
Models
- Architecture: 4-layer transformer encoder, $d_{\text{model}} = 256$, $d_{\text{ff}} = 1024$, $C_{\text{out}} = 300$; optional 4-block ResNet CNN for image patches; approximately 15M total parameters (with patches), 5M excluding embeddings.
- Code is released at https://github.com/SAP-samples/clustertabnet under Apache-2.0. A pretrained checkpoint (
model_weights/table_recognition.pth, ~29 MB) is distributed in the repository and is covered by the same Apache-2.0 license. - No separate model card is provided; the README references the checkpoint directly for inference.
Algorithms
- Adam optimizer; learning rate $10^{-4}$ for 100 epochs of 5,000 steps, then $10^{-5}$ for 100 more epochs.
- Batch size 8; maximum sequence length 1,000 words.
- Threshold $k \approx 0.7$ selected via Dice score on validation set.
- No gradient clipping, mixed precision, or other training tricks reported.
Data
- PubTables-1M (training + evaluation): ~947k table instances; word coordinates provided directly.
- FinTabNet, PubTabNet, SynthTabNet: used for training; OCR run with an internal SAP engine (not public).
- ICDAR-2019 cTDaR (modern track): evaluation only.
- Data preprocessing: JSON format; FinTabNet header labels masked during training.
- Test sets capped at 10,000 documents for runtime; authors report negligible accuracy difference between capped and full sets.
Evaluation
- $\text{AP}@[0.50:0.95]$, $\text{AP}@50$, and AR via torchvision CocoEvaluator.
- Ground-truth boxes shrunk to tight word-bounding-box before scoring; results are not directly comparable to published detection-based TSR numbers.
- Spanning cells and row headers excluded from 4-class comparison with DETR.
- Dice score used only for validation and threshold selection, not as a final reported metric.
- No error bars, significance tests, or multiple training run statistics reported.
Hardware
- DETR baseline was retrained for 1 week on a single V100 GPU for reference.
- No explicit hardware specification given for ClusterTabNet training.
- Inference hardware: not reported; model is compact (~15M parameters) and OCR-based, suggesting CPU-feasible inference is plausible.
BibTeX
@inproceedings{polewczyk2024clustertabnet,
title={ClusterTabNet: Supervised clustering method for table detection and table structure recognition},
author={Polewczyk, Marek and Spinaci, Marco},
booktitle={International Conference on Document Analysis and Recognition},
year={2024}
}
LORE++: Pre-Training Boosts Logical Location Regression for Table Structure Recognition
TL;DR
LORE++ extends the LORE table structure recognition (TSR) framework by introducing two pre-training tasks: a Masked Autoencoder (MAE) task to learn visual table clues, and a Logical Distance Prediction (LDP) task to learn grid-level logical relationships between text regions. Pre-training on 1.5 million table images consistently boosts accuracy, data efficiency, and generalization compared to the LORE baseline, without changing the underlying logical-location-regression paradigm.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$: the paper extends an existing TSR framework (LORE) with a pre-training approach. The core contributions are the pre-training architecture, the LDP task, and the joint-training masking strategy. The headline contribution is methodological.
Secondary: none.
What is the motivation?
The original LORE model demonstrated that formulating TSR as logical coordinate regression is both simpler and more accurate than adjacency-based or markup-sequence-based paradigms. However, it did not exploit the large pool of table data available from sources like PubTables-1M (over one million tables) and TableBank. In contrast, pre-trained models had driven major gains in both computer vision and NLP.
Two concrete limitations motivated the upgrade. First, ViT architectures, which are more natural fits for MAE pre-training, actually underperform the CNN-based LORE baseline when trained from scratch on limited TSR data (82.1% vs. 82.9% accuracy on WTW), indicating that sufficient pre-training is needed to realize their potential. Second, TSR requires not just visual understanding of cell regions and ruling lines, but also comprehension of logical grid relationships: which cells share a row or column, and by how much they are separated. No prior TSR pre-training work addressed both aspects jointly.
What is the novelty?
Logical Distance Prediction (LDP). The key contribution is a pre-training proxy task tailored to TSR’s unique demand for logical structure understanding. For each pair of text regions in a table image (obtained from an off-the-shelf OCR engine), the model is trained to predict the row and column logical distances between them. Unlike labeled TSR datasets, the LDP supervision can be generated automatically from any table image with readable text, enabling pre-training at scale.
Concretely, OCR results give 2D positions of text regions, which are then clustered into grid rows and columns. The logical distance between text regions $i$ and $j$ is then computed as:
$$ \Delta r_{ij} = |r^{(i)} - r^{(j)}|, \quad \Delta c_{ij} = |c^{(i)} - c^{(j)}| $$
The logical decoder is trained with an L1 loss on the predicted distances.
Joint pre-training with unidirectional self-attention. LORE++ trains MAE and LDP jointly in a single forward pass. To prevent information leakage from masked patches to unmasked ones (which would make MAE trivial), the encoder uses a unidirectional self-attention scheme: unmasked tokens attend only to each other, while masked tokens attend to all patches. This lets the encoder produce meaningful representations for both tasks without separate passes.
Architecture. LORE++’s encoder is a 12-layer ViT with 12-head self-attention, hidden size 384, and feed-forward intermediate size 1536 (comparable in parameter count to ResNet-50 or DLA-34). The spatial decoder follows ViT-MAE design. The logical decoder shares the self-attention-plus-linear structure of the LORE base and stacking regressors but operates on paired text-region features for distance prediction. During fine-tuning, the ViT encoder initializes from the pre-trained encoder, and both the base and stacking regressors initialize from the pre-trained logical decoder.
Pre-training data. A corpus of 1.5 million table images is assembled from PubTables-1M, TableBank, and smaller academic datasets. LDP labels are generated automatically via OCR and clustering, requiring no manual annotation.
What experiments were performed?
Benchmarks cover both digital-born and physical tables:
- ICDAR-2013: 238 table images; cross-validation only (no training split).
- SciTSR-comp: complex-table subset of SciTSR.
- PubTabNet: 20,000 training images sampled from the full 339k.
- WTW: wild table photos with distortion and spanning cells.
- TableGraph-24K: digital-born scientific tables.
- ICDAR-2019: scanned document tables.
- TableBank: markup-sequence benchmark.
Metrics include three types to allow comparison across paradigms: logical location accuracy (fraction of cells with all four indices exactly correct), adjacency F1 (precision/recall on adjacent cell pairs), and TEDS/BLEU for markup-sequence methods.
LORE vs. LORE++ comparison (Table III) is the primary quantitative result. Both models are evaluated on ICDAR-2013, SciTSR-comp, PubTabNet, and WTW using detection F1 (D-F1), relation F1 (R-F1), and logical location accuracy (Acc).
Pre-training ablation (Table V) systematically disentangles: (a) CNN baseline (LORE), (b) ViT with no pre-training, (c) ViT pre-trained with MAE on ImageNet, (d) ViT pre-trained with MAE on the curated table dataset, (e) ViT pre-trained with MAE and LDP jointly (LORE++).
Data efficiency experiments (Figure 9) train LORE and LORE++ on 20%, 60%, and 100% of WTW and SciTSR-comp training data.
Generalization experiments (Table IX) train on a hybrid dataset (WTW + 20k PubTabNet + TableGraph-24K) and evaluate on the held-out ICDAR-2013 and SciTSR-comp test sets.
Training used an Adam optimizer with weight decay 0.05, $(\beta_1, \beta_2) = (0.0, 0.95)$, learning rate $1.5 \times 10^{-4}$ with 5% linear warm-up. Fine-tuning follows the same schedule as LORE: initial LR $1 \times 10^{-4}$, decay to $1 \times 10^{-5}$ at epoch 70 and $1 \times 10^{-6}$ at epoch 90 over 100 epochs.
What are the outcomes/conclusions?
Accuracy improvements. LORE++ consistently outperforms LORE across all evaluated datasets (Table III). The largest gains appear on ICDAR-2013: logical accuracy improves from 86.8% to 93.2%, and D-F1 from 97.2% to 98.5%. On WTW (the most challenging benchmark), logical accuracy improves from 82.9% to 84.1% and D-F1 from 96.4% to 97.0%. The pre-training also boosts PubTabNet logical accuracy from 91.0% to 92.7%.
Pre-training task contributions (Table V). Using only MAE on the table-domain data (3d) already beats MAE pre-training on ImageNet (3c) at 83.2% vs. 82.1% accuracy on WTW, suggesting that domain-specific pre-training matters even with a smaller dataset. Adding LDP (3e, LORE++) pushes accuracy to 84.1%, with R-F1 improving from 96.4% to 96.9%. Notably, spatial detection also benefits from the LDP task, indicating that logical and spatial understanding are mutually reinforcing.
Data efficiency. LORE++ trained on 60% of SciTSR-comp data matches LORE trained on 100% of that data. On WTW, LORE++ maintains a consistent advantage across all data fractions.
Generalization. In the cross-dataset setting (Table IX), LORE++ boosts logical accuracy on ICDAR-2013 from 78.6% to 87.5% and on SciTSR-comp from 87.1% to 93.5%.
Inference efficiency. Despite switching to a ViT encoder, LORE++ achieves slightly faster inference than LORE (0.43 s vs. 0.45 s per image) and uses 29.7M parameters vs. 24.2M, representing a modest increase (Table VIII, FLOPs: 88.3G vs. 75.2G). Both are far more efficient than the EDD markup-sequence baseline at 14.8 s per image using 339k training samples.
Limitations acknowledged by the paper. The ViT architecture performs worse than the CNN-based LORE when trained from scratch on limited data, so the gains are contingent on having access to the pre-trained weights. The paper evaluates on a fixed set of benchmarks and does not include newer, harder datasets such as FinTabNet. LDP supervision requires an OCR system at pre-training time, and OCR errors could introduce noise in the logical-distance labels. One issue the paper does not address is that TableBank appears in both the pre-training corpus and as a downstream evaluation benchmark, which raises a potential contamination concern for those results.
Reproducibility
Models
- Encoder: 12-layer ViT, 12-head self-attention, hidden size 384, feed-forward intermediate size 1536, roughly comparable in parameter count to ResNet-50. Total model parameters: 29.7M (FLOPs: 88.3G at 1024x1024 with 32 cells).
- Logical decoder (pre-training) and base/stacking regressors (fine-tuning): 3 self-attention layers each, matching the vanilla LORE configuration.
- The LORE code base is released under Apache-2.0 at the Alibaba AdvancedLiterateMachinery repository for the original LORE; no separate LORE++ release is confirmed as of this writing.
- ViT encoder is initialized with MAE pre-trained weights; logical decoder initializes the base and stacking regressors at fine-tuning time.
Algorithms
- Pre-training optimizer: Adam with weight decay 0.05, $(\beta_1, \beta_2) = (0.0, 0.95)$, learning rate $1.5 \times 10^{-4}$, 5% linear warm-up.
- Fine-tuning: initial LR $1 \times 10^{-4}$, step decay to $1 \times 10^{-5}$ at epoch 70 and $1 \times 10^{-6}$ at epoch 90, over 100 epochs.
- Pre-training batch size: 196. Total pre-training step count is not specified in the paper (“for steps” is stated without a number).
- MAE masking ratio: 50% of patches (lower than the 75% used in the original ViT-MAE, to account for the information-dense nature of table images).
- Unidirectional self-attention: unmasked tokens attend only to each other; masked tokens attend to all.
- L1 loss for logical distance prediction; MSE for pixel reconstruction (MAE).
- Input resolution: 224 during pre-training; max-side 1024 (512 for SciTSR and PubTabNet) during fine-tuning.
Data
- Pre-training: 1.5 million table images from PubTables-1M, TableBank, and additional small-scale academic datasets.
- LDP labels generated by: off-the-shelf OCR to locate text regions, then clustering horizontal/vertical positions to assign grid row/column indices.
- Fine-tuning datasets: the same benchmarks as LORE (ICDAR-2013, SciTSR-comp, PubTabNet with 20k samples, TableGraph-24K, ICDAR-2019, WTW, TableBank).
- No held-out pre-training validation set is described; potential overlap between pre-training tables and fine-tuning/test tables is not explicitly analyzed. Notably, TableBank is used in both the pre-training corpus and as a fine-tuning/evaluation benchmark, creating a direct contamination risk that the paper does not address.
Evaluation
- Logical location accuracy: fraction of cells with all four indices $(r_s, r_e, c_s, c_e)$ exactly correct. Reported separately for column indices (A-C), row indices (A-R), and all four (Acc).
- Adjacency F1: precision/recall on predicted adjacent cell pairs, following the ICDAR 2013 methodology.
- TEDS: tree edit distance similarity between predicted and ground-truth HTML trees; PDF text extracted following Zhong et al. (2020).
- BLEU (4-gram) for TableBank markup evaluation.
- 5 runs averaged for all reported results; no significance tests or confidence intervals provided.
- Ablation studies are conducted on WTW only; it is unclear whether component conclusions generalize across benchmarks.
Hardware
- 4x NVIDIA Tesla V100 GPUs.
- GPU-hours and memory requirements not reported.
- Inference: 0.43 s/image at 1280x1280 on PubTabNet validation set (LORE++), vs. 0.45 s/image (LORE) and 14.8 s/image (EDD).
BibTeX
@article{long2024loreplusplus,
title={{LORE++}: Logical Location Regression Network for Table Structure Recognition with Pre-training},
author={Long, Rujiao and Xing, Hangdi and Yang, Zhibo and Zheng, Qi and Yu, Zhi and Yao, Cong and Huang, Fei},
journal={arXiv preprint arXiv:2401.01522},
year={2024}
}
ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents
TL;DR
The ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents benchmarked document layout detection across diverse corporate document domains using the DocLayNet dataset. With 45 registered teams and 21 official submissions, the top-performing team (docdog, Tencent WeChat AI) achieved an overall mAP of 0.70 on a purpose-built 498-page competition set spanning reports, manuals, patents, and out-of-distribution “Other” samples. The results point to a clear trend toward vision-transformer-based methods combined with ensemble strategies.
What kind of paper is this?
Dominant: $\Psi_{\text{Evaluation}}$. This is a competition report that systematically measures the state of the art in document layout segmentation. The primary contribution is the evaluation infrastructure: a hard, out-of-distribution test set, a standardized COCO mAP protocol, and a ranked analysis of 21 teams’ approaches. The organizers themselves do not propose a new model or training recipe.
Secondary: $\Psi_{\text{Resource}}$. The paper introduces the 498-page competition dataset (now available on HuggingFace) as a reusable benchmark for future research, and it facilitated the broader community adoption and HuggingFace release of the underlying DocLayNet dataset.
What is the motivation?
Document layout understanding (recovering the structural layout of paragraphs, tables, figures, and headers from PDF files or scanned pages) has been a long-standing challenge in document AI. Earlier large-scale datasets like PubLayNet and DocBank enabled significant progress but narrowed to scientific literature, where uniform XML/LaTeX sources make ground-truth generation straightforward. Models trained on these datasets tend to saturate performance metrics on their own domain while generalizing poorly to corporate documents such as financial reports, manuals, and patents.
ICDAR has hosted a series of layout segmentation competitions to benchmark progress and drive new solutions. This 2023 edition was designed to raise the bar by using DocLayNet (an 80,863-page human-annotated dataset spanning six document domains with 11 class labels) as the training resource, while evaluating on a harder, deliberately diverse competition dataset that included out-of-distribution “Other” pages not present in DocLayNet.
What is the novelty?
The paper’s primary contribution is the competition design and the resulting evaluation dataset. Three design choices are worth noting:
Hard competition set construction. The 498-page competition dataset was engineered to expose distribution bias. It mixes samples from DocLayNet’s existing categories (Reports at 59%, Manuals at 23%, Patents at 10%) with an “Other” category (9%) containing free-style layouts such as newspaper-style pages, advertisements, and product listings that fall outside the DocLayNet layout space. This prevents teams from simply fine-tuning on DocLayNet and expecting high scores on the same distribution.
Multi-modal representation. Competition pages were provided with both image and JSON text-cell layers (original PDF tokens with coordinates), mirroring DocLayNet’s format. Several top teams used this second modality to refine or postprocess detections.
Per-category COCO mAP aggregation. Rather than a single aggregate score, performance was computed for each document category (Reports, Manuals, Patents, Other) and then macro-averaged with equal weights. This penalized solutions that over-indexed on any single domain.
The evaluation metric is:
$$\text{mAP} = \frac{1}{|\mathcal{C}|} \sum_{c \in \mathcal{C}} \text{AP}_{c}$$
where $\mathcal{C}$ is the set of document categories, and $\text{AP}_{c}$ is the standard COCO average precision for category $c$ averaged over IoU thresholds from 0.50 to 0.95 in steps of 0.05.
What experiments were performed?
The competition ran from December 19, 2022 through April 3, 2023, including a one-week extension phase. Submissions were handled via the EvalAI platform; each team received a maximum of 15 submission attempts (10 regular plus 5 for the extension). The organizers provided a baseline submission using a YOLOv5 (medium) model trained on DocLayNet for 80 epochs at 1024x1024 resolution with standard augmentations, achieving an overall mAP of 0.49.
Top five team approaches:
- docdog (Tencent WeChat AI): Synthetic data generation (300,000 samples from DocLayNet), YOLOv8 (medium/large/x-large) and DINO ensembles, focal modulation networks in the DINO backbone, per-category models, Tree-Structured Parzen Estimator (TPE) for hyperparameter optimization, Weighted Boxes Fusion (WBF), and text-cell coordinate refinement. Final mAP: 0.70.
- BOE AIoT CTO: YOLOv5 + YOLOv8 + DiT-large ensemble, scale and mosaic augmentation, multi-scale image training for vertical text, BCELoss + FocalLoss, trained for 150 epochs. Final mAP: 0.64.
- INNERCONV: MaskDINO (image-only), WBF multi-scale inference ensemble. Final mAP: 0.63.
- LC-OCR (CVTE): VSR + LayoutLMv3 ensemble with class-specific model selection; text cell information from DocLayNet JSON. Final mAP: 0.63.
- DXM-DI-AI-CV-TEAM (Du Xiaoman Financial): Cascade Mask R-CNN with DiT-large backbone, multiple model fusions. Final mAP: 0.63.
All top teams incorporated vision-transformer-based models (DINO, MaskDINO, DiT, or LayoutLMv3) either as primary detectors or as ensemble components.
What are the outcomes/conclusions?
The top-ranked team (docdog) achieved an overall mAP of 0.70, with a 6-percentage-point lead over second place. Performance was highest on Patents (12 teams reached mAP of 0.79 or better), consistent with the expectation that uniform, structured patent layouts are easier to detect. The “Other” and “Reports” categories were considerably harder, with mAPs for most teams in the low 0.60s for Reports and 0.40s-0.50s for Others.
Several patterns emerge from the results:
- Performance across the top teams was notably lower than what had been reported in the ICDAR 2021 competition on scientific literature parsing, which the authors attribute to the greater layout diversity and higher class count (11 classes) in DocLayNet as well as the hard samples engineered into the competition set.
- Data augmentation (multi-scale, mosaic, synthetic data generation) and ensemble methods (WBF, per-category specialization) were consistently important for top performance; no single end-to-end model approach matched the best ensembles.
- Vision-transformer-based architectures (DINO, MaskDINO, DiT, LayoutLMv3) were present in all top submissions, with CNN-based detectors (YOLO variants) often included in ensembles rather than used alone.
- Two of the top five teams used the text cell layer from DocLayNet’s JSON representation, suggesting that multi-modal approaches carry real value even for this bounding-box detection task.
The competition dataset was released on the HuggingFace hub to support ongoing research. The organizers conclude that this competition helped establish DocLayNet as a well-known benchmark for document layout understanding.
Reproducibility
Models
- Baseline: YOLOv5 medium, trained from scratch on DocLayNet training split at 1024x1024 resolution for 80 epochs with default YOLOv5 settings and standard augmentations. No released checkpoint is cited in the paper.
- Top teams used publicly available pretrained weights for DINO, MaskDINO, DiT-large, LayoutLMv3, and YOLO variants, fine-tuned on DocLayNet. Only the docdog team’s WeLayout system has an accompanying paper (arXiv:2305.06553) with further methodological detail.
Algorithms
- Baseline training: YOLOv5 default settings, mosaic/scale/flip/rotation/mixup augmentation, 80 epochs.
- Top teams applied mosaic augmentation, multi-scale training, synthetic data derived from DocLayNet (docdog: 300,000 synthetic pages), and WBF for ensembling.
- Hyperparameter search: docdog used TPE (Tree-Structured Parzen Estimator).
- Optimizer and learning rate schedules are not reported for any team in this competition report; the individual team papers are the best reference for those details.
Data
| Resource | Type | License | Link |
|---|---|---|---|
| DocLayNet (training) | Dataset | CC-BY-SA-4.0 | HuggingFace |
| ICDAR 2023 competition set (498 pages) | Benchmark | CC-BY-SA-4.0 | HuggingFace |
DocLayNet covers 80,863 pages across 6 document domains (Financial Reports, Patents, Manuals, Laws, Tenders, Technical Papers), with 11 bounding-box class labels, annotated by human experts. The competition test set was provided to participants without ground-truth annotations during the competition; post-competition annotation release status is not stated in the paper. No teams reported creating private labeled ground-truth data outside of DocLayNet derivations.
Evaluation
- Metric: COCO mAP @ IoU [0.50:0.95], computed using
pycocotools. Average precision is computed per document category, then macro-averaged across categories. - Evaluation platform: EvalAI (automated, online). Submission limit: 15 per team.
- Baseline mAP: 0.49. Top mAP: 0.70. The spread across the 18 teams listed above baseline ranges from 0.70 to 0.49.
- The authors note that performance is lower than on the ICDAR 2021 scientific literature competition, attributing this to higher class count and layout diversity rather than evaluation artifacts.
- No statistical significance tests or multi-run variance are reported; each team’s single best public submission is ranked.
Hardware
- No hardware details are reported by the organizers or the participant teams in this competition report. The docdog team’s WeLayout paper (arXiv:2305.06553) may provide additional hardware context.
- Inference was performed locally by each team; no latency or throughput figures are reported.
BibTeX
@inproceedings{auer2023icdar,
title={ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents},
author={Auer, Christoph and Nassar, Ahmed and Lysak, Maksym and Dolfi, Michele and Livathinos, Nikolaos and Staar, Peter},
booktitle={Proceedings of the International Conference on Document Analysis and Recognition},
year={2023}
}
Revisiting Table Detection Datasets for Visually Rich Documents
TL;DR
This paper revisits five existing manually annotated table detection (TD) datasets, cleans their noisy annotations, aligns annotation definitions across them, and merges them into a single larger dataset called Open-Tables. The authors also introduce ICT-TD, a new manually annotated dataset drawn from ICT commodity PDF datasheets, which provides a domain-specific source underrepresented in prior work. Cross-domain baselines show that cleaned data yields 0.6–2.6% higher weighted F1 than training on the noisy originals.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$: The primary contribution is two new or improved datasets (Open-Tables and ICT-TD) released for the table detection community. The paper’s dataset construction methodology, annotation rules, and quality analysis are the focal content.
Secondary: $\Psi_{\text{Evaluation}}$: The paper builds cross-domain and noise-impact benchmarks, quantifying how annotation quality and domain shift affect model evaluation reliability.
What is the motivation?
Table detection is a foundational step in document understanding pipelines. However, existing public datasets for this task suffer from three problems the authors identify:
- Noisy annotations. Automatically generated datasets like TableBank and PubLayNet contain bounding boxes that are too large, too small, or missing tables entirely. Even manually annotated sets such as ICDAR2013 and TNCR have inconsistent labels due to ambiguous table boundaries (e.g., whether table explanations belong inside the table region).
- Inconsistent annotation definitions. Different datasets draw table boundaries differently, making direct merging unreliable without normalization.
- Limited data sources. Virtually all open TD datasets come from academic publications or government documents. This narrow provenance means that trained models may struggle to generalize to domains with different table layouts, such as industrial product datasheets.
What is the novelty?
The paper’s two core contributions are:
Open-Tables. A merged dataset constructed by cleaning and re-aligning five existing datasets: ICDAR2013, ICDAR2017, ICDAR2019 (modern documents only), Marmot, and TNCR. The authors define two annotation rules applied uniformly across all source sets:
- Table boundaries are set by lines; all content bounded by table lines is included, but explanation text outside those lines is excluded.
- A table must have at least two rows and two columns.
After cleaning (removing bounding boxes that are too large, too small, or mislabeled, and resolving ambiguous samples by the stated rules), the merged result is 8,834 training images, 1,240 test images, and 1,000 validation images.
ICT-TD. A new dataset sourced from 175,682 PDF files covering 370 different ICT commodity types (network hardware, transceivers, modules, etc.), yielding 3.58 million page images. Random sampling produced 5,000 images containing tables, which were annotated manually under expert guidance. Annotation rules for this domain add a third criterion: a table must describe commodity-specific information, not, for example, a table of contents. The resulting split is 4,000 training / 1,000 test images.
ICT domain tables include unique structures not found in academic or governmental datasets: compound tables composed of multiple sub-tables, tables containing figures as cell content, and partially lined tables with domain-specific layouts.
The evaluation metric used throughout is the weighted average F1 defined as:
$$\text{Weighted Avg. F1} = \frac{\sum_{i=1}^{4} \text{IoU}_i \cdot F1@\text{IoU}_i}{\sum_{i=1}^{4} \text{IoU}_i}$$
where IoU thresholds of 80%, 85%, 90%, and 95% are used (stricter than the ICDAR2019 competition setting of 60%–90%).
What experiments were performed?
The authors evaluate four baseline models, selected to span common detection paradigms:
- TableDet: Cascade R-CNN with transfer learning and table-aware augmentation (two-stage CNN)
- DiffusionDet: Diffusion-process-based detector with random region proposals
- Deformable-DETR: Transformer-based detector using deformable attention
- SparseR-CNN: Learnable sparse proposals with cascade refinement
All models are implemented in Detectron2 with a batch size of 16. TableDet uses SGD for 25,000 iterations; the remaining three use AdamW with MultiStepLR for 50,000 iterations. Experiments cover three settings:
- In-domain ICT-TD: train on ICT-TD training set, test on ICT-TD test set.
- In-domain Open-Tables: train on Open-Tables (clean or noisy) training set, test on Open-Tables test set.
- Cross-domain: two directions – ICT-TD training set evaluated on Open-Tables test set, and Open-Tables training set evaluated on ICT-TD test set. The cross-domain Open-Tables setting is also repeated with the noisy (pre-cleaning) training split to quantify the benefit of noise removal.
What are the outcomes/conclusions?
In-domain results. On ICT-TD, Deformable-DETR achieves the best weighted average F1 of 90.3, followed closely by SparseR-CNN (88.9) and DiffusionDet (88.9). On Open-Tables, SparseR-CNN and Deformable-DETR lead at 93.3 and 93.1 respectively. These numbers are high but not directly comparable to benchmarks like ICDAR2013 where near-perfect F1 has been reported, because the authors use stricter IoU thresholds.
Cross-domain results. Performance drops substantially in both cross-domain directions. Training on ICT-TD and testing on Open-Tables yields weighted F1 of 67.6–75.8 across models. Training on Open-Tables and testing on ICT-TD yields 75.7–80.2. Deformable-DETR shows the best cross-domain generalization in both directions.
Noise cleaning impact. Training with the cleaned Open-Tables training set consistently outperforms training with the noisy version when tested on the ICT-TD test set, with improvements of 0.6–2.6% weighted F1 across models. The benefit is largest at higher IoU thresholds (90%, 95%), which are most discriminative for bounding box quality.
Limitations the authors acknowledge. The IoU-based evaluation metrics used throughout are indirect measures of actual information extraction quality. A predicted box that is slightly larger than ground truth but still captures all table content may receive a lower score than a tighter box that misses some rows. The authors suggest that future work should explore metrics more directly tied to downstream information extraction performance.
Reproducibility
Models
All four baseline models are standard implementations from Detectron2. No custom architecture changes are introduced beyond what is described in the original model papers. Weights for the trained baselines are not released.
Algorithms
| Parameter | TableDet | DiffusionDet | Deformable-DETR | SparseR-CNN |
|---|---|---|---|---|
| Optimizer | SGD | AdamW | AdamW | AdamW |
| Max iterations | 25,000 | 50,000 | 50,000 | 50,000 |
| Base LR | 1e-3 | 1e-5 | 1e-4 | 2.5e-5 |
| LR schedule | – | MultiStepLR | MultiStepLR | MultiStepLR |
| Batch size | 16 | 16 | 16 | 16 |
Data
Open-Tables: Merges ICDAR2013 (test only, 150 images), ICDAR2017 (1,600 train / 817 test), ICDAR2019 modern (600 train / 240 test), Marmot (2,000 train), and TNCR (4,634 train / 987 test / 1,000 val). After cleaning and merging: 8,834 train / 1,240 test / 1,000 val images.
ICT-TD: 175,682 ICT commodity PDFs across 370 product types; 3.58M page images rendered at 200 DPI; 5,000 images with tables selected via random sampling and manually annotated. Split: 4,000 train / 1,000 test.
Both datasets are publicly available at HuggingFace under an Apache-2.0 license.
Evaluation
Weighted average F1 over IoU thresholds 80%, 85%, 90%, 95%. Precision, recall, and F1 at each IoU from 50% to 95% are also reported in appendix tables. No error bars or multi-run statistics are reported. No statistical significance tests are conducted.
Hardware
Not specified in the paper. All experiments are implemented in Detectron2; GPU type, GPU count, and wall-clock training time are not reported.
BibTeX
@article{xiao2025revisiting,
author = {Xiao, Bin and Simsek, Murat and Kantarci, Burak and Abu Alkheir, Ala},
title = {Revisiting Table Detection Datasets for Visually Rich Documents},
journal = {International Journal on Document Analysis and Recognition (IJDAR)},
year = {2025},
doi = {10.1007/s10032-025-00527-9},
url = {https://doi.org/10.1007/s10032-025-00527-9}
}
Optimized Table Tokenization for Table Structure Recognition
TL;DR
OTSL replaces HTML with a 5-token language for autoregressive table structure recognition. Its backward-only syntax rules enable on-the-fly validation, cut sequence length roughly in half, and yield consistent improvements in both accuracy and inference speed across three public benchmarks when evaluated with the TableFormer architecture.
What kind of paper is this?
OTSL (Optimized Table Structure Language) proposes a tokenization language with syntactic constraints for table structure recognition.
- Dominant Basis: $\Psi_{\text{Method}}$
- The paper proposes a new representation language (OTSL) to replace HTML in Image-to-Sequence (Im2Seq) models. It redesigns the output vocabulary to enforce structural validity and improve efficiency.
- Secondary Basis: $\Psi_{\text{Evaluation}}$
- It systematically compares HTML versus OTSL representations across multiple datasets (PubTabNet, FinTabNet, PubTables-1M) and model configurations, focusing on efficiency (latency) and accuracy (TEDs, mAP).
Motivation: The HTML Bottleneck
Image-to-Markup-Sequence (Im2Seq) table structure recognition typically reuses general-purpose HTML tokenization. However, HTML was not designed for autoregressive decoding efficiency and presents several challenges for neural models:
- Large Vocabulary: Requires at least 28 tokens to cover common rowspan/colspan attributes. The skewed token frequency distribution complicates learning.
- Variable Row Lengths: Rows with complex spanning produce longer token sequences. This variance makes positional encoding and attention mechanisms less effective.
- Late Error Detection: Invalid HTML outputs are difficult to detect early during generation. Partial sequences often violate structural consistency (e.g., missing closing tags) but remain syntactically valid markup until the end.
- Attention Drift: Long sequences on large tables cause output misalignment. This is particularly damaging for bounding box predictions in later rows.
Novelty: The 5-Token OTSL Vocabulary
The core innovation is OTSL, a minimal vocabulary representing a table as a rectangular grid.
1. Minimal Vocabulary
The language reduces the problem to just 5 tokens:
- C: New cell (anchor for cell region top-left).
- L: Merge with left neighbor (horizontal span continuation).
- U: Merge with upper neighbor (vertical span continuation).
- X: Merge with both left and upper (2D span interior).
- NL: End-of-row marker.
2. Syntactic Constraints
The representation enforces specific structure via backward-only syntax rules. Each token can be validated using only previously generated tokens, enabling incremental constraint enforcement during decoding:
- Left neighbor of L must be C or L.
- Upper neighbor of U must be C or U.
- Left neighbor of X must be U or X; upper neighbor must be L or X.
- First row allows only C and L.
- First column allows only C and U.
- All rows have equal length and are terminated by NL.
3. Error Mitigation
Invalid token predictions signal decoding errors immediately. The authors propose a heuristic that replaces the highest-confidence invalid token with the next-highest valid candidate until syntax rules are satisfied.
Methodology: TableFormer Evaluation
The authors evaluate OTSL using the TableFormer architecture, an encoder-decoder transformer with separate decoders for structure (HTML/OTSL) and cell bounding boxes.
Experiments
- Hyperparameter Sweep: Compared HTML vs. OTSL on PubTabNet with varying encoder/decoder depths (2-6 layers).
- Cross-Dataset Evaluation: Validated the best configuration (6 encoder, 6 decoder, 8 heads) on three major benchmarks:
- PubTabNet: 395K samples (Scientific).
- FinTabNet: 113K samples (Financial).
- PubTables-1M: ~1M samples (Scientific/Digital).
Metrics
- TEDs (Tree Edit Distance): Structural accuracy, reported separately for simple/complex/all tables. OTSL outputs are converted back to HTML for fair comparison.
- mAP@0.75: Mean Average Precision at 0.75 IoU threshold for cell bounding boxes.
- Inference Time: Measured on a single-core CPU (AMD EPYC 7763 @ 2.45 GHz).
Results: Efficiency and Accuracy
The results suggest that optimizing the representation yields consistent improvements without architectural changes.
1. Speed
OTSL achieves approximately $2\times$ inference speedup across configurations, primarily due to shorter sequence lengths. The authors report that OTSL reduces sequence length to roughly half that of HTML on average (e.g., 30 tokens vs. 55 for the example in Figure 1).
- PubTabNet: 2.73s (OTSL) vs. 5.39s (HTML).
- PubTables-1M: 1.79s (OTSL) vs. 3.26s (HTML).
2. Accuracy
- PubTabNet: Similar all-TEDs (0.955); improved mAP (0.88 vs. 0.857).
- FinTabNet: Large gains observed: all-TEDs 0.959 vs. 0.920; mAP 0.862 vs. 0.722.
- PubTables-1M: Gains on both metrics: all-TEDs 0.977 vs. 0.966.
Limitations
- Syntactic vs. Structural: While OTSL guarantees a valid grid, it does not guarantee the correct grid. A valid OTSL sequence can still represent an incorrect table structure.
- Heuristic Reliance: The token replacement strategy is a heuristic. The paper does not compare this against formal constrained decoding or beam search.
- Architecture Specificity: The evidence is limited to TableFormer. It is unclear if these gains transfer to graph-based or object-detection-based TSR pipelines.
Reproducibility
The work is partially reproducible. The evaluation datasets are publicly available, but the authors do not release code, model weights, or the OTSL-format dataset conversions (promised in the paper but no public link is provided). Reimplementation requires building both the OTSL tokenization logic and the TableFormer architecture from the paper description alone.
Models
- The architecture used is TableFormer, an encoder-decoder transformer with separate decoders for structure tokens and cell bounding boxes.
- The best configuration uses 6 encoder layers, 6 decoder layers, and 8 attention heads. Smaller configurations (4/4, 2/4, 4/2) are also evaluated.
- No parameter counts are reported. No pretrained weights are released.
Algorithms
- Training procedure details (optimizer, learning rate, batch size, epochs) are not reported in the paper.
- The error-mitigation heuristic replaces the highest-confidence invalid token with the next-highest valid candidate until OTSL syntax rules are satisfied. No comparison is made against constrained decoding or beam search alternatives.
Data
- PubTabNet: 395K samples of scientific tables, semi-automatically generated from PubMed Central.
- FinTabNet: 113K samples of financial tables.
- PubTables-1M: Approximately 1M samples of scientific/digital tables.
- Ground truth from all datasets was converted to OTSL format. The authors state these conversions will be made publicly available, but no download link is provided in the paper.
Evaluation
- TEDs (Tree Edit Distance score): Measures structural accuracy. OTSL predictions are converted back to HTML before computing TEDs for fair comparison. Reported separately for simple tables, complex tables (those with spanning cells), and all tables.
- mAP@0.75: Mean Average Precision at $0.75$ IoU threshold for cell bounding box predictions.
- Baselines are HTML-based TableFormer models with identical architecture configurations, making the comparison controlled.
- No error bars, significance tests, or multi-run statistics are reported.
Hardware
- Inference: All timing results measured on a single core of an AMD EPYC 7763 CPU @ 2.45 GHz.
- Training: GPU type, count, and training duration are not reported.
BibTeX
@inproceedings{lysak2023optimized,
title={Optimized Table Tokenization for Table Structure Recognition},
author={Lysak, Maksym and Nassar, Ahmed and Livathinos, Nikolaos and Auer, Christoph and Staar, Peter},
booktitle={Document Analysis and Recognition -- ICDAR 2023},
pages={37--50},
year={2023},
publisher={Springer},
doi={10.1007/978-3-031-41679-8_3}
}
TabRecSet: A Large-Scale Bilingual Dataset for End-to-End Table Recognition in the Wild
TL;DR
TabRecSet is a large-scale, bilingual (English and Chinese) dataset of 38,177 table images collected from multiple real-world scenarios, providing polygon-based spatial annotations for all three table recognition sub-tasks: table detection (TD), table structure recognition (TSR), and table content recognition (TCR). The authors also release TableMe, a custom annotation tool that assisted in building the dataset, and they benchmark several existing models on the new data.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$: The primary contribution is a new dataset. The paper describes a careful collection, cleaning, and annotation pipeline to produce a reusable benchmark with specific properties (bilingual, wild scenarios, polygon annotations, end-to-end coverage) not found in any prior dataset.
Secondary: $\Psi_{\text{Evaluation}}$: The paper includes a technical validation section that trains or fine-tunes several published models on TabRecSet and reports baseline performance across all three sub-tasks, establishing initial benchmark numbers.
What is the motivation?
Table recognition (TR) breaks down into three interdependent tasks: table detection (TD) to locate tables in an image, table structure recognition (TSR) to recover the spatial and logical cell layout, and table content recognition (TCR) to extract the text within each cell. Tackling all three simultaneously (end-to-end TR) is a practical goal for document automation, yet no existing dataset supports it well.
Prior datasets have one or more of three shortcomings. First, many restrict annotations to a single sub-task: WTW (14.5k images, 2021) covers TSR and TCR but not TD; PubTabNet (568k images) covers TSR and TCR but not TD; PubTables-1M covers all three but comes entirely from scanned, axis-aligned PDF documents with no real-world distortion. Second, large-scale datasets published before 2019 rely on programmatically generated annotations from structured sources (LaTeX, PDFs) rather than human annotation of real images, which limits background and layout diversity. Third, most datasets use bounding boxes or quadrilaterals for spatial annotation, which cannot accurately represent distorted or curved table cells common in camera-captured documents.
The gap is a human-annotated, multi-scenario, bilingual dataset that covers TD, TSR, and TCR with polygon-level spatial precision for irregular tables.
What is the novelty?
TabRecSet addresses the three identified gaps in a single release:
Scale and bilingual coverage. The dataset contains 32,072 images and 38,177 table instances: 15,542 images with 20,415 English tables and 16,530 images with 17,762 Chinese tables. An additional 21,228 generated images provide border-incomplete (three-line and no-line) variants. At the time of publication this was approximately 2.6 times larger than the prior largest multi-scenario dataset (WTW at 14,500 images).
Wild scenarios. Images are collected through reverse-image search engines using seed images of real tables, then filtered manually. This produces samples from documents, Excel spreadsheets, exam papers, financial invoices, ingredient labels, and books, both scanned and camera-captured. As a result, many tables exhibit rotation, inclination, and perspective or warp distortion.
Polygon spatial annotation. Where most datasets use bounding boxes or quadrilaterals for cell boundaries, TabRecSet annotates both the outer table body and individual cells using polygons. This provides greater precision for distorted cells, and the polygon annotations are shared between the all-line tables and their generated border-incomplete counterparts so that the invisible borders remain annotatable.
Complete end-to-end annotation. Each sample carries three aligned annotation types stored in a LabelMe-based JSON format:
- Table body polygon for TD
- Per-cell polygon plus logical properties (row, column, rowspan, colspan) for TSR
- Per-cell text content (including hand-written text, with
#for indistinguishable characters) for TCR
The cell label string encodes all four logical properties and text as <Row>-<Column>-<Rowspan>-<Colspan>-<Text content>.
Automated annotation assistance. The authors develop a TSR auto-annotation algorithm that rectifies irregular tables using Thin-Plate Spline (TPS) transformation to remove curvature distortion and Affine transformation to remove rotation, then computes logical cell properties by analyzing corner-point proximity on the rectified image. This algorithm achieves approximately 80% accuracy, leaving annotators to correct only the remaining 20% manually. Table body polygons are auto-generated from cell polygons using morphological closure and contour tracing.
Border-incomplete generation. Three-line and no-line table variants are generated by pixel-level edge erasure: the program identifies pixel runs on cell borders from polygon annotations and replaces them with the median background color of a surrounding kernel, effectively removing the visible line without altering the underlying annotation.
TableMe annotation tool. The team releases a custom annotation tool built on LabelMe that adds table-specific features: group assignment for multi-table images, batch logical property assignment for a row or column of cells at once, and an interactive digital-table view that renders the current annotation as a live spreadsheet. Annotators can type cell content directly into the rendered cells, and structure errors become immediately visible.
What experiments were performed?
The technical validation section trains or fine-tunes four published models on an 80/20 train/test split of TabRecSet and reports results on the held-out test set. No validation set is reported separately.
Topology structure and content recognition (TSR- + TCR). EDD (the sequence-based HTML prediction model from PubTabNet) is evaluated in two settings: trained on PubTabNet and evaluated zero-shot on TabRecSet, and fine-tuned on the TabRecSet training set. The metric is Tree-Edit-Distance-based Similarity (TEDS), computed either on the HTML structure alone (TEDS-S) or on structure and content jointly (TEDS-All).
Spatial and topological structure recognition (TSR). TableMaster (sequence-based, predicts HTML and bounding boxes) and TGRNet (graph-based, predicts a table graph) are tested. TableMaster is evaluated with PubTabNet pre-training followed by TabRecSet fine-tuning; the metric is TEDS-S for topology and bounding-box precision for spatial detection. TGRNet is trained from scratch on TabRecSet and evaluated with logical-property classification accuracy and cell detection precision.
Table detection (TD). CDeC-Net is trained on the TabRecSet training set and evaluated with Average Precision (AP) for table segmentation.
Results are summarized in Table 6 of the paper. Selected numbers:
| Model | Condition | Metric | Value |
|---|---|---|---|
| EDD | PubTabNet only, zero-shot on TabRecSet | TEDS-S | 72.34% |
| EDD | Fine-tuned on TabRecSet | TEDS-S | 90.68% |
| EDD | Fine-tuned on TabRecSet | TEDS-All | 70.70% |
| EDD | Trained from scratch on TabRecSet | TEDS-S | 51.75% |
| TableMaster | Fine-tuned on TabRecSet | TEDS-S | 93.13% |
| TableMaster | Fine-tuned on TabRecSet | Cell precision | 11.00% |
| TGRNet | Trained from scratch | TSR(-) Acc. | 65.66% |
| TGRNet | Trained from scratch | Cell precision | 74.82% |
| CDeC-Net | Trained from scratch | AP-Table | 92.80% |
The large gap between TEDS-S (93.13%) and cell precision (11.00%) for TableMaster illustrates that regression-based spatial localization fails on distorted tables even when topology is recovered correctly.
What are the outcomes/conclusions?
TabRecSet establishes a benchmark for end-to-end table recognition in multi-scenario, real-world conditions. The results from the validation experiments suggest the following:
Fine-tuning on TabRecSet substantially improves model performance relative to zero-shot transfer. EDD gains 18 TEDS-S points (72.34% to 90.68%) after fine-tuning, which indicates that wild-scenario diversity is genuinely different from the scanned-document distribution of PubTabNet.
Training from scratch on TabRecSet yields limited performance for sequence-based models. EDD’s TEDS-S of 51.75% and TableMaster’s TEDS-S of 16.61% without pre-training suggest that the dataset’s structural complexity and visual diversity make it harder than prior benchmarks for methods that have not been pre-trained on a simpler distribution first.
Spatial cell localization on distorted tables remains an open problem. TableMaster’s cell precision of 11.00% after fine-tuning (compared to 93.13% TEDS-S for topology alone) makes clear that recovering the spatial locations of cells in irregular tables is the harder sub-task, and that existing regression-based approaches are insufficient for this.
Graph-based TSR models handle the polygon-annotated data reasonably well. TGRNet achieves 74.82% cell precision and 65.66% logical property accuracy when trained from scratch, suggesting that the dataset is usable for this class of methods.
The authors do not report an end-to-end TR result because, at the time of writing, no published model chains TD, TSR, and TCR together. The dataset is intended to enable that line of research.
Limitations. The border-incomplete tables are generated from all-line tables rather than collected from natural sources, so the distribution of these variants may not fully reflect organic three-line or no-line tables found in documents. The data collection relies on Creative Commons-licensed images from web search, and the geographic and domain balance is not fully characterized. Annotation quality is validated by cross-checking and one round of proofreading, but inter-annotator agreement statistics are not reported.
Reproducibility
Models
No model weights are released as part of this paper; the four validated models (EDD, TableMaster, TGRNet, CDeC-Net) are third-party published systems. The paper reports that training was performed using the official code and pre-trained checkpoints for each model, but links to those checkpoints or exact configuration files are not included in the paper or repository.
Algorithms
The TSR auto-annotation algorithm uses TPS transformation (via a TPS interpolation library) and OpenCV’s findContour API. The border-incomplete generation algorithm is purely image-based, using a median-filter kernel to erase border pixels. Python source code for both algorithms is available in the GitHub repository. No optimizer, learning rate, or training schedule is reported for the baselines since those use the original training procedures from their respective papers.
Data
- Total size: 32,072 images, 38,177 tables (original all-line subset). An additional 21,228 generated images add three-line and no-line variants: 5,113 English (6,728 tables) and 5,501 Chinese (5,911 tables).
- English subset (original): 15,542 images, 20,415 tables.
- Chinese subset (original): 16,530 images, 17,762 tables.
- Scenarios: scanned documents, camera-captured documents, Excel spreadsheets, exam papers, financial invoices, ingredient labels.
- Table types: all-line (border-complete), three-line, and no-line (border-incomplete); regular and irregular (rotated, distorted, nested, over/under-exposed, hand-written).
- Annotation format: LabelMe JSON. TD annotation in table-wise JSON files; TSR + TCR annotation in cell-wise JSON files. Images are JPG. All-line originals and their generated border-incomplete counterparts share the same filename and annotation files.
- Public availability: Dataset hosted on Figshare (https://doi.org/10.6084/m9.figshare.20647788), licensed CC-BY-SA 4.0. The DOI resolves correctly (verified). The GitHub repository (https://github.com/MaxKinny/TabRecSet) is independently licensed CC-BY-SA 4.0 (verified).
- Splits: There is no fixed official train/val/test split in the release. The 80% training / 20% test division used in the paper’s technical validation experiments is the authors’ own partition for those experiments only. The Usage Notes section of the paper recommends that users mix and re-divide the data themselves according to their task.
- Data collection: images sourced from web search (Google, Baidu) using Creative Commons license filters or manual source verification; watermarked, privacy-sensitive, and duplicate images removed; samples from datasets without derivative licenses excluded; approximately 30% of raw downloads were filtered out in data cleaning.
- Annotation process: five qualified annotators (including some authors) were involved. Human annotation was performed with TableMe, assisted by the TSR auto-annotation algorithm (approximately 80% automation rate for logical properties). At each of the four annotation steps, annotators exchanged sub-datasets and cross-checked each other’s work. One final proofreading pass by a designated checker reviewed all samples for remaining dirty images and incorrect annotations; wrongly generated border-incomplete tables were deleted rather than corrected. Inter-annotator agreement statistics are not reported.
Evaluation
- TEDS / TEDS-S: Tree-Edit-Distance-based Similarity comparing predicted HTML sequences to ground-truth. TEDS-S ignores cell text content; TEDS-All includes it.
- TSR(-) Acc.: Classification accuracy of logical properties (row, column, rowspan, colspan) per cell.
- P-Cell: Bounding-box precision for detected cells, following standard object detection precision as defined in Liu et al. (2020, ref. 26 in the paper). The IoU threshold used for cell matching is not specified in the paper.
- AP-Table: Average Precision for table detection/segmentation, also following ref. 26. The IoU threshold is not specified in the paper.
- Baselines are not evaluated under identical compute budgets; some are fine-tuned from PubTabNet pre-trained weights while others are trained from scratch.
- No error bars, confidence intervals, or multiple-run statistics are reported.
Hardware
Training hardware, GPU counts, and training duration are not reported. The paper does not include inference latency or memory requirements for any of the evaluated models.
BibTeX
{{< bibtex >}} @article{yang2023large, title={A large-scale dataset for end-to-end table recognition in the wild}, author={Yang, Fan and Hu, Lei and Liu, Xinwu and Huang, Shuangping and Gu, Zhenghui}, journal={Scientific Data}, volume={10}, number={1}, pages={110}, year={2023}, publisher={Nature Publishing Group UK London} } {{< /bibtex >}}
SEMv2: Table Separation Line Detection Based on Instance Segmentation
TL;DR
SEMv2 reformulates the “split” stage of split-and-merge table structure recognition as an instance segmentation task, using conditional convolution to predict individual masks for each table separation line. The paper also introduces the iFLYTAB dataset, which covers wired and wireless tables across digital and camera-captured scenarios. SEMv2 achieves competitive results on SciTSR and PubTabNet, and substantially outperforms its predecessor SEM on the more challenging iFLYTAB benchmark.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$. The central contribution is a new approach to table separation line detection: replacing semantic segmentation with instance segmentation via conditional convolution. The paper devotes the majority of its technical content to the splitter architecture (Gather modules, kernel/feature branch decoupling, mask-to-line post-processing) and the parallel merger design.
Secondary: $\Psi_{\text{Resource}}$. The authors release iFLYTAB, a 17,291-image dataset spanning wired/wireless tables in both digital and camera-captured settings. This fills a gap left by prior datasets that focus exclusively on digital documents or only wired tables.
What is the motivation?
Split-and-merge methods for table structure recognition first divide a table into a basic grid of rows and columns, then merge grid cells that span multiple rows or columns. Prior approaches (SEM, SPLERGE, RobusTabNet) handle the “split” stage using semantic segmentation: they predict a single binary mask for all row separation lines and another for all column separation lines. This creates two problems:
- Limited receptive field. Pixel-wise semantic segmentation struggles to capture long-range dependencies across an entire table image, producing noisy or incomplete masks, especially for camera-captured images with distortion or complex backgrounds.
- Complex post-processing. Extracting individual separation lines from a single semantic mask requires heuristic mask-to-line algorithms that make strong assumptions (e.g., straight, axis-aligned lines) and fail on distorted or curved tables.
In addition, existing TSR datasets are concentrated on clean, axis-aligned tables from digital PDF documents. The WTW dataset introduced camera-captured tables but only covers wired tables. There was no public benchmark covering both wired and wireless tables across digital and photographic scenarios.
What is the novelty?
Instance segmentation for table separation line detection
SEMv2 treats each row or column separation line as a distinct instance. The architecture decouples mask generation into two branches:
- Kernel branch: Produces per-instance convolution kernels $\theta^{\text{col}} \in \mathbb{R}^{1 \times \frac{W}{4} \times C}$ (or $\theta^{\text{row}}$ for rows). Each kernel corresponds to one separation line instance.
- Feature branch: Generates feature maps $\mathbf{F}^{\text{col}} \in \mathbb{R}^{\frac{H}{4} \times \frac{W}{4} \times C}$ (or $\mathbf{F}^{\text{row}}$) to be convolved with the instance kernels.
Dynamic 1$\times$1 convolution between a selected kernel and the feature map produces a per-instance separation line mask. The final line shape is extracted by finding the maximum-scoring position in each row (for column lines) or each column (for row lines), yielding a simple and robust mask-to-line conversion.
Gather module
To give the kernel branch a receptive field spanning the full table image, the authors introduce the Gather module. It uses repeated downsampling followed by two spatial CNN modules that propagate information top-to-bottom and bottom-to-top (for columns) or left-to-right and right-to-left (for rows). The output is reduced to a 1D representation $G^{\text{col}} \in \mathbb{R}^{1 \times \frac{W}{4} \times C}$ via row-mean pooling, capturing global structural context along the relevant axis.
Parallel merger
The merger in the original SEM used a sequential decoder, whose inference time grew with the number of cells. SEMv2 replaces this with a parallel decoder based on conditional convolution. For each grid cell $(i, j)$, a kernel $e^{k}_{i,j}$ is convolved with the shared feature map $E^{f}$ to predict which other grid cells belong to the same table cell. This runs in parallel across all grid positions, increasing throughput from roughly 3 FPS to over 7 FPS on SciTSR.
Overall loss
The training objective combines five terms:
$$ O = \mathcal{L}_{s}^{\text{row}} + \mathcal{L}_{s}^{\text{col}} + \mathcal{L}_{\text{inst}}^{\text{row}} + \mathcal{L}_{\text{inst}}^{\text{col}} + \mathcal{L}_{m} $$
where $\mathcal{L}_{s}$ terms are sigmoid focal losses on the separation line masks, $\mathcal{L}_{\text{inst}}$ terms are binary cross-entropy losses on the instance detection heads, and $\mathcal{L}_{m}$ is the focal loss on the merge maps.
iFLYTAB dataset
The iFLYTAB dataset contains 17,291 table images (12,104 train, 5,187 test) collected from four categories: digital wired (3,664), digital wireless (4,125), camera-captured wired (4,155), and camera-captured wireless (5,347). Annotations include cell polygons (735,781), text line polygons (1,207,709), row information polygons (207,972), and column information polygons (112,820). Each polygon is labeled with four-vertex coordinates.
What experiments were performed?
Datasets
Experiments cover five benchmarks:
- SciTSR: 15,000 tables from scientific literature (12,000 train / 3,000 test), with a hard subset SciTSR-COMP (716 complex tables). The authors note annotation errors in the test set, which they manually corrected following RobusTabNet.
- PubTabNet: 500,777 training and 9,115 validation tables from scientific articles.
- cTDaR TrackB1-Historical: 600 training and 150 testing samples of historical handwritten documents.
- WTW: 10,970 training and 3,611 testing images of wired tables from wild scenes.
- iFLYTAB: The proposed dataset (12,104 train / 5,187 test).
Metrics
- F1-Measure: Percentage of correctly detected adjacent cell pairs.
- TEDS / TEDS-Struct: Tree-edit-distance-based similarity (structure-only variant ignores OCR content).
- WAvg.F1: Weighted average F1 across IoU thresholds [0.6, 0.7, 0.8, 0.9].
- GriTS: Grid table similarity metric comparing predicted and ground truth tables in matrix form.
Ablation studies
Six system variants (T1 through T6) isolate the contributions of three components:
| Comparison | Finding |
|---|---|
| Instance segmentation splitter (T1) vs. semantic segmentation splitter (T2) | Marginal difference on clean digital tables (SciTSR: 99.3 vs. 98.8 F1), large gap on camera-captured tables (iFLYTAB: 93.5 vs. 77.0 F1) |
| With Gather (T1) vs. without Gather (T3) | F1 improves from 90.4 to 93.5 on iFLYTAB; 98.4 to 99.3 on SciTSR |
| Parallel merger (T1) vs. sequential merger (T5) | Similar accuracy (99.3 vs. 99.2 F1 on SciTSR), but 7.3 FPS vs. 2.9 FPS |
| No merger (T4) | Accuracy drops sharply (98.2 F1 on SciTSR) since spanning cells are ignored |
Comparison with prior methods
- SciTSR: 99.3 F1, matching or close to TSRFormer (99.4) and RobusTabNet (99.3).
- SciTSR-COMP: 98.7 F1, comparable to TSRFormer (98.9).
- PubTabNet: 97.5 TEDS-Struct, tied with TSRFormer for the top reported result.
- cTDaR TrackB1-Historical: 67.5 WAvg.F1, outperforming prior competition participants by a large margin.
- WTW: 93.6 F1, slightly ahead of TSRFormer (93.4).
- iFLYTAB: 93.5 F1 and 92.0 TEDS-Struct, compared to SEM’s 78.0 F1 and 75.9 TEDS-Struct.
What are the outcomes/conclusions?
The results suggest that instance segmentation provides a more robust approach to table separation line detection than semantic segmentation, particularly for tables with non-rigid deformation and complex backgrounds. The instance-level formulation simplifies post-processing: instead of applying heuristic mask-to-line algorithms to a global mask, the method extracts each line directly from its own mask via a row-wise or column-wise argmax.
The Gather module contributes meaningfully by propagating structural context across the full image, which is important for long separation lines that span the entire table. The parallel merger achieves roughly 2.5$\times$ the throughput of the sequential decoder with no meaningful loss in accuracy.
Limitations
The authors identify several failure modes:
- Rotated tables. SEMv2 performs poorly when tables exhibit significant angular rotation. On the “overlaid” subset of WTW (which includes many rotated tables), it falls well behind polygon detection methods (75.1 vs. 84.1 F1).
- Multi-line cell content. Without a textual branch (removed from the SEM baseline for efficiency), SEMv2 sometimes mis-splits cells containing multiple text lines.
- Moire patterns. Screen-captured images with pronounced moire artifacts degrade recognition quality.
- No text content annotations. iFLYTAB does not include OCR-level text annotations, which limits its use for end-to-end table recognition evaluation.
Reproducibility
Models
- Backbone: ResNet-34 pretrained on ImageNet, with FPN producing four feature levels ($P_2$ through $P_5$ at scales 1/4 to 1/32).
- Feature channels: $C = 256$ (FPN/splitter), $D = 512$ (embedder/merger).
- RoIAlign pool size: $3 \times 3$.
- Embedder transformer: Used to capture long-range dependencies among grid-level features. The paper does not specify the number of layers or heads.
- Weights: The GitHub repository provides pretrained models, though the specific format and completeness are not documented in the paper.
Algorithms
- Optimizer: ADADELTA with $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\varepsilon = 10^{-9}$.
- Learning rate schedule: Cosine annealing, $\eta_{\text{max}} = 10^{-4}$, $\eta_{\text{min}} = 10^{-6}$.
- Loss functions: Sigmoid focal loss ($\alpha = 0.25$, $\gamma = 2$) for separation line masks and merge maps; binary cross-entropy for instance detection.
- Binarization threshold: 0.5 for all binary decisions (instance detection, merge maps).
- Training strategy: End-to-end (not multi-stage). Each dataset is trained independently. For cTDaR, the model is initialized with iFLYTAB-trained weights, then fine-tuned.
- Data augmentation: On iFLYTAB, images are resized with a random ratio in $[0.8, 1.2]$. On SciTSR and PubTabNet, original image sizes are used.
Data
- SciTSR: 15,000 tables from scientific literature. Publicly available. The authors corrected annotation errors in the test set following RobusTabNet.
- PubTabNet: ~568K tables from PubMed Central. Publicly available under CDLA-Permissive.
- cTDaR TrackB1-Historical: 750 historical handwritten document tables. Available through ICDAR 2019 competition.
- WTW: ~14.5K wild table images. Publicly available.
- iFLYTAB: 17,291 images (70/30 train/test split). Available through the SEMv2 GitHub repository. License is not specified. No text content annotations are provided; evaluation uses unique markers per text line as proxies.
Evaluation
- Metrics: F1-Measure, TEDS, TEDS-Struct, WAvg.F1, GriTS. Evaluation code is released publicly.
- Baselines: SEM (reimplemented without textual branch for fair comparison on iFLYTAB), TSRFormer, RobusTabNet, LGPMA, and others. On SciTSR, the corrected test annotations may make comparison with older methods slightly unfair (the authors acknowledge this).
- Limitations of evaluation: iFLYTAB lacks text content, so TEDS (which accounts for text) cannot be used in its standard form; TEDS-Struct is used instead.
- Statistical rigor: No error bars, significance tests, or multi-run results are reported.
Hardware
- SciTSR / PubTabNet / cTDaR: Single NVIDIA Tesla V100 (32 GB), batch size 8.
- iFLYTAB / WTW: 8$\times$ NVIDIA Tesla V100 (32 GB each), batch size 48.
- Inference speed: 7.3 FPS on SciTSR with the full model (single GPU). No total training time or GPU-hour estimates are reported.
- Framework: PyTorch.
BibTeX
@article{zhang2024semv2,
title={SEMv2: Table Separation Line Detection Based on Instance Segmentation},
author={Zhang, Zhenrong and Hu, Pengfei and Ma, Jiefeng and Du, Jun and Zhang, Jianshu and Yin, Baocai and Yin, Bing and Liu, Cong},
journal={Pattern Recognition},
year={2024},
publisher={Elsevier},
doi={10.1016/j.patcog.2024.110490}
}
LORE: Logical Location Regression for Table Structure Recognition
TL;DR
LORE (LOgical location REgression network) addresses table structure recognition (TSR) by directly regressing the logical row/column indices of each cell, rather than predicting adjacency relations or generating markup sequences. A cascade of self-attention regressors, combined with inter-cell and intra-cell constraint losses, captures dependencies between logical locations and achieves parallel inference that is over 30x faster than encoder-decoder baselines while requiring far less training data.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$: the paper proposes a TSR paradigm built around a regression objective, a cascade architecture, and auxiliary constraint losses. The headline contribution is the method design, validated with detailed ablations and cross-paradigm comparisons.
Secondary: none.
What is the motivation?
TSR methods at the time of this work fell into two main paradigms. Adjacency-based models classify pairs of detected cells as horizontally adjacent, vertically adjacent, or unrelated. While effective on clean datasets, recovering a complete grid structure from pairwise relations requires heuristic post-processing or graph optimization, and adjacency metrics can be satisfied by locally consistent predictions that are globally incoherent (the authors demonstrate this with an “shifted structure” example where adjacency F1 is 84% but logical accuracy is only 43%). Markup-sequence models (e.g., EDD) sidestep post-processing by generating HTML token sequences from table images, but they require roughly 17x more training data than LORE and suffer from slow sequential decoding (14.8 seconds per sample vs. 0.45 seconds for LORE on PubTabNet).
A third line of work (TGRNet) approached TSR via ordinal classification of logical indices but treated each cell independently, ignoring the structural dependencies between adjacent cells’ coordinates. LORE extends this direction into a regression framework that explicitly models those dependencies.
What is the novelty?
LORE represents each table cell by four integers: starting row $r_s$, ending row $r_e$, starting column $c_s$, and ending column $c_e$. Given a table image, the model predicts $\{(r_s^{(i)}, r_e^{(i)}, c_s^{(i)}, c_e^{(i)})\}_{i=1}^{N}$ jointly with the four spatial corner points $\{B^{(i)}\}_{i=1}^{N}$ for all $N$ detected cells.
Architecture. A DLA-34 backbone with output stride $R = 4$ and $d = 256$ channels produces cell center heatmaps and visual features. Cell center locations are detected with CenterNet-style keypoint segmentation, and spatial corner points are regressed directly from those center features.
For logical location prediction, each cell’s feature vector is enriched with 2D position embeddings from the four predicted corner points:
$$ \tilde{F}_{(\hat{x}_k^{(i)}, \hat{y}_k^{(i)}, :)} = f_{(\hat{x}_k^{(i)}, \hat{y}_k^{(i)}, :)} + PE(\hat{x}_k^{(i)}, \hat{y}_k^{(i)}) $$
These enriched corner features are combined with the center feature via a learned weighted sum to form per-cell representations:
$$ h^{(i)} = f^{(i)} + \sum_{k=1}^{4} w_k \tilde{F}_{(\hat{x}_k^{(i)}, \hat{y}_k^{(i)}, :)} $$
where $[w_1, w_2, w_3, w_4]$ are learnable scalars. These representations are passed through a self-attention encoder (the Base Regressor) to aggregate cross-cell context. The base regressor outputs a first-pass logical coordinate estimate $\hat{l}^{(i)}$.
The Stacking Regressor refines this estimate by conditioning on it:
$$ \tilde{l} = F_s(W_s \hat{l} + \tilde{h}) $$
where $W_s \in \mathbb{R}^{4 \times d}$ projects the predicted indices into feature space and $F_s$ is a second self-attention encoder with independent weights. At inference, predicted indices are rounded to the nearest integer.
Constraint losses. Two auxiliary supervision terms enforce structural consistency on the stacking regressor’s output. The inter-cell loss penalizes logical overlap between horizontally or vertically adjacent cell pairs:
$$ \begin{aligned} L_{\text{inter}} = &\sum_{(i,j) \in A_r} \max(\tilde{r}_e^{(j)} - \tilde{r}_s^{(i)} + 1,\ 0) \\ &+ \sum_{(i,j) \in A_c} \max(\tilde{c}_e^{(j)} - \tilde{c}_s^{(i)} + 1,\ 0) \end{aligned} $$
The intra-cell loss penalizes inconsistency in spanning cell extents across multi-row or multi-column cells:
$$ \begin{aligned} L_{\text{intra}} = &\sum_{i \in M_r} |\tilde{r}_s^{(i)} - \tilde{r}_e^{(i)} - r_s^{(i)} + r_e^{(i)}| \\ &+ \sum_{i \in M_c} |\tilde{c}_s^{(i)} - \tilde{c}_e^{(i)} - c_s^{(i)} + c_e^{(i)}| \end{aligned} $$
The total training objective combines all losses:
$$ L_{\text{LORE}} = L_{\text{center}} + L_{\text{spa}} + L_{\text{log}} + L_{\text{I2C}} $$
where $L_{\text{I2C}} = L_{\text{inter}} + L_{\text{intra}}$, and $L_{\text{log}}$ is the L1 regression loss computed jointly over both the base and stacking regressor outputs:
$$ L_{\text{log}} = \frac{1}{N} \sum_{i=1}^{N} \left( |\hat{l}^{(i)} - l_i|_1 + |\tilde{l}^{(i)} - l_i|_1 \right) $$
What experiments were performed?
LORE is evaluated across seven benchmarks spanning digital-born and physical table images:
- ICDAR-2013: cross-validated (no training split); 238 images.
- SciTSR-comp: 716 complex-table test images from the SciTSR split.
- PubTabNet: 20,000 training images sampled from the full 339k set; evaluated on TEDS.
- TableBank: evaluated on TEDS and BLEU; LORE trained on SciTSR (roughly 1/10 the training size).
- ICDAR-2019: scanned documents; adjacency F1.
- WTW: wild table photos with distortion; adjacency F1 and logical location accuracy.
- TableGraph-24K: used for ablation (logical accuracy metric).
Baselines span three paradigms:
- Logical location: ReS2TIM, TGRNet
- Adjacency: TabStrNet, LGPMA, TOD, FLAGNet, NCGM
- Markup sequence: Image2Text, EDD
Adjacency and markup-sequence metrics are derived from LORE’s predicted logical coordinates via deterministic transformations, not retrained models.
Ablation study (on WTW, Table 4) examines: (a) inter-cell loss only, (b) intra-cell loss only, (c) both, (d) GNN encoder vs. self-attention encoder, (e) single 6-layer regressor vs. cascade of two 3-layer regressors.
What are the outcomes/conclusions?
Against logical-location baselines, LORE substantially improves accuracy: on WTW it achieves 96.4 cell-detection F1 and 82.9 logical accuracy, compared to TGRNet’s 64.7 and 24.3 respectively. Against adjacency baselines, LORE matches or exceeds the top-performing published methods: 99.3 F1 on SciTSR-comp and 95.1 F1 on WTW (vs. NCGM’s 94.1). Against markup-sequence baselines, LORE reaches 98.1 TEDS on PubTabNet (vs. EDD’s 89.9) using only 20k training images compared to EDD’s 339k, and runs inference in 0.45 seconds per image vs. 14.8 seconds for EDD.
Ablation results show that intra-cell supervision contributes more than inter-cell supervision (+1.8% vs. +0.8% accuracy), and the cascade design outperforms a single regressor of equal depth by 3.1% accuracy. Self-attention message passing outperforms graph-based aggregation by about 5.9% accuracy, which the authors attribute to the inductive bias of nearest-neighbor graphs being suboptimal for table structure.
A controlled paradigm comparison (Table 5) is informative: a model retrained under the adjacency paradigm achieves 94.3 F1 on WTW adjacency metrics, but recovers only 51.9% logical accuracy for all cells and 20.2% for spanning cells. LORE (logical paradigm) achieves 95.1 F1 on adjacency metrics while simultaneously achieving 82.9% and 63.8% logical accuracy, underscoring the information loss in purely adjacency-based representations.
Limitations. LORE requires OCR text extracted from PDFs for the TEDS evaluation, so it is not fully image-only for all benchmarks. The PubTabNet comparison uses only 20k out of 339k training images; performance with the full training set is not reported. The method is not evaluated on newer complex benchmarks such as FinTabNet.
Reproducibility
Models
- Backbone: DLA-34 (Deep Layer Aggregation), 24.2M total parameters, 75.2 GFLOPs at 1024x1024 with 32 cells.
- Output stride $R = 4$, hidden size $d = 256$.
- Self-attention encoder: 3 layers for both base and stacking regressors.
- Corner point estimation added for WTW (following Long et al. 2021).
- Three pretrained checkpoints are released via the Alibaba AdvancedLiterateMachinery GitHub repository under Apache-2.0:
ckpt_wtw(DLA-34, 1024px, trained on WTW),ckpt_ptn(DLA-34, 512px, trained on PubTabNet), andckpt_wireless(ResNet-18, 768px, trained on SciTSR + PubTabNet + Chinese tables). Weights are hosted on Google Drive; a ModelScope version is also available. - The code base is also released under Apache-2.0, restricted to research use per Alibaba’s copyright notice.
Algorithms
- Optimizer: not stated in the paper.
- Learning rate: $1 \times 10^{-4}$, decayed to $1 \times 10^{-5}$ at epoch 70 and $1 \times 10^{-6}$ at epoch 90.
- Training duration: 100 epochs.
- Input resolution: max side 1024 (512 for SciTSR and PubTabNet).
- Batch size: not stated.
- Runs: 5 seeds, average reported.
Data
- Seven benchmarks used for training and evaluation; see datasets section above.
- PubTabNet: only 20k randomly sampled training images used (full split is 339k).
- ICDAR-2013: no training split; cross-validation protocol following prior work.
- TableBank: LORE trained on SciTSR for TableBank evaluation (cross-dataset transfer).
- No custom preprocessing or filtering pipelines are described beyond standard image resizing.
Evaluation
- Logical location accuracy: fraction of cells with all four indices exactly correct; reported separately for row (A-r), column (A-c), and all indices (Acc).
- Adjacency F1: precision/recall on predicted adjacent cell pairs.
- TEDS: tree edit distance similarity on HTML table trees; PDF text extracted following Zheng et al. 2021.
- BLEU (4-gram): used for TableBank markup evaluation.
- No error bars beyond the 5-run average are reported; significance tests not provided.
- Adjacency-based results for LORE are derived post-hoc from predicted logical coordinates, which makes the comparison methodology transparent but means LORE does not optimize directly for adjacency metrics.
Hardware
- 4x NVIDIA Tesla V100 GPUs.
- GPU-hours and memory requirements not reported.
- Inference time: 0.45 seconds/image on PubTabNet validation set at 1280x1280 (compared to 14.8 seconds for EDD under the same conditions).
- 24.2M parameter count makes the model small enough for practical deployment, though memory requirements at training time are not stated.
BibTeX
@inproceedings{xing2023lore,
title={{LORE}: Logical Location Regression Network for Table Structure Recognition},
author={Xing, Hangdi and Gao, Feiyu and Long, Rujiao and Bu, Jiajun and Zheng, Qi and Li, Liangcheng and Yao, Cong and Yu, Zhi},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={37},
number={3},
pages={2992--3000},
year={2023}
}
Deep Learning for Table Detection and Structure Recognition: A Survey
TL;DR
A comprehensive survey of deep learning approaches to table detection (TD) and table structure recognition (TSR) as of late 2022. The authors review 19 datasets, provide a historical taxonomy from heuristic to deep learning methods, and run comparative experiments across multiple backbone-detector combinations on the TNCR dataset.
What kind of paper is this?
Dominant: $\Psi_{\text{Systematic}}$: The paper’s primary contribution is synthesizing and taxonomizing a large body of work on TD and TSR into a structured landscape. It covers historical progression, datasets, architectures, and open problems without proposing a new model.
Secondary: $\Psi_{\text{Evaluation}}$: Beyond literature synthesis, the authors contribute comparative experimental tables. They run a set of backbone-detector combinations (HRNet, ResNeSt, Dynamic R-CNN) on the TNCR dataset and compile cross-paper result comparisons on ICDAR 2013/2017/2019, UNLV, Marmot, and TableBank.
What is the motivation?
Tables appear throughout documents and extracting their content programmatically requires two sequential steps: locating the table region (TD) and then parsing its internal grid structure (TSR). The deep learning literature on both tasks grew rapidly between 2017 and 2022, producing diverse architectures and datasets that were difficult to compare systematically. Existing reviews did not cover both tasks together or were based primarily on classical methods. The authors argue a comprehensive survey that unifies terminology, organizes datasets, and benchmarks representative methods is needed to orient both newcomers and practitioners.
What is the novelty?
The paper does not introduce a new model. Its contributions are organizational:
Historical taxonomy. Methods are grouped into three eras: (1) heuristic-based (1990s to early 2010s), relying on line detection, spatial features, and rule-based logic; (2) machine learning-based (2000s to 2010s), using SVMs, decision trees, and hidden Markov models on handcrafted features; and (3) deep learning-based (2017 to 2022), covering Faster R-CNN, Mask R-CNN, Cascade Mask R-CNN, DETR variants, and encoder-decoder architectures for TSR.
Dataset inventory. Table 1 in the paper catalogs 19 datasets with their sizes, task coverage (TD / TSR / classification), annotation source, and scan type. This includes ICDAR 2013, ICDAR 2017 POD, ICDAR 2019, TABLE2LATEX-450K, IIIT-AR-13K, UNLV, Marmot, TableBank, PubTabNet, PubTables-1M, FinTabNet, SciTSR, SynthTabNet, TNCR, and several smaller sets.
Table classification coverage. The survey explicitly covers table type classification (a five-class problem on the TNCR dataset: full-lined, no-lined, merged, partial-lined, partial-lined-merged), which prior surveys had not addressed.
Comparative benchmark. Section 7 contains the authors’ own experiments running multiple backbone-detector combinations on TNCR with precision, recall, and F1 evaluated at 11 IoU thresholds from 50% to 95%.
Open repository. The authors maintain a GitHub repository cataloging recent publications, open data, and code.
What experiments were performed?
Own Experiments
The authors evaluate table detection on the TNCR dataset (9,428 tables from FDA drug label images, MIT license, introduced with a 5-class table type annotation scheme). They test:
- HRNet variants (HRNetV2p-W18, HRNetV2p-W32) with three detection heads: HTC, Mask R-CNN, Cascade Mask R-CNN.
- ResNeSt variants (S-50, S-101) with Faster R-CNN.
- Dynamic R-CNN (ResNet-50 backbone) with Faster R-CNN.
Results are reported as precision, recall, and F1 at each IoU threshold from 50% to 95%, plus the averaged 50%:95% score. Selected highlights:
| Backbone | Detector | F1@50% | F1@50%:95% |
|---|---|---|---|
| HRNetV2p-W18 | HTC | 0.933 | 0.840 |
| HRNetV2p-W32 | Mask R-CNN | 0.911 | 0.871 |
| ResNeSt-101 | Faster R-CNN | 0.934 | 0.748 |
| Dynamic R-CNN (ResNet-50) | Faster R-CNN | 0.912 | 0.628 |
HRNet-based detectors show higher performance at tighter IoU thresholds (80% to 95%), while ResNeSt achieves similar peak F1 at 50% but degrades faster. The W32 models sometimes overfit on the smaller TNCR training set.
Literature Compilation
Tables 16 to 18 compile results from prior work across multiple datasets. Table detection methods are compared on UNLV, ICDAR 2013, ICDAR 2017, ICDAR 2019, Marmot, and TableBank using F1 at various IoU thresholds. TSR methods are compared on ICDAR 2013 using F1 at IoU 50%.
The metrics used across studies are inconsistent: some papers report F1 at a fixed IoU (50%, 60%, or 80%), others report mAP, and IoU thresholds are not always specified. The survey adopts a unified IoU sweep to facilitate comparison where possible, filling in blanks with the original paper’s stated threshold when the method did not specify.
What are the outcomes/conclusions?
The survey finds that deep learning has substantially advanced both tasks:
- TD on standard benchmarks is largely saturated, per the survey’s results. Several methods exceed F1 0.99 at IoU 50% on ICDAR 2013. Deformable CNN + Faster R-CNN (Siddiqui et al.) reaches F1 99.6%.
- TSR is more difficult than TD, the survey finds. Performance drops sharply at tighter IoU thresholds and on complex spanning-cell structures. ICDAR 2019 results show F1 values in the 40% to 55% range at IoU 50% for structure recognition, indicating that parsing internal table structure is harder than locating the table region.
- Dataset diversity matters. Models trained on scanned scientific articles (ICDAR, PubTabNet) do not transfer cleanly to financial (FinTabNet) or wild-captured images (WTW, CamCap).
- Table classification (the five-class TNCR task) is relatively underexplored and no clear winner emerges from existing work.
The authors conclude that opportunities for improvement remain in TSR and cross-domain generalization, and call for more diverse training sets and unified evaluation protocols.
Limitations
- Coverage stops at 2022; transformer-based TSR methods (Table Transformer, EDD successors) are only briefly mentioned.
- The own experiments are restricted to one dataset (TNCR), which is relatively small and domain-specific (FDA drug labels). Results do not transfer to diverse document types.
- The metric collection across papers is inconsistent and many cells in the comparison tables are empty, making direct comparisons difficult.
- The paper does not discuss statistical significance, variance across runs, or data splits in the own experiments.
- There is limited engagement with the more recent methods that emerged in 2021 to 2022 (e.g., PubTables-1M Table Transformer, EDD-family TSR systems).
Reproducibility
Models
This is a survey; no novel model is proposed. The own experiments use standard implementations from the MMDetection toolbox with HRNet, ResNeSt, and Dynamic R-CNN backbones. All these backbones and detectors have publicly available weights and code.
Algorithms
- Framework: MMDetection (PyTorch)
- Detectors: HTC, Mask R-CNN, Cascade Mask R-CNN, Faster R-CNN
- Backbones: HRNetV2p-W18, HRNetV2p-W32, ResNeSt-50, ResNeSt-101, ResNet-50
- Training schedules: 1x (12 epochs) and 20e (20 epochs), as reported per-model
- No additional training details (learning rate, batch size, augmentation, GPU count) are provided for the own TNCR experiments.
Data
- Own experiments: TNCR dataset, MIT licensed, 6,621 pages / 9,428 table instances, scanned FDA drug label images, 5-class table type classification
- Literature comparison: ICDAR 2013 (238 pages), ICDAR 2017 POD (2,417 images), ICDAR 2019 (2,439 images), UNLV (2,889 pages), Marmot (2,000 pages), TableBank (417k tables, research-only data license), PubTabNet (568k tables), FinTabNet (~113k tables), PubTables-1M (~947k tables), SciTSR (15k tables), SynthTabNet (600k synthetic tables)
Evaluation
- Metrics: Precision, Recall, F1 at IoU thresholds 50% to 95% in 5% steps, plus 50%:95% average
- The mAP formulation follows COCO convention; AP is computed per-class, then averaged
- No error bars, significance tests, or multi-run statistics are reported for the own experiments
- Cross-paper comparison tables have many missing cells due to inconsistent metric reporting in original papers
Hardware
No hardware details are provided for the own experiments. The MMDetection framework is standard; all tested backbones (HRNet, ResNeSt) require GPU memory typical of standard detection workflows.
BibTeX
@article{kasem2022deep,
title={Deep learning for table detection and structure recognition: A survey},
author={Kasem, Mahmoud and Abdallah, Abdelrahman and Berendeyev, Alexander and Elkady, Ebrahem and Abdalla, Mahmoud and Mahmoud, Mohamed and Hamada, Mohamed and Nurseitov, Daniyar and Taj-Eddin, Islam},
journal={ACM Computing Surveys},
volume={56},
number={12},
articleno={305},
pages={1--41},
year={2024},
publisher={ACM New York, NY},
doi={10.1145/3657281}
}
TRUST: An Accurate and End-to-End Table Structure Recognizer Using Splitting-based Transformers
TL;DR
TRUST is a split-merge table structure recognizer from Baidu that introduces two transformer-based modules: a Query-Based Splitting Module that predicts multi-oriented row/column separators, and a Vertex-Based Merging Module that uses cross-attention between row and column features to determine which grid cells should be merged into spanning cells. The system runs at 10 FPS on an A100 and achieves 97.1% Str-TEDS and 96.2% TEDS on PubTabNet at the time of publication.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$: The core contribution is a two-module architecture for TSR. The paper is structured around specific architectural choices, ablation tables validating each component, and comparison against published baselines. No new dataset or benchmark is introduced, and no formal theoretical analysis is provided.
Secondary: None significant.
What is the motivation?
Table structure recognition requires parsing both the physical coordinates of cells and their logical row/column indices simultaneously. Four challenges make this difficult in practice: spanning cells that occupy multiple rows or columns, unlined or partially-lined tables with no explicit visual delimiters, empty cells that are easily missed, and tables with rotation or linear perspective distortion.
Prior work falls into three families, each with weaknesses. Component-based methods (DeepDeSRT, TableNet, LGPMA) struggle with empty cells and boundary ambiguity in borderless tables. Sequence-based methods (EDD) require large amounts of training data and tend to produce inaccurate cell bounding boxes. Earlier splitting-based methods (SPLERGE, SEM) are promising but inefficient: SEM uses slow RoI operations and BERT-based cell text encoding, while SPLERGE trains its split and merge models independently, which complicates optimization. Neither handles rotated tables reliably.
TRUST targets all four challenges within a single end-to-end trainable framework that keeps the split-merge decomposition while replacing both stages with transformer-based modules.
What is the novelty?
The core innovation is a pair of transformer-based modules replacing the projection-based splitter of SPLERGE and the region-proposal cell encoder of SEM.
Query-Based Splitting Module (QBS): Row and column separators are predicted as a fixed set of $N$ row queries and $M$ column queries. Learnable position embeddings indexed $(0, 1, \ldots, N-1)$ for rows and $(0, 1, \ldots, M-1)$ for columns serve as Transformer decoder queries; flattened CNN feature maps $F^V \in \mathbb{R}^{(H \times W) \times d}$ serve as keys and values. Each query predicts a triple $(c, o, a)$: a binary classification score $c$ indicating whether a separator is present, a spatial offset $o$ from the predefined query position to the separator’s starting point on the table boundary, and a rotation angle $a \in [-45^\circ, +45^\circ]$ encoded as a 91-class categorical output. These define a parallelogram-shaped separator, which can represent tilted dividing lines naturally.
Vertex-Based Merging Module (VBM): Each intersection of a predicted horizontal and vertical separator forms a vertex. Its feature is constructed by fusing the corresponding row feature $F^r \in \mathbb{R}^{N \times d}$ and column feature $F^c \in \mathbb{R}^{M \times d}$ from QBS. Before fusion, cross-feature enhancement applies cross-attention: row features attend over column features and vice versa, propagating context across both spatial axes. Each of the $N \times M$ vertices then predicts four binary merge decisions, one per adjacent grid pair (top-left/top-right, bottom-left/bottom-right, top-left/bottom-left, top-right/bottom-right).
The combined training loss is:
$$ \begin{aligned} L = \frac{1}{N_r} L_{bce}(y_{row}, c_{row}) + \frac{1}{N_c} L_{bce}(y_{col}, c_{col}) \\
- \frac{1}{N_{pos}} L_{ce}(y_{ang}, c_{ang}) + \frac{1}{N_{pos}} L_{loc}(\hat{s}, s) \\
- \frac{1}{N_{vtx}} L_{bce}(y_{lnk}, c_{lnk}) \end{aligned} $$
Row/column classification and link classification each use binary cross-entropy. Rotation angle prediction uses cross-entropy over 91 angle classes. Separator start-point regression uses Smooth L1. Online Hard Example Mining (OHEM) is applied to the classification losses.
What experiments were performed?
Datasets: PubTabNet (500,777 train / 9,115 val / 9,138 test, scientific documents, predominantly simple three-line tables with some multi-span tables) and SynthTable (1,000 train / 1,000 test, synthetic, four difficulty categories: C1 standard with visible lines, C2 standard borderless, C3 spanning-cell, C4 rotated and perspective-distorted). For SynthTable, the authors retrain EDD and SPLERGE themselves for a controlled comparison.
Metrics: TEDS (tree-edit-distance similarity, full: structure and content) and Str-TEDS (structure only). OCR on PubTabNet cell content is provided by PSENet (text detection) and MASTER (text recognition) for consistency with prior work.
Baselines: EDD, TabStruct-Net, GTE, LGPMA, FLAG-Net on PubTabNet; EDD and SPLERGE on SynthTable.
Ablations on PubTabNet:
- Replacing QBS with SPLERGE’s split model drops Str-TEDS from 97.1% to 94.8% and TEDS from 96.2% to 93.4%.
- Replacing VBM with heuristic post-processing drops Str-TEDS from 97.1% to 88.3% and TEDS from 96.2% to 85.4%.
- Replacing VBM with SPLERGE’s merge model drops Str-TEDS from 97.1% to 96.2% and TEDS from 96.2% to 95.3%.
- Removing cross-feature enhancement from VBM drops Str-TEDS from 97.1% to 90.6% and TEDS from 96.2% to 88.0%.
What are the outcomes/conclusions?
On PubTabNet, TRUST achieves 97.1% Str-TEDS and 96.2% TEDS, above LGPMA (96.7% / 94.6%) and FLAG-Net (95.1% TEDS). On the rotated SynthTable C4 category, TRUST achieves 92.4% Str-TEDS and 89.2% TEDS, compared to 89.9% / 81.4% for EDD and 85.6% / 74.9% for SPLERGE.
At 10 FPS on an A100, TRUST runs at roughly 5x the throughput of SEM (1.94 FPS) and 10x that of EDD (1 FPS), attributed to the parallel structure of the transformer decoders and the absence of RoI-based operations.
Ablations support the paper’s design rationale: both QBS and VBM are meaningful contributors, and cross-feature enhancement within VBM accounts for the majority of the merging quality (an 8.2 percentage point TEDS improvement on PubTabNet when removed).
The main failure mode the authors acknowledge is strong perspective distortion; qualitative results show recognizable degradation in those cases. Additional limitations not discussed in the paper include: SynthTable has only 1,000 training samples, making conclusions about rotated-table handling less robust; the evaluation does not cover FinTabNet, WTW, SciTSR, or ICDAR competition sets used by many contemporaneous methods; no code or weights were released.
Reproducibility
Models
- Backbone: ResNet-18 pre-trained on ImageNet, with FPN for multi-scale feature merging.
- Decoder modules: Standard Transformer decoder layers for the row and column branches of QBS, and for the cross-feature enhancement sub-modules of VBM.
- Parameter count: Not reported.
- Weights: Not released. No code repository or model hub link is provided.
Algorithms
- Framework: PaddlePaddle (Baidu’s deep learning framework). This represents a meaningful barrier to reproduction outside Baidu’s ecosystem; porting to PyTorch or JAX is not discussed.
- Optimizer: Adam.
- Learning rate: 0.0001. No schedule is mentioned.
- Batch size: 16.
- Training duration: 20 epochs.
- Input resolution: Images resized to 640 $\times$ 640 with random scaling.
- Max query count: $N$ (row) and $M$ (column) maximum separator counts are predefined but their exact values are not specified.
- OHEM: Applied to classification losses; negative-to-positive ratio is not specified.
Data
- PubTabNet: Publicly available; CDLA-Permissive-1.0 annotations; underlying PMCOA images carry mixed per-article licenses.
- SynthTable: From the TIES paper (Qasim et al., ICDAR 2019); 1,000 train / 1,000 test. License and download availability are not discussed in the TRUST paper.
Evaluation
- TEDS and Str-TEDS are the only reported metrics; no error bars, confidence intervals, or multiple-seed results are reported.
- SynthTable baselines were retrained by the TRUST authors rather than using original checkpoints, which introduces uncertainty about baseline fidelity.
- No evaluation on FinTabNet, WTW, SciTSR, or any ICDAR competition benchmark.
Hardware
- Training and inference: NVIDIA Tesla A100 64 GB. The paper uses singular “GPU” throughout, indicating single-device training.
- Inference speed: 10 FPS on A100.
- Training time: Not reported.
- Deployment considerations: The combination of PaddlePaddle dependency and unavailable weights makes independent reproduction require substantial porting effort.
BibTeX
@article{guo2022trust,
title={TRUST: An Accurate and End-to-End Table structure Recognizer Using Splitting-based Transformers},
author={Guo, Zengyuan and Yu, Yuechen and Lv, Pengyuan and Zhang, Chengquan and Li, Haojie and Wang, Zhihui and Yao, Kun and Liu, Jingtuo and Wang, Jingdong},
journal={arXiv preprint arXiv:2208.14687},
year={2022}
}
TSRFormer: Table Structure Recognition with Transformers
TL;DR
TSRFormer reformulates table separation line detection as a direct regression problem rather than a pixel segmentation task. The core module, SepRETR, uses a two-stage DETR-like decoder: a reference point detector finds one anchor per separator, then a transformer decoder regresses the full curvilinear line from that anchor. On SciTSR, PubTabNet, and WTW, TSRFormer achieves results competitive with prior work; on a proprietary in-house dataset with distorted and curved tables it outperforms a re-implemented SPLERGE baseline by more than 11 F1 points.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$ The headline contribution is the SepRETR architecture and its two companion improvements (prior-enhanced bipartite matching, high-resolution cross-attention sampling). The paper is centered on the model design and its ablation, with benchmark comparisons as supporting evidence.
Secondary: None. No new dataset or benchmark is released; the in-house dataset is used for evaluation but not distributed.
What is the motivation?
Table structure recognition (TSR) must recover which rows and columns partition a table and identify spanning cells that cross those boundaries. Prior split-and-merge methods predict row and column separators via pixel-level segmentation, then apply post-processing to convert probability masks into line coordinates. This pipeline introduces heuristic mask-to-line steps that are fragile when tables are distorted, curved, or contain large empty regions that disrupt the feature representation.
At the same time, the DETR family of object detectors showed that set prediction with bipartite matching could eliminate NMS and anchor engineering, but DETR converges slowly and struggles to leverage high-resolution feature maps due to the quadratic cost of global attention.
The authors aim to address both problems at once: replace segmentation with direct regression and adapt two-stage DETR to the geometry of line prediction, where the location of a separator provides a strong spatial prior.
What is the novelty?
SepRETR: Separator REgression TRansformer
TSRFormer’s split module runs two parallel SepRETR branches (one for rows, one for columns) on a shared convolutional feature map $P_2$ from a ResNet-18-FPN backbone.
Feature enhancement. A series of downsampling blocks reduces $P_2$ to a compact feature map $P’_2 \in \mathbb{R}^{\frac{H}{4} \times \frac{W}{32} \times C}$. Two spatial CNN (SCNN) modules then propagate context across the width, producing a context-enhanced map $E_{\text{row}}$ where each position integrates information from both sides of the image.
Reference point detection. At a fixed column $x_r = \lfloor W/4 \rfloor$ of the upsampled map $E’’_{\text{row}}$, a sigmoid classifier scores each pixel as a candidate reference point. After 7×1 max-pool NMS, the top-100 points above a threshold of 0.05 become object queries for the DETR decoder.
Separation line regression. Each row separator is represented by three parallel curvilinear lines (top boundary, center, bottom boundary), each parameterized as $K = 15$ points at fixed x-coordinates. The decoder attends to a downsampled feature map $C’_{\text{row}} \in \mathbb{R}^{H \times K \times C’}$ formed by stacking the $K$ columns of $E’’_{\text{row}}$. A 3-layer transformer decoder produces per-query embeddings; two feedforward networks then classify each query and regress its 3K y-coordinates.
Prior-enhanced bipartite matching
Standard DETR bipartite matching is unstable during early training because any query can be assigned to any ground-truth object. The authors observe that detected reference points consistently fall within the top and bottom boundaries of their corresponding separator across training epochs. They exploit this by constructing a cost matrix where the cost between reference point $i$ and ground-truth separator $k$ is the distance to the GT reference point if the reference point lies within the separator’s boundaries, and $\infty$ otherwise. The Hungarian algorithm then produces a stable matching, substantially accelerating convergence.
Loss functions
Reference point detection uses a focal-loss variant with Gaussian-augmented targets:
$$ L_{\text{ref}}^{\text{row}} = -\frac{1}{N_r} \sum_{i=1}^{H} \begin{cases} (1-p_i)^\alpha \log(p_i), & p_i^\ast=1 \\ (1-p_i^\ast)^\beta p_i^\alpha \log(1-p_i), & \text{otherwise} \end{cases} $$
where $\alpha=2$, $\beta=4$, and the Gaussian soft label $p_i^\ast$ within separator $k$ is:
$$ p_i^\ast = \exp!\left(-\frac{(i - y_k)^2}{2\sigma_k^2}\right), \quad \sigma_k = \sqrt{\frac{w_k^2}{2\ln(10)}} $$
with $w_k$ the separator thickness, so that $p_i^\ast \geq 0.1$ within the separator region.
Separation line regression uses a combined focal and L1 loss after bipartite matching $\hat{\sigma}$:
$$ L_{\text{line}}^{\text{row}} = \sum_{i=1}^{Q} \left[ L_{\text{cls}}(c_i, c_{\hat{\sigma}(i)}^\ast) + \mathbf{1}_{\{c_i \neq \emptyset\}} L_{\text{reg}}(l_i, l_{\hat{\sigma}(i)}^\ast) \right] $$
An auxiliary binary cross-entropy segmentation loss penalizes per-pixel separator predictions:
$$ L_{\text{aux}}^{\text{row}} = \frac{1}{|S_{\text{row}}|} \sum_{(x,y) \in S_{\text{row}}} \text{BCE}(M_{\text{row}}(x,y), M_{\text{row}}^\ast(x,y)) $$
where $S_{\text{row}}$ is the set of sampled pixels and $M_{\text{row}}^\ast(x,y)$ is 1 only if the pixel falls within a row separator. The cell merging module uses a corresponding BCE loss over sampled cell pairs:
$$ L_{\text{merge}} = \frac{1}{|S_{\text{rel}}|} \sum_{i \in S_{\text{rel}}} \text{BCE}(P_i, P_i^\ast) $$
The full training objective is:
$$ L = \lambda(L_{\text{ref}}^{\text{row}} + L_{\text{ref}}^{\text{col}}) + L_{\text{aux}}^{\text{row}} + L_{\text{aux}}^{\text{col}} + L_{\text{line}}^{\text{row}} + L_{\text{line}}^{\text{col}} + L_{\text{merge}} $$
with $\lambda = 0.2$.
Cell merging module
After intersecting row and column separators to produce a regular cell grid, a relation network recovers spanning cells. RoI Align extracts $7 \times 7$ features for each cell from $P_2$; a two-layer MLP maps them to 512-d vectors. Three feature enhancement blocks aggregate row-level, column-level, and local context. For each adjacent cell pair, the concatenated features plus an 18-d spatial compatibility feature are fed to a binary MLP classifier.
What experiments were performed?
Datasets
- SciTSR: 12k train / 3k test; axis-aligned tables from scientific PDFs. A harder subset, SciTSR-COMP (716 tables), is evaluated separately.
- PubTabNet: ~500k train, 9k validation; axis-aligned tables from scientific articles. The structure-only TEDS-Struct metric is used to avoid confounds from OCR engine differences.
- WTW: ~11k train / ~3.6k test; bordered tables photographed in wild scenes, including skewed and curved instances.
- In-house dataset: 40,590 train / 1,053 test; heterogeneous scanned and camera-captured documents including scientific, financial, and invoice content. Evaluation uses the cTDaR TrackB metric at IoU 0.9.
Baselines
On public benchmarks, TSRFormer is compared against TabStruct-Net, GraphTSR, LGPMA, and FLAG-Net (SciTSR); EDD, GTE, LGPMA, FLAG-Net (PubTabNet); and Cycle-CenterNet (WTW). For in-house evaluation, a re-implemented SPLERGE with the same backbone is the primary comparison.
Implementation
Backbone: ResNet-18-FPN; $P_2$ channel width set to 64. Optimizer: AdamW with initial LR $10^{-4}$, polynomial decay (power 0.9). Batch size 16 on 8 Nvidia Tesla V100 GPUs. Staged training: reference point detection and auxiliary segmentation for $N$ epochs, then regression added for $N$ epochs, then cell merging added for $N$ epochs ($N = 12$ for PubTabNet, $N = 20$ for others).
What are the outcomes/conclusions?
SciTSR: TSRFormer reaches 99.4% F1 on the full test set and 98.9% on SciTSR-COMP, competitive with FLAG-Net (99.5% / 98.5%).
PubTabNet: 97.5% TEDS-Struct, 0.8 points above LGPMA (the ICDAR 2021 competition winner at 96.7%).
WTW: 93.4% F1, 1.0 point above Cycle-CenterNet (92.4%), which was designed specifically for this wild-scene scenario.
In-house (distorted/borderless tables): TSRFormer reaches 95.2% F1 vs. 83.8% for the re-implemented SPLERGE baseline, a gap of more than 11 points. The ablation attributes this gain primarily to replacing segmentation with regression (line prediction without heuristic mask-to-line steps) and to the cell merging module.
Ablation highlights: The prior-enhanced matching strategy matches the convergence quality of the original DETR strategy at 40 epochs while needing only 20 epochs. Using sampled high-resolution features $C_{\text{row}}, C_{\text{col}}$ in cross-attention adds 0.5 F1 over using the full encoder features.
Limitations acknowledged: Model weights and code are not released. The in-house dataset is not publicly available, so the largest performance claims cannot be externally verified. Evaluation on WTW covers bordered tables only; generalization to borderless wild-scene tables is not demonstrated. No latency or throughput numbers are reported.
Reproducibility
| Resource | Type | License | Link |
|---|---|---|---|
| Preprint (arXiv) | Paper | arXiv-nonexclusive-v1.0 | arXiv 2208.04921 |
| Published (ACM MM 2022) | Paper | Unknown | DOI 10.1145/3503161.3548038 |
| Model weights | Model | N/A | Not released |
| Training code | Code | N/A | Not released |
| SciTSR | Dataset | Unknown | GitHub |
| PubTabNet | Dataset | CDLA-Permissive-1.0 | GitHub |
| WTW | Dataset | Unknown | GitHub |
Models
- ResNet-18-FPN backbone; $P_2$ feature dimension 64. ResNet-18 weights initialized from ImageNet pretraining.
- SepRETR decoder: 3 transformer decoder layers; query dimension 256; 16 attention heads; FFN dimension 1024.
- $E’_{\text{row}}/E’_{\text{col}}$ channel dimension: 256.
- Cell merging: 2-layer MLP (512 nodes), relation network with 3 feature enhancement blocks, binary classifier MLP (2 hidden layers, 512 nodes).
- Model weights are not publicly released.
Algorithms
- Optimizer: AdamW; $\beta = (0.9, 0.999)$; $\epsilon = 10^{-8}$; weight decay $5 \times 10^{-4}$.
- Learning rate: $10^{-4}$ with polynomial decay (power 0.9).
- Data augmentation: shorter side randomly rescaled to one of ${416, 512, 608, 704, 800}$ (aspect ratio preserved) for all datasets except WTW, where both sides are resized to 1024.
- Synchronized BatchNorm during training.
- Training on 8 V100 GPUs, batch size 16; staged training ($N$ epochs per stage, $N=12$ or $N=20$).
- Reference points: top-100 per image, score threshold 0.05, 7×1 max-pool NMS.
- $K = 15$ regression points per line; $x_r = \lfloor W/4 \rfloor$ for rows; $y_r = \lfloor H/4 \rfloor$ for columns.
- Cell merging: 1024 cells sampled per mini-batch (1024 positive + 1024 negative for segmentation; 64 hard positive + 64 hard negative for merging, selected by OHEM).
- $\lambda = 0.2$ (loss balance for reference point detection).
Data
- SciTSR: publicly available; 12k/3k train/test split. License: see dataset page.
- PubTabNet: publicly available; ~500k/9k/9k train/val/test (test annotations not released). License: CDLA-Permissive-1.0.
- WTW: publicly available; ~11k/3.6k train/test.
- In-house dataset: proprietary; 40,590 train / 1,053 test; not released.
- Ground-truth separation lines for SciTSR and PubTabNet are generated from cell-box annotations following the procedure in RobusTabNet [arXiv:2203.09056].
Evaluation
- SciTSR: cell adjacency relationship metric (F1). Two variants: with and without empty cells.
- PubTabNet: TEDS-Struct (structure-only TEDS, ignoring OCR). Validation set only (test annotations unreleased).
- WTW: cell adjacency F1 at IoU 0.6.
- In-house: cTDaR TrackB metric at IoU 0.9 using GT text boxes.
- No error bars or multi-run statistics are reported.
- SPLERGE baseline is re-implemented by the authors with the same backbone for a fair comparison; results may differ from the original published SPLERGE.
Hardware
- Training: 8 Nvidia Tesla V100 GPUs.
- GPU-hours and memory requirements are not reported.
- Inference resolution: longer side rescaled to 1024 pixels (aspect ratio preserved) for SciTSR, PubTabNet, and in-house; both sides to 1024 for WTW.
- No inference latency or throughput numbers reported.
BibTeX
@inproceedings{lin2022tsrformer,
title={TSRFormer: Table Structure Recognition with Transformers},
author={Lin, Weihong and Sun, Zheng and Ma, Chixiang and Li, Mingze and Wang, Jiawei and Sun, Lei and Huo, Qiang},
booktitle={Proceedings of the 30th ACM International Conference on Multimedia},
year={2022}
}
TableFormer and SynthTabNet: Transformer-Based Table Structure Recognition with a Synthetic Dataset
TL;DR
TableFormer is an end-to-end transformer-based model that simultaneously predicts table HTML structure and per-cell bounding boxes from a table image. Alongside the model, the authors release SynthTabNet, a 600k synthetic table dataset with diverse styles and complexity. On PubTabNet, TableFormer reaches 96.75% TEDS overall, compared to 89.9% for EDD, the previous best-reported result on that benchmark.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$ The headline contribution is the TableFormer architecture, which replaces LSTM decoders with transformer decoders and introduces a dedicated cell bounding-box decoder. The bulk of the paper is devoted to describing the model, training procedure, and benchmark comparisons.
Secondary: $\Psi_{\text{Resource}}$ The paper also releases SynthTabNet, a 600k synthetic dataset created to address distribution imbalances in existing benchmarks. The dataset is a meaningful contribution in its own right and is described in detail across several pages.
What is the motivation?
Extracting structured data from table images is a prerequisite for many downstream applications: search engines, knowledge graphs, and document automation pipelines. The problem is hard because tables vary widely in shape, size, line style, header complexity, and cell content.
At the time of this work, the leading approaches for table structure recognition fell into three families: image-to-text networks (including the LSTM-based EDD model from PubTabNet), graph neural networks, and hybrid deep-learning/rule-based pipelines. Each had significant weaknesses. Image-to-text methods relied on implicit OCR decoders trained only on the languages present in the training set, making them language-dependent. Graph-based methods required explicit text-cell positions as input and reported lower accuracy than image-to-text approaches. Hybrid methods were not end-to-end and required hand-written rules for new table types.
A secondary motivation is the observation that existing public datasets (PubTabNet, FinTabNet, TableBank) are skewed toward simple tables with few rows and columns, and toward specific visual styles. A more diverse training set is needed to improve generalization.
What is the novelty?
TableFormer Architecture
TableFormer takes a table image as input and outputs two things simultaneously: a sequence of HTML structure tags and a bounding box for each data cell. The model has three main components.
CNN Backbone: A ResNet-18 (modified by removing the final linear and pooling layers and adding an adaptive pooling layer of size $28 \times 28$) encodes the input image into a feature map. The input is resized to $448 \times 448$ pixels.
Structure Decoder: A standard transformer encoder-decoder architecture. The encoder has 2 layers; the decoder has 4 layers. Both use a feature size of 512, a feed-forward dimension of 1024, and 4 attention heads. The decoder autoregressively predicts HTML tags one token at a time. Replacing the LSTM with a transformer decoder is the primary upgrade over the EDD architecture from PubTabNet.
Cell BBox Decoder: Inspired by DETR, this component reuses the hidden state produced by the Structure Decoder for each <td> or < tag (the opening token of a spanning cell) as an object query. A cross-attention network combines this hidden state with the CNN feature map to produce a cell-specific feature. An MLP (3 layers, ReLU activations) then regresses normalized bounding box coordinates. A linear layer classifies each predicted box as empty or non-empty.
Because the HTML tag sequence is naturally ordered, the predicted bounding boxes are already in one-to-one correspondence with the table cells, eliminating the need for the Hungarian matching algorithm used in DETR.
Loss function: TableFormer is trained with a multi-task loss:
$$ \begin{aligned} l_{\text{box}} &= \lambda_{\text{iou}} , l_{\text{iou}} + \lambda_{l1} , l_{1} \\ l &= \lambda , l_{s} + (1 - \lambda) , l_{\text{box}} \end{aligned} $$
where $l_{s}$ is the cross-entropy loss for the structure decoder, $l_{\text{iou}}$ is the generalized IoU loss, $l_{1}$ is the L1 bounding box regression loss, and $\lambda \in [0,1]$ balances the two objectives.
SynthTabNet
To address distribution skew in existing datasets, the authors generate SynthTabNet: 600k PNG table images synthesized from four themed subsets of 150k each (FinTabNet style, marketing style, PubTabNet style, and sparse/low-density). Each subset targets a different visual style and complexity profile.
The generation pipeline controls four independent axes: table structure (header rows, body rows, columns, row/column span ratios), appearance (hand-designed CSS styling templates), content (curated term lists from PubTabNet and FinTabNet, plus random text), and rendering (a web browser engine produces exact bounding boxes per cell). Every cell, including empty ones, receives a bounding box. All splits follow an 80%/10%/10% train/test/val division.
What experiments were performed?
Datasets
The authors train and evaluate on three public benchmarks: PubTabNet (509k tables, scientific papers), FinTabNet (112k tables, financial documents), and TableBank (145k tables, mixed documents). They also report baseline results on SynthTabNet itself.
A data preparation step unifies all datasets into PubTabNet annotation format at 72 dpi PNG, filters out extreme-sized tables, and regenerates missing bounding boxes using a grid-inference procedure (required for 48% of simple and 69% of complex PubTabNet tables, and 68% and 98% respectively for FinTabNet).
Metrics
TEDS (Tree-Edit-Distance-Based Similarity), introduced with PubTabNet, measures structural similarity between the predicted and ground-truth HTML trees:
$$ \text{TEDS}(T_{a}, T_{b}) = 1 - \frac{\text{EditDist}(T_{a}, T_{b})}{\max(|T_{a}|, |T_{b}|)} $$
where $T_{a}$ and $T_{b}$ are the table representations as HTML trees, EditDist is the tree-edit distance, and $|T|$ is the number of nodes.
Cell bounding box quality is measured with PASCAL VOC mAP on the non-empty cell class.
Baselines
Structure: EDD (the encoder-dual-decoder model from the PubTabNet paper) and GTE (Global Table Extractor). Cell detection: EDD augmented with TableFormer’s Cell BBox Decoder. Cell content (structure + content via PDF extraction): Tabula, Traprange, Camelot, Acrobat Pro, and EDD.
Training Details
Three Adam optimizers are used (one per component). Starting learning rate 0.001 for 12 epochs (batch size 24), then 0.0001 for 12 more epochs or convergence (batch size 18). $\lambda = 0.5$. Dropout rate 0.5. Input constraints: images up to $1024 \times 1024$ pixels, structure token sequences up to 512 tokens.
What are the outcomes/conclusions?
Structure accuracy: On PubTabNet, TableFormer achieves 98.5% TEDS on simple tables and 95.0% on complex tables (96.75% overall), compared to 91.1%/88.7% (89.9% overall) for EDD and 93.01% overall for GTE. On FinTabNet, 97.5%/96.0% (96.8% overall) vs. 88.4%/92.08% (90.6%) for EDD. On TableBank, 89.6% vs. 86.0% for EDD. On SynthTabNet, 96.9%/95.7% (96.7%).
Cell bounding box: TableFormer achieves 82.1% mAP on PubTabNet (86.8% with post-processing), vs. 79.2% (82.7%) for EDD augmented with the same bbox decoder.
Structure with content (PDF extraction): 95.4%/90.1% (93.6% overall) on PubTabNet, vs. 91.2%/85.4% (88.3%) for EDD and 80.0%/66.0% (73.0%) for Camelot.
Language generalization: Although trained only on English tables, TableFormer can extract Japanese-language tables by combining its bounding box predictions with native PDF text cells. This is a direct consequence of removing the language-dependent OCR decoder.
Acknowledged limitations: Performance degrades on tables that occupy a large portion of the page, because these are aggressively downsampled to 448 pixels during preprocessing, producing indistinguishable features. The authors suggest a separate model with a larger input resolution for such cases. No error bars or significance tests are reported. The comparison to commercial systems (Acrobat Pro, Tabula) conflates table structure recognition quality with content extraction mechanism: those tools differ from TableFormer in both dimensions, so the content-with-structure evaluation (Table 4) is not a controlled comparison.
Gaps not acknowledged in the paper: The paper contains no component-level ablations. There is no experiment isolating the effect of replacing the LSTM decoder with a transformer decoder separately from adding the Cell BBox Decoder, so the attribution of the performance gains to individual design choices remains unclear. Additionally, the training data mixture used for each benchmark evaluation is not explicitly stated, making it difficult to know whether results reflect single-dataset training or combined training including SynthTabNet.
Reproducibility
| Resource | Type | License | Link |
|---|---|---|---|
| SynthTabNet Dataset | Dataset | CDLA-Permissive-1.0 | GitHub |
| Preprint | Paper | CC-BY-4.0 | arXiv 2203.01017 |
| Model weights | Model | N/A | Not released |
| Training code | Code | N/A | Not released |
Models
- ResNet-18 CNN backbone (pretrained weights from torchvision), modified for table images.
- Transformer encoder: 2 layers, feature size 512, FFN 1024, 4 heads.
- Transformer decoder: 4 layers, same dimensions.
- Cell BBox Decoder: cross-attention module + 3-layer MLP + linear classification head.
- Model weights are not publicly released in the paper or the SynthTabNet repository. The SynthTabNet GitHub only provides the dataset and a usage notebook.
Algorithms
- Optimizer: Adam (3 separate instances: one per sub-network).
- Learning rate: 0.001 for 12 epochs, then 0.0001 for 12 epochs or convergence.
- Batch sizes: 24 (warm phase), 18 (fine-tuning phase).
- Dropout: 0.5 on all dropout layers.
- $\lambda = 0.5$ (balances structure vs. bbox loss).
- Inference optimization: single forward pass through encoder + caching of decoded token features for autoregressive decoding.
- Post-processing (PDF documents): 9-step pipeline to align predicted bounding boxes with PDF text cells via IOU matching, column alignment, and orphan cell recovery.
Data
- Training: Combination of PubTabNet (509k), FinTabNet (112k), and TableBank (145k), preprocessed to a unified PNG format at 72 dpi. Combined-Tabnet (PubTabNet + FinTabNet) is ~400k; Combined (all three) is ~500k. SynthTabNet adds 600k synthetic tables.
- SynthTabNet: 600k PNG images, 4 subsets of 150k (Part 1: FinTabNet style, Part 2: marketing style, Part 3: PubTabNet style, Part 4: sparse). All splits are 80%/10%/10%. Total download size approximately 37 GB. License: CDLA-Permissive-1.0.
- Annotation: For PubTabNet and FinTabNet, missing bounding boxes are computed via a grid-inference procedure. Tables with non-strict HTML structures (rows of unequal column counts) are discarded.
- Augmented dataset: The paper mentions releasing the preprocessed combination of PubTabNet/FinTabNet/TableBank with generated bounding boxes for reproducibility, but no persistent download link is provided.
Evaluation
- TEDS (structure-only and structure+content variants). No nonstandard variants used.
- PASCAL VOC mAP for cell bounding box evaluation (content cells only).
- No error bars, confidence intervals, or multi-seed runs are reported.
- The content evaluation (Table 4) notes that ground-truth HTML and extracted PDF text sometimes differ in whitespace and unicode representation, potentially deflating scores slightly.
Hardware
- Framework: PyTorch and Torchvision.
- Training hardware: not specified in the paper.
- GPU-hours, memory requirements, and inference latency are not reported.
- No cost estimates provided.
BibTeX
@inproceedings{nassar2022tableformer,
title={TableFormer: Table Structure Understanding with Transformers},
author={Nassar, Ahmed and Livathinos, Nikolaos and Lysak, Maksym and Staar, Peter},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2022}
}
PubTables-1M
TL;DR
PubTables-1M is a large-scale table extraction dataset containing nearly one million tables from PubMed Central Open Access articles. The dataset introduces richer annotations (including blank cells, rows, columns, and projected row headers) and a canonicalization procedure to address oversegmentation in markup-derived ground truth, substantially improving both training and evaluation reliability for table structure recognition.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$ (the dataset, annotation pipeline, and quality verification procedures constitute the primary contribution).
Secondary: $\Psi_{\text{Evaluation}}$ (demonstrates how canonicalization reduces ground truth ambiguity and changes evaluation conclusions), $\Psi_{\text{Method}}$ (DETR-based baseline models for table detection and structure recognition).
What is the motivation?
Table extraction requires inferring logical structure from visual presentation; modern deep learning approaches depend on large-scale, unambiguous ground truth. Prior markup-derived datasets exhibit three key limitations:
- Incomplete spatial annotations: lack explicit row/column/cell bounding boxes; omit blank cells entirely
- Missing header information: do not label row headers or projected headers, limiting functional analysis capabilities
- Oversegmentation artifacts: spanning header cells are frequently split into multiple grid cells in source markup, creating contradictory or non-unique ground truth that degrades both training and evaluation
What is the novelty?
- Scale and coverage: nearly 1M tables supporting all three subtasks (table detection, table structure recognition, functional analysis)
- Richer annotation schema: explicit bounding boxes for rows, columns, and cells (including blank cells); projected row header labels
- Canonicalization procedure: data is processed via automated rule-based merging of oversegmented header cells to produce unique, unambiguous structure interpretations. The algorithm operates in two phases: (1) inferring header regions (column headers, projected row headers, and first-column row headers) and (2) merging adjacent cells under structural constraints derived from the Wang model’s hierarchical header-tree assumption.
- Automated quality control: verification pipeline with measurable consistency checks (text alignment quality, cell-word assignment, overlap detection)
What experiments were performed?
The authors trained object detection baselines (Faster R-CNN and DETR) for table detection and joint table structure recognition with functional analysis. Key experimental comparisons:
- Canonicalization ablation: training DETR on canonicalized vs. non-canonical annotations (DETR-NC), evaluated against both test label variants
- Noise isolation: testing the same model (DETR-NC) against canonical vs. non-canonical test labels to measure evaluation noise independent of training data effects
- Metrics: table detection uses standard object detection metrics (AP/AP50/AP75/AR); table structure recognition reports $\text{Acc}_{\text{Cont}}$ (exact table content match), adjacent cell content F-score ($\text{Adj}_{\text{Cont}}$), and GriTS (Grid Table Similarity).
GriTS formulation:
$$ \text{GriTS}_f(\mathbf{A}, \mathbf{B}) = \frac{2 \cdot \sum_{i,j} f(\tilde{\mathbf{A}}_{i,j}, \tilde{\mathbf{B}}_{i,j})}{|\mathbf{A}| + |\mathbf{B}|} $$
where $\mathbf{A}$ and $\mathbf{B}$ are the ground truth and predicted table matrices, $\tilde{\mathbf{A}}$ and $\tilde{\mathbf{B}}$ are their most similar substructures (a selection of $m$ rows and $n$ columns), and $f$ is a similarity function. Three variants measure different aspects: $\text{GriTS}_{\text{Top}}$ for cell topology, $\text{GriTS}_{\text{Cont}}$ for cell content, and $\text{GriTS}_{\text{Loc}}$ for cell location.
What are the outcomes/conclusions?
Outcomes
Canonicalized ground truth materially changes experimental conclusions. For complex tables, exact-match table content accuracy ($\text{Acc}_{\text{Cont}}$) jumps from 0.5360 (DETR-NC evaluated on non-canonical test labels) to 0.6944 (DETR evaluated on canonical test labels). This comparison conflates two effects (improved training data and cleaner test labels), but the authors disentangle them by also evaluating DETR-NC against canonical test labels. In that isolated comparison, DETR-NC scores 0.9349 on simple tables with canonical test labels vs. 0.8678 with non-canonical test labels, demonstrating that canonicalization alone reduces evaluation noise.
DETR achieves strong performance under the object detection framing without task-specific customizations (test AP 0.966 for table detection, 0.912 for joint structure recognition and functional analysis).
Limitations
- Domain specificity: canonicalization rules are designed around PMCOA-derived annotation characteristics; generalization to other domains may require new assumptions. The authors explicitly note this in the paper.
- Single-page scope: multi-page tables are excluded.
- Partial header inference: full row-header structure is outside scope; coverage limited to projected row headers and first-column structure.
- No cross-dataset evaluation: all experiments use PubTables-1M only; the paper does not test whether models trained on PubTables-1M generalize to other table extraction benchmarks.
Reproducibility
Models
- Both DETR and Faster R-CNN use ResNet-18 backbones pretrained on ImageNet with early layers frozen.
- DETR architecture: 6 encoder layers, 6 decoder layers. TD model uses 15 object queries; TSR+FA model uses 125 object queries (slightly above the maximum object count in training data).
- Pre-trained weights for both detection and structure recognition models are publicly released under MIT license.
Algorithms
- All models trained for 20 epochs on a single NVIDIA Tesla V100 GPU.
- Learning rate selection: one short experiment comparing initial learning rates of 0.0002, 0.0001, and 0.00005; best validation performance after one epoch determined the choice.
- TSR+FA model: initial learning rate of 0.00005, no-object class weight of 0.4.
- Both models: learning rate drop of 1, gamma of 0.9.
- Standard data augmentations (random cropping, resizing). No custom components, losses, or training procedures.
- TD input: PDF pages rendered as images with maximum length of 1000 pixels.
- TSR+FA input: table region cropped from page image with 30-pixel padding on all sides.
Data
- Source: PubMed Central Open Access (PMCOA) scientific articles.
- Alignment: Needleman-Wunsch algorithm for character-level matching between XML markup and PDF-rendered text.
- Splits: 80/10/10 at document level to prevent leakage.
- 947,642 tables total for TSR; 460,589 pages for TD.
- Quality filters discard tables with overlapping rows/columns, high edit distance (threshold 0.05), low word-cell overlap (threshold 0.9), or extreme outlier object counts (>100). Less than 0.1% of tables are discarded as outliers.
- Licensing: CDLA-Permissive-2.0 applies to Microsoft’s annotations. Underlying PMCOA articles have mixed licenses (CC0, CC BY, CC BY-NC, etc.); per-article verification is needed for commercial use.
Evaluation
- TD metrics: AP, AP50, AP75, AR (standard COCO-style object detection).
- TSR metrics: $\text{Acc}_{\text{Cont}}$ (exact table content match), $\text{Adj}_{\text{Cont}}$ (adjacent cell content F-score from Gobel et al.), $\text{GriTS}_{\text{Top}}$, $\text{GriTS}_{\text{Cont}}$, $\text{GriTS}_{\text{Loc}}$.
- Post-prediction, bounding boxes are retroactively tightened (removing dilation padding) before scoring, so that training-time dilation does not unfairly penalize location metrics.
- Baselines compared: DETR vs. Faster R-CNN (both on canonical data), plus DETR-NC (on non-canonical data) for the canonicalization ablation.
- No error bars, significance tests, or multi-run reporting. Single training run per configuration.
- No cross-dataset evaluation or comparison with prior published results on other benchmarks.
Hardware
- Training: Single NVIDIA Tesla V100 GPU (reported in Appendix 9.1).
- Training time and GPU-hours are not reported.
- Inference latency and throughput are not reported.
Reported Results (Test Set)
Table Detection (TD):
| Model | AP | AP50 | AP75 | AR |
|---|---|---|---|---|
| DETR | 0.966 | 0.995 | 0.988 | 0.981 |
| Faster R-CNN | 0.825 | 0.985 | 0.927 | 0.866 |
Joint TSR+FA (Object Detection):
| Model | AP | AP50 | AP75 | AR |
|---|---|---|---|---|
| DETR | 0.912 | 0.971 | 0.948 | 0.942 |
| Faster R-CNN | 0.722 | 0.815 | 0.785 | 0.762 |
Table Structure Recognition (Canonical Test, All Tables):
| Model | $\text{Acc}_{\text{Cont}}$ | $\text{GriTS}_{\text{Top}}$ | $\text{GriTS}_{\text{Cont}}$ | $\text{GriTS}_{\text{Loc}}$ |
|---|---|---|---|---|
| DETR | 0.8138 | 0.9845 | 0.9846 | 0.9781 |
| DETR-NC | 0.5851 | 0.9576 | 0.9588 | 0.9449 |
| Faster R-CNN | 0.1039 | 0.8616 | 0.8538 | 0.7211 |
Mapping to Unified Taxonomy
PubTables-1M covers both Table Detection (TD) and Table Structure Recognition (TSR).
Detection (Layout Level)
For the Layout Analysis pipeline, we only care about the Detection classes:
| PubTables-1M Class | Visual Primitive | Logical Role | Notes |
|---|---|---|---|
| Table | Table | Table | Core bounding box. |
Structure (TSR Level)
Inside the detected table, it provides detailed structure annotations (rows, columns, headers) similar to PubTabNet but with bounding boxes for cells. These are not part of the page-level Layout Taxonomy but are inputs for the OTSL pipeline.
BibTeX
@inproceedings{smock2022pubtables,
title={PubTables-1M: Towards comprehensive table extraction from unstructured documents},
author={Smock, Brandon and Pesala, Rohith and Abraham, Robin},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
pages={4634--4642},
year={2022}
}
Parsing Table Structures in the Wild (WTW)
TL;DR
This paper introduces the Wired Table in the Wild (WTW) dataset, a collection of 14,581 images of tables captured in natural scenes (photos, scans, web pages) with full cell coordinate and row/column annotations. Alongside the dataset, the authors propose Cycle-CenterNet, which extends CenterNet with a cycle-pairing module to simultaneously detect cells and group them into structured tables, achieving a 24.6-point absolute improvement in TEDS over baselines on WTW.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$ The primary contribution is the WTW dataset itself, which fills a gap in available benchmarks for table structure parsing (TSP) outside of clean document images. The paper spends considerable effort describing the data collection, annotation protocol, challenging cases, and evaluation protocol.
Secondary: $\Psi_{\text{Method}}$ The paper also proposes Cycle-CenterNet with a novel cycle-pairing module and pairing loss, which represents a meaningful methodological contribution for cell detection and grouping.
What is the motivation?
Existing table structure parsing datasets (ICDAR-2013, PubTabNet, TableBank, SciTSR) focus almost exclusively on well-aligned tabular images from digital or scanned PDF documents. These images have clean backgrounds and clear, axis-aligned table structures. In practice, tables are frequently captured by hand-held cameras, introducing challenges like non-rigid deformation, bending, tilting, occlusion, and cluttered backgrounds.
The authors observe that existing TSP methods, which rely on assumptions about well-aligned inputs, fail on these real-world images. There was no large-scale dataset to train or evaluate models for this more practical setting. The authors also note that wireless tables in natural images are particularly difficult for human annotators due to the lack of reference lines, so they focus specifically on wired tables.
What is the novelty?
WTW Dataset
The dataset contains 14,581 images drawn from three sources: natural scene images (50%), archival document images (30%), and printed document images (20%). Each image is annotated with table IDs, cell coordinates (using inner table lines for localization, following ICDAR 2019 conventions), and row/column information. The dataset is split into 10,970 training and 3,611 testing samples.
WTW covers seven challenging cases that existing datasets largely miss:
- Inclined tables
- Curved tables
- Occluded or blurred tables
- Extreme aspect ratio tables
- Overlaid tables
- Multi-color tables
- Irregular tables
Cycle-CenterNet
Built on top of CenterNet with a DLA-34 backbone, the model adds a cycle-pairing module with two branches:
Center-to-Vertex branch: Predicts offsets from each cell center to its four vertices. For center point $P = \{x_C, y_C\}$ and vertices $V = \{x_V, y_V\}$:
$$ \begin{cases} \Delta x_{C_{ik}} = x_{C_i} - x_{V_{ik}} \\ \Delta y_{C_{ik}} = y_{C_i} - y_{V_{ik}} \end{cases}, \quad i=1:N^C, ; k=1:4 $$
Vertex-to-Center branch: Predicts offsets from each common vertex to the centers of surrounding cells:
$$ \begin{cases} \Delta x_{V_{ik}} = x_{V_i} - x_{C_{ik}} \\ \Delta y_{V_{ik}} = y_{V_i} - y_{C_{ik}} \end{cases}, \quad i=1:N^V, ; k=1:4 $$
The mutual-directed (cycle) relationship between centers and vertices enables grouping cells through shared vertices at cell intersections.
Pairing loss: Rather than applying loss functions directly on the output maps, the authors define a pairwise loss over center-vertex pairs:
$$ L_p = \sum_{c,v} \omega(P_{cv}) \left( \lambda_{cv} L_{cv} + \lambda_{vc} L_{vc} \right) $$
where $L_{cv}$ and $L_{vc}$ are $l_1$ losses on the predicted offsets, and $\omega(P_{cv})$ is a dynamic weighting function:
$$ \omega(P_{cv}) = 1 - \exp(-\pi D_{cv}) $$
Here $D_{cv}$ measures the regression error for each center-vertex pair, so that pairs with larger errors receive higher weight during training. The overall training loss combines the keypoint loss, offset loss, and pairing loss:
$$ L_{det} = L_k + \lambda_{off} L_{off} + L_p $$
A post-processing parsing module then recovers row/column information by splitting each cell into bounding edges, merging them into horizontal and vertical lines, and sorting/indexing them.
What experiments were performed?
Evaluation on WTW
The authors evaluate physical structure accuracy (cell coordinate precision, recall, F1 at IoU $= 0.9$) and logical structure accuracy (adjacency relation precision, recall, F1 at IoU $= 0.6$, plus TEDS).
Baselines include four object detectors (Faster-RCNN, TridentNet, Cascade-RCNN, CenterNet) with heuristic grouping, as well as two existing TSP methods (Split+Heuristic, CascadeTabNet) retrained on WTW.
Key results on WTW:
| Model | Phys. F1 | Adj. F1 | TEDS |
|---|---|---|---|
| Split+Heuristic | 3.4 | 27.6 | 26.0 |
| CenterNet (baseline) | 73.1 | 84.8 | 58.7 |
| Cycle-CenterNet | 78.3 | 92.4 | 83.3 |
The existing document-oriented TSP methods perform poorly on WTW, confirming that this dataset introduces genuinely new challenges. Cycle-CenterNet improves over the CenterNet baseline by 5.2 points on physical F1, 7.6 points on adjacency F1, and 24.6 points on TEDS.
Cross-dataset evaluation
On ICDAR 2013, Cycle-CenterNet achieves 91.7% adjacency F1, only 1.3% behind the top-ranked Split+Heuristic, and reaches 98.0% on the wired-table subset. On ICDAR 2019 Track B2, Cycle-CenterNet improves the weighted-average F1 by 15.2 points over the previous best reported result. These results suggest that training on WTW generalizes well to simpler document-oriented benchmarks.
What are the outcomes/conclusions?
The WTW dataset establishes a new benchmark for table structure parsing in natural, unconstrained settings. The authors demonstrate that methods designed for clean document images (Split+Heuristic, CascadeTabNet) largely fail on wild images, confirming the need for both new data and new approaches.
Cycle-CenterNet’s cycle-pairing mechanism provides an effective way to jointly detect cells and learn their grouping structure, avoiding the need for heuristic post-processing rules typical of prior work. The dynamic pairing loss focuses training on the hardest center-vertex pairs.
Limitations worth noting: the dataset focuses exclusively on wired tables, leaving wireless tables in natural images as an open problem. The model weights are not publicly released (the authors note the model is used in Alibaba’s online business software). The paper does not report results on merged or spanning cells in detail, nor does it discuss computational cost or inference latency.
Reproducibility
Models
- Architecture: CenterNet with DLA-34 backbone, extended with the cycle-pairing module producing two additional 8-channel output maps ($CV_{map}$ and $VC_{map}$).
- Pretrained on COCO, then trained on WTW.
- Model weights are not publicly released. The authors note the model is deployed in Alibaba’s online business software. An online testing demo was mentioned but may not be currently available.
Algorithms
- Optimizer and schedule: initial learning rate $1.25 \times 10^{-3}$, decayed to $1.25 \times 10^{-4}$ at epoch 90 and $1.25 \times 10^{-5}$ at epoch 120. Total: 150 epochs.
- Batch size: 32 per GPU (8 GPUs, so effective batch size of 256).
- Input resizing: max side resized to 1024, short side scaled proportionally.
- Loss hyperparameters: $\lambda_{cv} = 1.0$, $\lambda_{vc} = 0.5$.
- Post-processing: edge splitting, line merging, and row/column indexing (pseudo-code in supplementary materials).
Data
- WTW dataset: 14,581 images total (10,970 train / 3,611 test). Sources: natural scenes (50%), archival documents (30%), printed documents (20%).
- Annotations: cell coordinates (polygonal, using inner table lines), row/column indices, table instance IDs.
- Sensitive information (names, phone numbers) has been erased.
- The dataset is available via GitHub with a download link to the Tianchi Aliyun platform.
- License: CC-BY-NC-4.0.
Evaluation
- Physical structure: Precision, Recall, F1 at IoU threshold of 0.9 (stricter than typical object detection).
- Logical structure: Cell adjacency Precision, Recall, F1 at IoU threshold of 0.6; TEDS (Tree-Edit-Distance-based Similarity).
- Cross-dataset evaluation on ICDAR 2013 and ICDAR 2019 Track B2 using their respective official metrics.
- Baselines (Split+Heuristic, CascadeTabNet) retrained on WTW with author-provided hyperparameters.
- No error bars, significance tests, or multi-run statistics reported.
Hardware
- Training: 8 NVIDIA GTX 1080Ti GPUs.
- Total GPU-hours, inference latency, and memory requirements are not reported.
BibTeX
@inproceedings{long2021parsing,
title={Parsing Table Structures in the Wild},
author={Long, Rujiao and Wang, Wen and Xue, Nan and Gao, Feiyu and Yang, Zhibo and Wang, Yongpan and Xia, Gui-Song},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
year={2021}
}
GloSAT Historical Measurement Table Dataset
TL;DR
GloSAT is a manually annotated dataset of 500 historical meteorological logbook page images drawn from nine archival sources spanning the 1700s to the modern day. Beyond standard full-table and cell annotations (ICDAR cTDaR-19 XML and VOC2007 formats), it adds labels for page headings, table headers, table bodies, and coarse segmentation cells that group semantically related data cells. Benchmark experiments using CascadeTabNet show that a simple DBScan-based post-processing step substantially improves cell detection, and that table border type matters more than page style for TSR performance on historical documents.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$: The primary contribution is a new annotated dataset for table structure recognition in historical documents. The paper’s central claims rest on the dataset’s design, sources, annotation schema, and the open release of splits, code, and model checkpoints.
Secondary: $\Psi_{\text{Evaluation}}$: The paper provides benchmark results across five dataset configurations (GloSAT coarse and individual cells, cTDaR-19 tracks B1 and B2, and their aggregation), and introduces a row/col F1 metric that evaluates ruling-line accuracy directly rather than cell-level IoU.
What is the motivation?
Historical data rescue pipelines, such as those digitizing 100,000s of meteorological logbooks in the GloSAT project, require table structure recognition applied to scanned images of handwritten and printed measurement records. Existing TSR datasets focus on modern electronic documents (PubLayNet, PubTabNet, TableBank) where annotations can be extracted automatically from PDF metadata, yielding large, richly annotated corpora. Historical document datasets (UNLV, cTDaR-19) are small and limit annotations to full-table bounding boxes and individual cells, omitting contextually important elements: page headings that identify the logbook, table headers that label columns, and coarse groupings of cells indicated by ruling lines.
Without annotations for these higher-level structural components, TSR models trained on existing datasets cannot reliably associate data cells with their semantic context, which is precisely what a downstream data rescue pipeline needs.
What is the novelty?
The dataset introduces two categories of annotation beyond the ICDAR standard:
Enhanced contextual annotations in VOC2007 format: four classes (
heading,header,table_body,background). Aheadinglabels page-level context outside the table region (dates, captions); aheaderlabels the column-labeling rows within the table.Coarse segmentation cells: annotations that follow the original ruling lines of the logbook, grouping multiple individual data cells into a single coarse cell when the source document groups them by ink lines or whitespace. This preserves the spatial groupings that carry semantic meaning (e.g., groups of weather station IDs from the same region).
The evaluation methodology adds a row/col F1 score that measures ruling-line accuracy directly:
$$ \text{Row/Col F1} = F1\text{ computed on predicted vs. ground-truth ruling lines} $$
where a predicted ruling line matches the ground truth if it falls within $d$ pixels ($d = 20%$ of the average cell width or height for the horizontal or vertical dimension respectively). This is argued to be more informative than cell IoU for applications such as OCR pipelines that need precise cell boundaries to avoid clipping.
The paper also describes the cTDaR-19 weighted average F1 metric used as the primary cell detection score:
$$ \text{Weighted F1} = \frac{\sum_{i=1}^{4} \text{F1@IoU}_i \times \text{IoU}_i}{\sum_{i=1}^{4} \text{IoU}_i} $$
where IoU levels are $\{0.6, 0.7, 0.8, 0.9\}$.
What experiments were performed?
Two benchmark algorithms are evaluated on five dataset configurations:
CascadeTabNet (baseline): Cascade Mask R-CNN with HRNetV2p_W32 backbone. Due to severe class imbalance (35,555 cells vs. 710 full-table instances in GloSAT), two separate models are trained: one for full-table detection and one for cell detection. The backbone and region proposal network are frozen; only the three cascade heads are fine-tuned using Adam ($\text{lr} = 0.0001$ for TSR, $0.0003$ for detection) until loss stabilizes, typically 60-100 epochs. Learning rate is divided by 3 after epoch 50.
CascadeTabNet + DBScan post-processing: After CascadeTabNet cell predictions, 1D DBScan clustering is applied independently to horizontal and vertical ruling lines. Each detected cell contributes two points (start and end coordinate) to each dimension’s clustering. DBScan parameters are set heuristically: $\varepsilon$ = half the average cell width/height; $\text{minPts} = 2$. This extrapolates missing cells from the sparse initial detections. For the GloSAT dataset, results from both the coarse and individual cell models are combined as input to the post-processing step.
Dataset configurations tested: GloSAT coarse segmentation (129 test / 371 train), GloSAT individual cells (129/371), cTDaR-19 track B1 (150/600), cTDaR-19 track B2 (250/600), and an aggregated split (529/1571). A 75/25 train/test split is used uniformly. Training hardware: NVIDIA Tesla V100 (16 GB).
A breakdown of GloSAT performance by table style (bordered, semi-bordered, borderless) and page style (printed, mixed) is also reported for the best-performing model.
What are the outcomes/conclusions?
Post-processing provides substantial gains in all configurations:
| Dataset / Configuration | Baseline Weighted F1 | + Post-processing |
|---|---|---|
| GloSAT coarse cells (manual table det.) | 0.385 | 0.602 |
| GloSAT individual cells (manual table det.) | 0.047 | 0.284 |
| cTDaR-19 track B1 | 0.084 | 0.161 |
| cTDaR-19 track B2 | 0.082 | 0.155 |
| Aggregated | 0.071 | 0.181 |
Table border type is the dominant performance factor. Bordered tables reach weighted F1 = 0.79; semi-bordered reach 0.31; borderless reach 0.023. Page style (printed vs. mixed) has smaller and partially confounded effects, since mixed-style documents in the GloSAT collection correlate with the bordered table type.
The authors conclude that post-processing which encodes structural regularity (rectangular grid assumption) is valuable even when the underlying model has high precision (0.92 at IoU 0.6) but very low recall (0.042). The coarse segmentation annotation variant outperforms individual cell annotation on tables where multiple rows are grouped by ruling lines, suggesting the annotation design should match the downstream use case.
The results provide a public baseline for future researchers, but the absolute scores across all configurations remain low, indicating significant headroom for improvement on historical measurement tables.
Reproducibility
Models
- Architecture: CascadeTabNet (Cascade Mask R-CNN, HRNetV2p_W32 backbone). Two separate model instances trained: one for full-table detection, one for cell structure recognition.
- No new model architecture is introduced; the contribution is dataset and post-processing.
- Model checkpoints are released on GitHub alongside the dataset.
Algorithms
- Optimizer: Adam; learning rate $1 \times 10^{-4}$ (TSR) and $3 \times 10^{-4}$ (detection).
- Training duration: 60-100 epochs until loss stabilization; learning rate divided by 3 after epoch 50.
- Backbone and RPN frozen; only cascade heads trained.
- Post-processing: 1D DBScan per dimension; $\varepsilon =$ half mean cell width/height; $\text{minPts} = 2$.
Data
- 500 page images from nine historical archival sources; train/test split 75/25 (371/129).
- Sources include UK Met Office, NOAA, University of Reading, Ben Nevis observatory records, and others spanning 1830s to 1970s.
- Annotation tools: TTruth (initial grid layout) + Transkribus (visualization and adjustment).
- Annotations exported in VOC2007 and ICDAR cTDaR-19 XML formats; two versions (individual cells and coarse segmentation cells).
- Dataset, train/test splits, benchmark code, and model checkpoints are all released at the GitHub repository under a BSD open source license.
- Dataset is also archived on Zenodo (record 5363457).
Evaluation
- Primary metric: weighted average F1 at IoU $\in \{0.6, 0.7, 0.8, 0.9\}$ (cTDaR-19 convention).
- Secondary metric: row/col F1 evaluated on ruling lines within $d = 20%$ of average cell dimension.
- The ICDAR cTDaR-19 competition’s special rules for cell adjacency and blank cells are not used here; the same metric is applied to both tables and cells for direct comparison.
- Comparisons cover automated and manual table region detection variants.
- No statistical significance testing or multiple-run averaging is reported.
Hardware
- Training: NVIDIA Tesla V100, 16 GB memory. The IRIDIS High Performance Computing Facility at the University of Southampton.
- Inference hardware and latency not reported.
BibTeX
@inproceedings{ziomek2021glosat,
author = {Ziomek, Juliusz and Middleton, Stuart E.},
title = {{GloSAT} Historical Measurement Table Dataset: Enhanced Table Structure Recognition Annotation for Downstream Historical Data Rescue},
booktitle = {Proceedings of the 4th International Workshop on Historical Document Imaging and Processing},
series = {HIP '21},
year = {2021},
publisher = {ACM},
doi = {10.1145/3476887.3476892},
}
TGRNet: A Table Graph Reconstruction Network for Table Structure Recognition
TL;DR
TGRNet reframes table structure recognition as a graph reconstruction problem where each cell is a node with both a spatial location (bounding box) and a logical location (start/end row and column indices). A two-branch architecture performs segmentation-based cell detection and GCN-based ordinal classification for logical index prediction. The authors also contribute TableGraph-350K, a large-scale dataset with over 350K table graph annotations derived from TABLE2LATEX-450K.
What kind of paper is this?
- Dominant Basis: $\Psi_{\text{Method}}$
- The central contribution is an end-to-end architecture (TGRNet) that jointly detects cell spatial locations via segmentation and predicts cell logical locations via graph convolutional networks with ordinal regression. The paper devotes most of its pages to architecture design, loss formulation, and ablation studies.
- Secondary Basis: $\Psi_{\text{Resource}}$
- The paper introduces the TableGraph-350K dataset, providing table graph annotations (cell bounding boxes and logical indices) for over 350K tables from TABLE2LATEX-450K. Pretrained model checkpoints are also released.
What is the motivation?
Existing table structure recognition methods typically model tables in one of three ways: (1) detecting cell bounding boxes without inferring logical structure, (2) converting table images to markup sequences (LaTeX/HTML), or (3) predicting pairwise adjacency relations between cells. Each has limitations:
- Cell detection approaches recover spatial locations but ignore the logical row/column indices that downstream applications (e.g., table QA systems like TAPAS) require.
- Markup sequence approaches face a one-to-many mapping problem: the same table structure can be transcribed into different markup sequences, which introduces noise during training.
- Adjacency relation approaches only capture pairwise relationships. Recovering global logical structure from pairwise adjacency requires additional graph optimization algorithms, which adds complexity and introduces error propagation.
The authors argue that the logical location of each cell (its start/end row and column indices) is the most natural and complete representation of table structure. It can be used to derive adjacency matrices, markup sequences, or database formats, while the reverse transformations are harder.
What is the novelty?
Table Graph Formulation
The paper introduces the concept of a Table Graph $G = (V, A)$, where each node $v_i$ in $V$ represents a table cell with two attributes:
$$ b_i = (b_i^x, b_i^y, b_i^w, b_i^h) $$
$$ l_i = (row_i^{start}, row_i^{end}, col_i^{start}, col_i^{end}) $$
Here $b_i$ encodes the spatial bounding box (center, width, height) and $l_i$ encodes the logical location (start-row, end-row, start-column, end-column). The adjacency matrix $A$ is constructed from Euclidean distances between cell centers, with separate row and column components:
$$ a_{i,j}^{row} = \exp\{ -\left(\frac{b_i^y - b_j^y}{H} \cdot \alpha\right)^2 \} $$
$$ a_{i,j}^{col} = \exp\{ -\left(\frac{b_i^x - b_j^x}{W} \cdot \alpha\right)^2 \} $$
The adjustment factor $\alpha$ controls how sharply edge weights decay with distance, effectively limiting message passing to nearby cells.
Two-Branch Architecture
TGRNet uses a ResNet-50 backbone with FPN to extract multi-scale features, then splits into two branches:
Cell Spatial Location Branch: A segmentation module produces a 3-class pixel map (background, cell, boundary). A split-aggregation module pools features along row and column dimensions to capture tabular structure statistics. Cell bounding boxes are extracted as connected components from the segmentation map.
Cell Logical Location Branch: Detected cells initialize a graph. Node features combine spatial branch features with RoIAlign features. Parallel GCNs update node representations for row and column prediction separately. An ordinal classifier converts each logical index prediction into $T-1$ binary sub-problems.
Ordinal Regression with Focal Loss
The logical index $r_i \in \{0, 1, \ldots, T-1\}$ is converted to a binary vector $q_i \in \mathbb{R}^{T-1}$ where $q_i^t = 1$ if $t < r_i$. The loss function combines ordinal regression with a focal-style modulation to handle the long-tailed distribution of logical indices:
$$ \Psi(x_i’, \Theta) = \sum_{t=0}^{r_i-1} (1 - p_i^t)^{\gamma_t} \log(p_i^t) + \sum_{t=r_i}^{T-2} (1 - p_i^t)^{\gamma_t} \log(1 - p_i^t) $$
$$ \gamma_t = \min(2, -(1 - \lambda_t)^2 \log(\lambda_t) + 1) $$
where $\lambda_t$ is the empirical probability of index $t$ in the training set. This adaptive focusing parameter assigns higher loss weight to rare (large) indices.
What experiments were performed?
Datasets
- TableGraph-24K: A 24K-table subset of TableGraph-350K (20K train, 2K val, 2K test). Max row/column indices: 37/21.
- CMDD: 476 medical laboratory report tables (372 train, 104 test). No spanning cells. Max row/column indices: 24/5.
- ICDAR13-Table: 156 tables with spanning cells and varied styles. Split 50/50 for train/test. Max row/column indices: 57/12.
- ICDAR19-cTDaR (TrackB1): 881 tables from archival historical documents (679 train, 202 test). Max row/column indices: 87/43. The largest table contains over 2,000 cells.
Metrics
- Cell spatial location: Precision, Recall, and Hmean at IoU threshold 0.5.
- Cell logical location: Accuracy on four logical indices ($A_{rowSt}$, $A_{rowEd}$, $A_{colSt}$, $A_{colEd}$) and an overall accuracy $A_{all}$ requiring all four to be correct.
- Combined: $F_{\beta=0.5}$ combining Hmean and $A_{all}$.
- Adjacency-based: Weighted average F-Score (WAF) for comparison with adjacency relation methods.
Key Results
On TableGraph-24K, TGRNet achieves 0.906 Hmean for cell detection and 0.832 $A_{all}$ for logical location, yielding $F_{\beta=0.5} = 0.890$. On ICDAR13-Table, performance drops to $F_{\beta=0.5} = 0.519$, which the authors attribute to limited training data (78 tables) and domain shift from the pre-training data.
On CMDD, using ground-truth cell boxes, TGRNet reaches 0.992 $A_{all}$ for end-to-end table graph reconstruction.
Robustness Analysis
Experiments with randomly removed nodes (simulating imperfect cell detection) show TGRNet maintains relatively stable performance. At 80% cell coverage on CMDD, TGRNet achieves 0.857 $A_{all}$ compared to ReS2TIM’s 0.705.
Ablation Studies
On TableGraph-24K, each component contributes measurably:
| Component | $A_{all}$ | $F_{\beta=0.5}$ |
|---|---|---|
| Baseline (linear + cross-entropy) | 0.697 | 0.850 |
| + GCN | 0.788 | 0.876 |
| + Ordinal regression | 0.824 | 0.882 |
| + Focal loss | 0.832 | 0.890 |
The GCN provides the largest single improvement (+0.091 on $A_{all}$), followed by ordinal regression (+0.036) and focal loss (+0.008).
Logical Location vs. Adjacency Relation
On ICDAR19-cTDaR, TGRNet achieves 0.267 $A_{all}$ while ReS2TIM reaches only 0.138. However, ReS2TIM obtains a higher WAF (0.481 vs. 0.283) due to its pairwise relation loss being better aligned with the adjacency metric. The authors use this to illustrate that good adjacency prediction does not necessarily translate to accurate global logical structure recovery.
What are the outcomes/conclusions?
The paper demonstrates that framing table structure recognition as graph reconstruction provides a unified representation from which other formats (adjacency matrices, markup sequences, databases) can be derived. The ordinal classification formulation with focal loss effectively handles the long-tailed distribution of row/column indices.
Key limitations include:
- Performance degrades on datasets with limited training data and domain shift (ICDAR13-Table).
- On challenging historical documents (ICDAR19-cTDaR), both cell detection and logical location prediction remain difficult, with $A_{all}$ at only 0.267.
- The approach relies on connected-component extraction from segmentation maps, which can fail when cells are tightly packed (requiring morphological post-processing).
- Input images are resized to a fixed resolution (480 x 480 or 800 x 800), which may lose fine-grained details in large tables.
Reproducibility
Models
- Backbone: ResNet-50 with FPN, producing 4 feature maps at strides 4, 8, 16, 32, each reduced to 256 channels.
- Cell spatial branch: 1x1 conv to reduce 1024 to 256 channels, split-aggregation module with row/column average pooling, 3-class segmentation output.
- Cell logical branch: Parallel GCN pair (one for rows, one for columns) with ordinal classifiers. Node features are 1280-dimensional (256 spatial + 1024 RoIAlign with 2x2 output).
- Pretrained checkpoints: Available for CMDD, ICDAR13-Table, ICDAR19-cTDaR, and TableGraph-24K via the GitHub repository.
Algorithms
- Pre-training strategy: The cell spatial branch is pre-trained for 50 epochs with the logical branch frozen before end-to-end training on TableGraph-24K.
- Transfer learning: For ICDAR13-Table, the model is initialized from TableGraph-24K-trained weights.
- Adjustment factor $\alpha$: Set to 3 for most experiments; increased to 10 for ICDAR19-cTDaR due to higher cell density.
- Graph construction during training: Only cells with IoU > 0.5 against ground truth are included as nodes.
- ICDAR19-cTDaR modifications: Spatial and logical branches trained separately due to GPU memory constraints; morphological opening (3x3 kernel) applied to segmentation maps; graph edges pruned to top $8 \times N$ by weight.
Data
- TableGraph-350K: 358,767 tables (343,988 train / 7,420 val / 7,359 test) derived from TABLE2LATEX-450K. Annotations generated by parsing LaTeX source code with color-coded row/column borderlines to extract cell spatial and logical locations. Max row index: 48, max column index: 27.
- TableGraph-24K: 24,000-table random subset (20K/2K/2K split). Max row index: 37, max column index: 21.
- Source data: TABLE2LATEX-450K collects tables from arXiv articles with LaTeX source and rendered images.
Evaluation
- Metrics: Precision/Recall/Hmean at IoU 0.5 for spatial detection; per-index accuracies and $A_{all}$ for logical location; $F_{\beta=0.5}$ for combined performance; WAF for adjacency comparison.
- Baselines: ReS2TIM is the primary comparison for logical location prediction. Fair comparison with spatial detection methods is noted as difficult due to inconsistent evaluation settings across prior work.
- Limitations acknowledged: Small training sets, domain shift between datasets, and the challenge of large historical document tables.
- No error bars or multi-run statistics reported.
Hardware
- The paper does not report specific GPU types, training times, or computational costs. ICDAR19-cTDaR experiments note GPU memory constraints requiring the two branches to be trained separately, and input images are resized to 800x800 (vs. 480x480 for other datasets).
BibTeX
@inproceedings{xue2021tgrnet,
title={TGRNet: A Table Graph Reconstruction Network for Table Structure Recognition},
author={Xue, Wenyuan and Yu, Baosheng and Wang, Wen and Tao, Dacheng and Li, Qingyong},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
pages={6721--6730},
year={2021}
}
TNCR: Table Net Detection and Classification Dataset
TL;DR
TNCR is a dataset of 9,428 labeled page images collected from publicly available documents (primarily FDA drug labels). Pages may contain multiple tables; the total table count exceeds the page count. Tables are annotated for detection (bounding boxes) and classified into five structural categories: full lined, no lines, merged cells, partial lined, and partial lined merged cells. The authors benchmark several object detection architectures, with Cascade Mask R-CNN using a ResNeXt-101-64x4d backbone achieving the best overall F1 score of 84.4% at IoU 50%:95%.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$: The primary contribution is the TNCR dataset itself, including its five-class taxonomy of table types and the open-source release of images, annotations, and model checkpoints. The paper spends considerable effort describing the data collection pipeline, class definitions, and dataset statistics.
Secondary: $\Psi_{\text{Method}}$: The authors train and compare multiple object detection models (Cascade R-CNN, Cascade Mask R-CNN, Cascade RPN, HTC, YOLOv3) with various backbone configurations, establishing baselines for the new dataset.
What is the motivation?
Table detection and classification in scanned document images remains a difficult task due to the diversity of table layouts, image quality variations, and the presence of elements such as merged cells, partial borders, or no borders at all. Existing datasets (ICDAR2013, UNLV, Marmot, TableBank, ICDAR2019) either focus narrowly on table detection without structural classification, are relatively small, or consist of synthetically generated tables from LaTeX or Word. The authors argue that a dataset with real-world images, varying quality, and a structural classification scheme would be a useful resource for the community.
What is the novelty?
The paper introduces a table classification taxonomy that divides tables into five structural categories:
- Full lined: All cell boundaries are drawn with lines, no merged cells.
- No lines: No visible lines delimiting cells.
- Merged cells: Similar to full lined, but containing at least one merged cell.
- Partial lined: Some border lines are missing, but no merged cells.
- Partial lined merged cells: Some border lines are missing, and merged cells are present.
The dataset creation process is notable for its semi-automated pipeline. The authors first parsed 875,026 PDF pages from the FDA website (accessdata.fda.gov), then ran a Faster R-CNN model (trained on an initial unbalanced set) to identify 225,154 pages containing tables. They used the detected tables to augment underrepresented classes (no lines, partial lined, partial lined merged cells) and re-partition the dataset.
Standard COCO-style evaluation metrics are used:
$$ \text{AP} = \frac{\text{TP}}{\text{TP} + \text{FP}} $$
$$ \text{AR} = \frac{\text{TP}}{\text{TP} + \text{FN}} $$
$$ \text{F1} = \frac{2 \times \text{AP} \times \text{AR}}{\text{AP} + \text{AR}} $$
These metrics are computed across IoU thresholds from 50% to 95%.
What experiments were performed?
The authors trained and evaluated five families of object detection models on the TNCR dataset:
- Cascade Mask R-CNN with backbones: ResNet-50, ResNet-101, ResNeXt-101-32x4d, ResNeXt-101-64x4d (various learning rate schedules: 1x, 20e, 2x).
- Cascade R-CNN with the same set of backbones and schedules.
- Cascade RPN with Fast R-CNN and CRPN methods using ResNet-50.
- Hybrid Task Cascade (HTC) with ResNet-50, ResNet-101, ResNeXt-101-32x4d, and ResNeXt-101-64x4d backbones.
- YOLOv3 with DarkNet-53 at input scales 320, 416, and 608.
All models used SGD with momentum 0.9, weight decay 0.0001, and learning rate 0.02. Images were resized to $1300 \times 1500$ with batch size 16. All models used Feature Pyramid Network (FPN) as the neck. The dataset was split 70%/15%/15% for training, validation, and testing, with the split balanced across classes.
Evaluation was performed using precision, recall, and F1 at each individual IoU threshold from 50% to 95% (in 5% increments), as well as the average across IoU 50%:95%.
What are the outcomes/conclusions?
Key results at IoU 50%:95%:
| Model | Best Backbone | F1 |
|---|---|---|
| Cascade Mask R-CNN | ResNeXt-101-64x4d | 0.844 |
| Cascade R-CNN | ResNeXt-101-64x4d | 0.841 |
| HTC | ResNet-50 (1x) | 0.840 |
| Cascade RPN (Fast R-CNN) | ResNet-50 | 0.804 |
| YOLOv3 | DarkNet-53 (scale 320) | 0.492 |
Cascade Mask R-CNN with ResNeXt-101-64x4d achieved the best performance overall: precision 0.797, recall 0.898, and F1 0.844 at IoU 50%:95%. At the commonly reported IoU 50% threshold, the same model reached an F1 of 0.931.
YOLOv3 performed notably worse than the two-stage detectors, particularly at higher IoU thresholds. At IoU 90% and above, YOLOv3’s F1 dropped below 0.35 across all input scales, suggesting that the single-stage detector struggled with precise localization.
The authors note that ResNeXt-101-64x4d and ResNeXt-101-32x4d backbones suffered from overfitting when used with HTC, while performing well with Cascade R-CNN and Cascade Mask R-CNN. This suggests that the dataset size (approximately 6,500 training images) may be insufficient for the largest model configurations in some frameworks.
Limitations worth noting:
- The dataset is sourced almost entirely from FDA drug labels, which limits domain diversity.
- The class distribution is not balanced; three classes required augmentation from parsed FDA documents.
- The paper does not report per-class detection or classification accuracy, making it unclear how well models handle each table type.
- No comparison is made against other table detection datasets using the same models.
Reproducibility
Models
- Architectures: Cascade R-CNN, Cascade Mask R-CNN, Cascade RPN, HTC, YOLOv3
- Backbones: ResNet-50, ResNet-101, ResNeXt-101-32x4d, ResNeXt-101-64x4d, DarkNet-53
- All models use FPN as the feature neck
- Pretrained model checkpoints are released on the GitHub repository under MIT license
Algorithms
- Optimizer: SGD with momentum 0.9 and weight decay 0.0001
- Learning rate: 0.02
- Learning rate schedules: 1x (12 epochs) and 20e (20 epochs) for most models
- Batch size: 16
- Input resolution: $1300 \times 1500$
- Implementation: MMDetection library for PyTorch (except YOLOv3)
Data
- 9,428 labeled table images across approximately 6,621 document pages
- Sources: publicly available documents, primarily from accessdata.fda.gov
- Five classes: full lined (2,698), no lines (2,099), merged cells (2,013), partial lined (1,379), partial lined merged cells (1,149)
- Split: 70% training, 15% validation, 15% testing (split per class)
- Initial collection was unbalanced; three classes were augmented using a Faster R-CNN model to find additional tables from 875,026 parsed FDA PDF pages
- Annotations are in COCO format with bounding boxes and class labels
- Available for download via SharePoint and Google Drive links on the GitHub repository
Evaluation
- Metrics: precision, recall, and F1 score computed at each IoU threshold from 50% to 95% in 5% increments, plus the mean across 50%:95%
- True positive defined as a detection whose bounding box IoU with a ground truth box meets the threshold
- No per-class breakdown of detection or classification performance is reported
- No cross-dataset evaluation is performed
- No error bars, significance tests, or multi-run statistics are reported
Hardware
- Google Colab with 3 Tesla V100-SXM GPUs (16 GB each) and 16 GB RAM
- Additionally: a machine with 2x Intel Xeon E5-2680 CPUs and 4x NVIDIA Tesla K20X GPUs
- No training time or GPU-hours reported
- No inference latency measurements reported
BibTeX
@article{abdallah2022tncr,
title={TNCR: Table Net Detection and Classification Dataset},
journal={Neurocomputing},
year={2022},
issn={0925-2312},
doi={10.1016/j.neucom.2021.11.101},
url={https://www.sciencedirect.com/science/article/pii/S0925231221018142},
author={Abdelrahman Abdallah and Alexander Berendeyev and Islam Nuradin and Daniyar Nurseitov}
}
ICDAR 2021 Competition on Scientific Table Image Recognition to LaTeX
TL;DR
ICDAR 2021 introduces a competition and two new datasets for converting scientific table images to LaTeX code, splitting the problem into separate structure recognition (TSRD: 46K samples) and content recognition (TCRD: 38K samples) subtasks. The winning VCGroup method achieves 74% exact match on structure and 55% on content, beating prior baselines by 5 and 12 percentage points respectively.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$: The primary contribution is the competition infrastructure: two purpose-built datasets (TSRD and TCRD) with ground-truth LaTeX annotations, a CodaLab evaluation platform, and a publicly accessible post-challenge leaderboard.
Secondary: $\Psi_{\text{Evaluation}}$: The paper proposes a suite of fine-grained metrics tailored to LaTeX sequence evaluation, covering both coarse structural accuracy (row/column counts) and fine-grained token-category accuracy (alphanumeric, LaTeX tokens, LaTeX symbols, non-LaTeX symbols).
What is the motivation?
Scientific documents rely heavily on tables to convey experimental comparisons and structured data. Extracting that information accurately requires both understanding the spatial structure of cells and recognizing complex content including mathematical symbols and LaTeX macros.
Existing datasets for this problem are small (e.g., ICDAR 2013 had only 150 table instances) or contain simple tables that do not represent real-world complexity. Prior work by Deng et al. used word-level tokenization without fine-grained task decomposition, resulting in poor recognition of the full LaTeX output. There was no dedicated benchmark that cleanly separated structure recognition from content recognition or that handled the diversity of LaTeX table libraries (booktabs, multirow, amsmath, etc.).
What is the novelty?
The paper introduces two complementary subtask formulations:
Task 1: Table Structure Reconstruction (TSR) asks systems to generate the LaTeX skeleton of a table, using a placeholder token CELL for cell content. The target vocabulary includes alignment specifiers, column separators, spanning commands, and line commands:
$$\mathcal{V}_{\text{TSR}} = \{ &, 0\text{–}9, \texttt{CELL}, \backslash\backslash, \texttt{\hline}, \texttt{\multirow}, \texttt{\multicolumn}, \texttt{\toprule}, \ldots \}$$
Task 2: Table Content Reconstruction (TCR) asks systems to generate the full alphanumeric and mathematical content of each cell, including LaTeX math tokens and symbols.
Separating these two goals is the core structural novelty: structure and content are independently evaluated, enabling cleaner diagnosis of model capabilities and failure modes.
The paper also introduces a suite of evaluation metrics. For both tasks, the primary metric is Exact Match (EM), which requires the predicted token sequence to exactly match the ground truth:
$$\text{EM}(y^\ast, y) = \mathbf{1}[y_1^\ast = y_1 \wedge \cdots \wedge y_{n_2}^\ast = y_{n_2}]$$
A relaxed variant, EM@95%, accepts a prediction if at least 95% of the ground-truth tokens appear as a contiguous subsequence in the prediction (no insertions or deletions allowed within the matched span):
$$\text{EM@95%}: \exists, i,j \text{ s.t. } y_i^\ast = y_j, \ldots, y_{i+l}^\ast = y_{j+l},\ l \geq 0.95 \times n_2$$
For TSR, Row Accuracy (RA) and Column Accuracy (CA) measure whether the predicted row and column counts match. For TCR, token-category accuracies partition the output into four groups: alphanumeric (AN), LaTeX tokens (LT), LaTeX symbols (LS), and non-LaTeX symbols (NLS), measuring exact match within each category independently.
What experiments were performed?
The competition ran from October 2020 to March 2021 on the CodaLab platform with three phases: validation, testing (hidden leaderboard), and post-evaluation. Three teams participated in the TSR test phase; two in the TCR test phase.
Datasets:
| Dataset | Max Seq. Len. | Total Samples | Train | Val | Test |
|---|---|---|---|---|---|
| TSRD | 250 | 46,141 | 43,138 | 800 | 2,203 |
| TCRD | 500 | 37,917 | 35,500 | 500 | 1,917 |
Tables were sourced from scientific documents; ground truth LaTeX was extracted from original source files.
Baselines:
- CNN-Baseline: Eight convolutional layers with interleaved max-pooling, followed by bidirectional LSTM encoder with trainable positional embeddings per row, and an LSTM decoder with visual attention (based on Deng et al., 2017).
- Transformer-Baseline: ResNet-101 image encoder with a Transformer decoder, adapted from a scene text recognition architecture (Feng et al., 2020).
Participant method (VCGroup): Based on the MASTER architecture (originally for scene text recognition), which uses a multi-aspect non-local block integrated into a CNN backbone to capture global spatial context. Specific techniques used by VCGroup include:
- Ranger optimizer (combining RAdam, LookAhead, and gradient centralization)
- Data augmentation: shear, affine, perspective, contrast, brightness, saturation
- Synchronized Batch Normalization for multi-GPU training
- Feature concatenation of the last two transformer decoder layers with linear projection
- Model ensemble (voting and bagging)
- For TCR only: multi-resolution inputs (400x400 to 600x400) and pre-training on 58K samples from a larger external table-to-LaTeX dataset
What are the outcomes/conclusions?
TSR results:
| Method | EM | EM@95% | RA | CA |
|---|---|---|---|---|
| VCGroup | 0.74 | 0.88 | 0.95 | 0.89 |
| Transformer-Baseline | 0.69 | 0.85 | 0.93 | 0.86 |
| CNN-Baseline | 0.66 | 0.79 | 0.92 | 0.86 |
| Format* | 0.57 | 0.80 | 0.91 | 0.87 |
| asda* | 0.50 | 0.75 | 0.90 | 0.86 |
TCR results:
| Method | EM | EM@95% | AN | LT | LS | NLS |
|---|---|---|---|---|---|---|
| VCGroup | 0.55 | 0.74 | 0.85 | 0.75 | 0.96 | 0.62 |
| Transformer-Baseline | 0.43 | 0.64 | 0.74 | 0.67 | 0.94 | 0.54 |
| CNN-Baseline | 0.41 | 0.67 | 0.76 | 0.67 | 0.94 | 0.48 |
| Format* | 0.00 | 0.52 | 0.67 | 0.54 | 0.92 | 0.35 |
*Method description unavailable for rows marked with an asterisk.
VCGroup’s strongest gains appear in longer sequences, where the Transformer-Baseline degrades more sharply. The gap between RA/CA and EM reveals that row and column count errors are less common than token-level errors; exact sequence matching is harder than structural coarse accuracy.
Common failure modes across all methods include: incorrect handling of multi-row and multi-column spanning cells, confusion between \multicolumn and \multirow, and inconsistent dollar sign ($) prediction around numeric tokens. Both baselines tend to truncate long sequences prematurely.
The competition received 101 registrations but only a handful of test-phase submissions (three teams in TSR, two in TCR), which the authors attribute to the difficulty of the tasks. The datasets and platform remain available for post-challenge submissions.
Reproducibility
Models
- CNN-Baseline: Eight conv layers + five max-pool layers, bidirectional LSTM encoder, LSTM decoder with visual attention. Architecture based on Deng et al. (2017); no weights released as part of this competition report.
- Transformer-Baseline: ResNet-101 encoder + Transformer decoder. Based on Feng et al. (2020); no weights released.
- VCGroup (MASTER-based): Multi-aspect non-local block fused into CNN backbone + Transformer decoder. Full implementation details in He et al. (2021), arXiv:2105.07395. No weights publicly released with this report.
Algorithms
- VCGroup used the Ranger optimizer (RAdam + LookAhead + gradient centralization).
- Data augmentation: shear, affine, perspective, contrast, brightness, saturation transforms.
- Synchronized BatchNorm used for multi-GPU training.
- Feature concatenation of last two decoder layers with linear projection replaces single final layer output.
- Model ensemble via voting and bagging.
- Pre-training on a trimmed external dataset (58K of 450K samples) for the TCR task.
- No training hyperparameters (learning rate, batch size, number of epochs) are reported in this competition overview paper.
Data
- TSRD: 46,141 image-text pairs, max sequence length 250 tokens. Train/val/test splits: 43,138 / 800 / 2,203.
- TCRD: 37,917 image-text pairs, max sequence length 500 tokens. Train/val/test splits: 35,500 / 500 / 1,917.
- Data sourced from scientific papers; ground-truth LaTeX extracted from original source files.
- Available through the CodaLab competition platform (https://competitions.codalab.org/competitions/26979). No explicit standalone download link or Hugging Face release mentioned.
- License for the datasets is not explicitly stated in the paper beyond the arXiv CC-BY-4.0 license of the report itself.
Evaluation
- Primary metric: Exact Match (EM): full sequence equality.
- Secondary metric: EM@95%: at least 95% of ground-truth tokens matched as a contiguous subsequence (no insertions/deletions within the span).
- TSR-specific: Row Accuracy (RA) and Column Accuracy (CA): count-level structural accuracy.
- TCR-specific: token-category accuracies (AN, LT, LS, NLS).
- No error bars, significance tests, or multiple-run averaging are reported. The competition format uses single submissions evaluated on fixed test sets.
- No contamination analysis of the train/test split is reported.
Hardware
- No hardware details are reported for the baseline methods in this competition overview paper.
- VCGroup used multi-GPU training (SyncBN implies multiple GPUs), but GPU type, count, and training time are not reported here. Full hardware details may appear in He et al. (2021).
BibTeX
@inproceedings{kayal2021icdar,
title={{ICDAR} 2021 Competition on Scientific Table Image Recognition to {LaTeX}},
author={Kayal, Pratik and Anand, Mrinal and Desai, Harsh and Singh, Mayank},
booktitle={Document Analysis and Recognition -- ICDAR 2021},
year={2021},
publisher={Springer},
doi={10.1007/978-3-030-86337-1_50}
}
TabStruct-Net: Top-Down and Bottom-Up Cues for Table Structure Recognition
TL;DR
TabStruct-Net (ECCV 2020) is an end-to-end trainable architecture for table structure recognition that combines a modified Mask R-CNN cell detector with a graph neural network (DGCNN + LSTM) to simultaneously predict cell bounding boxes and row/column adjacency matrices. It introduces an alignment loss that enforces grid-like spatial constraints between detected cells, operates directly on table images without PDF metadata or OCR input, and was evaluated across eight benchmarks.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$: the headline contribution is the end-to-end trainable architecture (TabStruct-Net) combining a modified cell detection network and a graph-based structure recognition network, supported by ablation studies.
Secondary: $\Psi_{\text{Evaluation}}$: the paper provides a cross-dataset comparison study that unifies results from previously published methods under two evaluation setups (image-only vs. image + ground-truth boxes), which has independent value for understanding the field at the time of publication.
What is the motivation?
At the time of this work, most table structure recognition systems relied on either PDF metadata (character bounding boxes, font information) or OCR outputs to extract layout features before inferring structure. These dependencies create generalization problems: PDF metadata is unavailable for scanned documents, and OCR errors compound for complex or dense tables. Methods that operated on images directly (e.g., image-to-markup models) struggled with large, complex tables.
The authors target tables with complex layouts: spanning cells, dense content, no ruling lines, and multi-line cells. The central question is whether visual cues alone, without any textual or metadata input, can support accurate structure recognition.
What is the novelty?
TabStruct-Net treats TSR as a two-stage process operating on image features only:
- Top-down cell detection: Locate every cell as a bounding box.
- Bottom-up structure recognition: Given the detected cells, predict pairwise row and column adjacency relationships to reconstruct the grid.
The architecture introduces three specific contributions over the Mask R-CNN + DGCNN baseline:
Modified Feature Pyramid Network. The standard FPN is extended with both a top-down pathway (propagating high-level semantics to low-resolution feature maps) and a bottom-up pathway that feeds low-level visual features upward. Dilated convolutions replace standard $3 \times 3$ convolutions to expand the receptive field and better capture row- and column-spanning structures.
Alignment loss. An auxiliary loss is added to the cell detector to explicitly penalize misalignment of cells that share the same row or column. For cells $c_i$ and $c_j$ in the same starting row $r \in SR$, the alignment penalty is:
$$L_{align} = \begin{aligned} &\sum_{r \in SR} \sum_{c_i, c_j \in r} |y1_{c_i} - y1_{c_j}|_2^2 \\ +, &\sum_{r \in ER} \sum_{c_i, c_j \in r} |y2_{c_i} - y2_{c_j}|_2^2 \\ +, &\sum_{c \in SC} \sum_{c_i, c_j \in c} |x1_{c_i} - x1_{c_j}|_2^2 \\ +, &\sum_{c \in EC} \sum_{c_i, c_j \in c} |x2_{c_i} - x2_{c_j}|_2^2 \end{aligned}$$
where $SR$, $ER$, $SC$, $EC$ denote sets of cells sharing the same starting row, ending row, starting column, and ending column, respectively. This loss acts as a geometric regularizer that makes predicted boxes easier to post-process.
LSTM visual features for structure recognition. The DGCNN graph network receives per-cell features extracted from the P2 layer of the FPN. Rather than using only the centroid feature (as in the original DGCNN paper), horizontal and vertical line scans of each cell region are fed through LSTMs (depth 128) to produce summary vectors. This captures content distribution across the full height and width of the cell.
The overall training loss combines detection losses with the structure recognition loss:
$$L = L_{box} + L_{cls} + L_{mask} + L_{align} + L_{gnn}$$
where $L_{box}$, $L_{cls}$, and $L_{mask}$ are the standard Mask R-CNN losses; $L_{align}$ is the alignment regularizer; and $L_{gnn}$ is the cross-entropy loss from the row/column adjacency classifiers.
Structure recognition is formulated as predicting two adjacency matrices $M_{row}, M_{col} \in \mathbb{R}^{N_{cells} \times N_{cells}}$, where $M_{row_{i,j}} = 1$ if cells $i$ and $j$ belong to the same row, and similarly for columns. Pairs are classified by concatenating interaction features (from the DGCNN edge convolutions over 40 nearest neighbors) with the difference between bounding box coordinates.
What experiments were performed?
Datasets. Eight datasets are used: SciTSR (12K/3K train/test), SciTSR-COMP (12K/716), ICDAR-2013 (158 test, no official train), ICDAR-2013-partial (124/34), ICDAR-2019 cTDaR archival (600/150), UNLV-partial (446/112), TableBank (145K/1K), and PubTabNet (339K/114K).
Two evaluation setups are used throughout:
- Setup A (S-A): image-only input. Cell detection runs normally.
- Setup B (S-B): image + ground-truth cell bounding boxes. The detection component is bypassed; GT boxes are fed to the structure recognition network. This isolates structure recognition quality from detection errors.
Baselines. DeepDeSRT, TableNet, GraphTSR (the DGCNN paper), SPLERGE, Bi-directional GRU, and Image-to-Text are compared where applicable.
Metrics. Precision, recall, and F1 at IoU $\geq 0.6$ for physical structure; TEDS and BLEU for logical structure (TableBank, PubTabNet). CIDER and ROUGE are also reported for XML-level evaluation.
Implementation. ResNet-101 backbone pretrained on MS-COCO, input images scaled to $1536 \times 1536$, batch size 1, TITAN X GPU (12 GB), SGD with lr = 0.001, momentum = 0.9, weight decay = 0.0001. Two-stage training: first stage uses 2,014 anchors and 512 ROIs; second stage uses 3,072 anchors and 2,048 ROIs.
What are the outcomes/conclusions?
Physical structure (Setup A). On ICDAR-2013, TabStruct-Net achieves F1 = 0.906, outperforming DeepDeSRT by approximately 27.5 percentage points; DeepDeSRT cannot handle spanning cells because it infers cells purely from row/column intersection. Under equivalent training conditions (no dataset-specific heuristics), TabStruct-Net outperforms SPLERGE by roughly 3 percentage points on ICDAR-2013.
On SciTSR, F1 = 0.920; on SciTSR-COMP (the harder split with complex structures), F1 = 0.895. On ICDAR-2019 cTDaR archival, F1 = 0.804 when trained on SciTSR + ICDAR-2019. The gap between S-A and S-B (e.g., F1 0.906 vs. 0.981 on ICDAR-2013) confirms that cell detection quality remains a bottleneck: when GT boxes are given, structure recognition F1 exceeds 0.98 on most benchmarks.
Logical structure. On TableBank, BLEU = 0.914 (Word) and 0.937 (LaTeX). On PubTabNet, TEDS = 0.901 (training on SciTSR only).
Ablation. Adding the top-down/bottom-up pathways and alignment loss together improves cell detection F1 by 2.3 points and structure recognition F1 by 2.4 points over the plain Mask R-CNN + DGCNN baseline. Adding LSTM-based visual features contributes a further 2.1 points. End-to-end joint training contributes approximately 1.5 points over training the two networks separately.
Failure mode. The model fails on tables with many contiguous empty cells: empty regions lack visual gradients that the cell detector relies on, leading to missed detections and cascading structure errors.
Limitations not acknowledged. The comparison table in S-A vs. S-B reveals that detection errors dominate; no analysis is provided of how detection precision affects downstream TSR quality at the system level. Training is done on SciTSR (scientific PDFs) and transferred to other domains; domain shift to archival or financial tables is observed but not analyzed in depth. The use of Tesseract OCR for post-processing content extraction introduces a dependency the architecture nominally avoids. Inference latency and throughput are not reported.
Reproducibility
Models
- Cell detection: ResNet-101 backbone (pretrained on MS-COCO), Mask R-CNN framework with extended FPN (top-down + bottom-up pathways). Dilated convolutions: RPN uses $2 \times 2$ filter with dilation 2; FPN uses filter-7 dilated convolution blocks.
- Structure recognition: DGCNN with edge convolution over 40 nearest neighbors, dense layer size 64 with ReLU; LSTM depth 128 for visual feature extraction. Features from P2 layer of the shared FPN.
- Weights: The code repository is available at https://github.com/sachinraja13/TabStructNet. The repository carries an MIT license inherited from Matterport’s Mask R-CNN base; no separate license covers the TabStructNet additions by the authors. Pretrained model weights are not released.
- Parameter count: Not reported.
Algorithms
- SGD optimizer; learning rate 0.001, momentum 0.9, weight decay 0.0001.
- Two-stage training: stage 1 (2,014 anchors, 512 ROIs) then stage 2 (3,072 anchors, 2,048 ROIs).
- Anchor scales: 0.5, 1, 2; anchor box sizes: 8, 16, 32, 64, 128.
- Monte Carlo sampling during training for class balancing in the adjacency classifiers; all pairs enumerated at test time.
- IoU threshold of 0.5 to filter predicted cells for structure recognition training.
- Maximum 2,400 vertices per graph.
- Ground-truth bounding boxes for datasets with content-level annotations (e.g., ICDAR-2013) are preprocessed to enforce row/column alignment before computing alignment loss.
Data
- Primary training set: SciTSR (12K train split), plus ICDAR-2019 for the archival evaluation.
- Marmot Extended (1,016 training images) is listed in the dataset table but specific results are not reported in the main paper.
- All datasets used (SciTSR, ICDAR-2013, ICDAR-2019, UNLV, TableBank, PubTabNet) are publicly available with varying licenses; SciTSR is MIT, TableBank is research-only.
- No data augmentation details are reported.
Evaluation
- IoU threshold: 0.6 for confusion matrix computation (consistent across all datasets).
- Evaluation excludes empty cells (consistent with ICDAR-2013 practice).
- TEDS metric used as defined in the PubTabNet paper; BLEU as in the TableBank paper.
- No error bars, confidence intervals, or multiple-run statistics are reported.
- The two-setup evaluation (S-A vs. S-B) provides a useful isolation of detection vs. structure recognition, but direct comparisons against methods that use additional inputs (GT boxes, PDF metadata) are not always fair; the paper acknowledges this.
Hardware
- Training: 1x NVIDIA TITAN X GPU (12 GB VRAM), batch size 1.
- No training time or total compute hours reported.
- Inference speed not reported.
BibTeX
@inproceedings{raja2020tabstructnet,
author = {Sachin Raja and Ajoy Mondal and C. V. Jawahar},
editor = {Andrea Vedaldi and Horst Bischof and Thomas Brox and Jan-Michael Frahm},
title = {Table Structure Recognition Using Top-Down and Bottom-Up Cues},
booktitle = {Computer Vision -- ECCV 2020, Part XXVIII},
series = {Lecture Notes in Computer Science},
pages = {70--86},
publisher = {Springer},
year = {2020},
doi = {10.1007/978-3-030-58604-1_5},
}
CDeC-Net: Composite Deformable Cascade Network for Table Detection
TL;DR
CDeC-Net extends Cascade Mask R-CNN with a composite dual-backbone architecture and deformable convolutions to improve table detection accuracy across a range of document types and IoU thresholds. A single model trained on IIIT-AR-13K produces results at or near dataset-specific baselines across seven benchmark datasets without dataset-specific tuning.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$: The headline contribution is a new detection architecture. The paper proposes CDeC-Net, combining three existing components (cascade detection, composite backbones, deformable convolution) in a new configuration tailored for table detection. The bulk of the paper consists of ablations and baseline comparisons that validate design choices.
Secondary: $\Psi_{\text{Evaluation}}$: A notable secondary contribution is the breadth of evaluation: seven benchmark datasets are tested under multiple IoU thresholds and diverse experimental protocols, including a single-model evaluation variant.
What is the motivation?
Table detection in document images is complicated by high intra-class variability (tables differ widely in layout and ruling style) and inter-class similarity (figures, graphs, and equations can visually resemble tables). At the time of this work, existing deep learning approaches suffered from three interconnected problems:
- Backbone mismatch: Standard backbones pre-trained for image classification (ImageNet) are not necessarily well suited to extracting features for document table detection. Training a deeper, higher-capacity backbone from scratch is expensive.
- Fixed geometric structure: Standard convolutions sample from a fixed grid, limiting the model’s ability to handle the geometric variation typical of tables (spanning cells, rotated content, irregular borders).
- IoU threshold sensitivity: Most detectors train with an IoU threshold of 0.5 for positive sample assignment. At higher evaluation thresholds the number of valid positives drops sharply, leading to degraded performance and imprecise bounding boxes.
Additionally, prior methods typically trained and evaluated a separate model per benchmark. The authors argue for a generic solution: a single model that works across diverse datasets.
What is the novelty?
CDeC-Net composes three independently proposed techniques into a unified architecture:
Cascade Mask R-CNN: A multi-stage extension of Faster R-CNN in which each successive stage is trained at a higher IoU threshold (0.5, 0.6, 0.7). Each stage regresses bounding boxes that serve as proposals for the next stage. This progressive refinement maintains enough positive samples at higher thresholds while producing tighter boxes.
Composite (dual) backbone: Inspired by CBNet, two ResNeXt-101 backbones are stacked in parallel. The “assistant” backbone feeds its stage-$l$ output into the corresponding stage of the “lead” backbone via a composite connection $g(\cdot)$:
$$x_{bl}^l = F_{bl}^l!\left(x_{bl}^{l-1} + g!\left(x_{ba}^l\right)\right), \quad l \ge 2$$
This provides the lead backbone with richer features derived from a second forward pass without retraining a substantially larger single network.
Deformable convolution: Standard convolution samples over a fixed regular grid $R$:
$$y(p_0) = \sum_{p_n \in R} w(p_n) \cdot x(p_0 + p_n)$$
Deformable convolution augments each sampling location with a learned offset $\Delta p_n$:
$$y(p_0) = \sum_{p_n \in R} w(p_n) \cdot x(p_0 + p_n + \Delta p_n)$$
The offsets are themselves learned end-to-end, allowing each neuron to adapt its receptive field to the shape and scale of the object being detected. Both dual backbones use deformable convolution throughout.
The combination of these three components is applied to all benchmark datasets under a unified training regime, including a single-model variant (CDeC-Net$^\ddagger$) trained only on IIIT-AR-13K and optionally fine-tuned on each target dataset.
What experiments were performed?
Datasets: ICDAR-2013 (238 test images), ICDAR-POD-2017 (817 test images, multi-class), UNLV (424 test images), Marmot (1K test images, English and Chinese), ICDAR-2019 cTDaR (439 test images), TableBank (LaTeX and Word splits, 1K test each), PubLayNet (11K test images, multi-class).
Metrics: Recall (R), Precision (P), F1 score, and mean Average Precision (mAP) at IoU thresholds matching each benchmark’s established protocol.
Baselines: Per-benchmark state-of-the-art methods at time of submission, including DeCNT (deformable Faster R-CNN), GOD (Mask R-CNN variant), YOLO-based detectors, and Mask/Faster R-CNN variants from the PubLayNet and TableBank papers.
Implementation: MMDetection (PyTorch), ResNeXt-101 backbone pre-trained on MS-COCO with FPN, input resolution 1200$\times$800, initial learning rate 0.00125 with warmup (0.0033 for 500 iterations), decay at epochs 25 and 40, 50 total epochs (8 epochs for large datasets such as PubLayNet and TableBank). Batch size 1 on a single NVIDIA GeForce RTX 2080 Ti (12 GB). Multi-scale testing at 7 scales; a detection is kept only if it appears in at least 4 of the 7 scale runs.
Ablation (Table IX, trained on Marmot, tested on ICDAR-2013 at IoU 0.5):
| Model | F1 |
|---|---|
| Cascade Mask R-CNN + ResNeXt-101 | 0.981 |
| + Composite backbone | 0.984 |
| + Deformable convolution (CDeC-Net) | 1.000 |
IoU robustness (Table VIII/XIII, single best model per dataset): CDeC-Net maintains high F1 from 0.5 to 0.8 on ICDAR-2013 and ICDAR-2019. Performance on UNLV and ICDAR-2013 drops noticeably at IoU 0.9 (F1 0.557 and 0.660 respectively), while ICDAR-2019 retains F1 0.915 at IoU 0.9.
What are the outcomes/conclusions?
Dataset-specific models (Table III): CDeC-Net exceeds prior state of the art on five of seven benchmarks and comes within 0.1% F1 on a sixth (ICDAR-2019):
| Dataset | F1 | mAP |
|---|---|---|
| ICDAR-2013 | 1.000 | 1.000 |
| ICDAR-2017 | 0.947 | 0.912 |
| ICDAR-2019 | 0.944 | 0.922 |
| UNLV | 0.938 | 0.912 |
| Marmot | 0.952 | 0.911 |
| TableBank | 0.987 | 0.976 |
| PubLayNet | 0.978 | 0.967 |
ICDAR-2017 is the one dataset where CDeC-Net falls 2.4% F1 below the YOLO-based state of the art (0.971), which the authors attribute to those methods being specifically optimized for ICDAR-2017.
Single model (CDeC-Net$^\ddagger$): The IIIT-AR-13K-trained model reaches F1 figures close to or above dataset-specific baselines on all datasets after optional fine-tuning, notably exceeding those baselines on ICDAR-2019 and PubLayNet (Table I).
Failure modes: The model merges closely spaced tables into a single bounding box and occasionally produces false positives on graph-like structures that resemble tables. No error bars or statistical significance tests are reported.
Limitations acknowledged: The ICDAR-2017 gap is noted. The authors do not discuss computational overhead relative to simpler baselines or the practical cost of multi-scale inference at 7 scales.
Reproducibility
Models
- Architecture: Cascade Mask R-CNN with dual composite ResNeXt-101 backbones and deformable convolution, plus FPN. Three cascade stages at IoU thresholds 0.5, 0.6, and 0.7.
- Backbone: ResNeXt-101 (blocks 3-4-23-3 configuration), pre-trained on MS-COCO with FPN. Two identical instances connected via composite connections.
- Code and weights: The paper states code and models will be publicly released. The repository has moved to https://github.com/madhav1ag/CDeCNet and is released under the MIT license. Pretrained model weights are included in the repository.
Algorithms
- Framework: MMDetection (PyTorch)
- Optimizer: SGD (implied by MMDetection defaults; not explicitly stated)
- Learning rate: 0.00125 initial; decay at epochs 25 and 40; warmup factor 0.0033 for first 500 iterations
- Epochs: 50 for small/medium datasets; 8 for PubLayNet and TableBank; 12 for fine-tuning
- Batch size: 1
- Input resolution: 1200$\times$800 (aspect ratio preserved)
- Anchor ratios: 0.5, 1.0, 2.0; single anchor scale of 8
- Test-time augmentation: Multi-scale at 7 scales (3 smaller, original, 3 larger); consensus voting requiring presence in at least 4 of 7 scales
Data
- Primary training set for single model: IIIT-AR-13K (9K training, 2K validation, 2K test; 5 classes including table, figure, natural image, logo, signature)
- Benchmark datasets used: ICDAR-2013, ICDAR-POD-2017, UNLV, Marmot, ICDAR-2019 (cTDaR), TableBank (LaTeX, Word, both splits), PubLayNet (all public)
- Annotation quality: Ground truth for TableBank and PubLayNet is automatically generated; annotations for the other benchmarks are human-curated
Evaluation
- Metrics: Recall, Precision, F1, and mAP; IoU threshold varies per benchmark protocol (0.5 standard; some benchmarks use 0.6, 0.8, or 0.9)
- Protocol: The authors replicate the exact training/fine-tuning/test splits and IoU thresholds of each compared paper to ensure fair comparison
- Statistical rigor: No error bars, confidence intervals, or multiple-run reporting. Results are from single training runs per configuration.
- Limitations not acknowledged: Multi-scale test-time augmentation at 7 scales substantially increases inference cost; this is not discussed relative to baselines that do not use TTA.
Hardware
- Training GPU: Single NVIDIA GeForce RTX 2080 Ti (12 GB VRAM)
- GPU-hours: Not reported
- Inference: Multi-scale testing at 7 scales implies non-trivial inference overhead; latency not reported
- Deployment: The 12 GB training requirement suggests a mid-range workstation GPU is sufficient for training; inference requirements are not specified
BibTeX
@inproceedings{agarwal2020cdecnet,
title = {{CDeC-Net}: Composite Deformable Cascade Network for Table Detection in Document Images},
author = {Agarwal, Madhav and Mondal, Ajoy and Jawahar, C. V.},
booktitle = {2020 25th International Conference on Pattern Recognition (ICPR)},
year = {2021},
pages = {9491--9498},
doi = {10.1109/ICPR48806.2021.9411922}
}
GTE: Global Table Extractor and the FinTabNet Dataset
TL;DR
GTE is a vision-guided framework for joint table detection and cell structure recognition that introduces two innovations: a constraint loss coupling cell detection to table detection during training, and a hierarchical cell network that classifies table style before routing to a specialized cell detector. The paper also introduces FinTabNet, a dataset of ~113k financial tables from S&P 500 annual reports with HTML annotations, filling the gap between scientific-paper datasets (PubTabNet) and real-world financial documents.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$ The headline contribution is the GTE framework: the constraint-guided table detector, the hierarchical cell detection network, and the cluster-based cell-to-structure algorithm. The bulk of the paper is devoted to architecture design, ablations, and competitive-baseline comparisons on ICDAR 2013 and 2019.
Secondary: $\Psi_{\text{Resource}}$ The paper introduces FinTabNet, a large-scale dataset of financial tables from SEC 10-K filings with HTML cell-level annotations, as a complement to the scientific-paper-focused PubTabNet.
What is the motivation?
Table extraction from PDF and image documents requires solving two coupled problems: locating table regions on a page, and recovering the logical cell grid within each table. Prior approaches treat these as independent tasks. The authors identify two specific weaknesses in prior detection-based work:
- Coupling failure: Table and cell detection are trained independently, ignoring the natural geometric constraint that cells must lie inside tables and tables must contain cells.
- Style blindness: Object detection models attend to local regions and miss the global ruling-line style of a table, causing performance degradation when mixing tables with and without graphical lines in training.
Additionally, existing large-scale TSR datasets (PubTabNet, TableBank) draw exclusively from scientific articles and are not representative of financial documents, which have denser and more complex multi-level header structures.
What is the novelty?
GTE-Table: Constraint-guided detection
The table detector is a RetinaNet with a ResNet-50-FPN backbone. The key innovation is a constraint loss (CL) that explicitly enforces the table-cell spatial relationship during training. Two boolean operators over predicted cell masks form the building blocks. Let $\text{SLC}(M(B), b)$ denote the area of overlap between cell mask $M(B)$ and box $b$, and $A(b)$ denote the area of $b$. Then:
$$ C(b_{in}, b_{out}) = \bigl[\text{SLC}(M(B_{cells}), b_{out}) - \text{SLC}(M(B_{cells}), b_{in})\bigr] < \alpha \cdot \bigl[A(b_{out}) - A(b_{in})\bigr] $$
$$ D(b_{in}, b_{out}) = \bigl[\text{SLC}(M(B_{cells}), b_{out}) - \text{SLC}(M(B_{cells}), b_{in})\bigr] > 0 $$
$C$ is true when the cell coverage in the annular region between $b_{in}$ and $b_{out}$ falls below a fraction $\alpha$; $D$ is true when any cells exist in that region. The penalty indicator for a candidate table box $b_{tbl}$ combines four geometric checks:
$$ I(b_{tbl}) = C(\mathbf{0}, b_{tbl}) ;\lor; C(S(b_{tbl}, \mu_1), b_{tbl}) ;\lor; D(S(b_{tbl}, \mu_2), S(b_{tbl}, \mu_3)) ;\lor; C(U(b_{tbl}, \mu_4), b_{tbl}) $$
where $S(b, \mu)$ expands box $b$ by $\mu$ pixels on all sides and $U(b, \mu)$ expands only the bottom edge. The constraint loss is:
$$ \mathcal{L}_{\text{CL}} = \sum_{b_{tbl} \in B_{tbl}} I(b_{tbl}) P(b_{tbl}) + \gamma_1 (1 - I(b_{tbl}))(1 - P(b_{tbl})) $$
where $P(\cdot)$ is the detection probability and $\gamma_1 = 0.1$ down-weights the false-negative penalty. One channel of the table detector input is also replaced with a mask from the cell detector output to directly propagate cell evidence into table training.
At inference, standard non-maximum suppression is replaced with a Constraint Coefficient (CCoef) ranking. For each candidate box, CCoef measures cells just outside minus cells just inside:
$$ \text{CCoef}(b_{tbl}) = \text{SLC}(M(B_{cell}), S(b_{tbl}, \mu_5)) - \text{SLC}(M(B_{cell}), b_{tbl}) - \gamma_2 \cdot \bigl[\text{SLC}(M(B_{cell}), b_{tbl}) - \text{SLC}(M(B_{cell}), S(b_{tbl}, \mu_6))\bigr] $$
For overlapping pairs where detection probabilities are close ($|P(b_i) - P(b_j)| < \epsilon$), the box with the higher CCoef (more cells outside, fewer inside) is discarded.
GTE-Cell: Hierarchical style-aware detection
A three-stage hierarchy:
- Attribute network: A ResNet-50 binary classifier determines whether a table has vertical graphical ruling lines. This routes the table to one of two specialized cell detectors.
- Specialized cell detectors: Two RetinaNet models trained with different augmentation schemes. The “graphical lines” model trains on original images plus augmentations that add full boundaries at every row/column. The “no lines” model additionally trains on augmentations that erase existing graphical lines.
- Cluster-based structure recovery: Cell bounding boxes are converted to a logical grid via K-means clustering on cell coordinates to identify row and column locations, followed by heuristic post-processing (text-line alignment, spanning cell expansion based on text capitalization, leftover text box assignment).
FinTabNet dataset
FinTabNet uses the same automated annotation pipeline as PubTabNet: pixel-level matching between PDF-rendered pages and XML markup from source documents. The source is annual reports of S&P 500 companies (SEC 10-K filings), chosen to represent complex financial table structures with multi-level headers, spanning cells, and dense numeric content that are absent from scientific paper datasets.
The annotation includes pixel-coordinate cell bounding boxes, logical row/column span, and cell text content as HTML. At the time of submission (arXiv v2, December 2020), release was pending legal review; the dataset was subsequently published on IBM’s Data Asset eXchange.
What experiments were performed?
Datasets
- ICDAR 2013: 96/156 train/test tables from EU and US government reports. Primary benchmark.
- ICDAR 2019: Modern and archival document tracks.
- PubTabNet (val): Scientific tables; used to compare TEDS without access to the test set.
- FinTabNet: Financial tables from annual reports; models tested with and without fine-tuning.
Metrics
- Table detection (ICDAR 2013): Character-level Recall, Precision, F1 (official competition script); also Purity (Pu) and Completeness (Cpt) counts, defined over a set $N$ of test documents as:
$$ Pu = \sum_{n \in N} [\text{Rec}(n)], \qquad Cpt = \sum_{n \in N} [\text{Prec}(n)] $$
- Cell structure recognition (ICDAR 2013): Adjacency relation Precision, Recall, F1.
- TSR (PubTabNet / FinTabNet): TEDS (Tree-Edit-Distance Similarity, introduced in PubTabNet, arXiv:1911.10683), computed on the validation set PDF images (not the withheld test set). GTE modifies the evaluation script to ignore bold/italic styling tags.
Training setup
- RetinaNet with ResNet-50-FPN, pretrained on MS COCO.
- Table network pretrained on TableBank + PubTabNet boundaries, then fine-tuned on ICDAR 2013 train.
- Cell network pretrained on PubTabNet, then fine-tuned on ICDAR 2013 train.
- Table input resolution: $900 \times 643$; cell input resolution: $965 \times 1350$ (higher resolution needed for small, dense cells).
- Attribute (style) classifier: ResNet-50 with binary head, pretrained on SD-Tables dataset attributes, fine-tuned on ICDAR 2013.
Key results
Table detection on ICDAR 2013:
| Method | Input | Recall | Precision | F1 |
|---|---|---|---|---|
| FineReader | 99.71 | 97.29 | 98.48 | |
| Nurminen | 90.77 | 92.10 | 91.43 | |
| TableBank | Image | / | / | 96.25 |
| GTE | Image | 99.77 | 98.97 | 99.31 |
Cell structure on ICDAR 2013 (without GT border):
| Method | Recall | Precision | F1 |
|---|---|---|---|
| Nurminen | 80.78 | 86.93 | 83.74 |
| GTE | 92.72 | 94.41 | 93.50 |
PubTabNet and FinTabNet (Table F1 and TEDS, validation set):
| Dataset | Method | Finetuned? | Table F1 | TEDS |
|---|---|---|---|---|
| PubTabNet | GTE | Y | N/A | 93.01 |
| FinTabNet | Det-Base | N | 81.17 | 41.57 |
| FinTabNet | GTE | N | 89.97 | 87.14 |
| FinTabNet | GTE | Y | 95.29 | 91.02 |
The Table F1 values show that GTE improves table detection on FinTabNet even without fine-tuning (89.97 vs. 81.17), with a further gain after fine-tuning. TEDS without fine-tuning (87.14) already far exceeds the detection-base structure result (41.57).
What are the outcomes/conclusions?
The authors report a measurable improvement of approximately 3.6 F1 points from the constraint-guided training loop over the version without inter-network information (GTE-Table-Sep). The hierarchical cell style classifier improves cell structure F1 over mixed-style training.
On ICDAR 2013, GTE achieves the highest F1 for both table detection (99.31) and cell structure recognition (93.50 without GT border, 96.24 with GT border) compared to published methods at the time of submission.
On FinTabNet, the large gap between the detection-base (41.57 TEDS) and GTE without fine-tuning (87.14 TEDS) results suggest that the cell detection architecture matters significantly in the complex financial document domain.
Limitations
- Anchor-based cell detection: Very wide cells with extreme aspect ratios and very long lines of text are often detected with boxes that are too short. The anchor configuration partially compensates, but the fundamental limitation of fixed-aspect anchors remains.
- Style classifier errors: The attribute network achieves 78.84% accuracy on ICDAR 2013 (123/156 tables). Failures occur on very small tables or tables with header-only vertical graphical lines, which are genuinely ambiguous in the training data. The paper uses a fallback strategy based on sampling standard deviation to mitigate this.
- Cluster algorithm sensitivity: The post-processing structure recovery relies on heuristics (capitalization-based merge, text-line alignment) that may fail on languages or formatting conventions not represented in training.
- PubTabNet TEDS comparability: GTE reports TEDS of 93.01 on PubTabNet validation using PDF images and a modified evaluation script that ignores bold/italic styling tags. This is not directly comparable to the original EDD score of 88.38, which used rasterized table images and includes styling in the metric.
- FinTabNet statistics not in the paper: The paper does not report table counts or split sizes. The commonly cited ~113k figure comes from the IBM Data Asset eXchange release metadata, not from the paper itself. The original IBM distribution channel has since been decommissioned.
Reproducibility
Models
- GTE-Table: RetinaNet with ResNet-50-FPN. Modified anchor configuration: aspect ratios {0.1, 0.25} added per feature map for wide tables. Input: $900 \times 643$.
- GTE-Cell: RetinaNet with ResNet-50-FPN. Uses pyramid levels P3 and P5 (P4 skipped). Anchor sizes {0.5, 0.7, 1, 1.2, 1.6} times standard. Input: $965 \times 1350$.
- Attribute classifier: ResNet-50 with binary linear head.
- No model weights are publicly released.
Algorithms
- All detection models initialized from MS COCO pretrained weights.
- Constraint loss hyperparameters: $\mu_1 = -5$, $\mu_2 = 5$, $\mu_3 = 10$, $\mu_4 = -10$, $\alpha = 1/8$, $\gamma_1 = 0.1$.
- Inference ranking hyperparameters: $\mu_5 = -20$, $\mu_6 = {0.25(x_2 - x_1),, 0.25(y_2 - y_1)}$, $\gamma_2 = 0.1$, $\epsilon = 0.1$, $\delta = 25$.
- At test time, each page is run at multiple zoom scales to improve detection of unusually small or large tables.
- K-means cluster algorithm for structure recovery; full pseudocode in supplementary material (Algorithm 1).
- Augmentation schemes: “no lines” erases existing graphical lines; “full boundaries” adds lines at median inter-cell positions.
- Optimizer, learning rate, batch size, and number of training steps are not reported in the paper.
Data
- FinTabNet: ~113k tables from annual reports of S&P 500 companies (SEC 10-K filings), per the original IBM release metadata (not stated in the paper). Annotations use the same automated pipeline as PubTabNet (PDF-XML pixel matching). HTML annotation format. License: CDLA-Permissive-1.0. The original IBM Data Asset eXchange distribution channel is no longer available (404 as of early 2026). The closest accessible version is FinTabNet.c (a canonicalized variant by Brandon Smock, used by the Microsoft Table Transformer project) at
https://huggingface.co/datasets/bsmock/FinTabNet.c. Exact split sizes not reported in the paper. - PubTabNet: Used for pretraining and cross-domain evaluation. Accessed via the original IBM release.
- TableBank: Used for table detection pretraining only (no cell annotations available).
- ICDAR 2013: 96/156 train/test; primary fine-tuning target.
Evaluation
- ICDAR 2013 official evaluation script (character-level metrics).
- TEDS (Tree-Edit-Distance Similarity, Zhong et al., arXiv:1911.10683) computed on validation PDF images, not the official test set. GTE modifies the evaluation script to ignore bold/italic styling tags, making results not directly comparable to the original PubTabNet EDD baseline (88.38 TEDS).
- No error bars, significance tests, or multi-run statistics reported.
- FinTabNet results reported with and without fine-tuning, but without per-split breakdown.
Hardware
- Hardware specifications, GPU types, and training time are not reported.
BibTeX
@inproceedings{zheng2021global,
title={Global Table Extractor (GTE): A Framework for Joint Table Identification and Cell Structure Recognition Using Visual Context},
author={Zheng, Xinyi and Burdick, Douglas and Popa, Lucian and Zhong, Xu and Wang, Nancy Xin Ru},
booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
year={2021}
}
TableNet: End-to-End Joint Table Detection and Structure Recognition
TL;DR
TableNet proposes a multi-task encoder-decoder architecture based on VGG-19 that simultaneously predicts table and column regions in scanned document images through two separate decoder branches trained jointly. Rule-based post-processing derives row boundaries from OCR word positions and the predicted column masks. Experiments on ICDAR 2013 show results broadly comparable to DeepDeSRT. The authors also release Marmot Extended, a column-annotated version of the Marmot dataset.
What kind of paper is this?
Dominant: $\Psi_{\text{Method}}$: The central contribution is the TableNet architecture, a multi-task FCN that couples table detection and column segmentation within a single shared encoder. The paper focuses on architectural design choices, training schedules, and baseline comparisons on ICDAR 2013.
Secondary: $\Psi_{\text{Resource}}$: The authors manually annotated column bounding boxes for 509 English documents from the Marmot dataset and release this augmented split (Marmot Extended) to the community.
What is the motivation?
Most approaches to extracting structured data from scanned documents split the pipeline into two independent problems: table detection (locating the table region) and table structure recognition (segmenting rows and columns within it). Prior deep learning methods such as DeepDeSRT [Schreiber et al., 2017] train separate models on separate datasets for each sub-problem. This independence is at odds with the intuition that tables and their columns share spatial context: column regions are subsets of table regions, so evidence for columns should reinforce the detection of the enclosing table and vice versa. The authors argue that coupling both tasks in a single model can improve accuracy through shared low-level features and mutually reinforcing gradients.
A secondary motivation is data scarcity. Few large, publicly available datasets provide both table detection and structure annotations. The largest at the time, Marmot, had only table-level bounding boxes, not column-level annotations, limiting the ability to train structure-aware models directly.
What is the novelty?
TableNet adapts the Fully Convolutional Network (FCN) architecture from Long et al. (2014) to a dual-task setting. The encoder is the convolutional stack of VGG-19 (conv1 through pool5), pretrained on ImageNet. The fully connected layers of VGG-19 are discarded and replaced with a pair of $1 \times 1$ convolutional layers (conv6) followed by dropout (rate 0.8). After this shared bottleneck, two independent decoder branches diverge:
- Table branch: a $1 \times 1$ conv layer (conv7_table) followed by fractionally strided (transposed) convolutions that upsample the feature map back to the original image resolution, with skip connections from pool4 and pool3.
- Column branch: a $1 \times 1$ conv layer (conv7_column) with ReLU and dropout, then another $1 \times 1$ conv (conv8_column), followed by the same upsampling scheme with pool4 and pool3 skip connections.
Each branch produces a binary pixel-wise prediction mask. The joint training objective is:
$$\mathcal{L} = \mathcal{L}_{\text{table}} + \mathcal{L}_{\text{col}}$$
where both $\mathcal{L}_{\text{table}}$ and $\mathcal{L}_{\text{col}}$ are pixel-wise binary cross-entropy losses over the table and column segmentation masks respectively. The shared encoder receives gradient updates from both branches simultaneously.
An optional semantic feature augmentation step color-codes word bounding boxes by inferred data type (e.g., numeric strings get one color, alphabetic strings another). These color-coded overlays are pixel-wise added to the original image before feeding to the network. The authors report that this provides modest performance gains.
Row boundaries are not predicted by the model. Instead, after column masks are extracted, a rule-based system uses OCR word positions (from Tesseract) and three heuristics: Radon-transform detection of horizontal line demarcations, multi-line row detection based on column fill patterns, and single-line assignment when no demarcations are present.
What experiments were performed?
Datasets used:
- Training: 509 English documents from the Marmot dataset, with manually added column annotations (Marmot Extended).
- Evaluation: ICDAR 2013 table competition dataset, tested on 34 documents (the same split used by DeepDeSRT).
- Fine-tuning (Experiment 3): The Marmot-trained model is fine-tuned on the ICDAR 2013 train split to match DeepDeSRT’s experimental setup.
Baselines: DeepDeSRT [Schreiber et al., ICDAR 2017] and Tran et al. [2015].
Metrics:
- Table detection: Precision, Recall, and F1 based on completeness and purity over character-level sub-objects.
- Table structure recognition and data extraction: Precision, Recall, and F1 over cell adjacency relations (horizontal and vertical neighbors), with normalized cell text.
Training schedule: Initial 500 iterations train table and column branches in a 2:1 ratio (batch size 2); once both losses are comparable, training switches to a 1:1 ratio for 5,000 total iterations. Optimizer: Adam ($\beta_1 = 0.9$, $\beta_2 = 0.999$, $\varepsilon = 10^{-8}$), learning rate $10^{-4}$. Threshold for pixel-wise prediction at inference: 0.99.
What are the outcomes/conclusions?
Results on ICDAR 2013 are as follows.
Table detection (Table I):
| Model | Recall | Precision | F1 |
|---|---|---|---|
| TableNet + Sem. (fine-tuned) | 0.9628 | 0.9697 | 0.9662 |
| TableNet + Sem. | 0.9621 | 0.9547 | 0.9583 |
| TableNet | 0.9501 | 0.9547 | 0.9547 |
| DeepDeSRT | 0.9615 | 0.9740 | 0.9677 |
| Tran et al. | 0.9636 | 0.9521 | 0.9578 |
Table structure recognition and data extraction (Table II):
| Model | Recall | Precision | F1 |
|---|---|---|---|
| TableNet + Sem. (fine-tuned) | 0.9001 | 0.9307 | 0.9151 |
| TableNet + Sem. | 0.8994 | 0.9255 | 0.9122 |
| TableNet | 0.8987 | 0.9215 | 0.9098 |
| DeepDeSRT | 0.8736 | 0.9593 | 0.9144 |
TableNet with semantic features and fine-tuning achieves F1 of 0.9662 on table detection and 0.9151 on structure recognition, compared to DeepDeSRT’s 0.9677 and 0.9144 respectively. The authors characterize the results as “comparable” rather than clearly superior. The semantic feature augmentation provides a small, consistent improvement across all variants. Fine-tuning on ICDAR domain data provides a further modest boost.
Inference time is reported as 0.3765 seconds per document image on the test hardware; no comparison to DeepDeSRT is possible since that model was not publicly released.
The authors acknowledge that the row detection step is entirely rule-based, and plan to add a third model branch for row segmentation in future work. They also note that more abstract semantic types (e.g., currency, city, country) might further improve performance.
Notable limitations not fully discussed by the authors: (1) the evaluation covers only ICDAR 2013, a relatively small benchmark of 34 test images; (2) the comparison with DeepDeSRT is on a dataset (ICDAR) that DeepDeSRT was explicitly trained on, while TableNet is fine-tuned on it, making the comparison partially asymmetric; (3) the precision of the VGG-19 base means the model is not lightweight; inference latency and memory requirements at training time are not fully detailed.
Reproducibility
Models
- Architecture: VGG-19 encoder (conv1 through pool5) plus two $1 \times 1$ conv layers (conv6 with dropout), two decoder branches (table and column), each using transposed convolutions with pool4/pool3 skip connections. No parameter count reported.
- Pretrained weights: VGG-19 initialized from ImageNet pretraining (via standard sources, e.g., Keras/TensorFlow model zoo).
- Initialization/freezing: The paper does not state whether the VGG-19 encoder weights are frozen or updated during joint training. The standard FCN practice is to fine-tune the full network end-to-end, but this is not confirmed in the paper.
- Released weights: No model weights are released.
Algorithms
- Optimizer: Adam; $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\varepsilon = 10^{-8}$; learning rate $10^{-4}$.
- Batch size: 2.
- Training schedule: 500 initial iterations at 2:1 (table:column) branch ratio, then 1:1 ratio to 5000 total iterations for Marmot pre-training; 3000 fine-tuning iterations at 1:1 on ICDAR 2013.
- Loss functions: Pixel-wise binary cross-entropy for each branch ($\mathcal{L}_{\text{table}}$ and $\mathcal{L}_{\text{col}}$), summed with equal weight. No auxiliary losses.
- Training tricks: No warmup, gradient clipping, mixed precision, EMA, or curriculum learning are mentioned.
- Preprocessing: Images resized to $1024 \times 1024$; histogram equalization applied; Tesseract OCR used to extract word patches; optional color-coding by data type (pixel-wise added to original image).
- Inference threshold: 0.99 for pixel-wise table/column prediction.
- Framework: TensorFlow (version not specified).
- Code: Not publicly released.
Data
- Training: 509 English documents from Marmot Extended (column annotations added by the authors). The Marmot dataset contains 1,016 total documents (Chinese and English); only the English portion with manually added annotations was used.
- Validation split: No held-out validation set is mentioned; the paper uses Marmot Extended for training and ICDAR 2013 for evaluation only.
- Evaluation: ICDAR 2013 table competition dataset; 34-image test split matching DeepDeSRT’s setup.
- Marmot Extended availability: Released via Google Drive (see frontmatter artifacts); license not specified.
- Annotation process: Column bounding boxes were added manually by the authors for the Marmot English split; no inter-annotator agreement or quality details are reported.
Evaluation
- Table detection metric: Completeness/purity-based Precision, Recall, F1 over character-level sub-objects within each document region (standard ICDAR 2013 protocol).
- Structure recognition metric: Precision, Recall, F1 over cell adjacency relations with normalized text content.
- Statistical rigor: No error bars, confidence intervals, or multi-seed runs are reported. Results are averaged across documents but individual variances are not disclosed.
- Limitations acknowledged: Row detection is rule-based, not learned. Fine-tuning comparison with DeepDeSRT is noted as not fully controlled (different original training sets). DeepDeSRT weights not publicly available for runtime comparison.
- Liberties taken: Evaluation is restricted to ICDAR 2013 (34 test images), a small benchmark that limits generalizability claims. The model requires Tesseract OCR as an external dependency for both semantic augmentation and row extraction, meaning end-to-end performance depends on OCR quality, which is not ablated.
Hardware
- Training hardware: Intel Xeon Silver CPU (32 cores), 128 GB RAM, Tesla V100-PCIE GPU (reported as 6 GB GPU memory, though this may reflect available/allocated memory rather than total VRAM).
- Total GPU-hours: Not reported.
- Inference time: ~0.3765 seconds per document image.
- Cost/energy: Not reported.
- Local deployment: Feasible in principle given the VGG-19 base, but model weights are not released and memory requirements at training time are not fully characterized.
BibTeX
@inproceedings{paliwal2019tablenet,
title={{TableNet}: Deep Learning Model for End-to-end Table Detection and Tabular Data Extraction from Scanned Document Images},
author={Paliwal, Shubham and D, Vishwanath and Rahul, Rohit and Sharma, Monika and Vig, Lovekesh},
booktitle={2019 International Conference on Document Analysis and Recognition (ICDAR)},
year={2019},
organization={IEEE},
doi={10.1109/ICDAR.2019.00029}
}
PubTabNet: Image-based table recognition
TL;DR
PubTabNet introduces a large-scale table recognition dataset with 568K table images and HTML ground truth extracted from PubMed Central articles. The paper proposes an encoder-dual-decoder (EDD) architecture that separates structure prediction from cell content generation, achieving 88.3% TEDS (All) versus 78.6% for the single-decoder WYGIWYS baseline. The work also introduces TEDS (tree-edit-distance-based similarity) to address known failure modes of adjacency-based evaluation.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$
The primary contribution is the PubTabNet dataset: an automated annotation pipeline, curation methodology, and evaluation tooling for 568K table images with HTML ground truth.
Secondary: $\Psi_{\text{Method}}$
Introduces the encoder-dual-decoder (EDD) architecture for table recognition.
Secondary: $\Psi_{\text{Evaluation}}$
Proposes the TEDS metric for measuring table structure similarity via tree-edit distance.
The dataset release and benchmark establish the paper’s resource contribution, while the architectural and evaluation innovations provide methodological advances.
What is the motivation?
Table recognition requires reconstructing both layout structure (rows, columns, spanning cells) and cell content from images. Progress has been limited by specific gaps:
- Lack of large-scale training data: Existing datasets either provide only structure annotations without cell content or have insufficient scale for deep learning.
- Inadequate evaluation metrics: Adjacency-based metrics under-react to structural errors (misaligned row/column boundaries) and over-react to minor content variations.
The authors address both gaps with a dataset construction pipeline, a model architecture suited to the task decomposition, and an evaluation metric aligned with structural similarity.
What is the novelty?
Dataset: PubTabNet
Scale and diversity:
- 568,192 table images with HTML ground truth
- Sourced from 6,000+ journals in PubMed Central Open Access Subset
- Includes cell-type annotations (header vs. body cells)
Auto-annotation pipeline:
- Match XML-structured tables from PMCOA articles to page-rendered PDF representations
- Render table images at 72 PPI and generate HTML ground truth from XML
- Validate quality via TF-IDF cosine similarity between PDF-extracted text and XML cell text (threshold 0.9), with an additional constraint that text lengths differ by less than 10%
- Curate for learnability: remove tables with cells spanning more than 10 rows/columns, characters occurring fewer than 50 times, or math/inline-formula nodes; normalize HTML by stripping non-visual attributes and unifying header cell definitions
Design choices:
- HTML as target format enables web integration and natural tree representation for TEDS
- Header/body cell distinction supports models that need to differentiate cell roles
- Filtering for consistent structure vocabulary balances diversity with learnability
Model: Encoder-Dual-Decoder (EDD)
Architecture design:
The EDD model decomposes table recognition into two coupled generation tasks:
- Structure decoder: Predicts HTML structural tokens (tags, span attributes) via attention-based LSTM
- Cell decoder: Generates cell content tokens (character-level) when the structure decoder opens a new cell; it receives the structure decoder’s hidden state to attend to the correct cell region
Tokenization:
- Structural tokens: HTML tags (
<tr>,<td>, etc.) plus colspan/rowspan attributes (vocabulary size: 32) - Cell tokens: Character-level with inline HTML tags (e.g.,
<b>,<i>) treated as single tokens (vocabulary size: 281)
Training objective:
$$\mathcal{L} = \lambda \mathcal{L}_s + (1-\lambda) \mathcal{L}_c$$
where $\mathcal{L}_s$ is structure cross-entropy, $\mathcal{L}_c$ is cell content cross-entropy, and $\lambda \in [0, 1]$ balances the two losses.
Architectural rationale: Separating structure from content allows the model to focus on layout grammar independently of vocabulary, particularly beneficial for complex spanning patterns. Unlike other dual-decoder architectures where the decoders are independent, EDD’s cell decoder is triggered by the structure decoder and receives its hidden state, ensuring a one-to-one match between cells and content sequences.
Evaluation: TEDS Metric
Definition:
Tables are represented as HTML trees (root $\rightarrow$ thead/tbody $\rightarrow$ tr $\rightarrow$ td, with attributes colspan/rowspan/content). TEDS computes normalized tree-edit distance:
$$\text{TEDS}(T_a, T_b) = 1 - \frac{\text{EditDist}(T_a, T_b)}{\max(|T_a|, |T_b|)}$$
Edit costs:
- Insertion/deletion: cost 1
- Substitution of non-td nodes: cost 1
- Substitution of td nodes: cost 1 if colspan or rowspan differs; otherwise, the normalized Levenshtein distance ($\in [0, 1]$) between cell content strings
Validation: Perturbation experiments demonstrate that TEDS responds proportionally to structural errors (e.g., row/column misalignment) while the adjacency relation metric under-reacts. At 90% cell-shift perturbation, adjacency F1 remains near 80% while TEDS drops by 60%. Conversely, for cell content perturbations, the adjacency metric over-reacts (dropping over 70% at the 10% perturbation level) while TEDS decreases linearly from 90% to 40% across perturbation levels 10% to 90%.
What experiments were performed?
Architecture Specifications
Encoder:
- ResNet-18 CNN backbone
- Five encoder variants tested, differing in stride and shared vs. independent final CNN layers
- Best variant (EDD-S1S1) uses stride-1 final layers and independent final convolutional layers for structure/cell decoders
Decoders:
- Single-layer LSTMs with hidden dimensions 256 (structure) and 512 (cell)
- Soft attention with hidden layer size 256
- Embedding dimensions: 16 (structural tokens) and 80 (cell tokens)
- Structure decoder operates autoregressively; cell decoder is invoked when structure decoder emits cell-opening tags
Inference:
- Beam search with beam width 3
- Structure and cell predictions are synchronized: cell decoder must complete before structure decoder continues
Data Construction Details
Source: PubMed Central Open Access Subset scientific articles with both PDF renderings and XML markup.
Pipeline steps:
- Matching: Align XML table elements to page-rendered table regions using the algorithm from Zhong et al. (PubLayNet)
- Rendering: Generate table images from PDFs at 72 PPI
- HTML generation: Convert XML markup to HTML with normalized tag vocabulary
- Quality validation: Compute TF-IDF cosine similarity (PDF-extracted text vs. XML cell text); threshold at 0.9 with length difference below 10%
- Curation: Remove tables with rare structures (cells spanning $>$10 rows/columns, characters with $<$50 occurrences), math/inline-formula nodes, or multiple merged tables
Curation rationale: The authors filter for consistent structure patterns to improve learnability, trading exhaustive diversity for training stability.
Scale and Splits
| Split | Original | Balanced (used for eval) |
|---|---|---|
| Train | 548,592 | 548,592 |
| Val | 8,910 | 10,000 (5K spanning + 5K non-spanning) |
| Test | 10,690 | 10,000 (5K spanning + 5K non-spanning) |
Training constraint: GPU memory limits require filtering the training set to 399K samples satisfying:
- Image dimensions $\leq$ 512 $\times$ 512 pixels
- Structural tokens $\leq$ 300
- Longest cell $\leq$ 100 tokens
Validation and test sets are not subject to these constraints, ensuring evaluation on full complexity.
Balanced splits rationale: Raw dev/test distributions are skewed toward simple non-spanning tables. Balanced subsets ensure adequate representation of complex spanning structures.
Training Setup
Hardware: Two NVIDIA V100 GPUs, approximately 16 days training time.
Preprocessing:
- Training images rescaled to 448 $\times$ 448 pixels for batching, with per-channel z-score normalization
Optimization:
- Two-stage training:
- Structure pretraining: $\lambda = 1$ (structure-only loss), batch size 10, learning rate 0.001 for 10 epochs then 0.0001 for 3 epochs
- Joint training: $\lambda = 0.5$ (balanced structure + content), batch size 8, learning rate 0.001 for 10 epochs then 0.0001 for 2 epochs
- Adam optimizer
Baselines
Off-the-shelf extraction tools (PDF input):
- Tabula, Traprange, Camelot, PDFPlumber (document parsing libraries that require text-based PDF)
- Adobe Acrobat Pro (tested with both PDF and high-resolution 300 PPI image input)
Model baselines:
- WYGIWYS: Single-decoder image-to-markup architecture (Deng et al.), trained on PubTabNet
- TIES: Graph neural network model evaluated on synthetic data
PubTabNet Test Results
| Input | Method | Simple TEDS (%) | Complex TEDS (%) | All TEDS (%) |
|---|---|---|---|---|
| Tabula | 78.0 | 57.8 | 67.9 | |
| Traprange | 60.8 | 49.9 | 55.4 | |
| Camelot | 80.0 | 66.0 | 73.0 | |
| PDFPlumber | 44.9 | 35.9 | 40.4 | |
| Acrobat Pro | 68.9 | 61.8 | 65.3 | |
| Image | Acrobat Pro | 53.8 | 53.5 | 53.7 |
| Image | WYGIWYS | 81.7 | 75.5 | 78.6 |
| Image | EDD-S1S1 | 91.2 | 85.4 | 88.3 |
Generalization to Synthetic Data
Setup: 500K synthetic tables (420K/40K/40K train/val/test split) used to compare with TIES baseline, which lacks sufficient real-data training labels.
Results (four complexity levels C1/C2/C3/C4):
| Model | Avg TEDS | Exact Match |
|---|---|---|
| TIES | N/A | 96.9 / 94.7 / 52.9 / 68.5 |
| EDD | 99.8 / 99.8 / 99.7 / 99.7 | 99.7 / 99.9 / 97.2 / 98.0 |
Evaluation note: TIES comparison uses adjacency-based exact match without checking cell content recognition errors. For fairness, EDD’s cell recognition errors are ignored in this comparison, measuring only structural correctness.
Ablations
Encoder variants:
- Tested feature-map resolution (stride-1 vs. stride-2) and shared vs. independent final CNN layers
- EDD-S1S1 (stride-1, independent layers) selected via validation performance as the best configuration
What are the outcomes/conclusions?
Key observations:
- EDD achieves +9.7 TEDS improvement over the WYGIWYS single-decoder baseline (88.3% vs. 78.6%)
- The advantage is more pronounced on complex (spanning) tables: +9.9 TEDS (85.4% vs. 75.5%) compared to +9.5 on simple tables (91.2% vs. 81.7%)
- Camelot is the best off-the-shelf tool at 73.0% All TEDS
- Adobe Acrobat Pro’s performance drops substantially when using image input (53.7%) compared to PDF input (65.3%), illustrating the difficulty of image-only table recognition
- On synthetic data, EDD achieves near-perfect TEDS ($>$99.7%) across all complexity categories, with no significant degradation on complex structures (unlike TIES)
Error analysis: Both EDD and WYGIWYS show performance degradation as table size increases (in width, height, structural token count, or longest cell length). The authors attribute this primarily to aggressive image downsampling for batching and suggest grouping tables by size with different rescaling factors.
Limitations
Missing spatial information: PubTabNet does not include cell bounding box coordinates. The authors note this limits integration with detection-based pipelines and plan to add spatial annotations in future releases (PubTabNet 2.0.0, released July 2020, added bounding boxes for non-empty cells).
Detection not included: EDD assumes pre-cropped table images. End-to-end document processing requires coupling with a separate table detection model.
Scale sensitivity: Performance degrades on large tables. The authors suggest batching by table size to reduce aggressive downsampling, which loses fine-grained spatial detail.
Training subset constraint: GPU memory limitations require filtering to 399K samples (approximately 73% of the full training set) with size/token constraints. Full-scale training might improve performance but requires larger memory or architectural modifications.
License complexity: While annotations are permissive (CDLA-Permissive-1.0), underlying PMCOA images have mixed per-article licenses. Commercial users must audit article-level terms.
No code or weights released: Due to legal constraints, IBM does not release the EDD model code or pretrained weights. Replication requires reimplementing the architecture from the paper’s description.
Reproducibility
Models
- ResNet-18 encoder with modified final layers (EDD-S1S1: stride-1, independent layers for each decoder)
- Structure decoder: single-layer LSTM, hidden size 256, embedding size 16
- Cell decoder: single-layer LSTM, hidden size 512, embedding size 80
- Attention hidden layer size: 256 for both decoders
- No pretrained weights released due to legal constraints
Algorithms
- Two-stage training: structure pretraining ($\lambda = 1$) followed by joint training ($\lambda = 0.5$)
- Adam optimizer; stage 1: batch 10, LR 0.001 (10 epochs) then 0.0001 (3 epochs); stage 2: batch 8, LR 0.001 (10 epochs) then 0.0001 (2 epochs)
- Beam search decoding with beam width 3
- Training images rescaled to 448 $\times$ 448, per-channel z-score normalization
Data
- 568K table images from PubMed Central Open Access Subset
- Training subset filtered to 399K (image $\leq$ 512 $\times$ 512, structure tokens $\leq$ 300, longest cell $\leq$ 100)
- Balanced val/test: 10K each (5K spanning + 5K non-spanning)
- Annotations: CDLA-Permissive-1.0; images: per-article PMCOA licenses
- Test set ground truth withheld for ICDAR competition
Evaluation
- Primary metric: TEDS (tree-edit-distance-based similarity), reported as mean across test samples
- Results broken down by Simple (non-spanning) and Complex (spanning) tables
- Baselines include both PDF-input tools and image-input models
- TIES comparison on synthetic data uses structure-only exact match for fairness
Hardware
- 2x NVIDIA V100 GPUs
- Approximately 16 days total training time
- No inference latency or throughput figures reported
BibTeX
@inproceedings{zhong2020image,
title={Image-based table recognition: data, model, and evaluation},
author={Zhong, Xu and ShafieiBavani, Elaheh and Yepes, Antonio Jimeno},
booktitle={Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XII 16},
pages={564--580},
year={2020},
organization={Springer}
}
TableBank: Table Benchmark for Image-based Table Detection and Recognition
TL;DR
TableBank is a large-scale, weakly supervised dataset of 417K+ labeled tables for image-based table detection and structure recognition. The authors generate high-quality annotations automatically from Word (Office XML) and LaTeX source code, sidestepping the need for manual labeling. Baseline Faster R-CNN and image-to-text models demonstrate that scale matters but cross-domain generalization (Word vs. LaTeX) remains a challenge.
What kind of paper is this?
Dominant: $\Psi_{\text{Resource}}$ : Introduces a large-scale benchmark dataset (TableBank) with an automated label generation pipeline and baseline models.
Secondary: $\Psi_{\text{Evaluation}}$ : Provides baseline systems and cross-domain evaluation (Word vs. LaTeX vs. mixed).
What is the motivation?
Table detection and structure recognition are central tasks in document analysis because tables encode structured information across diverse layouts. Traditional heuristics fail to generalize, while deep learning methods require more labeled data than effectively exists in human-annotated sets (which are typically small, e.g., ~1k tables). Manual annotation is prohibitively expensive at scale.
The key insight is that many documents (Word, LaTeX) contain explicit table markup in their source code. This allows for scalable, high-quality weak supervision without human labeling.
What is the novelty?
The core innovation is the automatic generation of weak labels from source documents to create a dataset orders of magnitude larger than prior work.
1. Weak Supervision Pipeline
The authors developed a method to extract bounding boxes and structure labels by manipulating source code:
- Word: Modifying Office XML tags (
<w:tbl>) to render tables with distinct borders. - LaTeX: Wrapping table environments in
fcolorboxwith distinct colors. - Label Extraction: Recovering bounding boxes by pixel-level differencing between the “marked” rendered page and the specific table color.
2. Dataset Scale
This pipeline produced TableBank, utilizing documents crawled from the web and arXiv:
- Detection: 417,234 labeled tables (163k Word, 253k LaTeX).
- Structure Recognition: 145,463 instances.
- Validation: Manual spot-checks of 1,000 samples found only 5 erroneous bounding boxes.
3. Structure Recognition Formulation
The paper formulates structure recognition as an image-to-text task, predicting an HTML-like tag sequence (e.g., <tabular>, <tr>, <td>, <cell_y>, <cell_n>) rather than just coordinates. The vocabulary is deliberately small (12 tokens), keeping the output space tractable.
What experiments were performed?
Baseline Models
Table Detection:
- Architecture: Faster R-CNN with ResNeXt-101 and ResNeXt-152 backbones (ImageNet pretrained).
- Training: Detectron framework (Caffe2), 4×P100 GPUs, standard synchronous SGD.
Structure Recognition:
- Architecture: Encoder-decoder image-to-text model with attention (OpenNMT).
- Output: Sequence of layout tags implies the table structure. Cell content is recognized separately via OCR and filled into the predicted structure heuristically.
Evaluation Metrics
The authors assess detection using area-based metrics rather than standard object detection mAP. Precision and Recall are computed via pixel-area overlap aggregation across documents (following Gilani et al., 2017):
$$ \begin{aligned} \text{Precision} &= \frac{\text{Area of Ground Truth } \cap \text{ Detected}}{\text{Area of Detected Tables}} \\ \text{Recall} &= \frac{\text{Area of Ground Truth } \cap \text{ Detected}}{\text{Area of Ground Truth Tables}} \\ \text{F1} &= 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} \end{aligned} $$
Structure recognition is evaluated using 4-gram BLEU on the generated tag sequence, hypothesizing that n-gram overlap correlates with structural correctness.
What are the outcomes/conclusions?
Key Findings
- Scale matters: Deep learning baselines trained on TableBank outperform traditional methods and models trained on smaller datasets (like ICDAR 2013).
- Domain gaps exist: Models trained on Word documents perform poorly on LaTeX (and vice versa). Mixed training (Word+LaTeX) successfully improves cross-domain robustness.
- Sequence length limits: The structure recognition model struggles with complex tables. Exact-match rates drop significantly as the length of the tag sequence increases.
Limitations
- Metric Selection: The area-based precision/recall metrics are less standard than COCO-style mAP, potentially obscuring object-level errors like merging or splitting adjacent tables.
- Cross-Domain Generalization: Despite mixed training, the gap between document types remains a challenge for structure recognition.
- Task Scope: The structure recognition task is limited to layout tags; it does not solve end-to-end cell text extraction within the same model.
Reproducibility
Models
- Table Detection: Faster R-CNN with ResNeXt-101 and ResNeXt-152 backbones, pretrained on ImageNet. Implemented via the Detectron framework (Caffe2). Confidence threshold set to 90% at inference.
- Table Structure Recognition: Encoder-decoder image-to-text model from OpenNMT. Output vocabulary is 12 tokens (
<tabular>,</tabular>,<thead>,</thead>,<tbody>,</tbody>,<tr>,</tr>,<td>,</td>,<cell_y>,<cell_n>).
Algorithms
- Detection training: Synchronous SGD with a mini-batch size of 16 images. Other hyperparameters use Detectron defaults.
- Structure recognition training: Learning rate of 0.1, batch size of 24. Other hyperparameters use OpenNMT defaults.
Data
- Word documents: Crawled from the internet in
.docxformat. Multi-language (English, Chinese, Japanese, Arabic, etc.). - LaTeX documents: Sourced from arXiv bulk data access (2014 to 2018). Primarily English.
- Detection split: 415,234 training images; 2,000 sampled from each domain (Word, LaTeX) for validation (1,000) and test (1,000).
- Structure recognition split: 144,463 training instances; 500 each for validation and test per domain.
- License Note: The GitHub repository
LICENSEfile is Apache 2.0, but the README explicitly states “Our data can only be used for research purpose” and “Please DO NOT re-distribute our data.” We recommend adhering to the stricter README terms for the dataset itself.
Evaluation
- Detection metric: Area-based Precision/Recall/F1 (following Gilani et al., 2017), aggregated across documents by pixel-area overlap. This is not standard COCO-style mAP; it may obscure object-level errors such as merging or splitting adjacent tables.
- Structure recognition metric: 4-gram BLEU on generated tag sequences against a single reference.
- ICDAR 2013 cross-evaluation: TableBank models also evaluated on the ICDAR 2013 table competition dataset.
Hardware
- 4$\times$ NVIDIA P100 GPUs for both detection and structure recognition baselines.
- No GPU-hours, inference latency, or cost estimates reported.
Mapping to Unified Taxonomy
TableBank is a specialized dataset focused entirely on a single Primitive: Table. It does not concern itself with text, figures, or hierarchy.
| TableBank Class | Visual Primitive | Logical Role | Notes |
|---|---|---|---|
| Table | Table | Table | The only class. Includes diverse layouts (APA, grid, borderless). |
BibTeX
@inproceedings{li-etal-2020-tablebank,
title = "TableBank: Table Benchmark for Image-based Table Detection and Recognition",
author = "Li, Minghao and Cui, Lei and Huang, Shaohan and Wei, Furu and Zhou, Ming and Li, Zhoujun",
booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference",
month = may,
year = "2020",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://aclanthology.org/2020.lrec-1.236",
pages = "1918--1925"
}