Table Structure Recognition

Tracking models, datasets, and metrics for parsing the internal structure of tables in documents.

Table of Contents

Overview
TSR: Models
TSR: Datasets
TSR: Metrics
- Optimized Table Tokenization for Table Structure Recognition
TL;DR
What kind of paper is this?
Motivation: The HTML Bottleneck
Novelty: The 5-Token OTSL Vocabulary
- 1. Minimal Vocabulary
- 2. Syntactic Constraints
- 3. Error Mitigation
Methodology: TableFormer Evaluation
- Experiments
- Metrics
Results: Efficiency and Accuracy
- 1. Speed
- 2. Accuracy
- Limitations
Reproducibility
- Models
- Algorithms
- Data
- Evaluation
- Hardware
BibTeX
- PubTables-1M
TL;DR
What kind of paper is this?
What is the motivation?
What is the novelty?
What experiments were performed?
What are the outcomes/conclusions?
Reproducibility
- Models
- Algorithms
- Data
- Evaluation
- Hardware
Reported Results (Test Set)
Mapping to Unified Taxonomy
- Detection (Layout Level)
- Structure (TSR Level)
BibTeX
- PubTabNet: Image-based table recognition
TL;DR
What kind of paper is this?
What is the motivation?
What is the novelty?
- Dataset: PubTabNet
- Model: Encoder-Dual-Decoder (EDD)
- Evaluation: TEDS Metric
What experiments were performed?
- Architecture Specifications
- Data Construction Details
- Scale and Splits
- Training Setup
- Baselines
- PubTabNet Test Results
- Generalization to Synthetic Data
- Ablations
What are the outcomes/conclusions?
- Limitations
Reproducibility
- Models
- Algorithms
- Data
- Evaluation
- Hardware
BibTeX
- TableBank: Table Benchmark for Image-based Table Detection and Recognition
TL;DR
What kind of paper is this?
What is the motivation?
What is the novelty?
- 1. Weak Supervision Pipeline
- 2. Dataset Scale
- 3. Structure Recognition Formulation
What experiments were performed?
- Baseline Models
- Evaluation Metrics
What are the outcomes/conclusions?
- Key Findings
- Limitations
Reproducibility
- Models
- Algorithms
- Data
- Evaluation
- Hardware
Mapping to Unified Taxonomy
BibTeX

Disclaimer: This page covers models and datasets for parsing the internal grid structure of tables (rows, columns, spanning cells). For detecting where tables are on a page, see the Layout Page. For end-to-end document understanding, see the Document Understanding Page.

Overview

Table Structure Recognition (TSR) is the task of recovering the logical grid of a detected table region: identifying rows, columns, spanning cells, and (optionally) header vs. body roles. It typically operates on a pre-cropped table image produced by a layout detection model.

The field splits into two main formulations:

Image-to-Sequence (Im2Seq): Generates a markup representation (HTML or OTSL) token by token using an encoder-decoder architecture.
- Examples: EDD (PubTabNet), TableFormer + OTSL.
- Pros: Captures complex spanning patterns naturally; amenable to beam search decoding.
- Cons: Sequence length scales with table size; attention drift on large tables.
Object Detection: Treats rows, columns, and cells as bounding-box objects detected in a single forward pass.
- Examples: Table Transformer (DETR-based, from PubTables-1M).
- Pros: Fast; leverages standard detection tooling; produces spatial coordinates directly.
- Cons: Post-processing needed to resolve spanning cells; struggles with dense or borderless tables.

TSR: Models

Date	Name	Type	Key Contribution	Notes	License
2023-05	OTSL	Model	One-shot parsing	Notes. 5-token language with backward-only syntax rules for efficient autoregressive TSR.	-
2020-10	Table Transformer	Model	DETR for Tables	Microsoft. The standard baseline model.	MIT

TSR: Datasets

Date	Name	Pages/Tables	Domain	Key Contribution	Notes	License
2021-09	PubTables-1M	~1M tables	Scientific	Canonical/structure	Notes. Canonicalized annotations; fixes PubTabNet oversegmentation.	CDLA-Perm-2.0
2020-12	FinTabNet	113k tables	Financial	Complex, dense financial data.		CDLA-Perm-1.0
2019-11	PubTabNet	568k tables	Scientific	Image-to-HTML	Notes. First large-scale TSR dataset. Introduces TEDS metric.	CDLA-Sharing-1.0
2019-09	SciTSR	15k tables	Scientific	Structure	GitHub. Complex spans.	Apache-2.0
2019-03	TableBank	417k tables	Diverse	Detection + Recognition	Notes. Weak supervision from Word/LaTeX sources.	Apache-2.0
2012-03	Marmot	2,000 pages	Chinese e-books + English scientific	Table detection	Fang et al., DAS 2012. DOI. ~1:1 Chinese/English split; 50% pages contain tables, 50% hard negatives. Custom XML format. No official splits. PKU.	Research-only

TSR: Metrics

Metric	What it measures	Notes
TEDS (Tree Edit Distance)	Similarity between predicted HTML structure and ground truth.	Standard for PubTabNet/FinTabNet. Introduced by PubTabNet.
GriTS (Grid Table Similarity)	Grid topology correctness (row/col spanning alignment).	More robust to empty cell variations. Introduced by PubTables-1M.

TL;DR

OTSL replaces HTML with a 5-token language for autoregressive table structure recognition. Its backward-only syntax rules enable on-the-fly validation, cut sequence length roughly in half, and yield consistent improvements in both accuracy and inference speed across three public benchmarks when evaluated with the TableFormer architecture.

What kind of paper is this?

OTSL (Optimized Table Structure Language) proposes a tokenization language with syntactic constraints for table structure recognition.

Dominant Basis: $\Psi_{\text{Method}}$
- The paper proposes a new representation language (OTSL) to replace HTML in Image-to-Sequence (Im2Seq) models. It redesigns the output vocabulary to enforce structural validity and improve efficiency.
Secondary Basis: $\Psi_{\text{Evaluation}}$
- It systematically compares HTML versus OTSL representations across multiple datasets (PubTabNet, FinTabNet, PubTables-1M) and model configurations, focusing on efficiency (latency) and accuracy (TEDs, mAP).

Motivation: The HTML Bottleneck

Image-to-Markup-Sequence (Im2Seq) table structure recognition typically reuses general-purpose HTML tokenization. However, HTML was not designed for autoregressive decoding efficiency and presents several challenges for neural models:

Large Vocabulary: Requires at least 28 tokens to cover common rowspan/colspan attributes. The skewed token frequency distribution complicates learning.
Variable Row Lengths: Rows with complex spanning produce longer token sequences. This variance makes positional encoding and attention mechanisms less effective.
Late Error Detection: Invalid HTML outputs are difficult to detect early during generation. Partial sequences often violate structural consistency (e.g., missing closing tags) but remain syntactically valid markup until the end.
Attention Drift: Long sequences on large tables cause output misalignment. This is particularly damaging for bounding box predictions in later rows.

Novelty: The 5-Token OTSL Vocabulary

The core innovation is OTSL, a minimal vocabulary representing a table as a rectangular grid.

1. Minimal Vocabulary

The language reduces the problem to just 5 tokens:

C: New cell (anchor for cell region top-left).
L: Merge with left neighbor (horizontal span continuation).
U: Merge with upper neighbor (vertical span continuation).
X: Merge with both left and upper (2D span interior).
NL: End-of-row marker.

2. Syntactic Constraints

The representation enforces specific structure via backward-only syntax rules. Each token can be validated using only previously generated tokens, enabling incremental constraint enforcement during decoding:

Left neighbor of L must be C or L.
Upper neighbor of U must be C or U.
Left neighbor of X must be U or X; upper neighbor must be L or X.
First row allows only C and L.
First column allows only C and U.
All rows have equal length and are terminated by NL.

3. Error Mitigation

Invalid token predictions signal decoding errors immediately. The authors propose a heuristic that replaces the highest-confidence invalid token with the next-highest valid candidate until syntax rules are satisfied.

Methodology: TableFormer Evaluation

The authors evaluate OTSL using the TableFormer architecture, an encoder-decoder transformer with separate decoders for structure (HTML/OTSL) and cell bounding boxes.

Experiments

Hyperparameter Sweep: Compared HTML vs. OTSL on PubTabNet with varying encoder/decoder depths (2-6 layers).
Cross-Dataset Evaluation: Validated the best configuration (6 encoder, 6 decoder, 8 heads) on three major benchmarks:
- PubTabNet: 395K samples (Scientific).
- FinTabNet: 113K samples (Financial).
- PubTables-1M: ~1M samples (Scientific/Digital).

Metrics

TEDs (Tree Edit Distance): Structural accuracy, reported separately for simple/complex/all tables. OTSL outputs are converted back to HTML for fair comparison.
mAP@0.75: Mean Average Precision at 0.75 IoU threshold for cell bounding boxes.
Inference Time: Measured on a single-core CPU (AMD EPYC 7763 @ 2.45 GHz).

Results: Efficiency and Accuracy

The results suggest that optimizing the representation yields consistent improvements without architectural changes.

1. Speed

OTSL achieves approximately $2\times$ inference speedup across configurations, primarily due to shorter sequence lengths. The authors report that OTSL reduces sequence length to roughly half that of HTML on average (e.g., 30 tokens vs. 55 for the example in Figure 1).

PubTabNet: 2.73s (OTSL) vs. 5.39s (HTML).
PubTables-1M: 1.79s (OTSL) vs. 3.26s (HTML).

2. Accuracy

PubTabNet: Similar all-TEDs (0.955); improved mAP (0.88 vs. 0.857).
FinTabNet: Large gains observed: all-TEDs 0.959 vs. 0.920; mAP 0.862 vs. 0.722.
PubTables-1M: Gains on both metrics: all-TEDs 0.977 vs. 0.966.

Limitations

Syntactic vs. Structural: While OTSL guarantees a valid grid, it does not guarantee the correct grid. A valid OTSL sequence can still represent an incorrect table structure.
Heuristic Reliance: The token replacement strategy is a heuristic. The paper does not compare this against formal constrained decoding or beam search.
Architecture Specificity: The evidence is limited to TableFormer. It is unclear if these gains transfer to graph-based or object-detection-based TSR pipelines.

Reproducibility

The work is partially reproducible. The evaluation datasets are publicly available, but the authors do not release code, model weights, or the OTSL-format dataset conversions (promised in the paper but no public link is provided). Reimplementation requires building both the OTSL tokenization logic and the TableFormer architecture from the paper description alone.

Models

The architecture used is TableFormer, an encoder-decoder transformer with separate decoders for structure tokens and cell bounding boxes.
The best configuration uses 6 encoder layers, 6 decoder layers, and 8 attention heads. Smaller configurations (4/4, 2/4, 4/2) are also evaluated.
No parameter counts are reported. No pretrained weights are released.

Algorithms

Training procedure details (optimizer, learning rate, batch size, epochs) are not reported in the paper.
The error-mitigation heuristic replaces the highest-confidence invalid token with the next-highest valid candidate until OTSL syntax rules are satisfied. No comparison is made against constrained decoding or beam search alternatives.

Data

PubTabNet: 395K samples of scientific tables, semi-automatically generated from PubMed Central.
FinTabNet: 113K samples of financial tables.
PubTables-1M: Approximately 1M samples of scientific/digital tables.
Ground truth from all datasets was converted to OTSL format. The authors state these conversions will be made publicly available, but no download link is provided in the paper.

Evaluation

TEDs (Tree Edit Distance score): Measures structural accuracy. OTSL predictions are converted back to HTML before computing TEDs for fair comparison. Reported separately for simple tables, complex tables (those with spanning cells), and all tables.
mAP@0.75: Mean Average Precision at $0.75$ IoU threshold for cell bounding box predictions.
Baselines are HTML-based TableFormer models with identical architecture configurations, making the comparison controlled.
No error bars, significance tests, or multi-run statistics are reported.

Hardware

Inference: All timing results measured on a single core of an AMD EPYC 7763 CPU @ 2.45 GHz.
Training: GPU type, count, and training duration are not reported.

BibTeX


@inproceedings{lysak2023optimized,
  title={Optimized Table Tokenization for Table Structure Recognition},
  author={Lysak, Maksym and Nassar, Ahmed and Livathinos, Nikolaos and Auer, Christoph and Staar, Peter},
  booktitle={Document Analysis and Recognition -- ICDAR 2023},
  pages={37--50},
  year={2023},
  publisher={Springer},
  doi={10.1007/978-3-031-41679-8_3}
}

TL;DR

PubTables-1M is a large-scale table extraction dataset containing nearly one million tables from PubMed Central Open Access articles. The dataset introduces richer annotations (including blank cells, rows, columns, and projected row headers) and a canonicalization procedure to address oversegmentation in markup-derived ground truth, substantially improving both training and evaluation reliability for table structure recognition.

What kind of paper is this?

Dominant: $\Psi_{\text{Resource}}$ (the dataset, annotation pipeline, and quality verification procedures constitute the primary contribution).

Secondary: $\Psi_{\text{Evaluation}}$ (demonstrates how canonicalization reduces ground truth ambiguity and changes evaluation conclusions), $\Psi_{\text{Method}}$ (DETR-based baseline models for table detection and structure recognition).

What is the motivation?

Table extraction requires inferring logical structure from visual presentation; modern deep learning approaches depend on large-scale, unambiguous ground truth. Prior markup-derived datasets exhibit three key limitations:

Incomplete spatial annotations: lack explicit row/column/cell bounding boxes; omit blank cells entirely
Missing header information: do not label row headers or projected headers, limiting functional analysis capabilities
Oversegmentation artifacts: spanning header cells are frequently split into multiple grid cells in source markup, creating contradictory or non-unique ground truth that degrades both training and evaluation

What is the novelty?

Scale and coverage: nearly 1M tables supporting all three subtasks (table detection, table structure recognition, functional analysis)
Richer annotation schema: explicit bounding boxes for rows, columns, and cells (including blank cells); projected row header labels
Canonicalization procedure: data is processed via automated rule-based merging of oversegmented header cells to produce unique, unambiguous structure interpretations. The algorithm operates in two phases: (1) inferring header regions (column headers, projected row headers, and first-column row headers) and (2) merging adjacent cells under structural constraints derived from the Wang model’s hierarchical header-tree assumption.
Automated quality control: verification pipeline with measurable consistency checks (text alignment quality, cell-word assignment, overlap detection)

What experiments were performed?

The authors trained object detection baselines (Faster R-CNN and DETR) for table detection and joint table structure recognition with functional analysis. Key experimental comparisons:

Canonicalization ablation: training DETR on canonicalized vs. non-canonical annotations (DETR-NC), evaluated against both test label variants
Noise isolation: testing the same model (DETR-NC) against canonical vs. non-canonical test labels to measure evaluation noise independent of training data effects
Metrics: table detection uses standard object detection metrics (AP/AP50/AP75/AR); table structure recognition reports $\text{Acc}_{\text{Cont}}$ (exact table content match), adjacent cell content F-score ($\text{Adj}_{\text{Cont}}$), and GriTS (Grid Table Similarity).

GriTS formulation:

$$ \text{GriTS}_f(\mathbf{A}, \mathbf{B}) = \frac{2 \cdot \sum_{i,j} f(\tilde{\mathbf{A}}_{i,j}, \tilde{\mathbf{B}}_{i,j})}{|\mathbf{A}| + |\mathbf{B}|} $$

where $\mathbf{A}$ and $\mathbf{B}$ are the ground truth and predicted table matrices, $\tilde{\mathbf{A}}$ and $\tilde{\mathbf{B}}$ are their most similar substructures (a selection of $m$ rows and $n$ columns), and $f$ is a similarity function. Three variants measure different aspects: $\text{GriTS}_{\text{Top}}$ for cell topology, $\text{GriTS}_{\text{Cont}}$ for cell content, and $\text{GriTS}_{\text{Loc}}$ for cell location.

What are the outcomes/conclusions?

Outcomes

Canonicalized ground truth materially changes experimental conclusions. For complex tables, exact-match table content accuracy ($\text{Acc}_{\text{Cont}}$) jumps from 0.5360 (DETR-NC evaluated on non-canonical test labels) to 0.6944 (DETR evaluated on canonical test labels). This comparison conflates two effects (improved training data and cleaner test labels), but the authors disentangle them by also evaluating DETR-NC against canonical test labels. In that isolated comparison, DETR-NC scores 0.9349 on simple tables with canonical test labels vs. 0.8678 with non-canonical test labels, demonstrating that canonicalization alone reduces evaluation noise.

DETR achieves strong performance under the object detection framing without task-specific customizations (test AP 0.966 for table detection, 0.912 for joint structure recognition and functional analysis).

Limitations

Domain specificity: canonicalization rules are designed around PMCOA-derived annotation characteristics; generalization to other domains may require new assumptions. The authors explicitly note this in the paper.
Single-page scope: multi-page tables are excluded.
Partial header inference: full row-header structure is outside scope; coverage limited to projected row headers and first-column structure.
No cross-dataset evaluation: all experiments use PubTables-1M only; the paper does not test whether models trained on PubTables-1M generalize to other table extraction benchmarks.

Reproducibility

Models

Both DETR and Faster R-CNN use ResNet-18 backbones pretrained on ImageNet with early layers frozen.
DETR architecture: 6 encoder layers, 6 decoder layers. TD model uses 15 object queries; TSR+FA model uses 125 object queries (slightly above the maximum object count in training data).
Pre-trained weights for both detection and structure recognition models are publicly released under MIT license.

Algorithms

All models trained for 20 epochs on a single NVIDIA Tesla V100 GPU.
Learning rate selection: one short experiment comparing initial learning rates of 0.0002, 0.0001, and 0.00005; best validation performance after one epoch determined the choice.
TSR+FA model: initial learning rate of 0.00005, no-object class weight of 0.4.
Both models: learning rate drop of 1, gamma of 0.9.
Standard data augmentations (random cropping, resizing). No custom components, losses, or training procedures.
TD input: PDF pages rendered as images with maximum length of 1000 pixels.
TSR+FA input: table region cropped from page image with 30-pixel padding on all sides.

Data

Source: PubMed Central Open Access (PMCOA) scientific articles.
Alignment: Needleman-Wunsch algorithm for character-level matching between XML markup and PDF-rendered text.
Splits: 80/10/10 at document level to prevent leakage.
947,642 tables total for TSR; 460,589 pages for TD.
Quality filters discard tables with overlapping rows/columns, high edit distance (threshold 0.05), low word-cell overlap (threshold 0.9), or extreme outlier object counts (>100). Less than 0.1% of tables are discarded as outliers.
Licensing: CDLA-Permissive-2.0 applies to Microsoft’s annotations. Underlying PMCOA articles have mixed licenses (CC0, CC BY, CC BY-NC, etc.); per-article verification is needed for commercial use.

Evaluation

TD metrics: AP, AP50, AP75, AR (standard COCO-style object detection).
TSR metrics: $\text{Acc}_{\text{Cont}}$ (exact table content match), $\text{Adj}_{\text{Cont}}$ (adjacent cell content F-score from Gobel et al.), $\text{GriTS}_{\text{Top}}$, $\text{GriTS}_{\text{Cont}}$, $\text{GriTS}_{\text{Loc}}$.
Post-prediction, bounding boxes are retroactively tightened (removing dilation padding) before scoring, so that training-time dilation does not unfairly penalize location metrics.
Baselines compared: DETR vs. Faster R-CNN (both on canonical data), plus DETR-NC (on non-canonical data) for the canonicalization ablation.
No error bars, significance tests, or multi-run reporting. Single training run per configuration.
No cross-dataset evaluation or comparison with prior published results on other benchmarks.

Hardware

Training: Single NVIDIA Tesla V100 GPU (reported in Appendix 9.1).
Training time and GPU-hours are not reported.
Inference latency and throughput are not reported.

Reported Results (Test Set)

Table Detection (TD):

Model	AP	AP50	AP75	AR
DETR	0.966	0.995	0.988	0.981
Faster R-CNN	0.825	0.985	0.927	0.866

Joint TSR+FA (Object Detection):

Model	AP	AP50	AP75	AR
DETR	0.912	0.971	0.948	0.942
Faster R-CNN	0.722	0.815	0.785	0.762

Table Structure Recognition (Canonical Test, All Tables):

Model	$\text{Acc}_{\text{Cont}}$	$\text{GriTS}_{\text{Top}}$	$\text{GriTS}_{\text{Cont}}$	$\text{GriTS}_{\text{Loc}}$
DETR	0.8138	0.9845	0.9846	0.9781
DETR-NC	0.5851	0.9576	0.9588	0.9449
Faster R-CNN	0.1039	0.8616	0.8538	0.7211

Mapping to Unified Taxonomy

PubTables-1M covers both Table Detection (TD) and Table Structure Recognition (TSR).

Detection (Layout Level)

For the Layout Analysis pipeline, we only care about the Detection classes:

PubTables-1M Class	Visual Primitive	Logical Role	Notes
Table	`Table`	`Table`	Core bounding box.

Structure (TSR Level)

Inside the detected table, it provides detailed structure annotations (rows, columns, headers) similar to PubTabNet but with bounding boxes for cells. These are not part of the page-level Layout Taxonomy but are inputs for the OTSL pipeline.

BibTeX


@inproceedings{smock2022pubtables,
  title={PubTables-1M: Towards comprehensive table extraction from unstructured documents},
  author={Smock, Brandon and Pesala, Rohith and Abraham, Robin},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  pages={4634--4642},
  year={2022}
}

TL;DR

PubTabNet introduces a large-scale table recognition dataset with 568K table images and HTML ground truth extracted from PubMed Central articles. The paper proposes an encoder-dual-decoder (EDD) architecture that separates structure prediction from cell content generation, achieving 88.3% TEDS (All) versus 78.6% for the single-decoder WYGIWYS baseline. The work also introduces TEDS (tree-edit-distance-based similarity) to address known failure modes of adjacency-based evaluation.

What kind of paper is this?

Dominant: $\Psi_{\text{Resource}}$

The primary contribution is the PubTabNet dataset: an automated annotation pipeline, curation methodology, and evaluation tooling for 568K table images with HTML ground truth.

Secondary: $\Psi_{\text{Method}}$

Introduces the encoder-dual-decoder (EDD) architecture for table recognition.

Secondary: $\Psi_{\text{Evaluation}}$

Proposes the TEDS metric for measuring table structure similarity via tree-edit distance.

The dataset release and benchmark establish the paper’s resource contribution, while the architectural and evaluation innovations provide methodological advances.

What is the motivation?

Table recognition requires reconstructing both layout structure (rows, columns, spanning cells) and cell content from images. Progress has been limited by specific gaps:

Lack of large-scale training data: Existing datasets either provide only structure annotations without cell content or have insufficient scale for deep learning.
Inadequate evaluation metrics: Adjacency-based metrics under-react to structural errors (misaligned row/column boundaries) and over-react to minor content variations.

The authors address both gaps with a dataset construction pipeline, a model architecture suited to the task decomposition, and an evaluation metric aligned with structural similarity.

What is the novelty?

Dataset: PubTabNet

Scale and diversity:

568,192 table images with HTML ground truth
Sourced from 6,000+ journals in PubMed Central Open Access Subset
Includes cell-type annotations (header vs. body cells)

Auto-annotation pipeline:

Match XML-structured tables from PMCOA articles to page-rendered PDF representations
Render table images at 72 PPI and generate HTML ground truth from XML
Validate quality via TF-IDF cosine similarity between PDF-extracted text and XML cell text (threshold 0.9), with an additional constraint that text lengths differ by less than 10%
Curate for learnability: remove tables with cells spanning more than 10 rows/columns, characters occurring fewer than 50 times, or math/inline-formula nodes; normalize HTML by stripping non-visual attributes and unifying header cell definitions

Design choices:

HTML as target format enables web integration and natural tree representation for TEDS
Header/body cell distinction supports models that need to differentiate cell roles
Filtering for consistent structure vocabulary balances diversity with learnability

Model: Encoder-Dual-Decoder (EDD)

Architecture design:

The EDD model decomposes table recognition into two coupled generation tasks:

Structure decoder: Predicts HTML structural tokens (tags, span attributes) via attention-based LSTM
Cell decoder: Generates cell content tokens (character-level) when the structure decoder opens a new cell; it receives the structure decoder’s hidden state to attend to the correct cell region

Tokenization:

Structural tokens: HTML tags (<tr>, <td>, etc.) plus colspan/rowspan attributes (vocabulary size: 32)
Cell tokens: Character-level with inline HTML tags (e.g., <b>, <i>) treated as single tokens (vocabulary size: 281)

Training objective:

$$\mathcal{L} = \lambda \mathcal{L}_s + (1-\lambda) \mathcal{L}_c$$

where $\mathcal{L}_s$ is structure cross-entropy, $\mathcal{L}_c$ is cell content cross-entropy, and $\lambda \in [0, 1]$ balances the two losses.

Architectural rationale: Separating structure from content allows the model to focus on layout grammar independently of vocabulary, particularly beneficial for complex spanning patterns. Unlike other dual-decoder architectures where the decoders are independent, EDD’s cell decoder is triggered by the structure decoder and receives its hidden state, ensuring a one-to-one match between cells and content sequences.

Evaluation: TEDS Metric

Definition:

Tables are represented as HTML trees (root $\rightarrow$ thead/tbody $\rightarrow$ tr $\rightarrow$ td, with attributes colspan/rowspan/content). TEDS computes normalized tree-edit distance:

$$\text{TEDS}(T_a, T_b) = 1 - \frac{\text{EditDist}(T_a, T_b)}{\max(|T_a|, |T_b|)}$$

Edit costs:

Insertion/deletion: cost 1
Substitution of non-td nodes: cost 1
Substitution of td nodes: cost 1 if colspan or rowspan differs; otherwise, the normalized Levenshtein distance ($\in [0, 1]$) between cell content strings

Validation: Perturbation experiments demonstrate that TEDS responds proportionally to structural errors (e.g., row/column misalignment) while the adjacency relation metric under-reacts. At 90% cell-shift perturbation, adjacency F1 remains near 80% while TEDS drops by 60%. Conversely, for cell content perturbations, the adjacency metric over-reacts (dropping over 70% at the 10% perturbation level) while TEDS decreases linearly from 90% to 40% across perturbation levels 10% to 90%.

What experiments were performed?

Architecture Specifications

Encoder:

ResNet-18 CNN backbone
Five encoder variants tested, differing in stride and shared vs. independent final CNN layers
Best variant (EDD-S1S1) uses stride-1 final layers and independent final convolutional layers for structure/cell decoders

Decoders:

Single-layer LSTMs with hidden dimensions 256 (structure) and 512 (cell)
Soft attention with hidden layer size 256
Embedding dimensions: 16 (structural tokens) and 80 (cell tokens)
Structure decoder operates autoregressively; cell decoder is invoked when structure decoder emits cell-opening tags

Inference:

Beam search with beam width 3
Structure and cell predictions are synchronized: cell decoder must complete before structure decoder continues

Data Construction Details

Source: PubMed Central Open Access Subset scientific articles with both PDF renderings and XML markup.

Pipeline steps:

Matching: Align XML table elements to page-rendered table regions using the algorithm from Zhong et al. (PubLayNet)
Rendering: Generate table images from PDFs at 72 PPI
HTML generation: Convert XML markup to HTML with normalized tag vocabulary
Quality validation: Compute TF-IDF cosine similarity (PDF-extracted text vs. XML cell text); threshold at 0.9 with length difference below 10%
Curation: Remove tables with rare structures (cells spanning $>$10 rows/columns, characters with $<$50 occurrences), math/inline-formula nodes, or multiple merged tables

Curation rationale: The authors filter for consistent structure patterns to improve learnability, trading exhaustive diversity for training stability.

Scale and Splits

Split	Original	Balanced (used for eval)
Train	548,592	548,592
Val	8,910	10,000 (5K spanning + 5K non-spanning)
Test	10,690	10,000 (5K spanning + 5K non-spanning)

Training constraint: GPU memory limits require filtering the training set to 399K samples satisfying:

Image dimensions $\leq$ 512 $\times$ 512 pixels
Structural tokens $\leq$ 300
Longest cell $\leq$ 100 tokens

Validation and test sets are not subject to these constraints, ensuring evaluation on full complexity.

Balanced splits rationale: Raw dev/test distributions are skewed toward simple non-spanning tables. Balanced subsets ensure adequate representation of complex spanning structures.

Training Setup

Hardware: Two NVIDIA V100 GPUs, approximately 16 days training time.

Preprocessing:

Training images rescaled to 448 $\times$ 448 pixels for batching, with per-channel z-score normalization

Optimization:

Two-stage training:
1. Structure pretraining: $\lambda = 1$ (structure-only loss), batch size 10, learning rate 0.001 for 10 epochs then 0.0001 for 3 epochs
2. Joint training: $\lambda = 0.5$ (balanced structure + content), batch size 8, learning rate 0.001 for 10 epochs then 0.0001 for 2 epochs
Adam optimizer

Baselines

Off-the-shelf extraction tools (PDF input):

Tabula, Traprange, Camelot, PDFPlumber (document parsing libraries that require text-based PDF)
Adobe Acrobat Pro (tested with both PDF and high-resolution 300 PPI image input)

Model baselines:

WYGIWYS: Single-decoder image-to-markup architecture (Deng et al.), trained on PubTabNet
TIES: Graph neural network model evaluated on synthetic data

PubTabNet Test Results

Input	Method	Simple TEDS (%)	Complex TEDS (%)	All TEDS (%)
PDF	Tabula	78.0	57.8	67.9
PDF	Traprange	60.8	49.9	55.4
PDF	Camelot	80.0	66.0	73.0
PDF	PDFPlumber	44.9	35.9	40.4
PDF	Acrobat Pro	68.9	61.8	65.3
Image	Acrobat Pro	53.8	53.5	53.7
Image	WYGIWYS	81.7	75.5	78.6
Image	EDD-S1S1	91.2	85.4	88.3

Generalization to Synthetic Data

Setup: 500K synthetic tables (420K/40K/40K train/val/test split) used to compare with TIES baseline, which lacks sufficient real-data training labels.

Results (four complexity levels C1/C2/C3/C4):

Model	Avg TEDS	Exact Match
TIES	N/A	96.9 / 94.7 / 52.9 / 68.5
EDD	99.8 / 99.8 / 99.7 / 99.7	99.7 / 99.9 / 97.2 / 98.0

Evaluation note: TIES comparison uses adjacency-based exact match without checking cell content recognition errors. For fairness, EDD’s cell recognition errors are ignored in this comparison, measuring only structural correctness.

Ablations

Encoder variants:

Tested feature-map resolution (stride-1 vs. stride-2) and shared vs. independent final CNN layers
EDD-S1S1 (stride-1, independent layers) selected via validation performance as the best configuration

What are the outcomes/conclusions?

Key observations:

EDD achieves +9.7 TEDS improvement over the WYGIWYS single-decoder baseline (88.3% vs. 78.6%)
The advantage is more pronounced on complex (spanning) tables: +9.9 TEDS (85.4% vs. 75.5%) compared to +9.5 on simple tables (91.2% vs. 81.7%)
Camelot is the best off-the-shelf tool at 73.0% All TEDS
Adobe Acrobat Pro’s performance drops substantially when using image input (53.7%) compared to PDF input (65.3%), illustrating the difficulty of image-only table recognition
On synthetic data, EDD achieves near-perfect TEDS ($>$99.7%) across all complexity categories, with no significant degradation on complex structures (unlike TIES)

Error analysis: Both EDD and WYGIWYS show performance degradation as table size increases (in width, height, structural token count, or longest cell length). The authors attribute this primarily to aggressive image downsampling for batching and suggest grouping tables by size with different rescaling factors.

Limitations

Missing spatial information: PubTabNet does not include cell bounding box coordinates. The authors note this limits integration with detection-based pipelines and plan to add spatial annotations in future releases (PubTabNet 2.0.0, released July 2020, added bounding boxes for non-empty cells).

Detection not included: EDD assumes pre-cropped table images. End-to-end document processing requires coupling with a separate table detection model.

Scale sensitivity: Performance degrades on large tables. The authors suggest batching by table size to reduce aggressive downsampling, which loses fine-grained spatial detail.

Training subset constraint: GPU memory limitations require filtering to 399K samples (approximately 73% of the full training set) with size/token constraints. Full-scale training might improve performance but requires larger memory or architectural modifications.

License complexity: While annotations are permissive (CDLA-Permissive-1.0), underlying PMCOA images have mixed per-article licenses. Commercial users must audit article-level terms.

No code or weights released: Due to legal constraints, IBM does not release the EDD model code or pretrained weights. Replication requires reimplementing the architecture from the paper’s description.

Reproducibility

Models

ResNet-18 encoder with modified final layers (EDD-S1S1: stride-1, independent layers for each decoder)
Structure decoder: single-layer LSTM, hidden size 256, embedding size 16
Cell decoder: single-layer LSTM, hidden size 512, embedding size 80
Attention hidden layer size: 256 for both decoders
No pretrained weights released due to legal constraints

Algorithms

Two-stage training: structure pretraining ($\lambda = 1$) followed by joint training ($\lambda = 0.5$)
Adam optimizer; stage 1: batch 10, LR 0.001 (10 epochs) then 0.0001 (3 epochs); stage 2: batch 8, LR 0.001 (10 epochs) then 0.0001 (2 epochs)
Beam search decoding with beam width 3
Training images rescaled to 448 $\times$ 448, per-channel z-score normalization

Data

568K table images from PubMed Central Open Access Subset
Training subset filtered to 399K (image $\leq$ 512 $\times$ 512, structure tokens $\leq$ 300, longest cell $\leq$ 100)
Balanced val/test: 10K each (5K spanning + 5K non-spanning)
Annotations: CDLA-Permissive-1.0; images: per-article PMCOA licenses
Test set ground truth withheld for ICDAR competition

Evaluation

Primary metric: TEDS (tree-edit-distance-based similarity), reported as mean across test samples
Results broken down by Simple (non-spanning) and Complex (spanning) tables
Baselines include both PDF-input tools and image-input models
TIES comparison on synthetic data uses structure-only exact match for fairness

Hardware

2x NVIDIA V100 GPUs
Approximately 16 days total training time
No inference latency or throughput figures reported

BibTeX


@inproceedings{zhong2020image,
  title={Image-based table recognition: data, model, and evaluation},
  author={Zhong, Xu and ShafieiBavani, Elaheh and Yepes, Antonio Jimeno},
  booktitle={Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XII 16},
  pages={564--580},
  year={2020},
  organization={Springer}
}

TL;DR

TableBank is a large-scale, weakly supervised dataset of 417K+ labeled tables for image-based table detection and structure recognition. The authors generate high-quality annotations automatically from Word (Office XML) and LaTeX source code, sidestepping the need for manual labeling. Baseline Faster R-CNN and image-to-text models demonstrate that scale matters but cross-domain generalization (Word vs. LaTeX) remains a challenge.

What kind of paper is this?

Dominant: $\Psi_{\text{Resource}}$ : Introduces a large-scale benchmark dataset (TableBank) with an automated label generation pipeline and baseline models.

Secondary: $\Psi_{\text{Evaluation}}$ : Provides baseline systems and cross-domain evaluation (Word vs. LaTeX vs. mixed).

What is the motivation?

Table detection and structure recognition are central tasks in document analysis because tables encode structured information across diverse layouts. Traditional heuristics fail to generalize, while deep learning methods require more labeled data than effectively exists in human-annotated sets (which are typically small, e.g., ~1k tables). Manual annotation is prohibitively expensive at scale.

The key insight is that many documents (Word, LaTeX) contain explicit table markup in their source code. This allows for scalable, high-quality weak supervision without human labeling.

What is the novelty?

The core innovation is the automatic generation of weak labels from source documents to create a dataset orders of magnitude larger than prior work.

1. Weak Supervision Pipeline

The authors developed a method to extract bounding boxes and structure labels by manipulating source code:

Word: Modifying Office XML tags (<w:tbl>) to render tables with distinct borders.
LaTeX: Wrapping table environments in fcolorbox with distinct colors.
Label Extraction: Recovering bounding boxes by pixel-level differencing between the “marked” rendered page and the specific table color.

2. Dataset Scale

This pipeline produced TableBank, utilizing documents crawled from the web and arXiv:

Detection: 417,234 labeled tables (163k Word, 253k LaTeX).
Structure Recognition: 145,463 instances.
Validation: Manual spot-checks of 1,000 samples found only 5 erroneous bounding boxes.

3. Structure Recognition Formulation

The paper formulates structure recognition as an image-to-text task, predicting an HTML-like tag sequence (e.g., <tabular>, <tr>, <td>, <cell_y>, <cell_n>) rather than just coordinates. The vocabulary is deliberately small (12 tokens), keeping the output space tractable.

What experiments were performed?

Baseline Models

Table Detection:

Architecture: Faster R-CNN with ResNeXt-101 and ResNeXt-152 backbones (ImageNet pretrained).
Training: Detectron framework (Caffe2), 4×P100 GPUs, standard synchronous SGD.

Structure Recognition:

Architecture: Encoder-decoder image-to-text model with attention (OpenNMT).
Output: Sequence of layout tags implies the table structure. Cell content is recognized separately via OCR and filled into the predicted structure heuristically.

Evaluation Metrics

The authors assess detection using area-based metrics rather than standard object detection mAP. Precision and Recall are computed via pixel-area overlap aggregation across documents (following Gilani et al., 2017):

$$ \begin{aligned} \text{Precision} &= \frac{\text{Area of Ground Truth } \cap \text{ Detected}}{\text{Area of Detected Tables}} \\ \text{Recall} &= \frac{\text{Area of Ground Truth } \cap \text{ Detected}}{\text{Area of Ground Truth Tables}} \\ \text{F1} &= 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} \end{aligned} $$

Structure recognition is evaluated using 4-gram BLEU on the generated tag sequence, hypothesizing that n-gram overlap correlates with structural correctness.

What are the outcomes/conclusions?

Key Findings

Scale matters: Deep learning baselines trained on TableBank outperform traditional methods and models trained on smaller datasets (like ICDAR 2013).
Domain gaps exist: Models trained on Word documents perform poorly on LaTeX (and vice versa). Mixed training (Word+LaTeX) successfully improves cross-domain robustness.
Sequence length limits: The structure recognition model struggles with complex tables. Exact-match rates drop significantly as the length of the tag sequence increases.

Limitations

Metric Selection: The area-based precision/recall metrics are less standard than COCO-style mAP, potentially obscuring object-level errors like merging or splitting adjacent tables.
Cross-Domain Generalization: Despite mixed training, the gap between document types remains a challenge for structure recognition.
Task Scope: The structure recognition task is limited to layout tags; it does not solve end-to-end cell text extraction within the same model.

Reproducibility

Models

Table Detection: Faster R-CNN with ResNeXt-101 and ResNeXt-152 backbones, pretrained on ImageNet. Implemented via the Detectron framework (Caffe2). Confidence threshold set to 90% at inference.
Table Structure Recognition: Encoder-decoder image-to-text model from OpenNMT. Output vocabulary is 12 tokens (<tabular>, </tabular>, <thead>, </thead>, <tbody>, </tbody>, <tr>, </tr>, <td>, </td>, <cell_y>, <cell_n>).

Algorithms

Detection training: Synchronous SGD with a mini-batch size of 16 images. Other hyperparameters use Detectron defaults.
Structure recognition training: Learning rate of 0.1, batch size of 24. Other hyperparameters use OpenNMT defaults.

Data

Word documents: Crawled from the internet in .docx format. Multi-language (English, Chinese, Japanese, Arabic, etc.).
LaTeX documents: Sourced from arXiv bulk data access (2014 to 2018). Primarily English.
Detection split: 415,234 training images; 2,000 sampled from each domain (Word, LaTeX) for validation (1,000) and test (1,000).
Structure recognition split: 144,463 training instances; 500 each for validation and test per domain.
License Note: The GitHub repository LICENSE file is Apache 2.0, but the README explicitly states “Our data can only be used for research purpose” and “Please DO NOT re-distribute our data.” We recommend adhering to the stricter README terms for the dataset itself.

Evaluation

Detection metric: Area-based Precision/Recall/F1 (following Gilani et al., 2017), aggregated across documents by pixel-area overlap. This is not standard COCO-style mAP; it may obscure object-level errors such as merging or splitting adjacent tables.
Structure recognition metric: 4-gram BLEU on generated tag sequences against a single reference.
ICDAR 2013 cross-evaluation: TableBank models also evaluated on the ICDAR 2013 table competition dataset.

Hardware

4$\times$ NVIDIA P100 GPUs for both detection and structure recognition baselines.
No GPU-hours, inference latency, or cost estimates reported.

Mapping to Unified Taxonomy

TableBank is a specialized dataset focused entirely on a single Primitive: Table. It does not concern itself with text, figures, or hierarchy.

TableBank Class	Visual Primitive	Logical Role	Notes
Table	`Table`	`Table`	The only class. Includes diverse layouts (APA, grid, borderless).

BibTeX


@inproceedings{li-etal-2020-tablebank,
    title = &#34;TableBank: Table Benchmark for Image-based Table Detection and Recognition&#34;,
    author = &#34;Li, Minghao and Cui, Lei and Huang, Shaohan and Wei, Furu and Zhou, Ming and Li, Zhoujun&#34;,
    booktitle = &#34;Proceedings of the 12th Language Resources and Evaluation Conference&#34;,
    month = may,
    year = &#34;2020&#34;,
    address = &#34;Marseille, France&#34;,
    publisher = &#34;European Language Resources Association&#34;,
    url = &#34;https://aclanthology.org/2020.lrec-1.236&#34;,
    pages = &#34;1918--1925&#34;
}