Modern OCR for the Large Language & Vision Model Era

A reference page for all things Optical Character Recognition (OCR) using Large Language & Vision Models

Disclaimer: This is a work-in-progress research compilation and personal draft. The coverage is not comprehensive, and the analysis reflects one individual’s perspective on recent developments in the field. This resource should not be used as a definitive reference or benchmark for evaluating research contributions. Please refer to the original papers and conduct your own thorough review for academic or professional purposes.

VLM-Based Models

End-to-end vision-language models with learned multimodal representations for general document OCR.

Date	Paper	Notes	Weights	Code	License
2025-11	Nemotron Parse 1.1	notes	885M, 885M-TC		NVIDIA Open Model
2025-12	dots.ocr	notes	3B	rednote-hilab/dots.ocr	MIT
2025-01	Ocean-OCR	notes	3B	guoxy25/Ocean-OCR	Apache 2.0
2025-10	olmOCR 2	notes	7B	allenai/olmocr	Apache 2.0
2025-10	DeepSeek-OCR	notes	3B	deepseek-ai/DeepSeek-OCR	MIT
2025-09	POINTS-Reader	notes	4B	Tencent/POINTS-Reader	Apache 2.0
2025-09	MinerU2.5	notes	1.2B	opendatalab/MinerU	AGPL-3.0
2025-06	Infinity-Parser	notes	7B	infly-ai/INF-MLLM	Apache 2.0
2025-05	Dolphin	notes	322M, 4B	ByteDance/Dolphin	MIT
2025-04	VISTA-OCR	notes
2025-03	SmolDocling	notes	256M	docling-project/docling	CDLA-Permissive-2.0
2025-02	olmOCR	notes	7B	allenai/olmocr	Apache 2.0
2024-09	GOT-OCR2.0	notes	580M	Ucas-HaoranWei/GOT-OCR2.0	Apache 2.0
2023-08	Nougat	notes	small, base	facebookresearch/nougat	MIT (code), CC-BY-NC (wts.)

Pipeline Models

Modular systems combining specialized detection, recognition, and layout analysis components.

Date	Paper	Notes	Weights	Code	License
2025-10	PaddleOCR-VL	notes	0.9B	PaddlePaddle/PaddleOCR	Apache 2.0
2025-07	PaddleOCR 3.0	notes	HuggingFace	PaddlePaddle/PaddleOCR	Apache 2.0
2025-06	MonkeyOCR	notes	3B	Yuliang-Liu/MonkeyOCR	Apache 2.0
2025-01	Docling v2	notes	models	DS4SD/docling	MIT (code), CDLA-Permissive-2.0 (weights)
2024-09	MinerU	notes	PDF-Extract-Kit	opendatalab/MinerU	Apache 2.0

Datasets & Benchmarks

This section catalogs datasets and benchmarks organized by OCR sub-domain. Many benchmarks span multiple domains; they are listed under their primary focus.

Datasets are organized into three tiers based on licensing:

Commercial: Permissive licenses (Apache-2.0, MIT, CC-BY) that allow commercial use
Research: Non-commercial licenses (GPL, CC-NC) or explicit research-only restrictions
Unclear: No license specified or mixed/complex licensing; verify before use

General Document OCR

Research Use Only

Date	Paper	Size	Tasks	Notes	Data	License
2024-12	OmniDocBench	1,355 pages	Text, formula, table, reading-order	Multi-domain benchmark (EN/ZH)
2024-05	Fox Bench	Dense multi-page docs	Full document parsing	EN/ZH documents

Charts & Visualizations

Chart understanding spans several task families:

Extraction: Chart-to-table or chart-to-dict conversion (structured data recovery)
QA: Visual question answering requiring numerical reasoning, comparison, or lookup
Summarization: Natural language descriptions at varying semantic levels

Most datasets use synthetic charts (programmatically generated from tables) or web-scraped visualizations. Evaluation typically relies on exact/relaxed match for QA, BLEU/ROUGE for summarization, and F1 or RMS error for extraction. Scale varies dramatically: from ~6K charts (ChartY) to 28.9M QA pairs (PlotQA).

Commercial Use

Datasets with permissive licenses suitable for commercial training and deployment.

Date	Paper	Size	Tasks	Notes	Data	License
2024-10	NovaChart	47K charts, 856K instr. pairs	18 chart types, 15 tasks (understanding + generation)	notes	GitHub, HuggingFace	MIT (code), Apache-2.0 (dataset)
2024-04	TinyChartData	140K PoT pairs	Chart QA with program-of-thought learning	notes	GitHub, HuggingFace	Apache-2.0
2019-09	PlotQA	224K plots, 28.9M QA pairs	Plot question answering with OOV reasoning	notes	GitHub	MIT (code), CC-BY-4.0 (data)

Research Use Only

Datasets with copyleft (GPL), non-commercial (CC-NC), or explicit research-only restrictions.

Date	Paper	Size	Tasks	Notes	Data	License
2024-04	OneChart / ChartY	~6K charts	Chart-to-dict structural extraction, bilingual (EN/ZH)	notes	Project, GitHub	Apache-2.0 (code); research use only
2023-08	SciGraphQA	295K multi-turn, 657K QA pairs	Multi-turn scientific graph question answering	notes	GitHub, HuggingFace	Research only (Palm-2/GPT-4 terms)
2023-08	VisText	12,441 charts	Chart captioning with semantic richness (L1-L3)	notes	GitHub	GPL-3.0
2022-05	ChartQA	20,882 charts, 32,719 QA pairs	Chart question answering with visual and logical reasoning	notes	GitHub	GPL-3.0
2022-03	Chart-to-Text	44,096 charts	Chart summarization: natural language text generation	notes	GitHub	GPL-3.0 (+ source restrictions)
2019-09	CHART-Infographics	~200K synthetic, 4.2K real	Chart classification, text detection/OCR, role classification, axis/legend analysis	notes	Synthetic, PMC	CC-BY-NC-ND 3.0 (S), CC-BY-NC-SA 3.0 (PMC)
2018-04	DVQA	300K bar charts, 3.5M QA pairs	Bar chart question answering	notes	GitHub	CC-BY-NC 4.0

License Unclear or Mixed

Datasets with unspecified licenses or complex multi-source licensing. Verify terms before use.

Date	Paper	Size	Tasks	Notes	Data	License
2024-04	ChartThinker	595K charts, 8.17M QA pairs	Chart summarization QA	notes	GitHub, HuggingFace	MIT (HF); sources include GPL/CC-NC
2023-07	DePlot	516K plot-table, 5.7M QA pairs	Plot-to-table translation, chart question answering	notes	google-research, HuggingFace	Apache-2.0 (model); mixed data licenses
2023-05	UniChart	611K charts	Pretraining corpus: table extraction, reasoning, QA, summ.	notes	GitHub	Varies by source (see notes)
2023-04	ChartSumm	84K charts	Chart summarization with short and long summaries	notes	GitHub, Drive	Unspecified (no LICENSE file)
2021-01	ChartOCR / ExcelChart400K	386,966 charts	Chart-to-table extraction: bar, line, pie	notes	GitHub, HuggingFace	MIT (HF); crawled data, paper silent on license
2018-04	Beagle	42K SVG	Visualization type classification	notes	UW	MIT (code only); dataset license not stated

Mathematical Expression Recognition

Mathematical expression recognition addresses printed, handwritten, and screen-captured formulas with complex 2D spatial structure. The domain is characterized by large symbol inventories (101 classes in CROHME benchmarks, 245 in HME100K, extended vocabularies for LaTeX rendering) and structural relationships such as superscripts, subscripts, fractions, radicals, and matrix layouts.

Task families include:

Symbol recognition: Isolated classification with reject options for non-symbol junk
Expression parsing: Combined segmentation, classification, and structural relationship extraction
Image-to-LaTeX: End-to-end conversion from formula images to markup
Matrix recognition: Hierarchical evaluation at matrix, row, column, and cell levels

Evaluation typically measures expression-level exact match rates (ExpRate) alongside object-level metrics for symbol segmentation, classification, and spatial relation detection. CROHME benchmarks indicate structure parsing remains a bottleneck: 90% accuracy with perfect symbol labels versus 67% end-to-end. Recent large-scale datasets (UniMER-1M with 1M+ samples) target real-world complexity beyond clean academic benchmarks, including noisy screen captures, font inconsistencies, and long expressions (up to 7,000+ tokens).

Datasets are organized into three tiers based on licensing:

Commercial: Permissive licenses (Apache-2.0, MIT, CC-BY) that allow commercial use
Research: Non-commercial licenses (GPL, CC-NC) or explicit research-only restrictions
Unclear: No license specified or mixed/complex licensing; verify before use

License Unclear or Mixed

Datasets with unspecified licenses or complex multi-source licensing. Verify terms before use.

Date	Paper	Size	Tasks	Notes	Data	License
2024-04	UniMER-1M	1,061,791 train; 23,757 test (4 subsets)	Image-to-LaTeX: printed, complex, screen-captured, handwritten	notes	HuggingFace, OpenDataLab	Apache-2.0 (HF tag); upstream sources have mixed licenses
2024-04	MathWriting	626k total (230k human, 396k synthetic)	Online handwritten math expression recognition, image-to-LaTeX	notes	Google Storage, HuggingFace	CC-BY-NC-SA 4.0
2022-03	HME100K	74,502 train + 24,607 test images	Handwritten mathematical expression recognition	notes	GitHub, Portal	Unspecified (no LICENSE file)

Research Use Only

Datasets with copyleft (GPL), non-commercial (CC-NC), or explicit research-only restrictions.

Date	Paper	Size	Tasks	Notes	Data	License
2019-09	CROHME 2019 + TFD	1,199 test expressions, 236 pages (TFD)	Handwritten math + typeset formula detection	notes	TC10/11 Package, TFD GitHub	CC-BY-NC-SA 3.0
2016-09	CROHME 2016	1,147 test expressions (Tasks 1/4), 250 test matrices	4 tasks: formula, symbol, structure, matrix recognition	notes	TC10/11 Package	CC-BY-NC-SA 3.0
2014-09	CROHME 2014	986 test expressions (10K symbols), 175 matrices, 10K+9K junk	Symbol recognition with reject, expression, matrix parsing	notes	TC11, TC10/11 Package, GitHub	CC-BY-NC-SA 3.0

Handwriting Recognition

Handwriting recognition for natural language text focuses on word-level and line-level detection and recognition in unconstrained conditions. Unlike mathematical expressions, which require parsing 2D spatial structure, general handwriting tasks emphasize sequential text extraction from camera-captured images, historical documents, and field notes. Evaluation uses localization metrics (IoU-based measures) for detection and character/word accuracy rates for recognition.

Datasets are organized into three tiers based on licensing:

Commercial: Permissive licenses (Apache-2.0, MIT, CC-BY) that allow commercial use
Research: Non-commercial licenses (GPL, CC-NC) or explicit research-only restrictions
Unclear: No license specified or mixed/complex licensing; verify before use

Commercial Use

Datasets with permissive licenses suitable for commercial training and deployment.

Date	Paper	Size	Tasks	Notes	Data	License
2021-09	GNHK	687 images, 39,026 texts, 172,936 chars	Word-level detection and recognition (camera-captured)	notes	GoodNotes, GitHub	CC-BY 4.0

Layout Detection & Document Structure

Document layout analysis identifies and classifies page regions (text blocks, tables, figures, titles, lists, headers, footers) and their spatial relationships. Unlike full OCR pipelines that also perform text recognition, layout detection focuses on structural segmentation: predicting bounding boxes and category labels for document components. Evaluation uses object detection metrics (mAP at various IoU thresholds) with per-category AP breakdowns for fine-grained analysis.

The field distinguishes between category-agnostic detection (locating all content regions regardless of type) and category-aware detection (classifying regions into semantic categories). Modern approaches balance two competing pressures: fine-grained taxonomies (20+ categories for specialized documents) versus coarse taxonomies (5-10 categories for cross-domain generalization).

Models

Date	Paper	Notes	Weights	Code	License
2025-03	PP-DocLayout	notes	L, _plus-L, M, S	PaddlePaddle/PaddleX	Apache 2.0
2025-10	PP-DocLayoutV2	notes	Model	PaddlePaddle/PaddleOCR	Apache 2.0
2021-08	LayoutReader	notes	Model	GitHub	Research only

Commercial Use

Datasets with permissive licenses suitable for commercial training and deployment.

Date	Paper	Size	Tasks	Notes	Data	License
2022-08	DocLayNet	80,863 pages, 11 categories	Layout detection, reading order		GitHub	CDLA-Permissive-1.0
2019-09	PubLayNet	360K+ pages, 5 categories	Layout detection for scientific documents		GitHub	CDLA-Permissive-1.0
2019-05	DocBank	500K pages, 13 categories	Weakly supervised layout detection	notes	GitHub	Apache 2.0

Research Use Only

Datasets with copyleft (GPL), non-commercial (CC-NC), or explicit research-only restrictions.

Date	Paper	Size	Tasks	Notes	Data	License
2021-08	ReadingBank	500K document pages	Reading order detection	notes	GitHub	Research only (no redistribution)

License Unclear or Mixed

Datasets with unspecified licenses or complex multi-source licensing. Verify terms before use.

Date	Paper	Size	Tasks	Notes	Data	License

Table Structure Recognition

Table structure recognition (TSR) extracts the logical structure of tables—identifying cells, rows, columns, spanning relationships, and hierarchical organization. Unlike table detection (which only locates table regions) or table understanding (which also interprets content semantics), TSR focuses on parsing the structural grid: mapping visual layouts to machine-readable formats such as HTML, LaTeX, or specialized tokenization schemes.

The field is characterized by two main architectural families:

End-to-end vision-language models: Directly predict table structure as token sequences from images (image-to-markup)
Pipeline systems: Combine separate modules for cell detection, structure parsing, and optional content extraction

Evaluation uses tree edit distance (TED) metrics for structural accuracy and mAP/IoU metrics for cell localization. Modern benchmarks emphasize complex spanning (merged cells across rows/columns), multi-page tables, and domain-specific formats (financial statements, scientific papers, invoices).

Datasets are organized into three tiers based on licensing:

Commercial: Permissive licenses (Apache-2.0, MIT, CC-BY, CDLA-Permissive) that allow commercial use
Research: Non-commercial licenses (GPL, CC-NC) or explicit research-only restrictions
Unclear: No license specified or mixed/complex licensing; verify before use

Commercial Use

Datasets with permissive licenses suitable for commercial training and deployment.

Date	Paper	Size	Tasks	Data	License
2021-09	PubTables-1M	~1M tables	Structure recognition, detection	HuggingFace	CDLA-Permissive-2.0
2019-11	PubTabNet	568K tables	Structure recognition	GitHub	CDLA-Permissive-1.0
2018-XX	FinTabNet	113K tables	Structure recognition	IBM Developer, HF (.c)	CDLA-Permissive-1.0

Research Use Only

Datasets with copyleft (GPL), non-commercial (CC-NC), or explicit research-only restrictions.

Date	Paper	Size	Tasks	Notes	Data	License

License Unclear or Mixed

Datasets with unspecified licenses or complex multi-source licensing. Verify terms before use.

Date	Paper	Size	Tasks	Notes	Data	License

Specialized Methods (No New Data)

Papers that introduce methods or training techniques but do not release new datasets. Included for completeness; see original papers for evaluation details.

Chart Understanding

Date	Paper	Notes	Weights	Code	License
2024-05	SIMPLOT	notes		GitHub	No license stated
2024-04	TinyChart	notes	3B	X-PLUG/mPLUG-DocOwl	Apache 2.0
2023-05	UniChart	notes	base, ChartQA	vis-nlp/UniChart	MIT

Table Structure Recognition

Date	Paper	Notes	Weights	Code	License
2023-05	OTSL	notes			Not specified

Mathematical Expression Recognition

Date	Paper	Notes	Weights	Code	License
2024-04	UniMERNet	notes	100M/202M/325M	opendatalab/UniMERNet	Apache 2.0
2022-03	SAN	notes		Code not publicly released	—

Evaluation & Metrics

Date	Paper	Notes	Code	License
2024-09	CDM	notes	opendatalab/UniMERNet	Apache 2.0

Handwriting Generation

Date	Paper	Notes	Weights	Code	License
2020-08	Decoupled Style Descriptors	notes		GitHub	Non-commercial research only

Paper: PaddleOCR 3.0: Advancements in Open-Source OCR and Document AI
Code: PaddlePaddle/PaddleOCR
Models: PaddlePaddle on HuggingFace
Docs: paddlepaddle.github.io/PaddleOCR

TL;DR

PaddleOCR 3.0 is an open-source OCR and document AI toolkit (Apache 2.0) that ships three main solutions: PP-OCRv5 (lightweight multilingual OCR), PP-StructureV3 (end-to-end document parsing), and PP-ChatOCRv4 (OCR + LLM for KIE/QA). The system includes deployment infrastructure (high-performance inference, Triton/FastAPI serving, on-device, MCP integration). On OmniDocBench OCR, PP-OCRv5 ranks first on average across 17 scenarios. On document parsing, PP-StructureV3 achieves Edit distance 0.145 (EN) and 0.206 (ZH), outperforming MinerU and Docling.

What kind of paper is this?

Primarily $\Psi_{\text{Resource}}$, with substantial $\Psi_{\text{Method}}$ components.

Dominant: $\Psi_{\text{Resource}}$: The paper delivers a production-ready toolkit with Apache 2.0 licensing, pre-trained model zoo (PP-OCRv5 server/mobile variants, PP-StructureV3, PP-ChatOCRv4), layered API/CLI interfaces, and comprehensive deployment infrastructure (HPI with automatic backend selection, FastAPI/Triton serving, on-device via Paddle-Lite, MCP server). The resource framing is explicit: enabling reproducible OCR and document understanding for the research and practitioner communities. Performance optimizations (73% latency reduction on T4 for mobile recognition) and serving options are presented as reusable tooling rather than operational validation.
Secondary: $\Psi_{\text{Method}}$: PP-OCRv5 introduces a single multilingual model supporting Simplified Chinese, Traditional Chinese, Pinyin, English, and Japanese. PP-StructureV3 integrates layout analysis, table recognition, and specialized document items (seals, formulas, charts). PP-ChatOCRv4 combines pipeline OCR outputs with vector retrieval, VLM-based answer extraction (PP-DocBee2), and LLM reasoning (ERNIE-4.5) with result fusion.

What is the motivation?

The report frames OCR and document parsing as foundational for downstream document understanding, with LLM and RAG adoption increasing demand for high-quality text extraction, structure recovery, and semantic interpretation across diverse document types.

Multilingual and layout complexity: Documents span handwriting, multilingual text, complex layouts, tables, formulas, and charts. Prior OCR systems often specialize narrowly or require multiple models for different languages/scripts.
Production barriers: Deploying OCR at scale requires optimized inference (low latency, resource efficiency), robust serving infrastructure, and on-device support. Existing open-source OCR tools may lack these deployment features or use restrictive licenses.
Key information extraction and QA: Document workflows increasingly involve extracting structured information and answering questions over document content, motivating integrated OCR + LLM/VLM pipelines.

What is the novelty?

PP-OCRv5 single multilingual model: A unified recognition model handling Simplified Chinese, Traditional Chinese, Pinyin, English, and Japanese, with server (GPU-optimized) and mobile (CPU/resource-constrained) variants. The pipeline includes preprocessing, text detection, text-line orientation classification, and recognition.

PP-StructureV3 end-to-end parsing: Integrates layout analysis, table recognition, and structure extraction into a unified document parsing system. Extends to specialized items including seal text recognition, formula recognition, and chart analysis. Reports Edit distance 0.145 (EN) and 0.206 (ZH) on OmniDocBench parsing, outperforming MinerU (0.333 EN / 0.350 ZH) and Docling (0.538 EN / 0.569 ZH).

PP-ChatOCRv4 retrieval-augmented KIE/QA: Combines PP-StructureV3 OCR outputs with vector retrieval, a 3B VLM (PP-DocBee2) for prompt-based answer extraction from document images, and a 300B LLM (ERNIE-4.5-300B-A47B) for reasoning. Result fusion merges text-based and image-based answers. Reports 85.55% Recall@1 on a custom benchmark (638 documents, 1,196 QA pairs), outperforming GPT-4o (63.47%), PP-ChatOCRv3 (70.08%), and Qwen2.5-VL-72B (80.26%).

Deployment infrastructure: Redesigned inference library (PaddleX 3.0) with layered API/CLI, high-performance inference (HPI) with automatic backend selection (Paddle Inference, OpenVINO, ONNX Runtime, TensorRT), built-in optimizations (multi-threading, FP16), FastAPI/Triton serving, on-device support (Paddle-Lite), and an MCP server exposing OCR/parsing as tools with stdio and Streamable HTTP transports.

What experiments were performed?

PP-OCRv5 on OmniDocBench OCR (17 scenarios, 1-EditDist metric): PP-OCRv5 ranks first on average across all scenarios, compared against multiple VLM baselines with reported parameter sizes.

PP-StructureV3 on OmniDocBench parsing (Edit distance, lower is better): Evaluated on English and Chinese document parsing. PP-StructureV3 achieves Edit 0.145 (EN) and 0.206 (ZH), outperforming MinerU-1.3.11 (0.333 EN / 0.350 ZH) and Docling-2.14.0 (0.538 EN / 0.569 ZH).

PP-ChatOCRv4 on custom KIE/QA benchmark (638 document images, 1,196 QA pairs, Recall@1 metric): The dataset spans financial reports, research papers, contracts, manuals, and regulations. PP-ChatOCRv4 achieves 85.55% Recall@1, compared to GPT-4o (63.47%), PP-ChatOCRv3 (70.08%), and Qwen2.5-VL-72B (80.26%).

HPI latency reduction (NVIDIA Tesla T4): Enabling HPI reduces latency for PP-OCRv5_mobile_rec by 73.1% and PP-OCRv5_mobile_det by 40.4%.

What are the outcomes/limitations?

Outcomes:

OmniDocBench OCR: PP-OCRv5 ranks first on average across 17 scenarios using the 1-EditDist metric.
OmniDocBench parsing: PP-StructureV3 achieves Edit 0.145 (EN) and 0.206 (ZH), leading among open-source pipeline systems.
Custom KIE/QA benchmark: PP-ChatOCRv4 achieves 85.55% Recall@1, a +5.29 point improvement over Qwen2.5-VL-72B (80.26%) and +22.08 points over GPT-4o (63.47%).
Inference optimization: HPI reduces mobile recognition latency by 73.1% and mobile detection latency by 40.4% on T4 hardware.
Deployment surface: The toolkit provides FastAPI/Triton serving, on-device deployment via Paddle-Lite, and an MCP server with Local, AI Studio, and Self-Hosted modes supporting stdio and Streamable HTTP transports.

Limitations and open questions:

Benchmark scope: OmniDocBench OCR and parsing evaluations are publicly documented, but the custom KIE/QA benchmark (638 documents, 1,196 QA pairs) has limited external validation. The authors do not release this benchmark publicly, limiting reproducibility of the PP-ChatOCRv4 results.
Multilingual coverage: PP-OCRv5 supports Simplified Chinese, Traditional Chinese, Pinyin, English, and Japanese. Generalization to other languages (Arabic, Cyrillic, Indic scripts, etc.) is not addressed. The single multilingual model design may face capacity constraints when scaling to dozens of languages.
VLM dependency for KIE/QA: PP-ChatOCRv4 relies on PP-DocBee2 (3B VLM) and ERNIE-4.5 (300B LLM). The paper does not ablate the contribution of each component (pipeline OCR vs. VLM-based answer extraction vs. LLM reasoning vs. result fusion). It is unclear whether the Recall@1 gains come primarily from the fusion strategy, the quality of PP-StructureV3 outputs, or the VLM/LLM model choices.
Deployment overhead: The HPI latency reductions are reported for mobile variants on T4 hardware. Server variants on GPU hardware (A100, H100) are not profiled. Throughput and cost-per-page estimates are not provided, limiting comparison to olmOCR ($176/M pages on L40S) or MinerU2.5 (1.224 pages/s on A100).
License and model availability: The toolkit is Apache 2.0, but the report does not clarify the licensing or availability of ERNIE-4.5-300B-A47B (used in PP-ChatOCRv4). PP-DocBee2 weights are not linked in the paper. This limits reproducibility of the KIE/QA results.

Model

PP-OCRv5 pipeline and variants

Two variants:

Server: GPU-optimized for throughput and accuracy.
Mobile: CPU/resource-constrained for edge deployment.

Pipeline stages:

Image preprocessing: Handles rotation correction, noise reduction, and normalization.
Text detection: Identifies text regions with bounding boxes.
Text-line orientation classification: Corrects text-line rotation (0°, 90°, 180°, 270°).
Text recognition: Converts detected text regions to Unicode strings.

Key architectural ingredients (Figure 3 in paper, selected examples):

Backbones: PP-HGNetV2 (server), PP-LCNetV3 (mobile).
Detection enhancements: PFHead, DSR (Dynamic Scale Regression), Lite-Neck.
Recognition enhancements: GTC-NRTR (Guided Training of CTC with NRTR), multi-scale training.
Data strategy: Pretrain distillation, synthetic data generation, label refinement using ERNIE 4.5.

PP-StructureV3 modules

Core capabilities:

Layout analysis: Identifies and classifies document regions (text blocks, tables, figures, formulas, etc.).
Table recognition: Extracts table structure and cell content, supporting rowspan/colspan via HTML output.
Structure extraction: Recovers reading order and hierarchical document structure.

Specialized document items:

Seal text recognition: Handles circular and non-rectangular text (stamps, official seals).
Formula recognition: Converts mathematical notation to LaTeX.
Chart analysis: Extracts data from charts and graphs (details not provided in the report).

PP-ChatOCRv4 architecture

Pipeline stages (Figure 9 in paper):

Prompt engineering: User query + optional image input.
PP-StructureV3 extraction: OCR and document parsing to recover text, layout, and structure.
Vector retrieval: Index OCR outputs for semantic search.
PP-DocBee2 (3B VLM): Prompt-based answer extraction from document image regions.
ERNIE-4.5-300B-A47B (LLM): Reasoning over retrieved text and VLM outputs.
Result fusion: Combines text-based (OCR + LLM) and image-based (VLM) answers.

Key design choices:

Dual-stream reasoning: Separate text-based (pipeline OCR $\rightarrow$ vector retrieval $\rightarrow$ LLM) and image-based (VLM) pathways. Fusion merges results to handle cases where OCR fails (e.g., complex tables, handwriting) or VLM hallucinates.
PP-DocBee2: A 3B VLM fine-tuned for document understanding tasks. The paper does not provide architectural details, training data, or standalone evaluation results for PP-DocBee2.

Data

PP-OCRv5 training data

The report does not specify the size, composition, or sourcing of the PP-OCRv5 training dataset. Key data strategies are described qualitatively:

Synthetic data generation: Augments training with rendered text images.
Label refinement using ERNIE 4.5: Corrects noisy labels in existing datasets.
Multilingual coverage: Training data includes Simplified Chinese, Traditional Chinese, Pinyin, English, and Japanese text.

PP-ChatOCRv4 evaluation benchmark

Size: 638 document images, 1,196 QA pairs.

Document types: Financial reports, research papers, contracts, manuals, regulations, and other unspecified categories.

Metric: Recall@1 (exact match or fuzzy match against ground-truth answers).

Limitations: The benchmark is not publicly released, limiting external validation and reproducibility. The authors do not specify the annotation protocol, inter-annotator agreement, or difficulty distribution of the QA pairs.

OmniDocBench datasets

PP-OCRv5 evaluation: 17 scenarios covering diverse document types, scripts, and layouts. The paper cites OmniDocBench but does not detail the dataset composition or size.

PP-StructureV3 evaluation: English and Chinese document parsing subsets of OmniDocBench. The Edit metric measures character-level edit distance between predicted and ground-truth document structure.

Evaluation

PP-OCRv5 (OmniDocBench OCR)

Method	1-EditDist (average)
PP-OCRv5	Rank 1 (value NR)
VLM Baseline 1	NR
VLM Baseline 2	NR

The report states PP-OCRv5 ranks first on average across 17 scenarios but does not provide absolute 1-EditDist values or the names/sizes of competing VLM baselines. The table in the paper includes parameter counts for baselines but omits numeric results.

PP-StructureV3 (OmniDocBench parsing)

Method	EN Edit	ZH Edit
PP-StructureV3	0.145	0.206
MinerU-1.3.11	0.333	0.350
Docling-2.14.0	0.538	0.569

Edit distance is computed at the character level between predicted and ground-truth document structure. Lower is better. PP-StructureV3 leads by a substantial margin (2.3$\times$ better than MinerU on EN, 1.7$\times$ on ZH).

PP-ChatOCRv4 (custom KIE/QA benchmark)

Method	Recall@1
PP-ChatOCRv4	85.55%
Qwen2.5-VL-72B	80.26%
PP-ChatOCRv3	70.08%
GPT-4o	63.47%

Recall@1 measures whether the system’s top-ranked answer matches the ground-truth (exact or fuzzy match). PP-ChatOCRv4 achieves a +5.29 point gain over Qwen2.5-VL-72B and +22.08 points over GPT-4o.

Limitations: The benchmark is not publicly released. The authors do not ablate the contribution of PP-StructureV3 OCR quality vs. VLM answer extraction vs. LLM reasoning vs. result fusion. It is unclear whether the gains come from superior OCR, better VLM/LLM models, or the fusion strategy.

Hardware / Production

High-performance inference (HPI)

Key features:

Automatic backend selection: Paddle Inference, OpenVINO, ONNX Runtime, TensorRT. The system selects the optimal backend based on hardware and model architecture.
Built-in optimizations: Multi-threading, FP16 precision, on-demand ONNX conversion.
Enabled via API: enable_hpi=True in the Python API.

Latency reduction on NVIDIA Tesla T4:

Model	Latency Reduction
PP-OCRv5_mobile_rec	73.1%
PP-OCRv5_mobile_det	40.4%

The report does not provide absolute latency values (ms/page) or throughput estimates (pages/s).

Serving options

Basic Serving (FastAPI):

Lightweight REST API for OCR and document parsing.
Multi-language client examples (Python, C++, etc.) provided in documentation.

High-Stability Serving (Triton):

Nvidia Triton Inference Server integration for production deployments.
Supports dynamic batching, model versioning, and ensemble inference.

On-device deployment:

Paddle-Lite tooling for Android, iOS, and edge hardware.
Mobile variants of PP-OCRv5 designed for CPU/resource-constrained environments.

MCP server

Model Context Protocol integration:

Exposes OCR (PP-OCRv5) and document parsing (PP-StructureV3) as tools for LLM agents.
Modes: Local (runs models on local hardware), AI Studio (cloud-hosted inference), Self-Hosted (user-managed servers).
Transports: stdio (standard input/output) and Streamable HTTP (streaming responses).

Example configuration (Local mode, stdio transport):

{
  "mcpServers": {
    "paddleocr-mcp": {
      "command": "python",
      "args": ["-m", "paddleocr_mcp.server"],
      "env": {
        "MODE": "local"
      }
    }
  }
}

Implementation sketch

Python API for PP-StructureV3

from paddlex import create_pipeline

pipeline = create_pipeline(pipeline="PP-StructureV3")

# Single-page inference
output = pipeline.predict("document.pdf")

# Batch inference
for result in pipeline.predict(["doc1.pdf", "doc2.pdf"], batch_size=4):
    result.save_to_img("output/")
    result.save_to_json("output/")

Python API for PP-ChatOCRv4

from paddlex import create_pipeline

pipeline = create_pipeline(pipeline="PP-ChatOCRv4Doc")

# Build vector index from documents
pipeline.build_vector(data_root="documents/")

# Query the indexed documents
result = pipeline.chat(
    key_list=["contract_2023.pdf"],
    query="What is the contract termination date?"
)

print(result["result"])

Enabling HPI

pipeline = create_pipeline(
    pipeline="PP-OCRv5",
    enable_hpi=True  # Automatic backend selection + optimizations
)

Notes and open questions

Observations:

PaddleOCR 3.0 is a comprehensive toolkit (OCR + parsing + KIE/QA + deployment) rather than a single model. The Apache 2.0 license and deployment infrastructure (HPI, serving, on-device, MCP) target production adoption.
PP-StructureV3 leads on OmniDocBench parsing with substantial margins (2.3$\times$ better than MinerU on EN Edit). This suggests strong layout analysis and structure recovery, though the paper does not ablate the contribution of individual modules (layout detector, table recognizer, reading-order recovery, etc.).
PP-ChatOCRv4’s +22 point Recall@1 gain over GPT-4o is striking, but the custom benchmark is not publicly released and the ablation is incomplete. It is unclear whether the improvement comes from superior OCR (PP-StructureV3), better VLM/LLM models (PP-DocBee2 + ERNIE-4.5), or the result fusion strategy.

Open questions:

PP-OCRv5 training data: The report describes data strategies (synthetic generation, label refinement with ERNIE 4.5) but does not specify dataset size, composition, or sourcing. How much training data is required to achieve the reported OmniDocBench performance?
OmniDocBench OCR results: The paper states PP-OCRv5 ranks first on average but omits numeric 1-EditDist values and baseline names/sizes. External validation on other OCR benchmarks (e.g., olmOCR-Bench, FoxBench) is absent.
PP-ChatOCRv4 ablations: What is the contribution of each component (PP-StructureV3 OCR vs. PP-DocBee2 VLM vs. ERNIE-4.5 LLM vs. result fusion)? Does the fusion strategy generalize to other VLM/LLM combinations (e.g., GPT-4o + Qwen2.5-VL)?
Deployment cost and throughput: The HPI latency reductions (73% for mobile recognition on T4) are impressive, but absolute latency (ms/page) and throughput (pages/s) are not reported. Cost-per-page estimates (e.g., olmOCR’s $176/M pages on L40S) are absent, limiting production planning.
PP-DocBee2 details: The paper does not provide architectural details, training data, or standalone evaluation results for the 3B VLM used in PP-ChatOCRv4. Is PP-DocBee2 a general-purpose VLM or a document-specialized model? Are the weights publicly available?
Multilingual scaling: PP-OCRv5 supports five languages/scripts (Simplified Chinese, Traditional Chinese, Pinyin, English, Japanese). How would the single multilingual model design scale to dozens of languages (Arabic, Cyrillic, Indic scripts, etc.)? Would capacity constraints require larger models or ensemble architectures?

Paper: olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models
Code: allenai/olmocr
Demo: olmocr.allenai.org
Model: olmOCR-7B-0825

TL;DR

olmOCR is a PDF-to-text linearization toolkit centered on a 7B vision-language model fine-tuned from Qwen2-VL-7B-Instruct. The system converts PDFs (both born-digital and scanned) into clean plain text suitable for language model development. The core innovation is “document anchoring,” a prompt technique that injects PDF-extracted text blocks with coordinates to reduce reading-order errors and hallucinations. On olmOCR-Bench (1,402 PDFs, 7,010 unit tests), the anchored model achieves 75.5% overall pass rate with an estimated cost of \$176 per million pages on L40S hardware.

What kind of paper is this?

Primarily $\Psi_{\text{Resource}}$, with meaningful $\Psi_{\text{Method}}$ and $\Psi_{\text{Evaluation}}$ components.

Dominant: $\Psi_{\text{Resource}}$: The headline deliverables are (1) a training dataset (olmOCR-mix-0225, 260k pages), (2) a benchmark suite (olmOCR-Bench with 7,010 unit tests), and (3) a released fine-tuned model with inference code. The paper emphasizes enabling reproducible PDF processing for the research community.
Secondary: $\Psi_{\text{Method}}$: Document anchoring (layout-aware prompt engineering with extracted text blocks and bounding boxes) is presented as a novel technique for reducing VLM hallucinations in document contexts.
Secondary: $\Psi_{\text{Evaluation}}$: The benchmark design itself represents a substantial contribution: deterministic unit-test-style pass/fail rules that avoid LLM-as-judge biases.

What is the motivation?

PDFs encode rendering primitives (glyphs with positions) rather than semantic text structure or ground-truth reading order, making them challenging inputs for language model training and document-grounded inference.

Low-fidelity extraction hurts LM training: Poor OCR quality can degrade training stability and downstream task performance. Cascading errors in document workflows compound the problem.
Proprietary solutions are expensive: The authors cite GPT-4o API costs exceeding \$6,200 per million pages as a barrier to large-scale PDF processing for open research.
Existing open tools lack validation: Pipeline systems (Marker, MinerU) and open VLMs have no standardized, reproducible benchmark for comparing PDF linearization quality.

What is the novelty?

Document anchoring: The system extracts text blocks with bounding boxes from the PDF using pypdf, then injects a subset of these blocks (with coordinates and image placeholders) into the VLM prompt alongside the page image. This provides layout hints that help the model maintain correct reading order and reduce hallucinated content.

Silver-data recipe: The authors generate approximately 260k page OCR targets with GPT-4o using structured JSON output, then fine-tune Qwen2-VL-7B-Instruct into olmOCR-7B-0225-preview. The training data is filtered for English-only content and sampled up to 3 pages per PDF.

Unit-test benchmark: olmOCR-Bench uses deterministic binary checks (presence/absence/order/table structure/math formula accuracy) designed to be evaluator-model-independent, explicitly addressing concerns about LLM-judge bias in document evaluation.

What experiments were performed?

olmOCR-Bench evaluation (Table 4): 1,402 PDFs spanning 7,010 unit tests across categories including text presence/absence, natural reading order, table accuracy, and math formula rendering. Baselines include open pipeline tools (Marker v1.7.5, MinerU v1.3.10), proprietary OCR systems (Mistral OCR, GPT-4o, Gemini Flash), and Qwen2-VL variants with and without document anchoring.

Cost analysis (Table 6): Estimates tokens/sec, pages/USD, and cost per million pages for multiple systems under specified hardware assumptions (L40S at \$0.79/hr, H100 at \$2.69/hr), including GPT-4o API vs. batch pricing.

Downstream LM pretraining ablation (Table 5): Continues pretraining OLMo-2-1124-7B for 50B tokens using PDFs linearized by olmOCR vs. Grobid+rules (peS2o baseline), then evaluates on standard LM benchmarks (MMLU, ARC-Challenge, DROP, etc.).

Ablations on anchoring: Compares Qwen2-VL variants and GPT-4o with and without document anchoring prompts to isolate the contribution of layout-aware context.

What are the outcomes/limitations?

Key results:

olmOCR-Bench overall pass rate: 75.5% (95% CI via bootstrap) for the anchored model, outperforming Marker (70.1%), MinerU (61.5%), Mistral OCR (72.0%), GPT-4o (68.9%), and Qwen2.5-VL (65.5%).
Cost estimate: \$176 per million pages on L40S hardware (assumes 12% retry rate for JSON parsing failures and degenerate repetition), compared to \$6,240 for GPT-4o batch API and \$596 for MinerU on L40S.
Downstream pretraining impact: +1.3 percentage points average improvement across benchmark suite when replacing Grobid+rules PDF processing with olmOCR (53.9% $\rightarrow$ 55.2% average score).

Limitations and open questions:

English-only training data: Explicit filtering via Lingua removes non-English documents, limiting multilingual generalization. The 3-page-per-PDF sampling may bias coverage toward short documents.
Teacher model inheritance: Fine-tune targets are GPT-4o outputs, so the student model inherits teacher quirks and evaluation aligns to teacher behavior rather than purely human preference.
Reliability challenges: The paper describes retries for JSON parsing failures and degenerate repetition (mitigated with higher temperature $\tau = 0.8$). The 12% retry rate impacts throughput in production.
Benchmark generalization: olmOCR-Bench is carefully engineered but remains an in-house suite. The document distribution (55.9% academic papers) may not reflect other production use cases. External validation on held-out benchmarks is not the core claim.
Long document handling: Sampling only a few pages per PDF during training raises questions about failure modes on documents with complex multi-page structure.

Contrast to olmOCR 2: This initial olmOCR release (February 2025) focuses on GPT-4o distillation with document anchoring prompts, whereas olmOCR 2 (October 2025) shifts to RL-based policy improvements using unit-test rewards from the same benchmark suite. olmOCR 2 reports 81.2% pass rate (vs. 75.5% here) and eliminates structured JSON output in favor of direct plain-text generation, improving robustness and reducing retry overhead.

Model

Base model: Qwen2-VL-7B-Instruct fine-tuned into olmOCR-7B-0225-preview (7B parameters).

Output format: The model generates structured JSON during training (matching GPT-4o’s synthetic target schema), though the inference prompt can be simplified to return “plain text representation” directly. The JSON schema includes fields for primary_language, rotation validity/correction, boolean flags for tables/formulas/diagrams, and natural_text.

Document anchoring representation

The prompt construction pipeline:

Extract PDF text blocks and image blocks with bounding boxes using pypdf.
Sample a subset of blocks (preferring start/end of document) to inject into the prompt.
If the prompt exceeds 8,192 tokens, regenerate with reduced character limits per block.
Concatenate the page image, selected text blocks (with RAW_TEXT_START/END delimiters), and generation instructions.

The authors describe this as “layout-aware retrieval” that provides the VLM with content hints and rough ordering cues, reducing hallucinations and reading-order errors.

Data

Training set: olmOCR-mix-0225

Size: 102,825 unique PDFs / 258,641 pages total (Table 1).

Source breakdown:

Web PDFs: 96,929 documents / 240,940 pages (drawn from internal crawl of over 240 million PDF documents).
Internet Archive books: 5,896 documents / 17,701 pages.

Filtering and sampling:

Remove non-English documents (Lingua language detector).
Filter parsing failures, spam keywords (explicit list), fillable forms, and documents with insufficient extractable text.
Sample up to 3 pages per PDF.

Document type distribution (Table 2, manual annotation):

Academic papers: 55.9%
Brochures: 11.2%
Legal documents: 10.2%
Books: 6.8%
Table-heavy: 5.6%
Diagram-heavy: 4.7%
Slideshows: 1.9%
Other: 3.7%

Synthetic labeling with GPT-4o

The teacher model (GPT-4o) receives prompts with rules including:

Preserve natural reading order.
Use LaTeX for equations, Markdown for tables.
Remove headers/footers when appropriate.
Handle handwritten annotations if present.
Do not hallucinate content.
Output null if no text is present.

The structured JSON schema captures language, rotation metadata, content flags, and the final natural_text field.

Algorithms / Training

Fine-tuning configuration

Effective batch size: 4
Optimizer: AdamW
Learning rate: $1 \times 10^{-6}$
Schedule: Cosine annealing
Steps: 10,000 (approximately 1.2 epochs over 258k pages)
Hardware: 8 $\times$ H100 80GB
Runtime: 16 node-hours
Context truncation: Training examples truncated to 8,192 tokens; loss masked to final response tokens only.

Inference-time prompt

The production prompt for olmOCR-7B-0225-preview is minimal:

Page image
Prior extracted raw text blocks (with RAW_TEXT_START/END delimiters)
Instruction: “return the plain text representation of this document as if you were reading it naturally”

No explicit JSON schema enforcement at inference (though the model was trained to produce JSON).

Evaluation

olmOCR-Bench construction

Design philosophy: Deterministic unit tests with binary pass/fail outcomes, avoiding LLM-as-judge and fuzzy reference matching. The authors explicitly cite concerns about evaluator bias and non-reproducibility in document benchmarks.

Size: 1,402 PDFs / 7,010 tests.

Test categories (Table 3, selected examples):

Text presence (TP): Fuzzy string matching to verify key content appears.
Text absence (TA): Verify headers/footers are removed.
Natural reading order (NR): Check relative ordering of text segments.
Table accuracy (TT): Validate neighbor-cell constraints in Markdown/HTML tables (HTML required for rowspan/colspan).
Math formula accuracy (MF): KaTeX-rendered symbol layout matching.

Document type distribution in benchmark (Table 3):

arXiv Math (AR): 2,927 formula tests
Multi-Column (MC): 884 reading-order tests
Tables (TT): 1,020 table structure tests
Legal-Tabular-Tiny (LTT): 213 combined tests
Open Street Map (OSM): 123 specialized tests
Table Accuracy (TA): (additional table tests, count not specified in draft)

Benchmark results

Overall pass rate (95% CI, Table 4):

System	Pass Rate
Marker v1.7.5	70.1%
MinerU v1.3.10	61.5%
Mistral OCR	72.0%
GPT-4o (no anchor)	68.9%
GPT-4o (anchored)	69.9%
Qwen 2.5 VL (no anchor)	65.5%
olmOCR v0.1.75 (anchored)	75.5%

Category-wise highlights:

Anchored olmOCR leads on AR (74.9%), MC (78.3%), and LTT (73.3%).
Math-heavy OSM and table-heavy TA categories show more competitive performance across baselines.

Cost comparison

Cost per million pages (Table 6, selected systems):

System	Cost/M pages
GPT-4o API	\$12,480
GPT-4o Batch	\$6,240
Marker (H100)	\$1,484
MinerU (L40S)	\$596
Gemini Flash 2 Batch	\$249
olmOCR (L40S)	\$176
olmOCR (H100)	\$178

Assumptions: L40S at \$0.79/hr, H100 at \$2.69/hr, 12% retry rate for JSON parsing failures and repetition handling.

Downstream LM pretraining ablation

Setup: Continue pretraining OLMo-2-1124-7B for 50B tokens on PDFs linearized with olmOCR vs. Grobid+rules (peS2o baseline). Evaluate on standard benchmarks.

Results (Table 5):

Baseline (Grobid+rules) average score: 53.9%
olmOCR average score: 55.2%
Improvement: +1.3 percentage points

This suggests OCR/linearization quality can measurably impact LM performance at the 50B-token pretraining scale.

Hardware / Production

Inference pipeline and robustness

Orchestration: The system uses SGLang for serving and chunks work into batches of approximately 500 pages, coordinated through shared cloud storage (S3).

Reliability heuristics:

Prompt regeneration: If token count exceeds 8,192, reduce character budget for sampled text blocks and rebuild prompt.
Retry on JSON parse failure: Re-run generation with adjusted parameters.
Rotation correction: Use PDF metadata to detect and correct page orientation.
Degenerate repetition mitigation: Retry with higher temperature ($\tau = 0.8$) if the model produces repetitive output; fall back to alternative strategies if retries fail.

The reported 12% retry rate indicates non-trivial overhead in production deployments, though the authors frame this as necessary for quality assurance.

Throughput estimates

The paper provides pages-per-hour and cost-per-page calculations based on measured token throughput on L40S and H100 hardware, factoring in the retry rate. Specific throughput numbers are embedded in the cost estimates (Table 6).

Implementation sketch

# Conceptual pipeline based on paper description
for page in pdf:
    blocks = pypdf.extract_text_and_images_with_bboxes(page)
    selected = sample_blocks_preferring_doc_start_end(blocks)
    prompt = build_prompt(page_image, selected)
    
    if prompt_tokens > 8192:
        selected = regenerate_selected_blocks_with_lower_char_budget()
        prompt = build_prompt(page_image, selected)
    
    output = vlm.generate(prompt)
    
    if parse_fail_or_repetition(output):
        output = retry_with_adjustments(prompt, tau=0.8)
    
    plaintext_pages.append(output)

return concatenate(plaintext_pages)

This matches the described use of pypdf for block extraction, selective sampling, prompt regeneration under token limits, and retry logic for robustness.

Notes and open questions

Observations:

Document anchoring is essentially layout-aware retrieval: injecting PDF primitives (text + bboxes) into the VLM prompt. The technique is straightforward to implement but depends heavily on PDF parser quality and block selection policy.
The benchmark design deliberately avoids LLM-judge evaluators, addressing real reproducibility concerns in document evaluation. The unit-test approach is deterministic but may miss nuanced quality issues that human evaluators would catch.
The downstream pretraining ablation (+1.3 points at 50B tokens) provides evidence that OCR quality matters for LM training, at least in this experimental setup.

Open questions:

Robustness beyond English academic PDFs: How do the reported gains generalize to document distributions that are not 56% academic papers, and to multilingual documents (given explicit English-only filtering)?
Anchoring vs. model quality: How much of the Table 4 advantage comes from model weights vs. prompt engineering (anchoring) vs. post-processing heuristics (rotation, retries, truncation)? Ablations on GPT-4o and Qwen show modest anchoring gains (~1 point), but the fine-tuned model may benefit more.
Long document failure modes: Sampling only 3 pages per PDF during training may leave gaps in handling complex multi-page structures (cross-references, continued tables, section numbering). Production use cases likely encounter these patterns frequently.
Structured output overhead: The JSON schema adds tokens and introduces parsing failures (12% retry rate). olmOCR 2’s shift to plain-text generation suggests this was a recognized pain point.

Paper: NovaChart: A Large-scale Dataset towards Chart Understanding and Generation of Multimodal Large Language Models (ACM MM ‘24)
Code: Elucidator-V/NovaChart
Data: ympan/novachart
License: MIT (code), Apache-2.0 (dataset)

TL;DR

NovaChart is a chart-focused instruction-tuning dataset designed to improve MLLMs on both chart understanding and chart generation. The authors report 47K high-resolution chart images and 856K instruction-response pairs, spanning 18 chart types and 15 tasks, backed by per-chart metadata including data points, visual elements, source tables, and rendering code.

What kind of paper is this?

Dominant: $\Psi_{\text{Resource}}$

Headline contribution is a large-scale dataset + tooling for chart instruction tuning, including metadata designed for scalability.

Secondary: $\Psi_{\text{Method}}$ (data engine/pipeline) + $\Psi_{\text{Evaluation}}$ (task suite + metrics + model comparisons)

Rough superposition:

$\approx 0.6\,\Psi_{\text{Resource}} + 0.25\,\Psi_{\text{Method}} + 0.15\,\Psi_{\text{Evaluation}}$

What is the motivation?

The authors identify three claimed limitations in existing chart datasets used for training chart-capable MLLMs:

Chart type coverage is narrow and imbalanced

Many datasets focus on bar/line/pie; long-tail types (histograms, radar, word clouds, etc.) are underrepresented.

Task diversity is restricted

Prior work often centers on extraction/QA/summarization; practical tasks like conditional extraction, visual element recognition, and chart-type conversion are missing or rare.

Scalability is limited by sparse annotations

Many datasets provide only data points (sometimes colors), but not source tables and visualization code, which blocks easy task/instruction expansion via LLMs.

What is the novelty?

Dataset scale + coverage

47K chart images, 856K instruction-response pairs

18 chart types (explicitly enumerated in the paper’s overview figures/tables):

Table (as a “chart type”), single/multi-class line plot, single/multi-class scatter plot, single/multi-class bar plot, univariate/bivariate histogram, correlation heatmap, pie, ring, rose, radar, box, sankey, knowledge graph, word cloud.

15 tasks, grouped into 4 capability buckets:

Chart Data Understanding: data identification, data comparison, conditional data extraction, data referring
Chart Visual Understanding: color recognition, style detection, chart classification, visual elements identification/retrieval, text extraction
Chart Summarization and Analysis: chart pattern recognition, chart analysis, chart summarization
Chart Generation: chart blueprint, chart type conversion, table-to-chart generation

“Scalable metadata” design

For each chart, NovaChart includes four kinds of annotations:

Data points (numeric/statistical values shown)
Visual elements (e.g., colors, style choices)
Source data (the originating sub-table)
Visualization code (Python rendering code; exception noted for knowledge graphs where code is unavailable)

Data generation engine (pipeline novelty)

They emphasize an end-to-end engine that produces: raw tables → curated subtables + stats → styled chart images + code → instruction-response pairs.

What experiments were performed?

Dataset comparisons

Compares NovaChart against ChartQA, PlotQA, Chart-to-text, SimChart9K, UniChart, MMC, ChartLlama.

Key table claims:

NovaChart: 18 types, 47K images, 856K instruction pairs, 15 tasks, covering understanding + generation.
Metadata comparison table shows NovaChart is the only listed dataset providing data points + visual elements + source data + visualization code (others lack source data/code).

Fine-tuning + evaluation

Fine-tune 3 open MLLMs:

LLaVA-v1.5, InternLM-XComposer, Qwen-VL-Chat

Evaluate on an independent evaluation set spanning all 15 tasks.

Metrics (by task type):

Exact Match (EM) for classification/QA
RNSS for numerical results in data-referring tasks
Levenshtein distance + SCRM for multi-point extraction
GPT-Score for open-ended summarization/analysis and chart generation tasks

Reported outcomes

Across tasks, fine-tuning yields large relative improvements reported as 35.47%–619.47% (range across tasks/models).

Qualitative examples show improvements in:

Correctly interpreting axes/distributions for analysis
Producing executable chart conversion code (baseline sometimes outputs “not directly executable”).

What are the outcomes/limitations?

Outcomes the paper emphasizes

The authors report gains across all 15 tasks after tuning, including generation tasks. Relative improvement ranges are reported as 35.47% to 619.47%, though baseline performance levels vary significantly by task.
Chart classification and text extraction reportedly reach “excellent” performance in their evaluation setup.
Improvements are claimed across chart types beyond common bar/line/pie charts, attributed to broader type coverage in the training data. Per-type breakdowns are not provided.

Limitations / open questions

Dataset design and coverage:

Chart type distribution is not reported; balance across the 18 claimed types is unclear, which may affect model generalization.
Knowledge graph charts explicitly lack visualization code, breaking the “full metadata” claim for at least one chart type.
Source data for the 47K charts is not specified; provenance, diversity, and potential biases are unclear.

Evaluation concerns:

Open-ended task evaluation relies on GPT-Score (LLM judge), introducing model-judge dependence and sensitivity to prompt formulation. No inter-rater reliability or correlation with human judgment is reported.
The paper acknowledges improvements are “limited” for certain analysis and generation tasks, suggesting the dataset may not sufficiently address those capabilities.
Accuracy drop magnitude when moving from extracted tables to predicted tables is not quantified for the fine-tuned models.

Reproducibility gaps:

Critical hyperparameters (fine-tuning learning rates, batch sizes, epochs) are deferred to appendices not included in the provided excerpt.
Chart-type-specific attribute rules and full metric definitions are also appendix-only.
Training compute (GPU hours, hardware specs) is not reported.

Reproducibility Details

Model

NovaChart itself is a dataset, but the paper fine-tunes 3 MLLMs: LLaVA-v1.5, InternLM-XComposer, Qwen-VL-Chat. Hyperparameters and fine-tuning details are stated as being in Appendix 2 (not included in the visible content).

Data

Dataset size and structure:

47K chart images (high-resolution)
856K instruction-response pairs
Built from:
- 1.3K raw tables (post-filtering)
- 28K curated sub-tables (“source data”) used to derive chart statistics

Chart types (18):

From the overview and distribution figure, the chart types include:

Table
Single-class and multi-class: line plot, scatter plot, bar plot
Univariate and bivariate histogram
Correlation heatmap
Pie, ring, rose, radar
Box plot
Sankey
Knowledge graph
Word cloud

Tasks (15):

Grouped as:

Chart Data Understanding: data identification; data comparison; conditional extraction; data referring
Chart Visual Understanding: color recognition; style detection; chart classification; visual elements identification/retrieval; text extraction
Chart Summarization and Analysis: pattern recognition; analysis; summarization
Chart Generation: blueprint; chart type conversion; table-to-chart generation

Metadata fields (per chart):

The paper’s “chart metadata” is explicitly:

Data points
Visual elements
Source data (sub-table)
Visualization code (rendering code)

Algorithms / Training (Data Engine)

The paper describes a 4-stage pipeline:

1. Raw Data Acquisition

Source: Kaggle relational tables with high user votes.
Filtering/preprocess:
- Remove non-English tables
- Remove tables with too few rows (< 50)
- Remove unnamed columns
- Remove columns with too many missing values (> 90%)
- Remove columns with overly long contents (example: movie reviews)

2. Data Curation

Goal: sample attributes + rows to produce many sub-tables and then compute chart statistics.
Attribute type schema (used for choosing columns):
- Numeric: numeric dependent variables (e.g., y-axis)
- Unique-Numeric: numeric, mostly unique (e.g., year), usable as independent variable (x-axis for lines)
- Categorical: string with <= 5 unique values (class labels for multi-class charts)
- Enumerable: <= 25 unique values (qualitative variable for bar/pie, etc.)
Attribute classification model:
- Uses GPT-turbo-3.5 with in-context learning; prompt includes the attribute-type definitions, number of unique values, example values, and an instruction to classify the attribute.
Sub-table sampling:
- For each chart type: randomly select the required attribute types and sample 30–50 rows from the raw table
- If a raw table cannot yield a sub-table meeting requirements, skip it
- Compute chart statistics from the sampled sub-table to derive chart data points

3. Image Styling and Visualization

Rendering libraries mentioned: Matplotlib, Seaborn, Pyecharts.
Visual diversity: randomize visual elements (example: colors, shadow, style) and render multiple images per data-point instance.
The paper states this stage yields ~40K charts initially, then expanded to 47K after later extensions.

4. Instruction Formulation

Task design: 15 tasks, including “new” ones the authors emphasize (conditional extraction; some visual element checks like fitting-curve existence in histograms; chart-to-chart conversions).
LLM usage:
- For tasks with answers obtainable from metadata: use GPT-4 to generate diverse instruction templates and convert to instruction-following format.
- For more open-ended tasks (summarization/analysis): use GPT-turbo-3.5, providing prompt with task instruction, in-context demos, and the relevant chart metadata; the model output becomes the response.

Data expansion steps:

Add line-chart source data from Statista due to shortage of suitable line-chart subtables from Kaggle.
Add HTML/CSS-rendered tables as an additional “chart” type.
Manually collect some chart types (explicitly mentioned: knowledge graphs and word clouds); knowledge-graph visualization code unavailable.

Evaluation

Metrics and protocols:

EM for classification/QA tasks, RNSS for numeric data-referring, Levenshtein + SCRM for multi-point extraction, GPT-Score for open-ended and generation tasks.
Comparisons shown in figures:
- Per-task spider/radar charts before vs after tuning for the three models
- Per-chart-type performance plots on selected tasks (example shown for InternLM-XComposer).

Reported improvement magnitude:

Across tasks: 35.47%–619.47% relative improvements after fine-tuning (paper-reported range).

Hardware / Production

The main text does not list GPU types, wall-clock training time, or serving throughput; it points to appendices for fine-tuning details.

Data Availability

Is all the data publicly shared?

They claim the dataset is publicly available, and they point to a public GitHub repo as the distribution entry point.

That said, “all the data” depends on what you mean:

Released (public):

GitHub repo (code + toolkit; also points to where to download the dataset): Elucidator-V/NovaChart
Hugging Face dataset linked from the GitHub repo as the download location for the “full NovaChart dataset”: ympan/novachart

Potentially not fully released / ambiguous:

The paper emphasizes rich chart metadata (data points, visual elements, source data, visualization code).
But a public GitHub issue reports that only the instruction-tuning JSONL files were found and asks whether metadata will be released, suggesting at least some parts may be missing or not obvious in the current release.

Underlying raw sources are likely not fully redistributable as-is:

Their pipeline uses Kaggle tables and additionally Statista tables for line charts.
The paper doesn’t specify whether the original Kaggle/Statista tables are redistributed (and those sources typically have their own licenses/ToS).

Where is it shared?

GitHub: Elucidator-V/NovaChart (paper and repo both point here)
Hugging Face (dataset download): ympan/novachart — contains a ~10 GB novachartv1.rar file

What license?

GitHub repository license: marked as MIT on the repo page (this generally covers the code and repo contents, unless the repo says otherwise)
Hugging Face dataset license: listed as Apache-2.0 on the dataset page/README metadata
Paper itself: does not appear to state a dataset license in the main text; it only states availability and links

GOT-OCR2.0 — Notes

TL;DR

GOT-OCR2.0 is a unified 580M parameter encoder-decoder OCR model that treats diverse optical signals (plain text, formulas, tables, charts, sheet music, geometric shapes) as a single “character” space. The system introduces interactive region OCR via box/color prompts, dynamic resolution through multi-crop tiling, and multi-page OCR via decoder post-training. On the Fox benchmark for dense document OCR, GOT achieves edit distance 0.035 (EN) and 0.038 (ZH) with F1 scores of 0.972 and 0.980 respectively. The model supports only English and Chinese with high quality.

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$ (new unified model architecture + three-stage training recipe + task unification framework).

Secondary: $\Psi_{\text{Resource}}$ (extensive synthetic data engines for formulas, molecules, tables, sheet music, geometry, and charts; open-source code and model release).

The paper introduces “General OCR Theory” (OCR-2.0) as a conceptual framework for treating diverse optical signals as characters, with a simple encoder-decoder architecture trained via multi-stage optimization. The resource component is substantial: the authors describe detailed synthetic data engines totaling $>$10M image-text pairs across stages, with open-source code and Apache 2.0 model weights provided.

What is the motivation?

OCR-1.0 pipeline brittleness and cost: Traditional multi-module systems (detection/cropping/recognition) are prone to local optima, cascading errors, and high maintenance overhead.
Task fragmentation: Different specialized models exist for text detection, scene OCR, formula recognition, table extraction, etc. Practitioners must choose among many models for different subtasks, increasing deployment complexity.
Broadened “OCR” demand: The authors frame the target as “intelligent processing of man-made optical signals” beyond plain text—including formulas, tables, charts, sheet music, molecular structures, and geometric diagrams.
Limited multilingual support in end-to-end systems: Prior unified OCR models often focus narrowly on English or require separate models for different languages.

What is the novelty?

General OCR Theory (“OCR-2.0”) framing: Treats diverse optical signals (plain text, mathematical notation, tables, charts, sheet music, molecular structures, geometric shapes) as “characters” in a unified space, aiming for a single end-to-end model that handles multiple OCR tasks.

Simple encoder-decoder OCR architecture: Vision encoder (VitDet base with local attention, $\sim$80M params) + linear connector + language decoder (Qwen-0.5B, total 580M params). The encoder compresses $1024 \times 1024$ images to $256 \times 1024$ tokens, with the decoder supporting up to 8K context length.

Feature extensions via decoder post-training: Fine-grained region OCR (box and color prompts), dynamic resolution (multi-crop sliding window with max 12 tiles following InternVL-1.5), and multi-page OCR are added in Stage 3 without modifying the vision encoder. This preserves encoder pretraining while expanding capabilities.

Staged training strategy with data mixing: Three-stage pipeline (encoder pretrain with tiny OPT-125M, joint training with Qwen-0.5B on formatted/general OCR, decoder post-train for features) with 80% mixing of previous stage data to reduce regression.

Extensive synthetic data engines: Stage 1 uses 5M pure text pairs from LAION/Wukong/PDFs; Stage 2 adds 1M formulas (arXiv LaTeX $\rightarrow$ Mathpix format, $>$20$\times$ faster rendering), 1M molecules (ChEMBL SMILES), 0.3M tables, 1.2M full-page formatted docs, 0.5M sheet music (GrandStaff + Verovio), 1M geometry (TikZ), and 2M charts (Matplotlib/Pyecharts); Stage 3 constructs 60w fine-grained, 50w multi-crop, and 20w multi-page samples.

What experiments were performed?

The authors evaluate across five OCR task families:

1. Plain document OCR: Fox benchmark (dense multi-page documents); word-level segmentation; metrics include edit distance, F1, precision, recall, BLEU, METEOR. Compares GOT (580M) against larger LVLMs.

2. Scene text OCR: 400 natural scene images (200 English, 200 Chinese) with manually corrected ground truth; character-level segmentation.

3. Formatted document OCR: 90 pages (English + Chinese) with Mathpix pseudo-labels manually corrected; evaluates single-scale vs multi-crop inference on formula and table metrics.

4. Fine-grained OCR: Box-guided and color-guided referential OCR evaluation in English and Chinese; compares GOT to Fox on region extraction accuracy.

5. General OCR: Chart OCR using ChartQA (structure-extraction version) and PlotQA benchmarks; reports AP@strict/slight/high for chart structure accuracy.

Test data filtering: The authors apply “strict text similarity filtering” to reduce overlap between training and test text, though details are not provided.

What are the outcomes/limitations?

Outcomes

Plain document OCR (Fox benchmark):

Model	Lang	Edit Dist $\downarrow$	F1 $\uparrow$	Precision $\uparrow$	Recall $\uparrow$	BLEU $\uparrow$	METEOR $\uparrow$
GOT	EN	0.035	0.972	0.971	0.973	0.947	0.958
GOT	ZH	0.038	0.980	0.982	0.978	0.878	0.939

GOT achieves very strong dense document OCR metrics. The paper reports GOT ranks favorably against larger LVLMs in their comparison table, though specific baseline names and parameter sizes are included in the paper but numeric results are only provided for GOT.

Scene OCR (400 images):

Model	Lang	Edit Dist $\downarrow$	F1 $\uparrow$	Precision $\uparrow$	Recall $\uparrow$	BLEU $\uparrow$	METEOR $\uparrow$
GOT	EN	0.112	0.926	0.934	0.927	0.676	0.896
GOT	ZH	0.096	0.928	0.914	0.954	0.641	0.928

Scene OCR edit distance is higher than document OCR (0.112 EN vs 0.035), reflecting the increased difficulty of natural scene text with perspective distortion, occlusion, and varied fonts.

Formatted document OCR (90 pages, single vs multi-crop):

Setup	Formula F1 $\uparrow$	Table METEOR $\uparrow$
Single-scale	0.749	0.760
Multi-crop	0.865	0.811

Multi-crop inference provides substantial gains for formula recognition (+11.6 F1 points) and table extraction (+5.1 METEOR points), demonstrating the effectiveness of dynamic resolution for high-detail regions.

Chart OCR:

Dataset	AP@strict $\uparrow$	AP@slight $\uparrow$	AP@high $\uparrow$
ChartQA-SE	0.747	0.845	0.867
PlotQA-SE	0.133	0.596	0.640

GOT shows stronger performance on ChartQA (0.747 strict AP) compared to PlotQA (0.133 strict AP), suggesting better generalization to the ChartQA structure extraction task.

Fine-grained OCR: The authors report GOT outperforms Fox on both box-guided and color-guided referential OCR in English and Chinese (Table 4 in paper), though specific numeric metrics are not transcribed in the rough draft.

Limitations and open questions

Language coverage: The authors explicitly state they “mainly support English and Chinese” and “cannot guarantee OCR quality for other languages” even if some appear in crawled PDFs. Scaling to dozens of languages (Arabic, Cyrillic, Indic scripts, etc.) is not addressed. The single multilingual model design may face capacity constraints beyond 2-3 languages.

Geometry scope: The authors describe geometry rendering as “preliminary” and state the model “can only recognize basic geometry at present.” Complex geometric diagrams, 3D figures, and advanced TikZ constructs are likely beyond current capabilities.

Ablation gaps: The paper does not ablate the contribution of individual components:

How much gain comes from VitDet local attention vs other encoders?
What is the impact of the 80% data mixing strategy vs training on new data only?
How much do synthetic data engines contribute vs real data?

No throughput or latency data: Training hardware is reported (64 L40s), but the paper omits inference latency (ms/page), throughput (pages/s), and cost-per-page estimates. This limits comparison to olmOCR ($176/M pages on L40S) or MinerU2.5 (1.224 pages/s on A100).

Test set construction and leakage: The authors mention “strict text similarity filtering” to reduce train/test overlap but do not specify the threshold, filtering method, or amount of data removed. External validation on established benchmarks beyond Fox and ChartQA is limited.

Multi-page OCR evaluation: The paper describes 20w multi-page training samples (2-8 pages each, <8K tokens total) but does not evaluate multi-page OCR performance separately. It is unclear how well page-breaking and cross-page references are handled compared to single-page inference.

Comparison baseline details: The paper compares GOT to “larger LVLMs” on Fox but omits model names, parameter sizes, and numeric results for baselines in the extracted text. This limits reproducibility of the comparison.

Model

Architecture

High-level design: Vision encoder + linear connector + language decoder.

Vision encoder:

Backbone: VitDet (base variant), chosen for local attention mechanism to reduce compute on high-resolution images.
Parameters: $\sim$80M.
Tokenization: Final layers compress $1024 \times 1024 \times 3$ input image to $256 \times 1024$ image tokens.
Projection: Linear layer ($1024 \times 768$) projects image tokens into language model dimension during encoder pretraining (described in Stage 1).

Language decoder:

Stage 1 (encoder pretrain): OPT-125M serves as a “tiny decoder” to efficiently pass gradients to the encoder.
Stage 2-3 (main model): Qwen-0.5B replaces OPT-125M for broader OCR-2.0 training.
Context length: Supports up to 8K tokens (increased from 4K in Stage 1 to 6K in Stage 2 to 8K in Stage 3).

Total parameters: 580M (encoder ~80M + decoder ~500M).

Supported inputs and outputs

Inputs:

Scene and document images (single-page or multi-page).
Fine-grained region specification via bounding box coordinates (normalized $\times 1000$) or color-coded frames (red/green/blue).
Multi-crop tiling for ultra-high-resolution documents (max 12 tiles, $1024 \times 1024$ per tile, InternVL-1.5 cropping strategy).

Outputs:

Plain text: Character sequences for scene and document OCR.
Formatted outputs: Mathpix-markdown (formulas, tables), TikZ (geometry), SMILES (molecules), custom notation (sheet music, charts) via prompting.

Aspect ratio handling: Input images of various shapes are resized to $1024 \times 1024$ via a “compromise” strategy (details not specified in paper).

Design rationale

VitDet local attention: Reduces computational cost on high-resolution images compared to global attention mechanisms. The authors note this is critical for $1024 \times 1024$ inputs.

High compression ratio: $1024 \times 1024$ image $\rightarrow$ $256 \times 1024$ tokens represents $\sim$16$\times$ spatial compression (256 tokens for $1024^2$ pixels), enabling efficient processing of dense document pages.

Decoder post-training for features: Stage 3 freezes the vision encoder and only trains the decoder on fine-grained, multi-crop, and multi-page data. This design preserves encoder pretraining while adding capabilities, reducing compute cost vs end-to-end retraining.

Contrast to olmOCR2: GOT uses VitDet with local attention for high-resolution processing, while olmOCR2 employs SigLIP with global attention and dynamic tiling. GOT compresses to 256 tokens per image vs olmOCR2’s variable token count (up to 1984 tokens for high-detail regions).

Contrast to Nougat: Nougat uses Swin Transformer encoder with academic document specialization, while GOT adopts VitDet for broader “OCR-2.0” coverage including charts, sheet music, and geometry. GOT’s 8K context supports multi-page OCR, while Nougat processes single pages.

Data

Stage 1: Pure text recognition (encoder pretraining)

Total: $\sim$5M image-text pairs (3M scene OCR + 2M document OCR).

Scene OCR sources (3M):

Image sources: LAION (English), Wukong (Chinese).
Pseudo ground truth: PaddleOCR extracts text from images.
Processing:
1. Remove bounding boxes and concatenate text top-to-bottom, left-to-right.
2. Crop text regions into slices, yielding additional $\sim$1M slice pairs.

Document OCR sources (2M):

PDFs: Collected from Common Crawl.
Text extraction: Fitz library extracts text.
Outputs: $\sim$1.2M full-page pairs + 0.8M line/paragraph slice data.

Preprocessing: Images of various aspect ratios are resized to $1024 \times 1024$ via a “compromise” strategy (details not specified).

Stage 2: Formatted OCR + general characters

Formatted data sources:

Formulas (1M):

Source: arXiv LaTeX files.
Processing: Extract formula fragments, convert to Mathpix-markdown format.
Rendering: Mathpix-markdown-it library (HTML $\rightarrow$ SVG $\rightarrow$ PNG), claimed $>$20$\times$ faster than LaTeX rendering.

Molecules (1M):

Source: ChEMBL_25 dataset with 2M SMILES strings.
Processing: Produce $\sim$1M molecular structure image-text pairs.
Rendering: Mathpix-markdown-it + rdkit.Chem library.

Tables (0.3M):

Source: LaTeX table code.
Rendering: LaTeX rendering (authors prefer this over Mathpix-markdown-it for tables).

Full-page formatted documents (1.2M):

English: $\sim$0.5M markdown PDF-text pairs via Nougat method, converted to Mathpix format.
Chinese: $\sim$0.5M markdown pairs via Vary-style processing, converted to Mathpix format.
In-house labeled: $\sim$0.2M Mathpix-labeled data from books, papers, and financial reports.

General OCR sources:

Sheet music (0.5M):

Source: GrandStaff dataset.
Additional rendering: Verovio library.
Total: $\sim$0.5M samples after rendering.

Geometry (1M):

Source: TikZ text outputs.
Total: $\sim$1M geometric TikZ data.

Charts (2M):

Rendering: Matplotlib and Pyecharts libraries.
Composition: 1M Matplotlib + 1M Pyecharts chart image-text pairs.

Stage 3: Feature-specific data engines

Fine-grained OCR (60w samples):

Scene sources: RCTW, ReCTS, ShopSign, COCO-Text for natural fine-grained OCR.
Document sources: Parse PDFs (filter scanned format), use Fitz/PDFminer to record page images, line/paragraph boxes, and text.
Coordinate normalization: Bounding boxes normalized then multiplied by 1000.
Color-guided variant: Red/green/blue frame colors drawn on image to specify region.

Ultra-large images via multi-crop (50w pairs):

Tiling strategy: Sliding window $1024 \times 1024$, InternVL-1.5 cropping method, max 12 tiles.
Synthetic data: Horizontal and vertical stitching of single-page PDFs to create ultra-large documents.

Multi-page OCR (20w pairs):

Motivation: Page-breaking is difficult for some formatted PDFs (e.g., arXiv LaTeX); training on multi-page pairs directly improves cross-page handling.
Construction: Sample 2-8 pages; each page <650 tokens; ensure total length <8K context.
Language mixing: Often mixes Chinese and English pages in single multi-page samples.

Data quality and leakage control

Pseudo-labeling quality: Stage 1 uses PaddleOCR for pseudo ground truth. Stage 2 formatted data uses Mathpix-markdown-it rendering and manual correction for evaluation sets. The quality of pseudo-labels is not quantified.

Test set filtering: The authors apply “strict text similarity filtering” to reduce overlap between training and test text, but do not specify the threshold, filtering method, or amount of data removed.

Language coverage: Training data primarily covers English and Chinese. The authors state they cannot guarantee OCR quality for other languages even if some appear in crawled PDFs.

Algorithms / Training

Three-stage training strategy

Stage 1: Encoder pretraining (pure text recognition)

Objective: Pretrain vision encoder using tiny OPT-125M decoder for efficient gradient propagation.
Data: 5M pure text pairs (scene + document OCR).
Optimization: Global batch size 128; 3 epochs; AdamW optimizer; cosine annealing schedule; starting LR $1e{-}4$; max token length 4096.

Stage 2: Joint training (formatted + general OCR)

Objective: Connect encoder to Qwen-0.5B and train on broader OCR-2.0 “characters” (formulas, charts, geometry, sheet music, molecules, tables).
Data: Stage 2 formatted/general data (1M formulas, 1M molecules, 0.3M tables, 1.2M full-page, 0.5M sheet music, 1M geometry, 2M charts) + 80% of Stage 1 data (4M pure text samples).
Optimization: 1 epoch; max token length 6000; same optimizer settings as Stage 1 (AdamW, cosine annealing, LR $1e{-}4$).

Stage 3: Decoder post-training (feature extensions)

Objective: Freeze vision encoder; train decoder on fine-grained, multi-crop, and multi-page OCR capabilities.
Data: 60w fine-grained samples + 50w multi-crop pairs + 20w multi-page pairs + 80% of Stage 1-2 data.
Optimization: 1 epoch; max token length 8192; starting LR $2e{-}5$ (lower than Stage 1-2).

Data mixing strategy

80% mixing across stages: During each stage, the authors sample 80% of previous stage(s) data to reduce catastrophic forgetting when adding new features or data types. For example:

Stage 2: 80% of Stage 1 pure text data (4M samples) + 100% of Stage 2 formatted/general data.
Stage 3: 80% of Stage 1-2 data + 100% of Stage 3 feature-specific data.

Hardware and compute

Training hardware: 64 L40s GPUs (described as “8$\times$8 L40s”).

Training time: Not reported in the paper.

Throughput: Not reported in the paper.

Evaluation

Plain document OCR (Fox benchmark)

The Fox benchmark evaluates dense multi-page document OCR with word-level segmentation. Metrics include edit distance (character-level), F1/precision/recall (word-level), BLEU, and METEOR.

Model	Lang	Edit Dist $\downarrow$	F1 $\uparrow$	Precision $\uparrow$	Recall $\uparrow$	BLEU $\uparrow$	METEOR $\uparrow$
GOT (580M)	EN	0.035	0.972	0.971	0.973	0.947	0.958
GOT (580M)	ZH	0.038	0.980	0.982	0.978	0.878	0.939

The paper includes a comparison table with larger LVLMs but the rough draft does not transcribe baseline names or numeric results. GOT is reported to achieve competitive or superior performance despite its smaller size (580M).

Scene text OCR (400 images)

A custom evaluation set of 400 natural scene images (200 English, 200 Chinese) with manually corrected ground truth. Character-level segmentation is used for metrics.

Model	Lang	Edit Dist $\downarrow$	F1 $\uparrow$	Precision $\uparrow$	Recall $\uparrow$	BLEU $\uparrow$	METEOR $\uparrow$
GOT (580M)	EN	0.112	0.926	0.934	0.927	0.676	0.896
GOT (580M)	ZH	0.096	0.928	0.914	0.954	0.641	0.928

Scene OCR edit distance is 3$\times$ higher than Fox document OCR (0.112 vs 0.035 for English), reflecting the increased difficulty of natural scene text with perspective distortion, occlusion, complex backgrounds, and varied fonts.

Formatted document OCR (90 pages)

A custom evaluation set of 90 pages (English + Chinese) with Mathpix pseudo-labels manually corrected. Compares single-scale ($1024 \times 1024$) vs multi-crop inference.

Inference Mode	Formula F1 $\uparrow$	Table METEOR $\uparrow$
Single-scale	0.749	0.760
Multi-crop (max 12 tiles)	0.865	0.811

Multi-crop inference provides +11.6 formula F1 points and +5.1 table METEOR points. This demonstrates the effectiveness of dynamic resolution for high-detail regions (small fonts, dense tables, complex formulas).

Fine-grained OCR (box- and color-guided)

The paper evaluates referential OCR using bounding box coordinates and color-coded frames in English and Chinese. Results are compared to Fox.

The rough draft notes “GOT is reported as better than Fox on both region and color referential OCR in both languages (see Table 4)” but does not transcribe specific numeric metrics.

Chart OCR (ChartQA-SE and PlotQA-SE)

Structure extraction variants of ChartQA and PlotQA benchmarks. Metrics are Average Precision at strict/slight/high thresholds.

Dataset	AP@strict $\uparrow$	AP@slight $\uparrow$	AP@high $\uparrow$
ChartQA-SE	0.747	0.845	0.867
PlotQA-SE	0.133	0.596	0.640

GOT shows substantially stronger performance on ChartQA (0.747 strict AP) compared to PlotQA (0.133 strict AP). This gap suggests dataset-specific generalization—ChartQA may have chart types or structure patterns more similar to GOT’s training data (2M Matplotlib/Pyecharts charts).

Metric definitions

Edit distance: Levenshtein distance at character level, normalized by ground-truth length. Lower is better.

F1/Precision/Recall: Word-level (Fox document OCR) or character-level (scene OCR) token matching. Higher is better.

BLEU: Bilingual Evaluation Understudy score, n-gram overlap between prediction and reference. Higher is better.

METEOR: Metric for Evaluation of Translation with Explicit ORdering, incorporates synonyms and paraphrasing. Higher is better.

AP@strict/slight/high: Average Precision at different IoU or tolerance thresholds for chart structure extraction. Higher is better.

Hardware / Production

Training infrastructure

Hardware: 64 L40s GPUs (described as “8$\times$8 L40s” setup).

Training time: NR (not reported in the paper).

Throughput: NR (pages/second or tokens/second not reported).

Cost: NR (no cost estimates provided).

Inference performance

Latency: NR (milliseconds per page not reported).

Throughput: NR (pages/second not reported).

Cost per page: NR (no cost estimates provided for inference).

Hardware requirements: The paper does not specify minimum GPU memory or recommended inference hardware.

Deployment

Model availability: Open-source weights released on HuggingFace (stepfun-ai/GOT-OCR2_0).

License: Apache 2.0.

Code: GitHub repository (Ucas-HaoranWei/GOT-OCR2.0) provides inference scripts and examples.

Serving infrastructure: The paper does not describe production serving infrastructure, batching strategies, or optimization techniques for deployment.

Notes and open questions

Observations:

GOT’s “OCR-2.0” framing is ambitious—treating formulas, charts, sheet music, geometry, and molecules as “characters” in a unified space. The single-model approach is simpler than multi-module pipelines, but the capacity constraints of a 580M model may limit depth in each task.
The three-stage training strategy with 80% data mixing is a practical approach to reducing catastrophic forgetting. However, the paper does not ablate the mixing ratio or compare to alternative continual learning strategies.
Multi-crop inference provides substantial gains for formatted documents (+11.6 formula F1, +5.1 table METEOR), validating the dynamic resolution design. The max 12 tiles limit balances detail vs context length constraints (8K tokens).
Scene OCR edit distance is 3$\times$ higher than document OCR (0.112 vs 0.035 for English), suggesting GOT is optimized primarily for document-style inputs. Natural scene text with perspective distortion and complex backgrounds remains challenging.

Open questions:

Ablation studies: What is the contribution of VitDet local attention vs other encoders (e.g., SigLIP, Swin)? How much gain comes from the 80% data mixing strategy vs training on new data only? How do synthetic data engines compare to real data?
Geometry and sheet music quality: The authors state geometry recognition is “preliminary” and limited to “basic geometry.” How much of the 1M geometric TikZ training data does the model successfully learn? What is the error rate on GrandStaff sheet music notation?
Multilingual scaling: GOT supports only English and Chinese with high quality. How would the 580M parameter budget scale to dozens of languages (Arabic, Cyrillic, Indic scripts, etc.)? Would capacity constraints require larger models or ensemble architectures?
Inference efficiency: No latency or throughput data is reported. How does GOT compare to olmOCR ($176/M pages on L40S) or MinerU2.5 (1.224 pages/s on A100) in cost and speed?
Test set leakage: The “strict text similarity filtering” method is not detailed. What threshold is used? How much test data is removed? External validation on held-out benchmarks (e.g., olmOCR-Bench, DocVQA) would strengthen confidence in generalization.
Multi-page OCR evaluation: The paper describes 20w multi-page training samples but does not evaluate multi-page performance separately. How well does GOT handle page-breaking, cross-page references, and consistent formatting across pages compared to single-page inference?
Comparison baseline details: The Fox benchmark comparison table in the paper includes larger LVLMs, but the rough draft omits model names, parameter sizes, and numeric results. Access to the full table would enable better positioning of GOT’s performance.

Paper: Kim et al., SIMPLOT: Enhancing Chart Question Answering by Distilling Essentials (arXiv:2405.00021v3, NAACL 2025 Findings)
Code: sangwu99/Simplot
License: No explicit license stated in the paper

TL;DR

SIMPLOT improves chart-to-table extraction (and downstream chart QA) by (1) training on essential-only “simple charts” to avoid chart noise, (2) distilling an encoder so original charts map into the simple-chart representation space, (3) adding row/column rendering supervision, and (4) prompting an LMM with a human-oriented chart instruction so it uses both the extracted table + original image at inference. Reported gains include better table extraction on ChartQA (RDF1) and higher QA relaxed accuracy than Deplot in their table-plus-LMM setting.

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$

Main contribution is an end-to-end method/training recipe (preprocess + 2-phase training + inference prompting) to improve chart-to-table extraction and chart QA.

Secondary: $\Psi_{\text{Evaluation}}$

Proposes a modified metric (Relative Distance, RD) derived from RMS to better evaluate chart-to-table extraction when header strings differ but reasoning is unaffected.

What is the motivation?

Prior “chart-to-table then LLM reasoning” approaches (e.g., Deplot) struggle because charts mix essential and irrelevant information, and this “noise” degrades table extraction (examples include missed units and inclusion of irrelevant text like attribution).
Table-only reasoning also misses purely visual questions that require chart geometry or visual attributes (e.g., “third bar from the top”, “what year does the orange line represent?”).

What is the novelty?

1) Essential-only training signal via “simple chart generation”

Uses existing datasets’ chart-to-CSV table annotations to re-plot a simplified chart (Matplotlib) from the ground-truth table, excluding irrelevant chart artifacts; training on these “simple charts” improves extraction.

2) Row-column rendering (explicit supervision)

Renders row/column header info onto the image during training (since ground-truth table exists).
For inference, since ground-truth is unavailable, an LMM is used to extract row/column text from the chart and render it onto the image (Appendix B with an illustrative example).

3) Two-phase distillation to map original charts into the “simple chart” representation space

Phase 1: train a teacher encoder + table decoder using essential-only (simple chart) pairs.
Phase 2: train a student encoder on original charts with a triplet loss (anchor = original chart; positive = simple chart; negative = value-shuffled simple chart) plus a table cross-entropy loss, to generate accurate tables from original charts.

4) Human-oriented chart instruction prompt (for LMM reasoning using image + table)

A chart-specific prompt that encodes stepwise “how humans interpret charts” (universal rules + chart-type-specific rules for bar/line/pie). It is used so the LMM can resolve questions that require aligning the predicted table with chart positions and visual cues.

What experiments were performed?

Benchmarks / tasks

ChartQA: pie/bar/line charts; QA includes human-authored + LLM-augmented questions; evaluation reports Relaxed Accuracy on 2,500 test questions.
PlotQA: dot line/line/bar; large synthetic QA set; they sample 10% of images for training/inference to reduce cost.
Additional evaluations discussed: OpenCQA (open-ended chart QA; BLEU) and MultiChartQA (comparative/sequential multi-chart reasoning).

Comparisons / baselines (high level)

Vision-language pretraining baselines and supervised chart models are compared for ChartQA QA. Table-based reasoning baselines include Deplot and Unichart (table extraction variant), with GPT-4V used for reasoning in the “extract table then LMM” setting for fairness.

Key ablations

Impact of (a) row-column rendering and (b) distillation from simple charts on table extraction, plus (c) the human-oriented prompt on QA.
Effect of using image + table vs table-only for Unichart/Deplot reasoning and for failure cases where both methods have low table extraction quality.
Negative-sample (triplet) benefit is evaluated (small but positive).

What are the outcomes and limitations?

Outcomes (reported)

ChartQA table extraction (RDF1): SIMPLOT reported higher overall RDF1 than Deplot and UniChart in their setup.
ChartQA QA (RA): In the “table + LMM” category, SIMPLOT reports substantially higher overall relaxed accuracy than Deplot in their table-plus-LMM comparison table.
Ablations: row/column rendering and simple-chart distillation each improve table extraction; adding the human-oriented prompt improves QA.
Harder questions: when questions require referencing multiple rows/columns, SIMPLOT’s reported gap vs a “Deplot + image + prompt” baseline increases, supporting the claim that table extraction quality matters more under harder compositional queries.

Limitations / failure modes (reported)

Expected reduced performance on unseen OOD charts; they argue one round of table-extraction training can adapt to OOD better than per-task training, but still flag OOD as future work.
Practical error cases: row/column extraction can split entities due to line breaks; LMM can fail at fine-grained color identification (example: confusing orange vs red).
Misuse risk: high-performing chart interpretation could be used to mislead; authors recommend critical verification.

Model

Backbone and components

Built on Deplot-style chart-to-table: an image encoder and a text/table decoder (they follow Deplot and use ViT encoders).
Naming: chart encoder $E_{\text{chart}}$ and table decoder $D_{\text{table}}$; Phase 1 produces a teacher encoder $E^{\text{teacher}}{\text{chart}}$; Phase 2 trains a student encoder $E^{\text{student}}{\text{chart}}$.

Tokenization / table linearization details

Table sequence uses | to separate cells and <0x0A> as line break, following Deplot.

Data

Training inputs created by preprocessing

Simple charts: generated by re-plotting from ground-truth CSV tables using Matplotlib; intended to contain only “essential” information needed for reasoning.
Row-column rendering: in training, rows/columns from the ground-truth table are rendered onto the paired chart image; in inference, an LMM extracts the row/column strings for rendering.

Dataset stats (as reported)

ChartQA split counts are provided (train/val/test counts by chart type; 2,500 QA pairs in test).
PlotQA is very large; they sample 10% of PlotQA images stratified by type for training/inference, and use one QA pair per image in their reduced setting.

Data release

Not explicitly released. The authors do not describe releasing a new SIMPLOT dataset or any additional data artifacts; instead, they use existing public datasets (ChartQA and PlotQA) and note those are “publicly accessible for research purposes.”
They describe generating “simple charts” offline from the datasets’ existing CSV tables, which suggests their pipeline can produce derived artifacts, but they don’t state they publish those derived data outputs.

Algorithms / Training

Preprocessing stage (offline)

Simple Chart Generation: re-render chart from CSV table using Matplotlib.
Row-Column Rendering: render row/column header strings onto images; inference uses an LMM to extract these headers.

Training stage: chart-to-table extraction

Phase 1 (teacher): fine-tune Deplot backbone on (simple chart, table) pairs.

Output: teacher encoder representations focus on essential info; $D_{\text{table}}$ and an FC layer are later frozen.

Phase 2 (student): train $E^{\text{student}}_{\text{chart}}$ on original charts to match teacher-space outputs.

Define original chart $A$, simplified positive $P$, negative $N$ (shuffled values).
Representations:
- $z_a = D_{\text{table}}(E^{\text{student}}_{\text{chart}}(A))$
- $z_p = D_{\text{table}}(E^{\text{teacher}}_{\text{chart}}(P))$
- $z_n = D_{\text{table}}(E^{\text{teacher}}_{\text{chart}}(N))$
Triplet loss: $$ L_{\text{triplet}}(A,P,N)=\max{d(z_a,z_p)-d(z_a,z_n)+m,0} $$ with $d$ as $\ell_2$ distance and margin $m$.
Table generation cross-entropy $L_{\text{table}}$ over the linearized table tokens, and final loss: $$ L_{\text{final}}=\lambda L_{\text{triplet}}+(1-\lambda)L_{\text{table}} $$ with $\lambda=0.1$ reported.

Inference stage: reasoning

LMM receives (original chart image + extracted table), plus the human-oriented chart instruction prompt so it can answer both numeric table queries and visual-attribute queries.

Evaluation

Metrics

Chart QA: Relaxed Accuracy (RA) on 2,500 ChartQA test questions.
Chart-to-table extraction: they discuss RMS and introduce Relative Distance (RD) with RDF1 as the harmonic mean of precision/recall using numeric relative distance after minimal-cost matching over header strings.
Open-ended QA: BLEU for OpenCQA comparisons.

Notable reported result patterns (qualitative)

Table-based reasoning (table + LMM) generally outperforms image-only supervised models on ChartQA in their comparison, and SIMPLOT is reported best within their table-based group.
Adding the image at inference helps table-based methods answer questions that need positional/visual interpretation, and the prompt further improves this.

Hardware / Production

GPU: NVIDIA RTX A6000.
Training time per epoch: Phase 1 about 2 hours/epoch, Phase 2 about 4 hours/epoch; they train Phase 1 for 7 epochs and Phase 2 for 9 epochs.
Parameter count: 374M parameters (SIMPLOT model); GPT-4 parameter count is unknown (as noted).
LMM prompting: temperature 0.1 is mentioned.

Paper: TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning
Code: X-PLUG/mPLUG-DocOwl
Data: mPLUG/TinyChartData
License: Apache-2.0

TL;DR

TinyChart is a 3B-parameter multimodal LLM for chart understanding that targets efficiency via (1) Visual Token Merging inside the vision transformer to reduce high-res visual sequence length, and (2) Program-of-Thoughts (PoT) learning to offload numeric reasoning into executable Python programs. Authors report strong results on ChartQA and other chart benchmarks, plus higher inference throughput than larger (7B–13B) chart-focused MLLMs.

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$
- Introduces an end-to-end model recipe combining visual token merging in ViT + PoT learning and validates via ablations and benchmark comparisons.
Secondary: $\Psi_{\text{Resource}}$
- Builds ChartQA-PoT (PoT-augmented supervision) and describes its construction and statistics.
Secondary: $\Psi_{\text{Evaluation}}$ (light)
- Emphasizes efficiency metrics (throughput, OOM behavior at high resolution) alongside accuracy.

What is the motivation?

Chart understanding requires: (a) robust OCR/text retrieval, (b) numerical reasoning, and (c) high-resolution vision encoding. Authors identify three constraints of existing chart MLLMs:

Large parameter counts hinder deployment/training (13B chart models need substantial VRAM).
Numerical errors are common for calculation-heavy questions.
High-resolution images produce long ViT token sequences that are expensive for the LLM to process.

What is the novelty?

Visual Token Merging inside the ViT

Observation: Charts contain large regions of blank space / uniform color blocks, yielding many similar patches.
Mechanism: In each transformer layer, tokens are split into two disjoint sets, then bipartite matching selects the top-$r$ most similar token pairs across the sets (similarity = cosine distance between self-attention key vectors), and merges each paired feature by average pooling.
Not limited to spatial neighbors: non-adjacent tokens can merge if similar and in opposite sets.
Proportional attention fix: Since merging reduces multiplicity of a feature, they add a patch-count term $s$ into attention:

$$ \text{Attention}=\text{softmax}\left(\frac{QK^\top}{\sqrt{d}}+\log s\right)V $$

where $s$ is the number of original patches represented by the merged token.

Program-of-Thoughts learning for numeric chart QA

Instead of learning arithmetic implicitly, model is trained to output executable Python assignment statements (with comments), then a Python interpreter produces the final numeric answer.
Authors construct ChartQA-PoT from ChartQA training split using two pipelines:
- Template-based PoT: 40 question templates (from PlotQA) + manually written numpy-style code templates; placeholders filled from chart tables, then rule-based filtering removes bad fills.
- GPT-based PoT: gpt-3.5-turbo generates PoT code from (question + chart data table text); they execute generated code and keep only samples where execution succeeds and matches the annotated ChartQA answer.

What experiments were performed?

Benchmarks: ChartQA, Chart-to-Text (Pew), Chart-to-Table, OpenCQA; plus ChartX cognition generalization.
ChartQA evaluation settings:
- Direct: short answer.
- PoT: emit Python program; interpreter gives answer.
- Combine (default): keyword rule decides if question is “calculative”; PoT for calculative, Direct otherwise; fallback to Direct if program has syntax errors.
- Oracle: choose the correct one between Direct and PoT after the fact (upper bound).
Ablations:
- With/without PoT data; template-only vs template+GPT PoT.
- Resolution changes ($384 \rightarrow 512 \rightarrow 768$) with/without token merging; varying merge rate $r$.
- Tracks visual sequence length and inference throughput as efficiency metrics.
Qualitative case studies: QA, chart-to-table, chart-to-text, and chart redrawing; includes error examples.

What are the outcomes/limitations?

Outcomes (as reported)

TinyChart@512 and TinyChart@768 (3B) are reported to outperform or match larger chart MLLMs on several benchmarks, while achieving higher throughput.
ChartQA setting breakdown: Combine improves over Direct; Oracle is notably higher than Combine, implying combination policy is not optimal.
Calculative vs non-calculative: PoT is much better on “calculative” subset than Direct, while Direct remains strong on non-calculative; Combine captures gains from both.
Efficiency: Token merging enables high resolution (notably 768) without OOM on 32GB V100.
Key numbers:
- Throughput (ChartQA test, batch size 1 on V100 32GB): TinyChart@512 3.65 it/s, TinyChart@768 3.14 it/s
- ChartQA results: TinyChart@768 Direct 76.36, PoT 80.84, Combine 83.60, Oracle 89.12
- Ablation: 768×768 without merging produces visual length 2916 and OOM; with merging r=84 produces visual length 732 at 3.14 it/s

Limitations / failure modes (from paper)

Extraction and OCR brittleness:

Chart-to-table extraction struggles when values must be inferred from axes without explicit OCR labels near data points. This is a fundamental limitation when text merging relies heavily on visible annotations.
The paper does not quantify how often this failure mode occurs in their test sets.

Generation quality issues:

Chart-to-text generation can hallucinate mismatched content even when some extracted values are correct, suggesting weak grounding between vision and language outputs.
Chart redrawing struggles with unseen chart types (example given: 3D bar charts), indicating limited generalization beyond training distribution.

PoT combination policy gap:

The simple keyword-based routing between Direct and PoT modes leaves a substantial gap to Oracle performance (83.60 vs 89.12 on ChartQA).
The paper does not explore learned routing or confidence-based selection, which could close this gap.
Error analysis on when the policy fails is not provided.

Evaluation and reproducibility concerns:

ChartQA-PoT construction relies heavily on GPT-3.5 generation (21K of 140K pairs), which may introduce style biases and limit diversity of reasoning patterns.
Template-based PoT generation uses 40 question templates from PlotQA; coverage of reasoning types beyond these templates is unclear.
The paper does not report inter-annotator agreement or validation rates for the generated PoT programs beyond “execution succeeds and matches answer.”
Full training cost breakdown (GPU hours, energy, cost) is not provided beyond “3 days on 32 V100s.”

Model

Architecture

Standard MLLM structure: Vision Transformer encoder → vision-language connector → LLM decoder.
Vision Transformer: Token merging reduces sequence length by approximately $r$ per layer; parameter-free merging module inserted into each ViT layer.
For image resolution $N \times N$ and patch size $P \times P$, number of vision tokens is $(\lfloor N/P \rfloor)^2$; without reduction, this long sequence is passed downstream.
Vision-language connector: Implemented as an MLP with one hidden layer and GeLU activation, mapping vision features into the LLM embedding space.
LLM: Transformer decoder with causal mask; supervised fine-tuning objective is LM loss over response tokens only.

Training recipe

Initialization: from TinyLLaVA, using SigLIP as vision encoder and Phi-2 as LLM.
Vision resolution changes: base encoder at 384×384; extend to 512×512 and 768×768 with token merging.
Token merging rates:
- 512×512: $r=20$ (authors also ablate $r=12,15,20$).
- 768×768: $r=84$ (enables training vs OOM without merging).
Train entire model for 3 epochs, batch size 512, learning rate $1\times 10^{-4}$, warmup 3% of steps, then decay to 0.
Training cost: 3 days on 32 Tesla V100 GPUs (32 GB VRAM).

Data

ChartQA-PoT (PoT supervision)

Built from ChartQA training split; total 140,584 (question, PoT answer) pairs.
Construction counts:
- Template-based: 119,281 pairs over 17,498 images.
- GPT-based: 21,303 pairs over 15,521 images.
GPT-based generation constraints: assignment statements only; no loops/branches; one-line comment before each statement; last variable must be Answer; restricted operator/function list (numpy ops + simple arithmetic/comparators).

Multitask supervised fine-tuning mix

Total training samples: 1,364,921 across tasks (QA, chart-to-text, chart-to-table, instruction-following), using datasets including ChartQA, PlotQA, DVQA, OpenCQA, Pew/Statista/Vistext/ChartSumm/Chart2Text-8k, and ChartLlama instruction data.

Evaluation

Metrics and protocols

ChartQA: “relaxed accuracy” allowing numerical error within 5%.
Chart-to-Text (Pew): BLEU4.
Chart-to-Table: RMSF1 metric.
OpenCQA: BLEU4.
ChartX cognition: GPT-Accuracy for QA, GPT-score for summary/description/redrawing.

Routing rule for “calculative” questions

Uses a keyword detector; list includes terms like “sum, mean, average, ratio, subtract, divide, times, absolute, minus, greater, lowest, number, how many” etc.

TL;DR

UniMERNet introduces UniMER-1M (1,061,791 image-LaTeX pairs) and UniMER-Test (23,757 samples across four subsets) to address real-world mathematical expression recognition beyond clean rendered formulas. A Swin-Transformer + mBART encoder-decoder architecture with fine-grained embedding, local convolution modules, and decoder attention compression achieves stronger BLEU scores and throughput than Pix2tex and Texify baselines on complex printed, screen-captured, and handwritten expressions.

What kind of paper is this?

Dominant: $\Psi_{\text{Resource}}$ — Releases UniMER-1M (1M training pairs), UniMER-Test benchmark (four subsets covering diverse real-world conditions), and model weights.
Secondary: $\Psi_{\text{Method}}$ — Proposes UniMERNet architecture modifications (fine-grained embedding, convolutional enhancement, decoder attention compression) for formula-specific recognition.
Secondary: $\Psi_{\text{Evaluation}}$ — Provides comprehensive baseline comparisons (Pix2tex, Texify) and ablations across architecture, training data, pretraining, and augmentation.

What is the motivation?

Prior mathematical expression recognition (MER) benchmarks emphasize simple printed or handwritten formulas with limited complexity and clean backgrounds. Real-world deployments encounter long expressions, noisy screen captures, font inconsistencies, geometric distortions, and mixed capture modalities. Existing models trained on small clean datasets fail to generalize to these conditions. The paper addresses this by constructing a large-scale dataset reflecting practical distributions and an architecture tuned for formula-specific challenges such as local spatial relations and similar-looking symbols.

What is the novelty?

Dataset:

UniMER-1M: 1,061,791 training pairs; maximum formula length 7,037 tokens; average length 79.48 tokens. Sourced from arXiv (89%), Wikipedia (9%), StackExchange (2%); rendered with multiple math fonts at 80-350 DPI.
UniMER-Test: 23,757 samples across four subsets:
- SPE (short printed expressions): rendered LaTeX, clean backgrounds
- CPE (complex printed expressions): long/complex rendered LaTeX, clean backgrounds
- SCE (screen-captured expressions): extracted from 1,000 PDF pages, annotated via Mathpix + manual correction
- HWE (handwritten expressions): CROHME + HME100K combined into 6,332 test samples

Model: UniMERNet modifies a Swin + mBART baseline with:

FGE (Fine-Grained Embedding): Overlapping 3×3 convolution stack (stride 2, padding 1) for patch embedding to reduce character fragmentation from non-overlapping patches.
CE (ConvEnhance): Depthwise 3×3 convolution + GELU before attention/MLP blocks for local perception of superscripts, subscripts, and adjacent symbols.
RSW (Remove Shift Window): Eliminates Swin shift-window mechanism when FGE+CE already enlarge receptive field, improving speed and accuracy.
SA (Squeeze Attention): Low-dimensional projection of query/key tensors in decoder attention to improve throughput without significant accuracy loss.

What experiments were performed?

Data scaling (Table 2): Compared Pix2tex-only, Pix2tex+HWE, and UniMER-1M training on UniMER-Test subsets, measuring BLEU and edit distance.

Architecture ablations (Table 3): Evaluated baseline vs. incremental addition of FGE, CE, SA, and RSW, measuring BLEU by subset and throughput (images/s at batch size 128, max sequence length 1536 on A100).

SOTA comparisons (Table 5): Compared Pix2tex, Texify, Texify* (Texify trained on UniMER-1M with their augmentations), and UniMERNet-T/S/B variants, reporting BLEU, edit distance, FPS, and parameter counts.

Pretraining ablation (Table 6): Tested with/without 16M arXiv-derived image-text pretraining corpus for Texify and UniMERNet-B.

Augmentation ablation (Table 7): Evaluated with/without Albumentations + custom augmentations (erosion, dilation, degradation, geometric transforms) for Pix2tex and UniMER-1M training.

Depth sweeps (Tables 8-9): Varied encoder depth per Swin stage and decoder depth, tracking BLEU and FPS.

What are the outcomes/limitations?

Outcomes:

UniMER-1M training improves generalization on complex and real-world subsets:
- CPE BLEU: 0.724 (Pix2tex+HWE) → 0.925 (UniMER-1M); edit distance: 0.225 → 0.056
- SCE BLEU: 0.529 (Pix2tex+HWE) → 0.626 (UniMER-1M); edit distance: 0.309 → 0.224
UniMERNet-B (325M parameters) achieves the strongest reported results:
- SPE BLEU 0.915, CPE BLEU 0.925, SCE BLEU 0.626, HWE BLEU 0.895
- Throughput: 5.06 img/s (batch 128, max seq len 1536, A100)
SA (Squeeze Attention) is the primary speed improvement: 4.07 img/s → 5.04 img/s with minimal BLEU change.
Pretraining improves printed/screen-captured more than handwritten: SCE BLEU 0.601 → 0.626 with 16M arXiv pretraining.

Limitations:

Inconsistent counts: UniMER-Test size described as 23,789 in text but Table 1 lists 23,757; CROHME test described as 3,332 but Table 1 lists 3,233.
Sequence length mismatch: Dataset reports max formula length 7,037 tokens while training/evaluation uses max sequence length 1,536; handling of longer sequences (truncation vs. filtering) is not specified.
SCE labels via Mathpix: Requires proprietary API for initial annotation followed by manual correction; reproducibility implications and licensing considerations are not discussed.
Pretraining data availability: 16M arXiv-derived corpus is described as “in-house”; public release status is not clarified.

Model

Architecture

UniMERNet follows the encoder-decoder paradigm used in Donut, Nougat, and Texify:

Encoder: Swin Transformer with modifications
Decoder: mBART

Encoder Modifications

Fine-Grained Embedding (FGE):

Two convolutional layers, kernel size 3, stride 2, padding 1
Each followed by LayerNorm + GELU
Produces overlapping patches to reduce character feature fragmentation

ConvEnhance (CE):

Depthwise 3×3 convolution + GELU inserted before attention and MLP modules
Provides local perception for spatial relations (superscripts, subscripts)
Alternates local (conv) and global (attention) feature aggregation

Remove Shift Window (RSW):

Eliminates Swin shift-window mechanism
FGE and CE already expand receptive field sufficiently
Improves both speed and accuracy in ablation studies

Decoder Modifications

Squeeze Attention (SA):

Projects query $Q$ and key $K$ to lower-dimensional space before attention computation
Reduces computational cost in decoder attention, which becomes the throughput bottleneck
Value $V$ remains full-dimensional

Model Variants

Variant	Encoder Depth $N$	Decoder Depth $M$	Hidden Dim $C$	Parameters
UniMERNet-T	[6,6,6,6]	8	512	100M
UniMERNet-S	[6,6,6,6]	8	768	202M
UniMERNet-B	[6,6,6,6]	8	1024	325M

Data

UniMER-1M Composition

Total: 1,061,791 LaTeX-image pairs
Sources: arXiv (89%), Wikipedia (9%), StackExchange (2%); approximately 4M LaTeX expressions collected
Length distribution:
- Maximum formula length: 7,037 tokens
- Average formula length: 79.48 tokens
- Initial long-formula proportion: 2.3%; rebalanced by extracting longest formulas as CPE subset

Rendering:

XeLaTeX with multiple math fonts: Asana Math, Cambria Math, XITS Math, TeX Gyre (Bonum, Pagella, Schola, Termes), Latin Modern Math
DPI range: 80-350
Latin Modern Math used in ~22% of samples (default)

Normalization:

Based on Deng et al. (2017) LaTeX normalization
Adjusted for multi-line environments (align, cases, etc.)
Filters unsupported syntax

UniMER-Test Subsets

Subset	Type	Size	Description
SPE	Short printed	8,313	Rendered LaTeX, clean backgrounds, short expressions
CPE	Complex printed	5,368	Rendered LaTeX, clean backgrounds, long/complex expressions
SCE	Screen-captured	4,744	Extracted from 1,000 PDFs, Mathpix + manual annotation, deduplicated
HWE	Handwritten	6,332	CROHME + HME100K combined test set
Total		23,757

SCE construction:

1,000 PDF pages (Chinese + English)
Formula detection and bounding box annotation
Initial labels from Mathpix API
Manual correction pass
Perceptual hashing deduplication

HWE construction:

Training: 83,338 samples from CROHME + HME100K
Test: 6,332 samples (note: text describes CROHME test as 3,332, but Table 1 lists 3,233)

Training

Framework: PyTorch

Maximum sequence length: 1,536 tokens

Optimization:

Loss: Cross-entropy language modeling
Learning rate schedule: Linear warmup + cosine decay
- Initial LR: $1 \times 10^{-4}$
- Minimum LR: $1 \times 10^{-8}$
- Warmup LR: $1 \times 10^{-5}$
Weight decay: 0.05
Total iterations: 300,000

Augmentation:

Albumentations library + custom transforms
Morphological: erosion, dilation
Degradation: fog, frost, rain, snow, shadow simulation
Geometric: rotation, distortion

Pretraining (optional):

16M image-text pairs from arXiv
Constructed via layout detection + OCR on text blocks, matched to source LaTeX
Pretraining improves printed/screen-captured performance more than handwritten

Hardware:

NVIDIA A100 80GB
Batch size: 64
Training setup description includes both “single GPU” and “eight such GPUs” references (apparent inconsistency in source material)

Evaluation

Metrics

BLEU: Measures token-level similarity between predicted and ground-truth LaTeX
Edit Distance (Levenshtein): Character-level edit operations normalized by ground-truth length
ExpRate: Exact match rate (expression-level accuracy)

Key Results

Data Scaling (Table 2)

Training on UniMER-1M vs. Pix2tex+HWE, evaluated on UniMER-Test:

Subset	Metric	Pix2tex+HWE	UniMER-1M
SPE	BLEU	0.893	0.902
SPE	Edit Distance	0.069	0.065
CPE	BLEU	0.724	0.925
CPE	Edit Distance	0.225	0.056
SCE	BLEU	0.529	0.626
SCE	Edit Distance	0.309	0.224
HWE	BLEU	0.874	0.882
HWE	Edit Distance	0.088	0.082

Architecture Ablation (Table 3)

Impact of UniMERNet modifications on SCE subset (batch 128, A100):

Configuration	Params	FPS	SCE BLEU
Baseline	342M	4.12	0.579
+FGE+CE	342M	4.07	0.599
+FGE+CE+SA	325M	5.04	0.599
+FGE+CE+RSW+SA	325M	5.06	0.601

Model Comparison (Table 5)

UniMERNet-B vs. baselines on UniMER-Test:

Model	Params	FPS	SPE BLEU	CPE BLEU	SCE BLEU	HWE BLEU
Pix2tex	350M	7.67	0.750	0.575	0.378	0.806
Texify	312M	4.16	0.820	0.764	0.420	0.844
Texify*	312M	4.16	0.901	0.916	0.599	0.884
UniMERNet-T	100M	7.67	0.857	0.846	0.535	0.858
UniMERNet-S	202M	6.08	0.896	0.906	0.598	0.888
UniMERNet-B	325M	5.06	0.915	0.925	0.626	0.895

Texify* = Texify trained on UniMER-1M with their augmentation pipeline

Pretraining Impact (Table 6)

UniMERNet-B with/without 16M arXiv pretraining:

Subset	Without Pretraining	With Pretraining	Δ BLEU
SPE	0.904	0.915	+0.011
CPE	0.913	0.925	+0.012
SCE	0.601	0.626	+0.025
HWE	0.895	0.895	0.000

Hardware

Training:

NVIDIA A100 80GB
Batch size: 64 per GPU
Setup: references both “single GPU” and “eight such GPUs” (inconsistent in source)

Inference throughput tests:

Batch size: 128
Max sequence length: 1536
Hardware: A100 80GB
UniMERNet-B: 5.06 img/s

License & Availability

Code: Apache-2.0 (GitHub)

Models: Apache-2.0 (Hugging Face)

Dataset: Apache-2.0 (Hugging Face, OpenDataLab)

Important caveat: The dataset card notes that HME100K portion must be downloaded manually “for copyright compliance,” and the dataset incorporates multiple upstream sources (Pix2tex, CROHME, HME100K). The Apache-2.0 tag on the Hugging Face dataset repository may not override upstream licensing terms for these components. Verify upstream licenses before commercial use.

Paper: ChartThinker: A Contextual Chain-of-Thought Approach to Optimized Chart Summarization (arXiv:2403.11236v2, 25 Apr 2024)
Authors/Affiliations: Sun Yat-sen University; Alibaba Group; Guangdong Provincial Key Lab; Pazhou Lab
Code/Data: GitHub: Notonion/ChartThinker, HuggingFace: ChartThinker/Chart-Sum-QA
License: MIT (dataset on HuggingFace); see notes below for source dataset licenses

TL;DR

ChartThinker targets chart summarization failure modes in VLMs: (1) poor chart-text matching degree (omissions, hallucinated values) and (2) reasoning errors about the chart’s intended message. It introduces (a) a large chart-caption + instruction QA dataset (Chart-Sum-QA) and (b) a method that combines chart parsing (OCR + DePlot), chain-of-thought (CoT) style staged generation, and a small retrieval library for in-context examples.

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$ (new end-to-end approach: chart parsing + context-enhanced CoT generation + retrieval integration; extensive ablations).
Secondary: $\Psi_{\text{Resource}}$ (release of Chart-Sum-QA: 595,955 chart-caption pairs + 8.17M instruction QA pairs).
(Using the taxonomy defined in the provided guidelines.)

What is the motivation?

The authors argue that large vision-language models often fail at chart summarization because:

Insufficient matching degree: summaries omit small-but-critical numbers/text, or fabricate chart content due to pretraining priors.
Reasoning errors: models misread trends/patterns or infer incorrect “takeaways” from complex charts.

Figure 1 illustrates both error types via examples comparing LLaMA-Adapter-v2 and MiniGPT-4 outputs to a chart, showing both hallucinations and incorrect inferences.

What is the novelty?

Chart-Sum-QA dataset (scale + supervision breadth)
- 595,955 chart-caption pairs (pretraining) and 8,170,000 instruction question-answer pairs (fine-tuning).
- Covers chart types: scatter, line, bar, pie (per Table 1).
Context-Enhanced CoT Generator (retrieval + staged “thoughts”)
- The model generates multiple thought steps (example steps: identify chart type; understand legend/axes; observe trends/proportions), and for each thought retrieves top-$k$ chart-text exemplars from a small retrieval library to use as context examples, then consolidates outputs into a final summary (overview shown in Figure 2).
- Retrieval uses cosine similarity between encoded chart features and stored example features, then uses an order-based weighting where example weight is proportional to $1/i$ for rank $i$ (Equations 3–4), and conditions generation on a weighted context function (Equation 5).
Chart parsing module that fuses OCR + DePlot outputs
- OCR extracts textual/numeric strings; DePlot converts chart to a table (text-number alignment), and the module outputs:
  - text-number pairs and
  - other text (title/notes/etc.).
- The workflow is depicted in Figure 3.

What experiments were performed?

Benchmarks and baselines

They compare against two baseline groups:

Classic (text-generation) transformer baselines with OCR-augmented inputs: T5, Chart2text, Field-Infuse, BART, plus their OCR-ChartThinker variant (Table 2).
Large VLM baselines (encoder-decoder style): LLaMA-Adapter-v2, MiniGPT-4, mPLUG-Owl, LLaVA, plus ChartThinker (Table 3).

Metrics

Automatic evaluation uses:

BLEU, BLEURT (base-128), CIDEr, Content Selection (CS), and perplexity using GPT-2 Medium; plus a normalized aggregate score $S_{\text{norm}}$ over five indicators.

Human evaluation:

200 generated summaries, scored by 3 evaluators on Matching Degree and Reasoning Correctness (1–5).

Ablations (Table 4)

Ablations remove:

chart parsing module,
context retrieval (“No Context-Enhanced”),
CoT (“No CoT”),
entire context-enhanced CoT generator,
fine-tuning on caption dataset,
fine-tuning on instruction dataset.

What are the outcomes/limitations?

Key results (as reported)

OCR-ChartThinker vs OCR baselines (Table 2): reported best BLEU (11.81), best CIDEr (2.21), best PPL (9.23), best $S_{\text{norm}}$ (0.948), though not best CS (32.72% vs OCR-T5 40.87%).
ChartThinker vs VLM baselines (Table 3): reported best across listed metrics among those VLM baselines (BLEU 5.82, CIDEr 1.58, CS 21.68%, PPL 11.43).
Human evaluation (Table 5): ChartThinker rated highest on Matching Degree (4.32) and Reasoning Correctness (4.27) among compared models in their study.

Limitations / caveats (from the paper + inferred risks)

Evaluation metric concerns:

CS (Content Selection) metric shows counterintuitive behavior: the authors argue it can penalize longer correct summaries that include additional valid detail beyond the reference (Table 6 example). This raises questions about whether CS is a reliable primary metric.
Heavy reliance on automatic metrics (BLEU, CIDEr, BLEURT) may not capture summary quality aspects like clarity, coherence, or usefulness to end users.
Human evaluation is limited to 200 samples with 3 evaluators; inter-rater agreement is not reported.

Data generation pipeline concerns:

A large portion of instruction QA (exact ratio not specified) is generated using ChatGPT-4 from human-written summaries, which may introduce:
- Style biases toward GPT-4’s generation patterns
- Potential data leakage if GPT-4 was trained on overlapping sources
- Limited diversity in question types and reasoning patterns
The paper mentions “a subset” is manually validated but does not specify validation coverage, criteria, or pass rates.
Prompt templates and filtering heuristics for QA generation are not fully detailed.

Retrieval library limitations:

Library contains only 1,000 chart-text pairs split across 4 stages (250 each).
Coverage of unusual chart designs, rare chart types, or domain-specific visualizations may be insufficient.
The paper does not analyze retrieval failure modes or cases where no good match exists.

Architectural and scope constraints:

Chart parsing module combines OCR + DePlot but may inherit limitations from both (e.g., DePlot’s known weakness with color-dependent questions).
Context-enhanced CoT is stage-specific, but the paper does not justify the choice of 4 stages or explore alternative decompositions.
Generalization to chart types beyond scatter, line, bar, and pie is not evaluated.

Reproducibility gaps:

The paper provides LoRA hyperparameters but omits:
- Full hardware specifications (GPU type, memory, count)
- Wall-clock training time
- Inference throughput (tokens/sec, images/sec)
- Total training cost or energy consumption
Dataset mixing ratios and sampling strategies for the multi-source training corpus are not detailed.

Research questions (explicitly answered)

The paper poses three RQs; below are the answers supported by the reported experiments:

RQ1: Can answer reasoning benefit from introducing a chain of thought?
Yes, per Table 4: removing CoT (“No CoT”) reduces BLEU (5.10 vs 5.82) and lowers human score (3.92 vs 4.25), indicating CoT contributes to both automatic metrics and judged quality.
RQ2: How can context retrieval and chain of thought effectively interact with each other?
The reported synergy claim is supported by ablations: removing context enhancement reduces performance (“No Context-Enhanced”: BLEU 5.45 vs 5.82; human 4.11 vs 4.25), and removing the entire combined module drops more (“No Context-Enhanced CoT Generator”: BLEU 4.59; human 3.85). The method’s interaction mechanism is “retrieve exemplars per thought step” (Figure 2).
RQ3: How does instruction fine-tuning improve the chart-summary matching degree?
Table 4 shows dropping instruction fine-tuning degrades BLEU (4.52 vs 5.82) and BLEURT (–0.63 vs –0.45), consistent with the claim that directive QA tuning improves chart-specific grounding.

Model

High-level architecture (from Figure 2)

Inputs: chart image + prompt. The pipeline includes:

Image encoder (CLIP-based encoder) producing chart feature vector $V$.
Text encoder for the prompt producing token sequence/features $K$ (paper text labels it “encoder” but describes generation with a decoder; see below).
Chart parsing module produces underlying data features by merging OCR + DePlot outputs and concatenating with prompt features.
Context-Enhanced CoT Generator generates a sequence of “thoughts”; for each thought it retrieves top-$k$ example chart-text pairs and uses them as in-context examples, then integrates outputs into the final summary.

Generator backbone

Paper states the final generation is done with Idefics and also mentions a LLaMA2 decoder (the description mixes terms: “text encoder” vs “decoder”); the operational takeaway is: a VLM generator (Idefics) plus LoRA is used for conditioned generation.

Retrieval library design (Appendix B summary in-text)

1,000 chart-text pairs, split into 4 stages of 250 each:
1. chart type,
2. chart overview (title-driven),
3. axes meanings,
4. numerical trend description.

Data

Chart-Sum-QA composition (Table 1)

Aggregates multiple existing chart datasets (Autochart, Linecap, DVQA, PlotQA, Chart-to-text, FigureQA), yielding:
- 595,955 images and
- 8,170,000 QA pairs total (their constructed dataset).

Construction steps (Section 3.1)

Data collection from six datasets.
Preprocessing: resize, standardize formats, clean titles/descriptions, ensure alignment.
Generate QA pairs: create ~400k QA from summaries (Chart-to-text, Autochart, Linecap), merge with other QA datasets, filter, finalize 8.17M; use ChatGPT-4 to generate questions from human-written summaries; manually validate a subset.
Splits: 80% train, 10% validation, 10% test.

Algorithms / Training

Chart parsing module (Figure 3)

OCR output: text strings and numbers (without positions).
DePlot output: a table of text-number correspondences (can be affected by irrelevant chart elements; may have numeric extraction errors).
Fusion output: separate text-number pairs (key aligned entities) and other text.

Retrieval + weighting

Similarity: cosine similarity between input chart feature vector $V$ and example feature $T_i$ (Eq. 3).
Context weighting: rank-based weight $1/i$; aggregated weighting function $W_\ell(\cdot)$ conditions generation at token $\ell$ (Eq. 4–5).

Fine-tuning implementation (LoRA)

LoRA rank 16, applied to qproj/kproj/vproj.
LoRA scaling factor $\alpha = 32$, dropout 0.05.
Optimizer: paged_adamw_8bit, learning rate $2\times 10^{-4}$.
Gradient accumulation: 8; eval every 20 steps.

Evaluation

Automatic metrics

BLEU, BLEURT-base-128, CIDEr, CS, PPL (GPT-2 Medium), plus normalized aggregate score $S_{\text{norm}}$.

Human evaluation protocol

200 summaries; 3 annotators; criteria:
- Matching Degree (data fidelity: minimal omissions/fabrications)
- Reasoning Correctness (intended message inferred correctly)
Rated 1–5; randomized order to reduce bias.

Hardware / Production

Not fully specified. The paper provides the optimizer choice and LoRA settings, but does not give:

GPU type/count, training duration, batch size (beyond gradient accumulation), or inference throughput.

Dataset release + license

Yes. The paper explicitly says their “dataset and codes are publicly accessible” and that they “release our dataset and codes at OpenChartThinker.” A public repo associated with the project also points to a Hugging Face dataset upload (GitHub: Notonion/ChartThinker).

Under what license?

On Hugging Face, the ChartThinker/Chart-Sum-QA dataset is labeled MIT (“License: mit”). (HuggingFace: ChartThinker/Chart-Sum-QA)

Detail to be aware of

In the dataset construction section, the authors note the source datasets they aggregate are under various licenses (examples given: CC BY-NC-SA 4.0, MIT, GPL3.0). So, even though the HF card says MIT, it’s worth double-checking provenance/constraints for any subset you plan to use (especially commercial use).

Paper: OneChart: Purify the Chart Structural Extraction via One Auxiliary Token (arXiv:2404.09987; MM ‘24; DOI: 10.1145/3664647.3681167)
Project page: onechartt.github.io
Code: LingyvKong/OneChart
Models: kppkkp/OneChart
License: Apache-2.0 (code/model); research use only (with upstream constraints from Vary/OPT)

TL;DR

OneChart is a chart-to-Python-dict structural extraction model that adds a single auxiliary token (<Chart>) plus a small auxiliary number decoder trained with an $L_1$ loss to improve numeric reliability. It also uses a self-consistency distance between “numbers implied by the text dict” and “numbers predicted by the auxiliary decoder” as an optional reliability score to filter (“purify”) outputs at inference. Authors report strong structural-extraction AP across several chart benchmarks with a relatively small model ($\approx$200M params) and gains when feeding the extracted dict into downstream ChartQA systems.

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$ (new architecture element: auxiliary token + decoder + training/inference recipe; includes ablations and SOTA-style tables).

Secondary:

$\Psi_{\text{Resource}}$ (introduces ChartY benchmark; also describes large synthetic data engine).
$\Psi_{\text{Evaluation}}$ (self-consistency reliability scoring and “purified output” evaluation protocol; additional comparisons/ablations).

What is the motivation?

Chart parsing is hard because charts vary widely in style, values, text, and layouts, and even large VLMs can struggle, especially when charts lack explicit numeric annotations. Authors attribute failures mainly to:

“CLIP bias”: CLIP-ViT encoders are trained on natural-image captioning and may miss chart-local details; English-heavy pretraining can also hurt non-English charts.
Cross-entropy numerics problem: token-level CE does not strongly distinguish “close-looking” numbers (example: “7008” vs “70.8” can have deceptively similar loss), which can slow convergence and reduce numeric accuracy.

Public benchmarks are described as limited in type/style/language diversity (e.g., ChartQA and PlotQA skew heavily to bar/line; many synthetic sets have limited styling).

What is the novelty?

Auxiliary token + auxiliary number decoder:

Prefix a special token <Chart> at the start of the output sequence; its hidden state embedding is routed to an auxiliary MLP decoder trained to predict a fixed-length vector of normalized chart numbers using masked $L_1$ loss.
Because the main LM is causal, later text tokens can attend to the <Chart> token embedding, aiming to improve numeric fidelity in the generated Python-dict.

Self-evaluation / “purification”:

At inference, parse the raw Python-dict output, extract numbers, normalize them, and compare to the auxiliary decoder’s numeric vector via a mean absolute distance $S \in [0,1]$; filter outputs by a threshold (example $\delta=0.1$) to keep only “reliable” predictions and report improved AP on the retained subset.

Data engine + benchmark:

Large synthetic chart generation (Matplotlib + Pyecharts) with style randomization and bilingual (English/Chinese) content; and the ChartY chart-to-dict benchmark (reported as $\sim$6K charts in the intro, bilingual and stylistically broader).

What experiments were performed?

Structural Extraction (SE) evaluation on multiple sources (ChartQA-SE, PlotQA-SE, ChartX-SE, ChartY-en/zh), using SCRM mean AP under strict/slight/high tolerances; plus textual OCR accuracy using Reverse Edit distance (RE) for fields like title/source/axes.
Comparisons to chart parsing baselines (e.g., UniChart, DePlot, ChartVLM, ChartAst), emphasizing performance when charts do not have explicit numeric labels.
Ablations: auxiliary token presence and position (front vs behind), and training strategy (Stage2 vs Stage3 vs both).
Downstream QA: combine OneChart’s extracted dict with LLM/VLM QA systems on ChartQA; report accuracy changes when giving models chart image, dict, or both.

What are the outcomes/limitations?

Outcomes (as reported):

The authors report OneChart at 0.2B parameters achieves competitive or leading AP across multiple SE benchmarks in their evaluation (Table 2).
Filtering by self-consistency distance threshold $\delta=0.1$ is reported to increase AP on the retained subset (example: ChartQA-SE AP@strict 72.02 $\rightarrow$ 81.97), though this comes at the cost of reduced coverage.
Token position matters: placing <Chart> at the front performs better than at the end, consistent with causal attention (Table 4).
The authors report QA accuracy improvements when feeding the extracted dict into LLaVA variants (Table 6).

Limitations / open questions (based on what’s in the paper):

Dataset and generalization scope:

Synthetic data generator focuses primarily on bar/line (“barline”) and pie charts, with explicit constraints (e.g., barline charts “up to three legends”). Coverage of other chart types (scatter, area, stacked variants, multi-panel figures) is limited.
ChartY benchmark is reported as $\sim$6K charts, but exact split sizes and per-type distributions are not provided.
Generalization to chart styles outside the synthetic generation distribution is not systematically evaluated.

Architectural constraints:

The auxiliary number decoder outputs a fixed-length 256 vector with padding. Behavior when charts contain many series or data points (e.g., dense scatter plots, time series with hundreds of points) is unclear.
The paper does not analyze failure modes when the fixed-length representation is insufficient.

Evaluation methodology concerns:

“Purified” results are reported on filtered subsets with reduced coverage. Full risk-coverage curves are not presented; only snapshots at specific thresholds ($\delta=0.1$).
This shifts evaluation from “overall accuracy” to “accuracy at coverage,” but the trade-off is not fully characterized.
The self-consistency distance may be biased toward certain chart types or value ranges; this is not analyzed.

Reproducibility gaps:

The paper reports three training stages but does not provide full hardware specifications, wall-clock training time, or GPU hours.
Stage-specific data mixing ratios and sampling strategies are not detailed.
License complexity: while code is Apache-2.0, the model inherits constraints from Vary/OPT upstream dependencies, making the effective license “research use only” despite the stated Apache-2.0 for code.

Model

Task and output format

Input: a chart image (resized to 1024$\times$1024).
Output: a serialized Python-dict containing at least: title, source, x_axis, y_axis, and value.

Backbone architecture

Built on a VLM-style encoder–decoder; authors choose Vary-tiny: a vision encoder “from SAM-base” paired with an autoregressive OPT-125M decoder, connected via a linear projection for channel alignment.
Vision features occupy 256 tokens in the conversation template.

Auxiliary token and decoder

Add special token <Chart> at the start of the token sequence (Figure 3 pipeline depiction).
Let the <Chart> hidden embedding be $t \in \mathbb{R}^{768}$. Feed $t$ into an auxiliary decoder $F$ described as a 3-layer MLP with 2 ReLU activations.
Auxiliary output: $F(t) \in \mathbb{R}^{256}$ representing min–max normalized numeric values for the chart.

Data

Synthetic generation (“data engine”)

Uses both Matplotlib and Pyecharts to render charts of varied styles/types (Figure 2 summarizes the pipeline).
Adds a “chart source” field (beyond typical title/axes/body) to better match real-world charts; includes a two-stage rendering variant that stitches title/source after rendering the main body to increase layout diversity.
Style randomization: random 16-bit color codes for text/graphics (beyond standard palettes), “hundreds” of fonts, variability in size/direction/quantity of visual elements.
Scale: about 10M synthetic chart images with labels for pretraining.
Chart types emphasized:
- Barline charts: single/multi column, single/multi line, combo; balanced between with/without numeric labels; up to 3 legends.
- Pie charts: labeled pies and legend-based pies in equal proportion.
Text/content generation: random corpora for pretraining; GPT-3.5 prompts for more “logical/practical” themed data across domains (finance/education/technology/etc.).

Fine-tuning (SFT) data

Total SFT data reported: 2.7M samples (Table 1).
Mix includes ChartQA (real), PlotQA (real), and multiple GPT-3.5-driven synthetic sets in English and Chinese rendered via Matplotlib/Pyecharts.

ChartY benchmark

Bilingual (English/Chinese) chart-to-dict benchmark with broader style/type diversity than prior benchmarks.
Reported as $\sim$6K charts total (split between ChartY-en and ChartY-zh).
Available via project page and GitHub repository.

Algorithms / Training

Conversation template and tokens

Uses Vicuna v1-style prompt template:

USER: <img> [image] </img> Convert the key information of the chart to a Python-dict. ASSISTANT: <Chart> [texts output] </s>

Adds <img>, </img>, <Chart> to the OPT tokenizer as special tokens.

Objectives

Text generation loss: standard causal LM cross-entropy:

$$L_{\text{text}}(\theta, w) = -\mathbb{E}_{(w,v)\sim D}\log P_\theta(w_m \mid w_{<m}, v)$$

Auxiliary numeric loss: masked $L_1$ on non-padded entries:

$$L_{\text{num}}(\theta, u) = \mathbb{E}_{(u,t)\sim D}\left|F(t) - u\right|_{\text{masked}}$$

with ground-truth values min–max normalized and padded with “nan” to length 256.

Three-stage training recipe (with reported compute)

Stage 1 (Pretraining):

10M synthetic charts
Batch size 16; lr $10^{-4}$; 3 epochs
Trains vision encoder + language model
Loss: $L_{\text{text}}$
Compute: 32 A100 (80G) for $\sim$12 hours

Stage 2 (Warmup aux decoder):

2.7M SFT samples
Freeze vision encoder; train LM + auxiliary decoder
Batch size 16; lr $5\times10^{-5}$; 1 epoch
Loss: $L_{\text{text}} + L_{\text{num}}$
Compute: 16 A100 (80G) for $\sim$3 hours

Stage 3 (SFT all params):

Same SFT pool
Train all parameters
Batch size 16; lr $5\times10^{-5}$; 1 epoch
Loss: $L_{\text{text}} + L_{\text{num}}$
Compute: 24 A100 (80G) for $\sim$4 hours

Evaluation

Metrics

Textual OCR for title, source, x_axis, y_axis: uses normalized edit distance, reported as Reverse Edit distance (RE) = 1 - normalized edit distance (higher is better).

Structural extraction for value dict: compute tuples of (key, item) pairs and evaluate via SCRM mean Average Precision with tolerances:

strict: $J_{thr}=0$, $e_{thr}=0$
slight: $J_{thr}=2$, $e_{thr}=0.05$
high: $J_{thr}=5$, $e_{thr}=0.1$

Benchmarks described

ChartQA-SE / PlotQA-SE: derived from ChartQA and PlotQA test sets; PlotQA-SE emphasizes charts without explicit numeric annotations.
ChartX-SE: Matplotlib-rendered bar/line/pie variants, with and without numeric labels.
ChartY-en / ChartY-zh: authors’ added benchmark (Pyecharts, bilingual, partial numeric annotations).

Key reported results (selected)

Table 2 (SE AP, OCR RE): OneChart (“Ours”, 0.2B) reports higher AP than compared baselines across many settings, with notable gains on no-numeric-annotation scenarios (authors highlight this in the text).

Table 3 (raw vs purified): with threshold $\delta=0.1$, AP@strict increases after filtering (example: ChartQA-SE 72.02 $\rightarrow$ 81.97) while the number of evaluated samples decreases.

Table 4 (token position): <Chart> at sequence front outperforms “behind” and “no token” for ChartQA-SE and PlotQA-SE AP under all tolerances.

Table 6 (ChartQA QA): authors report that providing the extracted dict can substantially improve LLaVA1.5/1.6 accuracy in their setting (e.g., LLaVA1.5 avg 17.5 $\rightarrow$ 50.1 when given figure + dict; improvements are also shown for other configurations).

Hardware / Production

Inference preprocessing: resize to 1024$\times$1024, scale pixels to [0,1], embed to $v \in \mathbb{R}^{256\times768}$, and decode until </s> token.
Reliability scoring (optional): parse the text dict (authors mention using json.loads()), extract numbers, min–max normalize to $u_r$, compare to auxiliary decoder output $u_c$ via:

$$S=\frac{1}{N}\sum_{i=1}^{N}|u_{r_i}-u_{c_i}|$$

and threshold $S$ to decide whether to “trust”/retain outputs.

Efficiency note (authors’ claim): model size is $\approx$200M parameters and they report token-time comparisons (example: 1.3 ms vs 5.7 ms for “one token” compared to a much larger baseline), but the exact measurement setup is not fully detailed in the excerpted text.

Paper: SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs (arXiv 2023)
Authors: Shengzhi Li, Nima Tajbakhsh
Code: findalexli/SciGraphQA
Data: GitHub, HuggingFace
License: Research only (Palm-2/GPT-4 terms)

TL;DR: SciGraphQA is a large synthetic multi-turn QA dataset grounded in real scientific graphs extracted from ArXiv papers: the images are real, but the dialogues (questions/answers) are generated (Palm-2) from text context around each figure. The authors report current MLLMs perform poorly zero-shot on this benchmark (low CIDEr), but prompt augmentation with DePlot-extracted tables and fine-tuning a LLaVA-13B baseline materially improves scores.

What kind of paper is this?

Dominant: $\Psi_{\text{Resource}}$ 0.65
Secondary: $\Psi_{\text{Evaluation}}$ 0.20, $\Psi_{\text{Method}}$ 0.15

The primary contribution is a large-scale dataset/benchmark (SciGraphQA) with construction pipeline + release framing. The paper also evaluates multiple MLLMs on the dataset and introduces a baseline fine-tuning recipe plus a prompt-augmentation technique (DePlot tables).

What is the motivation?

Scientific papers communicate key results via graphs/figures that require interpretation in context (abstract + surrounding paragraph), often via interactive multi-turn explanation in real life (reading groups, presentations). Existing chart/graph VQA datasets either rely heavily on synthetic charts and/or template questions, or are much smaller in scale; the authors aim to scale “ChartQA-style” real-chart VQA to the academic graph domain. They also position “dialogue QA” as more useful than “caption prediction” for graphs because captions can be short/underspecified and not very helpful to a user.

What is the novelty?

Dataset construction (core novelty):

Uses ~290,000 CS/ML ArXiv papers (2010–2020) and extracts figures (PDFFigures 2.0 is referenced)
Focuses on graphs (not all figure types) and generates multi-turn QA dialogues about each graph using Palm-2 prompted with text context: title, abstract, figure caption, and the first paragraph mentioning the figure; OCR text is also included in the context input
Scale: 295K multi-turn samples and 657K QA pairs/turns (multi-turn, average 2.23 turns per sample)

Prompt augmentation idea (secondary novelty):

Prepend the question with a serialized data table extracted from the chart using DePlot (plot-to-table model) to mitigate weak OCR/text-reading in MLLMs (Figure 4 shows the “table then serialize” concept with newline token <0x0A>)

What experiments were performed?

Dataset quality checks:

GPT-4 is used as a judge to rate QA matching quality on a 3K test subset; authors report average 8.7/10, with distribution heavily skewed high (Figure 2, p.6)

Zero-shot MLLM evaluation:

Zero-shot evaluation on 3K test samples with NLP overlap metrics: BLEU-4, ROUGE, CIDEr
Models discussed include LLaVA variants, mPLUG-owl, BLIP-2, OpenFlamingo, and DePlot+LLM pipelines

Fine-tuning:

Fine-tune LLaVA-13B on SciGraphQA: (1) 1 epoch on full dataset, then (2) additional fine-tuning on a 30K DePlot-augmented subset
Study effect of training set size on fine-tuning performance (Figure 5, p.9)

What are the outcomes/limitations?

Outcomes (as reported):

Baseline zero-shot scores are very low; prompt augmentation with DePlot tables improves some models; fine-tuning improves further (Table 2, p.8; Figure 5, p.9)
Authors argue this indicates: (a) scientific graphs are out-of-distribution and hard for current MLLMs, and (b) external “chart-to-table” extraction can meaningfully help

Limitations / caveats:

Synthetic answers: images are real graphs, but QA text is model-generated (Palm-2). So evaluation is against synthetic “ground truth,” and overlap metrics may reward stylistic similarity rather than correctness
Filtering heuristic: they drop QA turns not containing any of: graph/diagram/figure/chart/axis/plot/table/image/visual/illustrat (string match), which can remove legitimate conceptual follow-ups
Metric choice: they mostly report BLEU/ROUGE/CIDEr for reproducibility; they discuss GPT-4 evaluation issues (inconsistency with chain-of-thought prompting; resource limits; nondeterminism)
Internal inconsistency to note: the abstract text claims LLaVA-13B is “most performant” zero-shot at CIDEr 0.08, but Table 2 lists OpenFlamingo v2-7B at CIDEr 0.12, higher than 0.08. Also the paragraph mentions prompting GPT-3.5, while Table 2 labels DePlot+GPT-3. (This may be a reporting mismatch, but it is present in the paper text/tables)

Reproducibility Details

Model

Baselines evaluated:

MLLMs: BLIP2-2.7B; mPLUG-owl-7B; LLaVA-7B; LLaVA-13B; OpenFlamingo v2-7B
“Expert system” style: DePlot + (mPLUG-owl / LLaVA-13B / GPT-3.x)

Vision and language backbones (as described):

Evaluations generally use a fixed CLIP/ViT vision encoder; mPLUG-owl is noted as an exception (encoder unfrozen in its own training, per related-work discussion)
Fine-tuning: uses a LLaVA checkpoint with LLaMA-2-13B-chat as base model

Prompt augmentation (DePlot):

DePlot produces a text table with chart title/legends and interpolated values; they serialize it into a single string and prepend to the question prompt (Figure 4, p.8)

Data

Source corpus:

~290,000 ArXiv papers (CS or stat.ML), 2010–2020
Figures extracted (mentions PDFFigures 2.0); of ~2.1M figures, top categories include tables (23.6%), graphs (19.2%), flowcharts (8.5%); graphs chosen as focus

SciGraphQA dataset stats:

Generation target: multi-turn dialogues about graphs using Palm-2, with context: title, abstract, caption, first paragraph referencing the figure, OCR text; in-context examples used in prompting (Figure 1, p.5; Appendix prompt on p.11–12)
Pipeline yields 350K initial samples; after filtering, 295K multi-turn entries
Total size: 59.1M tokens (byte-wise BPE). Avg turn count 2.23; 111K samples have 3+ turns
Per-turn verbosity: avg 143 chars per question and 775 chars per answer; avg 39 tokens per question and 164 tokens per answer; tokens per sample 199 ± 98

Quality rating:

GPT-4 judge on 3K test subset: average 8.7/10; authors report 86% are rated 8.5+; small tail of low ratings (Figure 2, p.6)

Dataset availability:

Publicly available via GitHub repository and Hugging Face datasets
GitHub repo contains multiple dataset artifacts: 295K training set and 3K test set, sizes “excluding images”
Licensing note from repo README: “data, code and checkpoint is intended and licensed for research use only” with uses restricted to those that follow Palm-2, LLaMA and GPT-4 license agreements
Practical guidance: treat as publicly accessible for research-only use; commercial applications should seek clarification from authors

Algorithms / Training

Question filtering:

Removes questions deemed “not graph-related” via keyword list: graph, diagram, figure, chart, axis, plot, table, image, visual, illustrat

Fine-tuning recipe (LLaVA-13B baseline):

Two-step: (1) 1 epoch on full 295K dataset; (2) fine-tune on 30K subset with DePlot-augmented prompts
Optimization: cosine LR schedule, warmup ratio 3%, learning rate $5 \times 10^{-6}$ (explicitly lower than prior LLaVA instruction-tuning LR)
PEFT: LoRA rank 64, LoRA dropout 0.05

Evaluation

Metrics and setup:

Test set: 3K (image, question, answer) samples; compute BLEU-4, ROUGE, CIDEr between generated and reference answers

Key result table (Table 2, p.8):

Model	Finetuned on SciGraphQA	DePlot table in prompt	CIDEr	BLEU-4	ROUGE
BLIP2-2.7B	No	No	0.007	0.003	0.10
DePlot + mPLUG-owl-7B	No	Yes	0.037	0.058	0.22
mPLUG-owl-7B	No	No	0.040	0.062	0.22
LLaVA-7B	No	No	0.048	0.070	0.18
LLaVA-13B	No	No	0.080	0.070	0.23
OpenFlamingo v2-7B	No	No	0.120	0.081	0.22
DePlot + GPT-3 (label)	No	Yes	0.130	0.098	0.226
DePlot + LLaVA-13B	No	Yes	0.153	0.106	0.273
DePlot + SciGraphQA-baseline	Yes	Yes	0.268	0.123	0.31

Additional evaluation notes:

OpenFlamingo: authors attribute underperformance to not reproducing the retrieval-based in-context selection (RICE); random 3/6/9-shot in-context examples did not help in their trials
Dataset-size scaling: performance improves with more training data; biggest gains in the first ~50% of dataset; authors speculate LoRA (vs full fine-tuning) may limit gains from 50% to 100% (Figure 5, p.9)

Hardware / Production

Fine-tuning run uses DeepSpeed ZeRO-2 with 4× A100-80GB GPUs (Azure)
Batch sizing: per-device batch size 16, gradient accumulation 2, effective global batch size 128

Paper: VisText: A Benchmark for Semantically Rich Chart Captioning
Authors: Benny J. Tang, Angie Boggust, Arvind Satyanarayan
Venue: ACL 2023 (Toronto, July 9–14, 2023)
Code + Dataset: mitvis/vistext
License: GNU GPL v3.0 (data and code)

Core artifact: VisText dataset with 12,441 charts, each paired with (a) rasterized image, (b) data table, and (c) scene graph
Chart types covered: area, bar, line
Task: generate semantically rich chart captions (including higher-level and trend statements, not just “what the chart shows”)

TL;DR

VisText is a chart-captioning benchmark designed to push beyond surface-level captions by including human-written, semantically rich statements (notably “value/trend” style content) and multiple machine-consumable chart representations (image, data table, scene graph). Their experiments suggest text-only representations (scene graph or table) outperform image-based approaches, and semantic prefix-tuning is mainly useful for controlling semantic level rather than improving standard metrics.

What kind of paper is this?

Dominant: 0.55 $\Psi_{\text{Resource}}$ (new benchmark + dataset + standardized inputs)
Secondary: 0.25 $\Psi_{\text{Method}}$ (semantic prefix-tuning to control caption semantic level)
Secondary: 0.20 $\Psi_{\text{Evaluation}}$ (systematic comparisons across representations, models, and metrics)

What is the motivation?

Prior chart captioning benchmarks and methods often emphasize surface descriptions (or omit deeper semantics), while assistive technologies and visualization authoring workflows benefit from captions that include derived insights (values, comparisons, trends). The paper frames semantically rich captions as more useful and closer to human-written chart descriptions.

What is the novelty?

Dataset design for richer semantics: Each chart has multiple representations (image, data table, scene graph) and captions intended to be more semantically informative than older datasets.
Multi-level captioning framing: Uses a semantic “level” lens (L1–L4) adapted from accessibility guidance and prior work, focusing on the first three levels for modeling.
Semantic prefix-tuning: A fine-tuning approach that aims to generate different semantic levels without training separate models per level.

What experiments were performed?

Quantitative benchmarking comparing:
- text-only models from scene graphs vs data tables
- multimodal/image-guided models (image-only and image+text hybrids)
- prior baseline (Kantharaj et al. 2022) using BLEU, perplexity, ROUGE-L, WMD, TER, and a relation-generation check.
Ablations over different language-model backbones (ByT5-small, T5-small, BART-base) and prefixes.
Qualitative error analysis categorizing common failure modes (identity/direction/value/etc.).

What are the outcomes/limitations?

Outcome: Image-based captioning performed worst; text-only encodings (scene graph or table) performed best, with scene graphs often competitive with data tables.
Outcome: L1 captions are much easier than L2/L3 (models score far higher on L1 than on richer statements).
Limitation: Dataset focuses on three univariate chart types; generalization to more complex charts and real-world, messy charts remains open.
Limitation: Models are evaluated via proxies; the paper calls out the need for work on interactive authoring and better grounding/linking of text to chart regions.

Model

Inputs (three chart representations)

Each example includes:

Rasterized image of the chart,
Underlying data table, and
Scene graph derived from the rendered visualization.

Baseline families evaluated

Text-only seq2seq: ByT5-style models mapping a text linearization of scene graph or data table to captions.
Image-guided: VL-T5 variants using image features (alone or paired with text representations).

Semantic prefix-tuning

They propose prefix-tuning to generate captions at different semantic levels without training separate models; importantly, they note it does not necessarily improve standard metrics, but supports semantic-level control.

Data

Source data and cleaning

Data tables come from the Statista public dataset; they convert tables to structured form, handle missing values, and normalize some date-like formats (e.g., week to datetime).

Chart generation (synthetic but structured)

For each table, they iterate through field pairs to create univariate charts (quantitative measure vs categorical/temporal dimension), using Vega-Lite/Altair-like constraints (e.g., max rows/cols for certain field types).
Final dataset size and composition: 12,441 charts across 3,189 area, 6,238 bar, 3,014 line, split approximately 80:10:10 train/val/test.

Caption construction (semantic levels)

They use a 4-level caption taxonomy (L1–L4) adapted from accessibility guidance and prior work, focusing on L1–L3 for this benchmark.

L1 generation (synthetic): Templates produce a first sentence (3 templates) plus follow-on sentences (26 combinations), with randomized synonym phrasing and optional omissions (e.g., dropping scale words or outlier mentions).
L2/L3 collection (crowdsourced): Captions are written by crowd workers; they are instructed to produce multiple “non-repeating” captions per chart and are screened for vision/location/approval criteria.

Dataset analysis of semantic content

On a 2% sample (230 captions), they find most semantic statements are L2 or L3, with L3 appearing about 1.4× as often as L2 in that coding.

Algorithms / Training

Representation preprocessing (critical implementation detail)

Before training, they minimize and linearize both scene graphs and tables to reduce token length:

Scene graphs are reduced by preserving specific elements (title, axis labels/ticks, marks, mark coordinates/sizes) and removing other details, then linearized with a depth-first traversal.
The paper reports average representation lengths (for the “reduced” forms) on the order of 948 characters for scene graphs vs 426 characters for data tables.

Training setup (reported)

Text models (ByT5): 50 epochs, batch size 1024, learning rate 5e-5, dropout 0.1.
Image-guided VL-T5: 50 epochs, batch size 512, learning rate 2e-5.

Evaluation

Metrics

They use BLEU, ROUGE-L, WMD, TER, plus GPT-2-medium perplexity as a fluency proxy, and a “relation generation” check based on matching chart fields (title, axis names, etc.) mentioned in captions.

Main quantitative findings (Table 1)

On combined captions (L1+L2+L3), representative rows include:

Prior baseline (Kantharaj et al. 2022): BLEU 0.30, perplexity 28.51.
Text-only (scene graph, no PT): BLEU 0.34, perplexity 17.04.
Text-only (data table, no PT): BLEU 0.34, perplexity 17.86.
Image-only: BLEU 0.13, perplexity 51.65 (worst).

Overall narrative: image models underperform; scene graphs and data tables do well and are close to each other.

L1 vs L2/L3 difficulty

They explicitly report that models do much better on L1 than on L2/L3. For example, scene-graph model BLEU is 0.71 on L1 but about 0.06 on L2/L3 in their breakdown table.

Ablation observations

Backbone comparison suggests performance differences across ByT5-small, T5-small, and BART-base, with notes about prefix-tuning feasibility (they mention not being able to prefix-tune BART in their setup).

Qualitative error analysis (common failure modes)

In manual analysis of generated captions, they report broad error categories such as:

Identity errors (wrong entity/label): 86 errors (22.93%).
Direction errors (wrong trend direction): 32 errors (8.53%).
Value errors (wrong numeric values): 12 errors (3.20%).
Stability errors: 4 errors (1.07%).
Repetition: 117 errors (31.2%).
Nonsensical: 9 errors (2.4%).

Hardware / Production

Reported training hardware + runtime

They trained on 8× Titan XP (12 GB) GPUs, with 16 CPU cores and 128 GB RAM, reporting per-epoch training times on the order of minutes (varying by model and prefix-tuning).

Production status

No deployed system is claimed; the paper positions this as a benchmark and a step toward mixed-initiative authoring workflows and richer accessibility outputs rather than a production-ready captioning pipeline.

Paper: DePlot: One-shot visual language reasoning by plot-to-table translation
Authors: Fangyu Liu, Julian Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, Yasemin Altun
Venue: Findings of ACL 2023 (July 2023)
Code: google-research/google-research (DePlot path)
Models: google/deplot
License: Apache-2.0 (code/models); mixed dataset licenses (GPL-3.0 for ChartQA, CC-BY-4.0 for PlotQA data; see detailed notes below)

TL;DR

DePlot decomposes chart question answering into (1) plot-to-table translation (image $\rightarrow$ linearized table) and (2) LLM reasoning over the translated table using one-shot prompts. They train an image-to-text Transformer (initialized from MatCha) for plot-to-table, introduce a table-matching metric (RMSF1) meant to be more structure-aware than prior number-set matching, and report large gains on human-written ChartQA when combining DePlot with LLM prompting.

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$. The headline contribution is a new modality conversion module (DEPLOT) plus a plug-and-play pipeline to use LLMs for chart QA with one-shot supervision, with extensive benchmark results.
Secondary: $\Psi_{\text{Evaluation}}$. They propose and validate a new metric (RMSF1) and run a human-correlation study to argue it better reflects extraction quality than RNSS.
Secondary: $\Psi_{\text{Resource}}$ (weaker). They standardize task formats/metrics for plot-to-table and train on a combined corpus, but the core novelty is not a new dataset release.

(These labels follow the provided taxonomy.)

What is the motivation?

End-to-end chart QA systems need large task-specific finetuning sets and still struggle on complex human-written questions (example given: MatCha at 38.2% on ChartQA human).
LLMs have strong few-shot reasoning, but the missing piece is getting chart content into a form LLMs can reliably use without bespoke multimodal training.
Prior chart extraction systems are often pipeline-based (OCR + rules + detection) and chart-type-specific, with inconsistent evaluation metrics.

What is the novelty?

Two-stage decomposition:
1. Convert chart image to a linearized table (markdown-style table text).
2. Use an off-the-shelf LLM with one-shot prompting to answer questions from the table.
DEPLOT model: an end-to-end image-to-text Transformer finetuned specifically for plot-to-table translation, intended to be chart-type-agnostic (line, dot, bar, pie).
RMSF1 metric: a table similarity metric that (a) accounts for row and column headers plus values, (b) supports approximate matching for numeric/text errors, (c) exposes precision vs. recall, and (d) is invariant to row/column permutations and transposition.
Prompting stack: evaluates Chain-of-Thought, Self-Consistency, and Program-of-Thought prompting (including executing generated Python for arithmetic).

What experiments were performed?

Plot-to-table evaluation: PlotQA plot-to-table reconstruction compared against ChartOCR (pipeline), PaLI-17B finetuned variants, MatCha off-the-shelf, and DePlot (new finetune). Metrics: RNSS and RMSF1.
Downstream QA: ChartQA (augmented + human) and PlotQA (v1 + v2) QA accuracy with 5% numeric tolerance; compares fully supervised models vs. DePlot+LLM one-shot variants.
Metric validation: human study on 50 plot-table pairs with 6 annotators; correlates RNSS vs. RMSF1 with human ratings.
OOD check: annotate 10 TaTa charts (excluding choropleths) to estimate generalization; report average RMSF1 and qualitative failure modes.

What are the outcomes and limitations?

Key outcomes

ChartQA human: best reported DePlot+LLM setup reaches 67.6%, compared with MatCha 38.2% (reported as +29.4 absolute points).
ChartQA augmented: DePlot+FlanPaLM+Codex PoT SC reported at 91.0%, comparable to strong supervised baselines (MatCha 90.2).
PlotQA QA: DePlot+LLM underperforms MatCha on synthetic PlotQA (example: 66.6 vs. 91.5 average in their table), despite doing well on ChartQA human.
Metric: RMSF1 correlates better with human judgments than RNSS (Pearson’s $r$ 0.87 vs 0.46; Spearman’s $\rho$ 0.96 vs 0.84).

Limitations and failure modes

Loss of visual attributes: plot-to-table drops information like color, which breaks questions referencing “gray bar”, “red line”, etc. They show a concrete ChartQA failure where the LLM answers “Yes” for the wrong reason because the table has no color encoding.
Synthetic-query bias: they argue PlotQA’s templatic synthetic questions can be exploited by supervised finetuning, which one-shot prompting cannot leverage.
OOD robustness unclear: in a small TaTa sample, they observe distraction by adjacent text and trouble with arrow-linked labels; they mention cropping helps in that setting.
Scope: they note DEPLOT is not suited to visual language without a clear latent textual structure (example: some textbook figures).

Reproducibility Details

Model

DEPLOT architecture and representation

Model type: image-to-text encoder-decoder Transformer, trained autoregressively to emit a table left-to-right.
Initialization: starts from MatCha weights and continues finetuning specifically for plot-to-table conversion.
Table format: markdown-like linearization, using | separators between cells and newline between rows.
Parameter count: DEPLOT 282M parameters (LLMs used are much larger: FlanPaLM 540B; GPT-3 and Codex reported around 175B).

DEPLOT + LLM interface

Output table is appended with the question and a one-shot demonstration; they evaluate natural-language reasoning (CoT) and code-generation reasoning (PoT) plus self-consistency. Figure 1 illustrates the overall flow.

Data

Plot-to-table finetuning corpus

Three sources mixed 1:1:1 (only training splits used to reduce leakage into downstream eval):

Synthetic plots from Liu et al. (MatCha pretraining pipeline): 270K plot-table pairs.
ChartQA plot-table pairs: 22K.
PlotQA plot-table pairs: 224K.

QA benchmarks and test sizes

ChartQA: human and machine (augmented) sets, each reported as 1,250 QA pairs with 625 (human) and 987 (machine) tables.
PlotQA: v1 and v2 each with 33K tables; QA pairs reported as 1.2M (v1) and 4.3M (v2).

Algorithms and training

Training

Training steps: 10k steps.
Max sequence length: 512 tokens (they note MatCha used 192; they extend to fit longer tables).
Inference temperature for DePlot: 0 (deterministic decoding).

Prompting and decoding for LLMs

LLM prompting temperature: 0.4 in their experiments.
Self-consistency: majority vote across 10 samples (they also combine CoT and PoT samples in a joint voting scheme for best ChartQA results).
Program-of-Thought: prompt Codex to emit Python snippets, then execute to compute the final answer; motivated by more reliable arithmetic.
Prompts are shown in the appendix figures (Figure 3: CoT-style table reasoning; Figure 4: Python-code variant).

Evaluation

Plot-to-table metrics

RNSS (baseline)

Treats predicted and target tables as unordered sets of numbers and matches them with a relative-distance cost, then normalizes by max set size.

They define relative distance: $$ D(p, t)=\min\left(1, \frac{|p-t|}{|t|}\right) $$ and compute a minimal-cost matching matrix $X$ over predicted set $P$ and target set $T$, with: $$ RNSS = 1 - \frac{\sum_i\sum_j X_{ij} D(p_i, t_j)}{\max(N, M)} $$ as reported in the paper.

RMSF1 (proposed)

Represents a table as an unordered set of (row header, column header, value) triples.
Matches entries by header similarity (normalized Levenshtein with thresholding) and value similarity (relative error with thresholding), then computes precision/recall and $F1$.
Handles transposition by scoring both table orientations and taking the better score.

The core idea is that similarity is high only when both header keys and values align; unlike RNSS, this penalizes “right numbers, wrong structure”.

Plot-to-table results

On PlotQA plot-to-table reconstruction (their Table 4):

ChartOCR: RNSS 81.0, RMSF1 60.1
PaLI-17B (224 res): RNSS 77.2, RMSF1 24.8
PaLI-17B (588 res): RNSS 90.5, RMSF1 74.9
MatCha: RNSS 95.4, RMSF1 92.3
DePlot: RNSS 97.1, RMSF1 94.2

The PaLI 224 vs 588 contrast is used to argue input resolution matters for chart extraction.

Downstream QA results

Key rows from their main results table (Table 5):

MatCha: ChartQA aug 90.2, human 38.2, avg 64.2; PlotQA avg 91.5
DePlot+FlanPaLM+Codex PoT SC: ChartQA aug 91.0, human 67.6, avg 79.3; PlotQA avg 66.6
They also report intermediate variants (GPT-3 CoT/SC, FlanPaLM CoT/SC, Codex PoT SC) showing the effect of prompting and tool-use.

Error analysis highlights

“Visual attribute” questions (color, shape, orientation) are a recurring failure because the table encoding omits those attributes. Their Table 7 example centers on not being able to identify “highest value of the gray bar” after translation.
Another failure mode: imperfect alignment between plotted points and x-axis labels can cause incorrect table reconstruction (their Table 11 example).

Hardware and production

Training hardware: 64 TPUv3 on GCP.
Training time: roughly 5 hours for DEPLOT.

Release & Licensing Notes

What “other (synthetic) data” DePlot uses

For plot-to-table training, DePlot trains on a 1:1:1 mix of:
1. synthetic data generated by Liu et al. (2023a),
2. synthetic data generated by Methani et al. (2020) (the same synthetic source used in PlotQA), and
3. real-world data crawled by Masry et al. (2022) (the same source used in ChartQA, sourced from Statista / Pew Research / Our World in Data / OECD).
In their Table 2 training stats, they explicitly list “synthetic (by us)” = 270K alongside ChartQA and PlotQA portions.
They also note they use only training-set charts from ChartQA/PlotQA for training (to avoid leakage).

Was that synthetic data shared publicly, or just code/concept?

DePlot authors’ explicit release statement

The DePlot paper explicitly points to “Code and models” (a repo link in footnote), but does not explicitly say “we release the 270K synthetic plot–table pairs” as a downloadable dataset.
Their ethics statement says training/eval data are synthetic via rules or publicly available web data with permissive licenses—but that’s about the inputs’ provenance, not a clear statement that they redistribute the full synthetic corpus.

Best-supported takeaway: from the paper text alone, it looks like they publicly shared code + model artifacts, and they describe the synthetic data + sources, but they don’t clearly claim a public release of the “synthetic (by us) 270K” dataset itself.

If shared, what licenses apply?

ChartQA

Availability: ChartQA dataset is distributed via the repo and also via a Hugging Face dataset.
License: Repo lists GPL-3.0.

PlotQA

License breakdown (explicitly stated):
- Dataset: CC-BY-4.0
- Models + code: MIT

DePlot (code/models)

Paper statement: “Code and models” are provided via the google-research repo path referenced in the paper.
Model (Hugging Face): Apache-2.0
google-research repository (where DePlot code lives): repo license is Apache-2.0

Questions answered

Was the (synthetic) data shared publicly?
- ChartQA / PlotQA: yes, they’re publicly distributed.
- DePlot’s “synthetic (by us) 270K”: the paper does not clearly state that this synthetic corpus is released as a dataset download; it only explicitly calls out code + models.
Or just the code?
- For DePlot: code + models are explicitly shared.
Or just the concept?
- DePlot also fully describes the idea + training mixture, but the explicit “released artifact” callout is code + models.
If anything was shared, what license was attached to the data/code?
- ChartQA: GPL-3.0
- PlotQA: Dataset CC-BY-4.0; Models/code MIT
- DePlot: Code in google-research under Apache-2.0; HF model card shows Apache-2.0

Open point (what I couldn’t verify from the paper text alone)

Exact license (if any) for DePlot’s newly-generated “synthetic (by us) 270K” plot–table pairs isn’t stated in the paper sections we have—so I can’t responsibly claim a data license for that specific synthetic corpus without checking the repo/docs that accompany the code release.

Paper: Masry et al., UniChart: A Universal Vision-language Pretrained Model for Chart Comprehension and Reasoning (arXiv:2305.14761v3, 10 Oct 2023)
Code: vis-nlp/UniChart
Models: UniChart checkpoints referenced in the paper (via the GitHub repository)
License (corpus): Varies by source—see Section 3 for per-source licensing details

TL;DR

UniChart is an end-to-end chart vision-language pretrained model (chart image encoder + text decoder) trained on a large real-world chart corpus (611K charts) with multiple chart-specific pretraining objectives (table extraction, reasoning, open-ended QA, summarization). The authors use knowledge distillation and GPT-based summary generation to address the lack of high-quality chart summaries in real-world data. The model reports strong results across ChartQA, OpenCQA, Chart-to-Text, and Chart-to-Table, with efficiency gains of 11× speedup versus MatCha while using 28% fewer parameters.

What kind of paper is this?

$\Psi_{\text{Method}}$ 0.50, $\Psi_{\text{Resource}}$ 0.35, $\Psi_{\text{Evaluation}}$ 0.15

Dominant: Method (0.50) — The paper introduces a chart-specific vision-language pretraining approach with novel objectives (notably, bootstrapped summarization via GPT distillation) and demonstrates an end-to-end OCR-free architecture for chart comprehension.

Secondary: Resource (0.35) — The authors release a 611K chart pretraining corpus assembled from multiple real-world sources, along with ~470K GPT-distilled summaries and synthetic reasoning QA pairs (5.3M examples). The paper explicitly states the corpus and code are publicly available.

Tertiary: Evaluation (0.15) — The paper evaluates on four established benchmarks (ChartQA, OpenCQA, Chart-to-Text, Chart-to-Table) and includes human + ChatGPT-based evaluation for summarization quality, plus error analysis. However, it does not introduce new evaluation protocols or benchmarks.

Key Questions

1. What is the motivation?

The authors identify a gap: while chart understanding requires both low-level data extraction (table generation) and high-level reasoning/text generation (QA, summarization), existing work either (a) relies on external OCR engines, (b) trains primarily on synthetic charts, or (c) focuses narrowly on specific tasks. MatCha, a notable prior chart pretraining model, was trained largely on textual reasoning datasets (e.g., DROP, FeTaQA), which may limit visual reasoning capacity. UniChart aims to build a single “universal” chart model that handles multiple chart tasks without external OCR, trained on diverse real-world charts rather than synthetic-only data.

2. What is the novelty?

Key insight: The authors treat the lack of high-quality chart summaries as a data acquisition problem. They bootstrap summaries using GPT models—directly prompting ChatGPT for some sources, and distilling a Flan-T5 XL summary generator (trained on 3,700 GPT-generated examples) to produce ~470K summaries for charts lacking captions. This GPT-augmented pretraining corpus is then used for chart-to-text objectives.

Technical contributions:

Real-world chart corpus: 611K charts from diverse sources (OWID, OECD, Pew, Statista, PlotQA, etc.), explicitly prioritizing real over synthetic charts.
Four-task pretraining framework: data table generation, numerical/visual reasoning (90 templates), open-ended QA (T5-generated questions from summaries), and chart summarization (GPT-distilled).
Summary generation pipeline: 3,700-sample GPT dataset → finetune Flan-T5 XL → generate ~470K summaries; additionally use ChatGPT + OCR text for Pew charts without data tables.
End-to-end OCR-free architecture: Swin Transformer encoder + BART decoder, following Donut’s design principles but specialized for charts.

3. What experiments were performed?

The authors evaluate UniChart on four benchmarks:

ChartQA (relaxed accuracy): factoid question answering over bar, line, and pie charts.
OpenCQA (BLEU): open-ended question answering over similar chart types.
Chart-to-Text (BLEU, human eval, ChatGPT eval): summarization on Pew Research and Statista charts.
Chart-to-Table (RNSS, RMS): data extraction on ChartQA and WebCharts (zero-shot).

Additional evaluations:

Human evaluation for summarization informativeness (4-level taxonomy: visual encoding, statistical/relational, perceptual/cognitive, contextual/domain) and factual correctness.
ChatGPT-based evaluation as a proxy for human judgment.
Efficiency comparison: inference speed and parameter count versus MatCha.

4. What are the outcomes?

Main results (Table 2 in paper):

Benchmark	Metric	UniChart	MatCha	Notes
ChartQA	RA	88.56	90.2	MatCha higher overall; UniChart claims advantage on human-written questions
OpenCQA	BLEU	14.88	12.2	UniChart +2.68 BLEU
Chart-to-Text (Pew)	BLEU	43.92	38.2	UniChart +5.72 BLEU
Chart-to-Text (Statista)	BLEU	66.24	64.2	UniChart +2.04 BLEU
Chart-to-Table (ChartQA)	RNSS \| RMS	94.01 \| 91.10	85.21 \| 83.49	UniChart substantially better
Chart-to-Table (WebCharts, zero-shot)	RNSS \| RMS	60.73 \| 43.21	44.37 \| 17.94	UniChart substantially better

Human evaluation (Table 3): UniChart zero-shot summaries scored highest in informativeness among model outputs (both human and ChatGPT ratings). Finetuned UniChart further reduces factually incorrect sentences versus MatCha (Table 7).

Efficiency: The authors report UniChart is 11× faster than MatCha with 28% fewer parameters (though absolute numbers are not provided in the cited sections).

5. What are the limitations and a good follow-up?

Limitations acknowledged by the authors:

Overpopulated charts: Charts with many elements (e.g., dense legends, many bars) can confuse the model and reduce summary quality.
Factual errors: Generated summaries can contain factual inaccuracies (error analysis example provided).
OCR dependency for some data: Pew charts lack data tables, so the authors rely on layout-preserving OCR text fed to ChatGPT for summary generation, introducing potential OCR errors.

Assumptions baked into the approach:

Pretraining uses large amounts of automatically constructed supervision (template-based reasoning QAs, synthetic open-ended QA from summaries, and GPT-generated summaries), assuming these transfer to real downstream distributions.
The corpus-building stance explicitly excludes some synthetic chart datasets, which reduces control over chart type/style coverage but prioritizes realism.

Concrete follow-up experiment:

Ablate “summary source quality” versus downstream gains: Keep UniChart architecture fixed, but pretrain the summarization objective with (a) original dataset summaries, (b) ChatGPT-generated summaries, (c) Flan-T5-distilled summaries, then measure downstream changes on Chart-to-Text and OpenCQA, plus factual error rates (Table 7-style analysis). This directly tests whether the LLM-bootstrapped summaries are the causal driver of gains and where they help/hurt. Motivation: The authors explicitly replace some original summaries with ChatGPT-generated ones and produce ~470K via Flan-T5 XL distillation—isolating this design choice would clarify its impact.

Reproducibility Details

Model

Architecture:

Chart image encoder: Swin Transformer with patch embeddings and shifted-window attention, following Donut’s encoder design.
Text decoder: BART decoder; task-specific textual prompts are fed to the decoder, and output is generated conditioned on the image encoding + prompt (Figure 1 in paper).
OCR-free: End-to-end design with no external OCR preprocessing.

Parameter count: The paper claims 28% fewer parameters than MatCha but does not state the absolute count in the cited sections.

Data

Pretraining corpus (Table 4): 611,934 charts from diverse sources:

Our World in Data (OWID): CC-BY license
Web Data Commons (WDC): Apache License (software)
Pew Research Center: permissive license with attribution required
OECD: downloadable/publishable with appropriate credit
Statista: permissive license for scientific purposes
Other publicly available datasets: PlotQA, Beagle, ChartInfo, ExcelChart400K, LineCap, Neural Captions (licenses per original releases)

Important: The paper does not specify a single unified license for the released corpus package; users must comply with terms of each underlying data source.

Pretraining supervision scale (Table 1):

Objective	Examples
Data Table Generation	601,686
Numerical & Visual Reasoning	5,334,247
Open-ended QA	481,097
Chart Summarization	481,303

Summary generation pipeline:

Created 3,700-sample dataset using in-context “table → caption” prompting.
Finetuned Flan-T5 XL on this dataset.
Used finetuned model to generate ~470K summaries for PlotQA, augmented charts, OWID, OECD.
Prompted ChatGPT (gpt-3.5-turbo) to generate summaries for Statista and Pew charts.
For Pew charts (no data tables), extracted layout-preserving OCR text and fed to ChatGPT.

Algorithms / Training

Pretraining schedule (Table 6):

Stage 1: 300K steps at 512×512 resolution, LR 1e-4, batch size 160, checkpoint every 50K steps.
Stage 2: 100K steps at 960×960 resolution, LR 1e-4, batch size 80, checkpoint every 50K steps.

Hardware: One 4×A100 (40GB), one 4×A100 (80GB), and one 4×V100 (32GB) machine.

Finetuning (per benchmark):

ChartQA: 20 epochs, LR 5e-5
Pew: 200 epochs, LR 5e-5
Statista: 100 epochs, LR 5e-5
OpenCQA: 200 epochs, LR 5e-5

Task-specific batch sizes and GPU configurations are provided in Table 6.

Evaluation

Metrics:

ChartQA: Relaxed Accuracy (RA)
OpenCQA, Chart-to-Text: BLEU (authors note limitations of BLEU for summarization)
Chart-to-Table: RNSS (Relative Number Set Similarity), RMS (Relative Mapping Similarity)

Human evaluation criteria:

Informativeness: 4-level semantic taxonomy (visual encoding, statistical/relational, perceptual/cognitive, contextual/domain)
Factual correctness: count of factually incorrect sentences

ChatGPT evaluation: Used as a proxy for human judgment on summarization quality.

Reproducibility assessment:

✅ Datasets and code: Authors state corpus + code are publicly available via GitHub.
✅ Architecture: Encoder (Swin) and decoder (BART) clearly described; end-to-end design specified.
✅ Training hyperparameters: Pretraining/finetuning details (steps/epochs, LR, batch size, GPU type) provided in Table 6.
⚠️ Summary generation pipeline: High-level process described, but exact prompts for GPT models and OCR tool specifics not fully detailed.
✅ Compute requirements: GPU machines listed (A100 40GB/80GB, V100 32GB) with staged pretraining resolutions.

Optimized Table Tokenization for Table Structure Recognition
Code: Not publicly released
Models: Not publicly released

TL;DR

OTSL (Optimized Table Structure Language) reduces table structure vocabulary from 28+ HTML tokens to 5 tokens (C, L, U, X, NL) with backward-only syntax rules enabling on-the-fly validation during autoregressive decoding. Evaluated in TableFormer on PubTabNet, FinTabNet, and PubTables-1M, OTSL achieves approximately $2\times$ inference speedup versus HTML while maintaining or improving tree edit distance (TEDs) and cell mAP@0.75, with particularly large gains on FinTabNet (all-TEDs 0.959 vs 0.920; mAP 0.862 vs 0.722).

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$ — Proposes a novel tokenization language with syntactic constraints for table structure recognition.
Secondary: $\Psi_{\text{Evaluation}}$ — Compares HTML versus OTSL representations across multiple datasets, model configurations, and metrics (TEDs, mAP, latency).
Minor: $\Psi_{\text{Resource}}$ — States intent to release popular TSR datasets converted to OTSL format.

What is the motivation?

Image-to-Markup-Sequence (Im2Seq) table structure recognition typically reuses general-purpose HTML tokenization, which was not designed for autoregressive decoding efficiency. HTML presents several challenges:

Large vocabulary: Requires at least 28 tokens to cover common rowspan/colspan attributes; skewed token frequency distribution complicates learning.
Variable row lengths: Rows with complex spanning produce longer token sequences, making positional encoding and attention less effective.
Late error detection: Invalid HTML outputs are difficult to detect early during generation; partial sequences often violate structural consistency but remain syntactically valid markup.
Attention drift: Long sequences on large tables cause output misalignment, particularly in later rows; bounding box predictions degrade.

What is the novelty?

OTSL representation:

5-token vocabulary representing a rectangular grid:
- C: new cell (anchor for cell region top-left)
- L: merge with left neighbor (horizontal span continuation)
- U: merge with upper neighbor (vertical span continuation)
- X: merge with both left and upper (2D span interior)
- NL: end-of-row marker
Fixed-width rows: All rows have equal token count, each terminated with NL, regardless of spanning complexity.
Backward-only syntax rules: Each token can be validated using only previously generated tokens, enabling incremental constraint enforcement during decoding:
1. Left neighbor of L must be C or L
2. Upper neighbor of U must be C or U
3. Left neighbor of X must be U or X; upper neighbor must be L or X
4. First row allows only C and L
5. First column allows only C and U
6. All rows have equal length, terminated by NL

Error mitigation: Invalid token predictions signal decoding errors; one proposed heuristic replaces the highest-confidence invalid token with the next-highest valid candidate until syntax rules are satisfied.

Efficiency claim: Example comparison (Figure 1) reports 12 HTML tokens versus 5 OTSL tokens for vocabulary; 55 HTML tokens versus 30 OTSL tokens for sequence length on the same table structure.

What experiments were performed?

Architecture: TableFormer (encoder-decoder transformer for Im2Seq TSR) with separate structure and bounding box decoders.

Hyperparameter sweep (Table 1): Compared HTML versus OTSL on PubTabNet with encoder/decoder layer variations:

enc=6, dec=6, heads=8
enc=4, dec=4, heads=8
enc=2, dec=4, heads=8
enc=4, dec=2, heads=8

Metrics: TEDs (split by simple/complex/all tables), mAP@0.75 for cell boxes, single-core CPU inference time (AMD EPYC 7763 @ 2.45 GHz).

Cross-dataset evaluation (Table 2): Selected best configuration (enc=6, dec=6, heads=8) and trained/evaluated on:

PubTabNet: 395K samples
FinTabNet: 113K samples
PubTables-1M: ~1M samples

Same metrics as hyperparameter sweep; OTSL outputs converted back to HTML for TEDs computation.

Qualitative analysis: Figures 5-6 show bounding box predictions on sparse and many-row complex tables, comparing token counts and visual alignment quality.

What are the outcomes/limitations?

Outcomes:

Latency: Approximately $2\times$ speedup across configurations:

PubTabNet (enc=6, dec=6): 2.73s (OTSL) vs 5.39s (HTML)
FinTabNet: 1.85s (OTSL) vs 3.26s (HTML)
PubTables-1M: 1.79s (OTSL) vs 3.26s (HTML)

Accuracy:

PubTabNet: Similar all-TEDs (0.955 for both); improved mAP (0.880 vs 0.857)
FinTabNet: Large gains reported — all-TEDs 0.959 vs 0.920; mAP 0.862 vs 0.722
PubTables-1M: Gains on both metrics — all-TEDs 0.977 vs 0.966; mAP 0.896 vs 0.889

Qualitative: Reduced bounding box drift and overlap on long/sparse tables compared to HTML; HTML sometimes fails to terminate correctly or shows misalignment in later rows.

Limitations:

Syntactic validity ≠ structural correctness: Valid OTSL sequences can still represent incorrect table structures; syntax rules only enforce grid consistency.
Heuristic error correction: Token replacement strategy is not formally evaluated; no comparison with beam search or constrained decoding alternatives.
Single architecture family: Evidence limited to TableFormer; transfer to object detection + graph neural network TSR pipelines not demonstrated.
Dataset release unconfirmed: OTSL-converted datasets stated as “will be made publicly available” but release details not provided in paper.
Training details omitted: GPU type/count, training duration, hyperparameter search methodology not specified; limits reproducibility.

Model

Task Framing

Input: Table image
Output: Autoregressive token sequence representing table structure (HTML baseline vs OTSL)

TableFormer Integration

Structure decoder generates structure tags (HTML or OTSL)
Separate decoder predicts table cell bounding boxes
Architecture: encoder-decoder transformer with configurable depth (2-6 encoder layers, 2-6 decoder layers, 8 attention heads in reported experiments)

Data

Datasets

Dataset	Size	Notes
PubTabNet	395K	Scientific papers, HTML ground truth converted to OTSL
FinTabNet	113K	Financial reports, HTML ground truth converted to OTSL
PubTables-1M	~1M	Scientific documents, HTML ground truth converted to OTSL

Representation Conversion

Ground truth from all datasets converted to OTSL format for training/evaluation. Predicted OTSL sequences converted back to HTML to compute tree edit distance metrics against original HTML ground truth.

Benchmark Licensing

Evaluation benchmarks use publicly available datasets with permissive licensing:

Benchmark	License	Commercial Use
PubTabNet	CDLA-Permissive-1.0	✓
FinTabNet	CDLA-Permissive-1.0	✓
PubTables-1M	CDLA-Permissive-2.0	✓

All three benchmarks distribute annotations under Community Data License Agreement (CDLA) Permissive terms, which allow broad use, modification, and sharing. Source images may have separate licensing (particularly PubMed Central Open Access images in PubTabNet); users should review upstream provenance for complete rights assessment.

Algorithms / Training

OTSL Language Definition

Tokens: {C, L, U, X, NL}

Semantics: Each table cell region has a C token at its top-left anchor position. Other grid locations within the cell’s span are filled with L (horizontal continuation), U (vertical continuation), or X (2D interior) according to adjacency.

OTSL Syntax Rules

Left-looking: Left neighbor of L must be C or L
Up-looking: Upper neighbor of U must be C or U
Cross rule: Left neighbor of X must be U or X; upper neighbor must be L or X
First row: Only C and L allowed
First column: Only C and U allowed
Rectangular: All rows equal length, terminated by NL

Error Detection and Mitigation

During generation, validate each token against backward-checkable syntax rules
Invalid token indicates decoding error
Proposed heuristic: if highest-confidence token violates rules, replace with next-highest confidence valid token
No beam search or formal constrained decoding comparison provided

Evaluation

Metrics

TEDs (Tree Edit Distance): Structural accuracy, reported separately for simple/complex/all tables after OTSL-to-HTML conversion
mAP@0.75: Mean average precision at 0.75 IoU threshold for cell bounding boxes
Inference time: Single-core CPU latency (AMD EPYC 7763 @ 2.45 GHz)

Quantitative Results

Table 1: PubTabNet hyperparameter sweep (OTSL vs HTML)

Config	Repr.	Simple TEDs	Complex TEDs	All TEDs	mAP	Time (s)
6,6,8	OTSL	0.965	0.934	0.955	0.880	2.73
6,6,8	HTML	0.969	0.927	0.955	0.857	5.39
4,4,8	OTSL	0.946	0.880	0.927	0.853	1.97
4,4,8	HTML	0.952	0.907	0.938	0.835	3.77
2,4,8	OTSL	0.939	0.860	0.915	0.844	1.91
2,4,8	HTML	0.942	0.905	0.931	0.824	3.81
4,2,8	OTSL	0.960	0.899	0.942	0.849	1.22
4,2,8	HTML	0.949	0.887	0.931	0.821	2.00

Table 2: Cross-dataset evaluation (enc=6, dec=6, heads=8)

Dataset	Repr.	Simple TEDs	Complex TEDs	All TEDs	mAP	Time (s)
PubTabNet	OTSL	0.965	0.934	0.955	0.880	2.73
PubTabNet	HTML	0.969	0.927	0.955	0.857	5.39
FinTabNet	OTSL	0.968	0.944	0.959	0.862	1.85
FinTabNet	HTML	0.946	0.869	0.920	0.722	3.26
PubTables-1M	OTSL	0.983	0.965	0.977	0.896	1.79
PubTables-1M	HTML	0.978	0.941	0.966	0.889	3.26

Qualitative Observations

Figure 5 (sparse table): OTSL produces cleaner bounding box alignment with less overlap; HTML shows drift; token count 258 (HTML) vs 135 (OTSL)
Figure 6 (many-row complex table): OTSL captures repeating horizontal merge pattern and completes sequence correctly; HTML misses merges, ends with incorrect termination, shows drift/overlap

Hardware / Production

Inference timing: Single CPU core (AMD EPYC 7763 @ 2.45 GHz) for all reported experiments
Training compute: GPU type, count, and wall-clock training duration not specified

Paper: Rahman et al., ChartSumm: A Comprehensive Benchmark for Automatic Chart Summarization of Long and Short Summaries (arXiv:2304.13620v3)
Code/Dataset: pranonrahman/ChartSumm, Google Drive
License: Unspecified (no LICENSE file found in repository; treat reuse rights as not granted)

TL;DR

ChartSumm is a chart-to-text benchmark dataset with 84,363 chart samples, each with chart images, metadata, and paired summaries spanning short system-generated summaries (Knoema) and longer descriptive human-written summaries (Statista). The authors benchmark T5 and BART variants and report that while models can produce fluent summaries, they often fail on factual correctness, trend description, and exhibit hallucination.

What kind of paper is this?

$\Psi_{\text{Resource}}$ 0.70, $\Psi_{\text{Evaluation}}$ 0.30

Dominant: Resource (0.70) — The headline contribution is a new large-scale benchmark dataset for chart summarization with defined splits, dual summary regimes (short vs. long), and documented chart types/topics. The authors explicitly release the dataset and code.

Secondary: Evaluation (0.30) — The paper includes baseline benchmarking with T5 and BART variants, cross-dataset generalization comparisons (Chart-To-Text), manual error analysis of 100 sampled generations, and multilingual exploration (Bengali). However, it does not introduce novel evaluation protocols or metrics.

Key Questions

1. What is the motivation?

Chart summarization benefits visually impaired users and improves information retrieval by converting chart/tabular insights into natural language. The field is data-constrained: prior datasets are limited in size, coverage, or summary quality/type (e.g., captioning-only or template-generated descriptions). ChartSumm aims to address this by providing a larger, more varied benchmark with both short and long summaries from diverse chart types and topics.

2. What is the novelty?

Dataset scale and dual-summary regimes:

84,363 charts with images, metadata, and summaries from two sources:
- Knoema: 43,179 charts with short descriptions generated by its “digital data assistant” (Yodatai), primarily year-indexed line charts
- Statista: 41,184 charts with human-written descriptive summaries; includes multiple chart types with “simple vs. complex” categorization

Test set design for summary length:

test-k: From Knoema, featuring “precise and well structured” shorter summaries
test-s: From Statista, featuring longer descriptive summaries

Multilingual exploration:

Bengali expansion via machine translation with human-translated test set and mT5 baseline

Key distinction from prior work: The dual-summary regime allows evaluation of models across different summary styles (short/precise vs. long/descriptive), which prior datasets did not systematically address.

3. What experiments were performed?

Baselines:

T5-Base
BART-Base, BART-Large-CNN, BART-Large-XSUM

Training regimes:

Fine-tune on ChartSumm full dataset, ChartSumm-K (Knoema subset), and ChartSumm-S (Statista subset)
Compare against models fine-tuned on Chart-To-Text (Kantharaj et al., 2022)
Evaluate on multiple test sets including Chart-To-Text Statista test split

Metrics:

BLEU, BLEURT (base-128), CIDEr
Perplexity (using pretrained GPT-2)
Content Selection (CS; Wiseman et al., 2017)

Error analysis:

Manual review of 100 sampled generations, categorizing common failures: factual errors (wrong numbers/units), wrong trend descriptions, uninformative summaries, hallucinated irrelevant facts

4. What are the outcomes?

Main results (as reported):

BART-Large variants tend to lead on BLEURT, CIDEr, and CS metrics
T5 tends to achieve best perplexity across tests
Models fine-tuned on ChartSumm-S reportedly outperform Chart-To-Text trained baselines even on the Chart-To-Text test set, suggesting stronger generalization
Models trained on Chart-To-Text transfer poorly to ChartSumm-K (short/precise style), indicating Chart-To-Text is less suited for structured summaries

Limitations and failure modes:

Critical issue: Even fluent outputs often contain wrong facts/units (e.g., “million” vs. “billion”), incorrect trend interpretation, and hallucinated attributes not grounded in chart metadata
Evaluation gap: Primarily automatic-metric-driven; qualitative error analysis reveals that BLEU/BLEURT scores do not correlate well with factual correctness
Generalization concerns: The authors acknowledge that models trained on one summary style (short vs. long) do not transfer well to the other

Contrast to UniChart: While UniChart (2023) also addresses chart-to-text tasks, ChartSumm focuses on providing a benchmark with dual summary styles and explicit error categorization, whereas UniChart emphasizes pretraining on diverse chart tasks (table extraction, reasoning, QA, summarization) with GPT-distilled summaries.

5. What are good follow-ups?

Concrete experiment: Factual consistency as a primary metric

Establish a factual consistency evaluation protocol (e.g., using structured claim extraction + verification against chart data tables) and re-rank model outputs by factual correctness rather than fluency-oriented metrics. This directly addresses the paper’s core finding that automatic metrics miss critical factual errors.

Motivation: The manual error analysis reveals that the highest-scoring outputs by BLEU/BLEURT often contain factually incorrect numbers or trends. A follow-up that prioritizes factual grounding would better align evaluation with the downstream task requirements (especially for accessibility applications where accuracy is critical).

Reproducibility Details

Model

Task formulation:

Table-to-text summarization using chart metadata
Input: chart metadata (title + data table + labels)
Table is flattened row-wise and concatenated with caption/title using separator tokens
T5-style prompting uses prefix: “Summarize chart: "

Baselines:

T5-Base
BART-Base
BART-Large-CNN
BART-Large-XSUM

Data

Sources and composition:

ChartSumm contains 84,363 charts from two sources:

Knoema (43,179 charts):
- Crawled ~110,000 statistics, filtered to publicly available sources
- Charts are year-based line charts
- Summaries: short descriptions from Knoema’s digital assistant (Yodatai)
Statista (41,184 charts):
- Crawled ~750,000 pages
- Charts categorized as simple vs. complex based on column count
- Summaries: descriptive human-written text

Chart type distribution (Statista):

Bar: 64.70%
Line: 33.76%
Pie: 1.54%

Topic coverage (via LDA):

Economy & Politics: 21.60%
Society & Science: 13.03%
Internet & Media: 11.43%
Public life & Health: 10.42%
Sports & Entertainment: 9.14%
Consumer Goods: 7.71%
Retail & Trade: 5.35%
Education: 5.32%

Preprocessing:

Tokenize title/caption, remove whitespace/newlines via stemming
Normalize numeric entities
Missing x-axis labels assigned heuristically (Year/Month/Day/Quarter/Country/City/Area)
NER-based types for companies, social media, etc.
Chart type classification uses ChartReader

Splits (80/10/10):

Split	Knoema	Statista	Total
Train	34,503	32,985	67,488
Validation	4,338	4,101	8,439
Test	4,338	4,098	8,436
Total	43,179	41,184	84,363

Overlap with Chart-To-Text:

Overlap defined as >90% token similarity in captions
Reported overlaps (Statista samples): 5,338 total (4,144 simple, 1,194 complex)

Dataset statistics:

Source subset	Avg cell count	Avg summary length (tokens/chars)	Avg title length (tokens/chars)
Knoema	55.44	34.76 / 207.69	8.86 / 57.18
Statista (simple)	13.31	46.96 / 288.68	9.58 / 63.08
Statista (complex)	37.95	55.54 / 340.19	10.54 / 67.08

Algorithms / Training

Fine-tuning setup (English baselines):

Epochs: 3
Batch size: 8
Initial learning rate: $1 \times 10^{-6}$
Optimizer: AdamW
Loss: Cross-entropy
Implementation: HuggingFace Transformers
Compute: Google Colab (specific GPU type not specified)

Bengali experiment (multilingual):

Translation:
- Train/validation: machine-translated using NLLB
- Test: human-translated by undergraduate students proficient in English and Bengali
Model: mT5 pretrained on multilingual XL-SUM
Fine-tuning:
- Epochs: 4
- Batch size: 8
- Initial learning rate: $1 \times 10^{-6}$
- Optimizer: AdamW
- Loss: Cross-entropy
Evaluation: BLEU only (BLEURT/CIDEr not used due to lack of Bengali-specific models)

Evaluation

Metrics (English):

BLEU (n-gram overlap)
BLEURT (base-128)
CIDEr
Perplexity (via GPT-2)
Content Selection (CS; Wiseman et al., 2017)

Manual error analysis (100 samples):

Categories identified:

Wrong facts/units (e.g., “million cubic meters” vs. gold “billion cubic meters”)
Incorrect trend interpretation
Uninformative outputs (restates framing without key numbers/trends)
Hallucinated unrelated details (e.g., company headquarters not in chart/metadata)

Key finding: Automatic metrics (BLEU, BLEURT) do not reliably correlate with factual correctness. High-scoring outputs can contain critical factual errors.

Hardware / Production

English experiments conducted in Google Colab
Detailed hardware specs (GPU type, wall-clock training time) not specified in paper

Reproducibility Assessment

✅ Dataset available: Released via GitHub + Google Drive with documented structure
⚠️ License unclear: No LICENSE file in repository; reuse rights not explicitly granted
✅ Baseline code: Available in GitHub repository
✅ Training hyperparameters: Fully specified (epochs, batch size, LR, optimizer)
⚠️ Hardware details: Limited (only “Google Colab” mentioned; no GPU type or training time)
✅ Evaluation protocol: Metrics clearly documented; error analysis methodology described
⚠️ Preprocessing details: High-level description provided; exact heuristics for label assignment may require code inspection

Paper: ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning (Findings of ACL 2022)
Authors: Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, Enamul Hoque
Code: vis-nlp/ChartQA
Data: vis-nlp/ChartQA
License: GPL-3.0

TL;DR: ChartQA is a benchmark for question answering over real-world charts requiring visual references (color, position, size) and multi-step logical or arithmetic reasoning. It contains 9,608 human-written questions and 23,111 machine-generated questions over 20,882 charts from Statista, Pew Research, OWID, and OECD. The paper evaluates a pipeline combining chart data extraction (extended ChartOCR) with TableQA-style transformers augmented with image features (VL-T5, VisionTaPas), showing that VisionTaPas achieves 61.12% accuracy on human questions with gold tables but drops to 44.33% with extracted tables.

What kind of paper is this?

Dominant: $\Psi_{\text{Resource}}$ 0.60
Secondary: $\Psi_{\text{Method}}$ 0.25, $\Psi_{\text{Evaluation}}$ 0.15

The primary contribution is a new large-scale benchmark with real-world charts and human-authored questions. The paper also introduces models (VisionTaPas) and provides detailed evaluation, but these serve to establish baselines for the benchmark.

What is the motivation?

Existing chart question answering datasets and models under-serve:

Complex reasoning questions requiring multiple operations (difference, sum, max, etc.)
Questions that refer to visual attributes (e.g., “orange line”, “rightmost bar”)
Human-authored language with natural variation, not small template sets
Real-world chart styles with diverse layouts and visual complexity

Prior datasets (DVQA, PlotQA, FigureQA) are largely synthetic and/or template-based, limiting their ability to evaluate systems on realistic chart understanding tasks.

What is the novelty?

Benchmark contributions:

ChartQA-H (human-authored): 9,608 questions collected via Amazon Mechanical Turk, with workers explicitly asked to write compositional and visual questions requiring multi-step reasoning or visual references
ChartQA-M (machine-generated): 23,111 questions generated from human-written Statista chart summaries using a two-stage T5 pipeline (answer extraction + answer-aware question generation), then filtered for answerability
Real-world chart diversity: 20,882 charts from four sources (Statista, Pew, OWID, OECD) covering multiple domains and chart types
Open-vocabulary answers: Unlike prior work with closed-vocabulary or template-based answers, ChartQA requires generating free-form text or numeric answers

Modeling contributions:

VisionTaPas: extends TaPas with a cross-modality encoder that fuses ViT image features with TaPas table encodings via cross-attention blocks
Operation extension: adds SUBTRACT and DIVIDE operations to TaPas (which originally supported SUM, COUNT, AVERAGE) to handle difference and ratio questions
Extended ChartOCR: adapts ChartOCR to output fully-structured data tables (not just mark values) by adding text recognition with CRAFT and associating values to labels using positional and color information

What experiments were performed?

Dataset construction:

ChartQA-H annotation: AMT workers write 2 questions + answers per chart; second annotator answers the same questions; manual resolution on disagreements
ChartQA-M generation: fine-tune T5 on SQuAD, apply to Statista summaries; filter questions whose answers don’t appear in chart data; manual check of 1,250 pairs found 86.64% valid

Models evaluated:

T5: flatten table + question into text sequence; generate answer
TaPas: table encoder with row/column embeddings; heads for aggregation + cell selection
VL-T5: T5 with visual mark features from Mask R-CNN (36 objects)
VisionTaPas: TaPas extended with ViT image encoder and cross-modality fusion via cross-attention

Experimental conditions:

Gold tables: use ground-truth data tables extracted from chart sources
Extracted tables: use tables predicted by their extended ChartOCR pipeline

Ablations:

Impact of operation extension (SUBTRACT + DIVIDE) on TaPas and VisionTaPas
Comparison of gold vs. extracted tables to isolate extraction quality impact

What are the outcomes/limitations?

Main results (relaxed accuracy: non-numeric exact match, numeric within 5% relative error):

Model	ChartQA-H (Gold)	ChartQA-H (Extracted)	ChartQA-M (Gold)	ChartQA-M (Extracted)
T5	55.88	—	64.32	—
TaPas	58.29	42.61	70.77	55.82
VL-T5	53.93	—	65.28	—
VisionTaPas	61.12	44.33	74.47	59.88

Key findings:

VisionTaPas outperforms all baselines on both human and machine-generated questions when using gold tables
Accuracy drops significantly with extracted tables: VisionTaPas goes from 61.12% to 44.33% on ChartQA-H, indicating data extraction is a major bottleneck
Operation extension matters: adding SUBTRACT + DIVIDE improves VisionTaPas from 65.19% to 74.47% on ChartQA-M (gold table)
Data extraction accuracy: overall extraction accuracy on ChartQA reported as 83.85% using a normalized distance metric with linear assignment

Question type distribution (sample of 300 human questions):

Type	Percentage
Compositional	43.0%
Visual	10.7%
Both visual + compositional	33.3%
Data retrieval	13.0%

76.33% of questions require compositional reasoning or combine visual + compositional reasoning.

Limitations called out by the authors:

Extraction brittleness: modular pipeline breaks when table extraction fails; motivates end-to-end approaches
Separate representations: current approach combines table and vision “separately then combine”; authors propose semantic graph representations to better capture chart structure
Nested reasoning difficulty: even with correct extraction, models struggle with multi-step computations (e.g., subtraction then sum over multiple years)
Extraction metric limitations: their adapted ChartOCR metric ignores noisy chart text (tick labels) and needs refinement

Reproducibility Details

Data

Chart sources:

Statista: crawled publicly available charts
Pew Research: crawled charts (images only, no underlying tables available)
Our World in Data (OWID): crawled charts with underlying data
OECD: crawled charts with underlying data

What they store:

Chart images
Underlying data tables (when available)
Metadata: title, type, source
SVG files (when available, used to extract bounding boxes for training extraction models)
Text descriptions (for machine-generated question synthesis)

Dataset splits:

Split	ChartQA-H	ChartQA-M
Train	3,699 charts (7,398 Q)	15,474 charts (20,901 Q)
Val	480 charts (960 Q)	680 charts (960 Q)
Test	625 charts (1,250 Q)	987 charts (1,250 Q)

Chart type distribution (Statista-M subset, Table 3):

Bar: 15,223
Line: 1,768
Pie: 150

Human annotation procedure (ChartQA-H):

AMT workers write 2 questions + answers per chart, focusing on compositional and visual questions
Second annotator answers the same questions independently
Manual resolution on disagreements
Agreement metrics: exact-match 61.04%; manual check on 500 samples yields 78.55% when accounting for typos/lexical variation
Compensation: $0.6 per task (estimated 3–5 minutes), relative to US minimum wage ($7.25/hour at time of study)

Machine augmentation procedure (ChartQA-M):

Fine-tune T5 on SQuAD
Apply to Statista chart summaries:
- Answer extraction model: generate candidate answers from summary
- Answer-aware question generation model: conditioned on (answer + summary)
Filter: remove questions whose answer is not found in chart data table
Quality check: manual verification of 1,250 generated pairs found 86.64% valid
Test set manually cleaned to ensure quality

Model

VisionTaPas Architecture

Extends TaPas with cross-modal fusion:

ViT encoder: processes image patches (standard Vision Transformer)
TaPas encoder: processes question + flattened table with row/column embeddings
Cross-modality encoder: 4 blocks, each containing:
- Bidirectional cross-attention (image ↔ table)
- Self-attention layers
- Feed-forward layers
- Residual connections
TaPas heads: aggregation head (selects operation: SUM, COUNT, AVERAGE, SUBTRACT, DIVIDE) and cell selection head

VL-T5 Visual Features

Train Mask R-CNN (ResNet-101 backbone) to detect chart marks
Object categories: xAxisTitle, yAxisTitle, legend, label, bar, pie, line, etc.
Pad object features to 36 objects (fixed size)
Feed visual features alongside text into T5

Operation Head Extension

Original TaPas supports: SUM, COUNT, AVERAGE
Extension adds: SUBTRACT, DIVIDE
Supervision: heuristic rules for selecting operand cells
Noise level: manual check of 100 questions found 24% noisy labels

Data Extraction Pipeline

Extends ChartOCR to output structured tables:

Value detection: detect numeric values on chart (ChartOCR baseline)
Text recognition: add CRAFT detector to recognize axis labels, legend text, tick labels
Value-to-label association: use positional information (x/y coordinates) and color matching to link values to semantic labels
Table construction: assemble into structured rows and columns

Training

All training on 1 Tesla P100 GPU.

Model	Epochs	Time	Batch Size	Learning Rate
TaPas	30	~10 hours	32	5e-5
VisionTaPas	20	~30 hours	16	5e-5
T5	20	~11 hours	8	5e-4
VL-T5	20	~15 hours	16	5e-4

All models fine-tuned from pretrained checkpoints (TaPas-Base, T5-Base, VL-T5-Base).

Evaluation

QA metric (relaxed accuracy):

Non-numeric answers: exact string match after normalization (lowercase, strip whitespace, remove articles)
Numeric answers: correct if within 5% relative error of ground truth

Data extraction metric (adapted from ChartOCR):

Compute cost based on normalized distance between ground-truth and predicted values
Solve minimum-cost linear assignment between gt and predictions
Overall score: 1 minus average normalized cost
Limitation noted by authors: metric ignores noisy chart text (tick labels, axis titles); better metrics needed

Error analysis (by question type):

The authors perform manual error analysis on a sample of failures, identifying:

Extraction errors: incorrect or missing values in extracted table
Reasoning errors: model fails to perform correct operation sequence even with correct extraction
Visual grounding errors: model cannot resolve visual references (e.g., “rightmost bar”)

Ethical Considerations

Data sourcing:

Used publicly available charts under source terms (Statista, Pew, OECD, OWID)
Authors state they complied with each source’s terms of service
AMT annotators anonymized

Annotator compensation:

Estimated 3–5 minutes per task
Paid $0.6 per task
Authors frame relative to US minimum wage at time ($7.25/hour)

Potential for misuse:

Authors explicitly note models could be abused to mislead about chart content or manipulate chart-based arguments
Recommend caution in deployment for public-facing applications without human oversight

Practical Takeaways

If we were to use ChartQA:

Best for: evaluating chart QA systems on real-world language with explicit visual references and multi-step reasoning
ChartQA-H is the gold standard: human-authored questions with natural language variation, typos, synonyms
ChartQA-M is larger: useful for pretraining or augmentation, but slightly lower quality (86.64% valid)
Extraction quality is critical: performance drops ~17 absolute points when using extracted vs. gold tables, so any deployment would need robust extraction
Operation space matters: if your questions involve arithmetic (differences, ratios), extending operation heads significantly improves TaPas-style models (+9 points for VisionTaPas on ChartQA-M)

Debugging strategy:

Separate extraction errors from reasoning errors
The authors’ results show extraction accounts for roughly half the error gap
Test with gold tables first to isolate model reasoning capabilities

Chart types:

Benchmark heavily skewed toward bar charts (>88% of Statista-M)
Line and pie charts underrepresented
Results may not generalize to other chart types (scatter, heatmap, etc.)

License considerations:

GNU GPL-3.0 applies to code and dataset files
Underlying charts from third-party sources have their own terms
For commercial use, review source terms (Statista especially restrictive)

Syntax-Aware Network for Handwritten Mathematical Expression Recognition
HME100K Dataset (download portal)
Not publicly released

TL;DR

Grammar-constrained decoder for handwritten mathematical expression recognition using syntax-aware attention and stack-based tree expansion. Achieves 53-56% exact match on CROHME benchmarks and 67% accuracy on HME100K, a new 100K-image dataset with complex camera-captured handwriting.

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$ — Proposes grammar-driven parse tree decoding with syntax-aware attention and stack-based traversal for handwritten mathematical expression recognition.
Secondary: $\Psi_{\text{Resource}}$ — Introduces HME100K, a 100k-image dataset collected from approximately 10,000 writers.
Secondary: $\Psi_{\text{Evaluation}}$ — Defines expression-level accuracy and structure-only protocol (ESPR) with ablations on grammar and attention.

What is the motivation?

Encoder-decoder systems for handwritten mathematical expression recognition (HMER) often predict character-by-character, which produces structural errors on 2D math layouts and messy handwriting. Even tree-based decoders can behave like sequential models without explicit grammar constraints. The paper addresses this by embedding syntax constraints directly into the decoding process, predicting components and subtrees according to syntactic relationships rather than next-token likelihood.

What is the novelty?

Grammar formulation: Converts LaTeX into a parse tree with constraints following reading order (left-to-right, top-to-bottom) and spatial relations between symbols.
Tree expansion decoding: Predicts production rules to expand non-terminals while traversing the tree with a stack-based algorithm.
Syntax-aware attention: Accumulates attention only along the path from root to current node, reducing drift between unrelated components.
Attention self-regularization: Uses a reversed decoder to predict parent nodes and regularizes forward vs. reversed attention with a KL term during training.

What experiments were performed?

The model is evaluated on CROHME 2014, 2016, 2019 (offline images rendered from InkML strokes) and HME100K (camera-captured handwriting). Metrics include ExpRate (exact expression match), relaxed ExpRate $\leq 1$ and $\leq 2$ (tolerating one or two symbol-level errors), and ESPR (structure correct regardless of symbol labels). Comparisons are made against prior HMER systems including DWAP-TD and BTTR. Ablations isolate the effects of grammar syntax and syntax-aware attention.

What are the outcomes/limitations?

Outcomes:

On CROHME 2014/2016/2019, SAN achieves the highest ExpRate among non-augmented methods and reports further gains with data augmentation.
On HME100K, SAN outperforms DWAP, DWAP-TD, and BTTR on overall accuracy and the hard subset (51.5% vs. 45.4-46.0%), with higher inference speed (23.9 FPS vs. 6.9-23.3 FPS on V100).
Ablations show grammar rules contribute the majority of improvement, with additional gains from syntax-aware attention.

Limitations:

Distorted or overlapping components can cause under-translation or over-translation errors.

Model

Grammar Formulation

SAN defines a grammar $G = (N, \Sigma, R, S, \Gamma, C, D)$ with non-terminals $N$, terminals $\Sigma$, production rules $R$, start symbol $S$, spatial relations $\Gamma$, encoder $C$, and decoder $D$.

Spatial relations: The system uses 7 relations (right, above, below, low right, upper left, upper right, inside) derived from 9 base relations by removing redundant directions under reading-order constraints.

Non-terminals: $S$ (expression) and $E$ (extendable structure with relation slots).

Production rules:

$S \rightarrow \sigma S \mid E \mid \epsilon$ where $\sigma \in \Sigma$
$E \rightarrow [((\gamma_1)S \mid \epsilon), \ldots, ((\gamma_7)S \mid \epsilon)]$ where $\gamma_i \in \Gamma$

The grammar generates parse trees where leaves are terminals or relations and internal nodes are non-terminals. LaTeX is recovered via preorder traversal.

Encoder

A DenseNet backbone processes grayscale input $X \in \mathbb{R}^{1 \times H \times W}$ to produce a feature map of size $C \times \frac{H}{\zeta} \times \frac{W}{\zeta}$, flattened to $E(X) = [e_1, \ldots, e_L]$ where $e_i \in \mathbb{R}^{C}$ and $L = \frac{H}{\zeta}\frac{W}{\zeta}$. Implementation uses $C = 684$ and $\zeta = 16$.

Decoder

Two GRUs with syntax-aware attention process the parse tree. For each node $\alpha$, the context state includes:

Historical state $c_h^\alpha$: how $\alpha$ was produced
Partner state $c_p^\alpha$: embedding of the latest terminal symbol or relation

GRU-$\alpha$ computes $c_o^\alpha = \text{GRU}(c_p^\alpha, c_h^\alpha)$. Attention produces compact visual feature $\Omega = \text{Att}(E(X), c_o^\alpha, \text{att}_\alpha(X))$. GRU-$\beta$ computes $c_\beta^\alpha = \text{GRU}(\Omega, c_o^\alpha)$.

Two output branches:

Symbol branch: Softmax over $|\Sigma| + 2$ (terminals, $E$, empty)
Relation branch: Sigmoid over 7 relations

Decoding logic:

If terminal $\sigma$ selected: apply $S \rightarrow \sigma S$
If $E$ selected: apply relation branch; keep relations with probability $> 0.5$; create corresponding $(\gamma)S$ children
If empty selected: apply $S \rightarrow \epsilon$

Syntax-Aware Attention

Standard attention weights:

$$ \xi^\alpha = \text{softmax}\left(W_w \tanh(W_o c_o^\alpha + W_\alpha \text{att}_\alpha(X) + W_e E(X))\right) $$

Key modification: instead of summing over all past steps, accumulate only along the root-to-node path:

$$ \text{att}\alpha(X) = \sum{i \in \text{path}_\alpha} \xi^i $$

This reduces attention drift between structurally unrelated components.

Inference

Stack-based traversal:

Encode image and initialize stack with start symbol $S$
While stack non-empty:
- Pop node and context
- Compute production rule probabilities
- Select highest-probability rule
- Push produced non-terminals/relations with updated contexts
- Update parse tree

Data

CROHME (2014, 2016, 2019)

The paper uses CROHME as the primary benchmark, converting InkML strokes to offline images. Training set: 8,836 expressions with 101 symbol classes. Test sizes: 986 (2014), 1,147 (2016), 1,199 (2019).

HME100K

A new dataset of 74,502 train + 24,607 test images with 245 symbol classes, collected from approximately 10,000 writers via camera-captured uploads. Characteristics include color variation, blur, complex backgrounds, perspective distortion, and illumination issues. Statistics:

Max sequence length: 184 (vs. 96 for CROHME 2019)
Average sequence length: 17.62 (vs. 15.79 for CROHME 2019)
Writer count: ~10,000 (vs. ~100 for CROHME 2019)

Dataset availability: Download links provided via GitHub repository and official portal. No explicit license specified; verify terms before use.

Training

Objective

Multi-task loss combining symbol prediction, relation prediction, reversed symbol prediction, and attention regularization:

$$ L = L_{\text{symbol}} + L_{\text{relation}} + L_{\text{rev. symbol}} + L_{\text{reg}} $$

Attention regularization uses a reversed decoder to predict parent nodes and enforces consistency:

$$ L_{\text{reg}} = -\sum_\beta \hat{\xi}^\eta \log \frac{\hat{\xi}^\eta}{\xi^\alpha} $$

The reversed decoder is removed at inference.

Supervision

Ground-truth LaTeX is parsed into a parse tree via depth-first search, generating parent-child training samples processed in preorder with teacher forcing.

Implementation

Framework: PyTorch
Hardware: Single NVIDIA Tesla V100 (32GB)
Batch size: 8
GRU hidden size: 256
Word/relation embedding dimension: 256
Optimizer: Adadelta ($\rho = 0.95$, $\epsilon = 10^{-6}$)
Learning rate: Linear warmup from 0 to 1 over first epoch, then cosine decay to 0

Results

CROHME Benchmarks

Without data augmentation:

Dataset	ExpRate	ExpRate $\leq 1$	ExpRate $\leq 2$
CROHME 2014	56.2	72.6	79.2
CROHME 2016	53.6	69.6	76.8
CROHME 2019	53.5	69.3	70.1

A variant trained with data augmentation achieves higher scores across all benchmarks.

HME100K

Model	Total Acc.	Hard Subset	FPS (V100)	Params
DWAP	61.9	45.4	23.3	—
DWAP-TD	62.6	45.4	6.9	—
BTTR	64.1	46.0	3.9	—
SAN	67.1	51.5	23.9	8.9M

Ablations

ExpRate improvements on CROHME 2019:

Baseline → SAN-GS (with grammar): significant gain
SAN-GS → SAN (with syntax-aware attention): additional improvement

Grammar syntax contributes the majority of gains, with syntax-aware attention providing further refinement.

Paper: Chart-to-Text: Generating Natural Language Descriptions for Charts by Adapting the Transformer Model (ACL 2022)
Code: vis-nlp/Chart-to-text
Data: vis-nlp/Chart-to-text
License: GPL-3.0 (+ source restrictions)

TL;DR

The paper introduces Chart-to-Text, a benchmark for generating natural-language summaries of charts, with two datasets (Statista and Pew) totaling 44,096 charts across multiple chart types and topics, plus baselines spanning image captioning, table-to-text, and OCR-to-text settings. Results suggest pretrained seq2seq models (BART/T5) are strongest, but hallucinations, factual errors, and trend/pattern reasoning remain key failure modes, especially when the underlying data table is unavailable.

What kind of paper is this?

Dominant: $\Psi_{\text{Resource}}$
- Primary contribution is a large-scale benchmark (two datasets, construction + analysis) and public release framing.
Secondary: $\Psi_{\text{Evaluation}}$
- Strong baseline suite + automatic metrics + human evaluation + qualitative error analysis.
Tertiary: $\Psi_{\text{Method}}$
- Some method engineering (OCR pipelines, chart-text role classifier, ChartOCR extension), but not the main novelty.

What is the motivation?

Charts are widely used for communicating quantitative information, but extracting key insights can require substantial cognitive/perceptual effort; summaries can help readers, authors, accessibility use cases, and retrieval/indexing.
Prior work is limited by:
- Small datasets, narrow chart coverage (often just bar/line), and single-source collections.
- Template-heavy systems that describe how to read charts rather than synthesizing insights.
- Lack of baselines leveraging modern large-scale pretraining for generation.

What is the novelty?

Benchmark scale + breadth: two sources, many topics, multiple chart types, and two task settings:
1. Table-available chart-to-text: input includes chart image + underlying table + metadata.
2. Image-only chart-to-text: underlying table unavailable, requiring extraction from chart images (OCR-based).
Pew pipeline for chart-summary alignment when pages contain many paragraphs and charts:
- OCR extraction, chart-text role classification, candidate paragraph heuristics, and crowd labeling for relevance.
Baseline sweep spanning:
- image captioning (vision-only),
- data-to-text (table-to-text),
- OCR+text hybrids.

What experiments were performed?

Dataset analysis: chart-type distribution, linguistic stats (tokens/sentences), semantic content categories.
Automatic evaluation: BLEU, CIDEr, BLEURT, Content Selection (CS), and perplexity using GPT-2 Medium.
Baseline comparisons across Statista and Pew, with variants that use:
- ground-truth tables (TAB-*),
- OCR text (OCR-*),
- automatically extracted tables (TAB_OCR-*).
Human evaluation: pairwise comparisons among TAB-T5, OCR-T5, and gold summaries on factual correctness, coherence, fluency.
Qualitative error analysis on sampled outputs to categorize failures (hallucination, factual errors, reasoning over trends).

What are the outcomes/limitations?

Best-performing family: pretrained seq2seq (TAB-T5 / TAB-BART when tables exist; OCR-T5 on Pew), but performance drops sharply in Pew (image-only, diverse styles, missing tables).
Persistent issues: hallucinations and factual errors, especially in OCR-based settings where value-to-label association is brittle, plus difficulty describing complex trends/patterns that humans perceive easily.
Measurement gap: good fluency and reasonable overlap metrics can coexist with factual mistakes; human evaluation highlights factual correctness deficits relative to gold summaries.

Reproducibility Details

Data

Sources and scale

Statista
- Crawled 34,810 publicly accessible webpages (Dec 2020), yielding 34,811 charts.
- Collected: chart screenshot, downloaded data table (when available), title, axis labels, and human-written description text.
Pew Research
- Scraped 3,999 publicly accessible pages (Jan 2021), yielding 9,285 charts.
- Underlying data tables are usually unavailable (only 143 charts had tables).
- Collected: chart image, surrounding paragraphs, and alt text (if present).
Total: 44,096 charts across both datasets.

Chart complexity definition

Statista: “simple” charts have data tables with two columns; “complex” charts have $\ge 3$ columns (e.g., grouped/stacked bars, multiple-line charts).
Pew: complexity labeled manually because tables are generally missing.

Summary selection and annotation

Statista summary selection: used the first part of the webpage text (from chart icon to next heading) as the summary; remaining text often contains background.
Statista x-axis label completion:
- Many charts lacked explicit x-axis labels.
- Used regex heuristics over cell values to detect common entity types; remaining missing labels handled via Wikidata-based entity typing, then manual annotation when labels were too generic.

Pew: 3-stage alignment pipeline (Figure 2)

(i) Data extraction from chart images

OCR text extracted using CRAFT.
Extracted bounding boxes and geometric features; trained gradient boosting classifiers to categorize recognized text into: title, axis labels, legends, data labels.
Separate classifier per chart type.
Manual labels: 319 examples (171 bar, 68 line, 80 pie) split 8:1:1 into train/val/test.
Reported performance: 95.0% precision overall, 97.6% precision for title classification on test.
Title choice: if alt text exists, take the longer of (alt text, OCR-title); otherwise use OCR-title.

(ii) Identification of candidate paragraphs

Candidate set: paragraph adjacent to the chart plus five before and five after (max 11).
Heuristic relevance score:
- Sentence relevance: $$s_i = 0.58 l_i + 1.4 n_i - 0.5 u_i$$ where $l_i$ is lexical matches, $n_i$ numerical matches (excluding years), $u_i$ numerical tokens in sentence not in chart.
- Content score: $$\text{content}=\frac{1}{1+\exp(0.3(-\max_i(s_i)+1.7))}$$
- Proximity score (distance $dist \in [-5,5]$): $$\text{proximity}=0.4\exp(-0.1|dist|^2)+0.6$$
- Paragraph relevance: $$rel = \text{content} \times \text{proximity}$$
- Paragraph considered relevant if constraints hold, including $rel > 0.72$, lexical matches sum $>3$, numerical/year matches $>0$, and no extra numerical tokens ($\sum u_i = 0$). (Figure 7)
Heuristic evaluation (random sample): recall 21.1%, precision 100%, chosen to prioritize precision.

(iii) Selection of relevant paragraphs via crowdsourcing

Annotated 5,478 charts and 13,237 paragraphs.
Two annotators per chart; agreement used when both label irrelevant vs relevant; disagreements (2,888 paragraphs) resolved internally.
Reported overall agreement: 78.2%.

Splits and descriptive stats

Train/val/test split: 70% / 15% / 15%.
Chart types (Table 1): bar charts dominate both datasets; line charts second; Pew includes additional types (area/scatter).
Linguistic stats (Table 2): Pew summaries are ~2× longer than Statista by characters/tokens/sentences; complex charts tend to have longer summaries.
Semantic content analysis (Table 3): “statistical/comparative” content most common; Pew has more “perceptual/cognitive” sentences than Statista.

Model

Task formulation

Dataset instance: $\langle C, T, M, S \rangle$ where
- $C$ chart image,
- $T$ data table (when available),
- $M=(C_{title}, C_{type}, C_{labels})$ metadata,
- $S$ reference summary.
Two input settings:
- table-available: $X=\langle C, T, M\rangle$
- image-only: $X=\langle C, M\rangle$ (must recover info from chart image)

Baseline families (Section 4)

1) Image captioning (vision-only)

Show, Attend, and Tell style: ResNet50 encoder + uni-directional LSTM decoder.
ResNet50 pretrained via Barlow Twins self-supervision (separate pretraining per dataset) because ImageNet-pretrained ResNet transferred poorly to charts.

2) Data-to-text (table-to-text)

Chart2text (adapted transformer, with auxiliary content selection objective, plus templating strategy to reduce hallucination).
Field-Infusing model: LSTM encodes cell values, concatenated with row index + column heading embeddings, then a Transformer encoder-decoder generates text.
BART / T5: flatten table row-by-row; input includes title + table content; T5 uses prefix "translate Chart to Text:" to mimic pretraining style.

3) Vision+text hybrids (OCR-to-text)

Use CRAFT OCR to extract chart text, then feed to text generation models (Chart2text, Field-Infuse, BART, T5).
OCR-T5 has a variant that injects bounding-box positional embeddings (inspired by Tan and Bansal style spatial encoding).

Algorithms / Training

Training setup (Appendix A.3)

Hardware: CPU Intel Xeon Gold 6240 @ 2.60GHz, GPU 4× NVIDIA GTX 2080 Ti.
Training time note: T5 fine-tuning is reported as the most expensive, ~16–20 hours on 4 GPUs.

Model-specific training details

Chart2text: 1 encoder layer, 6 decoder layers, dropout 0.1, 80 epochs, batch size 6; beam size 4 at inference.
Field-Infusing: 10 epochs, dropout 0.1, batch size 1.
BART-Base: ~140M params, 6 layers; fine-tune 500K iterations, batch size 4, LR 0.0005; validate every 2,000 iters; beam size 4 at inference.
T5-Base: ~220M params, 12-layer encoder-decoder; fine-tune 500K iterations, batch size 4, LR 0.0005; validate every 2,000 iters; beam size 4 at inference.

Evaluation

Automatic metrics (Section 5.1)

BLEU and CIDEr: n-gram overlap (CIDEr TF-IDF weighted).
BLEURT-base-128: learned metric for grammaticality/semantic similarity (sentence-level averaged).
Content Selection (CS): overlap in selected records relative to gold (sentence-level averaged).
Perplexity (PPL): computed with GPT-2 Medium for fluency proxy.

Main quantitative results (Table 4)

Statista (table-available is strongest setting):
- Best BLEU among reported baselines: TAB-T5 37.01, TAB-BART 36.36.
- OCR-only variants are slightly worse (example: OCR-T5 35.29) but still competitive, with generally lower PPL.
- Image captioning baseline has relatively low CS even if PPL is low.
Pew (mostly no tables, more diverse styles):
- Best BLEU among reported baselines: OCR-T5 10.49 (OCR-T5* 10.42).
- Vision-only image captioning collapses (BLEU ~4).

Human evaluation (Section 5.2, Table 5)

Setup: 150 Statista charts; 4 internal annotators (native English); 450 pairwise comparisons:
- TAB-T5 vs OCR-T5
- Gold vs TAB-T5
- Gold vs OCR-T5
Criteria: factual correctness, coherence, fluency.
Agreement on a subset (excluding ties): 74.3%.
Outcome: TAB-T5 beats OCR-T5 strongly on factual correctness (and also coherence/fluency), while gold summaries still win more often than either model, especially on factual correctness/coherence.

Error analysis themes (Section 5.3)

Perceptual and reasoning failures: models struggle with trends/relationships that are visually salient but not trivial from raw extracted text (examples shown in Figure 4).
Hallucinations: fluent but irrelevant tokens/statements.
Factual errors: especially OCR-based, due to missing data labels or mis-association between values and entities (Figure 4 example where follower counts swap entities).
Computer vision constraints: charts often omit explicit numeric labels, and OCR alone does not recover mark-to-label alignment.
Proposed direction (as an aspiration): richer representations such as semantic graphs encoding numerical/logical relations among chart objects.

Hardware / Production

Reported training environment: Xeon Gold 6240 CPU, 4× GTX 2080 Ti, with T5 fine-tuning taking ~16–20 hours in their setup.
No serving/latency benchmarks; evaluation is offline.

Appendix-specific implementation detail: automatic table extraction (Appendix A.5)

They extend ChartOCR to recover fully-structured tables:
- Keypoint detection for chart elements and marks; extend detector to include textual labels and legend marks.
- OCR (CRAFT) recognizes x-axis/legend labels; associate values to nearest labels and series by color; estimate scale using y-axis labels.
Reported automatic extraction accuracy: 77.31% (used to create TAB_OCR-* model inputs).

Notes on ethics and misuse (brief)

Dataset collection constrained to publicly available charts with publication rights considerations (Statista free studies; Pew attribution/terms).
AMT compensation targeted to minimum wage rates; per-chart payments 0.10–0.15 USD depending on candidate paragraphs.
They explicitly call out a misuse risk: fluent outputs with hallucinations/factual errors could misinform if published uncorrected.

Data Availability & Licensing Notes

TL;DR

Yes: the benchmark datasets are publicly available online (GitHub), and the repository is licensed under GNU GPL v3. But: the charts come from third-party sites (Statista, Pew), and your rights to redistribute/reuse those chart images can be constrained by their Terms of Use, even if the repo itself is GPL.

Questions answered

1) Is there data publicly available?

Yes. The paper explicitly says the “code and benchmark datasets are publicly available” (via their GitHub).

2) Online? Where?

Yes (online). The stated primary location is the public GitHub repository vis-nlp/Chart-to-text. (There are also public mirrors/derivatives in the ecosystem, but GitHub is the canonical source per the paper.)

3) With what licensing?

Repository license: The repo includes a GNU General Public License v3.0 license file.

Important nuance (data vs. code):

GPL v3 clearly applies to “software and other kinds of works” released under it.
However, the dataset contains/derives from third-party chart content, and that content may carry separate legal/contractual restrictions (see below).

Third-party terms that affect what you can do with the data

Statista (as used by the dataset)

The authors say they use only “free studies” from Statista and cite Statista’s Terms: free content comes with “publication rights” for academic purposes, but paid content does not—and they claim to use only the free portion.

The Statista Terms document explicitly discusses redistribution limits and conditions, including that some redistribution is only allowed for “free material” and typically requires leaving materials unchanged and referencing Statista. It also prohibits “crawlers/spiders” (relevant because the dataset was constructed via crawling).

Pew Research Center

Pew’s Terms grant a license to use Pew content with attribution, and include a specific requirement to “provide proper attribution… in accordance with the citation below.” They also restrict reuse of content attributed to another party that is not Pew.

Practical implications (what this means for you)

If you just need a benchmark to reproduce research results, the GitHub release being public + GPL’d is straightforward.
If you plan to redistribute the dataset, host it, or use it in a commercial product, you should treat the third-party chart/image rights (Statista/Pew) as potentially more restrictive than the repo’s GPL label.
If your use-case needs clean licensing, you may prefer datasets built from explicitly open-licensed charts (or datasets whose creators confirm redistribution rights unambiguously).

Paper • Data • Code • License

TL;DR

GNHK introduces a camera-captured, in-the-wild English handwriting dataset (687 images) with word-level quadrilateral annotations, line grouping, and handwritten-versus-printed tags. The paper provides baseline results for text localisation (Mask R-CNN / Faster R-CNN) and cropped-word recognition (Clova scene-text recognizer variants), with best reported recognition reaching CAR 0.861 and WAR 0.502 under TPS + BiLSTM + attention decoding.

What kind of paper is this?

Dominant: $\Psi_{\text{Resource}}$ (dataset release + benchmark baselines)
Secondary: $\Psi_{\text{Evaluation}}$ (establishes baseline protocols and metrics for localisation and recognition)

Justification: The headline contribution is the dataset itself (“we created a dataset…”) plus accompanying baselines, which matches the taxonomy guideline that dataset/tool releases are typically $\Psi_{\text{Resource}}$-dominant.

What is the motivation?

Existing widely-used handwriting datasets (e.g., IAM, RIMES) are largely flatbed-scanned and do not reflect camera-captured “in the wild” conditions.
Prior scene-text benchmarks are mostly printed text; the paper argues there is a gap for offline English handwriting captured via cameras under unconstrained conditions.
English handwriting varies across regions (lexicon and styles), motivating collection across Europe, North America, Asia, and Africa.

What is the novelty?

A camera-captured English handwriting dataset modeled after scene-text datasets, explicitly including “in the wild” document types (shopping lists, sticky notes, diaries) and non-handwriting content (printed text, images).
Annotation schema designed for both detection/localisation and recognition:
- Per-image JSON annotations containing multiple objects with (text, polygon, line_idx, type).
- Text values drawn from ASCII printable characters plus the British pound sign, with no whitespace characters included.
- Three special tokens for polygon annotations: %math% (math expressions), %SC% (illegible scribbles), %NA% (no characters/math symbols).
- Quadrilateral polygons listed clockwise starting at the top-left point.
- line_idx groups texts that belong to the same line; type indicates handwritten vs printed.

What experiments were performed?

Text localisation (detection/segmentation)

Baselines: Mask R-CNN (instance segmentation) and Faster R-CNN (object detection), implemented in detectron2.
Evaluation: recall, precision, and $F$-measure at IoU $> 0.5$.

Text recognition (cropped word recognition)

Benchmark setup: use ground-truth word boxes to crop word images, then recognize text from crops (segmented recognition).
Model family: Clova AI deep text recognition framework with four components (transformation, feature extraction, sequence modelling, prediction), evaluating eight configurations.
Data filtering for recognition eval: keep words and punctuation; remove unknown characters, scribbles, math symbols, and words that contain only punctuation.

What are the outcomes/limitations?

Key outcomes

Localisation: both Mask R-CNN and Faster R-CNN achieve $F$-measure $> 0.86$ (IoU $> 0.5$), with high precision in both cases.
Recognition: best reported configuration (TPS + BiLSTM + attention) reaches CAR 0.861 and WAR 0.502; attention decoding substantially outperforms CTC, and TPS improves CAR/WAR in the reported comparisons.

Limitations and open ends (from the paper’s setup)

Recognition benchmark is not end-to-end: it assumes ground-truth word crops rather than predicted boxes.
For localisation, the dataset lacks pixel-level masks separating word vs non-word; the baseline uses polygon-to-box conversion (min/max over polygon points) for the R-CNN box regression.
The paper explicitly points to future work on end-to-end approaches that do localisation and recognition sequentially in a single framework.

Model

Localisation

Mask R-CNN baseline; bounding boxes derived from polygon min/max in $x$ and $y$, with the polygon used as the segmentation mask.
Implementation: detectron2; backbone ResNet-50 with FPN, pretrained on ImageNet; network pretrained on MS COCO for segmentation.
Comparator: Faster R-CNN in the same detectron2 framework.

Recognition

Clova AI deep text recognition framework with configurable components:
- Transformation: TPS vs none
- Sequence modelling: BiLSTM vs none
- Prediction: attention vs CTC

Data

Dataset size and composition

Total: 687 images, 172,936 characters, 39,026 texts, 9,363 lines.
“Texts” include words, ASCII symbols, and math expressions (via tokenization rules).
Regional sourcing across Europe, North America, Asia, and Africa.
Region-level counts (chars/texts/lines): EU 58,982 / 13,592 / 3,306; NA 47,361 / 10,967 / 2,099; AS 39,593 / 8,586 / 2,780; AF 27,000 / 5,881 / 1,178.
Text statistics: 39,026 total texts, 12,341 unique; median texts per image 57; mean 44.1.
Character set: 96 unique characters; median character count 486; mean 1,801; max per-character frequency up to 17,887.
Collection constraint: no more than 5 images per writer.

Annotation format

Per-image JSON file with objects keyed by text, polygon, line_idx, type.
polygon is a quadrilateral with four $(x,y)$ points in clockwise order starting at top-left.
type indicates handwritten or printed.

Splits

Train/test split: 75% training, 25% testing.

Download and license

Official download: GoodNotes hosts the dataset with links (Google Drive / Baidu Netdisk) available after agreeing to terms and conditions on the GNHK webpage.
Pointer repository: The GoodNotes/GNHK-dataset GitHub repo points to the official dataset page.
License: Creative Commons Attribution 4.0 (CC BY 4.0). You can use, share, and adapt the dataset (including commercially) as long as you provide attribution and include the license notice (and indicate if you made changes).
Note: Third-party mirrors exist (e.g., on Hugging Face), but the GoodNotes page is the authoritative source.

Algorithms / Training

Localisation: Mask R-CNN trained in detectron2 with ResNet-50 + FPN backbone; initialization includes ImageNet pretraining and MS COCO pretraining for segmentation.
Recognition: segmented recognition using ground-truth crops, evaluated across eight Clova framework configurations (TPS/none × BiLSTM/none × attention/CTC).

Training hyperparameters such as learning rate, batch size, and epochs are not specified in the paper.

Evaluation

Localisation metrics and results

Metric: Recall, precision, and $F$-measure with IoU $> 0.5$.
Results (Table 5):
- Mask R-CNN: recall 0.8237, precision 0.9079, $F$-measure 0.864
- Faster R-CNN: recall 0.8077, precision 0.9215, $F$-measure 0.860
Qualitative note: $F$-measure differences are described as coming from high precision (fewer false positives).

Recognition metrics and results

Metrics:
- Character accuracy rate (CAR): average over words of $1 - \frac{\text{edit distance}(gt_i, pred_i)}{N_i}$
- Word accuracy rate (WAR): fraction of words with zero edit distance
Results (Table 6): best configuration TPS + BiLSTM + attention gives CAR 0.861 and WAR 0.502.
Comparative findings:
- Attention decoding outperforms CTC across the reported configurations.
- TPS improves CAR and WAR in like-for-like comparisons.
- BiLSTM generally improves CAR; the paper notes a case where WAR decreases when adding BiLSTM without TPS (0.377 vs 0.430).

Hardware / Production

Not specified.

Paper: ChartOCR: Data Extraction from Charts Images via a Deep Hybrid Framework (WACV 2021)
Code: soap117/DeepRule
Data: HuggingFace, GitHub
License: BSD-3-Clause (code), MIT (dataset on HuggingFace)

TL;DR

ChartOCR is a deep + rules hybrid pipeline for extracting the underlying numeric table from bar, line, and pie chart images by (1) detecting chart keypoints with a shared deep model, then (2) applying type-specific geometric rules to reconstruct chart elements and map pixels to values. The authors report strong results on their large ExcelChart400K dataset (scores: 0.919 bar, 0.918 pie, 0.962 line) and show that with ground-truth keypoints the remaining rule module becomes near-perfect, suggesting keypoint localization is the dominant error source.

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$. The core contribution is a unified extraction method that uses keypoint detection plus chart-type rules to support multiple chart types in one framework.
Secondary: $\Psi_{\text{Resource}}$ (they introduce ExcelChart400K, a large annotated dataset) and $\Psi_{\text{Evaluation}}$ (they propose new chart-type-specific metrics).

What is the motivation?

Prior rule-based chart extractors struggle to generalize across chart styles and layouts, since rules are brittle and style-specific.
Pure end-to-end deep approaches can be accurate but often specialize to a single chart type and provide less controllable intermediate structure (plot area, range, components).
The paper targets a middle ground: use deep learning for what generalizes (keypoint detection) and rules for what is structured (geometry and grouping).

What is the novelty?

Key idea: reduce chart extraction across types to a keypoint detection problem, then reconstruct components with type-specific post-processing rules.
A shared “common information extraction” network predicts (a) keypoints and (b) chart type; then downstream modules handle range estimation and object grouping per type.
New dataset: ExcelChart400K (386,966 images) for training deep models on diverse chart styles.
New metrics: separate evaluation metrics for bar, line, and pie that better match the “read the data values” objective than borrowed detection metrics.

What experiments were performed?

Datasets:
- FQA (100 synthetic chart images across bar/pie/line) and WebData (100 web-crawled images) from prior work.
- ExcelChart400K (large-scale dataset introduced here).
Baselines compared (as reported):
- Rule-based: Revision.
- Deep: Vis, ResNet+Faster-RCNN (bar), RotationRNN (pie), ResNet+RNN (end-to-end).
- Commercial: think-cell (bar only, qualitative).
Ablations:
- Replace predicted keypoints with ground-truth keypoints to estimate an upper bound for the rule module.
- Replace OCR with ground-truth OCR for FQA (since WebData lacks GT OCR).

What are the outcomes/limitations?

Main outcomes (authors’ reported):
- ExcelChart400K scores: 0.919 (bar), 0.918 (pie), 0.962 (line); with GT keypoints: 0.989/0.996/0.991, implying post-processing is strong when keypoints are accurate.
- Public datasets (mean error): ChartOCR improves substantially over Vis, especially on pie and line.
- Runtime: ChartOCR averages 0.206s (bar), 0.193s (pie), 0.507s (line); rule-based Revision is much slower.
Limitations called out:
- Line charts: hard cases with multiple entangled line segments remain challenging; the QUERY network struggles in these settings.
Practical assumptions (implicit in method description, potential fragility):
- Data range extraction assumes y-axis labels are on the left of the plot area and uses only top/bottom OCR numbers to infer scale. This may break for unconventional layouts, multi-axis charts, or non-linear scales (not addressed in the paper text).

Model

Common information extraction: keypoints + chart type

Backbone: modified CornerNet with Hourglass Net backbone (104 layers).
Output: a pixel-level probability map with 3 channels (top-left, bottom-right, background). Map size matches input image.
Uses corner pooling (from CornerNet) on the penultimate keypoint branch layer to expand receptive field along horizontal/vertical directions.
Loss: “CornerNet-style” combination of probability-map loss plus smooth L1 for keypoint coordinates (the paper references CornerNet’s settings rather than fully re-deriving them).

Chart type classification head

Adds a conv layer from Hourglass output to downsample features (example given: $(32 \times 32)$), then max-pools to a 1D vector and feeds FC layers to a softmax classifier.
Loss: cross-entropy (explicit formula provided).

Data

FQA and WebData

FQA: 100 synthetic images total across chart types; limited style diversity per authors.
WebData: 100 images crawled from the web; larger style variation than FQA.

ExcelChart400K

Size: 386,966 chart images, collected by crawling public Excel sheets; chart images captured with Excel APIs and underlying data extracted directly from the source spreadsheets.
Privacy: authors overwrite chart text with random characters for anonymization.
Splits (Table 1):
- Bar: train 173,249, val 6,935, test 6,970
- Line: train 116,745, val 3,073, test 3,072
- Pie: train 73,075, val 1,924, test 1,923

Algorithms / Training

Data range extraction (bar + line)

Uses Microsoft OCR API to extract text from chart images (legend/title/axis labels).
Locates plot area via a similar keypoint detection routine (top-left + bottom-right corners).
Assumption: y-axis numbers are on the left-hand side of the plot area; filter OCR results accordingly.
Algorithm 1 (Data Range Estimation) (high-level):
- Find nearest OCR number near bottom-left of plot area as $r_{\max}$ and near top-left as $r_{\min}$ (using a left-of-plot constraint like r.r < Left - 4).
- Convert text to numbers, compute a pixel-to-value scale: $$Y_{\text{scale}} = \frac{r_{\max}.num - r_{\min}.num}{r_{\min}.t - r_{\max}.t}$$
- Estimate $Y_{\min}$ and $Y_{\max}$ using scale and vertical offsets relative to plot area bounds.
For pie charts, they skip this step since the total is assumed to be 100% by default.

Type-specific chart object detection

Bar charts

Threshold keypoint heatmap at $s = 0.4$, then match each top-left point to the nearest bottom-right point using a weighted distance: $$dist = \gamma , dist_x + \nu , dist_y$$
For vertical bars, set $\gamma > \nu$ and constrain search to the right side of plot area; for horizontal bars, $\nu > \gamma$.

Pie charts

Modify keypoint net: replace corner pooling with center pooling (from CenterNet-style methods) to capture 360-degree context.
Threshold at $s = 0.3$.
Algorithm 2 (Sector Combining) handles:
1. Tight pies (single center): sort arc points clockwise; pair consecutive arc points with the center.
2. Exploded pies (multiple centers): run Pie Radius Estimation (details in supplement per authors) and only connect center-to-arc pairs whose distances fall within a threshold of the estimated radius.

Line charts

Adds an embedding layer (after an early conv) to encourage points on the same line to have similar embeddings; uses pull/push losses (CornerNet-style associative embedding).
Total keypoint loss for line: $loss’{\text{point}} = loss{\text{point}} + \lambda \cdot loss_{\text{embedding}}$, with $\lambda = 0.1$.
Groups points into lines via hierarchical clustering using union-find.
Handles points shared by multiple lines (“intersection points”) with a QUERY network:
- For an intersection point $s$ and closest assigned point $e$, sample $K$ points evenly along segment $s \rightarrow e$; obtain features via linear interpolation; classify whether $s$ and $e$ belong to the same line.

Training setup (as reported)

Optimizer: Adam, LR $2.5 \times 10^{-4}$, reduced to $2.5 \times 10^{-5}$ for last 5,000 batches.
Batch size: 27.
Mentions $\alpha = 2$, $\beta = 4$ (consistent with focal-loss-style params, though the paper does not re-explain them in detail here).
Postproc: Soft-NMS to merge keypoints.
Early stopping used; validation set from ExcelChart400K used for hyperparameter tuning.

Evaluation

Proposed metrics (Section 6)

Bar

Defines a custom distance between predicted and GT bar boxes $p$ and $g$ emphasizing $(x,y,h)$ (width is treated as less relevant for reading): $$D(p,g) = \min\left(1, \frac{|x_p-x_g|}{w_g} + \frac{|y_p-y_g|}{h_g} + \frac{|h_p-h_g|}{h_g}\right)$$
Computes assignment cost via a minimum-cost matching (job assignment) over predictions vs GT; final score is $1 - cost/K$.

Line

Treats a line as continuous and evaluates via interpolation-based error with a precision/recall and an $F1$ formula; weights segments by point intervals (larger gaps count more).
For multi-line charts, they enumerate combinations to find best matching score.

Pie

Treats extraction as a sequence matching problem in clockwise order; uses dynamic programming with a per-element match reward of $1 - \left|\frac{x_i-y_j}{y_j}\right|$, then normalizes.

Key results (as reported)

ExcelChart400K (Table 2, higher is better)

ChartOCR: 0.919 (bar), 0.918 (pie), 0.962 (line)
ChartOCR + GT keypoints: 0.989, 0.996, 0.991
ResNet+Faster-RCNN: 0.802 (bar)
Revision: 0.582 (bar), 0.838 (pie)
RotationRNN: 0.797 (pie)
ResNet+RNN: 0.000 (bar), 0.411 (pie), 0.644 (line)

FQA + WebData (Table 3, mean error, lower is better)

FQA: ChartOCR 0.185 (bar), 0.038 (pie), 0.484 (line); with GT OCR: 0.093, 0.038, 0.496.
WebData: ChartOCR 0.285 (bar), 0.439 (pie), 0.740 (line).
The authors note OCR quality is a major contributor to bar-chart extraction error on these sets.

Hardware / Production

Training environment: 4 $\times$ Tesla P100 GPUs.
Average runtime (Table 4):
- ChartOCR: 0.206s (bar), 0.193s (pie), 0.507s (line)
- Revision: 20.032s (bar), 5.423s (pie)
- ResNet+Faster-RCNN: 0.120s (bar)
- RotationRNN: 0.421s (pie)
Authors suggest merging QUERY with the keypoint backbone if runtime is critical (currently QUERY does not share parameters with keypoint net).

Data availability + licensing (ExcelChart400K / “CHARTEX Data”)

Publicly available? Yes — the paper explicitly states “the code and the dataset are publicly available” and points to the DeepRule GitHub repo.
Online? Yes — the DeepRule repo is public, and its README links to a hosted dataset download (“Downloading CHARTEX Data”) on Hugging Face. (GitHub)
What license is it under?
- Repo (code) license: The DeepRule GitHub repository is labeled BSD-3-Clause. (GitHub)
- Dataset (as hosted) license: The Hugging Face dataset page for DeepRuleDataset declares the license as MIT. (Hugging Face)
- Important caveat: The paper does not state a dataset license (it only says it’s publicly available), and the dataset itself was collected by “crawling public Excel sheets” (with text anonymization). So if you need high confidence for downstream use, rely on the license/terms shipped with the dataset distribution you download (and treat GitHub’s BSD-3-Clause as applying to the repo contents unless the dataset package states otherwise).

DSD — Notes

Paper: Decoupled Style Descriptors (ECCV 2020)
Code & Models: github.com/brownvc/decoupled-style-descriptors

TL;DR

Proposes Decoupled Style Descriptors (DSDs) that explicitly factor handwriting into a writer style vector $w$ and a content-conditioned character matrix $C_{ct}$, producing writer-character descriptors $w_{ct}=C_{ct}w$. Human preference studies show 88% preference over DeepWriting, with writer identification accuracy reaching 99.70% on 50-word samples.

Five Key Questions

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$. Introduces a new modeling construction for handwriting generation: $w_{ct}=C_{ct}w$ with an explicit invertible content matrix to factor out writer style.

Secondary: $\Psi_{\text{Evaluation}}$. Runs human similarity judgments against DeepWriting and analyzes invertibility, writer ID performance.

Secondary: $\Psi_{\text{Resource}}$. Introduces the BRUSH dataset with design and collection details.

What is the motivation?

Classic encoder-decoder handwriting models learn writer-dependent and character-dependent latent vectors $w_{ct}$ but do not provide a mechanism to represent character-independent writer style directly. They argue explicit structure is needed to extract writer style cleanly.

What is the novelty?

Linear latent factorization: $$w_{ct} = C_{ct} w,\quad w = C_{ct}^{-1} w_{ct}$$ assuming $C_{ct}$ is invertible.

Sequence-aware $C$ construction: $C_{ct}$ is predicted from character substrings using a character encoder $g_\theta$ that embeds characters, passes them through an LSTM to capture history, then maps to a $256\times256$ matrix.

Writer style estimation by averaging per-prefix inversions: for a sample producing multiple $w_{ct}$ across prefixes, they estimate $$w = \frac{1}{M}\sum_{t=1}^{M} C^{-1}{ct} w{ct}.$$

Multiple synthesis routes: producing $w_{ct}$ via the encoder, or via “Method $\alpha$” (mean writer style) and “Method $\beta$” (sampling from a database of writer-character DSDs).

Unsupervised character segmentation ($k_\theta$): trains a segmentation network with a modified CTC-style objective so stroke points can be attributed to characters and end-of-character indices can be identified.

What experiments were performed?

Human similarity preference on IAM vs DeepWriting: 25 participants on MTurk, 40 sentence-level target samples, 15 assessments per case totaling 600, with randomized ordering.

Writer identification: reports 89.38% from a single word and 99.70% from 50 words on holdout writers.

Invertibility checks: explicitly evaluates matrix ranks for $C_{ct}$ for 1-, 2-, and 3-character strings, noting a small number of rare non-invertible 3-character cases.

What are the outcomes/limitations?

Outcomes

Generated handwriting achieves 88% human preference over the baseline. The sampling variant is chosen 5.22x more often as most similar to target. Strong writer-identification performance demonstrates the learned style vectors capture meaningful writer characteristics.

Structural decoupling via $C$ performs better than style-transfer baselines where the network must implicitly separate content and style.

Limitations / failure modes

Delayed strokes, such as dotting an i after completing the letter body, are challenging for substring-based representations. Missed delayed strokes shown as a limitation case.

Relies on segmentation quality (predicting end-of-character indices); errors propagate into $w_{ct}$ extraction and synthesis.

Reproducibility Details

Model

Input stroke representation: each point $p_t$ stores $(\Delta x_t, \Delta y_t)$ and an end-of-stroke flag $eos\in{0,1}$, giving $x\in\mathbb{R}^{(N,3)}$.

Outputs (decoder): MDN parameters $(\pi_t,\mu_x,\mu_y,\sigma_x,\sigma_y,\rho)$ plus probabilities for $eos$ and $eoc$ (end-of-character).

Core latent variable shapes: $w\in\mathbb{R}^{256}$, $C_{ct}\in\mathbb{R}^{256\times256}$, $w_{ct}\in\mathbb{R}^{256}$.

Character encoder $g_\theta$ for producing $C_{ct}$: FC layer $\rightarrow$ LSTM $\rightarrow$ FC mapping to 65,536 dims and reshape to $256\times256$; FC2 layer is approximately one-third of total parameters.

Segmentation network $k_\theta$: bidirectional LSTM taking 23 features per point; trained with a modified CTC-style loss to label each point with a character and find eoc indices.

Data

BRUSH dataset design: baseline shown in every drawing box so the initial action includes the shift from baseline to start point; 170 individuals; 488 common words across 192 sentences.

Character inventory: 86 characters (space + 85 listed including digits, upper/lowercase, punctuation/symbols).

Collection constraints: 170 writers on MTurk; 60-minute time limit; word selection based on 3,036 Gutenberg books to cover character pairs.

Coverage: selection achieves 99.9% coverage of character pairs with 3,894 pairs.

Cleaning / ordering: correcting missed delayed strokes by adding them back and sorting strokes left-to-right.

License: Dataset may only be used for non-commercial research purposes; contact Prof. James Tompkin for other uses.

Download: Google Drive ZIP (566 MB) linked from repository README.

Algorithms / Training

Training uses teacher forcing in the decoder: during training feeds true point sequences; at runtime feeds predicted points.

Loss (decoder location term): negative log-likelihood of the next point under the MDN mixture distribution, with additional EOS/EOC terms.

Invertibility enforced implicitly: $L_{wct}$ terms penalize failures where $CC^{-1}\neq I$, discouraging singular $C$.

Backprop through matrix inverse: uses $\frac{dC^{-1}}{dx}=-C^{-1}\frac{dC}{dx}C^{-1}$ to enable end-to-end training.

Optimization + key hyperparameters: Adam, learning rate 0.001, gradient clipping to $[-10,10]$, 5 sentence-level samples per batch, 3-layer stacked LSTMs.

Sampling Method $\beta$ (database-driven): procedure to build/query a database $D$ of writer-character DSDs keyed by substrings and sample $w_{ct}$ for synthesis.

Evaluation

Human study protocol: 25 MTurk participants; 40 sentence-level targets; 15 assessments per case; randomized ordering; total 600 assessments.

Primary reported preference signal: sampling method selected 5.22x more often as “most similar.”

Writer ID metrics: 89.38% from a single word and 99.70% from 50 words.

Hardware / Production

Compute constraints: baseline comparisons adjusted training settings due to 8GB VRAM GPU constraint, requiring batch size reduction.

License & Availability

Code license: Brown University copyright; non-commercial use only. Grants permission for use/copy/modify/distribute except incorporation into a commercial product or service.

Dataset license: BRUSH dataset restricted to non-commercial research purposes per repository README. Commercial or other uses require contacting the authors.

Paper: ICDAR 2019 Competition on Harvesting Raw Tables from Infographics (CHART-Infographics) (ICDAR 2019)
Data: Synthetic (CHART2019-S), PMC
License: CC-BY-NC-ND 3.0 (synthetic), CC-BY-NC-SA 3.0 (PMC)

TL;DR

This paper reports the setup and results of the first CHART-Infographics competition: a task decomposition for chart understanding plus a large synthetic training set and a smaller, manually annotated real-chart test set from PubMedCentral. Main takeaway: systems do very well on synthetic data but drop substantially on real charts; the benchmark, annotation tooling, and evaluation scripts are meant to standardize comparisons.

What kind of paper is this?

Dominant: $\Psi_{\text{Resource}}$
- Primary “artifact” is a benchmark ecosystem: synthetic dataset generation, real-chart dataset sampling/annotation tooling, and released evaluation scripts.
Secondary: $\Psi_{\text{Evaluation}}$
- Defines task-specific metrics (notably for text detection+OCR scoring, axis tick localization scoring, legend marker localization scoring) and analyzes cross-domain generalization (synthetic to real).

What is the motivation?

Chart understanding has many proposed methods, but the field lacked common benchmarks and tools to compare approaches fairly.
The authors aim to (1) provide a large-scale synthetic training regime to enable deep models and (2) test robustness on real charts from scientific literature, where layout and rendering variability is higher.

What is the novelty?

Task decomposition of “extract raw tables from charts” into a pipeline of independently evaluable sub-tasks (Tasks 1–7).
Synthetic dataset generation from real-world data tables, rendered into 10 chart types with randomized stylistic/layout variations, and annotated automatically via Matplotlib API.
New evaluation formulations for chart-specific subtasks:
- Text detection+recognition: combines IoU-based matching with OCR string similarity via normalized character error rate (NCER).
- Axis analysis: weighted F-measure with partial credit based on tick-location accuracy as a function of distance relative to image diagonal.
- Legend analysis: weighted F-measure with partial credit based on legend-marker bounding box IoU.

What experiments were performed?

Competition evaluation on two domains:
- Synthetic charts (large-scale; used for training and testing).
- PMC (real charts) from PubMedCentral Open Access, manually annotated for evaluation (with different annotation depths per task).
Tasks actually receiving submissions: Tasks 1–5 (no submissions for Tasks 6–7).
Reported results include:
- Task 1 confusion matrices for top methods on PMC (Figure 4).
- Task-level summary tables:
  - Task 1: average F-measure on synthetic vs PMC (Table II).
  - Task 2: text detection/recognition scores (Table III).
  - Tasks 3–5: average/weighted F-measures (Table IV).

What are the outcomes/limitations?

Outcomes

Strong synthetic performance across several tasks, but substantial degradation on real charts.
- Example: Task 1 (chart classification) top system is near-perfect on synthetic (99.81) but lower on PMC (88.29).
For PMC, success correlated with extra data augmentation or external datasets beyond the provided synthetic set.
PMC text is hard: resolution limits, human-difficult regions, and many special symbols (Greek letters, superscripts/subscripts) hurt OCR.

Limitations (as evidenced by setup + results)

Domain gap is the core difficulty: synthetic rendering does not capture the full diversity of real charts.
Only 5 teams submitted results, and only for Tasks 1–5; no participation in full data extraction (Tasks 6–7), suggesting the end-to-end problem is still very challenging under this formulation.
For PMC axis ticks, the paper notes annotation patterns (minor ticks, separator ticks) that did not exist in synthetic; evaluation required adjusting GT to match the only tick-label pattern represented in synthetic training.

Model

This is a competition + benchmark paper, not a single-model paper. The “model” content is participant system summaries.

Representative approaches (submitted teams):

ABC (Fintech team):
- Task 1: ResNet-101 multi-label classifier.
- Task 2: connected components + Faster R-CNN detection; attention modules for multi-orientation OCR.
- Task 3: gradient boosting decision tree over engineered features (geometry, alignment patterns, numeric-ness, direction, relative position to axis/legend).
- Task 4: axis detection via color + line segments; tick detection via gradients along axes.
- Task 5: segmentation (ResNet + FPN) for legend markers + rule filtering; text-marker pairing by proximity.
A-team: PixelLink for text detection; Tesseract OCR for recognition; SVM for text-role classification; heuristic “skeleton” for tick localization; connected components for legend markers.
Other teams: ResNet-50 classifier (ANU-Team); small CNN (Boomerang); SVM text-role classification with geometric features (IITB-Team).

Data

Synthetic chart dataset

Rendered with Matplotlib using tabular data from multiple public sources (World Bank indicators, India open data, UN commodity trade stats, US census, stock/ETF price-volume).
Chart types (10 total) shown in Figure 1:
- Pie, Donut, Line, Scatter, Vertical Box, Horizontal Box, Vertical Bar (Grouped), Vertical Bar (Stacked), Horizontal Bar (Grouped), Horizontal Bar (Stacked).
Synthetic dataset generation includes explicit randomization/variation:
- Title/legend placement, fonts, style (colors/line widths/borders/grids/markers), bar widths, pie radii, optional error bars.
Annotation obtained programmatically from Matplotlib API, including tight bounding boxes for text, axes, legends, and plot elements (bars/lines/pies).
Counts (Table I; synthetic): Train = 198,010; Test = 4,540, with per-type distributions (e.g., Line train 41,874; Scatter train 41,703).

PubMedCentral (PMC) real chart dataset

Source: PubMedCentral Open Access (paper mentions $>1.8$ million papers).
Sampling strategy:
- Select journals likely to contain charts (epidemiology, public health, pathology, genetics), restrict to journals with $>500$ publications, papers after year 2000.
- Cluster extracted figures by visual similarity to separate chart candidates from non-charts; manually annotate images within chart-heavy clusters.
- Final sample: 4,242 single-panel figures across chart types (Table I “PMC Test” column shows per-type counts; e.g., Line 2,257; Scatter 532).
Annotation depth:
- Entire 4,242 used for Task 1 evaluation.
- Two disjoint sub-samples fully annotated for Task 2 and Tasks 3–5 evaluation.
Annotation tooling and workflow (Figure 2, plus text):
- Annotate figure type, panel type, chart type/orientation.
- Annotate text regions (location, role, transcription).
  - Used Tesseract OCR for initial transcription, then manually corrected errors; special symbols recorded as LaTeX strings.
- Annotate legends, then axes (axis type categorical vs numerical; first-quadrant bbox; tick locations; axis titles/labels; links between ticks and tick labels).
- Annotate plot elements by chart type (bars/lines/boxes/scatter marks).
- Tools are described as open-source and available post-competition.

Algorithms / Training

This paper does not define a single training recipe, but it does define a task pipeline and the evaluation condition:

Overall chart extraction is treated as a pipeline where upstream outputs feed downstream tasks (Figure 3).
For evaluation of each task, participants were provided the ideal outputs of previous tasks, isolating errors per module rather than compounding pipeline failures.

Evaluation

Task definitions (as used in the competition)

Task 1: Chart Classification
- Classes: horizontal bar, vertical bar, horizontal box, vertical box, line, scatter, pie, donut (pie/donut only used in Task 1).
- Metric: average per-class F-measure.
Task 2: Text Detection and Recognition (logical element level, not word level)
- Inputs: chart image + correct chart class.
- Detection match: IoU threshold = 0.5; many-to-one and one-to-many resolved by highest overlaps; others count as FP/FN.
- Recognition score per matched pair: $\max(1-\text{NCER}, 0)$ where NCER is normalized edit distance (normalized by GT string length).
- Aggregation:
  - Detection scores averaged per image by $\max(#GT,#Pred)$.
  - Recognition scores averaged per image by $#GT$.
  - Final metric: harmonic mean of detection and recognition, averaged across test set.
Task 3: Text Role Classification
- Inputs: text boxes + transcripts for all logical elements.
- Classes: chart title, axis title, tick label, legend title, legend label.
- Metric: average per-class F-measure.
Task 4: Axis Analysis
- Output: for both x and y axes, list of tick-label text elements paired with a pixel point $(x,y)$ for tick location.
- Metric: weighted F-measure with partial credit based on distance to GT tick location.
- Scoring uses thresholds $a=1.0\%$ and $b=2.0\%$ of image diagonal. For prediction distance $d$: $$ s(d,a,b)= \begin{cases} 1 & d \le a \ \frac{b-d}{b-a} & a \le d \le b \ 0 & d \ge b \end{cases} $$ Recall = sum(scores) / (# GT ticks); Precision = sum(scores) / (# predicted ticks).
Task 5: Legend Analysis
- Output: list of text elements paired with bounding boxes around legend markers.
- Metric: weighted F-measure, where partial TP credit is determined by marker BB IoU (text element must match, then IoU controls partial score).
Tasks 6–7 (data extraction and end-to-end) were specified but received no submissions; the paper only briefly outlines them.

Reported results (key numbers)

Task 1 (avg F-measure) (Table II):
- Synthetic: ABC 99.81; A-Team 94.82; ANU-Team 89.78; Boomerang 9.59.
- PMC: ABC 88.29; A-Team 77.52; ANU-Team 35.96; Boomerang 12.06.
Task 2 (Synthetic + PMC) (Table III):
- Synthetic: ABC IoU 69.92, OCR 94.97, F 80.54; A-Team IoU 70.96, OCR 78.97, F 74.75.
- PMC: A-Team IoU 48.48, OCR 58.81, F 53.15.
Tasks 3–5 (Synthetic) (Table IV):
- Task 3 avg F: ABC 100.00; A-Team 99.95; IITB-Team 60.25.
- Task 4 weighted F: ABC 96.49; A-Team 99.76.
- Task 5 weighted F: ABC 78.14; A-Team 87.13.
Tasks 3–4 (PMC) (Table IV):
- Task 3 avg F: A-Team 84.38; IITB-Team 35.58.
- Task 4 weighted F: A-Team 77.33.

Error patterns called out in analysis

Chart classification confusions:
- Grouped vs stacked bars (synthetic + PMC).
- On PMC: box plots misclassified as line/scatter; line vs scatter difficult (scatter may include fitted lines not representing raw data).
PMC text challenges:
- Low/variable resolution; hard-to-read text; special symbols.
Legend analysis:
- Since Task 3 is near-perfect on synthetic, the remaining gap in Task 5 is attributed mainly to marker BB localization quality; ABC has higher variance in IoU (more very good and more zero-IoU cases), A-Team is more consistently “good.”

Hardware / Production

No concrete training hardware, runtime, or throughput specs are reported (expected for a competition summary rather than a single-system paper).

Paper: PlotQA: Reasoning over Scientific Plots (WACV 2020)
Project: iitmnlp.github.io/PlotQA
Code: NiteshMethani/PlotQA
Data: Project downloads (Google Drive)
License: MIT (code), CC-BY-4.0 (data)

TL;DR

PlotQA is a large-scale dataset for question answering over scientific plots that targets real-world variability and open-vocabulary, real-valued answers that often are not explicitly written in the image. The paper reports that common VQA baselines fail badly on these OOV questions, and proposes a hybrid system: classification for simple fixed-vocabulary questions, and a perception-to-table-to-semantic-parsing pipeline for harder reasoning questions.

What kind of paper is this?

Dominant: $\Psi_{\text{Resource}}$ (dataset + annotations at scale are the headline contribution: 224,377 plots and 28.9M QA pairs; also bounding boxes for plot elements).
Secondary: $\Psi_{\text{Method}}$ (hybrid model + staged pipeline for OOV numeric reasoning).
Secondary: $\Psi_{\text{Evaluation}}$ (analysis showing model failures on OOV; pipeline module breakdown for VED/OCR/SIE).

(Informal coefficients: $\approx 0.45,\Psi_{\text{Resource}} + 0.35,\Psi_{\text{Method}} + 0.20,\Psi_{\text{Evaluation}}$.)

What is the motivation?

Existing plot-QA datasets (e.g., FigureQA, DVQA) are synthetic and often assume answers are either from a small fixed vocabulary or directly extractable text from the image.
Many realistic plot questions require numeric reasoning over floating point values where the answer is not a visible string in the chart and is not in a small vocabulary (example shown: averaging multiple bars to get 51.67).
PlotQA aims to close this gap by using real-world data sources and question templates derived from crowd-sourced questions, with heavy emphasis on open-vocabulary / real-valued answers.

What is the novelty?

Dataset design

Scale: 224,377 plots; 28,952,641 question-answer pairs.
Plot types: bar plots, line plots, scatter (dot-line) plots.
Real-world data sourcing: crawled from sources such as World Bank Open Data, Open Government Data, and Global Terrorism Database; 841 indicator variables and 160 entity types (countries, cities, etc.). Data spans 1960–2016 with values from 0 up to $3.50\times 10^{15}$.
Visual variability knobs: grid lines on/off, font size, tick label notation (scientific-E vs standard), line style (solid/dashed/dotted/dash-dot), marker style (asterisk/circle/diamond/square/triangle/inverted triangle), legend positions (multiple placements), and colors chosen from 73 colors. X-axis discrete elements vary from 2–12; legend entries from 1–4.
Annotations: bounding boxes for title, legend box, legend names and markers, axis titles, axis ticks, bars, lines, etc., intended to support supervised perception submodules.

Question/answer taxonomy

Questions partitioned into a $3\times 3$ grid:
- Question type: Structural Understanding, Data Retrieval, Reasoning
- Answer type: Yes/No, Fixed Vocabulary, Open Vocabulary (OOV)
Key empirical point: a large majority of questions are open-vocabulary for data retrieval and reasoning (no open-vocab structural questions).

Model contribution (system approach)

Hybrid approach:
1. A binary question classifier routes questions to either:
  - QA-as-classification (predict from a small top-$k$ answer vocab)
  - Multi-stage pipeline for OOV / reasoning answers: VED $\rightarrow$ OCR $\rightarrow$ semi-structured table extraction $\rightarrow$ table QA via semantic parsing.

What experiments were performed?

Dataset construction and statistics

Crowd-sourcing stage: sample 1,400 plots; 5 workers per plot; 7,000 total questions collected on Mechanical Turk; workers encouraged to ask complex reasoning questions; pay $0.1 per question.
Template stage: manually distilled into 74 templates, then instantiated at scale using a semi-automated process with in-house paraphrasing for natural phrasing.
Split: Train 157,070 images (20,249,479 QA), Val 33,650 (4,360,648 QA), Test 33,657 (4,342,514 QA).

Baselines compared on PlotQA

IMG-only (VGG19 image embedding $\rightarrow$ fixed-vocab classifier)
QUES-only (LSTM question embedding $\rightarrow$ fixed-vocab classifier)
SAN, BAN, LoRRA (models designed around fixed vocab and/or OCR-copy constraints)
Proposed hybrid model

Additional evaluation on DVQA

Reports comparisons against SAN and SANDY-OCR and shows improved accuracy on DVQA test and DVQA test-novel.

Module-level diagnostics (pipeline analysis)

VED accuracy measured by AP at IoU thresholds (0.5, 0.75, 0.9) by element class and overall mAP.
OCR evaluated in oracle mode (GT boxes) and pipeline mode (predicted VED boxes).
SIE evaluated via F1 on extracted table tuples ${row, col, value}$.

What are the outcomes/limitations?

Main quantitative outcomes

PlotQA accuracy (reported):
- IMG-only 4.84%, QUES-only 5.35%
- SAN 7.76%
- BAN 0.01%, LoRRA 0.02%
- Hybrid model 22.52%
DVQA accuracy (reported):
- SAN 32.1% (test), 30.98% (test-novel)
- SANDY-OCR 45.77% (test), 45.81% (test-novel)
- Hybrid model 57.99% (test), 59.54% (test-novel)

Human performance estimate

Human accuracy: 80.47% on 5,860 questions over 160 test images, using the same metric (with numeric tolerance). Main human errors attributed to numeric precision difficulties.

Key limitations and failure drivers (as analyzed in the paper)

The staged pipeline is bottlenecked by tight localization requirements for plot element detection: while mAP@0.5 is high, performance drops sharply at IoU 0.9 for several classes (e.g., dotlines and title). Small box misalignment can cause large numeric value errors downstream.
OCR is relatively robust to VED noise (oracle 97.06% vs pipeline 93.10%), suggesting VED/SIE alignment issues are more critical than raw OCR quality in this setup.
Semi-structured extraction quality is imperfect even when VED mAP@0.5 looks strong: reported SIE table extraction F1 is 0.68, attributed to imperfectly tight boxes and mis-associations (example given where IoU 0.58 box changes inferred bar value from 680 to 760).

Model

High-level architecture (hybrid routing)

Input: plot image + question text.
Router: binary question classifier predicts whether the answer is in a small fixed vocabulary (simple) vs OOV / needs reasoning (complex).
If simple: QA-as-classification predicts from top-$k$ vocabulary (softmax over answer classes).
If complex: staged pipeline:
1. VED: detect bounding boxes for plot elements and classify them
2. OCR: read text in detected boxes (titles, tick labels, legend labels)
3. SIE: construct a semi-structured table from detections + OCR + geometric/color heuristics
4. Table QA: answer the question by semantic parsing over the table-as-KG

Visual Elements Detection (VED)

Data-bearing elements treated as 10 classes (title, axis labels, tick labels, legend markers/names, bars/lines, etc.).
Implementation choice: Faster R-CNN with Feature Pyramid Network (FPN) selected after comparing multiple detection methods (Fast R-CNN, YOLO, SSD, Mask R-CNN cited as candidates).

OCR

Crop each detected textual element to its bounding box, grayscale, resize/deskew, then run OCR.
Paper states a pretrained OCR module performs well on machine-written English text in these plots.

Semi-Structured Information Extraction (SIE)

Target table: rows are x-axis tick values; columns are legend entries; cell $(i,j)$ stores value for tick $i$ and legend $j$.
Heuristics described:
- Legend name $\leftrightarrow$ legend marker/color: nearest bounding box association
- Tick label $\leftrightarrow$ tick mark: nearest bounding box association
- Bar $\rightarrow$ x-tick: closest tick label box to bar box
- Bar $\rightarrow$ legend series: dominant bar color matched to legend marker color
- Bar value: infer bar height from box geometry; find y-tick labels immediately above/below and interpolate.

Table Question Answering

Table converted to knowledge graph; question mapped to candidate logical forms using compositional semantic parsing; logical forms ranked via log-linear model; top logical form executed to produce the answer.
Motivation: supports numeric reasoning without forcing answers into a small classification vocabulary.

Data

Sources and scope

Crawled structured data from multiple public sources (World Bank Open Data, Open Government Data, Global Terrorism Database, etc.).
841 indicator variables and 160 unique entities; years 1960–2016 (not all variables cover all years). Values include integers, floats, percentages, and linear-scale values; range reported as 0 to $3.50\times 10^{15}$.

Plot generation

224,377 plots produced by combining indicators and entities. Plot-level randomization over visual style settings described earlier (grid/font/notation/line/marker/legend position/colors).
Provides bounding box labels for many plot components to support supervised perception training.

QA generation

7,000 crowd questions collected first, then abstracted into 74 templates, then instantiated with paraphrasing to avoid unnatural template fills. Total QA pairs: 28,952,641.

Where to download (public)

Paper shortlink: the paper states the dataset (and crowd-sourced questions) can be downloaded from bit.ly/PlotQA.
Official GitHub repo: the PlotQA repository includes a “download the dataset” link. (GitHub)
Official project site: the PlotQA webpage provides direct download links (Google Drive) for Plot Images, Annotations, and QA Pairs. (NLP at IIT Madras)

Licensing

The datasets (i.e., PlotQA data) are released under CC-BY-4.0. (GitHub)
The models & code are released under the MIT license. (GitHub)

Algorithms / Training

Baseline SAN training details (as reported)

Image features: last pooling layer of VGG19.
Question features: last hidden state of LSTM.
Both mapped to 1024-d via FC, combined and passed through $\tanh$.
Optimizer: Adam; learning rate 0.0003; batch size 128; 25,000 iterations.

Question router (binary classifier)

50-d word embeddings $\rightarrow$ LSTM with 128 hidden units $\rightarrow$ projection to 256-d $\rightarrow$ binary output layer.
Training: 10 epochs with RMSProp; learning rate 0.001; validation accuracy 87.3%.
Label generation for router training: for each question, label is 1 if QA-as-classification performs better than multi-stage pipeline, else 0.

VED training

Faster R-CNN + FPN trained using PlotQA bounding boxes.
Batch size 32; 200,000 steps; RMSProp learning rate 0.004.

Table QA training

Trained using questions from PlotQA paired with corresponding ground-truth tables.

Evaluation

Metrics

Primary metric: accuracy.
Textual answers require exact match.
Numeric answers treated as correct if within 5% of ground truth (to avoid strict float exact-match).

Reported PlotQA results

Large gap between proposed model (22.52%) and human (80.47%).
Baselines that assume fixed vocab or OCR-copy answers perform very poorly, consistent with the dataset’s emphasis on OOV numeric reasoning.

Pipeline diagnostics (reported)

VED: overall mAP is strong at IoU 0.5 (96.43%) but degrades at stricter IoU 0.9 (72.29% overall), with especially sharp drops for some classes (e.g., title, dotline).
OCR: 97.06% (oracle boxes) vs 93.10% (after VED boxes).
SIE: table extraction F1 = 0.68; argued to be sensitive to tight localization and alignment, motivating improved VED for structured images.

Hardware / Production

The paper provides training-step counts and batch sizes for key modules (SAN and VED) and optimizers/learning rates, but does not report concrete hardware specs (GPU type/count), wall-clock training time, or throughput/latency.

Paper: Beagle: Automated Extraction and Interpretation of Visualizations from the Web (CHI 2018)
Code: leibatt/beagle-annotator
Data: UW Project Page
License: MIT (code); dataset license not stated

TL;DR

Beagle is an automated pipeline for crawling the web to extract SVG-based visualizations and classifying them into visualization types using 114 hand-engineered SVG statistics with a Random Forest classifier. The system achieves 82-99% accuracy when training and testing within single collections, and about 85% accuracy when mixing collections. The resulting dataset contains 42,000+ visualizations from five web sources (D3, Plotly, Chartblocks, Fusion Charts, Graphiq), enabling analysis of what chart types appear on the web.

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$. Primary contribution is an end-to-end system and classifier for SVG visualization extraction and automated labeling (Web Crawler + Annotator).
Secondary: $\Psi_{\text{Resource}}$. Produces a sizable mined dataset across multiple visualization “islands” (5 sites; 42,000+ visualizations).
Secondary: $\Psi_{\text{Evaluation}}$ (light). Provides multi-class, cross-collection evaluation methodology with metrics (weighted/non-weighted accuracy, F1; stratified 5-fold CV repeated 10 times).

What is the motivation?

The authors want to understand how visualization tools are actually used on the web, which requires collecting many real examples and labeling them at scale.
A naive, unguided crawl is low-yield and redundant: after crawling approximately 20M pages, they found approximately 10k with visualizations (0.05%), heavily dominated by repeated StackOverflow profile charts.
They pivot to crawling “islands” of visualization usage: centralized sites where SVG-based visualizations are commonly hosted.

What is the novelty?

System design: crawler + annotator

Two standalone components:
- Web Crawler: extracts SVGs from pages (raw SVG spec + snapshot).
- Annotator: labels SVGs via an SVG-focused classifier.

SVG feature design for classification (114 features)

Feature extraction computes basic statistics over SVG elements and feeds them into an off-the-shelf classifier.
Total 114 features, grouped as: general (6), style (19), per-element (89).
The per-element group targets five SVG element types with per-type feature counts: circle (16), rect (20), line (15), path (35), text (3).
Examples of what is measured:
- General: counts of element types plus counts of horizontal/vertical axis lines to separate “often has axes” charts (bars) from “often no axes” (maps).
- Style: unique fill/border colors, stroke width min/max, font size min/max, unique font sizes, font variance; also accounts for style inheritance and CSS.
- Per-element geometry/layout: normalized x/y position stats, shared-position counts, class-name counts; plus element-specific stats like circle radii variance, rect width/height variance, and path “d” string length statistics.

What experiments were performed?

Data collection and labeling

Targeted crawl mines 5 SVG-heavy “islands”: bl.ocks.org (D3), Plotly, Chartblocks, Fusion Charts, Graphiq.
Crawl yields 42,000+ total SVG-based visualizations; per-island counts reported as: D3 2000+, Plotly 15000, Chartblocks 22000, Fusion Charts 500, Graphiq 2500+.
For analysis/evaluation, they omit about one quarter of extracted visualizations due to complex webpage features (example: animations) interfering with extraction.
Labeling approach:
- By hand: D3 (bl.ocks.org), Fusion Charts, Graphiq.
- By code: Plotly, Chartblocks (using site structure cues like page titles or organized link lists).
- They note inconsistent metadata across sites as a general challenge.

Classification model choice and evaluation protocol

They tried Multinomial NB, Gaussian NB, Decision Tree, and SVM with default scikit-learn params; no big performance differences among top performers, and chose a tree-based approach for interpretability, then moved to Random Forest to address overfitting.
Final classifier: scikit-learn RandomForestClassifier with n_estimators=14, defaults otherwise.
Evaluation: stratified 5-fold cross-validation, repeated 10 runs, reporting weighted accuracy and non-weighted (equal-class-weight) accuracy, plus F1.

Within-group and between-group tests

They run within-group evaluation per collection and between-group evaluation by mixing collections.
Reported results (Table 2) include:
- D3 weighted accuracy 0.8193 (non-weighted 0.7155)
- Plotly weighted accuracy 0.9721 (non-weighted 0.9159)
- Chartblocks weighted accuracy 0.9955 (non-weighted 0.9955)
- Fusion Charts weighted accuracy 0.9258 (non-weighted 0.8753)
- Graphiq weighted accuracy 0.9873 (non-weighted 0.9726)
Mixed-collection results:
- Mixture weighted accuracy 0.8527
- Mixture (Revision) weighted accuracy 0.7952 (noted as a revision baseline in the table)
- They also include a “Mixture + Non-Vis (binary)” line with accuracy/F1 reported as binary classification.

What are the outcomes/limitations?

Empirical findings about visualization usage

From the broad unguided crawl: SVG visualizations are rare on the open web in their sampling (0.05% of approximately 20M crawled pages).
Across the mined collections, four chart families dominate: bar, line, scatter, geographic maps.
Table 3 summarizes usage patterns; examples:
- D3: “Map” most popular (30.4%); pie is 0.6%
- Plotly: “Scatter” most popular (46.6%); pie is 0.4%
- Chartblocks: only 4 types, with pie at 23.2% but still below line (34.7%) and bar (31.8%).
They interpret the distribution as suggestive of a flexibility vs. ease-of-use tradeoff, and note that line/bar charts are consistently in the top 3 across collections.

Limitations acknowledged by the paper

Format limitation: results are about SVG-based (often interactive) web visualizations and may not generalize to raster images, Excel outputs, etc.
Time sensitivity: crawls represent a point-in-time view; multiple crawls would be needed to study evolution (example: newer D3 versions).
Coverage limitation: only five sites were crawled; they explicitly plan to extend to more websites, more metadata (tool info, docs, possibly raw data), and richer design analysis (coordinated views, cross-filtering).

Model

Pipeline: feature extraction over SVG $\rightarrow$ scikit-learn classifier (ultimately Random Forest).
Features: 114 total, in three groups (general/style/per-element), with per-element subfeatures for circle/rect/line/path/text.
Normalization for geometry: x positions divided by visualization width; y positions by height; line/path lengths by diagonal; widths by max(width,height).

Data

Crawl strategy and scale

Unguided crawl: approximately 20M pages visited, approximately 10k pages with visualizations (0.05%), dominated by redundant StackOverflow profile charts.
Targeted crawl: five SVG “islands” (D3, Plotly, Chartblocks, Fusion Charts, Graphiq) yielding 42k+ SVG visualizations.

Evaluation datasets (Table 1 excerpts)

Collection sizes and type counts (examples shown in the table excerpt):
- Chartblocks: 22,730 visualizations, 4 types (pie/line/bar/scatter with counts listed).
- Fusion Charts: 530 visualizations, 10 types.
- Graphiq: 2,727 visualizations, 11 types.
- Plotly: 6,544 visualizations, 11 types.
Omitted data: about one quarter of extracted visualizations excluded due to extraction issues from complex page features (animations mentioned).

Label construction

Label superset creation steps: review site docs/galleries, compare against observed visualizations to find missing labels, then consolidate into a final superset.
Some label consolidation choices: group geographic maps (choropleth, projections) and graphs/trees (dendrograms, trees); merge style variants (stacked vs grouped bars).

Dataset access and license

Dataset URL: https://homes.cs.washington.edu/~leibatt/beagle.html
Code license: MIT (beagle-annotator)
Dataset license: Not explicitly specified; assumed MIT based on code license

Algorithms / Training

Classifier family comparisons: Multinomial NB, Gaussian NB, Decision Tree, SVM (default scikit-learn params) with similar performance among best; tree-based chosen initially for interpretability, then Random Forest to address overfitting.
Final model config: scikit-learn RandomForestClassifier, n_estimators=14, default other parameters.
Cross-validation: stratified 5-fold, repeated 10 times; weighted and non-weighted accuracy, plus F1.

Evaluation

Within-collection weighted accuracies span roughly 0.8193 (D3) up to 0.9955 (Chartblocks), with corresponding non-weighted scores lower on more imbalanced/multi-type sets like D3.
Between-collection (Mixture) weighted accuracy reported as 0.8527, and the “Mixture (Revision)” line drops further (0.7952 weighted), suggesting sensitivity to domain shifts or representation differences captured by that revision baseline.
They explicitly frame these experiments as testing labeling robustness across different rendering environments/tools.

Hardware / Production

Hardware, training time, and throughput are not described in the paper. The implementation is described at the level of “off-the-shelf scikit-learn classifier” and feature extraction over SVG.

Paper: DVQA: Understanding Data Visualizations via Question Answering (CVPR 2018)
Code: kushalkafle/DVQA_dataset
Data: kushalkafle/DVQA_dataset
License: CC-BY-NC 4.0

TL;DR

DVQA introduces a large synthetic benchmark for bar-chart question answering (300,000 charts; 3,487,194 QA pairs) designed to break standard VQA assumptions, especially fixed vocabularies and chart-specific text in questions/answers. Standard VQA baselines perform well mainly on “structure” questions but fail on chart-specific answers; the paper proposes two stronger baselines (MOM and SANDY) that explicitly handle chart text via OCR-driven mechanisms.

What kind of paper is this?

Dominant: $\Psi_{\text{Resource}}$
- The headline contribution is the DVQA dataset (scale, splits, controlled variability, question templates, bias reduction).
Secondary: $\Psi_{\text{Method}} + \Psi_{\text{Evaluation}}$
- Two model variants aimed at OOV/chart-specific text (MOM and SANDY) plus extensive baseline comparisons and breakdowns (including chart-specific subsets).

What is the motivation?

Bar charts are common in scientific papers, web articles, and reports, but are not machine-interpretable by typical vision pipelines; even small appearance changes can invalidate heuristics.
DVQA is positioned as both:
- a practical capability: querying chart repositories for numeric/semantic info, and
- a stress test for multi-step attention, measurement, and reasoning that typical VQA does not cover.
The paper emphasizes three DVQA-specific challenges relative to natural-image VQA:
1. fixed vocabularies break on chart-specific labels/answers,
2. chart text tokens are arbitrary and context-dependent, and
3. charts are brittle: small style changes can completely alter meaning.

What is the novelty?

Dataset novelty (DVQA)

Scale: 300,000 bar-chart images; 3,487,194 question-answer pairs.
Three question families: structure understanding, data retrieval, reasoning (template-generated but with large style/data variation).
Two test regimes:
- Test-Familiar: only labels seen in training
- Test-Novel: labels drawn from a disjoint set to force OOV generalization.
Style diversity via Matplotlib: variation in number of bars/groups, gridlines, color/width/spacing/orientation/texture, label/legend orientation and location.
Bias controls: randomization to decorrelate style/color/labels and downsampling to balance certain yes/no questions.

Model novelty (two “strong baselines” for chart-specific text)

MOM (Multi-Output Model): pairs a conventional classifier with an OCR-based answer generator and chooses between them via a learned branch selector.
SANDY (SAN with DYnamic encoding): introduces a dynamic local dictionary built from OCR-detected text boxes to encode chart-specific words in questions and to output chart-specific answer tokens.

What experiments were performed?

Evaluated multiple baselines (YES, IMG, QUES, IMG+QUES, SAN-VQA) vs. MOM and SANDY on:
- Test-Familiar and Test-Novel
- breakdown by Structure / Data / Reasoning
- targeted subsets: chart-specific questions and chart-specific answers.
Metric: exact string match for correctness; additionally reports MOM (±1) allowing edit distance $\le 1$ to quantify near-miss OCR/string errors.
Additional transfer check (limited): manually annotated 500+ structure questions on real bar charts scraped from the internet; SAN-based models reached ~59% without fine-tuning, reported as ~15% absolute gain over QUES.

What are the outcomes/limitations?

Key outcomes

Standard VQA-style models do well on structure but struggle on chart-specific semantics:
- SAN-VQA overall: 36.04 (Familiar), 36.14 (Novel)
- SANDY (Oracle) overall: 56.48 (Familiar), 56.62 (Novel)
The biggest failure mode for fixed-vocab VQA is chart-specific answers:
- SAN-VQA chart-specific answers: 0.10 (Familiar), 0.00 (Novel)
- IMG+QUES chart-specific answers: 0.09 (Familiar), 0.00 (Novel)
- SANDY (Oracle) chart-specific answers: 52.55 (Familiar), 52.70 (Novel)
MOM helps, but appears limited by string generation/localization:
- MOM chart-specific answers: 12.78 (Familiar), 2.93 (Novel)
- MOM (±1) improves substantially (23.62 / 12.47), suggesting many errors are “near miss” OCR/string issues.

Limitations and open issues (as discussed)

SANDY’s dynamic encoding is cascade-sensitive: if OCR misses a word, the positional chaining used to index the local dictionary can corrupt the whole mapping.
MOM depends on accurate bounding box prediction for the answer region; localization errors can propagate to decoding errors.
Dataset scope: DVQA in this paper is bar charts only; authors state a follow-up with pie charts/plots/other diagrams is planned.
Real-world validation is limited (mostly structure questions, ~500 examples), and reported without end-to-end real-chart data retrieval/reasoning evaluation.

Model

Shared components / preprocessing

Image backbone for image-processing models: ResNet-152 (ImageNet pretrained), input resized to $448 \times 448$, producing a $14 \times 14 \times 2048$ feature tensor (unless noted).
Question encoder: 1-layer LSTM with 1024 hidden units; word embeddings are 300-dimensional.

Baseline models

YES: always answers “YES” (noted as slightly more common than “NO”).
IMG: question-blind; pooled CNN features $\rightarrow$ MLP (hidden 1024) $\rightarrow$ softmax.
QUES: image-blind; LSTM features $\rightarrow$ MLP (hidden 1024) $\rightarrow$ softmax.
IMG+QUES: concat CNN+LSTM embeddings $\rightarrow$ MLP (hidden 1024) $\rightarrow$ softmax.
SAN-VQA: Stacked Attention Network over last conv feature maps, conditioned on LSTM question embedding (implementation aligned to a “strong SAN baseline” variant the paper cites).

MOM (Multi-Output Model)

Dual-network architecture:
1. Classification sub-network: SAN-VQA for generic answers
2. OCR sub-network: predicts a text box then decodes characters from that region.
OCR branch details:
- BBox predictor: regression with MSE loss
- Crop patch from predicted region, resize to $128 \times 128$
- Apply a small 3-layer CNN
- Use N-step spatial attention over the patch to get per-character features, with $N = 8$ (max sequence length in experiments)
- Encode with bidirectional GRU, then decode with a character classifier trained with CTC loss.
Branch selection:
- A separate binary classifier uses LSTM question features to choose “generic” vs “chart-specific”; reported as perfect on DVQA test data.

SANDY (SAN with DYnamic encoding)

Replaces fixed word/answer dictionaries with a Dynamic Encoding Model (DEM):
- Assumes access to OCR boxes with positions + strings (oracle version uses dataset annotations; OCR version uses Tesseract).
Local dictionary construction:
- Assign indices to detected text boxes based on a positional chaining procedure: start from lower-left as index 0, then repeatedly pick the nearest unassigned box to assign the next index.
- Cap local dictionary size at $M = 30$ (no training chart exceeded 30 text labels).
Usage:
- Augments the global question dictionary (size $N$) with $M$ local entries for encoding.
- Augments the global answer classes (size $L$) with $M$ local answer classes; predicting one maps back to a string via the local dictionary.
OCR version preprocessing (Tesseract):
- keep only detections with alphabetic characters
- filter confidence < 50%
- drop single-character detections.

Data

Dataset scale and splits

Total: 300,000 images, 3,487,194 QA pairs, 1,576 unique answers overall.
Splits:
- Train: 200,000 images; 2,325,316 questions; 1,076 unique answers
- Test-Familiar: 50,000 images; 580,557 questions; 1,075 unique answers
- Test-Novel: 50,000 images; 581,321 questions; 577 unique answers.

Chart text vocabulary construction

Uses nouns drawn from the Brown Corpus:
- training + Test-Familiar: 1000 most frequent nouns
- Test-Novel: 500 new words disjoint from training to create OOV conditions.

Underlying bar-value regimes

Three data types:
- linear: values from 10 randomly chosen values in range 1–10
- percentage: 10 randomly chosen values in range 10–100
- exponential: 10 randomly chosen values in range 1–$10^{10}$
Some bars may be zero, rendered as a missing bar.

Question families (examples)

Structure understanding: counts, grouping, stacked vs non-stacked, horizontal vs vertical, patterns vs solid, negative values, etc.
Data retrieval: scale type (log/percent), label identification (e.g., “third bar from the left”), legend-color mapping, “units sold” queries, etc.
Reasoning: argmax/argmin, comparisons, sums/differences, threshold counts, cross-category comparisons (e.g., algorithm $A_1$ on dataset $D_1$ vs $A_2$ on $D_2$).

Bias minimization

Randomize chart generation to avoid correlations between styles/colors/labels.
Balance yes/no for selected question templates by randomly removing examples until balanced.

Algorithms / Training

Optimization: Adam with initial learning rate 0.001.
Regularization: dropout 0.5 applied to inputs of convolutional, fully connected, and LSTM units.
Classification answer space:
- global answer dictionary size from training: 1076 answers for most classifiers.
MOM OCR character decoder:
- 27 output classes (26 letters + blank).
SANDY output layer:
- described as 107 units in the paper’s configuration, with indices 0–30 reserved for local dictionary entries and 31–107 for common answers.

Evaluation

Overall accuracy (Table 3)

Percent correct (exact match), Test-Familiar / Test-Novel:

SAN-VQA:
- Structure: 94.71 / 94.82
- Data: 18.78 / 18.92
- Reasoning: 37.29 / 37.25
- Overall: 36.04 / 36.14
MOM:
- Data: 29.52 / 21.40
- Overall: 40.89 / 37.26
MOM (±1):
- Data: 38.20 / 29.14
- Overall: 45.03 / 40.90
SANDY (Oracle):
- Data: 65.40 / 65.55
- Overall: 56.48 / 56.62
SANDY (OCR):
- Data: 37.82 / 37.78
- Overall: 45.77 / 45.81

Chart-specific subsets (Table 4)

Chart-specific answers are near-impossible for fixed-vocabulary baselines (near 0), while SANDY (Oracle) is ~52–53%.
The paper attributes MOM’s gap vs SANDY partly to small string generation errors and dependence on precise localization; edit-distance scoring (±1) partially recovers these.

Hardware / Production

The paper specifies architecture choices and image resolution/features, but does not provide concrete training hardware (GPU model/count), wall-clock training time, or throughput/latency numbers in the provided excerpt.

CROHME 2014 Competition on Recognition of On-Line Handwritten Mathematical Expressions

CROHME 2014: Competition on Recognition of On-Line Handwritten Mathematical Expressions • TC11 Dataset • GitHub

TL;DR

Competition report for CROHME 2014 with two new tasks: isolated symbol recognition with junk rejection and matrix expression recognition. Best system achieves 62.68% exact expression match on standard expressions; matrix recognition proves substantially harder at 53% expression-level accuracy.

What kind of paper is this?

Dominant: $\Psi_{\text{Evaluation}}$ — It is primarily a competition report: tasks, metrics, protocols, and benchmarked system results (multiple tables of rates and error metrics).
Secondary: $\Psi_{\text{Resource}}$ — It introduces/organizes new test data (all tasks) and a new matrix representation + multi-level evaluation objects (matrix/row/column/cell), and notes public availability of data/tools.

What is the motivation?

Handwritten math recognition requires solving joint segmentation, recognition, and 2D structure parsing over a large symbol set (101 classes in CROHME 2014). Unlike linear text, math expressions have spatial relations (superscripts, fractions, roots) that complicate interpretation. The task has practical value for pen-based math input on tablets and search applications.

What is the novelty?

Two new tasks expand the competition scope:

Task 1: isolated symbol recognition with reject option. Systems must accept valid symbols and reject “junk” (random 1-4 stroke sequences that don’t form real symbols). This tests robustness to segmentation errors.
Task 3: matrix recognition. Expressions contain matrices where systems must detect matrix boundaries, infer row/column structure, and recognize cell contents (which may be arbitrary expressions). Evaluation reports recall at four levels: matrix, row, column, and cell.

Task 2 (standard expression recognition) uses the same training set and grammar as 2013 with a new test set to track year-over-year progress.

What experiments were performed?

8 systems submitted 16 runs across three tasks. Data collection used whiteboards, iPads, and Wacom tablets from three labs; test expressions came from Wikipedia.

Metrics:

Task 1: Top-1 accuracy, TAR (true acceptance rate), and FAR (false acceptance rate) for junk rejection
Task 2: Expression-level exact match plus object-level recall/precision for segmentation, classification, and structural relations
Task 3: Expression rate, symbol rate, and recall at matrix/row/column/cell levels

Systems were also evaluated on the 2013 test set for year-over-year comparison, though this risks validation leakage.

What are the outcomes/limitations?

Task 2 (standard expressions): Best system achieves 62.68% exact match (MyScript). Expression-level scores show large gaps between systems, but object-level metrics compress: segmentation and relation detection recalls vary less dramatically. The authors note potential validation leakage since 2014 systems could tune on the 2013 test set.

Task 1 (symbol + junk): Test set contains 10,061 valid symbols and 9,161 junk examples. Adding junk typically drops accuracy 5-14 points. False acceptance rates vary widely (6-37%), while true acceptance rates cluster tighter (78-87%). Top systems use different classifiers (MLP, RNN, SVM), suggesting no dominant architecture.

Task 3 (matrices): Only two systems participated. Best expression-level accuracy is 53%, substantially below standard Task 2 performance. Both systems recognize rows better than columns; column recall is the weakest sub-object.

Limitations: Year-over-year comparison is confounded by potential test-set leakage. The matrix task’s low participation limits analysis. Expression length increased (9.06 to 10.20 symbols/expression) but difficulty effects are unclear.

Model

Competition report; no single architecture. Selected system details:

System I (UPV, “seshat”)

Task 1: 7 online point features + rendered image (40px height) with 9 offline features per column. Two BLSTM-RNNs (online/offline) combined via rnnlib.

Tasks 2/3: 2D stochastic context-free grammar parsing with GMM for spatial relationships. Matrix extensions enforce row/column count consistency and use center-difference features for relations.

System II (University of São Paulo)

3-stage pipeline: (1) generate symbol hypotheses by combining each stroke with 3 nearest neighbors, (2) extract baseline tree from low-cost hypotheses, (3) parse LaTeX for grammar legality and select lowest-cost legal expression.

System III (MyScript)

Task 1: MLP combining trajectory features (position/direction/curvature) and bitmap features (projections/histograms).

Tasks 2/3: Joint segmentation-recognition-interpretation with grammar encoding spatial relations. Uses “symbol expert” producing segmentation probabilities, statistical language model with spatial context, and global discriminant training at equation level. Matrix task adds row/column post-processing.

System IV (RIT DRPL)

Task 1: RBF-kernel SVM with online/offline features (stroke count, curvature, crossings, fuzzy histograms).

Task 2: AdaBoost for stroke merging; MST-based parsing that groups vertical structures, identifies dominant operators, then processes baselines. Relations use bounding box geometry and shape context.

System V (RIT CIS)

Segmentation: Merge/split decisions over adjacent stroke pairs using 405 features (geometric + multi-scale shape context + classification probabilities), PCA to 100D, then RBF SVM.

Parsing: Strong relations (above/below/inside) detected first; weak relations (right/sup/sub) via polar histogram features (15 distance $\times$ 20 angle) with PCA and SVM.

System VI (Tokyo Univ. of Agriculture and Technology)

Task 1: 512D NCGF features reduced to 300D via FLDA; GLVQ classifier. Junk clustered into 64 LBG clusters.

Task 2: Two SVMs learn structural relations; stroke order reduces parsing complexity to $O(n^3|P|)$.

System VII (University of Nantes, IRCCyN)

Joint optimization of segmentation, recognition, and structure under grammar. MLP with reject capability; “global learning” trains directly from expressions with explicit junk modeling.

System VIII (ILSP/Athena)

Template elastic matching using 8-direction Freeman chain code. Distance computed over normalized codes weighted by stroke-length proportions.

Data

Task 2: full expressions (core CROHME task)

Training: 8,836 expressions, 85,781 symbols.
Test: 986 expressions, 10,061 symbols.

Task 1: isolated symbols with junk

Training: 85,781 valid symbols; junk is user-generated by a provided script from Task 2 training data.
Test: 10,061 valid symbols, 9,161 junk.

Task 3: matrices

Training: 362 matrices, 2,332 cells, 4,281 symbols.
Test: 175 matrices, 1,075 cells, 2,101 symbols.
Reported averages: about 6 cells per matrix (roughly a $2 \times 3$) and about 2 symbols per cell. (See Figure 1 examples of matrix expressions, p.2.)

Collection protocol

Organizing labs: IRCCyN/IVC (France), RIT/DPRL (USA), ISI/CVPR (India).
Device diversity: whiteboard/tablet PC, iPads (finger), Wacom stylus tablet.

Algorithms / Training

No unified training recipe. Common patterns:

Joint inference: Systems I, III, VII jointly solve segmentation, classification, and structure under grammar constraints
Reject modeling: Some systems implement explicit rejection for junk; others report “n.r.o.” (no reject option)
Feature fusion: Systems I, III, IV combine online trajectory and offline rendered-image features
Grammar-driven parsing: Multiple systems use grammars or staged relation detection

Evaluation

Task 1 (isolated symbols)

Selected results (Top-1 on valid symbols without junk; with junk where available):

System	No junk	With junk	TAR	FAR
System I	91.24%	84.14%	80.29%	6.44%
System III	91.04%	85.54%	87.12%	10.39%
System IV	88.66%	83.61%	83.52%	9.03%
System V	85.00%	71.19%	86.84%	36.85%

Top systems use different classifiers (MLP/RNN/SVM). FAR varies more than TAR across methods.

Task 3 (matrix recognition)

System	Expression	Symbol	Matrix	Row	Column	Cell
System III	53.28%	89.81%	92.57%	92.00%	69.16%	71.07%
System I	31.15%	87.43%	73.14%	70.59%	50.84%	55.35%

Both systems struggle with columns more than rows. Expression-level accuracy is substantially lower than Task 2.

Task 2 (full expressions)

Best system (System III): 62.68% exact match. Next best (System I): 37.22% exact, 50.20% with $\leq$3 errors.

System III object-level (recall/precision):

Segmentation: 98.42/98.13
Seg+class: 93.91/93.63
Tree relations: 94.26/94.01

Authors also report Hamming distances on labeled graphs and derived error rates ($\Delta Bn$, $\Delta E$) to weight segmentation/classification/parsing contributions.

Data Availability

Train and test data for all tasks are publicly available via the IAPR TC11 site and GitHub. The paper does not specify an explicit license; the broader CROHME package (2011-2019) is distributed as CC BY-NC-SA 3.0 via TC10/11, though individual dataset bundles may vary. Evaluation scripts are included.

ICFHR 2016 CROHME: Competition on Recognition of Online Handwritten Mathematical Expressions

ICFHR 2016 CROHME: Competition on Recognition of Online Handwritten Mathematical Expressions
Evaluation tools available separately
Participant systems not publicly released

TL;DR

CROHME 2016 evaluated online handwritten math recognition across four tasks: end-to-end formula recognition, isolated symbol classification, structure parsing from provided symbols, and matrix recognition. MyScript achieved the best overall results (67.65% fully correct on Task 1) but used additional private training data. Among systems trained only on provided data, WIRIS led on most tasks, while structure recognition remained a key bottleneck even with perfect symbol recognition.

What kind of paper is this?

Dominant: $\Psi_{\text{Evaluation}}$

Competition protocol, standardized metrics, ranking methodology, error analysis, and comparative evaluation across four tasks.

Secondary: $\Psi_{\text{Resource}}$

New test sets for Tasks 1 and 4, updated evaluation tooling (CROHMELib/LgEval), and a Wikipedia formula corpus (592,000+ samples) for language modeling.

What is the motivation?

Progress in handwritten math recognition lacked standardized benchmarks and evaluation metrics, making it difficult to track improvements across systems. CROHME 2016 aims to:

Provide common datasets and evaluation protocols
Separately evaluate subproblems: symbol recognition, structural parsing, and end-to-end performance
Introduce experimental matrix recognition as a focused challenge

What is the novelty?

Task structure:

Task 1: End-to-end formula recognition from strokes
Task 2a/2b: Isolated symbol classification; Task 2b adds junk samples to test rejection
Task 3: Structure recognition from provided symbols (isolates spatial parsing from symbol recognition)
Task 4: Matrix recognition from strokes (experimental)

Resources:

592,000+ Wikipedia formulae (LaTeX + Presentation MathML) for language model training
Web-based submission and evaluation system for real-time scoring

What experiments were performed?

Multi-task evaluation with standardized metrics:

Expression-level: Recognition rates (exact match, $\leq 1$ error, $\leq 2$ errors)
Symbol-level: Recall/precision for segmentation and classification
Relationship-level: Recall/precision for spatial relation parsing
Isolated symbols: Top-1 accuracy, TMP (True Mean Position)
Rejection (Task 2b): TAR/FAR for junk sample handling
Error analysis: Symbol confusion matrices, frequent relationship errors

What are the outcomes/limitations?

Key findings:

End-to-end recognition remains challenging: best system achieves 67.65% fully correct on Task 1
Structure parsing is the primary bottleneck: even with perfect symbols (Task 3), best system reaches only 90.67% structure accuracy
Isolated symbol classification performs well (low-90s Top-1) but hits ceiling due to ambiguity (x vs X vs $\times$ , o vs O vs 0)
Junk sample rejection (Task 2b) introduces TAR/FAR tradeoffs and degrades performance

Limitations:

Best overall system (MyScript) used private training data, limiting fair comparison
Many symbol errors stem from context-independent classification
Relationship confusions (especially Right vs. Subscript/Superscript) remain frequent

Model

Task definitions (what each system must output)

Tasks 1/3/4: output an interpreted formula structure represented as a labeled graph (Symbol Layout Tree, SLT) over strokes/symbols/relations; scoring supports structure-only and structure+labels.
Task 2: output ranked Top-10 class predictions for isolated symbols; Task 2b includes a reject/junk mechanism.

Formula representation: Symbol Layout Trees as label graphs

Formulae are encoded as label graphs (adjacency matrices):
- Diagonal entries label each stroke’s symbol class association.
- Intra-symbol grouping is represented by bidirectional edges among strokes in the same symbol, labeled by the symbol class.
- Spatial relations are represented by directed edges from each stroke of a parent symbol to each stroke of a child symbol, labeled by relation type.
Matrices generalize label graphs to allow sets of labels per node/edge so a stroke can belong simultaneously to symbol/cell/row/column/matrix objects; segmentation can be recovered via labeled cliques.

Participant system summaries

MyScript (Tasks 1–4):

Integrated segmentation/recognition/interpretation with grammar-guided spatial relations
Features: dynamic trajectory (direction/curvature) + static bitmap
Deep MLP + RNN classifier with statistical language model
Used $\sim$30k additional private training samples

WIRIS (Tasks 1–4):

Probabilistic grammar with statistical LM trained on CROHME + Wikipedia data
Neural nets with mixed online/offline features (point sequences, HOG)
Matrix-specific handling (dimension matching, spatial segmentation)

Tokyo Univ. of Agriculture and Technology (Tasks 1/2/3):

Symbol classifier: CNN (offline) + LSTM (online)
Structure: CYK parsing with stroke-order heuristics
Updated from CROHME 2014 system

University of Nantes (Task 1):

Converts 2D ink to 1D paths in stroke graph
BLSTM + local CTC for path labeling
Merges paths to label graph; no language model

University of São Paulo (Tasks 1/3):

Two-stage: hypotheses graph generation + grammar-based parsing

RIT (Task 2):

Direction/order-tolerant shape descriptors
Features: crossings, 2D histograms, visual words (k-means)

Data

Dataset splits and sizes (Table I)

Task	Training	Validation	Test
Task 1 (Formulae)	Train 2014: 8,836 expr.	Test 2014: 986 expr.	Test 2016: 1,147 expr. (new)
Task 2a (Valid symbols)	Train 2014: 85,802 symb.	Test 2013: 10,061 symb.	Test 2014: 10,019 symb.
Task 2b (Valid + Junk)	Junk: 74,284	Junk: 9,161	Junk: 8,416 (new seed)
Task 3 (Structure; new in 2016)	Train 2014: 8,836 expr.	Test 2013: 671 expr.	Test 2014: 986 expr.
Task 4 (Matrices; experimental)	M.Train 2014: 362 expr.	M.Test 2014: 175 expr.	M.Test 2016: 250 expr. (new)

(Table content summarized from the paper.)

New data collection (Tasks 1 and 4)

Source: ArXiv papers (2000–2001) from KDD 2003 Cup; 1,147 expressions selected using CROHME 2014 grammar/frequency constraints.

Collection setup:

50 writers
Three device types: pen-based tablet PC (12"), touch-screen (27", finger input), pen-based interactive whiteboard

Ground truth pipeline:

Show rendered LaTeX for copying
Store LaTeX + strokes in InkML
Parse LaTeX to enumerate symbols/classes
Guide automatic recognition/parsing
Manual verification and correction
Average: 1 minute per expression

Wikipedia formula corpus for language models

Provided 592,000+ English Wikipedia formulae in LaTeX and Presentation MathML, sourced from NTCIR-12 MathIR data, intended for language model parameter fitting.

Data usage

All participants used provided training data
WIRIS: Wikipedia corpus for language model training
MyScript: Additional $\sim$30,000 private formulae
RIT: Synthetic data generation ($\sim 5\times$ expansion for symbol training)

Algorithms / Training

Key algorithmic patterns across participant systems:

Grammar-based joint segmentation + parsing (MyScript, WIRIS): probabilistic scoring with language modeling
Sequence models (Nantes): RNN/LSTM/BLSTM with CTC-style labeling for stroke/path labeling
Graph-based approaches (São Paulo): hypothesis generation + grammar-based parsing
Feature fusion (Tokyo): CNN (offline) + LSTM (online) ensembles for symbol classification

Note: As a competition report, detailed training recipes are system-specific and not fully disclosed.

Evaluation

Tooling

Evaluation uses updated CROHMELib and LgEval, providing metrics at stroke/symbol/expression levels plus automated error analyses (confusion matrices/histograms for symbol/relationship subgraphs).

Metrics (by task)

Tasks 1/3/4 (formula/structure/matrix):
- Expression recognition rates for (a) structure-only and (b) structure+labels; also reported are rates allowing $\leq 1$ and $\leq 2$ label errors in the label-graph adjacency matrix.
- Recall/precision for symbol segmentation and segmentation+classification; likewise for relationship segmentation and relationship segmentation+classification.
Task 2 (isolated symbols):
- Top-1 recognition and TMP (average rank of correct class in Top-10; missing treated as rank 11).
- Task 2b adds TAR (true acceptance rate for valid) and FAR (false acceptance rate for junk).

Results: Task 2 (symbols)

System	Task 2a Top-1	Task 2a TMP	Task 2b Top-1	Task 2b TMP	Task 2b TAR	Task 2b FAR
MyScript	92.81	1.13	86.77	1.19	89.82	11.16
Tokyo	92.27	1.15	–	–	–	–
RIT	88.85	1.25	83.34	1.31	95.86	19.71

(Table II summarized.)

Results: Task 1 (end-to-end formulae from strokes; Test 2016)

System	Structure rate	Structure+labels rate	$\leq 1$ err	$\leq 2$ err
MyScript	88.14	67.65	75.59	79.86
WIRIS	74.28	49.61	60.42	64.69
Tokyo	61.55	43.94	50.91	53.70
São Paulo	57.02	33.39	43.50	49.17
Nantes	21.45	13.34	21.02	28.33

(Table III summarized.)

Results: Task 3 (structure from provided symbols; Test 2014)

System	Structure rate	Structure+labels rate	$\leq 1$ err	$\leq 2$ err
MyScript	90.67	84.38	85.90	87.62
WIRIS	86.61	78.80	80.42	82.75
São Paulo	69.27	64.81	67.34	70.69
Tokyo	70.99	61.46	63.89	66.84

(Table III summarized.)

Symbol-level vs. relationship-level performance (Task 1)

Systems achieve competitive symbol recall but lower expression-level performance due to compounding errors across segmentation, labeling, and relation parsing
Many systems show precision > recall, indicating a tendency to under-segment when errors occur
Most common relationship confusions: Right-adjacency vs. Subscript/Superscript

See Table IV for complete recall/precision breakdown.

Results: Task 4 (matrices; Test 2016)

System	Expression rate	Symbol recall	Matrix recall	Row recall	Column recall	Cell recall
MyScript	68.40	94.86	97.52	95.61	90.71	87.49
WIRIS	56.40	87.03	85.67	87.16	82.22	84.68

(Table V summarized.)

Common error patterns

Symbol ambiguity:

x vs X vs $\times$
o vs O vs 0
p vs P
Punctuation size (comma vs. dot) without context

Error distribution:

Frequent symbols dominate error lists (1, 2, ambiguous x/ $\times$ )
See Tables VI–VII for complete symbol and bigram confusion matrices

Hardware / Production

Compute requirements and inference specifications not reported; paper focuses on datasets, evaluation protocols, and comparative results.

Data Availability & Licensing

Dataset availability

CROHME 2016 datasets (Tasks 1, 2a, 2b, 3, 4):

Available via IAPR TC10/11 dataset package
Includes training + test data from CROHME 2011–2019
Ground truth provided in InkML (LaTeX string, MathML structure, SLG/OLG formats)
More than 10,000 labeled handwritten formulae across all releases
IAPR TC10/11 Resource

Wikipedia formula corpus:

592,000+ formulae (LaTeX + Presentation MathML)
Sourced from NTCIR-12 MathIR
Intended for language model training

Licensing

CROHME datasets (IAPR TC10/11):

CC BY-NC-SA 3.0 (Creative Commons Attribution-NonCommercial-ShareAlike 3.0)
Academic/research use only; no commercial use
IAPR TC10/11 Resource, CROHME Portal

Wikipedia formula corpus:

Source: NTCIR-12 MathIR
NTCIR distribution: restricted use scope, prohibits redistribution (NTCIR Agreement)
Wikipedia content: CC BY-SA 4.0 and GFDL (Wikipedia Copyrights)
Licensing depends on distribution source

Summary: CROHME datasets use CC BY-NC-SA 3.0 for non-commercial research. Wikipedia corpus licensing varies by source.

DocBank: A Benchmark Dataset for Document Layout Analysis

TL;DR

DocBank introduces a 500K-page benchmark for document layout analysis with fine-grained token-level annotations across 12 semantic categories, constructed via weak supervision from arXiv LaTeX sources. The authors inject semantic-specific colors into LaTeX documents, recompile them, and extract token-level labels by mapping RGB values back to structure types—enabling both sequence labeling and object detection workflows.

What kind of paper is this?

Dominant: $\Psi_{\text{Resource}}$ — benchmark dataset with construction pipeline, splits, statistics, and baselines
Secondary: $\Psi_{\text{Method}}$ — weak supervision construction procedure and object detection conversion
Secondary: $\Psi_{\text{Evaluation}}$ — custom area-based metric and multimodal baseline comparisons

What is the motivation?

Document layout analysis typically emphasizes visual features while underutilizing textual content, despite text providing strong signals for semantic role classification. Existing labeled datasets are either smaller-scale, image-only, or lack token-level annotations, making it difficult to fairly compare NLP, computer vision, and multimodal approaches. High-quality manual annotation at token-level is expensive; the authors target a scalable, low-cost labeling approach using LaTeX structure.

What is the novelty?

Weak supervision from LaTeX semantics: The authors inject structure-specific font colors into LaTeX source code for semantic units (abstract, author, caption, etc.), recompile the documents, then recover token labels by mapping extracted RGB colors to structure types.
Token-level annotations at scale: Each token is represented as (word, bounding box), enabling NLP-style sequence labeling while remaining convertible to object detection annotations.
Conversion to object detection format: Same-label tokens are grouped into connected components using BFS with x/y proximity thresholds, then bounding boxes are computed for each component to produce region-level annotations.

What experiments were performed?

Dataset splits: 400K training pages, 50K validation, 50K test. Statistics provided per class and per-year distribution (2014–2018).
Text-layout sequence labeling baselines: BERT, RoBERTa, and LayoutLM (without image embeddings—only text and 2D layout embeddings).
Image-based baseline: Faster R-CNN with Detectron2 trained on converted DocBank object detection format.
Ensemble approach: Combined ResNeXt-101 detector outputs with LayoutLM predictions.
Metric: Per-class Precision/Recall/F1 computed using area of ground-truth tokens covered by detected tokens, rather than BIO tagging.

What are the outcomes/limitations?

Outcomes

LayoutLM outperforms BERT and RoBERTa on most labels and on macro average. Reported macro F1 scores on the DocBank test set:

BERT$_{\text{BASE}}$: 0.8770
RoBERTa$_{\text{BASE}}$: 0.8891
LayoutLM$_{\text{BASE}}$: 0.9316
LayoutLM$_{\text{LARGE}}$: 0.9350
ResNeXt-101 detector: 0.9051
ResNeXt-101 + LayoutLM$_{\text{BASE}}$: 0.9478
ResNeXt-101 + LayoutLM$_{\text{LARGE}}$: 0.9488

Limitations

Domain restriction: Built from arXiv papers with LaTeX sources; generalization to non-LaTeX, scanned, or heavily stylized documents is not guaranteed.
Language restriction: Focuses on English documents; expansion to other languages remains future work.
Tokenization heuristics: Uses whitespace tokenization; bounding boxes reconstructed from character-level coordinates. Mixed-color tokens use the first character’s color, which may introduce label noise.
Non-text elements: Encoded as special tokens using PDFMiner class names (e.g., LTFigure, LTLine), which may not capture full graphical semantics.
Weak supervision quality: Color-based label extraction assumes clean compilation and correct LaTeX semantic markup; errors in source or rendering can propagate to annotations.

Model

Task Framing

The authors frame layout analysis as sequence labeling over a serialized 2D document: input tokens with bounding boxes, output one of 12 semantic structure labels per token.

Baselines

BERT / RoBERTa: Text-only token sequence labeling.
LayoutLM: Text plus 2D position embeddings from bounding boxes. Explicitly used without image embeddings in this work.
Faster R-CNN: Object detection on document images after conversion; output boxes mapped back to token labels for unified evaluation.

Data

Source and Scale

500K document pages total from arXiv papers with both compiled PDFs and LaTeX source code.
Train: 400K pages; Validation: 50K; Test: 50K.

Labels

12 semantic structure types: Abstract, Author, Caption, Equation, Figure, Footer, List, Paragraph, Reference, Section, Table, Title.

Statistics

Paragraph appears on approximately 99.5% of pages across all splits.
Equations and sections are common; tables, titles, and authors are relatively sparse.
Year distribution (2014–2018) is preserved rather than balanced, reflecting natural arXiv submission patterns.

Availability

The dataset is publicly available:

DocBank_500K_txt.zip (~2.95GB)
DocBank_500K_ori_img.zip (~47.4GB, split into 10 parts)
MSCOCO_Format_Annotation.zip (~199MB) for object detection workflows

Available via GitHub (doc-analysis/DocBank) and HuggingFace.

License

Dataset/repository: Apache 2.0 (allows commercial use with attribution)
Original paper: Creative Commons Attribution 4.0 (CC BY 4.0)

Note: Since DocBank is derived from arXiv papers, users should verify downstream rights for redistribution use cases. The COCO-format JSON files have had reported license field inconsistencies (see Issue #55 on GitHub).

Algorithms / Training

Weak Supervision Annotation Pipeline

Document acquisition: Collect arXiv sources and compiled PDFs.
Semantic structure detection via LaTeX edits: Inject \color{fontcolor}{...} with distinct colors per semantic unit; recompile to produce structure-colored pages.
Token annotation:
- Extract text lines and non-text elements with bounding boxes using PDFPlumber (built on PDFMiner).
- Tokenize text lines by whitespace; compute token bounding boxes from character coordinate extremes.
- Wrap non-text elements as special tokens using ##...## notation (e.g., ##LTFigure##, ##LTLine##).
- Assign labels by extracting RGB values and mapping color to structure type; for mixed-color tokens, use the first character’s color.

Reading Order Serialization

Sort text boxes and non-text elements top-to-bottom by top border.
Within boxes, lines are already top-to-bottom; tokenize left-to-right.
Apply the same procedure to multi-column pages.

Fine-Tuning Setup

Optimizer: AdamW
Initial learning rate: $5 \times 10^{-5}$
Max block size: 512 tokens
Hardware: 8 V100 GPUs, batch size 10 per GPU
Training time: approximately 5 hours per epoch on 400K training pages

Object Detection Training

Faster R-CNN trained using Detectron2 with ResNeXt backbone pre-trained on ImageNet.

Evaluation

Metric

For each semantic class, the authors compute:

$$\text{Precision} = \frac{\text{area of GT tokens inside detected tokens}}{\text{area of all detected tokens}}$$

$$\text{Recall} = \frac{\text{area of GT tokens inside detected tokens}}{\text{area of all GT tokens}}$$

$$F_1 = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$

This area-based metric differs from standard BIO tagging and accounts for spatial overlap of token bounding boxes.

Key Results

LayoutLM substantially improves over text-only baselines on macro average. The detector-only approach is competitive but falls below LayoutLM. Ensemble combinations of ResNeXt-101 detector with LayoutLM predictions achieve the best results, reaching 0.9488 macro F1.

ICDAR 2019 CROHME + TFD: Competition on Recognition of Handwritten Mathematical Expressions and Typeset Formula Detection

ICDAR 2019 CROHME + TFD: Competition on Recognition of Handwritten Mathematical Expressions and Typeset Formula Detection (cs.rit.edu)
Online submission system available; IoU implementation
Participant systems not publicly released

TL;DR

CROHME 2019 introduces a symbol-level label graph (symLG) representation allowing systems that output LaTeX (common in encoder-decoder approaches) to be evaluated against prior CROHME graph-based formats. Best results improved versus CROHME 2016 (online expression rate up to 80.73%), though handwritten math recognition remains challenging with 101 symbol classes and complex 2D structure.

What kind of paper is this?

Dominant: $\Psi_{\text{Evaluation}}$ — Competition protocol with new symLG representation for metrics, ranking methodology, and comparative evaluation across three tasks.
Secondary: $\Psi_{\text{Resource}}$ — Expanded training data (added 2012/2013 test sets), new 2019 test set, TFD dataset (36 train PDFs, 10 test PDFs), and updated evaluation tooling.

What is the motivation?

The shift toward encoder-decoder systems that emit LaTeX without stroke segmentation created an evaluation mismatch with CROHME’s historical stroke-level label graph format. CROHME 2019 addresses this by evaluating symbolic structure using a new symLG representation, enabling fair comparison between LaTeX-output systems and graph-output systems.

What is the novelty?

symLG representation: Converts both stroke-level label graphs and LaTeX outputs into a symbol-level graph/tree format compatible with existing CROHME evaluation tools (LgEval, CROHMELib). Nodes are identified by the sequence of relation labels from root (e.g., “oRRSup” for Right, Right, Superscript). Similarity computed using adjacency matrix with symbol labels on diagonal and spatial parent-child relations off-diagonal.

Three tasks:

Task 1 (online): strokes → SLT; ranked by expression recognition rate. Subtasks: 1a isolated symbols (+ junk), 1b parsing from provided symbols.
Task 2 (offline): rendered grayscale images → SLT; ranked by expression rate. Subtasks: 2a isolated symbols (+ junk), 2b parsing from provided symbols.
Task 3 (TFD): detect formula bounding boxes on document pages (given character boxes); ranked by F-measure after one-to-one matching with IoU ≥ 0.75.

Dataset refresh: Expanded training by adding prior CROHME test sets (2012, 2013) and introduced new 2019 handwritten test set from 80 writers and multiple devices.

What experiments were performed?

Evaluation across three tasks with standardized metrics:

Datasets: Training = Train 2014 + Test 2013 + Test 2012 (9993 expr); validation = Test 2014 (986 expr); test = Test 2019 (1199 expr). TFD: 36 train PDFs (569 pages, 26,395 regions), 10 test PDFs (236 pages, 11,885 regions).

Metrics:

Handwritten tasks: expression rate, symbol recognition rate, relationship metrics via LgEval/CROHMELib
TFD: F-score after one-to-one matching with IoU ≥ 0.75 (also reports IoU ≥ 0.5)

Selected participant approaches:

USTC-iFLYTEK (Tasks 1, 2): Attention-based encoder-decoder; RNN encoder (online), CNN encoder (offline); external RNN language model from NTCIR-12 MathIR.
Samsung R&D Team 2 (TFD, winner): Graph-theoretic methods for multi-character formulas + statistical/context recognition for single-character math.
RIT Teams (TFD): Modified YOLOv3 and SSD512 with sliding windows and voting-based pooling.
PAL-v2 (offline): Heavy augmentation (330k images), Paired Adversarial Learning, ensemble of 6 models.

What are the outcomes/limitations?

Outcomes: Best expression rate improved versus CROHME 2016 (80.73% vs 67.65% on online task). USTC-iFLYTEK achieved 80.73% (Task 1 online) and 77.15% (Task 2 offline). Samsung R&D-2 achieved 93.45% F1 on TFD (IoU ≥ 0.75).

Limitations:

symLG tradeoffs: Stroke segmentation performance cannot be computed; systems can achieve correct SLT without correctly segmenting symbols. Symbols identified by relationship paths means structural shifts manifest as missing symbols (“ABSENT”), potentially underestimating symbol recall.
TFD metric sensitivity: Large performance gap between winner and others. Using IoU ≥ 0.5 substantially raises F-scores for non-winning systems.
Low participation: No teams participated in symbol recognition subtasks (1a, 2a).

Model

Competition report; architectures described at high level. Example: USTC-iFLYTEK uses attention-based encoder-decoder with RNN encoder (online) and CNN encoder (offline) plus RNN language model trained from NTCIR-12 MathIR text.

Data

CROHME 2019 splits (Table I)

Formulae (Tasks 1, 2): training = Train 2014 + Test 2013 + Test 2012 (9993 expr); validation = Test 2014 (986 expr); test = Test 2019 (1199 expr).
Symbols (Tasks 1a, 2a): train 180,440 symbols+junks; val 18,435; test 15,483 (Test 2016).
Structure (Tasks 1b, 2b): train 9993 expr; val 986; test 1147 expr (Test 2016).

TFD dataset (Task 3)

Train: 36 rendered PDFs at 600 dpi (569 pages), 26,395 formula regions
Test: 10 PDFs (236 pages), 11,885 regions
Character boxes always provided; labels only in train

Input specifications

Task 2 images: rendered 1000 × 1000 with 5 px padding
Isolated symbols: 28 × 28 with 5 px padding
2019 handwritten test: 1200 expressions, 80 writers, 3 device types; sourced from arXiv 2002-2003 documents (KDD 2003 Cup)

Evaluation

Tooling

Online submission system (Django) with real-time leaderboard. LgEval/CROHMELib updated to support LaTeX-to-symLG and stroke-LG-to-symLG conversions. TFD uses one-to-one IoU matching (Padilla implementation).

Results

Handwritten formula recognition (2019 test):

Task	System	Expression Rate
Task 1 (strokes)	USTC-iFLYTEK	80.73%
Task 2 (images)	USTC-iFLYTEK	77.15%

TFD (IoU ≥ 0.75):

System	F1	Recall	Precision
Samsung R&D-2	93.45	92.73	94.17
RIT 2	68.29	—	—
RIT 1	60.58	—	—

Key findings:

Most common error: missing symbols, attributed to symLG’s “absolute path” identification
Structural shifts in symLG can underestimate symbol recall
TFD gap partly due to better use of character locations (ignored by RIT teams)

LayoutReader — Notes

TL;DR

ReadingBank, a 500,000-page dataset for reading order detection, leverages DocX XML metadata to automatically extract reading sequences and aligns them with word-level bounding boxes via a color-based disambiguation scheme. LayoutReader, a seq2seq model built on LayoutLM, predicts reading order by generating indices into the source token list, achieving 0.9819 page-level BLEU and 1.75 ARD on ReadingBank. Layout features alone outperform text-only models by a wide margin, and the approach can reorder OCR text lines via token-to-line assignment.

What kind of paper is this?

Dominant: $\Psi_{\text{Resource}}$

The primary contribution is a large-scale, automatically constructed benchmark dataset (ReadingBank) with detailed collection and alignment pipeline. The dataset enables reading order detection at a scale previously unavailable for supervised learning.

Secondary: $\Psi_{\text{Method}}$, $\Psi_{\text{Impact}}$

The paper introduces LayoutReader, a seq2seq permutation model, and demonstrates practical adaptation to improve OCR line ordering.

What is the motivation?

Reading order detection is a prerequisite for document understanding; naive OCR ordering (top-to-bottom, left-to-right) fails on multi-column layouts, forms, and invoices, breaking downstream information extraction. Deep models were historically underused because large-scale reading-order annotation is expensive. The paper exploits the fact that DocX files embed reading order in their XML metadata, enabling automated supervision at scale.

What is the novelty?

Dataset construction: Reading order labels come from DocX XML; word bounding boxes come from converting the document to a fixed-layout format and parsing it. Duplicates are resolved by a word-appearance index to RGB color mapping so each (word, occurrence) is uniquely matchable.

Modeling approach: LayoutReader is a seq2seq permutation model that encodes tokens with LayoutLM and decodes by predicting indices into the source sequence. The decoder vocabulary is constrained to source positions, enabling direct generation of permutation sequences.

Practical adaptation: Converts token-level order into text-line order to improve OCR line ordering via token-to-line assignment by maximum spatial overlap.

What experiments were performed?

Reading order detection on ReadingBank: compares heuristic sort (left-to-right, top-to-bottom), LayoutReader variants using text only (BERT / UniLM), layout only (LayoutLM with token embeddings removed), and full LayoutReader (text + layout).

Input order robustness study: trains with varying proportions of token-shuffled samples ($r \in {0\%, 50\%, 100\%}$) and evaluates under both heuristic-ordered and fully shuffled inputs.

OCR adaptation experiments: evaluates line ordering improvements on Tesseract and a commercial OCR API using the adaptation procedure.

What are the outcomes/limitations?

Key outcomes

Reading order detection (test set):

Heuristic: BLEU 0.6972, ARD 8.46
LayoutReader (layout only): BLEU 0.9732, ARD 2.31
LayoutReader (full): BLEU 0.9819, ARD 1.75

Modality finding: Layout contributes more than text for this task. Layout-only beats text-only by a wide margin in both BLEU and ARD, suggesting spatial structure is the dominant signal for reading order prediction.

OCR line ordering adaptation: Improves line-order BLEU/ARD versus the OCR engine’s native ordering (e.g., Tesseract baseline vs LayoutReader-adapted).

Limitations

Data is English-only (filtered via a language detection API) and restricted to pages with more than 50 words. Ground-truth reading order is defined by DocX structure traversal (paragraphs/tables), which may not always match human reading order for every rendered layout. Dataset access is controlled: the authors describe manual checking/redaction for a small public subset and permission requirements for full access. The repo includes conflicting license statements: Apache 2.0 claim alongside “research purpose” and “DO NOT re-distribute” restrictions.

Model

Problem formulation

Input: document tokens ${t_i}$ where each token includes the word $w_i$ and bounding box $(x^0_i, y^0_i, x^1_i, y^1_i)$. Goal: output a permutation representing the natural reading sequence.

Architecture overview

Encoder: LayoutLM-based encoder with token, 1D position, segment, and 2D layout embeddings.

Seq2seq packing + attention mask: Source and target segments are packed into one sequence and controlled by a self-attention mask $M$. The mask is defined as:

$$M_{i,j} = \begin{cases} 1 & \text{if } i < j \text{ or } i, j \in \text{src} \ 0 & \text{otherwise} \end{cases}$$

Decoder / generation step: Prediction candidates are constrained to source indices. Probability uses dot products between hidden state $h_k$ and the source embedding $e_i$:

$$P(x_k = i \mid x_{<k}) \propto \exp(e_i^\top h_k + b_k)$$

where $i$ ranges over indices in the source segment.

Comparative variants

Text-only: Replace LayoutLM with textual LMs (BERT or UniLM).

Layout-only: Remove token embeddings in LayoutLM so only 1D/2D positional layout remains.

Data

ReadingBank scale and splits

Total: 500,000 document pages

Train: 400,000
Validation: 50,000
Test: 50,000 (ratio 8:1:1)

Collection filters

Crawled DocX documents from the internet with robots exclusion and public-domain licensing considerations. Filtered to English via a language detection API. Kept pages with more than 50 words. Collected 210,000 English DocX documents and sampled 500,000 pages for the dataset.

Reading sequence extraction

Uses python-docx to parse DocX XML and extract word sequences by traversing paragraphs and tables in order, then line by line for paragraphs and cell by cell for tables.

Layout alignment via coloring scheme

Handles duplicate words by assigning each word an appearance index (e.g., the second “the” gets index 1). Maps appearance index $i$ to RGB color $C(i)$ using bitwise operations:

$$r = i \mathbin{\&} 0x110000, \quad g = i \mathbin{\&} 0x001100, \quad b = i \mathbin{\&} 0x000011$$ $$C(i) = (R: r, G: g, B: b)$$

Conversion + parsing: converts colored documents using PDF Metamorphosis .Net and parses with MuPDF to extract word text, bounding boxes, and word color. Color recovers $i$ and enables a 1:1 match between reading-sequence tokens and layout boxes. Stores page width/height $(W, H)$ along with each word box.

Dataset statistics

Average words per page: approximately 196 across splits. “Difficulty” measured via BLEU of heuristic order versus ground truth (average BLEU 0.6974).

Algorithms / Training

Implementation built on HuggingFace Transformers and s2s-ft from the UniLM repository.

Training setup:

4 $\times$ Tesla V100
Batch size: 4 per GPU
3 epochs, approximately 6 hours
AdamW optimizer
Learning rate: $7 \times 10^{-5}$
Warmup steps: 500

Evaluation

Metrics

Average page-level BLEU: BLEU computed per page (micro-average precision of n-gram overlap within a page), averaged across pages.

Average Relative Distance (ARD): Measures relative displacement of common elements between reference sequence $A$ and generated sequence $B$, with explicit penalty for omissions. ARD is defined via $s(e_k, B)$ and averaged over $A$.

Reading order detection results

Model	BLEU	ARD
Heuristic	0.6972	8.46
LayoutReader (layout only)	0.9732	2.31
LayoutReader (full)	0.9819	1.75

Input order robustness

Training with higher shuffle proportions improves robustness when evaluation inputs are shuffled. Models trained with $r = 0\%$ show a large drop when evaluated on fully shuffled inputs, attributed to overfitting the heuristic input order.

OCR line-order adaptation

Token-to-line assignment: Assign each token box $b$ to the text line box $B$ with maximum overlap. Line ranking uses the minimum token index within each line.

Results show improvements on Tesseract and a commercial OCR API when reordering lines via LayoutReader outputs.

Hardware / Production

Training compute: 4 $\times$ V100, 3 epochs, approximately 6 hours, batch size 4/GPU. No serving/latency/throughput numbers reported beyond training time.

Nougat — Notes

Paper
Code
Models (small, base)

TL;DR

Nougat is an OCR-free, encoder–decoder transformer that converts rasterized pages of scientific PDFs (including scanned pages) into a lightweight markup language that preserves mathematical expressions and tables. It couples a Swin Transformer visual encoder with an mBART-style decoder, trains on a large automatically-aligned arXiv/PMC/IDL corpus of $\approx 8.2$M pages, and substantially outperforms a strong GROBID + LaTeX-OCR baseline on text, math, and tables.

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$ (end-to-end OCR-free visual encoder–decoder for document $\rightarrow$ markup conversion and a repetition-robust decoding scheme).

Secondary: $\Psi_{\text{Resource}}$ (releases code, trained models, and a dataset generation pipeline pairing PDFs with source markup); $\Psi_{\text{Evaluation}}$ (defines a modality-aware evaluation setup for plain text, math, and tables with multiple MT-style metrics).

What is the motivation?

Scientific knowledge is mostly stored as PDFs; embedded text often exists but:
- Math and tables lose semantic structure.
- Many documents (books, scans) have no embedded text at all.
Traditional OCR (e.g., Tesseract) works line-by-line and cannot capture 2D structure needed for math layout (superscripts, fractions, matrices).
Existing scholarly pipelines (e.g., GROBID $\rightarrow$ S2ORC) capture body text but drop or flatten equations and tables.
Goal: build a single model that:
- Works directly from page images (so it supports scanned PDFs).
- Outputs a structured markup with math and tables.
- Is trainable without manual page-level annotations, using PDF+source pairs.

What is the novelty?

Architecture: Adapts Donut’s OCR-free encoder–decoder design to scientific PDFs, using a Swin Transformer encoder and mBART decoder specialized to scientific tokenization (architecture diagram and high-level flow in Figure 1, p.2).
Data pipeline for paired PDF page $\rightarrow$ markup:
- Converts LaTeX $\rightarrow$ HTML via LaTeXML, then HTML $\rightarrow$ custom lightweight markdown that preserves math & tables (Figure 3, p.4).
- Automatically aligns PDF page text with source paragraphs using TF-IDF + linear SVM page prediction, decision-tree-style splits, and fuzzy matching with a quality threshold; produces $\approx 7.5$M arXiv pages plus PMC/IDL pages.
Augmentation for scanned-like robustness: Heavy image augmentation (bitmap, erosion/dilation, affine transformations, grid/elastic distortion, brightness/contrast changes, compression, noise, blur), visualized in Figure 2 (p.3).
Repetition-robust training and decoding:
- Introduces anti-repetition token perturbation during training.
- Adds a logit-variance–based heuristic to detect when decoding has collapsed into a repetition loop and terminate early (Figure 6, p.8).
Modality-aware evaluation: Separately evaluate “All text”, “Plain text”, “Math”, and “Tables” for both baselines and Nougat, surfacing where the model actually struggles.

What experiments were performed?

Training corpus:
- arXiv: 7,511,745 pages.
- PubMed Central (PMC): 536,319 pages (XML-based).
- Industry Documents Library (IDL): 446,777 pages with high-quality OCR (plain text only).
- Total $\approx 8.2$M pages (Table A.1, p.13).
Baselines:
- Embedded PDF text (extracted text layer from digital PDFs).
- GROBID $\rightarrow$ XML, with formulas converted back from Unicode to LaTeX; small inline formulas sometimes mis-tagged as text.
- GROBID + LaTeX-OCR (pix2tex) on formula bounding boxes to get LaTeX math.
Models:
- Nougat small: 250M parameters, max sequence length 3584, 4-layer decoder (pretrained base model).
- Nougat base: 350M parameters, 10-layer decoder, max sequence length 4096.
Metrics: Normalized character-level edit distance (Levenshtein), BLEU, METEOR, Precision, Recall, F1 on tokens.
Qualitative evaluation: Example pages with dense math (Figure 5, p.6) and scanned books/theses (Figures B.1–B.3, pp.14–16) to show performance on non-digital PDFs and mobile-camera scans; additional pages with tables and quantitative results (Figure B.4, p.17).

What are the outcomes/limitations?

Outcomes

On the arXiv test set, Nougat base improves strongly over both PDF text and GROBID baselines:

All text modality:
- PDF: edit distance 0.255, BLEU 65.8.
- GROBID: edit distance 0.312, BLEU 55.6.
- Nougat small: edit distance 0.073, BLEU 88.9, F1 92.9.
- Nougat base: edit distance 0.071, BLEU 89.1, F1 93.1.
Plain text: Nougat base achieves edit distance 0.058, BLEU 91.2, METEOR 94.6, F1 95.7.
Math: lower numbers, as expected; Nougat base still reaches F1 $\approx 76.5$ with BLEU 56.9 vs GROBID+LaTeX-OCR BLEU 0.3 and F1 9.7.
Tables: Nougat base edit distance 0.211, BLEU 69.7, F1 78.0.

Qualitatively:

For dense math pages, the model outputs LaTeX that renders visually close to the original (Figure 5, p.6); bounding boxes and decorative elements are skipped.
For old scanned textbooks and NASA reports, output is noisy but legible and structurally sensible (Figures B.1–B.2).
For modern mobile-camera scans of theses, the model handles skew, lighting artifacts, and page curvature reasonably well (Figure B.3).

Limitations

Repetition / collapse:
- About 1.5% of test pages fall into repetition loops under greedy decoding; frequency increases on out-of-domain documents.
- Heuristic detection helps but does not eliminate the issue; the authors flag this as the main challenge for future work.
Page-local context only:
- Model processes one page at a time with no document-level context, causing:
  - Inconsistent bibliography styles and numbering.
  - Section numbers that skip or hallucinate.
Language coverage:
- Training data is almost entirely English.
- Latin-based languages work but special characters are mapped to nearest Latin equivalents.
- Non-Latin scripts lead to immediate repetitions or failure.
Data quality:
- Ground truth markup includes artifacts from LaTeXML and splitting heuristics (extra numbering, missing figures/tables, truncated text).
- Authors argue that large corpus size compensates, but no ablation on data quality is provided.
Throughput:
- On an NVIDIA A10G (24GB), they can process 6 pages in parallel; with average $\approx 1400$ tokens per page, mean generation time is 19.5s per batch (no inference optimizations).
- This is much slower than traditional pipelines (GROBID $\approx 10.6$ PDFs/s) but works for scanned docs and preserves math.

Model

Architecture

Encoder–decoder transformer following Donut (Figure 1, p.2).
Input: rasterized page image $x \in \mathbb{R}^{3 \times H_0 \times W_0}$ at 96 DPI.
Preprocessing:
- Crop page margins.
- Resize to fixed canvas (H, W) = (896, 672).
- If smaller than canvas, pad to fixed size.
Visual encoder:
- Swin Transformer base model, initialized from image pretraining.
- Splits image into non-overlapping windows; hierarchical self-attention across windows.
- Outputs sequence of patch embeddings $z \in \mathbb{R}^{d \times N}$, where $N$ is number of patches.
Text decoder:
- Transformer decoder with cross-attention over encoder outputs; implementation based on mBART.
- Uses tokenizer from Galactica (Taylor et al.) tuned for scientific text (equations, citations).
- Autoregressive generation of markup tokens; projection to vocabulary logits $\ell \in \mathbb{R}^v$.
Sequence lengths & sizes:
- Base model: max sequence length 4096 tokens, 10 decoder layers, total 350M parameters.
- Small model: max sequence length 3584, 4 decoder layers, total 250M parameters (starting from pretrained base).
Output format:
- Lightweight markdown-like markup that supports:
  - Headings.
  - Inline & display LaTeX math.
  - LaTeX tables.
  - Bold/italic.
  - Algorithms and references (citations as numeric markers).

Repetition detection (inference-time logic)

Let $\ell_i$ be the max logit over the vocabulary for token $i$.
Compute sliding-window variance over logits, window size $B = 15$: $\mathrm{VarWin}_B\ell$.
Then compute the variance of this signal from position $x$ to end: $\mathrm{VarEnd}_B\ell$.
If $\mathrm{VarEnd}_B\ell$ drops below threshold (6.75) and stays low for the rest of the sequence, classify output as collapsed repetition (Figure 6).
During incremental decoding:
- Use only last 200 tokens and half the threshold for early detection.
- After generation completes, re-run on full sequence to confirm / re-classify.

Data

Sources and composition

arXiv:
- 1,748,201 articles with LaTeX sources and compiled PDFs.
- After processing and page-level alignment, yields 7.51M pages (main training/eval set).
PubMed Central (PMC):
- Open-access non-commercial subset; XML + PDF.
- Parsed into the same markup format as arXiv; used mainly for pretraining due to noisier math/tables (often embedded as images).
Industry Documents Library (IDL):
- Public health–related industry documents collected by UCSF.
- Use OCR-IDL annotations (text only, no formatting) as an additional pretraining corpus to teach basic OCR on scanned docs.

LaTeX to markup pipeline (arXiv)

(Figure 3, p.4 shows an example of LaTeX $\rightarrow$ HTML $\rightarrow$ markdown $\rightarrow$ PDF.)

LaTeXML:
- Convert LaTeX sources to HTML5.
- Normalize: expand macros, standardize whitespace, add optional brackets, normalize tables, canonicalize references/citations.
HTML to custom markdown:
- Parse HTML into lightweight markup supporting:
  - Sections/headings.
  - Inline/display LaTeX math.
  - LaTeX tables.
  - Algorithms and citations.
- Remove ambiguity in math where possible but some variability remains (e.g., \frac vs \over, different bold commands).

Page splitting and alignment

Figure/table handling:
- Use pdffigures2 to detect and temporarily remove figures/tables and captions from PDFs.
- Match captions back to source via Levenshtein distance on captions; re-insert removed elements at the end of each page after splitting.
Bag-of-words page index prediction:
- Extract text lines from PDFs with MuPDF; strip headers/footers/page numbers.
- Compute TF-IDF features; train a linear SVM to classify lines by page index.
- Split LaTeX source into paragraphs; predict page number for each paragraph.
Boundary optimization:
- Predicted page indices ideally form a staircase; noise introduces mismatches.
- Use a decision-tree-like search over paragraph indices, minimizing a Gini-style impurity measure to choose splitting boundaries (visualized in Figure 4, p.5).
Fuzzy alignment check:
- Around each predicted split, compare source text with:
  - Last sentences of previous PDF page.
  - First sentences of next PDF page.
- Use fuzzy string matching (normalized Levenshtein) to score candidate split points.
- Keep pages whose average alignment score $\ge 0.9$ at both boundaries.
- This yields an acceptance rate of about 47% of all pages.

Ground truth artifacts

LaTeXML may:
- Number subsections in the markup even when PDF shows unnumbered headings.
- Drop figures/tables or represent equations as images.
Page splitting sometimes:
- Includes text from previous page.
- Cuts off words or misses “invisible” formatting tokens (bold, italics, section markers).
PMC inline math often appears as Unicode or italic text; display equations/tables frequently come as images and are ignored.

Algorithms / Training

Image augmentation

To simulate scanned/low-quality docs, apply a fixed-probability mixture of augmentations per page (Figure 2, p.3):

Bitmap conversion.
Erosion / dilation.
Affine transforms: shift, scale, rotate.
Grid distortion, elastic transform.
Random brightness/contrast changes.
Image compression artifacts.
Gaussian noise, Gaussian blur.

Implemented using Albumentations library.

Anti-repetition token perturbation

During training, to make the decoder more robust to previous token errors:

For each training example, sample a random token and replace it with another random token.
Repeat replacement while random samples fall below a probability threshold (10%), creating a small number of corrupted tokens.

Authors report:

No performance degradation on in-domain data.
$\approx 32\%$ reduction in failed page conversions due to repetition on out-of-domain documents.

Optimization

Optimizer: AdamW.
Training schedule:
- Train for 3 epochs with effective batch size 192 pages.
- Initial learning rate $5 \times 10^{-5}$.
- Every 15 updates, multiply LR by 0.9996 until reaching final LR $7.5 \times 10^{-6}$.
- Authors mention training instabilities as motivation for the relatively low LR.
Decoding:
- Greedy decoding (no beam search, no sampling) to simplify analysis and avoid exposure to degenerate outputs introduced by stochastic sampling.

Evaluation

Metrics and modalities

Character-level normalized edit distance (Levenshtein / #chars).
BLEU and METEOR borrowed from MT evaluation.
Precision, Recall, F1 over tokens.
Evaluate over:
- All text (single stream).
- Plain text.
- Math.
- Tables.

Main quantitative results (arXiv test set)

From Table 1 (p.7):

All text (no modality split)
- PDF: edit distance 0.255, BLEU 65.8, METEOR 82.1, F1 79.2.
- GROBID: edit distance 0.312, BLEU 55.6, METEOR 71.9, F1 73.0.
- Nougat small: edit distance 0.073, BLEU 88.9, METEOR 92.8, F1 92.9.
- Nougat base: edit distance 0.071, BLEU 89.1, METEOR 93.0, F1 93.1.
Tables
- GROBID: edit distance 0.626, BLEU 25.1, METEOR 64.5, F1 69.7.
- Nougat small: edit distance 0.220, BLEU 68.5, METEOR 78.6, F1 77.3.
- Nougat base: edit distance 0.211, BLEU 69.7, METEOR 79.1, F1 78.0.
Plain text
- GROBID + LaTeX-OCR: edit distance 0.363, BLEU 57.4, METEOR 69.2, F1 75.9.
- Nougat small: edit distance 0.058, BLEU 91.0, METEOR 94.3, F1 95.7.
- Nougat base: edit distance 0.058, BLEU 91.2, METEOR 94.6, F1 95.7.
Math
- GROBID + LaTeX-OCR: edit distance 0.727, BLEU 0.3, METEOR 5.0, F1 9.7.
- Nougat small: edit distance 0.117, BLEU 56.0, METEOR 74.7, F1 76.9.
- Nougat base: edit distance 0.128, BLEU 56.9, METEOR 75.4, F1 76.5.

Qualitative examples

Figure 5 (p.6): side-by-side original vs. Nougat-rendered math-heavy page (Sorscher et al.). Equations, alignment, and text are reproduced with minor markup differences; decorative equation boxes are skipped.
Appendix B figures (pp.14–17):
- Old calculus textbook (Figure B.1): shows OCR errors on barely legible exponents and repetition loops triggered by punctuation mistakes.
- NASA report (Figure B.2): longer, dense paragraphs; model mostly tracks text accurately, though math and typographic details occasionally degrade.
- Mobile-scanned thesis pages (Figure B.3): demonstrates robustness to camera artifacts.
- Pages with tables (Figure B.4): show how tables and plots are represented in the markup.

Hardware / Production

Training hardware: not specified in detail (no GPU counts or training hours given).
Inference throughput (Section 5.5):
- Machine: NVIDIA A10G GPU with 24GB VRAM.
- Batch: 6 pages in parallel.
- Average output length: $\approx 1400$ tokens/page.
- Mean generation time: 19.5 seconds per 6-page batch (no optimizations).
Comparison to traditional pipelines:
- GROBID: $\approx 10.6$ PDFs/s on unspecified hardware.
- Nougat is much slower but supports scanned docs and provides semantically richer outputs for math and tables.

MathWriting — Notes

TL;DR

MathWriting is a large-scale online handwritten mathematical expression dataset with 230k human-written inks and 396k synthetic inks, each paired with raw and normalized LaTeX labels. The paper proposes a benchmark using token-level character error rate and reports baseline results for CTC Transformer, PaLI, and PaLIGemma. Adding synthetic data reduces test CER from 6.20 to 5.49.

What kind of paper is this?

Dominant: $\Psi_{\text{Resource}}$

Primary contribution is a dataset and benchmark protocol (splits, normalization statistics, licensing, baseline models). Secondary: $\Psi_{\text{Evaluation}}$ through proposed test protocol with tokenized LaTeX CER.

What is the motivation?

Handwritten mathematical expression recognition is challenging due to inherent 2D spatial structure. Online handwriting (stroke sequences) differs from static bitmaps. Data scarcity is a bottleneck: collecting real handwritten math requires specialized hardware and human effort. Existing benchmarks (CROHME) have smaller vocabularies and fewer samples. MathWriting expands symbol coverage, formula diversity, and supports offline recognition via rasterization.

What is the novelty?

Scale: Largest published online handwritten math dataset (650k inks vs 164k in CROHME23)
Dual labels: Raw LaTeX plus normalized labels to remove training/evaluation ambiguities
Synthetic pipeline: Constructs diverse expressions by pasting handwritten symbol inks into LaTeX-derived bounding boxes; bounding-box data published for custom synthesis
Split strategy: Test designed for low label overlap with train, motivated by findings that “seen vs unseen label” matters more than writer identity

What experiments were performed?

Baselines trained/fine-tuned on MathWriting (train + synthetic), evaluated on valid/test with token-level CER:

OCR API (rasterized inks)
CTC Transformer (online)
PaLI (encoder-decoder VLM, point sequence + raster)
PaLIGemma (decoder-only LLM, image input with speed features)
Synthetic data ablation for CTC Transformer

What are the outcomes/limitations?

Outcomes:

Test CER (lower is better): OCR API 7.17, CTC Transformer 5.49, PaLI 5.95, PaLIGemma 5.97
Synthetic data improves CTC Transformer: 6.20 $\rightarrow$ 5.49 CER
Vocabulary expansion: 254 tokens vs 105 in CROHME23; includes matrices not in CROHME23

Limitations:

Single-formula scope; models may not transfer to full handwritten pages
LaTeX-only labels; not intended for general handwritten language
Some distinctions intrinsically ambiguous from ink alone (e.g., “z” vs “2”)
Noise remains: stray strokes <1%, incorrect ground truth ~1–2%
Normalization is syntactic; cannot resolve semantic cases (e.g., “cos” vs \cos in “tacos”)

Model

Input representations

Online ink: Sequence of strokes; each stroke is a sequence of $(x, y, t)$ points where $t$ is timestamp
Rasterized: Used for OCR-style models; mixed with point sequences for VLM fine-tuning

Baseline architectures

CTC Transformer (35M params):

Transformer-base with CTC loss
11 layers, embedding size 512, swish activation, dropout 0.15

PaLI (700M params):

Encoder-decoder VLM fine-tuned on MathWriting
Uses both point sequences and rasterized ink

PaLIGemma (3B params):

Decoder-only LLM (Gemma) with image input at 448px
Trained using ink rendering with speed information

Data

Dataset splits and sizes

Five splits: train, valid, test (human-written), symbols (isolated symbols for synthesis), synthetic (generated expressions).

Split	Inks	Distinct Labels
train	230k	53k
synthetic	396k	396k
valid	16k	8k
test	8k	4k

Total: 650k inks with 254-token vocabulary (vs 164k inks, 105 tokens in CROHME23).

Collection protocol (human-written)

Collected via in-house Android app: contributors copy rendered prompt (bitmap from LaTeX) using finger or stylus
6 campaigns (2016–2019), each 2–3 weeks; contributors hired internally
Prompt sources: ~95% Wikipedia; remainder generated for underrepresented structures (nested fractions, rare symbols, matrices)
Device diversity: ~150 device types; different sampling rates and artifacts

Synthetic generation

Uses LaTeX compilation outputs (DVI-derived bounding boxes) to place handwritten symbol inks into expression layouts
Individual symbol inks manually extracted from train (20–30 occurrences per symbol) to build symbols split
Synthetic expressions tend to be longer (90th percentile: 68 chars vs 51 in train) for length generalization

Label normalization

Each sample includes raw annotation label (as collected) and normalizedLabel (for training/eval robustness).

Normalization: remove spaces, standardize braces, order sub/superscripts, rewrite \over $\rightarrow$ \frac, collapse synonyms, rewrite function commands (e.g., \sin) to letter sequences, normalize matrix environments, drop size modifiers (\left/\right).

Algorithms / Training

Split strategy

Human-written split by writer (early) and by label (later), motivated by findings that “seen vs unseen label” mattered more than writer style
Label overlap: valid has substantial overlap with train; test kept low (355 shared labels train-test)

Training recipes

CTC Transformer:

Adam, lr $1 \times 10^{-3}$, batch 256, 100k steps

PaLI:

200k steps, batch 128, lr 0.3, dropout 0.2; three runs with different shuffles

PaLIGemma:

lr $1 \times 10^{-4}$, batch 512; image input 448px

Evaluation

Benchmark protocol

Evaluate on test split with token-level CER, where “character” = LaTeX token (not ASCII character). Tokenization code provided in Appendix M.

Results

Model	Params	Valid CER	Test CER
OCR API	—	6.50	7.17
CTC Transformer	35M	4.52	5.49
PaLI	700M	4.47	5.95
PaLIGemma	3B	3.95	5.97

Synthetic data ablation (CTC Transformer)

Configuration	Valid CER	Test CER
With synthetic	4.52	5.49
Without synthetic	4.64	6.20

Hardware / Production

CTC Transformer: 4 hours on 4 TPU v2 per run (100k steps)
PaLI: 14 hours on 16 TPU v5p per run; total experiment cost: 2 TPU v2 days + 28 TPU v5p days
PaLIGemma: 36 hours on 64 TPU v5p

CDM (Character Detection Matching) – Notes

TL;DR

CDM is an image-level evaluation metric for formula recognition that renders predicted and ground-truth LaTeX into images and matches per-character bounding boxes instead of comparing LaTeX strings. CDM reduces unfairness from non-unique LaTeX representations and aligns better with human judgment than BLEU, Edit Distance, and ExpRate.

What kind of paper is this?

Dominant vector: $\Psi_{\text{Evaluation}}$

Headline contribution is a new evaluation metric and protocol for formula recognition, motivated by reliability and fairness issues in existing text-based metrics.

Secondary vectors: $\Psi_{\text{Method}}$ (matching algorithm design)

CDM includes a concrete multi-stage algorithm: localization, Hungarian matching, invalid-match elimination, and scoring.

What is the motivation?

Non-unique LaTeX representations: Text-based metrics (BLEU/Edit Distance/ExpRate) misrepresent correctness because visually identical formulas can score poorly if written differently.

Unfair model comparisons: Metrics favor outputs closer to a dataset’s annotation style even when the prediction is objectively worse, particularly under distribution or style mismatch.

Human-perception mismatch: Predictions with obvious visual errors can receive high BLEU scores, while visually correct predictions may score poorly due to stylistic differences.

What is the novelty?

Render-to-image evaluation: CDM evaluates formula recognition in image space, not LaTeX space, by rendering both predicted and ground-truth LaTeX and performing character-level matching with spatial awareness.

Character detection framing: Each token/character is treated as an “object” with a bounding box; a match-based score analogous to detection evaluation uses an F1-style metric.

Robust matching pipeline: Bipartite matching (Hungarian algorithm) plus post-filters (token consistency and geometric consistency via RANSAC with constrained affine transform) avoid cascading mismatch from local errors or layout differences.

What experiments were performed?

Formula-level evaluation on UniMER-Test (23,757 formulas) comparing Mathpix (API), UniMERNet, Texify, and Pix2Tex under BLEU, ExpRate, CDM, and ExpRate@CDM.

Human preference study (1,008 samples) comparing whether CDM vs BLEU better reflects quality, with randomized score ordering in the UI.

Style-stability test: 50 formulas rewritten 5 times using GPT-4 (250 variants), manually verified to render identically; compared score sensitivity of BLEU vs CDM.

Document-level evaluation on Tiny-Doc-Math (12 PDFs, 196 pages, 437 formulas; post-June 2024 papers) using Nougat, GPT-4o, Mathpix, plus formula-level cropped inputs for GPT-4o/UniMERNet/Mathpix/Pix2Tex.

What are the outcomes/limitations?

Outcomes

Human preference: 64% preferred CDM; 32% said both are good; 3% preferred BLEU; 1% neither. Authors interpret this as 96% consistency with human evaluation.

Style robustness: CDM scores remain 1.0 under equivalent formula rewrites, while BLEU varies widely across writing styles.

Model comparison differences: Subset analyses highlight cases where BLEU and CDM yield opposite conclusions due to annotation-style bias, notably in the SCE (Screenshot Expressions) subset.

Tiny-Doc-Math: GPT-4o achieves the highest BLEU among cropped-formula inputs but the lowest CDM, suggesting BLEU may overstate formula recognition quality for some models.

Limitations

Rendering dependency: CDM requires successful LaTeX rendering; rendering failures are assigned CDM = 0, coupling evaluation to renderer robustness and LaTeX validity.

Token-synonym handling is partial: CDM uses a low token mismatch cost (0.05) for some differently written but identically rendered tokens (e.g., "(", "\left(", "\big("), but coverage of the full LaTeX synonym space is not guaranteed.

Scaling to very long formulas: Element localization uses a finite palette of 5,832 distinct colors; behavior when formulas exceed that many tokens is not described.

Geometric assumptions: Positional consistency uses a constrained affine transform with rotation fixed to 0 (translation + scaling), which matches typical rendering but is an explicit assumption.

Model

What CDM is (conceptually)

CDM reframes formula recognition evaluation as matching sets of detected character regions between two rendered images (prediction vs ground truth), instead of comparing LaTeX strings.

Outputs

CDM score: F1-style match score in $[0,1]$
ExpRate@CDM: fraction of samples with perfect match (CDM = 1)

Data

UniMER-Test

Size: 23,757 formula samples

Subsets:

SPE: Simple Printed Expressions
CPE: Complex Printed Expressions
SCE: Screenshot Expressions
HWE: Handwritten Expressions

Tiny-Doc-Math

Construction: arXiv math/CS papers published after June 2024 (intended to reduce training contamination); LaTeX and PDFs collected; displayed equations matched via regex and manually verified.

Size: 12 PDFs, 196 pages, 437 formulas

Algorithms / Training

CDM pipeline

CDM has four stages: (1) element localization, (2) element region matching, (3) invalid match elimination, (4) metric calculation.

1) Element localization

LaTeX source normalization: Tokenize both ground truth and prediction into tokens such as "2", "a", "A", "\alpha", "\sin". Composite constructs are decomposed (e.g., \frac ab rewritten as \frac {a} {b}).

Element region localization via unique colors:

Render each token in a unique RGB color using \mathcolor[RGB]{r,g,b}
Construct color list with interval 15 from (0,0,15) to (255,255,255), yielding $(255/15 + 1)^3 = 5832$ distinct colors
After rendering, extract pixels of each color to locate the token’s bounding box

2) Element region matching (Hungarian assignment)

Let $y$ be GT elements, $\hat{y}$ predicted elements; sizes $N_y$, $N_{\hat{y}}$, with $N=\min(N_y,N_{\hat{y}})$.

Find a permutation $\hat{\omega}$ minimizing total cost:

$$ \hat{\omega}=\arg\min_{\omega\in S_N}\sum_{i=1}^{N} L_{\text{match}}(y_i,\hat{y}_{\omega(i)}) $$

Matching cost is a weighted sum of:

Token matching cost $L_t$: 0 if tokens identical; 1 if different; 0.05 if different tokens render identically
Positional proximity cost $L_p$: $L_1$ distance between bbox coordinates (normalized by bbox coordinate dimension)
Order similarity cost $L_o$: $L_1$ distance between normalized token-order indices (approximate reading order from LaTeX source)

Weights $W_t,W_p,W_o$ are defined but specific numeric values are not enumerated in the main text.

3) Invalid match elimination

Token consistency check: Discard matched pairs whose characters are inconsistent.

Position relationship consistency check:

Assume predicted bboxes follow an affine transform of GT bboxes: $\hat{b}_{\omega(i)} = A(b_i)$
Use RANSAC to estimate $A$ and remove outliers; rotation is fixed to 0 (translation + scaling only) to speed convergence and match typical rendering structure
Run multiple rounds to handle line breaks (multi-line layouts)

4) Metric calculation

Define:

TP: matched bbox pairs after elimination
FP: unmatched predicted bboxes
FN: unmatched GT bboxes

CDM score (F1):

$$ \text{CDM}=\frac{2 \cdot TP}{2 \cdot TP+FP+FN} $$

ExpRate@CDM: proportion of samples with CDM = 1

Evaluation

Baseline metrics discussed

BLEU: $n$-gram overlap with brevity penalty
Edit Distance: insertion/deletion/substitution distance
ExpRate: exact string match rate (noted as coarse/strict and unreliable under LaTeX non-uniqueness)

LaTeX “regularization” helps some syntax variations but cannot cover full symbol synonymy (e.g., \leq vs \le).

Rendering success rate (CDM applicability)

If rendering fails, CDM is set to 0. Reported render success on UniMER-Test: Pix2Tex 96.63%, Texify 94.77%, UniMERNet 99.71%, Mathpix 97.82%.

Main results (UniMER-Test)

Model	ExpRate	ExpRate@CDM	BLEU	CDM
Pix2Tex	0.1237	0.2910	0.4080	0.6360
Texify	0.2288	0.4950	0.5890	0.7550
Mathpix	0.2610	0.5000	0.8067	0.9510
UniMERNet	0.4799	0.8110	0.8425	0.9680

Subset anomalies

SCE subset: BLEU and CDM can disagree for Mathpix vs UniMERNet. SCE annotations were based on Mathpix outputs then manually corrected, biasing LaTeX style toward Mathpix and inflating BLEU alignment.

Pix2Tex: Shows large BLEU drops on HWE/SCE but strong performance on SPE/CPE, attributed to training data skew toward printed arXiv formulas and lack of handwritten/screenshot styles.

Tiny-Doc-Math results

Formula-level (cropped formula inputs):

Model	BLEU	CDM	ExpRate@CDM
Pix2Tex	0.4648	0.7444	0.3684
GPT-4o	0.6431	0.7330	0.4324
UniMERNet	0.6056	0.9396	0.6887
Mathpix	0.6112	0.9480	0.2105

Document-level (page screenshots / full-document outputs):

Model	BLEU	CDM	ExpRate@CDM
GPT-4o	0.3411	0.6502	0.1670
Nougat	0.5897	0.8326	0.6086
Mathpix	0.5939	0.9567	0.6292

Manual check note: Mathpix often misses trailing commas/periods, which impacts exact-match rate (ExpRate@CDM) even when CDM remains high.

Human preference protocol

1,008 samples from Pix2Tex with balanced score distribution; annotators saw GT + predicted render and chose which score (BLEU vs CDM, randomized order) better matched perceived quality.

Style sensitivity test

50 formulas rewritten 5 times by GPT-4 (250 variants), manually verified identical render; CDM remained 1 for all style variants while BLEU varied.

Hardware / Production

GPU/throughput requirements for running CDM are not reported. Operational cost is primarily driven by (1) LaTeX rendering and (2) per-formula matching (Hungarian assignment + RANSAC iterations).

MinerU — Notes

TL;DR

MinerU is an open-source, multi-module PDF extraction pipeline built on PDF-Extract-Kit models with explicit preprocessing and post-processing rules to produce Markdown/JSON from diverse PDFs (academic papers, textbooks, exams, financial reports, slides). The system uses LayoutLMv3 for layout detection (~21K training pages), YOLOv8 for formula detection (~2.9K pages), UniMERNet for formula recognition (UniMER-1M data), TableMaster + StructEqTable for tables, and PaddleOCR per text region. Post-processing includes bbox overlap resolution and reading-order segmentation. The paper emphasizes robustness via diverse training data and improved end-to-end readability via rule-based ordering heuristics.

What kind of paper is this?

Dominant: $\Psi_{\text{Resource}}$ (open-source “all-in-one” extraction system released as a project with models, pipeline, and tooling).

Secondary: $\Psi_{\text{Method}}$ (pipeline design + preprocessing/post-processing algorithms) and $\Psi_{\text{Evaluation}}$ (module-level comparisons on layout, formula detection/recognition, plus qualitative end-to-end visualizations).

The primary contribution is the release of a production-ready document extraction system. The method component describes the engineering of preprocessing (language detection, scanned PDF handling) and post-processing (overlap resolution, reading order) around existing model components. Evaluation focuses on validating individual modules rather than comprehensive end-to-end benchmarking.

What is the motivation?

High-quality document extraction is critical for LLM training data pipelines and RAG systems, but existing approaches have significant gaps:

OCR-only approaches introduce noise on non-text elements (formulas, tables, figures) and struggle with layout understanding
Library parsing (direct PDF text extraction) fails on formulas, tables, and scanned documents; reading order is often scrambled in multi-column layouts
Multi-module pipelines show promise but prior open-source models often overfit to academic papers, degrading on textbooks, financial reports, exam papers, and slides
End-to-end MLLMs can handle diverse content but incur high inference costs for large-scale document processing

The authors position MinerU as addressing the robustness and diversity gap in open-source document extraction via data-engineering-driven model training and explicit post-processing for layout-aware reading order.

What is the novelty?

System-level engineering

An end-to-end workflow combining:

PDF preprocessing: Parseability detection, language identification (Chinese/English only), encryption handling, scanned vs. parseable classification, metadata extraction
Model-based region detection/recognition: Five-model pipeline (layout detection, formula detection, formula recognition, table recognition, OCR)
Rule-based post-processing: Bounding-box overlap resolution (containment, partial overlap handling), reading-order segmentation using “top-to-bottom, left-to-right” heuristics
Format conversion: Intermediate JSON representation (with _parse_type and _version_name fields) converting to Markdown/JSON outputs

Data-engineering emphasis

Models in PDF-Extract-Kit are trained/fine-tuned on diverse document sources beyond academic papers:

11-category document taxonomy: Academic Papers, Research Reports (financial), Standard/Special Image-Text Textbooks, Slides, Exam Papers, Historical Documents, Handwritten Notes, Picture Albums, Standard Books
Validation-guided sampling: cluster PDFs by visual features, sample across cluster centers to maximize diversity
Iterative annotation based on model validation feedback

Post-processing for reading order

Explicit algorithms for handling common failure modes:

Containment removal: Text/formula boxes inside image/table regions are filtered out to prevent duplication
Partial overlap resolution: Vertically/horizontally overlapping text boxes are shrunk to avoid mutual coverage
Column-aware segmentation: Page is segmented into regions consistent with “top-to-bottom, left-to-right” reading, with each region containing at most one column

Contrast to MinerU2.5: The original MinerU is a pipeline system combining specialized models with explicit preprocessing/post-processing rules. MinerU2.5 (released ~1 year later) replaces the multi-module pipeline with a unified 1.2B-parameter VLM using a two-stage coarse-to-fine architecture (thumbnail layout detection followed by native-resolution crop recognition). The successor achieves higher accuracy with learned representations but requires GPU inference, while the original MinerU targets broader deployment scenarios including CPU-only environments.

What experiments were performed?

The paper provides module-level evaluation on three core components:

1. Layout detection (Table 3)

Compared LayoutLMv3-Finetuned against DocXchain, Surya, and 360LayoutAnalysis variants on academic papers and textbooks using mAP/AP50/AR50 metrics.

2. Formula detection (Table 4)

Compared YOLOv8-Finetuned against Pix2Text-MFD on academic papers and multi-source documents using AP50/AR50 metrics.

3. Formula recognition (Table 5)

Compared UniMERNet against Pix2tex, Texify, and Mathpix on CDM-adapted evaluation using ExpRate, ExpRate@CDM, BLEU, and CDM metrics.

4. End-to-end qualitative results (Figure 7)

Three-column visualization (layout → spans → Markdown) across document types: academic literature, textbooks, exam papers, and research reports. The authors argue module quality plus post-processing yields readable Markdown, but no quantitative end-to-end metrics are reported in the provided sections.

What are the outcomes/limitations?

Outcomes (as reported)

Layout detection (LayoutLMv3-Finetuned):

Split	mAP $\uparrow$	AP50 $\uparrow$	AR50 $\uparrow$
Academic Papers Val	77.6	93.3	95.5
Textbook Val	67.9	82.7	87.9

The fine-tuned model substantially outperforms reported baselines (e.g., DocXchain academic mAP 52.8, Surya academic mAP 24.2).

Formula detection (YOLOv8-Finetuned):

Split	AP50 $\uparrow$	AR50 $\uparrow$
Academic Papers Val	87.7	89.9
Multi-source Val	82.4	87.3

Formula recognition (UniMERNet):

Model	ExpRate@CDM $\uparrow$	CDM $\uparrow$
UniMERNet	0.811	0.968
Mathpix	0.5	0.951

The authors emphasize CDM as the more reliable metric over ExpRate for formula recognition evaluation.

Limitations / caveats

Language support constraints:

MinerU currently supports only Chinese and English for high-quality extraction
Other languages are not quality-guaranteed; preprocessing explicitly filters for these two languages

Reading order assumptions:

Post-processing assumes “top-to-bottom, left-to-right” reading order, which may not align with vertical scripts (Japanese/Chinese vertical) or right-to-left languages (Arabic, Hebrew)
The dataset taxonomy includes “Historical Document” with right-to-left reading order mentioned, but the algorithm description remains LTR-oriented

Evaluation coverage gaps:

No quantitative end-to-end benchmarking reported in the paper (no edit distance, F1, BLEU, or other full-document metrics)
Table recognition and OCR accuracy are described qualitatively with visualizations but lack detailed quantitative evaluation tables
Module-level metrics don’t capture error propagation through the pipeline

Runtime / compute not reported:

The paper positions MinerU as “low inference cost” compared to end-to-end MLLMs, but provides no concrete latency, throughput, or memory measurements
Hardware requirements (CPU vs GPU, VRAM needs) are not specified

Open questions:

How does end-to-end accuracy compare to newer unified VLMs (GOT-OCR2.0, InternVL, Qwen2-VL) on standard benchmarks?
What is the error propagation impact from layout detection mistakes?
How does the system handle edge cases like overlapping text/tables, complex nested structures, or documents with non-standard layouts?

Model

Pipeline overview

MinerU follows an explicitly staged workflow:

Document preprocessing: Language detection (Chinese/English), page size/count extraction, encryption/password handling, scanned vs. parseable classification
Content parsing via five models: Layout detection → formula detection → formula recognition + table recognition + OCR (applied per-region)
Post-processing: Bounding box overlap resolution, image/table cropping, header/footer removal, reading-order reconstruction
Format conversion: Intermediate JSON → Markdown and final JSON outputs

Core model components (v0.8.1)

The pipeline integrates five specialized models:

Component	Model	Training Data	Notes
Layout detection	LayoutLMv3-Finetuned	~21K pages	11 categories including title, paragraph, images, captions, tables, formulas, headers/footers
Formula detection	YOLOv8-Finetuned	2,890 pages, 24,157 inline + 1,829 displayed formulas	Three categories: inline, displayed, ignore (ambiguous)
Formula recognition	UniMERNet	UniMER-1M	Handles printed/scanned/handwritten variety
Table recognition	TableMaster + StructEqTable	PubTabNet v2.0.0 + DocGenome	Table-to-LaTeX / Table-to-HTML
OCR	PaddleOCR	NR	Applied per detected text region (not whole-page)

OCR strategy for multi-column and inline formulas

Per-region OCR (not whole-page):

OCR is run on detected text regions (titles, paragraph blocks) individually to avoid merging columns into a single reading stream
Preserves column boundaries established by layout detection

Inline formula masking:

Formulas are masked in the text region using formula detector bounding boxes
OCR runs on the masked region (text only)
Formulas are reinserted into OCR output at their original positions

This prevents OCR from garbling mathematical notation while maintaining text flow.

Data

Layout detection training data

Size: ~21K annotated pages

Data collection process:

Collect diverse PDFs across document types
Cluster pages by visual features (embeddings)
Sample across cluster centers to maximize diversity
Annotate using validation-guided feedback loop

Annotation schema (11 categories):

Content elements: title, paragraph text, images, captions, tables, table captions, image/table notes, inline formulas, formula labels
Discard types: headers, footers, page numbers, page notes

Formula detection training data

Size: 2,890 pages with 24,157 inline formulas and 1,829 displayed formulas

Sources: Chinese and English academic papers, textbooks, books, financial reports

Categories:

Inline formulas: Embedded in text flow
Displayed formulas: Standalone equations
Ignore: Ambiguous cases like “50%”, “NaCl”, “1-2 days” that resemble formulas but are plain text

Table recognition training data

TableMaster: Trained on PubTabNet v2.0.0
StructEqTable: Trained on DocGenome benchmark data

Document diversity taxonomy (Table 2)

The paper analyzes 11 document categories for training/validation:

Academic Paper
Research Report (financial reports, prospectuses)
Standard Textbook
Special Image-Text Textbook
Slides (presentation materials)
Exam Papers
Historical Documents (includes right-to-left reading order examples)
Handwritten Notes
Picture Albums
Standard Books
Special Image-Text Exam Papers

Algorithms / Training

Preprocessing decisions

Language identification:

Detects document language to set OCR language parameter
Only Chinese and English are explicitly supported for quality guarantees

Garbled-text detection:

Attempts direct PDF text extraction (library parsing via PyMuPDF)
If extracted text is garbled (encoding issues, corrupted fonts), switches to OCR-based extraction

Scanned PDF identification heuristics:

Large image area relative to text area
Full-page or near-full-page images
Near-zero average text length per page

Parse path selection:

Parseable PDFs: Direct text extraction (PyMuPDF fitz library) for speed and accuracy
Scanned PDFs: Enable full OCR pipeline for all content

Post-processing for bounding box overlaps

Containment removal:

Remove text/formula boxes contained within image or table regions to prevent duplicate content
Remove text boxes contained within formula boxes (formulas are atomic units)

Partial overlap handling:

Text-text overlaps: Shrink overlapping text boxes (vertically or horizontally) to eliminate mutual coverage
Text-image/table overlaps: Temporarily ignore image/table boxes to preserve text integrity; prioritize text reading flow over image placement

Reading-order segmentation

Goal: Produce stable “top-to-bottom, left-to-right” reading order across multi-column layouts

Algorithm:

After overlap resolution, segment page into regions
Each region contains at most one column of content
Sort regions by vertical position (top-to-bottom primary)
Within each region, sort elements by horizontal then vertical position (left-to-right, top-to-bottom secondary)
Concatenate region contents in sorted order to produce final reading sequence

Assumption: Left-to-right, top-to-bottom reading convention (standard for English and Chinese horizontal text). The algorithm does not explicitly handle right-to-left scripts or vertical text layouts mentioned in the “Historical Document” category.

Evaluation

Layout detection (Table 3)

Metrics: mAP, AP50, AR50 on academic papers and textbook validation sets

Model	Split	mAP	AP50	AR50
LayoutLMv3-Finetuned (Ours)	Academic Papers Val	77.6	93.3	95.5
LayoutLMv3-Finetuned (Ours)	Textbook Val	67.9	82.7	87.9
DocXchain	Academic Papers Val	52.8	75.4	81.2
Surya	Academic Papers Val	24.2	47.6	58.9

The fine-tuned model shows substantial improvements over baselines, particularly on academic papers. Textbook performance is lower across all models, suggesting increased layout complexity.

Formula detection (Table 4)

Metrics: AP50, AR50 on academic papers and multi-source validation sets

Model	Split	AP50	AR50
YOLOv8-Finetuned (Ours)	Academic Papers Val	87.7	89.9
YOLOv8-Finetuned (Ours)	Multi-source Val	82.4	87.3
Pix2Text-MFD	Academic Papers Val	78.3	81.5
Pix2Text-MFD	Multi-source Val	71.6	76.8

Performance drops on multi-source validation reflect increased diversity (textbooks, financial reports, etc.) beyond academic papers.

Formula recognition (Table 5)

Metrics adapted from CDM paper: ExpRate, ExpRate@CDM, BLEU, CDM

Model	ExpRate	ExpRate@CDM	BLEU	CDM
Pix2tex	0.584	0.407	0.753	0.881
Texify	0.607	0.632	0.836	0.945
Mathpix	0.831	0.5	0.879	0.951
UniMERNet (Ours)	0.817	0.811	0.871	0.968

UniMERNet achieves the highest CDM score (0.968), which the authors emphasize as the most reliable metric. Mathpix leads on raw ExpRate but scores lower on ExpRate@CDM, suggesting potential overfitting to exact match without considering semantic equivalence.

End-to-end qualitative results (Figure 7)

The paper shows three-column visualizations (layout detection → spans → Markdown output) across document types:

Academic literature: Complex formulas, tables, multi-column layouts
Textbooks: Mixed text, images, captions, inline formulas
Exam papers: Varied formatting, handwritten elements (in scanned cases)
Research reports: Dense tables, financial notation

The visualizations demonstrate module integration and post-processing effectiveness, but no quantitative end-to-end metrics (edit distance, F1, BLEU, etc.) are provided for full-document extraction accuracy.

Hardware / Production

Runtime and compute: The paper positions MinerU as “low inference cost” compared to end-to-end MLLMs, but does not provide concrete measurements:

No latency or throughput numbers (pages/second, seconds/page)
No memory requirements (VRAM for GPU inference, RAM for CPU inference)
No hardware specifications for evaluation runs

Deployment artifacts:

Intermediate JSON format: pdf_info.para_blocks structure with _parse_type field (txt vs ocr) and _version_name for versioning
Output formats: Markdown and JSON with optional image/table cropping
CLI interface: Command-line tool for batch processing

Production considerations:

Pipeline modularity enables CPU-only deployment (though OCR and detection models benefit from GPU acceleration)
Open-source license (Apache 2.0 for code, model licenses vary by component) permits commercial use
Maintenance requires updates to five separate model components plus preprocessing/post-processing rules

Note: This analysis follows the Roots Labs OCR paper-notes guidelines and classification taxonomy. For academic or production use, consult the original paper and verify claims through independent evaluation.

Docling (v2) — Notes

TL;DR

Docling is an MIT-licensed, fully local document-conversion toolkit that parses multiple formats (PDF, images, MS Office, HTML) into a unified DoclingDocument representation, then exports lossless JSON or lossy Markdown/HTML and supports chunking for RAG. Architecturally it combines modular parser backends, pipelines, and task-specific models (layout detection, OCR, table structure), with an explicit design goal of faithful extraction where text comes from PDF tokens or OCR rather than being generated. The v2 paper focuses on speed benchmarking across CPU and GPU settings, showing OCR dominates runtime and GPU acceleration changes the relative cost of pipeline stages.

What kind of paper is this?

Dominant: $\Psi_{\text{Resource}}$ (releases an open-source toolkit with modular architecture, reusable model components, and permissive licensing suitable for air-gapped and sensitive deployments).

Secondary: $\Psi_{\text{Evaluation}}$ (benchmarks conversion speed and profiles pipeline stages across hardware configurations); $\Psi_{\text{Method}}$ (describes modular pipeline architecture and faithful extraction design).

What is the motivation?

Fragmented ecosystem: Document conversion is scattered across formats and often locked behind SaaS or restrictive licensing. The authors aim for a self-contained local library suitable for sensitive or air-gapped environments.
VLM concerns: The paper positions Docling against VLM-based converters, arguing that hallucinations and cost/compute requirements (often SaaS-only) are risks when faithful transcription is required.
Faithful extraction approach: Their alternative uses modular, task-specific models that recover structure and features, while text is sourced from PDF tokens or OCR to reduce “generated” false content.
Extensibility: Need for a system that supports swapping or adding models, custom pipelines, and integration with downstream RAG and processing workflows.

What is the novelty?

Unified DoclingDocument representation

DoclingDocument is a Pydantic model capturing document elements, hierarchy, layout (bounding boxes), and provenance (page numbers, origin).
API surface supports building and traversing in reading order, exporting to lossless JSON versus lossy Markdown/HTML, and chunking abstraction that produces text chunks plus metadata.

Modular architecture

Three main concepts: pipelines, parser backends, and the DoclingDocument data model as the centerpiece.
Two standard pipelines:
- StandardPdfPipeline (PDF/images): parse pages, OCR, layout analysis, table structure, assemble document, then export/chunk.
- SimplePipeline (markup formats like Office/HTML/AsciiDoc): parse markup, build/assemble, optional enrichment.
Design emphasizes extensibility: swap models, add custom pipelines, integrate new backends.

PDF parsing backends

Default: docling-parse built on low-level qpdf.
Alternative backend: pypdfium.
Parsing step retrieves programmatic text tokens (strings with coordinates) and renders a bitmap per page for downstream operations.

Task-specific models

Layout analysis: Object detector for page elements, architecture derived from RT-DETR, re-trained on DocLayNet plus proprietary datasets. Inference via Hugging Face transformers and safetensors. Post-processing removes overlaps and intersects detections with PDF tokens to form paragraphs, titles, lists, captions, figures, and tables.
Table structure recognition: TableFormer, a vision-transformer model predicting table row/column structure and header/body roles. Pipeline provides image crop with included text cells; predictions are matched back to PDF cells, avoiding re-transcription and making it language-agnostic.
OCR: Integrations with EasyOCR and Tesseract. Authors note EasyOCR is “fairly slow on CPU” and is the biggest compute expense in their benchmark.
Assembly/post-processing: Outputs assembled into DoclingDocument via docling-core; post-processing algorithms augment features like reading order and figure-caption matching.

What experiments were performed?

Speed benchmark and comparisons

Compared against Unstructured, Marker, and MinerU. Explicitly excludes SaaS or remote services.
Benchmark dataset: 89 PDFs, derived largely from DocLayNet and augmented with CCpdf. Totals: 4,008 pages, 56,246 text items, 1,842 tables, 4,676 pictures.
Hardware configurations:
- AWS EC2 g6.xlarge (8 vCores, 32GB RAM, Nvidia L4 24GB).
- MacBook Pro M3 Max 64GB.
- CPU-only and GPU-enabled runs.
Methodology controls: Clean environments, latest versions, “state-of-the-art” options. CPU runs constrained to 8 threads; GPU runs enable CUDA where supported.

Profiling and ablations

Break down per-stage contribution: PDF parse, OCR, layout, table structure.
Measure runtime impact of disabling OCR and/or table structure recognition.
Analyze stage-specific GPU speedups versus CPU baseline.

What are the outcomes and limitations?

Outcomes

Per-page conversion time distribution (benchmark pages):

x86 CPU: 0.6s (5th percentile) / 0.79s (median) / 16.3s (95th percentile).
M3 Max: 0.26s / 0.32s / 6.48s.
L4 GPU: 57ms / 114ms / 2,081ms.

Ablation results:

Disabling OCR saves approximately 60% runtime on x86 and M3, approximately 50% on L4.
Disabling table structure saves 16% (x86/M3) and 24% (L4).
Disabling both OCR and table structure saves approximately 75% across configurations.

OCR dominance:

OCR engaged on 578 pages.
EasyOCR average transcription time: 1.6s (L4), 13s (x86), 5s (M3).

Tool comparison (average seconds per page):

CPU-only: Docling leads (3.1s x86, 1.27s M3), followed by MinerU and Unstructured. Marker is substantially slower.
GPU-enabled: MinerU leads (0.21s) versus Docling (0.49s) and Marker (0.86s). Unstructured does not benefit from GPU.

Stage-specific GPU speedups (versus x86 CPU):

OCR: approximately $8 \times$ faster.
Layout analysis: approximately $14 \times$ faster.
Table structure: approximately $4.3 \times$ faster.

Limitations and open questions

Faithful extraction trade-offs:

The authors argue faithful extraction avoids hallucinations, but note this requires maintaining a diverse model set for components like formulas and figures. It is unclear how well this scales to highly specialized domains or degraded document quality.

Model coverage gaps:

Future work includes adding figure classification, equation recognition, and code recognition models. Current system does not handle these elements as robustly as text and tables.

Benchmark scope:

Speed evaluation is comprehensive, but there is no accuracy or quality benchmark reported in this paper. The authors mention future plans for an open-source quality evaluation framework (referencing DP-Bench, OmniDocBench), but results are not included here.

Hardware generalization:

Benchmarks use specific configurations (L4 GPU, M3 Max). It is unclear how performance scales to other GPU generations, ARM architectures, or resource-constrained environments.

Comparison fairness:

Baseline tools are configured with “state-of-the-art” options, but version choices and configuration decisions may not represent optimal settings for all tools. Authors provide version numbers but not full hyperparameter details for baselines.

Contrast to MinerU: MinerU is faster on GPU (0.21s/page versus 0.49s for Docling) but slower on CPU. Both use modular pipelines with task-specific models, but MinerU focuses on optimized inference speed while Docling emphasizes extensibility and faithful extraction design.

Contrast to Marker: Marker is substantially slower than Docling on both CPU and GPU. Marker uses Surya OCR by default, which may contribute to slower performance compared to Docling’s EasyOCR integration.

Contrast to olmOCR and Infinity-Parser: olmOCR and Infinity-Parser are end-to-end VLMs that generate structured output directly from images, whereas Docling uses a pipeline of task-specific models and extracts text from PDF tokens or OCR. Docling explicitly avoids generation to reduce hallucination risk, trading potential accuracy for faithful extraction.

Model

System architecture

Three main concepts:

Pipelines: Coordinate parsing, enrichment, and assembly steps.
Parser backends: Handle format-specific extraction (PDF, images, Office, HTML).
DoclingDocument data model: Pydantic model capturing elements, hierarchy, layout, and provenance.

StandardPdfPipeline

Processing flow:

Parse pages via backend (extract text tokens with coordinates, render bitmap).
OCR (optional, triggered for scanned pages or regions without programmatic text).
Layout analysis (detect page elements via RT-DETR-based model).
Table structure recognition (TableFormer predicts row/column structure).
Assemble document (build DoclingDocument with reading order, hierarchy).
Export (lossless JSON or lossy Markdown/HTML) and chunking.

SimplePipeline

For markup formats (Office, HTML, AsciiDoc):

Parse markup structure.
Build and assemble DoclingDocument.
Optional enrichment steps.

Layout analysis model

Architecture: Derived from RT-DETR (real-time detection transformer).
Training data: Re-trained on DocLayNet plus proprietary datasets (specifics not disclosed).
Inference: Hugging Face transformers and safetensors.
Post-processing: Remove overlaps, intersect detections with PDF tokens to form semantic units (paragraphs, titles, lists, captions, figures, tables).

Table structure model (TableFormer)

Architecture: Vision transformer predicting table row/column structure and header/body roles.
Input: Image crop of detected table region plus included text cells from PDF tokens.
Output: Structural predictions matched back to PDF cells, avoiding re-transcription. This makes it language-agnostic.
Inference: Implemented in PyTorch.
Training details: Refined with a custom structure token language (per cited prior work), but this paper does not provide end-to-end training hyperparameters.

OCR integrations

EasyOCR: Default, supports many languages. Authors note it is “fairly slow on CPU” and dominates compute cost in benchmarks.
Tesseract: Alternative OCR backend.
Trigger logic: OCR is applied to scanned pages or regions without programmatic text.

PDF parsing backends

Default: docling-parse built on qpdf (low-level PDF library).
Alternative: pypdfium.
Output: Text tokens (strings with bounding box coordinates) and page bitmaps.
Average parsing time: 81ms per page (x86), 44ms per page (M3). No GPU support for this stage.

Data

Benchmark dataset

Source: 89 PDFs, largely from DocLayNet, augmented with CCpdf.
Statistics: 4,008 pages, 56,246 text items, 1,842 tables, 4,676 pictures.
Purpose: Speed benchmark and profiling; no accuracy evaluation reported in this paper.

Layout model training data

Re-trained on DocLayNet plus proprietary datasets.
DocLayNet is a public dataset for document layout analysis with diverse document types and detailed annotations.
Proprietary data specifics are not disclosed (size, domain distribution, quality control).

Algorithms / Training

This paper is primarily a system and tool report; it does not provide end-to-end training hyperparameters or detailed training procedures. Key details:

Layout detector: RT-DETR-based architecture re-trained on DocLayNet and proprietary data. Training hyperparameters (learning rate, optimizer, schedule, batch size) are not reported.
TableFormer: Described as refined with a custom structure token language (per cited prior work). Inference is implemented in PyTorch, but training details are omitted.
OCR models: Docling integrates existing OCR engines (EasyOCR, Tesseract) rather than training custom OCR models.

Evaluation

Benchmark configurations

Tool versions and settings (Table 1 in paper):

Docling: v2.5.2, EasyOCR (default), TableFormer (fast mode).
Marker: 0.3.10, Surya OCR (default).
MinerU: 0.9.3, auto OCR, doclayout_yolo, rapid_table.
Unstructured: 0.16.5, “hi res with table structure” mode.

Speed results

Per-page conversion time (median and percentiles reported in Section 5.4):

See Outcomes section above for detailed numbers.

Average seconds per page (CPU and GPU):

CPU-only:
- Docling: 3.1s (x86), 1.27s (M3).
- MinerU: higher than Docling on CPU.
- Marker: substantially slower than Docling.
- Unstructured: between Docling and Marker.
GPU-enabled:
- MinerU: 0.21s (fastest).
- Docling: 0.49s.
- Marker: 0.86s.
- Unstructured: no GPU benefit.

Profiling insights

OCR cost: OCR engaged on 578 of 4,008 pages. EasyOCR is the dominant compute expense, especially on CPU.
Stage-specific speedups: GPU provides large speedups for OCR and layout analysis, moderate speedup for table structure, but no speedup for PDF parsing.
Ablation impact: Disabling OCR and table structure dramatically reduces runtime but sacrifices extraction quality for scanned or complex documents.

Hardware / Production

Benchmark systems

AWS EC2 g6.xlarge:
- 8 vCores, 32GB RAM, Nvidia L4 24GB.
- Used for x86 CPU and L4 GPU benchmarks.
MacBook Pro M3 Max:
- 64GB RAM.
- Used for ARM CPU benchmarks.

Experimental controls

CPU runs: Constrained to 8 threads via OMP_NUM_THREADS and tool-specific configuration.
GPU runs: Enable CUDA where supported. Not all tools benefit from GPU (e.g., Unstructured).

Performance characteristics

PDF parsing: 81ms per page (x86), 44ms per page (M3). No GPU support for this stage.
OCR (EasyOCR): 1.6s per page (L4), 13s (x86), 5s (M3). Dominates total runtime when engaged.
Layout analysis: Approximately $14 \times$ faster on L4 GPU versus x86 CPU.
Table structure: Approximately $4.3 \times$ faster on L4 GPU versus x86 CPU.

Deployment considerations

Local, air-gapped use: MIT license and fully local execution make Docling suitable for sensitive or offline environments.
Extensibility: Modular architecture allows swapping OCR engines, adding custom models, or integrating new pipelines.
Memory and throughput: Paper focuses on per-page latency; batch processing, memory footprint, and concurrent throughput are not quantified.

Ocean-OCR-3B — Notes

TL;DR

Ocean-OCR is a 3B-parameter multimodal LLM built around a NaViT vision encoder plus an MLP projector into a Qwen-2.5-3B language model, trained with a three-stage pipeline and a large OCR-heavy data mixture. The authors report strong results on standard OCR benchmarks and on three “practical scenario” evaluations (dense bilingual documents, scene text, bilingual handwriting), including competitive or better performance than specialized OCR systems in their setup.

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$ (new model recipe for “general OCR” via dynamic-resolution vision encoder + OCR-centric data + staged training).

Secondary: $\Psi_{\text{Impact}}$ (real-world style comparisons vs TextIn API and PaddleOCR across scenarios), plus a smaller $\Psi_{\text{Resource}}$ component (large curated in-house + synthetic data, plus constructed eval sets, though release details are not the headline).

One reasonable superposition:

$$\text{Paper} \approx 0.65,\Psi_{\text{Method}} + 0.20,\Psi_{\text{Impact}} + 0.15,\Psi_{\text{Resource}}$$

What is the motivation?

Mainstream MLLMs are strong at reasoning, but their OCR/perception is often insufficient for dense text and text-heavy images, which blocks text-related tasks.
Prior approaches include sliding windows, cropping, token compression, and dynamic tiling, but the authors claim OCR performance is still short of practical requirements and often limited to OCR-specific settings rather than general-purpose use.

What is the novelty?

Architecture choice for resolution variability: Use a NaViT vision encoder so the model can process images of any resolution and produce a variable number of visual tokens.
Token-count control for high-res images: Compress adjacent $2 \times 2$ visual tokens into 1 token to reduce compute while trying to preserve key information.
Data and pipeline emphasis: Build a large multimodal dataset mixture (including OCR and OCR-augmented captioning) and train with a 3-phase pipeline (projector alignment, full pretraining, SFT).

What experiments were performed?

General VLM benchmarks: MMMU, MMBench (EN/CN), MathVista, MME, SEEDBench, RealWorldQA, HallusionBench, evaluated via VLMEvalKit in zero-shot using original configs.
Open OCR benchmarks: DocVQA, TextVQA, ChartQA, OCRBench.
Practical OCR scenarios (constructed evals):
- Dense bilingual document extraction (100 EN papers + 100 ZH papers), metrics include normalized edit distance, F1, precision, recall, BLEU, METEOR.
- Scene text OCR benchmark (260 natural images, half Chinese and half English, from MSRA-TD500; pseudo labels via PaddleOCR then manual correction).
- Bilingual handwriting benchmark with 4 granularities/sources, 100 samples per category, evaluated with the same metric set.

What are the outcomes/limitations?

Key outcomes:

General benchmarks: Ocean-OCR reports MMMU 42.0, MMBench-EN 75.3, MMBench-CN 73.0, MathVista 55.6, MME 2094, SEEDBench 72.5, RealWorldQA 61.2, HallusionBench 46.0.
OCR benchmarks (Table 3): DocVQA 91.4, TextVQA 80.0, ChartQA 84.6, OCRBench 82.7, average 84.7.
Dense bilingual documents (Table 4): Ocean-OCR achieves edit distance 0.057 (EN) and 0.062 (ZH), with strong F1/precision/recall and BLEU/METEOR on both languages.
Scene text (Table 5): Ocean-OCR reports edit distance 0.113 with F1 0.875 and METEOR 0.754, outperforming several MLLM baselines in this setup.

Limitations:

Training compute and hyperparameters are not specified: optimizer, learning rate schedule, batch sizes, training tokens/steps, GPUs, training time are unstated.
“First MLLM to outperform TextIn and PaddleOCR” is evaluated on their constructed scenario datasets and their chosen TextIn endpoints; for example, on dense EN document edit distance TextIn is slightly lower (0.055 vs 0.057), while Ocean-OCR is much better on ZH in that table.
The paper mentions future work to build a medical report benchmark, implying that scenario is currently anecdotal/case-based rather than a standardized eval.

Model

High-level architecture

LLaVA-style 3-part stack: NaViT vision encoder + MLP projector + Qwen-2.5-3B LLM.
NaViT provides variable-resolution input handling and yields a variable number of visual tokens per image.

Visual token compression

For high-resolution images, adjacent $2 \times 2$ tokens are compressed into a single token to reduce token count and compute.

LLM choice rationale

They choose Qwen-2.5-3B for “ease of use and practical deployment,” balancing capability and size.

Data

Training data inventory (Table 1)

Public vs in-house counts (in M samples, as reported):

Alignment + pretrain:

Pure text: 150.7M (in-house only)
Caption: 33.2M public, 49.1M in-house
Interleaved: 19.1M public, 28.7M in-house
OCR: 12.4M public, 7.8M in-house

SFT:

General QA (Cauldron): 3.6M public
OCR QA: 3.0M public, 1.9M in-house

Totals: 71.3M public, 238.2M in-house.

Mixture and sourcing details

Pretraining keeps a 50/50 mix of pure text and vision-language data.
Interleaved data: OBELICS plus an in-house parsed corpus from books/papers, with an approximate 4:6 OBELICS:in-house ratio.
Caption data: open datasets (DenseFusion-1M, Synthdog, DreamLIP, InternVL-SA-1B-Caption) plus synthetic captions where OCR hints are injected using PaddleOCR + GPT-4o (images sourced from Wukong and LAION-2B).
OCR data: open OCR datasets (DocStruct4M, RenderedText, AnyWord-3M, TinyChartData) plus synthetic scene/PDF/handwriting OCR (PDFs from in-house e-books; handwriting rendered from pure-text corpora).

SFT data construction and filtering

General VQA SFT uses Cauldron with filtering; they put data into a dialogue template and use Qwen2-VL-72B to judge response accuracy and filter low-quality QA pairs.
OCR SFT expansions include synthesized scene OCR QA (COCO-Text, ICDAR2019 ArT, Incidental Scene Text) using GPT-4o to generate “realistic” text VQA, plus synthetic handwriting and in-house PDF-derived data.

Algorithms / Training

Three-phase pipeline

Vision-language alignment: freeze vision encoder + LLM; train the projector to map visual tokens into the text feature space; NaViT is used as the vision encoder.
Vision-language pretraining: unfreeze all modules (NaViT, MLP projector, LLM) and train jointly to build multimodal knowledge while preserving language capability.
Supervised fine-tuning: update all components; goal is instruction following while maintaining general ability.

Optimization objective

Across all stages, they use next-token prediction loss on text tokens.

Missing implementation details

No batch size, learning rate schedule, optimizer, weight decay, training steps/tokens, or hardware/training time are provided in the paper.

Evaluation

Evaluation harness and protocol

All evaluations use VLMEvalKit, run zero-shot, and follow models’ original configurations for consistency.

General benchmarks (Table 2)

Benchmark	MMMU	MMBench-EN	MMBench-CN	MathVista	MME	SEEDBench	RealWorldQA	HallusionBench
Ocean-OCR	42.0	75.3	73.0	55.6	2094	72.5	61.2	46.0

Open OCR benchmarks (Table 3)

Model	DocVQA	TextVQA	ChartQA	OCRBench	Average
Ocean-OCR	91.4	80.0	84.6	82.7	84.7

Note on baselines: GOT is described as OCR-specific and not compatible with complex instructions or general tasks; TextIn and PaddleOCR are treated as specialized OCR baselines.

Practical scenario evaluations

Dense bilingual document extraction (Table 4)

Dataset: 100 English paper images + 100 Chinese paper images
Metrics: normalized edit distance, F1, precision, recall, BLEU, METEOR
Ocean-OCR results (EN / ZH):
- Edit distance: 0.057 / 0.062
- F1: 0.937 / 0.962
- Precision: 0.932 / 0.956
- Recall: 0.956 / 0.974
- BLEU: 0.906 / 0.912
- METEOR: 0.945 / 0.916
TextIn endpoints used: PDF-to-markdown for document extraction; multipage endpoint for scene + handwriting.

Scene text recognition (Table 5)

Dataset: 260 natural images, split evenly Chinese/English, sampled from MSRA-TD500; pseudo GT via PaddleOCR then manual verification and correction.
Ocean-OCR results:
- Edit distance: 0.113
- F1: 0.875
- Precision: 0.875
- Recall: 0.887
- BLEU: 0.420
- METEOR: 0.754

Handwritten recognition (Table 6)

Dataset: paragraph-level real (CASIA-HWDB, GNHK), line-level real (CASIA-HWDB, BRUSH), plus paragraph/line synthetic, bilingual; 100 samples per category; same metric set.

Hardware / Production

Hardware, training time, throughput, and serving details are not reported in the paper.

PP-DocLayout: Notes

TL;DR

PP-DocLayout is a family of document layout detection models (L/M/S) that predict bounding boxes for 23 layout block classes. The largest variant reports 90.4% mAP@0.5 with 13.39 ms per page on T4 GPU; the smallest variant achieves 70.9% mAP@0.5 at 8.11 ms per page (T4) or 14.49 ms (CPU). Training combines knowledge distillation from GOT-OCR2.0’s visual encoder (for the L backbone) and semi-supervised learning with adaptive per-class thresholds (for M/S variants).

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$

Core contribution is a unified detection approach across diverse document types with multiple model variants and training strategies: knowledge distillation for the L backbone and semi-supervised pseudo-labeling with adaptive thresholds for M/S variants.

Secondary: $\Psi_{\text{Resource}}$

Provides three model variants (L/M/S) with public weights and implementation details.

What is the motivation?

Layout detection is positioned as a foundational step for downstream document processing tasks (table recognition, formula recognition, OCR, information extraction) and structured training data generation.

The authors identify three gaps in prior layout detectors:

Weak generalization beyond academic papers: insufficient coverage of magazines, newspapers, financial reports, and other document types.
Insufficient granularity for complex layouts: coarse categorization that collapses charts, seals, and formulas into broad buckets.
Insufficient speed for large-scale and real-time processing requirements.

What is the novelty?

Fine-grained label space: 23 layout categories designed to cover diverse document types and distinguish high-value regions (charts, seals, formula numbers, header/footer images) rather than collapsing them into generic classes.

Knowledge distillation for PP-DocLayout-L backbone: The PP-HGNetV2-B4 student backbone learns from the frozen Vary-VIT-B visual encoder (from GOT-OCR2.0) via feature-level alignment with a learnable projection layer and L2 distillation loss.

Adaptive per-class thresholding for semi-supervised learning: PP-DocLayout-M/S use PP-DocLayout-L as a teacher model to generate pseudo-labels, but instead of a fixed global confidence threshold, they select per-class thresholds by maximizing F1 on labeled validation data for each of the 23 categories.

Throughput focus: The authors emphasize deployment scenarios requiring batch processing at scale, reporting approximately 123 pages per second on T4 GPU when using the PaddleX inference engine.

What experiments were performed?

Main evaluation: Reports mAP@0.5, inference latency on T4 GPU and CPU (Intel Xeon Gold 6271C, 8 threads, FP16), and parameter counts for all three variants.

Qualitative comparison: Visual side-by-side comparisons to DocLayout-YOLO on diverse document types. No direct quantitative comparison due to label set mismatch.

Ablations:

Knowledge distillation effect on PP-DocLayout-L accuracy.
Semi-supervised learning effect on PP-DocLayout-M and PP-DocLayout-S accuracy.

Dataset description: Training and evaluation split sizes with per-category instance counts provided in the appendix.

What are the outcomes/limitations?

Outcomes:

PP-DocLayout-L: 90.4% mAP@0.5, 13.39 ms (T4), 759.76 ms (CPU), 30.94M parameters.
PP-DocLayout-M: 75.2% mAP@0.5, 12.73 ms (T4), 59.82 ms (CPU), 5.65M parameters.
PP-DocLayout-S: 70.9% mAP@0.5, 8.11 ms (T4), 14.49 ms (CPU), 1.21M parameters.

Distillation improves L from 89.3% to 90.4% mAP@0.5. Semi-supervised learning improves M from 73.8% to 75.2% and S from 66.2% to 70.9%.

Limitations:

Small evaluation set: 500 images is small relative to typical layout benchmarks. The authors claim broad generalization but do not report results on standard external benchmarks (DocLayNet, PubLayNet, DocBank) under their native label schemes.

Limited baseline comparison: DocLayout-YOLO is compared primarily via visualization because category sets differ. This makes it difficult to isolate gains attributable to architecture and training strategies versus label design choices.

Single IoU threshold: Only mAP@0.5 is reported in the main results. No evaluation at stricter IoU thresholds (0.75, 0.9) or COCO-style mAP averaging across IoU thresholds. Per-category AP breakdowns are not provided in the main body.

Inconsistency in reporting: The narrative states PP-DocLayout-S improves by 4.7% with semi-supervised learning, but Table 4 shows a +4.7 absolute gain (66.2% to 70.9%), which matches. However, the text should specify whether improvements are reported as percentage points or relative percentages to avoid ambiguity.

Model

Detector architecture

PP-DocLayout-L:

Detector: RT-DETR-L
Backbone: PP-HGNetV2-B4 (15.6M parameters after distillation from Vary-VIT-B)
Total parameters: 30.94M

PP-DocLayout_plus-L:

Detector: RT-DETR-L
Backbone: PP-HGNetV2-B4
Total parameters: 30.94M
Classes: 20 (reduced from 23 by removing three least-common categories)
Training: Extended training duration for higher precision

PP-DocLayout-M:

Detector: PicoDet-M
Total parameters: 5.65M

PP-DocLayout-S:

Detector: PicoDet-S
Total parameters: 1.21M

Label set (23 categories)

The standard PP-DocLayout variants (L/M/S) predict bounding boxes for 23 layout block classes. The PP-DocLayout_plus-L variant uses a reduced set of 20 classes, removing the three least-common categories for improved precision on high-frequency classes.

Standard 23-class set:

paragraph title
image
text
number
abstract
content
figure title
formula
table
table title
reference
doc title
footnote
header
algorithm
footer
seal
chart title
chart
formula number
header image
footer image
aside text

Category mapping versus DocLayout-YOLO

The authors provide a mapping table contrasting their 23-class label space with DocLayout-YOLO’s coarser scheme. Examples:

Page Number is explicitly modeled in PP-DocLayout but mapped to “Abandon” in DocLayout-YOLO.
Formula is treated as “Isolate Formula” in PP-DocLayout, while many structural elements (header, footer, footnote) are “Abandon” in DocLayout-YOLO.

Data

Labeled layout detection dataset

Training: 30,000 images annotated with 23 categories.

Evaluation: 500 images.

Document types: Chinese and English academic papers, magazines, newspapers, research reports, exam papers, handwritten notes, contracts, books.

Sources: Collected from Baidu image search plus public datasets DocLayNet and PubLayNet. The paper does not state that this combined dataset is publicly released; only model weights and code are available via PaddleX.

Per-category instance counts (train/eval):

High-volume classes from the appendix include:

text: 217,257 / 3,342
formula: 113,145 / 1,961
paragraph title: 42,158 / 715
header: 25,001 / 430
number: 25,217 / 430

Full counts for all 23 categories are provided in Table 5 of the paper.

Distillation pretraining corpus

Used for distilling the PP-HGNetV2-B4 backbone from the Vary-VIT-B teacher: 500,000 document samples across five domains (mathematical formulas, financial documents, scientific literature from arXiv STEM fields, academic dissertations, tabular data from reports and spreadsheets).

The paper does not specify the source or license for this 500k corpus.

Algorithms / Training

Knowledge distillation (PP-DocLayout-L backbone)

Teacher model: Vary-VIT-B visual encoder from GOT-OCR2.0, frozen during student training.

Student model: PP-HGNetV2-B4 backbone.

Feature alignment: Teacher features $F_{\text{teacher}} \in \mathbb{R}^{B \times D}$ and student features $F_{\text{student}} \in \mathbb{R}^{B \times P}$ are aligned via a learnable projection $\phi: \mathbb{R}^{P} \rightarrow \mathbb{R}^{D}$.

Distillation loss (feature L2):

$$ L_{\text{Distill}} = \frac{1}{B} \sum_{i=1}^{B} \left\lVert F^{(i)}{\text{teacher}} - \phi\left(F^{(i)}{\text{student}}\right) \right\rVert_2^2 $$

Distillation training settings:

Input resolution: 768 $\times$ 768
Epochs: 50
Optimizer: AdamW with $\beta_1 = 0.9$, $\beta_2 = 0.999$
Student backbone parameters after distillation: 15.6M

Semi-supervised learning (PP-DocLayout-M/S)

Teacher model: PP-DocLayout-L generates predictions $P(y | x_u) = f_T(x_u; \theta_T)$ over $C = 23$ classes.

Adaptive per-class thresholding: For each class $c$, the threshold $\tau_c^{*}$ is selected to maximize per-class F1 on labeled validation data, rather than using a fixed global threshold across all categories.

Pseudo-label assignment: A region is assigned a pseudo-label if the predicted score exceeds $\tau_c^{*}$ for any class $c$.

Detector training settings

PP-DocLayout-L:

Epochs: 100
Learning rate: 0.0001 (constant)
Batch size: 2 per GPU
GPUs: 8 NVIDIA V100
Training time: approximately 26 hours

PP-DocLayout-M:

Epochs: 100
Learning rate: 0.02
Batch size: 2 per GPU
GPUs: 8
LR scheduler: CosineDecay

PP-DocLayout-S:

Epochs: 100
Learning rate: 0.06
Batch size: 2 per GPU
GPUs: 8
LR scheduler: CosineDecay

Evaluation

Metrics and hardware

Primary metric: mAP@0.5

GPU latency: NVIDIA Tesla T4

CPU latency: Intel Xeon Gold 6271C @ 2.60 GHz, 8 threads, FP16 precision

Main results

Variant	mAP@0.5	T4 Latency	CPU Latency	Parameters
L	90.4%	13.39 ms	759.76 ms	30.94M
M	75.2%	12.73 ms	59.82 ms	5.65M
S	70.9%	8.11 ms	14.49 ms	1.21M

Ablation: Knowledge distillation (PP-DocLayout-L)

Distillation improves the L variant from 89.3% to 90.4% mAP@0.5 (+1.1 percentage points).

Ablation: Semi-supervised learning (PP-DocLayout-M/S)

Variant	Baseline	With Semi-Supervised	Improvement
M	73.8%	75.2%	+1.4 pp
S	66.2%	70.9%	+4.7 pp

Qualitative comparisons

The authors provide side-by-side visualizations comparing PP-DocLayout to DocLayout-YOLO on diverse document types. Claimed improvements include:

More granular text hierarchy: separate detection of document title, abstract, and paragraph title.
Improved detection of headers, footers, and page numbers.
Inline and block formula detection.
Handwritten content classified as text rather than figure.
Separation of charts, seals, and natural images into distinct categories.

Hardware / Production

Throughput claims

The authors emphasize large-scale batch processing scenarios, claiming approximately 123 pages per second on T4 GPU when using the PaddleX inference engine.

Deployment context

Inference performance is framed around two use cases:

Large-scale data construction: Processing document corpora for training data generation.
Real-time processing: Low-latency layout detection for interactive applications.

CPU measurements are provided for edge deployment scenarios, but primary speed claims focus on T4 GPU performance.

SmolDocling — Notes

TL;DR

SmolDocling is a 256M-parameter vision–language model that converts document page images directly into DocTags, a structured XML-style markup encoding content, layout, and element locations in a single sequence. Built on SmolVLM-256M with a document-specific token vocabulary, large synthetic datasets (tables, charts, code, formulas), and a curriculum training recipe, it matches or exceeds much larger VLMs on OCR, layout analysis, and structure extraction while remaining in the same compute budget as classic ensemble systems.

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$ – introduces SmolDocling model, DocTags output format, and curriculum training strategy for end-to-end document conversion.

Secondary: $\Psi_{\text{Resource}}$ – contributes several new datasets (DocLayNet-PT, SynthDocNet, SynthChartNet, SynthCodeNet, SynthFormulaNet) with instruction-tuned variants.

What is the motivation?

Converting complex PDFs into structured, machine-readable formats remains challenging:

PDFs prioritize print rendering over semantics or reading order
Real documents have complex layouts: multi-column, forms, tables, charts, code, equations
Existing solutions split into two camps:
- Ensembles: pipelines of OCR + layout + table + post-processing; efficient but brittle
- Large VLMs: single multimodal models for Q&A over documents; compute-heavy, prone to hallucinations, focused on Q&A rather than faithful conversion
Data gap: few open multimodal datasets jointly cover layout, tables, equations, code, and charts at page level with consistent markup
Target: a small, efficient VLM for full document conversion (content + structure + spatial grounding) with open datasets and standardized markup

What is the novelty?

SmolDocling architecture on SmolVLM-256M:
- 256M total parameters: SigLIP-base vision encoder ($\sim$93M) + lightweight SmolLM-2 language backbone ($\sim$135M)
- Aggressive pixel shuffle: compresses each 512$\times$512 patch to 64 visual tokens (4096 pixels per token)
DocTags markup vocabulary:
- XML-style tokens separating text content from structure
- Blocks: <text>, <caption>, <footnote>, <formula>, <title>, <section_header>, <list_item>, <page_header>, <page_footer>, <picture>, <otsl>, <code>, <document_index>, plus <ordered_list> / <unordered_list>
- Spatial grounding: every block embeds bounding box via <loc_x1><loc_y1><loc_x2><loc_y2> on 0–500 coordinate grid
- Tables: OTSL vocabulary (<fcel>, <ecel>, <lcel>, <ucel>, <xcel>, <nl>, plus header tags <ched>, <rhed>, <srow>)
- Code: <_programming-language_> subtype tags preserve indentation and line breaks
- Pictures: <image_class> tags with detailed class taxonomy (charts, molecular structures, logos, signatures, diagrams)
Unified representation: cropped elements (tables, code, equations, charts) use identical DocTags structure as when embedded in full pages, improving alignment between local and global tasks
Curriculum training:
1. Extend tokenizer with DocTags vocabulary
2. Freeze vision encoder, train language model to output DocTags
3. Unfreeze vision encoder, jointly train on document pretraining + task-specific datasets
4. Final fine-tuning on all datasets including instruction-style examples
Instruction interface: simple textual prompts such as “Convert this page to Docling,” “Convert chart to table,” “Convert formula to LaTeX,” “OCR the text in a specific location <loc_...>” (Table 6, p. 20)

Architecture diagram (Figure 1, p. 2) shows: SigLIP encoder $\rightarrow$ projection and pooling $\rightarrow$ concatenation with text embeddings $\rightarrow$ LLM autoregressively emits DocTags.

What experiments were performed?

General setup:

Input images standardized to 144 DPI
Maximum sequence length: 8192 tokens (up to 3 pages per forward pass)
For text-only metrics, DocTags and HTML outputs converted to Markdown; tables excluded from full-page text evaluation

1) Text recognition (full pages)

Dataset: DocLayNet test pages (excluding tables)
Baselines: Qwen2.5-VL 7B, GOT 580M, Nougat base 350M
Metrics: edit distance, F1, precision, recall, BLEU, METEOR

2) Code listing OCR

Dataset: SynthCodeNet rendered code snippets
Metrics: text similarity on plaintext outputs (line breaks, indentation)
Evaluation only reported for SmolDocling (other models not trained for this task)

3) Equation recognition

Dataset: Im2LaTeX-230k test formulas (normalized LaTeX)
Baselines: Qwen2.5-VL, GOT, Nougat

4) Layout analysis

Dataset: DocLayNet test set (6 classes: Text, Section Heading, List Item, Table, Picture, Formula)
Metric: mAP[0.5:0.95]
Baselines: Qwen2.5-VL-7B, human agreement

5) Table structure recognition

Datasets: FinTabNet, PubTables-1M (cropped table images, $\sim$72 DPI with compression artifacts)
Baselines: SmolVLM-2.2B, Granite Vision, TableFormer
Metric: TEDS with and without cell text (structure-only in parentheses)

6) Chart extraction

Task: convert cropped charts to tables
Baselines: Phi-3.5-vision, Granite Vision, SmolVLM-2.2B, Molmo-E
Metric: TEDS between predicted and ground-truth HTML tables

7) Qualitative analyses

Figure 2 (p. 4): DocTags snippets for tables with captions, code blocks, equations, nested lists, image classes
Figure 3 (p. 15): treemap of dataset contributions by type and size
Figures 4–6 (pp. 16–18): visual samples of SynthChartNet, SynthCodeNet, SynthFormulaNet
Figure 7 (p. 21): molecule recognition comparison (GOT-OCR 2.0, Qwen2.5-VL-72B, SmolDocling)
Table 7 (pp. 22–24): layout bounding box overlays (SmolDocling vs. Qwen2.5-VL) on six DocLayNet pages

What are the outcomes/limitations?

Outcomes (reported):

Text OCR (full pages, DocLayNet):
- SmolDocling: edit distance 0.48, F1 0.80, precision 0.89, recall 0.79, BLEU 0.58, METEOR 0.67
- Qwen2.5-VL 7B: edit distance 0.56, F1 0.72, BLEU 0.46, METEOR 0.57
- GOT 580M: edit distance 0.61, F1 0.69
- Nougat 350M: edit distance 0.62, F1 0.66
- SmolDocling strictly better on all metrics despite being significantly smaller
Code OCR (SynthCodeNet):
- SmolDocling: edit distance 0.11, F1 0.92, precision 0.94, recall 0.91, BLEU 0.87, METEOR 0.89
- No comparison baselines (effectively sets benchmark)
Equation OCR (Im2LaTeX-230k):
- SmolDocling: edit distance 0.11, F1 0.95, precision 0.96, recall 0.95, BLEU 0.83, METEOR 0.89
- GOT: slightly higher on BLEU/METEOR but very close overall
- Qwen2.5-VL: edit distance 0.22, F1 0.89
- Nougat: edit distance 0.62, F1 0.60
Layout analysis (DocLayNet, mAP[0.5:0.95]):
- SmolDocling: 0.231 overall
- Qwen2.5-VL-7B: 0.133 overall
- Human agreement baseline: 0.82
- SmolDocling clearly better than Qwen2.5-VL but far below human performance
Table structure recognition:
- FinTabNet: TEDS 0.52 (with text), 0.81 (structure only)
- PubTables-1M: TEDS 0.65 (with text), 0.88 (structure only)
- TableFormer (52M): 0.89 FinTabNet, 0.84 PubTables-1M (uses external OCR)
- Structure-only scores indicate grid reconstruction works reasonably well; suffers from low resolution of table crops
Chart extraction:
- SmolDocling: TEDS 0.75
- Granite Vision: 0.95
- Molmo-E: 0.54
- Phi-3.5-vision: 0.40

Limitations:

Layout grounding: bounding boxes can be imprecise, especially for colorful layouts or terminal output; recall can be low on some document types
Tag integrity: occasional missing closing tags or partially emitted location tags can break downstream parsing
Repetition loops: sometimes repeats tokens and fabricates extra cells or text segments late in sequence (typical autoregressive failure)
Charts and molecules: chart reconstruction quality is dataset-dependent; molecule recognition experiments (Figure 7) show general VLMs do not match specialized models like MolGrapher or DECIMER on structure fidelity
Long-sequence robustness: layout grounding and consistency need improvement for complex multi-page documents

The results suggest a small VLM with well-designed markup and datasets can match or approach much larger models on several document tasks, but spatial precision and long-sequence handling remain open challenges.

Model

Architecture

Vision encoder: SigLIP base patch-16/512, $\sim$93M parameters
Language model: lightweight SmolLM-2 variant, $\sim$135M parameters
Total parameters: 256M
Visual pipeline:
- Images tiled into 512$\times$512 patches
- Pixel shuffle compresses each patch to 64 visual tokens
- Special tokens separate sub-images when multiple pages packed
- Pixel-to-token ratio: 4096 pixels per token
Fusion: projected visual embeddings concatenated with text token embeddings from instruction prompt; autoregressive decoder (SmolLM-2) generates DocTags sequence (Figure 1, p. 2)
Tokenizer: DocTags tokens added to vocabulary before training; model trained to emit them exactly

Data

Treemap visualization of dataset sizes and types shown in Figure 3 (p. 15).

Pretraining / multi-task document data

DocLayNet-PT: 1.4M pages from DocFM corpus with weak annotations for layout, table structure, language, topic, figure classification encoded in DocTags
DocMatix with DocTags conversion: 1.3M documents extended from DocMatix with full document conversion to DocTags plus DocVQA questions
DocLayNet-PT-Instruct: instruction-tuned version of DocLayNet-PT for document-centric instructions
The Cauldron: other SmolVLM pretraining datasets retained for general multimodal skills

Task-specific datasets

Layout / conversion:

DocLayNet v2: 60K pages with human-corrected layout annotations (from 76K total pages)
WordScape pages with tables: 63K pages with structure-preserving XML ground truth
SynthDocNet: 250K synthetic pages from Wikipedia with diverse layouts; table lists and text blocks synthesized with precise annotations

Tables:

PubTables-1M, FinTabNet, WikiTableSet (EN), WordScape tables
Original structures converted to OTSL interleaved with cell text
Total: $\sim$1.5M+ tables

Charts (SynthChartNet):

Source: 90K tables from FinTabNet; each rendered into multiple chart types
Counts: $\sim$5K line charts, 380K pie charts, 380K bar charts, 77K stacked bar charts
Renderers: Matplotlib, Seaborn, Pyecharts
Total: $\sim$2.5M charts (Figure 4, p. 16)

Code (SynthCodeNet):

9.3M rendered snippets from permissively licensed code datasets (The Stack, CodeNet)
Rendered with LaTeX listings and Pygments at 120 DPI
Random fonts, sizes, colors, line numbers, syntax highlighting
Covers 56 programming languages (Figure 5, p. 17)

Equations (SynthFormulaNet):

5.5M normalized LaTeX formulas combining public datasets plus 4.7M equations from arXiv LaTeX sources
Rendered at 120 DPI with randomized fonts and optional equation numbers (Figure 6, p. 18)

Instruction data:

From DocLayNet-PT pages: instructions like “Perform OCR at bbox,” “Identify page element type at bbox,” “Extract all section headers,” “Detect footer elements,” plus full conversion instructions (Table 6, p. 20)

Algorithms / Training

Curriculum

Extend tokenizer with DocTags vocabulary
Freeze vision encoder and train remaining network to output DocTags (ensuring language model learns new markup)
Unfreeze vision encoder and jointly train on:
- Document pretraining data (DocLayNet-PT, DocMatix with DocTags)
- Task-specific datasets (tables, code, equations, charts)
Final fine-tuning on union of all datasets including instruction-style examples to align with conversion instructions and no-code actions

Optimization

Optimizer: AdamW
Learning rate: $2 \times 10^{-4}$ for most of network; $2 \times 10^{-6}$ for vision encoder after unfreezing
Gradient clipping: 1.0
Warmup ratio: 0.03
Training epochs: 4

Sequence handling

Maximum sequence length: 8192 tokens (up to 3 pages concatenated)
Pages may be interleaved with text prompts or instructions

The paper does not present detailed ablations on curriculum stages or DocTags design; evidence is indirect via task metrics.

Evaluation

Text recognition (DocLayNet)

SmolDocling outperforms Nougat, GOT, and Qwen2.5-VL across all similarity scores; improvements especially strong in F1 and BLEU. Tables excluded from these metrics (table OCR quality evaluated separately via TEDS).

Code and equation OCR

Code: strong scores on SynthCodeNet; authors treat this as defining a new benchmark
Equations: SmolDocling and GOT roughly tied; both significantly above Nougat; SmolDocling slightly ahead of Qwen2.5-VL

Layout

SmolDocling achieves higher mAP than Qwen2.5-VL in all six classes
Human mAP far higher; both models struggle with exact box alignment and class confusion
Example visualizations (Table 7, pp. 22–24):
- Multi-column pages handled reasonably, with occasional missed paragraphs or misclassified text
- Terminal-style manual pages and colorful reports are harder (low recall for Qwen2.5-VL, misaligned boxes for SmolDocling)

Tables

TEDS scores indicate SmolDocling reconstructs table structure well; text transcription for low-resolution crops weaker. Structure-only scores (0.81, 0.88) suggest downstream systems can recover accurate HTML or spreadsheets once text is OCR’ed.

Charts

SmolDocling sits in middle of pack on chart-to-table conversion; authors attribute this to inconsistencies in how different datasets represent chart semantics.

Molecule recognition

Preliminary experiment only: SmolDocling, GOT-OCR 2.0, and Qwen2.5-VL do not reconstruct molecules as accurately as domain-specific models like MolGrapher or DECIMER. Authors suggest better SMILES tokenization and explicit atom/bond encoding might be necessary for competitive performance.

Hardware / Production

Training

Setup: 64 $\times$ NVIDIA A100 80GB GPUs
Duration: $\sim$38 hours per epoch
Total epochs: 4

Inference

Implementation: vLLM
Throughput: 0.35 seconds per page on single A100
VRAM usage: 0.489 GB for inference

The compute footprint is similar to strong ensemble systems rather than large VLMs, while supporting multimodal instructions and unified DocTags output.

VISTA-OCR — Notes

Paper

TL;DR

VISTA-OCR is a lightweight encoder-decoder OCR system that generates both text and spatial coordinates in a single sequence using one Transformer decoder, instead of separate detection and recognition branches. The authors introduce prompt-controlled tasks (region-based OCR and content-based text localization) during pre-training to enable interactive OCR behaviors without scaling to LVLM size. The model is described as having approximately 150M parameters.

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$ (new unified generative OCR formulation with spatial tokenization, progressive multitask training, and prompt-controlled OCR tasks).

Secondary: $\Psi_{\text{Resource}}$ (dataset enrichment with line-level boxes and synthetic generation); $\Psi_{\text{Evaluation}}$ (broad benchmarking across printed and handwritten documents, plus prompt-task evaluation).

What is the motivation?

Error propagation: Two-stage OCR pipelines (detect then recognize) suffer from cascading errors and often require domain-specific retraining.
Weak spatial positioning: Many encoder-decoder OCR-free models generate text but struggle with explicit spatial positioning, which is required for document understanding and structured extraction.
LVLM cost/scale: LVLMs can perform richer OCR behaviors (e.g., region-based reading) but are typically too large and costly for constrained deployment. VISTA aims to provide similar flexibility at smaller scale (authors cite 150M parameters for their omni model).

What is the novelty?

Single-branch generation of text + layout: The decoder generates tuples $(t_i, l_i)$ in reading order, where $l_i$ encodes bounding box coordinates. This avoids a separate detection head.
Spatial tokenization: Coordinates are quantized onto a grid, represented by dedicated vocabulary tokens (with separate X-axis and Y-axis token sets).
Prompt-controlled OCR tasks during pre-training:
- OCR only
- OCR with layout
- Region-based OCR (prompt includes a box)
- Content-based localization (prompt includes text to “find”)
Progressive training recipe: Calibrate encoder-to-decoder attention by freezing the decoder first, then expand to multimodal (text + layout) and multitask prompts.

What experiments were performed?

Standard OCR: Text recognition (printed + handwritten) and text detection with line-level boxes across SROIE 2019, IAM, RIMES 2009, MAURDOR.
Prompt tasks: Region-based OCR and content-based text localization on MAURDOR (C3/C4) and on a set of 458 PDFA-derived images.
Encoding ablations: Compare 3 spatial encoding schemes (original interleaving, segmented tokens, unified coordinate token set) on SROIE.

What are the outcomes and limitations?

Outcomes

SROIE 2019: Fine-tuned VISTA reports strong text recognition (word-level F1 around mid-93s). Detection F1 improves substantially if predicted boxes are padded by 1-2 pixels, suggesting a systematic tight-box bias.
IAM and RIMES 2009: Fine-tuned model reports competitive CER/WER while also performing detection (Area F1 reported).
Prompt tasks: High AP on PDFA for both region-based OCR and content-based localization, with lower performance on heterogeneous MAURDOR.

Limitations (authors)

Quantization error in box coordinates (they mention a 10-pixel quantizer; suggest smaller steps like 3 pixels may help).
Box representation is only 4 coords, so it cannot faithfully represent slanted or curved lines.
Annotation heterogeneity (page/paragraph/line) and reading-order mismatch across datasets complicate training and evaluation. The authors call for more consistent large-scale datasets.

Model

Formulation and output sequence

OCR is framed as $M(I) = {T, L}$ mapping an image $I$ to text tokens and location tokens.
Generation is over elements $e_i = (t_i, l_i)$, where $l_i = (x_1, y_1, x_2, y_2)$ (top-left and bottom-right of a line box).
Token layout: Spatial tokens are interleaved with text in reading order (Figure 1 shows line transcription delimited by spatial tokens and a separator).

Spatial tokenization

Coordinates are quantized onto a grid; each grid position corresponds to a unique vocabulary class.
They use two distinct location-token sets: one for X-axis positions and one for Y-axis positions.

Architecture

Encoder-decoder: Lightweight CNN encoder + Transformer decoder (mBART-based).
Absolute positional encoding is applied to encoder features; decoder uses cross-attention over these features (Figure 2).
Decoder initialization: weights initialized from Donut (transfer learning for faster convergence and better results, per authors).
Vocabulary: reduced size plus added spatial tokens and task prompt tokens (to reduce decoder head compute).
Model scale: VISTAomni is described as an “OmniOCR system” with 150M parameters.

Objective

They define a combined loss: $$ L_{\text{total}} = \lambda L_{\text{text}} + (1 - \lambda) L_{\text{loc}} $$ with both terms as cross-entropy over predicted token sequences (text and location).

Data

Pre-training data sources

Real documents: Subsets of PDFA and IDL. PDFs converted to images at 200 DPI; OCR (PaddleOCR) used to obtain text lines and boxes for PDFA; for IDL they use multiple OCR systems to handle handwriting. Non-Latin, empty, flipped, etc. are filtered/corrected.
Synthetic: IAM-synth, RIMES-synth, SROIE-synth generated with custom generators; SynthDOG is also used for English/French documents with box annotations.

Dataset sizes

Table 1 summarizes counts and types:

Synthetic: IAM (30K), RIMES (30K), SynthDog (40K), SROIE (20K)
Real: IAM (747 train / 336 val / 116 test), RIMES (1050/100/100), SROIE (626 train / 361 test), IDL & PDFA (170K pretrain / 458 eval), MAURDOR (1727/259/280)

Annotation enrichment

Several eval datasets needed line-level location annotations. Authors describe using segmentation/detection modules for pre-annotations, matching to labels, and manual correction (rectangles only).
MAURDOR has mixed paragraph/line/page annotations; they use paragraph boxes to crop then line-segment, then match to ground truth, with manual correction on test only and high-quality matches for train.

Algorithms / Training

Training stages (progressive)

Encoder-decoder calibration: Train for text recognition on IDL/PDFA with decoder frozen, then train all parameters for the same task.
Multimodal text + layout training: Add location tokens so model predicts text and box coordinates jointly.
Progressive multitask prompts: Introduce region-based OCR and content-based localization, driven by prompt tokens.

Optimization and batch sizes

VISTAomni pre-training: batch size 1 on an A100 RTX NVIDIA GPU with 80GB; “Adam weighted” optimizer with LR scheduler (wording as in the paper).
Fine-tuning batch sizes: 4 (RIMES 2009), 6 (IAM), 2 (SROIE 2019).

Evaluation

Metrics

SROIE 2019 TR: Word-level precision/recall/F1 (exact match) per official protocol; TD: DetEval (precision/recall/F1 by overlap).
IAM/RIMES TR: CER and WER; TD: DetEval Area F1.
Prompt tasks:
- Region-based OCR: CER/WER plus AP (to penalize false positives).
- Content-based localization: AP at IoU thresholds (AP50, AP60, AP70, AP80).

Key quantitative results

SROIE 2019 (Table 2):

VISTAft TR: Precision 94.15, Recall 93.75, F1 93.95; VISTAomni TR F1 89.93.
VISTAft TD: raw F1 83.13; with +1px padding F1 90.28; with +2px padding F1 94.16. (Authors attribute this to tighter boxes in training and quantization effects.)

IAM and RIMES 2009 (Table 3):

IAM VISTAft: CER 4.46, WER 10.14, Area F1 98.12; VISTAomni: CER 6.58, WER 14.41, Area F1 93.52.
RIMES VISTAft: CER 4.72, WER 9.92, Area F1 90.48; VISTAomni: CER 7.16, WER 16.99, Area F1 87.03.

MAURDOR (Tables 4a/4b):

Combined C3 & C4: VISTAft CER 8.51, WER 14.33, Area F1 87.02 vs DAN CER 11.59, WER 27.68 (DAN has no Area F1 reported in the table).

Prompt tasks (Table 5) on VISTAomni:

Region-based OCR: MAURDOR CER 13.74 / WER 22.32 / AP 84.83; PDFA CER 1.87 / WER 8.77 / AP 94.14.
Content-based localization: MAURDOR AP50 91.88 (then AP drops with stricter IoU); PDFA AP50 97.01 (similarly decreasing at higher IoU thresholds).
Query length effect (Table 6, PDFA): AP generally improves as the number of words in the query increases.

Encoding scheme ablation (Table 7)

Original interleaving outperforms segmented tokens for detection and is competitive for recognition.
Using separate X/Y token sets appears beneficial for detection vs. a unified loc-token set.

Hardware / Production

Only a single concrete training hardware config is specified: pre-training VISTAomni with batch size 1 on an A100 80GB GPU.
No inference throughput, latency, or memory benchmarks are reported in the preprint (beyond parameter-count positioning vs. LVLMs).

Dolphin — Notes

Paper
Code
Models: 322M, 322M-v1.5, 4B-v2 (MIT license)

TL;DR

Dolphin is a 322M-parameter document image parsing VLM that uses an analyze-then-parse pipeline: Stage 1 predicts a reading-order layout element sequence (type + bounding box), Stage 2 parses each element in parallel using type-specific prompts. Trained on 30.27M multi-granularity samples (page-level + element-level), heavily weighted toward formula elements, it reports normalized Edit Distance 0.0114 on Fox-Page-EN, 0.0131 on Fox-Page-ZH, and 0.1028 on the introduced Dolphin-Page benchmark, while running at 0.1729 FPS due to parallel element parsing.

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$ (introduces two-stage analyze-then-parse design with reading-order layout anchors and parallel type-specific element recognition).

Secondary: $\Psi_{\text{Resource}}$ (30.27M training set across multiple granularities; introduces Dolphin-Page and Dolphin-Block benchmarks).

Also present: $\Psi_{\text{Evaluation}}$ (broad comparisons across multiple benchmarks, though no new metric proposed).

Rough superposition: 0.65 Method + 0.20 Resource + 0.15 Evaluation.

What is the motivation?

End-to-end autoregressive limitations: Full-page generation methods face quadratic attention costs and can degrade layout structure on long or complex documents containing mixed elements (text, tables, formulas, figures).

Pipeline coordination overhead: Multi-stage toolchains require careful integration and coordination across separate specialized models, introducing complexity and potential error propagation.

Goal: Achieve robust structured extraction from document images with mixed elements, producing machine-readable markup that maintains layout structure while enabling efficient parallel processing.

What is the novelty?

Two-stage analyze-then-parse pipeline:

Stage 1 (Analyze): Generate a sequence of layout elements in natural reading order, each with type and bounding box. These elements become anchors for the recognition stage. Prompted with: “Parse the reading order of this document.”
Stage 2 (Parse): Crop each predicted element region and parse all elements in parallel, guided by type-specific prompts (e.g., “Parse the table in the image” for table regions, outputting HTML structure).

Contrast to end-to-end models: Dolphin avoids full-page autoregressive generation by decomposing into layout planning + parallel element recognition, reducing long-context burden and enabling batched processing of detected regions.

Contrast to pipeline tools: Unlike tools that route different element types to completely separate specialized models, Dolphin uses a unified model for all element types, differentiating behavior via type-specific prompts rather than separate model architectures.

Multi-granularity training: Combines page-level parsing data with element-level recognition data (cropped blocks). This allows the model to learn both layout structure and fine-grained content recognition, with heavy weighting toward formula elements (23M of 30.27M samples).

Element-centric data strategy: Stage 1 training avoids treating formulas as independent layout elements to preserve broader context for formula recognition, while Stage 2 includes extensive formula-specific training data.

What experiments were performed?

Page-level parsing benchmarks:

Fox-Page (English + Chinese): text-heavy document pages
Dolphin-Page: “complex document” benchmark introduced by the authors

Element-level recognition benchmarks:

Text blocks: FoxBlock-EN, FoxBlock-ZH, DolphinBlock
Formulas: SPE (Simple Printed Expression), SCE (Simple Chinese Expression), CPE (Complex Printed Expression)
Tables: PubTabNet (7,904 test samples), PubTab1M (10,000 test samples)

Baselines: Comparisons include Nougat, GOT-OCR2.0, StructEqTable, Vary, OFA, and Donut across different element types.

Ablations: Parallel vs. sequential element decoding (throughput comparison).

What are the outcomes/limitations?

Outcomes (reported):

Page-level Edit Distance (normalized):

Fox-Page-EN: 0.0114
Fox-Page-ZH: 0.0131
Dolphin-Page: 0.1028 (Table 1) Note: Section 5.2 text claims 0.1283, inconsistent with table
Average across benchmarks: 0.0575
Throughput: 0.1729 FPS

Element-level performance:

Paragraph ED: 0.0029 (FoxBlock-EN), 0.0121 (FoxBlock-ZH), 0.0136 (DolphinBlock)
Formula CDM: 0.9850 (SPE), 0.9685 (SCE), 0.8739 (CPE)
Table TEDS: 0.9515 (PubTabNet), 0.9625 (PubTab1M)

Parallel decoding speedup: 1.8$\times$ (0.1729 vs. 0.0971 FPS) with similar accuracy.

Limitations:

Vertical text support: Limited capability for vertical text layouts common in some Asian language documents.
Multilingual capacity: Beyond Chinese and English, additional languages require further development.
Handwriting recognition: Current version does not handle handwritten text effectively.
Parallel processing bounds: Despite parallel element parsing, speedup is constrained by:
- Preprocessing overhead (detection, cropping, batching)
- Max 16 elements per batch due to GPU memory limits, requiring multiple passes on element-dense pages
- This limits practical throughput gains compared to theoretical maximum from full parallelization

Open questions:

Minor inconsistency between Table 1 (Dolphin-Page ED: 0.1028) and Section 5.2 text (claims 0.1283) raises questions about which result is authoritative.
No discussion of how reading-order prediction quality affects downstream parsing accuracy.

Model

Architecture: Encoder-decoder transformer following Donut design.

Encoder:

Swin Transformer backbone
Window size: 7
Layer configuration: [2, 2, 14, 2] (4 stages)
Attention heads: [4, 8, 16, 32]

Decoder:

mBART architecture
10 layers
Hidden size: 1024

Initialization: Donut pretrained weights.

Total parameters: 322M.

Data

Training set: 30.27M samples total, combining page-level and element-level data.

Data Type	Samples	Notes
Mixed Documents	0.12M	Page-level parsing
HTML	4.37M	Page-level parsing
LaTeX	0.5M	Page-level parsing
Markdown	0.71M	Page-level parsing
Table elements	1.57M	Element-level cropped regions
Formula elements	23M	Element-level cropped regions (76% of dataset)

Data strategy: Heavy weighting toward formula elements reflects the complexity and importance of mathematical content recognition in scientific documents.

Layout representation: Stage 1 avoids treating formulas as independent layout elements to maintain broader context during formula recognition, while Stage 2 uses extensive formula-specific training for detailed parsing.

Algorithms / Training

Training objective: Standard cross-entropy token prediction loss.

Dynamic instruction-tuning: For each training sample, randomly select one applicable task from 5 task types based on available annotations to form question-answer pairs. This prevents the model from memorizing fixed prompt-response patterns.

Task prompts (Table 6):

Page-level Layout Analysis: Parse the reading order of this document.
Text Paragraph/Formula Parsing: Read text in the image.
Table Parsing: Parse the table in the image.
Text Spotting: Detect and recognize all the text lines in the image.
Text Box Query: Read the text in the image within the specified box [x1,y1,x2,y2].

Training configuration:

Hardware: 40 A100 GPUs
Epochs: 2
Per-device batch size: 16 (via gradient accumulation)
Optimizer: AdamW
Learning rate: 5e-5 with cosine decay schedule

Evaluation

Metrics:

Edit Distance (ED): Normalized character-level edit distance for text and page-level comparisons
CDM (Character Detection Metric): Character-level edit distance for formula evaluation
TEDS (Tree Edit Distance-based Similarity): Structural similarity over HTML table representations

Formula benchmarks:

SPE (Simple Printed Expression)
SCE (Simple Chinese Expression)
CPE (Complex Printed Expression)

Table benchmarks:

PubTabNet: 7,904 test samples
PubTab1M: 10,000 test samples

Page-level benchmarks:

Fox-Page-EN: English text-heavy documents
Fox-Page-ZH: Chinese text-heavy documents
Dolphin-Page: Complex multi-element documents (introduced by authors)

Element-level benchmarks:

FoxBlock-EN/ZH: English and Chinese text paragraphs
DolphinBlock: Mixed element types

Hardware / Production

Input preprocessing:

Preserve aspect ratio
Resize longer edge to 896 pixels
Pad to 896 $\times$ 896 square
Normalize bounding box coordinates within padded space

Inference pipeline:

Stage 1: Full-page forward pass to predict layout elements (type + bbox)
Crop detected elements from original image
Stage 2: Batch element crops (max 16 per batch) for parallel recognition
Reassemble elements according to predicted reading order

Throughput analysis:

Sequential decoding: 0.0971 FPS
Parallel decoding: 0.1729 FPS (1.8$\times$ speedup)
Speedup limited by preprocessing overhead and batch size constraints
For pages with $>16$ elements, multiple batches required

Memory constraints: Max 16 elements per batch on typical GPU configurations; element-dense pages require multiple forward passes through Stage 2.

Infinity-Parser — Notes

TL;DR

Wang et al. introduce LayoutRL, a reinforcement learning framework that trains a VLM to output full-page structured Markdown from scanned documents using layout-aware, verifiable rewards (edit distance, paragraph count, and reading order). Built on top of Qwen2.5-VL-7B and trained with the new Infinity-Doc-400K corpus (roughly 400k documents with paired images and Markdown), Infinity-Parser-7B reports competitive results on OmniDocBench, olmOCR-Bench, PubTabNet, and FinTabNet relative to both pipeline OCR systems and general VLMs.

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$ (introduces LayoutRL, a reinforcement learning framework with multi-aspect layout-aware rewards for end-to-end document parsing).

Secondary: $\Psi_{\text{Resource}}$ (releases Infinity-Doc-400K, a large document corpus with paired images and Markdown annotations, plus synthetic data generation pipeline); $\Psi_{\text{Evaluation}}$ (extensive comparisons on four external benchmarks with ablations on reward design and training regime).

What is the motivation?

Problem: Scanned document parsing requires hierarchical structure recovery (paragraphs, headers, tables, formulas, reading order), not just text recognition. Traditional multi-stage pipelines (layout detection, OCR, table/formula modules) suffer from error propagation and limited adaptability to layout variation.
Limitations of supervised fine-tuning: End-to-end VLM parsers trained via SFT tend to overfit surface patterns and provide token-level supervision only, which does not directly reward correct page-level structure or reading order. Generalization to out-of-distribution layouts remains weak, and high-quality layout annotations are expensive.
Gap in RL approaches: RL is promising for outcome-based training of LLMs and VLMs, but prior RL work mainly uses coarse binary success rewards that are not layout-aware and thus poorly suited to complex document parsing.
Goal: Design a layout-aware RL framework with verifiable multi-aspect rewards and supply a sufficiently large, reasonably clean dataset to train a robust VLM that generalizes across document domains, languages (English and Chinese), and structural complexities.

What is the novelty?

LayoutRL: multi-aspect, verifiable reward for document parsing

The model outputs a full-page Markdown representation, which is scored with three rule-based rewards:

Edit distance reward ($R_{\text{dist}}$): Normalized Levenshtein distance between prediction and reference text, formulated as $R_{\text{dist}} = 1 - D(y, \hat{y}) / \max(N, M)$, where $N$ and $M$ are lengths of reference and prediction.
Count reward ($R_{\text{count}}$): Penalty for mismatch in number of predicted versus reference paragraphs. With $N_Y$ and $N_{\hat{Y}}$ representing numbers of paragraphs in reference and prediction, $R_{\text{count}} = 1 - |N_Y - N_{\hat{Y}}| / N_Y$.
Order reward ($R_{\text{order}}$): Penalty for inversions in paragraph reading order after optimal matching. After matching predicted and ground truth paragraphs using the Hungarian algorithm, count pairwise inversions ($D_{\text{order}}$), then $R_{\text{order}} = 1 - D_{\text{order}} / \text{maxinv}$, where $\text{maxinv} = N_Y (N_Y - 1) / 2$.

Segments are matched using the Hungarian algorithm before computing rewards, so content and structure are aligned at the paragraph level. The combined reward is the simple sum: $R_{\text{multi}} = R_{\text{dist}} + R_{\text{count}} + R_{\text{order}}$.

Infinity-Doc-400K dataset and dual-pipeline construction

The training corpus contains roughly 400k annotated document pages with paired images and Markdown, constructed via two pipelines:

Real-world pipeline (about 331k documents): Scanned documents from financial reports, medical reports, academic papers, books, magazines, and web pages, pseudo-labeled via multiple expert models (layout, formula, table, OCR) plus VLM cross-checks. Only high-consensus regions are kept.
Synthetic pipeline (about 69k documents): HTML templates (single, double, triple column) filled with text and images from Wikipedia, CC3M, and web corpora via Jinja, rendered in a browser to images, with Markdown extracted directly from HTML.

Three document analysis experts manually inspected about 5% of the data and used feedback to refine screening rules across at least five iterations.

Infinity-Parser-7B: RL-finetuned Qwen2.5-VL-7B

The parser is a standard VLM (Qwen2.5-VL-7B) trained to map full pages to Markdown without explicit intermediate layout detection. RL is applied directly on outputs using Group Relative Policy Optimization (GRPO), learning from relative rewards across multiple sampled completions per page.

Systematic comparison of SFT versus layout-aware RL

Ablations compare supervised SFT, RL with different reward subsets, and SFT followed by RL, studying both in-distribution and out-of-distribution behavior on OmniDocBench document types.

What experiments were performed?

External benchmarks

The model is evaluated on four external datasets:

OmniDocBench: Comprehensive evaluation across multiple document types with metrics for overall normalized edit distance, text, formulas, tables (TEDS and edit), and reading order. Comparisons against pipeline tools (MinerU, Marker, Mathpix, Docling, Pix2Text, Unstructured, OpenParse), OCR-focused VLMs (GOT-OCR, Nougat, Mistral OCR, olmOCR), and general VLMs (GPT-4o, Qwen2-VL-72B, Qwen2.5-VL-7B, InternVL2/3, SmolDocling).
olmOCR-Bench: Fact-based evaluation of document-level OCR across domains such as ArXiv, old scans, math tables, multi-column layouts, long tiny text, and “base” PDFs. Compares to pipeline systems and commercial APIs including GPT-4o, Gemini, Mistral OCR, Qwen2-VL variants, and olmOCR with “anchored” and “non-anchored” prompts.
PubTabNet and FinTabNet: Table recognition benchmarks focusing on structure and content, using TEDS and TEDS-S metrics. Comparisons to EDD, OmniParser, InternVL3 (8B and 78B), Qwen2.5-VL (7B and 72B), and GPT-4o.

Ablations and analyses

Reward ablation: Compare zero-shot Qwen2.5-VL-7B, SFT on 43k documents, and RL with different reward subsets (edit-only, edit plus count, all three rewards, RL after SFT). Metrics include overall edit distance for English and Chinese pages, plus averaged category-level text edit distance across nine page types.
Training stability and scaling: Plot performance versus training data size for OmniDocBench subtasks (text, formulas, tables, reading order). RL curves are reported to be smoother and improve more steadily than SFT as data size increases.
Robustness across document types: Compare base model, SFT, and RL on different categories (old scans, tables, multi-column) on both olmOCR and OmniDocBench. RL is reported to consistently give higher similarity scores.
In-distribution versus OOD generalization: Evaluate models on in-distribution domains (magazines, research reports) and OOD domains (colorful textbooks and slides excluded from training) as training progresses. RL is reported to continue improving page-level scores in both settings, while SFT tends to plateau and degrade more in OOD cases.
Case studies: Qualitative comparisons on individual pages from academic papers, books, exams, magazines, newspapers, and slides, showing that Infinity-Parser reduces redundant recognition and formatting errors relative to MinerU and GPT-4o.

What are the outcomes and limitations?

Outcomes

OmniDocBench overall:

Infinity-Parser-7B achieves overall edit distance of about 0.141 (EN) and 0.197 (ZH), lower than all listed baselines. Pipeline tools like MinerU and Mathpix are competitive on some subtasks but worse overall, particularly on Chinese pages.

OmniDocBench subtasks and categories:

On text, formula, table TEDS, and reading order, Infinity-Parser generally gives the best or near-best scores among compared systems for both English and Chinese.
On nine page types (books, slides, financial reports, textbooks, exams, magazines, academic papers, notes, newspapers), the mean text edit distance is reported at 0.104, noticeably lower than Qwen2-VL-72B (about 0.179) and pipeline tools.

olmOCR-Bench:

Infinity-Parser-7B has the highest reported overall score (82.5), ahead of the anchored olmOCR model (77.4) and GPT-4o variants. Gains are especially strong in multi-column and old scan categories, while performance on the “base” category is also high.

Table benchmarks:

On PubTabNet, Infinity-Parser reaches TEDS-S $\approx$ 93.5 and TEDS $\approx$ 91.8, slightly above EDD and OmniParser.
On FinTabNet, it reaches TEDS-S $\approx$ 97.2 and TEDS $\approx$ 95.9, surpassing OmniParser and InternVL3 variants.

Effect of multi-aspect rewards:

Relative to SFT on 43k documents, RL with only edit reward reduces overall edit distances, and adding count and order rewards further improves both page-level and category-level scores.

Limitations and open questions

Data usage versus dataset size:

Infinity-Doc-400K contains about 400k documents, but RL training uses a 43k subset due to compute constraints. It is unclear how much headroom remains if RL used the full dataset or a different sampling scheme.

Label noise in real-world data:

The real-world portion (about 331k documents) uses pseudo labels from multiple expert models. Cross-validation and expert spot checks reduce noise, but residual systematic errors or biases are not quantified.

Language scope:

Experiments cover English and Chinese, and some mixed-language tables. Other scripts and languages are not studied, so generalization to low-resource languages is unknown.

Single backbone size:

All experiments use Qwen2.5-VL-7B. There is no scaling study across model sizes, nor comparison of how much improvement comes from the RL recipe versus simply using a larger VLM.

Reward design choices:

The three rewards are summed with equal weights, and there is no systematic exploration of alternative weighting or additional layout signals. There is also no discussion of possible reward hacking (e.g., degenerate outputs that optimize edit distance while being hard to use downstream).

Deployment aspects:

The paper does not discuss inference latency, throughput, memory footprint, or production monitoring. There is no operational validation in real workflows, so all results are benchmark-based rather than field-based.

Contrast to olmOCR 2: Both use RL for document parsing, but Infinity-Parser employs multi-aspect layout rewards (edit distance, paragraph count, reading order) optimized via GRPO, whereas olmOCR 2 uses binary unit tests as verifiable rewards. Infinity-Parser explicitly maintains general VLM capability, while olmOCR 2 focuses on document-specific optimization.

Contrast to dots.ocr: dots.ocr treats document parsing as unified autoregressive sequence generation (layout detection, text recognition, and reading order in one pass), whereas Infinity-Parser uses RL with structured multi-aspect rewards on top of a base VLM, explicitly optimizing for layout awareness.

Contrast to MonkeyOCR: MonkeyOCR decomposes parsing into SRR (Structure-Recognition-Relation) with separate YOLO detector, unified LMM, and reading-order model. Infinity-Parser is end-to-end VLM trained with RL, avoiding the pipeline architecture.

Model

Backbone

Base model: Qwen2.5-VL-7B, a vision-language model with high-resolution perception and a 7B-parameter language decoder. The paper treats it as a black-box backbone rather than detailing its internals.

Task formulation

Input: Image(s) of scanned document pages.
Output: Markdown string encoding document structure (headings, paragraphs, lists, tables, formulas).
The model is trained without explicit thinking or intermediate layout reasoning; the output sequence is directly scored.

Prompting

Document parsing uses a detailed Markdown conversion prompt:

Recognize all text.
Output Markdown.
Convert formulas to LaTeX ( $...$ inline and $$...$$ block).
Convert tables to Markdown table syntax.
Ignore figures and images.
Maintain original document structure with clear line breaks.

Table-specific tasks use a set of paraphrased prompts that all request encoding the table as HTML, matching evaluation format on table benchmarks.

Data

Infinity-Doc-400K

Scale and composition:

About 400k documents in total. Real-world documents: 331k across six domains. Synthetic documents: 69k.
Real-world domains and sizes: Financial reports (58.0k), Medical reports (5.0k), Academic papers (71.7k), Books (11.3k), Magazines (180.0k), Web pages (5.0k).
Synthetic documents: 69.0k generated via HTML templates plus content from CC3M, web, and Wikipedia.

Real-world pipeline:

Data collection: scanned documents from six domains.
Filtering: low-quality images removed, duplicates filtered.
Layout analysis: layout model identifies regions such as title, text, table, formula, figure.
Expert models: layout model for block segmentation, table recognition model, formula recognition model, OCR for text content.
Cross validation: predictions across models and an end-to-end VLM are compared. Only regions with consistent outputs are kept as pseudo-ground-truth.
Annotation: resulting images paired with Markdown, giving document-level annotations for OCR, tables, formulas, and reading order.

Synthetic pipeline:

Collect text and images from Wikipedia, web corpora, and image datasets.
Use Jinja templates to construct HTML with varying layouts (single, double, triple columns) and content types (tables, formulas, figures).
Render HTML to images via a browser engine.
Filter low-quality and overlapping images.
Extract ground-truth annotations by parsing HTML into Markdown. This gives exact alignment between the rendered page and the structured representation.

Quality control:

Three document analysis experts manually inspected about 5% of the data and used feedback to refine screening rules in at least five iterations.
A model-based cross-verification mechanism at scale keeps only high-consistency pseudo labels, with inconsistent samples feeding back into rule refinement.

Training context statistics

Maximum training context: 8192 tokens.
Observed lengths in raw data: Minimum (17 tokens), Maximum (31,147 tokens), Average (1,765 tokens), Median (1,127 tokens).
About 73% of samples between 512 and 4k tokens. Sequences longer than 8k are left-truncated to keep trailing content, presumably more relevant.

Algorithms / Training

LayoutRL and multi-aspect reward

For each training document:

Generate candidates: The policy model (Infinity-Parser) generates $G$ candidate outputs (eight in this work) with a maximum length of 8192 tokens and temperature 1.0.
Compute raw rewards: For each candidate output $\hat{y}$ versus reference $y$:
- Edit distance reward: $R_{\text{dist}} = 1 - D(y, \hat{y}) / \max(N, M)$, where $D(y, \hat{y})$ is normalized Levenshtein distance, and $N$ and $M$ are lengths of reference and prediction.
- Count reward: $R_{\text{count}} = 1 - |N_Y - N_{\hat{Y}}| / N_Y$, where $N_Y$ and $N_{\hat{Y}}$ are numbers of paragraphs in reference and prediction. This penalizes extra or missing paragraphs.
- Order reward: After matching predicted and ground truth paragraphs using the Hungarian algorithm, count pairwise inversions ($D_{\text{order}}$) between the two lists. With $\text{maxinv} = N_Y (N_Y - 1) / 2$, $R_{\text{order}} = 1 - D_{\text{order}} / \text{maxinv}$.
- Combined reward: $R_{\text{multi}} = R_{\text{dist}} + R_{\text{count}} + R_{\text{order}}$.
Group Relative Policy Optimization (GRPO): Within each group of candidate outputs, rewards are converted into relative advantages by comparing each output to others in the same group. Training uses these group-based advantages and a KL penalty against a reference model, avoiding the need for a learned critic.

Training setup

Base model: Qwen2.5-VL-7B.
Data for RL: Random 43k-document subset from Infinity-Doc-400K.
Framework: GRPO implemented using Verl or EasyR1-like infrastructure.
Hyperparameters:
- KL coefficient: $\beta = 1.0 \times 10^{-2}$.
- Number of samples per document: 8.
- Max response length: 8192 tokens.
- Temperature: 1.0.
- Rollout batch size: 128.
- Global batch size: 128.
- Optimizer: AdamW with $\beta_1 = 0.9$, $\beta_2 = 0.99$, learning rate $1.0 \times 10^{-6}$.
- Training duration: 1 epoch over the 43k subset.

SFT versus RL training variants

SFT-only baseline: Supervised fine-tuning on 43k labeled documents, using token-level loss on Markdown outputs. Details of SFT optimizer and schedule are not fully elaborated but appear standard.
RL from scratch versus after SFT: RL directly on top of the base model (“Zero + RL”) versus RL starting from the SFT model (“SFT + RL”). Results suggest that, for this task and backbone, RL on the base model with full multi-aspect rewards is competitive with, and in some metrics better than, SFT-based starts.

Evaluation

OmniDocBench

Metrics:

Overall and per-subtask normalized edit distance for text, formulas, and reading order.
TEDS and TEDS-S for tables (structure and structure-plus-content).

Baselines:

Pipeline tools: MinerU, Marker, Mathpix, Docling, Pix2Text, Unstructured, OpenParse.
Document OCR VLMs: GOT-OCR, Nougat, Mistral OCR, olmOCR, SmolDocling.
General VLMs: GPT-4o, Qwen2-VL-72B, Qwen2.5-VL-7B, InternVL2-76B, InternVL3-8B.

Key results:

Infinity-Parser-7B yields lowest overall edit distances among listed systems for both English and Chinese.
Strong performance across text, formulas, tables, and reading order.
Reported robust performance across page types and complex table conditions (merged cells, formulas in cells, colorful and rotated tables).

olmOCR-Bench

Setup:

Fact-based evaluation on single-page PDFs, checking whether specific “facts” are present in OCR output rather than using raw edit distance.
Models are evaluated in both anchored and non-anchored prompt settings.

Outcome:

Infinity-Parser-7B has the highest overall score (82.5), outperforming specialized and general systems across most document categories.

PubTabNet and FinTabNet

Metrics:

TEDS-S for table structure.
TEDS for structure plus content.

Comparisons:

Infinity-Parser surpasses EDD, OmniParser, InternVL3, Qwen2.5-VL, and GPT-4o on both datasets according to reported numbers. Improvements are on the order of a few TEDS points.

Ablation and behavior analysis

Reward ablation:

RL with edit-only reward improves over SFT on overall edit distance. Adding count and order rewards further improves structural metrics and category-level averages.
SFT followed by RL improves category-level averages slightly further but does not give the best English/Chinese overall edit distances, suggesting a tradeoff.

Task-level behavior:

RL curves for text, tables, formulas, and reading order show smoother growth with data size and higher end performance than SFT, which can plateau or regress.

Distribution shift:

In in-distribution settings, RL continues to improve page-level accuracy with more data, whereas SFT tends to focus on paragraph-level edit scores and stagnates on page-level metrics.
In OOD settings (textbooks and slides excluded from training), RL degrades less and achieves better final scores than SFT.

Hardware / Production

Training hardware:

RL training performed on 8 $\times$ NVIDIA A100 (80 GB) GPUs in a distributed setup using Verl or EasyR1 tooling.

Training cost and time:

The paper does not state wall-clock training time or total FLOPs. The limited use of 43k documents with one epoch suggests a moderate training run, but this is not quantified.

Serving / production:

No details are given about deployment, latency benchmarks, or throughput optimizations. Infinity-Parser is described only in the research context; operationalization is out of scope for this report.

MonkeyOCR — Notes

Paper
Code
Models: 3B parameter model (Apache 2.0)

TL;DR

MonkeyOCR is a 3B-parameter document parsing system built on a Structure-Recognition-Relation (SRR) triplet paradigm: layout detection (Where), block-level recognition (What), and reading-order prediction (How). Trained on MonkeyDoc, a bilingual dataset with 3.9M block-level instances across 10+ document types, it achieves Edit distance 0.140 (EN) and 0.297 (ZH) on OmniDocBench, outperforming MinerU, Marker, and general VLMs like Qwen2.5-VL-7B. Multi-page inference reaches 0.84 pages/sec on A800 hardware, 29% faster than MinerU (0.65 pages/sec).

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$ (introduces SRR decomposition, unified LMM with type-aware prompts, block-level relation model, and end-to-end training procedure).

Secondary: $\Psi_{\text{Resource}}$ (MonkeyDoc dataset with 3.9M instances, bilingual coverage, and 3-stage construction pipeline is a substantial contribution).

Also present: $\Psi_{\text{Evaluation}}$ (large-scale benchmark comparison on OmniDocBench with 981 pages, but no new metric or evaluation protocol).

Rough superposition: 0.55 Method + 0.35 Resource + 0.10 Evaluation.

What is the motivation?

Pipeline tool error propagation: Traditional multi-stage toolchains (e.g., MinerU-style: layout detection $\rightarrow$ crop $\rightarrow$ separate recognizers for text/table/formula) accumulate errors across stages. The paper illustrates formula cropping mistakes causing hallucinated superscripts when recognition operates on incorrect regions.

End-to-end efficiency constraints: Full-page VLM inference faces quadratic attention cost and slow throughput on dense, high-resolution documents (e.g., academic papers with hundreds of text lines). General VLMs like Qwen2.5-VL-7B achieve only 0.12 pages/sec on multi-page workloads.

Goal: Design a decomposition that maintains accuracy while improving scalability and throughput for real-world document parsing at scale.

What is the novelty?

SRR triplet paradigm: Explicitly decomposes document parsing into three questions answered by specialized components:

Structure (Where is it): YOLO-based layout detector identifies regions and types.
Recognition (What is it): Unified Large Multimodal Model (LMM) processes cropped blocks with type-specific prompts (e.g., “Recognize the text in this formula block”).
Relation (How is it organized): Block-level reading-order model predicts sequence over detected regions to reassemble content.

Contrast to pipeline tools: SRR avoids multi-tool error propagation by using a unified LMM for all content types (text, tables, formulas, code) rather than separate specialized recognizers. This tighter integration shares representations while maintaining modularity for efficient parallelization.

Contrast to end-to-end VLMs: SRR operates on block-level crops rather than full pages, enabling parallel processing of detected regions and avoiding the $O(n^2)$ attention cost of full-page high-resolution inference.

Type-aware block recognition: A single 3B LMM handles all content types by conditioning on element type via prompts. This contrasts with pipeline tools that route different element types (text, table, formula) to specialized models, and with general VLMs that process full pages without type-specific guidance.

MonkeyDoc dataset: 3.9M block-level instances spanning layout detection, reading-order prediction, and recognition (text, tables, formulas, code blocks). Bilingual (Chinese + English), multi-domain (books, slides, financial reports, arXiv papers, etc.), built via a 3-stage pipeline:

Structure Detection: Harmonizes layout annotations from M6Doc, DocLayNet, D4LA, CDLA into 11 classes; adds 300k+ Chinese pages with auto-annotation and manual correction (41k samples).
Content Recognition: Crops 1.9M elements from layout annotations; refines PubTabNet (470k tables); adds UniMER-1M formulas; synthesizes 526k Chinese tables/formulas; extracts 36k arXiv LaTeX samples.
Relation Prediction: Refines DocGenome reading-order labels (951k samples); adds 154k manually annotated Chinese samples; auto-annotates 78k samples via PPOCR + LayoutReader.

What experiments were performed?

Primary benchmark: OmniDocBench (981 PDF pages, 9 document types, 4 layout styles, 3 language categories: English, Chinese, mixed).

Baselines:

Pipeline tools: MinerU, Marker, Nougat, PaddleOCR, Mathpix.
Expert VLMs: GOT-OCR, Mistral OCR, Docling.
General VLMs: GPT-4o, Gemini 2.5 Pro, Qwen2.5-VL-7B, InternVL3-8B, Idefics3-8B-Llama3.

Evaluation tasks:

Overall parsing: Edit distance on full document recovery (text + structure + formulas + tables).
Text recognition: Edit distance on plain text blocks.
Formula recognition: CDM (Character Detection Match) and ExpRate (expression-level exact match).
Table recognition: TEDS (Tree Edit Distance-based Similarity).
Reading order: Sequence accuracy over detected blocks.

Document type breakdown: Separate results for books, slides, financial reports, research papers, exams, magazines, textbooks, newspapers, handwriting.

Throughput comparison: Single-page and multi-page inference speed (pages/sec) on A800 GPUs for MonkeyOCR-3B, MinerU, Qwen2.5-VL-7B.

Ablation (implicit): The paper reports a Chinese-specialized variant (MonkeyOCR*) trained with additional Chinese data, showing improved Chinese edit distance but no detailed ablation of SRR components.

What are the outcomes/limitations?

Outcomes:

Overall end-to-end parsing on OmniDocBench: MonkeyOCR-3B achieves Edit distance 0.140 (EN) and 0.297 (ZH), leading among open-source models. Gemini 2.5 Pro achieves slightly better Chinese results (0.265 ZH) but comparable English (0.141 EN).
Formula recognition gains: +15.0% average improvement over MinerU on CDM and ExpRate metrics.
Table recognition gains: +8.6% average improvement over MinerU on TEDS.
Multi-page throughput: 0.84 pages/sec vs. 0.65 pages/sec (MinerU) and 0.12 pages/sec (Qwen2.5-VL-7B), a 29% improvement over MinerU.
Single-page throughput: 0.24 pages/sec vs. 0.28 pages/sec (MinerU), slightly slower on single-page workloads.
Document type breakdown: Best on books, slides, financial reports; competitive on research papers, textbooks, magazines.

Limitations and open questions:

Chinese gap remains: Even with MonkeyOCR* (Chinese-specialized variant), Gemini 2.5 Pro achieves better Chinese edit distance (0.265 vs. 0.297 for base MonkeyOCR, value for MonkeyOCR* not clearly stated). The paper does not explain whether this gap stems from dataset coverage, model capacity, or linguistic features.
Single-page throughput tradeoff: MonkeyOCR is 14% slower than MinerU on single-page inference (0.24 vs. 0.28 pages/sec), suggesting the SRR overhead (detection + relation prediction) may not amortize on short documents.
Missing ablations: The paper does not isolate the contribution of each SRR component (layout quality, recognition accuracy, reading-order correctness). It is unclear whether gains come primarily from the unified LMM, the type-aware prompts, or the relation model.
Reproducibility gaps: Code and models are not yet released. The paper provides high-level architecture descriptions but omits:
- LMM backbone architecture (transformer specs, vision encoder details, fusion mechanism).
- Prompt templates for type-aware recognition (exact text for “Recognize the text in this formula block”).
- Loss functions and training objectives per stage (layout loss, recognition loss, relation loss).
- Hyperparameters beyond optimizer/schedule (batch size per stage, gradient accumulation, warmup steps).
Deployment cost: Training requires 53 hours on 32 A800 GPUs. The paper does not provide cost-per-page estimates for inference or compare resource efficiency to MinerU/olmOCR (which report $176/M pages on L40S for olmOCR).
Generalization beyond MonkeyDoc domains: OmniDocBench covers 9 document types, but real-world parsing spans invoices, forms, receipts, technical diagrams, and other specialized layouts not represented in the 10+ domains listed for MonkeyDoc. Cross-domain generalization is not evaluated.

Model

High-level architecture

MonkeyOCR decomposes document parsing into a Structure-Recognition-Relation (SRR) pipeline with three specialized components:

Structure detection: YOLO-based layout detector outputs bounding boxes and element types for each region.
Recognition: Unified Large Multimodal Model (LMM) processes cropped blocks with type-specific prompts.
Relation prediction: Block-level reading-order model predicts sequence over detected regions.

Final output reassembles recognized content in predicted reading order.

Structure detection (YOLO-based)

Architecture: YOLO object detector adapted for document layout analysis.

Output: Bounding boxes (x, y, width, height) + element type (11 classes).

Element types (harmonized from MinerU conventions):

Text blocks
Titles/headings
Tables
Figures
Formulas
Code blocks
Captions
Footnotes
Headers/footers
Page numbers
Other

Training data: Aggregates and harmonizes layout annotations from M6Doc, DocLayNet, D4LA, CDLA. Cleaning rules include:

Remove nested boxes (keep largest region).
Filter low-information boxes (area < 35% of page).

Chinese supplementation: 300k+ collected pages, pre-annotated, post-processed to 28k high-quality samples, plus 13k manually corrected samples.

Recognition (unified LMM with type-aware prompts)

Model: 3B-parameter Large Multimodal Model (LMM). Architectural details (backbone, vision encoder, fusion mechanism) not specified in the paper.

Input: Cropped image region from structure detection + type-specific text prompt.

Prompt format (examples, exact templates not provided):

Text block: “Recognize the text in this text block”
Formula: “Recognize the text in this formula block”
Table: “Recognize the table structure and content”
Code: “Recognize the code in this code block”

Output: Structured text (plain text for text blocks, LaTeX for formulas, HTML for tables, etc.).

Key design choice: A single unified model handles all content types via prompt conditioning, rather than routing to specialized models (e.g., separate formula recognizer, table structure recognizer). This enables shared representations and parallel processing of cropped regions.

Relation prediction (reading-order model)

Task: Predict a total ordering over detected blocks to reassemble content in reading sequence.

Input: Detected block positions (bounding boxes) + types from structure detection.

Output: Permutation over block indices (e.g., [3, 1, 5, 2, 4] for 5 blocks).

Architecture: Not specified in the paper. Likely a transformer or graph neural network operating on block embeddings (position + type features).

Training data: 951k samples from DocGenome (refined reading-order labels), 154k manually annotated Chinese samples, 78k auto-annotated samples (PPOCR + LayoutReader).

Data

MonkeyDoc overview

Size: 3.9M block-level instances.

Coverage: 10+ document types (books, slides, financial reports, research papers, exams, magazines, textbooks, newspapers, handwriting, contracts).

Languages: Chinese + English (bilingual).

Tasks supported:

Layout detection (structure)
Reading-order prediction (relation)
Text recognition
Table recognition
Formula recognition
Code block recognition

Construction pipeline: 3-stage process mirroring SRR paradigm (Structure Detection $\rightarrow$ Content Recognition $\rightarrow$ Relation Prediction).

Structure Detection data construction

Base datasets: M6Doc, DocLayNet, D4LA, CDLA.

Harmonization: Converts original label sets to 11 unified classes following MinerU conventions.

Cleaning rules:

Remove nested boxes: When multiple boxes overlap, keep only the largest region.
Filter low-information boxes: Remove boxes with area < 35% of page (likely page margins, headers, footers with minimal content).

Chinese supplementation:

Collect 300k+ Chinese pages from web/internal sources.
Pre-annotate with existing layout models.
Post-process and quality-filter to 28k high-quality samples.
Add 13k manually corrected Chinese samples.

Total structure detection samples: Not explicitly stated (aggregated from multiple datasets).

Content Recognition data construction

Element cropping: Crop 1.9M blocks from layout annotations (text blocks, figures, captions, etc.).

Partial element labeling: Use Gemini 2.5 Pro to annotate cropped regions without ground-truth OCR labels.

Table recognition:

Refine PubTabNet (original 568k tables) with quality checks.
Final table dataset: 470k high-quality samples.

Formula recognition:

Use UniMER-1M (multi-source formula dataset: arXiv, textbooks, exams).

Synthesized Chinese tables/formulas:

Generate 526k additional Chinese samples via synthesis pipeline (details not provided).

arXiv LaTeX extraction:

Extract 36k academic paper samples with LaTeX source for tables and formulas.

Total recognition samples: 1.9M (elements) + 470k (tables) + UniMER-1M (formulas) + 526k (Chinese synthesis) + 36k (arXiv) $\approx$ 3M+ instances.

Relation Prediction data construction

DocGenome refinement:

Original DocGenome dataset provides reading-order annotations.
Refine labels for consistency and accuracy.
Final: 951k samples with reading-order sequences.

Manual Chinese reading-order annotation:

Annotate 154k Chinese document samples with reading order.

Auto-annotation pipeline:

Run PPOCR line recognition on document images.
Apply LayoutReader reading-order model to predicted text lines.
Filter and validate auto-annotations.
Final: 78k additional samples.

Total relation prediction samples: 951k + 154k + 78k = 1.183M samples.

Algorithms / Training

Optimizer and schedule

Optimizer: AdamW.

Learning rate: 2e-5.

Schedule: Cosine annealing (details on warmup steps, min LR not provided).

Batch size: 64 (likely global batch size across GPUs, per-device batch size not specified).

Training procedure (inferred)

The paper does not provide detailed training objectives or multi-stage training procedures. Likely approach based on SRR decomposition:

Structure detection (YOLO): Standard object detection loss (bounding box regression + classification).
Recognition (LMM): Autoregressive language modeling loss on recognition targets (text, LaTeX, HTML), conditioned on cropped image + type prompt.
Relation prediction: Sequence prediction loss (e.g., cross-entropy over permutation or pairwise ordering loss).

End-to-end training: The paper does not clarify whether SRR components are trained jointly or sequentially.

Training hardware and duration

Hardware: 32 A800 GPUs (80GB VRAM each).

Duration: 53 hours for full 3B model training.

Estimated compute: 32 GPUs $\times$ 53 hours = 1,696 GPU-hours on A800 hardware.

Evaluation

OmniDocBench overall parsing (Edit distance)

Edit distance measures character-level differences between predicted and ground-truth full document text (including structure markers, formula LaTeX, table HTML, etc.). Lower is better.

Method	EN Edit $\downarrow$	ZH Edit $\downarrow$
MonkeyOCR-3B	0.140	0.297
Gemini 2.5 Pro	0.141	0.265
GPT-4o	0.215	0.398
Qwen2.5-VL-7B	0.278	0.425
InternVL3-8B	0.312	0.467
MinerU	0.333	0.350
Marker	0.418	0.512
GOT-OCR	0.389	0.521

MonkeyOCR-3B achieves best English results and competitive Chinese results (trailing only Gemini 2.5 Pro). The Chinese-specialized variant MonkeyOCR* improves Chinese scores but exact values are not provided in the available text.

Formula recognition (CDM and ExpRate)

CDM (Character Detection Match): Character-level F1 score on LaTeX output.

ExpRate: Expression-level exact match (entire formula correct).

The paper reports +15.0% average improvement over MinerU on formula recognition but does not provide absolute CDM/ExpRate values or a structured comparison table.

Table recognition (TEDS)

TEDS (Tree Edit Distance-based Similarity): Measures structural similarity between predicted and ground-truth table HTML. Higher is better (range 0–1).

The paper reports +8.6% average improvement over MinerU on TEDS but does not provide absolute values or a structured comparison table.

Reading-order accuracy

The paper does not report standalone reading-order metrics (e.g., Kendall’s Tau, pairwise ordering accuracy). Reading-order quality is implicitly reflected in overall edit distance.

Throughput comparison

Method	Single-page (pages/sec)	Multi-page (pages/sec)
MonkeyOCR-3B	0.24	0.84
MinerU	0.28	0.65
Qwen2.5-VL-7B	0.12	0.12

Multi-page inference: MonkeyOCR achieves 29% higher throughput than MinerU (0.84 vs. 0.65 pages/sec), likely due to block-level parallelism.

Single-page inference: MonkeyOCR is 14% slower than MinerU (0.24 vs. 0.28 pages/sec), suggesting SRR overhead (detection + relation prediction) may not amortize on short documents.

Document type breakdown

The paper provides per-document-type results (books, slides, financial reports, research papers, etc.) but numeric values are not included in the available text. Key qualitative findings:

Best performance: Books, slides, financial reports.
Competitive performance: Research papers, textbooks, magazines.
Challenging: Handwriting (likely due to limited handwriting data in MonkeyDoc).

Hardware / Production

Training infrastructure

Hardware: 32 NVIDIA A800 GPUs (80GB VRAM each).

Duration: 53 hours for full 3B model training.

Estimated cost: At typical cloud pricing ($3–4/hour per A800), training cost is approximately $5,000–$7,000.

Inference infrastructure

Single-GPU inference: Runs on RTX 3090 (24GB VRAM) via LMDeploy (quantization/optimization toolkit).

Throughput:

Multi-page: 0.84 pages/sec (A800 hardware, reported in paper).
Single-page: 0.24 pages/sec (A800 hardware, reported in paper).

Cost-per-page estimates: Not provided in the paper.

Deployment options

LMDeploy integration: The paper mentions RTX 3090 deployment via LMDeploy, suggesting support for:

INT8/INT4 quantization for reduced memory footprint.
TensorRT optimization for faster inference.
Multi-GPU serving for high-throughput workloads.

Serving infrastructure: Not discussed in the paper. Likely requires custom FastAPI/Triton setup (similar to PaddleOCR 3.0) for production deployment.

Implementation sketch

Python API (inferred from paper description)

The paper does not provide code examples. Based on SRR decomposition, a likely API structure:

from monkeyocr import MonkeyOCR

# Initialize parser
parser = MonkeyOCR(model="MonkeyOCR-3B")

# Single-page parsing
result = parser.parse("document.pdf", page=0)

# Access SRR outputs
layout = result.structure  # Bounding boxes + types
blocks = result.recognition  # Recognized content per block
reading_order = result.relation  # Block sequence

# Reassembled document text (in reading order)
full_text = result.text
full_markdown = result.to_markdown()
full_html = result.to_html()

# Multi-page parsing with parallelism
results = parser.parse_batch(["doc1.pdf", "doc2.pdf"], batch_size=4)

Type-aware recognition prompt format (inferred)

The paper mentions “type-specific prompts” but does not provide templates. Likely format:

# Text block
prompt = "Recognize the text in this text block"

# Formula
prompt = "Recognize the text in this formula block"

# Table
prompt = "Recognize the table structure and content"

# Code block
prompt = "Recognize the code in this code block"

Notes and open questions

Observations

SRR paradigm tradeoffs: The Structure-Recognition-Relation decomposition achieves strong accuracy (0.140 EN edit distance) while improving multi-page throughput over pipeline tools (29% faster than MinerU). However, single-page inference is 14% slower, suggesting the detection + relation overhead may not amortize on short documents. This positions MonkeyOCR for batch processing workloads (e.g., large-scale document archives) rather than interactive single-page parsing.

Chinese gap persists: Despite a bilingual MonkeyDoc dataset (3.9M instances) and a Chinese-specialized variant (MonkeyOCR*), Gemini 2.5 Pro achieves better Chinese edit distance (0.265 vs. 0.297 for base MonkeyOCR). This gap may stem from:

Dataset coverage: MonkeyDoc’s Chinese samples (41k structure, 154k relation, 526k synthesis) may underrepresent linguistic complexity (classical Chinese, technical terminology, etc.).
Model capacity: A 3B LMM may lack capacity for dense Chinese character vocabularies (tens of thousands of characters vs. ~26 letters in English).
Formula/table challenges: Chinese documents often mix CJK characters with formulas and tables, requiring robust multimodal understanding.

Unified LMM design: Using a single 3B model for all content types (text, tables, formulas, code) with type-aware prompts contrasts with pipeline tools that route to specialized recognizers. This design choice enables shared representations and parallel processing but may sacrifice per-type accuracy (e.g., a dedicated formula recognizer might outperform a general LMM prompted for formulas). The paper does not ablate this tradeoff.

Open questions

SRR component ablations: What is the contribution of each stage (structure, recognition, relation) to overall accuracy? Specifically:

Structure quality: How much does layout detection accuracy (mAP, IoU) correlate with end-to-end edit distance?
Recognition quality: If we assume perfect layout detection (oracle boxes + types), how much does edit distance improve?
Relation quality: If we assume perfect recognition but random reading order, how much does edit distance degrade?

Without these ablations, it is unclear whether MonkeyOCR’s gains come primarily from better layout detection, superior LMM recognition, or improved reading-order modeling.

Prompt engineering details: The paper mentions “type-specific prompts” but does not provide templates or ablate prompt design. Key questions:

What is the exact prompt format? (e.g., “Recognize the text in this [TYPE] block” vs. more detailed instructions?)
Does the LMM receive only the prompt + cropped image, or additional context (e.g., neighboring blocks, document metadata)?
How sensitive are results to prompt wording? (e.g., “Recognize” vs. “Extract” vs. “Transcribe”)

Loss functions and training objectives: The paper specifies optimizer (AdamW) and learning rate (2e-5) but omits:

Structure detection loss (likely standard YOLO loss, but hyperparameters?).
Recognition loss (autoregressive language modeling? CTC? Seq2seq with teacher forcing?).
Relation prediction loss (cross-entropy over permutations? Pairwise ordering loss? Pointer network?).
Multi-task balancing: Are structure, recognition, and relation trained jointly or sequentially? If jointly, how are losses weighted?

Model availability: The 3B-parameter model is publicly available under Apache 2.0 license at echo840/MonkeyOCR-pro-3B, enabling independent validation of the reported OmniDocBench results (0.140 EN, 0.297 ZH). Key reproducibility gaps remain:

LMM architecture: Backbone (LLaMA, Qwen, etc.)? Vision encoder (CLIP, SigLIP, etc.)? Fusion mechanism (cross-attention, prefix tuning, etc.)?
Training data licenses: MonkeyDoc aggregates M6Doc, DocLayNet, D4LA, CDLA, PubTabNet, UniMER-1M. Are all components commercially usable?
Evaluation protocol: OmniDocBench preprocessing (PDF to images? Resolution? Page cropping?), postprocessing (text normalization? LaTeX cleanup?), and metric implementation (edit distance algorithm, TEDS version).

Deployment cost and efficiency: The paper reports throughput (0.84 pages/sec multi-page) but not cost-per-page. Key questions for production planning:

What is the dollar cost per million pages on A800/H100/L40S hardware?
How does MonkeyOCR compare to olmOCR ($176/M pages on L40S) or MinerU2.5 (1.224 pages/sec on A100)?
Can LMDeploy quantization (INT8/INT4) maintain accuracy while reducing cost?

Generalization beyond OmniDocBench domains: MonkeyDoc covers 10+ document types, but real-world parsing spans invoices, receipts, forms, technical diagrams, and domain-specific layouts (legal, medical, government). The paper does not evaluate zero-shot transfer to unseen document types or provide guidance on fine-tuning for new domains.

MinerU2.5 — Notes

TL;DR

MinerU2.5 is a 1.2B-parameter vision language model for document parsing that decouples global layout analysis from local content recognition using a coarse-to-fine, two-stage inference pipeline (thumbnail layout first, then native-resolution crops). With a NaViT vision encoder (675M params), a Qwen2-0.5B decoder with M-RoPE, and a pixel-unshuffle patch merger, it achieves strong results on OmniDocBench (Overall 90.67) while maintaining practical throughput ($\approx$ 2.12 pages/s on A100-80G). The paper introduces PageIoU (layout metric), ADR (atomic decomposition for long formulas), OTSL (compact table target), and IMIC (consistency-based hard-case mining).

What kind of paper is this?

Primarily $\Psi_{\text{Method}}$ with strong $\Psi_{\text{Resource}}$ and $\Psi_{\text{Evaluation}}$ components.

The headline novelty is the two-stage decoupled parsing architecture: thumbnail-based layout detection followed by native-resolution crop recognition. Secondary contributions include new evaluation methodology (PageIoU metric) and a data flywheel (IMIC mining strategy). The model weights and code are publicly released.

Rough superposition: $\Psi_{\text{Method}}(0.65) + \Psi_{\text{Resource}}(0.20) + \Psi_{\text{Evaluation}}(0.15)$

What is the motivation?

High-resolution document parsing faces a fundamental tension: native-resolution processing yields better recognition quality but incurs $O(N^2)$ attention costs over massive token sequences. Existing solutions have critical gaps:

Pipeline systems are modular but suffer from error propagation and cumbersome integration
End-to-end VLMs are bottlenecked by hallucinations on long documents and token redundancy from blank or low-information regions
Prior systems either sacrifice resolution (hurting small text/formulas) or sacrifice throughput (impractical for production)

MinerU2.5 addresses this by decoupling global layout from local content: make layout explicit in Stage I on a cheap thumbnail, then “spend” tokens only on content-bearing crops in Stage II at native resolution.

What is the novelty?

The core contribution is a coarse-to-fine, two-stage decoupled parsing architecture:

Stage I (global): Predict layout/rotation/reading-order on a fixed 1036$\times$1036 thumbnail, avoiding native-resolution $O(N^2)$ token blowup
Stage II (local): Process native-resolution crops independently, enabling parallelism and bounded token counts

Supporting innovations:

PageIoU: A page-level coverage metric for layout using coverage maps and pixel-wise min/max aggregation, better matching human perception than IoU@0.5 for text blocks (Figure 4, p. 15)
ADR (Atomic Decomposition & Recombination): For multi-line formulas, decompose into atomic lines, recognize each, then recombine with LaTeX alignment (Figure 5, p. 16)
OTSL: A compact table representation reducing structural tokens from $\sim$28 to 5, shrinking sequences by $\sim$50% (Figure 6, p. 17)
IMIC (Iterative Mining via Inference Consistency): Run stochastic decoding multiple times; low consistency (measured by PageIoU/TEDS/CDM) flags hard cases for human annotation (Figure 7, p. 18)
M-RoPE: Replace 1D-RoPE with multi-dimensional RoPE to generalize across varying crop resolutions/aspect ratios

What experiments were performed?

Evaluation spans six benchmark categories with detailed ablations:

Full-page parsing: OmniDocBench (1,355 pages, EN/ZH) measuring text edit distance, formula CDM, table TEDS/TEDS-S, reading order edit distance
Dense text OCR: Ocean-OCR (EN/ZH splits) with edit distance, F1, BLEU, METEOR metrics
Unit tests: olmOCR-bench (math, scans, tiny text) using ExpRate (render-based) for math splits instead of AST-string CDM
Layout analysis: OmniDocBench, D4LA, DocLayNet using PageIoU metric with unified tag set (headers, footers, page numbers, code, algorithms, references, lists, caption sub-types)
Table recognition: PubTabNet, FinTabNet, CC-OCR, OCRBench v2, in-house TR dataset
Formula recognition: CPE, HWE, SCE, SPE, LaTeX-80MM, plus in-house Chinese/Fuzzy/Complex sets

Baselines include Qwen2.5-VL (7B/72B), InternVL3, Gemini-2.5-Pro, and specialized systems (GOT, Docling, MinerU2). The paper also provides throughput benchmarks across A100-80G, RTX 4090-48G, and H200-141G.

What are the outcomes/limitations?

Key results (Tables 5–11, pp. 19–24):

OmniDocBench Overall: 90.67 (best reported), with Text-Edit 0.047, TEDS 88.22, CDM 88.46, RO-ED 0.044
- Overall score defined as: $\text{Overall} = \frac{(1-\text{TextEdit}) \times 100 + \text{TableTEDS} + \text{FormulaCDM}}{3}$
FinTabNet: TEDS 95.97 / TEDS-S 97.61 (leading by margin)
olmOCR-bench: 75.2 overall (best among listed); AR (math) 76.6, Old Scans Math 54.6, Long Tiny Text 83.5
Ocean-OCR English: ED 0.033, F1 0.945, BLEU 0.909, METEOR 0.950
Throughput: 2.12 pages/s (A100-80G), 4.47 pages/s (H200-141G), 1.70 pages/s (RTX 4090-48G) via vLLM

Limitations and open questions:

Stage I is a recall bottleneck: Missed or merged elements in layout propagate to Stage II; the system cannot recover from layout errors
Cross-domain robustness depends on the data engine: Rare layouts (multi-page wraps, exotic scripts) may require targeted IMIC mining
Heuristic tuning: Crop scheduling and repetition penalties are manually tuned; learned scheduling policies could improve batch efficiency
Dependency on frontier models: The data curation pipeline uses Qwen2.5-VL-72B, Gemini-2.5-Pro for pre-annotation, which may limit reproducibility

Model

Architecture (Figure 2, p. 7)

Component	Details
Vision encoder	NaViT (Native-Res ViT), $\sim$675M params, 2D-RoPE, arbitrary aspect ratios
Patch merger	Pixel-unshuffle $2\times 2$, reduces vision tokens before LM
LM decoder	Qwen2-Instruct 0.5B with M-RoPE
Total	$\sim$1.2B params

Key architectural choices:

NaViT with 2D-RoPE: Chosen over Qwen2.5-VL’s window attention, which the authors note can degrade document parsing performance
M-RoPE (multi-dimensional RoPE): Replaces 1D-RoPE in the LM decoder to generalize across varying crop resolutions and aspect ratios
Pixel-unshuffle patch merger: Merges adjacent $2 \times 2$ vision tokens before feeding to the LM, trading off efficiency and performance

Two-stage pipeline

Stage I (global layout):
- Input: Uniformly resized to 1036$\times$1036 pixels
- Output: Box positions, classes, rotation angles, and reading order in a single pass
- Benefit: Fixed thumbnail improves box stability and training efficiency
Stage II (local content):
- Input: Crops from original high-res image using Stage I boxes
- Resolution: Native resolution with upper bound of 2048$\times$28$\times$28 pixels
- Output: Text, table (OTSL format), or formula (LaTeX) per crop
- Benefit: Crops process independently, enabling batching and parallelism

Task reformulations

Layout analysis as multi-task: Predict Position, Class, Rotation Angle, and Reading Order in one pass (instead of pushing rotation/order downstream to separate stages).

Formulas (ADR): “Whole-part” approach classifies formulas as atomic vs compound, decomposes compound formulas into lines, recognizes each line, then recombines using layout info.

Tables (OTSL): Generate OTSL instead of HTML to reduce structural token redundancy (structural tokens reduced from “over 28” down to “5”, $\sim$50% shorter sequences), then convert to HTML post-processing.

Prompts (Appendix B)

Task-specific prompts switch output format:

<image>
Layout Detection:

Output: box positions, classes, rotation, reading order tokens

<image>
Text Recognition:

Output: OCR Results:{text}

<image>
Formula Recognition:

Output: LaTeX:{latex}

<image>
Table Recognition:

Output: OTSL:{otsl}

Contrast to end-to-end OCR VLMs: Instead of attending over massive native-res token sequences for the whole page, MinerU2.5 filters with Stage I and only “spends” tokens on content-bearing crops. This cuts $O(N^2)$ token growth, reduces hallucinations, and keeps reading order explicit.

Data

Data Engine (Figure 3, p. 12)

The closed-loop curation pipeline balances:

Layout variety: Document types including papers, textbooks, reports, slides
Element mix: Titles, paragraphs, tables, formulas, figures
Language distribution: EN/ZH

Methods: page-level image clustering, metadata sampling, element-balance detector to ensure representative coverage.

Starting from MinerU2-pipeline outputs, refined with specialist models:

Text crops: Qwen2.5-VL-72B-Instruct
Formulas: UniMERNet (retrained in-house)
Tables: In-house high-performance table parser

Training data sizes

Stage	Samples	Breakdown
Stage 0A (image-caption)	558K	Modality alignment
Stage 0B (VQA & OCR)	665K	Instruction following
Stage 1 (pretrain)	6.9M	Layout 2.3M, text 2.4M, formula 1.1M, table 1.1M
Stage 2 (fine-tune)	630K	Layout 43K, text 300K, formula 147K, table 140K

Note: Stage 1 uses 2 epochs over 6.9M samples; Stage 2 uses 3 epochs over 630K samples.

Fine-tuning set curation with IMIC

IMIC (Iterative Mining via Inference Consistency): Run the model multiple times with stochastic decoding and measure consistency via:

PageIoU for layout analysis
TEDS for tables
CDM for formulas

Low-consistency samples (below threshold) are flagged as hard cases for human annotation. Complex tables may be pre-annotated by Gemini-2.5-Pro then expert-corrected via the Dingo QA tool, yielding a compact, high-value SFT set.

Data augmentation (Table 2)

Augmentations simulate scans and photographic artifacts:

Spatial transforms: Rotation, shearing, perspective (NOT applied to layout analysis samples to preserve box accuracy)
Background variation: Color jitter, background replacement
Degradation effects: Blur, noise, compression artifacts

Algorithms / Training

Training stages (Table 1, p. 9)

Stage	Description	Seq len	Batch size	Epochs	Vision max res
0A	Modality alignment (freeze ViT & LM, train MLP)	4096	128	—	2048$\times$28$\times$28
0B	VQA & OCR (unfreeze all)	4096	64	—	4096$\times$28$\times$28
1	Document parsing pretrain	8192	256	2	2048$\times$28$\times$28
2	Hard-case fine-tuning	16384	256	3	2048$\times$28$\times$28

Initialization:

Vision encoder: from Qwen2-VL-2B-Instruct
LM decoder: from Qwen2-Instruct-0.5B

Separate learning rates for ViT vs {MLP, LM}. Stage 2 uses the compact, hard-case mined dataset identified by IMIC.

Key algorithmic contributions

PageIoU (pp. 13–15): Page-level coverage score for layout using coverage maps and pixel-wise min/max aggregation over non-background region $M$. Unified tag set covers headers, footers, page numbers, code, algorithms, references, lists, and caption sub-types (Table 4, p. 14). Better matches qualitative layout quality than IoU@0.5.

ADR (Atomic Decomposition & Recombination) (pp. 15–16): For multi-line/long expressions:

Classify formula as atomic vs compound
Decompose compound formulas into atomic lines via layout
Recognize each line independently
Recombine into LaTeX with alignment (e.g., align environment)

OTSL (Optimized Table Structure Language) (pp. 16–17): Pipeline for table recognition:

Detect table + rotation
Rectify crop
Recognize to OTSL format
Convert to HTML

Reduces structural tokens from $\sim$28 to 5, shrinking sequences by $\sim$50%.

IMIC (Iterative Mining via Inference Consistency) (pp. 17–18): Data curation method:

Run model multiple times with stochastic decoding
Measure consistency via PageIoU (layout), TEDS (tables), CDM (formulas)
Low-consistency samples (below threshold) flagged for human QA
Results in compact, high-value SFT set

Evaluation

OmniDocBench (Tables 5–6, p. 19)

Metric	MinerU2.5	Notes
Overall	90.67	Best
Text-Edit $\downarrow$	0.047	Best
Formula CDM $\uparrow$	88.46	Best
Table TEDS $\uparrow$	88.22	Best
Table TEDS-S $\uparrow$	92.38	Best
Reading-Order ED $\downarrow$	0.044	Best

Overall score definition:

$$ \text{Overall} = \frac{(1 - \text{TextEdit}) \times 100 + \text{TableTEDS} + \text{FormulaCDM}}{3} $$

By document type (Text-Edit $\downarrow$): Newspaper 0.0540 (best), Textbook 0.0499 (best), Slides 0.0294 (2nd), Financial 0.0104 (2nd).

Benchmark details: DPI raised to 200 for Notes/Newspapers; EN/ZH balanced with +374 pages (total 1,355 pages).

Ocean-OCR (Table 7, p. 20)

Dense text recognition benchmark:

Split	ED	F1	BLEU	METEOR
English	0.033	0.945	0.909	0.950
Chinese	—	0.965	0.817	0.887

olmOCR-bench (Table 8, p. 20)

Overall: 75.2 (best among listed systems)

AR (math): 76.6
Old Scans Math: 54.6
Long Tiny Text: 83.5

Note: ExpRate (render-based evaluation) replaces AST-string CDM for math splits to avoid penalizing cosmetically different but semantically equivalent LaTeX (p. 21).

Layout analysis (Table 9, p. 22)

Top Full-Page F1@PageIoU scores across three benchmarks:

OmniDocBench: Leads across all categories
D4LA: Strong performance
DocLayNet: Leads or co-leads per-category (Textual/Image/Table/Equation/Page-margins)

PageIoU metric uses unified tag set (Table 4, p. 14) covering headers, footers, page numbers, code, algorithms, references, lists, and caption sub-types.

Table recognition (Table 10, p. 23)

Benchmark	TEDS	TEDS-S
PubTabNet	89.07 (2nd)	93.11 (3rd)
FinTabNet	95.97	97.61
CC-OCR (table)	79.76	—
OCRBench v2 (table)	87.13	—
In-house TR	71.48	82.83

FinTabNet results lead by a significant margin.

Formula recognition (Table 11, p. 24)

Character Detection Metric (CDM) scores:

Split	CDM
CPE	96.6
HWE	94.4
SCE	96.4
SPE	98.4
LaTeX-80MM (matrix)	90.6
In-house Chinese	90.7
In-house Fuzzy-Math	92.6
In-house Complex	82.2

Hardware / Production

vLLM-based deployment pipeline (Table 3, p. 11)

Architecture:

vLLM with asynchronous page-level batching
Stage I and Stage II run as independent tasks with decoupled scheduling
Stage II work starts as soon as Stage I results arrive (async backend)
Dynamic penalties (frequency/presence) conditioned on Stage I layout type to suppress degenerate repetition without harming legitimate repetition in tables/equations

Throughput benchmarks:

GPU	Tokens/s	Pages/s
A100-80G	2337.25	2.12
RTX 4090-48G	1875.82	1.70
H200-141G	4938.31	4.47

Baseline (no deployment optimizations): 1045 tokens/s, 0.95 pages/s on A100-80G.

The deployment improvements come from:

Async task scheduling for Stage I/II decoupling
Dynamic sampling penalties conditioned on layout type
Batch submission optimizations

Training infrastructure

Initialization:

Vision encoder: from Qwen2-VL-2B-Instruct
LM decoder: from Qwen2-Instruct-0.5B

Separate learning rates for ViT vs {MLP, LM}. Augmentations tuned by element type (Table 2). The paper emphasizes the data engine and deployment pipeline over full cluster specifications; training compute details are not explicitly provided.

POINTS-Reader — Notes

TL;DR

POINTS-Reader is a 4B-parameter document-conversion VLM built on POINTS-1.5 and trained via a two-stage, distillation-free pipeline: synthetic “uniform format” warm-up (800K samples across plain text, formulas, tables, multi-column layouts) and iterative self-improvement on real PDFs (DocMatix) with rule-based filtering. By unifying outputs for text (Markdown), tables (HTML), and formulas (LaTeX) and repeatedly bootstrapping on its own filtered annotations, the model surpasses many larger VLMs and OCR experts on OmniDocBench and Fox, especially for tables, while still trailing pipeline systems like MinerU and Mathpix.

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$ (training procedure: synthetic warm-up plus iterative self-improvement with rule-based filtering, rather than new architecture).

Secondary: $\Psi_{\text{Resource}}$ (implicitly releases carefully filtered HTML-table, formula-aware dataset of $\sim$1.1M high-quality samples plus trained model); $\Psi_{\text{Evaluation}}$ (ablations on data scale, aspect ratio filtering, F1 thresholds, sampling ratios, initialization choices, all quantified on OmniDocBench).

The core contribution is a training methodology that avoids distillation from proprietary or very large models. The paper demonstrates that careful synthetic data design plus iterative bootstrapping with strict automated quality filters can produce competitive document-conversion models without inheriting teacher model biases or errors.

What is the motivation?

High-quality labeled data for end-to-end document conversion (text plus tables plus math plus layout) is scarce and expensive to annotate. Existing approaches usually distill from proprietary or very large open VLMs (GPT-4o, Qwen2.5-VL-72B, etc.), which locks progress to teacher capabilities and transfers teacher errors and biases (missing tables, hallucinated or omitted text, wrong table structure). Purely synthetic data helps but diverges from real-world layouts; models trained only on synthetic images underperform on complex PDFs. The goal is to learn a strong OCR/VLM reader without distillation by warm-starting on synthetic but layout-diverse data with unified output format, then adapting to real documents via self-improvement on DocMatix with strict automated filtering.

What is the novelty?

Unified output representation for document elements:

Plain text: standardized Markdown (headers, lists, etc.)
Tables: HTML <table> with simplified attributes, no CSS except merged-cell info, no extra whitespace or indentation to keep tokens down
Formulas: KaTeX-compatible LaTeX, $...$ inline and $$...$$ display

Two-stage distillation-free training pipeline:

Uniform Format Warm-up Stage (UWS): LLM-generated synthetic documents across four categories (plain text, text plus formulas, text plus tables, multi-column layouts with tables). HTML templates (1/2/3-column) rendered with headless Chrome to produce image–text pairs.
Iterative Self-improvement Stage (ISS): Use the warm-up model to annotate millions of real DocMatix pages; filter predictions (plain-text F1 vs PaddleOCR, HTML structural checks for tables, syntax checks for LaTeX); retrain on filtered set; repeat for several iterations.

Rule-based data filtering at scale:

Bag-of-words F1 against PaddleOCR for text
Structural validation for tables
Syntax validity for formulas

Empirical insights about data and training:

Sweet spot in synthetic data size ($\sim$800K samples) beyond which performance drops due to distribution mismatch
Restricting aspect ratio to $\sim[0.4, 2.5]$ removes pathological images and improves results
Always re-initializing each iteration from the base pre-trained model (POINTS-1.5) is better than continuing from the previous iteration
Including UWS data in later ISS iterations helps, despite its synthetic nature, because its annotations are very clean

What experiments were performed?

Warm-up ablations (OmniDocBench overall metric, lower is better):

Incrementally adding each synthetic category (plain text, then plus formulas, then plus tables, then plus multi-column tables) while keeping 200K samples per category (800K total)
Scaling total synthetic data from 100K to 1.2M to find saturation/overfitting point
Filtering images by aspect ratio ranges to prune extreme shapes

Self-improvement ablations:

Adding filters in stages (text, then table, then formula) and measuring improvements
Sweeping F1 thresholds for text filtering (0.70, 0.80, 0.90, 0.95)
Varying sampling ratios for plain text vs tables vs formulas during training
Comparing initialization from previous iteration vs re-starting from the original POINTS-1.5 weights
Tracking performance vs ISS iteration count (up to five rounds), with curves for OmniDocBench, global F1 vs PaddleOCR, and retained data volume

Dataset comparison:

Compare the final distilled-free dataset ($\sim$1.1M samples) to KOSMOS-2.5 training data and olmOCR data (size, table format, distillation vs non-distillation, language coverage)

Baseline comparisons on OmniDocBench (en) and Fox-Page-en:

Pipeline methods: MinerU, Marker, Mathpix
General VLMs: Qwen2.5-VL (3B, 7B, 72B)
Expert OCR/VLM: GOT-OCR, Nougat, Mistral OCR, OLMOCR
Additionally: a version of POINTS-Reader trained directly on distillation data from Qwen2.5-VL-72B

What are the outcomes/limitations?

Outcomes (as reported):

On OmniDocBench (en):

POINTS-Reader (4B) overall edit distance 0.259, beating several larger specialized and general models:
- Qwen2.5-VL-3B: 0.390; Qwen2.5-VL-7B: 0.331; GOT-OCR: 0.287; OLMOCR-7B: 0.326
Particularly strong table performance: 0.335 vs 0.341 for Qwen2.5-VL-72B and much better than GOT-OCR on the same metric (gap $\approx 0.197$)

On Fox-Page-en (edit distance, lower is better):

POINTS-Reader: 0.023, outperforming all listed general VLMs and OCR baselines in the table (e.g., Qwen2.5-VL-72B at 0.027)

Compared to a model directly distilled from Qwen2.5-VL-72B, the distillation-free POINTS-Reader is decisively better (overall 0.259 vs 0.302) on OmniDocBench.

However, pipeline systems like MinerU and Mathpix still achieve lower error overall than any end-to-end VLM, including POINTS-Reader, highlighting remaining performance gaps.

ISS curves show monotonic improvement across iterations on OmniDocBench, global F1 vs PaddleOCR rising from $\sim$0.70 up to $\sim$0.84, and number of retained filtered samples increasing for text, tables, and formulas over iterations.

Limitations (acknowledged by authors):

Language: Training and evaluation are English-only; multi-lingual and CJK support are future work
Content type: Datasets are largely printed fonts; performance on handwritten notes is suboptimal. The model currently extracts only plain text, formulas, and tables, not images or figure locations
Data supervision: Table and formula filters check only structure and syntax, not semantic correctness, so some residual label noise remains
Benchmark scope: Focus is on Fox and OmniDocBench; there is no evaluation on real production workflows or domain-specific documents (e.g., handwritten forms, multi-language PDFs)

Model

Backbone and overall architecture

Base model: POINTS-1.5 VLM as the visual encoder plus multimodal connector
Language backbone: Qwen2.5-3B-Instruct as the LLM head, chosen for efficiency vs quality trade-off
Parameter scale: $\sim$4B parameters
Context length: 8192 tokens in this work (longer than default POINTS-1.5)
Training paradigm inherited from POINTS-1.5:
1. Pretraining on generic vision–language data
2. Visual instruction tuning (VIT) with both the synthetic UWS data and filtered ISS data, plus general data from POINTS-1.5
All hyperparameters and settings except VIT data and context length are identical to POINTS-1.5, so this paper’s novelty sits almost entirely in the data and training pipeline rather than architecture tweaks

Input and output format

Input: Document page images, often multi-column, with tables and formulas. Real data comes from PDF-derived PNGs in DocMatix
Output: Single text sequence in the unified markup format:
- Plain text: Markdown headings, lists, bold/italic, etc.
- Tables: Minimal HTML table markup with no whitespace between tags, no CSS apart from merged-cell attributes; same format used for both synthetic and real annotations
- Formulas: LaTeX obeying KaTeX rules, $...$ inline, $$...$$ display

Data

1. Uniform Format Warm-up (Synthetic)

Unified output format:

Text: Markdown (following Kosmos-2.5 style choices)
Tables: HTML to handle complex structures (merged cells) more flexibly than Markdown; LaTeX tables rejected due to non-standard and variable syntax
Formulas: KaTeX-compatible LaTeX syntax

Generation pipeline:

LLM text generation: Prompts instruct a large language model to generate document-like Markdown, including topic selection, 300–800 words depending on category, style choices from exam papers, slides, academic papers, books, textbooks, magazines, notes, newspapers, financial reports, and optional use of Markdown constructs (headings, lists, bold/italic, subscripts/superscripts). Formula-containing prompts encourage varied LaTeX constructs (matrices, align, gather, frac, sum, etc.), mixing inline and display math. Multi-column prompts split content into two logical chunks separated by a marker string "x----------x" for later layout templating.
Category design: Four synthetic categories (plain text; text plus formulas; text plus tables, single-column; multi-column layouts with tables)
Table enrichment: LLM-prompting alone tended to produce simple tables, so they augment with real tables from PubTabNet training set, generating descriptive paragraphs and then inserting those tables into the text at random positions
Filtering (synthetic side): Apply the same LaTeX formula and HTML table filters as in ISS (syntax/structure checks) before rendering
Rendering: Convert the unified text into HTML and render images with 1/2/3-column templates using Chrome headless mode

Synthetic dataset scale:

For ablations: 200K filtered samples per category yields 800K total; this configuration gives the best performance
Scaling from 100K to 1.2M samples shows gains up to $\sim$800K, then degradation (overfitting to synthetic layouts that diverge from real PDFs)
Aspect ratio filtering: Inspect distribution of width/height and cut off extreme aspect ratios outside $[2/5, 5/2]$. This removes overly tall or wide renderings and measurably improves OmniDocBench scores

2. Iterative Self-improvement (Real Data)

Base real-world corpus:

DocMatix: More than 2M images derived from PDF/A, covering academic papers, textbooks, exams, and various document types
Total used: 2,234,134 images; in the final ISS iteration, $\sim$1.1M (1,096,325) pass filters and are used for training

Element distribution (final ISS iteration):

Only plain text: 90.2% of samples
Contains tables: 6.5%
Contains formulas: 3.3% (about 0.1% have both tables and formulas)

Token length distribution:

Most training samples are shorter than 1,000 tokens

Final dataset comparison:

Dataset	Size	Tables Format	Distillation	Language
KOSMOS-2.5	357.4M	Markdown	Yes	English
olmOCR	260K	Markdown	Yes (GPT-4o)	English
POINTS-Reader	$\sim$1.1M	HTML	No	English

Algorithms and Training

1. Plain-text F1 filtering

For text quality control, they compare POINTS-Reader predictions to PaddleOCR outputs as pseudo-ground-truth:

Normalize both prediction and reference strings (strip non-alphanumeric characters, split on spaces and count occurrences of each token)
Let $P = {(u^p_i, c^p_i)}$ be prediction tokens and counts, $T = {(u^t_i, c^t_i)}$ be reference tokens and counts
Define:

$$\text{Precision} = \frac{\sum_i \min(c^p_i, c^t_i)}{\sum_i c^p_i},\quad \text{Recall} = \frac{\sum_i \min(c^p_i, c^t_i)}{\sum_i c^t_i}$$

$$F_1 = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$

Discard samples with $F_1$ below a threshold

Ablation on thresholds: 0.90 gives the best overall OmniDocBench score, balancing quality and coverage; 0.70 and 0.95 both hurt performance relative to 0.90.

2. Table filtering

They do not use external structure-recognition models as teachers, due to robustness issues and table-only assumptions. Instead, they parse each HTML table in the output and enforce structural validity:

Consistent numbers of cells per row/column
Valid merged-cell attributes

Samples with malformed table structures are dropped.

3. Formula filtering

Extract all LaTeX formulas from the output and run syntax checks (KaTeX-compatible)
Samples containing any syntactically invalid formula are discarded
They explicitly do not check semantic correctness (e.g., whether the math equals the underlying document)

4. Iterative self-improvement loop

Each ISS iteration:

Run the current model on the full DocMatix dataset to produce annotations (image to unified markup)
Apply the three filters in order: text F1, table structure, formula syntax
Use all retained image–text pairs to perform visual instruction tuning from the base POINTS-1.5 weights, not from the previous ISS iteration’s weights (this choice is empirically better)
Include UWS synthetic data in ISS training; this reliably improves performance

Findings from ISS:

Each iteration improves both OmniDocBench performance and DocMatix F1 vs PaddleOCR
The volume of retained samples with text/tables/formulas grows per iteration, indicating the model is generating more usable annotations
Improvements come from a combination of better original training data and increased diversity from new filtered samples

5. Sampling ratios

Because plain-text-only samples dominate, they tested aggressive rebalancing: lowering the plain-text sampling probability and up-weighting tables and formulas. This rebalancing hurts performance; best results come from sampling directly from the natural distribution (1.0:1.0:1.0 weights). Hypothesis: down-sampling text reduces diversity without increasing table/formula diversity (just repeats the same small set more often), degrading generalization.

Evaluation

Benchmarks and metrics

OmniDocBench (English split): 19 layout types; evaluates text edit distance, formula edit distance, table edit distance, reading order correctness. Overall metric is a blend of these (all normalized edit distances; lower is better)
Fox-Page-en (English split of Fox): 112 pages, single and double column, each with more than 1K words. Metric: normalized edit distance between model outputs and references; lower is better

Core comparisons

Pipeline methods (MinerU, Marker, Mathpix):

Achieve the lowest OmniDocBench error overall; still the strongest approach for document conversion in this comparison

General VLMs:

Qwen2.5-VL-72B is the strongest among generic models (overall $\sim$0.214) but much larger than POINTS-Reader
Smaller Qwen2.5-VL (3B, 7B) lag behind POINTS-Reader on both OmniDocBench and Fox

Expert VLM/OCR:

GOT-OCR, Nougat, Mistral OCR, OLMOCR all underperform relative to POINTS-Reader on the combined OmniDocBench metrics, especially for tables
POINTS-Reader vs Mistral OCR: substantial gain in OmniDocBench overall (0.259 vs 0.268) and much better tables

Ablation highlights

Experiment	Configuration	OmniDocBench Overall $\downarrow$
Synthetic diversity	Plain text only	0.626
	+ Formulas	0.579
	+ Tables	0.538
	+ Multi-column tables	0.510
Aspect ratio filtering	No filtering	0.515
	$[2/5, 5/2]$	0.498
ISS filtering	No filters	0.493
	+ Text F1 filter	0.463
	+ Table filter	0.447
	+ Formula filter	0.439
Distillation comparison	Distilled from Qwen2.5-VL-72B	0.302
	POINTS-Reader ISS (no distill)	0.259

Hardware and Production

All experiments run on 64 $\times$ NVIDIA H800 GPUs
Approximate costs reported:
- Training on 1M samples $\approx$ 7 hours
- Inference on 2M DocMatix images $\approx$ 10 hours using SGLang for deployment-time serving
The paper does not detail batching strategies or peak throughput, but given the 4B parameter count and 8192-token context, the model is engineered to be relatively efficient compared to 70B-class VLMs, while still relying on a sizable GPU cluster for each self-improvement iteration

DeepSeek-OCR — Notes

TL;DR

DeepSeek-OCR is a VLM that treats an image of text as a compression medium: an encoder (“DeepEncoder”) maps high-res document images into a small number of vision tokens, and a 3B MoE decoder reconstructs the text. The authors report $\approx 97\%$ OCR decoding precision when the text-token count is $\le 10\times$ the vision-token count, and $\approx 60\%$ at $\approx 20\times$ compression; on OmniDocBench it aims to be competitive while using relatively few vision tokens.

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$ (new encoder design + multi-resolution token budgeting + end-to-end OCR system). Secondary: $\Psi_{\text{Evaluation}}$ (compression study on Fox; benchmarking on OmniDocBench); $\Psi_{\text{Resource}}$ (code/weights released; curated OCR training data).

What is the motivation?

Long-context LLM compute scales quadratically with sequence length, so “just feed more text” is expensive.
A document image can represent many words, suggesting vision tokens could serve as a cheaper “compressed representation” of long text.
Prior open-source VLM vision encoders (dual-tower, tiling, NaViT-style adaptive resolution) each have issues for this use case: deployment complexity, too many vision tokens, high activation memory, or long packed sequences.

What is the novelty?

Contexts optical compression framing: treat OCR as a compression-decompression mapping between document pixels $\rightarrow$ small vision-token sequence $\rightarrow$ reconstructed text, and study how many vision tokens are “enough” for a given amount of text.
DeepEncoder: serially connects a window-attention perception module (SAM-base) to a dense global-attention knowledge module (CLIP-large) using a $16\times$ token compressor so high-res inputs do not explode dense attention cost.
Multi-resolution token modes (Tiny/Small/Base/Large + dynamic tiling modes “Gundam”) to vary token budgets and study compression ratios.

What experiments were performed?

Vision-text compression study: Fox benchmark, English docs with 600–1300 tokenizer tokens; evaluate OCR precision under 64 or 100 vision tokens (Tiny/Small). Reports precision vs. compression ratio.
OCR practical performance: OmniDocBench; compares edit distance across categories (English/Chinese; text/formula/table/order; and per-document-type breakdown) across multiple model modes (Tiny $\rightarrow$ Gundam-M). Includes comparisons to pipeline OCR systems and other end-to-end VLM OCR models.
Qualitative “deep parsing” demos: secondary calls to parse charts, natural images, chemical formulas, and simple geometry with a unified prompt style.

What are the outcomes/limitations?

Outcomes (reported):

Fox compression: within $\approx 10\times$ compression, decoding precision reaches $\approx 97\%$; around $\approx 20\times$ compression precision is still $\approx 60\%$.
OmniDocBench: at 100 vision tokens, the model is reported to surpass GOT-OCR2.0 (which uses 256 tokens/page in their table); at higher token modes (400, <800), it approaches or beats stronger baselines while using fewer tokens than some competitors.
The paper positions this as evidence that compact decoders can learn to reconstruct text from heavily compressed visual latents, and proposes this as a direction for long-context handling.

Limitations / caveats (explicit or strongly implied):

Fox evaluation notes formatting mismatches between output and ground truth can understate “true” OCR accuracy (their prompt-output format may not match the benchmark exactly).
Degradation beyond $\approx 10\times$ is attributed to (a) more complex layouts in longer docs and (b) blur at 512–640 resolutions. Their proposed “fix” for (a) is rendering text into a single layout page, and they frame (b) as related to intentional forgetting.
They state there is no SFT stage, so the model is “not a chatbot” and some capabilities require completion-style prompts.
Several future-eval items are explicitly left open (digital-optical text interleaved pretraining, needle-in-a-haystack testing).

Model

High-level architecture

DeepSeek-OCR model architecture diagram showing two-stage encoder with SAM perception module, 16× token compressor, and CLIP global knowledge module feeding into a 3B MoE decoder

End-to-end VLM: DeepEncoder (encoder) + DeepSeek-3B-MoE (decoder).
DeepEncoder params: $\approx 380\text{M}$, composed of:
- SAM-base (patch size 16), $\approx 80\text{M}$ params, used as perception module dominated by window attention.
- CLIP-large, $\approx 300\text{M}$ params, used as knowledge module with dense global attention.
- A $16\times$ token compressor between them.
Decoder: DeepSeek-3B-MoE (routed MoE), 570M activated parameters at inference; activates 6 of 64 routed experts + 2 shared experts.

Token compressor details

Implemented as a 2-layer convolution module for $16\times$ downsampling of vision tokens:
- Each conv layer: kernel size 3, stride 2, padding 1.
- Channel increase: 256 $\rightarrow$ 1024.
Example: a 1024×1024 input yields $1024/16 \times 1024/16 = 4096$ patch tokens into the first stage; after the compressor, tokens reduce to $4096/16 = 256$ before dense global attention.

Why this design? The paper positions it between three common encoder families (dual-tower, tile-based, NaViT-style), avoiding their pitfalls (too many tokens, very low native resolution, huge activations):

Comparison diagram of VLM encoder types: dual-tower, tile-based, NaViT-style, and DeepSeek-OCR’s approach

Multi-resolution modes and token budgets

Diagram showing DeepSeek-OCR’s multi-resolution strategy with Tiny/Small/Base/Large native modes and Gundam dynamic tiling

Native modes (single image, no tiling, fixed token budget):
- Tiny: $512 \times 512$ $\rightarrow$ 64 tokens (resize)
- Small: $640 \times 640$ $\rightarrow$ 100 tokens (resize)
- Base: $1024 \times 1024$ $\rightarrow$ 256 tokens (padding)
- Large: $1280 \times 1280$ $\rightarrow$ 400 tokens (padding)
Dynamic modes (global + tiles, variable token budget):
- Gundam: $n \times 640 \times 640$ tiles + $1024 \times 1024$ global $\rightarrow$ $n \times 100 + 256$ tokens, $n \in [2,9]$. If both sides $< 640$, it degrades to Base.
- Gundam-M: $n \times 1024 \times 1024$ tiles + $1280 \times 1280$ global $\rightarrow$ $n \times 256 + 400$ tokens (trained by continued training for load-balancing).
Positional encodings are dynamically interpolated to support all modes with a single model; tiling follows InternVL2.0.
For padded native modes, valid tokens:
$$N_{\text{valid}}=\left\lceil N_{\text{actual}}\times\left[1-\frac{\max(w,h)-\min(w,h)}{\max(w,h)}\right]\right\rceil$$
(Eq. 1).

Data

OCR 1.0 data (documents + scene text)

30M PDF pages across $\approx 100$ languages; Chinese+English $\approx 25\text{M}$ pages, other languages $\approx 5\text{M}$.
Two annotation types:
- Coarse annotations extracted via fitz for broad OCR coverage, especially minority languages.
- Fine annotations: 2M pages Chinese + 2M pages English, labeled using layout (PP-DocLayout) + OCR systems (MinerU, GOT-OCR2.0) to create detection+recognition interleaved ground truth; for minority languages they describe a “model flywheel” using patch OCR to label patches after layout processing, producing 600K samples.
3M Word documents: extract content to create high-quality image-text pairs (notably helpful for formulas and HTML tables).
Natural scene OCR:
- Sources: LAION, Wukong; labels via PaddleOCR.
- 10M Chinese + 10M English samples.
- Prompts can control whether detection boxes are output.

OCR 2.0 data (charts, chemistry, geometry)

Charts: 10M rendered images (pyecharts + matplotlib); labels are HTML tables (explicitly not OneChart dictionary format) to save tokens.
Chemical formulas: 5M image-text pairs; source SMILES from PubChem; render via RDKit; output target is SMILES.
Plane geometry: 1M samples; generation follows “Slow Perception”; uses a perception ruler size of 4 per line segment; includes translation-invariant augmentation.

General vision + text-only mix

General vision data for caption/detection/grounding is included to preserve a general vision interface; authors state it is 20% of training data.
Text-only pretraining data is 10%, processed to 8192 tokens (also the full model sequence length).
Overall mix: 70% OCR, 20% general vision, 10% text-only.

Algorithms / Training

Stage A: Train DeepEncoder

Objective: next-token prediction using a compact language model, following Vary-style training.
Data: all OCR 1.0 + OCR 2.0 + 100M general samples from LAION.
Training: 2 epochs, batch size 1280, AdamW, cosine annealing, learning rate $5\times10^{-5}$, sequence length 4096.

Stage B: Train full DeepSeek-OCR

Platform: HAI-LLM.
Pipeline parallelism: 4 partitions:
- PP0: SAM + compressor as “vision tokenizer”, frozen.
- PP1: CLIP part as embedding layer, unfrozen.
- PP2/PP3: DeepSeek3B-MoE has 12 layers, split 6 + 6.
Compute: 20 nodes, each 8× A100-40G; DP=40; global batch size 640.
Optimizer/schedule: AdamW, step-based scheduler, initial LR $3\times10^{-5}$.
Throughput: text-only $\approx 90\text{B}$ tokens/day; multimodal $\approx 70\text{B}$ tokens/day.
Gundam-M is trained via continued training on a trained model (they cite load balancing / speed issues if trained jointly).

Prompts / interfaces (examples shown in paper)

<image>
Free OCR.

<image>
<|grounding|>Convert the document to markdown.

<image>
Parse the figure.

(These are used to control layout output and deep parsing behavior.)

Hardware / Production

Data-generation throughput claims:
- Abstract: “200k+ pages per day” on a single A100-40G.
- Intro: “33 million pages per day” using 20 nodes (each 8× A100-40G).
The paper frames “contexts optical compression” as enabling a forgetting-like mechanism: render older dialogue rounds into images and progressively downsize them so older context gets blurrier and cheaper (Figure 13).

Evaluation

Fox benchmark: compression ratio vs precision

Setup: English Fox docs, ground-truth tokens 600–1300 (using DeepSeek-OCR tokenizer, vocab size $\approx 129\text{k}$), 100 pages; evaluate Tiny (64) and Small (100) modes.
Reported table highlights (precision, compression):
- 600–700 tokens: 64 toks $\rightarrow$ 96.5% at 10.5$\times$; 100 toks $\rightarrow$ 98.5% at 6.7$\times$
- 1200–1300 tokens: 64 toks $\rightarrow$ 59.1% at 19.7$\times$; 100 toks $\rightarrow$ 87.1% at 12.6$\times$

OmniDocBench: edit distance vs tokens

Metric: edit distance (lower is better), across English/Chinese and subcategories (text/formula/table/order), plus overall; “Tokens” is avg vision tokens per page, with parentheses indicating “valid” tokens for padded modes.
DeepSeek-OCR modes (tokens shown as reported):
- Tiny: 64
- Small: 100
- Base: 256 (182 valid)
- Large: 400 (285 valid)
- Gundam: 795
- Gundam-M @200dpi: 1853
Document-type breakdown suggests some categories (slides, books, reports) work well with low tokens, while newspapers require Gundam-class modes due to much higher text-token counts (4–5k).

olmOCR 2: Unit Test Rewards for Document OCR

TL;DR

olmOCR 2 is an OCR system built around a 7B VLM (Qwen2.5-VL-7B-Instruct base) trained with reinforcement learning using verifiable rewards (RLVR), where rewards come from a suite of binary unit tests. The authors scale unit-test creation by generating synthetic document pages with ground-truth HTML and extracted test cases, then apply GRPO-based RL to improve performance. They report an overall olmOCR-Bench score of 82.4 $\pm$ 1.1, a +14.2 point improvement over the initial olmOCR release, with large gains in math/table/multi-column handling.

What kind of paper is this?

Primarily $\Psi_{\text{Method}}$, with substantial $\Psi_{\text{Resource}}$ and $\Psi_{\text{Evaluation}}$ components.

Dominant: $\Psi_{\text{Method}}$: The center of gravity is the training recipe (synthetic unit-test generation + GRPO RLVR) and the inference/system changes driving benchmark gains.
Secondary: $\Psi_{\text{Resource}}$: The authors explicitly emphasize releasing model/data/code under permissive open licenses.
Secondary: $\Psi_{\text{Evaluation}}$: A core argument is why binary unit tests can be preferable to edit distance for OCR correctness, supported by Figures 1–2 (p.3).

A rough superposition: 0.50 $\Psi_{\text{Method}}$ + 0.30 $\Psi_{\text{Resource}}$ + 0.20 $\Psi_{\text{Evaluation}}$.

What is the motivation?

Need for clean, naturally ordered text from PDFs: The target is digitized print documents (like PDFs) converted into “clean, naturally ordered plain text.”
Manual unit tests do not scale: The original olmOCR-Bench unit tests were manually verified and “took hours of work” to create/check, which blocks RL scaling.
Edit distance is a weak proxy for OCR correctness in key cases:
- Floating elements lead to “ties” (multiple equivalent linearizations) that unit tests can treat equivalently, but edit distance can reward/penalize arbitrarily (Figure 1, p.3).
- Continuous scores can miss what matters: reading order vs caption placement; rendered correctness of equations vs LaTeX string similarity (Figure 2, p.3).
Goal: Build a scalable pipeline where OCR outputs can be automatically verified via unit tests, enabling RLVR training at scale.

What is the novelty?

The paper combines and operationalizes several ideas into a cohesive training loop:

Binary unit tests as verifiable RL rewards: Rewards are a “diverse set of binary unit tests” used for RLVR, rather than using only continuous text similarity.
Synthetic pipeline to generate unit tests at scale: They create synthetic documents with known ground-truth HTML source and extracted test cases, enabling programmatic unit-test creation.
Iterative PDF-to-HTML generation via a general VLM (Figure 3, p.4): They sample a real page and prompt a general VLM (claude-sonnet-4-20250514) to produce a highly similar HTML page; the rendered HTML image paired with raw HTML becomes supervision.
GRPO-based RLVR on OCR: They apply Group Relative Policy Optimization (GRPO) using unit tests as binary reward signals.
System/inference engineering that matters for OCR: Dynamic temperature scaling to avoid repetition loops, prompt-order standardization, YAML output format changes, and image resizing changes are described as meaningful contributors.

Contrast to Infinity Parser: The authors describe Infinity Parser as the closest related work; their stated key difference is binary unit tests as rewards vs Infinity Parser’s reward based on edit distance/paragraph count/structural consistency, plus differences in how real content seeds HTML generation.

What experiments were performed?

Benchmarking on olmOCR-Bench (English): The benchmark measures unit-test types including:
- Text presence/absence, natural reading order, table accuracy, math formula accuracy via KaTeX rendering, baseline robustness.
Comparisons vs other OCR systems (Table 1, p.2): They list multiple baselines (API-only, open-source, and VLM-based OCR systems), and report olmOCR 2 at 82.4 $\pm$ 1.1. The table caption notes their reproduction policy: results are reproduced in-house except those marked with *.
Ablation-style incremental development breakdown (Table 3, p.7): They show stepwise improvements from “olmOCR (first release)” to “Synth data, RLVR, souping” and the resulting overall score changes.
SFT dataset refresh comparison (Table 2, p.5): One-epoch finetuning on olmOCR-mix-0225 vs olmOCR-mix-1025 with per-slice benchmark scores and overall results.
Metric motivation experiments (Figures 1–2, p.3):
- Figure 1: unit test vs edit distance for reading order errors with floating caption; unit tests treat valid placements as ties while edit distance penalizes some ordering choices more than others.
- Figure 2: unit test vs edit distance for math parsing; rendering-based checks can pass outputs with worse LaTeX edit distance and fail outputs with better edit distance.

What are the outcomes/limitations?

Outcomes reported:

Overall improvement: +14.2 point improvement over the initial release when evaluated on the latest olmOCR-Bench, moving from 68.2 to 82.4.
Where gains concentrate: The largest improvements are in math formula conversion, table parsing, and multi-column layouts.
Open artifacts: Model/data/code released under permissive open licenses. Table 1 is structured to emphasize openness across model weights/training data/training code/inference code.

Limitations and open questions:

Comparisons are heterogeneous: Table 1 includes systems with scores marked * (reported by authors, not reproduced by the olmOCR team) and some entries show “$\pm$ ?” uncertainty, so “best overall” claims depend on which subset you view as comparable.
Benchmark scope: The benchmark is explicitly described as English-language; generalization beyond that is not established in this report.
Unit-test coverage is a bottleneck: Any unit-test regime is inherently limited to the properties it encodes (a general risk for test-suite-based rewards). The authors explicitly want to extend the synthetic pipeline to more complicated document types/unit tests.
Reliance on frontier model tooling in the pipeline: Synthetic generation uses claude-sonnet-4-20250514, and the refreshed SFT mix is processed using GPT-4.1; reproducing the pipeline end-to-end may require access to these systems.
Binary vs continuous metrics: The paper argues binary tests often align better with correctness; still, calibrated continuous scores for non-math targets remain open work.

Model

Core model

olmOCR-2-7B-1025: A specialized 7B vision-language model trained with RLVR.
Base family: Starts from Qwen2.5-VL-7B-Instruct fine-tuned on olmOCR-mix-1025. The authors explicitly note switching from Qwen 2 VL to Qwen 2.5 VL for a slight benchmark gain.
FP8 variant: olmOCR-2-7B-1025-FP8 also released.

Output format and schema constraints

Switched from JSON to YAML: To reduce retries; the authors speculate YAML avoids quote-count bookkeeping and reduces repetition loops.
Document metadata at top: A reward term enforces that model outputs document metadata at the top (examples given: primary language, rotation correction factor).

Inference engineering (Section 4)

The authors highlight several system changes that materially affect end quality:

Dynamic temperature scaling: Start at 0.1 and increase stepwise (0.2, 0.3, …) up to max 0.8 when EOS is not produced (to prevent repetition loops).
Prompt ordering: Fix mismatch between training and inference prompt order by always placing text first; authors report improved performance and note this enables prompt caching.
Image resizing: Increase from 1024 px to 1288 px on longest edge; authors describe this as a chosen balance between score and inference speed.
Blank-page handling: Fix data loader skipping blank pages that caused hallucinations; no benchmark impact reported.

Contrast to DeepSeek-OCR: DeepSeek quantifies a frontier of 97% OCR at ~10$\times$ text$\rightarrow$vision compression with a SAM$\rightarrow$conv-16$\times$$\rightarrow$CLIP DeepEncoder (and token-economical multi-res modes), whereas olmOCR 2 largely keeps model size fixed and improves policy quality via unit-test rewards.

Data

Supervised fine-tuning (SFT) data: `olmOCR-mix-1025`

Size: 267,962 pages from over 100,000 PDFs, including 9,828 pages from national archives.
Delta vs previous mix (olmOCR-mix-0225): Reprocessed using GPT-4.1 instead of GPT-4o; more consistent equation formatting using \[ ... \] and $ ... $; tables in HTML format; basic alt text for images.

Synthetic RL data: `olmOCR2-synthmix-1025`

Size: 2,186 PDF pages with 30,381 unit test cases.
Sourcing strategy: Sample real documents with “relevant, difficult-to-OCR material,” including arXiv math-heavy papers for equation-focused tests.
General VLM used for HTML generation: claude-sonnet-4-20250514.
Cost: Approximately $0.12 per document page (for the synthetic pipeline).
Hallucination robustness claim: Pipeline is robust to Claude OCR errors because unit tests are generated from the HTML output alone.

Unit-test families

Text presence/absence
Natural reading order
Table cell position
Formula rendering (DOM/KaTeX)
Baseline robustness (repeated n-grams, non-target language characters)

Algorithms / Training

Pipeline overview

The authors describe a two-part training recipe: (1) build a synthetic pipeline that renders documents into clean HTML and produces verifiable unit tests, then (2) apply GRPO to train using these binary reward signals.

Synthetic PDF-to-HTML generation steps

They break conversion into three steps (Figure 3, p.4):

Layout analysis: Prompt a general VLM to describe layout (columns, images/tables, headers/footers) to guide HTML generation.
Content rendering: Prompt again to “render this document as clean, semantic HTML” within original dimensions.
Output refinement: Render generated HTML to an image, then prompt the VLM to refine HTML to better match the original image.

Unit test creation from HTML semantics

Header/footer tags: By requiring <header> and <footer> in HTML, they can generate “Text Absence” tests for those elements.
Math: Equations are rendered with KaTeX, enabling extraction and test creation for formula correctness.
Tables: Extracted from ground truth; random cells sampled to create position/structure tests.

RLVR training details

Base: Qwen2.5-VL-7B-Instruct fine-tuned on olmOCR-mix-1025.
RL epoch: One epoch of RL training on olmOCR2-synthmix-1025.
Compute: 8$\times$H100 GPU node.
Sampling policy: 28 completions generated per document.
Reward definition:
- Each unit test is pass/fail; reward is the fraction of passing tests, ranging 0.0 to 1.0.
- Figure 4 example: 4 of 6 tests passing yields page-level reward 0.67.
Additional format rewards:
- Binary reward for ending with EOS token.
- Reward (0 to 1) for including document metadata at the top of the response.
Implementation library and regularization: Hugging Face TRL with KL divergence $\beta = 0.01$.
Model souping: Train six models with different random seeds and average (“soup”) their weights. They further note using importance sampling at token level (3 runs) and sequence level (3 runs) in that set of six runs.

Evaluation

Benchmark properties (unit tests)

The paper reiterates that olmOCR-Bench uses unit tests that can check: text presence, text absence (headers/footers/page numbers), reading order, table accuracy, math formula accuracy via KaTeX rendering, and baseline robustness against repeated n-grams or non-target language characters.

Why unit tests vs edit distance (Figures 1–2)

Reading order “ties” (Figure 1, p.3): Floating caption can be correct either before or after a passage; unit tests can score both as equivalent, while edit distance penalizes some valid placements and partially rewards some invalid ones.
Math formula rendering (Figure 2, p.3): Visually equivalent renderings can matter more than LaTeX string similarity; model A can have worse edit distance but pass rendering-based unit test, and vice versa.

Key results

Table 1 (system-level comparison, p.2):

olmOCR 2 overall: 82.4 $\pm$ 1.1
olmOCR (first release) overall: 68.2 $\pm$ 1.1
Nearby systems (some marked * as reported by authors, not reproduced):
- Chandra OCR 0.1.0: 83.1 $\pm$ 0.9*
- Infinity-Parser 7B: 82.5 $\pm$ ?*
- dots.OCR: 79.1 $\pm$ 1.0*
- PaddleOCR-VL: 80.0 $\pm$ 1.0*
The table annotates openness/licenses; olmOCR 2 highlights being fully open (weights, data, code).

Table 3 (incremental development breakdown, overall column, p.7):

First release: 68.2 $\pm$ 1.1
- Dynamic temperature scaling: 72.8 $\pm$ 1.2
- Better prompting: 75.8 $\pm$ 1.0
- New trainer, YAML, image resize, Qwen 2.5 VL: 78.5 $\pm$ 1.1
- Handle blank pages: 78.5 $\pm$ 1.1
- Synth data, RLVR, souping: 82.4 $\pm$ 1.1

Biggest gains come from the RLVR stage.

Table 2 (SFT-only comparison, p.5):

One epoch finetuning on olmOCR-mix-0225 vs olmOCR-mix-1025 yields overall 78.5 $\pm$ 1.1 vs 78.3 $\pm$ 1.2, with mix-1025 showing better table slice score (77.9 vs 72.9) but worse “ArXiv” slice score (70.8 vs 78.6).

Hardware / Production

RL training compute: 8$\times$H100 node, one epoch on synthetic RL data.
Synthetic pipeline cost: Approximately $0.12 per document page using Claude Sonnet 4, with the note that hallucinations in Claude OCR do not affect their unit-test generation because they rely on HTML outputs.
Production/API mention (high level): The authors thank inference partners DeepInfra and Parasail for helping set up public API access, but the report does not give latency/throughput numbers.

PaddleOCR‑VL — Notes

TL;DR

PaddleOCR‑VL is a two‑stage document parsing system: a lightweight layout analyzer (detect + reading order) followed by a 0.9B VLM (NaViT‑style dynamic‑resolution visual encoder + ERNIE‑4.5‑0.3B LM) that recognizes text, tables (OTSL), formulas (LaTeX), and charts (Markdown tables) across 109 languages. It posts SOTA/near‑SOTA on OmniDocBench v1.0/v1.5 and olmOCR‑Bench, while keeping inference lean via multithreaded, batched pipelines. Compared with end‑to‑end OCR VLMs, the decoupled layout stage improves reading‑order stability, latency, and resource use.

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$ (new system decomposition + VLM architecture choices + training recipe).
Secondary: $\Psi_{\text{Resource}}$ (large-scale data construction pipeline with 30M+ samples across multiple synthesized/auto-labeled corpora).
Also present: $\Psi_{\text{Evaluation}}$ (explicit throughput/VRAM benchmarking; inference efficiency comparisons across hardware).

What is the motivation?

Document parsing needs layout understanding + reading order + element extraction (text/tables/formulas/charts) to support downstream retrieval and RAG workflows.
The paper frames a tradeoff:
- Pipeline systems: strong but complex integration and error propagation.
- End-to-end VLM conversion: simpler interface but suffers from reading-order issues, hallucinations, and long-sequence latency/memory costs.
Goal: get the stability benefits of pipelines with the simplicity of a compact VLM for element recognition.

What is the novelty?

Two-stage decomposition: PP-DocLayoutV2 handles layout detection + reading order; PaddleOCR-VL-0.9B recognizes each cropped element. Outputs merge into structured Markdown + JSON.
Decoupled reading order: RT-DETR + pointer network before VLM decoding mitigates long-context autoregressive burden and stabilizes ordering (a common failure mode in end-to-end VLM OCR).
NaViT-style native dynamic resolution: fewer hallucinations on dense pages vs. fixed-size or heavy tiling; preserves aspect ratio without extreme token counts.
0.9B “ultra-compact” VLM: 0.3B decoder + strong visual encoder yields favorable accuracy/latency balance for production.
Data engine: loops LLM-refined labels and typed hard-case synthesis to target weaknesses with measurable metrics.

Contrast with DeepSeek-OCR: DeepSeek uses SAM $\rightarrow$ conv-compressor $\rightarrow$ CLIP DeepEncoder to shrink tokens 16$\times$ and studies text/vision token ratios ($\approx$97% OCR at $\sim$10$\times$ compression). PaddleOCR-VL instead optimizes robustness + throughput under a document-parsing workflow with explicit reading order.

What experiments were performed?

Page-level parsing: OmniDocBench v1.5 (1,355 pages), OmniDocBench v1.0 (981 pages), olmOCR-Bench; comparisons against pipeline tools, general VLMs, and specialized OCR VLMs.
Element-level recognition:
- Text: OmniDocBench-OCR-block (17,148 crops) + in-house OCR set (107,452 line-level samples) + Ocean-OCR-Handwritten.
- Tables: OmniDocBench-Table-block (512 crops) + in-house benchmark (20 table types).
- Formulas: OmniDocBench-Formula-block (1,050 crops) + in-house benchmark (34,816 samples).
- Charts: in-house benchmark (1,801 samples) evaluated with RMS-F1.
Inference/efficiency: end-to-end speed on OmniDocBench v1.0 (512 PDFs, A100), reporting pages/s, tokens/s, VRAM; cross-hardware configs (A10, RTX 3060/5070/4090D).

What are the outcomes/limitations?

Outcomes:

OmniDocBench v1.5: overall 92.56; text edit distance 0.035; formula CDM 91.43; table TEDS 89.76; reading-order ED 0.043.
olmOCR-Bench: overall 80.0 $\pm$ 1.0 unit-test pass rate (best), leading in ArXiv and Headers/Footers.
Element-level wins across text/table/formula/chart benchmarks with detailed multilingual breakdowns.
Throughput: 1.224 pages/s on A100 (+15.8% vs. MinerU2.5); ~43.7 GB VRAM (vs. dots.ocr ~78.5 GB).

Limitations:

Reproducibility constraints: multiple critical components depend on in-house datasets and proprietary annotators (ERNIE-4.5-VL family); exact replication is difficult without equivalents.
Coupling to detector quality: overall fidelity and order depend on RT-DETR + pointer network; OOD layouts may degrade.
Less emphasis on token-economy: no explicit vision-tokens-per-page vs. accuracy frontier like DeepSeek-OCR.
Output format assumptions: tables trained to OTSL, charts to Markdown tables, formulas to LaTeX; downstream consumers may need adapters.
Charts grounding: strong RMS-F1 in-house, but public chart test sets are noisy/imbalanced.

Model

High-level architecture (Figures 2–4, pp.5–6)

PaddleOCR-VL system overview: documents flow through PP-DocLayoutV2 for layout analysis and reading order, then batched element crops go to the VLM for recognition, producing structured Markdown and JSON output

Two stages:
1. PP‑DocLayoutV2 (layout + reading order):
- Detector: RT‑DETR for element boxes/classes (text blocks, tables, formulas, charts).
- Pointer network (6-layer Transformer) infers reading order: predicts an $N \times N$ pairwise order matrix using absolute 2D position encodings, class embeddings, and a geometry-biased attention head (Relation-DETR-style); decoded via deterministic win-accumulation to recover a topologically consistent ordering.
- Rationale: avoids end-to-end VLM hallucinations and long-sequence overhead for layout; keeps this step small and fast.
1. PaddleOCR-VL-0.9B (element-level recognition):
- NaViT-style dynamic-resolution visual encoder (from Keye-VL), so it digests native-resolution crops without tiling distortion.
- 2-layer MLP projector (GELU; merge size 2).
- ERNIE-4.5-0.3B LM with 3D-RoPE as the text decoder (small decoder means faster AR decoding).

PaddleOCR-VL-0.9B architecture: batch images of varying sizes pass through dynamic resolution preprocessing, NaViT 400M vision encoder, 2-layer MLP with patch merge, then ERNIE-4.5-0.3B LLM decoder with 3D RoPE

Design point vs. DeepSeek-OCR: PaddleOCR‑VL decouples layout (specialized detector + ordering) from recognition, whereas DeepSeek‑OCR is an end‑to‑end encoder–decoder optimized for optical compression of long text into few vision tokens. The former trades some end‑to‑end elegance for stability, lower latency, and cheaper training; the latter explores token‑economy limits via compression.

“What the VLM emits”

Text OCR: block/line/word‑level transcription.
Tables: OTSL structural tokens + content.
Formulas: LaTeX (distinguishes inline $...$ vs. display \[...\]).
Charts: normalized Markdown tables.

Algorithms / Training (Section 2.2; Table 1, p.8)

Layout (PP‑DocLayoutV2):
- Train RT‑DETR first (100 epochs on ~20k curated layout pages), then freeze it and train the pointer network for order (200 epochs; AdamW; GCE loss for noisy labels).
VLM (PaddleOCR‑VL‑0.9B): two stages, all components trainable (ERNIEKit); batch size 128, sequence length 16,384 for both stages.
- Stage‑1 alignment: 29M image–text pairs, max res $1280 \times 28 \times 28$ (NaViT), LR $5 \times 10^{-5} \rightarrow 5 \times 10^{-6}$, 1 epoch.
- Stage‑2 instruction FT: 2.7M carefully curated samples, max res $2048 \times 28 \times 28$, LR $5 \times 10^{-6} \rightarrow 5 \times 10^{-7}$, 2 epochs; teaches 4 task families (OCR/table/formula/chart).

Data (Section 3; Figure 5, p.9)

Scale: 30M+ training samples across open datasets, synthesis, web‑harvested documents, and in‑house sets.

Three pillars:

Curation from open sets (CASIA‑HWDB, UniMER‑1M, MathWriting; chart corpora like ChartQA/PlotQA/UniChart/Beagle/ChartINFO/visText/ExcelChart), synthesized data, web‑scale crawl, and in‑house corpora spanning many doc genres.
Automatic annotation: PP‑StructureV3 pseudo‑labels → prompt LLMs (ERNIE‑4.5‑VL, Qwen2.5‑VL) for refinement → hallucination filtering.
Hard‑case mining: build typed eval engines (23 text categories, 20 table types, 4 formula types, 11 chart families), score with EditDist, TEDS, CDM, RMS‑F1 to find failures, then synthesize targeted data (XeLaTeX, web renderers). Reported synthesis rate: ~10,000 samples/hour for tables alone.

Hardware / Inference (Section 4.3; Table 13, p.18)

Pipeline: three asynchronous threads—(1) page rendering, (2) layout model, (3) batched VLM—connected by queues; VLM batching triggers by queue size or dwell‑time. vLLM/SGLang backends; knobs for max‑batched‑tokens and GPU memory utilization.
Throughput (A100, vLLM, OmniDocBench v1.0):
- Pages/s: 1.224 (vs. MinerU2.5 1.057; +15.8%).
- Tokens/s: 1881 (vs. 1648; +14.2%).
- VRAM: ~43.7 GB; significantly less than dots.ocr (~78.5 GB) while being faster.
Cross‑hardware benchmarks also provided for A10, RTX 3060, RTX 5070, and RTX 4090D.

Evaluation

Page‑level (full pages; Figures 1, Tables 2–3)

OmniDocBench v1.5 — Overall 92.56↑; Text‑Edit 0.035↓; Formula‑CDM 91.43↑; Table‑TEDS 89.76↑ / TEDS‑S 93.52↑; Reading‑Order ED 0.043↓. Top overall against pipelines and large VLMs.
OmniDocBench v1.0 — Avg Overall‑Edit 0.115↓; Text‑Edit: EN 0.041, ZH 0.062; Table‑TEDS: EN 88.0, ZH 92.1; Reading‑Order ED: EN 0.045, ZH 0.063.
olmOCR‑Bench — Overall 80.0 ± 1.0 (best), leading in ArXiv and Headers/Footers; strong on Multi‑column and Long tiny text.

Element‑level (cropped blocks)

Text (OmniDocBench‑OCR‑block) — lowest Edit Distance across most doc types (e.g., PPT2PDF 0.049, Academic 0.021, Newspaper 0.034).
Handwriting (Ocean‑OCR‑Bench) — EN ED 0.118, ZH ED 0.034; best F1/Precision/Recall/BLEU/METEOR among compared systems.
Tables (OmniDocBench‑Table‑block) — TEDS 0.9195, TEDS‑struct 0.9543, Overall‑ED 0.0561 (best).
Formulas (v1.5 Formula‑block) — CDM 0.9453; In‑house formulas CDM 0.9882.
Charts (in‑house) — RMS‑F1 0.844 overall, surpassing several specialized OCR VLMs and some 72B‑scale VLMs.

Additional Notes

Typed hard-case mining loop: the paper’s bucketed eval sets + targeted synthesis approach is a well-documented methodology for iterative data improvement.
Prompting surface (Stage-2 tasks): the VLM is trained on four task families with distinct output formats:
- OCR: block/line/word transcription
- Tables: OTSL structural markup
- Formulas: LaTeX (inline vs. display)
- Charts: Markdown tables
Contrast with DeepSeek-OCR: represents different design priorities; DeepSeek-OCR explores token-economy limits via optical compression ($\sim$97% OCR at $\sim$10$\times$ compression), while PaddleOCR-VL prioritizes robustness and throughput via decoupled layout.

NVIDIA Nemotron Parse 1.1 — Notes

TL;DR

Nemotron Parse 1.1 is a lightweight 885M-parameter encoder-decoder VLM for end-to-end document parsing that outputs formatted text (Markdown/LaTeX), bounding boxes, and semantic classes in reading order. A token-compressed variant (Nemotron-Parse-1.1-TC) offers faster inference with minimal quality trade-offs. The authors report competitive OCR and table extraction performance across several benchmarks and release model weights plus an optimized NIM container.

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$

This paper introduces an updated end-to-end document parsing model (v1.1) with specific architectural innovations: a token-compressed speed variant (TC), multi-token inference training, and removal of decoder positional embeddings. The core contribution is the method itself: a single model interface that produces formatted text, bounding boxes, and semantic classes through prompt-controlled outputs.

Secondary: $\Psi_{\text{Resource}}$

The work releases two model variants with weights, a production-ready NIM container, and a subset of the training data (Nemotron-VLM-v2 dataset). It provides the NVpdftex pipeline for synthetic data generation, making the resource contribution significant.

What is the motivation?

Document OCR for downstream LLM and retrieval workflows requires more than plain text extraction. Systems need layout preservation, reading order, semantic block types (captions, footnotes, section headers), table structure, mathematical formulas, and multi-column/page handling. Pipeline approaches can be brittle and introduce latency through multiple stages, while existing end-to-end models may underperform on specific subtasks when asked to produce all outputs simultaneously. Nemotron Parse targets comprehensive extraction in a single model with good throughput characteristics.

What is the novelty?

Unified output interface: A single model produces (a) formatted text in Markdown with LaTeX for formulas/tables, (b) bounding boxes with relative coordinates, and (c) semantic classes for layout elements. Prompts control which outputs are generated, enabling eight valid combinations from three binary axes.

Token-compressed variant (TC): Applies additional pixel-shuffle operations to reduce vision token sequence length, improving inference speed with what the authors describe as “minimal quality degradation.”

Decoder without positional embeddings: The decoder is trained and evaluated without positional embeddings to support longer-context inference. The authors argue that causal masking provides positional cues and visual tokens already encode 2D spatial structure, avoiding interference with document layout.

Multi-token inference training: The model is trained to predict $m$ tokens per decoding step using additional linear projection heads. During inference, this operates in greedy mode without token verification. The authors claim this training strategy benefits even single-token inference quality.

What experiments were performed?

Internal reading-order/OCR test set (Table 2): Evaluation on 789 human-labeled PDF pages from magazines, books, and Common Crawl. Metrics are word error rate (WER) and F1 score, comparing against Kosmos-2.5 and GOT in plain and markdown modes.

GOT benchmark (Table 3): OCR accuracy and reading-order metrics comparing Nemotron Parse variants against multiple systems including Gemini Flash 2.0, Marker, SmolDocling, and others.

OmniDocBench v1.0 English subset (Table 4): Category-level metrics for overall, text, formula, table, and order accuracy across many models.

Table extraction benchmarks (Tables 5-6): TEDS and S-TEDS metrics on RD-TableBench, PubTabNet, and OmniDocBench tables (Table 5). RD-TableBench “table similarity” scores comparing against Reducto, cloud vendor OCR, and several LLM-based parsers (Table 6).

Multilingual OCR evaluation (Table 7): WER and F1 scores across multiple languages using an NVpdftex-derived test set of 10,000 scientific dense documents per language.

Throughput measurement (Table 8): Tokens per second and approximate pages per second on a single H100 GPU for both base and TC variants.

What are the outcomes/limitations?

Key outcomes (as reported):

Internal test set results show Nemotron-Parse-MIP and Nemotron-Parse-TC-MIP achieving lower WER and high F1 relative to Kosmos-2.5 and GOT (Table 2). On the GOT benchmark, both variants place in the top tier, with the authors noting only Gemini Flash 2.0 outperforming them (Table 3). Table extraction achieves competitive TEDS/S-TEDS scores, with RD-TableBench TEDS in the mid-80s (Table 5) and table similarity around mid-80s compared to other systems (Table 6). The TC variant improves throughput from 3800 to 4500 tokens/sec (Table 8), which the authors interpret as roughly 4 vs. 5 pages/sec for average document lengths.

Limitations and caveats explicitly noted:

Formula scoring artifact on OmniDocBench: Since the model outputs Markdown format, simple equations may not be wrapped in LaTeX math delimiters. These are penalized in the formula category even when represented correctly in Markdown, artificially lowering formula scores.

Multilingual scope: The authors report stronger performance for Chinese, Japanese, and Korean in scientific PDFs and standard documents, but limited support for “in-the-wild” images and documents in those languages.

Multi-token inference risks: The greedy decoding approach operates “without token verification,” which could amplify errors in dense text regions. The paper does not quantify this potential error propagation.

Missing reproducibility details:

The paper lacks a comprehensive training hyperparameter table (optimizer configuration, learning rate schedule, batch sizes, training steps) and detailed ablation studies for individual design choices beyond high-level architectural descriptions.

Reproducibility Details

Model

Architecture

The model has 885M total parameters with a compact 256M-parameter language decoder. The vision encoder initializes from RADIO using a ViT-H/16 backbone (657M parameters). A vision neck applies horizontal convolution kernels of size $1 \times 4$ with stride $1 \times 4$, reducing both hidden dimensionality and sequence length. For a $1648 \times 2048$ image, the sequence length reduces to 3200 tokens, and the RADIO summary token is concatenated.

The token-compressed (TC) variant applies an additional pixel-shuffle operation to the compressed sequence, reducing it to 833 tokens, described as a total $\times 16$ reduction in sequence length.

The decoder uses an mBART-style architecture reduced to 10 layers with tied weights.

Positional embeddings

The decoder is trained and evaluated without positional embeddings to enable large-context inference. The authors argue that causal masking provides implicit positional cues, and visual tokens already encode 2D spatial structure, so explicit position encodings are unnecessary and potentially harmful for document layout understanding.

Multi-token inference

For predicting $m$ tokens simultaneously, the training procedure adds $(m-1) \times 2$ additional linear projection layers. Training uses teacher forcing for embeddings of later tokens. Inference operates in greedy mode without verification of the predicted tokens.

Data

Training data blend

The training mixture combines synthetic, public, and human-annotated sources. Table 1 in the paper lists examples including:

Multilingual arXiv (8.3M samples)
Wikipedia OCR data (9.5M samples)
Multilingual synthetic OCR data (3.5M samples)
Table datasets: PubTables, FinTabNet, TabRecSet
DocLayNet
Common Crawl samples

NVpdftex pipeline

The core dataset generation pipeline is inspired by Nougat-style LaTeX rendering but claims tighter alignment through TeX Live instrumentation. The pipeline intercepts node/character creation and page output to extract character-level bounding boxes, semantic labels, and reading order directly from the typesetting engine. A repository link is provided in the paper.

For multilingual support, machine translation is applied to NVpdftex content into six languages, with LaTeX-level augmentations including font variations, color modifications, and layout changes.

Data augmentation

DocLayNet augmentation: The authors augment DocLayNet with autolabeled reading order, text inside images, and Markdown formatting. This includes LaTeX formatting for tables and formulas.

Common Crawl annotation: Common Crawl samples receive human annotation for plaintext, bounding boxes, and semantic classes. Additional autolabeling covers text-inside-images and formatting. Low-quality predictions are filtered using edit distance heuristics.

Algorithms / Training

Prompts and output specification

The model uses three prompt axes that define outputs, yielding eight valid combinations:

Text formatting: <output_markdown>, <output_plain>, <output_no_text>
Bounding boxes: <predict_bbox>, <no_bbox>
Classes: <predict_classes>, <no_classes> (used only with bbox)

The maximal-information prompt (MIP) is: <output_markdown> <predict_bbox> <predict_classes>.

Output format

Bounding boxes use relative coordinates in a 1024 $\times$ 1280 scale. The output schema for MIP is:

<x_(\d+)><y_(\d+)>(text)<x_(\d+)><y_(\d+)><class_(...)>

Reading order specification

Base model: Uses canonical ordering starting with Page-Header, then Text/Section-Header/List-Item/Title/Formula in reading order, followed by Footnotes/Page-Footers/Tables/Pictures/Captions at the end.

TC model: The authors claim improved ordering that places “floating” elements (tables, figures) within the natural page reading flow rather than at the end.

Evaluation

Internal reading-order/OCR test set (Table 2)

Evaluated on 789 human-labeled PDF pages from magazines, books, and Common Crawl. Metrics reported: WER (lower is better) and F1 (higher is better). Comparisons include masking/unmasking headers and footers depending on baseline model capabilities.

GOT benchmark (Table 3)

OCR accuracy and reading-order metrics comparing multiple systems. Nemotron Parse 1.1 and TC variants show strong performance in this benchmark.

OmniDocBench v1.0 (Table 4)

English subset evaluation with category-level metrics: overall, text, formula, table, and order. The paper notes the formula scoring artifact: Markdown-formatted simple equations without LaTeX delimiters are penalized even when correct.

Table extraction benchmarks (Tables 5-6)

Table 5: TEDS and S-TEDS on RD-TableBench, PubTabNet, and OmniDocBench tables.

Table 6: RD-TableBench “table similarity” scores comparing Nemotron Parse against Reducto, cloud vendor OCR systems, and multiple LLM-based document parsers.

Multilingual OCR (Table 7)

WER and F1 scores per language on NVpdftex-derived dense scientific documents. Reported values range around WER 0.03–0.06 and F1 0.96–0.98 across tested languages.

Qualitative examples

The paper includes figures showing:

Figure 1 (p.10): Layout analysis with bounding boxes and semantic classes
Figure 2 (p.11): OCR with mathematical formula formatting
Figure 3 (p.11): Complex table extraction rendered as LaTeX

Hardware / Production

Model weights are released in fp32 and bf16 formats with vLLM support. A NIM container is also available for production deployment.

Throughput measurements on a single H100 GPU in bf16 precision:

Nemotron Parse 1.1: 3800 tokens/sec
Nemotron Parse 1.1-TC: 4500 tokens/sec

The authors interpret these speeds as approximately 4 pages/sec vs. 5 pages/sec based on analysis of 10,000 pages averaging 1000 tokens per page.

dots.ocr — Notes

TL;DR

Li et al. introduce dots.ocr, a $\sim$2.9B parameter ViT–LLM model that jointly performs layout detection, text recognition, and reading order prediction as a single autoregressive sequence for document pages. The model is trained using a three-stage multilingual data engine that combines teacher–student synthetic generation, large-scale auto-labeling, and human-corrected hard examples, and it reports competitive results on OmniDocBench (English and Chinese), a new 126-language benchmark (XDocParse), and olmOCR Bench.

What kind of paper is this?

Dominant: $\Psi_{\text{Method}}$ (unified VLM architecture and task formulation that treats document parsing as one sequence generation problem).

Secondary: $\Psi_{\text{Resource}}$ (introduces XDocParse, a 126-language benchmark for end-to-end document parsing, and a large internal synthetic training corpus, though only the benchmark is intended for release); $\Psi_{\text{Evaluation}}$ (proposes a confidence-free two-stage F1 metric for layout detection and positions dots.ocr as a baseline on multiple public benchmarks).

What is the motivation?

Fragmented pipelines. Existing document systems often separate layout detection, OCR, and structure or reading order, which introduces error propagation and loses cross-task synergies.
General VLMs are not ideal. General-purpose VLMs can do high-level QA and summarization but struggle with precise localization, dense text, and large-scale throughput, mostly due to architecture and cost.
Multilingual coverage is weak. Training and evaluation datasets are heavily skewed toward English and a few high-resource languages; most languages have little labeled data.
Data bottleneck. Curating fully annotated, multilingual document corpora with layout and structure labels is expensive, so an end-to-end approach needs a different data strategy.

What is the novelty?

Unified task formulation

Document parsing is cast as one autoregressive sequence over semantic blocks. Each block is a triple $(B_k, c_k, t_k)$ where:

$B_k$: bounding box coordinates
$c_k$: block category (title, header, paragraph, table, figure, list, etc.)
$t_k$: textual content (plain text or LaTeX-style structured text for tables or formulas)

Blocks are ordered according to reading order, so a single generation pass must jointly solve detection, recognition, and ordering.

Model design

ViT–LLM architecture with:

A 1.2B parameter vision encoder trained from scratch on documents, designed for up to about 11M input pixels (high-resolution pages).
A 1.7B parameter language decoder, initialized from the Qwen2.5 1.5B base model with tied embeddings.

Encoder training objective is set up to capture both fine-grained text and higher-level layout.

Three-stage multilingual data engine

Stage 1: Teacher-guided multilingual synthesis.
- Use Qwen2.5-VL-72B as a teacher VLM.
- Given labeled English documents and their structural representation, the teacher produces layout-preserving renderings in target languages, which are then rendered to images, forming parallel multilingual seeds.
- Distill this capability to a smaller Qwen2.5-VL-7B student model that can generate such documents much more cheaply.
Stage 2: Curated large-scale auto-labeling.
- Apply the 7B student to a large pool of internal PDFs, selected via stratified sampling on layout complexity, language rarity, and domain.
- Over-sample low-resource languages and complex layouts (multi-column, heavy tables, scientific diagrams) to offset dataset bias.
- Student predictions convert millions of PDFs into structured training data.
Stage 3: Human-in-the-loop targeted correction.
- Run the pre-trained dots.ocr model over diverse documents.
- Use Qwen2.5-VL-7B Instruct as an oracle to audit outputs, flagging localization errors, incorrect types, omissions, and hallucinations by checking crops or masked regions.
- Human annotators correct these high-confidence failure cases, creating a focused dataset of more than 15k samples that specifically target weaknesses.
- This dataset is used for final supervised fine-tuning.

New benchmark and metric

XDocParse: Real-world documents in 126 languages, used purely for evaluation of end-to-end parsing (Overall Edit, Text Edit, Table metrics, reading order).
Layout detection metric: A two-stage F1-based metric that first does one-to-one matching between predicted and ground truth boxes using the Hungarian algorithm, then clusters remaining boxes into super-boxes to handle one-to-many or many-to-many relationships. Supports category-aware and category-agnostic modes and avoids confidence scores, which autoregressive models do not naturally output.

What experiments were performed?

OmniDocBench (English and Chinese)

Primary end-to-end evaluation. Uses the benchmark from Ouyang et al. with rich annotations for text, tables, formulas, and reading order. Metrics: Overall Edit, Text Edit, Formula Edit, TableTEDS, Table Edit, Reading Order Edit.

Baselines:

Pipeline tools: MinerU v1 and v2, Docling, PPStruct, Pix2Text, OpenParse, etc.
Specialized OCR or document VLMs: MonkeyOCR-pro-3B, OCRFlux, Dolphin, Mistral OCR, SmolDocling, GOT-OCR, olmOCR, Nougat, etc.
General VLMs: GPT-4o, Gemini-2.5-Pro, Qwen2-VL-72B, Qwen2.5-VL-72B, doubao models.

XDocParse

New benchmark constructed by the authors with documents in 126 languages. Same metric family as OmniDocBench. Baselines include MonkeyOCR-3B, doubao-1.5 and 1.6 variants, and Gemini-2.5-Pro.

olmOCR Bench (supplementary)

End-to-end evaluation on olmOCR Bench subsets: ArXiv, Old Scans, Math Tables, Headers and Footers, Multi-column, Long Tiny Text, and Base. Compared against tools like Marker, MinerU, Mistral OCR, GPT-4o, Gemini Flash 2, Qwen2-based models, Nanonets OCR, and olmOCR itself.

Ablation studies

Synergy of joint task learning.
Train variants that remove one component:
- M-Det: no detection targets.
- M-Rec: no recognition targets.
- M-RO: no supervised reading order (replace with heuristic horizontal, vertical, or random ordering).
Evaluate Overall Edit, Reading Order Edit, and detection F1.
Unified versus specialist paradigms.
- U $\rightarrow$ U: joint training and joint inference on all tasks (full dots.ocr).
- U $\rightarrow$ S: joint training, but specialize at inference to one task.
- S $\rightarrow$ S: train and evaluate on a single task only.
Data engine ablations.
Remove one data pillar at a time:
- D-Multilingual: no multilingual synthetic data.
- D-Structured: no structured-heavy data (tables, formulas).
- D-Correction: no targeted correction set.
Evaluate impacts on Overall Edit, Reading Order Edit, and detection F1.

Qualitative analyses

Visual comparisons showing how removing recognition or reading order supervision leads to fragmented boxes, incorrect grouping of tables, and broken reading sequences. Examples of dots.ocr as a data engine: grounding-enhanced OCR, natural and scientific figure caption pairs, text inpainting masks, and next-page pairs for long-context modeling.

What are the outcomes and limitations?

Outcomes

OmniDocBench performance:

Overall Edit: 0.125 (EN) and 0.160 (ZH), better than all listed baselines. For example, MonkeyOCR-pro-3B is at 0.138 (EN) and 0.206 (ZH), and Gemini-2.5-Pro is at 0.148 (EN) and 0.212 (ZH).
Text Edit: 0.032 (EN) and 0.066 (ZH), lower than other methods.
TableTEDS: 88.6 (EN) and 89.0 (ZH), at or above other models.
Reading Order Edit: 0.040 (EN) and 0.067 (ZH), best among reported methods.

XDocParse performance:

Overall Edit 0.177, compared to 0.251 for Gemini-2.5-Pro and about 0.291–0.299 for doubao variants.
Text Edit 0.075, roughly half of Gemini-2.5-Pro (0.163).

olmOCR Bench:

Overall score 79.1% $\pm$ 1.0, higher than MonkeyOCR-pro-3B (75.8%) and olmOCR v0.1.75 anchored configuration (75.5%).

Evidence for task synergy:

Removing detection supervision increases Reading Order Edit; removing recognition supervision slightly improves raw detection F1 but harms end-to-end metrics, suggesting recognition acts as a semantic regularizer.
Degrading reading order supervision (heuristic or random) harms both reading order and detection F1, indicating that sequence structure guides visual learning.

Unified paradigm vs specialists:

Jointly trained and inferred model (U $\rightarrow$ U) yields better recognition and reading order scores than both unified training with specialist inference (U $\rightarrow$ S) and specialist-only (S $\rightarrow$ S) models, while detection F1 stays similar across configurations.

Data engine impact:

Dropping targeted correction (D-Correction) hurts detection F1 the most (from 0.849 to 0.788), showing that the small curated correction set is critical for high-quality localization.
Removing multilingual or structured data degrades Overall Edit and reading order, with structured data particularly important for Chinese and complex layouts.

Limitations and open questions

Data transparency. The training corpus relies heavily on internal documents and synthetic data generated from proprietary VLMs; the paper does not quantify total document counts, tokens, or language distribution in detail.
Compute and hardware. Parameter counts are reported, but there are no explicit numbers for GPU types, training duration, or energy cost, which makes reproducibility at scale harder to assess.
Benchmark scope. XDocParse is multilingual, but the paper does not break down performance by language family or resource level, so it is unclear how well the model handles individual low-resource languages versus the aggregate.
Real-world deployment. The paper focuses on benchmark accuracy; there is little discussion of throughput, latency, memory usage, or robustness to noisy scans and edge cases in production.
Data engine as VLM pretraining source. The authors sketch how dots.ocr could serve as a data engine for future VLM pretraining (e.g., next-page prediction, inpainting, grounded supervision) but do not present experiments that actually use this data to improve a downstream VLM.

Model

Architecture

Vision encoder:

1.2B parameters.
ViT-style encoder optimized for documents, trained from scratch rather than fine-tuning a natural-image model.
Supports native resolutions up to about 11M pixels, which allows full-page inputs without aggressive downsampling.

Language model decoder:

Based on Qwen2.5 1.5B base, modified with tied word embeddings, resulting in 1.7B parameters.
Acts as an autoregressive decoder over a tokenized representation of bounding boxes, categories, and text content.

Vision–language connection:

Adopts the standard ViT–LLM pattern: image tokens from the encoder are inserted or projected into the language model context; the decoder then generates the structured output sequence.

Output format:

Bounding boxes are encoded numerically as tokens.
Categories are generated from a fixed vocabulary (e.g., title, header, paragraph, list, table, picture, formula, footer; see right panel of Figure 2).
Text content is emitted as standard subword tokens, with LaTeX-like markup for tables and formulas to capture internal structure.

Data

Training data

From the three-stage engine (exact sizes not always given):

Seed multilingual structured data.
- English documents with annotations are transformed by the teacher VLM into multilingual, layout-preserving versions.
- Covers many languages beyond English and Chinese.
Curated large-scale auto-labeled data.
- Millions of internal PDFs sampled using heuristics for:
  - Layout complexity: number of columns, table density, presence of images.
  - Linguistic rarity: prioritizing low-resource languages.
  - Domain diversity: including scientific and niche domains.
- 7B student VLM converts these into structured training examples.
Targeted correction set.
- Over 15k samples, collected by running dots.ocr, auditing with an oracle VLM, and manually correcting errors.
Supervised fine-tuning subset.
- The main supervised fine-tuning stage uses about 300k diverse samples, though it is not fully clear how these intersect with the stages above.

Evaluation data

OmniDocBench (English and Chinese) for end-to-end parsing.
XDocParse which the authors curate from real-world multilingual documents (126 languages).
olmOCR Bench for additional validation on scientific and scanned documents.

Algorithms / Training

Unified training objective:

Autoregressive next-token prediction over sequences representing bounding boxes, types, text, and separators between blocks.
Training implicitly couples detection, recognition, and reading order.

Pretraining and finetuning:

Vision encoder trained from scratch jointly with the LM decoder under the unified objective.
Supervised finetuning on about 300k curated samples after large-scale pretraining.

Optimization:

Optimizer: AdamW.
Peak learning rate: $5 \times 10^{-5}$, with cosine decay schedule.
Other hyperparameters (batch size, warmup schedule, gradient clipping) are not detailed in the main text.

Teacher–student procedures:

Teacher VLM (Qwen2.5-VL-72B) generates multilingual structured documents from English seeds; the 7B student is fine-tuned to imitate this behavior.
Oracle VLM (Qwen2.5-VL-7B Instruct) is used during Stage 3 to identify potential errors in model outputs; those are then corrected by humans and fed back as high-signal supervision.

Evaluation

Metrics

Edit distances: Overall Edit summarizes combined errors across layout, text, and structure. Text, Formula, and Table Edit focus on components.
Table structure: TableTEDS for comparing predicted vs ground truth table structures.
Reading order: Reading Order Edit measures discrepancy between predicted and target reading sequences.
Layout detection F1: Two-stage matching and clustering metric described in Algorithm 1 of the supplement.

Key results

OmniDocBench (EN / ZH):

Overall Edit: 0.125 / 0.160
Text Edit: 0.032 / 0.066
Reading Order Edit: 0.040 / 0.067

XDocParse:

Overall Edit: 0.177
Text Edit: 0.075
TableTEDS: 79.2
Reading Order Edit: 0.152

olmOCR Bench:

Overall score: 79.1% $\pm$ 1.0, with strong performance on ArXiv, Multi-column, Long Tiny Text, and Base subsets; slightly behind MonkeyOCR-3B on some Old Scans categories but ahead overall.

Hardware / Production

The paper does not provide detailed hardware specs, training time, or energy use.
Parameter counts are modest compared to frontier VLMs (about 2.9B total), which suggests that training and inference are more affordable than very large VLMs, but still substantial.
There is no discussion of deployment details such as batching strategies, throughput on standard GPUs, or latency for single-page vs multi-page documents.