Navigation
Breadcrumb

Modern OCR for the Large Language & Vision Model Era

A reference page for all things Optical Character Recognition (OCR) using Large Language & Vision Models

Disclaimer: This is a work-in-progress research compilation and personal draft. The coverage is not comprehensive, and the analysis reflects one individual’s perspective on recent developments in the field. This resource should not be used as a definitive reference or benchmark for evaluating research contributions. Please refer to the original papers and conduct your own thorough review for academic or professional purposes.

VLM-Based Models

End-to-end vision-language models with learned multimodal representations for general document OCR.

DatePaperNotesWeightsCodeLicense
2025-11Nemotron Parse 1.1notes885M, 885M-TCNVIDIA Open Model
2025-12dots.ocrnotes3Brednote-hilab/dots.ocrMIT
2025-01Ocean-OCRnotes3Bguoxy25/Ocean-OCRApache 2.0
2025-10olmOCR 2notes7Ballenai/olmocrApache 2.0
2025-10DeepSeek-OCRnotes3Bdeepseek-ai/DeepSeek-OCRMIT
2025-09POINTS-Readernotes4BTencent/POINTS-ReaderApache 2.0
2025-09MinerU2.5notes1.2Bopendatalab/MinerUAGPL-3.0
2025-06Infinity-Parsernotes7Binfly-ai/INF-MLLMApache 2.0
2025-05Dolphinnotes322M, 4BByteDance/DolphinMIT
2025-04VISTA-OCRnotes
2025-03SmolDoclingnotes256Mdocling-project/doclingCDLA-Permissive-2.0
2025-02olmOCRnotes7Ballenai/olmocrApache 2.0
2024-09GOT-OCR2.0notes580MUcas-HaoranWei/GOT-OCR2.0Apache 2.0
2023-08Nougatnotessmall, basefacebookresearch/nougatMIT (code), CC-BY-NC (wts.)

Pipeline Models

Modular systems combining specialized detection, recognition, and layout analysis components.

DatePaperNotesWeightsCodeLicense
2025-10PaddleOCR-VLnotes0.9BPaddlePaddle/PaddleOCRApache 2.0
2025-07PaddleOCR 3.0notesHuggingFacePaddlePaddle/PaddleOCRApache 2.0
2025-06MonkeyOCRnotes3BYuliang-Liu/MonkeyOCRApache 2.0
2025-01Docling v2notesmodelsDS4SD/doclingMIT (code), CDLA-Permissive-2.0 (weights)
2024-09MinerUnotesPDF-Extract-Kitopendatalab/MinerUApache 2.0

Datasets & Benchmarks

This section catalogs datasets and benchmarks organized by OCR sub-domain. Many benchmarks span multiple domains; they are listed under their primary focus.

Datasets are organized into three tiers based on licensing:

  • Commercial: Permissive licenses (Apache-2.0, MIT, CC-BY) that allow commercial use
  • Research: Non-commercial licenses (GPL, CC-NC) or explicit research-only restrictions
  • Unclear: No license specified or mixed/complex licensing; verify before use

General Document OCR

Research Use Only

DatePaperSizeTasksNotesDataLicense
2024-12OmniDocBench1,355 pagesText, formula, table, reading-orderMulti-domain benchmark (EN/ZH)
2024-05Fox BenchDense multi-page docsFull document parsingEN/ZH documents

Charts & Visualizations

Chart understanding spans several task families:

  • Extraction: Chart-to-table or chart-to-dict conversion (structured data recovery)
  • QA: Visual question answering requiring numerical reasoning, comparison, or lookup
  • Summarization: Natural language descriptions at varying semantic levels

Most datasets use synthetic charts (programmatically generated from tables) or web-scraped visualizations. Evaluation typically relies on exact/relaxed match for QA, BLEU/ROUGE for summarization, and F1 or RMS error for extraction. Scale varies dramatically: from ~6K charts (ChartY) to 28.9M QA pairs (PlotQA).

Commercial Use

Datasets with permissive licenses suitable for commercial training and deployment.

DatePaperSizeTasksNotesDataLicense
2024-10NovaChart47K charts, 856K instr. pairs18 chart types, 15 tasks (understanding + generation)notesGitHub, HuggingFaceMIT (code), Apache-2.0 (dataset)
2024-04TinyChartData140K PoT pairsChart QA with program-of-thought learningnotesGitHub, HuggingFaceApache-2.0
2019-09PlotQA224K plots, 28.9M QA pairsPlot question answering with OOV reasoningnotesGitHubMIT (code), CC-BY-4.0 (data)

Research Use Only

Datasets with copyleft (GPL), non-commercial (CC-NC), or explicit research-only restrictions.

DatePaperSizeTasksNotesDataLicense
2024-04OneChart / ChartY~6K chartsChart-to-dict structural extraction, bilingual (EN/ZH)notesProject, GitHubApache-2.0 (code); research use only
2023-08SciGraphQA295K multi-turn, 657K QA pairsMulti-turn scientific graph question answeringnotesGitHub, HuggingFaceResearch only (Palm-2/GPT-4 terms)
2023-08VisText12,441 chartsChart captioning with semantic richness (L1-L3)notesGitHubGPL-3.0
2022-05ChartQA20,882 charts, 32,719 QA pairsChart question answering with visual and logical reasoningnotesGitHubGPL-3.0
2022-03Chart-to-Text44,096 chartsChart summarization: natural language text generationnotesGitHubGPL-3.0 (+ source restrictions)
2019-09CHART-Infographics~200K synthetic, 4.2K realChart classification, text detection/OCR, role classification, axis/legend analysisnotesSynthetic, PMCCC-BY-NC-ND 3.0 (S), CC-BY-NC-SA 3.0 (PMC)
2018-04DVQA300K bar charts, 3.5M QA pairsBar chart question answeringnotesGitHubCC-BY-NC 4.0

License Unclear or Mixed

Datasets with unspecified licenses or complex multi-source licensing. Verify terms before use.

DatePaperSizeTasksNotesDataLicense
2024-04ChartThinker595K charts, 8.17M QA pairsChart summarization QAnotesGitHub, HuggingFaceMIT (HF); sources include GPL/CC-NC
2023-07DePlot516K plot-table, 5.7M QA pairsPlot-to-table translation, chart question answeringnotesgoogle-research, HuggingFaceApache-2.0 (model); mixed data licenses
2023-05UniChart611K chartsPretraining corpus: table extraction, reasoning, QA, summ.notesGitHubVaries by source (see notes)
2023-04ChartSumm84K chartsChart summarization with short and long summariesnotesGitHub, DriveUnspecified (no LICENSE file)
2021-01ChartOCR / ExcelChart400K386,966 chartsChart-to-table extraction: bar, line, pienotesGitHub, HuggingFaceMIT (HF); crawled data, paper silent on license
2018-04Beagle42K SVGVisualization type classificationnotesUWMIT (code only); dataset license not stated

Mathematical Expression Recognition

Mathematical expression recognition addresses printed, handwritten, and screen-captured formulas with complex 2D spatial structure. The domain is characterized by large symbol inventories (101 classes in CROHME benchmarks, 245 in HME100K, extended vocabularies for LaTeX rendering) and structural relationships such as superscripts, subscripts, fractions, radicals, and matrix layouts.

Task families include:

  • Symbol recognition: Isolated classification with reject options for non-symbol junk
  • Expression parsing: Combined segmentation, classification, and structural relationship extraction
  • Image-to-LaTeX: End-to-end conversion from formula images to markup
  • Matrix recognition: Hierarchical evaluation at matrix, row, column, and cell levels

Evaluation typically measures expression-level exact match rates (ExpRate) alongside object-level metrics for symbol segmentation, classification, and spatial relation detection. CROHME benchmarks indicate structure parsing remains a bottleneck: 90% accuracy with perfect symbol labels versus 67% end-to-end. Recent large-scale datasets (UniMER-1M with 1M+ samples) target real-world complexity beyond clean academic benchmarks, including noisy screen captures, font inconsistencies, and long expressions (up to 7,000+ tokens).

Datasets are organized into three tiers based on licensing:

  • Commercial: Permissive licenses (Apache-2.0, MIT, CC-BY) that allow commercial use
  • Research: Non-commercial licenses (GPL, CC-NC) or explicit research-only restrictions
  • Unclear: No license specified or mixed/complex licensing; verify before use

License Unclear or Mixed

Datasets with unspecified licenses or complex multi-source licensing. Verify terms before use.

DatePaperSizeTasksNotesDataLicense
2024-04UniMER-1M1,061,791 train; 23,757 test (4 subsets)Image-to-LaTeX: printed, complex, screen-captured, handwrittennotesHuggingFace, OpenDataLabApache-2.0 (HF tag); upstream sources have mixed licenses
2024-04MathWriting626k total (230k human, 396k synthetic)Online handwritten math expression recognition, image-to-LaTeXnotesGoogle Storage, HuggingFaceCC-BY-NC-SA 4.0
2022-03HME100K74,502 train + 24,607 test imagesHandwritten mathematical expression recognitionnotesGitHub, PortalUnspecified (no LICENSE file)

Research Use Only

Datasets with copyleft (GPL), non-commercial (CC-NC), or explicit research-only restrictions.

DatePaperSizeTasksNotesDataLicense
2019-09CROHME 2019 + TFD1,199 test expressions, 236 pages (TFD)Handwritten math + typeset formula detectionnotesTC10/11 Package, TFD GitHubCC-BY-NC-SA 3.0
2016-09CROHME 20161,147 test expressions (Tasks 1/4), 250 test matrices4 tasks: formula, symbol, structure, matrix recognitionnotesTC10/11 PackageCC-BY-NC-SA 3.0
2014-09CROHME 2014986 test expressions (10K symbols), 175 matrices, 10K+9K junkSymbol recognition with reject, expression, matrix parsingnotesTC11, TC10/11 Package, GitHubCC-BY-NC-SA 3.0

Handwriting Recognition

Handwriting recognition for natural language text focuses on word-level and line-level detection and recognition in unconstrained conditions. Unlike mathematical expressions, which require parsing 2D spatial structure, general handwriting tasks emphasize sequential text extraction from camera-captured images, historical documents, and field notes. Evaluation uses localization metrics (IoU-based measures) for detection and character/word accuracy rates for recognition.

Datasets are organized into three tiers based on licensing:

  • Commercial: Permissive licenses (Apache-2.0, MIT, CC-BY) that allow commercial use
  • Research: Non-commercial licenses (GPL, CC-NC) or explicit research-only restrictions
  • Unclear: No license specified or mixed/complex licensing; verify before use

Commercial Use

Datasets with permissive licenses suitable for commercial training and deployment.

DatePaperSizeTasksNotesDataLicense
2021-09GNHK687 images, 39,026 texts, 172,936 charsWord-level detection and recognition (camera-captured)notesGoodNotes, GitHubCC-BY 4.0

Layout Detection & Document Structure

Document layout analysis identifies and classifies page regions (text blocks, tables, figures, titles, lists, headers, footers) and their spatial relationships. Unlike full OCR pipelines that also perform text recognition, layout detection focuses on structural segmentation: predicting bounding boxes and category labels for document components. Evaluation uses object detection metrics (mAP at various IoU thresholds) with per-category AP breakdowns for fine-grained analysis.

The field distinguishes between category-agnostic detection (locating all content regions regardless of type) and category-aware detection (classifying regions into semantic categories). Modern approaches balance two competing pressures: fine-grained taxonomies (20+ categories for specialized documents) versus coarse taxonomies (5-10 categories for cross-domain generalization).

Models

DatePaperNotesWeightsCodeLicense
2025-03PP-DocLayoutnotesL, _plus-L, M, SPaddlePaddle/PaddleXApache 2.0
2025-10PP-DocLayoutV2notesModelPaddlePaddle/PaddleOCRApache 2.0
2021-08LayoutReadernotesModelGitHubResearch only

Commercial Use

Datasets with permissive licenses suitable for commercial training and deployment.

DatePaperSizeTasksNotesDataLicense
2022-08DocLayNet80,863 pages, 11 categoriesLayout detection, reading orderGitHubCDLA-Permissive-1.0
2019-09PubLayNet360K+ pages, 5 categoriesLayout detection for scientific documentsGitHubCDLA-Permissive-1.0
2019-05DocBank500K pages, 13 categoriesWeakly supervised layout detectionnotesGitHubApache 2.0

Research Use Only

Datasets with copyleft (GPL), non-commercial (CC-NC), or explicit research-only restrictions.

DatePaperSizeTasksNotesDataLicense
2021-08ReadingBank500K document pagesReading order detectionnotesGitHubResearch only (no redistribution)

License Unclear or Mixed

Datasets with unspecified licenses or complex multi-source licensing. Verify terms before use.

DatePaperSizeTasksNotesDataLicense

Table Structure Recognition

Table structure recognition (TSR) extracts the logical structure of tables—identifying cells, rows, columns, spanning relationships, and hierarchical organization. Unlike table detection (which only locates table regions) or table understanding (which also interprets content semantics), TSR focuses on parsing the structural grid: mapping visual layouts to machine-readable formats such as HTML, LaTeX, or specialized tokenization schemes.

The field is characterized by two main architectural families:

  • End-to-end vision-language models: Directly predict table structure as token sequences from images (image-to-markup)
  • Pipeline systems: Combine separate modules for cell detection, structure parsing, and optional content extraction

Evaluation uses tree edit distance (TED) metrics for structural accuracy and mAP/IoU metrics for cell localization. Modern benchmarks emphasize complex spanning (merged cells across rows/columns), multi-page tables, and domain-specific formats (financial statements, scientific papers, invoices).

Datasets are organized into three tiers based on licensing:

  • Commercial: Permissive licenses (Apache-2.0, MIT, CC-BY, CDLA-Permissive) that allow commercial use
  • Research: Non-commercial licenses (GPL, CC-NC) or explicit research-only restrictions
  • Unclear: No license specified or mixed/complex licensing; verify before use

Commercial Use

Datasets with permissive licenses suitable for commercial training and deployment.

DatePaperSizeTasksNotesDataLicense
2021-09PubTables-1M~1M tablesStructure recognition, detectionHuggingFaceCDLA-Permissive-2.0
2019-11PubTabNet568K tablesStructure recognitionGitHubCDLA-Permissive-1.0
2018-XXFinTabNet113K tablesStructure recognitionIBM Developer, HF (.c)CDLA-Permissive-1.0

Research Use Only

Datasets with copyleft (GPL), non-commercial (CC-NC), or explicit research-only restrictions.

DatePaperSizeTasksNotesDataLicense

License Unclear or Mixed

Datasets with unspecified licenses or complex multi-source licensing. Verify terms before use.

DatePaperSizeTasksNotesDataLicense

Specialized Methods (No New Data)

Papers that introduce methods or training techniques but do not release new datasets. Included for completeness; see original papers for evaluation details.

Chart Understanding

DatePaperNotesWeightsCodeLicense
2024-05SIMPLOTnotesGitHubNo license stated
2024-04TinyChartnotes3BX-PLUG/mPLUG-DocOwlApache 2.0
2023-05UniChartnotesbase, ChartQAvis-nlp/UniChartMIT

Table Structure Recognition

DatePaperNotesWeightsCodeLicense
2023-05OTSLnotesNot specified

Mathematical Expression Recognition

DatePaperNotesWeightsCodeLicense
2024-04UniMERNetnotes100M/202M/325Mopendatalab/UniMERNetApache 2.0
2022-03SANnotesCode not publicly released

Evaluation & Metrics

DatePaperNotesCodeLicense
2024-09CDMnotesopendatalab/UniMERNetApache 2.0

Handwriting Generation

DatePaperNotesWeightsCodeLicense
2020-08Decoupled Style DescriptorsnotesGitHubNon-commercial research only