Navigation
Breadcrumb

A Survey of Multimodal LLMs (2021-2024)

A comprehensive survey of multimodal large language models from 2021 to 2024, covering encoder-only models, encoder-decoder architectures, decoder-only models, and specialized applications for documents and screens.

Table of Contents

Multimodal LLMs

This comprehensive survey presents an overview of multimodal Large Language Models (LLMs) developed between 2021 and 2024. These models represent a significant advancement in AI, combining visual and textual understanding to enable more sophisticated human-computer interactions.

We organize the models into several categories based on their architectural approaches, examine efficient training frameworks, and explore grounding capabilities that enable models to localize and reference specific regions in images.

Encoder-Only Models

Models that feature encoders for text and vision portions, but lack any notion of a decoder.

YearMonthModelLicenseSize(s)Weights?Permissive?Organization
202102CLIPMIT150M, 430MYesYesOpenAI
202102ALIGN-600MNoNoGoogle
202107ALBEFBSD-3200MYesYesSalesForce
202111VLMoMIT200M, 600MYesYesMicrosoft
202112FLAVABSD-3450MYesYesMeta
202112GLIPMIT500MYesYesMicrosoft
202207GLIPv2MIT-YesYesMicrosoft
202208BEiT-3MIT1.9BYesYesMicrosoft

Encoder-Decoder Models

Here, we assume a vision encoder; but we allow for the possibility of a text-based encoder-decoder model.

YearMonthModelLicenseSize(s)Weights?Permissive?Organization
202108SimVLM--NoNoGoogle
202205CoCa-0.4B, 0.8B, 2.1BNoNoGoogle
202205mPLUGBSD-3350M, 600MYesYesAlibaba
202209PaLI-3B, 15B, 17BNoNoGoogle
202301BLIP-2BSD-33B, 4B, 8B, 12BYesYesSalesforce
202302mPLUG-2Apache-2.01.5B, 3B, 6BYesYesAlibaba
202305PaLI-X-55BNoNoGoogle
202310PaLI-3-5BNoNoGoogle

Specialized for Documents

Here, we highlight models developed solely for document understanding tasks.

YearMonthModelLicenseSize(s)Weights?Permissive?Organization
202212UDOPMIT800MYesYesMicrosoft

Specialized for Screens

Here, we highlight models developed solely for screen understanding tasks.

YearMonthModelLicenseSize(s)Weights?Permissive?Organization
202402ScreenAI-0.7B, 2B, 5BNoNoGoogle

Decoder-Only Models

Note: This assumes decoder-only models for text. As usual, vision inputs are still encoded with a vision encoder.

YearMonthModelLicenseSize(s)Weights?Permissive?Organization
202201BLIPBSD-3200M, 500MYesYesSalesforce
202204Flamingo-3B, 9B, 80BNoNoGoogle
202301BLIP-2BSD-33B, 4B, 8B, 12BYesYesSalesforce
202302KOSMOS-1MIT1.9BNoYesMicrosoft
202304LLaVAMIT7B, 13BYesYesMicrosoft
202304mPLUG-OwlLLaMA, Apache-2.07BYesYesAlibaba
202305VisionLLMApache-2.0-NoNoShanghai AI
202306KOSMOS-2MIT1.9BYesYesMicrosoft
202306ShikraCC-NC-4.07BYesNoSenseTime
202308Qwen-VLTongyi9.6BYesNoAlibaba
202310LLaVA-1.5Apache-2.07B, 13BYesYesMicrosoft
202311SPHINXLlama7B, 13BYesYesShanghai AI
202311InfMLLMApache-2.07B, 13BYesYesInftech.AI
202311mPLUG-Owl2Apache-2.07BYesYesAlibaba
202311CogVLMLlama17BYesYesChinese Univ.
202311MonkeyTongyi9.8BYesNoChinese Univ.
202312LLaVA-GroundingCC-BY-NC-4.07BYesNoChinese Univ.
202401LLaVA-NEXTMIT7B, 13B, 34BYesYesChinese Univ.
202402LLaVARApache-2.07B, 13BYesYesAdobe
202402SPHINX-XApache-2.01B, 7B, 13B, 8x7BYesYesShanghai AI
202403MM1-3B, 7B, 30BNoNoApple
202406VisionLLM v2Apache-2.0-NoNoShanghai AI
202408mPLUG-Owl3MIT1B, 2B, 7BYesYesAlibaba

Specialized for Documents

Here, we highlight models developed solely for document understanding tasks.

YearMonthModelLicenseSize(s)Weights?Permissive?Organization
202309KOSMOS-2.5MIT1.4BYesYesMicrosoft
202311DocPedia-7/13BNoNoChinese Univ.
202403TextMonkeyTongyi9.8BYesYesChinese Univ.

Specialized for Screens

Here, we highlight models developed solely for screen understanding tasks.

YearMonthModelLicenseSize(s)Weights?Permissive?Organization
202210Pix2StructApache-2.00.3B, 1.3BYesYesGoogle
202305Pix2ActApache-2.00.3BYesYesGoogle
202312CogAgentLlama18BYesYesChinese Univ.
202401SeeClickTongyi9.6BYesNoShanghai AI

Frameworks for Efficient Multimodal LLMs

Any frameworks that are designed to efficiently train or adapt multimodal LLMs shall be highlighted here. This includes any approaches that use pre-existing models to bootstrap new, multimodal models, at a fraction of the cost of performing end-to-end training.

BLIP-2

  • Overview: BLIP-2 efficiently bridges vision and language by using frozen models for both vision and language components, avoiding the need for end-to-end training.
  • Stage 1: Q-Former is trained to align visual features with text using a frozen image encoder, enabling efficient representation learning without modifying the image encoder.
  • Stage 2: Q-Former connects to a frozen LLM for vision-to-language generation, leveraging pre-trained language capabilities without retraining the LLM.

LLaVA & LLaVA-1.5 & LLaVA-NEXT

  • Overview: LLaVA connects a frozen CLIP encoder to a language model, using instruction tuning on GPT-4-generated data for efficient multimodal training. The updated LLaVA-1.5 introduces several enhancements such as an MLP vision-language connector, additional datasets, and scaling to higher resolutions.
  • Stage 1: A simple linear projection (updated to MLP in LLaVA-1.5) is learned to align visual features with the language model’s embedding space, keeping both models frozen.
  • Stage 2: The projection layer and language model are fine-tuned on multimodal instruction data. In LLaVA-1.5, new datasets such as VQA-v2 and ShareGPT data are added, and resolution scaling improves model performance.

mPLUG-Owl

  • Overview: mPLUG-Owl introduces a modularized framework that equips LLMs with multimodal capabilities by decoupling the training of visual and language components, using a two-stage training process to improve efficiency.
  • Stage 1: A frozen LLM is paired with a trainable vision encoder (ViT-L/14) and visual abstractor to align image and text representations. The visual components are trained without modifying the LLM.
  • Stage 2: LoRA is applied to the LLM for efficient fine-tuning on unimodal and multimodal instruction data, while keeping the visual encoder frozen to preserve alignment and reduce computational costs.

VisionLLM

  • Overview: VisionLLM presents a unified framework for vision-centric tasks using an LLM-based decoder, efficiently aligning vision and language tasks without end-to-end retraining.
  • Stage 1: A language-guided image tokenizer is trained using a frozen LLM (Alpaca) while fine-tuning the visual backbone (ResNet/InternImage-H) to align visual and language features.
  • Stage 2: The visual backbone is frozen and only the LLM is fine-tuned using LoRA, enabling efficient adaptation to new tasks through flexible language instructions without modifying the vision components.

InfMLLM

  • Overview: InfMLLM uses a three-stage training approach with a novel pool-adapter to handle vision-language tasks efficiently.
  • Stage 1 (Pretraining): The pool-adapter is trained while keeping the ViT and LLM frozen, focusing on image-text alignment.
  • Stage 2 (Multitask Finetuning): The ViT, pool-adapter, and QV projection are unfrozen, training on a mixture of VQA, captioning, and grounding tasks.
  • Stage 3 (Instruction Tuning): The LLM is fully finetuned on instruction-following data, with the ViT remaining frozen.

mPLUG-Owl2

  • Overview: mPLUG-Owl2 leverages modality collaboration through adaptive modules, reducing interference between vision and language features for efficient multi-modal learning.
  • Stage 1: A modality-adaptive module (MAM) aligns visual and language features, with frozen LLM weights except for the adaptive modules, preserving separate modality representations.
  • Stage 2: The entire model is unfrozen and fine-tuned on text and multimodal instruction data, enabling cross-modality performance enhancements without compromising on individual modality strengths.

LLaVA-Grounding

  • Overview: LLaVA-Grounding integrates grounded visual chat and localization by combining frozen and partially trainable models through a staged training process, leveraging a language model, visual encoder, and grounding model to reduce computational needs.
  • Stage 1: Pretraining aligns vision and language features by finetuning a projection layer and grounding model while keeping the language model frozen.
  • Stage 2: Instruction tuning on GVC data adjusts the language and grounding components for grounded responses, while the vision encoder and prompt encoder remain frozen.
  • Stage 3: For visual prompts (e.g., boxes, marks), only the prompt encoder and specific layers are fine-tuned, supporting efficient visual-grounded chat.

Grounding Capabilities

Sometimes when a multimodal model produces an answer, it also must refer to a location in an image. For example, in the case of a question-answering task, the model might need to point to a specific region in an image to justify its answer. Or, in the case of a model that operates on screens, it might be asked where a login button is located. Unfortunately, not all multimodal models have this capability.

Here, we merely make a short-list of models that are trained to provide grounding and make a small comment about how they provide grounding.

Types of Grounding

  • Special Tokens: Some models use special tokens to indicate the location of an object in an image. Freqently, the x- and y-coordinates are uniformly gridified with special tokens for each grid point in the x- and y-dimensions.
    • For example, Google likes to construct models that discretize x- and y-coordinates into 1000 bins each. Each bounding box then is associated with a sequence of four special tokens, representing the x- and y-coordinates of the top-left and bottom-right corners of the bounding box.
    • A similar idea was used in the KOSMOS-2 model, which used a special token for each vision patch as determined by its visual encoder. This allows the model to produce a bounding box for a given object in an image using two special tokens: one for the top-left corner and one for the bottom-right corner.
  • Direct Prediction: Some models directly predict the bounding box coordinates for an object in an image. Coordinates are scaled to a fixed range (e.g., [0, 1]) and reduced to a fixed precision (e.g., 2 decimal places). From there, raw tuples are predicted directly in the text output.
    • For example, the LLaVA model directly predicts the bounding box coordinates for objects in images. The model is trained to predict the x- and y-coordinates of the top-left and bottom-right corners of the bounding box.
    • This sometimes is used to output a general polygon as well.
  • Object Detection: Some models use object detection to predict bounding boxes for objects in images. These models independently predict boxes and masks with a traditional object detection branch.

Grounding Models

ModelGrounding Type
VisionLLMDirect Prediction or Special Tokens
KOSMOS-2Patch-based Special Tokens
ShikraDirect Prediction
Qwen-VLDirect Prediction
KOSMOS-2.5Discretized Special Tokens
LLaVA-1.5Direct Prediction
SPHINXDirect Prediction
InfMLLMDirect Prediction
CogVLMDirect Prediction
LLaVA-GroundingObject Detection
SPHINX-XDirect Prediction