A Survey of Multimodal LLMs (2021-2024)

A comprehensive survey of multimodal large language models from 2021 to 2024, covering encoder-only models, encoder-decoder architectures, decoder-only models, and specialized applications for documents and screens.

Table of Contents

Multimodal LLMs
- Encoder-Only Models
- Encoder-Decoder Models
  - Specialized for Documents
  - Specialized for Screens
- Decoder-Only Models
  - Specialized for Documents
  - Specialized for Screens
- Frameworks for Efficient Multimodal LLMs
- Grounding Capabilities
2021-02: ALIGN
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2021-02: CLIP
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2021-07: ALBEF
- Key Links
- Guiding Questions
- Details of Note
2021-08: SimVLM
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2021-11: VLMo
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2021-12: FLAVA
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2021-12: GLIP
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2022-01: BLIP
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2022-04: Flamingo
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2022-05: CoCa
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2022-05: mPLUG
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2022-07: GLIPv2
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2022-08: BEiT-3
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2022-09: PaLI
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2022-10: Pix2Struct
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2022-12: UDOP
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2023-01: BLIP-2
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2023-02: KOSMOS-1
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2023-02: mPLUG-2
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2023-04: LLaVA
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2023-04: mPLUG-Owl
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2023-05: PaLI-X
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2023-05: Pix2Act
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2023-05: VisionLLM
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2023-06: KOSMOS-2
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2023-06: Shikra
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2023-08: Qwen-VL
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2023-09: KOSMOS-2.5
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2023-10: LLaVA-1.5
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2023-10: PaLI-3
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2023-11: CogVLM
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2023-11: DocPedia
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2023-11: InfMLLM
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
- Follow-Up: Inf-MLLM2
  - 2023_11_monkey
2023-11: Monkey
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2023-11: mPLUG-Owl2
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2023-11: SPHINX
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2023-12: CogAgent
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2023-12: LLaVA-Grounding
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2024-01: LLaVA-NEXT
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2024-01: SeeClick
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2024-02: LLaVAR
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2024-02: ScreenAI
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2024-02: SPHINX-X
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2024-03: MM1
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2024-03: TextMonkey
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2024-06: VisionLLM v2
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note
2024-08: mPLUG-Owl3
- Quick Notes
- Key Links
- Guiding Questions
- Details of Note

Multimodal LLMs

This comprehensive survey presents an overview of multimodal Large Language Models (LLMs) developed between 2021 and 2024. These models represent a significant advancement in AI, combining visual and textual understanding to enable more sophisticated human-computer interactions.

We organize the models into several categories based on their architectural approaches, examine efficient training frameworks, and explore grounding capabilities that enable models to localize and reference specific regions in images.

Encoder-Only Models

Models that feature encoders for text and vision portions, but lack any notion of a decoder.

Year	Month	Model	License	Size(s)	Weights?	Permissive?	Organization
2021	02	CLIP	MIT	150M, 430M	Yes	Yes	OpenAI
2021	02	ALIGN	-	600M	No	No	Google
2021	07	ALBEF	BSD-3	200M	Yes	Yes	SalesForce
2021	11	VLMo	MIT	200M, 600M	Yes	Yes	Microsoft
2021	12	FLAVA	BSD-3	450M	Yes	Yes	Meta
2021	12	GLIP	MIT	500M	Yes	Yes	Microsoft
2022	07	GLIPv2	MIT	-	Yes	Yes	Microsoft
2022	08	BEiT-3	MIT	1.9B	Yes	Yes	Microsoft

Encoder-Decoder Models

Here, we assume a vision encoder; but we allow for the possibility of a text-based encoder-decoder model.

Year	Month	Model	License	Size(s)	Weights?	Permissive?	Organization
2021	08	SimVLM	-	-	No	No	Google
2022	05	CoCa	-	0.4B, 0.8B, 2.1B	No	No	Google
2022	05	mPLUG	BSD-3	350M, 600M	Yes	Yes	Alibaba
2022	09	PaLI	-	3B, 15B, 17B	No	No	Google
2023	01	BLIP-2	BSD-3	3B, 4B, 8B, 12B	Yes	Yes	Salesforce
2023	02	mPLUG-2	Apache-2.0	1.5B, 3B, 6B	Yes	Yes	Alibaba
2023	05	PaLI-X	-	55B	No	No	Google
2023	10	PaLI-3	-	5B	No	No	Google

Specialized for Documents

Here, we highlight models developed solely for document understanding tasks.

Year	Month	Model	License	Size(s)	Weights?	Permissive?	Organization
2022	12	UDOP	MIT	800M	Yes	Yes	Microsoft

Specialized for Screens

Here, we highlight models developed solely for screen understanding tasks.

Year	Month	Model	License	Size(s)	Weights?	Permissive?	Organization
2024	02	ScreenAI	-	0.7B, 2B, 5B	No	No	Google

Decoder-Only Models

Note: This assumes decoder-only models for text. As usual, vision inputs are still encoded with a vision encoder.

Year	Month	Model	License	Size(s)	Weights?	Permissive?	Organization
2022	01	BLIP	BSD-3	200M, 500M	Yes	Yes	Salesforce
2022	04	Flamingo	-	3B, 9B, 80B	No	No	Google
2023	01	BLIP-2	BSD-3	3B, 4B, 8B, 12B	Yes	Yes	Salesforce
2023	02	KOSMOS-1	MIT	1.9B	No	Yes	Microsoft
2023	04	LLaVA	MIT	7B, 13B	Yes	Yes	Microsoft
2023	04	mPLUG-Owl	LLaMA, Apache-2.0	7B	Yes	Yes	Alibaba
2023	05	VisionLLM	Apache-2.0	-	No	No	Shanghai AI
2023	06	KOSMOS-2	MIT	1.9B	Yes	Yes	Microsoft
2023	06	Shikra	CC-NC-4.0	7B	Yes	No	SenseTime
2023	08	Qwen-VL	Tongyi	9.6B	Yes	No	Alibaba
2023	10	LLaVA-1.5	Apache-2.0	7B, 13B	Yes	Yes	Microsoft
2023	11	SPHINX	Llama	7B, 13B	Yes	Yes	Shanghai AI
2023	11	InfMLLM	Apache-2.0	7B, 13B	Yes	Yes	Inftech.AI
2023	11	mPLUG-Owl2	Apache-2.0	7B	Yes	Yes	Alibaba
2023	11	CogVLM	Llama	17B	Yes	Yes	Chinese Univ.
2023	11	Monkey	Tongyi	9.8B	Yes	No	Chinese Univ.
2023	12	LLaVA-Grounding	CC-BY-NC-4.0	7B	Yes	No	Chinese Univ.
2024	01	LLaVA-NEXT	MIT	7B, 13B, 34B	Yes	Yes	Chinese Univ.
2024	02	LLaVAR	Apache-2.0	7B, 13B	Yes	Yes	Adobe
2024	02	SPHINX-X	Apache-2.0	1B, 7B, 13B, 8x7B	Yes	Yes	Shanghai AI
2024	03	MM1	-	3B, 7B, 30B	No	No	Apple
2024	06	VisionLLM v2	Apache-2.0	-	No	No	Shanghai AI
2024	08	mPLUG-Owl3	MIT	1B, 2B, 7B	Yes	Yes	Alibaba

Specialized for Documents

Here, we highlight models developed solely for document understanding tasks.

Year	Month	Model	License	Size(s)	Weights?	Permissive?	Organization
2023	09	KOSMOS-2.5	MIT	1.4B	Yes	Yes	Microsoft
2023	11	DocPedia	-	7/13B	No	No	Chinese Univ.
2024	03	TextMonkey	Tongyi	9.8B	Yes	Yes	Chinese Univ.

Specialized for Screens

Here, we highlight models developed solely for screen understanding tasks.

Year	Month	Model	License	Size(s)	Weights?	Permissive?	Organization
2022	10	Pix2Struct	Apache-2.0	0.3B, 1.3B	Yes	Yes	Google
2023	05	Pix2Act	Apache-2.0	0.3B	Yes	Yes	Google
2023	12	CogAgent	Llama	18B	Yes	Yes	Chinese Univ.
2024	01	SeeClick	Tongyi	9.6B	Yes	No	Shanghai AI

Frameworks for Efficient Multimodal LLMs

Any frameworks that are designed to efficiently train or adapt multimodal LLMs shall be highlighted here. This includes any approaches that use pre-existing models to bootstrap new, multimodal models, at a fraction of the cost of performing end-to-end training.

BLIP-2

Overview: BLIP-2 efficiently bridges vision and language by using frozen models for both vision and language components, avoiding the need for end-to-end training.
Stage 1: Q-Former is trained to align visual features with text using a frozen image encoder, enabling efficient representation learning without modifying the image encoder.
Stage 2: Q-Former connects to a frozen LLM for vision-to-language generation, leveraging pre-trained language capabilities without retraining the LLM.

LLaVA & LLaVA-1.5 & LLaVA-NEXT

Overview: LLaVA connects a frozen CLIP encoder to a language model, using instruction tuning on GPT-4-generated data for efficient multimodal training. The updated LLaVA-1.5 introduces several enhancements such as an MLP vision-language connector, additional datasets, and scaling to higher resolutions.
Stage 1: A simple linear projection (updated to MLP in LLaVA-1.5) is learned to align visual features with the language model’s embedding space, keeping both models frozen.
Stage 2: The projection layer and language model are fine-tuned on multimodal instruction data. In LLaVA-1.5, new datasets such as VQA-v2 and ShareGPT data are added, and resolution scaling improves model performance.

mPLUG-Owl

Overview: mPLUG-Owl introduces a modularized framework that equips LLMs with multimodal capabilities by decoupling the training of visual and language components, using a two-stage training process to improve efficiency.
Stage 1: A frozen LLM is paired with a trainable vision encoder (ViT-L/14) and visual abstractor to align image and text representations. The visual components are trained without modifying the LLM.
Stage 2: LoRA is applied to the LLM for efficient fine-tuning on unimodal and multimodal instruction data, while keeping the visual encoder frozen to preserve alignment and reduce computational costs.

VisionLLM

Overview: VisionLLM presents a unified framework for vision-centric tasks using an LLM-based decoder, efficiently aligning vision and language tasks without end-to-end retraining.
Stage 1: A language-guided image tokenizer is trained using a frozen LLM (Alpaca) while fine-tuning the visual backbone (ResNet/InternImage-H) to align visual and language features.
Stage 2: The visual backbone is frozen and only the LLM is fine-tuned using LoRA, enabling efficient adaptation to new tasks through flexible language instructions without modifying the vision components.

InfMLLM

Overview: InfMLLM uses a three-stage training approach with a novel pool-adapter to handle vision-language tasks efficiently.
Stage 1 (Pretraining): The pool-adapter is trained while keeping the ViT and LLM frozen, focusing on image-text alignment.
Stage 2 (Multitask Finetuning): The ViT, pool-adapter, and QV projection are unfrozen, training on a mixture of VQA, captioning, and grounding tasks.
Stage 3 (Instruction Tuning): The LLM is fully finetuned on instruction-following data, with the ViT remaining frozen.

mPLUG-Owl2

Overview: mPLUG-Owl2 leverages modality collaboration through adaptive modules, reducing interference between vision and language features for efficient multi-modal learning.
Stage 1: A modality-adaptive module (MAM) aligns visual and language features, with frozen LLM weights except for the adaptive modules, preserving separate modality representations.
Stage 2: The entire model is unfrozen and fine-tuned on text and multimodal instruction data, enabling cross-modality performance enhancements without compromising on individual modality strengths.

LLaVA-Grounding

Overview: LLaVA-Grounding integrates grounded visual chat and localization by combining frozen and partially trainable models through a staged training process, leveraging a language model, visual encoder, and grounding model to reduce computational needs.
Stage 1: Pretraining aligns vision and language features by finetuning a projection layer and grounding model while keeping the language model frozen.
Stage 2: Instruction tuning on GVC data adjusts the language and grounding components for grounded responses, while the vision encoder and prompt encoder remain frozen.
Stage 3: For visual prompts (e.g., boxes, marks), only the prompt encoder and specific layers are fine-tuned, supporting efficient visual-grounded chat.

Grounding Capabilities

Sometimes when a multimodal model produces an answer, it also must refer to a location in an image. For example, in the case of a question-answering task, the model might need to point to a specific region in an image to justify its answer. Or, in the case of a model that operates on screens, it might be asked where a login button is located. Unfortunately, not all multimodal models have this capability.

Here, we merely make a short-list of models that are trained to provide grounding and make a small comment about how they provide grounding.

Types of Grounding

Special Tokens: Some models use special tokens to indicate the location of an object in an image. Freqently, the x- and y-coordinates are uniformly gridified with special tokens for each grid point in the x- and y-dimensions.
- For example, Google likes to construct models that discretize x- and y-coordinates into 1000 bins each. Each bounding box then is associated with a sequence of four special tokens, representing the x- and y-coordinates of the top-left and bottom-right corners of the bounding box.
- A similar idea was used in the KOSMOS-2 model, which used a special token for each vision patch as determined by its visual encoder. This allows the model to produce a bounding box for a given object in an image using two special tokens: one for the top-left corner and one for the bottom-right corner.
Direct Prediction: Some models directly predict the bounding box coordinates for an object in an image. Coordinates are scaled to a fixed range (e.g., [0, 1]) and reduced to a fixed precision (e.g., 2 decimal places). From there, raw tuples are predicted directly in the text output.
- For example, the LLaVA model directly predicts the bounding box coordinates for objects in images. The model is trained to predict the x- and y-coordinates of the top-left and bottom-right corners of the bounding box.
- This sometimes is used to output a general polygon as well.
Object Detection: Some models use object detection to predict bounding boxes for objects in images. These models independently predict boxes and masks with a traditional object detection branch.

Grounding Models

Model	Grounding Type
VisionLLM	Direct Prediction or Special Tokens
KOSMOS-2	Patch-based Special Tokens
Shikra	Direct Prediction
Qwen-VL	Direct Prediction
KOSMOS-2.5	Discretized Special Tokens
LLaVA-1.5	Direct Prediction
SPHINX	Direct Prediction
InfMLLM	Direct Prediction
CogVLM	Direct Prediction
LLaVA-Grounding	Object Detection
SPHINX-X	Direct Prediction