A Survey of Multimodal LLMs (2021-2024)
A comprehensive survey of multimodal large language models from 2021 to 2024, covering encoder-only models, encoder-decoder architectures, decoder-only models, and specialized applications for documents and screens.
Table of Contents
- Multimodal LLMs
- 2021-02: ALIGN
- 2021-02: CLIP
- 2021-07: ALBEF
- 2021-08: SimVLM
- 2021-11: VLMo
- 2021-12: FLAVA
- 2021-12: GLIP
- 2022-01: BLIP
- 2022-04: Flamingo
- 2022-05: CoCa
- 2022-05: mPLUG
- 2022-07: GLIPv2
- 2022-08: BEiT-3
- 2022-09: PaLI
- 2022-10: Pix2Struct
- 2022-12: UDOP
- 2023-01: BLIP-2
- 2023-02: KOSMOS-1
- 2023-02: mPLUG-2
- 2023-04: LLaVA
- 2023-04: mPLUG-Owl
- 2023-05: PaLI-X
- 2023-05: Pix2Act
- 2023-05: VisionLLM
- 2023-06: KOSMOS-2
- 2023-06: Shikra
- 2023-08: Qwen-VL
- 2023-09: KOSMOS-2.5
- 2023-10: LLaVA-1.5
- 2023-10: PaLI-3
- 2023-11: CogVLM
- 2023-11: DocPedia
- 2023-11: InfMLLM
- 2023-11: Monkey
- 2023-11: mPLUG-Owl2
- 2023-11: SPHINX
- 2023-12: CogAgent
- 2023-12: LLaVA-Grounding
- 2024-01: LLaVA-NEXT
- 2024-01: SeeClick
- 2024-02: LLaVAR
- 2024-02: ScreenAI
- 2024-02: SPHINX-X
- 2024-03: MM1
- 2024-03: TextMonkey
- 2024-06: VisionLLM v2
- 2024-08: mPLUG-Owl3
Multimodal LLMs
This comprehensive survey presents an overview of multimodal Large Language Models (LLMs) developed between 2021 and 2024. These models represent a significant advancement in AI, combining visual and textual understanding to enable more sophisticated human-computer interactions.
We organize the models into several categories based on their architectural approaches, examine efficient training frameworks, and explore grounding capabilities that enable models to localize and reference specific regions in images.
Encoder-Only Models
Models that feature encoders for text and vision portions, but lack any notion of a decoder.
| Year | Month | Model | License | Size(s) | Weights? | Permissive? | Organization |
|---|---|---|---|---|---|---|---|
| 2021 | 02 | CLIP | MIT | 150M, 430M | Yes | Yes | OpenAI |
| 2021 | 02 | ALIGN | - | 600M | No | No | |
| 2021 | 07 | ALBEF | BSD-3 | 200M | Yes | Yes | SalesForce |
| 2021 | 11 | VLMo | MIT | 200M, 600M | Yes | Yes | Microsoft |
| 2021 | 12 | FLAVA | BSD-3 | 450M | Yes | Yes | Meta |
| 2021 | 12 | GLIP | MIT | 500M | Yes | Yes | Microsoft |
| 2022 | 07 | GLIPv2 | MIT | - | Yes | Yes | Microsoft |
| 2022 | 08 | BEiT-3 | MIT | 1.9B | Yes | Yes | Microsoft |
Encoder-Decoder Models
Here, we assume a vision encoder; but we allow for the possibility of a text-based encoder-decoder model.
| Year | Month | Model | License | Size(s) | Weights? | Permissive? | Organization |
|---|---|---|---|---|---|---|---|
| 2021 | 08 | SimVLM | - | - | No | No | |
| 2022 | 05 | CoCa | - | 0.4B, 0.8B, 2.1B | No | No | |
| 2022 | 05 | mPLUG | BSD-3 | 350M, 600M | Yes | Yes | Alibaba |
| 2022 | 09 | PaLI | - | 3B, 15B, 17B | No | No | |
| 2023 | 01 | BLIP-2 | BSD-3 | 3B, 4B, 8B, 12B | Yes | Yes | Salesforce |
| 2023 | 02 | mPLUG-2 | Apache-2.0 | 1.5B, 3B, 6B | Yes | Yes | Alibaba |
| 2023 | 05 | PaLI-X | - | 55B | No | No | |
| 2023 | 10 | PaLI-3 | - | 5B | No | No |
Specialized for Documents
Here, we highlight models developed solely for document understanding tasks.
| Year | Month | Model | License | Size(s) | Weights? | Permissive? | Organization |
|---|---|---|---|---|---|---|---|
| 2022 | 12 | UDOP | MIT | 800M | Yes | Yes | Microsoft |
Specialized for Screens
Here, we highlight models developed solely for screen understanding tasks.
| Year | Month | Model | License | Size(s) | Weights? | Permissive? | Organization |
|---|---|---|---|---|---|---|---|
| 2024 | 02 | ScreenAI | - | 0.7B, 2B, 5B | No | No |
Decoder-Only Models
Note: This assumes decoder-only models for text. As usual, vision inputs are still encoded with a vision encoder.
| Year | Month | Model | License | Size(s) | Weights? | Permissive? | Organization |
|---|---|---|---|---|---|---|---|
| 2022 | 01 | BLIP | BSD-3 | 200M, 500M | Yes | Yes | Salesforce |
| 2022 | 04 | Flamingo | - | 3B, 9B, 80B | No | No | |
| 2023 | 01 | BLIP-2 | BSD-3 | 3B, 4B, 8B, 12B | Yes | Yes | Salesforce |
| 2023 | 02 | KOSMOS-1 | MIT | 1.9B | No | Yes | Microsoft |
| 2023 | 04 | LLaVA | MIT | 7B, 13B | Yes | Yes | Microsoft |
| 2023 | 04 | mPLUG-Owl | LLaMA, Apache-2.0 | 7B | Yes | Yes | Alibaba |
| 2023 | 05 | VisionLLM | Apache-2.0 | - | No | No | Shanghai AI |
| 2023 | 06 | KOSMOS-2 | MIT | 1.9B | Yes | Yes | Microsoft |
| 2023 | 06 | Shikra | CC-NC-4.0 | 7B | Yes | No | SenseTime |
| 2023 | 08 | Qwen-VL | Tongyi | 9.6B | Yes | No | Alibaba |
| 2023 | 10 | LLaVA-1.5 | Apache-2.0 | 7B, 13B | Yes | Yes | Microsoft |
| 2023 | 11 | SPHINX | Llama | 7B, 13B | Yes | Yes | Shanghai AI |
| 2023 | 11 | InfMLLM | Apache-2.0 | 7B, 13B | Yes | Yes | Inftech.AI |
| 2023 | 11 | mPLUG-Owl2 | Apache-2.0 | 7B | Yes | Yes | Alibaba |
| 2023 | 11 | CogVLM | Llama | 17B | Yes | Yes | Chinese Univ. |
| 2023 | 11 | Monkey | Tongyi | 9.8B | Yes | No | Chinese Univ. |
| 2023 | 12 | LLaVA-Grounding | CC-BY-NC-4.0 | 7B | Yes | No | Chinese Univ. |
| 2024 | 01 | LLaVA-NEXT | MIT | 7B, 13B, 34B | Yes | Yes | Chinese Univ. |
| 2024 | 02 | LLaVAR | Apache-2.0 | 7B, 13B | Yes | Yes | Adobe |
| 2024 | 02 | SPHINX-X | Apache-2.0 | 1B, 7B, 13B, 8x7B | Yes | Yes | Shanghai AI |
| 2024 | 03 | MM1 | - | 3B, 7B, 30B | No | No | Apple |
| 2024 | 06 | VisionLLM v2 | Apache-2.0 | - | No | No | Shanghai AI |
| 2024 | 08 | mPLUG-Owl3 | MIT | 1B, 2B, 7B | Yes | Yes | Alibaba |
Specialized for Documents
Here, we highlight models developed solely for document understanding tasks.
| Year | Month | Model | License | Size(s) | Weights? | Permissive? | Organization |
|---|---|---|---|---|---|---|---|
| 2023 | 09 | KOSMOS-2.5 | MIT | 1.4B | Yes | Yes | Microsoft |
| 2023 | 11 | DocPedia | - | 7/13B | No | No | Chinese Univ. |
| 2024 | 03 | TextMonkey | Tongyi | 9.8B | Yes | Yes | Chinese Univ. |
Specialized for Screens
Here, we highlight models developed solely for screen understanding tasks.
| Year | Month | Model | License | Size(s) | Weights? | Permissive? | Organization |
|---|---|---|---|---|---|---|---|
| 2022 | 10 | Pix2Struct | Apache-2.0 | 0.3B, 1.3B | Yes | Yes | |
| 2023 | 05 | Pix2Act | Apache-2.0 | 0.3B | Yes | Yes | |
| 2023 | 12 | CogAgent | Llama | 18B | Yes | Yes | Chinese Univ. |
| 2024 | 01 | SeeClick | Tongyi | 9.6B | Yes | No | Shanghai AI |
Frameworks for Efficient Multimodal LLMs
Any frameworks that are designed to efficiently train or adapt multimodal LLMs shall be highlighted here. This includes any approaches that use pre-existing models to bootstrap new, multimodal models, at a fraction of the cost of performing end-to-end training.
BLIP-2
- Overview: BLIP-2 efficiently bridges vision and language by using frozen models for both vision and language components, avoiding the need for end-to-end training.
- Stage 1: Q-Former is trained to align visual features with text using a frozen image encoder, enabling efficient representation learning without modifying the image encoder.
- Stage 2: Q-Former connects to a frozen LLM for vision-to-language generation, leveraging pre-trained language capabilities without retraining the LLM.
LLaVA & LLaVA-1.5 & LLaVA-NEXT
- Overview: LLaVA connects a frozen CLIP encoder to a language model, using instruction tuning on GPT-4-generated data for efficient multimodal training. The updated LLaVA-1.5 introduces several enhancements such as an MLP vision-language connector, additional datasets, and scaling to higher resolutions.
- Stage 1: A simple linear projection (updated to MLP in LLaVA-1.5) is learned to align visual features with the language model’s embedding space, keeping both models frozen.
- Stage 2: The projection layer and language model are fine-tuned on multimodal instruction data. In LLaVA-1.5, new datasets such as VQA-v2 and ShareGPT data are added, and resolution scaling improves model performance.
mPLUG-Owl
- Overview: mPLUG-Owl introduces a modularized framework that equips LLMs with multimodal capabilities by decoupling the training of visual and language components, using a two-stage training process to improve efficiency.
- Stage 1: A frozen LLM is paired with a trainable vision encoder (ViT-L/14) and visual abstractor to align image and text representations. The visual components are trained without modifying the LLM.
- Stage 2: LoRA is applied to the LLM for efficient fine-tuning on unimodal and multimodal instruction data, while keeping the visual encoder frozen to preserve alignment and reduce computational costs.
VisionLLM
- Overview: VisionLLM presents a unified framework for vision-centric tasks using an LLM-based decoder, efficiently aligning vision and language tasks without end-to-end retraining.
- Stage 1: A language-guided image tokenizer is trained using a frozen LLM (Alpaca) while fine-tuning the visual backbone (ResNet/InternImage-H) to align visual and language features.
- Stage 2: The visual backbone is frozen and only the LLM is fine-tuned using LoRA, enabling efficient adaptation to new tasks through flexible language instructions without modifying the vision components.
InfMLLM
- Overview: InfMLLM uses a three-stage training approach with a novel pool-adapter to handle vision-language tasks efficiently.
- Stage 1 (Pretraining): The pool-adapter is trained while keeping the ViT and LLM frozen, focusing on image-text alignment.
- Stage 2 (Multitask Finetuning): The ViT, pool-adapter, and QV projection are unfrozen, training on a mixture of VQA, captioning, and grounding tasks.
- Stage 3 (Instruction Tuning): The LLM is fully finetuned on instruction-following data, with the ViT remaining frozen.
mPLUG-Owl2
- Overview: mPLUG-Owl2 leverages modality collaboration through adaptive modules, reducing interference between vision and language features for efficient multi-modal learning.
- Stage 1: A modality-adaptive module (MAM) aligns visual and language features, with frozen LLM weights except for the adaptive modules, preserving separate modality representations.
- Stage 2: The entire model is unfrozen and fine-tuned on text and multimodal instruction data, enabling cross-modality performance enhancements without compromising on individual modality strengths.
LLaVA-Grounding
- Overview: LLaVA-Grounding integrates grounded visual chat and localization by combining frozen and partially trainable models through a staged training process, leveraging a language model, visual encoder, and grounding model to reduce computational needs.
- Stage 1: Pretraining aligns vision and language features by finetuning a projection layer and grounding model while keeping the language model frozen.
- Stage 2: Instruction tuning on GVC data adjusts the language and grounding components for grounded responses, while the vision encoder and prompt encoder remain frozen.
- Stage 3: For visual prompts (e.g., boxes, marks), only the prompt encoder and specific layers are fine-tuned, supporting efficient visual-grounded chat.
Grounding Capabilities
Sometimes when a multimodal model produces an answer, it also must refer to a location in an image. For example, in the case of a question-answering task, the model might need to point to a specific region in an image to justify its answer. Or, in the case of a model that operates on screens, it might be asked where a login button is located. Unfortunately, not all multimodal models have this capability.
Here, we merely make a short-list of models that are trained to provide grounding and make a small comment about how they provide grounding.
Types of Grounding
- Special Tokens: Some models use special tokens to indicate the location
of an object in an image. Freqently, the x- and y-coordinates are uniformly
gridified with special tokens for each grid point in the x- and
y-dimensions.
- For example, Google likes to construct models that discretize x- and y-coordinates into 1000 bins each. Each bounding box then is associated with a sequence of four special tokens, representing the x- and y-coordinates of the top-left and bottom-right corners of the bounding box.
- A similar idea was used in the KOSMOS-2 model, which used a special token for each vision patch as determined by its visual encoder. This allows the model to produce a bounding box for a given object in an image using two special tokens: one for the top-left corner and one for the bottom-right corner.
- Direct Prediction: Some models directly predict the bounding box
coordinates for an object in an image. Coordinates are scaled to a fixed
range (e.g., [0, 1]) and reduced to a fixed precision (e.g., 2 decimal
places). From there, raw tuples are predicted directly in the text output.
- For example, the LLaVA model directly predicts the bounding box coordinates for objects in images. The model is trained to predict the x- and y-coordinates of the top-left and bottom-right corners of the bounding box.
- This sometimes is used to output a general polygon as well.
- Object Detection: Some models use object detection to predict bounding boxes for objects in images. These models independently predict boxes and masks with a traditional object detection branch.
Grounding Models
| Model | Grounding Type |
|---|---|
| VisionLLM | Direct Prediction or Special Tokens |
| KOSMOS-2 | Patch-based Special Tokens |
| Shikra | Direct Prediction |
| Qwen-VL | Direct Prediction |
| KOSMOS-2.5 | Discretized Special Tokens |
| LLaVA-1.5 | Direct Prediction |
| SPHINX | Direct Prediction |
| InfMLLM | Direct Prediction |
| CogVLM | Direct Prediction |
| LLaVA-Grounding | Object Detection |
| SPHINX-X | Direct Prediction |
2021_02_align
2021-02: ALIGN
Quick Notes
- Encoder-only model.
- No GitHub repo.
- No Hugging Face Checkpoints.
- Organization: Google.
Key Links
- Arxiv: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
- PMLR: Proceedings of the 38th International Conference on Machine Learning
Guiding Questions
What kind of paper is this?
This paper proposes a new method in visual and vision-language representation learning using noisy, large-scale data, leveraging a dual-encoder architecture. The paper mainly presents a big idea with evaluation: a scalable approach to representation learning from noisy, uncurated data.
What is the motivation and potential impact of the paper?
- Motivation:
- Visual and vision-language representations still rely heavily on curated, human-labeled datasets (e.g., ImageNet, MSCOCO), which are expensive and time-consuming to create.
- There is a need to scale up representation learning using larger datasets that don’t require this level of manual curation to support improved downstream task performance and enable zero-shot learning.
- Potential Impact:
- The approach can enable the training of state-of-the-art models with large but noisy data instead of clean, expensive datasets.
- This can reduce the cost and time associated with creating labeled datasets and improve the ability of models to generalize to a wide range of tasks, including zero-shot learning and cross-modal retrieval.
What is the relevant related work and what is this paper’s contribution?
- Related Work:
- Pre-training on large, curated datasets like ImageNet and OpenImages has become the standard for visual representation learning.
- Vision-language pre-training has similarly relied on curated datasets like Conceptual Captions and MSCOCO, which are smaller and require human annotation.
- CLIP (Radford et al., 2021) is a closely related model that also uses image-text pairs for visual representation learning.
- Contribution:
- The key contribution is the ALIGN model, which scales representation learning using a noisy dataset of over 1 billion image alt-text pairs without requiring extensive filtering or cleaning. The paper shows that scale can compensate for noise.
- ALIGN achieves state-of-the-art results in image-text retrieval benchmarks and zero-shot image classification.
What are the results (Theory/Experiments)?
The paper presents experiments on several tasks, including zero-shot image classification, image-to-text and text-to-image retrieval, and transfer learning to ImageNet and fine-grained classification tasks.
- ALIGN achieves state-of-the-art results on Flickr30K and MSCOCO benchmarks, outperforming previous models like CLIP and UNITER.
- In zero-shot classification, ALIGN performs comparably to CLIP with 76.4% top-1 accuracy on ImageNet.
- ALIGN also shows strong performance on transfer learning tasks and smaller fine-grained classification datasets.
Details of Note
Dataset
- The dataset is constructed using a methodology similar to Conceptual Captions (Sharma et al., 2018).
- It consists of 1.8 billion image-text pairs, making it one of the largest available datasets for vision-language pre-training.
- The dataset is noisy due to minimal filtering; only frequency-based heuristics are applied to reduce noise.
- Image-based filtering includes removing low-resolution and pornographic images, as well as duplicates from evaluation sets.
- Text-based filtering involves removing alt-texts that are too short, too long, or contain rare or irrelevant tokens.
- Unlike Conceptual Captions, the dataset avoids extensive cleaning or curation to prioritize scale, demonstrating that scale compensates for noise in training.
Methodology Highlights
Dual-Encoder Architecture:
- The model uses separate encoders for images and text, aligning their embeddings in a shared latent space using contrastive learning.
- Image Encoder: EfficientNet models (e.g., EfficientNet-L2) are used to encode the visual features.
- Text Encoder: BERT-based models (e.g., BERT-Large) are employed for text embeddings.
Contrastive Loss:
- A normalized softmax contrastive loss is applied, where matched image-text pairs are pushed together in the embedding space, and non-matched pairs are pushed apart.
- Both image-to-text and text-to-image losses are used, making it analogous to a classification objective with text embeddings acting as “label” weights.
In-batch Negatives:
- The contrastive learning relies on in-batch negatives, which are all other image-text pairs within the batch. To enhance this, embeddings from all computing cores are concatenated to form larger batches.
Learned Temperature Parameter:
- The softmax temperature, which scales the logits, is learned during training instead of being fixed, allowing it to automatically adjust to improve model performance.
Training Framework:
- The training batch size is massive (up to 16,384 effective batch size), utilizing 1024 TPUv3 cores. This is critical for scaling the noisy data and ensuring convergence.
- The model is trained with LAMB optimizer with weight decay, which allows large-batch training and ensures stability.
Zero-shot Transfer and Fine-tuning:
- ALIGN is evaluated in zero-shot settings, where class names are fed into the text encoder, enabling zero-shot image classification.
- The model performs exceptionally well in both zero-shot and fine-tuned transfer to downstream tasks, including Flickr30K and MSCOCO retrieval benchmarks, and classification tasks like ImageNet.
Training Details
- Optimizer: LAMB optimizer with weight decay set to 1e-5.
- Batch Size: 16,384 effective batch size, utilizing 1024 TPUv3 cores.
- Learning Rate:
- Warmup for the first 10,000 steps to a peak learning rate of 1e-3.
- Learning rate is linearly decayed to zero over 1.2M steps (equivalent to 12 epochs).
- Gradient Accumulation: The in-batch negatives strategy concatenates embeddings from all cores to ensure large batch sizes and efficient contrastive learning.
- Hardware: Training is done on 1024 TPUv3 cores, enabling the model to handle the large-scale dataset and optimize across all cores.
- Precision: Likely using mixed precision for efficiency, though not explicitly mentioned (this is often standard in such large-scale models).
2021_02_clip
2021-02: CLIP
Quick Notes
- Encoder-only model.
- MIT License.
- Weights available. Permissive license.
- Organization: OpenAI.
Key Links
- Arxiv: Learning Transferable Visual Models From Natural Language Supervision
- PMLR: Proceedings of the 38th International Conference on Machine Learning
- GitHub: openai/CLIP
- Hugging Face Checkpoints:
- ViT-B/32 (151M params): openai/clip-vit-base-patch32
- ViT-B/16 (150M params): openai/clip-vit-base-patch16
- ViT-L/14 (428M params): openai/clip-vit-large-patch14
- ViT-L/14@336px (428M params): openai/clip-vit-large-patch14-336
Guiding Questions
What kind of paper is this?
- This paper introduces a new method, and a new idea for how to train multimodal models. It is a method paper but also a bit of a big idea paper.
- It focuses on image-text pairs for zero-shot learning, combining contrastive learning approaches with large-scale pre-training.
What is the motivation and potential impact of the paper?
- Problem: The current state-of-the-art (SOTA) computer vision systems require large, labeled datasets (e.g., ImageNet) to perform well. This reliance on supervised learning limits their generality and applicability to new tasks, as new labeled data is often needed.
- Importance: Learning directly from raw text about images could remove this limitation, enabling broader, more flexible, and more scalable visual learning.
- Impact: The potential for zero-shot transfer of visual models is significant. The authors show that their model, CLIP, can match SOTA supervised models (like ResNet50) without using any labeled training examples, enabling a broader set of tasks like OCR, action recognition, and more.
What is the relevant related work and what is this paper’s contribution?
- Related work: Previous work in image-caption learning (Joulin et al., VirTex, ConVIRT) demonstrated the potential of text-based supervision for learning visual representations but lacked scalability and efficiency compared to fully supervised models.
- Contribution: The paper introduces CLIP, a contrastive learning framework that scales efficiently to large datasets of 400 million image-text pairs. The contribution lies in the contrastive pre-training approach and the demonstration of its ability to perform zero-shot transfer to a wide variety of downstream tasks.
What are the results (Theory/Experiments)?
Experiments: They validate this by testing the model on over 30 computer vision datasets. The results show that CLIP outperforms supervised models on many tasks in a zero-shot setting and is more robust to natural distribution shifts than models like ResNet50.
- They achieve 76.2% accuracy on ImageNet zero-shot, matching ResNet50 trained on labeled data.
- CLIP outperforms supervised models in tasks like action recognition and fine-grained object classification, though it underperforms on more specialized tasks like traffic sign recognition and satellite imagery.
What are the implications of the paper? (Broader Impact)
- Technological implications: The ability to create visual classifiers without specific training data could revolutionize areas like autonomous systems, medical imaging, surveillance, and content moderation. It offers a more scalable and adaptable approach to visual understanding.
- Stakeholders: Researchers and engineers developing AI systems will benefit from the flexibility of this approach. However, communities affected by the deployment of AI (e.g., for surveillance) may experience increased privacy concerns.
- Positive impact: The potential to enable broad transferability of visual models without relying on labeled datasets could drastically reduce the time and cost of developing AI for new domains.
- Negative consequences: There are concerns about bias and fairness, as the model may reinforce existing social biases present in the training data. Additionally, the ease of use for surveillance tasks raises privacy concerns.
Details of Note
Dataset - WebImageText (WIT)
- 400M text-image pairs collected from the internet.
- 500k queries generated using common words from Wikipedia, high pointwise mutual information bigrams, and WordNet synsets.
- Restricted to 20k image-text pairs per query to maintain class balance.
- Filtered to exclude low-quality pairs (e.g., automatically generated filenames or non-descriptive text).
- Covers a broad range of visual concepts and is comparable in size to the WebText dataset used for GPT-2.
- Availability: Proprietary / Non-public.
Method - Contrastive Language-Image Pre-training (CLIP)
Objective: Predict which text is paired with which image in a batch of
Nimage-text pairs, generatingN^2possible pairings (Ntrue,N^2 - Nfalse).Contrastive Loss: Uses cosine similarity between image and text embeddings. A symmetric cross-entropy loss is applied to maximize similarity for correct pairs and minimize it for incorrect ones.
Multimodal Embedding Space:
- Image Encoder: ResNet or Vision Transformer (ViT).
- Text Encoder: Transformer-based model.
- Both are linearly projected into a shared embedding space and L2 normalized.
- Note: This is a dual-encoder formulation in contrast to fusion encoder formulations that mix all modalities as equal tokens.
Temperature Scaling: A learned temperature parameter controls the sharpness of the similarity distribution.
Data Augmentation: Random square cropping for images; simplified text processing (typically single sentences).
Architecture
Text Encoder:
- Transformer-based, BERT-style.
- 63M parameters.
- Configuration: 12 layers, 512 dimensions, and 8 attention heads.
- Lower-cased vocabulary of 50k tokens.
- Maximum sequence length: 76.
- The text encoder’s width is scaled to match the size of the vision model for different configurations.
Vision Encoders:
ResNet-style (with slight modifications to pooling):
- ResNet-50, ResNet-101, and scaled variants:
- RN50x4 (4x wider), RN50x16, RN50x64.
- ResNet improvements include ResNetD modifications (He et al., 2019) and antialiased blur pooling (Zhang, 2019).
- All ResNet weights are available in the GitHub repo.
- ResNet-50, ResNet-101, and scaled variants:
ViT-style (Vision Transformer):
- ViT-B/32 (151M parameters): Hugging Face Model.
- ViT-B/16 (150M parameters): Hugging Face Model.
- ViT-L/14 (428M parameters): Hugging Face Model.
- ViT-L/14@336px (428M parameters): Trained at higher 336px resolution for improved performance. Hugging Face Model.
Training
- Optimizer: AdamW.
- Weight decay of 0.1.
- Learning rate set via cosine learning rate schedule.
- Epochs: 32.
- Batch size: 32k images per batch.
- Techniques:
- Gradient checkpointing to reduce memory usage.
- Mixed precision and half precision (FP16) for faster training and lower memory consumption.
- Distributed training across multiple GPUs (e.g., trained on up to 592 V100 GPUs for the largest model).
Hardware and Training Times
- Largest ResNet (RN50x64):
- 18 days on 592 V100 GPUs.
- Largest ViT (ViT-L/14):
- 12 days on 256 V100 GPUs.
- Compute efficiency: ViT models demonstrated higher compute efficiency compared to ResNet models in this setup.
Benchmarks:
- Zero-shot image classification on over 30 datasets spanning a wide
variety of tasks.
- Includes standard datasets such as ImageNet, CIFAR-10, CIFAR-100, and PascalVOC.
- Fine-grained classification tasks (e.g., Stanford Cars, FGVC Aircraft, Flowers102).
- Specialized tasks such as OCR, geo-localization, and action recognition.
- Robustness benchmarks: Evaluated on distribution shifts using datasets like ImageNetV2, ImageNet-R, ImageNet-A, and ObjectNet.
- Achieved competitive performance with fully supervised models (e.g., ResNet50) in a zero-shot setting without using labeled training data.
- Zero-shot CLIP outperforms supervised models on tasks like OCR and action recognition, while underperforming on specialized tasks such as satellite imagery (e.g., EuroSAT), traffic sign recognition (e.g., GTSRB), and object counting (e.g., CLEVRCounts).
- Additional analysis includes few-shot performance comparison and general robustness to natural distribution shifts.
2021_07_albef
2021-07: ALBEF
Key Links
- Arxiv: Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
- NeurIPS 2021: Conference Paper
- GitHub: salesforce/ALBEF
- Checkpoints:
- ALBEF (4M training images): Google Storage API
- ALBEF (14M training images): Google Storage API
Guiding Questions
What kind of paper is this?
A vision-language representation learning paper proposing a new framework (ALBEF) to improve vision-and-language pre-training. It focuses on introducing new techniques (Align Before Fuse, momentum distillation) and achieving state-of-the-art results on a set of multimodal tasks.
This is a small idea with evaluation paper. It introduces several new tricks and techniques to improve vision-and-language pre-training and evaluates them on a variety of tasks.
What is the motivation and potential impact of the paper?
Motivation: The paper addresses the limitations in current Vision-and-Language Pre-training (VLP) methods, including:
- Misaligned features: Image features and word tokens are unaligned, making it challenging to learn image-text interactions.
- Resource-Intensive Object Detection: Most existing methods require bounding box annotations and high-resolution images, which are computationally expensive.
- Noisy Data: Web data used in pre-training is noisy, leading to suboptimal model performance.
The authors propose ALBEF to align image and text representations before fusing them using cross-modal attention, making learning more efficient and robust, especially with noisy data.
- Potential Impact: ALBEF aims to outperform existing vision-language models on various downstream tasks like image-text retrieval, visual question answering (VQA), and visual reasoning (NLVR2). The model eliminates the need for bounding box annotations and improves inference speed. Its novel method (momentum distillation) also enhances learning from large, uncurated web datasets.
What is the relevant related work and what is this paper’s contribution?
Related Work:
- Existing VLP methods such as LXMERT, UNITER, OSCAR, etc., use multimodal encoders with object detectors, which have high computational costs and performance limitations when handling noisy web data.
- Recent methods like CLIP and ALIGN that use separate encoders for image and text with a contrastive loss but struggle with complex image-text interactions.
Contribution:
- ALBEF introduces the concept of aligning before fusing the image and text features to improve interaction modeling in multimodal tasks.
- It proposes momentum distillation (MoD) to improve learning in noisy environments, offering theoretical insights on mutual information maximization in this context.
- The model eliminates the need for bounding boxes or high-resolution images and achieves state-of-the-art performance on tasks like image-text retrieval, VQA, and NLVR2.
What are the results (Theory/Experiments)?
Theory:
- The paper provides a theoretical framework for understanding ALBEF’s success, viewing it through the lens of mutual information maximization. Different training tasks (ITC, MLM, MoD) are framed as techniques to generate different views of an image-text pair.
Experiments:
- ALBEF achieves state-of-the-art results across multiple downstream vision-language tasks (e.g., image-text retrieval, VQA, and NLVR2). For example, ALBEF outperforms previous models like VILLA by 2.37% on VQA and 3.84% on NLVR2.
- It also demonstrates faster inference speed and better performance compared to models like CLIP and ALIGN, even when pre-trained on significantly smaller datasets.
Details of Note
Dataset
ALBEF is pre-trained on a combination of image-captioning datasets to cover a wide variety of image-text pairs. The pre-training datasets include:
- Conceptual Captions 3M (CC3M): A large-scale dataset with 3 million image-text pairs, focusing on high-diversity web-crawled images and corresponding descriptions.
- Conceptual Captions 12M (CC12M): A noisier and larger version of CC3M, containing 12 million image-text pairs, providing a broader but less curated data source.
- SBU Captions: A dataset containing image-text pairs scraped from Flickr, known for its diverse visual contexts.
- COCO (Common Objects in Context): A well-established dataset consisting of images depicting everyday scenes with object annotations and captions.
- Visual Genome: A dataset rich in image-region annotations and dense captions, adding granular context to the image-text learning process.
In total, the training dataset comprises:
- 4 million unique images
- 5.1 million image-text pairs
For scaling experiments, the authors also incorporate the Conceptual Captions 12M (CC12M) dataset, bringing the total number of images up to 14.1 million. This larger dataset, while noisier, demonstrates the scalability of ALBEF to web-scale data, showcasing improved performance on downstream tasks.
Methodology Highlights
Image-Text Contrastive (ITC) Loss
- Purpose: Aligns unimodal representations (image and text features) to improve multimodal learning.
- Mechanism: Enforces that the visual and textual embeddings are jointly predictive of their association, using a contrastive loss that aligns them before they are fused in the multimodal encoder.
- Benefits:
- Makes it easier for the multimodal encoder to fuse aligned features, enhancing cross-modal interactions.
- Encourages better understanding of the semantic spaces for both the unimodal encoders (image and text).
- Learns a common low-dimensional space for embedding images and texts, facilitating downstream tasks like image-text retrieval.
Momentum Distillation (MD)
- Purpose: Helps the model learn from noisy web-based image-text pairs by generating pseudo-targets that provide alternative reasonable outputs.
- Mechanism:
- During training, a momentum version of the model is maintained by taking moving averages of its parameters.
- This momentum model generates pseudo-targets for the main model, which provides additional supervision beyond the noisy annotations.
- Encourages the model to not overfit to web annotations and produce reasonable alternative outputs for tasks like Masked Language Modeling (MLM) and Image-Text Matching (ITM).
Architecture
- Image Encoder: A 12-layer ViT-B/16 (Vision Transformer), pre-trained on ImageNet-1K, with 85.8M parameters.
- Text Encoder: A 6-layer transformer initialized from the first 6 layers of BERT-base (from the BERT-base model).
- Multimodal Encoder: Another 6-layer transformer initialized from the last 6 layers of BERT-base.
- Total Parameters: Approximately 200M parameters (~85.8M for ViT-B/16 and 123.7M from BERT-base).
Pre-training Objectives
- Image-Text Contrastive (ITC) Loss: Aligns the unimodal representations by calculating softmax-normalized similarity scores between paired image and text representations. This loss is minimized to enhance alignment.
- Masked Language Modeling (MLM): A standard task where random tokens in the input text are masked, and the model must predict the masked tokens using both image and text context.
- Image-Text Matching (ITM): Predicts whether a given image-text pair is correctly matched or not, using the joint embedding from the multimodal encoder.
Hard Negative Sampling
- Hard Negatives: For the ITM task, the model mines hard negatives—semantically similar but incorrect pairs—using contrastive similarity, providing a more challenging training signal.
- In-Batch Sampling: Negatives are sampled from the same mini-batch based on similarity scores, ensuring efficient training without additional computational overhead.
Model Summary
- Training Setup: Pre-training is conducted on large-scale web datasets (e.g., Conceptual Captions, SBU Captions) and domain-specific datasets (e.g., COCO, Visual Genome). The model uses random image crops of resolution 256×256, with RandAugment applied during pre-training.
- Optimizer: The AdamW optimizer is used with a learning rate schedule including warm-up and cosine decay.
Training Details
- Training Duration: The model was trained for 30 epochs.
- Batch Size: Each training run used a batch size of 512.
- Hardware:
- The training was conducted on 8 NVIDIA A100 GPUs.
- This configuration provided sufficient compute for both the image and text encoders as well as the multimodal encoder.
- Optimizer:
- The model used the AdamW optimizer with a weight decay of 0.02.
- Learning Rate Schedule:
- The learning rate was warmed up to 1e-4 within the first 1,000 iterations.
- Following the warm-up phase, a cosine decay schedule reduced the learning rate to 1e-5 by the end of training.
Additional Details
- Data Augmentation: During pre-training, random image crops of resolution 256×256 were used, along with RandAugment (excluding color changes) to increase image variability.
- Fine-Tuning: For fine-tuning tasks, the image resolution was increased to 384×384, and positional encodings for image patches were interpolated.
- Momentum Update: The momentum parameter for updating the momentum model was set to 0.995, and the size of the queue used for image-text contrastive learning was 65,536.
- Distillation Weight: The distillation weight α was linearly ramped from 0 to 0.4 within the first epoch, balancing the pseudo-targets with standard supervision.
2021_08_simvlm
2021-08: SimVLM
Quick Notes
- Encoder-decoder model.
- No license. Closed-source.
- Organization: Google.
Key Links
- Arxiv: SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
- ICLR 2022 OpenReview: Conference Paper
Guiding Questions
What kind of paper is this?
- This is a vision-language pretraining (VLP) paper proposing a minimalist generative model.
- It presents a new model (SimVLM) and compares it against existing state-of-the-art methods on a range of multimodal benchmarks.
What is the motivation and potential impact of the paper?
- The motivation is to simplify vision-language pretraining by eliminating complex object detection, task-specific losses, and dataset-specific objectives.
- The paper seeks to improve scalability and generalization with weak supervision and end-to-end training on a single objective, Prefix Language Modeling.
- Its potential impact includes enabling zero-shot generalization in vision-language tasks and offering a simpler, more efficient pretraining framework.
What is the relevant related work and what is this paper’s contribution?
- Previous methods (e.g., LXMERT, UNITER) use supervised object detection and complex pretraining pipelines involving auxiliary losses and human-labeled data.
- These approaches struggle with scalability and zero-shot capabilities.
- SimVLM contributes a simplified pretraining protocol using large-scale weak supervision and a single language modeling objective.
- It achieves new state-of-the-art results on several benchmarks, offering both discriminative and generative capabilities.
What are the results (Theory/Experiments)?
- Experiments: SimVLM outperforms previous VLP models on benchmarks like VQA, NLVR2, and image captioning, achieving notable gains without requiring additional data or task-specific customization.
- Zero-shot capability: SimVLM demonstrates strong generalization in zero-shot image captioning and visual question answering (VQA).
- Generative performance: SimVLM improves on open-ended VQA by generating answers in a flexible format, outperforming discriminative baselines.
Details of Note
Dataset
ALIGN: Image-text pairs dataset used for pretraining.
- Sourced from large-scale web-crawled data.
- Includes 4,096 image-text pairs per batch.
C4 (Colossal Clean Crawled Corpus): Text-only corpus.
- Pretraining batches include 512 text documents from C4.
Fine-tuning Benchmarks:
- VQA v2, NLVR2, SNLI-VE, CoCo Captioning, NoCaps, and Multi30k datasets were used for fine-tuning and evaluation.
Zero-shot Experiments: Used datasets like CoCo and NoCaps to test zero-shot capabilities on image captioning and cross-modality transfer.
Methodology Highlights
Objective:
- Uses a Prefix Language Modeling (PrefixLM) objective.
- PrefixLM enables bidirectional attention on the prefix (vision tokens) and autoregressive attention for the remaining tokens (text).
- The prefix length is randomly sampled during training but forced to be larger than the number of vision tokens to always apply bidirectional attention to vision input.
Architecture:
- Encoder-Decoder Transformer model.
- Vision Modality:
- Processes images as flattened patches using the first three blocks of ResNet for contextualized patches.
- The resulting patches are then fed as input tokens to the Transformer.
- Text Modality:
- Uses a standard encoder-decoder setup where vision tokens are prepended to the text input.
Generative Pretraining:
- Unlike contrastive methods like CLIP, SimVLM employs generative training, focusing on text generation rather than classification.
Model Sizes:
- Three model variants: Base, Large, and Huge.
- The variants correspond to different capacities aligned with the ViT architecture.
Unified Training:
- SimVLM is trained end-to-end with no need for object detection pretraining or auxiliary losses, reducing complexity.
Training Details of Note
Batch Composition:
- 4,096 image-text pairs from the ALIGN dataset per batch.
- 512 text-only documents from the C4 dataset per batch.
Hardware:
- Training conducted on 512 TPU v3 chips.
Training Steps:
- Pretrained from scratch for about 1 million steps on the combined dataset.
2021_11_vlmo
2021-11: VLMo
Quick Notes
- Encoder-only model.
- MIT License.
- Weights available. Permissive license.
- Organization: Microsoft.
Key Links
- Arxiv: VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
- NeurIPS 2022: Conference Paper
- GitHub: microsoft/unilm
- Checkpoints:
Guiding Questions
What kind of paper is this?
It introduces a novel vision-language pre-trained model (VLMo) that unifies both dual and fusion encoders for vision-language tasks. In this way, it’s just a small idea with evaluation (the small idea being the unification of dual and fusion encoders).
What is the motivation and potential impact of the paper?
Motivation: The paper addresses the challenge of handling vision-language tasks with existing architectures. Previous models either adopt dual encoders (efficient for retrieval tasks) or fusion encoders (effective for vision-language classification but computationally expensive). There is no existing model that flexibly adapts both methods for different types of tasks while leveraging large-scale single-modality data.
Potential impact: The VLMo model can significantly improve the efficiency and accuracy of vision-language tasks across a variety of settings, particularly classification and retrieval tasks. It also demonstrates how stagewise pre-training on single-modality data (image-only, text-only) can enhance multi-modal performance, which can broaden the applicability and scalability of similar models in other multimodal contexts (e.g., speech, video).
What is the relevant related work and what is this paper’s contribution?
Related work:
The paper builds on previous vision-language pre-training models such as CLIP, ALIGN, and ViLT, which either use dual encoders or fusion encoders for vision-language tasks. These models optimize performance on vision-language retrieval or classification tasks, but they each have limitations: dual encoders perform shallow modality interaction, while fusion encoders are computationally expensive for large-scale retrieval tasks.
Contributions:
- The paper proposes a unified model (VLMo) that can function as both a dual encoder and fusion encoder, thus combining the advantages of both architectures.
- It introduces a Multiway Transformer with modality-specific experts (vision, language, vision-language) to flexibly handle different inputs (image-only, text-only, image-text pairs).
- The stagewise pre-training strategy leverages large-scale image-only and text-only data to improve performance on multimodal tasks.
- The model achieves state-of-the-art results on both vision-language classification and retrieval tasks.
What are the results (Theory/Experiments)?
Experiments: The model is evaluated on a range of vision-language tasks, including:
- Visual Question Answering (VQA)
- Natural Language for Visual Reasoning (NLVR2)
- Image-Text Retrieval on datasets like MSCOCO and Flickr30K
- Vision-only tasks, such as image classification on ImageNet and semantic segmentation on ADE20K.
Main results:
- VLMo achieves state-of-the-art results on vision-language tasks, outperforming models like UNITER, ALBEF, and SimVLM, especially in classification tasks where deeper cross-modal interactions are needed.
- VLMo-Large++ demonstrates strong performance when trained on 1B noisy image-text pairs, particularly on VQA and NLVR2, where it sets new benchmark SOTA.
- In retrieval tasks, VLMo performs on par with or better than other models, while offering a faster inference speed than fusion-encoder-based models.
Details of Note
Dataset
Pre-training utilized four image captioning datasets:
- Conceptual Captions (CC)
- SBU Captions
- COCO
- Visual Genome (VG)
Total: ~4M images and 10M image-text pairs.
Methodology Highlights
Architecture
- Vision input: ViT-style patch embeddings.
- Text input: WordPiece tokens, embedded with a max sequence length of 40.
- Initial input: Concatenated text and vision embeddings.
- Mixture of Modality Experts (Multiway Transformer):
- Shared self-attention across all modalities.
- Modality-specific feed-forward (FF) subunits for vision, language, and vision-language tasks.
- Unimodal support: Uses modality-specific FFN for vision or text inputs.
- Multimodal support: Uses unimodal FFN at base layers and multimodal FFN at the top layers.
- Architecture type: Encoder-based with modality flexibility.
- Model sizes:
- Base: 12 layers, 768-dimensional hidden state, 12 attention heads, VL fusion in the last 2 layers.
- Large: 24 layers, 1024-dimensional hidden state, 16 attention heads, VL fusion in the last 3 layers.
Methods
- Pre-training tasks (stagewise):
- Contrastive learning (ViT-style) between image and text representations.
- Masked Language Modeling (MLM) using whole-word masking.
- Binary matching classification for image-text pairs. (see: ALBEF)
Pre-training Stages
- Image-only pre-training: Initializes the vision expert and shared attention modules.
- Text-only pre-training: Initializes the language expert.
- Vision-language pre-training: Learns cross-modal representations.
Training Details
- 30 epochs, 512 batch size, using 8xA100 GPUs.
- Optimizer: AdamW with
β1 = 0.9,β2 = 0.98, 0.01 weight decay.- Learning rate: Peak at 2e-4 (base model) / 5e-5 (large model).
- Warmup: 2.5k linear warmup steps, followed by linear decay.
2021_12_flava
2021-12: FLAVA
Quick Notes
- Encoder-only model.
- BSD-3 License.
- Weights available. Permissive license.
- Organization: Meta/FAIR.
Key Links
- Arxiv: FLAVA: A Foundational Language And Vision Alignment Model
- CVPR 2022: Conference Paper
- GitHub Site: FLAVA
- GitHub Repo: facebookresearch/multimodal
- Hugging Face: facebook/flava-full
Guiding Questions
What kind of paper is this?
- This is a small idea with comparison paper.
- It presents FLAVA, a unified vision and language model, and evaluates it on a broad set of tasks across vision, language, and multimodal domains.
What is the motivation and potential impact of the paper?
- The paper aims to create a “foundational” model that can handle vision, language, and multimodal tasks simultaneously.
- The problem addressed is that many models excel in either cross-modal or multimodal tasks but not both, limiting their generality.
- The impact is the potential to improve performance across diverse tasks with a single model, using less data than competing models like CLIP and ALIGN.
What is the relevant related work and what is this paper’s contribution?
- The paper builds on work from CLIP, ALIGN, and other contrastive and fusion-based vision-language models.
- It addresses the limitations of contrastive methods that struggle with multimodal tasks and fusion models that ignore unimodal performance.
- The key contribution is the development of a model that jointly handles unimodal (vision or language) and multimodal tasks, along with using publicly available datasets for pretraining.
What are the results (Theory/Experiments)?
- Theory: The paper introduces new multimodal objectives like global contrastive loss, masked multimodal modeling (MMM), and image-text matching (ITM), alongside unimodal losses for vision (MIM) and language (MLM).
- Experiments: FLAVA achieves strong results across 35 tasks, outperforming similar models like CLIP in many multimodal and language tasks, and remains competitive in vision tasks, despite using smaller datasets for pretraining.
- Results show FLAVA’s strength in diverse domains, achieving superior “macro” average scores across vision, language, and multimodal benchmarks.
Details of Note
Dataset
Unimodal Data
Unimodal Text:
- CCNews
- BookCorpus
Unimodal Image:
- ImageNet-1k
Multimodal Data (Image-Text Pairs)
The total number of image-text pairs used in FLAVA’s pretraining is 70 million, derived from the following datasets:
- COCO: 0.9M image-text pairs
- SBU Captions: 1.0M image-text pairs
- Localized Narratives: 1.9M image-text pairs
- Conceptual Captions: 3.1M image-text pairs
- Visual Genome: 5.4M image-text pairs
- Wikipedia Image Text: 4.8M image-text pairs
- Conceptual Captions 12M: 11.0M image-text pairs
- Red Caps: 11.6M image-text pairs
- YFCC100M: Filtered to 30.3M image-text pairs (after applying filters for English captions and removing captions with fewer than two words)
Methodology Highlights
The methodology used in the FLAVA paper revolves around a unified architecture and several pretraining objectives designed for multimodal, vision, and language tasks.
Architecture
- Image Encoder: A ViT-style encoder (Vision Transformer), specifically using the ViT-B/16 configuration (B = base model, 16 = patch size).
- Text Encoder: Mirrors the ViT architecture but uses different (non-shared) parameters from the vision encoder. This allows the model to handle text input in the same structural manner as images, but with distinct learned representations.
- Multimodal Encoder: Also ViT-style, but introduces new parameters to fuse the outputs from the image and text encoders. This is crucial for the model’s capability to reason about multimodal inputs.
- Overall Model Size: The full model consists of three ViT-B/16 blocks (one for images, one for text, and one for multimodal), totaling approximately 450M parameters.
Multimodal Pre-Training Objectives
Global Contrastive Loss: A contrastive learning objective similar to that used in CLIP, where image-text pairs are brought closer in representation space, and mismatched pairs are pushed apart. This objective helps the model align visual and textual representations effectively.
Masked Multimodal Modeling (MMM): This involves masking patches from images and tokens from text, and then reconstructing them. A dVAE codebook is used for the image encoder to tokenize image patches in a manner similar to tokenizing text. This joint masked modeling improves both vision and language understanding simultaneously.
Image-Text Matching (ITM): A binary classification task where the model predicts whether a given image and text pair are correctly matched or not. This helps the model better understand paired multimodal data.
Unimodal Pre-Training Objectives
- Masked Image Modeling (MIM): Applied to unimodal image datasets, this task involves masking random patches of the image and reconstructing them, similar to BEiT. It helps the model improve vision-only performance.
- Masked Language Modeling (MLM): A standard pretraining objective where the model masks 15% of text tokens and predicts them, similar to BERT. This boosts the model’s language understanding capabilities when trained on large text corpora.
Training Details
Multimodal Pre-Training
- Batch Size: 8192
- Learning Rate: Peak learning rate of 1e-3 with a warm-up of 10,000 steps
- Optimizer: AdamW
- Weight Decay: 0.1
- Training Steps: 150,000
Text Pre-Training
- Training Steps: 125,000
- Batch Size: 2048
- Learning Rate: 5e-4
Vision Pre-Training
- Method: DINO-based pretraining on unimodal vision data (e.g., ImageNet-1k)
2021_12_glip
2021-12: GLIP
Quick Notes
- Encoder-only model.
- MIT License.
- Weights available. Permissive license.
- Organization: Microsoft.
Key Links
- Arxiv: Grounded Language-Image Pre-training
- CVPR 2022: Conference Paper
- GitHub: microsoft/GLIP
- Hugging Face:
GLIPModel/GLIP
- All weights are contained as
.pthfiles in the Hugging Face “demo”
- All weights are contained as
Guiding Questions
What kind of paper is this?
- This is a research paper proposing a new model for grounded language-image pre-training (GLIP).
- It focuses on unifying object detection and phrase grounding and demonstrates the model’s applicability in object-level recognition tasks with strong transferability.
- It’s a small idea with a comparison paper, but presents a novel approach to multimodal learning.
What is the motivation and potential impact of the paper?
- The motivation is to overcome the limitation of current visual recognition models that predict from a fixed set of object categories, which reduces generalizability to new visual concepts.
- The impact is significant in both object detection and phrase grounding tasks, as the model allows for zero-shot and few-shot transfer and improves performance on rare categories.
What is the relevant related work and what is this paper’s contribution?
- Related work includes CLIP, ALIGN, and other models like MDETR and ViLD, which tackle image-level visual representation and multimodal tasks.
- The contribution of this paper is the development of GLIP, which unifies detection and grounding, leveraging both types of data to improve object recognition. It also scales effectively with image-text data, surpassing the state-of-the-art in many object detection tasks.
What are the results (Theory/Experiments)?
- Experiments:
- GLIP achieves strong zero-shot and few-shot performance on various benchmarks such as COCO and LVIS, surpassing many supervised baselines.
- It also transfers well to 13 downstream detection tasks with minimal data, demonstrating excellent data efficiency.
- In fine-tuning experiments, GLIP surpasses prior state-of-the-art models on COCO and LVIS, particularly in rare categories.
Details of Note
Dataset
The model is pre-trained on 27 million grounding data, which includes:
- 3 million human-annotated grounding data from datasets like Flickr30K, VG Caption, and GQA.
- 24 million web-crawled image-text pairs, where noun phrases are detected by an NLP parser and used to generate pseudo grounding boxes.
The data used for pre-training includes both detection and grounding data, which enriches the model’s ability to understand diverse and rare visual concepts.
For large-scale image-text data, the authors use 78.1 million phrase-box annotations, out of which 58.4 million unique noun phrases are present, greatly expanding the concept pool beyond traditional detection datasets.
Fine-tuning experiments were conducted on COCO and LVIS, and the model was also evaluated on 13 downstream object detection tasks to test its transferability in real-world applications.
Methodology Highlights
Unification of Object Detection and Phrase Grounding: GLIP unifies these tasks by reformulating object detection as a phrase grounding task, where object detection is treated as matching visual regions with text phrases.
Region and Token Encoding:
- Images are encoded as region patches using a visual encoder based on the DyHead module.
- Class names or phrases are tokenized and encoded using a BERT-style text encoder.
Cross-Modal Interaction:
- GLIP uses deep fusion between the visual and text modalities, allowing cross-attention between the two encoders earlier than typical late-fusion models. This interaction is crucial for learning language-aware visual representations.
Binary Classification Approach:
- The model computes region-token interactions as a binary classification problem, where true associations are treated as the positive class and all other combinations as negative.
Dual Encoder Architecture:
- The architecture consists of two unimodal encoders: a DyHead visual encoder for region features and a BERT-style text encoder for phrases, with cross-attention layers enabling communication between the two modalities before the final layer.
Training Details
- Optimizer: The model is trained using the AdamW optimizer, which is standard for transformer-based architectures like GLIP’s text encoder (BERT-style).
2022_01_blip
2022-01: BLIP
Quick Notes
- Decoder-Only model
- BSD-3 License
- Weights available. Permissive license.
- Organization: Salesforce
Key Links
- Arxiv: [BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation](https://arxiv.org/abs/2201.12086
- PMLR 2022: Conference Paper
- GitHub: salesforce/BLIP
- All BLIP checkpoints are available on Hugging Face: BLIP models
Guiding Questions
What kind of paper is this?
- This is a vision-language pre-training (VLP) framework paper.
- It proposes a unified model for both understanding-based and generation-based tasks in vision-language domains, introducing a new architecture and data bootstrapping method.
What is the motivation and potential impact of the paper?
- The paper addresses the limitation that existing VLP models excel at either understanding or generation tasks, but not both.
- It tackles the issue of noisy web data commonly used for pre-training, proposing a better method to handle it.
- The potential impact includes state-of-the-art performance on a wide range of vision-language tasks, improving both understanding and generation capabilities while making models more generalizable to video-language tasks.
What is the relevant related work and what is this paper’s contribution?
- Existing VLP models are either encoder-based or encoder-decoder-based, and they often use noisy web data for training.
- This paper contributes by proposing the Multimodal Mixture of Encoder-Decoder (MED) architecture, which handles both understanding and generation tasks.
- It introduces CapFilt, a new method for dataset bootstrapping by using a captioner to generate synthetic captions and a filter to remove noisy captions, improving the quality of the training data.
What are the results (Theory/Experiments)?
- Experiments show BLIP achieves state-of-the-art performance on tasks like image-text retrieval, image captioning, visual question answering, and video-language tasks.
- The results indicate substantial performance gains: +2.7% on image-text retrieval, +2.8% on captioning, and +1.6% on VQA, with strong generalization to video-language tasks in a zero-shot manner.
Details of Note
Dataset
- 14M image-text pairs from multiple sources: COCO, Visual Genome, Conceptual Captions (CC), CC12M, and SBU Captions.
- LAION dataset (115M images) used with very noisy web text, filtered through CapFilt.
- The dataset is bootstrapped by generating synthetic captions using a captioner and filtering noisy captions using a filter, both fine-tuned on high-quality human-annotated data (COCO).
Methodology Highlights
- Multimodal Mixture of Encoder-Decoder (MED): A multitask architecture with three modes—unimodal encoder, image-grounded text encoder, and image-grounded text decoder.
- CapFilt: A novel dataset augmentation strategy combining a captioner
for generating synthetic captions and a filter for removing noisy
captions. Both initialized from pre-trained weights and fine-tuned
individually.
- Captioner: Generates one synthetic caption per image using nucleus sampling for more diverse captions.
- Filter: Removes noisy image-text pairs using contrastive pairing and matching objectives.
- Pre-training objectives:
- Image-text contrastive (ITC) loss for vision-language alignment.
- Image-text matching (ITM) loss for fine-grained multimodal representation.
- Language modeling (LM) loss for image-grounded text generation.
Training Details
- ViT-B/16 and ViT-L/16 architectures used for image encoding, trained on 224x224 resolution during pre-training, 384x384 during fine-tuning.
- Training conducted on 2 nodes, each with 16 GPUs.
- AdamW optimizer, trained for 20 epochs.
- Batch sizes: 2880 (ViT-B) and 2400 (ViT-L).
- Learning rate: linear warm-up, decayed by 0.85 with peak rates of 3e-4 (ViT-B) and 2e-4 (ViT-L).
- Weight decay: 0.05.
2022_04_flamingo
2022-04: Flamingo
Quick Notes
- Decoder-Only model
- No weights available. No permissive license. Proprietary.
- Organization: Google (DeepMind)
Key Links
- Arxiv: Flamingo: a Visual Language Model for Few-Shot Learning
- NeurIPS 2022: Conference Paper
Guiding Questions
What kind of paper is this?
- This is a small idea with evaluation paper, proposing a new Visual Language Model (VLM), Flamingo, for few-shot learning.
- It introduces architectural innovations to integrate pre-trained vision and language models and evaluates performance across several benchmarks.
What is the motivation and potential impact of the paper?
- The paper addresses the challenge of adapting models to new vision-language tasks with minimal annotated examples.
- The problem is significant because existing approaches require substantial task-specific fine-tuning, which is time-consuming and data-intensive.
- By solving this, the model enables rapid adaptation to new tasks in data-scarce regimes, making it impactful for multimodal AI applications.
What is the relevant related work and what is this paper’s contribution?
- The paper builds on the recent advances in multimodal models and large language models (LLMs) like GPT-3 for few-shot learning.
- Existing vision-language models focus on classification or retrieval tasks, but Flamingo extends this capability to open-ended tasks like captioning and visual question-answering.
- The main contribution is a novel architecture that combines vision and language models with minimal task-specific training, outperforming existing methods with fewer examples.
What are the results (Theory/Experiments)?
- Experiments: Flamingo is evaluated on 16 multimodal tasks, including COCO, VQAv2, and VATEX, outperforming the fine-tuned state of the art in 6 tasks using only a few examples.
- The model is trained on large-scale multimodal data using a mixture of image-text pairs, interleaved text and image sequences, and video-text pairs. The architecture leverages pretrained vision and language models with newly added cross-attention layers.
- Results show that Flamingo achieves strong performance in both few-shot and fine-tuned settings, scaling well with model size and number of shots.
Details of Note
Dataset
- M3W:
- Extracted text and images from 43M webpages (closed-source).
- Mixture of text-image pairs with interleaved HTML content.
- ALIGN:
- Large-scale dataset of 1.8B images paired with alt-text descriptions.
- Noisy due to poor alignment between images and text.
- LTIP (Long Text & Image Pairs):
- 312M image-text pairs (closed-source), with more descriptive captions than ALIGN.
- Average 20.5 tokens per caption.
- VTP (Video & Text Pairs):
- 27M short video-text pairs (closed-source).
- Average video length: 22 seconds.
Methodology Highlights
- Vision Encoder:
- Uses NFNet-F6 model (Normalizer-free ResNet).
- Perceiver Resampler:
- Summarizes visual input into 64 outputs.
- Flexible input size; supports sampled video frames.
- Multi-layer architecture with 6 layers, 1536 internal dimension, 16 attention heads, and Squared ReLU activation.
- 194M parameters in total.
- Text Decoder:
- Frozen decoder with gated cross-attention mechanism.
- Gating uses scalar with tanh activation; initialized to stabilize training by starting with zero-weight cross-attention.
- Decoder cross-attends only to the most recent visual token.
- Model Sizes:
- Flamingo-3B: 1.4B frozen LM + 1.2B learnable params.
- Flamingo-9B: 7B frozen LM + 1.6B learnable params.
- Flamingo-80B: 70B frozen LM + 10B learnable params.
Training Details
- Optimizer:
- AdamW with gradient norm clipping (1.0) and 0.1 weight decay (none on Perceiver Resampler).
- Learning Rate:
- Peak learning rate of 1e-4, with 5,000 warmup steps.
- Training Duration:
- Trained for 500k steps.
- Hardware:
- Flamingo-80B model trained on 1536 TPUv4 chips for 15 days.
2022_05_coca
2022-05: CoCa
Quick Notes
- Encoder-Decoder model.
- No License. Proprietary and closed-source.
- Organization: Google.
Key Links
Guiding Questions
What kind of paper is this?
- This is a model proposal and evaluation paper.
- It proposes a new image-text foundation model called Contrastive Captioner (CoCa), unifying contrastive learning and image captioning.
What is the motivation and potential impact of the paper?
- The paper aims to improve image-text foundation models by merging contrastive learning and captioning, building on methods like CLIP and SimVLM.
- The impact lies in creating a single, unified model that performs well across multiple tasks (e.g., image classification, image captioning, crossmodal retrieval) with zero-shot transfer or minimal task-specific adaptation.
What is the relevant related work and what is this paper’s contribution?
- Related works include models like CLIP, ALIGN, and SimVLM, which focus on contrastive learning and generative tasks, respectively.
- Prior methods either excelled at crossmodal alignment (contrastive) or multimodal understanding (captioning), but none effectively combined both.
- CoCa contributes by combining contrastive and captioning losses in a minimalist encoder-decoder architecture that subsumes capabilities from both contrastive models (e.g., CLIP) and captioning models (e.g., SimVLM).
What are the results (Theory/Experiments)?
- Experiments: CoCa achieves state-of-the-art results in a variety of
tasks:
- Zero-shot ImageNet classification (86.3% accuracy).
- Crossmodal retrieval on MSCOCO and Flickr30k.
- Multimodal understanding (e.g., VQA, SNLI-VE) and image captioning (e.g., MSCOCO, NoCaps).
Details of Note
Dataset
- JFT-3B: Internal, closed dataset used for image classification pretraining.
- ALIGN: Noisy image-text dataset used for contrastive pretraining.
- Strict deduplication applied to filter near-domain examples for fair evaluation.
Methodology Highlights
- Architecture: Encoder-decoder model with an image encoder and text
decoder.
- Unimodal and multimodal decoders: Omits cross-attention in earlier decoder layers to produce unimodal text representations and multimodal image-text representations.
- Attentional pooling layer: Learned atop the vision encoder for task-specific pooling.
- Unified approach: Combines contrastive and captioning losses in a single architecture.
- Model sizes: 383M, 787M, 2.1B parameters across different model variants.
- Pretrained at 288x288 image resolution; additional 1 epoch at 576x576.
Training Details
- Batch size: 65,536 image-text pairs (half from JFT, half from ALIGN).
- Training steps: 500k steps (5 epochs on JFT, 10 epochs on ALIGN).
- Loss weighting: Captioning loss weighted twice as much as contrastive loss.
- Optimizer: Adafactor with weight decay set to 0.01.
- Learning rate schedule: Linear warmup for the first 2% of steps to a max LR of 8e-4, then linearly decayed.
- Hardware: Training took 5 days on 2,048 TPUv4 chips.
2022_05_mplug
2022-05: mPLUG
Quick Notes
- Encoder-Decoder model.
- BSD-3 License.
- Weights available. Permissive license.
- Organization: Alibaba.
Key Links
- Arxiv: mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
- EMNLP 2022: Conference Paper
- GitHub: alibaba/AliceMind
- Checkpoints available in the GitHub repository, not on Hugging Face.
Guiding Questions
What kind of paper is this?
- This is a vision-language pre-training paper.
- It introduces mPLUG, a novel cross-modal model for vision-language understanding and generation.
- The paper focuses on improving efficiency and addressing information asymmetry in cross-modal fusion by introducing cross-modal skip-connections.
What is the motivation and potential impact of the paper?
- The paper addresses inefficiencies in existing vision-language models and the problem of linguistic signal vanishing when processing long visual sequences.
- The proposed mPLUG architecture aims to enhance the efficiency of vision-language models while improving cross-modal alignment.
- The impact is significant as mPLUG demonstrates state-of-the-art results on several vision-language tasks, with strong zero-shot transferability to video-language tasks.
What is the relevant related work and what is this paper’s contribution?
- Related work includes models like CLIP, ALBEF, BLIP, and SimVLM, which have addressed vision-language learning but often suffer from inefficiencies or information loss.
- The paper contributes by introducing cross-modal skip-connections, which balance the fusion of visual and linguistic representations and improve computational efficiency.
- It also introduces a new mechanism for asymmetric co-attention, enhancing linguistic signal processing while reducing visual redundancy.
What are the results (Theory/Experiments)?
- Theory: mPLUG uses a combination of asymmetric co-attention and skip-connections to address cross-modal fusion challenges.
- Experiments: The model achieves state-of-the-art performance on tasks like image captioning, visual question answering, and image-text retrieval.
- On benchmarks such as COCO Caption, Flickr30K, and VQA, mPLUG shows significant improvements in both accuracy and speed, outperforming models trained on much larger datasets.
Details of Note
Dataset
- The model is pre-trained on 14 million image-text pairs from several
datasets:
- COCO, Visual Genome, SBU, CC3M, and CC12M.
- These datasets include a mix of in-domain (COCO, Visual Genome) and web out-domain (CC3M, CC12M, SBU) images, providing a diverse set of image-text pairs.
- Evaluation benchmarks include:
- VQA, Visual Grounding (RefCOCO), Visual Reasoning (NLVR2, SNLI-VE), and Image Captioning (COCO, NoCaps).
Methodology Highlights
- Base architecture: Consists of unimodal encoders for vision and text, followed by a text decoder for generation tasks.
- Fusion mechanism: Introduces an asymmetric cross-attention mechanism
where the vision modality updates the text representation in S asymmetric
blocks before performing full cross-attention fusion.
- This method reduces computational complexity by limiting expensive full cross-attention until after initial asymmetric attention.
- Cross-modal skip connections: These connections allow skipping layers of heavy attention computations, further improving efficiency by focusing fusion at different abstraction levels.
- Transformer-based encoders: Uses 6-layers for the text encoder and cross-modal skip network and 12-layers for the decoder for text generation.
- Pre-training objectives:
- Image-Text Contrastive Learning (ITC) to align visual and text features.
- Image-Text Matching (ITM) as a binary classification task to predict if the image and text match.
- Masked Language Modeling (MLM), following a BERT-style objective.
- Prefix Language Modeling (PrefixLM) to predict text sequences based on the context from the image and previous text.
Training Details
- Training settings:
- Pre-trained on 16 NVIDIA A100 GPUs using BF16 precision.
- Trained for 30 epochs with a batch size of 1024.
- Input resolution: Model pre-trained with image resolutions of 224x224 and 256x256, generating outputs based on 4-pixel tokens for localization.
- Optimizer: Uses AdamW optimizer with a learning rate warm-up and cosine decay.
- Benchmarks: Evaluated on several tasks, including VQA, Visual Grounding (RefCOCO), Visual Reasoning (NLVR2, SNLI-VE), and Image Captioning (COCO, NoCaps).
2022_07_glipv2
2022-07: GLIPv2
Quick Notes
- Encoder-only model
- Weights are supposed to be available, but aren’t seen in the official repo
- Organization: Microsoft
Key Links
- Arxiv: GLIPv2: Unifying Localization and Vision-Language Understanding
- NeurIPS 2022: GLIPv2
- GitHub: microsoft/GLIP
Guiding Questions
What kind of paper are you reading?
- This paper introduces GLIPv2, a strong extension of the original GLIP model that unifies localization and vision-language understanding.
- It’s a simple evaluation paper that focuses on how this model performs a variety of tasks, including object detection and phrase grounding, and that grounding-based pre-training can improve performance on general vision-language tasks.
What is the motivation and potential impact of the paper?
- The motivation is to improve vision-language models by incorporating grounding-based pre-training, which can enhance the model’s ability to understand visual concepts and their textual descriptions.
- GLIP already showed promise in object detection and phrase grounding tasks, and GLIPv2 aims to extend these capabilities to broader vision-language tasks, potentially improving performance across various multimodal applications.
What is the relevant related work and what is this paper’s contribution?
- As an extension of GLIP, GLIPv2 builds on the idea of grounding-based pre-training to enhance vision-language understanding.
- All other encoder-based models like CLIP, ALIGN, and MDETR are relevant, but GLIPv2’s focus on grounding and localization makes it unique in its approach to vision-language tasks.
What are the results (Theory/Experiments)?
- Models ar evaluated on COCO object detection, LVIS, and other downstream tasks. They’re also tested on vision-language tasks like VQA.
- Results:
- GLIPv2 shows improved performance over the original GLIP model in object detection and phrase grounding tasks.
- The model also demonstrates strong performance on vision-language tasks like VQA, showcasing the benefits of grounding-based pre-training in enhancing multimodal understanding.
Details of Note
Dataset
- Pre-training data
- Detection: Objects365, COCO, LVIS, OpenImages, VG, ImageNetBoxes
- Grounding: GoldG
- Caption: Cap4M, CC15M+SBU, Cap4M
Methodology Highlights
- Compared with GLIP:
- GLIP showed grounded pre-training improves localization. GLIPv2 shows that grounded pre-training can improve general vision-language tasks.
- GLIPv2 adds a new loss term: inter-image region-word contrastive loss, which helps the model learn better visual-semantic alignments. They argue this can be viewed as the region-level equivalent of the word-level contrastive loss in CLIP.
Training Details
- GLIPv2-Tiny
- Based on Swin-T
- 32 GPUs (type not specified) w/ batch size of 64
- Different learning rates for language backbone (1e-5) and vision backbone (1e-4)
- Learning rate reduced 0.1 at fixed points
- 330,000 steps total
- Max token length of 256
- MLM is turned off for an additional 300,000 steps
- GLIPv2-Base
- Based on Swin-B
- 64 GPUs (type not specified) w/ batch size of 64
- All params use 1e-4 learning rate
- Same learning rate schedule as Tiny
- 1 million steps total
- Max token length of 256
- MLM is turned off for an additional 500,000 steps
- GLIPv2-Huge
- Based on CoSwin-Huge
- 64 GPUs (type not specified) w/ batch size of 64
- Most everything is the same as Base, but MLM is not turned off for a final stage
2022_08_beit3
2022-08: BEiT-3
Quick Notes
- Encoder-only model.
- MIT License.
- Weights available. Permissive license.
- Organization: Microsoft.
Key Links
- Arxiv: Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
- CVPR 2023: Conference Paper
- GitHub: microsoft/unilm
- Checkpoints: (Contained as links in the GitHub repository; no official checkpoint on Hugging Face)
Guiding Questions
What kind of paper is this?
- This is a methodological and empirical paper.
- It introduces BEIT-3, a general-purpose multimodal foundation model and evaluates it across multiple vision and vision-language tasks.
What is the motivation and potential impact of the paper?
- The paper aims to solve the problem of unified multimodal modeling by treating images as a foreign language (termed “Imglish”).
- The motivation is to simplify pretraining objectives and improve scalability for multimodal models.
- The potential impact lies in improving the generalization and transferability of multimodal models across a variety of vision and vision-language tasks.
What is the relevant related work and what is this paper’s contribution?
- The paper builds on the masked data modeling paradigm seen in BERT for language and BEIT for images.
- Related work includes CLIP, SimVLM, CoCa, and others that focus on image-text contrastive learning or multi-task pretraining objectives.
- The main contribution is introducing BEIT-3, a model that scales to billions of parameters, uses a simplified pretraining task (masked data modeling), and shows strong performance across tasks using only public datasets.
What are the results (Theory/Experiments)?
- Experiments: BEIT-3 achieves state-of-the-art results on several benchmarks, including COCO, ImageNet, ADE20K, and VQA, outperforming models like CoCa and Florence.
- Theory: The model demonstrates that treating images as a “language” enables more efficient and effective masked modeling, with significant performance gains over previous models.
- Results emphasize the scalability and versatility of BEIT-3 across vision and vision-language tasks, using a smaller batch size than previous models.
Details of Note
Dataset
Multimodal Data: BEIT-3 is pretrained using approximately 15 million images and 21 million image-text pairs from public datasets, including:
- CC12M (Conceptual Captions 12M)
- CC3M (Conceptual Captions 3M)
- SBU Captions
- COCO (Common Objects in Context)
- Visual Genome
Vision Data: The model is also pretrained on ImageNet-21K, which contains 14 million images for vision tasks.
Text Data: For the text-only pretraining, BEIT-3 uses a 160GB corpus consisting of:
- English Wikipedia
- BookCorpus
- OpenWebText
- CC-News
- Stories
These datasets are all publicly accessible, and no private datasets were used for pretraining, ensuring academic reproducibility. Despite this, BEIT-3 outperforms other models that utilize in-house data.
Methodology Highlights
Architecture:
- BEIT-3 uses a Multiway Transformer architecture, which includes:
- A shared self-attention module across modalities (vision, language, and vision-language).
- A pool of feed-forward network (FFN) experts for each modality (vision, language, and vision-language), allowing the model to capture modality-specific information.
- The giant model has 1.9 billion parameters, with separate parameter allocations for vision, language, and vision-language experts.
- BEIT-3 uses a Multiway Transformer architecture, which includes:
Pre-training:
- The model is pre-trained using masked data modeling objectives:
- Masked Language Modeling (MLM): Similar to BERT, it masks 15% of text tokens and learns to recover the original tokens.
- Masked Image Modeling (MIM): Masks 40% of image patches and reconstructs discrete visual tokens using a pre-trained image tokenizer (VQ-KDCLIP).
- Masked Vision-Language Modeling (MVLM): Masks tokens or patches in image-text pairs, learning to recover them based on the context of both modalities.
- The use of masked objectives (MLM, MIM, MVLM) allows BEIT-3 to be pre-trained with smaller batch sizes compared to contrastive-based methods (e.g., CLIP, CoCa) that require large batches to contrast pairs.
- The simplicity of the mask-then-predict task enables easier implementation of gradient accumulation, unlike contrastive methods.
- The model is pre-trained using masked data modeling objectives:
Training Efficiency:
- The model uses a 1024 sequence length during pre-training.
- Smaller batch sizes make BEIT-3 more efficient to scale, while still providing strong generalization for downstream tasks.
Training Details
Training Steps: BEIT-3 is pre-trained for 1 million steps.
Batch Size: The total batch size is 6144 samples, composed of:
- 2048 image samples,
- 2048 text samples,
- 2048 image-text pairs.
Optimizer: The model uses the AdamW optimizer with the following hyperparameters:
- Peak learning rate: 1e-3,
- Warmup: Linear warmup for the first 10,000 steps,
- Learning rate schedule: Cosine decay after warmup.
Weight Decay: A weight decay of 0.05 is applied.
Training Time & Hardware: The pretraining process takes approximately 2 weeks and is conducted on 256 A100 40GB GPUs.
2022_09_pali
2022-09: PaLI
Quick Notes
- Encoder-Decoder model.
- No License. Proprietary and closed-source.
- Organization: Google.
Key Links
- Arxiv: PaLI: A Jointly-Scaled Multilingual Language-Image Model
- ICLR 2023: Conference Paper
Guiding Questions
What kind of paper is this?
- This is a large-scale model introduction and performance evaluation paper.
- It introduces the PaLI model, which jointly scales language and vision components to perform multimodal tasks across multiple languages.
What is the motivation and potential impact of the paper?
- The paper seeks to address the challenge of joint vision-language modeling at scale, using both modalities more effectively.
- It emphasizes the importance of balancing scaling between vision and language models, claiming this approach improves accuracy across tasks like captioning and visual question-answering.
- The potential impact is providing a scalable, multilingual, multimodal model that advances the state-of-the-art in vision-language tasks.
What is the relevant related work and what is this paper’s contribution?
- Related works include models like SimVLM, CoCa, Flamingo, and Vision Transformers (ViTs), which focus on vision-language tasks but often prioritize language scaling over vision scaling.
- The paper contributes by:
- Introducing PaLI, which balances parameter scaling between the vision and language components.
- Training a massive multilingual dataset (WebLI), which supports over 100 languages, and achieves SOTA performance in various vision-language tasks.
What are the results (Theory/Experiments)?
- Theory: The paper demonstrates that scaling vision models (ViT-e) provides substantial performance gains when compared to smaller vision backbones, especially in joint tasks.
- Experiments:
- PaLI outperforms previous SOTA models in benchmarks such as COCO Captioning, VQAv2, and OKVQA.
- The 17B-parameter version achieves better results than models with significantly skewed parameter distributions between vision and language components.
Details of Note
Dataset
- WebLI dataset: 10 billion images, tens of billions of image-text pairs.
- Multilingual: Supports over 100 languages.
- Image-text pairs: Filtered for quality using cross-modal similarity scoring, top 10% retained (~1 billion examples).
- Closed-source: Built using internal tools, limited details on construction.
- OCR data: Includes 29 billion image-OCR pairs.
Methodology Highlights
- Pre-training tasks: 8 diverse tasks designed to enhance multimodal
capabilities.
- Span corruption on text-only data (similar to masked language modeling).
- Split-captioning task where captions are split and completed by the model.
- CC3M captioning in 35 languages.
- OCR task using WebLI OCR-text data.
- English and multilingual VQA and VQG.
- English-only object-aware VQA.
- Object detection task similar to pix2seq.
- Architecture:
- Pre-trained vision (ViT-G, ViT-e) and language (mT5-L, mT5-XXL) models.
- Standard ViT encoder feeding into a fusion encoder with a text decoder.
- ViT weights are frozen during training, updating only the language model.
Training Details
- Hardware: Largest model trained on 1024 TPUv4 chips for 7 days.
- High-resolution phase: An additional 3 days on 512 TPUv4 chips.
- Optimizer: Adafactor, with a peak learning rate of 5e-3.
- Data size: Pre-training uses 260TB of data.
- Training objective: Standard softmax cross-entropy with teacher forcing.
2022_10_pix2struct
2022-10: Pix2Struct
Quick Notes
- Decoder-Only model
- Weights available: Apache-2.0 License
- Organization: Google (DeepMind)
Key Links
- Arxiv: Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding
- PMLR 2023: Conference Paper
- GitHub: google-research/pix2struct
- Hugging Face checkpoints:
Guiding Questions
What kind of paper is this?
- This is a pretrained model proposal paper.
- It introduces Pix2Struct, a pretrained image-to-text model designed for visually-situated language understanding.
- The paper focuses on a general-purpose model capable of handling multiple tasks across various domains.
What is the motivation and potential impact of the paper?
- The paper addresses the need for a unified approach to visually-situated language understanding, where text and visuals are closely intertwined, such as in web pages, documents, and UIs.
- The problem is important because previous work has been fragmented, relying on domain-specific pipelines with limited data/model sharing.
- The potential impact includes enabling more general-purpose models capable of handling diverse visual language tasks without relying on domain-specific pipelines.
What is the relevant related work and what is this paper’s contribution?
- Related work includes models like Donut (OCR-focused for document understanding), VisionTaPas, and GIT2 for image captioning and visual language tasks.
- Existing approaches often rely on task-specific engineering or external systems like OCR, which limits adaptability.
- Pix2Struct’s main contribution is screenshot parsing as a pretraining objective, which subsumes pretraining signals like OCR and image captioning into a more general framework. It integrates visual and language inputs holistically, leading to improved performance in low-resource domains like UIs and illustrations.
What are the results (Theory/Experiments)?
- Experiments: Pix2Struct outperforms previous visual-only models (like Donut) in eight out of nine tasks across four domains: documents, illustrations, UIs, and natural images.
- It achieves state-of-the-art results in six out of the nine benchmarks, excelling in low-resource tasks such as UI and illustration understanding.
- The model lags in tasks with high-resource domains like document understanding, where OCR pipelines still dominate, but shows promise for future scaling.
Details of Note
Dataset
- 80M screenshots collected from URLs in the C4 dataset.
- Screenshots taken at 1024x1024 resolution.
- Pretraining includes a reading curriculum stage using BooksCorpus, where text is rendered as images with random fonts and colors.
Methodology Highlights
- Screenshot parsing pretraining: Condenses HTML DOM tree, retaining only visible elements for modeling in output.
- Masked span-based modeling: 50% of the text is masked during pretraining for better context modeling.
- Flexible aspect-ratio ViT: Scales input images to extract the maximal number of fixed-size patches, enabling the model to handle extreme aspect ratios and variable sequence lengths.
- ViT-encoder + text-based decoder architecture: Integrates vision and language by rendering language prompts (e.g., VQA questions) directly onto the input image, eliminating the need for cross-attention mechanisms.
Training Details
- Pretraining (Reading stage): 30K steps, 128 patch input length.
- Pretraining (Screenshot stage):
- Base: 270K steps, 2048 batch size, 64 TPUs.
- Large: 170K steps, 1024 batch size, 128 TPUs.
- Optimizer: Adafactor with linear warmup (1K steps) to 0.01 learning rate, followed by cosine decay.
- Input sequence length: 2048 patches; decoder sequence limited to 128 tokens, targets limited to 1024 characters.
2022_12_udop
2022-12: UDOP
Quick Notes
- Encoder-Decoder model
- MIT License
- Weights available. Permissive license.
- Organization: Microsoft
Key Links
- Arxiv: Unifying Vision, Text, and Layout for Universal Document Processing
- CVPR 2023: Conference Paper
- GitHub: microsoft/i-Code
- Hugging Face checkpoints available.
Guiding Questions
What kind of paper is this?
- This is a foundational model paper in the domain of Document AI.
- It introduces a universal framework for multimodal document processing, unifying text, vision, and layout modalities.
- It solves multiple tasks with a single model architecture using generative pretraining for tasks like document classification, layout analysis, and question answering.
What is the motivation and potential impact of the paper?
- The motivation is to address the challenges of processing documents with multimodal content (text, images, layout) efficiently.
- It aims to unify multiple tasks and modalities under a single model to improve document processing tasks across diverse domains.
- The potential impact is significant, providing a foundation model for document AI that could revolutionize tasks like document understanding, classification, and customization.
What is the relevant related work and what is this paper’s contribution?
- Relevant work includes vision-language models and document AI models like LayoutLM and TILT, which integrate vision and text but treat modalities separately or with limited fusion.
- The contribution is a unified framework (Vision-Text-Layout Transformer) that effectively fuses vision, text, and layout modalities into a single representation for all tasks.
- It introduces novel self-supervised and supervised pretraining objectives, showing superior performance on several document AI tasks.
What are the results (Theory/Experiments)?
- Theory: The paper presents a novel architecture (VTL Transformer) and joint pretraining objectives, introducing layout-induced vision-text embeddings.
- Experiments: It evaluates the model on several benchmarks like FUNSD, CORD, and DocVQA, achieving state-of-the-art results on 8 Document AI tasks.
- It also demonstrates unique capabilities like high-quality document generation and editing, which have not been achieved in prior Document AI models.
Details of Note
Dataset
- IIT-CDIP: A large-scale dataset containing 11 million scanned documents with token-level OCR bounding boxes.
- Supervised datasets used include: FUNSD (entity recognition), CORD (key information extraction), RVL-CDIP (document classification), and DUE-benchmark, which comprises 7 datasets such as DocVQA, InfoVQA, KLC, PWC, DeepForm, WTQ, and TabFact.
Methodology Highlights
- Layout-Induced Vision-Text Embeddings: Combines image patches with word tokens using bounding box information to enforce interaction between text and visual modalities. Image patches without text are concatenated to the sequence.
- Position Bias: Applies 2D relative attention based on bounding boxes for text; no 1D sequence bias is used.
- Vision-Text-Layout (VTL) Transformer: Unified model architecture with a text-layout decoder (producing tokens with discrete layout) and a vision decoder (masked autoencoder for pixel generation).
- Pretraining Objectives:
- Joint Text-Layout Reconstruction: Mask part of the text and regenerate both the text content and layout.
- Layout Modeling: Predict the location of groups of text tokens.
- Visual Text Recognition: Predict text present at a given location.
- Masked Image Reconstruction: Reconstruct pixel values from text and layout with cross-attention to raw character embeddings.
- Supervised Objectives: Include document classification, layout analysis, information extraction, question answering, and document NLI.
Training Details
- Optimizer: Adam, peak learning rate of 5e-5 with 1000 warmup steps, weight decay of 1e-2 (possibly AdamW).
- Batch Size: 512.
- Epochs: 3 epochs total, one for each resolution stage (224, 512, 1024).
- Curriculum Learning: Training starts at 224 resolution and scales up to 1024 resolution over epochs.
2023_01_blip2
2023-01: BLIP-2
Quick Notes
- Encoder-Decoder model OR Decoder-Only model; framework is agnostic.
- BSD-3 License
- Weights available. Permissive license.
- Organization: Salesforce
Key Links
- Arxiv: BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
- PMLR 2023: Conference Paper
- GitHub: salesforce/LAVIS
- Hugging Face checkpoints available.
Guiding Questions
What kind of paper are you reading?
- This is a vision-language pre-training paper.
- It proposes a novel method, BLIP-2, for vision-language pre-training by using frozen image encoders and frozen large language models (LLMs), bridging the modality gap with a lightweight Querying Transformer (Q-Former).
What is the motivation and potential impact of the paper?
- The paper addresses the high computation cost of vision-language pre-training by leveraging frozen pre-trained models instead of full end-to-end training.
- The problem is important because scaling large multimodal models has become costly, and BLIP-2 provides a more efficient strategy.
- The potential impact is that BLIP-2 can achieve state-of-the-art performance on various vision-language tasks while being more compute-efficient than prior methods.
What is the relevant related work and what is this paper’s contribution?
- The paper builds on work in vision-language models such as Flamingo, which combine image encoders and large language models, but BLIP-2 differentiates itself by using frozen unimodal models.
- Existing approaches (e.g., Flamingo) rely on an image-to-text generation loss but face modality alignment challenges.
- BLIP-2 contributes a two-stage pre-training strategy with a Querying Transformer that efficiently extracts visual features for the LLM without requiring end-to-end fine-tuning of the models.
What are the results (Theory/Experiments)?
- Experiments: BLIP-2 achieves state-of-the-art performance on tasks like Visual Question Answering (VQAv2), image captioning, and image-text retrieval, outperforming models like Flamingo80B with fewer trainable parameters.
- It excels in zero-shot tasks such as image-to-text generation, demonstrating visual conversation and instruction following capabilities.
- The method is shown to be both efficient (using frozen components) and powerful, significantly reducing the number of parameters while maintaining high performance.
Details of Note
Dataset
- Pre-training uses 129M images from several sources: COCO, Visual Genome, CC3M, CC12M, SBU, and LAION400M.
- Synthetic captions are generated with the CapFilt method: 10 captions are produced per image using the BLIP large model.
- Captions are ranked based on image-text similarity from CLIP ViT-L/14, and the top 2 captions are retained for training.
Methodology Highlights
- Q-Former: The core architectural component, with two transformer
sub-modules sharing self-attention layers.
- Image Transformer: Cross-attends to frozen image encoder features every other block.
- Text Transformer: Can act as both encoder and decoder, depending on the pre-training task.
- Learnable Queries: 32 queries interact with the frozen image encoder and LLM to extract relevant visual features.
- Stage 1 (Representation Learning): Pre-training without using the LLM.
- Objectives: Image-Text Contrastive (ITC), Image-Text Matching (ITM), Image-Grounded Text Generation (ITG).
- Stage 2 (Generative Learning): Connects the frozen image encoder and
Q-Former to a frozen LLM (either decoder-only or encoder-decoder) for
vision-to-language generation.
- Language modeling: Causal or prefix-based, depending on the LLM.
Training Details
- Stage 1 (Representation Learning):
- Steps: 250k steps.
- Batch Size: 2320 for ViT-L, 1680 for ViT-g.
- Stage 2 (Generative Learning):
- Steps: 80k steps.
- Batch Size: 1920 for OPT, 1520 for FlanT5.
- Precision: FP16 for ViTs and OPT, BF16 for FlanT5.
- Optimizer: AdamW with weight decay of 0.05.
- Learning Rate: Peak 1e-4 with 2k-step warmup, cosine cooldown to a minimum of 5e-5.
- Hardware: Training done on a single 16xA100 (40GB) machine.
- Training Time: ViT-g + FlanT5-XXL takes 6 days for Stage 1 and 3 days for Stage 2.
- Data Augmentation: Random resized cropping and horizontal flipping.
2023_02_kosmos
2023-02: KOSMOS-1
Quick Notes
- Decoder-Only model
- Weights unavailable: MIT License
- Organization: Microsoft
Key Links
- Arxiv: Language Is Not All You Need: Aligning Perception with Language Models
- NeurIPS 2023: Conference Paper
Guiding Questions
What kind of paper is this?
- This is a research paper introducing a new model: KOSMOS-1, a multimodal large language model (MLLM).
- It proposes aligning perception (vision) with language models to handle multimodal tasks like image captioning, visual question answering, and nonverbal reasoning.
What is the motivation and potential impact of the paper?
- The paper addresses the limitations of current large language models (LLMs) in handling multimodal data like images and audio.
- The goal is to move toward artificial general intelligence by aligning multimodal perception (e.g., vision) with language.
- The impact lies in enabling new tasks, such as robotics, document intelligence, and zero-shot multimodal reasoning, by training a model that can perceive, reason, and generate across modalities.
What is the relevant related work and what is this paper’s contribution?
- Related work includes other multimodal models like Flamingo and language models like METALM, which treat language models as general-purpose interfaces.
- Existing approaches are insufficient because they do not natively align perception with LLMs or handle instruction following and in-context learning across modalities.
- This paper contributes KOSMOS-1, which integrates multimodal input into a Transformer-based model, allowing it to learn and follow instructions across both language and vision tasks.
What are the results (Theory/Experiments)?
- KOSMOS-1 extends LLMs to multimodal tasks, allowing them to perceive and understand input from different modalities.
- Experimental results:
- KOSMOS-1 outperforms comparable models (e.g., Flamingo) in zero-shot and few-shot settings across tasks like image captioning and visual question answering.
- It demonstrates the ability to perform nonverbal reasoning on IQ tests and handle OCR-free tasks.
- KOSMOS-1 also shows cross-modal transfer capabilities, transferring knowledge between modalities (e.g., from language to vision).
Details of Note
Dataset
- Unimodal text: Includes The Pile, Common Crawl, CC snapshots, CC-Stories, and RealNews.
- Exclusions: Data from GitHub, arXiv, Stack Exchange, and PubMed Central are excluded.
- Image-caption pairs: Constructed from English LAION-2B, LAION-400M, COYO-700M, and Conceptual Captions datasets.
- Interleaved multimodal data: Created by extracting and interleaving text and images from Common Crawl web pages (approx. 71M web pages).
- Instruction tuning data: Includes Unnatural Instructions and FLANv2 (random 54k samples).
Methodology Highlights
- Architecture: Uses ViT-L/14 as the vision encoder with Resampler (attentive pooling); ViT is kept frozen except for the last layer.
- Input representation: Encoded image tokens wrapped with special
<image>and</image>tags. - Pretraining task: Standard next-token prediction for text, image-caption, and interleaved multimodal data.
- Position encoding: xPos (relative position encoding) is used for long-context modeling in the Magneto Transformer variant.
- Multimodal chain-of-thought: In some tasks, the model generates an intermediate rationale to improve reasoning performance.
Training Details
- Pretraining setup:
- Steps: 300k total with 375 warmup steps.
- Batch sizes: 256 for text, 6144 for image-caption pairs, and 128 for interleaved data.
- Learning rate: Max of 2e-4 with linear decay.
- Optimizer: AdamW with
eps=1e-6,beta=(0.9, 0.98), and weight decay of 0.01.
- Language-only instruction tuning:
- Steps: 10k with 375 warmup steps.
- Batch sizes: 256 for instructions, 32 for text, 768 for image-caption pairs, 16 for interleaved data.
- Learning rate: Max of 2e-5.
2023_02_mplug2
2023-02: mPLUG-2
Quick Notes
- Encoder-Decoder model
- Apache License 2.0
- Weights available. Permissive license.
- Organization: Alibaba
Key Links
- Arxiv: mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
- PMLR 2023: Conference Paper
- GitHub: X-PLUG/mPLUG-2
.pthcheckpoints available in the GitHub repository.
Guiding Questions
What kind of paper is this?
- This is a multi-modal pretraining model paper.
- It presents a new modularized framework (mPLUG-2) that addresses multi-modal tasks, unifying text, image, and video understanding and generation.
- The paper solves practical AI problems across multiple domains, leveraging both encoder-based and sequence-to-sequence generation paradigms.
What is the motivation and potential impact of the paper?
- The paper tackles the problem of modality entanglement when training models on multiple modalities like text, image, and video simultaneously.
- It highlights the need for modality collaboration while preventing interference across modalities during training.
- The proposed modularized framework allows for flexible, efficient scaling across various tasks.
- The potential impact is significant in multi-modal AI, where it achieves state-of-the-art or competitive performance on over 30 tasks.
What is the relevant related work and what is this paper’s contribution?
- The paper builds on multi-modal foundation models like CLIP, OFA, and BEIT-3, but introduces a more modularized architecture.
- Prior work either focuses on sequence-to-sequence generation or encoder-based instance discrimination, but mPLUG-2 leverages both approaches through a flexible network.
- It introduces a dual-vision encoder and universal layers to handle multi-modal collaboration and disentanglement.
- The paper contributes by designing a framework that selects and combines modules for downstream tasks, improving both performance and zero-shot transferability.
What are the results (Theory/Experiments)?
- Empirical study shows state-of-the-art performance on tasks like MSRVTT video QA and video captioning, achieving new records with smaller model size and data scale.
- Achieved 48.0 top-1 accuracy on MSRVTT video QA and 80.3 CIDEr for video captioning, demonstrating strong results across vision, language, and multi-modal benchmarks.
- The zero-shot transferability is also impressive, especially on vision-language and video-language tasks.
Details of Note
Dataset
- Image captioning: COCO, Conceptual Captions 3M (CC3M), Conceptual Captions 12M (CC12M), SBU, Visual Genome.
- Video-text: WebVid-2M.
- Text-only: WikiCorpus (20GB) and cleaned Common Crawl (350GB), mirroring the C4 dataset.
Methodology Highlights
- Dual Vision Encoder: A module that supports both images and videos, capturing spatial and temporal features using ViT-B/16 and ViT-L/14 models.
- Text Encoder: BERT-based unimodal encoder, initialized from pre-trained checkpoints.
- Universal Module: Aligns vision and language features using text-aware vision tokens and vision-aware text tokens, reducing computational complexity and aligning modalities in a shared semantic space.
- Fusion Module: Combines self-attention, cross-attention, and feed-forward networks (FNNs) to update and refine cross-modal representations.
- Shared Decoder: Supports text generation for both uni-modal and multi-modal tasks.
- Pre-training Losses:
- Language Loss: Masked Language Modeling (MLM) for text encoding.
- Image-Text Contrastive (ITC): Aligns image and text representations.
- Image-Text Matching (ITM): Matches image and text pairs, improving multi-modal understanding.
- Instruction-based LM: Handcrafted instructions for task and modality discrimination.
Training Details
- Pre-training: 30 epochs.
- Batch size:
- Base model: 1024 on 8 NVIDIA A100 GPUs.
- Large model: 512 on 16 NVIDIA A100 GPUs.
- Optimizer: AdamW with a weight decay of 0.02.
- Learning Rate:
- Warmup for 5000 steps, followed by cosine decay.
- Base model: Peak learning rate of 1e-4.
- Large model: Peak learning rate of 5e-5.
- Data Augmentation: Random cropping, horizontal flip, and sparse frame sampling for videos.
2023_04_llava
2023-04: LLaVA
Quick Notes
- Decoder-Only model
- Weights available: MIT License
- Organization: Microsoft(-ish)
Key Links
- Arxiv: Visual Instruction Tuning
- NeurIPS 2023: Conference Paper
- GitHub: haotian-liu/LLaVA
- Hugging Face checkpoints: Model Zoo
Guiding Questions
What kind of paper are you reading?
- This is a methodology paper introducing a new approach called visual instruction tuning.
- It describes the first attempt at using machine-generated multimodal instruction-following data to train a large language-vision model (LLaVA).
What is the motivation and potential impact of the paper?
- The paper addresses the lack of multimodal instruction-following data for training general-purpose vision and language assistants.
- The potential impact includes advancing multimodal AI models by improving their zero-shot capabilities on real-world vision-language tasks, which could benefit fields like automated assistants and multimodal reasoning.
What is the relevant related work and what is this paper’s contribution?
- Related work includes instruction-tuning for LLMs in natural language processing (e.g., ChatGPT, Alpaca) and existing multimodal models (e.g., CLIP, BLIP-2).
- This paper contributes by generating vision-language instruction data using GPT-4 and introducing LLaVA, a model that connects a vision encoder to an LLM for multimodal tasks, and by building new benchmarks for visual instruction-following.
What are the results (Theory/Experiments)?
- Experiments: The paper shows that LLaVA performs similarly to multimodal GPT-4 on visual reasoning tasks and achieves state-of-the-art accuracy (92.53%) on the Science QA dataset when combined with GPT-4.
- The results highlight LLaVA’s effectiveness in multimodal reasoning, outperforming existing models like BLIP-2 and OpenFlamingo on a new multimodal benchmark (LLaVA-Bench).
Details of Note
Dataset
- Visual Instruction Dataset: Created using GPT-4 to generate 158K
multimodal instruction samples.
- 58K conversations, 32K detailed descriptions, and 77K complex reasoning samples.
- Context types: Captions and bounding boxes used as symbolic image representations to prompt GPT for instruction generation.
- Source data: Filtered 595K image-text pairs from CC3M dataset to ensure relevant and diverse samples.
Methodology Highlights
- Architecture: Combines a frozen ViT-L/14 CLIP encoder with a
decoder-only LLaMA LLM.
- Vision tokens are linearly projected into the language space and prepended to text instructions.
- Response types: Three types of GPT-4-generated responses: conversation, detailed description, and complex reasoning.
- End-to-End Multimodal Model: The model connects vision and language components with minimal architectural changes, focusing on efficient training via instruction-tuning.
Training Details
- Two-Stage Training:
- Stage 1 (Feature Alignment): Trains the projection layer to align
visual features with the LLM’s embedding space. All other weights are
frozen.
- Uses 595K image-text pairs from CC3M for this stage.
- Stage 2 (End-to-End Fine-tuning): The vision encoder remains frozen, while the projection layer and LLM are fine-tuned on the 158K instruction-following dataset.
- Stage 1 (Feature Alignment): Trains the projection layer to align
visual features with the LLM’s embedding space. All other weights are
frozen.
2023_04_mplug_owl
2023-04: mPLUG-Owl
Quick Notes
- Decoder-Only model
- Weights available: MIT License
- Organization: Alibaba
Key Links
- Arxiv: mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
- GitHub: X-PLUG/mPLUG-Owl
- Hugging Face checkpoints:
Guiding Questions
What kind of paper are you reading?
- This is a paper proposing a novel modularized training paradigm for equipping large language models (LLMs) with multimodal abilities.
- It presents mPLUG-Owl, a system that enhances LLMs’ visual understanding while maintaining their text generation capabilities through a two-stage training process.
What is the motivation and potential impact of the paper?
- The paper aims to address the limitations of current multimodal LLMs, which often struggle with aligning vision and language tasks due to frozen visual models or inefficient alignment techniques.
- The impact lies in providing a framework that supports multiple modalities (text, vision) concurrently and shows improved instruction understanding, multi-turn conversation, and knowledge reasoning over competing models.
What is the relevant related work and what is this paper’s contribution?
- Related work includes Visual ChatGPT, MM-REACT, BLIP-2, MiniGPT-4, and LLaVA, all of which attempt to integrate vision and language but face issues with efficiency, alignment, or modularity.
- This paper’s contribution is the modularized training paradigm that enables better multimodal alignment and fine-tuning via LoRA. It also introduces OwlEval, an evaluation set for visually-related instruction tasks.
What are the results (Theory/Experiments)?
- Experiments show that mPLUG-Owl outperforms existing models like MiniGPT-4 and LLaVA in instruction understanding, visual reasoning, and multi-turn dialogue.
- The paper provides quantitative evidence via OwlEval (a custom evaluation set), where mPLUG-Owl achieves higher ratings across multiple visually-related tasks, including OCR, knowledge transfer, and multi-turn conversations.
Details of Note
Dataset
- Pre-training datasets:
- LAION-400M, COYO-700M, Conceptual Captions, and COCO for image-captioning tasks.
- Provides diverse and large-scale image-text pairs for teaching the model visual knowledge and alignment.
- Instruction tuning datasets:
- LLaVA data for multimodal instruction tasks.
- Unimodal instructions sourced from Alpaca, Vicuna, and Baize.
Methodology Highlights
Architecture:
- Vision encoder: ViT-L/14 for visual feature extraction, pretrained using CLIP.
- LLM decoder: LLaMA-7B, with an option for LLaMA-2 weights.
- Visual abstractor: Aggregates visual features using a learnable resampler module, providing higher semantic representations.
Training paradigm:
- Stage 1: Pre-training:
- Frozen LLM, trainable vision encoder (ViT) and abstractor module.
- Focuses on aligning visual and textual representations.
- Stage 2: Joint instruction tuning:
- LLM weights trained with LoRA, freezing visual components.
- Fine-tuned on both unimodal and multimodal instructions.
- Stage 1: Pre-training:
Training Details
Stage 1 (Pre-training):
- Batch size: 2.1 million tokens.
- Steps: 50k updates, consuming ~104 billion tokens.
- Optimizer: AdamW, learning
- Warmup: 2k steps, followed by cosine decay for learning rate.
- Image resolution: 224x224 pixels.
Stage 2 (Instruction tuning):
- Batch size: 256.
- Steps: 2k updates.
- Learning rate: 2e-5.
2023_05_palix
2023-05: PaLI-X
Quick Notes
- Encoder-Decoder model
- Weights not available. No permissive license. Proprietary.
- Organization: Google
Key Links
Guiding Questions
What kind of paper are you reading?
- This is a multilingual vision and language model paper.
- It focuses on scaling up the PaLI-X model for vision-language tasks and evaluates it across various benchmarks.
- The paper proposes improvements in model architecture, training mixtures, and evaluation methods for complex tasks like image captioning, VQA, and few-shot learning.
What is the motivation and potential impact of the paper?
- Motivation: To investigate the impact of scaling vision-language models (PaLI-X) in both size and task complexity, inspired by similar scaling efforts in language models.
- Importance: Vision-language models play a critical role in tasks like image captioning and VQA, which are essential for applications in AI-driven document understanding, multimodal learning, and interaction systems.
- Impact: The paper aims to push the state-of-the-art (SoTA) across a wide range of vision-language benchmarks and demonstrate emerging capabilities, including multilingual object detection and complex counting.
What is the relevant related work and what is this paper’s contribution?
- Related Work: Builds on models like PaLI, Flamingo, and ViT (vision transformers). It also leverages approaches in few-shot learning, multimodal models, and scaling trends in both language and vision models.
- Contribution:
- Demonstrates that scaling both the vision and language components leads to significant performance improvements.
- Introduces a novel training mixture that combines self-supervision and full supervision.
- Establishes new SoTA results across more than 25 benchmarks, showcasing advances in tasks such as image captioning, VQA, and document understanding.
What are the results (Theory/Experiments)?
- Experiments:
- Fine-tuning experiments on diverse benchmarks (e.g., COCO, VQAv2, NoCaps).
- Few-shot learning tasks that evaluate multilingual captioning and question-answering.
- Performance on video-based tasks and object detection tasks.
- Results:
- PaLI-X outperforms previous models, achieving new SoTA results in tasks like image captioning, VQA, and complex counting.
- Significant improvements in document and infographic understanding.
- Few-shot learning capabilities demonstrated strong multilingual performance.
- Emerging capabilities: PaLI-X exhibits abilities in multilingual object detection and complex counting that were not explicitly part of the training set.
Details of Note
Dataset
- Image captioning: COCO, Conceptual Captions 3M (CC3M), Conceptual Captions 12M (CC12M), SBU, Visual Genome.
- Video-text: WebVid-2M.
- Text-only: WikiCorpus (20GB) and cleaned Common Crawl (350GB), mirroring the C4 dataset.
Methodology Highlights
- Dual Vision Encoder: A module that supports both images and videos, capturing spatial and temporal features using ViT-B/16 and ViT-L/14 models.
- Text Encoder: BERT-based unimodal encoder, initialized from pre-trained checkpoints.
- Universal Module: Aligns vision and language features using text-aware vision tokens and vision-aware text tokens, reducing computational complexity and aligning modalities in a shared semantic space.
- Fusion Module: Combines self-attention, cross-attention, and feed-forward networks (FNNs) to update and refine cross-modal representations.
- Shared Decoder: Supports text generation for both uni-modal and multi-modal tasks.
- Pre-training Losses:
- Language Loss: Masked Language Modeling (MLM) for text encoding.
- Image-Text Contrastive (ITC): Aligns image and text representations.
- Image-Text Matching (ITM): Matches image and text pairs, improving multi-modal understanding.
- Instruction-based LM: Handcrafted instructions for task and modality discrimination.
Training Details
- Pre-training: 30 epochs.
- Batch size:
- Base model: 1024 on 8 NVIDIA A100 GPUs.
- Large model: 512 on 16 NVIDIA A100 GPUs.
- Optimizer: AdamW with a weight decay of 0.02.
- Learning Rate:
- Warmup for 5000 steps, followed by cosine decay.
- Base model: Peak learning rate of 1e-4.
- Large model: Peak learning rate of 5e-5.
- Data Augmentation: Random cropping, horizontal flip, and sparse frame sampling for videos.
2023_05_pix2act
2023-05: Pix2Act
Quick Notes
- Decoder-Only model
- Weights available: Apache-2.0 License
- Organization: Google (DeepMind)
Key Links
- Arxiv: From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces
- NeurIPS 2023: Conference Paper
- GitHub: google-deepmind/pix2act
- Checkpoints available in GitHub repository.
Guiding Questions
What kind of paper is this?
- This is a systems and benchmark paper.
- It proposes a new model (PIX2ACT) and demonstrates its capabilities on GUI-based tasks, benchmarking it against existing methods.
What is the motivation and potential impact of the paper?
- The paper addresses the problem of automating tasks through graphical user interfaces (GUIs) without relying on structured representations like HTML or DOM trees.
- The motivation stems from limitations of prior approaches that require structured data, which is often unavailable or misaligned with visual elements in GUIs.
- The potential impact is significant for digital agents, accessibility, and automation, especially in environments where structured data is inaccessible or incomplete.
What is the relevant related work and what is this paper’s contribution?
- Prior work has focused on using structured representations (like HTML, DOM) for GUI-based tasks, which limits applicability when those representations aren’t available.
- This paper’s main contribution is developing a pixel-based agent (PIX2ACT) that interacts with GUIs using screenshots and generic actions (mouse/keyboard), without relying on structured inputs.
- The paper also shows PIX2ACT can outperform human crowdworkers on the MiniWob++ benchmark using only pixel inputs.
What are the results (Theory/Experiments)?
- Theory: The paper builds on PIX2STRUCT for pixel-based pre-training and incorporates Monte Carlo Tree Search (MCTS) for policy improvement.
- Experiments:
- MiniWob++ benchmark: PIX2ACT outperforms human crowdworkers and improves task scores compared to previous models that do not access DOM information.
- WebShop benchmark: PIX2ACT establishes a baseline, but there remains a gap compared to models using HTML inputs.
- Ablation studies show that pre-training on screenshots is critical for success, especially for tasks where structured data is unavailable.
Details of Note
Dataset
- The paper adapts two benchmarks for GUI-based tasks:
- MiniWob++: A set of over 100 web-based tasks requiring interaction through visual elements. The authors support 59 tasks using pixel-based inputs and mouse/keyboard actions.
- WebShop: A shopping environment with over 1.1 million products, where the task is to find and purchase items based on human-authored instructions.
- Human demonstrations were used for both benchmarks, with 81% conversion success from the MiniWob++ demonstrations into the new action format.
- Tasks involve natural language instructions and various types of interactions like clicking, dragging, and scrolling.
Methodology Highlights
- The proposed PIX2ACT model relies solely on pixel-based inputs (screenshots) and uses generic mouse/keyboard actions to interact with GUIs.
- Builds on PIX2STRUCT, a Transformer-based image-to-text model, pre-trained to map screenshots to structured HTML representations.
- Monte Carlo Tree Search (MCTS) is used to improve policy decisions by exploring action sequences in the environment.
- Uses behavioral cloning with human demonstrations and tree search to iteratively improve the model’s performance.
Training Details
- Optimizer: Uses Adafactor with a learning rate of 0.01.
- MiniWob++: Trained for 26K steps with a batch size of 512. Policy improvement using tree search generated 826K episodes.
- WebShop: Finetuned on MiniWob++ before 10K additional steps on WebShop-specific data with a batch size of 256.
- Input lengths: Sequence lengths of 512 tokens for MiniWob++ and 4096 tokens for WebShop due to text-heavy data.
2023_05_visionllm
2023-05: VisionLLM
Quick Notes
- Decoder-Only model
- Weights unavailable; Apache-2.0 License
- Organization: Shanghai AI Lab
Key Links
- Arxiv: VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
- NeurIPS 2023: Conference Paper
- GitHub: OpenGVLab/VisionLLM
Guiding Questions
What kind of paper are you reading?
- This is a big idea paper proposing a new model and framework, VisionLLM, that unifies vision and language tasks.
- It introduces a large language model-based framework for vision-centric tasks, providing flexibility beyond current vision foundation models (VFMs).
What is the motivation and potential impact of the paper?
- The motivation is to bridge the gap between large language models’ (LLMs) open-ended task capabilities and the pre-defined nature of vision foundation models (VFMs).
- The paper aims to unlock flexible vision task customization using language instructions, leading to a unified approach to handling diverse tasks like object detection, segmentation, and image captioning.
- Its potential impact includes setting a new baseline for generalist vision-language models and making it easier to manage vision-centric tasks through a unified framework.
What is the relevant related work and what is this paper’s contribution?
- Related work includes large language models like GPT-4, visual-language models like Flamingo, and vision generalist models such as OFA and Pix2Seq.
- Existing approaches are limited to pre-defined tasks or visual prompt tuning that doesn’t fully integrate LLMs.
- This paper’s contribution is VisionLLM, which aligns vision tasks with LLM methodologies and introduces components like language-guided image tokenization and an LLM-based open-ended task decoder.
What are the results (Theory/Experiments)?
- Theory: VisionLLM redefines how vision tasks are formulated, using language instructions to structure outputs, allowing for customizable task handling.
- Experiments: It achieves over 60% mAP on COCO for object detection, performs well on other tasks like visual grounding and image captioning, and shows the ability to customize both task descriptions and output formats.
- The model’s generalist nature allows for strong performance on tasks that usually require separate models.
Details of Note
Dataset
- COCO2017: Used for object detection and instance segmentation tasks.
- RefCOCO/RefCOCOg/RefCOCO+: Employed for visual grounding tasks.
- LLaVA-Instruct: Synthetic dataset used for instruction fine-tuning, connecting vision and language.
Methodology Highlights
- Unified LLM-based Decoder: Applies to both vision-only and vision-language tasks, allowing flexible task customization using language instructions.
- Language-Guided Image Tokenizer: Extracts multi-scale visual features (via ResNet/InternImage-H) and updates them using cross-attended language features from a BERT-like encoder.
- Special Tokens: Augments the LLM with position and class tokens for object localization and classification.
- Training Stages: Two-stage training process:
- Stage 1: Freezes the LLM while training the visual backbone.
- Stage 2: Freezes the visual backbone and fine-tunes the LLM with LoRA-based updates.
Training Details
- 50 epochs with AdamW optimizer and cosine annealing learning rate schedule.
- Peak learning rate: 2e-4.
- Hardware: 4x8 NVIDIA A100s, processing one sample per GPU.
2023_06_kosmos2
2023-06: KOSMOS-2
Quick Notes
- Decoder-Only model
- Weights available: MIT License
- Organization: Microsoft
Key Links
- Arxiv: Kosmos-2: Grounding Multimodal Large Language Models to the World
- GitHub: microsoft/unilm
- Hugging Face checkpoints: microsoft/kosmos-2-patch14-224
Guiding Questions
What kind of paper is this?
- This is a technical research paper introducing a new multimodal large language model (MLLM) called KOSMOS-2.
- It extends previous work (KOSMOS-1) by integrating grounding capabilities, linking language to visual elements (e.g., bounding boxes).
What is the motivation and potential impact of the paper?
- The paper aims to address the challenge of grounding text in the visual world, which improves multimodal interactions.
- Motivation: Enabling human-AI interaction by allowing the model to understand image regions directly rather than relying on text descriptions.
- Potential impact: Enhances multimodal AI systems by introducing new capabilities like grounded image captioning and referring expression comprehension, contributing towards artificial general intelligence (AGI).
What is the relevant related work and what is this paper’s contribution?
- Related work: Builds on MLLMs such as KOSMOS-1, Flamingo, and grounding models like GLIP.
- Contribution: Introduces the grounding capability to MLLMs by linking text spans to image regions via location tokens, constructing the large-scale GRIT dataset of grounded image-text pairs, and expanding the applications of multimodal models to grounded vision tasks.
What are the results (Theory/Experiments)?
- Theory: The grounding capability is achieved by representing bounding boxes as location tokens integrated into the model’s language structure.
- Experiments: KOSMOS-2 is evaluated across multiple tasks, including
phrase grounding, referring expression comprehension, and visual
question answering.
- Results show strong performance in multimodal grounding tasks and competitive performance on vision-language tasks compared to models like MDETR and FIBER.
Details of Note
Dataset
- GRIT (Grounded Image-Text Pairs): Consists of 91M images, 115M text spans, and 137M bounding boxes.
- Mining process: Dataset created from captioning datasets (e.g., LAION-2B, COYO-700M).
- Grounding: Text spans are linked to image regions using bounding boxes, which are converted into sequences of location tokens.
- The dataset supports multiple bounding boxes for a single text phrase, enabling precise multimodal grounding.
Methodology Highlights
- Input Representation: Bounding boxes are represented by the top-left and bottom-right points, which are discretized into patch indices.
- Uses a Markdown-like format: Phrases are wrapped in special tokens, followed by box tokens that enclose the location tokens.
- Multibox support: Handles multiple bounding boxes linked to a single phrase.
- A special grounding token is used to signal when the model should ground text to the visual world.
- Transition from KOSMOS-1 to KOSMOS-2: KOSMOS-2 integrates GRIT data with KOSMOS-1’s existing multimodal corpora. KOSMOS-2’s weights are initialized from KOSMOS-1, with new tokens initialized randomly.
- Instruction tuning includes LLaVA-Instruct and newly created GRIT-based instructions.
Training Details
- Training hyperparameters:
- Steps: 60,000, with 375 warmup steps.
- Learning rate: Peaks at 2e-4, with linear decay.
- Optimizer: AdamW with β = (0.9, 0.98), weight decay of 0.01.
- Batch sizes: Text (93), Image-caption pairs (1117), Grounded image-text pairs (1117), Interleaved data (47).
- Instruction tuning:
- Steps: 10,000, with 375 warmup steps.
- Learning rate: 1e-5.
- Batch sizes: Text instruction (117), Vision-language instruction (351), Grounded image-text pair + grounded instruction (1404), Text (30), Interleaved data (15).
2023_06_shikra
2023-06: Shikra
Quick Notes
- Decoder-Only model
- Weights available: Apache-2.0 License
- Organization: SenseTime
Key Links
Guiding Questions
What kind of paper is this?
- This is a proposal paper introducing a new Multimodal Large Language Model (MLLM) called Shikra, which enables referential dialogue.
- It addresses a gap in MLLMs where spatial coordinates in dialogue were previously unsupported.
What is the motivation and potential impact of the paper?
- The motivation is to introduce referential dialogue capabilities into MLLMs, allowing users to point to specific areas of an image and ask questions, with the model responding in natural language and providing relevant spatial coordinates.
- The impact is significant for Mixed Reality (XR), robotic communication, and online shopping, where such spatial interaction could enhance user experience.
What is the relevant related work and what is this paper’s contribution?
- Related work: It builds on existing MLLMs like Flamingo, BLIP-2, and Mini-GPT4, which combine vision and language but lack spatial referencing capabilities.
- Contribution: Shikra introduces a unified, simple architecture that handles both input and output of spatial coordinates in natural language without extra vocabularies or specialized modules. It enhances the natural dialogue capabilities of MLLMs by supporting tasks like REC, PointQA, and VQA.
What are the results (Theory/Experiments)?
- Theory: Shikra’s architecture integrates a vision encoder, an alignment layer, and a LLM, and uses natural language numerical coordinates for spatial referencing.
- Experiments: Shikra shows promising performance across tasks like REC, PointQA, and image captioning. It demonstrates superior performance compared to other generalist models on REC and PointQA tasks, but still falls behind specialist models in some areas.
Details of Note
Dataset
- Shikra-RD: Custom instruction-tuning dataset generated using GPT-4, which includes chain-of-thought (CoT) examples with spatial annotations.
- Public datasets: Utilizes Flickr30K Entities and RefCOCO datasets to provide visual data and examples for GPT-4 generation.
- Other Vision-Language (VL) datasets: Includes publicly available datasets used in the first stage of training for conventional VL tasks (e.g., image captioning, VQA, REC).
Methodology Highlights
- Architecture: Combines ViT-L/14 (vision encoder) and a decoder-only LLM (Vicuna-7B/13B).
- Coordinate representation: Uses natural language to represent
coordinates in the format [x_min, y_min, x_max, y_max] for boxes and [x, y]
for points.
- Multiple points/boxes separated by semicolons.
- All coordinates normalized to [0, 1], truncated to 3 decimal places.
- Instruction tuning: Bootstraps CoT instruction examples from GPT-4 using visual datasets.
- No special vocabularies: Coordinates are represented without introducing any special encoders or vocabularies for reference points.
Training Details
- Two-stage training:
- Stage 1: Trained on all VL datasets for 100,000 steps (approx. 1.5 epochs).
- Stage 2: Fine-tuned on Shikra-RD and LLaVA-Instruct-150K with increased sampling ratio.
- Optimizer: AdamW with frozen visual encoders, no warmup.
- Learning rate: Peak LR of 2e-5, with cosine annealing for cooldown.
- Compute: Trained on 8 NVIDIA A100 GPUs, taking 100 hours for stage 1 and 20 hours for stage 2.
2023_08_qwenvl
2023-08: Qwen-VL
Quick Notes
- Decoder-Only model
- Weights available: Tongyi License
- Organization: Alibaba
Key Links
- Arxiv: Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
- GitHub: QwenLM/Qwen-VL
Guiding Questions
What kind of paper is this?
- This is a research paper that introduces and evaluates a new series of large-scale vision-language models (LVLMs), named Qwen-VL.
- It proposes a novel model for multimodal tasks involving both text and image inputs, targeting tasks like image captioning, question answering, and fine-grained visual understanding.
What is the motivation and potential impact of the paper?
- The motivation is to enhance large language models (LLMs) with the ability to process and understand visual information, expanding their utility beyond text-based tasks.
- The impact is significant, as the Qwen-VL models achieve state-of-the-art results across various vision-language benchmarks and have multilingual capabilities, which could accelerate research and practical applications in fields requiring multimodal understanding.
What is the relevant related work and what is this paper’s contribution?
- Related work includes LVLMs like Flamingo, BLIP-2, and Kosmos-2, which also aim to integrate vision and language processing.
- The main contribution of this paper is the Qwen-VL series, which introduces a robust visual receptor, a three-stage training pipeline, and fine-grained visual understanding. The models surpass previous open-source LVLMs in tasks like image captioning, visual question answering, and object grounding.
What are the results (Theory/Experiments)?
- Experiments: The paper presents results on a wide range of benchmarks, including image captioning (Flickr30K, Nocaps), general VQA (VQAv2, OKVQA, GQA), and text-oriented visual understanding (TextVQA, DocVQA).
- The results demonstrate that Qwen-VL and Qwen-VL-Chat models outperform other generalist models on several benchmarks, achieving leading performance in fine-grained tasks like text-reading and object localization.
- The paper also shows that Qwen-VL-Chat excels in real-world user behavior tests, outperforming similar models in instruction-following tasks.
Details of Note
Dataset
- Stage 1:
- Used a combination of publicly available English and Chinese caption datasets totaling 5 billion image-text pairs, cleaned to 1.4 billion.
- English data sources: LAION-en, LAION-COCO, DataComp, Coyo, CC12M, CC3M, SBU, COCO.
- Chinese data includes private in-house data and LAION-zh.
- Stage 2:
- Multitask pre-training datasets include:
- Captioning data (same as Stage 1 but reduced in size).
- Visual Question Answering (VQA) datasets: GQA, VGQA, VQAv2, DVQA, OCR-VQA, DocVQA, TextVQA, ChartQA, AI2D.
- Grounding and reference tasks: GRIT, Visual Genome, RefCOCO, RefCOCO+, RefCOCOg.
- OCR data from synthetic English and Chinese datasets (SynthDoG-en, Common Crawl PDF & HTML).
- Multitask pre-training datasets include:
Methodology Highlights
- Architecture:
- Base LLM: Qwen-7B, visual encoder: ViT-bigG/14 from OpenCLIP.
- Introduced a VL Adapter: A cross-attention layer compressing image features to 256 fixed-length vector representations.
- Special tokens used for bounding boxes (
), image content ( ), and text references () in input/output.
- Training Pipeline:
- Stage 1: Vision-language alignment with frozen Qwen-7B and unfrozen visual encoder and adapter, trained on weakly-labeled image-text pairs.
- Stage 2: Multitask pre-training with high-resolution image inputs, covering tasks like captioning, VQA, grounding, and OCR.
- Stage 3: Instruction tuning for Qwen-VL-Chat, fine-tuning on dialogue tasks involving multi-image comprehension, grounding, and multilingual support.
Training Details
- Stage 1:
- Optimizer: AdamW with a cosine learning rate schedule (max lr: 2e-4, min lr: 1e-6), 500-step warmup, 5e-2 weight decay.
- 50k steps with a batch size of 30,720, processing 1.5 billion image-text samples (~500 billion image-text tokens).
- Stage 2:
- Similar AdamW optimizer setup, with 19k steps and 400-step warmup.
- Stage 3:
- Instruction tuning with mixed multi-modal and pure text dialogue data.
- Emphasis on multi-image and fine-grained comprehension for dialogue models.
2023_09_kosmos2_5
2023-09: KOSMOS-2.5
Quick Notes
- Decoder-Only model
- Weights available: MIT License
- Organization: Microsoft
Key Links
- Arxiv: KOSMOS-2.5: A Multimodal Literate Model
- GitHub: microsoft/unilm
- Hugging Face checkpoint: microsoft/kosmos-2.5
Guiding Questions
What kind of paper is this?
- This is a research paper proposing a new multimodal literate model, KOSMOS-2.5, for document-level machine reading.
- It focuses on extending large language models to handle text-intensive images through two tasks: spatially-aware text recognition and image-to-markdown generation.
- The paper introduces a pre-trained model and its fine-tuned variant, KOSMOS-2.5-CHAT, with applications in document understanding.
What is the motivation and potential impact of the paper?
- The problem is enabling multimodal models to handle text-intensive images, including document-level reading and structural understanding.
- Current OCR and multimodal models fail to capture reading order and structure comprehensively, which are critical for accurate document understanding.
- The potential impact is significant for Artificial General Intelligence (AGI) as it advances models’ ability to understand complex text-rich images like academic papers, receipts, and web pages.
What is the relevant related work and what is this paper’s contribution?
- Related work includes OCR-based models (like Tesseract) and multimodal models (like GPT-4o and Donut), which either focus on line-level recognition or structured parsing in limited domains.
- Existing approaches fail to provide comprehensive document-level reading, especially in diverse domains.
- The paper contributes KOSMOS-2.5, a model capable of handling document-level text recognition and image-to-markdown tasks, and introduces two new benchmarks (OCREval, MarkdownEval) for evaluation.
What are the results (Theory/Experiments)?
- The paper presents a unified Transformer-based framework with a ViT-based vision encoder and a language decoder. It introduces a shared architecture for handling both spatially-aware text recognition and markdown generation.
- Experiments: KOSMOS-2.5 achieves superior results on the OCREval and MarkdownEval benchmarks, outperforming models like GPT-4o and Nougat in document-level text recognition and markdown generation tasks.
- The model excels across diverse document categories and tasks, demonstrating impressive performance relative to models with significantly more parameters.
Details of Note
Dataset
- Text-intensive image data:
- IIT-CDIP: 27.6M scanned pages.
- arXiv papers: 20.9M pages.
- PowerPoint slides: 6.2M pages.
- General PDFs: 155.2M pages crawled from the web.
- Web screenshots: 100M pages.
- Markdown data:
- GitHub README files: 2.9M.
- DOCX files: 1.1M from the web.
- LaTeX papers: 3.7M pages from arXiv.
- HTML pages: 6.3M converted to markdown.
- Total: ~300M pages of text-intensive image data, ~14M pages in markdown format.
- Data filtering includes language detection, deduplication, and removing malformed data.
- Uses augmentation techniques such as rotations, blurring, upscaling, and downscaling from TrOCR.
Methodology Highlights
- Model architecture:
- Vision Transformer (ViT) encoder combined with a Transformer language decoder.
- Perceiver resampler compresses visual features into a constant-length sequence.
- Supports flexible x/y coordinate handling for bounding boxes, compatible with Pix2Struct.
- Vocab size increased to 108,481 tokens (from ~64k in earlier models).
- Core tasks:
- Document-level text recognition, assigning spatial coordinates to text.
- Image-to-markdown generation to preserve document structure and style.
- Bounding boxes rendered by referencing x/y coordinates for each text line.
- Markdown output enhances structural sensitivity, especially for tables and complex formats.
Training Details
- Training setup:
- Total steps: 200k, with 375 steps warmup.
- Batch size: 1024.
- Learning rate: 2e-4 with linear decay.
- Standard settings for AdamW optimizer: weight decay, betas (0.9, 0.98), and epsilon.
- Training dataset:
- Total of ~260B tokens.
- Two-stage training: first on layout-based data (100k steps), followed by combined data (140k steps).
2023_10_llava1_5
2023-10: LLaVA-1.5
Quick Notes
- Decoder-Only model
- Weights available: Apache-2.0 License
- Organization: Microsoft
Key Links
- Arxiv: Improved Baselines with Visual Instruction Tuning
- CVPR 2024: Conference Paper
- For code and weights, see the page on LLaVA.
Guiding Questions
What kind of paper is this?
- This is a systematic study and baseline improvement paper.
- It investigates design choices for large multimodal models (LMMs) under the LLaVA framework and explores improvements to state-of-the-art models through modifications in architecture and training.
What is the motivation and potential impact of the paper?
- The paper aims to improve the efficiency and performance of LMMs, which are central to creating general-purpose visual-language assistants.
- It addresses the problem of balancing data efficiency, model performance, and scaling challenges in LMMs.
- The paper’s impact lies in its ability to offer a simple, reproducible baseline (LLaVA-1.5) that uses publicly available data and achieves state-of-the-art performance across 11 benchmarks.
What is the relevant related work and what is this paper’s contribution?
- Related work includes LLaVA, InstructBLIP, and Qwen-VL, which also explore visual instruction tuning and vision-language model training. These models use various techniques such as resamplers (Qformer) or larger image-text datasets.
- The paper’s main contribution is the introduction of LLaVA-1.5, which improves upon LLaVA by using an MLP connector and incorporating academic-task-oriented VQA datasets. This achieves state-of-the-art performance with a much smaller training set and computational resources.
What are the results (Theory/Experiments)?
- Experiments:
- LLaVA-1.5 achieves state-of-the-art performance across 11 benchmarks using only 1.2M training samples, compared to competing models trained on much larger datasets.
- Scaling to higher resolutions and compositional capabilities further improves results, reducing model hallucination and enhancing generalization across multiple tasks.
- Empirical evidence shows that LLaVA-1.5 outperforms larger models like InstructBLIP and IDEFICS on various visual reasoning tasks, while using fewer resources and simpler architecture.
Details of Note
Dataset
- LLaVA-1.5 uses a combination of public datasets for training, totaling 558K image-text pairs.
- Datasets include academic task-oriented VQA datasets (e.g., VQA-v2, OKVQA, OCR-based datasets like TextCaps) and region-level perception datasets (Visual Genome, RefCOCO).
- GQA and ShareGPT data are incorporated to enhance visual question answering (VQA) and conversational capabilities.
- No in-house data is used, ensuring full reproducibility.
Methodology Highlights
- Replaces LLaVA’s original linear projection with a more powerful two-layer MLP vision-language connector.
- Introduces response format prompting to handle short-form answers better in VQA tasks, improving precision in outputs.
- Scales input resolution to 336x336, allowing the model to process higher detail in images, improving accuracy for detailed visual tasks.
- The model is scaled to a 13B language model (Vicuna-v1.5), which enhances its performance on benchmarks like MME and MM-Vet.
- Demonstrates high data efficiency by reducing training data without significantly compromising performance.
Training Details
- LLaVA-1.5 is trained on 8x A100 GPUs, with the total training time of ~26
hours.
- Stage 1 (vision-language alignment pretraining) takes ~6 hours.
- Stage 2 (visual instruction tuning) takes ~20 hours.
- Efficient training setup with just 1.2M public data points, significantly lower than competitors using much larger datasets.
2023_10_pali3
2023-10: PaLI-3
Quick Notes
- Encoder-Decoder model
- Weights not available. No permissive license. Proprietary.
- Organization: Google
Key Links
- Arxiv: PaLI-3 Vision Language Models: Smaller, Faster, Stronger
- ICLR 2024: Rejected, but OpenReview available.
Guiding Questions
What kind of paper is this?
- This is a comparison and performance evaluation paper.
- It focuses on contrasting two pretraining methods (classification vs. contrastive) for Vision Transformers (ViT) in the context of vision-language models (VLMs).
- It also introduces a new model (PaLI-3) and assesses its state-of-the-art performance on several benchmarks.
What is the motivation and potential impact of the paper?
- The motivation is to create a smaller, faster, and more efficient vision-language model, PaLI-3, that can compete with much larger models in terms of performance.
- The problem addressed is scaling VLMs while maintaining or improving performance on key benchmarks like localization and visually-situated text understanding.
- The impact lies in demonstrating that models do not need to be extremely large to be state-of-the-art, potentially encouraging more research into efficient VLMs.
What is the relevant related work and what is this paper’s contribution?
- The paper builds upon previous works like PaLI, PaLI-X, and others that pretrain vision encoders either with classification or contrastive objectives.
- Existing approaches show promise in scaling but rely on larger models. This paper’s contribution is showing that a 5B parameter model (PaLI-3) with contrastive pretraining can match or outperform larger models.
- It also introduces a 2B SigLIP-based image encoder and shows superior performance in multimodal tasks.
What are the results (Theory/Experiments)?
- Experiments: PaLI-3 achieves new state-of-the-art results on several benchmarks including RefCOCO, TextVQA, and multilingual cross-modal retrieval, all while being smaller in size.
- Results show that contrastive pretraining (SigLIP) is particularly effective for tasks requiring visually-situated text understanding and localization.
Details of Note
Dataset
- Uses a mixture of web-scale image-text datasets for pretraining, including WebLI, CC3M-35L, and others.
- Additional datasets for multimodal tasks include RefCOCO (for referring expression segmentation) and VQA datasets (TextVQA, OCRVQA, etc.).
- WebLI dataset was filtered for quality control using a model-based approach, retaining around 40% of image-text pairs.
- Document understanding tasks were enhanced with datasets containing dense text images such as posters and PDFs across 100+ languages.
Methodology Highlights
- Pretraining Objectives: Contrastive pretraining (SigLIP) for vision encoder, focusing on image-text alignment over noisy web-scale data. UL2 pretraining for the text encoder-decoder.
- Architecture: 2B parameter ViT-G/14 vision encoder paired with a 3B UL2 encoder-decoder.
- Multimodal Training: Vision encoder frozen in the early stages, later fine-tuned at increasing resolutions.
- Ablations: Comparison between classification-pretrained (JFT) and contrastive-pretrained (SigLIP) vision encoders reveals significant improvements in visually-situated text understanding and localization tasks.
Training Details
- Stage 0 (Unimodal Pretraining): SigLIP contrastive objective applied to the vision encoder on image-text data. UL2 model pretraining as per original UL2 framework.
- Stage 1 (Multimodal Training): Vision encoder frozen, trained across a mixture of captioning, VQA, and object detection tasks at 224x224 resolution.
- Stage 2 (Resolution Increase): Gradual increase in image resolution during fine-tuning (812x812 and 1064x1064) with the vision encoder unfrozen, improving detailed understanding.
2023_11_cogvlm
2023-11: CogVLM
Quick Notes
- Decoder-Only model
- Weights available: Apache License
- Chinese academic research
Key Links
- Arxiv: CogVLM: Visual Expert for Pretrained Language Models
- GitHub: THUDM/CogVLM
- A variety of HF links are available in the GitHub repo.
- 224/490 pixel visual base model
- Grounding specific models
- Chat-specific models
- Rejected from ICLR 2024: OpenReview
Guiding Questions
What kind of paper is this?
- This is an empirical and methodological research paper focused on introducing a new vision-language model, CogVLM.
- It proposes a new approach for integrating visual and linguistic features through deep fusion using a “visual expert” module within a pretrained language model, rather than using shallow alignment techniques.
What is the motivation and potential impact of the paper?
- The motivation is to overcome limitations in shallow alignment methods used by prior models, which integrate image features into language models in a way that doesn’t fully capture the depth of vision-language information.
- By introducing a trainable module within the attention and feedforward layers, CogVLM achieves state-of-the-art performance across multiple multimodal benchmarks.
- The potential impact includes enabling large pretrained language models to perform high-quality visual tasks without sacrificing NLP capabilities, contributing significantly to both research and industrial applications of multimodal AI.
What is the relevant related work and what is this paper’s contribution?
- Related work includes prior models that use shallow integration techniques, like InstructBLIP and MiniGPT-4, which map image features into the language model’s input embedding space but fall short in performance due to limited fusion.
- CogVLM’s main contribution is the development of a trainable “visual expert” module within attention and feedforward layers, enabling deep fusion of vision and language features.
- It also includes extensive ablation studies and evaluations to validate this approach, resulting in state-of-the-art performance across a variety of benchmarks.
What are the results (Theory/Experiments)?
- Experiments: The paper presents CogVLM’s superior performance across 17 cross-modal benchmarks, including image captioning (NoCaps, Flickr30K), visual question answering (OKVQA, TextVQA), and visual grounding (RefCOCO).
- Performance: CogVLM outperforms other state-of-the-art models in multiple settings, especially for tasks requiring complex vision-language interactions, achieving high scores in generalist tasks and grounded VQA.
- Ablation Studies: The authors conduct detailed studies on architectural components, such as the impact of using different visual attention masks, initialization methods, and visual encoder scales, highlighting the importance of deep fusion.
Details of Note
Dataset
Pre-training Data:
- LAION-2B and COYO-700M datasets, heavily filtered for quality, yielding 1.5B image-text pairs.
- Visual Grounding Dataset: 40M images with bounding boxes for grounded object references, nouns extracted via spaCy, and bounding boxes predicted using GLIPv2.
Supervised Fine-Tuning Data:
- CogVLM-Chat: VQA datasets (VQAv2, OKVQA, TextVQA, etc.) and multi-turn dialogue datasets like LLaVA-Instruct. The VQA data includes both concise and detailed responses.
- CogVLM-Grounding: Grounded captioning, referring expression generation (REG), referring expression comprehension (REC), and VQA with bounding boxes from sources such as Flickr30K Entities, RefCOCO, and Visual7W.
Methodology Highlights
Architecture:
- ViT Encoder: EVA2-CLIP-E (last layer removed); outputs mapped to text embedding space using an MLP adapter (two-layer SwiGLU).
- Visual Expert Module: Added in each layer, separates vision and language modalities with unique QKV and FFN layers for vision, while sharing layer norms.
- Position Embedding: RoPE for text tokens, single shared position for visual tokens to avoid sequential proximity bias.
Training Approach:
- First Stage: Image captioning as a next-token prediction task on image-text pairs, for 120K steps with batch size 8192.
- Second Stage: Referring Expression Comprehension (REC) as VQA-style grounding; location answers normalized to 3-digit coordinates, trained for 60K steps with batch size 1024.
- Resolution Adjustment: Increased input resolution to 490x490 for the final 30K steps.
Training Details
CogVLM-Chat: Uses VQA and multi-turn instruction data for supervised fine-tuning.
- Hyperparameters: 6K steps, learning rate of 1e-5, batch size of 1024, ViT unfrozen with learning rate set to 1/10 of the full model’s learning rate.
CogVLM-Grounding: Fine-tuned on grounded captioning, REG, REC, and Grounded VQA.
- Task-Specific Adjustments: Incorporates bounding box annotations for improved localization and grounding performance across diverse datasets.
2023_11_docpedia
2023-11: DocPedia
Quick Notes
- Decoder-Only model
- No code, no weights
- Chinese academic research
Key Links
- Arxiv: DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding
- No code or weights available; no official GitHub repository
Guiding Questions
What kind of paper are you reading?
- This is a big idea paper in multimodal document understanding.
- It proposes DocPedia, an OCR-free multimodal model, focusing on frequency-domain processing instead of the pixel-based approach.
- The paper addresses the challenge of high-resolution document parsing and aligns perception and comprehension tasks by introducing a dual-stage training method.
What is the motivation and potential impact of the paper?
- The primary problem is to improve document parsing accuracy in high-resolution images while preserving both visual and textual information, without relying on OCR.
- DocPedia’s approach could substantially benefit OCR-free document parsing applications by providing a framework that maintains both perception and logical reasoning capabilities.
- The potential impact includes improved performance in information extraction, visual question answering (VQA), and complex multimodal document processing tasks, especially in fields needing OCR-free methods for dense, complex documents.
What is the relevant related work and what is this paper’s contribution?
- Traditional document parsing relies on OCR, leading to error accumulation and limited comprehension capabilities when handling high-resolution images.
- Previous models like LLaVAR and UniDoc operate on lower resolutions or do not fully integrate language models, which limits comprehension.
- Main Contributions:
- Proposes a frequency domain approach (Discrete Cosine Transform, DCT) to capture more detail at high resolutions with fewer visual tokens.
- Introduces a dual-stage training strategy combining perception and comprehension tasks.
- Demonstrates superior performance on key benchmarks, confirming the efficacy of joint learning and the frequency-based approach.
What are the results (Theory/Experiments)?
- The paper theorizes that processing in the frequency domain captures more visual and textual information while minimizing token counts, which is effective for high-resolution images.
- The dual-stage training approach balances lower-level perception tasks (e.g., text recognition) and higher-level comprehension tasks (e.g., logical reasoning).
- Experiments:
- DocPedia achieves significant accuracy improvements (e.g., 40% in DocVQA, 28% in FUNSD) over state-of-the-art methods on high-resolution document benchmarks.
- Results confirm the benefits of frequency domain processing and the training approach for handling high-density text and visual elements in complex documents.
Details of Note
Dataset
- Pre-training data:
- PPT images: 600K images from Common Crawl.
- PDF images: 325K images from arXiv, each assigned one of five OCR tasks (text detection, recognition, spotting, paragraph reading, full-text reading).
- Image-caption pairs: 595K pairs for natural scene perception, sourced from LLaVA data.
- Fine-tuning data:
- Uses pre-training datasets with additional VQA datasets (e.g., DocVQA, ChartVQA) and expanded answers generated by ChatGPT.
- Instruction tuning: 158K instances from LLaVA instruction-tuning data, enriching tasks with semantic reasoning.
Methodology Highlights
- Frequency-domain approach:
- Images converted from RGB to YCbCr, followed by JPEG DCT extraction to obtain DCT coefficients.
- Downscaling: Y channel via 8×8 blocks; Cb and Cr channels downscaled via 16×16 blocks and then upsampled.
- Concatenation and processing: Y, Cb, and Cr channels concatenated and passed through a 1×1 convolution, then fed into the Swin Transformer.
- Two-stage training strategy:
- Stage 1: Text-aware pre-training with frozen LLM (Vicuna), focusing on OCR-based tasks to align frequency features with the language model.
- Stage 2: Context-aware fine-tuning, unfreezing LLM and integrating higher-level semantic tasks for robust multimodal comprehension.
Training Details
- Learning rate strategy: One-cycle strategy with peak learning rates set to 1e-3 for pre-training (PT) and 1e-5 for fine-tuning (FT).
- Batch sizes: 64 for PT, 8 for FT.
- Optimizer: AdamW.
- Hardware: Trained on 8 A100 GPUs for one epoch each in PT and FT stages.
- Token count vs. Resolution: 1,600 tokens in RGB (1280×1280) achieves 29.54% accuracy on DocVQA; DCT (2560×2560) achieves 47.08% accuracy.
2023_11_infmllm
2023-11: InfMLLM
Quick Notes
- Decoder-Only model
- Weights available: See Model Zoo
- Organization: Inftech.AI
Key Links
Guiding Questions
What kind of paper is this?
- This is a model development and evaluation paper.
- It proposes InfMLLM, a multimodal large language model (MLLM) framework aimed at improving performance in vision-language tasks.
What is the motivation and potential impact of the paper?
- The paper aims to extend the capabilities of large language models (LLMs) to vision-language tasks such as image captioning, visual question answering (VQA), and visual grounding.
- The motivation is to create a versatile, general-purpose multimodal assistant by training models to handle various modalities (e.g., images, text) more effectively.
- The impact lies in advancing the field of multimodal models, contributing a new framework (InfMLLM) that achieves state-of-the-art results in multiple benchmarks, potentially improving real-world applications of MLLMs.
What is the relevant related work and what is this paper’s contribution?
- The paper builds on existing work in LLMs like GPT-3, as well as multimodal models like BLIP-2, Qwen-VL, and LLaVA.
- Existing approaches such as Q-Former and Perceiver struggle with tasks that require spatial relationships (e.g., visual grounding).
- This paper contributes a new visual adapter (pool-adapter) that preserves positional information better, improving performance in tasks like visual grounding.
- The proposed three-stage training scheme (alignment pretraining, multitask finetuning, and instruction tuning) helps to efficiently finetune models for multimodal tasks.
What are the results (Theory/Experiments)?
- Experiments: InfMLLM was evaluated on benchmarks for VQA, image
captioning, visual grounding, and other vision-language tasks.
- The model achieves state-of-the-art or near-SOTA performance on many benchmarks, particularly excelling in visual grounding tasks.
- Ablation studies show that increasing the number of visual embeddings improves performance, especially for visual grounding.
- Key findings: The introduction of the pool-adapter significantly enhances the ability to retain positional information, leading to better results in tasks requiring spatial reasoning.
Details of Note
Dataset
- Stage 1 (Pretraining): Uses weakly labeled image-text pairs from publicly available datasets such as CC3M, CC12M, and LAION-115M.
- Stage 2 (Multitask Finetuning): Uses a uniform mixture of tasks
(VQA, captioning, and visual grounding) with datasets like:
- VQA: VQAv2, OK-VQA, AOK-VQA, GQA, TextVQA, OCR-VQA
- Captioning: COCO, TextCaps
- Grounding: RefCOCO, RefCOCO+, RefCOCOg
- Stage 3 (Instruction Tuning): Instructional data from LLaVA 1.5 (665k instruction samples).
Methodology Highlights
- Vision Encoder: Uses ViT-g/14 from EVA-CLIP with the last layer and class token discarded for more efficient feature extraction.
- Pool-Adapter (VL Connector): A novel pooler adapter:
- Retains locality of image representations using local pooling operations.
- Applies a 2-layer MLP to project pooled image features into the language space.
- LLM: Utilizes Vicuna-7B as the core LLM.
- Training Scheme:
- Stage 1: Pretraining with only the pool-adapter unfrozen, using captioning data.
- Stage 2: Multitask finetuning with VQA, captioning, and grounding data; ViT, pool-adapter, and QV matrices are unfrozen.
- Stage 3: Instruction tuning with the LLM fully finetuned; ViT remains frozen.
Training Details
- Optimizer: AdamW with cosine decay learning rate scheduler.
- Batch Sizes:
- Stage 1: 1024
- Stage 2: 512
- Stage 3: 128
- Learning Rates:
- Stage 1: 2e-4
- Stage 2: 1e-5
- Stage 3: 2e-5
- Steps: 80k / 40k / 10k (Note: Possible typo, might be 20k for Stage 1).
- Image Resolutions:
- Stage 1: 224
- Stage 2 & 3: 448
- Compute:
- Stage 1: 32×A800 GPUs
- Stage 2: 32×A800 GPUs
- Stage 3: 16×A800 GPUs
Follow-Up: Inf-MLLM2
In a tech report, Infly.AI has introduced Inf-MLLM2, a successor to Inf-MLLM. It lacks a full paper, but the tech report highlights several changes:
- Dynamic image resolution handling: The technique of encoding a global image (that’s sized down) and local patches (that are sized up) to handle varying resolutions. This enables the model to support resolutions of upto 1344x1344 pixels.
- Enhanced OCR: The model can handle document processing tasks quite well. This seems to be driven by changing the data mixture and by handling larger image sizes.
2023_11_monkey
2023-11: Monkey
Quick Notes
- Decoder-Only model
- Weights available: Non-commercial use only
- Chinese academic research
Key Links
- Arxiv: Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
- CVPR 2024: Monkey
- GitHub: Yuliang-Liu/Monkey
- Non-HF integrated checkpoints available on GitHub; non-commercial use only
Guiding Questions
What kind of paper are you reading?
- This is a model proposal paper that introduces “Monkey,” a large multimodal model (LMM) specifically designed to handle higher resolution images and generate enhanced text descriptions.
- It emphasizes technical improvements in image resolution handling and contextual description generation within multimodal frameworks, which are critical for applications like image captioning and visual question answering (VQA).
- It validates these improvements through extensive empirical evaluations across various vision-language tasks.
What is the motivation and potential impact of the paper?
- The paper addresses the limitations of current large multimodal models, particularly their struggles with high-resolution inputs and complex scene understanding.
- High-resolution handling is essential because it improves the model’s ability to capture finer details, crucial for tasks requiring detailed image-text association.
- Monkey’s design enables higher-resolution inputs without costly retraining, making it a resource-efficient alternative with potential impacts across vision-language applications in fields like accessibility, AI-driven content creation, and enhanced human-computer interaction.
What is the relevant related work and what is this paper’s contribution?
- Related work includes recent advancements in LMMs, such as Qwen-VL, PaLI, and other vision-language models like BLIP-2, which attempt to address similar multimodal challenges.
- Existing models use large-scale datasets like LAION and COYO but often have limited success with high-resolution details due to simple captions and lower input resolutions.
- Contribution: Monkey introduces a patch-based approach for handling higher resolutions (up to 1344×896) and a multi-level description generation method that creates richer image-text pairs. This two-fold approach enhances contextual understanding in scene-text associations and improves performance across various vision-language tasks.
What are the results (Theory/Experiments)?
- The model theoretically optimizes large input resolution handling by dividing images into patches processed independently by a static visual encoder, incorporating LoRA adjustments to preserve resolution-specific details.
- Experiments: Extensive empirical results on 18 datasets show Monkey
surpassing other LMMs in tasks like image captioning, general VQA,
scene-text VQA, and document-oriented VQA.
- Monkey achieves significant performance improvements over models like GPT4V and Qwen-VL, particularly in dense-text VQA tasks.
- Ablation studies indicate the model’s success in leveraging high input resolution and LoRA modules effectively, showing clear benefits in accuracy and computational efficiency over traditional interpolation techniques.
Details of Note
Dataset
- Uses 1.44 million image-text pairs from publicly available datasets, covering tasks like Image Captioning, General VQA, Scene Text-centric VQA, and Document-oriented VQA.
- Includes 427k image-text pairs bootstrapped from CC3M using a multi-level description method to enrich captions, improving text-image associations.
- Example datasets include COCO for general image captions, VizWiz and OKVQA for VQA, and DocVQA and ChartQA for document-oriented VQA, among others.
Methodology Highlights
- High-Resolution Handling: Employs a sliding window attention mechanism over image patches and a LoRA-modified ViT encoder to support input resolutions up to 1344×896.
- Image Patch Strategy: Divides high-resolution images into uniform patches, supplemented by a global image version, with a static resampler across all patches and the global image.
- Multi-Level Description Generation: Leverages a multi-model approach using segmentation, captioning, and detection models (e.g., BLIP2, PPOCR, SAM, ChatGPT) to create detailed, contextual captions.
- Architecture Components:
- Vision Encoder: ViT-bigG from OpenCLIP for processing patches.
- LLM: Qwen-VL, interfaced with 256 learnable queries per crop for enhanced spatial and contextual understanding.
- LoRA: Applied with rank 16 for attention and 32 for MLP modules to update ViT weights efficiently.
Training Details
- Optimizer: AdamW with a peak learning rate of 1e-5, cosine decay, and 100-step warmup.
- Batch Size: 1024 with a weight decay of 0.1 to prevent overfitting.
- Compute Requirements: Training takes 40 A800 GPU days for a single epoch on 896×896 images.
- Model Size: Total parameters are 9.8B, combining 7.7B for the LLM, 1.9B for ViT, 117M for LoRA, and 90M for the resampler.
2023_11_mplug_owl2
2023-11: mPLUG-Owl2
Quick Notes
- Decoder-Only model
- Weights available: Apache License
- Organization: Alibaba
Key Links
- Arxiv: mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
- CVPR 2024: mPLUG-Owl2
- GitHub: X-PLUG/mPLUG-Owl2
- Checkpoints not HF integrated, but available on the GitHub repo.
Guiding Questions
What kind of paper is this?
- This is a multi-modal large language model (MLLM) research paper.
- It proposes a new architecture (mPLUG-Owl2) focused on modality collaboration, balancing performance across text and multi-modal tasks.
What is the motivation and potential impact of the paper?
- The problem is modality interference, where prior models struggle to handle multiple modalities without one negatively affecting the other.
- The importance lies in developing a unified, generalist model that can excel at both text and multi-modal tasks without fine-tuning for specific tasks.
- The potential impact is a state-of-the-art model in multi-modal tasks, which could set a new standard for multi-modal large language models (MLLMs).
What is the relevant related work and what is this paper’s contribution?
- Traditional models (e.g., BLIP-2, MiniGPT-4) have used cross-modal alignment but struggled with modality interference.
- The contribution of mPLUG-Owl2 is modality collaboration, via a modality-adaptive module that maintains the integrity of both visual and text features while sharing parameters for cross-modality interaction.
What are the results (Theory/Experiments)?
- The paper introduces a modular network design using modality-adaptive modules to mitigate interference.
- Experiments: The model is evaluated on 8 vision-language benchmarks and achieves state-of-the-art performance on most, including Flickr30K, VQAv2, and others. It also excels in pure-text tasks (e.g., MMLU, AGIEval), indicating that modality collaboration enhances both text and multi-modal performance.
Details of Note
Dataset
- Pre-training Data: 400M image-text pairs randomly sampled from CC3M, CC12M, COCO, LAION-EN, COYO, DataComp.
- Instruction Data:
- Captioning: TextCaps, COCO.
- Visual Question Answering (VQA): VQAv2, OKVQA, OCR-VQA, GQA, A-OKVQA.
- Region-aware QA: RefCOCO, VisualGenome.
- Multi-modal Instruction Data: LLaVA-instruct-150K.
- Text-only Instruction Data: ShareGPT-80k, SlimOrca.
Methodology Highlights
Architecture:
- Vision Encoder: ViT-L/14 with a visual abstractor similar to the Perceiver Resampler; 6 layers with 64 learnable queries.
- Language Decoder: LLaMA-2-7B, modified for modality sensitivity in attention layers (modality-specific layer norms, modality-separated key and value matrices).
- Modality-Adaptive Module (MAM): Enhances cross-modality interaction while preserving unique characteristics of each modality.
Training Stages:
- Stage 1: Pre-training:
- Language decoder frozen except for modality-adaptation modules.
- Stage 2: Joint Instruction Tuning:
- Entire model unfrozen for tuning on both text and multi-modal instructions.
- Stage 1: Pre-training:
Training Details
Stage 1: Pre-training:
- Steps: 42,500 with a batch size of 8,192 (total 348M image-text pairs).
- Optimizer: AdamW with a peak learning rate of 1e-4, 1k warmup steps, and cosine decay.
- Layer-wise Learning Rate Decay: 0.9 applied to vision encoder layers to preserve low-level visual features.
Stage 2: Instruction Tuning:
- Epochs: 1 with a batch size of 256.
- Learning Rate: 2e-5 with maintained layer-wise decay.
- Resolution Increase: Images upsampled from 224x224 to 448x448 for better OCR and fine-grained visual tasks.
2023_11_sphinx
2023-11: SPHINX
Quick Notes
- Decoder-Only model
- Weights available: Llama License
- Organization: Shanghai AI Lab
Key Links
- Arxiv: SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models
- GitHub: Alpha-VLLM/LLaMA2-Accessory
Guiding Questions
What kind of paper is this?
- This is a multi-modal model development paper.
- It proposes a new architecture, SPHINX, for multi-modal large language models (MLLMs) that integrates visual and language understanding through a novel joint mixing approach.
What is the motivation and potential impact of the paper?
- The paper aims to enhance the capabilities of large language models (LLMs) to perform multi-modal tasks by better integrating vision and language data.
- It solves the problem of limited visual instruction-following capacity in existing MLLMs by introducing techniques for visual embedding, task mixing, and model weight integration.
- The impact is the development of a versatile model that excels across a wide range of vision-language tasks, including fine-grained visual understanding and multi-tasking.
What is the relevant related work and what is this paper’s contribution?
- Existing work includes multi-modal models like LLaVA, MiniGPT-4, and BLIP-2, which often freeze LLMs or use basic visual encoders.
- SPHINX contributes by unfreezing LLMs during pre-training, introducing weight mixing from real-world and synthetic data, task mixing for diverse visual tasks, and using multiple visual encoders for richer embeddings.
- It also addresses high-resolution image processing challenges, offering superior performance on various benchmarks.
What are the results (Theory/Experiments)?
- The main innovation is the joint mixing approach that combines visual embeddings, model weights, and tasks, aiming for enhanced multi-modal capabilities.
- Experiments: SPHINX achieves state-of-the-art results on multiple benchmarks (e.g., MMBench, VQAV2, RefCOCO) for tasks like visual question answering and human pose estimation.
- The model outperforms strong baselines on key tasks, demonstrating improved fine-grained visual perception and reasoning abilities.
Details of Note
Dataset
- Stage 1 (Pretraining):
- LAION-400M and LAION-COCO for vision-language alignment (image-caption data).
- RefinedWeb (text-only dataset) used to maintain text reasoning capabilities.
- Stage 2 (Finetuning):
- VQA: VQAv2, GQA, OKVQA, A-OKVQA, OCRVQA for visual question answering.
- Other tasks: Multi-object detection, relation reasoning, human pose estimation, document layout detection.
Methodology Highlights
- Two-stage training procedure:
- Stage 1 (Pretraining):
- Unfrozen LLM trained on real data, followed by fine-tuning on synthetic data (e.g., LAION-COCO).
- LLM weights from real and synthetic data are averaged (mixed) to balance both knowledge domains.
- Visual encoders (e.g., CLIP, DINOv2, Q-Former) remain frozen during this stage.
- Stage 2 (Visual instruction tuning):
- Mixture of tasks for multi-purpose capabilities, such as VQA, multi-object detection, region-level understanding, and text-based reasoning.
- Each image is encoded as five sub-images (four cropped at 224x224, one low-res global), processed independently to improve scaling and fine-grained understanding.
- Stage 1 (Pretraining):
Training Details
- Pretraining (Stage 1):
- Optimizer: AdamW with 5e-5 peak learning rate, 2k warmup steps, and 180k steps with cosine annealing to 5e-6.
- Weight decay: 0.1.
- Pretraining time: 125 hours on 32 A100 GPUs for a 7B model (doubled for 13B).
- Finetuning (Stage 2):
- Batch size: 128.
- Similar optimizer settings as pretraining.
- Finetuning time: 38 hours on 16 A100 GPUs for a 13B model.
2023_12_cogagent
2023-12: CogAgent
Quick Notes
- Decoder-Only model
- Weights available: Apache-2.0 License
- Organization: Chinese university research
Key Links
- Arxiv: CogAgent: A Visual Language Model for GUI Agents
- CVPR 2024: Conference Paper
- GitHub: THUDM/CogVLM
- HuggingFace checkpoints:
Guiding Questions
What kind of paper are you reading?
- This is a “Big idea” paper that proposes and evaluates a new Visual Language Model (VLM), called CogAgent.
- It specifically aims to advance GUI-based agents using high-resolution image encoding and cross-modality for enhanced performance on both general VQA and GUI-specific tasks.
- The paper evaluates CogAgent’s versatility across multiple benchmarks, positioning it as a state-of-the-art generalist VLM.
What is the motivation and potential impact of the paper?
- The motivation stems from the need for large language models (LLMs) to interact effectively with GUIs, which is a limitation in current text-only models.
- GUIs are ubiquitous in digital device interactions, and a VLM that can navigate these interfaces autonomously could enhance automation across various tasks, such as web navigation and mobile app usage.
- CogAgent’s ability to understand and act within GUIs could reduce the need for human intervention in routine digital tasks, potentially impacting personal productivity and digital accessibility.
What is the relevant related work and what is this paper’s contribution?
- Existing models like AutoGPT and WebShop have demonstrated limited capabilities in combining GUI visual signals with LLMs but often fall short in generalist performance or require substantial pre-processing.
- CogAgent advances these by integrating low- and high-resolution image encoders, enabling it to outperform text-extracted methods on GUI and VQA tasks.
- The primary contribution is CogAgent’s architecture, which includes a high-resolution cross-module for GUIs, allowing the model to process tiny text and elements effectively.
- The authors also created a large, specialized dataset for GUI understanding, making CogAgent robust across multiple benchmarks.
What are the results (Theory/Experiments)?
- The paper introduces a novel high-resolution cross-attention module, balancing computational efficiency and high-resolution input processing. This module maintains CogAgent’s usability across GUI and VQA tasks without excessive FLOP increases.
- Experiments: CogAgent is evaluated across several VQA benchmarks and GUI-focused datasets (Mind2Web, AITW), demonstrating state-of-the-art results, particularly in text-rich VQA and GUI navigation tasks.
- Performance: It surpasses generalist models in benchmarks such as VQAv2, OK-VQA, and specific GUI datasets, showing 11.6%–16.5% improvements over comparable language-based agents.
Details of Note
Dataset
Text Recognition:
- Synthetic Text Data: Generated 80M synthetically rendered images with diverse font styles, colors, orientations, and LAION 2B backgrounds.
- OCR on Natural Images: 18M images with OCR annotations extracted from COYO and LAION-2B datasets.
- Academic Documents: 9M academic-style images processed with similar augmentation techniques to the Nougat dataset, featuring text, formulas, and tables.
Visual Grounding:
- Collected 40M image-caption pairs from LAION-115M, associating entities with bounding boxes.
GUI Imagery:
- GUI REG (Referring Expression Generation): Generates HTML code for DOM elements based on specified screenshot areas.
- GUI REC (Referring Expression Comprehension): Creates bounding boxes for DOM elements on screenshots.
- Built the CCS400K dataset from 400k Common Crawl screenshots and 140M GUI-based REC and REG tasks.
Methodology Highlights
Architecture:
- Employs CogVLM-17B as the base VLM, enhanced with a new high-resolution cross-attention module for GUI tasks.
- Integrates a smaller EVA2-CLIP-L image encoder (300M parameters) for 1120x1120 high-resolution images, retaining key visual details.
- High- and low-resolution branches allow for efficient processing; high-resolution input captures text and small elements, with cross-attention at each decoder layer.
High-Resolution Cross-Module:
- Maintains efficiency by reducing hidden sizes for text features, optimizing computation with a lower-dimensional cross-attention setup.
- Operates on 14x14 patches, balancing image clarity and computational load for GUI screens up to 1120x1120 pixels.
Training Details
Pre-Training:
- Conducted 60k steps with a batch size of 4,608 and learning rate of 2e-5.
- Freezing Strategy: Only the high-resolution cross module was trained for the first 20k steps. For the remaining 40k steps, the visual expert was unfrozen for further fine-tuning.
- Curriculum learning: Pre-trained sequentially, progressing from synthetic text recognition tasks to document-style OCR, grounding tasks, and finally GUI-based tasks.
Evaluation Benchmarks:
- VQA Tasks: Assessed on VQAv2, OK-VQA, TextVQA, OCR-VQA, DocVQA, InfoVQA, and ChartQA.
- GUI Interfaces: Tested on the Mind2Web and AITW datasets, covering desktop and Android interface navigation tasks.
2023_12_llava_grounding
2023-12: LLaVA-Grounding
Quick Notes
- Decoder-Only model
- Weights available: CC-BY-NC-4.0 License
- Organization: Chinese university research
Key Links
- Arxiv: LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models
- ECCV 2024: Conference Paper
- GitHub: UX-Decoder/LLaVA-Grounding
- HuggingFace checkpoints:
- Haozhangcx/llava_grounding_gd_vp
- Not HF-integrated, but available on the GitHub repo.
Guiding Questions
What kind of paper is this?
- This is a model and dataset paper that introduces a new large multimodal model (LLaVA-Grounding) and a specialized dataset (GVC) for grounded visual chat.
- It contributes a new benchmark (Grounding-Bench) and presents improvements in integrated chat and visual grounding capabilities in LMMs.
What is the motivation and potential impact of the paper?
- Motivation: Current large multimodal models (LMMs) struggle to perform visual grounding and chat together effectively; existing datasets and models either focus on one aspect or are limited by the lack of fine-grained, grounded data.
- Potential Impact: By combining grounded chat and grounding capabilities, this model could enable more effective multimodal interactions across applications that require detailed image understanding and conversational skills (e.g., AI assistants for complex visual tasks).
What is the relevant related work and what is this paper’s contribution?
- Related Work: Previous LMMs like LLaVA, miniGPT-4, and CogVLM focused on visual chat but lacked robust grounding capabilities. Specialized grounding models (e.g., Kosmos-2, Shikra) often require distinct prompting and cannot handle chat coherently.
- Contributions: The paper introduces:
- A dataset of 150K grounded visual chat instances (GVC).
- The LLaVA-Grounding model, which integrates a language model with a grounding model, supporting pixel-level and object-level grounding in visual chat.
- A benchmark (Grounding-Bench) for evaluating grounded visual chat, with new metrics like grounded recall and precision.
What are the results (Theory/Experiments)?
- The paper proposes a novel method to connect LMM features with a grounding model (OpenSeeD) and evaluates it on the new Grounding-Bench, emphasizing object-level and pixel-level grounding capabilities.
- Experiments:
- On Grounding-Bench: LLaVA-Grounding outperformed other LMMs in both chat and grounding capabilities.
- On classic benchmarks: Achieved competitive performance on RefCOCO/+/g and Flickr30K for phrase grounding and referring expression tasks.
- Ablations: Demonstrated the model’s effectiveness across grounding types, prompt types (e.g., boxes, clicks), and in handling the trade-off between chat and grounding tasks.
Details of Note
Dataset
- Grounded Visual Chat (GVC) Data: 150K instances labeled with human annotations from COCO, grounded in conversational context using GPT-4. Ensures noun-phrase references match segmentation masks for accuracy.
- Stage 1 Data: Combines datasets (e.g., RefCOCO, Flickr30K, Visual Genome) with alignment pairs from COCO for visual and language alignment tasks.
- Stage 2 Data: Instruction set and GVC data without visual prompts, leveraging conversational context for grounding-specific tuning.
- Stage 3 Data: Enhanced dataset for visual prompts (boxes, clicks, scribbles) with specific instructions for more interactive grounded visual chat.
Methodology Highlights
- Pipeline: Utilizes a three-stage approach to pretrain, align, and tune
grounding and visual chat tasks.
- Stage 1: Pretraining aligns visual and language embeddings using a projection layer to connect the language model’s output with the grounding model.
- Stage 2: Fine-tunes for grounding chat without updating the vision and prompt encoders.
- Stage 3: Adds visual prompting, training only the prompt encoder or updating language decoder for marked instances.
- Grounding Model: OpenSeeD model enables both bounding box and pixel-level grounding, enhancing the model’s ability to perform fine-grained visual localization.
- Prompt Encoder: Uses a Semantic SAM model to convert visual prompts into language-compatible embeddings, allowing the model to interpret diverse prompt types.
Training Details
- Stage 1: Vision-language (VL) projection and grounding model updated
with learning rates of
1e-4, aligning visual encoder and grounding model with language model outputs. - Stage 2: Instruction tuning for grounded responses; prompt encoder and
CLIP vision encoder are frozen, with other layers fine-tuned at
2e-5for language layers and1e-4for grounding. - Stage 3: Visual prompting (e.g., marks) trained by fine-tuning the prompt encoder and projection layers, with additional support for specific image regions through Set-of-Mark prompts.
- Hyperparameters: Gradual learning rate decay, weight decay set to
0.0,bf16andtf32precision for faster, stable training on compatible hardware.
2024_01_llava_next
2024-01: LLaVA-NEXT
Quick Notes
- Decoder-Only model
- Weights available: Apache-2.0 License
- Organization: Chinese university research
Key Links
- Blog: LLaVA-NeXT: Improved reasoning, OCR, and world knowledge
- Same codebase as LLaVA, with updated weights
Guiding Questions
What kind of paper are you reading?
- This is a system and enhancement describing LLaVA-NeXT, an improved large multimodal model (LMM).
- It presents design advancements over LLaVA-1.5 in areas such as reasoning, OCR, and handling high-resolution images.
- It also outlines technical improvements and evaluates performance against other state-of-the-art (SoTA) models on specific benchmarks.
What is the motivation and potential impact of the paper?
- Motivation: To enhance the LLaVA model’s visual reasoning, OCR, and general world knowledge, addressing limitations in existing multimodal models, especially for visual detail and zero-shot scenarios.
- Impact: LLaVA-NeXT offers improved performance on multiple benchmarks and enables efficient deployment and inference. Its open-source release aims to support further research and development in multimodal AI, benefiting areas requiring robust visual-linguistic integration.
What is the relevant related work and what is this paper’s contribution?
- Related Work: Builds on LLaVA-1.5, which set a baseline in multimodal performance. Other referenced models include Gemini Pro, Qwen-VL-Plus, and CogVLM, each contributing distinct multimodal capabilities.
- Contribution: LLaVA-NeXT surpasses prior models in resolution, reasoning, and deployment efficiency. It introduces dynamic high-resolution support, a refined visual instruction dataset, and compatibility with multiple LLM backbones, achieving SoTA performance in both English and zero-shot Chinese tasks.
What are the results (Theory/Experiments)?
- Experiments:
- LLaVA-NeXT outperforms several benchmarks, matching or surpassing proprietary models like Gemini Pro in specific categories.
- Zero-shot Chinese performance is notable, achieving SoTA results without dedicated Chinese training data.
- Model versions (7B, 13B, 34B) achieve varying levels of performance, with high-resolution image support significantly improving OCR and reasoning capabilities.
- Success Metrics:
- The model’s success is defined through SoTA performance on a suite of multimodal benchmarks, including OCR-specific tasks like TextVQA and overall visual reasoning with Math-Vista.
- Efficiency metrics are notable, with low training cost achieved on 32 A100 GPUs over ~1 day.
Details of Note
Dataset
- Image-Text Pairs: 1.3M visual instruction-tuning samples.
- Data Sources:
- GPT-V Data: LAION-GPT-V, ShareGPT-4V for enhanced instruction-following and visual conversation.
- Specialized Datasets: Added ChartQA, DVQA, AI2D for improved chart/diagram understanding, and DocVQA, SynDog-EN to enhance OCR capabilities.
- Data Filtering: TextCaps removed (overlap with TextVQA); sensitive data removed to ensure privacy and safety.
- Zero-Shot Capability: Demonstrates strong zero-shot Chinese language support, despite training predominantly on English data.
Methodology Highlights
- Dynamic High Resolution: Introduced ‘AnyRes’ technology, supporting variable resolutions (up to 4x more pixels), enhancing fine visual detail through grid splitting/merging (e.g., 2×2, 3×1 configurations).
- Data Mixture: Improved data diversity through high-quality visual instruction samples, designed to capture varied user intents and real-world applications.
- LLM Backbones: Models use Vicuna-1.5 (7B, 13B), Mistral-7B, and Yi-34B, enabling flexible deployment options and compatibility with larger multimodal tasks.
Training Details
- Training Hardware:
- LLaVA-NeXT-7B: 8xA100 GPUs, ~20 hours.
- LLaVA-NeXT-13B: 16xA100 GPUs, ~24 hours.
- LLaVA-NeXT-34B: 32xA100 GPUs, ~30 hours.
- Efficiency: Minimalist approach maintains data efficiency of LLaVA-1.5 with low GPU-hour cost, using less than 1M samples in the visual instruction phase.
- Performance Benchmarking: Achieves state-of-the-art performance across multiple multimodal tasks with optimized training time and compute costs.
2024_01_seeclick
2024-01: SeeClick
Quick Notes
- Decoder-Only model
- Weights available: Apache-2.0 License
- Organization: Shanghai AI Research
Key Links
- Arxiv: SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
- ICLR 2024 Workshop on LLM Agents: Conference Paper
- GitHub: njucckevin/SeeClick
- HuggingFace checkpoints:
Guiding Questions
What kind of paper are you reading?
- This is a research paper presenting a new model, benchmark, and evaluation framework for visual GUI agents.
- It proposes SeeClick, a visual agent designed for GUI tasks, and a benchmark called ScreenSpot for evaluating GUI grounding across multiple platforms.
What is the motivation and potential impact of the paper?
- The paper addresses the limitations of existing GUI agents that rely on structured text, which is often inaccessible and verbose.
- SeeClick’s approach, using visual GUI grounding on screenshots, could significantly expand the accessibility and efficiency of GUI agents across devices.
- Potential impacts include advancing GUI automation for platforms where structured text is unavailable, leading to more adaptable, universal agents.
What is the relevant related work and what is this paper’s contribution?
- Related work includes text-based GUI agents (e.g., using HTML), GUI agents trained with LLMs, and previous attempts at visual GUI grounding.
- Key contributions:
- Developing SeeClick, a visual GUI agent that relies solely on screenshots, removing dependency on structured text.
- Introducing ScreenSpot, a benchmark to evaluate GUI grounding performance across mobile, desktop, and web GUIs.
- Demonstrating that GUI grounding pre-training improves agent performance on diverse GUI tasks.
What are the results (Theory/Experiments)?
- The authors establish GUI grounding as a critical capability, allowing LVLMs to interpret and locate elements on screen using screenshots.
- Experiments:
- ScreenSpot results show SeeClick outperforming baseline models in GUI grounding, especially on non-text elements.
- On MiniWob, AITW, and Mind2Web tasks, SeeClick demonstrates superior performance, especially in environments with dynamic layouts or varied interfaces.
- Overall, results confirm that GUI grounding significantly boosts task success, validating the model’s approach and pre-training method.
Details of Note
Dataset
- SeeClick Dataset: Comprised of ~300k web screenshots, extracting visible text content and “title” attributes as grounding cues.
- Mobile Data: Utilizes RICO dataset’s widget captioning and screen summarization, along with auto-generated RICO SCA data.
- General Data: Incorporates instruction-following data collected via GPT-4 from the LLaVA dataset for maintaining general vision-language capabilities.
- ScreenSpot Benchmark: A curated, human-annotated dataset for GUI grounding across mobile, desktop, and web interfaces.
Methodology Highlights
- Model Architecture: Built on QwenVL, a vision-language model with an integrated Vision Transformer (ViT).
- GUI Grounding Pre-training: Trained to locate elements in GUI
screenshots by predicting action locations with floating-point coordinates
(e.g.,
click (0.12, 0.54)). - Data Curation: Automates collection of GUI grounding data from web, mobile, and general datasets, creating diverse and context-rich training samples.
- Task Adaptation: SeeClick was tested and fine-tuned for three downstream tasks: MiniWob (simplified web tasks), AITW (Android interactions), and Mind2Web (web navigation tasks).
Training Details
- Training Process: Conducted over 10k steps (1 epoch), leveraging the LoRA technique to fine-tune both ViT and the language model.
- Optimizer: Used AdamW with a learning rate of 3e-5 and a global batch size of 64.
- Hardware: Training completed in approximately 24 hours on 8 NVIDIA A100 GPUs.
2024_02_llavar
2024-02: LLaVAR
Quick Notes
- Decoder-Only model
- Weights available: Apache-2.0 License
- Organization: Adobe
Key Links
- Arxiv: LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding
- GitHub: SALT-NLP/LLaVAR
- HuggingFace checkpoints:
Guiding Questions
What kind of paper is this?
- This is a methodology paper with evaluation that proposes and validates an enhancement to an existing model pipeline, specifically for visual instruction tuning with text-rich images.
- It enhances the LLaVA model to better comprehend text within images through refined instruction-following data collection and multimodal tuning.
What is the motivation and potential impact of the paper?
- The paper addresses the challenge that current visual instruction-tuned models struggle with reading and understanding text in images, limiting real-world usability in contexts with text-heavy visual content.
- The impact of this work is significant for applications requiring multimodal understanding, like text-based visual question answering (VQA) and online content interpretation, as it enhances user interaction capabilities by enabling accurate text comprehension in images.
What is the relevant related work and what is this paper’s contribution?
- Related work includes instruction-tuned language models (e.g., GPT-4) and multimodal models like CLIP and LLaVA that incorporate visual encoders for image processing but struggle with text comprehension.
- The paper’s primary contribution is LLaVAR, an augmented model that combines data from OCR and GPT-4 to improve the text comprehension capability in visual instruction tuning. It provides a new data collection pipeline and demonstrates substantial improvements in text-rich image interpretation.
What are the results (Theory/Experiments)?
- The model is designed to improve visual and textual data alignment through dual-stage training (pre-training with noisy OCR data and fine-tuning with high-quality GPT-4 data).
- Experiments: LLaVAR is evaluated on four text-based VQA datasets and performs better than baseline models on these tasks, especially at higher resolutions, demonstrating significant accuracy improvements across multiple datasets like ST-VQA, OCR-VQA, and DocVQA.
- The paper includes GPT-4-based instruction-following evaluations on natural and text-rich images, with LLaVAR showing gains in nuanced interaction skills and real-world applicability.
Details of Note
Dataset
- Pre-training Data: Augments LLaVA’s pre-training dataset with 422K text-rich images from the LAION-5B dataset, filtered and selected for high OCR relevance, covering diverse categories like posters, book covers, advertisements, and educational materials.
- Instruction Tuning Data: Adds 16K high-quality instruction-following examples generated by GPT-4, using OCR results and captions to improve LLaVA’s text understanding in images.
- Data Filtering and Clustering: To enhance quality, images were categorized using CLIP-based features and manual inspection, ultimately selecting clusters that maximized text-richness.
Methodology Highlights
- Instruction Augmentation: Integrates OCR results and GPT-4 responses to craft realistic, text-focused questions and answers for improved text comprehension.
- Data Pipeline: Implements a two-tier data collection strategy—first generating noisy OCR-based responses, then enhancing with GPT-4–based high-quality data for specific instruction tuning.
- Resolution Scaling: Increases input resolution from 224² to 336², which improves the model’s ability to recognize small and fine text details, as shown in experiments.
Training Details
- Two-Stage Training:
- Pre-training: LLaVA’s original model is pre-trained on combined noisy OCR and standard instruction data to align visual and language features, with only the projection layer trainable.
- Fine-tuning: Finetuning on high-quality GPT-4 data with the visual encoder frozen but both the language decoder and projection layer trainable.
- Model Architecture: Based on LLaVA with a visual encoder (CLIP-ViT-L/14) and Vicuna-13B as the language decoder; enhances the model’s instruction-following capacity with added cross-attention for high-res encoding.
- Evaluation: Benchmarked on four VQA datasets (ST-VQA, OCR-VQA, TextVQA, DocVQA) and GPT-4-based evaluation for text comprehension within real-world images.
2024_02_screenai
2024-02: ScreenAI
Quick Notes
- Encoder-Decoder model
- Weights not available. No permissive license. Proprietary.
- Organization: Google
Key Links
- Arxiv: ScreenAI: A Vision-Language Model for UI and Infographics Understanding
- Published at IJCAI 2024. Main Track
Guiding Questions
What kind of paper are you reading?
- This is a small idea with evaluation paper.
- It introduces ScreenAI, a vision-language model for understanding UIs and infographics, and evaluates its performance through ablation studies and comparisons with state-of-the-art (SoTA) models.
What is the motivation and potential impact of the paper?
- The paper addresses the challenge of understanding complex screen UIs and infographics, which share visual language but are difficult to model due to their diverse designs.
- The impact is significant for fields requiring automated UI understanding, such as UI navigation, infographics comprehension, and screen-based question-answering (QA).
What is the relevant related work and what is this paper’s contribution?
- Existing works in screen-based UI models focus on narrow tasks like icon detection, whereas generalist foundation models excel in broad multimodal tasks.
- The paper contributes ScreenAI, which extends PaLI and pix2struct architectures, introduces a screen annotation task, and automates dataset generation using LLMs. It also releases three new datasets and achieves state-of-the-art results in several tasks.
What are the results (Theory/Experiments)?
- The paper introduces a novel architecture that combines PaLI’s multimodal encoder-decoder with pix2struct’s adaptive patching mechanism.
- Experiments:
- ScreenAI achieves new SoTA on several benchmarks like Multipage DocVQA, WebSRC, MoTIF, and excels in tasks like ChartQA and InfographicVQA.
- The model’s performance improves with larger sizes, and ablation studies confirm the benefits of the pix2struct patching strategy and using LLM-generated data during pre-training.
Details of Note
Dataset
- 353M autogenerated screen schemas: Includes OCR, object detection, classification, and image/icon captioning.
- 38.6M Q&A samples: Added to address challenges with arithmetic, counting, and understanding complex graphics.
- 15.9M navigation samples: Instructions like “click BOX” for UI navigation.
- 13.2M screen summarizations: Summaries of what’s shown on the screen.
- 3M chart-to-table translation samples: Converting graphical data to tabular formats.
- 1M other data: From PaLI, used for additional text and Q&A tasks.
Methodology Highlights
- Screen Schema: UI type + short caption or OCR text + bounding box (e.g., BUTTON 800 100 900 150); supports nested elements (e.g., TEXT in a BUTTON).
- Aspect-ratio preserving patching: For flexible handling of various screen shapes and resolutions, drawn from pix2struct.
- Data generation: Combines object detection, OCR, and image captioning for generating pretraining data; supplemented with LLM-enhanced Q&A and summarization.
- Task-specific fine-tuning: Conducted on human-annotated data for tasks like Q&A, navigation, and summarization.
Training Details
- Pre-training: Uses a mixture of self-supervised learning on autogenerated datasets (e.g., screen schemas) and LLM-generated data.
- Fine-tuning: On specific tasks like Q&A, navigation, and summarization, using both human-annotated and machine-generated data.
- Model sizes: Trained models include 670M, 2B, and 5B parameters; the 5B model starts from PaLI-3 multimodal checkpoints.
- Freezing the ViT encoder: During fine-tuning, the ViT encoder is frozen, focusing training on the language model (e.g., mT5 or UL2 depending on size).
2024_02_sphinx-x
2024-02: SPHINX-X
Quick Notes
- Decoder-Only model
- Weights available: Apache-2.0 License, but can’t find them in the official repo
- Organization: Shanghai AI Lab
Key Links
- Arxiv: SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
- Poster @ ICML 2024
- GitHub: Alpha-VLLM/LLaMA2-Accessory
Guiding Questions
What kind of paper are you reading?
- Small idea paper
- This is an introduction of some enhancements made to the SPHINX model and evaluation of the model’s performance across various vision-language tasks to demonstrate the improvements.
What is the motivation and potential impact of the paper?
- The motivation is to scale up the SPHINX model, make it easier to train, and to simplify its complexity while speeding up training and inference
- The potential impact is to make the SPHINX model more accessible and efficient for researchers and practitioners working on vision-language tasks.
What is the relevant related work and what is this paper’s contribution?
- Related work includes the original SPHINX model and other multi-modal models like LLaVA, MiniGPT-4, and BLIP-2.
- Very minor contributions from this paper, detailed below.
What are the results (Theory/Experiments)?
- Better performance on various vision-language tasks compared to the original SPHINX model.
Details of Note
Dataset
Methodology Highlights
- Two of the four original vision encoders are dropped in favor of the two complementary and better-performing encoders.
- They use a trick to skip visual tokens constructed from entirely padded values; this is only really relevant for wide or tall aspect ratio images. This involves a learned skip token which can be used to replace the padding tokens, tightening up the sequence length while still preserving relative positioning.
- They ditch a two-stage training pipeline in favor of a single stage.
- Other changes are really just including more and different data, other models as LLM backbones, etc.
Training Details
2024_03_mm1
2024-03: MM1
Quick Notes
- Decoder-Only model
- No weights available yet; proprietary model
- Organization: Apple
Key Links
Guiding Questions
What kind of paper is this?
- This is a research paper focused on building and improving Multimodal Large Language Models (MLLMs).
- It covers empirical evaluations of architectural components and data configurations to achieve optimal few-shot performance across benchmarks.
What is the main motivation for the research presented?
- The paper aims to address the lack of transparency and comprehensive design guidance in developing performant MLLMs.
- By studying architecture components and pre-training data configurations, the authors hope to establish guiding principles that outlast specific model implementations.
- They emphasize the importance of systematic ablation studies to improve few-shot and zero-shot multimodal performance.
What methodology do the authors employ?
- The authors use a series of ablations across different architectures, visual encoders, vision-language connectors, and data mixes to determine optimal configurations.
- They pre-train smaller models for initial tests, then scale up to larger models to confirm findings, using pre-training data that mixes image-caption pairs, interleaved image-text, and text-only documents.
- For supervised fine-tuning (SFT), they collect diverse vision-language datasets and use specialized training recipes, including varying image resolutions and model scales up to 64B parameters.
What are the key findings or insights?
- Image resolution and the number of visual tokens significantly impact model performance, with the architecture of the vision-language connector having a minor role.
- Interleaved image-text data is crucial for few-shot and text-only performance, while caption data enhances zero-shot performance.
- Synthetic caption data (VeCap) improves few-shot learning, showing a 2–4% increase in performance across tasks.
- Their final model family, MM1, outperforms or matches state-of-the-art multimodal models in few-shot learning, maintaining robust language understanding capabilities after pre-training.
How do the authors validate their approach?
- They benchmark pre-trained MM1 models on captioning and VQA tasks, comparing their performance to other multimodal models such as Flamingo, IDEFICS, and Emu2.
- Supervised fine-tuning (SFT) is evaluated across 12 multimodal benchmarks to validate few-shot performance gains.
- The authors include both quantitative and qualitative evaluations to demonstrate capabilities like counting objects, performing OCR, and multi-image reasoning.
- They also employ qualitative examples to show MM1’s capacity for chain-of-thought reasoning and multi-image contextual understanding.
What contribution does this paper make to its field?
- This paper provides empirical design guidelines for developing efficient MLLMs, offering a “recipe” for configuring data types, image resolutions, and encoder-decoder connections.
- It introduces MM1, a family of MLLMs that achieve state-of-the-art performance in few-shot multimodal tasks and presents a scalable pre-training approach for future large multimodal models.
- The insights on architecture and data configurations establish foundational lessons that may influence future multimodal model training, extending the body of knowledge on multimodal pre-training.
Details of Note
Dataset
- Pre-training Data Composition:
- 45% captioned images
- 45% image-text interleaved documents
- 10% text-only data
- Special Data Types:
- Synthetic Data: Boosts few-shot performance
- Interleaved Data: Essential for few-shot and text-only performance
- Captioning Data: Critical for zero-shot performance
Methodology Highlights
- Base Model Architecture:
- Image Encoder: ViT-L/14 and ViT-H, pre-trained on DFN-5B and VeCap-300M datasets at high resolutions (336x336 and 378x378).
- Vision-Language (VL) Connector: Convolutional abstractor with 144 image tokens; tested alternative VL connectors like average pooling and attention pooling.
- Language Model: 1.2B parameter decoder-only LM for core processing.
- Pre-training Ablations:
- Image Resolution & Model Size: Key factors impacting performance, with higher resolutions and larger models yielding better results.
- Data Mix: Interleaved image-text data boosts text-only and few-shot capabilities; captioning data aids zero-shot.
- Supervised Fine-Tuning:
- Uses ~1M SFT samples mined from pre-training data; all model parameters unfrozen during fine-tuning.
Training Details
- Batch Size: 512
- Sequence Length: 4096 tokens, accommodating up to 16 images per sequence
- Training Steps: 200k steps, equating to 100B tokens processed
- Resolution Scaling Techniques:
- Positional embedding interpolation and sub-image decomposition for higher-resolution adaptation
2024_03_text_monkey
2024-03: TextMonkey
Quick Notes
- Decoder-Only model
- Weights available: Apache-2.0 License
- Organization: Chinese university research
Key Links
- Arxiv: TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document
- GitHub: Yuliang-Liu/Monkey
Guiding Questions
What kind of paper are you reading?
- This is a large multimodal model (LMM) development paper with a specific focus on document understanding, without relying on OCR.
- It presents a new model, TextMonkey, optimized for text-centric tasks such as document analysis and scene text spotting.
What is the motivation and potential impact of the paper?
- Motivation: Existing OCR-dependent models create limitations in document processing by adding complexity and often misalign text and visual context, leading to error accumulation.
- Impact: TextMonkey aims to streamline document analysis by integrating cross-modal understanding without OCR, potentially setting new standards in document processing for both industry and academic applications.
What is the relevant related work, and what is this paper’s contribution?
- Related Work: Traditional approaches rely heavily on OCR to extract text, but OCR introduces errors and extra engineering needs, especially in complex document layouts.
- Contribution: TextMonkey eliminates OCR reliance, proposing a novel attention mechanism (Shifted Window Attention) for cross-window context. Additionally, it introduces token compression strategies to handle high-resolution images and complex tasks like text grounding and structured text extraction.
What are the results (Theory/Experiments)?
- The paper’s theoretical contributions include Shifted Window Attention with zero initialization to stabilize training and token resampling to reduce redundancy.
- Experiments: TextMonkey outperforms several benchmarks across 12 tasks, achieving a 5.2% improvement in scene text-centric tasks, a 6.9% increase in document-oriented tasks, and a 2.8% rise in key information extraction. It also sets a new standard on OCRBench with a score of 561, surpassing prior large multimodal models.
Details of Note
Dataset
- Scene Text: Datasets include COCOText, OCRText, HierText, TextVQA, and MLT. These support model training for scene text understanding, including text spotting and VQA.
- Document-Based VQA: Training uses IIT-CDIP, DocVQA, ChartQA, InfoVQA, DeepForm, KLC (Kleister Charity), and WikiTableQuestions (WTQ), focusing on document question answering, table understanding, and information retrieval.
- Evaluations:
- Scene Text VQA: Benchmarks include STVQA, TextVQA, and OCRVQA.
- Document-Oriented VQA: Benchmarks such as DocVQA, InfoVQA, and ChartQA assess the model’s document comprehension.
- Key Information Extraction (KIE): Benchmarks like FUNSD, SROIE, and POIE test the model’s ability to extract structured information.
Methodology Highlights
- Shifted Cross-Window Attention: Introduces cross-window attention at intervals in the Transformer layers, enabling the model to capture long-range dependencies across image patches.
- Zero Initialization for Cross-Window Attention: Inspired by LoRA, this stabilizes early training by gradually introducing cross-window connections.
- Token Resampling for Efficiency: Redundant visual tokens are filtered by their cosine similarity, retaining only the top tokens to reduce token count. The remaining tokens attend to discarded tokens to capture essential information.
Training Details
- Optimizer: AdamW with cosine decay scheduling.
- Learning Rate and Schedule: Starts at 1e-5 with a warmup over 150 steps, cooling down to 5e-6.
- Batch Size: 128 per training step.
- Training Time: 12 days on A800 GPUs for a single epoch.
- Regularization: Applies weight decay of 0.1 to prevent overfitting.
2024_06_visionllm2
2024-06: VisionLLM v2
Quick Notes
- Decoder-only model.
- Weights unavailable; Apache-2.0 License.
- Organization: Shanghai AI Lab.
Key Links
- Arxiv: VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
- GitHub: OpenGVLab/VisionLLM
Guiding Questions
What kind of paper are you reading?
- This paper is a model introduction. They want to demonstrate a single system that can do a wide range of vision-language tasks.
- Not a lot of new ideas, but a lot of training and evaluation for a single model across many tasks.
What is the motivation and potential impact of the paper?
- The motivation is to have one model that can do any language, vision, or vision-language task.
- The impact is in simplifying the deployment of vision-language models, making it easier to use them in a wide range of applications.
What is the relevant related work and what is this paper’s contribution?
- The most relevant previous work is the original VisionLLM model.
- All other decoder-based Vision-Language models are relevant, like Monkey.
What are the results (Theory/Experiments)?
- The model is evaluated on 100+ vision-language tasks, showing strong performance across the board.
Details of Note
Dataset
- Stage 1: Uses a wide variety of datasets across many tasks: conversation, imae captioning, image VQA, OCR, region captioning, region VQA, region recognition.
- Stage 3: Some overlap with Stage 1, but specialized towards object detection, instance segmentation, grounded captioning, visual grounding, object counting, pose estimation, image generation and editing, and more.
- Stage 2: Is trained on the combination of Stages 1 and 3.
Methodology Highlights
- Model architecture is 4 key components:
- An image encoder, and a region encoder
- A language decoder / LLM
- Task-specific decoders
- A module called a ``super-linker’’. This appears as a sort of abstractor model or a Q-transformer where queries are learned to extract a set of features for that specific task in that particular context.
- A key way that super-linking works is through routing tokens. For a
particular specialized task, like object detection, the model learns to
output special route tokens (e.g.,
[DET],[SEG],[CAP]) that guide the super-linker to extract the right features for that task AND to route to the particular task-specific decoder.
Training Details
Three stage training process:
- Stage 1: Essentially LLaVA. It included a pre-train and fine-tune stage.
- Stage 2: Specialized training on a subset of tasks. It uses task-specific decoders.
- Stage 3: Keeps every component frozen except task-specific decoders. It trains for longer on 128 A100 GPUs.
2024_08_mplug_owl3
2024-08: mPLUG-Owl3
Quick Notes
- Decoder-only model
- Weights available: MIT / Apache-2.0 License
- Organization: Alibaba
Key Links
- Arxiv: mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
- GitHub: X-PLUG/mPLUG-Owl3
- Hugging Face:
Guiding Questions
What kind of paper is this?
- This paper introduces a new model, building upon the past-line of mPLUG models.
- I’d say it’s a relatively small idea plus evaluation, but they present an intersting architectural variation that seems to enable their scaling up to long image sequence lengths.
What is the motivation and potential impact of the paper?
- The motivation is to improve long image-sequence understanding in multimodal LLMs.
- This has massive downstream impact, enabling in-context learning with long sequences of images and text, and also applications to video understanding and other multimodal tasks.
What is the relevant related work and what is this paper’s contribution?
- The most relevant related work is the original mPLUG-Owl model and all subsequent mPLUG models.
- This paper contributes by showing massive gains on scenarios where one wants to have a very long interleaved sequence of images and text, like video understanding or long-form image-text understanding.
What are the results (Theory/Experiments)?
- They show marginal or borderline improvements (or at least no degradation) in performance on standard VQA and image-text tasks.
- Where they really gain is on multi-image tasks and benchmarks
- Where they stand out light-years above other models is when they design a
new benchmark which they call Distractor Resistance that requires a model to
perform reasoning on hundreds of images in a sequence.
- Here, they demonstrate an ability to perform this task, albeit with degrading performance as the number of images increases.
- Most models fail to perform this task at all, so this is a significant step forward.
Details of Note
Dataset
Methodology Highlights
- One of the key things the authors do is allow for continual cross-attention between the language and vision outputs - They call this hyper attention - It just allows visual features to update language ones; not bidirectional - It isn’t applied in every block; it’s a sparsely added layer
- The same layer norm is applied to both text and visual input to encourage similar semantic spaces
- Modality specific KQV matrices are used to allow these components to
specialize
- This is reminiscient of past mPLUG-Owl family models
- They only let text refer to image features that occur before the text in the sequence, unlike other models that just attend to all images provided
- Adaptive gating (gating that is input dependent on the text features) is used to provide a final merging of the text representation update with the current representation
Training Details
- Stage 1: Pre-training
- This stage freezes most of the weights
- Only a linear projection, vision KV projection matrices, and the adaptive gate parameters are trained
- Sequence length is kept shorter (768) and resolution is kept small (384^2)
- 20k updates, batch size of 2048
- Stage 2: Multi-image training
- This stage focuses on training models that can reason over multiple images in an input
- Now the linear projection and the full language model are updated
- Sequence length expanded to 4k and full dynamic resolution enabled
- 3k updates, batch size of 1024
- Stage 3: Self-Supervised Finetuning
- This stage focuses on making sure the model can follow instructions provided
- Again, full LLM and linear projection are updated
- Same sequence length and resolution as stage 2
- 11k updates, batch size of 1024