Model Configurations

Overview

A comparison of model sizes, comparing the smallest Flamingo model, Flamingo-3B, to the architectures that were trained on Conceptual Captions and are named flamingo-mini and flamingo-tiny. The parameters for Flamingo-3B were extracted from the paper: https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/tackling-multiple-tasks-with-a-single-visual-language-model/flamingo.pdf

	flamingo 3B	flamingo-mini (ours)	flamingo-tiny (ours)
params (trainable/total)		529M / 835M	180M / 267M	(!) for ours, the vision encoder parameters are not included here.

language model	chinchilla	OPT-350m	OPT-125m
# params	1.4B	350M	125M
# layers	24	24	12
# heads	16	16	12
embedding size	2048	1024	768
number of tokens	32000	50256	50256

vision encoder	NFNet-F6	CLIP ViT-L/14	CLIP ViT-L/14
# params	435M	303M	303M
output shape	? x 1536	257 x 1024	257 x 1024

perceiver resampler
# params	194M	101M	63M
# heads	16	16	8
# layers	6	6	6
hidden size	1536	1024	1024	= Vision encoder hidden size
KV size	128	128	64
# latents	64	64	64
activation function	Sq. ReLU	Sq. ReLU	Sq. ReLU

gated cross-attention
# params	1.2B
# heads	16	16	8
# layers (freq)	24 (every)	24 (every)	12 (every)
hidden size	2048	1024	768	= LM embedding size
KV size	128	128	64
activation function	Sq. ReLU	Sq. ReLU	Sq. ReLU

From the Flamingo paper: