Model Configurations

Overview

	flamingo 3B	flamingo-mini (ours)	flamingo-tiny (ours)
params (trainable/total)		529M / 835M	180M / 267M	(!) for ours, the vision encoder parameters are not included here.

language model	chinchilla	OPT-350m	OPT-125m
# params	1.4B	350M	125M
# layers	24	24	12
# heads	16	16	12
embedding size	2048	1024	768
number of tokens	32000	50256	50256

vision encoder	NFNet-F6	CLIP ViT-L/14	CLIP ViT-L/14
# params	435M	303M	303M
output shape	?	257 x 1024	257 x 1024

resampler
# params
# heads	16	16	8
# layers	6	6	6
hidden size	1536	1024	1024	= Vision encoder hidden size
KV size	128	128	64
# latents	64	64	64
activation function	Sq. ReLU	Sq. ReLU	Sq. ReLU

xattn dense
# params
# heads	16	16	8
# layers (freq)	24 (every)	24 (every)	12 (every)
hidden size	2048	1024	768	= LM embedding size
KV size	128	128	64
activation function	Sq. ReLU	Sq. ReLU	Sq. ReLU

TODO

TODO