-
Notifications
You must be signed in to change notification settings - Fork 16
Model Configurations
dhansmair edited this page Sep 16, 2022
·
14 revisions
A comparison of model sizes, comparing the smallest Flamingo model, Flamingo-3B, to the architectures that were trained on Conceptual Captions and are named flamingo-mini
and flamingo-tiny
.
The parameters for Flamingo-3B were extracted from the paper: https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/tackling-multiple-tasks-with-a-single-visual-language-model/flamingo.pdf
flamingo 3B | flamingo-mini (ours) | flamingo-tiny (ours) | ||
---|---|---|---|---|
params (trainable/total) | 529M / 835M | 180M / 267M | (!) for ours, the vision encoder parameters are not included here. | |
language model | chinchilla | OPT-350m | OPT-125m | |
# params | 1.4B | 350M | 125M | |
# layers | 24 | 24 | 12 | |
# heads | 16 | 16 | 12 | |
embedding size | 2048 | 1024 | 768 | |
number of tokens | 32000 | 50256 | 50256 | |
vision encoder | NFNet-F6 | CLIP ViT-L/14 | CLIP ViT-L/14 | |
# params | 435M | 303M | 303M | |
output shape | ? x 1536 | 257 x 1024 | 257 x 1024 | |
perceiver resampler | ||||
# params | 194M | 101M | 63M | |
# heads | 16 | 16 | 8 | |
# layers | 6 | 6 | 6 | |
hidden size | 1536 | 1024 | 1024 | = Vision encoder hidden size |
KV size | 128 | 128 | 64 | |
# latents | 64 | 64 | 64 | |
activation function | Sq. ReLU | Sq. ReLU | Sq. ReLU | |
gated cross-attention | ||||
# params | 1.2B | |||
# heads | 16 | 16 | 8 | |
# layers (freq) | 24 (every) | 24 (every) | 12 (every) | |
hidden size | 2048 | 1024 | 768 | = LM embedding size |
KV size | 128 | 128 | 64 | |
activation function | Sq. ReLU | Sq. ReLU | Sq. ReLU |
From the Flamingo paper: