Skip to content

Model Configurations

dhansmair edited this page Sep 14, 2022 · 14 revisions

Overview

flamingo 3B flamingo-mini (ours) flamingo-tiny (ours)
params (trainable/total) 529M / 835M 180M / 267M (!) for ours, the vision encoder parameters are not included here.
language model chinchilla OPT-350m OPT-125m
# params 1.4B 350M 125M
# layers 24 24 12
# heads 16 16 12
embedding size 2048 1024 768
number of tokens 32000 50256 50256
vision encoder NFNet-F6 CLIP ViT-L/14 CLIP ViT-L/14
# params 435M 303M 303M
output shape ? 257 x 1024 257 x 1024
resampler
# params
# heads 16 16 8
# layers 6 6 6
hidden size 1536 1024 1024 = Vision encoder hidden size
KV size 128 128 64
# latents 64 64 64
activation function Sq. ReLU Sq. ReLU Sq. ReLU
xattn dense
# params
# heads 16 16 8
# layers (freq) 24 (every) 24 (every) 12 (every)
hidden size 2048 1024 768 = LM embedding size
KV size 128 128 64
activation function Sq. ReLU Sq. ReLU Sq. ReLU

flamingo-mini

TODO

flamingo-tiny

TODO

Clone this wiki locally