Skip to content

Model Configurations

dhansmair edited this page Sep 16, 2022 · 14 revisions

Overview

A comparison of model sizes, comparing the smallest Flamingo model, Flamingo-3B, to the architectures that were trained on Conceptual Captions and are named flamingo-mini and flamingo-tiny. The parameters for Flamingo-3B were extracted from the paper: https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/tackling-multiple-tasks-with-a-single-visual-language-model/flamingo.pdf

flamingo 3B flamingo-mini (ours) flamingo-tiny (ours)
params (trainable/total) 529M / 835M 180M / 267M (!) for ours, the vision encoder parameters are not included here.
language model chinchilla OPT-350m OPT-125m
# params 1.4B 350M 125M
# layers 24 24 12
# heads 16 16 12
embedding size 2048 1024 768
number of tokens 32000 50256 50256
vision encoder NFNet-F6 CLIP ViT-L/14 CLIP ViT-L/14
# params 435M 303M 303M
output shape ? x 1536 257 x 1024 257 x 1024
perceiver resampler
# params 194M 101M 63M
# heads 16 16 8
# layers 6 6 6
hidden size 1536 1024 1024 = Vision encoder hidden size
KV size 128 128 64
# latents 64 64 64
activation function Sq. ReLU Sq. ReLU Sq. ReLU
gated cross-attention
# params 1.2B
# heads 16 16 8
# layers (freq) 24 (every) 24 (every) 12 (every)
hidden size 2048 1024 768 = LM embedding size
KV size 128 128 64
activation function Sq. ReLU Sq. ReLU Sq. ReLU

From the Flamingo paper:
image

flamingo-mini

flamingo-tiny

Clone this wiki locally