Some questions about 257 x 1024 CLIP L #36

dongyangli-del · 2024-09-23T05:29:27Z

Hi, @PaulScotti , thanks for your great work! I'd like to know whether the model used for the CLIP L is the pre-trained model of ViT/L-14, or the pre-trained encoder of the GIT model.

As far as I know, although the output shape of both is 257 x 1024, the GIT model is fine-tuned for image caption tasks and has better results. On the contrary, the image features obtained by the ViT/L-14 image encoder are difficult to generate image descriptions directly through the GIT model.

Looking forward to your reply, thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some questions about 257 x 1024 CLIP L #36

Some questions about 257 x 1024 CLIP L #36

dongyangli-del commented Sep 23, 2024

Some questions about 257 x 1024 CLIP L #36

Some questions about 257 x 1024 CLIP L #36

Comments

dongyangli-del commented Sep 23, 2024