Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some questions about 257 x 1024 CLIP L #36

Open
dongyangli-del opened this issue Sep 23, 2024 · 0 comments
Open

Some questions about 257 x 1024 CLIP L #36

dongyangli-del opened this issue Sep 23, 2024 · 0 comments

Comments

@dongyangli-del
Copy link

Hi, @PaulScotti , thanks for your great work! I'd like to know whether the model used for the CLIP L is the pre-trained model of ViT/L-14, or the pre-trained encoder of the GIT model.

As far as I know, although the output shape of both is 257 x 1024, the GIT model is fine-tuned for image caption tasks and has better results. On the contrary, the image features obtained by the ViT/L-14 image encoder are difficult to generate image descriptions directly through the GIT model.

Looking forward to your reply, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant