You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, @PaulScotti , thanks for your great work! I'd like to know whether the model used for the CLIP L is the pre-trained model of ViT/L-14, or the pre-trained encoder of the GIT model.
As far as I know, although the output shape of both is 257 x 1024, the GIT model is fine-tuned for image caption tasks and has better results. On the contrary, the image features obtained by the ViT/L-14 image encoder are difficult to generate image descriptions directly through the GIT model.
Looking forward to your reply, thank you.
The text was updated successfully, but these errors were encountered:
Hi, @PaulScotti , thanks for your great work! I'd like to know whether the model used for the CLIP L is the pre-trained model of ViT/L-14, or the pre-trained encoder of the GIT model.
As far as I know, although the output shape of both is 257 x 1024, the GIT model is fine-tuned for image caption tasks and has better results. On the contrary, the image features obtained by the ViT/L-14 image encoder are difficult to generate image descriptions directly through the GIT model.
Looking forward to your reply, thank you.
The text was updated successfully, but these errors were encountered: