Abstract: Pretrained vision-language models (VLMs) such as CLIP have shown impressive generalization capability in downstream vision tasks with appropriate text prompts. Instead of designing prompts ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results