Abstract
Recently, large-scale pre-trained Vision-and-Language (VL) foundation models have demonstrated remarkable capabilities in many zero-shot downstream ta......
小提示:本篇文献需要登录阅读全文,点击跳转登录