With the maturity of visual detection techniques, we are more ambitious in describing visual content with open-vocabulary, fine-grained and free-form language, i.e., the task of image captioning. I…
Grounding natural language in images, such as localizing “the black dog on the left of the tree”, is one of the core problems in artificial intelligence, as it needs to comprehend the fine-grai…