Multiple Instance Visual-Semantic Embedding

Abstract: Visual-semantic embedding models have been recently proposed and shown to be effective for image classification and zero-shot learning. The key idea is that by directly learning a mapping from images into a semantic label space, the algorithm can generalize to a large number of unseen labels. However, existing approaches are limited to single-label embedding, handling images with multiple labels still remains an open problem, mainly due to the complex underlying correspondence between an image and its labels. In this work, we present a novel Multiple Instance Visual-Semantic Embedding (MIVSE) model for multi-label images. Instead of embedding a whole image into the semantic space, our model characterizes the subregion-to-label correspondence, which discovers and maps semantically meaningful image subregions to the corresponding labels. Experiments on two challenging tasks, multi-label image annotation and zero-shot learning, show that the proposed MIVSE model outperforms state-of-the-art methods on both tasks and possesses the ability of generalizing to unseen labels.