Journal ArticleDOI
Collective Reconstructive Embeddings for Cross-Modal Hashing
Reads0
Chats0
TLDR
This paper unify the projections of text and image to the Hamming space into a common reconstructive embedding through rigid mathematical reformulation, which not only reduces the optimization complexity significantly but also facilitates the inter-modal similarity preservation among different modalities.Abstract:
In this paper, we study the problem of cross-modal retrieval by hashing-based approximate nearest neighbor search techniques. Most existing cross-modal hashing works mainly address the issue of multi-modal integration complexity using the same mapping and similarity calculation for data from different media types. Nonetheless, this may cause information loss during the mapping process due to overlooking the specifics of each individual modality. In this paper, we propose a simple yet effective cross-modal hashing approach, termed collective reconstructive embeddings (CRE), which can simultaneously solve the heterogeneity and integration complexity of multi-modal data. To address the heterogeneity challenge, we propose to process heterogeneous types of data using different modality-specific models. Specifically, we model textual data with cosine similarity-based reconstructive embedding to alleviate the data sparsity to the greatest extent, while for image data, we utilize the Euclidean distance to characterize the relationships of the projected hash codes. Meanwhile, we unify the projections of text and image to the Hamming space into a common reconstructive embedding through rigid mathematical reformulation, which not only reduces the optimization complexity significantly but also facilitates the inter-modal similarity preservation among different modalities. We further incorporate the code balance and uncorrelation criteria into the problem and devise an efficient iterative algorithm for optimization. Comprehensive experiments on four widely used multimodal benchmarks show that the proposed CRE can achieve a superior performance compared with the state of the art on several challenging cross-modal tasks.read more
Citations
More filters
Journal ArticleDOI
Deep Multi-View Enhancement Hashing for Image Retrieval
TL;DR: Zhang et al. as discussed by the authors proposed a supervised multi-view hash model which can enhance the multiview information through neural networks, and the proposed method utilizes an effective view stability evaluation method to actively explore the relationship among views, which will affect the optimization direction of the entire network.
Journal ArticleDOI
Ternary Adversarial Networks With Self-Supervision for Zero-Shot Cross-Modal Retrieval
TL;DR: A novel model called ternary adversarial networks with self-supervision (TANSS) is proposed, inspired by zero-shot learning, to overcome the limitation of the existing methods on this challenging task of cross-modal retrieval.
Posted Content
Deep Multi-View Enhancement Hashing for Image Retrieval
TL;DR: This paper proposes a supervised multi-view hash model which can enhance the multi- view information through neural networks, and significantly outperforms the state-of-the-art single-view and multi-View hashing methods.
Journal ArticleDOI
Scalable Deep Hashing for Large-Scale Social Image Retrieval
TL;DR: This paper proposes a unified scalable deep hash learning framework which explores the weak but free supervision of discriminative user tags that are commonly accompanied with social images and proposes a discrete hash optimization method based on Augmented Lagrangian Multiplier to directly solve the hash codes and avoid the binary quantization information loss.
Proceedings ArticleDOI
Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking
TL;DR: This work proposes a novel Multi-modal Tensor Fusion Network (MTFN) to explicitly learn an accurate image-text similarity function with rank-based tensor fusion rather than seeking a common embedding space for each image- text instance.
References
More filters
Journal ArticleDOI
Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope
Aude Oliva,Antonio Torralba +1 more
TL;DR: The performance of the spatial envelope model shows that specific information about object shape or identity is not a requirement for scene categorization and that modeling a holistic representation of the scene informs about its probable semantic category.
Journal ArticleDOI
Content-based image retrieval at the end of the early years
TL;DR: The working conditions of content-based retrieval: patterns of use, types of pictures, the role of semantics, and the sensory gap are discussed, as well as aspects of system engineering: databases, system architecture, and evaluation.
Journal ArticleDOI
Canonical Correlation Analysis: An Overview with Application to Learning Methods
TL;DR: A general method using kernel canonical correlation analysis to learn a semantic representation to web images and their associated text and compares orthogonalization approaches against a standard cross-representation retrieval technique known as the generalized vector space model is presented.
Proceedings ArticleDOI
NUS-WIDE: a real-world web image database from National University of Singapore
TL;DR: The benchmark results indicate that it is possible to learn effective models from sufficiently large image dataset to facilitate general image retrieval and four research issues on web image annotation and retrieval are identified.
Proceedings Article
Spectral Hashing
TL;DR: The problem of finding a best code for a given dataset is closely related to the problem of graph partitioning and can be shown to be NP hard and a spectral method is obtained whose solutions are simply a subset of thresholded eigenvectors of the graph Laplacian.