scispace - formally typeset
Search or ask a question

Showing papers by "Rongrong Ji published in 2013"


Proceedings ArticleDOI
21 Oct 2013
TL;DR: This work presents a method built upon psychological theories and web mining to automatically construct a large-scale Visual Sentiment Ontology (VSO) consisting of more than 3,000 Adjective Noun Pairs (ANP) and proposes SentiBank, a novel visual concept detector library that can be used to detect the presence of 1,200 ANPs in an image.
Abstract: We address the challenge of sentiment analysis from visual content. In contrast to existing methods which infer sentiment or emotion directly from visual low-level features, we propose a novel approach based on understanding of the visual concepts that are strongly related to sentiments. Our key contribution is two-fold: first, we present a method built upon psychological theories and web mining to automatically construct a large-scale Visual Sentiment Ontology (VSO) consisting of more than 3,000 Adjective Noun Pairs (ANP). Second, we propose SentiBank, a novel visual concept detector library that can be used to detect the presence of 1,200 ANPs in an image. The VSO and SentiBank are distinct from existing work and will open a gate towards various applications enabled by automatic sentiment analysis. Experiments on detecting sentiment of image tweets demonstrate significant improvement in detection accuracy when comparing the proposed SentiBank based predictors with the text-based approaches. The effort also leads to a large publicly available resource consisting of a visual sentiment ontology, a large detector library, and the training/testing benchmark for visual sentiment analysis.

692 citations


Proceedings ArticleDOI
21 Oct 2013
TL;DR: A novel system which combines sound structures from psychology and the folksonomy extracted from social multimedia to develop a large visual sentiment ontology consisting of 1,200 concepts and associated classifiers called SentiBank, believed to offer a powerful mid-level semantic representation enabling high-level sentiment analysis of social multimedia.
Abstract: A picture is worth one thousand words, but what words should be used to describe the sentiment and emotions conveyed in the increasingly popular social multimedia? We demonstrate a novel system which combines sound structures from psychology and the folksonomy extracted from social multimedia to develop a large visual sentiment ontology consisting of 1,200 concepts and associated classifiers called SentiBank. Each concept, defined as an Adjective Noun Pair (ANP), is made of an adjective strongly indicating emotions and a noun corresponding to objects or scenes that have a reasonable prospect of automatic detection. We believe such large-scale visual classifiers offer a powerful mid-level semantic representation enabling high-level sentiment analysis of social multimedia. We demonstrate novel applications made possible by SentiBank including live sentiment prediction of social media and visualization of visual content in a rich intuitive semantic space.

180 citations


Journal ArticleDOI
TL;DR: This paper proposes to parallelize the near duplicate visual search architecture to index millions of images over multiple servers, including the distribution of both visual vocabulary and the corresponding indexing structure, and validates the distributed vocabulary indexing scheme in a real world location search system over 10 million landmark images.
Abstract: In recent years, there is an ever-increasing research focus on Bag-of-Words based near duplicate visual search paradigm with inverted indexing. One fundamental yet unexploited challenge is how to maintain the large indexing structures within a single server subject to its memory constraint, which is extremely hard to scale up to millions or even billions of images. In this paper, we propose to parallelize the near duplicate visual search architecture to index millions of images over multiple servers, including the distribution of both visual vocabulary and the corresponding indexing structure. We optimize the distribution of vocabulary indexing from a machine learning perspective, which provides a “memory light” search paradigm that leverages the computational power across multiple servers to reduce the search latency. Especially, our solution addresses two essential issues: “What to distribute” and “How to distribute”. “What to distribute” is addressed by a “lossy” vocabulary Boosting, which discards both frequent and indiscriminating words prior to distribution. “How to distribute” is addressed by learning an optimal distribution function, which maximizes the uniformity of assigning the words of a given query to multiple servers. We validate the distributed vocabulary indexing scheme in a real world location search system over 10 million landmark images. Comparing to the state-of-the-art alternatives of single-server search [5], [6], [16] and distributed search [23], our scheme has yielded a significant gain of about 200% speedup at comparable precision by distributing only 5% words. We also report excellent robustness even when partial servers crash.

104 citations


Proceedings ArticleDOI
01 Dec 2013
TL;DR: A novel image reranking approach is proposed by introducing a Co-Regularized Multi-Graph Learning (Co-RMGL) framework, in which the intra-graph and inter-graph constraints are simultaneously imposed to encode affinities in a single graph and consistency across different graphs.
Abstract: Visual reranking has been widely deployed to refine the quality of conventional content-based image retrieval engines. The current trend lies in employing a crowd of retrieved results stemming from multiple feature modalities to boost the overall performance of visual reranking. However, a major challenge pertaining to current reranking methods is how to take full advantage of the complementary property of distinct feature modalities. Given a query image and one feature modality, a regular visual reranking framework treats the top-ranked images as pseudo positive instances which are inevitably noisy, difficult to reveal this complementary property, and thus lead to inferior ranking performance. This paper proposes a novel image reranking approach by introducing a Co-Regularized Multi-Graph Learning (Co-RMGL) framework, in which the intra-graph and inter-graph constraints are simultaneously imposed to encode affinities in a single graph and consistency across different graphs. Moreover, weakly supervised learning driven by image attributes is performed to denoise the pseudo-labeled instances, thereby highlighting the unique strength of individual feature modality. Meanwhile, such learning can yield a few anchors in graphs that vitally enable the alignment and fusion of multiple graphs. As a result, an edge weight matrix learned from the fused graph automatically gives the ordering to the initially retrieved results. We evaluate our approach on four benchmark image retrieval datasets, demonstrating a significant performance gain over the state-of-the-arts.

89 citations


Proceedings Article
14 Jul 2013
TL;DR: In this model, a tree-structured sparsity-inducing norm regularization is firstly introduced to provide a hierarchical description of the image structure to ensure the completeness of the extracted salient object, and high-level priors are integrated to guide the matrix decomposition and enhance the saliency detection.
Abstract: Salient object detection provides an alternative solution to various image semantic understanding tasks such as object recognition, adaptive compression and image retrieval. Recently, low-rank matrix recovery (LR) theory has been introduced into saliency detection, and achieves impressed results. However, the existing LR-based models neglect the underlying structure of images, and inevitably degrade the associated performance. In this paper, we propose a Low-rank and Structured sparse Matrix Decomposition (LSMD) model for salient object detection. In the model, a tree-structured sparsity-inducing norm regularization is firstly introduced to provide a hierarchical description of the image structure to ensure the completeness of the extracted salient object. The similarity of saliency values within the salient object is then guaranteed by the l∞-norm. Finally, high-level priors are integrated to guide the matrix decomposition and enhance the saliency detection. Experimental results on the largest public benchmark database show that our model outperforms existing LR-based approaches and other state-of-the-art methods, which verifies the effectiveness and robustness of the structure cues in our model.

61 citations


Proceedings ArticleDOI
23 Jun 2013
TL;DR: This paper utilizes the existing massive 2D semantic labeled datasets from decade-long community efforts, and a novel ``cross-domain'' label propagation approach, which effectively addresses the cross-domain issue and does not require any training data from the target scenes, with good scalability towards large scale applications.
Abstract: Recent years have witnessed a growing interest in understanding the semantics of point clouds in a wide variety of applications. However, point cloud labeling remains an open problem, due to the difficulty in acquiring sufficient 3D point labels towards training effective classifiers. In this paper, we overcome this challenge by utilizing the existing massive 2D semantic labeled datasets from decade-long community efforts, such as Image Net and Label Me, and a novel ``cross-domain'' label propagation approach. Our proposed method consists of two major novel components, Exemplar SVM based label propagation, which effectively addresses the cross-domain issue, and a graphical model based contextual refinement incorporating 3D constraints. Most importantly, the entire process does not require any training data from the target scenes, also with good scalability towards large scale applications. We evaluate our approach on the well-known Cornell Point Cloud Dataset, achieving much greater efficiency and comparable accuracy even without any 3D training data. Our approach shows further major gains in accuracy when the training data from the target scenes is used, outperforming state-of-the-art approaches with far better efficiency.

44 citations


Journal ArticleDOI
TL;DR: Comprehensive experimental results validate that the proposed reversible watermarking scheme could effectively prevent the high-precision vector data from being illegally used with maintaining the basic shape of each polyline, simultaneously.
Abstract: The reversible watermarking technique is suitable for vector maps due to its reversibility after watermark extraction. In this paper, a novel reversible watermarking scheme based on the idea of nonlinear scrambling is proposed. It begins with feature point extraction. To avoid the high-precision vector data being illegally used by unauthorized users, the algorithm nonlinearly scrambles the relative position of feature points. Then based on the proposed reversible embedding, both scrambled feature points and nonfeature points are taken as cover data, the coordinates of which are modified to embed both watermark data and feature point identification data. Finally, combined with the scrambling secret key, the original vector data can be exactly recovered with watermark extraction. Comprehensive experimental results validate that the scheme could effectively prevent the high-precision vector data from being illegally used with maintaining the basic shape of each polyline, simultaneously.

37 citations


Journal ArticleDOI
07 May 2013-PLOS ONE
TL;DR: This paper proposes a precise and robust scheme for dynamic 3D scene reconstruction by using the compressed color video stream and their inaccurate motion vectors, which ensures the depth maps can be compensated in both video-rate and high resolution at the terminal side towards reducing the system consumption on both the compression and transmission.
Abstract: Remote dynamic three-dimensional (3D) scene reconstruction renders the motion structure of a 3D scene remotely by means of both the color video and the corresponding depth maps. It has shown a great potential for telepresence applications like remote monitoring and remote medical imaging. Under this circumstance, video-rate and high resolution are two crucial characteristics for building a good depth map, which however mutually contradict during the depth sensor capturing. Therefore, recent works prefer to only transmit the high-resolution color video to the terminal side, and subsequently the scene depth is reconstructed by estimating the motion vectors from the video, typically using the propagation based methods towards a video-rate depth reconstruction. However, in most of the remote transmission systems, only the compressed color video stream is available. As a result, color video restored from the streams has quality losses, and thus the extracted motion vectors are inaccurate for depth reconstruction. In this paper, we propose a precise and robust scheme for dynamic 3D scene reconstruction by using the compressed color video stream and their inaccurate motion vectors. Our method rectifies the inaccurate motion vectors by analyzing and compensating their quality losses, motion vector absence in spatial prediction, and dislocation in near-boundary region. This rectification ensures the depth maps can be compensated in both video-rate and high resolution at the terminal side towards reducing the system consumption on both the compression and transmission. Our experiments validate that the proposed scheme is robust for depth map and dynamic scene reconstruction on long propagation distance, even with high compression ratio, outperforming the benchmark approaches with at least 3.3950 dB quality gains for remote applications.

27 citations


Proceedings Article
03 Aug 2013
TL;DR: Extensive experiments carried out on six benchmark datasets validate that the proposed M-fitted graph is superior to state-of-the-art neighborhood graphs in terms of classification accuracy using popular graph-based semi-supervised learning methods.
Abstract: In this paper, we propose a locality-constrained and sparsity-encouraged manifold fitting approach, aiming at capturing the locally sparse manifold structure into neighborhood graph construction by exploiting a principled optimization model. The proposed model formulates neighborhood graph construction as a sparse coding problem with the locality constraint, therefore achieving simultaneous neighbor selection and edge weight optimization. The core idea underlying our model is to perform a sparse manifold fitting task for each data point so that close-by points lying on the same local manifold are automatically chosen to connect and meanwhile the connection weights are acquired by simple geometric reconstruction. We term the novel neighborhood graph generated by our proposed optimization model M-Fitted Graph since such a graph stems from sparse manifold fitting. To evaluate the robustness and effectiveness of M-fitted graphs, we leverage graph-based semi-supervised learning as the testbed. Extensive experiments carried out on six benchmark datasets validate that the proposed M-fitted graph is superior to state-of-the-art neighborhood graphs in terms of classification accuracy using popular graph-based semi-supervised learning methods.

26 citations


Journal ArticleDOI
TL;DR: This paper introduces an attention shift scheme to detect and partition the focused human actions from YouTube videos, and leverages a boosting based feature selection to output the final action descriptors, which incorporates the ranking distortion of the conjunctive queries into the boosting objective.

16 citations


Journal ArticleDOI
TL;DR: This paper proposes a new automatic way to integrate a background subtraction (BGS) and an alpha matting technique via a heuristic seeds selection scheme, and demonstrates the efficiency and effectiveness of this method.

Journal ArticleDOI
TL;DR: Extensive experiments over three benchmark image datasets well demonstrate the superiority of the proposed query-adaptive hashing method over the state-of-the-art ones in terms of retrieval accuracy.
Abstract: Hashing-based approximate nearest-neighbor search may well realize scalable content-based image retrieval. The existing semantic-preserving hashing methods leverage the labeled data to learn a fixed set of semantic-aware hash functions. However, a fixed hash function set is unable to well encode all semantic information simultaneously, and ignores the specific user's search intention conveyed by the query. In this article, we propose a query-adaptive hashing method which is able to generate the most appropriate binary codes for different queries. Specifically, a set of semantic-biased discriminant projection matrices are first learnt for each of the semantic concepts, through which a semantic-adaptable hash function set is learnt via a joint sparsity variable selection model. At query time, we further use the sparsity representation procedure to select the most appropriate hash function subset that is informative to the semantic information conveyed by the query. Extensive experiments over three benchmark image datasets well demonstrate the superiority of our proposed query-adaptive hashing method over the state-of-the-art ones in terms of retrieval accuracy.

Journal ArticleDOI
TL;DR: A Bayesian framework is proposed to generate accurate and temporal consistent dense depth videos in an efficient way and can achieve accurate depth videos with higher efficiency up to 68.14% than traditional methods.

Proceedings ArticleDOI
17 Aug 2013
TL;DR: Results on two hyperspectral images show that the proposed framework combining spectral information with spatial context can greatly improve the final result with respect to pixel-wise classification with Random Forests.
Abstract: The high dimensionality of hyperspectral images are usually coupled with limited reference data available, which degenerates the performances of supervised classification techniques such as random forests (RF). The commonly used pixel-wise classification lacks information about spatial structures of the image. In order to improve the performances of classification, incorporation of spectral and spatial is needed. This paper proposes a novel scheme for accurate spectral-spatial classification of hyperspectral image. It is based on random forests, followed by majority voting within the superpixels obtained by oversegmentation through a graph-based technique. The scheme combines the result of a pixel-wise RF classification and the segmentation map obtained by oversegmentation. Our experimental results on two hyperspectral images show that the proposed framework combining spectral information with spatial context can greatly improve the final result with respect to pixel-wise classification with Random Forests.

Journal ArticleDOI
21 Jun 2013-PLOS ONE
TL;DR: This paper presents a question reformulation scheme to enhance the question retrieval model by fully exploring the intelligence of paraphrase in phrase-level, which compensates for the existing paraphrasing research in a suitable granularity.
Abstract: Lexical gap in cQA search, resulted by the variability of languages, has been recognized as an important and widespread phenomenon. To address the problem, this paper presents a question reformulation scheme to enhance the question retrieval model by fully exploring the intelligence of paraphrase in phrase-level. It compensates for the existing paraphrasing research in a suitable granularity, which either falls into fine-grained lexical-level or coarse-grained sentence-level. Given a question in natural language, our scheme first detects the involved key-phrases by jointly integrating the corpus-dependent knowledge and question-aware cues. Next, it automatically extracts the paraphrases for each identified key-phrase utilizing multiple online translation engines, and then selects the most relevant reformulations from a large group of question rewrites, which is formed by full permutation and combination of the generated paraphrases. Extensive evaluations on a real world data set demonstrate that our model is able to characterize the complex questions and achieves promising performance as compared to the state-of-the-art methods.

Journal ArticleDOI
Ling-Yu Duan1, Jie Chen1, Rongrong Ji1, Tiejun Huang1, Wen Gao1 
TL;DR: This article introduces the work on low bit rate mobile landmark search, in which a compact yet discriminative landmark image descriptor is extracted by using location context such as GPS, crowd-sourced hotspot WLAN, and cell tower locations.
Abstract: Coming with the ever growing computational power of mobile devices, mobile visual search have undergone an evolution in techniques and applications. A significant trend is low bit rate visual search, where compact visual descriptors are extracted directly over a mobile and delivered as queries rather than raw images to reduce the query transmission latency. In this article, we introduce our work on low bit rate mobile landmark search, in which a compact yet discriminative landmark image descriptor is extracted by using location context such as GPS, crowd-sourced hotspot WLAN, and cell tower locations. The compactness originates from the bag-of-words image representation, with an offline learning from geotagged photos from online photo sharing websites including Flickr and Panoramio. The learning process involves segmenting the landmark photo collection by discrete geographical regions using Gaussian mixture model, and then boosting a ranking sensitive vocabulary within each region, with an “entropy” based descriptor compactness feedback to refine both phases iteratively. In online search, when entering a geographical region, the codebook in a mobile device are downstream adapted to generate extremely compact descriptors with promising discriminative ability. We have deployed landmark search apps to both HTC and iPhone mobile phones, working over the database of million scale images in typical areas like Beijing, New York, and Barcelona, and others. Our descriptor outperforms alternative compact descriptors (Chen et al. 2009; Chen et al., 2010; Chandrasekhar et al. 2009a; Chandrasekhar et al. 2009b) with significant margins. Beyond landmark search, this article will summarize the MPEG standarization progress of compact descriptor for visual search (CDVS) (Yuri et al. 2010; Yuri et al. 2011) towards application interoperability.

Journal ArticleDOI
TL;DR: A novel principle for modeling visual attention mechanism named short-term environmental adaption is proposed, which adaptively extract sparse features and treats saliency as the features' conditional self-information, which is more accurate in saliency measurement and more sparse with respect to visual signal representation.

Proceedings ArticleDOI
Li-Chuan Geng1, Shaozi Li1, Songzhi Su1, Donglin Cao1, Yun-Qi Lei1, Rongrong Ji1 
01 Nov 2013
TL;DR: This paper proposes an artificial immune system based method which can fast convergent to the global optimization solutions of AIS and demonstrates the performance of the proposed method with synthetic and real data.
Abstract: A large number of computer vision applications rely on camera calibration. Camera self-calibration which only depends on the relationship between corresponding points of a pair of images draws much attention for its simplicity. Almost all the camera self-calibration methods rely on the solution of Kruppa equations which are difficult to be directly solved. The state-of-the-art self-calibration algorithms usually convert the solution of these equations to non-linear optimization problem, traditional optimization methods usually have the drawback of convergent to local extreme. Artificial immune system (AIS) has the ability to fast convergent to global extreme. To address this problem, we proposed an artificial immune system based method which can fast convergent to the global optimization solutions. We demonstrate the performance of the proposed method with synthetic and real data.

Proceedings ArticleDOI
21 Oct 2013
TL;DR: This work presents a wireless 2D and 3D switchable video communication to handle the previous challenges, and name it as Stereotime, and shows the functionalities and compatibilities on 3D mobile devices in WiFi network environment.
Abstract: Mobile 3D video communication, especially with 2D and 3D compatible, is a new paradigm for both video communication and 3D video processing. Current techniques face challenges in mobile devices when bundled constraints such as computation resource and compatibility should be considered. In this work, we present a wireless 2D and 3D switchable video communication to handle the previous challenges, and name it as Stereotime. The methods of Zig-Zag fast object segmentation, depth cues detection and merging, and texture-adaptive view generation are used for 3D scene reconstruction. We show the functionalities and compatibilities on 3D mobile devices in WiFi network environment.

Journal ArticleDOI
TL;DR: A Bidirectional- Isomorphic Manifold learning strategy to optimize both visual feature space and textual space, in order to achieve more accurate comprehension for image semantics and relationships and promising results show that the model attains a significant improvement over state-of-the-art algorithms.
Abstract: From relevant textual information to improve visual content understanding and representation is an effective way for deeply understanding web image content. However, the description of images is usually imprecise at the semantic level, which is caused by the noisy and redundancy information in both text (such as surrounding text in HTML pages) and visual (such as intra-class diversity) aspects. This paper considers the solution from the association analysis for image content and presents a Bidirectional- Isomorphic Manifold learning strategy to optimize both visual feature space and textual space, in order to achieve more accurate comprehension for image semantics and relationships. To achieve this optimization between two different models, Bidirectional-Isomorphic Manifold Learning utilizes a novel algorithm to unify adjustments in both models together to a topological structure, which is called the reversed Manifold mapping. We also demonstrate its correctness and convergence from a mathematical perspective. Image annotation and keywords correlation analysis are applied. Two groups of experiments are conducted: The first group is carried on the Corel 5000 image database to validate our method's effectiveness by comparing with state-of-the-art Generalized Manifold Ranking Based Image Retrieval and SVM, while the second group carried on a web-downloaded Flickr dataset with over 6,000 images to testify the proposed method's effectiveness in real-world application. The promising results show that our model attains a significant improvement over state-of-the-art algorithms.

Proceedings ArticleDOI
21 Oct 2013
TL;DR: A query-dependent image reranking approach by leveraging the higher level attribute detection among the top returned images to adapt the dictionary built over the visual features to a query-specific fashion is proposed.
Abstract: Although text-based image search engines are popular for ranking images of user's interest, the state-of-the-art ranking performance is still far from satisfactory. One major issue comes from the visual similarity metric used in the ranking operation, which depends solely on visual features. To tackle this issue, one feasible method is to incorporate semantic concepts, also known as image attributes, into image ranking. However, the optimal combination of visual features and image attributes remains unknown. In this paper, we propose a query-dependent image reranking approach by leveraging the higher level attribute detection among the top returned images to adapt the dictionary built over the visual features to a query-specific fashion. We start from offline learning transposition probabilities between visual codewords and attributes, then utilize the probabilities to online adapt the dictionary, and finally produce a query-dependent and semantics-induced metric for image ranking. Extensive evaluations on several benchmark image datasets demonstrate the effectiveness and efficiency of the proposed approach in comparison with state-of-the-arts.

Proceedings ArticleDOI
Jie Chen1, Ling-Yu Duan1, Jie Lin1, Rongrong Ji1, Tiejun Huang1, Wen Gao1 
26 May 2013
TL;DR: This paper proposes to combine feature transform and multi-stage vector quantization to implement the interoperability of compact local descriptors, and reports superior performance over state-of-the-arts datasets.
Abstract: There are a number of component technologies that are useful for visual search, including format of visual descriptors, descriptor extraction process, as well as indexing, and matching algorithms. As a minimum, the format of descriptors as well as parts of their extraction process should be defined to ensure interoperability. In this paper, we study the problem of interoperability among compressed local descriptors at different bit-rates; that is, allowing effective and efficient comparison of compact descriptors, which is fundamentally important to mobile visual search applications. We propose to combine feature transform and multi-stage vector quantization to implement the interoperability of compact local descriptors. First, an orthogonal transform (e.g. Principle component analysis, PCA) is employed to eliminate the correlation between local feature dimensions, which improves the performance of compressed domain descriptor matching with the well-aligned distance computing of sorted important features in transform space. Second, a multi-stage vector quantization (MSVQ) is applied to generate compact codes for local descriptors. At light quantization tables, MSVQ takes advantage of the transform domain features to properly allocate different budgets to each group of transformed feature dimensions, respectively. The interoperability between compressed descriptors at different bit rates can be achieved by the descriptors' fast matching in the orthogonal feature space. In other words, descriptor decoding into the original feature space (SIFT space) is unnecessary, as the distance can be calculated by pre-computed lookup tables. In particular, such efficient matching in transform domain is significant for large-scale visual search. Over a set of benchmark datasets, we have reported superior performance over state-of-the-arts.

Proceedings ArticleDOI
Bing Shuai1, Songzhi Su1, Shaozi Li1, Yun Cheng2, Rongrong Ji1 
01 Nov 2013
TL;DR: The experiment results demonstrated that the decomposition-based model worked very well at localizing deformable persons, which boosted the average precision by 10% compared to state-of-the-art person detectors, and Similar Pose Feature (SPF) provides the feasibility of projecting persons with similar poses into same clusters, facilitating a novel pose-based photo album browsing functionality.
Abstract: Recent years have seen tremendous progress in human detection, whereas only upright poses are usually considered. In this paper, we relax this constraint to localizing highly deformable persons, as commonly exhibited in personal photo albums. Human localization based on arbitrary pose is extremely challenging, due to the large pose variances, disabling the traditional part based template detectors. To tackle this issue, we propose a decomposition-based human localization model dealing with this issue in three-step: a stable upper-body is firstly detected, then a set of bigger bounding boxes are extended, from which the most appropriate instance is distinguished by a discriminative Whole Person Model. The experiment results demonstrated that our decomposition-based model worked very well at localizing deformable persons, which boosted the average precision by 10% compared to state-of-the-art person detectors. On the other hand, Similar Pose Feature(SPF) provides the feasibility of projecting persons with similar poses into same clusters, facilitating a novel pose-based photo album browsing functionality.

Journal ArticleDOI
TL;DR: This work combines location related side information from the mobile devices to adaptively supervise the compact visual descriptor design in a flexible manner, which is very suitable to search locations or landmarks within a bandwidth constraint wireless link.
Abstract: We propose to learn an extremely compact visual descriptor from the mobile contexts towards low bit rate mobile location search. Our scheme combines location related side information from the mobile devices to adaptively supervise the compact visual descriptor design in a flexible manner, which is very suitable to search locations or landmarks within a bandwidth constraint wireless link. Along with the proposed compact descriptor learning, a large-scale, contextual aware mobile visual search benchmark dataset PKUBench is also introduced, which serves as the first comprehensive benchmark for the quantitative evaluation of how the cheaply available mobile contexts can help the mobile visual search systems. Our proposed contextual learning based compact descriptor has shown to outperform the existing works in terms of compression rate and retrieval effectiveness.

Journal ArticleDOI
TL;DR: A weakly supervised codebook learning framework, which integrates image labels to supervise codebook building with two steps: the Label Propagation step propagates image labels into local patches by multiple instance learning and instance selection and the Graph Quantization step integrates patch labels to build codebook using Mean Shift.

Proceedings Article
01 Jan 2013
TL;DR: By "transferring" the large-scale web images with geographical tags to web videos, to make a carefully designed associations between visual content similarities, this paper tackles the problem of geo-localization of web images from a novel perspective.
Abstract: While geo-localization of web images has been widely studied, limited effort is devoted to that of web videos. Nevertheless, an accurate location inference approach specified on web videos is of fundamental importance, as it's occupying increasing proportions in web corpus. The key challenge comes from the lack of sufficient labels for model training. In this paper, we tackle this problem from a novel perspective, by "transferring" the large-scale web images with geographical tags to web videos, to make a carefully designed associations between visual content similarities. A group of experiments are conducted on a collected web image and video data set, where superior performance gains are reported over several alternatives.

Proceedings ArticleDOI
01 Nov 2013
TL;DR: This paper model the scene as a mid-level “hidden layer” to bridge action descriptors and action categories via a scene topic model, in which hybrid visual descriptors including spatiotemporal action features and scene descriptors are first extracted from the video sequence.
Abstract: Recognizing human actions is not alone, as hinted by the scene herein. In this paper, we investigate the possibility to boost the action recognition performance by exploiting their scene context associated. To this end, we model the scene as a mid-level “hidden layer” to bridge action descriptors and action categories. This is achieved via a scene topic model, in which hybrid visual descriptors including spatiotemporal action features and scene descriptors are first extracted from the video sequence. Then, we learn a joint probability distribution between scene and action by a Naive Bayesian N-earest Neighbor algorithm, which is adopted to jointly infer the action categories online by combining off-the-shelf action recognition algorithms. We demonstrate our merits by comparing to state-of-the-arts in several action recognition benchmarks.

Proceedings ArticleDOI
01 Nov 2013
TL;DR: A clustering-based method to detect refined regions with comparative performance for coarse-grained classification with unknown clusters number is proposed, and an adaptive algorithm called f-means is developed in this paper.
Abstract: Saliency detection plays an important role in image segmentation, content-aware resizing and object recognition. Most approaches obtain promising performance recently, which is useful for the postprocessing. We propose a clustering-based method to detect refined regions with comparative performance. For coarse-grained classification with unknown clusters number, an adaptive algorithm called f-means is developed in this paper. Pixels are clustered by f-means based on color and spatial features, and then the centroids are used to compute their saliency values. Experiments show that our algorithm generates more fine maps, which outperform the state-of-the-art approaches on MSRA dataset. Relying on the saliency map, we also get superior results in foreground extracting, image resizing and thumbnails generation.