scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Multimedia in 2015"


Journal ArticleDOI
TL;DR: A new no-reference (NR) image quality assessment (IQA) metric is proposed using the recently revealed free-energy-based brain theory and classical human visual system (HVS)-inspired features to predict an image that the HVS perceives from a distorted image based on the free energy theory.
Abstract: In this paper we propose a new no-reference (NR) image quality assessment (IQA) metric using the recently revealed free-energy-based brain theory and classical human visual system (HVS)-inspired features. The features used can be divided into three groups. The first involves the features inspired by the free energy principle and the structural degradation model. Furthermore, the free energy theory also reveals that the HVS always tries to infer the meaningful part from the visual stimuli. In terms of this finding, we first predict an image that the HVS perceives from a distorted image based on the free energy theory, then the second group of features is composed of some HVS-inspired features (such as structural information and gradient magnitude) computed using the distorted and predicted images. The third group of features quantifies the possible losses of “naturalness” in the distorted image by fitting the generalized Gaussian distribution to mean subtracted contrast normalized coefficients. After feature extraction, our algorithm utilizes the support vector machine based regression module to derive the overall quality score. Experiments on LIVE, TID2008, CSIQ, IVC, and Toyama databases confirm the effectiveness of our introduced NR IQA metric compared to the state-of-the-art.

548 citations


Journal ArticleDOI
TL;DR: The state of the art in automatically classifying audio scenes, and automatically detecting and classifyingaudio events is reported on.
Abstract: For intelligent systems to make best use of the audio modality, it is important that they can recognize not just speech and music, which have been researched as specific tasks, but also general sounds in everyday environments. To stimulate research in this field we conducted a public research challenge: the IEEE Audio and Acoustic Signal Processing Technical Committee challenge on Detection and Classification of Acoustic Scenes and Events (DCASE). In this paper, we report on the state of the art in automatically classifying audio scenes, and automatically detecting and classifying audio events. We survey prior work as well as the state of the art represented by the submissions to the challenge from various research groups. We also provide detail on the organization of the challenge, so that our experience as challenge hosts may be useful to those organizing challenges in similar domains. We created new audio datasets and baseline systems for the challenge; these, as well as some submitted systems, are publicly available under open licenses, to serve as benchmarks for further research in general-purpose machine listening.

468 citations


Journal ArticleDOI
TL;DR: This paper describes systems that learn to attend to different places in the input, for each element of the output, for a variety of tasks: machine translation, image caption generation, video clip description, and speech recognition.
Abstract: Whereas deep neural networks were first mostly used for classification tasks, they are rapidly expanding in the realm of structured output problems, where the observed target is composed of multiple random variables that have a rich joint distribution, given the input. In this paper we focus on the case where the input also has a rich structure and the input and output structures are somehow related. We describe systems that learn to attend to different places in the input, for each element of the output, for a variety of tasks: machine translation, image caption generation, video clip description, and speech recognition . All these systems are based on a shared set of building blocks: gated recurrent neural networks and convolutional neural networks , along with trained attention mechanisms. We report on experimental results with these systems, showing impressively good performance and the advantage of the attention mechanism.

410 citations


Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a deep learning structure composed of a set of elaborately designed convolutional neural networks (CNNs) and a three-layer stacked auto-encoder (SAE).
Abstract: Face images appearing in multimedia applications, e.g., social networks and digital entertainment, usually exhibit dramatic pose, illumination, and expression variations, resulting in considerable performance degradation for traditional face recognition algorithms. This paper proposes a comprehensive deep learning framework to jointly learn face representation using multimodal information. The proposed deep learning structure is composed of a set of elaborately designed convolutional neural networks (CNNs) and a three-layer stacked auto-encoder (SAE). The set of CNNs extracts complementary facial features from multimodal data. Then, the extracted features are concatenated to form a high-dimensional feature vector, whose dimension is compressed by SAE. All of the CNNs are trained using a subset of 9,000 subjects from the publicly available CASIA-WebFace database, which ensures the reproducibility of this work. Using the proposed single CNN architecture and limited training data, 98.43% verification rate is achieved on the LFW database. Benefitting from the complementary information contained in multimodal data, our small ensemble system achieves higher than 99.0% recognition rate on LFW using publicly available training set.

321 citations


Journal ArticleDOI
TL;DR: Experimental results show that although state-of-the-art methods can achieve competitive performance compared to average human performance, majority votes of several humans can achieve much higher performance on this task and the gap between machine and human would imply possible directions for further improvement of cross-age face recognition in the future.
Abstract: This paper introduces a method for face recognition across age and also a dataset containing variations of age in the wild. We use a data-driven method to address the cross-age face recognition problem, called cross-age reference coding (CARC). By leveraging a large-scale image dataset freely available on the Internet as a reference set, CARC can encode the low-level feature of a face image with an age-invariant reference space. In the retrieval phase, our method only requires a linear projection to encode the feature and thus it is highly scalable. To evaluate our method, we introduce a large-scale dataset called cross-age celebrity dataset (CACD). The dataset contains more than 160 000 images of 2,000 celebrities with age ranging from 16 to 62. Experimental results show that our method can achieve state-of-the-art performance on both CACD and the other widely used dataset for face recognition across age. To understand the difficulties of face recognition across age, we further construct a verification subset from the CACD called CACD-VS and conduct human evaluation using Amazon Mechanical Turk. CACD-VS contains 2,000 positive pairs and 2,000 negative pairs and is carefully annotated by checking both the associated image and web contents. Our experiments show that although state-of-the-art methods can achieve competitive performance compared to average human performance, majority votes of several humans can achieve much higher performance on this task. The gap between machine and human would imply possible directions for further improvement of cross-age face recognition in the future.

273 citations


Journal ArticleDOI
TL;DR: A novel distance metric, superpixel earth mover's distance (SP-EMD), is proposed to measure the dissimilarity between the hand gestures, which is robust to distortion and articulation, but also invariant to scaling, translation and rotation with proper preprocessing.
Abstract: This paper presents a new superpixel-based hand gesture recognition system based on a novel superpixel earth mover's distance metric, together with Kinect depth camera The depth and skeleton information from Kinect are effectively utilized to produce markerless hand extraction The hand shapes, corresponding textures and depths are represented in the form of superpixels, which effectively retain the overall shapes and color of the gestures to be recognized Based on this representation, a novel distance metric, superpixel earth mover's distance (SP-EMD), is proposed to measure the dissimilarity between the hand gestures This measurement is not only robust to distortion and articulation, but also invariant to scaling, translation and rotation with proper preprocessing The effectiveness of the proposed distance metric and recognition algorithm are illustrated by extensive experiments with our own gesture dataset as well as two other public datasets Simulation results show that the proposed system is able to achieve high mean accuracy and fast recognition speed Its superiority is further demonstrated by comparisons with other conventional techniques and two real-life applications

271 citations


Journal ArticleDOI
TL;DR: In this article, a joint multi-task learning algorithm is proposed to better predict attributes in images using deep convolutional neural networks (CNN), where each CNN will predict one binary attribute.
Abstract: This paper proposes a joint multi-task learning algorithm to better predict attributes in images using deep convolutional neural networks (CNN). We consider learning binary semantic attributes through a multi-task CNN model, where each CNN will predict one binary attribute. The multi-task learning allows CNN models to simultaneously share visual knowledge among different attribute categories. Each CNN will generate attribute-specific feature representations, and then we apply multi-task learning on the features to predict their attributes. In our multi-task framework, we propose a method to decompose the overall model’s parameters into a latent task matrix and combination matrix. Furthermore, under-sampled classifiers can leverage shared statistics from other classifiers to improve their performance. Natural grouping of attributes is applied such that attributes in the same group are encouraged to share more knowledge. Meanwhile, attributes in different groups will generally compete with each other, and consequently share less knowledge. We show the effectiveness of our method on two popular attribute datasets.

255 citations


Journal ArticleDOI
TL;DR: The results show that the proposed algorithm is more robust and achieves the best performance, which outperforms the second best algorithm by about 5% on both the Pascal VOC2007 and NUS-WIDE databases.
Abstract: The cross-modal feature matching has gained much attention in recent years, which has many practical applications, such as the text-to-image retrieval. The most difficult problem of cross-modal matching is how to eliminate the heterogeneity between modalities. The existing methods (e.g., CCA and PLS) try to learn a common latent subspace, where the heterogeneity between two modalities is minimized so that cross-matching is possible. However, most of these methods require fully paired samples and suffer difficulties when dealing with unpaired data. Besides, utilizing the class label information has been found as a good way to reduce the semantic gap between the low-level image features and high-level document descriptions. Considering this, we propose a novel and effective supervised algorithm, which can also deal with the unpaired data. In the proposed formulation, the basis matrices of different modalities are jointly learned based on the training samples. Moreover, a local group-based priori is proposed in the formulation to make a better use of popular block based features (e.g., HOG and GIST). Extensive experiments are conducted on four public databases: Pascal VOC2007, LabelMe, Wikipedia, and NUS-WIDE. We also evaluated the proposed algorithm with unpaired data. By comparing with existing state-of-the-art algorithms, the results show that the proposed algorithm is more robust and achieves the best performance, which outperforms the second best algorithm by about 5% on both the Pascal VOC2007 and NUS-WIDE databases.

232 citations


Journal ArticleDOI
TL;DR: The creation of a new corpus designed for continuous audio-visual speech recognition research, TCD-TIMIT, which consists of high-quality audio and video footage of 62 speakers reading a total of 6913 phonetically rich sentences is detailed.
Abstract: Automatic audio-visual speech recognition currently lags behind its audio-only counterpart in terms of major progress. One of the reasons commonly cited by researchers is the scarcity of suitable research corpora. This paper details the creation of a new corpus designed for continuous audio-visual speech recognition research . TCD-TIMIT consists of high-quality audio and video footage of 62 speakers reading a total of 6913 phonetically rich sentences. Three of the speakers are professionally-trained lipspeakers, recorded to test the hypothesis that lipspeakers may have an advantage over regular speakers in automatic visual speech recognition systems. Video footage was recorded from two angles: straight on, and at $30^{\circ}$ . The paper outlines the recording of footage, and the required post-processing to yield video and audio clips for each sentence. Audio, visual, and joint audio-visual baseline experiments are reported. Separate experiments were run on the lipspeaker and non-lipspeaker data, and the results compared. Visual and audio-visual baseline results on the non-lipspeakers were low overall. Results on the lipspeakers were found to be significantly higher. It is hoped that as a publicly available database, TCD-TIMIT will now help further state of the art in audio-visual speech recognition research.

226 citations


Journal ArticleDOI
TL;DR: An author topic model-based collaborative filtering (ATCF) method is proposed to facilitate comprehensive points of interest (POIs) recommendations for social users and advantages and superior performance of this approach are demonstrated by extensive experiments on a large collection of data.
Abstract: From social media has emerged continuous needs for automatic travel recommendations. Collaborative filtering (CF) is the most well-known approach. However, existing approaches generally suffer from various weaknesses. For example , sparsity can significantly degrade the performance of traditional CF. If a user only visits very few locations, accurate similar user identification becomes very challenging due to lack of sufficient information for effective inference. Moreover, existing recommendation approaches often ignore rich user information like textual descriptions of photos which can reflect users’ travel preferences. The topic model (TM) method is an effective way to solve the “sparsity problem,” but is still far from satisfactory. In this paper, an author topic model-based collaborative filtering (ATCF) method is proposed to facilitate comprehensive points of interest (POIs) recommendations for social users. In our approach, user preference topics, such as cultural, cityscape, or landmark, are extracted from the geo-tag constrained textual description of photos via the author topic model instead of only from the geo-tags (GPS locations). Advantages and superior performance of our approach are demonstrated by extensive experiments on a large collection of data.

215 citations


Journal ArticleDOI
TL;DR: A novel deep neural network approach is adopted to allow unified feature learning and classifier training to estimate image aesthetics, and a double-column deep convolutional neural network is developed to support heterogeneous inputs.
Abstract: This paper investigates unified feature learning and classifier training approaches for image aesthetics assessment . Existing methods built upon handcrafted or generic image features and developed machine learning and statistical modeling techniques utilizing training examples. We adopt a novel deep neural network approach to allow unified feature learning and classifier training to estimate image aesthetics. In particular, we develop a double-column deep convolutional neural network to support heterogeneous inputs, i.e., global and local views, in order to capture both global and local characteristics of images . In addition, we employ the style and semantic attributes of images to further boost the aesthetics categorization performance . Experimental results show that our approach produces significantly better results than the earlier reported results on the AVA dataset for both the generic image aesthetics and content -based image aesthetics. Moreover, we introduce a 1.5-million image dataset (IAD) for image aesthetics assessment and we further boost the performance on the AVA test set by training the proposed deep neural networks on the IAD dataset.

Journal ArticleDOI
TL;DR: A new distance metric learning algorithm, namely weakly-supervised deep metric learning (WDML), under the deep learning framework is proposed, which utilizes a progressive learning manner to discover knowledge by jointly exploiting the heterogeneous data structures from visual contents and user-provided tags of social images.
Abstract: Recent years have witnessed the explosive growth of community-contributed images with rich context information, which is beneficial to the task of image retrieval. It can help us to learn a suitable metric to alleviate the semantic gap. In this paper, we propose a new distance metric learning algorithm, namely weakly-supervised deep metric learning (WDML), under the deep learning framework. It utilizes a progressive learning manner to discover knowledge by jointly exploiting the heterogeneous data structures from visual contents and user-provided tags of social images. The semantic structure in the textual space is expected to be well preserved while the problem of the noisy, incomplete or subjective tags is addressed by leveraging the visual structure in the original visual space. Besides, a sparse model with the ${\ell _{2,1}}$ mixed norm is imposed on the transformation matrix of the first layer in the deep architecture to compress the noisy or redundant visual features. The proposed problem is formulated as an optimization problem with a well-defined objective function and a simple yet efficient iterative algorithm is proposed to solve it. Extensive experiments on real-world social image datasets are conducted to verify the effectiveness of the proposed method for image retrieval. Encouraging experimental results are achieved compared with several representative metric learning methods.

Journal ArticleDOI
TL;DR: A convolutional neural network (CNN)-based model for human head pose estimation in low-resolution multi-modal RGB-D data is presented and it is shown that many higher level scene understanding like human-human/scene interaction detection can be achieved.
Abstract: In this paper we present a convolutional neural network (CNN)-based model for human head pose estimation in low-resolution multi-modal RGB-D data. We pose the problem as one of classification of human gazing direction. We further fine-tune a regressor based on the learned deep classifier. Next we combine the two models (classification and regression) to estimate approximate regression confidence. We present state-of-the-art results in datasets that span the range of high-resolution human robot interaction (close up faces plus depth information) data to challenging low resolution outdoor surveillance data. We build upon our robust head-pose estimation and further introduce a new visual attention model to recover interaction with the environment . Using this probabilistic model, we show that many higher level scene understanding like human-human/scene interaction detection can be achieved. Our solution runs in real-time on commercial hardware.

Journal ArticleDOI
TL;DR: Experimental results on two widely used RGB-D object datasets show that the proposed general CNN-based multi-modal learning framework achieves comparable performance to state-of-the-art methods specifically designed forRGB-D data.
Abstract: Most existing feature learning-based methods for RGB-D object recognition either combine RGB and depth data in an undifferentiated manner from the outset, or learn features from color and depth separately, which do not adequately exploit different characteristics of the two modalities or utilize the shared relationship between the modalities. In this paper, we propose a general CNN-based multi-modal learning framework for RGB-D object recognition. We first construct deep CNN layers for color and depth separately, which are then connected with a carefully designed multi-modal layer. This layer is designed to not only discover the most discriminative features for each modality, but is also able to harness the complementary relationship between the two modalities. The results of the multi-modal layer are back-propagated to update parameters of the CNN layers, and the multi-modal feature learning and the back-propagation are iteratively performed until convergence. Experimental results on two widely used RGB-D object datasets show that our method for general multi-modal learning achieves comparable performance to state-of-the-art methods specifically designed for RGB-D data.

Journal ArticleDOI
TL;DR: A novel cross-domain feature learning (CDFL) algorithm based on stacked denoising auto-encoders that can maximize the correlations among different modalities and extract domain invariant semantic features simultaneously.
Abstract: In the Web 20 era, a huge number of media data, such as text, image/video, and social interaction information, have been generated on the social media sites (eg, Facebook, Google, Flickr, and YouTube) These media data can be effectively adopted for many applications (eg, image/video annotation, image/video retrieval, and event classification) in multimedia However, it is difficult to design an effective feature representation to describe these data because they have multi-modal property (eg, text, image, video, and audio) and multi-domain property (eg, Flickr, Google, and YouTube) To deal with these issues, we propose a novel cross-domain feature learning (CDFL) algorithm based on stacked denoising auto-encoders By introducing the modal correlation constraint and the cross-domain constraint in conventional auto-encoder, our CDFL can maximize the correlations among different modalities and extract domain invariant semantic features simultaneously To evaluate our CDFL algorithm , we apply it to three important applications: sentiment classification, spam filtering, and event classification Comprehensive evaluations demonstrate the encouraging performance of the proposed approach

Journal ArticleDOI
TL;DR: A joint density model over the space of multimodal inputs, including visual, auditory, and textual modalities, is developed and trained directly using UGC data without any labeling efforts, leading to performance improvement in both emotion classification and cross-modal retrieval.
Abstract: Social media has been a convenient platform for voicing opinions through posting messages, ranging from tweeting a short text to uploading a media file, or any combination of messages. Understanding the perceived emotions inherently underlying these user-generated contents (UGC) could bring light to emerging applications such as advertising and media analytics. Existing research efforts on affective computation are mostly dedicated to single media, either text captions or visual content. Few attempts for combined analysis of multiple media are made, despite that emotion can be viewed as an expression of multimodal experience. In this paper, we explore the learning of highly non-linear relationships that exist among low-level features across different modalities for emotion prediction. Using the deep Bolzmann machine (DBM), a joint density model over the space of multimodal inputs, including visual, auditory, and textual modalities, is developed. The model is trained directly using UGC data without any labeling efforts. While the model learns a joint representation over multimodal inputs, training samples in absence of certain modalities can also be leveraged. More importantly, the joint representation enables emotion-oriented cross-modal retrieval, for example, retrieval of videos using the text query “crazy cat”. The model does not restrict the types of input and output, and hence, in principle, emotion prediction and retrieval on any combinations of media are feasible. Extensive experiments on web videos and images show that the learnt joint representation could be very compact and be complementary to hand-crafted features, leading to performance improvement in both emotion classification and cross-modal retrieval.

Journal ArticleDOI
TL;DR: In this paper, an adversary is an agent designed to make a classification system perform in some particular way, e.g., increase the probability of a false negative, where the system inputs are magnitude spectral frames, which require special care in order to produce valid input audio signals from network derived perturbations.
Abstract: An adversary is an agent designed to make a classification system perform in some particular way, e.g., increase the probability of a false negative. Recent work builds adversaries for deep learning systems applied to image object recognition, exploiting the parameters of the system to find the minimal perturbation of the input image such that the system misclassifies it with high confidence. We adapt this approach to construct and deploy an adversary of deep learning systems applied to music content analysis. In our case, however, the system inputs are magnitude spectral frames, which require special care in order to produce valid input audio signals from network- derived perturbations . For two different train-test partitionings of two benchmark datasets, and two different architectures , we find that this adversary is very effective. We find that convolutional architectures are more robust compared to systems based on a majority vote over individually classified audio frames. Furthermore , we experiment with a new system that integrates an adversary into the training loop, but do not find that this improves the resilience of the system to new adversaries.

Journal ArticleDOI
TL;DR: A new framework to track on-road pedestrians across multiple driving recorders built upon the results of tracking under a single driving recorder to determine whether a specific pedestrian belongs to one or several cameras' field of views by considering association likelihood of the tracked pedestrians.
Abstract: In this paper, we propose a new framework to track on-road pedestrians across multiple driving recorders The framework is built upon the results of tracking under a single driving recorder More specifically, we treat the problem as a multi-label classification task and determine whether a specific pedestrian belongs to one or several cameras’ field of views by considering association likelihood of the tracked pedestrians The likelihood is calculated based on the pedestrians’ motion cues and appearance features, which are necessarily transformed via brightness transfer functions obtained by some available spatially overlapping views for compensating diversity of the cameras When a pedestrian is leaving a camera’s field of view, the proposed framework predicts and interpolates its possible moving trajectories, facilitated by open map service which can provide routing information Experimental results show the robustness and effectiveness of the proposed framework in tracking pedestrians across several recorded driving videos Moreover, based on the GPS locations, we can also reconstruct a 3-D visualization on a 3-D virtual real-world environment, so as to show the dynamic scenes of the recorded videos

Journal ArticleDOI
Shunli Zhang1, Xin Yu1, Yao Sui1, Sicong Zhao1, Li Zhang1 
TL;DR: This paper proposes a novel tracking method via multi-view learning framework by using multiple support vector machines (SVM) and presents a novel collaborative strategy with entropy criterion, which is acquired by the confidence distribution of the candidate samples.
Abstract: How to build an accurate and reliable appearance model to improve the performance is a crucial problem in object tracking. Since the multi-view learning can lead to more accurate and robust representation of the object, in this paper, we propose a novel tracking method via multi-view learning framework by using multiple support vector machines (SVM). The multi-view SVMs tracking method is constructed based on multiple views of features and a novel combination strategy. To realize a comprehensive representation, we select three different types of features, i.e., gray scale value, histogram of oriented gradients (HOG), and local binary pattern (LBP), to train the corresponding SVMs. These features represent the object from the perspectives of description, detection, and recognition, respectively . In order to realize the combination of the SVMs under the multi-view learning framework, we present a novel collaborative strategy with entropy criterion, which is acquired by the confidence distribution of the candidate samples. In addition, to learn the changes of the object and the scenario, we propose a novel update scheme based on subspace evolution strategy. The new scheme can control the model update adaptively and help to address the occlusion problems . We conduct our approach on several public video sequences and the experimental results demonstrate that our method is robust and accurate, and can achieve the state-of-the-art tracking performance.

Journal ArticleDOI
TL;DR: A boosting-based strong classifier for robust visual tracking using a discriminative appearance model and a structural reconstruction error based weight computation method are proposed to adjust the classification score of each candidate for more precise tracking results.
Abstract: Sparse coding methods have achieved great success in visual tracking, and we present a strong classifier and structural local sparse descriptors for robust visual tracking. Since the summary features considering the sparse codes are sensitive to occlusion and other interfering factors, we extract local sparse descriptors from a fraction of all patches by performing a pooling operation. The collection of local sparse descriptors is combined into a boosting-based strong classifier for robust visual tracking using a discriminative appearance model. Furthermore, a structural reconstruction error based weight computation method is proposed to adjust the classification score of each candidate for more precise tracking results. To handle appearance changes during tracking, we present an occlusion-aware template update scheme. Comprehensive experimental comparisons with the state-of-the-art algorithms demonstrated the better performance of the proposed method.

Journal ArticleDOI
TL;DR: An informative set of features for the analysis of face dynamics, and a completely automatic system to distinguish between genuine and posed enjoyment smiles are proposed, which improves the state of the art in smile classification and provides useful insights in smile psychophysics.
Abstract: Automatic distinction between genuine (spontaneous) and posed expressions is important for visual analysis of social signals In this paper, we describe an informative set of features for the analysis of face dynamics, and propose a completely automatic system to distinguish between genuine and posed enjoyment smiles Our system incorporates facial landmarking and tracking, through which features are extracted to describe the dynamics of eyelid, cheek, and lip corner movements By fusing features over different regions, as well as over different temporal phases of a smile, we obtain a very accurate smile classifier We systematically investigate age and gender effects, and establish that age-specific classification significantly improves the results, even when the age is automatically estimated We evaluate our system on the 400-subject UvA-NEMO database we have recently collected, as well as on three other smile databases from the literature Through an extensive experimental evaluation, we show that our system improves the state of the art in smile classification and provides useful insights in smile psychophysics

Journal ArticleDOI
TL;DR: A robust coupled dictionary learning method with locality coordinate constraints is introduced to reconstruct the corresponding high resolution depth map and incorporates an adaptively regularized shock filter to simultaneously reduce the jagged noise and sharpen the edges.
Abstract: This paper describes a new algorithm for depth image super resolution and denoising using a single depth image as input. A robust coupled dictionary learning method with locality coordinate constraints is introduced to reconstruct the corresponding high resolution depth map. The local constraints effectively reduce the prediction uncertainty and prevent the dictionary from over-fitting. We also incorporate an adaptively regularized shock filter to simultaneously reduce the jagged noise and sharpen the edges. Furthermore, a joint reconstruction and smoothing framework is proposed with an L0 gradient smooth constraint, making the reconstruction more robust to noise. Experimental results demonstrate the effectiveness of our proposed algorithm compared to previously reported methods.

Journal ArticleDOI
TL;DR: The proposed query-dependent model equipped with learned deep aesthetic abstractions significantly and consistently outperforms state-of-the-art hand-crafted feature -based and universal model-based methods.
Abstract: The automatic assessment of photo quality from an aesthetic perspective is a very challenging problem. Most existing research has predominantly focused on the learning of a universal aesthetic model based on hand-crafted visual descriptors . However, this research paradigm can achieve only limited success because 1) such hand-crafted descriptors cannot well preserve abstract aesthetic properties , and 2) such a universal model cannot always capture the full diversity of visual content. To address these challenges, we propose in this paper a novel query-dependent aesthetic model with deep learning for photo quality assessment. In our method, deep aesthetic abstractions are discovered from massive images , whereas the aesthetic assessment model is learned in a query- dependent manner. Our work addresses the first problem by learning mid-level aesthetic feature abstractions via powerful deep convolutional neural networks to automatically capture the underlying aesthetic characteristics of the massive training images . Regarding the second problem, because photographers tend to employ different rules of photography for capturing different images , the aesthetic model should also be query- dependent . Specifically, given an image to be assessed, we first identify which aesthetic model should be applied for this particular image. Then, we build a unique aesthetic model of this type to assess its aesthetic quality. We conducted extensive experiments on two large-scale datasets and demonstrated that the proposed query-dependent model equipped with learned deep aesthetic abstractions significantly and consistently outperforms state-of-the-art hand-crafted feature -based and universal model-based methods.

Journal ArticleDOI
TL;DR: Thorough experiments suggest that the proposed saliency- inspired fast image retrieval scheme, S-sim, significantly speeds up online retrieval and outperforms the state-of-the-art BoW-based image retrieval schemes.
Abstract: The bag-of-visual-words (BoW) model is effective for representing images and videos in many computer vision problems, and achieves promising performance in image retrieval. Nevertheless, the level of retrieval efficiency in a large-scale database is not acceptable for practical usage. Considering that the relevant images in the database of a given query are more likely to be distinctive than ambiguous, this paper defines “database saliency” as the distinctiveness score calculated for every image to measure its overall “saliency” in the database. By taking advantage of database saliency, we propose a saliency- inspired fast image retrieval scheme, S-sim, which significantly improves efficiency while retains state-of-the-art accuracy in image retrieval . There are two stages in S-sim: the bottom-up saliency mechanism computes the database saliency value of each image by hierarchically decomposing a posterior probability into local patches and visual words, the concurrent information of visual words is then bottom-up propagated to estimate the distinctiveness, and the top-down saliency mechanism discriminatively expands the query via a very low-dimensional linear SVM trained on the top-ranked images after initial search, ranking images are then sorted on their distances to the decision boundary as well as the database saliency values. We comprehensively evaluate S-sim on common retrieval benchmarks, e.g., Oxford and Paris datasets. Thorough experiments suggest that, because of the offline database saliency computation and online low-dimensional SVM, our approach significantly speeds up online retrieval and outperforms the state-of-the-art BoW-based image retrieval schemes.

Journal ArticleDOI
TL;DR: In this paper, a relative symmetric bilinear model (RSBM) is introduced to model the similarity between the child and the parents, by incorporating the prior knowledge that a child may resemble one particular parent more than the other.
Abstract: One major challenge in computer vision is to go beyond the modeling of individual objects and to investigate the bi- (one-versus-one) or tri- (one-versus-two) relationship among multiple visual entities, answering such questions as whether a child in a photo belongs to the given parents. The child-parents relationship plays a core role in a family, and understanding such kin relationship would have a fundamental impact on the behavior of an artificial intelligent agent working in the human world. In this work, we tackle the problem of one-versus-two (tri-subject) kinship verification and our contributions are threefold: 1) a novel relative symmetric bilinear model (RSBM) is introduced to model the similarity between the child and the parents, by incorporating the prior knowledge that a child may resemble one particular parent more than the other; 2) a spatially voted method for feature selection, which jointly selects the most discriminative features for the child-parents pair, while taking local spatial information into account; and 3) a large-scale tri-subject kinship database characterized by over 1,000 child-parents families. Extensive experiments on KinFaceW, Family101, and our newly released kinship database show that the proposed method outperforms several previous state of the art methods, while could also be used to significantly boost the performance of one-versus-one kinship verification when the information about both parents are available.

Journal ArticleDOI
TL;DR: The main finding is, while there are variations, the glory days of a video's popularity typically pass by quickly and the probability of replaying a video by the same user is low, and the caching performance achieved by the mixed strategy is very close to the performance achieve by the offline strategy.
Abstract: Popular online video-on-demand (VoD) services all maintain a large catalog of videos for their users to access. The knowledge of video popularity is very important for system operation , such as video caching on content distribution network (CDN) servers. The video popularity distribution at a given time is quite well understood. We study how the video popularity changes with time, for different types of videos, and apply the results to design video caching strategies. Our study is based on analyzing the video access levels over time, based on data provided by a large video service provider. Our main finding is, while there are variations, the glory days of a video’s popularity typically pass by quickly and the probability of replaying a video by the same user is low. The reason appears to be due to fairly regular number of users and view time per day for each user, and continuous arrival of new videos. All these facts will affect how video popularity changes, hence also affect the optimal video caching strategy. Based on the observation from our measurement study, we propose a mixed replication strategy (of LFU and FIFO) that can handle different kinds of videos. Offline strategy assuming tomorrow’s video popularity is known in advance is used as a performance benchmark. Through trace-driven simulation, we show that the caching performance achieved by the mixed strategy is very close to the performance achieved by the offline strategy.

Journal ArticleDOI
TL;DR: This paper proposes a multimedia social event summarization framework to automatically generate visualized summaries from the microblog stream of multiple media types and conducts extensive experiments on two real-world microblog datasets to demonstrate the superiority of the proposed framework as compared to the state-of-the-art approaches.
Abstract: Microblogging services have revolutionized the way people exchange information. Confronted with the ever-increasing numbers of social events and the corresponding microblogs with multimedia contents, it is desirable to provide visualized summaries to help users to quickly grasp the essence of these social events for better understanding. While existing approaches mostly focus only on text-based summary, microblog summarization with multiple media types (e.g., text, image, and video) is scarcely explored. In this paper, we propose a multimedia social event summarization framework to automatically generate visualized summaries from the microblog stream of multiple media types. Specifically, the proposed framework comprises three stages, as follows. 1) A noise removal approach is first devised to eliminate potentially noisy images. An effective spectral filtering model is exploited to estimate the probability that an image is relevant to a given event. 2) A novel cross-media probabilistic model, termed Cross-Media-LDA (CMLDA), is proposed to jointly discover subevents from microblogs of multiple media types. The intrinsic correlations among these different media types are well explored and exploited for reinforcing the cross-media subevent discovery process. 3) Finally, based on the cross-media knowledge of all the discovered subevents, a multimedia microblog summary generation process is designed to jointly identify both representative textual and visual samples, which are further aggregated to form a holistic visualized summary. We conduct extensive experiments on two real-world microblog datasets to demonstrate the superiority of the proposed framework as compared to the state-of-the-art approaches.

Journal ArticleDOI
TL;DR: This paper proposes to learn features from sets of labeled raw images so that deep CNNs can be trained from scratch with a small number of training data, i.e., 420 labeled albums with about 30 000 photos.
Abstract: This paper proposes to learn features from sets of labeled raw images. With this method, the problem of over- fitting can be effectively suppressed, so that deep CNNs can be trained from scratch with a small number of training data, i.e., 420 labeled albums with about 30 000 photos. This method can effectively deal with sets of images, no matter if the sets bear temporal structures. A typical approach to sequential image analysis usually leverages motions between adjacent frames, while the proposed method focuses on capturing the co-occurrences and frequencies of features. Nevertheless, our method outperforms previous best performers in terms of album classification, and achieves comparable or even better performances in terms of gait based human identification. These results demonstrate its effectiveness and good adaptivity to different kinds of set data.

Journal ArticleDOI
TL;DR: A hashing-based orthogonal deep model is proposed to learn accurate and compact multimodal representations and it is theoretically proved that, in this case, the learned codes are guaranteed to be approximately Orthogonal.
Abstract: As large-scale multimodal data are ubiquitous in many real-world applications, learning multimodal representations for efficient retrieval is a fundamental problem. Most existing methods adopt shallow structures to perform multimodal representation learning. Due to a limitation of learning ability of shallow structures, they fail to capture the correlation of multiple modalities. Recently, multimodal deep learning was proposed and had proven its superiority in representing multimodal data due to its high nonlinearity. However, in order to learn compact and accurate representations, how to reduce the redundant information lying in the multimodal representations and incorporate different complexities of different modalities in the deep models is still an open problem. In order to address the aforementioned problem, in this paper we propose a hashing-based orthogonal deep model to learn accurate and compact multimodal representations. The method can better capture the intra-modality and inter-modality correlations to learn accurate representations. Meanwhile, in order to make the representations compact, the hashing-based model can generate compact hash codes and the proposed orthogonal structure can reduce the redundant information lying in the codes by imposing orthogonal regularizer on the weighting matrices. We also theoretically prove that, in this case, the learned codes are guaranteed to be approximately orthogonal. Moreover, considering the different characteristics of different modalities, effective representations can be attained with different number of layers for different modalities. Comprehensive experiments on three real-world datasets demonstrate a substantial gain of our method on retrieval tasks compared with existing algorithms.

Journal ArticleDOI
TL;DR: A two-stage approach to generate QR code with high quality visual content based on the Gauss-Jordan elimination procedure and the experimental results show that the proposed method substantially enhances the appearance of the QR code and the processing complexity is near real-time.
Abstract: Quick response (QR) code is generally used for embedding messages such that people can conveniently use mobile devices to capture the QR code and acquire information through a QR code reader. In the past, the design of QR code generators only aimed to achieve high decodability and the produced QR codes usually look like random black-and-white patterns without visual semantics. In recent years, researchers have been tried to endow the QR code with aesthetic elements and QR code beautification has been formulated as an optimization problem that minimizes the visual perception distortion subject to acceptable decoding rate. However, the visual quality of the QR code generated by existing methods still leaves much to be desired. In this work, we propose a two-stage approach to generate QR code with high quality visual content. In the first stage, a baseline QR code with reliable decodability but poor visual quality is first synthesized based on the Gauss-Jordan elimination procedure. In the second stage, a rendering mechanism is designed to improve the visual quality while avoiding affecting the decodability of the QR code. The experimental results show that the proposed method substantially enhances the appearance of the QR code and the processing complexity is near real-time.