scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Multimedia in 2016"


Journal ArticleDOI
TL;DR: A new objective metric for research on perceptual quality assessment of distorted SCIs is developed, which mainly relies on simple convolution operators and detects salient areas where the distortions usually attract more attention.
Abstract: With the widespread adoption of multidevice communication, such as telecommuting, screen content images (SCIs) have become more closely and frequently related to our daily lives. For SCIs, the tasks of accurate visual quality assessment, high-efficiency compression, and suitable contrast enhancement have thus currently attracted increased attention. In particular, the quality evaluation of SCIs is important due to its good ability for instruction and optimization in various processing systems. Hence, in this paper, we develop a new objective metric for research on perceptual quality assessment of distorted SCIs. Compared to the classical MSE, our method, which mainly relies on simple convolution operators, first highlights the degradations in structures caused by different types of distortions and then detects salient areas where the distortions usually attract more attention. A comparison of our algorithm with the most popular and state-of-the-art quality measures is performed on two new SCI databases (SIQAD and SCD). Extensive results are provided to verify the superiority and efficiency of the proposed IQA technique.

264 citations


Journal ArticleDOI
Tong Zhang1, Wenming Zheng1, Zhen Cui1, Yuan Zong1, Jingwei Yan1, Keyu Yan1 
TL;DR: A novel deep neural network (DNN)-driven feature learning method is proposed and applied to multi-view facial expression recognition (FER) and the experimental results show that the algorithm outperforms the state-of-the-art methods.
Abstract: In this paper, a novel deep neural network (DNN)-driven feature learning method is proposed and applied to multi-view facial expression recognition (FER). In this method, scale invariant feature transform (SIFT) features corresponding to a set of landmark points are first extracted from each facial image. Then, a feature matrix consisting of the extracted SIFT feature vectors is used as input data and sent to a well-designed DNN model for learning optimal discriminative features for expression classification. The proposed DNN model employs several layers to characterize the corresponding relationship between the SIFT feature vectors and their corresponding high-level semantic information. By training the DNN model, we are able to learn a set of optimal features that are well suitable for classifying the facial expressions across different facial views. To evaluate the effectiveness of the proposed method, two nonfrontal facial expression databases, namely BU-3DFE and Multi-PIE, are respectively used to testify our method and the experimental results show that our algorithm outperforms the state-of-the-art methods.

262 citations


Journal ArticleDOI
TL;DR: An effective and efficient no-reference objective quality metric which can automatically assess LDR images created by different TMOs without access to the original HDR images is developed.
Abstract: High dynamic range (HDR) imaging techniques have been working constantly, actively, and validly in the fault detection and disease diagnosis in the astronomical and medical fields, and currently they have also gained much more attention from digital image processing and computer vision communities. While HDR imaging devices are starting to have friendly prices, HDR display devices are still out of reach of typical consumers. Due to the limited availability of HDR display devices, in most cases tone mapping operators (TMOs) are used to convert HDR images to standard low dynamic range (LDR) images for visualization. But existing TMOs cannot work effectively for all kinds of HDR images, with their performance largely depending on brightness, contrast, and structure properties of a scene. To accurately measure and compare the performance of distinct TMOs, in this paper develop an effective and efficient no-reference objective quality metric which can automatically assess LDR images created by different TMOs without access to the original HDR images. Our model is shown to be statistically superior to recent full- and no-reference quality measures on the existing tone-mapped image database and a new relevant database built in this work.

186 citations


Journal ArticleDOI
TL;DR: A ranking aggregation algorithm is proposed to enhance the detection of similarity and dissimilarity based on the following assumption: the true match should be similar to the probe in different baseline methods, but also be dissimilar to those strongly dissimilar galleries of the probe.
Abstract: Person reidentification is a key technique to match different persons observed in nonoverlapping camera views. Many researchers treat it as a special object-retrieval problem, where ranking optimization plays an important role. Existing ranking optimization methods mainly utilize the similarity relationship between the probe and gallery images to optimize the original ranking list, but seldom consider the important dissimilarity relationship. In this paper, we propose to use both similarity and dissimilarity cues in a ranking optimization framework for person reidentification. Its core idea is that the true match should not only be similar to those strongly similar galleries of the probe, but also be dissimilar to those strongly dissimilar galleries of the probe. Furthermore, motivated by the philosophy of multiview verification, a ranking aggregation algorithm is proposed to enhance the detection of similarity and dissimilarity based on the following assumption: the true match should be similar to the probe in different baseline methods. In other words, if a gallery blue image is strongly similar to the probe in one method, while simultaneously strongly dissimilar to the probe in another method, it will probably be a wrong match of the probe. Extensive experiments conducted on public benchmark datasets and comparisons with different baseline methods have shown the great superiority of the proposed ranking optimization method.

183 citations


Journal ArticleDOI
TL;DR: This paper proposes a data-driven distance metric (DDDM) method, re-exploiting the training data to adjust the metric for each query-gallery pair, with a significant improvement over three baseline metric learning methods.
Abstract: Person re-identification, aiming to identify images of the same person from various cameras configured in different places, has attracted much attention in the multimedia retrieval community. In this problem, choosing a proper distance metric is a crucial aspect, and many classic methods utilize a uniform learnt metric. However, their performance is limited due to ignoring the zero-shot and fine-grained characteristics presented in real person re-identification applications. In this paper, we investigate two consistencies across two cameras, which are cross-view support consistency and cross-view projection consistency. The philosophy behind it is that, in spite of visual changes in two images of the same person under two camera views, the support sets in their respective views are highly consistent, and after being projected to the same view, their context sets are also highly consistent. Based on the above phenomena, we propose a data-driven distance metric (DDDM) method, re-exploiting the training data to adjust the metric for each query-gallery pair. Experiments conducted on three public data sets have validated the effectiveness of the proposed method, with a significant improvement over three baseline metric learning methods. In particular, on the public VIPeR dataset, the proposed method achieves an accuracy rate of 42.09% at rank-1, which outperforms the state-of-the-art methods by 4.29%.

162 citations


Journal ArticleDOI
TL;DR: It is shown that the algorithms, which progressively increase in quality toward the point of the view, manage to reduce the bandwidth requirement and provide a similar quality of experience (QoE) compared to a full panorama system.
Abstract: Interactive panoramic systems are currently on the rise. However, one of the major challenges in such a system is the overhead involved in transferring a full-quality panorama to the client when only a part of the panorama is used to extract a virtual view. Thus, such a system should maximize the user experience while simultaneously minimizing the bandwidth required. In this paper, we apply tiling to deliver different quality levels for different parts of the panorama. Tiling has traditionally been applied to the delivery of very high-resolution content to clients. Here, we apply similar ideas in a real-time interactive panoramic video system. A major challenge lies in the movement of such a virtual view, for which clients’ regions of interest change dynamically and independently from each other. We show that our algorithms, which progressively increase in quality toward the point of the view, manage to (i) reduce the bandwidth requirement and (ii) provide a similar quality of experience (QoE) compared to a full panorama system.

151 citations


Journal ArticleDOI
TL;DR: A novel cross-media active learning algorithm is proposed to reduce the effort on labeling images for training, and train classifiers on both visual features and privileged information, and measure the uncertainty of unlabeled data by exploiting the learned classifiers and slacking function.
Abstract: In this paper, we propose a novel cross-media active learning algorithm to reduce the effort on labeling images for training. The Internet images are often associated with rich textual descriptions. Even though such textual information is not available in test images, it is still useful for learning robust classifiers. In light of this, we apply the recently proposed supervised learning paradigm, learning using privileged information, to the active learning task. Specifically, we train classifiers on both visual features and privileged information, and measure the uncertainty of unlabeled data by exploiting the learned classifiers and slacking function. Then, we propose to select unlabeled samples by jointly measuring the cross-media uncertainty and the visual diversity. Our method automatically learns the optimal tradeoff parameter between the two measurements, which in turn makes our algorithms particularly suitable for real-world applications. Extensive experiments demonstrate the effectiveness of our approach.

151 citations


Journal ArticleDOI
TL;DR: A sentiment-based rating prediction method (RPS) to improve prediction accuracy in recommender systems and results show the sentiment can well characterize user preferences, which helps to improve the recommendation performance.
Abstract: In recent years, we have witnessed a flourish of review websites It presents a great opportunity to share our viewpoints for various products we purchase However, we face an information overloading problem How to mine valuable information from reviews to understand a user's preferences and make an accurate recommendation is crucial Traditional recommender systems (RS) consider some factors, such as user's purchase records, product category, and geographic location In this work, we propose a sentiment-based rating prediction method (RPS) to improve prediction accuracy in recommender systems Firstly, we propose a social user sentimental measurement approach and calculate each user's sentiment on items/products Secondly, we not only consider a user's own sentimental attributes but also take interpersonal sentimental influence into consideration Then, we consider product reputation, which can be inferred by the sentimental distributions of a user set that reflect customers’ comprehensive evaluation At last, we fuse three factors—user sentiment similarity, interpersonal sentimental influence, and item's reputation similarity—into our recommender system to make an accurate rating prediction We conduct a performance evaluation of the three sentimental factors on a real-world dataset collected from Yelp Our experimental results show the sentiment can well characterize user preferences, which helps to improve the recommendation performance

148 citations


Journal ArticleDOI
TL;DR: An integrated system for clothing co-parsing (CCP), in order to jointly parse a set of clothing images (unsegmented but annotated with tags) into semantic configurations, consisting of two phases of inference is proposed.
Abstract: This paper aims at developing an integrated system for clothing co-parsing (CCP), in order to jointly parse a set of clothing images (unsegmented but annotated with tags) into semantic configurations. A novel data-driven system consisting of two phases of inference is proposed. The first phase, referred as “image cosegmentation,” iterates to extract consistent regions on images and jointly refines the regions over all images by employing the exemplar-SVM technique [1] . In the second phase (i.e., “region colabeling”), we construct a multiimage graphical model by taking the segmented regions as vertices, and incorporating several contexts of clothing configuration (e.g., item locations and mutual interactions). The joint label assignment can be solved using the efficient Graph Cuts algorithm. In addition to evaluate our framework on the Fashionista dataset [2] , we construct a dataset called the SYSU-Clothes dataset consisting of 2098 high-resolution street fashion photos to demonstrate the performance of our system. We achieve 90.29%/88.23% segmentation accuracy and 65.52%/63.89% recognition rate on the Fashionista and the SYSU-Clothes datasets, respectively, which are superior compared with the previous methods. Furthermore, we apply our method on a challenging task, i.e., cross-domain clothing retrieval: given user photo depicting a clothing image, retrieving the same clothing items from online shopping stores based on the fine-grained parsing results.

143 citations


Journal ArticleDOI
TL;DR: The proposed BIQA model is called no-reference quality assessment using statistical structural and luminance features (NRSL), and it is demonstrated that the proposed NRSL metric compares favorably with the relevant state-of-the-art BIZA models in terms of high correlation with human subjective ratings.
Abstract: Blind image quality assessment (BIQA) aims to develop quantitative measures to automatically and accurately estimate perceptual image quality without any prior information about the reference image. In this paper, we introduce a novel BIQA metric by structural and luminance information, based on the characteristics of human visual perception for distorted image. We extract the perceptual structural features of distorted image by the local binary pattern distribution. Besides, the distribution of normalized luminance magnitudes is extracted to represent the luminance changes in distorted image. After extracting the features for structures and luminance, support vector regression is adopted to model the complex nonlinear relationship from feature space to quality measure. The proposed BIQA model is called no-reference quality assessment using statistical structural and luminance features (NRSL). Extensive experiments conducted on four synthetically distorted image databases and three naturally distorted image databases have demonstrated that the proposed NRSL metric compares favorably with the relevant state-of-the-art BIQA models in terms of high correlation with human subjective ratings. The MATLAB source code and validation results of NRSL are publicly online at http://www.ntu.edu.sg/home/wslin/Publications.htm .

142 citations


Journal ArticleDOI
TL;DR: A novel Markov decision-based rate adaptation scheme for DASH aiming to maximize the quality of user experience under time-varying channel conditions and a low-complexity sub-optimal greedy algorithm which is suitable for real-time video streaming is proposed.
Abstract: Dynamic adaptive streaming over HTTP (DASH) has recently been widely deployed in the Internet. It, however, does not impose any adaptation logic for selecting the quality of video fragments requested by clients. In this paper, we propose a novel Markov decision-based rate adaptation scheme for DASH aiming to maximize the quality of user experience under time-varying channel conditions. To this end, our proposed method takes into account those key factors that make a critical impact on visual quality, including video playback quality, video rate switching frequency and amplitude, buffer overflow/underflow, and buffer occupancy. Besides, to reduce computational complexity, we propose a low-complexity sub-optimal greedy algorithm which is suitable for real-time video streaming. Our experiments in network test-bed and real-world Internet all demonstrate the good performance of the proposed method in both objective and subjective visual quality.

Journal ArticleDOI
TL;DR: This work proposes a novel multi-modal event topic model (mmETM), which can effectively model social media documents, including long text with related images, and learn the correlations between textual and visual modalities to separate the visual- Representative topics and non-visual-representative topics.
Abstract: With the massive growth of social events in Internet, it has become more and more difficult to exactly find and organize the interesting events from massive social media data, which is useful to browse, search, and monitor social events by users or governments. To deal with this problem, we propose a novel multi-modal social event tracking and evolution framework to not only effectively capture multi-modal topics of social events, but also obtain the evolutionary trends of social events and generate effective event summary details over time. To achieve this goal, we propose a novel multi-modal event topic model (mmETM), which can effectively model social media documents, including long text with related images, and learn the correlations between textual and visual modalities to separate the visual-representative topics and non-visual-representative topics. To apply the mmETM model to social event tracking, we adopt an incremental learning strategy denoted as incremental mmETM, which can obtain informative textual and visual topics of social events over time to help understand these events and their evolutionary trends. To evaluate the effectiveness of our proposed algorithm, we collect a real-world dataset to conduct various experiments. Both qualitative and quantitative evaluations demonstrate that the proposed mmETM algorithm performs favorably against several state-of-the-art methods.

Journal ArticleDOI
TL;DR: Experiments on five benchmark datasets with eight saliency extraction methods show that the proposed saliency co-fusion-based approach achieves competitive performance even without parameter fine-tuning when compared with the state-of-the-art methods.
Abstract: Most existing high-performance co-segmentation algorithms are usually complex due to the way of co-labeling a set of images as well as the common need of fine-tuning few parameters for effective co-segmentation. In this paper, instead of following the conventional way of co-labeling multiple images, we propose to first exploit inter-image information through co-saliency, and then perform single-image segmentation on each individual image. To make the system robust and to avoid heavy dependence on one single saliency extraction method, we propose to apply multiple existing saliency extraction methods on each image to obtain diverse salient maps. Our major contribution lies in the proposed method that fuses the obtained diverse saliency maps by exploiting the inter-image information, which we call saliency co-fusion. Experiments on five benchmark datasets with eight saliency extraction methods show that our saliency co-fusion-based approach achieves competitive performance even without parameter fine-tuning when compared with the state-of-the-art methods.

Journal ArticleDOI
TL;DR: A deep and bidirectional representation learning model is proposed to address the issue of image-text cross-modal retrieval and shows that the proposed architecture is effective and the learned representations have good semantics to achieve superior cross- modal retrieval performance.
Abstract: Cross-modal retrieval emphasizes understanding inter-modality semantic correlations, which is often achieved by designing a similarity function. Generally, one of the most important things considered by the similarity function is how to make the cross-modal similarity computable. In this paper, a deep and bidirectional representation learning model is proposed to address the issue of image–text cross-modal retrieval. Owing to the solid progress of deep learning in computer vision and natural language processing, it is reliable to extract semantic representations from both raw image and text data by using deep neural networks. Therefore, in the proposed model, two convolution-based networks are adopted to accomplish representation learning for images and texts. By passing the networks, images and texts are mapped to a common space, in which the cross-modal similarity is measured by cosine distance. Subsequently, a bidirectional network architecture is designed to capture the property of the cross-modal retrieval—the bidirectional search. Such architecture is characterized by simultaneously involving the matched and unmatched image–text pairs for training. Accordingly, a learning framework with maximum likelihood criterion is finally developed. The network parameters are optimized via backpropagation and stochastic gradient descent. A great deal of experiments are conducted to sufficiently evaluate the proposed method on three publicly released datasets: IAPRTC-12, Flickr30k, and Flickr8k. The overall results definitely show that the proposed architecture is effective and the learned representations have good semantics to achieve superior cross-modal retrieval performance.

Journal ArticleDOI
TL;DR: A novel framework for RDH-EI based on reversible image transformation (RIT), in which the ciphertexts may attract the notation of the curious cloud and the data-embedding process executed by the cloud server is irrelevant with the processes of both encryption and decryption.
Abstract: With the popularity of outsourcing data to the cloud, it is vital to protect the privacy of data and enable the cloud server to easily manage the data at the same time. Under such demands, reversible data hiding in encrypted images (RDH-EI) attracts more and more researchers’ attention. In this paper, we propose a novel framework for RDH-EI based on reversible image transformation (RIT). Different from all previous encryption-based frameworks, in which the ciphertexts may attract the notation of the curious cloud, RIT-based framework allows the user to transform the content of original image into the content of another target image with the same size. The transformed image, that looks like the target image, is used as the “encrypted image,” and is outsourced to the cloud. Therefore, the cloud server can easily embed data into the “encrypted image” by any RDH methods for plaintext images. And thus a client-free scheme for RDH-EI can be realized, that is, the data-embedding process executed by the cloud server is irrelevant with the processes of both encryption and decryption. Two RDH methods, including traditional RDH scheme and unified embedding and scrambling scheme, are adopted to embed watermark in the encrypted image, which can satisfy different needs on image quality and large embedding capacity, respectively.

Journal ArticleDOI
TL;DR: An energy-efficient optimization objective function with individual fronthaul capacity and intertier interference constraints is presented in this paper for queue-aware multimedia H-CRANs and demonstrates that a tradeoff between EE and queuing delay can be achieved.
Abstract: The heterogeneous cloud radio access network (H-CRAN) is a promising paradigm that incorporates cloud computing into heterogeneous networks (HetNets), thereby taking full advantage of cloud radio access networks (C-RANs) and HetNets. Characterizing cooperative beamforming with fronthaul capacity and queue stability constraints is critical for multimedia applications to improve the energy efficiency (EE) in H-CRANs. An energy-efficient optimization objective function with individual fronthaul capacity and intertier interference constraints is presented in this paper for queue-aware multimedia H-CRANs. To solve this nonconvex objective function, a stochastic optimization problem is reformulated by introducing the general Lyapunov optimization framework. Under the Lyapunov framework, this optimization problem is equivalent to an optimal network-wide cooperative beamformer design algorithm with instantaneous power, average power, and intertier interference constraints, which can be regarded as a weighted sum EE maximization problem and solved by a generalized weighted minimum mean-square error approach. The mathematical analysis and simulation results demonstrate that a tradeoff between EE and queuing delay can be achieved, and this tradeoff strictly depends on the fronthaul constraint.

Journal ArticleDOI
TL;DR: The proposed user-service rating prediction approach fuse four factors-user personal interest (related to user and the item's topics), interpersonal interest similarity, interpersonal rating behavior similarity, and interpersonal rating Behavior diffusion (relatedto users' behavior diffusions)-into a unified matrix-factorized framework.
Abstract: With the boom of social media, it is a very popular trend for people to share what they are doing with friends across various social networking platforms. Nowadays, we have a vast amount of descriptions, comments, and ratings for local services. The information is valuable for new users to judge whether the services meet their requirements before partaking. In this paper, we propose a user-service rating prediction approach by exploring social users' rating behaviors. In order to predict user-service ratings, we focus on users' rating behaviors. In our opinion, the rating behavior in recommender system could be embodied in these aspects: 1) when user rated the item, 2) what the rating is, 3) what the item is, 4) what the user interest that we could dig from his/her rating records is, and 5) how the user's rating behavior diffuses among his/her social friends. Therefore, we propose a concept of the rating schedule to represent users' daily rating behaviors. In addition, we propose the factor of interpersonal rating behavior diffusion to deep understand users' rating behaviors. In the proposed user-service rating prediction approach, we fuse four factors—user personal interest (related to user and the item's topics), interpersonal interest similarity (related to user interest), interpersonal rating behavior similarity (related to users' rating behavior habits), and interpersonal rating behavior diffusion (related to users' behavior diffusions)—into a unified matrix-factorized framework. We conduct a series of experiments in the Yelp dataset and Douban Movie dataset. Experimental results show the effectiveness of our approach.

Journal ArticleDOI
TL;DR: A no-reference sparse representation-based image sharpness index that is not sensitive to training images, so a universal dictionary can be used to evaluate the sharpness of images.
Abstract: Recent advances in sparse representation show that overcomplete dictionaries learned from natural images can capture high-level features for image analysis. Since atoms in the dictionaries are typically edge patterns and image blur is characterized by the spread of edges, an overcomplete dictionary can be used to measure the extent of blur. Motivated by this, this paper presents a no-reference sparse representation-based image sharpness index. An overcomplete dictionary is first learned using natural images. The blurred image is then represented using the dictionary in a block manner, and block energy is computed using the sparse coefficients. The sharpness score is defined as the variance-normalized energy over a set of selected high-variance blocks, which is achieved by normalizing the total block energy using the sum of block variances. The proposed method is not sensitive to training images, so a universal dictionary can be used to evaluate the sharpness of images. Experiments on six public image quality databases demonstrate the advantages of the proposed method.

Journal ArticleDOI
TL;DR: A data-driven approach is proposed as a baseline of crowd segmentation and estimation of crowd properties for the proposed dataset and extensive experiments demonstrate that the proposed method outperforms state-of-the-art approaches for crowd understanding.
Abstract: Crowd understanding has drawn increasing attention from the computer vision community, and its progress is driven by the availability of public crowd datasets. In this paper, we contribute a large-scale benchmark dataset collected from the Shanghai 2010 World Expo. It includes $2630$ annotated video sequences captured by $245$ surveillance cameras, far larger than any public dataset. It covers a large number of different scenes and is suitable for evaluating the performance of crowd segmentation and estimation of crowd density, collectiveness, and cohesiveness, all of which are universal properties of crowd systems. In total, $53\,637$ crowd segments are manually annotated with the three crowd properties. This dataset is released to the public to advance research on crowd understanding. The large-scale annotated dataset enables using data-driven approaches for crowd understanding. In this paper, a data-driven approach is proposed as a baseline of crowd segmentation and estimation of crowd properties for the proposed dataset. Novel global and local crowd features are designed to retrieve similar training scenes and to match spatio-temporal crowd patches so that the labels of the training scenes can be accurately transferred to the query image. Extensive experiments demonstrate that the proposed method outperforms state-of-the-art approaches for crowd understanding.

Journal ArticleDOI
TL;DR: A novel cross-modal retrieval approach based on discriminative dictionary learning that is augmented with common label alignment that outperforms several state-of-the-art methods in terms of retrieval accuracy.
Abstract: Cross-modal retrieval has attracted much attention in recent years due to its widespread applications. In this area, how to capture and correlate heterogeneous features originating from different modalities remains a challenge. However, most existing methods dealing with cross-modal learning only focus on learning relevant features shared by two distinct feature spaces, therefore overlooking discriminative feature information of them. To remedy this issue and explicitly capture discriminative feature information, we propose a novel cross-modal retrieval approach based on discriminative dictionary learning that is augmented with common label alignment. Concretely, a discriminative dictionary is first learned to account for each modality, which boosts not only the discriminating capability of intra-modality data from different classes but also the relevance of inter-modality data in the same class. Subsequently, all the resulting sparse codes are simultaneously mapped to a common label space, where the cross-modal data samples are characterized and associated. Also in the label space, the discriminativeness and relevance of the considered cross-modal data can be further strengthened by enforcing a common label alignment. Finally, cross-modal retrieval is performed over the common label space. Experiments conducted on two public cross-modal datasets show that the proposed approach outperforms several state-of-the-art methods in term of retrieval accuracy.

Journal ArticleDOI
TL;DR: The proposed spatiotemporal object proposal and patch verification framework outperforms the state-of-the-art methods, including the recent Faster-RCNN method, on animal object detection accuracy by up to 4.5%.
Abstract: In this paper, we consider the animal object detection and segmentation from wildlife monitoring videos captured by motion-triggered cameras, called camera-traps. For these types of videos, existing approaches often suffer from low detection rates due to low contrast between the foreground animals and the cluttered background, as well as high false positive rates due to the dynamic background. To address this issue, we first develop a new approach to generate animal object region proposals using multilevel graph cut in the spatiotemporal domain. We then develop a cross-frame temporal patch verification method to determine if these region proposals are true animals or background patches. We construct an efficient feature description for animal detection using joint deep learning and histogram of oriented gradient features encoded with Fisher vectors. Our extensive experimental results and performance comparisons over a diverse set of challenging camera-trap data demonstrate that the proposed spatiotemporal object proposal and patch verification framework outperforms the state-of-the-art methods, including the recent Faster-RCNN method, on animal object detection accuracy by up to 4.5%.

Journal ArticleDOI
TL;DR: A comprehensive survey to characterize this fast-growing area and summarize the state-of-the-art techniques for analyzing social media data is presented and existing techniques are classified into two categories: gathering information and understanding user behaviors.
Abstract: The unprecedented availability of social media data offers substantial opportunities for data owners, system operators, solution providers, and end users to explore and understand social dynamics. However, the exponential growth in the volume, velocity, and variability of social media data prevents people from fully utilizing such data. Visual analytics, which is an emerging research direction, has received considerable attention in recent years. Many visual analytics methods have been proposed across disciplines to understand large-scale structured and unstructured social media data. This objective, however, also poses significant challenges for researchers to obtain a comprehensive picture of the area, understand research challenges, and develop new techniques. In this paper, we present a comprehensive survey to characterize this fast-growing area and summarize the state-of-the-art techniques for analyzing social media data. In particular, we classify existing techniques into two categories: gathering information and understanding user behaviors. We aim to provide a clear overview of the research area through the established taxonomy. We then explore the design space and identify the research trends. Finally, we discuss challenges and open questions for future studies.

Journal ArticleDOI
TL;DR: It is proved that the gap between the hit rate achieved by Trend-Caching and that by the optimal caching policy with hindsight is sublinear in the number of video requests, thereby guaranteeing both fast convergence and asymptotically optimal cache hit rate.
Abstract: This paper presents Trend-Caching, a novel cache replacement method that optimizes cache performance according to the trends of video content. Trend-Caching explicitly learns the popularity trend of video content and uses it to determine which video it should store and which it should evict from the cache. Popularity is learned in an online fashion and requires no training phase, hence it is more responsive to continuously changing trends of videos. We prove that the learning regret of Trend-Caching (i.e., the gap between the hit rate achieved by Trend-Caching and that by the optimal caching policy with hindsight) is sublinear in the number of video requests, thereby guaranteeing both fast convergence and asymptotically optimal cache hit rate. We further validate the effectiveness of Trend-Caching by applying it to a movie.douban.com dataset that contains over 38 million requests. Our results show significant cache hit rate lift compared to existing algorithms, and the improvements can exceed 40% when the cache capacity is limited. Furthermore, Trend-Caching has low complexity.

Journal ArticleDOI
TL;DR: A game theoretic resource allocation scheme for media cloud to allocate resource to mobile social users though brokers and results show that each player in the game can obtain the optimal strategy where the Stackelberg equilibrium exists stably.
Abstract: Due to the rapid increases in both the population of mobile social users and the demand for quality of experience (QoE), providing mobile social users with satisfied multimedia services has become an important issue. Media cloud has been shown to be an efficient solution to resolve the above issue, by allowing mobile social users to connect to it through a group of distributed brokers. However, as the resource in media cloud is limited, how to allocate resource among media cloud, brokers, and mobile social users becomes a new challenge. Therefore, in this paper, we propose a game theoretic resource allocation scheme for media cloud to allocate resource to mobile social users though brokers. First, a framework of resource allocation among media cloud, brokers, and mobile social users is presented. Media cloud can dynamically determine the price of the resource and allocate its resource to brokers. A mobile social user can select his broker to connect to the media cloud by adjusting the strategy to achieve the maximum revenue, based on the social features in the community. Next, we formulate the interactions among media cloud, brokers, and mobile social users by a four-stage Stackelberg game. In addition, through the backward induction method, we propose an iterative algorithm to implement the proposed scheme and obtain the Stackelberg equilibrium. Finally, simulation results show that each player in the game can obtain the optimal strategy where the Stackelberg equilibrium exists stably.

Journal ArticleDOI
TL;DR: This paper proposes a novel active skeleton representation towards low latency human action recognition that is robust in calculating features related to joint positions, and effective in handling the unsegmented sequences.
Abstract: With the development of depth sensors, low latency 3D human action recognition has become increasingly important in various interaction systems, where response with minimal latency is a critical process. High latency not only significantly degrades the interaction experience of users, but also makes certain interaction systems, e.g., gesture control or electronic gaming, unattractive. In this paper, we propose a novel active skeleton representation towards low latency human action recognition . First, we encode each limb of the human skeleton into a state through a Markov random field. The active skeleton is then represented by aggregating the encoded features of individual limbs. Finally, we propose a multi-channel multiple instance learning with maximum-pattern-margin to further boost the performance of the existing model. Our method is robust in calculating features related to joint positions, and effective in handling the unsegmented sequences. Experiments on the MSR Action3D, the MSR DailyActivity3D, and the Huawei/3DLife-2013 dataset demonstrate the effectiveness of the model with the proposed novel representation, and its superiority over the state-of-the-art low latency recognition approaches.

Journal ArticleDOI
TL;DR: This paper investigates the structure of social networks and develops an algorithm for network correlation-based social friend recommendation (NC-based SFR), which recommends friends more precisely than reference methods.
Abstract: Friend recommendation is an important recommender application in social media. Major social websites such as Twitter and Facebook are all capable of recommending friends to individuals. However, most of these websites use simple friend recommendation algorithms such as similarity, popularity, or “friend's friends are friends,” which are intuitive but consider few of the characteristics of the social network. In this paper we investigate the structure of social networks and develop an algorithm for network correlation-based social friend recommendation (NC-based SFR). To accomplish this goal, we correlate different “social role” networks, find their relationships and make friend recommendations. NC-based SFR is characterized by two key components: 1) related networks are aligned by selecting important features from each network, and 2) the network structure should be maximally preserved before and after network alignment. After important feature selection has been made, we recommend friends based on these features. We conduct experiments on the Flickr network, which contains more than ten thousand nodes and over 30 thousand tags covering half a million photos, to show that the proposed algorithm recommends friends more precisely than reference methods.

Journal ArticleDOI
TL;DR: A novel deep relative attributes (DRA) algorithm to learn visual features and the effective nonlinear ranking function to describe the RA of image pairs in a unified framework that consistently and significantly outperforms the state-of-the-art RA learning methods.
Abstract: Relative attribute (RA) learning aims to learn the ranking function describing the relative strength of the attribute. Most of current learning approaches learn a linear ranking function for each attribute by use of the hand-crafted visual features. Different from the existing study, in this paper, we propose a novel deep relative attributes (DRA) algorithm to learn visual features and the effective nonlinear ranking function to describe the RA of image pairs in a unified framework. Here, visual features and the ranking function are learned jointly, and they can benefit each other. The proposed DRA model is comprised of five convolutional neural layers, five fully connected layers, and a relative loss function which contains the contrastive constraint and the similar constraint corresponding to the ordered image pairs and the unordered image pairs, respectively. To train the DRA model effectively, we make use of the transferred knowledge from the large scale visual recognition on ImageNet [1] to the RA learning task. We evaluate the proposed DRA model on three widely used datasets. Extensive experimental results demonstrate that the proposed DRA model consistently and significantly outperforms the state-of-the-art RA learning methods. On the public OSR, PubFig, and Shoes datasets, compared with the previous RA learning results [2] , the average ranking accuracies have been significantly improved by about $8\%$ , $9\%$ , and $14\%$ , respectively.

Journal ArticleDOI
TL;DR: A novel universal framework for salient object detection, which aims to enhance the performance of any existing saliency detection method with distance weighting, adaptive binarization, and morphological closing is proposed.
Abstract: In this paper, we propose a novel universal framework for salient object detection, which aims to enhance the performance of any existing saliency detection method. First, rough salient regions are extracted from any existing saliency detection model with distance weighting, adaptive binarization, and morphological closing. With the superpixel segmentation, a Bayesian decision model is adopted to refine the rough saliency map to obtain a more accurate saliency map. An iterative optimization method is designed to obtain better saliency results by exploiting the characteristics of the output saliency map each time. Through the iterative optimization process, the rough saliency map is updated step by step with better and better performance until an optimal saliency map is obtained. Experimental results on the public salient object detection datasets with ground truth demonstrate the promising performance of the proposed universal framework subjectively and objectively.

Journal ArticleDOI
TL;DR: A provably secure time-domain attribute-based encryption scheme by embedding the time into both the ciphertexts and the keys, such that only users who hold sufficient attributes in a specific time slot can decrypt the video contents.
Abstract: With the ever-increasing demands on multimedia applications, cloud computing, due to its economical but powerful resources, is becoming a natural platform to process, store, and share multimedia contents. However, the employment of cloud computing also brings new security and privacy issues as few public cloud servers can be fully trusted by users. In this paper, we focus on how to securely share video contents to a certain group of people during a particular time period in cloud-based multimedia systems, and propose a cryptographic approach, a provably secure time-domain attribute-based access control (TAAC) scheme, to secure the cloud-based video content sharing. Specifically, we first propose a provably secure time-domain attribute-based encryption scheme by embedding the time into both the ciphertexts and the keys, such that only users who hold sufficient attributes in a specific time slot can decrypt the video contents. We also propose an efficient attribute updating method to achieve the dynamic change of users’ attributes, including granting new attributes, revoking previous attributes, and regranting previously revoked attributes. We further discuss on how to control those video contents that can be commonly accessed in multiple time slots and how to make special queries on video contents generated in previous time slots. The security analysis and performance evaluation show that TAAC is provably secure in generic group model and efficient in practice.

Journal ArticleDOI
TL;DR: This paper presents a novel and effective approach to automatic 3D/4D facial expression recognition based on the muscular movement model (MMM), which automatically segments the input 3D face by localizing the corresponding points within each muscular region of the reference using iterative closest normal point.
Abstract: Facial expression is an important channel for human nonverbal communication. This paper presents a novel and effective approach to automatic 3D/4D facial expression recognition based on the muscular movement model (MMM). In contrast to most of existing methods, the MMM deals with such an issue in the viewpoint of anatomy. It first automatically segments the input 3D face (frame) by localizing the corresponding points within each muscular region of the reference using iterative closest normal point. A set of features with multiple differential quantities, including ${coordinate}$ , ${normal,}$ and ${shape\,index}$ values, are then extracted to describe the geometry deformation of each segmented region. Meanwhile, we analyze the importance of these muscular areas, and a score level fusion strategy is exploited to optimize their weights by the genetic algorithm in the learning step. The support vector machine and the hidden Markov model are finally used to predict the expression label in 3D and 4D, respectively. The experiments are conducted on the BU-3DFE and BU-4DFE databases, and the results achieved clearly demonstrate the effectiveness of the proposed method.