scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Multimedia in 2014"


Journal ArticleDOI
TL;DR: This paper proposes to learn affect-salient features for SER using convolutional neural networks (CNN), and shows that this approach leads to stable and robust recognition performance in complex scenes and outperforms several well-established SER features.
Abstract: As an essential way of human emotional behavior understanding, speech emotion recognition (SER) has attracted a great deal of attention in human-centered signal processing. Accuracy in SER heavily depends on finding good affect- related , discriminative features. In this paper, we propose to learn affect-salient features for SER using convolutional neural networks (CNN). The training of CNN involves two stages. In the first stage, unlabeled samples are used to learn local invariant features (LIF) using a variant of sparse auto-encoder (SAE) with reconstruction penalization. In the second step, LIF is used as the input to a feature extractor, salient discriminative feature analysis (SDFA), to learn affect-salient, discriminative features using a novel objective function that encourages feature saliency, orthogonality, and discrimination for SER. Our experimental results on benchmark datasets show that our approach leads to stable and robust recognition performance in complex scenes (e.g., with speaker and language variation, and environment distortion) and outperforms several well-established SER features.

479 citations


Journal ArticleDOI
TL;DR: The focus of this article is on the issue of reliability and the use of video quality assessment as an example for the proposed best practices, showing that the recommended two-stage QoE crowdtesting design leads to more reliable results.
Abstract: Quality of Experience (QoE) in multimedia applications is closely linked to the end users' perception and therefore its assessment requires subjective user studies in order to evaluate the degree of delight or annoyance as experienced by the users. QoE crowdtesting refers to QoE assessment using crowdsourcing, where anonymous test subjects conduct subjective tests remotely in their preferred environment. The advantages of QoE crowdtesting lie not only in the reduced time and costs for the tests, but also in a large and diverse panel of international, geographically distributed users in realistic user settings. However, conceptual and technical challenges emerge due to the remote test settings. Key issues arising from QoE crowdtesting include the reliability of user ratings, the influence of incentives, payment schemes and the unknown environmental context of the tests on the results. In order to counter these issues, strategies and methods need to be developed, included in the test design, and also implemented in the actual test campaign, while statistical methods are required to identify reliable user ratings and to ensure high data quality. This contribution therefore provides a collection of best practices addressing these issues based on our experience gained in a large set of conducted QoE crowdtesting studies. The focus of this article is in particular on the issue of reliability and we use video quality assessment as an example for the proposed best practices, showing that our recommended two-stage QoE crowdtesting design leads to more reliable results.

278 citations


Journal ArticleDOI
TL;DR: This correspondence identifies usable bits suitable for data hiding so that the encrypted bitstream carrying secret data can be correctly decoded and achieve a perfect data extraction and image recovery.
Abstract: This correspondence proposes a framework of reversible data hiding (RDH) in an encrypted JPEG bitstream. Unlike existing RDH methods for encrypted spatial-domain images, the proposed method aims at encrypting a JPEG bitstream into a properly organized structure, and embedding a secret message into the encrypted bitstream by slightly modifying the JPEG stream. We identify usable bits suitable for data hiding so that the encrypted bitstream carrying secret data can be correctly decoded. The secret message bits are encoded with error correction codes to achieve a perfect data extraction and image recovery. The encryption and embedding are controlled by encryption and embedding keys respectively. If a receiver has both keys, the secret bits can be extracted by analyzing the blocking artifacts of the neighboring blocks, and the original bitstream perfectly recovered. In case the receiver only has the encryption key, he/she can still decode the bitstream to obtain the image with good quality without extracting the hidden data.

229 citations


Journal ArticleDOI
TL;DR: Experimental results show that the proposed approach outperforms state-of-the-art weakly-supervised image segmentation methods, on five popular segmentation data sets and performs competitively to the fully- supervised segmentation models.
Abstract: Weakly-supervised image segmentation is a challenging problem with multidisciplinary applications in multimedia content analysis and beyond. It aims to segment an image by leveraging its image-level semantics (i.e., tags). This paper presents a weakly-supervised image segmentation algorithm that learns the distribution of spatially structural superpixel sets from image-level labels. More specifically, we first extract graphlets from a given image, which are small-sized graphs consisting of superpixels and encapsulating their spatial structure. Then, an efficient manifold embedding algorithm is proposed to transfer labels from training images into graphlets. It is further observed that there are numerous redundant graphlets that are not discriminative to semantic categories, which are abandoned by a graphlet selection scheme as they make no contribution to the subsequent segmentation. Thereafter, we use a Gaussian mixture model (GMM) to learn the distribution of the selected post-embedding graphlets (i.e., vectors output from the graphlet embedding). Finally, we propose an image segmentation algorithm, termed representative graphlet cut, which leverages the learned GMM prior to measure the structure homogeneity of a test image. Experimental results show that the proposed approach outperforms state-of-the-art weakly-supervised image segmentation methods, on five popular segmentation data sets. Besides, our approach performs competitively to the fully-supervised segmentation models.

224 citations


Journal ArticleDOI
TL;DR: A statistical analysis of the properties of LcR is given together with experimental results on some public face databases and surveillance images to show the superiority of the proposed scheme over state-of-the-art face hallucination approaches.
Abstract: Recently, position-patch based approaches have been proposed to replace the probabilistic graph-based or manifold learning-based models for face hallucination. In order to obtain the optimal weights of face hallucination, these approaches represent one image patch through other patches at the same position of training faces by employing least square estimation or sparse coding. However, they cannot provide unbiased approximations or satisfy rational priors, thus the obtained representation is not satisfactory. In this paper, we propose a simpler yet more effective scheme called Locality-constrained Representation (LcR). Compared with Least Square Representation (LSR) and Sparse Representation (SR), our scheme incorporates a locality constraint into the least square inversion problem to maintain locality and sparsity simultaneously. Our scheme is capable of capturing the non-linear manifold structure of image patch samples while exploiting the sparse property of the redundant data representation. Moreover, when the locality constraint is satisfied, face hallucination is robust to noise, a property that is desirable for video surveillance applications. A statistical analysis of the properties of LcR is given together with experimental results on some public face databases and surveillance images to show the superiority of our proposed scheme over state-of-the-art face hallucination approaches.

218 citations


Journal ArticleDOI
TL;DR: A fast pyramid motion divergence (PMD) based CU selection algorithm is presented for HEVC inter prediction and theoretical analysis shows that PMD can be used to help selecting CU size.
Abstract: The newly developed HEVC video coding standard can achieve higher compression performance than the previous video coding standards, such as MPEG-4, H.263 and H.264/AVC. However, HEVC's high computational complexity raises concerns about the computational burden on real-time application. In this paper, a fast pyramid motion divergence (PMD) based CU selection algorithm is presented for HEVC inter prediction. The PMD features are calculated with estimated optical flow of the downsampled frames. Theoretical analysis shows that PMD can be used to help selecting CU size. A k nearest neighboring like method is used to determine the CU splittings. Experimental results show that the fast inter prediction method speeds up the inter coding significantly with negligible loss of the peak signal-to-noise ratio.

218 citations


Journal ArticleDOI
TL;DR: A novel self-learning based image decomposition framework, which is shown to outperform state-of-the-art image denoising algorithms and automatically determine the undesirable patterns from the derived image components directly from the input image, so that the task of single-image Denoising can be addressed.
Abstract: Decomposition of an image into multiple semantic components has been an effective research topic for various image processing applications such as image denoising, enhancement, and inpainting. In this paper, we present a novel self-learning based image decomposition framework. Based on the recent success of sparse representation, the proposed framework first learns an over-complete dictionary from the high spatial frequency parts of the input image for reconstruction purposes. We perform unsupervised clustering on the observed dictionary atoms (and their corresponding reconstructed image versions) via affinity propagation, which allows us to identify image-dependent components with similar context information. While applying the proposed method for the applications of image denoising, we are able to automatically determine the undesirable patterns (e.g., rain streaks or Gaussian noise) from the derived image components directly from the input image, so that the task of single-image denoising can be addressed. Different from prior image processing works with sparse representation, our method does not need to collect training image data in advance, nor do we assume image priors such as the relationship between input and output image dictionaries. We conduct experiments on two denoising problems: single-image denoising with Gaussian noise and rain removal. Our empirical results confirm the effectiveness and robustness of our approach, which is shown to outperform state-of-the-art image denoising algorithms.

215 citations


Journal ArticleDOI
TL;DR: This paper proposes to combine the human pose estimation module, the MRF-based color and category inference module and the (super)pixel-level category classifier learning module to generate multiple well-performing category classifiers, which can be directly applied to parse the fashion items in the images.
Abstract: In this paper we address the problem of automatically parsing the fashion images with weak supervision from the user-generated color-category tags such as “red jeans” and “white T-shirt”. This problem is very challenging due to the large diversity of fashion items and the absence of pixel-level tags, which make the traditional fully supervised algorithms inapplicable. To solve the problem, we propose to combine the human pose estimation module, the MRF-based color and category inference module and the (super)pixel-level category classifier learning module to generate multiple well-performing category classifiers, which can be directly applied to parse the fashion items in the images. Besides, all the training images are parsed with color-category labels and the human poses of the images are estimated during the model learning phase in this work. We also construct a new fashion image dataset called Colorful-Fashion, in which all 2,682 images are labeled with pixel-level color-category labels. Extensive experiments on this dataset clearly show the effectiveness of the proposed method for the weakly supervised fashion parsing task.

199 citations


Journal ArticleDOI
TL;DR: A novel multi-view hypergraph-based learning (MHL) method that adaptively integrates click data with varied visual features that outperforms state-of-the-art image re-ranking methods.
Abstract: Image re-ranking is effective in improving performance of text-based image searches. However, improvements from existing re-ranking algorithms are limited by two factors: one is that the associated textual information of images often mismatches their actual visual contents; the other is that a visual's features cannot accurately describe the semantic similarities between images. In this paper, we adopt click data to bridge the semantic gap. We propose a novel multi-view hypergraph-based learning (MHL) method that adaptively integrates click data with varied visual features. In particular, MHL considers pairwise discriminative constraints from click data to maximally distinguish images with high click counts from images with no click counts, and a semantic manifold is constructed. It then adopts hypergraph learning to build multiple manifolds from varied visual features. Finally, MHL integrates the semantic manifold with visual manifolds through an iterative optimization procedure. The weights of different manifolds and the re-ranking score are simultaneously obtained after using this optimization strategy. We conduct experiments on real world datasets and the results demonstrate that MHL outperforms state-of-the-art image re-ranking methods.

171 citations


Journal ArticleDOI
TL;DR: An in-depth analysis of the state-of-the-art framework of VLAD and Product Quantization proposed by Jegou is made, which develops an enhanced framework that significantly outperforms the previous best reported accuracy results on standard benchmarks and is more efficient.
Abstract: This paper deals with content-based large-scale image retrieval using the state-of-the-art framework of VLAD and Product Quantization proposed by Jegou as a starting point. Demonstrating an excellent accuracy-efficiency trade-off, this framework has attracted increased attention from the community and numerous extensions have been proposed. In this work, we make an in-depth analysis of the framework that aims at increasing our understanding of its different processing steps and boosting its overall performance. Our analysis involves the evaluation of numerous extensions (both existing and novel) as well as the study of the effects of several unexplored parameters. We specifically focus on: a) employing more efficient and discriminative local features; b) improving the quality of the aggregated representation; and c) optimizing the indexing scheme. Our thorough experimental evaluation provides new insights into extensions that consistently contribute, and others that do not, to performance improvement, and sheds light onto the effects of previously unexplored parameters of the framework. As a result, we develop an enhanced framework that significantly outperforms the previous best reported accuracy results on standard benchmarks and is more efficient.

158 citations


Journal ArticleDOI
TL;DR: This paper proposes a suite of measurement techniques to evaluate the QoS of cloud gaming systems and shows that OnLive performs better, because it provides adaptable frame rates, better graphic quality, and shorter server processing delays, while consuming less network bandwidth.
Abstract: Cloud gaming, i.e., real-time game playing via thin clients, relieves users from being forced to upgrade their computers and resolve the incompatibility issues between games and computers. As a result, cloud gaming is generating a great deal of interests among entrepreneurs, venture capitalists, general publics, and researchers. However, given the large design space, it is not yet known which cloud gaming system delivers the best user-perceived Quality of Service (QoS) and what design elements constitute a good cloud gaming system. This study is motivated by the question: How good is the QoS of current cloud gaming systems? Answering the question is challenging because most cloud gaming systems are proprietary and closed, and thus their internal mechanisms are not accessible for the research community. In this paper, we propose a suite of measurement techniques to evaluate the QoS of cloud gaming systems and prove the effectiveness of our schemes using a case study comprising two well-known cloud gaming systems: OnLive and StreamMyGame. Our results show that OnLive performs better, because it provides adaptable frame rates, better graphic quality, and shorter server processing delays, while consuming less network bandwidth. Our measurement techniques are general and can be applied to any cloud gaming systems, so that researchers, users, and service providers may systematically quantify the QoS of these systems. To the best of our knowledge, the proposed suite of measurement techniques have never been presented in the literature.

Journal ArticleDOI
TL;DR: This paper surveys the emerging paradigm of cloud mobile media and proposes design principles for cloud-based mobile media using a concrete case study: a cloud-centric media platform (CCMP) developed at Nanyang Technological University.
Abstract: This paper surveys the emerging paradigm of cloud mobile media. We start with two alternative perspectives for cloud mobile media networks: an end-to-end view and a layered view. Summaries of existing research in this area are organized according to the layered service framework: i) cloud resource management and control in infrastructure-as-a-service (IaaS), ii) cloud-based media services in platform-as-a-service (PaaS), and iii) novel cloud-based systems and applications in software-as-a-service (SaaS). We further substantiate our proposed design principles for cloud-based mobile media using a concrete case study: a cloud-centric media platform (CCMP) developed at Nanyang Technological University. Finally, this paper concludes with an outlook of open research problems for realizing the vision of cloud-based mobile media.

Journal ArticleDOI
TL;DR: Subjective evaluations demonstrate that this approach outperforms several representative photo cropping methods, including the previous cropping model that is guided by semantics-free graphlets, and the visualized graphlets explicitly capture photo semantics and global spatial configurations.
Abstract: Photo cropping is widely used in the printing industry, photography, and cinematography. Conventional photo cropping methods suffer from three drawbacks: 1) the semantics used to describe photo aesthetics are determined by the experience of model designers and specific data sets, 2) image global configurations, an essential cue to capture photos aesthetics, are not well preserved in the cropped photo, and 3) multi-channel visual features from an image region contribute differently to human aesthetics, but state-of-the-art photo cropping methods cannot automatically weight them. Owing to the recent progress in image retrieval community, image-level semantics, i.e., photo labels obtained without much human supervision, can be efficiently and effectively acquired. Thus, we propose weakly supervised photo cropping, where a manifold embedding algorithm is developed to incorporate image-level semantics and image global configurations with graphlets, or, small-sized connected subgraph. After manifold embedding, a Bayesian Network (BN) is proposed. It incorporates the testing photo into the framework derived from the multi-channel post-embedding graphlets of the training data, the importance of which is determined automatically. Based on the BN, photo cropping can be casted as searching the candidate cropped photo that maximally preserves graphlets from the training photos, and the optimal cropping parameter is inferred by Gibbs sampling. Subjective evaluations demonstrate that: 1) our approach outperforms several representative photo cropping methods, including our previous cropping model that is guided by semantics-free graphlets, and 2) the visualized graphlets explicitly capture photo semantics and global spatial configurations.

Journal ArticleDOI
TL;DR: A simple method is proposed and demonstrated to explain the figure of merit (FoM) of a music information retrieval system evaluated in a dataset, specifically, whether the FoM comes from the system using characteristics confounded with the “ground truth” of the dataset.
Abstract: We propose and demonstrate a simple method to explain the figure of merit (FoM) of a music information retrieval (MIR) system evaluated in a dataset, specifically, whether the FoM comes from the system using characteristics confounded with the “ground truth” of the dataset. Akin to the controlled experiments designed to test the supposed mathematical ability of the famous horse “Clever Hans,” we perform two experiments to show how three state-of-the-art MIR systems produce excellent FoM in spite of not using musical knowledge. This provides avenues for improving MIR systems, as well as their evaluation. We make available a reproducible research package so that others can apply the same method to evaluating other MIR systems.

Journal ArticleDOI
TL;DR: A computational framework for the automatic prediction of hirability is presented, and the predictive validity of psychometric questionnaires often used in the personnel selection process were found to be unable to predictHirability impressions were formed based on the interaction during the interview rather than on questionnaire data.
Abstract: Understanding the basis on which recruiters form hirability impressions for a job applicant is a key issue in organizational psychology and can be addressed as a social computing problem. We approach the problem from a face-to-face, nonverbal perspective where behavioral feature extraction and inference are automated. This paper presents a computational framework for the automatic prediction of hirability. To this end, we collected an audio-visual dataset of real job interviews where candidates were applying for a marketing job. We automatically extracted audio and visual behavioral cues related to both the applicant and the interviewer. We then evaluated several regression methods for the prediction of hirability scores and showed the feasibility of conducting such a task, with ridge regression explaining 36.2% of the variance. Feature groups were analyzed, and two main groups of behavioral cues were predictive of hirability: applicant audio features and interviewer visual cues, showing the predictive validity of cues related not only to the applicant, but also to the interviewer. As a last step, we analyzed the predictive validity of psychometric questionnaires often used in the personnel selection process, and found that these questionnaires were unable to predict hirability, suggesting that hirability impressions were formed based on the interaction during the interview rather than on questionnaire data.

Journal ArticleDOI
TL;DR: This paper proposes an accurate and robust algorithm for pairwise and multi-view range image registration that is accurate, and robust to small overlaps, noise and varying mesh resolutions.
Abstract: Range image registration is a fundamental research topic for 3D object modeling and recognition. In this paper, we propose an accurate and robust algorithm for pairwise and multi-view range image registration. We first extract a set of Rotational Projection Statistics (RoPS) features from a pair of range images, and perform feature matching between them. The two range images are then registered using a transformation estimation method and a variant of the Iterative Closest Point (ICP) algorithm. Based on the pairwise registration algorithm, we propose a shape growing based multi-view registration algorithm. The seed shape is initialized with a selected range image and then sequentially updated by performing pairwise registration between itself and the input range images. All input range images are iteratively registered during the shape growing process. Extensive experiments were conducted to test the performance of our algorithm. The proposed pairwise registration algorithm is accurate, and robust to small overlaps, noise and varying mesh resolutions. The proposed multi-view registration algorithm is also very accurate. Rigorous comparisons with the state-of-the-art show the superiority of our algorithm.

Journal ArticleDOI
TL;DR: A robust transfer video indexing (RTVI) model is developed, equipped with a novel sample-specific robust loss function, which employs the confidence score of a Web image as prior knowledge to suppress the influence and control of this image in the learning process.
Abstract: Semantic video indexing, also known as video annotation or video concept detection in literatures, has been attracting significant attention in recent years. Due to deficiency of labeled training videos, most of the existing approaches can hardly achieve satisfactory performance. In this paper, we propose a novel semantic video indexing approach, which exploits the abundant user-tagged Web images to help learn robust semantic video indexing classifiers. The following two major challenges are well studied: 1) noisy Web images with imprecise and/or incomplete tags; and 2) domain difference between images and videos. Specifically, we first apply a non-parametric approach to estimate the probabilities of images being correctly tagged as confidence scores. We then develop a robust transfer video indexing (RTVI) model to learn reliable classifiers from a limited number of training videos together with the abundance of user-tagged images. The RTVI model is equipped with a novel sample-specific robust loss function, which employs the confidence score of a Web image as prior knowledge to suppress the influence and control the contribution of this image in the learning process. Meanwhile, the RTVI model discovers an optimal kernel space, in which the mismatch between images and videos is minimized for tackling the domain difference problem. Besides, we devise an iterative algorithm to effectively optimize the proposed RTVI model and a theoretical analysis on the convergence of the proposed algorithm is provided as well. Extensive experiments on various real-world multimedia collections demonstrate the effectiveness of the proposed robust semantic video indexing approach.

Journal ArticleDOI
TL;DR: The experimental results show that SM2H outperforms other methods in terms of mAP and Percentage on two real-world data sets.
Abstract: Learning hash functions across heterogenous high-dimensional features is very desirable for many applications involving multi-modal data objects. In this paper, we propose an approach to obtain the sparse codesets for the data objects across different modalities via joint multi-modal dictionary learning, which we call sparse multi-modal hashing (abbreviated as SM2H). In SM2H, both intra-modality similarity and inter-modality similarity are first modeled by a hypergraph, then multi-modal dictionaries are jointly learned by Hypergraph Laplacian sparse coding. Based on the learned dictionaries, the sparse codeset of each data object is acquired and conducted for multi-modal approximate nearest neighbor retrieval using a sensitive Jaccard metric. The experimental results show that SM2H outperforms other methods in terms of mAP and Percentage on two real-world data sets.

Journal ArticleDOI
TL;DR: An efficient implementation based on the K-singular value decomposition (SVD) algorithm, where the exact SVD computation is replaced with a much faster approximation, and the straightforward orthogonal matching pursuit algorithm is employed, which is more suitable for the proposed self-example-learning-based sparse reconstruction with far fewer signals.
Abstract: In this paper, we propose a novel algorithm for fast single image super-resolution based on self-example learning and sparse representation. We propose an efficient implementation based on the K-singular value decomposition (SVD) algorithm, where we replace the exact SVD computation with a much faster approximation, and we employ the straightforward orthogonal matching pursuit algorithm, which is more suitable for our proposed self-example-learning-based sparse reconstruction with far fewer signals. The patches used for dictionary learning are efficiently sampled from the low-resolution input image itself using our proposed sample mean square error strategy, without an external training set containing a large collection of high- resolution images. Moreover, the l 0 -optimization-based criterion, which is much faster than l 1 -optimization-based relaxation, is applied to both the dictionary learning and reconstruction phases. Compared with other super-resolution reconstruction methods, our low- dimensional dictionary is a more compact representation of patch pairs and it is capable of learning global and local information jointly, thereby reducing the computational cost substantially. Our algorithm can generate high-resolution images that have similar quality to other methods but with an increase in the computational efficiency greater than hundredfold.

Journal ArticleDOI
TL;DR: This paper proposes a multi-level 3-D shape feature extraction framework by using deep learning, where low-level shape descriptors are first encoded into geometric bag-of-words, from which middle-level patterns are discovered to explore geometric relationships among words.
Abstract: 3-D shape analysis has attracted extensive research efforts in recent years, where the major challenge lies in designing an effective high-level 3-D shape feature. In this paper, we propose a multi-level 3-D shape feature extraction framework by using deep learning. The low-level 3-D shape descriptors are first encoded into geometric bag-of-words, from which middle-level patterns are discovered to explore geometric relationships among words. After that, high-level shape features are learned via deep belief networks, which are more discriminative for the tasks of shape classification and retrieval. Experiments on 3-D shape recognition and retrieval demonstrate the superior performance of the proposed method in comparison to the state-of-the-art methods.

Journal ArticleDOI
TL;DR: Topology aggregation and link summarization methods to efficiently acquire network topology and state information and a general optimization framework for flow-based end-to-end QoS provision over multi-domain networks are proposed.
Abstract: This paper presents novel QoS extensions to distributed control plane architectures for multimedia delivery over large-scale, multi-operator Software Defined Networks (SDNs). We foresee that large-scale SDNs shall be managed by a distributed control plane consisting of multiple controllers, where each controller performs optimal QoS routing within its domain and shares summarized (aggregated) QoS routing information with other domain controllers to enable inter-domain QoS routing with reduced problem dimensionality. To this effect, this paper proposes (i) topology aggregation and link summarization methods to efficiently acquire network topology and state information, (ii) a general optimization framework for flow-based end-to-end QoS provision over multi-domain networks, and (iii) two distributed control plane designs by addressing the messaging between controllers for scalable and secure inter-domain QoS routing. We apply these extensions to streaming of layered videos and compare the performance of different control planes in terms of received video quality, communication cost and memory overhead. Our experimental results show that the proposed distributed solution closely approaches the global optimum (with full network state information) and nicely scales to large networks.

Journal ArticleDOI
TL;DR: A generalized equalization model integrating contrast enhancement and white balancing into a unified framework of convex programming of image histogram is established and it is shown that many image enhancement tasks can be accomplished by the proposed model using different configurations of parameters.
Abstract: In this paper, we propose a generalized equalization model for image enhancement. Based on our analysis on the relationships between image histogram and contrast enhancement/white balancing, we first establish a generalized equalization model integrating contrast enhancement and white balancing into a unified framework of convex programming of image histogram. We show that many image enhancement tasks can be accomplished by the proposed model using different configurations of parameters. With two defining properties of histogram transform, namely contrast gain and nonlinearity, the model parameters for different enhancement applications can be optimized. We then derive an optimal image enhancement algorithm that theoretically achieves the best joint contrast enhancement and white balancing result with trading-off between contrast enhancement and tonal distortion. Subjective and objective experimental results show favorable performances of the proposed algorithm in applications of image enhancement, white balancing and tone correction. Computational complexity of the proposed method is also analyzed.

Journal ArticleDOI
TL;DR: A robust hand parsing scheme to extract a high-level description of the hand from the depth image is presented and a Superpixel-Markov Random Field (SMRF) parsing scheme is proposed to enforce the spatial smoothness and the label co-occurrence prior to remove the misclassified regions.
Abstract: Hand pose tracking and gesture recognition are useful for human-computer interaction, while a major problem is the lack of discriminative features for compact hand representation. We present a robust hand parsing scheme to extract a high-level description of the hand from the depth image. A novel distance-adaptive selection method is proposed to get more discriminative depth-context features. Besides, we propose a Superpixel-Markov Random Field (SMRF) parsing scheme to enforce the spatial smoothness and the label co-occurrence prior to remove the misclassified regions. Compared to pixel-level filtering, the SMRF scheme is more suitable to model the misclassified regions. By fusing the temporal constraints, its performance can be further improved. Overall, the proposed hand parsing scheme is accurate and efficient. The tests on synthesized dataset show it gives much higher accuracy for single-frame parsing and enhanced robustness for continuous sequence parsing compared to benchmarks. The tests on real-world depth images of the hand and human body show the robustness to complex hand configurations of our method and its generalization power to different kinds of articulated objects.

Journal ArticleDOI
TL;DR: This work presents a thorough investigation of HEVC-CABAC from an encryption standpoint, and an algorithm is devised for conversion of non-dyadic ES to dyadic, which can be concatenated to form plaintext for AES-CFB.
Abstract: This paper presents one of the first methods allowing the protection of the newly emerging video codec HEVC (High Efficiency Video Coding). Visual protection is achieved through selective encryption (SE) of HEVC-CABAC binstrings in a format compliant manner. The SE approach developed for HEVC is different from that of H.264/AVC in several aspects. Truncated rice code is introduced for binarization of quantized transform coefficients (QTCs) instead of truncated unary code. The encryption space (ES) of binstrings of truncated rice codes is not always dyadic and cannot be represented by an integer number of bits. Hence they cannot be concatenated together to create plaintext for the CFB (Cipher Feedback) mode of AES, which is a self-synchronizing stream cipher for so-called AES-CFB. Another challenge for SE in HEVC concerns the introduction of context, which is adaptive to QTC. This work presents a thorough investigation of HEVC-CABAC from an encryption standpoint. An algorithm is devised for conversion of non-dyadic ES to dyadic, which can be concatenated to form plaintext for AES-CFB. For selectively encrypted binstrings, the context of truncated rice code for binarization of future syntax elements is guaranteed to remain unchanged. Hence the encrypted bitstream is format-compliant and has exactly the same bit-rate. The proposed technique requires very little processing power and is ideal for playback on hand held devices. The proposed scheme is acceptable for DRM of a wide range of applications, since it protects the contour and motion information, along with texture. Several benchmark video sequences of different resolutions and diverse contents were used for experimental evaluation of the proposed algorithm. A detailed security analysis of the proposed scheme verified the validity of the proposed encryption scheme for content protection in a wide range of applications.

Journal ArticleDOI
TL;DR: An innovative mobile video search system through which users can discover videos by simply pointing their phones at a screen to capture a very few seconds of what they are watching, aiming at instant and progressive video search by leveraging the light-weight computing capacity of mobile devices.
Abstract: The proliferation of mobile devices is producing a new wave of applications that enable users to sense their surroundings with smart phones. People are preferring mobile devices to search and browse video content on the move. In this paper, we have developed an innovative mobile video search system through which users can discover videos by simply pointing their phones at a screen to capture a very few seconds of what they are watching. Different than most existing mobile video search applications, the proposed system is aiming at instant and progressive video search by leveraging the light-weight computing capacity of mobile devices. In particular, the system is able to index large-scale video data using a new layered audio-video indexing approach in the cloud, as well as generate lightweight joint audio-video signatures with progressive transmission and perform progressive search on mobile devices. Furthermore, we showcase that the system can be applied to two novel applications—video entity search and video clip localization. The evaluations on the real-world mobile video query dataset show that our system significantly improves user’s search experience due to search accuracy, low retrieval latency, and very short recording duration.

Journal ArticleDOI
TL;DR: A novel Topic-Sensitive Influencer Mining (TSIM) framework in interest-based social media networks that aims to find topical influential users and images and demonstrates the effectiveness of the proposed framework on a real-world dataset.
Abstract: Social media is emerging as a new mainstream means of interacting around online media. Social influence mining in social networks is therefore of critical importance in real-world applications such as friend suggestion and photo recommendation. Social media is inherently multimodal, including rich types of user contributed content and social link information. Most of the existing research suffers from two limitations: 1) only utilizing the textual information, and/or 2) only analyzing the generic influence but ignoring the more important topic-level influence. To address these limitations, in this paper we develop a novel Topic-Sensitive Influencer Mining (TSIM) framework in interest-based social media networks. Specifically, we take Flickr as the study platform. People in Flickr interact with each other through images. TSIM aims to find topical influential users and images. The influence estimation is determined with a hypergraph learning approach. In the hypergraph, the vertices represent users and images, and the hyperedges are utilized to capture multi-type relations including visual-textual content relations among images, and social links between users and images. Algorithmwise, TSIM first learns the topic distribution by leveraging user-contributed images, and then infers the influence strength under different topics for each node in the hypergraph. Extensive experiments on a real-world dataset of more than 50 K images and 70 K comment/favorite links from Flickr have demonstrated the effectiveness of our proposed framework. In addition, we also report promising results of friend suggestion and photo recommendation via TSIM on the same dataset.

Journal ArticleDOI
Shiyang Lu1, Zhiyong Wang1, Tao Mei2, Genliang Guan1, David Dagan Feng1 
TL;DR: The proposed Bag-of-Importance (BoI) model for static video summarization is able to exploit both the inter-frame and intra-frame properties of feature representations and identify keyframes capturing both the dominant content and discriminative details within a video.
Abstract: Video summarization helps users obtain quick comprehension of video content. Recently, some studies have utilized local features to represent each video frame and formulate video summarization as a coverage problem of local features. However, the importance of individual local features has not been exploited. In this paper, we propose a novel Bag-of-Importance (BoI) model for static video summarization by identifying the frames with important local features as keyframes, which is one of the first studies formulating video summarization at local feature level, instead of at global feature level. That is, by representing each frame with local features, a video is characterized with a bag of local features weighted with individual importance scores and the frames with more important local features are more representative, where the representativeness of each frame is the aggregation of the weighted importance of the local features contained in the frame. In addition, we propose to learn a transformation from a raw local feature to a more powerful sparse nonlinear representation for deriving the importance score of each local feature, rather than directly utilize the hand-crafted visual features like most of the existing approaches. Specifically, we first employ locality-constrained linear coding (LCC) to project each local feature into a sparse transformed space. LCC is able to take advantage of the manifold geometric structure of the high dimensional feature space and form the manifold of the low dimensional transformed space with the coordinates of a set of anchor points. Then we calculate the l2 norm of each anchor point as the importance score of each local feature which is projected to the anchor point. Finally, the distribution of the importance scores of all the local features in a video is obtained as the BoI representation of the video. We further differentiate the importance of local features with a spatial weighting template by taking the perceptual difference among spatial regions of a frame into account. As a result, our proposed video summarization approach is able to exploit both the inter-frame and intra-frame properties of feature representations and identify keyframes capturing both the dominant content and discriminative details within a video. Experimental results on three video datasets across various genres demonstrate that the proposed approach clearly outperforms several state-of-the-art methods.

Journal ArticleDOI
TL;DR: By extending the BSIF-TOP descriptor to a multiresolution scheme, the descriptor is able to capture the spatio-temporal content of an image sequence at multiple scales, improving its representation capacity.
Abstract: A spatio-temporal descriptor for representation and recognition of time-varying textures is proposed (binarized statis- tical image features on three orthogonal planes (BSIF-TOP)) in this paper. The descriptor, similar in spirit to the well known local binary patterns on three orthogonal planes approach, estimates histograms of binary coded image sequences on three orthogonal planes corresponding to spatial/spatio-temporal dimensions. However, unlike some other methods which generate the code in a heuristic fashion, binary code generation in the BSIF-TOP approach is realized by filtering operations on different regions of spatial/spatio-temporal support and by binarizing the filter responses. The filters are learnt via independent component analysis on each of three planes after preprocessing using a whitening transformation. By extending the BSIF-TOP descriptor to a multiresolution scheme, the descriptor is able to capture the spatio-temporal content of an image sequence at multiple scales, improving its representation capacity. In the evaluations on the UCLA, Dyntex, and Dyntex dynamic texture databases, the proposed method achieves very good performance compared to existing approaches.

Journal ArticleDOI
TL;DR: This work proposes a codebook-free algorithm for large scale mobile image search that achieves fast and accurate feature matching free of a huge visual codebook, and demonstrates competitive retrieval accuracy and scalability against four recent retrieval methods in literature.
Abstract: State-of-the-art image retrieval algorithms using local invariant features mostly rely on a large visual codebook to accelerate the feature quantization and matching. This codebook typically contains millions of visual words, which not only demands for considerable resources to train offline but also consumes large amount of memory at the online retrieval stage. This is hardly affordable in resource limited scenarios such as mobile image search applications. To address this issue, we propose a codebook-free algorithm for large scale mobile image search. In our method, we first employ a novel scalable cascaded hashing scheme to ensure the recall rate of local feature matching. Afterwards, we enhance the matching precision by an efficient verification with the binary signatures of these local features. Consequently, our method achieves fast and accurate feature matching free of a huge visual codebook. Moreover, the quantization and binarizing functions in the proposed scheme are independent of small collections of training images and generalize well for diverse image datasets. Evaluated on two public datasets with a million distractor images, the proposed algorithm demonstrates competitive retrieval accuracy and scalability against four recent retrieval methods in literature.

Journal ArticleDOI
TL;DR: It is demonstrated that CAVVA achieves a good balance between the following seemingly conflicting goals of minimizing the user disturbance because of advertisement insertion while (b) enhancing the user engagement with the advertising content.
Abstract: Advertising is ubiquitous in the online community and more so in the ever-growing and popular online video delivery websites (e.g., YouTube). Video advertising is becoming increasingly popular on these websites. In addition to the existing pre-roll/post-roll advertising and contextual advertising, this paper proposes an in-stream video advertising strategy-Computational Affective Video-in-Video Advertising (CAVVA). Humans being emotional creatures are driven by emotions as well as rational thought. We believe that emotions play a major role in influencing the buying behavior of users and hence propose a video advertising strategy which takes into account the emotional impact of the videos as well as advertisements. Given a video and a set of advertisements, we identify candidate advertisement insertion points (step 1) and also identify the suitable advertisements (step 2) according to theories from marketing and consumer psychology. We formulate this two part problem as a single optimization function in a non-linear 0-1 integer programming framework and provide a genetic algorithm based solution. We evaluate CAVVA using a subjective user-study and eye-tracking experiment. Through these experiments, we demonstrate that CAVVA achieves a good balance between the following seemingly conflicting goals of (a) minimizing the user disturbance because of advertisement insertion while (b) enhancing the user engagement with the advertising content. We compare our method with existing advertising strategies and show that CAVVA can enhance the user's experience and also help increase the monetization potential of the advertising content.