scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Multimedia in 2012"


Journal ArticleDOI
TL;DR: This work forms video summarization as a novel dictionary selection problem using sparsity consistency, where a dictionary of key frames is selected such that the original video can be best reconstructed from this representative dictionary.
Abstract: The rapid growth of consumer videos requires an effective and efficient content summarization method to provide a user-friendly way to manage and browse the huge amount of video data. Compared with most previous methods that focus on sports and news videos, the summarization of personal videos is more challenging because of its unconstrained content and the lack of any pre-imposed video structures. We formulate video summarization as a novel dictionary selection problem using sparsity consistency, where a dictionary of key frames is selected such that the original video can be best reconstructed from this representative dictionary. An efficient global optimization algorithm is introduced to solve the dictionary selection model with the convergence rates as O(1/K2) (where K is the iteration counter), in contrast to traditional sub-gradient descent methods of O(1/√K). Our method provides a scalable solution for both key frame extraction and video skim generation, because one can select an arbitrary number of key frames to represent the original videos. Experiments on a human labeled benchmark dataset and comparisons to the state-of-the-art methods demonstrate the advantages of our algorithm.

295 citations


Journal ArticleDOI
TL;DR: This paper presents an approach for event driven web video summarization by tag localization and key-shot mining, and provides two types of summaries, i.e., threaded video skimming and visual-textual storyboard.
Abstract: With the explosive growth of web videos on the Internet, it becomes challenging to efficiently browse hundreds or even thousands of videos. When searching an event query, users are often bewildered by the vast quantity of web videos returned by search engines. Exploring such results will be time consuming and it will also degrade user experience. In this paper, we present an approach for event driven web video summarization by tag localization and key-shot mining. We first localize the tags that are associated with each video into its shots. Then, we estimate the relevance of the shots with respect to the event query by matching the shot-level tags with the query. After that, we identify a set of key-shots from the shots that have high relevance scores by exploring the repeated occurrence characteristic of key sub-events. Following the scheme in [6] and [22], we provide two types of summaries, i.e., threaded video skimming and visual-textual storyboard. Experiments are conducted on a corpus that contains 60 queries and more than 10 000 web videos. The evaluation demonstrates the effectiveness of the proposed approach.

290 citations


Journal ArticleDOI
TL;DR: Experimental results have shown that the proposed algorithms can effectively annotate the kin relationships among people in an image and semantic context can further improve the accuracy.
Abstract: There is an urgent need to organize and manage images of people automatically due to the recent explosion of such data on the Web in general and in social media in particular. Beyond face detection and face recognition, which have been extensively studied over the past decade, perhaps the most interesting aspect related to human-centered images is the relationship of people in the image. In this work, we focus on a novel solution to the latter problem, in particular the kin relationships. To this end, we constructed two databases: the first one named UB KinFace Ver2.0, which consists of images of children, their young parents and old parents, and the second one named FamilyFace. Next, we develop a transfer subspace learning based algorithm in order to reduce the significant differences in the appearance distributions between children and old parents facial images. Moreover, by exploring the semantic relevance of the associated metadata, we propose an algorithm to predict the most likely kin relationships embedded in an image. In addition, human subjects are used in a baseline study on both databases. Experimental results have shown that the proposed algorithms can effectively annotate the kin relationships among people in an image and semantic context can further improve the accuracy.

239 citations


Journal ArticleDOI
TL;DR: The Gammatone cepstral coefficients (GTCCs), which have been previously employed in the field of speech research, are adapted for non-speech audio classification purposes and are more effective than MFCC in representing the spectral characteristics of non- speech audio signals, especially at low frequencies.
Abstract: In the context of non-speech audio recognition and classification for multimedia applications, it becomes essential to have a set of features able to accurately represent and discriminate among audio signals. Mel frequency cepstral coefficients (MFCC) have become a de facto standard for audio parameterization. Taking as a basis the MFCC computation scheme, the Gammatone cepstral coefficients (GTCCs) are a biologically inspired modification employing Gammatone filters with equivalent rectangular bandwidth bands. In this letter, the GTCCs, which have been previously employed in the field of speech research, are adapted for non-speech audio classification purposes. Their performance is evaluated on two audio corpora of 4 h each (general sounds and audio scenes), following two cross-validation schemes and four machine learning methods. According to the results, classification accuracies are significantly higher when employing GTCC rather than other state-of-the-art audio features. As a detailed analysis shows, with a similar computational cost, the GTCC are more effective than MFCC in representing the spectral characteristics of non-speech audio signals, especially at low frequencies.

209 citations


Journal ArticleDOI
TL;DR: A novel feature selection method that can jointly select the most relevant features from all the data points by using a sparsity-based model and apply it to automatic image annotation is proposed and validated.
Abstract: The number of web images has been explosively growing due to the development of network and storage technology. These images make up a large amount of current multimedia data and are closely related to our daily life. To efficiently browse, retrieve and organize the web images, numerous approaches have been proposed. Since the semantic concepts of the images can be indicated by label information, automatic image annotation becomes one effective technique for image management tasks. Most existing annotation methods use image features that are often noisy and redundant. Hence, feature selection can be exploited for a more precise and compact representation of the images, thus improving the annotation performance. In this paper, we propose a novel feature selection method and apply it to automatic image annotation. There are two appealing properties of our method. First, it can jointly select the most relevant features from all the data points by using a sparsity-based model. Second, it can uncover the shared subspace of original features, which is beneficial for multi-label learning. To solve the objective function of our method, we propose an efficient iterative algorithm. Extensive experiments are performed on large image databases that are collected from the web. The experimental results together with the theoretical analysis have validated the effectiveness of our method for feature selection, thus demonstrating its feasibility of being applied to web image annotation.

181 citations


Journal ArticleDOI
TL;DR: The results show that the emergent leader is perceived by his/her peers as an active and dominant person; that visual information augments acoustic information; and that adding relational information to the nonverbal cues improves the inference of each participant's leadership rankings in the group.
Abstract: Identifying emergent leaders in organizations is a key issue in organizational behavioral research, and a new problem in social computing. This paper presents an analysis on how an emergent leader is perceived in newly formed, small groups, and then tackles the task of automatically inferring emergent leaders, using a variety of communicative nonverbal cues extracted from audio and video channels. The inference task uses rule-based and collective classification approaches with the combination of acoustic and visual features extracted from a new small group corpus specifically collected to analyze the emergent leadership phenomenon. Our results show that the emergent leader is perceived by his/her peers as an active and dominant person; that visual information augments acoustic information; and that adding relational information to the nonverbal cues improves the inference of each participant's leadership rankings in the group.

174 citations


Journal ArticleDOI
TL;DR: A new saliency detection model based on the human visual sensitivity and the amplitude spectrum of quaternion Fourier transform (QFT) to represent the color, intensity, and orientation distributions for image patches is proposed.
Abstract: With the wide applications of saliency information in visual signal processing, many saliency detection methods have been proposed. However, some key characteristics of the human visual system (HVS) are still neglected in building these saliency detection models. In this paper, we propose a new saliency detection model based on the human visual sensitivity and the amplitude spectrum of quaternion Fourier transform (QFT). We use the amplitude spectrum of QFT to represent the color, intensity, and orientation distributions for image patches. The saliency value for each image patch is calculated by not only the differences between the QFT amplitude spectrum of this patch and other patches in the whole image, but also the visual impacts for these differences determined by the human visual sensitivity. The experiment results show that the proposed saliency detection model outperforms the state-of-the-art detection models. In addition, we apply our proposed model in the application of image retargeting and achieve better performance over the conventional algorithms.

171 citations


Journal ArticleDOI
TL;DR: A new content-based, non-intrusive quality of experience (QoE) prediction model for low bitrate and resolution (QCIF) H.264 encoded videos and its application in video quality adaptation over Universal Mobile Telecommunication Systems (UMTS) networks is illustrated.
Abstract: The primary aim of this paper is to present a new content-based, non-intrusive quality of experience (QoE) prediction model for low bitrate and resolution (QCIF) H.264 encoded videos and to illustrate its application in video quality adaptation over Universal Mobile Telecommunication Systems (UMTS) networks. The success of video applications over UMTS networks very much depends on meeting the QoE requirements of users. Thus, it is highly desirable to be able to predict and, if appropriate, to control video quality to meet such QoE requirements. Video quality is affected by distortions caused both by the encoder and the UMTS access network. The impact of these distortions is content dependent, but this feature is not widely used in non-intrusive video quality prediction models. In the new model, we chose four key parameters that can impact video quality and hence the QoE-content type, sender bitrate, block error rate and mean burst length. The video quality was predicted in terms of the mean opinion score (MOS). Subjective quality tests were carried out to develop and evaluate the model. The performance of the model was evaluated with unseen dataset with good prediction accuracy ( ~ 93%). The model also performed well with the LIVE database which was recently made available to the research community. We illustrate the application of the new model in a novel QoE-driven adaptation scheme at the pre-encoding stage in a UMTS network. Simulation results in NS2 demonstrate the effectiveness of the proposed adaptation scheme, especially at the UMTS access network which is a bottleneck. An advantage of the model is that it is light weight (and so it can be implemented for real-time monitoring), and it provides a measure of user-perceived quality, but without requiring time-consuming subjective tests. The model has potential applications in several other areas, including QoE control and optimization in network planning and content provisioning for network/service providers.

169 citations


Journal ArticleDOI
TL;DR: This study reveals that the proposed semantic model vectors representation outperforms-and is complementary to-other low-level visual descriptors for video event modeling, and validates it not only as the best individual descriptor, outperforming state-of-the-art global and local static features as well as spatio-temporal HOG and HOF descriptors, but also as the most compact.
Abstract: We propose semantic model vectors, an intermediate level semantic representation, as a basis for modeling and detecting complex events in unconstrained real-world videos, such as those from YouTube. The semantic model vectors are extracted using a set of discriminative semantic classifiers, each being an ensemble of SVM models trained from thousands of labeled web images, for a total of 280 generic concepts. Our study reveals that the proposed semantic model vectors representation outperforms-and is complementary to-other low-level visual descriptors for video event modeling. We hence present an end-to-end video event detection system, which combines semantic model vectors with other static or dynamic visual descriptors, extracted at the frame, segment, or full clip level. We perform a comprehensive empirical study on the 2010 TRECVID Multimedia Event Detection task (http://www.nist.gov/itl/iad/mig/med10.cfm), which validates the semantic model vectors representation not only as the best individual descriptor, outperforming state-of-the-art global and local static features as well as spatio-temporal HOG and HOF descriptors, but also as the most compact. We also study early and late feature fusion across the various approaches, leading to a 15% performance boost and an overall system performance of 0.46 mean average precision. In order to promote further research in this direction, we made our semantic model vectors for the TRECVID MED 2010 set publicly available for the community to use (http://www1.cs.columbia.edu/~mmerler/SMV.html).

159 citations


Journal ArticleDOI
TL;DR: This paper proposes a new model that efficiently segments common objects from multiple images by segmenting each original image into a number of local regions based on local region similarities and saliency maps and uses the dynamic programming method to solve the co-segmentation problem.
Abstract: Segmenting common objects that have variations in color, texture and shape is a challenging problem. In this paper, we propose a new model that efficiently segments common objects from multiple images. We first segment each original image into a number of local regions. Then, we construct a digraph based on local region similarities and saliency maps. Finally, we formulate the co-segmentation problem as the shortest path problem, and we use the dynamic programming method to solve the problem. The experimental results demonstrate that the proposed model can efficiently segment the common objects from a group of images with generally lower error rate than many existing and conventional co-segmentation methods.

147 citations


Journal ArticleDOI
TL;DR: A novel active learning approach based on the optimum experimental design criteria in statistics is proposed that simultaneously exploits sample's local structure, and sample relevance, density, and diversity information, as well as makes use of labeled and unlabeled data.
Abstract: Video indexing, also called video concept detection, has attracted increasing attentions from both academia and industry. To reduce human labeling cost, active learning has been introduced to video indexing recently. In this paper, we propose a novel active learning approach based on the optimum experimental design criteria in statistics. Different from existing optimum experimental design, our approach simultaneously exploits sample's local structure, and sample relevance, density, and diversity information, as well as makes use of labeled and unlabeled data. Specifically, we develop a local learning model to exploit the local structure of each sample. Our assumption is that for each sample, its label can be well estimated based on its neighbors. By globally aligning the local models from all the samples, we obtain a local learning regularizer, based on which a local learning regularized least square model is proposed. Finally, a unified sample selection approach is developed for interactive video indexing, which takes into account the sample relevance, density and diversity information, and sample efficacy in minimizing the parameter variance of the proposed local learning regularized least square model. We compare the performance between our approach and the state-of-the-art approaches on the TREC video retrieval evaluation (TRECVID) benchmark. We report superior performance from the proposed approach.

Journal ArticleDOI
TL;DR: The proposed semi-supervised feature analyzing framework is able to learn a classifier for different applications by selecting the discriminating features closely related to the semantic concepts by designing an efficient iterative algorithm with fast convergence, thus making it applicable to practical applications.
Abstract: In this paper, we propose a novel semi-supervised feature analyzing framework for multimedia data understanding and apply it to three different applications: image annotation, video concept detection and 3-D motion data analysis. Our method is built upon two advancements of the state of the art: (1) l2, 1-norm regularized feature selection which can jointly select the most relevant features from all the data points. This feature selection approach was shown to be robust and efficient in literature as it considers the correlation between different features jointly when conducting feature selection; (2) manifold learning which analyzes the feature space by exploiting both labeled and unlabeled data. It is a widely used technique to extend many algorithms to semi-supervised scenarios for its capability of leveraging the manifold structure of multimedia data. The proposed method is able to learn a classifier for different applications by selecting the discriminating features closely related to the semantic concepts. The objective function of our method is non-smooth and difficult to solve, so we design an efficient iterative algorithm with fast convergence, thus making it applicable to practical applications. Extensive experiments on image annotation, video concept detection and 3-D motion data analysis are performed on different real-world data sets to demonstrate the effectiveness of our algorithm.

Journal ArticleDOI
TL;DR: A scheme that is able to automatically turn a movie clip to comics by optimizing the information preservation of the movie and generating outputs following the rules and the styles of comics is proposed.
Abstract: As a type of artwork, comics is prevalent and popular around the world. However, despite the availability of assistive software and tools, the creation of comics is still a labor-intensive and time-consuming process. This paper proposes a scheme that is able to automatically turn a movie clip to comics. Two principles are followed in the scheme: 1) optimizing the information preservation of the movie; and 2) generating outputs following the rules and the styles of comics. The scheme mainly contains three components: script-face mapping, descriptive picture extraction, and cartoonization. The script-face mapping utilizes face tracking and recognition techniques to accomplish the mapping between characters' faces and their scripts. The descriptive picture extraction then generates a sequence of frames for presentation. Finally, the cartoonization is accomplished via three steps: panel scaling, stylization, and comics layout design. Experiments are conducted on a set of movie clips and the results have demonstrated the usefulness and the effectiveness of the scheme.

Journal ArticleDOI
TL;DR: A robust watermarking algorithm to watermark JPEG2000 compressed and encrypted images is proposed, using a stream cipher and the embedding capacity, robustness, perceptual quality and security of the proposed algorithm are investigated.
Abstract: Digital asset management systems (DAMS) generally handle media data in a compressed and encrypted form. It is sometimes necessary to watermark these compressed encrypted media items in the compressed-encrypted domain itself for tamper detection or ownership declaration or copyright management purposes. It is a challenge to watermark these compressed encrypted streams as the compression process would have packed the information of raw media into a low number of bits and encryption would have randomized the compressed bit stream. Attempting to watermark such a randomized bit stream can cause a dramatic degradation of the media quality. Thus it is necessary to choose an encryption scheme that is both secure and will allow watermarking in a predictable manner in the compressed encrypted domain. In this paper, we propose a robust watermarking algorithm to watermark JPEG2000 compressed and encrypted images. The encryption algorithm we propose to use is a stream cipher. While the proposed technique embeds watermark in the compressed-encrypted domain, the extraction of watermark can be done in the decrypted domain. We investigate in detail the embedding capacity, robustness, perceptual quality and security of the proposed algorithm, using these watermarking schemes: Spread Spectrum (SS), Scalar Costa Scheme Quantization Index Modulation (SCS-QIM), and Rational Dither Modulation (RDM).

Journal ArticleDOI
TL;DR: An unsupervised salient object segmentation approach based on kernel density estimation (KDE) and two-phase graph cut that efficiently utilizes the information of minimum cut generated using the KDE model based graph cut, and exploits a balancing weight update scheme for convergence of segmentation refinement.
Abstract: In this paper, we propose an unsupervised salient object segmentation approach based on kernel density estimation (KDE) and two-phase graph cut. A set of KDE models are first constructed based on the pre-segmentation result of the input image, and then for each pixel, a set of likelihoods to fit all KDE models are calculated accordingly. The color saliency and spatial saliency of each KDE model are then evaluated based on its color distinctiveness and spatial distribution, and the pixel-wise saliency map is generated by integrating likelihood measures of pixels and saliency measures of KDE models. In the first phase of salient object segmentation, the saliency map based graph cut is exploited to obtain an initial segmentation result. In the second phase, the segmentation is further refined based on an iterative seed adjustment method, which efficiently utilizes the information of minimum cut generated using the KDE model based graph cut, and exploits a balancing weight update scheme for convergence of segmentation refinement. Experimental results on a dataset containing 1000 test images with ground truths demonstrate the better segmentation performance of our approach.

Journal ArticleDOI
TL;DR: This work presents a novel approach for visual detection and attribute-based search of vehicles in crowded surveillance scenes, and performs a comprehensive quantitative analysis to validate the approach, showing its usefulness in realistic urban surveillance settings.
Abstract: We present a novel approach for visual detection and attribute-based search of vehicles in crowded surveillance scenes. Large-scale processing is addressed along two dimensions: 1) large-scale indexing, where hundreds of billions of events need to be archived per month to enable effective search and 2) learning vehicle detectors with large-scale feature selection, using a feature pool containing millions of feature descriptors. Our method for vehicle detection also explicitly models occlusions and multiple vehicle types (e.g., buses, trucks, SUVs, cars), while requiring very few manual labeling. It runs quite efficiently at an average of 66 Hz on a conventional laptop computer. Once a vehicle is detected and tracked over the video, fine-grained attributes are extracted and ingested into a database to allow future search queries such as “Show me all blue trucks larger than 7 ft. length traveling at high speed northbound last Saturday, from 2 pm to 5 pm”. We perform a comprehensive quantitative analysis to validate our approach, showing its usefulness in realistic urban surveillance settings.

Journal ArticleDOI
TL;DR: A novel approach, kernel cross-modal factor analysis, is introduced, which identifies the optimal transformations that are capable of representing the coupled patterns between two different subsets of features by minimizing the Frobenius norm in the transformed domain.
Abstract: In this paper, we investigate kernel based methods for multimodal information analysis and fusion. We introduce a novel approach, kernel cross-modal factor analysis, which identifies the optimal transformations that are capable of representing the coupled patterns between two different subsets of features by minimizing the Frobenius norm in the transformed domain. The kernel trick is utilized for modeling the nonlinear relationship between two multidimensional variables. We examine and compare with kernel canonical correlation analysis which finds projection directions that maximize the correlation between two modalities, and kernel matrix fusion which integrates the kernel matrices of respective modalities through algebraic operations. The performance of the introduced method is evaluated on an audiovisual based bimodal emotion recognition problem. We first perform feature extraction from the audio and visual channels respectively. The presented approaches are then utilized to analyze the cross-modal relationship between audio and visual features. A hidden Markov model is subsequently applied for characterizing the statistical dependence across successive time segments, and identifying the inherent temporal structure of the features in the transformed domain. The effectiveness of the proposed solution is demonstrated through extensive experimentation.

Journal ArticleDOI
TL;DR: This work proposes a method of Ranking based Multi-correlation Tensor Factorization (RMTF), to jointly model the ternary relations among user, image, and tag, and further to precisely reconstruct the user-aware image-tag associations as a result.
Abstract: Large-scale user contributed images with tags are easily available on photo sharing websites. However, the noisy or incomplete correspondence between the images and tags prohibits them from being leveraged for precise image retrieval and effective management. To tackle the problem of tag refinement, we propose a method of Ranking based Multi-correlation Tensor Factorization (RMTF), to jointly model the ternary relations among user, image, and tag, and further to precisely reconstruct the user-aware image-tag associations as a result. Since the user interest or background can be explored to eliminate the ambiguity of image tags, the proposed RMTF is believed to be superior to the traditional solutions, which only focus on the binary image-tag relations. During the model estimation, we employ a ranking based optimization scheme to interpret the tagging data, in which the pair-wise qualitative difference between positive and negative examples is used, instead of the point-wise 0/1 confidence. Specifically, the positive examples are directly decided by the observed user-image-tag interrelations, while the negative ones are collected with respect to the most semantically and contextually irrelevant tags. Extensive experiments on a benchmark Flickr dataset demonstrate the effectiveness of the proposed solution for tag refinement. We also show attractive performances on two potential applications as the by-products of the ternary relation analysis.

Journal ArticleDOI
TL;DR: Experimental results show that the proposed approach not only outperforms other fusion-based bimodal emotion recognition methods for posed expressions but also provides satisfactory results for naturalistic expressions.
Abstract: This paper presents an approach to the automatic recognition of human emotions from audio-visual bimodal signals using an error weighted semi-coupled hidden Markov model (EWSC-HMM). The proposed approach combines an SC-HMM with a state-based bimodal alignment strategy and a Bayesian classifier weighting scheme to obtain the optimal emotion recognition result based on audio-visual bimodal fusion. The state-based bimodal alignment strategy in SC-HMM is proposed to align the temporal relation between audio and visual streams. The Bayesian classifier weighting scheme is then adopted to explore the contributions of the SC-HMM-based classifiers for different audio-visual feature pairs in order to obtain the emotion recognition output. For performance evaluation, two databases are considered: the MHMC posed database and the SEMAINE naturalistic database. Experimental results show that the proposed approach not only outperforms other fusion-based bimodal emotion recognition methods for posed expressions but also provides satisfactory results for naturalistic expressions.

Journal ArticleDOI
TL;DR: In this article, the normalized energy density present within windows of varying sizes in the second derivative of the image in the frequency domain is exploited to derive a 19-D feature vector that is used to train a SVM classifier.
Abstract: We propose a new method to detect resampled imagery. The method is based on examining the normalized energy density present within windows of varying size in the second derivative of the image in the frequency domain, and exploiting this characteristic to derive a 19-D feature vector that is used to train a SVM classifier. Experimental results are reported on 7500 raw images from the BOSS database. Comparison with prior work reveals that the proposed algorithm performs similarly for resampling rates greater than 1, and is superior to prior work for resampling rates less than 1. Experiments are performed for both bilinear and bicubic interpolations, and qualitatively similar results are observed for each. Results are also provided for the detection of resampled imagery with noise corruption and JPEG compression. As expected, some degradation in performance is observed as the noise increases or the JPEG quality factor declines.

Journal ArticleDOI
TL;DR: A novel asymmetric coding method of multi-view video plus depth (MVD) based 3-D video is proposed on purpose of providing high-quality view rendering and experimental results show that compared with other methods, the proposed method can obtain higher performance of view rendering under the total bitrate constraint.
Abstract: The recent years have witnessed three-dimensional (3-D) video technology to become increasingly popular, as it can provide high-quality and immersive experience to end users, where view rendering with depth-image-based rendering (DIBR) technique is employed to generate the virtual views. Distortions in depth map may induce geometry changes in the virtual views, and distortions in texture video may be propagated to the virtual views. Thus, effective compression of both texture videos and depth maps is important for 3-D video system. From the perspective of bit allocation, asymmetric coding of the texture videos and depth maps is an effective way to get the optimal solution of 3-D video compression and view rendering problems. In this paper, a novel asymmetric coding method of multi-view video plus depth (MVD) based 3-D video is proposed on purpose of providing high-quality view rendering. In the proposed method, two models are proposed to characterize view rendering distortion and binocular suppression in 3-D video. Then, an asymmetric coding method of MVD-based 3-D video is proposed by combining two models in encoding framework. Finally, a chrominance reconstruction algorithm is presented to achieve accurate reconstruction. Experimental results show that compared with other methods, the proposed method can obtain higher performance of view rendering under the total bitrate constraint. Moreover, the perceptual visual quality of 3-D video is almost unaffected with the proposed method.

Journal ArticleDOI
TL;DR: A face annotation system to automatically collect and label celebrity faces from the web and a context likelihood is proposed to constrain the name assignment process.
Abstract: In this paper, we present a face annotation system to automatically collect and label celebrity faces from the web. With the proposed system, we have constructed a large-scale dataset called “Celebrities on the Web,” which contains 2.45 million distinct images of 421 436 celebrities and is orders of magnitude larger than previous datasets. Collecting and labeling such a large-scale dataset pose great challenges on current multimedia mining methods. In this work, a two-step face annotation approach is proposed to accomplish this task. In the first step, an image annotation system is proposed to label an input image with a list of celebrities. To utilize the noisy textual data, we construct a large-scale celebrity name vocabulary to identify candidate names from the surrounding text. Moreover, we expand the scope of analysis to the surrounding text of webpages hosting near-duplicates of the input image. In the second step, the celebrity names are assigned to the faces by label propagation on a facial similarity graph. To cope with the large variance in the facial appearances, a context likelihood is proposed to constrain the name assignment process. In an evaluation on 21 735 faces, both the image annotation system and name assignment algorithm significantly outperform previous techniques.

Journal ArticleDOI
TL;DR: A unified model for detecting different types of video shot transitions is presented and frame estimation scheme using the previous and the next frames is formulated using a multilayer perceptron network.
Abstract: We have presented a unified model for detecting different types of video shot transitions. Based on the proposed model, we formulate frame estimation scheme using the previous and the next frames. Unlike other shot boundary detection algorithms, instead of properties of frames, frame transition parameters and frame estimation errors based on global and local features are used for boundary detection and classification. Local features include scatter matrix of edge strength and motion matrix. Finally, the frames are classified as no change (within shot frame), abrupt change, or gradual change frames using a multilayer perceptron network. The proposed method is relatively less dependent on user defined thresholds and is free from sliding window size as widely used by various schemes found in the literature. Moreover, handling both abrupt and gradual transitions along with non-transition frames under a single framework using model guided visual feature is another unique aspect of the work.

Journal ArticleDOI
TL;DR: The proposed GL-MKL determines the optimal base kernels, including the associated weights and kernel parameters, and results in improved recognition performance, and can be extended to address heterogeneous variable selection problems.
Abstract: We propose a novel multiple kernel learning (MKL) algorithm with a group lasso regularizer, called group lasso regularized MKL (GL-MKL), for heterogeneous feature fusion and variable selection. For problems of feature fusion, assigning a group of base kernels for each feature type in an MKL framework provides a robust way in fitting data extracted from different feature domains. Adding a mixed norm constraint (i.e., group lasso) as the regularizer, we can enforce the sparsity at the group/feature level and automatically learn a compact feature set for recognition purposes. More precisely, our GL-MKL determines the optimal base kernels, including the associated weights and kernel parameters, and results in improved recognition performance. Besides, our GL-MKL can also be extended to address heterogeneous variable selection problems. For such problems, we aim to select a compact set of variables (i.e., feature attributes) for comparable or improved performance. Our proposed method does not need to exhaustively search for the entire variable space like prior sequential-based variable selection methods did, and we do not require any prior knowledge on the optimal size of the variable subset either. To verify the effectiveness and robustness of our GL-MKL, we conduct experiments on video and image datasets for heterogeneous feature fusion, and perform variable selection on various UCI datasets.

Journal ArticleDOI
TL;DR: The compressive sensing (CS) principles are studied and an alternative coding paradigm with a number of descriptions is proposed based upon CS for high packet loss transmission and Experimental results show that the proposed CS-based codec is much more robust against lossy channels, while achieving higher rate-distortion performance.
Abstract: Multiple description coding (MDC) is one of the widely used mechanisms to combat packet-loss in non-feedback systems. However, the number of descriptions in the existing MDC schemes is very small (typically 2). With the number of descriptions increasing, the coding complexity increases drastically and many decoders would be required. In this paper, the compressive sensing (CS) principles are studied and an alternative coding paradigm with a number of descriptions is proposed based upon CS for high packet loss transmission. Two-dimentional discrete wavelet transform (DWT) is applied for sparse representation. Unlike the typical wavelet coders (e.g., JPEG 2000), DWT coefficients here are not directly encoded, but re-sampled towards equal importance of information instead. At the decoder side, by fully exploiting the intra-scale and inter-scale correlation of multiscale DWT, two different CS recovery algorithms are developed for the low-frequency subband and high-frequency subbands, respectively. The recovery quality only depends on the number of received CS measurements (not on which of the measurements that are received). Experimental results show that the proposed CS-based codec is much more robust against lossy channels, while achieving higher rate-distortion (R-D) performance compared with conventional wavelet-based MDC methods and relevant existing CS-based coding schemes.

Journal ArticleDOI
TL;DR: The proposed full-reference (FR) algorithm is more efficient due to its low complexity without jeopardizing the prediction accuracy and cross-database tests have been carried out to provide a proper perspective of the performance of this scheme as compared to other VQA methods.
Abstract: Objective video quality assessment (VQA) is the use of computational models to evaluate the video quality in line with the perception of the human visual system (HVS). It is challenging due to the underlying complexity, and the relatively limited understanding of the HVS and its intricate mechanisms. There are three important issues that arise in objective VQA in comparison with image quality assessment: 1) the temporal factors apart from the spatial ones also need to be considered, 2) the contribution of each factor (spatial and temporal) and their interaction to the overall video quality need to be determined, and 3) the computational complexity of the resultant method. In this paper, we seek to tackle the first issue by utilizing the worst case pooling strategy and the variations of spatial quality along the temporal axis with proper analysis and justification. The second issue is addressed by the use of machine learning; we believe this to be more convincing since the relationship between the factors and the overall quality is derived via training with substantial ground truth (i.e., subjective scores). Experiments conducted using publicly available video databases show the effectiveness of the proposed full-reference (FR) algorithm in comparison to the relevant existing VQA schemes. Focus has also been placed on demonstrating the robustness of the proposed method to new and untrained data. To that end, cross-database tests have been carried out to provide a proper perspective of the performance of proposed scheme as compared to other VQA methods. The third issue regarding the computational costs also plays a key role in determining the feasibility of a VQA scheme for practical deployment given the large amount of data that needs to be processed/analyzed in real time. A limitation of many existing VQA algorithms is their higher computational complexity. In contrast, the proposed scheme is more efficient due to its low complexity without jeopardizing the prediction accuracy.

Journal ArticleDOI
TL;DR: A novel real-time algorithm for head and hand tracking based on data from a range camera, which is exploited to resolve ambiguities and overlaps and may be used for interactive applications as well as for gesture classification purposes.
Abstract: A novel real-time algorithm for head and hand tracking is proposed in this paper. This approach is based on data from a range camera, which is exploited to resolve ambiguities and overlaps. The position of the head is estimated with a depth-based template matching, its robustness being reinforced with an adaptive search zone. Hands are detected in a bounding box attached to the head estimate, so that the user may move freely in the scene. A simple method to decide whether the hands are open or closed is also included in the proposal. Experimental results show high robustness against partial occlusions and fast movements. Accurate hand trajectories may be extracted from the estimated hand positions, and may be used for interactive applications as well as for gesture classification purposes.

Journal ArticleDOI
TL;DR: A novel framework, HodgeRank on Random Graphs, based on paired comparison, for subjective video quality assessment, which is not only suitable for traditional laboratory studies, but also for crowdsourcing experiments on Internet where the raters are distributive and it is hard to control with fixed designs.
Abstract: This paper introduces a novel framework, HodgeRank on Random Graphs, based on paired comparison, for subjective video quality assessment. Two types of random graph models are studied, i.e., Erdos-Renyi random graphs and random regular graphs. Hodge decomposition of paired comparison data may derive, from incomplete and imbalanced data, quality scores of videos and inconsistency of participants' judgments. We demonstrate the effectiveness of the proposed framework on LIVE video database. Both of the two random designs are promising sampling methods without jeopardizing the accuracy of the results. In particular, due to balanced sampling, random regular graphs may achieve better performances when sampling rates are small. However, when the number of videos is large or when sampling rates are large, their performances are so close that Erdos-Renyi random graphs, as the simplest independent and identically distributed sampling scheme, could provide good approximations to random regular graphs, as a dependent sampling scheme. In contrast to the traditional deterministic incomplete block designs, our random design is not only suitable for traditional laboratory studies, but also for crowdsourcing experiments on Internet where the raters are distributive and it is hard to control with fixed designs.

Journal ArticleDOI
TL;DR: A new perspective-Vicept representation to solve the problem of visual polysemia and concept polymorphism in the large-scale semantic image understanding and introduces a novel image distance measurement based on the hierarchical Vicept description.
Abstract: This paper proposes a new perspective-Vicept representation to solve the problem of visual polysemia and concept polymorphism in the large-scale semantic image understanding. Vicept characterizes the membership probability distribution between visual appearances and semantic concepts, and forms a hierarchical representation of image semantic from local to global. In the implementation, incorporating group sparse coding, visual appearance is encoded as a weighted sum of dictionary elements, which could obtain more accurate image representation with sparsity at the image level. To obtain discriminative Vicept descriptions with structural sparsity, mixed-norm regularization is adopted in the optimization problem for learning the concept membership distribution of visual appearance. Furthermore, we introduce a novel image distance measurement based on the hierarchical Vicept description, where different levels of Vicept distance are fused together by multi-level separability analysis. Finally, the wide applications of Vicept description are validated in our experiments, including large-scale semantic image search, image annotation, and semantic image re-ranking.

Journal ArticleDOI
TL;DR: A preference-aware aesthetic model is proposed to suggest views according to varied user-favorite photographic styles, where a bottom-up approach is developed to construct an aesthetic feature library with bag-of-aesthetics-preserving features instead of top-down methods that implement heuristic guidelines.
Abstract: In this paper, the framework for a real-time view recommendation system is proposed. The proposed system comprises two parts: offline aesthetic modeling stage and efficient online aesthetic view finding process. A preference-aware aesthetic model is proposed to suggest views according to varied user-favorite photographic styles, where a bottom-up approach is developed to construct an aesthetic feature library with bag-of-aesthetics-preserving features instead of top-down methods that implement the heuristic guidelines (rule-specific features) listed in photography literatures, which is employed in previous works. A collection of scenic photos is used as the test set; however, the proposed method can be employed to other types of photo collection according to different application scenarios. The proposed model can cover both implicit and explicit aesthetic features and can adapt to users' preferences with a learning process. In the second part, the learned model is employed in a view finder to help the user to locate the most aesthetic view while taking a photograph. The experimental results show that the proposed features in the library (92.06% in accuracy) outperform the state-of-the-art rule-specific features (83.63% in accuracy) significantly in the photo aesthetic quality classification task, and the rule-specific features are also proved to be encompassed by the proposed features. Meanwhile, it is observed from experiments that the features extracted for contrast information are more effective than those for absolute information, which is consistent with the properties of human visual systems. Furthermore, the user studies for the view recommendation task confirm that the suggested views are consistent with users' preferences (81.25% agreements).