Bio: Rui Zhang is an academic researcher from Ryerson University. The author has contributed to research in topics: Image retrieval & Automatic image annotation. The author has an hindex of 7, co-authored 20 publications receiving 134 citations. Previous affiliations of Rui Zhang include Tianjin University & Nanyang Technological University.
TL;DR: By incorporating class label information of the training samples, the proposed LMCCA ensures that the fused features carry discriminative characteristics of the multimodal information representations and are capable of providing superior recognition performance.
Abstract: The objective of multimodal information fusion is to mathematically analyze information carried in different sources and create a new representation that will be more effectively utilized in pattern recognition and other multimedia information processing tasks. In this paper, we introduce a new method for multimodal information fusion and representation based on the Labeled Multiple Canonical Correlation Analysis (LMCCA). By incorporating class label information of the training samples, the proposed LMCCA ensures that the fused features carry discriminative characteristics of the multimodal information representations and are capable of providing superior recognition performance. We implement a prototype of LMCCA to demonstrate its effectiveness on handwritten digit recognition, face recognition, and object recognition utilizing multiple features, bimodal human emotion recognition involving information from both audio and visual domains. The generic nature of LMCCA allows it to take as input features extracted by any means, including those by deep learning (DL) methods. Experimental results show that the proposed method enhanced the performance of both statistical machine learning methods, and methods based on DL.
••12 Oct 2010
TL;DR: It is shown that the combination of multimodality information is capable of providing a more complete and effective description of the intrinsic characteristics of the specific pattern, and producing improved system performance than single modality only.
Abstract: The effective interpretation and integration of multiple information content are important for the efficacious utilisation of multimedia in a wide variety of application context. The major challenge in multimodal information fusion lies in the difficulty of identifying the complementary and discriminatory representations from individual channels, and the efficient fusion of the resulting information for the targeted application problem. This paper outlines several multimedia systems that utilise a multimodal approach, and provides a comprehensive review of the state-of-the-art in related areas, including emotion recognition, image annotation and retrieval, and biometrics. Data collected from diverse sources or sensors are employed to improve the recognition or classification accuracy. It is shown that the combination of multimodality information is capable of providing a more complete and effective description of the intrinsic characteristics of the specific pattern, and producing improved system performance than single modality only. In addition, we present a facial fiducial point detection and a gesture recognition system, which can be incorporated into a multimodal framework. The issues and challenges in the research and development of multimodal systems are discussed, and a cutting-edge application of multimodal information fusion for intelligent robotic system is presented.
••28 Nov 2011
TL;DR: The presented model can be considered as an extension of the probabilistic latent semantic analysis (pLSA) in that it handles data from two different visual feature domains by attaching one more leaf node to the graphical structure of the original pLSA.
Abstract: We study in this paper the problem of combining low-level visual features for image region annotation. The problem is tackled with a novel method that combines texture and color features via a mixture model of their joint distribution. The structure of the presented model can be considered as an extension of the probabilistic latent semantic analysis (pLSA) in that it handles data from two different visual feature domains by attaching one more leaf node to the graphical structure of the original pLSA. Therefore, the proposed approach is referred to as multi-feature pLSA (MF-pLSA). The supervised paradigm is adopted to classify a new image region into one of a few pre-defined object categories using the MF-pLSA. To evaluate the performance, the VOC2009 and LabelMe databases were employed in our experiments, along with various experimental settings in terms of the number of visual words and mixture components. Evaluated based on the average recall and precision, the MF-pLSA is demonstrated superior to seven other approaches, including other schemes for visual feature combination.
11 Sep 2012
TL;DR: In this paper, a method for detecting body parts in a video stream from a mobile device is presented, where the first frame of the video stream is identified for processing, and the remaining pixels are compared to determine a degree of entropy of the pixels in the observation window.
Abstract: A method is provided for detecting a body part in a video stream from a mobile device. A video stream of a human subject is received from a camera connected to the mobile device. The video stream has frames. A first frame of the video stream is identified for processing. This first frame is then partitioned into observation windows, each observation window having pixels. In each observation window, non-skin-toned pixels are eliminated; and the remaining pixels are compared to determine a degree of entropy of the pixels in the observation window. In any observation window having a degree of entropy above a predetermined threshold, a bounded area is made around the region of high entropy pixels. The consistency of the entropy is analyzed in the bounded area. If the bounded area has inconsistently high entropy, a body part is determined to be detected at that bounded area.
••12 Nov 2007
TL;DR: The proposed method improve the accuracy of texture image retrieval in terms of average retrieval rate, compared with the traditional method using GGD for feature extraction and Kullback-Leibler divergence for similarity measure.
Abstract: In this paper, a novel approach to texture retrieval using independent component analysis (ICA) in wavelet domain is proposed. It is well recognized that the wavelet coefficients in different subbands are statistically correlated, resulting in the fact that the product of the marginal distributions of wavelet coefficients is not accurate enough to characterize the stochastic properties of texture images. To tackle this problem, we employ (ICA) in feature extraction to decorrelate the analysis coefficients in different subbands, followed by modeling the marginal distributions of the separated sources using generalized Gaussian density (GGD), and perform similarity measure based on the maximum likelihood criterion. It is demonstrated by simulation results on a database consisting of 1776 texture images that the proposed method improve the accuracy of texture image retrieval in terms of average retrieval rate, compared with the traditional method using GGD for feature extraction and Kullback-Leibler divergence for similarity measure.
01 Jan 2012
TL;DR: Multi-view representation learning has become a rapidly growing direction in machine learning and data mining areas as mentioned in this paper, and a comprehensive survey of multi-view representations can be found in this paper.
Abstract: Recently, multi-view representation learning has become a rapidly growing direction in machine learning and data mining areas. This paper introduces two categories for multi-view representation learning: multi-view representation alignment and multi-view representation fusion. Consequently, we first review the representative methods and theories of multi-view representation learning based on the perspective of alignment, such as correlation-based alignment. Representative examples are canonical correlation analysis (CCA) and its several extensions. Then from the perspective of representation fusion we investigate the advancement of multi-view representation learning that ranges from generative methods including multi-modal topic learning, multi-view sparse coding, and multi-view latent space Markov networks, to neural network-based methods including multi-modal autoencoders, multi-view convolutional neural networks, and multi-modal recurrent neural networks. Further, we also investigate several important applications of multi-view representation learning. Overall, this survey aims to provide an insightful overview of theoretical foundation and state-of-the-art developments in the field of multi-view representation learning and to help researchers find the most appropriate tools for particular applications.
••11 Nov 2014
TL;DR: A survey on the theoretical and practical work offering new and broad views of the latest research in emotion recognition from bimodal information including facial and vocal expressions is provided.
Abstract: Emotion recognition is the ability to identify what people would think someone is feeling from moment to moment and understand the connection between his/her feelings and expressions. In today's world, human–computer interaction (HCI) interface undoubtedly plays an important role in our daily life. Toward harmonious HCI interface, automated analysis and recognition of human emotion has attracted increasing attention from the researchers in multidisciplinary research fields. In this paper, a survey on the theoretical and practical work offering new and broad views of the latest research in emotion recognition from bimodal information including facial and vocal expressions is provided. First, the currently available audiovisual emotion databases are described. Facial and vocal features and audiovisual bimodal data fusion methods for emotion recognition are then surveyed and discussed. Specifically, this survey also covers the recent emotion challenges in several conferences. Conclusions outline and address some of the existing emotion recognition issues.
TL;DR: A novel approach, kernel cross-modal factor analysis, is introduced, which identifies the optimal transformations that are capable of representing the coupled patterns between two different subsets of features by minimizing the Frobenius norm in the transformed domain.
Abstract: In this paper, we investigate kernel based methods for multimodal information analysis and fusion. We introduce a novel approach, kernel cross-modal factor analysis, which identifies the optimal transformations that are capable of representing the coupled patterns between two different subsets of features by minimizing the Frobenius norm in the transformed domain. The kernel trick is utilized for modeling the nonlinear relationship between two multidimensional variables. We examine and compare with kernel canonical correlation analysis which finds projection directions that maximize the correlation between two modalities, and kernel matrix fusion which integrates the kernel matrices of respective modalities through algebraic operations. The performance of the introduced method is evaluated on an audiovisual based bimodal emotion recognition problem. We first perform feature extraction from the audio and visual channels respectively. The presented approaches are then utilized to analyze the cross-modal relationship between audio and visual features. A hidden Markov model is subsequently applied for characterizing the statistical dependence across successive time segments, and identifying the inherent temporal structure of the features in the transformed domain. The effectiveness of the proposed solution is demonstrated through extensive experimentation.
TL;DR: A deep review of state-of-the-art AIA methods is presented by synthesizing 138 literatures published during the past two decades by dividing AIA Methods into five categories and comparing their performance on benchmark dataset and standard evaluation metrics.
Abstract: In recent years, image annotation has attracted extensive attention due to the explosive growth of image data. With the capability of describing images at the semantic level, image annotation has many applications not only in image analysis and understanding but also in some relative disciplines, such as urban management and biomedical engineering. Because of the inherent weaknesses of manual image annotation, Automatic Image Annotation (AIA) has been raised since the late 1990s. In this paper, a deep review of state-of-the-art AIA methods is presented by synthesizing 138 literatures published during the past two decades. We classify AIA methods into five categories: 1) Generative model-based image annotation, 2) Nearest neighbor-based image annotation, 3) Discriminative model-based image annotation, and 4) Tag completion-based image annotation, 5) Deep Learning-based image annotation. Comparisons of the five types of AIA methods are made on the basis of the underlying idea, main contribution, model framework, computational complexity, computation time, and annotation accuracy. We also give an overview of five publicly available image datasets and four standard evaluation metrics commonly used as benchmarks for evaluating AIA methods. Then the performance of some typical or well-behaved models is assessed based on benchmark dataset and standard evaluation metrics. Finally, we share our viewpoints on the open issues and challenges in AIA as well as research trends in the future.