scispace - formally typeset
Search or ask a question
Proceedings Article

Multimodal Deep Learning

28 Jun 2011-pp 689-696
TL;DR: This work presents a series of tasks for multimodal learning and shows how to train deep networks that learn features to address these tasks, and demonstrates cross modality feature learning, where better features for one modality can be learned if multiple modalities are present at feature learning time.
Abstract: Deep networks have been successfully applied to unsupervised feature learning for single modalities (e.g., text, images or audio). In this work, we propose a novel application of deep networks to learn features over multiple modalities. We present a series of tasks for multimodal learning and show how to train deep networks that learn features to address these tasks. In particular, we demonstrate cross modality feature learning, where better features for one modality (e.g., video) can be learned if multiple modalities (e.g., audio and video) are present at feature learning time. Furthermore, we show how to learn a shared representation between modalities and evaluate it on a unique task, where the classifier is trained with audio-only data but tested with video-only data and vice-versa. Our models are validated on the CUAVE and AVLetters datasets on audio-visual speech classification, demonstrating best published visual speech classification on AVLetters and effective shared representation learning.

Content maybe subject to copyright    Report

Citations
More filters
Posted Content
TL;DR: A new Deep Adaptation Network (DAN) architecture is proposed, which generalizes deep convolutional neural network to the domain adaptation scenario and can learn transferable features with statistical guarantees, and can scale linearly by unbiased estimate of kernel embedding.
Abstract: Recent studies reveal that a deep neural network can learn transferable features which generalize well to novel tasks for domain adaptation. However, as deep features eventually transition from general to specific along the network, the feature transferability drops significantly in higher layers with increasing domain discrepancy. Hence, it is important to formally reduce the dataset bias and enhance the transferability in task-specific layers. In this paper, we propose a new Deep Adaptation Network (DAN) architecture, which generalizes deep convolutional neural network to the domain adaptation scenario. In DAN, hidden representations of all task-specific layers are embedded in a reproducing kernel Hilbert space where the mean embeddings of different domain distributions can be explicitly matched. The domain discrepancy is further reduced using an optimal multi-kernel selection method for mean embedding matching. DAN can learn transferable features with statistical guarantees, and can scale linearly by unbiased estimate of kernel embedding. Extensive empirical evidence shows that the proposed architecture yields state-of-the-art image classification error rates on standard domain adaptation benchmarks.

3,351 citations


Cites background from "Multimodal Deep Learning"

  • ...Hence, deep neural networks have been explored for domain adaptation (Glorot et al., 2011; Chen et al., 2012), multimodal and multi-source learning problems (Ngiam et al., 2011; Ge et al., 2013), where significant performance gains have been witnessed....

    [...]

  • ..., 2012), multimodal and multi-source learning problems (Ngiam et al., 2011; Ge et al., 2013), where significant performance gains have been obtained....

    [...]

Journal ArticleDOI
TL;DR: In this article, a review of deep learning-based object detection frameworks is provided, focusing on typical generic object detection architectures along with some modifications and useful tricks to improve detection performance further.
Abstract: Due to object detection’s close relationship with video analysis and image understanding, it has attracted much research attention in recent years. Traditional object detection methods are built on handcrafted features and shallow trainable architectures. Their performance easily stagnates by constructing complex ensembles that combine multiple low-level image features with high-level context from object detectors and scene classifiers. With the rapid development in deep learning, more powerful tools, which are able to learn semantic, high-level, deeper features, are introduced to address the problems existing in traditional architectures. These models behave differently in network architecture, training strategy, and optimization function. In this paper, we provide a review of deep learning-based object detection frameworks. Our review begins with a brief introduction on the history of deep learning and its representative tool, namely, the convolutional neural network. Then, we focus on typical generic object detection architectures along with some modifications and useful tricks to improve detection performance further. As distinct specific detection tasks exhibit different characteristics, we also briefly survey several specific tasks, including salient object detection, face detection, and pedestrian detection. Experimental analyses are also provided to compare various methods and draw some meaningful conclusions. Finally, several promising directions and tasks are provided to serve as guidelines for future work in both object detection and relevant neural network-based learning systems.

3,097 citations

Journal ArticleDOI
TL;DR: This survey paper formally defines transfer learning, presents information on current solutions, and reviews applications applied toTransfer learning, which can be applied to big data environments.
Abstract: Machine learning and data mining techniques have been used in numerous real-world applications. An assumption of traditional machine learning methodologies is the training data and testing data are taken from the same domain, such that the input feature space and data distribution characteristics are the same. However, in some real-world machine learning scenarios, this assumption does not hold. There are cases where training data is expensive or difficult to collect. Therefore, there is a need to create high-performance learners trained with more easily obtained data from different domains. This methodology is referred to as transfer learning. This survey paper formally defines transfer learning, presents information on current solutions, and reviews applications applied to transfer learning. Lastly, there is information listed on software downloads for various transfer learning solutions and a discussion of possible future research work. The transfer learning solutions surveyed are independent of data size and can be applied to big data environments.

2,900 citations


Cites methods from "Multimodal Deep Learning"

  • ...Other methods tested include a heterogeneous spectral mapping approach proposed by Shi [105], a method proposed by Vinokourov [120], and a multimodal deep learning approach proposed by Ngiam [79]....

    [...]

  • ...Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY....

    [...]

  • ...The results of the experiment from best to worst performance are Zhou [144], Ngiam [79], Vinokourov [120], and Shi [105]....

    [...]

Book
Li Deng1, Dong Yu1
12 Jun 2014
TL;DR: This monograph provides an overview of general deep learning methodology and its applications to a variety of signal and information processing tasks, including natural language and text processing, information retrieval, and multimodal information processing empowered by multi-task deep learning.
Abstract: This monograph provides an overview of general deep learning methodology and its applications to a variety of signal and information processing tasks. The application areas are chosen with the following three criteria in mind: (1) expertise or knowledge of the authors; (2) the application areas that have already been transformed by the successful use of deep learning technology, such as speech recognition and computer vision; and (3) the application areas that have the potential to be impacted significantly by deep learning and that have been experiencing research growth, including natural language and text processing, information retrieval, and multimodal information processing empowered by multi-task deep learning.

2,817 citations


Cites background from "Multimodal Deep Learning"

  • ...[268, 269] propose and evaluate an application of deep networks to learn features over audio/speech and image/video modalities....

    [...]

  • ...The results presented in [269, 268] show that learning video features with both video and audio outperforms that with only video data....

    [...]

  • ...While the deep generative architecture for multimodal learning described in [268, 269] is based on non-probabilistic autoencoder neural...

    [...]

Journal ArticleDOI
TL;DR: This work was supported in part by the Royal Society of the UK, the National Natural Science Foundation of China, and the Alexander von Humboldt Foundation of Germany.

2,404 citations


Cites background from "Multimodal Deep Learning"

  • ...Compared with bimodal AEs and bimodal DBNs [123,145], the Corr-AEs focus more on the correlation across data than the complementarity learned from different modalities....

    [...]

References
More filters
Proceedings ArticleDOI
20 Jun 2005
TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Abstract: We study the question of feature sets for robust visual object recognition; adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.

31,952 citations


"Multimodal Deep Learning" refers methods in this paper

  • ...We used an off-the-shelf object detector [12] with median filtering over time to extract the mouth regions....

    [...]

Journal ArticleDOI
28 Jul 2006-Science
TL;DR: In this article, an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data is described.
Abstract: High-dimensional data can be converted to low-dimensional codes by training a multilayer neural network with a small central layer to reconstruct high-dimensional input vectors. Gradient descent can be used for fine-tuning the weights in such "autoencoder" networks, but this works well only if the initial weights are close to a good solution. We describe an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data.

16,717 citations

Journal ArticleDOI
TL;DR: A fast, greedy algorithm is derived that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory.
Abstract: We show how to use "complementary priors" to eliminate the explaining-away effects that make inference difficult in densely connected belief nets that have many hidden layers. Using complementary priors, we derive a fast, greedy algorithm that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory. The fast, greedy algorithm is used to initialize a slower learning procedure that fine-tunes the weights using a contrastive version of the wake-sleep algorithm. After fine-tuning, a network with three hidden layers forms a very good generative model of the joint distribution of handwritten digit images and their labels. This generative model gives better digit classification than the best discriminative learning algorithms. The low-dimensional manifolds on which the digits lie are modeled by long ravines in the free-energy landscape of the top-level associative memory, and it is easy to explore these ravines by using the directed connections to display what the associative memory has in mind.

15,055 citations


"Multimodal Deep Learning" refers background or methods in this paper

  • ...We first describe the restricted Boltzmann machine (RBM) [5, 6] followed by the sparsity regularization method [8]....

    [...]

  • ...While self-taught learning was first motivated with sparse coding, recent work on deep learning [5, 6, 7] have examined how deep sigmoidal networks can be trained to produce useful representations for handwritten digits and text....

    [...]

Proceedings ArticleDOI
05 Jul 2008
TL;DR: This work introduces and motivate a new training principle for unsupervised learning of a representation based on the idea of making the learned representations robust to partial corruption of the input pattern.
Abstract: Previous work has shown that the difficulties in learning deep generative or discriminative models can be overcome by an initial unsupervised learning step that maps inputs to useful intermediate representations. We introduce and motivate a new training principle for unsupervised learning of a representation based on the idea of making the learned representations robust to partial corruption of the input pattern. This approach can be used to train autoencoders, and these denoising autoencoders can be stacked to initialize deep architectures. The algorithm can be motivated from a manifold learning and information theoretic perspective or from a generative model perspective. Comparative experiments clearly show the surprising advantage of corrupting the input of autoencoders on a pattern classification benchmark suite.

6,816 citations


"Multimodal Deep Learning" refers methods in this paper

  • ...Instead, we propose a training method inspired by denoising autoencoders [11]....

    [...]

Journal ArticleDOI
01 Dec 1976-Nature
TL;DR: The study reported here demonstrates a previously unrecognised influence of vision upon speech perception, on being shown a film of a young woman's talking head in which repeated utterances of the syllable [ba] had been dubbed on to lip movements for [ga].
Abstract: MOST verbal communication occurs in contexts where the listener can see the speaker as well as hear him. However, speech perception is normally regarded as a purely auditory process. The study reported here demonstrates a previously unrecognised influence of vision upon speech perception. It stems from an observation that, on being shown a film of a young woman's talking head, in which repeated utterances of the syllable [ba] had been dubbed on to lip movements for [ga], normal adults reported hearing [da]. With the reverse dubbing process, a majority reported hearing [bagba] or [gaba]. When these subjects listened to the soundtrack from the film, without visual input, or when they watched untreated film, they reported the syllables accurately as repetitions of [ba] or [ga]. Subsequent replications confirm the reliability of these findings; they have important implications for the understanding of speech perception.

5,506 citations


"Multimodal Deep Learning" refers background in this paper

  • ...The McGurk effect [1] refers to an audio-visual perception phenomenon where a visual /ga/ with a audio /ba/ is perceived as /da/ by most subjects....

    [...]

  • ...This was first exemplified in the McGurk effect [1] where a visual /ga/ with a voiced /ba/ is perceived as /da/ by most subjects....

    [...]