Author
Hao Tang
Other affiliations: Hewlett-Packard, Mitsubishi Electric Research Laboratories, Shanghai Ocean University ...read more
Bio: Hao Tang is an academic researcher from Massachusetts Institute of Technology. The author has contributed to research in topics: Hidden Markov model & Speaker recognition. The author has an hindex of 28, co-authored 104 publications receiving 2275 citations. Previous affiliations of Hao Tang include Hewlett-Packard & Mitsubishi Electric Research Laboratories.
Papers published on a yearly basis
Papers
More filters
15 Sep 2019
TL;DR: The authors proposed an unsupervised autoregressive neural model for learning generic speech representations, which is designed to preserve information for a wide range of downstream tasks, such as phone classification and speaker verification.
Abstract: This paper proposes a novel unsupervised autoregressive neural model for learning generic speech representations. In contrast to other speech representation learning methods that aim to remove noise or speaker variabilities, ours is designed to preserve information for a wide range of downstream tasks. In addition, the proposed model does not require any phonetic or word boundary labels, allowing the model to benefit from large quantities of unlabeled data. Speech representations learned by our model significantly improve performance on both phone classification and speaker verification over the surface features and other supervised and unsupervised approaches. Further analysis shows that different levels of speech information are captured by our model at different layers. In particular, the lower layers tend to be more discriminative for speakers, while the upper layers provide more phonetic content.
379 citations
23 Jun 2008
TL;DR: A novel automatic feature selection method based on maximizing the average relative entropy of marginalized class-conditional feature distributions and apply it to a complete pool of candidate features composed of normalized Euclidean distances between 83 facial feature points in the 3D space is proposed.
Abstract: In this paper, the problem of person-independent facial expression recognition from 3D facial shapes is investigated. We propose a novel automatic feature selection method based on maximizing the average relative entropy of marginalized class-conditional feature distributions and apply it to a complete pool of candidate features composed of normalized Euclidean distances between 83 facial feature points in the 3D space. Using a regularized multi-class AdaBoost classification algorithm, we achieve a 95.1% average recognition rate for six universal facial expressions on the publicly available 3D facial expression database BU-3DFE [1], with a highest average recognition rate of 99.2% for the recognition of surprise. We compare these results with the results based on a set of manually devised features and demonstrate that the auto features yield better results than the manual features. Our results outperform the results presented in the previous work [2] and [3], namely average recognition rates of 83.6% and 91.3% on the same database, respectively.
158 citations
01 Sep 2008
TL;DR: This paper performs person and gender independent facial expression recognition based on properties of the line segments connecting certain 3D facial feature points, which comprises a set of 96 distinguishing features for recognizing six universal facial expressions.
Abstract: The 3D facial geometry contains ample information about human facial expressions. Such information is invariant to pose and lighting conditions, which have imposed serious hurdles on many 2D facial analysis problems. In this paper, we perform person and gender independent facial expression recognition based on properties of the line segments connecting certain 3D facial feature points. The normalized distances and slopes of these line segments comprise a set of 96 distinguishing features for recognizing six universal facial expressions, namely anger, disgust, fear, happiness, sadness, and surprise. Using a multi-class support vector machine (SVM) classifier, an 87.1% average recognition rate is achieved on the publicly available 3D facial expression database BU-3DFE. The highest average recognition rate obtained in our experiments is 99.2% for the recognition of surprise. Our result outperforms the result reported in the prior work, which uses elaborately extracted primitive facial surface features and an LDA classifier and which yields an average recognition rate of 83.6% on the same database.
119 citations
05 Apr 2017
TL;DR: This work hypothesizes that using intermediate representations as auxiliary supervision at lower levels of deep networks may be a good way of combining the advantages of end-to-end training and more traditional pipeline approaches.
Abstract: End-to-end training of deep learning-based models allows for implicit learning of intermediate representations based on the final task loss. However, the end-to-end approach ignores the useful domain knowledge encoded in explicit intermediate-level supervision. We hypothesize that using intermediate representations as auxiliary supervision at lower levels of deep networks may be a good way of combining the advantages of end-to-end training and more traditional pipeline approaches. We present experiments on conversational speech recognition where we use lower-level tasks, such as phoneme recognition, in a multitask training approach with an encoder-decoder model for direct character transcription. We compare multiple types of lower-level tasks and analyze the effects of the auxiliary tasks. Our results on the Switchboard corpus show that this approach improves recognition accuracy over a standard encoder-decoder model on the Eval2000 test set.
117 citations
12 Dec 2008
TL;DR: A local patch method based on sparse representation with respect to coupled overcomplete patch dictionaries is proposed, which can be fast solved through linear programming and can hallucinate high quality super-resolution faces.
Abstract: In this paper, we address the problem of hallucinating a high resolution face given a low resolution input face. The problem is approached through sparse coding. To exploit the facial structure, non-negative matrix factorization (NMF) is first employed to learn a localized part-based subspace. This subspace is effective for super-resolving the incoming low resolution face under reconstruction constraints. To further enhance the detailed facial information, we propose a local patch method based on sparse representation with respect to coupled overcomplete patch dictionaries, which can be fast solved through linear programming. Experiments demonstrate that our approach can hallucinate high quality super-resolution faces.
106 citations
Cited by
More filters
01 Jan 2006
TL;DR: Probability distributions of linear models for regression and classification are given in this article, along with a discussion of combining models and combining models in the context of machine learning and classification.
Abstract: Probability Distributions.- Linear Models for Regression.- Linear Models for Classification.- Neural Networks.- Kernel Methods.- Sparse Kernel Machines.- Graphical Models.- Mixture Models and EM.- Approximate Inference.- Sampling Methods.- Continuous Latent Variables.- Sequential Data.- Combining Models.
10,141 citations
Journal Article•
3,940 citations
18 Apr 2019
TL;DR: This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.
Abstract: We present SpecAugment, a simple data augmentation method for speech recognition. SpecAugment is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients). The augmentation policy consists of warping the features, masking blocks of frequency channels, and masking blocks of time steps. We apply SpecAugment on Listen, Attend and Spell networks for end-to-end speech recognition tasks. We achieve state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work. On LibriSpeech, we achieve 6.8% WER on test-other without the use of a language model, and 5.8% WER with shallow fusion with a language model. This compares to the previous state-of-the-art hybrid system of 7.5% WER. For Switchboard, we achieve 7.2%/14.6% on the Switchboard/CallHome portion of the Hub5'00 test set without the use of a language model, and 6.8%/14.1% with shallow fusion, which compares to the previous state-of-the-art hybrid system at 8.3%/17.3% WER.
2,758 citations
13 Jun 2010
TL;DR: This work seeks to establish the relative importance of each step of mid-level feature extraction through a comprehensive cross evaluation of several types of coding modules and pooling schemes and shows how to improve the best performing coding scheme by learning a supervised discriminative dictionary for sparse coding.
Abstract: Many successful models for scene or object recognition transform low-level descriptors (such as Gabor filter responses, or SIFT descriptors) into richer representations of intermediate complexity. This process can often be broken down into two steps: (1) a coding step, which performs a pointwise transformation of the descriptors into a representation better adapted to the task, and (2) a pooling step, which summarizes the coded features over larger neighborhoods. Several combinations of coding and pooling schemes have been proposed in the literature. The goal of this paper is threefold. We seek to establish the relative importance of each step of mid-level feature extraction through a comprehensive cross evaluation of several types of coding modules (hard and soft vector quantization, sparse coding) and pooling schemes (by taking the average, or the maximum), which obtains state-of-the-art performance or better on several recognition benchmarks. We show how to improve the best performing coding scheme by learning a supervised discriminative dictionary for sparse coding. We provide theoretical and empirical insight into the remarkable performance of max pooling. By teasing apart components shared by modern mid-level feature extractors, our approach aims to facilitate the design of better recognition architectures.
1,177 citations