Probability Distributions.- Linear Models for Regression.- Linear Models for Classification.- Neural Networks.- Kernel Methods.- Sparse Kernel Machines.- Graphical Models.- Mixture Models and EM.- Approximate Inference.- Sampling Methods.- Continuous Latent Variables.- Sequential Data.- Combining Models.

Pattern Recognition and Machine Learning

“Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告

We present SpecAugment, a simple data augmentation method for speech recognition. SpecAugment is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients). The augmentation policy consists of warping the features, masking blocks of frequency channels, and masking blocks of time steps. We apply SpecAugment on Listen, Attend and Spell networks for end-to-end speech recognition tasks. We achieve state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work. On LibriSpeech, we achieve 6.8% WER on test-other without the use of a language model, and 5.8% WER with shallow fusion with a language model. This compares to the previous state-of-the-art hybrid system of 7.5% WER. For Switchboard, we achieve 7.2%/14.6% on the Switchboard/CallHome portion of the Hub5'00 test set without the use of a language model, and 6.8%/14.1% with shallow fusion, which compares to the previous state-of-the-art hybrid system at 8.3%/17.3% WER.

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

A morphable model for the synthesis of 3D faces

Many successful models for scene or object recognition transform low-level descriptors (such as Gabor filter responses, or SIFT descriptors) into richer representations of intermediate complexity. This process can often be broken down into two steps: (1) a coding step, which performs a pointwise transformation of the descriptors into a representation better adapted to the task, and (2) a pooling step, which summarizes the coded features over larger neighborhoods. Several combinations of coding and pooling schemes have been proposed in the literature. The goal of this paper is threefold. We seek to establish the relative importance of each step of mid-level feature extraction through a comprehensive cross evaluation of several types of coding modules (hard and soft vector quantization, sparse coding) and pooling schemes (by taking the average, or the maximum), which obtains state-of-the-art performance or better on several recognition benchmarks. We show how to improve the best performing coding scheme by learning a supervised discriminative dictionary for sparse coding. We provide theoretical and empirical insight into the remarkable performance of max pooling. By teasing apart components shared by modern mid-level feature extractors, our approach aims to facilitate the design of better recognition architectures.

/pdf/learning-mid-level-features-for-recognition-2sxmawecs7.pdf

Learning mid-level features for recognition

This paper proposes a novel unsupervised autoregressive neural model for learning generic speech representations. In contrast to other speech representation learning methods that aim to remove noise or speaker variabilities, ours is designed to preserve information for a wide range of downstream tasks. In addition, the proposed model does not require any phonetic or word boundary labels, allowing the model to benefit from large quantities of unlabeled data. Speech representations learned by our model significantly improve performance on both phone classification and speaker verification over the surface features and other supervised and unsupervised approaches. Further analysis shows that different levels of speech information are captured by our model at different layers. In particular, the lower layers tend to be more discriminative for speakers, while the upper layers provide more phonetic content.

An Unsupervised Autoregressive Model for Speech Representation Learning.

In this paper, the problem of person-independent facial expression recognition from 3D facial shapes is investigated. We propose a novel automatic feature selection method based on maximizing the average relative entropy of marginalized class-conditional feature distributions and apply it to a complete pool of candidate features composed of normalized Euclidean distances between 83 facial feature points in the 3D space. Using a regularized multi-class AdaBoost classification algorithm, we achieve a 95.1% average recognition rate for six universal facial expressions on the publicly available 3D facial expression database BU-3DFE [1], with a highest average recognition rate of 99.2% for the recognition of surprise. We compare these results with the results based on a set of manually devised features and demonstrate that the auto features yield better results than the manual features. Our results outperform the results presented in the previous work [2] and [3], namely average recognition rates of 83.6% and 91.3% on the same database, respectively.

3D facial expression recognition based on automatically selected features

The 3D facial geometry contains ample information about human facial expressions. Such information is invariant to pose and lighting conditions, which have imposed serious hurdles on many 2D facial analysis problems. In this paper, we perform person and gender independent facial expression recognition based on properties of the line segments connecting certain 3D facial feature points. The normalized distances and slopes of these line segments comprise a set of 96 distinguishing features for recognizing six universal facial expressions, namely anger, disgust, fear, happiness, sadness, and surprise. Using a multi-class support vector machine (SVM) classifier, an 87.1% average recognition rate is achieved on the publicly available 3D facial expression database BU-3DFE. The highest average recognition rate obtained in our experiments is 99.2% for the recognition of surprise. Our result outperforms the result reported in the prior work, which uses elaborately extracted primitive facial surface features and an LDA classifier and which yields an average recognition rate of 83.6% on the same database.

3D facial expression recognition based on properties of line segments connecting facial feature points

End-to-end training of deep learning-based models allows for implicit learning of intermediate representations based on the final task loss. However, the end-to-end approach ignores the useful domain knowledge encoded in explicit intermediate-level supervision. We hypothesize that using intermediate representations as auxiliary supervision at lower levels of deep networks may be a good way of combining the advantages of end-to-end training and more traditional pipeline approaches. We present experiments on conversational speech recognition where we use lower-level tasks, such as phoneme recognition, in a multitask training approach with an encoder-decoder model for direct character transcription. We compare multiple types of lower-level tasks and analyze the effects of the auxiliary tasks. Our results on the Switchboard corpus show that this approach improves recognition accuracy over a standard encoder-decoder model on the Eval2000 test set.

Multitask Learning with Low-Level Auxiliary Tasks for Encoder-Decoder Based Speech Recognition

In this paper, we address the problem of hallucinating a high resolution face given a low resolution input face. The problem is approached through sparse coding. To exploit the facial structure, non-negative matrix factorization (NMF) is first employed to learn a localized part-based subspace. This subspace is effective for super-resolving the incoming low resolution face under reconstruction constraints. To further enhance the detailed facial information, we propose a local patch method based on sparse representation with respect to coupled overcomplete patch dictionaries, which can be fast solved through linear programming. Experiments demonstrate that our approach can hallucinate high quality super-resolution faces.

Hao Tang

Papers

An Unsupervised Autoregressive Model for Speech Representation Learning.

3D facial expression recognition based on automatically selected features

3D facial expression recognition based on properties of line segments connecting facial feature points

Multitask Learning with Low-Level Auxiliary Tasks for Encoder-Decoder Based Speech Recognition

Face hallucination VIA sparse coding