This monograph provides an overview of general deep learning methodology and its applications to a variety of signal and information processing tasks. The application areas are chosen with the following three criteria in mind: (1) expertise or knowledge of the authors; (2) the application areas that have already been transformed by the successful use of deep learning technology, such as speech recognition and computer vision; and (3) the application areas that have the potential to be impacted significantly by deep learning and that have been experiencing research growth, including natural language and text processing, information retrieval, and multimodal information processing empowered by multi-task deep learning.

/pdf/deep-learning-methods-and-applications-5cyi610sjz.pdf

Deep Learning: Methods and Applications

https://bura.brunel.ac.uk/bitstream/2438/14221/1/FullText.pdf

A survey of deep neural network architectures and their applications

A morphable model for the synthesis of 3D faces

In the last few years, deep learning has led to very good performance on a variety of problems, such as visual recognition, speech recognition and natural language processing. Among different types of deep neural networks, convolutional neural networks have been most extensively studied. Leveraging on the rapid growth in the amount of the annotated data and the great improvements in the strengths of graphics processor units, the research on convolutional neural networks has been emerged swiftly and achieved state-of-the-art results on various tasks. In this paper, we provide a broad survey of the recent advances in convolutional neural networks. We detailize the improvements of CNN on different aspects, including layer design, activation function, loss function, regularization, optimization and fast computation. Besides, we also introduce various applications of convolutional neural networks in computer vision, speech and natural language processing.

Recent Advances in Convolutional Neural Networks

Given audio of President Barack Obama, we synthesize a high quality video of him speaking with accurate lip sync, composited into a target video clip. Trained on many hours of his weekly address footage, a recurrent neural network learns the mapping from raw audio features to mouth shapes. Given the mouth shape at each time instant, we synthesize high quality mouth texture, and composite it with proper 3D pose matching to change what he appears to be saying in a target video to match the input audio track. Our approach produces photorealistic results.

Synthesizing Obama: learning lip sync from audio

Hidden Markov models (HMMs) and Gaussian mixture models (GMMs) are the two most common types of acoustic models used in statistical parametric approaches for generating low-level speech waveforms from high-level symbolic inputs via intermediate acoustic feature sequences. However, these models have their limitations in representing complex, nonlinear relationships between the speech generation inputs and the acoustic features. Inspired by the intrinsically hierarchical process of human speech production and by the successful application of deep neural networks (DNNs) to automatic speech recognition (ASR), deep learning techniques have also been applied successfully to speech generation, as reported in recent literature.

Deep Learning for Acoustic Modeling in Parametric Speech Generation: A systematic review of existing techniques and future trends

Hidden Markov models (HMMs) and Gaussian mixture models (GMMs) are the two most common types of acoustic models used in statistical parametric approaches for generating low-level speech waveforms from high-level symbolic inputs via intermediate acoustic feature sequences. However, these models have their limitations in representing complex, nonlinear relationships between the speech generation inputs and the acoustic features. Inspired by the intrinsically hierarchical process of human speech production and by the successful application of deep neural networks (DNNs) to automatic speech recognition (ASR), deep learning techniques have also been applied successfully to speech generation, as reported in recent literature. This article systematically reviews these emerging speech generation approaches, with the dual goal of helping readers gain a better understanding of the existing techniques as well as stimulating new work in the burgeoning area of deep learning for parametric speech generation. In speech signal and information processing, many applications have been formulated as machine-learning tasks. ASR is a typical classification task that predicts word sequences from speech waveforms or feature sequences. There are also many regression tasks in speech processing that are aimed to generate speech signals from various types of inputs. They are referred to as speech generation tasks in this article. Speech generation covers a wide range of research topics in speech processing, such as text-to-speech (TTS) synthesis (generating speech from text), voice conversion (modifying nonlinguistic information of the input speech), speech enhancement (improving speech quality by noise reduction or other processing), and articulatory-to-acoustic mapping (converting articulatory movements to acoustic features). These

Deep Learning for Acoustic Modeling in Parametric Speech Generation

This paper investigates the use of multidistribution deep neural networks DNNs for mispronunciation detection and diagnosis MDD, to circumvent the difficulties encountered in an existing approach based on extended recognition networks ERNs. The ERNs leverage existing automatic speech recognition technology by constraining the search space via including the likely phonetic error patterns of the target words in addition to the canonical transcriptions. MDDs are achieved by comparing the recognized transcriptions with the canonical ones. Although this approach performs reasonably well, it has the following issues: 1 Learning the error patterns of the target words to generate the ERNs remains a challenging task. Phones or phone errors missing from the ERNs cannot be recognized even if we have well-trained acoustic models; and 2 acoustic models and phonological rules are trained independently, and hence, contextual information is lost. To address these issues, we propose an acoustic-graphemic-phonemic model AGPM using a multidistribution DNN, whose input features include acoustic features, as well as corresponding graphemes and canonical transcriptions encoded as binary vectors. The AGPM can implicitly model both grapheme-to-likely-pronunciation and phoneme-to-likely-pronunciation conversions, which are integrated into acoustic modeling. With the AGPM, we develop a unified MDD framework, which works much like free-phone recognition. Experiments show that our method achieves a phone error rate PER of 11.1%. The false rejection rate FRR, false acceptance rate FAR, and diagnostic error rate DER for MDD are 4.6%, 30.5%, and 13.5%, respectively. It outperforms the ERN approach using DNNs as acoustic models, whose PER, FRR, FAR, and DER are 16.8%, 11.0%, 43.6%, and 32.3%, respectively.

/pdf/mispronunciation-detection-and-diagnosis-in-l2-english-1zr8uf24qg.pdf

Mispronunciation Detection and Diagnosis in L2 English Speech Using Multidistribution Deep Neural Networks

Deep belief network (DBN) has been shown to be a good generative model in tasks such as hand-written digit image generation. Previous work on DBN in the speech community mainly focuses on using the generatively pre-trained DBN to initialize a discriminative model for better acoustic modeling in speech recognition (SR). To fully utilize its generative nature, we propose to model the speech parameters including spectrum and F0 simultaneously and generate these parameters from DBN for speech synthesis. Compared with the predominant HMM-based approach, objective evaluation shows that the spectrum generated from DBN has less distortion. Subjective results also confirm the advantage of the spectrum from DBN, and the overall quality is comparable to that of context-independent HMM.

/pdf/multi-distribution-deep-belief-network-for-speech-synthesis-2va00z9w3h.pdf

Multi-distribution deep belief network for speech synthesis

This paper presents recent extensions to our ongoing effort in developing speech recognition for automatic mispronunciation detection and diagnosis in the interlanguage of Chinese learners of English. We have developed a set of context-sensitive phonological rules based on cross-language (Cantonese versus English) analysis which has also been validated against common mispronunciations observed from the learners interlanguage. These rules are represented as finite state transducers which can generate an extended recognition network (ERN) based on arbitrary canonical pronunciations. The ERN includes not only standard English pronunciations but also common mispronunciations of learners. Recognition with the ERN enables the speech recognizer to phonetically transcribe the learner’s input speech. This transcription can be compared with the canonical pronunciations to identify the location(s) and type(s) of phonetic differences, thus facilitating mispronunciation detection and diagnoses. We have developed a prototype implementation known as the CHELSEA system and have validated the approach based on a new, annotated test set of 600 utterances recorded from 100 Cantonese learners of English. The approach achieves a false rejection rate (i.e. system identifies a phone as incorrect when it is actually correctly pronounced) of 13.6%; as well as a false acceptance rate (i.e. system identifies a phone as correct when it is actually mispronounced) of 44.7%. Among the detected errors, the system can correctly diagnose 54.8% of the mispronunciations.

/pdf/implementation-of-an-extended-recognition-network-for-5cr76pe6eq.pdf

Xiaojun Qian

Papers

Deep Learning for Acoustic Modeling in Parametric Speech Generation: A systematic review of existing techniques and future trends

Deep Learning for Acoustic Modeling in Parametric Speech Generation

Mispronunciation Detection and Diagnosis in L2 English Speech Using Multidistribution Deep Neural Networks

Multi-distribution deep belief network for speech synthesis

Implementation of an extended recognition network for mispronunciation detection and diagnosis in computer-assisted pronunciation training.