Bio: Chin-Hui Lee is an academic researcher from Georgia Institute of Technology. The author has contributed to research in topics: Hidden Markov model & Word error rate. The author has an hindex of 67, co-authored 533 publications receiving 20346 citations. Previous affiliations of Chin-Hui Lee include AT&T & Alcatel-Lucent.
Papers published on a yearly basis
TL;DR: A framework for maximum a posteriori (MAP) estimation of hidden Markov models (HMM) is presented, and Bayesian learning is shown to serve as a unified approach for a wide range of speech recognition applications.
Abstract: In this paper, a framework for maximum a posteriori (MAP) estimation of hidden Markov models (HMM) is presented. Three key issues of MAP estimation, namely, the choice of prior distribution family, the specification of the parameters of prior densities, and the evaluation of the MAP estimates, are addressed. Using HMM's with Gaussian mixture state observation densities as an example, it is assumed that the prior densities for the HMM parameters can be adequately represented as a product of Dirichlet and normal-Wishart densities. The classical maximum likelihood estimation algorithms, namely, the forward-backward algorithm and the segmental k-means algorithm, are expanded, and MAP estimation formulas are developed. Prior density estimation issues are discussed for two classes of applications/spl minus/parameter smoothing and model adaptation/spl minus/and some experimental results are given illustrating the practical interest of this approach. Because of its adaptive nature, Bayesian learning is shown to serve as a unified approach for a wide range of speech recognition applications. >
TL;DR: The proposed DNN approach can well suppress highly nonstationary noise, which is tough to handle in general, and is effective in dealing with noisy speech data recorded in real-world scenarios without the generation of the annoying musical artifact commonly observed in conventional enhancement methods.
Abstract: In contrast to the conventional minimum mean square error (MMSE)-based noise reduction techniques, we propose a supervised method to enhance speech by means of finding a mapping function between noisy and clean speech signals based on deep neural networks (DNNs). In order to be able to handle a wide range of additive noises in real-world situations, a large training set that encompasses many possible combinations of speech and noise types, is first designed. A DNN architecture is then employed as a nonlinear regression function to ensure a powerful modeling capability. Several techniques have also been proposed to improve the DNN-based speech enhancement system, including global variance equalization to alleviate the over-smoothing problem of the regression model, and the dropout and noise-aware training strategies to further improve the generalization capability of DNNs to unseen noise conditions. Experimental results demonstrate that the proposed framework can achieve significant improvements in both objective and subjective measures over the conventional MMSE based technique. It is also interesting to observe that the proposed DNN approach can well suppress highly nonstationary noise, which is tough to handle in general. Furthermore, the resulting DNN model, trained with artificial synthesized data, is also effective in dealing with noisy speech data recorded in real-world scenarios without the generation of the annoying musical artifact commonly observed in conventional enhancement methods.
TL;DR: This letter presents a regression-based speech enhancement framework using deep neural networks (DNNs) with a multiple-layer deep architecture that tends to achieve significant improvements in terms of various objective quality measures.
Abstract: This letter presents a regression-based speech enhancement framework using deep neural networks (DNNs) with a multiple-layer deep architecture. In the DNN learning process, a large training set ensures a powerful modeling capability to estimate the complicated nonlinear mapping from observed noisy speech to desired clean signals. Acoustic context was found to improve the continuity of speech to be separated from the background noises successfully without the annoying musical artifact commonly observed in conventional speech enhancement algorithms. A series of pilot experiments were conducted under multi-condition training with more than 100 hours of simulated speech data, resulting in a good generalization capability even in mismatched testing conditions. When compared with the logarithmic minimum mean square error approach, the proposed DNN-based algorithm tends to achieve significant improvements in terms of various objective quality measures. Furthermore, in a subjective preference evaluation with 10 listeners, 76.35% of the subjects were found to prefer DNN-based enhanced speech to that obtained with other conventional technique.
TL;DR: The issue of speech recognizer training from a broad perspective with root in the classical Bayes decision theory is discussed, and the superiority of the minimum classification error (MCE) method over the distribution estimation method is shown by providing the results of several key speech recognition experiments.
Abstract: A critical component in the pattern matching approach to speech recognition is the training algorithm, which aims at producing typical (reference) patterns or models for accurate pattern comparison. In this paper, we discuss the issue of speech recognizer training from a broad perspective with root in the classical Bayes decision theory. We differentiate the method of classifier design by way of distribution estimation and the discriminative method of minimizing classification error rate based on the fact that in many realistic applications, such as speech recognition, the real signal distribution form is rarely known precisely. We argue that traditional methods relying on distribution estimation are suboptimal when the assumed distribution form is not the true one, and that "optimality" in distribution estimation does not automatically translate into "optimality" in classifier design. We compare the two different methods in the context of hidden Markov modeling for speech recognition. We show the superiority of the minimum classification error (MCE) method over the distribution estimation method by providing the results of several key speech recognition experiments. In general, the MCE method provides a significant reduction of recognition error rate.
TL;DR: The modifications made to a connected word speech recognition algorithm based on hidden Markov models which allow it to recognize words from a predefined vocabulary list spoken in an unconstrained fashion are described.
Abstract: The modifications made to a connected word speech recognition algorithm based on hidden Markov models (HMMs) which allow it to recognize words from a predefined vocabulary list spoken in an unconstrained fashion are described. The novelty of this approach is that statistical models of both the actual vocabulary word and the extraneous speech and background are created. An HMM-based connected word recognition system is then used to find the best sequence of background, extraneous speech, and vocabulary word models for matching the actual input. Word recognition accuracy of 99.3% on purely isolated speech (i.e., only vocabulary items and background noise were present), and 95.1% when the vocabulary word was embedded in unconstrained extraneous speech, were obtained for the five word vocabulary using the proposed recognition algorithm. >
TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).
01 Jan 2006
TL;DR: Probability distributions of linear models for regression and classification are given in this article, along with a discussion of combining models and combining models in the context of machine learning and classification.
Abstract: Probability Distributions.- Linear Models for Regression.- Linear Models for Classification.- Neural Networks.- Kernel Methods.- Sparse Kernel Machines.- Graphical Models.- Mixture Models and EM.- Approximate Inference.- Sampling Methods.- Continuous Latent Variables.- Sequential Data.- Combining Models.
TL;DR: This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.
Abstract: Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feed-forward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks (DNNs) that have many hidden layers and are trained using new methods have been shown to outperform GMMs on a variety of speech recognition benchmarks, sometimes by a large margin. This article provides an overview of this progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.
TL;DR: The major elements of MIT Lincoln Laboratory's Gaussian mixture model (GMM)-based speaker verification system used successfully in several NIST Speaker Recognition Evaluations (SREs) are described.
Abstract: Reynolds, Douglas A., Quatieri, Thomas F., and Dunn, Robert B., Speaker Verification Using Adapted Gaussian Mixture Models, Digital Signal Processing10(2000), 19Â?41.In this paper we describe the major elements of MIT Lincoln Laboratory's Gaussian mixture model (GMM)-based speaker verification system used successfully in several NIST Speaker Recognition Evaluations (SREs). The system is built around the likelihood ratio test for verification, using simple but effective GMMs for likelihood functions, a universal background model (UBM) for alternative speaker representation, and a form of Bayesian adaptation to derive speaker models from the UBM. The development and use of a handset detector and score normalization to greatly improve verification performance is also described and discussed. Finally, representative performance benchmarks and system behavior experiments on NIST SRE corpora are presented.