scispace - formally typeset
Search or ask a question

Showing papers on "Hidden Markov model published in 1986"


Journal ArticleDOI
TL;DR: The purpose of this tutorial paper is to give an introduction to the theory of Markov models, and to illustrate how they have been applied to problems in speech recognition.
Abstract: The basic theory of Markov chains has been known to mathematicians and engineers for close to 80 years, but it is only in the past decade that it has been applied explicitly to problems in speech processing. One of the major reasons why speech models, based on Markov chains, have not been developed until recently was the lack of a method for optimizing the parameters of the Markov model to match observed signal patterns. Such a method was proposed in the late 1960's and was immediately applied to speech processing in several research institutions. Continued refinements in the theory and implementation of Markov modelling techniques have greatly enhanced the method, leading to a wide range of applications of these models. It is the purpose of this tutorial paper to give an introduction to the theory of Markov models, and to illustrate how they have been applied to problems in speech recognition.

4,546 citations


Proceedings ArticleDOI
07 Apr 1986
TL;DR: A method for estimating the parameters of hidden Markov models of speech is described and recognition results are presented comparing this method with maximum likelihood estimation.
Abstract: A method for estimating the parameters of hidden Markov models of speech is described. Parameter values are chosen to maximize the mutual information between an acoustic observation sequence and the corresponding word sequence. Recognition results are presented comparing this method with maximum likelihood estimation.

921 citations


Journal ArticleDOI
Stephen E. Levinson1
TL;DR: The solution proposed here is to replace the probability distributions of duration with continuous probability density functions to form a continuously variable duration hidden Markov model (CVDHMM) which is ideally suited to specification of the durational density.

512 citations


Journal ArticleDOI
TL;DR: To use probabilistic functions of a Markov chain to model certain parameterizations of the speech signal, an estimation technique of Liporace is extended to the eases of multivariate mixtures, such as Gaussian sums, and products of mixtures.
Abstract: To use probabilistic functions of a Markov chain to model certain parameterizations of the speech signal, we extend an estimation technique of Liporace to the eases of multivariate mixtures, such as Gaussian sums, and products of mixtures. We also show how these problems relate to Liporace's original framework.

244 citations


Proceedings ArticleDOI
07 Apr 1986
TL;DR: Results are given which show that HMMs provide a versatile pattern matching tool suitable for some image processing tasks as well as speech processing problems.
Abstract: A handwritten script recognition system is presented which uses Hidden Markov Models (HMM), a technique widely used in speech recognition. The script is encoded as templates in the form of a sequence of quantised inclination angles of short equal length vectors together with some additional features. A HMM is created for each written word from a set of training data. Incoming templates are recognised by calculating which model has the highest probability for producing that template. The task chosen to test the system is that of handwritten word recognition, where the words are digits written by one person. Results are given which show that HMMs provide a versatile pattern matching tool suitable for some image processing tasks as well as speech processing problems.

124 citations


Proceedings ArticleDOI
Stephen E. Levinson1
01 Dec 1986
TL;DR: The solution proposed here is to replace the probability distributions of duration with continuous probability density functions to form a continuously variable duration hidden Markov model (CVDHMM) which is ideally suited to specification of the durational density.
Abstract: During the past decade, the applicability of hidden Markov models (HMM) to various facets of speech analysis had been demonstrated in several different experiments. These investigations all rest on the assumption that speech is a quasi-stationary process whose stationary intervals can be identified with the occupancy of a single state of an appropriate HMM. In the traditional form of the HMM, the probability of duration of a state decreases exponentially with time. This behavior does not provide an adequate representation of the temporal structure of speech. The solution proposed here is to replace the probability distributions of duration with continuous probability density functions to form a continuously variable duration hidden Markov model (CVDHMM). The gamma distribution is ideally suited to specification of the durational density since it is one-sided and has only two parameters which, together, define both mean and variance. The main result is a derivation and proof of convergence of reestimation formulae for all the parameters of the CVDHMM. It is interesting to note that if the state durations are gamma distributed, one of the formulae is nonalgebraic but, fortuitously, has properties such that it is easily and rapidly solved numerically to any desired degree of accuracy. Other results are presented including the performance of the formulae on simulated data.

88 citations


Journal ArticleDOI
TL;DR: This paper describes an approach to formant tracking based on hidden Markov models and vector quantization of LPC spectra that has been evaluated using portions of the Texas Instruments multidialect connected digits database.
Abstract: This paper describes an approach to formant tracking based on hidden Markov models and vector quantization of LPC spectra. Two general classes of models are developed, differing in whether formants are tracked singly or jointly. The states of a single-formant model are scalar values corresponding to possible formant frequencies. The states of a multiformant model are frequency vectors defining possible formant configurations. Formant detection and estimation are performed simultaneously using the forward-backward algorithm. Model parameters are estimated from handmarked formant tracks. The models have been evaluated using portions of the Texas Instruments multidialect connected digits database. The most accurate configurations exhibited root-mean-square estimation errors of about 70, 95, and 140 HZ, for F 1 , F 2 , and F 3 , respectively.

83 citations


Proceedings ArticleDOI
01 Apr 1986
TL;DR: This paper describes the results of the work in designing a system for large-vocabulary word recognition of continuous speech, and generalizes the use of context-dependent Hidden Markov Models of phonemes to take into account word-dependent coarticulatory effects.
Abstract: This paper describes the results of our work in designing a system for large-vocabulary word recognition of continuous speech. We generalize the use of context-dependent Hidden Markov Models (HMM) of phonemes to take into account word-dependent coarticulatory effects, Robustness is assured by smoothing the detailed word-dependent models with less detailed but more robust models. We describe training and recognition algorithms for HMMs of phonemes-in-context. On a task with a 334-word vocabulary and no grammar (i.e., a branching factor of 334), in speaker-dependent mode, we show an average reduction in word error rate from 24% using context-independent phoneme models, to 10% when using robust context-dependent phoneme models.

59 citations


Journal ArticleDOI
TL;DR: High-quality speech synthesis is used to demonstrate the power of the HMM in preserving the naturalness of the intonational meaning, conveyed by the variation of fundamental frequency and duration.
Abstract: A novel technique is introduced for characterizing prosodic structure and is used for speech synthesis. The mechanism consists of modeling a set of observations as a probabilistic function of a hidden Markov chain. It uses mixtures of Gaussian continuous probability density functions to represent the essential, perceptually relevant structure of intonation by observing movements of fundamental frequency in monosyllabic words of varying phonetic structure. High-quality speech synthesis, using multipulse excitation, is used to demonstrate the power of the HMM in preserving the naturalness of the intonational meaning, conveyed by the variation of fundamental frequency and duration. The fundamental frequency contours are synthesized using a random number generator from the models, and are imposed on a synthesized prototype word which had the intonation of a low fall. The resulting monosyllabic words with imposed synthesized fundamental frequency contours show a high level of naturalness and are found to be perceptually indistinguishable from the original recordings with the same intonation. The results clearly show the high potential of hidden Markov models as a mechanism for the representation of prosodic structure by naturally capturing its essentials.

43 citations


Journal ArticleDOI
TL;DR: A unified system for automatically recognizing fluently spoken digit strings based on whole-word reference units is presented, which can use either hidden Markov model (HMM) technology or template-based technology and contains features from both approaches.

43 citations


Proceedings ArticleDOI
07 Apr 1986
TL;DR: In a series of experiments on isolated-word recognition, hidden Markov models with multivariate Gaussian output densities with best models obtained with offsets of 75 or 90 msecs improved on previous algorithms.
Abstract: Hidden Markov modeling has become an increasingly popular technique in automatic speech recognition. Recently, attention has been focused on the application of these models to talker-independent, isolated-word recognition. Initial results using models with discrete output densities for isolated-digit recognition were later improved using models based on continuous output densities. In a series of experiments on isolated-word recognition, we applied hidden Markov models with multivariate Gaussian output densities to the problem. Speech data was represented by feature vectors consisting of eight log area ratios and the log LPC error. A weak measure of vocal-tract dynamics was included in the observations by appending to the feature vector observed at time t, the vector observed at time t-δ, for some fixed offset δ. The best models were obtained with offsets of 75 or 90 msecs. When a comparison is made on a common data base, the resulting error rate of 0.2% for isolated-digit recognition improves on previous algorithms.

Proceedings ArticleDOI
01 Apr 1986
TL;DR: The definition of this phonetic unit set is presented, along with experimental comparisons with classical diphones and with phoneme-like units, and the performance was qualitatively evaluated using the segmentation of the training data-base provided from the Viterbi algorithm in a forced recognition task.
Abstract: This paper describes the design of a phonetic unit set for recognition of continuous speech where each unit is represented by an Hidden Markov Model. Starting from a unit set definition like classical diphones, many variations were made in order to have an improvement in recognition performance and a reduction in storage requirements. The definition of this unit set is presented, along with experimental comparisons with classical diphones and with phoneme-like units. The performance was qualitatively evaluated using the segmentation of the training data-base provided from the Viterbi algorithm in a forced recognition task. Classical recognition experiments have also been carried out using different "difficult vocabularies" as test.

Proceedings ArticleDOI
01 Apr 1986
TL;DR: The signal modeling methodology is discussed, experimental results on speaker independent recognition of isolated digits are given and finite mixture autoregressive probabilistic functions of Markov chains are investigated.
Abstract: In this paper a signal modeling technique based upon finite mixture autoregressive probabilistic functions of Markov chains is developed and applied to the problem of speech recognition, particularly speaker-independent recognition of isolated digits. Two types of mixture probability densities are investigated: finite mixtures of Gaussian autoregressive densities (GAM) and nearest-neighbor partitioned finite mixtures of Gaussian autoregressive densities (PGAM). In the former (GAM), the observation density in each Markov state is simply a (stochastically constrained) weighted sum of Gaussian autoregressive densities, while in the latter (PGAM) it involves nearest-neighbor decoding which, in effect, defines a set of partitions on the observation space. In this paper we discuss the signal modeling methodology and give experimental results on speaker independent recognition of isolated digits.

Proceedings ArticleDOI
07 Apr 1986
TL;DR: A Markov Model System in which symbols are substituted by Spectral Lines which are sequentially generated over the frequency domain which drastically reduces the number of states on the Markov chain and the use of continuous parameters eliminates quantization error completely.
Abstract: In most of the existing Automatic Speech Recognition Systems which make use of Markov Models, the output of the Markov Chain are strings whose symbols belong to a finite alphabet and are generated sequentially over the time domain. We propose a Markov Model System in which symbols are substituted by Spectral Lines which are sequentially generated over the frequency domain. Each spectral line is represented by Continuous Distribution of Parameters. Switching from time-domain to frequency domain drastically reduces the number of states on the Markov chain and the use of continuous parameters eliminates quantization error completely. An application will be presented with experimental results in a multi-speaker environment.

Proceedings ArticleDOI
Serge Soudoplatoff1
07 Apr 1986
TL;DR: The results showed that one can decrease the error-rate, by switching from a simple labelling scheme to this continuous parameter model, and the results of an application of this model to a 5000 words speech recognition system were presented.
Abstract: This paper presents how to avoid the labelling part of a speech recognition strategy based on hidden Markov models, while keeping a stochastic formulation. After a brief recall of how a Markov model can be used for speech recognition, we propose another formulation, in which the labels are suppressed, dealing only with continuous parameters. The notion of speech generator is then introduced, and the formulas for speech training as well as decoding are rewritten. This new formulation leads to the fact that the probability densities p(x | G) , where G is a generator, and x an acoustic vector, must be estimated. We explain our choice of non-parametric methods, using Parzen estimators. Those estimators require a kernel function, which we choose in a simple manner, and the value for the radius of the kernel, which is the key problem. Successively statistical solution, information theory solution, and an original topological solution are presented, the last being retained. We finally present the results of an application of this model to a 5000 words speech recognition system. The results showed that one can decrease the error-rate, by switching from a simple labelling scheme to this continuous parameter model.

Proceedings ArticleDOI
01 Apr 1986
TL;DR: This paper uses an extension of the well-known hidden Markow models in order to model more accurately the properties of the phonetic labeling stage and presents experimental results which were computed speaker independently.
Abstract: This paper addresses the problem of generating word hypotheses in continuous German speech. It uses an extension of the well-known hidden Markow models in order to model more accurately the properties of the phonetic labeling stage. A powerful scoring function is derived. Experimental results are presented which were computed speaker independently.

Proceedings ArticleDOI
01 Apr 1986
TL;DR: A new type of very low bit rate speech coder based on a global Discrete Hidden Markov Model (DHMM) of continuous speech for a single speaker is presented here.
Abstract: A new type of very low bit rate speech coder based on a global Discrete Hidden Markov Model (DHMM) of continuous speech for a single speaker is presented here. Several important issues of the training, coding, and decoding procedures are discussed for a 64-state, 1024-observation model. Such a framework is useful in reducing the redundancy in a 10-bit classical Vector Quantizer (VQ), and could lead to a DHMM coder with a bit rate comparable to that of a Segment Vocoder (SV) or a Matrix Quantizer (MQ). This is achieved not only by modelling the long term non-stationarity and the inter-frame time dependencies of the speech, but also by efficiently representing a different kind of information such as vocal tract structure and linguistic patterns.

Proceedings ArticleDOI
K. Sugawara1, M. Nishimura, A. Kuroda
01 Apr 1986
TL;DR: The adaptation method proposed in this paper uses the intermediate results of the last iteration of an HMM (hidden Markov model) to reduce recognition errors and for different speakers, only a slight improvement were obtained.
Abstract: During the training process, parameters of an HMM (hidden Markov model) are calculated iteratively using "Forward-Backward algorithm." The adaptation method we propose in this paper uses the intermediate results of the last iteration. The amount of storage to keep intermediate results is very small (typically 1/400) compared with that of the entire parameters. The confidence measure of the initial training and adaptive training can be reflected to the coefficients in calculating new parameters. Experiments were done on A. the same speaker several months between training and adaptive training/decoding B. different speakers In the case of the same speaker the recognition errors were reduced by 1/2 to 2/3 compared with non-adaptation case. However, for different speakers, only a slight improvement were obtained.

Patent
13 Aug 1986
TL;DR: In this article, the authors proposed to reduce the number of probability functions determined and stored by assigning each such function to a reduced number of different states used in the Markov models.
Abstract: Speech recognisers which employ hidden Markov models using continuous probability functions require a large number of calculations to be carried out in a short time in order to give real time recognition. In addition a large amount of ewctronic storage is also required. The present invention reduces these problems by reducing the number of probability functions determined and stored by assigning each such function to a reduced number of different states used in the Markov models. In recognising words a minimum distance is computed for each model according to the Viterbi algorithm (operations 45 to 47) but since only a relatively small number of probability functions and states are stored the number of calculations required per unit time is also reduced. Fewer probability functions are stored and thus the amount of storage required is not as great as would otherwise be required. Further in order to reduce costs and increase the speed of recognition a specially constructed Viterbi engine is used in determining the probabilities that sounds observed represent various states of the models. Methods of deriving the required number of probability functions and states are also described.

Proceedings ArticleDOI
01 Apr 1986
TL;DR: This paper investigates the problem of defining an optimal classification for a given speech decoder, so that broad phonetic classes are recognized as accurately as possible from the speech signal.
Abstract: An approach for supporting large vocabulary in speech recognition is to use broad phonetic classes to reduce the search to a subset of the dictionary. In this paper, we investigate the problem of defining an optimal classification for a given speech decoder, so that these broad phonetic classes are recognized as accurately as possible from the speech signal. More precisely, given Hidden Markov Models of phonemes, we define a similarity measure of the phonetic machines, and use a standard classification algorithm to find the optimal classification. Three measures are proposed, and compared with manual classifications.


Proceedings ArticleDOI
Osaaki Watanuki1, T. Kaneko
01 Apr 1986
TL;DR: This method is applied to the recognition of 32-Japanese-word vocabulary, and achieved a recognition accuracy comparable to or better than that of the conventional approaches.
Abstract: In this paper, a simple and fast method for speaker-independent isolated word recognition is presented. This method is regarded as simplification of the approach based on the Hidden Markov Model (HMM). In the proposed method, all training and decoding data are transformed into label strings by vector quantization. By segmenting the label strings of utterances into N pieces with equal duration, label histograms are computed in the training mode. In recognition, the label string of an input word is also divided into equal N segments, and the likelihood is computed with the corresponding histogram. It will be shown that the computational cost of this method is relatively low. This method is applied to the recognition of 32-Japanese-word vocabulary, and achieved a recognition accuracy comparable to or better than that of the conventional approaches.

Proceedings ArticleDOI
01 Apr 1986
TL;DR: An approach to isolated and connected word recognition by using dynamic time warping algorithm which referes as a hidden Markov model which has almost the same performance as the exact algorithm.
Abstract: In this paper, we present an approach to isolated and connected word recognition by using dynamic time warping algorithm which referes as a hidden Markov model. The classification consists of computing the a posteriori probability for each word model and choosing the word model that gives the highest probability. The probability is calculated by two different ways: One is the exact algorithm and the other is the approximate (Viterbi) algorithm. In our system, first, an input speech is recognized as a string of monosyllables by the syllable-based O(n) DP matching. Second, the recognized string is matched with a mono-syllable string of each lexical model, and the word or word sequence with the highest probability is recognized as the input speech by using O(n) DP matching based on a hidden Markov model. Reference patterns consist of 68 mono-syllables, and test patterns consists of 90 isolated words, two connected words and three connected words. We conclude from the results of the experiments that: (1) The results by using 3 candidates are much better than those by using only best candidate for each segment. (2) The approximate algorithm has almost the same performance as the exact algorithm. (3) The extended algorithm for connected word recognition works well.


Journal ArticleDOI
TL;DR: The two most prominent algorithms, dynamic time-warping and hidden Markov modelling, are described and compared and particular attention is given to the role of dynamic programming in either approach.
Abstract: This article describes the methods which form the basis of contemporary automatic speech recognition systems. The two most prominent algorithms, dynamic time-warping and hidden Markov modelling, are described and compared. Particular attention is given to the role of dynamic programming in either approach.

Proceedings ArticleDOI
C. Wellekens1
07 Apr 1986
TL;DR: A connected speech recognition method based on the Baum forward backward algorithm is presented that segmentation of the test sentence uses the probability that an acoustic vector lays at the separation of two speech subunit models.
Abstract: A connected speech recognition method based on the Baum forward backward algorithm is presented. The segmentation of the test sentence uses the probability that an acoustic vector lays at the separation of two speech subunit models (Hidden Markov models). The labelling rests on the highest probability that a vector has been emitted on the last state of a subunit model. Results are presented for word- and phoneme-recognition.

Journal ArticleDOI
TL;DR: A computer program for Markov chain analysis is presented and discussed and tests hypotheses about the goodness of fit of first- and second-order Markov models.
Abstract: A computer program for Markov chain analysis is presented and discussed. The program is written in the language of the Statistical Analysis System (SAS) but detailed knowledge of SAS is not required for its use. The program tests hypotheses about the goodness of fit of first- and second-order Markov models. It also tests if transition probabilities are homogeneous between the first and the second half of each sequence.

Proceedings ArticleDOI
01 Apr 1986
TL;DR: This paper describes a family of formant trackers based on hidden Markov models and vector quantization of LPC spectra, differing in whether formants are tracked singly or jointly.
Abstract: This paper describes a family of formant trackers based on hidden Markov models and vector quantization of LPC spectra. Two general classes of models are presented, differing in whether formants are tracked singly or jointly. The states of a single-formant model are scalar values corresponding to possible formant frequencies. The states of a multi-formant model are frequency vectors defining possible formant configurations. Formant detection and estimation are performed simultaneously using the forward-backward algorithm. Model parameters are estimated from hand-marked formant tracks. The models have been evaluated using portions of the Texas Instruments multi-dialect connected digits database. The most accurate configurations exhibited root mean square estimation errors of about 70 Hz, 95 Hz, and 140 Hz, for F 1 , F 2 and F 3 , respectively.

Proceedings ArticleDOI
01 Apr 1986
TL;DR: This work looks at the problem at two levels, the first at the sub-word level to find significant segment labels and the second at the grammar level in an attempt to deduce the grammatical units of a given vocabulary from the emission probabilities of a Hidden Markov Model.
Abstract: There has been much work in using Hidden Markov Models to model different types of linguistically defined units such as words, syllables and phonetic-type units. Here we look at the problem from the other direction and try to use the states obtained from a Markov model to find our own linguistic units. We look at the problem at two levels, the first at the sub-word level to find significant segment labels and the second at the grammar level in an attempt to deduce the grammatical units of a given vocabulary from the emission probabilities of a Hidden Markov Model.

Proceedings ArticleDOI
A. Tassy1, L. Miclet
07 Apr 1986
TL;DR: This paper presents a speaker-independent digit recognition system that combines word-based VQ with HMM, the cost of which is low enough to be implemented on a single signal processor available today.
Abstract: Vector Quantization has recently been used in the realization of a speaker-independent digit recognizer, based uniquely on the spectral content of the speech signal. On the other hand, the Hidden Markov Models proved their ability in modelling temporal distortions between different utterances of a word pronounced by several speakers. In term of recognition rate, HMMs are as efficient as the conventional DTW matching, but they need less computation and memory. This paper presents a speaker-independent digit recognition system that combines word-based VQ with HMM, the cost of which is low enough to be implemented on a single signal processor available today. It is the first result of a cooperation project between ENST and the MATRA company, financially supported by the French government. The proposed recognizer is structured in two parts. First, a VQ-preprocessor, with one vector codebook per vocabulary word, performs a coding of the short-time spectrum of the speech signal and realizes an initial sorting. Then HMMs are used to take the final recognition decision.