scispace - formally typeset
Search or ask a question

Showing papers on "TIMIT published in 2000"


Journal ArticleDOI
TL;DR: This paper reports experiments on three phonological feature systems: the Sound Pattern of English (SPE) system, amulti-valued (MV) feature system which uses traditional phonetic categories such as manner, place, etc., and government Phonology which uses a set of structured primes.

199 citations


Journal ArticleDOI
TL;DR: Experimental results show that small EBF networks with basis function parameters estimated by the EM algorithm outperform the large RBF networks trained in the conventional approach.
Abstract: This paper proposes to incorporate full covariance matrices into the radial basis function (RBF) networks and to use the expectation-maximization (EM) algorithm to estimate the basis function parameters. The resulting networks, referred to as elliptical basis function (EBF) networks, are evaluated through a series of text-independent speaker verification experiments involving 258 speakers from a phonetically balanced, continuous speech corpus (TIMIT). We propose a verification procedure using RBF and EBF networks as speaker models and show that the networks are readily applicable to verifying speakers using LP-derived cepstral coefficients as features. Experimental results show that small EBF networks with basis function parameters estimated by the EM algorithm outperform the large RBF networks trained in the conventional approach. The results also show that the equal error rate achieved by the EBF networks is about two-third of that achieved by the vector quantization-based speaker models.

96 citations


Proceedings Article
01 Oct 2000
TL;DR: The results show that there is useful supplementary information contained in the articulatory data which yields a small but significant improvement in phone recognition accuracy of 2%, but preliminary attempts to estimate the articulation data from the acoustic signal and use this to supplement the acoustic input have not yielded any significant improved in phone accuracy.
Abstract: In this paper we show that there is measurable information in the articulatory system which can help to disambiguate the acoustic signal. We measure directly the movement of the lips, tongue, jaw, velum and larynx and parameterise this articulatory feature space using principal components analysis. The parameterisation is developed and evaluated using a speaker dependent phone recognition task on a specially recorded TIMIT corpus of 460 sentences. The results show that there is useful supplementary information contained in the articulatory data which yields a small but significant improvement in phone recognition accuracy of 2%. However, preliminary attempts to estimate the articulatory data from the acoustic signal and use this to supplement the acoustic input have not yielded any significant improvement in phone accuracy.

93 citations


01 Jan 2000
TL;DR: The hypothesis is that the integration of acoustic-phonetic information into a state-of-the-art automatic phonetic alignment system will significantly improve its accuracy and robustness.
Abstract: One requirement for researching and building spoken language systems is the availability of speech data that have been labeled and time-aligned at the phonetic level. Although manual phonetic alignment is considered more accurate than automatic methods, it is too time consuming to be commonly used for aligning large corpora. One reason for the greater accuracy of human labeling is that humans are better able to locate distinct events in the speech signal that correspond to specific phonetic characteristics. The development of the proposed method was motivated by the belief that if an automatic alignment method were to use such acoustic-phonetic information, its accuracy would become closer to that of human performance. Our hypothesis is that the integration of acoustic-phonetic information into a state-of-the-art automatic phonetic alignment system will significantly improve its accuracy and robustness. In developing an alignment system that uses acoustic-phonetic information, we use a measure of intensity discrimination in detecting voicing, glottalization, and burst-related impulses. We propose and implement a method of voicing determination that has average accuracy of 97.25% (which is an average 58% reduction in error over a baseline system), a fundamental-frequency extraction method with average absolute error of 3.12 Hz (representing a 45% reduction in error), and a method for detecting burst-related impulses with accuracy of 86.8% on the TIMIT corpus (which is a 45% reduction in error compared to reported results). In addition to these features, we propose a means of using acoustics-dependent transition information in the HMM framework. One aspect of successful implementation of this method is the use of distinctive phonetic features. To evaluate the proposed and baseline phonetic alignment systems, we measure agreement with manual alignments and robustness. On the TIMIT corpus, the proposed method has 92.57% agreement within 20 msec. The average agreement of the proposed method represents a 28% reduction in error over our state-of-the-art baseline system. In measuring robustness, the proposed method has 14% less standard deviation when evaluated on 12 versions of the TIMIT corpus.

70 citations


Journal ArticleDOI
TL;DR: The authors propose a system that automatically selects the most discriminant parts of a speech utterance, which divides the signal into different time–frequency blocks and proposes a new selection procedure designed to select the pertinent features.

64 citations


Dissertation
01 Jan 2000
TL;DR: The acoustic theory of speech production was used to predict characteristics of vowels, and studies were done on a speech database to test the predictions, and the resulting data guided the development of an improved Vowel landmark detector (VLD).
Abstract: Lexical Access From Features (LAFF) is a proposed knowledge-based speech recognition system which uses landmarks to guide the search for distinctive features. The first stage in LAFF must find Vowel landmarks. This task is similar to automatic detection of syllable nuclei (ASD). This thesis adapts and extends ASD algorithms for Vowel landmark detection. In addition to existing work on ASD, the acoustic theory of speech production was used to predict characteristics of vowels, and studies were done on a speech database to test the predictions. The resulting data guided the development of an improved Vowel landmark detector (VLD). Studies of the TIMIT database showed that about 94% of vowels have a peak of energy in the F1 region, and that about 89% of vowels have a peak in F1 frequency. Energy and frequency peaks were fairly highly correlated, with both peaks tending to appear before the midpoint of the vowel duration (as labeled), and frequency peaks tending to appear before energy peaks. Landmark based vowel classification was not found to be sensitive to the precise location of the landmark. Energy in a fixed frequency band (300 to 900 Hz) was found to be as good for finding landmarks as the energy at F1, enabling a simple design for a VLD without the complexity of formant tracking. The VLD was based on a peak picking technique, using a recursive convex hull algorithm. Three acoustic cues (peak-to-dip depth, duration, and level) were combined using a multilayer perceptron with two hidden units. The perceptron was trained by matching landmarks to syllabic nuclei derived from the TIMIT aligned phonetic transcription. Pairs of abutting vowels were allowed to match either one or two landmarks without penalty. The perceptron was trained first by back propagation using mean squared error, and then by gradient descent using error rate. The final VLD's error rate was about 12%, with about 3.5% insertions and 8.5% deletions, which compares favorably to the 6% of vowels without peaks. Most errors occurred in predictable circumstances, such as high vowels adjacent to semivowels, or very reduced schwas. Further work should include improvements to the output confidence score, and error correction as part of vowel quality detection. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

51 citations


Journal ArticleDOI
TL;DR: An original approach for automatic speaker identification especially applicable to environments which cause partial corruption of the frequency spectrum of the signal, and proposes a particularly redundant parallel architecture for which most of the correlations are kept.

50 citations


Proceedings ArticleDOI
05 Jun 2000
TL;DR: An algorithm to hide data in speech signals by inverted the polarity of the signal at every syllable according to the assigned bit and was able to successfully hide data and restore it automatically.
Abstract: In this paper we investigate how polarity inversion of speech signals effects human perception, and we apply this technique for data hiding. In most languages, glottal airflow during phonation is uni-directional, causing constant polarity of the speech waveform. On the other hand, the human auditory system cannot discriminate between speech signals with positive and negative polarity. Based on these facts, we developed an algorithm to hide data in speech signals. We assigned one bit to each syllable of speech, and inverted the polarity of the signal at every syllable according to the assigned bit. We performed a test using 20 sentences from the TIMIT corpus to determine both whether a human could distinguish between the original and polarity-inverted signal and whether we could automatically restore the embedded binary data. We found that we were able to successfully hide data and restore it automatically.

30 citations


Proceedings Article
01 Jan 2000
TL;DR: This paper argues that the underlying statistical model of classifier combination should be made explicit, and provides representations of two common combination schemes, the mean and product rules, using directed graphical models (DGMs).
Abstract: Classifier combination is a technique that often provides appreciable accuracy gains. In this paper, we argue that the underlying statistical model of classifier combination should be made explicit. Using directed graphical models (DGMs), we provide representations of two common combination schemes, the mean and product rules. We also introduce new DGMs that yield novel combination rules. We find that these new DGM-inspired rules can achieve significant accuracy gains on the TIMIT phone-classification task relative to existing combination schemes.

24 citations


Proceedings ArticleDOI
05 Jun 2000
TL;DR: The novel approach here is to encode the clean models using principal component analysis (PCA) and pre-compute the prototype vectors and matrices for the means and covariances in the linear spectral-domain using rectangular DCT and inverse DCT matrices.
Abstract: This paper describes an algorithm to reduce computational complexity of the parallel model combination (PMC) method for robust speech recognition while retaining the same level of performance. Although, PMC is effective in composing a noise corrupted acoustic model from clean speech and noise models, the intense computational complexity limits its use in real-time use. The novel approach here is to encode the clean models using principal component analysis (PCA) and pre-compute the prototype vectors and matrices for the means and covariances in the linear spectral-domain using rectangular DCT and inverse DCT matrices. Therefore, transformation into the linear spectral domain is reduced to finding the projection of each vector in the eigen space of means and covariances followed by a linear combination of vectors and matrices obtained from the projections. Furthermore, the eigen space allows a better trade-off for reducing computational complexity versus accuracy. The computational savings are demonstrated both analytically and through experimental evaluations. Experiments using context independent phone recognition with TIMIT data shows that the new PMC framework can outperforms the baseline method by a factor of 1.9 with the same level of accuracy.

18 citations


Proceedings Article
04 Sep 2000
TL;DR: The influence of GSM speech coding in the performance of a text-independent speaker recognition system based on Gaussian Mixture Models (GMM) is investigated and feature calculation directly from the GSM EFR encoded parameters is explored.
Abstract: We have investigated the influence of GSM speech coding in the performance of a text-independent speaker recognition system based on Gaussian Mixture Models (GMM). The performance degradation due to the utilization of the three GSM speech coders was assessed, using three trans-coded databases, obtained by passing the TIMIT through each GSM coder / decoder. The recognition performance was also assessed using the original TIMIT and its 8 kHz downsampled version. Then, different experiments were carried out in order to explore feature calculation directly from the GSM EFR encoded parameters and to measure the degradation introduced by different aspects of the coder.

Proceedings ArticleDOI
01 Jun 2000
TL;DR: The investigated error criteria that make non-iterative, closed-form estimator solutions possible are all found to achieve good speaker clustering potential for both male and female subgroups.
Abstract: This paper describes a method for the unsupervised and gender-independent estimation of the average human vocal tract length from the speech waveform, and reports results obtained on Fant's (1960) X-ray vowel data as well as results from experiments performed on multiple sentence utterances of 86 male and 78 female TIMIT speakers, including correlation analyses between the vocal tract length estimates and given body heights. The investigated error criteria that make non-iterative, closed-form estimator solutions possible are all found to achieve good speaker clustering potential for both male and female subgroups.

Journal ArticleDOI
Young-Sun Yun1, Yung-Hwan Oh
TL;DR: A new trajectory model for characterizing segmental features and their interaction based upon a general framework of hidden Markov models is proposed, which adopts polynomial trajectory modeling to represent the trajectories using a new design matrix that includes transitional information on neighborhood acoustic events.
Abstract: In this letter, we propose a new trajectory model for characterizing segmental features and their interaction based upon a general framework of hidden Markov models. Each segment, a sequence of frame vectors, is represented by a trajectory of observed vector sequences. This trajectory replaces the frame features in the segment and becomes the input of the segmental hidden Markov models (HMM's). In our approach, we adopt polynomial trajectory modeling to represent the trajectories using a new design matrix that includes transitional information on neighborhood acoustic events. To apply this trajectory to the segmental HMM, extra- and intrasegmental variations are modified to contain trajectory information. The presented model is regarded as an extension and generalization of conventional HMM, trajectory-based segmental HMM, and parametric trajectory models. The experimental results are reported on the TIMIT corpus and performance is shown to improve significantly over that of the conventional HMM.

Proceedings ArticleDOI
05 Jun 2000
TL;DR: A Bayesian method is proposed where model combination and model decomposition are employed for the estimation of parameters required to implement subband LP Wiener filters, which provides advantages in terms of improved parameter estimates and also in restoring the temporal-spectral composition of speech.
Abstract: The performance of Wiener filters in restoring the quality and intelligibility of noisy speech depends on: (i) the accuracy of the estimates of the power spectra or the correlation values of the noise and the speech processes, and (ii) on the Wiener filter structure. In this paper a Bayesian method is proposed where model combination and model decomposition are employed for the estimation of parameters required to implement subband LP Wiener filters. The use of subband LP Wiener filters provides advantages in terms of improved parameter estimates and also in restoring the temporal-spectral composition of speech. The method is evaluated, and compared with the parallel model combination, using the TIMIT continuous speech database with BMW and VOLVO car noise databases.

Proceedings ArticleDOI
M. Lieb1, Reinhold Haeb-Umbach1
05 Jun 2000
TL;DR: Significant error rate reductions have been achieved when applying the novel long-range feature filters compared to standard approaches employing cepstral mean normalization and delta and delta-delta features, in particular when facing acoustic echo cancellation scenarios and room reverberation.
Abstract: Amongst several data driven approaches for designing filters for the time sequence of spectral parameters, the linear discriminant analysis (LDA) based method has been proposed for automatic speech recognition. Here we apply LDA-based filter design to cepstral features, which better match the inherent assumption of this method that feature vector components are uncorrelated. Extensive recognition experiments have been conducted both on the standard TIMIT phone recognition task and on a proprietary 130-words command word task under various adverse environmental conditions, including reverberant data with real-life room impulse responses and data processed by acoustic echo cancellation algorithms. Significant error rate reductions have been achieved when applying the novel long-range feature filters compared to standard approaches employing cepstral mean normalization and delta and delta-delta features, in particular when facing acoustic echo cancellation scenarios and room reverberation. For example, the phone accuracy on reverberated TIMIT data could be increased from 50.7% to 56.0%.

01 Jan 2000
TL;DR: In this paper, a multi-talker multi-dialect corpus of spoken American English has been designed to provide researchers who are interested in variation and variability with a large number of speech samples from twenty talkers in each of four cities located in phonologically distinct dialect regions of the United States: West (Los Angeles), South (Atlanta), Midland (Indianapolis), and Northern Cities (Chicago).
Abstract: A multi-talker multi-dialect corpus of spoken American English has been designed to provide researchers who are interested in variation and variability with a large number of speech samples from twenty talkers in each of four cities located in phonologically distinct dialect regions of the United States: West (Los Angeles), South (Atlanta), Midland (Indianapolis), and Northern Cities (Chicago). The speech samples to be collected include word-length, sentence-length, and paragraph-length utterances, and have been designed to elicit phonological forms that differentiate the four regions. Once collected, these materials can be used for a range of perceptual and acoustic studies investigating the perception and production of dialect variation in the United States. Objectives of the Corpus The purpose of this project is to create a speech corpus containing recordings from a large number of talkers from phonologically distinct dialect regions in the United States for use in a range of perceptual studies and acoustic analyses. Dialect variation, both regional and social in origin, has been an important topic of research in American English since the 1930’s when plans for a “Linguistic Atlas of North America” were first discussed (Cassidy, 1993). The first studies were primarily concerned with regional variation, focusing on differences in lexical items produced by older males from rural areas (Chambers, 1993). More recently, dialect research has been extended to include studies on social and ethnic variation, such as African American Vernacular English and Appalachian English (Wolfram & Schilling-Estes, 1998). Recent research has also begun to focus on phonological variation, particularly on variation and changes in progress that have been documented in the vowel systems of several American English dialects. For example, vowel shifts such as the Northern Cities Vowel Shift found in urban areas surrounding the Great Lakes and the Southern Vowel Shift found in rural areas of the Southern United States have been described in some detail (Wolfram & Schilling-Estes, 1998). While such phonological variation has been studied via field recordings and transcription, relatively little work has been done to document the acoustic properties of these phenomena or to study their perceptual correlates via playback experiments. While acoustic analysis is a commonly accepted technique for comparing and differentiating the vowel systems of different languages, it is not commonly employed in sociolinguistic research due to Labov’s “observer’s paradox” (Wolfram & Schilling-Estes, 1998). Simply put, the paradox refers to the effect of the observer’s presence (the observer being an experimenter, recording equipment, or any other tool of measurement) on the acoustic properties of speech produced by members of a dialect community of interest. The dialect variation that sociolinguists seek to document is almost always found in forms that appear in speech styles used more frequently in casual conversation, in specific pragmatic or situational contexts, or only with other members of the same dialect community. The intrusion of an experimenter from outside the dialect community and the effect of recording equipment on the formality of the conversational setting are perceived as barriers to the elicitation of the “deepest” form of the dialect in question (Wolfram & Schilling-Estes, 1998). Thus, the most commonly used method to investigate the properties of American English and other dialects involves making audio recordings of spontaneous speech and then phonetically transcribing those interviews. MULTI-TALKER DIALECT CORPUS 411 While such methods are useful in describing relatively gross differences between dialects, they suffer from a number of limitations for researchers interested in the acoustic-phonetic properties of phonological forms of a dialect, and for researchers developing controlled stimulus materials varying in dialect for use in perception experiments. First, the use of spontaneous speech entails a lack of control over the particular stimulus materials elicited. For the experimenter hoping to collect numerous tokens of a particular vowel or word in a common phonetic and prosodic context, it is very difficult to elicit such materials in a natural, spontaneous speech style (cf. Harnsberger & Pisoni, 1999). While certain tasks, such as topically-guided conversations or map tasks, can be used to elicit particular words or prosodic phrases, strict control over the phonetic context of these forms cannot be achieved. Control of phonetic context is crucial for any acoustic analysis, as well as in constructing stimulus materials for use in perception tests. Given these constraints, and given the purposes of this corpus, we have chosen to elicit speech materials in a read speech style, enabling control over the materials elicited. For the purposes of comparison only, we will also elicit a spontaneous sample from each talker, taking the form of a conversation with the experimenter administering the tests. While eliciting read speech undoubtedly limits the range of phonological variability we will observe between the dialects, we hope to ameliorate this problem by selecting American English dialects that have been shown in prior research to differ robustly from one another in terms of phonological patterns. We are also interested in documenting American English dialects that constitute relatively large communities within the United States. We believe that this will make the corpus as a whole more representative of American English dialectal variation than a corpus that is focused on much smaller dialect communities. We have therefore decided to record twenty talkers from each of four cities, representing four phonologically distinct regions: Atlanta (South), Indianapolis (Midland), Chicago (Northern Cities), and Los Angeles (West). For summary descriptions of each of the regional dialects, and for the rationale behind the selection of the boundaries defining these regions, see Wolfram and Schilling-Estes (1998) and Labov, Ash, and Boberg (1997). While we recognize that these four cities are not representative of all dialects of American English, we expect that they will provide us with some degree of phonological variation that is both acoustically and perceptually prominent, from a relatively large sample of talkers. The nature of the controlled stimulus materials, the focus on dialect variation, and the large number of talkers we plan to record are the three main features that set this corpus apart from other existing corpora. There are at least three existing spoken language corpora that include speech samples from a large number of talkers from a variety of American English dialects: the Santa Barbara Corpus of Spoken American English (LDC Catalog, 2001c), the CALLFRIEND project (LDC Catalog, 2001a; LDC Catalog, 2001b), and the TIMIT Acoustic-Phonetic Continuous Speech Corpus (Zue, Seneff, & Glass, 1990). The Santa Barbara corpus contains spontaneous speech samples from talkers from a wide range of geographic and socioeconomic backgrounds. The CALLFRIEND project contains recordings of telephone conversations between talkers which are grouped into two broad dialect categories: Southern and Non-Southern. The TIMIT Corpus contains ten read sentences from each of 630 talkers who come from eight defined dialect regions of the United States. The usefulness of the first two corpora in perceptual studies is limited by the lack of common stimulus materials for all talkers. The usefulness of the TIMIT corpus is also limited because of the ten sentences read by each talker, only two of those sentences were read by all 630 talkers. Spoken language corpora that control for stimulus materials also exist. However, they do not necessarily vary the dialect of the talkers in a systematic fashion. For example, corpora used in our lab such as the “Easy-Hard” Word Multi-Talker Speech Database (Torretta, 1995) and the Talker Variability Sentence Database (Karl & Pisoni, 1994) contain fixed sets of stimuli spoken by 10-20 talkers, but no effort was made to identify or control for dialectal variation in the talkers. The new corpus will combine CLOPPER, CARTER, DILLON, HARNSBERGER, HERMAN, CLARKE, PISONI AND HERNANDEZ 412 the systematic variation in dialect found in the TIMIT corpus with the control over a range of stimulus materials found in the “Easy-Hard” and Talker Variability databases. Once the corpus has been collected, we plan to use it in our lab for perceptual studies involving dialect identification, categorization, and discrimination by non-native listeners, lexical decision tasks, and voice quality judgement tasks involving dialect manipulations. This corpus will also be used in a series of perceptual learning tasks on dialect intelligibility after laboratory training and dialect manipulations in voice learning. Finally, the corpus will enable us to conduct acoustic-phonetic studies including descriptions of the vowel systems, analyses of diphthongal differences, and investigations into the acoustic correlates of stress across dialects. Organization of the Corpus

Proceedings Article
01 Jan 2000
TL;DR: A key element of the proposed method of burst detection is the use of a measurement of intensity discrimination based on models from perceptual studies, which is compared to the support vector machine (SVM) method.
Abstract: Detection of burst-related impulses, such as those accompanying plosive stop consonants, is an important problem for accurate measurement of acoustic features for recogntion (e.g., voice-onset-time) and for accurate automatic phonetic alignment. The proposed method of burst detection utilizes techniques for identifying and combining information about specific acoustic characteristics of bursts. One key element of the proposed method is the use of a measurement of intensity discrimination based on models from perceptual studies. Our experiments compared the proposed method of burst detection to the support vector machine (SVM) method, described below. The total error rate for the proposed method is 13.2% on the test-set partition of the TIMIT corpus, compared to a total error rate of 24% for the SVM method.

Proceedings ArticleDOI
29 Oct 2000
TL;DR: Simulation results generally indicate improved separation quality, a higher probability in producing distinct source outputs, and robustness in noisy cases.
Abstract: Techniques for blind separation of mixed speech signals (co-channel speech) have been reported in the literature One computationally simple method for linear mixtures (suitable for real-time separation), employs a gradient search algorithm to maximize the kurtosis of the outputs (hopefully separated speech signals) We report the results of an enhancement to the algorithm which involves a normalization to the correction matrix used in the update of the separation matrix Simulation results (using the TIMIT speech corpus) generally indicate improved (sometimes significantly) separation quality, a higher probability in producing distinct source outputs, and robustness in noisy cases

Proceedings ArticleDOI
30 Aug 2000
TL;DR: The combined system provides significant improvements for phone recognition and classification on the TIMIT corpus and is better than the best context-independent systems in the literature and close to thebest context-dependent systems.
Abstract: We have been investigating the possible advantages of a modular/ensemble neural network for acoustic modelling. We report experiments with ensembles of networks trained on data provided by different front-end preprocessing methods. As for previous work we train a network ensemble for each individual phone and combine the outputs of the ensemble using a further trained network. The combined system provides significant improvements for phone recognition and classification on the TIMIT corpus. Our results are now better than the best context-independent systems in the literature and close to the best context-dependent systems.

01 Jan 2000
TL;DR: The segmentation produced by an entropy rate-based method is compared to the manual phoneme segmentations of the TIMIT and the KIEL corpora to evaluate the potential of the entropy rate contour to identify stationary and non-stationary segments of speech signals.
Abstract: This study evaluates the potential of the entropy rate contour to identify stationary and non-stationary segments of speech signals. The segmentation produced by an entropy rate-based method is compared to the manual phoneme segmentations of the TIMIT and the KIEL corpora. Characteristic points, i.e. steepest rises and falls of the entropy rate curve and its maxima and minima are investigated to determine whether they label stationary and non-stationary speech segments. The phonetically labelled speech corpora for American English (TIMIT) and German (Kiel Corpus of Read Speech) serve as references for the corpus-based evaluation.

Journal ArticleDOI
TL;DR: The kurtosis maximization ideas for source separation are extended to include delays in the mixing model to at least account for propagation delays from speakers to microphones.
Abstract: Blind source separation of mixtures of speech signals has received considerable attention in the research community over the last 2 years. One computationally efficient method employs a gradient search algorithm to maximize the kurtosis of the outputs thereby achieving separation of the source signals. While this method has reported excellent separation results (30–50‐dB SIR), it assumes a simple linear mixing model. In the general case, convolutional mixing models are used, however, this is a rather difficult problem due to causality and stability restrictions on the inverse not to mention length requirements in the FIR approximation. Research results with the general problem are modest at best. In this paper, we extend the kurtosis maximization ideas for source separation to include delays in the mixing model to at least account for propagation delays from speakers to microphones. The algorithm is designed to first estimate the relative delays of the sources within each mixture using a standard autocorrelation technique. These delay estimates are then used in the kurtosis maximization algorithm where the separation matrix is now modified to include these delays. Simulation results (using the TIMIT speech corpus) generally indicate good separation quality (10–20 dB) with little additional computational overhead.

Journal ArticleDOI
Ha-Jin Yu1, Y.-H. Oh1
TL;DR: A non-uniform unit which can model phoneme variations caused by co-articulation spread over several phonemes and between words is introduced to neural networks for speaker-independent continuous speech recognition.

Journal ArticleDOI
TL;DR: In this article, a successful and efficient kurtosis maximization algorithm was proposed for speech separation of two sources from two linear mixtures for use in problems with arbitrary numbers of sources and mixtures.
Abstract: In many real‐world applications of blind source separation, the number of mixture signals, L, available for analysis often differs from the number of sources, M, which may be present. In this paper, we extend a successful and efficient kurtosis maximization algorithm used in speech separation of two sources from two linear mixtures for use in problems with arbitrary numbers of sources and mixtures. We examine three cases: underdetermined (M L). In each of these cases, we present simulation results (using the TIMIT speech corpus) and discuss separation matrix initialization issues and observed algorithm limitations. We find that in the critically determined case, the algorithm performs well (20–40 dB SIR) at separating four sources from four mixtures. For the other cases, our results are mixed. In the overdetermined case (two sources, three mixtures), the algorithm performs well (20–40 dB SIR) and we find that the extra mixtures do not result in better SIR measurements. In the underdetermined case (three sources, two mixtures), we are able to separate out at least one source (sometimes two) with the other output signals each containing pairs of the remaining sources.

Proceedings ArticleDOI
Young-Sun Yun1, Yung-Hwan Oh
05 Jun 2000
TL;DR: A parametric trajectory model for characterizing segmental features and their interaction within the segmental HMMs is presented and performance is shown to improve significantly over that of the conventional HMM.
Abstract: We present a parametric trajectory model for characterizing segmental features and their interaction within the segmental HMMs. The trajectory is obtained by applying the design matrix which includes transitional information on contiguous frames, and it is characterized as a polynomial regression function. To apply the trajectory to the segmental HMM, the extra- and intra-segmental variations are modified to contain the trajectory information. We made some experiments to examine the characteristics of variances and the variabilities in a segment. The experimental results are reported on the TIMIT corpus and performance is shown to improve significantly over that of the conventional HMM.

Proceedings ArticleDOI
01 Jun 2000
TL;DR: The motivation is that clustering at the finer acoustic level of subspace Gaussians of lower dimension is more effective, resulting in lower distortions and relatively fewer regression classes.
Abstract: In the hidden Markov modeling framework with mixture Gaussians, adaptation is often done by modifying the Gaussian mean vectors using MAP estimation or MLLR transformation. When the amount of adaptation data is scarce or when some speech units are unseen in the data, it is necessary to do adaptation in groups-either with regression classes of Gaussians or via vector field smoothing. In this paper, we propose to derive regression classes of subspace Gaussians for MAP adaptation. The motivation is that clustering at the finer acoustic level of subspace Gaussians of lower dimension is more effective, resulting in lower distortions and relatively fewer regression classes. Experiments in which context-dependent TIMIT HMMs are adapted to the resource management task with few minutes of speech show improvement of our subspace regression classes over traditional full-space regression classes.

Proceedings ArticleDOI
07 Mar 2000
TL;DR: Simulation results for classifying the utterances show that the size of the BDRNN required is very small compared to multilayer perceptron networks with time delayed feedforward connections.
Abstract: The objective of this paper is to recognize speech based on speech prediction techniques using a discrete time recurrent neural network (DTRNN) with a block diagonal feedback weight matrix called the block diagonal recurrent neural network (BDRNN). The ability of this network has been investigated for the TIMIT isolated digits spoken by a representative speaker. Simulation results for classifying the utterances show that the size of the BDRNN required is very small compared to multilayer perceptron networks with time delayed feedforward connections.

23 Mar 2000
TL;DR: The influence of GSM speech coding in the performance of a text-independent speaker recognition system based on Gaussian Mixture Models (GMM) classifiers is investigated and feature calculation directly from the encoded parameters is explored.
Abstract: We have investigated the influence of GSM speech coding in the performance of a text-independent speaker recognition system based on Gaussian Mixture Models (GMM) classifiers. The performance degradation due to the utilization of the three GSM speech coders was first assessed, using three transcoded databases, obtained by passing the TIMIT through each GSM coder / decoder. The coded databases were used for training and testing the speaker identification system. The speaker recognition performance was also assessed using the original TIMIT and its 8 kHz downsampled version. Then, different experiments aimed to explore feature calculation directly from the encoded parameters, and to measure the degradation introduced by different aspects of the coders were carried out.

DOI
01 Jan 2000
TL;DR: Experimental results indicated that none of the proposed methods perform significantly better than the standard method, however, the absolute best result obtained with the proposed front end is comparable to those obtained with current state-of-the-art systems.
Abstract: This dissertation presents an investigation of non-uniform time sampling methods for spectral/temporal feature extraction in speech. Frame-based features were computed based on an encoding of the global spectral shape using a Discrete Cosine Transform. In most current “standard” methods, trajectory (dynamic) features are determined from frame-based parameters using a fixed time sampling, i.e., fixed block length and fixed block spacing. In this research, new methods are proposed and investigated in which block length and/or block spacing are variable. The idea was initially tested with HMM-based isolated word recognition, and a significant performance improvement resulted when a variable block length and variable block method were applied. An accuracy of 97.9% was obtained with an alphabet recognition task using the ISOLET database. This result is by far the highest reported in the literature. The variable block length method was then adapted to accommodate the complexity of continuous speech. Three methods were proposed and each was tested with the TIMIT and NTIMIT databases using HMM recognizers. Phone recognition experiments were conducted using the standard 39 phone set. Tuning of parameters was achieved with monophone models using a simple HMM configuration. The methods were also evaluated with more complex models, such as models with more mixture components, models with a full covariance matrix and right-context biphone models. Experimental results indicated that none of the proposed methods perform significantly better than the standard method. However, the absolute best result obtained with the proposed front end is comparable to those obtained with current state-of-the-art systems. Also, the performance achieved with monophone models is favorable to many context-dependent systems which are more complex.