scispace - formally typeset
Search or ask a question

Showing papers on "TIMIT published in 2004"


Journal ArticleDOI
TL;DR: An algorithm, based on group delay processing of the magnitude spectrum to determine segment boundaries in the speech signal to automatically segment a continuous speech signal into syllable-like segments is presented.

96 citations


Journal ArticleDOI
TL;DR: A new scheme is proposed that shows how the phase sensitivity can be removed by using an analytical description of the ICA-adapted basis functions via the Hilbert transform, and since the basis functions are not shift invariant, is extended to include a frequency-based ICA stage that removes redundant time shift information.

89 citations


Proceedings ArticleDOI
17 May 2004
TL;DR: The modified group delay feature (MODGDF) is used as a front end feature in a Gaussian mixture model (GMM) based speaker identification system and it is shown that the MODGDF has speaker specific properties.
Abstract: In this paper, we explore new methods by which speakers can be identified and discriminated, using features derived from the Fourier transform phase. The modified group delay feature (MODGDF) which is a parameterized form of the modified group delay function is used as a front end feature in this study. A Gaussian mixture model (GMM) based speaker identification system is built with the MODGDF as the front end feature. The system is tested on both clean (TIMIT) and noisy telephone (NTIMIT) speech. The results obtained are compared with traditional Mel frequency cepstral coefficients (MFCC) which is derived from the Fourier transform magnitude. When both MFCC and MODGDF were combined, the performance improved by about 4% indicating that both phase and magnitude contain complementary information. In an earlier paper (Murthy et al. (2003)), it was shown that the MODGDF does possess phoneme specific characteristics. In this paper we show that the MODGDF has speaker specific properties. We also make an attempt to understand speaker discriminating characteristics of the MODGDF using the nonlinear mapping technique based on Sammon mapping (Sammon (1969)) and find that the MODGDF empirically demonstrates a certain level of linear separability among speakers.

73 citations


Journal ArticleDOI
TL;DR: In this study, a system is developed that extracts linguistically relevant temporal information that can be used in the front end of an automatic speech recognition system and results in the same recognition accuracy achieved when the standard 39 cepstral-based parameters are used.
Abstract: Studies by Shannon et al. [Science, 270, 303-304 (1995)], Van Tasell et al. [J. Acoust. Soc. Am. 82, 1152-1161 (1987)], and others show that human listeners can understand important aspects of the speech signal when spectral shape has been significantly degraded. These experiments suggest that temporal information is particularly important in human speech perception when the speech signal is heavily degraded. In this study, a system is developed that extracts linguistically relevant temporal information that can be used in the front end of an automatic speech recognition system. The parameters targeted include energy onset and offsets (computed using an adaptive algorithm) and measures of periodic and aperiodic content; together these are used to find abrupt acoustic events which signify landmarks. Overall detection rates for strongly robust events, robust events, and weak events in a portion of the TIMIT test database are 98.9%, 94.7%, and 52.1%, respectively. Error rates increase by less than 5% when the speech signals are spectrally impoverished. Use of the four temporal parameters as the front end of a hidden Markov model (HMM)-based system for the automatic recognition of the manner classes "sonorant," "fricative," "stop," and "silence" results in the same recognition accuracy achieved when the standard 39 cepstral-based parameters are used, 70.1%. The combination of the temporal parameters and cepstral parameters results in an accuracy of 74.8%.

57 citations


Dissertation
01 Jun 2004
TL;DR: A novel approach to decoding for segment models in the form of a stack decoder with A∗ search is introduced, which allows flexibility in the choice of acoustic and language models since the Viterbi criterion is not integral to the search, and hypothesis generation is independent of the particular language model.
Abstract: The majority of automatic speech recognition (ASR) systems rely on hidden Markov models (HMM), in which the output distribution associated with each state is modelled by a mixture of diagonal covariance Gaussians. Dynamic information is typically included by appending time-derivatives to feature vectors. This approach, whilst successful, makes the false assumption of framewise independence of the augmented feature vectors and ignores the spatial correlations in the parametrised speech signal. This dissertation seeks to address these shortcomings by exploring acoustic modelling for ASR with an application of a form of state-space model, the linear dynamic model (LDM). Rather than modelling individual frames of data, LDMs characterise entire segments of speech. An auto-regressive state evolution through a continuous space gives a Markovian model of the underlying dynamics, and spatial correlations between feature dimensions are absorbed into the structure of the observation process. LDMs have been applied to speech recognition before, however a smoothed Gauss-Markov form was used which ignored the potential for subspace modelling. The continuous dynamical state means that information is passed along the length of each segment. Furthermore, if the state is allowed to be continuous across segment boundaries, long range dependencies are built into the system and the assumption of independence of successive segments is loosened. The state provides an explicit model of temporal correlation which sets this approach apart from frame-based and some segment-based models where the ordering of the data is unimportant. The benefits of such a model are examined both within and between segments. LDMs are well suited to modelling smoothly varying, continuous, yet noisy trajectories such as found in measured articulatory data. Using speaker-dependent data from the MOCHA corpus, the performance of systems which model acoustic, articulatory, and combined acoustic-articulatory features are compared. As well as measured articulatory parameters, experiments use the output of neural networks trained to perform an articulatory inversion mapping. The speaker-independent TIMIT corpus provides the basis for larger scale acoustic-only experiments. Classification tasks provide an ideal means to compare modelling choices without the confounding influence of recognition search errors, and are used to explore issues such as choice of state dimension, front-end acoustic parametrisation and parameter initialisation. Recognition for segment models is typically more computationally expensive than for frame-based models. Unlike frame-level models, it is not always possible to share likelihood calculations for observation sequences which occur within hypothesised segments that have different start and end times. Furthermore, the Viterbi criterion is not necessarily applicable at the frame level. This work introduces a novel approach to decoding for segment models in the form of a stack decoder with A∗ search. Such a scheme allows flexibility in the choice of acoustic and language models since the Viterbi criterion is not integral to the search, and hypothesis generation is independent of the particular language model. Furthermore, the time-asynchronous ordering of the search means that only likely paths are extended, and so a minimum number of models are evaluated. The decoder is used to give full recognition results for feature-sets derived from the MOCHA and TIMIT corpora. Conventional train/test divisions and choice of language model are used so that results can be directly compared to those in other studies. The decoder is also used to implement Viterbi training, in which model parameters are alternately updated and then used to re-align the training data.

43 citations


Journal ArticleDOI
TL;DR: It is shown that it is possible to improve the discriminant capacities of such an encoder with the introduction of signal membership class information as from the coding stage, and fits in with the category of discriminant features extraction (DFE) encoders already proposed in literature.

24 citations


Proceedings ArticleDOI
23 Aug 2004
TL;DR: This paper presents a novel extension of hidden Markov models (HMMs): type-2 fuzzy HMMs (type-2 FHMMs), which can handle both randomness and fuzziness within the framework of type-1 fuzzy sets and fuzzy logic systems (FLSs).
Abstract: This paper presents a novel extension of hidden Markov models (HMMs): type-2 fuzzy HMMs (type-2 FHMMs). The advantage of this extension is that it can handle both randomness and fuzziness within the framework of type-2 fuzzy sets (FSs) and fuzzy logic systems (FLSs). Membership functions (MFs) of type-2 fuzzy sets are three-dimensional. It is the third dimension that provides the additional degrees of freedom that make it possible to handle both uncertainties. We apply the type-2 FHMM as acoustic models for phoneme recognition on TIMIT speech database. Experimental results show that the type-2 FHMM has a comparable performance as that of the HMM but is more robust to noise, while it retains almost the same computational complexity as that of the HMM.

22 citations


Proceedings ArticleDOI
17 May 2004
TL;DR: A detector is described that outputs a stream (sequence of classes) for each articulatory feature given the Mel frequency cepstral coefficient (MFCC) representation of the input speech and the results are compared to a baseline system.
Abstract: In this paper, the states in the speech production process are defined by a number of categorical articulatory features. We describe a detector that outputs a stream (sequence of classes) for each articulatory feature given the Mel frequency cepstral coefficient (MFCC) representation of the input speech. The detector consists of a bank of recurrent neural network (RNN) classifiers, a variable depth lattice generator and Viterbi decoder. A bank of classifiers has been previously used for articulatory feature detection by many researchers. We extend their work first by creating variable depth lattices for each feature and then by combining them into product lattices for rescoring using the Viterbi algorithm. During the rescoring we incorporate language and duration constraints along with the posterior probabilities of classes provided by the RNN classifiers. We present our results for the place and manner features using TIMIT data, and compare the results to a baseline system. We report performance improvements both at the frame and segment levels.

21 citations


Proceedings ArticleDOI
04 Oct 2004
TL;DR: The results show that significant changes occur at the prosodic, phoneme space, and word levels for dialect analysis, and that effective dialect classification can be achieved using processing strategies from each domain.
Abstract: In this paper, we present our recent work in the analysis and modeling of speech under dialect Dialect and accent significantly influence automatic speech recognition performance, and therefore it is critical to detect and classify non-native speech In this study, we consider three areas that include: (i) prosodic structure (normalized f0, syllable rate, and sentence duration), (ii) phoneme acoustic space modeling and sub-word classification, and (iii) word-level based modeling using large vocabulary data The corpora used in this study include: the NATO N-4 corpus (2 accents, 2 dialects of English), TIMIT (7 dialect regions), and American and British English versions of the WSJ corpus These corpora were selected because the contained audio material from specific dialects/accents of English (N-4), were phonetically balanced and organized across US (TIMIT), or contained significant amounts of read audio material from distinct dialects (WSJ) The results show that significant changes occur at the prosodic, phoneme space, and word levels for dialect analysis, and that effective dialect classification can be achieved using processing strategies from each domain

20 citations


Proceedings ArticleDOI
25 Jul 2004
TL;DR: An extension of the hidden Markov models (HMMs) using interval type-2 fuzzy sets (FSs) and fuzzy logic systems (FLSs) to produce intervaltype-2 FHMMs that have a comparable performance as that of the HMM but is more robust to the speech variation.
Abstract: This paper presents an extension of the hidden Markov models (HMMs) using interval type-2 fuzzy sets (FSs) and fuzzy logic systems (FLSs) to produce interval type-2 FHMMs. The advantage of this extension is that it can handle both the randomness and fuzziness. Membership function (MF) of the type-2 FS is three-dimensional. It is the third-dimension that provides additional degrees of freedom to evaluate HMM's uncertainties. An attractive property of this extension is that if all uncertainties disappear, the interval type-2 FHMM reduces to the classical HMM. We apply our interval type-2 FHMM as an acoustic model for phoneme recognition on TIMIT speech database. Experimental results show that the type-2 FHMM has a comparable performance as that of the HMM but is more robust to the speech variation, while it retains almost the same computational complexity as that of the HMM.

17 citations


01 Jan 2004
TL;DR: A new multi-scale voice morphing algorithm that enables a user to transform one person's speech pattern into another person's pattern with distinct characteristics, giving it a new identity, while preserving the original content.
Abstract: This paper presents a new multi-scale voice morphing algorithm. This algorithm enables a user to transform one person's speech pattern into another person's pattern with distinct characteristics, giving it a new identity, while preserving the original content. The voice morphing algorithm performs the morphing at different subbands by using the theory of wavelets and models the spectral conversion using the theory of Radial Basis Function Neural Networks. The results obtained on the TIMIT speech database demonstrate effective transformation of the speaker identity.

Proceedings ArticleDOI
04 Oct 2004
TL;DR: The in-set speakers are clustered into smaller subsets without merging speaker models and the Anti-Speaker or Background Model is adapted for each subset which minimizes the identification errors of the pseudo impostors during the training stage.
Abstract: In this paper, we propose an approach to address the problem of text-independent open-set speaker identification. The in-set speakers are clustered into smaller subsets without merging speaker models. The Anti-Speaker or Background Model is then adapted for each subset which minimizes the identification errors of the pseudo impostors during the training stage. Score normalization is applied to align all the in-set speaker score distributions to share a single scale. Finally, confidence measure processing is used to identify in-set versus out-of-set speakers. Experiments with TIMIT and the CU-Accent corpora show an improvement in Equal Error Rate on the average of 20.28 and 8.35 over the baseline performance respectively. Finally, a probe experiment is also included that considers prosody for in-set speaker detection.

Proceedings ArticleDOI
17 May 2004
TL;DR: The results show that the identification performance can be improved greatly when the training data is limited and a new selection strategy of competitive speakers is proposed to it.
Abstract: In this paper we apply the maximum model distance (MMD) training to speaker identification and a new selection strategy of competitive speakers is proposed to it. The traditional ML method only utilizes the utterances for each speaker model, which probably leads to a local optimization solution. By maximizing the dissimilarities among those similar speaker models, MMD could add the discriminative capability into the training procedure and then improve the identification performance. Based on the TIMIT corpus, we designed the word and sentence experiments to evaluate this proposed training approach. The results show that the identification performance can be improved greatly when the training data is limited.

Proceedings ArticleDOI
17 May 2004
TL;DR: An improved PSM is introduced, dynamic multi-region PSM, that allows a data-driven alignment between observations and the segment trajectory and outperforms HMM and traditional PSM in both phone classification and phone recognition tasks on the TIMIT corpus.
Abstract: One of the difficulties in using the polynomial segment model (PSM) to capture the temporal correlations within a phonetic segment is the lack of an efficient training algorithm comparable with the Baum-Welch algorithm in HMM. In our previous paper, we introduced a recursive likelihood computation algorithm for PSM recognition which can perform Viterbi-style training. In this paper, we extend the recursive likelihood computation into a fast forward-backward PSM training algorithm that maximizes PSM likelihood. In addition, we introduce an improved PSM, dynamic multi-region PSM, that allows a data-driven alignment between observations and the segment trajectory. The dynamic multi-region PSM model outperforms HMM and traditional PSM in both phone classification and phone recognition tasks on the TIMIT corpus.

Journal ArticleDOI
TL;DR: It is shown that combining the models during training not only improved performance but also simplified fusion process during recognition, particularly for highly constrained recognition fusion such as synchronous models combination.

Proceedings ArticleDOI
17 May 2004
TL;DR: Three significant enhancements to time-domain adaptive decorrelation filtering (ADF) are proposed for effective separation and recognition of simultaneous speech sources in reverberant room conditions and have significantly improved target-to-interference ratio and accuracy of phone recognition.
Abstract: Three significant enhancements to time-domain adaptive decorrelation filtering (ADF) are proposed for effective separation and recognition of simultaneous speech sources in reverberant room conditions. The methods include whitening filtering on cochannel speech prior to ADF to improve condition of adaptive estimation, a novel block-iterative implementation of ADF to speed up convergence rate, and an integration of multiple ADF outputs through optimal post-filtering. Experimental data were generated by convolving TIMIT speech with acoustic path impulse responses measured in a real acoustic environment, with a 2m microphone-source distance and an initial target-to-interference ratio of about 0 dB. The proposed methods are shown to have speeded up the convergence rate of ADF to a level feasible for online applications, and they have significantly improved target-to-interference ratio and accuracy of phone recognition.

Proceedings ArticleDOI
17 May 2004
TL;DR: To prove the concept, guided discriminative training is applied to derive an optimal linear transformation on the mel-filterbank log power spectra to improve TIMIT phoneme classification.
Abstract: In this paper, we investigate guided discriminative training in the context of improving multi-class classification problems. We are interested in applications that require improvement in the classification performance of only a subset of the classes at the possible expense of poorer classification performance of the remaining classes. However, should the classification of the remaining classes deteriorate, it is guaranteed not to be worse than the extent that the user specifies. The problem is formulated as a nonlinear programming problem, which can be translated to a unconstrained nonlinear optimization problem using the barrier method that, in turn, can be solved by the gradient descent method. To prove the concept, we apply guided discriminative training to derive an optimal linear transformation on the mel-filterbank log power spectra to improve TIMIT phoneme classification. Encouraging results are obtained.

Proceedings ArticleDOI
17 May 2004
TL;DR: A combined fixed/adaptive beamforming algorithm (CFA-BF) for speech enhancement with two single channel methods based on speech spectral constrained iterative processing (Auto-LSP), and an auditory masked threshold based method using equivalent rectangular bandwidth filtering (GMMSE-AMTERB).
Abstract: While a number of studies have investigated various speech enhancement and noise suppression schemes, most consider either a single channel or array processing framework. Clearly there are potential advantages in leveraging the strengths of array processing solutions in suppressing noise from a direction other than the speaker, with that seen in single channel methods that include speech spectral constraints or psychoacoustically motivated processing. In this paper, we propose to integrate a combined fixed/adaptive beamforming algorithm (CFA-BF) for speech enhancement with two single channel methods based on speech spectral constrained iterative processing (Auto-LSP), and an auditory masked threshold based method using equivalent rectangular bandwidth filtering (GMMSE-AMTERB). After formulating the method, we evaluate performance on a subset of the TIMIT corpus with four real noise sources. We demonstrate a consistent level of noise suppression and voice communication quality improvement using the proposed method as reflected by an overall average 26dB increase in SegSNR from the original degraded audio corpus.

Proceedings ArticleDOI
16 Mar 2004
TL;DR: A general method for text-independent speaker identification system using discrete wavelet transform (DWT) is presented and shows 100% successful text-dependent recognition for both database of speakers.
Abstract: A general method for text-independent speaker identification system using discrete wavelet transform (DWT) is presented. The production of identification is based on a predefined threshold in conjunction with a subset mother wavelet properly selected to extract the speaker's features. The classification process to produce the speakers codebooks is achieved by using the vector quantization technique (VQ) with a proper algorithm to minimize the time needed to create the data vector. The performance of the proposed system is demonstrated by considering two databases; 28 persons (males and females of 7 different regions) from the international TIMIT database down-sampled to 8 kHz and a group of 5 local females recorded in a semiquiet environment. TIMIT results of text-independent identification indicates 85.7% for speakers of same region and 90.5% for speakers of different regions whereas 88% and 100% identification for the local group are obtained. However, the proposed method shows 100% successful text-dependent recognition for both database of speakers.

01 Jan 2004
TL;DR: It is observed that optimal sentence decoding is essential to achieve maximum recognition rate for models trained by global MMI, and global MMJ is found superior to the frame-based criterion for continuous recognition.
Abstract: This paper deals wir h speaker-independent continuous speech recognition. Our approach is based on continuous density hidden Markov models with a non-linpar input featmre transformation performed by a multilayer perceptron. We discuss various optimisation criteria and provide results on a TIMIT phoneme recognition task, using single frame MMI embedded in Viterbi training, and a global MMI criterion. As expected, global MMJ is found superior to the frame-based criterion for continuous recognition. We furt.her observe that optimal sentence decoding is essential to achieve maximum recognition rate for models trained by global MMI. Finally, we find that the simple MLP input transformation, with five frames of context information, (:an increase the recognition rate significantly compared t,o just nsing delta parameters.

Proceedings ArticleDOI
01 Jan 2004
TL;DR: It is shown that while classification accuracy using MeI frequency cepstral coefficients as features does not improve with sub-banding, the accuracy increases from 36.1% to 42.0% using sub- banded reconstructed phase spaces to model the phonemes.
Abstract: This paper examines the use of multi-band reconstructed phase spaces as models for phoneme classification. Sub-banding reconstructed phase spaces combines linear, frequency-based techniques with a nonlinear modeling approach to speech recognition. Experiments comparing the effects of filtering speech signals for both reconstructed phase space and traditional speech recognition approaches are presented. These experiments study the use of two non-overlapping subbands for isolated phoneme classification on the TIMIT corpus. It is shown that while classification accuracy using MeI frequency cepstral coefficients as features does not improve with sub-banding, the accuracy increases from 36.1% to 42.0% using sub-banded reconstructed phase spaces to model the phonemes.

Journal ArticleDOI
TL;DR: The results indicate that the supplementary features contain classification characteristics which can be useful in automatic speech recognition.
Abstract: Traditional speech recognition systems use mel‐frequency cepstral coefficients (MFCCs) as acoustic features. The present research aims to study the classification characteristics and the performance of some supplementary features (SFs) such as periodicity, zero crossing rate, log energy and ratio of low frequency energy to total energy, in a phone recognition system, built using the Hidden Markov Model Tool Kit. To demonstrate the performance of the SFs, training is done on a subset of the TIMIT data base (DR1 data set) on context independent phones using a single mixture. When only the SFs and their first derivatives (feature set of dimension 8) are used the recognition accuracy is found to be 42.96% as compared to 54.65% when 12 MFCCs and their corresponding derivatives are used. The performance of the system improves to 56.49%, when the SFs and their derivatives are used along with the MFCCs. A further improvement to 60.34% is observed when the last 4 MFCCs and their derivatives are replaced by SFs and their derivatives, respectively. These results indicate that the supplementary features contain classification characteristics which can be useful in automatic speech recognition.

Proceedings ArticleDOI
02 Nov 2004
TL;DR: The maximum model distance (MMD) algorithm is applied to the Gaussian mixture model (GMM) training and shows that the equal error rate (EER) could be reduced greatly compared with the traditional ML method.
Abstract: This paper presents the design and implementation of text-independent speaker verification. We apply the maximum model distance (MMD) algorithm to the Gaussian mixture model (GMM) training. The traditional maximum likelihood (ML) method only utilizes the labeled utterances for each speaker model, which probably leads to a local optimization solution. By maximizing the model distance between the target and competing speakers, MMD could add the discriminative capability into the training procedure and then improve the verification performance. Based on the TIMIT corpus, we designed the verification experiments and the results show that the equal error rate (EER) could be reduced greatly compared with the traditional ML method.

Proceedings ArticleDOI
20 Oct 2004
TL;DR: Experimental results indicate that the ARMA lattice model achieves an improved noise-resistant capability on vowel phoneme and fricative phonemes as compared to those of the conventional mel-frequency cepstral coefficient (MFCC) method.
Abstract: In this paper, the result of a study on phoneme feature extraction, under a noisy environment, using an auto-regressive moving average (ARMA) lattice model, is presented. The phoneme characteristics are modeled and expressed in the form of ARMA lattice reflection coefficients for classification. Experimental results, based on the TIMIT speech database and NoiseX-92 noise database, indicate that the ARMA lattice model achieves an improved noise-resistant capability on vowel phonemes and fricative phonemes as compared to those of the conventional mel-frequency cepstral coefficient (MFCC) method.

Proceedings ArticleDOI
26 Aug 2004
TL;DR: This paper analyzes the roles of individual hidden states of the HMM and their associated posterior probabilities that reflect the nature of the components in the observation sequence, and proposes to make a full use of the state-level information.
Abstract: In HMM-based pattern recognition, the structure of HMM is predetermined according to some prior knowledge. In the recognition process, we usually make our judgment based on the maximum likelihood of the HMM which unfortunately may lead to incorrect results. In this paper, we analyze the roles of individual hidden states of the HMM and their associated posterior probabilities that reflect the nature of the components in the observation sequence, which should be taken into consideration. For this, we propose to make a full use of the state-level information, e.g., making use of the distribution of the intersection number of state posterior probability trajectories in the recognition process. We apply the proposed methods to phoneme classification on TIMIT speech corpus and show indeed that we are able to achieve about 2% percent improvement in recognition rate over that of the classical HMM.

Journal ArticleDOI
TL;DR: The performance of MRAN is compared with other well-known RBF and Elliptical Basis Function (EBF) based speaker verification methods in terms of error rates and computational complexity on a series of speaker verification experiments.
Abstract: This paper presents a text-independent speaker verification system based on an online Radial Basis Function (RBF) network referred to as Minimal Resource Allocation Network (MRAN). MRAN is a sequential learning RBF, in which hidden neurons are added or removed as training progresses. LP-derived cepstral coefficients are used as feature vectors during training and verification phases. The performance of MRAN is compared with other well-known RBF and Elliptical Basis Function (EBF) based speaker verification methods in terms of error rates and computational complexity on a series of speaker verification experiments. The experiments use data from 258 speakers from the phonetically balancedcontinuous speech corpus TIMIT. The results show that MRAN produces comparable error rates to other methods with much less computational complexity.

Proceedings Article
01 Jan 2004
TL;DR: An algorithm is developed that permits to solve this type of problem and is called “DOLS” which means Dynamic Orthogonal Least Square, that will be presented in this paper.
Abstract: Introduction A successful speech recognition system has to determine features not only present in the input pattern at one point in time, but also features of input pattern changing over time ( e.g., Berthold, 1994; Benyettou, 1995). In network design, great importance must be attributed to correct choice of the number of hidden neurons, which helps avoiding problems of overfitting and contributes to reduce the time required for the training without significantly affecting the network performances (e.g., Colla & Reyneri & Sgarbi, 1999), but never looking to architecture adapting effect according to input. The goal to combine the approach of the RBF with the shift invariance features of the TDNN, can be get a new robust model, this is named temporal radial basis function “TRBF” (e.g., Mesbahi & Benyettou, 2003), but to be more efficient, we have adapt these networks so that they come more dynamic according to their behaviour and features of the object has study. It can be goes more clearly in continuous speech. Therefore in object to obtain an Adaptive TRBF, we must adapt the TRBF networks, consequently it was necessary to develop an algorithm that permits to solve this type of problem, this algorithm is called “DOLS” which means Dynamic Orthogonal Least Square, that will be presented in this paper.

Proceedings ArticleDOI
04 Oct 2004
TL;DR: The results indicate that by performing feature extraction at the client end, the bitrate can be reduced significantly to 13.6kbps with 96% recognition performance.
Abstract: Speech recognition systems are gaining increasing importance with the wide-spread use of mobile and portable devices and other interactive voice response systems. Because of the resource constraints on such devices and the requirements of specific applications, the need to perform speech recognition over a data network becomes inevitable. The requirements of such a system with a human at one end and a machine at the other end are clearly asymmetric. The major focus of this work is to enable speaker recognition for information access over the network. Assuming that at the client end the device is either a Personal Digital Assistant(PDA) or a cellphone, an attempt is made to perform part of computation at the client end, thus conserve bandwidth. Experiments have been performed on both TIMIT data and TIMIT data passed through a speech codec. The results indicate that by performing feature extraction at the client end, the bitrate can be reduced significantly to 13.6kbps with 96% recognition performance.