scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Audio, Speech, and Language Processing in 2009"


Journal ArticleDOI
TL;DR: An empirical feature analysis for audio environment characterization is performed and a matching pursuit algorithm is proposed to use to obtain effective time-frequency features to yield higher recognition accuracy for environmental sounds.
Abstract: The paper considers the task of recognizing environmental sounds for the understanding of a scene or context surrounding an audio sensor. A variety of features have been proposed for audio recognition, including the popular Mel-frequency cepstral coefficients (MFCCs) which describe the audio spectral shape. Environmental sounds, such as chirpings of insects and sounds of rain which are typically noise-like with a broad flat spectrum, may include strong temporal domain signatures. However, only few temporal-domain features have been developed to characterize such diverse audio signals previously. Here, we perform an empirical feature analysis for audio environment characterization and propose to use the matching pursuit (MP) algorithm to obtain effective time-frequency features. The MP-based method utilizes a dictionary of atoms for feature selection, resulting in a flexible, intuitive and physically interpretable set of features. The MP-based feature is adopted to supplement the MFCC features to yield higher recognition accuracy for environmental sounds. Extensive experiments are conducted to demonstrate the effectiveness of these joint features for unstructured environmental sound classification, including listening tests to study human recognition capabilities. Our recognition system has shown to produce comparable performance as human listeners.

626 citations


Journal ArticleDOI
TL;DR: A new adaptation algorithm is proposed called constrained structural maximum a posteriori linear regression (CSMAPLR) whose derivation is based on the knowledge obtained in this analysis and on the results of comparing several conventional adaptation algorithms.
Abstract: In this paper, we analyze the effects of several factors and configuration choices encountered during training and model construction when we want to obtain better and more stable adaptation in HMM-based speech synthesis. We then propose a new adaptation algorithm called constrained structural maximum a posteriori linear regression (CSMAPLR) whose derivation is based on the knowledge obtained in this analysis and on the results of comparing several conventional adaptation algorithms. Here, we investigate six major aspects of the speaker adaptation: initial models; the amount of the training data for the initial models; the transform functions, estimation criteria, and sensitivity of several linear regression adaptation algorithms; and combination algorithms. Analyzing the effect of the initial model, we compare speaker-dependent models, gender-independent models, and the simultaneous use of the gender-dependent models to single use of the gender-dependent models. Analyzing the effect of the transform functions, we compare the transform function for only mean vectors with that for mean vectors and covariance matrices. Analyzing the effect of the estimation criteria, we compare the ML criterion with a robust estimation criterion called structural MAP. We evaluate the sensitivity of several thresholds for the piecewise linear regression algorithms and take up methods combining MAP adaptation with the linear regression algorithms. We incorporate these adaptation algorithms into our speech synthesis system and present several subjective and objective evaluation results showing the utility and effectiveness of these algorithms in speaker adaptation for HMM-based speech synthesis.

373 citations


Journal ArticleDOI
TL;DR: It is shown that relative transfer functions (RTFs), which relate the desired speech sources and the microphones, and a basis for the interference subspace suffice for constructing the beamformer, and that the application of the adaptive ANC contributes to interference reduction, but only when the constraint sets are not completely satisfied.
Abstract: In many practical environments we wish to extract several desired speech signals, which are contaminated by nonstationary and stationary interfering signals. The desired signals may also be subject to distortion imposed by the acoustic room impulse responses (RIRs). In this paper, a linearly constrained minimum variance (LCMV) beamformer is designed for extracting the desired signals from multimicrophone measurements. The beamformer satisfies two sets of linear constraints. One set is dedicated to maintaining the desired signals, while the other set is chosen to mitigate both the stationary and nonstationary interferences. Unlike classical beamformers, which approximate the RIRs as delay-only filters, we take into account the entire RIR [or its respective acoustic transfer function (ATF)]. The LCMV beamformer is then reformulated in a generalized sidelobe canceler (GSC) structure, consisting of a fixed beamformer (FBF), blocking matrix (BM), and adaptive noise canceler (ANC). It is shown that for spatially white noise field, the beamformer reduces to a FBF, satisfying the constraint sets, without power minimization. It is shown that the application of the adaptive ANC contributes to interference reduction, but only when the constraint sets are not completely satisfied. We show that relative transfer functions (RTFs), which relate the desired speech sources and the microphones, and a basis for the interference subspace suffice for constructing the beamformer. The RTFs are estimated by applying the generalized eigenvalue decomposition (GEVD) procedure to the power spectral density (PSD) matrices of the received signals and the stationary noise. A basis for the interference subspace is estimated by collecting eigenvectors, calculated in segments where nonstationary interfering sources are active and the desired sources are inactive. The rank of the basis is then reduced by the application of the orthogonal triangular decomposition (QRD). This procedure relaxes the common requirement for nonoverlapping activity periods of the interference sources. A comprehensive experimental study in both simulated and real environments demonstrates the performance of the proposed beamformer.

285 citations


Journal ArticleDOI
TL;DR: An analysis of the statistics derived from the pitch contour indicates that gross pitch contours statistics such as mean, maximum, minimum, and range are more emotionally prominent than features describing the pitch shape.
Abstract: During expressive speech, the voice is enriched to convey not only the intended semantic message but also the emotional state of the speaker. The pitch contour is one of the important properties of speech that is affected by this emotional modulation. Although pitch features have been commonly used to recognize emotions, it is not clear what aspects of the pitch contour are the most emotionally salient. This paper presents an analysis of the statistics derived from the pitch contour. First, pitch features derived from emotional speech samples are compared with the ones derived from neutral speech, by using symmetric Kullback-Leibler distance. Then, the emotionally discriminative power of the pitch features is quantified by comparing nested logistic regression models. The results indicate that gross pitch contour statistics such as mean, maximum, minimum, and range are more emotionally prominent than features describing the pitch shape. Also, analyzing the pitch statistics at the utterance level is found to be more accurate and robust than analyzing the pitch statistics for shorter speech regions (e.g., voiced segments). Finally, the best features are selected to build a binary emotion detection system for distinguishing between emotional versus neutral speech. A new two-step approach is proposed. In the first step, reference models for the pitch features are trained with neutral speech, and the input features are contrasted with the neutral model. In the second step, a fitness measure is used to assess whether the input speech is similar to, in the case of neutral speech, or different from, in the case of emotional speech, the reference models. The proposed approach is tested with four acted emotional databases spanning different emotional categories, recording settings, speakers and languages. The results show that the recognition accuracy of the system is over 77% just with the pitch features (baseline 50%). When compared to conventional classification schemes, the proposed approach performs better in terms of both accuracy and robustness.

276 citations


Journal ArticleDOI
TL;DR: This paper presents a systematic study on dominance modeling in group meetings from fully automatic nonverbal activity cues, in a multi-camera, multi-microphone setting, and investigates efficient audio and visual activity cues for the characterization of dominant behavior, analyzing single and joint modalities.
Abstract: Dominance - a behavioral expression of power - is a fundamental mechanism of social interaction, expressed and perceived in conversations through spoken words and audiovisual nonverbal cues. The automatic modeling of dominance patterns from sensor data represents a relevant problem in social computing. In this paper, we present a systematic study on dominance modeling in group meetings from fully automatic nonverbal activity cues, in a multi-camera, multi-microphone setting. We investigate efficient audio and visual activity cues for the characterization of dominant behavior, analyzing single and joint modalities. Unsupervised and supervised approaches for dominance modeling are also investigated. Activity cues and models are objectively evaluated on a set of dominance-related classification tasks, derived from an analysis of the variability of human judgment of perceived dominance in group discussions. Our investigation highlights the power of relatively simple yet efficient approaches and the challenges of audiovisual integration. This constitutes the most detailed study on automatic dominance modeling in meetings to date.

227 citations


Journal ArticleDOI
TL;DR: A speaker-adaptive HMM-based speech synthesis system that employs speaker adaptation, feature-space adaptive training, mixed-gender modeling, and full-covariance modeling using CSMAPLR transforms, in addition to several other techniques that have proved effective in previous systems are described.
Abstract: This paper describes a speaker-adaptive HMM-based speech synthesis system. The new system, called ldquoHTS-2007,rdquo employs speaker adaptation (CSMAPLR+MAP), feature-space adaptive training, mixed-gender modeling, and full-covariance modeling using CSMAPLR transforms, in addition to several other techniques that have proved effective in our previous systems. Subjective evaluation results show that the new system generates significantly better quality synthetic speech than speaker-dependent approaches with realistic amounts of speech data, and that it bears comparison with speaker-dependent approaches even when large amounts of speech data are available. In addition, a comparison study with several speech synthesis techniques shows the new system is very robust: It is able to build voices from less-than-ideal speech data and synthesize good-quality speech even for out-of-domain sentences.

203 citations


Journal ArticleDOI
TL;DR: The accuracy of the fundamental frequency estimation by the proposed method is comparable or even better than many existing methods and is also robust against rapid variation of the pitch period or vocal-tract changes.
Abstract: Exploiting the impulse-like nature of excitation in the sequence of glottal cycles, a method is proposed to derive the instantaneous fundamental frequency from speech signals. The method involves passing the speech signal through two ideal resonators located at zero frequency. A filtered signal is derived from the output of the resonators by subtracting the local mean computed over an interval corresponding to the average pitch period. The positive zero crossings in the filtered signal correspond to the locations of the strong impulses in each glottal cycle. Then the instantaneous fundamental frequency is obtained by taking the reciprocal of the interval between successive positive zero crossings. Due to filtering by zero-frequency resonator, the effects of noise and vocal-tract variations are practically eliminated. For the same reason, the method is also robust to degradation in speech due to additive noise. The accuracy of the fundamental frequency estimation by the proposed method is comparable or even better than many existing methods. Moreover, the proposed method is also robust against rapid variation of the pitch period or vocal-tract changes. The method works well even when the glottal cycles are not periodic or when the speech signals are not correlated in successive glottal cycles.

201 citations


Journal ArticleDOI
TL;DR: A room impulse response is assumed to consist of three parts: a direct-path response, early reflections and late reverberations, which is known to be a major cause of ASR performance degradation.
Abstract: A speech signal captured by a distant microphone is generally smeared by reverberation, which severely degrades automatic speech recognition (ASR) performance. One way to solve this problem is to dereverberate the observed signal prior to ASR. In this paper, a room impulse response is assumed to consist of three parts: a direct-path response, early reflections and late reverberations. Since late reverberations are known to be a major cause of ASR performance degradation, this paper focuses on dealing with the effect of late reverberations. The proposed method first estimates the late reverberations using long-term multi-step linear prediction, and then reduces the late reverberation effect by employing spectral subtraction. The algorithm provided good dereverberation with training data corresponding to the duration of one speech utterance, in our case, less than 6 s. This paper describes the proposed framework for both single-channel and multichannel scenarios. Experimental results showed substantial improvements in ASR performance with real recordings under severe reverberant conditions.

186 citations


Journal ArticleDOI
TL;DR: The performance of the combined method for VOP detection is improved by 2.13% compared to the best performing individual Vop detection method.
Abstract: Vowel onset point (VOP) is the instant at which the onset of vowel takes place during speech production. There are significant changes occurring in the energies of excitation source, spectral peaks, and modulation spectrum at the VOP. This paper demonstrates the independent use of each of these three energies in detecting the VOPs. Since each of these energies represents a different aspect of speech production, it may be possible that they contain complementary information about the VOP. The individual evidences are therefore combined for detecting the VOPs. The error rates measured as the ratio of missing and spurious to the total number of VOPs evaluated on the sentences taken from the TIMIT database are 6.92%, 8.8%, 6.13%, and 4.0% for source, spectral peaks, modulation spectrum, and combined information, respectively. The performance of the combined method for VOP detection is improved by 2.13% compared to the best performing individual VOP detection method.

145 citations


Journal ArticleDOI
TL;DR: The usefulness of six nonlinear chaotic measures based on nonlinear dynamics theory in the discrimination between two levels of voice quality: healthy and pathological is studied.
Abstract: In this paper, we propose to quantify the quality of the recorded voice through objective nonlinear measures. Quantification of speech signal quality has been traditionally carried out with linear techniques since the classical model of voice production is a linear approximation. Nevertheless, nonlinear behaviors in the voice production process have been shown. This paper studies the usefulness of six nonlinear chaotic measures based on nonlinear dynamics theory in the discrimination between two levels of voice quality: healthy and pathological. The studied measures are first- and second-order Renyi entropies, the correlation entropy and the correlation dimension. These measures were obtained from the speech signal in the phase-space domain. The values of the first minimum of mutual information function and Shannon entropy were also studied. Two databases were used to assess the usefulness of the measures: a multiquality database composed of four levels of voice quality (healthy voice and three levels of pathological voice); and a commercial database (MEEI Voice Disorders) composed of two levels of voice quality (healthy and pathological voices). A classifier based on standard neural networks was implemented in order to evaluate the measures proposed. Global success rates of 82.47% (multiquality database) and 99.69% (commercial database) were obtained.

134 citations


Journal ArticleDOI
TL;DR: This paper presents reduced-bandwidth MWF-based noise reduction algorithms, where a filtered combination of the contralateral microphone signals is transmitted, and shows that the best performance of the reduced- bandwidth algorithms is obtained by the DB-MWF procedure and that the performance of this procedure approaches quite well the optimalperformance of the binaural MWF.
Abstract: In a binaural hearing aid system, output signals need to be generated for the left and the right ear. Using the binaural multichannel Wiener filter (MWF), which exploits all microphone signals from both hearing aids, a significant reduction of background noise can be achieved. However, due to power and bandwidth limitations of the binaural link, it is typically not possible to transmit all microphone signals between the hearing aids. To limit the amount of transmitted information, this paper presents reduced-bandwidth MWF-based noise reduction algorithms, where a filtered combination of the contralateral microphone signals is transmitted. A first scheme uses a signal-independent beamformer, whereas a second scheme uses the output of a monaural MWF on the contralateral microphone signals and a third scheme involves an iterative distributed MWF (DB-MWF) procedure. It is shown that in the case of a rank-1 speech correlation matrix, corresponding to a single speech source, the DB-MWF procedure converges to the binaural MWF solution. Experimental results compare the noise reduction performance of the reduced-bandwidth algorithms with respect to the benchmark binaural MWF. It is shown that the best performance of the reduced-bandwidth algorithms is obtained by the DB-MWF procedure and that the performance of the DB-MWF procedure approaches quite well the optimal performance of the binaural MWF.

Journal ArticleDOI
C. Joder1, Slim Essid1, Gael Richard1
TL;DR: A number of methods for early and late temporal integration are proposed and an in-depth experimental study on their interest for the task of musical instrument recognition on solo musical phrases is provided.
Abstract: Nowadays, it appears essential to design automatic indexing tools which provide meaningful and efficient means to describe the musical audio content. There is in fact a growing interest for music information retrieval (MIR) applications amongst which the most popular are related to music similarity retrieval, artist identification, musical genre or instrument recognition. Current MIR-related classification systems usually do not take into account the mid-term temporal properties of the signal (over several frames) and lie on the assumption that the observations of the features in different frames are statistically independent. The aim of this paper is to demonstrate the usefulness of the information carried by the evolution of these characteristics over time. To that purpose, we propose a number of methods for early and late temporal integration and provide an in-depth experimental study on their interest for the task of musical instrument recognition on solo musical phrases. In particular, the impact of the time horizon over which the temporal integration is performed will be assessed both for fixed and variable frame length analysis. Also, a number of proposed alignment kernels will be used for late temporal integration. For all experiments, the results are compared to a state of the art musical instrument recognition system.

Journal ArticleDOI
TL;DR: The concept of theoretical continuous loudspeaker on a circle is used to derive the discrete loudspeaker aperture functions by avoiding matrix inversion and reveals the underlying structure of the solution as a function of the desired soundfield, the loudspeaker positions, and the frequency.
Abstract: Reproduction of a soundfield is a fundamental problem in acoustic signal processing. A common approach is to use an array of loudspeakers to reproduce the desired field where the least-squares method is used to calculate the loudspeaker weights. However, the least-squares method involves matrix inversion which may lead to errors if the matrix is poorly conditioned. In this paper, we use the concept of theoretical continuous loudspeaker on a circle to derive the discrete loudspeaker aperture functions by avoiding matrix inversion. In addition, the aperture function obtained through continuous loudspeaker method reveals the underlying structure of the solution as a function of the desired soundfield, the loudspeaker positions, and the frequency. This concept can also be applied for the 3-D soundfield reproduction using spherical harmonics analysis with a spherical array. Results are verified through computer simulations.

Journal ArticleDOI
TL;DR: It is theoretically and experimentally pointed out that ICA is proficient in noise estimation under a non-point-source noise condition rather than in speech estimation, and a new blind spatial subtraction array (BSSA) is proposed that utilizes ICA as a noise estimator.
Abstract: We propose a new blind spatial subtraction array (BSSA) consisting of a noise estimator based on independent component analysis (ICA) for efficient speech enhancement. In this paper, first, we theoretically and experimentally point out that ICA is proficient in noise estimation under a non-point-source noise condition rather than in speech estimation. Therefore, we propose BSSA that utilizes ICA as a noise estimator. In BSSA, speech extraction is achieved by subtracting the power spectrum of noise signals estimated using ICA from the power spectrum of the partly enhanced target speech signal with a delay-and-sum beamformer. This ldquopower-spectrum-domain subtractionrdquo procedure enables better noise reduction than the conventional ICA with estimation-error robustness. Another benefit of BSSA architecture is ldquopermutation robustness". Although the ICA part in BSSA suffers from a source permutation problem, the BSSA architecture can reduce the negative affection when permutation arises. The results of various speech enhancement test reveal that the noise reduction and speech recognition performance of the proposed BSSA are superior to those of conventional methods.

Journal ArticleDOI
TL;DR: Issues related to speaker diarization using this information theoretic framework such as the criteria for inferring the number of speakers, the tradeoff between quality and compression achieved by the diarized system, and the algorithms for optimizing the objective function are discussed.
Abstract: A speaker diarization system based on an information theoretic framework is described. The problem is formulated according to the information bottleneck (IB) principle. Unlike other approaches where the distance between speaker segments is arbitrarily introduced, the IB method seeks the partition that maximizes the mutual information between observations and variables relevant for the problem while minimizing the distortion between observations. This solves the problem of choosing the distance between speech segments, which becomes the Jensen-Shannon divergence as it arises from the IB objective function optimization. We discuss issues related to speaker diarization using this information theoretic framework such as the criteria for inferring the number of speakers, the tradeoff between quality and compression achieved by the diarization system, and the algorithms for optimizing the objective function. Furthermore, we benchmark the proposed system against a state-of-the-art system on the NIST RT06 (rich transcription) data set for speaker diarization of meetings. The IB-based system achieves a diarization error rate of 23.2% compared to 23.6% for the baseline system. This approach being mainly based on nonparametric clustering, it runs significantly faster than the baseline HMM/GMM based system, resulting in faster-than-real-time diarization.

Journal ArticleDOI
TL;DR: The results suggest that high-order models can be essential in morph-based speech recognition, even when lattices are generated for two-pass recognition.
Abstract: Speech recognition systems trained for morphologically rich languages face the problem of vocabulary growth caused by prefixes, suffixes, inflections, and compound words. Solutions proposed in the literature include increasing the size of the vocabulary and segmenting words into morphs. However, in many cases, the methods have only been experimented with low-order n-gram models or compared to word-based models that do not have very large vocabularies. In this paper, we study the importance of using high-order variable-length n-gram models when the language models are trained over morphs instead of whole words. Language models trained on a very large vocabulary are compared with models based on different morph segmentations. Speech recognition experiments are carried out on two highly inflecting and agglutinative languages, Finnish and Estonian. The results suggest that high-order models can be essential in morph-based speech recognition, even when lattices are generated for two-pass recognition. The analysis of recognition errors reveal that the high-order morph language models improve especially the recognition of previously unseen words.

Journal ArticleDOI
TL;DR: An unbiased RTF estimator is developed that exploits the nonstationarity and presence probability of the speech signal and derive an analytic expression for the estimator variance.
Abstract: In this paper, we present a relative transfer function (RTF) identification method for speech sources in reverberant environments. The proposed method is based on the convolutive transfer function (CTF) approximation, which enables to represent a linear convolution in the time domain as a linear convolution in the short-time Fourier transform (STFT) domain. Unlike the restrictive and commonly used multiplicative transfer function (MTF) approximation, which becomes more accurate when the length of a time frame increases relative to the length of the impulse response, the CTF approximation enables representation of long impulse responses using short time frames. We develop an unbiased RTF estimator that exploits the nonstationarity and presence probability of the speech signal and derive an analytic expression for the estimator variance. Experimental results show that the proposed method is advantageous compared to common RTF identification methods in various acoustic environments, especially when identifying long RTFs typical to real rooms.

Journal ArticleDOI
TL;DR: The results show that the accuracy of acoustic parameter prediction and the naturalness of synthesized speech can be improved when shared clustering and asynchronous-state model structures are adopted for combined acoustic and articulatory features.
Abstract: This paper presents an investigation into ways of integrating articulatory features into hidden Markov model (HMM)-based parametric speech synthesis In broad terms, this may be achieved by estimating the joint distribution of acoustic and articulatory features during training This may in turn be used in conjunction with a maximum-likelihood criterion to produce acoustic synthesis parameters for generating speech Within this broad approach, we explore several variations that are possible in the construction of an HMM-based synthesis system which allow articulatory features to influence acoustic modeling: model clustering, state synchrony and cross-stream feature dependency Performance is evaluated using the RMS error of generated acoustic parameters as well as formal listening tests Our results show that the accuracy of acoustic parameter prediction and the naturalness of synthesized speech can be improved when shared clustering and asynchronous-state model structures are adopted for combined acoustic and articulatory features Most significantly, however, our experiments demonstrate that modeling the dependency between these two feature streams can make speech synthesis systems more flexible The characteristics of synthetic speech can be easily controlled by modifying generated articulatory features as part of the process of producing acoustic synthesis parameters

Journal ArticleDOI
TL;DR: A novel statistical user model based on a compact stack-like state representation called a user agenda which allows state transitions to be modeled as sequences of push- and pop-operations and elegantly encodes the dialogue history from a user's point of view is described.
Abstract: A key advantage of taking a statistical approach to spoken dialogue systems is the ability to formalise dialogue policy design as a stochastic optimization problem. However, since dialogue policies are learnt by interactively exploring alternative dialogue paths, conventional static dialogue corpora cannot be used directly for training and instead, a user simulator is commonly used. This paper describes a novel statistical user model based on a compact stack-like state representation called a user agenda which allows state transitions to be modeled as sequences of push- and pop-operations and elegantly encodes the dialogue history from a user's point of view. An expectation-maximisation based algorithm is presented which models the observable user output in terms of a sequence of hidden states and thereby allows the model to be trained on a corpus of minimally annotated data. Experimental results with a real-world dialogue system demonstrate that the trained user model can be successfully used to optimise a dialogue policy which outperforms a hand-crafted baseline in terms of task completion rates and user satisfaction scores.

Journal ArticleDOI
TL;DR: This paper studies how combination schemes, where the outputs of two independent adaptive filters are adaptively mixed together, can be used to increase IPNLMS robustness to channels with different degrees of sparsity, as well as to alleviate the rate of convergence versus steady-state mis adjustment tradeoff imposed by the selection of the step size.
Abstract: Proportionate adaptive filters, such as those based on the improved proportionate normalized least-mean-square (IPNLMS) algorithm, have been proposed for echo cancellation as an interesting alternative to the normalized least-mean-square (NLMS) filter. Proportionate schemes offer improved performance when the echo path is sparse, but are still subject to some compromises regarding their convergence properties and steady-state error. In this paper, we study how combination schemes, where the outputs of two independent adaptive filters are adaptively mixed together, can be used to increase IPNLMS robustness to channels with different degrees of sparsity, as well as to alleviate the rate of convergence versus steady-state misadjustment tradeoff imposed by the selection of the step size. We also introduce a new block-based combination scheme which is specifically designed to further exploit the characteristics of the IPNLMS filter. The advantages of these combined filters are justified theoretically and illustrated in several echo cancellation scenarios.

Journal ArticleDOI
Ebru Arisoy1, Dogan Can1, Siddika Parlak1, Hasim Sak1, Murat Saraclar1 
TL;DR: This paper summarizes the recent efforts for building a Turkish Broadcast News transcription and retrieval system using sub-word-based recognition units to solve the OOV problem with moderate size vocabularies and perform even better than a 500 K word vocabulary as far as recognition accuracy is concerned.
Abstract: This paper summarizes our recent efforts for building a Turkish Broadcast News transcription and retrieval system. The agglutinative nature of Turkish leads to a high number of out-of-vocabulary (OOV) words which in turn lower automatic speech recognition (ASR) accuracy. This situation compromises the performance of speech retrieval systems based on ASR output. Therefore using a word-based ASR is not adequate for transcribing speech in Turkish. To alleviate this problem, various sub-word-based recognition units are utilized. These units solve the OOV problem with moderate size vocabularies and perform even better than a 500 K word vocabulary as far as recognition accuracy is concerned. As a novel approach, the interaction between recognition units, words and sub-words, and discriminative training is explored. Sub-word models benefit from discriminative training more than word models do, especially in the discriminative language modeling framework. For speech retrieval, a spoken term detection system based on automata indexation is utilized. As with transcription, retrieval performance is measured under various schemes incorporating words and sub-words. Best results are obtained using a cascade of word and sub-word indexes together with term-specific thresholding.

Journal ArticleDOI
TL;DR: A supervised learning approach to monaural segregation of reverberant voiced speech is proposed, which learns to map from a set of pitch-based auditory features to a grouping cue encoding the posterior probability of a time-frequency (T-F) unit being target dominant given observed features.
Abstract: A major source of signal degradation in real environments is room reverberation. Monaural speech segregation in reverberant environments is a particularly challenging problem. Although inverse filtering has been proposed to partially restore the harmonicity of reverberant speech before segregation, this approach is sensitive to specific source/receiver and room configurations. This paper proposes a supervised learning approach to monaural segregation of reverberant voiced speech, which learns to map from a set of pitch-based auditory features to a grouping cue encoding the posterior probability of a time-frequency (T-F) unit being target dominant given observed features. We devise a novel objective function for the learning process, which directly relates to the goal of maximizing signal-to-noise ratio. The models trained using this objective function yield significantly better T-F unit labeling. A segmentation and grouping framework is utilized to form reliable segments under reverberant conditions and organize them into streams. Systematic evaluations show that our approach produces very promising results under various reverberant conditions and generalizes well to new utterances and new speakers.

Journal ArticleDOI
TL;DR: Experimental results on Vietnamese showed that with only a few hours of target language speech data, crosslingual context independent modeling worked better than crosslingUAL context dependent modeling, however, it was outperformed by the latter one, when more speech data were available, and it was concluded that in both cases,Crosslingual systems are better than monolingual baseline systems.
Abstract: This paper presents our work in automatic speech recognition (ASR) in the context of under-resourced languages with application to Vietnamese. Different techniques for bootstrapping acoustic models are presented. First, we present the use of acoustic-phonetic unit distances and the potential of crosslingual acoustic modeling for under-resourced languages. Experimental results on Vietnamese showed that with only a few hours of target language speech data, crosslingual context independent modeling worked better than crosslingual context dependent modeling. However, it was outperformed by the latter one, when more speech data were available. We concluded, therefore, that in both cases, crosslingual systems are better than monolingual baseline systems. The proposal of grapheme-based acoustic modeling, which avoids building a phonetic dictionary, is also investigated in our work. Finally, since the use of sub-word units (morphemes, syllables, characters, etc.) can reduce the high out-of-vocabulary rate and improve the lack of text resources in statistical language modeling for under-resourced languages, we propose several methods to decompose, normalize and combine word and sub-word lattices generated from different ASR systems. The proposed lattice combination scheme results in a relative syllable error rate reduction of 6.6% over the sentence MAP baseline method for a Vietnamese ASR task.

Journal ArticleDOI
TL;DR: The study illustrates the impact of Lombard effect on speaker recognition, and effective methods to improve system performance for speaker recognition when train/test conditions are mismatched for neutral versus LombardEffect speech.
Abstract: Speech production in the presence of noise results in the Lombard effect, which is known to have a serious impact on speech system performance. In this study, Lombard speech produced under different types and levels of noise is analyzed in terms of duration, energy histogram, and spectral tilt. Acoustic-phonetic differences are shown to exist between different ldquoflavorsrdquo of Lombard speech based on analysis of trends from a Gaussian mixture model (GMM)-based Lombard speech type classifier. For the first time, the dependence of Lombard speech on noise type and noise level is established for the purposes of speech processing systems. Also, the impact of the different flavors of Lombard effect on speech system performance is shown with respect to an in-set/out-of-set speaker recognition task. System performance is shown to degrade from an equal error rate (EER) of 7.0% under matched neutral training and testing conditions, to an average EER of 26.92% when trained with neutral and tested with Lombard effect speech. Furthermore, improvement in the performance of in-set/out-of-set speaker recognition is demonstrated by adapting neutral speaker models with Lombard speech data of limited duration. Improved average EERs of 4.75% and 12.37% were achieved for matched and mismatched adaptation and testing conditions, respectively. At the highest noise levels, an EER as low as 1.78% was obtained by adapting neutral speaker models with Lombard speech of limited duration. The study therefore illustrates the impact of Lombard effect on speaker recognition, and effective methods to improve system performance for speaker recognition when train/test conditions are mismatched for neutral versus Lombard effect speech.

Journal ArticleDOI
TL;DR: This paper presents a novel dynamic music similarity measurement strategy that utilizes both content features and user access patterns and significantly improves theMusic similarity measurement accuracy and performance.
Abstract: Music recommendation is receiving increasing attention as the music industry develops venues to deliver music over the Internet. The goal of music recommendation is to present users lists of songs that they are likely to enjoy. Collaborative-filtering and content-based recommendations are two widely used approaches that have been proposed for music recommendation. However, both approaches have their own disadvantages: collaborative-filtering methods need a large collection of user history data and content-based methods lack the ability of understanding the interests and preferences of users. To overcome these limitations, this paper presents a novel dynamic music similarity measurement strategy that utilizes both content features and user access patterns. The seamless integration of them significantly improves the music similarity measurement accuracy and performance. Based on this strategy, recommended songs are obtained by a means of label propagation over a graph representing music similarity. Experimental results on a real data set collected from http://www.newwisdom.net demonstrate the effectiveness of the proposed approach.

Journal ArticleDOI
TL;DR: A new acoustic modeling paradigm based on augmented conditional random fields (ACRFs) is investigated and developed, which addresses some limitations of HMMs while maintaining many of the aspects which have made them successful.
Abstract: Acoustic modeling based on hidden Markov models (HMMs) is employed by state-of-the-art stochastic speech recognition systems. Although HMMs are a natural choice to warp the time axis and model the temporal phenomena in the speech signal, their conditional independence properties limit their ability to model spectral phenomena well. In this paper, a new acoustic modeling paradigm based on augmented conditional random fields (ACRFs) is investigated and developed. This paradigm addresses some limitations of HMMs while maintaining many of the aspects which have made them successful. In particular, the acoustic modeling problem is reformulated in a data driven, sparse, augmented space to increase discrimination. Acoustic context modeling is explicitly integrated to handle the sequential phenomena of the speech signal. We present an efficient framework for estimating these models that ensures scalability and generality. In the TIMIT phone recognition task, a phone error rate of 23.0% was recorded on the full test set, a significant improvement over comparable HMM-based systems.

Journal ArticleDOI
TL;DR: Simulation results show that MPM is robust against various common attacks such as noise addition, filtering, echo, MP3 compression, etc. and provides more robustness and inaudibility of the watermark insertion.
Abstract: This paper presents a Multiplicative Patchwork Method (MPM) for audio watermarking. The watermark signal is embedded by selecting two subsets of the host signal features and modifying one subset multiplicatively regarding the watermark data, whereas another subset is left unchanged. The method is implemented in wavelet domain and approximation coefficients are used to embed data. In order to have an error-free detection, the watermark data is inserted only in the frames where the ratio of the energy of subsets is between two predefined values. Also in order to control the inaudibility of watermark insertion, we use an iterative algorithm to reach a desired quality for the watermarked audio signal. The quality of watermarked signal is evaluated in each iteration using Perceptual Evaluation of Audio Quality (PEAQ) method. The probability of error is also derived for the watermarking scheme and simulation results prove the validity of the analytical derivations. Simulation results show that MPM is robust against various common attacks such as noise addition, filtering, echo, MP3 compression, etc. In comparison to the original patchwork method and its modified versions, and some recent methods, MPM provides more robustness and inaudibility of the watermark insertion.

Journal ArticleDOI
TL;DR: In this paper, a multimodal classification and learning rules should be adjusted to compensate for feature measurement uncertainty in audiovisual speech recognition, where multiple streams of complementary time-evolving features are integrated.
Abstract: While the accuracy of feature measurements heavily depends on changing environmental conditions, studying the consequences of this fact in pattern recognition tasks has received relatively little attention to date. In this paper, we explicitly take feature measurement uncertainty into account and show how multimodal classification and learning rules should be adjusted to compensate for its effects. Our approach is particularly fruitful in multimodal fusion scenarios, such as audiovisual speech recognition, where multiple streams of complementary time-evolving features are integrated. For such applications, provided that the measurement noise uncertainty for each feature stream can be estimated, the proposed framework leads to highly adaptive multimodal fusion rules which are easy and efficient to implement. Our technique is widely applicable and can be transparently integrated with either synchronous or asynchronous multimodal sequence integration architectures. We further show that multimodal fusion methods relying on stream weights can naturally emerge from our scheme under certain assumptions; this connection provides valuable insights into the adaptivity properties of our multimodal uncertainty compensation approach. We show how these ideas can be practically applied for audiovisual speech recognition. In this context, we propose improved techniques for person-independent visual feature extraction and uncertainty estimation with active appearance models, and also discuss how enhanced audio features along with their uncertainty estimates can be effectively computed. We demonstrate the efficacy of our approach in audiovisual speech recognition experiments on the CUAVE database using either synchronous or asynchronous multimodal integration models.

Journal ArticleDOI
TL;DR: This paper describes a new method for EGG-based glottal activity detection which exhibits high accuracy over the entirety of voiced segments, including, in particular, the transition regions, thereby giving significant improvement over existing methods.
Abstract: Accurate estimation of glottal closure instants (GCIs) and opening instants (GOIs) is important for speech processing applications that benefit from glottal-synchronous processing. The majority of existing approaches detect GCIs by comparing the differentiated EGG signal to a threshold and are able to provide accurate results during voiced speech. More recent algorithms use a similar approach across multiple dyadic scales using the stationary wavelet transform. All existing approaches are however prone to errors around the transition regions at the end of voiced segments of speech. This paper describes a new method for EGG-based glottal activity detection which exhibits high accuracy over the entirety of voiced segments, including, in particular, the transition regions, thereby giving significant improvement over existing methods. Following a stationary wavelet transform-based preprocessor, detection of excitation due to glottal closure is performed using a group delay function and then true and false detections are discriminated by Gaussian mixture modeling. GOI detection involves additional processing using the estimated GCIs. The main purpose of our algorithm is to provide a ground-truth for GCIs and GOIs. This is essential in order to evaluate algorithms that estimate GCIs and GOIs from the speech signal only, and is also of high value in the analysis of pathological speech where knowledge of GCIs and GOIs is often needed. We compare our algorithm with two previous algorithms against a hand-labeled database. Evaluation has shown an average GCI hit rate of 99.47% and GOI of 99.35%, compared to 96.08 and 92.54 for the best-performing existing algorithm.

Journal ArticleDOI
TL;DR: A class of AEC algorithms that can not only work well in both sparse and dispersive circumstances, but also adapt dynamically to the level of sparseness using a newSparseness-controlled approach is proposed.
Abstract: In the context of acoustic echo cancellation (AEC), it is shown that the level of sparseness in acoustic impulse responses can vary greatly in a mobile environment When the response is strongly sparse, convergence of conventional approaches is poor Drawing on techniques originally developed for network echo cancellation (NEC), we propose a class of AEC algorithms that can not only work well in both sparse and dispersive circumstances, but also adapt dynamically to the level of sparseness using a new sparseness-controlled approach Simulation results, using white Gaussian noise (WGN) and speech input signals, show improved performance over existing methods The proposed algorithms achieve these improvement with only a modest increase in computational complexity