scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Audio, Speech, and Language Processing in 2012"


Journal ArticleDOI
TL;DR: A pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output that can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs.
Abstract: We propose a novel context-dependent (CD) model for large-vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output. The deep belief network pre-training algorithm is a robust and often helpful way to initialize deep neural networks generatively that can aid in optimization and reduce generalization error. We illustrate the key components of our model, describe the procedure for applying CD-DNN-HMMs to LVSR, and analyze the effects of various modeling choices on performance. Experiments on a challenging business search dataset demonstrate that CD-DNN-HMMs can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs, with an absolute sentence accuracy improvement of 5.8% and 9.2% (or relative error reduction of 16.0% and 23.2%) over the CD-GMM-HMMs trained using the minimum phone error rate (MPE) and maximum-likelihood (ML) criteria, respectively.

3,120 citations


Journal ArticleDOI
TL;DR: It is shown that better phone recognition on the TIMIT dataset can be achieved by replacing Gaussian mixture models by deep neural networks that contain many layers of features and a very large number of parameters.
Abstract: Gaussian mixture models are currently the dominant technique for modeling the emission distribution of hidden Markov models for speech recognition. We show that better phone recognition on the TIMIT dataset can be achieved by replacing Gaussian mixture models by deep neural networks that contain many layers of features and a very large number of parameters. These networks are first pre-trained as a multi-layer generative model of a window of spectral feature vectors without making use of any discriminative information. Once the generative pre-training has designed the features, we perform discriminative fine-tuning using backpropagation to adjust the features slightly to make them better at predicting a probability distribution over the states of monophone hidden Markov models.

1,767 citations


Journal ArticleDOI
TL;DR: An analysis of speaker diarization performance as reported through the NIST Rich Transcription evaluations on meeting data and identify important areas for future research are presented.
Abstract: Speaker diarization is the task of determining “who spoke when?” in an audio or video recording that contains an unknown amount of speech and also an unknown number of speakers. Initially, it was proposed as a research topic related to automatic speech recognition, where speaker diarization serves as an upstream processing step. Over recent years, however, speaker diarization has become an important key technology for many tasks, such as navigation, retrieval, or higher level inference on audio data. Accordingly, many important improvements in accuracy and robustness have been reported in journals and conferences in the area. The application domains, from broadcast news, to lectures and meetings, vary greatly and pose different problems, such as having access to multiple microphones and multimodal information or overlapping speech. The most recent review of existing technology dates back to 2006 and focuses on the broadcast news domain. In this paper, we review the current state-of-the-art, focusing on research developed since 2006 that relates predominantly to speaker diarization for conference meetings. Finally, we present an analysis of speaker diarization performance as reported through the NIST Rich Transcription evaluations on meeting data and identify important areas for future research.

706 citations


Journal ArticleDOI
TL;DR: It is shown that the proposed speech presence probability (SPP) approach maintains the quick noise tracking performance of the bias compensated minimum mean-square error (MMSE)-based approach while exhibiting less overestimation of the spectral noise power and an even lower computational complexity.
Abstract: Recently, it has been proposed to estimate the noise power spectral density by means of minimum mean-square error (MMSE) optimal estimation. We show that the resulting estimator can be interpreted as a voice activity detector (VAD)-based noise power estimator, where the noise power is updated only when speech absence is signaled, compensated with a required bias compensation. We show that the bias compensation is unnecessary when we replace the VAD by a soft speech presence probability (SPP) with fixed priors. Choosing fixed priors also has the benefit of decoupling the noise power estimator from subsequent steps in a speech enhancement framework, such as the estimation of the speech power and the estimation of the clean speech. We show that the proposed speech presence probability (SPP) approach maintains the quick noise tracking performance of the bias compensated minimum mean-square error (MMSE)-based approach while exhibiting less overestimation of the spectral noise power and an even lower computational complexity.

528 citations


Journal ArticleDOI
TL;DR: A comparative evaluation of the proposed approach shows that it outperforms current state-of-the-art melody extraction systems in terms of overall accuracy.
Abstract: We present a novel system for the automatic extraction of the main melody from polyphonic music recordings. Our approach is based on the creation and characterization of pitch contours, time continuous sequences of pitch candidates grouped using auditory streaming cues. We define a set of contour characteristics and show that by studying their distributions we can devise rules to distinguish between melodic and non-melodic contours. This leads to the development of new voicing detection, octave error minimization and melody selection techniques. A comparative evaluation of the proposed approach shows that it outperforms current state-of-the-art melody extraction systems in terms of overall accuracy. Further evaluation of the algorithm is provided in the form of a qualitative error analysis and the study of the effect of key parameters and algorithmic components on system performance. Finally, we conduct a glass ceiling analysis to study the current limitations of the method, and possible directions for future work are proposed.

417 citations


Journal ArticleDOI
TL;DR: This paper introduces a general audio source separation framework based on a library of structured source models that enable the incorporation of prior knowledge about each source via user-specifiable constraints.
Abstract: Most audio source separation methods are developed for a particular scenario characterized by the number of sources and channels and the characteristics of the sources and the mixing process. In this paper, we introduce a general audio source separation framework based on a library of structured source models that enable the incorporation of prior knowledge about each source via user-specifiable constraints. While this framework generalizes several existing audio source separation methods, it also allows to imagine and implement new efficient methods that were not yet reported in the literature. We first introduce the framework by describing the model structure and constraints, explaining its generality, and summarizing its algorithmic implementation using a generalized expectation-maximization algorithm. Finally, we illustrate the above-mentioned capabilities of the framework by applying it in several new and existing configurations to different source separation problems. We have released a software tool named Flexible Audio Source Separation Toolbox (FASST) implementing a baseline version of the framework in Matlab.

296 citations


Journal ArticleDOI
TL;DR: This paper generalizes existing dereverberation methods using subband-domain multi-channel linear prediction filters so that the resultant generalized algorithm can blindly shorten a multiple-input multiple-output room impulse response between a set of unknown number of sources and a microphone array.
Abstract: The performance of many microphone array processing techniques deteriorates in the presence of reverberation. To provide a widely applicable solution to this longstanding problem, this paper generalizes existing dereverberation methods using subband-domain multi-channel linear prediction filters so that the resultant generalized algorithm can blindly shorten a multiple-input multiple-output (MIMO) room impulse response between a set of unknown number of sources and a microphone array. Unlike existing dereverberation methods, the presented algorithm is developed without assuming specific acoustic conditions, and provides a firm theoretical underpinning for the applicability of the subband-domain multi-channel linear prediction methods. The generalization is achieved by using a new cost function for estimating the prediction filter and an efficient optimization algorithm. The proposed generalized algorithm makes it easier to understand the common background underlying different dereverberation methods and future technical development. Indeed, this paper also derives two alternative dereverberation methods from the proposed algorithm, which are advantageous in terms of computational complexity. Experimental results are reported, showing that the proposed generalized algorithm effectively achieves blind MIMO impulse response shortening especially in a mid-to-high frequency range.

241 citations


Journal ArticleDOI
TL;DR: In this paper, five state-of-the-art GCI detection algorithms are compared using six different databases with contemporaneous electroglottographic recordings as ground truth, and containing many hours of speech by multiple speakers.
Abstract: The pseudo-periodicity of voiced speech can be exploited in several speech processing applications. This requires however that the precise locations of the glottal closure instants (GCIs) are available. The focus of this paper is the evaluation of automatic methods for the detection of GCIs directly from the speech waveform. Five state-of-the-art GCI detection algorithms are compared using six different databases with contemporaneous electroglottographic recordings as ground truth, and containing many hours of speech by multiple speakers. The five techniques compared are the Hilbert Envelope-based detection (HE), the Zero Frequency Resonator-based method (ZFR), the Dynamic Programming Phase Slope Algorithm (DYPSA), the Speech Event Detection using the Residual Excitation And a Mean-based Signal (SEDREAMS) and the Yet Another GCI Algorithm (YAGA). The efficacy of these methods is first evaluated on clean speech, both in terms of reliabililty and accuracy. Their robustness to additive noise and to reverberation is also assessed. A further contribution of the paper is the evaluation of their performance on a concrete application of speech processing: the causal-anticausal decomposition of speech. It is shown that for clean speech, SEDREAMS and YAGA are the best performing techniques, both in terms of identification rate and accuracy. ZFR and SEDREAMS also show a superior robustness to additive noise and reverberation.

241 citations


Journal ArticleDOI
TL;DR: A new feature based on relative phase shift (RPS) is proposed, demonstrated reliable detection of synthetic speech, and shown how this classifier can be used to improve security of SV systems.
Abstract: In this paper, we evaluate the vulnerability of speaker verification (SV) systems to synthetic speech. The SV systems are based on either the Gaussian mixture model–universal background model (GMM-UBM) or support vector machine (SVM) using GMM supervectors. We use a hidden Markov model (HMM)-based text-to-speech (TTS) synthesizer, which can synthesize speech for a target speaker using small amounts of training data through model adaptation of an average voice or background model. Although the SV systems have a very low equal error rate (EER), when tested with synthetic speech generated from speaker models derived from the Wall Street Journal (WSJ) speech corpus, over 81% of the matched claims are accepted. This result suggests vulnerability in SV systems and thus a need to accurately detect synthetic speech. We propose a new feature based on relative phase shift (RPS), demonstrate reliable detection of synthetic speech, and show how this classifier can be used to improve security of SV systems.

229 citations


Journal ArticleDOI
TL;DR: The audio inpainting framework that recovers portions of audio data distorted due to impairments such as impulsive noise, clipping, and packet loss is proposed and this approach is shown to outperform state-of-the-art and commercially available methods for audio declipping in terms of Signal-to-Noise Ratio.
Abstract: We propose the audio inpainting framework that recovers portions of audio data distorted due to impairments such as impulsive noise, clipping, and packet loss. In this framework, the distorted data are treated as missing and their location is assumed to be known. The signal is decomposed into overlapping time-domain frames and the restoration problem is then formulated as an inverse problem per audio frame. Sparse representation modeling is employed per frame, and each inverse problem is solved using the Orthogonal Matching Pursuit algorithm together with a discrete cosine or a Gabor dictionary. The Signal-to-Noise Ratio performance of this algorithm is shown to be comparable or better than state-of-the-art methods when blocks of samples of variable durations are missing. We also demonstrate that the size of the block of missing samples, rather than the overall number of missing samples, is a crucial parameter for high quality signal restoration. We further introduce a constrained Matching Pursuit approach for the special case of audio declipping that exploits the sign pattern of clipped audio samples and their maximal absolute value, as well as allowing the user to specify the maximum amplitude of the signal. This approach is shown to outperform state-of-the-art and commercially available methods for audio declipping in terms of Signal-to-Noise Ratio.

229 citations


Journal ArticleDOI
TL;DR: An improved parametric model for a spring reverberation unit is presented, which is currently mainly used for simplified geometries or to generate reverberation impulse responses for use with a convolution method.
Abstract: The first artificial reverberation algorithms were proposed in the early 1960s, and new, improved algorithms are published regularly. These algorithms have been widely used in music production since the 1970s, and now find applications in new fields, such as game audio. This overview article provides a unified review of the various approaches to digital artificial reverberation. The three main categories have been delay networks, convolution-based algorithms, and physical room models. Delay-network and convolution techniques have been competing in popularity in the music technology field, and are often employed to produce a desired perceptual or artistic effect. In applications including virtual reality, predictive acoustic modeling, and computer-aided design of acoustic spaces, accuracy is desired, and physical models have been mainly used, although, due to their computational complexity, they are currently mainly used for simplified geometries or to generate reverberation impulse responses for use with a convolution method. With the increase of computing power, all these approaches will be available in real time. A recent trend in audio technology is the emulation of analog artificial reverberation units, such as spring reverberators, using signal processing algorithms. As a case study we present an improved parametric model for a spring reverberation unit.

Journal ArticleDOI
TL;DR: A phase information extraction method that normalizes the change variation in the phase according to the frame position of the input speech and combines the phase information with MFCCs in text-independent speaker identification and verification methods.
Abstract: In conventional speaker recognition methods based on Mel-frequency cepstral coefficients (MFCCs), phase information has hitherto been ignored. In this paper, we propose a phase information extraction method that normalizes the change variation in the phase according to the frame position of the input speech and combines the phase information with MFCCs in text-independent speaker identification and verification methods. There is a problem with the original phase information extraction method when comparing two phase values. For example, the difference in the two values of π-\mathtildeθ1 and \mathtildeθ2=-π+\mathtildeθ1 is 2π-2\mathtildeθ1 . If \mathtildeθ1 ≈ 0, then the difference ≈ 2π, despite the two phases being very similar to one another. To address this problem, we map the phase into coordinates on a unit circle. Speaker identification and verification experiments are performed using the NTT database which consists of sentences uttered by 35 (22 male and 13 female) Japanese speakers with normal, fast and slow speaking modes during five sessions. Although the phase information-based method performs worse than the MFCC-based method, it augments the MFCC and the combination is useful for speaker recognition. The proposed modified phase information is more robust than the original phase information for all speaking modes. By integrating the modified phase information with the MFCCs, the speaker identification rate was improved to 98.8% from 97.4% (MFCC), and equal error rate for speaker verification was reduced to 0.45% from 0.72% (MFCC), respectively. We also conducted the speaker identification and verification experiments on a large-scale Japanese Newspaper Article Sentences (JNAS) database, a similar trend as NTT database was obtained.

Journal ArticleDOI
TL;DR: Voice conversion methods from NAM to normal speech and to a whispered voice (NAM-to-Whisper) are proposed, where the acoustic features of body-conducted unvoiced speech are converted into those of natural voices in a probabilistic manner using Gaussian mixture models (GMMs).
Abstract: In this paper, we present statistical approaches to enhance body-conducted unvoiced speech for silent speech communication. A body-conductive microphone called nonaudible murmur (NAM) microphone is effectively used to detect very soft unvoiced speech such as NAM or a whispered voice while keeping speech sounds emitted outside almost inaudible. However, body-conducted unvoiced speech is difficult to use in human-to-human speech communication because it sounds unnatural and less intelligible owing to the acoustic change caused by body conduction. To address this issue, voice conversion (VC) methods from NAM to normal speech (NAM-to-Speech) and to a whispered voice (NAM-to-Whisper) are proposed, where the acoustic features of body-conducted unvoiced speech are converted into those of natural voices in a probabilistic manner using Gaussian mixture models (GMMs). Moreover, these methods are extended to convert not only NAM but also a body-conducted whispered voice (BCW) as another type of body-conducted unvoiced speech. Several experimental evaluations are conducted to demonstrate the effectiveness of the proposed methods. The experimental results show that 1) NAM-to-Speech effectively improves intelligibility but it causes degradation of naturalness owing to the difficulty of estimating natural fundamental frequency contours from unvoiced speech; 2) NAM-to-Whisper significantly outperforms NAM-to-Speech in terms of both intelligibility and naturalness; and 3) a single conversion model capable of converting both NAM and BCW is effectively developed in our proposed VC methods.

Journal ArticleDOI
TL;DR: This work investigates CASA for robust speaker identification and introduces a novel speaker feature, gammatone frequency cepstral coefficient (GFCC), based on an auditory periphery model, and shows that this feature captures speaker characteristics and performs substantially better than conventional speaker features under noisy conditions.
Abstract: Conventional speaker recognition systems perform poorly under noisy conditions. Inspired by auditory perception, computational auditory scene analysis (CASA) typically segregates speech by producing a binary time-frequency mask. We investigate CASA for robust speaker identification. We first introduce a novel speaker feature, gammatone frequency cepstral coefficient (GFCC), based on an auditory periphery model, and show that this feature captures speaker characteristics and performs substantially better than conventional speaker features under noisy conditions. To deal with noisy speech, we apply CASA separation and then either reconstruct or marginalize corrupted components indicated by a CASA mask. We find that both reconstruction and marginalization are effective. We further combine the two methods into a single system based on their complementary advantages, and this system achieves significant performance improvements over related systems under a wide range of signal-to-noise ratios.

Journal ArticleDOI
TL;DR: The design of an array that is robust to variations in the acoustic environment and driver sensitivity and position leads to a generalization of regularization, and various methods of formulating this tradeoff as a regularization problem have been suggested and the connection between these formulations is discussed.
Abstract: As well as being able to reproduce sound in one region of space, it would be useful to reduce the level of reproduced sound in other spatial regions, with a “personal audio” system. For mobile devices this is motivated by issues of privacy for the user and the need to reduce annoyance for other people nearby. Such personal audio systems can be realized with arrays of loudspeakers that become superdirectional at low frequencies, when the array dimensions are small compared with the acoustic wavelength. The design of the array then becomes a compromise between performance and array effort, defined as the sum of mean squared driving signals. Various methods of formulating this tradeoff as a regularization problem have been suggested and the connection between these formulations is discussed. Large array efforts are due to strongly self-cancelling multipole arrays. A concern is then the robustness of such an array to variations in the acoustic environment and driver sensitivity and position. The design of an array that is robust to these uncertainties then leads to a generalization of regularization.

Journal ArticleDOI
TL;DR: It is concluded that while the deep processing structures can provide improvements for this genre, choice of features and the structure with which they are incorporated, including layer width, can also be significant factors.
Abstract: This paper reviews a line of research carried out over the last decade in speech recognition assisted by discriminatively trained, feedforward networks. The particular focus is on the use of multiple layers of processing preceding the hidden Markov model based decoding of word sequences. Emphasis is placed on the use of multiple streams of highly dimensioned layers, which have proven useful for this purpose. This paper ultimately concludes that while the deep processing structures can provide improvements for this genre, choice of features and the structure with which they are incorporated, including layer width, can also be significant factors.

Journal ArticleDOI
TL;DR: In this paper, a dynamic kernel partial least squares (DKPLS) technique is proposed to model nonlinearities as well as to capture the dynamics in the data, which is based on a kernel transformation of the source features to allow nonlinear modeling and concatenation of previous and next frames to model the dynamics.
Abstract: A drawback of many voice conversion algorithms is that they rely on linear models and/or require a lot of tuning. In addition, many of them ignore the inherent time-dependency between speech features. To address these issues, we propose to use dynamic kernel partial least squares (DKPLS) technique to model nonlinearities as well as to capture the dynamics in the data. The method is based on a kernel transformation of the source features to allow non-linear modeling and concatenation of previous and next frames to model the dynamics. Partial least squares regression is used to find a conversion function that does not overfit to the data. The resulting DKPLS algorithm is a simple and efficient algorithm and does not require massive tuning. Existing statistical methods proposed for voice conversion are able to produce good similarity between the original and the converted target voices but the quality is usually degraded. The experiments conducted on a variety of conversion pairs show that DKPLS, being a statistical method, enables successful identity conversion while achieving a major improvement in the quality scores compared to the state-of-the-art Gaussian mixture-based model. In addition to enabling better spectral feature transformation, quality is further improved when aperiodicity and binary voicing values are converted using DKPLS with auxiliary information from spectral features.

Journal ArticleDOI
TL;DR: An overview of the AMIDA systems for transcription of conference and lecture room meetings, developed for participation in the Rich Transcription evaluations conducted by the National Institute for Standards and Technology in the years 2007 and 2009 is given.
Abstract: In this paper, we give an overview of the AMIDA systems for transcription of conference and lecture room meetings. The systems were developed for participation in the Rich Transcription evaluations conducted by the National Institute for Standards and Technology in the years 2007 and 2009 and can process close talking and far field microphone recordings. The paper first discusses fundamental properties of meeting data with special focus on the AMI/AMIDA corpora. This is followed by a description and analysis of improved processing and modeling, with focus on techniques specifically addressing meeting transcription issues such as multi-room recordings or domain variability. In 2007 and 2009, two different strategies of systems building were followed. While in 2007 we used our traditional style system design based on cross adaptation, the 2009 systems were constructed semi-automatically, supported by improved decoders and a new method for system representation. Overall these changes gave a 6%-13% relative reduction in word error rate compared to our 2007 results while at the same time requiring less training material and reducing the real-time factor by five times. The meeting transcription systems are available at www.webasr.org.

Journal ArticleDOI
TL;DR: This work investigating a multi-source localization framework in which monaural source segregation is used as a mechanism to increase the robustness of azimuth estimates from a binaural input shows that robust performance can be achieved with limited prior training.
Abstract: Sound source localization from a binaural input is a challenging problem, particularly when multiple sources are active simultaneously and reverberation or background noise are present. In this work, we investigate a multi-source localization framework in which monaural source segregation is used as a mechanism to increase the robustness of azimuth estimates from a binaural input. We demonstrate performance improvement relative to binaural only methods assuming a known number of spatially stationary sources. We also propose a flexible azimuth-dependent model of binaural features that independently captures characteristics of the binaural setup and environmental conditions, allowing for adaptation to new environments or calibration to an unseen binaural setup. Results with both simulated and recorded impulse responses show that robust performance can be achieved with limited prior training.

Journal ArticleDOI
TL;DR: In this paper, a set of speech processing tools created by introducing sparsity constraints into the linear prediction framework is presented, which have shown to be effective in several issues related to modeling and coding of speech signals.
Abstract: The aim of this paper is to provide an overview of Sparse Linear Prediction, a set of speech processing tools created by introducing sparsity constraints into the linear prediction framework. These tools have shown to be effective in several issues related to modeling and coding of speech signals. For speech analysis, we provide predictors that are accurate in modeling the speech production process and overcome problems related to traditional linear prediction. In particular, the predictors obtained offer a more effective decoupling of the vocal tract transfer function and its underlying excitation, making it a very efficient method for the analysis of voiced speech. For speech coding, we provide predictors that shape the residual according to the characteristics of the sparse encoding techniques resulting in more straightforward coding strategies. Furthermore, encouraged by the promising application of compressed sensing in signal compression, we investigate its formulation and application to sparse linear predictive coding. The proposed estimators are all solutions to convex optimization problems, which can be solved efficiently and reliably using, e.g., interior-point methods. Extensive experimental results are provided to support the effectiveness of the proposed methods, showing the improvements over traditional linear prediction in both speech analysis and coding.

Journal ArticleDOI
TL;DR: This work investigates the problem of locating planar reflectors in rooms, such as walls and furniture, from signals obtained using distributed microphones by estimation of the time of arrival (TOA) of reflected signals by analysis of acoustic impulse responses (AIRs).
Abstract: Acoustic scene reconstruction is a process that aims to infer characteristics of the environment from acoustic measurements. We investigate the problem of locating planar reflectors in rooms, such as walls and furniture, from signals obtained using distributed microphones. Specifically, localization of multiple two- dimensional (2-D) reflectors is achieved by estimation of the time of arrival (TOA) of reflected signals by analysis of acoustic impulse responses (AIRs). The estimated TOAs are converted into elliptical constraints about the location of the line reflector, which is then localized by combining multiple constraints. When multiple walls are present in the acoustic scene, an ambiguity problem arises, which we show can be addressed using the Hough transform. Additionally, the Hough transform significantly improves the robustness of the estimation for noisy measurements. The proposed approach is evaluated using simulated rooms under a variety of different controlled conditions where the floor and ceiling are perfectly absorbing. Results using AIRs measured in a real environment are also given. Additionally, results showing the robustness to additive noise in the TOA information are presented, with particular reference to the improvement achieved through the use of the Hough transform.

Journal ArticleDOI
TL;DR: The proposed DFW with Amplitude scaling (DFWA) outperforms standard GMM and hybrid GMM-DFW methods for VC in terms of both speech quality and timbre conversion, as is confirmed in extensive objective and subjective testing.
Abstract: In Voice Conversion (VC), the speech of a source speaker is modified to resemble that of a particular target speaker. Currently, standard VC approaches use Gaussian mixture model (GMM)-based transformations that do not generate high-quality converted speech due to “over-smoothing” resulting from weak links between individual source and target frame parameters. Dynamic Frequency Warping (DFW) offers an appealing alternative to GMM-based methods, as more spectral details are maintained in transformation; however, the speaker timbre is less successfully converted because spectral power is not adjusted explicitly. Previous work combines separate GMM- and DFW-transformed spectral envelopes for each frame. This paper proposes a more effective DFW-based approach that (1) does not rely on the baseline GMM methods, and (2) functions on the acoustic class level. To adjust spectral power, an amplitude scaling function is used that compares the average target and warped source log spectra for each acoustic class. The proposed DFW with Amplitude scaling (DFWA) outperforms standard GMM and hybrid GMM-DFW methods for VC in terms of both speech quality and timbre conversion, as is confirmed in extensive objective and subjective testing. Furthermore, by not requiring time-alignment of source and target speech, DFWA is able to perform equally well using parallel or nonparallel corpora, as is demonstrated explicitly.

Journal ArticleDOI
TL;DR: This paper provides detailed statistical analysis of MFCC bias and variance using autoregressive process simulations on the TIMIT corpus and proposes the multitaper method for MFCC extraction with a practical focus.
Abstract: In speech and audio applications, short-term signal spectrum is often represented using mel-frequency cepstral coefficients (MFCCs) computed from a windowed discrete Fourier transform (DFT). Windowing reduces spectral leakage but variance of the spectrum estimate remains high. An elegant extension to windowed DFT is the so-called multitaper method which uses multiple time-domain windows (tapers) with frequency-domain averaging. Multitapers have received little attention in speech processing even though they produce low-variance features. In this paper, we propose the multitaper method for MFCC extraction with a practical focus. We provide, first, detailed statistical analysis of MFCC bias and variance using autoregressive process simulations on the TIMIT corpus. For speaker verification experiments on the NIST 2002 and 2008 SRE corpora, we consider three Gaussian mixture model based classifiers with universal background model (GMM-UBM), support vector machine (GMM-SVM) and joint factor analysis (GMM-JFA). Multitapers improve MinDCF over the baseline windowed DFT by relative 20.4% (GMM-SVM) and 13.7% (GMM-JFA) on the interview-interview condition in NIST 2008. The GMM-JFA system further reduces MinDCF by 18.7% on the telephone data. With these improvements and generally noncritical parameter selection, multitaper MFCCs are a viable candidate for replacing the conventional MFCCs.

Journal ArticleDOI
TL;DR: In this article, the authors proposed a method for optimizing content-based similarity by learning from a sample of collaborative filter data, which can then be applied to answer queries on novel and unpopular items, while still maintaining high recommendation accuracy.
Abstract: Many tasks in music information retrieval, such as recommendation, and playlist generation for online radio, fall naturally into the query-by-example setting, wherein a user queries the system by providing a song, and the system responds with a list of relevant or similar song recommendations. Such applications ultimately depend on the notion of similarity between items to produce high-quality results. Current state-of-the-art systems employ collaborative filter methods to represent musical items, effectively comparing items in terms of their constituent users. While collaborative filter techniques perform well when historical data is available for each item, their reliance on historical data impedes performance on novel or unpopular items. To combat this problem, practitioners rely on content-based similarity, which naturally extends to novel items, but is typically outperformed by collaborative filter methods. In this paper, we propose a method for optimizing content-based similarity by learning from a sample of collaborative filter data. The optimized content-based similarity metric can then be applied to answer queries on novel and unpopular items, while still maintaining high recommendation accuracy. The proposed system yields accurate and efficient representations of audio content, and experimental results show significant improvements in accuracy over competing content-based recommendation techniques.

Journal ArticleDOI
TL;DR: The Yet Another GCI/GOI Algorithm (YAGA) is proposed to detect GCIs from speech signals by employing multiscale analysis, the group delay function, and N-best dynamic programming and a novel GOI detector based upon the consistency of the candidates' closed quotients relative to the estimated GCIs is presented.
Abstract: Accurate estimation of glottal closing instants (GCIs) and opening instants (GOIs) is important for speech processing applications that benefit from glottal-synchronous processing including pitch tracking, prosodic speech modification, speech dereverberation, synthesis and study of pathological voice. We propose the Yet Another GCI/GOI Algorithm (YAGA) to detect GCIs from speech signals by employing multiscale analysis, the group delay function, and N-best dynamic programming. A novel GOI detector based upon the consistency of the candidates' closed quotients relative to the estimated GCIs is also presented. Particular attention is paid to the precise definition of the glottal closed phase, which we define as the analysis interval that produces minimum deviation from an all-pole model of the speech signal with closed-phase linear prediction (LP). A reference algorithm analyzing both electroglottograph (EGG) and speech signals is described for evaluation of the proposed speech-based algorithm. In addition to the development of a GCI/GOI detector, an important outcome of this work is in demonstrating that GOIs derived from the EGG signal are not necessarily well-suited to closed-phase LP analysis. Evaluation of YAGA against the APLAWD and SAM databases show that GCI identification rates of up to 99.3% can be achieved with an accuracy of 0.3 ms and GOI detection can be achieved equally reliably with an accuracy of 0.5 ms.

Journal ArticleDOI
TL;DR: The method enables the control of the source distortion and source confusion trade-off, and therefore achieves superior performance compared to powerful approaches like geometric spectral subtraction and codebook-based filtering, for a number of challenging interferer classes such as speech babble and wind noise.
Abstract: The enhancement of speech degraded by real-world interferers is a highly relevant and difficult task. Its importance arises from the multitude of practical applications, whereas the difficulty is due to the fact that interferers are often nonstationary and potentially similar to speech. The goal of monaural speech enhancement is to separate a single mixture into its underlying clean speech and interferer components. This under-determined problem is solved by incorporating prior knowledge in the form of learned speech and interferer dictionaries. The clean speech is recovered from the degraded speech by sparse coding of the mixture in a composite dictionary consisting of the concatenation of a speech and an interferer dictionary. Enhancement performance is measured using objective measures and is limited by two effects. A too sparse coding of the mixture causes the speech component to be explained with too few speech dictionary atoms, which induces an approximation error we denote source distortion. However, a too dense coding of the mixture results in source confusion, where parts of the speech component are explained by interferer dictionary atoms and vice-versa. Our method enables the control of the source distortion and source confusion trade-off, and therefore achieves superior performance compared to powerful approaches like geometric spectral subtraction and codebook-based filtering, for a number of challenging interferer classes such as speech babble and wind noise.

Journal ArticleDOI
TL;DR: Experimental results on statistical parametric speech synthesis show that the proposed framework enables data from multiple speakers in different languages to be used to: train a synthesis system; synthesize speech in a language using speaker characteristics estimated in a different language; and adapt to a new language.
Abstract: An increasingly common scenario in building speech synthesis and recognition systems is training on inhomogeneous data. This paper proposes a new framework for estimating hidden Markov models on data containing both multiple speakers and multiple languages. The proposed framework, speaker and language factorization, attempts to factorize speaker-/language-specific characteristics in the data and then model them using separate transforms. Language-specific factors in the data are represented by transforms based on cluster mean interpolation with cluster-dependent decision trees. Acoustic variations caused by speaker characteristics are handled by transforms based on constrained maximum-likelihood linear regression. Experimental results on statistical parametric speech synthesis show that the proposed framework enables data from multiple speakers in different languages to be used to: train a synthesis system; synthesize speech in a language using speaker characteristics estimated in a different language; and adapt to a new language.

Journal ArticleDOI
TL;DR: The techniques and the attempt to achieve the low-latency monitoring of meetings are described, the experimental results for real-time meeting transcription are shown, and the goal is to recognize automatically “who is speaking what” in an online manner for meeting assistance.
Abstract: This paper presents our real-time meeting analyzer for monitoring conversations in an ongoing group meeting. The goal of the system is to recognize automatically “who is speaking what” in an online manner for meeting assistance. Our system continuously captures the utterances and face poses of each speaker using a microphone array and an omni-directional camera positioned at the center of the meeting table. Through a series of advanced audio processing operations, an overlapping speech signal is enhanced and the components are separated into individual speaker's channels. Then the utterances are sequentially transcribed by our speech recognizer with low latency. In parallel with speech recognition, the activity of each participant (e.g., speaking, laughing, watching someone) and the circumstances of the meeting (e.g., topic, activeness, casualness) are detected and displayed on a browser together with the transcripts. In this paper, we describe our techniques and our attempt to achieve the low-latency monitoring of meetings, and we show our experimental results for real-time meeting transcription.

Journal ArticleDOI
TL;DR: A novel systematic approach to the design of directivity patterns of higher order differential microphones is proposed, obtained by optimizing a convex combination of a front-back energy ratio and uniformity within a frontal sector of interest.
Abstract: A novel systematic approach to the design of directivity patterns of higher order differential microphones is proposed. The directivity patterns are obtained by optimizing a cost function which is a convex combination of a front-back energy ratio and uniformity within a frontal sector of interest. Most of the standard directivity patterns - omnidirectional, cardioid, subcardioid, hypercardioid, supercardioid - are particular solutions of this optimization problem with specific values of two free parameters: the angular width of the frontal sector and the convex combination factor. More general solutions of practical use are obtained by varying these two parameters. Many of these optimal directivity patterns are trigonometric polynomials with complex roots. A new differential array structure that enables the implementation of general higher order directivity patterns, with complex or real roots, is then proposed. The effectiveness of the proposed design framework and the implementation structure are illustrated by design examples, simulations, and measurements.

Journal ArticleDOI
TL;DR: The proposed coherence-based algorithm was found to yield substantially higher intelligibility than that obtained by the beamforming algorithm, particularly when multiple noise sources or competing talker(s) were present.
Abstract: A novel dual-microphone speech enhancement technique is proposed in the present paper. The technique utilizes the coherence between the target and noise signals as a criterion for noise reduction and can be generally applied to arrays with closely spaced microphones, where noise captured by the sensors is highly correlated. The proposed algorithm is simple to implement and requires no estimation of noise statistics. In addition, it offers the capability of coping with multiple interfering sources that might be located at different azimuths. The proposed algorithm was evaluated with normal hearing listeners using intelligibility listening tests and compared against a well-established beamforming algorithm. Results indicated large gains in speech intelligibility relative to the baseline (front microphone) algorithm in both single and multiple-noise source scenarios. The proposed algorithm was found to yield substantially higher intelligibility than that obtained by the beamforming algorithm, particularly when multiple noise sources or competing talker(s) were present. Objective quality evaluation of the proposed algorithm also indicated significant quality improvement over that obtained by the beamforming algorithm. The intelligibility and quality benefits observed with the proposed coherence-based algorithm make it a viable candidate for hearing aid and cochlear implant devices.