scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Audio, Speech, and Language Processing in 2014"


Journal ArticleDOI
TL;DR: It is shown that further error rate reduction can be obtained by using convolutional neural networks (CNNs), and a limited-weight-sharing scheme is proposed that can better model speech features.
Abstract: Recently, the hybrid deep neural network (DNN)- hidden Markov model (HMM) has been shown to significantly improve speech recognition performance over the conventional Gaussian mixture model (GMM)-HMM. The performance improvement is partially attributed to the ability of the DNN to model complex correlations in speech features. In this paper, we show that further error rate reduction can be obtained by using convolutional neural networks (CNNs). We first present a concise description of the basic CNN and explain how it can be used for speech recognition. We further propose a limited-weight-sharing scheme that can better model speech features. The special structure such as local connectivity, weight sharing, and pooling in CNNs exhibits some degree of invariance to small shifts of speech features along the frequency axis, which is important to deal with speaker and environment variations. Experimental results show that CNNs reduce the error rate by 6%-10% compared with DNNs on the TIMIT phone recognition and the voice search large vocabulary speech recognition tasks.

1,948 citations


Journal ArticleDOI
TL;DR: Results in various test conditions reveal that the two ratio mask targets, the IRM and the FFT-MASK, outperform the other targets in terms of objective intelligibility and quality metrics, and that masking based targets, in general, are significantly better than spectral envelope based targets.
Abstract: Formulation of speech separation as a supervised learning problem has shown considerable promise. In its simplest form, a supervised learning algorithm, typically a deep neural network, is trained to learn a mapping from noisy features to a time-frequency representation of the target of interest. Traditionally, the ideal binary mask (IBM) is used as the target because of its simplicity and large speech intelligibility gains. The supervised learning framework, however, is not restricted to the use of binary targets. In this study, we evaluate and compare separation results by using different training targets, including the IBM, the target binary mask, the ideal ratio mask (IRM), the short-time Fourier transform spectral magnitude and its corresponding mask (FFT-MASK), and the Gammatone frequency power spectrum. Our results in various test conditions reveal that the two ratio mask targets, the IRM and the FFT-MASK, outperform the other targets in terms of objective intelligibility and quality metrics. In addition, we find that masking based targets, in general, are significantly better than spectral envelope based targets. We also present comparisons with recent methods in non-negative matrix factorization and speech enhancement, which show clear performance advantages of supervised speech separation.

1,046 citations


Journal ArticleDOI
TL;DR: A thorough overview of modern noise-robust techniques for ASR developed over the past 30 years is provided and methods that are proven to be successful and that are likely to sustain or expand their future applicability are emphasized.
Abstract: New waves of consumer-centric applications, such as voice search and voice interaction with mobile devices and home entertainment systems, increasingly require automatic speech recognition (ASR) to be robust to the full range of real-world noise and other acoustic distorting conditions. Despite its practical importance, however, the inherent links between and distinctions among the myriad of methods for noise-robust ASR have yet to be carefully studied in order to advance the field further. To this end, it is critical to establish a solid, consistent, and common mathematical foundation for noise-robust ASR, which is lacking at present. This article is intended to fill this gap and to provide a thorough overview of modern noise-robust techniques for ASR developed over the past 30 years. We emphasize methods that are proven to be successful and that are likely to sustain or expand their future applicability. We distill key insights from our comprehensive overview in this field and take a fresh look at a few old problems, which nevertheless are still highly relevant today. Specifically, we have analyzed and categorized a wide range of noise-robust techniques using five different criteria: 1) feature-domain vs. model-domain processing, 2) the use of prior knowledge about the acoustic environment distortion, 3) the use of explicit environment-distortion models, 4) deterministic vs. uncertainty processing, and 5) the use of acoustic models trained jointly with the same feature enhancement or model adaptation process used in the testing stage. With this taxonomy-oriented review, we equip the reader with the insight to choose among techniques and with the awareness of the performance-complexity tradeoffs. The pros and cons of using different noise-robust ASR techniques in practical application scenarios are provided as a guide to interested practitioners. The current challenges and future research directions in this field is also carefully analyzed.

534 citations


Journal ArticleDOI
TL;DR: The plain DBN-based model gives a call-routing classification accuracy that is equal to the best of the other models, however, using additional unlabeled data for DBN pre-training and combining Dbn-based learned features with the original features provides significant gains over SVMs, which, in turn, performed better than both MaxEnt and Boosting.
Abstract: Applications of Deep Belief Nets (DBN) to various problems have been the subject of a number of recent studies ranging from image classification and speech recognition to audio classification. In this study we apply DBNs to a natural language understanding problem. The recent surge of activity in this area was largely spurred by the development of a greedy layer-wise pretraining method that uses an efficient learning algorithm called Contrastive Divergence (CD). CD allows DBNs to learn a multi-layer generative model from unlabeled data and the features discovered by this model are then used to initialize a feed-forward neural network which is fine-tuned with backpropagation. We compare a DBN-initialized neural network to three widely used text classification algorithms: Support Vector Machines (SVM), boosting and Maximum Entropy (MaxEnt). The plain DBN-based model gives a call-routing classification accuracy that is equal to the best of the other models. However, using additional unlabeled data for DBN pre-training and combining DBN-based learned features with the original features provides significant gains over SVMs, which, in turn, performed better than both MaxEnt and Boosting.

430 citations


Journal ArticleDOI
TL;DR: A DNN is used to construct a global non-linear mapping relationship between the spectral envelopes of two speakers to significantly improve the performance in terms of both similarity and naturalness compared to conventional methods.
Abstract: This paper presents a new spectral envelope conversion method using deep neural networks (DNNs). The conventional joint density Gaussian mixture model (JDGMM) based spectral conversion methods perform stably and effectively. However, the speech generated by these methods suffer severe quality degradation due to the following two factors: 1) inadequacy of JDGMM in modeling the distribution of spectral features as well as the non-linear mapping relationship between the source and target speakers, 2) spectral detail loss caused by the use of high-level spectral features such as mel-cepstra. Previously, we have proposed to use the mixture of restricted Boltzmann machines (MoRBM) and the mixture of Gaussian bidirectional associative memories (MoGBAM) to cope with these problems. In this paper, we propose to use a DNN to construct a global non-linear mapping relationship between the spectral envelopes of two speakers. The proposed DNN is generatively trained by cascading two RBMs, which model the distributions of spectral envelopes of source and target speakers respectively, using a Bernoulli BAM (BBAM). Therefore, the proposed training method takes the advantage of the strong modeling ability of RBMs in modeling the distribution of spectral envelopes and the superiority of BAMs in deriving the conditional distributions for conversion. Careful comparisons and analysis among the proposed method and some conventional methods are presented in this paper. The subjective results show that the proposed method can significantly improve the performance in terms of both similarity and naturalness compared to conventional methods.

246 citations


Journal ArticleDOI
TL;DR: It is shown that, when the noisy phase is enhanced using the proposed phase reconstruction, instrumental measures predict an increase of speech quality over a range of signal to noise ratios, even without explicit amplitude enhancement.
Abstract: The enhancement of speech which is corrupted by noise is commonly performed in the short-time discrete Fourier transform domain. In case only a single microphone signal is available, typically only the spectral amplitude is modified. However, it has recently been shown that an improved spectral phase can as well be utilized for speech enhancement, e.g., for phase-sensitive amplitude estimation. In this paper, we therefore present a method to reconstruct the spectral phase of voiced speech from only the fundamental frequency and the noisy observation. The importance of the spectral phase is highlighted and we elaborate on the reason why noise reduction can be achieved by modifications of the spectral phase. We show that, when the noisy phase is enhanced using the proposed phase reconstruction, instrumental measures predict an increase of speech quality over a range of signal to noise ratios, even without explicit amplitude enhancement.

197 citations


Journal ArticleDOI
TL;DR: PEFAC is presented, a fundamental frequency estimation algorithm for speech that is able to identify voiced frames and estimate pitch reliably even at negative signal-to-noise ratios and performs well in both high and low levels of additive noise.
Abstract: We present PEFAC, a fundamental frequency estimation algorithm for speech that is able to identify voiced frames and estimate pitch reliably even at negative signal-to-noise ratios. The algorithm combines a normalization stage, to remove channel dependency and to attenuate strong noise components, with a harmonic summing filter applied in the log-frequency power spectral domain, the impulse response of which is chosen to sum the energy of the fundamental frequency harmonics while attenuating smoothly-varying noise components. Temporal continuity constraints are applied to the selected pitch candidates and a voiced speech probability is computed from the likelihood ratio of two classifiers, one for voiced speech and one for unvoiced speech/silence. We compare the performance of our algorithm with that of other widely used algorithms and demonstrate that it performs well in both high and low levels of additive noise.

188 citations


Journal ArticleDOI
TL;DR: The subjective listening tests indicate that the naturalness of the converted speech by the proposed method is comparable with that by the ML-GMM method with global variance constraint, and the results show the superiority of the method over PLS-based methods.
Abstract: We propose a nonparametric framework for voice conversion, that is, exemplar-based sparse representation with residual compensation. In this framework, a spectrogram is reconstructed as a weighted linear combination of speech segments, called exemplars, which span multiple consecutive frames. The linear combination weights are constrained to be sparse to avoid over-smoothing, and high-resolution spectra are employed in the exemplars directly without dimensionality reduction to maintain spectral details. In addition, a spectral compression factor and a residual compensation technique are included in the framework to enhance the conversion performances. We conducted experiments on the VOICES database to compare the proposed method with a large set of state-of-the-art baseline methods, including the maximum likelihood Gaussian mixture model (ML-GMM) with dynamic feature constraint and the partial least squares (PLS) regression based methods. The experimental results show that the objective spectral distortion of ML-GMM is reduced from 5.19 dB to 4.92 dB, and both the subjective mean opinion score and the speaker identification rate are increased from 2.49 and 73.50% to 3.15 and 79.50%, respectively, by the proposed method. The results also show the superiority of our method over PLS-based methods. In addition, the subjective listening tests indicate that the naturalness of the converted speech by our proposed method is comparable with that by the ML-GMM method with global variance constraint.

179 citations


Journal ArticleDOI
TL;DR: A simple iterative Mean Shift algorithm based on the cosine distance to perform speaker clustering under speaker diarization conditions and state of the art results as measured by the Diarization Error Rate and the Number of Detected Speakers on the LDC CallHome telephone corpus are reported.
Abstract: Speaker clustering is a crucial step for speaker diarization. The short duration of speech segments in telephone speech dialogue and the absence of prior information on the number of clusters dramatically increase the difficulty of this problem in diarizing spontaneous telephone speech conversations. We propose a simple iterative Mean Shift algorithm based on the cosine distance to perform speaker clustering under these conditions. Two variants of the cosine distance Mean Shift are compared in an exhaustive practical study. We report state of the art results as measured by the Diarization Error Rate and the Number of Detected Speakers on the LDC CallHome telephone corpus.

167 citations


Journal ArticleDOI
TL;DR: This paper proposes a novel multiple-speaker localization technique suitable for environments with high reverberation, based on a spherical microphone array and processing in the spherical harmonics (SH) domain, and validates the robustness of the proposed method in the presence of high reverberations.
Abstract: One of the major challenges encountered when localizing multiple speakers in real world environments is the need to overcome the effect of multipath distortion due to room reverberation A wide range of methods has been proposed for speaker localization, many based on microphone array processing Some of these methods are designed for the localization of coherent sources, typical of multipath environments, and some have even reported limited robustness to reverberation Nevertheless, speaker localization under conditions of high reverberation still remains a challenging task This paper proposes a novel multiple-speaker localization technique suitable for environments with high reverberation, based on a spherical microphone array and processing in the spherical harmonics (SH) domain The non-stationarity and sparsity of speech, as well as frequency smoothing in the SH domain, are exploited in the development of a direct-path dominance test This test can identify time-frequency (TF) bins that contain contributions from only one significant source and no significant contribution from room reflections, such that localization based on these selected TF-bins is performed accurately, avoiding the potential distortion due to other sources and reverberation Computer simulations and an experiment in a real reverberant room validate the robustness of the proposed method in the presence of high reverberation

163 citations


Journal ArticleDOI
TL;DR: A general adaptation scheme for DNN based on discriminant condition codes is proposed, which is directly fed to various layers of a pre-trained DNN through a new set of connection weights, which are quite effective to adapt large DNN models using only a small amount of adaptation data.
Abstract: Fast adaptation of deep neural networks (DNN) is an important research topic in deep learning. In this paper, we have proposed a general adaptation scheme for DNN based on discriminant condition codes, which are directly fed to various layers of a pre-trained DNN through a new set of connection weights. Moreover, we present several training methods to learn connection weights from training data as well as the corresponding adaptation methods to learn new condition code from adaptation data for each new test condition. In this work, the fast adaptation scheme is applied to supervised speaker adaptation in speech recognition based on either frame-level cross-entropy or sequence-level maximum mutual information training criterion. We have proposed three different ways to apply this adaptation scheme based on the so-called speaker codes: i) Nonlinear feature normalization in feature space; ii) Direct model adaptation of DNN based on speaker codes; iii) Joint speaker adaptive training with speaker codes. We have evaluated the proposed adaptation methods in two standard speech recognition tasks, namely TIMIT phone recognition and large vocabulary speech recognition in the Switchboard task. Experimental results have shown that all three methods are quite effective to adapt large DNN models using only a small amount of adaptation data. For example, the Switchboard results have shown that the proposed speaker-code-based adaptation methods may achieve up to 8-10% relative error reduction using only a few dozens of adaptation utterances per speaker. Finally, we have achieved very good performance in Switchboard (12.1% in WER) after speaker adaptation using sequence training criterion, which is very close to the best performance reported in this task ("Deep convolutional neural networks for LVCSR," T. N. Sainath et al., Proc. IEEE Acoust., Speech, Signal Process., 2013).

Journal ArticleDOI
TL;DR: It is shown that GP policy optimization can be implemented for a real world POMDP dialog manager, and it is demonstrated that designer effort can be substantially reduced by basing the policy directly on the full belief space thereby avoiding ad hoc feature space modeling.
Abstract: A partially observable Markov decision process (POMDP) has been proposed as a dialog model that enables automatic optimization of the dialog policy and provides robustness to speech understanding errors. Various approximations allow such a model to be used for building real-world dialog systems. However, they require a large number of dialogs to train the dialog policy and hence they typically rely on the availability of a user simulator. They also require significant designer effort to hand-craft the policy representation. We investigate the use of Gaussian processes (GPs) in policy modeling to overcome these problems. We show that GP policy optimization can be implemented for a real world POMDP dialog manager, and in particular: 1) we examine different formulations of a GP policy to minimize variability in the learning process; 2) we find that the use of GP increases the learning rate by an order of magnitude thereby allowing learning by direct interaction with human users; and 3) we demonstrate that designer effort can be substantially reduced by basing the policy directly on the full belief space thereby avoiding ad hoc feature space modeling. Overall, the GP approach represents an important step forward towards fully automatic dialog policy optimization in real world systems.

Journal ArticleDOI
TL;DR: This study systematically evaluates a range of promising features for classification-based separation using six nonstationary noises at the low SNR level of -5 dB, and proposes a new feature called multi-resolution cochleagram (MRCG), which experimental results show gives the best classification results among all evaluated features.
Abstract: Speech separation can be formulated as a classification problem. In classification-based speech separation, supervised learning is employed to classify time-frequency units as either speech-dominant or noise-dominant. In very low signal-to-noise ratio (SNR) conditions, acoustic features extracted from a mixture are crucial for correct classification. In this study, we systematically evaluate a range of promising features for classification-based separation using six nonstationary noises at the low SNR level of -5 dB, which is chosen with the goal of improving human speech intelligibility in mind. In addition, we propose a new feature called multi-resolution cochleagram (MRCG). The new feature is constructed by combining four cochleagrams at different spectrotemporal resolutions in order to capture both the local and contextual information. Experimental results show that MRCG gives the best classification results among all evaluated features. In addition, our results indicate that auto-regressive moving average (ARMA) filtering, a post-processing technique for improving automatic speech recognition features, also improves many acoustic features for speech separation.

Journal ArticleDOI
TL;DR: Imprecise consonant articulation was found to be the most powerful indicator of PD-related dysarthria and is envisaged as the first step towards development of acoustic methods allowing the automated assessment of articulatory features in dysarthrias.
Abstract: Although articulatory deficits represent an important manifestation of dysarthria in Parkinson's disease (PD), the most widely used methods currently available for the automatic evaluation of speech performance are focused on the assessment of dysphonia. The aim of the present study was to design a reliable automatic approach for the precise estimation of articulatory deficits in PD. Twenty-four individuals diagnosed with de novo PD and twenty-two age-matched healthy controls were recruited. Each participant performed diadochokinetic tasks based upon the fast repetition of /pa/-/ta/-/ka/ syllables. All phonemes were manually labeled and an algorithm for their automatic detection was designed. Subsequently, 13 features describing six different articulatory aspects of speech including vowel quality, coordination of laryngeal and supralaryngeal activity, precision of consonant articulation, tongue movement, occlusion weakening, and speech timing were analyzed. In addition, a classification experiment using a support vector machine based on articulatory features was proposed to differentiate between PD patients and healthy controls. The proposed detection algorithm reached approximately 80% accuracy for a 5 ms threshold of absolute difference between manually labeled references and automatically detected positions. When compared to controls, PD patients showed impaired articulatory performance in all investigated speech dimensions (p

Journal ArticleDOI
TL;DR: A robust SID with speaker models trained in selected reverberant conditions is performed, on the basis of bounded marginalization and direct masking, which substantially improves SID performance over related systems in a wide range of reverberation time and signal-to-noise ratios.
Abstract: Robustness of speaker recognition systems is crucial for real-world applications, which typically contain both additive noise and room reverberation. However, the combined effects of additive noise and convolutive reverberation have been rarely studied in speaker identification (SID). This paper addresses this issue in two phases. We first remove background noise through binary masking using a deep neural network classifier. Then we perform robust SID with speaker models trained in selected reverberant conditions, on the basis of bounded marginalization and direct masking. Evaluation results show that the proposed system substantially improves SID performance over related systems in a wide range of reverberation time and signal-to-noise ratios.

Journal ArticleDOI
TL;DR: An in-depth evaluation of such techniques as a front-end for noise-robust automatic speech recognition (ASR) and a diagonal feature discriminant linear regression (dFDLR) adaptation that can be performed on a per-utterance basis for ASR systems employing deep neural networks and HMM are performed.
Abstract: Recently, supervised classification has been shown to work well for the task of speech separation We perform an in-depth evaluation of such techniques as a front-end for noise-robust automatic speech recognition (ASR) The proposed separation front-end consists of two stages The first stage removes additive noise via time-frequency masking The second stage addresses channel mismatch and the distortions introduced by the first stage; a non-linear function is learned that maps the masked spectral features to their clean counterpart Results show that the proposed front-end substantially improves ASR performance when the acoustic models are trained in clean conditions We also propose a diagonal feature discriminant linear regression (dFDLR) adaptation that can be performed on a per-utterance basis for ASR systems employing deep neural networks and HMM Results show that dFDLR consistently improves performance in all test conditions Surprisingly, the best average results are obtained when dFDLR is applied to models trained using noisy log-Mel spectral features from the multi-condition training set With no channel mismatch, the best results are obtained when the proposed speech separation front-end is used along with multi-condition training using log-Mel features followed by dFDLR adaptation Both these results are among the best on the Aurora-4 dataset

Journal ArticleDOI
TL;DR: Evaluations and comparisons show that DNN-based binaural classification produces superior segregation performance in a variety of multisource and reverberant conditions.
Abstract: Speech signal degradation in real environments mainly results from room reverberation and concurrent noise. While human listening is robust in complex auditory scenes, current speech segregation algorithms do not perform well in noisy and reverberant environments. We treat the binaural segregation problem as binary classification, and employ deep neural networks (DNNs) for the classification task. The binaural features of the interaural time difference and interaural level difference are used as the main auditory features for classification. The monaural feature of gammatone frequency cepstral coefficients is also used to improve classification performance, especially when interference and target speech are collocated or very close to one another. We systematically examine DNN generalization to untrained spatial configurations. Evaluations and comparisons show that DNN-based binaural classification produces superior segregation performance in a variety of multisource and reverberant conditions.

Journal ArticleDOI
TL;DR: This paper presents low-rank approximation based multichannel Wiener filter algorithms for noise reduction in speech plus noise scenarios, with application in cochlear implants and introduces a more robust rank-1, or more generally rank-R, approximation of the autocorrelation matrix of the speech signal.
Abstract: This paper presents low-rank approximation based multichannel Wiener filter algorithms for noise reduction in speech plus noise scenarios, with application in cochlear implants. In a single speech source scenario, the frequency-domain autocorrelation matrix of the speech signal is often assumed to be a rank-1 matrix, which then allows to derive different rank-1 approximation based noise reduction filters. In practice, however, the rank of the autocorrelation matrix of the speech signal is usually greater than one. Firstly, the link between the different rank-1 approximation based noise reduction filters and the original speech distortion weighted multichannel Wiener filter is investigated when the rank of the autocorrelation matrix of the speech signal is indeed greater than one. Secondly, in low input signal-to-noise-ratio scenarios, due to noise non-stationarity, the estimation of the auto-correlation matrix of the speech signal can be problematic and the noise reduction filters can deliver unpredictable noise reduction performance. An eigenvalue decomposition based filter and a generalized eigenvalue decomposition based filter are introduced that include a more robust rank-1, or more generally rank-R, approximation of the autocorrelation matrix of the speech signal. These noise reduction filters are demonstrated to deliver a better noise reduction performance especially in low input signal-to-noise-ratio scenarios. The filters are especially useful in cochlear implants, where more speech distortion and hence a more aggressive noise reduction can be tolerated.

Journal ArticleDOI
TL;DR: The proposed quasi closed phase (QCP) analysis method utilizes weighted linear prediction with a specific attenuated main excitation (AME) weight function that attenuates the contribution of the glottal source in the linear prediction model optimization.
Abstract: This study presents a new glottal inverse filtering (GIF) technique based on closed phase analysis over multiple fundamental periods. The proposed quasi closed phase (QCP) analysis method utilizes weighted linear prediction (WLP) with a specific attenuated main excitation (AME) weight function that attenuates the contribution of the glottal source in the linear prediction model optimization. This enables the use of the autocorrelation criterion in linear prediction in contrast to the covariance criterion used in conventional closed phase analysis. The QCP method was compared to previously developed methods by using synthetic vowels produced with the conventional source-filter model as well as with a physical modeling approach. The obtained objective measures show that the QCP method improves the GIF performance in terms of errors in typical glottal source parametrizations for both low- and high-pitched vowels. Additionally, QCP was tested in a physiologically oriented vocoder, where the analysis/synthesis quality was evaluated with a subjective listening test indicating improved perceived quality for normal speaking style.

Journal ArticleDOI
TL;DR: This paper estimates pitch using supervised learning, where the probabilistic pitch states are directly learned from noisy speech data, and investigates two alternative neural networks modeling pitch state distribution given observations.
Abstract: Pitch determination is a fundamental problem in speech processing, which has been studied for decades. However, it is challenging to determinate pitch in strong noise because the harmonic structure is corrupted. In this paper, we estimate pitch using supervised learning, where the probabilistic pitch states are directly learned from noisy speech data. We investigate two alternative neural networks modeling pitch state distribution given observations. The first one is a feedforward deep neural network (DNN), which is trained on static frame-level acoustic features. The second one is a recurrent deep neural network (RNN) which is trained on sequential frame-level features and capable of learning temporal dynamics. Both DNNs and RNNs produce accurate probabilistic outputs of pitch states, which are then connected into pitch contours by Viterbi decoding. Our systematic evaluation shows that the proposed pitch tracking algorithms are robust to different noise conditions and can even be applied to reverberant speech. The proposed approach also significantly outperforms other state-of-the-art pitch tracking algorithms.

Journal ArticleDOI
TL;DR: The proposed SCM model is combined with a linear model for magnitudes and the parameter estimation is formulated in a complex-valued non-negative matrix factorization (CNMF) framework and is shown to exceed the performance of existing state of the art separation methods with two sources when evaluated by objective separation quality metrics.
Abstract: This paper addresses the problem of sound source separation from a multichannel microphone array capture via estimation of source spatial covariance matrix (SCM) of a short-time Fourier transformed mixture signal. In many conventional audio separation algorithms the source mixing parameter estimation is done separately for each frequency thus making them prone to errors and leading to suboptimal source estimates. In this paper we propose a SCM model which consists of a weighted sum of direction of arrival (DoA) kernels and estimate only the weights dependent on the source directions. In the proposed algorithm, the spatial properties of the sources become jointly optimized over all frequencies, leading to more coherent source estimates and mitigating the effect of spatial aliasing at high frequencies. The proposed SCM model is combined with a linear model for magnitudes and the parameter estimation is formulated in a complex-valued non-negative matrix factorization (CNMF) framework. Simulations consist of recordings done with a hand-held device sized array having multiple microphones embedded inside the device casing. Separation quality of the proposed algorithm is shown to exceed the performance of existing state of the art separation methods with two sources when evaluated by objective separation quality metrics.

Journal ArticleDOI
TL;DR: A novel geometric formulation is proposed, together with a thorough algebraic analysis and a global optimization solver, to solve the problem of sound-source localization from time-delay estimates using arbitrarily-shaped non-coplanar microphone arrays.
Abstract: This paper addresses the problem of sound-source localization from time-delay estimates using arbitrarily-shaped non-coplanar microphone arrays. A novel geometric formulation is proposed, together with a thorough algebraic analysis and a global optimization solver. The proposed model is thoroughly described and evaluated. The geometric analysis, stemming from the direct acoustic propagation model, leads to necessary and sufficient conditions for a set of time delays to correspond to a unique position in the source space. Such sets of time delays are referred to as feasible sets. We formally prove that every feasible set corresponds to exactly one position in the source space, whose value can be recovered using a closed-form localization mapping. Therefore we seek for the optimal feasible set of time delays given, as input, the received microphone signals. This time delay estimation problem is naturally cast into a programming task, constrained by the feasibility conditions derived from the geometric analysis. A global branch-and-bound optimization technique is proposed to solve the problem at hand, hence estimating the best set of feasible time delays and, subsequently, localizing the sound source. Extensive experiments with both simulated and real data are reported; we compare our methodology to four state-of-the-art techniques. This comparison shows that the proposed method combined with the branch-and-bound algorithm outperforms existing methods. These in-depth geometric understanding, practical algorithms, and encouraging results, open several opportunities for future work.

Journal ArticleDOI
TL;DR: This paper presents a patchwork-based audio watermarking method to resist de-synchronization attacks such as pitch-scaling, time- scaling, and jitter attacks and has much higher embedding capacity.
Abstract: This paper presents a patchwork-based audio watermarking method to resist de-synchronization attacks such as pitch-scaling, time-scaling, and jitter attacks. At the embedding stage, the watermarks are embedded into the host audio signal in the discrete cosine transform (DCT) domain. Then, a set of synchronization bits are implanted into the watermarked signal in the logarithmic DCT (LDCT) domain. At the decoding stage, we analyze the received audio signal in the LDCT domain to find the scaling factor imposed by an attack. Then, we modify the received signal to remove the scaling effect, together with the embedded synchronization bits. After that, watermarks are extracted from the modified signal. Simulation results show that at the embedding rate of 10 bps, the proposed method achieves 98.9% detection rate on average under the considered de-synchronization attacks. At the embedding rate of 16 bps, it can still obtain 94.7% detection rate on average. So, the proposed method is much more robust to de-synchronization attacks than other patchwork watermarking methods. Compared with the audio watermarking methods designed for tackling de-synchronization attacks, our method has much higher embedding capacity.

Journal ArticleDOI
TL;DR: This paper introduces an improved general distributed synchronous averaging (IGDSA) algorithm, which can be used in any connected network, and combines that with the DDSB algorithm where multiple node pairs can update their estimates simultaneously, and presents a distributed delay and sum beamformer.
Abstract: In this paper, we investigate the use of randomized gossip for distributed speech enhancement and present a distributed delay and sum beamformer (DDSB). In a randomly connected wireless acoustic sensor network, the DDSB estimates the desired signal at each node by communicating only with its neighbors. We first provide the asynchronous DDSB (ADDSB) where each pair of neighboring nodes updates its data asynchronously. Then, we introduce an improved general distributed synchronous averaging (IGDSA) algorithm, which can be used in any connected network, and combine that with the DDSB algorithm where multiple node pairs can update their estimates simultaneously. For convergence analysis, we first provide bounds for the worst case averaging time of the ADDSB for the best and worst connected networks, and then we compare the convergence rate of the ADDSB with the original synchronous DDSB (OSDDSB) and the improved synchronous DDSB (ISDDSB) in regular networks. This convergence rate comparison is extended to randomly connected non-regular networks using simulations. The simulation results show that the DDSB using the different updating schemes converges to the optimal estimates of the centralized beamformer and that the proposed IGDSA algorithm converges much faster than the original synchronous communication scheme, in particular for non-regular networks. Moreover, comparisons are performed with several existing distributed speech enhancement methods from literature, assuming that the steering vector is given. In the simulated scenario, the proposed method leads to a slight performance improvement at the expense of a higher communication cost. The presented method is not constrained to a certain network topology (e.g., tree connected or fully connected), while this is the case for many of the reference methods.

Journal ArticleDOI
TL;DR: The proposed speech enhancement technique for signals corrupted by nonstationary acoustic noises applies the empirical mode decomposition to the noisy speech signal and obtains a set of intrinsic mode functions (IMF) and adopts the Hurst exponent in the selection of IMFs to reconstruct the speech.
Abstract: This paper presents a speech enhancement technique for signals corrupted by nonstationary acoustic noises. The proposed approach applies the empirical mode decomposition (EMD) to the noisy speech signal and obtains a set of intrinsic mode functions (IMF). The main contribution of the proposed procedure is the adoption of the Hurst exponent in the selection of IMFs to reconstruct the speech. This EMD and Hurst-based (EMDH) approach is evaluated in speech enhancement experiments considering environmental acoustic noises with different indices of nonstationarity. The results show that the EMDH improves the segmental signal-to-noise ratio and an overall quality composite measure, encompassing the perceptual evaluation of speech quality (PESQ). Moreover, the short-time objective intelligibility (STOI) measure reinforces the superior performance of EMDH. Finally, the EMDH is also examined in a speaker identification task in noisy conditions. The proposed technique leads to the highest speaker identification rates when compared to the baseline speech enhancement algorithms and also to a multicondition training procedure.

Journal ArticleDOI
TL;DR: The major contributions from the last 14 years of research are summarized, with detailed discussions of the following topics: feature extraction, modeling strategies, model training and datasets, and evaluation strategies.
Abstract: In this overview article, we review research on the task of Automatic Chord Estimation (ACE). The major contributions from the last 14 years of research are summarized, with detailed discussions of the following topics: feature extraction, modeling strategies, model training and datasets, and evaluation strategies. Results from the annual benchmarking evaluation Music Information Retrieval Evaluation eXchange (MIREX) are also discussed as well as developments in software implementations and the impact of ACE within MIR. We conclude with possible directions for future research.

Journal ArticleDOI
TL;DR: The results demonstrate that the optimal performance of the MVDR beamformer occurs when the source is in the endfire directions for diffuse noise and point-source noise while its SNR gain does not depend on the signal incidence angle in spatially white noise.
Abstract: Linear microphone arrays combined with the minimum variance distortionless response (MVDR) beamformer have been widely studied in various applications to acquire desired signals and reduce the unwanted noise. Most of the existing array systems assume that the desired sources are in the broadside direction. In this paper, we study and analyze the performance of the MVDR beamformer as a function of the source incidence angle. Using the signal-to-noise ratio (SNR) and beampattern as the criteria, we investigate its performance in four different scenarios: spatially white noise, diffuse noise, diffuse-plus-white noise, and point-source-plus-white noise. The results demonstrate that the optimal performance of the MVDR beamformer occurs when the source is in the endfire directions for diffuse noise and point-source noise while its SNR gain does not depend on the signal incidence angle in spatially white noise. This indicates that most current systems may not fully exploit the potential of the MVDR beamformer. This analysis does not only help us better understand this algorithm, but also helps us design better array systems for practical applications.

Journal ArticleDOI
TL;DR: This work forms the localization task as a maximum likelihood (ML) parameter estimation problem, and solves it by utilizing the expectation-maximization (EM) procedure, and proposes to adapt two recursive EM (REM) variants based on Titterington's scheme.
Abstract: The problem of localizing and tracking a known number of concurrent speakers in noisy and reverberant enclosures is addressed in this paper. We formulate the localization task as a maximum likelihood (ML) parameter estimation problem, and solve it by utilizing the expectation-maximization (EM) procedure. For the tracking scenario, we propose to adapt two recursive EM (REM) variants. The first, based on Titterington's scheme, is a Newton-based recursion. In this work we also extend Titterington's method to deal with constrained maximization, encountered in the problem at hand. The second is based on Cappe and Moulines' scheme. We discuss the similarities and dissimilarities of these two variants and show their applicability to the tracking problem by a simulated experimental study.

Journal ArticleDOI
TL;DR: The design and analysis of an array of higher order microphones that uses 2D wavefield translation to provide a mode matching solution to the height invariant recording problem is presented.
Abstract: Successful recording of large spatial soundfields is a prevailing challenge in acoustic signal processing due to the enormous numbers of microphones required. This paper presents the design and analysis of an array of higher order microphones that uses 2D wavefield translation to provide a mode matching solution to the height invariant recording problem. It is shown that the use of $M$th order microphones significantly reduces the number of microphone units by a factor of $1/(2M + 1)$ at the expense of increased complexity at each microphone unit. Robustness of the proposed array is also analyzed based on the condition number of the translation matrix while discussing array configurations that result in low condition numbers. The white-noise gain (WNG) of the array is then derived to verify that improved WNG can be achieved when the translation matrix is well conditioned. Furthermore, the array’s performance is studied for interior soundfield recording as well as exterior soundfield recording using appropriate simulation examples.

Journal ArticleDOI
TL;DR: This work considers sensor location optimization for both general wideband beamforming and frequency invariant beamforming, and sparsity in the tapped delay-line coefficients associated with each sensor is considered in order to reduce the implementation complexity of each TDL.
Abstract: Sparse wideband array design for sensor location optimization is highly nonlinear and it is traditionally solved by genetic algorithms (GAs) or other similar optimization methods. This is an extremely time-consuming process and an optimum solution is not always guaranteed. In this work, this problem is studied from the viewpoint of compressive sensing (CS). Although there have been CS-based methods proposed for the design of sparse narrowband arrays, its extension to the wideband case is not straightforward, as there are multiple coefficients associated with each sensor and they have to be simultaneously minimized in order to discard the corresponding sensor locations. At first, sensor location optimization for both general wideband beamforming and frequency invariant beamforming is considered. Then, sparsity in the tapped delay-line (TDL) coefficients associated with each sensor is considered in order to reduce the implementation complexity of each TDL. Finally, design of robust wideband arrays against norm-bounded steering vector errors is addressed. Design examples are provided to verify the effectiveness of the proposed methods, with comparisons drawn with a GA-based design method.