scispace - formally typeset
Search or ask a question

Showing papers by "DeLiang Wang published in 2011"


Journal ArticleDOI
TL;DR: This paper proposes a robust algorithm for multipitch tracking in the presence of both background noise and room reverberation, which can reliably detect single and double pitch contours in noisy and reverberant conditions.
Abstract: Multipitch tracking in real environments is critical for speech signal processing. Determining pitch in reverberant and noisy speech is a particularly challenging task. In this paper, we propose a robust algorithm for multipitch tracking in the presence of both background noise and room reverberation. An auditory front-end and a new channel selection method are utilized to extract periodicity features. We derive pitch scores for each pitch state, which estimate the likelihoods of the observed periodicity features given pitch candidates. A hidden Markov model integrates these pitch scores and searches for the best pitch state sequence. Our algorithm can reliably detect single and double pitch contours in noisy and reverberant conditions. Quantitative evaluations show that our approach outperforms existing ones, particularly in reverberant conditions.

81 citations


Journal ArticleDOI
Jiangye Yuan1, DeLiang Wang1, Bo Wu1, Lin Yan1, Rongxing Li1 
TL;DR: An automatic method for road extraction from satellite imagery using locally excitatory globally inhibitory oscillator networks (LEGION) and a comparison with other methods shows that the proposed method produces very competitive extraction results.
Abstract: An automatic method for road extraction from satellite imagery is presented. The core of the proposed method is locally excitatory globally inhibitory oscillator networks (LEGION). The road extraction task is decomposed into three stages. The first stage is image segmentation by LEGION. In the second stage, the medial axis of each segment is computed, and the medial axis points corresponding to narrow regions are selected. The third is the road grouping stage. Alignment-dependent connections between selected points are established, and LEGION is utilized to group well-aligned points, which represent the extracted roads. Due to the selective gating mechanism of LEGION, different roads in an image are grouped separately. Road extraction results on synthetic and real images are presented. A comparison with other methods shows that the proposed method produces very competitive extraction results.

60 citations


Journal ArticleDOI
TL;DR: The proposed algorithm is computationally efficient, and systematic evaluation and comparison show that the approach considerably improves the performance of unvoiced speech segregation.
Abstract: While a lot of effort has been made in computational auditory scene analysis to segregate voiced speech from monaural mixtures, unvoiced speech segregation has not received much attention. Unvoiced speech is highly susceptible to interference due to its relatively weak energy and lack of harmonic structure, and hence makes its segregation extremely difficult. This paper proposes a new approach to segregation of unvoiced speech from nonspeech interference. The proposed system first removes estimated voiced speech, and the periodic part of interference based on cross-channel correlation. The resultant interference becomes more stationary and we estimate the noise energy in unvoiced intervals using segregated speech in neighboring voiced intervals. Then unvoiced speech segregation occurs in two stages: segmentation and grouping. In segmentation, we apply spectral subtraction to generate time-frequency segments in unvoiced intervals. Unvoiced speech segments are subsequently grouped based on frequency characteristics of unvoiced speech using simple thresholding as well as Bayesian classification. The proposed algorithm is computationally efficient, and systematic evaluation and comparison show that our approach considerably improves the performance of unvoiced speech segregation.

58 citations


Proceedings ArticleDOI
22 May 2011
TL;DR: Systematic evaluations show that the proposed classification approach to monaural separation problem produces high quality binary masks and outperforms a previous system in terms of classification accuracy.
Abstract: Monaural speech separation is a very challenging task. CASA-based systems utilize acoustic features to produce a time-frequency (T-F) mask. In this study, we propose a classification approach to monaural separation problem. Our feature set consists of pitch-based features and amplitude modulation spectrum features, which can discriminate both voiced and unvoiced speech from nonspeech interference. We employ support vector machines (SVMs) followed by a re-thresholding method to classify each T-F unit as either target-dominated or interference-dominated. An auditory segmentation stage is then utilized to improve SVM-generated results. Systematic evaluations show that our approach produces high quality binary masks and outperforms a previous system in terms of classification accuracy.

45 citations


Journal ArticleDOI
TL;DR: A neurocomputational model of object-based selection in the framework of oscillatory correlation is presented, which selects salient objects rather than salient locations by segmenting an input scene and integrating the segments with their conspicuity obtained from a saliency map.

30 citations


Journal ArticleDOI
TL;DR: In this article, the authors proposed an algorithm for the separation of convolutive speech mixtures using two-microphone recordings, based on the combination of independent component analysis (ICA) and ideal binary mask (IBM), together with a post-filtering process in the cepstral domain.

28 citations


Proceedings ArticleDOI
22 May 2011
TL;DR: This work first shows that a recently introduced speaker feature, Gammatone Frequency Cepstral Coefficient, performs substantially better than conventional speaker features under noisy conditions, and applies CASA separation and then either reconstruct or marginalize corrupted components indicated by the CASA mask.
Abstract: Speaker recognition remains a challenging task under noisy conditions. Inspired by auditory perception, computational auditory scene analysis (CASA) typically segregates speech by producing a binary time-frequency mask. We first show that a recently introduced speaker feature, Gammatone Frequency Cepstral Coefficient, performs substantially better than conventional speaker features under noisy conditions. To deal with noisy speech, we apply CASA separation and then either reconstruct or marginalize corrupted components indicated by the CASA mask. Both methods are effective. We further combine them into a single system depending on the detected signal to noise ratio (SNR). This system achieves significant performance improvements over related systems under a wide range of SNR conditions.

25 citations


Journal ArticleDOI
TL;DR: A computational auditory scene analysis approach to monaural segregation of reverberant voiced speech is proposed, which performs multipitch tracking of reverberan mixtures and supervised classification and has a significant advantage over existing systems.
Abstract: Room reverberation creates a major challenge to speech segregation. We propose a computational auditory scene analysis approach to monaural segregation of reverberant voiced speech, which performs multipitch tracking of reverberant mixtures and supervised classification. Speech and nonspeech models are separately trained, and each learns to map from a set of pitch-based features to a grouping cue which encodes the posterior probability of a time-frequency (T-F) unit being dominated by the source with the given pitch estimate. Because interference may be either speech or nonspeech, a likelihood ratio test selects the correct model for labeling corresponding T-F units. Experimental results show that the proposed system performs robustly in different types of interference and various reverberant conditions, and has a significant advantage over existing systems.

18 citations


Proceedings ArticleDOI
22 May 2011
TL;DR: A trend estimation algorithm to detect the pitch ranges of a singing voice in each time frame is proposed and substantially reduces the difficulty of singing pitch detection by reducing a large number of wrong pitch candidates either produced by musical instruments or the overtones of the singing voice.
Abstract: Detecting pitch values for singing voice in the presence of music accompaniment is challenging but useful for many applications. We propose a trend estimation algorithm to detect the pitch ranges of a singing voice in each time frame. The detected trend substantially reduces the difficulty of singing pitch detection by reducing a large number of wrong pitch candidates either produced by musical instruments or the overtones of the singing voice. The proposed algorithm can be applied to improve the performance of singing pitch detection. Quantitative evaluations show that proposed trend estimation improves an existing algorithm significantly. The results from the MIREX 2010 competition show that our system achieves the best overall raw-pitch accuracy for vocal songs.

18 citations


Proceedings ArticleDOI
22 May 2011
TL;DR: In situations where a target of interest is near to the listener while interfering sources are more distant, simple features that capture the directionality of sound energy can be used to attenuate significant undesired signal energy and can be more effective than a strategy based on noise-floor tracking.
Abstract: In this work we describe methods for using the directionality of sound energy as a criterion to estimate single- and multichannel linear filters for suppression of diffuse noise and reverberation in a hearing aid application. We compare conservative strategies where direction of arrival is unknown, and more aggressive strategies where the proposed methods can be used to derive a fast acting post-filter for the output of a beamformer. We show that in situations where a target of interest is near to the listener while interfering sources are more distant, simple features that capture the directionality of sound energy can be used to attenuate significant undesired signal energy and can be more effective than a strategy based on noise-floor tracking.

5 citations


Proceedings ArticleDOI
22 May 2011
TL;DR: This paper proposes to train multiple prior models of speech instead of a single prior model based on distinct characteristics of speech, and in this study, they are trained based on voicing characteristics.
Abstract: Prior models of speech have been used in robust automatic speech recognition to enhance noisy speech. Typically, a single prior model is trained by pooling the entire training data. In this paper we propose to train multiple prior models of speech instead of a single prior model. The prior models can be trained based on distinct characteristics of speech. In this study, they are trained based on voicing characteristics. The trained prior models are then used to reconstruct noisy speech. Significant improvements are obtained on the Aurora-4 robust speech recognition task when multiple priors are used; in conjunction with an uncertainty transform technique, multiple priors yield a 13.7% absolute improvement in the average word error rate over directly recognizing noisy speech.

Proceedings ArticleDOI
22 May 2011
TL;DR: An unsupervised approach to sequential organization of cochannel speech is proposed that outperforms a model-based method in terms of speech segregation and is computationally simple.
Abstract: Model-based methods for sequential organization in cochannel speech require pretrained speaker models and often prior knowledge of participating speakers. We propose an unsupervised approach to sequential organization of cochannel speech. Based on cepstral features, we first cluster voiced speech into two speaker groups by maximizing the ratio of between- and within-group distances penalized by within-group concurrent pitches. To group unvoiced speech, we employ an onset/offset based analysis to generate time-frequency segments. Unvoiced segments are then labeled by the complementary portions of segregated voiced speech. Our method does not require any pretrained model and is computationally simple. Evaluations and comparisons show that the proposed method outperforms a model-based method in terms of speech segregation.

Journal ArticleDOI
TL;DR: This issue begins the twenty-third year of publication for Neural Networks, which is the leading journal in the world that covers the full range of neural networks and related research from all the areas of psychology and cognitive science, neuroscience and neuropsychology, mathematical and computational analysis, engineering and design, and technology and applications.

Proceedings ArticleDOI
22 May 2011
TL;DR: It is shown that by combining the outputs of classifiers trained on the traditional MFCC features and this novel speech pattern, statistically significant improvements over the baseline MFCC based classifier can be achieved for the task of phonetic classification.
Abstract: Ideal binary masks are binary patterns that encode the masking characteristics of speech in noise Recent evidence in speech perception suggests that such binary patterns provide sufficient information for human speech recognition Motivated by these findings, we propose to use ideal binary masks to improve phonetic modeling We show that by combining the outputs of classifiers trained on the traditional MFCC features and this novel speech pattern, statistically significant improvements over the baseline MFCC based classifier can be achieved for the task of phonetic classification Using the combined classifiers, we achieve an error rate of 195% on the TIMIT phonetic classification task using multilayer perceptrons as the underlying classifier

Proceedings ArticleDOI
03 Oct 2011
TL;DR: This work proposes an algorithm to automatically identify representative features corresponding to different homogeneous regions, and shows that the number of representative features can be determined by examining the effective rank of a feature matrix.
Abstract: We present a novel method for segmenting images with texture and nontexture regions. Local spectral histograms are feature vectors consisting of histograms of chosen filter responses, which capture both texture and nontexture information. Based on the observation that the local spectral histogram of a pixel location can be approximated through a linear combination of the representative features weighted by the area coverage of each feature, we formulate the segmentation problem as a multivariate linear regression, where the solution is obtained by least squares estimation. Moreover, we propose an algorithm to automatically identify representative features corresponding to different homogeneous regions, and show that the number of representative features can be determined by examining the effective rank of a feature matrix. We present segmentation results on different types of images, and our comparison with another spectral histogram based method shows that the proposed method gives more accurate results.