scispace - formally typeset
Search or ask a question

Showing papers by "DeLiang Wang published in 2006"


Journal Article•DOI•
TL;DR: This paper focuses on the development of model-Based Speech Segregation in CASA systems, which was first introduced in 2000 and has since been upgraded to a full-blown model-based system.
Abstract: Foreword. Preface. Contributors. Acronyms. 1. Fundamentals of Computational Auditory Scene Analysis (DeLiang Wang and Guy J. Brown). 1.1 Human Auditory Scene Analysis. 1.1.1 Structure and Function of the Auditory System. 1.1.2 Perceptual Organization of Simple Stimuli. 1.1.3 Perceptual Segregation of Speech from Other Sounds. 1.1.4 Perceptual Mechanisms. 1.2 Computational Auditory Scene Analysis (CASA). 1.2.1 What Is CASA? 1.2.2 What Is the Goal of CASA? 1.2.3 Why CASA? 1.3 Basics of CASA Systems. 1.3.1 System Architecture. 1.3.2 Cochleagram. 1.3.3 Correlogram. 1.3.4 Cross-Correlogram. 1.3.5 Time-Frequency Masks. 1.3.6 Resynthesis. 1.4 CASA Evaluation. 1.4.1 Evaluation Criteria. 1.4.2 Corpora. 1.5 Other Sound Separation Approaches. 1.6 A Brief History of CASA (Prior to 2000). 1.6.1 Monaural CASA Systems. 1.6.2 Binaural CASA Systems. 1.6.3 Neural CASA Models. 1.7 Conclusions 36 Acknowledgments. References. 2. Multiple F0 Estimation (Alain de Cheveigne). 2.1 Introduction. 2.2 Signal Models. 2.3 Single-Voice F0 Estimation. 2.3.1 Spectral Approach. 2.3.2 Temporal Approach. 2.3.3 Spectrotemporal Approach. 2.4 Multiple-Voice F0 Estimation. 2.4.1 Spectral Approach. 2.4.2 Temporal Approach. 2.4.3 Spectrotemporal Approach. 2.5 Issues. 2.5.1 Spectral Resolution. 2.5.2 Temporal Resolution. 2.5.3 Spectrotemporal Resolution. 2.6 Other Sources of Information. 2.6.1 Temporal and Spectral Continuity. 2.6.2 Instrument Models. 2.6.3 Learning-Based Techniques. 2.7 Estimating the Number of Sources. 2.8 Evaluation. 2.9 Application Scenarios. 2.10 Conclusion. Acknowledgments. References. 3. Feature-Based Speech Segregation (DeLiang Wang). 3.1 Introduction. 3.2 Feature Extraction. 3.2.1 Pitch Detection. 3.2.2 Onset and Offset Detection. 3.2.3 Amplitude Modulation Extraction. 3.2.4 Frequency Modulation Detection. 3.3 Auditory Segmentation. 3.3.1 What Is the Goal of Auditory Segmentation? 3.3.2 Segmentation Based on Cross-Channel Correlation and Temporal Continuity. 3.3.3 Segmentation Based on Onset and Offset Analysis. 3.4 Simultaneous Grouping. 3.4.1 Voiced Speech Segregation. 3.4.2 Unvoiced Speech Segregation. 3.5 Sequential Grouping. 3.5.1 Spectrum-Based Sequential Grouping. 3.5.2 Pitch-Based Sequential Grouping. 3.5.3 Model-Based Sequential Grouping. 3.6 Discussion. Acknowledgments. References. 4. Model-Based Scene Analysis (Daniel P. W. Ellis). 4.1 Introduction. 4.2 Source Separation as Inference. 4.3 Hidden Markov Models. 4.4 Aspects of Model-Based Systems. 4.4.1 Constraints: Types and Representations. 4.4.2 Fitting Models. 4.4.3 Generating Output. 4.5 Discussion. 4.5.1 Unknown Interference. 4.5.2 Ambiguity and Adaptation. 4.5.3 Relations to Other Separation Approaches. 4.6 Conclusions. References. 5. Binaural Sound Localization (Richard M. Stern, Guy J. Brown, and DeLiang Wang). 5.1 Introduction. 5.2 Physical and Physiological Mechanisms Underlying Auditory Localization. 5.2.1 Physical Cues. 5.2.2 Physiological Estimation of ITD and IID. 5.3 Spatial Perception of Single Sources. 5.3.1 Sensitivity to Differences in Interaural Time and Intensity. 5.3.2 Lateralization of Single Sources. 5.3.3 Localization of Single Sources. 5.3.4 The Precedence Effect. 5.4 Spatial Perception of Multiple Sources. 5.4.1 Localization of Multiple Sources. 5.4.2 Binaural Signal Detection. 5.5 Models of Binaural Perception. 5.5.1 Classical Models of Binaural Hearing. 5.5.2 Cross-Correlation-Based Models of Binaural Interaction. 5.5.3 Some Extensions to Cross-Correlation-Based Binaural Models. 5.6 Multisource Sound Localization. 5.6.1 Estimating Source Azimuth from Interaural Cross-Correlation. 5.6.2 Methods for Resolving Azimuth Ambiguity. 5.6.3 Localization of Moving Sources. 5.7 General Discussion. Acknowledgments. References. 6. Localization-Based Grouping (Albert S. Feng and Douglas L. Jones). 6.1 Introduction. 6.2 Classical Beamforming Techniques. 6.2.1 Fixed Beamforming Techniques. 6.2.2 Adaptive Beamforming Techniques. 6.2.3 Independent Component Analysis Techniques. 6.2.4 Other Localization-Based Techniques. 6.3 Location-Based Grouping Using Interaural Time Difference Cue. 6.4 Location-Based Grouping Using Interaural Intensity Difference Cue. 6.5 Location-Based Grouping Using Multiple Binaural Cues. 6.6 Discussion and Conclusions. Acknowledgments. References. 7. Reverberation (Guy J. Brown and Kalle J. Palomaki). 7.1 Introduction. 7.2 Effects of Reverberation on Listeners. 7.2.1 Speech Perception. 7.2.2 Sound Localization. 7.2.3 Source Separation and Signal Detection. 7.2.4 Distance Perception. 7.2.5 Auditory Spatial Impression. 7.3 Effects of Reverberation on Machines. 7.4 Mechanisms Underlying Robustness to Reverberation in Human Listeners. 7.4.1 The Role of Slow Temporal Modulations in Speech Perception. 7.4.2 The Binaural Advantage. 7.4.3 The Precedence Effect. 7.4.4 Perceptual Compensation for Spectral Envelope Distortion. 7.5 Reverberation-Robust Acoustic Processing. 7.5.1 Dereverberation. 7.5.2 Reverberation-Robust Acoustic Features. 7.5.3 Reverberation Masking. 7.6 CASA and Reverberation. 7.6.1 Systems Based on Directional Filtering. 7.6.2 CASA for Robust ASR in Reverberant Conditions. 7.6.3 Systems that Use Multiple Cues. 7.7 Discussion and Conclusions. Acknowledgments. References. 8. Analysis of Musical Audio Signals (Masataka Goto). 8.1 Introduction. 8.2 Music Scene Description. 8.2.1 Music Scene Descriptions. 8.2.2 Difficulties Associated with Musical Audio Signals. 8.3 Estimating Melody and Bass Lines. 8.3.1 PreFEst-front-end: Forming the Observed Probability Density Functions. 8.3.2 PreFEst-core: Estimating the F0's Probability Density Function. 8.3.3 PreFEst-back-end: Sequential F0 Tracking by Multiple-Agent Architecture. 8.3.4 Other Methods. 8.4 Estimating Beat Structure. 8.4.1 Estimating Period and Phase. 8.4.2 Dealing with Ambiguity. 8.4.3 Using Musical Knowledge. 8.5 Estimating Chorus Sections and Repeated Sections. 8.5.1 Extracting Acoustic Features and Calculating Their Similarity. 8.5.2 Finding Repeated Sections. 8.5.3 Grouping Repeated Sections. 8.5.4 Detecting Modulated Repetition. 8.5.5 Selecting Chorus Sections. 8.5.6 Other Methods. 8.6 Discussion and Conclusions. 8.6.1 Importance. 8.6.2 Evaluation Issues. 8.6.3 Future Directions. References. 9. Robust Automatic Speech Recognition (Jon Barker). 9.1 Introduction. 9.2 ASA and Speech Perception in Humans. 9.2.1 Speech Perception and Simultaneous Grouping. 9.2.2 Speech Perception and Sequential Grouping. 9.2.3 Speech Schemes. 9.2.4 Challenges to the ASA Account of Speech Perception. 9.2.5 Interim Summary. 9.3 Speech Recognition by Machine. 9.3.1 The Statistical Basis of ASR. 9.3.2 Traditional Approaches to Robust ASR. 9.3.3 CASA-Driven Approaches to ASR. 9.4 Primitive CASA and ASR. 9.4.1 Speech and Time-Frequency Masking. 9.4.2 The Missing-Data Approach to ASR. 9.4.3 Marginalization-Based Missing-Data ASR Systems. 9.4.4 Imputation-Based Missing-Data Solutions. 9.4.5 Estimating the Missing-Data Mask. 9.4.6 Difficulties with the Missing-Data Approach. 9.5 Model-Based CASA and ASR. 9.5.1 The Speech Fragment Decoding Framework. 9.5.2 Coupling Source Segregation and Recognition. 9.6 Discussion and Conclusions. 9.7 Concluding Remarks. References. 10. Neural and Perceptual Modeling (Guy J. Brown and DeLiang Wang). 10.1 Introduction. 10.2 The Neural Basis of Auditory Grouping. 10.2.1 Theoretical Solutions to the Binding Problem. 10.2.2 Empirical Results on Binding and ASA. 10.3 Models of Individual Neurons. 10.3.1 Relaxation Oscillators. 10.3.2 Spike Oscillators. 10.3.3 A Model of a Specific Auditory Neuron. 10.4 Models of Specific Perceptual Phenomena. 10.4.1 Perceptual Streaming of Tone Sequences. 10.4.2 Perceptual Segregation of Concurrent Vowels with Different F0s. 10.5 The Oscillatory Correlation Framework for CASA. 10.5.1 Speech Segregation Based on Oscillatory Correlation. 10.6 Schema-Driven Grouping. 10.7 Discussion. 10.7.1 Temporal or Spatial Coding of Auditory Grouping. 10.7.2 Physiological Support for Neural Time Delays. 10.7.3 Convergence of Psychological, Physiological, and Computational Approaches. 10.7.4 Neural Models as a Framework for CASA. 10.7.5 The Role of Attention. 10.7.6 Schema-Based Organization. Acknowledgments. References. Index.

940 citations


Journal Article•DOI•
TL;DR: This study attempted to isolate the effects that energetic masking, defined as the loss of detectable target information due to the spectral overlap of the target and masking signals, has on multitalker speech perception through the use of ideal time-frequency binary masks.
Abstract: When a target speech signal is obscured by an interfering speech wave form, comprehension of the target message depends both on the successful detection of the energy from the target speech wave form and on the successful extraction and recognition of the spectro-temporal energy pattern of the target out of a background of acoustically similar masker sounds. This study attempted to isolate the effects that energetic masking, defined as the loss of detectable target information due to the spectral overlap of the target and masking signals, has on multitalker speech perception. This was achieved through the use of ideal time-frequency binary masks that retained those spectro-temporal regions of the acoustic mixture that were dominated by the target speech but eliminated those regions that were dominated by the interfering speech. The results suggest that energetic masking plays a relatively small role in the overall masking that occurs when speech is masked by interfering speech but a much more significant role when speech is masked by interfering noise.

388 citations


Journal Article•DOI•
TL;DR: A comparison with a recent enhancement algorithm is made on a corpus of speech utterances in a number of reverberant conditions, and the results show that the proposed algorithm performs substantially better.
Abstract: Under noise-free conditions, the quality of reverberant speech is dependent on two distinct perceptual components: coloration and long-term reverberation. They correspond to two physical variables: signal-to-reverberant energy ratio (SRR) and reverberation time, respectively. Inspired by this observation, we propose a two-stage reverberant speech enhancement algorithm using one microphone. In the first stage, an inverse filter is estimated to reduce coloration effects or increase SRR. The second stage employs spectral subtraction to minimize the influence of long-term reverberation. The proposed algorithm significantly improves the quality of reverberant speech. A comparison with a recent enhancement algorithm is made on a corpus of speech utterances in a number of reverberant conditions, and the results show that our algorithm performs substantially better.

226 citations


Journal Article•DOI•
TL;DR: In this article, a time-varying Wiener filter is used to specify the ratio of a target signal and a noisy mixture in a local time-frequency unit, which is then fed to a conventional speech recognizer operating in the cepstral domain.

212 citations


Journal Article•DOI•
TL;DR: A method for segmenting images consisting of texture and nontexture regions based on local spectral histograms using local spectral Histograms of homogeneous regions and an algorithm that iteratively updates the segmentation using the derived probability models.
Abstract: We present a method for segmenting images consisting of texture and nontexture regions based on local spectral histograms. Defined as a vector consisting of marginal distributions of chosen filter responses, local spectral histograms provide a feature statistic for both types of regions. Using local spectral histograms of homogeneous regions, we decompose the segmentation process into three stages. The first is the initial classification stage, where probability models for homogeneous texture and nontexture regions are derived and an initial segmentation result is obtained by classifying local windows. In the second stage, we give an algorithm that iteratively updates the segmentation using the derived probability models. The third is the boundary localization stage, where region boundaries are localized by building refined probability models that are sensitive to spatial patterns in segmented regions. We present segmentation results on texture as well as nontexture images. Our comparison with other methods shows that the proposed method produces more accurate segmentation results

123 citations


Journal Article•DOI•
Yang Shao1, DeLiang Wang1•
TL;DR: This paper extends the traditional SID framework to cochannel speech and derive a joint objective for sequential grouping and SID, leading to a problem of search for the optimum hypothesis and proposes a hypothesis pruning algorithm based on speaker models in order to make the search computationally efficient.
Abstract: A human listener has the ability to follow a speaker's voice while others are speaking simultaneously; in particular, the listener can organize the time-frequency energy of the same speaker across time into a single stream. In this paper, we focus on sequential organization in cochannel speech, or mixtures of two voices. We extract minimally corrupted segments, or usable speech, in cochannel speech using a robust multipitch tracking algorithm. The extracted usable speech is shown to capture speaker characteristics and improves speaker identification (SID) performance across various target-to-interferer ratios. To utilize speaker characteristics for sequential organization, we extend the traditional SID framework to cochannel speech and derive a joint objective for sequential grouping and SID, leading to a problem of search for the optimum hypothesis. Subsequently we propose a hypothesis pruning algorithm based on speaker models in order to make the search computationally efficient. Evaluation results show that the proposed system approaches the ceiling SID performance obtained with prior pitch information and yields significant improvement over alternative approaches to sequential organization.

83 citations


Book Chapter•DOI•
01 Jan 2006
TL;DR: This chapter contains sections titled: Introduction Physical and Physiological Mechanisms Underlying Auditory Localization Spatial perception of Single Sources Spatial Perception of Multiple Sources Models of Binaural Perception Multisource Sound Localization General Discussion.
Abstract: This chapter contains sections titled: Introduction Physical and Physiological Mechanisms Underlying Auditory Localization Spatial Perception of Single Sources Spatial Perception of Multiple Sources Models of Binaural Perception Multisource Sound Localization General Discussion This chapter contains sections titled: Acknowledgments References ]]>

74 citations


Journal Article•DOI•
TL;DR: A binaural segregation system that extracts the reverberant target signal from multisource reverberant mixtures by utilizing only the location information of target source is proposed, and comparisons using SNR as well as automatic speech recognition measures show that this system outperforms standard two-microphone beamforming approaches and a recent bINAural processor.
Abstract: In a natural environment, speech signals are degraded by both reverberation and concurrent noise sources. While human listening is robust under these conditions using only two ears, current two-microphone algorithms perform poorly. The psychological process of figure-ground segregation suggests that the target signal is perceived as a foreground while the remaining stimuli are perceived as a background. Accordingly, the goal is to estimate an ideal time-frequency (T-F) binary mask, which selects the target if it is stronger than the interference in a local T-F unit. In this paper, a binaural segregation system that extracts the reverberant target signal from multisource reverberant mixtures by utilizing only the location information of target source is proposed. The proposed system combines target cancellation through adaptive filtering and a binary decision rule to estimate the ideal T-F binary mask. The main observation in this work is that the target attenuation in a T-F unit resulting from adaptive filtering is correlated with the relative strength of target to mixture. A comprehensive evaluation shows that the proposed system results in large SNR gains. In addition, comparisons using SNR as well as automatic speech recognition measures show that this system outperforms standard two-microphone beamforming approaches and a recent binaural processor.

60 citations


01 Jan 2006
TL;DR: This chapter presents in detail a CASA system that segregates both voiced and unvoiced speech and covers the major stages of CASA, including feature extraction, auditory segmentation, and grouping.
Abstract: A human listener has the remarkable ability to segregate an acoustic mixture and attend to a target sound. This perceptual process is called auditory scene analysis (ASA). Moreover, the listener can accomplish much of auditory scene analysis with only one ear. Research in ASA has inspired many studies in computational auditory scene analysis (CASA) for sound segregation. In this chapter we introduce a CASA approach to monaural speech segregation. After a brief overview of CASA, we present in detail a CASA system that segregates both voiced and unvoiced speech. Our description covers the major stages of CASA, including feature extraction, auditory segmentation, and grouping.

60 citations


Journal Article•DOI•
TL;DR: This work proposes a two-stage monaural separation system that combines the inverse filtering of the room impulse response corresponding to target location and a pitch-based speech segregation method, and shows that the proposed system results in considerable signal-to-noise ratio gains across different conditions.
Abstract: In everyday listening, both background noise and reverberation degrade the speech signal. Psychoacoustic evidence suggests that human speech perception under reverberant conditions relies mostly on monaural processing. While speech segregation based on periodicity has achieved considerable progress in handling additive noise, little research in monaural segregation has been devoted to reverberant scenarios. Reverberation smears the harmonic structure of speech signals, and our evaluations using a pitch-based segregation algorithm show that an increase in the room reverberation time causes degraded performance due to weakened periodicity in the target signal. We propose a two-stage monaural separation system that combines the inverse filtering of the room impulse response corresponding to target location and a pitch-based speech segregation method. As a result of the first stage, the harmonicity of a signal arriving from target direction is partially restored while signals arriving from other directions are further smeared, and this leads to improved segregation. A systematic evaluation of the system shows that the proposed system results in considerable signal-to-noise ratio gains across different conditions. Potential applications of this system include robust automatic speech recognition and hearing aid design.

51 citations


Book Chapter•DOI•
01 Jan 2006
TL;DR: This chapter contains sections titled: Human Auditory Scene Analysis Computational Auditory scene Analysis (CASA) Basics of CASA Systems CASA Evaluation Other Sound Separation Approaches.
Abstract: This chapter contains sections titled: Human Auditory Scene Analysis Computational Auditory Scene Analysis (CASA) Basics of CASA Systems CASA Evaluation Other Sound Separation Approaches A Brief History of CASA (Prior to 2000) Conclusions This chapter contains sections titled: Acknowledgments References ]]>

Proceedings Article•DOI•
Yang Shao1, DeLiang Wang1•
14 May 2006
TL;DR: This study demonstrates that the use of binary masking represents a promising direction for robust speaker recognition by employing a speech segregation system that estimates the ideal binary mask and achieves significant improvements over alternative approaches.
Abstract: Conventional speaker recognition systems perform poorly under noisy conditions. In this paper, we evaluate binary time-frequency masks for robust speaker recognition. An ideal binary mask is a priori defined as a binary matrix where 1 indicates that the target is stronger than the interference within the corresponding time-frequency unit and 0 indicates otherwise. We perform speaker identification and verification using a missing data recognizer under cochannel and other noise conditions, and show that the ideal binary mask provides large performance gains. By employing a speech segregation system that estimates the ideal binary mask, we achieve significant improvements over alternative approaches. Our study, thus, demonstrates that the use of binary masking represents a promising direction for robust speaker recognition.

Proceedings Article•
Yipeng Li1, DeLiang Wang1•
01 Jan 2006
TL;DR: A system to separate singing voice from music accompaniment from monaural recordings is proposed and Quantitative results show that the system performs well in singing voice separation.
Abstract: Separating singing voice from music accompaniment has wide applications in areas such as automatic lyrics recognition and alignment, singer identification, and music information retrieval. Compared to the extensive studies of speech separation, singing voice separation has been little explored. We propose a system to separate singing voice from music accompaniment from monaural recordings. The system has three stages. The singing voice detection stage partitions and classifies an input into vocal and non-vocal portions. Then the predominant pitch detection stage detects the pitch contour of the singing voice for vocal portions. Finally the separation stage uses the detected pitch contour to group the time-frequency segments of the singing voice. Quantitative results show that the system performs well in singing voice separation.

Book Chapter•DOI•
01 Jan 2006
TL;DR: This chapter contains sections titled: Introduction Signal Models Single-Voice F 0 Estimation Multiple-VoiceF 0Estimation Issues Other Sources of Information Estimating the Number of Sources Evaluation Application Scenarios Conclusion
Abstract: This chapter contains sections titled: Introduction Signal Models Single-Voice F 0 Estimation Multiple-Voice F 0 Estimation Issues Other Sources of Information Estimating the Number of Sources Evaluation Application Scenarios Conclusion This chapter contains sections titled: Acknowledgments References ]]>

Proceedings Article•DOI•
01 Jan 2006
TL;DR: A computational auditory scene analysis system for separating and recognizing target speech in the presence of competing speech or noise and estimates the ideal binary time-frequency (T-F) mask which retains the mixture in a local TF unit if and only if the target is stronger than the interference within the unit.
Abstract: We present a computational auditory scene analysis system for separating and recognizing target speech in the presence of competing speech or noise. We estimate, in two stages, the ideal binary time-frequency (T-F) mask which retains the mixture in a local TF unit if and only if the target is stronger than the interference within the unit. In the first stage, we use harmonicity to segregate the voiced portions of individual sources in each time frame based on multipitch tracking. Additionally, unvoiced portions are segmented based on an onset/offset analysis. In the second stage, speaker characteristics are used to group the T-F units across time frames. The resulting T-F masks are used in conjunction with missing-data methods for recognition. Systematic evaluations on a speech separation challenge task show significant improvement over the baseline performance.

Book Chapter•DOI•
05 Mar 2006
TL;DR: The proposed framework is applicable for separation of instantaneous as well as convolutive speech mixtures by combining blind source separation techniques with binary time-frequency masking and needs only two microphones.
Abstract: A limitation in many source separation tasks is that the number of source signals has to be known in advance. Further, in order to achieve good performance, the number of sources cannot exceed the number of sensors. In many real-world applications these limitations are too restrictive. We propose a method for underdetermined blind source separation of convolutive mixtures. The proposed framework is applicable for separation of instantaneous as well as convolutive speech mixtures. It is possible to iteratively extract each speech signal from the mixture by combining blind source separation techniques with binary time-frequency masking. In the proposed method, the number of source signals is not assumed to be known in advance and the number of sources is not limited to the number of microphones. Our approach needs only two microphones and the separated sounds are maintained as stereo signals.

Journal Article•
TL;DR: In this paper, the pitch strength is measured by deriving the statistics of relative time lags, defined as the distances from the detected pitch periods to the closest peaks in a correlogram.
Abstract: Reverberation corrupts harmonic structure in voiced speech. We observe that the pitch strength of voiced speech segments is indicative of the degree of reverberation. Consequently, we present an estimation method of reverberation time (T 60 ) based on pitch strength. The pitch strength is measured by deriving the statistics of relative time lags, defined as the distances from the detected pitch periods to the closest peaks in a correlogram. The monotonic relationship between the measured pitch strength and reverberation time learned from a corpus of reverberant speech with known reverberation times yields an estimate of T 60 up to 0.6 seconds.

Proceedings Article•DOI•
14 May 2006
TL;DR: A novel approach to address unvoiced speech segregation, which lacks harmonic structure and has weaker energy, hence more susceptible to interference, is described.
Abstract: Speech segregation, or the cocktail party problem, has proven to be extremely challenging. While efforts in computational auditory scene analysis have led to considerable progress in voiced speech segregation, little attention has been given to unvoiced speech which lacks harmonic structure and has weaker energy, hence more susceptible to interference. We describe a novel approach to address this problem. The segregation process occurs in two stages: segmentation and grouping. In segmentation, our model decomposes the input mixture into contiguous time-frequency segments by analyzing sound onsets and offsets. Grouping of unvoiced segments is based on Bayesian classification of acoustic-phonetic features. The proposed model yields very promising results

Proceedings Article•DOI•
14 May 2006
TL;DR: A supervised approach is proposed to learn the non linear transformation of the uncertainty from the linear spectral domain to the cepstral domain and shows substantial improvement over the baseline performance of the Aurora4 task.
Abstract: Recently several algorithms have been proposed to enhance noisy speech by estimating a binary mask that can be used to select those time-frequency regions of a noisy speech signal that contain more speech energy than noise energy. This binary mask encodes the uncertainty associated with enhanced speech in the linear spectral domain. The use of the cepstral transformation leads to a smearing of this uncertainty. We propose a supervised approach to learn the non linear transformation of the uncertainty from the linear spectral domain to the cepstral domain. This uncertainty is used by a decoder that exploits the variance associated with the enhanced cepstral features to improve robust speech recognition. Systematic evaluations on a subset of the Aurora4 task using the estimated uncertainty shows substantial improvement over the baseline performance.

Book Chapter•DOI•
01 Jan 2006
TL;DR: This chapter contains sections titled: Introduction Music Scene Description Estimating Melody and Bass Lines Estimating Beat Structure Estimating Chorus Sections and Repeated Sections Discussion and Conclusions.
Abstract: This chapter contains sections titled: Introduction Music Scene Description Estimating Melody and Bass Lines Estimating Beat Structure Estimating Chorus Sections and Repeated Sections Discussion and Conclusions This chapter contains sections titled: References ]]>

Dissertation•
01 Jan 2006
TL;DR: A model is presented that simulates listeners' ability to attend to a target speaker when degraded by the effects of energetic and informational masking in multitalker environments and finds that while missing-data recognition outperforms conventional ASR on a small vocabulary task, the performance of conventional AsR is significantly better when the vocabulary size is increased.
Abstract: We present a schema-based model for phonemic restoration. The model employs missing-data ASR to decode speech based on unmasked portions and activates word templates that contain the masked phoneme via dynamic time warping. An activated template is then used to restore the masked phoneme. A systematic evaluation shows that the model is able to restore both voiced and unvoiced phonemes with a spectral quality close to that of original phonemes. Missing-data ASR relies on a binary mask generated by bottom-up CASA to label the speech-dominant time-frequency (T-F) regions of a noisy mixture as reliable and the rest as unreliable. However, errors in mask estimation cause degradation in recognition accuracy. Hence, we propose a two-pass ASR system that performs segregation and recognition in tandem. In the first pass, an n-best lattice, consistent with bottom-up speech separation, is generated. The lattice is then re-scored using a model-based hypothesis test to improve mask estimation and recognition accuracy concurrently. This two-pass system leads to significant improvement in recognition performance. By combining a monaural CASA system with missing-data ASR, we present a model that simulates listeners' ability to attend to a target speaker when degraded by the effects of energetic and informational masking in multitalker environments. Energetic masking refers to the phenomenon that a stronger signal masks a weaker one within a critical band. Informational masking occurs when the listener is unable to segregate target from interference. Missing-data ASR is used to account for energetic masking. The effects of informational masking are modeled by the output degradation of the CASA system in binary mask estimation. The model successfully simulates several quantitative aspects of listener performance including the differential effects of energetic and informational masking on multitalker perception. While missing-data ASR performs well on small vocabulary tasks, previous studies have not examined the effect of vocabulary size. In this dissertation, we investigate the performance of the missing-data ASR on a larger vocabulary task and compare its results to those of conventional ASR. For conventional ASR, we extract the speech signal from a noisy mixture by estimating a Wiener filter based on estimated interaural time and intensity differences within a T-F unit. For missing-data ASR, the same estimation is used to produce a binary T-F mask. We find that while missing-data recognition outperforms conventional ASR on a small vocabulary task, the performance of conventional ASR is significantly better when the vocabulary size is increased. (Abstract shortened by UMI.)

Proceedings Article•DOI•
14 May 2006
TL;DR: A systematic evaluation in terms of automatic speech recognition (ASR) performance shows substantial improvements over the baseline performance and better results over related two-microphone approaches.
Abstract: We present a binaural solution to robust speech recognition in multi-source reverberant environments. We employ the notion of an ideal time-frequency binary mask, which selects the target if it is stronger than the interference in a local time-frequency (T-F) unit. Our system estimates this ideal binary mask at the output of a target cancellation module implemented using adaptive filtering. This mask is used in conjunction with a missing-data algorithm to decode the target utterance. A systematic evaluation in terms of automatic speech recognition (ASR) performance shows substantial improvements over the baseline performance and better results over related two-microphone approaches.

Book Chapter•DOI•
01 Jan 2006
TL;DR: This chapter contains sections titled: Introduction The Neural Basis of Auditory Grouping Models of Individual Neurons Models of Specific Perceptual Phenomena The Oscillatory Correlation Framework for CASA Schema-Driven Grouping Discussion.
Abstract: This chapter contains sections titled: Introduction The Neural Basis of Auditory Grouping Models of Individual Neurons Models of Specific Perceptual Phenomena The Oscillatory Correlation Framework for CASA Schema-Driven Grouping Discussion This chapter contains sections titled: Acknowledgments References ]]>

Book•
01 Nov 2006
TL;DR: This presentation presented at the 13th International Conference of the ICONIP on Neural Information Processing focused on the development of neural information processing models for discrete-time decision-making.
Abstract: PDF : Neural Information Processing: 13th International Conference, ICONIP 2006, Hong Kong, China, October 3-6, 2006, Proceedings, Part I (Lecture Notes In ... Computer Science And General Issues) Doc : Neural Information Processing: 13th International Conference, ICONIP 2006, Hong Kong, China, October 3-6, 2006, Proceedings, Part I (Lecture Notes In ... Computer Science And General Issues) ePub : Neural Information Processing: 13th International Conference, ICONIP 2006, Hong Kong, China, October 3-6, 2006, Proceedings, Part I (Lecture Notes In ... Computer Science And General Issues)


Book Chapter•DOI•
01 Jan 2006

Book Chapter•DOI•
DeLiang Wang1•
23 Oct 2006
TL;DR: The auditory environment is typically composed of multiple simultaneous events and the auditory system is able to disentangle the acoustic mixture and group the sound energy that originates from the same event or source.
Abstract: The acoustic environment is typically composed of multiple simultaneous events. A remarkable achievement of the auditory system is its ability to disentangle the acoustic mixture and group the sound energy that originates from the same event or source. This process of auditory organization is referred to as auditory scene analysis. The cocktail party problem, or segregation of speech from interfering sounds, has proven to be extremely challenging from the computational standpoint.