Showing papers by "DeLiang Wang published in 2006"

PDF

Open Access

Journal Article•DOI•

Computational Auditory Scene Analysis: Principles, Algorithms, and Applications

[...]

01 Sep 2006-Journal of the Acoustical Society of America

TL;DR: This paper focuses on the development of model-Based Speech Segregation in CASA systems, which was first introduced in 2000 and has since been upgraded to a full-blown model-based system.

...read moreread less

Abstract: Foreword. Preface. Contributors. Acronyms. 1. Fundamentals of Computational Auditory Scene Analysis (DeLiang Wang and Guy J. Brown). 1.1 Human Auditory Scene Analysis. 1.1.1 Structure and Function of the Auditory System. 1.1.2 Perceptual Organization of Simple Stimuli. 1.1.3 Perceptual Segregation of Speech from Other Sounds. 1.1.4 Perceptual Mechanisms. 1.2 Computational Auditory Scene Analysis (CASA). 1.2.1 What Is CASA? 1.2.2 What Is the Goal of CASA? 1.2.3 Why CASA? 1.3 Basics of CASA Systems. 1.3.1 System Architecture. 1.3.2 Cochleagram. 1.3.3 Correlogram. 1.3.4 Cross-Correlogram. 1.3.5 Time-Frequency Masks. 1.3.6 Resynthesis. 1.4 CASA Evaluation. 1.4.1 Evaluation Criteria. 1.4.2 Corpora. 1.5 Other Sound Separation Approaches. 1.6 A Brief History of CASA (Prior to 2000). 1.6.1 Monaural CASA Systems. 1.6.2 Binaural CASA Systems. 1.6.3 Neural CASA Models. 1.7 Conclusions 36 Acknowledgments. References. 2. Multiple F0 Estimation (Alain de Cheveigne). 2.1 Introduction. 2.2 Signal Models. 2.3 Single-Voice F0 Estimation. 2.3.1 Spectral Approach. 2.3.2 Temporal Approach. 2.3.3 Spectrotemporal Approach. 2.4 Multiple-Voice F0 Estimation. 2.4.1 Spectral Approach. 2.4.2 Temporal Approach. 2.4.3 Spectrotemporal Approach. 2.5 Issues. 2.5.1 Spectral Resolution. 2.5.2 Temporal Resolution. 2.5.3 Spectrotemporal Resolution. 2.6 Other Sources of Information. 2.6.1 Temporal and Spectral Continuity. 2.6.2 Instrument Models. 2.6.3 Learning-Based Techniques. 2.7 Estimating the Number of Sources. 2.8 Evaluation. 2.9 Application Scenarios. 2.10 Conclusion. Acknowledgments. References. 3. Feature-Based Speech Segregation (DeLiang Wang). 3.1 Introduction. 3.2 Feature Extraction. 3.2.1 Pitch Detection. 3.2.2 Onset and Offset Detection. 3.2.3 Amplitude Modulation Extraction. 3.2.4 Frequency Modulation Detection. 3.3 Auditory Segmentation. 3.3.1 What Is the Goal of Auditory Segmentation? 3.3.2 Segmentation Based on Cross-Channel Correlation and Temporal Continuity. 3.3.3 Segmentation Based on Onset and Offset Analysis. 3.4 Simultaneous Grouping. 3.4.1 Voiced Speech Segregation. 3.4.2 Unvoiced Speech Segregation. 3.5 Sequential Grouping. 3.5.1 Spectrum-Based Sequential Grouping. 3.5.2 Pitch-Based Sequential Grouping. 3.5.3 Model-Based Sequential Grouping. 3.6 Discussion. Acknowledgments. References. 4. Model-Based Scene Analysis (Daniel P. W. Ellis). 4.1 Introduction. 4.2 Source Separation as Inference. 4.3 Hidden Markov Models. 4.4 Aspects of Model-Based Systems. 4.4.1 Constraints: Types and Representations. 4.4.2 Fitting Models. 4.4.3 Generating Output. 4.5 Discussion. 4.5.1 Unknown Interference. 4.5.2 Ambiguity and Adaptation. 4.5.3 Relations to Other Separation Approaches. 4.6 Conclusions. References. 5. Binaural Sound Localization (Richard M. Stern, Guy J. Brown, and DeLiang Wang). 5.1 Introduction. 5.2 Physical and Physiological Mechanisms Underlying Auditory Localization. 5.2.1 Physical Cues. 5.2.2 Physiological Estimation of ITD and IID. 5.3 Spatial Perception of Single Sources. 5.3.1 Sensitivity to Differences in Interaural Time and Intensity. 5.3.2 Lateralization of Single Sources. 5.3.3 Localization of Single Sources. 5.3.4 The Precedence Effect. 5.4 Spatial Perception of Multiple Sources. 5.4.1 Localization of Multiple Sources. 5.4.2 Binaural Signal Detection. 5.5 Models of Binaural Perception. 5.5.1 Classical Models of Binaural Hearing. 5.5.2 Cross-Correlation-Based Models of Binaural Interaction. 5.5.3 Some Extensions to Cross-Correlation-Based Binaural Models. 5.6 Multisource Sound Localization. 5.6.1 Estimating Source Azimuth from Interaural Cross-Correlation. 5.6.2 Methods for Resolving Azimuth Ambiguity. 5.6.3 Localization of Moving Sources. 5.7 General Discussion. Acknowledgments. References. 6. Localization-Based Grouping (Albert S. Feng and Douglas L. Jones). 6.1 Introduction. 6.2 Classical Beamforming Techniques. 6.2.1 Fixed Beamforming Techniques. 6.2.2 Adaptive Beamforming Techniques. 6.2.3 Independent Component Analysis Techniques. 6.2.4 Other Localization-Based Techniques. 6.3 Location-Based Grouping Using Interaural Time Difference Cue. 6.4 Location-Based Grouping Using Interaural Intensity Difference Cue. 6.5 Location-Based Grouping Using Multiple Binaural Cues. 6.6 Discussion and Conclusions. Acknowledgments. References. 7. Reverberation (Guy J. Brown and Kalle J. Palomaki). 7.1 Introduction. 7.2 Effects of Reverberation on Listeners. 7.2.1 Speech Perception. 7.2.2 Sound Localization. 7.2.3 Source Separation and Signal Detection. 7.2.4 Distance Perception. 7.2.5 Auditory Spatial Impression. 7.3 Effects of Reverberation on Machines. 7.4 Mechanisms Underlying Robustness to Reverberation in Human Listeners. 7.4.1 The Role of Slow Temporal Modulations in Speech Perception. 7.4.2 The Binaural Advantage. 7.4.3 The Precedence Effect. 7.4.4 Perceptual Compensation for Spectral Envelope Distortion. 7.5 Reverberation-Robust Acoustic Processing. 7.5.1 Dereverberation. 7.5.2 Reverberation-Robust Acoustic Features. 7.5.3 Reverberation Masking. 7.6 CASA and Reverberation. 7.6.1 Systems Based on Directional Filtering. 7.6.2 CASA for Robust ASR in Reverberant Conditions. 7.6.3 Systems that Use Multiple Cues. 7.7 Discussion and Conclusions. Acknowledgments. References. 8. Analysis of Musical Audio Signals (Masataka Goto). 8.1 Introduction. 8.2 Music Scene Description. 8.2.1 Music Scene Descriptions. 8.2.2 Difficulties Associated with Musical Audio Signals. 8.3 Estimating Melody and Bass Lines. 8.3.1 PreFEst-front-end: Forming the Observed Probability Density Functions. 8.3.2 PreFEst-core: Estimating the F0's Probability Density Function. 8.3.3 PreFEst-back-end: Sequential F0 Tracking by Multiple-Agent Architecture. 8.3.4 Other Methods. 8.4 Estimating Beat Structure. 8.4.1 Estimating Period and Phase. 8.4.2 Dealing with Ambiguity. 8.4.3 Using Musical Knowledge. 8.5 Estimating Chorus Sections and Repeated Sections. 8.5.1 Extracting Acoustic Features and Calculating Their Similarity. 8.5.2 Finding Repeated Sections. 8.5.3 Grouping Repeated Sections. 8.5.4 Detecting Modulated Repetition. 8.5.5 Selecting Chorus Sections. 8.5.6 Other Methods. 8.6 Discussion and Conclusions. 8.6.1 Importance. 8.6.2 Evaluation Issues. 8.6.3 Future Directions. References. 9. Robust Automatic Speech Recognition (Jon Barker). 9.1 Introduction. 9.2 ASA and Speech Perception in Humans. 9.2.1 Speech Perception and Simultaneous Grouping. 9.2.2 Speech Perception and Sequential Grouping. 9.2.3 Speech Schemes. 9.2.4 Challenges to the ASA Account of Speech Perception. 9.2.5 Interim Summary. 9.3 Speech Recognition by Machine. 9.3.1 The Statistical Basis of ASR. 9.3.2 Traditional Approaches to Robust ASR. 9.3.3 CASA-Driven Approaches to ASR. 9.4 Primitive CASA and ASR. 9.4.1 Speech and Time-Frequency Masking. 9.4.2 The Missing-Data Approach to ASR. 9.4.3 Marginalization-Based Missing-Data ASR Systems. 9.4.4 Imputation-Based Missing-Data Solutions. 9.4.5 Estimating the Missing-Data Mask. 9.4.6 Difficulties with the Missing-Data Approach. 9.5 Model-Based CASA and ASR. 9.5.1 The Speech Fragment Decoding Framework. 9.5.2 Coupling Source Segregation and Recognition. 9.6 Discussion and Conclusions. 9.7 Concluding Remarks. References. 10. Neural and Perceptual Modeling (Guy J. Brown and DeLiang Wang). 10.1 Introduction. 10.2 The Neural Basis of Auditory Grouping. 10.2.1 Theoretical Solutions to the Binding Problem. 10.2.2 Empirical Results on Binding and ASA. 10.3 Models of Individual Neurons. 10.3.1 Relaxation Oscillators. 10.3.2 Spike Oscillators. 10.3.3 A Model of a Specific Auditory Neuron. 10.4 Models of Specific Perceptual Phenomena. 10.4.1 Perceptual Streaming of Tone Sequences. 10.4.2 Perceptual Segregation of Concurrent Vowels with Different F0s. 10.5 The Oscillatory Correlation Framework for CASA. 10.5.1 Speech Segregation Based on Oscillatory Correlation. 10.6 Schema-Driven Grouping. 10.7 Discussion. 10.7.1 Temporal or Spatial Coding of Auditory Grouping. 10.7.2 Physiological Support for Neural Time Delays. 10.7.3 Convergence of Psychological, Physiological, and Computational Approaches. 10.7.4 Neural Models as a Framework for CASA. 10.7.5 The Role of Attention. 10.7.6 Schema-Based Organization. Acknowledgments. References. Index.

...read moreread less

940 citations

Journal Article•DOI•

Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation.

[...]

Douglas S. Brungart¹, Peter S. Chang, Brian D. Simpson, DeLiang Wang•Institutions (1)

Air Force Research Laboratory¹

06 Dec 2006-Journal of the Acoustical Society of America

TL;DR: This study attempted to isolate the effects that energetic masking, defined as the loss of detectable target information due to the spectral overlap of the target and masking signals, has on multitalker speech perception through the use of ideal time-frequency binary masks.

...read moreread less

Abstract: When a target speech signal is obscured by an interfering speech wave form, comprehension of the target message depends both on the successful detection of the energy from the target speech wave form and on the successful extraction and recognition of the spectro-temporal energy pattern of the target out of a background of acoustically similar masker sounds. This study attempted to isolate the effects that energetic masking, defined as the loss of detectable target information due to the spectral overlap of the target and masking signals, has on multitalker speech perception. This was achieved through the use of ideal time-frequency binary masks that retained those spectro-temporal regions of the acoustic mixture that were dominated by the target speech but eliminated those regions that were dominated by the interfering speech. The results suggest that energetic masking plays a relatively small role in the overall masking that occurs when speech is masked by interfering speech but a much more significant role when speech is masked by interfering noise.

...read moreread less

388 citations

Journal Article•DOI•

A two-stage algorithm for one-microphone reverberant speech enhancement

[...]

Mingyang Wu¹, DeLiang Wang²•Institutions (2)

FICO¹, Ohio State University²

01 Dec 2006-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: A comparison with a recent enhancement algorithm is made on a corpus of speech utterances in a number of reverberant conditions, and the results show that the proposed algorithm performs substantially better.

...read moreread less

Abstract: Under noise-free conditions, the quality of reverberant speech is dependent on two distinct perceptual components: coloration and long-term reverberation. They correspond to two physical variables: signal-to-reverberant energy ratio (SRR) and reverberation time, respectively. Inspired by this observation, we propose a two-stage reverberant speech enhancement algorithm using one microphone. In the first stage, an inverse filter is estimated to reduce coloration effects or increase SRR. The second stage employs spectral subtraction to minimize the influence of long-term reverberation. The proposed algorithm significantly improves the quality of reverberant speech. A comparison with a recent enhancement algorithm is made on a corpus of speech utterances in a number of reverberant conditions, and the results show that our algorithm performs substantially better.

...read moreread less

226 citations

Journal Article•DOI•

Binary and Ratio Time-frequency Masks for Robust Speech Recognition

[...]

Soundararajan Srinivasan¹, Nicoleta Roman¹, DeLiang Wang¹•Institutions (1)

Ohio State University¹

01 Nov 2006-Speech Communication

TL;DR: In this article, a time-varying Wiener filter is used to specify the ratio of a target signal and a noisy mixture in a local time-frequency unit, which is then fed to a conventional speech recognizer operating in the cepstral domain.

...read moreread less

212 citations

Journal Article•DOI•

Image and Texture Segmentation Using Local Spectral Histograms

[...]

Xiuwen Liu¹, DeLiang Wang²•Institutions (2)

Florida State University¹, Ohio State University²

01 Oct 2006-IEEE Transactions on Image Processing

TL;DR: A method for segmenting images consisting of texture and nontexture regions based on local spectral histograms using local spectral Histograms of homogeneous regions and an algorithm that iteratively updates the segmentation using the derived probability models.

...read moreread less

Abstract: We present a method for segmenting images consisting of texture and nontexture regions based on local spectral histograms. Defined as a vector consisting of marginal distributions of chosen filter responses, local spectral histograms provide a feature statistic for both types of regions. Using local spectral histograms of homogeneous regions, we decompose the segmentation process into three stages. The first is the initial classification stage, where probability models for homogeneous texture and nontexture regions are derived and an initial segmentation result is obtained by classifying local windows. In the second stage, we give an algorithm that iteratively updates the segmentation using the derived probability models. The third is the boundary localization stage, where region boundaries are localized by building refined probability models that are sensitive to spatial patterns in segmented regions. We present segmentation results on texture as well as nontexture images. Our comparison with other methods shows that the proposed method produces more accurate segmentation results

...read moreread less

123 citations

Journal Article•DOI•

Model-based sequential organization in cochannel speech

[...]

Yang Shao¹, DeLiang Wang¹•Institutions (1)

Ohio State University¹

01 Dec 2006-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: This paper extends the traditional SID framework to cochannel speech and derive a joint objective for sequential grouping and SID, leading to a problem of search for the optimum hypothesis and proposes a hypothesis pruning algorithm based on speaker models in order to make the search computationally efficient.

...read moreread less

Abstract: A human listener has the ability to follow a speaker's voice while others are speaking simultaneously; in particular, the listener can organize the time-frequency energy of the same speaker across time into a single stream. In this paper, we focus on sequential organization in cochannel speech, or mixtures of two voices. We extract minimally corrupted segments, or usable speech, in cochannel speech using a robust multipitch tracking algorithm. The extracted usable speech is shown to capture speaker characteristics and improves speaker identification (SID) performance across various target-to-interferer ratios. To utilize speaker characteristics for sequential organization, we extend the traditional SID framework to cochannel speech and derive a joint objective for sequential grouping and SID, leading to a problem of search for the optimum hypothesis. Subsequently we propose a hypothesis pruning algorithm based on speaker models in order to make the search computationally efficient. Evaluation results show that the proposed system approaches the ceiling SID performance obtained with prior pitch information and yields significant improvement over alternative approaches to sequential organization.

...read moreread less

83 citations

Book Chapter•DOI•

Binaural Sound Localization

[...]

DeLiang Wang¹, Guy J. Brown²•Institutions (2)

Ohio State University¹, University of Sheffield²

01 Jan 2006

TL;DR: This chapter contains sections titled: Introduction Physical and Physiological Mechanisms Underlying Auditory Localization Spatial perception of Single Sources Spatial Perception of Multiple Sources Models of Binaural Perception Multisource Sound Localization General Discussion.

...read moreread less

Abstract: This chapter contains sections titled: Introduction Physical and Physiological Mechanisms Underlying Auditory Localization Spatial Perception of Single Sources Spatial Perception of Multiple Sources Models of Binaural Perception Multisource Sound Localization General Discussion This chapter contains sections titled: Acknowledgments References ]]>

...read moreread less

74 citations

Journal Article•DOI•

Binaural segregation in multisource reverberant environments.

[...]

Nicoleta Roman¹, Soundararajan Srinivasan¹, DeLiang Wang¹•Institutions (1)

Ohio State University¹

06 Dec 2006-Journal of the Acoustical Society of America

TL;DR: A binaural segregation system that extracts the reverberant target signal from multisource reverberant mixtures by utilizing only the location information of target source is proposed, and comparisons using SNR as well as automatic speech recognition measures show that this system outperforms standard two-microphone beamforming approaches and a recent bINAural processor.

...read moreread less

Abstract: In a natural environment, speech signals are degraded by both reverberation and concurrent noise sources. While human listening is robust under these conditions using only two ears, current two-microphone algorithms perform poorly. The psychological process of figure-ground segregation suggests that the target signal is perceived as a foreground while the remaining stimuli are perceived as a background. Accordingly, the goal is to estimate an ideal time-frequency (T-F) binary mask, which selects the target if it is stronger than the interference in a local T-F unit. In this paper, a binaural segregation system that extracts the reverberant target signal from multisource reverberant mixtures by utilizing only the location information of target source is proposed. The proposed system combines target cancellation through adaptive filtering and a binary decision rule to estimate the ideal T-F binary mask. The main observation in this work is that the target attenuation in a T-F unit resulting from adaptive filtering is correlated with the relative strength of target to mixture. A comprehensive evaluation shows that the proposed system results in large SNR gains. In addition, comparisons using SNR as well as automatic speech recognition measures show that this system outperforms standard two-microphone beamforming approaches and a recent binaural processor.

...read moreread less

60 citations

An Auditory Scene Analysis Approach to Monaural Speech Segregation

[...]

Guoning Hu, DeLiang Wang

01 Jan 2006

TL;DR: This chapter presents in detail a CASA system that segregates both voiced and unvoiced speech and covers the major stages of CASA, including feature extraction, auditory segmentation, and grouping.

...read moreread less

Abstract: A human listener has the remarkable ability to segregate an acoustic mixture and attend to a target sound. This perceptual process is called auditory scene analysis (ASA). Moreover, the listener can accomplish much of auditory scene analysis with only one ear. Research in ASA has inspired many studies in computational auditory scene analysis (CASA) for sound segregation. In this chapter we introduce a CASA approach to monaural speech segregation. After a brief overview of CASA, we present in detail a CASA system that segregates both voiced and unvoiced speech. Our description covers the major stages of CASA, including feature extraction, auditory segmentation, and grouping.

...read moreread less

60 citations

Journal Article•DOI•

Pitch-based monaural segregation of reverberant speech

[...]

Nicoleta Roman¹, DeLiang Wang¹•Institutions (1)

Ohio State University¹

29 Jun 2006-Journal of the Acoustical Society of America

TL;DR: This work proposes a two-stage monaural separation system that combines the inverse filtering of the room impulse response corresponding to target location and a pitch-based speech segregation method, and shows that the proposed system results in considerable signal-to-noise ratio gains across different conditions.

...read moreread less

Abstract: In everyday listening, both background noise and reverberation degrade the speech signal. Psychoacoustic evidence suggests that human speech perception under reverberant conditions relies mostly on monaural processing. While speech segregation based on periodicity has achieved considerable progress in handling additive noise, little research in monaural segregation has been devoted to reverberant scenarios. Reverberation smears the harmonic structure of speech signals, and our evaluations using a pitch-based segregation algorithm show that an increase in the room reverberation time causes degraded performance due to weakened periodicity in the target signal. We propose a two-stage monaural separation system that combines the inverse filtering of the room impulse response corresponding to target location and a pitch-based speech segregation method. As a result of the first stage, the harmonicity of a signal arriving from target direction is partially restored while signals arriving from other directions are further smeared, and this leads to improved segregation. A systematic evaluation of the system shows that the proposed system results in considerable signal-to-noise ratio gains across different conditions. Potential applications of this system include robust automatic speech recognition and hearing aid design.

...read moreread less

51 citations

Book Chapter•DOI•

Fundamentals of Computational Auditory Scene Analysis

[...]

DeLiang Wang¹, Guy J. Brown²•Institutions (2)

Ohio State University¹, University of Sheffield²

01 Jan 2006

TL;DR: This chapter contains sections titled: Human Auditory Scene Analysis Computational Auditory scene Analysis (CASA) Basics of CASA Systems CASA Evaluation Other Sound Separation Approaches.

...read moreread less

Abstract: This chapter contains sections titled: Human Auditory Scene Analysis Computational Auditory Scene Analysis (CASA) Basics of CASA Systems CASA Evaluation Other Sound Separation Approaches A Brief History of CASA (Prior to 2000) Conclusions This chapter contains sections titled: Acknowledgments References ]]>

...read moreread less

Proceedings Article•DOI•

Robust Speaker Recognition Using Binary Time-Frequency Masks

[...]

Yang Shao¹, DeLiang Wang¹•Institutions (1)

Ohio State University¹

14 May 2006

TL;DR: This study demonstrates that the use of binary masking represents a promising direction for robust speaker recognition by employing a speech segregation system that estimates the ideal binary mask and achieves significant improvements over alternative approaches.

...read moreread less

Abstract: Conventional speaker recognition systems perform poorly under noisy conditions. In this paper, we evaluate binary time-frequency masks for robust speaker recognition. An ideal binary mask is a priori defined as a binary matrix where 1 indicates that the target is stronger than the interference within the corresponding time-frequency unit and 0 indicates otherwise. We perform speaker identification and verification using a missing data recognizer under cochannel and other noise conditions, and show that the ideal binary mask provides large performance gains. By employing a speech segregation system that estimates the ideal binary mask, we achieve significant improvements over alternative approaches. Our study, thus, demonstrates that the use of binary masking represents a promising direction for robust speaker recognition.

...read moreread less

Proceedings Article•

Singing Voice Separation from Monaural Recordings.

[...]

Yipeng Li¹, DeLiang Wang¹•Institutions (1)

Ohio State University¹

01 Jan 2006

TL;DR: A system to separate singing voice from music accompaniment from monaural recordings is proposed and Quantitative results show that the system performs well in singing voice separation.

...read moreread less

Abstract: Separating singing voice from music accompaniment has wide applications in areas such as automatic lyrics recognition and alignment, singer identification, and music information retrieval. Compared to the extensive studies of speech separation, singing voice separation has been little explored. We propose a system to separate singing voice from music accompaniment from monaural recordings. The system has three stages. The singing voice detection stage partitions and classifies an input into vocal and non-vocal portions. Then the predominant pitch detection stage detects the pitch contour of the singing voice for vocal portions. Finally the separation stage uses the detected pitch contour to group the time-frequency segments of the singing voice. Quantitative results show that the system performs well in singing voice separation.

...read moreread less

Book Chapter•DOI•

Multiple F0 Estimation

[...]

DeLiang Wang, Guy J. Brown

01 Jan 2006

TL;DR: This chapter contains sections titled: Introduction Signal Models Single-Voice F 0 Estimation Multiple-VoiceF 0Estimation Issues Other Sources of Information Estimating the Number of Sources Evaluation Application Scenarios Conclusion

...read moreread less

Abstract: This chapter contains sections titled: Introduction Signal Models Single-Voice F 0 Estimation Multiple-Voice F 0 Estimation Issues Other Sources of Information Estimating the Number of Sources Evaluation Application Scenarios Conclusion This chapter contains sections titled: Acknowledgments References ]]>

...read moreread less

Proceedings Article•DOI•

A computational auditory scene analysis system for robust speech recognition

[...]

Soundararajan Srinivasan¹, Yang Shao, Zhaozhang Jin¹, DeLiang Wang¹•Institutions (1)

Ohio State University¹

01 Jan 2006

TL;DR: A computational auditory scene analysis system for separating and recognizing target speech in the presence of competing speech or noise and estimates the ideal binary time-frequency (T-F) mask which retains the mixture in a local TF unit if and only if the target is stronger than the interference within the unit.

...read moreread less

Abstract: We present a computational auditory scene analysis system for separating and recognizing target speech in the presence of competing speech or noise. We estimate, in two stages, the ideal binary time-frequency (T-F) mask which retains the mixture in a local TF unit if and only if the target is stronger than the interference within the unit. In the first stage, we use harmonicity to segregate the voiced portions of individual sources in each time frame based on multipitch tracking. Additionally, unvoiced portions are segmented based on an onset/offset analysis. In the second stage, speaker characteristics are used to group the T-F units across time frames. The resulting T-F masks are used in conjunction with missing-data methods for recognition. Systematic evaluations on a speech separation challenge task show significant improvement over the baseline performance.

...read moreread less

Book Chapter•DOI•

Separating underdetermined convolutive speech mixtures

[...]

Michael Syskind Pedersen¹, DeLiang Wang², Jan Larsen¹, Ulrik Kjems•Institutions (2)

Technical University of Denmark¹, Ohio State University²

05 Mar 2006

TL;DR: The proposed framework is applicable for separation of instantaneous as well as convolutive speech mixtures by combining blind source separation techniques with binary time-frequency masking and needs only two microphones.

...read moreread less

Abstract: A limitation in many source separation tasks is that the number of source signals has to be known in advance. Further, in order to achieve good performance, the number of sources cannot exceed the number of sensors. In many real-world applications these limitations are too restrictive. We propose a method for underdetermined blind source separation of convolutive mixtures. The proposed framework is applicable for separation of instantaneous as well as convolutive speech mixtures. It is possible to iteratively extract each speech signal from the mixture by combining blind source separation techniques with binary time-frequency masking. In the proposed method, the number of source signals is not assumed to be known in advance and the number of sources is not limited to the number of microphones. Our approach needs only two microphones and the separated sounds are maintained as stereo signals.

...read moreread less

Journal Article•

A pitch-based method for the estimation of short reverberation time

[...]

Mingyang Wu, DeLiang Wang

01 Jan 2006-Acta Acustica United With Acustica

TL;DR: In this paper, the pitch strength is measured by deriving the statistics of relative time lags, defined as the distances from the detected pitch periods to the closest peaks in a correlogram.

...read moreread less

Abstract: Reverberation corrupts harmonic structure in voiced speech. We observe that the pitch strength of voiced speech segments is indicative of the degree of reverberation. Consequently, we present an estimation method of reverberation time (T 60 ) based on pitch strength. The pitch strength is measured by deriving the statistics of relative time lags, defined as the distances from the detected pitch periods to the closest peaks in a correlogram. The monotonic relationship between the measured pitch strength and reverberation time learned from a corpus of reverberant speech with known reverberation times yields an estimate of T 60 up to 0.6 seconds.

...read moreread less

Proceedings Article•DOI•

Unvoiced Speech Segregation

[...]

DeLiang Wang¹, Guoning Hu•Institutions (1)

Ohio State University¹

14 May 2006

TL;DR: A novel approach to address unvoiced speech segregation, which lacks harmonic structure and has weaker energy, hence more susceptible to interference, is described.

...read moreread less

Abstract: Speech segregation, or the cocktail party problem, has proven to be extremely challenging. While efforts in computational auditory scene analysis have led to considerable progress in voiced speech segregation, little attention has been given to unvoiced speech which lacks harmonic structure and has weaker energy, hence more susceptible to interference. We describe a novel approach to address this problem. The segregation process occurs in two stages: segmentation and grouping. In segmentation, our model decomposes the input mixture into contiguous time-frequency segments by analyzing sound onsets and offsets. Grouping of unvoiced segments is based on Bayesian classification of acoustic-phonetic features. The proposed model yields very promising results

...read moreread less

Proceedings Article•DOI•

A Supervised Learning Approach to Uncertainty Decoding for Robust Speech Recognition

[...]

Soundararajan Srinivasan¹, DeLiang Wang¹•Institutions (1)

Ohio State University¹

14 May 2006

TL;DR: A supervised approach is proposed to learn the non linear transformation of the uncertainty from the linear spectral domain to the cepstral domain and shows substantial improvement over the baseline performance of the Aurora4 task.

...read moreread less

Abstract: Recently several algorithms have been proposed to enhance noisy speech by estimating a binary mask that can be used to select those time-frequency regions of a noisy speech signal that contain more speech energy than noise energy. This binary mask encodes the uncertainty associated with enhanced speech in the linear spectral domain. The use of the cepstral transformation leads to a smearing of this uncertainty. We propose a supervised approach to learn the non linear transformation of the uncertainty from the linear spectral domain to the cepstral domain. This uncertainty is used by a decoder that exploits the variance associated with the enhanced cepstral features to improve robust speech recognition. Systematic evaluations on a subset of the Aurora4 task using the estimated uncertainty shows substantial improvement over the baseline performance.

...read moreread less

Book Chapter•DOI•

Analysis of Musical Audio Signals

[...]

DeLiang Wang, Guy J. Brown

01 Jan 2006

TL;DR: This chapter contains sections titled: Introduction Music Scene Description Estimating Melody and Bass Lines Estimating Beat Structure Estimating Chorus Sections and Repeated Sections Discussion and Conclusions.

...read moreread less

Abstract: This chapter contains sections titled: Introduction Music Scene Description Estimating Melody and Bass Lines Estimating Beat Structure Estimating Chorus Sections and Repeated Sections Discussion and Conclusions This chapter contains sections titled: References ]]>

...read moreread less

Dissertation•

Integrating computational auditory scene analysis and automatic speech recognition

[...]

DeLiang Wang¹, Soundararajan Srinivasan¹•Institutions (1)

Ohio State University¹

01 Jan 2006

TL;DR: A model is presented that simulates listeners' ability to attend to a target speaker when degraded by the effects of energetic and informational masking in multitalker environments and finds that while missing-data recognition outperforms conventional ASR on a small vocabulary task, the performance of conventional AsR is significantly better when the vocabulary size is increased.

...read moreread less

Abstract: We present a schema-based model for phonemic restoration. The model employs missing-data ASR to decode speech based on unmasked portions and activates word templates that contain the masked phoneme via dynamic time warping. An activated template is then used to restore the masked phoneme. A systematic evaluation shows that the model is able to restore both voiced and unvoiced phonemes with a spectral quality close to that of original phonemes. Missing-data ASR relies on a binary mask generated by bottom-up CASA to label the speech-dominant time-frequency (T-F) regions of a noisy mixture as reliable and the rest as unreliable. However, errors in mask estimation cause degradation in recognition accuracy. Hence, we propose a two-pass ASR system that performs segregation and recognition in tandem. In the first pass, an n-best lattice, consistent with bottom-up speech separation, is generated. The lattice is then re-scored using a model-based hypothesis test to improve mask estimation and recognition accuracy concurrently. This two-pass system leads to significant improvement in recognition performance. By combining a monaural CASA system with missing-data ASR, we present a model that simulates listeners' ability to attend to a target speaker when degraded by the effects of energetic and informational masking in multitalker environments. Energetic masking refers to the phenomenon that a stronger signal masks a weaker one within a critical band. Informational masking occurs when the listener is unable to segregate target from interference. Missing-data ASR is used to account for energetic masking. The effects of informational masking are modeled by the output degradation of the CASA system in binary mask estimation. The model successfully simulates several quantitative aspects of listener performance including the differential effects of energetic and informational masking on multitalker perception. While missing-data ASR performs well on small vocabulary tasks, previous studies have not examined the effect of vocabulary size. In this dissertation, we investigate the performance of the missing-data ASR on a larger vocabulary task and compare its results to those of conventional ASR. For conventional ASR, we extract the speech signal from a noisy mixture by estimating a Wiener filter based on estimated interaural time and intensity differences within a T-F unit. For missing-data ASR, the same estimation is used to produce a binary T-F mask. We find that while missing-data recognition outperforms conventional ASR on a small vocabulary task, the performance of conventional ASR is significantly better when the vocabulary size is increased. (Abstract shortened by UMI.)

...read moreread less

Proceedings Article•DOI•

Speech Recognition in Multisource Reverberant Environments with Binaural Inputs

[...]

Nicoleta Roman¹, Soundararajan Srinivasan, DeLiang Wang¹•Institutions (1)

Ohio State University¹

14 May 2006

TL;DR: A systematic evaluation in terms of automatic speech recognition (ASR) performance shows substantial improvements over the baseline performance and better results over related two-microphone approaches.

...read moreread less

Abstract: We present a binaural solution to robust speech recognition in multi-source reverberant environments. We employ the notion of an ideal time-frequency binary mask, which selects the target if it is stronger than the interference in a local time-frequency (T-F) unit. Our system estimates this ideal binary mask at the output of a target cancellation module implemented using adaptive filtering. This mask is used in conjunction with a missing-data algorithm to decode the target utterance. A systematic evaluation in terms of automatic speech recognition (ASR) performance shows substantial improvements over the baseline performance and better results over related two-microphone approaches.

...read moreread less

Book Chapter•DOI•

Neural and Perceptual Modeling

[...]

DeLiang Wang¹, Guy J. Brown²•Institutions (2)

Ohio State University¹, University of Sheffield²

01 Jan 2006

TL;DR: This chapter contains sections titled: Introduction The Neural Basis of Auditory Grouping Models of Individual Neurons Models of Specific Perceptual Phenomena The Oscillatory Correlation Framework for CASA Schema-Driven Grouping Discussion.

...read moreread less

Abstract: This chapter contains sections titled: Introduction The Neural Basis of Auditory Grouping Models of Individual Neurons Models of Specific Perceptual Phenomena The Oscillatory Correlation Framework for CASA Schema-Driven Grouping Discussion This chapter contains sections titled: Acknowledgments References ]]>

...read moreread less

Book•

Neural Information Processing: 13th International Conference, ICONIP 2006, Hong Kong, China, October 3-6, 2006, Proceedings, Part II (Lecture Notes in Computer Science)

[...]

Irwin King, Jun Wang, Lai-Wan Chan, DeLiang Wang

01 Nov 2006

TL;DR: This presentation presented at the 13th International Conference of the ICONIP on Neural Information Processing focused on the development of neural information processing models for discrete-time decision-making.

...read moreread less

Abstract: PDF : Neural Information Processing: 13th International Conference, ICONIP 2006, Hong Kong, China, October 3-6, 2006, Proceedings, Part I (Lecture Notes In ... Computer Science And General Issues) Doc : Neural Information Processing: 13th International Conference, ICONIP 2006, Hong Kong, China, October 3-6, 2006, Proceedings, Part I (Lecture Notes In ... Computer Science And General Issues) ePub : Neural Information Processing: 13th International Conference, ICONIP 2006, Hong Kong, China, October 3-6, 2006, Proceedings, Part I (Lecture Notes In ... Computer Science And General Issues)

...read moreread less

Journal Article•DOI•

LEGION: locally excitatory globally inhibitory oscillator networks

[...]

DeLiang Wang

28 Sep 2006-Scholarpedia

Book Chapter•DOI•

Robust Automatic Speech Recognition

[...]

DeLiang Wang, Guy J. Brown

01 Jan 2006

Book Chapter•DOI•

Cocktail party processing

[...]

DeLiang Wang¹•Institutions (1)

Ohio State University¹

23 Oct 2006

TL;DR: The auditory environment is typically composed of multiple simultaneous events and the auditory system is able to disentangle the acoustic mixture and group the sound energy that originates from the same event or source.

...read moreread less

Abstract: The acoustic environment is typically composed of multiple simultaneous events. A remarkable achievement of the auditory system is its ability to disentangle the acoustic mixture and group the sound energy that originates from the same event or source. This process of auditory organization is referred to as auditory scene analysis. The cocktail party problem, or segregation of speech from interfering sounds, has proven to be extremely challenging from the computational standpoint.

...read moreread less