scispace - formally typeset
Search or ask a question
Author

DeLiang Wang

Bio: DeLiang Wang is an academic researcher from Ohio State University. The author has contributed to research in topics: Speech processing & Speech enhancement. The author has an hindex of 82, co-authored 440 publications receiving 23687 citations. Previous affiliations of DeLiang Wang include Massachusetts Institute of Technology & Tsinghua University.


Papers
More filters
Journal ArticleDOI
TL;DR: Results in various test conditions reveal that the two ratio mask targets, the IRM and the FFT-MASK, outperform the other targets in terms of objective intelligibility and quality metrics, and that masking based targets, in general, are significantly better than spectral envelope based targets.
Abstract: Formulation of speech separation as a supervised learning problem has shown considerable promise. In its simplest form, a supervised learning algorithm, typically a deep neural network, is trained to learn a mapping from noisy features to a time-frequency representation of the target of interest. Traditionally, the ideal binary mask (IBM) is used as the target because of its simplicity and large speech intelligibility gains. The supervised learning framework, however, is not restricted to the use of binary targets. In this study, we evaluate and compare separation results by using different training targets, including the IBM, the target binary mask, the ideal ratio mask (IRM), the short-time Fourier transform spectral magnitude and its corresponding mask (FFT-MASK), and the Gammatone frequency power spectrum. Our results in various test conditions reveal that the two ratio mask targets, the IRM and the FFT-MASK, outperform the other targets in terms of objective intelligibility and quality metrics. In addition, we find that masking based targets, in general, are significantly better than spectral envelope based targets. We also present comparisons with recent methods in non-negative matrix factorization and speech enhancement, which show clear performance advantages of supervised speech separation.

1,046 citations

Journal ArticleDOI
TL;DR: A comprehensive overview of deep learning-based supervised speech separation can be found in this paper, where three main components of supervised separation are discussed: learning machines, training targets, and acoustic features.
Abstract: Speech separation is the task of separating target speech from background interference. Traditionally, speech separation is studied as a signal processing problem. A more recent approach formulates speech separation as a supervised learning problem, where the discriminative patterns of speech, speakers, and background noise are learned from training data. Over the past decade, many supervised separation algorithms have been put forward. In particular, the recent introduction of deep learning to supervised speech separation has dramatically accelerated progress and boosted separation performance. This paper provides a comprehensive overview of the research on deep learning based supervised speech separation in the last several years. We first introduce the background of speech separation and the formulation of supervised separation. Then, we discuss three main components of supervised separation: learning machines, training targets, and acoustic features. Much of the overview is on separation algorithms where we review monaural methods, including speech enhancement (speech-nonspeech separation), speaker separation (multitalker separation), and speech dereverberation, as well as multimicrophone techniques. The important issue of generalization, unique to supervised learning, is discussed. This overview provides a historical perspective on how advances are made. In addition, we discuss a number of conceptual issues, including what constitutes the target source.

1,009 citations

Journal ArticleDOI
TL;DR: This paper focuses on the development of model-Based Speech Segregation in CASA systems, which was first introduced in 2000 and has since been upgraded to a full-blown model-based system.
Abstract: Foreword. Preface. Contributors. Acronyms. 1. Fundamentals of Computational Auditory Scene Analysis (DeLiang Wang and Guy J. Brown). 1.1 Human Auditory Scene Analysis. 1.1.1 Structure and Function of the Auditory System. 1.1.2 Perceptual Organization of Simple Stimuli. 1.1.3 Perceptual Segregation of Speech from Other Sounds. 1.1.4 Perceptual Mechanisms. 1.2 Computational Auditory Scene Analysis (CASA). 1.2.1 What Is CASA? 1.2.2 What Is the Goal of CASA? 1.2.3 Why CASA? 1.3 Basics of CASA Systems. 1.3.1 System Architecture. 1.3.2 Cochleagram. 1.3.3 Correlogram. 1.3.4 Cross-Correlogram. 1.3.5 Time-Frequency Masks. 1.3.6 Resynthesis. 1.4 CASA Evaluation. 1.4.1 Evaluation Criteria. 1.4.2 Corpora. 1.5 Other Sound Separation Approaches. 1.6 A Brief History of CASA (Prior to 2000). 1.6.1 Monaural CASA Systems. 1.6.2 Binaural CASA Systems. 1.6.3 Neural CASA Models. 1.7 Conclusions 36 Acknowledgments. References. 2. Multiple F0 Estimation (Alain de Cheveigne). 2.1 Introduction. 2.2 Signal Models. 2.3 Single-Voice F0 Estimation. 2.3.1 Spectral Approach. 2.3.2 Temporal Approach. 2.3.3 Spectrotemporal Approach. 2.4 Multiple-Voice F0 Estimation. 2.4.1 Spectral Approach. 2.4.2 Temporal Approach. 2.4.3 Spectrotemporal Approach. 2.5 Issues. 2.5.1 Spectral Resolution. 2.5.2 Temporal Resolution. 2.5.3 Spectrotemporal Resolution. 2.6 Other Sources of Information. 2.6.1 Temporal and Spectral Continuity. 2.6.2 Instrument Models. 2.6.3 Learning-Based Techniques. 2.7 Estimating the Number of Sources. 2.8 Evaluation. 2.9 Application Scenarios. 2.10 Conclusion. Acknowledgments. References. 3. Feature-Based Speech Segregation (DeLiang Wang). 3.1 Introduction. 3.2 Feature Extraction. 3.2.1 Pitch Detection. 3.2.2 Onset and Offset Detection. 3.2.3 Amplitude Modulation Extraction. 3.2.4 Frequency Modulation Detection. 3.3 Auditory Segmentation. 3.3.1 What Is the Goal of Auditory Segmentation? 3.3.2 Segmentation Based on Cross-Channel Correlation and Temporal Continuity. 3.3.3 Segmentation Based on Onset and Offset Analysis. 3.4 Simultaneous Grouping. 3.4.1 Voiced Speech Segregation. 3.4.2 Unvoiced Speech Segregation. 3.5 Sequential Grouping. 3.5.1 Spectrum-Based Sequential Grouping. 3.5.2 Pitch-Based Sequential Grouping. 3.5.3 Model-Based Sequential Grouping. 3.6 Discussion. Acknowledgments. References. 4. Model-Based Scene Analysis (Daniel P. W. Ellis). 4.1 Introduction. 4.2 Source Separation as Inference. 4.3 Hidden Markov Models. 4.4 Aspects of Model-Based Systems. 4.4.1 Constraints: Types and Representations. 4.4.2 Fitting Models. 4.4.3 Generating Output. 4.5 Discussion. 4.5.1 Unknown Interference. 4.5.2 Ambiguity and Adaptation. 4.5.3 Relations to Other Separation Approaches. 4.6 Conclusions. References. 5. Binaural Sound Localization (Richard M. Stern, Guy J. Brown, and DeLiang Wang). 5.1 Introduction. 5.2 Physical and Physiological Mechanisms Underlying Auditory Localization. 5.2.1 Physical Cues. 5.2.2 Physiological Estimation of ITD and IID. 5.3 Spatial Perception of Single Sources. 5.3.1 Sensitivity to Differences in Interaural Time and Intensity. 5.3.2 Lateralization of Single Sources. 5.3.3 Localization of Single Sources. 5.3.4 The Precedence Effect. 5.4 Spatial Perception of Multiple Sources. 5.4.1 Localization of Multiple Sources. 5.4.2 Binaural Signal Detection. 5.5 Models of Binaural Perception. 5.5.1 Classical Models of Binaural Hearing. 5.5.2 Cross-Correlation-Based Models of Binaural Interaction. 5.5.3 Some Extensions to Cross-Correlation-Based Binaural Models. 5.6 Multisource Sound Localization. 5.6.1 Estimating Source Azimuth from Interaural Cross-Correlation. 5.6.2 Methods for Resolving Azimuth Ambiguity. 5.6.3 Localization of Moving Sources. 5.7 General Discussion. Acknowledgments. References. 6. Localization-Based Grouping (Albert S. Feng and Douglas L. Jones). 6.1 Introduction. 6.2 Classical Beamforming Techniques. 6.2.1 Fixed Beamforming Techniques. 6.2.2 Adaptive Beamforming Techniques. 6.2.3 Independent Component Analysis Techniques. 6.2.4 Other Localization-Based Techniques. 6.3 Location-Based Grouping Using Interaural Time Difference Cue. 6.4 Location-Based Grouping Using Interaural Intensity Difference Cue. 6.5 Location-Based Grouping Using Multiple Binaural Cues. 6.6 Discussion and Conclusions. Acknowledgments. References. 7. Reverberation (Guy J. Brown and Kalle J. Palomaki). 7.1 Introduction. 7.2 Effects of Reverberation on Listeners. 7.2.1 Speech Perception. 7.2.2 Sound Localization. 7.2.3 Source Separation and Signal Detection. 7.2.4 Distance Perception. 7.2.5 Auditory Spatial Impression. 7.3 Effects of Reverberation on Machines. 7.4 Mechanisms Underlying Robustness to Reverberation in Human Listeners. 7.4.1 The Role of Slow Temporal Modulations in Speech Perception. 7.4.2 The Binaural Advantage. 7.4.3 The Precedence Effect. 7.4.4 Perceptual Compensation for Spectral Envelope Distortion. 7.5 Reverberation-Robust Acoustic Processing. 7.5.1 Dereverberation. 7.5.2 Reverberation-Robust Acoustic Features. 7.5.3 Reverberation Masking. 7.6 CASA and Reverberation. 7.6.1 Systems Based on Directional Filtering. 7.6.2 CASA for Robust ASR in Reverberant Conditions. 7.6.3 Systems that Use Multiple Cues. 7.7 Discussion and Conclusions. Acknowledgments. References. 8. Analysis of Musical Audio Signals (Masataka Goto). 8.1 Introduction. 8.2 Music Scene Description. 8.2.1 Music Scene Descriptions. 8.2.2 Difficulties Associated with Musical Audio Signals. 8.3 Estimating Melody and Bass Lines. 8.3.1 PreFEst-front-end: Forming the Observed Probability Density Functions. 8.3.2 PreFEst-core: Estimating the F0's Probability Density Function. 8.3.3 PreFEst-back-end: Sequential F0 Tracking by Multiple-Agent Architecture. 8.3.4 Other Methods. 8.4 Estimating Beat Structure. 8.4.1 Estimating Period and Phase. 8.4.2 Dealing with Ambiguity. 8.4.3 Using Musical Knowledge. 8.5 Estimating Chorus Sections and Repeated Sections. 8.5.1 Extracting Acoustic Features and Calculating Their Similarity. 8.5.2 Finding Repeated Sections. 8.5.3 Grouping Repeated Sections. 8.5.4 Detecting Modulated Repetition. 8.5.5 Selecting Chorus Sections. 8.5.6 Other Methods. 8.6 Discussion and Conclusions. 8.6.1 Importance. 8.6.2 Evaluation Issues. 8.6.3 Future Directions. References. 9. Robust Automatic Speech Recognition (Jon Barker). 9.1 Introduction. 9.2 ASA and Speech Perception in Humans. 9.2.1 Speech Perception and Simultaneous Grouping. 9.2.2 Speech Perception and Sequential Grouping. 9.2.3 Speech Schemes. 9.2.4 Challenges to the ASA Account of Speech Perception. 9.2.5 Interim Summary. 9.3 Speech Recognition by Machine. 9.3.1 The Statistical Basis of ASR. 9.3.2 Traditional Approaches to Robust ASR. 9.3.3 CASA-Driven Approaches to ASR. 9.4 Primitive CASA and ASR. 9.4.1 Speech and Time-Frequency Masking. 9.4.2 The Missing-Data Approach to ASR. 9.4.3 Marginalization-Based Missing-Data ASR Systems. 9.4.4 Imputation-Based Missing-Data Solutions. 9.4.5 Estimating the Missing-Data Mask. 9.4.6 Difficulties with the Missing-Data Approach. 9.5 Model-Based CASA and ASR. 9.5.1 The Speech Fragment Decoding Framework. 9.5.2 Coupling Source Segregation and Recognition. 9.6 Discussion and Conclusions. 9.7 Concluding Remarks. References. 10. Neural and Perceptual Modeling (Guy J. Brown and DeLiang Wang). 10.1 Introduction. 10.2 The Neural Basis of Auditory Grouping. 10.2.1 Theoretical Solutions to the Binding Problem. 10.2.2 Empirical Results on Binding and ASA. 10.3 Models of Individual Neurons. 10.3.1 Relaxation Oscillators. 10.3.2 Spike Oscillators. 10.3.3 A Model of a Specific Auditory Neuron. 10.4 Models of Specific Perceptual Phenomena. 10.4.1 Perceptual Streaming of Tone Sequences. 10.4.2 Perceptual Segregation of Concurrent Vowels with Different F0s. 10.5 The Oscillatory Correlation Framework for CASA. 10.5.1 Speech Segregation Based on Oscillatory Correlation. 10.6 Schema-Driven Grouping. 10.7 Discussion. 10.7.1 Temporal or Spatial Coding of Auditory Grouping. 10.7.2 Physiological Support for Neural Time Delays. 10.7.3 Convergence of Psychological, Physiological, and Computational Approaches. 10.7.4 Neural Models as a Framework for CASA. 10.7.5 The Role of Attention. 10.7.6 Schema-Based Organization. Acknowledgments. References. Index.

940 citations

Journal ArticleDOI
TL;DR: The proposed approach improves over other methods when evaluated with several objective metrics, including the perceptual evaluation of speech quality (PESQ), and a listening test where subjects prefer the proposed approach with at least a 69% rate.
Abstract: Speech separation systems usually operate on the short-time Fourier transform (STFT) of noisy speech, and enhance only the magnitude spectrum while leaving the phase spectrum unchanged. This is done because there was a belief that the phase spectrum is unimportant for speech enhancement. Recent studies, however, suggest that phase is important for perceptual quality, leading some researchers to consider magnitude and phase spectrum enhancements. We present a supervised monaural speech separation approach that simultaneously enhances the magnitude and phase spectra by operating in the complex domain. Our approach uses a deep neural network to estimate the real and imaginary components of the ideal ratio mask defined in the complex domain. We report separation results for the proposed method and compare them to related systems. The proposed approach improves over other methods when evaluated with several objective metrics, including the perceptual evaluation of speech quality (PESQ), and a listening test where subjects prefer the proposed approach with at least a 69% rate.

699 citations

Book ChapterDOI
01 Jan 2005
TL;DR: This chapter is an attempt at a computational-theory analysis of auditory scene analysis, where the main task is to understand the character of the CASA problem.
Abstract: In his famous treatise of computational vision, Marr (1982) makes a compelling argument for separating different levels of analysis in order to understand complex information processing. In particular, the computational theory level, concerned with the goal of computation and general processing strategy, must be separated from the algorithm level, or the separation of what from how. This chapter is an attempt at a computational-theory analysis of auditory scene analysis, where the main task is to understand the character of the CASA problem.

617 citations


Cited by
More filters
Book
18 Nov 2016
TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.
Abstract: Deep learning is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts. Because the computer gathers knowledge from experience, there is no need for a human computer operator to formally specify all the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones; a graph of these hierarchies would be many layers deep. This book introduces a broad range of topics in deep learning. The text offers mathematical and conceptual background, covering relevant concepts in linear algebra, probability theory and information theory, numerical computation, and machine learning. It describes deep learning techniques used by practitioners in industry, including deep feedforward networks, regularization, optimization algorithms, convolutional networks, sequence modeling, and practical methodology; and it surveys such applications as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames. Finally, the book offers research perspectives, covering such theoretical topics as linear factor models, autoencoders, representation learning, structured probabilistic models, Monte Carlo methods, the partition function, approximate inference, and deep generative models. Deep Learning can be used by undergraduate or graduate students planning careers in either industry or research, and by software engineers who want to begin using deep learning in their products or platforms. A website offers supplementary material for both readers and instructors.

38,208 citations

Journal ArticleDOI
TL;DR: The major concepts and results recently achieved in the study of the structure and dynamics of complex networks are reviewed, and the relevant applications of these ideas in many different disciplines are summarized, ranging from nonlinear science to biology, from statistical mechanics to medicine and engineering.

9,441 citations

Journal ArticleDOI
06 Jun 1986-JAMA
TL;DR: The editors have done a masterful job of weaving together the biologic, the behavioral, and the clinical sciences into a single tapestry in which everyone from the molecular biologist to the practicing psychiatrist can find and appreciate his or her own research.
Abstract: I have developed "tennis elbow" from lugging this book around the past four weeks, but it is worth the pain, the effort, and the aspirin. It is also worth the (relatively speaking) bargain price. Including appendixes, this book contains 894 pages of text. The entire panorama of the neural sciences is surveyed and examined, and it is comprehensive in its scope, from genomes to social behaviors. The editors explicitly state that the book is designed as "an introductory text for students of biology, behavior, and medicine," but it is hard to imagine any audience, interested in any fragment of neuroscience at any level of sophistication, that would not enjoy this book. The editors have done a masterful job of weaving together the biologic, the behavioral, and the clinical sciences into a single tapestry in which everyone from the molecular biologist to the practicing psychiatrist can find and appreciate his or

7,563 citations

Book
01 Jan 2010
TL;DR: Refocused, revised and renamed to reflect the duality of neural networks and learning machines, this edition recognizes that the subject matter is richer when these topics are studied together.
Abstract: For graduate-level neural network courses offered in the departments of Computer Engineering, Electrical Engineering, and Computer Science. Neural Networks and Learning Machines, Third Edition is renowned for its thoroughness and readability. This well-organized and completely upto-date text remains the most comprehensive treatment of neural networks from an engineering perspective. This is ideal for professional engineers and research scientists. Matlab codes used for the computer experiments in the text are available for download at: http://www.pearsonhighered.com/haykin/ Refocused, revised and renamed to reflect the duality of neural networks and learning machines, this edition recognizes that the subject matter is richer when these topics are studied together. Ideas drawn from neural networks and machine learning are hybridized to perform improved learning tasks beyond the capability of either independently.

4,943 citations

Journal ArticleDOI
TL;DR: In this paper, the authors describe the rules of the ring, the ring population, and the need to get off the ring in order to measure the movement of a cyclic clock.
Abstract: 1980 Preface * 1999 Preface * 1999 Acknowledgements * Introduction * 1 Circular Logic * 2 Phase Singularities (Screwy Results of Circular Logic) * 3 The Rules of the Ring * 4 Ring Populations * 5 Getting Off the Ring * 6 Attracting Cycles and Isochrons * 7 Measuring the Trajectories of a Circadian Clock * 8 Populations of Attractor Cycle Oscillators * 9 Excitable Kinetics and Excitable Media * 10 The Varieties of Phaseless Experience: In Which the Geometrical Orderliness of Rhythmic Organization Breaks Down in Diverse Ways * 11 The Firefly Machine 12 Energy Metabolism in Cells * 13 The Malonic Acid Reagent ('Sodium Geometrate') * 14 Electrical Rhythmicity and Excitability in Cell Membranes * 15 The Aggregation of Slime Mold Amoebae * 16 Numerical Organizing Centers * 17 Electrical Singular Filaments in the Heart Wall * 18 Pattern Formation in the Fungi * 19 Circadian Rhythms in General * 20 The Circadian Clocks of Insect Eclosion * 21 The Flower of Kalanchoe * 22 The Cell Mitotic Cycle * 23 The Female Cycle * References * Index of Names * Index of Subjects

3,424 citations