scispace - formally typeset
Search or ask a question

Showing papers by "DeLiang Wang published in 2004"


Journal ArticleDOI
TL;DR: This work proposes a novel system for voiced speech segregation that segregates resolved and unresolved harmonics differently, and it yields substantially better performance, especially for the high-frequency part of speech.
Abstract: Segregating speech from one monaural recording has proven to be very challenging. Monaural segregation of voiced speech has been studied in previous systems that incorporate auditory scene analysis principles. A major problem for these systems is their inability to deal with the high-frequency part of speech. Psychoacoustic evidence suggests that different perceptual mechanisms are involved in handling resolved and unresolved harmonics. We propose a novel system for voiced speech segregation that segregates resolved and unresolved harmonics differently. For resolved harmonics, the system generates segments based on temporal continuity and cross-channel correlation, and groups them according to their periodicities. For unresolved harmonics, it generates segments based on common amplitude modulation (AM) in addition to temporal continuity and groups them according to AM rates. Underlying the segregation process is a pitch contour that is first estimated from speech segregated according to dominant pitch and then adjusted according to psychoacoustic constraints. Our system is systematically evaluated and compared with pervious systems, and it yields substantially better performance, especially for the high-frequency part of speech.

394 citations


Proceedings ArticleDOI
04 Oct 2004
TL;DR: A time-varying Wiener filter specifies the ratio of a target signal and a noisy mixture in a local time-frequency unit, which is used to extract the speech signal and is fed to a conventional speech recognizer operating in the cepstral domain.
Abstract: A time-varying Wiener filter specifies the ratio of a target signal and a noisy mixture in a local time-frequency unit. We estimate this ratio using a binaural processor and derive a ratio time-frequency mask. This mask is used to extract the speech signal, which is then fed to a conventional speech recognizer operating in the cepstral domain. We compare the performance of this system with a missing-data recognizer that operates in the spectral domain using the time-frequency units that are dominated by speech. To apply the missing-data recognizer, the same binaural processor is used to estimate an ideal binary time-frequency mask, which selects a local time-frequency unit if the speech signal within the unit is stronger than the interference. We find that the performance of the missing data recognizer is better on a small vocabulary recognition task but the performance of the conventional recognizer is substantially better when the vocabulary size is increased.

178 citations


Journal ArticleDOI
TL;DR: The binaural auditory model improves speech recognition performance in small room reverberation conditions in the presence of spatially separated noise, particularly for conditions in which the spatial separation is 20� or larger.

119 citations


Proceedings ArticleDOI
17 May 2004
TL;DR: A novel method for binaural sound segregation from acoustic mixtures contaminated by both multiple interference and reverberation is presented, which employs an adaptive filter that performs target cancellation.
Abstract: We present a novel method for binaural sound segregation from acoustic mixtures contaminated by both multiple interference and reverberation. We employ the notion of an ideal time-frequency binary mask, which selects the target if it is stronger than the interference in a local time-frequency (T-F) unit. As opposed to classical adaptive filtering, which focuses on the suppression of noise, our model employs an adaptive filter that performs target cancellation. T-F units dominated by a target are largely suppressed at the output of the cancellation unit when compared to units dominated by noise. Consequently, the actual input-to-output attenuation level in each T-F unit is used to estimate an ideal binary mask. A systematic evaluation in terms of automatic speech recognition performance shows that the resulting system produces masks close to ideal binary ones.

30 citations


Journal ArticleDOI
TL;DR: Using a sigmoid interaction results in /spl sim/n/sup 2/, for relaxation oscillators in the sinusoidal and relaxation regimes, indicating that the form of the coupling is a controlling factor in the synchronization rate.
Abstract: Relaxation oscillators arise frequently in physics, electronics, mathematics, and biology. Their mathematical definitions possess a high degree of flexibility in the sense that through appropriate parameter choices relaxation oscillators can be made to exhibit qualitatively different kinds of oscillations. We study numerically four different classes of relaxation oscillators through their synchronization rates in one-dimensional chains with a Heaviside step function interaction and obtain the following results. Relaxation oscillators in the sinusoidal and relaxation regime both exhibit an average time to synchrony, /spl sim/n, where n is the chain length. Relaxation oscillators in the singular limit exhibit /spl sim/n/sup p/, where p is a numerically obtained value less than 0.5. Relaxation oscillators in the singular limit with parameters modified so that they resemble spike oscillations exhibit /spl sim/log(n) in chains and /spl sim/log(L) in two-dimensional square networks of length L. Finally, using a sigmoid interaction results in /spl sim/n/sup 2/, for relaxation oscillators in the sinusoidal and relaxation regimes, indicating that the form of the coupling is a controlling factor in the synchronization rate.

18 citations


Proceedings Article
01 Jan 2004
TL;DR: Systematic evaluation shows that much target speech, including unvoiced speech, is correctly segmented, and target speech and interference are well separated into different segments in this proposed system for auditory segmentation.
Abstract: Acoustic signals from different sources in a natural environment form an auditory scene. Auditory scene analysis (ASA) is the process in which the auditory system segregates an auditory scene into streams corresponding to different sources. Segmentation is an important stage of ASA where an auditory scene is decomposed into segments, each of which contains signal mainly from one source. We propose a system for auditory segmentation based on analyzing onsets and offsets of auditory events. Our system first detects onsets and offsets, and then generates segments by matching corresponding onsets and offsets. This is achieved through a multiscale approach based on scale-space theory. Systematic evaluation shows that much target speech, including unvoiced speech, is correctly segmented, and target speech and interference are well separated into different segments.

13 citations


Proceedings ArticleDOI
25 Jul 2004
TL;DR: The comparison suggests that a combined network is likely to enhance the overall processing capability, and reveals fundamental differences between CNN and LEGION.
Abstract: CNN and LEGION networks have been extensively studied in recent years. These two frameworks share many common features; both employ continuous-time dynamics, are nonlinear, and emphasize local connectivity. In addition, they both have been successfully applied to visual processing tasks and implemented on analog VLSI chips. This paper investigates the relations between the two frameworks. We present their standard versions, and contrast the underlying dynamics and connectivity. We also describe several tasks where both CNN and LEGION have been applied. The comparison reveals fundamental differences between them. CNN is good for early visual processing, whereas LEGION is good for midlevel visual processing. Furthermore, the comparison suggests that a combined network is likely to enhance the overall processing capability.

12 citations


Journal ArticleDOI
TL;DR: This Special Issue intends to present, in a collective way, research that makes a clear contribution to addressing information processing tasks using temporal coding by providing a comprehensive view of the current approaches and issues to the neural network community.
Abstract: LARGELY motivated by neurobiological discoveries, neural network research is witnessing a significant shift of emphasis towards temporal coding, which uses time as an essential dimension in neural representations. Temporal coding is passionately debated in neuroscience and related fields, but in the last few years a large volume of physiological and behavioral data has emerged that supports a key role for temporal coding in the brain. In neural networks, extensive research is undertaken under the topics of nonlinear dynamics, oscillatory and chaotic networks, spiking neurons, and pulse-coupled networks. Various information processing tasks have been investigated using temporal coding, including scene analysis, figure-ground separation, classification, associative learning, inference, and motor control. Progress has been made that substantially advances the state-of-the-art of neural computing. In many instances, however, neural models incorporating temporal coding are driven merely by the fact that real neurons use impulses. It is often unclear whether, and to what extent, the temporal aspects of the models contribute to information processing capabilities. This Special Issue was conceived in part to assess the role and potential of temporal coding in terms of information processing by providing a comprehensive view of the current approaches and issues to the neural network community. The Special Issue intends to present, in a collective way, research that makes a clear contribution to addressing information processing tasks using temporal coding. The issue serves not only to highlight successful use of temporal coding in neural computation but also clarify outstanding issues for future progress. The Special Issue Call for Papers received a very strong response from the community. A total of 64 manuscripts were submitted for consideration. They represent a broad spectrum of research in temporal coding, ranging from the study of the synchronization phenomenon to spatiotemporal processing. Of these submissions, 33 papers were accepted following a rigorous review process, coordinated by the guest editors. The accepted papers are organized into the following eight topics. Coincidence detection. Mikula and Niebur study an ideal coincidence detector and give a solution for its steady-state output in response to an arbitrary number of excitatory and inhibitory spike trains. Beroule surveys temporal processing networks and explicitly discusses the time dimension and temporal coding with implications in perception, learning, and memory. Watanabe and Aihara propose an iterative model with time delay for temporal spike coding. By changing local

5 citations


Proceedings ArticleDOI
01 Jan 2004
TL;DR: This work derives a joint computational objective for speaker assignment and cochannel SID, leading to a problem of search for the optimum hypothesis, and proposes a hypothesis pruning method based on speaker models to make the search computationally feasible.
Abstract: It is difficult to directly apply traditional speaker identification (SID) systems to cochannel speech, mixtures from two speakers. Previous work demonstrates that extraction of usable speech segments significantly improves SID performance if speaker assignment, or sequential organization of the segments, is known. We derive a joint computational objective for speaker assignment and cochannel SID, leading to a problem of search for the optimum hypothesis. We propose a hypothesis pruning method based on speaker models to make the search computationally feasible. Evaluation results show that the proposed algorithm approaches the ceiling SID performance obtained with prior pitch information, and yields significant improvement over alternative approaches on speaker assignment.

1 citations