scispace - formally typeset
Search or ask a question
Author

Futoshi Asano

Bio: Futoshi Asano is an academic researcher from National Institute of Advanced Industrial Science and Technology. The author has contributed to research in topics: Speech enhancement & Microphone array. The author has an hindex of 22, co-authored 78 publications receiving 2106 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: This optimized ATSP (OATSP) has an almost ideal characteristic to measure impulse responses shorter than its specific length N and it is newly shown in this paper that OATSP has also a good characteristic toMeasure impulse responses longer than N.
Abstract: Transfer functions of acoustic systems often exhibit wide dynamic ranges and very long impulse responses. A ‘‘time‐stretched’’ pulse as proposed by Aoshima (ATSP), though originally given in a very specific form seems to be one of the most promising signals to measure transfer functions with characteristics of acoustic system mentioned as above. In this paper, this pulse (ATSP) is first generalized and then optimized for the measurement of long impulse responses. This optimized ATSP (OATSP) has an almost ideal characteristic to measure impulse responses shorter than its specific length N. Moreover, it is newly shown in this paper that OATSP has also a good characteristic to measure impulse responses longer than N. Discussion is presented on how to design OATSP suitable for a specific situation of measurement by analyzing errors, when the pulse is used to measure impulse responses longer than N.

284 citations

Proceedings Article
01 May 2000
TL;DR: LREC2000: the 2nd International Conference on Language Resources and Evaluation, May 31 - June 2, 2000, Athens, Greece.
Abstract: LREC2000: the 2nd International Conference on Language Resources and Evaluation, May 31 - June 2, 2000, Athens, Greece.

259 citations

Journal ArticleDOI
TL;DR: Two array signal processing techniques are combined with independent component analysis (ICA) to enhance the performance of blind separation of acoustic signals in a reflective environment by using the subspace method, which reduces the effect of room reflection when the system is used in a room.
Abstract: Two array signal processing techniques are combined with independent component analysis (ICA) to enhance the performance of blind separation of acoustic signals in a reflective environment. The first technique is the subspace method which reduces the effect of room reflection when the system is used in a room. Room reflection is one of the biggest problems in blind source separation (BSS) in acoustic environments. The second technique is a method of solving permutation. For employing the subspace method, ICA must be used in the frequency domain, and precise permutation is necessary for all frequencies. In this method, a physical property of the mixing matrix, i.e., the coherency in adjacent frequencies, is utilized to solve the permutation. The experiments in a meeting room showed that the subspace method improved the rate of automatic speech recognition from 50% to 68% and that the method of solving permutation achieves performance that closely approaches that of the correct permutation, differing by only 4% in recognition rate.

164 citations

Journal ArticleDOI
TL;DR: A method of speech enhancement using microphone-array signal processing based on the subspace method that reduces less-directional ambient noise by eliminating the noise-dominant subspace and extracts the spectrum of the target source from the mixture of spectra of the multiple directional components remaining in the modified spatial correlation matrix.
Abstract: A method of speech enhancement using microphone-array signal processing based on the subspace method is proposed and evaluated. The method consists of the following two stages corresponding to the different types of noise. In the first stage, less-directional ambient noise is reduced by eliminating the noise-dominant subspace. It is realized by weighting the eigenvalues of the spatial correlation matrix. This is based on the fact that the energy of less-directional noise spreads over all eigenvalues while that of directional components is concentrated on a few dominant eigenvalues. In the second stage, the spectrum of the target source is extracted from the mixture of spectra of the multiple directional components remaining in the modified spatial correlation matrix by using a minimum variance beamformer. Finally, the proposed method is evaluated in both a simulated model environment and a real environment.

126 citations

Proceedings ArticleDOI
28 Sep 2004
TL;DR: In this method, audio information and video information are fused by a Bayesian network to enable the detection of speech events and the information of detected speech events is utilized in sound separation using adaptive beam forming.
Abstract: For cooperative work of robots and humans in the real world, a communicative function based on speech is indispensable for robots. To realize such a function in a noisy real environment, it is essential that robots be able to extract target speech spoken by humans from a mixture of sounds by their own resources. We have developed a method of detecting and extracting speech events based on the fusion of audio and video information. In this method, audio information (sound localization using a microphone array) and video information (human tracking using a camera) are fused by a Bayesian network to enable the detection of speech events. The information of detected speech events is then utilized in sound separation using adaptive beam forming. In this paper, some basic investigations for applying the above system to the humanoid robot HRP-2 are reported. Input devices, namely a microphone array and a camera, were mounted on the head of HRP-2, and acoustic characteristics for sound localization/separation performance were investigated. Also, the human tracking system was improved so that it can be used in a dynamic situation. Finally, overall performance of the system was tested via off-line experiments.

124 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: The context for socially interactive robots is discussed, emphasizing the relationship to other research fields and the different forms of “social robots”, and a taxonomy of design methods and system components used to build socially interactive Robots is presented.

2,869 citations

Proceedings ArticleDOI
05 Mar 2017
TL;DR: It is found that the performance gap between using simulated and real RIRs can be eliminated when point-source noises are added, and the trained acoustic models not only perform well in the distant- talking scenario but also provide better results in the close-talking scenario.
Abstract: The environmental robustness of DNN-based acoustic models can be significantly improved by using multi-condition training data. However, as data collection is a costly proposition, simulation of the desired conditions is a frequently adopted strategy. In this paper we detail a data augmentation approach for far-field ASR. We examine the impact of using simulated room impulse responses (RIRs), as real RIRs can be difficult to acquire, and also the effect of adding point-source noises. We find that the performance gap between using simulated and real RIRs can be eliminated when point-source noises are added. Further we show that the trained acoustic models not only perform well in the distant-talking scenario but also provide better results in the close-talking scenario. We evaluate our approach on several LVCSR tasks which can adequately represent both scenarios.

781 citations

Journal ArticleDOI
TL;DR: By utilizing the harmonics of signals, the new method is robust even for low frequencies where DOA estimation is inaccurate, and provides an almost perfect solution to the permutation problem for a case where two sources were mixed in a room whose reverberation time was 300 ms.
Abstract: Blind source separation (BSS) for convolutive mixtures can be solved efficiently in the frequency domain, where independent component analysis (ICA) is performed separately in each frequency bin. However, frequency-domain BSS involves a permutation problem: the permutation ambiguity of ICA in each frequency bin should be aligned so that a separated signal in the time-domain contains frequency components of the same source signal. This paper presents a robust and precise method for solving the permutation problem. It is based on two approaches: direction of arrival (DOA) estimation for sources and the interfrequency correlation of signal envelopes. We discuss the advantages and disadvantages of the two approaches, and integrate them to exploit their respective advantages. Furthermore, by utilizing the harmonics of signals, we make the new method robust even for low frequencies where DOA estimation is inaccurate. We also present a new closed-form formula for estimating DOAs from a separation matrix obtained by ICA. Experimental results show that our method provided an almost perfect solution to the permutation problem for a case where two sources were mixed in a room whose reverberation time was 300 ms.

644 citations

Journal ArticleDOI
TL;DR: This paper studies the quantitative performance behavior of the Wiener filter in the context of noise reduction and shows that in the single-channel case the a posteriori signal-to-noise ratio (SNR) is greater than or equal to the a priori SNR (defined before theWiener filter), indicating that the Wieners filter is always able to achieve noise reduction.
Abstract: The problem of noise reduction has attracted a considerable amount of research attention over the past several decades. Among the numerous techniques that were developed, the optimal Wiener filter can be considered as one of the most fundamental noise reduction approaches, which has been delineated in different forms and adopted in various applications. Although it is not a secret that the Wiener filter may cause some detrimental effects to the speech signal (appreciable or even significant degradation in quality or intelligibility), few efforts have been reported to show the inherent relationship between noise reduction and speech distortion. By defining a speech-distortion index to measure the degree to which the speech signal is deformed and two noise-reduction factors to quantify the amount of noise being attenuated, this paper studies the quantitative performance behavior of the Wiener filter in the context of noise reduction. We show that in the single-channel case the a posteriori signal-to-noise ratio (SNR) (defined after the Wiener filter) is greater than or equal to the a priori SNR (defined before the Wiener filter), indicating that the Wiener filter is always able to achieve noise reduction. However, the amount of noise reduction is in general proportional to the amount of speech degradation. This may seem discouraging as we always expect an algorithm to have maximal noise reduction without much speech distortion. Fortunately, we show that speech distortion can be better managed in three different ways. If we have some a priori knowledge (such as the linear prediction coefficients) of the clean speech signal, this a priori knowledge can be exploited to achieve noise reduction while maintaining a low level of speech distortion. When no a priori knowledge is available, we can still achieve a better control of noise reduction and speech distortion by properly manipulating the Wiener filter, resulting in a suboptimal Wiener filter. In case that we have multiple microphone sensors, the multiple observations of the speech signal can be used to reduce noise with less or even no speech distortion

563 citations

Journal ArticleDOI
TL;DR: This paper proposes to analyze a large number of established and recent techniques according to four transverse axes: 1) the acoustic impulse response model, 2) the spatial filter design criterion, 3) the parameter estimation algorithm, and 4) optional postfiltering.
Abstract: Speech enhancement and separation are core problems in audio signal processing, with commercial applications in devices as diverse as mobile phones, conference call systems, hands-free systems, or hearing aids. In addition, they are crucial preprocessing steps for noise-robust automatic speech and speaker recognition. Many devices now have two to eight microphones. The enhancement and separation capabilities offered by these multichannel interfaces are usually greater than those of single-channel interfaces. Research in speech enhancement and separation has followed two convergent paths, starting with microphone array processing and blind source separation, respectively. These communities are now strongly interrelated and routinely borrow ideas from each other. Yet, a comprehensive overview of the common foundations and the differences between these approaches is lacking at present. In this paper, we propose to fill this gap by analyzing a large number of established and recent techniques according to four transverse axes: 1 the acoustic impulse response model, 2 the spatial filter design criterion, 3 the parameter estimation algorithm, and 4 optional postfiltering. We conclude this overview paper by providing a list of software and data resources and by discussing perspectives and future trends in the field.

452 citations