Author
Kazuo Hiyane
Bio: Kazuo Hiyane is an academic researcher from Mitsubishi Research Institute. The author has contributed to research in topics: Sound (geography) & Microphone array. The author has an hindex of 7, co-authored 14 publications receiving 366 citations.
Papers
More filters
•
01 May 2000TL;DR: LREC2000: the 2nd International Conference on Language Resources and Evaluation, May 31 - June 2, 2000, Athens, Greece.
Abstract: LREC2000: the 2nd International Conference on Language Resources and Evaluation, May 31 - June 2, 2000, Athens, Greece.
259 citations
••
41 citations
•
01 Sep 1999
TL;DR: EUROSPEECH1999: the 6th European Conference on Speech Communication and Techinology, September 5-9, 1999, Budapest, Hungary.
Abstract: EUROSPEECH1999: the 6th European Conference on Speech Communication and Techinology, September 5-9, 1999, Budapest, Hungary.
30 citations
•
01 Jan 2002
21 citations
••
07 Nov 2002
TL;DR: Progress of the sound scene database collection project and application to environment sound recognition and hands-free speech recognition are described.
Abstract: The sound data for open evaluation is necessary for studies such as sound source localization, sound retrieval, sound recognition and hands-free speech recognition in real acoustic environments. This paper reports on our project for acoustic data collection. There are many kinds of sound scenes in real environments. The sound scene is specified by sound sources and room acoustics. The number of combinations of the sound sources, source positions and rooms is huge in real acoustic environments. We assumed that the sound in the environments can be simulated by convolution of the isolated sound sources and impulse responses. As an isolated sound source, hundred kinds of environment sounds and speech sounds are collected. The impulse responses are collected in various acoustic environments. Additionally we collected sounds from a moving source. In this paper, progress of our sound scene database collection project and application to environment sound recognition and hands-free speech recognition are described.
13 citations
Cited by
More filters
••
05 Mar 2017TL;DR: It is found that the performance gap between using simulated and real RIRs can be eliminated when point-source noises are added, and the trained acoustic models not only perform well in the distant- talking scenario but also provide better results in the close-talking scenario.
Abstract: The environmental robustness of DNN-based acoustic models can be significantly improved by using multi-condition training data. However, as data collection is a costly proposition, simulation of the desired conditions is a frequently adopted strategy. In this paper we detail a data augmentation approach for far-field ASR. We examine the impact of using simulated room impulse responses (RIRs), as real RIRs can be difficult to acquire, and also the effect of adding point-source noises. We find that the performance gap between using simulated and real RIRs can be eliminated when point-source noises are added. Further we show that the trained acoustic models not only perform well in the distant-talking scenario but also provide better results in the close-talking scenario. We evaluate our approach on several LVCSR tasks which can adequately represent both scenarios.
781 citations
••
TL;DR: This paper proposes to analyze a large number of established and recent techniques according to four transverse axes: 1) the acoustic impulse response model, 2) the spatial filter design criterion, 3) the parameter estimation algorithm, and 4) optional postfiltering.
Abstract: Speech enhancement and separation are core problems in audio signal processing, with commercial applications in devices as diverse as mobile phones, conference call systems, hands-free systems, or hearing aids. In addition, they are crucial preprocessing steps for noise-robust automatic speech and speaker recognition. Many devices now have two to eight microphones. The enhancement and separation capabilities offered by these multichannel interfaces are usually greater than those of single-channel interfaces. Research in speech enhancement and separation has followed two convergent paths, starting with microphone array processing and blind source separation, respectively. These communities are now strongly interrelated and routinely borrow ideas from each other. Yet, a comprehensive overview of the common foundations and the differences between these approaches is lacking at present. In this paper, we propose to fill this gap by analyzing a large number of established and recent techniques according to four transverse axes: 1 the acoustic impulse response model, 2 the spatial filter design criterion, 3 the parameter estimation algorithm, and 4 optional postfiltering. We conclude this overview paper by providing a list of software and data resources and by discussing perspectives and future trends in the field.
452 citations
••
18 Nov 2011
TL;DR: Stable and fast update rules for independent vector analysis (IVA) based on auxiliary function technique that yield faster convergence and better results than natural gradient updates is presented.
Abstract: This paper presents stable and fast update rules for independent vector analysis (IVA) based on auxiliary function technique. The algorithm consists of two alternative updates: 1) weighted covariance matrix updates and 2) demixing matrix updates, which include no tuning parameters such as step size. The monotonic decrease of the objective function at each update is guaranteed. The experimental evaluation shows that the derived update rules yield faster convergence and better results than natural gradient updates.
308 citations
••
TL;DR: This paper addresses the determined blind source separation problem and proposes a new effective method unifying independent vector analysis (IVA) and nonnegative matrix factorization (NMF) based on conventional multichannel NMF (MNMF), which reveals the relationship between MNMF and IVA.
Abstract: This paper addresses the determined blind source separation problem and proposes a new effective method unifying independent vector analysis (IVA) and nonnegative matrix factorization (NMF). IVA is a state-of-the-art technique that utilizes the statistical independence between sources in a mixture signal, and an efficient optimization scheme has been proposed for IVA. However, since the source model in IVA is based on a spherical multivariate distribution, IVA cannot utilize specific spectral structures such as the harmonic structures of pitched instrumental sounds. To solve this problem, we introduce NMF decomposition as the source model in IVA to capture the spectral structures. The formulation of the proposed method is derived from conventional multichannel NMF (MNMF), which reveals the relationship between MNMF and IVA. The proposed method can be optimized by the update rules of IVA and single-channel NMF. Experimental results show the efficacy of the proposed method compared with IVA and MNMF in terms of separation accuracy and convergence speed.
296 citations
••
TL;DR: A sound event classification framework is outlined that compares auditory image front end features with spectrogram image-based frontEnd features, using support vector machine and deep neural network classifiers, and is shown to compare very well with current state-of-the-art classification techniques.
Abstract: The automatic recognition of sound events by computers is an important aspect of emerging applications such as automated surveillance, machine hearing and auditory scene understanding. Recent advances in machine learning, as well as in computational models of the human auditory system, have contributed to advances in this increasingly popular research field. Robust sound event classification, the ability to recognise sounds under real-world noisy conditions, is an especially challenging task. Classification methods translated from the speech recognition domain, using features such as mel-frequency cepstral coefficients, have been shown to perform reasonably well for the sound event classification task, although spectrogram-based or auditory image analysis techniques reportedly achieve superior performance in noise. This paper outlines a sound event classification framework that compares auditory image front end features with spectrogram image-based front end features, using support vector machine and deep neural network classifiers. Performance is evaluated on a standard robust classification task in different levels of corrupting noise, and with several system enhancements, and shown to compare very well with current state-of-the-art classification techniques.
239 citations