scispace - formally typeset
Search or ask a question

Showing papers on "Microphone array published in 2019"


Proceedings ArticleDOI
02 May 2019
TL;DR: MilliSonic is a novel localization algorithm that can provably achieve sub-millimeter 1D tracking accuracy in the presence of multipath, while using only a single beacon with a small 4-microphone array and enables two previously infeasible interaction applications.
Abstract: Recent years have seen interest in device tracking and localization using acoustic signals. State-of-the-art acoustic motion tracking systems however do not achieve millimeter accuracy and require large separation between microphones and speakers, and as a result, do not meet the requirements for many VR/AR applications. Further, tracking multiple concurrent acoustic transmissions from VR devices today requires sacrificing accuracy or frame rate. We present MilliSonic, a novel system that pushes the limits of acoustic based motion tracking. Our core contribution is a novel localization algorithm that can provably achieve sub-millimeter 1D tracking accuracy in the presence of multipath, while using only a single beacon with a small 4-microphone array.Further, MilliSonic enables concurrent tracking of up to four smartphones without reducing frame rate or accuracy. Our evaluation shows that MilliSonic achieves 0.7mm median 1D accuracy and a 2.6mm median 3D accuracy for smartphones, which is 5x more accurate than state-of-the-art systems. MilliSonic enables two previously infeasible interaction applications: a) 3D tracking of VR headsets using the smartphone as a beacon and b) fine-grained 3D tracking for the Google Cardboard VR system using a small microphone array.

60 citations


Journal ArticleDOI
TL;DR: In this article, a modified 3D Kalman (M3K) method is proposed for sound source tracking and tracking in 3D the directions of sound sources using a 16-microphone array and low cost hardware.

58 citations


Proceedings ArticleDOI
Wu Minhua1, Kenichi Kumatani1, Shiva Sundaram1, Nikko Strom1, Bjorn Hoffmeister1 
12 May 2019
TL;DR: In this article, the authors developed new acoustic modeling techniques that optimize spatial filtering and long short-term memory (LSTM) layers from multi-channel (MC) input based on an ASR criterion directly.
Abstract: Conventional far-field automatic speech recognition (ASR) systems typically employ microphone array techniques for speech enhancement in order to improve robustness against noise or reverberation. However, such speech enhancement techniques do not always yield ASR accuracy improvement because the optimization criterion for speech enhancement is not directly relevant to the ASR objective. In this work, we develop new acoustic modeling techniques that optimize spatial filtering and long short-term memory (LSTM) layers from multi-channel (MC) input based on an ASR criterion directly. In contrast to conventional methods, we incorporate array processing knowledge into the acoustic model. Moreover, we initialize the network with beamformers’ coefficients. We investigate effects of such MC neural networks through ASR experiments on the real-world far-field data where users are interacting with an ASR system in uncontrolled acoustic environments. We show that our MC acoustic model can reduce a word error rate (WER) by 16.5% compared to a single channel ASR system with the traditional log-mel filter bank energy (LFBE) feature on average. Our result also shows that our network with the spatial filtering layer on two-channel input achieves a relative WER reduction of 9.5% compared to conventional beamforming with seven microphones.

38 citations


Journal ArticleDOI
TL;DR: A novel 3-D audio-visual people tracker that exploits visual observations to guide the acoustic processing by constraining the acoustic likelihood on the horizontal plane defined by the predicted height of a speaker.
Abstract: Compact multi-sensor platforms are portable and thus desirable for robotics and personal-assistance tasks. However, compared to physically distributed sensors, the size of these platforms makes person tracking more difficult. To address this challenge, we propose a novel 3-D audio-visual people tracker that exploits visual observations (object detections) to guide the acoustic processing by constraining the acoustic likelihood on the horizontal plane defined by the predicted height of a speaker. This solution allows the tracker to estimate, with a small microphone array, the distance of a sound. Moreover, we apply a color-based visual likelihood on the image plane to compensate for misdetections. Finally, we use a 3-D particle filter and greedy data association to combine visual observations, color-based, and acoustic likelihoods to track the position of multiple simultaneous speakers. We compare the proposed multimodal 3-D tracker against two state-of-the-art methods on the AV16.3 dataset and on a newly collected dataset with co-located sensors, which we make available to the research community. Experimental results show that our multimodal approach outperforms the other methods both in 3-D and on the image plane.

37 citations


Proceedings ArticleDOI
13 Apr 2019
TL;DR: A low-latency SI-CSS method whose performance is comparable to that of the previous method in a microphone array-based meeting transcription task is proposed by using a new speech separation network architecture combined with a double buffering scheme and by performing enhancement with a set of fixed beamformers followed by a neural post-filter.
Abstract: Speaker independent continuous speech separation (SI-CSS) is a task of converting a continuous audio stream, which may contain overlapping voices of unknown speakers, into a fixed number of continuous signals each of which contains no overlapping speech segment. A separated, or cleaned, version of each utterance is generated from one of SI-CSS’s output channels nondeterministically without being split up and distributed to multiple channels. A typical application scenario is transcribing multi-party conversations, such as meetings, recorded with microphone arrays. The output signals can be simply sent to a speech recognition engine because they do not include speech overlaps. The previous SI-CSS method uses a neural network trained with permutation invariant training and a data-driven beamformer and thus requires much processing latency. This paper proposes a low-latency SI-CSS method whose performance is comparable to that of the previous method in a microphone array-based meeting transcription task. This is achieved (1) by using a new speech separation network architecture combined with a double buffering scheme and (2) by performing enhancement with a set of fixed beamformers followed by a neural post-filter.

36 citations


Journal ArticleDOI
01 Dec 2019
TL;DR: In this article, a convolutional neural network (CNN) is preliminarily applied as a new algorithm for sound source localization, which can reconstruct the sound localization with up to 100% test accuracy, although sidelobes may appear in some situations.
Abstract: To phased microphone array for sound source localization, algorithm with both high computational efficiency and high precision is a persistent pursuit until now. In this paper, convolutional neural network (CNN) a kind of deep learning is preliminarily applied as a new algorithm. The input of CNN is only cross-spectral matrix, while the output of CNN is source distribution. With regard to computing speed in applications, CNN once trained is as fast as conventional beamforming, and is significantly faster than the most famous deconvolution algorithm DAMAS. With regard to measurement accuracy in applications, at high frequency, CNN can reconstruct the sound localizations with up to 100% test accuracy, although sidelobes may appear in some situations. In addition, CNN has a spatial resolution nearly as that of DAMAS and better than that of the conventional beamforming. CNN test accuracy decreases with frequency decreasing; however, in most incorrect samples, CNN results are not far away from the correct results. This exciting result means that CNN perfectly finds source distribution directly from cross-spectral matrix without given propagation function and microphone positions in advance, and thus, CNN deserves to be further explored as a new algorithm.

31 citations


Journal ArticleDOI
TL;DR: The proposed algorithms can improve the iteration speed of the non-synchronous measurements beamforming which is proved first in the simulation, and they are further applied in the on-site measurement of the vehicle engine compartment (the data missing cross-spectral matrix of 15 non- synchronized measurements for a given frequency can be completed only in a few seconds).

31 citations



Posted Content
TL;DR: In this article, a far-field text-dependent speaker verification database named HI-MIA is presented, which contains recordings of 340 people in rooms designed for the far field scenario.
Abstract: This paper presents a far-field text-dependent speaker verification database named HI-MIA. We aim to meet the data requirement for far-field microphone array based speaker verification since most of the publicly available databases are single channel close-talking and text-independent. The database contains recordings of 340 people in rooms designed for the far-field scenario. Recordings are captured by multiple microphone arrays located in different directions and distance to the speaker and a high-fidelity close-talking microphone. Besides, we propose a set of end-to-end neural network based baseline systems that adopt single-channel data for training. Moreover, we propose a testing background aware enrollment augmentation strategy to further enhance the performance. Results show that the fusion systems could achieve 3.29% EER in the far-field enrollment far field testing task and 4.02% EER in the close-talking enrollment and far-field testing task.

27 citations


Proceedings ArticleDOI
01 Oct 2019
TL;DR: In this paper, two Convolutional Recurrent Neural Networks (CRNNs) are used to perform sound event detection and time difference of arrival estimation on each pair of microphones in a microphone array.
Abstract: This paper proposes sound event localization and detection methods from multichannel recording. The proposed system is based on two Convolutional Recurrent Neural Networks (CRNNs) to perform sound event detection (SED) and time difference of arrival (TDOA) estimation on each pair of microphones in a microphone array. In this paper, the system is evaluated with a four-microphone array, and thus combines the results from six pairs of microphones to provide a final classification and a 3-D direction of arrival (DOA) estimate. Results demonstrate that the proposed approach outperforms the DCASE 2019 baseline system.

26 citations


Journal ArticleDOI
TL;DR: The evaluation of lowpass-filtered stimuli shows that the perceived differences occur exclusively at higher frequencies and can therefore be attributed to spatial aliasing.
Abstract: A listening experiment is presented in which subjects rated the perceived differences in terms of spaciousness and timbre between a headphone-based headtracked dummy head auralization of a sound source in different rooms and a headphone-based headtracked auralization of a spherical microphone array recording of the same scenario. The underlying auralizations were based on measured impulse responses to assure equal conditions. Rigid-sphere arrays with different amounts of microphones ranging from 50 to up to 1202 were emulated through sequential measurements, and spherical harmonics orders of up to 12 were tested. The results show that the array auralizations are partially indistinguishable from the direct dummy head auralization at a spherical harmonics order of 8 or higher if the virtual sound source is located at a lateral position. No significant reduction of the perceived differences with increasing order is observed for frontal virtual sound sources. In this case, small differences with respect to both spaciousness and timbre persist. The evaluation of lowpass-filtered stimuli shows that the perceived differences occur exclusively at higher frequencies and can therefore be attributed to spatial aliasing. The room had only a minor effect on the results.

Journal ArticleDOI
Jian Wang1, Hekuo Peng1, Pengwei Zhou1, Jiachen Guo1, Bo Jia1, Hongyan Wu1 
TL;DR: In this article, a fiber optic array acoustic sensor based on Michelson interferometer is proposed, which can be applied for sound source localization in hash environments, and the experiment results show that the array has a high sensitivity and the localization accuracy can be up to 0.01m.

Journal ArticleDOI
TL;DR: A new EB-ESPRIT method is proposed which uses three recurrence relations and a joint-diagonalization procedure to estimate the unit-vectors pointing to the source DOAs with higher accuracy and the estimation accuracy does not depend on the sourceDOAs.
Abstract: Several techniques exist to estimate the directions of arrival (DOAs) of sound sources captured with a spherical microphone array. The eigenbeam rotational invariance technique (EB-ESPRIT) uses recurrence relations of spherical harmonics to estimate the DOAs. In this letter, we propose a new EB-ESPRIT method which uses three recurrence relations and a joint-diagonalization procedure to estimate the unit-vectors pointing to the source DOAs. We evaluate the angular estimation errors of the proposed method under noisy and reverberant conditions and compare it to existing EB-ESPRIT methods. We find that our proposed method can estimate the source DOAs with higher accuracy compared to the discussed existing EB-ESPRIT methods, and the estimation accuracy does not depend on the source DOAs.

Proceedings ArticleDOI
TL;DR: In this paper, the authors present a novel system that can achieve sub-millimeter 1D tracking accuracy in the presence of multipath, while using only a single beacon with a small 4-microphone array.
Abstract: Recent years have seen interest in device tracking and localization using acoustic signals. State-of-the-art acoustic motion tracking systems however do not achieve millimeter accuracy and require large separation between microphones and speakers, and as a result, do not meet the requirements for many VR/AR applications. Further, tracking multiple concurrent acoustic transmissions from VR devices today requires sacrificing accuracy or frame rate. We present MilliSonic, a novel system that pushes the limits of acoustic based motion tracking. Our core contribution is a novel localization algorithm that can provably achieve sub-millimeter 1D tracking accuracy in the presence of multipath, while using only a single beacon with a small 4-microphone array.Further, MilliSonic enables concurrent tracking of up to four smartphones without reducing frame rate or accuracy. Our evaluation shows that MilliSonic achieves 0.7mm median 1D accuracy and a 2.6mm median 3D accuracy for smartphones, which is 5x more accurate than state-of-the-art systems. MilliSonic enables two previously infeasible interaction applications: a) 3D tracking of VR headsets using the smartphone as a beacon and b) fine-grained 3D tracking for the Google Cardboard VR system using a small microphone array.

Journal ArticleDOI
TL;DR: Simulation results demonstrate that the proposed multichannel active noise control (ANC) system can separate the target noise and the disturbance noise and effectively reduce thetarget noise.

Journal ArticleDOI
TL;DR: Noise of a ducted propeller suitable to be installed on a medium size UAV (wingspan 5–10 m) is analyzed and the duct significantly modifies the noise radiation both in the frequency and the spatial domain.
Abstract: Ducted propellers are an interesting design choice for unmanned aerial vehicle (UAV) concepts due to a potential increase of the propeller efficiency. In such designs, it is commonly assumed that introducing the duct also results in an overall noise reduction. The objective of this work is to experimentally analyze and quantify noise of a ducted propeller suitable to be installed on a medium size UAV (wingspan 5–10 m). A microphone array is used for recording the noise levels at each microphone position and used collectively to localize noise sources with beamforming. Different types of noise sources are considered (an omni-directional source and a propeller). In addition, the effect of the presence of an incoming airflow is assessed. With no incoming airflow, it is found that the duct significantly modifies the noise radiation both in the frequency and the spatial domain. With an incoming airflow, the effect of the duct on the frequency content of the signal is almost eliminated. The fact that for this case the harmonics become lower results in a reduction of the received noise levels. Also the directivity changes. These insights are of importance in efforts towards modeling the effects of ducts for complex noise sources such as propellers.

Patent
25 Jan 2019
TL;DR: In this article, the authors proposed a method, device, system and apparatus for testing a microphone array and a storage medium, which comprises the following steps: receiving an audio signal obtained by collecting test audio by the microphone array, processing the audio signal to obtain a single channel designation parameter of each microphone and a channel-to-channel designation parameter between the microphones; determining the performance of the microphone arrays according to the single channel designated parameter and the channel- to-channel designated parameter, and outputting a performance test result; wherein, the single-channel specified parameters include sensitivity level
Abstract: The invention provides a method,device, system and apparatus for testing a microphone array and a storage medium The method comprises the following steps: receiving an audio signal obtained by collecting test audio by the microphone array; processing the audio signal to obtain a single channel designation parameter of each microphone and a channel-to-channel designation parameter between the microphones; determining the performance of the microphone array according to the single channel designated parameter and the channel-to-channel designated parameter, and outputting a performance test result; wherein, the single-channel specified parameters include sensitivity level, sensitivity level curve, total harmonic distortion parameter, total harmonic distortion curve, noise level, signal-to-noise ratio, tolerance of frequency response, tightness parameter and truncation parameter; the specified parameters among the channels include frequency response consistency parameters, time delay consistency parameters and correlation parameters Thus, the present disclosure improves the accuracy of performance test results and reduces the difficulty of performance testing of microphone arrays

Proceedings ArticleDOI
15 Oct 2019
TL;DR: This work describes audiovisual zooming as a generalized eigenvalue problem and proposes an algorithm for efficient computation on mobile platforms, whereby an auditory FOV is formed to match the visual.
Abstract: When capturing videos on a mobile platform, often the target of interest is contaminated by the surrounding environment. To alleviate the visual irrelevance, camera panning and zooming provide the means to isolate a desired field of view (FOV). However, the captured audio is still contaminated by signals outside the FOV. This effect is unnatural---for human perception, visual and auditory cues must go hand-in-hand. We present the concept ofAudiovisual Zooming, whereby an auditory FOV is formed to match the visual. Our framework is built around the classic idea of beamforming, a computational approach to enhancing sound from a single direction using a microphone array. Yet, beamforming on its own can not incorporate the auditory FOV, as the FOV may include an arbitrary number of directional sources. We formulate our audiovisual zooming as a generalized eigenvalue problem and propose an algorithm for efficient computation on mobile platforms. To inform the algorithmic and physical implementation, we offer a theoretical analysis of our algorithmic components as well as numerical studies for understanding various design choices of microphone arrays. Finally, we demonstrate audiovisual zooming on two different mobile platforms: a mobile smartphone and a 360$^\circ $ spherical imaging system for video conference settings.

Journal ArticleDOI
01 Jun 2019-Energies
TL;DR: In this article, an aerodynamic glove was used to classify the boundary-layer state of a 2 megawatt wind turbine (WT) operating in the northern part of Schleswig-Holstein, Germany.
Abstract: Knowledge about laminar–turbulent transition on operating multi megawatt wind turbine (WT) blades needs sophisticated equipment like hot films or microphone arrays. Contrarily, thermographic pictures can easily be taken from the ground, and temperature differences indicate different states of the boundary layer. Accuracy, however, is still an open question, so that an aerodynamic glove, known from experimental research on airplanes, was used to classify the boundary-layer state of a 2 megawatt WT blade operating in the northern part of Schleswig-Holstein, Germany. State-of-the-art equipment for measuring static surface pressure was used for monitoring lift distribution. To distinguish the laminar and turbulent parts of the boundary layer (suction side only), 48 microphones were applied together with ground-based thermographic cameras from two teams. Additionally, an optical camera mounted on the hub was used to survey vibrations. During start-up (SU) (from 0 to 9 rpm), extended but irregularly shaped regions of a laminar-boundary layer were observed that had the same extension measured both with microphones and thermography. When an approximately constant rotor rotation (9 rpm corresponding to approximately 6 m/s wind speed) was achieved, flow transition was visible at the expected position of 40% chord length on the rotor blade, which was fouled with dense turbulent wedges, and an almost complete turbulent state on the glove was detected. In all observations, quantitative determination of flow-transition positions from thermography and microphones agreed well within their accuracy of less than 1%.

Journal ArticleDOI
TL;DR: The design and fabrication of a resonant microphone array for pre-filtered acoustic signal acquisition, and a novel signal classification algorithm that benefits from its properties are described, suggesting that this technique has wide-ranging applicability to low-power and self-powered wireless sensing and detection systems.
Abstract: This paper reports the implementation and evaluation of an array-based recognition system for the detection of wheezing in breathing. We describe the design and fabrication of a resonant microphone array for pre-filtered acoustic signal acquisition, and outline a novel signal classification algorithm that benefits from its properties. The use of a resonant-transducer array at the input simplifies several of the required digital processing steps necessary for performing spectral filtering and wheezing recognition. The recognizer system is evaluated and compared against a traditional approach, and the experimental results show that recognition processing time (and power consumption) can be reduced 11 times, specifically from 5.11 to 0.46 s. Classification experiments also indicate a robustness in recognition accuracy to low-frequency-dominated background noises such as those emitted by the heart. The findings suggest that this technique has wide-ranging applicability to low-power and self-powered wireless sensing and detection systems.

Proceedings ArticleDOI
12 May 2019
TL;DR: In this paper, the concept of microphone array augmentation echoes (MIRAGE) is introduced and a learn-based scheme for echo estimation combined with a phys-based echo aggregation is proposed.
Abstract: It is commonly observed that acoustic echoes hurt per mance of sound source localization (SSL) methods. We troduce the concept of microphone array augmentation echoes (MIRAGE) and show how estimation of early-e characteristics can in fact benefit SSL. We propose a learn based scheme for echo estimation combined with a phys based scheme for echo aggregation. In a simple scenario volving 2 microphones close to a reflective surface and source, we show using simulated data that the proposed proach performs similarly to a correlation-based metho azimuth estimation while retrieving elevation as well from 2 microphones only, an impossible task in anechoic settings.

Journal ArticleDOI
TL;DR: The weighting function ρ-PHAT-C provides the smallest surface ellipses especially when the arithmetic of the GCC is replaced by the geometric mean (GEO) and the acoustic images obtained confirm that this function outperforms the GCC, GCC- PHAT, and GCC ρ -PHAT.
Abstract: The generalized cross correlation (GCC) is an efficient technique for performing acoustic imaging. However, it suffers from important limitations such as a large main lobe width for noise sources with low frequency content or a high amplitude of side lobes for noise sources with high frequencies. Prefiltering operation of the microphone signals by a weighting function can be used to improve the acoustic image. In this work, two weighting functions based on PHAse Transform (PHAT) improvements are used. The first adds an exponent to the PHAT expression (ρ-PHAT), while the second adds the minimum value of the coherence function to the denominator (ρ-PHAT-C). Numerical acoustic images obtained with the GCC and those weighting functions are compared and quantitatively assessed thanks to a metric based on a covariance ellipse, which surrounds either the main lobe or the side lobes. The weighting function ρ-PHAT-C provides the smallest surface ellipses especially when the arithmetic of the GCC is replaced by the geometric mean (GEO). Experimental measurements are carried out in a hemi-anechoic room and a reverberant chamber where two loudspeakers were set in front of microphone array. The acoustic images obtained confirm that the ρ-PHAT-C with the GEO outperforms the GCC, GCC-PHAT, and GCC ρ-PHAT.

Journal ArticleDOI
T. C. Yang1
TL;DR: Deconvolving the CBF is a method of superdirective beamforming and the improvement in directivity (beam width) and array gain is studied/illustrated using the SWellEx96 horizontal array data where only sub-arrays are used.
Abstract: Arrays employing superdirective beamforming can provide the same directivity and directivity index (or array gain) with less aperture as a large size array using conventional beamforming (CBF). Superdirective arrays offer a practical and significant improvement in the reception of low frequency signals and is useful for many applications where the array size is limited, such as a miniature microphone array or an underwater acoustic array with a limited aperture. Deconvolving the CBF is a method of superdirective beamforming. The improvement in directivity (beam width) and array gain is studied/illustrated using the SWellEx96 horizontal array data where only sub-arrays are used.

Journal ArticleDOI
TL;DR: The results depict an accuracy over 85% for the proposed audio-based recognition system and a case study for calculating productivity rates of a sample piece of equipment is presented at the end.
Abstract: Various activities of construction equipment are associated with distinctive sound patterns (e.g., excavating soil, breaking rocks, etc.). Considering this fact, it is possible to extract useful information about construction operations by recording the audio at a jobsite and then processing this data to determine what activities are being performed. Audio-based analysis of construction operations mainly depends on specific hardware and software settings to achieve satisfactory performance. This paper explores the impacts of these settings on the ultimate performance on the task of interest. To achieve this goal, an audio-based system has been developed to recognize the routine sounds of construction machinery. The next step evaluates three types of microphones (off-the-shelf, contact, and a multichannel microphone array) and two installation settings (microphones placed in machines’ cabin and installed on the jobsite in relatively proximity to the machines). Two different jobsite conditions have been considered: (1) jobsites with single machines and (2) jobsites with multiple machines operating simultaneously. In terms of software settings, two different SVM classifiers (RBF and linear kernels) and two common frequency feature extraction techniques (STFT and CWT) were selected. Experimental data from several jobsites was gathered and the results depict an accuracy over 85% for the proposed audio-based recognition system. To better illustrate the practical value of the proposed system, a case study for calculating productivity rates of a sample piece of equipment is presented at the end.

Journal ArticleDOI
Yusuke Hioka1, Michael Kingan1, Gian Schmid1, Ryan McKay1, Karl Stol1 
TL;DR: Results of subjective listening tests suggest the quality of the recording made by the designed UAV mounted system for recording sound from a targeted source or direction is significantly better than that of the Recording by the shotgun microphone.

Posted Content
TL;DR: A system that generates speaker-annotated transcripts of meetings by using a virtual microphone array, a set of spatially distributed asynchronous recording devices such as laptops and mobile phones, composed of continuous audio stream alignment, blind beamforming, speech recognition, speaker diarization using prior speaker information, and system combination is described.
Abstract: We describe a system that generates speaker-annotated transcripts of meetings by using a virtual microphone array, a set of spatially distributed asynchronous recording devices such as laptops and mobile phones. The system is composed of continuous audio stream alignment, blind beamforming, speech recognition, speaker diarization using prior speaker information, and system combination. When utilizing seven input audio streams, our system achieves a word error rate (WER) of 22.3% and comes within 3% of the close-talking microphone WER on the non-overlapping speech segments. The speaker-attributed WER (SAWER) is 26.7%. The relative gains in SAWER over the single-device system are 14.8%, 20.3%, and 22.4% for three, five, and seven microphones, respectively. The presented system achieves a 13.6% diarization error rate when 10% of the speech duration contains more than one speaker. The contribution of each component to the overall performance is also investigated, and we validate the system with experiments on the NIST RT-07 conference meeting test set.

Proceedings ArticleDOI
01 Oct 2019
TL;DR: This paper proposes a data-driven source localization approach under a noisy and reverberant environment, using a newly defined feature named relative harmonic coefficients (RHC) in the modal domain, which has a faster speed and achieves competitive performance in comparison to the state-of-art algorithm.
Abstract: This paper proposes a data-driven source localization approach under a noisy and reverberant environment, using a newly defined feature named relative harmonic coefficients (RHC) in the modal domain. Being independent of the source signal, the RHC is capable of localizing a sound source(s) located at unknown position(s). Two distinctive multi-view Gaussian process (MVGP), (i) multi-frequency views and (ii) multi-mode views, are developed for Gaussian process regression (GPR) to reveal the mapping function from the RHC to the corresponding source location. We evaluate the effectiveness of the algorithm for single source localization while the underlying concepts proposed can be extended to acoustic scenarios where multiple sources are active. Experimental results, using a spherical microphone array, confirm that the proposed algorithm has a faster speed and achieves competitive performance in comparison to the state-of-art algorithm.

Journal ArticleDOI
TL;DR: A novel speaker-dependent speech separation framework for the challenging CHiME-5 acoustic environments, exploiting advantages of both deep learning based and conventional preprocessing techniques to prepare data effectively for separating target speech from multi-talker mixed speech collected with multiple microphone arrays.
Abstract: We propose a novel speaker-dependent speech separation framework for the challenging CHiME-5 acoustic environments, exploiting advantages of both deep learning based and conventional preprocessing techniques to prepare data effectively for separating target speech from multi-talker mixed speech collected with multiple microphone arrays. First, a series of multi-channel operations is conducted to reduce existing reverberation and noise, and a single-channel deep learning based speech enhancement model is used to predict speech presence probabilities. Next, a two-stage supervised speech separation approach, using oracle speaker diarization information from CHiME-5, is proposed to separate speech of a target speaker from interference speakers in mixed speech. Given a set of three estimated masks of the background noise, the target speaker and the interference speakers from single-channel speech enhancement and separation models, a complex Gaussian mixture model based generalized eigenvalue beamformer is then used for enhancing the signal at the reference array while avoiding the speaker permutation issue. Furthermore, the proposed front-end can generate a large variety of processed data for an ensemble of speech recognition results. Experiments on the development set have shown that the proposed two-stage approach can yield significant improvements of recognition performance over the official baseline system and achieved top accuracies in all four competing evaluation categories among all systems submitted to the CHiME-5 Challenge.

Journal ArticleDOI
TL;DR: An outlier removal method is proposed, which takes the properties of the observed sounds into consideration and leads to establishing system design guidelines that ensure a predictable performance.
Abstract: This paper addresses the problem of 2D sound source localization using multiple microphone arrays in an outdoor environment. Two main issues exist in such localization. Since the localization perfo...

Journal ArticleDOI
TL;DR: This study focused on comparing speech intelligibility as measured in a reverberant reference room with virtual versions of that room, and found auditory modeling might be a fast and efficient way to evaluate virtual sound environments.