scispace - formally typeset
Search or ask a question

Showing papers on "Microphone array published in 2020"


Proceedings ArticleDOI
04 May 2020
TL;DR: This paper proposes transform-average-concatenate (TAC), a simple design paradigm for channel permutation and number invariant multi-channel speech separation based on the filter-and-sum network, and shows how TAC significantly improves the separation performance across various numbers of microphones in noisy reverberant separation tasks with ad-hoc arrays.
Abstract: An important problem in ad-hoc microphone speech separation is how to guarantee the robustness of a system with respect to the locations and numbers of microphones. The former requires the system to be invariant to different indexing of the microphones with the same locations, while the latter requires the system to be able to process inputs with varying dimensions. Conventional optimization-based beamforming techniques satisfy these requirements by definition, while for deep learning-based end-to-end systems those constraints are not fully addressed. In this paper, we propose transform-average-concatenate (TAC), a simple design paradigm for channel permutation and number invariant multi-channel speech separation. Based on the filter-and-sum network (FaSNet), a recently proposed end-to-end time-domain beamforming system, we show how TAC significantly improves the separation performance across various numbers of microphones in noisy reverberant separation tasks with ad-hoc arrays. Moreover, we show that TAC also significantly improves the separation performance with fixed geometry array configuration, further proving the effectiveness of the proposed paradigm in the general problem of multi-microphone speech separation.

102 citations


Proceedings ArticleDOI
04 May 2020
TL;DR: Experiments show that the proposed Beam-TasNet significantly outperforms the conventional TasNet without beamforming and, moreover, successfully achieves a word error rate comparable to an oracle mask-based MVDR beamformer.
Abstract: Recent studies have shown that acoustic beamforming using a microphone array plays an important role in the construction of high-performance automatic speech recognition (ASR) systems, especially for noisy and overlapping speech conditions. In parallel with the success of multichannel beamforming for ASR, in the speech separation field, the time-domain audio separation network (TasNet), which accepts a time-domain mixture as input and directly estimates the time-domain waveforms for each source, achieves remarkable speech separation performance. In light of these two recent trends, the question of whether TasNet can benefit from beamforming to achieve high ASR performance in overlapping speech conditions naturally arises. Motivated by this question, this paper proposes a novel speech separation scheme, i.e., Beam-TasNet, which combines TasNet with the frequency-domain beamformer, i.e., a minimum variance distortionless response (MVDR) beamformer, through spatial covariance computation to achieve better ASR performance. Experiments on the spatialized WSJ0-2mix corpus show that our proposed Beam-TasNet significantly outperforms the conventional TasNet without beamforming and, moreover, successfully achieves a word error rate comparable to an oracle mask-based MVDR beamformer.

75 citations


Proceedings ArticleDOI
25 Oct 2020
TL;DR: The database, the challenge, and the baseline system are described, which is based on a ResNet-based deep speaker network with cosine similarity scoring, which achieves minDCFs of 0.62, 0.66, and 0.64 and EERs of 6.27%, 6.55%, and 7.18% for task 1, task 2, and task 3, respectively.
Abstract: The INTERSPEECH 2020 Far-Field Speaker Verification Challenge (FFSVC 2020) addresses three different research problems under well-defined conditions: far-field text-dependent speaker verification from single microphone array, far-field text-independent speaker verification from single microphone array, and far-field text-dependent speaker verification from distributed microphone arrays. All three tasks pose a cross-channel challenge to the participants. To simulate the real-life scenario, the enrollment utterances are recorded from close-talk cellphone, while the test utterances are recorded from the far-field microphone arrays. In this paper, we describe the database, the challenge, and the baseline system, which is based on a ResNet-based deep speaker network with cosine similarity scoring. For a given utterance, the speaker embeddings of different channels are equally averaged as the final embedding. The baseline system achieves minDCFs of 0.62, 0.66, and 0.64 and EERs of 6.27%, 6.55%, and 7.18% for task 1, task 2, and task 3, respectively.

51 citations


Proceedings ArticleDOI
16 Apr 2020
TL;DR: VoLoc is developed, a system that proposes an iterative align-and-cancel algorithm for improved multipath AoA estimation, followed by an error-minimization technique to estimate the geometry of a nearby wall reflection, which reveals the user's location.
Abstract: Voice assistants such as Amazon Echo (Alexa) and Google Home use microphone arrays to estimate the angle of arrival (AoA) of the human voice. This paper focuses on adding user localization as a new capability to voice assistants. For any voice command, we desire Alexa to be able to localize the user inside the home. The core challenge is two-fold: (1) accurately estimating the AoAs of multipath echoes without the knowledge of the source signal, and (2) tracing back these AoAs to reverse triangulate the user's location. We develop VoLoc, a system that proposes an iterative align-and-cancel algorithm for improved multipath AoA estimation, followed by an error-minimization technique to estimate the geometry of a nearby wall reflection. The AoAs and geometric parameters of the nearby wall are then fused to reveal the user's location. Under modest assumptions, we report localization accuracy of 0.44 m across different rooms, clutter, and user/microphone locations. VoLoc runs in near real-time but needs to hear around 15 voice commands before becoming operational.

49 citations


Proceedings ArticleDOI
04 May 2020
TL;DR: A far-field text-dependent speaker verification database named HI-MIA is presented and a set of end-to-end neural network based baseline systems that adopt single-channel data for training are proposed.
Abstract: This paper presents a far-field text-dependent speaker verification database named HI-MIA. We aim to meet the data requirement for far-field microphone array based speaker verification since most of the publicly available databases are single channel close-talking and text-independent. The database contains recordings of 340 people in rooms designed for the far-field scenario. Recordings are captured by multiple microphone arrays located in different directions and distance to the speaker and a high-fidelity close-talking microphone. Besides, we propose a set of end-to-end neural network based baseline systems that adopt single-channel data for training. Moreover, we propose a testing background aware enrollment augmentation strategy to further enhance the performance. Results show that the fusion systems could achieve 3.29% EER in the far-field enrollment far field testing task and 4.02% EER in the close-talking enrollment and far-field testing task.

47 citations


Journal ArticleDOI
TL;DR: A new theory of differential beamformers with uniform linear arrays is proposed, which shows clearly the connection between the conventional differential beamforming and the null-constrained differential beamforms methods.
Abstract: This article presents a theoretical study of differential beamforming with uniform linear arrays. By defining a forward spatial difference operator, any order of the spatial difference of the observed signals can be represented as a product of a difference operator matrix and the microphone array observations. Consequently, differential beamforming is implemented in two stages, where the first one obtains spatial difference of the observations and the second stage optimizes the beamformer. The major contributions of this article are as follows. First, we propose a new theory of differential beamforming with uniform linear arrays, which shows clearly the connection between the conventional differential beamforming and the null-constrained differential beamforming methods. This provides some new insight into the design of differential beamformers. Second, we deduce some new differential beamformers, where conventional beamforming may be seen as a particular case. Specifically, we derive the maximum white noise gain (MWNG), maximum directivity factor (MDF), parameterized MDF, and parameterized maximum front-to-back ratio differential beamformers. Third, we further extend the idea of how to design optimal differential beamformers by combining both the observed signals and their spatial differences.

43 citations


Journal ArticleDOI
TL;DR: Sparse Bayesian learning is used to perform localization in 3D space and the use of principal component analysis to denoise the measurement data is examined, demonstrating that the approach offers accurate localization in a 3D domain.
Abstract: The identification of acoustic sources in a three-dimensional (3D) domain based on measurements with an array of microphones is a challenging problem: it entails the estimation of the angular position of the sources (direction of arrival), distance relative to the array (range), and the quantification of the source amplitudes. A 3D source localization model using a rigid spherical microphone array with spherical wave propagation is proposed. In this study, sparse Bayesian learning is used to perform localization in 3D space and examine the use of principal component analysis to denoise the measurement data. The performance of the proposed method is examined numerically and experimentally, which is tested both in a free-field and in a reverberant environment. The numerical and experimental investigations demonstrate that the approach offers accurate localization in a 3D domain, resolving closely spaced sources and making it possible to identify sources located at different ranges.

32 citations


Proceedings ArticleDOI
25 Oct 2020
TL;DR: Speech recognition experimental results show that the proposed neural network based speech separation method significantly outperforms baseline multi-channel speech separation systems.
Abstract: This paper proposes a neural network based speech separation method using spatially distributed microphones. Unlike with traditional microphone array settings, neither the number of microphones nor their spatial arrangement is known in advance, which hinders the use of conventional multi-channel speech separation neural networks based on fixed size input. To overcome this, a novel network architecture is proposed that interleaves inter-channel processing layers and temporal processing layers. The inter-channel processing layers apply a self-attention mechanism along the channel dimension to exploit the information obtained with a varying number of microphones. The temporal processing layers are based on a bidirectional long short term memory (BLSTM) model and applied to each channel independently. The proposed network leverages information across time and space by stacking these two kinds of layers alternately. Our network estimates time-frequency (TF) masks for each speaker, which are then used to generate enhanced speech signals either with TF masking or beamforming. Speech recognition experimental results show that the proposed method significantly outperforms baseline multi-channel speech separation systems.

32 citations


Journal ArticleDOI
TL;DR: A new approach to calculate the cross-spectral matrix based on numerically solving the transcendental equation is proposed as an alternative with higher computational efficiency to the steering vector in the former algorithm.

30 citations


Proceedings ArticleDOI
16 Nov 2020
TL;DR: Symphony is the first approach to tackle the problem of concurrently localizing multiple acoustic sources with a smart device with a single microphone array and includes a geometry-based filtering module to distinguish signals from different sources along different paths and a coherence-based module to identify signals from the same source.
Abstract: Sound recognition is an important and popular function of smart devices. The location of sound is basic information associated with the acoustic source. Apart from sound recognition, whether the acoustic sources can be localized largely affects the capability and quality of the smart device's interactive functions. In this work, we study the problem of concurrently localizing multiple acoustic sources with a smart device (e.g., a smart speaker like Amazon Alexa). The existing approaches either can only localize a single source, or require deploying a distributed network of microphone arrays to function. Our proposal called Symphony is the first approach to tackle the above problem with a single microphone array. The insight behind Symphony is that the geometric layout of microphones on the array determines the unique relationship among signals from the same source along the same arriving path, while the source's location determines the DoAs (direction-of-arrival) of signals along different arriving paths. Symphony therefore includes a geometry-based filtering module to distinguish signals from different sources along different paths and a coherence-based module to identify signals from the same source. We implement Symphony with different types of commercial off-the-shelf microphone arrays and evaluate its performance under different settings. The results show that Symphony has a median localization error of 0.694m, which is 68% less than that of the state-of-the-art approach.

29 citations


Journal ArticleDOI
TL;DR: A solution to the problem of acoustic source localization using a microphone array mounted on multirotor unmanned aerial vehicles (UAVs) is proposed, which adopts an efficient beamforming technique for the direction of arrival estimation of an acoustic source and a circular array detached from themultirotor vehicle body in order to reduce the effects of noise generated by the propellers.
Abstract: In this article, we address the problem of acoustic source localization using a microphone array mounted on multirotor unmanned aerial vehicles (UAVs). Conventional localization beamforming techniques are especially challenging in these specific conditions, due to the nature and intensity of the disturbances affecting the recorded acoustic signals. The principal disturbances are related to the high-frequency, narrowband noise originated by the electrical engines, and to the broadband aerodynamic noise induced by the propellers. A solution to this problem is proposed, which adopts an efficient beamforming technique for the direction of arrival estimation of an acoustic source and a circular array detached from the multirotor vehicle body in order to reduce the effects of noise generated by the propellers. The approach used to localize the source relies on a diagonal unloading beamforming with a novel norm transform frequency fusion. The proposed algorithm is tested on a multirotor UAV equipped with a compact uniform circular array of eight microphones, placed on the bottom of the drone to localize the target acoustic source placed on the ground while the quadcopter is hovering at different altitudes. The experimental results conducted in outdoor hovering conditions are illustrated, and the localization performances are reported under various recording conditions and source characteristics.

Journal ArticleDOI
28 Aug 2020-Sensors
TL;DR: The software functional components of the proposed detection and location algorithm were developed employing acoustic signal analysis and concurrent neural networks (CoNNs), and an analysis of the detection and tracking performance for remotely piloted aircraft systems (RPASs), measured with a dedicated spiral microphone array with MEMS microphones.
Abstract: The purpose of this paper is to investigate the possibility of developing and using an intelligent, flexible, and reliable acoustic system, designed to discover, locate, and transmit the position of unmanned aerial vehicles (UAVs). Such an application is very useful for monitoring sensitive areas and land territories subject to privacy. The software functional components of the proposed detection and location algorithm were developed employing acoustic signal analysis and concurrent neural networks (CoNNs). An analysis of the detection and tracking performance for remotely piloted aircraft systems (RPASs), measured with a dedicated spiral microphone array with MEMS microphones, was also performed. The detection and tracking algorithms were implemented based on spectrograms decomposition and adaptive filters. In this research, spectrograms with Cohen class decomposition, log-Mel spectrograms, harmonic-percussive source separation and raw audio waveforms of the audio sample, collected from the spiral microphone array-as an input to the Concurrent Neural Networks were used, in order to determine and classify the number of detected drones in the perimeter of interest.

Journal ArticleDOI
TL;DR: The experimental results of the multi-channel speech coding and enhancement prove that the MASS could well simulate the signals used in real room acoustic environment and can be applied to the research of the related fields.
Abstract: Multi-channel speech coding and enhancement is an indispensable technology in speech communication. In order to verify the effectiveness of multi-channel speech coding and enhancement methods in the research and development, a microphone array speech simulator (MASS) used in room acoustic environment is proposed. The proposed MASS is the improvement and extension of the existing multi-channel speech simulator. It aims to simulate clean speech, noisy speech, clean speech with reverberation, noisy speech with reverberation, and noise signals by microphone array used for multi-channel coding and enhancement of speech signal in room acoustic environment. The experimental results of the multi-channel speech coding and enhancement prove that the MASS could well simulate the signals used in real room acoustic environment and can be applied to the research of the related fields.

Journal ArticleDOI
TL;DR: Overall, GIBF and EHR–CLEAN–SC offer the most accurate results when point sources (speakers) are present, and even achieve super–resolution by separating sound sources beyond the Rayleigh resolution limit.

Journal ArticleDOI
TL;DR: A listening experiment evaluating the perceptual improvements of binaural rendering of undersampled SMA data that can be achieved using state-of-the-art mitigation approaches found that most mitigation approaches lead to significant perceptual improvements, even though audible differences to the reference remain.
Abstract: Spherical microphone arrays (SMAs) are widely used to capture spatial sound fields that can then be rendered in various ways as a virtual acoustic environment (VAE) including headphone-based binaural synthesis. Several practical limitations have a significant impact on the fidelity of the rendered VAE. The finite number of microphones of SMAs leads to spatial undersampling of the captured sound field, which, on the one hand, induces spatial aliasing artifacts and, on the other hand, limits the order of the spherical harmonics (SH) representation. Several approaches have been presented in the literature that aim to mitigate the perceptual impairments due to these limitations. In this article, we present a listening experiment evaluating the perceptual improvements of binaural rendering of undersampled SMA data that can be achieved using state-of-the-art mitigation approaches. In particular, we examined the Magnitude Least-Squares algorithm, the Bandwidth Extraction Algorithm for Microphone Arrays, Spherical Head Filters, SH Tapering, and a newly proposed equalization filter. In the experiment, subjects rated the perceived differences between a dummy head and the corresponding SMA auralization. We found that most mitigation approaches lead to significant perceptual improvements, even though audible differences to the reference remain.

Journal ArticleDOI
TL;DR: A Bayesian inference method based on Non-synchronous Array Measurements (Bi-NAM) is proposed so as to refine the point spread function (PSF) and break through the beamforming limitation for low-frequency source imaging.
Abstract: Beamforming is a powerful technique to achieve acoustic imaging in far-field. However, its spatial resolution is strongly blurred by the point spread function (PSF) of phased microphone array. Due to the limitation of array aperture and microphone density, the PSF is far from Dirac delta function, so that it is difficult to obtain a high-resolution beamforming image at low-frequencies (e.g.500-1500Hz). This paper proposes a Bayesian inference method based on Non-synchronous Array Measurements (Bi-NAM) so as to refine the PSF and break through the beamforming limitation for low-frequency source imaging. Firstly, by sequentially moving prototype array at different positions, the non-synchronous measurements can get a sizeable synthetic aperture and high density of microphones. The synthetic cross-spectrum matrix (CSM) can significantly improve the beamforming performance. To confine the approximation error of synthetic CSM and the uncertainty of forward model, as well as the noise interference, a Bayesian inference based on joint maximum a posterior (JMAP) is proposed to solve an ill-posed inverse problem. A Student-t prior is employed to enforce the sparse property of acoustic strength distribution. The background noise can be adaptively modeled by the Student-t distribution, which is related to some of the typical symmetric distributions. Then the hyper-parameters in JMAP inference are efficiently estimated by the Bayesian hierarchical framework. Through experimental data, the proposed Bi-NAM approach is confirmed to achieve high-resolution acoustic imaging at 1000Hz and 800Hz, respectively, even under the Laplace noise interference.

Journal ArticleDOI
TL;DR: A pre-processing algorithm which uses time-frequency spatial filtering (TFS) to generate a reference to pre-align the permutation not only improves the performance of clustering and permutation alignment, but also solves the target-channel selection problem for BSS.
Abstract: Acoustic sensing from a multi-rotor drone is heavily degraded by the strong ego-noise produced by the rotating motors and propellers. To address this problem, we propose a blind source separation (BSS) framework that extracts a target sound from noisy multi-channel signals captured by a microphone array mounted on a drone. The proposed method addresses the challenging problem of permutation alignment, in extremely low signal-to-noise-ratio scenarios (e.g. SNR $ −15 dB), by performing clustering on the time activities of the separated signals across frequencies. Since initialization plays an important role to the success of clustering, we propose a pre-processing algorithm which uses time-frequency spatial filtering (TFS) to generate a reference to pre-align the permutation. The pre-alignment not only improves the performance of clustering and permutation alignment, but also solves the target-channel selection problem for BSS. The proposed method integrates the advantages of both TFS and BSS. Experimental results with real-recorded data show that the proposed method is capable of processing the audio stream continuously in a blockwise manner and also remarkably outperforms the state-of-the-art.

Journal ArticleDOI
TL;DR: The experimental study demonstrates that the proposed spatially selective ANC headphones provide a hear-through capability in the look direction, whilst reducing ambient noise and enabling the wearer to experience reduced noise communication in a noisy environment.
Abstract: This article presents the design and implementation of an active noise control (ANC) headphone system with a directional hear-through capability and compares the performance of this system to that of a standard hear-through headphone system. The directional hear-through ANC headphones are a novel integration of microphone array beamforming and ANC technologies into a pair of headphones, which provide the consumer with additional functionality and new, digitally augmented ways to interact with their acoustic environment. As the microphone array is necessarily compact, superdirective beamforming is utilised to increase its low and mid frequency directional performance. In this unique integration of two current consumer technologies, first, the ANC subsystem attempts to maximise the attenuation and then the beamformer output is added to the control signal and reproduced by the headphones’ loudspeakers, with the appropriate compensation to avoid self-cancellation. The experimental study demonstrates that the proposed spatially selective ANC headphones provide a hear-through capability in the look direction, whilst reducing ambient noise and enabling the wearer to experience reduced noise communication in a noisy environment. The proposed system thus offers the consumer the potential for an electronically enhanced acoustic experience, allowing a selective reduction in environmental noise whilst desired exterior noise remains audible.

Journal ArticleDOI
TL;DR: A class of differential beamformers, including the maximum white noise gain beamformer, the maximum directivity factor one, and optimal compromising beamformer are derived, from a graph perspective.
Abstract: In this article, we study differential beamforming from a graph perspective. The microphone array used for differential beamforming is viewed as a graph, where its sensors correspond to the nodes, the number of microphones corresponds to the order of the graph, and linear spatial difference equations among microphones are related to graph edges. Specifically, for the first-order differential beamforming with an array of $M$ microphones, each pair of adjacent microphones are directly connected, resulting in $M-1$ spatial difference equations. On a graph, each of these equations corresponds to a 2-clique. For the second-order differential beamforming, each three adjacent microphones are directly connected, resulting in $M-2$ second-order spatial difference equations, and each of these equations corresponds to a 3-clique. In an analogous manner, the differential microphone array for any order of differential beamforming can be viewed as a graph. From this perspective, we then derive a class of differential beamformers, including the maximum white noise gain beamformer, the maximum directivity factor one, and optimal compromising beamformers. Simulations are presented to demonstrate the performance of the derived differential beamformers.

Journal ArticleDOI
TL;DR: The proposed data processing methods are able to determine the position of the disturbance mass even with low amounts of training data and show to be promising for applications where space-frequency information is of essence.

Journal ArticleDOI
TL;DR: This paper proposes an acoustic-based scheme for positioning and tracking of illegal drones, and performs classification with a hidden Markov model (HMM) in order to know whether the sound is a drone or something else.
Abstract: This paper addresses issues with monitoring systems that identify and track illegal drones. The development of drone technologies promotes the widespread commercial application of drones. However, the ability of a drone to carry explosives and other destructive materials may pose serious threats to public safety. In order to reduce these threats, we propose an acoustic-based scheme for positioning and tracking of illegal drones. Our proposed scheme has three main focal points. First, we scan the sky with switched beamforming to find sound sources and record the sounds using a microphone array; second, we perform classification with a hidden Markov model (HMM) in order to know whether the sound is a drone or something else. Finally, if the sound source is a drone, we use its recorded sound as a reference signal for tracking based on adaptive beamforming. Simulations are conducted under both ideal conditions (without background noise and interference sounds) and non-ideal conditions (with background noise and interference sounds), and we evaluate the performance when tracking illegal drones.

Journal ArticleDOI
TL;DR: It is shown that the practical causality constraint limits the performance of the active cloak at lower frequencies, but the causally constrained controller is able to achieve approximately 10 dB of attenuation in the far-field scattered acoustic power, using an array of 9 control actuators.

Proceedings ArticleDOI
21 Apr 2020
TL;DR: Soundr presents a novel interaction technique that leverages the built-in microphone array found in most smart speakers to infer the user's spatial location and head orientation using only their voice, and can figure out users references to objects, people, and locations based on the speakers' gaze, and also provide relative directions.
Abstract: Although state-of-the-art smart speakers can hear a user's speech, unlike a human assistant these devices cannot figure out users' verbal references based on their head location and orientation. Soundr presents a novel interaction technique that leverages the built-in microphone array found in most smart speakers to infer the user's spatial location and head orientation using only their voice. With that extra information, Soundr can figure out users references to objects, people, and locations based on the speakers' gaze, and also provide relative directions. To provide training data for our neural network, we collected 751 minutes of data (50x that of the best prior work) from human speakers leveraging a virtual reality headset to accurately provide head tracking ground truth. Our results achieve an average positional error of 0.31m and an orientation angle accuracy of 34.3° for each voice command. A user study to evaluate user preferences for controlling IoT appliances by talking at them found this new approach to be fast and easy to use.

Journal ArticleDOI
TL;DR: The intrinsic harmonic structure of the emitted sound is exploited by a pitch detection algorithm coupled with zero-phase selective bandpass filtering to detect the fundamental of the signal and to extract its specific harmonics.
Abstract: In recent years, the current technological improvements of unmanned aerial vehicles (UAV) have made drones more difficult to locate using optical or radio-based systems. However, the sound emitted by UAV motorization and the aerodynamic whistling of the UAVs can be exploited using a microphone array and an adequate real time signal processing algorithm. The proposed method takes advantage of the characteristics of the sound emitted by the UAV. The intrinsic harmonic structure of the emitted sound is exploited by a pitch detection algorithm coupled with zero-phase selective bandpass filtering to detect the fundamental of the signal and to extract its specific harmonics. Although three-dimensional position errors are less when signals are filtered within the antenna bandwidth, experimental measurements show that accurate estimates with only a few selected harmonics in the signal can be obtained with the localization process. Kalman filtering is used to smooth the estimates.

Proceedings ArticleDOI
08 Jun 2020
TL;DR: This paper proposes a method to design differential beamformers with larger arrays consisting of multiple DMAs, and takes advantage of the good properties of DMAs for the design of beamformer with any size of microphone array.
Abstract: Differential microphone arrays (DMAs) are very attractive because of their high directional gains and frequency-invariant beampatterns. However, it is generally required that the array aperture is small, such that the DMA can respond to acoustic pressure differentials. In this paper, we propose a method to design differential beamformers with larger arrays consisting of multiple DMAs. In our study, conventional DMAs are considered as elementary units. The beamforming process consists of elementary differential beamformers and an additional beamformer that combines the multiple DMAs’ outputs. The steering vector of the global array is written as a Kronecker product of the steering vectors of an elementary DMA unit and the virtual array constructed from all the DMA units. This enables to design the global beamformer as a Kronecker product of the differential beamformer and the beamformer that corresponds to the virtual array. With the proposed method, one can take advantage of the good properties of DMAs for the design of beamformers with any size of microphone array.

Journal ArticleDOI
TL;DR: An end-to-end deep learning model, called DOANet, is proposed, based on a one-dimensional dilated convolutional neural network that computes the azimuth and elevation angles of the target sound source from the raw audio signal that shows promising results compared to both the angular spectrum methods with and without SCHC.
Abstract: Drone-embedded sound source localization (SSL) has interesting application perspective in challenging search and rescue scenarios due to bad lighting conditions or occlusions. However, the problem gets complicated by severe drone ego-noise that may result in negative signal-to-noise ratios in the recorded microphone signals. In this paper, we present our work on drone-embedded SSL using recordings from an 8-channel cube-shaped microphone array embedded in an unmanned aerial vehicle (UAV). We use angular spectrum-based TDOA (time difference of arrival) estimation methods such as generalized cross-correlation phase-transform (GCC-PHAT), minimum-variance-distortion-less-response (MVDR) as baseline, which are state-of-the-art techniques for SSL. Though we improve the baseline method by reducing ego-noise using speed correlated harmonics cancellation (SCHC) technique, our main focus is to utilize deep learning techniques to solve this challenging problem. Here, we propose an end-to-end deep learning model, called DOANet, for SSL. DOANet is based on a one-dimensional dilated convolutional neural network that computes the azimuth and elevation angles of the target sound source from the raw audio signal. The advantage of using DOANet is that it does not require any hand-crafted audio features or ego-noise reduction for DOA estimation. We then evaluate the SSL performance using the proposed and baseline methods and find that the DOANet shows promising results compared to both the angular spectrum methods with and without SCHC. To evaluate the different methods, we also introduce a well-known parameter—area under the curve (AUC) of cumulative histogram plots of angular deviations—as a performance indicator which, to our knowledge, has not been used as a performance indicator for this sort of problem before.

DissertationDOI
01 Jan 2020
TL;DR: In this paper, two approaches are presented to overcome the limitations of the beamforming method for low frequency sound source localization, which are based on numerically computed transfer functions (NCTFs) and finite element method (FEM).
Abstract: For taking actions to reduce noise, knowledge of the distribution, position and strength of the sound sources is necessary. Thereby, various sound localization methods can be used for this task. The standard methods are intensity measurement, acoustic near-field holography and acoustic beamforming. But, these methods are not universally applicable. In contrast to intensity measurements, where an intensity probe is used, near-field holography and beamforming use locally distributed microphones (= microphone array). Depending on the sound source under investigation, frequency range and measurement environment, the different methods have specific strengths and weaknesses. The obtained information can be used for noise reduction tasks as well as for monitoring and failure diagnosis of machines and facilities. Furthermore, sound source location is an important tool in the development of new products and in acoustic optimization. With the knowledge gained from the localization process, it can be determined which area of the sound source causes acoustic emissions. In the last years, considerable improvements have been achieved in the localization of sound sources using microphone arrays. However, there are still some limitations. In most cases, a simple source model is applied and the Green’s function for free radiation is used as transfer function between source and microphone. Hence, the actual conditions as given in the measurement setup can not be fully taken into account. The beamforming method, which is used in this thesis among other things for localization, shows weaknesses with coherent sound sources. Moreover, the determination of the phase information of the sources is not possible. Furthermore, the beamforming method is not well suited for the localization of low frequency sound sources. In this thesis, two approaches are presented to overcome these limitations. In order to consider the actual conditions as they are given by the measurement setup, first the beamforming method using numerically computed transfer functions (NCTFs) is applied. Here, the steering vector (often the Green’s function for free radiation) is replaced by the NCTF. Thereby, the finite element method (FEM) is used to determine the NCTF. In this context, a major challenge is the creation of an accurate finite element (FE) model including the determination of the boundary conditions. The second and more powerful approach is an inverse method, in which the wave equation in the frequency domain (Helmholtz equation) is solved with the corresponding boundary conditions using the FEM. Then the inverse problem of matching measured (microphone signals) and simulated pressure is solved to determine the source locations. This method identifies the amplitude and phase information of the acoustic sources. With this information the prevailing sound field can be reconstructed with high level of accuracy, so that better results regarding the sound field can be achieved than, e.g., with the source distribution obtained by beamforming. The applicability of both approaches will be demonstrated through simulation examples and the localization of a low frequency sound source in a real environment. In this context, the various challenges that arise in practice will also be discussed. Thereby, the accurate modeling of the measurement environment, the determination of the boundary conditions and the microphone positions in the room are discussed in detail. Since own-built microphones are used, the microphone calibration is also explained. iii D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D is se rt at io n is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . Danksagung Mein Dank gilt Univ.-Prof. Dr.techn. Manfred Kaltenbacher für die Betreuung dieser Arbeit, sowie für das durch ihm geschaffene motivierende und freundschaftliche Arbeitsumfeld in der Arbeitsgruppe. Darüber hinaus bin ich ihm dankbar für die kontinuierliche Unterstützung und die Freiräume zur Bearbeitung der verschiedenen Themengebiete der Arbeit. Zudem möchte ich mich bei Prof. Dr.-Ing. Ennes Sarradj und Dipl.-Ing. Dr.techn. Christoph Reichl für das Interesse an meiner Arbeit und die Übernahme der Gutachten bedanken. Ein besonderer Dank gebührt meinen Kollegen in der Arbeitsgruppe für das tolle Arbeitsumfeld. Besonders möchte ich hier meine Bürokollegen Sebastian Floss, Jonathan Nowak, Clemens Junger und Jochen Metzger erwähnen. Aber auch bei allen anderen Kollegen, die ich während der Zeit am Institut kennenlernen durfte und die zu guten Freunden geworden sind danke ich für die Unterstützung und die anregenden Diskussionen. Ebenfalls gilt mein Dank Peter Unterkreuter, Johann Schindele, Christoph Keppel, Manfred Neumann und Reinhold Wagner für die Unterstützung und Mithilfe beim Aufbau verschiedener Messanordnungen. Auch möchte ich mich bei Birgit Pimperl, Renate Mühlberger und Ruth Tscherne für die zuverlässige Erledigung von organisatorischen Angelegenheiten bedanken. Ein großes Dankeschön gilt meinen Eltern Franz und Brigitte, die mich während des Studiums hervorragend unterstützt haben. Schließlich möchte ich mich noch bei meiner Freundin Eva für den Rückhalt und die Unterstützung während der Zeit meiner Arbeit bedanken. iv D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D is se rt at io n is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek .

Journal ArticleDOI
TL;DR: The results presented in this article show that, despite the fact that in-air sonar applications are limited to only one snapshot, more advanced algorithms than Delay-And-Sum beamforming are viable options, which is confirmed with the real-life data captured by the newly developed micro Real Time Imaging Sonar.
Abstract: State-of-the-art autonomous vehicles use all kinds of sensors based on light, such as a camera or LIDAR (Laser Imaging Detection And Ranging). These sensors tend to fail when exposed to airborne particles. Ultrasonic sensors have the ability to work in these environments since they have longer wavelengths and are based on acoustics, making them able to pass through the mentioned distortions. However, they have a lower angular resolution compared to their optical counterparts. In this article a 3D in-air sonar sensor is simulated, consisting of a Uniform Rectangular Array similar to the newly developed micro Real Time Imaging Sonar ( $\mu $ RTIS) by CoSys-Lab. Different direction of arrival techniques will be compared for an 8 by 8 uniform rectangular microphone array in a simulation environment to investigate the influence of different parameters in a completely controlled environment. We will investigate the influence of the signal-to-noise ratio and number of snapshots to the angular and spatial resolution in the direction parallel and perpendicular to the direction of the emitted signal, respectively called the angular and range resolution. We will compare these results with real-life imaging results of the $\mu $ RTIS. The results presented in this article show that, despite the fact that in-air sonar applications are limited to only one snapshot, more advanced algorithms than Delay-And-Sum beamforming are viable options, which is confirmed with the real-life data captured by the $\mu $ RTIS.

Journal ArticleDOI
15 May 2020
TL;DR: Two different ways to interpolate the time signals between the microphone locations are proposed which are tested on synthetic array data from a benchmark test case as well as on experimental data obtained with a spiral array and a five-bladed fan.
Abstract: The characterization of rotating aeroacoustic sources using microphone array methods has been proven to be a useful tool. One technique to identify rotating sources is the virtual rotating array method. The method interpolates the pressure time data signals between the microphones in a stationary array to compensate the motion of the rotating sources. One major drawback of the method is the requirement of ring array geometries that are centred around the rotating axis. This contribution extends the virtual rotating array method to arbitrary microphone configurations. Two different ways to interpolate the time signals between the microphone locations are proposed. The first method constructs a mesh between the microphone positions using Delaunay-triangulation and interpolates over the mesh faces using piecewise linear functions. The second one is a meshless technique which is based on radial basis function interpolation. The methods are tested on synthetic array data from a benchmark test case as well as on experimental data obtained with a spiral array and a five-bladed fan.

Posted Content
TL;DR: This paper proposes to extend a previously introduced distributed DNN-based time-frequency mask estimation scheme that can efficiently use spatial information in form of so-called compressed signals which are pre-filtered target estimations.
Abstract: Deep neural network (DNN)-based speech enhancement algorithms in microphone arrays have now proven to be efficient solutions to speech understanding and speech recognition in noisy environments. However, in the context of ad-hoc microphone arrays, many challenges remain and raise the need for distributed processing. In this paper, we propose to extend a previously introduced distributed DNN-based time-frequency mask estimation scheme that can efficiently use spatial information in form of so-called compressed signals which are pre-filtered target estimations. We study the performance of this algorithm under realistic acoustic conditions and investigate practical aspects of its optimal application. We show that the nodes in the microphone array cooperate by taking profit of their spatial coverage in the room. We also propose to use the compressed signals not only to convey the target estimation but also the noise estimation in order to exploit the acoustic diversity recorded throughout the microphone array.