scispace - formally typeset
Search or ask a question

Showing papers on "Acoustic source localization published in 2020"


Journal ArticleDOI
TL;DR: The LOCAlization and Tracking Challenge (LOCATA) as discussed by the authors is an open-access framework for the objective evaluation and benchmarking of broad classes of algorithms for sound source localization and tracking.
Abstract: The ability to localize and track acoustic events is a fundamental prerequisite for equipping machines with the ability to be aware of and engage with humans in their surrounding environment. However, in realistic scenarios, audio signals are adversely affected by reverberation, noise, interference, and periods of speech inactivity. In dynamic scenarios, where the sources and microphone platforms may be moving, the signals are additionally affected by variations in the source-sensor geometries. In practice, approaches to sound source localization and tracking are often impeded by missing estimates of active sources, estimation errors, as well as false estimates. The aim of the LOCAlization and TrAcking (LOCATA) Challenge is an open-access framework for the objective evaluation and benchmarking of broad classes of algorithms for sound source localization and tracking. This article provides a review of relevant localization and tracking algorithms and, within the context of the existing literature, a detailed evaluation and dissemination of the LOCATA submissions. The evaluation highlights achievements in the field, open challenges, and identifies potential future directions.

91 citations


Journal ArticleDOI
TL;DR: A new approach to calculate the cross-spectral matrix based on numerically solving the transcendental equation is proposed as an alternative with higher computational efficiency to the steering vector in the former algorithm.

30 citations


Journal ArticleDOI
TL;DR: A numerical study illustrates how the modified new techniques can localize the acoustic source with sufficient accuracy in an anisotropic plate with unknown orientation of the axes of symmetry and its material properties.

28 citations


Journal ArticleDOI
TL;DR: A solution to the problem of acoustic source localization using a microphone array mounted on multirotor unmanned aerial vehicles (UAVs) is proposed, which adopts an efficient beamforming technique for the direction of arrival estimation of an acoustic source and a circular array detached from themultirotor vehicle body in order to reduce the effects of noise generated by the propellers.
Abstract: In this article, we address the problem of acoustic source localization using a microphone array mounted on multirotor unmanned aerial vehicles (UAVs). Conventional localization beamforming techniques are especially challenging in these specific conditions, due to the nature and intensity of the disturbances affecting the recorded acoustic signals. The principal disturbances are related to the high-frequency, narrowband noise originated by the electrical engines, and to the broadband aerodynamic noise induced by the propellers. A solution to this problem is proposed, which adopts an efficient beamforming technique for the direction of arrival estimation of an acoustic source and a circular array detached from the multirotor vehicle body in order to reduce the effects of noise generated by the propellers. The approach used to localize the source relies on a diagonal unloading beamforming with a novel norm transform frequency fusion. The proposed algorithm is tested on a multirotor UAV equipped with a compact uniform circular array of eight microphones, placed on the bottom of the drone to localize the target acoustic source placed on the ground while the quadcopter is hovering at different altitudes. The experimental results conducted in outdoor hovering conditions are illustrated, and the localization performances are reported under various recording conditions and source characteristics.

27 citations


Journal ArticleDOI
Zhu Qinfeng1, Zheng Huifeng1, Yuebing Wang1, Cao Yonggang1, Guo Shixu1 
02 Aug 2020-Sensors
TL;DR: A localization error index for sound imaging instruments is defined, and an acoustic phase cloud map evaluation method based on an improved YOLOv4 algorithm to directly and objectively evaluate the sound source localization results of a sound imaging instrument is proposed.
Abstract: Most sound imaging instruments are currently used as measurement tools which can provide quantitative data, however, a uniform method to directly and comprehensively evaluate the results of combining acoustic and optical images is not available. Therefore, in this study, we define a localization error index for sound imaging instruments, and propose an acoustic phase cloud map evaluation method based on an improved YOLOv4 algorithm to directly and objectively evaluate the sound source localization results of a sound imaging instrument. The evaluation method begins with the image augmentation of acoustic phase cloud maps obtained from the different tests of a sound imaging instrument to produce the dataset required for training the convolutional network. Subsequently, we combine DenseNet with existing clustering algorithms to improve the YOLOv4 algorithm to train the neural network for easier feature extraction. The trained neural network is then used to localize the target sound source and its pseudo-color map in the acoustic phase cloud map to obtain a pixel-level localization error. Finally, a standard chessboard grid is used to obtain the proportional relationship between the size of the acoustic phase cloud map and the actual physical space distance; then, the true lateral and longitudinal positioning error of sound imaging instrument can be obtained. Experimental results show that the mean average precision of the improved YOLOv4 algorithm in acoustic phase cloud map detection is 96.3%, the F1-score is 95.2%, and detection speed is up to 34.6 fps. The improved algorithm can rapidly and accurately determine the positioning error of sound imaging instrument, which can be used to analyze and evaluate the positioning performance of sound imaging instrument.

26 citations


Proceedings ArticleDOI
03 May 2020
TL;DR: An end-to-end deep convolutional neural network operating on multi-channel raw audio data to localize multiple simultaneously active acoustic sources in space and outperforms a recent time difference of arrival (TDOA) based multiple source localization approach reported in the literature.
Abstract: In this paper, we present an end-to-end deep convolutional neural network operating on multi-channel raw audio data to localize multiple simultaneously active acoustic sources in space. Previously reported deep learning based approaches work well in localizing a single source directly from multi-channel raw-audio, but are not easily extendable to localize multiple sources due to the well known permutation problem. We propose a novel encoding scheme to represent the spatial coordinates of multiple sources, which facilitates 2D localization of multiple sources in an end-to-end fashion, avoiding the permutation problem and achieving arbitrary spatial resolution. Experiments on a simulated data set and real recordings from the AV16.3 Corpus demonstrate that the proposed method generalizes well to unseen test conditions, and outperforms a recent time difference of arrival (TDOA) based multiple source localization approach reported in the literature.

24 citations


Journal ArticleDOI
TL;DR: The designed bio–inspired directional microphone (BDM) has been fabricated using commercially available MEMSCAP based on PiezoMUMPS processes and the results have been found in a perfect match with the given angle of incidence of sound.
Abstract: The single-tone sound source localization (SSL) by majority of fly Ormia ochracea’s ears–inspired directional microphones leaves a limited choice when an application like hearing aid (HA) demands broadband SSL. Here, a piezoelectric MEMS directional microphone using a modified mechanical model of fly’s ear has been presented with primary focus to achieve SSL in most sensitive audio bands to mitigate the constraints of traditional SSL works. In the modified model, two optimized rectangular diaphragms have been pivoted by four optimized torsional beams; while the backside of the whole structure has been etched. As a result, the SSL relative to angular rotation of the incoming sound depicts the cosine dependency as an ideal pressure–gradient sensor. At the same time, the mechanical coupling leads the magnitude difference between two diaphragms which has been accounted as SSL in frequency domain. The idea behind this work has been analytical simulated first, and with the convincing mechanical results, the designed bio–inspired directional microphone (BDM) has been fabricated using commercially available MEMSCAP based on PiezoMUMPS processes. In an anechoic chamber, the fabricated device has been excited in free-field sound, and the SSL at 1 kHz frequency, rocking frequency, bending frequency, and in-between rocking and bending frequencies has been found in full compliance with the given angle of incidence of sound. With the measured inter-aural sensitivity difference (mISD) and directionality, the developed BDM has been demonstrated as a practical SSL device, and the results have been found in a perfect match with the given angle of incidence of sound. Furthermore, to facilitate the SSL in noisy environment, the noise has been optimized in all scopes, like the geometry of the diaphragm, supportive torsional beam, and sensing. As a result, the A-weighted noise of this work has been found less than 23 dBA across the audio bands, and the equivalent-input noise (EIN) has been found to be 25.52 dB SPL at 1 kHz frequency which are the lowest ever reported by a similar device. With the developed SSL in broadband–in addition to the lowest noise–the developed device can be extended in some audio applications like an HA device.

23 citations


Journal ArticleDOI
Jing Wang1, Jin Wang1, Kai Qian1, Xiang Xie1, Jingming Kuang1 
TL;DR: A data-efficient method based on deep neural network (DNN) and clustering to improve binaural localization performance in the mismatched HRTF condition and has almost equal performance to the DNN trained with a large number of HRTFs is proposed.
Abstract: Binaural sound source localization is an important and widely used perceptually based method and it has been applied to machine learning studies by many researchers based on head-related transfer function (HRTF). Because the HRTF is closely related to human physiological structure, the HRTFs vary between individuals. Related machine learning studies to date tend to focus on binaural localization in reverberant or noisy environments, or in conditions with multiple simultaneously active sound sources. In contrast, mismatched HRTF condition, in which the HRTFs used to generate the training and test sets are different, is rarely studied. This mismatch leads to a degradation of localization performance. A basic solution to this problem is to introduce more data to improve generalization performance, which requires a lot. However, simply increasing the data volume will result in data-inefficiency. In this paper, we propose a data-efficient method based on deep neural network (DNN) and clustering to improve binaural localization performance in the mismatched HRTF condition. Firstly, we analyze the relationship between binaural cues and the sound source localization with a classification DNN. Different HRTFs are used to generate training and test sets, respectively. On this basis, we study the localization performance of DNN model trained by each training set on different test sets. The result shows that the localization performance of the same model on different test sets is different, while the localization performance of different models on the same test set may be similar. The result also shows a clustering trend. Secondly, different HRTFs are divided into several clusters. Finally, the corresponding HRTFs of each cluster center are selected to generate a new training set and to train a more generalized DNN model. The experimental results show that the proposed method achieves better generalization performance than the baseline methods in the mismatched HRTF condition and has almost equal performance to the DNN trained with a large number of HRTFs, which means the proposed method is data-efficient.

22 citations


Journal ArticleDOI
TL;DR: Performance evaluation regarding the number of estimation steps shows that higher accuracy can be achieved by longer observations of stationary sound source, and investigation on the uncertainty of TDOA and AOA measurements depending on the length of measurement interval is conducted.
Abstract: Accurate source localization is an important problem in many research areas as well as practical applications in wireless communications and acoustic signal processing. This paper presents a passive three-dimensional sound source localization (SSL) method that employs a geometric configuration of three soundfield microphones. Two methods for estimating the angle of arrival (AOA) and time difference of arrival (TDOA) are proposed based on Ambisonics A and B format signals. The closed-form solution for sound source location estimation based on two TDOAs and three AOAs is derived. The proposed method is evaluated by simulations and physical experiments in our anechoic chamber. Simulations demonstrate that the estimation method can theoretically obtain Cramer-Rao lower bound for a small Gaussian noise present in AOA and TDOA observations. Investigation on the uncertainty of TDOA and AOA measurements depending on the length of measurement interval is also conducted. Experimental results in terms of RMSE indicate that the proposed solution can be used to accurately find a 3D position of the sound source in free-field environment. Performance evaluation regarding the number of estimation steps shows that higher accuracy can be achieved by longer observations of stationary sound source.

20 citations


Journal ArticleDOI
TL;DR: A novel MUSIC framework for multiple sound source localization (range, elevation, azimuth) in reverberant rooms by incorporating a recently proposed region-to-region room transfer model.
Abstract: This work presents a method that persuades acoustic reflections to be a favorable property for sound source localization. Whilst most real world spatial audio applications utilize prior knowledge of sound source position, estimating such positions in reverberant environments is still considered to be a difficult problem due to acoustic reflections. This article presents a novel MUSIC framework for multiple sound source localization (range, elevation, azimuth) in reverberant rooms by incorporating a recently proposed region-to-region room transfer model. The method is built upon the received signals of a higher order microphone and a spherical harmonic representation of the room transfer function. We demonstrate the method's general applicability and multiple source localization performance through a simulation study across an assortment of reverberant conditions. Additionally, we investigate robustness against various system modeling errors to gauge implementation viability. Finally, we prove the method in a practical experiment inside a real-world room with measured region-to-region transfer function parameters.

19 citations


Journal ArticleDOI
TL;DR: A data-driven approach based on a convolutional neural network, which, using the GCCs as input, estimates the source location in two steps, and the use of the RST as an intermediate representation makes it possible for the network to generalize to data unseen during training.
Abstract: In this article we present a methodology for source localization in reverberant environments from Generalized Cross Correlations (GCCs) computed between spatially distributed individual microphones. Reverberation tends to negatively affect localization based on Time Differences of Arrival (TDOAs), which become inaccurate due to the presence of spurious peaks in the GCC. We therefore adopt a data-driven approach based on a convolutional neural network, which, using the GCCs as input, estimates the source location in two steps. It first computes the Ray Space Transform (RST) from multiple arrays. The RST is a convenient representation of the acoustic rays impinging on the array in a parametric space, called Ray Space. Rays produced by a source are visualized in the RST as patterns, whose position is uniquely related to the source location. The second step consists of estimating the source location through a nonlinear fitting, which estimates the coordinates that best approximate the RST pattern obtained through the first step. It is worth noting that training can be accomplished on simulated data only, thus relaxing the need of actually deploying microphone arrays in the acoustic scene. The localization accuracy of the proposed techniques is similar to the one of SRP-PHAT, however our method demonstrates an increased robustness regarding different distributed microphones configurations. Moreover, the use of the RST as an intermediate representation makes it possible for the network to generalize to data unseen during training.

Journal ArticleDOI
Ran Lee1, Minseok Kang1, Bo-Hyun Kim1, Kang-Ho Park, Sung Q Lee, Hyung-Min Park1 
TL;DR: This paper describes a method that can efficiently achieve sound source localization in noisy and reverberant environments based on the generalized cross-correlation function with phase transform weights (GCC-PHAT) to achieve robustness against reverberation and binarized the “inversed” diffuseness with a very rigorous threshold.
Abstract: Although sound source localization is a desirable technique in many communication systems and intelligence applications, the distortion caused by diffuse noise or reverberation makes the time delay estimation (TDE) between signals acquired by a pair of microphones a complicated and challenging problem. In this paper, we describe a method that can efficiently achieve sound source localization in noisy and reverberant environments. This method is based on the generalized cross-correlation (GCC) function with phase transform (PHAT) weights (GCC-PHAT) to achieve robustness against reverberation. In addition, to estimate the time delay robust to diffuse components and to further improve the robustness of the GCC-PHAT against reverberation, time-frequency(t-f) components of observations directly emitted by a point source are chosen by “inversed” diffuseness. The diffuseness that can be estimated from the coherent-to-diffuse power ratio (CDR) based on spatial coherence between two microphones represents the contribution of diffuse components on a scale of zero to one with direct sounds from a source modeled to be fully coherent. In particular, the “inversed” diffuseness is binarized with a very rigorous threshold to select highly reliable components for accurate TDE even in noisy and reverberant environments. Experimental results for both simulated and real-recorded data consistently demonstrated the robustness of the presented method against diffuse noise and reverberation.

Book
Harunori Ohmori1
01 Jun 2020
TL;DR: Temporal features of sound are encoded in the pathway starting in nucleus magnocellularis (NM), and ITD is processed in the nucleus laminaris (NL), and the contrast of ITD processing in NL is enhanced over a wide range of sound level through the activity of GABAergic inhibitory systems from both the superior olivary nucleus and local inhibitory neurons that follow monosynaptic to NM activity.
Abstract: Sound information is encoded as a series of spikes of the auditory nerve fibers (ANFs), and then transmitted to the brainstem auditory nuclei. Features such as timing and level are extracted from ANFs activity and further processed as the interaural time difference (ITD) and the interaural level difference (ILD), respectively. These two interaural difference cues are used for sound source localization by behaving animals. Both cues depend on the head size of animals and are extremely small, requiring specialized neural properties in order to process these cues with precision. Moreover, the sound level and timing cues are not processed independently from one another. Neurons in the nucleus angularis (NA) are specialized for coding sound level information in birds and the ILD is processed in the posterior part of the dorsal lateral lemniscus nucleus (LLDp). Processing of ILD is affected by the phase difference of binaural sound. Temporal features of sound are encoded in the pathway starting in nucleus magnocellularis (NM), and ITD is processed in the nucleus laminaris (NL). In this pathway a variety of specializations are found in synapse morphology, neuronal excitability, distribution of ion channels and receptors along the tonotopic axis, which reduces spike timing fluctuation in the ANFs-NM synapse, and imparts precise and stable ITD processing to the NL. Moreover, the contrast of ITD processing in NL is enhanced over a wide range of sound level through the activity of GABAergic inhibitory systems from both the superior olivary nucleus (SON) and local inhibitory neurons that follow monosynaptic to NM activity.

Journal ArticleDOI
TL;DR: A new fuzzy-based algorithm for localizing a sound source using distributed sensor nodes using fuzzy fusion and a beamforming method is proposed, which is able to record audio signals synchronously on an SD card to evaluate different algorithms offline.
Abstract: Sound source localization has always been one of the most challenging subjects in different fields of engineering, one of the most important of which being tracking of flying objects. This article focuses on sound source localization using fuzzy fusion and a beamforming method. It proposes a new fuzzy-based algorithm for localizing a sound source using distributed sensor nodes. Eight low-cost sensor nodes have been constructed in this study each of which consists of a microphone array to capture sound waves. Each node is able to record audio signals synchronously on an SD card to evaluate different algorithms offline. However, the sensor nodes are designed to be able to estimate the location of the sound source in real-time. In the proposed algorithm, every node estimates the direction of the sound source. Moreover, a calibration algorithm is used for extracting the orientation of sensor nodes to calibrate the estimated directions. The calibrated directions are fuzzified and then used for localizing the sound source by fuzzy fusion. An experiment was designed based on localizing a flying quadcopter as a moving sound source to evaluate the performance of the proposed algorithm. The flying trajectory was then estimated and compared with the target trajectory extracted from the GPS module mounted on the quadcopter. Comparing the estimated sound source with the target location, a mean distance error of ${6.03}{m}$ was achieved in a wide-range outdoor environment with the size of ${240}\times {160}\times {80} \,\,{m}^{{3}}$ . The achieved mean distance error is reasonable regarding the mean precision of the GPS module. The practical results illustrate the effectiveness of the proposed approach in localizing a sound source in a wide-range outdoor environment.

Journal ArticleDOI
09 Jan 2020
TL;DR: This study uses a set of classification measures determined by support vector machines (SVM) to avoid labeling failures and deal with unknown signals, and is validated through simulations and experiments conducted in the field.
Abstract: Drone audition, or auditory processing for drones equipped with a microphone array, is expected to compensate for problems affecting drones’ visual processing, in particular occlusion and poor-illumination conditions. The current state of drone audition still assumes a single sound source. When a drone hears sounds originating from multiple sound sources, its sound-source localization function determines their directions. If two sources are very close to each other, the localization function cannot determine whether they are crossing or approaching-then-departing. This ambiguity in tracking multiple sound sources is resolved by data association. Typical methods of data association use each label of the separated sounds, but are prone to errors due to identification failures. Instead of labeling by classification, this study uses a set of classification measures determined by support vector machines (SVM) to avoid labeling failures and deal with unknown signals. The effectiveness of the proposed approach is validated through simulations and experiments conducted in the field.

Journal ArticleDOI
TL;DR: This brief proposes extreme low-cost sound source localization system composed of two microphones and the low power microcontroller module ESP32, showing excellent performance despite the memory constraints imposed by the platform.
Abstract: The implementation of algorithms for acoustic source localization on edge platforms for the Internet of Things (IoT) is gaining momentum. Applications based on acoustic monitoring can greatly benefit from efficient implementations of such algorithms, enabling novel services for smart homes and buildings or ambient-assisted living. In this context, this brief proposes extreme low-cost sound source localization system composed of two microphones and the low power microcontroller module ESP32. A Direction-Of-Arrival (DOA) algorithm has been implemented taking into account the specific features of this board, showing excellent performance despite the memory constraints imposed by the platform. We have also adapted off-the-shelf low-cost microphone boards to the input requirements of the ESP32 Analog-to-Digital Converter. The processing has been optimized by leveraging in parallel both cores of the microcontroller to capture and process the audio in real time. Our experiments expose that we can perform real-time localization, with a processing time below 3.3 ms.

Journal ArticleDOI
TL;DR: Performance characteristics of the NoiseSpotter system are presented, using data from controlled acoustic transmissions in a quiet environment and ambient noise measurements in an energetic tidal channel in the presence of non-acoustic flow noise.
Abstract: NoiseSpotter is a passive acoustic monitoring system that characterizes, classifies, and geo-locates anthropogenic and natural sounds in near real time. It was developed with the primary goal of supporting the evaluation of potential acoustic effects of offshore renewable energy projects. The system consists of a compact array of three acoustic vector sensors, which measures acoustic pressure and the three-dimensional particle velocity vector associated with the propagation of an acoustic wave, thereby inherently providing bearing information to an underwater source of sound. By utilizing an array of three vector sensors, the application of beamforming techniques can provide sound source localization, allowing for characterization of the acoustic signature of specific underwater acoustic sources. Here, performance characteristics of the system are presented, using data from controlled acoustic transmissions in a quiet environment and ambient noise measurements in an energetic tidal channel in the presence of non-acoustic flow noise. Data quality is demonstrated by the ability to reduce non-acoustic flow noise contamination, while system utility is shown by the ability to characterize and localize sources of sound in the underwater environment.

Journal ArticleDOI
TL;DR: In this paper, a fiber optic sensor array is obtained with sensor tips formed with polymer-based diaphragm material, which is low cost and easy to manufacture, and the SSL process has been experimentally verified in the measurement medium with a volume of 1m3.

Journal ArticleDOI
01 Nov 2020
TL;DR: The findings show that sound localization benefits from the presence of a minimal visual spatial frame and confirm the importance of combining kinematic tracking and virtual reality when aiming to reveal the multisensory and motor contributions to spatial-hearing abilities.
Abstract: Studies on audio-visual interactions in sound localization have primarily focused on the relations between the spatial position of sounds and their perceived visual source, as in the famous ventriloquist effect. Much less work has examined the effects on sound localization of seeing aspects of the visual environment. In this study, we took advantage of an innovative method for the study of spatial hearing – based on real sounds, virtual reality and real-time kinematic tracking – to examine the impact of a minimal visual spatial frame on sound localization. We tested sound localization in normal hearing participants (N=36) in two visual conditions: a uniform gray scene and a simple visual environment comprising only a grid. In both cases, no visual cues about the sound sources were provided. During and after sound emission, participants were free to move their head and eyes without restriction. We found that the presence of a visual spatial frame improved hand-pointing in elevation. In addition, it determined faster first-gaze movements to sounds. Our findings show that sound localization benefits from the presence of a minimal visual spatial frame and confirm the importance of combining kinematic tracking and virtual reality when aiming to reveal the multisensory and motor contributions to spatial-hearing abilities.

Journal ArticleDOI
TL;DR: The results show that the proposed SH-DU-FSPT has a DOA estimation performance comparable to that of high resolution state-of-the-art methods with a significant reduction of the computational cost, since the steering directional responses are computed on the broadband frequency smoothing covariance matrix.
Abstract: Spherical microphone arrays allow the sound field analysis in three dimensions with the advantage of having the same resolution in all directions By considering the frequency-independent character of the steering vectors in the spherical harmonic (SH) domain, we propose a very low-complexity SH diagonal unloading (DU) beamforming with a novel frequency smoothing power transform (FSPT) of the covariance matrices We consider the direction of arrival (DOA) estimation problem of acoustic sources in reverberant conditions The DU beamforming provides high resolution directional response since it exploits the subspace orthogonality property of the covariance matrix by the removal or the attenuation of the signal subspaces, obtained through the subtraction of an opportune diagonal matrix from the covariance matrix The FSPT aims at smoothing the narrowband covariance matrices of the entire set of frequency domain components, and it pursues this goal by minimizing the narrowband error contributions due to reverberation in the broadband frequency smoothing covariance matrix We analyze the DOA estimation performance using speech signals with simulations and real acoustic data in reverberant conditions The results show that the proposed SH-DU-FSPT has a DOA estimation performance comparable to that of high resolution state-of-the-art methods with a significant reduction of the computational cost, since the steering directional responses are computed on the broadband frequency smoothing covariance matrix

Proceedings ArticleDOI
01 Sep 2020
TL;DR: A deep neural network is proposed to analyze audio signals recorded by 3D microphones and localize sound sources in a spatial sound field and is able to exploit both the quaternion-valued representation of ambisonic signals and to improve the localization performance with respect to existing methods.
Abstract: Localization of sound sources in 3D sound fields is an extremely challenging task, especially when the environments are reverberant and involve multiple sources. In this work, we propose a deep neural network to analyze audio signals recorded by 3D microphones and localize sound sources in a spatial sound field. In particular, we consider first-order Ambisonics microphones to capture 3D acoustic signals and represent them by spherical harmonic decomposition in the quaternion domain. Moreover, to improve the localization performance, we use quaternion input features derived from the acoustic intensity, which is strictly related to the direction of arrival (DOA) of a sound source. The proposed network architecture involves both quaternion-valued convolutional and recurrent layers. Results show that the proposed method is able to exploit both the quaternion-valued representation of ambisonic signals and to improve the localization performance with respect to existing methods.

DissertationDOI
01 Jan 2020
TL;DR: In this paper, two approaches are presented to overcome the limitations of the beamforming method for low frequency sound source localization, which are based on numerically computed transfer functions (NCTFs) and finite element method (FEM).
Abstract: For taking actions to reduce noise, knowledge of the distribution, position and strength of the sound sources is necessary. Thereby, various sound localization methods can be used for this task. The standard methods are intensity measurement, acoustic near-field holography and acoustic beamforming. But, these methods are not universally applicable. In contrast to intensity measurements, where an intensity probe is used, near-field holography and beamforming use locally distributed microphones (= microphone array). Depending on the sound source under investigation, frequency range and measurement environment, the different methods have specific strengths and weaknesses. The obtained information can be used for noise reduction tasks as well as for monitoring and failure diagnosis of machines and facilities. Furthermore, sound source location is an important tool in the development of new products and in acoustic optimization. With the knowledge gained from the localization process, it can be determined which area of the sound source causes acoustic emissions. In the last years, considerable improvements have been achieved in the localization of sound sources using microphone arrays. However, there are still some limitations. In most cases, a simple source model is applied and the Green’s function for free radiation is used as transfer function between source and microphone. Hence, the actual conditions as given in the measurement setup can not be fully taken into account. The beamforming method, which is used in this thesis among other things for localization, shows weaknesses with coherent sound sources. Moreover, the determination of the phase information of the sources is not possible. Furthermore, the beamforming method is not well suited for the localization of low frequency sound sources. In this thesis, two approaches are presented to overcome these limitations. In order to consider the actual conditions as they are given by the measurement setup, first the beamforming method using numerically computed transfer functions (NCTFs) is applied. Here, the steering vector (often the Green’s function for free radiation) is replaced by the NCTF. Thereby, the finite element method (FEM) is used to determine the NCTF. In this context, a major challenge is the creation of an accurate finite element (FE) model including the determination of the boundary conditions. The second and more powerful approach is an inverse method, in which the wave equation in the frequency domain (Helmholtz equation) is solved with the corresponding boundary conditions using the FEM. Then the inverse problem of matching measured (microphone signals) and simulated pressure is solved to determine the source locations. This method identifies the amplitude and phase information of the acoustic sources. With this information the prevailing sound field can be reconstructed with high level of accuracy, so that better results regarding the sound field can be achieved than, e.g., with the source distribution obtained by beamforming. The applicability of both approaches will be demonstrated through simulation examples and the localization of a low frequency sound source in a real environment. In this context, the various challenges that arise in practice will also be discussed. Thereby, the accurate modeling of the measurement environment, the determination of the boundary conditions and the microphone positions in the room are discussed in detail. Since own-built microphones are used, the microphone calibration is also explained. iii D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D is se rt at io n is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek . Danksagung Mein Dank gilt Univ.-Prof. Dr.techn. Manfred Kaltenbacher für die Betreuung dieser Arbeit, sowie für das durch ihm geschaffene motivierende und freundschaftliche Arbeitsumfeld in der Arbeitsgruppe. Darüber hinaus bin ich ihm dankbar für die kontinuierliche Unterstützung und die Freiräume zur Bearbeitung der verschiedenen Themengebiete der Arbeit. Zudem möchte ich mich bei Prof. Dr.-Ing. Ennes Sarradj und Dipl.-Ing. Dr.techn. Christoph Reichl für das Interesse an meiner Arbeit und die Übernahme der Gutachten bedanken. Ein besonderer Dank gebührt meinen Kollegen in der Arbeitsgruppe für das tolle Arbeitsumfeld. Besonders möchte ich hier meine Bürokollegen Sebastian Floss, Jonathan Nowak, Clemens Junger und Jochen Metzger erwähnen. Aber auch bei allen anderen Kollegen, die ich während der Zeit am Institut kennenlernen durfte und die zu guten Freunden geworden sind danke ich für die Unterstützung und die anregenden Diskussionen. Ebenfalls gilt mein Dank Peter Unterkreuter, Johann Schindele, Christoph Keppel, Manfred Neumann und Reinhold Wagner für die Unterstützung und Mithilfe beim Aufbau verschiedener Messanordnungen. Auch möchte ich mich bei Birgit Pimperl, Renate Mühlberger und Ruth Tscherne für die zuverlässige Erledigung von organisatorischen Angelegenheiten bedanken. Ein großes Dankeschön gilt meinen Eltern Franz und Brigitte, die mich während des Studiums hervorragend unterstützt haben. Schließlich möchte ich mich noch bei meiner Freundin Eva für den Rückhalt und die Unterstützung während der Zeit meiner Arbeit bedanken. iv D ie a pp ro bi er te g ed ru ck te O rig in al ve rs io n di es er D is se rt at io n is t a n de r T U W ie n B ib lio th ek v er fü gb ar . T he a pp ro ve d or ig in al v er si on o f t hi s do ct or al th es is is a va ila bl e in p rin t a t T U W ie n B ib lio th ek .

Journal ArticleDOI
TL;DR: An end-to-end deep learning model, called DOANet, is proposed, based on a one-dimensional dilated convolutional neural network that computes the azimuth and elevation angles of the target sound source from the raw audio signal that shows promising results compared to both the angular spectrum methods with and without SCHC.
Abstract: Drone-embedded sound source localization (SSL) has interesting application perspective in challenging search and rescue scenarios due to bad lighting conditions or occlusions. However, the problem gets complicated by severe drone ego-noise that may result in negative signal-to-noise ratios in the recorded microphone signals. In this paper, we present our work on drone-embedded SSL using recordings from an 8-channel cube-shaped microphone array embedded in an unmanned aerial vehicle (UAV). We use angular spectrum-based TDOA (time difference of arrival) estimation methods such as generalized cross-correlation phase-transform (GCC-PHAT), minimum-variance-distortion-less-response (MVDR) as baseline, which are state-of-the-art techniques for SSL. Though we improve the baseline method by reducing ego-noise using speed correlated harmonics cancellation (SCHC) technique, our main focus is to utilize deep learning techniques to solve this challenging problem. Here, we propose an end-to-end deep learning model, called DOANet, for SSL. DOANet is based on a one-dimensional dilated convolutional neural network that computes the azimuth and elevation angles of the target sound source from the raw audio signal. The advantage of using DOANet is that it does not require any hand-crafted audio features or ego-noise reduction for DOA estimation. We then evaluate the SSL performance using the proposed and baseline methods and find that the DOANet shows promising results compared to both the angular spectrum methods with and without SCHC. To evaluate the different methods, we also introduce a well-known parameter—area under the curve (AUC) of cumulative histogram plots of angular deviations—as a performance indicator which, to our knowledge, has not been used as a performance indicator for this sort of problem before.

Journal ArticleDOI
08 Jul 2020
TL;DR: An overview and history of robot audition are provided together with introduction of an open‐source software for robot audition and its wide applications in the real world and how robot audition contributes to the development of computational auditory scene analysis, that is, understanding of real‐world auditory environments.
Abstract: Robot audition aims at developing robot's ears that work in the real world, that is, machine listening of multiple sound sources. Its critical problem is noise. Speech interfaces have become more familiar and more indispensable as smartphones and artificial intelligence (AI) speakers spread. Their critical problems are noise and multiple simultaneous speakers. Recently two technological advances have contributed to significantly improve the performance of speech interfaces and robot audition. Emerging deep learning technology has improved noise robustness of automatic speech recognition, whereas microphone array processing has improved the performance of preprocessing such as noise reduction. Herein, an overview and history of robot audition are provided together with introduction of an open‐source software for robot audition and its wide applications in the real world. Also, it is discussed how robot audition contributes to the development of computational auditory scene analysis, that is, understanding of real‐world auditory environments.

Book ChapterDOI
30 Nov 2020
TL;DR: An unsupervised learning system that solves sound source localization by decomposing this task into two steps, and shows that visual information is dominant in "sound" source localization when evaluated with the currently adopted benchmark dataset.
Abstract: During the performance of sound source localization which uses both visual and aural information, it presently remains unclear how much either image or sound modalities contribute to the result, i.e. do we need both image and sound for sound source localization? To address this question, we develop an unsupervised learning system that solves sound source localization by decomposing this task into two steps: (i) “potential sound source localization”, a step that localizes possible sound sources using only visual information (ii) “object selection”, a step that identifies which objects are actually sounding using aural information. Our overall system achieves state-of-the-art performance in sound source localization, and more importantly, we find that despite the constraint on available information, the results of (i) achieve similar performance. From this observation and further experiments, we show that visual information is dominant in “sound” source localization when evaluated with the currently adopted benchmark dataset. Moreover, we show that the majority of sound-producing objects within the samples in this dataset can be inherently identified using only visual information, and thus that the dataset is inadequate to evaluate a system’s capability to leverage aural information. As an alternative, we present an evaluation protocol that enforces both visual and aural information to be leveraged, and verify this property through several experiments.

Journal ArticleDOI
TL;DR: A powerful heart sound classification method utilizing spatial-temporal information is proposed to determine the heart murmur type when multiple heart murmurs are occurred to evaluate the benefits and compared with a murmur classification method based on power spectra density feature extraction.

Journal ArticleDOI
TL;DR: This paper proposes a proficient three-step method for localizing multiple sources from delay estimates by carrying out the experiments under different acoustic conditions on the synthesized data and, recordings from SMARD & Audio Visual 16.3 Corpus.
Abstract: The higher computational efficiency of the time difference of arrival (TDOA) based sound source localization makes it a preferred choice over steered response power (SRP) methods in real-time applications. However, unlike SRP, its implementation for multiple source localization (MSL) is not straight forward. It includes challenges as accurate feature extraction in unfavourable acoustic conditions, association ambiguity involved in mapping the feature extractions to the corresponding sources and complexity involved in solving the hyperbolic delay equation to estimate the source coordinates. Moreover, the dominating source and early reverberation make the detection of delay associated with the submissive sources further perplexing. Hence, this paper proposes a proficient three-step method for localizing multiple sources from delay estimates. In step 1, the search space region is partitioned into cubic subvolumes, and the delay bound associated with each one is computed. Hereafter, these subvolumes are grouped differently, such that whose associated TDOA bounds are enclosed by a specific delay interval, are clustered together. In step 2, initially, the delay segments and later each subvolume contained by the corresponding delay segment are traced for passing through estimated delay hyperbola. These traced volumes are updated by the weight to measure the likelihood of a source in it. The resultant generates the delay density map in the search space. In the final step, localization enhancement is carried out in the selected volumes using conventional SRP (C-SRP). The validation of the proposed approach is done by carrying out the experiments under different acoustic conditions on the synthesized data and, recordings from SMARD & Audio Visual 16.3 Corpus.

Journal ArticleDOI
TL;DR: This paper focuses on the designing of a system that deals with the sniper positioning based on using only bullet Shockwave signals, and the generalized cross-correlation phase transform method is used for calculating of the time delays between the microphones in an array.
Abstract: Real-time sniper sound source localization in a real environment has long been an important research topic in the acoustic field. All those positioning systems utilize the Muzzle Blast signal for positioning. Acquiring a low signal to noise ratio Muzzle Blast signal is the most effected issue in the unfavorable situations. Detecting the Shockwave signal in the real environment is easier due to its higher signal to noise ratio value and some other its specific features. This paper focuses on the designing of a system that deals with the sniper positioning based on using only bullet Shockwave signals. The first acquired Shockwave signal in the microphones’ array is not from the fired bullet and specifies only the relevant point on the bullet trajectory. Hence, in this research, by utilizing the two least microphones count arrays in the 2-Dimensional real environment, first it is obtained the bullet trajectory, and then in an initiative, the reduction curve of the Mach number on this bullet trajectory, is devoted to locating the source of the fire. Also, the generalized cross-correlation phase transform method is used for calculating of the time delays between the microphones in an array. Based on the proposed method, the experimental results demonstrate the sniper locating with a precision of less than 30 meter at the distance of 400 meter, using an AK_47 weapon and a 6.72 millimeter caliber in different real environmental situations.

Proceedings ArticleDOI
19 Jul 2020
TL;DR: 2-D and 3-D versions of two sparse signal reconstruction methods are developed and it is shown that the $\ell_{1}$-SVD and re-weighted SVD processors outperform the widely used MUSIC and Bartlett processors.
Abstract: Several superresolution source localization algorithms based on the sparse signal reconstruction framework have been developed in recent years. These methods also offer other advantages such as immunity to noise coherence and robustness to reduction in the number of snapshots. The application of these methods is mostly limited to the problem of one dimensional (1-D) direction-of-arrival estimation. In this paper, we have developed 2-D and 3-D versions of two sparse signal reconstruction methods, viz. $\ell_{1}$-SVD and re-weighted $\ell_{1}$-SVD, and applied them to the problem of 3-D localization of underwater acoustic sources. A vertical linear array is used for estimation of range and depth and a horizontal cross-shaped array is used for bearing estimation. It is shown that the $\ell_{1}$-SVD and re-weighted $\ell_{1}$-SVD processors outperform the widely used MUSIC and Bartlett processors.

Journal ArticleDOI
TL;DR: An extended Richardson-Lucy (Ex-RL) algorithm is applied to the two-dimensional shift-variant deconvolution acoustic image measurement problem to improve resolution and is compared with the original-RL, classical deconVolution approach for the mapping of acoustic sources and non-negative least squares.