scispace - formally typeset
Search or ask a question

Showing papers on "Microphone published in 2020"


Proceedings ArticleDOI
04 May 2020
TL;DR: This paper proposes transform-average-concatenate (TAC), a simple design paradigm for channel permutation and number invariant multi-channel speech separation based on the filter-and-sum network, and shows how TAC significantly improves the separation performance across various numbers of microphones in noisy reverberant separation tasks with ad-hoc arrays.
Abstract: An important problem in ad-hoc microphone speech separation is how to guarantee the robustness of a system with respect to the locations and numbers of microphones. The former requires the system to be invariant to different indexing of the microphones with the same locations, while the latter requires the system to be able to process inputs with varying dimensions. Conventional optimization-based beamforming techniques satisfy these requirements by definition, while for deep learning-based end-to-end systems those constraints are not fully addressed. In this paper, we propose transform-average-concatenate (TAC), a simple design paradigm for channel permutation and number invariant multi-channel speech separation. Based on the filter-and-sum network (FaSNet), a recently proposed end-to-end time-domain beamforming system, we show how TAC significantly improves the separation performance across various numbers of microphones in noisy reverberant separation tasks with ad-hoc arrays. Moreover, we show that TAC also significantly improves the separation performance with fixed geometry array configuration, further proving the effectiveness of the proposed paradigm in the general problem of multi-microphone speech separation.

102 citations


Posted Content
TL;DR: A new class of signal injection attacks on microphones by physically converting light to sound is proposed, showing how an attacker can inject arbitrary audio signals to a target microphone by aiming an amplitude-modulated light at the microphone's aperture.
Abstract: We propose a new class of signal injection attacks on microphones by physically converting light to sound. We show how an attacker can inject arbitrary audio signals to a target microphone by aiming an amplitude-modulated light at the microphone's aperture. We then proceed to show how this effect leads to a remote voice-command injection attack on voice-controllable systems. Examining various products that use Amazon's Alexa, Apple's Siri, Facebook's Portal, and Google Assistant, we show how to use light to obtain control over these devices at distances up to 110 meters and from two separate buildings. Next, we show that user authentication on these devices is often lacking, allowing the attacker to use light-injected voice commands to unlock the target's smartlock-protected front doors, open garage doors, shop on e-commerce websites at the target's expense, or even unlock and start various vehicles connected to the target's Google account (e.g., Tesla and Ford). Finally, we conclude with possible software and hardware defenses against our attacks.

74 citations


Journal ArticleDOI
TL;DR: In this article, an ultra high sensitive quasi-distributed acoustic sensor based on coherent detection and cylindrical transducer is proposed and demonstrated, which integrates a series of high-sensitive sensing units in a single fiber.
Abstract: Highly sensitive distributed acoustic sensor is required in various practical applications. In this article, an ultra high sensitive quasi distributed acoustic sensor based on coherent detection and cylindrical transducer is proposed and demonstrated. As the acoustic sensing medium, distributed microstructured optical fiber (DMOF) is utilized to improve the signal to noise ratio (SNR) of the signal, which contains backscattering enhanced points (BEPs) along the longitudinal direction of the fiber. In order to increase the acoustic sensitivity, the hollow cylindrical structure is developed for acoustic wave transduction. In addition, coherent phase detection is adopted to achieve the high precision phase signal demodulation, and thus to realize high-fidelity recovery of the airborne sound wave. Consequently, the spatial distributed acoustic sensing can be realized, which integrates a series of high-sensitive sensing units in a single fiber. The field test results of the airborne sound detection illustrate an excellent phase sensitivity of −112.5 dB ( re 1 rad/μPa) within the flat frequency range of 500 Hz–5 kHz and a peak sensitivity up to −83.7 dB ( re 1 rad/μPa) at 80Hz. The waveform comparison between the measurement result and the standard signal shows the maximum error of only 3.07%. Besides, distributed audio signal recovery and spatial acoustic imaging are demonstrated, which can be further applied in the field of fiber distributed microphone and urban noise intensity holography.

54 citations


Proceedings ArticleDOI
25 Oct 2020
TL;DR: The database, the challenge, and the baseline system are described, which is based on a ResNet-based deep speaker network with cosine similarity scoring, which achieves minDCFs of 0.62, 0.66, and 0.64 and EERs of 6.27%, 6.55%, and 7.18% for task 1, task 2, and task 3, respectively.
Abstract: The INTERSPEECH 2020 Far-Field Speaker Verification Challenge (FFSVC 2020) addresses three different research problems under well-defined conditions: far-field text-dependent speaker verification from single microphone array, far-field text-independent speaker verification from single microphone array, and far-field text-dependent speaker verification from distributed microphone arrays. All three tasks pose a cross-channel challenge to the participants. To simulate the real-life scenario, the enrollment utterances are recorded from close-talk cellphone, while the test utterances are recorded from the far-field microphone arrays. In this paper, we describe the database, the challenge, and the baseline system, which is based on a ResNet-based deep speaker network with cosine similarity scoring. For a given utterance, the speaker embeddings of different channels are equally averaged as the final embedding. The baseline system achieves minDCFs of 0.62, 0.66, and 0.64 and EERs of 6.27%, 6.55%, and 7.18% for task 1, task 2, and task 3, respectively.

51 citations


Proceedings ArticleDOI
16 Apr 2020
TL;DR: VoLoc is developed, a system that proposes an iterative align-and-cancel algorithm for improved multipath AoA estimation, followed by an error-minimization technique to estimate the geometry of a nearby wall reflection, which reveals the user's location.
Abstract: Voice assistants such as Amazon Echo (Alexa) and Google Home use microphone arrays to estimate the angle of arrival (AoA) of the human voice. This paper focuses on adding user localization as a new capability to voice assistants. For any voice command, we desire Alexa to be able to localize the user inside the home. The core challenge is two-fold: (1) accurately estimating the AoAs of multipath echoes without the knowledge of the source signal, and (2) tracing back these AoAs to reverse triangulate the user's location. We develop VoLoc, a system that proposes an iterative align-and-cancel algorithm for improved multipath AoA estimation, followed by an error-minimization technique to estimate the geometry of a nearby wall reflection. The AoAs and geometric parameters of the nearby wall are then fused to reveal the user's location. Under modest assumptions, we report localization accuracy of 0.44 m across different rooms, clutter, and user/microphone locations. VoLoc runs in near real-time but needs to hear around 15 voice commands before becoming operational.

49 citations


Journal ArticleDOI
TL;DR: An active sound control system fitted onto the opening of the domestic window that attenuates the incident sound, achieving a global reduction in the room interior while maintaining natural ventilation is described.
Abstract: Shutting the window is usually the last resort in mitigating environmental noise, at the expense of natural ventilation. We describe an active sound control system fitted onto the opening of the domestic window that attenuates the incident sound, achieving a global reduction in the room interior while maintaining natural ventilation. The incident sound is actively attenuated by an array of control modules (a small loudspeaker) distributed optimally across the aperture. A single reference microphone provides advance information for the controller to compute the anti-noise signal input to the loudspeakers in real-time. A numerical analysis revealed that the maximum active attenuation potential outperforms the perfect acoustic insulation provided by a fully shut single-glazed window in ideal conditions. To determine the real-world performance of such an active control system, an experimental system is realized in the aperture of a full-sized window installed on a mockup room. Up to 10-dB reduction in energy-averaged sound pressure level was achieved by the active control system in the presence of a recorded real-world broadband noise. However, attenuation in the low-frequency range and its maximum power output is limited by the size of the loudspeakers.

47 citations


Proceedings ArticleDOI
04 May 2020
TL;DR: A far-field text-dependent speaker verification database named HI-MIA is presented and a set of end-to-end neural network based baseline systems that adopt single-channel data for training are proposed.
Abstract: This paper presents a far-field text-dependent speaker verification database named HI-MIA. We aim to meet the data requirement for far-field microphone array based speaker verification since most of the publicly available databases are single channel close-talking and text-independent. The database contains recordings of 340 people in rooms designed for the far-field scenario. Recordings are captured by multiple microphone arrays located in different directions and distance to the speaker and a high-fidelity close-talking microphone. Besides, we propose a set of end-to-end neural network based baseline systems that adopt single-channel data for training. Moreover, we propose a testing background aware enrollment augmentation strategy to further enhance the performance. Results show that the fusion systems could achieve 3.29% EER in the far-field enrollment far field testing task and 4.02% EER in the close-talking enrollment and far-field testing task.

47 citations


Proceedings ArticleDOI
04 May 2020
TL;DR: In this paper, a deep neural network is trained to predict the real and imaginary (RI) components of direct sound from the stacked reverberant (and noisy) RI components of multiple microphones.
Abstract: This study proposes a multi-microphone complex spectral mapping approach for speech dereverberation on a fixed array geometry. In the proposed approach, a deep neural network (DNN) is trained to predict the real and imaginary (RI) components of direct sound from the stacked reverberant (and noisy) RI components of multiple microphones. We also investigate the integration of multi-microphone complex spectral mapping with beamforming and post-filtering. Experimental results on multi-channel speech dereverberation demonstrate the effectiveness of the proposed approach.

46 citations


Proceedings ArticleDOI
21 Apr 2020
TL;DR: A wearable microphone jammer that is capable of disabling microphones in its user's surroundings, including hidden microphones, and which provides stronger privacy in a world in which most devices are constantly eavesdropping on the authors' conversations is engineered.
Abstract: We engineered a wearable microphone jammer that is capable of disabling microphones in its user's surroundings, including hidden microphones. Our device is based on a recent exploit that leverages the fact that when exposed to ultrasonic noise, commodity microphones will leak the noise into the audible range. Unfortunately, ultrasonic jammers are built from multiple transducers and therefore exhibit blind spots, i.e., locations in which transducers destructively interfere and where a microphone cannot be jammed. To solve this, our device exploits a synergy between ultrasonic jamming and the naturally occur- ring movements that users induce on their wearable devices (e.g., bracelets) as they gesture or walk. We demonstrate that these movements can blur jamming blind spots and increase jamming coverage. Moreover, current jammers are also directional, requiring users to point the jammer to a microphone; instead, our wearable bracelet is built in a ring-layout that al- lows it to jam in multiple directions. This is beneficial in that it allows our jammer to protect against microphones hidden out of sight. We evaluated our jammer in a series of experiments and found that: (1) it jams in all directions, e.g., our device jams over 87% of the words uttered around it in any direction, while existing devices jam only 30% when not pointed directly at the microphone; (2) it exhibits significantly less blind spots; and, (3) our device induced a feeling of privacy to participants of our user study. We believe our wearable provides stronger privacy in a world in which most devices are constantly eavesdropping on our conversations.

43 citations


Proceedings ArticleDOI
04 May 2020
TL;DR: A fully-convolutional neural network structure has been used to directly separate speech from multiple microphone recordings, with no need of conventional spatial feature extraction, and can further reduce the WER by 29% relative using an acoustic model trained on clean and reverberated data.
Abstract: This paper introduces a new method for multi-channel time domain speech separation in reverberant environments. A fully-convolutional neural network structure has been used to directly separate speech from multiple microphone recordings, with no need of conventional spatial feature extraction. To reduce the influence of reverberation on spatial feature extraction, a dereverberation pre-processing method has been applied to further improve the separation performance. A spatialized version of wsj0-2mix dataset has been simulated to evaluate the proposed system. Both source separation and speech recognition performance of the separated signals have been evaluated objectively. Experiments show that the proposed fully-convolutional network improves the source separation metric and the word error rate (WER) by more than 13% and 50% relative, respectively, over a reference system with conventional features. Applying dereverberation as pre-processing to the proposed system can further reduce the WER by 29% relative using an acoustic model trained on clean and reverberated data.

39 citations


Journal ArticleDOI
TL;DR: The effect of room acoustics and background noise on voice parameters appears to be stronger than the type of microphone used for the recording, and an appropriate acoustical clinical space may be more important than the quality of the microphone.

Journal ArticleDOI
TL;DR: A new VS method is proposed, the relative path based VS (RP-VS) method, which estimates both the disturbance signal and the anti-noise signal at the target ZoQ, and an ANC casing is built up with the RP-VS method to reduce a varying broadband fan noise.

Proceedings ArticleDOI
16 Nov 2020
TL;DR: Symphony is the first approach to tackle the problem of concurrently localizing multiple acoustic sources with a smart device with a single microphone array and includes a geometry-based filtering module to distinguish signals from different sources along different paths and a coherence-based module to identify signals from the same source.
Abstract: Sound recognition is an important and popular function of smart devices. The location of sound is basic information associated with the acoustic source. Apart from sound recognition, whether the acoustic sources can be localized largely affects the capability and quality of the smart device's interactive functions. In this work, we study the problem of concurrently localizing multiple acoustic sources with a smart device (e.g., a smart speaker like Amazon Alexa). The existing approaches either can only localize a single source, or require deploying a distributed network of microphone arrays to function. Our proposal called Symphony is the first approach to tackle the above problem with a single microphone array. The insight behind Symphony is that the geometric layout of microphones on the array determines the unique relationship among signals from the same source along the same arriving path, while the source's location determines the DoAs (direction-of-arrival) of signals along different arriving paths. Symphony therefore includes a geometry-based filtering module to distinguish signals from different sources along different paths and a coherence-based module to identify signals from the same source. We implement Symphony with different types of commercial off-the-shelf microphone arrays and evaluate its performance under different settings. The results show that Symphony has a median localization error of 0.694m, which is 68% less than that of the state-of-the-art approach.

Journal ArticleDOI
TL;DR: An all-optical photoacoustic (PA) system for trace gas detection of ethylene (C2H4) in high-concentration methane (CH4) background has been developed based on dual light sources and fiber-optic microphone.
Abstract: An all-optical photoacoustic (PA) system for trace gas detection of ethylene (C2H4) in high-concentration methane (CH4) background has been developed based on dual light sources and fiber-optic microphone (FOM). C2H4 was measured by an infrared thermal radiation emitter. To eliminate the interference of high-concentration CH4 gas, a near-infrared laser source is used to measure the concentration of CH4 for self-correction. Light from the two sources was incident from both ends of the non-resonant PA cell, respectively. The concentration of the two gases was measured by time division multiplexing. The generated PA pressure signal was detected by a fiber-optic Fabry-Perot microphone, which was demodulated by a high-speed spectrometer. A lock-in white-light interferometry (WLI) based demodulator was developed for ultra-high sensitivity detection of 1f and 2f PA signal. Experimental results showed that the detection limit of C2H4 reached 200 ppb over a range of 0–100 % CH4 concentration background.

Proceedings Article
01 Jan 2020
TL;DR: In this paper, the authors proposed a new class of signal injection attacks on microphones by physically converting light to sound and showed how an attacker can inject arbitrary audio signals to a target microphone by aiming an amplitude-modulated light at the microphone's aperture.
Abstract: We propose a new class of signal injection attacks on microphones by physically converting light to sound. We show how an attacker can inject arbitrary audio signals to a target microphone by aiming an amplitude-modulated light at the microphone's aperture. We then proceed to show how this effect leads to a remote voice-command injection attack on voice-controllable systems. Examining various products that use Amazon's Alexa, Apple's Siri, Facebook's Portal, and Google Assistant, we show how to use light to obtain control over these devices at distances up to 110 meters and from two separate buildings. Next, we show that user authentication on these devices is often lacking, allowing the attacker to use light-injected voice commands to unlock the target's smartlock-protected front doors, open garage doors, shop on e-commerce websites at the target's expense, or even unlock and start various vehicles connected to the target's Google account (e.g., Tesla and Ford). Finally, we conclude with possible software and hardware defenses against our attacks.

Journal ArticleDOI
TL;DR: In this article, the authors examined the feasibility of designing a capacitive MEMS microphone employing a levitation-based electrode configuration, which could work for large bias voltages without pull-in failure and demonstrated that it is possible to create robust sensors properly working at high DC voltages, which is not feasible for most of the conventional parallel plate electrode-based microscale devices.
Abstract: In this study, we examine the feasibility of designing a MEMS microphone employing a levitation based electrode configuration. This electrode scheme enables capacitive MEMS sensors that could work for large bias voltages without pull-in failure. Our experiments and simulations indicate that it is possible to create robust sensors properly working at high DC voltages, which is not feasible for most of the conventional parallel plate electrode-based-microscale devices. In addition, the use of larger bias voltages will improve signal-to-noise ratios in MEMS sensors because it increases the signal relative to the noise in read-out circuits. This study presents the design, fabrication, and testing of a capacitive microphone, which is made of approximately 2 μm thick highly-doped polysilicon as a diaphragm. It has approximately 1 mm 2 surface area and incorporates interdigitated sensing electrodes on three of its sides. Right underneath these moving electrodes, there are fixed fingers being held at the same voltage potential as the moving electrodes and separated from them with a 2 μm thick air gap. The electronic output is obtained using a charge amplifier. Measured results obtained on three different microphone chips using bias voltages up to 200 volts indicate that pull-in failure is completely avoided. The sensitivity of this initial design was measured to be 16.1 mV/Pa at 200 V bias voltage, and the bandwidth was from 100 Hz to 4.9 kHz.

Posted Content
TL;DR: This study first investigates offline utterance-wise speaker separation and then extends to block-online continuous speech separation, and integrates multi-microphone complex spectral mapping with minimum variance distortionless response (MVDR) beamforming and post-filtering to further improve separation.
Abstract: We propose multi-microphone complex spectral mapping, a simple way of applying deep learning for time-varying non-linear beamforming, for speaker separation in reverberant conditions. We aim at both speaker separation and dereverberation. Our study first investigates offline utterance-wise speaker separation and then extends to block-online continuous speech separation (CSS). Assuming a fixed array geometry between training and testing, we train deep neural networks (DNN) to predict the real and imaginary (RI) components of target speech at a reference microphone from the RI components of multiple microphones. We then integrate multi-microphone complex spectral mapping with minimum variance distortionless response (MVDR) beamforming and post-filtering to further improve separation, and combine it with frame-level speaker counting for block-online CSS. Although our system is trained on simulated room impulse responses (RIR) based on a fixed number of microphones arranged in a given geometry, it generalizes well to a real array with the same geometry. State-of-the-art separation performance is obtained on the simulated two-talker SMS-WSJ corpus and the real-recorded LibriCSS dataset.

Proceedings ArticleDOI
21 Apr 2020
TL;DR: HomeSound, an in-home sound awareness system for Deaf and hard of hearing (DHH) users, consists of a microphone and display, and uses multiple devices installed in each home, similar to the Echo Show or Nest Hub.
Abstract: We introduce HomeSound, an in-home sound awareness system for Deaf and hard of hearing (DHH) users. Similar to the Echo Show or Nest Hub, HomeSound consists of a microphone and display, and uses multiple devices installed in each home. We iteratively developed two prototypes, both of which sense and visualize sound information in real-time. Prototype 1 provided a floorplan view of sound occurrences with waveform histories depicting loudness and pitch. A three-week deployment in four DHH homes showed an increase in participants' home- and self-awareness but also uncovered challenges due to lack of line of sight and sound classification. For Prototype 2, we added automatic sound classification and smartwatch support for wearable alerts. A second field deployment in four homes showed further increases in awareness but misclassifications and constant watch vibrations were not well received. We discuss findings related to awareness, privacy, and display placement and implications for future home sound awareness technology.

Journal ArticleDOI
TL;DR: In this paper, a nonlinear active control framework for low-frequency sound absorption is proposed, which combines linear feedforward control on front pressure through a first microphone located at the front face of the loudspeaker and nonlinear feedback control on the membrane displacement estimated through the measurement of the pressure inside the back cavity with a second microphone located in the enclosure.
Abstract: The absorption of airborne noise at frequencies below 300 Hz is a particularly vexing problem due to the absence of natural sound-absorbing materials at these frequencies. The prevailing solution for low-frequency sound absorption is the use of passive narrow-band resonators, the absorption level and bandwidth of which can be further enhanced using nonlinear effects. However, these effects are typically triggered at high intensity levels, without much control over the form of the nonlinear absorption mechanism. In this study, we propose, implement, and experimentally demonstrate a nonlinear active control framework on an electroacoustic resonator prototype, allowing for unprecedented control over the form of nonlinearity and arbitrarily low intensity thresholds. More specifically, the proposed architecture combines linear feedforward control on the front pressure through a first microphone located at the front face of the loudspeaker and nonlinear feedback control on the membrane displacement estimated through the measurement of the pressure inside the back cavity with a second microphone located in the enclosure. It is experimentally shown that even at a weak excitation level, it is possible to observe and control the nonlinear behavior of the system. Taking the cubic nonlinearity as an example, we demonstrate numerically and experimentally that in the low-frequency range (50--500 Hz), the nonlinear control law allows improvement of the absorption performance, i.e., enlarging the bandwidth of optimal sound absorption while increasing the maximal absorption coefficient value and producing only a negligible amount of nonlinear distortion. The reported experimental methodology can be extended to implement various types of hybrid linear and/or nonlinear controls, thus opening new avenues for managing wave nonlinearity and achieving nontrivial wave phenomena.

Journal ArticleDOI
28 Aug 2020-Sensors
TL;DR: The software functional components of the proposed detection and location algorithm were developed employing acoustic signal analysis and concurrent neural networks (CoNNs), and an analysis of the detection and tracking performance for remotely piloted aircraft systems (RPASs), measured with a dedicated spiral microphone array with MEMS microphones.
Abstract: The purpose of this paper is to investigate the possibility of developing and using an intelligent, flexible, and reliable acoustic system, designed to discover, locate, and transmit the position of unmanned aerial vehicles (UAVs). Such an application is very useful for monitoring sensitive areas and land territories subject to privacy. The software functional components of the proposed detection and location algorithm were developed employing acoustic signal analysis and concurrent neural networks (CoNNs). An analysis of the detection and tracking performance for remotely piloted aircraft systems (RPASs), measured with a dedicated spiral microphone array with MEMS microphones, was also performed. The detection and tracking algorithms were implemented based on spectrograms decomposition and adaptive filters. In this research, spectrograms with Cohen class decomposition, log-Mel spectrograms, harmonic-percussive source separation and raw audio waveforms of the audio sample, collected from the spiral microphone array-as an input to the Concurrent Neural Networks were used, in order to determine and classify the number of detected drones in the perimeter of interest.

Journal ArticleDOI
TL;DR: This article defines a directivity pattern that can achieve a continuous compromise between the pattern corresponding to the maximum DMA order and the omnidirectional pattern and shows how to determine analytically the proper fractional order of the DMA with a given target beampattern when either the value of the DF or WNG is specified.
Abstract: Differential microphone arrays (DMAs) often encounter white noise amplification, especially at low frequencies. If the array geometry and the number of microphones are fixed, one can improve the white noise amplification problem by reducing the DMA order. With the existing differential beamforming methods, the DMA order can only be a positive integer number. Consequently, with a specified beampattern (or a kind of beampattern), reducing this order may easily lead to over compensation of the white noise gain (WNG) and too much reduction of the directivity factor (DF), which is not optimal. To deal with this problem, we present in this article a general approach to the design of DMAs with fractional orders. The major contributions of this article include but are not limited to: 1) we first define a directivity pattern that can achieve a continuous compromise between the pattern corresponding to the maximum DMA order and the omnidirectional pattern; 2) by approximating the beamformer's beampattern with the Jacobi-Anger expansion, we present a method to find the proper differential beamforming filter so that its beampattern matches closely the target directivity pattern of fractional orders; and 3) we show how to determine analytically the proper fractional order of the DMA with a given target beampattern when either the value of the DF or WNG is specified, which is useful in practice to achieve the desired beampattern and spatial gain while maintaining the robustness of the DMA system.

Journal ArticleDOI
TL;DR: An improved empirical wavelet transform strategy for compound weak bearing fault diagnosis with acoustic signals and the new fault diagnosis scheme that utilizes DEWT and SVD is compared with traditional methods.
Abstract: Most of the current research on the diagnosis of rolling bearing faults is based on vibration signals. However, the location and number of sensors are often limited in some special cases. Thus, a small number of non-contact microphone sensors are a suboptimal choice, but it will result in some problems, e.g., underdetermined compound fault detection from a low signal-to-noise ratio (SNR) acoustic signal. Empirical wavelet transform (EWT) is a signal processing algorithm that has a dimension-increasing characteristic, and is beneficial for solving the underdetermined problem with few microphone sensors. However, there remain some critical problems to be solved for EWT, especially the determination of signal mode numbers, high-frequency modulation and boundary detection. To solve these problems, this paper proposes an improved empirical wavelet transform strategy for compound weak bearing fault diagnosis with acoustic signals. First, a novel envelope demodulation-based EWT (DEWT) is developed to overcome the high frequency modulation, based on which a source number estimation method with singular value decomposition (SVD) is then presented for the extraction of the correct boundary from a low SNR acoustic signal. Finally, the new fault diagnosis scheme that utilizes DEWT and SVD is compared with traditional methods, and the advantages of the proposed method in weak bearing compound fault diagnosis with a single-channel, low SNR, variable speed acoustic signal, are verified.

Posted Content
TL;DR: Experimental results on multi-channel speech dereverberation demonstrate the effectiveness of the proposed approach and the integration of multi-microphone complex spectral mapping with beamforming and post-filtering is investigated.
Abstract: This study proposes a multi-microphone complex spectral mapping approach for speech dereverberation on a fixed array geometry. In the proposed approach, a deep neural network (DNN) is trained to predict the real and imaginary (RI) components of direct sound from the stacked reverberant (and noisy) RI components of multiple microphones. We also investigate the integration of multi-microphone complex spectral mapping with beamforming and post-filtering. Experimental results on multi-channel speech dereverberation demonstrate the effectiveness of the proposed approach.

Journal ArticleDOI
TL;DR: A listening experiment evaluating the perceptual improvements of binaural rendering of undersampled SMA data that can be achieved using state-of-the-art mitigation approaches found that most mitigation approaches lead to significant perceptual improvements, even though audible differences to the reference remain.
Abstract: Spherical microphone arrays (SMAs) are widely used to capture spatial sound fields that can then be rendered in various ways as a virtual acoustic environment (VAE) including headphone-based binaural synthesis. Several practical limitations have a significant impact on the fidelity of the rendered VAE. The finite number of microphones of SMAs leads to spatial undersampling of the captured sound field, which, on the one hand, induces spatial aliasing artifacts and, on the other hand, limits the order of the spherical harmonics (SH) representation. Several approaches have been presented in the literature that aim to mitigate the perceptual impairments due to these limitations. In this article, we present a listening experiment evaluating the perceptual improvements of binaural rendering of undersampled SMA data that can be achieved using state-of-the-art mitigation approaches. In particular, we examined the Magnitude Least-Squares algorithm, the Bandwidth Extraction Algorithm for Microphone Arrays, Spherical Head Filters, SH Tapering, and a newly proposed equalization filter. In the experiment, subjects rated the perceived differences between a dummy head and the corresponding SMA auralization. We found that most mitigation approaches lead to significant perceptual improvements, even though audible differences to the reference remain.

Journal ArticleDOI
TL;DR: A novel MUSIC framework for multiple sound source localization (range, elevation, azimuth) in reverberant rooms by incorporating a recently proposed region-to-region room transfer model.
Abstract: This work presents a method that persuades acoustic reflections to be a favorable property for sound source localization. Whilst most real world spatial audio applications utilize prior knowledge of sound source position, estimating such positions in reverberant environments is still considered to be a difficult problem due to acoustic reflections. This article presents a novel MUSIC framework for multiple sound source localization (range, elevation, azimuth) in reverberant rooms by incorporating a recently proposed region-to-region room transfer model. The method is built upon the received signals of a higher order microphone and a spherical harmonic representation of the room transfer function. We demonstrate the method's general applicability and multiple source localization performance through a simulation study across an assortment of reverberant conditions. Additionally, we investigate robustness against various system modeling errors to gauge implementation viability. Finally, we prove the method in a practical experiment inside a real-world room with measured region-to-region transfer function parameters.

Journal ArticleDOI
22 Jan 2020-Sensors
TL;DR: A design for a versatile electronic device to measure outdoor noise that is connected to a commercial microprocessor board and inserted into the infrastructure of an existing outdoor monitoring network, and verified that this equipment meets the similar requirements to those obtained for type 2 instruments for measuring outdoor noise.
Abstract: Presently, large cities have significant problems with noise pollution due to human activity. Transportation, economic activities, and leisure activities have an important impact on noise pollution. Acoustic noise monitoring must be done with equipment of high quality. Thus, long-term noise monitoring is a high-cost activity for administrations. For this reason, new alternative technological solutions are being used to reduce the costs of measurement instruments. This article presents a design for a versatile electronic device to measure outdoor noise. This device has been designed according to the technical standards for this type of instrument, which impose strict requirements on both the design and the quality of the device’s measurements. This instrument has been designed under the original equipment manufacturer (OEM) concept, so the microphone–electronics set can be used as a sensor that can be connected to any microprocessor-based device, and therefore can be easily attached to a monitoring network. To validate the instrument’s design, the device has been tested following the regulations of the calibration laboratories for sound level meters (SLM). These tests allowed us to evaluate the behavior of the electronics and the microphone, obtaining different results for these two elements. The results show that the electronics and algorithms implemented fully fit within the requirements of type 1 noise measurement instruments. However, the use of an electret microphone reduces the technical features of the designed instrument, which can only fully fit the requirements of type 2 noise measurement instruments. This situation shows that the microphone is a key element in this kind of instrument and an important element in the overall price. To test the instrument’s quality and show how it can be used for monitoring noise in smart wireless acoustic sensor networks, the designed equipment was connected to a commercial microprocessor board and inserted into the infrastructure of an existing outdoor monitoring network. This allowed us to deploy a low-cost sub-network in the city of Malaga (Spain) to analyze the noise of conflict areas due to high levels of leisure noise. The results obtained with this equipment are also shown. It has been verified that this equipment meets the similar requirements to those obtained for type 2 instruments for measuring outdoor noise. The designed equipment is a two-channel instrument, that simultaneously measures, in real time, 86 sound noise parameters for each channel, such as the equivalent continuous sound level (Leq) (with Z, C, and A frequency weighting), the peak level (with Z, C, and A frequency weighting), the maximum and minimum levels (with Z, C, and A frequency weighting), and the impulse, fast, and slow time weighting; seven percentiles (1%, 5%, 10%, 50%, 90%, 95%, and 99%); as well as continuous equivalent sound pressure levels in the one-third octave and octave frequency bands.

Journal ArticleDOI
Xuecong Sun1, Han Jia1, Zhe Zhang1, Yuzhen Yang1, Zhaoyong Sun1, Jun Yang1 
TL;DR: The proposed metamaterial‐based single‐sensor listening system opens a new way of sound localization and separation, which can be applied to intelligent scene monitoring and robot audition, and the designed system can also be applied in source identification and tracking.
Abstract: Conventional approaches to sound localization and separation are based on microphone arrays in artificial systems. Inspired by the selective perception of the human auditory system, a multisource listening system which can separate simultaneous overlapping sounds and localize the sound sources in 3D space, using only a single microphone with a metamaterial enclosure is designed. The enclosure modifies the frequency response of the microphone in a direction-dependent manner by giving each direction a characteristic signature. Thus, the information about the location and the audio content of sound sources can be experimentally reconstructed from the modulated mixed signals using a compressive sensing algorithm. Due to the low computational complexity of the proposed reconstruction algorithm, the designed system can also be applied in source identification and tracking. The effectiveness of the system in multiple real-life scenarios is evaluated through multiple random listening tests. The proposed metamaterial-based single-sensor listening system opens a new way of sound localization and separation, which can be applied to intelligent scene monitoring and robot audition.

Journal ArticleDOI
20 Oct 2020-PeerJ
TL;DR: Microphone signal-to-noise ratio is a crucial characteristic of a sound recording system, positively affecting the acoustic sampling performance of birds and bats and should be maximised by choosing appropriate microphones, and be quantified independently, especially in the ultrasound range.
Abstract: Background Automated sound recorders are a popular sampling tool in ecology. However, the microphones themselves received little attention so far, and specifications that determine the recordings' sound quality are seldom mentioned. Here, we demonstrate the importance of microphone signal-to-noise ratio for sampling sonant animals. Methods We tested 12 different microphone models in the field and measured their signal-to-noise ratios and detection ranges. We also measured the vocalisation activity of birds and bats that they recorded, the bird species richness, the bat call types richness, as well as the performance of automated detection of bird and bat calls. We tested the relationship of each one of these measures with signal-to-noise ratio in statistical models. Results Microphone signal-to-noise ratio positively affects the sound detection space areas, which increased by a factor of 1.7 for audible sound, and 10 for ultrasound, from the lowest to the highest signal-to-noise ratio microphone. Consequently, the sampled vocalisation activity increased by a factor of 1.6 for birds, and 9.7 for bats. Correspondingly, the species pool of birds and bats could not be completely detected by the microphones with lower signal-to-noise ratio. The performance of automated detection of bird and bat calls, as measured by its precision and recall, increased significantly with microphone signal-to-noise ratio. Discussion Microphone signal-to-noise ratio is a crucial characteristic of a sound recording system, positively affecting the acoustic sampling performance of birds and bats. It should be maximised by choosing appropriate microphones, and be quantified independently, especially in the ultrasound range.

Journal ArticleDOI
TL;DR: The design of the spatial filter is based on a recently proposed frequency-domain design methodology that approximates, in a least-square sense, a target beampattern using the Jacobi-Anger expansion involving Bessel functions, and it is shown that this approximation allows an efficient discrete-time-domain implementation of first-order steerable differential beamformers based on arrays with arbitrary geometries.
Abstract: We present a spatial filtering approach to first-order steerable Differential Microphone Arrays (DMAs) with arbitrary planar geometry. In particular, the design of the spatial filter is based on a recently proposed frequency-domain design methodology that approximates, in a least-square sense, a target beampattern using the Jacobi-Anger expansion involving Bessel functions. Despite the generality of that approach, however, its computational cost turns out to be excessive when working with limited processing resources. The beamforming technique proposed in this manuscript overcomes this issue by exploiting the fact that in DMAs the spacing between sensors is typically smaller than the smallest wavelength of audio signals of interest. This allows us to substitute zero- and first-order Bessel functions with their Taylor series approximation truncated to the first order. Moreover, we show that this approximation allows us to derive an efficient discrete-time-domain implementation of first-order steerable differential beamformers based on arrays with arbitrary geometries.

Proceedings ArticleDOI
TL;DR: Mic2Mic as discussed by the authors is a machine-learned system component that resides in the inference pipeline of audio models and at real-time reduces the variability in audio data caused by microphone-specific factors.
Abstract: Mobile and embedded devices are increasingly using microphones and audio-based computational models to infer user context. A major challenge in building systems that combine audio models with commodity microphones is to guarantee their accuracy and robustness in the real-world. Besides many environmental dynamics, a primary factor that impacts the robustness of audio models is microphone variability. In this work, we propose Mic2Mic -- a machine-learned system component -- which resides in the inference pipeline of audio models and at real-time reduces the variability in audio data caused by microphone-specific factors. Two key considerations for the design of Mic2Mic were: a) to decouple the problem of microphone variability from the audio task, and b) put a minimal burden on end-users to provide training data. With these in mind, we apply the principles of cycle-consistent generative adversarial networks (CycleGANs) to learn Mic2Mic using unlabeled and unpaired data collected from different microphones. Our experiments show that Mic2Mic can recover between 66% to 89% of the accuracy lost due to microphone variability for two common audio tasks.