scispace - formally typeset
Search or ask a question

Showing papers on "Audio signal processing published in 2012"


Journal ArticleDOI
TL;DR: This paper introduces a general audio source separation framework based on a library of structured source models that enable the incorporation of prior knowledge about each source via user-specifiable constraints.
Abstract: Most audio source separation methods are developed for a particular scenario characterized by the number of sources and channels and the characteristics of the sources and the mixing process. In this paper, we introduce a general audio source separation framework based on a library of structured source models that enable the incorporation of prior knowledge about each source via user-specifiable constraints. While this framework generalizes several existing audio source separation methods, it also allows to imagine and implement new efficient methods that were not yet reported in the literature. We first introduce the framework by describing the model structure and constraints, explaining its generality, and summarizing its algorithmic implementation using a generalized expectation-maximization algorithm. Finally, we illustrate the above-mentioned capabilities of the framework by applying it in several new and existing configurations to different source separation problems. We have released a software tool named Flexible Audio Source Separation Toolbox (FASST) implementing a baseline version of the framework in Matlab.

296 citations


Patent
27 Jun 2012
TL;DR: In this article, the authors describe an adaptive audio system that processes audio data comprising a number of independent monophonic audio streams, which are associated with metadata that specifies whether the stream is a channel-based or object-based stream.
Abstract: Embodiments are described for an adaptive audio system that processes audio data comprising a number of independent monophonic audio streams. One or more of the streams has associated with it metadata that specifies whether the stream is a channel-based or object-based stream. Channel-based streams have rendering information encoded by means of channel name; and the object-based streams have location information encoded through location expressions encoded in the associated metadata. A codec packages the independent audio streams into a single serial bitstream that contains all of the audio data. This configuration allows for the sound to be rendered according to an allocentric frame of reference, in which the rendering location of a sound is based on the characteristics of the playback environment (e.g., room size, shape, etc.) to correspond to the mixer's intent. The object position metadata contains the appropriate allocentric frame of reference information required to play the sound correctly using the available speaker positions in a room that is set up to play the adaptive audio content.

231 citations


Journal ArticleDOI
TL;DR: The audio inpainting framework that recovers portions of audio data distorted due to impairments such as impulsive noise, clipping, and packet loss is proposed and this approach is shown to outperform state-of-the-art and commercially available methods for audio declipping in terms of Signal-to-Noise Ratio.
Abstract: We propose the audio inpainting framework that recovers portions of audio data distorted due to impairments such as impulsive noise, clipping, and packet loss. In this framework, the distorted data are treated as missing and their location is assumed to be known. The signal is decomposed into overlapping time-domain frames and the restoration problem is then formulated as an inverse problem per audio frame. Sparse representation modeling is employed per frame, and each inverse problem is solved using the Orthogonal Matching Pursuit algorithm together with a discrete cosine or a Gabor dictionary. The Signal-to-Noise Ratio performance of this algorithm is shown to be comparable or better than state-of-the-art methods when blocks of samples of variable durations are missing. We also demonstrate that the size of the block of missing samples, rather than the overall number of missing samples, is a crucial parameter for high quality signal restoration. We further introduce a constrained Matching Pursuit approach for the special case of audio declipping that exploits the sign pattern of clipped audio samples and their maximal absolute value, as well as allowing the user to specify the maximum amplitude of the signal. This approach is shown to outperform state-of-the-art and commercially available methods for audio declipping in terms of Signal-to-Noise Ratio.

229 citations


Journal ArticleDOI
TL;DR: The Gammatone cepstral coefficients (GTCCs), which have been previously employed in the field of speech research, are adapted for non-speech audio classification purposes and are more effective than MFCC in representing the spectral characteristics of non- speech audio signals, especially at low frequencies.
Abstract: In the context of non-speech audio recognition and classification for multimedia applications, it becomes essential to have a set of features able to accurately represent and discriminate among audio signals. Mel frequency cepstral coefficients (MFCC) have become a de facto standard for audio parameterization. Taking as a basis the MFCC computation scheme, the Gammatone cepstral coefficients (GTCCs) are a biologically inspired modification employing Gammatone filters with equivalent rectangular bandwidth bands. In this letter, the GTCCs, which have been previously employed in the field of speech research, are adapted for non-speech audio classification purposes. Their performance is evaluated on two audio corpora of 4 h each (general sounds and audio scenes), following two cross-validation schemes and four machine learning methods. According to the results, classification accuracies are significantly higher when employing GTCC rather than other state-of-the-art audio features. As a detailed analysis shows, with a similar computational cost, the GTCC are more effective than MFCC in representing the spectral characteristics of non-speech audio signals, especially at low frequencies.

209 citations


Book
14 Aug 2012
TL;DR: This book provides quick access to different analysis algorithms and allows comparison between different approaches to the same task, making it useful for newcomers to audio signal processing and industry experts alike.
Abstract: With the proliferation of digital audio distribution over digital media, audio content analysis is fast becoming a requirement for designers of intelligent signal-adaptive audio processing systems. Written by a well-known expert in the field, this book provides quick access to different analysis algorithms and allows comparison between different approaches to the same task, making it useful for newcomers to audio signal processing and industry experts alike. A review of relevant fundamentals in audio signal processing, psychoacoustics, and music theory, as well as downloadable MATLAB files are also included. Please visit the companion website: www.AudioContentAnalysis.org

184 citations


Book
07 Apr 2012
TL;DR: In this article, the authors present an introduction to the principles of the fast Fourier transform (FFT) and its applications in video and audio signal processing, including frequency domain filtering.
Abstract: This book presents an introduction to the principles of the fast Fourier transform. This book covers FFTs, frequency domain filtering, and applications to video and audio signal processing. As fields like communications, speech and image processing, and related areas are rapidly developing, the FFT as one of essential parts in digital signal processing has been widely used. Thus there is a pressing need from instructors and students for a book dealing with the latest FFT topics. This book provides thorough and detailed explanation of important or up-to-date FFTs. It also has adopted modern approaches like MATLAB examples and projects for better understanding of diverse FFTs.

157 citations


Journal ArticleDOI
TL;DR: The design of an array that is robust to variations in the acoustic environment and driver sensitivity and position leads to a generalization of regularization, and various methods of formulating this tradeoff as a regularization problem have been suggested and the connection between these formulations is discussed.
Abstract: As well as being able to reproduce sound in one region of space, it would be useful to reduce the level of reproduced sound in other spatial regions, with a “personal audio” system. For mobile devices this is motivated by issues of privacy for the user and the need to reduce annoyance for other people nearby. Such personal audio systems can be realized with arrays of loudspeakers that become superdirectional at low frequencies, when the array dimensions are small compared with the acoustic wavelength. The design of the array then becomes a compromise between performance and array effort, defined as the sum of mean squared driving signals. Various methods of formulating this tradeoff as a regularization problem have been suggested and the connection between these formulations is discussed. Large array efforts are due to strongly self-cancelling multipole arrays. A concern is then the robustness of such an array to variations in the acoustic environment and driver sensitivity and position. The design of an array that is robust to these uncertainties then leads to a generalization of regularization.

143 citations


Patent
06 Dec 2012
TL;DR: In this paper, a signal analyzer for analyzing the audio signal is provided, which determines whether an audio portion is effective in the encoder output signal as a first encoded signal from the first encoding branch or as a second encoded message from a second encoding branch.
Abstract: An audio encoder for encoding an audio signal has a first coding branch, the first coding branch comprising a first converter for converting a signal from a time domain into a frequency domain Furthermore, the audio encoder has a second coding branch comprising a second time/frequency converter Additionally, a signal analyzer for analyzing the audio signal is provided The signal analyzer, on the hand, determines whether an audio portion is effective in the encoder output signal as a first encoded signal from the first encoding branch or as a second encoded signal from a second encoding branch On the other hand, the signal analyzer determines a time/frequency resolution to be applied by the converters when generating the encoded signals An output interface includes, in addition to the first encoded signal and the second encoded signal, a resolution information identifying the resolution used by the first time/frequency converter and used by the second time/frequency converter

128 citations


01 Jan 2012
TL;DR: Reading an introduction to audio content analysis is also a way as one of the collective books that gives many advantages.
Abstract: No wonder you activities are, reading will be always needed. It is not only to fulfil the duties that you need to finish in deadline time. Reading will encourage your mind and thoughts. Of course, reading will greatly develop your experiences about everything. Reading an introduction to audio content analysis is also a way as one of the collective books that gives many advantages. The advantages are not only for you, but for the other peoples with those meaningful benefits.

125 citations


Patent
31 Dec 2012
TL;DR: In this article, an exemplary system consisting of an equalizer module that analyzes sound characteristics of individual digital audio samples including a discrete signal, a selector module that applies a selection heuristic to select the discrete signal from the individual audio samples based on the sound characteristics, and an audio module that supplies to an output an insert signal generated according to the selected signal selected by the heuristic.
Abstract: An exemplary system comprises a device including a memory with an audio injection application installed thereon. The application comprises an equalizer module that analyzes sound characteristics of individual digital audio samples including a discrete signal, a selector module that applies a selection heuristic to select the discrete signal from the individual digital audio samples based on the sound characteristics, and an audio module that supplies to an output an insert signal generated according to the discrete signal selected by the selection heuristic.

125 citations


Patent
Jen-Po Hsiao1, Ting-Wei Sun1, Hann-Shi Tong1
07 Nov 2012
TL;DR: In this paper, a method for audio intelligibility enhancement and computing apparatus is described, where environment noise is detected by performing voice activity detection according to a detected audio signal from at least a microphone of a computing device and noise information is obtained according to the detected environment noise and a first audio signal.
Abstract: Method and apparatus for audio intelligibility enhancement and computing apparatus are provided. The method includes the following steps. Environment noise is detected by performing voice activity detection according to a detected audio signal from at least a microphone of a computing device. Noise information is obtained according to the detected environment noise and a first audio signal. A second audio signal is outputted by boosting the first audio signal under an adjustable headroom by the computing device according to the noise information and the first audio signal.

Patent
02 Aug 2012
TL;DR: In this article, a method for use in performing acoustic calibration of at least one audio output device for a plurality of listening locations is presented, where an audio input device generates a data signal based on a series of one or more tones output by the audio output devices.
Abstract: An illustrative embodiment includes a method for use in performing acoustic calibration of at least one audio output device for a plurality of listening locations. An audio input device generates a data signal based on a series of one or more tones output by the at least one audio output device. The audio input device wirelessly transmits the data signal to a calibration device. The audio input device is one of a plurality of audio input devices deployed at respective ones of the plurality of listening locations. The data signal is one of a plurality of data signals generated by respective ones of the plurality of audio input devices based on the series of one or more tones output by the at least one audio output device. The plurality of data signals are wirelessly transmitted by the respective ones of the plurality of audio input devices to the calibration device.

Proceedings ArticleDOI
12 Nov 2012
TL;DR: A novel recognition approach called Non-Markovian Ensemble Voting (NEV) able to classify multiple human activities in an online fashion without the need for silence detection or audio stream segmentation is proposed.
Abstract: Human activity recognition is a key component for socially enabled robots to effectively and naturally interact with humans. In this paper we exploit the fact that many human activities produce characteristic sounds from which a robot can infer the corresponding actions. We propose a novel recognition approach called Non-Markovian Ensemble Voting (NEV) able to classify multiple human activities in an online fashion without the need for silence detection or audio stream segmentation. Moreover, the method can deal with activities that are extended over undefined periods in time. In a series of experiments in real reverberant environments, we are able to robustly recognize 22 different sounds that correspond to a number of human activities in a bathroom and kitchen context. Our method outperforms several established classification techniques.

Patent
Jae-Hoon Lee1
02 Mar 2012
TL;DR: In this article, an audio/video (A/V) device with a volume control function for external audio reproduction units by using volume control buttons of a remote controller is provided.
Abstract: An audio/video (A/V) device having a volume control function for external audio reproduction units by using volume control buttons of a remote controller is provided. The A/V device includes speakers, an audio output port for externally outputting an audio signal, an audio signal processing unit for reproducing and amplifying the audio signal and applying the amplified audio signal to the speakers or the audio output port, a memory unit for storing volume control values, and a control unit for applying to the audio signal processing unit any of the volume control values stored in the memory based on whether the external audio reproduction unit is plugged in the audio output port. The control unit controls the audio signal processing unit to adjust the volume control values for the audio output port by the volume control buttons when the external audio reproduction unit is plugged in the audio output port.

Patent
12 Mar 2012
TL;DR: In this paper, an active noise control system was proposed to generate sound waves to destructively interfere with an undesired sound in a targeted space, where a speaker was also driven to produce sound waves representative of a desired audio signal.
Abstract: An active noise control system generates an anti-noise signal to drive a speaker to produce sound waves to destructively interfere with an undesired sound in a targeted space. The speaker is also driven to produce sound waves representative of a desired audio signal. Sound waves are detected in the target space and a representative signal is generated. The representative signal is combined with an audio compensation signal to remove a signal component representative of the sound waves based on the desired audio signal and generate an error signal. The active noise control adjusts the anti-noise signal based on the error signal. The active noise control system converts the sample rates of an input signal representative of the undesired sound, the desired audio signal, and the error signal. The active noise control system converts the sample rate of the anti-noise signal.


Patent
29 Jun 2012
TL;DR: In this article, a system for generating one or more enhanced audio signals such that sound levels corresponding with sounds received from one or multiple sources of sound within an environment may be dynamically adjusted based on contextual information is described.
Abstract: A system for generating one or more enhanced audio signals such that one or more sound levels corresponding with sounds received from one or more sources of sound within an environment may be dynamically adjusted based on contextual information is described. The one or more enhanced audio signals may be generated by a head-mounted display device (HMD) worn by an end user within the environment and outputted to earphones associated with the HMD such that the end user may listen to the one or more enhanced audio signals in real-time. In some cases, each of the one or more sources of sound may correspond with a priority level. The priority level may be dynamically assigned depending on whether the end user of the HMD is focusing on a particular source of sound or has specified a predetermined level of importance corresponding with the particular source of sound.

Patent
20 Feb 2012
TL;DR: In this paper, a computer-implemented method and system allows a remote computer user to listen to teams in a race event, which includes receiving audio signals from a plurality of audio sources at the race event.
Abstract: A computer-implemented method and system allows a remote computer user to listen to teams in a race event. The method includes receiving audio signals from a plurality of audio sources at the race event; transmitting at least some of the audio signals to a remote computer; and filtering the audio signals as a function of the source of at least some of the audio signals so that at least some of the audio signals are not played by the remote computer and heard by the user.

Patent
17 Dec 2012
TL;DR: In this paper, the discovery of a plurality of audio devices, and for the discovered audio devices are determined relative positions thereof and distances therebetween, are used to select a constellation of audio device from the discovered plurality.
Abstract: There is described discovery of a plurality of audio devices, and for the discovered audio devices are determined relative positions thereof and distances therebetween. The determined relative positions and distances are used to select a constellation of audio devices from the discovered plurality. This constellation is selected for playing or recording of a multi-channel audio file so as to present an audio effect such as a spatial audio effect. Channels for the multi-channel audio file are allocated to different audio devices of the selected constellation, which are controlled to synchronously play back or record their respectively allocated channel or channels of the multi-channel audio file. In a specific embodiment the determined distances are used to automatically select the constellation and include distance between each pair of audio devices of the plurality. Several embodiments are presented for determining the distances and relative positions.

Patent
26 Jul 2012
TL;DR: In this paper, an audio calibration system and a method that determines optimum placement and/or operating conditions of speakers for an entertainment system is presented, where the system receives an audio signal and transmits the audio signal to a speaker.
Abstract: Described herein is an audio calibration system and method that determines optimum placement and/or operating conditions of speakers for an entertainment system. The system receives an audio signal and transmits the audio signal to a speaker. A recordation of an emanated audio signal from each speaker is made. The system performs a sliding window fast Fourier transform (FFT) comparison of the recorded audio signal temporally and volumetrically with the audio signal. A time delay for each speaker is shifted so that each of the plurality of speakers is synchronized. The individual volumes are then compared for each speaker and are adjusted to collectively match. The method can align and move the convergence point of multiple audio sources. Time differences are measured with respect to a microphone as a function of position. The method uses any audio data and functions with background noise in real time.

Proceedings ArticleDOI
25 Mar 2012
TL;DR: This work generalizes REPET to permit the processing of complete musical tracks and shows that this method can perform at least as well as a recent competitive music/voice separation method, while being computationally efficient.
Abstract: The separation of the lead vocals from the background accompaniment in audio recordings is a challenging task. Recently, an efficient method called REPET (REpeating Pattern Extraction Technique) has been proposed to extract the repeating background from the non-repeating foreground. While effective on individual sections of a song, REPET does not allow for variations in the background (e.g. verse vs. chorus), and is thus limited to short excerpts only. We overcome this limitation and generalize REPET to permit the processing of complete musical tracks. The proposed algorithm tracks the period of the repeating structure and computes local estimates of the background pattern. Separation is performed by soft time-frequency masking, based on the deviation between the current observation and the estimated background pattern. Evaluation on a dataset of 14 complete tracks shows that this method can perform at least as well as a recent competitive music/voice separation method, while being computationally efficient.

Proceedings ArticleDOI
01 Nov 2012
TL;DR: This work evaluates an approach to leverage audio-based receiver localization of devices requesting localization in indoor environments, by playing barely audible controlled sounds from multiple speakers at known positions, and reports promising initial results with localization accuracy within half a meter 94% of the time.
Abstract: Audio-based receiver localization in indoor environments has multiple applications including indoor navigation, location tagging, and tracking. Public places like shopping malls and consumer stores often have loudspeakers installed to play music for public entertainment. Similarly, office spaces may have sound conditioning speakers installed to soften other environmental noises. We discuss an approach to leverage this infrastructure to perform audio-based localization of devices requesting localization in such environments, by playing barely audible controlled sounds from multiple speakers at known positions. Our approach can be used to localize devices such as smart-phones, tablets and laptops to sub-meter accuracy. The user does not need to carry any specialized hardware. Unlike acoustic approaches which use high-energy ultrasound waves, the use of barely audible (low energy) signals in our approach poses very different challenges. We discuss these challenges, how we addressed those, and experimental results on two prototypical implementations: a request-play-record localizer, and a continuous tracker. We evaluated our approach in a real world meeting room and report promising initial results with localization accuracy within half a meter 94% of the time. The system has been deployed in multiple zones in our office building and is used on a regular basis.

Journal ArticleDOI
TL;DR: This paper presents an experimental and comparative study of several spherical microphone array eigenbeam (EB) processing techniques for localization of early reflections in room acoustic environments, which is a relevant research topic in both audio signal processing and room acoustics.
Abstract: This paper presents an experimental and comparative study of several spherical microphone array eigenbeam (EB) processing techniques for localization of early reflections in room acoustic environments, which is a relevant research topic in both audio signal processing and room acoustics. This paper focuses on steered beamformer-based and subspace-based localization techniques implemented in the spherical EB domain, including the plane-wave decomposition, eigenbeam delay and sum, eigenbeam minimum variance distortionless response, eigenbeam multiple signal classification (EB-MUSIC), and eigenbeam estimation of signal parameters via rotational invariance techniques (EB-ESPRIT) methods. The directions of arrival of the original sound source and the associated reflection signals in acoustic environments are estimated from acoustic maps of the rooms, which are obtained using a spherical microphone array. The EB-domain-based frequency smoothing and white noise gain control techniques are derived and employed to improve the performance and robustness of reflection localization. The applicability of the presented methods in practice is confirmed by experiments carried out in real rooms.

Patent
30 Apr 2012
TL;DR: In this paper, an adaptive noise canceling (ANC) circuit that adaptively generates an anti-noise signal from a reference microphone signal that measures the ambient audio and an error microphone signal, which injects the anti-Noise signal at the transducer output to cause cancellation of ambient audio sounds.
Abstract: A personal audio device, such as a wireless telephone, includes an adaptive noise canceling (ANC) circuit that adaptively generates an anti-noise signal from a reference microphone signal that measures the ambient audio and an error microphone signal that measures the output of an output transducer plus any ambient audio at that location and injects the anti-noise signal at the transducer output to cause cancellation of ambient audio sounds. A processing circuit uses the reference and error microphone to generate the anti-noise signal, which can be generated by an adaptive filter operating at a multiple of the ANC coefficient update rate. Downlink audio can be combined with the high data rate anti-noise signal by interpolation. High-pass filters in the control paths reduce DC offset in the ANC circuits, and ANC coefficient adaptation can be halted when downlink audio is not detected.

Journal ArticleDOI
TL;DR: The inclusion of a voice activity detector in the weighting scheme improves speech recognition over different system architectures and confidence measures, leading to an increase in performance more relevant than any difference between the proposed confidence measures.
Abstract: The integration of audio and visual information improves speech recognition performance, specially in the presence of noise. In these circumstances it is necessary to introduce audio and visual weights to control the contribution of each modality to the recognition task. We present a method to set the value of the weights associated to each stream according to their reliability for speech recognition, allowing them to change with time and adapt to different noise and working conditions. Our dynamic weights are derived from several measures of the stream reliability, some specific to speech processing and others inherent to any classification task, and take into account the special role of silence detection in the definition of audio and visual weights. In this paper, we propose a new confidence measure, compare it to existing ones, and point out the importance of the correct detection of silence utterances in the definition of the weighting system. Experimental results support our main contribution: the inclusion of a voice activity detector in the weighting scheme improves speech recognition over different system architectures and confidence measures, leading to an increase in performance more relevant than any difference between the proposed confidence measures.

Patent
24 Apr 2012
TL;DR: In this paper, a method of operating an audio system in an automobile includes identifying a user of the audio system and an audio recording playing on the audio systems is identified and stored in memory in association with the identified user and the identified audio recording.
Abstract: A method of operating an audio system in an automobile includes identifying a user of the audio system. An audio recording playing on the audio system is identified. An audio setting entered into the audio system by the identified user while the audio recording is being played by the audio system is sensed. The sensed audio setting is stored in memory in association with the identified user and the identified audio recording. The audio recording is retrieved from memory with the sensed audio setting being embedded in the retrieved audio recording as a watermark signal. The retrieved audio recording is played on the audio system with the embedded sensed audio setting being automatically implemented by the audio system during the playing.

Patent
Jin-hyong Kim1
24 Apr 2012
TL;DR: In this article, an electronic apparatus consisting of a sensor to sense an orientation of the electronic apparatus, an audio processor to divide a stereo audio signal into a left audio signal and a right audio signal, a plurality of speakers disposed at separate locations on a main body of the EH, a switching unit to output the left audio signals and the right audio signals from the audio processor, and a controller to control the switching unit.
Abstract: An electronic apparatus is provided. The electronic apparatus includes a sensor to sense an orientation of the electronic apparatus, an audio processor to divide a stereo audio signal into a left audio signal and a right audio signal, a plurality of speakers disposed at separate locations on a main body of the electronic apparatus to output the left audio signal and the right audio signal, a switching unit to output the left audio signal and the right audio signal from the audio processor to the plurality of speakers, and a controller to control the switching unit to switch the output of the left audio signal and the right audio signal between the plurality of speakers based on a change in the orientation of the electronic apparatus.

Proceedings ArticleDOI
12 Nov 2012
TL;DR: This paper sonifies objects that do not intrinsically produce sound, with the purpose of revealing additional information about them, and uses computer vision methods to identify high-level features of interest in an RGB-D stream, which are then sonified as virtual objects at their respective real-world coordinates.
Abstract: Augmented reality applications have focused on visually integrating virtual objects into real environments. In this paper, we propose an auditory augmented reality, where we integrate acoustic virtual objects into the real world. We sonify objects that do not intrinsically produce sound, with the purpose of revealing additional information about them. Using spatialized (3D) audio synthesis, acoustic virtual objects are placed at specific real-world coordinates, obviating the need to explicitly tell the user where they are. Thus, by leveraging the innate human capacity for 3D sound source localization and source separation, we create an audio natural user interface. In contrast with previous work, we do not create acoustic scenes by transducing low-level (for instance, pixel-based) visual information. Instead, we use computer vision methods to identify high-level features of interest in an RGB-D stream, which are then sonified as virtual objects at their respective real-world coordinates. Since our visual and auditory senses are inherently spatial, this technique naturally maps between these two modalities, creating intuitive representations. We evaluate this concept with a head-mounted device, featuring modes that sonify flat surfaces, navigable paths and human faces.

Patent
18 Dec 2012
TL;DR: In this article, an apparatus consisting of an input configured to receive at least one audio signal from a further apparatus, an audio signal associated with the apparatus, and an orientation/location determiner was used to determine a relative orientation or location difference between the apparatus and the further apparatus.
Abstract: An apparatus comprising: an input configured to receive at least one audio signal from a further apparatus; an input configured to receive at least one audio signal associated with the apparatus; an orientation/location determiner configured to determine a relative orientation/location difference between the apparatus and the further apparatus; an audio processor configured to process the at least one audio signal from the further apparatus based on the relative orientation/location difference between the apparatus and the further apparatus; and a combiner configured to combine the at least one audio signal from the further apparatus having been processed and the at least one audio signal associated with the apparatus.

Proceedings ArticleDOI
24 Dec 2012
TL;DR: This work proposes two methods, MUSIC based on Generalized Singular Value Decomposition (GSVD-MUSIC), and Hierarchical SSL (H-SSL), which drastically reduces the computational cost while maintaining noise-robustness in localization.
Abstract: Sound Source Localization (SSL) is an essential function for robot audition and yields the location and number of sound sources, which are utilized for post-processes such as sound source separation. SSL for a robot in a real environment mainly requires noise-robustness, high resolution and real-time processing. A technique using microphone array processing, that is, Multiple Signal Classification based on Standard Eigen-Value Decomposition (SEVD-MUSIC) is commonly used for localization. We improved its robustness against noise with high power by incorporating Generalized EigenValue Decomposition (GEVD). However, GEVD-based MUSIC (GEVD-MUSIC) has mainly two issues: 1) the resolution of pre-measured Transfer Functions (TFs) determines the resolution of SSL, 2) its computational cost is expensive for real-time processing. For the first issue, we propose a TF interpolation method integrating time-domain-based and frequency-domain-based interpolation. The interpolation achieves super-resolution SSL, whose resolution is higher than that of the pre-measured TFs. For the second issue, we propose two methods, MUSIC based on Generalized Singular Value Decomposition (GSVD-MUSIC), and Hierarchical SSL (H-SSL). GSVD-MUSIC drastically reduces the computational cost while maintaining noise-robustness in localization. H-SSL also reduces the computational cost by introducing a hierarchical search algorithm instead of using greedy search in localization. These techniques are integrated into an SSL system using a robot embedded microphone array. The experimental result showed: the proposed interpolation achieved approximately 1 degree resolution although we have only TFs at 30 degree intervals, GSVD-MUSIC attained 46.4% and 40.6% of the computational cost compared to SEVD-MUSIC and GEVD-MUSIC, respectively, H-SSL reduces 59.2% computational cost in localization of a single sound source.