scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Sound in 2015"


Posted Content
TL;DR: This report introduces a new corpus of music, speech, and noise suitable for training models for voice activity detection (VAD) and music/speech discrimination and demonstrates use of this corpus on Broadcast news and VAD for speaker identification.
Abstract: This report introduces a new corpus of music, speech, and noise. This dataset is suitable for training models for voice activity detection (VAD) and music/speech discrimination. Our corpus is released under a flexible Creative Commons license. The dataset consists of music from several genres, speech from twelve languages, and a wide assortment of technical and non-technical noises. We demonstrate use of this corpus for music/speech discrimination on Broadcast news and VAD for speaker identification.

855 citations


Journal ArticleDOI
TL;DR: In this paper, a joint optimization of masking functions and deep recurrent neural networks for monaural source separation tasks was proposed, which achieved 2.30-4.98 dB SDR and 4.32-5.42 dB GSIR gain compared to existing models in the singing voice separation task and outperformed NMF and DNN baselines in the speech denoising task.
Abstract: Monaural source separation is important for many real world applications. It is challenging because, with only a single channel of information available, without any constraints, an infinite number of solutions are possible. In this paper, we explore joint optimization of masking functions and deep recurrent neural networks for monaural source separation tasks, including monaural speech separation, monaural singing voice separation, and speech denoising. The joint optimization of the deep recurrent neural networks with an extra masking layer enforces a reconstruction constraint. Moreover, we explore a discriminative criterion for training neural networks to further enhance the separation performance. We evaluate the proposed system on the TSP, MIR-1K, and TIMIT datasets for speech separation, singing voice separation, and speech denoising tasks, respectively. Our approaches achieve 2.30--4.98 dB SDR gain compared to NMF models in the speech separation task, 2.30--2.48 dB GNSDR gain and 4.32--5.42 dB GSIR gain compared to existing models in the singing voice separation task, and outperform NMF and DNN baselines in the speech denoising task.

370 citations


Posted Content
TL;DR: In this paper, a convolutional DNN was used to estimate the ideal binary mask for singing voice separation from real-world musical mixtures, which may be useful for automatic removal of vocal sounds from musical mixture for 'karaoke' type applications.
Abstract: Identification and extraction of singing voice from within musical mixtures is a key challenge in source separation and machine audition. Recently, deep neural networks (DNN) have been used to estimate 'ideal' binary masks for carefully controlled cocktail party speech separation problems. However, it is not yet known whether these methods are capable of generalizing to the discrimination of voice and non-voice in the context of musical mixtures. Here, we trained a convolutional DNN (of around a billion parameters) to provide probabilistic estimates of the ideal binary mask for separation of vocal sounds from real-world musical mixtures. We contrast our DNN results with more traditional linear methods. Our approach may be useful for automatic removal of vocal sounds from musical mixtures for 'karaoke' type applications.

79 citations


Posted Content
TL;DR: It is shown that the high-dimensional acoustic samples indeed lie on a low-dimensional manifold and can be embedded into aLow-dimensional space and a semi-supervised source localization algorithm based on two-microphone measurements is proposed, which recovers the inverse mapping between the acoustic samples and their corresponding locations.
Abstract: Conventional speaker localization algorithms, based merely on the received microphone signals, are often sensitive to adverse conditions, such as: high reverberation or low signal to noise ratio (SNR). In some scenarios, e.g. in meeting rooms or cars, it can be assumed that the source position is confined to a predefined area, and the acoustic parameters of the environment are approximately fixed. Such scenarios give rise to the assumption that the acoustic samples from the region of interest have a distinct geometrical structure. In this paper, we show that the high dimensional acoustic samples indeed lie on a low dimensional manifold and can be embedded into a low dimensional space. Motivated by this result, we propose a semi-supervised source localization algorithm which recovers the inverse mapping between the acoustic samples and their corresponding locations. The idea is to use an optimization framework based on manifold regularization, that involves smoothness constraints of possible solutions with respect to the manifold. The proposed algorithm, termed Manifold Regularization for Localization (MRL), is implemented in an adaptive manner. The initialization is conducted with only few labelled samples attached with their respective source locations, and then the system is gradually adapted as new unlabelled samples (with unknown source locations) are received. Experimental results show superior localization performance when compared with a recently presented algorithm based on a manifold learning approach and with the generalized cross-correlation (GCC) algorithm as a baseline.

56 citations


Posted Content
TL;DR: In this paper, the authors conducted an experimental study on recognizing emotions from human speech, including neutral, anger, joy, and sadness, using different classifiers, and found that for better accuracy, one should consider the data collected from one person rather than considering the data from a group of people.
Abstract: Recognizing emotion from speech has become one the active research themes in speech processing and in applications based on human-computer interaction. This paper conducts an experimental study on recognizing emotions from human speech. The emotions considered for the experiments include neutral, anger, joy and sadness. The distinuishability of emotional features in speech were studied first followed by emotion classification performed on a custom dataset. The classification was performed for different classifiers. One of the main feature attribute considered in the prepared dataset was the peak-to-peak distance obtained from the graphical representation of the speech signals. After performing the classification tests on a dataset formed from 30 different subjects, it was found that for getting better accuracy, one should consider the data collected from one person rather than considering the data from a group of people.

46 citations


Journal ArticleDOI
TL;DR: An impostor selection algorithm and a universal model adaptation process in a hybrid system based on deep belief networks and deep neural networks to discriminatively model each target speaker to fill the performance gap between cosine and PLDA scoring baseline techniques for speaker recognition.
Abstract: The promising performance of Deep Learning (DL) in speech recognition has motivated the use of DL in other speech technology applications such as speaker recognition. Given i-vectors as inputs, the authors proposed an impostor selection algorithm and a universal model adaptation process in a hybrid system based on Deep Belief Networks (DBN) and Deep Neural Networks (DNN) to discriminatively model each target speaker. In order to have more insight into the behavior of DL techniques in both single and multi-session speaker enrollment tasks, some experiments have been carried out in this paper in both scenarios. Additionally, the parameters of the global model, referred to as universal DBN (UDBN), are normalized before adaptation. UDBN normalization facilitates training DNNs specifically with more than one hidden layer. Experiments are performed on the NIST SRE 2006 corpus. It is shown that the proposed impostor selection algorithm and UDBN adaptation process enhance the performance of conventional DNNs 8-20 % and 16-20 % in terms of EER for the single and multi-session tasks, respectively. In both scenarios, the proposed architectures outperform the baseline systems obtaining up to 17 % reduction in EER.

46 citations


Posted Content
TL;DR: It is shown that a convolutional neural network trained on raw audio can achieve performance surpassing traditional methods that rely on hand-crafted features.
Abstract: Traditional methods to tackle many music information retrieval tasks typically follow a two-step architecture: feature engineering followed by a simple learning algorithm. In these "shallow" architectures, feature engineering and learning are typically disjoint and unrelated. Additionally, feature engineering is difficult, and typically depends on extensive domain expertise. In this paper, we present an application of convolutional neural networks for the task of automatic musical instrument identification. In this model, feature extraction and learning algorithms are trained together in an end-to-end fashion. We show that a convolutional neural network trained on raw audio can achieve performance surpassing traditional methods that rely on hand-crafted features.

38 citations


Posted Content
TL;DR: In this paper, a modal expansion of 3D basis functions is proposed to estimate the room transfer function (RTF) between any two arbitrary points from a predefined spatial region where the source(s) lie and a pre-defined spatial regions where the receiver(s), lie.
Abstract: This paper proposes an efficient parameterization of the Room Transfer Function (RTF). Typically, the RTF rapidly varies with varying source and receiver positions, hence requires an impractical number of point to point measurements to characterize a given room. Therefore, we derive a novel RTF parameterization that is robust to both receiver and source variations with the following salient features: (i) The parameterization is given in terms of a modal expansion of 3D basis functions. (ii) The aforementioned modal expansion can be truncated at a finite number of modes given that the source and receiver locations are from two sizeable spatial regions, which are arbitrarily distributed. (iii) The parameter weights/coefficients are independent of the source/receiver positions. Therefore, a finite set of coefficients is shown to be capable of accurately calculating the RTF between any two arbitrary points from a predefined spatial region where the source(s) lie and a pre-defined spatial region where the receiver(s) lie. A practical method to measure the RTF coefficients is also provided, which only requires a single microphone unit and a single loudspeaker unit, given that the room characteristics remain stationary over time. The accuracy of the above parameterization is verified using appropriate simulation examples.

36 citations


Proceedings ArticleDOI
TL;DR: The joint time-frequency scattering transform (JTF) as discussed by the authors is a time shift invariant descriptor of timefrequency structure for audio classification, which is obtained by applying a two-dimensional wavelet transform in time and log-frequency to a timefrequency wavelet scalogram.
Abstract: We introduce the joint time-frequency scattering transform, a time shift invariant descriptor of time-frequency structure for audio classification. It is obtained by applying a two-dimensional wavelet transform in time and log-frequency to a time-frequency wavelet scalogram. We show that this descriptor successfully characterizes complex time-frequency phenomena such as time-varying filters and frequency modulated excitations. State-of-the-art results are achieved for signal reconstruction and phone segment classification on the TIMIT dataset.

31 citations


Journal ArticleDOI
TL;DR: In this article, a variational expectation-maximization (VEM) algorithm was proposed to estimate the time-varying mixing matrix and jointly estimate the source parameters, and sound sources were then separated by Wiener filters constructed with the estimators provided by the VEM algorithm.
Abstract: This paper addresses the problem of separating audio sources from time-varying convolutive mixtures. We propose a probabilistic framework based on the local complex-Gaussian model combined with non-negative matrix factorization. The time-varying mixing filters are modeled by a continuous temporal stochastic process. We present a variational expectation-maximization (VEM) algorithm that employs a Kalman smoother to estimate the time-varying mixing matrix, and that jointly estimate the source parameters. The sound sources are then separated by Wiener filters constructed with the estimators provided by the VEM algorithm. Extensive experiments on simulated data show that the proposed method outperforms a block-wise version of a state-of-the-art baseline method.

30 citations


Journal ArticleDOI
TL;DR: This study first extracts the predominant melody and applies a novel contour filtering process to eliminate segments of the pitch contour which originate from the guitar accompaniment, and formulates a set of onset detection functions based on volume and pitch characteristics to segment the resulting vocal pitch contours into discrete note events.
Abstract: Automatic note-level transcription is considered one of the most challenging tasks in music information retrieval. The specific case of flamenco singing transcription poses a particular challenge due to its complex melodic progressions, intonation inaccuracies, the use of a high degree of ornamentation and the presence of guitar accompaniment. In this study, we explore the limitations of existing state of the art transcription systems for the case of flamenco singing and propose a specific solution for this genre: We first extract the predominant melody and apply a novel contour filtering process to eliminate segments of the pitch contour which originate from the guitar accompaniment. We formulate a set of onset detection functions based on volume and pitch characteristics to segment the resulting vocal pitch contour into discrete note events. A quantised pitch label is assigned to each note event by combining global pitch class probabilities with local pitch contour statistics. The proposed system outperforms state of the art singing transcription systems with respect to voicing accuracy, onset detection and overall performance when evaluated on flamenco singing datasets.

Posted Content
TL;DR: A new musical instrument classification method using convolutional neural networks (CNNs) is presented, which improves over a system that uses only a spectrogram and outperforms the baseline result from traditional handcrafted features and classifiers.
Abstract: A new musical instrument classification method using convolutional neural networks (CNNs) is presented in this paper. Unlike the traditional methods, we investigated a scheme for classifying musical instruments using the learned features from CNNs. To create the learned features from CNNs, we not only used a conventional spectrogram image, but also proposed multiresolution recurrence plots (MRPs) that contain the phase information of a raw input signal. Consequently, we fed the characteristic timbre of the particular instrument into a neural network, which cannot be extracted using a phase-blinded representations such as a spectrogram. By combining our proposed MRPs and spectrogram images with a multi-column network, the performance of our proposed classifier system improves over a system that uses only a spectrogram. Furthermore, the proposed classifier also outperforms the baseline result from traditional handcrafted features and classifiers.

Posted Content
TL;DR: This paper addresses the problem of audio scenes classification and contributes to the state of the art by proposing a novel feature by considering histogram of gradients (HOG) of time-frequency representation of an audio scene and evaluating its performances with state-of-the-art competitors.
Abstract: This paper addresses the problem of audio scenes classification and contributes to the state of the art by proposing a novel feature. We build this feature by considering histogram of gradients (HOG) of time-frequency representation of an audio scene. Contrarily to classical audio features like MFCC, we make the hypothesis that histogram of gradients are able to encode some relevant informations in a time-frequency {representation:} namely, the local direction of variation (in time and frequency) of the signal spectral power. In addition, in order to gain more invariance and robustness, histogram of gradients are locally pooled. We have evaluated the relevance of {the novel feature} by comparing its performances with state-of-the-art competitors, on several datasets, including a novel one that we provide, as part of our contribution. This dataset, that we make publicly available, involves $19$ classes and contains about $900$ minutes of audio scene recording. We thus believe that it may be the next standard dataset for evaluating audio scene classification algorithms. Our comparison results clearly show that our HOG-based features outperform its competitors

Posted Content
TL;DR: A novel approach is proposed for joint estimation of wideband speech in noisy conditions from reverberant speech signals provided by the Acoustic Characterization of Environments (ACE) Challenge, which outperforms the baseline systems with median errors and calculation of estimates is 5.8 times faster compared to the baseline.
Abstract: Blind estimation of acoustic room parameters such as the reverberation time $T_\mathrm{60}$ and the direct-to-reverberation ratio ($\mathrm{DRR}$) is still a challenging task, especially in case of blind estimation from reverberant speech signals. In this work, a novel approach is proposed for joint estimation of $T_\mathrm{60}$ and $\mathrm{DRR}$ from wideband speech in noisy conditions. 2D Gabor filters arranged in a filterbank are exploited for extracting features, which are then used as input to a multi-layer perceptron (MLP). The MLP output neurons correspond to specific pairs of $(T_\mathrm{60}, \mathrm{DRR})$ estimates; the output is integrated over time, and a simple decision rule results in our estimate. The approach is applied to single-microphone fullband speech signals provided by the Acoustic Characterization of Environments (ACE) Challenge. Our approach outperforms the baseline systems with median errors of close-to-zero and -1.5 dB for the $T_\mathrm{60}$ and $\mathrm{DRR}$ estimates, respectively, while the calculation of estimates is 5.8 times faster compared to the baseline.

Proceedings ArticleDOI
TL;DR: This work approaches the challenging problem of generating highlights from sports broadcasts utilizing audio information only by employing a language-independent, multi-stage classification approach for detection of key acoustic events which then act as a platform for summarization of highlight scenes.
Abstract: We approach the challenging problem of generating highlights from sports broadcasts utilizing audio information only. A language-independent, multi-stage classification approach is employed for detection of key acoustic events which then act as a platform for summarization of highlight scenes. Objective results and human experience indicate that our system is highly efficient.

Posted Content
TL;DR: This work trains a convolutional deep neural network, on a two-speaker cocktail party problem, to make probabilistic predictions about binary masks, illustrating that relatively simple deep neural networks are capable of robust binary mask prediction.
Abstract: Separation of competing speech is a key challenge in signal processing and a feat routinely performed by the human auditory brain. A long standing benchmark of the spectrogram approach to source separation is known as the ideal binary mask. Here, we train a convolutional deep neural network, on a two-speaker cocktail party problem, to make probabilistic predictions about binary masks. Our results approach ideal binary mask performance, illustrating that relatively simple deep neural networks are capable of robust binary mask prediction. We also illustrate the trade-off between prediction statistics and separation quality.

Posted Content
TL;DR: This paper focuses on the fingerprint design step for which various audio features and their tractable statistical models are discussed, and presents a more up-to-date review of existing algorithms.
Abstract: Audio fingerprinting, also named as audio hashing, has been well-known as a powerful technique to perform audio identification and synchronization. It basically involves two major steps: fingerprint (voice pattern) design and matching search. While the first step concerns the derivation of a robust and compact audio signature, the second step usually requires knowledge about database and quick-search algorithms. Though this technique offers a wide range of real-world applications, to the best of the authors' knowledge, a comprehensive survey of existing algorithms appeared more than eight years ago. Thus, in this paper, we present a more up-to-date review and, for emphasizing on the audio signal processing aspect, we focus our state-of-the-art survey on the fingerprint design step for which various audio features and their tractable statistical models are discussed.

Posted Content
TL;DR: In this paper, a single channel data-driven method for nonintrusive estimation of full-band reverberation time and direct-to-reverberant ratio is presented.
Abstract: We present a single channel data driven method for non-intrusive estimation of full-band reverberation time and full-band direct-to-reverberant ratio. The method extracts a number of features from reverberant speech and builds a model using a recurrent neural network to estimate the reverberant acoustic parameters. We explore three configurations by including different data and also by combining the recurrent neural network estimates using a support vector machine. Our best method to estimate DRR provides a Root Mean Square Deviation (RMSD) of 3.84 dB and a RMSD of 43.19 % for T60 estimation.

Journal ArticleDOI
TL;DR: An acoustic reverberator consisting of a network of delay lines connected via scattering junctions is proposed, which renders the first-order reflections exactly, while making progressively coarser approximations of higher- order reflections.
Abstract: An acoustic reverberator consisting of a network of delay lines connected via scattering junctions is proposed. All parameters of the reverberator are derived from physical properties of the enclosure it simulates. It allows for simulation of unequal and frequency-dependent wall absorption, as well as directional sources and microphones. The reverberator renders the first-order reflections exactly, while making progressively coarser approximations of higher-order reflections. The rate of energy decay is close to that obtained with the image method (IM) and consistent with the predictions of Sabine and Eyring equations. The time evolution of the normalized echo density, which was previously shown to be correlated with the perceived texture of reverberation, is also close to that of IM. However, its computational complexity is one to two orders of magnitude lower, comparable to the computational complexity of a feedback delay network (FDN), and its memory requirements are negligible.

Posted Content
TL;DR: This contribution presents four algorithms developed by the authors for single-channel fullband and subband T60 estimation within the ACE challenge, where a new algorithm to estimate the subband RT is devised, where the RT estimates for the lower octave subbands are extrapolated from theRT estimates of the upper subbands by means of a simple model for the frequency-dependency of the sub band RT.
Abstract: This contribution presents four algorithms developed by the authors for single-channel fullband and subband T60 estimation within the ACE challenge. The blind estimation of the fullband reverberation time (RT) by maximum-likelihood (ML) estimation based on [15] is considered as baseline approach. An improvement of this algorithm is devised where an energy-weighted averaging of the upper subband RT estimates is performed using either a DCT or 1/3-octave filter-bank. The evaluation results show that this approach leads to a lower variance for the estimation error in comparison to the baseline approach at the price of an increased computational complexity. Moreover, a new algorithm to estimate the subband RT is devised, where the RT estimates for the lower octave subbands are extrapolated from the RT estimates of the upper subbands by means of a simple model for the frequency-dependency of the subband RT. The evaluation results of the ACE challenge reveal that this approach allows to estimate the subband RT with an estimation error which is in a similar range as for the presented fullband RT estimators.

Posted Content
TL;DR: The experimental results revealed that the proposed method was able to effectively estimate the DRR from a recording of a reverberant speech signal which included various environmental noise.
Abstract: A method for estimation of direct-to-reverberant ratio (DRR) using a microphone array is proposed. The proposed method estimates the power spectral density (PSD) of the direct sound and the reverberation using the algorithm \textit{PSD estimation in beamspace} with a microphone array and calculates the DRR of the observed signal. The speech corpus of the ACE (Acoustic Characterisation of Environments) Challenge was utilised for evaluating the practical feasibility of the proposed method. The experimental results revealed that the proposed method was able to effectively estimate the DRR from a recording of a reverberant speech signal which included various environmental noise.

Posted Content
TL;DR: In this article, a cappella flamenco singing in debla and martinete styles is analyzed using a combination of manual and automatic description, and the similarity measure is assessed by inspecting the clusters obtained through phylogenetic algorithms and by relating similarity to categorization in terms of style.
Abstract: This work focuses on the topic of melodic characterization and similarity in a specific musical repertoire: a cappella flamenco singing, more specifically in debla and martinete styles. We propose the combination of manual and automatic description. First, we use a state-of-the-art automatic transcription method to account for general melodic similarity from music recordings. Second, we define a specific set of representative mid-level melodic features, which are manually labeled by flamenco experts. Both approaches are then contrasted and combined into a global similarity measure. This similarity measure is assessed by inspecting the clusters obtained through phylogenetic algorithms algorithms and by relating similarity to categorization in terms of style. Finally, we discuss the advantage of combining automatic and expert annotations as well as the need to include repertoire-specific descriptions for meaningful melodic characterization in traditional music collections.

Posted Content
TL;DR: In this paper, an algorithm involving Mel-Frequency Cepstral Coefficients (MFCCs) was provided to perform signal feature extraction for the task of speaker accent recognition.
Abstract: An algorithm involving Mel-Frequency Cepstral Coefficients (MFCCs) is provided to perform signal feature extraction for the task of speaker accent recognition. Then different classifiers are compared based on the MFCC feature. For each signal, the mean vector of MFCC matrix is used as an input vector for pattern recognition. A sample of 330 signals, containing 165 US voice and 165 non-US voice, is analyzed. By comparison, k-nearest neighbors yield the highest average test accuracy, after using a cross-validation of size 500, and least time being used in the computation

Posted Content
TL;DR: In this article, the authors proposed a novel space to represent music pieces by developing: a) a method to adjust a description from its original scale of observation to a general scale, b) the concept of higher order entropy as the entropy associated to the deviations of a frequency ranked symbol profile from a perfect Zipf profile.
Abstract: Polyphonic music files were analyzed using the set of symbols that produced the Minimal Entropy Description which we call the Fundamental Scale. This allowed us to create a novel space to represent music pieces by developing: a) a method to adjust a description from its original scale of observation to a general scale, b) the concept of higher order entropy as the entropy associated to the deviations of a frequency ranked symbol profile from a perfect Zipf profile. We called this diversity index the "2nd Order Entropy". Applying these methods to a variety of musical pieces showed how the space of "symbolic specific diversity-entropy" and that of "2nd order entropy" captures characteristics that are unique to each music type, style, composer and genre. Some clustering of these properties around each musical category is shown. This method allows to visualize a historic trajectory of academic music across this space, from medieval to contemporary academic music. We show that description of musical structures using entropy and symbolic diversity allows to characterize traditional and popular expressions of music. These classification techniques promise to be useful in other disciplines for pattern recognition and machine learning, for example.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a method to estimate the direct-path relative transfer function (DP-RTF) from noisy and reverberant microphone signals in the short-time Fourier transform domain.
Abstract: This paper addresses the problem of binaural localization of a single speech source in noisy and reverberant environments. For a given binaural microphone setup, the binaural response corresponding to the direct-path propagation of a single source is a function of the source direction. In practice, this response is contaminated by noise and reverberations. The direct-path relative transfer function (DP-RTF) is defined as the ratio between the direct-path acoustic transfer function of the two channels. We propose a method to estimate the DP-RTF from the noisy and reverberant microphone signals in the short-time Fourier transform domain. First, the convolutive transfer function approximation is adopted to accurately represent the impulse response of the sensors in the STFT domain. Second, the DP-RTF is estimated by using the auto- and cross-power spectral densities at each frequency and over multiple frames. In the presence of stationary noise, an inter-frame spectral subtraction algorithm is proposed, which enables to achieve the estimation of noise-free auto- and cross-power spectral densities. Finally, the estimated DP-RTFs are concatenated across frequencies and used as a feature vector for the localization of speech source. Experiments with both simulated and real data show that the proposed localization method performs well, even under severe adverse acoustic conditions, and outperforms state-of-the-art localization methods under most of the acoustic conditions.

Posted Content
TL;DR: In this article, the authors used circular statistics to provide a convenient probabilistic estimate of spectrogram phase in a complex convolutional deep neural network (DNN) for source separation.
Abstract: Convolutional deep neural networks (DNN) are state of the art in many engineering problems but have not yet addressed the issue of how to deal with complex spectrograms. Here, we use circular statistics to provide a convenient probabilistic estimate of spectrogram phase in a complex convolutional DNN. In a typical cocktail party source separation scenario, we trained a convolutional DNN to re-synthesize the complex spectrograms of two source speech signals given a complex spectrogram of the monaural mixture - a discriminative deep transform (DT). We then used this complex convolutional DT to obtain probabilistic estimates of the magnitude and phase components of the source spectrograms. Our separation results are on a par with equivalent binary-mask based non-complex separation approaches.

Posted Content
TL;DR: A time-frequency masking based speech enhancement front-end is proposed to suppress the environmental noise utilizing multi-channel coherence and spatial cues and obtains consistent improvements with over 57% relative word error rate reduction on the real-data test set.
Abstract: In this paper, the Lingban entry to the third 'CHiME' speech separation and recognition challenge is presented. A time-frequency masking based speech enhancement front-end is proposed to suppress the environmental noise utilizing multi-channel coherence and spatial cues. The state-of-the-art speech recognition techniques, namely recurrent neural network based acoustic and language modeling, state space minimum Bayes risk based discriminative acoustic modeling, and i-vector based acoustic condition modeling, are carefully integrated into the speech recognition back-end. To further improve the system performance by fully exploiting the advantages of different technologies, the final recognition results are obtained by lattice combination and rescoring. Evaluations carried out on the official dataset prove the effectiveness of the proposed systems. Comparing with the best baseline result, the proposed system obtains consistent improvements with over 57% relative word error rate reduction on the real-data test set.

Posted Content
TL;DR: In this paper, the role of delay and the limits imposed by additive Gaussian noise are studied along with the computation of the cumulants and probability density functions of individual quadratic forms and their ratios.
Abstract: The Teager-Kaiser energy operator (TKO) belongs to a class of autocorrelators and their linear combination that can track the instantaneous energy of a nonstationary sinusoidal signal source. TKO-based monocomponent AM-FM demodulation algorithms work under the basic assumption that the operator outputs are always positive. In the absence of noise, this is assured for pure sinusoidal inputs and the instantaneous property is also guaranteed. Noise invalidates both of these, particularly under small signal conditions. Post-detection filtering and thresholding are of use to reestablish these at the cost of some time to acquire. Key questions are: (a) how many samples must one use and (b) how much noise power at the detector input can one tolerate. Results of study of the role of delay and the limits imposed by additive Gaussian noise are presented along with the computation of the cumulants and probability density functions of the individual quadratic forms and their ratios.

Posted Content
TL;DR: This paper proposes a practical approach to estimate the direct-to-reverberant energy ratio using a spherical microphone array without having knowledge of the source signal, based on a theoretical relationship between the DRR and the coherence estimation function between coincident pressure and particle velocity.
Abstract: This paper proposes a practical approach to estimate the direct-to-reverberant energy ratio (DRR) using a spherical microphone array without having knowledge of the source signal. We base our estimation on a theoretical relationship between the DRR and the coherence estimation function between coincident pressure and particle velocity. We discuss the proposed method's ability to estimate the DRR in a wide variety of room sizes, reverberation times and source receiver distances with appropriate examples. Test results show that the method can estimate the room DRR for frequencies between 199 - 2511 Hz, with $\pm$ 3 dB accuracy.

Posted Content
TL;DR: In this paper, the authors introduce a simple method to model the onsets, durations and offsets of acoustic events to avoid intrinsic limits on polyphony or on inter-event temporal patterns.
Abstract: Many current paradigms for acoustic event detection (AED) are not adapted to the organic variability of natural sounds, and/or they assume a limit on the number of simultaneous sources: often only one source, or one source of each type, may be active. These aspects are highly undesirable for applications such as bird population monitoring. We introduce a simple method modelling the onsets, durations and offsets of acoustic events to avoid intrinsic limits on polyphony or on inter-event temporal patterns. We evaluate the method in a case study with over 3000 zebra finch calls. In comparison against a HMM-based method we find it more accurate at recovering acoustic events, and more robust for estimating calling rates.