scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Audio, Speech, and Language Processing in 2010"


Journal ArticleDOI
TL;DR: In this article, a general data-driven object-based model of multichannel audio data, assumed generated as a possibly underdetermined convolutive mixture of source signals, is considered.
Abstract: We consider inference in a general data-driven object-based model of multichannel audio data, assumed generated as a possibly underdetermined convolutive mixture of source signals. We work in the short-time Fourier transform (STFT) domain, where convolution is routinely approximated as linear instantaneous mixing in each frequency band. Each source STFT is given a model inspired from nonnegative matrix factorization (NMF) with the Itakura-Saito divergence, which underlies a statistical model of superimposed Gaussian components. We address estimation of the mixing and source parameters using two methods. The first one consists of maximizing the exact joint likelihood of the multichannel data using an expectation-maximization (EM) algorithm. The second method consists of maximizing the sum of individual likelihoods of all channels using a multiplicative update algorithm inspired from NMF methodology. Our decomposition algorithms are applied to stereo audio source separation in various settings, covering blind and supervised separation, music and speech sources, synthetic instantaneous and convolutive mixtures, as well as professionally produced music recordings. Our EM method produces competitive results with respect to state-of-the-art as illustrated on two tasks from the international Signal Separation Evaluation Campaign (SiSEC 2008).

636 citations


Journal ArticleDOI
TL;DR: NDLP can robustly estimate an inverse system for late reverberation in the presence of noise without greatly distorting a direct speech signal and can be implemented in a computationally efficient manner in the time-frequency domain.
Abstract: This paper proposes a statistical model-based speech dereverberation approach that can cancel the late reverberation of a reverberant speech signal captured by distant microphones without prior knowledge of the room impulse responses. With this approach, the generative model of the captured signal is composed of a source process, which is assumed to be a Gaussian process with a time-varying variance, and an observation process modeled by a delayed linear prediction (DLP). The optimization objective for the dereverberation problem is derived to be the sum of the squared prediction errors normalized by the source variances; hence, this approach is referred to as variance-normalized delayed linear prediction (NDLP). Inheriting the characteristic of DLP, NDLP can robustly estimate an inverse system for late reverberation in the presence of noise without greatly distorting a direct speech signal. In addition, owing to the use of variance normalization, NDLP allows us to improve the dereverberation result especially with relatively short (of the order of a few seconds) observations. Furthermore, NDLP can be implemented in a computationally efficient manner in the time-frequency domain. Experimental results demonstrate the effectiveness and efficiency of the proposed approach in comparison with two existing approaches.

371 citations


Journal ArticleDOI
TL;DR: In this article, the contribution of each source to all mixture channels in the time-frequency domain was modeled as a zero-mean Gaussian random variable whose covariance encodes the spatial characteristics of the source.
Abstract: This paper addresses the modeling of reverberant recording environments in the context of under-determined convolutive blind source separation. We model the contribution of each source to all mixture channels in the time-frequency domain as a zero-mean Gaussian random variable whose covariance encodes the spatial characteristics of the source. We then consider four specific covariance models, including a full-rank unconstrained model. We derive a family of iterative expectation-maximization (EM) algorithms to estimate the parameters of each model and propose suitable procedures adapted from the state-of-the-art to initialize the parameters and to align the order of the estimated sources across all frequency bins. Experimental results over reverberant synthetic mixtures and live recordings of speech data show the effectiveness of the proposed approach.

368 citations


Journal ArticleDOI
TL;DR: Experimental results show the proposed measure outperforming three standard algorithms for tasks involving estimation of multiple dimensions of perceived coloration, as well as quality measurement and intelligibility estimation of reverberant and dereverberated speech.
Abstract: A modulation spectral representation is investigated for non-intrusive quality and intelligibility measurement of reverberant and dereverberated speech. The representation is obtained by means of an auditory-inspired filterbank analysis of critical-band temporal envelopes of the speech signal. Modulation spectral insights are used to develop an adaptive measure termed speech to reverberation modulation energy ratio. Experimental results show the proposed measure outperforming three standard algorithms for tasks involving estimation of multiple dimensions of perceived coloration, as well as quality measurement and intelligibility estimation of reverberant and dereverberated speech.

352 citations


Journal ArticleDOI
TL;DR: This work formally shows that the minimum variance distortionless response (MVDR) filter is a particular case of the PMWF by properly formulating the constrained optimization problem of noise reduction, and proposes new simplified expressions for thePMWF, the MVDR, and the generalized sidelobe canceller that depend on the signals' statistics only.
Abstract: Several contributions have been made so far to develop optimal multichannel linear filtering approaches and show their ability to reduce the acoustic noise. However, there has not been a clear unifying theoretical analysis of their performance in terms of both noise reduction and speech distortion. To fill this gap, we analyze the frequency-domain (non-causal) multichannel linear filtering for noise reduction in this paper. For completeness, we consider the noise reduction constrained optimization problem that leads to the parameterized multichannel non-causal Wiener filter (PMWF). Our contribution is fivefold. First, we formally show that the minimum variance distortionless response (MVDR) filter is a particular case of the PMWF by properly formulating the constrained optimization problem of noise reduction. Second, we propose new simplified expressions for the PMWF, the MVDR, and the generalized sidelobe canceller (GSC) that depend on the signals' statistics only. In contrast to earlier works, these expressions are explicitly independent of the channel transfer function ratios. Third, we quantify the theoretical gains and losses in terms of speech distortion and noise reduction when using the PWMF by establishing new simplified closed-form expressions for three performance measures, namely, the signal distortion index, the noise reduction factor (originally proposed in the paper titled ldquoNew insights into the noise reduction Wiener filter,rdquo by J. Chen (IEEE Transactions on Audio, Speech, and Language Processing, Vol. 15, no. 4, pp. 1218-1234, Jul. 2006) to analyze the single channel time-domain Wiener filter), and the output signal-to-noise ratio (SNR). Fourth, we analyze the effects of coherent and incoherent noise in addition to the benefits of utilizing multiple microphones. Fifth, we propose a new proof for the a posteriori SNR improvement achieved by the PMWF. Finally, we provide some simulations results to corroborate the findings of this work.

317 citations


Journal ArticleDOI
TL;DR: This paper describes a model-based expectation-maximization source separation and localization system for separating and localizing multiple sound sources from an underdetermined reverberant two-channel recording, and creates probabilistic spectrogram masks that can be used for source separation.
Abstract: This paper describes a system, referred to as model-based expectation-maximization source separation and localization (MESSL), for separating and localizing multiple sound sources from an underdetermined reverberant two-channel recording. By clustering individual spectrogram points based on their interaural phase and level differences, MESSL generates masks that can be used to isolate individual sound sources. We first describe a probabilistic model of interaural parameters that can be evaluated at individual spectrogram points. By creating a mixture of these models over sources and delays, the multi-source localization problem is reduced to a collection of single source problems. We derive an expectation-maximization algorithm for computing the maximum-likelihood parameters of this mixture model, and show that these parameters correspond well with interaural parameters measured in isolation. As a byproduct of fitting this mixture model, the algorithm creates probabilistic spectrogram masks that can be used for source separation. In simulated anechoic and reverberant environments, separations using MESSL produced on average a signal-to-distortion ratio 1.6 dB greater and perceptual evaluation of speech quality (PESQ) results 0.27 mean opinion score units greater than four comparable algorithms.

317 citations


Journal ArticleDOI
TL;DR: A new method for the estimation of multiple concurrent pitches in piano recordings is presented, which addresses the issue of overlapping overtones by modeling the spectral envelope of the overtones of each note with a smooth autoregressive model.
Abstract: A new method for the estimation of multiple concurrent pitches in piano recordings is presented. It addresses the issue of overlapping overtones by modeling the spectral envelope of the overtones of each note with a smooth autoregressive model. For the background noise, a moving-average model is used and the combination of both tends to eliminate harmonic and sub-harmonic erroneous pitch estimations. This leads to a complete generative spectral model for simultaneous piano notes, which also explicitly includes the typical deviation from exact harmonicity in a piano overtone series. The pitch set which maximizes an approximate likelihood is selected from among a restricted number of possible pitch combinations as the one. Tests have been conducted on a large homemade database called MAPS, composed of piano recordings from a real upright piano and from high-quality samples.

314 citations


Journal ArticleDOI
TL;DR: The CALO-MA architecture and its speech recognition and understanding components, which include real-time and offline speech transcription, dialog act segmentation and tagging, topic identification and segmentation, question-answer pair identification, action item recognition, decision extraction, and summarization are presented.
Abstract: The CALO Meeting Assistant (MA) provides for distributed meeting capture, annotation, automatic transcription and semantic analysis of multiparty meetings, and is part of the larger CALO personal assistant system. This paper presents the CALO-MA architecture and its speech recognition and understanding components, which include real-time and offline speech transcription, dialog act segmentation and tagging, topic identification and segmentation, question-answer pair identification, action item recognition, decision extraction, and summarization.

295 citations


Journal ArticleDOI
TL;DR: A NMF-like algorithm is derived that performs similarly to supervised NMF using pre-trained piano spectra but improves pitch estimation performance by 6% to 10% compared to alternative unsupervised NMF algorithms.
Abstract: Multiple pitch estimation consists of estimating the fundamental frequencies and saliences of pitched sounds over short time frames of an audio signal. This task forms the basis of several applications in the particular context of musical audio. One approach is to decompose the short-term magnitude spectrum of the signal into a sum of basis spectra representing individual pitches scaled by time-varying amplitudes, using algorithms such as nonnegative matrix factorization (NMF). Prior training of the basis spectra is often infeasible due to the wide range of possible musical instruments. Appropriate spectra must then be adaptively estimated from the data, which may result in limited performance due to overfitting issues. In this paper, we model each basis spectrum as a weighted sum of narrowband spectra representing a few adjacent harmonic partials, thus enforcing harmonicity and spectral smoothness while adapting the spectral envelope to each instrument. We derive a NMF-like algorithm to estimate the model parameters and evaluate it on a database of piano recordings, considering several choices for the narrowband spectra. The proposed algorithm performs similarly to supervised NMF using pre-trained piano spectra but improves pitch estimation performance by 6% to 10% compared to alternative unsupervised NMF algorithms.

271 citations


Journal ArticleDOI
TL;DR: A voice conversion approach using an ANN model to capture speaker-specific characteristics of a target speaker is proposed and it is demonstrated that such a voice Conversion approach can perform monolingual as well as cross-lingual voice conversion of an arbitrary source speaker.
Abstract: In this paper, we use artificial neural networks (ANNs) for voice conversion and exploit the mapping abilities of an ANN model to perform mapping of spectral features of a source speaker to that of a target speaker. A comparative study of voice conversion using an ANN model and the state-of-the-art Gaussian mixture model (GMM) is conducted. The results of voice conversion, evaluated using subjective and objective measures, confirm that an ANN-based VC system performs as good as that of a GMM-based VC system, and the quality of the transformed speech is intelligible and possesses the characteristics of a target speaker. In this paper, we also address the issue of dependency of voice conversion techniques on parallel data between the source and the target speakers. While there have been efforts to use nonparallel data and speaker adaptation techniques, it is important to investigate techniques which capture speaker-specific characteristics of a target speaker, and avoid any need for source speaker's data either for training or for adaptation. In this paper, we propose a voice conversion approach using an ANN model to capture speaker-specific characteristics of a target speaker and demonstrate that such a voice conversion approach can perform monolingual as well as cross-lingual voice conversion of an arbitrary source speaker.

269 citations


Journal ArticleDOI
TL;DR: A tandem algorithm is proposed that performs pitch estimation of a target utterance and segregation of voiced portions of target speech jointly and iteratively and performs substantially better than previous systems for either pitch extraction or voiced speech segregation.
Abstract: A lot of effort has been made in computational auditory scene analysis (CASA) to segregate speech from monaural mixtures. The performance of current CASA systems on voiced speech segregation is limited by lacking a robust algorithm for pitch estimation. We propose a tandem algorithm that performs pitch estimation of a target utterance and segregation of voiced portions of target speech jointly and iteratively. This algorithm first obtains a rough estimate of target pitch, and then uses this estimate to segregate target speech using harmonicity and temporal continuity. It then improves both pitch estimation and voiced speech segregation iteratively. Novel methods are proposed for performing segregation with a given pitch estimate and pitch determination with given segregation. Systematic evaluation shows that the tandem algorithm extracts a majority of target speech without including much interference, and it performs substantially better than previous systems for either pitch extraction or voiced speech segregation.

Journal ArticleDOI
TL;DR: This paper has constructed a corpus called MIR-1K (multimedia information retrieval lab, 1000 song clips), where all singing voices and music accompaniments were recorded separately, and enhanced the performance of separating voiced singing via a spectral subtraction method.
Abstract: Monaural singing voice separation is an extremely challenging problem. While efforts in pitch-based inference methods have led to considerable progress in voiced singing voice separation, little attention has been paid to the incapability of such methods to separate unvoiced singing voice due to its in harmonic structure and weaker energy. In this paper, we proposed a systematic approach to identify and separate the unvoiced singing voice from the music accompaniment. We have also enhanced the performance of separating voiced singing via a spectral subtraction method. The proposed system follows the framework of computational auditory scene analysis (CASA) which consists of the segmentation stage and the grouping stage. In the segmentation stage, the input song signals are decomposed into small sensory elements in different time-frequency resolutions. The unvoiced sensory elements are then identified by Gaussian mixture models. The experimental results demonstrated that the quality of the separated singing voice is improved for both the unvoiced and voiced parts. Moreover, to deal with the problem of lack of a publicly available dataset for singing voice separation, we have constructed a corpus called MIR-1K (multimedia information retrieval lab, 1000 song clips) where all singing voices and music accompaniments were recorded separately. Each song clip comes with human-labeled pitch values, unvoiced sounds and vocal/non-vocal segments, and lyrics, as well as the speech recording of the lyrics.

Journal ArticleDOI
TL;DR: Bayesian NMF with harmonicity and temporal continuity constraints is shown to outperform other standard NMF-based transcription systems, providing a meaningful mid-level representation of the data.
Abstract: This paper presents theoretical and experimental results about constrained non-negative matrix factorization (NMF) in a Bayesian framework. A model of superimposed Gaussian components including harmonicity is proposed, while temporal continuity is enforced through an inverse-Gamma Markov chain prior. We then exhibit a space-alternating generalized expectation-maximization (SAGE) algorithm to estimate the parameters. Computational time is reduced by initializing the system with an original variant of multiplicative harmonic NMF, which is described as well. The algorithm is then applied to perform polyphonic piano music transcription. It is compared to other state-of-the-art algorithms, especially NMF-based. Convergence issues are also discussed on a theoretical and experimental point of view. Bayesian NMF with harmonicity and temporal continuity constraints is shown to outperform other standard NMF-based transcription systems, providing a meaningful mid-level representation of the data. However, temporal smoothness has its drawbacks, as far as transients are concerned in particular, and can be detrimental to transcription performance when it is the only constraint used. Possible improvements of the temporal prior are discussed.

Journal ArticleDOI
TL;DR: A technique to combine PLS with GMMs, enabling the use of multiple local linear mappings in voice conversion and to low-pass filter the component posterior probabilities to improve the perceptual quality of the mapping.
Abstract: Voice conversion can be formulated as finding a mapping function which transforms the features of the source speaker to those of the target speaker. Gaussian mixture model (GMM)-based conversion is commonly used, but it is subject to overfitting. In this paper, we propose to use partial least squares (PLS)-based transforms in voice conversion. To prevent overfitting, the degrees of freedom in the mapping can be controlled by choosing a suitable number of components. We propose a technique to combine PLS with GMMs, enabling the use of multiple local linear mappings. To further improve the perceptual quality of the mapping where rapid transitions between GMM components produce audible artefacts, we propose to low-pass filter the component posterior probabilities. The conducted experiments show that the proposed technique results in better subjective and objective quality than the baseline joint density GMM approach. In speech quality conversion preference tests, the proposed method achieved 67% preference score against the smoothed joint density GMM method and 84% preference score against the unsmoothed joint density GMM method. In objective tests the proposed method produced a lower Mel-cepstral distortion than the reference methods.

Journal ArticleDOI
TL;DR: A new signal model is proposed where the leading vocal part is explicitly represented by a specific source/filter model and reaches state-of-the-art performances on all test sets.
Abstract: Extracting the main melody from a polyphonic music recording seems natural even to untrained human listeners. To a certain extent it is related to the concept of source separation, with the human ability of focusing on a specific source in order to extract relevant information. In this paper, we propose a new approach for the estimation and extraction of the main melody (and in particular the leading vocal part) from polyphonic audio signals. To that aim, we propose a new signal model where the leading vocal part is explicitly represented by a specific source/filter model. The proposed representation is investigated in the framework of two statistical models: a Gaussian Scaled Mixture Model (GSMM) and an extended Instantaneous Mixture Model (IMM). For both models, the estimation of the different parameters is done within a maximum-likelihood framework adapted from single-channel source separation techniques. The desired sequence of fundamental frequencies is then inferred from the estimated parameters. The results obtained in a recent evaluation campaign (MIREX08) show that the proposed approaches are very promising and reach state-of-the-art performances on all test sets.

Journal ArticleDOI
TL;DR: Compared to standard probabilistic systems, Weighted Frequency Warping results in a significant increase in quality scores, whereas the conversion scores remain almost unaltered.
Abstract: Any modification applied to speech signals has an impact on their perceptual quality. In particular, voice conversion to modify a source voice so that it is perceived as a specific target voice involves prosodic and spectral transformations that produce significant quality degradation. Choosing among the current voice conversion methods represents a trade-off between the similarity of the converted voice to the target voice and the quality of the resulting converted speech, both rated by listeners. This paper presents a new voice conversion method termed Weighted Frequency Warping that has a good balance between similarity and quality. This method uses a time-varying piecewise-linear frequency warping function and an energy correction filter, and it combines typical probabilistic techniques and frequency warping transformations. Compared to standard probabilistic systems, Weighted Frequency Warping results in a significant increase in quality scores, whereas the conversion scores remain almost unaltered. This paper carefully discusses the theoretical aspects of the method and the details of its implementation, and the results of an international evaluation of the new system are also included.

Journal ArticleDOI
TL;DR: This paper proposes an iterative greedy search strategy to estimate F0s one by one, to avoid the combinatorial problem of concurrent F0 estimation, and proposes a polyphony estimation method to terminate the iterative process.
Abstract: This paper presents a maximum-likelihood approach to multiple fundamental frequency (F0) estimation for a mixture of harmonic sound sources, where the power spectrum of a time frame is the observation and the F0s are the parameters to be estimated. When defining the likelihood model, the proposed method models both spectral peaks and non-peak regions (frequencies further than a musical quarter tone from all observed peaks). It is shown that the peak likelihood and the non-peak region likelihood act as a complementary pair. The former helps find F0s that have harmonics that explain peaks, while the latter helps avoid F0s that have harmonics in non-peak regions. Parameters of these models are learned from monophonic and polyphonic training data. This paper proposes an iterative greedy search strategy to estimate F0s one by one, to avoid the combinatorial problem of concurrent F0 estimation. It also proposes a polyphony estimation method to terminate the iterative process. Finally, this paper proposes a postprocessing method to refine polyphony and F0 estimates using neighboring frames. This paper also analyzes the relative contributions of different components of the proposed method. It is shown that the refinement component eliminates many inconsistent estimation errors. Evaluations are done on ten recorded four-part J. S. Bach chorales. Results show that the proposed method shows superior F0 estimation and polyphony estimation compared to two state-of-the-art algorithms.

Journal ArticleDOI
TL;DR: This work devise a fully automatic method to simultaneously estimate from an audio waveform the chord sequence including bass notes, the metric positions of chords, and the key, and introduces a measure of segmentation quality and shows that bass and meter modeling are especially beneficial for obtaining the correct level of granularity.
Abstract: Chord labels provide a concise description of musical harmony. In pop and jazz music, a sequence of chord labels is often the only written record of a song, and forms the basis of so-called lead sheets. We devise a fully automatic method to simultaneously estimate from an audio waveform the chord sequence including bass notes, the metric positions of chords, and the key. The core of the method is a six-layered dynamic Bayesian network, in which the four hidden source layers jointly model metric position, key, chord, and bass pitch class, while the two observed layers model low-level audio features corresponding to bass and treble tonal content. Using 109 different chords our method provides substantially more harmonic detail than previous approaches while maintaining a high level of accuracy. We show that with 71% correctly classified chords our method significantly exceeds the state of the art when tested against manually annotated ground truth transcriptions on the 176 audio tracks from the MIREX 2008 Chord Detection Task. We introduce a measure of segmentation quality and show that bass and meter modeling are especially beneficial for obtaining the correct level of granularity.

Journal ArticleDOI
TL;DR: The performance evaluation supports the theoretical analysis and demonstrates the tradeoff between speech dereverberation and noise reduction, and shows that maximum noise reduction is achieved when the MVDR beamformer is used for noise reduction only.
Abstract: The minimum variance distortionless response (MVDR) beamformer, also known as Capon's beamformer, is widely studied in the area of speech enhancement. The MVDR beamformer can be used for both speech dereverberation and noise reduction. This paper provides new insights into the MVDR beamformer. Specifically, the local and global behavior of the MVDR beamformer is analyzed and novel forms of the MVDR filter are derived and discussed. In earlier works it was observed that there is a tradeoff between the amount of speech dereverberation and noise reduction when the MVDR beamformer is used. Here, the tradeoff between speech dereverberation and noise reduction is analyzed thoroughly. The local and global behavior, as well as the tradeoff, is analyzed for different noise fields such as, for example, a mixture of coherent and non-coherent noise fields, entirely non-coherent noise fields and diffuse noise fields. It is shown that maximum noise reduction is achieved when the MVDR beamformer is used for noise reduction only. The amount of noise reduction that is sacrificed when complete dereverberation is required depends on the direct-to-reverberation ratio of the acoustic impulse response between the source and the reference microphone. The performance evaluation supports the theoretical analysis and demonstrates the tradeoff between speech dereverberation and noise reduction. When desiring both speech dereverberation and noise reduction, the results also demonstrate that the amount of noise reduction that is sacrificed decreases when the number of microphones increases.

Journal ArticleDOI
TL;DR: The two cues, computed from a two-channel time-frequency representation, are combined in order to estimate the azimuth of sources in binaural recording to show that it compares favorably with available techniques.
Abstract: In this paper, we propose a binaural source localization method based on interaural time differences (ITDs) and interaural level differences (ILDs). The two cues, computed from a two-channel time-frequency representation, are combined in order to estimate the azimuth of sources in binaural recording. We introduce an individual parametric model for the ITD and ILD, and an average parametric model that sets us free from measurements of the subjects' HRIRs for sound localization. We conduct several experiments to validate the proposed approach and show that it compares favorably with available techniques.

Journal ArticleDOI
TL;DR: This paper proposes a new iterative alignment method that allows pairing phonetically equivalent acoustic vectors from nonparallel utterances from different speakers, even under cross-lingual conditions, and it does not require any phonetic or linguistic information.
Abstract: Most existing voice conversion systems, particularly those based on Gaussian mixture models, require a set of paired acoustic vectors from the source and target speakers to learn their corresponding transformation function. The alignment of phonetically equivalent source and target vectors is not problematic when the training corpus is parallel, which means that both speakers utter the same training sentences. However, in some practical situations, such as cross-lingual voice conversion, it is not possible to obtain such parallel utterances. With an aim towards increasing the versatility of current voice conversion systems, this paper proposes a new iterative alignment method that allows pairing phonetically equivalent acoustic vectors from nonparallel utterances from different speakers, even under cross-lingual conditions. This method is based on existing voice conversion techniques, and it does not require any phonetic or linguistic information. Subjective evaluation experiments show that the performance of the resulting voice conversion system is very similar to that of an equivalent system trained on a parallel corpus.

Journal ArticleDOI
TL;DR: The diffuse reverberation model presented in this paper produces impulse responses that are representative of the specific virtual environment under consideration (within the general assumptions of geometrical room acoustics), in contrast to other artificial reverberation techniques developed on the basis of perceptual measures or assuming a purely exponential energy decay.
Abstract: In many research fields of engineering and acoustics, the image-source model represents one of the most popular tools for the simulation of sound fields in virtual reverberant environments. This can be seen as a result from the relative simplicity and flexibility of this particular method. However, the associated computational costs constitute a well known drawback of image-source implementations, as the required simulation times grow exponentially with the considered reflection order. This paper proposes a method that allows for a fast synthesis of room impulse responses according to the image-source technique. This is achieved by modeling the diffuse reverberation tail as decaying random noise, where the decay envelope for the considered acoustic environment is determined according to a recently proposed method for the prediction of energy decay curves in image-source simulations. The diffuse reverberation model presented in this paper thus produces impulse responses that are representative of the specific virtual environment under consideration (within the general assumptions of geometrical room acoustics), in contrast to other artificial reverberation techniques developed on the basis of perceptual measures or assuming a purely exponential energy decay. Furthermore, since image-source simulations are only used for the computation of the early reflections, the proposed approach achieves a reduction of the computational requirements by up to two orders of magnitude for the simulation of full-length room impulse responses, compared to a standard image-source implementation.

Journal ArticleDOI
TL;DR: Two extensions of the binaural SDW-MWF are proposed to improve the binural cue preservation and are able to preserve bINAural cues for the speech and noise sources, while still achieving significant noise reduction performance.
Abstract: Binaural hearing aids use microphone signals from both left and right hearing aid to generate an output signal for each ear. The microphone signals can be processed by a procedure based on speech distortion weighted multichannel Wiener filtering (SDW-MWF) to achieve significant noise reduction in a speech + noise scenario. In binaural procedures, it is also desirable to preserve binaural cues, in particular the interaural time difference (ITD) and interaural level difference (ILD), which are used to localize sounds. It has been shown in previous work that the binaural SDW-MWF procedure only preserves these binaural cues for the desired speech source, but distorts the noise binaural cues. Two extensions of the binaural SDW-MWF have therefore been proposed to improve the binaural cue preservation, namely the MWF with partial noise estimation (MWF-eta) and MWF with interaural transfer function extension (MWF-ITF). In this paper, the binaural cue preservation of these extensions is analyzed theoretically and tested based on objective performance measures. Both extensions are able to preserve binaural cues for the speech and noise sources, while still achieving significant noise reduction performance.

Journal ArticleDOI
Jingen Ni1, Feng Li1
TL;DR: This paper proposes a variable step-size matrix NSAF (VSSM-NSAF) from another point of view, i.e., recovering the powers of theSubband system noises from those of the subband error signals of the adaptive filter, to further improve the performance of the NSAF.
Abstract: The normalized subband adaptive filter (NSAF) presented by Lee and Gan can obtain faster convergence rate than the normalized least-mean-square (NLMS) algorithm with colored input signals. However, similar to other fixed step-size adaptive filtering algorithms, the NSAF requires a tradeoff between fast convergence rate and low misadjustment. Recently, a set-membership NSAF (SM-NSAF) has been developed to address this problem. Nevertheless, in order to determine the error bound of the SM-NSAF, the power of the system noise should be known. In this paper, we propose a variable step-size matrix NSAF (VSSM-NSAF) from another point of view, i.e., recovering the powers of the subband system noises from those of the subband error signals of the adaptive filter, to further improve the performance of the NSAF. The VSSM-NSAF uses an effective system noise power estimate method, which can also be applied to the under-modeling scenario, and therefore need not know the powers of the subband system noises in advance. Besides, the steady-state mean-square behavior of the proposed algorithm is analyzed, which theoretically proves that the VSSM-NSAF can obtain a low misadjustment. Simulation results show good performance of the new algorithm as compared to other members of the NSAF family.

Journal ArticleDOI
TL;DR: Three different sets of experiments conducted on the GTZAN and the ISMIR2004 Genre datasets demonstrate the superiority of NMPCA against the aforementioned subspace analysis techniques in extracting more discriminating features, especially when the training set has small cardinality.
Abstract: Motivated by psychophysiological investigations on the human auditory system, a bio-inspired two-dimensional auditory representation of music signals is exploited, that captures the slow temporal modulations. Although each recording is represented by a second-order tensor (i.e., a matrix), a third-order tensor is needed to represent a music corpus. Non-negative multilinear principal component analysis (NMPCA) is proposed for the unsupervised dimensionality reduction of the third-order tensors. The NMPCA maximizes the total tensor scatter while preserving the non-negativity of auditory representations. An algorithm for NMPCA is derived by exploiting the structure of the Grassmann manifold. The NMPCA is compared against three multilinear subspace analysis techniques, namely the non-negative tensor factorization, the high-order singular value decomposition, and the multilinear principal component analysis as well as their linear counterparts, i.e., the non-negative matrix factorization, the singular value decomposition, and the principal components analysis in extracting features that are subsequently classified by either support vector machine or nearest neighbor classifiers. Three different sets of experiments conducted on the GTZAN and the ISMIR2004 Genre datasets demonstrate the superiority of NMPCA against the aforementioned subspace analysis techniques in extracting more discriminating features, especially when the training set has small cardinality. The best classification accuracies reported in the paper exceed those obtained by the state-of-the-art music genre classification algorithms applied to both datasets.

Journal ArticleDOI
TL;DR: The modeling of social tagging data with three-order tensors, which capture cubic (three-way) correlations between users-tags-music items, indicates the superiority of the proposed approach compared to existing methods that suppress the cubic relationships that are inherent in social tags data.
Abstract: Social tagging is becoming increasingly popular in music information retrieval (MIR). It allows users to tag music items like songs, albums, or artists. Social tags are valuable to MIR, because they comprise a multifaced source of information about genre, style, mood, users' opinion, or instrumentation. In this paper, we examine the problem of personalized music recommendation based on social tags. We propose the modeling of social tagging data with three-order tensors, which capture cubic (three-way) correlations between users-tags-music items. The discovery of latent structure in this model is performed with the Higher Order Singular Value Decomposition (HOSVD), which helps to provide accurate and personalized recommendations, i.e., adapted to the particular users' preferences. To address the sparsity that incurs in social tagging data and further improve the quality of recommendation, we propose to enhance the model with a tag-propagation scheme that uses similarity values computed between the music items based on audio features. As a result, the proposed model effectively combines both information about social tags and audio features. The performance of the proposed method is examined experimentally with real data from Last.fm. Our results indicate the superiority of the proposed approach compared to existing methods that suppress the cubic relationships that are inherent in social tagging data. Additionally, our results suggest that the combination of social tagging data with audio features is preferable than the sole use of the former.

Journal ArticleDOI
TL;DR: This correspondence establishes a new expression for speech presence probability when an array of microphones with an arbitrary geometry is used and proposes a new proposed multichannel approach that can significantly increase the detection accuracy.
Abstract: The knowledge of the target speech presence probability in a mixture of signals captured by a speech communication system is of paramount importance in several applications including reliable noise reduction algorithms. In this correspondence, we establish a new expression for speech presence probability when an array of microphones with an arbitrary geometry is used. Our study is based on the assumption of the Gaussian statistical model for all signals and involves the noise and noisy data statistics only. In comparison with the single-channel case, the new proposed multichannel approach can significantly increase the detection accuracy. In particular, when the additive noise is spatially coherent, perfect speech presence detection is theoretically possible, while when the noise is spatially white, a coherent summation of speech components is performed to allow for enhanced speech presence probability estimation.

Journal ArticleDOI
TL;DR: This paper focuses on the investigation of the spatial truncation and discretization of the secondary source distribution occurring in real-world implementations and presents a rigorous analysis of evanescent and propagating components in the reproduced sound field.
Abstract: In this paper, we consider physical reproduction of sound fields via planar and linear distributions of secondary sources (i.e., loudspeakers). The presented approach employs a formulation of the reproduction equation in spatial frequency domain which is explicitly solved for the secondary source driving signals. Wave field synthesis (WFS), the alternative formulation, can be shown to be equivalent under equal assumptions. Unlike the WFS formulation, the presented approach does not employ a far-field approximation when linear secondary source distributions are considered but provides exact results. We focus on the investigation of the spatial truncation and discretization of the secondary source distribution occurring in real-world implementations and present a rigorous analysis of evanescent and propagating components in the reproduced sound field.

Journal ArticleDOI
TL;DR: Experimental results of active learning for word sense disambiguation and text classification tasks using six real-world evaluation data sets demonstrate the effectiveness of the proposed methods, sampling by uncertainty and density (SUD) and density-based re-ranking.
Abstract: To solve the knowledge bottleneck problem, active learning has been widely used for its ability to automatically select the most informative unlabeled examples for human annotation. One of the key enabling techniques of active learning is uncertainty sampling, which uses one classifier to identify unlabeled examples with the least confidence. Uncertainty sampling often presents problems when outliers are selected. To solve the outlier problem, this paper presents two techniques, sampling by uncertainty and density (SUD) and density-based re-ranking. Both techniques prefer not only the most informative example in terms of uncertainty criterion, but also the most representative example in terms of density criterion. Experimental results of active learning for word sense disambiguation and text classification tasks using six real-world evaluation data sets demonstrate the effectiveness of the proposed methods.

Journal ArticleDOI
TL;DR: A frequency-domain technique based on PARAllel FACtor (PARAFAC) analysis that performs multichannel blind source separation (BSS) of convolutive speech mixtures and a low-complexity adaptive version of the BSS algorithm is proposed that can track changes in the mixing environment.
Abstract: We present a frequency-domain technique based on PARAllel FACtor (PARAFAC) analysis that performs multichannel blind source separation (BSS) of convolutive speech mixtures. PARAFAC algorithms are combined with a dimensionality reduction step to significantly reduce computational complexity. The identifiability potential of PARAFAC is exploited to derive a BSS algorithm for the under-determined case (more speakers than microphones), combining PARAFAC analysis with time-varying Capon beamforming. Finally, a low-complexity adaptive version of the BSS algorithm is proposed that can track changes in the mixing environment. Extensive experiments with realistic and measured data corroborate our claims, including the under-determined case. Signal-to-interference ratio improvements of up to 6 dB are shown compared to state-of-the-art BSS algorithms, at an order of magnitude lower computational complexity.