Showing papers in "IEEE Transactions on Audio, Speech, and Language Processing in 2008"
TL;DR: The evaluation of correlations of several objective measures with these three subjective rating scales is reported on and several new composite objective measures are also proposed by combining the individual objective measures using nonparametric and parametric regression analysis techniques.
Abstract: In this paper, we evaluate the performance of several objective measures in terms of predicting the quality of noisy speech enhanced by noise suppression algorithms. The objective measures considered a wide range of distortions introduced by four types of real-world noise at two signal-to-noise ratio levels by four classes of speech enhancement algorithms: spectral subtractive, subspace, statistical-model based, and Wiener algorithms. The subjective quality ratings were obtained using the ITU-T P.835 methodology designed to evaluate the quality of enhanced speech along three dimensions: signal distortion, noise distortion, and overall quality. This paper reports on the evaluation of correlations of several objective measures with these three subjective rating scales. Several new composite objective measures are also proposed by combining the individual objective measures using nonparametric and parametric regression analysis techniques.
TL;DR: It is shown that when a large joint factor analysis model is trained in this way and tested on the core condition, the extended data condition and the cross-channel condition, it is capable of performing at least as well as fusions of multiple systems of other types.
Abstract: We propose a new approach to the problem of estimating the hyperparameters which define the interspeaker variability model in joint factor analysis. We tested the proposed estimation technique on the NIST 2006 speaker recognition evaluation data and obtained 10%-15% reductions in error rates on the core condition and the extended data condition (as measured both by equal error rates and the NIST detection cost function). We show that when a large joint factor analysis model is trained in this way and tested on the core condition, the extended data condition and the cross-channel condition, it is capable of performing at least as well as fusions of multiple systems of other types. (The comparisons are based on the best results on these tasks that have been reported in the literature.) In the case of the cross-channel condition, a factor analysis model with 300 speaker factors and 200 channel factors can achieve equal error rates of less than 3.0%. This is a substantial improvement over the best results that have previously been reported on this task.
TL;DR: The interesting part of the results is that the epoch extraction by the proposed method seems to be robust against degradations like white noise, babble, high-frequency channel, and vehicle noise.
Abstract: Epoch is the instant of significant excitation of the vocal-tract system during production of speech. For most voiced speech, the most significant excitation takes place around the instant of glottal closure. Extraction of epochs from speech is a challenging task due to time-varying characteristics of the source and the system. Most epoch extraction methods attempt to remove the characteristics of the vocal-tract system, in order to emphasize the excitation characteristics in the residual. The performance of such methods depends critically on our ability to model the system. In this paper, we propose a method for epoch extraction which does not depend critically on characteristics of the time-varying vocal-tract system. The method exploits the nature of impulse-like excitation. The proposed zero resonance frequency filter output brings out the epoch locations with high accuracy and reliability. The performance of the method is demonstrated using CMU-Arctic database using the epoch information from the electroglottograph as reference. The proposed method performs significantly better than the other methods currently available for epoch extraction. The interesting part of the results is that the epoch extraction by the proposed method seems to be robust against degradations like white noise, babble, high-frequency channel, and vehicle noise.
TL;DR: A computer audition system that can both annotate novel audio tracks with semantically meaningful words and retrieve relevant tracks from a database of unlabeled audio content given a text-based query is presented.
Abstract: We present a computer audition system that can both annotate novel audio tracks with semantically meaningful words and retrieve relevant tracks from a database of unlabeled audio content given a text-based query. We consider the related tasks of content-based audio annotation and retrieval as one supervised multiclass, multilabel problem in which we model the joint probability of acoustic features and words. We collect a data set of 1700 human-generated annotations that describe 500 Western popular music tracks. For each word in a vocabulary, we use this data to train a Gaussian mixture model (GMM) over an audio feature space. We estimate the parameters of the model using the weighted mixture hierarchies expectation maximization algorithm. This algorithm is more scalable to large data sets and produces better density estimates than standard parameter estimation techniques. The quality of the music annotations produced by our system is comparable with the performance of humans on the same task. Our ldquoquery-by-textrdquo system can retrieve appropriate songs for a large number of musically relevant words. We also show that our audition system is general by learning a model that can annotate and retrieve sound effects.
TL;DR: This paper forms MER as a regression problem to predict the arousal and valence values (AV values) of each music sample directly and applies the regression approach to detect the emotion variation within a music selection and find the prediction accuracy superior to existing works.
Abstract: Content-based retrieval has emerged in the face of content explosion as a promising approach to information access. In this paper, we focus on the challenging issue of recognizing the emotion content of music signals, or music emotion recognition (MER). Specifically, we formulate MER as a regression problem to predict the arousal and valence values (AV values) of each music sample directly. Associated with the AV values, each music sample becomes a point in the arousal-valence plane, so the users can efficiently retrieve the music sample by specifying a desired point in the emotion plane. Because no categorical taxonomy is used, the regression approach is free of the ambiguity inherent to conventional categorical approaches. To improve the performance, we apply principal component analysis to reduce the correlation between arousal and valence, and RReliefF to select important features. An extensive performance study is conducted to evaluate the accuracy of the regression approach for predicting AV values. The best performance evaluated in terms of the R 2 statistics reaches 58.3% for arousal and 28.1% for valence by employing support vector machine as the regressor. We also apply the regression approach to detect the emotion variation within a music selection and find the prediction accuracy superior to existing works. A group-wise MER scheme is also developed to address the subjectivity issue of emotion perception.
TL;DR: It is shown how pattern discovery can be used to automatically acquire lexical entities directly from an untranscribed audio stream by exploiting the structure of repeating patterns within the speech signal.
Abstract: We present a novel approach to speech processing based on the principle of pattern discovery. Our work represents a departure from traditional models of speech recognition, where the end goal is to classify speech into categories defined by a prespecified inventory of lexical units (i.e., phones or words). Instead, we attempt to discover such an inventory in an unsupervised manner by exploiting the structure of repeating patterns within the speech signal. We show how pattern discovery can be used to automatically acquire lexical entities directly from an untranscribed audio stream. Our approach to unsupervised word acquisition utilizes a segmental variant of a widely used dynamic programming technique, which allows us to find matching acoustic patterns between spoken utterances. By aggregating information about these matching patterns across audio streams, we demonstrate how to group similar acoustic sequences together to form clusters corresponding to lexical entities such as words and short multiword phrases. On a corpus of academic lecture material, we demonstrate that clusters found using this technique exhibit high purity and that many of the corresponding lexical identities are relevant to the underlying audio stream.
TL;DR: A new technique for audio signal comparison based on tonal subsequence alignment and its application to detect cover versions (i.e., different performances of the same underlying musical piece) is presented.
Abstract: We present a new technique for audio signal comparison based on tonal subsequence alignment and its application to detect cover versions (i.e., different performances of the same underlying musical piece). Cover song identification is a task whose popularity has increased in the music information retrieval (MIR) community along in the past, as it provides a direct and objective way to evaluate music similarity algorithms. This paper first presents a series of experiments carried out with two state-of-the-art methods for cover song identification. We have studied several components of these (such as chroma resolution and similarity, transposition, beat tracking or dynamic time warping constraints), in order to discover which characteristics would be desirable for a competitive cover song identifier. After analyzing many cross-validated results, the importance of these characteristics is discussed, and the best performing ones are finally applied to the newly proposed method. Multiple evaluations of this one confirm a large increase in identification accuracy when comparing it with alternative state-of-the-art approaches.
TL;DR: It is shown that in the context of noise reduction the squared PCC has many appealing properties and can be used as an optimization cost function to derive many optimal and suboptimal noise-reduction filters.
Abstract: Noise reduction, which aims at estimating a clean speech from noisy observations, has attracted a considerable amount of research and engineering attention over the past few decades. In the single-channel scenario, an estimate of the clean speech can be obtained by passing the noisy signal picked up by the microphone through a linear filter/transformation. The core issue, then, is how to find an optimal filter/transformation such that, after the filtering process, the signal-to-noise ratio (SNR) is improved but the desired speech signal is not noticeably distorted. Most of the existing optimal filters (such as the Wiener filter and subspace transformation) are formulated from the mean-square error (MSE) criterion. However, with the MSE formulation, many desired properties of the optimal noise-reduction filters such as the SNR behavior cannot be seen. In this paper, we present a new criterion based on the Pearson correlation coefficient (PCC). We show that in the context of noise reduction the squared PCC (SPCC) has many appealing properties and can be used as an optimization cost function to derive many optimal and suboptimal noise-reduction filters. The clear advantage of using the SPCC over the MSE is that the noise-reduction performance (in terms of the SNR improvement and speech distortion) of the resulting optimal filters can be easily analyzed. This shows that, as far as noise reduction is concerned, the SPCC-based cost function serves as a more natural criterion to optimize as compared to the MSE.
TL;DR: The proposed method outperformed two reference methods in the evaluations and showed a high level of robustness in processing signals where important parts of the audible spectrum were deleted to simulate bandlimited interference.
Abstract: A method is described for estimating the fundamental frequencies of several concurrent sounds in polyphonic music and multiple-speaker speech signals. The method consists of a computational model of the human auditory periphery, followed by a periodicity analysis mechanism where fundamental frequencies are iteratively detected and canceled from the mixture signal. The auditory model needs to be computed only once, and a computationally efficient strategy is proposed for implementing it. Simulation experiments were made using mixtures of musical sounds and mixed speech utterances. The proposed method outperformed two reference methods in the evaluations and showed a high level of robustness in processing signals where important parts of the audible spectrum were deleted to simulate bandlimited interference. Different system configurations were studied to identify the conditions where pitch analysis using an auditory model is advantageous over conventional time or frequency domain approaches.
TL;DR: Experimental results show that in many cases the resulting segmentations correspond well to conventional notions of musical form, and how the constrained clustering approach can be extended to include prior musical knowledge, input from other machine approaches, or semi-supervision.
Abstract: We describe a method of segmenting musical audio into structural sections based on a hierarchical labeling of spectral features. Frames of audio are first labeled as belonging to one of a number of discrete states using a hidden Markov model trained on the features. Histograms of neighboring frames are then clustered into segment-types representing distinct distributions of states, using a clustering algorithm in which temporal continuity is expressed as a set of constraints modeled by a hidden Markov random field. We give experimental results which show that in many cases the resulting segmentations correspond well to conventional notions of musical form. We show further how the constrained clustering approach can easily be extended to include prior musical knowledge, input from other machine approaches, or semi-supervision.
TL;DR: This paper builds an automatic detector and classifier for prosodic events in American English, based on their acoustic, lexical, and syntactic correlates, and focuses on accent (prominence, or ldquostressrdquo) and prosodic phrase boundary detection at the syllable level.
Abstract: With the advent of prosody annotation standards such as tones and break indices (ToBI), speech technologists and linguists alike have been interested in automatically detecting prosodic events in speech. This is because the prosodic tier provides an additional layer of information over the short-term segment-level features and lexical representation of an utterance. As the prosody of an utterance is closely tied to its syntactic and semantic content in addition to its lexical content, knowledge of the prosodic events within and across utterances can assist spoken language applications such as automatic speech recognition and translation. On the other hand, corpora annotated with prosodic events are useful for building natural-sounding speech synthesizers. In this paper, we build an automatic detector and classifier for prosodic events in American English, based on their acoustic, lexical, and syntactic correlates. Following previous work in this area, we focus on accent (prominence, or ldquostressrdquo) and prosodic phrase boundary detection at the syllable level. Our experiments achieved a performance rate of 86.75% agreement on the accent detection task, and 91.61% agreement on the phrase boundary detection task on the Boston University Radio News Corpus.
TL;DR: A probabilistic generative model is used that unifies the collaborative and content-based data in a principled way that accurately recommended pieces including nonrated ones from a wide variety of artists and maintained a high degree of accuracy even when new users and rating scores were added.
Abstract: This paper presents a hybrid music recommender system that ranks musical pieces while efficiently maintaining collaborative and content-based data, i.e., rating scores given by users and acoustic features of audio signals. This hybrid approach overcomes the conventional tradeoff between recommendation accuracy and variety of recommended artists. Collaborative filtering, which is used on e-commerce sites, cannot recommend nonbrated pieces and provides a narrow variety of artists. Content-based filtering does not have satisfactory accuracy because it is based on the heuristics that the user's favorite pieces will have similar musical content despite there being exceptions. To attain a higher recommendation accuracy along with a wider variety of artists, we use a probabilistic generative model that unifies the collaborative and content-based data in a principled way. This model can explain the generative mechanism of the observed data in the probability theory. The probability distribution over users, pieces, and features is decomposed into three conditionally independent ones by introducing latent variables. This decomposition enables us to efficiently and incrementally adapt the model for increasing numbers of users and rating scores. We evaluated our system by using audio signals of commercial CDs and their corresponding rating scores obtained from an e-commerce site. The results revealed that our system accurately recommended pieces including nonrated ones from a wide variety of artists and maintained a high degree of accuracy even when new users and rating scores were added.
TL;DR: This paper proposes a VSS-APA derived in the context of AEC that aims to recover the near-end signal within the error signal of the adaptive filter and is robust against near- end signal variations (including double-talk).
Abstract: The adaptive algorithms used for acoustic echo cancellation (AEC) have to provide (1) high convergence rates and good tracking capabilities, since the acoustic environments imply very long and time-variant echo paths, and (2) low misadjustment and robustness against background noise variations and double-talk. In this context, the affine projection algorithm (APA) and different versions of it are very attractive choices for AEC. However, an APA with a constant step-size parameter has to compromise between the performance criteria (1) and (2). Therefore, a variable step-size APA (VSS-APA) represents a more reliable solution. In this paper, we propose a VSS-APA derived in the context of AEC. Most of the APAs aim to cancel p (i.e., projection order) previous a posteriori errors at every step of the algorithm. The proposed VSS-APA aims to recover the near-end signal within the error signal of the adaptive filter. Consequently, it is robust against near-end signal variations (including double-talk). This algorithm does not require any a priori information about the acoustic environment, so that it is easy to control in practice. The simulation results indicate the good performance of the proposed algorithm as compared to other members of the APA family.
TL;DR: A novel parametrization of speech that is based on the AM-FM representation of the speech signal and to assess the utility of these features in the context of speaker identification is presented.
Abstract: This paper presents an experimental evaluation of different features for use in speaker identification. The features are tested using speech data provided by the chains corpus, in a closed-set speaker identification task. The main objective of the paper is to present a novel parametrization of speech that is based on the AM-FM representation of the speech signal and to assess the utility of these features in the context of speaker identification. In order to explore the extent to which different instantaneous frequencies due to the presence of formants and harmonics in the speech signal may predict a speaker's identity, this work evaluates three different decompositions of the speech signal within the same AM-FM framework: a first setup has been used previously for formant tracking, a second setup is designed to enhance familiar resonances below 4000 Hz, and a third setup is designed to approximate the bandwidth scaling of the filters conventionally used in the extraction of Mel-fequency cepstral coefficients (MFCCs). From each of the proposed setups, parameters are extracted and used in a closed-set text-independent speaker identification task. The performance of the new featural representation is compared with results obtained adopting MFCC and RASTA-PLP features in the context of a generic Gaussian mixture model (GMM) classification system. In evaluating the novel features, we look selectively at information for speaker identification contained in the frequency range 0-4000 Hz and 4000-8000 Hz, as the instantaneous frequencies revealed by the AM-FM approach suggest the presence of structures not well known from conventional spectrographic analyses. Accuracy results obtained using the new parametrization perform as well as conventional MFCC parameters within the same reference system, when tested and trained on modally voiced speech which is mismatched in both channel and style. When the testing material is whispered speech, the new parameters provide better results than any of the other features tested, although they remain far from ideal in this limiting case.
TL;DR: An acoustic chord transcription system that uses symbolic data to train hidden Markov models and gives best-of-class frame-level recognition results and the robustness of the tonal centroid feature, which outperforms the conventional chroma feature.
Abstract: We describe an acoustic chord transcription system that uses symbolic data to train hidden Markov models and gives best-of-class frame-level recognition results. We avoid the extremely laborious task of human annotation of chord names and boundaries-which must be done to provide machine learning models with ground truth-by performing automatic harmony analysis on symbolic music files. In parallel, we synthesize audio from the same symbolic files and extract acoustic feature vectors which are in perfect alignment with the labels. We, therefore, generate a large set of labeled training data with a minimal amount of human labor. This allows for richer models. Thus, we build 24 key-dependent HMMs, one for each key, using the key information derived from symbolic data. Each key model defines a unique state-transition characteristic and helps avoid confusions seen in the observation vector. Given acoustic input, we identify a musical key by choosing a key model with the maximum likelihood, and we obtain the chord sequence from the optimal state path of the corresponding key model, both of which are returned by a Viterbi decoder. This not only increases the chord recognition accuracy, but also gives key information. Experimental results show the models trained on synthesized data perform very well on real recordings, even though the labels automatically generated from symbolic data are not 100% accurate. We also demonstrate the robustness of the tonal centroid feature, which outperforms the conventional chroma feature.
TL;DR: The proposed noise tracking method can accurately track fast changes in noise power level and improvements in segmental signal-to-noise ratio of more than 1 dB can be obtained for the most nonstationary noise sources at high noise levels.
Abstract: This paper considers estimation of the noise spectral variance from speech signals contaminated by highly nonstationary noise sources. The method can accurately track fast changes in noise power level (up to about 10 dB/s). In each time frame, for each frequency bin, the noise variance estimate is updated recursively with the minimum mean-square error (mmse) estimate of the current noise power. A time- and frequency-dependent smoothing parameter is used, which is varied according to an estimate of speech presence probability. In this way, the amount of speech power leaking into the noise estimates is kept low. For the estimation of the noise power, a spectral gain function is used, which is found by an iterative data-driven training method. The proposed noise tracking method is tested on various stationary and nonstationary noise sources, for a wide range of signal-to-noise ratios, and compared with two state-of-the-art methods. When used in a speech enhancement system, improvements in segmental signal-to-noise ratio of more than 1 dB can be obtained for the most nonstationary noise sources at high noise levels.
TL;DR: The combination of various deformation- and fault-tolerance mechanisms allows us to employ standard indexing techniques to obtain an efficient, index-based matching procedure, thus providing an important step towards semantically searching large-scale real-world music collections.
Abstract: Given a large audio database of music recordings, the goal of classical audio identification is to identify a particular audio recording by means of a short audio fragment. Even though recent identification algorithms show a significant degree of robustness towards noise, MP3 compression artifacts, and uniform temporal distortions, the notion of similarity is rather close to the identity. In this paper, we address a higher level retrieval problem, which we refer to as audio matching: given a short query audio clip, the goal is to automatically retrieve all excerpts from all recordings within the database that musically correspond to the query. In our matching scenario, opposed to classical audio identification, we allow semantically motivated variations as they typically occur in different interpretations of a piece of music. To this end, this paper presents an efficient and robust audio matching procedure that works even in the presence of significant variations, such as nonlinear temporal, dynamical, and spectral deviations, where existing algorithms for audio identification would fail. Furthermore, the combination of various deformation- and fault-tolerance mechanisms allows us to employ standard indexing techniques to obtain an efficient, index-based matching procedure, thus providing an important step towards semantically searching large-scale real-world music collections.
TL;DR: An automatic method for measuring content-based music similarity, enhancing the current generation of music search engines and recommended systems and compatible with locality-sensitive hashing-allowing implementation with retrieval times several orders of magnitude faster than those using exhaustive distance computations.
Abstract: We propose an automatic method for measuring content-based music similarity, enhancing the current generation of music search engines and recommended systems. Many previous approaches to track similarity require brute-force, pair-wise processing between all audio features in a database and therefore are not practical for large collections. However, in an Internet-connected world, where users have access to millions of musical tracks, efficiency is crucial. Our approach uses features extracted from unlabeled audio data and near-neigbor retrieval using a distance threshold, determined by analysis, to solve a range of retrieval tasks. The tasks require temporal features-analogous to the technique of shingling used for text retrieval. To measure similarity, we count pairs of audio shingles, between a query and target track, that are below a distance threshold. The distribution of between-shingle distances is different for each database; therefore, we present an analysis of the distribution of minimum distances between shingles and a method for estimating a distance threshold for optimal retrieval performance. The method is compatible with locality-sensitive hashing (LSH)-allowing implementation with retrieval times several orders of magnitude faster than those using exhaustive distance computations. We evaluate the performance of our proposed method on three contrasting music similarity tasks: retrieval of mis-attributed recordings (fingerprint), retrieval of the same work performed by different artists (cover songs), and retrieval of edited and sampled versions of a query track by remix artists (remixes). Our method achieves near-perfect performance in the first two tasks and 75% precision at 70% recall in the third task. Each task was performed on a test database comprising 4.5 million audio shingles.
TL;DR: An improved estimator for the speech presence probability at each time-frequency point in the short-time Fourier transform domain that yields a better tradeoff between speech distortion and noise leakage than state-of-the-art estimators.
Abstract: In this paper, we present an improved estimator for the speech presence probability at each time-frequency point in the short-time Fourier transform domain. In contrast to existing approaches, this estimator does not rely on an adaptively estimated and thus signal-dependent a priori signal-to-noise ratio estimate. It therefore decouples the estimation of the speech presence probability from the estimation of the clean speech spectral coefficients in a speech enhancement task. Using both a fixed a priori signal-to-noise ratio and a fixed prior probability of speech presence, the proposed a posteriori speech presence probability estimator achieves probabilities close to zero for speech absence and probabilities close to one for speech presence. While state-of-the-art speech presence probability estimators use adaptive prior probabilities and signal-to-noise ratio estimates, we argue that these quantities should reflect true a priori information that shall not depend on the observed signal. We present a detection theoretic framework for determining the fixed a priori signal-to-noise ratio. The proposed estimator is conceptually simple and yields a better tradeoff between speech distortion and noise leakage than state-of-the-art estimators.
TL;DR: Using nonnegative matrix factorization to derive a novel description for the timbre of musical sounds, a spectrogram is factorized providing a characteristic spectral basis and compression is shown to reduce the noise present in the data set resulting in more stable classification models.
Abstract: Nonnegative matrix factorization (NMF) is used to derive a novel description for the timbre of musical sounds. Using NMF, a spectrogram is factorized providing a characteristic spectral basis. Assuming a set of spectrograms given a musical genre, the space spanned by the vectors of the obtained spectral bases is modeled statistically using mixtures of Gaussians, resulting in a description of the spectral base for this musical genre. This description is shown to improve classification results by up to 23.3% compared to MFCC-based models, while the compression performed by the factorization decreases training time significantly. Using a distance-based stability measure this compression is shown to reduce the noise present in the data set resulting in more stable classification models. In addition, we compare the mean squared errors of the approximation to a spectrogram using independent component analysis and nonnegative matrix factorization, showing the superiority of the latter approach.
TL;DR: A model selection method is developed to determine the optimal mode between Gaussian mixture models (GMM) and vector quantization (VQ) when the amount of training data is different for each species.
Abstract: This paper presents a method for automatic classification of birds into different species based on the audio recordings of their sounds. Each individual syllable segmented from continuous recordings is regarded as the basic recognition unit. To represent the temporal variations as well as sharp transitions within a syllable, a feature set derived from static and dynamic two-dimensional Mel-frequency cepstral coefficients are calculated for the classification of each syllable. Since a bird might generate several types of sounds with variant characteristics, a number of representative prototype vectors are used to model different syllables of identical bird species. For each bird species, a model selection method is developed to determine the optimal mode between Gaussian mixture models (GMM) and vector quantization (VQ) when the amount of training data is different for each species. In addition, a component number selection algorithm is employed to find the most appropriate number of components of GMM or the cluster number of VQ for each species. The mean vectors of GMM or the cluster centroids of VQ will form the prototype vectors of a certain bird species. In the experiments, the best classification accuracy is 84.06% for the classification of 28 bird species.
TL;DR: A complete drum transcription system is described, which combines information from the original music signal and a drum track enhanced version obtained by source separation, and which integrates a large set of features, optimally selected by feature selection.
Abstract: The purpose of this article is to present new advances in music transcription and source separation with a focus on drum signals. A complete drum transcription system is described, which combines information from the original music signal and a drum track enhanced version obtained by source separation. In addition to efficient fusion strategies to take into account these two complementary sources of information, the transcription system integrates a large set of features, optimally selected by feature selection. Concurrently, the problem of drum track extraction from polyphonic music is tackled both by proposing a novel approach based on harmonic/noise decomposition and time/frequency masking and by improving an existing Wiener filtering-based separation method. The separation and transcription techniques presented are thoroughly evaluated on a large public database of music signals. A transcription accuracy between 64.5% and 80.3% is obtained, depending on the drum instrument, for well-balanced mixes, and the efficiency of our drum separation algorithms is illustrated in a comprehensive benchmark.
TL;DR: In analyzing alignment performance, Chinese-English word alignments are shown to be comparable to those of IBM Model 4 even when models are trained over large parallel texts.
Abstract: Estimation and alignment procedures for word and phrase alignment hidden Markov models (HMMs) are developed for the alignment of parallel text. The development of these models is motivated by an analysis of the desirable features of IBM Model 4, one of the original and most effective models for word alignment. These models are formulated to capture the desirable aspects of Model 4 in an HMM alignment formalism. Alignment behavior is analyzed and compared to human-generated reference alignments, and the ability of these models to capture different types of alignment phenomena is evaluated. In analyzing alignment performance, Chinese-English word alignments are shown to be comparable to those of IBM Model 4 even when models are trained over large parallel texts. In translation performance, phrase-based statistical machine translation systems based on these HMM alignments can equal and exceed systems based on Model 4 alignments, and this is shown in Arabic-English and Chinese-English translation. These alignment models can also be used to generate posterior statistics over collections of parallel text, and this is used to refine and extend phrase translation tables with a resulting improvement in translation quality.
TL;DR: A new mid-level representation based on the decomposition of a signal into a small number of sound atoms or molecules bearing explicit musical instrument labels is proposed and investigated, including polyphonic instrument recognition and music visualization.
Abstract: Several studies have pointed out the need for accurate mid-level representations of music signals for information retrieval and signal processing purposes. In this paper, we propose a new mid-level representation based on the decomposition of a signal into a small number of sound atoms or molecules bearing explicit musical instrument labels. Each atom is a sum of windowed harmonic sinusoidal partials whose relative amplitudes are specific to one instrument, and each molecule consists of several atoms from the same instrument spanning successive time windows. We design efficient algorithms to extract the most prominent atoms or molecules and investigate several applications of this representation, including polyphonic instrument recognition and music visualization.
TL;DR: This approach revolves around the effective use of the N-best lists generated by the ASR module to reduce semantic classification errors, and thereby, reduce the semantic classification error rate (CER).
Abstract: Traditional methods of spoken utterance classification (SUC) adopt two independently trained phases. In the first phase, an automatic speech recognition (ASR) module returns the most likely sentence for the observed acoustic signal. In the second phase, a semantic classifier transforms the resulting sentence into the most likely semantic class. Since the two phases are isolated from each other, such traditional SUC systems are suboptimal. In this paper, we present a novel integrative and discriminative learning technique for SUC to alleviate this problem, and thereby, reduce the semantic classification error rate (CER). Our approach revolves around the effective use of the N-best lists generated by the ASR module to reduce semantic classification errors. The N-best list sentences are first rescored using all the available knowledge sources. Then, the sentence that is most likely to helps reduce the CER are extracted from the N-best lists as well as those sentences that are most likely to increase the CER. These sentences are used to discriminatively train the language and semantic-classifier models to minimize the overall semantic CER. Our experiments resulted in a reduction of CER from its initial value of 4.92% to 4.04% in the standard ATIS task.
TL;DR: A novel Bayesian PLSA framework is presented and an incremental PLSA algorithm is constructed to accomplish the parameter estimation as well as the hyperparameter updating, which is capable of performing dynamic document indexing and modeling.
Abstract: Due to the vast growth of data collections, the statistical document modeling has become increasingly important in language processing areas. Probabilistic latent semantic analysis (PLSA) is a popular approach whereby the semantics and statistics can be effectively captured for modeling. However, PLSA is highly sensitive to task domain, which is continuously changing in real-world documents. In this paper, a novel Bayesian PLSA framework is presented. We focus on exploiting the incremental learning algorithm for solving the updating problem of new domain articles. This algorithm is developed to improve document modeling by incrementally extracting up-to-date latent semantic information to match the changing domains at run time. By adequately representing the priors of PLSA parameters using Dirichlet densities, the posterior densities belong to the same distribution so that a reproducible prior/posterior mechanism is activated for incremental learning from constantly accumulated documents. An incremental PLSA algorithm is constructed to accomplish the parameter estimation as well as the hyperparameter updating. Compared to standard PLSA using maximum likelihood estimate, the proposed approach is capable of performing dynamic document indexing and modeling. We also present the maximum a posteriori PLSA for corrective training. Experiments on information retrieval and document categorization demonstrate the superiority of using Bayesian PLSA methods.
TL;DR: This paper presents a novel approach to estimate the time difference of arrival (TDOA) for multiple sources in reverberant environments that resolves ambiguities in TDOA estimation caused by multipath propagation and multiple sources.
Abstract: This paper presents a novel approach to estimate the time difference of arrival (TDOA) for multiple sources in reverberant environments. It resolves ambiguities in TDOA estimation caused by multipath propagation and multiple sources. By exploiting two TDOA constraints, the raster condition and the zero cyclic sum condition, we are able to identify and reject the echo path TDOAs and to assign the direct path TDOAs correctly to different sources. For the latter purpose, an efficient algorithm for the synthesis of approximately consistent TDOA graphs has been developed. A real experiment demonstrates the superior performance of our algorithms.
TL;DR: This correspondence presents a microphone array shape calibration procedure for diffuse noise environments by fitting the measured noise coherence with its theoretical model and then estimates the array geometry using classical multidimensional scaling.
Abstract: This correspondence presents a microphone array shape calibration procedure for diffuse noise environments. The procedure estimates intermicrophone distances by fitting the measured noise coherence with its theoretical model and then estimates the array geometry using classical multidimensional scaling. The technique is validated on noise recordings from two office environments.
TL;DR: The proposed maximum entropy acoustic-syntactic model achieves pitch accent and boundary tone detection accuracies and phrase structure detection through prosodic break index labeling provides accuracies of 84% and 87% on the two corpora, respectively.
Abstract: In this paper, we describe a maximum entropy-based automatic prosody labeling framework that exploits both language and speech information. We apply the proposed framework to both prominence and phrase structure detection within the Tones and Break Indices (ToBI) annotation scheme. Our framework utilizes novel syntactic features in the form of supertags and a quantized acoustic-prosodic feature representation that is similar to linear parameterizations of the prosodic contour. The proposed model is trained discriminatively and is robust in the selection of appropriate features for the task of prosody detection. The proposed maximum entropy acoustic-syntactic model achieves pitch accent and boundary tone detection accuracies of 86.0% and 93.1% on the Boston University Radio News corpus, and, 79.8% and 90.3% on the Boston Directions corpus. The phrase structure detection through prosodic break index labeling provides accuracies of 84% and 87% on the two corpora, respectively. The reported results are significantly better than previously reported results and demonstrate the strength of maximum entropy model in jointly modeling simple lexical, syntactic, and acoustic features for automatic prosody labeling.
TL;DR: An unsupervised single-channel music source separation algorithm based on average harmonic structure modeling is proposed, and experiments show that this algorithm outperforms the general nonnegative matrix factorization (NMF)-based source separation algorithms, and yields good subjective listening quality.
Abstract: Source separation of musical signals is an appealing but difficult problem, especially in the single-channel case. In this paper, an unsupervised single-channel music source separation algorithm based on average harmonic structure modeling is proposed. Under the assumption of playing in narrow pitch ranges, different harmonic instrumental sources in a piece of music often have different but stable harmonic structures; thus, sources can be characterized uniquely by harmonic structure models. Given the number of instrumental sources, the proposed algorithm learns these models directly from the mixed signal by clustering the harmonic structures extracted from different frames. The corresponding sources are then extracted from the mixed signal using the models. Experiments on several mixed signals, including synthesized instrumental sources, real instrumental sources, and singing voices, show that this algorithm outperforms the general nonnegative matrix factorization (NMF)-based source separation algorithm, and yields good subjective listening quality. As a side effect, this algorithm estimates the pitches of the harmonic instrumental sources. The number of concurrent sounds in each frame is also computed, which is a difficult task for general multipitch estimation (MPE) algorithms.