scispace - formally typeset
Search or ask a question

Showing papers on "Linear predictive coding published in 2017"


Journal ArticleDOI
TL;DR: Five predictive coding algorithms are covered: linear predictive coding which has a long and influential history in the signal processing literature; the first neuroscience-related application of predictive coding to explaining the function of the retina; and three versions of predictive codes that have been proposed to model cortical function.

298 citations


Journal ArticleDOI
TL;DR: An extensive set of acoustic–phonetic features extracted in adverse conditions is investigated, and feature combination sets constructed using a sequential floating forward selection algorithm outperform individual ones and are found that optimal feature sets in anechoic conditions are different from those in reverberant conditions.
Abstract: Monaural speech separation is a fundamental problem in speech and signal processing. This problem can be approached from a supervised learning perspective by predicting an ideal time–frequency mask from features of noisy speech. In reverberant conditions at low signal-to-noise ratios (SNRs), accurate mask prediction is challenging and can benefit from effective features. In this paper, we investigate an extensive set of acoustic–phonetic features extracted in adverse conditions. Deep neural networks are used as the learning machine, and separation performance is evaluated using standard objective speech intelligibility metrics. Separation performance is systematically evaluated in both nonspeech and speech interference, in a variety of SNRs, reverberation times, and direct-to-reverberant energy ratios. Considerable performance improvement is observed by using contextual information, likely due to temporal effects of room reverberation. In addition, we construct feature combination sets using a sequential floating forward selection algorithm, and combined features outperform individual ones. We also find that optimal feature sets in anechoic conditions are different from those in reverberant conditions.

65 citations


Journal ArticleDOI
TL;DR: The results indicate that the performance of the proposed SFF-based methods for emotional speech is comparable to the results for neutral speech, and is better than the results from many of the standard methods.

54 citations


Journal ArticleDOI
TL;DR: The proposed QCCN steganalysis method can effectively detect the QIM steganography in encoded speech stream when it is applied to low-bit-rate speech codec such as G.723.1 and G.729.1.
Abstract: Steganalysis of the quantization index modulation (QIM) steganography in a low-bit-rate encoded speech stream is conducted in this research. According to the speech generation theory and the phoneme distribution properties in language, we first point out that the correlation characteristics of split vector quantization (VQ) codewords of linear predictive coding filter coefficients are changed after the QIM steganography. Based on this observation, we construct a model called the Quantization codeword correlation network (QCCN) based on split VQ codeword from adjacent speech frames. Furthermore, the QCCN model is pruned to yield a stronger correlation network. After quantifying the correlation characteristics of vertices in the pruned correlation network, we obtain feature vectors that are sensitive to steganalysis. Finally, we build a high-performance detector using the support vector machine (SVM) classifier. It is shown by experimental results that the proposed QCCN steganalysis method can effectively detect the QIM steganography in encoded speech stream when it is applied to low-bit-rate speech codec such as G.723.1 and G.729.

50 citations


Proceedings ArticleDOI
05 Mar 2017
TL;DR: A two-stage algorithm to deal with the confounding effects of noise and reverberation separately, where denoising and dereverberation are conducted sequentially using deep neural networks is proposed, and it substantially outperforms one-stage enhancement baselines.
Abstract: In daily listening environments, speech is commonly corrupted by room reverberation and background noise. These distortions are detrimental to speech intelligibility and quality, and also severely degrade the performance of automatic speech and speaker recognition systems. In this paper, we propose a two-stage algorithm to deal with the confounding effects of noise and reverberation separately, where denoising and dereverberation are conducted sequentially using deep neural networks. In addition, we design a new objective function that incorporates clean phase information during training. As the objective function emphasizes more important time-frequency (T-F) units, better estimated magnitude is obtained during testing. By jointly training the two-stage model to optimize the proposed objective function, our algorithm improves objective metrics of speech intelligibility and quality significantly, and substantially outperforms one-stage enhancement baselines.

49 citations



Proceedings ArticleDOI
05 Mar 2017
TL;DR: A deep neural network is used to estimate the real and imaginary components of the complex ideal ratio mask (cIRM), which results in clean and anechoic speech when applied to a reverberant-noisy mixture and shows that phase is important for dereverberation, and that complex ratio masking outperforms related methods.
Abstract: Traditional speech separation systems enhance the magnitude response of noisy speech. Recent studies, however, have shown that perceptual speech quality is significantly improved when magnitude and phase are both enhanced. These studies, however, have not determined if phase enhancement is beneficial in environments that contain reverberation as well as noise. In this paper, we present an approach that jointly enhances the magnitude and phase of reverberant and noisy speech. We use a deep neural network to estimate the real and imaginary components of the complex ideal ratio mask (cIRM), which results in clean and anechoic speech when applied to a reverberant-noisy mixture. Our results show that phase is important for dereverberation, and that complex ratio masking outperforms related methods.

47 citations


Journal ArticleDOI
TL;DR: This paper is concerned with generating intelligible audio speech from a video of a person talking, and regression and classification methods are proposed first to estimate static spectral envelope features from active appearance model visual features and two further methods are developed to incorporate temporal information into the prediction.
Abstract: This paper is concerned with generating intelligible audio speech from a video of a person talking. Regression and classification methods are proposed first to estimate static spectral envelope features from active appearance model visual features. Two further methods are then developed to incorporate temporal information into the prediction: A feature-level method using multiple frames and a model-level method based on recurrent neural networks. Speech excitation information is not available from the visual signal, so methods to artificially generate aperiodicity and fundamental frequency are developed. These are combined within the STRAIGHT vocoder to produce a speech signal. The various systems are optimized through objective tests before applying subjective intelligibility tests that determine a word accuracy of 85% from a set of human listeners on the GRID audio-visual speech database. This compares favorably with a previous regression-based system that serves as a baseline, which achieved a word accuracy of 33%.

44 citations


DOI
02 Nov 2017
TL;DR: In this article, the authors proposed a combined use of Mel Frequency Cepstral Coefficients (MFCC) and Linear Predictive Coding (LPC) coefficients expressing the basic speech features to improve the reliability of speech recognition system.
Abstract: Statement of the automatic speech recognition problem, the assignment of speech recognition and the application fields are shown in the paper. At the same time as Azerbaijan speech, the establishment principles of speech recognition system and the problems arising in the system are investigated. The computing algorithms of speech features, being the main part of speech recognition system, are analyzed. From this point of view, the determination algorithms of Mel Frequency Cepstral Coefficients (MFCC) and Linear Predictive Coding (LPC) coefficients expressing the basic speech features are developed. Combined use of cepstrals of MFCC and LPC in speech recognition system is suggested to improve the reliability of speech recognition system. To this end, the recognition system is divided into MFCC and LPC-based recognition subsystems. The training and recognition processes are realized in both subsystems separately, and recognition system gets the decision being the same results of each subsystems. This results in decrease of error rate during recognition. The training and recognition processes are realized by artificial neural networks in the automatic speech recognition system. The neural networks are trained by the conjugate gradient method. In the paper the problems observed by the number of speech features at training the neural networks of MFCC and LPC-based speech recognition subsystems are investigated. The variety of results of neural networks trained from different initial points in training process is analyzed. Methodology of combined use of neural networks trained from different initial points in speech recognition system is suggested to improve the reliability of recognition system and increase the recognition quality, and obtained practical results are shown.

43 citations


Proceedings ArticleDOI
05 Mar 2017
TL;DR: This paper presents an optimal multi-channel Wiener filter, which consists of an eigenvector beamformer and a single-channel postfilter, and shows that both components solely depend on a speech presence probability, which is learned using a deep neural network.
Abstract: In this paper, we present an optimal multi-channel Wiener filter, which consists of an eigenvector beamformer and a single-channel postfilter. We show that both components solely depend on a speech presence probability, which we learn using a deep neural network, consisting of a deep autoencoder and a softmax regression layer. To prevent the DNN from learning specific speaker and noise types, we do not use the signal energy as input feature, but rather the cosine distance between the dominant eigenvectors of consecutive frames of the power spectral density of the noisy speech signal. We compare our system against the BeamformIt toolkit, and state-of-the-art approaches such as the front-end of the best system of the CHiME3 challenge. We show that our system yields superior results, both in terms of perceptual speech quality and classification error.

41 citations


Journal ArticleDOI
TL;DR: The experimental results demonstrate that the proposed approach to embed information during linear predictive coding (LPC) process based on Matrix Embedding (ME) leads to a better performance with less speech distortion and better security.
Abstract: The extensive use of Voice over IP (VoIP) applications makes low bit-rate speech stream a very suitable steganographic cover media. To incorporate steganography into low bit-rate speech codec, we propose a novel approach to embed information during linear predictive coding (LPC) process based on Matrix Embedding (ME). In the proposed method, a mapping table is constructed based on the criterion of minimum distance of Linear-Predictive-Coefficient-Vectors, and embedding position and template are selected according to a private key so as to choose the cover frames. The original speech data of the chosen frames are partially encoded to get the codewords for embedding and then the codewords that need to be modified for embedding are selected according to the secret bits and ME algorithm. The selected codeword will be changed into its best replacement codeword according to the mapping table. When embedding k (k > 1) bits into 2kź1 codewords, the embedding efficiency of our method is k times as that of LPC-based Quantization Index Modulation method. The performance of the proposed approach is evaluated in two aspects: distortion in speech quality introduced by embedding and security under steganalysis. The experimental results demonstrate that the proposed approach leads to a better performance with less speech distortion and better security.

Journal ArticleDOI
TL;DR: The method yields state-of-the-art performance and greatly reduces the effects of reverberation and noise while improving speech quality and preserving speech intelligibility in challenging acoustic environments.
Abstract: This paper proposes an online single-channel speech enhancement method designed to improve the quality of speech degraded by reverberation and noise. Based on an autoregressive model for the reverberation power and on a hidden Markov model for clean speech production, a Bayesian filtering formulation of the problem is derived and online joint estimation of the acoustic parameters and mean speech, reverberation, and noise powers is obtained in mel-frequency bands. From these estimates, a real-valued spectral gain is derived and spectral enhancement is applied in the short-time Fourier transform STFT domain. The method yields state-of-the-art performance and greatly reduces the effects of reverberation and noise while improving speech quality and preserving speech intelligibility in challenging acoustic environments.

Journal ArticleDOI
TL;DR: The proposed novel QIM steganography based on the replacement of quantization index set in linear predictive coding (LPC) outperforms the state-of-the-art LPC-based approach in low-bit-rate speech codec with respect to both steganographic capacity and steganalysis resistance.
Abstract: In this paper, we focus on quantization-index-modulation (QIM) steganography in low-bit-rate speech codec and contribute to improve its steganalysis resistance. A novel QIM steganography is proposed based on the replacement of quantization index set in linear predictive coding (LPC). In this method, each quantization index set is seen as a point in quantization index space. Steganography is conducted in such space. Comparing with other methods, our algorithm significantly improves the embedding efficiency. One quantization index needs to be changed at most when three binary bits are hidden. The number of alterations introduced by the proposed approach is much lower than that of the current methods with the same embedding rate. Due to the fewer cover changes, the proposed steganography is less detectable. Moreover, a division strategy based on the genetic algorithm is proposed to reduce the additional distortion introduced by replacements. In our experiment, ITU-T G.723.1 is selected as the codec, and the experimental results show that the proposed approach outperforms the state-of-the-art LPC-based approach in low-bit-rate speech codec with respect to both steganographic capacity and steganalysis resistance.

Journal ArticleDOI
TL;DR: An improved codebook-driven Wiener filter combined with the speech-presence probability is developed, so that the proposed method achieves the goal of removing the residual noise between the harmonics of noisy speech.
Abstract: In this paper, we present a novel method for estimating short-term linear predictive parameters of speech and noise in the codebook-driven Wiener filtering speech enhancement method. We only use pretrained spectral shape codebook of speech to model the a priori information about linear predictive coefficients of speech, and the spectral shape of noise is estimated online directly instead of using noise codebook to solve the problem of noise classification. Differing from the existing codebook-driven methods that the linear predictive gains of speech and noise are estimated by maximum-likelihood method, in the proposed method we exploit a multiplicative update rule to estimate the linear predictive gains more accurately. The estimated gains can help to reserve more speech components in the enhanced speech. Meanwhile, the Bayesian parameter-estimator without the noise codebook is also developed. Moreover, we develop an improved codebook-driven Wiener filter combined with the speech-presence probability, so that the proposed method achieves the goal of removing the residual noise between the harmonics of noisy speech.

Proceedings ArticleDOI
01 Mar 2017
TL;DR: A novel training algorithm for high-quality Deep Neural Network (DNN)-based speech synthesis that takes into account an Anti-Spoofing Verification (ASV) as an additional constraint in the acoustic model training.
Abstract: This paper proposes a novel training algorithm for high-quality Deep Neural Network (DNN)-based speech synthesis. The parameters of synthetic speech tend to be over-smoothed, and this causes significant quality degradation in synthetic speech. The proposed algorithm takes into account an Anti-Spoofing Verification (ASV) as an additional constraint in the acoustic model training. The ASV is a discriminator trained to distinguish natural and synthetic speech. Since acoustic models for speech synthesis are trained so that the ASV recognizes the synthetic speech parameters as natural speech, the synthetic speech parameters are distributed in the same manner as natural speech parameters. Additionally, we find that the algorithm compensates not only the parameter distributions, but also the global variance and the correlations of synthetic speech parameters. The experimental results demonstrate that 1) the algorithm outperforms the conventional training algorithm in terms of speech quality, and 2) it is robust against the hyper-parameter settings.

Journal ArticleDOI
TL;DR: A novel realization that integrates full-sentence speech correlation with clean speech recognition, formulated as a constrained maximization problem, to overcome the data sparsity problem and be able to significantly outperform conventional methods that use optimized noise tracking.
Abstract: Conventional speech enhancement methods, based on frame, multiframe, or segment estimation, require knowledge about the noise. This paper presents a new method that aims to reduce or effectively remove this requirement. It is shown that by using the zero-mean normalized correlation coefficient ZNCC as the comparison measure, and by extending the effective length of speech segment matching to sentence-long speech utterances, it is possible to obtain an accurate speech estimate from noise without requiring specific knowledge about the noise. The new method, thus, could be used to deal with unpredictable noise or noise without proper training data. This paper is focused on realizing and evaluating this potential. We propose a novel realization that integrates full-sentence speech correlation with clean speech recognition, formulated as a constrained maximization problem, to overcome the data sparsity problem. Then we propose an efficient implementation algorithm to solve this constrained maximization problem to produce speech sentence estimates. For evaluation, we build the new system on one training dataset and test it on two different test datasets across two databases, for a range of different noises including highly nonstationary ones. It is shown that the new approach, without any estimation of the noise, is able to significantly outperform conventional methods that use optimized noise tracking, in terms of various objective measures including automatic speech recognition.

Journal ArticleDOI
TL;DR: The objective of this work is to investigate the benefit of discrete wavelet transform combined with LPC, for speaker identification system applied for Algerian Berber language, compared to the traditional Mel frequency analysis.
Abstract: The objective of this work is to investigate the benefit of discrete wavelet transform combined with LPC, for speaker identification system applied for Algerian Berber language, compared to the traditional Mel frequency analysis. We’ve developed a speaker identification system for Algerian Berber language. The corpus concerns two dataset, the first one concerns eight isolated words and the second is dedicated for continuous speech repeated by Algerian native Berber. We’ve used MFCC feature, their first and second derivatives and discrete wavelet transform (DWT) followed by linear predictive coding (LPC) to ameliorate the parameterization phase. Mahalanobis distance, ascendant classification and pitch analysis were used for characterizing our speech signals. We evaluate the performance of DWT–LPC feature for clean and additive noisy speech. The multilayer perceptron classifier was used for this purpose, efficiency was improved for DWT combined with LPC feature vectors.

Proceedings ArticleDOI
01 Aug 2017
TL;DR: In this paper, a high-performance wearable bone-conducted speech enhancement system is developed to reduce the distortion of the environment noises, which can effectively avoid the interference of acoustic noise on the speech.
Abstract: Wearable electronic systems have been and will continue to be utilized in both civil and military uses. Strong noise environments derived from large vehicles (e.g. ships, aircrafts, or military tanks) seriously affect the quality of speech communications especially for wearable systems without hermetic packaging. Bone conduction technology through the acquisition of skull vibration is able to obtain voice information, which can effectively avoid the interference of acoustic noise on the speech. In this paper, a high-performance wearable bone-conducted speech enhancement system is developed to reduce the distortion of the environment noises. Both bone-conducted and air-conducted voices are used to train the equalization function of bone-conducted speech to air-conducted speech based on the deep neural network, and spectrum coefficients of linear predictive coding is taken as feature information for conversion model.

Proceedings ArticleDOI
20 Aug 2017
TL;DR: A three-step unsupervised approach to zero resource speech processing, which does not require any other information/dataset, is proposed and outperforms baselines, supplied along the datasets, in both the tasks without any task specific modifications.
Abstract: Zero resource speech processing refers to a scenario where no or minimal transcribed data is available. In this paper, we propose a three-step unsupervised approach to zero resource speech processing, which does not require any other information/dataset. In the first step, we segment the speech signal into phonemelike units, resulting in a large number of varying length segments. The second step involves clustering the varying-length segments into a finite number of clusters so that each segment can be labeled with a cluster index. The unsupervised transcriptions, thus obtained, can be thought of as a sequence of virtual phone labels. In the third step, a deep neural network classifier is trained to map the feature vectors extracted from the signal to its corresponding virtual phone label. The virtual phone posteriors extracted from the DNN are used as features in the zero resource speech processing. The effectiveness of the proposed approach is evaluated on both ABX and spoken term discovery tasks (STD) using spontaneous American English and Tsonga language datasets, provided as part of zero resource 2015 challenge. It is observed that the proposed system outperforms baselines, supplied along the datasets, in both the tasks without any task specific modifications.

Proceedings ArticleDOI
01 Jul 2017
TL;DR: An audio signal classification system based on Linear Predictive Coding and Random Forests is presented for multiclass classification with imbalanced datasets and achieves an overall correct classification rate of 99.25%.
Abstract: The goal of this work is to present an audio signal classification system based on Linear Predictive Coding and Random Forests. We consider the problem of multiclass classification with imbalanced datasets. The signals under classification belong to the class of sounds from wildlife intruder detection applications: birds, gunshots, chainsaws, human voice and tractors. The proposed system achieves an overall correct classification rate of 99.25%. There is no probability of false alarms in the case of birds or human voices. For the other three classes the probability is low, around 0.3%. The false omission rate is also low: around 0.2% for birds and tractors, a little bit higher for chainsaws (0.4%), lower for gunshots (0.14%) and zero for human voices.

Journal ArticleDOI
TL;DR: The SBWT is introduced in order to solve the problem of the perfect reconstruction associated with the bionic wavelet transform and the maximum a posterior estimator of magnitude-squared spectrum (MSS-MAP) was used for estimation of speech in the SBWT domain.
Abstract: Numerous efforts have focused on the problem of reducing the impact of noise on the performance of various speech systems such as speech coding, speech recognition and speaker recognition. These approaches consider alternative speech features, improved speech modeling, or alternative training for acoustic speech models. In this paper, we propose a new speech enhancement technique, which integrates a new proposed wavelet transform which we call stationary bionic wavelet transform (SBWT) and the maximum a posterior estimator of magnitude-squared spectrum (MSS-MAP). The SBWT is introduced in order to solve the problem of the perfect reconstruction associated with the bionic wavelet transform. The MSS-MAP estimation was used for estimation of speech in the SBWT domain. The experiments were conducted for various noise types and different speech signals. The results of the proposed technique were compared with those of other popular methods such as Wiener filtering and MSS-MAP estimation in frequency domain. To test the performance of the proposed speech enhancement system, four objective quality measurement tests [signal to noise ratio (SNR), segmental SNR, Itakura---Saito distance and perceptual evaluation of speech quality] were conducted for various noise types and SNRs. Experimental results and objective quality measurement test results proved the performance of the proposed speech enhancement technique. It provided sufficient noise reduction and good intelligibility and perceptual quality, without causing considerable signal distortion and musical background noise.

Proceedings ArticleDOI
01 Mar 2017
TL;DR: An approach based on Linear Predictive Coding (LPC) and Mel Frequency Cepstral Coefficients (MFCCs) features for language identification is proposed using SVM and Random Forest classification techniques.
Abstract: Speech uttered by the human beings contains the information about speakers, languages and contents. Language of uttered speech can easily be identified by extracting the language specific information from it. Identification of language of speech is known as Language Identification (LID). Identification of language from speech is helpful in its translation, speech recognition and speech activated automatic systems. LID system may also play an important role in speaker recognition as identification of language can be used to reduce search space. In this paper an approach based on Linear Predictive Coding (LPC) and Mel Frequency Cepstral Coefficients (MFCCs) features for language identification is proposed using SVM and Random Forest (RF) classification techniques. Both LPC and MFCC features are vocal tract features. LPC and MFCC features extracted from uttered speech contain language as well as speaker related informations. Identification of language highly depends upon extraction of language specific features. Both these vocal tract parameters of speech contain lot of information about languages spoken compared to other parameters like excitation source parameters and prosodic parameters. Hence combination of these features performs better than individual. Experiments have been performed on the database obtained from IIIT-Hyderabad consisting of 5000 multilingual clean speech signals (Hindi, Bengali, Telugu, Tamil, Marathi and Malayalam). For training the proposed model, 600 speech signals are taken arbitrarily from the above database. Language model are created for each language. Evaluation of the proposed models has been made using other 300 speech signals from same database. Language models are evaluated using individual features as well as combined features. Experiments performed by taking both features at a time give better result as compared to taking individual features one at a time. Using these features, the accuracy of language identification is not more than 80% so far as claimed by other researchers. In the proposed approach, the accuracy of language identification is improved to 92.6% using combination of same features and random forest model.

Proceedings ArticleDOI
01 Jan 2017
TL;DR: TEO-CB-Auto-Env is the method which is non-linear method of features extraction of emotion recognition, used in lie detector, database access systems, and in military for recognition of soldiers' emotion identification during the war.
Abstract: Detection of Emotion by analysis of speech is important for identification of emotional state of person. This can be done using ‘Linear Predictive Techniques(LPC)’, which has different parameters like pitch, vocal tract spectrum, formant frequencies, Duration, MFCC etc. which are used for extraction of features from speech. TEO-CB-Auto-Env is the method which is non-linear method of features extraction. Analysis is done using TU-Berlin (Technical University of Berlin) German database. Here emotion recognition is done for different emotions like neutral, happy, disgust, sad, boredom and anger. Emotion recognition is used in lie detector, database access systems, and in military for recognition of soldiers' emotion identification during the war.

Proceedings ArticleDOI
01 Jun 2017
TL;DR: Based segmentation preprocessing is put in the speech signal according to a phonetic transcription of language, in order to reduce the amount of data supplied to the input of the neural network, which considerably improves its input data sensitivity.
Abstract: The article considers the pre-processing voice signals for voice recognition systems based on the use of artificial neural networks. Based segmentation preprocessing is put in the speech signal according to a phonetic transcription of language, in order to reduce the amount of data supplied to the input of the neural network, which considerably improves its input data sensitivity. Application of numerical methods in processing will reduce acoustic noise impact on the speech signal segmentation, which will more accurately identify the areas of classification. Simulation results of the speech signal partition into components are shown, i.e. the selection of phonemes which will be the voice message classification.

Proceedings ArticleDOI
04 Dec 2017
TL;DR: An automatic detection of Dysarthria, a motor speech disorder, using extended speech features called Centroid Formants, which are the weighted averages of the formants extracted from a speech signal are presented.
Abstract: This paper presents an automatic detection of Dysarthria, a motor speech disorder, using extended speech features called Centroid Formants. Centroid Formants are the weighted averages of the formants extracted from a speech signal. This involves extraction of the first four formants of a speech signal and averaging their weighted values. The weights are determined by the peak energies of the bands of frequency resonance, formants. The resulting weighted averages are called the Centroid Formants. In our proposed methodology, these centroid formants are used to automatically detect Dysarthric speech using neural network classification technique. The experimental results recorded after testing this algorithm are presented. The experimental data consists of 200 speech samples from 10 Dysarthric Speakers and 200 speech samples from 10 age-matched healthy speakers. The experimental results show a high performance using neural networks classification. A possible future research related to this work is the use of these extended features in speaker identification and recognition of disordered speech.

Journal ArticleDOI
TL;DR: The results indicate thatCHLLP outperformed the reference feature extraction methods in almost all the comparisons in the noise-corrupted conditions and the performance of CHLLP was only slightly inferior to the nonparametric FFT-based spectral modeling in the clean condition.
Abstract: A linear predictive spectral estimation method based on higher-lag autocorrelation coefficients is proposed for the noise-robust feature extraction from speech. The method, called higher-lag linear prediction, is derived from a signal prediction model that is optimized in the mean square sense using a cost function that has two prediction error terms, the first of which is similar to that of conventional linear prediction and the second of which is a delayed version introducing an integer delay of M samples. This basic form is developed further into the combined higher-lag linear prediction (CHLLP) model by simultaneously taking advantage of the zero-lag and higher-lag predictions. The CHLLP model was used in the computation of mel-frequency cepstral coefficients and compared with several reference feature extraction methods in speaker recognition. The experiments were conducted by using a modern i-vector-based system. Noise-corruption was done using both additive car, babble, and factory noise in different signal-to-noise ratio conditions as well as speech recordings from real noisy conditions. The results indicate that CHLLP outperformed the reference feature extraction methods in almost all the comparisons in the noise-corrupted conditions and the performance of CHLLP was only slightly inferior to the nonparametric FFT-based spectral modeling in the clean condition.

Proceedings ArticleDOI
05 Mar 2017
TL;DR: This paper presents an approach for lyric-audio alignment by comparing synthesized speech with a vocal track removed from an instrument mixture using source separation, taking a hierarchical approach to solve the problem.
Abstract: The massive amount of digital music data available necessitates automated methods for processing, classifying and organizing large volumes of songs. As music discovery and interactive music applications become commonplace, the ability to synchronize lyric text information with an audio recording has gained interest. This paper presents an approach for lyric-audio alignment by comparing synthesized speech with a vocal track removed from an instrument mixture using source separation. We take a hierarchical approach to solve the problem, assuming a set of paragraph-music segment pairs is given and focus on within-segment lyric alignment at the word level. A synthesized speech signal is generated to reflect the properties of the music signal by controlling the speech rate and gender. Dynamic time warping finds the shortest path between the synthesized speech and separated vocal. The resulting path is used to calculate the timestamps of words in the original signal. The system results in approximately half a second of misalignment error on average. Finally, we discuss the challenges and suggest improvements to the method.

Posted Content
TL;DR: In this article, a sampling-based speech parameter generation method using moment-matching networks for deep neural network (DNN)-based speech synthesis is presented. But the method is limited to synthetic speech and it cannot handle natural inter-utterance variation.
Abstract: This paper presents sampling-based speech parameter generation using moment-matching networks for Deep Neural Network (DNN)-based speech synthesis. Although people never produce exactly the same speech even if we try to express the same linguistic and para-linguistic information, typical statistical speech synthesis produces completely the same speech, i.e., there is no inter-utterance variation in synthetic speech. To give synthetic speech natural inter-utterance variation, this paper builds DNN acoustic models that make it possible to randomly sample speech parameters. The DNNs are trained so that they make the moments of generated speech parameters close to those of natural speech parameters. Since the variation of speech parameters is compressed into a low-dimensional simple prior noise vector, our algorithm has lower computation cost than direct sampling of speech parameters. As the first step towards generating synthetic speech that has natural inter-utterance variation, this paper investigates whether or not the proposed sampling-based generation deteriorates synthetic speech quality. In evaluation, we compare speech quality of conventional maximum likelihood-based generation and proposed sampling-based generation. The result demonstrates the proposed generation causes no degradation in speech quality.

Journal ArticleDOI
TL;DR: A decision-directed approach to estimate the speech power spectral density (PSD) matrix for multichannel speech enhancement is proposed, capable of tracking the time-varying speech characteristics more robustly and improves the noise reduction performance under various noise environments.
Abstract: In this letter, a multichannel decision-directed approach to estimate the speech power spectral density (PSD) matrix for multichannel speech enhancement is proposed. There have been attempts to build multichannel speech enhancement filters which depend only on the speech and noise PSD matrices, for which the accurate estimate of the clean speech PSD matrix is crucial for a successful noise reduction. In contrast to the maximum likelihood estimator which has been applied conventionally, the proposed decision-directed method is capable of tracking the time-varying speech characteristics more robustly and improves the noise reduction performance under various noise environments.

Proceedings ArticleDOI
15 May 2017
TL;DR: A low-cost system which classifies different road conditions (asphalt, gravel, snowy and stony road) using acoustic signal processing is proposed to estimate road/tire friction forces in the active safety systems.
Abstract: In this study, a low-cost system which classifies different road conditions (asphalt, gravel, snowy and stony road) using acoustic signal processing is proposed. Thus it is aimed to estimate road/tire friction forces in the active safety systems. Classical acoustic signal processing methods which are linear predictive coding (LPC), power spectrum (PSC) and mel-frequency cepstrum coefficients (MFCC) are used with minimum variance and maximum distance principle in this system. The classification process is also executed by support vector machine (SVM).