scispace - formally typeset
Search or ask a question

Showing papers on "Linear predictive coding published in 2013"


Journal ArticleDOI
TL;DR: Experimental results indicate that the proposed SAD scheme is highly effective and provides superior and consistent performance across various noise types and distortion levels.
Abstract: Effective speech activity detection (SAD) is a necessary first step for robust speech applications. In this letter, we propose a robust and unsupervised SAD solution that leverages four different speech voicing measures combined with a perceptual spectral flux feature, for audio-based surveillance and monitoring applications. Effectiveness of the proposed technique is evaluated and compared against several commonly adopted unsupervised SAD methods under simulated and actual harsh acoustic conditions with varying distortion levels. Experimental results indicate that the proposed SAD scheme is highly effective and provides superior and consistent performance across various noise types and distortion levels.

186 citations


Proceedings ArticleDOI
06 May 2013
TL;DR: An attempt has been made to recognize and classify the speech emotion from three language databases, namely, Berlin, Japan and Thai emotion databases, using Support Vector Machines (SVM) as the classification model.
Abstract: Automatic recognition of emotional states from human speech is a current research topic with a wide range. In this paper an attempt has been made to recognize and classify the speech emotion from three language databases, namely, Berlin, Japan and Thai emotion databases. Speech features consisting of Fundamental Frequency (F0), Energy, Zero Crossing Rate (ZCR), Linear Predictive Coding (LPC) and Mel Frequency Cepstral Coefficient (MFCC) from short-time wavelet signals are comprehensively investigated. In this regard, Support Vector Machines (SVM) is utilized as the classification model. Empirical experimentation shows that the combined features of F0, Energy and MFCC provide the highest accuracy on all databases provided using the linear kernel. It gives 89.80%, 93.57% and 98.00% classification accuracy for Berlin, Japan and Thai emotions databases, respectively.

94 citations


Journal ArticleDOI
TL;DR: Experiments show large intelligibility improvements with the proposed method over the unprocessed noisy speech and better performance than one state-of-the art method.
Abstract: In this letter the focus is on linear filtering of speech before degradation due to additive background noise. The goal is to design the filter such that the speech intelligibility index (SII) is maximized when the speech is played back in a known noisy environment. Moreover, a power constraint is taken into account to prevent uncomfortable playback levels and deal with loudspeaker constraints. Previous methods use linear approximations of the SII in order to find a closed-form solution. However, as we show, these linear approximations introduce errors in low SNR regions and are therefore suboptimal. In this work we propose a nonlinear approximation of the SII which is accurate for all SNRs. Experiments show large intelligibility improvements with the proposed method over the unprocessed noisy speech and better performance than one state-of-the art method.

82 citations


Journal ArticleDOI
TL;DR: Compared to the state-of-the-art GMM-based VQ and recently proposed beta mixture model (BMM) based VQ, DVQ performs better, with even fewer free parameters and lower computational cost.
Abstract: Quantization of the linear predictive coding parameters is an important part in speech coding. Probability density function (PDF)-optimized vector quantization (VQ) has been previously shown to be more efficient than VQ based only on training data. For data with bounded support, some well-defined bounded-support distributions (e.g., the Dirichlet distribution) have been proven to outperform the conventional Gaussian mixture model (GMM), with the same number of free parameters required to describe the model. When exploiting both the boundary and the order properties of the line spectral frequency (LSF) parameters, the distribution of LSF differences LSF can be modelled with a Dirichlet mixture model (DMM). We propose a corresponding DMM based VQ. The elements in a Dirichlet vector variable are highly mutually correlated. Motivated by the Dirichlet vector variable's neutrality property, a practical non-linear transformation scheme for the Dirichlet vector variable can be obtained. Similar to the Karhunen-Loeve transform for Gaussian variables, this non-linear transformation decomposes the Dirichlet vector variable into a set of independent beta-distributed variables. Using high rate quantization theory and by the entropy constraint, the optimal inter- and intra-component bit allocation strategies are proposed. In the implementation of scalar quantizers, we use the constrained-resolution coding to approximate the derived constrained-entropy coding. A practical coding scheme for DVQ is designed for the purpose of reducing the quantization error accumulation. The theoretical and practical quantization performance of DVQ is evaluated. Compared to the state-of-the-art GMM-based VQ and recently proposed beta mixture model (BMM) based VQ, DVQ performs better, with even fewer free parameters and lower computational cost

54 citations


Journal ArticleDOI
TL;DR: Experimental results show that the proposed trajectory tiling approach can render speech which is both natural and highly intelligible, and the perceived high quality of rendered speech is also confirmed in both objective and subjective evaluations.
Abstract: It is technically challenging to make a machine talk as naturally as a human so as to facilitate “frictionless” interactions between machine and human. We propose a trajectory tiling-based approach to high-quality speech rendering, where speech parameter trajectories, extracted from natural, processed, or synthesized speech, are used to guide the search for the best sequence of waveform “tiles” stored in a pre-recorded speech database. We test the proposed unified algorithm in both Text-To-Speech (TTS) synthesis and cross-lingual voice transformation applications. Experimental results show that the proposed trajectory tiling approach can render speech which is both natural and highly intelligible. The perceived high quality of rendered speech is also confirmed in both objective and subjective evaluations.

48 citations


Journal ArticleDOI
TL;DR: A very simple algorithm based on the skewness of two excitation signals that significantly reduces the computational load through its simplicity and is observed to exhibit the strongest robustness in both noisy and reverberant environments.
Abstract: Detecting the correct speech polarity is a necessary step prior to several speech processing techniques. An error on its determination could have a dramatic detrimental impact on their performance. As current systems have to deal with increasing amounts of data stemming from multiple devices, the automatic detection of speech polarity has become a crucial problem. For this purpose, we here propose a very simple algorithm based on the skewness of two excitation signals. The method is shown on 10 speech corpora (8545 files) to lead to an error rate of only 0.06% in clean conditions and to clearly outperform four state-of-the-art methods. Besides it significantly reduces the computational load through its simplicity and is observed to exhibit the strongest robustness in both noisy and reverberant environments.

39 citations


Journal ArticleDOI
TL;DR: The experimental results reveal that the proposed method can be used to help speech language pathologists in classifying speech disuencies.
Abstract: Stuttering assessment through the manual classication of speech disuencies is subjective, inconsistent, of the 2 parameters (LPC order and frame length) in the LPC- and PLP-based methods on the classication results is also investigated. The experimental results reveal that the proposed method can be used to help speech language pathologists in classifying speech disuencies.

38 citations


Journal ArticleDOI
TL;DR: This letter presents a voice activity detection (VAD) approach using non-negative sparse coding to improve the detection performance in low signal-to-noise ratio (SNR) conditions and demonstrates that the VAD approach has a good performance inLow SNR conditions.
Abstract: This letter presents a voice activity detection (VAD) approach using non-negative sparse coding to improve the detection performance in low signal-to-noise ratio (SNR) conditions. The basic idea is to use features extracted from a noise-reduced representation of original audio signals. We decompose the magnitude spectrum of an audio signal on a speech dictionary learned from clean speech and a noise dictionary learned from noise samples. Only coefficients corresponding to the speech dictionary are considered and used as the noise-reduced representation of the signal for feature extraction. A conditional random field (CRF) is used to model the correlation between feature sequences and voice activity labels along audio signals. Then, we assign the voice activity labels for a given audio by decoding the CRF. Experimental results demonstrate that our VAD approach has a good performance in low SNR conditions.

36 citations


Journal Article
TL;DR: An effort has been made to highlight the progress made so far in the feature extraction phase of speech recognition system and an overview of technological perspective of an Automatic Speech Recognition system are discussed.
Abstract: Speech has evolved as a primary form of communication between humans. The advent of digital technology, gave us highly versatile digital processors with high speed, low cost and high power which enable researchers to transform the analog speech signals in to digital speech signals that can be scientifically studied. Achieving higher recognition accuracy, low word error rate and addressing the issues of sources of variability are the major considerations for developing an efficient Automatic Speech Recognition system. In speech recognition, feature extraction requires much attention because recognition performance depends heavily on this phase. In this paper, an effort has been made to highlight the progress made so far in the feature extraction phase of speech recognition system and an overview of technological perspective of an Automatic Speech Recognition system are discussed.

35 citations


Journal ArticleDOI
TL;DR: Gaussian Mixture Model training using several combinations of auditory perception and speech production features, which include principal components of Lyon’s auditory model features, MFCC, LSF and their first and second differences find that many combinations of these feature sets outperform the ITU-T P.563 Recommendation under the test conditions.
Abstract: Quality estimation of speech is essential for monitoring and maintenance of the quality of service at different nodes of modern telecommunication networks. It is also required in the selection of codecs in speech communication systems. There is no requirement of the original clean speech signal as a reference in non-intrusive speech quality evaluation, and thus it is of importance in evaluating the quality of speech at any node of the communication network. In this paper, non-intrusive speech quality assessment of narrowband speech is done by Gaussian Mixture Model (GMM) training using several combinations of auditory perception and speech production features, which include principal components of Lyon's auditory model features, MFCC, LSF and their first and second differences. Results are obtained and compared for several combinations of auditory features for three sets of databases. The results are also compared with ITU-T Recommendation P.563 for non-intrusive speech quality assessment. It is found that many combinations of these feature sets outperform the ITU-T P.563 Recommendation under the test conditions.

35 citations


Journal ArticleDOI
05 Apr 2013
TL;DR: While the main focus in ASR is to obtain spectral envelope measures, human speech communication efficiently exploits the manipulation of one's vocal-cord vibration rate, and so F0 extraction and its integration into ASR are also reviewed.
Abstract: As a pattern recognition application, automatic speech recognition (ASR) requires the extraction of useful features from its input signal, speech. To help determine relevance, human speech production and acoustic aspects of speech perception are reviewed, to identify acoustic elements likely to be most important for ASR. Common methods of estimating useful aspects of speech spectral envelopes are reviewed, from the point of view of efficiency and reliability in mismatched conditions. Because many speech inputs for ASR have noise and channel degradations, ways to improve robustness in speech parameterization are analyzed. While the main focus in ASR is to obtain spectral envelope measures, human speech communication efficiently exploits the manipulation of one's vocal-cord vibration rate [fundamental frequency (F0)], and so F0 extraction and its integration into ASR are also reviewed. For the acoustic analysis reviewed here for ASR, this work presents modern methods as well as future perspectives on important aspects of speech information processing.

Journal ArticleDOI
TL;DR: A Bayesian STSA algorithm is proposed under a stochastic-deterministic speech model that makes provision for the inclusion of a priori information by considering a non-zero mean and has an improved capability to retain low amplitude voiced speech components in low SNR conditions.
Abstract: A wide range of Bayesian short-time spectral amplitude (STSA) speech enhancement algorithms exist, varying in both the statistical model used for speech and the cost functions considered. Current algorithms of this class consistently assume that the distribution of clean speech short time Fourier transform (STFT) samples are either randomly distributed with zero mean or deterministic. No single distribution function has been considered that captures both deterministic and random signal components. In this paper a Bayesian STSA algorithm is proposed under a stochastic-deterministic (SD) speech model that makes provision for the inclusion of a priori information by considering a non-zero mean. Analytical expressions are derived for the speech STFT magnitude in the MMSE sense, and phase in the maximum-likelihood sense. Furthermore, a practical method of estimating the a priori SD speech model parameters is described based on explicit consideration of harmonically related sinusoidal components in each STFT frame, and variations in both the magnitude and phase of these components between successive STFT frames. Objective tests using the PESQ measure indicate that the proposed algorithm results in superior speech quality when compared to several other speech enhancement algorithms. In particular it is clear that the proposed algorithm has an improved capability to retain low amplitude voiced speech components in low SNR conditions.

Patent
21 Jul 2013
TL;DR: In this article, a speech signal can be characterized and the characterization can be employed to improve ASR performance in a noisy environment, such as a media device such as smart TV.
Abstract: Systems and methods are provided for enhancing speech signal intelligibility and for bettering performance of automatic speech recognition processes, for a speech signal in a noisy environment. Some typical application environments include a media device such as a smart TV. An acoustically coupled loudspeaker signal and signals from one or more microphones can be employed to enhance a near end user speech signal. Some processing can be application-specific, such as specific to applications wherein cleaned speech is employed for human voice communication and/or specific to applications employing Automatic Speech Recognition (ASR) processing. A formant emphasis filter and a spectrum band reconstruction process can be employed to enhance speech quality and/or to improve ASR recognition rate performance. A speech signal can be characterized and the characterization can be employed to improve ASR performance. Some systems and methods apply to devices having a foreground microphone and a background microphone.

Journal ArticleDOI
TL;DR: This paper presents a methodology based on empirical mode decomposition (EMD) for classification of continuous normal and pathological speech signals obtained from a well-known database, and demonstrates the effectiveness of the methodology.
Abstract: Automated classification of normal and pathological speech signals can provide an objective and accurate mechanism for pathological speech diagnosis, and is an active area of research. A large part of this research is based on analysis of acoustic measures extracted from sustained vowels. However, sustained vowels do not reflect real-world attributes of voice as effectively as continuous speech, which can take into account important attributes of speech such as rapid voice onset and termination, changes in voice frequency and amplitude, and sudden discontinuities in speech. This paper presents a methodology based on empirical mode decomposition (EMD) for classification of continuous normal and pathological speech signals obtained from a well-known database. EMD is used to decompose randomly chosen portions of speech signals into intrinsic mode functions, which are then analyzed to extract meaningful temporal and spectral features, including true instantaneous features which can capture discriminative information in signals hidden at local time-scales. A total of six features are extracted, and a linear classifier is used with the feature vector to classify continuous speech portions obtained from a database consisting of 51 normal and 161 pathological speakers. A classification accuracy of 95.7 % is obtained, thus demonstrating the effectiveness of the methodology.

Journal ArticleDOI
TL;DR: The overall compression system built around this modeling tool is shown to achieve the main goals: improved compression and, even more importantly, faster decoding speeds than the state of the art lossless audio compression methods.
Abstract: We investigate the problem of sparse modeling for predictive coding and introduce an efficient algorithm for computing sparse stereo linear predictors for lossless audio compression. Sparse linear predictive coding offers both improved compression and reduction of decoding complexity compared with non-sparse linear predictive coding. The modeling part amounts to finding the optimal structure of a sparse linear predictor using a fully implementable minimum description length (MDL) approach. The MDL criterion, simplified conveniently under realistic assumptions, is approximately minimized by a greedy algorithm which solves sequentially least squares partial problems, where the LDLT factorization ensures numerically stable solutions and facilitates a quasi-optimal quantization of the parameter vector. The overall compression system built around this modeling tool is shown to achieve the main goals: improved compression and, even more importantly, faster decoding speeds than the state of the art lossless audio compression methods. The optimal MDL sparse predictors are shown to provide parametric spectra that constitute new alternative spectral descriptors, capturing important regularities missed by the optimal MDL non-sparse predictors.

Proceedings ArticleDOI
01 Dec 2013
TL;DR: A speech recognition system has been developed using different feature extraction techniques like MFCC (mel frequency cepestral coefficient), LPC (linear predictive coding) and HMM (hidden markov model) is used as the classifier and shows that MFCC performs well in each and every condition.
Abstract: To utilize the robot's capabilities, it is necessary for us to communicate with them efficiently Thus, Human Robot Interaction is attracting the attention of most of the researchers these days In this paper a speech recognition system has been developed using different feature extraction techniques like MFCC (mel frequency cepestral coefficient), LPC (linear predictive coding) and HMM (hidden markov model) is used as the classifier Less work has been done for Hindi language in this field with a vocabulary size not very large So, work in this paper has been done for Hindi database, with a vocabulary size a bit extended HMM has been implemented using HTK Toolkit Afterwards the performances of both of the techniques used have been compared The work has been done using audacity for sound recordings and Cygwin to execute the HTK commands in Linux type environment in windows platform As well as, the system developed has been tested in the speaker dependent and speaker independent both types of environments, whose performance results, as well as, the comparison graph of the system shows that MFCC performs well as compared to LPC in each and every condition

Patent
04 Oct 2013
TL;DR: In this paper, a machine-learning framework is used to extract and analyze cues pertaining to noisy speech to dynamically generate an appropriate gain mask, which may eliminate the noise components from the input audio signal.
Abstract: Described are noise suppression techniques applicable to various systems including automatic speech processing systems in digital audio pre-processing. The noise suppression techniques utilize a machine-learning framework trained on cues pertaining to reference clean and noisy speech signals, and a corresponding synthetic noisy speech signal combining the clean and noisy speech signals. The machine-learning technique is further used to process audio signals in real time by extracting and analyzing cues pertaining to noisy speech to dynamically generate an appropriate gain mask, which may eliminate the noise components from the input audio signal. The audio signal pre-processed in such a manner may be applied to an automatic speech processing engine for corresponding interpretation or processing. The machine-learning technique may enable extraction of cues associated with clean automatic speech processing features, which may be used by the automatic speech processing engine for various automatic speech processing.

Proceedings ArticleDOI
17 Oct 2013
TL;DR: Several experimental results, signal-to-noise ratio, key space, key sensitivity tests, statistical analysis, chosen/known plaintext attack and time analysis show that the proposed method for speech scrambling performs efficiently and can be applied for secure real time speech communications.
Abstract: This paper presents a chaos-based speech scrambling system. Chaotic maps have been successfully used for large-scale data encryption such as image, audio and video data, due to their good properties such as pseudo-randomness, sensitivity to changes in initial conditions and system parameters and aperiodicity. This paper uses two chaotic maps, circle map and logistic map for speech confusion and diffusion, respectively. In the confusion stage, speech samples are divided into small segments. Then, indices of ordered generated sequence of circle map are used to shuffle the positions of the speech signal segments. Then, a one-time pad generated by the logistic map is used for the diffusion stage. Several experimental results, signal-to-noise ratio, key space, key sensitivity tests, statistical analysis, chosen/known plaintext attack and time analysis show that the proposed method for speech scrambling performs efficiently and can be applied for secure real time speech communications.

Journal ArticleDOI
TL;DR: A spectral domain speech enhancement algorithm is developed, and hidden Markov model (HMM) based MMSE estimators for speech periodogram coefficients are derived under this gamma assumption in both a high uniform resolution and a reduced-resolution Mel domain.
Abstract: The derivation of MMSE estimators for the DFT coefficients of speech signals, given an observed noisy signal and super-Gaussian prior distributions, has received a lot of interest recently. In this letter, we look at the distribution of the periodogram coefficients of different phonemes, and show that they have a gamma distribution with shape parameters less than one. This verifies that the DFT coefficients for not only the whole speech signal but also for individual phonemes have super-Gaussian distributions. We develop a spectral domain speech enhancement algorithm, and derive hidden Markov model (HMM) based MMSE estimators for speech periodogram coefficients under this gamma assumption in both a high uniform resolution and a reduced-resolution Mel domain. The simulations show that the performance is improved using a gamma distribution compared to the exponential case. Moreover, we show that, even though beneficial in some aspects, the Mel-domain processing does not lead to better results than the algorithms in the high-resolution domain.

Proceedings ArticleDOI
26 May 2013
TL;DR: By including the phase information in the two steps of a typical single-channel speech enhancement system, it is possible to improve the perceived signal quality of the enhanced signal significantly with respect to the methods that do not employ thephase information.
Abstract: In this paper, we study the impact of exploiting the spectral phase information to further improve the speech quality of the single-channel speech enhancement algorithms. In particular, we focus on the two required steps in a typical single-channel speech enhancement system, namely: parameter estimation solved by a minimum mean square error (MMSE) estimator of the speech spectral amplitude, followed by signal reconstruction stage, where the observed noisy phase is often used. For the parameter estimation stage, in contrast to conventional Wiener filter, a new MMSE estimator is derived which takes into account the clean phase information as a prior information. In our experiments, we show that by including the phase information in the two steps, it is possible to improve the perceived signal quality of the enhanced signal significantly with respect to the methods that do not employ the phase information.

Journal ArticleDOI
TL;DR: A new computational model to identify the aircraft class with a better performance is presented, because it introduces the take-off noise signal segmentation in time and increases the model effectiveness with a lower computational cost.
Abstract: Aircraft noise is one of the most uncomfortable kinds of sounds. That is why many organizations have addressed this problem through noise contours around airports, for which they use the aircraft type as the key element. This paper presents a new computational model to identify the aircraft class with a better performance, because it introduces the take-off noise signal segmentation in time. A method for signal segmentation into four segments was created. The aircraft noise patterns are extracted using an LPC (Linear Predictive Coding) based technique and the classification is made combining the output of four parallel MLP (Multilayer Perceptron) neural networks, one for each segment. The individual accuracy of each network was improved using a wrapper feature selection method, increasing the model effectiveness with a lower computational cost. The aircraft are grouped into classes depending on the installed engine type. The model works with 13 aircraft categories with an identification level above 85% in real environments.

Journal ArticleDOI
TL;DR: A speech pre-enhancement method based on matching the recognized text to the text of the original message that indicates a significant improvement over natural speech and a reference system that optimizes a perceptual-distortion-based objective intelligibility measure.
Abstract: An effective measure of speech intelligibility is the probability of correct recognition of the transmitted message. We propose a speech pre-enhancement method based on matching the recognized text to the text of the original message. The selected criterion is accurately approximated by the probability of the correct transcription given an estimate of the noisy speech features. In the presence of environment noise, and with a decrease in the signal-to-noise ratio, speech intelligibility declines. We implement a speech pre-enhancement system that optimizes the proposed criterion for the parameters of two distinct speech modification strategies under an energy-preservation constraint. The proposed method requires prior knowledge in the form of a transcription of the transmitted message and acoustic speech models from an automatic speech recognition system. Performance results from an open-set subjective intelligibility test indicate a significant improvement over natural speech and a reference system that optimizes a perceptual-distortion-based objective intelligibility measure. The computational complexity of the approach permits use in on-line applications.

Proceedings ArticleDOI
26 May 2013
TL;DR: To use video input as an auxiliary modality for speech processing by applying a new statistical model - the twin hidden Markov model, which greatly outperforms the standard audio-only log-MMSE estimator on all considered instrumental speech quality measures covering spectral and perceptual quality.
Abstract: Most approaches for speech signal processing rely solely on acoustic input, which has the consequence that spectrum estimation becomes exceedingly difficult when the signal-to-noise ratio drops to values near 0 dB. However, alternative sources of information are becoming widely available with increasing use of multimedia data in everyday communication. In the following paper, we suggest to use video input as an auxiliary modality for speech processing by applying a new statistical model - the twin hidden Markov model. The resulting enhancement algorithm for audiovisual data greatly outperforms the standard audio-only log-MMSE estimator on all considered instrumental speech quality measures covering spectral and perceptual quality.

Proceedings ArticleDOI
26 May 2013
TL;DR: This zero-resource system employs a segmental dynamic time warping algorithm for acoustic pattern discovery in conjunction with a probabilistic model which treats the topic and pseudo-word identity of each discovered pattern as hidden variables.
Abstract: Zero-resource speech processing involves the automatic analysis of a collection of speech data in a completely unsupervised fashion without the benefit of any transcriptions or annotations of the data. In this paper, our zero-resource system seeks to automatically discover important words, phrases and topical themes present in an audio corpus. This system employs a segmental dynamic time warping (S-DTW) algorithm for acoustic pattern discovery in conjunction with a probabilistic model which treats the topic and pseudo-word identity of each discovered pattern as hidden variables. By applying an Expectation-Maximization (EM) algorithm, our system estimates the latent probability distributions over the pseudo-words and topics associated with the discovered patterns. Using this information, we produce acoustic summaries of the dominant topical themes of the audio document collection.

Book
03 Dec 2013
TL;DR: Digital Speech Processing Using Matlab deals with digital speech pattern recognition, speech production model, speech feature extraction, and speech compression, and is suitable for beginners pursuing basic research in digital speech processing.
Abstract: Digital Speech Processing Using Matlab deals with digital speech pattern recognition, speech production model, speech feature extraction, and speech compression. The book is written in a manner that is suitable for beginners pursuing basic research in digital speech processing. Matlab illustrations are provided for most topics to enable better understanding of concepts. This book also deals with the basic pattern recognition techniques (illustrated with speech signals using Matlab) such as PCA, LDA, ICA, SVM, HMM, GMM, BPN, and KSOM.

Journal ArticleDOI
TL;DR: Two novel recovery methods of speech based on the AM-FM model are presented: one depends on the LCT domain filtering; the other one isbased on the chirp signal parameter estimation to restore the speech signal in LCTdomain.

Journal ArticleDOI
TL;DR: This paper introduces a recognition system that can recognize speech in the presence of multiple rapidly time-varying noise sources as found in a typical family living room and approaches human performance levels by greatly improving the audible quality of speech and substantially improving the keyword recognition accuracy.

Journal ArticleDOI
TL;DR: A new scheme of data hiding which takes advantage of the masking property of the Human Auditory System to hide a secret signal into a host (speech) signal, and the embedding process is carried out into the wavelet coefficients of the speech signals.

Journal ArticleDOI
TL;DR: A data-driven approach is described, which matches each mixed speech segment against a composite training segment to separate the underlying clean speech segments, and seeks and separates the longest mixed speech segments with matching composite training segments.
Abstract: This paper studies single-channel speech separation, assuming unknown, arbitrary temporal dynamics for the speech signals to be separated. A data-driven approach is described, which matches each mixed speech segment against a composite training segment to separate the underlying clean speech segments. To advance the separation accuracy, the new approach seeks and separates the longest mixed speech segments with matching composite training segments. Lengthening the mixed speech segments to match reduces the uncertainty of the constituent training segments, and hence the error of separation. For convenience, we call the new approach Composition of Longest Segments, or CLOSE. The CLOSE method includes a data-driven approach to model long-range temporal dynamics of speech signals, and a statistical approach to identify the longest mixed speech segments with matching composite training segments. Experiments are conducted on the Wall Street Journal database, for separating mixtures of two simultaneous large-vocabulary speech utterances spoken by two different speakers. The results are evaluated using various objective and subjective measures, including the challenge of large-vocabulary continuous speech recognition. It is shown that the new separation approach leads to significant improvement in all these measures.

Proceedings ArticleDOI
04 Nov 2013
TL;DR: In this paper, authors implemented a speaker recognition system (SRS) using Mel-Frequency Cepstrum Coefficients (MFCC), Linear Prediction Coding (LPC) as feature extraction techniques and Vector Quantization (VQ) as speaker classification technique.
Abstract: Identity verification is very important issue in current era of information technology. Conventional means of identity verification using keys or personal identification numbers and passwords can be stolen and are not well suited for such critical purposes. To overcome this problem new bio metric verification method have been e m e r g e d. Now, the i n c r e a s ed computing power and decreased microchip size has given thrust for implementing realistic biometric authentication systems such as speech recognition system. In this paper, authors implemented a speaker recognition system (SRS) using Mel-Frequency Cepstrum Coefficients (MFCC), Linear Prediction Coding (LPC) as feature extraction techniques and Vector Quantization (VQ) as speaker classification technique.