Showing papers on "Linear predictive coding published in 2011"

PDF

Open Access

Proceedings Article•DOI•

Learning a better representation of speech soundwaves using restricted boltzmann machines

[...]

Navdeep Jaitly¹, Geoffrey E. Hinton¹•Institutions (1)

22 May 2011

TL;DR: A novel approach for modeling speech sound waves using a Restricted Boltzmann machine (RBM) with a novel type of hidden variable is presented and initial results demonstrate phoneme recognition performance better than the current state-of-the-art for methods based on Mel cepstrum coefficients.

...read moreread less

Abstract: State of the art speech recognition systems rely on preprocessed speech features such as Mel cepstrum or linear predictive coding coefficients that collapse high dimensional speech sound waves into low dimensional encodings. While these have been successfully applied in speech recognition systems, such low dimensional encodings may lose some relevant information and express other information in a way that makes it difficult to use for discrimination. Higher dimensional encodings could both improve performance in recognition tasks, and also be applied to speech synthesis by better modeling the statistical structure of the sound waves. In this paper we present a novel approach for modeling speech sound waves using a Restricted Boltzmann machine (RBM) with a novel type of hidden variable and we report initial results demonstrating phoneme recognition performance better than the current state-of-the-art for methods based on Mel cepstrum coefficients.

...read moreread less

223 citations

Journal Article•DOI•

Predicting speech intelligibility based on the signal-to-noise envelope power ratio after modulation-frequency selective processing.

[...]

Søren Jørgensen¹, Torsten Dau•Institutions (1)

Technical University of Denmark¹

02 Sep 2011-Journal of the Acoustical Society of America

TL;DR: The results strongly suggest that the signal-to-noise ratio at the output of a modulation frequency selective process provides a key measure of speech intelligibility.

...read moreread less

Abstract: A model for predicting the intelligibility of processed noisy speech is proposed. The speech-based envelope power spectrum model has a similar structure as the model of Ewert and Dau [(2000). J. Acoust. Soc. Am. 108, 1181-1196], developed to account for modulation detection and masking data. The model estimates the speech-to-noise envelope power ratio, SNR(env), at the output of a modulation filterbank and relates this metric to speech intelligibility using the concept of an ideal observer. Predictions were compared to data on the intelligibility of speech presented in stationary speech-shaped noise. The model was further tested in conditions with noisy speech subjected to reverberation and spectral subtraction. Good agreement between predictions and data was found in all cases. For spectral subtraction, an analysis of the model's internal representation of the stimuli revealed that the predicted decrease of intelligibility was caused by the estimated noise envelope power exceeding that of the speech. The classical concept of the speech transmission index fails in this condition. The results strongly suggest that the signal-to-noise ratio at the output of a modulation frequency selective process provides a key measure of speech intelligibility.

...read moreread less

193 citations

Speech Recognition Using Linear Predictive Coding and Artificial Neural Network for Controlling Movement of Mobile Robot

[...]

Suryo Wijoyo

01 Jan 2011

TL;DR: Implementation of speech recognition system on a mobile robot for controlling movement of the robot is described and the highest recognition rate that can be achieved is 91.4%.

...read moreread less

Abstract: This paper describes about implementation of speech recognition system on a mobile robot for controlling movement of the robot. The methods used for speech recognition system are Linear Predictive Coding (LPC) and Artificial Neural Network (ANN). LPC method is used for extracting feature of a voice signal and ANN is used as the recognition method. Backpropagation method is used to train the ANN. Voice signals are sampled directly from the microphone and then they are processed using LPC method for extracting the features of voice signal. For each voice signal, LPC method produces 576 data. Then, these data become the input of the ANN. The ANN was trained by using 210 data training. This data training includes the pronunciation of the seven words used as the command, which are created from 30 different people. Experimental results show that the highest recognition rate that can be achieved by this system is 91.4%. This result is obtained by using 25 samples per word, 1 hidden layer, 5 neurons for each hidden layer, and learning rate 0.1.

...read moreread less

65 citations

Journal Article•DOI•

Modulation-domain Kalman filtering for single-channel speech enhancement

[...]

Stephen So¹, Kuldip K. Paliwal¹•Institutions (1)

Griffith University¹

01 Jul 2011-Speech Communication

TL;DR: The results from objective experiments and blind subjective listening tests using the NOIZEUS corpus show that the MDKF (with clean speech parameters) outperforms all the acoustic and time- domain enhancement methods that were evaluated, including the time-domain Kalman filter withclean speech parameters.

...read moreread less

64 citations

Journal Article•DOI•

Epoch-based analysis of speech signals

[...]

B. Yegnanarayana¹, Suryakanth V. Gangashetty¹•Institutions (1)

International Institute of Information Technology¹

22 Nov 2011-Sadhana-academy Proceedings in Engineering Sciences

TL;DR: In this paper, the importance of epochs for speech analysis is discussed, and methods to extract the epoch information are reviewed, and applications of epoch extraction for some speech applications are demonstrated.

...read moreread less

Abstract: Speech analysis is traditionally performed using short-time analysis to extract features in time and frequency domains. The window size for the analysis is fixed somewhat arbitrarily, mainly to account for the time varying vocal tract system during production. However, speech in its primary mode of excitation is produced due to impulse-like excitation in each glottal cycle. Anchoring the speech analysis around the glottal closure instants (epochs) yields significant benefits for speech analysis. Epoch-based analysis of speech helps not only to segment the speech signals based on speech production characteristics, but also helps in accurate analysis of speech. It enables extraction of important acoustic-phonetic features such as glottal vibrations, formants, instantaneous fundamental frequency, etc. Epoch sequence is useful to manipulate prosody in speech synthesis applications. Accurate estimation of epochs helps in characterizing voice quality features. Epoch extraction also helps in speech enhancement and multispeaker separation. In this tutorial article, the importance of epochs for speech analysis is discussed, and methods to extract the epoch information are reviewed. Applications of epoch extraction for some speech applications are demonstrated.

...read moreread less

57 citations

Journal Article•DOI•

Enhancement of noisy speech by temporal and spectral processing

[...]

P. Krishnamoorthy¹, S. R. M. Prasanna²•Institutions (2)

Samsung India Software Center¹, Indian Institute of Technology Guwahati²

01 Feb 2011-Speech Communication

TL;DR: A noisy speech enhancement method by combining linear prediction (LP) residual weighting in the time domain and spectral processing in the frequency domain to provide better noise suppression as well as better enhancement in the speech regions is presented.

...read moreread less

49 citations

Journal Article•DOI•

Adaptive Feedback Cancellation With Band-Limited LPC Vocoder in Digital Hearing Aids

[...]

Guilin Ma¹, Fredrik Gran, Finn Jacobsen¹, Finn T. Agerkvist¹•Institutions (1)

University of Copenhagen¹

01 May 2011-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: The idea is to replace the hearing-aid output with a synthesized signal, which sounds perceptually the same as or similar to the original signal but is statistically uncorrelated with the external input signal at high frequencies where feedback oscillation usually occurs.

...read moreread less

Abstract: Feedback oscillation is one of the major issues with hearing aids. An effective way of feedback suppression is adaptive feedback cancellation, which uses an adaptive filter to estimate the feedback path. However, when the external input signal is correlated with the receiver input signal, the estimate of the feedback path is biased. This so-called “bias problem” results in a large modeling error and a cancellation of the desired signal. This paper proposes a band-limited linear predictive coding based approach to reduce the bias. The idea is to replace the hearing-aid output with a synthesized signal, which sounds perceptually the same as or similar to the original signal but is statistically uncorrelated with the external input signal at high frequencies where feedback oscillation usually occurs. Simulation results show that the proposed algorithm can effectively reduce the bias and the misalignment between the real and the estimated feedback path. When combined with filtered-X adaptation in the feedback canceller, this approach reduces the misalignment even further.

...read moreread less

43 citations

Patent•

Speech audio processing

[...]

Willem M. Beltman¹, Matías Zañartu, Arijit Raychowdhury, Anand P. Rangarajan, Michael E. Deisher - Show less +1 more•Institutions (1)

Intel¹

30 Jun 2011

TL;DR: In this paper, a speech processing engine is provided that employs Kalman filtering with a particular speaker's glottal information to clean up an audio speech signal for more efficient automatic speech recognition.

...read moreread less

Abstract: A speech processing engine is provided that in some embodiments, employs Kalman filtering with a particular speaker's glottal information to clean up an audio speech signal for more efficient automatic speech recognition.

...read moreread less

39 citations

Journal Article•DOI•

Speech recognition from spectral dynamics

[...]

Hynek Hermansky¹•Institutions (1)

Johns Hopkins University¹

22 Nov 2011-Sadhana-academy Proceedings in Engineering Sciences

TL;DR: The paper points to its close connection to modulation spectra of speech and argues against short-term spectral envelopes as dominant carriers of the linguistic information in speech, and aims at a historical global overview of attempts for using spectral dynamics in machine recognition of speech.

...read moreread less

Abstract: Information is carried in changes of a signal. The paper starts with revisiting Dudley’s concept of the carrier nature of speech. It points to its close connection to modulation spectra of speech and argues against short-term spectral envelopes as dominant carriers of the linguistic information in speech. The history of spectral representations of speech is briefly discussed. Some of the history of gradual infusion of the modulation spectrum concept into Automatic recognition of speech (ASR) comes next, pointing to the relationship of modulation spectrum processing to well-accepted ASR techniques such as dynamic speech features or RelAtive SpecTrAl (RASTA) filtering. Next, the frequency domain perceptual linear prediction technique for deriving autoregressive models of temporal trajectories of spectral power in individual frequency bands is reviewed. Finally, posterior-based features, which allow for straightforward application of modulation frequency domain information, are described. The paper is tutorial in nature, aims at a historical global overview of attempts for using spectral dynamics in machine recognition of speech, and does not always provide enough detail of the described techniques. However, extensive references to earlier work are provided to compensate for the lack of detail in the paper.

...read moreread less

28 citations

Journal Article•DOI•

Performance of an Event-Based Instantaneous Fundamental Frequency Estimator for Distant Speech Signals

[...]

G. Seshadri¹, B. Yegnanarayana•Institutions (1)

Indian Institute of Technology Madras¹

01 Sep 2011-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: The proposed method depends only on the strengths of impulse-like excitations in the direct component of distant speech signals, and not on the similarity of speech signal in successive glottal cycles, so it is robust to the effects of reverberation and noise.

...read moreread less

Abstract: This paper proposes a method for extracting the fundamental frequency of voiced speech from distant speech signals. The method is based on the impulse-like nature of excitation in voiced speech. The characteristics of impulse-like excitation are extracted by filtering the speech signal through a cascade of resonators located at zero frequency. The resulting filtered signal preserves information specific to the fundamental frequency, in the sequence of positive-to-negative zero crossings. Also, the filtered signal is free from the effects of resonances of the vocal tract. An estimate of the fundamental frequency is derived from the short-time spectrum of the filtered signal. This estimate is used to remove spurious zero crossings in the filtered signal. The proposed method depends only on the strengths of impulse-like excitations in the direct component of distant speech signals, and not on the similarity of speech signal in successive glottal cycles. Hence, the method is robust to the effects of reverberation and noise. Performance of the method is evaluated using a database of close-speaking and distant speech signals. Experiments show that the accuracy of the proposed method is significantly higher than that of existing methods based on time-domain and frequency-domain processing.

...read moreread less

27 citations

Proceedings Article•DOI•

Malaysian English accents identification using LPC and formant analysis

[...]

M. A. Yusnita¹, M. P. Paulraj², Sazali Yaacob², Shahriman Abu Bakar², A. Saidatul² - Show less +1 more•Institutions (2)

Universiti Teknologi MARA¹, Universiti Malaysia Perlis²

01 Nov 2011

TL;DR: The experimental results indicate a highly promising recognition accuracy of 94.2% upon feature fusion sets of LPC, formants and log energy, and the accent identity of a speaker is predicted using K-Nearest Neighbors (KNN) classifier based on the extracted information.

...read moreread less

Abstract: In Malaysia, most people speak several varieties of English known as Malaysian English (MalE) and there is no uniform version because of the existence of multi-ethnic population. It is a common scenario that Malaysians speak a particular local Malay, Chinese or Indian English accent. As most commercial speech recognizers have been developed using a standard English language, it is a challenging task for achieving highly efficient performance when other accented speech are presented to this system. Accent identification (AccID) can be one of the subsystem in speaker independent automatic speech recognition (SI-ASR) system so that deterioration issue in its performance can be tackled. In this paper, the most important speech features of three ethnic groups of MalE speakers are extracted using Linear Predictive Coding (LPC), formant and log energy feature vectors. In the subsequent stage, the accent identity of a speaker is predicted using K-Nearest Neighbors (KNN) classifier based on the extracted information. Prior, the preprocessing parameters and LPC order are investigated to properly extract the speech features. This study is conducted on a small set speech corpus developed as pilot study to determine the feasibility of automatic AccID of MalE speakers which has never been reported before. The experimental results indicate a highly promising recognition accuracy of 94.2% upon feature fusion sets of LPC, formants and log energy.

...read moreread less

Exemplar-based speech enhancement and its application to noise-robust automatic speech recognition

[...]

Jort F. Gemmeke, Tuomas Virtanen¹, Antti Hurmalainen¹•Institutions (1)

Tampere University of Technology¹

01 Jan 2011

TL;DR: In this work an exemplar-based technique for speech enhancement of noisy speech is proposed, which works by finding a sparse representation of the noisy speech in a dictionary containing both speech and noise exemplars, and uses the activated dictionary atoms to create a time-varying filter to enhance the noisyspeech.

...read moreread less

Abstract: In this work an exemplar-based technique for speech enhancement of noisy speech is proposed. The technique works by finding a sparse representation of the noisy speech in a dictionary containing both speech and noise exemplars, and uses the activated dictionary atoms to create a time-varying filter to enhance the noisy speech. The speech enhancement algorithm is evaluated using measured signal to noise ratio (SNR) improvements as well as by using automatic speech recognition. Experiments on the PASCAL CHiME challenge corpus, which contains speech corrupted by both reverberation and authentic living room noise at varying SNRs ranging from 9 to -6 dB, confirm the validity of the proposed technique. Examples of enhanced signals are available at http://www.cs.tut.fi/ ˜tuomasv/. Index Terms: speech enhancement, exemplar-based, noise robustness, sparse representations

...read moreread less

Patent•

Method and apparatus for audio coding and decoding

[...]

Udar Mittal¹, James P. Ashley², Jonathan Alastair Gibbs¹•Institutions (2)

Motorola¹, Google²

26 Jul 2011

TL;DR: In this article, an encoder and decoder for processing an audio signal including generic audio and speech frames are provided, and two encoders and two decoders are utilized by the speech decoder.

...read moreread less

Abstract: An encoder and decoder for processing an audio signal including generic audio and speech frames are provided herein. During operation, two encoders are utilized by the speech coder, and two decoders are utilized by the speech decoder. The two encoders and decoders are utilized to process speech and non-speech (generic audio) respectively. During a transition between generic audio and speech, parameters that are needed by the speech decoder for decoding frame of speech are generated by processing the preceding generic audio (non-speech) frame for the necessary parameters. Because necessary parameters are obtained by the speech coder/decoder, the discontinuities associated with prior-art techniques are reduced when transitioning between generic audio frames and speech frames.

...read moreread less

Patent•

Speaker segmentation in noisy conversational speech

[...]

Jiyong Ma

29 Apr 2011

TL;DR: In this paper, a robust multiple speaker segmentation in noisy conversational speech is presented, where the system identifies one reliable speech segment near the beginning of the conversation and extracts speech features with a short latency, then learns a statistical model from the selected speech segment.

...read moreread less

Abstract: System and methods for robust multiple speaker segmentation in noisy conversational speech are presented. Robust voice activity detection is applied to detect temporal speech events. In order to get robust speech features and detect speech events in a noisy environment, a noise reduction algorithm is applied, using noise tracking. After noise reduction and voice activity detection, the incoming audio/speech is initially labeled as speech segments or silence segments. With no prior knowledge of the number of speakers, the system identifies one reliable speech segment near the beginning of the conversational speech and extracts speech features with a short latency, then learns a statistical model from the selected speech segment. This initial statistical model is used to identify the succeeding speech segments in a conversation. The statistical model is also continuously adapted and expanded with newly identified speech segments that match well to the model. The speech segments with low likelihoods are labeled with a second speaker ID, and a statistical model is learned from them. At the same time, these two trained speaker models are also updated/adapted once a reliable speech segment is identified. If a speech segment does not match well to the two speaker models, the speech segment is temporarily labeled as an outlier or as originating from a third speaker. This procedure is then applied recursively as needed when there are more than two speakers in a conversation.

...read moreread less

Patent•

Systems and Methods for Enhancing Voice Quality in Mobile Device

[...]

Carlo Murgia¹, Scott K. Isabelle¹•Institutions (1)

Audience¹

03 Nov 2011

TL;DR: In this article, a speech-noise classification data with a speech encoder via a shared memory or by a Least Significant Bit (LSB) of a Pulse Code Modulation (PCM) stream is presented.

...read moreread less

Abstract: Provided are methods and systems for enhancing the quality of voice communications. The method and corresponding system may involve classifying an audio signal into speech, and speech and noise and creating speech-noise classification data. The method may further involve sharing the speech-noise classification data with a speech encoder via a shared memory or by a Least Significant Bit (LSB) of a Pulse Code Modulation (PCM) stream. The method and corresponding system may also involve sharing acoustic cues with the speech encoder to improve the speech noise classification and, in certain embodiments, sharing scaling transition factors with the speech encoder to enable the speech encoder gradually change data rate in the transitions between the encoding modes.

...read moreread less

Proceedings Article•

Secure voice communication via GSM network

[...]

Mehmet Ozkan¹, Berna Ors¹, Gokay Saldamli²•Institutions (2)

Istanbul Technical University¹, Boğaziçi University²

01 Dec 2011

TL;DR: A system which communicates through GSM mobile phones and provides protection for interviews against third parties including with service providers developed and formed speech like waveform by the designed coder to transmit through the GSM line.

...read moreread less

Abstract: In this study, a system is developed which communicates through GSM mobile phones and provides protection for interviews against third parties including with service providers developed. GSM line is sensitive to human speech to be more efficient and provide more quality for transmission. In addition, a tool should be used to compress speech to transmit speech over GSM. For these reasons, speech cannot be transmitted to the GSM line directly after encrypted. In this study, the encrypted speech which is a digital data stream, formed speech like waveform by the designed coder to transmit through the GSM line. FPGA implementation of AES is used for encryption of digital data stream. Desired speech characteristics are obtained by scanning the database of NTIMIT, and then LBG algorithm is used to design codebooks which include speech parameters. A coder is designed to synthesize speech like waveforms from the encrypted digital data stream.

...read moreread less

Journal Article•

Malay isolated speech recognition using neural network: a work in finding number of hidden nodes and learning parameters

[...]

Salam, Dzulkifli Mohamad¹, Sheikh Hussain Shaikh Salleh•Institutions (1)

Universiti Teknologi Malaysia¹

01 Oct 2011-The International Arab Journal of Information Technology

TL;DR: The result shows that choosing unsuitable parameters lead to unlearned network while some good parameters set from previous work perform badly in this application.

...read moreread less

Abstract: This paper explains works in speech recognition using neural network. The main objective of the experiment is to choose suitable number of nodes in hidden layer and learning parameters for malay iIsolated digit speech problem through trial and error method. The network used in the experiment is feed forward multilayer perceptron trained with back propagation scheme. Speech data for the study are analyzed using linear predictive coding and log area ratio to represent speech signal for every 20ms through a fixed overlapped windows. The neural network learning operation is greatly influenced by the parameters i.e., momentum, learning rate and number of hidden nodes chosen. The result shows that choosing unsuitable parameters lead to unlearned network while some good parameters set from previous work perform badly in this application. Best recognition rate achieved was 95% using network topology of input nodes, hidden nodes and output nodes of size 320:45:4 respectively while the best momentum rate and learning rate in the experiment were 0.5 and 0.75.

...read moreread less

Journal Article•DOI•

Burst packet loss concealment using multiple codebooks and comfort noise for CELP-type speech coders in wireless sensor networks.

[...]

Nam In Park¹, Hong Kook Kim², Jung Min A², Seong Ro Lee³, Seung Ho Choi¹ - Show less +1 more•Institutions (3)

Gwangju Institute of Science and Technology¹, Mokpo National University², Seoul National University of Science and Technology³

17 May 2011-Sensors

TL;DR: A packet loss concealment (PLC) algorithm for CELP-type speech coders is proposed in order to improve the quality of decoded speech under burst packet loss conditions in a wireless sensor network and provides significantly better speech quality than the PLC algorithm employed in G.729.

...read moreread less

Abstract: In this paper, a packet loss concealment (PLC) algorithm for CELP-type speech coders is proposed in order to improve the quality of decoded speech under burst packet loss conditions in a wireless sensor network. Conventional receiver-based PLC algorithms in the G.729 speech codec are usually based on speech correlation to reconstruct the decoded speech of lost frames by using parameter information obtained from the previous correctly received frames. However, this approach has difficulty in reconstructing voice onset signals since the parameters such as pitch, linear predictive coding coefficient, and adaptive/fixed codebooks of the previous frames are mostly related to silence frames. Thus, in order to reconstruct speech signals in the voice onset intervals, we propose a multiple codebook-based approach that includes a traditional adaptive codebook and a new random codebook composed of comfort noise. The proposed PLC algorithm is designed as a PLC algorithm for G.729 and its performance is then compared with that of the PLC algorithm currently employed in G.729 via a perceptual evaluation of speech quality, a waveform comparison, and a preference test under different random and burst packet loss conditions. It is shown from the experiments that the proposed PLC algorithm provides significantly better speech quality than the PLC algorithm employed in G.729 under all the test conditions.

...read moreread less

Proceedings Article•

An Informed Source Separation System for Speech Signals.

[...]

Shuhua Zhang, Laurent Girin

27 Aug 2011

TL;DR: Compared to the reference MPEG-4 Spatial Audio Object Coding system, the proposedISS system pro- vides much cleaner separated speech signals (consistently 10- 20 dB higher Signal to Interference Ratios), revealing strong potential for audio conference applications.

...read moreread less

Abstract: In two previous papers, we proposed an audio Informed Source Separation (ISS) system which can achieve the separation of I > 2 musical sources from linear instantaneous stationary stereo (2-channel) mixtures, based on audio signal's natural sparsity, pre-mix source signals analysis, and side-information embedding (within the mix signal). In the present paper and for the first time, we apply this system to mixtures of (up to seven) simultaneous speech signals. Compared to the reference MPEG-4 Spatial Audio Object Coding system, our system pro- vides much cleaner separated speech signals (consistently 10- 20 dB higher Signal to Interference Ratios), revealing strong potential for audio conference applications.

...read moreread less

Patent•

Apparatus and method for determining weighting function having low complexity for linear predictive coding (lpc) coefficients quantization

[...]

Ho-Sang Sung¹, Eun-Mi Oh¹•Institutions (1)

Samsung¹

18 Oct 2011

TL;DR: In this paper, a method and apparatus for determining a weighting function for quantizing a linear predictive coding (LPC) coefficient and having a low complexity was proposed, which may convert an LPC coefficient of a mid-subframe of an input signal to one of a immitance spectral frequency (ISF) coefficient, and may determine the weighting functions associated with an importance of the ISF coefficient or the LSF coefficient based on the converted ISF coefficients or LSF coefficients.

...read moreread less

Abstract: Proposed is a method and apparatus for determining a weighting function for quantizing a linear predictive coding (LPC) coefficient and having a low complexity. The weighting function determination apparatus may convert an LPC coefficient of a mid-subframe of an input signal to one of a immitance spectral frequency (ISF) coefficient and a line spectral frequency (LSF) coefficient, and may determine a weighting function associated with an importance of the ISF coefficient or the LSF coefficient based on the converted ISF coefficient or LSF coefficient.

...read moreread less

Proceedings Article•DOI•

Intelligibility enhancement of bone conducted speech by an analysis-synthesis method

[...]

M. Shahidur Rahman¹, Tetsuya Shimamura¹•Institutions (1)

Saitama University¹

23 Sep 2011

TL;DR: In this article, the authors proposed an intelligibility enhancement technique for bone conducted (BC) speech without exploiting any spectral characteristics of normal air conducted (AC) speech, which has proven to be very suitable for military, rescue and security operations.

...read moreread less

Abstract: This paper proposes an intelligibility enhancement technique for bone conducted (BC) speech without exploiting any spectral characteristics of normal air conducted (AC) speech. Due to its robustness against ambient noise, BC speech has recently received a lot of attention. Particularly, it has proven to be very suitable for military, rescue and security operations. However, BC speech suffers from lower intelligibility for it lacks higher frequency components. This appears as a tradeoff for human-machine communication like speech recognition and understanding. The proposed technique enhances the weak higher frequency components by an analysis-synthesis method based on linear prediction. Preliminary listening test and spectrograms produced from the synthesized BC speech demonstrates significant intelligibility enhancement when compared with the original BC speech.

...read moreread less

Patent•

Speech model generating apparatus, speech synthesis apparatus, speech model generating program product, speech synthesis program product, speech model generating method, and speech synthesis method

[...]

Javier Latorre¹, Masami Akamine¹•Institutions (1)

Toshiba¹

21 Sep 2011

TL;DR: In this article, a speech model generating apparatus consisting of a spectrum analyzer, a chunker, a parameterizer, a clustering unit, and a model training unit is described.

...read moreread less

Abstract: According to one embodiment, a speech model generating apparatus includes a spectrum analyzer, a chunker, a parameterizer, a clustering unit, and a model training unit. The spectrum analyzer acquires a speech signal corresponding to text information and calculates a set of spectral coefficients. The chunker acquires boundary information indicating a beginning and an end of linguistic units and chunks the speech signal into linguistic units. The parameterizer calculates a set of spectral trajectory parameters for a trajectory of the spectral trajectory parameters of the linguistic unit on the basis of the spectral coefficients. The clustering unit clusters the spectral trajectory parameters calculated for each of the linguistic units into clusters on the basis of linguistic information. The model training unit obtains a trained spectral trajectory model indicating a characteristic of a cluster based on the spectral trajectory parameters belonging to the same cluster.

...read moreread less

Performance evaluation of speech scrambling methods based on statistical approach

[...]

Sattar B. Sadkhan, Nidaa A. Abbas

01 Jan 2011

TL;DR: A comparison between speech scrambler methods using methods based on statistical metrics called Independent Component Analysis (lCA) and Principal component Analysis (PCA) is proposed.

...read moreread less

Abstract: The speech scrambling plays a great role in many important communication systems, such as military communications and banks communication systems. There are many traditional scrambling methods used in single dimension such as lime or frequency domain scrambling. This paper propose a comparison between speech scrambler methods using methods based on statistical metrics called Independent Component Analysis (lCA) and Principal Component Analysis (PCA). For lCA method, Approximate Diagonalization of Eigen-matrices (JADE) algorithm was implemented while for PCA, traditional PCA algorithm was implemented. The paper takes into consideration the testability of many input speech signals in English in two types of bits 8 bits and 16 bits. The objective test using Linear Predictive Coding (LPC), Signal-to-Noise Ratio (SNR) where applied to evaluate the scrambling systems under consideration.

...read moreread less

Patent•

Method for converting emotional speech by combining rhythm parameters with tone parameters

[...]

Xia Mao, Lin Han

14 Sep 2011

TL;DR: In this paper, a method for converting emotional speech by combining rhythm parameters (fundamental frequency, time length and energy) with a tone parameter (a formant) was proposed, which mainly comprises the following steps of: 1, carrying out extraction and analysis of feature parameters on a Beihang University emotional speech database (BHUDES) emotional speech sample (containing neutral speech and four types of emotional speech of sadness, anger, happiness and surprise).

...read moreread less

Abstract: The invention discloses a method for converting emotional speech by combining rhythm parameters (fundamental frequency, time length and energy) with a tone parameter (a formant), which mainly comprises the following steps of: 1, carrying out extraction and analysis of feature parameters on a Beihang University emotional speech database (BHUDES) emotional speech sample (containing neutral speech and four types of emotional speech of sadness, anger, happiness and surprise); 2, making an emotional speech conversion rule and defining each conversion constant according to the extracted feature parameters; 3, carrying out extraction of the feature parameters and fundamental tone synchronous tagging on the neutral speech to be converted; 4, setting each conversion constant according to the emotional speech conversion rule in the step 2, modifying a fundamental frequency curve, the time length and the energy and synchronously overlaying fundamental tones to synthesize a speech signal; and 5, carrying out linear predictive coding (LPC) analysis on the speech signal in the step 4 and modifying the formant by a pole of a transfer function so as to finally obtain the emotional speech rich in expressive force

...read moreread less

Journal Article•DOI•

Context-adaptive pre-processing scheme for robust speech recognition in fast-varying noise environment

[...]

Iosif Mporas¹, Todor Ganchev¹, Otilia Kocsis¹, Nikos Fakotakis¹•Institutions (1)

University of Patras¹

01 Aug 2011-Signal Processing

TL;DR: A context-adaptive speech pre-processing scheme, which performs adaptive selection of the most advantageous speech enhancement algorithm for each condition, which can be beneficial to spoken interfaces operating in fast-varying noise environments.

...read moreread less

Proceedings Article•

Accuracy of MP3 speech recognition under real-word conditions: Experimental study

[...]

Petr Pollák, Martin Behunek

18 Jul 2011

TL;DR: The realized experiments have proved that although MP3 format is not optimal for speech compression it does not distort speech significantly especially for high or moderate bit rates and high quality of source data.

...read moreread less

Abstract: This paper presents the study of speech recognition accuracy with respect to different levels of MP3 compression. Special attention is focused on the processing of speech signals with different quality, i.e. with different level of background noise and channel distortion. The work was motivated by possible usage of ASR for offline automatic transcription of audio recordings collected by standard wide-spread MP3 devices. The realized experiments have proved that although MP3 format is not optimal for speech compression it does not distort speech significantly especially for high or moderate bit rates and high quality of source data. The accuracy of connected digits ASR decreased consequently very slowly up to the bit rate 24 kbps. For the best case of PLP parameterization in close-talk channel just 3% decrease of recognition accuracy was observed while the size of the compressed file was approximately 10% of the original size. All results were slightly worse under presence of additive background noise and channel distortion in a signal but achieved accuracy was also acceptable in this case especially for PLP features.

...read moreread less

Journal Article•DOI•

Comparative Analysis of Genomic Signal Processing for Microarray Data Clustering

[...]

Robert S. H. Istepanian¹, Ala Sungoor¹, J-C Nebel¹•Institutions (1)

Kingston University¹

07 Dec 2011-IEEE Transactions on Nanobioscience

TL;DR: The results of this study show that the fractal approach provides the best clustering accuracy compared to other digital signal processing and well known statistical methods.

...read moreread less

Abstract: Genomic signal processing is a new area of research that combines advanced digital signal processing methodologies for enhanced genetic data analysis. It has many promising applications in bioinformatics and next generation of healthcare systems, in particular, in the field of microarray data clustering. In this paper we present a comparative performance analysis of enhanced digital spectral analysis methods for robust clustering of gene expression across multiple microarray data samples. Three digital signal processing methods: linear predictive coding, wavelet decomposition, and fractal dimension are studied to provide a comparative evaluation of the clustering performance of these methods on several microarray datasets. The results of this study show that the fractal approach provides the best clustering accuracy compared to other digital signal processing and well known statistical methods.

...read moreread less

Patent•

Frame mapping approach for cross-lingual voice transformation

[...]

Yao Qian¹, Frank K. Soong¹•Institutions (1)

Microsoft¹

04 Apr 2011

TL;DR: In this paper, a formant-based frequency warping is performed on the fundamental frequencies and the linear predictive coding (LPC) spectrums of source speech waveforms in a first language to produce transformed fundamental frequencies.

...read moreread less

Abstract: Frame mapping-based cross-lingual voice transformation may transform a target speech corpus in a particular language into a transformed target speech corpus that remains recognizable, and has the voice characteristics of a target speaker that provided the target speech corpus. A formant-based frequency warping is performed on the fundamental frequencies and the linear predictive coding (LPC) spectrums of source speech waveforms in a first language to produce transformed fundamental frequencies and transformed LPC spectrums. The transformed fundamental frequencies and the transformed LPC spectrums are then used to generate warped parameter trajectories. The warped parameter trajectories are further used to transform the target speech waveforms in the second language to produce transformed target speech waveform with voice characteristics of the first language that nevertheless retain at least some voice characteristics of the target speaker.

...read moreread less

Patent•

Packet loss concealment for audio codec

[...]

Shabestary Turaj Zakizadeh, Grand Tina Le

16 May 2011

TL;DR: In this paper, a speech signal is encoded as a sequence of consecutive frames and the loss is concealed at the receiver by reconstructing audio that would be contained in the lost frame based on other previously received frames.

...read moreread less

Abstract: A speech signal is encoded as a sequence of consecutive frames. When a frame is lost, the loss is concealed at a receiver by reconstructing audio that would be contained in the lost frame based on other previously received frames. The frames contain a residual signal and linear predictive coding parameters representing a segment of audio data. For a lost frame the content of a previous frame is not copied, but is modified to make the reconstructed audio sound natural. The modification includes creating a weighted sum of a quasi-periodic signal derived from the latest two pitch cycles and a pseudo random sequence. The weights are selected based on a determination of whether the previous frame contains voiced or unvoiced utterances.

...read moreread less

Proceedings Article•DOI•

Artificial bandwidth extension of narrowband speech using Gaussian Mixture Model

[...]

D Murali Mohan, Dileep B. Karpur¹, Manoj Narayan¹, J Kishore¹•Institutions (1)

BNM Institute of Technology¹

24 Mar 2011

TL;DR: The proposed method for bandwidth extension is based on statistical recovery using Gaussian Mixture Model (GMM) for spectral envelope parameters and spectral shifting method is used for excitation extension.

...read moreread less

Abstract: Spectrum of speech signals have frequency components from 50Hz to 7 kHz (Wideband speech). However, due to historical reasons speech is band-pass filtered between 300 Hz-3.4 kHz in PSTN networks and this speech is referred to as narrowband speech. The missing bandwidth in narrow band speech contributes to speech quality and intelligibility. This paper addresses the problem of artificial bandwidth extension of narrowband speech to wideband speech. The proposed method for bandwidth extension is based on statistical recovery using Gaussian Mixture Model (GMM) for spectral envelope parameters and spectral shifting method is used for excitation extension.

...read moreread less

Collapse