scispace - formally typeset
Search or ask a question

Showing papers on "Cepstrum published in 2019"


Journal ArticleDOI
TL;DR: The results show that TQWT performs better or comparable to the state-of-the-art speech signal processing techniques used in PD classification, and Mel-frequency cepstral and the tunable-Q wavelet coefficients, which give the highest accuracies, contain complementary information inPD classification problem resulting in an improved system when combined using a filter feature selection technique.

303 citations


Posted Content
TL;DR: In this article, the authors investigated the possibility of using complex cepstrum for glottal flow estimation on a large-scale database and showed that the proposed method has the potential to be used for voice quality analysis.
Abstract: Complex cepstrum is known in the literature for linearly separating causal and anticausal components. Relying on advances achieved by the Zeros of the Z-Transform (ZZT) technique, we here investigate the possibility of using complex cepstrum for glottal flow estimation on a large-scale database. Via a systematic study of the windowing effects on the deconvolution quality, we show that the complex cepstrum causal-anticausal decomposition can be effectively used for glottal flow estimation when specific windowing criteria are met. It is also shown that this complex cepstral decomposition gives similar glottal estimates as obtained with the ZZT method. However, as complex cepstrum uses FFT operations instead of requiring the factoring of high-degree polynomials, the method benefits from a much higher speed. Finally in our tests on a large corpus of real expressive speech, we show that the proposed method has the potential to be used for voice quality analysis.

66 citations


Journal ArticleDOI
TL;DR: A new bearing fault classification method based on convolutional neural networks (CNNs) is presented, demonstrated to have strong ability of classification under the interference of factory noise and the Gaussian noise.
Abstract: Bearing fault diagnosis is an important technique in industrial production as bearings are one of the key components in rotating machines. In bearing fault diagnosis, complex environmental noises will lead to inaccurate results. To address the problem, bearing fault classification methods should be capable of noise resistance and be more robust. In previous studies, researchers mainly focus on noise-free condition, measured signal and signal with simulated noise, many effective approaches have been proposed. But in real-world working condition, strong and complex noises are often leads to inaccurate results. According to the situation, this work focuses on bearing fault classification under the influence of factory noise and the white Gaussian noise. In order to eliminate the noise interference and take the possible connection between signal frames into consideration, this paper presents a new bearing fault classification method based on convolutional neural networks (CNNs). By using the sensitivity to impulse of spectral kurtosis (SK), noises are repressed by the proposed filtering approach based on the SK. Mel-frequency cepstral coefficients (MFCC) and delta cepstrum are extracted as the feature by the reason of satisfactory performance in sound recognition. And in consideration of the connection between frames, a feature arrangement method is presented to transfer feature vectors to feature images, so the advantages of the CNNs in the fields of image processing can be exploited in the proposed method. The proposed method is demonstrated to have strong ability of classification under the interference of factory noise and the Gaussian noise by experiments.

53 citations


Journal ArticleDOI
TL;DR: Comparative analysis reveals that considerable improvement in the performance of emotion recognition is obtained using DNN with the identified combination of perceptual features.
Abstract: This paper focusses on investigation of the effective performance of perceptual based speech features on emotion detection. Mel frequency cepstral coefficients (MFCC’s), perceptual linear predictive cepstrum (PLPC), Mel frequency perceptual linear prediction cepstrum (MFPLPC), bark frequency cepstral coefficients (BFCC), revised perceptual linear prediction coefficient’s (RPLP) and inverted Mel frequency cepstral coefficients (IMFCC) are the perception features considered. The algorithm using these auditory cues is evaluated with deep neural networks (DNN). The novelty of the work involves analysis of the perceptual features to identify predominant features that contain significant emotional information about the speaker. The validity of the algorithm is analysed on publicly available Berlin database with seven emotions in 1-dimensional space termed categorical and 2-dimensional continuous space consisting of emotions in valence and arousal dimensions. Comparative analysis reveals that considerable improvement in the performance of emotion recognition is obtained using DNN with the identified combination of perceptual features.

46 citations


Proceedings ArticleDOI
01 Jan 2019
TL;DR: Mel Frequency Cepstrum Coefficient features were extracted from speech signals to detect the underlying emotion of the speech and this approach provides an efficient solution to classifying different emotions using speech signals.
Abstract: Understanding human emotion is a complicated task for humans themselves, however, this did not stop the researchers from trying to make machines capable of understanding human emotions. Many approaches have been followed, using speech signals to detect emotions has been popular among these approaches. In this study, Mel Frequency Cepstrum Coefficient (MFCC) features were extracted from speech signals to detect the underlying emotion of the speech. Extracted features were used to classify different emotions using LMT classifier. For each frame of a speech signal, 13-dimensional feature vectors were extracted and Logistic Model Tree (LMT) models were trained using these features. For classifying an unknown speech signal, the 13-dimensional frame features are first extracted from the signal and each frame is classified using the trained model. Using a voting mechanism on the classified frames, the emotion of the speech signal is detected. Experimental results on two datasets- Berlin Database of Emotional Speech (Emo-DB) and Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) show that our approach works very well in classifying certain emotions while it struggles to discern the differences between some pairs of emotions. Among the trained models, the maximum accuracy achieved was 70% in detecting 7 different emotions. Considering the small dimension size of the feature vectors used, this approach provides an efficient solution to classifying different emotions using speech signals.

45 citations


Journal ArticleDOI
TL;DR: The history, current situation and potential future development of the application of cepstral analysis to structural modal analysis is described, this seemingly being greatly under-utilised.

32 citations


Proceedings ArticleDOI
01 Nov 2019
TL;DR: A simple 2D-convolution multi-branch network architecture for replay detection, which can model the distortion both in the time and frequency domains and performance can be further improved by combining both magnitude-based and phase-based feature.
Abstract: Automatic Speaker Verification (ASV) technology is vulnerable to various kinds of spoofing attacks, including speech synthesis, voice conversion, and replay. Among them, the replay attack is easy to implement, posing a more severe threat to ASV. The constant-Q cepstrum coefficient (CQCC) feature is effective for detecting the replay attacks, but it only utilizes the magnitude of constant-Q transform (CQT) and discards the phase information. Meanwhile, the commonly used Gaussian mixture model (GMM) cannot model the reverberation present in far-field recordings. In this paper, we incorporate the CQT and modified group delay function (MGD) in order to utilize the phase of CQT. Also, we present a simple 2D-convolution multi-branch network architecture for replay detection, which can model the distortion both in the time and frequency domains. The experiment shows that the proposed CQT-based MGD feature outperforms traditional MGD feature, and performance can be further improved by combining both magnitude-based and phase-based feature. Our best fusion system achieves 0.0096 min-tDCF and 0.39% EER on ASVspoof 2019 Physical Access evaluation set. Comparing with the CQCC-GMM baseline system provided by the organizer, the min-tDCF is relatively reduced by 96.09% and EER is relatively reduced by 96.46%. Our system is submitted to the ASVspoof 2019 Physical Access sub-challenge and won 1st place.

26 citations


Journal ArticleDOI
TL;DR: A comparative study using various algorithms, i.e., wavelet analysis, cepstrum, fast Fourier transform, and autocorrelation function for heart rate measurement, which achieved relatively good results despite the remarkable amount of motion artifact produced owing to the frequent body movements and/or vibrations of the massage chair during stress relief massage.
Abstract: Nonintrusive monitoring and long-term monitoring of vital signs are essential requirements for early diagnosis and prevention due to many reasons, one of the most important being improving the quality of life. In this paper, we present a comparative study using various algorithms, i.e., wavelet analysis, cepstrum, fast Fourier transform, and autocorrelation function for heart rate measurement. The heart rate was measured from noisy ballistocardiogram signals acquired from 50 subjects in a sitting position using a massage chair. The signals were unobtrusively collected from a microbend fiber-optic sensor embedded within the headrest of the chair and then transmitted to a computer through a Bluetooth connection. The multiresolution analysis of the maximal overlap discrete wavelet transform was implemented for heart rate measurement. The error between the proposed method and the reference electrocardiogram is estimated in beats per minute using the mean absolute error in which the system achieved relatively good results ( $$10.12\pm 4.69$$ ) despite the remarkable amount of motion artifact produced owing to the frequent body movements and/or vibrations of the massage chair during stress relief massage. In contrast, the error between the proposed method and the reference signal was very large when other algorithms, i.e., cepstrum, fast Fourier transform, and autocorrelation function, were implemented for heart rate measurement.

25 citations


Journal ArticleDOI
TL;DR: The proposed method is based on the wavelet derived from the popular biorthogonal Cohen-Daubechies-Feauveau 9/7 filter bank and has superior frequency selectivity, symmetric, and better time-frequency localization.
Abstract: In this paper, a novel technique based on wavelet cepstrum feature is discussed for iris recognition system. The proposed method is based on the wavelet derived from the popular biorthogonal Cohen-Daubechies-Feauveau 9/7 filter bank. Moreover, being biorthogonal in nature it has superior frequency selectivity, symmetric, and better time-frequency localization. The suggested scheme deals with computing the two level detail coefficients from the normalized iris template. Then these detailed coefficients are then divided into non-uniform bins in a logarithmic manner. This helps in reducing the dimension of the wavelet coefficients followed by assigning non-uniform weights to the different frequency components. Then the discrete cosine transform of the same is computed, from which the energy feature is extracted. The proposed technique is experimentally validated with publicly available databases: CASIAv3, UBIRISv1, and IITD. The performance of the proposed approach is found be superior to that of the state-of-the-art methods.

25 citations


Journal ArticleDOI
TL;DR: In this article, low frequency frame-wise normalization (LFFN) is proposed as one of the modules in feature extraction process that is hypothesized to help in capturing the artifacts from the playback speech.

24 citations


Journal ArticleDOI
TL;DR: According to the comparison between vibration and AE from 9 tests of experimental system results, vibration has a better result than AE, specifically in the inner race and rolling element faults, and for the remaining 3 tests that correspond to outer race fault, AE has a best result.
Abstract: In this study, an experimental system was built to acquire vibration and acoustic emission (AE) signals from faulted bearings methodology based on cepstrum pre-whitening (CPW), tested for vibration signals, and was applied for both types of signals to compare and enhance results on machining condition monitoring. The methodology was applied to 9 vibration and 9 AE signals from the experimental system database. For the 18 analyzed signals, in 5 the identification of fault components was easily made, in 12 the fault identification was possible, and in 1 the identification was not completed. The comparison between vibration and AE from 9 tests of experimental system results in 6, vibration has a better result than AE, specifically in the inner race and rolling element faults, for the remaining 3 tests that correspond to outer race fault, AE has a better result.

Journal ArticleDOI
TL;DR: An empirical and automated nonlinear filtering process will be proposed in which different components of a signal are decomposed based on their powers to seek the presence of bearings characteristic frequencies and can be seen as complementary to the narrowband amplitude demodulation techniques.

Journal ArticleDOI
16 May 2019-PLOS ONE
TL;DR: This study investigates the granular scattering effect in identification of chemicals with THz spectral absorption features and proposes a signal processing technique in the so-called “quefrency” domain to improve the ability to resolve these spectral features in the diffuse scattered THz images.
Abstract: Terahertz (THz) imaging is a widely used technique in the study and detection of many chemicals and biomolecules in polycrystalline form because the spectral absorption signatures of these target materials often lie in the THz frequencies. When the size of dielectric grain boundaries are comparable to the THz wavelengths, spectral features can be obscured due to electromagnetic scattering. In this study, we first investigate this granular scattering effect in identification of chemicals with THz spectral absorption features. We then will propose a signal processing technique in the so-called "quefrency" domain to improve the ability to resolve these spectral features in the diffuse scattered THz images. We created a pellet with α-lactose monohydrate and riboflavin, two biologically significant materials with well-known vibrational spectral resonances, and buried the pellet in a highly scattering medium. THz transmission measurements were taken at all angles covering the half focal plane. We show that, while spectral features of lactose and riboflavin cannot be distinguished in the scattered image, application of cepstrum filtering can mitigate these scattering effects. By employing our quefrency-domain signal processing technique, we were able to unambiguously detect the dielectric resonance of lactose in the diffused scattering geometries. Finally we will discuss the limitation of the new proposed technique in spectral identification of chemicals.

Journal ArticleDOI
TL;DR: A road vehicle recognition and classification approach for intelligent transportation systems using a roadside installed low-cost magnetometer and associated data collection system and a 3-dimensional map algorithm using Vector Quantization to classify vehicle magnetic features to 4 typical types of vehicles in Australian suburbs.
Abstract: This paper presents a road vehicle recognition and classification approach for intelligent transportation systems. This approach uses a roadside installed low-cost magnetometer and associated data collection system. The system measures the magnetic field changing, detects passing vehicles, and recognizes vehicle types. We introduce Mel Frequency Cepstral Coefficients (MFCC) to analyze vehicle magnetic signals and extract it as vehicle feature with the representation of cepstrum, frame energy, and gap cepstrum of magnetic signals. We design a 3-dimensional map algorithm using Vector Quantization (VQ) to classify vehicle magnetic features to 4 typical types of vehicles in Australian suburbs: sedan, van, truck, and bus. In order to train an accurate classifier, training samples are selected using the Dynamic Time Warping (DTW). The verification experiments show that our approach achieves a high level of accuracy for vehicle detection and classification.

Journal ArticleDOI
TL;DR: The experimental results demonstrate that the proposed method surpasses most of the previous studies in point of classification accuracy and establishes applicability and efficacy of cepstrum-based features in classifying sEMG signals of hand movements.
Abstract: It is of great importance to effectively process and interpret surface electromyogram (sEMG) signals to actuate a robotic and prosthetic exoskeleton hand needed by hand amputees. In this paper, we have proposed a cepstrum analysis-based method for classification of basic hand movement sEMG signals. Cepstral analysis technique primarily used for analyzing acoustic and seismological signals is effectively exploited to extract features of time-domain sEMG signals by computing mel-frequency cepstral coefficients (MFCCs). The extracted feature vector consisting of MFCCs is then forwarded to feed a generalized regression neural network (GRNN) so as to classify basic hand movements. The proposed method has been tested on sEMG for Basic Hand movements Data Set and achieved an average accuracy rate of 99.34% for the five individual subjects and an overall mean accuracy rate of 99.23% for the collective (mixed) dataset. The experimental results demonstrate that the proposed method surpasses most of the previous studies in point of classification accuracy. Discrimination ability of the cepstral features exploited in this study is quantified using Kruskal-Wallis statistical test. Evidenced by the experimental results, this study explores and establishes applicability and efficacy of cepstrum-based features in classifying sEMG signals of hand movements. Owing to the non-iterative training nature of the artificial neural network type adopted in the study, the proposed method does not demand much time to build up the model in the training phase. Graphical abstract.

Journal ArticleDOI
TL;DR: A new framework to identify and assess progressive structural damage is developed and a new damage feature outperforms the conventional principle component analysis–based feature, and the comprehensive test framework including extensive progressive damage cases validates the proposed technique.
Abstract: This article aims at developing a new framework to identify and assess progressive structural damage. The method relies solely on output measurements to establish the frequency response functions o...


Journal ArticleDOI
TL;DR: It is concluded that time and frequency domain characteristics of wheezes are not steady and hence, tunable time-scale representations are more successful in discriminating polyphonic and monophonic wheeze types when compared with conventional fixed resolution representations.

Proceedings ArticleDOI
01 Feb 2019
TL;DR: An ensemble learning method named Gradient Boosting (GB) is proposed to previse future fault classes based on the data obtained from analyzing the recorded fault data and can detect and prefigure different types of bearing faults with a staggering 99.58% accuracy.
Abstract: Monitoring the condition of rolling element bearing and diagnosing their faults are cumbrous jobs. Fortunately, we have machines to do the burdensome task for us. The contemporary development in the field of machine learning allows us not only to extract features from fault signals accurately but to analyze them and predict future bearing faults almost accurately as well in a systematic manner. Utilizing an ensemble learning method named Gradient Boosting (GB) our paper proposes a technique to previse future fault classes based on the data obtained from analyzing the recorded fault data. To demonstrate the cogency of the method, we applied it on the REB fault data provided by the Case Western Reserve University (CWRU) Lab. Employing this supervised learning algorithm after preprocessing the fault signals using real cepstrum analysis, we can detect and prefigure different types of bearing faults with a staggering 99.58% accuracy.

Proceedings ArticleDOI
12 May 2019
TL;DR: The method reported here realizes an inaudible echo-hiding based speech watermarking by using sparse subspace clustering (SSC) and the evaluation results verify the feasibility and effectiveness of this method.
Abstract: The method reported here realizes an inaudible echo-hiding based speech watermarking by using sparse subspace clustering (SSC). Speech signal is first analyzed with SSC to obtain its sparse and low-rank components. Watermarks are embedded as the echoes of the sparse component for robust extraction. Self-compensated echoes consisting of two independent echo kernels are designed to have similar delay offsets but opposite amplitudes. A one-bit watermark is embedded by separately performing the echo kernels on the sparse and low-rank components. As a result, the sound distortion caused by one echo signal can be quickly compensated by the other echo signal, which enables better inaudibility. Since the embedded echoes have the same sparsity as the sparse component, watermarks can be extracted with a basic cepstrum analysis even if the echo kernels are not directly performed on the original speech. The evaluation results verify the feasibility and effectiveness of this method.

Journal ArticleDOI
TL;DR: Experimental results show that the effectiveness of the proposed MOMEDA method for fault detection of parallel shaft gearbox is better than that of traditional methods.
Abstract: In this paper, a new method for fault detection of parallel shaft gearbox based on the Empirical Mode Decomposition (EMD) and Multipoint Optimal Minimum Entropy Deconvolution (MOMEDA) is proposed. MOMEDA can overcome the shortcomings of minimum entropy deconvolution (MED) and Maximum Correlated Kurtosis Deconvolution (MCKD), and it is introduced to extract the fault cycle of gearbox signals. The vibration signals of gearbox are complex, including fault signals, noise signals and deterministic signals such as gear meshing component. Fault signal is often buried in these other components, which increases the difficulty of gearbox fault detection. Thus the EMD is proposed to decompose the signal and extract the fault impact components from the signal. The parallel shaft gearbox preset fault experiment is carried out to verify the effectiveness of method. In addition, some traditional methods, such as Fourier transform, cepstrum analysis, MED and MCKD, are used to compare with the proposed methods. Experimental results show that the effectiveness of the proposed method is better than that of traditional methods.

Journal ArticleDOI
TL;DR: The proposed method uses Deep Neural Network based regression model to estimate clean phase and clean amplitude for speech reconstruction and the overall quality of speech improved for factory noise, restaurant noise, car noise, airport noise and babble noise.
Abstract: In low Signal-to-Noise Ratio environment phase information is one of the important factor and therefore this article consider the importance of clean phase in single channel speech enhancement technique. The proposed method uses Deep Neural Network based regression model to estimate clean phase and clean amplitude for speech reconstruction. Experiments are conducted over five different noises such as factory, restaurant, car, airport and babble at different levels and result are evaluated using objective quality measures like Perceptual Evaluation of Speech Quality, Weighted Spectral Slope, Cepstrum Distance, frequency weighted segmented Signal-to-Noise Ratio and Log Likelihood Ratio. The overall quality of speech improved for factory noise by $$12\%$$ , restaurant noise by $$8\%$$ , car noise by $$13\%$$ , airport noise by $$10\%$$ and babble noise by $$14\%$$ respectively.

Journal ArticleDOI
TL;DR: The experimental results show that the proposed feature set not only display a high recognition rate and excellent anti-noise performance in speech recognition, but can also fully characterize the auditory and energy information in the speech signals.
Abstract: Environmental noise can pose a threat to the stable operation of current speech recognition systems. It is therefore essential to develop a front feature set that is able to identify speech under low signal-to-noise ratio. In this paper, a robust fusion feature is proposed that can fully characterize speech information. To obtain the cochlear filter cepstral coefficients (CFCC), a novel feature is first extracted by the power-law nonlinear function, which can simulate the auditory characteristics of the human ear. Speech enhancement technology is then introduced into the front end of feature extraction, and the extracted feature and their first-order difference are combined in new mixed features. An energy feature Teager energy operator cepstral coefficient (TEOCC) is also extracted, and combined with the above-mentioned mixed features to form the fusion feature sets. Principal component analysis (PCA) is then applied to feature selection and optimization of the feature set, and the final feature set is used in a non-specific persons, isolated words, and small-vocabulary speech recognition system. Finally, a comparative experiment of speech recognition is designed to verify the advantages of the proposed feature set using a support vector machine (SVM). The experimental results show that the proposed feature set not only display a high recognition rate and excellent anti-noise performance in speech recognition, but can also fully characterize the auditory and energy information in the speech signals.

Journal ArticleDOI
Jian Zhao1, Weiwen Su1, Jian Jia1, Chao Zhang1, Tingting Lu1 
TL;DR: A multi-modal fusion algorithm based on speech signal and facial image sequence for depression diagnosis, which can easily apply to the hardware and software on the existing hospital instruments with low cost is an accurate and effective method for diagnosing depression.
Abstract: Due to the existence of false positive rate of the traditional depression diagnosis method, this paper proposes a multi-modal fusion algorithm based on speech signal and facial image sequence for depression diagnosis. Introduced spectrum subtraction to enhance depressed speech signal, and use cepstrum method to extract pitch frequency features with large variation rate and formant features with significant difference, the short time energy and Mel-frequency cepstral coefficients characteristic parameters for different emotion speeches are analyzed in both time domain and frequency domain, and establish a model for training and identification. Meanwhile, this paper implements the orthogonal match pursuit algorithm to obtain a sparse linear combination of face test samples, and cascade with voice and facial emotions based proportion. The experimental results show that the recognition rate based on the depression detection algorithm of fusion speech and facial emotions has reached 81.14%. Compared to the existing doctor’s accuracy rate of 47.3%, the accuracy can bring extra 71.54% improvement by combining with the proposed method of this paper. Additionally, it can easily apply to the hardware and software on the existing hospital instruments with low cost. Therefore, it is an accurate and effective method for diagnosing depression.

Journal ArticleDOI
TL;DR: The experimental results showed that the proposed method significantly improved the naturalness and similarity of the converted voice compared to the baselines, even with the noisy inputs of source speakers.
Abstract: This paper presents a noise-robust voice conversion method with high-quefrency boosting via sub-band cepstrum conversion and fusion based on the bidirectional long short-term memory (BLSTM) neural networks that can convert parameters of vocal tracks of a source speaker into those of a target speaker. With the implementation of state-of-the-art machine learning methods, voice conversion has achieved good performance given abundant clean training data. However, the quality and similarity of the converted voice are significantly degraded compared to that of a natural target voice due to various factors, such as limited training data and noisy input speech from the source speaker. To address the problem of noisy input speech, an architecture of voice conversion with statistical filtering and sub-band cepstrum conversion and fusion is introduced. The impact of noises on the converted voice is reduced by the accurate reconstruction of the sub-band cepstrum and the subsequent statistical filtering. By normalizing the mean and variance of the converted cepstrum to those of the target cepstrum in the training phase, a cepstrum filter was constructed to further improve the quality of the converted voice. The experimental results showed that the proposed method significantly improved the naturalness and similarity of the converted voice compared to the baselines, even with the noisy inputs of source speakers.

Proceedings ArticleDOI
08 Jul 2019
TL;DR: Multi-channel sEMG based human lower limb motion intention recognition method is reliable and effective, and the average motion recognition rate of the improved method from 86.3%±8.24% to 93.6%±2.6%.
Abstract: The paper presents a multi-channel sEMG based human lower limb motion intention recognition method, aiming at solving the problem of human lower limb motion intention recognition when using an exoskeletal robot. The cepstrum distance is used to automatically detect the endpoints of the sEMG signal for each motion. There are extracted the time domain and frequency domain characteristic parameters of the multi-channel sEMG signal, which are used to merge and constructe a joint feature matrix. The joint feature matrix is reduced by the principal component analysis (PCA) method, and a low-dimensional matrix of each motion is obtained. Traditional back propagation (BP) neural network model is optimized by the use of particle swarm optimization (PSO) algorithm. The low-dimensional matrix of each motion of the human lower limb is identified by the optimized BP neural network model. The average motion recognition rate of the improved method from 86.3%±8.24% to 93.6%±2.6% compared with the classical BP neural network algorithm in the recognition experiment. Multi-channel sEMG based human lower limb motion intention recognition method is reliable and effective.

Proceedings ArticleDOI
01 Apr 2019
TL;DR: An autonomous algorithm for the person identification by analyzing their vocal sounds and speech patterns and correctly identifies the speaker with the accuracy, specificity and sensitivity of 83.33%, 86.67% and 80% respectively is proposed.
Abstract: Speech processing has emerged as one of the important and crucial domain over the past decade. Many researchers have worked on voice recognition and verification. Some of the reported work has been done in the field of biometrics. However, this paper proposes an autonomous algorithm for the person identification by analyzing their vocal sounds and speech patterns. First, the input voice signal is introduced to our proposed system from which the low frequency contents are extracted using finite response low pass filter based on hamming window. Then the proposed system performs a cepstral analysis and extracts two distinct features from the signal spectrum i.e. the maximum pitch frequency and maximum cepstrum value. The 2D extracted feature set is passed on to the multi-level classification system constructed on supervised Support Vector Machine (SVM), which first discriminates between the person's gender and then classifies the person based on the gender. Total 120 samples were used to train the proposed classification system and the proposed system correctly identifies the speaker with the accuracy, specificity and sensitivity of 83.33%, 86.67% and 80% respectively.

Journal ArticleDOI
TL;DR: The new approach exceeds the performance of a formerly introduced classical signal processing-based cepstral excitation manipulation (CEM) method in terms of noise attenuation by about 1.5 dB and shows that this gain also holds true when comparing serial combinations of envelope and excitation enhancement.
Abstract: This contribution aims at speech model-based speech enhancement by exploiting the source-filter model of human speech production. The proposed method enhances the excitation signal in the cepstral domain by making use of a deep neural network (DNN). We investigate two types of target representations along with the significant effects of their normalization. The new approach exceeds the performance of a formerly introduced classical signal processing-based cepstral excitation manipulation (CEM) method in terms of noise attenuation by about 1.5 dB. We show that this gain also holds true when comparing serial combinations of envelope and excitation enhancement. In the important low-SNR conditions, no significant trade-off for speech component quality or speech intelligibility is induced, while allowing for substantially higher noise attenuation. In total, a traditional purely statistical state-of-the-art speech enhancement system is outperformed by more than 3 dB noise attenuation.

Proceedings ArticleDOI
15 Sep 2019
TL;DR: In this article, the authors used a simple continuous F0 tracker which does not apply a strict voiced / unvoiced decision, and used a convolutional neural network to predict continuous vocoder parameters (ContF0, Maximum Voiced Frequency and Mel-Generalized Cepstrum).
Abstract: Recently it was shown that within the Silent Speech Interface (SSI) field, the prediction of F0 is possible from Ultrasound Tongue Images (UTI) as the articulatory input, using Deep Neural Networks for articulatory-to-acoustic mapping. Moreover, text-to-speech synthesizers were shown to produce higher quality speech when using a continuous pitch estimate, which takes non-zero pitch values even when voicing is not present. Therefore, in this paper on UTI-based SSI, we use a simple continuous F0 tracker which does not apply a strict voiced / unvoiced decision. Continuous vocoder parameters (ContF0, Maximum Voiced Frequency and Mel-Generalized Cepstrum) are predicted using a convolutional neural network, with UTI as input. The results demonstrate that during the articulatory-to-acoustic mapping experiments, the continuous F0 is predicted with lower error, and the continuous vocoder produces slightly more natural synthesized speech than the baseline vocoder using standard discontinuous F0.

Posted Content
TL;DR: The results demonstrate that during the articulatory-to-acoustic mapping experiments, the continuous F0 is predicted with lower error, and the continuous vocoder produces slightly more natural synthesized speech than the baseline vocoder using standard discontinuous F0.
Abstract: Recently it was shown that within the Silent Speech Interface (SSI) field, the prediction of F0 is possible from Ultrasound Tongue Images (UTI) as the articulatory input, using Deep Neural Networks for articulatory-to-acoustic mapping. Moreover, text-to-speech synthesizers were shown to produce higher quality speech when using a continuous pitch estimate, which takes non-zero pitch values even when voicing is not present. Therefore, in this paper on UTI-based SSI, we use a simple continuous F0 tracker which does not apply a strict voiced / unvoiced decision. Continuous vocoder parameters (ContF0, Maximum Voiced Frequency and Mel-Generalized Cepstrum) are predicted using a convolutional neural network, with UTI as input. The results demonstrate that during the articulatory-to-acoustic mapping experiments, the continuous F0 is predicted with lower error, and the continuous vocoder produces slightly more natural synthesized speech than the baseline vocoder using standard discontinuous F0.