scispace - formally typeset
Search or ask a question

Showing papers on "Cepstrum published in 2022"


Journal ArticleDOI
TL;DR: A novel method to reconstruct equivalent STE or GTE from a single DTE measurement, i.e. from a moderate/high speed test is proposed, which has been validated on a single-stage spur gear rig and can be useful in gear condition monitoring, gear quality control and as an aid to gearbox modelling.

12 citations


Journal ArticleDOI
TL;DR: In this article , the authors designed and validated various deep learning systems to improve diagnosis of infant cry records, including deep feed-forward neural networks, long short-term memory (LSTM), and convolutional neural networks (CNN).
Abstract: • We design and validate various deep learning systems to improve diagnosis of infant cry records. • Considered deep learning systems are deep feedforward neural networks (DFFNN), long short-term memory (LSTM) neural networks, and convolutional neural networks (CNN). • All deep learning systems are trained with cepstrum analysis-based coefficients. • Compared to existing models, all deep learning systems were found to be more effective in distinguishing between healthy and unhealthy infant cry records. Nowadays, deep learning architectures are promising artificial intelligence systems in various applications of biomedical engineering. For instance, they can be combined with signal processing techniques to build computer-aided diagnosis systems used to help physician making appropriate decision related to the diagnosis task. The goal of the current study is to design and validate various deep learning systems to improve diagnosis of infant cry records. Specifically, deep feedforward neural networks (DFFNN), long short-term memory (LSTM) neural networks, and convolutional neural networks (CNN) are designed, implemented and trained with cepstrum analysis-based coefficients as inputs to distinguish between healthy and unhealthy infant cry records. All deep learning systems are validated on expiration and inspiration sets separately. The number of convolutional layers and number of neurons in hidden layers are respectively varied in CNN and DFFNN. It is found that CNN achieved the highest accuracy and sensitivity, followed by DFFNN. The latter, obtained the highest specificity. Compared to similar work in the literature, it is concluded that deep learning systems trained with cepstrum analysis-based coefficients are powerful machines that can be employed for accurate diagnosis of infant cry records so as to distinguish between healthy and pathological signals.

11 citations


Journal ArticleDOI
TL;DR: In this paper , a cepstrum-based operational modal analysis (OMA) method was proposed to reconstruct equivalent STE or GTE from a single DTE measurement, i.e. from a moderate/high speed test.

11 citations


Journal ArticleDOI
TL;DR: The experimental results show that the proposed deep learning-based single-channel multitarget underwater acoustic signal recognition method can effectively recognize synthetic multitarget ship signals when the magnitude STFT spectrum, complex-valued ST FT spectrum, and log-mel spectrum are used as network inputs.
Abstract: The radiated noise from ships is of great significance to target recognition, and several deep learning methods have been developed for the recognition of underwater acoustic signals. Previous studies have focused on single-target recognition, with relatively few reports on multitarget recognition. This paper proposes a deep learning-based single-channel multitarget underwater acoustic signal recognition method for an unknown number of targets in the specified category. The proposed method allows the two subproblems of recognizing the unique class and duplicate categories of multiple targets to be solved. These two tasks are essentially multilabel binary classification and multilabel multiple value classification, respectively. In this paper, we describe the use of real-valued and complex-valued ResNet and DenseNet convolutional networks to recognize synthetic mixed multitarget signals, which was superimposed from individual target signals. We compare the performance of various features, including the original audio signal, complex-valued short-time Fourier transform (STFT) spectrum, magnitude STFT spectrum, logarithmic mel spectrum, and mel frequency cepstral coefficients. The experimental results show that our method can effectively recognize synthetic multitarget ship signals when the magnitude STFT spectrum, complex-valued STFT spectrum, and log-mel spectrum are used as network inputs.

11 citations


Journal ArticleDOI
29 Jan 2022-Symmetry
TL;DR: This work investigates two types of new acoustic features to improve the performance of spoofing attacks, which consist of two cepstral coefficients and one LogSpec feature, which are extracted from the linear prediction (LP) residual signals.
Abstract: With the rapid development of intelligent speech technologies, automatic speaker verification (ASV) has become one of the most natural and convenient biometric speaker recognition approaches. However, most state-of-the-art ASV systems are vulnerable to spoofing attack techniques, such as speech synthesis, voice conversion, and replay speech. Due to the symmetry distribution characteristic between the genuine (true) speech and spoof (fake) speech pair, the spoofing attack detection is challenging. Many recent research works have been focusing on the ASV anti-spoofing solutions. This work investigates two types of new acoustic features to improve the performance of spoofing attacks. The first features consist of two cepstral coefficients and one LogSpec feature, which are extracted from the linear prediction (LP) residual signals. The second feature is a harmonic and noise subband ratio feature, which can reflect the interaction movement difference of the vocal tract and glottal airflow of the genuine and spoofing speech. The significance of these new features has been investigated in both the t-stochastic neighborhood embedding space and the binary classification modeling space. Experiments on the ASVspoof 2019 database show that the proposed residual features can achieve from 7% to 51.7% relative equal error rate (EER) reduction on the development and evaluation set over the best single system baseline. Furthermore, more than 31.2% relative EER reduction on both the development and evaluation set shows that the proposed new features contain large information complementary to the source acoustic features.

10 citations


Journal ArticleDOI
TL;DR: In this paper , the authors compare and combine different acoustic features in discriminating subjects with and without voice disorders, and the best classification result (86.43% accuracy) was obtained by combining traditional linear and recurrence quantification measures.

9 citations


Journal ArticleDOI
TL;DR: The work was aimed at assessing the utility of several automatic speech signal analysis methods for diagnosing voice disorders and suggesting a strategy for classifying healthy and diseased voices, and designed a deep neural network capable of learning from the retrieved data and producing a highly accurate voice-based disease prediction model.
Abstract: Diseases of internal organs other than the vocal folds can also affect a person's voice. As a result, voice problems are on the rise, even though they are frequently overlooked. According to a recent study, voice pathology detection systems can successfully help the assessment of voice abnormalities and enable the early diagnosis of voice pathology. For instance, in the early identification and diagnosis of voice problems, the automatic system for distinguishing healthy and diseased voices has gotten much attention. As a result, artificial intelligence-assisted voice analysis brings up new possibilities in healthcare. The work was aimed at assessing the utility of several automatic speech signal analysis methods for diagnosing voice disorders and suggesting a strategy for classifying healthy and diseased voices. The proposed framework integrates the efficacy of three voice characteristics: chroma, mel spectrogram, and mel frequency cepstral coefficient (MFCC). We also designed a deep neural network (DNN) capable of learning from the retrieved data and producing a highly accurate voice-based disease prediction model. The study describes a series of studies using the Saarbruecken Voice Database (SVD) to detect abnormal voices. The model was developed and tested using the vowels /a/, /i/, and /u/ pronounced in high, low, and average pitches. We also maintained the “continuous sentence” audio files collected from SVD to select how well the developed model generalizes to completely new data. The highest accuracy achieved was 77.49%, superior to prior attempts in the same domain. Additionally, the model attains an accuracy of 88.01% by integrating speaker gender information. The designed model trained on selected diseases can also obtain a maximum accuracy of 96.77% (cordectomy × healthy). As a result, the suggested framework is the best fit for the healthcare industry.

8 citations


Journal ArticleDOI
30 Oct 2022-Sensors
TL;DR: In this article , a diesel engine acoustic fault diagnosis method based on variational modal decomposition mapping Mel frequency cepstral coefficients (MFCC) and long short-term memory network is proposed.
Abstract: Diesel engines have a wide range of functions in the industrial and military fields. An urgent problem to be solved is how to diagnose and identify their faults effectively and timely. In this paper, a diesel engine acoustic fault diagnosis method based on variational modal decomposition mapping Mel frequency cepstral coefficients (MFCC) and long-short-term memory network is proposed. Variational mode decomposition (VMD) is used to remove noise from the original signal and differentiate the signal into multiple modes. The sound pressure signals of different modes are mapped to the Mel filter bank in the frequency domain, and then the Mel frequency cepstral coefficients of the respective mode signals are calculated in the mapping range of frequency domain, and the optimized Mel frequency cepstral coefficients are used as the input of long and short time memory network (LSTM) which is trained and verified, and the fault diagnosis model of the diesel engine is obtained. The experimental part compares the fault diagnosis effects of different feature extraction methods, different modal decomposition methods and different classifiers, finally verifying the feasibility and effectiveness of the method proposed in this paper, and providing solutions to the problem of how to realise fault diagnosis using acoustic signals.

8 citations


Proceedings ArticleDOI
23 May 2022
TL;DR: Experimental results demonstrate that the proposed speech enhancement method outperforms existing state-of-the-art complex-valued neural network-based methods in terms of both PESQ and eSTOI.
Abstract: Most deep learning-based speech enhancement methods operate directly on time-frequency representations or learned features without making use of the model of speech production. This work proposes a new speech enhancement method based on neural homomorphic synthesis. The speech signal is firstly decomposed into excitation and vocal tract with complex cepstrum analysis. Then, two complex-valued neural networks are applied to estimate the target complex spectrum of the decomposed components. Finally, the time-domain speech signal is synthesized from the estimated excitation and vocal tract. Furthermore, we investigated numerous loss functions and found that the multi-resolution STFT loss, commonly used in the TTS vocoder, benefits speech enhancement. Experimental results demonstrate that the proposed method outperforms existing state-of-the-art complex-valued neural network-based methods in terms of both PESQ and eSTOI.

7 citations


Journal ArticleDOI
TL;DR: In this article , the distinguishing characteristics of speech production are examined for ASD affected children, with comparison to normal children's speech. And the results up to 98.17% accuracy are obtained for classification between acoustic features of the speech sounds of children with ASD and the normal children.

7 citations


Journal ArticleDOI
TL;DR: In this article, the distinguishing characteristics of speech production are examined for ASD affected children, with comparison to normal children's speech. And the results up to 98.17% accuracy are obtained for classification between acoustic features of the speech sounds of children with ASD and the normal children.

Journal ArticleDOI
TL;DR: A spoken language identification system that depends on the sequence of feature vectors that can learn language-specific patterns in various filter size representations of speech files that indicates higher performance with combined GTCC and MFCC features compared to GTCC or MFCC Features used individually.
Abstract: Following recent advancements in deep learning and artificial intelligence, spoken language identification applications are playing an increasingly significant role in our day-to-day lives, especially in the domain of multi-lingual speech recognition. In this article, we propose a spoken language identification system that depends on the sequence of feature vectors. The proposed system uses a hybrid Convolutional Recurrent Neural Network (CRNN), which combines a Convolutional Neural Network (CNN) with a Recurrent Neural Network (RNN) network, for spoken language identification on seven languages, including Arabic, chosen from subsets of the Mozilla Common Voice (MCV) corpus. The proposed system exploits the advantages of both CNN and RNN architectures to construct the CRNN architecture. At the feature extraction stage, it compares the Gammatone Cepstral Coefficient (GTCC) feature and Mel Frequency Cepstral Coefficient (MFCC) feature, as well as a combination of both. Finally, the speech signals were represented as frames and used as the input for the CRNN architecture. After conducting experiments, the results of the proposed system indicate higher performance with combined GTCC and MFCC features compared to GTCC or MFCC features used individually. The average accuracy of the proposed system was 92.81% in the best experiment for spoken language identification. Furthermore, the system can learn language-specific patterns in various filter size representations of speech files.

Journal ArticleDOI
Lijiang Chen, Jie Ren, Pengfei Chen, Xia Mao, Qi Zhao 
TL;DR: In this article , the authors proposed a framework of applying only the EGG signal for speech synthesis in the limited categories of contents scenario, which achieved 91.12% accuracy on the validation set in a 20-class content recognition experiment.
Abstract: Abstract This paper proposes a framework of applying only the EGG signal for speech synthesis in the limited categories of contents scenario. EGG is a sort of physiological signal which can reflect the trends of the vocal cord movement. Note that EGG’s different acquisition method contrasted with speech signals, we exploit its application in speech synthesis under the following two scenarios. (1) To synthesize speeches under high noise circumstances, where clean speech signals are unavailable. (2) To enable dumb people who retain vocal cord vibration to speak again. Our study consists of two stages, EGG to text and text to speech. The first is a text content recognition model based on Bi-LSTM, which converts each EGG signal sample into the corresponding text with a limited class of contents. This model achieves 91.12% accuracy on the validation set in a 20-class content recognition experiment. Then the second step synthesizes speeches with the corresponding text and the EGG signal. Based on modified Tacotron-2, our model gains the Mel cepstral distortion (MCD) of 5.877 and the mean opinion score (MOS) of 3.87, which is comparable with the state-of-the-art performance and achieves an improvement by 0.42 and a relatively smaller model size than the origin Tacotron-2. Considering to introduce the characteristics of speakers contained in EGG to the final synthesized speech, we put forward a fine-grained fundamental frequency modification method, which adjusts the fundamental frequency according to EGG signals and achieves a lower MCD of 5.781 and a higher MOS of 3.94 than that without modification.

Journal ArticleDOI
TL;DR: Improvement in performance indicates that the IFs estimated using ESA-based approach are able to efficiently capture the artefacts produced in the instantaneous phase by the SS- and VC-based spoof signals.

Journal ArticleDOI
01 Mar 2022-Sensors
TL;DR: Very short heart sound signal duration (1 s) weakens the performance of Recurrent Neural Networks (RNNs), whereas no apparent decrease in the tested Convolutional Neural Network (CNN) model was found.
Abstract: Deep learning techniques are the future trend for designing heart sound classification methods, making conventional heart sound segmentation dispensable. However, despite using fixed signal duration for training, no study has assessed its effect on the final performance in detail. Therefore, this study aims at analysing the duration effect on the commonly used deep learning methods to provide insight for future studies in data processing, classifier, and feature selection. The results of this study revealed that (1) very short heart sound signal duration (1 s) weakens the performance of Recurrent Neural Networks (RNNs), whereas no apparent decrease in the tested Convolutional Neural Network (CNN) model was found. (2) RNN outperformed CNN using Mel-frequency cepstrum coefficients (MFCCs) as features. There was no difference between RNN models (LSTM, BiLSTM, GRU, or BiGRU). (3) Adding dynamic information (∆ and ∆²MFCCs) of the heart sound as a feature did not improve the RNNs’ performance, and the improvement on CNN was also minimal (≤2.5% in MAcc). The findings provided a theoretical basis for further heart sound classification using deep learning techniques when selecting the input length.

Proceedings ArticleDOI
23 May 2022
TL;DR: A high-frequency singularity detection feature obtained by wavelet transform that can explicitly show the location of the tampering operation on the waveform and has greatly improved the accuracy and generalization.
Abstract: There are many methods for detecting forged audio produced by conversion and synthesis. However, as a simpler method of forgery, splicing has not attracted widespread attention. Based on the characteristic that the tampering operation will cause singularities at high-frequency components, we propose a high-frequency singularity detection feature obtained by wavelet transform. The proposed feature can explicitly show the location of the tampering operation on the waveform. Moreover, the long short-term memory (LSTM) is introduced to the CNN-architecture LCNN to ensure that the sequence information can be fully learned. The proposed feature is sent to the improved RNN-architecture LCNN together with the widely used linear frequency cepstral coefficients (LFCC) to learn forgery characteristics where the LFCC is used as a supplement. Systematic evaluation and comparison show that the proposed method has greatly improved the accuracy and generalization.

Journal ArticleDOI
01 Aug 2022-Entropy
TL;DR: This study improves the understanding of infant crying by providing a complete description of its intrinsic dynamics to better evaluate its health status.
Abstract: Multifractal behavior in the cepstrum representation of healthy and unhealthy infant cry signals is examined by means of wavelet leaders and compared using the Student t-test. The empirical results show that both expiration and inspiration signals exhibit clear evidence of multifractal properties under healthy and unhealthy conditions. In addition, expiration and inspiration signals exhibit more complexity under healthy conditions than under unhealthy conditions. Furthermore, distributions of multifractal characteristics are different across healthy and unhealthy conditions. Hence, this study improves the understanding of infant crying by providing a complete description of its intrinsic dynamics to better evaluate its health status.

Journal ArticleDOI
09 May 2022-PeerJ
TL;DR: In this paper , the authors developed the first emotional speech database of the Urdu language and also developed the system to classify five different emotions: sadness, happiness, neutral, disgust, and anger using different machine learning algorithms.
Abstract: Emotion recognition from acoustic signals plays a vital role in the field of audio and speech processing. Speech interfaces offer humans an informal and comfortable means to communicate with machines. Emotion recognition from speech signals has a variety of applications in the area of human computer interaction (HCI) and human behavior analysis. In this work, we develop the first emotional speech database of the Urdu language. We also develop the system to classify five different emotions: sadness, happiness, neutral, disgust, and anger using different machine learning algorithms. The Mel Frequency Cepstrum Coefficient (MFCC), Linear Prediction Coefficient (LPC), energy, spectral flux, spectral centroid, spectral roll-off, and zero-crossing were used as speech descriptors. The classification tests were performed on the emotional speech corpus collected from 20 different subjects. To evaluate the quality of speech emotions, subjective listing tests were conducted. The recognition of correctly classified emotions in the complete Urdu emotional speech corpus was 66.5% with K-nearest neighbors. It was found that the disgust emotion has a lower recognition rate as compared to the other emotions. Removing the disgust emotion significantly improves the performance of the classifier to 76.5%.

Journal ArticleDOI
TL;DR: In this paper , the authors evaluated the AVQI and its isolated acoustic measures accuracy in discriminating voices with different degrees of deviation and found that the combination of acoustic measures performed better when discriminating voices having a higher degree of deviation.

Journal ArticleDOI
TL;DR: In this article , the authors investigated the differences in the subjective and objective parameters (acoustic, spectral, and cepstral parameters) of the voice in elderly male speakers with and without symptoms of dysphonia.

Journal ArticleDOI
TL;DR: In this article , an optimal heart-sound classification method based on machine learning technologies for cardiovascular disease prediction is performed, which consists of three steps: preprocessing that sets the 5 s duration of the PhysioNet Challenge 2016 and 2022 datasets, feature extraction using Mel frequency cepstrum coefficients (MFCC), and classification using grid search for hyperparameter tuning of several classifier algorithms including k-nearest neighbor (K-NN), random forest (RF), artificial neural network (ANN), and support vector machine (SVM).
Abstract: Heart-sound auscultation is one of the most widely used approaches for detecting cardiovascular disorders. Diagnosing abnormalities of heart sound using a stethoscope depends on the physician’s skill and judgment. Several studies have shown promising results in automatically detecting cardiovascular disorders based on heart-sound signals. However, the accuracy performance needs to be enhanced as automated heart-sound classification aids in the early detection and prevention of the dangerous effects of cardiovascular problems. In this study, an optimal heart-sound classification method based on machine learning technologies for cardiovascular disease prediction is performed. It consists of three steps: pre-processing that sets the 5 s duration of the PhysioNet Challenge 2016 and 2022 datasets, feature extraction using Mel frequency cepstrum coefficients (MFCC), and classification using grid search for hyperparameter tuning of several classifier algorithms including k-nearest neighbor (K-NN), random forest (RF), artificial neural network (ANN), and support vector machine (SVM). The five-fold cross-validation was used to evaluate the performance of the proposed method. The best model obtained classification accuracy of 95.78% and 76.31%, which was assessed using PhysioNet Challenge 2016 and 2022, respectively. The findings demonstrate that the suggested approach obtained excellent classification results using PhysioNet Challenge 2016 and showed promising results using PhysioNet Challenge 2022. Therefore, the proposed method has been potentially developed as an additional tool to facilitate the medical practitioner in diagnosing the abnormality of the heart sound.

Journal ArticleDOI
TL;DR: In this paper , the authors proposed a unique voice spoofing countermeasure which is successful to hit upon the Logical Access (LA) attacks and classify the spoofing structures by the usage of Long Short-Term Reminiscence (LSTM).
Abstract: With the growing number of voice-controlled devices, it is necessary to address the potential vulnerabilities of Automatic Speaker Verification (ASV) against voice spoofing attacks such as Physical Access (PA) and Logical Access (LA) attacks. To improve the reliability of ASV systems, researchers have developed various voice spoofing countermeasures. However, it is hard for the voice anti-spoofing systems to effectively detect the synthetic speech attacks that are generated through powerful spoofing algorithms and have quite different statistical distributions. More importantly, the speedy improvement of voice spoofing structures is producing the most effective attacks that make ASV structures greater vulnerable to stumble on those voice spoofing assaults. In this paper, we proposed a unique voice spoofing countermeasure which is successful to hit upon the LA attacks (i.e., artificial speech and transformed speech) and classify the spoofing structures by the usage of Long Short-Term Reminiscence (LSTM). The novel set of spectral features i.e., Mel-Frequency Cepstral Coefficients (MFCC), Gammatone Cepstral Coefficients (GTCC), and spectral centroid are capable to seize maximum alterations present in the cloned audio. The proposed system achieved remarkable accuracy of 98.93%, precision of 100%, recall of 92.32%, F1-score of 96.01%, and an Equal Error Rate (EER) of 1.30%. Our method achieved 8.5% and 7.02% smaller EER than the baseline methods such as Constant-Q Cepstral Coefficients (CQCC) using Gaussian Mixture Model (GMM) and Linear Frequency Cepstral Coefficients (LFCC) using GMM, respectively. We evaluated the performance of the proposed system on the standard dataset i.e., ASVspoof2019 LA. Experimental results and comparative analysis with other existing state-of-the-art methods illustrate that our method is reliable and effective to be used for the detection of voice spoofing attacks.

Journal ArticleDOI
TL;DR: In this paper , an improved frequency response function curvature method which is both baseline-free and output-only was proposed to eliminate 1/f decay of higher resonance peaks caused by the temporal spread of real impulse excitation.
Abstract: Low-severity multiple damage detection relies on sensing minute deviations in the vibrational or dynamical characteristics of the structure. The problem becomes complicated when the reference vibrational profile of the healthy structure and corresponding input excitation, is unavailable as frequently experienced in real-life scenarios. Detection methods that require neither undamaged vibrational profile (baseline-free) nor excitation information (output-only) constitute state-of-art in structural health monitoring. Unfortunately, their efficacy is ultimately limited by non-ideal input excitation masking crucial attributes of system response such as resonant frequency peaks beyond first (few) natural frequency(ies) which can better resolve the issue of multiple damage detection. This study presents an improved frequency response function curvature method which is both baseline-free and output-only. It employs the cepstrum technique to eliminate 1 / f decay of higher resonance peaks caused by the temporal spread of real impulse excitation. Long-pass liftering screens out the bulk of low-frequency sensor noise along with the excitation. With more visible resonant peaks, the cepstrum purified frequency response functions (regenerated frequency response functions) register finer deviation from an estimated baseline frequency response function and yield an accurate damage index profile. The simulation and experimental results on the beam show that the proposed method can successfully locate multiple damages of severity as low as 5%.

Journal ArticleDOI
TL;DR: In this article , the angle cepstrum comb liftering (ACCL) method was proposed to select all rahmonics of the gearmesh quefrency of a particular meshing gear pair.

Proceedings ArticleDOI
02 Aug 2022
TL;DR: This work proposes a system to use real and imaginary spectrogram features as complementary input features and model the disjoint subbands separately for audio deepfake detection, achieving an equivalent error rate (EER) of 0.43%, which surpasses almost all systems.
Abstract: Recently, pioneer research works have proposed a large number of acoustic features (log power spectrogram, linear frequency cepstral coefficients, constant Q cepstral coefficients, etc.) for audio deepfake detection, obtaining good performance, and showing that different subbands have different contributions to audio deepfake detection. However, this lacks an explanation of the specific information in the subband, and these features also lose information such as phase. Inspired by the mechanism of synthetic speech, the fundamental frequency (F0) information is used to improve the quality of synthetic speech, while the F0 of synthetic speech is still too average, which differs significantly from that of real speech. It is expected that F0 can be used as important information to discriminate between bonafide and fake speech, while this information cannot be used directly due to the irregular distribution of F0. Insteadly, the frequency band containing most of F0 is selected as the input feature. Meanwhile, to make full use of the phase and full-band information, we also propose to use real and imaginary spectrogram features as complementary input features and model the disjoint subbands separately. Finally, the results of F0, real and imaginary spectrogram features are fused. Experimental results on the ASVspoof 2019 LA dataset show that our proposed system is very effective for the audio deepfake detection task, achieving an equivalent error rate (EER) of 0.43%, which surpasses almost all systems.

Journal ArticleDOI
TL;DR: In this paper , the authors investigated the utility of cepstral acoustic analysis for the Japanese language as an indicator of dysphoria and the degree of dysphonia severity in Japanese.

Journal ArticleDOI
TL;DR: In this paper , a review article evaluates the related studies in the cepstral areas to ascertain whether they are efficient in the diagnosis of dysphonia, and concludes that it is reasonable for the voice care teams to use CPP and CPPS in the patients' initial assessment and track the effects of treatment.
Abstract: Introduction: The acoustic analysis is one of the well-known methods for voice evaluation. In recent years, many studies have investigated the cepstral measures compared with the other former acoustic parameters. This review article evaluates the related studies in the cepstral areas to ascertain whether they are efficient in the diagnosis of dysphonia. Materials and Methods: We reviewed the available research studies between 2009 and 2021 narratively in PubMed, Scopus, Google Scholar, and Science Direct databases. The searched keywords included “cepstral peak prominence”, “smoothed cepstral peak prominence”, “instrumental acoustic analysis”, “acoustic”, and “diagnosis”. The articles that investigated the power of Cepstral Peak Prominence (CPP) and its smoothed version (CPPS) to differentiate dysphonia versus normal voice have been included. However, the interventional studies that consider CPP and CPPS as one of their adjunct variables and studies that investigated the relationship of the cepstral measure with other parameters were not included. Results: Recent studies support the efficiency of CPP and CPPS to diagnose dysphonia. Conclusion: It is reasonable for the voice care teams to use CPP and CPPS in the patients’ initial assessment and track the effects of treatment. However, according to the relatively limited number of studies in this area, more studies are required to clarify the efficacy of cepstral measures in different voice pathologies.

Journal ArticleDOI
TL;DR: In this paper , the authors review the applications that the Mel Frequency Cepstrum Coefficient (MFCC) is used for in addition to some issues that facing the MFCC computation and its impact on the model performance.
Abstract: Feature extraction and representation has significant impact on the performance of any machine learning method. Mel Frequency Cepstrum Coefficient (MFCC) is designed to model features of audio signal and is widely used in various fields. This paper aims to review the applications that the MFCC is used for in addition to some issues that facing the MFCC computation and its impact on the model performance. These issues include the use of MFCC for non-acoustic signals, adopting the MFCC alone or combining it with other features, the use of time series versus global representation of the MFCC, following the standard form of the MFCC computation versus modifying its parameters, and supplying the traditional machine learning methods versus the deep learning methods.

Journal ArticleDOI
TL;DR: In this article , the authors proposed two modifications to improve the robustness and performance of CEM in low signal to noise ratio (SNR) cases, which resulted in better preservation of speech harmonics, more refined fine structure and higher interharmonic noise suppression.
Abstract: The periodic nature of voiced speech is often exploited to restore speech harmonics and to increase inter-harmonic noise suppression. In particular, a recent paper proposed to do this by manipulating the speech harmonic frequencies in the cepstral domain. The manipulations were carried out on the cepstrum of the excitation signal, obtained by the source-filter decomposition of speech. This method was termed Cepstral Excitation Manipulation (CEM). In this contribution we further analyse this method, point out its inherent weakness and propose means to overcome it. First of all, it will be shown by both illustrative examples and theoretical analysis that the existing method underestimates the excitation, especially at low signal to noise ratio (SNR) conditions. This inherent weakness leads to speech harmonic weakening and vocoding due to the insufficient noise suppression in the inter-harmonic regions. Then, we propose two modifications to improve the robustness and performance of CEM in low SNR cases. The first modification is to use an instantaneous amplifying factor adapted to the signal, instead of a pre-defined constant, for the excitation cepstrum. The second modification is to smooth the excitation cepstrum to preserve additional fine structure, instead of discarding it. These modifications result in better preservation of speech harmonics, more refined fine structure and higher inter-harmonic noise suppression. Experimental evaluations using a range of standard instrumental metrics conclusively demonstrate that our proposed modifications clearly outperform the existing method, especially in extremely noisy conditions.

Journal ArticleDOI
TL;DR: In this paper , Fisher vector (FV) encoding was used to convert features from frame level (local descriptors) to utterance level (global descriptors), and the global descriptors derived from the local descriptors were used to train a support vector machine (SVM) classifier.