scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Recognition of Audio Depression Based on Convolutional Neural Network and Generative Antagonism Network Model

29 May 2020-IEEE Access (Institute of Electrical and Electronics Engineers (IEEE))-Vol. 8, pp 101181-101191
TL;DR: An audio depression recognition method based on convolution neural network and generative antagonism network model that effectively reduces the depression recognition error compared with other existing methods, and the RMSE and MAE values obtained on the two datasets are better than the comparison algorithm by more than 5%.
Abstract: This paper proposes an audio depression recognition method based on convolution neural network and generative antagonism network model. First of all, preprocess the data set, remove the long-term mute segments in the data set, and splice the rest into a new audio file. Then, the features of speech signal, such as Mel-scale Frequency Cepstral Coefficients (MFCCs), short-term energy and spectral entropy, are extracted based on audio difference normalization algorithm. The extracted matrix vector feature data, which represents the unique attributes of the subjects' own voice, is the data base for model training. Then, based on the combination of CNN and GAN, DR AudioNet is used to build the model of depression recognition research. With the help of DR AudioNet, the former model is optimized and the recognition classification is completed through the normalization characteristics of the two adjacent segments before and after the current audio segment. The experimental results on AViD-Corpus and DAIC-WOZ datasets show that the proposed method effectively reduces the depression recognition error compared with other existing methods, and the RMSE and MAE values obtained on the two datasets are better than the comparison algorithm by more than 5%.

Content maybe subject to copyright    Report

Citations
More filters
Book ChapterDOI
01 Jan 2021
TL;DR: In this article, the use of Convolutional Neural Network based model for identifying whether a person is suffering from dysarthria is proposed, which makes use of several speech features viz. zero crossing rates, MFCCs, spectral centroids, spectral roll off for analysis of the speech signals.
Abstract: Patients suffering from dysarthria have trouble controlling their muscles involved in speaking, thereby leading to spoken speech that is indiscernible. There have been a number of studies that have addressed speech impairments; however additional research is required in terms of considering speakers with the same impairment though with variable condition of the impairment. The type of impairment and the level of severity will help in assessing the progression of the dysarthria and will also help in planning the therapy.This paper proposes the use of Convolutional Neural Network based model for identifying whether a person is suffering from dysarthria. Early diagnosis is a step towards better management of the impairment. The proposed model makes use of several speech features viz. zero crossing rates, MFCCs, spectral centroids, spectral roll off for analysis of the speech signals. TORGO speech signal database is used for the training and testing of the proposed model. CNN shows promising results for early diagnosis of dysarthric speech with an accuracy score of 93.87%.

4 citations

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper used an audio depression regression model (DR AudioNet) based on a convolutional neural network (CNN) and a long short-term memory network (LSTM) to identify the prevalence of depression patients.
Abstract: In recent years, depression not only makes patients suffer from psychological pain such as self-blame but also has a high disability mortality rate. Early detection and diagnosis of depression and timely treatment of patients with different levels can improve the cure rate. Because there are quite a few potential depression patients who are not aware of their illness, some even suspect that they are sick but are unwilling to go to the hospital. In response to this situation, this research designed an intelligent depression recognition human-computer interaction system. The main contributions of this research are (1) the use of an audio depression regression model (DR AudioNet) based on a convolutional neural network (CNN) and a long-short-term memory network (LSTM) to identify the prevalence of depression patients. And it uses a multiscale audio differential normalization (MADN) feature extraction algorithm. The MADN feature describes the characteristics of nonpersonalized speech, and two network models are designed based on the MADN features of two adjacent segments of audio. Comparative experiments show that the method is effective in identifying depression. (2) Based on the research conclusion of the previous step, a human-computer interaction system is designed. After the user inputs his own voice, the final recognition result is output through the recognition of the network model used in this research. Visual operation is more convenient for users and has a practical application value.

3 citations

Journal ArticleDOI
TL;DR: In this article, the authors proposed a model to detect mixed-mood episode which is characterized by combination of various symptoms of bipolar disorder in random, unpredictable, and uncertain manner.
Abstract: In the present state of health and wellness, Mental illness is always deemed less importance compared to other forms of physical illness. In reality, mental illness causes serious multi-dimensional adverse effect to the subject with respect to personal life, social life, as well as financial stability. In the area of mental illness, bipolar disorder is one of the most prominent type which can be triggered by any external stimulation to the subject suffering from this illness. There diagnosis as well as treatment process of bipolar disorder is very much different from other form of illness where the first step of impediment is the correct diagnosis itself. According to the standard body, there are classification of discrete forms of bipolar disorder viz. type-I, type-II, cyclothymic, etc. which is characterized by specific mood associated with depression and mania. However, there is no study associated with mixed-mood episode detection which is characterized by combination of various symptoms of bipolar disorder in random, unpredictable, and uncertain manner. Hence, the model contributes to obtain granular information with dynamics of mood transition. The simulated outcome of the proposed system in MATLAB shows that resulting model is capable enough for detection of mixed mood episode precisely.

2 citations

Journal ArticleDOI
TL;DR: In this article , the improved Wasserstein Skip-Connection GAN (IWGAN) was proposed to detect anomalies and hazards in the airport environment using autoencoders and GANs.
Abstract: Anomaly detection is an important research topic in the field of artificial intelligence and visual scene understanding. The most significant challenge in real-world anomaly detection problems is the high imbalance of available data (i.e., non-anomalous versus anomalous data). This limits the use of supervised learning methods. Furthermore, the abnormal—and even normal—datasets in the airport field are relatively insufficient, causing them to be difficult to use to train deep neural networks when conducting experiments. Because generative adversarial networks (GANs) are able to effectively learn the latent vector space of all images, the present study adopted a GAN variant with autoencoders to create a hybrid model for detecting anomalies and hazards in the airport environment. The proposed method, which integrates the Wasserstein-GAN (WGAN) and Skip-GANomaly models to distinguish between normal and abnormal images, is called the Improved Wasserstein Skip-Connection GAN (IWGAN). In the experimental stage, we evaluated different hyper-parameters—including the activation function, learning rate, decay rate, training times of discriminator, and method of label smoothing—to identify the optimal combination. The proposed model’s performance was compared with that of existing models, such as U-Net, GAN, WGAN, GANomaly, and Skip-GANomaly. Our experimental results indicate that the proposed model yields exceptional performance.

1 citations

Journal ArticleDOI
TL;DR: In this paper , a hybrid model is proposed for depression detection using deep learning algorithms, which mainly combines textual features and audio features of patient's responses, and the results show that audio CNN is a good model for depression classification.
References
More filters
Journal ArticleDOI
TL;DR: The experimental results show that the designed networks achieve excellent performance on the task of recognizing speech emotion, especially the 2D CNN LSTM network outperforms the traditional approaches, Deep Belief Network (DBN) and CNN on the selected databases.

599 citations


"Recognition of Audio Depression Bas..." refers methods in this paper

  • ...For the audio recognition problem, scholars have proposed many methods, in [11] they constructed a one-dimensional long-short term memory (LSTM) and a two-dimensional LSTM to extract local and global emotion related features in speech, which can improve the accuracy of original model by combining the two features....

    [...]

Journal ArticleDOI
TL;DR: This paper proposes to bridge the emotional gap by using a hybrid deep model, which first produces audio–visual segment features with Convolutional Neural Networks and 3D-CNN, then fuses audio– visual segment features in a Deep Belief Networks (DBNs).
Abstract: Emotion recognition is challenging due to the emotional gap between emotions and audio–visual features. Motivated by the powerful feature learning ability of deep neural networks, this paper proposes to bridge the emotional gap by using a hybrid deep model, which first produces audio–visual segment features with Convolutional Neural Networks (CNNs) and 3D-CNN, then fuses audio–visual segment features in a Deep Belief Networks (DBNs). The proposed method is trained in two stages. First, CNN and 3D-CNN models pre-trained on corresponding large-scale image and video classification tasks are fine-tuned on emotion recognition tasks to learn audio and visual segment features, respectively. Second, the outputs of CNN and 3D-CNN models are combined into a fusion network built with a DBN model. The fusion network is trained to jointly learn a discriminative audio–visual segment feature representation. After average-pooling segment features learned by DBN to form a fixed-length global video feature, a linear Support Vector Machine is used for video emotion classification. Experimental results on three public audio–visual emotional databases, including the acted RML database, the acted eNTERFACE05 database, and the spontaneous BAUM-1s database, demonstrate the promising performance of the proposed method. To the best of our knowledge, this is an early work fusing audio and visual cues with CNN, 3D-CNN, and DBN for audio–visual emotion recognition.

249 citations


"Recognition of Audio Depression Bas..." refers background in this paper

  • ...Clinical observations and studies have found that there is a significant correlation between the audio characteristics and the depression degrees [4], [5]....

    [...]

Proceedings ArticleDOI
Xingchen Ma1, Hongyu Yang1, Qiang Chen1, Di Huang1, Yunhong Wang1 
16 Oct 2016
TL;DR: A deep model is proposed, namely DepAudioNet, to encode the depression related characteristics in the vocal channel, combining Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) to deliver a more comprehensive audio representation.
Abstract: This paper presents a novel and effective audio based method on depression classification. It focuses on two important issues, \emph{i.e.} data representation and sample imbalance, which are not well addressed in literature. For the former one, in contrast to traditional shallow hand-crafted features, we propose a deep model, namely DepAudioNet, to encode the depression related characteristics in the vocal channel, combining Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) to deliver a more comprehensive audio representation. For the latter one, we introduce a random sampling strategy in the model training phase to balance the positive and negative samples, which largely alleviates the bias caused by uneven sample distribution. Evaluations are carried out on the DAIC-WOZ dataset for the Depression Classification Sub-challenge (DCC) at the 2016 Audio-Visual Emotion Challenge (AVEC), and the experimental results achieved clearly demonstrate the effectiveness of the proposed approach.

183 citations


"Recognition of Audio Depression Bas..." refers background in this paper

  • ...[27] proposed a binary classification network structure for identifying depression in the 2016 AVEC competition, which is mainly composed of CNN and LSTM....

    [...]

Journal ArticleDOI
TL;DR: The authors discuss several core challenges in embedded and mobile deep learning, as well as recent solutions demonstrating the feasibility of building IoT applications that are powered by effective, efficient, and reliable deep learning models.
Abstract: How can the advantages of deep learning be brought to the emerging world of embedded IoT devices? The authors discuss several core challenges in embedded and mobile deep learning, as well as recent solutions demonstrating the feasibility of building IoT applications that are powered by effective, efficient, and reliable deep learning models.

106 citations


"Recognition of Audio Depression Bas..." refers background or methods in this paper

  • ...The number of filters M is between 20-40, and we setM = 40 according to [8]....

    [...]

  • ...Currently, the Beck Depression Inventory II (BDI-II) is most widely used selfassessment scale for depressive symptoms and is the tool to assess the degrees of depression [8]....

    [...]

Journal ArticleDOI
TL;DR: This study investigates the relationship between rough voice and the presence of subharmonics, which correspond to smaller yet distinct peaks located between two consecutive harmonic peaks in the power spectrum.

62 citations


"Recognition of Audio Depression Bas..." refers methods in this paper

  • ...Based on the measurement method of signal harmonics in Omori [26] et al. work, we can use the concept of entropy to describe this hypothesis....

    [...]

  • ...Based on the measurement method of signal harmonics in Omori [26] et al....

    [...]

  • ...[26] K. Omori, H. Kojima, R. Kakani, D. H. Slavit, and S. M. Blaugrund, ‘‘Acoustic characteristics of rough voice: Subharmonics,’’ J. Voice, vol. 11, no. 1, pp. 40–47, Mar. 1997....

    [...]