scispace - formally typeset
Search or ask a question

Showing papers in "International Journal of Speech Technology in 2020"


Journal ArticleDOI
TL;DR: The main aim of this work is to improve the speech emotion recognition rate of a system using the different feature extraction algorithms validated for universal emotions comprising Anger, Happiness, Sad and Neutral.
Abstract: In this digitally growing era speech emotion recognition plays significant role in several applications such as Human Computer Interface (HCI), lie detection, automotive system to assist steering, intelligent tutoring system, audio mining, security, Telecommunication, Interaction between a human and machine at home, hospitals, shops etc. Speech is a unique human characteristic used as a tool to communicate and express one’s perspective to others. Speech emotion recognition is extracting the emotions of the speaker from his or her speech signal. Feature extraction, Feature selection and classifier are three main stages of the emotion recognition. The main aim of this work is to improve the speech emotion recognition rate of a system using the different feature extraction algorithms. The work emphasizes on the preprocessing of the received audio samples where the noise from speech samples is removed using filters. In next step, the Mel Frequency Cepstral Coefficients (MFCC), Discrete Wavelet Transform (DWT), pitch, energy and Zero crossing rate (ZCR) algorithms are used for extracting the features. In feature selection stage Global feature algorithm is used to remove redundant information from features and to identify the emotions from extracted features machine learning classification algorithms are used. These feature extraction algorithms are validated for universal emotions comprising Anger, Happiness, Sad and Neutral.

85 citations


Journal ArticleDOI
TL;DR: Algorithms like linear regression, decision tree, random forest, support vector machine (SVM) and convolutional neural networks (CNN) are used for classification and prediction once relevant features are selected from speech signals.
Abstract: Emotion recognition plays a vital role in dealing with day to day interpersonal human interactions. Understanding the feeling of a person from his speech can reveal wonders in shaping social interactions. A persons emotion can be identified with the tone and pitch of his voice. The acoustic speech signal are split into short frames, fast fourier transformation is applied, and relevant features are extracted using mel-frequency cepstrum coefficients (MFCC) and modulation spectral (MS). In this paper, algorithms like linear regression, decision tree, random forest, support vector machine (SVM) and convolutional neural networks (CNN) are used for classification and prediction once relevant features are selected from speech signals. Human emotions like neutral, calm, happy, sad, fearful, disgust and surprise are classified using decision tree, random forest, support vector machine (SVM) and convolutional neural networks (CNN). We have tested our model with RAVDEES dataset and CNN has shown 78.20% accuracy in recognizing emotions compared to decision tree, random forest and SVM.

34 citations


Journal ArticleDOI
TL;DR: This paper recast the hearing aid using distributed arithmetic (DA), which enables the implementation of hearing aid without multipliers, and it is shown that low complexity hearing aid architecture can be obtained.
Abstract: In this paper, we propose a low complex architectural design for hearing aid applications. In this, we recast the hearing aid using distributed arithmetic (DA), which enables the implementation of hearing aid without multipliers. The design is based on the distributed arithmetic based formulation of it. It is further shown that high order filters, which are required to implement high-speed hearing aid can be realized using only look-up-tables and shift-accumulate operations. A novel approach was proposed to replace the decimation filter of a hearing aid using multiplier less architecture with a single DA unit. By proper initialization, it is shown that low complexity hearing aid architecture can be obtained. The proposed distributed arithmetic architecture is implemented in ASIC SAED 90 nm technology. The application of hearing aid is implemented in Matlab Simulink and Xilinx system generator tool. The obtained results show $$20\%$$ less area delay product and $$40\%$$ less power delay product when compared with the existing architecture.

33 citations


Journal ArticleDOI
TL;DR: The present work aims at analyzing the social media data for code-switching and transliterated to English language using the special kind of recurrent neural network (RNN) called Long Short-Term Memory (LSTM) Network.
Abstract: The present work aims at analyzing the social media data for code-switching and transliterated to English language using the special kind of recurrent neural network (RNN) called Long Short-Term Memory (LSTM) Network. During the course of work, TensorFlow is used to express LSTM suitably. Twitter data is stored in MongoDB to enable easy handling and processing of data. The data is parsed through different fields with the aid of Python script and cleaned using regular expressions. The LSTM model is trained for 1 M data which is further used for transliteration and translation of the Twitter data. Translation and transliteration of social media data enables publicizing the content in the language understood by majority of the population. With this, any content which is anti-social or threat to law and order can be easily verified and blocked at the source.

28 citations


Journal ArticleDOI
TL;DR: A new deep learning speech based recognition model is presented for automatically recognizes the speech words and offered better recognition performance over the other methods.
Abstract: Automatic speaker recognizing models consists of a foundation on building various models of speaker characterization, pattern analyzing and engineering. The effect of classification and feature selection methods for the speech emotion recognition is focused. The process of selecting the exact parameter in arrangement with the classifier is an important part of minimizing the difficulty of system computing. This process becomes essential particularly for the models which undergo deployment in real time scenario. In this paper, a new deep learning speech based recognition model is presented for automatically recognizes the speech words. The superiority of an input source, i.e. speech sound in this state has straight impact on a classifier correctness attaining process. The Berlin database consist around 500 demonstrations to media persons that is both male and female. On the applied dataset, the presented model achieves a maximum accuracy of 94.21%, 83.54%, 83.65% and 78.13% under MFCC, prosodic, LSP and LPC features. The presented model offered better recognition performance over the other methods.

27 citations


Journal ArticleDOI
TL;DR: The results show that the proposed design has very high computation speed with total delay of only 20 ns and occupies 20% less area in comparison with the existing designs.
Abstract: Digital signal processing (DSP) systems are becoming popular with the emergence of artificial intelligence and machine learning based applications. Residue number system is one of most sought representation for implementing the high speed DSP systems. This paper presents an efficient implementation of memory less distributed arithmetic (MLDA) architecture in finite impulse response filter with residual number system. The input data and filter coefficients of MLDA are in residue number form and the output data from MLDA is converted into binary form using Chinese remainder theorem. In addition, compressor adders are used to reduce the area. For real time validation, the proposed design has been simulated and synthesized in application specific integrated circuit platform using synopsis design compiler with CMOS 90 nm technology. The results show that the proposed design has very high computation speed with total delay of only 20 ns and occupies 20% less area in comparison with the existing designs.

22 citations


Journal ArticleDOI
Soufiane Hourri1, Jamal Kharroubi1
TL;DR: This study is proposing a new way to use deep neural networks (DNNs) in speaker recognition, in the purpose to facilitate to DNN to learn features distribution, and is aiming to transform the extracted feature vectors into enhanced feature vectors, that are denote Deep Speaker Features (DeepSFs).
Abstract: Speaker verification (SV) is an important branch in speaker recognition. Several approaches have been investigated within the last few decades. In this context, deep learning has received much more interest by speech processing researchers, and it was introduced recently in speaker recognition. In most cases, deep learning models are adapted from speech recognition applications and applied to speaker recognition, and they have been showing their capability of being competitors to the state-of-the-art approaches. Nevertheless, the use of deep learning in speaker recognition is still linked to speech recognition. In this study, we are proposing a new way to use deep neural networks (DNNs) in speaker recognition, in the purpose to facilitate to DNN to learn features distribution. We have been motivated by our previous work, where we have proposed a novel scoring method that works perfectly with clean speech, but it needs improvements under noisy conditions. For this reason, we are aiming to transform the extracted feature vectors (MFCCs) into enhanced feature vectors, that we denote Deep Speaker Features (DeepSFs). Experiments have been conducted on THUYG-20 SRE corpus, and significant results have been achieved. Moreover, this new method outperformed both i-vector/PLDA and our baseline system in both clean and noisy conditions.

22 citations


Journal ArticleDOI
TL;DR: This work modify the Weighted TF_IDF (Term Frequency Inverse Document Frequency) algorithm to summarize books into relevant keywords and finds that it is an efficient algorithm to automate text summarization and produce an effective summary which is then converted from text to speech.
Abstract: Owing to the phenomenal growth in communication technology, most of us hardly have time to read books. This habit of reading is slowly diminishing because of the busy lives of people. For visually challenged people, the situation is even worse. In order to address this impedes, we develop a better and more accurate methodology than the existing ones. In this work, in order to save the efforts for reading the complete text every time, we modify the Weighted TF_IDF (Term Frequency Inverse Document Frequency) algorithm to summarize books into relevant keywords. Then, we compare the modified algorithm with that of the existing algorithms of TextRank Algorithm, Luhn’s Algorithm, LexRank Algorithm, Latent Semantic Analysis(LSA). From the comparative analysis, we find that Weighted TF_IDF is an efficient algorithm to automate text summarization and produce an effective summary which is then converted from text to speech. Thus, the proposed algorithm would highly be useful for blind people.

20 citations


Journal ArticleDOI
TL;DR: The performance of Amazigh speech recognition via an interactive voice response in noisy conditions is described and the degradation of accuracy was observed for all studied words by different degrees due to word components or the speech coding.
Abstract: This paper describes the performance of Amazigh speech recognition via an interactive voice response in noisy conditions. The experiments were first conducted for the uncoded speech and then repeated for decoded speech in a noisy environment for different signal noise ratios (SNR). In this study, we analyze the effect of noise at different SNR levels on the ten first Amazigh digits which have collected from 22 Moroccan native speakers including both males and females. Our experiments results show that the degradation of accuracy was observed for all studied words by different degrees due to word components or the speech coding.

19 citations


Journal ArticleDOI
TL;DR: This paper demonstrates and generalizes a model combining bi-directional long short term memory (LSTM) and convolutional neural network (CNN), as bi-irectional LSTM used to hold the temporal data for part-of-speech (PoS) tagging and CNN to extract the potential features.
Abstract: In past few years, the popularity of social media has increased drastically, sentiment analysis on the reviews, comments and opinions from social media has become more active in research area. A high grade, sentiment analysis portrays the opinion about the real time objects, topics, products and tweet reviews. The social trends or customer opinion is better understood with sentiment analysis. The state-of-art methods in analyzing the sentiments are based on textual features and with different neural network models. In this paper, we demonstrate and generalize a model combining bi-directional long short term memory (LSTM) and convolutional neural network (CNN), as bi-directional LSTM used to hold the temporal data for part-of-speech (PoS) tagging and CNN to extract the potential features. The experiment results validate our combined model performance with individual models. Our combined model indicates performance accurately and efficiently, achieving a reduced execution time and increased accuracy rate 98.6% in sentiment analysis is achieved by using combined bi-directional LSTM-CNN technique as when compared with traditional techniques.

19 citations


Journal ArticleDOI
TL;DR: For the first time, the combination of deep belief network (DBN), for extracting features of speech signals, and Deep Bidirectional Long Short-Term Memory (DBLSTM) with Connectionist Temporal Classification (CTC) output layer is used to create an AM on the Farsdat Persian speech data set.
Abstract: Up to now, various methods are used for Automatic Speech Recognition (ASR), and among which the Hidden Markov Model (HMM) and Artificial Neural Networks (ANNs) are the most important ones. One of the existing challenges is increasing the accuracy and efficiency of these systems. One way to enhance the accuracy of them is by improving the acoustic model (AM). In this paper, for the first time, the combination of deep belief network (DBN), for extracting features of speech signals, and Deep Bidirectional Long Short-Term Memory (DBLSTM) with Connectionist Temporal Classification (CTC) output layer is used to create an AM on the Farsdat Persian speech data set. The obtained results show that the use of a deep neural network (DNN) compared to a shallow network improves the results. Also, using the bidirectional network increases the accuracy of the model in comparison with the unidirectional network, in both deep and shallow networks. Comparing obtained results with the HMM and Kaldi-DNN indicates that using DBLSTM with features extracted from the DBN increases the accuracy of Persian phoneme recognition.

Journal ArticleDOI
TL;DR: A learning-based hybrid model is proposed for speaker-independent emotional voice conversion using a combination of deep belief nets (DBN-DNN) and general regression neural net (GRNN), which shows a significant performance improvement in RMSE and Pearson’s correlation coefficient.
Abstract: Emotional voice conversion systems are used to formulate mapping functions to transform the neutral speech from output of text-to-speech systems to that of target emotion appropriate to the context. In this work, a learning-based hybrid model is proposed for speaker-independent emotional voice conversion using a combination of deep belief nets (DBN-DNN) and general regression neural net (GRNN). The main acoustic features considered for mapping are shape of the vocal tract given by line spectral frequencies (LSF), glottal excitation given by LP residual and long term prosodic features viz. pitch contour and energy. GRNN is used to attain the transformation function between source and target LSFs. Source and target LP residuals are subjected to wavelet transform before DBN-DNN training. This is helpful to remove phase-change induced distortions which may affect the performance of neural networks when transforming time-domain residual. Low-dimensional pitch (intonation) contour is subjected to feed-forward neural network mapping (ANN). Energy modification is achieved by taking average transformation scales across entire utterance. The system is tested on three different datasets viz. EmoDB (German), IITKGP (Telugu) and SAVEE (English). Relative performances of proposed model are compared with constrained variance GMM (CV-GMM) using objective and subjective metrics. The results obtained show a significant performance improvement of 41% in RMSE (Hz) and 9.72% in Pearson’s correlation coefficient for fundamental frequency (F0) (Fear) compared to CV-GMM across all 3 datasets. Subjective results indicate a maximum MOS score of 3.85 (Fear) and CMOS score of 3.9 (Happiness) across the three datasets considered.

Journal ArticleDOI
TL;DR: A novel method is proposed for emotion classification by using deep learning network with transfer learning method to achieve promising significant effect on emotion classification with good accuracy and PDA value, when compared with other state-of-art methods.
Abstract: Emotion is subjective which convey rich semantics based on an image that induces different emotion based on each individual. A novel method is proposed for emotion classification by using deep learning network with transfer learning method. Transfer learning techniques are the predictive model that reuses the model trained on related predictive problems. The purpose of the proposed work is to classify the emotion perception from images based on visual features. Image augmentation and segmentation is performed to build powerful classifier. The performance of deep convolution neural network (CNN) is improved with transfer learning techniques in large scale Image-Emotion-dataset effectively. The experiments conducted on this dataset and result shows that proposed method achieve promising significant effect on emotion classification with good accuracy and PDA value, when compared with other state-of-art methods.

Journal ArticleDOI
TL;DR: The proposed system provides dynamic and energy efficient live VM (virtual machine) migration approach that reduces wastage of power by initiating sleep mode of idle physical machines results into energy saving.
Abstract: Cloud computing offers unlimited computational resources which are ready to use from anywhere, anytime on request. The achievement of maximized utilization of computational resources (physical and virtual) and minimized energy consumption of resources are goals of proposed system. The proposed system provides dynamic and energy efficient live VM (virtual machine) migration approach. This system reduces wastage of power by initiating sleep mode of idle physical machines results into energy saving. We propose a system consist with seven modules. (1) Resource monitor analyses energy consumption of resources. (2) Capacity distributor distributes maximum and minimum capacity for the physical machines. (3) Task allocator determines overloaded servers. (4) Optimizer analyses load on physical machine using ant colony optimization algorithm (5) Local Migration Agent calculates load of VMs to be migrated and select appropriate physical server. (6) Migration Orchestrator migrates the VM cosidering load. (7) Energy Manager initiates sleep mode for idle physical machine(PM)

Journal ArticleDOI
TL;DR: This paper proposes a novel iterative clustering algorithm that makes use of the translated text and reduces error in it and measures the quality of clustering with many real-world benchmark datasets.
Abstract: In the recent years, many research methodologies are proposed to recognize the spoken language and translate them to text. In this paper, we propose a novel iterative clustering algorithm that makes use of the translated text and reduces error in it. The proposed methodology involves three steps executed over many iterations, namely: (1) unknown word probability assignment, (2) multi-probability normalization, and (3) probability filtering. In the first case, each iteration learns the unknown words from previous iterations and assigns a new probability to the unknown words based on the temporary results obtained in the previous iteration. This process continues until there are no unknown words left. The second case involves normalization of multiple probabilities assigned to a single word by considering neighbour word probabilities. The last step is to eliminate probabilities below the threshold, which ensures the reduction of noise. We measure the quality of clustering with many real-world benchmark datasets. Results show that our optimized algorithm produces more accurate clustering compared to other clustering algorithms.

Journal ArticleDOI
TL;DR: The planned CBIR algorithm is developed based on different image feature characteristic and structure, also emulating the procedure of graphical substantial transmission and representation in upper-level sympathetic, with the aid of the future graphic improvement for property union.
Abstract: Recover accurate images from larger database with an efficient way is nearly essential in CBIR. Create a new method to improve the accuracy in CBIR with the combination MTH (Multi Texton Histogram) and MSD (Micro Structure Descriptor). It is called Composite Micro Structure Descriptor (CMSD). The planned CBIR algorithm is developed based on different image feature characteristic and structure, also emulating the procedure of graphical substantial transmission and representation in upper-level sympathetic, with the aid of the future graphic improvement for property union. We have used four different kind of data sets to evaluate the performances of new method. Out new designed method outperforms compared with other CBIR methods such as MTH and MSD.

Journal ArticleDOI
TL;DR: A procedure of swapping consonant-graphemes based on phonological similarity is proposed to boost the standard bigram-based orthographic syllabification, which commonly has a low performance for a dataset with many out-of-vocabulary (OOV) bigrams.
Abstract: Swapping one or more consonant-graphemes in a word into other phonologically similar ones, which based on both place and manner of articulation, interestingly produces some other words without shifting the syllable boundary (or point). For examples, in the Indonesian language, swapping consonant-graphemes in a word “ba.ra” (embers) creates three new words: “ba.la” (disaster), “pa.ra” (reference to a group), and “pa.la” (nutmeg) without changing the syllabification points since both graphemes $$\langle \hbox {b}\rangle$$ and $$\langle \hbox {p}\rangle$$ are in the same category of plosive-bilabial while both $$\langle \hbox {r}\rangle$$ and $$\langle \hbox {l}\rangle$$ are trill/lateral-dental. An observation on 50k Indonesian words shows that replacing consonant-graphemes in those words impressively increases the number of unigrams by 16.52 times and significantly increases the number of bigrams by 14.12 times. Therefore, in this paper, a procedure of swapping consonant-graphemes based on phonological similarity is proposed to boost the standard bigram-based orthographic syllabification, which commonly has a low performance for a dataset with many out-of-vocabulary (OOV) bigrams. Some examinations on the 50k words using the k-fold cross-validation scheme, with $$k=5$$, prove that the proposed procedure significantly boosts the standard bigram-syllabification, where it gives a relative reduction of mean syllable error rate (SER) up to 31.39%. It also shows an improvement for the dataset of 15k named-entities by relatively decreasing the average SER by 9.53%. It is better than a flipping onsets-based model for both datasets. Compared to a nearest neighbor-based model, its performance is a little worse, but it provides much lower complexity. Another important finding is that the proposed model can produce a relatively small SER, even for a tiny training-set.

Journal ArticleDOI
TL;DR: A low complexity design of a digital finite impulse response (FIR) filter for digital hearing aid application and the results shows that the proposed architecture has less slices than best existing designs.
Abstract: Hearing aid is an acoustic device which is worn by hearing loss people. To compensate the different types of hearing loss, it is necessary to selectively amplify sounds at required frequencies. The main aim of the hearing aid is to selectively remove the noise signal such that the processed sound matches ones audiogram. To achieve this, the decimation filter in hearing aids can be design using multiplier less architecture which should be able to adjust sound levels at arbitrary frequencies within a given spectrum. In hearing aids, decimation filter plays a key role. This paper presents a low complexity design of a digital finite impulse response (FIR) filter for digital hearing aid application. This paper proposed approximate 4:2 compressor adders in memory less DA based FIR filter architecture. In DA architecture the area of the ROM increases gradually when filter order is increased. Memory less DA is designed using compressor adders is a solution to decrease the power consumption and area of the FIR filters and makes the area and power reduction for hearing aid application. The proposed DA based FIR filter architecture is synthesized on 90 nm technology using Synapsis Application Specific Integrated circuit design compiler. The proposed architecture has 45% reduction in area delay product when distinguish with systolic architecture and 10% less ADP when compare with OBC DA architecture. The proposed design is also implemented Field Programmable Gate Array and the results shows that the proposed architecture has less slices than best existing designs. The proposed architecture is used in decimation filter of hearing aids applications using matlab simulink, which removes the unwanted signal.

Journal ArticleDOI
TL;DR: A novel Speech Emotion Recognition (SER) method based on phonological features is proposed, and the most discriminative features are investigated and some patterns of emotional rhyme are found based on the phonological representations.
Abstract: A novel Speech Emotion Recognition (SER) method based on phonological features is proposed in this paper. Intuitively, as expert knowledge derived from linguistics, phonological features are correlated with emotions. However, it has been found that they are seldomly used as features to improve SER. Motivated by this, we set our goal to utilize phonological features to further advance SER’s accuracy since they can provide complementary information for the task. Furthermore, we will also explore the relationship between phonological features and emotions. Firstly, instead of only based on acoustic features, we devise a new SER approach by fusing phonological representations and acoustic features together. A significant improvement in SER performance has been demonstrated on a publicly available SER database named Interactive Emotional Dyadic Motion Capture (IEMOCAP). Secondly, the experimental results show that the top-performing method for the task of categorical emotion recognition is a deep learning-based classifier which generates an unweighted average recall (UAR) accuracy of 60.02%. Finally, we investigate the most discriminative features and find some patterns of emotional rhyme based on the phonological representations.

Journal ArticleDOI
TL;DR: A practical dynamic approach on to find the polarity of any sentence and analyse the opinion of the particular sentence and a comparative result of sentimental analysis is presented.
Abstract: Sentimental analysis is one of the most common applications of Natural Language Processing (NLP). Sentiment analysis, the term itself refers to identify the emotions and opinions of people through written text. It is concerned with information extraction from any text based on the polarity in social behavior whether it may be positive, negative or neutral. This paper presents a practical dynamic approach on to find the polarity of any sentence and analyse the opinion of the particular sentence. The proposed Sentimental Analysis of Hindi (SAH) script have adopted two different classifier Naive Bayes Classifier and Decision Tree Classifier is used for the text extraction. The positive, neutral and negative result validation shows a comparative result of sentimental analysis.

Journal ArticleDOI
TL;DR: This paper presents a grapheme-to-phoneme conversion system for Arabic, which constitutes the text processing module of a deep neural networks (DNN)-based Arabic TTS systems, and gives a higher accuracy rate either for all phonemes or for each class, and high precision, recall and F1 score for eachclass of diacritic signs.
Abstract: Arabic text-to-speech synthesis from non-diacritized text is still a big challenge, because of unique Arabic language rules and characteristics. Indeed, the diacritic and gemination signs, which are special characters representing respectively short vowels and consonant doubling, have a major effect on accurate pronunciation of Arabic. However these signs are often not mentioned in written texts, since most of Arab readers are used to guess them from the context. To tackle this issue, this paper presents a grapheme-to-phoneme conversion system for Arabic, which constitutes the text processing module of a deep neural networks (DNN)-based Arabic TTS systems. In the case of Arabic text, this step starts with predicting the diacritic and gemination signs. In this work, this step was fully realized based on DNN. Finally, the grapheme-to-phoneme conversion of the diacritized text was achieved using the Buckwalter code. In comparison to state-of-the-art approaches, the proposed system gives a higher accuracy rate either for all phonemes or for each class, and high precision, recall and F1 score for each class of diacritic signs.

Journal ArticleDOI
TL;DR: Simulation results prove that the DCT is the optimum transform with the suggested methods, while the DWT is the best one with the hybrid method and the spectral subtraction method.
Abstract: This paper presents two pre-processing methods that can be implemented for noise reduction in speaker recognition systems. These methods are adaptive noise canceller (ANC) and Savitzky-Golay (SG) filter. Also, discrete cosine transform (DCT), discrete wavelet transform (DWT) and discrete sine transform (DST) are considered for consistent feature extraction from noisy speech signals. A neural network with only one hidden layer is used as a classifier. The performances of the proposed noise reduction methods are compared with those of a hybrid method that comprises empirical mode decomposition (EMD) and spectral subtraction and also with spectral subtraction method only. Recognition rate is taken as a performance metric to evaluate the behavior of the system with these enhancement strategies. Simulation results prove that the DCT is the optimum transform with the suggested methods, while the DWT is the best one with the hybrid method and the spectral subtraction method.

Journal ArticleDOI
TL;DR: A speech database that can be utilized for the recognition of Telugu dialects and two modeling techniques that are, Hidden Markov Model (HMM) and Gaussian mixture model (GMM) in order to recognize the dialects ofTelugu language by using speech independant utterances are developed.
Abstract: Telugu language is one of the important languages in the world. The language that is spoken by most of the people in a region is called as dialect. In the recent days, speech recognition system is present in almost all electronic devices. In this, dialects of particular language perform a vital role. The accurate dialects identification technique helps in not only enhancing its features but also expected to provide in modern services in health and telemedicine for older and homebound peoples. Like any other language, even Telugu language has diversified itself into different dialects viz., Telangana, Kostha Andhra, and Rayalaseema. Combination of all the dialects is the language TELUGU and it is a perfect blend of elegance in Sanskrit, sweetness in Tamil along with the essence of Kannada language. The formation of dialects can be of different reasons. For speech processing research, till today there is no standard speech database created for Telugu dialects. In this paper we developed a speech database that can be utilized for the recognition of Telugu dialects and we had applied two modeling techniques that are, Hidden Markov Model (HMM) and Gaussian mixture model (GMM) in order to recognize the dialects of Telugu language by using speech independant utterances. We imposed Mel-Frequency Cepstral Coefficient for extracting the spectral features from the obtained speech data and observed that GMM provides better accurate results than HMM.

Journal ArticleDOI
TL;DR: The main aim is to create a closed-domain question answering framework, which will give the precise and considerably short answer to all the inquiries that are related to the Hyderabad city, as a response, instead of giving a lengthy paragraph or document.
Abstract: Question answering (QA) framework is a framework that gives answers to the inquiries raised by the client using the common language. The framework recovers minor portion of the content from the collection of the report, which contains the appropriate response for the client’s inquiry. In order to retrieve such response from the repository, information retrieval techniques are needed and for further processing or comprehension of the client’s inquiry, presented in the characteristic language, natural language processing techniques are utilized. However to make the recovering procedure increasingly hearty, snappy and accurate, the idea of knowledge-based classification also included in this work, for this reason, utmost care was taken in training the framework. using “Jaccard likeness”, the closest answer for the client’s inquiry was reached. In addition to this, “WordNet” was used to recover the appropriate response, depends on both syntactic and semantic similitudes. Utilizing these ideas we have actualized a QA framework on space “Hyderabad Tourism” which gives in general exactness of 92%. In this work, our main aim is to create a closed-domain question answering framework, which will give the precise and considerably short answer to all the inquiries that are related to the Hyderabad city, as a response, instead of giving a lengthy paragraph or document.

Journal ArticleDOI
Jyotismita Chaki1
TL;DR: The aim of this state-of-art paper is to produce a summary and guidelines for using the broadly used methods, to identify the challenges as well as future research directions of acoustic signal processing.
Abstract: Audio signal processing is the most challenging field in the current era for an analysis of an audio signal. Audio signal classification (ASC) comprises of generating appropriate features from a sound and utilizing these features to distinguish the class the sound is most likely to fit. Based on the application’s classification domain, the characteristics extraction and classification/clustering algorithms used may be quite diverse. The paper provides the survey of the state-of art for understanding ASC’s general research scope, including different types of audio; representation of audio like acoustic, spectrogram; audio feature extraction techniques like physical, perceptual, static, dynamic; audio pattern matching approaches like pattern matching, acoustic phonetic, artificial intelligence; classification, and clustering techniques. The aim of this state-of-art paper is to produce a summary and guidelines for using the broadly used methods, to identify the challenges as well as future research directions of acoustic signal processing.

Journal ArticleDOI
TL;DR: This work uses Restricted Boltzmann Machines (RBMs) to extract speakers models as matrices and introduces a new way to model target and non-target speakers, in order to perform speaker verification, which uses a CNN to discriminate between target andnon-target matrices.
Abstract: We propose a novel usage of convolutional neural networks (CNNs) for the problem of speaker recognition. While being particularly designed for computer vision problems, CNNs have recently been applied for speaker recognition by using spectrograms as input images. We believe that this approach is not optimal as it may result in two cumulative errors in solving both a computer vision and a speaker recognition problem. In this work, we aim at integrating CNNs in speaker recognition without relying on images. We use Restricted Boltzmann Machines (RBMs) to extract speakers models as matrices and introduce a new way to model target and non-target speakers, in order to perform speaker verification. Thus, we use a CNN to discriminate between target and non-target matrices. Experiments were conducted with the THUYG-20 SRE corpus under three noise conditions: clean, 9 db, and 0 db. The results demonstrate that our method outperforms the state-of-the-art approaches by decreasing the error rate by up to 60%.

Journal ArticleDOI
TL;DR: An overview of deep learning methodologies for commonly used NIDS such as Auto Encoder, Deep Belief Network (DBN), Deep Neural Network (DNN), Restricted Boltzmann Machine (RBN) is introduced.
Abstract: Network Intrusion Detection System (NIDS) is the key technology for information security, and it plays significant role for classifying various attacks in the networks accurately. An NIDS gains an understanding of normal and anomalous behavior by examining the network traffic and can identify unknown and new attacks. Analyzing and Identifying unfamiliar attacks are one of the big challenges in Network IDS research. A huge response has been given to deep learning over the past several years and novelty in deep learning techniques are also improved regularly. Deep learning based Network Intrusion Detection approach is highly essential for improved performance. Nowadays, Machine learning algorithms made a revolution in the area of human computer interaction and achieved significant advancement in imitating human brain exactly. Convolutional Neural Network (CNN) is a powerful learning algorithm in deep learning model for improving the machine learning ability in order to achieve high attack classification accuracy and low false alarm rate. In this article, an overview of deep learning methodologies for commonly used NIDS such as Auto Encoder (AE), Deep Belief Network (DBN), Deep Neural Network (DNN), Restricted Boltzmann Machine (RBN). Moreover, the article introduces the most recent work on network anomaly detection using deep learning techniques for better understanding to choose appropriate method while implementing NIDS through widespread literature analysis. The experimental results designate that the accuracy, false alarm rate, and timeliness of the proposed CNN-NIDS model are superior than the traditional algorithms.

Journal ArticleDOI
TL;DR: This manuscript is reviewing the existing set of computer aided methods of predictive analytics defined in related to precision farming, gaining insights into how distinct set of precision farming inputs are supporting the predictive analytics to help farming communities towards improvisation.
Abstract: The scope of sensor networks and Internet of Things spanning rapidly to diversified domains but not limited to sports, health, and business trading. In recent past, the sensors and MEMS integrated Internet of Things are playing crucial role in diversified farming strategies like dairy farming, animal farming, and agriculture farming. The usage of sensors and IoT technologies in farming are coined in contemporary literature as smart farming or precision farming. At its early state of the smart farming, the practices applying in agriculture farming are limited to collect the data related to the context of the farming such as soil state, weather state, weed state, crop quality, and seed quality. These collections are to help the farmers, scientists to conclude the positive and negative factors of crop to initiate the required agricultural practices. However, the impact of these practices taken by the agriculturists depends on their experience. In this regard, the computer aided predictive analytics by machine learning and big data strategies are having inevitable scope. The emphasis of this manuscript is reviewing the existing set of computer aided methods of predictive analytics defined in related to precision farming, gaining insights into how distinct set of precision farming inputs are supporting the predictive analytics to help farming communities towards improvisation.

Journal ArticleDOI
TL;DR: A comprehensive assessment for the aforementioned transcription schemes by employing them in building a collection of Arabic ASR systems using the GALE (phase 3) Arabic broadcast news and broadcast conversational speech datasets LDC ( 2015 ), which include 260 h of recorded material.
Abstract: It is well-known that the Arabic language poses non-trivial issues for Automatic Speech Recognition (ASR) systems This paper is concerned with the problems posed by the complex morphology of the language and the absence of diacritics in the written form of the language Several acoustic and language models are built using different transcription resources, namely a grapheme-based transcription which uses non-diacriticised text materials, phoneme-based transcriptions obtained from automatic diacritisation tools (SAMA or MADAMIRA), and a predefined dictionary The paper presents a comprehensive assessment for the aforementioned transcription schemes by employing them in building a collection of Arabic ASR systems using the GALE (phase 3) Arabic broadcast news and broadcast conversational speech datasets LDC (2015), which include 260 h of recorded material Contrary to our expectations, the experimental evidence confirms that the use of grapheme-based transcription is superior to the use of phoneme-based transcription To investigate this further, several modifications are applied to the MADAMIRA analysis by applying a number of simple phonological rules These improvements have a substantial effect on the systems’ performance, but it is still inferior to the use of a simple grapheme-based transcription The research also examined the use of a manually diacriticised subset of the data in training the ASR system and compared it with the use of grapheme-based transcription and phoneme-based transcription obtained from MADAMIRA The goal of this step is to validate MADAMIRA’s analysis The results show that using the manually diacriticised text in generating the phonetic transcription can significantly decrease the WER compared to the use of MADAMIRA diacriticised text and also the isolated graphemes The results obtained strongly indicate that providing the training model with less information about the data (only graphemes) is less damaging than providing it with inaccurate information

Journal ArticleDOI
TL;DR: The objective of this paper is to examine various contrast functions using FastICA algorithm, and to find highly performed available contrast function for the application of speech signal analysis in noisy environments.
Abstract: Independent component analysis (ICA) is a thriving tool in separating blind sources from its determined or over-determined instantaneous mixture signals. FastICA is one of the successful algorithms in ICA. The objective of this paper is to examine various contrast functions using FastICA algorithm, and to find highly performed available contrast function for the application of speech signal analysis in noisy environments. The contrast function is a non-linear function used to measure the independence of the estimated sources from the observed mixture signals in FastICA algorithm. Kurtosis, negentropy and maximum likelihood functions are used as contrast functions in FastICA algorithm. The FastICA algorithm using these contrast functions is tested on the synthetic instantaneous mixtures and real time recorded mixture signals. We evaluate the performance of the contrast functions based on signal to distortion ratio, signal to artifact ratio, signal to interference ratio and computational complexity. The result shows the maximum likelihood function performs better than the other contrast functions in noisy environments.