scispace - formally typeset
Search or ask a question

Showing papers in "International Journal of Speech Technology in 2016"


Journal ArticleDOI
TL;DR: In this study, three types of sustained vowels (/a/, /o/ and /u/) were recorded from each participant and then the analyses were done on these voice samples to discriminate between two groups of people.
Abstract: In this study, we wanted to discriminate between two groups of people. The database used in this study contains 20 patients with Parkinson’s disease (PD) and 20 healthy people. Three types of sustained vowels (/a/, /o/ and /u/) were recorded from each participant and then the analyses were done on these voice samples. The technique used in this study is to extract voiceprint from each voice samples by using mel frequency cepstral coefficients (MFCCs). The extracted MFCC were compressed by calculating their average value in order to extract the voiceprint from each voice recording. Subsequently, a classification method was performed using leave one subject out (LOSO) validation scheme along with support vector machines (SVMs). We also used an independent test to validate our results by using another database which contains 28 PD patients. Based on the research result, the best obtained classification accuracy using LOSO on the first dataset was 82.50 % using MLP kernel of SVM on sustained vowel /u/. And the maximum classification accuracy using the independent test was 100 % using sustained vowel /a/ with polynomial kernel of the SVM and with MLP kernel of the SVM. This result was also achieved using sustained vowel /o/ with polynomial kernel of the SVM.

59 citations


Journal ArticleDOI
TL;DR: In this work, Mel frequency cepstral coefficients (MFCC) features are extracted for each speech of both training and test samples and Gaussian mixture model (GMM) is used for classification of the speech based on accent.
Abstract: Speech processing is very important research area where speaker recognition, speech synthesis, speech codec, speech noise reduction are some of the research areas. Many of the languages have different speaking styles called accents or dialects. Identification of the accent before the speech recognition can improve performance of the speech recognition systems. If the number of accents is more in a language, the accent recognition becomes crucial. Telugu is an Indian language which is widely spoken in Southern part of India. Telugu language has different accents. The main accents are coastal Andhra, Telangana, and Rayalaseema. In this present work the samples of speeches are collected from the native speakers of different accents of Telugu language for both training and testing. In this work, Mel frequency cepstral coefficients (MFCC) features are extracted for each speech of both training and test samples. In the next step Gaussian mixture model (GMM) is used for classification of the speech based on accent. The overall efficiency of the proposed system to recognize the speaker, about the region he belongs, based on accent is 91 %.

47 citations


Journal ArticleDOI
TL;DR: An automatic syllable based segmentation technique for segmenting continuous speech signals in Indian languages at syllable boundaries is presented and the effectiveness of the proposed technique in segmenting the syllable units from the original speech samples compared to the existing techniques is shown.
Abstract: Speech recognition is the process of understanding the human or natural language speech by a computer. A syllable centric speech recognition system in this aspect identifies the syllable boundaries in the input speech and converts it into the respective written scripts or text units. Appropriate segmentation of the acoustic speech signal into syllabic units is an important task for development of highly accurate speech recognition system. This paper presents an automatic syllable based segmentation technique for segmenting continuous speech signals in Indian languages at syllable boundaries. To analyze the performance of the proposed technique, a set of experiments are carried out on different speech samples in three Indian languages Hindi, Bengali and Odia and are compared with the existing group delay based segmentation technique along with the manual segmentation technique. The results of all our experiments show the effectiveness of the proposed technique in segmenting the syllable units from the original speech samples compared to the existing techniques.

40 citations


Journal ArticleDOI
TL;DR: A combined feature selection technique has been proposed which uses the reduced features set artifact of vector quantizer (VQ) in a Radial Basis Function Neural Network (RBFNN) environment for classification.
Abstract: The challenge to enhance the naturalness and efficiency of spoken language man---machine interface, emotional speech identification and its classification has been a predominant research area. The reliability and accuracy of such emotion identification greatly depends on the feature selection and extraction. In this paper, a combined feature selection technique has been proposed which uses the reduced features set artifact of vector quantizer (VQ) in a Radial Basis Function Neural Network (RBFNN) environment for classification. In the initial stage, Linear Prediction Coefficient (LPC) and time---frequency Hurst parameter (pH) are utilized to extract the relevant feature, both exhibiting complementary information from the emotional speech. Extensive simulations have been carried out using Berlin Database of Emotional Speech (EMO-DB) with various combination of feature set. The experimental results reveal 76 % accuracy for pH and 68 % for LPC using standalone feature set, whereas the combination of feature sets, (LP VQC and pH VQC) enhance the average accuracy level up to 90.55 %.

37 citations


Journal ArticleDOI
TL;DR: Results reveal that SFFS is a better choice as a feature subset selection method because SFS suffers from nesting problem, and makes the set not to be fixed at any stage but floating up and down during the selection based on the objective function.
Abstract: Feature Fusion plays an important role in speech emotion recognition to improve the classification accuracy by combining the most popular acoustic features for speech emotion recognition like energy, pitch and mel frequency cepstral coefficients. However the performance of the system is not optimal because of the computational complexity of the system, which occurs due to high dimensional correlated feature set after feature fusion. In this paper, a two stage feature selection method is proposed. In first stage feature selection, appropriate features are selected and fused together for speech emotion recognition. In second stage feature selection, optimal feature subset selection techniques [sequential forward selection (SFS) and sequential floating forward selection (SFFS)] are used to eliminate the curse of dimensionality problem due to high dimensional feature vector after feature fusion. Finally the emotions are classified by using several classifiers like Linear Discriminant Analysis (LDA), Regularized Discriminant Analysis (RDA), Support Vector Machine (SVM) and K Nearest Neighbor (KNN). The performance of overall emotion recognition system is validated over Berlin and Spanish databases by considering classification rate. An optimal uncorrelated feature set is obtained by using SFS and SFFS individually. Results reveal that SFFS is a better choice as a feature subset selection method because SFS suffers from nesting problem i.e it is difficult to discard a feature after it is retained into the set. SFFS eliminates this nesting problem by making the set not to be fixed at any stage but floating up and down during the selection based on the objective function. Experimental results showed that the efficiency of the classifier is improved by 15---20 % with two stage feature selection method when compared with performance of the classifier with feature fusion.

33 citations


Journal ArticleDOI
TL;DR: A method that takes into account the tags that are not included in the training data is proposed and it is shown that this consideration increases significantly the accuracy of the morphosyntactic analysis.
Abstract: The objective of this work is to develop a POS tagger for the Arabic language. This analyzer uses a very rich tag set that gives syntactic information about proclitic attached to words. This study employs a probabilistic model and a morphological analyzer to identify the right tag in the context. Most published research on probabilistic analysis uses only a training corpus to search the probable tags for each words, and this sometimes affects their performances. In this paper, we propose a method that takes into account the tags that are not included in the training data. These tags are proposed by the Alkhalil_Morpho_Sys analyzer (Bebah et al. 2011). We show that this consideration increases significantly the accuracy of the morphosyntactic analysis. In addition, the adopted tag set is very rich and it contains the compound tags that allow analyze the proclitics attached to words.

29 citations


Journal ArticleDOI
TL;DR: The shape and appearance information are extracted from jaw and lip region to enhance the performance in vehicle environments to show more robustness compared to acoustic speech recognizer across all driving conditions.
Abstract: Consideration of visual speech features along with traditional acoustic features have shown decent performance in uncontrolled auditory environment. However, most of the existing audio-visual speech recognition (AVSR) systems have been developed in the laboratory conditions and rarely addressed the visual domain problems. This paper presents an active appearance model (AAM) based multiple-camera AVSR experiment. The shape and appearance information are extracted from jaw and lip region to enhance the performance in vehicle environments. At first, a series of visual speech recognition (VSR) experiments are carried out to study the impact of each camera on multi-stream VSR. Four cameras in car audio-visual corpus is used to perform the experiments. The individual camera stream is fused to have four-stream synchronous hidden Markov model visual speech recognizer. Finally, optimum four-stream VSR is combined with single stream acoustic HMM to build five-stream AVSR. The dual modality AVSR system shows more robustness compared to acoustic speech recognizer across all driving conditions.

28 citations


Journal ArticleDOI
TL;DR: This work used the ALICE/AIML chatbot architecture as a platform to develop a range of chatbots covering different languages, genres, text-types, and user-groups, to illustrate qualitative aspects of natural language dialogue system evaluation.
Abstract: Human---computer dialogue systems interact with human users using natural language. We used the ALICE/AIML chatbot architecture as a platform to develop a range of chatbots covering different languages, genres, text-types, and user-groups, to illustrate qualitative aspects of natural language dialogue system evaluation. We present some of the different evaluation techniques used in natural language dialogue systems, including black box and glass box, comparative, quantitative, and qualitative evaluation. Four aspects of NLP dialogue system evaluation are often overlooked: "usefulness" in terms of a user's qualitative needs, "localizability" to new genres and languages, "humanness" or "naturalness" compared to human---human dialogues, and "language benefit" compared to alternative interfaces. We illustrated these aspects with respect to our work on machine-learnt chatbot dialogue systems; we believe these aspects are worthwhile in impressing potential new users and customers.

28 citations


Journal ArticleDOI
TL;DR: In this study, three types of sustained vowels (/a/, /o/ and /u/) were recorded from each participant and then the analyses were done on these voice samples to discriminate between two groups of people.
Abstract: In this study, we wanted to discriminate between two groups of people. The database used in this study contains 20 patients with Parkinson's disease and 20 healthy people. Three types of sustained vowels (/a/, /o/ and /u/) were recorded from each participant and then the analyses were done on these voice samples. Firstly, an initial feature vector extracted from time, frequency and cepstral domains. Then we used linear and nonlinear feature extraction techniques, principal component analysis (PCA), and nonlinear PCA. These techniques reduce the number of parameters and choose the most effective acoustic features used for classification. Support vector machine with its different kernel was used for classification. We obtained an accuracy up to 87.50 % for discrimination between PD patients and healthy people.

26 citations


Journal ArticleDOI
TL;DR: A new classifier is developed by combining deep belief network (DBN) and Fractional Calculus, trained with the multiple features such as tonal power ratio, spectral flux, pitch chroma and Mel frequency cepstral coefficients (MFCC) to make the emotional classes more separable through the spectral characteristics.
Abstract: With an essential demand of human emotional behavior understanding and human machine interaction for the recent electronic applications, speaker emotion recognition is a key component which has attracted a great deal of attention among the researchers. Even though a handful of works are available in the literature for speaker emotion classification, the important challenges such as, distinct emotions, low quality recording, and independent affective states are still need to be addressed with good classifier and discriminative features. Accordingly, a new classifier, called fractional deep belief network (FDBN) is developed by combining deep belief network (DBN) and Fractional Calculus. This new classifier is trained with the multiple features such as tonal power ratio, spectral flux, pitch chroma and Mel frequency cepstral coefficients (MFCC) to make the emotional classes more separable through the spectral characteristics. The proposed FDBN classifier with integrated feature vectors is tested using two databases such as, Berlin database of emotional speech and real time Telugu database. The performance of the proposed FDBN and existing DBN classifiers are validated using False Acceptance Rate (FAR), False Rejection Rate (FRR) and Accuracy. The experimental results obtained by the proposed FDBN shows the accuracy of 98.39 and 95.88 % in Berlin and Telugu database.

24 citations


Journal ArticleDOI
TL;DR: A new steganalysis method that uses a deep belief network (DBN) as a classifier for audio files and shows that it gives higher classification rates than the two other Steganalysis methods based on SVMs and GMMs.
Abstract: This paper presents a new steganalysis method that uses a deep belief network (DBN) as a classifier for audio files. It has been tested on three steganographic techniques: StegHide, Hide4PGP and FreqSteg. The results were compared to two other existing robust steganalysis methods based on support vector machines (SVMs) and Gaussian mixture models (GMMs). Afterwards, another classification task aiming at identifying the type of steganographic applied or not to the speech signal was carried out. The results of this four-way classification show that in most cases, the proposed DBN-based steganalysis method gives higher classification rates than the two other steganalysis methods based on SVMs and GMMs.

Journal ArticleDOI
TL;DR: To build a new Corpus of the Quran, the work used a semi-automatic technique, which consists in using the morphsyntactic of standard Arabic words “AlKhalil Morpho Sys” followed by a manual treatment, and built a new Quranic Corpus rich in morphosyntactical information.
Abstract: There is not a widely amount of available annotated Arabic corpora. This leads us to contribute to the enrichment of Arabic corpora resources. In this regard, we have decided to start working with correct and carefully selected texts. Thus, beginning with the Quranic Arabic text is the best way to start for such an effort. Furthermore, the annotating linguistic resources, such as Quranic Corpus, are important for researchers working in all Arabic natural language processing fields. To the best of our knowledge, the only available Quranic Arabic corpora are from the University of Leeds, University of Jordan and the University of Haifa. Unfortunately, these corpora have several problems and they do not contain enough grammatical and syntactical information. To build a new Corpus of the Quran, the work used a semi-automatic technique, which consists in using the morphsyntactic of standard Arabic words "AlKhalil Morpho Sys" followed by a manual treatment. As a result of this work, we have built a new Quranic Corpus rich in morphosyntactical information.

Journal ArticleDOI
TL;DR: Experimental results for the identification of 36 bird species from Tonga lake demonstrate that the proposed TRD–GTECC feature is highly effective and performs satisfactorily compared to popular front-ends considered in this study.
Abstract: The key solution to study birds in their natural habitat is the continuous survey using wireless sensors networks (WSN). The final objective of this study is to conceive a system for monitoring threatened bird species using audio sensor nodes. The principal feature for their recognition is their sound. The main limitations encountered with this process are environmental noise and energy consumption in sensor nodes. Over the years, a variety of birdsong classification methods has been introduced, but very few have focused to find an adequate one for WSN. In this paper, a tonal region detector (TRD) using sigmoid function is proposed. This approach for noise power estimation offers flexibility, since the slope and the mean of the sigmoid function can be adapted autonomously for a better trade-off between noise overvaluation and undervaluation. Once the tonal regions in the noisy bird sound are detected, the features gammatone teager energy cepstral coefficients (GTECC) post-processed by quantile-based cepstral normalization were extracted from the above signals for classification using deep neural network classifier. Experimental results for the identification of 36 bird species from Tonga lake (northeast of Algeria) demonstrate that the proposed TRD–GTECC feature is highly effective and performs satisfactorily compared to popular front-ends considered in this study. Moreover, recognition performance, noise immunity and energy consumption are considerably improved after tonal region detection, indicating that it is a very suitable approach for the acoustic bird recognition in complex environments with wireless sensor nodes.

Journal ArticleDOI
TL;DR: A novel hybrid recognition algorithm composed of the learning vector quantization (LVQ) and hidden Markov model (HMM) that achieves 89 % of Arabic phonemes recognition rate based on the hybrid LVQ/HMM algorithm.
Abstract: In attempt to increase the rate of Arabic phonemes recognition, we introduce a novel hybrid recognition algorithm. The algorithm is composed of the learning vector quantization (LVQ) and hidden Markov model (HMM). The hybrid algorithm used to recognizing Arabic phonemes in continuous open-vocabulary speech. A recorded Arabic corpus of different TV news for modern standard Arabic was used for training and testing purposes. We employ a data driven approach to generate the training feature vectors that embed the frame neighboring correlation information. Next, we generate the phonemes codebooks using the K-means splitting algorithm. Then, we trained the generated codebooks using the LVQ algorithm. We achieved a performance of 98.49 % during independent classification training and 90 % during dependent classification training. When using the trained LVQ codebooks in Arabic utterance transcription, the phoneme recognition rate was 72 % using LVQ only. We combined the LVQ codebooks with the single state HMM model using enhanced Viterbi algorithm which includes the phonemes bigrams. We achieved 89 % of Arabic phonemes recognition rate based on the hybrid LVQ/HMM algorithm.

Journal ArticleDOI
TL;DR: This paper attempts to discuss methods to capture dialect specific knowledge through vocal tract and prosody information extracted from speech that can be utilized for automatic identification of dialects.
Abstract: A primary challenge in the field of automatic speech recognition is to understand and create acoustic models to represent individual differences in their spoken language. Individual’s age, gender; their speaking styles influenced by their dialect may be few of the reasons for these differences. This work investigates the dialectal differences by measuring the analysis of variance of acoustic features such as, formant frequencies, pitch, pitch slope, duration and intensity for vowel sounds. This paper attempts to discuss methods to capture dialect specific knowledge through vocal tract and prosody information extracted from speech that can be utilized for automatic identification of dialects. Kernel based support vector machine is utilized for measuring the dialect discriminating ability of acoustic features. For the spectral feature shifted delta cepstral coefficients along with Mel frequency cepstral coefficients gives a recognition performance of 66.97 %. Combination of prosodic features performs better with a classification score of 74 %. The model is further evaluated for the combination of spectral and prosodic feature set and achieves a classification accuracy of 88.77 %. The proposed model is compared with the human perception of dialects. The overall work is based on four dialects of Hindi; one of the world’s major languages.

Journal ArticleDOI
TL;DR: A corpus of Arabic text was indexed using Arabic WordNet and the disambiguation of words was performed by applying the Lesk algorithm, allowing the contribution of this approach in IRS for Arabic texts to be deducted.
Abstract: As part of information retrieval systems (IRS) and in the context of the use of ontologies for documents and queries indexing, we propose and evaluate in this paper the contribution of this approach applied to Arabic texts. To do this we indexed a corpus of Arabic text using Arabic WordNet. The disambiguation of words was performed by applying the Lesk algorithm. The results obtained by our experiment allowed us to deduct the contribution of this approach in IRS for Arabic texts.

Journal ArticleDOI
TL;DR: A new adaptive combination of the APSSAFs based on convex combination scheme is proposed for modeling of acoustic paths under impulsive noise environments and achieves improved performance than existing algorithms when applied to modeling sparse/dispersive systems.
Abstract: The affine projection sign subband adaptive filter (APSSAF) algorithm has attracted much attention because of its fast convergence rate and robustness against impulsive interference. However, a drawback of this algorithm is that the step size implies a compromise between the convergence speed and steady-state error. To solve this problem, a new adaptive combination of the APSSAFs based on convex combination scheme is proposed for modeling of acoustic paths under impulsive noise environments. Moreover, a weight transfer approach is applied to further improve the performance. Simulation results demonstrate that the proposed algorithm achieves improved performance than existing algorithms when applied to modeling sparse/dispersive systems.

Journal ArticleDOI
TL;DR: The proposed method uses the distributional semantics to build the word-context matrix representing the distribution of words across contexts and to transform the text into a vector space model (VSM) representation based on word semantic similarity.
Abstract: An efficient method is introduced to represent large Arabic texts in comparatively smaller size without losing significant information. The proposed method uses the distributional semantics to build the word-context matrix representing the distribution of words across contexts and to transform the text into a vector space model (VSM) representation based on word semantic similarity. The linguistic features of the Arabic language, in addition to the semantic information extracted from different lexical-semantic resources such as Arabic WordNet and named entities' gazetteers are used to improve the text representation and to create word clusters of similar and related words. Distributional similarity measures have been used to capture the words' semantic similarity and to create clusters of similar words. The conducted experiments have shown that the proposed method significantly reduces the size of text representation by about 27 % compared with the stem-based VSM and by about 50 % compared with the traditional bag-of-words model. Their results have shown that the amount of dimension reduction depends on the size and shape of the windows of analysis as well as on the content of the text.

Journal ArticleDOI
TL;DR: An automatic extraction model of synonyms, which is used to construct the Quranic Arabic WordNet (QAWN), and computed cosine similarities between Quranic words based on textual definitions that are extracted from traditional Arabic dictionaries.
Abstract: In this paper, we developed an automatic extraction model of synonyms, which is used to construct our Quranic Arabic WordNet (QAWN) that depends on traditional Arabic dictionaries. In this work, we rely on three resources. First, the Boundary Annotated Quran Corpus that contains Quran words, Part-of-Speech, root and other related information. Second, the lexicon resources that was used to collect a set of derived words for Quranic words. Third, traditional Arabic dictionaries, which were used to extract the meaning of words with distinction of different senses. The objective of this work is to link the Quranic words of similar meanings in order to generate synonym sets (synsets). To accomplish that, we used term frequency and inverse document frequency in vector space model, and we then computed cosine similarities between Quranic words based on textual definitions that are extracted from traditional Arabic dictionaries. Words of highest similarity were grouped together to form a synset. Our QAWN consists of 6918 synsets that were constructed from about 8400 unique word senses, on average of 5 senses for each word. Based on our experimental evaluation, the average recall of the baseline system was 7.01 %, whereas the average recall of the QAWN was 34.13 % which improved the recall of semantic search for Quran concepts by 27 %.

Journal ArticleDOI
TL;DR: The design and implementation of a computational model for Arabic natural language semantics, a semantic parser for capturing the deep semantic representation of Arabic text, and a rule based algorithm to generate an equivalent Arabic FrameNet are described.
Abstract: This paper describes the design and implementation of a computational model for Arabic natural language semantics, a semantic parser for capturing the deep semantic representation of Arabic text. The parser represents a major part of an Interlingua-based machine translation system for translating Arabic text into Sign Language. The parser follows a frame-based analysis to capture the overall meaning of Arabic text into a formal representation suitable for NLP applications that need for deep semantics representation, such as language generation and machine translation. We will show the representational power of this theory for the semantic analysis of texts in Arabic, a language which differs substantially from English in several ways. We will also show that the integration of WordNet and FrameNet in a single unified knowledge resource can improve disambiguation accuracy. Furthermore, we will propose a rule based algorithm to generate an equivalent Arabic FrameNet, using a lexical resource alignment of FrameNet1.3 LUs and WordNet3.0 synsets for English Language. A pilot study of motion and location verbs was carried out in order to test our system. Our corpus is made up of more than 2000 Arabic sentences in the domain of motion events collected from Algerian first level educational Arabic books and other relevant Arabic corpora.

Journal ArticleDOI
TL;DR: It has been observed, that the FrFT based MFCC, with timbral features and SVM, efficiently classifies the two western genres of rock and classical music, from the GTZAN dataset, with fewer features and a higher classification accuracy.
Abstract: This paper presents the automatic genre classification of Indian Tamil music and western music using timbral features and fractional Fourier transform (FrFT) based Mel frequency cepstral coefficient (MFCC) features. The classifier model for the proposed system has been built using K-nearest neighbours and support vector machine (SVM) classifiers. In this work, the performance of various features extracted from music excerpts have been analyzed, to identify the appropriate feature descriptors for the two major genres of Indian Tamil music, namely classical music (Carnatic based devotional hymn compositions) and folk music. The results have shown that the feature combination of spectral roll off, spectral flux, spectral skewness and spectral kurtosis, combined with fractional MFCC features, outperforms all other feature combinations, to yield a higher classification accuracy of 96.05 %, as compared to the accuracy of 84.21 % with conventional MFCC. It has also been observed, that the FrFT based MFCC, with timbral features and SVM, efficiently classifies the two western genres of rock and classical music, from the GTZAN dataset, with fewer features and a higher classification accuracy of 96.25 %, as compared to the classification accuracy of 80 % with conventional MFCC.

Journal ArticleDOI
TL;DR: This paper investigates the feed forward back propagation neural network (FFBPNN) and the support vector machine (SVM) for the classification of two Maghrebian dialects: Tunisian and Moroccan.
Abstract: This paper investigates the feed forward back propagation neural network (FFBPNN) and the support vector machine (SVM) for the classification of two Maghrebian dialects: Tunisian and Moroccan. The dialect used by the Moroccan speakers is called "La Darijja" and that of Tunisians is called "Darija". An Automatic Speech Recognition System is implemented in order to identify ten Arabic digits (from zero to nine). The implementation of our present system consists of two phases: The features extraction using a variety of popular hybrid techniques and the classification phase using separately the FFBPNN and the SVM. The experimental results showed that the recognition rates with both approaches have reached 98.3 % with FFBPNN and 97.5 % with SVM.

Journal ArticleDOI
TL;DR: This paper presents the method for building an Arabic parser based on an induced grammar, PCFG grammar, and shows the efficiency of the proposed parser for parsing modern standard Arabic sentences.
Abstract: The importance of the parsing task for NLP applications is well understood. However developing parsers remains difficult because of the complexity of the Arabic language. Most parsers are based on syntactic grammars that describe the syntactic structures of a language. The development of these grammars is laborious and time consuming. In this paper we present our method for building an Arabic parser based on an induced grammar, PCFG grammar. We first induce the PCFG grammar from an Arabic Treebank. Then, we implement the parser that assigns syntactic structure to each input sentence. The parser is tested on sentences extracted from the treebank (1650 sentences).We calculate the precision, recall and f-measure. Our experimental results showed the efficiency of the proposed parser for parsing modern standard Arabic sentences (Precision: 83.59 %, Recall: 82.98 % and F-measure: 83.23 %).

Journal ArticleDOI
TL;DR: A promising approach for integrity verification of recorded audio signals using discrete cosine transform based on self embedding concept which embeds block-based marks extracted from the same audio signal after being transformed into 2D format into other blocks according to a specific algorithm.
Abstract: Audio recordings have been used as evidence for long times. Multimedia processing advancement makes it difficult to be completely sure about what is heard is the truth. This paper presents a promising approach for integrity verification of recorded audio signals using discrete cosine transform. This approach is based on self embedding concept which embeds block-based marks extracted from the same audio signal after being transformed into 2-D format into other blocks according to a specific algorithm. After the self-embedding process, the data is converted back into 1-D style which represents a marked audio signal. The 1-D audio signal is converted into a 2-D format and then converted back into a 1-D format using the popular lexicographic ordering scheme utilized in image processing. Reverse processes are executed to extract the verification marks from the audio signal throughout the integrity verification process. Based on the extracted audio signal properties, the integrity of the marked audio signal is evaluated. Different audio processing tasks and attacks are implemented to examine the suitability of the proposed algorithm for verifying the integrity of high-confidentiality recorded audio data. The results show that the efficient ability of the proposed approach to verify integrity and detect attacks.

Journal ArticleDOI
TL;DR: An efficient Arabic TTS system based on statistical parametric approach and non-uniform units speech synthesis and a new simple stacked neural network approach to improve the accuracy of the acoustic models is presented.
Abstract: Text-to-speech system (TTS), known also as speech synthesizer, is one of the important technology in the last years due to the expanding field of applications. Several works on speech synthesizer have been made on English and French, whereas many other languages, including Arabic, have been recently taken into consideration. The area of Arabic speech synthesis has not sufficient progress and it is still in its first stage with a low speech quality. In fact, speech synthesis systems face several problems (e.g. speech quality, articulatory effect, etc.). Different methods were proposed to solve these issues, such as the use of large and different unit sizes. This method is mainly implemented with the concatenative approach to improve the speech quality and several works have proved its effectiveness. This paper presents an efficient Arabic TTS system based on statistical parametric approach and non-uniform units speech synthesis. Our system includes a diacritization engine. Modern Arabic text is written without mention the vowels, called also diacritic marks. Unfortunately, these marks are very important to define the right pronunciation of the text which explains the incorporation of the diacritization engine to our system. In this work, we propose a simple approach based on deep neural networks. Deep neural networks are trained to directly predict the diacritic marks and to predict the spectral and prosodic parameters. Furthermore, we propose a new simple stacked neural network approach to improve the accuracy of the acoustic models. Experimental results show that our diacritization system allows the generation of full diacritized text with high precision and our synthesis system produces high-quality speech.

Journal ArticleDOI
TL;DR: Evaluating six existing corpus search and analysis tools based on eight criteria revealed that three tools, Khawas, Sketch Engine, and aConCorde, have met most of the evaluation criteria and achieved the highest benchmark scores.
Abstract: As the number of Arabic corpora is constantly increasing, there is an obvious and growing need for concordancing software for corpus search and analysis that supports as many features as possible of the Arabic language, and provides users with a greater number of functions. This paper evaluates six existing corpus search and analysis tools based on eight criteria which seem to be the most essential for searching and analysing Arabic corpora, such as displaying Arabic text in its right-to-left direction, normalising diacritics and Hamza, and providing an Arabic user interface. The results of the evaluation revealed that three tools: Khawas, Sketch Engine, and aConCorde, have met most of the evaluation criteria and achieved the highest benchmark scores. The paper concluded that developers' conscious consideration of the linguistic features of Arabic when designing these three tools was the most significant factor behind their superiority.

Journal ArticleDOI
TL;DR: This study is part of a broader project that includes development of software and hardware systems to monitor the bird species that appear in different geographical locations which helps ornithologists to monitor environmental conditions with respect to specific bird species.
Abstract: In this paper we focus on automatic bird classification based on their sound patterns. This is useful in the field of ornithology for studying bird species and their behavior based on their sound. The proposed methodology may be used to conduct survey of birds. The proposed methods may be used to automatically classify birds using different audio processing and machine learning techniques on the basis of their chirping patterns. An effort has been made in this work to map characteristics of birds such as size, habitat, species and types of call, on to their sounds. This study is also part of a broader project that includes development of software and hardware systems to monitor the bird species that appear in different geographical locations which helps ornithologists to monitor environmental conditions with respect to specific bird species.

Journal ArticleDOI
TL;DR: A Standard Yorùbá speech-to-text system capable of recognizing isolated words spoken by the users based on previously stored data was designed and implemented, and carefully-selected words were recorded, analyzed and annotated, using Praat software.
Abstract: In this paper, a Standard Yoruba speech-to-text system capable of recognizing isolated words spoken by the users based on previously stored data was designed and implemented. This system adopted syllable-based approach, and carefully-selected words were recorded, analyzed and annotated, using Praat software. An experimental database of six native speakers was taken, each speaking 25 bi-syllabic and 25 tri-syllabic words, under an acoustically-controlled room. The meaningful spectral coefficients were successfully extracted using Mel-frequency cepstral coefficients technique and Hidden Markov Model Toolkit was used to implement the system. A graphical user interface was also developed to make the system accessible and more interactive. Furthermore, the system was tested and evaluated based on the perception of native speakers of the language. The overall accuracy for bi-syllabic and tri-syllabic words was 76 and 84 % respectively. These results obtained for both bi and tri-syllabic words showed that this system was a promising approach that could be adopted for Standard Yoruba continuous speech recognition system as this will make the system useable for the foreign speaker.

Journal ArticleDOI
TL;DR: A new approach for Arabic word sense disambiguation is introduced by utilizing Wikipedia as a lexical resource for disambigsuation by utilizing Vector Space Model as a representation and cosine similarity between the word context and the retrieved senses fromWikipedia as a measure.
Abstract: In this research we introduce a new approach for Arabic word sense disambiguation by utilizing Wikipedia as a lexical resource for disambiguation. The nearest sense for an ambiguous word is selected using Vector Space Model as a representation and cosine similarity between the word context and the retrieved senses from Wikipedia as a measure. Three experiments have been conducted to evaluate the proposed approach, two experiments use the first retrieved sentence for each sense from Wikipedia but they use different Vector Space Model representations while the third experiment uses the first paragraph for the retrieved sense from Wikipedia. The experiments show that using first paragraph is better than the first sentence and the use of TF-IDF is better than using abstract frequency in VSM. Also, the proposed approach is tested on English words and it gives better results using the first sentence retrieved from Wikipedia for each sense.

Journal ArticleDOI
TL;DR: The experimental results show that the normalization algorithm is effective on the locally collected database, as well as on the eNTERFACE’05 Audio-Visual Emotion Database.
Abstract: In this paper we propose a feature normalization method for speaker-independent speech emotion recognition. The performance of a speech emotion classifier largely depends on the training data, and a large number of unknown speakers may cause a great challenge. To address this problem, first, we extract and analyse 481 basic acoustic features. Second, we use principal component analysis and linear discriminant analysis jointly to construct the speaker-sensitive feature space. Third, we classify the emotional utterances into pseudo-speaker groups in the speaker-sensitive feature space by using fuzzy k-means clustering. Finally, we normalize the original basic acoustic features of each utterance based on its group information. To verify our normalization algorithm, we adopt a Gaussian mixture model based classifier for recognition test. The experimental results show that our normalization algorithm is effective on our locally collected database, as well as on the eNTERFACE'05 Audio-Visual Emotion Database. The emotional features achieved using our method are robust to the speaker change, and an improved recognition rate is observed.