Bio: A.H. Khalil is an academic researcher. The author has contributed to research in topics: Speaker recognition & Word recognition. The author has an hindex of 2, co-authored 2 publications receiving 26 citations.
••01 Jan 2003
TL;DR: A pattern matching algorithm based on HMM is implemented using Field Programmable Gate Array (FPGA) for isolated Arabic word recognition and achieved a recognition accuracy comparable with the powerful classical recognition system.
Abstract: In this work we propose a speech recognition system for Arabic speech based on a hardware/software co-design implementation approach. Speech recognition is a computationally demanding task, specially the pattern matching stage. The Hidden Markov Model (HMM) is considered the most powerful modeling and matching technique in the different speech recognition tasks. Implementing the pattern matching algorithm, which is time consuming, using dedicated hardware will speed up the recognition process. In this paper, a pattern matching algorithm based on HMM is implemented using Field Programmable Gate Array (FPGA). The forward algorithm, core of matching algorithm in HMM, is analyzed and modified to be more suitable for FPGA implementation. Implementation results showed that the recognition accuracy of the modified algorithm is very close to the classical algorithm with the gain of achieving higher speed and less occupied area in the FPGA. The proposed approach is used for isolated Arabic word recognition and achieved a recognition accuracy comparable with the powerful classical recognition system.
05 Sep 2004
TL;DR: The nonlinear activation function is adapted to be more suitable for the FPGA implementation and almost 100% identification rate is reached using Multi Layer Perceptron Neural Network ( MLP NN).
Abstract: Speaker identification is a challenging pattern classification task. It is used enormously in many applications such as security systems, information retrieved services, etc. portable identification systems are expected to be widely used in future in many purposes, such as mobile applications. Implementing the identification technique using a dedicated hardware could be very useful to achieve smart units. In this context, the FPGA could offer an efficient technology to realize a pattern classification strategy. A speaker identification system can be implemented using many classification approaches, one of these , the artificial neural network (ANN), which is considered one of the most powerful classification techniques. Implementing a Neural Network on. an FPGA is a challenging task because of the complexity of the required arithmetic operations. In this paper the nonlinear activation function is adapted to be more suitable for the FPGA implementation. We have reached almost 100% identification rate using Multi Layer Perceptron Neural Network ( MLP NN).
TL;DR: A speech recognition system that allows arm‐disabled students to control computers by voice as a helping tool in the educational process and achieves higher recognition rates than other relevant approaches.
Abstract: Over the previous decades, a need has emerged to empower human-machine communication systems, which are essential to not only perform actions, but also obtain information especially in education applications. Moreover, any communication system has to introduce an efficient and easy way for interaction with a minimum possible error rate. The keyboard, mouse, trackball, touch-screen, and joystick are all examples of tools which were built to provide mechanical human-to-machine interaction. However, a system with the ability to use oral speech, which is the natural form of communication between humans instead of mechanical communication systems, can be more practical for normal students and even a necessity for arm-disabled students who cannot use their arms to handle traditional education tools like pens and notebooks. In this paper, we present a speech recognition system that allows arm-disabled students to control computers by voice as a helping tool in the educational process. When a student speaks through a microphone, the speech is divided into isolated words which are compared with a predefined database of huge number of spoken words to find a match. After that, each recognized word is translated into its related tasks which will be performed by the computer like opening a teaching application or renaming a file. The speech recognition process discussed in this paper involves two separate approaches; the first approach is based on double thresholds voice activity detection and improved Mel-frequency cepstral coefficients (MFCC), while the second approach is based on discrete wavelet transform along with modified MFCC algorithm. Utilizing the best values for all parameters in just mentioned techniques, our proposed system achieved a recognition rate of 98.7% using the first approach, and 98.86% using the second approach of which is better in ratio than the first one but slower in processing which is a critical point for a real time system. Both proposed approaches were compared with other relevant approaches and their recognition rates were noticeably higher.
TL;DR: The aim is not only to provide the architecture of a speaker identification system but also to reduce the redundant frames at the pre-processing stage to lower the identification time and computation burden which are vital for real time implementation.
Abstract: Speaker recognition refers to the task of recognizing persons from their spoken speech. It belongs to the field of biometric person authentication which also includes authentication by fingerprints, face and iris. Implementing the identification technique using a dedicated hardware like field programmable gate arrays (FPGA) could be useful to achieve smart units. The computational complexity and identification time mainly depend on the number of speakers, the number of frame vectors, their dimensionality and the model order of the classifier. Due to the slow movement of the voice producing parts, the adjacent frame vectors do not vary much in information content. In this paper, we present the design of a speaker identification system with a distance metric based frame selection technique. The aim is not only to provide the architecture of a speaker identification system but also to reduce the redundant frames at the pre-processing stage to lower the identification time and computation burden which are vital for real time implementation.
TL;DR: The benefit of the overall accuracy of the integrated system (e.g., translation) outweighs the WER increase for the Arabic ASR system and it is recommended to include diacritics for ASR systems when integrated with other systems such as voice-enabled translation.
Abstract: Arabic is the native language for over 300 million speakers and one of the official languages in United Nations. It has a unique set of diacritics that can alter a word’s meaning. Arabic automatic speech recognition (ASR) received little attention compared to other languages, and researches were oblivious to the diacritics in most cases. Omitting diacritics circumscribes the Arabic ASR system’s usability for several applications such as voice-enabled translation, text to speech, and speech-to-speech. In this paper, we study the effect of diacritics on Arabic ASR systems. Our approach is based on building and comparing diacritized and nondiacritized models for different corpus sizes. In particular, we build Arabic ASR models using state-of-the-art technologies for 1, 2, 5, 10, and 23 h. Each of those models was trained once with a diacritized corpus and another time with a nondiacritized version of the same corpus. KALDI toolkit and SRILM were used to build eight models for each corpus that are GMM-SI, GMM SAT, GMM MPE, GMM MMI, SGMM, SGMM-bMMI, DNN, DNN-MPE. Eighty different models were created using this experimental setup. Our results show that Word Error Rates (WERs) ranged from 4.68% to 42%. Adding diacritics increased WER by 0.59% to 3.29%. Although diacritics increased WERs, it is recommended to include diacritics for ASR systems when integrated with other systems such as voice-enabled translation. We believe that the benefit of the overall accuracy of the integrated system (e.g., translation) outweighs the WER increase for the Arabic ASR system.
•22 Dec 2008
TL;DR: In this paper, a hardware implemented backend search stage for a speech recognition system is provided, which includes a number of pipelined stages including a fetch stage, an updating stage which may be a Viterbi stage, a transition and prune stage, and a language model stage.
Abstract: A hardware implemented backend search stage, or engine, for a speech recognition system is provided. In one embodiment, the backend search engine includes a number of pipelined stages including a fetch stage, an updating stage which may be a Viterbi stage, a transition and prune stage, and a language model stage. Each active triphone of each active word is represented by a corresponding triphone model. By being pipelined, the stages of the backend search engine are enabled to simultaneously process different triphone models, thereby providing high-rate backend searching for the speech recognition system. In one embodiment, caches may be used to cache frequently and/or recently accessed triphone information utilized by the fetch stage, frequently and/or recently accessed triphone-to-senone mappings utilized by the updating stage, or both.