scispace - formally typeset
Search or ask a question
Author

Md. Shariful Alam

Bio: Md. Shariful Alam is an academic researcher from University of Malaya. The author has contributed to research in topics: Medicine & Mel-frequency cepstrum. The author has an hindex of 2, co-authored 4 publications receiving 17 citations.

Papers
More filters
Proceedings ArticleDOI
01 Dec 2014
TL;DR: The proposed phoneme classification technique using the neural responses of a physiologically-based computational model of the auditory periphery outperforms the traditional MFCC-based method under noisy conditions even with the use of less number of features in the proposed method.
Abstract: Human listeners are capable of recognizing speech in noisy environment, while most of the traditional speech recognition methods do not perform well in the presence of noise. Unlike traditional Mel-frequency cepstral coefficient (MFCC)-based method, this study proposes a phoneme classification technique using the neural responses of a physiologically-based computational model of the auditory periphery. Neurograms were constructed from the responses of the model auditory nerve to speech phonemes. The features of neurograms were used to train the recognition system using a Gaussian Mixture Model (GMM) classification technique. Performance was evaluated for different types of phonemes such as stops, fricatives and vowels from the TIMIT database for both under quiet and noisy conditions. Although performance of the proposed method is comparable with that of MFCC-based classifier in quiet condition, the neural response-based proposed method outperforms the traditional MFCC-based method under noisy conditions even with the use of less number of features in the proposed method. The proposed method could be used in the field of speech recognition such as speech to text application, especially under noisy conditions.

8 citations

Journal ArticleDOI
TL;DR: A phoneme classification technique using simulated neural responses from a physiologically based computational model of the auditory periphery instead of using features directly from the acoustic signal that outperformed most of the traditional acoustic-property-based phoneme Classification methods for both in quiet and under noisy conditions.
Abstract: In order to mimic the capability of human listeners identifying speech in noisy environments, this paper proposes a phoneme classification technique using simulated neural responses from a physiologically based computational model of the auditory periphery instead of using features directly from the acoustic signal. The 2-D neurograms were constructed from the simulated responses of the auditory-nerve fibers to speech phonemes. The features of the neurograms were extracted using the Radon transform and used to train the classification system using a deep neural network classifier. Classification performance was evaluated in quiet and under noisy conditions for different types of phonemes extracted from the TIMIT database. Based on simulation results, the proposed method outperformed most of the traditional acoustic-property-based phoneme classification methods for both in quiet and under noisy conditions. The proposed method could easily be extended to develop an automatic speech recognition system.

8 citations

Journal ArticleDOI
TL;DR: This study proposes a new feature for phoneme classification using neural responses from a physiologically based computational model of the auditory periphery that exhibited a better classification accuracy in quiet and under noisy conditions compared with the performance of most existing acoustic-signal-based methods.
Abstract: Classification of speech phonemes is challenging, especially under noisy environments, and hence traditional speech recognition systems do not perform well in the presence of noise. Unlike traditional methods in which features are mostly extracted from the properties of the acoustic signal, this study proposes a new feature for phoneme classification using neural responses from a physiologically based computational model of the auditory periphery. The two-dimensional neurogram was constructed from the simulated responses of auditory-nerve fibres to speech phonemes. Features of neurogram images were extracted using the Discrete Radon Transform, and the dimensionality of features was reduced using an efficient feature selection technique. A standard classifier, Support Vector Machine, was employed to model and test the phoneme classes. Classification performance was evaluated in quiet and under noisy conditions in which test data were corrupted with various environmental distortions such as additive noise, room reverberation, and telephone-channel noise. Performances were also compared with the results from existing methods such as the Mel-frequency cepstral coefficient, Gammatone frequency cepstral coefficient, and frequency-domain linear prediction-based phoneme classification methods. In general, the proposed neural feature exhibited a better classification accuracy in quiet and under noisy conditions compared with the performance of most existing acoustic-signal-based methods.

3 citations

Proceedings ArticleDOI
01 Dec 2017
TL;DR: In this article, the effect of speech-shaped noise on consonant recognition in Malay was investigated using a 22-alternative forced choice task, and the results showed that fricatives were less affected, whereas laterals and nasals were severely affected.
Abstract: This paper presents the effect of speech-shaped noise on consonant recognitions in Malay. Scores were measured using a 22-alternative forced choice task. Based on responses, consonants were grouped into low, medium and high scoring sets, and the results were compared with some previous reports on English consonants. Our results showed that fricatives were less affected, whereas laterals and nasals were severely affected by speech-shaped noise. Large differences in consonant recognition scores at unfavorable signal-to-noise ratios (e.g., −10 dB) suggest that speech-shaped noise masked the Malay consonants non-uniformly. This was a key finding that differs from research reports with uniform speech masking in white noise. Masking patterns had some similarities and notable differences between English and Malay. The noted differences may have clinical implications for the design of signal processing strategies for hearing devices that are intended to improve speech understanding in noise for non-English speakers such as Malaysians.

2 citations

Journal ArticleDOI
TL;DR: Dilated Convolution and Inception blocks-based U-Net (DCI-UNet) as mentioned in this paper was proposed to extract multi-scale context features with varying sizes of receptive fields, and adding a dilated inception block between the encoder and decoder paths to alleviate the information recession and the semantic gap between features.
Abstract: Medical image segmentation is critical for efficient diagnosis of diseases and treatment planning. In recent years, convolutional neural networks (CNN)-based methods, particularly U-Net and its variants, have achieved remarkable results on medical image segmentation tasks. However, they do not always work consistently on images with complex structures and large variations in regions of interest (ROI). This could be due to the fixed geometric structure of the receptive fields used for feature extraction and repetitive down-sampling operations that lead to information loss. To overcome these problems, the standard U-Net architecture is modified in this work by replacing the convolution block with a dilated convolution block to extract multi-scale context features with varying sizes of receptive fields, and adding a dilated inception block between the encoder and decoder paths to alleviate the problem of information recession and the semantic gap between features. Furthermore, the input of each dilated convolution block is added to the output through a squeeze and excitation unit, which alleviates the vanishing gradient problem and improves overall feature representation by re-weighting the channel-wise feature responses. The original inception block is modified by reducing the size of the spatial filter and introducing dilated convolution to obtain a larger receptive field. The proposed network was validated on three challenging medical image segmentation tasks with varying size ROIs: lung segmentation on chest x-ray (CXR) images, skin lesion segmentation on dermoscopy images and nucleus segmentation on microscopy cell images. Improved performance compared to state-of-the-art techniques demonstrates the effectiveness and generalisability of the proposed Dilated Convolution and Inception blocks-based U-Net (DCI-UNet).

1 citations


Cited by
More filters
01 Jan 2001
TL;DR: The authors find that it can be regarded as the generalized frequency band energy (FBE) and is hence useful, which results in the FBE-MFCC, and propose a better analysis, namely the auto-regressive analysis, on the frame energy, which outperforms its 1st and/or 2nd order differential derivatives.
Abstract: The performance of the Mel-Frequency Cepstrum Coefficients (MFCC) may be affected by (1) the number of filters, (2) the shape of filters, (3) the way in which filters are spaced, and (4) the way in which the power spectrum is warped. In this paper, several comparison experiments are done to find a best implementation. The traditional MFCC calculation excludes the 0th coefficient for the reason that it is regarded as somewhat unreliable. According to the analysis and experiments, the authors find that it can be regarded as the generalized frequency band energy (FBE) and is hence useful, which results in the FBE-MFCC. The authors also propose a better analysis, namely the auto-regressive analysis, on the frame energy,which outperforms its 1st and/or 2nd order differential derivatives. Experiments with the “863”Speech Database show that, compared with the traditional MFCC with its corresponding autoregressive analysis coefficients, the FBE-MFCC and the frame energy with their corresponding auto-regressive analysis coefficients form the best combination, reducing the Chinese syllable error rate (CSER) by about 10%, while the FBE-MFCC with the corresponding auto-regressive analysis coefficients reduces CSER by 2.5%. Comparison experiments are also done with a quite casual Chinese speech database, named Chinese Annotated Spontaneous Speech (CASS)corpus. The FBE-MFCC can reduce the error rate by about 2.9% on an average.

263 citations

Proceedings Article
01 Jan 2006
TL;DR: In this paper, the results of a closed-set recognition task for 64 consonant-vowel sounds (16 C X 4 V, spoken by 18 talkers) in speech-weighted noise (-22,20,16,10,2 [dB]) and in quiet were presented.
Abstract: This paper presents the results of a closed-set recognition task for 64 consonant-vowel sounds (16 C X 4 V, spoken by 18 talkers) in speech-weighted noise (-22,-20,-16,-10,-2 [dB]) and in quiet. The confusion matrices were generated using responses of a homogeneous set of ten listeners and the confusions were analyzed using a graphical method. In speech-weighted noise the consonants separate into three sets: a low-scoring set C1 (/f/, /theta/, /v/, /d/, /b/, /m/), a high-scoring set C2 (/t/, /s/, /z/, /S/, /Z/) and set C3 (/n/, /p/, /g/, /k/, /d/) with intermediate scores. The perceptual consonant groups are C1: {/f/-/theta/, /b/-/v/-/d/, /theta/-/d/}, C2: {/s/-/z/, /S/-/Z/}, and C3: /m/-/n/, while the perceptual vowel groups are /a/-/ae/ and /epsilon/-/iota/. The exponential articulation index (AI) model for consonant score works for 12 of the 16 consonants, using a refined expression of the AI. Finally, a comparison with past work shows that white noise masks the consonants more uniformly than speech-weighted noise, and shows that the AI, because it can account for the differences in noise spectra, is a better measure than the wideband signal-to-noise ratio for modeling and comparing the scores with different noise maskers.

126 citations

Journal ArticleDOI
TL;DR: This study introduces a method to improve emotion classification performance under clean and noisy environments by combining two types of features: the proposed neural-responses-based features and the traditional INTERSPEECH 2010 paralinguistic emotion challenge features.
Abstract: Recently, increasing attention has been directed to study and identify the emotional content of a spoken utterance. This study introduces a method to improve emotion classification performance under clean and noisy environments by combining two types of features: the proposed neural-responses-based features and the traditional INTERSPEECH 2010 paralinguistic emotion challenge features. The neural-responses-based features are represented by the responses of a computational model of the auditory system for listeners with normal hearing. The model simulates the responses of an auditory-nerve fibre with a characteristic frequency to a speech signal. The simulated responses of the model are represented by the 2D neurogram (time-frequency representation). The neurogram image is sub-divided into non-overlapped blocks and the averaged value of each block is computed. The neurogram features and the traditional emotion features are combined together to form the feature vector for each speech signal. The features are trained using support vector machines to predict the emotion of speech. The performance of the proposed method is evaluated on two well-known databases: the eNTERFACE and Berlin emotional speech data set. The results show that the proposed method performed better when compared with the classification results obtained using neurogram and INTERSPEECH features separately.

28 citations

Journal ArticleDOI
TL;DR: A new approach to improve the accuracy of speaker identification in the presence of interference for robot control applications with a convolutional neural network (CNN) that achieves a high classification accuracy up to 97.5%, which is more than double the performance reported for some traditional methods that are used for speaker identification.

9 citations

Journal ArticleDOI
TL;DR: A phoneme classification technique using simulated neural responses from a physiologically based computational model of the auditory periphery instead of using features directly from the acoustic signal that outperformed most of the traditional acoustic-property-based phoneme Classification methods for both in quiet and under noisy conditions.
Abstract: In order to mimic the capability of human listeners identifying speech in noisy environments, this paper proposes a phoneme classification technique using simulated neural responses from a physiologically based computational model of the auditory periphery instead of using features directly from the acoustic signal. The 2-D neurograms were constructed from the simulated responses of the auditory-nerve fibers to speech phonemes. The features of the neurograms were extracted using the Radon transform and used to train the classification system using a deep neural network classifier. Classification performance was evaluated in quiet and under noisy conditions for different types of phonemes extracted from the TIMIT database. Based on simulation results, the proposed method outperformed most of the traditional acoustic-property-based phoneme classification methods for both in quiet and under noisy conditions. The proposed method could easily be extended to develop an automatic speech recognition system.

8 citations