scispace - formally typeset
Search or ask a question

Showing papers on "Word error rate published in 2022"


Journal ArticleDOI
TL;DR: In this article , an enhanced multimodal biometric technique for a smart city that is based on score-level fusion is proposed, where a fuzzy strategy with soft computing techniques known as an optimized fuzzy genetic algorithm is used.
Abstract: Biometric security is a major emerging concern in the field of data security. In recent years, research initiatives in the field of biometrics have grown at an exponential rate. The multimodal biometric technique with enhanced accuracy and recognition rate for smart cities is still a challenging issue. This paper proposes an enhanced multimodal biometric technique for a smart city that is based on score-level fusion. Specifically, the proposed approach provides a solution to the existing challenges by providing a multimodal fusion technique with an optimized fuzzy genetic algorithm providing enhanced performance. Experiments with different biometric environments reveal significant improvements over existing strategies. The result analysis shows that the proposed approach provides better performance in terms of the false acceptance rate, false rejection rate, equal error rate, precision, recall, and accuracy. The proposed scheme provides a higher accuracy rate of 99.88% and a lower equal error rate of 0.18%. The vital part of this approach is the inclusion of a fuzzy strategy with soft computing techniques known as an optimized fuzzy genetic algorithm.

26 citations


Journal ArticleDOI
TL;DR: In this paper , a hybrid framework comprising deep auto-encoder (AE) with the long short term memory (LSTM) and the bidirectional LSTM (Bi-LSTMs) was proposed for intrusion detection system by obtaining optimal features using AE and then LSTMs for classification into normal and anomaly samples.

23 citations


Journal ArticleDOI
TL;DR: In this article , an enhanced multimodal biometric technique for a smart city that is based on score-level fusion is proposed, where a fuzzy strategy with soft computing techniques known as an optimized fuzzy genetic algorithm is used.
Abstract: Biometric security is a major emerging concern in the field of data security. In recent years, research initiatives in the field of biometrics have grown at an exponential rate. The multimodal biometric technique with enhanced accuracy and recognition rate for smart cities is still a challenging issue. This paper proposes an enhanced multimodal biometric technique for a smart city that is based on score-level fusion. Specifically, the proposed approach provides a solution to the existing challenges by providing a multimodal fusion technique with an optimized fuzzy genetic algorithm providing enhanced performance. Experiments with different biometric environments reveal significant improvements over existing strategies. The result analysis shows that the proposed approach provides better performance in terms of the false acceptance rate, false rejection rate, equal error rate, precision, recall, and accuracy. The proposed scheme provides a higher accuracy rate of 99.88% and a lower equal error rate of 0.18%. The vital part of this approach is the inclusion of a fuzzy strategy with soft computing techniques known as an optimized fuzzy genetic algorithm.

17 citations


Journal ArticleDOI
TL;DR: This paper used deep learning and language-modeling techniques to decode letter sequences as the participant attempted to silently spell using code words that represented the 26 English letters (e.g. "alpha" for "a").
Abstract: Neuroprostheses have the potential to restore communication to people who cannot speak or type due to paralysis. However, it is unclear if silent attempts to speak can be used to control a communication neuroprosthesis. Here, we translated direct cortical signals in a clinical-trial participant (ClinicalTrials.gov; NCT03698149) with severe limb and vocal-tract paralysis into single letters to spell out full sentences in real time. We used deep-learning and language-modeling techniques to decode letter sequences as the participant attempted to silently spell using code words that represented the 26 English letters (e.g. "alpha" for "a"). We leveraged broad electrode coverage beyond speech-motor cortex to include supplemental control signals from hand cortex and complementary information from low- and high-frequency signal components to improve decoding accuracy. We decoded sentences using words from a 1,152-word vocabulary at a median character error rate of 6.13% and speed of 29.4 characters per minute. In offline simulations, we showed that our approach generalized to large vocabularies containing over 9,000 words (median character error rate of 8.23%). These results illustrate the clinical viability of a silently controlled speech neuroprosthesis to generate sentences from a large vocabulary through a spelling-based approach, complementing previous demonstrations of direct full-word decoding.

17 citations


Proceedings ArticleDOI
23 May 2022
TL;DR: In this article , the authors explore the limits of speech representations learned by different self-supervised objectives and datasets for automatic speaker verification (ASV), especially with a well-recognized SOTA ASV model, ECAPA-TDNN, as a downstream model.
Abstract: The speech representations learned from large-scale unlabeled data have shown better generalizability than those from supervised learning and thus attract a lot of interest to be applied for various downstream tasks. In this paper, we explore the limits of speech representations learned by different self-supervised objectives and datasets for automatic speaker verification (ASV), especially with a well-recognized SOTA ASV model, ECAPA-TDNN [1], as a downstream model. The representations from all hidden layers of the pre-trained model are firstly averaged with learnable weights and then fed into the ECAPA-TDNN as input features. The experimental results on Voxceleb dataset show that the weighted average representation is significantly superior to FBank, a conventional handcrafted feature for ASV. Our best single system achieves 0.537%, 0.569%, and 1.180% equal error rate (EER) on the three official trials of VoxCeleb1, separately. Accordingly, the ensemble system with three pre-trained models can further improve the EER to 0.479%, 0.536% and 1.023%. Among the three evaluation trials, our best system outperforms the winner system [2] of the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC2021) on the VoxCeleb1-E trial.

15 citations


Journal ArticleDOI
TL;DR: A robust method to predict human behavior from indoor and outdoor crowd environments by applying the graph mining technique for data optimization, while the maximum entropy Markov model is applied for classification and predictions.
Abstract: With the change of technology and innovation of the current era, retrieving data and data processing becomes a more challenging task for researchers. In particular, several types of sensors and cameras are used to collect multimedia data from various resources and domains, which have been used in different domains and platforms to analyze things such as educational and communicational setups, emergency services, and surveillance systems. In this paper, we propose a robust method to predict human behavior from indoor and outdoor crowd environments. While taking the crowd-based data as input, some preprocessing steps for noise reduction are performed. Then, human silhouettes are extracted that eventually help in the identification of human beings. After that, crowd analysis and crowd clustering are applied for more accurate and clear predictions. This step is followed by features extraction in which the deep flow, force interaction matrix and force flow features are extracted. Moreover, we applied the graph mining technique for data optimization, while the maximum entropy Markov model is applied for classification and predictions. The evaluation of the proposed system showed 87% of mean accuracy and 13% of error rate for the avenue dataset, while 89.50% of mean accuracy rate and 10.50% of error rate for the University of Minnesota (UMN) dataset. In addition, it showed a 90.50 mean accuracy rate and 9.50% of error rate for the A Day on Campus (ADOC) dataset. Therefore, these results showed a better accuracy rate and low error rate compared to state-of-the-art methods.

14 citations


Proceedings ArticleDOI
23 May 2022
TL;DR: A new objective is introduced, with encoder and decoder consistency and contrastive regularization between real and synthesized speech derived from the labeled corpora during the pretraining stage, which leads to more similar representations derived from speech and text that help downstream ASR.
Abstract: An effective way to learn representations from untranscribed speech and unspoken text with linguistic/lexical representations derived from synthesized speech was introduced in tts4pretrain [1]. However, the representations learned from synthesized and real speech are likely to be different, potentially limiting the improvements from incorporating unspoken text. In this paper, we introduce learning from supervised speech earlier on in the training process with consistency-based regularization between real and synthesized speech. This allows for better learning of shared speech and text representations. Thus, we introduce a new objective, with encoder and decoder consistency and contrastive regularization between real and synthesized speech derived from the labeled corpora during the pretraining stage. We show that the new objective leads to more similar representations derived from speech and text that help downstream ASR. The proposed pretraining method yields Word Error Rate (WER) reductions of 7-21% relative on six public corpora, Librispeech, AMI, TEDLIUM, Common Voice, Switchboard, CHiME-6, over a state-of-the-art baseline pretrained with wav2vec2.0 and 2-17% over the previously proposed tts4pretrain. The proposed method outperforms the supervised SpeechStew by up to 17%. Moreover, we show that the proposed method also yields WER reductions on larger data sets by evaluating on a large resource, in-house Voice Search task and streaming ASR.

14 citations


Journal ArticleDOI
TL;DR: In this paper , a set of novel modeling techniques including neural architectural search, data augmentation using spectra-temporal perturbation, model based speaker adaptation and cross-domain generation of visual features within an audio-visual speech recognition (AVSR) system framework were employed to address the above challenges.
Abstract: Despite the rapid progress of automatic speech recognition (ASR) technologies in the past few decades, recognition of disordered speech remains a highly challenging task to date. Disordered speech presents a wide spectrum of challenges to current data intensive deep neural networks (DNNs) based ASR technologies that predominantly target normal speech. This paper presents recent research efforts at the Chinese University of Hong Kong (CUHK) to improve the performance of disordered speech recognition systems on the largest publicly available UASpeech dysarthric speech corpus. A set of novel modelling techniques including neural architectural search, data augmentation using spectra-temporal perturbation, model based speaker adaptation and cross-domain generation of visual features within an audio-visual speech recognition (AVSR) system framework were employed to address the above challenges. The combination of these techniques produced the lowest published word error rate (WER) of 25.21% on the UASpeech test set 16 dysarthric speakers, and an overall WER reduction of 5.4% absolute (17.6% relative) over the CUHK 2018 dysarthric speech recognition system featuring a 6-way DNN system combination and cross adaptation of out-of-domain normal speech data trained systems. Bayesian model adaptation further allows rapid adaptation to individual dysarthric speakers to be performed using as little as 3.06 seconds of speech. The efficacy of these techniques were further demonstrated on a CUDYS Cantonese dysarthric speech recognition task.

12 citations


Journal ArticleDOI
TL;DR: This paper explored whether speakers have a voice-AI-specific register relative to their speech toward an adult human and tested if speakers have targeted error correction strategies for voice-activated AI and human interlocutors.

12 citations


Journal ArticleDOI
01 Mar 2022-Sensors
TL;DR: A novel template-driven KD approach that optimizes the distillation process so that the student model learns to produce templates similar to those produced by the teacher model, and demonstrates the superiority of this approach on intra- and cross-device periocular verification.
Abstract: This work addresses the challenge of building an accurate and generalizable periocular recognition model with a small number of learnable parameters. Deeper (larger) models are typically more capable of learning complex information. For this reason, knowledge distillation (kd) was previously proposed to carry this knowledge from a large model (teacher) into a small model (student). Conventional KD optimizes the student output to be similar to the teacher output (commonly classification output). In biometrics, comparison (verification) and storage operations are conducted on biometric templates, extracted from pre-classification layers. In this work, we propose a novel template-driven KD approach that optimizes the distillation process so that the student model learns to produce templates similar to those produced by the teacher model. We demonstrate our approach on intra- and cross-device periocular verification. Our results demonstrate the superiority of our proposed approach over a network trained without KD and networks trained with conventional (vanilla) KD. For example, the targeted small model achieved an equal error rate (EER) value of 22.2% on cross-device verification without KD. The same model achieved an EER of 21.9% with the conventional KD, and only 14.7% EER when using our proposed template-driven KD.

12 citations


Journal ArticleDOI
TL;DR: This paper examined the impact of inconclusives on error rates from three statistical perspectives: (a) an ideal perspective using objective measurements combined with statistical algorithms, (b) basic sampling theory and practice, and (c) standards of experimental design in human studies.

Proceedings ArticleDOI
23 May 2022
TL;DR: Wang et al. as mentioned in this paper proposed a method to encode noise robustness into contextualized representations of speech via contrastive learning by feeding original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
Abstract: The goal of self-supervised learning (SSL) for automatic speech recognition (ASR) is to learn good speech representations from a large amount of unlabeled speech for the downstream ASR task. However, most SSL frameworks do not consider noise robustness which is crucial for real-world applications. In this paper we propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech via contrastive learning. Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network. In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets of each other. By doing this, it enforces the network to have consistent predictions for the original and noisy speech, thus allows to learn contextualized representation with noise robustness. Our experiments on synthe-sized and real noisy data show the effectiveness of our method: it achieves 2.9–4.9% relative word error rate (WER) reduction on the synthesized noisy LibriSpeech data without deterioration on the original data, and 5.7% on CHiME-4 real 1-channel noisy data compared to a data augmentation baseline even with a strong language model for decoding. Our results on CHiME-4 can match or even surpass those with well-designed speech enhancement components.

Journal ArticleDOI
29 Jan 2022-Symmetry
TL;DR: This work investigates two types of new acoustic features to improve the performance of spoofing attacks, which consist of two cepstral coefficients and one LogSpec feature, which are extracted from the linear prediction (LP) residual signals.
Abstract: With the rapid development of intelligent speech technologies, automatic speaker verification (ASV) has become one of the most natural and convenient biometric speaker recognition approaches. However, most state-of-the-art ASV systems are vulnerable to spoofing attack techniques, such as speech synthesis, voice conversion, and replay speech. Due to the symmetry distribution characteristic between the genuine (true) speech and spoof (fake) speech pair, the spoofing attack detection is challenging. Many recent research works have been focusing on the ASV anti-spoofing solutions. This work investigates two types of new acoustic features to improve the performance of spoofing attacks. The first features consist of two cepstral coefficients and one LogSpec feature, which are extracted from the linear prediction (LP) residual signals. The second feature is a harmonic and noise subband ratio feature, which can reflect the interaction movement difference of the vocal tract and glottal airflow of the genuine and spoofing speech. The significance of these new features has been investigated in both the t-stochastic neighborhood embedding space and the binary classification modeling space. Experiments on the ASVspoof 2019 database show that the proposed residual features can achieve from 7% to 51.7% relative equal error rate (EER) reduction on the development and evaluation set over the best single system baseline. Furthermore, more than 31.2% relative EER reduction on both the development and evaluation set shows that the proposed new features contain large information complementary to the source acoustic features.

Journal ArticleDOI
TL;DR: In this article , a hybrid convolutional-recurrent neural network (C-RNN) was proposed for heart rate estimation from PPG signals acquired from subjects performing different exercises.

Proceedings ArticleDOI
23 May 2022
TL;DR: The authors proposed an interactive feature fusion network (IFF-Net) for noise-robust speech recognition to learn complementary information from the enhanced feature and original noisy feature, which achieved absolute word error rate (WER) reduction of 4.1% over the best baseline on RATS Channel-A corpus.
Abstract: Speech enhancement (SE) aims to suppress the additive noise from noisy speech signals to improve the speech’s perceptual quality and intelligibility. However, the over-suppression phenomenon in the enhanced speech might degrade the performance of downstream automatic speech recognition (ASR) task due to the missing latent information. To alleviate such problem, we propose an interactive feature fusion network (IFF-Net) for noise-robust speech recognition to learn complementary information from the enhanced feature and original noisy feature. Experimental results show that the proposed method achieves absolute word error rate (WER) reduction of 4.1% over the best baseline on RATS Channel-A corpus. Our further analysis indicates that the proposed IFF-Net can complement some missing information in the over-suppressed enhanced feature.

Journal ArticleDOI
TL;DR: Deep learning-based techniques are proposed to build a high-performance versatile CAPT system for mispronunciation detection and diagnosis (MDD) and articulatory feedback generation for non-native Arabic learners and the use of cutting-edge neural text-to-speech (TTS) technology to generate a new corpus of high-quality speech from predefined text that has the most common substitution errors among Arabic learners.
Abstract: A high-performance versatile computer-assisted pronunciation training (CAPT) system that provides the learner immediate feedback as to whether their pronunciation is correct is very helpful in learning correct pronunciation and allows learners to practice this at any time and with unlimited repetitions, without the presence of an instructor. In this paper, we propose deep learning-based techniques to build a high-performance versatile CAPT system for mispronunciation detection and diagnosis (MDD) and articulatory feedback generation for non-native Arabic learners. The proposed system can locate the error in pronunciation, recognize the mispronounced phonemes, and detect the corresponding articulatory features (AFs), not only in words but even in sentences. We formulate the recognition of phonemes and corresponding AFs as a multi-label object recognition problem, where the objects are the phonemes and their AFs in a spectral image. Moreover, we investigate the use of cutting-edge neural text-to-speech (TTS) technology to generate a new corpus of high-quality speech from predefined text that has the most common substitution errors among Arabic learners. The proposed model and its various enhanced versions achieved excellent results. We compared the performance of the different proposed models with the state-of-the-art end-to-end technique of MDD, and our system had a better performance. In addition, we proposed using fusion between the proposed model and the end-to-end model and obtained a better performance. Our best model achieved a 3.83% phoneme error rate (PER) in the phoneme recognition task, a 70.53% F1-score in the MDD task, and a detection error rate (DER) of 2.6% for the AF detection task.

Proceedings ArticleDOI
23 May 2022
TL;DR: In this paper , the authors explore reducing computational latency of the 2-pass cascaded encoder model by reducing the size of the causal 1st pass and adding capacity to the non-causal 2nd pass.
Abstract: In this paper, we explore reducing computational latency of the 2-pass cascaded encoder model [1]. Specifically, we experiment with reducing the size of the causal 1st-pass and adding capacity to the non-causal 2nd-pass, such that the overall latency can be reduced without loss of quality. In addition, we explore using a confidence model for deciding to stop 2nd-pass recognition if we are confident in the 1st-pass hypothesis. Overall, we are able to reduce latency by a factor of 1.7X, compared to the baseline cascaded encoder from [1]. Secondly, with the added capacity in the non-causal 2nd-pass, we find that we can improve WER by up to 7% relative using wav2vec and minimum word-error-rate (MWER) training.

Journal ArticleDOI
TL;DR: In this paper , a residual connection-based bidirectional Gated Recurrent Unit (BiGRU) augmented Kalman filtering model was proposed for speech enhancement and recognition, where clean speech and noise signals are modeled as autoregressive process and the parameters are composed of linear prediction coefficients (LPCs) and driving noise variances.
Abstract: With the recent research developments, deep learning models are powerful alternatives for speech enhancement and recognition in many real-world applications. Although state-of-the-art models achieve phenomenal results in terms of the background noise reduction, but the challenge is to design robust models for improving the quality, intelligibility, and word error rate. We propose a novel residual connection-based Bidirectional Gated Recurrent Unit (BiGRU) augmented Kalman filtering model for speech enhancement and recognition. In the proposed model, clean speech and noise signals are modeled as autoregressive process and the parameters are composed of linear prediction coefficients (LPCs) and driving noise variances. Recurrent neural networks are trained to estimate the line spectrum frequencies (LSFs) whereas an optimization problem is solved to attain noise variances such that to minimize the divergence between the modeled and predicted autoregressive spectrums of the noise contaminated speech. Augmented Kalman filtering with the estimated parameters are applied to the noisy speech for background noise reduction such that to improve the speech quality, intelligibility, and word error rates. Bidirectional GRUs network is implemented which predicts parameters both in the future and past contexts of the input sequence and outperform in terms of modeling the long-term dependencies. A compensated phase spectra is used to recover the enhanced speech signals. The Kaldi toolkit is employed to train the automatic speech recognition (ASR) system in order to measure the word error rates (WERs). By using the LibriSpeech dataset, the proposed model improved the quality, intelligibility, and word error rates by 35.52%, 18.79%, and 19.13%, respectively under various noisy environments.

Journal ArticleDOI
TL;DR: In this article , the authors explored three back-end models: Convolutional Neural Network, Long Short-Term Memory and hybrid of these two models, with different input feature formats.

Proceedings ArticleDOI
10 Jan 2022
TL;DR: This work explores a contextual biasing approach using likelihood-ratio that leverages text data sources to adapt RNN-T model to new domains and entities and shows that this method is effective in improving rare words recognition, and results in a relative improvement in 1-best word error rate (WER) and n-best Oracle1 WER.
Abstract: End-to-end (E2E) automatic speech recognition models like Recurrent Neural Networks Transducer (RNN-T) are becoming a popular choice for streaming ASR applications like voice assistants. While E2E models are very effective at learning representation of the training data they are trained on, their accuracy on unseen domains remains a challenging problem. Additionally, these models require paired audio and text training data, are computationally expensive and are difficult to adapt towards the fast evolving nature of conversational speech. In this work, we explore a contextual biasing approach using likelihood-ratio that leverages text data sources to adapt RNN-T model to new domains and entities. We show that this method is effective in improving rare words recognition, and results in a relative improvement of 10% in 1-best word error rate (WER) and 10% in n-best Oracle1 WER (n=8) on multiple out-of-domain datasets without any degradation on a general dataset. We also show that complementing the contextual biasing adaptation with adaptation of a second-pass rescoring model gives additive WER improvements.

Proceedings ArticleDOI
23 May 2022
TL;DR: In this article , instead of suppressing background noise with a conventional cascaded pipeline, the authors employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition and combine a reconstruction module with contrastive learning and perform multi-task continual pre-training on noisy data.
Abstract: Noise robustness is essential for deploying automatic speech recognition (ASR) systems in real-world environments. One way to reduce the effect of noise interference is to employ a preprocessing module that conducts speech enhancement, and then feed the enhanced speech to an ASR backend. In this work, instead of suppressing background noise with a conventional cascaded pipeline, we employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition. We propose to combine a reconstruction module with contrastive learning and perform multi-task continual pre-training on noisy data. The reconstruction module is used for auxiliary learning to improve the noise robustness of the learned representation and thus is not required during inference. Experiments demonstrate the effectiveness of our proposed method. Our model substantially reduces the word error rate (WER) for the synthesized noisy LibriSpeech test sets, and yields around 4.1/7.5% WER reduction on noisy clean/other test sets compared to data augmentation. For the real-world noisy speech from the CHiME-4 challenge (1-channel track), we have obtained the state of the art ASR performance without any denoising front-end. Moreover, we achieve comparable performance to the best supervised approach reported with only 16% of labeled data.

Posted ContentDOI
TL;DR: This work aims at surveying the research work that considers NOMA error rate analysis and classifying the contributions of each work, so work redundancy and overlap can be minimized, research gabs can be identified, and future research directions can be outlined.
Abstract: Non-orthogonal multiple access (NOMA) has received an enormous attention in the recent literature due to its potential to improve the spectral efficiency of wireless networks. For several years, most of the research efforts on the performance analysis of NOMA were steered towards the ergodic sum rate and outage probability. More recently, error rate analysis of NOMA has attracted massive attention and sparked massive number of researchers whose aim was to evaluate the error rate of the various NOMA configurations and designs. Therefore, the large number of publications that appeared in a short time duration made highly challenging for the research community to identify the contribution of the different research articles. Therefore, this work aims at surveying the research work that considers NOMA error rate analysis and classifying the contributions of each work. Therefore, work redundancy and overlap can be minimized, research gabs can be identified, and future research directions can be outlined. Moreover, this work presents the principles of NOMA error rate analysis.

Proceedings ArticleDOI
23 May 2022
TL;DR: In this article , a loss-gated learning (LGL) strategy was proposed to extract the reliable labels through the fitting ability of the neural network during training, which obtains a 46.3% performance gain over the system without it.
Abstract: In self-supervised learning for speaker recognition, pseudo labels are useful as the supervision signals. It is a known fact that a speaker recognition model doesn’t always benefit from pseudo labels due to their unreliability. In this work, we observe that a speaker recognition network tends to model the data with reliable labels faster than those with unreliable labels. This motivates us to study a loss-gated learning (LGL) strategy, which extracts the reliable labels through the fitting ability of the neural network during training. With the proposed LGL, our speaker recognition model obtains a 46.3% performance gain over the system without it. Further, the proposed self-supervised speaker recognition with LGL trained on the VoxCeleb2 dataset without any labels achieves an equal error rate of 1.66% on the VoxCeleb1 original test set.

Proceedings ArticleDOI
23 May 2022
TL;DR: In this paper , a two-pass hybrid and E2E cascading framework is proposed to combine the hybrid and end-to-end model in order to take advantage of both sides.
Abstract: Hybrid and end-to-end (E2E) systems have their individual advantages, with different error patterns in the speech recognition results. By jointly modeling audio and text, the E2E model performs better in matched scenarios and scales well with a large amount of paired audio-text training data. The modularized hybrid model is easier for customization, and better to make use of a massive amount of unpaired text data. This paper proposes a two-pass hybrid and E2E cascading (HEC) framework to combine the hybrid and E2E model in order to take advantage of both sides, with hybrid in the first pass and E2E in the second pass. We show that the proposed system achieves 8-10% relative word error rate reduction with respect to each individual system. More importantly, compared with the pure E2E system, we show the proposed system has the potential to keep the advantages of hybrid system, e.g., customization and segmentation capabilities. We also show the second pass E2E model in HEC is robust with respect to the change in the first pass hybrid model.

Proceedings ArticleDOI
23 May 2022
TL;DR: TitaNet as mentioned in this paper employs 1D depth-wise separable convolutions with Squeeze-and-Excitation (SE) layers with global context followed by channel attention based statistics pooling layer to map variable-length utterances to a fixed-length embedding (t-vector).
Abstract: In this paper, we propose TitaNet, a novel neural network architecture for extracting speaker representations. We employ 1D depth-wise separable convolutions with Squeeze-and-Excitation (SE) layers with global context followed by channel attention based statistics pooling layer to map variable-length utterances to a fixed-length embedding (t-vector). TitaNet is a scalable architecture and achieves state-of-the-art performance on speaker verification task with an equal error rate (EER) of 0.68% on the VoxCeleb1 trial file and also on speaker diarization tasks with diarization error rate (DER) of 1.73% on AMI-MixHeadset, 1.99% on AMI-Lapel and 1.11% on CH109. Furthermore, we investigate various sizes of TitaNet and present a light TitaNet-S model with only 6M parameters that achieve near state-of-the-art results in diarization tasks.

Journal ArticleDOI
TL;DR: In this article , prosodic features were extracted from Punjabi children's speech corpus, then particular prosodic feature were coupled with Mel Frequency Cepstral Coefficient (MFCC) features before being submitted to an ASR framework, and experiments were conducted on both individual and integrated prosodic-based acoustic features.
Abstract: Speech recognition has been an active field of research in the last few decades since it facilitates better human–computer interaction. Native language automatic speech recognition (ASR) systems are still underdeveloped. Punjabi ASR systems are in their infancy stage because most research has been conducted only on adult speech systems; however, less work has been performed on Punjabi children’s ASR systems. This research aimed to build a prosodic feature-based automatic children speech recognition system using discriminative modeling techniques. The corpus of Punjabi children’s speech has various runtime challenges, such as acoustic variations with varying speakers’ ages. Efforts were made to implement out-domain data augmentation to overcome such issues using Tacotron-based text to a speech synthesizer. The prosodic features were extracted from Punjabi children’s speech corpus, then particular prosodic features were coupled with Mel Frequency Cepstral Coefficient (MFCC) features before being submitted to an ASR framework. The system modeling process investigated various approaches, which included Maximum Mutual Information (MMI), Boosted Maximum Mutual Information (bMMI), and feature-based Maximum Mutual Information (fMMI). The out-domain data augmentation was performed to enhance the corpus. After that, prosodic features were also extracted from the extended corpus, and experiments were conducted on both individual and integrated prosodic-based acoustic features. It was observed that the fMMI technique exhibited 20% to 25% relative improvement in word error rate compared with MMI and bMMI techniques. Further, it was enhanced using an augmented dataset and hybrid front-end features (MFCC + POV + Fo + Voice quality) with a relative improvement of 13% compared with the earlier baseline system.


Proceedings ArticleDOI
23 May 2022
TL;DR: In this article , a hierarchical co-attention fusion approach was proposed to fuse automatic speech recognition (ASR) outputs into the pipeline for joint training speech emotion recognition (SER).
Abstract: Alongside acoustic information, linguistic features based on speech transcripts have been proven useful in Speech Emotion Recognition (SER). However, due to the scarcity of emotion labelled data and the difficulty of recognizing emotional speech, it is hard to obtain reliable linguistic features and models in this research area. In this paper, we propose to fuse Automatic Speech Recognition (ASR) outputs into the pipeline for joint training SER. The relationship between ASR and SER is understudied, and it is unclear what and how ASR features benefit SER. By examining various ASR outputs and fusion methods, our experiments show that in joint ASR-SER training, incorporating both ASR hidden and text output using a hierarchical co-attention fusion approach improves the SER performance the most. On the IEMOCAP corpus, our approach achieves 63.4% weighted accuracy, which is close to the baseline results achieved by combining ground-truth transcripts. In addition, we also present novel word error rate analysis on IEMOCAP and layer-difference analysis of the Wav2vec 2.0 model to better understand the relationship between ASR and SER.

Journal ArticleDOI
TL;DR: Five pre-trained networks, including VGG-16, Inceptionv3, Resnet50, Densenet121, and EfficientNetB7, are used to recognize iris liveness using transfer learning techniques, showing that pre- trained models outperform other current iris biometrics variants.
Abstract: In the recent decade, comprehensive research has been carried out in terms of promising biometrics modalities regarding humans’ physical features for person recognition. This work focuses on iris characteristics and traits for person identification and iris liveness detection. This study used five pre-trained networks, including VGG-16, Inceptionv3, Resnet50, Densenet121, and EfficientNetB7, to recognize iris liveness using transfer learning techniques. These models are compared using three state-of-the-art biometric databases: the LivDet-Iris 2015 dataset, IIITD contact dataset, and ND Iris3D 2020 dataset. Validation accuracy, loss, precision, recall, and f1-score, APCER (attack presentation classification error rate), NPCER (normal presentation classification error rate), and ACER (average classification error rate) were used to evaluate the performance of all pre-trained models. According to the observational data, these models have a considerable ability to transfer their experience to the field of iris recognition and to recognize the nanostructures within the iris region. Using the ND Iris 3D 2020 dataset, the EfficeintNetB7 model has achieved 99.97% identification accuracy. Experiments show that pre-trained models outperform other current iris biometrics variants.

Journal ArticleDOI
01 Jan 2022
TL;DR: In this article , the authors presented a method to develop speech recognition model with minimal resources using Mozilla DeepSpeech architecture, enabling similar approaches to be carried out for research in a low-resourced languages in a financially constrained environments.
Abstract: Research in speech recognition is progressing with numerous state-of-the-art results in recent times. However, relatively fewer research is being carried out in Automatic Speech Recognition (ASR) for languages with low resources. We present a method to develop speech recognition model with minimal resources using Mozilla DeepSpeech architecture. We have utilized freely available online computational resources for training, enabling similar approaches to be carried out for research in a low-resourced languages in a financially constrained environments. We also present novel ways to build an efficient language model from publicly available web resources to improve accuracy in ASR. The proposed ASR model gives the best result of 24.7% Word Error Rate (WER), compared to 55% WER by Google speech-to-text. We have also demonstrated a semi-supervised development of speech corpus using our trained ASR model, indicating a cost effective approach of building large vocabulary corpus for low resource language. The trained Tamil ASR model and the training sets are released in public domain and are available on GitHub.