Showing papers on "Word error rate published in 2019"

PDF

Open Access

Proceedings Article•

Common Voice: A Massively-Multilingual Speech Corpus

[...]

Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers¹, Gregor Weber - Show less +6 more•Institutions (1)

Indiana University¹

13 Dec 2019

TL;DR: This work presents speech recognition experiments using Mozilla’s DeepSpeech Speech-to-Text toolkit, and finds an average Character Error Rate improvement for twelve target languages, for most of these languages, these are the first ever published results on end- to-end Automatic Speech Recognition.

...read moreread less

Abstract: The Common Voice corpus is a massively-multilingual collection of transcribed speech intended for speech technology research and development. Common Voice is designed for Automatic Speech Recognition purposes but can be useful in other domains (e.g. language identification). To achieve scale and sustainability, the Common Voice project employs crowdsourcing for both data collection and data validation. The most recent release includes 29 languages, and as of November 2019 there are a total of 38 languages collecting data. Over 50,000 individuals have participated so far, resulting in 2,500 hours of collected audio. To our knowledge this is the largest audio corpus in the public domain for speech recognition, both in terms of number of hours and number of languages. As an example use case for Common Voice, we present speech recognition experiments using Mozilla’s DeepSpeech Speech-to-Text toolkit. By applying transfer learning from a source English model, we find an average Character Error Rate improvement of 5.99 ± 5.48 for twelve target languages (German, French, Italian, Turkish, Catalan, Slovenian, Welsh, Irish, Breton, Tatar, Chuvash, and Kabyle). For most of these languages, these are the first ever published results on end-to-end Automatic Speech Recognition.

...read moreread less

539 citations

Proceedings Article•DOI•

Speaker Recognition for Multi-speaker Conversations Using X-vectors

[...]

David Snyder¹, Daniel Garcia-Romero¹, Gregory Sell¹, Alan V. McCree¹, Daniel Povey¹, Sanjeev Khudanpur¹ - Show less +2 more•Institutions (1)

Johns Hopkins University¹

12 May 2019

TL;DR: It is found that diarization substantially reduces error rate when there are multiple speakers, while maintaining excellent performance on single-speaker recordings.

...read moreread less

Abstract: Recently, deep neural networks that map utterances to fixed-dimensional embeddings have emerged as the state-of-the-art in speaker recognition. Our prior work introduced x-vectors, an embedding that is very effective for both speaker recognition and diarization. This paper combines our previous work and applies it to the problem of speaker recognition on multi-speaker conversations. We measure performance on Speakers in the Wild and report what we believe are the best published error rates on this dataset. Moreover, we find that diarization substantially reduces error rate when there are multiple speakers, while maintaining excellent performance on single-speaker recordings. Finally, we introduce an easily implemented method to remove the domain-sensitive threshold typically used in the clustering stage of a diarization system. The proposed method is more robust to domain shifts, and achieves similar results to those obtained using a well-tuned threshold.

...read moreread less

280 citations

Proceedings Article•DOI•

Improving RNN Transducer Modeling for End-to-End Speech Recognition

[...]

Jinyu Li¹, Rui Zhao¹, Hu Hu¹, Yifan Gong¹•Institutions (1)

Microsoft¹

26 Sep 2019

TL;DR: This paperoptimizes the training algorithm of RNN-T to reduce the memory consumption so that it can have larger training minibatch for faster training speed and proposes better model structures so that Rnn-T models with the very good accuracy but small footprint are obtained.

...read moreread less

Abstract: In the last few years, an emerging trend in automatic speech recognition research is the study of end-to-end (E2E) systems. Connectionist Temporal Classification (CTC), Attention Encoder-Decoder (AED), and RNN Transducer (RNN-T) are the most popular three methods. Among these three methods, RNN-T has the advantages to do online streaming which is challenging to AED and it doesn't have CTC's frame-independence assumption. In this paper, we improve the RNN-T training in two aspects. First, we optimize the training algorithm of RNN-T to reduce the memory consumption so that we can have larger training minibatch for faster training speed. Second, we propose better model structures so that we obtain RNN-T models with the very good accuracy but small footprint. Trained with 30 thousand hours anonymized and transcribed Microsoft production data, the best RNN-T model with even smaller model size (216 Megabytes) achieves up-to 11.8% relative word error rate (WER) reduction from the baseline RNN-T model. This best RNN-T model is significantly better than the device hybrid model with similar size by achieving up-to 15.0% relative WER reduction, and obtains similar WERs as the server hybrid model of 5120 Megabytes in size.

...read moreread less

191 citations

Journal Article•DOI•

Detecting Parkinson's Disease with Sustained Phonation and Speech Signals using Machine Learning Techniques

[...]

Jefferson S. Almeida, Pedro Pedrosa Rebouças Filho¹, Tiago Carneiro², Wei Wei, Robertas Damasevicius³, Rytis Maskeliūnas³, Victor Hugo C. de Albuquerque¹ - Show less +3 more•Institutions (3)

University of Fortaleza¹, French Institute for Research in Computer Science and Automation², Kaunas University of Technology³

01 Jul 2019-Pattern Recognition Letters

TL;DR: It is shown that the task of phonation was more efficient than speech tasks in the detection of disease and compared with other approaches that use the same data set.

...read moreread less

143 citations

Journal Article•DOI•

Intelligent character recognition using fully convolutional neural networks

[...]

Raymond Ptucha¹, Felipe Petroski Such¹, Suhas Pillai¹, Frank Brockler², Vatsala Singh², Paul Hutkowski² - Show less +2 more•Institutions (2)

Rochester Institute of Technology¹, University of Rochester²

01 Apr 2019-Pattern Recognition

TL;DR: This paper presents a fully convolutional network architecture which outputs arbitrary length symbol streams from handwritten text and is the first to demonstrate state-of-the-art results on both lexicon-based and arbitrary symbol based handwriting recognition benchmarks.

...read moreread less

112 citations

Journal Article•DOI•

REAK: Reliability analysis through Error rate-based Adaptive Kriging

[...]

Zeyu Wang¹, Abdollah Shafieezadeh¹•Institutions (1)

Ohio State University¹

01 Feb 2019-Reliability Engineering & System Safety

TL;DR: An extension of the Central Limit Theorem based on Lindeberg condition is adopted here to derive the distribution of the number of design samples with wrong sign estimate and subsequently determine the maximum error rate for failure probability estimates.

...read moreread less

110 citations

Proceedings Article•DOI•

FaSNet: Low-Latency Adaptive Beamforming for Multi-Microphone Audio Processing

[...]

Yi Luo¹, Cong Han¹, Nima Mesgarani¹, Enea Ceolini², Shih-Chii Liu² - Show less +1 more•Institutions (2)

Columbia University¹, University of Zurich²

29 Sep 2019

TL;DR: FaSNet as discussed by the authors is a two-stage system design that first learns frame-level time-domain adaptive beamforming filters for a selected reference channel, and then calculate the filters for all remaining channels.

...read moreread less

Abstract: 1. ABSTRACT Beamforming has been extensively investigated for multi-channel audio processing tasks. Recently, learning-based beamforming methods, sometimes called neural beamformers, have achieved significant improvements in both signal quality (e.g. signal-to-noise ratio (SNR)) and speech recognition (e.g. word error rate (WER)). Such systems are generally non-causal and require a large context for robust estimation of inter-channel features, which is impractical in applications requiring low-latency responses. In this paper, we propose filter-and-sum network (FaSNet), a time-domain, filter-based beamforming approach suitable for low-latency scenarios. FaSNet has a two-stage system design that first learns frame-level time-domain adaptive beamforming filters for a selected reference channel, and then calculate the filters for all remaining channels. The filtered outputs at all channels are summed to generate the final output. Experiments show that despite its small model size, FaSNet is able to outperform several traditional oracle beamformers with respect to scale-invariant signal-to-noise ratio (SI-SNR) in reverberant speech enhancement and separation tasks. Moreover, when trained with a frequency-domain objective function on the CHiME-3 dataset, FaSNet achieves 14.3% relative word error rate reduction (RWERR) compared with the baseline model. These results show the efficacy of FaSNet particularly in reverberant and noisy signal conditions.

...read moreread less

109 citations

Journal Article•DOI•

Error Probability Analysis of Non-Orthogonal Multiple Access Over Nakagami- $m$ Fading Channels

[...]

Lina Bariah¹, Sami Muhaidat¹, Arafat Al-Dweik¹•Institutions (1)

Khalifa University¹

01 Feb 2019-IEEE Transactions on Communications

TL;DR: This paper focuses on the pairwise error probability (PEP) analysis, where exact PEP expressions are derived to characterize the performance of all users under different fading conditions and derive an exact union bound on the bit error rate (BER).

...read moreread less

Abstract: Non-orthogonal multiple access (NOMA) is currently considered as a promising technology for the next-generation wireless networks. In this paper, the error rate performance of NOMA systems is investigated over Nakagami- $m$ fading channels, while considering imperfect successive interference cancelation. In particular, this paper focuses on the pairwise error probability (PEP) analysis, where exact PEP expressions are derived to characterize the performance of all users under different fading conditions. The obtained PEP expressions are then used to derive an exact union bound on the bit error rate (BER). Through the derived PEP expressions, the asymptotic PEP analysis is presented to investigate the maximum achievable diversity gain of NOMA users. Moreover, using the derived BER bound, the power allocation problem for all users in NOMA systems is considered under average power and users BER constraints, which allows realizing the full potential of NOMA. Monte Carlo simulation and numerical results are presented to corroborate the derived analytical expressions and give valuable insights into the error rate performance of each user and the achievable diversity gain.

...read moreread less

92 citations

Proceedings Article•

XNAS: Neural Architecture Search with Expert Advice

[...]

Niv Nayman¹, Asaf Noy¹, Tal Ridnik¹, Itamar Friedman¹, Rong Jin¹, Lihi Zelnik¹ - Show less +2 more•Institutions (1)

Alibaba Group¹

01 Jun 2019

TL;DR: This paper introduces a novel optimization method for differential neural architecture search, based on the theory of prediction with expert advice, that achieves an optimal worst-case regret bound and suggests the use of multiple learning-rates,based on the amount of information carried by the backward gradients.

...read moreread less

Abstract: This paper introduces a novel optimization method for differential neural architecture search, based on the theory of prediction with expert advice. Its optimization criterion is well fitted for an architecture-selection, i.e., it minimizes the regret incurred by a sub-optimal selection of operations. Unlike previous search relaxations, that require hard pruning of architectures, our method is designed to dynamically wipe out inferior architectures and enhance superior ones. It achieves an optimal worst-case regret bound and suggests the use of multiple learning-rates, based on the amount of information carried by the backward gradients. Experiments show that our algorithm achieves a strong performance over several image classification datasets. Specifically, it obtains an error rate of 1.6% for CIFAR-10, 23.9% for ImageNet under mobile settings, and achieves state-of-the-art results on three additional datasets.

...read moreread less

86 citations

Journal Article•DOI•

A comparative evaluation of hybrid error correction methods for error-prone long reads

[...]

Shuhua Fu¹, Anqi Wang¹, Kin Fai Au¹, Kin Fai Au²•Institutions (2)

University of Iowa¹, Ohio State University²

04 Feb 2019-Genome Biology

TL;DR: A comparative performance assessment of ten state-of-the-art error-correction methods for long reads, including sensitivity, accuracy, output rate, alignment rate, output read length, run time, and memory usage, as well as the effects of error correction on two downstream applications of long reads.

...read moreread less

Abstract: Third-generation sequencing technologies have advanced the progress of the biological research by generating reads that are substantially longer than second-generation sequencing technologies. However, their notorious high error rate impedes straightforward data analysis and limits their application. A handful of error correction methods for these error-prone long reads have been developed to date. The output data quality is very important for downstream analysis, whereas computing resources could limit the utility of some computing-intense tools. There is a lack of standardized assessments for these long-read error-correction methods. Here, we present a comparative performance assessment of ten state-of-the-art error-correction methods for long reads. We established a common set of benchmarks for performance assessment, including sensitivity, accuracy, output rate, alignment rate, output read length, run time, and memory usage, as well as the effects of error correction on two downstream applications of long reads: de novo assembly and resolving haplotype sequences. Taking into account all of these metrics, we provide a suggestive guideline for method choice based on available data size, computing resources, and individual research goals.

...read moreread less

86 citations

Proceedings Article•DOI•

Recognizing Long-Form Speech Using Streaming End-to-End Models

[...]

Arun Narayanan¹, Rohit Prabhavalkar¹, Chung-Cheng Chiu¹, David Rybach¹, Tara N. Sainath¹, Trevor Strohman¹ - Show less +2 more•Institutions (1)

Google¹

01 Dec 2019

TL;DR: In this paper, the authors examine the ability of end-to-end (E2E) models to generalize to unseen domains, where they find that models trained on short utterances fail to generalise to long-form speech.

...read moreread less

Abstract: All-neural end-to-end (E2E) automatic speech recognition (ASR) systems that use a single neural network to transduce audio to word sequences have been shown to achieve state-of-the-art results on several tasks. In this work, we examine the ability of E2E models to generalize to unseen domains, where we find that models trained on short utterances fail to generalize to long-form speech. We propose two complementary solutions to address this: training on diverse acoustic data, and LSTM state manipulation to simulate long-form audio when training using short utterances. On a synthesized long-form test set, adding data diversity improves word error rate (WER) by 90% relative, while simulating long-form training improves it by 67% relative, though the combination doesn't improve over data diversity alone. On a real long-form call-center test set, adding data diversity improves WER by 40% relative. Simulating long-form training on top of data diversity improves performance by an additional 27% relative.

...read moreread less

Proceedings Article•DOI•

Advances in Online Audio-Visual Meeting Transcription

[...]

Takuya Yoshioka¹, Yan Huang¹, Aviv Hurvitz¹, Li Jiang¹, Sharon Koubi¹, Eyal Krupka¹, Ido Leichter¹, Changliang Liu¹, Partha Parthasarathy¹, Alon Vinnikov¹, Lingfeng Wu¹, Igor Abramovski¹, Xiong Xiao¹, Wayne Xiong¹, Huaming Wang¹, Zhenghao Wang¹, Jun Zhang¹, Yong Zhao¹, Tianyan Zhou¹, Cem Aksoylar¹, Zhuo Chen¹, Moshe David¹, Dimitrios Dimitriadis¹, Yifan Gong¹, Ilya Gurvich¹, Xuedong Huang¹ - Show less +22 more•Institutions (1)

Microsoft¹

14 Dec 2019

TL;DR: In this article, a system that generates speaker-annotated transcripts of meetings by using a microphone array and a 360-degree camera is described, which can handle overlapped speech.

...read moreread less

Abstract: This paper describes a system that generates speaker-annotated transcripts of meetings by using a microphone array and a 360-degree camera. The hallmark of the system is its ability to handle overlapped speech, which has been an unsolved problem in realistic settings for over a decade. We show that this problem can be addressed by using a continuous speech separation approach. In addition, we describe an online audio-visual speaker diarization method that leverages face tracking and identification, sound source localization, speaker identification, and, if available, prior speaker information for robustness to various real world challenges. All components are integrated in a meeting transcription framework called SRD, which stands for “separate, recognize, and diarize”. Experimental results using recordings of natural meetings involving up to 11 attendees are reported. The continuous speech separation improves a word error rate (WER) by 16.1% compared with a highly tuned beamformer. When a complete list of meeting attendees is available, the discrepancy between WER and speaker-attributed WER is only 1.0%, indicating accurate word-to-speaker association. This increases marginally to 1.6% when 50% of the attendees are unknown to the system.

...read moreread less

Proceedings Article•DOI•

The Speechtransformer for Large-scale Mandarin Chinese Speech Recognition

[...]

Zhao Yuanyuan, Jie Li, Xiaorui Wang, Li Yan

12 May 2019

TL;DR: This paper focuses on a large-scale Mandarin Chinese speech recognition task and proposes three optimization strategies to further improve the performance and efficiency of the SpeechTransformer, including a much lower frame rate.

...read moreread less

Abstract: Attention-based sequence-to-sequence architectures have made great progress in the speech recognition task. The SpeechTransformer, a no-recurrence encoder-decoder architecture, has shown promising results on small-scale speech recognition data sets in previous works. In this paper, we focus on a large-scale Mandarin Chinese speech recognition task and propose three optimization strategies to further improve the performance and efficiency of the SpeechTransformer. Our first improvement is to use a much lower frame rate, which is shown very beneficial to not only the computation efficiency but also the model performance. The other two strategies are scheduled sampling and focal loss, which are both very effective to reduce the character error rate (CER). On a 8,000 hours task, the proposed improvements yield 10.8%-26.1% relative gain in CER on four different test sets. Compared to a strong hybrid TDNN-LSTM system, which is trained with LF-MMI criterion and decoded with a large 4-gram LM, the final optimized Speech-Transformer gives 12.2%-19.1% relative CER reduction without any explicit language models.

...read moreread less

Proceedings Article•DOI•

Cycle-consistency Training for End-to-end Speech Recognition

[...]

Takaaki Hori¹, Ramón Fernandez Astudillo², Tomoki Hayashi³, Yu Zhang⁴, Shinji Watanabe⁵, Jonathan Le Roux¹ - Show less +2 more•Institutions (5)

Mitsubishi Electric Research Laboratories¹, INESC-ID², Nagoya University³, Google⁴, Johns Hopkins University⁵

12 May 2019

TL;DR: In this paper, a cycle-consistency loss based on the speech encoder state sequence instead of the raw speech signal was proposed to mitigate the problem of limited paired data, which reduced the word error rate by 14.7% from an initial model trained with 100-hour paired data.

...read moreread less

Abstract: This paper presents a method to train end-to-end automatic speech recognition (ASR) models using unpaired data. Although the end-to-end approach can eliminate the need for expert knowledge such as pronunciation dictionaries to build ASR systems, it still requires a large amount of paired data, i.e., speech utterances and their transcriptions. Cycle-consistency losses have been recently proposed as a way to mitigate the problem of limited paired data. These approaches compose a reverse operation with a given transformation, e.g., text-to-speech (TTS) with ASR, to build a loss that only requires unsupervised data, speech in this example. Applying cycle consistency to ASR models is not trivial since fundamental information, such as speaker traits, are lost in the intermediate text bottleneck. To solve this problem, this work presents a loss that is based on the speech encoder state sequence instead of the raw speech signal. This is achieved by training a Text-To-Encoder model and defining a loss based on the encoder reconstruction error. Experimental results on the LibriSpeech corpus show that the proposed cycle-consistency training reduced the word error rate by 14.7% from an initial model trained with 100-hour paired data, using an additional 360 hours of audio data without transcriptions. We also investigate the use of text-only data mainly for language modeling to further improve the performance in the unpaired data training scenario.

...read moreread less

Journal Article•DOI•

Large-scale directed network inference with multivariate transfer entropy and hierarchical statistical testing

[...]

Leonardo Novelli¹, Patricia Wollstadt², Pedro A. M. Mediano³, Michael Wibral⁴, Joseph T. Lizier¹ - Show less +1 more•Institutions (4)

University of Sydney¹, Honda², Imperial College London³, University of Göttingen⁴

15 Jul 2019

TL;DR: The algorithm presented—as implemented in the IDTxl open-source software—addresses challenges by employing hierarchical statistical tests to control the family-wise error rate and to allow for efficient parallelization, and was validated on synthetic datasets involving random networks of increasing size.

...read moreread less

Abstract: Network inference algorithms are valuable tools for the study of large-scale neuroimaging datasets. Multivariate transfer entropy is well suited for this task, being a model-free measure that captures nonlinear and lagged dependencies between time series to infer a minimal directed network model. Greedy algorithms have been proposed to efficiently deal with high-dimensional datasets while avoiding redundant inferences and capturing synergistic effects. However, multiple statistical comparisons may inflate the false positive rate and are computationally demanding, which limited the size of previous validation studies. The algorithm we present-as implemented in the IDTxl open-source software-addresses these challenges by employing hierarchical statistical tests to control the family-wise error rate and to allow for efficient parallelization. The method was validated on synthetic datasets involving random networks of increasing size (up to 100 nodes), for both linear and nonlinear dynamics. The performance increased with the length of the time series, reaching consistently high precision, recall, and specificity (>98% on average) for 10,000 time samples. Varying the statistical significance threshold showed a more favorable precision-recall trade-off for longer time series. Both the network size and the sample size are one order of magnitude larger than previously demonstrated, showing feasibility for typical EEG and magnetoencephalography experiments.

...read moreread less

Proceedings Article•DOI•

Self-Attention Transducers for End-to-End Speech Recognition.

[...]

Zhengkun Tian¹, Jiangyan Yi¹, Jianhua Tao¹, Ye Bai¹, Zhengqi Wen¹ - Show less +1 more•Institutions (1)

Chinese Academy of Sciences¹

15 Sep 2019

TL;DR: A self-attention transducer for speech recognition that is powerful to model long-term dependencies inside sequences and able to be efficiently parallelized, and with a path-aware regularization to assist SA-T to learn alignments and improve the performance.

...read moreread less

Abstract: Recurrent neural network transducers (RNN-T) have been successfully applied in end-to-end speech recognition. However, the recurrent structure makes it difficult for parallelization . In this paper, we propose a self-attention transducer (SA-T) for speech recognition. RNNs are replaced with self-attention blocks, which are powerful to model long-term dependencies inside sequences and able to be efficiently parallelized. Furthermore, a path-aware regularization is proposed to assist SA-T to learn alignments and improve the performance. Additionally, a chunk-flow mechanism is utilized to achieve online decoding. All experiments are conducted on a Mandarin Chinese dataset AISHELL-1. The results demonstrate that our proposed approach achieves a 21.3% relative reduction in character error rate compared with the baseline RNN-T. In addition, the SA-T with chunk-flow mechanism can perform online decoding with only a little degradation of the performance.

...read moreread less

Proceedings Article•DOI•

Self-attention Aligner: A Latency-control End-to-end Model for ASR Using Self-attention Network and Chunk-hopping

[...]

Linhao Dong¹, Feng Wang¹, Bo Xu¹•Institutions (1)

Chinese Academy of Sciences¹

12 May 2019

TL;DR: This paper presents a RNN-free end-to-end model: self-attention aligner (SAA), which applies the self-Attention networks to a simplified recurrent neuralaligner (RNA) framework and proposes a chunk-hopping mechanism, which enables the SAA model to encode on segmented frame chunks one after another to support online recognition.

...read moreread less

Abstract: Self-attention network, an attention-based feedforward neural network, has recently shown the potential to replace recurrent neural networks (RNNs) in a variety of NLP tasks. However, it is not clear if the self-attention network could be a good alternative of RNNs in automatic speech recognition (ASR), which processes the longer speech sequences and may have online recognition requirements. In this paper, we present a RNN-free end-to-end model: self-attention aligner (SAA), which applies the self-attention networks to a simplified recurrent neural aligner (RNA) framework. We also propose a chunk-hopping mechanism, which enables the SAA model to encode on segmented frame chunks one after another to support online recognition. Experiments on two Mandarin ASR datasets show the replacement of RNNs by the self-attention networks yields a 8.4%-10.2% relative character error rate (CER) reduction. In addition, the chunk-hopping mechanism allows the SAA to have only a 2.5% relative CER degradation with a 320ms latency. After jointly training with a self-attention network language model, our SAA model obtains further error rate reduction on multiple datasets. Especially, it achieves 24.12% CER on the Mandarin ASR benchmark (HKUST), exceeding the best end-to-end model by over 2% absolute CER.

...read moreread less

Proceedings Article•DOI•

Deep Residual Neural Networks for Audio Spoofing Detection.

[...]

Moustafa Alzantot¹, Ziqi Wang², Mani Srivastava¹•Institutions (2)

University of California, Los Angeles¹, Delft University of Technology²

15 Sep 2019

TL;DR: In this paper, three variants of a residual convolutional neural network that accept different feature representations (MFCC, Log-magnitude STFT, and CQCC) of input were proposed.

...read moreread less

Abstract: The state-of-art models for speech synthesis and voice conversion are capable of generating synthetic speech that is perceptually indistinguishable from bonafide human speech. These methods represent a threat to the automatic speaker verification (ASV) systems. Additionally, replay attacks where the attacker uses a speaker to replay a previously recorded genuine human speech are also possible. We present our solution for the ASVSpoof2019 competition, which aims to develop countermeasure systems that distinguish between spoofing attacks and genuine speeches. Our model is inspired by the success of residual convolutional networks in many classification tasks. We build three variants of a residual convolutional neural network that accept different feature representations (MFCC, Log-magnitude STFT, and CQCC) of input. We compare the performance achieved by our model variants and the competition baseline models. In the logical access scenario, the fusion of our models has zero t-DCF cost and zero equal error rate (EER), as evaluated on the development set. On the evaluation set, our model fusion improves the t-DCF and EER by 25% compared to the baseline algorithms. Against physical access replay attacks, our model fusion improves the baseline algorithms t-DCF and EER scores by 71% and 75% on the evaluation set, respectively.

...read moreread less

Proceedings Article•DOI•

Investigating End-to-end Speech Recognition for Mandarin-english Code-switching

[...]

Changhao Shan¹, Chao Weng², Guangsen Wang², Dan Su², Min Luo², Dong Yu², Lei Xie¹ - Show less +3 more•Institutions (2)

Northwestern Polytechnical University¹, Tencent²

12 May 2019

TL;DR: Three approaches are investigated to improve end-to-end speech recognition on Mandarin-English code-switching task and multi-task learning (MTL) is introduced which enables the language identity information to facilitate Mandarin- English code- Switching ASR.

...read moreread less

Abstract: Code-switching is a common phenomenon in many multilingual communities and presents a challenge to automatic speech recognition (ASR). In this paper, three approaches are investigated to improve end-to-end speech recognition on Mandarin-English code-switching task. First, multi-task learning (MTL) is introduced which enables the language identity information to facilitate Mandarin-English code-switching ASR. Second, we explore wordpieces, as opposed to graphemes, as English modeling units to reduce the mod-eling unit gap between Mandarin and English. Third, we employ transfer learning to utilize larger amount of monolingual Mandarin and English data to compensate the data sparsity issue of a code-switching task. Significant improvements are observed from all three approaches. With all three approaches combined, the final system achieves a character error rate (CER) of 6.49% on a real Mandarin-English code-switching task.

...read moreread less

Posted Content•

Recognizing long-form speech using streaming end-to-end models

[...]

Arun Narayanan¹, Rohit Prabhavalkar¹, Chung-Cheng Chiu¹, David Rybach¹, Tara N. Sainath¹, Trevor Strohman¹ - Show less +2 more•Institutions (1)

Google¹

24 Oct 2019-arXiv: Audio and Speech Processing

TL;DR: This work examines the ability of E2E models to generalize to unseen domains, and proposes two complementary solutions to address this: training on diverse acoustic data, and LSTM state manipulation to simulate long-form audio when training using short utterances.

...read moreread less

Posted Content•

Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model

[...]

Anjuli Kannan¹, Arindrima Datta¹, Tara N. Sainath¹, Eugene Weinstein¹, Bhuvana Ramabhadran², Yonghui Wu¹, Ankur Bapna¹, Zhifeng Chen¹, Seungji Lee¹ - Show less +5 more•Institutions (2)

Google¹, IBM²

11 Sep 2019-arXiv: Audio and Speech Processing

TL;DR: This work presents an E2E multilingual system which is equipped to operate in low-latency interactive applications, as well as handle a key challenge of real world data: the imbalance in training data across languages.

...read moreread less

Abstract: Multilingual end-to-end (E2E) models have shown great promise in expansion of automatic speech recognition (ASR) coverage of the world's languages. They have shown improvement over monolingual systems, and have simplified training and serving by eliminating language-specific acoustic, pronunciation, and language models. This work presents an E2E multilingual system which is equipped to operate in low-latency interactive applications, as well as handle a key challenge of real world data: the imbalance in training data across languages. Using nine Indic languages, we compare a variety of techniques, and find that a combination of conditioning on a language vector and training language-specific adapter layers produces the best model. The resulting E2E multilingual model achieves a lower word error rate (WER) than both monolingual E2E models (eight of nine languages) and monolingual conventional systems (all nine languages).

...read moreread less

Journal Article•DOI•

Kernel Approximation Methods for Speech Recognition

[...]

Avner May¹, Alireza Bagheri Garakani², Zhiyun Lu², Dong Guo², Kuan Liu², Aurélien Bellet, Linxi Fan³, Michael Collins¹, Daniel Hsu¹, Brian Kingsbury⁴, Michael Picheny⁴, Fei Sha² - Show less +8 more•Institutions (4)

Columbia University¹, University of Southern California², Stanford University³, IBM⁴

01 Jan 2019-Journal of Machine Learning Research

TL;DR: In this paper, the authors compare the performance of deep neural networks (DNNs) and kernel models on four speech recognition datasets, including the TIMIT and Broadcast News benchmark tasks, and compare these two types of models on frame-level performance metrics (accuracy, crossentropy), as well as on recognition metrics (word/character error rate).

...read moreread less

Abstract: We study large-scale kernel methods for acoustic modeling in speech recognition and compare their performance to deep neural networks (DNNs). We perform experiments on four speech recognition datasets, including the TIMIT and Broadcast News benchmark tasks, and compare these two types of models on frame-level performance metrics (accuracy, cross-entropy), as well as on recognition metrics (word/character error rate). In order to scale kernel methods to these large datasets, we use the random Fourier feature method of Rahimi and Recht [2007]. We propose two novel techniques for improving the performance of kernel acoustic models. First, in order to reduce the number of random features required by kernel models, we propose a simple but effective method for feature selection. The method is able to explore a large number of non-linear features while maintaining a compact model more efficiently than existing approaches. Second, we present a number of frame-level metrics which correlate very strongly with recognition performance when computed on the heldout set; we take advantage of these correlations by monitoring these metrics during training in order to decide when to stop learning. This technique can noticeably improve the recognition performance of both DNN and kernel models, while narrowing the gap between them. Additionally, we show that the linear bottleneck method of Sainath et al. [2013a] improves the performance of our kernel models significantly, in addition to speeding up training and making the models more compact. Together, these three methods dramatically improve the performance of kernel acoustic models, making their performance comparable to DNNs on the tasks we explored.

...read moreread less

Proceedings Article•DOI•

Unsupervised Training of a Deep Clustering Model for Multichannel Blind Source Separation

[...]

Lukas Drude¹, Daniel Hasenklever¹, Reinhold Haeb-Umbach¹•Institutions (1)

University of Paderborn¹

10 May 2019

TL;DR: A training scheme to train neural network-based source separation algorithms from scratch when parallel clean data is unavailable is proposed and it is demonstrated that an unsupervised spatial clustering algorithm is sufficient to guide the training of a deep clustering system.

...read moreread less

Abstract: We propose a training scheme to train neural network-based source separation algorithms from scratch when parallel clean data is unavailable. In particular, we demonstrate that an unsupervised spatial clustering algorithm is sufficient to guide the training of a deep clustering system. We argue that previous work on deep clustering requires strong supervision and elaborate on why this is a limitation. We demonstrate that (a) the single-channel deep clustering system trained according to the proposed scheme alone is able to achieve a similar performance as the multi-channel teacher in terms of word error rates and (b) initializing the spatial clustering approach with the deep clustering result yields a relative word error rate reduction of 26 % over the unsupervised teacher.

...read moreread less

Proceedings Article•DOI•

State-of-the-Art Speech Recognition Using Multi-Stream Self-Attention with Dilated 1D Convolutions

[...]

Kyu Jeong Han, Ramon Prieto, Tao Ma

01 Oct 2019

TL;DR: A new neural network model architecture, namely multi-stream self-attention, is proposed to address the issue thus make the self-Attention mechanism more effective for speech recognition and achieve the word error rate of 2.2% on the test-clean dataset of the LibriSpeech corpus.

...read moreread less

Abstract: Self-attention has been a huge success for many downstream tasks in NLP, which led to exploration of applying self-attention to speech problems as well. The efficacy of self-attention in speech applications, however, seems not fully blown yet since it is challenging to handle highly correlated speech frames in the context of self-attention. In this paper we propose a new neural network model architecture, namely multi-stream self-attention, to address the issue thus make the self-attention mechanism more effective for speech recognition. The proposed model architecture consists of parallel streams of self-attention encoders, and each stream has layers of 1D convolutions with dilated kernels whose dilation rates are unique given stream, followed by a self-attention layer. The self-attention mechanism in each stream pays attention to only one resolution of input speech frames and the attentive computation can be more efficient. In a later stage, outputs from all the streams are concatenated then linearly projected to the final embedding. By stacking the proposed multi-stream self-attention encoder blocks and rescoring the resultant lattices with neural network language models, we achieve the word error rate of 2.2% on the test-clean dataset of the LibriSpeech corpus, the best number reported thus far on the dataset.

...read moreread less

Journal Article•DOI•

Speech Enhancement Based on Teacher–Student Deep Learning Using Improved Speech Presence Probability for Noise-Robust Speech Recognition

[...]

Yan-Hui Tu¹, Jun Du¹, Chin-Hui Lee²•Institutions (2)

University of Science and Technology of China¹, Georgia Institute of Technology²

01 Dec 2019-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: A novel teacher-student learning framework for the preprocessing of a speech recognizer, leveraging the online noise tracking capabilities of improved minima controlled recursive averaging (IMCRA) and deep learning of nonlinear interactions between speech and noise.

...read moreread less

Abstract: In this paper, we propose a novel teacher-student learning framework for the preprocessing of a speech recognizer, leveraging the online noise tracking capabilities of improved minima controlled recursive averaging (IMCRA) and deep learning of nonlinear interactions between speech and noise. First, a teacher model with deep architectures is built to learn the target of ideal ratio masks (IRMs) using simulated training pairs of clean and noisy speech data. Next, a student model is trained to learn an improved speech presence probability by incorporating the estimated IRMs from the teacher model into the IMCRA approach. The student model can be compactly designed in a causal processing mode having no latency with the guidance of a complex and noncausal teacher model. Moreover, the clean speech requirement, which is difficult to meet in real-world adverse environments, can be relaxed for training the student model, implying that noisy speech data can be directly used to adapt the regression-based enhancement model to further improve speech recognition accuracies for noisy speech collected in such conditions. Experiments on the CHiME-4 challenge task show that our best student model with bidirectional gated recurrent units (BGRUs) can achieve a relative word error rate (WER) reduction of 18.85% for the real test set when compared to unprocessed system without acoustic model retraining. However, the traditional teacher model degrades the performance of the unprocessed system in this case. In addition, the student model with a deep neural network (DNN) in causal mode having no latency yields a relative WER reduction of 7.94% over the unprocessed system with 670 times less computing cycles when compared to the BGRU-equipped student model. Finally, the conventional speech enhancement and IRM-based deep learning method destroyed the ASR performance when the recognition system became more powerful. While our proposed approach could still improve the ASR performance even in the more powerful recognition system.

...read moreread less

Posted Content•DOI•

A comprehensive evaluation of long read error correction methods

[...]

Haowen Zhang¹, Chirag Jain¹, Srinivas Aluru¹•Institutions (1)

Georgia Institute of Technology¹

13 Jan 2019-bioRxiv

TL;DR: This paper presents a categorization and review of long read error correction methods, and provides a comprehensive evaluation of the corresponding longread error correction tools.

...read moreread less

Abstract: Background Third-generation single molecule sequencing technologies can sequence long reads, which is advancing the frontiers of genomics research. However, their high error rates prohibit accurate and efficient downstream analysis. This difficulty has motivated the development of many long read error correction tools, which tackle this problem through sampling redundancy and/or leveraging accurate short reads of the same biological samples. Existing studies to asses these tools use simulated data sets, and are not sufficiently comprehensive in the range of software covered or diversity of evaluation measures used. Results In this paper, we present a categorization and review of long read error correction methods, and provide a comprehensive evaluation of the corresponding long read error correction tools. Leveraging recent real sequencing data, we establish benchmark data sets and set up evaluation criteria for a comparative assessment which includes quality of error correction as well as run-time and memory usage. We study how trimming and long read sequencing depth affect error correction in terms of length distribution and genome coverage post-correction, and the impact of error correction performance on an important application of long reads, genome assembly. We provide guidelines for practitioners for choosing among the available error correction tools and identify directions for future research. Conclusions Despite the high error rate of long reads, the state-of-the-art correction tools can achieve high correction quality. When short reads are available, the best hybrid methods outperform non-hybrid methods in terms of correction quality and computing resource usage. When choosing tools for use, practitioners are suggested to be careful with a few correction tools that discard reads, and check the effect of error correction tools on downstream analysis. Our evaluation code is available as open-source at https://github.com/haowenz/LRECE.

...read moreread less

Journal Article•DOI•

Voice Activity Detection: Merging Source and Filter-based Information

[...]

Thomas Drugman¹, Yannis Stylianou¹, Yusuke Kida¹, Masami Akamine¹•Institutions (1)

Toshiba¹

07 Mar 2019-arXiv: Sound

TL;DR: A mutual information-based assessment shows superior discrimination power for the source-related features, especially the proposed ones, and two strategies are proposed to merge source and filter information: feature and decision fusion.

...read moreread less

Abstract: Voice Activity Detection (VAD) refers to the problem of distinguishing speech segments from background noise. Numerous approaches have been proposed for this purpose. Some are based on features derived from the power spectral density, others exploit the periodicity of the signal. The goal of this paper is to investigate the joint use of source and filter-based features. Interestingly, a mutual information-based assessment shows superior discrimination power for the source-related features, especially the proposed ones. The features are further the input of an artificial neural network-based classifier trained on a multi-condition database. Two strategies are proposed to merge source and filter information: feature and decision fusion. Our experiments indicate an absolute reduction of 3% of the equal error rate when using decision fusion. The final proposed system is compared to four state-of-the-art methods on 150 minutes of data recorded in real environments. Thanks to the robustness of its source-related features, its multi-condition training and its efficient information fusion, the proposed system yields over the best state-of-the-art VAD a substantial increase of accuracy across all conditions (24% absolute on average).

...read moreread less

Proceedings Article•DOI•

SignSpeaker: A Real-time, High-Precision SmartWatch-based Sign Language Translator

[...]

Jiahui Hou¹, Xiang-Yang Li¹, Peide Zhu¹, Zefan Wang¹, Yu Wang², Jianwei Qian³, Panlong Yang¹ - Show less +3 more•Institutions (3)

University of Science and Technology of China¹, University of North Carolina at Charlotte², Illinois Institute of Technology³

05 Aug 2019

TL;DR: Inspired by previous works on motion detection with wearable devices, this work proposes Sign Speaker - a real-time, robust, and user-friendly American sign language recognition (ASLR) system with affordable and portable commodity mobile devices.

...read moreread less

Abstract: Sign language is a natural and fully-formed communication method for deaf or hearing-impaired people. Unfortunately, most of the state-of-the-art sign recognition technologies are limited by either high energy consumption or expensive device costs and have a difficult time providing a real-time service in a daily-life environment. Inspired by previous works on motion detection with wearable devices, we propose Sign Speaker - a real-time, robust, and user-friendly American sign language recognition (ASLR) system with affordable and portable commodity mobile devices. SignSpeaker is deployed on a smartwatch along with a smartphone; the smartwatch collects the sign signals and the smartphone outputs translation through an inbuilt loudspeaker. We implement a prototype system and run a series of experiments that demonstrate the promising performance of our system. For example, the average translation time is approximately $1.1$ seconds for a sentence with eleven words. The average detection ratio and reliability of sign recognition are 99.2% and 99.5%, respectively. The average word error rate of continuous sentence recognition is 1.04% on average.

...read moreread less

Proceedings Article•

Orthogonal random forest for causal inference

[...]

Miruna Oprescu¹, Vasilis Syrgkanis¹, Zhiwei Steven Wu•Institutions (1)

Microsoft¹

24 May 2019

TL;DR: The orthogonal random forest (ORF) as mentioned in this paper combines Neyman-orthogonality to reduce sensitivity with respect to estimation error of nuisance parameters with generalized random forests (Athey et al., 2017).

...read moreread less

Abstract: We propose the orthogonal random forest, an algorithm that combines Neyman-orthogonality to reduce sensitivity with respect to estimation error of nuisance parameters with generalized random forests (Athey et al., 2017)—a flexible nonparametric method for statistical estimation of conditional moment models using random forests. We provide a consistency rate and establish asymptotic normality for our estimator. We show that under mild assumptions on the consistency rate of the nuisance estimator, we can achieve the same error rate as an oracle with a priori knowledge of these nuisance parameters. We show that when the nuisance functions have a locally sparse parametrization, then a local `1-penalized regression achieves the required rate. We apply our method to estimate heterogeneous treatment effects from observational data with discrete treatments or continuous treatments, and we show that, unlike prior work, our method provably allows to control for a high-dimensional set of variables under standard sparsity conditions. We also provide a comprehensive empirical evaluation of our algorithm on both synthetic and real data.

...read moreread less

Proceedings Article•DOI•

Improving Noise Robustness of Automatic Speech Recognition via Parallel Data and Teacher-student Learning

[...]

Ladislav Mosner¹, Minhua Wu², Anirudh Raju², Sree Hari Krishnan Parthasarathi², Kenichi Kumatani², Shiva Sundaram², Roland Maas², Bjorn Hoffmeister² - Show less +4 more•Institutions (2)

Brno University of Technology¹, Amazon.com²

12 May 2019

TL;DR: This paper adopted the teacher-student learning technique using a parallel clean and noisy corpus for improving automatic speech recognition (ASR) performance under multimedia noise and applied a logits selection method which only preserves the k highest values to prevent wrong emphasis of knowledge from the teacher and to reduce bandwidth needed for transferring data.

...read moreread less

Abstract: For real-world speech recognition applications, noise robustness is still a challenge. In this work, we adopt the teacher-student (T/S) learning technique using a parallel clean and noisy corpus for improving automatic speech recognition (ASR) performance under multimedia noise. On top of that, we apply a logits selection method which only preserves the k highest values to prevent wrong emphasis of knowledge from the teacher and to reduce bandwidth needed for transferring data. We incorporate up to 8000 hours of untranscribed data for training and present our results on sequence trained models apart from cross entropy trained ones. The best sequence trained student model yields relative word error rate (WER) reductions of approximately 10.1%, 28.7% and 19.6% on our clean, simulated noisy and real test sets respectively comparing to a sequence trained teacher.

...read moreread less

Collapse