Top 20 papers published by Dong Yu from Tencent in 2018

Proceedings Article•DOI•

Deep Discriminative Embeddings for Duration Robust Speaker Verification.

[...]

Na Li¹, Deyi Tuo², Dan Su², Zhifeng Li, Dong Yu² - Show less +1 more•Institutions (2)

02 Sep 2018

TL;DR: A novel algorithm to learn more discriminative utterance-level embeddings based on the Inception-ResNet speaker classifier is proposed, which outperforms ivector/PLDA framework for short utterances and is effective for long utterances.

...read moreread less

Abstract: The embedding-based deep convolution neural networks (CNNs) have demonstrated effective for text-independent speaker verification systems with short utterances. However, the duration robustness of the existing deep CNNs based algorithms has not been investigated when dealing with utterances of arbitrary duration. To improve robustness of embedding-based deep CNNs for longer duration utterances, we propose a novel algorithm to learn more discriminative utterance-level embeddings based on the Inception-ResNet speaker classifier. Specifically, the discriminability of embeddings is enhanced by reducing intra-speaker variation with center loss, and simultaneously increasing inter-speaker discrepancy with softmax loss. To further improve system performance when long utterances are available, at test stage long utterances are segmented into shorter ones, where utterance-level speaker embeddings are extracted by an average pooling layer. Experimental results show that when cosine distance is employed as the measure of similarity for a trial, the proposed method outperforms ivector/PLDA framework for short utterances and is effective for long utterances.

...read moreread less

71 citations

Journal Article•DOI•

Past review, current progress, and challenges ahead on the cocktail party problem

[...]

Yanmin Qian¹, Chao Weng¹, Xuankai Chang², Shuai Wang², Dong Yu¹ - Show less +1 more•Institutions (2)

Tencent¹, Shanghai Jiao Tong University²

23 Apr 2018-Journal of Zhejiang University Science C

TL;DR: This overview paper focuses on the speech separation problem given its central role in the cocktail party environment, and describes the conventional single-channel techniques such as computational auditory scene analysis (CASA), non-negative matrix factorization (NMF) and generative models, and the newly developed deep learning-based techniques.

...read moreread less

Abstract: The cocktail party problem, i.e., tracing and recognizing the speech of a specific speaker when multiple speakers talk simultaneously, is one of the critical problems yet to be solved to enable the wide application of automatic speech recognition (ASR) systems. In this overview paper, we review the techniques proposed in the last two decades in attacking this problem. We focus our discussions on the speech separation problem given its central role in the cocktail party environment, and describe the conventional single-channel techniques such as computational auditory scene analysis (CASA), non-negative matrix factorization (NMF) and generative models, the conventional multi-channel techniques such as beamforming and multi-channel blind source separation, and the newly developed deep learning-based techniques, such as deep clustering (DPCL), the deep attractor network (DANet), and permutation invariant training (PIT). We also present techniques developed to improve ASR accuracy and speaker identification in the cocktail party environment. We argue effectively exploiting information in the microphone array, the acoustic training set, and the language itself using a more powerful model. Better optimization objective and techniques will be the approach to solving the cocktail party problem.

...read moreread less

71 citations

Proceedings Article•DOI•

Improving Attention Based Sequence-to-Sequence Models for End-to-End English Conversational Speech Recognition.

[...]

Chao Weng¹, Jia Cui¹, Guangsen Wang¹, Jun Wang¹, Chengzhu Yu¹, Dan Su¹, Dong Yu¹ - Show less +3 more•Institutions (1)

Tencent¹

02 Sep 2018

TL;DR: This work proposes to use an input-feeding architecture which feeds not only the previous context vector but also the previous decoder hidden state information as inputs to the decoder, based on a better hypothesis generation scheme for sequential minimum Bayes risk (MBR) training of sequence-to-sequence models.

...read moreread less

Abstract: In this work, we propose two improvements to attention based sequence-to-sequence models for end-to-end speech recognition systems. For the first improvement, we propose to use an input-feeding architecture which feeds not only the previous context vector but also the previous decoder hidden state information as inputs to the decoder. The second improvement is based on a better hypothesis generation scheme for sequential minimum Bayes risk (MBR) training of sequence-to-sequence models where we introduce softmax smoothing into N-best generation during MBR training. We conduct the experiments on both Switchboard-300hrs and Switchboard+Fisher-2000hrs datasets and observe significant gains from both proposed improvements. Together with other training strategies such as dropout and scheduled sampling, our best model achieved WERs of 8.3%/15.5% on the Switchboard/CallHome subsets of Eval2000 without any external language models which is highly competitive among state-of-the-art English conversational speech recognition systems.

...read moreread less

67 citations

Proceedings Article•DOI•

Deep Extractor Network for Target Speaker Recovery From Single Channel Speech Mixtures

[...]

Jun Wang¹, Jie Chen¹, Dan Su¹, Lianwu Chen¹, Meng Yu¹, Yanmin Qian², Dong Yu¹ - Show less +3 more•Institutions (2)

Tencent¹, Shanghai Jiao Tong University²

24 Jul 2018

TL;DR: A novel "deep extractor network" which creates an extractor point for the target speaker in a canonical high dimensional embedding space, and pulls together the time-frequency bins corresponding to thetarget speaker.

...read moreread less

Abstract: Speaker-aware source separation methods are promising workarounds for major difficulties such as arbitrary source permutation and unknown number of sources. However, it remains challenging to achieve satisfying performance provided a very short available target speaker utterance (anchor). Here we present a novel "deep extractor network" which creates an extractor point for the target speaker in a canonical high dimensional embedding space, and pulls together the time-frequency bins corresponding to the target speaker. The proposed model is different from prior works in that the canonical embedding space encodes knowledges of both the anchor and the mixture during an end-to-end training phase: First, embeddings for the anchor and mixture speech are separately constructed in a primary embedding space, and then combined as an input to feed-forward layers to transform to a canonical embedding space which we discover more stable than the primary one. Experimental results show that given a very short utterance, the proposed model can efficiently recover high quality target speech from a mixture, which outperforms various baseline models, with 5.2% and 6.6% relative improvements in SDR and PESQ respectively compared with a baseline oracle deep attracor model. Meanwhile, we show it can be generalized well to more than one interfering speaker.

...read moreread less

63 citations

Journal Article•DOI•

Single-channel multi-talker speech recognition with permutation invariant training

[...]

Yanmin Qian¹, Xuankai Chang¹, Dong Yu²•Institutions (2)

Shanghai Jiao Tong University¹, Tencent²

01 Nov 2018-Speech Communication

TL;DR: In this article, the authors extend permutation invariant training (PIT) by introducing the front-end feature separation module with the minimum mean square error (MSE) criterion and the back-end recognition module with minimum cross entropy (CE) criterion.

...read moreread less

56 citations

Proceedings Article•DOI•

Monaural multi-talker speech recognition with attention mechanism and gated convolutional networks

[...]

Yanmin Qian¹, Dong Yu¹•Institutions (1)

Tencent¹

02 Sep 2018

TL;DR: A novel model architecture that incorporates the attention mechanism and gated convolutional network (GCN) into the previously developed permutation invariant training based multi-talker speech recognition system (PIT-ASR) is proposed.

...read moreread less

Abstract: Provided are a speech recognition training processing method and an apparatus including the same. The speech recognition training processing method includes acquiring multi-talker mixed speech sequence data corresponding to a plurality of speakers, encoding the multi-speaker mixed speech sequence data into an embedded sequence data, generating speaker specific context vectors at each frame based on the embedded sequence, generating senone posteriors for each of the speaker based on the speaker specific context vectors and updating an acoustic model by performing permutation invariant training (PIT) model training based on the senone posteriors.

...read moreread less

30 citations

Proceedings Article•DOI•

A Multistage Training Framework for Acoustic-to-Word Model.

[...]

Chengzhu Yu¹, Chunlei Zhang², Chao Weng¹, Jia Cui¹, Dong Yu¹ - Show less +1 more•Institutions (2)

Tencent¹, University of Texas at Dallas²

02 Sep 2018

TL;DR: This study empirically investigate advanced model initializations and training strategies to achieve competitive speech recognition performance on 300 hour subset of the Switchboard task (SWB-300Hr) and investigates the use of hierarchical CTC pretraining for improved model initialization.

...read moreread less

Abstract: Acoustic-to-word (A2W) prediction model based on Connectionist Temporal Classification (CTC) criterion has gained increasing interest in recent studies. Although previous studies have shown that A2W system could achieve competitive Word Error Rate (WER), there is still performance gap compared with the conventional speech recognition system when the amount of training data is not exceptionally large. In this study, we empirically investigate advanced model initializations and training strategies to achieve competitive speech recognition performance on 300 hour subset of the Switchboard task (SWB-300Hr). We first investigate the use of hierarchical CTC pretraining for improved model initialization. We also explore curriculum training strategy to gradually increase the target vocabulary size from 10k to 20k. Finally, joint CTC and Cross Entropy (CE) training techniques are studied to further improve the performance of A2W system. The combination of hierarchical-CTC model initialization, curriculum training and joint CTC-CE training translates to a relative of 12.1% reduction in WER. Our final A2W system evaluated on Hub5-2000 test sets achieves a WER of 11.4/20.8 for Switchboard and CallHome parts without using language model and complex decoder.

...read moreread less

27 citations

Posted Content•

Deep Extractor Network for Target Speaker Recovery From Single Channel Speech Mixtures

[...]

Jun Wang¹, Jie Chen¹, Dan Su¹, Lianwu Chen¹, Meng Yu¹, Yanmin Qian², Dong Yu¹ - Show less +3 more•Institutions (2)

Tencent¹, Shanghai Jiao Tong University²

24 Jul 2018-arXiv: Sound

TL;DR: In this article, a deep extractor network is proposed, which creates an extractor point for the target speaker in a canonical high dimensional embedding space, and pulls together the time-frequency bins corresponding to the target speakers.

...read moreread less

Abstract: Speaker-aware source separation methods are promising workarounds for major difficulties such as arbitrary source permutation and unknown number of sources. However, it remains challenging to achieve satisfying performance provided a very short available target speaker utterance (anchor). Here we present a novel "deep extractor network" which creates an extractor point for the target speaker in a canonical high dimensional embedding space, and pulls together the time-frequency bins corresponding to the target speaker. The proposed model is different from prior works in that the canonical embedding space encodes knowledges of both the anchor and the mixture during an end-to-end training phase: First, embeddings for the anchor and mixture speech are separately constructed in a primary embedding space, and then combined as an input to feed-forward layers to transform to a canonical embedding space which we discover more stable than the primary one. Experimental results show that given a very short utterance, the proposed model can efficiently recover high quality target speech from a mixture, which outperforms various baseline models, with 5.2% and 6.6% relative improvements in SDR and PESQ respectively compared with a baseline oracle deep attracor model. Meanwhile, we show it can be generalized well to more than one interfering speaker.

...read moreread less

24 citations

Proceedings Article•DOI•

Knowledge Transfer in Permutation Invariant Training for Single-Channel Multi-Talker Speech Recognition

[...]

Tian Tan¹, Yanmin Qian¹, Dong Yu²•Institutions (2)

Shanghai Jiao Tong University¹, Tencent²

15 Apr 2018

TL;DR: The experimental results show that the teacher-student training can cut the word error rate (WER) by relative 20% against the baseline PIT model and the unsupervised domain adaptation method achieved relative 30% WER reduction against the AMI PIT model.

...read moreread less

Abstract: This paper proposes a framework that combines teacher-student training and permutation invariant training (PIT) for single-channel multi-talker speech recognition. In contrast to most of conventional teacher-student training methods that aim at compressing the model, the proposed method distills knowledge from the single-talker model to improve the multi-talker model in the PIT framework. The inputs to the teacher and student networks are the single-talker clean speech and the multi-talker mixed speech, respectively. The knowledge is transferred to the student through the soft labels generated by the teacher. Furthermore, the ensemble of multiple teachers is exploited with a progressive training scheme to further improve the system. In this framework it is easy to take advantage of data augmentation and perform domain adaptation for multi-talker speech recognition using only untranscribed data. The proposed techniques were evaluated on artificially mixed two-talker AMI speech data. The experimental results show that the teacher-student training can cut the word error rate (WER) by relative 20% against the baseline PIT model. We also evaluated our unsupervised domain adaptation method on an artificially mixed WSJO corpus and achieved relative 30% WER reduction against the AMI PIT model.

...read moreread less

23 citations

Proceedings Article•DOI•

Speech Super-Resolution Using Parallel WaveNet

[...]

Mu Wang¹, Zhiyong Wu¹, Shiyin Kang², Xixin Wu³, Jia Jia¹, Dan Su², Dong Yu², Helen Meng¹ - Show less +4 more•Institutions (3)

Tsinghua University¹, Tencent², University of Hong Kong³

01 Nov 2018

TL;DR: This work introduces a novel auto-regressive method for the speech super-resolution task, which utilizes WaveNet to model the distribution of the target high- resolution signal conditioned on the log-scale mel-spectrogram of the low-resolution signal.

...read moreread less

Abstract: Audio super-resolution is the task to increase the sampling rate of a given low-resolution (i.e. low sampling rate) audio. One of the most popular approaches for audio super-resolution is to minimize the squared Euclidean distance between the reconstructed signal and the high sampling rate signal in a point-wise manner. However, such approach has intrinsic limitations, such as the regression to mean problem. In this work, we introduce a novel auto-regressive method for the speech super-resolution task, which utilizes WaveNet to model the distribution of the target high-resolution signal conditioned on the log-scale mel-spectrogram of the low-resolution signal. As an auto-regressive neural network, WaveNet uses the negative log-likelihood as the objective function, which is much more suitable for highly stochastic process such as speech waveform, instead of the Euclidean distance. We also train a parallel WaveNet to speed up the generating process to real-time. In the experiments, we perform speech super-resolution by increasing the sampling rate from 4kHz to 16kHz on the VCTK corpus. The proposed method can achieve the improvement of ∼2 dB over the base-line deep residual convolutional neural network (CNN) under the Log-Spectral Distance (LSD) metric.

...read moreread less

18 citations

Proceedings Article•DOI•

Rapid Style Adaptation Using Residual Error Embedding for Expressive Speech Synthesis.

[...]

Xixin Wu¹, Yuewen Cao¹, Mu Wang², Songxiang Liu¹, Shiyin Kang³, Zhiyong Wu², Xunying Liu¹, Dan Su³, Dong Yu³, Helen Meng¹ - Show less +6 more•Institutions (3)

The Chinese University of Hong Kong¹, Tsinghua University², Tencent³

02 Sep 2018

TL;DR: This paper encodes the residual error into a style embedding via a neural networkbased error encoder, which enables rapid adaptation to the desired style to be achieved with only a single adaptation utterance.

...read moreread less

Abstract: Synthesizing expressive speech with appropriate prosodic variations, e.g., various styles, still has much room for improvement. Previous methods have explored to use manual annotations as conditioning attributes to provide variation information. However, the related training data are expensive to obtain and the annotated style codes can be ambiguous and unreliable. In this paper, we explore utilizing the residual error as conditioning attributes. The residual error is the difference between the prediction of a trained average model and the ground truth. We encode the residual error into a style embedding via a neural networkbased error encoder. The style embedding is then fed to the target synthesis model to provide information for modeling various style distributions more accurately. The average model and the error encoder are jointly optimized with the target synthesis model. Our proposed method has two advantages: 1) the embedding is automatically learned with no need of manual style annotations, which helps overcome data sparsity and ambiguity limitations; 2) For any unseen audio utterance, the style embedding can be efficiently generated. This enables rapid adaptation to the desired style to be achieved with only a single adaptation utterance. Experimental results show that our proposed method outperforms the baseline model in both speech quality and style similarity.

...read moreread less

Proceedings Article•

Unsupervised Speech Recognition via Segmental Empirical Output Distribution Matching

[...]

Chih-Kuan Yeh¹, Jianshu Chen², Chengzhu Yu², Dong Yu²•Institutions (2)

Carnegie Mellon University¹, Tencent²

27 Sep 2018

TL;DR: In this article, a fully unsupervised learning algorithm was proposed to train a phoneme classifier for a given set of phoneme segmentation boundaries and refine the phoneme boundaries based on a given classifier.

...read moreread less

Abstract: We consider the problem of training speech recognition systems without using any labeled data, under the assumption that the learner can only access to the input utterances and a phoneme language model estimated from a non-overlapping corpus. We propose a fully unsupervised learning algorithm that alternates between solving two sub-problems: (i) learn a phoneme classifier for a given set of phoneme segmentation boundaries, and (ii) refining the phoneme boundaries based on a given classifier. To solve the first sub-problem, we introduce a novel unsupervised cost function named Segmental Empirical Output Distribution Matching, which generalizes the work in (Liu et al., 2017) to segmental structures. For the second sub-problem, we develop an approximate MAP approach to refining the boundaries obtained from Wang et al. (2017). Experimental results on TIMIT dataset demonstrate the success of this fully unsupervised phoneme recognition system, which achieves a phone error rate (PER) of 41.6%. Although it is still far away from the state-of-the-art supervised systems, we show that with oracle boundaries and matching language model, the PER could be improved to 32.5%.This performance approaches the supervised system of the same model architecture, demonstrating the great potential of the proposed method.

...read moreread less

Proceedings Article•DOI•

Improving Attention-Based End-to-End ASR Systems with Sequence-Based Loss Functions

[...]

Jia Cui¹, Chao Weng¹, Guangsen Wang¹, Jun Wang¹, Peidong Wang¹, Chengzhu Yu¹, Dan Su¹, Dong Yu¹ - Show less +4 more•Institutions (1)

Tencent¹

01 Dec 2018

TL;DR: A novel training strategy is proposed for multi-task training with the connectionist temporal classification (CTC) loss and the sequence-based minimum Bayes risk (MBR) loss is investigated, showing that both loss functions could significantly improve the baseline model performance.

...read moreread less

Abstract: Acoustic model and language model (LM) have been two major components in conventional speech recognition systems. They are normally trained independently, but recently there has been a trend to optimize both components simultaneously in a unified end-to-end (E2E) framework. However, the performance gap between the E2E systems and the traditional hybrid systems suggests that some knowledge has not yet been fully utilized in the new framework. An observation is that the current attention-based E2E systems could produce better recognition results when decoded with LMs which are independently trained with the same resource.In this paper, we focus on how to improve attention-based E2E systems without increasing model complexity or resorting to extra data. A novel training strategy is proposed for multi-task training with the connectionist temporal classification (CTC) loss. The sequence-based minimum Bayes risk (MBR) loss is also investigated. Our experiments on SWB 300hrs showed that both loss functions could significantly improve the baseline model performance. The additional gain from joint-LM decoding remains the same for CTC trained model but is only marginal for MBR trained model. This implies that while CTC loss function is able to capture more acoustic knowledge, MBR loss function exploits more word/character dependency.

...read moreread less

Proceedings Article•DOI•

Adaptive Permutation Invariant Training with Auxiliary Information for Monaural Multi-Talker Speech Recognition

[...]

Xuankai Chang¹, Yanmin Qian¹, Dong Yu²•Institutions (2)

Shanghai Jiao Tong University¹, Tencent²

01 Apr 2018

TL;DR: This paper proposes to adapt the PIT models with auxiliary features such as pitch and i-vector, and to exploit the gender information with multi-task learning which jointly optimizes for the speech recognition and speaker-pair prediction.

...read moreread less

Abstract: In this paper, we extend our previous work on direct recognition of single-channel multi-talker mixed speech using permutation invariant training (PIT) We propose to adapt the PIT models with auxiliary features such as pitch and i-vector, and to exploit the gender information with multi-task learning which jointly optimizes for the speech recognition and speaker-pair prediction We also compare CNN-BLSTMs against BLSTM-RNNs used in our previous PIT-ASR model The experimental results on the artificially mixed two-talker AMI data indicate that our proposed model improvements can reduce word error rate (WER) by ~ 100% relative to our previous work for both speakers in the mixed speech Our results also confirm that PIT can be easily combined with advanced techniques to improve the performance on multi-talker speech recognition

...read moreread less

Proceedings Article•DOI•

Text-Dependent Speech Enhancement for Small-Footprint Robust Keyword Detection.

[...]

Meng Yu¹, Xuan Ji¹, Gao Yi¹, Lianwu Chen¹, Jie Chen¹, Zheng Jimeng¹, Dan Su¹, Dong Yu¹ - Show less +4 more•Institutions (1)

Tencent¹

02 Sep 2018

TL;DR: It is demonstrated that KWD with TDSE frontend significantly outperforms the baseline KWD system with or without a generic speech enhancement in terms of equal error rate (EER) in the keyword detection evaluation.

...read moreread less

Abstract: Keyword detection (KWD), also known as keyword spotting, is in great demand in small devices in the era of Internet of Things. Albeit recent progresses, the performance of KWD, measured in terms of precision and recall rate, may still degrade significantly when either the non-speech ambient noises or the human voice and speech-like interferences (e. g., TV, background competing talkers) exists. In this paper, we propose a general solution to address all kinds of environmental interferences. A novel text-dependent speech enhancement (TDSE) technique using a recurrent neural network (RNN) with long short-term memory (LSTM) is presented for improving the robustness of the small-footprint KWD task in the presence of environmental noises and interfering talkers. On our large simulated and recorded noisy and far-field evaluation sets, we show that TDSE significantly improves the quality of the target keyword speech and performs particularly well under speech interference conditions. We demonstrate that KWD with TDSE frontend significantly outperforms the baseline KWD system with or without a generic speech enhancement in terms of equal error rate (EER) in the keyword detection evaluation.

...read moreread less

Proceedings Article•DOI•

An Exploration of Directly Using Word as ACOUSTIC Modeling Unit for Speech Recognition

[...]

Chunlei Zhang¹, Chengzhu Yu¹, Chao Weng¹, Jia Cui¹, Dong Yu¹ - Show less +1 more•Institutions (1)

Tencent¹

01 Dec 2018

TL;DR: This study systematically explore to use word as acoustic modeling unit for conversational speech recognition by replacing senone alignment with word alignment in a convolutional bidirectional LSTM architecture and employing a lexicon-free weighted finite-state transducer (WFST) based decoding, which greatly simplify conventional hybrid speech recognition system.

...read moreread less

Abstract: Conventional acoustic models for automatic speech recognition (ASR) are usually constructed from sub-word unit (e.g., context-dependent phoneme, grapheme, wordpiece etc.). Recent studies demonstrate that connectionist temporal classification (CTC) based acoustic-to-word (A2W) models are also promising for ASR. Such structures have drawn increasing attention as they can directly target words as output units, which simplify ASR pipeline by avoiding additional pronunciation lexicon, or even language model. In this study, we systematically explore to use word as acoustic modeling unit for conversational speech recognition. By replacing senone alignment with word alignment in a convolutional bidirectional LSTM architecture and employing a lexicon-free weighted finite-state transducer (WFST) based decoding, we greatly simplify conventional hybrid speech recognition system. On Hub5-2000 Switchboard/CallHome test sets with 300-hour training data, we achieve a WER that is close to the senone based hybrid systems with a WFST based decoding.

...read moreread less

Proceedings Article•DOI•

Permutation Invariant Training of Generative Adversarial Network for Monaural Speech Separation.

[...]

Lianwu Chen¹, Meng Yu¹, Yanmin Qian¹, Dan Su¹, Dong Yu¹ - Show less +1 more•Institutions (1)

Tencent¹

02 Sep 2018

TL;DR: It is found that SSGAN-PIT outperforms SSGAN without PIT and the neural networks based speech separation with or without PIT, which confirms the feasibility of the proposed model and training approach for efficient speech separation.

...read moreread less

Abstract: We explore generative adversarial networks (GANs) for speech separation, particularly with permutation invariant training (SSGAN-PIT). Prior work [1] demonstrates that GANs can be implemented for suppressing additive noise in noisy speech waveform and improving perceptual speech quality. In this work, we train GANs for speech separation which enhances multiple speech sources simultaneously with the permutation issue addressed by the utterance level PIT in the training of the generator network. We propose operating GANs on the power spectrum domain instead of waveforms to reduce computation. To better explore time dependencies, recurrent neural networks (RNNs) with long short-term memory (LSTM) are adopted for both generator and discriminator in this study. We evaluated SSGAN-PIT on the WSJ0 two-talker mixed speech separation task and found that SSGAN-PIT outperforms SSGAN without PIT and the neural networks based speech separation with or without PIT. The evaluation confirms the feasibility of the proposed model and training approach for efficient speech separation. The convergence behavior of permutation invariant training and adversarial training are analyzed.

...read moreread less

Journal Article•DOI•

Erratum to: Past review, current progress, and challenges ahead on the cocktail party problem

[...]

Yanmin Qian¹, Chao Weng², Xuankai Chang¹, Shuai Wang¹, Dong Yu² - Show less +1 more•Institutions (2)

Shanghai Jiao Tong University¹, Tencent²

28 Jun 2018-Journal of Zhejiang University Science C

TL;DR: In the original version of this article, the affiliations were incorrect.

...read moreread less

Abstract: In the original version of this article, the affiliations are incorrect. The correct affiliations are given above. The corresponding author’s E-mail address should be yanminqian@sjtu.edu.cn.

...read moreread less

Posted Content•

A Comparison of Lattice-free Discriminative Training Criteria for Purely Sequence-Trained Neural Network Acoustic Models

[...]

Chao Weng¹, Dong Yu¹•Institutions (1)

Tencent¹

08 Nov 2018-arXiv: Learning

TL;DR: It is demonstrated that, analogous to LF-MMI, a neural network acoustic model can also be trained from scratch using LF-bMMI or LF-sMBR criteria respectively without the need of cross-entropy pre-training.

...read moreread less

Abstract: In this work, three lattice-free (LF) discriminative training criteria for purely sequence-trained neural network acoustic models are compared on LVCSR tasks, namely maximum mutual information (MMI), boosted maximum mutual information (bMMI) and state-level minimum Bayes risk (sMBR). We demonstrate that, analogous to LF-MMI, a neural network acoustic model can also be trained from scratch using LF-bMMI or LF-sMBR criteria respectively without the need of cross-entropy pre-training. Furthermore, experimental results on Switchboard-300hrs and Switchboard+Fisher-2100hrs datasets show that models trained with LF-bMMI consistently outperform those trained with plain LF-MMI and achieve a relative word error rate (WER) reduction of 5% over competitive temporal convolution projected LSTM (TDNN-LSTMP) LF-MMI baselines.

...read moreread less

Posted Content•

Unsupervised Speech Recognition via Segmental Empirical Output Distribution Matching

[...]

Chih-Kuan Yeh¹, Jianshu Chen², Chengzhu Yu², Dong Yu²•Institutions (2)

Carnegie Mellon University¹, Tencent²

23 Dec 2018-arXiv: Audio and Speech Processing

TL;DR: In this article, a fully unsupervised learning algorithm was proposed to train a phoneme classifier for a given set of phoneme segmentation boundaries and refine the phoneme boundaries based on a given classifier.

...read moreread less

Abstract: We consider the problem of training speech recognition systems without using any labeled data, under the assumption that the learner can only access to the input utterances and a phoneme language model estimated from a non-overlapping corpus. We propose a fully unsupervised learning algorithm that alternates between solving two sub-problems: (i) learn a phoneme classifier for a given set of phoneme segmentation boundaries, and (ii) refining the phoneme boundaries based on a given classifier. To solve the first sub-problem, we introduce a novel unsupervised cost function named Segmental Empirical Output Distribution Matching, which generalizes the work in (Liu et al., 2017) to segmental structures. For the second sub-problem, we develop an approximate MAP approach to refining the boundaries obtained from Wang et al. (2017). Experimental results on TIMIT dataset demonstrate the success of this fully unsupervised phoneme recognition system, which achieves a phone error rate (PER) of 41.6%. Although it is still far away from the state-of-the-art supervised systems, we show that with oracle boundaries and matching language model, the PER could be improved to 32.5%.This performance approaches the supervised system of the same model architecture, demonstrating the great potential of the proposed method.

...read moreread less

Showing papers by "Dong Yu published in 2018"