Conference

International Conference on Acoustics, Speech, and Signal Processing

About: International Conference on Acoustics, Speech, and Signal Processing is an academic conference. The conference publishes majorly in the area(s): Speech processing & Adaptive filter. Over the lifetime, 46388 publications have been published by the conference receiving 886494 citations.

...read moreread less

Topics: Speech processing, Adaptive filter, Hidden Markov model, Speech coding, Signal processing ...read more

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Speech recognition with deep recurrent neural networks

[...]

Alex Graves¹, Abdelrahman Mohamed¹, Geoffrey E. Hinton¹•Institutions (1)

University of Toronto¹

26 May 2013

TL;DR: This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.

...read moreread less

Abstract: Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art results in cursive handwriting recognition. However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score.

...read moreread less

7,316 citations

Proceedings Article•DOI•

Librispeech: An ASR corpus based on public domain audio books

[...]

Vassil Panayotov¹, Guoguo Chen¹, Daniel Povey¹, Sanjeev Khudanpur¹•Institutions (1)

Johns Hopkins University¹

19 Apr 2015

TL;DR: It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself.

...read moreread less

Abstract: This paper introduces a new corpus of read English speech, suitable for training and evaluating speech recognition systems. The LibriSpeech corpus is derived from audiobooks that are part of the LibriVox project, and contains 1000 hours of speech sampled at 16 kHz. We have made the corpus freely available for download, along with separately prepared language-model training data and pre-built language models. We show that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models trained on WSJ itself. We are also releasing Kaldi scripts that make it easy to build these systems.

...read moreread less

4,770 citations

Proceedings Article•DOI•

X-Vectors: Robust DNN Embeddings for Speaker Recognition

[...]

David Snyder¹, Daniel Garcia-Romero¹, Gregory Sell¹, Daniel Povey¹, Sanjeev Khudanpur¹ - Show less +1 more•Institutions (1)

Johns Hopkins University¹

15 Apr 2018

TL;DR: This paper uses data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness of deep neural network embeddings for speaker recognition.

...read moreread less

Abstract: In this paper, we use data augmentation to improve performance of deep neural network (DNN) embeddings for speaker recognition. The DNN, which is trained to discriminate between speakers, maps variable-length utterances to fixed-dimensional embeddings that we call x-vectors. Prior studies have found that embeddings leverage large-scale training datasets better than i-vectors. However, it can be challenging to collect substantial quantities of labeled data for training. We use data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness. The x-vectors are compared with i-vector baselines on Speakers in the Wild and NIST SRE 2016 Cantonese. We find that while augmentation is beneficial in the PLDA classifier, it is not helpful in the i-vector extractor. However, the x-vector DNN effectively exploits data augmentation, due to its supervised training. As a result, the x-vectors achieve superior performance on the evaluation datasets.

...read moreread less

2,300 citations

Proceedings Article•DOI•

Listen, attend and spell: A neural network for large vocabulary conversational speech recognition

[...]

William Chan¹, Navdeep Jaitly², Quoc V. Le², Oriol Vinyals²•Institutions (2)

Carnegie Mellon University¹, Google²

20 Mar 2016

TL;DR: Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional speech recognizers is presented.

...read moreread less

Abstract: We present Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional speech recognizers. In LAS, the neural network architecture subsumes the acoustic, pronunciation and language models making it not only an end-to-end trained system but an end-to-end model. In contrast to DNN-HMM, CTC and most other models, LAS makes no independence assumptions about the probability distribution of the output character sequences given the acoustic sequence. Our system has two components: a listener and a speller. The listener is a pyramidal recurrent network encoder that accepts filter bank spectra as inputs. The speller is an attention-based recurrent network decoder that emits each character conditioned on all previous characters, and the entire acoustic sequence. On a Google voice search task, LAS achieves a WER of 14.1% without a dictionary or an external language model and 10.3% with language model rescoring over the top 32 beams. In comparison, the state-of-the-art CLDNN-HMM model achieves a WER of 8.0% on the same set.

...read moreread less

2,279 citations

Proceedings Article•DOI•

Audio Set: An ontology and human-labeled dataset for audio events

[...]

Jort F. Gemmeke¹, Daniel P. W. Ellis¹, Dylan Freedman¹, Aren Jansen¹, Wade Lawrence¹, R. Channing Moore¹, Manoj Plakal¹, Marvin Ritter¹ - Show less +4 more•Institutions (1)

Google¹

05 Mar 2017

TL;DR: The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.

...read moreread less

Abstract: Audio event recognition, the human-like ability to identify and relate sounds from audio, is a nascent problem in machine perception. Comparable problems such as object detection in images have reaped enormous benefits from comprehensive datasets - principally ImageNet. This paper describes the creation of Audio Set, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research. Using a carefully structured hierarchical ontology of 632 audio classes guided by the literature and manual curation, we collect data from human labelers to probe the presence of specific audio classes in 10 second segments of YouTube videos. Segments are proposed for labeling using searches based on metadata, context (e.g., links), and content analysis. The result is a dataset of unprecedented breadth and size that will, we hope, substantially stimulate the development of high-performance audio event recognizers.

...read moreread less

2,204 citations

Collapse

Performance

Metrics

46,388

Papers

886,494

Citations

No. of papers from the Conference in previous years
Year	Papers
2021	1,723
2020	1,863
2019	1,738
2018	1,390
2017	1,333
2016	1,325