Showing papers by "Vincent Vanhoucke published in 2013"

PDF

Open Access

Proceedings Article•DOI•

On rectified linear units for speech processing

[...]

Matthew D. Zeiler¹, Marc'Aurelio Ranzato², Rajat Monga², Mark Z. Mao², Ke Yang², Quoc V. Le², Patrick Nguyen², Andrew W. Senior², Vincent Vanhoucke², Jeffrey Dean², Geoffrey E. Hinton³ - Show less +7 more•Institutions (3)

New York University¹, Google², University of Toronto³

26 May 2013

TL;DR: This work shows that it can improve generalization and make training of deep networks faster and simpler by substituting the logistic units with rectified linear units.

...read moreread less

Abstract: Deep neural networks have recently become the gold standard for acoustic modeling in speech recognition systems The key computational unit of a deep network is a linear projection followed by a point-wise non-linearity, which is typically a logistic function In this work, we show that we can improve generalization and make training of deep networks faster and simpler by substituting the logistic units with rectified linear units These units are linear when their input is positive and zero otherwise In a supervised setting, we can successfully train very deep nets from random initialization on a large vocabulary speech recognition task achieving lower word error rates than using a logistic network with the same topology Similarly in an unsupervised setting, we show how we can learn sparse features that can be useful for discriminative tasks All our experiments are executed in a distributed environment using several hundred machines and several hundred hours of speech data

...read moreread less

541 citations

Proceedings Article•DOI•

Multilingual acoustic models using distributed deep neural networks

[...]

Georg Heigold¹, Vincent Vanhoucke¹, Andrew W. Senior¹, Patrick Nguyen¹, Marc'Aurelio Ranzato¹, Matthieu Devin¹, Jeffrey Dean¹ - Show less +3 more•Institutions (1)

Google¹

26 May 2013

TL;DR: Experimental results for cross- and multi-lingual network training of eleven Romance languages on 10k hours of data in total show average relative gains over the monolingual baselines, but additional gain from jointly training the languages on all data comes at an increased training time of roughly four weeks.

...read moreread less

Abstract: Today's speech recognition technology is mature enough to be useful for many practical applications. In this context, it is of paramount importance to train accurate acoustic models for many languages within given resource constraints such as data, processing power, and time. Multilingual training has the potential to solve the data issue and close the performance gap between resource-rich and resource-scarce languages. Neural networks lend themselves naturally to parameter sharing across languages, and distributed implementations have made it feasible to train large networks. In this paper, we present experimental results for cross- and multi-lingual network training of eleven Romance languages on 10k hours of data in total. The average relative gains over the monolingual baselines are 4%/2% (data-scarce/data-rich languages) for cross- and 7%/2% for multi-lingual training. However, the additional gain from jointly training the languages on all data comes at an increased training time of roughly four weeks, compared to two weeks (monolingual) and one week (crosslingual).

...read moreread less

330 citations

Proceedings Article•DOI•

Multiframe deep neural networks for acoustic modeling

[...]

Vincent Vanhoucke¹, Matthieu Devin¹, Georg Heigold¹•Institutions (1)

Google¹

26 May 2013

TL;DR: This paper describes a method of tying the neural network parameters over time which achieves comparable performance to the typical frame-synchronous model, while achieving up to a 4X reduction in the computational cost of the neuralnetwork activations.

...read moreread less

Abstract: Deep neural networks have been shown to perform very well as acoustic models for automatic speech recognition. Compared to Gaussian mixtures however, they tend to be very expensive computationally, making them challenging to use in real-time applications. One key advantage of such neural networks is their ability to learn from very long observation windows going up to 400 ms. Given this very long temporal context, it is tempting to wonder whether one can run neural networks at a lower frame rate than the typical 10 ms, and whether there might be computational benefits to doing so. This paper describes a method of tying the neural network parameters over time which achieves comparable performance to the typical frame-synchronous model, while achieving up to a 4X reduction in the computational cost of the neural network activations.

...read moreread less

61 citations

Patent•

Frame-level combination of deep neural network and gaussian mixture models

[...]

Hui Lin¹, Xin Lei¹, Vincent Vanhoucke¹•Institutions (1)

Google¹

12 Feb 2013

TL;DR: In this paper, a method and system for frame-level merging of HMM state predictions determined by different techniques is disclosed, where audio input signals are transformed into a first and second sequence of feature vector, the sequences corresponding to each other and to a temporal sequence of frames of the audio input signal on a frame-by-frame basis.

...read moreread less

Abstract: A method and system for frame-level merging of HMM state predictions determined by different techniques is disclosed. An audio input signal may be transformed into a first and second sequence of feature vector, the sequences corresponding to each other and to a temporal sequence of frames of the audio input signal on a frame-by-frame basis. The first sequence may be processed by a neural network (NN) to determine NN-based state predictions, and the second sequence may be processed by a Gaussian mixture model (GMM) to determine GMM-based state predictions. The NN-based and GMM-based state predictions may be merged as weighted sums for each of a plurality of HMM state on a frame-by-frame basis to determine merged state predictions. The merged state predictions may then be applied to the HMMs to speech content of the audio input signal.

...read moreread less

21 citations

Patent•

Multilingual, acoustic deep neural networks

[...]

Vincent Vanhoucke¹, Jeffrey Dean¹, Georg Heigold¹, Marc'Aurelio Ranzato¹, Matthieu Devin¹, Patrick Nguyen¹, Andrew W. Senior¹ - Show less +3 more•Institutions (1)

Google¹

15 Apr 2013

TL;DR: In this paper, a multilingual deep neural network (DNN) acoustic model may be processed based on the training data and weights associated with the multiple layers of one or more nodes of the processed multilingual DNN acoustic model can be stored in a database.

...read moreread less

Abstract: Methods and systems for processing multilingual DNN acoustic models are described. An example method may include receiving training data that includes a respective training data set for each of two or more or languages. A multilingual deep neural network (DNN) acoustic model may be processed based on the training data. The multilingual DNN acoustic model may include a feedforward neural network having multiple layers of one or more nodes. Each node of a given layer may connect with a respective weight to each node of a subsequent layer, and the multiple layers of one or more nodes may include one or more shared hidden layers of nodes and a language-specific output layer of nodes corresponding to each of the two or more languages. Additionally, weights associated with the multiple layers of one or more nodes of the processed multilingual DNN acoustic model may be stored in a database.

...read moreread less

15 citations

Patent•

Keyword detection without decoding

[...]

Vincent Vanhoucke¹, Oriol Vinyals¹, Patrick Nguyen¹, Maria Carolina Parada San Martin¹, Johan Schalkwyk¹ - Show less +1 more•Institutions (1)

Google¹

11 Apr 2013

TL;DR: In this article, an audio waveform is received at a mobile device, followed by acoustic modeling, high level feature extraction, and output classification to detect the keyword, using a neural network or a vector quantization dictionary.

...read moreread less

Abstract: Embodiments pertain to automatic speech recognition in mobile devices to establish the presence of a keyword. An audio waveform is received at a mobile device. Front-end feature extraction is performed on the audio waveform, followed by acoustic modeling, high level feature extraction, and output classification to detect the keyword. Acoustic modeling may use a neural network or a vector quantization dictionary and high level feature extraction may use pooling.

...read moreread less

10 citations