Showing papers by "Vincent Vanhoucke published in 2012"

PDF

Open Access

Journal Article•DOI•

Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups

[...]

Geoffrey E. Hinton¹, Li Deng², Dong Yu², George E. Dahl¹, Abdelrahman Mohamed¹, Navdeep Jaitly¹, Andrew W. Senior³, Vincent Vanhoucke³, Patrick Nguyen³, Tara N. Sainath⁴, Brian Kingsbury⁴ - Show less +7 more•Institutions (4)

University of Toronto¹, Microsoft², Google³, IBM⁴

18 Oct 2012-IEEE Signal Processing Magazine

TL;DR: This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.

...read moreread less

Abstract: Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feed-forward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks (DNNs) that have many hidden layers and are trained using new methods have been shown to outperform GMMs on a variety of speech recognition benchmarks, sometimes by a large margin. This article provides an overview of this progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.

...read moreread less

9,091 citations

Journal Article•

Deep Neural Networks for Acoustic Modeling in Speech Recognition

[...]

Geoffrey E. Hinton, Li Deng, Dong Yu, George E. Dahl, Abdelrahman Mohamed, Navdeep Jaitly, Andrew W. Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, Brian Kingsbury - Show less +7 more

01 Nov 2012-IEEE Signal Processing Magazine

TL;DR: This paper provides an overview of this progress and repres nts the shared views of four research groups who have had recent successes in using deep neural networks for a coustic modeling in speech recognition.

...read moreread less

2,527 citations

Proceedings Article•

Application Of Pretrained Deep Neural Networks To Large Vocabulary Speech Recognition

[...]

Navdeep Jaitly¹, Patrick Nguyen², Andrew W. Senior², Vincent Vanhoucke²•Institutions (2)

University of Toronto¹, Google²

01 Jan 2012

TL;DR: This paper reports results of a DBN-pretrained context-dependent ANN/HMM system trained on two datasets that are much larger than any reported previously, and outperforms the best Gaussian Mixture Model Hidden Markov Model baseline.

...read moreread less

Abstract: The use of Deep Belief Networks (DBN) to pretrain Neural Networks has recently led to a resurgence in the use of Artificial Neural Network Hidden Markov Model (ANN/HMM) hybrid systems for Automatic Speech Recognition (ASR). In this paper we report results of a DBN-pretrained context-dependent ANN/HMM system trained on two datasets that are much larger than any reported previously with DBN-pretrained ANN/HMM systems 5870 hours of Voice Search and 1400 hours of YouTube data. On the first dataset, the pretrained ANN/HMM system outperforms the best Gaussian Mixture Model Hidden Markov Model (GMM/HMM) baseline, built with a much larger dataset by 3.7% absolute WER, while on the second dataset, it outperforms the GMM/HMM baseline by 4.7% absolute. Maximum Mutual Information (MMI) fine tuning and model combination using Segmental Conditional Random Fields (SCARF) give additional gains of 0.1% and 0.4% on the first dataset and 0.5% and 0.9% absolute on the second dataset.

...read moreread less

273 citations

Patent•

System and method for using image analysis and search in E-commerce

[...]

Salih Burak Gokturk, Baris Sumengen¹, Diem Vu, Navneet Dalal¹, Danny Yang¹, Xiaofan Lin¹, Azhar Khan¹, Munjal Shah¹, Dragomir Anguelov¹, Lorenzo Torresani¹, Vincent Vanhoucke¹ - Show less +7 more•Institutions (1)

Google¹

16 Feb 2012

TL;DR: In this paper, an image of a merchandise item is obtained and the image is programmatically analyzed to determine information about the item and the information is used to generate a presentation that includes the item.

...read moreread less

Abstract: Embodiments described herein provide a system and method for providing merchandise items at a network site. According to an embodiment, an image of a merchandise item is obtained. The image is programmatically analyzed to determine information about the merchandise item. The information is used to generate a presentation that includes the merchandise item.

...read moreread less

129 citations

Patent•

Computer-Implemented Method for Performing Similarity Searches

[...]

Vincent Vanhoucke, Salih Burak Gokturk¹, Dragomir Anguelov, Kuang-Chih Lee¹, Munjal Shah¹, Ashwin Tengli¹ - Show less +2 more•Institutions (1)

Google¹

14 Sep 2012

TL;DR: In this article, a similarity search is performed on the image of a person, using visual characteristics and information that is known about the person, and the search identifies images of other persons that are similar in appearance to the person in the image.

...read moreread less

Abstract: A similarity search may be performed on the image of a person, using visual characteristics and information that is known about the person. The search identifies images of other persons that are similar in appearance to the person in the image.

...read moreread less

126 citations

The shared views of four research groups )

[...]

Geoffrey E. Hinton, Li Deng, Dong Yu, George E. Dahl, Abdelrahman Mohamed, Navdeep Jaitly, Andrew W. Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, Brian Kingsbury - Show less +7 more

01 Jan 2012

84 citations

Patent•DOI•

Multi-frame prediction for hybrid neural network/hidden Markov models

[...]

Vincent Vanhoucke¹•Institutions (1)

Google¹

27 Jul 2012-Journal of the Acoustical Society of America

TL;DR: A method and system for multi-frame prediction in a hybrid neural network/hidden Markov model automatic speech recognition (ASR) system is disclosed.

...read moreread less

Abstract: A method and system for multi-frame prediction in a hybrid neural network/hidden Markov model automatic speech recognition (ASR) system is disclosed. An audio input signal may be transformed into a time sequence of feature vectors, each corresponding to respective temporal frame of a sequence of periodic temporal frames of the audio input signal. The time sequence of feature vectors may be concurrently input to a neural network, which may process them concurrently. In particular, the neural network may concurrently determine for the time sequence of feature vectors a set of emission probabilities for a plurality of hidden Markov models of the ASR system, where the set of emission probabilities are associated with the temporal frames. The set of emission probabilities may then be concurrently applied to the hidden Markov models for determining speech content of the audio input signal.

...read moreread less

77 citations

Application of Pretrained Deep Neural Networks to Large Vocabulary Conversational Speech Recognition

[...]

Navdeep Jaitly, Patrick Nguyen, Andrew W. Senior, Vincent Vanhoucke

01 Jan 2012

48 citations

Patent•

Adaptive auto-encoders

[...]

Vincent Vanhoucke¹•Institutions (1)

Google¹

27 Jul 2012

TL;DR: In this article, an adaptive auto-encoder is proposed to compensate for supra-phonetic features by reducing the magnitude of an error signal corresponding to a difference between the normalized signal and the recovered form of the quantitative measures.

...read moreread less

Abstract: A method and system for adaptive auto-encoders is disclosed. An input audio training signal may be transformed into a sequence of feature vectors, each bearing quantitative measures of acoustic properties of the input audio training signal. An auto-encoder may process the feature vectors to generate an encoded form of the quantitative measures, and a recovered form of the quantitative measures based on an inverse operation by the auto-encoder on the encoded form of the quantitative measures. A duplicate copy of the sequence of feature vectors may be normalized to form a normalized signal in which supra-phonetic acoustic properties are reduced in comparison with phonetic acoustic properties of the input audio training signal. The auto-encoder may then be trained to compensate for supra-phonetic features by reducing the magnitude of an error signal corresponding to a difference between the normalized signal and the recovered form of the quantitative measures.

...read moreread less

41 citations

Proceedings Article•DOI•

Investigations on exemplar-based features for speech recognition towards thousands of hours of unsupervised, noisy data

[...]

Georg Heigold¹, Patrick Nguyen¹, Mitchel Weintraub¹, Vincent Vanhoucke¹•Institutions (1)

Google¹

25 Mar 2012

TL;DR: A log-linear rescoring framework is used to combine the exemplar-based features on the word level with the first-pass model and focuses on the refined modeling of words with sufficient data.

...read moreread less

Abstract: The acoustic models in state-of-the-art speech recognition systems are based on phones in context that are represented by hidden Markov models. This modeling approach may be limited in that it is hard to incorporate long-span acoustic context. Exemplar-based approaches are an attractive alter-native, in particular if massive data and computational power are available. Yet, most of the data at Google are unsupervised and noisy. This paper investigates an exemplar-based approach under this yet not well understood data regime. A log-linear rescoring framework is used to combine the exemplar-based features on the word level with the first-pass model. This approach guarantees at least baseline performance and focuses on the refined modeling of words with sufficient data. Experimental results for the Voice Search and the YouTube tasks are presented.

...read moreread less

17 citations

Patent•

Speech recognition process

[...]

Georg Heigold¹, Patrick Nguyen¹, Mitchel Weintraub¹, Vincent Vanhoucke¹•Institutions (1)

Google¹

31 Oct 2012

TL;DR: In this article, a speech recognition process may perform the following operations: performing a preliminary recognition process on first audio to identify candidates for the first audio, generating first templates corresponding to the audio, where each first template includes a number of elements, and selecting second templates representing second audio, and where each second template includes elements that correspond to the elements in the first templates.

...read moreread less

Abstract: A speech recognition process may perform the following operations: performing a preliminary recognition process on first audio to identify candidates for the first audio; generating first templates corresponding to the first audio, where each first template includes a number of elements; selecting second templates corresponding to the candidates, where the second templates represent second audio, and where each second template includes elements that correspond to the elements in the first templates; comparing the first templates to the second templates, where comparing comprises includes similarity metrics between the first templates and corresponding second templates; applying weights to the similarity metrics to produce weighted similarity metrics, where the weights are associated with corresponding second templates; and using the weighted similarity metrics to determine whether the first audio corresponds to the second audio.

...read moreread less