scispace - formally typeset
Search or ask a question
Author

Dimitri Kanevsky

Bio: Dimitri Kanevsky is an academic researcher from Google. The author has contributed to research in topics: Speaker recognition & Sparse approximation. The author has an hindex of 62, co-authored 362 publications receiving 14072 citations. Previous affiliations of Dimitri Kanevsky include GlobalFoundries & Nuance Communications.


Papers
More filters
Patent
Dimitri Kanevsky1
06 Jul 1998
TL;DR: A web page adaptation system and method as mentioned in this paper provides organization of viewing material associated with web sites for visual displays and windows on which these home pages are being viewed, and a different viewing-access strategy is provided for such visual devices varying, for example, from standard PC monitors, laptop screens and palmtops to web phone and digital camera displays and from large windows to small windows.
Abstract: A web page adaptation system and method provides organization of viewing material associated with web sites for visual displays and windows on which these home pages are being viewed. A different viewing-access strategy is provided for such visual devices varying, for example, from standard PC monitors, laptop screens and palmtops to web phone and digital camera displays and from large windows to small windows. A new web site design incorporates features that permit automatic display of the content of home pages in the most friendly manner for a user viewing this content from a screen or window of a certain size. For example, if a size of a display screen or window allows, links are displayed with some text or pictures to which they are linked. Conversely, if a size of a screen or window does not allow display of all textual and icon information on a whole screen or window, the home page is mapped into hierarchically linked new smaller pages that fully fit the current display or window. The unique display strategy of the invention is provided by a web page adaptation scheme that is implemented on a web site server or is incorporated in a web browser (e.g., as a java appelet) or both. This adaptation strategy employs variables that provide size of screen and/or window information from which a call to a web site was initiated.

744 citations

PatentDOI
Dimitri Kanevsky1, Stephane H. Maes1
TL;DR: In this article, a method and apparatus for securing access to a service or facility employing automatic speech recognition, text-independent speaker identification, natural language understanding techniques and additional dynamic and static features is presented.
Abstract: A method and apparatus for securing access to a service or facility employing automatic speech recognition, text-independent speaker identification, natural language understanding techniques and additional dynamic and static features. The method includes the steps of receiving and decoding speech containing indicia of the speaker such as a name, address or customer number; accessing a database containing information on candidate speakers; questioning the speaker based on the information; receiving, decoding and verifying an answer to the question; obtaining a voice sample of the speaker and verifying the voice sample against a model; generating a score based on the answer and the voice sample; and granting access if the score is equal to or greater than a threshold. Alternatively, the method includes the steps of receiving and decoding speech containing indicia of the speaker; generating a sub-list of speaker candidates having indicia substantially matching the speaker; activating databases containing information about the speaker candidates in the sub-list; performing voice classification analysis; eliminating speaker candidates based on the voice classification analysis; questioning the speaker regarding the information; eliminating speaker candidates based on the answer; and iteratively repeating prior steps until one speaker candidate (in which case the speaker is granted access), or no speaker candidate remains (in which case the speaker is not granted access).

474 citations

Proceedings ArticleDOI
12 May 2008
TL;DR: A modified form of the maximum mutual information (MMI) objective function which gives improved results for discriminative training by boosting the likelihoods of paths in the denominator lattice that have a higher phone error relative to the correct transcript.
Abstract: We present a modified form of the maximum mutual information (MMI) objective function which gives improved results for discriminative training. The modification consists of boosting the likelihoods of paths in the denominator lattice that have a higher phone error relative to the correct transcript, by using the same phone accuracy function that is used in Minimum Phone Error (MPE) training. We combine this with another improvement to our implementation of the Extended Baum-Welch update equations for MMI, namely the canceling of any shared part of the numerator and denominator statistics on each frame (a procedure that is already done in MPE). This change affects the Gaussian-specific learning rate. We also investigate another modification whereby we replace I-smoothing to the ML estimate with I-smoothing to the previous iteration's value. Boosted MMI gives better results than MPE in both model and feature-space discriminative training, although not consistently.

441 citations

Patent
23 Oct 1995
TL;DR: In this paper, a method of automatically aligning a written transcript with speech in video and audio clips is presented. But it does not address the problem of automatic alignment of the transcript with the original transcript.
Abstract: A method of automatically aligning a written transcript with speech in video and audio clips. The disclosed technique involves as a basic component an automatic speech recognizer. The automatic speech recognizer decodes speech (recorded on a tape) and produces a file with a decoded text. This decoded text is then matched with the original written transcript via identification of similar words or clusters of words. The results of this matching is an alignment of the speech with the original transcript. The method can be used (a) to create indexing of video clips, (b) for "teleprompting" (i.e. showing the next portion of text when someone is reading from a television screen), or (c) to enhance editing of a text that was dictated to a stenographer or recorded on a tape for its subsequent textual reproduction by a typist.

376 citations

Patent
31 Aug 2004
TL;DR: In this article, an improved apparatus and method is provided for operating devices and systems in a motor vehicle, while at the same time reducing vehicle operator distractions, where one or more touch sensitive pads are mounted on the steering wheel of the motor vehicle.
Abstract: An improved apparatus and method is provided for operating devices and systems in a motor vehicle, while at the same time reducing vehicle operator distractions. One or more touch sensitive pads are mounted on the steering wheel of the motor vehicle, and the vehicle operator touches the pads in a pre-specified synchronized pattern, to perform functions such as controlling operation of the radio or adjusting a window. At least some of the touch patterns used to generate different commands may be selected by the vehicle operator. Usefully, the system of touch pad sensors and the signals generated thereby are integrated with speech recognition and/or facial gesture recognition systems, so that commands may be generated by synchronized multi-mode inputs.

361 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.
Abstract: Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feed-forward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks (DNNs) that have many hidden layers and are trained using new methods have been shown to outperform GMMs on a variety of speech recognition benchmarks, sometimes by a large margin. This article provides an overview of this progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.

9,091 citations

Proceedings ArticleDOI
19 Apr 2015
TL;DR: It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself.
Abstract: This paper introduces a new corpus of read English speech, suitable for training and evaluating speech recognition systems. The LibriSpeech corpus is derived from audiobooks that are part of the LibriVox project, and contains 1000 hours of speech sampled at 16 kHz. We have made the corpus freely available for download, along with separately prepared language-model training data and pre-built language models. We show that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models trained on WSJ itself. We are also releasing Kaldi scripts that make it easy to build these systems.

4,770 citations

Book
01 Jan 1996
TL;DR: The Bayes Error and Vapnik-Chervonenkis theory are applied as guide for empirical classifier selection on the basis of explicit specification and explicit enforcement of the maximum likelihood principle.
Abstract: Preface * Introduction * The Bayes Error * Inequalities and alternatedistance measures * Linear discrimination * Nearest neighbor rules *Consistency * Slow rates of convergence Error estimation * The regularhistogram rule * Kernel rules Consistency of the k-nearest neighborrule * Vapnik-Chervonenkis theory * Combinatorial aspects of Vapnik-Chervonenkis theory * Lower bounds for empirical classifier selection* The maximum likelihood principle * Parametric classification *Generalized linear discrimination * Complexity regularization *Condensed and edited nearest neighbor rules * Tree classifiers * Data-dependent partitioning * Splitting the data * The resubstitutionestimate * Deleted estimates of the error probability * Automatickernel rules * Automatic nearest neighbor rules * Hypercubes anddiscrete spaces * Epsilon entropy and totally bounded sets * Uniformlaws of large numbers * Neural networks * Other error estimates *Feature extraction * Appendix * Notation * References * Index

3,598 citations

Journal ArticleDOI
TL;DR: A pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output that can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs.
Abstract: We propose a novel context-dependent (CD) model for large-vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output. The deep belief network pre-training algorithm is a robust and often helpful way to initialize deep neural networks generatively that can aid in optimization and reduce generalization error. We illustrate the key components of our model, describe the procedure for applying CD-DNN-HMMs to LVSR, and analyze the effects of various modeling choices on performance. Experiments on a challenging business search dataset demonstrate that CD-DNN-HMMs can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs, with an absolute sentence accuracy improvement of 5.8% and 9.2% (or relative error reduction of 16.0% and 23.2%) over the CD-GMM-HMMs trained using the minimum phone error rate (MPE) and maximum-likelihood (ML) criteria, respectively.

3,120 citations

Book
Li Deng1, Dong Yu1
12 Jun 2014
TL;DR: This monograph provides an overview of general deep learning methodology and its applications to a variety of signal and information processing tasks, including natural language and text processing, information retrieval, and multimodal information processing empowered by multi-task deep learning.
Abstract: This monograph provides an overview of general deep learning methodology and its applications to a variety of signal and information processing tasks. The application areas are chosen with the following three criteria in mind: (1) expertise or knowledge of the authors; (2) the application areas that have already been transformed by the successful use of deep learning technology, such as speech recognition and computer vision; and (3) the application areas that have the potential to be impacted significantly by deep learning and that have been experiencing research growth, including natural language and text processing, information retrieval, and multimodal information processing empowered by multi-task deep learning.

2,817 citations