scispace - formally typeset
Search or ask a question

Showing papers by "Ron Weiss published in 2016"


Posted Content
TL;DR: This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.
Abstract: Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio. We use various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels. We examine fully connected Deep Neural Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We investigate varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on our audio classification task, and larger training and label sets help up to a point. A model using embeddings from these classifiers does much better than raw features on the Audio Set [5] Acoustic Event Detection (AED) classification task.

487 citations


Journal ArticleDOI
TL;DR: This work presents a novel approach for generating and then co-differentiating hiPSC-derived progenitors and demonstrates derivation of complex tissues from hiPSCs using a single autologous hiPSCS as source and generates a range of stromal cells that co-develop with parenchymal cells to form tissues.
Abstract: Human induced pluripotent stem cells (hiPSCs) have potential for personalized and regenerative medicine. While most of the methods using these cells have focused on deriving homogenous populations of specialized cells, there has been modest success in producing hiPSC-derived organotypic tissues or organoids. Here we present a novel approach for generating and then co-differentiating hiPSC-derived progenitors. With a genetically engineered pulse of GATA-binding protein 6 (GATA6) expression, we initiate rapid emergence of all three germ layers as a complex function of GATA6 expression levels and tissue context. Within 2 weeks we obtain a complex tissue that recapitulates early developmental processes and exhibits a liver bud-like phenotype, including haematopoietic and stromal cells as well as a neuronal niche. Collectively, our approach demonstrates derivation of complex tissues from hiPSCs using a single autologous hiPSCs as source and generates a range of stromal cells that co-develop with parenchymal cells to form tissues.

132 citations


Proceedings ArticleDOI
08 Sep 2016
TL;DR: A neural network adaptive beamforming (NAB) technique that uses LSTM layers to predict time domain beamforming filter coefficients at each input frame and achieves a 12.7% relative improvement in WER over a single channel model.
Abstract: Joint multichannel enhancement and acoustic modeling using neural networks has shown promise over the past few years. However, one shortcoming of previous work [1, 2, 3] is that the filters learned during training are fixed for decoding, potentially limiting the ability of these models to adapt to previously unseen or changing conditions. In this paper we explore a neural network adaptive beamforming (NAB) technique to address this issue. Specifically, we use LSTM layers to predict time domain beamforming filter coefficients at each input frame. These filters are convolved with the framed time domain input signal and summed across channels, essentially performing FIR filter-andsum beamforming using the dynamically adapted filter. The beamformer output is passed into a waveform CLDNN acoustic model [4] which is trained jointly with the filter prediction LSTM layers. We find that the proposed NAB model achieves a 12.7% relative improvement in WER over a single channel model [4] and reaches similar performance to a “factored” model architecture which utilizes several fixed spatial filters [3] on a 2,000-hour Voice Search task, with a 17.9% decrease in computational cost.

120 citations


Proceedings ArticleDOI
Tara N. Sainath1, Ron Weiss1, Kevin W. Wilson1, Arun Narayanan1, Michiel Bacchiani1 
20 Mar 2016
TL;DR: This paper explores factoring multichannel enhancement operations into separate layers in the network, and uses multi-task learning (MTL) as a proxy for postfiltering, where the network is trained to predict "clean" features as well as context-dependent states.
Abstract: Multichannel ASR systems commonly separate speech enhancement, including localization, beamforming and postfiltering, from acoustic modeling. Recently, we explored doing multichannel enhancement jointly with acoustic modeling, where beamforming and frequency decomposition was folded into one layer of the neural network [1, 2]. In this paper, we explore factoring these operations into separate layers in the network. Furthermore, we explore using multi-task learning (MTL) as a proxy for postfiltering, where we train the network to predict "clean" features as well as context-dependent states. We find that with the factored architecture, we can achieve a 10% relative improvement in WER over a single channel and a 5% relative improvement over the unfactored model from [1] on a 2,000-hour Voice Search task. In addition, by incorporating MTL, we can achieve 11% and 7% relative improvements over single channel and unfactored multichannel models, respectively.

75 citations


Patent
08 Jul 2016
TL;DR: In this article, the authors proposed a method for enhancing the processing of audio waveforms for speech recognition using various neural network processing techniques, which includes: receiving multiple channels of audio data corresponding to an utterance; convolving each of multiple filters, in a time domain, with each of the multiple channels to generate convolution outputs.
Abstract: Methods, including computer programs encoded on a computer storage medium, for enhancing the processing of audio waveforms for speech recognition using various neural network processing techniques. In one aspect, a method includes: receiving multiple channels of audio data corresponding to an utterance; convolving each of multiple filters, in a time domain, with each of the multiple channels of audio waveform data to generate convolution outputs, wherein the multiple filters have parameters that have been learned during a training process that jointly trains the multiple filters and trains a deep neural network as an acoustic model; combining, for each of the multiple filters, the convolution outputs for the filter for the multiple channels of audio waveform data; inputting the combined convolution outputs to the deep neural network trained jointly with the multiple filters; and providing a transcription for the utterance that is determined.

73 citations


Proceedings ArticleDOI
08 Sep 2016
TL;DR: Several different approaches to reduce the complexity of this multichannel neural network model by reducing the stride of the convolution operation and by implementing filters in the frequency domain are presented.
Abstract: Recently, we presented a multichannel neural network model trained to perform speech enhancement jointly with acoustic modeling [1], directly from raw waveform input signals. While this model achieved over a 10% relative improvement compared to a single channel model, it came at a large cost in computational complexity, particularly in the convolutions used to implement a time-domain filterbank. In this paper we present several different approaches to reduce the complexity of this model by reducing the stride of the convolution operation and by implementing filters in the frequency domain. These optimizations reduce the computational complexity of the model by a factor of 3 with no loss in accuracy on a 2,000 hour Voice Search task.

21 citations


Patent
25 Mar 2016

17 citations


Patent
Ehsan Variani1, Kevin W. Wilson1, Ron Weiss1, Tara N. Sainath1, Arun Narayanan1 
14 Nov 2016
TL;DR: In this paper, the authors describe computer-implemented methods and systems that use a neural network of a speech recognition system to predict subword units encoded in both the first raw audio signal and the second audio signal.
Abstract: This specification describes computer-implemented methods and systems. One method includes receiving, by a neural network of a speech recognition system, first data representing a first raw audio signal and second data representing a second raw audio signal. The first raw audio signal and the second raw audio signal describe audio occurring at a same period of time. The method further includes generating, by a spatial filtering layer of the neural network, a spatial filtered output using the first data and the second data, and generating, by a spectral filtering layer of the neural network, a spectral filtered output using the spatial filtered output. Generating the spectral filtered output comprises processing frequency-domain data representing the spatial filtered output. The method still further includes processing, by one or more additional layers of the neural network, the spectral filtered output to predict sub-word units encoded in both the first raw audio signal and the second raw audio signal.

16 citations


Patent
04 Jan 2016

11 citations


Patent
01 Sep 2016
TL;DR: In this article, the authors describe a regulatory system composed of a CRISPR-associated nuclease and at least two distinct guide RNAs (gRNAs), which modulates cleavage and transcription, including repression and activation, in a mammalian cell such as a human cell.
Abstract: Aspects of the disclosure relate to synthetic regulatory systems composed of a multifunctional Cas [clustered regularly interspaced short palindromic repeat (CRISPR)-associated (Cas)] nuclease and at least two distinct guide RNAs (gRNAs). The synthetic regulatory system modulates cleavage and transcription, including repression and activation, in a mammalian cell such as a human cell.

8 citations


Patent
10 Nov 2016
TL;DR: In this article, modular transcriptional architectures and methods for regulated expression of guide RNAs in cells, such as human cells, which are based on Clustered Regularly Interspaced Palindromic Repeats (CRISPR) systems are presented.
Abstract: Provided herein, in some embodiments, are modular transcriptional architectures and methods for regulated expression of guide RNAs in cells, such as human cells, which are based on Clustered Regularly Interspaced Palindromic Repeats (CRISPR) systems.