scispace - formally typeset
Search or ask a question

Showing papers by "Ruslan Salakhutdinov published in 2018"


Proceedings ArticleDOI
25 Sep 2018
TL;DR: HotpotQA as discussed by the authors is a dataset with 113k Wikipedia-based question-answer pairs with four key features: finding and reasoning over multiple supporting documents to answer; the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; providing sentence-level supporting facts required for reasoning; and offering a new type of factoid comparison questions to test QA systems' ability to extract relevant facts and perform necessary comparison.
Abstract: Existing question answering (QA) datasets fail to train QA systems to perform complex reasoning and provide explanations for answers We introduce HotpotQA, a new dataset with 113k Wikipedia-based question-answer pairs with four key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) we provide sentence-level supporting facts required for reasoning, allowing QA systems to reason with strong supervision and explain the predictions; (4) we offer a new type of factoid comparison questions to test QA systems’ ability to extract relevant facts and perform necessary comparison We show that HotpotQA is challenging for the latest QA systems, and the supporting facts enable models to improve performance and make explainable predictions

850 citations


Posted Content
TL;DR: It is shown that HotpotQA is challenging for the latest QA systems, and the supporting facts enable models to improve performance and make explainable predictions.
Abstract: Existing question answering (QA) datasets fail to train QA systems to perform complex reasoning and provide explanations for answers. We introduce HotpotQA, a new dataset with 113k Wikipedia-based question-answer pairs with four key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) we provide sentence-level supporting facts required for reasoning, allowing QA systems to reason with strong supervision and explain the predictions; (4) we offer a new type of factoid comparison questions to test QA systems' ability to extract relevant facts and perform necessary comparison. We show that HotpotQA is challenging for the latest QA systems, and the supporting facts enable models to improve performance and make explainable predictions.

574 citations


Proceedings ArticleDOI
24 Apr 2018
TL;DR: This paper used adversarial generation techniques to make the output match the desired style, which showed improvements both in automatic evaluation of style transfer and in manual evaluation of meaning preservation and fluency.
Abstract: Style transfer is the task of rephrasing the text to contain specific stylistic properties without changing the intent or affect within the context. This paper introduces a new method for automatic style transfer. We first learn a latent representation of the input sentence which is grounded in a language translation model in order to better preserve the meaning of the sentence while reducing stylistic properties. Then adversarial generation techniques are used to make the output match the desired style. We evaluate this technique on three different style transformations: sentiment, gender and political slant. Compared to two state-of-the-art style transfer modeling techniques we show improvements both in automatic evaluation of style transfer and in manual evaluation of meaning preservation and fluency.

269 citations


Proceedings ArticleDOI
01 Jan 2018
TL;DR: GRAFT-Net as mentioned in this paper uses graph representation learning to extract answers from a question-specific subgraph containing text and knowledge base entities and relations, which is appropriate when an incomplete knowledge base is available with a large text corpus.
Abstract: Open Domain Question Answering (QA) is evolving from complex pipelined systems to end-to-end deep neural networks. Specialized neural models have been developed for extracting answers from either text alone or Knowledge Bases (KBs) alone. In this paper we look at a more practical setting, namely QA over the combination of a KB and entity-linked text, which is appropriate when an incomplete KB is available with a large text corpus. Building on recent advances in graph representation learning we propose a novel model, GRAFT-Net, for extracting answers from a question-specific subgraph containing text and KB entities and relations. We construct a suite of benchmark tasks for this problem, varying the difficulty of questions, the amount of training data, and KB completeness. We show that GRAFT-Net is competitive with the state-of-the-art when tested using either KBs or text alone, and vastly outperforms existing methods in the combined setting.

265 citations


Proceedings Article
16 Jun 2018
TL;DR: A model that factorizes representations into two sets of independent factors: multimodal discriminative and modality-specific generative factors is introduced that demonstrates flexible generative capabilities by conditioning on independent factors and can reconstruct missing modalities without significantly impacting performance.
Abstract: Learning multimodal representations is a fundamentally complex research problem due to the presence of multiple heterogeneous sources of information. Although the presence of multiple modalities provides additional valuable information, there are two key challenges to address when learning from multimodal data: 1) models must learn the complex intra-modal and cross-modal interactions for prediction and 2) models must be robust to unexpected missing or noisy modalities during testing. In this paper, we propose to optimize for a joint generative-discriminative objective across multimodal data and labels. We introduce a model that factorizes representations into two sets of independent factors: multimodal discriminative and modality-specific generative factors. Multimodal discriminative factors are shared across all modalities and contain joint multimodal features required for discriminative tasks such as sentiment prediction. Modality-specific generative factors are unique for each modality and contain the information required for generating data. Experimental results show that our model is able to learn meaningful multimodal representations that achieve state-of-the-art or competitive performance on six multimodal datasets. Our model demonstrates flexible generative capabilities by conditioning on independent factors and can reconstruct missing modalities without significantly impacting performance. Lastly, we interpret our factorized representations to understand the interactions that influence multimodal learning.

179 citations


Posted Content
TL;DR: A novel model is proposed, GRAFT-Net, for extracting answers from a question-specific subgraph containing text and Knowledge Bases entities and relations that is competitive with the state-of-the-art when tested using either KBs or text alone, and vastly outperforms existing methods in the combined setting.
Abstract: Open Domain Question Answering (QA) is evolving from complex pipelined systems to end-to-end deep neural networks. Specialized neural models have been developed for extracting answers from either text alone or Knowledge Bases (KBs) alone. In this paper we look at a more practical setting, namely QA over the combination of a KB and entity-linked text, which is appropriate when an incomplete KB is available with a large text corpus. Building on recent advances in graph representation learning we propose a novel model, GRAFT-Net, for extracting answers from a question-specific subgraph containing text and KB entities and relations. We construct a suite of benchmark tasks for this problem, varying the difficulty of questions, the amount of training data, and KB completeness. We show that GRAFT-Net is competitive with the state-of-the-art when tested using either KBs or text alone, and vastly outperforms existing methods in the combined setting. Source code is available at this https URL .

119 citations


Posted Content
TL;DR: A two fold modification to GAN algorithm for learning to generate point clouds (PC-GAN), which combines ideas from hierarchical Bayesian modeling and implicit generative models by learning a hierarchical and interpretable sampling process and defines a generic framework that can incorporate many existing GAN algorithms.
Abstract: Generative Adversarial Networks (GAN) can achieve promising performance on learning complex data distributions on different types of data. In this paper, we first show a straightforward extension of existing GAN algorithm is not applicable to point clouds, because the constraint required for discriminators is undefined for set data. We propose a two fold modification to GAN algorithm for learning to generate point clouds (PC-GAN). First, we combine ideas from hierarchical Bayesian modeling and implicit generative models by learning a hierarchical and interpretable sampling process. A key component of our method is that we train a posterior inference network for the hidden variables. Second, instead of using only state-of-the-art Wasserstein GAN objective, we propose a sandwiching objective, which results in a tighter Wasserstein distance estimate than the commonly used dual form. Thereby, PC-GAN defines a generic framework that can incorporate many existing GAN algorithms. We validate our claims on ModelNet40 benchmark dataset. Using the distance between generated point clouds and true meshes as metric, we find that PC-GAN trained by the sandwiching objective achieves better results on test data than the existing methods. Moreover, as a byproduct, PC- GAN learns versatile latent representations of point clouds, which can achieve competitive performance with other unsupervised learning algorithms on object recognition task. Lastly, we also provide studies on generating unseen classes of objects and transforming image to point cloud, which demonstrates the compelling generalization capability and potentials of PC-GAN.

108 citations


Proceedings ArticleDOI
01 Apr 2018
TL;DR: The authors proposed a coreference annotations extracted from an external system to connect entity mentions belonging to the same cluster and incorporated this layer into a state-of-the-art reading comprehension model.
Abstract: Many problems in NLP require aggregating information from multiple mentions of the same entity which may be far apart in the text. Existing Recurrent Neural Network (RNN) layers are biased towards short-term dependencies and hence not suited to such tasks. We present a recurrent layer which is instead biased towards coreferent dependencies. The layer uses coreference annotations extracted from an external system to connect entity mentions belonging to the same cluster. Incorporating this layer into a state-of-the-art reading comprehension model improves performance on three datasets – Wikihop, LAMBADA and the bAbi AI tasks – with large gains when training data is scarce.

102 citations


Posted Content
TL;DR: This article proposed a generative discriminative-discriminative objective across multimodal data and labels, which can learn the complex intra-modal and crossmodal interactions for prediction and robust to unexpected missing or noisy modalities during testing.
Abstract: Learning multimodal representations is a fundamentally complex research problem due to the presence of multiple heterogeneous sources of information. Although the presence of multiple modalities provides additional valuable information, there are two key challenges to address when learning from multimodal data: 1) models must learn the complex intra-modal and cross-modal interactions for prediction and 2) models must be robust to unexpected missing or noisy modalities during testing. In this paper, we propose to optimize for a joint generative-discriminative objective across multimodal data and labels. We introduce a model that factorizes representations into two sets of independent factors: multimodal discriminative and modality-specific generative factors. Multimodal discriminative factors are shared across all modalities and contain joint multimodal features required for discriminative tasks such as sentiment prediction. Modality-specific generative factors are unique for each modality and contain the information required for generating data. Experimental results show that our model is able to learn meaningful multimodal representations that achieve state-of-the-art or competitive performance on six multimodal datasets. Our model demonstrates flexible generative capabilities by conditioning on independent factors and can reconstruct missing modalities without significantly impacting performance. Lastly, we interpret our factorized representations to understand the interactions that influence multimodal learning.

94 citations


Posted Content
TL;DR: This paper reframe the problem of architecture selection as understanding how data determines the most expressive and generalizable architectures suited to that data, beyond inductive bias, and provides the first empirical characterization of the topological capacity of neural networks.
Abstract: The learnability of different neural architectures can be characterized directly by computable measures of data complexity In this paper, we reframe the problem of architecture selection as understanding how data determines the most expressive and generalizable architectures suited to that data, beyond inductive bias After suggesting algebraic topology as a measure for data complexity, we show that the power of a network to express the topological complexity of a dataset in its decision region is a strictly limiting factor in its ability to generalize We then provide the first empirical characterization of the topological capacity of neural networks Our empirical analysis shows that at every level of dataset complexity, neural networks exhibit topological phase transitions This observation allowed us to connect existing theory to empirically driven conjectures on the choice of architectures for fully-connected neural networks

80 citations


Proceedings Article
27 Sep 2018
TL;DR: Zhang et al. as mentioned in this paper proposed a two fold modification to GAN algorithm for learning to generate point clouds (PC-GAN), which combines ideas from hierarchical Bayesian modeling and implicit generative models by learning a hierarchical and interpretable sampling process.
Abstract: Generative Adversarial Networks (GAN) can achieve promising performance on learning complex data distributions on different types of data. In this paper, we first show a straightforward extension of existing GAN algorithm is not applicable to point clouds, because the constraint required for discriminators is undefined for set data. We propose a two fold modification to GAN algorithm for learning to generate point clouds (PC-GAN). First, we combine ideas from hierarchical Bayesian modeling and implicit generative models by learning a hierarchical and interpretable sampling process. A key component of our method is that we train a posterior inference network for the hidden variables. Second, instead of using only state-of-the-art Wasserstein GAN objective, we propose a sandwiching objective, which results in a tighter Wasserstein distance estimate than the commonly used dual form. Thereby, PC-GAN defines a generic framework that can incorporate many existing GAN algorithms. We validate our claims on ModelNet40 benchmark dataset. Using the distance between generated point clouds and true meshes as metric, we find that PC-GAN trained by the sandwiching objective achieves better results on test data than the existing methods. Moreover, as a byproduct, PC- GAN learns versatile latent representations of point clouds, which can achieve competitive performance with other unsupervised learning algorithms on object recognition task. Lastly, we also provide studies on generating unseen classes of objects and transforming image to point cloud, which demonstrates the compelling generalization capability and potentials of PC-GAN.

Proceedings ArticleDOI
01 Jun 2018
TL;DR: A new, differentiable architecture, Neural Graph Optimizer, progressing towards a complete neural network solution for SLAM by designing a system composed of a local pose estimation model, a novel pose selection module, and a novel graph optimization process.
Abstract: The ability for an agent to localize itself within an environment is crucial for many real-world applications. For unknown environments, Simultaneous Localization and Mapping (SLAM) enables incremental and concurrent building of and localizing within a map. We present a new, differentiable architecture, Neural Graph Optimizer, progressing towards a complete neural network solution for SLAM by designing a system composed of a local pose estimation model, a novel pose selection module, and a novel graph optimization process. The entire architecture is trained in an end-to-end fashion, enabling the network to automatically learn domain-specific features relevant to the visual odometry and avoid the involved process of feature engineering. We demonstrate the effectiveness of our system on a simulated 2D maze and the 3D ViZ-Doom environment.

Posted Content
TL;DR: This work attempts to systematically characterize methods for density estimation, and proposes multiple novel methods to model non-Markovian dependencies, and introduces a novel data driven framework for learning a family of distributions.
Abstract: The fundamental task of general density estimation $p(x)$ has been of keen interest to machine learning In this work, we attempt to systematically characterize methods for density estimation Broadly speaking, most of the existing methods can be categorized into either using: \textit{a}) autoregressive models to estimate the conditional factors of the chain rule, $p(x_{i}\, |\, x_{i-1}, \ldots)$; or \textit{b}) non-linear transformations of variables of a simple base distribution Based on the study of the characteristics of these categories, we propose multiple novel methods for each category For example we proposed RNN based transformations to model non-Markovian dependencies Further, through a comprehensive study over both real world and synthetic data, we show for that jointly leveraging transformations of variables and autoregressive conditional models, results in a considerable improvement in performance We illustrate the use of our models in outlier detection and image modeling Finally we introduce a novel data driven framework for learning a family of distributions

Proceedings Article
17 Jun 2018
TL;DR: It is argued that standard gated recurrent update equations could potentially alleviate the optimization issues plaguing VIN and the resulting architecture, which is shown to empirically outperform VIN on a variety of metrics such as learning speed, hyperparameter sensitivity, iteration count, and even generalization.
Abstract: Value Iteration Networks (VINs) are effective differentiable path planning modules that can be used by agents to perform navigation while still maintaining end-to-end differentiability of the entire architecture. Despite their effectiveness, they suffer from several disadvantages including training instability, random seed sensitivity, and other optimization problems. In this work, we reframe VINs as recurrent-convolutional networks which demonstrates that VINs couple recurrent convolutions with an unconventional max-pooling activation. From this perspective, we argue that standard gated recurrent update equations could potentially alleviate the optimization issues plaguing VIN. The resulting architecture, which we call the Gated Path Planning Network, is shown to empirically outperform VIN on a variety of metrics such as learning speed, hyperparameter sensitivity, iteration count, and even generalization. Furthermore, we show that this performance gap is consistent across different maze transition types, maze sizes and even show success on a challenging 3D environment, where the planner is only provided with first-person RGB images.

Posted Content
TL;DR: The proposed model incorporates ideas of traditional filtering-based localization methods, by using a structured belief of the state with multiplicative interactions to propagate belief, and combines it with a policy model to localize accurately while minimizing the number of steps required.
Abstract: Localization is the problem of estimating the location of an autonomous agent from an observation and a map of the environment. Traditional methods of localization, which filter the belief based on the observations, are sub-optimal in the number of steps required, as they do not decide the actions taken by the agent. We propose "Active Neural Localizer", a fully differentiable neural network that learns to localize accurately and efficiently. The proposed model incorporates ideas of traditional filtering-based localization methods, by using a structured belief of the state with multiplicative interactions to propagate belief, and combines it with a policy model to localize accurately while minimizing the number of steps required for localization. Active Neural Localizer is trained end-to-end with reinforcement learning. We use a variety of simulation environments for our experiments which include random 2D mazes, random mazes in the Doom game engine and a photo-realistic environment in the Unreal game engine. The results on the 2D environments show the effectiveness of the learned policy in an idealistic setting while results on the 3D environments demonstrate the model's capability of learning the policy and perceptual model jointly from raw-pixel based RGB observations. We also show that a model trained on random textures in the Doom environment generalizes well to a photo-realistic office space environment in the Unreal engine.


Posted Content
TL;DR: In this article, a mathematical correspondence between posterior regularization (PR) and reinforcement learning (RL) is established and, based on the connection, expand PR to learn constraints as the extrinsic reward in RL.
Abstract: The broad set of deep generative models (DGMs) has achieved remarkable advances. However, it is often difficult to incorporate rich structured domain knowledge with the end-to-end DGMs. Posterior regularization (PR) offers a principled framework to impose structured constraints on probabilistic models, but has limited applicability to the diverse DGMs that can lack a Bayesian formulation or even explicit density evaluation. PR also requires constraints to be fully specified a priori, which is impractical or suboptimal for complex knowledge with learnable uncertain parts. In this paper, we establish mathematical correspondence between PR and reinforcement learning (RL), and, based on the connection, expand PR to learn constraints as the extrinsic reward in RL. The resulting algorithm is model-agnostic to apply to any DGMs, and is flexible to adapt arbitrary constraints with the model jointly. Experiments on human image generation and templated sentence generation show models with learned knowledge constraints by our algorithm greatly improve over base generative models.

Posted Content
TL;DR: The authors proposed a variant of Latent Dirichlet Allocation in which the topic proportions for a document are replaced by synset proportions and further utilize the information in the WordNet by assigning a non-uniform prior to synset distribution over words and a logistic-normal prior for document distribution over synsets.
Abstract: Word Sense Disambiguation is an open problem in Natural Language Processing which is particularly challenging and useful in the unsupervised setting where all the words in any given text need to be disambiguated without using any labeled data. Typically WSD systems use the sentence or a small window of words around the target word as the context for disambiguation because their computational complexity scales exponentially with the size of the context. In this paper, we leverage the formalism of topic model to design a WSD system that scales linearly with the number of words in the context. As a result, our system is able to utilize the whole document as the context for a word to be disambiguated. The proposed method is a variant of Latent Dirichlet Allocation in which the topic proportions for a document are replaced by synset proportions. We further utilize the information in the WordNet by assigning a non-uniform prior to synset distribution over words and a logistic-normal prior for document distribution over synsets. We evaluate the proposed method on Senseval-2, Senseval-3, SemEval-2007, SemEval-2013 and SemEval-2015 English All-Word WSD datasets and show that it outperforms the state-of-the-art unsupervised knowledge-based WSD system by a significant margin.

Proceedings Article
24 Jan 2018
TL;DR: Active Neural Localizer as mentioned in this paper incorporates ideas of traditional filtering-based localization methods by using a structured belief of the state with multiplicative interactions to propagate belief, and combines it with a policy model to localize accurately while minimizing the number of steps required for localization.
Abstract: Localization is the problem of estimating the location of an autonomous agent from an observation and a map of the environment. Traditional methods of localization, which filter the belief based on the observations, are sub-optimal in the number of steps required, as they do not decide the actions taken by the agent. We propose "Active Neural Localizer", a fully differentiable neural network that learns to localize accurately and efficiently. The proposed model incorporates ideas of traditional filtering-based localization methods, by using a structured belief of the state with multiplicative interactions to propagate belief, and combines it with a policy model to localize accurately while minimizing the number of steps required for localization. Active Neural Localizer is trained end-to-end with reinforcement learning. We use a variety of simulation environments for our experiments which include random 2D mazes, random mazes in the Doom game engine and a photo-realistic environment in the Unreal game engine. The results on the 2D environments show the effectiveness of the learned policy in an idealistic setting while results on the 3D environments demonstrate the model's capability of learning the policy and perceptual model jointly from raw-pixel based RGB observations. We also show that a model trained on random textures in the Doom environment generalizes well to a photo-realistic office space environment in the Unreal engine.

Posted Content
21 May 2018
TL;DR: It is shown that for learning an $m-dimensional convolutional filter with linear activation acting on a $d$-dimensional input, the sample complexity of achieving population prediction error of $\epsilon$ is $\widetilde{O} (m/\Epsilon^2)$, whereas its FNN counterpart needs at least $\Omega(d/\epsil on)$ samples.
Abstract: It is widely believed that the practical success of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) owes to the fact that CNNs and RNNs use a more compact parametric representation than their Fully-Connected Neural Network (FNN) counterparts, and consequently require fewer training examples to accurately estimate their parameters. We initiate the study of rigorously characterizing the sample-complexity of estimating CNNs and RNNs. We show that the sample-complexity to learn CNNs and RNNs scales linearly with their intrinsic dimension and this sample-complexity is much smaller than for their FNN counterparts. For both CNNs and RNNs, we also present lower bounds showing our sample complexities are tight up to logarithmic factors. Our main technical tools for deriving these results are a localized empirical process analysis and a new technical lemma characterizing the convolutional and recurrent structure. We believe that these tools may inspire further developments in understanding CNNs and RNNs.

Posted Content
TL;DR: This work explores the possibility of learning generic latent relational graphs that capture dependencies between pairs of data units from large-scale unlabeled data and transferring the graphs to downstream tasks, and shows that the learned graphs are generic enough to be transferred to different embeddings on which the graphs have been trained.
Abstract: Modern deep transfer learning approaches have mainly focused on learning generic feature vectors from one task that are transferable to other tasks, such as word embeddings in language and pretrained convolutional features in vision. However, these approaches usually transfer unary features and largely ignore more structured graphical representations. This work explores the possibility of learning generic latent relational graphs that capture dependencies between pairs of data units (e.g., words or pixels) from large-scale unlabeled data and transferring the graphs to downstream tasks. Our proposed transfer learning framework improves performance on various tasks including question answering, natural language inference, sentiment analysis, and image classification. We also show that the learned graphs are generic enough to be transferred to different embeddings on which the graphs have not been trained (including GloVe embeddings, ELMo embeddings, and task-specific RNN hidden unit), or embedding-free units such as image pixels.

Proceedings Article
22 Feb 2018
TL;DR: This work proposes a new neural network architecture for the policy network representation that is simple yet effective, and demonstrates much improved performance for locomotion tasks by emulating the biological central pattern generators as the nonlinear part of the architecture.
Abstract: In recent years, Deep Reinforcement Learning has made impressive advances in solving several important benchmark problems for sequential decision making. Many control applications use a generic multilayer perceptron (MLP) for non-vision parts of the policy network. In this work, we propose a new neural network architecture for the policy network representation that is simple yet effective. The proposed Structured Control Net (SCN) splits the generic MLP into two separate sub-modules: a nonlinear control module and a linear control module. Intuitively, the nonlinear control is for forward-looking and global control, while the linear control stabilizes the local dynamics around the residual of global control. We hypothesize that this will bring together the benefits of both linear and nonlinear policies: improve training sample efficiency, final episodic reward, and generalization of learned policy, while requiring a smaller network and being generally applicable to different training methods. We validated our hypothesis with competitive results on simulations from OpenAI MuJoCo, Roboschool, Atari, and a custom 2D urban driving environment, with various ablation and generalization tests, trained with multiple black-box and policy gradient training methods. The proposed architecture has the potential to improve upon broader control tasks by incorporating problem specific priors into the architecture. As a case study, we demonstrate much improved performance for locomotion tasks by emulating the biological central pattern generators (CPGs) as the nonlinear part of the architecture.

Posted Content
TL;DR: Gated Path Planning Network as discussed by the authors reframes value iteration networks (VINs) as recurrent-convolutional networks which demonstrates that VINs couple recurrent convolutions with an unconventional max-pooling activation.
Abstract: Value Iteration Networks (VINs) are effective differentiable path planning modules that can be used by agents to perform navigation while still maintaining end-to-end differentiability of the entire architecture. Despite their effectiveness, they suffer from several disadvantages including training instability, random seed sensitivity, and other optimization problems. In this work, we reframe VINs as recurrent-convolutional networks which demonstrates that VINs couple recurrent convolutions with an unconventional max-pooling activation. From this perspective, we argue that standard gated recurrent update equations could potentially alleviate the optimization issues plaguing VIN. The resulting architecture, which we call the Gated Path Planning Network, is shown to empirically outperform VIN on a variety of metrics such as learning speed, hyperparameter sensitivity, iteration count, and even generalization. Furthermore, we show that this performance gap is consistent across different maze transition types, maze sizes and even show success on a challenging 3D environment, where the planner is only provided with first-person RGB images.

Proceedings Article
03 Dec 2018
TL;DR: The study of rigorously characterizing the sample complexity of estimating CNNs is initiated, showing that for an $m$-dimensional convolutional filter with linear activation acting on a d-dimensional input, the samplecomplexity of achieving population prediction error of $\epsilon$ is $\widetilde{O(m/\ep silon^2)$, whereas the sample-complexity for its FNN counterpart is lower bounded by $\Omega(d/\Epsilon
Abstract: A widespread folklore for explaining the success of convolutional neural network (CNN) is that CNN is a more compact representation than the fully connected neural network (FNN) and thus requires fewer samples for learning. We initiate the study of rigorously characterizing the sample complexity of learning convolutional neural networks. We show that for learning an m-dimensional convolutional filter with linear activation acting on a d-dimensional input, the sample complexity of achieving population prediction error of ϵ is ˜ O (m/ϵ2) whereas its FNN counterpart needs at least Ω(d/ϵ2) samples. Since m≪d, this result demonstrates the advantage of using CNN. We further consider the sample complexity of learning a one-hidden-layer CNN with linear activation where both the m-dimensional convolutional filter and the r-dimensional output weights are unknown. For this model, we show the sample complexity is ˜ O ((m+r)/ϵ2) when the ratio between the stride size and the filter size is a constant. For both models, we also present lower bounds showing our sample complexities are tight up to logarithmic factors. Our main tools for deriving these results are localized empirical process and a new lemma characterizing the convolutional structure. We believe these tools may inspire further developments in understanding CNN.

Proceedings Article
01 Jan 2018
TL;DR: This work explores the possibility of learning generic latent relational graphs that capture dependencies between pairs of data units from large-scale unlabeled data and transferring the graphs to downstream tasks, and shows that the learned graphs are generic enough to be transferred to different embeddings on which the graphs have been trained.
Abstract: Modern deep transfer learning approaches have mainly focused on learning generic feature vectors from one task that are transferable to other tasks, such as word embeddings in language and pretrained convolutional features in vision. However, these approaches usually transfer unary features and largely ignore more structured graphical representations. This work explores the possibility of learning generic latent relational graphs that capture dependencies between pairs of data units (e.g., words or pixels) from large-scale unlabeled data and transferring the graphs to downstream tasks. Our proposed transfer learning framework improves performance on various tasks including question answering, natural language inference, sentiment analysis, and image classification. We also show that the learned graphs are generic enough to be transferred to different embeddings on which the graphs have not been trained (including GloVe embeddings, ELMo embeddings, and task-specific RNN hidden units), or embedding-free units such as image pixels.

Proceedings Article
03 Jul 2018
TL;DR: In this article, a comprehensive study over both real world and synthetic data, the authors show for that jointly leveraging transformations of variables and autoregressive conditional models, results in a considerable improvement in performance.
Abstract: The fundamental task of general density estimation $p(x)$ has been of keen interest to machine learning. In this work, we attempt to systematically characterize methods for density estimation. Broadly speaking, most of the existing methods can be categorized into either using: \textit{a}) autoregressive models to estimate the conditional factors of the chain rule, $p(x_{i}\, |\, x_{i-1}, \ldots)$; or \textit{b}) non-linear transformations of variables of a simple base distribution. Based on the study of the characteristics of these categories, we propose multiple novel methods for each category. For example we proposed RNN based transformations to model non-Markovian dependencies. Further, through a comprehensive study over both real world and synthetic data, we show for that jointly leveraging transformations of variables and autoregressive conditional models, results in a considerable improvement in performance. We illustrate the use of our models in outlier detection and image modeling. Finally we introduce a novel data driven framework for learning a family of distributions.

Posted Content
TL;DR: This article proposed a coreference annotations extracted from an external system to connect entity mentions belonging to the same cluster and incorporated this layer into a state-of-the-art reading comprehension model.
Abstract: Many problems in NLP require aggregating information from multiple mentions of the same entity which may be far apart in the text. Existing Recurrent Neural Network (RNN) layers are biased towards short-term dependencies and hence not suited to such tasks. We present a recurrent layer which is instead biased towards coreferent dependencies. The layer uses coreference annotations extracted from an external system to connect entity mentions belonging to the same cluster. Incorporating this layer into a state-of-the-art reading comprehension model improves performance on three datasets -- Wikihop, LAMBADA and the bAbi AI tasks -- with large gains when training data is scarce.

Posted Content
TL;DR: The normalized gradient methods having constant step size with occasionally decay, such as SGD with momentum, have better performance in the deep convolution neural networks, while those with adaptive step sizes perform better in recurrent neural networks.
Abstract: In this paper, we propose a generic and simple strategy for utilizing stochastic gradient information in optimization. The technique essentially contains two consecutive steps in each iteration: 1) computing and normalizing each block (layer) of the mini-batch stochastic gradient; 2) selecting appropriate step size to update the decision variable (parameter) towards the negative of the block-normalized gradient. We conduct extensive empirical studies on various non-convex neural network optimization problems, including multi-layer perceptron, convolution neural networks and recurrent neural networks. The results indicate the block-normalized gradient can help accelerate the training of neural networks. In particular, we observe that the normalized gradient methods having constant step size with occasionally decay, such as SGD with momentum, have better performance in the deep convolution neural networks, while those with adaptive step sizes, such as Adam, perform better in recurrent neural networks. Besides, we also observe this line of methods can lead to solutions with better generalization properties, which is confirmed by the performance improvement over strong baselines.

Proceedings Article
01 Jan 2018
TL;DR: The proposed method is a variant of Latent Dirichlet Allocation in which the topic proportions for a document are replaced by synset proportions and it outperforms the state-of-the-art unsupervised knowledge-based WSD system by a significant margin.
Abstract: Word Sense Disambiguation is an open problem in Natural Language Processing which is particularly challenging and useful in the unsupervised setting where all the words in any given text need to be disambiguated without using any labeled data. Typically WSD systems use the sentence or a small window of words around the target word as the context for disambiguation because their computational complexity scales exponentially with the size of the context. In this paper, we leverage the formalism of topic model to design a WSD system that scales linearly with the number of words in the context. As a result, our system is able to utilize the whole document as the context for a word to be disambiguated. The proposed method is a variant of Latent Dirichlet Allocation in which the topic proportions for a document are replaced by synset proportions. We further utilize the information in the WordNet by assigning a non-uniform prior to synset distribution over words and a logistic-normal prior for document distribution over synsets. We evaluate the proposed method on Senseval-2, Senseval-3, SemEval-2007, SemEval-2013 and SemEval-2015 English All-Word WSD datasets and show that it outperforms the state-of-the-art unsupervised knowledge-based WSD system by a significant margin.

Posted Content
TL;DR: This article used adversarial generation techniques to make the output match the desired style, which showed improvements both in automatic evaluation of style transfer and in manual evaluation of meaning preservation and fluency.
Abstract: Style transfer is the task of rephrasing the text to contain specific stylistic properties without changing the intent or affect within the context. This paper introduces a new method for automatic style transfer. We first learn a latent representation of the input sentence which is grounded in a language translation model in order to better preserve the meaning of the sentence while reducing stylistic properties. Then adversarial generation techniques are used to make the output match the desired style. We evaluate this technique on three different style transformations: sentiment, gender and political slant. Compared to two state-of-the-art style transfer modeling techniques we show improvements both in automatic evaluation of style transfer and in manual evaluation of meaning preservation and fluency.