scispace - formally typeset
Search or ask a question

Showing papers by "Kyunghyun Cho published in 2018"


Journal ArticleDOI
TL;DR: In this article, a deep learning model based on Gated Recurrent Unit (GRU) is proposed to exploit the missing values and their missing patterns for effective imputation and improving prediction performance.
Abstract: Multivariate time series data in practical applications, such as health care, geoscience, and biology, are characterized by a variety of missing values. In time series prediction and other related tasks, it has been noted that missing values and their missing patterns are often correlated with the target labels, a.k.a., informative missingness. There is very limited work on exploiting the missing patterns for effective imputation and improving prediction performance. In this paper, we develop novel deep learning models, namely GRU-D, as one of the early attempts. GRU-D is based on Gated Recurrent Unit (GRU), a state-of-the-art recurrent neural network. It takes two representations of missing patterns, i.e., masking and time interval, and effectively incorporates them into a deep model architecture so that it not only captures the long-term temporal dependencies in time series, but also utilizes the missing patterns to achieve better prediction results. Experiments of time series classification tasks on real-world clinical datasets (MIMIC-III, PhysioNet) and synthetic datasets demonstrate that our models achieve state-of-the-art performance and provide useful insights for better understanding and utilization of missing values in time series analysis.

1,085 citations


Posted Content
TL;DR: This paper proposed a conditional non-autoregressive neural sequence model based on iterative refinement, which is designed based on the principles of latent variable models and denoising autoencoders, and is generally applicable to any sequence generation task.
Abstract: We propose a conditional non-autoregressive neural sequence model based on iterative refinement. The proposed model is designed based on the principles of latent variable models and denoising autoencoders, and is generally applicable to any sequence generation task. We extensively evaluate the proposed model on machine translation (En-De and En-Ro) and image caption generation, and observe that it significantly speeds up decoding while maintaining the generation quality comparable to the autoregressive counterpart.

272 citations


Proceedings ArticleDOI
01 Jan 2018
TL;DR: The proposed conditional non-autoregressive neural sequence model is evaluated on machine translation and image caption generation, and it is observed that it significantly speeds up decoding while maintaining the generation quality comparable to the autoregressive counterpart.
Abstract: We propose a conditional non-autoregressive neural sequence model based on iterative refinement. The proposed model is designed based on the principles of latent variable models and denoising autoencoders, and is generally applicable to any sequence generation task. We extensively evaluate the proposed model on machine translation (En-De and En-Ro) and image caption generation, and observe that it significantly speeds up decoding while maintaining the generation quality comparable to the autoregressive counterpart.

261 citations


Proceedings Article
01 Jan 2018
TL;DR: Empirical evaluation on three language pairs shows that the proposed approach significantly outperforms the baseline approach and the improvement is more significant when more relevant sentence pairs were retrieved.
Abstract: In this paper, we extend an attention-based neural machine translation (NMT) model by allowing it to access an entire training set of parallel sentence pairs even after training. The proposed approach consists of two stages. In the first stage –retrieval stage–, an off-the-shelf, black-box search engine is used to retrieve a small subset of sentence pairs from a training set given a source sentence. These pairs are further filtered based on a fuzzy matching score based on edit distance. In the second stage–translation stage–, a novel translation model, called search engine guided NMT (SEG-NMT), seamlessly uses both the source sentence and a set of retrieved sentence pairs to perform the translation. Empirical evaluation on three language pairs (En-Fr, En-De, and En-Es) shows that the proposed approach significantly outperforms the baseline approach and the improvement is more significant when more relevant sentence pairs were retrieved.

150 citations


Journal ArticleDOI
TL;DR: The authors proposed a fine-grained (or 2D) attention mechanism where each dimension of a context vector will receive a separate attention score, which improves the translation quality in terms of BLEU score.

141 citations


Proceedings ArticleDOI
01 Jan 2018
TL;DR: A transferable architecture of structural and compositional neural networks is designed to jointly represent and map event mentions and types into a shared semantic space and can select, for each event mention, the event type which is semantically closest in this space as its type.
Abstract: Most previous supervised event extraction methods have relied on features derived from manual annotations, and thus cannot be applied to new event types without extra annotation effort. We take a fresh look at event extraction and model it as a generic grounding problem: mapping each event mention to a specific type in a target event ontology. We design a transferable architecture of structural and compositional neural networks to jointly represent and map event mentions and types into a shared semantic space. Based on this new framework, we can select, for each event mention, the event type which is semantically closest in this space as its type. By leveraging manual annotations available for a small set of existing event types, our framework can be applied to new unseen event types without additional manual annotations. When tested on 23 unseen event types, our zero-shot framework, without manual annotations, achieved performance comparable to a supervised model trained from 3,000 sentences annotated with 500 event mentions.

110 citations


Journal ArticleDOI
TL;DR: In this article, the authors presented an automatic proximal femur segmentation method that is based on deep convolutional neural networks (CNNs), which achieved a high dice similarity score of 0.95.
Abstract: Magnetic resonance imaging (MRI) has been proposed as a complimentary method to measure bone quality and assess fracture risk. However, manual segmentation of MR images of bone is time-consuming, limiting the use of MRI measurements in the clinical practice. The purpose of this paper is to present an automatic proximal femur segmentation method that is based on deep convolutional neural networks (CNNs). This study had institutional review board approval and written informed consent was obtained from all subjects. A dataset of volumetric structural MR images of the proximal femur from 86 subjects were manually-segmented by an expert. We performed experiments by training two different CNN architectures with multiple number of initial feature maps, layers and dilation rates, and tested their segmentation performance against the gold standard of manual segmentations using four-fold cross-validation. Automatic segmentation of the proximal femur using CNNs achieved a high dice similarity score of 0.95 ± 0.02 with precision = 0.95 ± 0.02, and recall = 0.95 ± 0.03. The high segmentation accuracy provided by CNNs has the potential to help bring the use of structural MRI measurements of bone quality into clinical practice for management of osteoporosis.

108 citations


Proceedings ArticleDOI
TL;DR: It is demonstrated that the increased flexibility offered by the product-of-experts model allowed it to achieve state- of-the-art performance on the Amazon review dataset, outperforming the LDA-based approach, however, interestingly, the greater modeling power offers by the recurrent neural network appears to undermine the model's ability to act as a regularizer of the product representations.
Abstract: Recent work has shown that collaborative filter-based recommender systems can be improved by incorporating side information, such as natural language reviews, as a way of regularizing the derived product representations. Motivated by the success of this approach, we introduce two different models of reviews and study their effect on collaborative filtering performance. While the previous state-of-the-art approach is based on a latent Dirichlet allocation (LDA) model of reviews, the models we explore are neural network based: a bag-of-words product-of-experts model and a recurrent neural network. We demonstrate that the increased flexibility offered by the product-of-experts model allowed it to achieve state-of-the-art performance on the Amazon review dataset, outperforming the LDA-based approach. However, interestingly, the greater modeling power offered by the recurrent neural network appears to undermine the model's ability to act as a regularizer of the product representations.

94 citations


Proceedings Article
31 May 2018
TL;DR: The authors proposed DialogWAE, a conditional Wasserstein autoencoder specially designed for dialogue modeling, which models the distribution of data by training a GAN within the latent variable space.
Abstract: Variational autoencoders~(VAEs) have shown a promise in data-driven conversation modeling. However, most VAE conversation models match the approximate posterior distribution over the latent variables to a simple prior such as standard normal distribution, thereby restricting the generated responses to a relatively simple (e.g., unimodal) scope. In this paper, we propose DialogWAE, a conditional Wasserstein autoencoder~(WAE) specially designed for dialogue modeling. Unlike VAEs that impose a simple distribution over the latent variables, DialogWAE models the distribution of data by training a GAN within the latent variable space. Specifically, our model samples from the prior and posterior distributions over the latent variables by transforming context-dependent random noise using neural networks and minimizes the Wasserstein distance between the two distributions. We further develop a Gaussian mixture prior network to enrich the latent space. Experiments on two popular datasets show that DialogWAE outperforms the state-of-the-art approaches in generating more coherent, informative and diverse responses.

79 citations


Journal ArticleDOI
TL;DR: In this paper, a semi-supervised variational autoencoder is proposed to generate new molecules with desired properties by sampling from the generative distribution estimated by the model.
Abstract: Although machine learning has been successfully used to propose novel molecules that satisfy desired properties, it is still challenging to explore a large chemical space efficiently. In this paper, we present a conditional molecular design method that facilitates generating new molecules with desired properties. The proposed model, which simultaneously performs both property prediction and molecule generation, is built as a semi-supervised variational autoencoder trained on a set of existing molecules with only a partial annotation. We generate new molecules with desired properties by sampling from the generative distribution estimated by the model. We demonstrate the effectiveness of the proposed model by evaluating it on drug-like molecules. The model improves the performance of property prediction by exploiting unlabeled molecules, and efficiently generates novel molecules fulfilling various target conditions.

57 citations


Posted Content
TL;DR: Pommerman, a multi-agent environment based on the classic console game Bomberman, consists of a set of scenarios, each having at least four players and containing both cooperative and competitive aspects.
Abstract: We present Pommerman, a multi-agent environment based on the classic console game Bomberman. Pommerman consists of a set of scenarios, each having at least four players and containing both cooperative and competitive aspects. We believe that success in Pommerman will require a diverse set of tools and methods, including planning, opponent/teammate modeling, game theory, and communication, and consequently can serve well as a multi-agent benchmark. To date, we have already hosted one competition, and our next one will be featured in the NIPS 2018 competition track.

Posted Content
TL;DR: A significant gap is observed between greedy search and the proposed iterative beam search augmented with selection scoring, demonstrating the importance of the search algorithm in neural dialogue generation.
Abstract: Search strategies for generating a response from a neural dialogue model have received relatively little attention compared to improving network architectures and learning algorithms in recent years. In this paper, we consider a standard neural dialogue model based on recurrent networks with an attention mechanism, and focus on evaluating the impact of the search strategy. We compare four search strategies: greedy search, beam search, iterative beam search and iterative beam search followed by selection scoring. We evaluate these strategies using human evaluation of full conversations and compare them using automatic metrics including log-probabilities, scores and diversity metrics. We observe a significant gap between greedy search and the proposed iterative beam search augmented with selection scoring, demonstrating the importance of the search algorithm in neural dialogue generation.

Proceedings ArticleDOI
15 Apr 2018
TL;DR: This work explored the limits of breast density classification with a data set coming from over 200,000 breast cancer screening exams and found that a strong convolutional neural network classifier can perform this task comparably to a human expert.
Abstract: Breast density classification is an essential part of breast cancer screening. Although a lot of prior work considered this problem as a task for learning algorithms, to our knowledge, all of them used small and not clinically realistic data both for training and evaluation of their models. In this work, we explored the limits of this task with a data set coming from over 200,000 breast cancer screening exams. We used this data to train and evaluate a strong convolutional neural network classifier. In a reader study, we found that our model can perform this task comparably to a human expert.

Proceedings ArticleDOI
29 Nov 2018
TL;DR: In this paper, the authors empirically investigate the effect of audio preprocessing on music tagging with deep neural networks and show that many commonly used input preprocessing techniques are redundant except magnitude compression.
Abstract: In this paper, we empirically investigate the effect of audio preprocessing on music tagging with deep neural networks. We perform comprehensive experiments involving audio preprocessing using different time-frequency representations, logarithmic magnitude compression, frequency weighting, and scaling. We show that many commonly used input preprocessing techniques are redundant except magnitude compression.

Journal ArticleDOI
TL;DR: A novel approach to finding linkage/association between multimodal brain imaging data, such as structural MRI (sMRI) and functional MRI (fMRI), which employs a deep learning model to consider two different imaging views of the same brain like two different languages conveying some common facts.

Journal ArticleDOI
TL;DR: The model outperforms long short-term memory and NTM variants, and provides further experimental results on the sequential MNIST, Stanford Natural Language Inference, associative recall, and copy tasks.
Abstract: We extend the neural Turing machine (NTM) model into a dynamic neural Turing machine (D-NTM) by introducing trainable address vectors. This addressing scheme maintains for each memory cell two sepa...

01 Sep 2018
TL;DR: Pommerman as mentioned in this paper is a multi-agent environment based on the classic console game Bomberman, which consists of a set of scenarios, each having at least four players and containing both cooperative and competitive aspects.
Abstract: We present Pommerman, a multi-agent environment based on the classic console game Bomberman. Pommerman consists of a set of scenarios, each having at least four players and containing both cooperative and competitive aspects. We believe that success in Pommerman will require a diverse set of tools and methods, including planning, opponent/teammate modeling, game theory, and communication, and consequently can serve well as a multi-agent benchmark. To date, we have already hosted one competition, and our next one will be featured in the NIPS 2018 competition track.

Posted Content
TL;DR: DialogWAE is proposed, a conditional Wasserstein autoencoder specially designed for dialogue modeling that models the distribution of data by training a GAN within the latent variable space and develops a Gaussian mixture prior network to enrich the latent space.
Abstract: Variational autoencoders~(VAEs) have shown a promise in data-driven conversation modeling. However, most VAE conversation models match the approximate posterior distribution over the latent variables to a simple prior such as standard normal distribution, thereby restricting the generated responses to a relatively simple (e.g., unimodal) scope. In this paper, we propose DialogWAE, a conditional Wasserstein autoencoder~(WAE) specially designed for dialogue modeling. Unlike VAEs that impose a simple distribution over the latent variables, DialogWAE models the distribution of data by training a GAN within the latent variable space. Specifically, our model samples from the prior and posterior distributions over the latent variables by transforming context-dependent random noise using neural networks and minimizes the Wasserstein distance between the two distributions. We further develop a Gaussian mixture prior network to enrich the latent space. Experiments on two popular datasets show that DialogWAE outperforms the state-of-the-art approaches in generating more coherent, informative and diverse responses.

Posted Content
TL;DR: The approach, Backplay, uses a single demonstration to construct a curriculum for a given task, and analytically characterize the types of environments where Backplay can improve training speed and compare favorably to other competitive methods known to improve sample efficiency.
Abstract: Model-free reinforcement learning (RL) requires a large number of trials to learn a good policy, especially in environments with sparse rewards. We explore a method to improve the sample efficiency when we have access to demonstrations. Our approach, Backplay, uses a single demonstration to construct a curriculum for a given task. Rather than starting each training episode in the environment's fixed initial state, we start the agent near the end of the demonstration and move the starting point backwards during the course of training until we reach the initial state. Our contributions are that we analytically characterize the types of environments where Backplay can improve training speed, demonstrate the effectiveness of Backplay both in large grid worlds and a complex four player zero-sum game (Pommerman), and show that Backplay compares favorably to other competitive methods known to improve sample efficiency. This includes reward shaping, behavioral cloning, and reverse curriculum generation.

Journal ArticleDOI
23 Mar 2018
TL;DR: In this article, the effects of noisy labels are investigated in deep neural networks for music classification. And the authors show that networks can be effective despite relatively large error rates in ground truth datasets, while conjecturing that label noise can be the cause of varying tag-wise performance differences.
Abstract: Deep neural networks (DNNs) have been successfully applied to music classification including music tagging. However, there are several open questions regarding the training, evaluation, and analysis of DNNs. In this paper, we investigate specific aspects of neural networks, the effects of noisy labels, to deepen our understanding of their properties. We analyze and (re-)validate a large music tagging dataset to investigate the reliability of training and evaluation. Using a trained network, we compute label vector similarities, which are compared to groundtruth similarity. The results highlight several important aspects of music tagging and neural networks. We show that networks can be effective despite relatively large error rates in groundtruth datasets, while conjecturing that label noise can be the cause of varying tag-wise performance differences. Finally, the analysis of our trained network provides valuable insight into the relationships between music tags. These results highlight the benefit of using data-driven methods to address automatic music tagging.

Proceedings ArticleDOI
01 Jul 2018
TL;DR: This work describes the work for the CALCS 2018 shared task on named entity recognition on code-switched data, where the system ranked first place for MS Arabic-Egyptian namedentity recognition and third place for English-Spanish.
Abstract: We describe our work for the CALCS 2018 shared task on named entity recognition on code-switched data. Our system ranked first place for MS Arabic-Egyptian named entity recognition and third place for English-Spanish.

Posted Content
TL;DR: This paper proposed a meta-learning approach for low-resource NMT, which uses the universal lexical representation to overcome the input-output mismatch across different languages, and showed that the proposed approach significantly outperforms the multilingual, transfer learning based approach.
Abstract: In this paper, we propose to extend the recently introduced model-agnostic meta-learning algorithm (MAML) for low-resource neural machine translation (NMT). We frame low-resource translation as a meta-learning problem, and we learn to adapt to low-resource languages based on multilingual high-resource language tasks. We use the universal lexical representation~\citep{gu2018universal} to overcome the input-output mismatch across different languages. We evaluate the proposed meta-learning strategy using eighteen European languages (Bg, Cs, Da, De, El, Es, Et, Fr, Hu, It, Lt, Nl, Pl, Pt, Sk, Sl, Sv and Ru) as source tasks and five diverse languages (Ro, Lv, Fi, Tr and Ko) as target tasks. We show that the proposed approach significantly outperforms the multilingual, transfer learning based approach~\citep{zoph2016transfer} and enables us to train a competitive NMT system with only a fraction of training examples. For instance, the proposed approach can achieve as high as 22.04 BLEU on Romanian-English WMT'16 by seeing only 16,000 translated words (~600 parallel sentences).

Posted Content
TL;DR: This work proposes a retrieval-augmented convolutional network and proposes to train it with local mixup, a novel variant of the recently proposed mixup algorithm that addresses on-manifold adversarial examples by explicitly encouraging the classifier to locally behave linearly on the data manifold.
Abstract: We propose a retrieval-augmented convolutional network and propose to train it with local mixup, a novel variant of the recently proposed mixup algorithm. The proposed hybrid architecture combining a convolutional network and an off-the-shelf retrieval engine was designed to mitigate the adverse effect of off-manifold adversarial examples, while the proposed local mixup addresses on-manifold ones by explicitly encouraging the classifier to locally behave linearly on the data manifold. Our evaluation of the proposed approach against five readily-available adversarial attacks on three datasets--CIFAR-10, SVHN and ImageNet--demonstrate the improved robustness compared to the vanilla convolutional network.

Journal ArticleDOI
TL;DR: In this paper, a novel recurrent neural network (RNN) approach is introduced to account for temporal dynamics and dependencies in brain networks observed via functional magnetic resonance imaging (fMRI).
Abstract: We introduce a novel recurrent neural network (RNN) approach to account for temporal dynamics and dependencies in brain networks observed via functional magnetic resonance imaging (fMRI). Our approach directly parameterizes temporal dynamics through recurrent connections, which can be used to formulate blind source separation with a conditional (rather than marginal) independence assumption, which we call RNN-ICA. This formulation enables us to visualize the temporal dynamics of both first order (activity) and second order (directed connectivity) information in brain networks that are widely studied in a static sense, but not well-characterized dynamically. RNN-ICA predicts dynamics directly from the recurrent states of the RNN in both task and resting state fMRI. Our results show both task-related and group-differentiating directed connectivity.

Proceedings Article
15 Feb 2018
TL;DR: In this paper, a multiset loss function was proposed for sequential decision-making, which is empirically evaluated on two families of datasets, one synthetic and the other real, with varying levels of difficulty, against various baseline loss functions.
Abstract: We study the problem of multiset prediction. The goal of multiset prediction is to train a predictor that maps an input to a multiset consisting of multiple items. Unlike existing problems in supervised learning, such as classification, ranking and sequence generation, there is no known order among items in a target multiset, and each item in the multiset may appear more than once, making this problem extremely challenging. In this paper, we propose a novel multiset loss function by viewing this problem from the perspective of sequential decision making. The proposed multiset loss function is empirically evaluated on two families of datasets, one synthetic and the other real, with varying levels of difficulty, against various baseline loss functions including reinforcement learning, sequence, and aggregated distribution matching loss functions. The experiments reveal the effectiveness of the proposed loss function over the others.

Proceedings Article
15 Feb 2018
TL;DR: A method for training GANs with discrete data that uses the estimated difference measure from the discriminator to compute importance weights for generated samples, thus providing a policy gradient for training the generator and demonstrating the effectiveness of the proposed algorithm with discrete image and character-based natural language generation.
Abstract: Generative adversarial networks are a learning framework that rely on training a discriminator to estimate a measure of difference between a target and generated distributions GANs, as normally formulated, rely on the generated samples being completely differentiable wrt the generative parameters, and thus do not work for discrete data We introduce a method for training GANs with discrete data that uses the estimated difference measure from the discriminator to compute importance weights for generated samples, thus providing a policy gradient for training the generator The importance weights have a strong connection to the decision boundary of the discriminator, and we call our method boundary-seeking GANs (BGANs) We demonstrate the effectiveness of the proposed algorithm with discrete image and character-based natural language generation In addition, the boundary-seeking objective extends to continuous data, which can be used to improve stability of training, and we demonstrate this on Celeba, Large-scale Scene Understanding (LSUN) bedrooms, and Imagenet without conditioning

Proceedings ArticleDOI
01 Jan 2018
TL;DR: This work addresses concerns with a model that incorporates document covariates to estimate conditional word embedding distributions and allows for hypothesis tests about the meanings of terms, and assessments as to whether a word is near or far from another conditioned on different covariate values.
Abstract: Conventional word embedding models do not leverage information from document meta-data, and they do not model uncertainty. We address these concerns with a model that incorporates document covariates to estimate conditional word embedding distributions. Our model allows for (a) hypothesis tests about the meanings of terms, (b) assessments as to whether a word is near or far from another conditioned on different covariate values, and (c) assessments as to whether estimated differences are statistically significant.

Posted Content
TL;DR: This paper proposes a simple baseline method that allows us to control the amount of copying without retraining, and indicates that the method provides a strong baseline for abstractive systems looking to obtain high ROUGE scores while minimizing overlap with the source article.
Abstract: Attention-based neural abstractive summarization systems equipped with copy mechanisms have shown promising results. Despite this success, it has been noticed that such a system generates a summary by mostly, if not entirely, copying over phrases, sentences, and sometimes multiple consecutive sentences from an input paragraph, effectively performing extractive summarization. In this paper, we verify this behavior using the latest neural abstractive summarization system - a pointer-generator network. We propose a simple baseline method that allows us to control the amount of copying without retraining. Experiments indicate that the method provides a strong baseline for abstractive systems looking to obtain high ROUGE scores while minimizing overlap with the source article, substantially reducing the n-gram overlap with the original article while keeping within 2 points of the original model's ROUGE score.

Posted Content
19 Apr 2018
TL;DR: This work considers self-driving cars coordinating with each other and focuses on how communication influences the agents' collective behavior, finding that communication helps (most) with adverse conditions.
Abstract: Interest in emergent communication has recently surged in Machine Learning. The focus of this interest has largely been either on investigating the properties of the learned protocol or on utilizing emergent communication to better solve problems that already have a viable solution. Here, we consider self-driving cars coordinating with each other and focus on how communication influences the agents' collective behavior. Our main result is that communication helps (most) with adverse conditions.

15 Feb 2018
TL;DR: In this paper, the authors propose a new policy, called a nearest neighbor policy, which does not require any optimization for simple, low-dimensional continuous control tasks and allows them to investigate the underlying difficulty of a task without being distracted by optimization difficulty.
Abstract: We design a new policy, called a nearest neighbor policy, that does not require any optimization for simple, low-dimensional continuous control tasks As this policy does not require any optimization, it allows us to investigate the underlying difficulty of a task without being distracted by optimization difficulty of a learning algorithm We propose two variants, one that retrieves an entire trajectory based on a pair of initial and goal states, and the other retrieving a partial trajectory based on a pair of current and goal states We test the proposed policies on five widely-used benchmark continuous control tasks with a sparse reward: Reacher, Half Cheetah, Double Pendulum, Cart Pole and Mountain Car We observe that the majority (the first four) of these tasks, which have been considered difficult, are easily solved by the proposed policies with high success rates, indicating that reported difficulties of them may have likely been due to the optimization difficulty Our work suggests that it is necessary to evaluate any sophisticated policy learning algorithm on more challenging problems in order to truly assess the advances from them