scispace - formally typeset
Search or ask a question

Showing papers on "Representation (systemics) published in 2018"


Book ChapterDOI
08 Sep 2018
TL;DR: In this paper, the authors argue that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation, and they propose to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned.
Abstract: The thud of a bouncing ball, the onset of speech as lips open—when visual and audio events occur together, it suggests that there might be a common, underlying event that produced both signals. In this paper, we argue that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation. We propose to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned. We use this learned representation for three applications: (a) sound source localization, i.e. visualizing the source of sound in a video; (b) audio-visual action recognition; and (c) on/off-screen audio source separation, e.g. removing the off-screen translator’s voice from a foreign official’s speech. Code, models, and video results are available on our webpage: http://andrewowens.com/multisensory.

652 citations


Book ChapterDOI
08 Sep 2018
TL;DR: In this paper, a simple integral operation relates and unifies the heat map representation and joint regression, thus avoiding the non-differentiable post-processing and quantization error of human pose estimation.
Abstract: State-of-the-art human pose estimation methods are based on heat map representation. In spite of the good performance, the representation has a few issues in nature, such as non-differentiable post-processing and quantization error. This work shows that a simple integral operation relates and unifies the heat map representation and joint regression, thus avoiding the above issues. It is differentiable, efficient, and compatible with any heat map based methods. Its effectiveness is convincingly validated via comprehensive ablation experiments under various settings, specifically on 3D pose estimation, for the first time.

536 citations


BookDOI
01 Sep 2018
TL;DR: The Oxford Handbook of 4E cognition as mentioned in this paper provides a systematic overview of the state of the art in the field of four-eme cognition, including recent trends such as Bayesian inference and predictive coding, new insights and findings regarding social understanding, and new theoretical paradigms for understanding emotions and conceptualizing the interaction between cognition, language and culture.
Abstract: The Oxford Handbook of 4E Cognition provides a systematic overview of the state of the art in the field of 4E cognition: it includes chapters on hotly debated topics, for example, on the nature of cognition and the relation between cognition, perception and action; it discusses recent trends such as Bayesian inference and predictive coding; it presents new insights and findings regarding social understanding including the development of false belief understanding, and introduces new theoretical paradigms for understanding emotions and conceptualizing the interaction between cognition, language and culture. Each thematic section ends with a critical note to foster the fruitful discussion. In addition the final section of the book is dedicated to applications of 4E cognition approaches in disciplines such as psychiatry and robotics. This is a book with high relevance for philosophers, psychologists, psychiatrists, neuroscientists and anyone with an interest in the study of cognition as well as a wider audience with an interest in 4E cognition approaches.

391 citations


Posted Content
TL;DR: An in-depth review of recent advances in representation learning with a focus on autoencoder-based models and makes use of meta-priors believed useful for downstream tasks, such as disentanglement and hierarchical organization of features.
Abstract: Learning useful representations with little or no supervision is a key challenge in artificial intelligence. We provide an in-depth review of recent advances in representation learning with a focus on autoencoder-based models. To organize these results we make use of meta-priors believed useful for downstream tasks, such as disentanglement and hierarchical organization of features. In particular, we uncover three main mechanisms to enforce such properties, namely (i) regularizing the (approximate or aggregate) posterior distribution, (ii) factorizing the encoding and decoding distribution, or (iii) introducing a structured prior distribution. While there are some promising results, implicit or explicit supervision remains a key enabler and all current methods use strong inductive biases and modeling assumptions. Finally, we provide an analysis of autoencoder-based representation learning through the lens of rate-distortion theory and identify a clear tradeoff between the amount of prior knowledge available about the downstream tasks, and how useful the representation is for this task.

307 citations


Journal ArticleDOI
TL;DR: This survey aims at covering the state-of-the-art on state representation learning in the most recent years by reviewing different SRL methods that involve interaction with the environment, their implementations and their applications in robotics control tasks (simulated or real).

274 citations


Posted Content
TL;DR: It is argued that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation, and it is proposed to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned.
Abstract: The thud of a bouncing ball, the onset of speech as lips open -- when visual and audio events occur together, it suggests that there might be a common, underlying event that produced both signals. In this paper, we argue that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation. We propose to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned. We use this learned representation for three applications: (a) sound source localization, i.e. visualizing the source of sound in a video; (b) audio-visual action recognition; and (c) on/off-screen audio source separation, e.g. removing the off-screen translator's voice from a foreign official's speech. Code, models, and video results are available on our webpage: this http URL

252 citations


Journal ArticleDOI
TL;DR: The results establish a principled link between high-level actions and abstract representations, a concrete theoretical foundation for constructing abstract representations with provable properties, and a practical mechanism for autonomously learning abstract high- level representations.
Abstract: We consider the problem of constructing abstract representations for planning in high-dimensional, continuous environments We assume an agent equipped with a collection of high-level actions, and construct representations provably capable of evaluating plans composed of sequences of those actions We first consider the deterministic planning case, and show that the relevant computation involves set operations performed over sets of states We define the specific collection of sets that is necessary and sufficient for planning, and use them to construct a grounded abstract symbolic representation that is provably suitable for deterministic planning The resulting representation can be expressed in PDDL, a canonical high-level planning domain language; we construct such a representation for the Playroom domain and solve it in milliseconds using an off-the-shelf planner We then consider probabilistic planning, which we show requires generalizing from sets of states to distributions over states We identify the specific distributions required for planning, and use them to construct a grounded abstract symbolic representation that correctly estimates the expected reward and probability of success of any plan In addition, we show that learning the relevant probability distributions corresponds to specific instances of probabilistic density estimation and probabilistic classification We construct an agent that autonomously learns the correct abstract representation of a computer game domain, and rapidly solves it Finally, we apply these techniques to create a physical robot system that autonomously learns its own symbolic representation of a mobile manipulation task directly from sensorimotor data---point clouds, map locations, and joint angles---and then plans using that representation Together, these results establish a principled link between high-level actions and abstract representations, a concrete theoretical foundation for constructing abstract representations with provable properties, and a practical mechanism for autonomously learning abstract high-level representations

234 citations


Book ChapterDOI
08 Sep 2018
TL;DR: In this article, a weakly supervised 3D human pose estimation method was proposed for multi-view images without annotations, which requires a sufficiently large set of samples with 3D annotations for learning to succeed.
Abstract: Modern 3D human pose estimation techniques rely on deep networks, which require large amounts of training data. While weakly-supervised methods require less supervision, by utilizing 2D poses or multi-view imagery without annotations, they still need a sufficiently large set of samples with 3D annotations for learning to succeed.

211 citations


23 Oct 2018
TL;DR: In this article, a self-supervised dense object representation for visual understanding and manipulation is proposed, which can be trained in approximately 20 minutes for a wide variety of previously unseen and potentially non-rigid objects.
Abstract: What is the right object representation for manipulation? We would like robots to visually perceive scenes and learn an understanding of the objects in them that (i) is task-agnostic and can be used as a building block for a variety of manipulation tasks, (ii) is generally applicable to both rigid and non-rigid objects, (iii) takes advantage of the strong priors provided by 3D vision, and (iv) is entirely learned from self-supervision. This is hard to achieve with previous methods: much recent work in grasping does not extend to grasping specific objects or other tasks, whereas task-specific learning may require many trials to generalize well across object configurations or other tasks. In this paper we present Dense Object Nets, which build on recent developments in self-supervised dense descriptor learning, as a consistent object representation for visual understanding and manipulation. We demonstrate they can be trained quickly (approximately 20 minutes) for a wide variety of previously unseen and potentially non-rigid objects. We additionally present novel contributions to enable multi-object descriptor learning, and show that by modifying our training procedure, we can either acquire descriptors which generalize across classes of objects, or descriptors that are distinct for each object instance. Finally, we demonstrate the novel application of learned dense descriptors to robotic manipulation. We demonstrate grasping of specific points on an object across potentially deformed object configurations, and demonstrate using class general descriptors to transfer specific grasps across objects in a class.

184 citations



Proceedings ArticleDOI
Kyle Gao1, Achille Fokoue1, Heng Luo1, Arun Iyengar1, Sanjoy Dey1, Ping Zhang1 
01 Jul 2018
TL;DR: This work proposes an end-to-end neural network model that predicts DTIs directly from low level representations and provides biological interpretation using two-way attention mechanism.
Abstract: The identification of drug-target interactions (DTIs) is a key task in drug discovery, where drugs are chemical compounds and targets are proteins. Traditional DTI prediction methods are either time consuming (simulation-based methods) or heavily dependent on domain expertise (similarity-based and feature-based methods). In this work, we propose an end-to-end neural network model that predicts DTIs directly from low level representations. In addition to making predictions, our model provides biological interpretation using two-way attention mechanism. Instead of using simplified settings where a dataset is evaluated as a whole, we designed an evaluation dataset from BindingDB following more realistic settings where predictions of unobserved examples (proteins and drugs) have to be made. We experimentally compared our model with matrix factorization, similarity-based methods, and a previous deep learning approach. Overall, the results show that our model outperforms other approaches without requiring domain knowledge and feature engineering. In a case study, we illustrated the ability of our approach to provide biological insights to interpret the predictions.

Journal ArticleDOI
TL;DR: This work provides the first evidence for separable scales of representation along the human hippocampal anteroposterior axis by showing greater similarity among voxel time courses and higher temporal autocorrelation in anterior hippocampus (aHPC) relative to posterior hippocampus (pHPC), the human homologs of ventral and dorsal rodent hippocampus.

Journal ArticleDOI
TL;DR: A review of different neuro-fuzzy systems based on the classification of research articles from 2000 to 2017 is proposed to help readers have a general overview of the state-of-the-arts of neuro- fizzy systems and easily refer suitable methods according to their research interests.
Abstract: Neuro-fuzzy systems have attracted the growing interest of researchers in various scientific and engineering areas due to its effective learning and reasoning capabilities. The neuro-fuzzy systems combine the learning power of artificial neural networks and explicit knowledge representation of fuzzy inference systems. This paper proposes a review of different neuro-fuzzy systems based on the classification of research articles from 2000 to 2017. The main purpose of this survey is to help readers have a general overview of the state-of-the-arts of neuro-fuzzy systems and easily refer suitable methods according to their research interests. Different neuro-fuzzy models are compared and a table is presented summarizing the different learning structures and learning criteria with their applications.

Proceedings Article
26 Apr 2018
TL;DR: Results show that the proposed reinforcement learning method can learn task-friendly representations by identifying important words or task-relevant structures without explicit structure annotations, and thus yields competitive performance.
Abstract: Representation learning is a fundamental problem in natural language processing. This paper studies how to learn a structured representation for text classification. Unlike most existing representation models that either use no structure or rely on pre-specified structures, we propose a reinforcement learning (RL) method to learn sentence representation by discovering optimized structures automatically. We demonstrate two attempts to build structured representation: Information Distilled LSTM (ID-LSTM) and Hierarchically Structured LSTM (HS-LSTM). ID-LSTM selects only important, task-relevant words, and HS-LSTM discovers phrase structures in a sentence. Structure discovery in the two representation models is formulated as a sequential decision problem: current decision of structure discovery affects following decisions, which can be addressed by policy gradient RL. Results show that our method can learn task-friendly representations by identifying important words or task-relevant structures without explicit structure annotations, and thus yields competitive performance.

Journal ArticleDOI
TL;DR: The model accounts for executive influences on semantics by including a controlled retrieval mechanism that provides top-down input to amplify weak semantic relationships and successfully codes knowledge for abstract and concrete words, associative and taxonomic relationships, and the multiple meanings of homonyms, within a single representational space.
Abstract: Semantic cognition requires conceptual representations shaped by verbal and nonverbal experience and executive control processes that regulate activation of knowledge to meet current situational demands. A complete model must also account for the representation of concrete and abstract words, of taxonomic and associative relationships, and for the role of context in shaping meaning. We present the first major attempt to assimilate all of these elements within a unified, implemented computational framework. Our model combines a hub-and-spoke architecture with a buffer that allows its state to be influenced by prior context. This hybrid structure integrates the view, from cognitive neuroscience, that concepts are grounded in sensory-motor representation with the view, from computational linguistics, that knowledge is shaped by patterns of lexical co-occurrence. The model successfully codes knowledge for abstract and concrete words, associative and taxonomic relationships, and the multiple meanings of homonyms, within a single representational space. Knowledge of abstract words is acquired through (a) their patterns of co-occurrence with other words and (b) acquired embodiment, whereby they become indirectly associated with the perceptual features of co-occurring concrete words. The model accounts for executive influences on semantics by including a controlled retrieval mechanism that provides top-down input to amplify weak semantic relationships. The representational and control elements of the model can be damaged independently, and the consequences of such damage closely replicate effects seen in neuropsychological patients with loss of semantic representation versus control processes. Thus, the model provides a wide-ranging and neurally plausible account of normal and impaired semantic cognition. (PsycINFO Database Record

Proceedings ArticleDOI
15 Oct 2018
TL;DR: A novel Attribute-Aware Attention Model is proposed, which can learn local attribute representation and global category representation simultaneously in an end-to-end manner and contains more intrinsic information for image recognition instead of the noisy and irrelevant features.
Abstract: How to learn a discriminative fine-grained representation is a key point in many computer vision applications, such as person re-identification, fine-grained classification, fine-grained image retrieval, etc. Most of the previous methods focus on learning metrics or ensemble to derive better global representation, which are usually lack of local information. Based on the considerations above, we propose a novel Attribute-Aware Attention Model ($A^3M$), which can learn local attribute representation and global category representation simultaneously in an end-to-end manner. The proposed model contains two attention models: attribute-guided attention module uses attribute information to help select category features in different regions, at the same time, category-guided attention module selects local features of different attributes with the help of category cues. Through this attribute-category reciprocal process, local and global features benefit from each other. Finally, the resulting feature contains more intrinsic information for image recognition instead of the noisy and irrelevant features. Extensive experiments conducted on Market-1501, CompCars, CUB-200-2011 and CARS196 demonstrate the effectiveness of our $A^3M$.

Journal ArticleDOI
04 Sep 2018-eLife
TL;DR: This account shows how previously reported neural responses can map onto higher cognitive function in a modular way, and predicts new cell types (egocentric and head-direction-modulated boundary- and object-vector cells) and predicts how these neural populations should interact across multiple brain regions to support spatial memory.
Abstract: We present a model of how neural representations of egocentric spatial experiences in parietal cortex interface with viewpoint-independent representations in medial temporal areas, via retrosplenial cortex, to enable many key aspects of spatial cognition. This account shows how previously reported neural responses (place, head-direction and grid cells, allocentric boundary- and object-vector cells, gain-field neurons) can map onto higher cognitive function in a modular way, and predicts new cell types (egocentric and head-direction-modulated boundary- and object-vector cells). The model predicts how these neural populations should interact across multiple brain regions to support spatial memory, scene construction, novelty-detection, 'trace cells', and mental navigation. Simulated behavior and firing rate maps are compared to experimental data, for example showing how object-vector cells allow items to be remembered within a contextual representation based on environmental boundaries, and how grid cells could update the viewpoint in imagery during planning and short-cutting by driving sequential place cell activity.


Proceedings ArticleDOI
18 Jun 2018
TL;DR: Charades-Ego is introduced, a large-scale dataset of paired first-person and third-person videos, involving 112 people, with 4000 paired videos, which enables learning the link between the two, actor and observer perspectives, and addresses one of the biggest bottlenecks facing egocentric vision research.
Abstract: Several theories in cognitive neuroscience suggest that when people interact with the world, or simulate interactions, they do so from a first-person egocentric perspective, and seamlessly transfer knowledge between third-person (observer) and first-person (actor). Despite this, learning such models for human action recognition has not been achievable due to the lack of data. This paper takes a step in this direction, with the introduction of Charades-Ego, a large-scale dataset of paired first-person and third-person videos, involving 112 people, with 4000 paired videos. This enables learning the link between the two, actor and observer perspectives. Thereby, we address one of the biggest bottlenecks facing egocentric vision research, providing a link from first-person to the abundant third-person data on the web. We use this data to learn a joint representation of first and third-person videos, with only weak supervision, and show its effectiveness for transferring knowledge from the third-person to the first-person domain.

Posted Content
TL;DR: In this paper, a geometry-aware body representation is learned from multi-view images without annotations using an encoder-decoder that predicts image from one viewpoint given an image from another viewpoint.
Abstract: Modern 3D human pose estimation techniques rely on deep networks, which require large amounts of training data. While weakly-supervised methods require less supervision, by utilizing 2D poses or multi-view imagery without annotations, they still need a sufficiently large set of samples with 3D annotations for learning to succeed. In this paper, we propose to overcome this problem by learning a geometry-aware body representation from multi-view images without annotations. To this end, we use an encoder-decoder that predicts an image from one viewpoint given an image from another viewpoint. Because this representation encodes 3D geometry, using it in a semi-supervised setting makes it easier to learn a mapping from it to 3D human pose. As evidenced by our experiments, our approach significantly outperforms fully-supervised methods given the same amount of labeled data, and improves over other semi-supervised methods while using as little as 1% of the labeled data.

Proceedings Article
26 Apr 2018
TL;DR: An unsupervised representation learning approach for the first time to capture the long-term global motion dynamics in skeleton sequences and designs a conditional skeleton inpainting architecture for learning a fixed-dimensional representation.
Abstract: In recent years, skeleton based action recognition is becoming an increasingly attractive alternative to existing video-based approaches, beneficial from its robust and comprehensive 3D information. In this paper, we explore an unsupervised representation learning approach for the first time to capture the long-term global motion dynamics in skeleton sequences. We design a conditional skeleton inpainting architecture for learning a fixed-dimensional representation, guided by additional adversarial training strategies. We quantitatively evaluate the effectiveness of our learning approach on three well-established action recognition datasets. Experimental results show that our learned representation is discriminative for classifying actions and can substantially reduce the sequence inpainting errors.

Journal ArticleDOI
31 Jan 2018
TL;DR: Using autonomous racing tests in the TORCS simulator, it is shown how the integrated methods quickly learn policies that generalize to new environments much better than deep reinforcement learning without state representation learning.
Abstract: Most deep reinforcement learning techniques are unsuitable for robotics, as they require too much interaction time to learn useful, general control policies. This problem can be largely attributed to the fact that a state representation needs to be learned as a part of learning control policies, which can only be done through fitting expected returns based on observed rewards. While the reward function provides information on the desirability of the state of the world, it does not necessarily provide information on how to distill a good, general representation of that state from the sensory observations. State representation learning objectives can be used to help learn such a representation. While many of these objectives have been proposed, they are typically not directly combined with reinforcement learning algorithms. We investigate several methods for integrating state representation learning into reinforcement learning. In these methods, the state representation learning objectives help regularize the state representation during the reinforcement learning, and the reinforcement learning itself is viewed as a crucial state representation learning objective and allowed to help shape the representation. Using autonomous racing tests in the TORCS simulator, we show how the integrated methods quickly learn policies that generalize to new environments much better than deep reinforcement learning without state representation learning.

Posted Content
TL;DR: Results on a number of difficult continuous-control tasks show that the developed notion of sub-optimality of a representation, defined in terms of expected reward of the optimal hierarchical policy using this representation, yields qualitatively better representations as well as quantitatively better hierarchical policies compared to existing methods.
Abstract: We study the problem of representation learning in goal-conditioned hierarchical reinforcement learning. In such hierarchical structures, a higher-level controller solves tasks by iteratively communicating goals which a lower-level policy is trained to reach. Accordingly, the choice of representation -- the mapping of observation space to goal space -- is crucial. To study this problem, we develop a notion of sub-optimality of a representation, defined in terms of expected reward of the optimal hierarchical policy using this representation. We derive expressions which bound the sub-optimality and show how these expressions can be translated to representation learning objectives which may be optimized in practice. Results on a number of difficult continuous-control tasks show that our approach to representation learning yields qualitatively better representations as well as quantitatively better hierarchical policies, compared to existing methods (see videos at this https URL).

Posted Content
TL;DR: More than 300 research contributions are included in this survey, covering many aspects of generic object detection: detection frameworks, object feature representation, object proposal generation, context modeling, training strategies, and evaluation metrics as mentioned in this paper.
Abstract: Object detection, one of the most fundamental and challenging problems in computer vision, seeks to locate object instances from a large number of predefined categories in natural images. Deep learning techniques have emerged as a powerful strategy for learning feature representations directly from data and have led to remarkable breakthroughs in the field of generic object detection. Given this period of rapid evolution, the goal of this paper is to provide a comprehensive survey of the recent achievements in this field brought about by deep learning techniques. More than 300 research contributions are included in this survey, covering many aspects of generic object detection: detection frameworks, object feature representation, object proposal generation, context modeling, training strategies, and evaluation metrics. We finish the survey by identifying promising directions for future research.

Posted Content
TL;DR: In this representation instances from the same class are close to each other while instances from different classes are further apart, resulting in statistically significant improvement when compared to other approaches on three datasets from two different domains.
Abstract: Open set recognition problems exist in many domains. For example in security, new malware classes emerge regularly; therefore malware classification systems need to identify instances from unknown classes in addition to discriminating between known classes. In this paper we present a neural network based representation for addressing the open set recognition problem. In this representation instances from the same class are close to each other while instances from different classes are further apart, resulting in statistically significant improvement when compared to other approaches on three datasets from two different domains.

Proceedings ArticleDOI
01 Jan 2018
TL;DR: A Privacy-Preserving Representation-Learning Variational Generative Adversarial Network (PPRL-VGAN) to learn an image representation that is explicitly disentangled from the identity information so that it allows expression-equivalent face image synthesis.
Abstract: Reliable facial expression recognition plays a critical role in human-machine interactions. However, most of the facial expression analysis methodologies proposed to date pay little or no attention to the protection of a user's privacy. In this paper, we propose a Privacy-Preserving Representation-Learning Variational Generative Adversarial Network (PPRL-VGAN) to learn an image representation that is explicitly disentangled from the identity information. At the same time, this representation is discriminative from the standpoint of facial expression recognition and generative as it allows expression-equivalent face image synthesis. We evaluate the proposed model on two public datasets under various threat scenarios. Quantitative and qualitative results demonstrate that our approach strikes a balance between the preservation of privacy and data utility. We further demonstrate that our model can be effectively applied to other tasks such as expression morphing and image completion.

Journal ArticleDOI
TL;DR: A new remaining useful life prediction approach based on deep feature representation and long short-term memory neural network is proposed, and a new criterion, named support vector data normalized correlation coefficient, is proposed to automatically divide the whole bearing life as normal state and fast degradation state.
Abstract: For bearing remaining useful life prediction problem, the traditional machine-learning-based methods are generally short of feature representation ability and incapable of adaptive feature extracti...


Journal ArticleDOI
TL;DR: In this paper, critical discourse analysis of over 450 media texts produced between 2009 and 2017, the authors reported the conceptual understanding of digital-free tourism, the ways the media representation has changed over time and explored the broad social context and debates in which the concept is embedded.

Journal ArticleDOI
TL;DR: In this paper, gender differences were analyzed in online gamers' experience with feedback from other players and spectators during online play, and the findings suggest a mixed experience for women that includes more sexual harassment in online gaming compared with men.
Abstract: Despite the growing popularity of eSports, the poor representation of women players points to a need to understand the experiences of female players during competitive gaming online. The present study focuses on female gamers’ experiences with positive and negative feedback and sexual harassment in the male-dominated space of eSports. In Study 1, gender differences were analyzed in online gamers’ experience with feedback from other players and spectators during online play. In Study 2, gender differences were analyzed in observations of real gameplay that focused on the types of comments spectators directed toward female and male gamers on Twitch (a popular video game streaming website). The findings suggest a mixed experience for women that includes more sexual harassment in online gaming compared with men.