Showing papers by "Marie-Francine Moens published in 2017"

PDF

Open Access

Proceedings Article•

Imagined visual representations as multimodal embeddings

[...]

Guillem Collell Talleda¹, Teddy Zhang², Marie-Francine Moens¹•Institutions (2)

Katholieke Universiteit Leuven¹, Rakuten²

01 Jan 2017

TL;DR: This paper presents a simple and effective method that learns a language-to-vision mapping and uses its output visual predictions to build multimodal representations, providing a cognitively plausible way of building representations, consistent with the inherently reconstructive and associative nature of human memory.

...read moreread less

Abstract: Language and vision provide complementary information. Integrating both modalities in a single multimodal representation is an unsolved problem with wide-reaching applications to both natural language processing and computer vision. In this paper, we present a simple and effective method that learns a language-to-vision mapping and uses its output visual predictions to build multimodal representations. In this sense, our method provides a cognitively plausible way of building representations, consistent with the inherently reconstructive and associative nature of human memory. Using seven benchmark concept similarity tests we show that the mapped (or imagined) vectors not only help to fuse multimodal information, but also outperform strong unimodal baselines and state-of-the-art multimodal methods, thus exhibiting more human-like judgments. Ultimately, the present work sheds light on fundamental questions of natural language understanding concerning the fusion of vision and language such as the plausibility of more associative and reconstructive approaches.

...read moreread less

91 citations

Proceedings Article•DOI•

Structured Learning for Temporal Relation Extraction from Clinical Records

[...]

Artuur Leeuwenberg¹, Marie-Francine Moens¹•Institutions (1)

Katholieke Universiteit Leuven¹

01 Jan 2017

TL;DR: This study employs a structured perceptron together with integer linear programming constraints for document-level inference during training and prediction to exploit relational properties of temporality, together with global learning of the relations at the document level.

...read moreread less

Abstract: We propose a scalable structured learning model that jointly predicts temporal relations between events and temporal expressions (TLINKS), and the relation between these events and the document creation time (DCTR). We employ a structured perceptron, together with integer linear programming constraints for document-level inference during training and prediction to exploit relational properties of temporality, together with global learning of the relations at the document level. Moreover, this study gives insights in the results of integrating constraints for temporal relation extraction when using structured learning and prediction. Our best system outperforms the state-of-the art on both the CONTAINS TLINK task, and the DCTR task.

...read moreread less

61 citations

Journal Article•DOI•

Argumentation mining: How can a machine acquire common sense and world knowledge?

[...]

Marie-Francine Moens¹•Institutions (1)

Katholieke Universiteit Leuven¹

01 Jan 2017-Argument & Computation

TL;DR: A number of ways are proposed to improve the learning of common sense and world knowledge by exploiting textual and visual data, and touch upon how to integrate the learned knowledge in the argumentation mining process.

...read moreread less

Abstract: Argumentation mining is an advanced form of human language understanding by the machine. This is a challenging task for a machine. When sufficient explicit discourse markers are present in the language utterances, the argumentation can be interpreted by the machine with an acceptable degree of accuracy. However, in many real settings, the mining task is difficult due to the lack or ambiguity of the discourse markers, and the fact that a substantial amount of knowledge needed for the correct recognition of the argumentation, its composing elements and their relationships is not explicitly present in the text, but makes up the background knowledge that humans possess when interpreting language. In this article1 we focus on how the machine can automatically acquire the needed common sense and world knowledge. As very few research has been done in this respect, many of the ideas proposed in this article are tentative, but start being researched. We give an overview of the latest methods for human language understanding that map language to a formal knowledge representation that facilitates other tasks (for instance, a representation that is used to visualize the argumentation or that is easily shared in a decision or argumentation support system). Most current systems are trained on texts that are manually annotated. Then we go deeper into the new field of representation learning that nowadays is very much studied in computational linguistics. This field investigates methods for representing language as statistical concepts or as vectors, allowing straightforward methods of compositionality. The methods often use deep learning and its underlying neural network technologies to learn concepts from large text collections in an unsupervised way (i.e., without the need for manual annotations). We show how these methods can help the argumentation mining process, but also demonstrate that these methods need further research to automatically acquire the necessary background knowledge and more specifically common sense and world knowledge. We propose a number of ways to improve the learning of common sense and world knowledge by exploiting textual and visual data, and touch upon how we can integrate the learned knowledge in the argumentation mining process.

...read moreread less

42 citations

Proceedings Article•DOI•

Bilingual Lexicon Induction by Learning to Combine Word-Level and Character-Level Representations

[...]

Geert Heyman¹, Ivan Vulić², Marie-Francine Moens²•Institutions (2)

University of Copenhagen Faculty of Science¹, Katholieke Universiteit Leuven²

01 Jan 2017

TL;DR: The results show that word- and character-level representations each improve state-of-the-art results for BLI, and the best results are obtained by exploiting the synergy between these word-and character- level representations in the classification model.

...read moreread less

Abstract: We study the problem of bilingual lexicon induction (BLI) in a setting where some translation resources are available, but unknown translations are sought for certain, possibly domain-specific terminology. We frame BLI as a classification problem for which we design a neural network based classification architecture composed of recurrent long short-term memory and deep feed forward networks. The results show that word- and character-level representations each improve state-of-the-art results for BLI, and the best results are obtained by exploiting the synergy between these word- and character-level representations in the classification model.

...read moreread less

32 citations

Journal Article•DOI•

Soft quantification in statistical relational learning

[...]

Golnoosh Farnadi¹, Golnoosh Farnadi², Stephen H. Bach³, Marie-Francine Moens², Lise Getoor⁴, Martine De Cock⁵ - Show less +2 more•Institutions (5)

Ghent University¹, Katholieke Universiteit Leuven², Stanford University³, University of California, Santa Cruz⁴, University of Washington⁵

01 Jan 2017

TL;DR: The experimental results for two real-world applications, link prediction in social trust networks and user profiling in social networks, demonstrate that the use of soft quantifiers not only allows for a natural and intuitive formulation of domain knowledge, but also improves inference accuracy.

...read moreread less

Abstract: We present a new statistical relational learning (SRL) framework that supports reasoning with soft quantifiers, such as “most” and “a few.” We define the syntax and the semantics of this language, which we call $$\hbox {PSL}^Q$$ , and present a most probable explanation inference algorithm for it. To the best of our knowledge, $$\hbox {PSL}^Q$$ is the first SRL framework that combines soft quantifiers with first-order logic rules for modelling uncertain relational data. Our experimental results for two real-world applications, link prediction in social trust networks and user profiling in social networks, demonstrate that the use of soft quantifiers not only allows for a natural and intuitive formulation of domain knowledge, but also improves inference accuracy.

...read moreread less

25 citations

Journal Article•DOI•

Fast and Flexible Top-k Similarity Search on Large Networks

[...]

Jing Zhang¹, Jie Tang², Cong Ma², Hanghang Tong³, Yu Jing², Juanzi Li², Walter Luyten⁴, Marie-Francine Moens⁴ - Show less +4 more•Institutions (4)

Renmin University of China¹, Tsinghua University², Arizona State University³, Katholieke Universiteit Leuven⁴

15 Sep 2017-ACM Transactions on Information Systems

TL;DR: A sampling-based method using random paths to estimate the similarities based on both common neighbors and structural contexts efficiently in very large homogeneous or heterogeneous information networks.

...read moreread less

Abstract: Similarity search is a fundamental problem in network analysis and can be applied in many applications, such as collaborator recommendation in coauthor networks, friend recommendation in social networks, and relation prediction in medical information networks. In this article, we propose a sampling-based method using random paths to estimate the similarities based on both common neighbors and structural contexts efficiently in very large homogeneous or heterogeneous information networks. We give a theoretical guarantee that the sampling size depends on the error-bound e, the confidence level (1-Δ), and the path length T of each random walk. We perform an extensive empirical study on a Tencent microblogging network of 1,000,000,000 edges. We show that our algorithm can return top-k similar vertices for any vertex in a network 300× faster than the state-of-the-art methods. We develop a prototype system of recommending similar authors to demonstrate the effectiveness of our method.

...read moreread less

21 citations

Proceedings Article•

Cross-modal search for fashion attributes

[...]

Katrien Laenen¹, Susana Zoghbi¹, Marie-Francine Moens¹•Institutions (1)

Katholieke Universiteit Leuven¹

01 Jan 2017

TL;DR: A neural network which learns intermodal representations for fashion attributes to be utilized in a cross-modal search tool and demonstrates that the neural network model trained with the objective function on image fragments acquired with the rule-based segmentation approach improves the results of image search with textual queries.

...read moreread less

Abstract: In this paper we develop a neural network which learns intermodal representations for fashion attributes to be utilized in a cross-modal search tool. Our neural network learns from organic e-commerce data, which is characterized by clean image material, but noisy and incomplete product descriptions. First, we experiment with techniques to segment ecommerce images and their product descriptions into respectively image and text fragments denoting fashion attributes. Here, we propose a rule-based image segmentation approach which exploits the cleanness of e-commerce images. Next, we design an objective function which encourages our model to induce a common embedding space where a semantically related image fragment and text fragment have a high inner product. This objective function incorporates similarity information of image fragments to obtain better intermodal representations. A key insight is that similar looking image fragments should be described with the same text fragments. We explicitly require this in our objective function, and as such recover information which was lost due to noise and incompleteness in the product descriptions. We evaluate the inferred intermodal representations in cross-modal search. We demonstrate that the neural network model trained with our objective function on image fragments acquired with our rule-based segmentation approach improves the results of image search with textual queries by 198% for recall@1 and by 181% for recall@5 compared to results obtained by a state-of-the-art image search system on the same benchmark dataset.

...read moreread less

18 citations

Posted Content•

Speech-Based Visual Question Answering.

[...]

Teddy Zhang, Dengxin Dai, Tinne Tuytelaars, Marie-Francine Moens, Luc Van Gool - Show less +1 more

01 May 2017-arXiv: Computation and Language

TL;DR: This paper introduces speech-based visual question answering (VQA), the task of generating an answer given an image and a spoken question and investigates the robustness of both methods by injecting various levels of noise into the spoken question.

...read moreread less

Abstract: This paper introduces speech-based visual question answering (VQA), the task of generating an answer given an image and a spoken question. Two methods are studied: an end-to-end, deep neural network that directly uses audio waveforms as input versus a pipelined approach that performs ASR (Automatic Speech Recognition) on the question, followed by text-based visual question answering. Furthermore, we investigate the robustness of both methods by injecting various levels of noise into the spoken question and find both methods to be tolerate noise at similar levels.

...read moreread less

16 citations

Journal Article•DOI•

ECIR 2017 Workshop on Exploitation of Social Media for Emergency Relief and Preparedness (SMERP 2017)

[...]

Saptarshi Ghosh¹, Kripabandhu Ghosh², Debasis Ganguly³, Tanmoy Chakraborty⁴, Gareth J. F. Jones⁵, Marie-Francine Moens⁶ - Show less +2 more•Institutions (6)

Indian Institute of Engineering Science and Technology, Shibpur¹, Indian Institute of Technology Kanpur², IBM³, University of Maryland, College Park⁴, Dublin City University⁵, Katholieke Universiteit Leuven⁶

02 Aug 2017

TL;DR: An overview of the workshop is presented, including the motivations behind organizing the workshop, and summaries of the research papers and keynote talks at the workshop are presented, to reflect on the future directions as inferred from discussion sessions during the workshop.

...read moreread less

Abstract: The first international workshop on Exploitation of Social Media for Emergency Relief and Preparedness (SMERP) was held in conjunction with the 2017 European Conference on Information Retrieval (ECIR) in Aberdeen, Scotland, UK. The aim of the workshop was to explore various technologies for extracting useful information from social media content in disaster situations. The workshop included a peer-reviewed research paper track, a data challenge, two keynote talks, and discussion sessions on the relevant open research challenges. This report presents an overview of the workshop, including the motivations behind organizing the workshop, and summaries of the research papers and keynote talks at the workshop. We also reflect on the future directions as inferred from discussion sessions during the workshop

...read moreread less

13 citations

Proceedings Article•

Improving Implicit Semantic Role Labeling by Predicting Semantic Frame Arguments

[...]

Quynh Ngoc Thi Do¹, Steven Bethard², Marie-Francine Moens¹•Institutions (2)

Katholieke Universiteit Leuven¹, University of Arizona²

01 Nov 2017

TL;DR: This paper used a Predictive Recurrent Neural Semantic Frame Model (PRNSFM) to learn the probability of a sequence of semantic arguments given a predicate and leverage the sequence probabilities predicted by the PRNSFM to estimate selectional preferences for predicates and their arguments.

...read moreread less

Abstract: Implicit semantic role labeling (iSRL) is the task of predicting the semantic roles of a predicate that do not appear as explicit arguments, but rather regard common sense knowledge or are mentioned earlier in the discourse. We introduce an approach to iSRL based on a predictive recurrent neural semantic frame model (PRNSFM) that uses a large unannotated corpus to learn the probability of a sequence of semantic arguments given a predicate. We leverage the sequence probabilities predicted by the PRNSFM to estimate selectional preferences for predicates and their arguments. On the NomBank iSRL test set, our approach improves state-of-the-art performance on implicit semantic role labeling with less reliance than prior work on manually constructed language resources.

...read moreread less

12 citations

Book Chapter•DOI•

CLEF 2017: Multimodal Spatial Role Labeling (mSpRL) Task Overview

[...]

Parisa Kordjamshidi¹, Taher Rahgooy¹, Marie-Francine Moens², James Pustejovsky³, Umar Manzoor¹, Kirk Roberts⁴ - Show less +2 more•Institutions (4)

Tulane University¹, Katholieke Universiteit Leuven², Brandeis University³, University of Texas Health Science Center at Houston⁴

11 Sep 2017

TL;DR: This task is a multimodal extension of spatial role labeling task which has been previously introduced as a semantic evaluation task in the SemEval series and makes it appropriate for the CLEF lab series.

...read moreread less

Abstract: The extraction of spatial semantics is important in many real-world applications such as geographical information systems, robotics and navigation, semantic search, etc. Moreover, spatial semantics are the most relevant semantics related to the visualization of language. The goal of multimodal spatial role labeling task is to extract spatial information from free text while exploiting accompanying images. This task is a multimodal extension of spatial role labeling task which has been previously introduced as a semantic evaluation task in the SemEval series. The multimodal aspect of the task makes it appropriate for the CLEF lab series. In this paper, we provide an overview of the task of multimodal spatial role labeling. We describe the task, sub-tasks, corpora, annotations, evaluation metrics, and the results of the baseline and the task participant.

...read moreread less

Book Chapter•DOI•

Spatial Role Labeling Annotation Scheme

[...]

Parisa Kordjamshidi¹, Martijn van Otterlo², Marie-Francine Moens³•Institutions (3)

Tulane University¹, VU University Amsterdam², Katholieke Universiteit Leuven³

01 Jan 2017

TL;DR: This chapter introduces a spatial annotation scheme built upon the previous research that supports various aspects of spatial semantics, including static and dynamic spatial relations, and produces a rich spatial language corpus.

...read moreread less

Abstract: Spatial information extraction from natural language is important for many applications including geographical information systems, human computer interaction, providing navigational instructions to robots and visualization or text-to-scene conversion. The main obstacles for corpus-based approaches to perform such extractions have been: (a) the lack of an agreement on a unique semantic model for spatial information; (b) the diversity of formal spatial representation models; (c) the gap between the expressiveness of natural language and formal spatial representation models; and consequently, (d) the lack of annotated data on which machine learning can be employed to learn and extract the spatial relations. These items drive the direction of the contributions on which this chapter is built. In this chapter we introduce a spatial annotation scheme built upon the previous research that supports various aspects of spatial semantics, including static and dynamic spatial relations. The annotation scheme is based on the ideas of holistic spatial semantics as well as qualitative spatial reasoning models. Spatial roles, their relations and indicators along with their multiple formal meaning are tagged using the annotation scheme producing a rich spatial language corpus. The goal of building such a corpus is to produce a resource for training the machine learning methods for mapping the language to formal spatial representation models, and to use it as ground-truth data for evaluation.

...read moreread less

Journal Article•DOI•

Entity linking across vision and language

[...]

Aparna Nurani Venkitasubramanian¹, Tinne Tuytelaars¹, Marie-Francine Moens¹•Institutions (1)

Katholieke Universiteit Leuven¹

01 Nov 2017-Multimedia Tools and Applications

TL;DR: A novel weakly supervised framework that jointly tackles entity analysis tasks in vision and language and shows that this integrated modeling yields significantly better performance over text-based and vision-based approaches.

...read moreread less

Abstract: We propose a novel weakly supervised framework that jointly tackles entity analysis tasks in vision and language. Given a video with subtitles, we jointly address the questions: a) What do the textual entity mentions refer to? and b) What/ who are in the video key frames? We use a Markov Random Field (MRF) to encode the dependencies within and across the two modalities. This MRF model incorporates beliefs using independent methods for the textual and visual entities. These beliefs are propagated across the modalities to jointly derive the entity labels. We apply the framework to a challenging dataset of wildlife documentaries with subtitles and show that this integrated modeling yields significantly better performance over text-based and vision-based approaches. We show that textual mentions that cannot be resolved using text-only methods are resolved correctly using our method. The approaches described here bring us closer to automated multimedia indexing.

...read moreread less

Proceedings Article•DOI•

KULeuven-LIIR at SemEval-2017 Task 12: Cross-Domain Temporal Information Extraction from Clinical Records

[...]

Artuur Leeuwenberg¹, Marie-Francine Moens¹•Institutions (1)

Katholieke Universiteit Leuven¹

01 Jan 2017

TL;DR: The authors' system performed above average for all subtasks in both phases of Clinical TempEval 2017, using a combination of Support Vector Machines (SVM) for event and temporal expression detection, and a structured perceptron for extracting temporal relations.

...read moreread less

Abstract: In this paper, we describe the system of the KULeuven-LIIR submission for Clinical TempEval 2017. We participated in all six subtasks, using a combination of Support Vector Machines (SVM) for event and temporal expression detection, and a structured perceptron for extracting temporal relations. Moreover, we present and analyze the results from our submissions, and verify the effectiveness of several system components. Our system performed above average for all subtasks in both phases.

...read moreread less

Proceedings Article•DOI•

Evolutionary learning of meta-rules for text classification

[...]

Juan Carlos Gomez¹, Stijn Hoskens², Marie-Francine Moens²•Institutions (2)

Universidad de Guanajuato¹, Katholieke Universiteit Leuven²

15 Jul 2017

TL;DR: The method builds rules based on features of a set of training text datasets, and evolves them using special crossover and mutation operators to generalize the selection of the best classifier for a given text dataset.

...read moreread less

Abstract: This paper presents an evolutionary method for learning lists of meta-rules for generalizing the selection of the best classifier for a given text dataset. The method builds rules based on features of a set of training text datasets, and evolves them using special crossover and mutation operators. Once the rules are learned, they are tested in a different set of datasets to demonstrate their accuracy and generality. Our experiments show encouraging results.

...read moreread less

Journal Article•DOI•

Learning to Extract Action Descriptions From Narrative Text

[...]

Oswaldo Ludwig¹, Quynh Ngoc Thi Do¹, Cameron Smith², Marc Cavazza³, Marie-Francine Moens¹ - Show less +1 more•Institutions (3)

Katholieke Universiteit Leuven¹, Teesside University², University of Kent³

25 Jan 2017-IEEE Transactions on Computational Intelligence and AI in Games

TL;DR: This paper focuses on the mapping of natural language sentences in written stories to a structured knowledge representation and proposes a mapping framework able to reason with uncertainty, to integrate supervision and evidence from external sources, which yields performance gains in predicting the most likely structured representations of sentences when compared with a baseline algorithm.

...read moreread less

Abstract: This paper focuses on the mapping of natural language sentences in written stories to a structured knowledge representation. This process yields an exponential explosion of instance combinations since each sentence may contain a set of ambiguous terms, each one giving place to a set of instance candidates. The selection of the best combination of instances is a structured classification problem that yields a high-demanding combinatorial optimization problem which, in this paper, is approached by a novel and efficient formulation of a genetic algorithm, which is able to exploit the conditional independence among variables, while improving the parallel scalability. The automatic rating of the resulting set of instance combinations, i.e., possible text interpretations, demands an exhaustive exploitation of the state-of-the-art resources in natural language processing to feed the system with pieces of evidence to be fused by the proposed framework. In this sense, a mapping framework able to reason with uncertainty, to integrate supervision and evidence from external sources, was adopted. To improve the generalization capacity while learning from a limited amount of annotated data, a new constrained learning algorithm for Bayesian networks is introduced. This algorithm bounds the search space through a set of constraints which encode information on mutually exclusive values. The mapping of natural language utterances to a structured knowledge representation is important in the context of game construction, e.g., in an RPG setting, as it alleviates the manual knowledge acquisition bottleneck. The effectiveness of the proposed algorithm is evaluated on a set of three stories, yielding nine experiments. Our mapping framework yields performance gains in predicting the most likely structured representations of sentences when compared with a baseline algorithm.

...read moreread less

Posted Content•

Improving Implicit Semantic Role Labeling by Predicting Semantic Frame Arguments

[...]

Quynh Ngoc Thi Do¹, Steven Bethard², Marie-Francine Moens¹•Institutions (2)

Katholieke Universiteit Leuven¹, University of Arizona²

10 Apr 2017-arXiv: Computation and Language

TL;DR: An approach to iSRL based on a predictive recurrent neural semantic frame model (PRNSFM) that uses a large unannotated corpus to learn the probability of a sequence of semantic arguments given a predicate and leverages the sequence probabilities predicted by the PRNSFM to estimate selectional preferences for predicates and their arguments.

...read moreread less

Proceedings Article•DOI•

Learning to recognize animals by watching documentaries: using subtitles as weak supervision

[...]

Aparna Nurani Venkitasubramanian¹, Tinne Tuytelaars, Marie-Francine Moens•Institutions (1)

Katholieke Universiteit Leuven¹

01 Jan 2017

TL;DR: Different image representations and models are investigated, including a support vector machine on top of activations of a pretrained convolutional neural network, as well as a Naive Bayes framework on a ‘bag-of-activations’ image representation, where each element of the bag is considered separately.

...read moreread less

Abstract: We investigate animal recognition models learned from wildlife video documentaries by using the weak supervision of the textual subtitles. This is a particularly challenging setting, since i) the animals occur in their natural habitat and are often largely occluded and ii) subtitles are to a large degree complementary to the visual content, providing a very weak supervisory signal. This is in contrast to most work on integrated vision and language in the literature, where textual descriptions are tightly linked to the image content, and often generated in a curated fashion for the task at hand. In particular, we investigate different image representations and models, including a support vector machine on top of activations of a pretrained convolutional neural network, as well as a Naive Bayes framework on a ‘bag-of-activations’ image representation, where each element of the bag is considered separately. This representation allows key components in the image to be isolated, in spite of largely varying backgrounds and image clutter, without an object detection or image segmentation step. The methods are evaluated based on how well they transfer to unseen camera-trap images captured across diverse topographical regions under different environmental conditions and illumination settings, involving a large domain shift.

...read moreread less

Posted Content•

Learning to Predict: A Fast Re-constructive Method to Generate Multimodal Embeddings

[...]

Guillem Collell, Teddy Zhang, Marie-Francine Moens

25 Mar 2017-arXiv: Machine Learning

TL;DR: This paper presents a simple method to build multimodal representations by learning a language-to-vision mapping and using its output to build multi-modal embeddings, providing a cognitively plausible way of building representations, consistent with the inherently re-constructive and associative nature of human memory.

...read moreread less

Abstract: Integrating visual and linguistic information into a single multimodal representation is an unsolved problem with wide-reaching applications to both natural language processing and computer vision. In this paper, we present a simple method to build multimodal representations by learning a language-to-vision mapping and using its output to build multimodal embeddings. In this sense, our method provides a cognitively plausible way of building representations, consistent with the inherently re-constructive and associative nature of human memory. Using seven benchmark concept similarity tests we show that the mapped vectors not only implicitly encode multimodal information, but also outperform strong unimodal baselines and state-of-the-art multimodal methods, thus exhibiting more "human-like" judgments---particularly in zero-shot settings.

...read moreread less

Posted Content•

Acquiring Common Sense Spatial Knowledge through Implicit Spatial Templates.

[...]

Guillem Collell¹, Luc Van Gool², Marie-Francine Moens¹•Institutions (2)

Katholieke Universiteit Leuven¹, ETH Zurich²

18 Nov 2017-arXiv: Artificial Intelligence

TL;DR: This work introduces the task of predicting spatial templates for two objects under a relationship, and presents two simple neural-based models that leverage annotated images and structured text to learn this task, demonstrating that spatial locations are to a large extent predictable from implicit spatial language.

...read moreread less

Abstract: Spatial understanding is a fundamental problem with wide-reaching real-world applications. The representation of spatial knowledge is often modeled with spatial templates, i.e., regions of acceptability of two objects under an explicit spatial relationship (e.g., "on", "below", etc.). In contrast with prior work that restricts spatial templates to explicit spatial prepositions (e.g., "glass on table"), here we extend this concept to implicit spatial language, i.e., those relationships (generally actions) for which the spatial arrangement of the objects is only implicitly implied (e.g., "man riding horse"). In contrast with explicit relationships, predicting spatial arrangements from implicit spatial language requires significant common sense spatial understanding. Here, we introduce the task of predicting spatial templates for two objects under a relationship, which can be seen as a spatial question-answering task with a (2D) continuous output ("where is the man w.r.t. a horse when the man is walking the horse?"). We present two simple neural-based models that leverage annotated images and structured text to learn this task. The good performance of these models reveals that spatial locations are to a large extent predictable from implicit spatial language. Crucially, the models attain similar performance in a challenging generalized setting, where the object-relation-object combinations (e.g.,"man walking dog") have never been seen before. Next, we go one step further by presenting the models with unseen objects (e.g., "dog"). In this scenario, we show that leveraging word embeddings enables the models to output accurate spatial predictions, proving that the models acquire solid common sense spatial knowledge allowing for such generalization.

...read moreread less

CLEF 2017: Multimodal Spatial Role Labeling Task Working Notes.

[...]

Parisa Kordjamshidi, Taher Rahgooy, Marie-Francine Moens, James Pustejovsky, Umar Manzoor, Kirk Roberts - Show less +2 more

01 Jan 2017

TL;DR: This paper describes the task, sub-tasks, corpora, annotations, evaluation metrics, and the results of the baseline and the task participant, and describes the multimodal aspect of the task which makes it appropriate for the CLEF lab series.

...read moreread less

Learning Visually Grounded Common Sense Spatial Knowledge for Implicit Spatial Language.

[...]

Guillem Collell, Marie-Francine Moens

01 Jan 2017

TL;DR: This work extends spatial templates for explicit spatial language to implicit spatial language, i.e., those relationships that do not explicitly define the relative location of the two objects but only implicitly (e.g., “dog under table”).

...read moreread less

Abstract: Spatial understanding is crucial for any agent that navigates in a physical world. Computational and cognitive frameworks often model spatial representations as spatial templates or regions of acceptability for two objects under an explicit spatial preposition such as “left” or “below” (Logan and Sadler 1996). Contrary to previous work that define spatial templates for explicit spatial language only (Malinowski and Fritz 2014; Moratz and Tenbrink 2006), we extend such concept to implicit spatial language, i.e., those relationships (usually actions) that do not explicitly define the relative location of the two objects (e.g., “dog under table”) but only implicitly (e.g., “girl riding horse”). Unlike explicit relationships, predicting spatial arrangements from implicit spatial language requires spatial common sense knowledge about the objects and actions. Furthermore, prior work that leverage common sense spatial knowledge to solve tasks such as visual paraphrasing (Lin and Parikh 2015) or object labeling (Shiang et al. 2017) do not aim to predict (unseen) spatial configurations.

...read moreread less