scispace - formally typeset
Search or ask a question

Showing papers by "Marie-Francine Moens published in 2017"


Proceedings Article
01 Jan 2017
TL;DR: This paper presents a simple and effective method that learns a language-to-vision mapping and uses its output visual predictions to build multimodal representations, providing a cognitively plausible way of building representations, consistent with the inherently reconstructive and associative nature of human memory.
Abstract: Language and vision provide complementary information. Integrating both modalities in a single multimodal representation is an unsolved problem with wide-reaching applications to both natural language processing and computer vision. In this paper, we present a simple and effective method that learns a language-to-vision mapping and uses its output visual predictions to build multimodal representations. In this sense, our method provides a cognitively plausible way of building representations, consistent with the inherently reconstructive and associative nature of human memory. Using seven benchmark concept similarity tests we show that the mapped (or imagined) vectors not only help to fuse multimodal information, but also outperform strong unimodal baselines and state-of-the-art multimodal methods, thus exhibiting more human-like judgments. Ultimately, the present work sheds light on fundamental questions of natural language understanding concerning the fusion of vision and language such as the plausibility of more associative and reconstructive approaches.

91 citations


Proceedings ArticleDOI
01 Jan 2017
TL;DR: This study employs a structured perceptron together with integer linear programming constraints for document-level inference during training and prediction to exploit relational properties of temporality, together with global learning of the relations at the document level.
Abstract: We propose a scalable structured learning model that jointly predicts temporal relations between events and temporal expressions (TLINKS), and the relation between these events and the document creation time (DCTR). We employ a structured perceptron, together with integer linear programming constraints for document-level inference during training and prediction to exploit relational properties of temporality, together with global learning of the relations at the document level. Moreover, this study gives insights in the results of integrating constraints for temporal relation extraction when using structured learning and prediction. Our best system outperforms the state-of-the art on both the CONTAINS TLINK task, and the DCTR task.

61 citations


Journal ArticleDOI
TL;DR: A number of ways are proposed to improve the learning of common sense and world knowledge by exploiting textual and visual data, and touch upon how to integrate the learned knowledge in the argumentation mining process.
Abstract: Argumentation mining is an advanced form of human language understanding by the machine. This is a challenging task for a machine. When sufficient explicit discourse markers are present in the language utterances, the argumentation can be interpreted by the machine with an acceptable degree of accuracy. However, in many real settings, the mining task is difficult due to the lack or ambiguity of the discourse markers, and the fact that a substantial amount of knowledge needed for the correct recognition of the argumentation, its composing elements and their relationships is not explicitly present in the text, but makes up the background knowledge that humans possess when interpreting language. In this article1 we focus on how the machine can automatically acquire the needed common sense and world knowledge. As very few research has been done in this respect, many of the ideas proposed in this article are tentative, but start being researched. We give an overview of the latest methods for human language understanding that map language to a formal knowledge representation that facilitates other tasks (for instance, a representation that is used to visualize the argumentation or that is easily shared in a decision or argumentation support system). Most current systems are trained on texts that are manually annotated. Then we go deeper into the new field of representation learning that nowadays is very much studied in computational linguistics. This field investigates methods for representing language as statistical concepts or as vectors, allowing straightforward methods of compositionality. The methods often use deep learning and its underlying neural network technologies to learn concepts from large text collections in an unsupervised way (i.e., without the need for manual annotations). We show how these methods can help the argumentation mining process, but also demonstrate that these methods need further research to automatically acquire the necessary background knowledge and more specifically common sense and world knowledge. We propose a number of ways to improve the learning of common sense and world knowledge by exploiting textual and visual data, and touch upon how we can integrate the learned knowledge in the argumentation mining process.

42 citations


Proceedings ArticleDOI
01 Jan 2017
TL;DR: The results show that word- and character-level representations each improve state-of-the-art results for BLI, and the best results are obtained by exploiting the synergy between these word-and character- level representations in the classification model.
Abstract: We study the problem of bilingual lexicon induction (BLI) in a setting where some translation resources are available, but unknown translations are sought for certain, possibly domain-specific terminology. We frame BLI as a classification problem for which we design a neural network based classification architecture composed of recurrent long short-term memory and deep feed forward networks. The results show that word- and character-level representations each improve state-of-the-art results for BLI, and the best results are obtained by exploiting the synergy between these word- and character-level representations in the classification model.

32 citations


Journal ArticleDOI
01 Jan 2017
TL;DR: The experimental results for two real-world applications, link prediction in social trust networks and user profiling in social networks, demonstrate that the use of soft quantifiers not only allows for a natural and intuitive formulation of domain knowledge, but also improves inference accuracy.
Abstract: We present a new statistical relational learning (SRL) framework that supports reasoning with soft quantifiers, such as “most” and “a few.” We define the syntax and the semantics of this language, which we call $$\hbox {PSL}^Q$$ , and present a most probable explanation inference algorithm for it. To the best of our knowledge, $$\hbox {PSL}^Q$$ is the first SRL framework that combines soft quantifiers with first-order logic rules for modelling uncertain relational data. Our experimental results for two real-world applications, link prediction in social trust networks and user profiling in social networks, demonstrate that the use of soft quantifiers not only allows for a natural and intuitive formulation of domain knowledge, but also improves inference accuracy.

25 citations


Journal ArticleDOI
TL;DR: A sampling-based method using random paths to estimate the similarities based on both common neighbors and structural contexts efficiently in very large homogeneous or heterogeneous information networks.
Abstract: Similarity search is a fundamental problem in network analysis and can be applied in many applications, such as collaborator recommendation in coauthor networks, friend recommendation in social networks, and relation prediction in medical information networks. In this article, we propose a sampling-based method using random paths to estimate the similarities based on both common neighbors and structural contexts efficiently in very large homogeneous or heterogeneous information networks. We give a theoretical guarantee that the sampling size depends on the error-bound e, the confidence level (1-Δ), and the path length T of each random walk. We perform an extensive empirical study on a Tencent microblogging network of 1,000,000,000 edges. We show that our algorithm can return top-k similar vertices for any vertex in a network 300× faster than the state-of-the-art methods. We develop a prototype system of recommending similar authors to demonstrate the effectiveness of our method.

21 citations


Proceedings Article
01 Jan 2017
TL;DR: A neural network which learns intermodal representations for fashion attributes to be utilized in a cross-modal search tool and demonstrates that the neural network model trained with the objective function on image fragments acquired with the rule-based segmentation approach improves the results of image search with textual queries.
Abstract: In this paper we develop a neural network which learns intermodal representations for fashion attributes to be utilized in a cross-modal search tool. Our neural network learns from organic e-commerce data, which is characterized by clean image material, but noisy and incomplete product descriptions. First, we experiment with techniques to segment ecommerce images and their product descriptions into respectively image and text fragments denoting fashion attributes. Here, we propose a rule-based image segmentation approach which exploits the cleanness of e-commerce images. Next, we design an objective function which encourages our model to induce a common embedding space where a semantically related image fragment and text fragment have a high inner product. This objective function incorporates similarity information of image fragments to obtain better intermodal representations. A key insight is that similar looking image fragments should be described with the same text fragments. We explicitly require this in our objective function, and as such recover information which was lost due to noise and incompleteness in the product descriptions. We evaluate the inferred intermodal representations in cross-modal search. We demonstrate that the neural network model trained with our objective function on image fragments acquired with our rule-based segmentation approach improves the results of image search with textual queries by 198% for recall@1 and by 181% for recall@5 compared to results obtained by a state-of-the-art image search system on the same benchmark dataset.

18 citations


Posted Content
TL;DR: This paper introduces speech-based visual question answering (VQA), the task of generating an answer given an image and a spoken question and investigates the robustness of both methods by injecting various levels of noise into the spoken question.
Abstract: This paper introduces speech-based visual question answering (VQA), the task of generating an answer given an image and a spoken question. Two methods are studied: an end-to-end, deep neural network that directly uses audio waveforms as input versus a pipelined approach that performs ASR (Automatic Speech Recognition) on the question, followed by text-based visual question answering. Furthermore, we investigate the robustness of both methods by injecting various levels of noise into the spoken question and find both methods to be tolerate noise at similar levels.

16 citations


Journal ArticleDOI
02 Aug 2017
TL;DR: An overview of the workshop is presented, including the motivations behind organizing the workshop, and summaries of the research papers and keynote talks at the workshop are presented, to reflect on the future directions as inferred from discussion sessions during the workshop.
Abstract: The first international workshop on Exploitation of Social Media for Emergency Relief and Preparedness (SMERP) was held in conjunction with the 2017 European Conference on Information Retrieval (ECIR) in Aberdeen, Scotland, UK. The aim of the workshop was to explore various technologies for extracting useful information from social media content in disaster situations. The workshop included a peer-reviewed research paper track, a data challenge, two keynote talks, and discussion sessions on the relevant open research challenges. This report presents an overview of the workshop, including the motivations behind organizing the workshop, and summaries of the research papers and keynote talks at the workshop. We also reflect on the future directions as inferred from discussion sessions during the workshop

13 citations


Proceedings Article
01 Nov 2017
TL;DR: This paper used a Predictive Recurrent Neural Semantic Frame Model (PRNSFM) to learn the probability of a sequence of semantic arguments given a predicate and leverage the sequence probabilities predicted by the PRNSFM to estimate selectional preferences for predicates and their arguments.
Abstract: Implicit semantic role labeling (iSRL) is the task of predicting the semantic roles of a predicate that do not appear as explicit arguments, but rather regard common sense knowledge or are mentioned earlier in the discourse. We introduce an approach to iSRL based on a predictive recurrent neural semantic frame model (PRNSFM) that uses a large unannotated corpus to learn the probability of a sequence of semantic arguments given a predicate. We leverage the sequence probabilities predicted by the PRNSFM to estimate selectional preferences for predicates and their arguments. On the NomBank iSRL test set, our approach improves state-of-the-art performance on implicit semantic role labeling with less reliance than prior work on manually constructed language resources.

12 citations


Book ChapterDOI
11 Sep 2017
TL;DR: This task is a multimodal extension of spatial role labeling task which has been previously introduced as a semantic evaluation task in the SemEval series and makes it appropriate for the CLEF lab series.
Abstract: The extraction of spatial semantics is important in many real-world applications such as geographical information systems, robotics and navigation, semantic search, etc. Moreover, spatial semantics are the most relevant semantics related to the visualization of language. The goal of multimodal spatial role labeling task is to extract spatial information from free text while exploiting accompanying images. This task is a multimodal extension of spatial role labeling task which has been previously introduced as a semantic evaluation task in the SemEval series. The multimodal aspect of the task makes it appropriate for the CLEF lab series. In this paper, we provide an overview of the task of multimodal spatial role labeling. We describe the task, sub-tasks, corpora, annotations, evaluation metrics, and the results of the baseline and the task participant.

Book ChapterDOI
01 Jan 2017
TL;DR: This chapter introduces a spatial annotation scheme built upon the previous research that supports various aspects of spatial semantics, including static and dynamic spatial relations, and produces a rich spatial language corpus.
Abstract: Spatial information extraction from natural language is important for many applications including geographical information systems, human computer interaction, providing navigational instructions to robots and visualization or text-to-scene conversion. The main obstacles for corpus-based approaches to perform such extractions have been: (a) the lack of an agreement on a unique semantic model for spatial information; (b) the diversity of formal spatial representation models; (c) the gap between the expressiveness of natural language and formal spatial representation models; and consequently, (d) the lack of annotated data on which machine learning can be employed to learn and extract the spatial relations. These items drive the direction of the contributions on which this chapter is built. In this chapter we introduce a spatial annotation scheme built upon the previous research that supports various aspects of spatial semantics, including static and dynamic spatial relations. The annotation scheme is based on the ideas of holistic spatial semantics as well as qualitative spatial reasoning models. Spatial roles, their relations and indicators along with their multiple formal meaning are tagged using the annotation scheme producing a rich spatial language corpus. The goal of building such a corpus is to produce a resource for training the machine learning methods for mapping the language to formal spatial representation models, and to use it as ground-truth data for evaluation.

Journal ArticleDOI
TL;DR: A novel weakly supervised framework that jointly tackles entity analysis tasks in vision and language and shows that this integrated modeling yields significantly better performance over text-based and vision-based approaches.
Abstract: We propose a novel weakly supervised framework that jointly tackles entity analysis tasks in vision and language. Given a video with subtitles, we jointly address the questions: a) What do the textual entity mentions refer to? and b) What/ who are in the video key frames? We use a Markov Random Field (MRF) to encode the dependencies within and across the two modalities. This MRF model incorporates beliefs using independent methods for the textual and visual entities. These beliefs are propagated across the modalities to jointly derive the entity labels. We apply the framework to a challenging dataset of wildlife documentaries with subtitles and show that this integrated modeling yields significantly better performance over text-based and vision-based approaches. We show that textual mentions that cannot be resolved using text-only methods are resolved correctly using our method. The approaches described here bring us closer to automated multimedia indexing.

Proceedings ArticleDOI
01 Jan 2017
TL;DR: The authors' system performed above average for all subtasks in both phases of Clinical TempEval 2017, using a combination of Support Vector Machines (SVM) for event and temporal expression detection, and a structured perceptron for extracting temporal relations.
Abstract: In this paper, we describe the system of the KULeuven-LIIR submission for Clinical TempEval 2017. We participated in all six subtasks, using a combination of Support Vector Machines (SVM) for event and temporal expression detection, and a structured perceptron for extracting temporal relations. Moreover, we present and analyze the results from our submissions, and verify the effectiveness of several system components. Our system performed above average for all subtasks in both phases.

Proceedings ArticleDOI
15 Jul 2017
TL;DR: The method builds rules based on features of a set of training text datasets, and evolves them using special crossover and mutation operators to generalize the selection of the best classifier for a given text dataset.
Abstract: This paper presents an evolutionary method for learning lists of meta-rules for generalizing the selection of the best classifier for a given text dataset. The method builds rules based on features of a set of training text datasets, and evolves them using special crossover and mutation operators. Once the rules are learned, they are tested in a different set of datasets to demonstrate their accuracy and generality. Our experiments show encouraging results.

Journal ArticleDOI
TL;DR: This paper focuses on the mapping of natural language sentences in written stories to a structured knowledge representation and proposes a mapping framework able to reason with uncertainty, to integrate supervision and evidence from external sources, which yields performance gains in predicting the most likely structured representations of sentences when compared with a baseline algorithm.
Abstract: This paper focuses on the mapping of natural language sentences in written stories to a structured knowledge representation. This process yields an exponential explosion of instance combinations since each sentence may contain a set of ambiguous terms, each one giving place to a set of instance candidates. The selection of the best combination of instances is a structured classification problem that yields a high-demanding combinatorial optimization problem which, in this paper, is approached by a novel and efficient formulation of a genetic algorithm, which is able to exploit the conditional independence among variables, while improving the parallel scalability. The automatic rating of the resulting set of instance combinations, i.e., possible text interpretations, demands an exhaustive exploitation of the state-of-the-art resources in natural language processing to feed the system with pieces of evidence to be fused by the proposed framework. In this sense, a mapping framework able to reason with uncertainty, to integrate supervision and evidence from external sources, was adopted. To improve the generalization capacity while learning from a limited amount of annotated data, a new constrained learning algorithm for Bayesian networks is introduced. This algorithm bounds the search space through a set of constraints which encode information on mutually exclusive values. The mapping of natural language utterances to a structured knowledge representation is important in the context of game construction, e.g., in an RPG setting, as it alleviates the manual knowledge acquisition bottleneck. The effectiveness of the proposed algorithm is evaluated on a set of three stories, yielding nine experiments. Our mapping framework yields performance gains in predicting the most likely structured representations of sentences when compared with a baseline algorithm.

Posted Content
TL;DR: An approach to iSRL based on a predictive recurrent neural semantic frame model (PRNSFM) that uses a large unannotated corpus to learn the probability of a sequence of semantic arguments given a predicate and leverages the sequence probabilities predicted by the PRNSFM to estimate selectional preferences for predicates and their arguments.
Abstract: Implicit semantic role labeling (iSRL) is the task of predicting the semantic roles of a predicate that do not appear as explicit arguments, but rather regard common sense knowledge or are mentioned earlier in the discourse. We introduce an approach to iSRL based on a predictive recurrent neural semantic frame model (PRNSFM) that uses a large unannotated corpus to learn the probability of a sequence of semantic arguments given a predicate. We leverage the sequence probabilities predicted by the PRNSFM to estimate selectional preferences for predicates and their arguments. On the NomBank iSRL test set, our approach improves state-of-the-art performance on implicit semantic role labeling with less reliance than prior work on manually constructed language resources.

Proceedings ArticleDOI
01 Jan 2017
TL;DR: Different image representations and models are investigated, including a support vector machine on top of activations of a pretrained convolutional neural network, as well as a Naive Bayes framework on a ‘bag-of-activations’ image representation, where each element of the bag is considered separately.
Abstract: We investigate animal recognition models learned from wildlife video documentaries by using the weak supervision of the textual subtitles. This is a particularly challenging setting, since i) the animals occur in their natural habitat and are often largely occluded and ii) subtitles are to a large degree complementary to the visual content, providing a very weak supervisory signal. This is in contrast to most work on integrated vision and language in the literature, where textual descriptions are tightly linked to the image content, and often generated in a curated fashion for the task at hand. In particular, we investigate different image representations and models, including a support vector machine on top of activations of a pretrained convolutional neural network, as well as a Naive Bayes framework on a ‘bag-of-activations’ image representation, where each element of the bag is considered separately. This representation allows key components in the image to be isolated, in spite of largely varying backgrounds and image clutter, without an object detection or image segmentation step. The methods are evaluated based on how well they transfer to unseen camera-trap images captured across diverse topographical regions under different environmental conditions and illumination settings, involving a large domain shift.

Posted Content
TL;DR: This paper presents a simple method to build multimodal representations by learning a language-to-vision mapping and using its output to build multi-modal embeddings, providing a cognitively plausible way of building representations, consistent with the inherently re-constructive and associative nature of human memory.
Abstract: Integrating visual and linguistic information into a single multimodal representation is an unsolved problem with wide-reaching applications to both natural language processing and computer vision. In this paper, we present a simple method to build multimodal representations by learning a language-to-vision mapping and using its output to build multimodal embeddings. In this sense, our method provides a cognitively plausible way of building representations, consistent with the inherently re-constructive and associative nature of human memory. Using seven benchmark concept similarity tests we show that the mapped vectors not only implicitly encode multimodal information, but also outperform strong unimodal baselines and state-of-the-art multimodal methods, thus exhibiting more "human-like" judgments---particularly in zero-shot settings.

Posted Content
TL;DR: This work introduces the task of predicting spatial templates for two objects under a relationship, and presents two simple neural-based models that leverage annotated images and structured text to learn this task, demonstrating that spatial locations are to a large extent predictable from implicit spatial language.
Abstract: Spatial understanding is a fundamental problem with wide-reaching real-world applications. The representation of spatial knowledge is often modeled with spatial templates, i.e., regions of acceptability of two objects under an explicit spatial relationship (e.g., "on", "below", etc.). In contrast with prior work that restricts spatial templates to explicit spatial prepositions (e.g., "glass on table"), here we extend this concept to implicit spatial language, i.e., those relationships (generally actions) for which the spatial arrangement of the objects is only implicitly implied (e.g., "man riding horse"). In contrast with explicit relationships, predicting spatial arrangements from implicit spatial language requires significant common sense spatial understanding. Here, we introduce the task of predicting spatial templates for two objects under a relationship, which can be seen as a spatial question-answering task with a (2D) continuous output ("where is the man w.r.t. a horse when the man is walking the horse?"). We present two simple neural-based models that leverage annotated images and structured text to learn this task. The good performance of these models reveals that spatial locations are to a large extent predictable from implicit spatial language. Crucially, the models attain similar performance in a challenging generalized setting, where the object-relation-object combinations (e.g.,"man walking dog") have never been seen before. Next, we go one step further by presenting the models with unseen objects (e.g., "dog"). In this scenario, we show that leveraging word embeddings enables the models to output accurate spatial predictions, proving that the models acquire solid common sense spatial knowledge allowing for such generalization.

01 Jan 2017
TL;DR: This paper describes the task, sub-tasks, corpora, annotations, evaluation metrics, and the results of the baseline and the task participant, and describes the multimodal aspect of the task which makes it appropriate for the CLEF lab series.
Abstract: The extraction of spatial semantics is important in many real-world applications such as geographical information systems, robotics and navigation, semantic search, etc. Moreover, spatial semantics are the most relevant semantics related to the visualization of language. The goal of multimodal spatial role labeling task is to extract spatial information from free text while exploiting accompanying images. This task is a multimodal extension of spatial role labeling task which has been previously introduced as a semantic evaluation task in the SemEval series. The multimodal aspect of the task makes it appropriate for the CLEF lab series. In this paper, we provide an overview of the task of multimodal spatial role labeling. We describe the task, sub-tasks, corpora, annotations, evaluation metrics, and the results of the baseline and the task participant.

01 Jan 2017
TL;DR: This work extends spatial templates for explicit spatial language to implicit spatial language, i.e., those relationships that do not explicitly define the relative location of the two objects but only implicitly (e.g., “dog under table”).
Abstract: Spatial understanding is crucial for any agent that navigates in a physical world. Computational and cognitive frameworks often model spatial representations as spatial templates or regions of acceptability for two objects under an explicit spatial preposition such as “left” or “below” (Logan and Sadler 1996). Contrary to previous work that define spatial templates for explicit spatial language only (Malinowski and Fritz 2014; Moratz and Tenbrink 2006), we extend such concept to implicit spatial language, i.e., those relationships (usually actions) that do not explicitly define the relative location of the two objects (e.g., “dog under table”) but only implicitly (e.g., “girl riding horse”). Unlike explicit relationships, predicting spatial arrangements from implicit spatial language requires spatial common sense knowledge about the objects and actions. Furthermore, prior work that leverage common sense spatial knowledge to solve tasks such as visual paraphrasing (Lin and Parikh 2015) or object labeling (Shiang et al. 2017) do not aim to predict (unseen) spatial configurations.