scispace - formally typeset
Search or ask a question

Showing papers by "Marie-Francine Moens published in 2019"


Proceedings ArticleDOI
01 Sep 2019
TL;DR: This work presents the Talk2Car dataset, which is the first object referral dataset that contains commands written in natural language for self-driving cars, and provides a detailed comparison with related datasets such as ReferIt, RefCOCO, Ref COCO+, RefC OCOg, Cityscape-Ref and CLEVR-Ref.
Abstract: A long-term goal of artificial intelligence is to have an agent execute commands communicated through natural language. In many cases the commands are grounded in a visual environment shared by the human who gives the command and the agent. Execution of the command then requires mapping the command into the physical visual space, after which the appropriate action can be taken. In this paper we consider the former. Or more specifically, we consider the problem in an autonomous driving setting, where a passenger requests an action that can be associated with an object found in a street scene. Our work presents the Talk2Car dataset, which is the first object referral dataset that contains commands written in natural language for self-driving cars. We provide a detailed comparison with related datasets such as ReferIt, RefCOCO, RefCOCO+, RefCOCOg, Cityscape-Ref and CLEVR-Ref. Additionally, we include a performance analysis using strong state-of-the-art models. The results show that the proposed object referral task is a challenging one for which the models show promising results but still require additional research in natural language processing, computer vision and the intersection of these fields. The dataset can be found on our website: http://macchina-ai.eu/

45 citations


Proceedings ArticleDOI
01 Jun 2019
TL;DR: This work proposes a new robust framework for learning unsupervised multilingual word embeddings that mitigates the instability issues and yields results that are on par with the state-of-the-art methods in the bilingual lexicon induction (BLI) task, and simultaneously obtain state of theart scores on two downstream tasks: multilingual document classification and multilingual dependency parsing.
Abstract: Recent research has discovered that a shared bilingual word embedding space can be induced by projecting monolingual word embedding spaces from two languages using a self-learning paradigm without any bilingual supervision. However, it has also been shown that for distant language pairs such fully unsupervised self-learning methods are unstable and often get stuck in poor local optima due to reduced isomorphism between starting monolingual spaces. In this work, we propose a new robust framework for learning unsupervised multilingual word embeddings that mitigates the instability issues. We learn a shared multilingual embedding space for a variable number of languages by incrementally adding new languages one by one to the current multilingual space. Through the gradual language addition the method can leverage the interdependencies between the new language and all other languages in the current multilingual space. We find that it is beneficial to project more distant languages later in the iterative process. Our fully unsupervised multilingual embedding spaces yield results that are on par with the state-of-the-art methods in the bilingual lexicon induction (BLI) task, and simultaneously obtain state-of-the-art scores on two downstream tasks: multilingual document classification and multilingual dependency parsing, outperforming even supervised baselines. This finding also accentuates the need to establish evaluation protocols for cross-lingual word embeddings beyond the omnipresent intrinsic BLI task in future work.

29 citations


Journal ArticleDOI
TL;DR: This article presents a comprehensive survey of the research from the past decades on temporal reasoning for automatic temporal information extraction from text, providing a case study on the integration of symbolic reasoning with machine learning-based information extraction systems.
Abstract: Time is deeply woven into how people perceive, and communicate about the world. Almost unconsciously, we provide our language utterances with temporal cues, like verb tenses, and we can hardly produce sentences without such cues. Extracting temporal cues from text, and constructing a global temporal view about the order of described events is a major challenge of automatic natural language understanding. Temporal reasoning, the process of combining different temporal cues into a coherent temporal view, plays a central role in temporal information extraction. This article presents a comprehensive survey of the research from the past decades on temporal reasoning for automatic temporal information extraction from text, providing a case study on how combining symbolic reasoning with machine learning-based information extraction systems can improve performance. It gives a clear overview of the used methodologies for temporal reasoning, and explains how temporal reasoning can be, and has been successfully integrated into temporal information extraction systems. Based on the distillation of existing work, this survey also suggests currently unexplored research areas. We argue that the level of temporal reasoning that current systems use is still incomplete for the full task of temporal information extraction, and that a deeper understanding of how the various types of temporal information can be integrated into temporal reasoning is required to drive future research in this area.

22 citations


Journal ArticleDOI
TL;DR: It is demonstrated through experiments that NER models trained on labeled data from a source domain can be used as base models and then be fine-tuned with few labeled data for recognition of different named entity classes in a target domain.
Abstract: Recent deep learning approaches have shown promising results for named entity recognition (NER). A reasonable assumption for training robust deep learning models is that a sufficient amount of high-quality annotated training data is available. However, in many real-world scenarios, labeled training data is scarcely present. In this paper we consider two use cases: generic entity extraction from financial and from biomedical documents. First, we have developed a character based model for NER in financial documents and a word and character based model with attention for NER in biomedical documents. Further, we have analyzed how transfer learning addresses the problem of limited training data in a target domain. We demonstrate through experiments that NER models trained on labeled data from a source domain can be used as base models and then be fine-tuned with few labeled data for recognition of different named entity classes in a target domain. We also witness an interest in language models to improve NER as a way of coping with limited labeled data. The current most successful language model is BERT. Because of its success in state-of-the-art models we integrate representations based on BERT in our biomedical NER model along with word and character information. The results are compared with a state-of-the-art model applied on a benchmarking biomedical corpus.

21 citations


Proceedings ArticleDOI
15 Oct 2019
TL;DR: This paper proposes an artwork type enriched image captioning model where the encoder represents an input artwork image as a 512-dimensional vector and the decoder generates a corresponding caption based on the input image vector.
Abstract: The neural encoder-decoder framework is widely adopted for image captioning of natural images. However, few works have contributed to generating captions for cultural images using this scheme. In this paper, we propose an artwork type enriched image captioning model where the encoder represents an input artwork image as a 512-dimensional vector and the decoder generates a corresponding caption based on the input image vector. The artwork type is first predicted by a convolutional neural network classifier and then merged into the decoder. We investigate multiple approaches to integrate the artwork type into the captioning model among which is one that applies a step-wise weighted sum of the artwork type vector and the hidden representation vector of the decoder. This model outperforms three baseline image captioning models for a Chinese art image captioning dataset on all evaluation metrics. One of the baselines is a state-of-the-art approach fusing textual image attributes into the captioning model for natural images. The proposed model also obtains promising results for another Egyptian art image captioning dataset.

16 citations


Proceedings ArticleDOI
TL;DR: The Talk2Car dataset as mentioned in this paper is the first object referral dataset that contains commands written in natural language for self-driving cars, where a passenger requests an action that can be associated with an object found in a street scene.
Abstract: A long-term goal of artificial intelligence is to have an agent execute commands communicated through natural language. In many cases the commands are grounded in a visual environment shared by the human who gives the command and the agent. Execution of the command then requires mapping the command into the physical visual space, after which the appropriate action can be taken. In this paper we consider the former. Or more specifically, we consider the problem in an autonomous driving setting, where a passenger requests an action that can be associated with an object found in a street scene. Our work presents the Talk2Car dataset, which is the first object referral dataset that contains commands written in natural language for self-driving cars. We provide a detailed comparison with related datasets such as ReferIt, RefCOCO, RefCOCO+, RefCOCOg, Cityscape-Ref and CLEVR-Ref. Additionally, we include a performance analysis using strong state-of-the-art models. The results show that the proposed object referral task is a challenging one for which the models show promising results but still require additional research in natural language processing, computer vision and the intersection of these fields. The dataset can be found on our website: this http URL

14 citations


Journal ArticleDOI
TL;DR: An integrated approach is proposed across visual and textual data to both determine and justify a medical diagnosis by a neural network and achieves excellent diagnosis accuracy and captioning quality when compared to current state-of-the-art single-task methods.

13 citations


Journal ArticleDOI
TL;DR: The SCATE research with respect to improved fuzzy matching, parallel treebanks, the integration of translation memories with machine translation, quality estimation, terminology extraction from comparable texts, the use of speech recognition in the translation process, and human computer interaction and interface design for the professional translation environment is described.
Abstract: When using computer-aided translation systems in a typical, professional translation workflow, there are several stages at which there is room for improvement. The SCATE (Smart Computer-Aided Translation Environment) project investigated several of these aspects, both from a human-computer interaction point of view, as well as from a purely technological side. This paper describes the SCATE research with respect to improved fuzzy matching, parallel treebanks, the integration of translation memories with machine translation, quality estimation, terminology extraction from comparable texts, the use of speech recognition in the translation process, and human computer interaction and interface design for the professional translation environment. For each of these topics, we describe the experiments we performed and the conclusions drawn, providing an overview of the highlights of the entire SCATE project.

7 citations


Book ChapterDOI
21 Jul 2019
TL;DR: A multimodal neural machine translation model in which the decoder that generates the translation attends to visually grounded representations that capture both the semantics of the fashion words in the source language and regions in the fashion image.
Abstract: Neural networks become extremely popular in artificial intelligence. In this paper we show how they aid in automatically translating fashion item descriptions and how they use fashion images to generate the translations. More specifically, we propose a multimodal neural machine translation model in which the decoder that generates the translation attends to visually grounded representations that capture both the semantics of the fashion words in the source language and regions in the fashion image. We introduce this novel neural architecture in the context of fashion e-commerce, where product descriptions need to be available in multiple languages. We report state-of-the-art multimodal translation results on a real-world fashion e-commerce dataset.

4 citations


Book ChapterDOI
TL;DR: This paper shows that by incorporating content from related written sources in the training of the classification model has a benefit and qualitatively demonstrates that these representations to a certain extent indirectly correct the transcription noise.
Abstract: Today’s content including user generated content is increasingly found in multimedia format. It is known that speech data are sometimes incorrectly transcribed especially when they are spoken by voices on which the transcribers have not been trained or when they contain unfamiliar words. A familiar mining tasks that helps in storage, indexing and retrieval is automatic classification with predefined category labels. Although state-of-the-art classifiers like neural networks, support vector machines (SVM) and logistic regression classifiers perform quite satisfactory when categorizing written text, their performance degrades when applied on speech data transcribed by automatic speech recognition (ASR) due to transcription errors like insertion and deletion of words, grammatical errors and words that are just transcribed wrongly. In this paper, we show that by incorporating content from related written sources in the training of the classification model has a benefit. We especially focus on and compare different representations that make this integration possible, such as representations of speech data that embed content from the written text and simple concatenation of speech and written content. In addition, we qualitatively demonstrate that these representations to a certain extent indirectly correct the transcription noise.

3 citations


Book ChapterDOI
14 Apr 2019
TL;DR: A novel approach to conducting passage retrieval for multimodal question answering of ancient artworks where the query image caption of the multi-modal query is provided as additional evidence to state-of-the-art retrieval models in the cultural heritage domain trained on a small dataset.
Abstract: Passage retrieval for multimodal question answering, spanning natural language processing and computer vision, is a challenging task, particularly when the documentation to search from contains poor punctuation or obsolete word forms and with little labeled training data. Here, we introduce a novel approach to conducting passage retrieval for multimodal question answering of ancient artworks where the query image caption of the multimodal query is provided as additional evidence to state-of-the-art retrieval models in the cultural heritage domain trained on a small dataset. The query image caption is generated with an advanced image captioning model trained on an external dataset. Consequently, the retrieval model obtains transferred knowledge from the external dataset. Extensive experiments prove the efficiency of this approach on a benchmark dataset compared to state-of-the-art approaches.

DOI
01 Jan 2019
TL;DR: This report documents the program and the outcomes of Dagstuhl Seminar 19021 "Joint Processing of Language and Visual Data for Better Automated Understanding".
Abstract: This report documents the program and the outcomes of Dagstuhl Seminar 19021 "Joint Processing of Language and Visual Data for Better Automated Understanding". It includes a discussion of the motivation and overall organization, the abstracts of the talks, and a report of each working group.


Journal ArticleDOI
17 Jan 2019
TL;DR: The Second Workshop on Exploitation of Social Media for Emergency Relief and Preparedness (SMERP) was held in conjunction with The Web Conference (WWW) 2018 at Lyon, France to promote multi-modal and multi-view information retrieval from the social media content in disaster situations.
Abstract: The Second Workshop on Exploitation of Social Media for Emergency Relief and Preparedness (SMERP) was held in conjunction with The Web Conference (WWW) 2018 at Lyon, France. A primary aim of the workshop was to promote multi-modal and multi-view information retrieval from the social media content in disaster situations. The workshop programme included keynote talks, a peer-reviewed paper track, and a panel discussion on the relevant research problems in the scope of the workshop.