scispace - formally typeset
Search or ask a question

Showing papers by "Marie-Francine Moens published in 2021"


Journal ArticleDOI
01 Jan 2021
TL;DR: It is argued that combining discrete and continuous representations and their processing will be essential to build systems that exhibit a general form of intelligence.
Abstract: Discrete and continuous representations of content (e.g., of language or images) have interesting properties to be explored for the understanding of or reasoning with this content by machines. This position paper puts forward our opinion on the role of discrete and continuous representations and their processing in the deep learning field. Current neural network models compute continuous-valued data. Information is compressed into dense, distributed embeddings. By stark contrast, humans use discrete symbols in their communication with language. Such symbols represent a compressed version of the world that derives its meaning from shared contextual information. Additionally, human reasoning involves symbol manipulation at a cognitive level, which facilitates abstract reasoning, the composition of knowledge and understanding, generalization and efficient learning. Motivated by these insights, in this paper we argue that combining discrete and continuous representations and their processing will be essential to build systems that exhibit a general form of intelligence. We suggest and discuss several avenues that could improve current neural networks with the inclusion of discrete elements to combine the advantages of both types of representations.

12 citations


Journal ArticleDOI
TL;DR: In this paper, the authors present a survey of state-of-the-art methods to identify causal relationships between events or entities within biomedical texts, including multiview CNN, attention-based BiLSTM, and graph LSTM.

11 citations


Journal ArticleDOI
TL;DR: In this article, the authors investigate the hypothesis that the timestamp of a Web page is crucial to how it should be ranked for a given claim, and they delineate four temporal ranking methods that constrain evidence ranking differently and simulate hypothesis-specific evidence rankings given the evidence timestamps as gold standard.

11 citations


Book ChapterDOI
01 Jan 2021
TL;DR: In this paper, a multinomial sequence classifier for dialogue breakdown detection was proposed, and the best performing model was selected and compared with the best model and with the majority baseline from the previous challenge.
Abstract: One of the principal problems of human-computer interaction is miscommunication. Occurring mainly on behalf of the dialogue system, miscommunication can lead to dialogue breakdown, i.e., a point when the dialogue cannot be continued. Detecting breakdown can facilitate its prevention or recovery after breakdown occurred. In the paper, we propose a multinomial sequence classifier for dialogue breakdown detection. We explore several LSTM models each different in terms of model type and word embedding models they use. We select our best performing model and compare it with the performance of the best model and with the majority baseline from the previous challenge. We conclude that our detector outperforms the baselines during the offline testing.

8 citations


Journal ArticleDOI
TL;DR: It is proposed that the IR community starts working on a road map for transitioning the IR literature to a fully, "diamond", open access model.
Abstract: Almost all of the important literature on Information Retrieval (IR) is published in subscription-based journals and digital libraries. We argue that the lack of open access publishing in IR is seriously hampering progress and inclusiveness of the field. We propose that the IR community starts working on a road map for transitioning the IR literature to a fully, "diamond", open access model.

7 citations


Journal ArticleDOI
TL;DR: In this paper, the authors investigate the predictive power of solely processing spatial clues for scene understanding in 2D images and compare such an approach with visual appearance, and propose a scale-, mirror-, and translation-invariant representation that captures the spatial essence of the relationship, i.e., a canonical spatial representation.
Abstract: Humans often leverage spatial clues to categorize scenes in a fraction of a second. This form of intelligence is very relevant in time-critical situations (e.g., when driving a car) and valuable to transfer to automated systems. This work investigates the predictive power of solely processing spatial clues for scene understanding in 2D images and compares such an approach with the predictive power of visual appearance. To this end, we design the laboratory task of predicting the identity of two objects (e.g., “man” and “horse”) and their relationship or predicate (e.g., “riding”) given exclusively the ground truth bounding box coordinates of both objects. We also measure the performance attainable in Human Object Interaction (HOI) detection, a real-world spatial task, which includes a setting where ground truth boxes are not available at test time. An additional goal is to identify the principles necessary to effectively represent a spatial template, that is, the visual region in which two objects involved in a relationship expressed by a predicate occur. We propose a scale-, mirror-, and translation-invariant representation that captures the spatial essence of the relationship, i.e., a canonical spatial representation. Tests in two benchmarks reveal: (1) High performance is attainable by using exclusively spatial information in all tasks. (2) In HOI detection, the canonical template outperforms the rest of spatial, visual, and several state-of-the-art baselines. (3) Simple fusion of visual and spatial features substantially improves performance. (4) Our methods fare remarkably well with a small amount of data and rare categories. Our results obtained on the Visual Genome (VG) and the Humans Interacting with Common Objects - Detection (HICO-DET) datasets indicate that great predictive power can be obtained from spatial clues alone, opening up possibilities for performing fast scene understanding at a glance.

4 citations


Journal ArticleDOI
TL;DR: This work proposes a weakly supervised alignment model where the correspondence between the input training visual and textual fragments is not known but their corresponding units that refer to the same artwork are treated as a positive pair.
Abstract: In this paper, we target the tasks of fine-grained image–text alignment and cross-modal retrieval in the cultural heritage domain as follows: (1) given an image fragment of an artwork, we retrieve the noun phrases that describe it; (2) given a noun phrase artifact attribute, we retrieve the corresponding image fragment it specifies. To this end, we propose a weakly supervised alignment model where the correspondence between the input training visual and textual fragments is not known but their corresponding units that refer to the same artwork are treated as a positive pair. The model exploits the latent alignment between fragments across modalities using attention mechanisms by first projecting them into a shared common semantic space; the model is then trained by increasing the image–text similarity of the positive pair in the common space. During this process, we encode the inputs of our model with hierarchical encodings and remove irrelevant fragments with different indicator functions. We also study techniques to augment the limited training data with synthetic relevant textual fragments and transformed image fragments. The model is later fine-tuned by a limited set of small-scale image–text fragment pairs. We rank the test image fragments and noun phrases by their intermodal similarity in the learned common space. Extensive experiments demonstrate that our proposed models outperform two state-of-the-art methods adapted to fine-grained cross-modal retrieval of cultural items for two benchmark datasets.

2 citations


Proceedings ArticleDOI
01 Aug 2021
TL;DR: In this paper, a system combining pretrained multimodal models (CLIP) and chained classifiers was proposed to detect persuasion techniques in multimodality (memes) for SemEval-2021 task 6.
Abstract: We describe our approach for SemEval-2021 task 6 on detection of persuasion techniques in multimodal content (memes). Our system combines pretrained multimodal models (CLIP) and chained classifiers. Also, we propose to enrich the data by a data augmentation technique. Our submission achieves a rank of 8/16 in terms of F1-micro and 9/16 with F1-macro on the test set.

2 citations


Posted Content
TL;DR: This article showed that by using a training method that is stable with respect to linear mode connectivity, large networks can also be entirely rewound to initialization, which raises doubts about the use of the lottery ticket hypothesis.
Abstract: The lottery ticket hypothesis states that sparse subnetworks exist in randomly initialized dense networks that can be trained to the same accuracy as the dense network they reside in. However, the subsequent work has failed to replicate this on large-scale models and required rewinding to an early stable state instead of initialization. We show that by using a training method that is stable with respect to linear mode connectivity, large networks can also be entirely rewound to initialization. Our subsequent experiments on common vision tasks give strong credence to the hypothesis in Evci et al. (2020b) that lottery tickets simply retrain to the same regions (although not necessarily to the same basin). These results imply that existing lottery tickets could not have been found without the preceding dense training by iterative magnitude pruning, raising doubts about the use of the lottery ticket hypothesis.

1 citations


Posted Content
TL;DR: This article proposed a contrastive learning framework that trains sentence embeddings to encode the relations in a graph structure, which achieved state-of-the-art results on relation extraction task using only a simple KNN classifier.
Abstract: Though language model text embeddings have revolutionized NLP research, their ability to capture high-level semantic information, such as relations between entities in text, is limited. In this paper, we propose a novel contrastive learning framework that trains sentence embeddings to encode the relations in a graph structure. Given a sentence (unstructured text) and its graph, we use contrastive learning to impose relation-related structure on the token-level representations of the sentence obtained with a CharacterBERT (El Boukkouri et al.,2020) model. The resulting relation-aware sentence embeddings achieve state-of-the-art results on the relation extraction task using only a simple KNN classifier, thereby demonstrating the success of the proposed method. Additional visualization by a tSNE analysis shows the effectiveness of the learned representation space compared to baselines. Furthermore, we show that we can learn a different space for named entity recognition, again using a contrastive learning objective, and demonstrate how to successfully combine both representation spaces in an entity-relation task.

1 citations


Proceedings ArticleDOI
07 Jun 2021
TL;DR: In this article, a neural network approach that can aid the process of medical document classification is presented, where a human annotator can correct the model predictions for a new training sample which can then be used for training.
Abstract: Hospitals have to deal with continually arriving clinical data. In this paper, we present a neural network approach that can aide the process medical document classification. In this scenario a human annotator can correct the model predictions for a new training sample which can then be used for training. This data needs to be classified into ICD categories and the newly obtained knowledge should be captured by the model with minimal loss of already acquired knowledge. More specifically, different strategies are proposed and evaluated for constructing a replay dataset in a continual learning setting. The presented methodology alternates an incremental learning phase with a full retraining of all training samples seen so far. In this manner, a balance can be found where most of the time newly obtained knowledge can immediately be added to the model, but not to the extent where it loses a vital part of previously obtained knowledge.

Proceedings ArticleDOI
14 Apr 2021
TL;DR: TIEVis as mentioned in this paper is a visual analytics dashboard that visualizes event-timelines extracted from clinical reports, highlighting the importance of seeing events in their context, and the ability to manually verify and update critical events in a patient history.
Abstract: Clinical reports, as unstructured texts, contain important temporal information. However, it remains a challenge for natural language processing (NLP) models to accurately combine temporal cues into a single coherent temporal ordering of described events. In this paper, we present TIEVis, a visual analytics dashboard that visualizes event-timelines extracted from clinical reports. We present the findings of a pilot study in which healthcare professionals explored and used the dashboard to complete a set of tasks. Results highlight the importance of seeing events in their context, and the ability to manually verify and update critical events in a patient history, as a basis to increase user trust.

Journal ArticleDOI
TL;DR: In this article, the authors propose a model that detects uncertain situations when a command is given and finds the visual objects causing it, and generates a question generated by the system describing the uncertain objects.

Posted Content
TL;DR: The authors proposed to use ideas from predictive coding theory to augment BERT-style language models with a mechanism that allows them to learn suitable discourse-level representations, which is able to predict future sentences using explicit top-down connections that operate at the intermediate layers of the network.
Abstract: Current language models are usually trained using a self-supervised scheme, where the main focus is learning representations at the word or sentence level. However, there has been limited progress in generating useful discourse-level representations. In this work, we propose to use ideas from predictive coding theory to augment BERT-style language models with a mechanism that allows them to learn suitable discourse-level representations. As a result, our proposed approach is able to predict future sentences using explicit top-down connections that operate at the intermediate layers of the network. By experimenting with benchmarks designed to evaluate discourse-related knowledge using pre-trained sentence representations, we demonstrate that our approach improves performance in 6 out of 11 tasks by excelling in discourse relationship detection.

Posted Content
TL;DR: In this article, a multimodal learning algorithm for disinformation detection in online news articles is proposed, which is guided by the profile of users who prefer content similar to the news article that is evaluated, and this effect is reinforced if content is shared among different users.
Abstract: User-generated content (e.g., tweets and profile descriptions) and shared content between users (e.g., news articles) reflect a user's online identity. This paper investigates whether correlations between user-generated and user-shared content can be leveraged for detecting disinformation in online news articles. We develop a multimodal learning algorithm for disinformation detection. The latent representations of news articles and user-generated content allow that during training the model is guided by the profile of users who prefer content similar to the news article that is evaluated, and this effect is reinforced if that content is shared among different users. By only leveraging user information during model optimization, the model does not rely on user profiling when predicting an article's veracity. The algorithm is successfully applied to three widely used neural classifiers, and results are obtained on different datasets. Visualization techniques show that the proposed model learns feature representations of unseen news articles that better discriminate between fake and real news texts.

Proceedings ArticleDOI
01 Apr 2021
TL;DR: This paper proposed two soft constraints that can improve the model's ability of resolving coreference relations in dialog in an unsupervised way, which achieved state-of-the-art performance on the VisDial v1.0 dataset.
Abstract: Visual dialog is a vision-language task where an agent needs to answer a series of questions grounded in an image based on the understanding of the dialog history and the image. The occurrences of coreference relations in the dialog makes it a more challenging task than visual question-answering. Most previous works have focused on learning better multi-modal representations or on exploring different ways of fusing visual and language features, while the coreferences in the dialog are mainly ignored. In this paper, based on linguistic knowledge and discourse features of human dialog we propose two soft constraints that can improve the model’s ability of resolving coreferences in dialog in an unsupervised way. Experimental results on the VisDial v1.0 dataset shows that our model, which integrates two novel and linguistically inspired soft constraints in a deep transformer neural architecture, obtains new state-of-the-art performance in terms of recall at 1 and other evaluation metrics compared to current existing models and this without pretraining on other vision language datasets. Our qualitative results also demonstrate the effectiveness of the method that we propose.

Proceedings Article
02 Sep 2021
TL;DR: This paper proposed a contrastive learning framework that trains sentence embeddings to encode the relations in a graph structure, which achieved state-of-the-art results on relation extraction task using only a simple KNN classifier.
Abstract: Though language model text embeddings have revolutionized NLP research, their ability to capture high-level semantic information, such as relations between entities in text, is limited. In this paper, we propose a novel contrastive learning framework that trains sentence embeddings to encode the relations in a graph structure. Given a sentence (unstructured text) and its graph, we use contrastive learning to impose relation-related structure on the token level representations of the sentence obtained with a CharacterBERT (El Boukkouri et al., 2020) model. The resulting relation-aware sentence embeddings achieve state-of-the-art results on the relation extraction task using only a simple KNN classifier, thereby demonstrating the success of the proposed method. Additional visualization by a tSNE analysis shows the effectiveness of the learned representation space compared to baselines. Furthermore, we show that we can learn a different space for named entity recognition, again using a contrastive learning objective, and demonstrate how to successfully combine both representation spaces in an entity-relation task.

Book ChapterDOI
28 Mar 2021
TL;DR: This paper studied the effect of simple feature transforms (e.g., standardizing) in 25 datasets with 6 tasks covering semantic similarity and text and image retrieval, and found that some feature transforms provide solid improvements, suggesting their default adoption; cosine similarity fares better than Euclidean similarity.
Abstract: Practitioners often resort to off-the-shelf feature extractors such as language models (e.g., BERT or Glove) for text or pre-trained CNNs for images. These features are often used without further supervision in tasks such as text or image retrieval and semantic similarity with cosine-based semantic match. Although cosine similarity is sensitive to centering and other feature transforms, their impact on task performance has not been systematically studied. Prior studies are limited to a single domain (e.g., bilingual embeddings) and one data modality (text). Here, we systematically study the effect of simple feature transforms (e.g., standardizing) in 25 datasets with 6 tasks covering semantic similarity and text and image retrieval. We further back up our claims in ad-hoc laboratory experiments. We include 15 (8 image + 7 text) embeddings, covering the state-of-the-art models. Our second goal is to determine whether the common practice of defaulting to the cosine similarity is empirically supported. Our findings reveal that: (i) some feature transforms provide solid improvements, suggesting their default adoption; (ii) cosine similarity fares better than Euclidean similarity, thus backing up standard practices. Ultimately, our takeaways provide actionable advice for practitioners.

Posted Content
TL;DR: In this paper, a compositional/few-shot action recognition approach is proposed, where multi-head attention is used over spatio-temporal layouts, i.e., configurations of object bounding boxes.
Abstract: Recognizing human actions is fundamentally a spatio-temporal reasoning problem, and should be, at least to some extent, invariant to the appearance of the human and the objects involved. Motivated by this hypothesis, in this work, we take an object-centric approach to action recognition. Multiple works have studied this setting before, yet it remains unclear (i) how well a carefully crafted, spatio-temporal layout-based method can recognize human actions, and (ii) how, and when, to fuse the information from layout and appearance-based models. The main focus of this paper is compositional/few-shot action recognition, where we advocate the usage of multi-head attention (proven to be effective for spatial reasoning) over spatio-temporal layouts, i.e., configurations of object bounding boxes. We evaluate different schemes to inject video appearance information to the system, and benchmark our approach on background cluttered action recognition. On the Something-Else and Action Genome datasets, we demonstrate (i) how to extend multi-head attention for spatio-temporal layout-based action recognition, (ii) how to improve the performance of appearance-based models by fusion with layout-based models, (iii) that even on non-compositional background-cluttered video datasets, a fusion between layout- and appearance-based models improves the performance.

Book ChapterDOI
01 Jan 2021
TL;DR: The authors argue that modern machine learning approaches fail to adequately address how grammar and common sense should be learned, and advocate for experiments with the use of abstract, confined world environments where agents interact with the emphasis on learning world models.
Abstract: In this position paper we argue that modern machine learning approaches fail to adequately address how grammar and common sense should be learned. State of the art language models achieve impressive results in a range of specialized tasks but lack underlying world understanding. We advocate for experiments with the use of abstract, confined world environments where agents interact with the emphasis on learning world models. Agents are induced to learn the grammar needed to navigate the environment, hence their grammar will be grounded in this abstracted world. We believe that this grounded grammar will therefore facilitate a more realistic, interpretable and human-like form of common sense.

Proceedings Article
01 Nov 2021
TL;DR: This article proposed to use ideas from predictive coding theory to augment BERT-style language models with a mechanism that allows them to learn suitable discourse-level representations, which is able to predict future sentences using explicit top-down connections that operate at the intermediate layers of the network.
Abstract: Current language models are usually trained using a self-supervised scheme, where the main focus is learning representations at the word or sentence level. However, there has been limited progress in generating useful discourse-level representations. In this work, we propose to use ideas from predictive coding theory to augment BERT-style language models with a mechanism that allows them to learn suitable discourse-level representations. As a result, our proposed approach is able to predict future sentences using explicit top-down connections that operate at the intermediate layers of the network. By experimenting with benchmarks designed to evaluate discourse-related knowledge using pre-trained sentence representations, we demonstrate that our approach improves performance in 6 out of 11 tasks by excelling in discourse relationship detection.