scispace - formally typeset
Search or ask a question

Showing papers by "Marie-Francine Moens published in 2013"


Proceedings Article
28 Jun 2013
TL;DR: This paper explores the use of machine learning techniques for inferring a user's personality traits from their Facebook status updates and indicates that personality trait recognition generalises across social media platforms.
Abstract: Gaining insight in a web user's personality is very valuable for applications that rely on personalisation, such as recommender systems and personalised advertising. In this paper we explore the use of machine learning techniques for inferring a user's personality traits from their Facebook status updates. Even with a small set of training examples we can outperform the majority class baseline algorithm. Furthermore, the results are improved by adding training examples from another source. This is an interesting result because it indicates that personality trait recognition generalises across social media platforms.

102 citations


Journal ArticleDOI
TL;DR: The main importance of this work lies in the fact that it provides novel CLIR statistical models that exhaustively exploit as many cross-lingual clues as possible in the quest for better CLIR results, without use of any additional external resources such as parallel corpora or machine-readable dictionaries.
Abstract: In this paper, we study different applications of cross-language latent topic models trained on comparable corpora. The first focus lies on the task of cross-language information retrieval (CLIR). The Bilingual Latent Dirichlet allocation model (BiLDA) allows us to create an interlingual, language-independent representation of both queries and documents. We construct several BiLDA-based document models for CLIR, where no additional translation resources are used. The second focus lies on the methods for extracting translation candidates and semantically related words using only per-topic word distributions of the cross-language latent topic model. As the main contribution, we combine the two former steps, blending the evidences from the per-document topic distributions and the per-topic word distributions of the topic model with the knowledge from the extracted lexicon. We design and evaluate the novel evidence-rich statistical model for CLIR, and prove that such a model, which combines various (only internal) evidences, obtains the best scores for experiments performed on the standard test collections of the CLEF 2001---2003 campaigns. We confirm these findings in an alternative evaluation, where we automatically generate queries and perform the known-item search on a test subset of Wikipedia articles. The main importance of this work lies in the fact that we train translation resources from comparable document-aligned corpora and provide novel CLIR statistical models that exhaustively exploit as many cross-lingual clues as possible in the quest for better CLIR results, without use of any additional external resources such as parallel corpora or machine-readable dictionaries.

62 citations


Proceedings Article
01 Jun 2013
TL;DR: It is shown that in the cross-lingual settings without any language pair dependent knowledge the response-based method of similarity is more robust and outperforms current state-of-the art methods that directly operate in the semantic space of latent cross-lingsual concepts/topics.
Abstract: We propose a new approach to identifying semantically similar words across languages. The approach is based on an idea that two words in different languages are similar if they are likely to generate similar words (which includes both source and target language words) as their top semantic word responses. Semantic word responding is a concept from cognitive science which addresses detecting most likely words that humans output as free word associations given some cue word. The method consists of two main steps: (1) it utilizes a probabilistic multilingual topic model trained on comparable data to learn and quantify the semantic word responses, (2) it provides ranked lists of similar words according to the similarity of their semantic word response vectors. We evaluate our approach in the task of bilingual lexicon extraction (BLE) for a variety of language pairs. We show that in the cross-lingual settings without any language pair dependent knowledge the response-based method of similarity is more robust and outperforms current state-of-the art methods that directly operate in the semantic space of latent cross-lingual concepts/topics.

56 citations


Proceedings Article
01 Oct 2013
TL;DR: A new language pair agnostic approach to inducing bilingual vector spaces from non-parallel data without any other resource in a bootstrapping fashion that outperforms the best performing fully corpus-based BLE methods on these test sets.
Abstract: We present a new language pair agnostic approach to inducing bilingual vector spaces from non-parallel data without any other resource in a bootstrapping fashion. The paper systematically introduces and describes all key elements of the bootstrapping procedure: (1) starting point or seed lexicon, (2) the confidence estimation and selection of new dimensions of the space, and (3) convergence. We test the quality of the induced bilingual vector spaces, and analyze the influence of the different components of the bootstrapping approach in the task of bilingual lexicon extraction (BLE) for two language pairs. Results reveal that, contrary to conclusions from prior work, the seeding of the bootstrapping process has a heavy impact on the quality of the learned lexicons. We also show that our approach outperforms the best performing fully corpus-based BLE methods on these test sets.

51 citations


Proceedings ArticleDOI
04 Dec 2013
TL;DR: This paper gives a short overview of the state-of-the-art and goals of argumentation mining and it provides ideas for further research.
Abstract: This paper gives a short overview of the state-of-the-art and goals of argumentation mining and it provides ideas for further research. Its content is based on two invited lectures on argumentation mining respectively at the FIRE 2013 conference at the India International Center in New Delhi, India and a lecture given as SICSA distinguished visitor at the University of Dundee, UK in the summer of 2014.

42 citations


Journal ArticleDOI
TL;DR: A qualitative evaluation with professional video searchers shows that the combination of automatic video indexing, interactive visualisations and user-centred design can result in an increased usability, user satisfaction and productivity.
Abstract: Professional video searchers typically have to search for particular video fragments in a vast video archive that contains many hours of video data. Without having the right video archive exploration tools, this is a difficult and time consuming task that induces hours of video skimming. We propose the video archive explorer, a video exploration tool that provides visual representations of automatically detected concepts to facilitate individual and collaborative video search tasks. This video archive explorer is developed by employing a user-centred methodology, which ensures that the tool is more likely to fit to the end user needs. A qualitative evaluation with professional video searchers shows that the combination of automatic video indexing, interactive visualisations and user-centred design can result in an increased usability, user satisfaction and productivity.

33 citations


Proceedings Article
01 Jan 2013
TL;DR: This paper describes the task in Semantic Evaluations 2013, annotation schema, corpora, participants, methods and results obtained by the participants.
Abstract: Many NLP applications require information about locations of objects referenced in text, or relations between them in space. For example, the phrase a book on the desk contains information about the location of the object book, as trajector, with respect to another object desk, as landmark. Spatial Role Labeling (SpRL) is an evaluation task in the information extraction domain which sets a goal to automatically process text and identify objects of spatial scenes and relations between them. This paper describes the task in Semantic Evaluations 2013, annotation schema, corpora, participants, methods and results obtained by the participants.

29 citations


Book ChapterDOI
01 Jan 2013
TL;DR: One of the goals of the STEVIN programme is the realisation of a digital infrastructure that will enforce the position of the Dutch language in the modern information and communication technology.
Abstract: One of the goals of the STEVIN programme is the realisation of a digital infrastructure that will enforce the position of the Dutch language in the modern information and communication technology.A semantic database makes it possible to go from words to concepts and consequently, to develop technologies that access and use knowledge rather than textual representations.

22 citations



Book ChapterDOI
24 Mar 2013
TL;DR: In this article, the authors explore the potential of probabilistic topic modeling within the relevance modeling framework for both monolingual and cross-lingual ad-hoc retrieval, and integrate the topical knowledge into a unified relevance modelling framework.
Abstract: We explore the potential of probabilistic topic modeling within the relevance modeling framework for both monolingual and cross-lingual ad-hoc retrieval. Multilingual topic models provide a way to represent documents in a structured and coherent way, regardless of their actual language, by means of language-independent concepts, that is, cross-lingual topics. We show how to integrate the topical knowledge into a unified relevance modeling framework in order to build quality retrieval models in monolingual and cross-lingual contexts. The proposed modeling framework processes all documents uniformly and does not make any conceptual distinction between monolingual and cross-lingual modeling. Our results obtained from the experiments conducted on the standard CLEF test collections reveal that fusing the topical knowledge and relevance modeling leads to building monolingual and cross-lingual retrieval models that outperform several strong baselines. We show that that the topical knowledge coming from a general Web-generated corpus boosts retrieval scores. Additionally, we show that within this framework the estimation of cross-lingual relevance models may be performed by exploiting only a general non-parallel corpus.

11 citations


Journal ArticleDOI
TL;DR: It is found that simple methods can still outperform the current state-of-the-art techniques for representing, fusing and comparing content representations of news documents.
Abstract: We study several techniques for representing, fusing and comparing content representations of news documents. As underlying models we consider the vector space model (both in a term setting and in a latent semantic analysis setting) and probabilistic topic models based on latent Dirichlet allocation. Content terms can be classified as topical terms or named entities, yielding several models for content fusion and comparison. All used methods are completely unsupervised. We find that simple methods can still outperform the current state-of-the-art techniques.

01 Jan 2013
TL;DR: This study contributes to this effort by exploring the use of machine learning techniques to automatically infer users’ personality traits based on their Facebook status updates.
Abstract: User generated content in online social networking sites provides a potentially rich source of information for applications that rely on personalisation, such as on-line marketing. In this study we contribute to this effort by exploring the use of machine learning (ML) techniques to automatically infer users’ personality traits based on their Facebook status updates (i.e., text messages to communicate with friends).

Proceedings ArticleDOI
28 Oct 2013
TL;DR: This paper implements and evaluates several information retrieval models for linking the texts of pins of Pinterest to webpages of Amazon, and ranking the pages according to the personal interest of the pinner.
Abstract: The information that users of social network sites post often points towards their interests and hobbies. It can be used to recommend relevant products to users. In this paper we implement and evaluate several information retrieval models for linking the texts of pins of Pinterest to webpages of Amazon, and ranking the pages (which we call webshops) according to the personal interest of the pinner. The results show that models that combine latent concepts composed of related terms with single words yield the best performance.

Proceedings Article
01 Jan 2013
TL;DR: This paper describes the entry in the Gene Regulation Network in Bacteria (GRN) part, for which the system finished in second place, and employs a basic Support Vector Machine framework to tackle this relation extraction task.
Abstract: The BioNLP Shared Task 2013 is organised to further advance the field of information extraction in biomedical texts. This paper describes our entry in the Gene Regulation Network in Bacteria (GRN) part, for which our system finished in second place (out of five). To tackle this relation extraction task, we employ a basic Support Vector Machine framework. We discuss our findings in constructing local and contextual features, that augment our precision with as much as 7.5%. We touch upon the interaction type hierarchy inherent in the problem, and the importance of the evaluation procedure to encourage exploration of that structure.

Proceedings ArticleDOI
28 Oct 2013
TL;DR: This work evaluates different textual representations and retrieval models that aim to make sense of social media data for retail applications and shows that document representations that combine latent concepts with single words yield the best performance.
Abstract: User-generated content offers opportunities to learn about people's interests and hobbies. We can leverage this information to help users find interesting shops and businesses find interested users. However this content is highly noisy and unstructured as posted on social media sites and blogs. In this work we evaluate different textual representations and retrieval models that aim to make sense of social media data for retail applications. Our task is to link the text of pins (from Pinterest.com) to online shops (formed by clustering Amazon.com's products). Our results show that document representations that combine latent concepts with single words yield the best performance.

Book ChapterDOI
24 Mar 2013
TL;DR: This tutorial demonstrates how semantically similar words across languages are integrated as useful additional evidences in cross-lingual information retrieval models, and presents how to use the knowledge from the topic models in the tasks of cross- Lingual event clustering, cross-lingsual document classification and the detection of cross -lingual semantic similarity of words.
Abstract: Probabilistic topic models are a group of unsupervised generative machine learning models that can be effectively trained on large text collections. They model document content as a two-step generation process, i.e., documents are observed as mixtures of latent topics, while topics are probability distributions over vocabulary words. Recently, a significant research effort has been invested into transferring the probabilistic topic modeling concept from monolingual to multilingual settings. Novel topic models have been designed to work with parallel and comparable multilingual data (e.g., Wikipedia or news data discussing the same events). Probabilistic topics models offer an elegant way to represent content across different languages. Their probabilistic framework allows for their easy integration into a language modeling framework for monolingual and cross-lingual information retrieval. Moreover, we present how to use the knowledge from the topic models in the tasks of cross-lingual event clustering, cross-lingual document classification and the detection of cross-lingual semantic similarity of words. The tutorial also demonstrates how semantically similar words across languages are integrated as useful additional evidences in cross-lingual information retrieval models.

Proceedings Article
05 Dec 2013
TL;DR: A general framework for designing machine learning models that deal with constructing complex structures in the output space is proposed, based on generalized linear training techniques, and exploits techniques from combinatorial optimization.
Abstract: We propose a general framework for designing machine learning models that deal with constructing complex structures in the output space. The goal is to provide an abstraction layer to easily represent and design constructive learning models. The learning approach is based on generalized linear training techniques, and exploits techniques from combinatorial optimization to deal with the complexity of the underlying inference required in this type of models. This approach also allows to consider global structural characteristics and constraints over the output elements in an efficient training and prediction setting. The use case focuses on building spatial meaning representations from text to instantiate a virtual world.

Proceedings Article
01 Jan 2013
TL;DR: The central feature of the proposed system is temporal parsing, an approach which identifies temporal relation arguments (eventevent and event-timex pairs) and the semantic label of the relation as a single decision.
Abstract: This paper describes a system for temporal processing of text, which participated in the Temporal Evaluations 2013 campaign. The system employs a number of machine learning classifiers to perform the core tasks of: identification of time expressions and events, recognition of their attributes, and estimation of temporal links between recognized events and times. The central feature of the proposed system is temporal parsing ‐ an approach which identifies temporal relation arguments (eventevent and event-timex pairs) and the semantic label of the relation as a single decision.

Book ChapterDOI
14 Dec 2013
TL;DR: This paper presents a framework in order to automatically label messages from discussion forums using the categories of Bloom's taxonomy, and shows that the combination of a linear classifier with a Rule-Based classifier yields very good and promising results for this difficult task.
Abstract: The labeling of discussion forums using the cognitive levels of Bloom’s taxonomy is a time-consuming and very expensive task due to the big amount of information that needs to be labeled and the need of an expert in the educational field for applying the taxonomy according to the messages of the forums. In this paper we present a framework in order to automatically label messages from discussion forums using the categories of Bloom’s taxonomy. Several models were created using three kind of machine learning approaches: linear, Rule-Based and combined classifiers. The models are evaluated using the accuracy, the F1-measure and the area under the ROC curve. Additionally, a statistical significance of the results is performed using a McNemar test in order to validate them. The results show that the combination of a linear classifier with a Rule-Based classifier yields very good and promising results for this difficult task.

Proceedings Article
01 Jan 2013
TL;DR: This research aims at ”bringing a given text to life” via an immersive environment where the user can freely explore the surroundings and increase his understanding of the given text through natural language processing and the development of virtual immersive environments.
Abstract: This paper describes our research-in-progress which integrates several domains, in particular natural language processing and the development of virtual immersive environments. Our research aims at ”bringing a given text to life” via an immersive environment where the user can freely explore the surroundings and increase his understanding of the given text. We describe some important challenges in achieving this goal and outline our current research results. Our work is practically oriented, aiming at fulfilling some societal needs related to education on which we also report.

10 Jul 2013
TL;DR: It is argued that this work is an important step towards automatically describing text with semantic labels that form a structured ontological representation of the content.
Abstract: We propose a novel structured learning framework for mapping natural language to spatial ontologies. The applied spatial ontology contains spatial concepts, relations and their multiple semantic types based on qualitative spatial calculi models. To achieve a tractable structured learning model, we propose an ecient inference approach based on constraint optimization techniques. Particularly, we decompose the inference to subproblems, each of which is solved using LP-relaxation. This is done for both training-time and prediction-time inference. In this framework ontology components are learnt while taking into account the ontological constraints and linguistic dependencies among components. Particularly, we for- mulate complex relationships such as composed-of, is-a and mutual exclusivity during learning while the previous structured learning models in similar tasks do not go beyond hierarchical relationships. Our experimental results show that jointly learning the output components considering the above mentioned constraints and relationships improves the results compared to ignoring these. The application of the proposed learning model for mapping to ontologies is not limited to extraction of spatial semantics, it could be used to populate any ontology. We argue therefore that this work is an important step towards automatically describing text with semantic labels that form a structured ontological representation of the content.

Journal ArticleDOI
TL;DR: This paper reports on the application of naming faces in soap series by using the weak supervision of narrative texts that describe the events in the video and that are drafted by fans.

Proceedings Article
01 Jun 2013
TL;DR: A central feature of the proposed system is temporal parsing – an approach which identifies temporal relation arguments (eventevent and event-timex pairs) and the semantic label of the relation as a single decision.
Abstract: This paper describes a system for temporal processing of text, which participated in the Temporal Evaluations 2013 campaign. The system employs a number of machine learning classifiers to perform the core tasks of: identification of time expressions and events, recognition of their attributes, and estimation of temporal links between recognized events and times. The central feature of the proposed system is temporal parsing – an approach which identifies temporal relation arguments (eventevent and event-timex pairs) and the semantic label of the relation as a single decision.

Proceedings ArticleDOI
22 Oct 2013
TL;DR: An unsupervised framework for recognizing animals in videos using subtitles is proposed using an Expectation Maximization algorithm which is adapted to two very different circumstances- when the bounding boxes are available and when the frame as a whole is used instead of bounded boxes.
Abstract: We propose an unsupervised framework for recognizing animals in videos using subtitles. In this framework, the alignment between animals and their names is performed using an Expectation Maximization algorithm which is adapted to two very different circumstances- 1) when the bounding boxes are available and 2) when the frame as a whole is used instead of bounding boxes. With the goal of maximizing precision, recall and F-measure, the experiments compare a multitude of natural language processing approaches and visual features when associating animal names in the subtitles with visual patterns. The proposed unsupervised methods obtain 83.1% F1 using bounding boxes and 65.7% F1 without bounding boxes in a fully automated pipeline.

01 Jan 2013
TL;DR: This paper proposes several novel methods for summarization and expansion of facets using the ranked list of search results generated from a keyword search, coupled with the spatial distribution of relevant documents in a hierarchical taxonomy of subject classes.
Abstract: Combination of a ‘keyword’ and a ‘faceted’ search has the potential to enhance user experience by providing a better arrangement of search results and aiding further search exploration. However, in such a framework, two key problems exist: 1) a given query may cover several facets, requiring an aggregation or summarization of the most relevant ones 2) a query may cover too few facets necessitating an expansion to include additional facets. In this paper, we propose several novel methods for summarization and expansion of facets. Using the ranked list of search results generated from a keyword search, coupled with the spatial distribution of relevant documents in a hierarchical taxonomy of subject classes, we dynamically extract key facets. An evaluation of the different methods based on the relevance and diversity of the facets indicates that the Subtree density model performs best for both summarization and expansion. RESUME. Les recherches par mots cles sur le Web donnent souvent une enorme quantite de pages Web pertinentes. Un cadre qui integre les avantages a la fois des « mot-cle » et des « facettes » des recherches a des larges avantages pour les utilisateurs Web, car il offre une meilleure organisation des resultats de la recherche et une plate-forme utile pour guider les utilisateurs a trouver les informations pertinentes. Dans un cadre pareil, deux problemes principaux existent : 1) une requete peut entamer plusieurs facettes, ce qui necessite une agregation ou un resume des facettes les plus pertinentes ; 2) une requete peut couvrir trop peu de facettes necessitant une recommandation ou une expansion. Dans cet article, nous proposons plusieurs nouvelles methodes de synthese et de l’expansion de facettes a partir d’une recherche par mot cle, associees a la distribution spatiale des documents pertinents dans une taxonomie hierarchique des classes de sujets. Une evaluation des differents modeles bases sur la pertinence et la diversite des facettes indique que le modele de « subtree density » donne les meilleures resultats.

Journal Article
TL;DR: This work proposes a new approach to improving named entity recognition NER in broadcast news speech data that is able to find named entities missing in the transcribed speech data, and additionally to correct incorrectly assigned named entity tags.
Abstract: We propose a new approach to improving named entity recognition NER in broadcast news speech data. The approach proceeds in two key steps: 1 we automatically detect document alignments between highly similar speech documents and corresponding written news stories that are easily obtainable from the Web; 2 we employ term expansion techniques commonly used in information retrieval to recover named entities that were initially missed by the speech transcriber. We show that our method is able to find named entities missing in the transcribed speech data, and additionally to correct incorrectly assigned named entity tags. Consequently, our novel approach improves state-of-the-art NER results from speech data both in terms of recall and precision.

Book ChapterDOI
01 Jan 2013
TL;DR: The DAISY project especially focuses on paraphrasing and compression of Dutch sentences, and on the rhetorical classification of content blocks and sentences in the web pages, which uses an Integer Linear Programming optimization strategy.
Abstract: During the DAISY project we have developed essential technology for automatic summarisation of Dutch informative web pages. The project especially focuses on paraphrasing and compression of Dutch sentences, and on the rhetorical classification of content blocks and sentences in the web pages. For the paraphrasing and compression we rely on language models and syntactic constraints. In addition, the Alpino parser for Dutch was extended with a fluency component. Because the rhetorical role of a sentence is dependent on the role of its surrounding sentences we improve the rhetorical classification by finding a globally optimal assignment for all the sentences in a web page. Both the sentence compression and rhetorical classification use an Integer Linear Programming optimization strategy.

Book ChapterDOI
07 Oct 2013
TL;DR: The authors proposed a new approach to improve named entity recognition NER in broadcast news speech data by automatically detecting document alignments between highly similar speech documents and corresponding written news stories that are easily obtainable from the Web.
Abstract: We propose a new approach to improving named entity recognition NER in broadcast news speech data. The approach proceeds in two key steps: 1 we automatically detect document alignments between highly similar speech documents and corresponding written news stories that are easily obtainable from the Web; 2 we employ term expansion techniques commonly used in information retrieval to recover named entities that were initially missed by the speech transcriber. We show that our method is able to find named entities missing in the transcribed speech data, and additionally to correct incorrectly assigned named entity tags. Consequently, our novel approach improves state-of-the-art NER results from speech data both in terms of recall and precision.

Proceedings ArticleDOI
15 May 2013
TL;DR: A novel method using the spatial distribution of relevant categories in a taxonomy together with an 'importance' score obtained using a ranked list of search results to derive a metric that represents the breadth of a query is proposed.
Abstract: This paper proposes a novel method to estimate the breadth of a search query. Using the spatial distribution of relevant categories in a taxonomy, together with an 'importance' score obtained using a ranked list of search results, we derive a metric that represents the breadth of a query. Several experiments have been performed on the DMOZ hierarchy with two different sets of queries and benchmarked with state-of-the-art results. Evaluation of the method based on metrics such as F-measure and accuracy indicate better agreement with human judgements.

01 Jan 2013
TL;DR: A novel method to dynamically extract key facets from the ranked list of search results generated from a keyword search is coupled with the spatial distribution of relevant documents in a hierarchical taxonomy of subject classes.
Abstract: We present a novel method for summarization and expansion of search facets. To dynamically extract key facets, the ranked list of search results generated from a keyword search is coupled with the spatial distribution of relevant documents in a hierarchical taxonomy of subject classes. An evaluation of the method based on the relevance and diversity of the produced facets indicates its eectiveness for both summarization and expansion.