Showing papers by "Marie-Francine Moens published in 2013"

PDF

Open Access

Proceedings Article•

Recognising Personality Traits Using Facebook Status Updates

[...]

Golnoosh Farnadi¹, Susana Zoghbi², Marie-Francine Moens², Martine De Cock¹•Institutions (2)

Ghent University¹, Katholieke Universiteit Leuven²

28 Jun 2013

TL;DR: This paper explores the use of machine learning techniques for inferring a user's personality traits from their Facebook status updates and indicates that personality trait recognition generalises across social media platforms.

...read moreread less

Abstract: Gaining insight in a web user's personality is very valuable for applications that rely on personalisation, such as recommender systems and personalised advertising. In this paper we explore the use of machine learning techniques for inferring a user's personality traits from their Facebook status updates. Even with a small set of training examples we can outperform the majority class baseline algorithm. Furthermore, the results are improved by adding training examples from another source. This is an interesting result because it indicates that personality trait recognition generalises across social media platforms.

...read moreread less

102 citations

Journal Article•DOI•

Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora

[...]

Ivan Vulić¹, Wim De Smet¹, Marie-Francine Moens¹•Institutions (1)

University of Copenhagen Faculty of Science¹

01 Jun 2013-Information Retrieval

TL;DR: The main importance of this work lies in the fact that it provides novel CLIR statistical models that exhaustively exploit as many cross-lingual clues as possible in the quest for better CLIR results, without use of any additional external resources such as parallel corpora or machine-readable dictionaries.

...read moreread less

Abstract: In this paper, we study different applications of cross-language latent topic models trained on comparable corpora. The first focus lies on the task of cross-language information retrieval (CLIR). The Bilingual Latent Dirichlet allocation model (BiLDA) allows us to create an interlingual, language-independent representation of both queries and documents. We construct several BiLDA-based document models for CLIR, where no additional translation resources are used. The second focus lies on the methods for extracting translation candidates and semantically related words using only per-topic word distributions of the cross-language latent topic model. As the main contribution, we combine the two former steps, blending the evidences from the per-document topic distributions and the per-topic word distributions of the topic model with the knowledge from the extracted lexicon. We design and evaluate the novel evidence-rich statistical model for CLIR, and prove that such a model, which combines various (only internal) evidences, obtains the best scores for experiments performed on the standard test collections of the CLEF 2001---2003 campaigns. We confirm these findings in an alternative evaluation, where we automatically generate queries and perform the known-item search on a test subset of Wikipedia articles. The main importance of this work lies in the fact that we train translation resources from comparable document-aligned corpora and provide novel CLIR statistical models that exhaustively exploit as many cross-lingual clues as possible in the quest for better CLIR results, without use of any additional external resources such as parallel corpora or machine-readable dictionaries.

...read moreread less

62 citations

Proceedings Article•

Cross-Lingual Semantic Similarity of Words as the Similarity of Their Semantic Word Responses

[...]

Ivan Vulić¹, Marie-Francine Moens²•Institutions (2)

Katholieke Universiteit Leuven¹, University of Copenhagen Faculty of Science²

01 Jun 2013

TL;DR: It is shown that in the cross-lingual settings without any language pair dependent knowledge the response-based method of similarity is more robust and outperforms current state-of-the art methods that directly operate in the semantic space of latent cross-lingsual concepts/topics.

...read moreread less

Abstract: We propose a new approach to identifying semantically similar words across languages. The approach is based on an idea that two words in different languages are similar if they are likely to generate similar words (which includes both source and target language words) as their top semantic word responses. Semantic word responding is a concept from cognitive science which addresses detecting most likely words that humans output as free word associations given some cue word. The method consists of two main steps: (1) it utilizes a probabilistic multilingual topic model trained on comparable data to learn and quantify the semantic word responses, (2) it provides ranked lists of similar words according to the similarity of their semantic word response vectors. We evaluate our approach in the task of bilingual lexicon extraction (BLE) for a variety of language pairs. We show that in the cross-lingual settings without any language pair dependent knowledge the response-based method of similarity is more robust and outperforms current state-of-the art methods that directly operate in the semantic space of latent cross-lingual concepts/topics.

...read moreread less

56 citations

Proceedings Article•

A Study on Bootstrapping Bilingual Vector Spaces from Non-Parallel Data (and Nothing Else)

[...]

Ivan Vulić¹, Marie-Francine Moens¹•Institutions (1)

Katholieke Universiteit Leuven¹

01 Oct 2013

TL;DR: A new language pair agnostic approach to inducing bilingual vector spaces from non-parallel data without any other resource in a bootstrapping fashion that outperforms the best performing fully corpus-based BLE methods on these test sets.

...read moreread less

Abstract: We present a new language pair agnostic approach to inducing bilingual vector spaces from non-parallel data without any other resource in a bootstrapping fashion. The paper systematically introduces and describes all key elements of the bootstrapping procedure: (1) starting point or seed lexicon, (2) the confidence estimation and selection of new dimensions of the space, and (3) convergence. We test the quality of the induced bilingual vector spaces, and analyze the influence of the different components of the bootstrapping approach in the task of bilingual lexicon extraction (BLE) for two language pairs. Results reveal that, contrary to conclusions from prior work, the seeding of the bootstrapping process has a heavy impact on the quality of the learned lexicons. We also show that our approach outperforms the best performing fully corpus-based BLE methods on these test sets.

...read moreread less

51 citations

Proceedings Article•DOI•

Argumentation Mining: Where are we now, where do we want to be and how do we get there?

[...]

Marie-Francine Moens¹•Institutions (1)

Katholieke Universiteit Leuven¹

04 Dec 2013

TL;DR: This paper gives a short overview of the state-of-the-art and goals of argumentation mining and it provides ideas for further research.

...read moreread less

Abstract: This paper gives a short overview of the state-of-the-art and goals of argumentation mining and it provides ideas for further research. Its content is based on two invited lectures on argumentation mining respectively at the FIRE 2013 conference at the India International Center in New Delhi, India and a lecture given as SICSA distinguished visitor at the University of Dundee, UK in the summer of 2014.

...read moreread less

42 citations

Journal Article•DOI•

Finding a needle in a haystack: an interactive video archive explorer for professional video searchers

[...]

Mieke Haesen¹, Jan Meskens¹, Kris Luyten¹, Karin Coninx¹, Jan Hendrik Becker², Tinne Tuytelaars², Gert-Jan Poulisse², Marie-Francine Moens² - Show less +4 more•Institutions (2)

University of Hasselt¹, Katholieke Universiteit Leuven²

01 Mar 2013-Multimedia Tools and Applications

TL;DR: A qualitative evaluation with professional video searchers shows that the combination of automatic video indexing, interactive visualisations and user-centred design can result in an increased usability, user satisfaction and productivity.

...read moreread less

Abstract: Professional video searchers typically have to search for particular video fragments in a vast video archive that contains many hours of video data. Without having the right video archive exploration tools, this is a difficult and time consuming task that induces hours of video skimming. We propose the video archive explorer, a video exploration tool that provides visual representations of automatically detected concepts to facilitate individual and collaborative video search tasks. This video archive explorer is developed by employing a user-centred methodology, which ensures that the tool is more likely to fit to the end user needs. A qualitative evaluation with professional video searchers shows that the combination of automatic video indexing, interactive visualisations and user-centred design can result in an increased usability, user satisfaction and productivity.

...read moreread less

33 citations

Proceedings Article•

SemEval-2013 Task 3: Spatial Role Labeling

[...]

Oleksandr Kolomiyets¹, Parisa Kordjamshidi¹, Marie-Francine Moens¹, Steven Bethard²•Institutions (2)

Katholieke Universiteit Leuven¹, University of Colorado Boulder²

01 Jan 2013

TL;DR: This paper describes the task in Semantic Evaluations 2013, annotation schema, corpora, participants, methods and results obtained by the participants.

...read moreread less

Abstract: Many NLP applications require information about locations of objects referenced in text, or relations between them in space. For example, the phrase a book on the desk contains information about the location of the object book, as trajector, with respect to another object desk, as landmark. Spatial Role Labeling (SpRL) is an evaluation task in the information extraction domain which sets a goal to automatically process text and identify objects of spatial scenes and relations between them. This paper describes the task in Semantic Evaluations 2013, annotation schema, corpora, participants, methods and results obtained by the participants.

...read moreread less

29 citations

Book Chapter•DOI•

Cornetto: a combinatorial lexical semantic database for Dutch

[...]

Piek Vossen¹, Isa Maks¹, R.H. Segers¹, Hennie van der Vliet¹, Marie-Francine Moens², Katja Hofmann¹, Erik Tjong Kim Sang³, Maarten de Rijke¹ - Show less +4 more•Institutions (3)

University of Amsterdam¹, Katholieke Universiteit Leuven², University of Groningen³

01 Jan 2013

TL;DR: One of the goals of the STEVIN programme is the realisation of a digital infrastructure that will enforce the position of the Dutch language in the modern information and communication technology.

...read moreread less

Abstract: One of the goals of the STEVIN programme is the realisation of a digital infrastructure that will enforce the position of the Dutch language in the modern information and communication technology.A semantic database makes it possible to go from words to concepts and consequently, to develop technologies that access and use knowledge rather than textual representations.

...read moreread less

22 citations

Learning to interpret spatial natural language in terms of qualitative spatial relations

[...]

Parisa Kordjamshidi, Joana Hois, Martijn van Otterlo, Marie-Francine Moens

01 Jan 2013

13 citations

Book Chapter•DOI•

A unified framework for monolingual and cross-lingual relevance modeling based on probabilistic topic models

[...]

Ivan Vulić¹, Marie-Francine Moens¹•Institutions (1)

Katholieke Universiteit Leuven¹

24 Mar 2013

TL;DR: In this article, the authors explore the potential of probabilistic topic modeling within the relevance modeling framework for both monolingual and cross-lingual ad-hoc retrieval, and integrate the topical knowledge into a unified relevance modelling framework.

...read moreread less

Abstract: We explore the potential of probabilistic topic modeling within the relevance modeling framework for both monolingual and cross-lingual ad-hoc retrieval. Multilingual topic models provide a way to represent documents in a structured and coherent way, regardless of their actual language, by means of language-independent concepts, that is, cross-lingual topics. We show how to integrate the topical knowledge into a unified relevance modeling framework in order to build quality retrieval models in monolingual and cross-lingual contexts. The proposed modeling framework processes all documents uniformly and does not make any conceptual distinction between monolingual and cross-lingual modeling. Our results obtained from the experiments conducted on the standard CLEF test collections reveal that fusing the topical knowledge and relevance modeling leads to building monolingual and cross-lingual retrieval models that outperform several strong baselines. We show that that the topical knowledge coming from a general Web-generated corpus boosts retrieval scores. Additionally, we show that within this framework the estimation of cross-lingual relevance models may be performed by exploiting only a general non-parallel corpus.

...read moreread less

11 citations

Journal Article•DOI•

Representations for multi-document event clustering

[...]

Wim De Smet¹, Marie-Francine Moens¹•Institutions (1)

Katholieke Universiteit Leuven¹

01 May 2013-Data Mining and Knowledge Discovery

TL;DR: It is found that simple methods can still outperform the current state-of-the-art techniques for representing, fusing and comparing content representations of news documents.

...read moreread less

Abstract: We study several techniques for representing, fusing and comparing content representations of news documents. As underlying models we consider the vector space model (both in a term setting and in a latent semantic analysis setting) and probabilistic topic models based on latent Dirichlet allocation. Content terms can be classified as topical terms or named entities, yielding several models for content fusion and comparison. All used methods are completely unsupervised. We find that simple methods can still outperform the current state-of-the-art techniques.

...read moreread less

How well do your Facebook status updates express your personality

[...]

Golnoosh Farnadi¹, Susana Zoghbi², Marie-Francine Moens², Martine De Cock¹•Institutions (2)

Ghent University¹, Katholieke Universiteit Leuven²

01 Jan 2013

TL;DR: This study contributes to this effort by exploring the use of machine learning techniques to automatically infer users’ personality traits based on their Facebook status updates.

...read moreread less

Abstract: User generated content in online social networking sites provides a potentially rich source of information for applications that rely on personalisation, such as on-line marketing. In this study we contribute to this effort by exploring the use of machine learning (ML) techniques to automatically infer users’ personality traits based on their Facebook status updates (i.e., text messages to communicate with friends).

...read moreread less

Proceedings Article•DOI•

I pinned it. where can i buy one like it?: automatically linking pinterest pins to online webshops

[...]

Susana Zoghbi¹, Ivan Vulić¹, Marie-Francine Moens¹•Institutions (1)

Katholieke Universiteit Leuven¹

28 Oct 2013

TL;DR: This paper implements and evaluates several information retrieval models for linking the texts of pins of Pinterest to webpages of Amazon, and ranking the pages according to the personal interest of the pinner.

...read moreread less

Abstract: The information that users of social network sites post often points towards their interests and hobbies. It can be used to recommend relevant products to users. In this paper we implement and evaluate several information retrieval models for linking the texts of pins of Pinterest to webpages of Amazon, and ranking the pages (which we call webshops) according to the personal interest of the pinner. The results show that models that combine latent concepts composed of related terms with single words yield the best performance.

...read moreread less

Proceedings Article•

Detecting Relations in the Gene Regulation Network

[...]

Thomas Provoost¹, Marie-Francine Moens¹•Institutions (1)

Katholieke Universiteit Leuven¹

01 Jan 2013

TL;DR: This paper describes the entry in the Gene Regulation Network in Bacteria (GRN) part, for which the system finished in second place, and employs a basic Support Vector Machine framework to tackle this relation extraction task.

...read moreread less

Abstract: The BioNLP Shared Task 2013 is organised to further advance the field of information extraction in biomedical texts. This paper describes our entry in the Gene Regulation Network in Bacteria (GRN) part, for which our system finished in second place (out of five). To tackle this relation extraction task, we employ a basic Support Vector Machine framework. We discuss our findings in constructing local and contextual features, that augment our precision with as much as 7.5%. We touch upon the interaction type hierarchy inherent in the problem, and the importance of the evaluation procedure to encourage exploration of that structure.

...read moreread less

Proceedings Article•DOI•

Are words enough?: a study on text-based representations and retrieval models for linking pins to online shops

[...]

Susana Zoghbi¹, Ivan Vulić¹, Marie-Francine Moens¹•Institutions (1)

Katholieke Universiteit Leuven¹

28 Oct 2013

TL;DR: This work evaluates different textual representations and retrieval models that aim to make sense of social media data for retail applications and shows that document representations that combine latent concepts with single words yield the best performance.

...read moreread less

Abstract: User-generated content offers opportunities to learn about people's interests and hobbies. We can leverage this information to help users find interesting shops and businesses find interested users. However this content is highly noisy and unstructured as posted on social media sites and blogs. In this work we evaluate different textual representations and retrieval models that aim to make sense of social media data for retail applications. Our task is to link the text of pins (from Pinterest.com) to online shops (formed by clustering Amazon.com's products). Our results show that document representations that combine latent concepts with single words yield the best performance.

...read moreread less

Book Chapter•DOI•

Monolingual and cross-lingual probabilistic topic models and their applications in information retrieval

[...]

Marie-Francine Moens¹, Ivan Vulić¹•Institutions (1)

University of Copenhagen Faculty of Science¹

24 Mar 2013

TL;DR: This tutorial demonstrates how semantically similar words across languages are integrated as useful additional evidences in cross-lingual information retrieval models, and presents how to use the knowledge from the topic models in the tasks of cross- Lingual event clustering, cross-lingsual document classification and the detection of cross -lingual semantic similarity of words.

...read moreread less

Abstract: Probabilistic topic models are a group of unsupervised generative machine learning models that can be effectively trained on large text collections. They model document content as a two-step generation process, i.e., documents are observed as mixtures of latent topics, while topics are probability distributions over vocabulary words. Recently, a significant research effort has been invested into transferring the probabilistic topic modeling concept from monolingual to multilingual settings. Novel topic models have been designed to work with parallel and comparable multilingual data (e.g., Wikipedia or news data discussing the same events). Probabilistic topics models offer an elegant way to represent content across different languages. Their probabilistic framework allows for their easy integration into a language modeling framework for monolingual and cross-lingual information retrieval. Moreover, we present how to use the knowledge from the topic models in the tasks of cross-lingual event clustering, cross-lingual document classification and the detection of cross-lingual semantic similarity of words. The tutorial also demonstrates how semantically similar words across languages are integrated as useful additional evidences in cross-lingual information retrieval models.

...read moreread less

Proceedings Article•

Designing constructive machine learning models based on generalized linear learning techniques

[...]

Parisa Kordjamshidi¹, Marie-Francine Moens¹•Institutions (1)

Katholieke Universiteit Leuven¹

05 Dec 2013

TL;DR: A general framework for designing machine learning models that deal with constructing complex structures in the output space is proposed, based on generalized linear training techniques, and exploits techniques from combinatorial optimization.

...read moreread less

Abstract: We propose a general framework for designing machine learning models that deal with constructing complex structures in the output space. The goal is to provide an abstraction layer to easily represent and design constructive learning models. The learning approach is based on generalized linear training techniques, and exploits techniques from combinatorial optimization to deal with the complexity of the underlying inference required in this type of models. This approach also allows to consider global structural characteristics and constraints over the output elements in an efficient training and prediction setting. The use case focuses on building spatial meaning representations from text to instantiate a virtual world.

...read moreread less

Proceedings Article•

KUL: A data-driven approach to temporal parsing of documents

[...]

Oleksandr Kolomiyets¹, KU Leuven, Marie-Francine Moens¹•Institutions (1)

Katholieke Universiteit Leuven¹

01 Jan 2013

TL;DR: The central feature of the proposed system is temporal parsing, an approach which identifies temporal relation arguments (eventevent and event-timex pairs) and the semantic label of the relation as a single decision.

...read moreread less

Abstract: This paper describes a system for temporal processing of text, which participated in the Temporal Evaluations 2013 campaign. The system employs a number of machine learning classifiers to perform the core tasks of: identification of time expressions and events, recognition of their attributes, and estimation of temporal links between recognized events and times. The central feature of the proposed system is temporal parsing ‐ an approach which identifies temporal relation arguments (eventevent and event-timex pairs) and the semantic label of the relation as a single decision.

...read moreread less

Book Chapter•DOI•

Automatic labeling of forums using Bloom’s taxonomy

[...]

Vanessa Echeverria¹, Juan Carlos Gomez², Marie-Francine Moens²•Institutions (2)

Escuela Superior Politecnica del Litoral¹, Katholieke Universiteit Leuven²

14 Dec 2013

TL;DR: This paper presents a framework in order to automatically label messages from discussion forums using the categories of Bloom's taxonomy, and shows that the combination of a linear classifier with a Rule-Based classifier yields very good and promising results for this difficult task.

...read moreread less

Abstract: The labeling of discussion forums using the cognitive levels of Bloom’s taxonomy is a time-consuming and very expensive task due to the big amount of information that needs to be labeled and the need of an expert in the educational field for applying the taxonomy according to the messages of the forums. In this paper we present a framework in order to automatically label messages from discussion forums using the categories of Bloom’s taxonomy. Several models were created using three kind of machine learning approaches: linear, Rule-Based and combined classifiers. The models are evaluated using the accuracy, the F1-measure and the area under the ROC curve. Additionally, a statistical significance of the results is performed using a McNemar test in order to validate them. The results show that the combination of a linear classifier with a Rule-Based classifier yields very good and promising results for this difficult task.

...read moreread less

Proceedings Article•

Machine understanding for interactive storytelling

[...]

Wim De Mulder¹, Ngoc Quynh Do Thi², Paul van den Broek³, Marie-Francine Moens²•Institutions (3)

Ghent University¹, Katholieke Universiteit Leuven², University of Minnesota³

01 Jan 2013

TL;DR: This research aims at ”bringing a given text to life” via an immersive environment where the user can freely explore the surroundings and increase his understanding of the given text through natural language processing and the development of virtual immersive environments.

...read moreread less

Abstract: This paper describes our research-in-progress which integrates several domains, in particular natural language processing and the development of virtual immersive environments. Our research aims at ”bringing a given text to life” via an immersive environment where the user can freely explore the surroundings and increase his understanding of the given text. We describe some important challenges in achieving this goal and outline our current research results. Our work is practically oriented, aiming at fulfilling some societal needs related to education on which we also report.

...read moreread less

Structured Machine Learning for Mapping Natural Language to Spatial Ontologies

[...]

Parisa Kordjamshidi¹, Marie-Francine Moens•Institutions (1)

Katholieke Universiteit Leuven¹

10 Jul 2013

TL;DR: It is argued that this work is an important step towards automatically describing text with semantic labels that form a structured ontological representation of the content.

...read moreread less

Abstract: We propose a novel structured learning framework for mapping natural language to spatial ontologies. The applied spatial ontology contains spatial concepts, relations and their multiple semantic types based on qualitative spatial calculi models. To achieve a tractable structured learning model, we propose an ecient inference approach based on constraint optimization techniques. Particularly, we decompose the inference to subproblems, each of which is solved using LP-relaxation. This is done for both training-time and prediction-time inference. In this framework ontology components are learnt while taking into account the ontological constraints and linguistic dependencies among components. Particularly, we for- mulate complex relationships such as composed-of, is-a and mutual exclusivity during learning while the previous structured learning models in similar tasks do not go beyond hierarchical relationships. Our experimental results show that jointly learning the output components considering the above mentioned constraints and relationships improves the results compared to ignoring these. The application of the proposed learning model for mapping to ontologies is not limited to extraction of spatial semantics, it could be used to populate any ontology. We argue therefore that this work is an important step towards automatically describing text with semantic labels that form a structured ontological representation of the content.

...read moreread less

Journal Article•DOI•

Naming persons in video: Using the weak supervision of textual stories

[...]

Koen Deschacht, Tinne Tuytelaars¹, Marie-Francine Moens¹•Institutions (1)

Katholieke Universiteit Leuven¹

01 Oct 2013-Journal of Visual Communication and Image Representation

TL;DR: This paper reports on the application of naming faces in soap series by using the weak supervision of narrative texts that describe the events in the video and that are drafted by fans.

...read moreread less

Proceedings Article•

KUL: Data-driven Approach to Temporal Parsing of Newswire Articles

[...]

Oleksandr Kolomiyets¹, Marie-Francine Moens²•Institutions (2)

Katholieke Universiteit Leuven¹, University of Copenhagen Faculty of Science²

01 Jun 2013

TL;DR: A central feature of the proposed system is temporal parsing – an approach which identifies temporal relation arguments (eventevent and event-timex pairs) and the semantic label of the relation as a single decision.

...read moreread less

Abstract: This paper describes a system for temporal processing of text, which participated in the Temporal Evaluations 2013 campaign. The system employs a number of machine learning classifiers to perform the core tasks of: identification of time expressions and events, recognition of their attributes, and estimation of temporal links between recognized events and times. The central feature of the proposed system is temporal parsing – an approach which identifies temporal relation arguments (eventevent and event-timex pairs) and the semantic label of the relation as a single decision.

...read moreread less

Proceedings Article•DOI•

Cross-modal alignment for wildlife recognition

[...]

Thibaut Dusart¹, Aparna Nurani Venkitasubramanian¹, Marie-Francine Moens¹•Institutions (1)

Katholieke Universiteit Leuven¹

22 Oct 2013

TL;DR: An unsupervised framework for recognizing animals in videos using subtitles is proposed using an Expectation Maximization algorithm which is adapted to two very different circumstances- when the bounding boxes are available and when the frame as a whole is used instead of bounded boxes.

...read moreread less

Abstract: We propose an unsupervised framework for recognizing animals in videos using subtitles. In this framework, the alignment between animals and their names is performed using an Expectation Maximization algorithm which is adapted to two very different circumstances- 1) when the bounding boxes are available and 2) when the frame as a whole is used instead of bounding boxes. With the goal of maximizing precision, recall and F-measure, the experiments compare a multitude of natural language processing approaches and visual features when associating animal names in the subtitles with visual patterns. The proposed unsupervised methods obtain 83.1% F1 using bounding boxes and 65.7% F1 without bounding boxes in a fully automated pipeline.

...read moreread less

Selection of search facets

[...]

Aparna Nurani Venkitasubramanian¹, Marie-Francine Moens•Institutions (1)

Katholieke Universiteit Leuven¹

01 Jan 2013

TL;DR: This paper proposes several novel methods for summarization and expansion of facets using the ranked list of search results generated from a keyword search, coupled with the spatial distribution of relevant documents in a hierarchical taxonomy of subject classes.

...read moreread less

Abstract: Combination of a ‘keyword’ and a ‘faceted’ search has the potential to enhance user experience by providing a better arrangement of search results and aiding further search exploration. However, in such a framework, two key problems exist: 1) a given query may cover several facets, requiring an aggregation or summarization of the most relevant ones 2) a query may cover too few facets necessitating an expansion to include additional facets. In this paper, we propose several novel methods for summarization and expansion of facets. Using the ranked list of search results generated from a keyword search, coupled with the spatial distribution of relevant documents in a hierarchical taxonomy of subject classes, we dynamically extract key facets. An evaluation of the different methods based on the relevance and diversity of the facets indicates that the Subtree density model performs best for both summarization and expansion. RESUME. Les recherches par mots cles sur le Web donnent souvent une enorme quantite de pages Web pertinentes. Un cadre qui integre les avantages a la fois des « mot-cle » et des « facettes » des recherches a des larges avantages pour les utilisateurs Web, car il offre une meilleure organisation des resultats de la recherche et une plate-forme utile pour guider les utilisateurs a trouver les informations pertinentes. Dans un cadre pareil, deux problemes principaux existent : 1) une requete peut entamer plusieurs facettes, ce qui necessite une agregation ou un resume des facettes les plus pertinentes ; 2) une requete peut couvrir trop peu de facettes necessitant une recommandation ou une expansion. Dans cet article, nous proposons plusieurs nouvelles methodes de synthese et de l’expansion de facettes a partir d’une recherche par mot cle, associees a la distribution spatiale des documents pertinents dans une taxonomie hierarchique des classes de sujets. Une evaluation des differents modeles bases sur la pertinence et la diversite des facettes indique que le modele de « subtree density » donne les meilleures resultats.

...read moreread less

Journal Article•

An IR-inspired approach to recovering named entity tags in broadcast news

[...]

Niraj Shrestha¹, Ivan Vulić¹, Marie-Francine Moens¹•Institutions (1)

Katholieke Universiteit Leuven¹

01 Oct 2013-Lecture Notes in Computer Science

TL;DR: This work proposes a new approach to improving named entity recognition NER in broadcast news speech data that is able to find named entities missing in the transcribed speech data, and additionally to correct incorrectly assigned named entity tags.

...read moreread less

Abstract: We propose a new approach to improving named entity recognition NER in broadcast news speech data. The approach proceeds in two key steps: 1 we automatically detect document alignments between highly similar speech documents and corresponding written news stories that are easily obtainable from the Web; 2 we employ term expansion techniques commonly used in information retrieval to recover named entities that were initially missed by the speech transcriber. We show that our method is able to find named entities missing in the transcribed speech data, and additionally to correct incorrectly assigned named entity tags. Consequently, our novel approach improves state-of-the-art NER results from speech data both in terms of recall and precision.

...read moreread less

Book Chapter•DOI•

Question Answering of InformativeWeb Pages: How Summarisation Technology Helps

[...]

Jan De Belder¹, Daniël de Kok², Gertjan van Noord², Fabrice Nauze, Leonoor van der Beek, Marie-Francine Moens¹ - Show less +2 more•Institutions (2)

Katholieke Universiteit Leuven¹, University of Groningen²

01 Jan 2013

TL;DR: The DAISY project especially focuses on paraphrasing and compression of Dutch sentences, and on the rhetorical classification of content blocks and sentences in the web pages, which uses an Integer Linear Programming optimization strategy.

...read moreread less

Abstract: During the DAISY project we have developed essential technology for automatic summarisation of Dutch informative web pages. The project especially focuses on paraphrasing and compression of Dutch sentences, and on the rhetorical classification of content blocks and sentences in the web pages. For the paraphrasing and compression we rely on language models and syntactic constraints. In addition, the Alpino parser for Dutch was extended with a fluency component. Because the rhetorical role of a sentence is dependent on the role of its surrounding sentences we improve the rhetorical classification by finding a globally optimal assignment for all the sentences in a web page. Both the sentence compression and rhetorical classification use an Integer Linear Programming optimization strategy.

...read moreread less

Book Chapter•DOI•

An IR-Inspired Approach to Recovering Named Entity Tags in Broadcast News

[...]

Niraj Shrestha¹, Ivan Vulić¹, Marie-Francine Moens¹•Institutions (1)

Katholieke Universiteit Leuven¹

07 Oct 2013

TL;DR: The authors proposed a new approach to improve named entity recognition NER in broadcast news speech data by automatically detecting document alignments between highly similar speech documents and corresponding written news stories that are easily obtainable from the Web.

...read moreread less

Proceedings Article•DOI•

Estimating the breadth of search queries

[...]

Aparna Nurani Venkitasubramanian¹, Marie-Francine Moens¹•Institutions (1)

Katholieke Universiteit Leuven¹

15 May 2013

TL;DR: A novel method using the spatial distribution of relevant categories in a taxonomy together with an 'importance' score obtained using a ranked list of search results to derive a metric that represents the breadth of a query is proposed.

...read moreread less

Abstract: This paper proposes a novel method to estimate the breadth of a search query. Using the spatial distribution of relevant categories in a taxonomy, together with an 'importance' score obtained using a ranked list of search results, we derive a metric that represents the breadth of a query. Several experiments have been performed on the DMOZ hierarchy with two different sets of queries and benchmarked with state-of-the-art results. Evaluation of the method based on metrics such as F-measure and accuracy indicate better agreement with human judgements.

...read moreread less

Summarization and Expansion of Search Facets

[...]

Aparna Nurani Venkitasubramanian¹, Marie-Francine Moens¹•Institutions (1)

Katholieke Universiteit Leuven¹

01 Jan 2013

TL;DR: A novel method to dynamically extract key facets from the ranked list of search results generated from a keyword search is coupled with the spatial distribution of relevant documents in a hierarchical taxonomy of subject classes.

...read moreread less

Abstract: We present a novel method for summarization and expansion of search facets. To dynamically extract key facets, the ranked list of search results generated from a keyword search is coupled with the spatial distribution of relevant documents in a hierarchical taxonomy of subject classes. An evaluation of the method based on the relevance and diversity of the produced facets indicates its eectiveness for both summarization and expansion.

...read moreread less