Showing papers by "Marie-Francine Moens published in 2010"

PDF

Open Access

Journal Article•DOI•

New filtering approaches for phishing email

[...]

André Bergholz, Jan De Beer¹, Sebastian Glahn, Marie-Francine Moens¹, Gerhard Paaß, Siehyun Strobel - Show less +2 more•Institutions (1)

Katholieke Universiteit Leuven¹

01 Jan 2010-Journal of Computer Security

TL;DR: A number of novel features that are particularly well-suited to identify phishing emails are described, including statistical models for the low-dimensional descriptions of email topics, sequential analysis of email text and external links, the detection of embedded logos as well as indicators for hidden salting.

...read moreread less

Abstract: Phishing emails usually contain a message from a credible looking source requesting a user to click a link to a website where she/he is asked to enter a password or other confidential information. Most phishing emails aim at withdrawing money from financial institutions or getting access to private information. Phishing has increased enormously over the last years and is a serious threat to global security and economy. There are a number of possible countermeasures to phishing. These range from communication-oriented approaches like authentication protocols over blacklisting to content-based filtering approaches. We argue that the first two approaches are currently not broadly implemented or exhibit deficits. Therefore content-based phishing filters are necessary and widely used to increase communication security. A number of features are extracted capturing the content and structural properties of the email. Subsequently a statistical classifier is trained using these features on a training set of emails labeled as ham (legitimate), spam or phishing. This classifier may then be applied to an email stream to estimate the classes of new incoming emails. In this paper we describe a number of novel features that are particularly well-suited to identify phishing emails. These include statistical models for the low-dimensional descriptions of email topics, sequential analysis of email text and external links, the detection of embedded logos as well as indicators for hidden salting. Hidden salting is the intentional addition or distortion of content not perceivable by the reader. For empirical evaluation we have obtained a large realistic corpus of emails prelabeled as spam, phishing, and ham (legitimate). In experiments our methods outperform other published approaches for classifying phishing emails. We discuss the implications of these results for the practical application of this approach in the workflow of an email provider. Finally we describe a strategy how the filters may be updated and adapted to new types of phishing.

...read moreread less

137 citations

Proceedings Article•

Text simplification for children

[...]

Jan De Belder¹, Marie-Francine Moens¹•Institutions (1)

Katholieke Universiteit Leuven¹

01 Jan 2010

TL;DR: The goal in this paper is to automatically transform text into a simpler text, so that it is easier to understand by children, and to include information from a language model in the lexical simplification step, to obtain better results over a baseline method.

...read moreread less

Abstract: The goal in this paper is to automatically transform text into a simpler text, so that it is easier to understand by children. We perform syntactic simplification, i.e. the splitting of sentences, and lexical simplification, i.e. replacing difficult words with easier synonyms. We test the performance of this approach for each component separately on a per sentence basis, and globally with the automatic construction of simplified news articles and encyclopedia articles. By including information from a language model in the lexical simplification step, we obtain better results over a baseline method. The syntactic simplification shows that some phenomena are difficult to recognize by a parser, and that errors are often introduced. Although the reading difficulty goes down, it still doesn’t reach the required level for young children.

...read moreread less

136 citations

Book Chapter•DOI•

Approaches to text mining arguments from legal cases

[...]

Adam Wyner¹, Raquel Mochales-Palau², Marie-Francine Moens², David Milward³•Institutions (3)

University College London¹, Katholieke Universiteit Leuven², St John's Innovation Centre³

01 Jan 2010

TL;DR: It is shown how a Context-Free Grammar can be used to extract arguments, and how ontologies and Natural Language Processing can identify complex information such as case factors and participant roles.

...read moreread less

Abstract: This paper describes recent approaches using text-mining to automatically profile and extract arguments from legal cases. We outline some of the background context and motivations. We then turn to consider issues related to the construction and composition of corpora of legal cases. We show how a Context-Free Grammar can be used to extract arguments, and how ontologies and Natural Language Processing can identify complex information such as case factors and participant roles. Together the results bring us closer to automatic identification of legal arguments.

...read moreread less

130 citations

Proceedings Article•

Spatial Role Labeling: Task Definition and Annotation Scheme

[...]

Parisa Kordjamshidi¹, Martijn van Otterlo¹, Marie-Francine Moens¹•Institutions (1)

Katholieke Universiteit Leuven¹

01 May 2010

TL;DR: This paper introduces the task of spatial role labeling and proposes an annotation scheme that is language-independent and facilitates the application of machine learning techniques.

...read moreread less

Abstract: One of the essential functions of natural language is to talk about spatial relationships between objects. Linguistic constructs can express highly complex, relational structures of objects, spatial relations between them, and patterns of motion through spaces relative to some reference point. Learning how to map this information onto a formal representation from a text is a challenging problem. At present no well-defined framework for automatic spatial information extraction exists that can handle all of these issues. In this paper we introduce the task of spatial role labeling and propose an annotation scheme that is language-independent and facilitates the application of machine learning techniques. Our framework consists of a set of spatial roles based on the theory of holistic spatial semantics with the intent of covering all aspects of spatial concepts, including both static and dynamic spatial relations. We illustrate our annotation scheme with many examples throughout the paper, and in addition we highlight how to connect to spatial calculi such as region connection calculus and also how our approach fits into related work.

...read moreread less

75 citations

Journal Article•DOI•

Cross-Media Alignment of Names and Faces

[...]

Marie-Francine Moens¹, Tinne Tuytelaars¹•Institutions (1)

Katholieke Universiteit Leuven¹

01 Jan 2010-IEEE Transactions on Multimedia

TL;DR: The results are competitive with state-of-the-art performance on the ¿Labeled Faces in the Wild¿ dataset in terms of recall values, include excellent precision values, and show the value of text and image analysis for identifying the probability of being pictured or named in the alignment process.

...read moreread less

Abstract: In this paper we report on our experiments on aligning names and faces as found in images and captions of online news Websites. Developing accurate technologies for linking names and faces is valuable when retrieving or mining information from multimedia collections. We perform exhaustive and systematic experiments exploiting the (a)symmetry between the visual and textual modalities. This leads to different schemes for assigning names to the faces, assigning faces to the names, and establishing name-face link pairs. On top of that, we investigate generic approaches to the use of textual and visual structural information to predict the presence of the corresponding entity in the other modality. The proposed methods are completely unsupervised and are inspired by methods for aligning phrases and words in texts of different languages developed for constructing dictionaries for machine translation. The results are competitive with state-of-the-art performance on the ?Labeled Faces in the Wild? dataset in terms of recall values, now reported on the complete dataset, include excellent precision values, and show the value of text and image analysis for identifying the probability of being pictured or named in the alignment process.

...read moreread less

47 citations

Proceedings Article•

KUL: Recognition and Normalization of Temporal Expressions

[...]

Oleksandr Kolomiyets¹, Marie-Francine Moens¹•Institutions (1)

Katholieke Universiteit Leuven¹

15 Jul 2010

TL;DR: A system for the recognition and normalization of temporal expressions (Task 13: TempEval-2, Task A) that is approached as a classification problem of sentence constituents and the normalization is implemented in a rule-based manner.

...read moreread less

Abstract: In this paper we describe a system for the recognition and normalization of temporal expressions (Task 13: TempEval-2, Task A). The recognition task is approached as a classification problem of sentence constituents and the normalization is implemented in a rule-based manner. One of the system features is extending positive annotations in the corpus by semantically similar words automatically obtained from a large unannotated textual corpus. The best results obtained by the system are 0.85 and 0.84 for precision and recall respectively for recognition of temporal expressions; the accuracy values of 0.91 and 0.55 were obtained for the feature values type and val respectively.

...read moreread less

27 citations

Proceedings Article•DOI•

Wisdom of the ages: toward delivering the children's web with the link-based agerank algorithm

[...]

Karl Gyllstrom¹, Marie-Francine Moens¹•Institutions (1)

Katholieke Universiteit Leuven¹

26 Oct 2010

TL;DR: AgeRank, a link-based algorithm that ranks web pages according their appropriateness for young audiences, is designed and shown to be accurate in page-labeling, widely-spanning in page coverage, and with high potential to improve children's search.

...read moreread less

Abstract: Though children frequently use web search engines to learn, interact, and be entertained, modern web search engines are poorly suited to children's needs, requiring relatively complex querying and filtering of results in order to find pages oriented to young audiences. To address this limitation, we designed AgeRank, a link-based algorithm that ranks web pages according their appropriateness for young audiences. We show its effectiveness through a multipart evaluation that demonstrates AgeRank to be accurate in page-labeling, widely-spanning in page coverage, and with high potential to improve children's search. As a fast, scalable, and effective algorithm, AgeRank can be adopted by search engines seeking to more effectively address the needs of young users, or easily fitted to complementary machine-learning based classification approaches.

...read moreread less

26 citations

Proceedings Article•DOI•

Naming persons in news video with label propagation

[...]

Marie-Francine Moens¹, Tinne Tuytelaars¹•Institutions (1)

Katholieke Universiteit Leuven¹

19 Jul 2010

TL;DR: A face naming method is developed that learns from labeled and unlabeled examples using iterative label propagation in a graph of connected faces or name-face pairs that yields better results than a Support Vector Machine classifier trained on the same labeled data.

...read moreread less

Abstract: Labeling persons appearing in video frames with names detected from the video transcript helps improving the video content identification and search task. We develop a face naming method that learns from labeled and unlabeled examples using iterative label propagation in a graph of connected faces or name-face pairs. The advantage of this method is that it can use very few labeled data points and incorporate the unlabeled data points during the learning process. Anchor detection and metric learning for face classification techniques are incorporated into the label propagation process to help boosting the face naming performance. On BBC News videos, the label propagation algorithm yields better results than a Support Vector Machine classifier trained on the same labeled data.

...read moreread less

24 citations

From language towards formal spatial calculi

[...]

Parisa Kordjamshidi, Martijn van Otterlo, Marie-Francine Moens

15 Aug 2010

TL;DR: This work considers mapping unrestricted natural language to formal spatial representations and describes ongoing work on a two-level machine learning approach that deals with the extraction of spatial information from natural language sentences, called spatial role labeling.

...read moreread less

Abstract: We consider mapping unrestricted natural language to formal spatial representations. We describe ongoing work on a two-level machine learning approach. The first level is linguistic, and deals with the extraction of spatial information from natural language sentences, and is called spatial role labeling. The second level is ontological in nature, and deals with mapping this linguistic, spatial information to formal spatial calculi. Our main obstacles are the lack of available annotated data for training machine learning algorithms for these tasks, and the difficulty of selecting an appropriate abstraction level for the spatial information. For the linguistic part, we approach the problem in a gradual way. We make use of existing resources such as The Preposition Project (TPP) and the validation data of General Upper Model (GUM) ontology, and we show some computational results. For the ontological part, we describe machine learning challenges and discuss our proposed approach.

...read moreread less

19 citations

Journal Article•DOI•

News story segmentation in multiple modalities

[...]

Gert-Jan Poulisse¹, Marie-Francine Moens¹, Tomas Dekens², Koen Deschacht¹•Institutions (2)

Katholieke Universiteit Leuven¹, Vrije Universiteit Brussel²

01 May 2010-Multimedia Tools and Applications

TL;DR: This work suggests that it is possible to train robust story segmenters for news video using only a handful of broadcasts, provided a good initial feature selection is made.

...read moreread less

Abstract: In this paper, we describe an approach to segmenting news video based on the perceived shift in content using features spanning multiple modalities. We investigate a number of multimedia features, which serve as potential indicators of a change in story, in order to determine which are the most effective. The efficacy of our approach is demonstrated by the performance of our prototype, where a number of feature combinations demonstrate an up to 18% improvement in WindowDiff score compared to other state of the art story segmenters. In our investigation, there is no, one, clearly superior feature, rather the best segmentation occurs when there is synergy between multiple features. A further investigation into the effect on segmentation performance, while varying the number of training examples versus the number of features used, reveal that having better feature combinations is more important than having more training examples. Our work suggests that it is possible to train robust story segmenters for news video using only a handful of broadcasts, provided a good initial feature selection is made.

...read moreread less

19 citations

Patent•

Method for the automatic determination of context-dependent hidden word distributions

[...]

Koen Deschacht¹, Marie-Francine Moens•Institutions (1)

Katholieke Universiteit Leuven¹

18 Nov 2010

TL;DR: The Latent Words Language Model (LWLM) as mentioned in this paper automatically determines context-dependent word distributions (called hidden or latent words) for each word of a text, which reflect the probability that another word of the vocabulary of a language would occur at that position in the text.

...read moreread less

Abstract: Described is method, the Latent Words Language Model (LWLM), that automatically determines context-dependent word distributions (called hidden or latent words) for each word of a text. The probabilistic word distributions reflect the probability that another word of the vocabulary of a language would occur at that position in the text. Furthermore, a method is described to use these word distributions in statistical language processing applications, such as information extraction applications (for example, semantic role labeling, named entity recognition), automatic machine translation, textual entailment, paraphrasing, information retrieval, and speech recognition.

...read moreread less

Proceedings Article•DOI•

A picture is worth a thousand search results: finding child-oriented multimedia results with collAge

[...]

Karl Gyllstrom¹, Marie-Francine Moens¹•Institutions (1)

Katholieke Universiteit Leuven¹

19 Jul 2010

TL;DR: This work presents a simple and effective approach to complement search results for children's web queries with child-oriented multimedia results, such as coloring pages and music sheets, through an online user evaluation.

...read moreread less

Abstract: We present a simple and effective approach to complement search results for children's web queries with child-oriented multimedia results, such as coloring pages and music sheets. Our approach determines appropriate media types for a query by searching Google's database of frequent queries for co-occurrences of a query's terms (e.g., "dinosaurs") with preselected multimedia terms (e.g., "coloring pages"). We show the effectiveness of this approach through an online user evaluation.

...read moreread less

Journal Article•

Using biased discriminant analysis for email filtering

[...]

Juan Carlos Gomez¹, Marie-Francine Moens²•Institutions (2)

Monterrey Institute of Technology and Higher Education¹, Katholieke Universiteit Leuven²

01 Jan 2010-Lecture Notes in Computer Science

TL;DR: In this paper, a novel statistical feature extraction method, called Biased Discriminant Analysis (BDA), which relies on dimensionality reduction to retain the most informative and discriminative features from messages, was proposed.

...read moreread less

Abstract: This paper reports on email filtering based on content features. We test the validity of a novel statistical feature extraction method, which relies on dimensionality reduction to retain the most informative and discriminative features from messages. The approach, named Biased Discriminant Analysis (BDA), aims at finding a feature space transformation that closely clusters positive examples while pushing away the negative ones. This method is an extension of Linear Discriminant Analysis (LDA), but introduces a different transformation to improve the separation between classes and it has up till now not been applied for text mining tasks. We successfully test BDA under two schemas. The first one is a traditional classification scenario using a 10-fold cross validation for four ground truth standard corpora: LingSpam, SpamAssassin, Phishing corpus and a subset of the TREC 2007 spam corpus. In the second schema we test the anticipatory properties of the statistical features with the TREC 2007 spam corpus. The contributions of this work is the evidence that BDA offers better discriminative features for email filtering, gives stable classification results notwithstanding the amount of features chosen, and robustly retains their discriminative value over time.

...read moreread less

Book Chapter•DOI•

Using biased discriminant analysis for email filtering

[...]

Juan Carlos Gomez¹, Marie-Francine Moens²•Institutions (2)

Monterrey Institute of Technology and Higher Education¹, Katholieke Universiteit Leuven²

08 Sep 2010

TL;DR: The contributions of this work is the evidence that BDA offers better discriminative features for email filtering, gives stable classification results notwithstanding the amount of features chosen, and robustly retains their discrim inative value over time.

...read moreread less

Journal Article•DOI•

Identifying and Resolving Hidden Text Salting

[...]

Marie-Francine Moens¹, Jan De Beer¹, Erik Boiy¹, Juan Carlos Gomez¹•Institutions (1)

Katholieke Universiteit Leuven¹

01 Dec 2010-IEEE Transactions on Information Forensics and Security

TL;DR: A method to detect portions of a digital text source which are invisible to the end user, when they are rendered on a visual medium, and the effectiveness of this method in spam filtering task is assessed.

...read moreread less

Abstract: Hidden salting in digital media involves the intentional addition or distortion of content patterns with the purpose of content filtering. We propose a method to detect portions of a digital text source which are invisible to the end user, when they are rendered on a visual medium (like a computer monitor). The method consists of “tapping” into the rendering process and analyzing the rendering commands to identify portions of the source text (plaintext) which will be invisible for a human reader, using criteria based on text character and background colors, font size, overlapping characters, etc. Moreover, text deemed visible (covertext) is reconstructed from rendering commands and then the character reading order is identified, which could differ from the rendering order. The detection and resolution of hidden salting is evaluated on two e-mail corpora, and the effectiveness of this method in spam filtering task is assessed. We provide a solution to a relevant open problem in content filtering applications, namely the presence of tricks aimed at circumventing automatic filters.

...read moreread less

Book Chapter•DOI•

Integer linear programming for dutch sentence compression

[...]

Jan De Belder¹, Marie-Francine Moens¹•Institutions (1)

Katholieke Universiteit Leuven¹

21 Mar 2010

TL;DR: This paper used an integer linear programming (ILP) approach to compress sentences from news articles from Dutch and Flemish newspapers written in Dutch using an Alpino parser and the Latent Words Language Model.

...read moreread less

Abstract: Sentence compression is a valuable task in the framework of text summarization. In this paper we compress sentences from news articles from Dutch and Flemish newspapers written in Dutch using an integer linear programming approach. We rely on the Alpino parser available for Dutch and on the Latent Words Language Model. We demonstrate that the integer linear programming approach yields good results for compressing Dutch sentences, despite the large freedom in word order.

...read moreread less

DOI•

Weakly supervised person naming in news video

[...]

Marie-Francine Moens¹, Tinne Tuytelaars¹•Institutions (1)

Katholieke Universiteit Leuven¹

28 Apr 2010

TL;DR: An unsupervised model for naming anchor persons in the news by developing a face naming method that learns from labeled and unlabeled examples using iterative label propagation in a graph of connected faces or name-face pairs.

...read moreread less

Abstract: In this paper we report our experiments on assigning person names to faces as found in video frames and transcripts of the news broadcasts. We develop a face naming method that learns from labeled and unlabeled examples using iterative label propagation in a graph of connected faces or name-face pairs. The advantage of this method is that it can use very few labeled data points and incorporate the unlabeled data points during the learning process. The label propagation algorithm yields better results than a Support Vector Machine classifier trained on the same labeled data. We improve the face labeling performance by learning and using a similarity metric for comparing faces. The anchors may be problematic, since their names are typically mentioned only once, at the very beginning of the news broadcast, and they occur quite frequently. If the name-face pairs corresponding to the anchors can be separately identified, the accuracy of the overall alignments can be boosted. Hence, we develop an unsupervised model for naming anchor persons in the news.

...read moreread less

Sentence compression for Dutch using integer linear programming

[...]

Jan De Belder, Marie-Francine Moens

01 Jan 2010

TL;DR: This paper compress sentences from news articles taken from Dutch and Flemish newspapers using an integer linear programming approach using the Alpino parser available for Dutch and the Latent Words Language Model.

...read moreread less

Abstract: Sentence compression is a valuable task in the framework of text summarization. In this paper we compress sentences from news articles taken from Dutch and Flemish newspapers using an integer linear programming approach. We rely on the Alpino parser available for Dutch and on the Latent Words Language Model. We demonstrate that the integer linear programming approach yields good results for compressing Dutch sentences, despite the large freedom in word order.

...read moreread less