scispace - formally typeset
Search or ask a question

Showing papers by "Marie-Francine Moens published in 2010"


Journal ArticleDOI
TL;DR: A number of novel features that are particularly well-suited to identify phishing emails are described, including statistical models for the low-dimensional descriptions of email topics, sequential analysis of email text and external links, the detection of embedded logos as well as indicators for hidden salting.
Abstract: Phishing emails usually contain a message from a credible looking source requesting a user to click a link to a website where she/he is asked to enter a password or other confidential information. Most phishing emails aim at withdrawing money from financial institutions or getting access to private information. Phishing has increased enormously over the last years and is a serious threat to global security and economy. There are a number of possible countermeasures to phishing. These range from communication-oriented approaches like authentication protocols over blacklisting to content-based filtering approaches. We argue that the first two approaches are currently not broadly implemented or exhibit deficits. Therefore content-based phishing filters are necessary and widely used to increase communication security. A number of features are extracted capturing the content and structural properties of the email. Subsequently a statistical classifier is trained using these features on a training set of emails labeled as ham (legitimate), spam or phishing. This classifier may then be applied to an email stream to estimate the classes of new incoming emails. In this paper we describe a number of novel features that are particularly well-suited to identify phishing emails. These include statistical models for the low-dimensional descriptions of email topics, sequential analysis of email text and external links, the detection of embedded logos as well as indicators for hidden salting. Hidden salting is the intentional addition or distortion of content not perceivable by the reader. For empirical evaluation we have obtained a large realistic corpus of emails prelabeled as spam, phishing, and ham (legitimate). In experiments our methods outperform other published approaches for classifying phishing emails. We discuss the implications of these results for the practical application of this approach in the workflow of an email provider. Finally we describe a strategy how the filters may be updated and adapted to new types of phishing.

137 citations


Proceedings Article
01 Jan 2010
TL;DR: The goal in this paper is to automatically transform text into a simpler text, so that it is easier to understand by children, and to include information from a language model in the lexical simplification step, to obtain better results over a baseline method.
Abstract: The goal in this paper is to automatically transform text into a simpler text, so that it is easier to understand by children. We perform syntactic simplification, i.e. the splitting of sentences, and lexical simplification, i.e. replacing difficult words with easier synonyms. We test the performance of this approach for each component separately on a per sentence basis, and globally with the automatic construction of simplified news articles and encyclopedia articles. By including information from a language model in the lexical simplification step, we obtain better results over a baseline method. The syntactic simplification shows that some phenomena are difficult to recognize by a parser, and that errors are often introduced. Although the reading difficulty goes down, it still doesn’t reach the required level for young children.

136 citations


Book ChapterDOI
01 Jan 2010
TL;DR: It is shown how a Context-Free Grammar can be used to extract arguments, and how ontologies and Natural Language Processing can identify complex information such as case factors and participant roles.
Abstract: This paper describes recent approaches using text-mining to automatically profile and extract arguments from legal cases. We outline some of the background context and motivations. We then turn to consider issues related to the construction and composition of corpora of legal cases. We show how a Context-Free Grammar can be used to extract arguments, and how ontologies and Natural Language Processing can identify complex information such as case factors and participant roles. Together the results bring us closer to automatic identification of legal arguments.

130 citations


Proceedings Article
01 May 2010
TL;DR: This paper introduces the task of spatial role labeling and proposes an annotation scheme that is language-independent and facilitates the application of machine learning techniques.
Abstract: One of the essential functions of natural language is to talk about spatial relationships between objects. Linguistic constructs can express highly complex, relational structures of objects, spatial relations between them, and patterns of motion through spaces relative to some reference point. Learning how to map this information onto a formal representation from a text is a challenging problem. At present no well-defined framework for automatic spatial information extraction exists that can handle all of these issues. In this paper we introduce the task of spatial role labeling and propose an annotation scheme that is language-independent and facilitates the application of machine learning techniques. Our framework consists of a set of spatial roles based on the theory of holistic spatial semantics with the intent of covering all aspects of spatial concepts, including both static and dynamic spatial relations. We illustrate our annotation scheme with many examples throughout the paper, and in addition we highlight how to connect to spatial calculi such as region connection calculus and also how our approach fits into related work.

75 citations


Journal ArticleDOI
TL;DR: The results are competitive with state-of-the-art performance on the ¿Labeled Faces in the Wild¿ dataset in terms of recall values, include excellent precision values, and show the value of text and image analysis for identifying the probability of being pictured or named in the alignment process.
Abstract: In this paper we report on our experiments on aligning names and faces as found in images and captions of online news Websites. Developing accurate technologies for linking names and faces is valuable when retrieving or mining information from multimedia collections. We perform exhaustive and systematic experiments exploiting the (a)symmetry between the visual and textual modalities. This leads to different schemes for assigning names to the faces, assigning faces to the names, and establishing name-face link pairs. On top of that, we investigate generic approaches to the use of textual and visual structural information to predict the presence of the corresponding entity in the other modality. The proposed methods are completely unsupervised and are inspired by methods for aligning phrases and words in texts of different languages developed for constructing dictionaries for machine translation. The results are competitive with state-of-the-art performance on the ?Labeled Faces in the Wild? dataset in terms of recall values, now reported on the complete dataset, include excellent precision values, and show the value of text and image analysis for identifying the probability of being pictured or named in the alignment process.

47 citations


Proceedings Article
15 Jul 2010
TL;DR: A system for the recognition and normalization of temporal expressions (Task 13: TempEval-2, Task A) that is approached as a classification problem of sentence constituents and the normalization is implemented in a rule-based manner.
Abstract: In this paper we describe a system for the recognition and normalization of temporal expressions (Task 13: TempEval-2, Task A). The recognition task is approached as a classification problem of sentence constituents and the normalization is implemented in a rule-based manner. One of the system features is extending positive annotations in the corpus by semantically similar words automatically obtained from a large unannotated textual corpus. The best results obtained by the system are 0.85 and 0.84 for precision and recall respectively for recognition of temporal expressions; the accuracy values of 0.91 and 0.55 were obtained for the feature values type and val respectively.

27 citations


Proceedings ArticleDOI
26 Oct 2010
TL;DR: AgeRank, a link-based algorithm that ranks web pages according their appropriateness for young audiences, is designed and shown to be accurate in page-labeling, widely-spanning in page coverage, and with high potential to improve children's search.
Abstract: Though children frequently use web search engines to learn, interact, and be entertained, modern web search engines are poorly suited to children's needs, requiring relatively complex querying and filtering of results in order to find pages oriented to young audiences. To address this limitation, we designed AgeRank, a link-based algorithm that ranks web pages according their appropriateness for young audiences. We show its effectiveness through a multipart evaluation that demonstrates AgeRank to be accurate in page-labeling, widely-spanning in page coverage, and with high potential to improve children's search. As a fast, scalable, and effective algorithm, AgeRank can be adopted by search engines seeking to more effectively address the needs of young users, or easily fitted to complementary machine-learning based classification approaches.

26 citations


Proceedings ArticleDOI
19 Jul 2010
TL;DR: A face naming method is developed that learns from labeled and unlabeled examples using iterative label propagation in a graph of connected faces or name-face pairs that yields better results than a Support Vector Machine classifier trained on the same labeled data.
Abstract: Labeling persons appearing in video frames with names detected from the video transcript helps improving the video content identification and search task. We develop a face naming method that learns from labeled and unlabeled examples using iterative label propagation in a graph of connected faces or name-face pairs. The advantage of this method is that it can use very few labeled data points and incorporate the unlabeled data points during the learning process. Anchor detection and metric learning for face classification techniques are incorporated into the label propagation process to help boosting the face naming performance. On BBC News videos, the label propagation algorithm yields better results than a Support Vector Machine classifier trained on the same labeled data.

24 citations


15 Aug 2010
TL;DR: This work considers mapping unrestricted natural language to formal spatial representations and describes ongoing work on a two-level machine learning approach that deals with the extraction of spatial information from natural language sentences, called spatial role labeling.
Abstract: We consider mapping unrestricted natural language to formal spatial representations. We describe ongoing work on a two-level machine learning approach. The first level is linguistic, and deals with the extraction of spatial information from natural language sentences, and is called spatial role labeling. The second level is ontological in nature, and deals with mapping this linguistic, spatial information to formal spatial calculi. Our main obstacles are the lack of available annotated data for training machine learning algorithms for these tasks, and the difficulty of selecting an appropriate abstraction level for the spatial information. For the linguistic part, we approach the problem in a gradual way. We make use of existing resources such as The Preposition Project (TPP) and the validation data of General Upper Model (GUM) ontology, and we show some computational results. For the ontological part, we describe machine learning challenges and discuss our proposed approach.

19 citations


Journal ArticleDOI
TL;DR: This work suggests that it is possible to train robust story segmenters for news video using only a handful of broadcasts, provided a good initial feature selection is made.
Abstract: In this paper, we describe an approach to segmenting news video based on the perceived shift in content using features spanning multiple modalities. We investigate a number of multimedia features, which serve as potential indicators of a change in story, in order to determine which are the most effective. The efficacy of our approach is demonstrated by the performance of our prototype, where a number of feature combinations demonstrate an up to 18% improvement in WindowDiff score compared to other state of the art story segmenters. In our investigation, there is no, one, clearly superior feature, rather the best segmentation occurs when there is synergy between multiple features. A further investigation into the effect on segmentation performance, while varying the number of training examples versus the number of features used, reveal that having better feature combinations is more important than having more training examples. Our work suggests that it is possible to train robust story segmenters for news video using only a handful of broadcasts, provided a good initial feature selection is made.

19 citations


Patent
18 Nov 2010
TL;DR: The Latent Words Language Model (LWLM) as mentioned in this paper automatically determines context-dependent word distributions (called hidden or latent words) for each word of a text, which reflect the probability that another word of the vocabulary of a language would occur at that position in the text.
Abstract: Described is method, the Latent Words Language Model (LWLM), that automatically determines context-dependent word distributions (called hidden or latent words) for each word of a text. The probabilistic word distributions reflect the probability that another word of the vocabulary of a language would occur at that position in the text. Furthermore, a method is described to use these word distributions in statistical language processing applications, such as information extraction applications (for example, semantic role labeling, named entity recognition), automatic machine translation, textual entailment, paraphrasing, information retrieval, and speech recognition.

Proceedings ArticleDOI
19 Jul 2010
TL;DR: This work presents a simple and effective approach to complement search results for children's web queries with child-oriented multimedia results, such as coloring pages and music sheets, through an online user evaluation.
Abstract: We present a simple and effective approach to complement search results for children's web queries with child-oriented multimedia results, such as coloring pages and music sheets. Our approach determines appropriate media types for a query by searching Google's database of frequent queries for co-occurrences of a query's terms (e.g., "dinosaurs") with preselected multimedia terms (e.g., "coloring pages"). We show the effectiveness of this approach through an online user evaluation.

Journal Article
TL;DR: In this paper, a novel statistical feature extraction method, called Biased Discriminant Analysis (BDA), which relies on dimensionality reduction to retain the most informative and discriminative features from messages, was proposed.
Abstract: This paper reports on email filtering based on content features. We test the validity of a novel statistical feature extraction method, which relies on dimensionality reduction to retain the most informative and discriminative features from messages. The approach, named Biased Discriminant Analysis (BDA), aims at finding a feature space transformation that closely clusters positive examples while pushing away the negative ones. This method is an extension of Linear Discriminant Analysis (LDA), but introduces a different transformation to improve the separation between classes and it has up till now not been applied for text mining tasks. We successfully test BDA under two schemas. The first one is a traditional classification scenario using a 10-fold cross validation for four ground truth standard corpora: LingSpam, SpamAssassin, Phishing corpus and a subset of the TREC 2007 spam corpus. In the second schema we test the anticipatory properties of the statistical features with the TREC 2007 spam corpus. The contributions of this work is the evidence that BDA offers better discriminative features for email filtering, gives stable classification results notwithstanding the amount of features chosen, and robustly retains their discriminative value over time.

Book ChapterDOI
08 Sep 2010
TL;DR: The contributions of this work is the evidence that BDA offers better discriminative features for email filtering, gives stable classification results notwithstanding the amount of features chosen, and robustly retains their discrim inative value over time.
Abstract: This paper reports on email filtering based on content features. We test the validity of a novel statistical feature extraction method, which relies on dimensionality reduction to retain the most informative and discriminative features from messages. The approach, named Biased Discriminant Analysis (BDA), aims at finding a feature space transformation that closely clusters positive examples while pushing away the negative ones. This method is an extension of Linear Discriminant Analysis (LDA), but introduces a different transformation to improve the separation between classes and it has up till now not been applied for text mining tasks. We successfully test BDA under two schemas. The first one is a traditional classification scenario using a 10-fold cross validation for four ground truth standard corpora: LingSpam, SpamAssassin, Phishing corpus and a subset of the TREC 2007 spam corpus. In the second schema we test the anticipatory properties of the statistical features with the TREC 2007 spam corpus. The contributions of this work is the evidence that BDA offers better discriminative features for email filtering, gives stable classification results notwithstanding the amount of features chosen, and robustly retains their discriminative value over time.

Journal ArticleDOI
TL;DR: A method to detect portions of a digital text source which are invisible to the end user, when they are rendered on a visual medium, and the effectiveness of this method in spam filtering task is assessed.
Abstract: Hidden salting in digital media involves the intentional addition or distortion of content patterns with the purpose of content filtering. We propose a method to detect portions of a digital text source which are invisible to the end user, when they are rendered on a visual medium (like a computer monitor). The method consists of “tapping” into the rendering process and analyzing the rendering commands to identify portions of the source text (plaintext) which will be invisible for a human reader, using criteria based on text character and background colors, font size, overlapping characters, etc. Moreover, text deemed visible (covertext) is reconstructed from rendering commands and then the character reading order is identified, which could differ from the rendering order. The detection and resolution of hidden salting is evaluated on two e-mail corpora, and the effectiveness of this method in spam filtering task is assessed. We provide a solution to a relevant open problem in content filtering applications, namely the presence of tricks aimed at circumventing automatic filters.

Book ChapterDOI
21 Mar 2010
TL;DR: This paper used an integer linear programming (ILP) approach to compress sentences from news articles from Dutch and Flemish newspapers written in Dutch using an Alpino parser and the Latent Words Language Model.
Abstract: Sentence compression is a valuable task in the framework of text summarization. In this paper we compress sentences from news articles from Dutch and Flemish newspapers written in Dutch using an integer linear programming approach. We rely on the Alpino parser available for Dutch and on the Latent Words Language Model. We demonstrate that the integer linear programming approach yields good results for compressing Dutch sentences, despite the large freedom in word order.

DOI
28 Apr 2010
TL;DR: An unsupervised model for naming anchor persons in the news by developing a face naming method that learns from labeled and unlabeled examples using iterative label propagation in a graph of connected faces or name-face pairs.
Abstract: In this paper we report our experiments on assigning person names to faces as found in video frames and transcripts of the news broadcasts. We develop a face naming method that learns from labeled and unlabeled examples using iterative label propagation in a graph of connected faces or name-face pairs. The advantage of this method is that it can use very few labeled data points and incorporate the unlabeled data points during the learning process. The label propagation algorithm yields better results than a Support Vector Machine classifier trained on the same labeled data. We improve the face labeling performance by learning and using a similarity metric for comparing faces. The anchors may be problematic, since their names are typically mentioned only once, at the very beginning of the news broadcast, and they occur quite frequently. If the name-face pairs corresponding to the anchors can be separately identified, the accuracy of the overall alignments can be boosted. Hence, we develop an unsupervised model for naming anchor persons in the news.

01 Jan 2010
TL;DR: This paper compress sentences from news articles taken from Dutch and Flemish newspapers using an integer linear programming approach using the Alpino parser available for Dutch and the Latent Words Language Model.
Abstract: Sentence compression is a valuable task in the framework of text summarization. In this paper we compress sentences from news articles taken from Dutch and Flemish newspapers using an integer linear programming approach. We rely on the Alpino parser available for Dutch and on the Latent Words Language Model. We demonstrate that the integer linear programming approach yields good results for compressing Dutch sentences, despite the large freedom in word order.