scispace - formally typeset
Search or ask a question

Showing papers by "Walter Daelemans published in 2012"


Journal ArticleDOI
TL;DR: Pattern is a package for Python 2.4+ with functionality for web mining, natural language processing, naturallanguage processing, machine learning, and network analysis for graph centrality and visualization.
Abstract: Pattern is a package for Python 2.4+ with functionality for web mining (Google + Twitter + Wikipedia, web spider, HTML DOM parser), natural language processing (tagger/chunker, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, k-means clustering, Naive Bayes + k-NN + SVM classifiers) and network analysis (graph centrality and visualization). It is well documented and bundled with 30+ examples and 350+ unit tests. The source code is licensed under BSD and available from http://www.clips.ua.ac.be/pages/pattern.

348 citations


Proceedings Article
01 May 2012
TL;DR: A new open source subjectivity lexicon for Dutch adjectives, a dictionary of 1,100 adjectives that occur frequently in online product reviews, manually annotated with polarity strength, subjectivity and intensity, for each word sense is presented.
Abstract: We present a new open source subjectivity lexicon for Dutch adjectives. The lexicon is a dictionary of 1,100 adjectives that occur frequently in online product reviews, manually annotated with polarity strength, subjectivity and intensity, for each word sense. We discuss two machine learning methods (using distributional extraction and synset relations) to automatically expand the lexicon to 5,500 words. We evaluate the lexicon by comparing it to the user-given star rating of online product reviews. We show promising results in both in-domain and cross-domain evaluation. The lexicon is publicly available as part of the PATTERN software package (http://www.clips.ua.ac.be/pages/pattern).

58 citations


Proceedings Article
01 May 2012
TL;DR: ConanDoyle-neg, a corpus of stories by Conan Doyle annotated with negation information, where the negation cues and their scope, as well as the event or property that is negated have been annotated by two annotators.
Abstract: In this paper we present ConanDoyle-neg, a corpus of stories by Conan Doyle annotated with negation information. The negation cues and their scope, as well as the event or property that is negated have been annotated by two annotators. The inter-annotator agreement is measured in terms of F-scores at scope level. It is higher for cues (94.88 and 92.77), less high for scopes (85.04 and 77.31), and lower for the negated event (79.23 and 80.67). The corpus is publicly available.

55 citations


Journal ArticleDOI
TL;DR: The suggested framework introduces a generic text mining approach to analyse media coverage on political issues, including a set of methodological guidelines, evaluation metrics, as well as open source opinion mining tools.
Abstract: At the year end of 2011 Belgium formed a government, after a world record breaking period of 541days of negotiations. We have gathered and analysed 68,000 related on-line news articles published in 2011 in Flemish newspapers. These articles were analysed by a custom-built expert system. The results of our text mining analyses show interesting differences in media coverage and votes for several political parties and politicians. With opinion mining, we are able to automatically detect the sentiment of each article, thereby allowing to visualise how the tone of reporting evolved throughout the year, on a party, politician and newspaper level. Our suggested framework introduces a generic text mining approach to analyse media coverage on political issues, including a set of methodological guidelines, evaluation metrics, as well as open source opinion mining tools. Since all analyses are based on automated text mining algorithms, an objective overview of the manner of reporting is provided. The analysis shows peaks of positive and negative sentiments during key moments in the negotiation process.

49 citations


Journal ArticleDOI
TL;DR: In this paper, the authors stress-test a recently proposed technique for computational authorship verification, "unmasking", which has been well received in the literature, and apply the technique to authorhip verification across genres, an extremely complex text categorization problem.
Abstract: In this paper we will stress-test a recently proposed technique for computational authorship verification, ‘‘unmasking'', which has been well received in the literature. The technique envisages an experimental set-up commonly referred to as ‘‘authorship verification'', a task generally deemed more difficult than so-called ‘‘authorship attribution''. We will apply the technique to authorship verification across genres, an extremely complex text categorization problem that so far has remained unexplored. We focus on five representative contemporary English-language authors. For each of them, the corpus under scrutiny contains several texts in two genres (literary prose and theatre plays). Our research confirms that unmasking is an interesting technique for computational authorship verification, especially yielding reliable results within the genre of (larger) prose works in our corpus. Authorship verification, however, proves much more difficult in the theatrical part of the corpus.

47 citations


Proceedings Article
12 Jul 2012
TL;DR: This paper presents an approach to automatically annotate sentences in medical abstracts with these labels using kLog, a new language for statistical relational learning with kernels, and shows a clear improvement with respect to state-of-the-art systems.
Abstract: Evidence-based medicine is an approach whereby clinical decisions are supported by the best available findings gained from scientific research. This requires efficient access to such evidence. To this end, abstracts in evidence-based medicine can be labeled using a set of predefined medical categories, the so-called PICO criteria. This paper presents an approach to automatically annotate sentences in medical abstracts with these labels. Since both structural and sequential information are important for this classification task, we use kLog, a new language for statistical relational learning with kernels. Our results show a clear improvement with respect to state-of-the-art systems.

44 citations


Journal ArticleDOI
30 Jan 2012
TL;DR: A system to automatically identify emotion-carrying sentences in suicide notes and to detect the specific fine-grained emotion conveyed is presented.
Abstract: We present a system to automatically identify emotion-carrying sentences in suicide notes and to detect the specific fine-grained emotion conveyed. With this system, we competed in Track 2 of the 2011 Medical NLP Challenge,14 where the task was to distinguish between fifteen emotion labels, from guilt, sorrow, and hopelessness to hopefulness and happiness.Since a sentence can be annotated with multiple emotions, we designed a thresholding approach that enables assigning multiple labels to a single instance. We rely on the probability estimates returned by an SVM classifier and experimentally set thresholds on these probabilities. Emotion labels are assigned only if their probability exceeds a certain threshold and if the probability of the sentence being emotion-free is low enough. We show the advantages of this thresholding approach by comparing it to a naive system that assigns only the most probable label to each test sentence, and to a system trained on emotion-carrying sentences only.

27 citations


Proceedings Article
01 Jan 2012
TL;DR: A dictionary of words and expressions relating to predators' grooming stages is described, which was used to identify which posts in the predators' conversations were most distinctive for their grooming behavior.
Abstract: In this paper we present a new approach for detecting online pedophiles in chat rooms that combines the results of predictions on the level of the individual post, the level of the user and the level of the entire conversation, and describe the results of this three-stage system in the PAN 2012 competition. Also, we describe a resampling and a filtering strategy to circumvent issues regarding the unbalanced dataset. Finally, we describe the creation of a dictionary of words and expressions relating to predators' grooming stages, which we used to identify which posts in the predators' conversations were most distinctive for their grooming behavior.

27 citations


01 Jan 2012
TL;DR: DeLearyous as discussed by the authors is a proof-of-concept of a serious game that will assist in the training of communication skills following the Interpersonal Circumplex (also known as Leary's Rose) -a framework for interpersonal communication.
Abstract: We describe project deLearyous, in which the goal is to develop a proof-of-concept of a serious game that will assist in the training of communication skills following the Interpersonal Circumplex (also known as Leary’s Rose) –a framework for interpersonal communication. Users will interact with the application using unconstrained written natural language input and will engage in conversation with a 3D virtual agent. The application will thus alleviate the need for expensive communication coaching and will offer players a non-threatening environment in which to practice their communication skills. We outline the preliminary data collection procedure, as well as the workings of each of the modules that make up the application pipeline. We evaluate the modules’ performance and offer our thoughts on what can be expected from the final “proof-of-concept” application. To get a firm grasp on the structure and dynamics of human-to-human conversations, we first gathered data from a series of “Wizard of Oz” experiments in which the virtual agent was replaced with a human actor. All data was subsequently transcribed, analysed and annotated. This data functioned as the basis for all modules in the application pipeline: the NLP module, the scenario engine, the visualization module, and the audio module. The freeform, unconstrained text input from the player is first processed by a Natural Language Processing (NLP) module, which uses machine learning to automatically identify the position of the player on the Interpersonal Circumplex. The NLP module also identifies the topic of the player’s input using a keyword-based approach. The output of the NLP module is sent to the scenario engine, which implements the virtual agent’s conversation options as a finite state machine. Given the virtual agent’s previous state and Circumplex position, it predicts the most likely follow-up state. The follow-up state is then realized by the visualization and audio modules. The visualization module takes care of displaying the 3D virtual agent’s facial and torso animations, while the audio module looks up and plays the appropriate pre-recorded audio responses. In terms of performance, the NLP module appears to be a bottleneck, as finding the position of the player on the Interpersonal Circumplex is a very difficult problem to solve automatically. However, we show that human agreement on this task is also very low, indicating that there isn’t always a single “correct” way to interpret Circumplex positions. We conclude by stating that applications like deLearyous show promise, but we also readily admit that technology still has a way to go before they can be used without human supervision.

25 citations


Proceedings Article
01 Jan 2012
TL;DR: The pilot task Processing modality and negation as mentioned in this paper, which was organized in the framework of the Question Answering for Machine Reading Evaluation Lab at CLEF 2012, defined as an annotation exercise consisting on determining whether an event mentioned in a text is presented as negated, modalised (i.e. affected by an expression of modality), or both.
Abstract: This paper describes the pilot task Processing modality and negation, which was organized in the framework of the Question Answering for Machine Reading Evaluation Lab at CLEF 2012. This task was defined as an annotation exercise consisting on determining whether an event mentioned in a text is presented as negated, modalised (i.e. affected by an expression of modality), or both. Three teams participated in the task submitting a total of 6 runs. The highest score obtained by a system was 0.6368 macroaveraged F1 measure.

21 citations


Proceedings ArticleDOI
01 Jan 2012
TL;DR: The paper describes the overall learning framework, and the two components that will provide vocabulary learning and grammar induction, and encouraging results of early implementations of these vocabulary and grammar learning components are described.
Abstract: This paper introduces research within the ALADIN project, which aims to develop an assistive vocal interface for people with a physical impairment. In contrast to existing approaches, the vocal interface is self-learning which means it can be used with any language, dialect, vocabulary and grammar. The paper describes the overall learning framework, and the two components that will provide vocabulary learning and grammar induction. In addition, the paper describes encouraging results of early implementations of these vocabulary and grammar learning components, applied to recorded sessions of a vocally guided card game, patience.

Proceedings Article
01 Dec 2012
TL;DR: It is shown that classification performance is significantly higher when the inflective character of the language is taken into account by using character ngrams as opposed to the more common bag-of-words approach, indicating that topic classification is possible even for languages for which automatic grammatical tools are not available.
Abstract: Despite the existence of many effective methods to solve topic classification tasks for such widely used languages as English, there is no clear answer whether these methods are suitable for languages that are substantially different. We attempt to solve a topic classification task for Lithuanian, a relatively resource-scarce language that is highly inflective, has a rich vocabulary, and a complex word derivation system. We show that classification performance is significantly higher when the inflective character of the language is taken into account by using character ngrams as opposed to the more common bag-of-words approach. These results are not only promising for Lithuanian, but also for other languages with similar properties. We show that the performance of classifiers based on character n-grams even surpasses that of classifiers built on stemmed or lemmatized text. This indicates that topic classification is possible even for languages for which automatic grammatical tools are not available. TITLE AND ABSTRACT IN LITHUANIAN Klasifikavimo į temas gerinimas stipriai kaitomoms kalboms Nepaisant to, jog tokioms placiai naudojamoms kalboms kaip anglų yra sukurta daug efektyvių metodų, sprendžiancių klasifikavimo į temas uždavinius, neaisku ar sie metodai yra tinkami visiskai skirtingoms kalboms. Siekiame isspresti klasifikavimo į temas uždavinį gana mažai isteklių sioje srityje turinciai lietuvių kalbai, kuri yra stipriai kaitoma, turi turtingą žodyną, sudėtingą žodžių darybos sistemą. Pademonstruosime, kad galima pasiekti ženkliai geresnius klasifikavimo rezultatus, kuomet atsižvelgiama į kaitomą kalbos pobūdį: naudojamos simbolių ngmamos vietoj labiau įprasto žodžių rinkinio. Gauti rezultatai perspektyvūs ne tik lietuvių kalbai, bet taip pat ir kitoms, panasiomis savybėmis pasižymincioms, kalboms. Pademonstruosime, kad klasifikatorių, naudojancių simbolių n-gramas veikimas netgi efektyvesnis, palyginus su klasifikatoriais, naudojanciais į žodžių kamienus arba lemas transformuotą tekstą. O tai reiskia, kad sį klasifikavimo į temas metodą galima taikyti netgi toms kalboms, kurios neturi specializuotų automatinių gramatinių įrankių.

01 Jan 2012
TL;DR: The task of machine reading of biomedical texts about Alzheimer's disease, which is a pilot task of the Question Answering for Machine Reading Evaluation (QA4MRE) Lab at CLEF 2012 as discussed by the authors, aims at exploring the ability of a machine reading system to answer questions about a scientific topic.
Abstract: This report describes the task Machine reading of biomedical texts about Alzheimer’s disease, which is a pilot task of the Question Answering for Machine Reading Evaluation (QA4MRE) Lab at CLEF 2012. The task aims at exploring the ability of a machine reading system to answer questions about a scientific topic, namely Alzheimer’s disease. As in the QA4MRE task, participant systems were asked to read a document and identify the answers to a set of questions about information that is stated or implied in the text. A background collection was provided for systems to acquire background knowledge. The background collection is a corpus newly compiled for this task, the Alzheimer’s Disease Literature Corpus. Seven teams participated in the task submitting a total of 43 runs. The highest score obtained by a team was 0.55 c@1, which is clearly above baseline.

01 Jan 2012
TL;DR: Annotation methods used to provide a training set of compounds with the appropriate semantic class using a distributional lexical semantics representation of the compound’s constituents to make its classification decision.
Abstract: This article presents initial results on a supervised machine learning approach to determine the semantics of noun compounds in Dutch and Afrikaans. After a discussion of previous research on the topic, we present our annotation methods used to provide a training set of compounds with the appropriate semantic class. The support vector machine method used for this classification experiment utilizes a distributional lexical semantics representation of the compound’s constituents to make its classification decision. The collection of words that occur in the near context of the constituent are considered an implicit representation of the semantics of this constituent. Fscores were reached of 47.8% for Dutch and 51.1% for Afrikaans. Keywords—compound semantics; Afrikaans; Dutch; machine learning; distributional methods

Proceedings Article
01 Jan 2012
TL;DR: This paper presents a large corpus of Flemish Dutch chat posts that were collected from the Belgian online social network Netlog and proposes to normalize this ‘anomalous' input into a format suitable for existing NLP solutions for standard Dutch.
Abstract: Although in recent years numerous forms of Internet communication ― such as e-mail, blogs, chat rooms and social network environments ― have emerged, balanced corpora of Internet speech with trustworthy meta-information (e.g. age and gender) or linguistic annotations are still limited. In this paper we present a large corpus of Flemish Dutch chat posts that were collected from the Belgian online social network Netlog. For all of these posts we also acquired the users' profile information, making this corpus a unique resource for computational and sociolinguistic research. However, for analyzing such a corpus on a large scale, NLP tools are required for e.g. automatic POS tagging or lemmatization. Because many NLP tools fail to correctly analyze the surface forms of chat language usage, we propose to normalize this ‘anomalous' input into a format suitable for existing NLP solutions for standard Dutch. Additionally, we have annotated a substantial part of the corpus (i.e. the Chatty subset) to provide a gold standard for the evaluation of future approaches to automatic (Flemish) chat language normalization.

01 Jan 2012
TL;DR: A case study of the learnability of this task on the basis of a corpus of commands for the card game patience, followed by results of preliminary experiments using a shallow concept-tagging approach.
Abstract: This paper describes research within the ALADIN project, which aims to develop an adaptive, assistive vocal interface for people with a physical impairment. One of the components in this interface is a self-learning grammar module, which maps a user’s utterance to its intended meaning. This paper describes a case study of the learnability of this task on the basis of a corpus of commands for the card game patience. The collection, transcription and annotation of this corpus is outlined in this paper, followed by results of preliminary experiments using a shallow concept-tagging approach. Encouraging results are observed during learning curve experiments, that gauge the minimal amount of training data needed to trigger accurate concept tagging of previously unseen utterances.


Journal ArticleDOI
TL;DR: It is demonstrated for Maerlant's oeuvre that this highly frequent rhyme words' stylistic stability should not be exaggerated since their distribution significantly correlates with the internal structure of that oeuve, which is relatively content-independent and well-spread over texts.
Abstract: We explore the application of stylometric methods developed for modern texts to rhymed medieval narratives (Jacob van Maerlant and Lodewijk van Velthem, ca. 1260–1330). Because of the peculiarities of medieval text transmission, we propose to use highly frequent rhyme words for authorship attribution. First, we shall demonstrate that these offer important benefits, being relatively content-independent and well-spread over texts. Subsequent experimentation shows that correspondence analyses can indeed detect authorial differences using highly frequent rhyme words. Finally, we demonstrate for Maerlant's oeuvre that this highly frequent rhyme words' stylistic stability should not be exaggerated since their distribution significantly correlates with the internal structure of that oeuvre.

Proceedings Article
12 Jul 2012
TL;DR: The overall learning framework, the two components that will provide vocabulary learning and grammar induction, and encouraging results of early implementations of these vocabulary and grammar learning components are described, applied to recorded sessions of a vocally guided card game, Patience.
Abstract: This paper introduces research within the ALADIN project, which aims to develop an assistive vocal interface for people with a physical impairment. In contrast to existing approaches, the vocal interface is self-learning, which means it can be used with any language, dialect, vocabulary and grammar. This paper describes the overall learning framework, and the two components that will provide vocabulary learning and grammar induction. In addition, the paper describes encouraging results of early implementations of these vocabulary and grammar learning components, applied to recorded sessions of a vocally guided card game, Patience.

01 Jan 2012
TL;DR: The possibility of leveraging existing resources to help facilitate the development of new resources for under-resourced languages by using cross-lingual classification methods is explored and it is concluded that the robustness of the Afrikaans genre classification system needs improvement.
Abstract: Resource-scarcity is a topic that is continually researched by the HLT community, especially for the SouthAfrican context. We explore the possibility of leveraging existing resources to help facilitate the development of new resources for under-resourced languages by using cross-lingual classification methods. We investigate the application of an Afrikaans genre classification system on Dutch texts and see encouraging results of 63.1% when classifying raw Dutch texts. We attempt to optimise the performance by employing a machine translation pre-processing step, boosting performance of the Afrikaans system on Dutch data to 67.2%. Further investigation is required as we conclude that the robustness of the Afrikaans genre classification system needs improvement.

23 Apr 2012
TL;DR: The 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2012) as mentioned in this paper was the most successful one in terms of the number of papers submitted and the number and number of attendees.
Abstract: Welcome to EACL 2012, the 13th Conference of the European Chapter of the Association for Computational Linguistics. We are happy that despite strong competition from other Computational Linguistics events and economic turmoil in many European countries, this EACL is comparable to the successful previous ones, both in terms of the number of papers submitted and in terms of attendance. We have a strong scientific program, including ten workshops, four tutorials, a demos session and a student research workshop. I am convinced that you will appreciate our program.

Posted Content
TL;DR: Using a model built from a combination of text mining and time series prediction, a novel sentiment mining technique is applied in the design of the model and the usefulness of state-of-the-art explanation-based techniques are shown to validate the resulting models.
Abstract: The efficient market hypothesis and related theories claim that it is impossible to predict future stock prices. Even so, empirical research has countered this claim by achieving better than random prediction performance. Using a model built from a combination of text mining and time series prediction, we provide further evidence to counter the efficient market hypothesis. We discuss the difficulties in evaluating such models by investigating the drawbacks of the common choices of evaluation metrics used in these empirical studies. We continue by suggesting alternative techniques to validate stock prediction models, circumventing these shortcomings. Finally, a trading system is built for the Euronext Brussels stock exchange market. In our framework, we applied a novel sentiment mining technique in the design of the model and show the usefulness of state-of-the-art explanation-based techniques to validate the resulting models.