Showing papers by "Walter Daelemans published in 2012"

PDF

Open Access

Journal Article•DOI•

[...]

Tom De Smedt¹, Walter Daelemans¹•Institutions (1)

01 Jan 2012-Journal of Machine Learning Research

TL;DR: Pattern is a package for Python 2.4+ with functionality for web mining, natural language processing, naturallanguage processing, machine learning, and network analysis for graph centrality and visualization.

...read moreread less

Abstract: Pattern is a package for Python 2.4+ with functionality for web mining (Google + Twitter + Wikipedia, web spider, HTML DOM parser), natural language processing (tagger/chunker, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, k-means clustering, Naive Bayes + k-NN + SVM classifiers) and network analysis (graph centrality and visualization). It is well documented and bundled with 30+ examples and 350+ unit tests. The source code is licensed under BSD and available from http://www.clips.ua.ac.be/pages/pattern.

...read moreread less

348 citations

Proceedings Article•

``Vreselijk mooi!'' (terribly beautiful): A Subjectivity Lexicon for Dutch Adjectives.

[...]

Tom De Smedt¹, Walter Daelemans¹•Institutions (1)

University of Antwerp¹

01 May 2012

TL;DR: A new open source subjectivity lexicon for Dutch adjectives, a dictionary of 1,100 adjectives that occur frequently in online product reviews, manually annotated with polarity strength, subjectivity and intensity, for each word sense is presented.

...read moreread less

Abstract: We present a new open source subjectivity lexicon for Dutch adjectives. The lexicon is a dictionary of 1,100 adjectives that occur frequently in online product reviews, manually annotated with polarity strength, subjectivity and intensity, for each word sense. We discuss two machine learning methods (using distributional extraction and synset relations) to automatically expand the lexicon to 5,500 words. We evaluate the lexicon by comparing it to the user-given star rating of online product reviews. We show promising results in both in-domain and cross-domain evaluation. The lexicon is publicly available as part of the PATTERN software package (http://www.clips.ua.ac.be/pages/pattern).

...read moreread less

58 citations

Proceedings Article•

ConanDoyle-neg: Annotation of negation cues and their scope in Conan Doyle stories

[...]

Roser Morante¹, Walter Daelemans¹•Institutions (1)

University of Antwerp¹

01 May 2012

TL;DR: ConanDoyle-neg, a corpus of stories by Conan Doyle annotated with negation information, where the negation cues and their scope, as well as the event or property that is negated have been annotated by two annotators.

...read moreread less

Abstract: In this paper we present ConanDoyle-neg, a corpus of stories by Conan Doyle annotated with negation information. The negation cues and their scope, as well as the event or property that is negated have been annotated by two annotators. The inter-annotator agreement is measured in terms of F-scores at scope level. It is higher for cues (94.88 and 92.77), less high for scopes (85.04 and 77.31), and lower for the negated event (79.23 and 80.67). The corpus is publicly available.

...read moreread less

55 citations

Journal Article•DOI•

Media coverage in times of political crisis: A text mining approach

[...]

Enric Junqué de Fortuny¹, Tom De Smedt¹, David Martens¹, Walter Daelemans¹•Institutions (1)

University of Antwerp¹

01 Oct 2012-Expert Systems With Applications

TL;DR: The suggested framework introduces a generic text mining approach to analyse media coverage on political issues, including a set of methodological guidelines, evaluation metrics, as well as open source opinion mining tools.

...read moreread less

Abstract: At the year end of 2011 Belgium formed a government, after a world record breaking period of 541days of negotiations. We have gathered and analysed 68,000 related on-line news articles published in 2011 in Flemish newspapers. These articles were analysed by a custom-built expert system. The results of our text mining analyses show interesting differences in media coverage and votes for several political parties and politicians. With opinion mining, we are able to automatically detect the sentiment of each article, thereby allowing to visualise how the tone of reporting evolved throughout the year, on a party, politician and newspaper level. Our suggested framework introduces a generic text mining approach to analyse media coverage on political issues, including a set of methodological guidelines, evaluation metrics, as well as open source opinion mining tools. Since all analyses are based on automated text mining algorithms, an objective overview of the manner of reporting is provided. The analysis shows peaks of positive and negative sentiments during key moments in the negotiation process.

...read moreread less

49 citations

Journal Article•DOI•

Cross-Genre Authorship Verification Using Unmasking

[...]

Mike Kestemont¹, Kim Luyckx¹, Walter Daelemans¹, Thomas Crombez¹•Institutions (1)

University of Antwerp¹

22 May 2012-English Studies

TL;DR: In this paper, the authors stress-test a recently proposed technique for computational authorship verification, "unmasking", which has been well received in the literature, and apply the technique to authorhip verification across genres, an extremely complex text categorization problem.

...read moreread less

Abstract: In this paper we will stress-test a recently proposed technique for computational authorship verification, ‘‘unmasking'', which has been well received in the literature. The technique envisages an experimental set-up commonly referred to as ‘‘authorship verification'', a task generally deemed more difficult than so-called ‘‘authorship attribution''. We will apply the technique to authorship verification across genres, an extremely complex text categorization problem that so far has remained unexplored. We focus on five representative contemporary English-language authors. For each of them, the corpus under scrutiny contains several texts in two genres (literary prose and theatre plays). Our research confirms that unmasking is an interesting technique for computational authorship verification, especially yielding reliable results within the genre of (larger) prose works in our corpus. Authorship verification, however, proves much more difficult in the theatrical part of the corpus.

...read moreread less

47 citations

Proceedings Article•

A Statistical Relational Learning Approach to Identifying Evidence Based Medicine Categories

[...]

Mathias Verbeke¹, Vincent Van Asch², Roser Morante², Paolo Frasconi³, Walter Daelemans², Luc De Raedt¹ - Show less +2 more•Institutions (3)

Katholieke Universiteit Leuven¹, University of Antwerp², University of Florence³

12 Jul 2012

TL;DR: This paper presents an approach to automatically annotate sentences in medical abstracts with these labels using kLog, a new language for statistical relational learning with kernels, and shows a clear improvement with respect to state-of-the-art systems.

...read moreread less

Abstract: Evidence-based medicine is an approach whereby clinical decisions are supported by the best available findings gained from scientific research. This requires efficient access to such evidence. To this end, abstracts in evidence-based medicine can be labeled using a set of predefined medical categories, the so-called PICO criteria. This paper presents an approach to automatically annotate sentences in medical abstracts with these labels. Since both structural and sequential information are important for this classification task, we use kLog, a new language for statistical relational learning with kernels. Our results show a clear improvement with respect to state-of-the-art systems.

...read moreread less

44 citations

Journal Article•DOI•

Fine-grained emotion detection in suicide notes: a thresholding approach to multi-label classification.

[...]

Kim Luyckx¹, Frederik Vaassen¹, Claudia Peersman¹, Walter Daelemans¹•Institutions (1)

University of Antwerp¹

30 Jan 2012

TL;DR: A system to automatically identify emotion-carrying sentences in suicide notes and to detect the specific fine-grained emotion conveyed is presented.

...read moreread less

Abstract: We present a system to automatically identify emotion-carrying sentences in suicide notes and to detect the specific fine-grained emotion conveyed. With this system, we competed in Track 2 of the 2011 Medical NLP Challenge,14 where the task was to distinguish between fifteen emotion labels, from guilt, sorrow, and hopelessness to hopefulness and happiness.Since a sentence can be annotated with multiple emotions, we designed a thresholding approach that enables assigning multiple labels to a single instance. We rely on the probability estimates returned by an SVM classifier and experimentally set thresholds on these probabilities. Emotion labels are assigned only if their probability exceeds a certain threshold and if the probability of the sentence being emotion-free is low enough. We show the advantages of this thresholding approach by comparing it to a naive system that assigns only the most probable label to each test sentence, and to a system trained on emotion-carrying sentences only.

...read moreread less

27 citations

Proceedings Article•

Conversation Level Constraints on Pedophile Detection in Chat Rooms

[...]

Claudia Peersman¹, Frederik Vaassen¹, Vincent Van Asch¹, Walter Daelemans¹•Institutions (1)

University of Antwerp¹

01 Jan 2012

TL;DR: A dictionary of words and expressions relating to predators' grooming stages is described, which was used to identify which posts in the predators' conversations were most distinctive for their grooming behavior.

...read moreread less

Abstract: In this paper we present a new approach for detecting online pedophiles in chat rooms that combines the results of predictions on the level of the individual post, the level of the user and the level of the entire conversation, and describe the results of this three-stage system in the PAN 2012 competition. Also, we describe a resampling and a filtering strategy to circumvent issues regarding the unbalanced dataset. Finally, we describe the creation of a dictionary of words and expressions relating to predators' grooming stages, which we used to identify which posts in the predators' conversations were most distinctive for their grooming behavior.

...read moreread less

27 citations

deLearyous : training interpersonal communication skills using unconstrained text input

[...]

Frederik Vaasen¹, Jeroen Wauters, Frederik Van Broeckhoven, Maarten Van Overveldt¹, Walter Daelemans, Koen Eneman - Show less +2 more•Institutions (1)

University of Antwerp¹

01 Jan 2012

TL;DR: DeLearyous as discussed by the authors is a proof-of-concept of a serious game that will assist in the training of communication skills following the Interpersonal Circumplex (also known as Leary's Rose) -a framework for interpersonal communication.

...read moreread less

Abstract: We describe project deLearyous, in which the goal is to develop a proof-of-concept of a serious game that will assist in the training of communication skills following the Interpersonal Circumplex (also known as Leary’s Rose) –a framework for interpersonal communication. Users will interact with the application using unconstrained written natural language input and will engage in conversation with a 3D virtual agent. The application will thus alleviate the need for expensive communication coaching and will offer players a non-threatening environment in which to practice their communication skills. We outline the preliminary data collection procedure, as well as the workings of each of the modules that make up the application pipeline. We evaluate the modules’ performance and offer our thoughts on what can be expected from the final “proof-of-concept” application. To get a firm grasp on the structure and dynamics of human-to-human conversations, we first gathered data from a series of “Wizard of Oz” experiments in which the virtual agent was replaced with a human actor. All data was subsequently transcribed, analysed and annotated. This data functioned as the basis for all modules in the application pipeline: the NLP module, the scenario engine, the visualization module, and the audio module. The freeform, unconstrained text input from the player is first processed by a Natural Language Processing (NLP) module, which uses machine learning to automatically identify the position of the player on the Interpersonal Circumplex. The NLP module also identifies the topic of the player’s input using a keyword-based approach. The output of the NLP module is sent to the scenario engine, which implements the virtual agent’s conversation options as a finite state machine. Given the virtual agent’s previous state and Circumplex position, it predicts the most likely follow-up state. The follow-up state is then realized by the visualization and audio modules. The visualization module takes care of displaying the 3D virtual agent’s facial and torso animations, while the audio module looks up and plays the appropriate pre-recorded audio responses. In terms of performance, the NLP module appears to be a bottleneck, as finding the position of the player on the Interpersonal Circumplex is a very difficult problem to solve automatically. However, we show that human agreement on this task is also very low, indicating that there isn’t always a single “correct” way to interpret Circumplex positions. We conclude by stating that applications like deLearyous show promise, but we also readily admit that technology still has a way to go before they can be used without human supervision.

...read moreread less

25 citations

Proceedings Article•

Annotating modality and negation for a machine reading evaluation

[...]

Roser Morante¹, Walter Daelemans¹•Institutions (1)

University of Antwerp¹

01 Jan 2012

TL;DR: The pilot task Processing modality and negation as mentioned in this paper, which was organized in the framework of the Question Answering for Machine Reading Evaluation Lab at CLEF 2012, defined as an annotation exercise consisting on determining whether an event mentioned in a text is presented as negated, modalised (i.e. affected by an expression of modality), or both.

...read moreread less

Abstract: This paper describes the pilot task Processing modality and negation, which was organized in the framework of the Question Answering for Machine Reading Evaluation Lab at CLEF 2012. This task was defined as an annotation exercise consisting on determining whether an event mentioned in a text is presented as negated, modalised (i.e. affected by an expression of modality), or both. Three teams participated in the task submitting a total of 6 runs. The highest score obtained by a system was 0.6368 macroaveraged F1 measure.

...read moreread less

21 citations

Proceedings Article•DOI•

A self-learning assistive vocal interface based on vocabulary learning and grammar induction

[...]

Jort F. Gemmeke¹, Janneke van de Loo², Guy De Pauw², Joris Driesen¹, Hugo Van hamme¹, Walter Daelemans² - Show less +2 more•Institutions (2)

Katholieke Universiteit Leuven¹, University of Antwerp²

01 Jan 2012

TL;DR: The paper describes the overall learning framework, and the two components that will provide vocabulary learning and grammar induction, and encouraging results of early implementations of these vocabulary and grammar learning components are described.

...read moreread less

Abstract: This paper introduces research within the ALADIN project, which aims to develop an assistive vocal interface for people with a physical impairment. In contrast to existing approaches, the vocal interface is self-learning which means it can be used with any language, dialect, vocabulary and grammar. The paper describes the overall learning framework, and the two components that will provide vocabulary learning and grammar induction. In addition, the paper describes encouraging results of early implementations of these vocabulary and grammar learning components, applied to recorded sessions of a vocally guided card game, patience.

...read moreread less

Proceedings Article•

Improving Topic Classification for Highly Inflective Languages

[...]

Jurgita Kapociute-Dzikiene¹, Frederik Vaassen², Walter Daelemans², Algis KrupaviÄius¹•Institutions (2)

Kaunas University of Technology¹, University of Antwerp²

01 Dec 2012

TL;DR: It is shown that classification performance is significantly higher when the inflective character of the language is taken into account by using character ngrams as opposed to the more common bag-of-words approach, indicating that topic classification is possible even for languages for which automatic grammatical tools are not available.

...read moreread less

Abstract: Despite the existence of many effective methods to solve topic classification tasks for such widely used languages as English, there is no clear answer whether these methods are suitable for languages that are substantially different. We attempt to solve a topic classification task for Lithuanian, a relatively resource-scarce language that is highly inflective, has a rich vocabulary, and a complex word derivation system. We show that classification performance is significantly higher when the inflective character of the language is taken into account by using character ngrams as opposed to the more common bag-of-words approach. These results are not only promising for Lithuanian, but also for other languages with similar properties. We show that the performance of classifiers based on character n-grams even surpasses that of classifiers built on stemmed or lemmatized text. This indicates that topic classification is possible even for languages for which automatic grammatical tools are not available. TITLE AND ABSTRACT IN LITHUANIAN Klasifikavimo į temas gerinimas stipriai kaitomoms kalboms Nepaisant to, jog tokioms placiai naudojamoms kalboms kaip anglų yra sukurta daug efektyvių metodų, sprendžiancių klasifikavimo į temas uždavinius, neaisku ar sie metodai yra tinkami visiskai skirtingoms kalboms. Siekiame isspresti klasifikavimo į temas uždavinį gana mažai isteklių sioje srityje turinciai lietuvių kalbai, kuri yra stipriai kaitoma, turi turtingą žodyną, sudėtingą žodžių darybos sistemą. Pademonstruosime, kad galima pasiekti ženkliai geresnius klasifikavimo rezultatus, kuomet atsižvelgiama į kaitomą kalbos pobūdį: naudojamos simbolių ngmamos vietoj labiau įprasto žodžių rinkinio. Gauti rezultatai perspektyvūs ne tik lietuvių kalbai, bet taip pat ir kitoms, panasiomis savybėmis pasižymincioms, kalboms. Pademonstruosime, kad klasifikatorių, naudojancių simbolių n-gramas veikimas netgi efektyvesnis, palyginus su klasifikatoriais, naudojanciais į žodžių kamienus arba lemas transformuotą tekstą. O tai reiskia, kad sį klasifikavimo į temas metodą galima taikyti netgi toms kalboms, kurios neturi specializuotų automatinių gramatinių įrankių.

...read moreread less

Machine Reading of Biomedical Texts about Alzheimer's Disease.

[...]

Roser Morante¹, Martin Krallinger, Alfonso Valencia, Walter Daelemans•Institutions (1)

University of Antwerp¹

01 Jan 2012

TL;DR: The task of machine reading of biomedical texts about Alzheimer's disease, which is a pilot task of the Question Answering for Machine Reading Evaluation (QA4MRE) Lab at CLEF 2012 as discussed by the authors, aims at exploring the ability of a machine reading system to answer questions about a scientific topic.

...read moreread less

Abstract: This report describes the task Machine reading of biomedical texts about Alzheimer’s disease, which is a pilot task of the Question Answering for Machine Reading Evaluation (QA4MRE) Lab at CLEF 2012. The task aims at exploring the ability of a machine reading system to answer questions about a scientific topic, namely Alzheimer’s disease. As in the QA4MRE task, participant systems were asked to read a document and identify the answers to a set of questions about information that is stated or implied in the text. A background collection was provided for systems to acquire background knowledge. The background collection is a corpus newly compiled for this task, the Alzheimer’s Disease Literature Corpus. Seven teams participated in the task submitting a total of 43 runs. The highest score obtained by a team was 0.55 c@1, which is clearly above baseline.

...read moreread less

Classification of noun-noun compound semantics in Dutch and Afrikaans

[...]

Ben Verhoeven¹, Walter Daelemans¹, Gerhard B. van Huyssteen²•Institutions (2)

University of Antwerp¹, North-West University²

01 Jan 2012

TL;DR: Annotation methods used to provide a training set of compounds with the appropriate semantic class using a distributional lexical semantics representation of the compound’s constituents to make its classification decision.

...read moreread less

Abstract: This article presents initial results on a supervised machine learning approach to determine the semantics of noun compounds in Dutch and Afrikaans. After a discussion of previous research on the topic, we present our annotation methods used to provide a training set of compounds with the appropriate semantic class. The support vector machine method used for this classification experiment utilizes a distributional lexical semantics representation of the compound’s constituents to make its classification decision. The collection of words that occur in the near context of the constituent are considered an implicit representation of the semantics of this constituent. Fscores were reached of 47.8% for Dutch and 51.1% for Afrikaans. Keywords—compound semantics; Afrikaans; Dutch; machine learning; distributional methods

...read moreread less

Proceedings Article•

The Netlog Corpus. A Resource for the Study of Flemish Dutch Internet Language

[...]

Mike Kestemont¹, Claudia Peersman¹, Benny De Decker¹, Guy De Pauw¹, Kim Luyckx¹, Roser Morante¹, Frederik Vaassen¹, Janneke van de Loo¹, Walter Daelemans¹ - Show less +5 more•Institutions (1)

University of Antwerp¹

01 Jan 2012

TL;DR: This paper presents a large corpus of Flemish Dutch chat posts that were collected from the Belgian online social network Netlog and proposes to normalize this anomalous' input into a format suitable for existing NLP solutions for standard Dutch.

...read moreread less

Abstract: Although in recent years numerous forms of Internet communication ― such as e-mail, blogs, chat rooms and social network environments ― have emerged, balanced corpora of Internet speech with trustworthy meta-information (e.g. age and gender) or linguistic annotations are still limited. In this paper we present a large corpus of Flemish Dutch chat posts that were collected from the Belgian online social network Netlog. For all of these posts we also acquired the users' profile information, making this corpus a unique resource for computational and sociolinguistic research. However, for analyzing such a corpus on a large scale, NLP tools are required for e.g. automatic POS tagging or lemmatization. Because many NLP tools fail to correctly analyze the surface forms of chat language usage, we propose to normalize this anomalous' input into a format suitable for existing NLP solutions for standard Dutch. Additionally, we have annotated a substantial part of the corpus (i.e. the Chatty subset) to provide a gold standard for the evaluation of future approaches to automatic (Flemish) chat language normalization.

...read moreread less

Towards shallow grammar induction for an adaptive assistive vocal interface : a concept tagging approach

[...]

Janneke van de Loo, Guy De Pauw, Jort F. Gemmeke, Peter Karsmakers, Bert Van Den Broeck, Walter Daelemans, Hugo Van hamme - Show less +3 more

01 Jan 2012

TL;DR: A case study of the learnability of this task on the basis of a corpus of commands for the card game patience, followed by results of preliminary experiments using a shallow concept-tagging approach.

...read moreread less

Abstract: This paper describes research within the ALADIN project, which aims to develop an adaptive, assistive vocal interface for people with a physical impairment. One of the components in this interface is a self-learning grammar module, which maps a user’s utterance to its intended meaning. This paper describes a case study of the learnability of this task on the basis of a corpus of commands for the card game patience. The collection, transcription and annotation of this corpus is outlined in this paper, followed by results of preliminary experiments using a shallow concept-tagging approach. Encouraging results are observed during learning curve experiments, that gauge the minimal amount of training data needed to trigger accurate concept tagging of previously unseen utterances.

...read moreread less

Evaluating unmasking for cross-genre authorship verification

[...]

Mike Kestemont, Kim Luyckx, Walter Daelemans, Thomas Crombez

01 Jan 2012

Journal Article•DOI•

Robust Rhymes? The Stability of Authorial Style in Medieval Narratives*

[...]

Mike Kestemont¹, Walter Daelemans¹, Dominiek Sandra¹•Institutions (1)

University of Antwerp¹

18 Jan 2012-Journal of Quantitative Linguistics

TL;DR: It is demonstrated for Maerlant's oeuvre that this highly frequent rhyme words' stylistic stability should not be exaggerated since their distribution significantly correlates with the internal structure of that oeuve, which is relatively content-independent and well-spread over texts.

...read moreread less

Abstract: We explore the application of stylometric methods developed for modern texts to rhymed medieval narratives (Jacob van Maerlant and Lodewijk van Velthem, ca. 1260–1330). Because of the peculiarities of medieval text transmission, we propose to use highly frequent rhyme words for authorship attribution. First, we shall demonstrate that these offer important benefits, being relatively content-independent and well-spread over texts. Subsequent experimentation shows that correspondence analyses can indeed detect authorial differences using highly frequent rhyme words. Finally, we demonstrate for Maerlant's oeuvre that this highly frequent rhyme words' stylistic stability should not be exaggerated since their distribution significantly correlates with the internal structure of that oeuvre.

...read moreread less

Proceedings Article•

Towards a Self-Learning Assistive Vocal Interface: Vocabulary and Grammar Learning

[...]

Janneke van de Loo¹, Jort F. Gemmeke², Guy De Pauw¹, Joris Driesen², Hugo Van hamme², Walter Daelemans¹ - Show less +2 more•Institutions (2)

University of Antwerp¹, Katholieke Universiteit Leuven²

12 Jul 2012

TL;DR: The overall learning framework, the two components that will provide vocabulary learning and grammar induction, and encouraging results of early implementations of these vocabulary and grammar learning components are described, applied to recorded sessions of a vocally guided card game, Patience.

...read moreread less

Abstract: This paper introduces research within the ALADIN project, which aims to develop an assistive vocal interface for people with a physical impairment. In contrast to existing approaches, the vocal interface is self-learning, which means it can be used with any language, dialect, vocabulary and grammar. This paper describes the overall learning framework, and the two components that will provide vocabulary learning and grammar induction. In addition, the paper describes encouraging results of early implementations of these vocabulary and grammar learning components, applied to recorded sessions of a vocally guided card game, Patience.

...read moreread less

Cross-lingual genre classification for closely related languages

[...]

Dirk Snyman, Gerhard B. van Huyssteen, Walter Daelemans

01 Jan 2012

TL;DR: The possibility of leveraging existing resources to help facilitate the development of new resources for under-resourced languages by using cross-lingual classification methods is explored and it is concluded that the robustness of the Afrikaans genre classification system needs improvement.

...read moreread less

Abstract: Resource-scarcity is a topic that is continually researched by the HLT community, especially for the SouthAfrican context. We explore the possibility of leveraging existing resources to help facilitate the development of new resources for under-resourced languages by using cross-lingual classification methods. We investigate the application of an Afrikaans genre classification system on Dutch texts and see encouraging results of 63.1% when classifying raw Dutch texts. We attempt to optimise the performance by employing a machine translation pre-processing step, boosting performance of the Afrikaans system on Dutch data to 67.2%. Further investigation is required as we conclude that the robustness of the Afrikaans genre classification system needs improvement.

...read moreread less

Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

[...]

Walter Daelemans¹•Institutions (1)

University of Antwerp¹

23 Apr 2012

TL;DR: The 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2012) as mentioned in this paper was the most successful one in terms of the number of papers submitted and the number and number of attendees.

...read moreread less

Abstract: Welcome to EACL 2012, the 13th Conference of the European Chapter of the Association for Computational Linguistics. We are happy that despite strong competition from other Computational Linguistics events and economic turmoil in many European countries, this EACL is comparable to the successful previous ones, both in terms of the number of papers submitted and in terms of attendance. We have a strong scientific program, including ten workshops, four tutorials, a demos session and a student research workshop. I am convinced that you will appreciate our program.

...read moreread less

Posted Content•

Design and evaluation of empirical models for stock price prediction

[...]

Enric Junqué de Fortuny¹, Tom De Smedt, David Martens, Walter Daelemans•Institutions (1)

University of Antwerp¹

01 Sep 2012-Research Papers in Economics

TL;DR: Using a model built from a combination of text mining and time series prediction, a novel sentiment mining technique is applied in the design of the model and the usefulness of state-of-the-art explanation-based techniques are shown to validate the resulting models.

...read moreread less

Abstract: The efficient market hypothesis and related theories claim that it is impossible to predict future stock prices. Even so, empirical research has countered this claim by achieving better than random prediction performance. Using a model built from a combination of text mining and time series prediction, we provide further evidence to counter the efficient market hypothesis. We discuss the difficulties in evaluating such models by investigating the drawbacks of the common choices of evaluation metrics used in these empirical studies. We continue by suggesting alternative techniques to validate stock prediction models, circumventing these shortcomings. Finally, a trading system is built for the Euronext Brussels stock exchange market. In our framework, we applied a novel sentiment mining technique in the design of the model and show the usefulness of state-of-the-art explanation-based techniques to validate the resulting models.

...read moreread less