scispace - formally typeset
Search or ask a question
Author

Konstantinos Chandrinos

Bio: Konstantinos Chandrinos is an academic researcher. The author has contributed to research in topics: Naive Bayes classifier & The Internet. The author has an hindex of 7, co-authored 8 publications receiving 1641 citations.

Papers
More filters
Posted Content
TL;DR: It is reached that additional safety nets are needed for the Naive Bayesian anti-spam filter to be viable in practice.
Abstract: It has recently been argued that a Naive Bayesian classifier can be used to filter unsolicited bulk e-mail (“spam”). We conduct a thorough evaluation of this proposal on a corpus that we make publicly available, contributing towards standard benchmarks. At the same time we investigate the effect of attribute-set size, training-corpus size, lemmatization, and stop-lists on the filter’s performance, issues that had not been previously explored. After introducing appropriate cost-sensitive evaluation measures, we reach the conclusion that additional safety nets are needed for the Naive Bayesian anti-spam filter to be viable in practice.

641 citations

Posted Content
TL;DR: In this article, a Naive Bayesian classifier is trained automatically to detect spam messages, and a large collection of personal e-mail messages are made publicly available in "encrypted" form contributing towards standard benchmarks.
Abstract: The growing problem of unsolicited bulk e-mail, also known as "spam", has generated a need for reliable anti-spam e-mail filters. Filters of this type have so far been based mostly on manually constructed keyword patterns. An alternative approach has recently been proposed, whereby a Naive Bayesian classifier is trained automatically to detect spam messages. We test this approach on a large collection of personal e-mail messages, which we make publicly available in "encrypted" form contributing towards standard benchmarks. We introduce appropriate cost-sensitive measures, investigating at the same time the effect of attribute-set size, training-corpus size, lemmatization, and stop lists, issues that have not been explored in previous experiments. Finally, the Naive Bayesian filter is compared, in terms of performance, to a filter that uses keyword patterns, and which is part of a widely used e-mail reader.

464 citations

Proceedings ArticleDOI
01 Jul 2000
TL;DR: This work introduces appropriate cost-sensitive measures, and investigates at the same time the effect of attribute-set size, training-corpus size, lemmatization, and stop lists, issues that have not been explored in previous experiments.
Abstract: The growing problem of unsolicited bulk e-mail, also known as “spam”, has generated a need for reliable anti-spam e-mail filters. Filters of this type have so far been based mostly on manually constructed keyword patterns. An alternative approach has recently been proposed, whereby a Naive Bayesian classifier is trained automatically to detect spam messages. We test this approach on a large collection of personal e-mail messages, which we make publicly available in “encrypted” form contributing towards standard benchmarks. We introduce appropriate cost-sensitive measures, investigating at the same time the effect of attribute-set size, training-corpus size, lemmatization, and stop lists, issues that have not been explored in previous experiments. Finally, the Naive Bayesian filter is compared, in terms of performance, to a filter that uses keyword patterns, and which is part of a widely used e-mail reader.

448 citations

Book ChapterDOI
04 Jul 1998
TL;DR: Trafficopter is described, a multi-agent system that collects and propagates traffic information in an urban setting using distributed methods using PDAs/WindowsCE-based terminals equipped with GPS and wireless transceivers.
Abstract: We describe Trafficopter, a multi-agent system that collects and propagates traffic information in an urban setting using distributed methods. Agents into the vehicles themselves collect and propagate traffic-related information in a decentralized, self-organizing fashion with no single point of failure. The ideas in this system are the use of the vehicles/agents themselves as a way of collecting traffic data and the way those data are distributed to the interested vehicles. The tools used are a traffic simulator and a set of PDAs/WindowsCE-based terminals equipped with GPS and wireless transceivers. The simulator is used for the investigation of the validity of traffic control and information propagation algorithms in a distributed environment and the WindowsCE terminals for applying the above ideas into the real world.

42 citations

Proceedings ArticleDOI
25 May 2011
TL;DR: ELS, a new method for entity-level sentiment classification using sequence modeling by Conditional Random Fields, performs better than the common bag-of-words approaches, especially when the authors target the local sentiment in small parts of a larger document.
Abstract: We introduce ELS, a new method for entity-level sentiment classification using sequence modeling by Conditional Random Fields (CRF). The CRF is trained to identify the sentiment of each word in a document, which is then used to determine the sentiment for the entity, based on where it appears in the text. Due to its sequential nature, the CRF classifier performs better than the common bag-of-words approaches, especially when we target the local sentiment in small parts of a larger document. Identifying the sentiment about a specific entity, mentioned in a blog post or a larger product review, is a special case of such local sentiment classification. Furthermore, the proposed approach performs well even in short pieces of text, where bag-of-words approaches usually fail, due to the sparseness of the resulting feature vector. We have implemented and tested the proposed method on a publicly available benchmark corpus of short product reviews in English. The results that we present in this paper improve significantly upon published results on the same data, thus confirming our intuition about the approach.

30 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

13,246 citations

Journal ArticleDOI
TL;DR: This survey discusses the main approaches to text categorization that fall within the machine learning paradigm and discusses in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.
Abstract: The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.

7,539 citations

Book
01 Dec 2006
TL;DR: Providing an in-depth examination of core text mining and link detection algorithms and operations, this text examines advanced pre-processing techniques, knowledge representation considerations, and visualization approaches.
Abstract: 1. Introduction to text mining 2. Core text mining operations 3. Text mining preprocessing techniques 4. Categorization 5. Clustering 6. Information extraction 7. Probabilistic models for Information extraction 8. Preprocessing applications using probabilistic and hybrid approaches 9. Presentation-layer considerations for browsing and query refinement 10. Visualization approaches 11. Link analysis 12. Text mining applications Appendix Bibliography.

1,628 citations

Book
12 Mar 2012
TL;DR: Comprehensive and coherent, this hands-on text develops everything from basic reasoning to advanced techniques within the framework of graphical models, and develops analytical and problem-solving skills that equip them for the real world.
Abstract: Machine learning methods extract value from vast data sets quickly and with modest resources They are established tools in a wide range of industrial applications, including search engines, DNA sequencing, stock market analysis, and robot locomotion, and their use is spreading rapidly People who know the methods have their choice of rewarding jobs This hands-on text opens these opportunities to computer science students with modest mathematical backgrounds It is designed for final-year undergraduates and master's students with limited background in linear algebra and calculus Comprehensive and coherent, it develops everything from basic reasoning to advanced techniques within the framework of graphical models Students learn more than a menu of techniques, they develop analytical and problem-solving skills that equip them for the real world Numerous examples and exercises, both computer based and theoretical, are included in every chapter Resources for students and instructors, including a MATLAB toolbox, are available online

1,474 citations

Journal ArticleDOI
TL;DR: This survey of MAS is intended to serve as an introduction to the field and as an organizational framework, and highlights how multiagent systems can be and have been used to build complex systems.
Abstract: Distributed Artificial Intelligence (DAI) has existed as a subfield of AI for less than two decades. DAI is concerned with systems that consist of multiple independent entities that interact in a domain. Traditionally, DAI has been divided into two sub-disciplines: Distributed Problem Solving (DPS) focuses on the information management aspects of systems with several components working together towards a common goals Multiagent Systems (MAS) deals with behavior management in collections of several independent entities, or agents. This survey of MAS is intended to serve as an introduction to the field and as an organizational framework. A series of general multiagent scenarios are presented. For each scenario, the issues that arise are described along with a sampling of the techniques that exist to deal with them. The presented techniques are not exhaustive, but they highlight how multiagent systems can be and have been used to build complex systems. When options exist, the techniques presented are biased towards machine learning approaches. Additional opportunities for applying machine learning to MAS are highlighted and robotic soccer is presented as an appropriate test bed for MAS. This survey does not focus exclusively on robotic systems. However, we believe that much of the prior research in non-robotic MAS is relevant to robotic MAS, and we explicitly discuss several robotic MAS, including all of those presented in this issue.

1,073 citations