scispace - formally typeset
Search or ask a question
Journal ArticleDOI

An ontology enhanced parallel SVM for scalable spam filter training

01 May 2013-Neurocomputing (Elsevier)-Vol. 108, pp 45-57
TL;DR: Experimental results show that ontology based augmentation improves the accuracy level of the parallel SVM beyond the original sequential counterpart.
About: This article is published in Neurocomputing.The article was published on 2013-05-01 and is currently open access. It has received 46 citations till now. The article focuses on the topics: Support vector machine & Ontology (information science).
Citations
More filters
Journal ArticleDOI
TL;DR: CACM is really essential reading for students, it keeps tabs on the latest in computer science and is a valuable asset for us students, who tend to delve deep into a particular area of CS and forget everything that is happening around us.
Abstract: Communications of the ACM (CACM for short, not the best sounding acronym around) is the ACM’s flagship magazine. Started in 1957, CACM is handy for keeping up to date on current research being carried out across all topics of computer science and realworld applications. CACM has had an illustrious past with many influential pieces of work and debates started within its pages. These include Hoare’s presentation of the Quicksort algorithm; Rivest, Shamir and Adleman’s description of the first publickey cryptosystem RSA; and Dijkstra’s famous letter against the use of GOTO. In addition to the print edition, which is released monthly, there is a fantastic website (http://cacm.acm. org/) that showcases not only the most recent edition but all previous CACM articles as well, readable online as well as downloadable as a PDF. In addition, the website lets you browse for articles by subject, a handy feature if you want to focus on a particular topic. CACM is really essential reading. Pretty much guaranteed to contain content that is interesting to anyone, it keeps tabs on the latest in computer science. It is a valuable asset for us students, who tend to delve deep into a particular area of CS and forget everything that is happening around us. — Daniel Gooch U ndergraduate research is like a box of chocolates: You never know what kind of project you will get. That being said, there are still a few things you should know to get the most out of the experience.

856 citations

Journal ArticleDOI
TL;DR: A novel distributed partitioning methodology for prototype reduction techniques in nearest neighbor classification that enables prototype reduction algorithms to be applied over big data classification problems without significant accuracy loss and is a suitable tool to enhance the performance of the nearest neighbor classifier with big data.

212 citations


Cites methods from "An ontology enhanced parallel SVM f..."

  • ...To overcome this limitation, we develop a MapReduce-based framework to distribute the functioning of these algorithms through a cluster of computing elements, proposing several algorithmic strategies to integrate multiple partial solutions (reduced sets of prototypes) into a single one....

    [...]

Journal ArticleDOI
TL;DR: This work describes the methodology that won the ECBDL'14 big data challenge for a bioinformatics big data problem, named as ROSEFW-RF, which is based on several MapReduce approaches to balance the classes distribution through random oversampling and detect the most relevant features via an evolutionary feature weighting process.
Abstract: The application of data mining and machine learning techniques to biological and biomedicine data continues to be an ubiquitous research theme in current bioinformatics. The rapid advances in biotechnology are allowing us to obtain and store large quantities of data about cells, proteins, genes, etc., that should be processed. Moreover, in many of these problems such as contact map prediction, the problem tackled in this paper, it is difficult to collect representative positive examples. Learning under these circumstances, known as imbalanced big data classification, may not be straightforward for most of the standard machine learning methods.In this work we describe the methodology that won the ECBDL'14 big data challenge for a bioinformatics big data problem. This algorithm, named as ROSEFW-RF, is based on several MapReduce approaches to (1) balance the classes distribution through random oversampling, (2) detect the most relevant features via an evolutionary feature weighting process and a threshold to choose them, (3) build an appropriate Random Forest model from the pre-processed data and finally (4) classify the test data. Across the paper, we detail and analyze the decisions made during the competition showing an extensive experimental study that characterize the way of working of our methodology. From this analysis we can conclude that this approach is very suitable to tackle large-scale bioinformatics classifications problems.

126 citations


Cites background from "An ontology enhanced parallel SVM f..."

  • ...The latter aims at preserving the original semantics of the variable by choosing a subset of the original set of features....

    [...]

BookDOI
01 Jan 2004
TL;DR: It is argued that formulations of better linguistic-quality are beneficial for both answer extraction and HCI in the context of QA.
Abstract: In this paper, we describe our experimentations in answer formulation for question-answering (QA) systems. In the context of QA, answer formulation can serve two purposes: improving answer extraction or improving human-computer interaction (HCI). Each purpose has different precision/recall requirements. We present our experiments for both purposes and argue that formulations of better linguistic-quality are beneficial for both answer extraction and HCI.

100 citations

Journal ArticleDOI
TL;DR: Experimental results show that, DTFS outperforms other methods: such as Chi-squre, comprehensively measure feature selection, t-test based featureselection, term frequency based information gain, two-step based hybrid feature selection method and improved term frequency inverse document frequency method on six corpuses.
Abstract: Feature selection is often used in email classification to reduce the dimensionality of the feature space. In this study, a new document frequency and term frequency combined feature selection method (DTFS) is proposed to improve the performance of email classification. Firstly, an existing optimal document frequency based feature selection method (ODFFS) and a predetermined threshold are applied to select the most discriminative features. Secondly, an existing optimal term frequency based feature selection (OTFFS) method and another predetermined threshold are applied to select more discriminative features. Finally, ODFFS and OTFFS are combined to select the remaining features. In order to improve the convergence rate of parameter optimization, a metaheuristic method, namely global best harmony oriented harmony search (GBHS), is proposed to search these optimal predetermined thresholds. Experiments with fuzzy Support Vector Machine (FSVM) and Naive Bayesian (NB) classifiers are applied on six corpuses: PU2, CSDMC2010, PU3, Lingspam, Enron-spam and Trec2007. Experimental results show that, DTFS outperforms other methods: such as Chi-squre, comprehensively measure feature selection, t-test based feature selection, term frequency based information gain, two-step based hybrid feature selection method and improved term frequency inverse document frequency method on six corpuses.

64 citations


Cites methods from "An ontology enhanced parallel SVM f..."

  • ...Currently, much work on email classification has been done using the techniques such as decision trees (DT) [3], Naïve Bayesian classifiers [4], neural networks (NN) [5], Support Vector Machines (SVM) [6], Boosting [7], k-nearest neighbor (KNN) [8] and so on....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
Abstract: We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

30,570 citations

Proceedings Article
03 Jan 2001
TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).
Abstract: We propose a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams [6], and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI) [3]. In the context of text modeling, our model posits that each document is generated as a mixture of topics, where the continuous-valued mixture proportions are distributed as a latent Dirichlet random variable. Inference and learning are carried out efficiently via variational algorithms. We present empirical results on applications of this model to problems in text modeling, collaborative filtering, and text classification.

25,546 citations

Book
25 Oct 1999
TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.
Abstract: Data Mining: Practical Machine Learning Tools and Techniques offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining. Thorough updates reflect the technical changes and modernizations that have taken place in the field since the last edition, including new material on Data Transformations, Ensemble Learning, Massive Data Sets, Multi-instance Learning, plus a new version of the popular Weka machine learning software developed by the authors. Witten, Frank, and Hall include both tried-and-true techniques of today as well as methods at the leading edge of contemporary research. *Provides a thorough grounding in machine learning concepts as well as practical advice on applying the tools and techniques to your data mining projects *Offers concrete tips and techniques for performance improvement that work by transforming the input or output in machine learning methods *Includes downloadable Weka software toolkit, a collection of machine learning algorithms for data mining tasks-in an updated, interactive interface. Algorithms in toolkit cover: data pre-processing, classification, regression, clustering, association rules, visualization

20,196 citations


"An ontology enhanced parallel SVM f..." refers methods in this paper

  • ...We also evaluate the effect of conventional bagging and boosting techniques on the performance of the parallel SVM. Section 6 concludes the paper and points out some future work....

    [...]

Journal ArticleDOI
TL;DR: This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.
Abstract: More than twelve years have elapsed since the first public release of WEKA. In that time, the software has been rewritten entirely from scratch, evolved substantially and now accompanies a text on data mining [35]. These days, WEKA enjoys widespread acceptance in both academia and business, has an active community, and has been downloaded more than 1.4 million times since being placed on Source-Forge in April 2000. This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.

19,603 citations

Journal ArticleDOI
Jeffrey Dean1, Sanjay Ghemawat1
TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Abstract: MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.

17,663 citations