An ontology enhanced parallel SVM for scalable spam filter training

doi:10.1016/J.NEUCOM.2012.12.001

Home
/
Papers
/
An ontology enhanced parallel SVM for scalable spam filter training

Journal Article•DOI•

An ontology enhanced parallel SVM for scalable spam filter training

Godwin Caruana¹, Maozhen Li², Yang Liu¹•Institutions (2)

Brunel University London¹, Tongji University²

01 May 2013-Neurocomputing (Elsevier)-Vol. 108, pp 45-57

TL;DR: Experimental results show that ontology based augmentation improves the accuracy level of the parallel SVM beyond the original sequential counterpart.

read less

About: This article is published in Neurocomputing.The article was published on 2013-05-01 and is currently open access. It has received 46 citations till now. The article focuses on the topics: Support vector machine & Ontology (information science).

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Communications of the ACM

[...]

Daniel Gooch

01 Dec 2011-ACM Crossroads Student Magazine

TL;DR: CACM is really essential reading for students, it keeps tabs on the latest in computer science and is a valuable asset for us students, who tend to delve deep into a particular area of CS and forget everything that is happening around us.

...read moreread less

Abstract: Communications of the ACM (CACM for short, not the best sounding acronym around) is the ACM’s flagship magazine. Started in 1957, CACM is handy for keeping up to date on current research being carried out across all topics of computer science and realworld applications. CACM has had an illustrious past with many influential pieces of work and debates started within its pages. These include Hoare’s presentation of the Quicksort algorithm; Rivest, Shamir and Adleman’s description of the first publickey cryptosystem RSA; and Dijkstra’s famous letter against the use of GOTO. In addition to the print edition, which is released monthly, there is a fantastic website (http://cacm.acm. org/) that showcases not only the most recent edition but all previous CACM articles as well, readable online as well as downloadable as a PDF. In addition, the website lets you browse for articles by subject, a handy feature if you want to focus on a particular topic. CACM is really essential reading. Pretty much guaranteed to contain content that is interesting to anyone, it keeps tabs on the latest in computer science. It is a valuable asset for us students, who tend to delve deep into a particular area of CS and forget everything that is happening around us. — Daniel Gooch U ndergraduate research is like a box of chocolates: You never know what kind of project you will get. That being said, there are still a few things you should know to get the most out of the experience.

...read moreread less

856 citations

Journal Article•DOI•

MRPR: A MapReduce solution for prototype reduction in big data classification

[...]

Isaac Triguero¹, Daniel Peralta¹, Jaume Bacardit², Salvador García³, Francisco Herrera¹ - Show less +1 more•Institutions (3)

University of Granada¹, Newcastle University², University of Jaén³

20 Feb 2015-Neurocomputing

TL;DR: A novel distributed partitioning methodology for prototype reduction techniques in nearest neighbor classification that enables prototype reduction algorithms to be applied over big data classification problems without significant accuracy loss and is a suitable tool to enhance the performance of the nearest neighbor classifier with big data.

...read moreread less

212 citations

Cites methods from "An ontology enhanced parallel SVM f..."

...To overcome this limitation, we develop a MapReduce-based framework to distribute the functioning of these algorithms through a cluster of computing elements, proposing several algorithmic strategies to integrate multiple partial solutions (reduced sets of prototypes) into a single one....
[...]

Journal Article•DOI•

Rosefw-rf

[...]

Isaac Triguero, Sara del Río¹, Victoria López¹, Jaume Bacardit², José Manuel Benítez¹, Francisco Herrera¹ - Show less +2 more•Institutions (2)

University of Granada¹, Newcastle University²

01 Oct 2015-Knowledge Based Systems

TL;DR: This work describes the methodology that won the ECBDL'14 big data challenge for a bioinformatics big data problem, named as ROSEFW-RF, which is based on several MapReduce approaches to balance the classes distribution through random oversampling and detect the most relevant features via an evolutionary feature weighting process.

...read moreread less

Abstract: The application of data mining and machine learning techniques to biological and biomedicine data continues to be an ubiquitous research theme in current bioinformatics. The rapid advances in biotechnology are allowing us to obtain and store large quantities of data about cells, proteins, genes, etc., that should be processed. Moreover, in many of these problems such as contact map prediction, the problem tackled in this paper, it is difficult to collect representative positive examples. Learning under these circumstances, known as imbalanced big data classification, may not be straightforward for most of the standard machine learning methods.In this work we describe the methodology that won the ECBDL'14 big data challenge for a bioinformatics big data problem. This algorithm, named as ROSEFW-RF, is based on several MapReduce approaches to (1) balance the classes distribution through random oversampling, (2) detect the most relevant features via an evolutionary feature weighting process and a threshold to choose them, (3) build an appropriate Random Forest model from the pre-processed data and finally (4) classify the test data. Across the paper, we detail and analyze the decisions made during the competition showing an extensive experimental study that characterize the way of working of our methodology. From this analysis we can conclude that this approach is very suitable to tackle large-scale bioinformatics classifications problems.

...read moreread less

126 citations

Cites background from "An ontology enhanced parallel SVM f..."

...The latter aims at preserving the original semantics of the variable by choosing a subset of the original set of features....
[...]

Book•DOI•

Advances in Artificial Intelligence

[...]

Zina M. Ibrahim, Ahmed Tawfik

01 Jan 2004

TL;DR: It is argued that formulations of better linguistic-quality are beneficial for both answer extraction and HCI in the context of QA.

...read moreread less

Abstract: In this paper, we describe our experimentations in answer formulation for question-answering (QA) systems. In the context of QA, answer formulation can serve two purposes: improving answer extraction or improving human-computer interaction (HCI). Each purpose has different precision/recall requirements. We present our experiments for both purposes and argue that formulations of better linguistic-quality are beneficial for both answer extraction and HCI.

...read moreread less

100 citations

Journal Article•DOI•

Novel feature selection method based on harmony search for email classification

[...]

Youwei Wang¹, Yuanning Liu¹, Lizhou Feng¹, Xiaodong Zhu¹•Institutions (1)

Jilin University¹

01 Jan 2015-Knowledge Based Systems

TL;DR: Experimental results show that, DTFS outperforms other methods: such as Chi-squre, comprehensively measure feature selection, t-test based featureselection, term frequency based information gain, two-step based hybrid feature selection method and improved term frequency inverse document frequency method on six corpuses.

...read moreread less

Abstract: Feature selection is often used in email classification to reduce the dimensionality of the feature space. In this study, a new document frequency and term frequency combined feature selection method (DTFS) is proposed to improve the performance of email classification. Firstly, an existing optimal document frequency based feature selection method (ODFFS) and a predetermined threshold are applied to select the most discriminative features. Secondly, an existing optimal term frequency based feature selection (OTFFS) method and another predetermined threshold are applied to select more discriminative features. Finally, ODFFS and OTFFS are combined to select the remaining features. In order to improve the convergence rate of parameter optimization, a metaheuristic method, namely global best harmony oriented harmony search (GBHS), is proposed to search these optimal predetermined thresholds. Experiments with fuzzy Support Vector Machine (FSVM) and Naive Bayesian (NB) classifiers are applied on six corpuses: PU2, CSDMC2010, PU3, Lingspam, Enron-spam and Trec2007. Experimental results show that, DTFS outperforms other methods: such as Chi-squre, comprehensively measure feature selection, t-test based feature selection, term frequency based information gain, two-step based hybrid feature selection method and improved term frequency inverse document frequency method on six corpuses.

...read moreread less

64 citations

Cites methods from "An ontology enhanced parallel SVM f..."

...Currently, much work on email classification has been done using the techniques such as decision trees (DT) [3], Naïve Bayesian classifiers [4], neural networks (NN) [5], Support Vector Machines (SVM) [6], Boosting [7], k-nearest neighbor (KNN) [8] and so on....
[...]

1
2
3
4
…
5
6
7
8
9
10

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Latent dirichlet allocation

[...]

David M. Blei¹, Andrew Y. Ng², Michael I. Jordan¹•Institutions (2)

University of California, Berkeley¹, Stanford University²

01 Mar 2003-Journal of Machine Learning Research

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.

...read moreread less

Abstract: We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

...read moreread less

30,570 citations

Proceedings Article•

Latent Dirichlet Allocation

[...]

David M. Blei¹, Andrew Y. Ng¹, Michael I. Jordan¹•Institutions (1)

University of California, Berkeley¹

03 Jan 2001

TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).

...read moreread less

Abstract: We propose a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams [6], and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI) [3]. In the context of text modeling, our model posits that each document is generated as a mixture of topics, where the continuous-valued mixture proportions are distributed as a latent Dirichlet random variable. Inference and learning are carried out efficiently via variational algorithms. We present empirical results on applications of this model to problems in text modeling, collaborative filtering, and text classification.

...read moreread less

25,546 citations

Book•

Data Mining: Practical Machine Learning Tools and Techniques

[...]

Ian H. Witten, Eibe Frank, Mark Hall

25 Oct 1999

TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.

...read moreread less

Abstract: Data Mining: Practical Machine Learning Tools and Techniques offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining. Thorough updates reflect the technical changes and modernizations that have taken place in the field since the last edition, including new material on Data Transformations, Ensemble Learning, Massive Data Sets, Multi-instance Learning, plus a new version of the popular Weka machine learning software developed by the authors. Witten, Frank, and Hall include both tried-and-true techniques of today as well as methods at the leading edge of contemporary research. *Provides a thorough grounding in machine learning concepts as well as practical advice on applying the tools and techniques to your data mining projects *Offers concrete tips and techniques for performance improvement that work by transforming the input or output in machine learning methods *Includes downloadable Weka software toolkit, a collection of machine learning algorithms for data mining tasks-in an updated, interactive interface. Algorithms in toolkit cover: data pre-processing, classification, regression, clustering, association rules, visualization

...read moreread less

20,196 citations

"An ontology enhanced parallel SVM f..." refers methods in this paper

...We also evaluate the effect of conventional bagging and boosting techniques on the performance of the parallel SVM. Section 6 concludes the paper and points out some future work....
[...]

Journal Article•DOI•

The WEKA data mining software: an update

[...]

Mark Hall, Eibe Frank¹, Geoffrey Holmes¹, Bernhard Pfahringer¹, Peter Reutemann¹, Ian H. Witten¹ - Show less +2 more•Institutions (1)

University of Waikato¹

16 Nov 2009-Sigkdd Explorations

TL;DR: This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.

...read moreread less

Abstract: More than twelve years have elapsed since the first public release of WEKA. In that time, the software has been rewritten entirely from scratch, evolved substantially and now accompanies a text on data mining [35]. These days, WEKA enjoys widespread acceptance in both academia and business, has an active community, and has been downloaded more than 1.4 million times since being placed on Source-Forge in April 2000. This paper provides an introduction to the WEKA workbench, reviews the history of the project, and, in light of the recent 3.6 stable release, briefly discusses what has been added since the last stable version (Weka 3.4) released in 2003.

...read moreread less

19,603 citations

Journal Article•DOI•

MapReduce: simplified data processing on large clusters

[...]

Jeffrey Dean¹, Sanjay Ghemawat¹•Institutions (1)

Google¹

01 Jan 2008-Communications of The ACM

TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.

...read moreread less

Abstract: MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.

...read moreread less

17,663 citations