scispace - formally typeset
Journal ArticleDOI

Evaluating data mining procedures: techniques for generating artificial data sets

Paul D. Scott, +1 more
- 25 Jun 1999 - 
- Vol. 41, Iss: 9, pp 579-587
TLDR
It is argued that tests done with real data sets cannot provide all the information needed for a thorough assessment of their performance characteristics and that artificial data sets are therefore essential.
Abstract
In this article, we discuss the need to evaluate the performance of data mining procedures and argue that tests done with real data sets cannot provide all the information needed for a thorough assessment of their performance characteristics. We argue that artificial data sets are therefore essential. After a discussion of the desirable characteristics of such artificial data, we describe two pseudo-random generators. The first is based on the multi-variate normal distribution and gives the investigator full control of the degree of correlation between the variables in the artificial data sets. The second is inspired by fractal techniques for synthesizing artificial landscapes and can produce data whose classification complexity can be controlled by a single parameter. We conclude with a discussion of the additional work necessary to achieve the ultimate goal of a method of matching data sets to the most appropriate data mining technique.

read more

Citations
More filters
Proceedings ArticleDOI

Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization

TL;DR: A reliable dataset is produced that contains benign and seven common attack network flows, which meets real world criteria and is publicly avaliable and evaluates the performance of a comprehensive set of network traffic features and machine learning algorithms to indicate the best set of features for detecting the certain attack categories.
Journal ArticleDOI

Towards a Reliable Intrusion Detection Benchmark Dataset

TL;DR: A comprehensive evaluation of the existing datasets using the proposed criteria, a design and evaluation framework for IDS and IPS datasets, and a dataset generation model to create a reliable IDS or IPS benchmark dataset are presented.
Proceedings ArticleDOI

An Evaluation Framework for Intrusion Detection Dataset

TL;DR: This paper presents a comprehensive evaluation of the existing datasets using the proposed criteria, and proposes an evaluation framework for IDS and IPS datasets.
Journal ArticleDOI

Toward data mining engineering: A software engineering approach

TL;DR: It is proposed to reuse ideas and concepts underlying the IEEE Std 1074 and ISO 12207 software engineering model processes to redefine and add to the CRISP-DM process and make it a data mining engineering standard.
Journal ArticleDOI

Elements of simulation

References
More filters
Book

C4.5: Programs for Machine Learning

TL;DR: A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting.

Numerical recipes in C

TL;DR: The Diskette v 2.06, 3.5''[1.44M] for IBM PC, PS/2 and compatibles [DOS] Reference Record created on 2004-09-07, modified on 2016-08-08.
Journal ArticleDOI

The Fractal Geometry of Nature

TL;DR: A blend of erudition (fascinating and sometimes obscure historical minutiae abound), popularization (mathematical rigor is relegated to appendices) and exposition (the reader need have little knowledge of the fields involved) is presented in this article.
Journal ArticleDOI

Multivariate Statistical Methods

TL;DR: In this article, a text designed to make multivariate techniques available to behavioural, social, biological and medical students is presented, which includes an approach to multivariate inference based on the union-intersection and generalized likelihood ratio principles.
Journal ArticleDOI

Programs for machine learning Part I

TL;DR: A proposed schema and some detailed specifications for constructing a learning system by means of programming a computer are given, trying to separate learning processes and problem-solving techniques from specific problem content in order to achieve generality.
Related Papers (5)