Showing papers in &quot;Sigkdd Explorations in 2000&quot;

Web mining research: a survey

TL;DR: Web usage mining is the application of data mining techniques to discover usage patterns from Web data, in order to understand and better serve the needs of Web-based applications as mentioned in this paper, where preprocessing, pattern discovery, and pattern analysis are described in detail.

...read moreread less

Abstract: Web usage mining is the application of data mining techniques to discover usage patterns from Web data, in order to understand and better serve the needs of Web-based applications. Web usage mining consists of three phases, namely preprocessing, pattern discovery, and pattern analysis. This paper describes each of these phases in detail. Given its application potential, Web usage mining has seen a rapid increase in interest, from both the research and practice communities. This paper provides a detailed taxonomy of the work in this area, including research efforts as well as commercial offerings. An up-to-date survey of the existing work is also provided. Finally, a brief overview of the WebSIFT system as an example of a prototypical Web usage mining system is given.

...read moreread less

2,227 citations

Journal Article•DOI•

[...]

Raymond Kosala¹, Hendrik Blockeel¹•Institutions (1)

Katholieke Universiteit Leuven¹

Algorithms for association rule mining — a general survey and comparison

TL;DR: This paper surveys the research in the area of Web mining, point out some confusions regarded the usage of the term Web mining and suggest three Web mining categories, which are then situate some of the research with respect to these three categories.

...read moreread less

Abstract: With the huge amount of information available online, the World Wide Web is a fertile area for data mining research. The Web mining research is at the cross road of research from several research communities, such as database, information retrieval, and within AI, especially the sub-areas of machine learning and natural language processing. However, there is a lot of confusions when comparing research efforts from different point of views. In this paper, we survey the research in the area of Web mining, point out some confusions regarded the usage of the term Web mining and suggest three Web mining categories. Then we situate some of the research with respect to these three categories. We also explore the connection between the Web mining categories and the related agent paradigm. For the survey, we focus on representation issues, on the process, on the learning algorithm, and on the application of the recent works as the criteria. We conclude the paper with some research issues.

...read moreread less

1,699 citations

Journal Article•DOI•

[...]

Jochen Hipp, Ulrich Güntzer, Gholamreza Nakhaeizadeh¹•Institutions (1)

Daimler AG¹

Support vector machines: hype or hallelujah?

TL;DR: The fundamentals of asso iation rule mining are explained and a general framework is derived and it turns out that the runtime behavior of the algorithms is more similar as to be expe ted.

...read moreread less

Abstract: Today there are several eAE ient algorithms that ope with the popular and omputationally expensive task of asso iation rule mining. A tually, these algorithms are more or less des ribed on their own. In this paper we explain the fundamentals of asso iation rule mining and moreover derive a general framework. Based on this we des ribe today's approa hes in ontext by pointing out ommon aspe ts and di eren es. After that we thoroughly investigate their strengths and weaknesses and arry out several runtime experiments. It turns out that the runtime behavior of the algorithms is mu h more similar as to be expe ted.

...read moreread less

1,040 citations

Journal Article•DOI•

[...]

Kristin P. Bennett¹, Colin Campbell²•Institutions (2)

Rensselaer Polytechnic Institute¹, University of Bristol²

Mining frequent patterns with counting inference

TL;DR: An intuitive explanation of SVMs from a geometric perspective is provided and the classification problem is used to investigate the basic concepts behind SVMs and to examine their strengths and weaknesses from a data mining perspective.

...read moreread less

Abstract: Support Vector Machines (SVMs) and related kernel methods have become increasingly popular tools for data mining tasks such as classification, regression, and novelty detection. The goal of this tutorial is to provide an intuitive explanation of SVMs from a geometric perspective. The classification problem is used to investigate the basic concepts behind SVMs and to examine their strengths and weaknesses from a data mining perspective. While this overview is not comprehensive, it does provide resources for those interested in further exploring SVMs.

...read moreread less

707 citations

Journal Article•DOI•

[...]

Yves Bastide¹, Rafik Taouil², Nicolas Pasquier³, Gerd Stumme⁴, Lotfi Lakhal³ - Show less +1 more•Institutions (4)

Blaise Pascal University¹, French Institute for Research in Computer Science and Automation², Centre national de la recherche scientifique³, Karlsruhe Institute of Technology⁴

Data mining for hypertext: a tutorial survey

TL;DR: It is shown that the support of frequent non-key patterns can be inferred from frequent key patterns without accessing the database, and PASCAL is among the most efficient algorithms for mining frequent patterns.

...read moreread less

Abstract: In this paper, we propose the algorithm PASCAL which introduces a novel optimization of the well-known algorithm Apriori. This optimization is based on a new strategy called pattern counting inference that relies on the concept of key patterns. We show that the support of frequent non-key patterns can be inferred from frequent key patterns without accessing the database. Experiments comparing PASCAL to the three algorithms Apriori, Close and Max-Miner, show that PASCAL is among the most efficient algorithms for mining frequent patterns.

...read moreread less

335 citations

Journal Article•DOI•

[...]

Soumen Chakrabarti¹•Institutions (1)

Indian Institute of Technology Bombay¹

KDD-Cup 2000 organizers' report: peeling the onion

TL;DR: Recent advances in learning and mining problems related to hypertext in general and the Web in particular are surveyed and the continuum of supervised to semi-supervised to unsupervised learning problems is reviewed.

...read moreread less

Abstract: With over 800 million pages covering most areas of human endeavor, the World-wide Web is a fertile ground for data mining research to make a difference to the effectiveness of information search. Today, Web surfers access the Web through two dominant interfaces: clicking on hyperlinks and searching via keyword queries. This process is often tentative and unsatisfactory. Better support is needed for expressing one's information need and dealing with a search result in more structured ways than available now. Data mining and machine learning have significant roles to play towards this end.In this paper we will survey recent advances in learning and mining problems related to hypertext in general and the Web in particular. We will review the continuum of supervised to semi-supervised to unsupervised learning problems, highlight the specific challenges which distinguish data mining in the hypertext domain from data mining in the context of data warehouses, and summarize the key areas of recent and ongoing research.

...read moreread less

331 citations

Journal Article•DOI•

[...]

Ron Kohavi¹, Carla E. Brodley², Brian Frasca¹, Llew Mason¹, Zijian Zheng¹ - Show less +1 more•Institutions (2)

Blue Martini Software¹, Purdue University²

Results of the KDD'99 classifier learning

TL;DR: KDD-Cup 2000, the yearly competition in data mining, is described, for the first time the Cup included insight problems in addition to prediction problems, thus posing new challenges in both the knowledge discovery and the evaluation criteria and highlighting the need to "peel the onion" and drill deeper into the reasons for the initial patterns found.

...read moreread less

Abstract: We describe KDD-Cup 2000, the yearly competition in data mining. For the first time the Cup included insight problems in addition to prediction problems, thus posing new challenges in both the knowledge discovery and the evaluation criteria, and highlighting the need to "peel the onion" and drill deeper into the reasons for the initial patterns found. We chronicle the data generation phase starting from the collection at the site through its conversion to a star schema in a warehouse through data cleansing, data obfuscation for privacy protection, and data aggregation. We describe the information given to the participants, including the questions, site structure, the marketing calendar, and the data schema. Finally, we discuss interesting insights, common mistakes, and lessons learned. Three winners were announced and they describe their own experiences and lessons in the pages following this paper.

...read moreread less

303 citations

Journal Article•DOI•

[...]

Charles Elkan¹•Institutions (1)

University of California, San Diego¹

Mining frequent patterns by pattern-growth: methodology and implications

TL;DR: The task for the classifier learning contest organized in conjunction with the KDD’99 conference was to learn a predictive model capable of distinguishing between legitimate and illegitimate connections in a computer network.

...read moreread less

Abstract: The task for the classifier learning contest organized in conjunction with the KDD’99 conference was to learn a predictive model (i.e. a classifier) capable of distinguishing between legitimate and illegitimate connections in a computer network. Here is a detailed description of the task. The training and test data were generously made available by Prof. Sal Stolfo of Columbia University and Prof. Wenke Lee of North Carolina State University.

...read moreread less

299 citations

Journal Article•DOI•

[...]

Jiawei Han¹, Jian Pei¹•Institutions (1)

Simon Fraser University¹

Scalability for clustering algorithms revisited

TL;DR: It is shown that frequent pattern growth is efficient at mining large databases and its further development may lead to scalable mining of many other kinds of patterns as well.

...read moreread less

Abstract: Mining frequent patterns has been a focused topic in data mining research in recent years, with the developmeht of numerous interesting algorithms for mining association, correlation, causality, sequential patterns, partial periodicity, constraint-based frequent pattern mining, associative classification, emerging patterns, etc. Most of the previous studies adopt an Apriori-like, candidate generation-and-test approach. However, based on our analysis, candidate generation and test may still be expensive, especially when encountering long and numerous patterns. A new methodology, called f r e q u e n t p a t t e r n g rowth , which mines frequent patterns without candidate generation, has been developed. The method adopts a divide-andconquer philosophy to project and partition databases based on the currently discovered frequent patterns and grow such patterns to longer ones in the projected databases. Moreover, efficient data structures have been developed for effective database compression and fast in-memory traversal. Such a methodology may eliminate or substantially reduce the number of candidate sets to be generated and also reduce the size of the database to be iteratively examined, and, therefore, lead to high performance. In this paper, we provide an overview of this approach and examine its methodology and implications for mining several kinds of frequent patterns, including association, frequent closed itemsets, max-patterns, sequential patterns, and constraint-based mining of frequent patterns. We show that frequent pattern growth is efficient at mining large databases and its further development may lead to scalable mining of many other kinds of patterns as well.

...read moreread less

286 citations

Journal Article•DOI•

[...]

Fredrik Farnstrom, James Lewis¹, Charles Elkan¹•Institutions (1)

University of California, San Diego¹

Winning the KDD99 classification cup: bagged boosting

TL;DR: A simple new algorithm that performs k-means clustering in one scan of a dataset, while using a bu er for points from the dataset of xed size is presented, and experiments show that the new method is several times faster than standard k-Means, and that it produces clusterings of equal or almost equal quality.

...read moreread less

Abstract: ABSTRACT This paper presents a simple new algorithm that performs k-means clustering in one scan of a dataset, while using a bu er for points from the dataset of xed size. Experiments show that the new method is several times faster than standard k-means, and that it produces clusterings of equal or almost equal quality. The new method is a simpli cation of an algorithm due to Bradley, Fayyad and Reina that uses several data compression techniques in an attempt to improve speed and clustering quality. Unfortunately, the overhead of these techniques makes the original algorithm several times slower than standard k-means on materialized datasets, even though standard k-means scans a dataset multiple times. Also, lesion studies show that the compression techniques do not improve clustering quality. All results hold for 400 megabyte synthetic datasets and for a dataset created from the real-world data used in the 1998 KDD data mining contest. All algorithm implementations and experiments are designed so that results generalize to datasets of many gigabytes and larger.

...read moreread less

263 citations

Journal Article•DOI•

[...]

Bernhard Pfahringer

The UCI KDD archive of large data sets for data mining research and experimentation

TL;DR: The standard sampling with replacement methodology of bagging was modified to put a specific focus on the smaller but expensive-if-predicted-wrongly classes.

...read moreread less

Abstract: We briefly describe our approach for the KDD99 Classification Cup. The solution is essentially a mixture of bagging and boosting. Additionally, asymmetric error costs are taken into account by minimizing the so-called conditional risk. Furthermore, the standard sampling with replacement methodology of bagging was modified to put a specific focus on the smaller but expensive-if-predicted-wrongly classes.

...read moreread less

Journal Article•DOI•

[...]

Stephen D. Bay¹, Dennis F. Kibler¹, Michael J. Pazzani¹, Padhraic Smyth¹•Institutions (1)

University of California, Irvine¹

KDD-99 classifier learning contest LLSoft's results overview

TL;DR: The UCI KDD Archive is described, which is a new online archive of large and complex data sets that encompasses a wide variety of data types, analysis tasks, and application areas and draws parallels with the development of the UCI Machine Learning Repository.

...read moreread less

Abstract: Advances in data collection and storage have allowed organizations to create massive, complex and heterogeneous databases, which have stymied traditional methods of data analysis. This has led to the development of new analytical tools that often combine techniques from a variety of elds such as statistics, computer science, and mathematics to extract meaningful knowledge from the data. To support research in this area, UC Irvine has created the UCI Knowledge Discovery in Databases (KDD) Archive (http://kdd.ics.uci.edu)which is a new online archive of large and complex data sets that encompasses a wide variety of data types, analysis tasks, and application areas. This article describes the objectives and philosophy of the UCI KDD Archive. We draw parallels with the development of the UCI Machine Learning Repository and its a ect on the Machine Learning community.

...read moreread less

Journal Article•DOI•

[...]

Itzhak Levin

Distributed data clustering can be efficient and exact

TL;DR: The Kernel Miner's approach and method used for solving the contest task is described and the received results are analyzed and explained.

...read moreread less

Abstract: Kernel Miner is a new data-mining tool based on building the optimal decision forest. The tool won second place in the KDD99 Classifier Learning Contest, August 1999. We describe the Kernel Miner's approach and method used for solving the contest task. The received results are analyzed and explained.

...read moreread less

Journal Article•DOI•

[...]

George Forman¹, Bin Zhang¹•Institutions (1)

Hewlett-Packard¹

Concept-based knowledge discovery in texts extracted from the Web

TL;DR: It is demonstrated in this paper that even for relatively small problem sizes, it can be more cost effective to cluster the data in-place using an exact distributed algorithm than to collect the data at one central location for clustering.

...read moreread less

Abstract: Data clustering is one of the fundamental techniques in scientific data analysis and data mining. It partitions a data set into groups of similar items, as measured by some distance metric. Over the years, data set sizes have grown rapidly with the exponential growth of computer storage and increasingly automated business and manufacturing processes. Many of these datasets are geographically distributed across multiple sites, e.g. different sales or warehouse locations. To cluster such large and distributed data sets, efficient distributed algorithms are called for to reduce the communication overhead, central storage requirements, and computation time, as well as to bring the resources of multiple machines to bear on a given problem as the data set sizes scale-up. We describe a technique for parallelizing a family of center-based data clustering algorithms. The central idea is to communicate only sufficient statistics, yielding linear speed-up with excellent efficiency. The technique does not involve approximation and may be used orthogonally in conjunction with sampling or aggregation-based methods, such as BIRCH, to lessen the quality degradation of their approximation or to handle larger data sets. We demonstrate in this paper that even for relatively small problem sizes, it can be more cost effective to cluster the data in-place using an exact distributed algorithm than to collect the data in one central location for clustering.

...read moreread less

Journal Article•DOI•

[...]

Stanley Loh¹, Leandro Krug Wives¹, José Palazzo Moreira de Oliveira¹•Institutions (1)

Universidade Federal do Rio Grande do Sul¹

Understanding the crucial differences between classification and discovery of association rules: a position paper

TL;DR: The approach is based on concepts, which are extracted from texts to be used as characteristics in the mining process, and statistical techniques are applied on concepts in order to find interesting patterns in concept distributions or associations.

...read moreread less

Abstract: This paper presents an approach for knowledge discovery in texts extracted from the Web. Instead of analyzing words or attribute values, the approach is based on concepts, which are extracted from texts to be used as characteristics in the mining process. Statistical techniques are applied on concepts in order to find interesting patterns in concept distributions or associations. In this way, users can pe rform discovery in a high level, since concepts describe real world events, objects, thoughts, etc. For identifying concepts in texts, a categorization algorithm is used associated to a previous classification task for concept definitions. Two experiments are presented: one for political analysis and other for competitive intelligence. At the end, the approach is discussed, examining its problems and advantages in the Web context. Keywords Knowledge discovery, data mining, information extraction, categorization, text mining. 1. INTRODUCTION The Web is a large and growing collection of texts. This amount of text is becoming a valuable resource of information and knowledge. As Garofalakis and partners comment, "

...read moreread less

Journal Article•DOI•

[...]

Alex A. Freitas

Theoretical frameworks for data mining

TL;DR: It is argued that the classification task can be considered an ill-defined, nondeterministic task, which is unavoidable given the fact that it involves prediction; while the standard association task can still be considered a well- defined, deterministic, relatively simple task.

...read moreread less

Abstract: The goal of this position paper is to contribute to a clear understanding of the profound differences between the association-rule discovery and the classification tasks. We argue that the classification task can be considered an ill-defined, nondeterministic task, which is unavoidable given the fact that it involves prediction; while the standard association task can be considered a well-defined, deterministic, relatively simple task, which does not involve prediction in the same sense as the classification task does.

...read moreread less

Journal Article•DOI•

[...]

Heikki Mannila¹•Institutions (1)

Nokia¹

Knowledge discovery in databases: 10 years after

TL;DR: In this paper I present some possible theoretical approaches to data mining, which are at its infancy and there probably are more questions than answers.

...read moreread less

Abstract: Research in data mining and knowledge discovery in databases has mostly concentrated on developing good algorithms for various data mining tasks (see for example the recent proceedings of KDD conferences). Some parts of the research effort have gone to investigating data mining process, user interface issues, database topics, or visualization [7]. Relatively little has been published about the theoretical foundations of data mining. In this paper I present some possible theoretical approaches to data mining. The area is at its infancy, and there probably are more questions than answers in this paper.

...read moreread less

Journal Article•DOI•

[...]

Gregory Piatetsky-Shapiro

Measuring lift quality in database marketing

TL;DR: The past l0 years of KDD are described and predictions for the next 10 years are outlined and it is suggested that KDD should be renamed KDD2 or KDD3 to avoid confusion.

...read moreread less

Abstract: In this paper, we describe the past l0 years of KDD and outline predictions for the next 10 years.

...read moreread less

Journal Article•DOI•

[...]

Gregory Piatetsky-Shapiro, Sam Steingold

A fine grained heuristic to capture web navigation patterns

TL;DR: This paper proposes a measure, called L-quality, which indicates how close the lift curve is to the random and perfect models, and examines the practical issues of computing L- quality from discrete quantiles available in a typical lift table.

...read moreread less

Abstract: Database marketers often select predictive models based on the lift in the top 5%, 10%, or 20%. However, dierent models may be better at dierent thresholds. Absent a good cost function, or when multiple cost functions are possible, we want a measure that helps to compare models by looking at the entire lift curve. In this paper, we propose such a measure, called L-quality, which indicates how close the lift curve is to the random and perfect models. We also examine the practical issues of computing L-quality from discrete quantiles available in a typical lift table.

...read moreread less

Journal Article•DOI•

[...]

José Borges¹, Mark Levene¹•Institutions (1)

University College London¹

Data mining models as services on the internet

TL;DR: A new heuristic is presented that implements an iterative deepening search wherein the set of rules is incrementally augmented by first exploring trails with high probability by setting the value of the stopping criterion.

...read moreread less

Abstract: In previous work we have proposed a statistical model to capture the user behaviour when browsing the web. The user navigation information obtained from web logs is modelled as a hypertext probabilistic grammar (HPG) which is within the class of regular probabilistic grammars. The set of highest probability strings generated by the grammar corresponds to the user preferred navigation trails. We have previously conducted experiments with a Breadth-First Search algorithm (BFS) to perform the exhaustive computation of all the strings with probability above a specified cut-point, which we call the rules. Although the algorithm’s running time varies linearly with the number of grammar states, it has the drawbacks of returning a large number of rules when the cut-point is small and a small set of very short rules when the cut-point is high. In this work, we present a new heuristic that implements an iterative deepening search wherein the set of rules is incrementally augmented by first exploring trails with high probability. A stopping parameter is provided which measures the distance between the current rule-set and its corresponding maximal set obtained by the BFS algorithm. When the stopping parameter takes the value zero the heuristic corresponds to the BFS algorithm and as the parameter takes values closer to one the number of rules obtained decreases accordingly. Experiments were conducted with both real and synthetic data and the results show that for a given cut-point the number of rules induced increases smoothly with the decrease of the stopping criterion. Therefore, by setting the value of the stopping criterion the analyst can determine the number and quality of rules to be induced; the quality of a rule is measured by both its length and probability.

...read moreread less

Journal Article•DOI•

[...]

Sunita Sarawagi¹, Sree Hari Nagaralu¹•Institutions (1)

Indian Institute of Technology Bombay¹

A note on "beyond market baskets: generalizing association rules to correlations"

TL;DR: A debate is raised on the usefulness of providing data mining models as services on the internet and how this can be made accessible to a wider audience instead of being limited to people with the data and the expertise.

...read moreread less

Abstract: The goal of this article is to raise a debate on the usefulness of providing data mining models as services on the internet. These services can be provided by anyone with adequate data and expertise and made available on the internet for anyone to use. For instance, Yahoo or Altavista, given their huge categorized document collection, can train a document classier and provide the model as a service on the internet. This way data mining can be made accessible to a wider audience instead of being limited to people with the data and the expertise. A host of practical problems need to be solved before this idea can be made to work. We identify them and close with an invitation for further debate and investigation.

...read moreread less

Journal Article•DOI•

[...]

Khalil M. Ahmed¹, Nagwa M. El-Makky¹, Yousry Taha¹•Institutions (1)

Alexandria University¹

KDD-cup 99: knowledge discovery in a charitable organization's donor database

TL;DR: The chi-squared test succeeds in measuring the cell dependencies in a 2x2 contingency table, however, it can be misleading in cases of bigger contingency tables and a more appropriate reliability measure of association rules is proposed.

...read moreread less

Abstract: In their paper [1], S. Brin, R. Matwani and C. Silverstien discussed measuring significance of (generalized) association rules via the support and the chi-squared test for correlation. They provided some illustrative examples and pointed that the chi-squared test needs to be agumented by a measure of interest that they also suggested.This paper presents a further elaboration and extension of their discussion. As suggested by Brin et al, the chi-squared test succeeds in measuring the cell dependencies in a 2x2 contingency table. However, it can be misleading in cases of bigger contingency tables. We will give some illustrative examples based on those presented in [1]. We will also propose a more appropriate reliability measure of association rules.

...read moreread less

Journal Article•DOI•

[...]

Saharon Rosset, Aron Inger

The segment support map: scalable mining of frequent itemsets

TL;DR: The results of the knowledge discovery and modeling on the data of the 1997 donation campaign of an American charitable organization, with a total net donation of around $10500 results from the "mail to all" policy are described.

...read moreread less

Abstract: 1. I N T R O D U C T I O N 2. This report describes the results of our knowledge discovery and modeling on the data of the 1997 donation campaign of an 1. American charitable organization. The two data sets (training and evaluation) contained about 95000 customers each, with an average net donation of slightly over 11 cents per customer, hence a total net donation of around $10500 results from the "mail to all" policy. The main tool we utilized for the knowledge discovery task is 2. Amdocs' Information Analysis Environment, which allows standard 2-class knowledge discovery and modeling, but also Value Weighted Analysis (VWA). In VWA, the discovered segments and models attempt to optimize the value and class membership simultaneously. Thus, our modeling was based on a 1-stage model rather than a separate analysis for donation probability and expected donation (the approach taken by all of KDD-Cup 98's reported modeling efforts except our own). 3. We concentrate the first two parts of the report on introducing the knowledge and models we have discovered. The third part deals with the methods, algorithms and comments about the results. In doing the analysis and modeling we used only the training data set of KDD-Cup 98, reserving the evaluation data set for final unbiased model evaluation for our 5 suggested models only. If our goal had been only knowledge discovery, it might have been useful to utilize the evaluation data too, especially the donors. It is probably possible to find more interesting phenomena with almost 10000 donors than with under 5000. MAIN RESULTS

...read moreread less

Journal Article•DOI•

[...]

Laks V. S. Lakshmanan¹, Carson K. Leung², Raymond T. Ng²•Institutions (2)

Indian Institute of Technology Bombay¹, University of British Columbia²

Postprocessing in machine learning and data mining

TL;DR: A novel structure called segment support map is proposed to help mining of frequent itemsets of the various forms by improving the performance of frequent-set mining algorithms by obtaining sharper bounds on the support of itemsets and better exploiting properties of constraints.

...read moreread less

Abstract: Since its introduction, frequent set mining has been generalized to many forms, including online mining with Carma, and constrained mining with CAP. Regardless, scalability is always an important aspect of the development. In this paper, we propose a novel structure called segment support map to help mining of frequent itemsets of the various forms. A light-weight structure, the segment support map improves the performance of frequent-set mining algorithms by: (i) obtaining sharper bounds on the support of itemsets, and/or (ii) better exploiting properties of constraints. Our experimental results show the effectiveness of the segment support map.

...read moreread less

Journal Article•DOI•

[...]

Ivan Bruha¹, A. Famili²•Institutions (2)

McMaster University¹, National Research Council²

Artificial neural networks: a science in trouble

TL;DR: This article surveys the contents of the workshop Post-Processing in Machine Learning and data Mining: Interpretation, Visualization, Integration, and Related Topics within KDD-2000: The Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

...read moreread less

Abstract: This article surveys the contents of the workshop Post-Processing in Machine Learning and Data Mining: Interpretation, Visualization, Integration, and Related Topics within KDD-2000: The Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA, USA, 20-23 August 2000. The corresponding web site is on www.acm.org/sigkdd/kdd2000 First, this survey paper introduces the state of the art of the workshop topics, emphasizing that postprocessing forms a significant component in Knowledge Discovery in Databases (KDD). Next, the article brings up a report on the contents, analysis, discussion, and other aspects regarding this workshop. Afterwards, we survey all the workshop papers. They can be found at (and downloaded from) www.cas.mcmaster.ca/~bruha/kdd2000/kddrep.html The authors of this report worked as the organizers of the workshop; the programme committee was formed by additional three researches in this field.

...read moreread less

Journal Article•DOI•

[...]

Asim Roy¹•Institutions (1)

Arizona State University¹

The MP13 approach to the KDD'99 classifier learning contest

TL;DR: This article points out some very serious misconceptions about the brain in connectionism and artificial neural networks and argues that a very convincing argument can be made for a "control theoretic" approach to understanding the brain.

...read moreread less

Abstract: This article points out some very serious misconceptions about the brain in connectionism and artificial neural networks. Some of the connectionist ideas have been shown to have logical flaws, while others are inconsistent with some commonly observed human learning processes and behavior. For example, the connectionist ideas have absolutely no provision for learning from stored information, something that humans do all the time. The article also argues that there is definitely a need for some new ideas about the internal mechanisms of the brain. It points out that a very convincing argument can be made for a "control theoretic" approach to understanding the brain. A "control theoretic" approach is actually used in all connectionist and neural network algorithms and it can also be justified from recent neurobiological evidence. A control theoretic approach proposes that there are subsystems within the brain that control other subsystems. Hence a similar approach can be taken in constructing learning algorithms and other intelligent systems.

...read moreread less

Journal Article•DOI•

[...]

Miheev Vladimir, Vopilov Alexei, Shabalin Ivan

Text mining as integration of several related research areas: report on KDD's workshop on text mining 2000

TL;DR: The MP13 method is best summarized as recognition based on voting decision trees using "pipes" in potential space using "magnifying lens" techniques.

...read moreread less

Abstract: The MP13 method is best summarized as recognition based on voting decision trees using "pipes" in potential space.

...read moreread less

Journal Article•DOI•

[...]

Marko Grobelnik¹, Dunja Mladenic², Natasa Milic-Frayling³•Institutions (3)

Jožef Stefan Institute¹, Carnegie Mellon University², Microsoft³

Web for data mining: organizing and interpreting the discovered rules using the Web

TL;DR: An overview of the KDD’2000 W orkshop on Text Mining that was held in Boston, MA on August 2 0, 000 is given.

...read moreread less

Abstract: In this paper we give an overview of the KDD’2000 W orkshop on Text Mining that was held in Boston, MA on August 2 0, 000. We report in detail on the research issues covered in the papers presented at the Workshop and during the group disc ussion held in the final session of the Workshop.

...read moreread less

Journal Article•DOI•

[...]

Yiming Ma¹, Bing Liu¹, Ching Kian Wong¹•Institutions (1)

National University of Singapore¹

Workshop report: 2000 ACM SIGMOD workshop on research issues in data mining and knowledge discovery

TL;DR: This paper presents a system (called DS-Web) that uses the web to facilitate delivering and interpreting the discovered rules, and shows that DS-WEB is much more powerful than a conventional system.

...read moreread less

Abstract: The web not only contains a vast amount of usefUl information, but also provides a powerful infrastructure for communication and information sharing. In this paper, we present a system (called DS-Web) that uses the web to help data mining. Specifically, we use the web to facilitate delivering and interpreting the discovered rules. Interpreting the discovered rules to gain a good understanding of the domain is an important phase of data mining. It is also a very difficult task because the number of rules involved is often very large. This problem has been regarded as a major obstacle to the use of data mining results. DS-WEB assists the user in understanding a set of discovered rules in two steps. First, it finds a special subset (or a summary) of the rules that represents the essential relationships of the domain to build a hierarchical structure of the rules. It then publishes this hierarchy of rules via multiple web pages connected using hyperlinks. By using the web, we inherit the advantages of the web, e.g., accessibility, multi-user communication and friendly interface. DS-WEB not only allows the user to browse the rules easily, but also allows us to create a virtual workspace where multiple users can share opinions on the rules. This ultimately contributes towards comprehension of the domain. Our application experiences show that DS-WEB is much more powerful than a conventional system.

...read moreread less

Journal Article•DOI•

[...]

Dimitrios Gunopulos¹, Rajeev Rastogi²•Institutions (2)

University of California, Riverside¹, Bell Labs²