scispace - formally typeset
Search or ask a question

Showing papers in "Sigkdd Explorations in 2003"


Journal ArticleDOI
TL;DR: This survey describes several approaches of defining positive definite kernels on structured instances directly on the basis of areal vector space and thus in a single table.
Abstract: Kernel methods in general and support vector machines in particular have been successful in various learning tasks on data represented in a single table. Much 'real-world' data, however, is structured - it has no natural representation in a single table. Usually, to apply kernel methods to 'real-world' data, extensive pre-processing is performed to embed the data into areal vector space and thus in a single table. This survey describes several approaches of defining positive definite kernels on structured instances directly.

507 citations


Journal ArticleDOI
TL;DR: This article introduces the theoretical basis of graph based data mining and surveys the state of the art of graph-based data mining.
Abstract: The need for mining structured data has increased in the past few years. One of the best studied data structures in computer science and discrete mathematics are graphs. It can therefore be no surprise that graph based data mining has become quite popular in the last few years.This article introduces the theoretical basis of graph based data mining and surveys the state of the art of graph-based data mining. Brief descriptions of some representative approaches are provided as well.

480 citations


Journal ArticleDOI
TL;DR: This paper surveys the 2003 KDD Cup, a competition focused on mining the complex real-life social network inherent in the e-print arXiv (arXiv.org), and describes the four K DD Cup tasks: citation prediction, download prediction, data cleaning, and an open task.
Abstract: This paper surveys the 2003 KDD Cup, a competition held in conjunction with the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) in August 2003. The competition focused on mining the complex real-life social network inherent in the e-print arXiv (arXiv.org). We describe the four KDD Cup tasks: citation prediction, download prediction, data cleaning, and an open task.

309 citations


Journal ArticleDOI
TL;DR: This article provides a brief introduction to MRDM, while the remainder of this special issue treats in detail advanced research topics at the frontiers of MRDM.
Abstract: Data mining algorithms look for patterns in data. While most existing data mining approaches look for patterns in a single data table, multi-relational data mining (MRDM) approaches look for patterns that involve multiple tables (relations) from a relational database. In recent years, the most common types of patterns and approaches considered in data mining have been extended to the multi-relational case and MRDM now encompasses multi-relational (MR) association rule discovery, MR decision trees and MR distance-based methods, among others. MRDM approaches have been successfully applied to a number of problems in a variety of areas, most notably in the area of bioinformatics. This article provides a brief introduction to MRDM, while the remainder of this special issue treats in detail advanced research topics at the frontiers of MRDM.

292 citations


Journal ArticleDOI
TL;DR: A key challenge for data mining is tackling the problem of mining richly structured datasets, where the objects are linked in some way and links among the objects may demonstrate certain patterns, which can be helpful for many data mining tasks and are usually hard to capture with traditional statistical models.
Abstract: A key challenge for data mining is tackling the problem of mining richly structured datasets, where the objects are linked in some way. Links among the objects may demonstrate certain patterns, which can be helpful for many data mining tasks and are usually hard to capture with traditional statistical models. Recently there has been a surge of interest in this area, fueled largely by interest in web and hypertext mining, but also by interest in mining social networks, security and law enforcement data, bibliographic citations and epidemiological records.

205 citations


Journal ArticleDOI
TL;DR: The different patterns of gene expression following carefully tuned biological programs, according to tissue type, developmental stage, environment and genetic background account for the huge variety of different cells states and types.
Abstract: All organisms on Earth, except for viruses, consist of cells. Yeast, for example, has one cell, while humans have trillions of cells. All cells have a nucleus, and inside nucleus there is DNA, which encodes the “program” for making future organisms. DNA has coding and non-coding segments, and coding segments, called “genes”, specify the structure of proteins, which are large molecules, like hemoglobin, that do the essential work in every organism. Practically all cells in the same organism have the same genes, but these genes can be expressed differently at different times and under different conditions. Genes make proteins in two steps. First, DNA is transcribed into messenger RNA or mRNA, which in turn is translated into proteins. The different patterns of gene expression following carefully tuned biological programs, according to tissue type, developmental stage, environment and genetic background account for the huge variety of different cells states and types. Virtually all major differences in cell state or type are correlated with changes in the mRNA levels of many genes.

196 citations


Journal ArticleDOI
Tom Fawcett1
TL;DR: This paper argues that researchers should pursue in vivo spam filtering as an accessible domain for investigating them and demonstrates some of the characteristics that make it a rich and challenging domain for data mining.
Abstract: Spam, also known as Unsolicited Commercial Email (UCE), is the bane of email communication. Many data mining researchers have addressed the problem of detecting spam, generally by treating it as a static text classification problem. True in vivo spam filtering has characteristics that make it a rich and challenging domain for data mining. Indeed, real-world datasets with these characteristics are typically difficult to acquire and to share. This paper demonstrates some of these characteristics and argues that researchers should pursue in vivo spam filtering as an accessible domain for investigating them.

139 citations


Journal ArticleDOI
TL;DR: For the KDD Cup 2003 competition's "Open Task," it was examined how well various automatic matching techniques could identify authors within the competition's very large archive of research papers.
Abstract: Prior studies have questioned the degree of anonymity of the double-blind review process for scholarly research articles. For example, one study based on a survey of reviewers concluded that authors often could be identified by reviewers using a combination of the author's reference list and the referee's personal background knowledge. For the KDD Cup 2003 competition's "Open Task," we examined how well various automatic matching techniques could identify authors within the competition's very large archive of research papers. This paper describes the issues surrounding author identification, how these issues motivated our study, and the results we obtained. The best method, based on discriminative self-citations, identified authors correctly 40--45% of the time. One main motivation for double-blind review is to eliminate bias in favor of well-known authors. However, identification accuracy for authors with substantial publication history is even better (60% accuracy for the top-10% most prolific authors, 85% for authors with 100 or more prior papers).

109 citations


Journal ArticleDOI
TL;DR: This paper presents the experiences of the authors and others in applying exploratory data mining techniques to medical, health and clinical data and provides pointers to possible areas of future research in data mining and knowledge discovery more broadly.
Abstract: The application of data mining and knowledge discovery techniques to medical and health datasets is a rewarding but highly challenging area. Not only are the datasets large, complex, heterogeneous, hierarchical, time-varying and of varying quality but there exists asubstantial medical knowledge base which demands a robust collaboration between the data miner and the health professional(s) if useful information is to be extracted.This paper presents the experiences of the authors and others in applying exploratory data mining techniques to medical, health and clinical data. In so doing, it elicits a number of general issues and provides pointers to possible areas of future research in data mining and knowledge discovery more broadly.

104 citations


Journal ArticleDOI
TL;DR: This panel was an attempt to address the possible future directions for Data Mining and KDD.
Abstract: The goal of the panel was to gather representatives from academia and industry and to ponder where the field stands after nearly a decade and a half of KDD meetings. We all have seen a significant growth in demand for data mining technology driven by a glut in data. We have observed data mining growing as a healthy research community. However, we still struggle on two important fronts: the scientific and the commercial. On the scientific front, Data Mining still needs to reach a stronger level of attracting steady contributions from the related fields. On the commercial fronts, the huge opportunity has not yet been met with adequate tools and solutions. This panel was an attempt to address the possible future directions for Data Mining and KDD.

100 citations


Journal ArticleDOI
TL;DR: This short paper argues that multi-relational data mining has a key role to play in the growth of KDD, and briefly surveys some of the main drivers, research problems, and opportunities in this emerging field.
Abstract: This short paper argues that multi-relational data mining has a key role to play in the growth of KDD, and briefly surveys some of the main drivers, research problems, and opportunities in this emerging field.

Journal ArticleDOI
TL;DR: This article attempts to present a structured overview of efficiency and Scalability of multi-relational data mining approaches through a number of theoretical results, algorithms and implementations.
Abstract: Efficiency and Scalability have always been important concerns in the field of data mining, and are even more so in the multi-relational context, which is inherently more complex. The issue has been receiving an increasing amount of attention during the last few years, and quite a number of theoretical results, algorithms and implementations have been presented that explicitly aim at improving the efficiency and Scalability of multi-relational data mining approaches. With this article we attempt to present a structured overview.

Journal ArticleDOI
TL;DR: This work analyzes publication patterns in theoretical high-energy physics using a relational learning approach, focusing on understanding and identifying patterns of citations, examining publication patterns at the author level, predicting whether a paper will be accepted by specific journals, and identifying research communities from the citation patterns and paper text.
Abstract: We analyze publication patterns in theoretical high-energy physics using a relational learning approach. We focus on four related areas: understanding and identifying patterns of citations, examining publication patterns at the author level, predicting whether a paper will be accepted by specific journals, and identifying research communities from the citation patterns and paper text. Each of these analyses contributes to an overall understanding of theoretical high-energy physics.

Journal ArticleDOI
TL;DR: This paper presents experimental evidence of the usefulness of combining prototype-based feature selection with statistical gene selection methods for the task of classifying adenocarcinoma from gene expressions.
Abstract: This paper addresses the problem of improving accuracy in the machine-learning task of classification from microarray data. One of the known issues specifically related to microarray data is the large number of inputs (genes) versus the small number of available samples (conditions). A promising direction of research to decrease the generalization error of classification algorithms is to perform gene selection so as to identify those genes which are potentially most relevant for the classification. Classical feature selection methods are based on direct statistical methods. We present a reduction algorithm based on the notion of prototypegene. Each prototype represents a set of similar gene according to a given clustering method. We present experimental evidence of the usefulness of combining prototype-based feature selection with statistical gene selection methods for the task of classifying adenocarcinoma from gene expressions.

Journal ArticleDOI
TL;DR: This paper presents several applications of multi-relational data mining to biological data, taking care to cover a broad range ofMulti-relationally data mining techniques.
Abstract: Biological databases contain a wide variety of data types, often with rich relational structure. Consequently multi-relational data mining techniques frequently are applied to biological data. This paper presents several applications of multi-relational data mining to biological data, taking care to cover a broad range of multi-relational data mining techniques.

Journal ArticleDOI
TL;DR: Some of key aspects of p>>n problems for identifying informative features and developing accurate classifiers are highlighted.
Abstract: New genomic and proteomic technologies provide measurements of thousands of features for each case. This provides a context for enhanced discovery and false discovery. Most statistical and machine learning procedures were not developed for the p>>n setting and the literature of DNA microarray studies contains many examples of mis-use of analytic and computatinal methods such a cross-validation. This paper highlights some of key aspects of p>>n problems for identifying informative features and developing accurate classifiers.

Journal ArticleDOI
TL;DR: This paper explores how an integrated analysis of expression data and literature-extracted information can reveal biologically meaningful clusters not identified when using microarray information alone, validated in terms of transcriptional regulation.
Abstract: The current tendency in the life sciences to spawn ever growing amounts of high-throughput assays has led to a situation where the interpretation of data and the formulation of hypotheses lag the pace at which information is produced. Although the first generation of statistical algorithms scrutinizing single, large-scale data sets found their way into the biological community, the great challenge to connect their results to existing knowledge still remains. Despite the fairly large number of biological databases that is currently available, a lot of relevant information is found in free-text format (such as textual annotations, scientific abstracts and full publications). In this paper we explore how an integrated analysis of expression data and literature-extracted information can reveal biologically meaningful clusters not identified when using microarray information alone. The joint analysis is validated in terms of transcriptional regulation.

Journal ArticleDOI
TL;DR: This paper focuses on identifying novel, not necessarily most frequent, patterns in a graph-theoretic representation of data, which provides both simplifications and challenges over frequency-based approaches to graph-based data mining.
Abstract: Graph-based relational learning (GBRL) differs from logic-based relational learning, as addressed by inductive logic programming techniques, and differs from frequent subgraph discovery, as addressed by many graph-based data mining techniques. Learning from graphs, rather than logic, presents representational issues both in input data preparation and output pattern language. While a form of graph-based data mining, GBRL focuses on identifying novel, not necessarily most frequent, patterns in a graph-theoretic representation of data. This approach to graph-based data mining provides both simplifications and challenges over frequency-based approaches. In this paper we discuss these issues and future directions of graph-based relational learning.

Journal ArticleDOI
TL;DR: A novel procedure called "nearest shrunken centroids" is described that has successfully detected clinically relevant genetic differences in cancer patients and has the potential to become a powerful tool for diagnosing and treating cancer.
Abstract: The morbidity rate of cancer victims varies greatly for similar patients who receive similar treatments. It is hypothesized that this variation can be explained by the genetic heterogeneity of the disease. DNA Microarrays, which can simultaneously measure the expression level of thousands of different genes, have been successfully used to identify such genetic differences. However, microarray data typically has a large number of features and relatively few observations, meaning that conventional machine learning tools can fail when applied to such data. We describe a novel procedure called "nearest shrunken centroids" that has successfully detected clinically relevant genetic differences in cancer patients. This procedure has the potential to become a powerful tool for diagnosing and treating cancer.

Journal ArticleDOI
TL;DR: A set of unsupervised link discovery methods that compute interestingness based on a notion of "rarity" and "abnormality" are developed and show that these approaches are able to automatically uncover interesting hidden connections and unexpected facts in the data.
Abstract: This paper describes a submission to the Open Task of the 2003 KDD Cup. For this task contestants were asked to devise their own questions about the HEP-Th bibliography dataset, and the most interesting result would be selected as the winner. Instead of taking a more traditional approach such as starting with a inspection of the data, formulating questions or hypotheses interesting to us and then devising an analysis and approach to answer these questions, we tried to go a different route: can we develop a program that automatically finds interesting facts and connections in the data?To do this we developed a set of unsupervised link discovery methods that compute interestingness based on a notion of "rarity" and "abnormality". The experiments performed on the HEP-Th dataset show that our approaches are able to automatically uncover interesting hidden connections (e.g. significant relationships between people) and unexpected facts (e.g. citation loops) without the support of any prerequisite knowledge or training examples. The interestingness of some of our results is self-evident. For others we were able to verify them by looking for supporting evidence on the World-Wide-Web, which shows that our methods can find connections between entities that actually are interestingly connected in the real world in an unsupervised way.

Journal ArticleDOI
TL;DR: A gene-ranking algorithm whose main novelty is the use of bootstrapped P-values is proposed, showing how it takes account of small-sample variability in observed values of the test statistic, in a way conventional statistical tests cannot.
Abstract: Recent research has shown that it is possible to find genes involved in the pathogenesis of a particular condition on the basis of microarray experiments. Genes which are differentially expressed, for example between healthy and diseased tissues, are likely to be relevant to the disease under study. Some of the properties of microarray datasets make the task of finding these genes a challenging one. This paper proposes a gene-ranking algorithm whose main novelty is the use of bootstrapped P-values. We present an analysis of the algorithm, showing how it takes account of small-sample variability in observed values of the test statistic, in a way conventional statistical tests cannot. Experimental results show that our algorithm outperforms the widely-used two-sample t-test on challenging artificial data. Gene ranking is then performed on two well-known microarray datasets, with encouraging results. For example, a number of genes from one of the datasets, whose differential expression was subsequently confirmed by a more reliable biochemical analysis, are found to be ranked higher by the bootstrapped algorithm than by the conventional t-test, suggesting that the proposed algorithm may be better able to exploit the limited data available to infer biologically useful information.

Journal ArticleDOI
TL;DR: An interactive exploration system GeneX (Gene eXplorer) is introduced for mining coherent expression patterns and a novel coherent pattern index graph is developed to provide highly confident indications of the existence of coherent patterns.
Abstract: Analyzing coherent gene expression patterns is an important task in bioinformatics research and biomedical applications. Recently, various clustering methods have been adapted or proposed to identify clusters of co-expressed genes and recognize coherent expression patterns as the centroids of the clusters. However, the interpretation of co-expressed genes and coherent patterns mainly depends on the domain knowledge, which presents several challenges for coherent pattern mining and cannot be solved by most existing clustering approaches.In this paper, we introduce an interactive exploration system GeneX (Gene eXplorer) for mining coherent expression patterns. We develop a novel coherent pattern index graph to provide highly confident indications of the existence of coherent patterns. Typical exploration operations are supported based on the index graph. We also provide a bunch of graphical views as the user interface to visualize the data set and facilitate the interactive operations. To help users to interpret and validate the mining results, we design the gene annotation panel that connects the genes with some public annotation databases. The experimental results show that our approach is more effective than the state-of-the-art methods in mining real gene expression data sets.

Journal ArticleDOI
TL;DR: Opportunities for the application of machine learning and data mining methods to several important low-level microarray analysis problems: monitoring gene expression, transcript discovery, genotyping and resequencing are explored.
Abstract: Machine learning and data mining have found a multitude of successful applications in microarray analysis, with gene clustering and classification of tissue samples being widely cited examples. Low-level microarray analysis -- often associated with the pre-processing stage within the microarray life-cycle -- has increasingly become an area of active research, traditionally involving techniques from classical statistics. This paper explores opportunities for the application of machine learning and data mining methods to several important low-level microarray analysis problems: monitoring gene expression, transcript discovery, genotyping and resequencing. Relevant methods and ideas from the machine learning community include semi-supervised learning, learning from heterogeneous data, and incremental learning.

Journal ArticleDOI
TL;DR: The experiences in building the winning system for KDD Cup, 2003, Task 1 are described, a very large archive of research papers that provides an unusually comprehensive snapshot of a particular social network in action that provides a framework for testing general network and usage mining techniques.
Abstract: In this article we describe our experiences in building the winning system for KDD Cup, 2003, Task 1. This year's competition was based on a very large archive of research papers that provides an unusually comprehensive snapshot of a particular social network in action; in addition to the full text of research papers, it includes both explicit citation structure and partial data on the downloading of papers by users. It provides a framework for testing general network and usage mining techniques, which can be explored via four varied and interesting tasks. Each task is a separate competition with its own specific goal. In task 1 the goal is to predict the change in number of citations to each paper in the archive over time.The contest was very challenging because the given data was not in a format suitable for conventional data mining techniques. So we had to do a considerable amount of data processing. Also there were different sources of data like tex files, citation graph, slac-data database. So we had to make a decision about which sources to use and how much to use.

Journal ArticleDOI
TL;DR: A new probabilistic framework for analyzing a metabolic pathway with microarray expression profiles is presented, and it is found that this method significantly outperformed another method, which was trained by microarray data only.
Abstract: We present a new probabilistic framework for analyzing a metabolic pathway with microarray expression profiles. Our purpose is to find biologically significant paths and patterns in a given metabolic pathway. Our approach first builds a Markov model using a graph structure of a known metabolic pathway, and then estimates parameters of a mixture of the Markov models using microarray data, based on an EM algorithm. In our experiments, we used a main pathway of glycolysis to evaluate the effectiveness of our method. We first measured the performance of our method comparing with that of another method, in a supervised learning manner, and found that our method significantly outperformed another method, which was trained by microarray data only. We further analyzed the trained models and obtained a number of new biological findings on frequent patterns (paths) and long-range correlations in a metabolic pathway.

Journal ArticleDOI
TL;DR: This article provides an overview of the methodology and describes its application to the prediction of biological and clinical outcomes (possibly censored) using microarray gene expression measures.
Abstract: Current statistical inference problems in genomic data analysis involve parameter estimation for high-dimensional multivariate distributions, with typically unknown and intricate correlation patterns among variables. Addressing these inference questions satisfactorily requires: (i) an intensive and thorough search of the parameter space to generate good candidate estimators; (ii) an approach for selecting an optimal estimator among these candidates; and (iii) a method for reliably assessing the performance of the resulting estimator. We propose a unified loss-based methodology for estimator construction, selection, and performance assessment with cross-validation. In this approach, the parameter of interest is defined as the risk minimizer for a suitable loss function and candidate estimators are generated using this (or possibly another) loss function. Cross-validation is applied to select an optimal estimator among the candidates and to assess the overall performance of the resulting estimator. This general estimation framework encompasses a number of problems which have traditionally been treated separately in the statistical literature, including multivariate outcome prediction and density estimation based on either uncensored or censored data. This article provides an overview of the methodology and describes its application to the prediction of biological and clinical outcomes (possibly censored) using microarray gene expression measures.

Journal ArticleDOI
TL;DR: It is shown that, with the integration of the IFs to the Golub and Slonim (GS) and k-nearest neighbors (kNN) classifiers, the classifiers can be further improved on the classification accuracy of heterogeneous samples.
Abstract: Recent advanced technologies in DNA microarray analysis are intensively applied in disease classification, especially for cancer classification. Most recent proposed gene expression classifiers can successfully classify testing samples obtained from the same microarray experiment as training samples with the assumption that the symmetric errors are constant among training and testing samples. However, the classification performance is degraded with heterogeneous testing samples obtained from different microarray experiments. In this paper, we propose the "impact factors" (IFs) to measure the variations between individual classes in training samples and heterogeneous testing samples, and integrate the IFs to classifiers for classification of heterogeneous samples. Two publicly available lung adenocarcinomas gene expression data sets are used in our experiments to demonstrate the effectiveness of the IFs. It shows that, with the integration of the IFs to the Golub and Slonim (GS) and k-nearest neighbors (kNN) classifiers, the classifiers can be further improved on the classification accuracy of heterogeneous samples. Even more, the classification accuracy of the integrated GS classifier is around 90%.

Journal ArticleDOI
TL;DR: This paper provides a review of statistical analysis approaches to the analysis of data from microarray experiments, including discussion of experimental design, data management, preprocessing, differential expression, clustering and class prediction, reporting and annotation.
Abstract: Microarrays are a powerful experimental platform, allowing simultaneous studies of gene expression for thousands of genes under different experimental conditions. However there is much biological variability induced throughout the experimental process that can obscure the biological signals of interest. As such, the need for experimental design, replication and statistical rigor are now widely recognized. Statistical hypothesis testing has become the accepted differential expression analysis approach and many classification and prediction methods used in class discovery and class prediction now incorporate stochastic modeling components.This paper provides a review of statistical analysis approaches to the analysis of data from microarray experiments. This includes discussion of experimental design, data management, preprocessing, differential expression, clustering and class prediction, reporting and annotation. The review is illustrated with the analysis of an experiment with 3 experimental conditions using the Affymetrix murine chip mgu 74av2; and with descriptions of available functionality in the statistical analysis software S-PLUS and its associated module for microarray analysis, S+ArrayAnalyzer.

Journal ArticleDOI
TL;DR: The process of creating a citation graph from a given repository of physics publications in LATEX format involved a series of information extraction, data cleaning, matching and ranking steps.
Abstract: In this paper, we describe our process of creating a citation graph from a given repository of physics publications in LATEX format. The task involved a series of information extraction, data cleaning, matching and ranking steps. This paper describes the challenges we faced along the way and the issues involved in resolving them.

Journal ArticleDOI
TL;DR: This paper describes the work on the Download Estimation task for KDD Cup 2003, based on an extension of the bag-of-words model, with linear SVM regression as the learning algorithm, and focuses particularly on issues of feature construction and weighting.
Abstract: This paper describes our work on the Download Estimation task for KDD Cup 2003. The task requires us to estimate how many times a paper has been downloaded in the first 60 days after it has been published on arXiv.org, a preprint server for papers on physics and related areas. The training data consists of approximately 29000 papers, the citation graph, and information about the downloads of a subset of these papers. Our approach is based on an extension of the bag-of-words model, with linear SVM regression as the learning algorithm. We describe our experiments with various kinds of features. We focus particularly on issues of feature construction and weighting, which turns out to be quite important for this task.