Search or ask a question

Citation booster

Literature Review

Citation generator

Chrome Extension

Home
/
Journals
/
Sigkdd Explorations
/
2003

Showing papers in "Sigkdd Explorations in 2003"

PDF

Open Access

Journal Article•DOI•

A survey of kernels for structured data

[...]

Thomas Gärtner¹•Institutions (1)

University of Bristol¹

01 Jul 2003-Sigkdd Explorations

TL;DR: This survey describes several approaches of defining positive definite kernels on structured instances directly on the basis of areal vector space and thus in a single table.

...read moreread less

Abstract: Kernel methods in general and support vector machines in particular have been successful in various learning tasks on data represented in a single table. Much 'real-world' data, however, is structured - it has no natural representation in a single table. Usually, to apply kernel methods to 'real-world' data, extensive pre-processing is performed to embed the data into areal vector space and thus in a single table. This survey describes several approaches of defining positive definite kernels on structured instances directly.

...read moreread less

507 citations

Journal Article•DOI•

State of the art of graph-based data mining

[...]

Takashi Washio¹, Hiroshi Motoda¹•Institutions (1)

Osaka University¹

01 Jul 2003-Sigkdd Explorations

TL;DR: This article introduces the theoretical basis of graph based data mining and surveys the state of the art of graph-based data mining.

...read moreread less

Abstract: The need for mining structured data has increased in the past few years. One of the best studied data structures in computer science and discrete mathematics are graphs. It can therefore be no surprise that graph based data mining has become quite popular in the last few years.This article introduces the theoretical basis of graph based data mining and surveys the state of the art of graph-based data mining. Brief descriptions of some representative approaches are provided as well.

...read moreread less

480 citations

Journal Article•DOI•

Overview of the 2003 KDD Cup

[...]

Johannes Gehrke¹, Paul Ginsparg¹, Jon Kleinberg¹•Institutions (1)

Cornell University¹

01 Dec 2003-Sigkdd Explorations

TL;DR: This paper surveys the 2003 KDD Cup, a competition focused on mining the complex real-life social network inherent in the e-print arXiv (arXiv.org), and describes the four K DD Cup tasks: citation prediction, download prediction, data cleaning, and an open task.

...read moreread less

Abstract: This paper surveys the 2003 KDD Cup, a competition held in conjunction with the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) in August 2003. The competition focused on mining the complex real-life social network inherent in the e-print arXiv (arXiv.org). We describe the four KDD Cup tasks: citation prediction, download prediction, data cleaning, and an open task.

...read moreread less

309 citations

Journal Article•DOI•

Multi-relational data mining: an introduction

[...]

Sašo Džeroski¹•Institutions (1)

Jožef Stefan Institute¹

01 Jul 2003-Sigkdd Explorations

TL;DR: This article provides a brief introduction to MRDM, while the remainder of this special issue treats in detail advanced research topics at the frontiers of MRDM.

...read moreread less

Abstract: Data mining algorithms look for patterns in data. While most existing data mining approaches look for patterns in a single data table, multi-relational data mining (MRDM) approaches look for patterns that involve multiple tables (relations) from a relational database. In recent years, the most common types of patterns and approaches considered in data mining have been extended to the multi-relational case and MRDM now encompasses multi-relational (MR) association rule discovery, MR decision trees and MR distance-based methods, among others. MRDM approaches have been successfully applied to a number of problems in a variety of areas, most notably in the area of bioinformatics. This article provides a brief introduction to MRDM, while the remainder of this special issue treats in detail advanced research topics at the frontiers of MRDM.

...read moreread less

292 citations

Journal Article•DOI•

Link mining: a new data mining challenge

[...]

Lise Getoor¹•Institutions (1)

University of Maryland, College Park¹

01 Jul 2003-Sigkdd Explorations

TL;DR: A key challenge for data mining is tackling the problem of mining richly structured datasets, where the objects are linked in some way and links among the objects may demonstrate certain patterns, which can be helpful for many data mining tasks and are usually hard to capture with traditional statistical models.

...read moreread less

Abstract: A key challenge for data mining is tackling the problem of mining richly structured datasets, where the objects are linked in some way. Links among the objects may demonstrate certain patterns, which can be helpful for many data mining tasks and are usually hard to capture with traditional statistical models. Recently there has been a surge of interest in this area, fueled largely by interest in web and hypertext mining, but also by interest in mining social networks, security and law enforcement data, bibliographic citations and epidemiological records.

...read moreread less

205 citations

Journal Article•DOI•

Microarray data mining: facing the challenges

[...]

Gregory Piatetsky-Shapiro, Pablo Tamayo¹•Institutions (1)

Broad Institute¹

01 Dec 2003-Sigkdd Explorations

TL;DR: The different patterns of gene expression following carefully tuned biological programs, according to tissue type, developmental stage, environment and genetic background account for the huge variety of different cells states and types.

...read moreread less

Abstract: All organisms on Earth, except for viruses, consist of cells. Yeast, for example, has one cell, while humans have trillions of cells. All cells have a nucleus, and inside nucleus there is DNA, which encodes the “program” for making future organisms. DNA has coding and non-coding segments, and coding segments, called “genes”, specify the structure of proteins, which are large molecules, like hemoglobin, that do the essential work in every organism. Practically all cells in the same organism have the same genes, but these genes can be expressed differently at different times and under different conditions. Genes make proteins in two steps. First, DNA is transcribed into messenger RNA or mRNA, which in turn is translated into proteins. The different patterns of gene expression following carefully tuned biological programs, according to tissue type, developmental stage, environment and genetic background account for the huge variety of different cells states and types. Virtually all major differences in cell state or type are correlated with changes in the mRNA levels of many genes.

...read moreread less

196 citations

Journal Article•DOI•

"In vivo" spam filtering: a challenge problem for KDD

[...]

Tom Fawcett¹•Institutions (1)

Hewlett-Packard¹

01 Dec 2003-Sigkdd Explorations

TL;DR: This paper argues that researchers should pursue in vivo spam filtering as an accessible domain for investigating them and demonstrates some of the characteristics that make it a rich and challenging domain for data mining.

...read moreread less

Abstract: Spam, also known as Unsolicited Commercial Email (UCE), is the bane of email communication. Many data mining researchers have addressed the problem of detecting spam, generally by treating it as a static text classification problem. True in vivo spam filtering has characteristics that make it a rich and challenging domain for data mining. Indeed, real-world datasets with these characteristics are typically difficult to acquire and to share. This paper demonstrates some of these characteristics and argues that researchers should pursue in vivo spam filtering as an accessible domain for investigating them.

...read moreread less

139 citations

Journal Article•DOI•

The myth of the double-blind review?: author identification using only citations

[...]

Shawndra Hill¹, Foster Provost¹•Institutions (1)

New York University¹

01 Dec 2003-Sigkdd Explorations

TL;DR: For the KDD Cup 2003 competition's "Open Task," it was examined how well various automatic matching techniques could identify authors within the competition's very large archive of research papers.

...read moreread less

Abstract: Prior studies have questioned the degree of anonymity of the double-blind review process for scholarly research articles. For example, one study based on a survey of reviewers concluded that authors often could be identified by reviewers using a combination of the author's reference list and the referee's personal background knowledge. For the KDD Cup 2003 competition's "Open Task," we examined how well various automatic matching techniques could identify authors within the competition's very large archive of research papers. This paper describes the issues surrounding author identification, how these issues motivated our study, and the results we obtained. The best method, based on discriminative self-citations, identified authors correctly 40--45% of the time. One main motivation for double-blind review is to eliminate bias in favor of well-known authors. However, identification accuracy for authors with substantial publication history is even better (60% accuracy for the top-10% most prolific authors, 85% for authors with 100 or more prior papers).

...read moreread less

109 citations

Journal Article•DOI•

Exploratory medical knowledge discovery: experiences and issues

[...]

John F. Roddick¹, Peter Fule¹, Warwick J. Graco•Institutions (1)

Flinders University¹

01 Jul 2003-Sigkdd Explorations

TL;DR: This paper presents the experiences of the authors and others in applying exploratory data mining techniques to medical, health and clinical data and provides pointers to possible areas of future research in data mining and knowledge discovery more broadly.

...read moreread less

Abstract: The application of data mining and knowledge discovery techniques to medical and health datasets is a rewarding but highly challenging area. Not only are the datasets large, complex, heterogeneous, hierarchical, time-varying and of varying quality but there exists asubstantial medical knowledge base which demands a robust collaboration between the data miner and the health professional(s) if useful information is to be extracted.This paper presents the experiences of the authors and others in applying exploratory data mining techniques to medical, health and clinical data. In so doing, it elicits a number of general issues and provides pointers to possible areas of future research in data mining and knowledge discovery more broadly.

...read moreread less

104 citations

Journal Article•DOI•

Summary from the KDD-03 panel: data mining: the next 10 years

[...]

Usama M. Fayyad, Gregory Piatetsky-Shapiro, Ramasamy Uthurusamy¹•Institutions (1)

General Motors¹

01 Dec 2003-Sigkdd Explorations

TL;DR: This panel was an attempt to address the possible future directions for Data Mining and KDD.

...read moreread less

Abstract: The goal of the panel was to gather representatives from academia and industry and to ponder where the field stands after nearly a decade and a half of KDD meetings. We all have seen a significant growth in demand for data mining technology driven by a glut in data. We have observed data mining growing as a healthy research community. However, we still struggle on two important fronts: the scientific and the commercial. On the scientific front, Data Mining still needs to reach a stronger level of attracting steady contributions from the related fields. On the commercial fronts, the huge opportunity has not yet been met with adequate tools and solutions. This panel was an attempt to address the possible future directions for Data Mining and KDD.

...read moreread less

100 citations

Journal Article•DOI•

Prospects and challenges for multi-relational data mining

[...]

Pedro Domingos¹•Institutions (1)

University of Washington¹

01 Jul 2003-Sigkdd Explorations

TL;DR: This short paper argues that multi-relational data mining has a key role to play in the growth of KDD, and briefly surveys some of the main drivers, research problems, and opportunities in this emerging field.

...read moreread less

Abstract: This short paper argues that multi-relational data mining has a key role to play in the growth of KDD, and briefly surveys some of the main drivers, research problems, and opportunities in this emerging field.

...read moreread less

Journal Article•DOI•

Scalability and efficiency in multi-relational data mining

[...]

Hendrik Blockeel¹, Michèle Sebag²•Institutions (2)

Katholieke Universiteit Leuven¹, University of Paris-Sud²

01 Jul 2003-Sigkdd Explorations

TL;DR: This article attempts to present a structured overview of efficiency and Scalability of multi-relational data mining approaches through a number of theoretical results, algorithms and implementations.

...read moreread less

Abstract: Efficiency and Scalability have always been important concerns in the field of data mining, and are even more so in the multi-relational context, which is inherently more complex. The issue has been receiving an increasing amount of attention during the last few years, and quite a number of theoretical results, algorithms and implementations have been presented that explicitly aim at improving the efficiency and Scalability of multi-relational data mining approaches. With this article we attempt to present a structured overview.

...read moreread less

Journal Article•DOI•

Exploiting relational structure to understand publication patterns in high-energy physics

[...]

Amy McGovern¹, Lisa Friedland¹, Michael Hay¹, Brian Gallagher¹, Andrew S. Fast¹, Jennifer Neville¹, David Jensen¹ - Show less +3 more•Institutions (1)

University of Massachusetts Amherst¹

01 Dec 2003-Sigkdd Explorations

TL;DR: This work analyzes publication patterns in theoretical high-energy physics using a relational learning approach, focusing on understanding and identifying patterns of citations, examining publication patterns at the author level, predicting whether a paper will be accepted by specific journals, and identifying research communities from the citation patterns and paper text.

...read moreread less

Abstract: We analyze publication patterns in theoretical high-energy physics using a relational learning approach. We focus on four related areas: understanding and identifying patterns of citations, examining publication patterns at the author level, predicting whether a paper will be accepted by specific journals, and identifying research communities from the citation patterns and paper text. Each of these analyses contributes to an overall understanding of theoretical high-energy physics.

...read moreread less

Journal Article•DOI•

Improving classification of microarray data using prototype-based feature selection

[...]

Blaise Hanczar¹, Mélanie Courtine¹, Arriel Benis¹, Corneliu Hennegar¹, Karine Clément², Jean-Daniel Zucker¹ - Show less +2 more•Institutions (2)

Centre national de la recherche scientifique¹, French Institute of Health and Medical Research²

01 Dec 2003-Sigkdd Explorations

TL;DR: This paper presents experimental evidence of the usefulness of combining prototype-based feature selection with statistical gene selection methods for the task of classifying adenocarcinoma from gene expressions.

...read moreread less

Abstract: This paper addresses the problem of improving accuracy in the machine-learning task of classification from microarray data. One of the known issues specifically related to microarray data is the large number of inputs (genes) versus the small number of available samples (conditions). A promising direction of research to decrease the generalization error of classification algorithms is to perform gene selection so as to identify those genes which are potentially most relevant for the classification. Classical feature selection methods are based on direct statistical methods. We present a reduction algorithm based on the notion of prototypegene. Each prototype represents a set of similar gene according to a given clustering method. We present experimental evidence of the usefulness of combining prototype-based feature selection with statistical gene selection methods for the task of classifying adenocarcinoma from gene expressions.

...read moreread less

Journal Article•DOI•

Biological applications of multi-relational data mining

[...]

David C. Page¹, Mark Craven¹•Institutions (1)

University of Wisconsin-Madison¹

01 Jul 2003-Sigkdd Explorations

TL;DR: This paper presents several applications of multi-relational data mining to biological data, taking care to cover a broad range ofMulti-relationally data mining techniques.

...read moreread less

Abstract: Biological databases contain a wide variety of data types, often with rich relational structure. Consequently multi-relational data mining techniques frequently are applied to biological data. This paper presents several applications of multi-relational data mining to biological data, taking care to cover a broad range of multi-relational data mining techniques.

...read moreread less

Journal Article•DOI•

Supervised analysis when the number of candidate features (p) greatly exceeds the number of cases (n)

[...]

Richard Simon¹•Institutions (1)

National Institutes of Health¹

01 Dec 2003-Sigkdd Explorations

TL;DR: Some of key aspects of p>>n problems for identifying informative features and developing accurate classifiers are highlighted.

...read moreread less

Abstract: New genomic and proteomic technologies provide measurements of thousands of features for each case. This provides a context for enhanced discovery and false discovery. Most statistical and machine learning procedures were not developed for the p>>n setting and the literature of DNA microarray studies contains many examples of mis-use of analytic and computatinal methods such a cross-validation. This paper highlights some of key aspects of p>>n problems for identifying informative features and developing accurate classifiers.

...read moreread less

Journal Article•DOI•

Meta-clustering of gene expression data and literature-based information

[...]

Patrick Glenisson¹, Janick Mathys¹, Bart De Moor¹•Institutions (1)

Katholieke Universiteit Leuven¹

01 Dec 2003-Sigkdd Explorations

TL;DR: This paper explores how an integrated analysis of expression data and literature-extracted information can reveal biologically meaningful clusters not identified when using microarray information alone, validated in terms of transcriptional regulation.

...read moreread less

Abstract: The current tendency in the life sciences to spawn ever growing amounts of high-throughput assays has led to a situation where the interpretation of data and the formulation of hypotheses lag the pace at which information is produced. Although the first generation of statistical algorithms scrutinizing single, large-scale data sets found their way into the biological community, the great challenge to connect their results to existing knowledge still remains. Despite the fairly large number of biological databases that is currently available, a lot of relevant information is found in free-text format (such as textual annotations, scientific abstracts and full publications). In this paper we explore how an integrated analysis of expression data and literature-extracted information can reveal biologically meaningful clusters not identified when using microarray information alone. The joint analysis is validated in terms of transcriptional regulation.

...read moreread less

Journal Article•DOI•

Graph-based relational learning: current and future directions

[...]

Lawrence B. Holder¹, Diane J. Cook¹•Institutions (1)

University of Texas at Arlington¹

01 Jul 2003-Sigkdd Explorations

TL;DR: This paper focuses on identifying novel, not necessarily most frequent, patterns in a graph-theoretic representation of data, which provides both simplifications and challenges over frequency-based approaches to graph-based data mining.

...read moreread less

Abstract: Graph-based relational learning (GBRL) differs from logic-based relational learning, as addressed by inductive logic programming techniques, and differs from frequent subgraph discovery, as addressed by many graph-based data mining techniques. Learning from graphs, rather than logic, presents representational issues both in input data preparation and output pattern language. While a form of graph-based data mining, GBRL focuses on identifying novel, not necessarily most frequent, patterns in a graph-theoretic representation of data. This approach to graph-based data mining provides both simplifications and challenges over frequency-based approaches. In this paper we discuss these issues and future directions of graph-based relational learning.

...read moreread less

Journal Article•DOI•

Machine learning methods applied to DNA microarray data can improve the diagnosis of cancer

[...]

Eric Bair¹, Robert Tibshirani¹•Institutions (1)

Stanford University¹

01 Dec 2003-Sigkdd Explorations

TL;DR: A novel procedure called "nearest shrunken centroids" is described that has successfully detected clinically relevant genetic differences in cancer patients and has the potential to become a powerful tool for diagnosing and treating cancer.

...read moreread less

Abstract: The morbidity rate of cancer victims varies greatly for similar patients who receive similar treatments. It is hypothesized that this variation can be explained by the genetic heterogeneity of the disease. DNA Microarrays, which can simultaneously measure the expression level of thousands of different genes, have been successfully used to identify such genetic differences. However, microarray data typically has a large number of features and relatively few observations, meaning that conventional machine learning tools can fail when applied to such data. We describe a novel procedure called "nearest shrunken centroids" that has successfully detected clinically relevant genetic differences in cancer patients. This procedure has the potential to become a powerful tool for diagnosing and treating cancer.

...read moreread less

Journal Article•DOI•

Using unsupervised link discovery methods to find interesting facts and connections in a bibliography dataset

[...]

Shou-De Lin¹, Hans Chalupsky¹•Institutions (1)

University of Southern California¹

01 Dec 2003-Sigkdd Explorations

TL;DR: A set of unsupervised link discovery methods that compute interestingness based on a notion of "rarity" and "abnormality" are developed and show that these approaches are able to automatically uncover interesting hidden connections and unexpected facts in the data.

...read moreread less

Abstract: This paper describes a submission to the Open Task of the 2003 KDD Cup. For this task contestants were asked to devise their own questions about the HEP-Th bibliography dataset, and the most interesting result would be selected as the winner. Instead of taking a more traditional approach such as starting with a inspection of the data, formulating questions or hypotheses interesting to us and then devising an analysis and approach to answer these questions, we tried to go a different route: can we develop a program that automatically finds interesting facts and connections in the data?To do this we developed a set of unsupervised link discovery methods that compute interestingness based on a notion of "rarity" and "abnormality". The experiments performed on the HEP-Th dataset show that our approaches are able to automatically uncover interesting hidden connections (e.g. significant relationships between people) and unexpected facts (e.g. citation loops) without the support of any prerequisite knowledge or training examples. The interestingness of some of our results is self-evident. For others we were able to verify them by looking for supporting evidence on the World-Wide-Web, which shows that our methods can find connections between entities that actually are interestingly connected in the real world in an unsupervised way.

...read moreread less

Journal Article•DOI•

Gene ranking using bootstrapped P-values

[...]

Sach Mukherjee¹, Stephen J. Roberts¹, Peter Sykacek¹, Sarah J. Gurr¹•Institutions (1)

University of Oxford¹

01 Dec 2003-Sigkdd Explorations

TL;DR: A gene-ranking algorithm whose main novelty is the use of bootstrapped P-values is proposed, showing how it takes account of small-sample variability in observed values of the test statistic, in a way conventional statistical tests cannot.

...read moreread less

Abstract: Recent research has shown that it is possible to find genes involved in the pathogenesis of a particular condition on the basis of microarray experiments. Genes which are differentially expressed, for example between healthy and diseased tissues, are likely to be relevant to the disease under study. Some of the properties of microarray datasets make the task of finding these genes a challenging one. This paper proposes a gene-ranking algorithm whose main novelty is the use of bootstrapped P-values. We present an analysis of the algorithm, showing how it takes account of small-sample variability in observed values of the test statistic, in a way conventional statistical tests cannot. Experimental results show that our algorithm outperforms the widely-used two-sample t-test on challenging artificial data. Gene ranking is then performed on two well-known microarray datasets, with encouraging results. For example, a number of genes from one of the datasets, whose differential expression was subsequently confirmed by a more reliable biochemical analysis, are found to be ranked higher by the bootstrapped algorithm than by the conventional t-test, suggesting that the proposed algorithm may be better able to exploit the limited data available to infer biologically useful information.

...read moreread less

Journal Article•DOI•

Towards interactive exploration of gene expression patterns

[...]

Daxin Jiang¹, Jian Pei¹, Aidong Zhang¹•Institutions (1)

University at Buffalo¹

01 Dec 2003-Sigkdd Explorations

TL;DR: An interactive exploration system GeneX (Gene eXplorer) is introduced for mining coherent expression patterns and a novel coherent pattern index graph is developed to provide highly confident indications of the existence of coherent patterns.

...read moreread less

Abstract: Analyzing coherent gene expression patterns is an important task in bioinformatics research and biomedical applications. Recently, various clustering methods have been adapted or proposed to identify clusters of co-expressed genes and recognize coherent expression patterns as the centroids of the clusters. However, the interpretation of co-expressed genes and coherent patterns mainly depends on the domain knowledge, which presents several challenges for coherent pattern mining and cannot be solved by most existing clustering approaches.In this paper, we introduce an interactive exploration system GeneX (Gene eXplorer) for mining coherent expression patterns. We develop a novel coherent pattern index graph to provide highly confident indications of the existence of coherent patterns. Typical exploration operations are supported based on the index graph. We also provide a bunch of graphical views as the user interface to visualize the data set and facilitate the interactive operations. To help users to interpret and validate the mining results, we design the gene annotation panel that connects the genes with some public annotation databases. The experimental results show that our approach is more effective than the state-of-the-art methods in mining real gene expression data sets.

...read moreread less

Journal Article•DOI•

Machine learning in low-level microarray analysis

[...]

Benjamin I. P. Rubinstein¹, Jon McAuliffe², Simon Cawley³, Marimuthu Palaniswami¹, Kotagiri Ramamohanarao¹, Terence P. Speed⁴ - Show less +2 more•Institutions (4)

University of Melbourne¹, University of California, Berkeley², Affymetrix³, Walter and Eliza Hall Institute of Medical Research⁴

01 Dec 2003-Sigkdd Explorations

TL;DR: Opportunities for the application of machine learning and data mining methods to several important low-level microarray analysis problems: monitoring gene expression, transcript discovery, genotyping and resequencing are explored.

...read moreread less

Abstract: Machine learning and data mining have found a multitude of successful applications in microarray analysis, with gene clustering and classification of tissue samples being widely cited examples. Low-level microarray analysis -- often associated with the pre-processing stage within the microarray life-cycle -- has increasingly become an area of active research, traditionally involving techniques from classical statistics. This paper explores opportunities for the application of machine learning and data mining methods to several important low-level microarray analysis problems: monitoring gene expression, transcript discovery, genotyping and resequencing. Relevant methods and ideas from the machine learning community include semi-supervised learning, learning from heterogeneous data, and incremental learning.

...read moreread less

Journal Article•DOI•

Citation prediction using time series approach KDD Cup 2003 (task 1)

[...]

J. N. Manjunatha¹, K. R. Sivaramakrishnan¹, Raghavendra Kumar Pandey¹, M. Narasimha Murthy¹•Institutions (1)

Indian Institute of Science¹

01 Dec 2003-Sigkdd Explorations

TL;DR: The experiences in building the winning system for KDD Cup, 2003, Task 1 are described, a very large archive of research papers that provides an unusually comprehensive snapshot of a particular social network in action that provides a framework for testing general network and usage mining techniques.

...read moreread less

Abstract: In this article we describe our experiences in building the winning system for KDD Cup, 2003, Task 1. This year's competition was based on a very large archive of research papers that provides an unusually comprehensive snapshot of a particular social network in action; in addition to the full text of research papers, it includes both explicit citation structure and partial data on the downloading of papers by users. It provides a framework for testing general network and usage mining techniques, which can be explored via four varied and interesting tasks. Each task is a separate competition with its own specific goal. In task 1 the goal is to predict the change in number of citations to each paper in the archive over time.The contest was very challenging because the given data was not in a format suitable for conventional data mining techniques. So we had to do a considerable amount of data processing. Also there were different sources of data like tex files, citation graph, slac-data database. So we had to make a decision about which sources to use and how much to use.

...read moreread less

Journal Article•DOI•

Mining biologically active patterns in metabolic pathways using microarray expression profiles

[...]

Hiroshi Mamitsuka¹, Yasushi Okuno¹, Atsuko Yamaguchi¹•Institutions (1)

Kyoto University¹

01 Dec 2003-Sigkdd Explorations

TL;DR: A new probabilistic framework for analyzing a metabolic pathway with microarray expression profiles is presented, and it is found that this method significantly outperformed another method, which was trained by microarray data only.

...read moreread less

Abstract: We present a new probabilistic framework for analyzing a metabolic pathway with microarray expression profiles. Our purpose is to find biologically significant paths and patterns in a given metabolic pathway. Our approach first builds a Markov model using a graph structure of a known metabolic pathway, and then estimates parameters of a mixture of the Markov models using microarray data, based on an EM algorithm. In our experiments, we used a main pathway of glycolysis to evaluate the effectiveness of our method. We first measured the performance of our method comparing with that of another method, in a supervised learning manner, and found that our method significantly outperformed another method, which was trained by microarray data only. We further analyzed the trained models and obtained a number of new biological findings on frequent patterns (paths) and long-range correlations in a metabolic pathway.

...read moreread less

Journal Article•DOI•

Loss-based estimation with cross-validation: applications to microarray data analysis

[...]

Sandrine Dudoit¹, Mark J. van der Laan¹, Sunduz Keles¹, Annette M. Molinaro¹, Sandra E. Sinisi¹, Siew Leng Teng¹ - Show less +2 more•Institutions (1)

University of California, Berkeley¹

01 Dec 2003-Sigkdd Explorations

TL;DR: This article provides an overview of the methodology and describes its application to the prediction of biological and clinical outcomes (possibly censored) using microarray gene expression measures.

...read moreread less

Abstract: Current statistical inference problems in genomic data analysis involve parameter estimation for high-dimensional multivariate distributions, with typically unknown and intricate correlation patterns among variables. Addressing these inference questions satisfactorily requires: (i) an intensive and thorough search of the parameter space to generate good candidate estimators; (ii) an approach for selecting an optimal estimator among these candidates; and (iii) a method for reliably assessing the performance of the resulting estimator. We propose a unified loss-based methodology for estimator construction, selection, and performance assessment with cross-validation. In this approach, the parameter of interest is defined as the risk minimizer for a suitable loss function and candidate estimators are generated using this (or possibly another) loss function. Cross-validation is applied to select an optimal estimator among the candidates and to assess the overall performance of the resulting estimator. This general estimation framework encompasses a number of problems which have traditionally been treated separately in the statistical literature, including multivariate outcome prediction and density estimation based on either uncensored or censored data. This article provides an overview of the methodology and describes its application to the prediction of biological and clinical outcomes (possibly censored) using microarray gene expression measures.

...read moreread less

Journal Article•DOI•

Classification of heterogeneous gene expression data

[...]

Benny Y. M. Fung¹, Vincent To Yee Ng¹•Institutions (1)

Hong Kong Polytechnic University¹

01 Dec 2003-Sigkdd Explorations

TL;DR: It is shown that, with the integration of the IFs to the Golub and Slonim (GS) and k-nearest neighbors (kNN) classifiers, the classifiers can be further improved on the classification accuracy of heterogeneous samples.

...read moreread less

Abstract: Recent advanced technologies in DNA microarray analysis are intensively applied in disease classification, especially for cancer classification. Most recent proposed gene expression classifiers can successfully classify testing samples obtained from the same microarray experiment as training samples with the assumption that the symmetric errors are constant among training and testing samples. However, the classification performance is degraded with heterogeneous testing samples obtained from different microarray experiments. In this paper, we propose the "impact factors" (IFs) to measure the variations between individual classes in training samples and heterogeneous testing samples, and integrate the IFs to classifiers for classification of heterogeneous samples. Two publicly available lung adenocarcinomas gene expression data sets are used in our experiments to demonstrate the effectiveness of the IFs. It shows that, with the integration of the IFs to the Golub and Slonim (GS) and k-nearest neighbors (kNN) classifiers, the classifiers can be further improved on the classification accuracy of heterogeneous samples. Even more, the classification accuracy of the integrated GS classifier is around 90%.

...read moreread less

Journal Article•DOI•

Differential expression, class discovery and class prediction using S-PLUS and S+ArrayAnalyzer

[...]

Michael O'Connell¹•Institutions (1)

North Carolina State University¹

01 Dec 2003-Sigkdd Explorations

TL;DR: This paper provides a review of statistical analysis approaches to the analysis of data from microarray experiments, including discussion of experimental design, data management, preprocessing, differential expression, clustering and class prediction, reporting and annotation.

...read moreread less

Abstract: Microarrays are a powerful experimental platform, allowing simultaneous studies of gene expression for thousands of genes under different experimental conditions. However there is much biological variability induced throughout the experimental process that can obscure the biological signals of interest. As such, the need for experimental design, replication and statistical rigor are now widely recognized. Statistical hypothesis testing has become the accepted differential expression analysis approach and many classification and prediction methods used in class discovery and class prediction now incorporate stochastic modeling components.This paper provides a review of statistical analysis approaches to the analysis of data from microarray experiments. This includes discussion of experimental design, data management, preprocessing, differential expression, clustering and class prediction, reporting and annotation. The review is illustrated with the analysis of an experiment with 3 experimental conditions using the Affymetrix murine chip mgu 74av2; and with descriptions of available functionality in the statistical analysis software S-PLUS and its associated module for microarray analysis, S+ArrayAnalyzer.

...read moreread less

Journal Article•DOI•

Resolving citations in a paper repository

[...]

Sunita Sarawagi, V. G. Vinod Vydiswaran, Sumana Srinivasan, Kapil Bhudhia

01 Dec 2003-Sigkdd Explorations

TL;DR: The process of creating a citation graph from a given repository of physics publications in LATEX format involved a series of information extraction, data cleaning, matching and ranking steps.

...read moreread less

Abstract: In this paper, we describe our process of creating a citation graph from a given repository of physics publications in LATEX format. The task involved a series of information extraction, data cleaning, matching and ranking steps. This paper describes the challenges we faced along the way and the issues involved in resolving them.

...read moreread less

Journal Article•DOI•

The Download Estimation task on KDD Cup 2003

[...]

Janez Brank¹, Jure Leskovec¹•Institutions (1)

Jožef Stefan Institute¹

01 Dec 2003-Sigkdd Explorations

TL;DR: This paper describes the work on the Download Estimation task for KDD Cup 2003, based on an extension of the bag-of-words model, with linear SVM regression as the learning algorithm, and focuses particularly on issues of feature construction and weighting.

...read moreread less

Abstract: This paper describes our work on the Download Estimation task for KDD Cup 2003. The task requires us to estimate how many times a paper has been downloaded in the first 60 days after it has been published on arXiv.org, a preprint server for papers on physics and related areas. The training data consists of approximately 29000 papers, the citation graph, and information about the downloads of a subset of these papers. Our approach is based on an extension of the bag-of-words model, with linear SVM regression as the learning algorithm. We describe our experiments with various kinds of features. We focus particularly on issues of feature construction and weighting, which turns out to be quite important for this task.

...read moreread less

1
2

SciSpace

About Careers Resources Support Browse Papers Pricing SciSpace Affiliate Program Cancellation & Refund Policy Terms Privacy

Tools

Citation generator AI Detector Paraphraser Citation Booster

Extensions

Directories

Papers Topics Journals Authors Conferences Institutions Questions Citation Styles

Contact

support@typeset.io +91 8431021544

© 2024 | PubGenius Inc.