scispace - formally typeset
Author

Sayan Mukherjee

Bio: Sayan Mukherjee is a academic researcher at Duke University who has co-authored 291 publication(s) receiving 41660 citation(s). The author has an hindex of 55. Previous affiliations of Sayan Mukherjee include University of Illinois at Chicago & Florida State University College of Arts and Sciences. The author has done significant research in the topic(s): Markov chain Monte Carlo & Cancer.

...read more

Papers
  More

Open accessJournal ArticleDOI: 10.1073/PNAS.0506580102
Abstract: Although genomewide RNA expression analysis has become a routine tool in biomedical research, extracting biological insight from such information remains a major challenge. Here, we describe a powerful analytical method called Gene Set Enrichment Analysis (GSEA) for interpreting gene expression data. The method derives its power by focusing on gene sets, that is, groups of genes that share common biological function, chromosomal location, or regulation. We demonstrate how GSEA yields insights into several cancer-related data sets, including leukemia and lung cancer. Notably, where single-gene analysis finds little similarity between two independent studies of patient survival in lung cancer, GSEA reveals many biological pathways in common. The GSEA method is embodied in a freely available software package, together with an initial database of 1,325 biologically defined gene sets.

...read more

26,320 Citations


Open accessJournal ArticleDOI: 10.1023/A:1012450327387
11 Mar 2002-Machine Learning
Abstract: The problem of automatically tuning multiple parameters for pattern recognition Support Vector Machines (SVMs) is considered. This is done by minimizing some estimates of the generalization error of SVMs using a gradient descent algorithm over the set of parameters. Usual methods for choosing parameters, based on exhaustive search become intractable as soon as the number of parameters exceeds two. Some experimental results assess the feasibility of our approach for a large number of parameters (more than 100) and demonstrate an improvement of generalization performance.

...read more

Topics: Support vector machine (56%), Gradient descent (54%), Generalization (52%) ...read more

2,234 Citations


Open accessJournal ArticleDOI: 10.1073/PNAS.211566398
Sridhar Ramaswamy1, Pablo Tamayo1, Ryan Rifkin1, Sayan Mukherjee1  +12 moreInstitutions (3)
Abstract: The optimal treatment of patients with cancer depends on establishing accurate diagnoses by using a complex combination of clinical and histopathological data. In some instances, this task is difficult or impossible because of atypical clinical presentation or histopathology. To determine whether the diagnosis of multiple common adult malignancies could be achieved purely by molecular classification, we subjected 218 tumor samples, spanning 14 common tumor types, and 90 normal tissue samples to oligonucleotide microarray gene expression analysis. The expression levels of 16,063 genes and expressed sequence tags were used to evaluate the accuracy of a multiclass classifier based on a support vector machine algorithm. Overall classification accuracy was 78%, far exceeding the accuracy of random classification (9%). Poorly differentiated cancers resulted in low-confidence predictions and could not be accurately classified according to their tissue of origin, indicating that they are molecularly distinct entities with dramatically different gene expression patterns compared with their well differentiated counterparts. Taken together, these results demonstrate the feasibility of accurate, multiclass molecular cancer classification and suggest a strategy for future clinical implementation of molecular cancer diagnostics.

...read more

2,044 Citations


Open accessProceedings Article
01 Jan 2000-
Abstract: We introduce a method of feature selection for Support Vector Machines. The method is based upon finding those features which minimize bounds on the leave-one-out error. This search can be efficiently performed via gradient descent. The resulting algorithms are shown to be superior to some standard feature selection algorithms on both toy data and real-life problems of face recognition, pedestrian detection and analyzing DNA microarray data.

...read more

Topics: Feature scaling (65%), Feature selection (64%), Feature vector (62%) ...read more

1,087 Citations


Journal ArticleDOI: 10.1056/NEJMOA060467
Abstract: Background Clinical trials have indicated a benefit of adjuvant chemotherapy for patients with stage IB, II, or IIIA — but not stage IA — non–small-cell lung cancer (NSCLC). This classification scheme is probably an imprecise predictor of the prognosis of an individual patient. Indeed, approximately 25 percent of patients with stage IA disease have a recurrence after surgery, suggesting the need to identify patients in this subgroup for more effective therapy. Methods We identified gene-expression profiles that predicted the risk of recurrence in a cohort of 89 patients with early-stage NSCLC (the lung metagene model). We evaluated the predictor in two independent groups of 25 patients from the American College of Surgeons Oncology Group (ACOSOG) Z0030 study and 84 patients from the Cancer and Leukemia Group B (CALGB) 9761 study. Results The lung metagene model predicted recurrence for individual patients significantly better than did clinical prognostic factors and was consistent across all early stages ...

...read more

Topics: Lung cancer (55%), Cancer (54%)

594 Citations


Cited by
  More

Journal ArticleDOI: 10.1038/NPROT.2008.211
01 Jan 2009-Nature Protocols
Abstract: DAVID bioinformatics resources consists of an integrated biological knowledgebase and analytic tools aimed at systematically extracting biological meaning from large gene/protein lists. This protocol explains how to use DAVID, a high-throughput and integrated data-mining environment, to analyze gene lists derived from high-throughput genomic experiments. The procedure first requires uploading a gene list containing any number of common gene identifiers followed by analysis using one or more text and pathway-mining tools such as gene functional classification, functional annotation chart or clustering and functional annotation table. By following this protocol, investigators are able to gain an in-depth understanding of the biological themes in lists of genes that are enriched in genome-scale studies.

...read more

27,356 Citations


Open accessJournal ArticleDOI: 10.1073/PNAS.0506580102
Abstract: Although genomewide RNA expression analysis has become a routine tool in biomedical research, extracting biological insight from such information remains a major challenge. Here, we describe a powerful analytical method called Gene Set Enrichment Analysis (GSEA) for interpreting gene expression data. The method derives its power by focusing on gene sets, that is, groups of genes that share common biological function, chromosomal location, or regulation. We demonstrate how GSEA yields insights into several cancer-related data sets, including leukemia and lung cancer. Notably, where single-gene analysis finds little similarity between two independent studies of patient survival in lung cancer, GSEA reveals many biological pathways in common. The GSEA method is embodied in a freely available software package, together with an initial database of 1,325 biologically defined gene sets.

...read more

26,320 Citations


Open access
28 Jul 2005-
Abstract: 抗原变异可使得多种致病微生物易于逃避宿主免疫应答。表达在感染红细胞表面的恶性疟原虫红细胞表面蛋白1(PfPMP1)与感染红细胞、内皮细胞、树突状细胞以及胎盘的单个或多个受体作用,在黏附及免疫逃避中起关键的作用。每个单倍体基因组var基因家族编码约60种成员,通过启动转录不同的var基因变异体为抗原变异提供了分子基础。

...read more

18,940 Citations


Journal ArticleDOI: 10.1023/A:1009715923555
Christopher John Burges1Institutions (1)
Abstract: The tutorial starts with an overview of the concepts of VC dimension and structural risk minimization. We then describe linear Support Vector Machines (SVMs) for separable and non-separable data, working through a non-trivial example in detail. We describe a mechanical analogy, and discuss when SVM solutions are unique and when they are global. We describe how support vector training can be practically implemented, and discuss in detail the kernel mapping technique which is used to construct SVM solutions which are nonlinear in the data. We show how Support Vector machines can have very large (even infinite) VC dimension by computing the VC dimension for homogeneous polynomial and Gaussian radial basis function kernels. While very high VC dimension would normally bode ill for generalization performance, and while at present there exists no theory which shows that good generalization performance is guaranteed for SVMs, there are several arguments which support the observed high accuracy of SVMs, which we review. Results of some experiments which were inspired by these arguments are also presented. We give numerous examples and proofs of most of the key theorems. There is new material, and I hope that the reader will find that even old material is cast in a fresh light.

...read more

14,909 Citations


Open accessJournal ArticleDOI: 10.1093/NAR/GKV007
Matthew E. Ritchie1, Belinda Phipson2, Di Wu3, Yifang Hu1  +4 moreInstitutions (5)
Abstract: limma is an R/Bioconductor software package that provides an integrated solution for analysing data from gene expression experiments. It contains rich features for handling complex experimental designs and for information borrowing to overcome the problem of small sample sizes. Over the past decade, limma has been a popular choice for gene discovery through differential expression analyses of microarray and high-throughput PCR data. The package contains particularly strong facilities for reading, normalizing and exploring such data. Recently, the capabilities of limma have been significantly expanded in two important directions. First, the package can now perform both differential expression and differential splicing analyses of RNA sequencing (RNA-seq) data. All the downstream analysis tools previously restricted to microarray data are now available for RNA-seq as well. These capabilities allow users to analyse both RNA-seq and microarray data with very similar pipelines. Second, the package is now able to go past the traditional gene-wise expression analyses in a variety of ways, analysing expression profiles in terms of co-regulated sets of genes or in terms of higher-order expression signatures. This provides enhanced possibilities for biological interpretation of gene expression differences. This article reviews the philosophy and design of the limma package, summarizing both new and historical features, with an emphasis on recent enhancements and features that have not been previously described.

...read more

Topics: Microarray databases (61%), Bioconductor (51%)

13,819 Citations


Performance
Metrics

Author's H-index: 55

No. of papers from the Author in previous years
YearPapers
202127
202023
201922
201820
201719
201620

Top Attributes

Show by:

Author's top 5 most impactful journals

bioRxiv

24 papers, 151 citations

arXiv: Statistics Theory

17 papers, 104 citations

arXiv: Methodology

9 papers, 109 citations

arXiv: Machine Learning

8 papers, 18 citations

Journal of Machine Learning Research

7 papers, 410 citations

Network Information
Related Authors (5)
Lorin Crawford

53 papers, 616 citations

80% related
Elena Edelman

4 papers, 252 citations

76% related
Joseph R. Nevins

261 papers, 42.9K citations

74% related
Ryan Rifkin

55 papers, 8.8K citations

74% related
Kimberly Roche

10 papers, 122 citations

73% related