scispace - formally typeset
Search or ask a question

Showing papers by "Nello Cristianini published in 2005"


Journal ArticleDOI
TL;DR: A model of stochastic birth and death for gene family evolution is used and it is shown that it can be efficiently applied to multispecies genome comparisons and offers both the opportunity to identify large-scale patterns in genome evolution and the ability to make stronger inferences regarding the role of natural selection.
Abstract: Comparison of whole genomes has revealed that changes in the size of gene families among organisms is quite common. However, there are as yet no models of gene family evolution that make it possible to estimate ancestral states or to infer upon which lineages gene families have contracted or expanded. In addition, large differences in family size have generally been attributed to the effects of natural selection, without a strong statistical basis for these conclusions. Here we use a model of stochastic birth and death for gene family evolution and show that it can be efficiently applied to multispecies genome comparisons. This model takes into account the lengths of branches on phylogenetic trees, as well as duplication and deletion rates, and hence provides expectations for divergence in gene family size among lineages. The model offers both the opportunity to identify large-scale patterns in genome evolution and the ability to make stronger inferences regarding the role of natural selection in gene family expansion or contraction. We apply our method to data from the genomes of five yeast species to show its applicability.

285 citations


Journal ArticleDOI
TL;DR: The differences between the two spectra are bounded and a performance bound on kernel principal component analysis (PCA) is provided showing that good performance can be expected even in very-high-dimensional feature spaces provided the sample eigenvalues fall sufficiently quickly.
Abstract: In this paper, the relationships between the eigenvalues of the m/spl times/m Gram matrix K for a kernel /spl kappa/(/spl middot/,/spl middot/) corresponding to a sample x/sub 1/,...,x/sub m/ drawn from a density p(x) and the eigenvalues of the corresponding continuous eigenproblem is analyzed. The differences between the two spectra are bounded and a performance bound on kernel principal component analysis (PCA) is provided showing that good performance can be expected even in very-high-dimensional feature spaces provided the sample eigenvalues fall sufficiently quickly.

182 citations


Book ChapterDOI
01 Jan 2005
TL;DR: This chapter describes a large class of pattern analysis methods based on the use of generalized eigenproblems, which reduce to solving the equation Aw = Aw + 1.
Abstract: The task of studying the properties of conflgurations of points embeddedin a metric space has long been a central task in pattern recognition, buthas acquired even greater importance after the recent introduction of kernel-based learning methods. These methods work by virtually embedding generaltypes of data in a vector space, and then analyzing the properties of theresulting data cloud. While a number of techniques for this task have beendeveloped in flelds as diverse as multivariate statistics, neural networks, andsignal processing, many of them show an underlying unity. In this chapterwe describe a large class of pattern analysis methods based on the use ofgeneralized eigenproblems, which reduce to solving the equation Aw =

125 citations




Journal ArticleDOI
TL;DR: Results indicate that the NTAR system could assist neuroscientists with thesauri creation for closely related, highly detailed neuroanatomical domains.
Abstract: Generating informational thesauri that classify, cross-reference, and retrieve diverse and highly detailed neuroscientific information requires identifying related neuroanatomical terms and acronyms within and between species (Gorin et al., 2001) Manual construction of such informational thesauri is laborious, and we describe implementing and evaluating a neuroanatomical term and acronym reconciliation (NTAR) system to assist domain experts with this task. NTAR is composed of two modules. The neuroanatomical term extraction (NTE) module employs a hidden Markov model (HMM) in conjunction with lexical rules to extract neuroanatomical terms (NT) and acronyms (NA) from textual material. The output of the NTE is formatted into collections of term- or acronym-indexed documents composed of sentences and word phrases extracted from textual material. The second information retrieval (IR) module utilizes a vector space model (VSM) and includes a novel, automated relevance feedback algorithm. The IR module retrieves statistically related neuroanatomical terms and acronyms in response to queried neuroanatomical terms and acronyms. Neuroanatomical terms and acronyms retrieval obtained from term-based inquiries were compared with (1) term retrieval obtained by including automated relevance feedback and with (2) term retrieval using “document-to-document” comparisons (context-based VSM). The retrieval of synonymous and similar primate and macaque thalamic terms and acronyms in response to a query list of human thalamic terminology by these three IR approaches was compared against a previously published, manually constructed concordance table of homologous cross-species terms and acronyms. Term-based VSM with automated relevance feedback retrieved 70% and 80% of these primate and macaque terms and acronyms, respectively, listed in the concordance table. Automated feedback algorithm correctly identified 87% of the macaque terms and acronyms that were independently selected by a domain expert as being appropriate for manual relevance feedback. Context-based VSM correctly retrieved 97% and 98% of the primate and macaque terms and acronyms listed in the term homology table. These results indicate that the NTAR system could assist neuroscientists with thesauri creation for closely related, highly detailed neuroanatomical domains.

7 citations