scispace - formally typeset
Search or ask a question

Showing papers by "Nello Cristianini published in 2006"


Journal ArticleDOI
TL;DR: Hahn et al. as mentioned in this paper presented CAFE (Computational Analysis of gene Family Evolution), a tool for the statistical analysis of the evolution of the size of gene families.
Abstract: Summary: We present CAFE (Computational Analysis of gene Family Evolution), a tool for the statistical analysis of the evolution of the size of gene families. It uses a stochastic birth and death process to model the evolution of gene family sizes over a phylogeny. For a specified phylogenetic tree, and given the gene family sizes in the extant species, CAFE can estimate the global birth and death rate of gene families, infer the most likely gene family size at all internal nodes, identify gene families that have accelerated rates of gain and loss (quantified by a p-value) and identify which branches cause the p-value to be small for significant families. Availability: Software is available from http://www.bio.indiana.edu/~hahnlab/Software.html Contact: mwh@indiana.edu

1,170 citations


Journal ArticleDOI
20 Dec 2006-PLOS ONE
TL;DR: Analysis of the gene families contained within the whole genomes of human, chimpanzee, mouse, rat, and dog finds that more than half of the 9,990 families present in the mammalian common ancestor have either expanded or contracted along at least one lineage.
Abstract: Gene families are groups of homologous genes that are likely to have highly similar functions. Differences in family size due to lineage-specific gene duplication and gene loss may provide clues to the evolutionary forces that have shaped mammalian genomes. Here we analyze the gene families contained within the whole genomes of human, chimpanzee, mouse, rat, and dog. In total we find that more than half of the 9,990 families present in the mammalian common ancestor have either expanded or contracted along at least one lineage. Additionally, we find that a large number of families are completely lost from one or more mammalian genomes, and a similar number of gene families have arisen subsequent to the mammalian common ancestor. Along the lineage leading to modern humans we infer the gain of 689 genes and the loss of 86 genes since the split from chimpanzees, including changes likely driven by adaptive natural selection. Our results imply that humans and chimpanzees differ by at least 6% (1,418 of 22,000 genes) in their complement of genes, which stands in stark contrast to the oft-cited 1.5% difference between orthologous nucleotide sequences. This genomic "revolving door" of gene gain and loss represents a large number of genetic differences separating humans from our closest relatives.

356 citations


Journal ArticleDOI
TL;DR: A new and fast SDP relaxation of the normalized graph cut problem is presented, and its usefulness in unsupervised and semi-supervised learning is investigated, providing a convex algorithm for transduction, as well as approaches to clustering.
Abstract: The rise of convex programming has changed the face of many research fields in recent years, machine learning being one of the ones that benefitted the most. A very recent developement, the relaxation of combinatorial problems to semi-definite programs (SDP), has gained considerable attention over the last decade (Helmberg, 2000; De Bie and Cristianini, 2004a). Although SDP problems can be solved in polynomial time, for many relaxations the exponent in the polynomial complexity bounds is too high for scaling to large problem sizes. This has hampered their uptake as a powerful new tool in machine learning. In this paper, we present a new and fast SDP relaxation of the normalized graph cut problem, and investigate its usefulness in unsupervised and semi-supervised learning. In particular, this provides a convex algorithm for transduction, as well as approaches to clustering. We further propose a whole cascade of fast relaxations that all hold the middle between older spectral relaxations and the new SDP relaxation, allowing one to trade off computational cost versus relaxation accuracy. Finally, we discuss how the methodology developed in this paper can be applied to other combinatorial problems in machine learning, and we treat the max-cut problem as an example.

57 citations


Book Chapter
01 Jan 2006
TL;DR: This chapter discusses an alternative approach based on a convex relaxation of the optimization problem associated to support vector machine transduction which can be optimized in polynomial time and extends the formulation to more general settings of semi-supervised learning, where equivalence and inequivalence constraints are given on labels of some of the samples.
Abstract: We discuss the problem of support vector machine (SVM) transduction, which is a combinatorial problem with exponential computational complexity in the number of unlabeled samples. Different approaches to such combinatorial problems exist, among which are exact integer programming approaches (only feasible for very small sample sizes, e.g. [1]) and local search heuristics starting from a suitably chosen start value such as the approach explained in Chapter 5, Transductive Support Vector Machines , and introduced in [2] (scalable to large problem sizes, but sensitive to local optima). In this chapter, we discuss an alternative approach introduced in [3], which is based on a convex relaxation of the optimization problem associated to support vector machine transduction. The result is a semi-definite programming (SDP) problem which can be optimized in polynomial time, the solution of which is an approximation of the optimal labeling as well as a bound on the true optimum of the original transduction objective function. To further decrease the computational complexity, we propose an approximation that allows to solve transduction problems of up to 1000 unlabeled samples. Lastly, we extend the formulation to more general settings of semi-supervised learning, where equivalence and inequivalence constraints are given on labels of some of the samples.

53 citations


Journal ArticleDOI
09 Oct 2006
TL;DR: It is argued that the energy currently consumed in transport is poorly exploited for a number of reasons that the application of intelligent infrastructure analysis can significantly address: traffic congestion implies reduced fuel efficiency; the average number of people travelling per car is very low; and public/private transport combination options are difficult to plan.
Abstract: The potential impact of advances in data mining, data fusion and information management on the efficient exploitation of the transport infrastructure are addressed. It is argued that the energy currently consumed in transport is poorly exploited for a number of reasons that the application of intelligent infrastructure analysis can significantly address: traffic congestion implies reduced fuel efficiency; the average number of people travelling per car is very low; and public/private transport combination options are difficult to plan. Relevant advances in intelligent information systems are reviewed, and a number of scenarios illustrating how these weaknesses might be addressed are presented. For each, the feasibility of the required technology and the timescales for deployment are discussed.

33 citations


Proceedings ArticleDOI
01 Mar 2006
TL;DR: A ”statistical signature” of a language is developed, analogous to the genetic signature proposed by Karlin in biology, and its stability within languages and its discriminative power between languages are shown.
Abstract: We propose to address a series of questions related to the evolution of languages by statistical analysis of written text. We develop a ”statistical signature” of a language, analogous to the genetic signature proposed by Karlin in biology, and we show its stability within languages and its discriminative power between languages. Using this representation, we address the question of its trajectory during language evolution. We first reconstruct a phylogenetic tree of IE languages using this property, in this way showing that it also contains enough information to act as a ”tracking” tag for a language during its evolution. One advantage of this kind of phylogenetic trees is that they do not depend on any semantic assessment or on any choice of words. We use the ”statistical signature” to analyze a time-series of documents from four romance languages, following their transition from latin. The languages are italian, french, spanish and portuguese, and the time points correspond to all centuries from III bC to XX AD.

3 citations


Book ChapterDOI
01 Jan 2006

1 citations


Book ChapterDOI
01 Jan 2006

1 citations