scispace - formally typeset
Search or ask a question

Showing papers by "Michael I. Jordan published in 2005"


Journal ArticleDOI
12 Jun 2005
TL;DR: A statistical debugging algorithm that isolates bugs in programs containing multiple undiagnosed bugs and identifies predictors that are associated with individual bugs that reveal both the circumstances under which bugs occur as well as the frequencies of failure modes, making it easier to prioritize debugging efforts.
Abstract: We present a statistical debugging algorithm that isolates bugs in programs containing multiple undiagnosed bugs. Earlier statistical algorithms that focus solely on identifying predictors that correlate with program failure perform poorly when there are multiple bugs. Our new technique separates the effects of different bugs and identifies predictors that are associated with individual bugs. These predictors reveal both the circumstances under which bugs occur as well as the frequencies of failure modes, making it easier to prioritize debugging efforts. Our algorithm is validated using several case studies, including examples in which the algorithm identified previously unknown, significant crashing bugs in widely used systems.

851 citations


01 Jan 2005
TL;DR: A probabilistic interpretation of canonical correlation (CCA) analysis as a latent variable model for two Gaussian random vectors for Fisher linear discriminant analysis within the CCA framework is given.
Abstract: We give a probabilistic interpretation of canonical correlation (CCA) analysis as a latent variable model for two Gaussian random vectors. Our interpretation is similar to the probabilistic interpretation of principal component analysis (Tipping and Bishop, 1999, Roweis, 1998). In addition, we cast Fisher linear discriminant analysis (LDA) within the CCA framework.

490 citations


Proceedings ArticleDOI
07 Aug 2005
TL;DR: This paper presents an algorithm that can exploit side information (e.g., classification labels, regression responses) in the computation of low-rank decompositions for kernel matrices and presents simulation results that show that the algorithm yields decomposition of significantly smaller rank than those found by incomplete Cholesky decomposition.
Abstract: Low-rank matrix decompositions are essential tools in the application of kernel methods to large-scale learning problems. These decompositions have generally been treated as black boxes---the decomposition of the kernel matrix that they deliver is independent of the specific learning task at hand---and this is a potentially significant source of inefficiency. In this paper, we present an algorithm that can exploit side information (e.g., classification labels, regression responses) in the computation of low-rank decompositions for kernel matrices. Our algorithm has the same favorable scaling as state-of-the-art methods such as incomplete Cholesky decomposition---it is linear in the number of data points and quadratic in the rank of the approximation. We present simulation results that show that our algorithm yields decompositions of significantly smaller rank than those found by incomplete Cholesky decomposition.

268 citations


Proceedings Article
01 Jan 2005
TL;DR: A semiparametric model for regression and classification problems involving multiple response variables makes use of a set of Gaussian processes to model the relationship to the inputs in a nonparametric fashion.
Abstract: We propose a semiparametric model for regression and classification problems involving multiple response variables. The model makes use of a set of Gaussian processes to model the relationship to the inputs in a nonparametric fashion. Conditional dependencies between the responses can be captured through a linear mixture of the driving processes. This feature becomes important if some of the responses of predictive interest are less densely supplied by observed data than related auxiliary ones. We propose an efficient approximate inference scheme for this semiparametric model whose complexity is linear in the number of training data points.

245 citations


Journal ArticleDOI
TL;DR: PS-containing liposomes did not appear to directly inhibit dendritic cell maturation in vitro in response to a variety of stimuli, nor did it prevent their migration to regional lymph nodes in vivo, suggesting that the inhibitory effects may have resulted from complicated interactions between tissue cells and dendedritic cells, subsequently inhibiting their ability to productively activate T lymphocytes.
Abstract: Phosphatidylserine (PS) on apoptotic cells promotes their uptake and induces anti-inflammatory responses in phagocytes, including TGF-β release. Little is known regarding the effects of PS on adaptive immune responses. We therefore investigated the effects of PS-containing liposomes on immune responses in mice in vivo. PS liposomes specifically inhibited responses to Ags as determined by decreased draining lymph node tissue mass, with reduced numbers of total leukocytes and Ag-specific CD4+ T cells. There was also a decrease in formation and size of germinal centers in spleen and lymph nodes, accompanied by decreased levels of Ag-specific IgG in blood. Many of these effects were mimicked by an agonistic Ab-specific for the PS receptor. TGF-β appears to play a critical role in this inhibition, as the inhibitory effects of PS were reversed by in vivo administration of anti-TGF-β Ab. PS-containing liposomes did not appear to directly inhibit dendritic cell maturation in vitro in response to a variety of stimuli, nor did it prevent their migration to regional lymph nodes in vivo, suggesting that the inhibitory effects may have resulted from complicated interactions between tissue cells and dendritic cells, subsequently inhibiting their ability to productively activate T lymphocytes.

212 citations


Journal ArticleDOI
TL;DR: The accuracy of SIFTER on this dataset is a significant improvement over other currently available methods such as BLAST (75%), GeneQuiz (64%), GOtcha (89%), and Orthostrapper (11%).
Abstract: We present a statistical graphical model to infer specific molecular function for unannotated protein sequences using homology. Based on phylogenomic principles, SIFTER (Statistical Inference of Function Through Evolutionary Relationships) accurately predicts molecular function for members of a protein family given a reconciled phylogeny and available function annotations, even when the data are sparse or noisy. Our method produced specific and consistent molecular function predictions across 100 Pfam families in comparison to the Gene Ontology annotation database, BLAST, GOtcha, and Orthostrapper. We performed a more detailed exploration of functional predictions on the adenosine-5'-monophosphate/adenosine deaminase family and the lactate/malate dehydrogenase family, in the former case comparing the predictions against a gold standard set of published functional characterizations. Given function annotations for 3% of the proteins in the deaminase family, SIFTER achieves 96% accuracy in predicting molecular function for experimentally characterized proteins as reported in the literature. The accuracy of SIFTER on this dataset is a significant improvement over other currently available methods such as BLAST (75%), GeneQuiz (64%), GOtcha (89%), and Orthostrapper (11%). We also experimentally characterized the adenosine deaminase from Plasmodium falciparum, confirming SIFTER's prediction. The results illustrate the predictive power of exploiting a statistical model of function evolution in phylogenomic problems. A software implementation of SIFTER is available from the authors.

205 citations


Journal ArticleDOI
TL;DR: It is shown that the coarse- grained and fine-grained localization problems for ad hoc sensor networks can be posed and solved as a pattern recognition problem using kernel methods from statistical learning theory, and a simple and effective localization algorithm is derived.
Abstract: We show that the coarse-grained and fine-grained localization problems for ad hoc sensor networks can be posed and solved as a pattern recognition problem using kernel methods from statistical learning theory. This stems from an observation that the kernel function, which is a similarity measure critical to the effectiveness of a kernel-based learning algorithm, can be naturally defined in terms of the matrix of signal strengths received by the sensors. Thus we work in the natural coordinate system provided by the physical devices. This not only allows us to sidestep the difficult ranging procedure required by many existing localization algorithms in the literature, but also enables us to derive a simple and effective localization algorithm. The algorithm is particularly suitable for networks with densely distributed sensors, most of whose locations are unknown. The computations are initially performed at the base sensors, and the computation cost depends only on the number of base sensors. The localization step for each sensor of unknown location is then performed locally in linear time. We present an analysis of the localization error bounds, and provide an evaluation of our algorithm on both simulated and real sensor networks.

198 citations


Journal ArticleDOI
TL;DR: Clustering the data for 12 distinct compounds uncovered both known and novel functional interactions that comprise the DNA-damage response and allowed us to define the genetic determinants required for repair of interstrand cross-links.
Abstract: The mechanistic and therapeutic differences in the cellular response to DNA-damaging compounds are not completely understood, despite intense study. To expand our knowledge of DNA damage, we assayed the effects of 12 closely related DNA-damaging agents on the complete pool of ~4,700 barcoded homozygous deletion strains of Saccharomyces cerevisiae. In our protocol, deletion strains are pooled together and grown competitively in the presence of compound. Relative strain sensitivity is determined by hybridization of PCR-amplified barcodes to an oligonucleotide array carrying the barcode complements. These screens identified genes in well-characterized DNA-damage-response pathways as well as genes whose role in the DNA-damage response had not been previously established. High-throughput individual growth analysis was used to independently confirm microarray results. Each compound produced a unique genome-wide profile. Analysis of these data allowed us to determine the relative importance of DNA-repair modules for resistance to each of the 12 profiled compounds. Clustering the data for 12 distinct compounds uncovered both known and novel functional interactions that comprise the DNA-damage response and allowed us to define the genetic determinants required for repair of interstrand cross-links. Further genetic analysis allowed determination of epistasis for one of these functional groups.

172 citations


Journal ArticleDOI
TL;DR: Global transcriptional responses of Escherichia coli K-12 to sulfur (S)- or nitrogen (N)-limited growth in adapted batch cultures and cultures subjected to nutrient shifts were determined using two limitations to distinguish between nutrient-specific changes in mRNA levels and common changes related to the growth rate.
Abstract: We determined global transcriptional responses of Escherichia coli K-12 to sulfur (S)- or nitrogen (N)-limited growth in adapted batch cultures and cultures subjected to nutrient shifts. Using two limitations helped to distinguish between nutrient-specific changes in mRNA levels and common changes related to the growth rate. Both homeostatic and slow growth responses were amplified upon shifts. This made detection of these responses more reliable and increased the number of genes that were differentially expressed. We analyzed microarray data in several ways: by determining expression changes after use of a statistical normalization algorithm, by hierarchical and k-means clustering, and by visual inspection of aligned genome images. Using these tools, we confirmed known homeostatic responses to global S limitation, which are controlled by the activators CysB and Cbl, and found that S limitation propagated into methionine metabolism, synthesis of FeS clusters, and oxidative stress. In addition, we identified several open reading frames likely to respond specifically to S availability. As predicted from the fact that the ddp operon is activated by NtrC, synthesis of cross-links between diaminopimelate residues in the murein layer was increased under N-limiting conditions, as was the proportion of tripeptides. Both of these effects may allow increased scavenging of N from the dipeptide d-alanine-d-alanine, the substrate of the Ddp system.

106 citations


Proceedings ArticleDOI
13 Jun 2005
TL;DR: A set of tools that augment the ability of operators to perceive the presence of failure are introduced: an automatic anomaly detector scours HTTP access logs to find changes in user behavior that are indicative of site failures, and a visualizer helps operators rapidly detect and diagnose problems.
Abstract: Web applications suffer from software and configuration faults that lower their availability. Recovering from failure is dominated by the time interval between when these faults appear and when they are detected by site operators. We introduce a set of tools that augment the ability of operators to perceive the presence of failure: an automatic anomaly detector scours HTTP access logs to find changes in user behavior that are indicative of site failures, and a visualizer helps operators rapidly detect and diagnose problems. Visualization addresses a key question of autonomic computing of how to win operators' confidence so that new tools will be embraced. Evaluation performed using HTTP logs from Ebates.com demonstrates that these tools can enhance the detection of failure as well as shorten detection time. Our approach is application-generic and can be applied to any Web application without the need for instrumentation

95 citations


Journal ArticleDOI
TL;DR: This work proposes a novel algorithm using the framework of empirical risk minimization and marginalized kernels and analyzes its computational and statistical properties both theoretically and empirically.
Abstract: We consider the problem of decentralized detection under constraints on the number of bits that can be transmitted by each sensor. In contrast to most previous work, in which the joint distribution of sensor observations is assumed to be known, we address the problem when only a set of empirical samples is available. We propose a novel algorithm using the framework of empirical risk minimization and marginalized kernels and analyze its computational and statistical properties both theoretically and empirically. We provide an efficient implementation of the algorithm and demonstrate its performance on both simulated and real data sets.

Proceedings Article
05 Dec 2005
TL;DR: A method for robust experiment design based on a semidefinite programming relaxation is presented and an application of this method is presented to the design of experiments for a complex calcium signal transduction pathway, where it is found that the parameter estimates obtained from the robust design are better than those obtained from an "optimal" design.
Abstract: We address the problem of robust, computationally-efficient design of biological experiments. Classical optimal experiment design methods have not been widely adopted in biological practice, in part because the resulting designs can be very brittle if the nominal parameter estimates for the model are poor, and in part because of computational constraints. We present a method for robust experiment design based on a semidefinite programming relaxation. We present an application of this method to the design of experiments for a complex calcium signal transduction pathway, where we have found that the parameter estimates obtained from the robust design are better than those obtained from an "optimal" design.

Journal ArticleDOI
TL;DR: A general probabilistic model that clusters genes and experiments without requiring that a given gene or drug only appear in one cluster is developed and it is shown that this model is useful for summarizing the relationship among treatments and genes affected by those treatments in a compendium of microarray profiles.
Abstract: Motivation: In haploinsufficiency profiling data, pleiotropic genes are often misclassified by clustering algorithms that impose the constraint that a gene or experiment belong to only one cluster. We have developed a general probabilistic model that clusters genes and experiments without requiring that a given gene or drug only appear in one cluster. The model also incorporates the functional annotation of known genes to guide the clustering procedure. Results: We applied our model to the clustering of 79 chemogenomic experiments in yeast. Known pleiotropic genes PDR5 and MAL11 are more accurately represented by the model than by a clustering procedure that requires genes to belong to a single cluster. Drugs such as miconazole and fenpropimorph that have different targets but similar off-target genes are clustered more accurately by the model-based framework. We show that this model is useful for summarizing the relationship among treatments and genes affected by those treatments in a compendium of microarray profiles. Availability: Supplementary information and computer code at http://genomics.lbl.gov/llda Contact: flaherty@berkeley.edu

Proceedings Article
05 Dec 2005
TL;DR: A simple and scalable algorithm for large-margin estimation of structured models, including an important class of Markov networks and combinatorial models, with linear convergence using simple gradient and projection calculations is presented.
Abstract: We present a simple and scalable algorithm for large-margin estimation of structured models, including an important class of Markov networks and combinatorial models. We formulate the estimation problem as a convex-concave saddle-point problem and apply the extragradient method, yielding an algorithm with linear convergence using simple gradient and projection calculations. The projection step can be solved using combinatorial algorithms for min-cost quadratic flow. This makes the approach an efficient alternative to formulations based on reductions to a quadratic program (QP). We present experiments on two very different structured prediction tasks: 3D image segmentation and word alignment, illustrating the favorable scaling properties of our algorithm.

Proceedings ArticleDOI
18 Mar 2005
TL;DR: A multiple pitch tracking algorithm based on direct probabilistic modeling of the spectrogram of the signal and a factorial hidden Markov model whose parameters are learned discriminatively from the Keele pitch database is presented.
Abstract: We present a multiple pitch tracking algorithm that is based on direct probabilistic modeling of the spectrogram of the signal. The model is a factorial hidden Markov model whose parameters are learned discriminatively from the Keele pitch database. Our algorithm can track several pitches and determines the number of pitches that are active at any given time. We present simulation results on mixtures of several speech signals and noise, showing the robustness of our approach.

Journal ArticleDOI
TL;DR: In this article, the authors characterized responses to slow growth per se that are not nutrient-specific, and showed that these global homeostatic responses presumably help to coordinate the slowing of growth, and in the case of downregulated genes, to conserve scarce N or S for other purposes.
Abstract: We previously characterized nutrient-specific transcriptional changes in Escherichia coli upon limitation of nitrogen (N) or sulfur (S). These global homeostatic responses presumably minimize the slowing of growth under a particular condition. Here, we characterize responses to slow growth per se that are not nutrient-specific. The latter help to coordinate the slowing of growth, and in the case of down-regulated genes, to conserve scarce N or S for other purposes. Three effects were particularly striking. First, although many genes under control of the stationary phase sigma factor RpoS were induced and were apparently required under S-limiting conditions, one or more was inhibitory under N-limiting conditions, or RpoS itself was inhibitory. RpoS was, however, universally required during nutrient downshifts. Second, limitation for N and S greatly decreased expression of genes required for synthesis of flagella and chemotaxis, and the motility of E. coli was decreased. Finally, unlike the response of all other met genes, transcription of metE was decreased under S- and N-limiting conditions. The metE product, a methionine synthase, is one of the most abundant proteins in E. coli grown aerobically in minimal medium. Responses of metE to S and N limitation pointed to an interesting physiological rationale for the regulatory subcircuit controlled by the methionine activator MetR.

Journal Article
TL;DR: In this paper, the informative vector machine (IVM) is extended to a block-diagonal covariance matrix, which allows the IVM to be applied to a mixture of labeled and unlabeled data.
Abstract: The informative vector machine (IVM) is a practical method for Gaussian process regression and classification. The IVM produces a sparse approximation to a Gaussian process by combining assumed density filtering with a heuristic for choosing points based on minimizing posterior entropy. This paper extends IVM in several ways. First, we propose a novel noise model that allows the IVM to be applied to a mixture of labeled and unlabeled data. Second, we use IVM on a block-diagonal covariance matrix, for learning to learn from related tasks. Third, we modify the IVM to incorporate prior knowledge from known invariances. All of these extensions are tested on artificial and real data.

Journal ArticleDOI
TL;DR: This work introduces a statistical framework for optimal species subset selection, based on maximizing power to detect conserved sites, and suggests that marsupials are prime sequencing candidates.
Abstract: Sequence comparison across multiple organisms aids in the detection of regions under selection. However, resource limitations require a prioritization of genomes to be sequenced. This prioritization should be grounded in two considerations: the lineal scope encompassing the biological phenomena of interest, and the optimal species within that scope for detecting functional elements. We introduce a statistical framework for optimal species subset selection, based on maximizing power to detect conserved sites. Analysis of a phylogenetic star topology shows theoretically that the optimal species subset is not in general the most evolutionarily diverged subset. We then demonstrate this finding empirically in a study of vertebrate species. Our results suggest that marsupials are prime sequencing candidates.

Journal ArticleDOI
TL;DR: This work considers an elaboration of binary classification in which the covariates are not available directly but are transformed by a dimensionality-reducing quantizer Q, and makes it possible to pick out the (strict) subset of surrogate loss functions that yield Bayes consistency for joint estimation of the discriminant function and the quantizer.
Abstract: The goal of binary classification is to estimate a discriminant function $\gamma$ from observations of covariate vectors and corresponding binary labels. We consider an elaboration of this problem in which the covariates are not available directly but are transformed by a dimensionality-reducing quantizer $Q$. We present conditions on loss functions such that empirical risk minimization yields Bayes consistency when both the discriminant function and the quantizer are estimated. These conditions are stated in terms of a general correspondence between loss functions and a class of functionals known as Ali-Silvey or $f$-divergence functionals. Whereas this correspondence was established by Blackwell [Proc. 2nd Berkeley Symp. Probab. Statist. 1 (1951) 93--102. Univ. California Press, Berkeley] for the 0--1 loss, we extend the correspondence to the broader class of surrogate loss functions that play a key role in the general theory of Bayes consistency for binary classification. Our result makes it possible to pick out the (strict) subset of surrogate loss functions that yield Bayes consistency for joint estimation of the discriminant function and the quantizer.

Posted Content
25 Oct 2005
TL;DR: A general correspondence between a family of loss functions that act as surrogates to 0-1 loss, and the class of Ali-Silvey or f -divergence functionals is developed, which provides the basis for choosing and evaluating surrogate losses frequently used in statistical learning.
Abstract: We develop a general correspondence between a family of loss functions that act as surrogates to 0-1 loss, and the class of Ali-Silvey or f -divergence functionals. This correspondence provides the basis for choosing and evaluating various surrogate losses frequently used in statistical learning (e.g., hinge loss, exponential loss, logistic loss); conversely, it provides a decision-theoretic framework for the choice of divergences in signal processing and quantization theory. We exploit this correspondence to characterize the statistical behavior of a nonparametric decentralized hypothesis testing algorithms that operate by minimizing convex surrogate loss functions. In particular, we specify the family of loss functions that are equivalent to 0-1 loss in the sense of producing the same quantization rules and discriminant functions.

Proceedings Article
05 Dec 2005
TL;DR: A general theorem is provided that establishes a correspondence between surrogate loss functions in classification and the family of f-divergences and leverages the results to prove consistency of a procedure for learning a classifier under decentralization requirements.
Abstract: In this paper, we provide a general theorem that establishes a correspondence between surrogate loss functions in classification and the family of f-divergences. Moreover, we provide constructive procedures for determining the f-divergence induced by a given surrogate loss, and conversely for finding all surrogate loss functions that realize a given f-divergence. Next we introduce the notion of universal equivalence among loss functions and corresponding f-divergences, and provide necessary and sufficient conditions for universal equivalence to hold. These ideas have applications to classification problems that also involve a component of experiment design; in particular, we leverage our results to prove consistency of a procedure for learning a classifier under decentralization requirements.

01 Jan 2005
TL;DR: This work designs a utility function whose components may be adjusted based on the suspected level of determinism of the bug and presents an iterative predicate scoring algorithm, which proves to work well on two real world programs.
Abstract: Statistical debugging is a combination of statistical machine learning and software debugging. Given sampled run-time profiles from both successful and failed runs, our task is to select a small set of program predicates that can succinctly capture the failure modes, thereby leading to the locations of the bugs. Given the diverse nature of software bugs and coding structure, this is not a trivial task. We start by assuming that there is only one bug in the program. This allows us to concentrate on the problem of non-deterministic bugs. We design a utility function whose components may be adjusted based on the suspected level of determinism of the bug. The algorithm proves to work well on two real world programs. The problems becomes much more complicated once we do away with the single-bug assumption. The original single-bug algorithm does not perform well in the presence of multiple bugs. Our initial attempts at clustering fall short of an effective solution. After identifying the main problems in the multi-bug case, we present an iterative predicate scoring algorithm. We demonstrate the algorithm at work on five real world programs, where it successfully clusters runs and identifies important predicates that clearly point to many of the underlying bugs.


Proceedings Article
26 Jul 2005
TL;DR: In this article, a hierarchy for approximate inference based on the Dobrushin, Langford, Ruelle (DLR) equations is proposed, which includes existing algorithms, such as belief propagation, and also motivates novel algorithms such as factorized neighbors (FN) algorithms and variants of mean field (MF) algorithms.
Abstract: We propose a hierarchy for approximate inference based on the Dobrushin, Langford, Ruelle (DLR) equations. This hierarchy includes existing algorithms, such as belief propagation, and also motivates novel algorithms such as factorized neighbors (FN) algorithms and variants of mean field (MF) algorithms. In particular, we show that extrema of the Bethe free energy correspond to approximate solutions of the DLR equations. In addition, we demonstrate a close connection between these approximate algorithms and Gibbs sampling. Finally, we compare and contrast various of the algorithms in the DLR hierarchy on spin-glass problems. The experiments show that algorithms higher up the hierarchy give more accurate results when they converge but tend to be less stable

01 Jan 2005
TL;DR: The correspondence between distance measures and surrogate loss functions in the context of decentralized binary hypothesis testing is shown and a notion of equivalence among distance measures, and among loss functions is developed.
Abstract: In this paper, we show the correspondence between distance measures and surrogate loss functions in the context of decentralized binary hypothesis testing. This correspondence helps explicate the use of various distance measures in signal processing and quantization theory, as well as explain the behavior of surrogate loss functions often used in machine learning and statistics. We then develop a notion of equivalence among distance measures, and among loss functions. Finally, we investigate the statistical behavior of a nonparametric decentralized hypothesis testing algorithm by minimizing convex surrogate loss functions that are equivalent to the 0-1 loss.

01 Jan 2005
TL;DR: It is shown that extrema of the Bethe free energy correspond to approximate solutions of the DLR equations and a close connection between these approximate algorithms and Gibbs sampling is demonstrated.
Abstract: We propose a hierarchy for approximate inference based on the Dobrushin, Lanford, Ruelle (DLR) equations. This hierarchy includes existing algorithms, such as belief propagation, and also motivates novel algorithms such as factorized neighbors (FN) algorithms and variants of mean field (MF) algorithms. In particular, we show that extrema of the Bethe free energy correspond to approximate solutions of the DLR equations. In addition, we demonstrate a close connection between these approximate algorithms and Gibbs sampling. Finally, we compare and contrast various of the algorithms in the DLR hierarchy on spin-glass problems. The experiments show that algorithms higher up in the hierarchy give more accurate results when they converge but tend to be less stable.

Posted Content
TL;DR: This work introduces a statistical framework for optimal species subset selection, based on maximizing power to detect conserved sites, and suggests that marsupials are prime sequencing candidates.
Abstract: Sequence comparison across multiple organisms aids in the detection of regions under selection. However, resource limitations require a prioritization of genomes to be sequenced. This prioritization should be grounded in two considerations: the lineal scope encompassing the biological phenomena of interest, and the optimal species within that scope for detecting functional elements. We introduce a statistical framework for optimal species subset selection, based on maximizing power to detect conserved sites. Analysis of a phylogenetic star topology shows theoretically that the optimal species subset is not in general the most evolutionarily diverged subset. We then demonstrate this finding empirically in a study of vertebrate species. Our results suggest that marsupials are prime sequencing candidates.

Proceedings ArticleDOI
18 Mar 2005
TL;DR: A dynamic graphical model (DGM) for automated multi-instrument musical transcription, which models two musical instruments, each capable of playing at most one note at a time, is presented.
Abstract: We present a dynamic graphical model (DGM) for automated multi-instrument musical transcription. By multi-instrument transcription, we mean a system capable of listening to a recording in which two or more instruments are playing, and identifying both the notes that were played and the instruments that played them. Our transcription system models two musical instruments, each capable of playing at most one note at a time. We present results for two-instrument transcription on piano and violin sounds.

Proceedings Article
01 Jan 2005
TL;DR: A general correspondence between two classes of statistical functions: AliSilvey distances and surrogate loss functions is established, showing how to determine the unique f -divergence induced by a given surrogate loss, and characterizing all surrogate Loss functions that realize a given f -Divergence.
Abstract: We establish a general correspondence between two classes of statistical functions: AliSilvey distances (also known as f -divergences) and surrogate loss functions. Ali-Silvey distances play an important role in signal processing and information theory, for instance as error exponents in hypothesis testing problems. Surrogate loss functions (e.g., hinge loss, exponential loss) are the basis of recent advances in statistical learning methods for classi£cation (e.g., the support vector machine, AdaBoost). We provide a connection between these two lines of research, showing how to determine the unique f -divergence induced by a given surrogate loss, and characterizing all surrogate loss functions that realize a given f -divergence. The correspondence between f -divergences and surrogate loss functions has applications to the problem of designing quantization rules for decentralized hypothesis testing in the framework of statistical learning (i.e., when the underlying distributions are unknown, but the learner has access to labeled samples).

01 Jan 2005
TL;DR: SIFTER (Statistical Inference of Function Through Evolutionary Relationships) accurately predicts molecular function for members of a protein family given a reconciled phylogeny and available function annotations, even when the data are sparse or noisy.
Abstract: We present a statistical graphical model to infer specific molecular function for unannotated protein sequences using homology. Based on phylogenomic principles, SIFTER (Statistical Inference of Function Through Evolutionary Relationships) accurately predicts molecular function for members of a protein family given a reconciled phylogeny and available function annotations, even when the data are sparse or noisy. Our method produced specific and consistent molecular function predictions across 100 Pfam families in comparison to the Gene Ontology annotation database, BLAST, GOtcha, and Orthostrapper. We performed a more detailed exploration of functional predictions on the adenosine-59-monophosphate/adenosine deaminase family and the lactate/malate dehydrogenase family, in the former case comparing the predictions against a gold standard set of published functional characterizations. Given function annotations for 3% of the proteins in the deaminase family, SIFTER achieves 96% accuracy in predicting molecular function for experimentally characterized proteins as reported in the literature. The accuracy of SIFTER on this dataset is a significant improvement over other currently available methods such as BLAST (75%), GeneQuiz (64%), GOtcha (89%), and Orthostrapper (11%). We also experimentally characterized the adenosine deaminase from Plasmodium falciparum, confirming SIFTER’s prediction. The results illustrate the predictive power of exploiting a statistical model of function evolution in phylogenomic problems. A software implementation of SIFTER is available from the authors.