scispace - formally typeset
Search or ask a question

Showing papers by "David C. Page published in 2005"


Journal Article
01 Jan 2005-Nature
TL;DR: In this article, the authors compare the DNA sequences of unique, Y-linked genes in chimpanzee and human, which diverged about six million years ago, and find evidence that in the human lineage, all such genes were conserved through purifying selection.
Abstract: The human Y chromosome, transmitted clonally through males, contains far fewer genes than the sexually recombining autosome from which it evolved. The enormity of this evolutionary decline has led to predictions that the Y chromosome will be completely bereft of functional genes within ten million years 1,2 . Although recent evidence of gene conversion within massive Y-linked palindromes runs counter to this hypothesis, most unique Y-linked genes are not situated in palindromes and have no gene conversion partners 3,4 . The 'impending demise' hypothesis thus rests on understanding the degree of conservation of these genes. Here we find, by systematically comparing the DNA sequences of unique, Y-linked genes in chimpanzee and human, which diverged about six million years ago, evidence that in the human lineage, all such genes were conserved through purifying selection. In the chimpanzee lineage, by contrast, several genes have sustained inactivating mutations. Gene decay in the chimpanzee lineage might be a consequence of positive selection focused elsewhere on the Y chromosome and driven by sperm competition.

172 citations


Journal ArticleDOI
01 Sep 2005-Nature
TL;DR: DNA sequences of unique, Y-linked genes in chimpanzee and human, which diverged about six million years ago, are compared to find evidence that in the human lineage, all such genes were conserved through purifying selection.
Abstract: The human Y chromosome, transmitted clonally through males, contains far fewer genes than the sexually recombining autosome from which it evolved. The enormity of this evolutionary decline has led to predictions that the Y chromosome will be completely bereft of functional genes within ten million years. Although recent evidence of gene conversion within massive Y-linked palindromes runs counter to this hypothesis, most unique Y-linked genes are not situated in palindromes and have no gene conversion partners. The 'impending demise' hypothesis thus rests on understanding the degree of conservation of these genes. Here we find, by systematically comparing the DNA sequences of unique, Y-linked genes in chimpanzee and human, which diverged about six million years ago, evidence that in the human lineage, all such genes were conserved through purifying selection. In the chimpanzee lineage, by contrast, several genes have sustained inactivating mutations. Gene decay in the chimpanzee lineage might be a consequence of positive selection focused elsewhere on the Y chromosome and driven by sperm competition.

168 citations


Journal ArticleDOI
TL;DR: It is proposed that Dazl is required as early as E12.5-E13.5, shortly after its expression is first detected, and that inbred DZl-/- mice of C57BL/6 background provide a reproducible standard for exploring DazL's roles in embryonic germ cell development.

166 citations


Journal ArticleDOI
TL;DR: The data demonstrate that, like sex-linked housekeeping genes, germ-cell-specific sex- linked genes are subject to meiotic sex-chromosome inactivation (MSCI), and demonstrate that the chromosome-wide repression imposed by MSCI is limited to meiotics spermatocytes and that postmeiotic expression of sex-linkage genes is variable.
Abstract: We have examined expression during spermatogenesis in the mouse of three Y-linked genes, 11 X-linked genes and 22 autosomal genes, all previously shown to be germ-cell-specific and expressed in premeiotic spermatogonia, plus another 21 germ-cell-specific autosomal genes that initiate expression in meiotic spermatocytes. Our data demonstrate that, like sex-linked housekeeping genes, germ-cell-specific sex-linked genes are subject to meiotic sex-chromosome inactivation (MSCI). Although all the sex-linked genes we investigated underwent MSCI, 14 of the 22 autosomal genes expressed in spermatogonia showed no decrease in expression in meiotic spermatocytes. This along with our observation that an additional 21 germ-cell-specific autosomal genes initiate or significantly up-regulate expression in spermatocytes confirms that MSCI is indeed a sex-chromosome-specific effect. Our results further demonstrate that the chromosome-wide repression imposed by MSCI is limited to meiotic spermatocytes and that postmeiotic expression of sex-linked genes is variable. Thus, 13 of the 14 sex-linked genes we examined showed some degree of postmeiotic reactivation. The extent of postmeiotic reactivation of germ-cell-specific X-linked genes did not correlate with proximity to the X inactivation center or the Xist gene locus. The implications of these findings are discussed with respect to differential gene regulation and the function of MSCI during spermatogenesis, including epigenetic programming of the future paternal genome during spermatogenesis.

142 citations


Journal ArticleDOI
TL;DR: Results have shown that RNF17 is a component of a novel germ cell nuage and is required for differentiation of male germ cells and is distinguishable from other known nuages, such as chromatoid bodies.
Abstract: Nuages are found in the germ cells of diverse organisms. However, nuages in postnatal male germ cells of mice are poorly studied. Previously, we cloned a germ cell-specific gene named Rnf17, which encodes a protein containing both a RING finger and tudor domains. Here, we report that RNF17 is a component of a novel nuage in male germ cells--the RNF17 granule, which is an electron-dense non-membrane bound spherical organelle with a diameter of 0.5 mum. RNF17 granules are prominent in late pachytene and diplotene spermatocytes, and in elongating spermatids. RNF17 granules are distinguishable from other known nuages, such as chromatoid bodies. RNF17 is able to form dimers or polymers both in vitro and in vivo, indicating that it may play a role in the assembly of RNF17 granules. Rnf17-deficient male mice were sterile and exhibited a complete arrest in round spermatids, demonstrating that Rnf17 encodes a novel key regulator of spermiogenesis. Rnf17-null round spermatids advanced to step 4 but failed to produce sperm. These results have shown that RNF17 is a component of a novel germ cell nuage and is required for differentiation of male germ cells.

124 citations


Proceedings ArticleDOI
07 Aug 2005
TL;DR: A novel algorithm for decision tree learning in the multi- instance setting as originally defined by Dietterich et al. is introduced and it is shown that the resulting system outperforms the existing multi-instance decision tree learners.
Abstract: We introduce a novel algorithm for decision tree learning in the multi-instance setting as originally defined by Dietterich et al. It differs from existing multi-instance tree learners in a few crucial, well-motivated details. Experiments on synthetic and real-life datasets confirm the beneficial effect of these differences and show that the resulting system outperforms the existing multi-instance decision tree learners.

98 citations


Book ChapterDOI
03 Oct 2005
TL;DR: This work proposes an algorithm that interleaves the two steps of ILP, by incrementally building a Bayes net during rule learning, and calls it SAYU for Score As You Use, which sees a significant improvement in two out of the four applications.
Abstract: Inductive Logic Programming (ILP) is a popular approach for learning rules for classification tasks. An important question is how to combine the individual rules to obtain a useful classifier. In some instances, converting each learned rule into a binary feature for a Bayes net learner improves the accuracy compared to the standard decision list approach [3,4,14]. This results in a two-step process, where rules are generated in the first phase, and the classifier is learned in the second phase. We propose an algorithm that interleaves the two steps, by incrementally building a Bayes net during rule learning. Each candidate rule is introduced into the network, and scored by whether it improves the performance of the classifier. We call the algorithm SAYU for Score As You Use. We evaluate two structure learning algorithms Naive Bayes and Tree Augmented Naive Bayes. We test SAYU on four different datasets and see a significant improvement in two out of the four applications. Furthermore, the theories that SAYU learns tend to consist of far fewer rules than the theories in the two-step approach.

69 citations


Proceedings ArticleDOI
21 Aug 2005
TL;DR: The accuracy of the trained SVM estimated by leave-one-out cross-validation is significantly greater than random guessing, and this result is particularly encouraging since only 3000 SNPs were used in profiling, whereas several million SNPs are known.
Abstract: This paper asks whether susceptibility to early-onset (diagnosis before age 40) of a particularly deadly form of cancer, Multiple Myeloma, can be predicted from single-nucleotide polymorphism (SNP) profiles with an accuracy greater than chance. Specifically, given SNP profiles for 80 Multiple Myeloma patients -- of which we believe 40 to have high susceptibility and 40 to have lower susceptibility -- we train a support vector machine (SVM) to predict age at diagnosis. We chose SVMs for this task because they are well suited to deal with interactions among features and redundant features. The accuracy of the trained SVM estimated by leave-one-out cross-validation is 71%, significantly greater than random guessing. This result is particularly encouraging since only 3000 SNPs were used in profiling, whereas several million SNPs are known.

66 citations


Proceedings Article
30 Jul 2005
TL;DR: This work provides statistical relational learning with the capability of learning new views of a relational database, for many database applications, where users find it profitable to define alternative "views" of the database, in effect defining new fields or tables.
Abstract: Statistical relational learning (SRL) constructs probabilistic models from relational databases. A key capability of SRL is the learning of arcs (in the Bayes net sense) connecting entries in different rows of a relational table, or in different tables. Nevertheless, SRL approaches currently are constrained to use the existing database schema. For many database applications, users find it profitable to define alternative "views" of the database, in effect defining new fields or tables. Such new fields or tables can also be highly useful in learning. We provide SRL with the capability of learning new views.

54 citations


Proceedings Article
01 Jan 2005
TL;DR: Using a database from a breast imaging practice containing patient risk factors, imaging findings, and biopsy results, inductive logic programming (ILP) could discover interesting hypotheses that could be tested and validated by analysis of the data itself.
Abstract: The development of large mammography databases provides an opportunity for knowledge discovery and data mining techniques to recognize patterns not previously appreciated. Using a database from a breast imaging practice containing patient risk factors, imaging findings, and biopsy results, we tested whether inductive logic programming (ILP) could discover interesting hypotheses that could subsequently be tested and validated. The ILP algorithm discovered two hypotheses from the data that were 1) judged as interesting by a subspecialty-trained mammographer and 2) validated by analysis of the data itself.

32 citations


Book ChapterDOI
03 Oct 2005
TL;DR: It is shown how to exploit the links between objects in multi-relational data to help a first-order rule learning system direct the search by explicitly traversing these links to find paths between variables of interest.
Abstract: Learning from multi-relational domains has gained increasing attention over the past few years. Inductive logic programming (ILP) systems, which often rely on hill-climbing heuristics in learning first-order concepts, have been a dominating force in the area of multi-relational concept learning. However, hill-climbing heuristics are susceptible to local maxima and plateaus. In this paper, we show how we can exploit the links between objects in multi-relational data to help a first-order rule learning system direct the search by explicitly traversing these links to find paths between variables of interest. Our contributions are twofold: (i) we extend the pathfinding algorithm by Richards and Mooney [12] to make use of mode declarations, which specify the mode of call (input or output) for predicate variables, and (ii) we apply our extended path finding algorithm to saturated bottom clauses, which anchor one end of the search space, allowing us to make use of background knowledge used to build the saturated clause to further direct search. Experimental results on a medium-sized dataset show that path finding allows one to consider interesting clauses that would not easily be found by Aleph.

01 May 2005
TL;DR: This work uses Inductive Logic Programming to find a set of rules that are predictive of aliases and uses the Bayesian Network to assign a probability that two identities are aliases.
Abstract: Identity Equivalence or Alias Detection is an important topic in Intelligence Analysis. Often, terrorists will use multiple different identities to avoid detection. We apply machine learning to the task of determining Identity Equivalence. Two challenges exist in this domain. First, data can be spread across multiple tables. Second, we need to limit the number of false positives. We present a two step approach to combat these issues. First, we use Inductive Logic Programming to find a set of rules that are predictive of aliases. In the second step, we treat each learned rule as a random variable in a Bayesian Network. We use the Bayesian Network to assign a probability that two identities are aliases. We evaluate our technique on several data sets and find that layering Bayesian Network over the rules significantly increases the precision of our system.

Proceedings ArticleDOI
07 Aug 2005
TL;DR: It is proved that, in an idealized setting, for any function and choice of skew parameters, skewing finds relevant variables with probability 1.
Abstract: We analyze skewing, an approach that has been empirically observed to enable greedy decision tree learners to learn "difficult" Boolean functions, such as parity, in the presence of irrelevant variables. We prove tha, in an idealized setting, for any function and choice of skew parameters, skewing finds relevant variables with probability 1. We present experiments exploring how different parameter choices affect the success of skewing in empirical settings. Finally, we analyze a variant of skewing called Sequential Skewing.

Patent
01 Aug 2005
TL;DR: Novel sequence tagged sites (STSs), probes and primers are useful, e.g., for detecting the presence or absence of an STS in a sample, and methods of using these STSs, probes and pruning primers in methods of detecting alterations in the Y chromosome are disclosed as mentioned in this paper.
Abstract: Novel sequence tagged sites (STSs), probes and primers useful, e.g., for detecting the presence or absence of an STS in a sample, and methods of using these STSs, probes and primers, e.g., in methods of detecting alterations in the Y chromosome are disclosed. These compositions are also useful in methods of diagnosing or aiding in the diagnosis and/or cause of reduced sperm count and in methods of predicting or aiding in the prediction of the likelihood of success of infertility treatments.

01 Jan 2005
TL;DR: This thesis develops and evaluates machine learning algorithms that can learn effectively from data with complex interactions and ambiguous labels, and develops and evaluate approximation algorithms for MI regression on synthetic and real-world drug activity prediction problems.
Abstract: In this thesis, we develop and evaluate machine learning algorithms that can learn effectively from data with complex interactions and ambiguous labels. The need for such algorithms is motivated by such problems as protein-protein binding and drug activity prediction. In the first part of the thesis, we focus on the problem of myopia. This problem arises when greedy learning strategies are applied to learn from data with complex interactions. We present skewing, our approach to alleviating myopia. We describe theoretical results and empirical results on Boolean data that show that our approach can learn effectively from data with complex interactions. We investigate the effects of various parameter choices on our approach, and the effects of dimensionality and class-label noise. We then propose and evaluate a variant that scales better to high-dimensional data. Finally, we propose and evaluate an extension that is able to learn from non-Boolean data with similar complex interactions as in the Boolean case. In the second part of the thesis, we focus on the multiple-instance (MI) problem. This problem arises when the class labels or responses of individual instances are unknown, but there are constraints relating the labels of collections of instances (bags). We first describe an empirical evaluation of several multiple-instance and supervised learning methods on several MI datasets. From our study, we derive several useful observations about the accuracy of supervised and MI methods on MI data. We next design and evaluate an approach to learning combining functions from data. These functions are used to combine predictions on each instance into a prediction for a bag. Finally, we consider the problem of regression in a multiple-instance setting. We show that an exact solution to this problem is NP-hard, and develop and evaluate approximation algorithms for MI regression on synthetic and real-world drug activity prediction problems. Our experiments show that there is value in considering the MI setting in regression as well as in learning combining functions from data.

Proceedings ArticleDOI
21 Aug 2005
TL;DR: In this approach, rules are scored by how much they improve the classifier, providing a tight coupling between rule generation and rule usage, and this novel methodology Score As You Use (SAYU) is called.
Abstract: Inductive Logic Programming (ILP) is a popular approach for learning in a relational environment. Given a set of positive and negative examples, an ILP system finds a logical description of the underlying data model that differentiates between the positive and negative examples. The key question becomes how to combine a set of rules to obtain a useful classifier. Previous work has shown that an effective approach is to treat each learned rule as an attribute in a propositional learner, and to use the classifier to determine the final label of an example [3]. This methodology defines a two step process. In the first step, an ILP algorithm learns a set of rules. In the second step, a classifier combines the learned rules. One weakness of this approach is that the rules learned in the first step are being evaluated by a different metric than how they are ultimately scored in the second step. ILP traditionally scores clauses through a coverage score or compression metric. Thus we have no guarantee that the rule learning process will select the rules that best contribute to the final classifier.We propose an alternative approach, based on the idea of constructing the classifier as we learn the rules [2, 4]. In our approach, rules are scored by how much they improve the classifier, providing a tight coupling between rule generation and rule usage. We call this novel methodology Score As You Use (SAYU) [2].In order to implement SAYU we defined an interface that allows an ILP algorithm to control a propositional learner. Second, we developed a greedy algorithm that uses the interface to decide whether to retain a candidate clause. We implemented this interface using Aleph to learn ILP rules, and Bayesian networks as the combining mechanism. We used two different Bayes net structure learning algorithms, Naive Bayes and Tree Augmented Naive Bayes (TAN) as propositional learners. We score the network by computing area under the precision recall curve for levels of recall greater than 0.2. Aleph proposes a candidate clause, which is introduced as a new feature in the training set. A new network topology is learned using the new training set, and then the new network is evaluated on a tuning set. If the score of the new network exceeds the previous score we retain the new rule in the training set. Otherwise the rule is discarded. The figure compares performance on the Breast Cancer dataset [1]. These results show that, given the same amount of CPU time, SAYU can clearly outperform the original two step approach. Furthermore, SAYU learns smaller theories. These results were obtained even though SAYU considers far fewer rules than standard ILP.

Proceedings ArticleDOI
07 Aug 2005
TL;DR: The results indicate that the algorithms extended to directly handle functions of continuous and nominal variables almost always outperforms an Information Gain-based decision tree learner.
Abstract: This paper extends previous work on skewing, an approach to problematic functions in decision tree induction. The previous algorithms were applicable only to functions of binary variables. In this paper, we extend skewing to directly handle functions of continuous and nominal variables. We present experiments with randomly generated functions and a number of real world datasets to evaluate the algorithm's accuracy. Our results indicate that our algorithm almost always outperforms an Information Gain-based decision tree learner.

Book ChapterDOI
10 Aug 2005
TL;DR: A new approach to Inductive Logic Programming is proposed that systematically exploits caching and offers a number of advantages over current systems, which avoids redundant computation, is more amenable to the use of set-oriented generation and evaluation of hypotheses, and allows relational DBMS technology to be more easily applied to ILP systems.
Abstract: We propose a new approach to Inductive Logic Programming that systematically exploits caching and offers a number of advantages over current systems It avoids redundant computation, is more amenable to the use of set-oriented generation and evaluation of hypotheses, and allows relational DBMS technology to be more easily applied to ILP systems Further, our approach opens up new avenues such as probabilistically scoring rules during search and the generation of probabilistic rules As a first example of the benefits of our ILP framework, we propose a scheme for defining the hypothesis search space through Inverse Entailment using multiple example seeds

Proceedings ArticleDOI
21 Aug 2005
TL;DR: It is shown how the links between objects in multi-relational data can be exploited to help a first-order rule learning system to direct the search by explicitly traversing these links to find paths between variables of interest.
Abstract: Learning in multi-relational domains has gained in popularity over the past few years, contributing to applications in diverse areas. Typically, learning from multi-relational domains has involved learning rules about distinct entities so that they can be classified into one category or another. However, there are also interesting applications that are concerned with the problem of learning whether a number of entities are connected. Examples of these include determining whether two proteins interact in a cell, whether two identifiers are aliases, or whether a web page will refer another one; these are known as link mining [3].Inductive logic programming (ILP) systems, which often rely on hill-climbing heuristics in learning first-order concepts, have been a dominating force in the area of multi relational concept learning. However, hill-climbing heuristics are susceptible to local maxima and plateaus, which is especially a factor for large datasets where the branching factor per node can be very large [2, 1]. Ideally, saturation based search and a good scoring method should eventually lead us to the interesting clauses, however, the search space can grow so quickly that we risk never reaching an interesting path in a reasonable amount of time. This prompted us to consider alternative ways, such as pathfinding [4], to constrain the search space.Richards and Mooney realized that the problem of learning first-order concepts could be represented using graphs, and using the intuition that if two nodes interact there must exist an explanation, proposed that the explanation should be a connected path linking the two nodes. We agree with the idea and propose to use pathfinding on the saturated clause instead. The original pathfinding algorithm assumes the background knowledge forms an undirected graph. In contrast, the saturated clause is obtained by using mode declarations: in a nutshell, a literal can only be added to a clause if the literal's input variables are known to be bound. Mode declarations thus embed directionality in the graph formed by literals.We show how we can exploit the links between objects in multi-relational data to help a first-order rule learning system to direct the search by explicitly traversing these links to find paths between variables of interest. Specifically, we extend the pathfinding algorithm by Richards and Mooney [4] to make use of mode declarations to find paths in the saturated bottom clause, which anchor one end of the search space based on background knowledge.Our major insight is that a saturated clause for a moded program can be described as a directed hypergraph, which consists of nodes and hyperarcs that connect a nonempty set of nodes to one target node. Given this, we show that path finding can be reduced to reachability in the hypergraph, whereby each hyperpath will correspond to a hypothesis. However, we may be interested in non-minimal paths and in the composition of paths. We thus propose and evaluate an algorithm that can enumerate all such hyperpaths according to some heuristic and test it on the UW-CSE dataset by Richardson and Domingos [5]. Experimental results on a medium sized dataset show that path finding allows one to consider interesting clauses that would not easily be found by Aleph.