scispace - formally typeset
Search or ask a question

Showing papers by "David C. Page published in 2012"


Journal ArticleDOI
01 Mar 2012-Nature
TL;DR: An empirical reconstruction of human MSY evolution is presented, in which each stratum transitioned from rapid, exponential loss of ancestral genes to strict conservation through purifying selection.
Abstract: This evolutionary decay was driven by a series of five ‘stratification’ events. Each event suppressed X–Y crossing over within a chromosome segment or ‘stratum’, incorporated that segment into the MSY and subjected its genes to the erosive forces that attend the absence of crossing over 2,6 . The last of these events occurred 30 million years ago, 5 million years before the human and Old World monkey lineages diverged. Although speculation abounds regarding ongoing decay and looming extinction of the human Y chromosome 7–10 , remarkably little is known about how many MSY genes were lost in the human lineage in the 25 million years that have followed its separation from the Old World monkey lineage. To investigate this question, we sequenced the MSY of the rhesus macaque, an Old World monkey, and compared it to the human MSY. We discovered that during the last 25 million years MSY gene loss in the human lineage was limited to the youngest stratum (stratum 5), which comprises three percent of the human MSY. In the older strata, which collectively comprise the bulk of the human MSY, gene loss evidently ceased more than 25 million years ago. Likewise, the rhesus MSY has not lost any older genes (from strata 1–4) during the past 25 million years, despite its major structural differences to the human MSY. The rhesus MSY is simpler, with few amplified gene families or palindromes that might enable intrachromosomal recombination and repair. We present an empirical reconstruction of human MSY evolution in which each stratum transitioned from rapid, exponential loss of ancestral genes to strict conservation through purifying selection. The human Y chromosome no longer engages in crossing over with its once-identical partner, the X chromosome, except in its pseudoautosomal regions. During evolution, X–Y crossing over was suppressed in five different chromosomal regions at five different times, each probably resulting from an inversion in the Y chromosome 2,3 . Each of these regions of the Y chromosome then began its own individual course of degeneration, experiencing deletions and gene loss. Comparison of the present-day X and Y chromosomes enables identification of these five evolutionary ‘strata’ in the MSY (and X chromosome); their distinctive degrees of X–Y differentiation indicate their evolutionary ages 2,3 . The oldest stratum (stratum 1) dates back over 240 million years (Myr) 2 and is the most highly differentiated, and the youngest stratum (stratum 5) originated only 30 Myr ago and displays the highest X–Y nucleotide sequence similarity within the MSY 3 . The five strata and their respective decay processes, over tens to hundreds of millions of years of mammalian evolution, offer replicate experiments of nature from which to reconstruct the trajectories and kinetics of gene loss in the MSY. Only the human and chimpanzee MSYs had been sequenced before the present study, and they are separated by just 6 Myr of evolution. We decided to examine the MSY of a much more distant relative, the rhesus macaque (Macaca mulatta), to enable us to reconstruct gene loss and conservation in the MSY during the past 25 Myr. We sequenced the rhesus MSY using bacterial artificial chromosome (BAC) clones and the SHIMS (single-haplotype iterative mapping and sequencing) strategy that has previously been used in the human and chimpanzee MSYs 4,11–13 as well as in the chicken Z chromosome 5 . The resulting sequence is comprised of 11.0 megabases (Mb), is complete aside from three small gaps and has an error rate of about one nucleotide per Mb. We ordered and oriented the finished sequence contigs by fluorescence in situ hybridization and radiation hybrid mapping (Supplementary Figs 1–6, Supplementary Table 1, Supplemen

258 citations


Journal ArticleDOI
TL;DR: The generation of induced embryonic Sertoli-like cells (ieSCs) by ectopic expression of five transcription factors is demonstrated and the role of specific transcription factor combinations in the transition from fibroblasts to ieSCs is characterized and key steps in the process are identified.

148 citations


Journal ArticleDOI
TL;DR: In this paper, the authors compare three organisms, Caenorhabditis elegans, Drosophila melanogaster and the mouse, and highlight themes from the comparison of these three alternative strategies for navigating the fundamental cycle of sexual reproduction.
Abstract: The germ line represents a continuous cellular link between generations and between species, but the germ cells themselves develop in a specialized, organism-specific context. The model organisms Caenorhabditis elegans, Drosophila melanogaster and the mouse display striking similarities, as well as major differences, in the means by which they control germ cell development. Recent developments in genetic technologies allow a more detailed comparison of the germ cells of these three organisms than has previously been possible, shedding light not only on universal aspects of germline regulation, but also on the control of the pluripotent state in vivo and on the earliest steps of embryogenesis. Here, we highlight themes from the comparison of these three alternative strategies for navigating the fundamental cycle of sexual reproduction.

127 citations


Journal ArticleDOI
TL;DR: In this paper, the authors found that deletions involving the Y chromosome's AZFc region are the most common known genetic cause of severe spermatogenic failure (SSF).
Abstract: Deletions involving the Y chromosome’s AZFc region are the most common known genetic cause of severe spermatogenic failure (SSF). Six recurrent interstitial deletions affecting the region have been reported, but their population genetics are largely unexplored. We assessed the deletions’ prevalence in 20,884 men in five populations and found four of the six deletions (presented here in descending order of prevalence): gr/gr, b2/b3, b1/b3, and b2/b4. One of every 27 men carried one of these four deletions. The 1.6 Mb gr/gr deletion, found in one of every 41 men, almost doubles the risk of SSF and accounts for ∼2% of SSF, although <2% of men with the deletion are affected. The 1.8 Mb b2/b3 deletion, found in one of every 90 men, does not appear to be a risk factor for SSF. The 1.6 Mb b1/b3 deletion, found in one of every 994 men, appears to increase the risk of SSF by a factor of 2.5, although <2% of men with the deletion are affected, and it accounts for only 0.15% of SSF. The 3.5 Mb b2/b4 deletion, found in one of every 2,320 men, increases the risk of SSF 145 times and accounts for ∼6% of SSF; the observed prevalence should approximate the rate at which the deletion arises anew in each generation. We conclude that a single rare variant of major effect (the b2/b4 deletion) and a single common variant of modest effect (the gr/gr deletion) are largely responsible for the AZFc region’s contribution to SSF in the population.

110 citations


01 Oct 2012
TL;DR: It is concluded that a single rare variant of major effect (the b2/b4 deletion) and a single common variant of modest effect are largely responsible for the AZFc region's contribution to SSF in the population.
Abstract: Deletions involving the Y chromosome’s AZFc region are the most common known genetic cause of severe spermatogenic failure (SSF). Six recurrent interstitial deletions affecting the region have been reported, but their population genetics are largely unexplored. We assessed the deletions’ prevalence in 20,884 men in five populations and found four of the six deletions (presented here in descending order of prevalence): gr/gr, b2/b3, b1/b3, and b2/b4. One of every 27 men carried one of these four deletions. The 1.6 Mb gr/gr deletion, found in one of every 41 men, almost doubles the risk of SSF and accounts for ∼2% of SSF, although <2% of men with the deletion are affected. The 1.8 Mb b2/b3 deletion, found in one of every 90 men, does not appear to be a risk factor for SSF. The 1.6 Mb b1/b3 deletion, found in one of every 994 men, appears to increase the risk of SSF by a factor of 2.5, although <2% of men with the deletion are affected, and it accounts for only 0.15% of SSF. The 3.5 Mb b2/b4 deletion, found in one of every 2,320 men, increases the risk of SSF 145 times and accounts for ∼6% of SSF; the observed prevalence should approximate the rate at which the deletion arises anew in each generation. We conclude that a single rare variant of major effect (the b2/b4 deletion) and a single common variant of modest effect (the gr/gr deletion) are largely responsible for the AZFc region’s contribution to SSF in the population.

90 citations


Journal ArticleDOI
TL;DR: Two statistical relational learning algorithms are applied to the task of predicting primary myocardial infarction and it is shown that one SRL algorithm, relational functional gradient boosting, outperforms propositional learners particularly in the medically-relevant high recall region.
Abstract: Electronic health records (EHRs) are an emerging relational domain with large potential to improve clinical outcomes. We apply two statistical relational learning (SRL) algorithms to the task of predicting primary myocardial infarction. We show that one SRL algorithm, relational functional gradient boosting, outperforms propositional learners particularly in the medically-relevant high recall region. We observe that both SRL algorithms predict outcomes better than their propositional analogs and suggest how our methods can augment current epidemiological practices.

70 citations


Proceedings Article
26 Jun 2012
TL;DR: This article showed that there is a region of precision recall space that is completely unachievable, and the size of this region depends only on the skew of the data set, and discussed its implications for empirical evaluation methodology in machine learning.
Abstract: Precision-recall (PR) curves and the areas under them are widely used to summarize machine learning results, especially for data sets exhibiting class skew. They are often used analogously to ROC curves and the area under ROC curves. It is known that PR curves vary as class skew changes. What was not recognized before this paper is that there is a region of PR space that is completely unachievable, and the size of this region depends only on the skew. This paper precisely characterizes the size of that region and discusses its implications for empirical evaluation methodology in machine learning.

65 citations


Journal ArticleDOI
TL;DR: Results show that deregulation of the mitotic-meiotic switch in XY germ cells contributes to teratoma initiation, and direct evidence that premature initiation of the meiotic program contributes to tumorigenesis is provided.
Abstract: ‡SUMMARY Testicular teratomas result from anomalies in germ cell development during embryogenesis. In the 129 family of inbred strains of mice, teratomas initiate around embryonic day (E) 13.5 during the same developmental period in which female germ cells initiate meiosis and male germ cells enter mitotic arrest. Here, we report that three germ cell developmental abnormalities, namely continued proliferation, retention of pluripotency, and premature induction of differentiation, associate with teratoma susceptibility. Using mouse strains with low versus high teratoma incidence (129 versus 129-Chr19 MOLF/Ei ), and resistant to teratoma formation (FVB), we found that germ cell proliferation and expression of the pluripotency factor Nanog at a specific time point, E15.5, were directly related with increased tumor risk. Additionally, we discovered that genes expressed in pre-meiotic embryonic female and adult male germ cells, including cyclin D1 (Ccnd1) and stimulated by retinoic acid 8 (Stra8), were prematurely expressed in teratoma-susceptible germ cells and, in rare instances, induced entry into meiosis. As with Nanog, expression of differentiation-associated factors at a specific time point, E15.5, increased with tumor risk. Furthermore, Nanog and Ccnd1, genes with known roles in testicular cancer risk and tumorigenesis, respectively, were co-expressed in teratoma-susceptible germ cells and tumor stem cells, suggesting that retention of pluripotency and premature germ cell differentiation both contribute to tumorigenesis. Importantly, Stra8-deficient mice had an 88% decrease in teratoma incidence, providing direct evidence that premature initiation of the meiotic program contributes to tumorigenesis. These results show that deregulation of the mitotic-meiotic switch in XY germ cells contributes to teratoma initiation.

61 citations


Journal ArticleDOI
TL;DR: In this article, a method for identifying sequences that are W-specific was developed and applied to identify sequences of the female genome that are underrepresented in the male genome and are therefore likely to be female specific.
Abstract: The female-specific W chromosomes and male-specific Y chromosomes have proven difficult to assemble with whole-genome shotgun methods, creating a demand for new approaches to identify sequence contigs specific to these sex chromosomes. Here, we develop and apply a novel method for identifying sequences that are W-specific. Using the Illumina Genome Analyzer, we generated sequence reads from a male domestic chicken (ZZ) and mapped them to the existing female (ZW) genome sequence. This method allowed us to identify segments of the female genome that are underrepresented in the male genome and are therefore likely to be female specific. We developed a Bayesian classifier to automate the calling of W-linked contigs and successfully identified more than 60 novel W-specific sequences. Our classifier can be applied to improve heterogametic whole-genome shotgun assemblies of the W or Y chromosome of any organism. This study greatly improves our knowledge of the W chromosome and will enhance future studies of avian sex determination and sex chromosome evolution.

44 citations


01 May 2012
TL;DR: A Bayesian classifier is developed to automate the calling of W-linked contigs and successfully identified more than 60 novel W-specific sequences, which greatly improves knowledge of the W chromosome and will enhance future studies of avian sex determination and sex chromosome evolution.
Abstract: The female-specific W chromosomes and male-specific Y chromosomes have proven difficult to assemble with whole-genome shotgun methods, creating a demand for new approaches to identify sequence contigs specific to these sex chromosomes. Here, we develop and apply a novel method for identifying sequences that are W-specific. Using the Illumina Genome Analyzer, we generated sequence reads from a male domestic chicken (ZZ) and mapped them to the existing female (ZW) genome sequence. This method allowed us to identify segments of the female genome that are underrepresented in the male genome and are therefore likely to be female specific. We developed a Bayesian classifier to automate the calling of W-linked contigs and successfully identified more than 60 novel W-specific sequences. Our classifier can be applied to improve heterogametic whole-genome shotgun assemblies of the W or Y chromosome of any organism. This study greatly improves our knowledge of the W chromosome and will enhance future studies of avian sex determination and sex chromosome evolution.

40 citations


Posted Content
TL;DR: There is a region of PR space that is completely unachievable, and the size of this region depends only on the skew, and its implications for empirical evaluation methodology in machine learning are discussed.
Abstract: Precision-recall (PR) curves and the areas under them are widely used to summarize machine learning results, especially for data sets exhibiting class skew. They are often used analogously to ROC curves and the area under ROC curves. It is known that PR curves vary as class skew changes. What was not recognized before this paper is that there is a region of PR space that is completely unachievable, and the size of this region depends only on the skew. This paper precisely characterizes the size of that region and discusses its implications for empirical evaluation methodology in machine learning.

Journal ArticleDOI
TL;DR: It is reported that DAZ arrived on the Y chromosome about 38 million years ago via the transposition of at least 1.1 megabases of autosomal DNA, but all five genes were subsequently lost through mutation or deletion.
Abstract: Studies of Y chromosome evolution often emphasize gene loss, but this loss has been counterbalanced by addition of new genes. The DAZ genes, which are critical to human spermatogenesis, were acquired by the Y chromosome in the ancestor of Old World monkeys and apes. We and our colleagues recently sequenced the rhesus macaque Y chromosome, and comparison of this sequence to human and chimpanzee enables us to reconstruct much of the evolutionary history of DAZ. We report that DAZ arrived on the Y chromosome about 38 million years ago via the transposition of at least 1.1 megabases of autosomal DNA. This transposition also brought five additional genes to the Y chromosome, but all five genes were subsequently lost through mutation or deletion. As the only surviving gene, DAZ experienced extensive restructuring, including intragenic amplification and gene duplication, and has been the target of positive selection in the chimpanzee lineage. Editor's suggested further reading in BioEssays Should Y stay or should Y go: The evolution of non-recombining sex chromosomes Abstract.

Proceedings Article
03 Dec 2012
TL;DR: This work develops a partition-based representation using regression trees and forests whose parameter spaces grow linearly in the number of node splits, and shows how to update the forest likelihood in closed form, producing efficient model updates.
Abstract: Learning temporal dependencies between variables over continuous time is an important and challenging task. Continuous-time Bayesian networks effectively model such processes but are limited by the number of conditional intensity matrices, which grows exponentially in the number of parents per variable. We develop a partition-based representation using regression trees and forests whose parameter spaces grow linearly in the number of node splits. Using a multiplicative assumption we show how to update the forest likelihood in closed form, producing efficient model updates. Our results show multiplicative forests can be learned from few temporal trajectories with large gains in performance and scalability.

Proceedings Article
22 Jul 2012
TL;DR: This paper casts the problem of post-marketing surveillance of drugs to identify previously-unanticipated ADEs as a reverse machine learning task, related to relational subgroup discovery and provides an initial evaluation of this approach based on experiments with an actual EMR/EHR and known adverse drug events.
Abstract: The pharmaceutical industry, consumer protection groups, users of medications and government oversight agencies are all strongly interested in identifying adverse reactions to drugs While a clinical trial of a drug may use only a thousand patients, once a drug is released on the market it may be taken by millions of patients As a result, in many cases adverse drug events (ADEs) are observed in the broader population that were not identified during clinical trials Therefore, there is a need for continued, postmarketing surveillance of drugs to identify previously-unanticipated ADEs This paper casts this problem as a reverse machine learning task, related to relational subgroup discovery and provides an initial evaluation of this approach based on experiments with an actual EMR/EHR and known adverse drug events

Journal ArticleDOI
TL;DR: This paper shows that Inductive Logic Programming implemented in ProGolem can derive rules giving structural features of protein/ligand interactions, and several of these rules are consistent with descriptions in the literature.
Abstract: There is a need for automated methods to learn general features of the interactions of a ligand class with its diverse set of protein receptors. An appropriate machine learning approach is Inductive Logic Programming (ILP), which automatically generates comprehensible rules in addition to prediction. The development of ILP systems which can learn rules of the complexity required for studies on protein structure remains a challenge. In this work we use a new ILP system, ProGolem, and demonstrate its performance on learning features of hexose-protein interactions. The rules induced by ProGolem detect interactions mediated by aromatics and by planar-polar residues, in addition to less common features such as the aromatic sandwich. The rules also reveal a previously unreported dependency for residues cys and leu. They also specify interactions involving aromatic and hydrogen bonding residues. This paper shows that Inductive Logic Programming implemented in ProGolem can derive rules giving structural features of protein/ligand interactions. Several of these rules are consistent with descriptions in the literature. In addition to confirming literature results, ProGolem’s model has a 10-fold cross-validated predictive accuracy that is superior, at the 95% confidence level, to another ILP system previously used to study protein/hexose interactions and is comparable with state-of-the-art statistical learners.

Proceedings Article
22 Jul 2012
TL;DR: Two statistical relational learning algorithms are applied to the task of predicting primary myocardial infarction and it is shown that one SRL algorithm, relational functional gradient boosting, outperforms propositional learners particularly in the medically-relevant high recall region.
Abstract: Electronic health records (EHRs) are an emerging relational domain with large potential to improve clinical outcomes. We apply two statistical relational learning (SRL) algorithms to the task of predicting primary myocardial infarction. We show that one SRL algorithm, relational functional gradient boosting, outperforms propositional learners particularly in the medically-relevant high recall region. We observe that both SRL algorithms predict outcomes better than their propositional analogs and suggest how our methods can augment current epidemiological practices.

Proceedings Article
14 Aug 2012
TL;DR: This work proposes a multiple testing procedure based on a Markov-random-field-coupled mixture model, which is applied to a real-world genome-wide association study on breast cancer, and identifies several SNPs with strong association evidence.
Abstract: Large-scale multiple testing tasks often exhibit dependence, and leveraging the dependence between individual tests is still one challenging and important problem in statistics. With recent advances in graphical models, it is feasible to use them to perform multiple testing under dependence. We propose a multiple testing procedure which is based on a Markov-random-field-coupled mixture model. The ground truth of hypotheses is represented by a latent binary Markov random field, and the observed test statistics appear as the coupled mixture variables. The parameters in our model can be automatically learned by a novel EM algorithm. We use an MCMC algorithm to infer the posterior probability that each hypothesis is null (termed local index of significance), and the false discovery rate can be controlled accordingly. Simulations show that the numerical performance of multiple testing can be improved substantially by using our procedure. We apply the procedure to a real-world genome-wide association study on breast cancer, and we identify several SNPs with strong association evidence.

Proceedings ArticleDOI
04 Oct 2012
TL;DR: This work builds the first BI-RADS parser for Portuguese free texts, modeled after existing approaches to extract BI- RADS features from English medical records, and compares the performance of the algorithm to manual annotation by a specialist in mammography.
Abstract: In this work we build the first BI-RADS parser for Portuguese free texts, modeled after existing approaches to extract BI-RADS features from English medical records Our concept finder uses a semantic grammar based on the BI-RADS lexicon and on iterative transferred expert knowledge We compare the performance of our algorithm to manual annotation by a specialist in mammography Our results show that our parser's performance is comparable to the manual method

Book ChapterDOI
24 Sep 2012
TL;DR: This work presents the first multi-relational differential prediction (aka uplift modeling) system, and proposes three different approaches to learn differential predictive rules within the Inductive Logic Programming framework.
Abstract: A typical classification problem involves building a model to correctly segregate instances of two or more classes. Such a model exhibits differential prediction with respect to given data subsets when its performance is significantly different over these subsets. Driven by a mammography application, we aim at learning rules that predict breast cancer stage while maximizing differential prediction over age-stratified data. In this work, we present the first multi-relational differential prediction (aka uplift modeling) system, and propose three different approaches to learn differential predictive rules within the Inductive Logic Programming framework. We first test and validate our methods on synthetic data, then apply them on a mammography dataset for breast cancer stage differential prediction rule discovery. We mine a novel rule linking calcification to in situ breast cancer in older women.

Proceedings Article
03 Nov 2012
TL;DR: This work introduces novel machine learning algorithms to improve diagnostic accuracy of breast cancer in aging populations, and develops a novel algorithm, Logical Differential Prediction Bayes Net, that calculates the risk of breast disease based on mammography findings.
Abstract: Overdiagnosis is a phenomenon in which screening identities cancer which may not go on to cause symptoms or death. Women over 65 who develop breast cancer bear the heaviest burden of overdiagnosis. This work introduces novel machine learning algorithms to improve diagnostic accuracy of breast cancer in aging populations. At the same time, we aim at minimizing unnecessary invasive procedures (thus decreasing false positives) and concomitantly addressing overdiagnosis. We develop a novel algorithm. Logical Differential Prediction Bayes Net (LDP-BN), that calculates the risk of breast disease based on mammography findings. LDP-BN uses Inductive Logic Programming (ILP) to learn relational rules, selects older-specific differentially predictive rules, and incorporates them into a Bayes Net, significantly improving its performance. In addition, LDP-BN offers valuable insight into the classification process, revealing novel older-specific rules that link mass presence to invasive, and calcification presence and lack of detectable mass to DCIS.

Proceedings Article
26 Jun 2012
TL;DR: In this paper, a demand-driven clustering approach is proposed to capture the latent structure of EMRs via a relational clustering of objects, instead of pre-clustering the objects, performs a demand driven clustering during learning.
Abstract: Learning from electronic medical records (EMR) is challenging due to their relational nature and the uncertain dependence between a patient's past and future health status. Statistical relational learning is a natural fit for analyzing EMRs but is less adept at handling their inherent latent structure, such as connections between related medications or diseases. One way to capture the latent structure is via a relational clustering of objects. We propose a novel approach that, instead of pre-clustering the objects, performs a demand-driven clustering during learning. We evaluate our algorithm on three real-world tasks where the goal is to use EMRs to predict whether a patient will have an adverse reaction to a medication. We find that our approach is more accurate than performing no clustering, pre-clustering, and using expert-constructed medical heterarchies.

01 Jan 2012
TL;DR: This work proposes the concept of a feature relevance network, a binary Markov random field to represent the relevance of each individual feature by potentials on the nodes, and represent the correlation structure by potentialS on the edges, and shows its superior performance over common feature selection methods in terms of prediction error and recovery of the truly relevant features on real-world data and synthetic data.
Abstract: Feature screening is a useful feature selection approach for high-dimensional data when the goal is to identify all the features relevant to the response variable. However, common feature screening methods do not take into account the correlation structure of the covariate space. We propose the concept of a feature relevance network, a binary Markov random field to represent the relevance of each individual feature by potentials on the nodes, and represent the correlation structure by potentials on the edges. By performing inference on the feature relevance network, we can accordingly select relevant features. Our algorithm does not yield sparsity, which is dierent from the particular popular family of feature selection approaches based on penalized least squares or penalized pseudolikelihood. We give one concrete algorithm under this framework and show its superior performance over common feature selection methods in terms of prediction error and recovery of the truly relevant features on realworld data and synthetic data.

Proceedings Article
01 Jan 2012
TL;DR: Wang et al. as discussed by the authors proposed the concept of a feature relevance network, a binary Markov random field to represent the relevance of each individual feature by potentials on the nodes and represent the correlation structure of the covariate space.
Abstract: Feature screening is a useful feature selection approach for high-dimensional data when the goal is to identify all the features relevant to the response variable. However, common feature screening methods do not take into account the correlation structure of the covariate space. We propose the concept of a feature relevance network, a binary Markov random field to represent the relevance of each individual feature by potentials on the nodes, and represent the correlation structure by potentials on the edges. By performing inference on the feature relevance network, we can accordingly select relevant features. Our algorithm does not yield sparsity, which is different from the particular popular family of feature selection approaches based on penalized least squares or penalized pseudo-likelihood. We give one concrete algorithm under this framework and show its superior performance over common feature selection methods in terms of prediction error and recovery of the truly relevant features on real-world data and synthetic data.

Posted Content
TL;DR: In this paper, a multiple testing procedure based on a Markov-random-field-coupled mixture model is proposed, where the ground truth of hypotheses is represented by a latent binary Markov random field, and the observed test statistics appear as the coupled mixture variables.
Abstract: Large-scale multiple testing tasks often exhibit dependence, and leveraging the dependence between individual tests is still one challenging and important problem in statistics. With recent advances in graphical models, it is feasible to use them to perform multiple testing under dependence. We propose a multiple testing procedure which is based on a Markov-random-field-coupled mixture model. The ground truth of hypotheses is represented by a latent binary Markov random field, and the observed test statistics appear as the coupled mixture variables. The parameters in our model can be automatically learned by a novel EM algorithm. We use an MCMC algorithm to infer the posterior probability that each hypothesis is null (termed local index of significance), and the false discovery rate can be controlled accordingly. Simulations show that the numerical performance of multiple testing can be improved substantially by using our procedure. We apply the procedure to a real-world genome-wide association study on breast cancer, and we identify several SNPs with strong association evidence.

Posted Content
TL;DR: The CLP(BN) language represents the joint probability distribution over missing values in a database or logic program by using constraints to represent Skolem functions.
Abstract: We present CLP(BN), a novel approach that aims at expressing Bayesian networks through the constraint logic programming framework. Arguably, an important limitation of traditional Bayesian networks is that they are propositional, and thus cannot represent relations between multiple similar objects in multiple contexts. Several researchers have thus proposed first-order languages to describe such networks. Namely, one very successful example of this approach are the Probabilistic Relational Models (PRMs), that combine Bayesian networks with relational database technology. The key difficulty that we had to address when designing CLP(cal{BN}) is that logic based representations use ground terms to denote objects. With probabilitic data, we need to be able to uniquely represent an object whose value we are not sure about. We use {sl Skolem functions} as unique new symbols that uniquely represent objects with unknown value. The semantics of CLP(cal{BN}) programs then naturally follow from the general framework of constraint logic programming, as applied to a specific domain where we have probabilistic data. This paper introduces and defines CLP(cal{BN}), and it describes an implementation and initial experiments. The paper also shows how CLP(cal{BN}) relates to Probabilistic Relational Models (PRMs), Ngo and Haddawys Probabilistic Logic Programs, AND Kersting AND De Raedts Bayesian Logic Programs.

Posted Content
TL;DR: In this article, the Sparse Candidate algorithm is extended with a technique called "skewing", which is able to discover relationships between random variables that are approximately correlation-immune, with a significantly lower computational cost than the alternative of considering multiple parents of a node at a time.
Abstract: Searching the complete space of possible Bayesian networks is intractable for problems of interesting size, so Bayesian network structure learning algorithms, such as the commonly used Sparse Candidate algorithm, employ heuristics. However, these heuristics also restrict the types of relationships that can be learned exclusively from data. They are unable to learn relationships that exhibit "correlation-immunity", such as parity. To learn Bayesian networks in the presence of correlation-immune relationships, we extend the Sparse Candidate algorithm with a technique called "skewing". This technique uses the observation that relationships that are correlation-immune under a specific input distribution may not be correlation-immune under another, sufficiently different distribution. We show that by extending Sparse Candidate with this technique we are able to discover relationships between random variables that are approximately correlation-immune, with a significantly lower computational cost than the alternative of considering multiple parents of a node at a time.

Posted Content
TL;DR: This work proposes a novel approach that, instead of pre-clustering the objects, performs a demand-driven clustering during learning, and finds that this approach is more accurate than performing no clustering, pre-Clustering, and using expert-constructed medical heterarchies.
Abstract: Learning from electronic medical records (EMR) is challenging due to their relational nature and the uncertain dependence between a patient's past and future health status. Statistical relational learning is a natural fit for analyzing EMRs but is less adept at handling their inherent latent structure, such as connections between related medications or diseases. One way to capture the latent structure is via a relational clustering of objects. We propose a novel approach that, instead of pre-clustering the objects, performs a demand-driven clustering during learning. We evaluate our algorithm on three real-world tasks where the goal is to use EMRs to predict whether a patient will have an adverse reaction to a medication. We find that our approach is more accurate than performing no clustering, pre-clustering, and using expert-constructed medical heterarchies.


Proceedings ArticleDOI
07 Oct 2012
TL;DR: This work proposes CollectRank, a ranking approach which allows SNPs to reinforce one another via the correlation structure, loosely analogous to the well-known PageRank algorithm, and evaluates it on synthetic data generated from a variety of genetic models under different settings.
Abstract: Genome-wide association studies (GWAS) analyze genetic variation (SNPs) across the entire human genome, searching for SNPs that are associated with certain phenotypes, most often diseases, such as breast cancer. In GWAS, we seek a ranking of SNPs in terms of their relevance to the given phenotype. However, because certain SNPs are known to be highly correlated with one another across individuals, it can be beneficial to take into account these correlations when ranking. If a SNP appears associated with the phenotype, and we question whether this association is real, the extent to which its neighbors (correlated SNPs) also appear associated can be informative. Therefore, we propose CollectRank, a ranking approach which allows SNPs to reinforce one another via the correlation structure. CollectRank is loosely analogous to the well-known PageRank algorithm. We first evaluate CollectRank on synthetic data generated from a variety of genetic models under different settings. The numerical results suggest CollectRank can significantly outperform common GWAS methods at the cost of a small amount of extra computation. We further evaluate CollectRank on two real-world GWAS on breast cancer and atrial fibrillation/flutter, and CollectRank performs well in both studies. We finally provide a theoretical analysis that also suggests CollectRank's advantages.

01 Dec 2012
TL;DR: The authors reported that DAZ arrived on the Y chromosome about 36 million years ago via the transposition of at least 1.1 megabases of autosomal DNA, but all five genes were subsequently lost through mutation or deletion.
Abstract: Studies of Y chromosome evolution often emphasize gene loss, but this loss has been counterbalanced by addition of new genes. The DAZ genes, which are critical to human spermatogenesis, were acquired by the Y chromosome in the ancestor of Old World monkeys and apes. We and our colleagues recently sequenced the rhesus macaque Y chromosome, and comparison of this sequence to human and chimpanzee enables us to reconstruct much of the evolutionary history of DAZ. We report that DAZ arrived on the Y chromosome about 36 million years ago via the transposition of at least 1.1 megabases of autosomal DNA. This transposition also brought five additional genes to the Y chromosome, but all five genes were subsequently lost through mutation or deletion. As the only surviving gene, DAZ experienced extensive restructuring, including intragenic amplification and gene duplication, and has been the target of positive selection in the chimpanzee lineage.