scispace - formally typeset
Search or ask a question

Showing papers by "David C. Page published in 2017"


Journal ArticleDOI
TL;DR: An inducible Y centromere-selective inactivation strategy is developed by exploiting a CENP-A/histone H3 chimaera to directly examine the fate of missegregated chromosomes in otherwise diploid human cells, and initial errors in cell division can provoke further genomic instability through fragmentation of micronuclear DNAs coupled to NHEJ-mediated reassembly in the subsequent interphase.
Abstract: Ly et al. establish a method to selectively inactivate the centromere of the Y chromosome to follow chromosome shattering and micronuclei formation through several cell cycles, and suggest re-ligation of chromosome fragments is dependent on non-homologous end joining.

190 citations


Journal ArticleDOI
TL;DR: In this paper, the chicken W-chromosome was sequenced and compared with the reconstructed ancestral W-linked genes across birds, and it was shown that the chicken X chromosome is essential for embryonic viability of the heterogametic sex.
Abstract: After birds diverged from mammals, different ancestral autosomes evolved into sex chromosomes in each lineage. In birds, females are ZW and males are ZZ, but in mammals females are XX and males are XY. We sequenced the chicken W chromosome, compared its gene content with our reconstruction of the ancestral autosomes, and followed the evolutionary trajectory of ancestral W-linked genes across birds. Avian W chromosomes evolved in parallel with mammalian Y chromosomes, preserving ancestral genes through selection to maintain the dosage of broadly expressed regulators of key cellular processes. We propose that, like the human Y chromosome, the chicken W chromosome is essential for embryonic viability of the heterogametic sex. Unlike other sequenced sex chromosomes, the chicken W chromosome did not acquire and amplify genes specifically expressed in reproductive tissues. We speculate that the pressures that drive the acquisition of reproduction-related genes on sex chromosomes may be specific to the male germ line.

136 citations


Journal ArticleDOI
TL;DR: It is shown that pachytene spermatocytes, which express an RA-synthesizing enzyme, Aldh1a2, contribute directly and significantly to RA production in testes, and concludes that the premeiotic transitions are coordinated by RA from Sertoli (somatic) cells.
Abstract: Mammalian spermatogenesis is an elaborately organized differentiation process, starting with diploid spermatogonia, which include germ-line stem cells, and ending with haploid spermatozoa. The process involves four pivotal transitions occurring in physical proximity: spermatogonial differentiation, meiotic initiation, initiation of spermatid elongation, and release of spermatozoa. We report how the four transitions are coordinated in mice. Two premeiotic transitions, spermatogonial differentiation and meiotic initiation, were known to be coregulated by an extrinsic signal, retinoic acid (RA). Our chemical manipulations of RA levels in mouse testes now reveal that RA also regulates the two postmeiotic transitions: initiation of spermatid elongation and spermatozoa release. We measured RA concentrations and found that they changed periodically, as also reflected in the expression patterns of an RA-responsive gene, STRA8; RA levels were low before the four transitions, increased when the transitions occurred, and remained elevated thereafter. We found that pachytene spermatocytes, which express an RA-synthesizing enzyme, Aldh1a2, contribute directly and significantly to RA production in testes. Indeed, chemical and genetic depletion of pachytene spermatocytes revealed that RA from pachytene spermatocytes was required for the two postmeiotic transitions, but not for the two premeiotic transitions. We conclude that the premeiotic transitions are coordinated by RA from Sertoli (somatic) cells. Once germ cells enter meiosis, pachytene spermatocytes produce RA to coordinate the two postmeiotic transitions. In combination, these elements underpin the spatiotemporal coordination of spermatogenesis and ensure its prodigious output in adult males.

88 citations


Journal ArticleDOI
TL;DR: It is shown that, in mice, maintenance of an extended meiotic prophase I requires the gene Meioc, a germ-cell specific factor conserved in most metazoans that promotes a meiotic (as opposed to mitotic) cell cycle program via post-transcriptional control of their target transcripts.
Abstract: The meiosis-specific chromosomal events of homolog pairing, synapsis, and recombination occur over an extended meiotic prophase I that is many times longer than prophase of mitosis. Here we show that, in mice, maintenance of an extended meiotic prophase I requires the gene Meioc, a germ-cell specific factor conserved in most metazoans. In mice, Meioc is expressed in male and female germ cells upon initiation of and throughout meiotic prophase I. Mouse germ cells lacking Meioc initiate meiosis: they undergo pre-meiotic DNA replication, they express proteins involved in synapsis and recombination, and a subset of cells progress as far as the zygotene stage of prophase I. However, cells in early meiotic prophase-as early as the preleptotene stage-proceed to condense their chromosomes and assemble a spindle, as if having progressed to metaphase. Meioc-deficient spermatocytes that have initiated synapsis mis-express CYCLIN A2, which is normally expressed in mitotic spermatogonia, suggesting a failure to properly transition to a meiotic cell cycle program. MEIOC interacts with YTHDC2, and the two proteins pull-down an overlapping set of mitosis-associated transcripts. We conclude that when the meiotic chromosomal program is initiated, Meioc is simultaneously induced so as to extend meiotic prophase. Specifically, MEIOC, together with YTHDC2, promotes a meiotic (as opposed to mitotic) cell cycle program via post-transcriptional control of their target transcripts.

85 citations


Journal ArticleDOI
TL;DR: This work is the first to investigate a big data machine learning strategy for ADE discovery on massive datasets downloaded from PubMed Central and social media and shows possible capacities in big data biomedical text analysis using advanced computational methods with real-time update from new data published on a daily basis.
Abstract: Background: The study of adverse drug events (ADEs) is a tenured topic in medical literature. In recent years, increasing numbers of scientific articles and health-related social media posts have been generated and shared daily, albeit with very limited use for ADE study and with little known about the content with respect to ADEs. Objective: The aim of this study was to develop a big data analytics strategy that mines the content of scientific articles and health-related Web-based social media to detect and identify ADEs. Methods: We analyzed the following two data sources: (1) biomedical articles and (2) health-related social media blog posts. We developed an intelligent and scalable text mining solution on big data infrastructures composed of Apache Spark, natural language processing, and machine learning. This was combined with an Elasticsearch No-SQL distributed database to explore and visualize ADEs. Results: The accuracy, precision, recall, and area under receiver operating characteristic of the system were 92.7%, 93.6%, 93.0%, and 0.905, respectively, and showed better results in comparison with traditional approaches in the literature. This work not only detected and classified ADE sentences from big data biomedical literature but also scientifically visualized ADE interactions. Conclusions: To the best of our knowledge, this work is the first to investigate a big data machine learning strategy for ADE discovery on massive datasets downloaded from PubMed Central and social media. This contribution illustrates possible capacities in big data biomedical text analysis using advanced computational methods with real-time update from new data published on a daily basis.

48 citations



Journal ArticleDOI
TL;DR: The authors directly examined the fate of missegregated chromosomes in otherwise diploid human cells—the male human DLD-1 cell line—and developed an inducible Y centromere–selective inactivation strategy by exploiting a CENP-A/ histone H3 chimaera.
Abstract: Ly et al. establish a method to selectively inactivate the centromere of the Y chromosome to follow chromosome shattering and micronuclei formation through several cell cycles, and suggest re-ligation of chromosome fragments is dependent on non-homologous end joining.

39 citations


26 Jul 2017
TL;DR: A simple text mining method that is easy to implement, requires minimal data collection and preparation, and iseasy to use for proposing ranked associations between a list of target terms and a key phrase is presented.
Abstract: We present a simple text mining method that is easy to implement, requires minimal data collection and preparation, and is easy to use for proposing ranked associations between a list of target terms and a key phrase We call this method KinderMiner, and apply it to two biomedical applications The first application is to identify relevant transcription factors for cell reprogramming, and the second is to identify potential drugs for investigation in drug repositioning We compare the results from our algorithm to existing data and state-of-the-art algorithms, demonstrating compelling results for both application areas While we apply the algorithm here for biomedical applications, we argue that the method is generalizable to any available corpus of sufficient size

27 citations


06 Nov 2017
TL;DR: Experimental results on a large-scale cohort of real-world EHRs demonstrate that the proposed method outperforms a leading approach, multiple self-controlled case series (Simpson et al., 2013), in identifying benchmark ADRs defined by the Observational Medical Outcomes Partnership.
Abstract: Adverse drug reaction (ADR) discovery is the task of identifying unexpected and negative events caused by pharmaceutical products. This paper describes a log-linear Hawkes process model for ADR discovery from longitudinal observational data such as electronic health records (EHRs). The proposed method leverages the irregular time-stamped events in EHRs to represent the time-varying effect of various drugs on the occurrence rate of adverse events. Experimental results on a large-scale cohort of real-world EHRs demonstrate that the proposed method outperforms a leading approach, multiple self-controlled case series (Simpson et al., 2013), in identifying benchmark ADRs defined by the Observational Medical Outcomes Partnership.

20 citations


Proceedings ArticleDOI
01 Dec 2017
TL;DR: An open-source big data neural network toolkit, namely bigNN, is designed and developed which tackles the problem of large-scale biomedical text classification in an efficient fashion, facilitating fast prototyping and reproducible text analytics researches.
Abstract: Every single day, a massive amount of text data is generated by different medical data sources, such as scientific literature, medical web pages, health-related social media, clinical notes, and drug reviews. Processing this wealth of data is indeed a daunting task, and it forces us to adopt smart and scalable computational strategies, including machine intelligence, big data analytics, and distributed architecture. In this contribution, we designed and developed an open-source big data neural network toolkit, namely bigNN which tackles the problem of large-scale biomedical text classification in an efficient fashion, facilitating fast prototyping and reproducible text analytics researches. bigNN scales up a word2vec-based neural network model over Apache Spark 2.10 and Hadoop Distributed File System (HDFS) 2.7.3, allowing for more efficient big data sentence classification. The toolkit supports big data computing, and simplifies rapid application development in sentence analysis by allowing users to configure and examine different internal parameters of both Apache Spark and the neural network model. bigNN is fully documented, and it is publicly and freely available at https://github.com/bircatmcri/bigNN.

17 citations


Journal ArticleDOI
TL;DR: Similar AED reporting rates were observed for the AG and generic comparisons for most outcomes and drugs, suggesting that brands and generics have similar reporting rates after accounting for generic perception biases.

Proceedings Article
01 Dec 2017
TL;DR: A screening rule is discovered for ℓ1-regularized Ising model estimation that is especially suitable for large-scale exploratory data analysis, where the number of variables in the dataset can be thousands while the relationship among a handful of variables within moderate-size clusters for interpretability is only interested.
Abstract: We discover a screening rule for l1-regularized Ising model estimation. The simple closed-form screening rule is a necessary and sufficient condition for exactly recovering the blockwise structure of a solution under any given regularization parameters. With enough sparsity, the screening rule can be combined with various optimization procedures to deliver solutions efficiently in practice. The screening rule is especially suitable for large-scale exploratory data analysis, where the number of variables in the dataset can be thousands while we are only interested in the relationship among a handful of variables within moderate-size clusters for interpretability. Experimental results on various datasets demonstrate the efficiency and insights gained from the introduction of the screening rule.

Journal ArticleDOI
TL;DR: Eight essential components of team membership identified included a) demonstrates followership, b) maintains situational awareness, c) demonstrates appreciative inquiry, d) does not freelance, e) is an active listener, f) accurately performs tasks in a timely manner, and h) leaves ego and rank at the door.

Journal ArticleDOI
TL;DR: This work proposes and studies a state-of-the-art NLP-based extraction of ADEs from text based on the observation that text sources such as the Medline/Medinfo library provide a wealth of information on human health.
Abstract: Adverse drug events (ADEs) are a major concern and point of emphasis for the medical profession, government, and society. A diverse set of techniques from epidemiology, statistics, and computer science are being proposed and studied for ADE discovery from observational health data (e.g., EHR and claims data), social network data (e.g., Google and Twitter posts), and other information sources. Methodologies are needed for evaluating, quantitatively measuring and comparing the ability of these various approaches to accurately discover ADEs. This work is motivated by the observation that text sources such as the Medline/Medinfo library provide a wealth of information on human health. Unfortunately, ADEs often result from unexpected interactions, and the connection between conditions and drugs is not explicit in these sources. Thus, in this work, we address the question of whether we can quantitatively estimate relationships between drugs and conditions from the medical literature. This paper proposes and studies a state-of-the-art NLP-based extraction of ADEs from text.

Journal ArticleDOI
TL;DR: A proof-of-concept database and website has been developed to share results from both in vivo inhalation studies and in vitro studies conducted by Philip Morris International R&D to assess candidate MRTPs, and the goal is to establish a public repository for 21 st-century preclinical systems toxicology MRTP assessment data and results that supports open data principles.
Abstract: The US FDA defines modified risk tobacco products (MRTPs) as products that aim to reduce harm or the risk of tobacco-related disease associated with commercially marketed tobacco products. Establishing a product's potential as an MRTP requires scientific substantiation including toxicity studies and measures of disease risk relative to those of cigarette smoking. Best practices encourage verification of the data from such studies through sharing and open standards. Building on the experience gained from the OpenTox project, a proof-of-concept database and website ( INTERVALS) has been developed to share results from both in vivo inhalation studies and in vitro studies conducted by Philip Morris International R&D to assess candidate MRTPs. As datasets are often generated by diverse methods and standards, they need to be traceable, curated, and the methods used well described so that knowledge can be gained using data science principles and tools. The data-management framework described here accounts for the latest standards of data sharing and research reproducibility. Curated data and methods descriptions have been prepared in ISA-Tab format and stored in a database accessible via a search portal on the INTERVALS website. The portal allows users to browse the data by study or mechanism (e.g., inflammation, oxidative stress) and obtain information relevant to study design, methods, and the most important results. Given the successful development of the initial infrastructure, the goal is to grow this initiative and establish a public repository for 21 st-century preclinical systems toxicology MRTP assessment data and results that supports open data principles.

Proceedings ArticleDOI
01 Aug 2017
TL;DR: The performance of two models to predict breast cancer one year in advance based on diagnosis codes in three levels of data representation demonstrates that EHR data can be used to predict Breast cancer risk, which provides the possibility to personalize care in clinical practice.
Abstract: Electronic health records (EHRs) represent an underused data source that has great research and clinical potential. Our goal was to quantify the value of EHRs in breast cancer risk prediction. We conducted a retrospective case-control study, gathering patients' ICD-9 diagnosis codes from an existing EHR data repository. Based on the hierarchical structure of ICD-9 codes, which are composed of 3-5 digits, three levels of data representation were studied: level 0, using only the first 3 digits; level 1, using up to the first 4 digits; and level 2, using up to the full 5 digits of each code. We created two models to predict breast cancer one year in advance based on diagnosis codes in three levels of data representation: logistic regression (LR) and LASSO logistic regression (LR+Lasso). Area under the ROC curve (AUC) was used to assess model performance. The LR+Lasso model demonstrated significantly higher predictive performance than the LR model when using the level 2 feature representation (0.648 vs 0.603, p=0.013). For both the level 1 representation and the level 0 representation, the predictive difference between LR+Lasso and LR model was not significant, (0.634 vs 0.604, p=0.081) and (0.612 vs 0.603, p=0.523), respectively. For LR model, predictive performance changed modestly across three levels. For LR+Lasso model, predictive performance also changed modestly from the level 0 to the level 1representation (p=0.168) and from the level 1 to the level 2 representation (p=0.374). However, the level 2 representation provided significantly higher predictive performance than the level 0 representation (p=0.034). The unabridged level 2 representation of the diagnosis codes contains the most valuable information that may contribute to breast cancer risk prediction. The performance of these models demonstrates that EHR data can be used to predict breast cancer risk, which provides the possibility to personalize care in clinical practice. In the future, we will combine coded EHR data with demographic risk factors, genetic variants, and imaging features to improve breast cancer risk prediction.

Journal ArticleDOI
TL;DR: Differentiation of FAERS reports as brand versus generic requires careful attention to risk of product misclassification, but the relative stability of findings across varying assumptions supports the utility of these approaches for potential signal detection.
Abstract: The US Food and Drug Administration Adverse Event Reporting System (FAERS), a post-marketing safety database, can be used to differentiate brand versus generic safety signals. To explore the methods for identifying and analyzing brand versus generic adverse event (AE) reports. Public release FAERS data from January 2004 to March 2015 were analyzed using alendronate and carbamazepine as examples. Reports were classified as brand, generic, and authorized generic (AG). Disproportionality analyses compared reporting odds ratios (RORs) of selected known labeled serious adverse events stratifying by brand, generic, and AG. The homogeneity of these RORs was compared using the Breslow-Day test. The AG versus generic was the primary focus since the AG is identical to brand but marketed as a generic, therefore minimizing generic perception bias. Sensitivity analyses explored how methodological approach influenced results. Based on 17,521 US event reports involving alendronate and 3733 US event reports involving carbamazepine (immediate and extended release), no consistently significant differences were observed across RORs for the AGs versus generics. Similar results were obtained when comparing reporting patterns over all time and just after generic entry. The most restrictive approach for classifying AE reports yielded smaller report counts but similar results. Differentiation of FAERS reports as brand versus generic requires careful attention to risk of product misclassification, but the relative stability of findings across varying assumptions supports the utility of these approaches for potential signal detection.

Posted ContentDOI
21 Sep 2017-bioRxiv
TL;DR: A more complete view of the role of dosage sensitivity in shaping the mammalian and avian sex chromosomes is provided, and an important role for post-transcriptional regulatory sequences (miRNA target sites) in sex chromosome evolution is revealed.
Abstract: Mammalian X and Y chromosomes evolved from an ordinary autosomal pair. Genetic decay of the Y led to X chromosome inactivation (XCI) in females, but some Y-linked genes were retained during the course of sex chromosome evolution, and many X-linked genes did not become subject to XCI. We reconstructed gene-by-gene dosage sensitivities on the ancestral autosomes through phylogenetic analysis of microRNA (miRNA) target sites and compared these preexisting characteristics to the current status of Y-linked and X-linked genes in mammals. Preexisting heterogeneities in dosage sensitivity, manifesting as differences in the extent of miRNA-mediated repression, predicted either the retention of a Y homolog or the acquisition of XCI following Y gene decay. Analogous heterogeneities among avian Z-linked genes predicted the retention of a W homolog. Genome-wide analyses of human copy number variation indicate that these heterogeneities consisted of sensitivity to both increases and decreases in dosage. We propose a model of XY/ZW evolution incorporating such preexisting dosage sensitivities in determining the evolutionary fates of individual genes. Our findings thus provide a more complete view of the role of dosage sensitivity in shaping the mammalian and avian sex chromosomes, and reveal an important role for post-transcriptional regulatory sequences (miRNA target sites) in sex chromosome evolution.

Book ChapterDOI
21 Jun 2017
TL;DR: This work presents a machine learning approach that uses a definite set of features obtained from the Parkinsons Progression Markers Initiative study as input and classifies patients with Parkinson's disease into one of two classes: PD and HC.
Abstract: Parkinson's, a progressive neural disorder, is difficult to identify due to the hidden nature of the symptoms associated We present a machine learning approach that uses a definite set of features obtained from the Parkinsons Progression Markers Initiative(PPMI) study as input and classifies them into one of two classes: PD(Parkinson's disease) and HC(Healthy Control) As far as we know this is the first work in applying machine learning algorithms for classifying patients with Parkinson's disease with the involvement of domain expert during the feature selection process We evaluate our approach on 1194 patients acquired from Parkinsons Progression Markers Initiative and show that it achieves a state-of-the-art performance with minimal feature engineering

01 Jan 2017
TL;DR: It is proposed that, like the human Y chromosome, the chicken W chromosome is essential for embryonic viability of the heterogametic sex and it is speculated that the pressures that drive the acquisition of reproduction-related genes on sex chromosomes may be specific to the male germ line.
Abstract: After birds diverged from mammals, different ancestral autosomes evolved into sex chromosomes in each lineage. In birds, females are ZW and males are ZZ, but in mammals females are XX and males are XY. We sequenced the chicken W chromosome, compared its gene content with our reconstruction of the ancestral autosomes, and followed the evolutionary trajectory of ancestral W-linked genes across birds. Avian W chromosomes evolved in parallel with mammalian Y chromosomes, preserving ancestral genes through selection to maintain the dosage of broadly expressed regulators of key cellular processes. We propose that, like the human Y chromosome, the chicken W chromosome is essential for embryonic viability of the heterogametic sex. Unlike other sequenced sex chromosomes, the chicken W chromosome did not acquire and amplify genes specifically expressed in reproductive tissues. We speculate that the pressures that drive the acquisition of reproduction-related genes on sex chromosomes may be specific to the male germ line.

Proceedings ArticleDOI
13 Aug 2017
TL;DR: B baseline regularization is proposed, a regularized generalized linear model that leverages the diverse health profiles available in LODs across different individuals at different times that helps to improve the performance in identifying benchmark ADEs from the Observational Medical Outcomes Partnership ground truth.
Abstract: Several prominent public health incidents that occurred at the beginning of this century due to adverse drug events (ADEs) have raised international awareness of governments and industries about pharmacovigilance (PhV), the science and activities to monitor and prevent adverse events caused by pharmaceutical products after they are introduced to the market. A major data source for PhV is large-scale longitudinal observational databases (LODs) such as electronic health records (EHRs) and medical insurance claim databases. Inspired by the Multiple Self-Controlled Case Series (MSCCS) model, arguably the leading method for ADE discovery from LODs, we propose baseline regularization, a regularized generalized linear model that leverages the diverse health profiles available in LODs across different individuals at different times. We apply the proposed method as well as MSCCS to the Marshfield Clinic EHR. Experimental results suggest that incorporating the heterogeneity among different patients and different times help to improve the performance in identifying benchmark ADEs from the Observational Medical Outcomes Partnership ground truth

Posted ContentDOI
29 Jun 2017-bioRxiv
TL;DR: SHIMS 2.0 is introduced, an improved SHIMS protocol to allow even a small laboratory to generate high-quality reference sequence from complex genomic regions, and reduces the cost and time required by two orders of magnitude, while preserving high sequencing accuracy.
Abstract: Reference sequence of structurally complex regions can only be obtained through highly accurate clone-based approaches. We and others have successfully employed Single-Haplotype Iterative Mapping and Sequencing (SHIMS 1.0) to assemble structurally complex regions across the sex chromosomes of several vertebrate species and in targeted improvements to the reference sequences of human autosomes. However, SHIMS 1.0 was expensive and time consuming, requiring the resources that only a genome center could command. Here we introduce SHIMS 2.0, an improved SHIMS protocol to allow even a small laboratory to generate high-quality reference sequence from complex genomic regions. Using a streamlined and parallelized library preparation protocol, and taking advantage of high-throughput, inexpensive, short-read sequencing technologies, a small group can sequence and assemble hundreds of clones in a week. Relative to SHIMS 1.0, SHIMS 2.0 reduces the cost and time required by two orders of magnitude, while preserving high sequencing accuracy.

Posted ContentDOI
20 Mar 2017-bioRxiv
TL;DR: A model of XY/ZW evolution incorporating preexisting dosage sensitivities of individual genes in determining their evolutionary fates, and ultimately shaping the mammalian and avian sex chromosomes is proposed.
Abstract: Mammalian X and Y chromosomes evolved from an ordinary autosomal pair; genetic decay decimated the Y, which in turn necessitated X chromosome inactivation (XCI). Genes of the ancestral autosomes are often assumed to have undertaken these transitions on uniform terms, but we hypothesized that they varied in their dosage constraints. We inferred such constraints from conservation of microRNA (miRNA)-mediated repression, validated by analysis of experimental data. X-linked genes with a surviving Y homolog have the most conserved miRNA target sites, followed by genes with no Y homolog and subject to XCI, and then genes with no Y homolog but escaping XCI; this heterogeneity existed on the ancestral autosomes. Similar results for avian Z-linked genes, with or without a W homolog, lead to a model of XY/ZW evolution incorporating preexisting dosage sensitivities of individual genes in determining their evolutionary fates, and ultimately shaping the mammalian and avian sex chromosomes.