scispace - formally typeset
Search or ask a question

Showing papers in "BMC Medical Genomics in 2016"


Journal ArticleDOI
TL;DR: In this paper, the authors applied whole exome sequencing (WES) for molecular diagnosis and in silico analysis to identify novel disease gene candidates in a cohort from Saudi Arabia with primarily Mendelian neurologic diseases.
Abstract: Neurodevelopment is orchestrated by a wide range of genes, and the genetic causes of neurodevelopmental disorders are thus heterogeneous. We applied whole exome sequencing (WES) for molecular diagnosis and in silico analysis to identify novel disease gene candidates in a cohort from Saudi Arabia with primarily Mendelian neurologic diseases. We performed WES in 31 mostly consanguineous Arab families and analyzed both single nucleotide and copy number variants (CNVs) from WES data. Interaction/expression network and pathway analyses, as well as paralog studies were utilized to investigate potential pathogenicity and disease association of novel candidate genes. Additional cases for candidate genes were identified through the clinical WES database at Baylor Miraca Genetics Laboratories and GeneMatcher. We found known pathogenic or novel variants in known disease genes with phenotypic expansion in 6 families, disease-associated CNVs in 2 families, and 12 novel disease gene candidates in 11 families, including KIF5B, GRM7, FOXP4, MLLT1, and KDM2B. Overall, a potential molecular diagnosis was provided by variants in known disease genes in 17 families (54.8 %) and by novel candidate disease genes in an additional 11 families, making the potential molecular diagnostic rate ~90 %. Molecular diagnostic rate from WES is improved by exome-predicted CNVs. Novel candidate disease gene discovery is facilitated by paralog studies and through the use of informatics tools and available databases to identify additional evidence for pathogenicity. Not applicable.

76 citations


Journal ArticleDOI
TL;DR: Two osteoblast miRNAs were identified over-expressed in osteoporotic samples and expressed in primary osteoblasts, which opens novel prospects for research and therapy.
Abstract: MicroRNAs (miRNAs) are important regulators of gene expression, with documented roles in bone metabolism and osteoporosis, suggesting potential therapeutic targets. Our aim was to identify miRNAs differentially expressed in fractured vs nonfractured bones. Additionally, we performed a miRNA profiling of primary osteoblasts to assess the origin of these differentially expressed miRNAs. Total RNA was extracted from (a) fresh femoral neck trabecular bone from women undergoing hip replacement due to either osteoporotic fracture (OP group, n = 6) or osteoarthritis in the absence of osteoporosis (Control group, n = 6), matching the two groups by age and body mass index, and (b) primary osteoblasts obtained from knee replacement due to osteoarthritis (n = 4). Samples were hybridized to a microRNA array containing more than 1900 miRNAs. Principal component analysis (PCA) plots and heat map hierarchical clustering were performed. For comparison of expression levels, the threshold was set at log fold change > 1.5 and a p-value < 0.05 (corrected for multiple testing). Both PCA and heat map analyses showed that the samples clustered according to the presence or absence of fracture. Overall, 790 and 315 different miRNAs were detected in fresh bone samples and in primary osteoblasts, respectively, 293 of which were common to both groups. A subset of 82 miRNAs was differentially expressed (p < 0.05) between osteoporotic and control osteoarthritic samples. The eight miRNAs with the lowest p-values (and for which a validated miRNA qPCR assay was available) were assayed, and two were confirmed: miR-320a and miR-483-5p. Both were over-expressed in the osteoporotic samples and expressed in primary osteoblasts. miR-320a is known to target CTNNB1 and predicted to regulate RUNX2 and LEPR, while miR-483-5p down-regulates IGF2. We observed a reduction trend for this target gene in the osteoporotic bone. We identified two osteoblast miRNAs over-expressed in osteoporotic fractures, which opens novel prospects for research and therapy.

75 citations


Journal ArticleDOI
TL;DR: Genomic testing should not be offered with the explicit aim to reduce uncertainty, but rather, uncertainty should be appraised, adapted to and communicated about as part of the process of offering and providing genomic information.
Abstract: Genomic testing has reached the point where, technically at least, it can be cheaper to undertake panel-, exome- or whole genome testing than it is to sequence a single gene. An attribute of these approaches is that information gleaned will often have uncertain significance. In addition to the challenges this presents for pre-test counseling and informed consent, a further consideration emerges over how - ethically - we should conceive of and respond to this uncertainty. To date, the ethical aspects of uncertainty in genomics have remained under-explored. In this paper, we draft a conceptual and ethical response to the question of how to conceive of and respond to uncertainty in genomic medicine. After introducing the problem, we articulate a concept of ‘genomic uncertainty’. Drawing on this, together with exemplar clinical cases and related empirical literature, we then critique the presumption that uncertainty is always problematic and something to be avoided, or eradicated. We conclude by outlining an ‘ethics of genomic uncertainty’; describing how we might handle uncertainty in genomic medicine. This involves fostering resilience, welfare, autonomy and solidarity. Uncertainty will be an inherent aspect of clinical practice in genomics for some time to come. Genomic testing should not be offered with the explicit aim to reduce uncertainty. Rather, uncertainty should be appraised, adapted to and communicated about as part of the process of offering and providing genomic information.

71 citations


Journal ArticleDOI
TL;DR: This 19-gene signature was able to significantly predict the survival of patients with colorectal cancer compared to the conventional Dukes’ classification in both training and test sets and was validated as a significant independent predictor of survival.
Abstract: Histopathological assessment has a low potential to predict clinical outcome in patients with the same stage of colorectal cancer. More specific and sensitive biomarkers to determine patients’ survival are needed. We aimed to determine gene expression signatures as reliable prognostic marker that could predict survival of colorectal cancer patients with Dukes’ B and C. We examined microarray gene expression profiles of 78 archived tissues of patients with Dukes’ B and C using the Illumina DASL assay. The gene expression data were analyzed using the GeneSpring software and R programming. The outliers were detected and replaced with randomly chosen genes from the 90 % confidence interval of the robust mean for each group. We performed three statistical methods (SAM, LIMMA and t-test) to identify significant genes. There were 19 significant common genes identified from microarray data that have been permutated 100 times namely NOTCH2, ITPRIP, FRMD6, GFRA4, OSBPL9, CPXCR1, SORCS2, PDC, C12orf66, SLC38A9, OR10H5, TRIP13, MRPL52, DUSP21, BRCA1, ELTD1, SPG7, LASS6 and DUOX2. This 19-gene signature was able to significantly predict the survival of patients with colorectal cancer compared to the conventional Dukes’ classification in both training and test sets (p < 0.05). The performance of this signature was further validated as a significant independent predictor of survival using patient cohorts from Australia (n = 185), USA (n = 114), Denmark (n = 37) and Norway (n = 95) (p < 0.05). Validation using quantitative PCR confirmed similar expression pattern for the six selected genes. Profiling of these 19 genes may provide a more accurate method to predict survival of patients with colorectal cancer and assist in identifying patients who require more intensive treatment.

65 citations


Journal ArticleDOI
TL;DR: WES is already used in the clinical setting, and may soon be considered the standard of care for specific medical conditions, yet technology users are calling for certain standards and guidelines to be published before this technology replaces more focused approaches such as gene panels sequencing.
Abstract: Whole-exome sequencing (WES) consists in the capture, sequencing and analysis of all exons in the human genome. Originally developed in the research context, this technology is now increasingly used clinically to inform patient care. The implementation of WES into healthcare poses significant organizational, regulatory, and ethical hurdles, which are widely discussed in the literature. In order to inform future policy decisions on the integration of WES into standard clinical practice, we performed a systematic literature review to identify the most important challenges directly reported by technology users. Out of 2094 articles, we selected and analyzed 147 which reported a total of 23 different challenges linked to the production, analysis, reporting and sharing of patients’ WES data. Interpretation of variants of unknown significance, incidental findings, and the cost and reimbursement of WES-based tests were the most reported challenges across all articles. WES is already used in the clinical setting, and may soon be considered the standard of care for specific medical conditions. Yet, technology users are calling for certain standards and guidelines to be published before this technology replaces more focused approaches such as gene panels sequencing. In addition, a number of infrastructural adjustments will have to be made for clinics to store, process and analyze the amounts of data produced by WES.

62 citations


Journal ArticleDOI
TL;DR: This paper discusses several studies that show how primary care physicians and clinicians in general feel underequipped to interpret genetic tests and direct-to-consumer genomic tests and proposes several strategies that involve medical curriculum reforms, specialist training, and ongoing physician training.
Abstract: A new paradigm in disease classification, diagnosis and treatment is rapidly approaching. Known as precision medicine, this new healthcare model incorporates and integrates genetic information, microbiome data, and information on patients’ environment and lifestyle to better identify and classify disease processes, and to provide custom-tailored therapeutic solutions. In spite of its promises, precision medicine faces several challenges that need to be overcome to successfully implement this new healthcare model. In this paper we identify four main areas that require attention: data, tools and systems, regulations, and people. While there are important ongoing efforts for addressing the first three areas, we argue that the human factor needs to be taken into consideration as well. In particular, we discuss several studies that show how primary care physicians and clinicians in general feel underequipped to interpret genetic tests and direct-to-consumer genomic tests. Considering the importance of genetic information for precision medicine applications, this is a pressing issue that needs to be addressed. To increase the number of professionals with the necessary expertise to correctly interpret the genomics profiles of their patients, we propose several strategies that involve medical curriculum reforms, specialist training, and ongoing physician training.

58 citations


Journal ArticleDOI
TL;DR: The results of the second Critical Assessment of Data Privacy and Protection competition indicated that secure computation techniques can enable comparative analysis of human genomes, but greater efficiency are needed before they are sufficiently practical for real world environments.
Abstract: The outsourcing of genomic data into public cloud computing settings raises concerns over privacy and security. Significant advancements in secure computation methods have emerged over the past several years, but such techniques need to be rigorously evaluated for their ability to support the analysis of human genomic data in an efficient and cost-effective manner. With respect to public cloud environments, there are concerns about the inadvertent exposure of human genomic data to unauthorized users. In analyses involving multiple institutions, there is additional concern about data being used beyond agreed research scope and being prcoessed in untrused computational environments, which may not satisfy institutional policies. To systematically investigate these issues, the NIH-funded National Center for Biomedical Computing iDASH (integrating Data for Analysis, ‘anonymization’ and SHaring) hosted the second Critical Assessment of Data Privacy and Protection competition to assess the capacity of cryptographic technologies for protecting computation over human genomes in the cloud and promoting cross-institutional collaboration. Data scientists were challenged to design and engineer practical algorithms for secure outsourcing of genome computation tasks in working software, whereby analyses are performed only on encrypted data. They were also challenged to develop approaches to enable secure collaboration on data from genomic studies generated by multiple organizations (e.g., medical centers) to jointly compute aggregate statistics without sharing individual-level records. The results of the competition indicated that secure computation techniques can enable comparative analysis of human genomes, but greater efficiency (in terms of compute time and memory utilization) are needed before they are sufficiently practical for real world environments.

53 citations


Journal ArticleDOI
TL;DR: The meta-analysis of reprogrammed NSCLC cell lines identified SFRP1 as a promising target of epigenetic therapy for NSCLE, and numerical computation validated the binding of S FRP1 to WNT1 to suppress Wnt signalling pathway activation inNSCLC.
Abstract: Non-small cell lung cancer (NSCLC) remains a lethal disease despite many proposed treatments. Recent studies have indicated that epigenetic therapy, which targets epigenetic effects, might be a new therapeutic methodology for NSCLC. However, it is not clear which objects (e.g., genes) this treatment specifically targets. Secreted frizzled-related proteins (SFRPs) are promising candidates for epigenetic therapy in many cancers, but there have been no reports of SFRPs targeted by epigenetic therapy for NSCLC. This study performed a meta-analysis of reprogrammed NSCLC cell lines instead of the direct examination of epigenetic therapy treatment to identify epigenetic therapy targets. In addition, mRNA expression/promoter methylation profiles were processed by recently proposed principal component analysis based unsupervised feature extraction and categorical regression analysis based feature extraction. The Wnt/β-catenin signalling pathway was extensively enriched among 32 genes identified by feature extraction. Among the genes identified, SFRP1 was specifically indicated to target β-catenin, and thus might be targeted by epigenetic therapy in NSCLC cell lines. A histone deacetylase inhibitor might reactivate SFRP1 based upon the re-analysis of a public domain data set. Numerical computation validated the binding of SFRP1 to WNT1 to suppress Wnt signalling pathway activation in NSCLC. The meta-analysis of reprogrammed NSCLC cell lines identified SFRP1 as a promising target of epigenetic therapy for NSCLC.

40 citations


Journal ArticleDOI
TL;DR: The development, testing and application of a promising new approach to repositioning based on mining a human functional linkage network for inversely correlated modules of drug and disease gene targets appears to offer promise for the identification of multi-targeted drug candidates that can correct aberrant cellular functions.
Abstract: The high cost and the long time required to bring drugs into commerce is driving efforts to repurpose FDA approved drugs—to find new uses for which they weren’t intended, and to thereby reduce the overall cost of commercialization, and shorten the lag between drug discovery and availability. We report on the development, testing and application of a promising new approach to repositioning. Our approach is based on mining a human functional linkage network for inversely correlated modules of drug and disease gene targets. The method takes account of multiple information sources, including gene mutation, gene expression, and functional connectivity and proximity of within module genes. The method was used to identify candidates for treating breast and prostate cancer. We found that (i) the recall rate for FDA approved drugs for breast (prostate) cancer is 20/20 (10/11), while the rates for drugs in clinical trials were 131/154 and 82/106; (ii) the ROC/AUC performance substantially exceeds that of comparable methods; (iii) preliminary in vitro studies indicate that 5/5 candidates have therapeutic indices superior to that of Doxorubicin in MCF7 and SUM149 cancer cell lines. We briefly discuss the biological plausibility of the candidates at a molecular level in the context of the biological processes that they mediate. Our method appears to offer promise for the identification of multi-targeted drug candidates that can correct aberrant cellular functions. In particular the computational performance exceeded that of other CMap-based methods, and in vitro experiments indicate that 5/5 candidates have therapeutic indices superior to that of Doxorubicin in MCF7 and SUM149 cancer cell lines. The approach has the potential to provide a more efficient drug discovery pipeline.

37 citations


Journal ArticleDOI
TL;DR: Differential network analysis of allergen-induced CD4 T cell responses can unmask covert disease-associated genes and pin point novel therapeutic targets.
Abstract: Asthma is strongly associated with allergic sensitization, but the mechanisms that determine why only a subset of atopics develop asthma are not well understood. The aim of this study was to test the hypothesis that variations in allergen-driven CD4 T cell responses are associated with susceptibility to expression of asthma symptoms. The study population consisted of house dust mite (HDM) sensitized atopics with current asthma (n = 22), HDM-sensitized atopics without current asthma (n = 26), and HDM-nonsensitized controls (n = 24). Peripheral blood mononuclear cells from these groups were cultured in the presence or absence of HDM extract for 24 h. CD4 T cells were then isolated by immunomagnetic separation, and gene expression patterns were profiled on microarrays. Differential network analysis of HDM-induced CD4 T cell responses in sensitized atopics with or without asthma unveiled a cohort of asthma-associated genes that escaped detection by more conventional data analysis techniques. These asthma-associated genes were enriched for targets of STAT6 signaling, and they were nested within a larger coexpression module comprising 406 genes. Upstream regulator analysis suggested that this module was driven primarily by IL-2, IL-4, and TNF signaling; reconstruction of the wiring diagram of the module revealed a series of hub genes involved in inflammation (IL-1B, NFkB, STAT1, STAT3), apoptosis (BCL2, MYC), and regulatory T cells (IL-2Ra, FoxP3). Finally, we identified several negative regulators of asthmatic CD4 T cell responses to allergens (e.g. IL-10, type I interferons, microRNAs, drugs, metabolites), and these represent logical candidates for therapeutic intervention. Differential network analysis of allergen-induced CD4 T cell responses can unmask covert disease-associated genes and pin point novel therapeutic targets.

37 citations


Journal ArticleDOI
TL;DR: Novel associations of DNA methylation with oxidative stress are reported, some of which show evidence of a relation with T2D incidence, and the contribution of oxidative stress-associated CpGs to development of cardiometabolic disease is investigated.
Abstract: Oxidative stress has been related to type 2 diabetes (T2D) and cardiovascular disease (CVD), the leading global cause of death. Contributions of environmental factors such as oxidative stress on complex traits and disease may be partly mediated through changes in epigenetic marks (e.g. DNA methylation). Studies relating differential methylation with intermediate phenotypes and disease endpoints may be useful in identifying additional candidate genes and mechanisms involved in disease. To investigate the role of epigenetic variation in oxidative stress marker levels and subsequent development of CVD and T2D, we performed analyses of genome-wide DNA methylation in blood, ten markers of oxidative stress (total glutathione [TGSH], reduced glutathione [GSH], oxidised glutathione [GSSG], GSSG to GSH ratio, homocysteine [HCY], oxidised low-density lipoprotein (oxLDL), antibodies against oxLDL [OLAB], conjugated dienes [CD], baseline conjugated dienes [BCD]-LDL and total antioxidant capacity [TAOC]) and incident disease in up to 966 age-matched individuals. In total, we found 66 cytosine-guanine (CpG) sites associated with one or more oxidative stress markers (false discovery rate [FDR] <0.05). These sites were enriched in regulatory regions of the genome. Genes annotated to CpG sites showed enrichment in annotation clusters relating to phospho-metabolism and proteins with pleckstrin domains. We investigated the contribution of oxidative stress-associated CpGs to development of cardiometabolic disease. Methylation variation at CpGs in the 3'-UTR of HIST1H4D (cg08170869; histone cluster 1, H4d) and in the body of DVL1 (cg03465880; dishevelled-1) were associated with incident T2D events during 10 years of follow-up (all permutation p-values <0.01), indicating a role of epigenetic regulation in oxidative stress processes leading to development or progression of diabetes. Methylation QTL (meQTL) analysis showed significant associations with genetic sequence variants in cis at 28 (42%) of oxidative stress phenotype-associated sites (FDR < 0.05). Integrating cis-meQTLs with genotype-phenotype associations indicated that genetic effects on oxidative stress phenotype at one locus (cg07547695; BCL2L11) may be mediated through DNA methylation. In conclusion, we report novel associations of DNA methylation with oxidative stress, some of which also show evidence of a relation with T2D incidence.

Journal ArticleDOI
TL;DR: The proposed consensus strategy represents an efficient and biologically relevant approach for gene prioritization tasks providing a valuable decision-making tool for the study of PD pathogenesis and the development of disease-modifying PD therapeutics.
Abstract: The systemic information enclosed in microarray data encodes relevant clues to overcome the poorly understood combination of genetic and environmental factors in Parkinson’s disease (PD), which represents the major obstacle to understand its pathogenesis and to develop disease-modifying therapeutics. While several gene prioritization approaches have been proposed, none dominate over the rest. Instead, hybrid approaches seem to outperform individual approaches. A consensus strategy is proposed for PD related gene prioritization from mRNA microarray data based on the combination of three independent prioritization approaches: Limma, machine learning, and weighted gene co-expression networks. The consensus strategy outperformed the individual approaches in terms of statistical significance, overall enrichment and early recognition ability. In addition to a significant biological relevance, the set of 50 genes prioritized exhibited an excellent early recognition ability (6 of the top 10 genes are directly associated with PD). 40 % of the prioritized genes were previously associated with PD including well-known PD related genes such as SLC18A2, TH or DRD2. Eight genes (CCNH, DLK1, PCDH8, SLIT1, DLD, PBX1, INSM1, and BMI1) were found to be significantly associated to biological process affected in PD, representing potentially novel PD biomarkers or therapeutic targets. Additionally, several metrics of standard use in chemoinformatics are proposed to evaluate the early recognition ability of gene prioritization tools. The proposed consensus strategy represents an efficient and biologically relevant approach for gene prioritization tasks providing a valuable decision-making tool for the study of PD pathogenesis and the development of disease-modifying PD therapeutics.

Journal ArticleDOI
TL;DR: It is demonstrated that RPL5 is required for both primitive and definitive hematopoiesis processes in DBA pathology, and the results provide a comprehensive basis for the study of molecular pathogenesis of RPL 5-mediated DBA and other ribosomopathies.
Abstract: Diamond–Blackfan anemia (DBA) was the first ribosomopathy associated with mutations in ribosome protein (RP) genes. The clinical phenotypes of DBA include failure of erythropoiesis, congenital anomalies and cancer predisposition. Mutations in RPL5 are reported in approximately 9 ~ 21 % of DBA patients, which represents the most common pathological condition related to a large-subunit ribosomal protein. However, it remains unclear how RPL5 downregulation results in severe phenotypes of this disease. In this study, we generated a zebrafish model of DBA with RPL5 morphants and implemented high-throughput RNA-seq and ncRNA-seq to identify key genes, lncRNAs, and miRNAs during zebrafish development and hematopoiesis. We demonstrated that RPL5 is required for both primitive and definitive hematopoiesis processes. By comparing with other DBA zebrafish models and processing functional coupling network, we identified some common regulated genes, lncRNAs and miRNAs, that might play important roles in development and hematopoiesis. Ribosome biogenesis and translation process were affected more in RPL5 MO than in other RP MOs. Both P53 dependent (for example, cell cycle pathway) and independent pathways (such as Aminoacyl-tRNA biosynthesis pathway) play important roles in DBA pathology. Our results therefore provide a comprehensive basis for the study of molecular pathogenesis of RPL5-mediated DBA and other ribosomopathies.

Journal ArticleDOI
TL;DR: The epigenetic profiles of DNA differential methylation from schizophrenic brain samples were investigated to understand the regulatory roles of SDMGs, and increasing methylation on these promoters is proposed as a novel therapeutic approach for schizophrenia.
Abstract: Epigenetics of schizophrenia provides important information on how the environmental factors affect the genetic architecture of the disease. DNA methylation plays a pivotal role in etiology for schizophrenia. Previous studies have focused mostly on the discovery of schizophrenia-associated SNPs or genetic variants. As postmortem brain samples became available, more and more recent studies surveyed transcriptomics of the diseases. In this study, we constructed protein-protein interaction (PPI) network using the disease associated SNP (or genetic variants), differentially expressed disease genes and differentially methylated disease genes (or promoters). By combining the different datasets and topological analyses of the PPI network, we established a more comprehensive understanding of the development and genetics of this devastating mental illness. We analyzed the previously published DNA methylation profiles of prefrontal cortex from 335 healthy controls and 191 schizophrenic patients. These datasets revealed 2014 CpGs identified as GWAS risk loci with the differential methylation profile in schizophrenia, and 1689 schizophrenic differential methylated genes (SDMGs) identified with predominant hypomethylation. These SDMGs, combined with the PPIs of these genes, were constructed into the schizophrenic differential methylation network (SDMN). On the SDMN, there are 10 hypermethylated SDMGs, including GNA13, CAPNS1, GABPB2, GIT2, LEFTY1, NDUFA10, MIOS, MPHOSPH6, PRDM14 and RFWD2. The hypermethylation to differential expression network (HyDEN) were constructed to determine how the hypermethylated promoters regulate gene expression. The enrichment analyses of biochemical pathways in HyDEN, including TNF alpha, PDGFR-beta signaling, TGF beta Receptor, VEGFR1 and VEGFR2 signaling, regulation of telomerase, hepatocyte growth factor receptor signaling, ErbB1 downstream signaling and mTOR signaling pathway, suggested that the malfunctioning of these pathways contribute to the symptoms of schizophrenia. The epigenetic profiles of DNA differential methylation from schizophrenic brain samples were investigated to understand the regulatory roles of SDMGs. The SDMGs interplays with SCZCGs in a coordinated fashion in the disease mechanism of schizophrenia. The protein complexes and pathways involved in SDMN may be responsible for the etiology and potential treatment targets. The SDMG promoters are predominantly hypomethylated. Increasing methylation on these promoters is proposed as a novel therapeutic approach for schizophrenia.

Journal ArticleDOI
TL;DR: The data suggest that few of the SNPs in biogenesis genes evaluated alter levels of mRNA transcription or colon cancer risk, and it is likely that SNPs influencing cancer do not do so through miRNAs.
Abstract: MicroRNAs (miRNAs) have been implicated in the incidence and progression of cancer. It has been proposed that single nucleotide polymorphisms (SNPs) influence cancer risk due to their position within genes involved in miRNA synthesis and regulation. Genes directly and indirectly involved in miRNA biogenesis were identified from the literature. We then identified SNPs within these regions. Using genome-wide association study data we evaluated associations between biogenesis-related SNPs with colon cancer risk and their corresponding mRNA expression in normal colonic mucosa and carcinoma and difference in expression between the two tissues. SNPs that were associated with either altered colon cancer risk or with mRNA expression were evaluated for associations with altered miRNA expression. Eleven SNPs were associated (P < 0.05) with colon cancer risk, and two of these variants remained significant after correction for multiple comparisons (PHolm < 0.05): rs1967327 (PRKRA) (ORdom = 0.78, 95 % CI 0.66–0.92) and rs4548444 (MAPKAP2) (ORrec = 1.67, 95 % CI 1.12–2.48). Of these two SNPs, rs4548444 (MAPKAP2), was associated with significantly altered miRNA expression levels in normal colonic mucosa, with nine miRNAs upregulated among individuals homozygous rare (GG) for rs4548444. One SNP associated with cancer prior to adjustment for multiple comparisons, rs11089328 (DGCR8), was associated with altered levels of hsa-miR-645 in differential tissue under the dominant model. Three SNPs, rs2740349 (GEMIN4) in carcinoma tissue, and rs235768 (BMP2) and rs2059691 (PRKRA) in normal mucosa, were significantly associated with altered mRNA expression levels across genotypes after multiple comparison adjustment. Rs2740349 (GEMIN4) and rs235768 (BMP2) were significantly associated with the upregulation of six and nine individual miRNAs in normal colonic mucosa, respectively. Our data suggest that few of the SNPs in biogenesis genes we evaluated alter levels of mRNA transcription or colon cancer risk. As only one SNP both alters colon cancer risk and miRNA expression it is likely that SNPs influencing cancer do not do so through miRNAs. Because the significant SNPs were associated with downregulated mRNAs and upregulated miRNAs, and because each SNP was associated with unique miRNAs, it is possible that other mechanisms influence mature miRNA levels.

Journal ArticleDOI
TL;DR: It is shown that Immunoseq is a powerful approach to detect novel rare variants in regulatory regions and also demonstrate that these novel variants have a potential functional role in immune cells.
Abstract: The observation that the genetic variants identified in genome-wide association studies (GWAS) frequently lie in non-coding regions of the genome that contain cis-regulatory elements suggests that altered gene expression underlies the development of many complex traits. In order to efficiently make a comprehensive assessment of the impact of non-coding genetic variation in immune related diseases we emulated the whole-exome sequencing paradigm and developed a custom capture panel for the known DNase I hypersensitive site (DHS) in immune cells – “Immunoseq”. We performed Immunoseq in 30 healthy individuals where we had existing transcriptome data from T cells. We identified a large number of novel non-coding variants in these samples. Relying on allele specific expression measurements, we also showed that our selected capture regions are enriched for functional variants that have an impact on differential allelic gene expression. The results from a replication set with 180 samples confirmed our observations. We show that Immunoseq is a powerful approach to detect novel rare variants in regulatory regions. We also demonstrate that these novel variants have a potential functional role in immune cells.

Journal ArticleDOI
TL;DR: A novel semi-supervised learning method based on the Cox and AFT models to accurately predict the treatment risk and the survival time of the patients and adopt the efficient L1/2 regularization approach to select the relevant genes, which are significantly associated with the disease.
Abstract: One of the most important objectives of the clinical cancer research is to diagnose cancer more accurately based on the patients’ gene expression profiles. Both Cox proportional hazards model (Cox) and accelerated failure time model (AFT) have been widely adopted to the high risk and low risk classification or survival time prediction for the patients’ clinical treatment. Nevertheless, two main dilemmas limit the accuracy of these prediction methods. One is that the small sample size and censored data remain a bottleneck for training robust and accurate Cox classification model. In addition to that, similar phenotype tumours and prognoses are actually completely different diseases at the genotype and molecular level. Thus, the utility of the AFT model for the survival time prediction is limited when such biological differences of the diseases have not been previously identified. To try to overcome these two main dilemmas, we proposed a novel semi-supervised learning method based on the Cox and AFT models to accurately predict the treatment risk and the survival time of the patients. Moreover, we adopted the efficient L1/2 regularization approach in the semi-supervised learning method to select the relevant genes, which are significantly associated with the disease. The results of the simulation experiments show that the semi-supervised learning model can significant improve the predictive performance of Cox and AFT models in survival analysis. The proposed procedures have been successfully applied to four real microarray gene expression and artificial evaluation datasets. The advantages of our proposed semi-supervised learning method include: 1) significantly increase the available training samples from censored data; 2) high capability for identifying the survival risk classes of patient in Cox model; 3) high predictive accuracy for patients’ survival time in AFT model; 4) strong capability of the relevant biomarker selection. Consequently, our proposed semi-supervised learning model is one more appropriate tool for survival analysis in clinical cancer research.

Journal ArticleDOI
TL;DR: This phenome-wide association study (PheWAS) exploring the association between a selected list of functional stop-gain genetic variants and an extensive group of diagnoses revealed novel associations of stop-gained variants with interesting phenotypes (ICD-9 codes) along with pleiotropic effects.
Abstract: We explored premature stop-gain variants to test the hypothesis that variants, which are likely to have a consequence on protein structure and function, will reveal important insights with respect to the phenotypes associated with them. We performed a phenome-wide association study (PheWAS) exploring the association between a selected list of functional stop-gain genetic variants (variation resulting in truncated proteins or in nonsense-mediated decay) and an extensive group of diagnoses to identify novel associations and uncover potential pleiotropy. In this study, we selected 25 stop-gain variants: 5 stop-gain variants with previously reported phenotypic associations, and a set of 20 putative stop-gain variants identified using dbSNP. For the PheWAS, we used data from the electronic MEdical Records and GEnomics (eMERGE) Network across 9 sites with a total of 41,057 unrelated patients. We divided all these samples into two datasets by equal proportion of eMERGE site, sex, race, and genotyping platform. We calculated single effect associations between these 25 stop-gain variants and ICD-9 defined case-control diagnoses. We also performed stratified analyses for samples of European and African ancestry. Associations were adjusted for sex, site, genotyping platform and the first three principal components to account for global ancestry. We identified previously known associations, such as variants in LPL associated with hyperglyceridemia indicating that our approach was robust. We also found a total of three significant associations with p < 0.01 in both datasets, with the most significant replicating result being LPL SNP rs328 and ICD-9 code 272.1 “Disorder of Lipoid metabolism” (pdiscovery = 2.59x10-6, preplicating = 2.7x10-4). The other two significant replicated associations identified by this study are: variant rs1137617 in KCNH2 gene associated with ICD-9 code category 244 “Acquired Hypothyroidism” (pdiscovery = 5.31x103, preplicating = 1.15x10-3) and variant rs12060879 in DPT gene associated with ICD-9 code category 996 “Complications peculiar to certain specified procedures” (pdiscovery = 8.65x103, preplicating = 4.16x10-3). In conclusion, this PheWAS revealed novel associations of stop-gained variants with interesting phenotypes (ICD-9 codes) along with pleiotropic effects.

Journal ArticleDOI
TL;DR: The results suggest that eCGIs may constitute a distinct class of enhancers and perform a more instrumental role in tumorigenesis than typical CGIs in gene promoters.
Abstract: CpG islands (CGIs) are interspersed DNA sequences that have unusually high CpG ratios and GC contents. CGIs are typically located in the promoter of protein-coding genes. They normally lack DNA methylation but become hypermethylated and induce repression of associated genes in cancer. However, the biological functions of non-promoter CGIs (orphan CGIs) largely remain unclear. Here, we identify orphan CGIs that do not map to the promoter of any protein-coding or non-coding transcripts but possess chromatin and transcriptional marks that reflect enhancer activity (termed eCGIs). They exhibit three-dimensional chromatin looping toward multiple target genes with high affinity. Intriguingly, transcription regulators were frequently associated with such CGI-containing enhancers. Remarkably, our analyses in cell lines and clinical tissues showed that eCGIs have more dynamic DNA methylation changes in cancer relative to promoter CGIs. The observed eCGI hypermethylation was accompanied by a loss of enhancer marks and transcriptional inactivation of the target genes. Our results suggest that eCGIs may constitute a distinct class of enhancers and perform a more instrumental role in tumorigenesis than typical CGIs in gene promoters.

Journal ArticleDOI
TL;DR: In spite of inferior purity, the performance of saliva DNAs for microarray genotyping was excellent, and these results agree with other studies concluding that saliva collection is a viable alternative to blood.
Abstract: The question of whether DNA obtained from saliva is an acceptable alternative to DNA from blood is a topic of considerable interest for large genetics studies. We compared the yields, quality and performance of DNAs from saliva and blood from a mostly elderly study population. Two thousand nine hundred ten DNAs from primarily elderly subjects (mean age ± standard deviation (SD): 65 ± 12 years), collected for the Primary Open-Angle African-American Glaucoma Genetics (POAAGG) study, were evaluated by fluorometry and/or spectroscopy. These included 566 DNAs from blood and 2344 from saliva. Subsets of these were evaluated by Sanger sequencing (n = 1555), and by microarray SNP genotyping (n = 94) on an Illumina OmniExpress bead chip platform. The mean age of subjects was 65, and 68 % were female in both the blood and saliva groups. The mean ± SD of DNA yield per ml of requested specimen was significantly higher for saliva (17.6 ± 17.8 μg/ml) than blood (13.2 ± 8.5 μg/ml), but the mean ± SD of total DNA yield obtained per saliva specimen (35 ± 36 μg from 2 ml maximum specimen volume) was approximately three-fold lower than from blood (106 ± 68 μg from 8 ml maximum specimen volume). The average genotyping call rates were >99 % for 43 of 44 saliva DNAs and >99 % for 50 of 50 for blood DNAs. For 22 of 23 paired blood and saliva samples from the same individuals, the average genotyping concordance rate was 99.996 %. High quality PCR Sanger sequencing was obtained from ≥ 98 % of blood (n = 297) and saliva (n = 1258) DNAs. DNA concentrations ≥10 ng/μl, corresponding to total yields ≥ 2 μg, were obtained for 94 % of the saliva specimens (n = 2344). In spite of inferior purity, the performance of saliva DNAs for microarray genotyping was excellent. Our results agree with other studies concluding that saliva collection is a viable alternative to blood. The potential to boost study enrollments and reduce subject discomfort is not necessarily offset by a reduction in genotyping efficiency. Saliva DNAs performed comparably to blood DNAs for PCR Sanger sequencing.

Journal ArticleDOI
TL;DR: This paper proposes an approach which utilizes neural network model based on dependency-based word embedding to automatically learn significant features from raw input for trigger classification, and achieves the semantic distributed representation of every trigger word.
Abstract: In biomedical research, events revealing complex relations between entities play an important role. Biomedical event trigger identification has become a research hotspot since its important role in biomedical event extraction. Traditional machine learning methods, such as support vector machines (SVM) and maxent classifiers, which aim to manually design powerful features fed to the classifiers, depend on the understanding of the specific task and cannot generalize to the new domain or new examples. In this paper, we propose an approach which utilizes neural network model based on dependency-based word embedding to automatically learn significant features from raw input for trigger classification. First, we employ Word2vecf, the modified version of Word2vec, to learn word embedding with rich semantic and functional information based on dependency relation tree. Then neural network architecture is used to learn more significant feature representation based on raw dependency-based word embedding. Meanwhile, we dynamically adjust the embedding while training for adapting to the trigger classification task. Finally, softmax classifier labels the examples by specific trigger class using the features learned by the model. The experimental results show that our approach achieves a micro-averaging F1 score of 78.27 and a macro-averaging F1 score of 76.94 % in significant trigger classes, and performs better than baseline methods. In addition, we can achieve the semantic distributed representation of every trigger word.

Journal ArticleDOI
TL;DR: Of the five technology platforms tested, NanoString technology provides a more faithful translation of the RAS pathway gene expression signature from FF to FFPE than the Affymetrix GeneChip and multiple RNASeq technologies.
Abstract: The KRAS gene is mutated in about 40 % of colorectal cancer (CRC) cases, which has been clinically validated as a predictive mutational marker of intrinsic resistance to anti-EGFR inhibitor (EGFRi) therapy. Since nearly 60 % of patients with a wild type KRAS fail to respond to EGFRi combination therapies, there is a need to develop more reliable molecular signatures to better predict response. Here we address the challenge of adapting a gene expression signature predictive of RAS pathway activation, created using fresh frozen (FF) tissues, for use with more widely available formalin fixed paraffin-embedded (FFPE) tissues. In this study, we evaluated the translation of an 18-gene RAS pathway signature score from FF to FFPE in 54 CRC cases, using a head-to-head comparison of five technology platforms. FFPE-based technologies included the Affymetrix GeneChip (Affy), NanoString nCounter™ (NanoS), Illumina whole genome RNASeq (RNA-Acc), Illumina targeted RNASeq (t-RNA), and Illumina stranded Total RNA-rRNA-depletion (rRNA). Using Affy_FF as the “gold” standard, initial analysis of the 18-gene RAS scores on all 54 samples shows varying pairwise Spearman correlations, with (1) Affy_FFPE (r = 0.233, p = 0.090); (2) NanoS_FFPE (r = 0.608, p < 0.0001); (3) RNA-Acc_FFPE (r = 0.175, p = 0.21); (4) t-RNA_FFPE (r = −0.237, p = 0.085); (5) and t-RNA (r = −0.012, p = 0.93). These results suggest that only NanoString has successful FF to FFPE translation. The subsequent removal of identified “problematic” samples (n = 15) and genes (n = 2) further improves the correlations of Affy_FF with three of the five technologies: Affy_FFPE (r = 0.672, p < 0.0001); NanoS_FFPE (r = 0.738, p < 0.0001); and RNA-Acc_FFPE (r = 0.483, p = 0.002). Of the five technology platforms tested, NanoString technology provides a more faithful translation of the RAS pathway gene expression signature from FF to FFPE than the Affymetrix GeneChip and multiple RNASeq technologies. Moreover, NanoString was the most forgiving technology in the analysis of samples with presumably poor RNA quality. Using this approach, the RAS signature score may now be reasonably applied to FFPE clinical samples.

Journal ArticleDOI
TL;DR: A strategy for complete gene sequence analysis followed by a unified framework for interpreting non-coding variants that may affect gene expression is presented and large numbers of variants detected by NGS are distilled to a limited set of variants prioritized as potential deleterious changes.
Abstract: Sequencing of both healthy and disease singletons yields many novel and low frequency variants of uncertain significance (VUS). Complete gene and genome sequencing by next generation sequencing (NGS) significantly increases the number of VUS detected. While prior studies have emphasized protein coding variants, non-coding sequence variants have also been proven to significantly contribute to high penetrance disorders, such as hereditary breast and ovarian cancer (HBOC). We present a strategy for analyzing different functional classes of non-coding variants based on information theory (IT) and prioritizing patients with large intragenic deletions. We captured and enriched for coding and non-coding variants in genes known to harbor mutations that increase HBOC risk. Custom oligonucleotide baits spanning the complete coding, non-coding, and intergenic regions 10 kb up- and downstream of ATM, BRCA1, BRCA2, CDH1, CHEK2, PALB2, and TP53 were synthesized for solution hybridization enrichment. Unique and divergent repetitive sequences were sequenced in 102 high-risk, anonymized patients without identified mutations in BRCA1/2. Aside from protein coding and copy number changes, IT-based sequence analysis was used to identify and prioritize pathogenic non-coding variants that occurred within sequence elements predicted to be recognized by proteins or protein complexes involved in mRNA splicing, transcription, and untranslated region (UTR) binding and structure. This approach was supplemented by in silico and laboratory analysis of UTR structure. 15,311 unique variants were identified, of which 245 occurred in coding regions. With the unified IT-framework, 132 variants were identified and 87 functionally significant VUS were further prioritized. An intragenic 32.1 kb interval in BRCA2 that was likely hemizygous was detected in one patient. We also identified 4 stop-gain variants and 3 reading-frame altering exonic insertions/deletions (indels). We have presented a strategy for complete gene sequence analysis followed by a unified framework for interpreting non-coding variants that may affect gene expression. This approach distills large numbers of variants detected by NGS to a limited set of variants prioritized as potential deleterious changes.

Journal ArticleDOI
TL;DR: Although the enrichment profiles of bivalent nucleosomes show a clear dependency on CpG island content, they demonstrate a stark anti-correlation with methylation status, and are focally enriched in the vicinity of the transcription start site (TSS).
Abstract: Bivalent chromatin refers to overlapping regions containing activating histone H3 Lys4 trimethylation (H3K4me3) and inactivating H3K27me3 marks. Existence of such bivalent marks on the same nucleosome has only recently been suggested. Previous genome-wide efforts to characterize bivalent chromatin have focused primarily on individual marks to define overlapping zones of bivalency rather than mapping positions of truly bivalent mononucleosomes. Here, we developed an efficacious sequential ChIP technique for examining global positioning of individual bivalent nucleosomes. Using next generation sequencing approaches we show that although individual H3K4me3 and H3K27me3 marks overlap in broad zones, bivalent nucleosomes are focally enriched in the vicinity of the transcription start site (TSS). These seem to occupy the H2A.Z nucleosome positions previously described as salt-labile nucleosomes, and are correlated with low gene expression. Although the enrichment profiles of bivalent nucleosomes show a clear dependency on CpG island content, they demonstrate a stark anti-correlation with methylation status. We show that regional overlap of H3K4me3 and H3K27me3 chromatin tend to be upstream to the TSS, while bivalent nucleosomes with both marks are mainly promoter proximal near the TSS of CpG island-containing genes with poised/low expression. We discuss the implications of the focal enrichment of bivalent nucleosomes around the TSS on the poised chromatin state of promoters in stem cells.

Journal ArticleDOI
TL;DR: The result provides researchers with another way into the regulation mechanism through which IVIG represses excessive inflammatory responses and enriched in the pathways associated with inflammatory immune response.
Abstract: Kawasaki disease (KD) is an autoimmune disease preferentially attacking children younger than five years worldwide. So far, the principal treatment to KD is the administration of Intravenous immunoglobulin (IVIG). Although DNA methylation plays important regulation roles in diseases, few studies investigated the regulation roles of DNA methylation in KD. In this study, we focused not only on the DNA methylation alterations resulted from KD onset but also on DNA methylation alterations resulted from IVIG administration. To do so, we investigated the effects of KD’s onset and IVIG administration on CpG marker’s methylation alterations. We first found that DNA methylation alterations reflecting disease onset or IVIG administration are contributed mainly by the CpG markers on autosomes. In addition, we also demonstrated that some CpG markers carry methylation alteration among samples, forcing the expression abundance of the downstream genes to be also altered and negatively correlated with methylation profile. Finally, compared with KD’s onset, IVIG administration brings stronger impact on methylation alteration. And, such alterations were conducted mainly by hyper-methylating CpG markers, forcing the corresponding genes to keep lower expression levels. Moreover, the genes regulated by the altered CpG markers with IVIG administration are enriched in the pathways associated with inflammatory immune response. In summary, our result provides researchers with another way into the regulation mechanism through which IVIG represses excessive inflammatory responses.

Journal ArticleDOI
TL;DR: In this article, the authors performed a rare variant association analysis of PSEN1 with quantitative biomarkers of LOAD using whole genome sequencing (WGS) by integrating bioinformatics and imaging informatics.
Abstract: Pathogenic mutations in PSEN1 are known to cause familial early-onset Alzheimer’s disease (EOAD) but common variants in PSEN1 have not been found to strongly influence late-onset AD (LOAD). The association of rare variants in PSEN1 with LOAD-related endophenotypes has received little attention. In this study, we performed a rare variant association analysis of PSEN1 with quantitative biomarkers of LOAD using whole genome sequencing (WGS) by integrating bioinformatics and imaging informatics. A WGS data set (N = 815) from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) cohort was used in this analysis. 757 non-Hispanic Caucasian participants underwent WGS from a blood sample and high resolution T1-weighted structural MRI at baseline. An automated MRI analysis technique (FreeSurfer) was used to measure cortical thickness and volume of neuroanatomical structures. We assessed imaging and cerebrospinal fluid (CSF) biomarkers as LOAD-related quantitative endophenotypes. Single variant analyses were performed using PLINK and gene-based analyses of rare variants were performed using the optimal Sequence Kernel Association Test (SKAT-O). A total of 839 rare variants (MAF < 1/√(2 N) = 0.0257) were found within a region of ±10 kb from PSEN1. Among them, six exonic (three non-synonymous) variants were observed. A single variant association analysis showed that the PSEN1 p. E318G variant increases the risk of LOAD only in participants carrying APOE e4 allele where individuals carrying the minor allele of this PSEN1 risk variant have lower CSF Aβ1–42 and higher CSF tau. A gene-based analysis resulted in a significant association of rare but not common (MAF ≥ 0.0257) PSEN1 variants with bilateral entorhinal cortical thickness. This is the first study to show that PSEN1 rare variants collectively show a significant association with the brain atrophy in regions preferentially affected by LOAD, providing further support for a role of PSEN1 in LOAD. The PSEN1 p. E318G variant increases the risk of LOAD only in APOE e4 carriers. Integrating bioinformatics with imaging informatics for identification of rare variants could help explain the missing heritability in LOAD.

Journal ArticleDOI
TL;DR: SOX5 has a strong inhibitory effect on MITF expression and seems to have a decisive clinical impact on melanoma during tumor progression, while double knockdown with SOX10 showed a rescue effect; both effects were validated by reporter assays.
Abstract: Melanoma is a cancer with rising incidence and new therapeutics are needed. For this, it is necessary to understand the molecular mechanisms of melanoma development and progression. Melanoma differs from other cancers by its ability to produce the pigment melanin via melanogenesis; this biosynthesis is essentially regulated by microphthalmia-associated transcription factor (MITF). MITF regulates various processes such as cell cycling and differentiation. MITF shows an ambivalent role, since high levels inhibit cell proliferation and low levels promote invasion. Hence, well-balanced MITF homeostasis is important for the progression and spread of melanoma. Therefore, it is difficult to use MITF itself for targeted therapy, but elucidating its complex regulation may lead to a promising melanoma-cell specific therapy. We systematically analyzed the regulation of MITF with a novel established transcription factor based gene regulatory network model. Starting from comparative transcriptomics analysis using data from cells originating from nine different tumors and a melanoma cell dataset, we predicted the transcriptional regulators of MITF employing ChIP binding information from a comprehensive set of databases. The most striking regulators were experimentally validated by functional assays and an MITF-promoter reporter assay. Finally, we analyzed the impact of the expression of the identified regulators on clinically relevant parameters of melanoma, i.e. the thickness of primary tumors and patient overall survival. Our model predictions identified SOX10 and SOX5 as regulators of MITF. We experimentally confirmed the role of the already well-known regulator SOX10. Additionally, we found that SOX5 knockdown led to MITF up-regulation in melanoma cells, while double knockdown with SOX10 showed a rescue effect; both effects were validated by reporter assays. Regarding clinical samples, SOX5 expression was distinctively up-regulated in metastatic compared to primary melanoma. In contrast, survival analysis of melanoma patients with predominantly metastatic disease revealed that low SOX5 levels were associated with a poor prognosis. MITF regulation by SOX5 has been shown only in murine cells, but not yet in human melanoma cells. SOX5 has a strong inhibitory effect on MITF expression and seems to have a decisive clinical impact on melanoma during tumor progression.

Journal ArticleDOI
TL;DR: This study presents a modified Artificial Bee Colony Algorithm to select minimum number of genes that are deemed to be significant for cancer along with improvement of predictive accuracy and can provide subset of genes leading to more accurate classification results while the number of selected genes is smaller.
Abstract: Development of biologically relevant models from gene expression data notably, microarray data has become a topic of great interest in the field of bioinformatics and clinical genetics and oncology. Only a small number of gene expression data compared to the total number of genes explored possess a significant correlation with a certain phenotype. Gene selection enables researchers to obtain substantial insight into the genetic nature of the disease and the mechanisms responsible for it. Besides improvement of the performance of cancer classification, it can also cut down the time and cost of medical diagnoses. This study presents a modified Artificial Bee Colony Algorithm (ABC) to select minimum number of genes that are deemed to be significant for cancer along with improvement of predictive accuracy. The search equation of ABC is believed to be good at exploration but poor at exploitation. To overcome this limitation we have modified the ABC algorithm by incorporating the concept of pheromones which is one of the major components of Ant Colony Optimization (ACO) algorithm and a new operation in which successive bees communicate to share their findings. The proposed algorithm is evaluated using a suite of ten publicly available datasets after the parameters are tuned scientifically with one of the datasets. Obtained results are compared to other works that used the same datasets. The performance of the proposed method is proved to be superior. The method presented in this paper can provide subset of genes leading to more accurate classification results while the number of selected genes is smaller. Additionally, the proposed modified Artificial Bee Colony Algorithm could conceivably be applied to problems in other areas as well.

Journal ArticleDOI
TL;DR: The identification of 24 promoter associated CpG sites that correlated with change in SBP after RYGB surgery may contribute to a further understanding of the epigenetic regulatory mechanisms underlying the development of essential hypertension.
Abstract: Essential hypertension is a significant risk factor for cardiovascular diseases. Emerging research suggests a role of DNA methylation in blood pressure physiology. We aimed to investigate epigenetic associations of promoter related CpG sites to essential hypertension in a genome-wide methylation approach.

Journal ArticleDOI
TL;DR: A novel deep learning model based on generative stochastic networks and hidden Markov chain to classify the observed samples with SNPs on five loci of two genes respectively to the vulnerable population of 14 types of adverse reactions is presented.
Abstract: Genomic variations are associated with the metabolism and the occurrence of adverse reactions of many therapeutic agents. The polymorphisms on over 2000 locations of cytochrome P450 enzymes (CYP) due to many factors such as ethnicity, mutations, and inheritance attribute to the diversity of response and side effects of various drugs. The associations of the single nucleotide polymorphisms (SNPs), the internal pharmacokinetic patterns and the vulnerability of specific adverse reactions become one of the research interests of pharmacogenomics. The conventional genomewide association studies (GWAS) mainly focuses on the relation of single or multiple SNPs to a specific risk factors which are a one-to-many relation. However, there are no robust methods to establish a many-to-many network which can combine the direct and indirect associations between multiple SNPs and a serial of events (e.g. adverse reactions, metabolic patterns, prognostic factors etc.). In this paper, we present a novel deep learning model based on generative stochastic networks and hidden Markov chain to classify the observed samples with SNPs on five loci of two genes (CYP2D6 and CYP1A2) respectively to the vulnerable population of 14 types of adverse reactions. A supervised deep learning model is proposed in this study. The revised generative stochastic networks (GSN) model with transited by the hidden Markov chain is used. The data of the training set are collected from clinical observation. The training set is composed of 83 observations of blood samples with the genotypes respectively on CYP2D6*2, *10, *14 and CYP1A2*1C, *1 F. The samples are genotyped by the polymerase chain reaction (PCR) method. A hidden Markov chain is used as the transition operator to simulate the probabilistic distribution. The model can perform learning at lower cost compared to the conventional maximal likelihood method because the transition distribution is conditional on the previous state of the hidden Markov chain. A least square loss (LASSO) algorithm and a k-Nearest Neighbors (kNN) algorithm are used as the baselines for comparison and to evaluate the performance of our proposed deep learning model. There are 53 adverse reactions reported during the observation. They are assigned to 14 categories. In the comparison of classification accuracy, the deep learning model shows superiority over the LASSO and kNN model with a rate over 80 %. In the comparison of reliability, the deep learning model shows the best stability among the three models. Machine learning provides a new method to explore the complex associations among genomic variations and multiple events in pharmacogenomics studies. The new deep learning algorithm is capable of classifying various SNPs to the corresponding adverse reactions. We expect that as more genomic variations are added as features and more observations are made, the deep learning model can improve its performance and can act as a black-box but reliable verifier for other GWAS studies.