scispace - formally typeset
Search or ask a question
Posted ContentDOI

Pathway Association Studies Tool

TL;DR: A more accessible and user friendly method of pathway analysis that interprets GWAS data to find biological mechanisms associated with traits of interest and produces similar results in significantly less time is developed.
Abstract: Background In recent years, a bioinformatics method for interpreting GWAS data using metabolic pathway analysis has been developed and successfully used to find significant pathways and mechanisms explaining phenotypic traits of interest in plants. However, the many scripts implementing this method were not straightforward to use, had to be customized for each project, required user supervision, and took more than 24 hours to process data. PAST (Pathway Association Study Tool), a new implementation of this method, has been developed to address these concerns. Results PAST is implemented as a package for the R language. Two user-interfaces are provided; PAST can be run by loading the package in R and calling its methods, or by using an R Shiny guided user interface. In testing, PAST completed analyses in approximately one hour by processing data in parallel. PAST has many user-specified options for maximum customization. PAST produces the same results as the previously developed method. Conclusions In order to promote a powerful new method of pathway analysis that interprets GWAS data to find biological mechanisms associated with traits of interest, we developed a more accessible and user friendly tool. This tool is more efficient and requires less knowledge of programming languages to use than previous methods. Moreover, it produces similar results in significantly less time. These attributes make PAST accessible to researchers interested in associating metabolic pathways with GWAS datasets to better understand the genetic architecture and mechanisms affecting phenotype.

Summary (3 min read)

Introduction

  • Two user-interfaces are provided; PAST can be run by loading the package in R and calling its methods, or by using an R Shiny guided user interface.
  • PAST has many user-specified options for maximum customization.
  • This tool is more efficient and requires less knowledge of programming languages to use than previous methods.
  • Moreover, it produces similar results in significantly less time.
  • These attributes make PAST accessible to researchers interested in associating metabolic pathways with GWAS datasets to better understand the genetic architecture and mechanisms affecting phenotype.

Background

  • Genome-wide association study (GWAS) of complex traits in maize and other crops has become very popular to identify regions of the genome that influence these traits [1, 2, 3].
  • Thus, the statistical power of GWAS for detecting genes of small effect is limited by the strict levels set for FDR and by insufficient numbers of high-frequency polymorphisms found in most panels.
  • A pathway-based approach was used to study aflatoxin accumulation [13], corn ear worm resistance [14] and oil biosynthesis [15] in maize.
  • The Pathway Association Study Tool (PAST) was developed to facilitate easier and more efficient GWAS-based metabolic pathway analysis.
  • PAST then identifies genes within a user-defined distance of the tagSNPs, and transfers the attributes of the tagSNP to the gene(s), including the allele effect, R2 and p-value of the original SNP-trait association found from the GWAS analysis.

Implementation

  • PAST is based on a method developed by their research group [6].
  • PAST’s implementation is completely in R and requires a user to install the package without needing to edit the source code.
  • PAST processes data through four main steps.
  • The SNPs are then associated with genes based on the allelic effects, p-values, and genomic distance between SNPs and genes.
  • The genes and their effects are used to find significant pathways and calculate a running ES.

Loading Data

  • During the process of loading data, the GWAS data is filtered to account for any non-biallelic data.
  • Any data with more or fewer than two alleles associated with that marker is discarded.
  • The effects data is associated with the statistics data in order to collect all data about a marker into a single dataframe.
  • The LD data is filtered to drop rows where the loci are not the same, and then extra columns are dropped.
  • The remaining data is split into groups based on the locus.

Assigning Genes

  • Genes are assigned the attributes of linked SNPs according to the method described in Tang et al [6].
  • In all cases, PAST follows an algorithm to identify one tagSNP to represent all linked SNPs in order to reduce the dimensionality of the dataset and identify which allele effect, p and R2 to transfer to the physically linked gene(s).
  • If, however, the effect signs are different, the SNP with the lowest pvalue is used.
  • The tagSNP within blocks of SNPs is identified by first counting the number of positive and negative effects in each linkage block.
  • If there are more than one equally positive or equally negative effects, the effect with the lowest p-value is chosen and assigned to the gene.

Finding Significant Metabolic Pathways

  • Significant pathways are found by using a previously described method [4, 6, 7].
  • User-input determines the minimum number of genes that a pathway must contain to be retained for processing (to avoid small sample size bias), the number of times the effects data are randomly sampled with replacement to generate a null distribution of ES, and the pathways database that is being used.
  • For each gene effects column (observed and randomly sampled), the effects are sorted and ranked from best to worst (and whether this is in increasing or decreasing order depends on the trait under study).
  • The ES running sum statistic increases for genes in the pathway and decreases for genes not in the pathway.
  • The mean and standard deviation for the null distribution are used to normalize the observed ES so that z scores can be obtained.

Plotting

  • Based on user input, the pathways can be filtered for significance (either p-value or q-value), or the top n pathways can be selected.
  • Rugplots for each pathway in the set of significant pathways are plotted as the last step.
  • The x-axis shows the rank of each gene effect value; the y-axis shows the value of the ES running sum statistic as each consecutive gene effect value is processed.
  • An x-intercept line indicates the highest point of the ES.
  • Small hatch marks at the top of the image indicate the position of the effect of all genes in the pathway.

R Shiny Applications

  • Two versions of an R Shiny application that use PAST have been developed.
  • These R Shiny applications provides a guided user interface that sets analysis parameters in PAST; they can also upload a saved set of results to explore again.
  • The version available on Github and CyVerse allows a user to run a new analysis by selecting their data, annotations, and pathways depending on the species being studied.
  • The version available on MaizeGDB [17] allows a user to upload their data and select specific versions of the maize annotation and pathways databases available on MaizeGDB.

Results and Discussion

  • Kernel color and aflatoxin analyses were run on a desktop computer with 32GB of memory, a 4GHz Intel Core i7 with four processors, and solid-state storage.
  • The analysis of the total oil trait completed in 71 minutes.
  • Two tools for human GWAS pathway studies have been published, including GSA-SNP2 [10] and Pascal (a Pathway scoring algorithm) [22].

Conclusions

  • PAST is faster and more user-friendly than previous methods, requires minimal knowledge of programming languages, and is publicly available at Github, Bioconductor, CyVerse and MaizeGDB.
  • Availability and requirements Project name: PAST Project home page: https://github.com/IGBB/ PAST / Shiny apps: https://apps.maizegdb.org/PAST/ , https://apps.maizegdb.org/maizeGDB_PAST/ , <CyVerse link once the authors have it>.
  • Operating system(s): Platform independent Programming language: R 3.6.
  • Any restrictions to use by non-academics: None.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

PAST
Pathway Association Studies Tool
Adam Thrash
1
, Juliet D. Tang
2
, Mason DeOrnellis
3, 4
, Daniel G. Peterson
1
, Marilyn L. Warburton
5
1
Institute for Genomics, Biocomputing & Biotechnology, Mississippi State University, Mississippi State,
MS, USA
2
USDA-FS Forest Products Laboratory, Starkville, MS, USA
3
Humanities and Fine Arts Division, East Mississippi Community College, Mayhew, MS, USA
4
Department of Computer Science and Engineering, Mississippi State University, Mississippi State, MS,
USA
5
USDA-ARS Corn Host Plant Resistance Research Unit, Mississippi State, MS, USA
Abstract
Background: In recent years, a bioinformatics method for interpreting GWAS data using metabolic
pathway analysis has been developed and successfully used to find significant pathways and mechanisms
explaining phenotypic traits of interest in plants. However, the many scripts implementing this method
were not straightforward to use, had to be customized for each project, required user supervision, and
took more than 24 hours to process data. PAST (Pathway Association Study Tool), a new implementation
of this method, has been developed to address these concerns.
Results: PAST is implemented as a package for the R language. Two user-interfaces are provided; PAST
can be run by loading the package in R and calling its methods, or by using an R Shiny guided user
interface. In testing, PAST completed analyses in approximately one hour by processing data in parallel.
PAST has many user-specified options for maximum customization. PAST produces the same results as
the previously developed method.
Conclusions: In order to promote a powerful new method of pathway analysis that interprets GWAS data
to find biological mechanisms associated with traits of interest, we developed a more accessible and user
friendly tool. This tool is more efficient and requires less knowledge of programming languages to use
than previous methods. Moreover, it produces similar results in significantly less time. These attributes
make PAST accessible to researchers interested in associating metabolic pathways with GWAS datasets
to better understand the genetic architecture and mechanisms affecting phenotype.
Keywords
Metabolic pathway analysis, Genome-wide association study (GWAS), maize (Zea mays L.)
Background
Genome-wide association study (GWAS) of complex traits in maize and other crops has become very
popular to identify regions of the genome that influence these traits [1, 2, 3]. In general, hundreds of
thousands of single nucleotide polymorphisms (SNPs) markers are each tested using F statistics for
association with the trait, which assigns a p-value for the SNP-trait association. Individual marker-trait
made available for use under a CC0 license.
certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also
The copyright holder for this preprint (which was notthis version posted July 13, 2019. ; https://doi.org/10.1101/691964doi: bioRxiv preprint

associations that meet the threshold set for the false discovery rate (FDR, the proportion of false positives
among all significant results for some level α) are then studied in more detail to uncover hints as to the
genetic architecture of the trait, and how best to improve it in the future. Many true associations may be
missed in GWAS, however, because the threshold for FDR could be as low as α divided by the total
number of SNPs being tested. In complex, polygenic traits, the effects of genes that exert only small
effects on a trait may not meet the FDR threshold, especially if the effect value of the association is
influenced by the environment. Additionally, alleles of many genes may be expressed only in specific
genetic backgrounds and will only be useful when found in combination with the positive alleles of other
genes in the same pathway [3]. These allelic combinations may not exist in the limited number of
individuals in the GWAS panel. Thus, the statistical power of GWAS for detecting genes of small effect
is limited by the strict levels set for FDR and by insufficient numbers of high-frequency polymorphisms
found in most panels.
Metabolic pathway analysis focuses on the combined effects of many genes that are grouped according to
their shared biological function [4, 5, 6]. This is a promising approach that can complement GWAS to
give clues to the genetic basis of a trait. Originally developed to study differences in gene expression data
in human disease studies [7], pathway analysis and association mapping have been used in medical
research to find biological insights missed when focusing on only one or a few genes that have highly
significant associations with a trait of interest [8, 9, 5, 10]. Pathway analysis has only just begun to be
used as well in animal studies [11, 12]. In addition, biologically relevant pathways can be used to guide
interpretation of large data sets produced by other high-throughput approaches like RNA sequencing,
proteomics, and metabolomics.
More recently, GWAS-based metabolic pathway analysis has been used as a discovery tool to investigate
the genetic basis of complex traits in plants. A pathway-based approach was used to study aflatoxin
accumulation [13], corn ear worm resistance [14] and oil biosynthesis [15] in maize. Combining GWAS
analysis with metabolic pathway analysis considers all genetic sequences positively associated with the
trait of interest, regardless of magnitude, and jointly may highlight which sequences lead to mechanisms
for crop improvement and which warrant further study and manipulation, for example, by gene editing.
While combined GWAS and pathway analyses were highly successful in uncovering associated
pathways, the analyses were slow and cumbersome, as the analysis tools were written in a combination of
R, Perl, and Bash, and the output of each analysis was manually input into the next analysis. A single,
unified and user friendly tool to accomplish this pathway analysis was lacking.
The Pathway Association Study Tool (PAST) was developed to facilitate easier and more efficient
GWAS-based metabolic pathway analysis. PAST was designed for use with maize but is usable for other
species as well. It tracks all SNP marker - trait associations, regardless of significance or magnitude.
PAST groups SNPs into linkage blocks based on linkage disequilibrium (LD) data and identifies a
tagSNP from each block. PAST then identifies genes within a user-defined distance of the tagSNPs, and
transfers the attributes of the tagSNP to the gene(s), including the allele effect, R
2
and p-value of the
original SNP-trait association found from the GWAS analysis. Finally, PAST uses the gene effect values
to calculate an enrichment score (ES) and p-value for each pathway. PAST is easy to use as an online
tool, standalone R script, or as a downloadable R Shiny application. It uses as input TASSEL [16] files
that are generated as output from the General Linear or Mixed Linear Models (GLM and MLM), or files
from any association analysis that has been similarly formatted, as well as genome annotations in GFF
format, and a metabolic pathways file.
Implementation
PAST is implemented as an R package and is available through Bioconductor 3.9. PAST is based on a
made available for use under a CC0 license.
certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also
The copyright holder for this preprint (which was notthis version posted July 13, 2019. ; https://doi.org/10.1101/691964doi: bioRxiv preprint

method developed by our research group [6]. The original method was subsequently used in two other
maize studies [14, 15], but required users to customize Perl and R scripts and run BASH scripts. PAST’s
implementation is completely in R and requires a user to install the package without needing to edit the
source code. Two graphical user interfaces are available in the form of R Shiny applications. A generic
version is available on CyVerse and Github, while a maize-specific version can be run on MaizeGDB
[17] (explained below).
PAST processes data through four main steps. First, GWAS output data is loaded into PAST. This data
comes in the form of statistics that reflect the effects of specific loci (e.g., SNPs) with a trait(s) of interest
and LD data between loci. The SNPs are then associated with genes based on the allelic effects, p-values,
and genomic distance between SNPs and genes. The genes and their effects are used to find significant
pathways and calculate a running ES. Finally, the genes in these pathways are plotted to show a running
ES for each pathway. A flowchart in Figure 1 shows the process.
Loading Data
During the process of loading data, the GWAS data is filtered to account for any non-biallelic data. Any
data with more or fewer than two alleles associated with that marker is discarded. Data without an R
2
value (coefficient of determination the SNP/trait association) is removed as well, since later calculations
rely on the R
2
value. The effects data is associated with the statistics data in order to collect all data about
a marker into a single dataframe.
The LD data is filtered to drop rows where the loci are not the same, and then extra columns are dropped.
Only data about the locus, the positions, the sites, the distance between the sites, and the r
2
value
(coefficient of determination for LD) is retained. The remaining data is split into groups based on the
locus.
Assigning Genes
Genes are assigned the attributes of linked SNPs according to the method described in Tang et al [6].
SNPs are parsed into linked groups by identifying all pairs of SNPs with LD data that exceed a set cutoff
r
2
for linkage. SNP blocks occur when multiple SNPs are linked to one SNP in common. SNPs that are
only linked to one other SNP are considered singly-linked SNPs, and SNPs not linked to any other SNPs
are unlinked. In all cases, PAST follows an algorithm to identify one tagSNP to represent all linked SNPs
in order to reduce the dimensionality of the dataset and identify which allele effect, p and R
2
to transfer to
the physically linked gene(s). Unlinked SNPs are by default identified as the tagSNP. For SNPs that are
linked to a single other SNP, if both have the same effect sign (positive or negative), PAST identifies the
one associated with the largest effect (absolute value) as the tagSNP. If the effects are equal, the second
(more downstream) SNP is used. If, however, the effect signs are different, the SNP with the lowest p-
value is used. If the p-values are the same and the signs are different, the SNP is labeled as problematic,
since no assignment can be made, and no tagSNP is identified; these are dropped from the analysis.
The tagSNP within blocks of SNPs is identified by first counting the number of positive and negative
effects in each linkage block. If the number of positive effects is greater, then the SNP with largest
positive effect is chosen. If the number of negative effects is greater, then the SNP with the largest
negative effect is noted. Ties between the number of negative and positive effects are broken by checking
the sign of the SNP in common defining the block. The tagSNP is then the one with the largest effect and
the same sign, and it is marked to indicate the number of SNPs in the block. Once all blocks have been
reduced to a single tagSNP, the tagSNP is used to locate the nearby gene(s).
Once tagSNPs have been identified, the annotation files are checked to look for genes within a physical
made available for use under a CC0 license.
certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also
The copyright holder for this preprint (which was notthis version posted July 13, 2019. ; https://doi.org/10.1101/691964doi: bioRxiv preprint

distance window provided by the user. The effect and the p-value of the tagSNP is transferred to the gene.
The SNP-gene assignments are grouped by gene name, and if more than one SNP block or unlinked SNP
was found to be linked to the same gene, each gene is tagged by counting the number of negative effect
and the number of positive effect associations in the blocks linked to the same gene. If there are more
negative effects, the most negative effect and p-value is assigned to the gene. If there are more positive
effects, the positive effect and p-value is assigned to the gene. If there are more than one equally positive
or equally negative effects, the effect with the lowest p-value is chosen and assigned to the gene. If there
are an equal number of negative and positive effects, the effect with the greatest absolute value is
selected. The number of linked SNPs is set to the total number of SNPs (SNPs within blocks plus blocks
within genes) linked to that gene. Once all of the blocks of genes have been processed, the effects of each
gene are used to find significant pathways.
Finding Significant Metabolic Pathways
Significant pathways are found by using a previously described method [4, 6, 7]. User-input determines
the minimum number of genes that a pathway must contain to be retained for processing (to avoid small
sample size bias), the number of times the effects data are randomly sampled with replacement to
generate a null distribution of ES, and the pathways database that is being used.
For each gene effects column (observed and randomly sampled), the effects are sorted and ranked from
best to worst (and whether this is in increasing or decreasing order depends on the trait under study). The
ES running sum statistic increases for genes in the pathway and decreases for genes not in the pathway.
The amount of increase for genes in the pathway is weighted by the absolute value of the effect. The
pathway ES is the largest positive value calculated for the running sum statistic.
Pathway significance is determined by comparing the observed ES with the ES for the null distribution.
The mean and standard deviation for the null distribution are used to normalize the observed ES so that z
scores can be obtained. P-values are computed from the z scores using the (1-pnorm) function. Since
multiple hypothesis testing is still a concern, an FDR-adjusted p-value (known as q-value) is calculated
using the qvalue package in R [18].
Plotting
Based on user input, the pathways can be filtered for significance (either p-value or q-value), or the top n
pathways can be selected. Rugplots for each pathway in the set of significant pathways are plotted as the
last step. The x-axis shows the rank of each gene effect value; the y-axis shows the value of the ES
running sum statistic as each consecutive gene effect value is processed. An x-intercept line indicates the
highest point of the ES. Small hatch marks at the top of the image indicate the position of the effect of all
genes in the pathway. An example rugplot is provided in Figure 2.
R Shiny Applications
Two versions of an R Shiny application that use PAST have been developed. These R Shiny applications
provides a guided user interface that sets analysis parameters in PAST; they can also upload a saved set of
results to explore again. The version available on Github and CyVerse allows a user to run a new analysis
by selecting their data, annotations, and pathways depending on the species being studied. The version
available on MaizeGDB [17] allows a user to upload their data and select specific versions of the maize
annotation and pathways databases available on MaizeGDB. A screenshot of the generic R Shiny
application is provided in Figure 3.
made available for use under a CC0 license.
certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also
The copyright holder for this preprint (which was notthis version posted July 13, 2019. ; https://doi.org/10.1101/691964doi: bioRxiv preprint

Results and Discussion
PAST is run by calling its functions with GWAS data from within an R script or by using an included R
Shiny interface. PAST will allow a new interpretation of GWAS results, which should identify associated
pathways either when one or a few genes are highly associated with the trait (these would have been
identified by the GWAS analysis directly); or when many genes in the pathway are moderately associated
with the trait (these would not necessarily have been identified by the GWAS analysis). Such an
interpretation will add both additional results, and biological meaning to the association data, as was seen
with oil biosynthesis in studies by Li et al. [19, 15]. While PAST may be useful in bringing biologically
useful insights to a GWAS analysis, it will not be able to find order from a chaotic dataset if
environmental variation, experimental error, or improper analysis models were used in the association
analysis. For strong data sets, however, it may find pathways where GWAS found few or no significant
associations which, taken in isolation, shed no real light on the genetic mechanisms underlying the traits
under study. PAST may be able to overcome this limitation, and may in addition be able to identify
epistatic interactions between genes in the same pathway [20].
PAST was tested using data from three previous corn GWAS on kernel color (261147 SNPs), aflatoxin
resistance (261184 SNPs), and total oil production (525107 SNPs). Kernel color and aflatoxin analyses
were run on a desktop computer with 32GB of memory, a 4GHz Intel Core i7 with four processors, and
solid-state storage. All four processors were used when testing PAST. The kernel color test completed in
63 minutes; the aflatoxin test completed in 62 minutes. The total oil test was run on a system with the
following specifications: Intel(R) Xeon(R) CPU E5-2680 v2 at 2.80 GHz with 64 GB of RAM connected
to a lustre v.2.5.3 file system via an FDR InfiniBand interconnect. The analysis of the total oil trait
completed in 71 minutes. Using the previous method, these analyses took 24 hours or more, depending on
how attentive the user was to starting the next step in the process. The results of the analyses of all three
traits were comparable when generated with PAST or with the previous method.
The use of metabolic pathway analysis to derive functional meaning from GWAS results has been used
extensively in human disease studies, and methodologies and tools similar to PAST have been published
for use with annotated human pathways. Some methodologies reviewed by Kwak and Pan [21] include
GATES-Simes, HYST, and MAGMA. Two tools for human GWAS pathway studies have been
published, including GSA-SNP2 [10] and Pascal (a Pathway scoring algorithm) [22]. However, these
tools would need to be extensively modified to work with any set of user-supplied pathways. Some of the
analysis procedures, however, should be considered in future versions of PAST. An analysis with PAST
should be illuminating for any outcrossing plant species and any animal and human datasets for which
annotated pathway/genome databases (or related model organism databases) and GFF annotations are
available. However, for inbreeding and polyploid species, the assignment of SNPs to genes may be
complicated by very long LD blocks which may contain multiple, equally linked genes, or homology to
more than one genome. Tests will be run to see if these problems can be overcome.
Conclusions
In conclusion, we present PAST, a tool designed to use GWAS data to perform pathway analysis. PAST
is faster and more user-friendly than previous methods, requires minimal knowledge of programming
languages, and is publicly available at Github, Bioconductor, CyVerse and MaizeGDB.
Availability and requirements
made available for use under a CC0 license.
certified by peer review) is the author/funder. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also
The copyright holder for this preprint (which was notthis version posted July 13, 2019. ; https://doi.org/10.1101/691964doi: bioRxiv preprint

Citations
More filters
16 Dec 2019
TL;DR: Esta investigacion plantea the integracion of the aproximacion of rutas biologicas “biological pathway” en un estudio de asociacion genetica de compuestos de antocianinas en papa.
Abstract: La papa es considerada un cultivo basico a nivel mundial y constituye la tercera fuente de antioxidantes en la dieta humana, debido a su consumo per capita. Algunos de los compuestos antioxidantes mas relevantes en papa dentro de los polifenoles dietarios son los pigmentos antocianicos. La sintesis de los compuestos antocianicos implica una matriz de senales reguladoras potencialmente superpuestas, por lo que, no se puede ver su sintesis exclusivamente a partir de la transcripcion de los genes incluidos en la sintesis de las enzimas involucradas en la ruta biosintetica, sino que, es importante considerar diferentes variantes alelicas no asociadas en las rutas y/o factores de transcripcion que pueden estar desempenando un papel importante en la regulacion y acumulacion de estos compuestos en diferentes tejidos vegetales. Para entender la arquitectura genetica que gobierna rasgos complejos relacionados con la composicion nutricional de los alimentos, como es la composicion de antocianinas en diferentes especies y en papa; se han implementado estrategias como los estudios de mapeo de asociacion amplia del genoma (GWAS), en programas de mejoramiento genetico, con el fin de identificar loci de caracteres cuantitativos asociados al rasgo, para luego aplicar en programas de mejoramiento genetico. En el caso de polifenoles dietarios, debido a que su sintesis y acumulacion en plantas esta regulada por multiples genes que se organizan dentro de redes biologicas complejas, la metodologia de GWAS presenta una limitante y es que no ofrece evidencia directa acerca del proceso biologico que liga la variante asociada con el rasgo. Esta investigacion plantea la integracion de la aproximacion de rutas biologicas “biological pathway” en un estudio de asociacion genetica de compuestos de antocianinas en papa. El objetivo de esta investigacion fue la integracion de variantes alelicas/genes obtenidos por un estudio de asociacion amplia del genoma y variantes alelicas de genes putativamente relacionados en la ruta biosintetica de los compuestos antocianicos, integrando la participacion genica de cada una de las regiones genomicas asociadas, permitiendo una interpretacion biologica de cada asociacion. Para lo cual, se empleo un panel de asociacion de 109 genotipos de papa diploide Grupo Phureja, fenotipado para el rasgo del contenido de cinco antocianindinas por medio de cromatografia liquida de alta precision (HPLC) y, genotipado bajo la metodologia de “genotyping by sequencing” con una matriz de 87,657 marcadores polimorficos de un solo nucleotido. Se establecieron modelos de rutas biosinteticas promisorias que integraron la participacion genica de cada gen asociado. Como resultado se identifico una region de interes en el cromosoma 10, identificando genes asociados a la ruta biosintetica de las antocianinas en una region de 4Mpb en el brazo final del cromosoma 10; un marcador ligado al gen de fenilalanina amoniaco-liasa, que codifica la primera enzima en la ruta biosintetica de fenilpropanoides, se asocio a los cinco compuestos antocianicos evaluados explicando la mayor variacion fenotipica del rasgo. Se identifico que la ruta biosintetica de la L-metionina es importante para la ruta tardia de las antocianinas. Esta investigacion confirmo regiones genomicas cuya variabilidad alelica esta asociada con los compuestos analizados y que ya previamente habian sido detectadas. Tambien permitio la identificacion de otras nuevas regiones genomicas en un enfoque de “rutas biologicas” complementando el conocimiento existente. Los resultados contribuyen a la comprension de la regulacion de las antocianinas en la papa y pueden usarse en futuros estudios para la integracion en programas de mejoramiento de papa.

81 citations

Posted ContentDOI
15 Jan 2020-bioRxiv
TL;DR: While, most of the identified pathways recapitulate the pathophysiology of T2D, the results show that incorporating SNP functional properties, protein-protein interaction networks into GWAS can dissect leading molecular pathways, which cannot be picked up using traditional analyses.
Abstract: Diabetes Mellitus (DM) is a group of metabolic disorder that is characterized by pancreatic dysfunction in insulin producing beta cells, glucagon secreting alpha cells, and insulin resistance or insulin in-functionality related hyperglycemia. Type 2 Diabetes Mellitus (T2D), which constitutes 90% of the diabetes cases, is a complex multifactorial disease. In the last decade, genome-wide association studies (GWASs) for type 2 diabetes (T2D) successfully pinpointed the genetic variants (typically single nucleotide polymorphisms, SNPs) that associate with disease risk. However, traditional GWASs focus on the ‘the tip of the iceberg’ SNPs, and the SNPs with mild effects are discarded. In order to diminish the burden of multiple testing in GWAS, researchers attempted to evaluate the collective effects of interesting variants. In this regard, pathway-based analyses of GWAS became popular to discover novel multi-genic functional associations. Still, to reveal the unaccounted 85 to 90% of T2D variation, which lies hidden in GWAS datasets, new post-GWAS strategies need to be developed. In this respect, here we reanalyze three meta-analysis data of GWAS in T2D, using the methodology that we have developed to identify disease-associated pathways by combining nominally significant evidence of genetic association with the known biochemical pathways, protein-protein interaction (PPI) networks, and the functional information of selected SNPs. In this research effort, to enlighten the molecular mechanisms underlying T2D development and progress, we integrated different in-silico approaches that proceed in top-down manner and bottom-up manner, and hence presented a comprehensive analysis at protein subnetwork, pathway, and pathway subnetwork levels. Our network and pathway-oriented approach is based on both the significance level of an affected pathway and its topological relationship with its neighbor pathways. Using the mutual information based on the shared genes, the identified protein subnetworks and the affected pathways of each dataset were compared. While, most of the identified pathways recapitulate the pathophysiology of T2D, our results show that incorporating SNP functional properties, protein-protein interaction networks into GWAS can dissect leading molecular pathways, which cannot be picked up using traditional analyses. We hope to bridge the knowledge gap from sequence to consequence.

2 citations


Cites background from "Pathway Association Studies Tool"

  • ...…to disease mechanisms, because identifying the accumulation of small genetic effects acting in a common pathway is often easier than mapping the individual genes within the pathway that contribute to disease susceptibility remarkably (Kao et al., 2017; Lamparter et al., 2016; Thrash et al., 2019)....

    [...]

Journal ArticleDOI
TL;DR: In this article , the authors reanalyze three meta-analysis data of GWAS in Type 2 diabetes mellitus (T2D), using the methodology that they have developed to identify disease-associated pathways by combining nominally significant evidence of genetic association with the known biochemical pathways, protein-protein interaction (PPI) networks, and the functional information of selected SNPs.
Abstract: Type 2 diabetes mellitus (T2D) constitutes 90% of the diabetes cases, and it is a complex multifactorial disease. In the last decade, genome-wide association studies (GWASs) for T2D successfully pinpointed the genetic variants (typically single nucleotide polymorphisms, SNPs) that associate with disease risk. In order to diminish the burden of multiple testing in GWAS, researchers attempted to evaluate the collective effects of interesting variants. In this regard, pathway-based analyses of GWAS became popular to discover novel multigenic functional associations. Still, to reveal the unaccounted 85 to 90% of T2D variation, which lies hidden in GWAS datasets, new post-GWAS strategies need to be developed. In this respect, here we reanalyze three metaanalysis data of GWAS in T2D, using the methodology that we have developed to identify disease-associated pathways by combining nominally significant evidence of genetic association with the known biochemical pathways, protein-protein interaction (PPI) networks, and the functional information of selected SNPs. In this research effort, to enlighten the molecular mechanisms underlying T2D development and progress, we integrated different in silico approaches that proceed in top-down manner and bottom-up manner, and presented a comprehensive analysis at protein subnetwork, pathway, and pathway subnetwork levels. Using the mutual information based on the shared genes, the identified protein subnetworks and the affected pathways of each dataset were compared. While most of the identified pathways recapitulate the pathophysiology of T2D, our results show that incorporating SNP functional properties, PPI networks into GWAS can dissect leading molecular pathways, and it could offer improvement over traditional enrichment strategies.
References
More filters
Journal ArticleDOI
TL;DR: The Gene Set Enrichment Analysis (GSEA) method as discussed by the authors focuses on gene sets, that is, groups of genes that share common biological function, chromosomal location, or regulation.
Abstract: Although genomewide RNA expression analysis has become a routine tool in biomedical research, extracting biological insight from such information remains a major challenge. Here, we describe a powerful analytical method called Gene Set Enrichment Analysis (GSEA) for interpreting gene expression data. The method derives its power by focusing on gene sets, that is, groups of genes that share common biological function, chromosomal location, or regulation. We demonstrate how GSEA yields insights into several cancer-related data sets, including leukemia and lung cancer. Notably, where single-gene analysis finds little similarity between two independent studies of patient survival in lung cancer, GSEA reveals many biological pathways in common. The GSEA method is embodied in a freely available software package, together with an initial database of 1,325 biologically defined gene sets.

34,830 citations

Journal ArticleDOI
TL;DR: TASSEL (Trait Analysis by aSSociation, Evolution and Linkage) implements general linear model and mixed linear model approaches for controlling population and family structure and allows for linkage disequilibrium statistics to be calculated and visualized graphically.
Abstract: Summary: Association analyses that exploit the natural diversity of a genome to map at very high resolutions are becoming increasingly important. In most studies, however, researchers must contend with the confounding effects of both population and family structure. TASSEL (Trait Analysis by aSSociation, Evolution and Linkage) implements general linear model and mixed linear model approaches for controlling population and family structure. For result interpretation, the program allows for linkage disequilibrium statistics to be calculated and visualized graphically. Database browsing and data importation is facilitated by integrated middleware. Other features include analyzing insertions/deletions, calculating diversity statistics, integration of phenotypic and genotypic data, imputing missing data and calculating principal components. Availability: The TASSEL executable, user manual, example data sets and tutorial document are freely available at http://www. maizegenetics.net/tassel. The source code for TASSEL can be found at http://sourceforge.net/projects/tassel.

5,460 citations


"Pathway Association Studies Tool" refers methods in this paper

  • ...It uses as input TASSEL [16] files that are generated as output from the General Linear or Mixed Linear Models (GLM and MLM), or files from any association analysis that has been similarly formatted, as well as genome annotations in GFF format, and a metabolic pathways file....

    [...]

Journal ArticleDOI
TL;DR: It is demonstrated that pathway-based approaches, which jointly consider multiple contributing factors in the same pathway, might complement the most-significant SNPs/genes approach and provide additional insights into interpretation of GWA data on complex diseases.
Abstract: Published genomewide association (GWA) studies typically analyze and report single-nucleotide polymorphisms (SNPs) and their neighboring genes with the strongest evidence of association (the "most-significant SNPs/genes" approach), while paying little attention to the rest. Borrowing ideas from microarray data analysis, we demonstrate that pathway-based approaches, which jointly consider multiple contributing factors in the same pathway, might complement the most-significant SNPs/genes approach and provide additional insights into interpretation of GWA data on complex diseases.

889 citations


"Pathway Association Studies Tool" refers background or methods in this paper

  • ...Finding Significant Metabolic Pathways Significant pathways are found by using a previously described method [4, 6, 7]....

    [...]

  • ...Metabolic pathway analysis focuses on the combined effects of many genes that are grouped according to their shared biological function [4, 5, 6]....

    [...]

Journal ArticleDOI
TL;DR: The genetic architecture of maize oil biosynthesis is extensively examined in a genome-wide association study using 1.03 million SNPs characterized in 368 maize inbred lines, including 'high-oil' lines, to provide insights into the genetic basis ofOil biosynthesis in maize kernels.
Abstract: Maize kernel oil is a valuable source of nutrition. Here we extensively examine the genetic architecture of maize oil biosynthesis in a genome-wide association study using 1.03 million SNPs characterized in 368 maize inbred lines, including ‘high-oil’ lines. We identified 74 loci significantly associated with kernel oil concentration and fatty acid composition (P < 1.8 × 10 −6 ), which we subsequently examined using expression quantitative trait loci (QTL) mapping, linkage mapping and coexpression analysis. More than half of the identified loci localized in mapped QTL intervals, and one-third of the candidate genes were annotated as enzymes in the oil metabolic pathway. The 26 loci associated with oil concentration could explain up to 83% of the phenotypic variation using a simple additive model. Our results provide insights into the genetic basis of oil biosynthesis in maize kernels and may facilitate marker-based breeding for oil quantity and quality.

643 citations


Additional excerpts

  • ...[19, 15]....

    [...]

Journal ArticleDOI
27 May 2004-Nature
TL;DR: Association studies of this type have good prospects for dissecting the genetics of common disease, but they currently face a number of challenges, including problems with multiple testing and study design, definition of intermediate phenotypes and interaction between polymorphisms.
Abstract: Identification of the genetic polymorphisms that contribute to susceptibility for common diseases such as type 2 diabetes and schizophrenia will aid in the development of diagnostics and therapeutics. Previous studies have focused on the technique of genetic linkage, but new technologies and experimental resources make whole-genome association studies more feasible. Association studies of this type have good prospects for dissecting the genetics of common disease, but they currently face a number of challenges, including problems with multiple testing and study design, definition of intermediate phenotypes and interaction between polymorphisms.

643 citations


"Pathway Association Studies Tool" refers background in this paper

  • ...Originally developed to study differences in gene expression data in human disease studies [7], pathway analysis and association mapping have been used in medical research to find biological insights missed when focusing on only one or a few genes that have highly significant associations with a trait of interest [8, 9, 5, 10]....

    [...]

Frequently Asked Questions (2)
Q1. What contributions have the authors mentioned in the paper "Pathway association studies tool" ?

However, the many scripts implementing this method were not straightforward to use, had to be customized for each project, required user supervision, and took more than 24 hours to process data. 

Some of the analysis procedures, however, should be considered in future versions of PAST.