scispace - formally typeset
Search or ask a question
Posted ContentDOI

An Improved Search Algorithm to Find G-Quadruplexes in Genome Sequences

TL;DR: An improved (broadened) GQ-search algorithm is developed that accounts for the recently reported new types of GQs and confirms the G Q-forming potential of naturally occurring and model single-stranded DNA fragments defying the G3-NL1G3+NL2G 3+NL3G3- formula.
Abstract: A growing body of data suggests that the secondary structures adopted by G-rich polynucleotides may be more diverse than previously thought and that the definition of G-quadruplex-forming sequences should be broadened. We studied solution structures of a series of naturally occurring and model single-stranded DNA fragments defying the G3+NL1G3+NL2G3+NL3G3+ formula, which is used in most of the current GQ-search algorithms. The results confirm the GQ-forming potential of such sequences and suggest the existence of new types of GQs. We developed an improved (broadened) GQ-search algorithm (http://niifhm.ru/nauchnye-issledovanija/otdel-molekuljarnoj-biologii-i-genetiki/laboratorija-iskusstvennogo-antitelogeneza/497-2/) that accounts for the recently reported new types of GQs.

Summary (2 min read)

INTRODUCTION

  • Non-canonical polynucleotide structures play an important role in biogenesis processes, such as transcription, DNA repair, replication, translocation and RNA splicing (Saini et al. 2013).
  • A clear view of DNA/RNA secondary structures and dynamics is necessary to understand the mechanisms of genomic regulation and to identify new biomarkers of pathology and drug targets.
  • GQs with mismatches (mGQs) contain one or more substitutions of G for other nucleotides in the tetrads.
  • (The mismatching nucleosides may participate in stacking).
  • The authors present here the first GQ-search tool, imGQfinder, that accounts for noncanonical (‘imperfect’) quadruplex structures (imGQs; i.e., bGQs and mGQs) in addition to canonical GQs.

RESULTS

  • ImGQ-motif definition and ImGQfinder interface (algorithm implementation).
  • The 3 imGQ motif definition for imGQs with single defects is presented in Table 1.
  • Some imGQs may also turn out to be ‘perfect’.
  • The graphical user interface was developed using the Tk library.
  • The inputs include the queried nucleotide sequence in fasta format, the number of tetrads and defects and the maximum loop length.

Structural studies (algorithm verification)

  • Such structures are still relatively new, and there are few examples of well-characterized imGQs.
  • The sequences Bcl, Ct1 and PSTP were taken from the human genome.
  • For the UV-melting profiles, molecularity analysis, CD spectra of the model ONs G3, G4 and their mutants and all the corresponding experimental procedures, see the supporting information.
  • The authors only considered 4-tetrad GQs and imGQs, which are generally more stable than 2- and 3-tetrad GQs according to the literature and their own physicochemical data.
  • As expected, imGQs are substantially more abundant than GQs (Table 3).

DISCUSSION

  • A new GQ-search algorithm, which is based on a broadened definition of quadruplex-forming sequences, and the user-friendly online tool ImGQfinder were developed.
  • The algorithm was verified by structural studies of a series of ONs whose imGQ-forming potential was predicted by 1 G3 demonstrated extreme stability in potassium, and the stability was even superior to that of G4.
  • Importantly, the physicochemical properties of ImGQs and GQ, such as the thermal stability under physiological conditions, appear to be rather similar.
  • Large clusters of putative GQ/imGQ sites were found in the introns near the intron/exon boundaries and in the promoters that are approximately 100 bp downstream of the transcription start site.
  • The results of several recent studies suggest 5’- UTR GQ participation in transcription and translational regulation (Huppert et al. 2008).

METHODS

  • The ON synthesis and purification, the MS analysis and the UV-melting, CD and rotational relaxation time measurements were performed as previously described (Varizhuk et al. 2013).
  • For the analysis GQ/imGQ abundance and distribution in human genome, RefSeq genomic sequences (http://www.ncbi.nlm.nih.gov/refseq/rsg/about/) were used.

NMR studies

  • The 1H chemical shifts were referenced relative to an external standard - sodium 2,2-dimethyl-2silapentane-5-sulfonate (DSS).
  • The spectra were recorded using presaturation or pulsed-field gradient WATERGATE W5 pulse sequences (zgprsp and zggpw5 from the Bruker library, respectively) for H2O suppression.

Molecular modeling

  • The starting positions of the GQ core atoms were obtained from the PDB (139D and 2KQH).
  • The core of every GQ was created using SwissPDB Viewer.
  • The electrostatic contribution to the hydration energy Gpolar was computed using the Generalized Born (GB) method (Onufriev et al. 2000) using the algorithm developed by Onufriev et al.
  • SASA was computed using the LCPO method (Srinivasan et al. 1998) with α = 0.00542 kcal/mol-1 Å-2. The entropic term was not calculated explicitly, but it was accounted for implicitly via the GQ conformational mobility.
  • Snapshots taken from a single trajectory of the MD simulation of the complex were used for the calculations of the binding free energy.

FIGURE LEGENDS

  • For each of the two imGQ types, a single example is shown.
  • Both mGQ and bGQ structures are diverse and can theoretically contain more than one mismatch or bulge.
  • CD spectra of the ONs Bcl, CT1 and their mutants, also known as Figure 2. Left.
  • The ellipticity is given per mole of nucleotide.
  • 1H NMR spectra fragments of several GQ- and imGQ-ONs, also known as Right.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

1
An Improved Search Algorithm to Find G-Quadruplexes in Genome
Sequences
Anna Varizhuk
1,2
, Dmitry Ischenko
1
, Igor Smirnov
1
, Olga Tatarinova
1
, Vyacheslav Severov
1
,
Roman Novikov
2,3
, Vladimir Tsvetkov
1,4
, Vladimir Naumov
1
, Dmitry Kaluzhny
2
, Galina
Pozmogova
1*
1
Institute for Physical-Chemical Medicine, Malaya Pirogovskaya Str., 1a, Moscow 119435,
Russia;
2
Engelhardt Institute of Molecular Biology, Vavilov Str., 32 Moscow 119991, Russia;
3
N. D. Zelinsky Institute of Organic Chemistry, Leninsky prosp. 47, Moscow 119991, Russia;
5
Topchiev Institute of Petrochemical Synthesis, Leninsky Prospect, 29, Moscow 119991, Russia.
*Corresponding author
Keywords: noncanonical DNA structures, G-quadruplexes, search algorithm
this version posted January 23, 2014. ; https://doi.org/10.1101/001990doi: bioRxiv preprint

2
ABSTRACT
A growing body of data suggests that the secondary structures adopted by G-rich polynucleotides
may be more diverse than previously thought and that the definition of G-quadruplex-forming
sequences should be broadened. We studied solution structures of a series of naturally occurring
and model single-stranded DNA fragments defying the G
3+
N
L1
G
3+
N
L2
G
3+
N
L3
G
3+
formula, which
is used in most of the current GQ-search algorithms. The results confirm the GQ-forming
potential of such sequences and suggest the existence of new types of GQs. We developed an
improved (broadened) GQ-search algorithm (http://niifhm.ru/nauchnye-issledovanija/otdel-
molekuljarnoj-biologii-i-genetiki/laboratorija-iskusstvennogo-antitelogeneza/497-2/) that
accounts for the recently reported new types of GQs.
INTRODUCTION
Non-canonical polynucleotide structures play an important role in biogenesis processes, such as
transcription, DNA repair, replication, translocation and RNA splicing (Saini et al. 2013). A
clear view of DNA/RNA secondary structures and dynamics is necessary to understand the
mechanisms of genomic regulation and to identify new biomarkers of pathology and drug
targets. A growing body of data suggests that the secondary structures adopted by G-rich
polynucleotides may be more diverse than previously thought (Kaluzhny et al. 2009; Tomasko et
al. 2009; Guedin et al. 2010; Amrane et al. 2012; Beaudoin et al. 2013; Mukundan and Phan
2013). For instance, two new types of G-quadruplexes (GQs) have recently been reported: GQs
with mismatches (Tomasko et al. 2009) and GQs with bulges (Mukundan and Phan 2013)
(Figure 1). GQs with bulges (bGQs) are GQs in which two stacked tetrad-forming guanosines in
one column are separated by a projecting nucleoside. GQs with mismatches (mGQs) contain one
or more substitutions of G for other nucleotides in the tetrads. (The mismatching nucleosides
may participate in stacking). Both types of structures appeared to be stable under physiological
conditions. These findings have led the researchers to the conclusion that the definition of GQ-
forming sequences should be broadened. All currently available online search tools for GQs
(Quad finder (Scaria et al. 2006), QGRS Mapper (Kikin et al. 2006) and QGRS predictor
(Menendez et al. 2012)) employ the G
3
+N
L1
G
3
+N
L2
G
3
+N
L3
G
3
+ formula, which only defines
canonical (‘perfect’) GQs.
We present here the first GQ-search tool, imGQfinder, that accounts for noncanonical
(‘imperfect’) quadruplex structures (imGQs; i.e., bGQs and mGQs) in addition to canonical
GQs. The ImGQfinder tool is freely accessible at the URL http://niifhm.ru/nauchnye-
issledovanija/otdel-molekuljarnoj-biologii-i-genetiki/laboratorija-iskusstvennogo-
antitelogeneza/497-2/
.
Structural studies of a series of
(
G
3+
N
L1
G
3+
N
L2
G
3+
N
L3
G
3+
)-defying
oligonucleotides (ONs), whose imGQ-forming potential was predicted by imGQfinder, were
performed to verify our improved GQ-search algorithm. We also utilize ImGQfinder for
statistical analysis of the imGQ-motif distribution in the human genome.
RESULTS
ImGQ-motif definition and ImGQfinder interface (algorithm implementation)
In our broadened algorithm, we search for G-runs, determine the distance between them and
select fragments that comply with the predetermined conditions for the maximum length of GQ
loops and the minimum number of nucleotides in a G-run (i.e., the number of tetrads). The
this version posted January 23, 2014. ; https://doi.org/10.1101/001990doi: bioRxiv preprint

3
imGQ motif definition for imGQs with single defects is presented in Table 1. ImGQs with
multiple defects can also be analyzed, but the relative set of formulas is not shown. Apparently,
most imGQ motifs can be interpreted as both putative bGQs and mGQs. Some imGQs may also
turn out to be ‘perfect’ GQs with fewer tetrads, e.g., a putative 3-tetrad imGQ with a bulge or a
mismatch in the external tetrad can theoretically adopt the canonical 2-tetrad GQ conformation.
ImGQfinder searches for all GQ and imGQ motifs, including overlapping ones. The program is
implemented in Perl. The graphical user interface was developed using the Tk library. The inputs
include the queried nucleotide sequence in fasta format, the number of tetrads and defects and
the maximum loop length. The hits are displayed in a table. The coordinates of each G-run start
(G1, G2, G3 and G4) and the positions of the defects (DEFECT) in each imGQ-forming
fragment are shown. The user can also see the full sequences of the putative GQs or imGQs (the
‘Add sequence to output’ option).
In addition, we offer an application that can determine overlapping GQ/imGQ sites in the form
of a single lengthy fragment with GQ/imGQ-forming potential (the ‘Add intersected output’
option). This feature is useful for estimating the maximum number of quadruplexes that can exist
simultaneously.
Structural studies (algorithm verification)
Although several recent publications contain direct evidence for the existence of imGQs that are
stable under physiological conditions, such structures are still relatively new, and there are few
examples of well-characterized imGQs. To complement the studies on imGQ structures and to
verify our search algorithm for imGQs, we synthesized a set of naturally occurring and model
single-stranded DNA fragments that were defined by ImGQfinder to be putative imGQs and
GQs, and we analyzed their conformations in solution using physicochemical methods. The ONs
are listed in Table 2.
The sequences Bcl, Ct1 and PSTP were taken from the human genome. Bcl is located in the
BCL2 promoter region 42 nucleotides upstream of the translation start site (NCBI Reference
Sequence: NC_000018.9, chr18: -60985942 to -60985966). Ct1 is located in the intron of the
CTIF gene (NCBI Reference Sequence: NC_000018.9, chr18: +46379322 to +46379344). PSTP
is located at the PSTPIP2 intron/boundary (NCBI Reference Sequence: NC_000018.9, chr18: -
43572049 to -43572072). BclG, BclA, BclT, Bcl-tr (truncated), Ct2, Ct3, Ct4, CtA, CtC and
CTG are mutants of Bcl and Ct1. G3, G3A, G4, G4A and G4AA are model sequences. The
solution structures of the ONs were investigated using UV-melting experiments, CD
spectroscopy and NMR spectroscopy. The rotational relaxation times of EtBr in complex with
the ONs are proportional to the hydrodynamic volumes of the molecules, and these times were
estimated to distinguish between monomolecular and intermolecular quadruplexes. The melting
temperatures of monomolecular GQs/imGQs and the GQ characteristics determined from the CD
data (parallel, antiparallel or mixed GQ folding) are given in Table 1. Fragments of the
1
H-NMR
spectra and CD spectra of the ONs Bcl, Ct1 and their mutants are shown in Figure 2. For the
UV-melting profiles, molecularity analysis, CD spectra of the model ONs G3, G4 and their
mutants and all the corresponding experimental procedures, see the supporting information.
The ONs G4, G3, CtG and BclG can fold into perfect 4-tetrad (G4, CtG and BclG) and 3-tetrad
(G3) GQs according to the conventional GQ definition. Indeed, all of them formed highly stable
this version posted January 23, 2014. ; https://doi.org/10.1101/001990doi: bioRxiv preprint

4
parallel (G3
1
, G4 and CTG) or mixed (BclG) GQs in the presence potassium salt, as evidenced
by the CD spectra and the UV-melting profiles. The imino region of the BclG
1
H-NMR spectrum
(Figure 2) contains 16 signals, which is consistent with 4 G-tetrads.
The ONs BclT, G4A, G4AA, Ct1-Ct4, CtA, CtC and PSTP defy the conventional
G
3+
N
L1
G
3+
N
L2
G
3+
N
L3
G
3+
formula and would be omitted by the currently existing GQ-search
algorithms. ImGQfinder defines all of these sequences to be putative imGQs. Indeed, all these
ONs form stable monomolecular quadruplexes in the presence of potassium salt. Fourteen
signals in the imino regions of the
1
H-NMR spectra of the ONs Ct1 and PSTP are consistent with
4-tetrad mGQ structures with one imperfect tetrad
2
. Twelve signals in the imino-spectrum region
of ribo-Ct1 most likely suggest a 3-tetrad bGQ structure (Molecular modeling studies were
performed to clarify the Ct1 structure. For more information, see the supporting data).
Importantly, the ONs BclT and G4aa cannot even form 2-tetrad GQs according to the
conventional GQ definition. However, these ONs appear to fold into rather stable GQ-like
structures under physiological conditions. These results confirm that ImGQfinder can be used to
predict the possibility of bGQ and mGQ formation.
ImGQs in the human genome: statistical analysis (algorithm application)
ImGQfinder was utilized to reassess the abundance and to analyze the distribution of putative
quadruplex sites in the human genome. We only considered 4-tetrad GQs and imGQs, which are
generally more stable than 2- and 3-tetrad GQs according to the literature and our own
physicochemical data. The sequences representing overlapping GQ/imGQ sites were counted
only once (this was performed using an application feature of ImGQfinder). Sites with both GQ-
and imGQ-folding potentials were regarded as putative GQs because the latter are generally
more stable. As expected, imGQs are substantially more abundant than GQs (Table 3). Thus, the
maximum overall number of quadruplex-like structures realized in vivo may be significantly
higher than previously thought. The distribution of putative imGQs and GQs within RefSeq
genes (genomic sequences used as reference standards for well-characterized genes;
http://www.ncbi.nlm.nih.gov/refseq/rsg/about/
) is shown in Figure 3 (for a more detailed
analysis, see the supporting information). As expected, imGQs are substantially more abundant
than GQs.
To additionally validate the ImGQfinder algorithm, we also calculated the number of all putative
non-overlapping ‘perfect’ 3-tetrad guadruplexes in the human genome and compared it with the
literature data. The obtained value (359 k) is close to the previous estimations (376 k) (Huppert
and Balasubramanian 2005).
DISCUSSION
A new GQ-search algorithm, which is based on a broadened definition of quadruplex-forming
sequences, and the user-friendly online tool ImGQfinder were developed. The algorithm was
verified by structural studies of a series of ONs whose imGQ-forming potential was predicted by
1
G3 demonstrated extreme stability in potassium, and the stability was even superior to that of G4. We attribute this
stability to the fact that single-nucleotide fragments separating G runs fit perfectly well in the diagonal loops of 3-
tetrad GQs but may be slightly too short for 4-tetrad diagonal loops.
2
One imperfect tetrad contains three Hoogsteen-bound Gs with two imino G protons that participate in H-bonding,
which results in two additional signals in the relative region of the
1
H-NMR spectrum.
this version posted January 23, 2014. ; https://doi.org/10.1101/001990doi: bioRxiv preprint

5
imGQfinder. Importantly, the physicochemical properties of ImGQs and GQ, such as the thermal
stability under physiological conditions, appear to be rather similar.
Reassessment of the abundance of putative quadruplex sites in the human genome with
imGQfinder revealed that the maximum number of G4 structures that could be simultaneously
realized has been underestimated. As is evident from Figure 3, putative GQ and imGQ sites have
basically similar distributions within RefSeq genes. Exons tend to be depleted of both GQs and
imGQs. Large clusters of putative GQ/imGQ sites were found in the introns near the intron/exon
boundaries and in the promoters that are approximately 100 bp downstream of the transcription
start site. GQ clustering in 5’ untranslated regions is consistent with literature data (Huppert and
Balasubramanian 2007; Maizels and Gray 2013). The results of several recent studies suggest 5’-
UTR GQ participation in transcription and translational regulation (Huppert et al. 2008). GQs at
intron/exon boundaries may play a role in splicing. Although known enhancer/silencer splicing
element motifs (Wang et al. 2005) do not have GQ/imGQ-folding potential, recent publications
suggest that GQ-like structures may influence splicing (Han et al. 2005; Fisette et al. 2012) and
that the genes that undergo alternative splicing are enriched with GQs (Kostadinov et al. 2006).
In conclusion, the broadened GQ-search algorithm opens up new opportunities in the prediction
of DNA/RNA structure and allows thorough analysis of all possible conformations adopted by
polynucleotides.
METHODS
The ON synthesis and purification, the MS analysis and the UV-melting, CD and rotational
relaxation time measurements were performed as previously described (Varizhuk et al. 2013).
For the analysis GQ/imGQ abundance and distribution in human genome, RefSeq genomic
sequences (http://www.ncbi.nlm.nih.gov/refseq/rsg/about/
) were used.
NMR studies
NMR samples were prepared at a concentration of ~0.1 mM in 0.6 ml H2O+D2O (10%) buffer
solution containing 20 mM Tris-HCl (pH 7.5) and 100 mM KCl and annealed (heated to 90 C for
3 minutes, then cooled quickly on ice) prior to spectral measurements to ensure unimolecular
quadruplex folding.
1
H-NMR spectra were recorded with Bruker AVANCE II 300 (300.1 MHz),
Bruker AMX III (400.1 MHz) and Bruker AVANCE II 600 (600.1 MHz) spectrometers. The
1
H
chemical shifts were referenced relative to an external standard - sodium 2,2-dimethyl-2-
silapentane-5-sulfonate (DSS). The spectra were recorded using presaturation or pulsed-field
gradient WATERGATE W5 pulse sequences (zgprsp and zggpw5 from the Bruker library,
respectively) for H
2
O suppression.
Molecular modeling
Ct1 GQ models 1 and 2 were created as follows. The starting positions of the GQ core atoms
were obtained from the PDB (139D and 2KQH). The core of every GQ was created using Swiss-
PDB Viewer. Then, loops were added step by step as described further by utilizing the SYBYL
8.0 molecular modeling package. To remove unfavorable van der Waals interactions, the models
were reoptimized after attaching each loop using SYBYL 8.0 and the Powell method with the
following parameters: Gasteiger-Hückel charges, TRIPOS force field, non-bonded cut-off
this version posted January 23, 2014. ; https://doi.org/10.1101/001990doi: bioRxiv preprint

Citations
More filters
Journal ArticleDOI
TL;DR: A newly developed Bioconductor package for identifying potential quadruplex‐forming sequences (PQS), which allows for sequence searches that accommodate possible divergences from the optimal G4 base composition and demonstrates that the algorithm behind the searches has a 96% accuracy.
Abstract: Motivation: G-quadruplexes (G4s) are one of the non-B DNA structures easily observed in vitro and assumed to form in vivo. The latest experiments with G4-specific antibodies and G4-unwinding helicase mutants confirm this conjecture. These four-stranded structures have also been shown to influence a range of molecular processes in cells. As G4s are intensively studied, it is often desirable to screen DNA sequences and pinpoint the precise locations where they might form. Results: We describe and have tested a newly-developed Bioconductor package for identifying potential quadruplex-forming sequences (PQS). The package is easy-to-use, flexible and customizable. It allows for sequence searches that accommodate possible divergences from the optimal G4 base composition. A novel aspect of our research was the creation and training (parametrization) of an advanced scoring model which resulted in increased precision compared to similar tools. We demonstrate that the algorithm behind the searches has a 96% accuracy on 392 currently known and experimentally observed G4 structures. We also carried out searches against the recent G4-seq data to verify how well we can identify the structures detected by that technology. The correlation with pqsfinder predictionswas 0.622, higher than the correlation 0.491 obtained with the second best G4Hunter. Availability:http://bioconductor.org/packages/pqsfinder/ This paper is based on pqsfinder-1.4.1.

97 citations

Journal ArticleDOI
TL;DR: It is shown that the allowance of a single atypical G-tract which includes a mismatched or a bulging non-guanine nucleotide, and a single loop of extreme size benefits the overall prediction, an improvement over other computational approaches.
Abstract: Motivation In vivo discovery of G-quadruplex-forming sequences would provide the most relevant G-quadruplexes along a genomic DNA or an RNA molecule, however it is difficult to perform due to the small size of G-quadruplexes, the existence of different topologies, and the additional influence of environmental factors and ligands present during experimentation. In vitro discovery on the other hand is not only unable to simulate in vivo conditions but also, is not practical for large sequences due to limited resources. The immediate solution continues to be the computational prediction although, not always in agreement with experimental findings. This is often due to features that are not conventionally accepted for G-quadruplexes such as disrupted G-tracts or extremely long loops. Results Here, we propose a novel tool for the discovery of putative G-quadruplexes with better accuracy through consideration of the features of previously missed G-quadruplex-forming sequences. Comparing against a set of experimentally confirmed sequences, a sensitivity as high as 99% and Youden's J-statistics of as high as 0.91 is achieved; an improvement over other computational approaches. More importantly, we showed that the allowance of a single atypical G-tract which includes a mismatched or a bulging non-guanine nucleotide, and a single loop of extreme size benefits the overall prediction. Availability and Implementation The python code may be found at http://github.com/odoluca/G4Catchall and the web application at http://homes.ieu.edu.tr/odoluca/G4Catchall

15 citations

Journal ArticleDOI
TL;DR: All available bioinformatics resources dedicated to quadruplexes are reviewed and their usefulness in G4 RNA analysis is examined and the results obtained from processing specially created RNA datasets with these tools are shared.
Abstract: Quadruplexes (G4s) are of interest, which increases with the number of identified G4 structures and knowledge about their biomedical potential. These unique motifs form in many organisms, including humans, where their appearance correlates with various diseases. Scientists store and analyze quadruplexes using recently developed bioinformatic tools-many of them focused on DNA structures. With an expanding collection of G4 RNAs, we check how existing tools deal with them. We review all available bioinformatics resources dedicated to quadruplexes and examine their usefulness in G4 RNA analysis. We distinguish the following subsets of resources: databases, tools to predict putative quadruplex sequences, tools to predict secondary structure with quadruplexes and tools to analyze and visualize quadruplex structures. We share the results obtained from processing specially created RNA datasets with these tools. Contact: mszachniuk@cs.put.poznan.pl Supplementary information: Supplementary data are available at Briefings in Bioinformatics online.

9 citations

Dissertation
19 Mar 2019
TL;DR: A novel method for identifying G4s is introduced, which uses a machine learning approach trained on datasets derived from the high throughput sequencing of G4 structures, to study the prevalence of PG4s in the genome of Arabidopsis thaliana, the model plant.
Abstract: G-Quadruplexes (G4s) are four stranded DNA structures which form in regions with high GC content and high GC skew. Because of the dependence of G4 structure on specific sequences, it is possible to predict putative G4s (PG4s) throughout genomic sequence. PG4s are non-uniformly distributed in genomes, with higher densities within various genic features, particularly promoters, 5’ untranslated regions (UTRs) and coding sequences (CDSs). When they form G4s, these sequences can have a variety of implications for biological processes including replication, transcription, translation and splicing. Here, we introduce a novel method for identifying PG4s, which uses a machine learning approach trained on datasets derived from the high throughput sequencing of G4 structures. We apply this and other techniques, to study the prevalence of PG4s in the genome of Arabidopsis thaliana, the model plant. Finally, we study the effect of G4 stabilisation on gene expression in Arabidopsis, using the GQuadruplex binding agent N-methyl mesoporphyrin (NMM). We identify a family of genes which are strongly downregulated by NMM, and find that they contain large numbers of PG4s in their CDSs.

5 citations


Cites background from "An Improved Search Algorithm to Fin..."

  • ...Several tools have been released which attempt to include these sequences amongst matches (Varizhuk et al., 2014; Dhapola and Chowdhury, 2016; Hon et al., 2017)....

    [...]

Journal ArticleDOI
TL;DR: This study reviews the currently available computational methods for predicting the non-canonical DNAs in the genome and reviews strategies for the identification of ncDNA motifs across the whole genome, necessary for further understanding and investigation of the structure and function of nCDNAs.
Abstract: Although most nucleotides in the genome form canonical double-stranded B-DNA, many repeated sequences transiently present as non-canonical conformations (non-B DNA) such as triplexes, quadruplexes, Z-DNA, cruciforms, and slipped/hairpins. Those noncanonical DNAs (ncDNAs) are not only associated with many genetic events such as replication, transcription, and recombination, but are also related to the genetic instability that results in the predisposition to disease. Due to the crucial roles of ncDNAs in cellular and genetic functions, various computational methods have been implemented to predict sequence motifs that generate ncDNA. Here, we review strategies for the identification of ncDNA motifs across the whole genome, which is necessary for further understanding and investigation of the structure and function of ncDNAs. There is a great demand for computational prediction of non-canonical DNAs that play key functional roles in gene expression and genome biology. In this study, we review the currently available computational methods for predicting the non-canonical DNAs in the genome. Current studies not only provide an insight into the computational methods for predicting the secondary structures of DNA but also increase our understanding of the roles of non-canonical DNA in the genome.

4 citations

References
More filters
Journal ArticleDOI
TL;DR: There is a significant repression of quadruplexes in the coding strand of exonic regions, which suggests that quadruplex-forming patterns are disfavoured in sequences that will form RNA.
Abstract: Guanine-rich DNA sequences of a particular form have the ability to fold into four-stranded structures called G-quadruplexes. In this paper, we present a working rule to predict which primary sequences can form this structure, and describe a search algorithm to identify such sequences in genomic DNA. We count the number of quadruplexes found in the human genome and compare that with the figure predicted by modelling DNA as a Bernoulli stream or as a Markov chain, using windows of various sizes. We demonstrate that the distribution of loop lengths is significantly different from what would be expected in a random case, providing an indication of the number of potentially relevant quadruplex-forming sequences. In particular, we show that there is a significant repression of quadruplexes in the coding strand of exonic regions, which suggests that quadruplex-forming patterns are disfavoured in sequences that will form RNA.

1,493 citations


"An Improved Search Algorithm to Fin..." refers result in this paper

  • ...The obtained value (359 k) is close to the previous estimations (376 k) (Huppert and Balasubramanian 2005)....

    [...]

Journal ArticleDOI
TL;DR: It is shown that the promoter regions (1 kb upstream of the transcription start site TSS) of genes are significantly enriched in quadruplex motifs relative to the rest of the genome, with >40% of human gene promoters containing one or more quadruplexaterials.
Abstract: Certain G-rich DNA sequences readily form four-stranded structures called G-quadruplexes. These sequence motifs are located in telomeres as a repeated unit, and elsewhere in the genome, where their function is currently unknown. It has been proposed that G-quadruplexes may be directly involved in gene regulation at the level of transcription. In support of this hypothesis, we show that the promoter regions (1 kb upstream of the transcription start site TSS) of genes are significantly enriched in quadruplex motifs relative to the rest of the genome, with >40% of human gene promoters containing one or more quadruplex motif. Furthermore, these promoter quadruplexes strongly associate with nuclease hypersensitive sites identified throughout the genome via biochemical measurement. Regions of the human genome that are both nuclease hypersensitive and within promoters show a remarkable (230-fold) enrichment of quadruplex elements, compared to the rest of the genome. These quadruplex motifs identified in promoter regions also show an interesting structural bias towards more stable forms. These observations support the proposal that promoter G-quadruplexes are directly involved in the regulation of gene expression.

1,145 citations


"An Improved Search Algorithm to Fin..." refers background in this paper

  • ...The results of several recent studies suggest 5’-UTR GQ participation in transcription and translational regulation (Huppert et al. 2008)....

    [...]

Journal ArticleDOI
TL;DR: The analytic generalized Born approximation is modified to permit a more accurate description of large macromolecules, while its established performance on small compounds is nearly unaffected, and is adapted to describe molecules with an interior dielectric constant not equal to unity.
Abstract: The analytic generalized Born approximation is an efficient electrostatic model that describes molecules in solution. Here it is modified to permit a more accurate description of large macromolecul...

982 citations


"An Improved Search Algorithm to Fin..." refers methods in this paper

  • ...The electrostatic contribution to the hydration energy Gpolar was computed using the Generalized Born (GB) method (Onufriev et al. 2000) using the algorithm developed by Onufriev et al....

    [...]

  • ...The electrostatic contribution to the hydration energy Gpolar was computed using the Generalized Born (GB) method (Onufriev et al. 2000) using the algorithm developed by Onufriev et al. (Weiser et al. 1999; Onufriev et al. 2002) for calculating the effective Born radii....

    [...]

Journal ArticleDOI
TL;DR: In this article, a fast analytical formula was derived for the calculation of approximate atomic and molecular van der Waals (vdWSA), and solvent-accessible surface areas (SASAs), as well as the first and second derivatives of these quantities with respect to atomic coordinates.
Abstract: A fast analytical formula was derived for the calculation of approximate atomic and molecular van der Waals (vdWSA), and solvent-accessible surface areas (SASAs), as well as the first and second derivatives of these quantities with respect to atomic coordinates. This method makes use of linear combinations of terms composed from pairwise overlaps of hard spheres; therefore, we term this the LCPO method for linear combination of pairwise overlaps. For higher performance, neighbor-list reduction (NLR) was applied as a preprocessing step. Eighteen compounds of different sizes (8–2366 atoms) and classes (organic, proteins, DNA, and various complexes) were chosen as representative test cases. LCPO/NLR computed the SASA and first derivatives of penicillopepsin, a protein with 2366 atoms, in 0.87 s (0.22 s for the creation of the neighbor list, 0.35 s for NLR, and 0.30 s for SASA and first derivatives) on an SGI R10000/194 Mhz processor. This appears comparable to or better than timings reported previously for other algorithms. The vdWSAs were in good agreement with the numerical results: relative errors for total molecular surface areas ranged from 0.1 to 2.0% and average absolute atomic surface area deviations from 0.3 to 0.7 A2. For SASAs without NLR, the LCPO method exhibited relative errors in the range of 0.4–9.2% for total molecular surface areas and average absolute atomic surface area deviations of 2.0–2.7 A2; with NLR the relative molecular errors ranged from 0.1 to 7.8% and the average absolute atomic surface area deviation from 1.6 to 3.0 A2. ©1999 John Wiley & Sons, Inc. J Comput Chem 20: 217–230, 1999

935 citations

Journal ArticleDOI
TL;DR: A web-based server that predicts quadruplex forming G-rich sequences (QGRS) in nucleotide sequences and features interactive graphic representation of the data is developed, very useful for investigating the functional relevance of G-quadruplex structure, in particular its role in regulating the gene expression by alternative processing.
Abstract: The quadruplex structures formed by guanine-rich nucleic acid sequences have received significant attention recently because of growing evidence for their role in important biological processes and as therapeutic targets. G-quadruplex DNA has been suggested to regulate DNA replication and may control cellular proliferation. Sequences capable of forming G-quadruplexes in the RNA have been shown to play significant roles in regulation of polyadenylation and splicing events in mammalian transcripts. Whether quadruplex structure directly plays a role in regulating RNA processing requires investigation. Computational approaches to study G-quadruplexes allow detailed analysis of mammalian genomes. There are no known easily accessible user-friendly tools that can compute G-quadruplexes in the nucleotide sequences. We have developed a web-based server, QGRS Mapper, that predicts quadruplex forming G-rich sequences (QGRS) in nucleotide sequences. It is a user-friendly application that provides many options for defining and studying G-quadruplexes. It performs analysis of the user provided genomic sequences, e.g. promoter and telomeric regions, as well as RNA sequences. It is also useful for predicting G-quadruplex structures in oligonucleotides. The program provides options to search and retrieve desired gene/nucleotide sequence entries from NCBI databases for mapping G-quadruplexes in the context of RNA processing sites. This feature is very useful for investigating the functional relevance of G-quadruplex structure, in particular its role in regulating the gene expression by alternative processing. In addition to providing data on composition and locations of QGRS relative to the processing sites in the pre-mRNA sequence, QGRS Mapper features interactive graphic representation of the data. The user can also use the graphics module to visualize QGRS distribution patterns among all the alternative RNA products of a gene simultaneously on a single screen. QGRS Mapper can be accessed at http://bioinformatics.ramapo.edu/QGRS/.

730 citations


"An Improved Search Algorithm to Fin..." refers methods in this paper

  • ...All currently available online search tools for GQs (Quad finder (Scaria et al. 2006), QGRS Mapper (Kikin et al. 2006) and QGRS predictor (Menendez et al. 2012)) employ the G3+NL1G3+NL2G3+NL3G3+ formula, which only defines canonical (‘perfect’) GQs. Figure 1....

    [...]

  • ...2006), QGRS Mapper (Kikin et al. 2006) and QGRS predictor (Menendez et al....

    [...]

Frequently Asked Questions (1)
Q1. What are the contributions mentioned in the paper "An improved search algorithm to find g-quadruplexes in genome sequences" ?

ImGQfinder this paper is a search tool for non-canonical quadruplex structures ( imGQs ) in addition to canonical GQs.