Home
/
Authors
/
Sergey L. Sheetlin

Author

Sergey L. Sheetlin

Other affiliations: National Institute of Advanced Industrial Science and Technology

Bio: Sergey L. Sheetlin is an academic researcher from National Institutes of Health. The author has contributed to research in topics: Alignment-free sequence analysis & Gumbel distribution. The author has an hindex of 8, co-authored 13 publications receiving 185 citations. Previous affiliations of Sergey L. Sheetlin include National Institute of Advanced Industrial Science and Technology.

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Alignments anchored on genomic landmarks can aid in the identification of regulatory elements

[...]

Kannan Tharakaraman¹, Leonardo Mariòo-Ramírez¹, Sergey L. Sheetlin¹, David Landsman¹, John L. Spouge¹ - Show less +1 more•Institutions (1)

National Institutes of Health¹

01 Jan 2005-Bioinformatics

TL;DR: A program A-GLAM, an extension of the GLAM program, uses significant word positions as new 'anchors' to realign the sequences, which locates putative cis-acting regulatory elements by their positional preferences.

...read moreread less

Abstract: Motivation: The transcription start site (TSS) has been located for an increasing number of genes across several organisms. Statistical tests have shown that some cis-acting regulatory elements have positional preferences with respect to the TSS, but few strategies have emerged for locating elements by their positional preferences. This paper elaborates such a strategy. First, we align promoter regions without gaps, anchoring the alignment on each promoter's TSS. Second, we apply a novel word-specific mask. Third, we apply a clustering test related to gapless BLAST statistics. The test examines whether any specific word is placed unusually consistently with respect to the TSS. Finally, our program A-GLAM, an extension of the GLAM program, uses significant word positions as new 'anchors' to realign the sequences. A Gibbs sampling algorithm then locates putative cis-acting regulatory elements. Usually, Gibbs sampling requires a preliminary masking step, to avoid convergence onto a dominant but uninteresting signal from a DNA repeat. However, since the positional anchors focus A-GLAM on the motif of interest, masking DNA repeats during Gibbs sampling becomes unnecessary. Results: In a set of human DNA sequences with experimentally characterized TSSs, the placement of 791 octonucleotide words was unusually consistent (multiple test corrected P < 0.05). Alignments anchored on these words sometimes located statistically significant motifs inaccessible to GLAM or AlignACE. Availability: The A-GLAM program and a list of statistically significant words are available at ftp://ftp.ncbi.nih.gov/pub/spouge/papers/archive/AGLAM/. Contact: spouge@ncbi.nlm.nih.gov

...read moreread less

35 citations

Journal Article•DOI•

Frameshift alignment: statistics and post-genomic applications

[...]

Sergey L. Sheetlin¹, Yonil Park¹, Martin C. Frith¹, John L. Spouge¹•Institutions (1)

National Institute of Advanced Industrial Science and Technology¹

15 Dec 2014-Bioinformatics

TL;DR: A method to estimate statistical significance of frameshift alignments, similar to classic BLAST statistics is described, suggesting that metagenomic analysis needs to use frameshIFT alignment to derive accurate results.

...read moreread less

Abstract: Motivation: The alignment of DNA sequences to proteins, allowing for frameshifts, is a classic method in sequence analysis. It can help identify pseudogenes (which accumulate mutations), analyze raw DNA and RNA sequence data (which may have frameshift sequencing errors), investigate ribosomal frameshifts, etc. Often, however, only ad hoc approximations or simulations are available to provide the statistical significance of a frameshift alignment score. Results: We describe a method to estimate statistical significance of frameshift alignments, similar to classic BLAST statistics. (BLAST presently does not permit its alignments to include frameshifts.) We also illustrate the continuing usefulness of frameshift alignment with two ‘post-genomic’ applications: (i) when finding pseudogenes within the human genome, frameshift alignments show that most anciently conserved non-coding human elements are recent pseudogenes with conserved ancestral genes; and (ii) when analyzing metagenomic DNA reads from polluted soil, frameshift alignments show that most alignable metagenomic reads contain frameshifts, suggesting that metagenomic analysis needs to use frameshift alignment to derive accurate results. Availability and implementation: The statistical calculation is available in FALP (http://www.ncbi.nlm.nih.gov/CBBresearch/Spouge/html_ncbi/html/index/software.html), and giga-scale frameshift alignment is available in LAST (http://last.cbrc.jp/falp). Contact: vog.hin.mln.ibcn@eguops or pj.crbc@nitram Supplementary information: Supplementary data are available at Bioinformatics online.

...read moreread less

32 citations

Journal Article•DOI•

The Gumbel pre-factor k for gapped local alignment can be estimated from simulations of global alignment.

[...]

Sergey L. Sheetlin¹, Yonil Park¹, John L. Spouge¹•Institutions (1)

National Institutes of Health¹

01 Jan 2005-Nucleic Acids Research

TL;DR: K is related to global alignment and exploits the relationship to show that for the BLASTP defaults, 10 000 realizations with sequences of average length 140 suffice to estimate both Gumbel parameters λ and k within the errors required (λ, 0.8%; k, 10%).

...read moreread less

Abstract: The optimal gapped local alignment score of two random sequences follows a Gumbel distribution. The Gumbel distribution has two parameters, the scale parameter lambda and the pre-factor k. Presently, the basic local alignment search tool (BLAST) programs (BLASTP (BLAST for proteins), PSI-BLAST, etc.) use all time-consuming computer simulations to determine the Gumbel parameters. Because the simulations must be done offline, BLAST users are restricted in their choice of alignment scoring schemes. The ultimate aim of this paper is to speed the simulations, to determine the Gumbel parameters online, and to remove the corresponding restrictions on BLAST users. Simulations for the scale parameter lambda can be as much as five times faster, if they use global instead of local alignment [R. Bundschuh (2002) J. Comput. Biol., 9, 243-260]. Unfortunately, the acceleration does not extend in determining the Gumbel pre-factor k, because k has no known mathematical relationship to global alignment. This paper relates k to global alignment and exploits the relationship to show that for the BLASTP defaults, 10 000 realizations with sequences of average length 140 suffice to estimate both Gumbel parameters lambda and k within the errors required (lambda, 0.8%; k, 10%). For the BLASTP defaults, simulations for both Gumbel parameters now take less than 30 s on a 2.8 GHz Pentium 4 processor.

...read moreread less

27 citations

Journal Article•DOI•

The whole alignment and nothing but the alignment: the problem of spurious alignment flanks

[...]

Martin C. Frith, Yonil Park, Sergey L. Sheetlin, John L. Spouge

01 Oct 2008-Nucleic Acids Research

TL;DR: This article shows that some common scoring schemes tend to overextend alignments and generate spurious alignment flanks up to hundreds of base pairs/amino acids in length, and provides a simple ‘overalignment’ P-value that can identify spurious alignment Flanks.

...read moreread less

Abstract: Pairwise sequence alignment is a ubiquitous tool for inferring the evolution and function of DNA, RNA and protein sequences. It is therefore essential to identify alignments arising by chance alone, i.e. spurious alignments. On one hand, if an entire alignment is spurious, statistical techniques for identifying and eliminating it are well known. On the other hand, if only a part of the alignment is spurious, elimination is much more problematic. In practice, even the sizes and frequencies of spurious subalignments remain unknown. This article shows that some common scoring schemes tend to overextend alignments and generate spurious alignment flanks up to hundreds of base pairs/amino acids in length. In the UCSC genome database, e.g. spurious flanks probably comprise >18% of the human–fugu genome alignment. To evaluate the possibility that chance alone generated a particular flank on a particular pairwise alignment, we provide a simple ‘overalignment’ P-value. The overalignment P-value can identify spurious alignment flanks, thereby eliminating potentially misleading inferences about evolution and function. Moreover, by explicitly demonstrating the tradeoff between over- and under-alignment, our methods guide the rational choice of scoring schemes for various alignment tasks.

...read moreread less

24 citations

Journal Article•DOI•

MsDetector: toward a standard computational tool for DNA microsatellites detection

[...]

Hani Z. Girgis¹, Sergey L. Sheetlin¹•Institutions (1)

National Institutes of Health¹

01 Jan 2013-Nucleic Acids Research

TL;DR: MsDetector is a standard computational tool for detecting microsatellites based on a hidden Markov model and a general linear model that has a very low false-positive rate and is expected to produce consistent results across studies analyzing the same sequence.

...read moreread less

Abstract: Microsatellites (MSs) are DNA regions consisting of repeated short motif(s). MSs are linked to several diseases and have important biomedical applications. Thus, researchers have developed several computational tools to detect MSs. However, the currently available tools require adjusting many parameters, or depend on a list of motifs or on a library of known MSs. Therefore, two laboratories analyzing the same sequence with the same computational tool may obtain different results due to the user-adjustable parameters. Recent studies have indicated the need for a standard computational tool for detecting MSs. To this end, we applied machine-learning algorithms to develop a tool called MsDetector. The system is based on a hidden Markov model and a general linear model. The user is not obligated to optimize the parameters of MsDetector. Neither a list of motifs nor a library of known MSs is required. MsDetector is memory- and time-efficient. We applied MsDetector to several species. MsDetector located the majority of MSs found by other widely used tools. In addition, MsDetector identified novel MSs. Furthermore, the system has a very low false-positive rate resulting in a precision of up to 99%. MsDetector is expected to produce consistent results across studies analyzing the same sequence.

...read moreread less

23 citations

Cited by

PDF

Open Access

More filters

“Bioinformatics” 특집을 내면서

[...]

장병탁, 김삼묘, 허철구

01 Aug 2000

TL;DR: Assessment of medical technology in the context of commercialization with Bioentrepreneur course, which addresses many issues unique to biomedical products.

...read moreread less

Abstract: BIOE 402. Medical Technology Assessment. 2 or 3 hours. Bioentrepreneur course. Assessment of medical technology in the context of commercialization. Objectives, competition, market share, funding, pricing, manufacturing, growth, and intellectual property; many issues unique to biomedical products. Course Information: 2 undergraduate hours. 3 graduate hours. Prerequisite(s): Junior standing or above and consent of the instructor.

...read moreread less

4,833 citations

On robust estimation of the location parameter

[...]

Frederick R. Forst

01 Jan 1980

3,652 citations

Journal Article•DOI•

MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets

[...]

Martin Steinegger¹, Johannes Söding¹•Institutions (1)

Max Planck Society¹

16 Oct 2017-Nature Biotechnology

TL;DR: Because MMseqs2 needs no random memory access in its innermost loop, its runtime scales almost inversely with the number of cores used, which enables sensitive protein sequence searching for the analysis of massive data sets.

...read moreread less

Abstract: Sequencing costs have dropped much faster than Moore's law in the past decade, and sensitive sequence searching has become the main bottleneck in the analysis of large (meta)genomic datasets. While previous methods sacrificed sensitivity for speed gains, the parallelized, open-source software MMseqs2 overcomes this trade-off: In three-iteration profile searches it reaches 50% higher sensitivity than BLAST at 83-fold speed and the same sensitivity as PSI-BLAST at 270 times its speed. MMseqs2 therefore offers great potential to increase the fraction of annotatable (meta)genomic sequences.

...read moreread less

1,371 citations

Journal Article•DOI•

Applied Probability and Queues

[...]

Upendra Dave

01 Nov 1987-Journal of the Operational Research Society

TL;DR: In this paper, applied probability and queuing in the field of applied probabilistic analysis is discussed. But the authors focus on the application of queueing in the context of road traffic.

...read moreread less

Abstract: (1987). Applied Probability and Queues. Journal of the Operational Research Society: Vol. 38, No. 11, pp. 1095-1096.

...read moreread less

1,121 citations

Journal Article•DOI•

COBALT: constraint-based alignment tool for multiple protein sequences

[...]

Jason S. Papadopoulos¹, Richa Agarwala¹•Institutions (1)

National Institutes of Health¹

01 May 2007-Bioinformatics

TL;DR: It is shown that using constraints derived from the conserved domain database (CDD) and PROSITE protein-motif database improves COBALT's alignment quality and has reasonable runtime performance and alignment accuracy comparable to or exceeding that of other tools for a broad range of problems.

...read moreread less

Abstract: Motivation: A tool that simultaneously aligns multiple protein sequences, automatically utilizes information about protein domains, and has a good compromise between speed and accuracy will have practical advantages over current tools. Results: We describe COBALT, a constraint based alignment tool that implements a general framework for multiple alignment of protein sequences. COBALT finds a collection of pairwise constraints derived from database searches, sequence similarity and user input, combines these pairwise constraints, and then incorporates them into a progressive multiple alignment. We show that using constraints derived from the conserved domain database (CDD) and PROSITE protein-motif database improves COBALT’s alignment quality. We also show that COBALT has reasonable runtime performance and alignment accuracy comparable to or exceeding that of other tools for a broad range of problems. Availability: COBALT is included in the NCBI Cþþ toolkit. A Linux executable for COBALT, and CDD and PROSITE data used is available at: ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/cobalt Contact: richa@helix.nih.gov

...read moreread less

909 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38

Collapse