HTSeq—a Python framework to work with high-throughput sequencing data

doi:10.1093/BIOINFORMATICS/BTU638

Home
/
Papers
/
HTSeq—a Python framework to work with high-throughput sequencing data

Journal Article•DOI•

HTSeq—a Python framework to work with high-throughput sequencing data

Simon Anders, Paul Theodor Pyl, Wolfgang Huber

15 Jan 2015-Bioinformatics (Oxford University Press)-Vol. 31, Iss: 2, pp 166-169

TL;DR: This work presents HTSeq, a Python library to facilitate the rapid development of custom scripts for high-throughput sequencing data analysis, and presents htseq-count, a tool developed with HTSequ that preprocesses RNA-Seq data for differential expression analysis by counting the overlap of reads with genes.

read less

Abstract: Motivation: A large choice of tools exists for many standard tasks in the analysis of high-throughput sequencing (HTS) data. However, once a project deviates from standard workflows, custom scripts are needed. Results: We present HTSeq, a Python library to facilitate the rapid development of such scripts. HTSeq offers parsers for many common data formats in HTS projects, as well as classes to represent data, such as genomic coordinates, sequences, sequencing reads, alignments, gene model information and variant calls, and provides data structures that allow for querying via genomic coordinates. We also present htseq-count, a tool developed with HTSeq that preprocesses RNA-Seq data for differential expression analysis by counting the overlap of reads with genes. Availability and implementation: HTSeq is released as an opensource software under the GNU General Public Licence and available from http://www-huber.embl.de/HTSeq or from the Python Package Index at https://pypi.python.org/pypi/HTSeq. Contact: sanders@fs.tum.de

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

A pathology atlas of the human cancer transcriptome

[...]

Mathias Uhlén¹, Mathias Uhlén², Cheng Zhang¹, Sunjae Lee¹, Evelina Sjöstedt³, Evelina Sjöstedt¹, Linn Fagerberg¹, Gholamreza Bidkhori¹, Rui Benfeitas¹, Muhammad Arif¹, Zhengtao Liu¹, Fredrik Edfors¹, Kemal Sanli¹, Kalle von Feilitzen¹, Per Oksvold¹, Emma Lundberg¹, Sophia Hober¹, Peter Nilsson¹, Johanna Sofia Margareta Mattsson³, Jochen M. Schwenk¹, Hans Brunnström⁴, Bengt Glimelius³, Tobias Sjöblom³, Per-Henrik Edqvist³, Dijana Djureinovic³, Patrick Micke³, Cecilia Lindskog³, Adil Mardinoglu⁵, Adil Mardinoglu¹, Fredrik Pontén³ - Show less +26 more•Institutions (5)

Royal Institute of Technology¹, Technical University of Denmark², Uppsala University³, Lund University⁴, Chalmers University of Technology⁵

18 Aug 2017-Science

TL;DR: A Human Pathology Atlas has been created as part of the Human Protein Atlas program to explore the prognostic role of each protein-coding gene in 17 different cancers, and reveals that gene expression of individual tumors within a particular cancer varied considerably and could exceed the variation observed between distinct cancer types.

...read moreread less

Abstract: Cancer is one of the leading causes of death, and there is great interest in understanding the underlying molecular mechanisms involved in the pathogenesis and progression of individual tumors. We used systems-level approaches to analyze the genome-wide transcriptome of the protein-coding genes of 17 major cancer types with respect to clinical outcome. A general pattern emerged: Shorter patient survival was associated with up-regulation of genes involved in cell growth and with down-regulation of genes involved in cellular differentiation. Using genome-scale metabolic models, we show that cancer patients have widespread metabolic heterogeneity, highlighting the need for precise and personalized medicine for cancer treatment. All data are presented in an interactive open-access database (www.proteinatlas.org/pathology) to allow genome-wide exploration of the impact of individual proteins on clinical outcomes.

...read moreread less

2,276 citations

Journal Article•DOI•

A survey of best practices for RNA-seq data analysis

[...]

Ana Conesa¹, Pedro Madrigal², Pedro Madrigal³, Sonia Tarazona⁴, David Gomez-Cabrero, Alejandra Cervera⁵, Andrew McPherson⁶, Michał Wojciech Szcześniak⁷, Daniel J. Gaffney³, Laura L. Elo⁸, Xuegong Zhang⁹, Ali Mortazavi¹⁰ - Show less +8 more•Institutions (10)

University of Florida¹, University of Cambridge², Wellcome Trust Sanger Institute³, Polytechnic University of Valencia⁴, University of Helsinki⁵, Simon Fraser University⁶, Adam Mickiewicz University in Poznań⁷, Åbo Akademi University⁸, Tsinghua University⁹, University of California, Irvine¹⁰

26 Jan 2016-Genome Biology

TL;DR: All of the major steps in RNA-seq data analysis are reviewed, including experimental design, quality control, read alignment, quantification of gene and transcript levels, visualization, differential gene expression, alternative splicing, functional analysis, gene fusion detection and eQTL mapping.

...read moreread less

Abstract: RNA-sequencing (RNA-seq) has a wide variety of applications, but no single analysis pipeline can be used in all cases. We review all of the major steps in RNA-seq data analysis, including experimental design, quality control, read alignment, quantification of gene and transcript levels, visualization, differential gene expression, alternative splicing, functional analysis, gene fusion detection and eQTL mapping. We highlight the challenges associated with each step. We discuss the analysis of small RNAs and the integration of RNA-seq with other functional genomics techniques. Finally, we discuss the outlook for novel technologies that are changing the state of the art in transcriptomics.

...read moreread less

1,963 citations

Cites methods from "HTSeq—a Python framework to work wi..."

...The simplest approach to quantification is to aggregate raw counts of mapped reads using programs such as HTSeq-count [35] or featureCounts [36]....
[...]
...This application is primarily based on the number of reads that map to each transcript sequence, although there are algorithms such as Sailfish that rely on k-mer counting in reads without the need for mapping [34]....
[...]
...These methods allocate multi-mapping reads among transcript and output within-sample normalized values corrected for sequencing biases [35, 41, 43]....
[...]
...Algorithms that quantify expression from transcriptome mappings include RSEM (RNA-Seq by Expectation Maximization) [40], eXpress [41], Sailfish [35] and kallisto [42] among others....
[...]

Journal Article•DOI•

The single-cell transcriptional landscape of mammalian organogenesis

[...]

Junyue Cao¹, Malte Spielmann¹, Xiaojie Qiu¹, Xingfan Huang¹, Daniel M. Ibrahim², Daniel M. Ibrahim³, Andrew J. Hill¹, Fan Zhang⁴, Stefan Mundlos³, Stefan Mundlos², Lena Christiansen⁴, Frank J. Steemers⁴, Cole Trapnell¹, Jay Shendure - Show less +10 more•Institutions (4)

University of Washington¹, Charité², Max Planck Society³, Illumina⁴

01 Feb 2019-Nature

TL;DR: A cell atlas of mouse organogenesis provides a global view of developmental processes occurring during this critical period, including focused analyses of the apical ectodermal ridge, limb mesenchyme and skeletal muscle.

...read moreread less

Abstract: Mammalian organogenesis is a remarkable process. Within a short timeframe, the cells of the three germ layers transform into an embryo that includes most of the major internal and external organs. Here we investigate the transcriptional dynamics of mouse organogenesis at single-cell resolution. Using single-cell combinatorial indexing, we profiled the transcriptomes of around 2 million cells derived from 61 embryos staged between 9.5 and 13.5 days of gestation, in a single experiment. The resulting ‘mouse organogenesis cell atlas’ (MOCA) provides a global view of developmental processes during this critical window. We use Monocle 3 to identify hundreds of cell types and 56 trajectories, many of which are detected only because of the depth of cellular coverage, and collectively define thousands of corresponding marker genes. We explore the dynamics of gene expression within cell types and trajectories over time, including focused analyses of the apical ectodermal ridge, limb mesenchyme and skeletal muscle. Data from single-cell combinatorial-indexing RNA-sequencing analysis of 2 million cells from mouse embryos between embryonic days 9.5 and 13.5 are compiled in a cell atlas of mouse organogenesis, which provides a global view of developmental processes occurring during this critical period.

...read moreread less

1,865 citations

Journal Article•DOI•

Visualization and analysis of gene expression in tissue sections by spatial transcriptomics

[...]

Patrik L. Ståhl¹, Patrik L. Ståhl², Fredrik Salmén², Sanja Vickovic², Anna Lundmark², Anna Lundmark¹, José Fernández Navarro², José Fernández Navarro¹, Jens P. Magnusson¹, Stefania Giacomello², Michaela Asp², Jakub Orzechowski Westholm³, Mikael Huss³, Annelie Mollbrink², Sten Linnarsson¹, Simone Codeluppi¹, Åke Borg⁴, Fredrik Pontén⁵, Paul I. Costea², Pelin Sahlén², Jan Mulder³, Olaf Bergmann¹, Joakim Lundeberg², Jonas Frisén¹ - Show less +20 more•Institutions (5)

Karolinska Institutet¹, Royal Institute of Technology², Science for Life Laboratory³, Lund University⁴, Uppsala University⁵

01 Jul 2016-Science

TL;DR: By positioning histological sections on arrayed reverse transcription primers with unique positional barcodes, this work demonstrates high-quality RNA-sequencing data with maintained two-dimensional positional information from the mouse brain and human breast cancer.

...read moreread less

Abstract: Analysis of the pattern of proteins or messengerRNAs (mRNAs) in histological tissue sections is a cornerstone in biomedical research and diagnostics. This typically involves the visualization of a few proteins or expressed genes at a time. We have devised a strategy, which we call “spatial transcriptomics,” that allows visualization and quantitative analysis of the transcriptome with spatial resolution in individual tissue sections. By positioning histological sections on arrayed reverse transcription primers with unique positional barcodes, we demonstrate high-quality RNA-sequencing data with maintained two-dimensional positional information from the mouse brain and human breast cancer. Spatial transcriptomics provides quantitative gene expression data and visualization of the distribution of mRNAs within tissue sections and enables novel types of bioinformatics analyses, valuable in research and diagnostics.

...read moreread less

1,741 citations

Journal Article•DOI•

A circadian gene expression atlas in mammals: Implications for biology and medicine

[...]

Ray Zhang¹, Nicholas F. Lahens¹, Heather I. Ballance¹, Michael E. Hughes², John B. Hogenesch¹ - Show less +1 more•Institutions (2)

University of Pennsylvania¹, University of Missouri–St. Louis²

11 Nov 2014-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: High-resolution multiorgan expression data is generated showing that nearly half of all genes in the mouse genome oscillate with circadian rhythm somewhere in the body, and the majority of best-selling drugs and World Health Organization essential medicines directly target the products of rhythmic genes.

...read moreread less

Abstract: To characterize the role of the circadian clock in mouse physiology and behavior, we used RNA-seq and DNA arrays to quantify the transcriptomes of 12 mouse organs over time. We found 43% of all protein coding genes showed circadian rhythms in transcription somewhere in the body, largely in an organ-specific manner. In most organs, we noticed the expression of many oscillating genes peaked during transcriptional “rush hours” preceding dawn and dusk. Looking at the genomic landscape of rhythmic genes, we saw that they clustered together, were longer, and had more spliceforms than nonoscillating genes. Systems-level analysis revealed intricate rhythmic orchestration of gene pathways throughout the body. We also found oscillations in the expression of more than 1,000 known and novel noncoding RNAs (ncRNAs). Supporting their potential role in mediating clock function, ncRNAs conserved between mouse and human showed rhythmic expression in similar proportions as protein coding genes. Importantly, we also found that the majority of best-selling drugs and World Health Organization essential medicines directly target the products of rhythmic genes. Many of these drugs have short half-lives and may benefit from timed dosage. In sum, this study highlights critical, systemic, and surprising roles of the mammalian circadian clock and provides a blueprint for advancement in chronotherapy.

...read moreread less

1,642 citations

1
2
3
4
5
…
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

[...]

Michael I. Love¹, Michael I. Love², Wolfgang Huber, Simon Anders•Institutions (2)

Max Planck Society¹, Harvard University²

05 Dec 2014-Genome Biology

TL;DR: This work presents DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates, which enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression.

...read moreread less

Abstract: In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-seq, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. We present DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression. The DESeq2 package is available at http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html .

...read moreread less

47,038 citations

Journal Article•DOI•

The Sequence Alignment/Map format and SAMtools

[...]

Heng Li¹, Bob Handsaker², Alec Wysoker², T. J. Fennell², Jue Ruan³, Nils Homer², Gabor T. Marth⁴, Gonçalo R. Abecasis², Richard Durbin¹ - Show less +5 more•Institutions (4)

Wellcome Trust Sanger Institute¹, University of California, Los Angeles², Chinese Academy of Sciences³, Boston College⁴

01 Aug 2009-Bioinformatics

TL;DR: SAMtools as discussed by the authors implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments.

...read moreread less

Abstract: Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments. Availability: http://samtools.sourceforge.net Contact: [email protected]

...read moreread less

45,957 citations

"HTSeq—a Python framework to work wi..." refers background in this paper

...…is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. functionality from PySam…...
[...]

Journal Article•DOI•

Trimmomatic: a flexible trimmer for Illumina sequence data

[...]

Anthony Bolger¹, Marc Lohse¹, Bjoern Usadel¹•Institutions (1)

Max Planck Society¹

01 Aug 2014-Bioinformatics

TL;DR: Timmomatic is developed as a more flexible and efficient preprocessing tool, which could correctly handle paired-end data and is shown to produce output that is at least competitive with, and in many cases superior to, that produced by other tools, in all scenarios tested.

...read moreread less

Abstract: Motivation: Although many next-generation sequencing (NGS) read preprocessing tools already existed, we could not find any tool or combination of tools that met our requirements in terms of flexibility, correct handling of paired-end data and high performance. We have developed Trimmomatic as a more flexible and efficient preprocessing tool, which could correctly handle paired-end data. Results: The value of NGS read preprocessing is demonstrated for both reference-based and reference-free tasks. Trimmomatic is shown to produce output that is at least competitive with, and in many cases superior to, that produced by other tools, in all scenarios tested. Availability and implementation: Trimmomatic is licensed under GPL V3. It is cross-platform (Java 1.5+ required) and available at http://www.usadellab.org/cms/index.php?page=trimmomatic Contact: ed.nehcaa-htwr.1oib@ledasu Supplementary information: Supplementary data are available at Bioinformatics online.

...read moreread less

39,291 citations

Journal Article•DOI•

edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.

[...]

Mark D. Robinson¹, Davis J. McCarthy¹, Gordon K. Smyth¹•Institutions (1)

Walter and Eliza Hall Institute of Medical Research¹

01 Jan 2010-Bioinformatics

TL;DR: EdgeR as mentioned in this paper is a Bioconductor software package for examining differential expression of replicated count data, which uses an overdispersed Poisson model to account for both biological and technical variability and empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference.

...read moreread less

Abstract: Summary: It is expected that emerging digital gene expression (DGE) technologies will overtake microarray technologies in the near future for many functional genomics applications. One of the fundamental data analysis tasks, especially for gene expression studies, involves determining whether there is evidence that counts for a transcript or exon are significantly different across experimental conditions. edgeR is a Bioconductor software package for examining differential expression of replicated count data. An overdispersed Poisson model is used to account for both biological and technical variability. Empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference. The methodology can be used even with the most minimal levels of replication, provided at least one phenotype or experimental condition is replicated. The software may have other applications beyond sequencing data, such as proteome peptide count data. Availability: The package is freely available under the LGPL licence from the Bioconductor web site (http://bioconductor.org).

...read moreread less

29,413 citations

"HTSeq—a Python framework to work wi..." refers methods in this paper

...These counts can then be used for gene-level differential expression analyses using methods such as DESeq2 (Anders and Huber, 2010) or edgeR (Robinson et al., 2010)....
[...]

Journal Article•DOI•

BEDTools: a flexible suite of utilities for comparing genomic features

[...]

Aaron R. Quinlan¹, Ira M. Hall¹•Institutions (1)

University of Virginia¹

15 Mar 2010-Bioinformatics

TL;DR: A new software suite for the comparison, manipulation and annotation of genomic features in Browser Extensible Data (BED) and General Feature Format (GFF) format, which allows the user to compare large datasets (e.g. next-generation sequencing data) with both public and custom genome annotation tracks.

...read moreread less

Abstract: Motivation: Testing for correlations between different sets of genomic features is a fundamental task in genomics research. However, searching for overlaps between features with existing webbased methods is complicated by the massive datasets that are routinely produced with current sequencing technologies. Fast and flexible tools are therefore required to ask complex questions of these data in an efficient manner. Results: This article introduces a new software suite for the comparison, manipulation and annotation of genomic features in Browser Extensible Data (BED) and General Feature Format (GFF) format. BEDTools also supports the comparison of sequence alignments in BAM format to both BED and GFF features. The tools are extremely efficient and allow the user to compare large datasets (e.g. next-generation sequencing data) with both public and custom genome annotation tracks. BEDTools can be combined with one another as well as with standard UNIX commands, thus facilitating routine genomics tasks as well as pipelines that can quickly answer intricate questions of large genomic datasets. Availability and implementation: BEDTools was written in C++. Source code and a comprehensive user manual are freely available at http://code.google.com/p/bedtools

...read moreread less

18,858 citations

"HTSeq—a Python framework to work wi..." refers background in this paper

...Interval queries are a recurring task in HTS analysis problems, and several libraries now offer solutions for different programming languages, including BEDtools (Quinlan and Hall, 2010; Dale et al., 2011) and IRanges/GenomicRanges (Lawrence et al....
[...]
...Interval queries are a recurring task in HTS analysis problems, and several libraries now offer solutions for different programming languages, including BEDtools (Quinlan and Hall, 2010; Dale et al., 2011) and IRanges/GenomicRanges (Lawrence et al., 2013)....
[...]