Home
/
Authors
/
Baikang Pei

Author

Baikang Pei

Other affiliations: University of Connecticut

Bio: Baikang Pei is an academic researcher from Yale University. The author has contributed to research in topics: GENCODE & Genome. The author has an hindex of 11, co-authored 18 publications receiving 8577 citations. Previous affiliations of Baikang Pei include University of Connecticut.

Topics: GENCODE, Genome, Pseudogene, Microarray analysis techniques, Gene ...read more

Papers

PDF

Open Access

More filters

Journal Article•DOI•

GENCODE: The reference human genome annotation for The ENCODE Project

[...]

Jennifer Harrow¹, Adam Frankish¹, José M. González¹, Electra Tapanari¹, Mark Diekhans², Felix Kokocinski¹, Bronwen Aken¹, Daniel Barrell¹, Amonida Zadissa¹, Stephen M. J. Searle¹, If H. A. Barnes¹, Alexandra Bignell¹, Veronika Boychenko¹, Toby Hunt¹, M. Kay¹, Gaurab Mukherjee¹, Jeena Rajan¹, Gloria Despacio-Reyes¹, Gary Saunders¹, Charles A. Steward¹, Rachel A. Harte², Michael F. Lin³, Cédric Howald⁴, Andrea Tanzer, Thomas Derrien⁴, Jacqueline Chrast⁴, Nathalie Walters⁴, Suganthi Balasubramanian⁵, Baikang Pei⁵, Michael L. Tress, Jose Manuel Rodriguez, Iakes Ezkurdia, Jeltje Van Baren, Michael R. Brent, David Haussler², Manolis Kellis³, Alfonso Valencia, Alexandre Reymond⁴, Mark Gerstein⁵, Roderic Guigó, Tim Hubbard¹ - Show less +37 more•Institutions (5)

Wellcome Trust Sanger Institute¹, University of California, Santa Cruz², Massachusetts Institute of Technology³, University of Lausanne⁴, Yale University⁵

01 Sep 2012-Genome Research

TL;DR: This work has examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites, and over one-third of GENCODE protein-Coding genes aresupported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas.

...read moreread less

Abstract: The GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. Since the first public release of this annotation data set, few new protein-coding loci have been added, yet the number of alternative splicing transcripts annotated has steadily increased. The GENCODE 7 release contains 20,687 protein-coding and 9640 long noncoding RNA loci and has 33,977 coding transcripts not represented in UCSC genes and RefSeq. It also has the most comprehensive annotation of long noncoding RNA (lncRNA) loci publicly available with the predominant transcript form consisting of two exons. We have examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites. Over one-third of GENCODE protein-coding genes are supported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas. New models derived from the Illumina Body Map 2.0 RNA-seq data identify 3689 new loci not currently in GENCODE, of which 3127 consist of two exon models indicating that they are possibly unannotated long noncoding loci. GENCODE 7 is publicly available from gencodegenes.org and via the Ensembl and UCSC Genome Browsers.

...read moreread less

4,281 citations

An integrated encyclopedia of DNA elements in the human genome

[...]

Ian Dunham, Anshul Kundaje, Shelley Force Aldred, Patrick J. Collins +439 more

01 Sep 2012

TL;DR: The Encyclopedia of DNA Elements project provides new insights into the organization and regulation of the authors' genes and genome, and is an expansive resource of functional annotations for biomedical research.

...read moreread less

2,767 citations

Journal Article•DOI•

GENCODE reference annotation for the human and mouse genomes.

[...]

Adam Frankish¹, Mark Diekhans², Anne-Maud Ferreira³, Rory Johnson⁴, Irwin Jungreis⁵, Irwin Jungreis⁶, Jane E. Loveland¹, Jonathan M. Mudge¹, Cristina Sisu⁷, Cristina Sisu⁸, James C. Wright, Joel Armstrong², If Barnes¹, Andrew Berry¹, Alexandra Bignell¹, Silvia Carbonell Sala, Jacqueline Chrast³, Fiona Cunningham¹, Tomás Di Domenico, Sarah Donaldson¹, Ian T. Fiddes², Carlos García Girón¹, Jose Manuel Gonzalez¹, Tiago Grego¹, Matthew P. Hardy¹, Thibaut Hourlier¹, Toby Hunt¹, Osagie G. Izuogu¹, Julien Lagarde, Fergal J. Martin¹, Laura Martinez, Shamika Mohanan¹, Paul R. Muir⁸, Fabio C. P. Navarro⁸, Anne Parker¹, Baikang Pei⁸, Fernando Pozo, Magali Ruffier¹, Bianca M. Schmitt¹, Eloise Stapleton¹, Marie-Marthe Suner¹, Irina Sycheva¹, Barbara Uszczynska-Ratajczak⁹, Jinuri Xu⁸, Andrew D. Yates¹, Daniel R. Zerbino¹, Yan Zhang¹⁰, Yan Zhang⁸, Bronwen Aken¹, Jyoti S. Choudhary, Mark Gerstein⁸, Roderic Guigó¹¹, Tim Hubbard¹², Manolis Kellis⁵, Manolis Kellis⁶, Benedict Paten², Alexandre Reymond³, Michael L. Tress, Paul Flicek¹ - Show less +55 more•Institutions (12)

European Bioinformatics Institute¹, University of California, Santa Cruz², University of Lausanne³, University of Bern⁴, Massachusetts Institute of Technology⁵, Broad Institute⁶, Brunel University London⁷, Yale University⁸, University of Warsaw⁹, Ohio State University¹⁰, Pompeu Fabra University¹¹, King's College London¹²

08 Jan 2019-Nucleic Acids Research

TL;DR: This work generates primary data, creates bioinformatics tools and provides analysis to support the work of expert manual gene annotators and automated gene annotation pipelines to identify and characterise gene loci to the highest standard.

...read moreread less

Abstract: The accurate identification and description of the genes in the human and mouse genomes is a fundamental requirement for high quality analysis of data informing both genome biology and clinical genomics. Over the last 15 years, the GENCODE consortium has been producing reference quality gene annotations to provide this foundational resource. The GENCODE consortium includes both experimental and computational biology groups who work together to improve and extend the GENCODE gene annotation. Specifically, we generate primary data, create bioinformatics tools and provide analysis to support the work of expert manual gene annotators and automated gene annotation pipelines. In addition, manual and computational annotation workflows use any and all publicly available data and analysis, along with the research literature to identify and characterise gene loci to the highest standard. GENCODE gene annotations are accessible via the Ensembl and UCSC Genome Browsers, the Ensembl FTP site, Ensembl Biomart, Ensembl Perl and REST APIs as well as https://www.gencodegenes.org.

...read moreread less

2,095 citations

Journal Article•DOI•

Gencode 2021

[...]

Adam Frankish, Mark Diekhans, Irwin Jungreis, Julien Lagarde, Jane E. Loveland, Jonathan M. Mudge, Cristina Sisu, James C. Wright, Joel Armstrong, If Barnes, Andrew Berry, Alexandra Bignell, Carles Boix, S. Carbonell Sala, Fiona Cunningham, T. Di Domenico, Sarah Donaldson, Ian T. Fiddes, C. Garcia Giron, José M. González, Tiago Grego, Matthew Hardy, Thibaut Hourlier, Kerstin Howe, Toby Hunt, Osagie G. Izuogu, Rory Johnson, Fergal J. Martin, Laura Martinez, S. Mohanan, Paul R. Muir, Fabio C. P. Navarro, Anne Parker, Baikang Pei, Fernando Pozo, F. C. Riera, Magali Ruffier, Bianca M. Schmitt, E. Stapleton, Marie Marthe Suner, I. Sycheva, Barbara Uszczynska-Ratajczak, Maxim Y Wolf, Jinrui Xu, Y. T. Yang, Andrew D. Yates, Daniel R. Zerbino, Yan Zhang, Jyoti S. Choudhary, Mark Gerstein, Roderic Guigó, Tim Hubbard, Manolis Kellis, Benedict Paten, Michael L. Tress, Paul Flicek - Show less +52 more

01 Jan 2020-Nucleic Acids Research

TL;DR: The GENCODE project annotates human and mouse genes and transcripts supported by experimental data with high accuracy, providing a foundational resource that supports genome biology and clinical genomics as mentioned in this paper. But the annotation process does not support the creation of transcript structures and the determination of their function.

...read moreread less

Abstract: The GENCODE project annotates human and mouse genes and transcripts supported by experimental data with high accuracy, providing a foundational resource that supports genome biology and clinical genomics. GENCODE annotation processes make use of primary data and bioinformatic tools and analysis generated both within the consortium and externally to support the creation of transcript structures and the determination of their function. Here, we present improvements to our annotation infrastructure, bioinformatics tools, and analysis, and the advances they support in the annotation of the human and mouse genomes including: the completion of first pass manual annotation for the mouse reference genome; targeted improvements to the annotation of genes associated with SARS-CoV-2 infection; collaborative projects to achieve convergence across reference annotation databases for the annotation of human and mouse protein-coding genes; and the first GENCODE manually supervised automated annotation of lncRNAs. Our annotation is accessible via Ensembl, the UCSC Genome Browser and https://www.gencodegenes.org.

...read moreread less

371 citations

Journal Article•DOI•

The GENCODE pseudogene resource

[...]

Baikang Pei¹, Cristina Sisu¹, Adam Frankish², Cédric Howald³, Lukas Habegger¹, Xinmeng Jasmine Mu¹, Rachel A. Harte⁴, Suganthi Balasubramanian¹, Andrea Tanzer, Mark Diekhans⁴, Alexandre Reymond³, Tim Hubbard², Jennifer Harrow², Mark Gerstein¹ - Show less +10 more•Institutions (4)

Yale University¹, Wellcome Trust Sanger Institute², University of Lausanne³, University of California, Santa Cruz⁴

05 Sep 2012-Genome Biology

TL;DR: This work presents the first genome-wide pseudogene assignment for protein-coding genes, based on both large-scale manual annotation and in silico pipelines, and determines the expression level, transcription-factor and RNA polymerase II binding, and chromatin marks associated with each pseudogene.

...read moreread less

Abstract: Pseudogenes have long been considered as nonfunctional genomic sequences. However, recent evidence suggests that many of them might have some form of biological activity, and the possibility of functionality has increased interest in their accurate annotation and integration with functional genomics data. As part of the GENCODE annotation of the human genome, we present the first genome-wide pseudogene assignment for protein-coding genes, based on both large-scale manual annotation and in silico pipelines. A key aspect of this coupled approach is that it allows us to identify pseudogenes in an unbiased fashion as well as untangle complex events through manual evaluation. We integrate the pseudogene annotations with the extensive ENCODE functional genomics information. In particular, we determine the expression level, transcription-factor and RNA polymerase II binding, and chromatin marks associated with each pseudogene. Based on their distribution, we develop simple statistical models for each type of activity, which we validate with large-scale RT-PCR-Seq experiments. Finally, we compare our pseudogenes with conservation and variation data from primate alignments and the 1000 Genomes project, producing lists of pseudogenes potentially under selection. At one extreme, some pseudogenes possess conventional characteristics of functionality; these may represent genes that have recently died. On the other hand, we find interesting patterns of partial activity, which may suggest that dead genes are being resurrected as functioning non-coding RNAs. The activity data of each pseudogene are stored in an associated resource, psiDR, which will be useful for the initial identification of potentially functional pseudogenes.

...read moreread less

309 citations

1
2
3
4
…
5

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

STAR: ultrafast universal RNA-seq aligner

[...]

Alexander Dobin¹, Carrie A. Davis¹, Felix Schlesinger¹, Jorg Drenkow¹, Chris Zaleski¹, Sonali Jha¹, Philippe Batut¹, Mark Chaisson¹, Thomas R. Gingeras¹ - Show less +5 more•Institutions (1)

Cold Spring Harbor Laboratory¹

01 Jan 2013-Bioinformatics

TL;DR: The Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure outperforms other aligners by a factor of >50 in mapping speed.

...read moreread less

Abstract: Motivation Accurate alignment of high-throughput RNA-seq data is a challenging and yet unsolved problem because of the non-contiguous transcript structure, relatively short read lengths and constantly increasing throughput of the sequencing technologies. Currently available RNA-seq aligners suffer from high mapping error rates, low mapping speed, read length limitation and mapping biases. Results To align our large (>80 billon reads) ENCODE Transcriptome RNA-seq dataset, we developed the Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure. STAR outperforms other aligners by a factor of >50 in mapping speed, aligning to the human genome 550 million 2 × 76 bp paired-end reads per hour on a modest 12-core server, while at the same time improving alignment sensitivity and precision. In addition to unbiased de novo detection of canonical junctions, STAR can discover non-canonical splices and chimeric (fusion) transcripts, and is also capable of mapping full-length RNA sequences. Using Roche 454 sequencing of reverse transcription polymerase chain reaction amplicons, we experimentally validated 1960 novel intergenic splice junctions with an 80-90% success rate, corroborating the high precision of the STAR mapping strategy. Availability and implementation STAR is implemented as a standalone C++ code. STAR is free open source software distributed under GPLv3 license and can be downloaded from http://code.google.com/p/rna-star/.

...read moreread less

30,684 citations

Journal Article•DOI•

An integrated encyclopedia of DNA elements in the human genome

[...]

Principal investigators¹, Nhgri groups², Data production leads³, Lead analysts³•Institutions (3)

Wellcome Trust¹, University of Washington², Pennsylvania State University³

06 Sep 2012-Nature

...read moreread less

Abstract: The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.

...read moreread less

13,548 citations

Journal Article•DOI•

TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions

[...]

Daehwan Kim¹, Daehwan Kim², Geo Pertea³, Cole Trapnell⁴, Cole Trapnell⁵, Harold Pimentel⁶, Kelley Ryan Matthew⁷, Steven L. Salzberg², Steven L. Salzberg³ - Show less +5 more•Institutions (7)

University of Maryland, College Park¹, Johns Hopkins University School of Medicine², Johns Hopkins University³, Broad Institute⁴, Harvard University⁵, University of California, Berkeley⁶, Illumina⁷

25 Apr 2013-Genome Biology

TL;DR: TopHat2 is described, which incorporates many significant enhancements to TopHat, and combines the ability to identify novel splice sites with direct mapping to known transcripts, producing sensitive and accurate alignments, even for highly repetitive genomes or in the presence of pseudogenes.

...read moreread less

Abstract: TopHat is a popular spliced aligner for RNA-sequence (RNA-seq) experiments. In this paper, we describe TopHat2, which incorporates many significant enhancements to TopHat. TopHat2 can align reads of various lengths produced by the latest sequencing technologies, while allowing for variable-length indels with respect to the reference genome. In addition to de novo spliced alignment, TopHat2 can align reads across fusion breaks, which can occur after genomic translocations. TopHat2 combines the ability to identify novel splice sites with direct mapping to known transcripts, producing sensitive and accurate alignments, even for highly repetitive genomes or in the presence of pseudogenes. TopHat2 is available at http://ccb.jhu.edu/software/tophat.

...read moreread less

11,380 citations

Journal Article•DOI•

Tissue-based map of the human proteome

[...]

Mathias Uhlén¹, Mathias Uhlén², Linn Fagerberg², Björn M. Hallström², Cecilia Lindskog³, Per Oksvold², Adil Mardinoglu⁴, Åsa Sivertsson², Caroline Kampf³, Evelina Sjöstedt³, Evelina Sjöstedt², Anna Asplund³, IngMarie Olsson³, Karolina Edlund, Emma Lundberg², Sanjay Navani, Cristina Al-Khalili Szigyarto², Jacob Odeberg², Dijana Djureinovic³, Jenny Ottosson Takanen², Sophia Hober², Tove Alm², Per-Henrik Edqvist³, Holger Berling², Hanna Tegel², Jan Mulder³, Johan Rockberg², Peter Nilsson², Jochen M. Schwenk², Marica Hamsten², Kalle von Feilitzen², Mattias Forsberg², Lukas Persson², Fredric Johansson², Martin Zwahlen², Gunnar von Heijne⁵, Jens Nielsen⁴, Jens Nielsen¹, Fredrik Pontén³ - Show less +35 more•Institutions (5)

Technical University of Denmark¹, Royal Institute of Technology², Science for Life Laboratory³, Chalmers University of Technology⁴, Stockholm University⁵

23 Jan 2015-Science

TL;DR: In this paper, a map of the human tissue proteome based on an integrated omics approach that involves quantitative transcriptomics at the tissue and organ level, combined with tissue microarray-based immunohistochemistry, to achieve spatial localization of proteins down to the single-cell level.

...read moreread less

Abstract: Resolving the molecular details of proteome variation in the different tissues and organs of the human body will greatly increase our knowledge of human biology and disease. Here, we present a map of the human tissue proteome based on an integrated omics approach that involves quantitative transcriptomics at the tissue and organ level, combined with tissue microarray-based immunohistochemistry, to achieve spatial localization of proteins down to the single-cell level. Our tissue-based analysis detected more than 90% of the putative protein-coding genes. We used this approach to explore the human secretome, the membrane proteome, the druggable proteome, the cancer proteome, and the metabolic functions in 32 different tissues and organs. All the data are integrated in an interactive Web-based database that allows exploration of individual proteins, as well as navigation of global expression patterns, in all major tissues and organs in the human body.

...read moreread less

9,745 citations

Journal Article•

An integrated encyclopedia of DNA elements in the human genome.

[...]

ENCODEConsortium

01 Jan 2012-Nature

...read moreread less

8,106 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse