Home
/
Authors
/
Danielle Thierry-Mieg

Author

Danielle Thierry-Mieg

Other affiliations: Cornell University, Centre national de la recherche scientifique, University of Washington

Bio: Danielle Thierry-Mieg is an academic researcher from National Institutes of Health. The author has contributed to research in topics: Human genome & Transcriptome. The author has an hindex of 26, co-authored 32 publications receiving 27971 citations. Previous affiliations of Danielle Thierry-Mieg include Cornell University & Centre national de la recherche scientifique.

Topics: Human genome, Transcriptome, Gene, Genome, Alternative splicing ...read more

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Initial sequencing and analysis of the human genome.

[...]

Eric S. Lander¹, Lauren Linton¹, Bruce W. Birren¹, Chad Nusbaum¹ +245 more•Institutions (29)

15 Feb 2001-Nature

TL;DR: The results of an international collaboration to produce and make freely available a draft sequence of the human genome are reported and an initial analysis is presented, describing some of the insights that can be gleaned from the sequence.

...read moreread less

Abstract: The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.

...read moreread less

22,269 citations

Journal Article•DOI•

The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements

[...]

Leming Shi¹, Laura H. Reid, Wendell D. Jones, Richard Shippy², Janet A. Warrington³, Shawn C. Baker⁴, Patrick J. Collins⁵, Francoise de Longueville, Ernest S. Kawasaki⁶, Kathleen Y. Lee⁷, Yuling Luo, Yongming Andrew Sun⁷, James C. Willey⁸, Robert Setterquist⁷, Gavin M. Fischer⁹, Weida Tong¹, Yvonne P. Dragan¹, David J. Dix¹⁰, Felix W. Frueh¹, Federico Goodsaid¹, Damir Herman⁶, Roderick V. Jensen¹¹, Charles D. Johnson, Edward K. Lobenhofer¹², Raj K. Puri¹, Uwe Scherf¹, Jean Thierry-Mieg⁶, Charles Wang¹³, Michael A Wilson⁷, Paul K. Wolber⁵, Lu Zhang⁷, William Slikker¹, Shashi Amur¹, Wenjun Bao¹⁴, Catalin Barbacioru⁷, Anne Bergstrom Lucas⁵, Vincent Bertholet, Cecilie Boysen, Bud Bromley, Donna Brown, Alan Brunner², Roger D. Canales⁷, Xiaoxi Megan Cao, Thomas A. Cebula¹, James J. Chen¹, Jing Cheng, Tzu Ming Chu¹⁴, Eugene Chudin⁴, John F. Corson⁵, J. Christopher Corton¹⁰, Lisa J. Croner¹⁵, Christopher Davies³, Timothy Davison, Glenda C. Delenstarr⁵, Xutao Deng¹³, David Dorris⁷, Aron Charles Eklund¹¹, Xiaohui Fan¹, Hong Fang, Stephanie Fulmer-Smentek⁵, James C. Fuscoe¹, Kathryn Gallagher¹⁰, Weigong Ge¹, Lei Guo¹, Xu Guo³, Janet Hager¹⁶, Paul K. Haje, Jing Han¹, Tao Han¹, Heather Harbottle¹, Stephen C. Harris¹, Eli Hatchwell¹⁷, Craig A. Hauser¹⁸, Susan D. Hester¹⁰, Huixiao Hong, Patrick Hurban¹², Scott A. Jackson¹, Hanlee P. Ji¹⁹, Charles R. Knight, Winston Patrick Kuo²⁰, J. Eugene LeClerc¹, Shawn Levy²¹, Quan Zhen Li, Chunmei Liu³, Ying Liu²², Michael Lombardi¹¹, Yunqing Ma, Scott R. Magnuson, Botoul Maqsodi, Timothy K. McDaniel³, Nan Mei¹, Ola Myklebost²³, Baitang Ning¹, Natalia Novoradovskaya⁹, Michael S. Orr¹, Terry Osborn, Adam Papallo¹¹, Tucker A. Patterson¹, Roger Perkins, Elizabeth Herness Peters, Ron L. Peterson²⁴, Kenneth L. Philips¹², P. Scott Pine¹, Lajos Pusztai²⁵, Feng Qian, Hongzu Ren¹⁰, Mitch Rosen¹⁰, Barry A. Rosenzweig¹, Raymond R. Samaha⁷, Mark Schena, Gary P. Schroth, Svetlana Shchegrova⁵, Dave D. Smith²⁶, Frank Staedtler²⁴, Zhenqiang Su¹, Hongmei Sun, Zoltan Szallasi²⁰, Zivana Tezak¹, Danielle Thierry-Mieg⁶, Karol L. Thompson¹, Irina Tikhonova¹⁶, Yaron Turpaz³, Beena Vallanat¹⁰, Christophe Van, Stephen J. Walker²⁷, Sue Jane Wang¹, Yonghong Wang⁶, Russell D. Wolfinger¹⁴, Alexander Wong⁵, Jie Wu, Chunlin Xiao⁷, Qian Xie, Jun Xu¹³, Wen Yang, Liang Zhang, Sheng Zhong²⁸, Yaping Zong - Show less +133 more•Institutions (28)

Food and Drug Administration¹, GE Healthcare², Thermo Fisher Scientific³, Illumina⁴, Agilent Technologies⁵, National Institutes of Health⁶, Applied Biosystems⁷, University of Toledo⁸, Stratagene⁹, United States Environmental Protection Agency¹⁰, University of Massachusetts Boston¹¹, Clinical Data, Inc¹², University of California, Los Angeles¹³, SAS Institute¹⁴, Biogen Idec¹⁵, Yale University¹⁶, Cold Spring Harbor Laboratory¹⁷, Discovery Institute¹⁸, Stanford University¹⁹, Harvard University²⁰, Vanderbilt University²¹, University of Texas at Dallas²², University of Oslo²³, Novartis²⁴, University of Texas MD Anderson Cancer Center²⁵, Luminex Corporation²⁶, Wake Forest University²⁷, University of Illinois at Urbana–Champaign²⁸

01 Sep 2006-Nature Biotechnology

TL;DR: This study describes the experimental design and probe mapping efforts behind the MicroArray Quality Control project and shows intraplatform consistency across test sites as well as a high level of interplatform concordance in terms of genes identified as differentially expressed.

...read moreread less

Abstract: Over the last decade, the introduction of microarray technology has had a profound impact on gene expression research. The publication of studies with dissimilar or altogether contradictory results, obtained using different microarray platforms to analyze identical RNA samples, has raised concerns about the reliability of this technology. The MicroArray Quality Control (MAQC) project was initiated to address these concerns, as well as other performance and data analysis issues. Expression data on four titration pools from two distinct reference RNA samples were generated at multiple test sites using a variety of microarray-based and alternative technology platforms. Here we describe the experimental design and probe mapping efforts behind the MAQC project. We show intraplatform consistency across test sites as well as a high level of interplatform concordance in terms of genes identified as differentially expressed. This study provides a resource that represents an important first step toward establishing a framework for the use of microarrays in clinical and regulatory settings.

...read moreread less

1,987 citations

Journal Article•DOI•

A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium

[...]

Zhenqiang Su, Paweł P. Łabaj¹, Sheng Li², Jean Thierry-Mieg³ +161 more•Institutions (54)

01 Sep 2014-Nature Biotechnology

TL;DR: The complete SEQC data sets, comprising >100 billion reads, provide unique resources for evaluating RNA-seq analyses for clinical and regulatory settings, and measurement performance depends on the platform and data analysis pipeline, and variation is large for transcript-level profiling.

...read moreread less

Abstract: We present primary results from the Sequencing Quality Control (SEQC) project, coordinated by the US Food and Drug Administration. Examining Illumina HiSeq, Life Technologies SOLiD and Roche 454 platforms at multiple laboratory sites using reference RNA samples with built-in controls, we assess RNA sequencing (RNA-seq) performance for junction discovery and differential expression profiling and compare it to microarray and quantitative PCR (qPCR) data using complementary metrics. At all sequencing depths, we discover unannotated exon-exon junctions, with >80% validated by qPCR. We find that measurements of relative expression are accurate and reproducible across sites and platforms if specific filters are used. In contrast, RNA-seq and microarrays do not provide accurate absolute measurements, and gene-specific biases are observed for all examined platforms, including qPCR. Measurement performance depends on the platform and data analysis pipeline, and variation is large for transcript-level profiling. The complete SEQC data sets, comprising >100 billion reads (10Tb), provide unique resources for evaluating RNA-seq analyses for clinical and regulatory settings.

...read moreread less

853 citations

Journal Article•DOI•

The Microarray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models

[...]

Leming Shi¹, Gregory Campbell¹, Wendell D. Jones, Fabien Campagne² +198 more•Institutions (55)

01 Aug 2010-Nature Biotechnology

TL;DR: P predictive models for classifying a sample with respect to one of 13 endpoints indicative of lung or liver toxicity in rodents, or of breast cancer, multiple myeloma or neuroblastoma in humans are generated.

...read moreread less

Abstract: Gene expression data from microarrays are being applied to predict preclinical and clinical endpoints, but the reliability of these predictions has not been established. In the MAQC-II project, 36 independent teams analyzed six microarray data sets to generate predictive models for classifying a sample with respect to one of 13 endpoints indicative of lung or liver toxicity in rodents, or of breast cancer, multiple myeloma or neuroblastoma in humans. In total, >30,000 models were built using many combinations of analytical methods. The teams generated predictive models without knowing the biological meaning of some of the endpoints and, to mimic clinical reality, tested the models on data that had not been used for training. We found that model performance depended largely on the endpoint and team proficiency and that different approaches generated models of similar performance. The conclusions and recommendations from MAQC-II should be useful for regulatory agencies, study committees and independent investigators that evaluate methods for global gene expression analysis.

...read moreread less

753 citations

Journal Article•DOI•

AceView: a comprehensive cDNA-supported gene and transcripts annotation

[...]

Danielle Thierry-Mieg¹, Jean Thierry-Mieg¹•Institutions (1)

National Institutes of Health¹

07 Aug 2006-Genome Biology

TL;DR: The driving principles of AceView are described, and how, by performing hand-supervised automatic annotation, it solves the combinatorial splicing problem and summarize all of GenBank, dbEST and RefSeq into a genome-wide non-redundant but comprehensive cDNA-supported transcriptome.

...read moreread less

Abstract: Regions covering one percent of the genome, selected by ENCODE for extensive analysis, were annotated by the HAVANA/Gencode group with high quality transcripts, thus defining a benchmark. The ENCODE Genome Annotation Assessment Project (EGASP) competition aimed at reproducing Gencode and finding new genes. The organizers evaluated the protein predictions in depth. We present a complementary analysis of the mRNAs, including alternative transcript variants. We evaluate 25 gene tracks from the University of California Santa Cruz (UCSC) genome browser. We either distinguish or collapse the alternative splice variants, and compare the genomic coordinates of exons, introns and nucleotides. Whole mRNA models, seen as chains of introns, are sorted to find the best matching pairs, and compared so that each mRNA is used only once. At the mRNA level, AceView is by far the closest to Gencode: the vast majority of transcripts of the two methods, including alternative variants, are identical. At the protein level, however, due to a lack of experimental data, our predictions differ: Gencode annotates proteins in only 41% of the mRNAs whereas AceView does so in virtually all. We describe the driving principles of AceView, and how, by performing hand-supervised automatic annotation, we solve the combinatorial splicing problem and summarize all of GenBank, dbEST and RefSeq into a genome-wide non-redundant but comprehensive cDNA-supported transcriptome. AceView accuracy is now validated by Gencode. Relative to a consensus mRNA catalog constructed from all evidence-based annotations, Gencode and AceView have 81% and 84% sensitivity, and 74% and 73% specificity, respectively. This close agreement validates a richer view of the human transcriptome, with three to five times more transcripts than in UCSC Known Genes (sensitivity 28%), RefSeq (sensitivity 21%) or Ensembl (sensitivity 19%).

...read moreread less

657 citations

1
2
3
4
…
5
6
7

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

limma powers differential expression analyses for RNA-sequencing and microarray studies

[...]

Matthew E. Ritchie¹, Belinda Phipson², Di Wu³, Yifang Hu¹, Charity W. Law⁴, Wei Shi¹, Gordon K. Smyth¹, Gordon K. Smyth⁵ - Show less +4 more•Institutions (5)

Walter and Eliza Hall Institute of Medical Research¹, Royal Children's Hospital², Harvard University³, University of Zurich⁴, University of Melbourne⁵

20 Apr 2015-Nucleic Acids Research

TL;DR: The philosophy and design of the limma package is reviewed, summarizing both new and historical features, with an emphasis on recent enhancements and features that have not been previously described.

...read moreread less

Abstract: limma is an R/Bioconductor software package that provides an integrated solution for analysing data from gene expression experiments. It contains rich features for handling complex experimental designs and for information borrowing to overcome the problem of small sample sizes. Over the past decade, limma has been a popular choice for gene discovery through differential expression analyses of microarray and high-throughput PCR data. The package contains particularly strong facilities for reading, normalizing and exploring such data. Recently, the capabilities of limma have been significantly expanded in two important directions. First, the package can now perform both differential expression and differential splicing analyses of RNA sequencing (RNA-seq) data. All the downstream analysis tools previously restricted to microarray data are now available for RNA-seq as well. These capabilities allow users to analyse both RNA-seq and microarray data with very similar pipelines. Second, the package is now able to go past the traditional gene-wise expression analyses in a variety of ways, analysing expression profiles in terms of co-regulated sets of genes or in terms of higher-order expression signatures. This provides enhanced possibilities for biological interpretation of gene expression differences. This article reviews the philosophy and design of the limma package, summarizing both new and historical features, with an emphasis on recent enhancements and features that have not been previously described.

...read moreread less

22,147 citations

Journal Article•DOI•

RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome

[...]

Bo Li¹, Colin N. Dewey¹•Institutions (1)

University of Wisconsin-Madison¹

04 Aug 2011-BMC Bioinformatics

TL;DR: It is shown that accurate gene-level abundance estimates are best obtained with large numbers of short single-end reads, and estimates of the relative frequencies of isoforms within single genes may be improved through the use of paired- end reads, depending on the number of possible splice forms for each gene.

...read moreread less

Abstract: RNA-Seq is revolutionizing the way transcript abundances are measured. A key challenge in transcript quantification from RNA-Seq data is the handling of reads that map to multiple genes or isoforms. This issue is particularly important for quantification with de novo transcriptome assemblies in the absence of sequenced genomes, as it is difficult to determine which transcripts are isoforms of the same gene. A second significant issue is the design of RNA-Seq experiments, in terms of the number of reads, read length, and whether reads come from one or both ends of cDNA fragments. We present RSEM, an user-friendly software package for quantifying gene and isoform abundances from single-end or paired-end RNA-Seq data. RSEM outputs abundance estimates, 95% credibility intervals, and visualization files and can also simulate RNA-Seq data. In contrast to other existing tools, the software does not require a reference genome. Thus, in combination with a de novo transcriptome assembler, RSEM enables accurate transcript quantification for species without sequenced genomes. On simulated and real data sets, RSEM has superior or comparable performance to quantification methods that rely on a reference genome. Taking advantage of RSEM's ability to effectively use ambiguously-mapping reads, we show that accurate gene-level abundance estimates are best obtained with large numbers of short single-end reads. On the other hand, estimates of the relative frequencies of isoforms within single genes may be improved through the use of paired-end reads, depending on the number of possible splice forms for each gene. RSEM is an accurate and user-friendly software tool for quantifying transcript abundances from RNA-Seq data. As it does not rely on the existence of a reference genome, it is particularly useful for quantification with de novo transcriptome assemblies. In addition, RSEM has enabled valuable guidance for cost-efficient design of quantification experiments with RNA-Seq, which is currently relatively expensive.

...read moreread less

14,524 citations

Journal Article•DOI•

featureCounts: an efficient general-purpose program for assigning sequence reads to genomic features

[...]

Yang Liao¹, Gordon K. Smyth¹, Wei Shi¹•Institutions (1)

Walter and Eliza Hall Institute of Medical Research¹

01 Apr 2014-Bioinformatics

TL;DR: FeatureCounts as discussed by the authors is a read summarization program suitable for counting reads generated from either RNA or genomic DNA sequencing experiments, which implements highly efficient chromosome hashing and feature blocking techniques.

...read moreread less

Abstract: MOTIVATION: Next-generation sequencing technologies generate millions of short sequence reads, which are usually aligned to a reference genome. In many applications, the key information required for downstream analysis is the number of reads mapping to each genomic feature, for example to each exon or each gene. The process of counting reads is called read summarization. Read summarization is required for a great variety of genomic analyses but has so far received relatively little attention in the literature. RESULTS: We present featureCounts, a read summarization program suitable for counting reads generated from either RNA or genomic DNA sequencing experiments. featureCounts implements highly efficient chromosome hashing and feature blocking techniques. It is considerably faster than existing methods (by an order of magnitude for gene-level summarization) and requires far less computer memory. It works with either single or paired-end reads and provides a wide range of options appropriate for different sequencing applications. AVAILABILITY AND IMPLEMENTATION: featureCounts is available under GNU General Public License as part of the Subread (http://subread.sourceforge.net) or Rsubread (http://www.bioconductor.org) software packages.

...read moreread less

14,103 citations

Journal Article•DOI•

The Pfam protein families database

[...]

Marco Punta¹, Penny Coggill¹, Ruth Y. Eberhardt¹, Jaina Mistry¹, John Tate¹, Chris Boursnell¹, Ningze Pang¹, Kristoffer Forslund¹, Goran Ceric¹, Jody Clements¹, Andreas Heger¹, Liisa Holm¹, Erik L. L. Sonnhammer¹, Sean R. Eddy¹, Alex Bateman¹, Robert D. Finn¹ - Show less +12 more•Institutions (1)

Wellcome Trust Sanger Institute¹

01 Jan 2000-Nucleic Acids Research

TL;DR: The definition and use of family-specific, manually curated gathering thresholds are explained and some of the features of domains of unknown function (also known as DUFs) are discussed, which constitute a rapidly growing class of families within Pfam.

...read moreread less

Abstract: Pfam is a widely used database of protein families and domains. This article describes a set of major updates that we have implemented in the latest release (version 24.0). The most important change is that we now use HMMER3, the latest version of the popular profile hidden Markov model package. This software is approximately 100 times faster than HMMER2 and is more sensitive due to the routine use of the forward algorithm. The move to HMMER3 has necessitated numerous changes to Pfam that are described in detail. Pfam release 24.0 contains 11,912 families, of which a large number have been significantly updated during the past two years. Pfam is available via servers in the UK (http://pfam.sanger.ac.uk/), the USA (http://pfam.janelia.org/) and Sweden (http://pfam.sbc.su.se/).

...read moreread less

14,075 citations

Journal Article•DOI•

The sequence of the human genome.

[...]

J. Craig Venter¹, Mark Raymond Adams¹, Eugene W. Myers¹, Peter W. Li¹ +269 more•Institutions (12)

16 Feb 2001-Science

TL;DR: Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems are indicated.

...read moreread less

Abstract: A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion of the human genome was generated by the whole-genome shotgun sequencing method. The 14.8-billion bp DNA sequence was generated over 9 months from 27,271,853 high-quality sequence reads (5.11-fold coverage of the genome) from both ends of plasmid clones made from the DNA of five individuals. Two assembly strategies-a whole-genome assembly and a regional chromosome assembly-were used, each combining sequence data from Celera and the publicly funded genome effort. The public data were shredded into 550-bp segments to create a 2.9-fold coverage of those genome regions that had been sequenced, without including biases inherent in the cloning and assembly procedure used by the publicly funded group. This brought the effective coverage in the assemblies to eightfold, reducing the number and size of gaps in the final assembly over what would be obtained with 5.11-fold coverage. The two assembly strategies yielded very similar results that largely agree with independent mapping data. The assemblies effectively cover the euchromatic regions of the human chromosomes. More than 90% of the genome is in scaffold assemblies of 100,000 bp or more, and 25% of the genome is in scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed 26,588 protein-encoding transcripts for which there was strong corroborating evidence and an additional approximately 12,000 computationally derived genes with mouse matches or other weak supporting evidence. Although gene-dense clusters are obvious, almost half the genes are dispersed in low G+C sequence separated by large tracts of apparently noncoding sequence. Only 1.1% of the genome is spanned by exons, whereas 24% is in introns, with 75% of the genome being intergenic DNA. Duplications of segmental blocks, ranging in size up to chromosomal lengths, are abundant throughout the genome and reveal a complex evolutionary history. Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems. DNA sequence comparisons between the consensus sequence and publicly funded genome data provided locations of 2.1 million single-nucleotide polymorphisms (SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 1250 on average, but there was marked heterogeneity in the level of polymorphism across the genome. Less than 1% of all SNPs resulted in variation in proteins, but the task of determining which SNPs have functional consequences remains an open challenge.

...read moreread less

12,098 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse