Showing papers in "bioRxiv in 2014"

PDF

Open Access

Posted Content•DOI•

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

[...]

Michael I. Love¹, Wolfgang Huber, Simon Anders•Institutions (1)

17 Nov 2014-bioRxiv

TL;DR: This work presents DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates, which enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression.

...read moreread less

Abstract: In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-Seq data, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. We present DESeq2, a method for differential analysis of count data. DESeq2 uses shrinkage estimation for dispersions and fold changes to improve stability and interpretability of the estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression and facilitates downstream tasks such as gene ranking and visualization. DESeq2 is available as an R/Bioconductor package.

...read moreread less

17,014 citations

Posted Content•DOI•

qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots

[...]

Stephen D. Turner¹•Institutions (1)

University of Virginia¹

14 May 2014-bioRxiv

TL;DR: The qqman package enables the flexible creation of manhattan plots, both genome-wide and for single chromosomes, with optional highlighting of SNPs of interest.

...read moreread less

Abstract: Summary: Genome-wide association studies (GWAS) have identified thousands of human trait-associated single nucleotide polymorphisms. Here, I describe a freely available R package for visualizing GWAS results using Q-Q and manhattan plots. The qqman package enables the flexible creation of manhattan plots, both genome-wide and for single chromosomes, with optional highlighting of SNPs of interest. Availability: qqman is released under the GNU General Public License, and is freely available on the Comprehensive R Archive Network (http://cran.r-project.org/package=qqman). The source code is available on GitHub (https://github.com/stephenturner/qqman).

...read moreread less

713 citations

Posted Content•DOI•

HTSeq - A Python framework to work with high-throughput sequencing data

[...]

Simon Anders, Paul Theodor Pyl, Wolfgang Huber

19 Aug 2014-bioRxiv

TL;DR: This work presents HTSeq, a Python library to facilitate the rapid development of custom scripts for high-throughput sequencing data analysis and presents htseq-count, a tool developed with HTSequ that preprocesses RNA-Seq data for differential expression analysis by counting the overlap of reads with genes.

...read moreread less

Abstract: Motivation: A large choice of tools exists for many standard tasks in the analysis of high-throughput sequencing (HTS) data. However, once a project deviates from standard work flows, custom scripts are needed. Results: We present HTSeq, a Python library to facilitate the rapid development of such scripts. HTSeq offers parsers for many common data formats in HTS projects, as well as classes to represent data such as genomic coordinates, sequences, sequencing reads, alignments, gene model information, variant calls, and provides data structures that allow for querying via genomic coordinates. We also present htseq-count, a tool developed with HTSeq that preprocesses RNA-Seq data for differential expression analysis by counting the overlap of reads with genes. Availability: HTSeq is released as open-source software under the GNU General Public Licence and available from http://www-huber.embl.de/HTSeq or from the Python Package Index, https://pypi.python.org/pypi/HTSeq

...read moreread less

433 citations

Posted Content•DOI•

FusionCatcher - a tool for finding somatic fusion genes in paired-end RNA-sequencing data

[...]

Daniel Nicorici¹, Satalan M², Henrik Edgren³, Sara Kangaspeska³, Astrid Murumägi³, Olli Kallioniemi³, Virtanen S¹, Kilkku O¹ - Show less +4 more•Institutions (3)

Orion Corporation¹, Florida Hospital Heartland Medical Center², University of Helsinki³

19 Nov 2014-bioRxiv

TL;DR: FusionCatcher is a software tool for finding somatic fusion genes in paired-end RNA-sequencing data from human or other vertebrates that achieves competitive detection rates and real-time PCR validation rates in RNA- sequencing data from tumor cells.

...read moreread less

Abstract: FusionCatcher is a software tool for finding somatic fusion genes in paired-end RNA-sequencing data from human or other vertebrates. FusionCatcher achieves competitive detection rates and real-time PCR validation rates in RNA-sequencing data from tumor cells. FusionCatcher is available at http://code.google.com/p/fusioncatcher

...read moreread less

335 citations

Posted Content•DOI•

Improved vectors and genome-wide libraries for CRISPR screening

[...]

Neville E. Sanjana¹, Ophir Shalem¹, Feng Zhang¹•Institutions (1)

Broad Institute¹

28 Jun 2014-bioRxiv

TL;DR: A Genome-scale CRISPR Knock-Out (GeCKO) library is used to identify loss-of-function mutations conferring vemurafenib resistance in a melanoma model and reveal new mechanisms in diverse biological models.

...read moreread less

Abstract: Genome-wide, targeted loss-of-function pooled screens using the CRISPR (clustered regularly interspaced short palindrome repeats)?associated nuclease Cas9 in human and mouse cells provide an alternative screening system to RNA interference (RNAi). Initial lentiviral delivery systems for CRISPR screening had low viral titer or required a cell line already expressing Cas9, limiting the range of biological systems amenable to screening. In this work, we present 1- and 2-vector lentiCRISPR systems capable of producing higher viral titers and, in these vectors, new human and mouse libraries for genome-scale CRISPR knock-out (GeCKO) screening.

...read moreread less

309 citations

Posted Content•DOI•

Characterization of directed differentiation by high-throughput single-cell RNA-Seq

[...]

Magali Soumillon¹, Davide Cacchiarelli¹, Stefan Semrau², van Oudenaarden A, Tarjei S. Mikkelsen¹ - Show less +1 more•Institutions (2)

Broad Institute¹, Massachusetts Institute of Technology²

05 Mar 2014-bioRxiv

TL;DR: An efficient digital gene expression profiling protocol that enables surveying of mRNA in thousands of single cells at a time is developed and applied to profile 12,832 cells collected at multiple time points during directed adipogenic differentiation of human adipose-derived stem/stromal cells in vitro.

...read moreread less

Abstract: Directed differentiation of cells in vitro is a powerful approach for dissection of developmental pathways, disease modeling and regenerative medicine, but analysis of such systems is complicated by heterogeneous and asynchronous cellular responses to differentiation-inducing stimuli. To enable deep characterization of heterogeneous cell populations, we developed an efficient digital gene expression profiling protocol that enables surveying of mRNA in thousands of single cells at a time. We then applied this protocol to profile 12,832 cells collected at multiple time points during directed adipogenic differentiation of human adipose-derived stem/stromal cells in vitro. The resulting data reveal the major axes of cell-to-cell variation within and between time points, and an inverse relationship between inflammatory gene expression and lipid accumulation across cells from a single donor.

...read moreread less

251 citations

Posted Content•DOI•

Highly-efficient Cas9-mediated transcriptional programming

[...]

Alejandro Chavez¹, Jonathan Scheiman², Suhani Vora³, Benjamin W. Pruitt², Marcelle Tuttle², Eswar Prasad Ramachandran Iyer², Samira Kiani³, Christopher D. Guzman², Daniel J. Wiegand², Dmitry Ter-Ovanesyan², Jonathan L. Braff², Noah Davidsohn², Ron Weiss³, John Aach², James J. Collins³, George M. Church¹ - Show less +12 more•Institutions (3)

Harvard University¹, Wyss Institute for Biologically Inspired Engineering², Massachusetts Institute of Technology³

20 Dec 2014-bioRxiv

TL;DR: The development of an improved transcriptional regulator through the rational design of a tripartite activator, VP64-p65-Rta (VPR), fused to Cas9 is described and its utility in activating expression of endogenous coding and non-coding genes, targeting several genes simultaneously and stimulating neuronal differentiation of induced pluripotent stem cells (iPSCs).

...read moreread less

Abstract: The RNA-guided bacterial nuclease Cas9 can be reengineered as a programmable transcription factor by a series of changes to the Cas9 protein in addition to the fusion of a transcriptional activation domain (AD). However, the modest levels of gene activation achieved by current Cas9 activators have limited their potential applications. Here we describe the development of an improved transcriptional regulator through the rational design of a tripartite activator, VP64-p65-Rta (VPR), fused to Cas9. We demonstrate its utility in activating expression of endogenous coding and non-coding genes, targeting several genes simultaneously and stimulating neuronal differentiation of induced pluripotent stem cells (iPSCs).

...read moreread less

198 citations

Posted Content•DOI•

Inferring human population size and separation history from multiple genome sequences

[...]

Stephan Schiffels¹, Richard Durbin¹•Institutions (1)

Wellcome Trust Sanger Institute¹

21 May 2014-bioRxiv

TL;DR: Results from applying Multiple Sequentially Markovian Coalescent (MSMC) to genome sequences from nine populations across the world suggest that the genetic separation of non-African ancestors from African Yoruban ancestors started long before 50,000 years ago, and give information about human population history as recently as 2,500 years ago.

...read moreread less

Abstract: The availability of complete human genome sequences from populations across the world has given rise to new population genetic inference methods that explicitly model their ancestral relationship under recombination and mutation. So far, application of these methods to evolutionary history more recent than 20-30 thousand years ago and to population separations has been limited. Here we present a new method that overcomes these shortcomings. The Multiple Sequentially Markovian Coalescent (MSMC) analyses the observed pattern of mutations in multiple individuals, focusing on the first coalescence between any two individuals. Results from applying MSMC to genome sequences from nine populations across the world suggest that the genetic separation of non-African ancestors from African Yoruban ancestors started long before 50,000 years ago, and give information about human population history as recently as 2,000 years ago, including the bottleneck in the peopling of the Americas, and separations within Africa, East Asia and Europe.

...read moreread less

189 citations

Posted Content•DOI•

Arousal and locomotion make distinct contributions to cortical activity patterns and visual encoding

[...]

Martin Vinck¹, Renata Batista-Brito¹, Ulf Knoblich¹, Jessica A. Cardin¹•Institutions (1)

Yale University¹

30 Oct 2014-bioRxiv

TL;DR: These findings suggest complementary roles of arousal and locomotion in promoting functional flexibility in cortical circuits are suggested and suggest that spontaneous firing in anticipation of and during movement was attributable to locomotion effects.

...read moreread less

Abstract: Spontaneous and sensory-evoked cortical activity is highly state-dependent, yet relatively little is known about transitions between distinct waking states. Patterns of activity in mouse V1 differ dramatically between quiescence and locomotion, but this difference could be explained by either motor feedback or a change in arousal levels. We recorded single cells and local field potentials from area V1 in mice head-fixed on a running wheel and monitored pupil diameter to assay arousal. Using naturally occurring and induced state transitions, we dissociated arousal and locomotion effects in V1. Arousal suppressed spontaneous firing and strongly altered the temporal patterning of population activity. Moreover, heightened arousal increased the signal-to-noise ratio of visual responses and reduced noise correlations. In contrast, increased firing in anticipation of and during movement was attributable to locomotion effects. Our findings suggest complementary roles of arousal and locomotion in promoting functional flexibility in cortical circuits.

...read moreread less

189 citations

Posted Content•DOI•

Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing

[...]

Konstantin Berlin¹, Sergey Koren, Chen-Shan Chin², James P Drake², Jane M. Landolin², Adam M. Phillippy - Show less +2 more•Institutions (2)

University of Maryland, College Park¹, Pacific Biosciences²

14 Aug 2014-bioRxiv

TL;DR: The MinHash Alignment Process (MHAP) is introduced for efficient overlapping of noisy, long reads using probabilistic, locality-sensitive hashing and demonstrated that single-molecule sequencing alone can produce near-complete eukaryotic genomes at modest cost.

...read moreread less

Abstract: We report reference-grade de novo assemblies of four model organisms and the human genome from single-molecule, real-time (SMRT) sequencing. Long-read SMRT sequencing is routinely used to finish microbial genomes, but the available assembly methods have not scaled well to larger genomes. Here we introduce the MinHash Alignment Process (MHAP) for efficient overlapping of noisy, long reads using probabilistic, locality-sensitive hashing. Together with Celera Assembler, MHAP was used to reconstruct the genomes of Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster, and human from high-coverage SMRT sequencing. The resulting assemblies include fully resolved chromosome arms and close persistent gaps in these important reference genomes, including heterochromatic and telomeric transition sequences. For D. melanogaster, MHAP achieved a 600-fold speedup relative to prior methods and a cloud computing cost of a few hundred dollars. These results demonstrate that single-molecule sequencing alone can produce near-complete eukaryotic genomes at modest cost.

...read moreread less

186 citations

Posted Content•DOI•

DensiTree 2: Seeing Trees Through the Forest

[...]

Remco R. Bouckaert¹, Joseph Heled¹•Institutions (1)

University of Auckland¹

08 Dec 2014-bioRxiv

TL;DR: DensiTree is extended to allow visualisation of meta-data associated with branches such as population size and evolutionary rates and various methods for positioning internal nodes are explored, which make it easier to comprehend distributions over trees.

...read moreread less

Abstract: Motivation: Phylogenetic analysis like Bayesian MCMC or bootstrapping result in a collection of trees. Trees are discrete objects and it is generally difficult to get a mental grip on a distributions over trees. Visualisation tools like DensiTree can give good intuition on tree distributions. It works by drawing all trees in the set transparently thus highlighting areas where the tree in the set agrees. In this way, both uncertainty in clade heights and uncertainty in topology can be visualised. In our experience, a vanilla DensiTree can turn out to be misleading in that it shows too much uncertainty due to wrongly ordering taxa or due to unlucky placement of internal nodes. Results: DensiTree is extended to allow visualisation of meta-data associated with branches such as population size and evolutionary rates. Furthermore, geographic locations of taxa can be shown on a map, making it easy to visually check there is some geographic pattern in a phylogeny. Taxa orderings have a large impact on the layout of the tree set, and advances have been made in finding better orderings resulting in significantly more informative visualisations. We also explored various methods for positioning internal nodes, which can improve the quality of the image. Together, these advances make it easier to comprehend distributions over trees. Availability: DensiTree is freely available from http://compevol. auckland.ac.nz/software/.

...read moreread less

Posted Content•DOI•

SRST2: Rapid genomic surveillance for public health and hospital microbiology labs

[...]

Michael Inouye¹, Harriet Dashnow¹, Lesley Raven¹, Mark B. Schultz¹, Bernard J. Pope², Takehiro Tomita¹, Justin Zobel¹, Kathryn E. Holt¹ - Show less +4 more•Institutions (2)

University of Melbourne¹, Victorian Life Sciences Computation Initiative²

28 Jul 2014-bioRxiv

TL;DR: Using >900 genomes from common pathogens, SRST2 is highly accurate and outperforms assembly-based methods in terms of both gene detection and allele assignment and represents a powerful tool for rapidly extracting clinically useful information from raw WGS data.

...read moreread less

Abstract: Rapid molecular typing of bacterial pathogens is critical for public health epidemiology, surveillance and infection control, yet routine use of whole genome sequencing (WGS) for these purposes poses significant challenges. Here we present SRST2, a read mapping-based tool for fast and accurate detection of genes, alleles and multi-locus sequence types (MLST) from WGS data. Using >900 genomes from common pathogens, we show SRST2 is highly accurate and outperforms assembly-based methods in terms of both gene detection and allele assignment. Here we have demonstrated the use of SRST2 for microbial genome surveillance in a variety of public health and hospital settings. In the face of rising threats of antimicrobial resistance and emerging virulence amongst bacterial pathogens, SRST2 represents a powerful tool for rapidly extracting clinically useful information from raw WGS data. Source code is available from http://katholt.github.io/srst2/

...read moreread less

Posted Content•DOI•

Measuring error rates in genomic perturbation screens: gold standards for human functional genomics

[...]

Traver Hart¹, Kevin R. Brown¹, Fabrice Sircoulomb², Robert Rottapel², Jason Moffat¹ - Show less +1 more•Institutions (2)

University of Toronto¹, University Health Network²

17 Mar 2014-bioRxiv

TL;DR: The results indicate that CRISPR technology is more sensitive than RNAi, and that both techniques have nontrivial false discovery rates that can be mitigated by rigorous analytical methods.

...read moreread less

Abstract: Technological advancement has opened the door to systematic genetics in mammalian cells. Genome-scale loss-of-function screens can assay fitness defects induced by partial gene knockdown, using RNA interference, or complete gene knockout, using new CRISPR techniques. These screens can reveal the basic blueprint required for cellular proliferation. Moreover, comparing healthy to cancerous tissue can uncover genes that are essential only in the tumor; these genes are targets for the development of specific anticancer therapies. Unfortunately, progress in this field has been hampered by off-target effects of perturbation reagents and poorly quantified error rates in large-scale screens. To improve the quality of information derived from these screens, and to provide a framework for understanding the capabilities and limitations of CRISPR technology, we derive gold-standard reference sets of essential and nonessential genes, and provide a Bayesian classifier of gene essentiality that outperforms current methods on both RNAi and CRISPR screens. Our results indicate that CRISPR technology is more sensitive than RNAi, and that both techniques have nontrivial false discovery rates that can be mitigated by rigorous analytical methods.

...read moreread less

Posted Content•DOI•

Bonsai: An event-based framework for processing and controlling data streams

[...]

Gonçalo Lopes, Niccolò Bonacchi, João Frazão, Joana P. Neto, Bassam V. Atallah, Sofia Soares, Luís Moreira, Sara Matias, Pavel M. Itskov, Patrícia A. Correia, Roberto E. Medina, Lorenza Calcaterra, Elena Dreosti¹, Joseph J. Paton, Adam R. Kampff - Show less +11 more•Institutions (1)

University College London¹

02 Jul 2014-bioRxiv

TL;DR: Bonsai is described, a modular, high-performance, open-source visual programming framework for the acquisition and online processing of data streams and demonstrated how it allows for flexible and rapid prototyping of integrated experimental designs in neuroscience.

...read moreread less

Abstract: The design of modern scientific experiments requires the control and monitoring of many parallel data streams. However, the serial execution of programming instructions in a computer makes it a challenge to develop software that can deal with the asynchronous, parallel nature of scientific data. Here we present Bonsai, a modular, high-performance, open-source visual programming framework for the acquisition and online processing of data streams. We describe Bonsai's core principles and architecture and demonstrate how it allows for flexible and rapid prototyping of integrated experimental designs in neuroscience. We specifically highlight different possible applications which require the combination of many different hardware and software components, including behaviour video tracking, electrophysiology and closed-loop control of stimulation parameters.

...read moreread less

Posted Content•DOI•

Ancient human genomes suggest three ancestral populations for present-day Europeans

[...]

Iosif Lazaridis¹, Nick Patterson², Alissa Mittnik³, Gabriel Renaud⁴, Swapan Mallick¹, Karola Kirsanow⁵, Peter H. Sudmant⁶, Joshua G. Schraiber⁷, Sergi Castellano⁴, Mark Lipson⁸, Bonnie Berger⁸, Christos Economou⁹, Ruth Bollongino⁵, Qiaomei Fu⁴, Kirsten I. Bos³, Susanne Nordenfelt¹, Heng Li², Cesare de Filippo⁴, Kay Prüfer⁴, Susanna Sawyer⁴, Cosimo Posth³, Wolfgang Haak¹⁰, Fredrik Hallgren¹¹, Elin Fornander¹¹, Nadin Rohland¹, Dominique Delsate¹², Michael Francken³, Jean-Michel Guinet¹³, Joachim Wahl, George Ayodo, Hamza A. Babiker¹⁴, Graciela Bailliet¹⁵, Elena Balanovska, Oleg Balanovsky, Ramiro Barrantes¹⁶, Gabriel Bedoya¹⁷, Haim Ben-Ami¹⁸, Judit Bene¹⁹, Fouad Berrada²⁰, Claudio M. Bravi¹⁵, Francesca Brisighelli²¹, George B.J. Busby²², Francesco Calì, Mikhail Churnosov²³, David E. C. Cole²⁴, Daniel Corach²⁵, Larissa Damba²⁶, George van Driem²⁷, Stanislav Dryomov²⁶, Jean-Michel Dugoujon²⁸, Sardana A. Fedorova²⁹, Irene Gallego Romero³⁰, Marina Gubina³¹, Michael F. Hammer³², Brenna M. Henn³³, Tor Hervig³⁴, Ugur Hodoglugil³⁵, Aashish R. Jha³⁰, Sena Karachanak-Yankova³⁶, Rita Khusainova³¹, Elza Khusnutdinova³¹, Rick A. Kittles³⁷, Toomas Kivisild³⁸, William Klitz⁷, Vaidutis Kučinskas³⁹, Alena Kushniarevich⁴⁰, Leila Laredj⁴¹, Sergey Litvinov³¹, Theologos Loukidis⁴², Robert W. Mahley⁴³, Béla Melegh¹⁹, Ene Metspalu⁴⁴, Julio Molina, Joanna L. Mountain, Klemetti Näkkäläjärvi⁴⁵, Desislava Nesheva³⁶, Thomas B. Nyambo⁴⁶, Ludmila P. Osipova³¹, Jüri Parik⁴⁴, Fedor Platonov²⁹, Olga L. Posukh³¹, Valentino Romano⁴⁷, Francisco Rothhammer⁴⁸, Igor Rudan¹⁴, Ruslan Ruizbakiev⁴⁹, Hovhannes Sahakyan⁴⁰, Antti Sajantila⁵⁰, Antonio Salas⁵¹, Elena B. Starikovskaya³¹, Ayele Tarekegn, Draga Toncheva³⁶, Shahlo Turdikulova⁴⁹, Ingrida Uktveryte³⁹, Olga Utevska⁵², René Vasquez⁵³, Mercedes Villena⁵³, Mikhail Voevoda³¹, Cheryl A. Winkler⁵⁴, Levon Yepiskoposyan⁵⁵, Pierre Zalloua⁵⁶, Tatijana Zemunik⁵⁷, Alan Cooper¹⁰, Cristian Capelli²², Mark G. Thomas⁵⁸, Andres Ruiz-Linares⁵⁸, Sarah A. Tishkoff⁵⁹, Lalji Singh⁶⁰, Kumarasamy Thangaraj⁶⁰, Richard Villems⁴⁰, David Comas⁶¹, Rem I. Sukernik³¹, Mait Metspalu⁴⁰, Matthias Meyer⁴, Evan E. Eichler⁶, Joachim Burger⁵, Montgomery Slatkin⁷, Svante Pääbo⁴, Janet Kelso⁴, David Reich¹, Johannes Krause³ - Show less +116 more•Institutions (61)

Harvard University¹, Broad Institute², University of Tübingen³, Max Planck Society⁴, University of Mainz⁵, University of Washington⁶, University of California, Berkeley⁷, Massachusetts Institute of Technology⁸, Stockholm University⁹, University of Adelaide¹⁰, The Heritage Foundation¹¹, National Museum of Natural History¹², American Museum of Natural History¹³, University of Edinburgh¹⁴, National Scientific and Technical Research Council¹⁵, University of Costa Rica¹⁶, University of Antioquia¹⁷, Rambam Health Care Campus¹⁸, University of Pécs¹⁹, Al Akhawayn University²⁰, Catholic University of the Sacred Heart²¹, University of Oxford²², Belgorod State University²³, University of Toronto²⁴, University of Buenos Aires²⁵, Russian Academy²⁶, University of Bern²⁷, Paul Sabatier University²⁸, North-Eastern Federal University²⁹, University of Chicago³⁰, Russian Academy of Sciences³¹, University of Arizona³², Stony Brook University³³, University of Bergen³⁴, Illumina³⁵, Sofia Medical University³⁶, University of Illinois at Chicago³⁷, University of Cambridge³⁸, Vilnius University³⁹, Estonian Biocentre⁴⁰, University of Strasbourg⁴¹, Amgen⁴², Gladstone Institutes⁴³, University of Tartu⁴⁴, University of Oulu⁴⁵, Muhimbili University of Health and Allied Sciences⁴⁶, University of Palermo⁴⁷, University of Tarapacá⁴⁸, Academy of Sciences of Uzbekistan⁴⁹, University of Helsinki⁵⁰, University of Santiago de Compostela⁵¹, University of Kharkiv⁵², Higher University of San Andrés⁵³, Leidos⁵⁴, Armenian National Academy of Sciences⁵⁵, Lebanese American University⁵⁶, University of Split⁵⁷, University College London⁵⁸, University of Pennsylvania⁵⁹, Centre for Cellular and Molecular Biology⁶⁰, Pompeu Fabra University⁶¹

02 Apr 2014-bioRxiv

TL;DR: It is shown that the great majority of present-day Europeans derive from at least three highly differentiated populations: West European Hunter-Gatherers (WHG), who contributed ancestry to all Europeans but not to Near Easterners; Ancient North Eurasians (ANE); and Early European Farmers (EEF), who were mainly of Near Eastern origin but also harbored WHG-related ancestry.

...read moreread less

Abstract: We sequenced genomes from a ~7,000 year old early farmer from Stuttgart in Germany, an ~8,000 year old hunter-gatherer from Luxembourg, and seven ~8,000 year old hunter-gatherers from southern Sweden. We analyzed these data together with other ancient genomes and 2,345 contemporary humans to show that the great majority of present-day Europeans derive from at least three highly differentiated populations: West European Hunter-Gatherers (WHG), who contributed ancestry to all Europeans but not to Near Easterners; Ancient North Eurasians (ANE), who were most closely related to Upper Paleolithic Siberians and contributed to both Europeans and Near Easterners; and Early European Farmers (EEF), who were mainly of Near Eastern origin but also harbored WHG-related ancestry. We model these populations' deep relationships and show that EEF had ~44% ancestry from a "Basal Eurasian" lineage that split prior to the diversification of all other non-African lineages.

...read moreread less

Posted Content•DOI•

Concerning RNA-Guided Gene Drives for the Alteration of Wild Populations

[...]

Kevin M. Esvelt¹, Andrea L. Smidler², Flaminia Catteruccia², George M. Church¹•Institutions (2)

Wyss Institute for Biologically Inspired Engineering¹, Harvard University²

17 Jul 2014-bioRxiv

TL;DR: The potential for RNA-guided gene drives based on the CRISPR nuclease Cas9 to serve as a general method for spreading altered traits through wild populations over many generations is considered.

...read moreread less

Abstract: Gene drives may be capable of addressing ecological problems by altering entire populations of wild organisms, but their use has remained largely theoretical due to technical constraints. Here we consider the potential for RNA-guided gene drives based on the CRISPR nuclease Cas9 to serve as a general method for spreading altered traits through wild populations over many generations. We detail likely capabilities, discuss limitations, and provide novel precautionary strategies to control the spread of gene drives and reverse genomic changes. The ability to edit populations of sexual species would offer substantial benefits to humanity and the environment. For example, RNA-guided gene drives could potentially prevent the spread of disease, support agriculture by reversing pesticide and herbicide resistance in insects and weeds, and control damaging invasive species. However, the possibility of unwanted ecological effects and near-certainty of spread across political borders demand careful assessment of each potential application. We call for thoughtful, inclusive, and well-informed public discussions to explore the responsible use of this currently theoretical technology.

...read moreread less

Posted Content•DOI•

svaseq: removing batch effects and other unwanted noise from sequencing data

[...]

Jeffrey T. Leek¹•Institutions (1)

Johns Hopkins University¹

26 Jun 2014-bioRxiv

TL;DR: An update to the sva approach is described that can be applied to analyze count data or FPKMs from sequencing experiments and the addition of supervised sva (ssva) for using control probes to identify the part of the genomic data only affected by artifacts.

...read moreread less

Abstract: It is now well known that unwanted noise and unmodeled artifacts such as batch effects can dramatically reduce the accuracy of statistical inference in genomic experiments. We introduced surrogate variable analysis for estimating these artifacts by (1) identifying the part of the genomic data only affected by artifacts and (2) estimating the artifacts with principal components or singular vectors of the subset of the data matrix. The resulting estimates of artifacts can be used in subsequent analyses as adjustment factors. Here I describe an update to the sva approach that can be applied to analyze count data or FPKMs from sequencing experiments. I also describe the addition of supervised sva (ssva) for using control probes to identify the part of the genomic data only affected by artifacts. These updates are available through the surrogate variable analysis (sva) Bioconductor package.

...read moreread less

Posted Content•DOI•

Visualizing spatial population structure with estimated effective migration surfaces

[...]

Desislava Petkova¹, John Novembre¹, Matthew Stephens¹•Institutions (1)

University of Chicago¹

26 Nov 2014-bioRxiv

TL;DR: This work presents a method to quantify and visualize variation in effective migration across the habitat, which can be used to identify potential barriers to gene flow, from geographically indexed large-scale genetic data.

...read moreread less

Abstract: Genetic data often exhibit patterns that are broadly consistent with "isolation by distance" - a phenomenon where genetic similarity tends to decay with geographic distance. In a heterogeneous habitat, decay may occur more quickly in some regions than others: for example, barriers to gene flow can accelerate the genetic differentiation between groups located close in space. We use the concept of "effective migration" to model the relationship between genetics and geography: in this paradigm, effective migration is low in regions where genetic similarity decays quickly. We present a method to quantify and visualize variation in effective migration across the habitat, which can be used to identify potential barriers to gene flow, from geographically indexed large-scale genetic data. Our approach uses a population genetic model to relate underlying migration rates to expected pairwise genetic dissimilarities, and estimates migration rates by matching these expectations to the observed dissimilarities. We illustrate the potential and limitations of our method using simulations and data from elephant, human, and Arabidopsis thaliana populations. The resulting visualizations highlight important features of the spatial population structure that are difficult to discern using existing methods for summarizing genetic variation such as principal components analysis.

...read moreread less

Posted Content•DOI•

Error correction and assembly complexity of single molecule sequencing reads.

[...]

Hayan Lee¹, James Gurtowski¹, Shinjae Yoo², Shoshana Marcus¹, W. R. McCombie¹, Michael C. Schatz¹ - Show less +2 more•Institutions (2)

Cold Spring Harbor Laboratory¹, Brookhaven National Laboratory²

18 Jun 2014-bioRxiv

TL;DR: A new data-driven model using support vector regression that can accurately predict assembly performance is developed and applied to several prokaryotic and eukaryotic genomes, and can achieve near-perfect assemblies of small genomes and substantially improved assemblies of larger ones.

...read moreread less

Abstract: Third generation single molecule sequencing technology is poised to revolutionize genomics by enabling the sequencing of long, individual molecules of DNA and RNA. These technologies now routinely produce reads exceeding 5,000 basepairs, and can achieve reads as long as 50,000 basepairs. Here we evaluate the limits of single molecule sequencing by assessing the impact of long read sequencing in the assembly of the human genome and 25 other important genomes across the tree of life. From this, we develop a new data-driven model using support vector regression that can accurately predict assembly performance. We also present a novel hybrid error correction algorithm for long PacBio sequencing reads that uses pre-assembled Illumina sequences for the error correction. We apply it several prokaryotic and eukaryotic genomes, and show it can achieve near-perfect assemblies of small genomes (< 100Mbp) and substantially improved assemblies of larger ones. All source code and the assembly model are available open-source.

...read moreread less

Posted Content•DOI•

MDTraj: a modern, open library for the analysis of molecular dynamics trajectories

[...]

Robert T. McGibbon¹, Kyle A. Beauchamp², Christian R. Schwantes¹, Lee-Ping Wang¹, Carlos X. Hernández¹, Matthew P. Harrigan¹, Thomas J. Lane¹, Jason M. Swails³, Vijay S. Pande¹ - Show less +5 more•Institutions (3)

Stanford University¹, Kettering University², Rutgers University³

09 Sep 2014-bioRxiv

TL;DR: The MDTraj package is a modern, lightweight and efficient software package for analyzing molecular dynamics simulations, bridging the gap between molecular dynamics data and the rapidly-growing collection of industry-standard statistical analysis and visualization tools in Python.

...read moreread less

Abstract: Summary: MDTraj is a modern, lightweight and efficient software package for analyzing molecular dynamics simulations. MDTraj reads trajectory data from a wide variety of commonly used formats. It provides a large number of trajectory analysis capabilities including RMSD, DSSP secondary structure assignment and the extraction of common order parameters. The package has a strong focus on interoperability with the wider scientific Python ecosystem, bridging the gap between molecular dynamics data and the rapidly-growing collection of industry-standard statistical analysis and visualization tools in Python. Availability: Package downloads, detailed examples and full documentation are available at http://mdtraj.org. The source code is distributed under the GNU Lesser General Public License at https://github.com/simtk/mdtraj.

...read moreread less

Posted Content•DOI•

An experimentally determined evolutionary model dramatically improves phylogenetic fit

[...]

Jesse D. Bloom¹•Institutions (1)

Fred Hutchinson Cancer Research Center¹

02 Mar 2014-bioRxiv

TL;DR: An experimental determination of a parameter-free evolutionary model via mutagenesis, functional selection, and deep sequencing is demonstrated: an evolutionary model for influenza nucleoprotein that describes the gene phylogeny far better than existing models with dozens or even hundreds of free parameters.

...read moreread less

Abstract: All modern approaches to molecular phylogenetics require a quantitative model for how genes evolve. Unfortunately, existing evolutionary models do not realistically represent the site-heterogeneous selection that governs actual sequence change. Attempts to remedy this problem have involved augmenting these models with a burgeoning number of free parameters. Here I demonstrate an alternative: experimental determination of a parameter-free evolutionary model via mutagenesis, functional selection, and deep sequencing. Using this strategy, I create an evolutionary model for influenza nucleoprotein that describes the gene phylogeny far better than existing models with dozens or even hundreds of free parameters. Emerging high-throughput experimental strategies such as the one employed here provide fundamentally new information that has the potential to transform the sensitivity of phylogenetic and genetic analyses.

...read moreread less

Posted Content•DOI•

Ancestry Composition: A Novel, Efficient Pipeline for Ancestry Deconvolution

[...]

Eric Durand, Chuong B. Do, Joanna L. Mountain, John Michael Macpherson

18 Oct 2014-bioRxiv

TL;DR: Ancestry Composition is described, a modular three-stage pipeline that efficiently and accurately identifies the ancestral origin of chromosomal segments in admixed individuals and achieves high precision and recall for labeling chromosomesomal segments across over 25 different populations worldwide.

...read moreread less

Abstract: Ancestry deconvolution, the task of identifying the ancestral origin of chromosomal segments in admixed individuals, has important implications, from mapping disease genes to identifying candidate loci under natural selection. To date, however, most existing methods for ancestry deconvolution are typically limited to two or three ancestral populations, and cannot resolve contributions from populations related at a sub-continental scale. We describe Ancestry Composition, a modular three-stage pipeline that efficiently and accurately identifies the ancestral origin of chromosomal segments in admixed individuals. It assumes the genotype data have been phased. In the first stage, a support vector machine classifier assigns tentative ancestry labels to short local phased genomic regions. In the second stage, an autoregressive pair hidden Markov model simultaneously corrects phasing errors and produces reconciled local ancestry estimates and confidence scores based on the tentative ancestry labels. In the third stage, confidence estimates are recalibrated using isotonic regression. We compiled a reference panel of almost 10,000 individuals of homogeneous ancestry, derived from a combination of several publicly available datasets and over 8,000 individuals reporting four grandparents with the same country-of-origin from the member database of the personal genetics company, 23andMe, Inc., and excluding outliers identified through principal components analysis (PCA). In cross-validation experiments, Ancestry Composition achieves high precision and recall for labeling chromosomal segments across over 25 different populations worldwide.

...read moreread less

Posted Content•DOI•

Global Epistasis Makes Adaptation Predictable Despite Sequence-Level Stochasticity

[...]

Sergey Kryazhimskiy¹, Daniel P. Rice¹, Elizabeth R. Jerison¹, Michael M. Desai¹•Institutions (1)

Harvard University¹

13 Jan 2014-bioRxiv

TL;DR: Experimental evolution in Saccharomyces cerevisiae is used to quantify the effect of epistasis, finding dramatic differences in adaptability between 64 closely related genotypes, suggesting that many beneficial mutations affecting a variety of biological processes are globally coupled.

...read moreread less

Abstract: Epistasis can make adaptation highly unpredictable, rendering evolutionary trajectories contingent on the chance effects of initial mutations. We used experimental evolution in Saccharomyces cerevisiae to quantify this effect, finding dramatic differences in adaptability between 64 closely related genotypes. Despite these differences, sequencing of 105 evolved clones showed no significant effect of initial genotype on future sequence-level evolution. Instead, reconstruction experiments revealed a consistent pattern of diminishing returns epistasis. Our results suggest that many beneficial mutations affecting a variety of biological processes are globally coupled: they interact strongly, but only through their combined effect on fitness. Sequence-level adaptation is thus highly stochastic. Nevertheless, fitness evolution is strikingly predictable because differences in adaptability are determined only by global fitness-mediated epistasis, not by the identity of individual mutations.

...read moreread less

Posted Content•DOI•

Species Delimitation using Genome-Wide SNP Data

[...]

Adam D. Leaché¹, Matthew K. Fujita², Vladimir N. Minin¹, Remco R. Bouckaert³•Institutions (3)

University of Washington¹, University of Texas at Arlington², University of Auckland³

04 Jan 2014-bioRxiv

TL;DR: A recently introduced dynamic programming algorithm for estimating species trees that bypasses MCMC integration over gene trees with sophisticated methods for estimating marginal likelihoods, needed for Bayesian model selection, are combined to provide a rigorous and computationally tractable technique for genome-wide species delimitation.

...read moreread less

Abstract: The multi-species coalescent has provided important progress for evolutionary inferences, including increasing the statistical rigor and objectivity of comparisons among competing species delimitation models. However, Bayesian species delimitation methods typically require brute force integration over gene trees via Markov chain Monte Carlo (MCMC), which introduces a large computation burden and precludes their application to genomic-scale data. Here we combine a recently introduced dynamic programming algorithm for estimating species trees that bypasses MCMC integration over gene trees with sophisticated methods for estimating marginal likelihoods, needed for Bayesian model selection, to provide a rigorous and computationally tractable technique for genome-wide species delimitation. We provide a critical yet simple correction that brings the likelihoods of different species trees, and more importantly their corresponding marginal likelihoods, to the same common denominator, which enables direct and accurate comparisons of competing species delimitation models using Bayes factors. We test this approach, which we call Bayes factor delimitation (*with genomic data; BFD*), using common species delimitation scenarios with computer simulations. Varying the numbers of loci and the number of samples suggest that the approach can distinguish the true model even with few loci and limited samples per species. Misspecification of the prior for population size θ has little impact on support for the true model. We apply the approach to West African forest geckos (Hemidactylus fasciatus complex) using genome-wide SNP data data. This new Bayesian method for species delimitation builds on a growing trend for objective species delimitation methods with explicit model assumptions that are easily tested.

...read moreread less

Posted Content•DOI•

Synthesis of phylogeny and taxonomy into a comprehensive tree of life

[...]

Cody E. Hinchliff¹, Stephen A. Smith¹, James F. Allman, J. Gordon Burleigh², Ruchi Chaudhary², Lyndon M. Coghill³, Keith A. Crandall⁴, Jiabin Deng², Bryan T. Drew², Romina Gazis⁵, Karl Gude⁶, David S. Hibbett⁵, Laura A. Katz⁷, H. Dail Laughinghouse⁷, Emily Jane McTavish⁸, Peter E. Midford³, Christopher L. Owen⁴, Richard H. Ree³, Jonathan Rees⁹, Douglas E. Soltis², Tiffani L. Williams¹⁰, Tiffani L. Williams¹¹, Tiffani L. Williams¹², Tiffani L. Williams¹³, Tiffani L. Williams¹⁴, Tiffani L. Williams¹⁵, Karen Cranston⁹ - Show less +23 more•Institutions (15)

University of Michigan¹, University of Florida², Field Museum of Natural History³, George Washington University⁴, Clark University⁵, Michigan State University⁶, Smith College⁷, University of Kansas⁸, National Evolutionary Synthesis Center⁹, Texas A&M Health Science Center College of Medicine¹⁰, Hospital Corporation of America¹¹, Texas College¹², Lanzhou University¹³, Texas Tech University¹⁴, Texas A&M University¹⁵

08 Dec 2014-bioRxiv

TL;DR: This comprehensive tree will fuel fundamental research on the nature of biological diversity, ultimately providing up-to-date phylogenies for downstream applications in comparative biology, ecology, conservation biology, climate change, agriculture, and genomics.

...read moreread less

Abstract: Reconstructing the phylogenetic relationships that unite all biological lineages (the tree of life) is a grand challenge of biology. However, the paucity of readily available homologous character data across disparately related lineages renders direct phylogenetic inference currently untenable. Our best recourse towards realizing the tree of life is therefore the synthesis of existing collective phylogenetic knowledge available from the wealth of published primary phylogenetic hypotheses, together with taxonomic hierarchy information for unsampled taxa. We combined phylogenetic and taxonomic data to produce a draft tree of life?the Open Tree of Life?containing 2.3 million tips. Realization of this draft tree required the assembly of two resources that should prove valuable to the community: 1) a novel comprehensive global reference taxonomy, and 2) a database of published phylogenetic trees mapped to this common taxonomy. Our open source framework facilitates community comment and contribution, enabling a continuously updatable tree when new phylogenetic and taxonomic data become digitally available. While data coverage and phylogenetic conflict across the Open Tree of Life illuminates significant gaps in both the underlying data available for phylogenetic reconstruction and the publication of trees as digital objects, the tree provides a compelling starting point from which we can continue to improve through community contributions. Having a comprehensive tree of life will fuel fundamental research on the nature of biological diversity, ultimately providing up-to-date phylogenies for downstream applications in comparative biology, ecology, conservation biology, climate change studies, agriculture, and genomics.

...read moreread less

Posted Content•DOI•

U2AF1 mutations alter splice site recognition in hematological malignancies

[...]

Janine O. Ilagan¹, Aravind Ramakrishnan¹, Brian Hayes¹, Michele E. Murphy¹, Ahmad S. Zebari¹, Philip Bradley¹, Robert K. Bradley¹ - Show less +3 more•Institutions (1)

Fred Hutchinson Cancer Research Center¹

28 Jun 2014-bioRxiv

TL;DR: It is found that U2AF1 mutations influence the similarity of splicing programs in leukemias, but do not give rise to widespread splicing failure.

...read moreread less

Abstract: Whole-exome sequencing studies have identified common mutations affecting genes encoding components of the RNA splicing machinery in hematological malignancies. Here, we sought to determine how mutations affecting the 3' splice site recognition factor U2AF1 alter its normal role in RNA splicing. We find that U2AF1 mutations influence the similarity of splicing programs in leukemias, but do not give rise to widespread splicing failure. U2AF1 mutations cause differential splicing of hundreds of genes, affecting biological pathways implicated in myeloid disease such as DNA methylation (DNMT3B), X chromosome inactivation (H2AFY), the DNA damage response (ATR, FANCA), and apoptosis (CASP8). We show that U2AF1 mutations alter the preferred 3' splice site motif in patients, in cell culture, and in vitro. Mutations affecting the first and second zinc fingers give rise to different alterations in splice site preference and largely distinct downstream splicing programs. These allele-specific effects are consistent with a computationally predicted model of U2AF1 in complex with RNA. Our findings suggest that U2AF1 mutations contribute to pathogenesis by causing quantitative changes in splicing that affect diverse cellular pathways, and give insight into the normal function of U2AF1?s zinc finger domains.

...read moreread less

Posted Content•DOI•

CasFinder: Flexible algorithm for identifying specific Cas9 targets in genomes

[...]

John Aach¹, Prashant Mali¹, George M. Church¹•Institutions (1)

Harvard University¹

12 May 2014-bioRxiv

TL;DR: The CasFinder system is demonstrated by generating human and mouse exome-wide catalogs of specific sites for three varieties of Cas9 – S. pyogenes, S. thermophilus (ST1), and N. meningitidis – that each target 56-74% of all exons.

...read moreread less

Abstract: CRISPR/Cas9 systems enable many molecular activities to be efficiently directed in vivo to user-specifiable DNA sequences of interest, including generation of dsDNA cuts and nicks, transcriptional activation and repression, and fluorescence. CRISPR targeting relies on base pairing of short RNA transcripts with their target DNA sequences that must also be adjacent to fixed DNA motifs. However, rules for Cas9 targeting specificity are incompletely known. With increasing numbers of Cas9 systems being developed and deployed in more and more organisms, there is now strong need for a flexible and rational method for finding Cas9 sites with low off-targeting potential. We address this through the CasFinder system, which we demonstrate by generating human and mouse exome-wide catalogs of specific sites for three varieties of Cas9 - S. pyogenes, S. thermophilus (ST1), and N. meningitidis - that each target 56-74% of all exons. We also generate reduced sets of up to 3 targets per gene for use in high-throughput Cas9-based gene knockout screens that target 75-80% of all genes.

...read moreread less

Posted Content•DOI•

READemption - A tool for the computational analysis of deep-sequencing-based transcriptome data

[...]

Konrad U. Förstner¹, Jörg Vogel¹, Cynthia M. Sharma¹•Institutions (1)

University of Würzburg¹

15 Apr 2014-bioRxiv

TL;DR: This work has successfully applied READemption to analyze RNA-Seq reads from other sample types, including whole transcriptomes, RNA immunoprecipitated with proteins, not only from bacteria, but also from eukaryotes and archaea.

...read moreread less

Abstract: Summary: RNA-Seq has become a potent and widely used method to qualitatively and quantitatively study transcriptomes. In order to draw biological conclusions based on RNA-Seq data, several steps some of which are computationally intensive, have to betaken. Our READemption pipeline takes care of these individual tasks and integrates them into an easy-to-use tool with a command line interface. To leverage the full power of modern computers, most subcommands of READemption offer parallel data processing. While READemption was mainly developed for the analysis of bacterial primary transcriptomes, we have successfully applied it to analyze RNA-Seq reads from other sample types, including whole transcriptomes, RNA immunoprecipitated with proteins, not only from bacteria, but also from eukaryotes and archaea. Availability and Implementation: READemption is implemented in Python and is published under the ISC open source license. The tool and documentation is hosted at http://pythonhosted.org/READemption (DOI:10.6084/m9.figshare.977849).

...read moreread less

Posted Content•DOI•

Reagent contamination can critically impact sequence-based microbiome analyses

[...]

Susannah J. Salter¹, Michael J. Cox², Elena M. Turek², Szymon T. Calus³, William O.C.M. Cookson², Miriam F. Moffatt², Paul Turner⁴, Julian Parkhill¹, Nicholas J. Loman³, Alan W. Walker⁵ - Show less +6 more•Institutions (5)

Wellcome Trust Sanger Institute¹, National Institutes of Health², University of Birmingham³, University of Oxford⁴, University of Aberdeen⁵

16 Jul 2014-bioRxiv

TL;DR: It is demonstrated that contaminating DNA is ubiquitous in commonly used DNA extraction kits, varies greatly in composition between different kits and kit batches, and that this contamination critically impacts results obtained from samples containing a low microbial biomass.

...read moreread less

Abstract: The study of microbial communities has been revolutionised in recent years by the widespread adoption of culture independent analytical techniques such as 16S rRNA gene sequencing and metagenomics. One potential confounder of these sequence-based approaches is the presence of contamination in DNA extraction kits and other laboratory reagents. In this study we demonstrate that contaminating DNA is ubiquitous in commonly used DNA extraction kits, varies greatly in composition between different kits and kit batches, and that this contamination critically impacts results obtained from samples containing a low microbial biomass. Contamination impacts both PCR-based 16S rRNA gene surveys and shotgun metagenomics. These results suggest that caution should be advised when applying sequence-based techniques to the study of microbiota present in low biomass environments. We provide an extensive list of potential contaminating genera, and guidelines on how to mitigate the effects of contamination. Concurrent sequencing of negative control samples is strongly advised.

...read moreread less

Posted Content•DOI•

Reconstructing Austronesian population history in Island Southeast Asia

[...]

Mark Lipson¹, Po-Ru Loh¹, Nick Patterson², Priya Moorjani³, Ying-Chin Ko⁴, Mark Stoneking⁵, Bonnie Berger¹, David Reich³ - Show less +4 more•Institutions (5)

Massachusetts Institute of Technology¹, Broad Institute², Harvard University³, China Medical University (PRC)⁴, Max Planck Society⁵

27 May 2014-bioRxiv

TL;DR: It is shown that all sampled Austronesian groups harbor ancestry that is more closely related to aboriginal Taiwanese than to any present-day mainland population, suggesting that either there was once a substantial Austro-Asiatic presence in Island Southeast Asia, or Austronesians migrated to and through the mainland, admixing there before continuing to western Indonesia.

...read moreread less

Abstract: Austronesian languages are spread across half the globe, from Easter Island to Madagascar. Evidence from linguistics and archaeology indicates that the "Austronesian expansion," which began 4-5 thousand years ago, likely had roots in Taiwan, but the ancestry of present-day Austronesian-speaking populations remains controversial. Here, focusing primarily on Island Southeast Asia, we analyze genome-wide data from 56 populations using new methods for tracing ancestral gene flow. We show that all sampled Austronesian groups harbor ancestry that is more closely related to aboriginal Taiwanese than to any present-day mainland population. Surprisingly, western Island Southeast Asian populations have also inherited ancestry from a source nested within the variation of present-day populations speaking Austro-Asiatic languages, which have historically been nearly exclusive to the mainland. Thus, either there was once a substantial Austro-Asiatic presence in Island Southeast Asia, or Austronesian speakers migrated to and through the mainland, admixing there before continuing to western Indonesia.

...read moreread less

Collapse