scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Quantitative Methods in 2016"


Posted Content
TL;DR: The power of using deep learning to produce significant improvements in the accuracy of pathological diagnoses is demonstrated, by combining the deep learning system's predictions with the human pathologist's diagnoses.
Abstract: The International Symposium on Biomedical Imaging (ISBI) held a grand challenge to evaluate computational systems for the automated detection of metastatic breast cancer in whole slide images of sentinel lymph node biopsies. Our team won both competitions in the grand challenge, obtaining an area under the receiver operating curve (AUC) of 0.925 for the task of whole slide image classification and a score of 0.7051 for the tumor localization task. A pathologist independently reviewed the same images, obtaining a whole slide image classification AUC of 0.966 and a tumor localization score of 0.733. Combining our deep learning system's predictions with the human pathologist's diagnoses increased the pathologist's AUC to 0.995, representing an approximately 85 percent reduction in human error rate. These results demonstrate the power of using deep learning to produce significant improvements in the accuracy of pathological diagnoses.

739 citations


Posted Content
TL;DR: To realize the full impact of machine learning for tomographic imaging, major theoretical, technical and translational efforts are immediately needed.
Abstract: The combination of tomographic imaging and deep learning, or machine learning in general, promises to empower not only image analysis but also image reconstruction. The latter aspect is considered in this perspective article with an emphasis on medical imaging to develop a new generation of image reconstruction theories and techniques. This direction might lead to intelligent utilization of domain knowledge from big data, innovative approaches for image reconstruction, and superior performance in clinical and preclinical applications. To realize the full impact of machine learning on medical imaging, major challenges must be addressed.

234 citations


Journal ArticleDOI
Yuxiang Jiang, Tal Ronnen Oron, Wyatt T. Clark, Asma R. Bankapur, Daniel D'Andrea, Rosalba Lepore, Christopher S. Funk, Indika Kahanda, Karin Verspoor, Asa Ben-Hur, Emily Koo, Duncan Penfold-Brown, Dennis Shasha, Noah Youngs, Richard Bonneau, Alexandra Lin, Sayed M. E. Sahraeian, Pier Luigi Martelli, Giuseppe Profiti, Rita Casadio, Renzhi Cao, Zhaolong Zhong, Jianlin Cheng, Adrian M. Altenhoff, Nives Škunca, Christophe Dessimoz, Tunca Doğan, Kai Hakala, Suwisa Kaewphan, Farrokh Mehryary, Tapio Salakoski, Filip Ginter, Hai Fang, Ben Smithers, Matt E. Oates, Julian Gough, Petri Törönen, Patrik Koskinen, Liisa Holm, Ching-Tai Chen, Wen-Lian Hsu, Kevin Bryson, Domenico Cozzetto, Federico Minneci, David T. Jones, Samuel Chapman, Ishita K. Khan, Daisuke Kihara, Dan Ofer, Nadav Rappoport, Amos Stern, Elena Cibrian-Uhalte, Paul Denny, Rebecca E. Foulger, Reija Hieta, Duncan Legge, Ruth C. Lovering, Michele Magrane, Anna N. Melidoni, Prudence Mutowo-Meullenet, Klemens Pichler, Aleksandra Shypitsyna, Biao Li, Pooya Zakeri, Sarah ElShal, Léon-Charles Tranchevent, Sayoni Das, Natalie L. Dawson, David A. Lee, Jonathan G. Lees, Ian Sillitoe, Prajwal Bhat, Tamás Nepusz, Alfonso E. Romero, Rajkumar Sasidharan, Haixuan Yang, Alberto Paccanaro, Jesse Gillis, Adriana E. Sedeno-Cortes, Paul Pavlidis, Shou Feng, Juan Miguel Cejuela, Tatyana Goldberg, Tobias Hamp, Lothar Richter, Asaf Salamov, Toni Gabaldón, Marina Marcet-Houben, Fran Supek, Qingtian Gong, Wei Ning, Yuanpeng Zhou, Weidong Tian, Marco Falda, Paolo Fontana, Enrico Lavezzo, Stefano Toppo, Carlo Ferrari, Manuel Giollo, Damiano Piovesan, Silvio C. E. Tosatto, Angela del Pozo, José M. Fernández, Paolo Maietta, Alfonso Valencia, Michael L. Tress, Alfredo Benso, Stefano Di Carlo, Gianfranco Politano, Alessandro Savino, Hafeez Ur Rehman, Matteo Re, Marco Mesiti, Giorgio Valentini, Joachim W. Bargsten, Aalt D. J. van Dijk, Branislava Gemovic, Sanja Glisic, Vladmir Perovic, Veljko Veljkovic, Nevena Veljkovic, Danillo C Almeida-E-Silva, Ricardo Z. N. Vêncio, Malvika Sharan, Jörg Vogel, Lakesh Kansakar, Shanshan Zhang, Slobodan Vucetic, Zheng Wang, Michael J.E. Sternberg, Mark N. Wass, Rachael P. Huntley, Maria Jesus Martin, Claire O'Donovan, Peter N. Robinson, Yves Moreau, Anna Tramontano, Patricia C. Babbitt, Steven E. Brenner, Michal Linial, Christine A. Orengo, Burkhard Rost, Casey S. Greene, Sean D. Mooney, Iddo Friedberg, Predrag Radivojac 
TL;DR: The second Critical Assessment of Functional Annotation (CAFA) challenge as mentioned in this paper was the first attempt to assess computational methods that automatically assign protein function. And the results of CAFA2 showed that computational function prediction is improving.
Abstract: Background: The increasing volume and variety of genotypic and phenotypic data is a major defining characteristic of modern biomedical sciences. At the same time, the limitations in technology for generating data and the inherently stochastic nature of biomolecular events have led to the discrepancy between the volume of data and the amount of knowledge gleaned from it. A major bottleneck in our ability to understand the molecular underpinnings of life is the assignment of function to biological macromolecules, especially proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, accurately assessing methods for protein function prediction and tracking progress in the field remain challenging. Methodology: We have conducted the second Critical Assessment of Functional Annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. One hundred twenty-six methods from 56 research groups were evaluated for their ability to predict biological functions using the Gene Ontology and gene-disease associations using the Human Phenotype Ontology on a set of 3,681 proteins from 18 species. CAFA2 featured significantly expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis also compared the best methods participating in CAFA1 to those of CAFA2. Conclusions: The top performing methods in CAFA2 outperformed the best methods from CAFA1, demonstrating that computational function prediction is improving. This increased accuracy can be attributed to the combined effect of the growing number of experimental annotations and improved methods for function prediction.

200 citations


Journal ArticleDOI
TL;DR: In this article, a self-contained introduction to modeling, approximations and inference methods for stochastic chemical kinetics is given, as well as a comparison of several of these methods by means of a numerical case study.
Abstract: Stochastic fluctuations of molecule numbers are ubiquitous in biological systems. Important examples include gene expression and enzymatic processes in living cells. Such systems are typically modelled as chemical reaction networks whose dynamics are governed by the Chemical Master Equation. Despite its simple structure, no analytic solutions to the Chemical Master Equation are known for most systems. Moreover, stochastic simulations are computationally expensive, making systematic analysis and statistical inference a challenging task. Consequently, significant effort has been spent in recent decades on the development of efficient approximation and inference methods. This article gives an introduction to basic modelling concepts as well as an overview of state of the art methods. First, we motivate and introduce deterministic and stochastic methods for modelling chemical networks, and give an overview of simulation and exact solution methods. Next, we discuss several approximation methods, including the chemical Langevin equation, the system size expansion, moment closure approximations, time-scale separation approximations and hybrid methods. We discuss their various properties and review recent advances and remaining challenges for these methods. We present a comparison of several of these methods by means of a numerical case study and highlight some of their respective advantages and disadvantages. Finally, we discuss the problem of inference from experimental data in the Bayesian framework and review recent methods developed the literature. In summary, this review gives a self-contained introduction to modelling, approximations and inference methods for stochastic chemical kinetics.

153 citations


Journal ArticleDOI
TL;DR: The generalized-growth model is introduced to characterize the early growth profile of outbreaks and estimate the effective reproduction number, with no need for explicit assumptions about the shape of epidemic growth, and provides a compelling argument for the unexpected extinction of certain emerging disease outbreaks during the early ascending phase.
Abstract: Early estimates of the transmission potential of emerging and re-emerging infections are increasingly used to inform public health authorities on the level of risk posed by outbreaks. Existing methods to estimate the reproduction number generally assume exponential growth in case incidence in the first few disease generations, before susceptible depletion sets in. In reality, outbreaks can display sub-exponential (i.e., polynomial) growth in the first few disease generations, owing to clustering in contact patterns, spatial effects, inhomogeneous mixing, reactive behavior changes, or other mechanisms. Here, we introduce the generalized growth model to characterize the early growth profile of outbreaks and estimate the effective reproduction number, with no need for explicit assumptions about the shape of epidemic growth. We demonstrate this phenomenologic approach using analytical results and simulations from mechanistic models, and provide validation against a range of empirical disease datasets. Our results suggest that sub-exponential growth in the early phase of an epidemic is the rule rather the exception. For empirical outbreaks, the generalized-growth model consistently outperforms the exponential model for a variety of directly and indirectly transmitted diseases datasets with model estimates supporting sub-exponential growth dynamics. The rapid decline in effective reproduction number predicted by analytical results and observed in real and synthetic datasets within 3-5 disease generations contrasts with the expectation of invariant reproduction number in epidemics obeying exponential growth. Overall, our approach promotes a more reliable and data-driven characterization of the early epidemic phase, which is important for accurate estimation of the reproduction number and prediction of disease impact.

89 citations


Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper investigated the hypothesis that information attached to miRNAs and diseases can be revealed by distributional semantics and proposed an approach to represent distributional information on miRNA and diseases in a high-dimensional vector space.
Abstract: MicroRNAs play critical roles in many physiological processes. Their dysregulations are also closely related to the development and progression of various human diseases, including cancer. Therefore, identifying new microRNAs that are associated with diseases contributes to a better understanding of pathogenicity mechanisms. MicroRNAs also represent a tremendous opportunity in biotechnology for early diagnosis. To date, several in silico methods have been developed to address the issue of microRNA-disease association prediction. However, these methods have various limitations. In this study, we investigate the hypothesis that information attached to miRNAs and diseases can be revealed by distributional semantics. Our basic approach is to represent distributional information on miRNAs and diseases in a high-dimensional vector space and to define associations between miRNAs and diseases in terms of their vector similarity. Cross validations performed on a dataset of known miRNA-disease associations demonstrate the excellent performance of our method. Moreover, the case study focused on breast cancer confirms the ability of our method to discover new disease-miRNA associations and to identify putative false associations reported in databases.

88 citations


Posted Content
TL;DR: Biologists and NMR spectroscopists can easily interact and develop synergies by visualizing the N MR spectra along with their corresponding experimental-factor levels, thus setting a bridge between experimental design and subsequent statistical analyses.
Abstract: Concerning NMR-based metabolomics, 1D spectra processing often requires an expert eye for disentangling the intertwined peaks, and so far the best way is to proceed interactively with a spectra viewer. NMRProcFlow is a graphical and interactive 1D NMR (1H \& 13C) spectra processing tool dedicated to metabolic fingerprinting and targeted metabolomic, covering all spectra processing steps including baseline correction, chemical shift calibration, alignment. It does not require programming skills. Biologists and NMR spectroscopists can easily interact and develop synergies by visualizing the NMR spectra along with their corresponding experimental-factor levels, thus setting a bridge between experimental design and subsequent statistical analyses.

86 citations


Journal ArticleDOI
TL;DR: The proposed deep neural network, DeepVS, uses the output of a docking program and learns how to extract relevant features from basic data such as atom and residues types obtained from protein-ligand complexes, and is the best AUC reported so far for virtual screening using the 40 receptors from the DUD.
Abstract: In this work, we propose a deep learning approach to improve docking-based virtual screening. The introduced deep neural network, DeepVS, uses the output of a docking program and learns how to extract relevant features from basic data such as atom and residues types obtained from protein-ligand complexes. Our approach introduces the use of atom and amino acid embeddings and implements an effective way of creating distributed vector representations of protein-ligand complexes by modeling the compound as a set of atom contexts that is further processed by a convolutional layer. One of the main advantages of the proposed method is that it does not require feature engineering. We evaluate DeepVS on the Directory of Useful Decoys (DUD), using the output of two docking programs: AutodockVina1.1.2 and Dock6.6. Using a strict evaluation with leave-one-out cross-validation, DeepVS outperforms the docking programs in both AUC ROC and enrichment factor. Moreover, using the output of AutodockVina1.1.2, DeepVS achieves an AUC ROC of 0.81, which, to the best of our knowledge, is the best AUC reported so far for virtual screening using the 40 receptors from DUD.

79 citations


Posted Content
TL;DR: The good performance of Qprob shows that this new probability density distribution based method is effective for protein single-model quality assessment and is useful for protein structure prediction.
Abstract: Protein quality assessment (QA) has played an important role in protein structure prediction. We developed a novel single-model quality assessment method - Qprob. Qprob calculates the absolute error for each protein feature value against the true quality scores (i.e. GDT-TS scores) of protein structural models, and uses them to estimate its probability density distribution for quality assessment. Qprob has been blindly tested on the 11th Critical Assessment of Techniques for Protein Structure Prediction (CASP11) as MULTICOM-NOVEL server. The official CASP result shows that Qprob ranks as one of the top single-model QA methods. In addition, Qprob makes contributions to our protein tertiary structure predictor MULTICOM, which is officially ranked 3rd out of 143 predictors. The good performance shows that Qprob is good at assessing the quality of models of hard targets. These results demonstrate that this new probability density distribution based method is effective for protein single-model quality assessment and is useful for protein structure prediction. The webserver and software packages of Qprob are available at: this http URL.

73 citations


Journal ArticleDOI
TL;DR: In this article, the authors take the single enzyme perspective and rebuild the theory of enzymatic inhibition from the bottom up, finding that accounting for multi-conformational enzyme structure and intrinsic randomness cannot undermine the validity of classical results in the case of competitive inhibition; but that it should strongly change our view on the uncompetitive and mixed modes of inhibition.
Abstract: The classical theory of enzymatic inhibition aims to quantitatively describe the effect of certain molecules -- called inhibitors -- on the progression of enzymatic reactions, but growing signs indicate that it must be revised to keep pace with the single-molecule revolution that is sweeping through the sciences. Here, we take the single enzyme perspective and rebuild the theory of enzymatic inhibition from the bottom up. We find that accounting for multi-conformational enzyme structure and intrinsic randomness cannot undermine the validity of classical results in the case of competitive inhibition; but that it should strongly change our view on the uncompetitive and mixed modes of inhibition. There, stochastic fluctuations on the single-enzyme level could give rise to inhibitor-activator duality -- a phenomenon in which, under some conditions, the introduction of a molecule whose binding shuts down enzymatic catalysis will counter intuitively work to facilitate product formation. We state -- in terms of experimentally measurable quantities -- a mathematical condition for the emergence of inhibitor-activator duality, and propose that it could explain why certain molecules that act as inhibitors when substrate concentrations are high elicit a non-monotonic dose response when substrate concentrations are low. The fundamental and practical implications of our findings are thoroughly discussed.

58 citations


Posted Content
TL;DR: Methodological details used by WHO in 2015 to estimate TB incidence, prevalence and mortality are described and methods to derive MDR-TB burden indicators are detailed.
Abstract: This paper describes methodological details used by WHO in 2015 to estimate TB incidence, prevalence and mortality. Incidence and mortality are disaggregated by HIV status, age and sex. Methods to derive MDR-TB burden indicators are detailed. Four main methods were used to derive incidence: (i) case notification data combined with expert opinion about case detection gaps (120 countries representing 51% of global incidence); (ii) results from national TB prevalence surveys (19 countries, 46% of global incidence); (iii) notifications in high-income countries adjusted by a standard factor to account for under-reporting and underdiagnosis (73 countries, 3% of global incidence) and (iv) capture-recapture modelling (5 countries, 0.5% of global incidence). Prevalence was obtained from results of national prevalence surveys in 21 countries, representing 69% of global prevalence). In other countries, prevalence was estimated from incidence and disease duration. Mortality was obtained from national vital registration systems of mortality surveys in 129 countries (43% of global HIV-negative TB mortality). In other countries, mortality was derived indirectly from incidence and case fatality ratio.

Posted Content
TL;DR: It is shown that complex models such as random forest can be made interpretable using Model-Agnostic Explanations algorithm and successfully predicted ICU mortality with 80% balanced accuracy and were able to interpret the relative effect of the features on prediction at individual level.
Abstract: Interpretability of machine learning models is critical for data-driven precision medicine efforts However, highly predictive models are generally complex and are difficult to interpret Here using Model-Agnostic Explanations algorithm, we show that complex models such as random forest can be made interpretable Using MIMIC-II dataset, we successfully predicted ICU mortality with 80% balanced accuracy and were also were able to interpret the relative effect of the features on prediction at individual level

Posted Content
TL;DR: A novel method is proposed to extract the fetal ECG signal from the single channel maternal abdominal ECGs signal, without any additional measurement, and could be applied to solve other detection and source separation problems.
Abstract: The multiple fundamental frequency detection problem and the source separation problem from a single-channel signal containing multiple oscillatory components and a nonstationary noise are both challenging tasks. To extract the fetal electrocardiogram (ECG) from a single-lead maternal abdominal ECG, we face both challenges. In this paper, we propose a novel method to extract the fetal ECG signal from the single channel maternal abdominal ECG signal, without any additional measurement. The algorithm is composed of three main ingredients. First, the maternal and fetal heart rates are estimated by the de-shape short time Fourier transform, which is a recently proposed nonlinear time-frequency analysis technique; second, the beat tracking technique is applied to accurately obtain the maternal and fetal R peaks; third, the maternal and fetal ECG waveforms are established by the nonlocal median. The algorithm is evaluated on a simulated fetal ECG signal database ({\em fecgsyn} database), and tested on two real databases with the annotation provided by experts ({\em adfecgdb} database and {\em CinC2013} database). In general, the algorithm could be applied to solve other detection and source separation problems, and reconstruct the time-varying wave-shape function of each oscillatory component.

Posted Content
TL;DR: In this article, an extension of the edge-based compartmental model for epidemics with arbitrary distributions of transmission and recovery times is presented, and a new pairwise-like model with Markovian transmission and an arbitrary recovery period is derived.
Abstract: This paper presents a novel extension of the edge-based compartmental model for epidemics with arbitrary distributions of transmission and recovery times. Using the message passing approach we also derive a new pairwise-like model for epidemics with Markovian transmission and an arbitrary recovery period. The new pairwise-like model allows one to formally prove that the message passing and edge-based compartmental models are equivalent in the case of Markovian transmission and arbitrary recovery processes. The edge-based and message passing models are conjectured to also be equivalent for arbitrary transmission processes; we show the first step of a full proof of this. The new pairwise-like model encompasses many existing well-known models that can be obtained by appropriate reductions. It is also amenable to a relatively straightforward numerical implementation. We test the theoretical results by comparing the numerical solutions of the various pairwise-like models to results based on explicit stochastic network simulations.

Posted Content
TL;DR: A viable real-time solution, a multi-pass pipeline optimized for shared-memory multicore systems, capable of processing data at near the terabyte-per-hour pace of multi-beam electron microscopes, and demonstrates the accuracy of a sparse slow-pass reconstruction algorithm and suggests new methods for detecting morphological errors.
Abstract: The field of connectomics faces unprecedented "big data" challenges. To reconstruct neuronal connectivity, automated pixel-level segmentation is required for petabytes of streaming electron microscopy data. Existing algorithms provide relatively good accuracy but are unacceptably slow, and would require years to extract connectivity graphs from even a single cubic millimeter of neural tissue. Here we present a viable real-time solution, a multi-pass pipeline optimized for shared-memory multicore systems, capable of processing data at near the terabyte-per-hour pace of multi-beam electron microscopes. The pipeline makes an initial fast-pass over the data, and then makes a second slow-pass to iteratively correct errors in the output of the fast-pass. We demonstrate the accuracy of a sparse slow-pass reconstruction algorithm and suggest new methods for detecting morphological errors. Our fast-pass approach provided many algorithmic challenges, including the design and implementation of novel shallow convolutional neural nets and the parallelization of watershed and object-merging techniques. We use it to reconstruct, from image stack to skeletons, the full dataset of Kasthuri et al. (463 GB capturing 120,000 cubic microns) in a matter of hours on a single multicore machine rather than the weeks it has taken in the past on much larger distributed systems.

Posted Content
TL;DR: A deep convolutional neural network is developed and applied to thoracic CT images for the classification of lung nodules and it is found that simplistic geometric nodules cannot capture the important features of lung nodsules.
Abstract: Deep learning, as a promising new area of machine learning, has attracted a rapidly increasing attention in the field of medical imaging. Compared to the conventional machine learning methods, deep learning requires no hand-tuned feature extractor, and has shown a superior performance in many visual object recognition applications. In this study, we develop a deep convolutional neural network (CNN) and apply it to thoracic CT images for the classification of lung nodules. We present the CNN architecture and classification accuracy for the original images of lung nodules. In order to understand the features of lung nodules, we further construct new datasets, based on the combination of artificial geometric nodules and some transformations of the original images, as well as a stochastic nodule shape model. It is found that simplistic geometric nodules cannot capture the important features of lung nodules.

Posted Content
TL;DR: It is suggested that stochastic resonance (SR) plays a key role in both short- and long-term plasticity within the auditory system and that SR is the primary cause of neuronal hyperactivity and tinnitus.
Abstract: Subjective tinnitus (ST) is generally assumed to be a consequence of hearing loss (HL). In animal studies acoustic trauma can lead to behavioral signs of ST, in human studies ST patients without increased hearing thresholds were found to suffer from so called hidden HL. Additionally, ST is correlated with pathologically increased spontaneous firing rates and neuronal hyperactivity (NH) along the auditory pathway. Homeostatic plasticity (HP) has been proposed as a compensation mechanism leading to the development of NH, arguing that after HL initially decreased mean firing rates of neurons are subsequently restored by increased spontaneous rates. However all HP models fundamentally lack explanatory power since the function of keeping mean firing rate constant remains elusive as does the benefit this might have in terms of information processing. Furthermore the neural circuitry being able to perform the comparison of preferred with actual mean firing rate remains unclear. Here we propose an entirely new interpretation of ST related development of NH in terms of information theory. We suggest that stochastic resonance (SR) plays a key role in short- and long-term plasticity within the auditory system and is the ultimate cause of NH and ST. SR has been found ubiquitous in neuroscience and refers to the phenomenon that sub-threshold, unperceivable signals can be transmitted by adding noise to sensor input. We argue that after HL, SR serves to lift signals above the increased hearing threshold, hence subsequently decreasing thresholds again. The increased amount of internal noise is the correlate of the NH, which finally leads to the development of ST, due to neuronal plasticity along the auditory pathway. We demonstrate the plausibility of our hypothesis by using a computational model and provide exemplarily findings of human and animal studies that are consistent with our model.

Journal ArticleDOI
TL;DR: In this paper, the collective motion and the response to visual stimuli in two morphologically different strains (TL and AB) of zebrafish were analyzed and the authors provided a new insight into the need to take into account individual variability of Zebrafish strains for studying collective behaviour.
Abstract: Recent studies show differences in individual motion and shoaling tendency between strains of the same species. Here, we analyse the collective motion and the response to visual stimuli in two morphologically different strains (TL and AB) of zebrafish. For both strains, we observe 10 groups of 5 and 10 zebrafish swimming freely in a large experimental tank with two identical attractive landmarks (cylinders or disks) for one hour. We track the positions of the fish by an automated tracking method and compute several metrics at the group level. First, the probability of presence shows that both strains avoid free space and are more likely to swim in the vicinity of the walls of the tank and the attractive landmarks. Second, the analysis of landmarks occupancy shows that AB zebrafish are more present in their vicinity than TL zebrafish and that both strains regularly transit from one landmark to the other with no preference on the long duration. Finally, TL zebrafish show a higher cohesion than AB zebrafish. Thus, landmarks and duration of the repicates allow to reveal collective behavioural variabilities among different strains of zebrafish. These results provide a new insight into the need to take into account individual variability of zebrafish strains for studying collective behaviour.

Posted Content
TL;DR: The ability to combine direct observations of animal activity with statistical models, which account for the features of accelerometer data, offers a new way to quantify animal behaviour, energetic expenditure and deepen the insights into individual behaviour as a constituent of populations and ecosystems.
Abstract: Use of accelerometers is now widespread within animal biotelemetry as they provide a means of measuring an animal's activity in a meaningful and quantitative way where direct observation is not possible. In sequential acceleration data there is a natural dependence between observations of movement or behaviour, a fact that has been largely ignored in most analyses. Analyses of acceleration data where serial dependence has been explicitly modelled have largely relied on hidden Markov models (HMMs). Depending on the aim of an analysis, either a supervised or an unsupervised learning approach can be applied. Under a supervised context, an HMM is trained to classify unlabelled acceleration data into a finite set of pre-specified categories, whereas we will demonstrate how an unsupervised learning approach can be used to infer new aspects of animal behaviour. We will provide the details necessary to implement and assess an HMM in both the supervised and unsupervised context, and discuss the data requirements of each case. We outline two applications to marine and aerial systems (sharks and eagles) taking the unsupervised approach, which is more readily applicable to animal activity measured in the field. HMMs were used to infer the effects of temporal, atmospheric and tidal inputs on animal behaviour. Animal accelerometer data allow ecologists to identify important correlates and drivers of animal activity (and hence behaviour). The HMM framework is well suited to deal with the main features commonly observed in accelerometer data. The ability to combine direct observations of animals activity and combine it with statistical models which account for the features of accelerometer data offer a new way to quantify animal behaviour, energetic expenditure and deepen our insights into individual behaviour as a constituent of populations and ecosystems.

Posted Content
TL;DR: DeepPicker as mentioned in this paper employs a cross-molecule training strategy to capture common features of particles from previously-analyzed micrographs, and thus does not require any human intervention during particle picking.
Abstract: Particle picking is a time-consuming step in single-particle analysis and often requires significant interventions from users, which has become a bottleneck for future automated electron cryo-microscopy (cryo-EM). Here we report a deep learning framework, called DeepPicker, to address this problem and fill the current gaps toward a fully automated cryo-EM pipeline. DeepPicker employs a novel cross-molecule training strategy to capture common features of particles from previously-analyzed micrographs, and thus does not require any human intervention during particle picking. Tests on the recently-published cryo-EM data of three complexes have demonstrated that our deep learning based scheme can successfully accomplish the human-level particle picking process and identify a sufficient number of particles that are comparable to those manually by human experts. These results indicate that DeepPicker can provide a practically useful tool to significantly reduce the time and manual effort spent in single-particle analysis and thus greatly facilitate high-resolution cryo-EM structure determination.

Journal ArticleDOI
TL;DR: In this article, the authors use persistent homology with a weight rank clique filtration to study functional networks and use persistence landscapes to interpret their results, showing that the position of the features in a filter can sometimes play a more vital role than persistence in the interpretation of topological features.
Abstract: We use topological data analysis to study "functional networks" that we construct from time-series data from both experimental and synthetic sources. We use persistent homology with a weight rank clique filtration to gain insights into these functional networks, and we use persistence landscapes to interpret our results. Our first example uses time-series output from networks of coupled Kuramoto oscillators. Our second example consists of biological data in the form of functional magnetic resonance imaging (fMRI) data that was acquired from human subjects during a simple motor-learning task in which subjects were monitored on three days in a five-day period. With these examples, we demonstrate that (1) using persistent homology to study functional networks provides fascinating insights into their properties and (2) the position of the features in a filtration can sometimes play a more vital role than persistence in the interpretation of topological features, even though conventionally the latter is used to distinguish between signal and noise. We find that persistent homology can detect differences in synchronization patterns in our data sets over time, giving insight both on changes in community structure in the networks and on increased synchronization between brain regions that form loops in a functional network during motor learning. For the motor-learning data, persistence landscapes also reveal that on average the majority of changes in the network loops take place on the second of the three days of the learning process.

Posted Content
TL;DR: In this article, the authors propose a model for reproducible research called "science in the cloud" which leverages existing technologies and standards, such as containers, cloud computing and cloud data services.
Abstract: Modern technologies are enabling scientists to collect extraordinary amounts of complex and sophisticated data across a huge range of scales like never before With this onslaught of data, we can allow the focal point to shift towards answering the question of how we can analyze and understand the massive amounts of data in front of us Unfortunately, lack of standardized sharing mechanisms and practices often make reproducing or extending scientific results very difficult With the creation of data organization structures and tools which drastically improve code portability, we now have the opportunity to design such a framework for communicating extensible scientific discoveries Our proposed solution leverages these existing technologies and standards, and provides an accessible and extensible model for reproducible research, called "science in the cloud" (sic) Exploiting scientific containers, cloud computing and cloud data services, we show the capability to launch a computer in the cloud and run a web service which enables intimate interaction with the tools and data presented We hope this model will inspire the community to produce reproducible and, importantly, extensible results which will enable us to collectively accelerate the rate at which scientific breakthroughs are discovered, replicated, and extended

Posted Content
TL;DR: This work proposes to use domains of potential phenotypes for the search of an optimum, taking into account correlations between traits to ground numerical experiments in biological reality, and shows that it could improve trait-based breeding methods with paths describing desirable trait modifications both in direction and intensity.
Abstract: Simulation models can be used to predict the outcome of plant traits modifications resulting from the genetic variation (and its interaction with the environment) on plant performance, hence gaining momentum in plant breeding process. Optimization methods complement those models in finding ideal values of a set of plant traits, maximizing a defined criteria (e.g. crop yield, light interception). However, using such methods carelessly may lead to misleading solutions, missing the appropriate traits or phenotypes. Therefore, we propose to use domains of potential phenotypes for the search of an optimum, taking into account correlations between traits to ground numerical experiments in biological reality. In addition, we propose a multi-objective optimization formulation using a metric of performance returned by numerical model and a metric of feasibility based on field observations. This can be solved with standard optimization algorithms without any model modification. We applied our approach to two contrasted simulation models: a process-based crop model of sunflower and a structural-functional plant model of apple tree. On both cases, we were able to characterize key plant traits and a continuum of optimal solutions, ranging from the most feasible to the most efficient. The present study thus provides a proof of concept for this approach and shows that it could improve trait-based breeding methods with paths describing desirable trait modifications both in direction and intensity.

Journal ArticleDOI
TL;DR: In this paper, the authors present a framework to model gene transcription in populations of cells with time-varying (stochastic or deterministic) transcription and degradation rates, which can be understood as upstream cellular drives representing the effect of different aspects of the cellular environment.
Abstract: Gene transcription is a highly stochastic and dynamic process. As a result, the mRNA copy number of a given gene is heterogeneous both between cells and across time. We present a framework to model gene transcription in populations of cells with time-varying (stochastic or deterministic) transcription and degradation rates. Such rates can be understood as upstream cellular drives representing the effect of different aspects of the cellular environment. We show that the full solution of the master equation contains two components: a model-specific, upstream effective drive, which encapsulates the effect of cellular drives (e.g., entrainment, periodicity or promoter randomness), and a downstream transcriptional Poissonian part, which is common to all models. Our analytical framework treats cell-to-cell and dynamic variability consistently, unifying several approaches in the literature. We apply the obtained solution to characterise different models of experimental relevance, and to explain the influence on gene transcription of synchrony, stationarity, ergodicity, as well as the effect of time-scales and other dynamic characteristics of drives. We also show how the solution can be applied to the analysis of noise sources in single-cell data, and to reduce the computational cost of stochastic simulations.

Journal ArticleDOI
TL;DR: The HPC-oriented implementation of a general purpose learning algorithm, originally conceived for DNA analysis and recently extended to treat uncertainty on data (U-BRAIN), is discussed and implemented and tested on the INTEL XEON E7xxx and E5xxx family of the CRESCO structure.
Abstract: Background: The huge quantity of data produced in Biomedical research needs sophisticated algorithmic methodologies for its storage, analysis, and processing. High Performance Computing (HPC) appears as a magic bullet in this challenge. However, several hard to solve parallelization and load balancing problems arise in this context. Here we discuss the HPC-oriented implementation of a general purpose learning algorithm, originally conceived for DNA analysis and recently extended to treat uncertainty on data (U BRAIN). The U-BRAIN algorithm is a learning algorithm that finds a Boolean formula in disjunctive normal form (DNF), of approximately minimum complexity, that is consistent with a set of data (instances) which may have missing bits. The conjunctive terms of the formula are computed in an iterative way by identifying, from the given data, a family of sets of conditions that must be satisfied by all the positive instances and violated by all the negative ones; such conditions allow the computation of a set of coefficients (relevances) for each attribute (literal), that form a probability distribution, allowing the selection of the term literals. The great versatility that characterizes it, makes U-BRAIN applicable in many of the fields in which there are data to be analyzed. However the memory and the execution time required by the running are of O(n3) and of O(n5) order, respectively, and so, the algorithm is unaffordable for huge data sets.

Posted Content
TL;DR: Simulation results indicate that the MPI-based, parallel operator-splitting implementation for stochastic spatial reaction-diffusion simulations with irregular tetrahedral meshes is capable of achieving super-linear speedup for balanced loading simulations with reasonable molecule density and mesh quality.
Abstract: Stochastic, spatial reaction-diffusion simulations have been widely used in systems biology and computational neuroscience. However, the increasing scale and complexity of simulated models and morphologies have exceeded the capacity of any serial implementation. This led to development of parallel solutions that benefit from the boost in performance of modern large-scale supercomputers. In this paper, we describe an MPI-based, parallel Operator-Splitting implementation for stochastic spatial reaction-diffusion simulations with irregular tetrahedral meshes. The performance of our implementation is first examined and analyzed with simulations of a simple model. We then demonstrate its usage in real-world research by simulating the reaction-diffusion components of a published calcium burst model in both Purkinje neuron sub-branch and full dendrite morphologies. Simulation results indicate that our implementation is capable of achieving super-linear speedup for balanced loading simulations with reasonable molecule density and mesh quality. In the best scenario a parallel simulation with 2000 processes achieves more than 3600 times of speedup relative to its serial SSA counterpart and more than 20 times of speedup relative to parallel simulation with 100 processes. While simulation performance is affected by unbalanced loading, a substantial speedup can still be observed without any special treatment.

Posted Content
TL;DR: The framework is provided as a ready-to-use R package to easily test the approach, providing a versatile unified framework for partitioning biological diversity and permitting the direct comparison of subcommunities.
Abstract: Diversity measurement underpins the study of biological systems, but measures used vary across disciplines. Despite their common use and broad utility, no unified framework has emerged for measuring, comparing and partitioning diversity. The introduction of information theory into diversity measurement has laid the foundations, but the framework is incomplete without the ability to partition diversity, which is central to fundamental questions across the life sciences: How do we prioritise communities for conservation? How do we identify reservoirs and sources of pathogenic organisms? How do we measure ecological disturbance arising from climate change? The lack of a common framework means that diversity measures from different fields have conflicting fundamental properties, allowing conclusions reached to depend on the measure chosen. This conflict is unnecessary and unhelpful. A mathematically consistent framework would transform disparate fields by delivering scientific insights in a common language. It would also allow the transfer of theoretical and practical developments between fields. We meet this need, providing a versatile unified framework for partitioning biological diversity. It encompasses any kind of similarity between individuals, from functional to genetic, allowing comparisons between qualitatively different kinds of diversity. Where existing partitioning measures aggregate information across the whole population, our approach permits the direct comparison of subcommunities, allowing us to pinpoint distinct, diverse or representative subcommunities and investigate population substructure. The framework is provided as a ready-to-use R package to easily test our approach.

Posted Content
TL;DR: In this article, the authors examined established formulas for the estimation of intrinsic and extrinsic noise and provided interpretations of them in terms of a hierarchical model, which allows them to derive corrections that minimize the mean squared error, an objective that may be important when sample sizes are small.
Abstract: Gene expression is stochastic and displays variation ("noise") both within and between cells. Intracellular (intrinsic) variance can be distinguished from extracellular (extrinsic) variance by applying the law of total variance to data from two-reporter assays that probe expression of identical gene pairs in single-cells. We examine established formulas for the estimation of intrinsic and extrinsic noise and provide interpretations of them in terms of a hierarchical model. This allows us to derive corrections that minimize the mean squared error, an objective that may be important when sample sizes are small. The statistical framework also highlights the need for quantile normalization, and provides justification for the use of the sample correlation between the two reporter expression levels to estimate the percent contribution of extrinsic noise to the total noise. Finally, we provide a geometric interpretation of these results that clarifies the current interpretation.

Posted Content
TL;DR: The results highlight how differences in the social responsiveness between individuals can give rise to leadership in free moving groups and demonstrate how the movement characteristics of groups depend on the spatial configuration of individuals within them.
Abstract: Collective movement can be achieved when individuals respond to the local movements and positions of their neighbours. Some individuals may disproportionately influence group movement if they occupy particular spatial positions in the group, for example, positions at the front of the group. We asked, therefore, what led individuals in moving pairs of fish (Gambusia holbrooki) to occupy a position in front of their partner. Individuals adjusted their speed and direction differently in response to their partner's position, resulting in individuals occupying different positions in the group. Individuals that were found most often at the front of the pair had greater mean changes in speed than their partner, and were less likely to turn towards their partner, compared to those individuals most often found at the back of the pair. The pair moved faster when led by the individual that was usually at the front. Our results highlight how differences in the social responsiveness between individuals can give rise to leadership in free moving groups. They also demonstrate how the movement characteristics of groups depend on the spatial configuration of individuals within them.

Posted Content
TL;DR: In this article, the authors analyse a high-frequency movement dataset for a group of grazing cattle and investigate their spatiotemporal patterns using a simple two-state ''stop-and-move'' mobility model.
Abstract: In this study, we analyse a high-frequency movement dataset for a group of grazing cattle and investigate their spatiotemporal patterns using a simple two-state `stop-and-move' mobility model. We find that the dispersal kernel in the moving state is best described by a mixture exponential distribution, indicating the hierarchical nature of the movement. On the other hand, the waiting time appears to be scale-invariant below a certain cut-off and is best described by a truncated power-law distribution, suggesting heterogenous dynamics in the non-moving state. We explore possible explanations for the observed phenomena, covering factors that can play a role in the generation of mobility patterns, such as the context of grazing environment, the intrinsic decision-making mechanism or the energy status of different activities. In particular, we propose a new hypothesis that the underlying movement pattern can be attributed to the most probable observable energy status under the maximum entropy configuration. These results are not only valuable for modelling cattle movement but also provide new insights for understanding the underlying biological basis of grazing behaviour.