scispace - formally typeset
Search or ask a question

Showing papers on "Software published in 2015"


Book
01 May 2015
TL;DR: An acceleration heuristic for profile HMMs, the “multiple segment Viterbi” (MSV) algorithm, which computes an optimal sum of multiple ungapped local alignment segments using a striped vector-parallel approach previously described for fast Smith/Waterman alignment.
Abstract: Profile hidden Markov models (profile HMMs) and probabilistic inference methods have made important contributions to the theory of sequence database homology search. However, practical use of profile HMM methods has been hindered by the computational expense of existing software implementations. Here I describe an acceleration heuristic for profile HMMs, the "multiple segment Viterbi" (MSV) algorithm. The MSV algorithm computes an optimal sum of multiple ungapped local alignment segments using a striped vector-parallel approach previously described for fast Smith/Waterman alignment. MSV scores follow the same statistical distribution as gapped optimal local alignment scores, allowing rapid evaluation of significance of an MSV score and thus facilitating its use as a heuristic filter. I also describe a 20-fold acceleration of the standard profile HMM Forward/Backward algorithms using a method I call "sparse rescaling". These methods are assembled in a pipeline in which high-scoring MSV hits are passed on for reanalysis with the full HMM Forward/Backward algorithm. This accelerated pipeline is implemented in the freely available HMMER3 software package. Performance benchmarks show that the use of the heuristic MSV filter sacrifices negligible sensitivity compared to unaccelerated profile HMM searches. HMMER3 is substantially more sensitive and 100- to 1000-fold faster than HMMER2. HMMER3 is now about as fast as BLAST for protein searches.

4,492 citations


Journal ArticleDOI
TL;DR: The ImageJ project is used as a case study of how open‐source software fosters its suites of software tools, making multitudes of image‐analysis technology easily accessible to the scientific community.
Abstract: Technology in microscopy advances rapidly, enabling increasingly affordable, faster, and more precise quantitative biomedical imaging, which necessitates correspondingly more-advanced image processing and analysis techniques. A wide range of software is available-from commercial to academic, special-purpose to Swiss army knife, small to large-but a key characteristic of software that is suitable for scientific inquiry is its accessibility. Open-source software is ideal for scientific endeavors because it can be freely inspected, modified, and redistributed; in particular, the open-software platform ImageJ has had a huge impact on the life sciences, and continues to do so. From its inception, ImageJ has grown significantly due largely to being freely available and its vibrant and helpful user community. Scientists as diverse as interested hobbyists, technical assistants, students, scientific staff, and advanced biology researchers use ImageJ on a daily basis, and exchange knowledge via its dedicated mailing list. Uses of ImageJ range from data visualization and teaching to advanced image processing and statistical analysis. The software's extensibility continues to attract biologists at all career stages as well as computer scientists who wish to effectively implement specific image-processing algorithms. In this review, we use the ImageJ project as a case study of how open-source software fosters its suites of software tools, making multitudes of image-analysis technology easily accessible to the scientific community. We specifically explore what makes ImageJ so popular, how it impacts the life sciences, how it inspires other projects, and how it is self-influenced by coevolving projects within the ImageJ ecosystem.

2,081 citations


Journal ArticleDOI
07 May 2015
TL;DR: This paper discusses aspects of recruiting subjects for economic laboratory experiments, and shows how the Online Recruitment System for Economic Experiments can help.
Abstract: This paper discusses aspects of recruiting subjects for economic laboratory experiments, and shows how the Online Recruitment System for Economic Experiments can help. The software package provides experimenters with a free, convenient, and very powerful tool to organize their experiments and sessions.

1,974 citations


Journal ArticleDOI
TL;DR: The RAST tool kit (RASTtk), a modular version of RAST that enables researchers to build custom annotation pipelines and offers a choice of software for identifying and annotating genomic features as well as the ability to add custom features to an annotation job.
Abstract: The RAST (Rapid Annotation using Subsystem Technology) annotation engine was built in 2008 to annotate bacterial and archaeal genomes. It works by offering a standard software pipeline for identifying genomic features (i.e., protein-encoding genes and RNA) and annotating their functions. Recently, in order to make RAST a more useful research tool and to keep pace with advancements in bioinformatics, it has become desirable to build a version of RAST that is both customizable and extensible. In this paper, we describe the RAST tool kit (RASTtk), a modular version of RAST that enables researchers to build custom annotation pipelines. RASTtk offers a choice of software for identifying and annotating genomic features as well as the ability to add custom features to an annotation job. RASTtk also accommodates the batch submission of genomes and the ability to customize annotation protocols for batch submissions. This is the first major software restructuring of RAST since its inception.

1,666 citations


Journal ArticleDOI
TL;DR: In this paper, the authors present the LHAPDF-6 library, a ground-up re-engineering of the PDFLIB/LHAPDF paradigm for PDF access which removes all limits on use of concurrent PDF sets, massively reduces static memory requirements, offers improved CPU performance, and fixes fundamental bugs in multi-set access to PDF metadata.
Abstract: The Fortran LHAPDF library has been a long-term workhorse in particle physics, providing standardised access to parton density functions for experimental and phenomenological purposes alike, following on from the venerable PDFLIB package During Run 1 of the LHC, however, several fundamental limitations in LHAPDF’s design have became deeply problematic, restricting the usability of the library for important physics-study procedures and providing dangerous avenues by which to silently obtain incorrect results In this paper we present the LHAPDF 6 library, a ground-up re-engineering of the PDFLIB/LHAPDF paradigm for PDF access which removes all limits on use of concurrent PDF sets, massively reduces static memory requirements, offers improved CPU performance, and fixes fundamental bugs in multi-set access to PDF metadata The new design, restricted for now to interpolated PDFs, uses centralised numerical routines and a powerful cascading metadata system to decouple software releases from provision of new PDF data and allow completely general parton content More than 200 PDF sets have been migrated from LHAPDF 5 to the new universal data format, via a stringent quality control procedure LHAPDF 6 is supported by many Monte Carlo generators and other physics programs, in some cases via a full set of compatibility routines, and is recommended for the demanding PDF access needs of LHC Run 2 and beyond

1,563 citations


Journal ArticleDOI
TL;DR: MDTraj is a modern, lightweight, and fast software package for analyzing MD simulations that simplifies the analysis of MD data and connects these datasets with the modern interactive data science software ecosystem in Python.

1,480 citations


Journal ArticleDOI
TL;DR: Ncorr is an open-source subset-based 2D DIC package that amalgamates modern DIC algorithms proposed in the literature with additional enhancements and several applications of Ncorr that both validate it and showcase its capabilities are discussed.
Abstract: Digital Image Correlation (DIC) is an important and widely used non-contact technique for measuring material deformation. Considerable progress has been made in recent decades in both developing new experimental DIC techniques and in enhancing the performance of the relevant computational algorithms. Despite this progress, there is a distinct lack of a freely available, high-quality, flexible DIC software. This paper documents a new DIC software package Ncorr that is meant to fill that crucial gap. Ncorr is an open-source subset-based 2D DIC package that amalgamates modern DIC algorithms proposed in the literature with additional enhancements. Several applications of Ncorr that both validate it and showcase its capabilities are discussed.

1,184 citations


Journal ArticleDOI
TL;DR: Dioptas is a Python-based program for on-the-fly data processing and exploration of two-dimensional X-ray diffraction area detector data, specifically designed for the large amount of data collected at XRD beamlines at synchrotrons.
Abstract: The amount of data collected during synchrotron X-ray diffraction (XRD) experiments is constantly increasing. Most of the time, the data are collected with image detectors, which necessitates the use of image reduction/integration routines to extract structural information from measured XRD patterns. This step turns out to be a bottleneck in the data processing procedure due to a lack of suitable software packages. In particular, fast-running synchrotron experiments require online data reduction and analysis in real time so that experimental parameters can be adjusted interactively. Dioptas is a Python-based program for on-the-fly data processing and exploration of two-dimensional X-ray diffraction area detector data, specifically designed for the large amount of data collected at XRD beamlines at synchrotrons. Its fast data reduction algorithm and graphical data exploration capabilities make it ideal for online data processing during XRD experiments and batch post-processing of large numbers of images.

1,163 citations


Journal ArticleDOI
TL;DR: Qualimap 2 represents a next step in the QC analysis of HTS data, along with comprehensive single-sample analysis of alignment data, and includes new modes that allow simultaneous processing and comparison of multiple samples.
Abstract: Motivation: Detection of random errors and systematic biases is a crucial step of a robust pipeline for processing high-throughput sequencing (HTS) data. Bioinformatics software tools capable of performing this task are available, either for general analysis of HTS data or targeted to a specific sequencing technology. However, most of the existing QC instruments only allow processing of one sample at a time. Results: Qualimap 2 represents a next step in the QC analysis of HTS data. Along with comprehensive single-sample analysis of alignment data, it includes new modes that allow simultaneous processing and comparison of multiple samples. As with the first version, the new features are available via both graphical and command line interface. Additionally, it includes a large number of improvements proposed by the user community. Availability and implementation: The implementation of the software along with documentation is freely available at http://www.qualimap.org. Contact: ed.gpm.nilreb-biipm@reyem Supplementary information: Supplementary data are available at Bioinformatics online.

1,154 citations


Journal ArticleDOI
TL;DR: An extension of a set previously used by the CheckMol software that covers in addition heterocyclic compound classes and periodic table groups is described, which demonstrates that EFG can be efficiently used to develop and interpret structure-activity relationship models.
Abstract: The article describes a classification system termed "extended functional groups" (EFG), which are an extension of a set previously used by the CheckMol software, that covers in addition heterocyclic compound classes and periodic table groups. The functional groups are defined as SMARTS patterns and are available as part of the ToxAlerts tool (http://ochem.eu/alerts) of the On-line CHEmical database and Modeling (OCHEM) environment platform. The article describes the motivation and the main ideas behind this extension and demonstrates that EFG can be efficiently used to develop and interpret structure-activity relationship models.

1,024 citations


Journal ArticleDOI
Pierre Hirel1
TL;DR: Atomsk is a unified program that allows to generate, convert and transform atomic systems for the purposes of ab initio calculations, classical atomistic simulations, or visualization, in the areas of computational physics and chemistry.

Journal ArticleDOI
TL;DR: In such nonstationary environments, where the probabilistic properties of the data change over time, a non-adaptive model trained under the false stationarity assumption is bound to become obsolete in time, and perform sub-optimally at best, or fail catastrophically at worst.
Abstract: The prevalence of mobile phones, the internet-of-things technology, and networks of sensors has led to an enormous and ever increasing amount of data that are now more commonly available in a streaming fashion [1]-[5]. Often, it is assumed - either implicitly or explicitly - that the process generating such a stream of data is stationary, that is, the data are drawn from a fixed, albeit unknown probability distribution. In many real-world scenarios, however, such an assumption is simply not true, and the underlying process generating the data stream is characterized by an intrinsic nonstationary (or evolving or drifting) phenomenon. The nonstationarity can be due, for example, to seasonality or periodicity effects, changes in the users' habits or preferences, hardware or software faults affecting a cyber-physical system, thermal drifts or aging effects in sensors. In such nonstationary environments, where the probabilistic properties of the data change over time, a non-adaptive model trained under the false stationarity assumption is bound to become obsolete in time, and perform sub-optimally at best, or fail catastrophically at worst.

Proceedings ArticleDOI
17 Aug 2015
TL;DR: This paper built a centralized control mechanism based on a global configuration pushed to all datacenter switches, and modular hardware design coupled with simple, robust software allowed the design to also support inter-cluster and wide-area networks.
Abstract: We present our approach for overcoming the cost, operational complexity, and limited scale endemic to datacenter networks a decade ago. Three themes unify the five generations of datacenter networks detailed in this paper. First, multi-stage Clos topologies built from commodity switch silicon can support cost-effective deployment of building-scale networks. Second, much of the general, but complex, decentralized network routing and management protocols supporting arbitrary deployment scenarios were overkill for single-operator, pre-planned datacenter networks. We built a centralized control mechanism based on a global configuration pushed to all datacenter switches. Third, modular hardware design coupled with simple, robust software allowed our design to also support inter-cluster and wide-area networks. Our datacenter networks run at dozens of sites across the planet, scaling in capacity by 100x over ten years to more than 1Pbps of bisection bandwidth.

Journal ArticleDOI
TL;DR: The development and the present state of the “tps” series of software for use in geometric morphometrics on Windows-based computers are described and used in hundreds of studies in mammals and other organisms.
Abstract: The development and the present state of the “tps” series of software for use in geometric morphometrics on Windows-based computers are described. These programs have been used in hundreds of studies in mammals and other organisms. Download the complete issue.

Posted Content
TL;DR: oTree is an open-source and online software for implementing interactive experiments in the laboratory, online, the field or combinations thereof, and provides the source code, a library of standard game templates and demo games which can be played by anyone.
Abstract: oTree is an open-source and online software for implementing interactive experiments in the laboratory, online, the field or combinations thereof. oTree does not require installation of software on subjects’ devices; it can run on any device that has a web browser, be that a desktop computer, a tablet or a smartphone. Deployment can be internet-based without a shared local network, or local-network-based even without internet access. For coding, Python is used, a popular, open-source programming language. www.oTree.org provides the source code, a library of standard game templates and demo games which can be played by anyone.

Journal ArticleDOI
TL;DR: DIA-Umpire enables targeted extraction of quantitative information based on peptides initially identified in only a subset of the samples, resulting in more consistent quantification across multiple samples.
Abstract: As a result of recent improvements in mass spectrometry (MS), there is increased interest in data-independent acquisition (DIA) strategies in which all peptides are systematically fragmented using wide mass-isolation windows ('multiplex fragmentation'). DIA-Umpire (http://diaumpire.sourceforge.net/), a comprehensive computational workflow and open-source software for DIA data, detects precursor and fragment chromatographic features and assembles them into pseudo-tandem MS spectra. These spectra can be identified with conventional database-searching and protein-inference tools, allowing sensitive, untargeted analysis of DIA data without the need for a spectral library. Quantification is done with both precursor- and fragment-ion intensities. Furthermore, DIA-Umpire enables targeted extraction of quantitative information based on peptides initially identified in only a subset of the samples, resulting in more consistent quantification across multiple samples. We demonstrated the performance of the method with control samples of varying complexity and publicly available glycoproteomics and affinity purification-MS data.

01 Jan 2015
TL;DR: MNE-Python is an open-source software package that addresses this challenge by providing state-of-the-art algorithms implemented in Python that cover multiple methods of data preprocessing, source localization, statistical analysis, and estimation of functional connectivity between distributed brain regions.
Abstract: Magnetoencephalography and electroencephalography (M/EEG) measure the weak electromagnetic signals generated by neuronal activity in the brain. Using these signals to characterize and locate neural activation in the brain is a challenge that requires expertise in physics, signal processing, statistics, and numerical methods. As part of the MNE software suite, MNE-Python is an open-source software package that addresses this challenge by providing state-of-the-art algorithms implemented in Python that cover multiple methods of data preprocessing, source localization, statistical analysis, and estimation of functional connectivity between distributed brain regions. All algorithms and utility functions are implemented in a consistent manner with well-documented interfaces, enabling users to create M/EEG data analysis pipelines by writing Python scripts. Moreover, MNE-Python is tightly integrated with the core Python libraries for scientific comptutation (NumPy, SciPy) and visualization (matplotlib and Mayavi), as well as the greater neuroimaging ecosystem in Python via the Nibabel package. The code is provided under the new BSD license allowing code reuse, even in commercial products. Although MNE-Python has only been under heavy development for a couple of years, it has rapidly evolved with expanded analysis capabilities and pedagogical tutorials because multiple labs have collaborated during code development to help share best practices. MNE-Python also gives easy access to preprocessed datasets, helping users to get started quickly and facilitating reproducibility of methods by other researchers. Full documentation, including dozens of examples, is available at http://martinos.org/mne.

Journal ArticleDOI
01 Feb 2015
TL;DR: The machine learning techniques have the ability for predicting software fault proneness and can be used by software practitioners and researchers, however, the application of theMachine learning techniques in software fault prediction is still limited and more number of studies should be carried out in order to obtain well formed and generalizable results.
Abstract: Reviews studies from 1991-2013 to assess application of ML techniques for SFP.Identifies seven categories of the ML techniques.Identifies 64 studies to answer the established research questions.Selects primary studies according to the quality assessment of the studies.Systematic literature review performs the following:Summarize ML techniques for SFP models.Assess performance accuracy and capability of ML techniques for constructing SFP models.Provide comparison between the ML and statistical techniques.Provide comparison of performance accuracy of different ML techniques.Summarize the strength and weakness of the ML techniques.Provides future guidelines to software practitioners and researchers. BackgroundSoftware fault prediction is the process of developing models that can be used by the software practitioners in the early phases of software development life cycle for detecting faulty constructs such as modules or classes. There are various machine learning techniques used in the past for predicting faults. MethodIn this study we perform a systematic review of studies from January 1991 to October 2013 in the literature that use the machine learning techniques for software fault prediction. We assess the performance capability of the machine learning techniques in existing research for software fault prediction. We also compare the performance of the machine learning techniques with the statistical techniques and other machine learning techniques. Further the strengths and weaknesses of machine learning techniques are summarized. ResultsIn this paper we have identified 64 primary studies and seven categories of the machine learning techniques. The results prove the prediction capability of the machine learning techniques for classifying module/class as fault prone or not fault prone. The models using the machine learning techniques for estimating software fault proneness outperform the traditional statistical models. ConclusionBased on the results obtained from the systematic review, we conclude that the machine learning techniques have the ability for predicting software fault proneness and can be used by software practitioners and researchers. However, the application of the machine learning techniques in software fault prediction is still limited and more number of studies should be carried out in order to obtain well formed and generalizable results. We provide future guidelines to practitioners and researchers based on the results obtained in this work.

Journal ArticleDOI
TL;DR: Practical guidelines for verification and validation of NMS models and simulations are established that researchers, clinicians, reviewers, and others can adopt to evaluate the accuracy and credibility of modeling studies.
Abstract: Computational modeling and simulation of neuromusculoskeletal (NMS) systems enables researchers and clinicians to study the complex dynamics underlying human and animal movement. NMS models use equations derived from physical laws and biology to help solve challenging real-world problems, from designing prosthetics that maximize running speed to developing exoskeletal devices that enable walking after a stroke. NMS modeling and simulation has proliferated in the biomechanics research community over the past 25 years, but the lack of verification and validation standards remains a major barrier to wider adoption and impact. The goal of this paper is to establish practical guidelines for verification and validation of NMS models and simulations that researchers, clinicians, reviewers, and others can adopt to evaluate the accuracy and credibility of modeling studies. In particular, we review a general process for verification and validation applied to NMS models and simulations, including careful formulation of a research question and methods, traditional verification and validation steps, and documentation and sharing of results for use and testing by other researchers. Modeling the NMS system and simulating its motion involves methods to represent neural control, musculoskeletal geometry, muscle-tendon dynamics, contact forces, and multibody dynamics. For each of these components, we review modeling choices and software verification guidelines; discuss variability, errors, uncertainty, and sensitivity relationships; and provide recommendations for verification and validation by comparing experimental data and testing robustness. We present a series of case studies to illustrate key principles. In closing, we discuss challenges the community must overcome to ensure that modeling and simulation are successfully used to solve the broad spectrum of problems that limit human mobility.

Proceedings ArticleDOI
09 Nov 2015
TL;DR: In this paper, a comparison of the main existing test input generation tools for Android apps is presented, based on four metrics: ease of use, ability to work on multiple platforms, code coverage, and ability to detect faults.
Abstract: Like all software, mobile applications ("apps") must be adequately tested to gain confidence that they behave correctly. Therefore, in recent years, researchers and practitioners alike have begun to investigate ways to automate apps testing. In particular, because of Android's open source nature and its large share of the market, a great deal of research has been performed on input generation techniques for apps that run on the Android operating systems. At this point in time, there are in fact a number of such techniques in the literature, which differ in the way they generate inputs, the strategy they use to explore the behavior of the app under test, and the specific heuristics they use. To better understand the strengths and weaknesses of these existing approaches, and get general insight on ways they could be made more effective, in this paper we perform a thorough comparison of the main existing test input generation tools for Android. In our comparison, we evaluate the effectiveness of these tools, and their corresponding techniques, according to four metrics: ease of use, ability to work on multiple platforms, code coverage, and ability to detect faults. Our results provide a clear picture of the state of the art in input generation for Android apps and identify future research directions that, if suitably investigated, could lead to more effective and efficient testing tools for Android.

Journal ArticleDOI
TL;DR: An open platform using commodity vehicles and sensors is introduced to facilitate the development of autonomous vehicles and presents algorithms, software libraries, and datasets required for scene recognition, path planning, and vehicle control.
Abstract: Autonomous vehicles are an emerging application of automotive technology. They can recognize the scene, plan the path, and control the motion by themselves while interacting with drivers. Although they receive considerable attention, components of autonomous vehicles are not accessible to the public but instead are developed as proprietary assets. To facilitate the development of autonomous vehicles, this article introduces an open platform using commodity vehicles and sensors. Specifically, the authors present algorithms, software libraries, and datasets required for scene recognition, path planning, and vehicle control. This open platform allows researchers and developers to study the basis of autonomous vehicles, design new algorithms, and test their performance using the common interface.

Proceedings ArticleDOI
07 Dec 2015
TL;DR: The Panoptic Studio is a system organized around the thesis that social interactions should be measured through the perceptual integration of a large variety of view points, consisting of integrated structural, hardware, and software innovations.
Abstract: We present an approach to capture the 3D structure and motion of a group of people engaged in a social interaction. The core challenges in capturing social interactions are: (1) occlusion is functional and frequent, (2) subtle motion needs to be measured over a space large enough to host a social group, and (3) human appearance and configuration variation is immense. The Panoptic Studio is a system organized around the thesis that social interactions should be measured through the perceptual integration of a large variety of view points. We present a modularized system designed around this principle, consisting of integrated structural, hardware, and software innovations. The system takes, as input, 480 synchronized video streams of multiple people engaged in social activities, and produces, as output, the labeled time-varying 3D structure of anatomical landmarks on individuals in the space. The algorithmic contributions include a hierarchical approach for generating skeletal trajectory proposals, and an optimization framework for skeletal reconstruction with trajectory re-association.

Journal ArticleDOI
TL;DR: This work focuses on the computational aspects of super-resolution microscopy and presents a comprehensive evaluation of localization software packages, reflecting the various tradeoffs of SMLM software packages and helping users to choose the software that fits their needs.
Abstract: The quality of super-resolution images obtained by single-molecule localization microscopy (SMLM) depends largely on the software used to detect and accurately localize point sources. In this work, we focus on the computational aspects of super-resolution microscopy and present a comprehensive evaluation of localization software packages. Our philosophy is to evaluate each package as a whole, thus maintaining the integrity of the software. We prepared synthetic data that represent three-dimensional structures modeled after biological components, taking excitation parameters, noise sources, point-spread functions and pixelation into account. We then asked developers to run their software on our data; most responded favorably, allowing us to present a broad picture of the methods available. We evaluated their results using quantitative and user-interpretable criteria: detection rate, accuracy, quality of image reconstruction, resolution, software usability and computational resources. These metrics reflect the various tradeoffs of SMLM software packages and help users to choose the software that fits their needs.

Book
04 Nov 2015
TL;DR: Evidence-Based Software Engineering and Systematic Reviews provides a clear introduction to the use of an evidence-based model for software engineering research and practice, explaining the roles of primary studies as elements of an over-arching evidence model, rather than as disjointed elements in the empirical spectrum.
Abstract: In the decade since the idea of adapting the evidence-based paradigm for software engineering was first proposed, it has become a major tool of empirical software engineering. Evidence-Based Software Engineering and Systematic Reviews provides a clear introduction to the use of an evidence-based model for software engineering research and practice. The book explains the roles of primary studies (experiments, surveys, case studies) as elements of an over-arching evidence model, rather than as disjointed elements in the empirical spectrum. Supplying readers with a clear understanding of empirical software engineering best practices, it provides up-to-date guidance on how to conduct secondary studies in software engineeringreplacing the existing 2004 and 2007 technical reports. The book is divided into three parts. The first part discusses the nature of evidence and the evidence-based practices centered on a systematic review, both in general and as applying to software engineering. The second part examines the different elements that provide inputs to a systematic review (usually considered as forming a secondary study), especially the main forms of primary empirical study currently used in software engineering. The final part provides practical guidance on how to conduct systematic reviews (the guidelines), drawing together accumulated experiences to guide researchers and students in planning and conducting their own studies. The book includes an extensive glossary and an appendix that provides a catalogue of reviews that may be useful for practice and teaching.

Journal ArticleDOI
TL;DR: In this paper, the authors show how imputation by fully conditional specification, a popular approach for performing multiple imputation, can be modified so that covariates are imputed from models which are compatible with the substantive model.
Abstract: Missing covariate data commonly occur in epidemiological and clinical research, and are often dealt with using multiple imputation. Imputation of partially observed covariates is complicated if the substantive model is non-linear (e.g. Cox proportional hazards model), or contains non-linear (e.g. squared) or interaction terms, and standard software implementations of multiple imputation may impute covariates from models that are incompatible with such substantive models. We show how imputation by fully conditional specification, a popular approach for performing multiple imputation, can be modified so that covariates are imputed from models which are compatible with the substantive model. We investigate through simulation the performance of this proposal, and compare it with existing approaches. Simulation results suggest our proposal gives consistent estimates for a range of common substantive models, including models which contain non-linear covariate effects or interactions, provided data are missing at random and the assumed imputation models are correctly specified and mutually compatible. Stata software implementing the approach is freely available.

Journal ArticleDOI
TL;DR: Bonsai is described, a modular, high-performance, open-source visual programming framework for the acquisition and online processing of data streams and demonstrated how it allows for the rapid and flexible prototyping of integrated experimental designs in neuroscience.
Abstract: The design of modern scientific experiments requires the control and monitoring of many different data streams. However, the serial execution of programming instructions in a computer makes it a challenge to develop software that can deal with the asynchronous, parallel nature of scientific data. Here we present Bonsai, a modular, high-performance, open-source visual programming framework for the acquisition and online processing of data streams. We describe Bonsai's core principles and architecture and demonstrate how it allows for the rapid and flexible prototyping of integrated experimental designs in neuroscience. We specifically highlight some applications that require the combination of many different hardware and software components, including video tracking of behavior, electrophysiology and closed-loop control of stimulation.

Journal ArticleDOI
TL;DR: Practical issues that arise in the processing, management, translation, and analysis of textual data are discussed with a particular focus on how procedures differ across languages.
Abstract: Recent advances in research tools for the systematic analysis of textual data are enabling exciting new research throughout the social sciences. For comparative politics, scholars who are often interested in nonEnglish and possibly multilingual textual datasets, these advances may be difficult to access. This article discusses practical issues that arise in the processing, management, translation, and analysis of textual data with a particular focus on how procedures differ across languages. These procedures are combined in two applied examples of automated text analysis using the recently introduced Structural Topic Model. We also show how the model can be used to analyze data that have been translated into a single language via machine translation tools. All the methods we describe here are implemented in open-source software packages available from the authors.

Book
25 Nov 2015
TL;DR: How to do Linguistics with R: Data exploration and statistical analysis is unique in its scope, as it covers a wide range of classical and cutting-edge statistical methods, including different flavours of regression analysis and ANOVA, random forests and conditional inference trees, as well as specific linguistic approaches.
Abstract: This book provides a linguist with a statistical toolkit for exploration and analysis of linguistic data. It employs R, a free software environment for statistical computing, which is increasingly popular among linguists. How to do Linguistics with R: Data exploration and statistical analysis is unique in its scope, as it covers a wide range of classical and cutting-edge statistical methods, including different flavours of regression analysis and ANOVA, random forests and conditional inference trees, as well as specific linguistic approaches, among which are Behavioural Profiles, Vector Space Models and various measures of association between words and constructions. The statistical topics are presented comprehensively, but without too much technical detail, and illustrated with linguistic case studies that answer non-trivial research questions. The book also demonstrates how to visualize linguistic data with the help of attractive informative graphs, including the popular ggplot2 system and Google visualization tools.

Journal ArticleDOI
TL;DR: Tackling software data issues, including redundancy, correlation, feature irrelevance and missing samples, with the proposed combined learning model resulted in remarkable classification performance paving the way for successful quality control.
Abstract: Context Several issues hinder software defect data including redundancy, correlation, feature irrelevance and missing samples. It is also hard to ensure balanced distribution between data pertaining to defective and non-defective software. In most experimental cases, data related to the latter software class is dominantly present in the dataset. Objective The objectives of this paper are to demonstrate the positive effects of combining feature selection and ensemble learning on the performance of defect classification. Along with efficient feature selection, a new two-variant (with and without feature selection) ensemble learning algorithm is proposed to provide robustness to both data imbalance and feature redundancy. Method We carefully combine selected ensemble learning models with efficient feature selection to address these issues and mitigate their effects on the defect classification performance. Results Forward selection showed that only few features contribute to high area under the receiver-operating curve (AUC). On the tested datasets, greedy forward selection (GFS) method outperformed other feature selection techniques such as Pearson’s correlation. This suggests that features are highly unstable. However, ensemble learners like random forests and the proposed algorithm, average probability ensemble (APE), are not as affected by poor features as in the case of weighted support vector machines (W-SVMs). Moreover, the APE model combined with greedy forward selection (enhanced APE) achieved AUC values of approximately 1.0 for the NASA datasets: PC2, PC4, and MC1. Conclusion This paper shows that features of a software dataset must be carefully selected for accurate classification of defective components. Furthermore, tackling the software data issues, mentioned above, with the proposed combined learning model resulted in remarkable classification performance paving the way for successful quality control.

Journal ArticleDOI
TL;DR: In this paper, the authors surveyed 122 peer-reviewed journal articles and conference proceedings that used UGC as a data source and investigated the scope of the tourism and hospitality issues that are addressed using available UGC; the methods that have been applied to UGC data to achieve research objectives; and the software that has been used to collect UGC and extract information from large data sets.
Abstract: The rapid growth of information generated by consumers of tourism and hospitality services calls for a systematic review of how user-generated content (UGC) has been applied in tourism and hospitality research. This study surveyed 122 peer-reviewed journal articles and conference proceedings that used UGC as a data source. The study investigates (a) the scope of the tourism and hospitality issues that are addressed using available UGC; (b) the methods that have been applied to UGC data to achieve research objectives; and (c) the software that has been used to collect UGC and extract information from large UGC data sets. The study also presents the emerging topics and challenges in UGC research.