scispace - formally typeset
Search or ask a question

Showing papers on "Data quality published in 2018"


Journal ArticleDOI
TL;DR: This tutorial review will guide the reader through the use of system suitability and QC samples, why these samples should be applied and how the quality of data can be reported.
Abstract: Quality assurance (QA) and quality control (QC) are two quality management processes that are integral to the success of metabolomics including their application for the acquisition of high quality data in any high-throughput analytical chemistry laboratory. QA defines all the planned and systematic activities implemented before samples are collected, to provide confidence that a subsequent analytical process will fulfil predetermined requirements for quality. QC can be defined as the operational techniques and activities used to measure and report these quality requirements after data acquisition. This tutorial review will guide the reader through the use of system suitability and QC samples, why these samples should be applied and how the quality of data can be reported. System suitability samples are applied to assess the operation and lack of contamination of the analytical platform prior to sample analysis. Isotopically-labelled internal standards are applied to assess system stability for each sample analysed. Pooled QC samples are applied to condition the analytical platform, perform intra-study reproducibility measurements (QC) and to correct mathematically for systematic errors. Standard reference materials and long-term reference QC samples are applied for inter-study and inter-laboratory assessment of data.

457 citations


Journal ArticleDOI
TL;DR: A new r package, dartr, enables the analysis of single nucleotide polymorphism data for population genomic and phylogenomic applications, and provides user‐friendly functions for data quality control and marker selection, and permits rigorous evaluations of conformation to Hardy–Weinberg equilibrium, gametic‐phase disequilibrium and neutrality.
Abstract: Although vast technological advances have been made and genetic software packages are growing in number, it is not a trivial task to analyse SNP data. We announce a new r package, dartr, enabling the analysis of single nucleotide polymorphism data for population genomic and phylogenomic applications. dartr provides user-friendly functions for data quality control and marker selection, and permits rigorous evaluations of conformation to Hardy-Weinberg equilibrium, gametic-phase disequilibrium and neutrality. The package reports standard descriptive statistics, permits exploration of patterns in the data through principal components analysis and conducts standard F-statistics, as well as basic phylogenetic analyses, population assignment, isolation by distance and exports data to a variety of commonly used downstream applications (e.g., newhybrids, faststructure and phylogeny applications) outside of the r environment. The package serves two main purposes: first, a user-friendly approach to lower the hurdle to analyse such data-therefore, the package comes with a detailed tutorial targeted to the r beginner to allow data analysis without requiring deep knowledge of r. Second, we use a single, well-established format-genlight from the adegenet package-as input for all our functions to avoid data reformatting. By strictly using the genlight format, we hope to facilitate this format as the de facto standard of future software developments and hence reduce the format jungle of genetic data sets. The dartr package is available via the r CRAN network and GitHub.

387 citations


Journal ArticleDOI
TL;DR: In this article, the authors present a taxonomy of machine learning algorithms that can be applied to the data in order to extract higher level information, and a use case of applying Support Vector Machine (SVM) on Aarhus Smart City traffic data is presented for more detailed exploration.
Abstract: Rapid developments in hardware, software, and communication technologies have allowed the emergence of Internet-connected sensory devices that provide observation and data measurement from the physical world. By 2020, it is estimated that the total number of Internet-connected devices being used will be between 25 and 50 billion. As the numbers grow and technologies become more mature, the volume of data published will increase. Internet-connected devices technology, referred to as Internet of Things (IoT), continues to extend the current Internet by providing connectivity and interaction between the physical and cyber worlds. In addition to increased volume, the IoT generates Big Data characterized by velocity in terms of time and location dependency, with a variety of multiple modalities and varying data quality. Intelligent processing and analysis of this Big Data is the key to developing smart IoT applications. This article assesses the different machine learning methods that deal with the challenges in IoT data by considering smart cities as the main use case. The key contribution of this study is presentation of a taxonomy of machine learning algorithms explaining how different techniques are applied to the data in order to extract higher level information. The potential and challenges of machine learning for IoT data analytics will also be discussed. A use case of applying Support Vector Machine (SVM) on Aarhus Smart City traffic data is presented for a more detailed exploration.

375 citations


Journal ArticleDOI
TL;DR: The National Sleep Research Resource (NSRR) provides a single point of access to analysis-ready physiological signals from polysomnography obtained from multiple sources, and a wide variety of clinical data to facilitate sleep research.

332 citations


Journal ArticleDOI
TL;DR: In this paper, the Real-time Affordable Multi-Pollutant (RAMP) sensor package is used to measure CO, NO2, O3, and CO2.
Abstract: . Low-cost sensing strategies hold the promise of denser air quality monitoring networks, which could significantly improve our understanding of personal air pollution exposure. Additionally, low-cost air quality sensors could be deployed to areas where limited monitoring exists. However, low-cost sensors are frequently sensitive to environmental conditions and pollutant cross-sensitivities, which have historically been poorly addressed by laboratory calibrations, limiting their utility for monitoring. In this study, we investigated different calibration models for the Real-time Affordable Multi-Pollutant (RAMP) sensor package, which measures CO, NO2, O3, and CO2. We explored three methods: (1) laboratory univariate linear regression, (2) empirical multiple linear regression, and (3) machine-learning-based calibration models using random forests (RF). Calibration models were developed for 16–19 RAMP monitors (varied by pollutant) using training and testing windows spanning August 2016 through February 2017 in Pittsburgh, PA, US. The random forest models matched (CO) or significantly outperformed (NO2, CO2, O3) the other calibration models, and their accuracy and precision were robust over time for testing windows of up to 16 weeks. Following calibration, average mean absolute error on the testing data set from the random forest models was 38 ppb for CO (14 % relative error), 10 ppm for CO2 (2 % relative error), 3.5 ppb for NO2 (29 % relative error), and 3.4 ppb for O3 (15 % relative error), and Pearson r versus the reference monitors exceeded 0.8 for most units. Model performance is explored in detail, including a quantification of model variable importance, accuracy across different concentration ranges, and performance in a range of monitoring contexts including the National Ambient Air Quality Standards (NAAQS) and the US EPA Air Sensors Guidebook recommendations of minimum data quality for personal exposure measurement. A key strength of the RF approach is that it accounts for pollutant cross-sensitivities. This highlights the importance of developing multipollutant sensor packages (as opposed to single-pollutant monitors); we determined this is especially critical for NO2 and CO2. The evaluation reveals that only the RF-calibrated sensors meet the US EPA Air Sensors Guidebook recommendations of minimum data quality for personal exposure measurement. We also demonstrate that the RF-model-calibrated sensors could detect differences in NO2 concentrations between a near-road site and a suburban site less than 1.5 km away. From this study, we conclude that combining RF models with carefully controlled state-of-the-art multipollutant sensor packages as in the RAMP monitors appears to be a very promising approach to address the poor performance that has plagued low-cost air quality sensors.

308 citations


Journal ArticleDOI
TL;DR: Data quality both inflated and obscured associations with age during adolescence, indicating that reliable measures of data quality can be automatically derived from T1‐weighted volumes, and that failing to control for dataquality can systematically bias the results of studies of brain maturation.

250 citations


Journal ArticleDOI
TL;DR: The GEOTRACES Intermediate Data Product 2017 (IDP2017) as discussed by the authors is the second publicly available data product of the international GEOTrACES programme, and contains data measured and quality controlled before the end of 2016.

249 citations


Posted Content
TL;DR: The Dataset Nutrition Label is a diagnostic framework that lowers the barrier to standardized data analysis by providing a distilled yet comprehensive overview of dataset "ingredients" before AI model development.
Abstract: Artificial intelligence (AI) systems built on incomplete or biased data will often exhibit problematic outcomes. Current methods of data analysis, particularly before model development, are costly and not standardized. The Dataset Nutrition Label (the Label) is a diagnostic framework that lowers the barrier to standardized data analysis by providing a distilled yet comprehensive overview of dataset "ingredients" before AI model development. Building a Label that can be applied across domains and data types requires that the framework itself be flexible and adaptable; as such, the Label is comprised of diverse qualitative and quantitative modules generated through multiple statistical and probabilistic modelling backends, but displayed in a standardized format. To demonstrate and advance this concept, we generated and published an open source prototype with seven sample modules on the ProPublica Dollars for Docs dataset. The benefits of the Label are manyfold. For data specialists, the Label will drive more robust data analysis practices, provide an efficient way to select the best dataset for their purposes, and increase the overall quality of AI models as a result of more robust training datasets and the ability to check for issues at the time of model development. For those building and publishing datasets, the Label creates an expectation of explanation, which will drive better data collection practices. We also explore the limitations of the Label, including the challenges of generalizing across diverse datasets, and the risk of using "ground truth" data as a comparison dataset. We discuss ways to move forward given the limitations identified. Lastly, we lay out future directions for the Dataset Nutrition Label project, including research and public policy agendas to further advance consideration of the concept.

213 citations


Journal ArticleDOI
TL;DR: The challenges of a ‘Big Data’ approach to building global EBV data products across taxa and spatiotemporal scales, focusing on species distribution and abundance are assessed.
Abstract: Much biodiversity data is collected worldwide, but it remains challenging to assemble the scattered knowledge for assessing biodiversity status and trends. The concept of Essential Biodiversity Variables (EBVs) was introduced to structure biodiversity monitoring globally, and to harmonize and standardize biodiversity data from disparate sources to capture a minimum set of critical variables required to study, report and manage biodiversity change. Here, we assess the challenges of a 'Big Data' approach to building global EBV data products across taxa and spatiotemporal scales, focusing on species distribution and abundance. The majority of currently available data on species distributions derives from incidentally reported observations or from surveys where presence-only or presence-absence data are sampled repeatedly with standardized protocols. Most abundance data come from opportunistic population counts or from population time series using standardized protocols (e.g. repeated surveys of the same population from single or multiple sites). Enormous complexity exists in integrating these heterogeneous, multi-source data sets across space, time, taxa and different sampling methods. Integration of such data into global EBV data products requires correcting biases introduced by imperfect detection and varying sampling effort, dealing with different spatial resolution and extents, harmonizing measurement units from different data sources or sampling methods, applying statistical tools and models for spatial inter- or extrapolation, and quantifying sources of uncertainty and errors in data and models. To support the development of EBVs by the Group on Earth Observations Biodiversity Observation Network (GEO BON), we identify 11 key workflow steps that will operationalize the process of building EBV data products within and across research infrastructures worldwide. These workflow steps take multiple sequential activities into account, including identification and aggregation of various raw data sources, data quality control, taxonomic name matching and statistical modelling of integrated data. We illustrate these steps with concrete examples from existing citizen science and professional monitoring projects, including eBird, the Tropical Ecology Assessment and Monitoring network, the Living Planet Index and the Baltic Sea zooplankton monitoring. The identified workflow steps are applicable to both terrestrial and aquatic systems and a broad range of spatial, temporal and taxonomic scales. They depend on clear, findable and accessible metadata, and we provide an overview of current data and metadata standards. Several challenges remain to be solved for building global EBV data products: (i) developing tools and models for combining heterogeneous, multi-source data sets and filling data gaps in geographic, temporal and taxonomic coverage, (ii) integrating emerging methods and technologies for data collection such as citizen science, sensor networks, DNA-based techniques and satellite remote sensing, (iii) solving major technical issues related to data product structure, data storage, execution of workflows and the production process/cycle as well as approaching technical interoperability among research infrastructures, (iv) allowing semantic interoperability by developing and adopting standards and tools for capturing consistent data and metadata, and (v) ensuring legal interoperability by endorsing open data or data that are free from restrictions on use, modification and sharing. Addressing these challenges is critical for biodiversity research and for assessing progress towards conservation policy targets and sustainable development goals.

212 citations


Journal ArticleDOI
TL;DR: RCT research, characterized as having the highest reliability, and RWE research, which reflects the actual clinical aspects, can have a mutually supplementary relationship and once this is proven, the two could comprise the most powerful evidence-based research method in medicine.
Abstract: Real-world evidence (RWE) and randomized control trial (RCT) data are considered mutually complementary. However, compared with RCT, the outcomes of RWE continue to be assigned lower credibility. It must be emphasized that RWE research is a real-world practice that does not need to be executed as RCT research for it to be reliable. The advantages and disadvantages of RWE must be discerned clearly, and then the proper protocol can be planned from the beginning of the research to secure as many samples as possible. Attention must be paid to privacy protection. Moreover, bias can be reduced meaningfully by reducing the number of dropouts through detailed and meticulous data quality management. RCT research, characterized as having the highest reliability, and RWE research, which reflects the actual clinical aspects, can have a mutually supplementary relationship. Indeed, once this is proven, the two could comprise the most powerful evidence-based research method in medicine.

195 citations


Journal ArticleDOI
TL;DR: This paper presents a highly versatile and precisely annotated large-scale data set of smartphone sensor data for multimodal locomotion and transportation analytics of mobile users, and presents how a machine-learning system can use this data set to automatically recognize modes of transportations.
Abstract: Scientific advances build on reproducible researches which need publicly available benchmark data sets. The computer vision and speech recognition communities have led the way in establishing benchmark data sets. There are much less data sets available in mobile computing, especially for rich locomotion and transportation analytics. This paper presents a highly versatile and precisely annotated large-scale data set of smartphone sensor data for multimodal locomotion and transportation analytics of mobile users. The data set comprises seven months of measurements, collected from all sensors of four smartphones carried at typical body locations, including the images of a body-worn camera, while three participants used eight different modes of transportation in the south-east of the U.K., including in London. In total, 28 context labels were annotated, including transportation mode, participant’s posture, inside/outside location, road conditions, traffic conditions, presence in tunnels, social interactions, and having meals. The total amount of collected data exceed 950 GB of sensor data, which corresponds to 2812 h of labeled data and 17 562 km of traveled distance. We present how we set up the data collection, including the equipment used and the experimental protocol. We discuss the data set, including the data curation process, the analysis of the annotations, and of the sensor data. We discuss the challenges encountered and present the lessons learned and some of the best practices we developed to ensure high quality data collection and annotation. We discuss the potential applications which can be developed using this large-scale data set. In particular, we present how a machine-learning system can use this data set to automatically recognize modes of transportations. Many other research questions related to transportation analytics, activity recognition, radio signal propagation and mobility modeling can be addressed through this data set. The full data set is being made available to the community, and a thorough preview is already published.

Journal ArticleDOI
TL;DR: This study develops and validates the concept of Data Analytics Competency as a five multidimensional formative index and empirically examines its impact on firm decision making performance and reveals that all dimensions of data analytics competency significantly improve decision quality.
Abstract: The concept of Data Analytics (DA) competency has been conceptualized and validated.The impact of DA competency on decision making performance is empirically examined.All dimensions of DA competency significantly improve decision quality.All dimensions, except bigness of data, significantly increase decision efficiency. This study develops and validates the concept of Data Analytics Competency as a five multidimensional formative index (i.e., data quality, bigness of data, analytical skills, domain knowledge, and tools sophistication) and empirically examines its impact on firm decision making performance (i.e., decision quality and decision efficiency). The findings based on an empirical analysis of survey data from 151 Information Technology managers and data analysts demonstrate a large, significant, positive relationship between data analytics competency and firm decision making performance. The results reveal that all dimensions of data analytics competency significantly improve decision quality. Furthermore, interestingly, all dimensions, except bigness of data, significantly increase decision efficiency. This is the first known empirical study to conceptualize, operationalize and validate the concept of data analytics competency and to study its impact on decision making performance. The validity of the data analytics competency construct as conceived and operationalized, suggests the potential for future research evaluating its relationships with possible antecedents and consequences. For practitioners, the results provide important guidelines for increasing firm decision making performance through the use of data analytics.

Journal ArticleDOI
TL;DR: This paper focuses on the process of EMR processing and emphatically analyzes the key techniques and makes an in-depth study on the applications developed based on text mining together with the open challenges and research issues for future work.
Abstract: Currently, medical institutes generally use EMR to record patient’s condition, including diagnostic information, procedures performed, and treatment results. EMR has been recognized as a valuable resource for large-scale analysis. However, EMR has the characteristics of diversity, incompleteness, redundancy, and privacy, which make it difficult to carry out data mining and analysis directly. Therefore, it is necessary to preprocess the source data in order to improve data quality and improve the data mining results. Different types of data require different processing technologies. Most structured data commonly needs classic preprocessing technologies, including data cleansing, data integration, data transformation, and data reduction. For semistructured or unstructured data, such as medical text, containing more health information, it requires more complex and challenging processing methods. The task of information extraction for medical texts mainly includes NER (named-entity recognition) and RE (relation extraction). This paper focuses on the process of EMR processing and emphatically analyzes the key techniques. In addition, we make an in-depth study on the applications developed based on text mining together with the open challenges and research issues for future work.

Journal ArticleDOI
TL;DR: This article contains the basic information and considerations needed to plan, set up, and interpret a pupillometry experiment, as well as commentary about how to interpret the response and some methodological considerations that might not be necessary in other auditory experiments.
Abstract: Within the field of hearing science, pupillometry is a widely used method for quantifying listening effort. Its use in research is growing exponentially, and many labs are (considering) applying pupillometry for the first time. Hence, there is a growing need for a methods paper on pupillometry covering topics spanning from experiment logistics and timing to data cleaning and what parameters to analyze. This article contains the basic information and considerations needed to plan, set up, and interpret a pupillometry experiment, as well as commentary about how to interpret the response. Included are practicalities like minimal system requirements for recording a pupil response and specifications for peripheral, equipment, experiment logistics and constraints, and different kinds of data processing. Additional details include participant inclusion and exclusion criteria and some methodological considerations that might not be necessary in other auditory experiments. We discuss what data should be recorded and how to monitor the data quality during recording in order to minimize artifacts. Data processing and analysis are considered as well. Finally, we share insights from the collective experience of the authors and discuss some of the challenges that still lie ahead.

Journal ArticleDOI
TL;DR: Whether data that are recorded routinely as part of the health care process in primary care are actually fit to use for other purposes such as research and quality of health care indicators, how the original purpose may affect the extent to which the data are fit for another purpose, and the mechanisms behind these effects are discussed.
Abstract: Background: Enormous amounts of data are recorded routinely in health care as part of the care process, primarily for managing individual patient care. There are significant opportunities to use these data for other purposes, many of which would contribute to establishing a learning health system. This is particularly true for data recorded in primary care settings, as in many countries, these are the first place patients turn to for most health problems. Objective: In this paper, we discuss whether data that are recorded routinely as part of the health care process in primary care are actually fit to use for other purposes such as research and quality of health care indicators, how the original purpose may affect the extent to which the data are fit for another purpose, and the mechanisms behind these effects. In doing so, we want to identify possible sources of bias that are relevant for the use and reuse of these type of data. Methods: This paper is based on the authors’ experience as users of electronic health records data, as general practitioners, health informatics experts, and health services researchers. It is a product of the discussions they had during the Translational Research and Patient Safety in Europe (TRANSFoRm) project, which was funded by the European Commission and sought to develop, pilot, and evaluate a core information architecture for the learning health system in Europe, based on primary care electronic health records. Results: We first describe the different stages in the processing of electronic health record data, as well as the different purposes for which these data are used. Given the different data processing steps and purposes, we then discuss the possible mechanisms for each individual data processing step that can generate biased outcomes. We identified 13 possible sources of bias. Four of them are related to the organization of a health care system, whereas some are of a more technical nature. Conclusions: There are a substantial number of possible sources of bias; very little is known about the size and direction of their impact. However, anyone that uses or reuses data that were recorded as part of the health care process (such as researchers and clinicians) should be aware of the associated data collection process and environmental influences that can affect the quality of the data. Our stepwise, actor- and purpose-oriented approach may help to identify these possible sources of bias. Unless data quality issues are better understood and unless adequate controls are embedded throughout the data lifecycle, data-driven health care will not live up to its expectations. We need a data quality research agenda to devise the appropriate instruments needed to assess the magnitude of each of the possible sources of bias, and then start measuring their impact. The possible sources of bias described in this paper serve as a starting point for this research agenda.

Proceedings ArticleDOI
01 Sep 2018
TL;DR: This research work proposes a conceptual design for sharing personal continuous-dynamic health data using blockchain technology supplemented by cloud storage to share the health-related information in a secure and transparent manner and introduces a data quality inspection module based on machine learning techniques to have control over data quality.
Abstract: With the advent of rapid development of wearable technology and mobile computing, huge amount of personal health-related data is being generated and accumulated on continuous basis at every moment. These personal datasets contain valuable information and they belong to and asset of the individual users, hence should be owned and controlled by themselves. Currently most of such datasets are stored and controlled by different service providers and this centralised data storage brings challenges of data security and hinders the data sharing. These personal health data are valuable resources for healthcare research and commercial projects. In this research work, we propose a conceptual design for sharing personal continuous-dynamic health data using blockchain technology supplemented by cloud storage to share the health-related information in a secure and transparent manner. Besides, we also introduce a data quality inspection module based on machine learning techniques to have control over data quality. The primary goal of the proposed system is to enable users to own, control and share their personal health data securely, in a General Data Protection Regulation (GDPR) compliant way to get benefit from their personal datasets. It also provides an efficient way for researchers and commercial data consumers to collect high quality personal health data for research and commercial purposes.

Journal ArticleDOI
TL;DR: It is shown that improving mortality data completeness minimized overestimation of survival relative to NDI‐based estimates, and the importance of data quality assessment and benchmarking to the NDI is highlighted.
Abstract: Objective To create a high-quality electronic health record (EHR)-derived mortality dataset for retrospective and prospective real-world evidence generation. Data sources/study setting Oncology EHR data, supplemented with external commercial and US Social Security Death Index data, benchmarked to the National Death Index (NDI). Study design We developed a recent, linkable, high-quality mortality variable amalgamated from multiple data sources to supplement EHR data, benchmarked against the highest completeness U.S. mortality data, the NDI. Data quality of the mortality variable version 2.0 is reported here. Principal findings For advanced non-small-cell lung cancer, sensitivity of mortality information improved from 66 percent in EHR structured data to 91 percent in the composite dataset, with high date agreement compared to the NDI. For advanced melanoma, metastatic colorectal cancer, and metastatic breast cancer, sensitivity of the final variable was 85 to 88 percent. Kaplan-Meier survival analyses showed that improving mortality data completeness minimized overestimation of survival relative to NDI-based estimates. Conclusions For EHR-derived data to yield reliable real-world evidence, it needs to be of known and sufficiently high quality. Considering the impact of mortality data completeness on survival endpoints, we highlight the importance of data quality assessment and advocate benchmarking to the NDI.

Journal ArticleDOI
TL;DR: Social media and crowdsourcing data are employed to address hyper-resolution datasets for urban flooding and it is found these big data based flood monitoring approaches can complement the existing means of flood data collection.

Journal ArticleDOI
TL;DR: This study finds several trends about data scientists in the software engineering context at Microsoft, and should inform managers on how to leverage data science capability effectively within their teams.
Abstract: The demand for analyzing large scale telemetry, machine, and quality data is rapidly increasing in software industry Data scientists are becoming popular within software teams, eg, Facebook, LinkedIn and Microsoft are creating a new career path for data scientists In this paper, we present a large-scale survey with 793 professional data scientists at Microsoft to understand their educational background, problem topics that they work on, tool usages, and activities We cluster these data scientists based on the time spent for various activities and identify 9 distinct clusters of data scientists, and their corresponding characteristics We also discuss the challenges that they face and the best practices they share with other data scientists Our study finds several trends about data scientists in the software engineering context at Microsoft, and should inform managers on how to leverage data science capability effectively within their teams

Journal ArticleDOI
01 Aug 2018
TL;DR: This work presents a system for automating the verification of data quality at scale, which meets the requirements of production use cases and provides a declarative API, which combines common quality constraints with user-defined validation code, and thereby enables 'unit tests' for data.
Abstract: Modern companies and institutions rely on data to guide every single business process and decision. Missing or incorrect information seriously compromises any decision process downstream. Therefore, a crucial, but tedious task for everyone involved in data processing is to verify the quality of their data. We present a system for automating the verification of data quality at scale, which meets the requirements of production use cases. Our system provides a declarative API, which combines common quality constraints with user-defined validation code, and thereby enables 'unit tests' for data. We efficiently execute the resulting constraint validation workload by translating it to aggregation queries on Apache Spark. Our platform supports the incremental validation of data quality on growing datasets, and leverages machine learning, e.g., for enhancing constraint suggestions, for estimating the 'predictability' of a column, and for detecting anomalies in historic data quality time series. We discuss our design decisions, describe the resulting system architecture, and present an experimental evaluation on various datasets.

Journal ArticleDOI
TL;DR: This paper provides an introduction to and an assessment of the Transparency Platform, helping researchers to use it more efficiently and to judge data quality more rigorously.

Journal ArticleDOI
TL;DR: This study addresses questions about the impact of screening techniques on data and statistical analyses, and serves an initial attempt to estimate descriptive statistics and graphically display the distributions of popular screening techniques.
Abstract: The purpose of this study is to empirically address questions pertaining to the effects of data screening practices in survey research. This study addresses questions about the impact of screening techniques on data and statistical analyses. It also serves an initial attempt to estimate descriptive statistics and graphically display the distributions of popular screening techniques. Data were obtained from an online sample who completed demographic items and measures of character strengths (N = 307). Screening indices demonstrate minimal overlap and differ in the number of participants flagged. Existing cutoff scores for most screening techniques seem appropriate, but cutoff values for consistency-based indices may be too liberal. Screens differ in the extent to which they impact survey results. The use of screening techniques can impact inter-item correlations, inter-scale correlations, reliability estimates, and statistical results. While data screening can improve the quality and trustworthiness of data, screening techniques are not interchangeable. Researchers and practitioners should be aware of the differences between data screening techniques and apply appropriate screens for their survey characteristics and study design. Low-impact direct and unobtrusive screens such as self-report indicators, bogus items, instructed items, longstring, individual response variability, and response time are relatively simple to administer and analyze. The fact that data screening can influence the statistical results of a study demonstrates that low-quality data can distort hypothesis testing in organizational research and practice. We recommend analyzing results both before and after screens have been applied.

Journal ArticleDOI
TL;DR: A methodological framework detailing the steps and decisions required to quantitatively analyze a set of data that was originally qualitative is presented and new perspectives on data integration in the study of biopsychosocial aspects in everyday contexts are provided.
Abstract: Indirect observation is a recent concept in systematic observation. It largely involves analyzing textual material generated either indirectly from transcriptions of audio recordings of verbal behavior in natural settings (e.g., conversation, group discussions) or directly from narratives (e.g., letters of complaint, tweets, forum posts). It may also feature seemingly unobtrusive objects that can provide relevant insights into daily routines. All these materials constitute an extremely rich source of information for studying everyday life, and they are continuously growing with the burgeoning of new technologies for data recording, dissemination, and storage. Narratives are an excellent vehicle for studying everyday life, and quantitization is proposed as a means of integrating qualitative and quantitative elements. However, this analysis requires a structured system that enables researchers to analyze varying forms and sources of information objectively. In this paper, we present a methodological framework detailing the steps and decisions required to quantitatively analyze a set of data that was originally qualitative. We provide guidelines on study dimensions, text segmentation criteria, ad hoc observation instruments, data quality controls, and coding and preparation of text for quantitative analysis. The quality control stage is essential to ensure that the code matrices generated from the qualitative data are reliable. We provide examples of how an indirect observation study can produce data for quantitative analysis and also describe the different software tools available for the various stages of the process. The proposed method is framed within a specific mixed methods approach that involves collecting qualitative data and subsequently transforming these into matrices of codes (not frequencies) for quantitative analysis to detect underlying structures and behavioral patterns. The data collection and quality control procedures fully meet the requirement of flexibility and provide new perspectives on data integration in the study of biopsychosocial aspects in everyday contexts.

Journal ArticleDOI
TL;DR: These findings are consistent with the literature on some dimensions, such as finding a negative relationship between completion rate and survey length and question difficulty, and surveys without progress bars have higher completion rates than surveys with progress bars.
Abstract: A survey’s completion rate is one of its most important data quality measures. There are quite a few published studies examining web survey completion rate through experimental approaches. In this ...

Journal ArticleDOI
TL;DR: A list of recommendations, developed by a group of experts, including members of patient organizations, to be used as a framework for improving the quality of RD registries, includes aspects of governance, Findable, Accessible, Interoperable and Reusable (FAIR) data and information, infrastructure, documentation, training, and quality audit.
Abstract: Rare diseases (RD) patient registries are powerful instruments that help develop clinical research, facilitate the planning of appropriate clinical trials, improve patient care, and support healthcare management. They constitute a key information system that supports the activities of European Reference Networks (ERNs) on rare diseases. A rapid proliferation of RD registries has occurred during the last years and there is a need to develop guidance for the minimum requirements, recommendations and standards necessary to maintain a high-quality registry. In response to these heterogeneities, in the framework of RD-Connect, a European platform connecting databases, registries, biobanks and clinical bioinformatics for rare disease research, we report on a list of recommendations, developed by a group of experts, including members of patient organizations, to be used as a framework for improving the quality of RD registries. This list includes aspects of governance, Findable, Accessible, Interoperable and Reusable (FAIR) data and information, infrastructure, documentation, training, and quality audit. The list is intended to be used by established as well as new RD registries. Further work includes the development of a toolkit to enable continuous assessment and improvement of their organizational and data quality.

Journal ArticleDOI
TL;DR: This paper provides an updated and detailed classification of the design decisions that matter in questionnaire development, and a summary of what is said in the literature about their impact on data quality.
Abstract: Quite a lot of research is available on the relationships between survey response scales’ characteristics and the quality of responses. However, it is often difficult to extract practical rules for questionnaire design from the wide and often mixed amount of empirical evidence. The aim of this study is to provide first a classification of the characteristics of response scales, mentioned in the literature, that should be considered when developing a scale, and second a summary of the main conclusions extracted from the literature regarding the impact these characteristics have on data quality. Thus, this paper provides an updated and detailed classification of the design decisions that matter in questionnaire development, and a summary of what is said in the literature about their impact on data quality. It distinguishes between characteristics that have been demonstrated to have an impact, characteristics for which the impact has not been found, and characteristics for which research is still needed to make a conclusion.

Journal ArticleDOI
TL;DR: The findings show considerable variation in UAS practices, suggesting a need for establishing standardized image collection and processing procedures, and reviewed basic research and methodological developments to assess how data quality and uncertainty issues are being addressed.
Abstract: Over the past decade, the remote-sensing community has eagerly adopted unmanned aircraft systems (UAS) as a cost-effective means to capture imagery at spatial and temporal resolutions not typically feasible with manned aircraft and satellites. The rapid adoption has outpaced our understanding of the relationships between data collection methods and data quality, causing uncertainties in data and products derived from UAS and necessitating exploration into how researchers are using UAS for terrestrial applications. We synthesize these procedures through a meta-analysis of UAS applications alongside a review of recent, basic science research surrounding theory and method development. We performed a search of the Web of Science (WoS) database on 17 May 2017 using UAS-related keywords to identify all peer-reviewed studies indexed by WoS. We manually filtered the results to retain only terrestrial studies () and further categorized results into basic theoretical studies (), method development (), and a...

Journal ArticleDOI
TL;DR: This paper proposes to pay the participants as how well they do, to motivate the rational participants to efficiently perform crowdsensing tasks, and proposes a mechanism that estimates the quality of sensing data, and offers each participant a reward based on her effective contribution.
Abstract: In crowdsensing, appropriate rewards are always expected to compensate the participants for their consumptions of physical resources and involvements of manual efforts. While continuous low quality sensing data could do harm to the availability and preciseness of crowdsensing based services, few existing incentive mechanisms have ever addressed the issue of data quality. The design of quality based incentive mechanism is motivated by its potential to avoid inefficient sensing and unnecessary rewards. In this paper, we incorporate the consideration of data quality into the design of incentive mechanism for crowdsensing, and propose to pay the participants as how well they do, to motivate the rational participants to efficiently perform crowdsensing tasks. This mechanism estimates the quality of sensing data, and offers each participant a reward based on her effective contribution. We also implement the mechanism and evaluate its improvement in terms of quality of service and profit of service provider. The evaluation results show that our mechanism achieves superior performance when compared to general data collection model and uniform pricing scheme.

Journal ArticleDOI
TL;DR: The statTarget is a streamlined tool with an easy-to-use graphical user interface and an integrated suite of algorithms specifically developed for the evaluation of data quality and removal of unwanted variations for quantitative mass spectrometry-based omics data that allows user-friendly the improvement of the data precision.

Journal ArticleDOI
TL;DR: This algorithm can be a promising tool to identify low quality or automated data via AMT or other online data collection platforms and be used as part of sensitivity analyses to warrant exclusion from further analyses.
Abstract: Web-based data collection methods such as Amazon's Mechanical Turk (AMT) are an appealing option to recruit participants quickly and cheaply for psychological research. While concerns regarding data quality have emerged with AMT, several studies have exhibited that data collected via AMT are as reliable as traditional college samples and are often more diverse and representative of noncollege populations. The development of methods to screen for low quality data, however, has been less explored. Omitting participants based on simple screening methods in isolation, such as response time or attention checks may not be adequate identification methods, with an inability to delineate between high or low effort participants. Additionally, problematic survey responses may arise from survey automation techniques such as survey bots or automated form fillers. The current project developed low quality data detection methods while overcoming previous screening limitations. Multiple checks were employed, such as page response times, distribution of survey responses, the number of utilized choices from a given range of scale options, click counts, and manipulation checks. This method was tested on a survey taken with an easily available plug-in survey bot, as well as compared to data collected by human participants providing both high effort and randomized, or low effort, answers. Identified cases can then be used as part of sensitivity analyses to warrant exclusion from further analyses. This algorithm can be a promising tool to identify low quality or automated data via AMT or other online data collection platforms.