Showing papers on "Data quality published in 2014"

PDF

Open Access

Monograph•DOI•

The data revolution : big data, open data, data infrastructures & their consequences

[...]

01 Jan 2014

TL;DR: The case for open data and the economics of open data are discussed in this paper, with a focus on the benefits of data integration and the challenges of building data infrastructures.

...read moreread less

Abstract: Chapter 1: Conceptualising Data What are data? Kinds of data Data, information, knowledge, wisdom Framing data Thinking critically about databases and data infrastructures Data assemblages and the data revolution Chapter 2: Small Data, Data Infrastructures and Data Brokers Data holdings, data archives and data infrastructures Rationale for research data infrastructures The challenges of building data infrastructures The challenges of building data infrastructuresData brokers and markets Chapter 3: Open and Linked Data Open data Linked data The case for open data The economics of open data Concerns with respect to opening data Chapter 4: Big Data Volume Exhaustive Resolution and indexicality Relationality Velocity Variety Flexibility Chapter 5: Enablers and Sources of Big Data The enablers of big data Sources of big data Directed Data Automated data Volunteered data Chapter 6: Data Analytics Pre-analytics Machine learning Data mining and pattern recognition Data visualisation and visual analytics Statistical analysis Prediction, simulation and optimization Chapter 7: The Governmental and Business Rationale for Big Data Governing people Managing organisations Leveraging value and producing capital Creating better places Chapter 8: The Reframing of Science, Social Science and Humanities Research The fourth paradigm in science? The re-emergence of empiricism The fallacies of empiricism Data-driven science Computational social sciences and digital humanities Chapter 9: Technical and Organisational Issues Deserts and deluges Access Data quality, veracity and lineage Data integration and interoperability Poor analysis and ecological fallacies Skills and human resourcing Chapter 10: Ethical, Political, Social and Legal Concerns Data shadows and dataveillance Privacy Data security Profiling, social sorting and redlining Secondary uses, control creep and anticipatory governance Modes of governance and technological lock-ins Chapter 11: Making Sense of the Data Revolution Understanding data and the data revolution Researching data assemblages Final thoughts

...read moreread less

751 citations

Journal Article•DOI•

The eBird enterprise: An integrated approach to development and application of citizen science

[...]

Brian L. Sullivan¹, Jocelyn L. Aycrigg², Jessie H. Barry¹, Rick Bonney¹, Nicholas E. Bruns¹, Caren B. Cooper¹, Theo Damoulas¹, André A. Dhondt¹, Thomas G. Dietterich³, Andrew Farnsworth¹, Daniel Fink¹, John W. Fitzpatrick¹, Thomas Fredericks¹, Jeff Gerbracht¹, Carla P. Gomes¹, Wesley M. Hochachka¹, Marshall J. Iliff¹, Carl Lagoze⁴, Frank A. La Sorte¹, Matt Merrifield⁵, Will Morris¹, Tina B. Phillips¹, Mark D. Reynolds⁵, Amanda D. Rodewald¹, Kenneth V. Rosenberg¹, Nancy M. Trautmann¹, Andrea Wiggins⁶, David W. Winkler¹, Weng-Keen Wong³, Christopher L. Wood¹, Jun Yu³, Steve Kelling¹ - Show less +28 more•Institutions (6)

Cornell University¹, University of Idaho², Oregon State University³, University of Michigan⁴, The Nature Conservancy⁵, University of New Mexico⁶

01 Jan 2014-Biological Conservation

TL;DR: The eBird project as mentioned in this paper has become a major source of biodiversity data, increasing our knowledge of the dynamics of species distributions, and having a direct impact on the conservation of birds and their habitats.

...read moreread less

682 citations

Book•

Data Preprocessing in Data Mining

[...]

Salvador Garca, Julin Luengo, Francisco Herrera

30 Aug 2014

TL;DR: This book is intended to review the tasks that fill the gap between the data acquisition from the source and the data mining process, and contains a comprehensive look from a practical point of view, including basic concepts and surveying the techniques proposed in the specialized literature.

...read moreread less

Abstract: Data Preprocessing for Data Mining addresses one of the most important issues within the well-known Knowledge Discovery from Data process. Data directly taken from the source will likely have inconsistencies, errors or most importantly, it is not ready to be considered for a data mining process. Furthermore, the increasing amount of data in recent science, industry and business applications, calls to the requirement of more complex tools to analyze it. Thanks to data preprocessing, it is possible to convert the impossible into possible, adapting the data to fulfill the input demands of each data mining algorithm. Data preprocessing includes the data reduction techniques, which aim at reducing the complexity of the data, detecting or removing irrelevant and noisy elements from the data. This book is intended to review the tasks that fill the gap between the data acquisition from the source and the data mining process. A comprehensive look from a practical point of view, including basic concepts and surveying the techniques proposed in the specialized literature, is given. Each chapter is a stand-alone guide to a particular data preprocessing topic, from basic concepts and detailed descriptions of classical algorithms, to an incursion of an exhaustive catalog of recent developments. The in-depth technical descriptions make this book suitable for technical professionals, researchers, senior undergraduate and graduate students in data science, computer science and engineering.

...read moreread less

678 citations

Journal Article•DOI•

Data quality for data science, predictive analytics, and big data in supply chain management: An introduction to the problem and suggestions for research and applications

[...]

Benjamin T. Hazen¹, Christopher A. Boone¹, Jeremy D. Ezell², L. Allison Jones-Farmer²•Institutions (2)

College of Business Administration¹, Auburn University²

01 Aug 2014-International Journal of Production Economics

TL;DR: The data quality problem in the context of supply chain management (SCM) is introduced and methods for monitoring and controlling data quality are proposed and highlighted.

...read moreread less

652 citations

Journal Article•DOI•

Data quality management, data usage experience and acquisition intention of big data analytics

[...]

Ohbyung Kwon¹, Nam Yeon Lee¹, Bongsik Shin²•Institutions (2)

Kyung Hee University¹, College of Business Administration²

01 Jun 2014-International Journal of Information Management

TL;DR: A research model is proposed to explain the acquisition intention of big data analytics mainly from the theoretical perspectives of data quality management and data usage experience and empirical investigation reveals that a firm's intention for big data Analytics can be positively affected by its competence in maintaining the quality of corporate data.

...read moreread less

550 citations

Journal Article•DOI•

Information Security in Big Data: Privacy and Data Mining

[...]

Lei Xu¹, Chunxiao Jiang¹, Jian Wang¹, Jian Yuan¹, Yong Ren¹ - Show less +1 more•Institutions (1)

Tsinghua University¹

09 Oct 2014-IEEE Access

TL;DR: This paper identifies four different types of users involved in data mining applications, namely, data provider, data collector, data miner, and decision maker, and examines various approaches that can help to protect sensitive information.

...read moreread less

Abstract: The growing popularity and development of data mining technologies bring serious threat to the security of individual,'s sensitive information. An emerging research topic in data mining, known as privacy-preserving data mining (PPDM), has been extensively studied in recent years. The basic idea of PPDM is to modify the data in such a way so as to perform data mining algorithms effectively without compromising the security of sensitive information contained in the data. Current studies of PPDM mainly focus on how to reduce the privacy risk brought by data mining operations, while in fact, unwanted disclosure of sensitive information may also happen in the process of data collecting, data publishing, and information (i.e., the data mining results) delivering. In this paper, we view the privacy issues related to data mining from a wider perspective and investigate various approaches that can help to protect sensitive information. In particular, we identify four different types of users involved in data mining applications, namely, data provider, data collector, data miner, and decision maker. For each type of user, we discuss his privacy concerns and the methods that can be adopted to protect sensitive information. We briefly introduce the basics of related research topics, review state-of-the-art approaches, and present some preliminary thoughts on future research directions. Besides exploring the privacy-preserving approaches for each type of user, we also review the game theoretical approaches, which are proposed for analyzing the interactions among different users in a data mining scenario, each of whom has his own valuation on the sensitive information. By differentiating the responsibilities of different users with respect to security of sensitive information, we would like to provide some useful insights into the study of PPDM.

...read moreread less

528 citations

Journal Article•DOI•

Review: Spatial interpolation methods applied in the environmental sciences: A review

[...]

Jin Li¹, Andrew D. Heap¹•Institutions (1)

Geoscience Australia¹

01 Mar 2014-Environmental Modelling and Software

TL;DR: In this paper, the authors provide guidelines and suggestions regarding application of spatial interpolation methods to environmental data by comparing the features of the commonly applied methods which fall into three categories, namely: non-geostatistical interpolation, geostatistic interpolation and combined methods.

...read moreread less

Abstract: Spatially continuous data of environmental variables are often required for environmental sciences and management. However, information for environmental variables is usually collected by point sampling, particularly for the mountainous region and deep ocean area. Thus, methods generating such spatially continuous data by using point samples become essential tools. Spatial interpolation methods (SIMs) are, however, often data-specific or even variable-specific. Many factors affect the predictive performance of the methods and previous studies have shown that their effects are not consistent. Hence it is difficult to select an appropriate method for a given dataset. This review aims to provide guidelines and suggestions regarding application of SIMs to environmental data by comparing the features of the commonly applied methods which fall into three categories, namely: non-geostatistical interpolation methods, geostatistical interpolation methods and combined methods. Factors affecting the performance, including sampling design, sample spatial distribution, data quality, correlation between primary and secondary variables, and interaction among factors, are discussed. A total of 25 commonly applied methods are then classified based on their features to provide an overview of the relationships among them. These features are quantified and then clustered to show similarities among these 25 methods. An easy to use decision tree for selecting an appropriate method from these 25 methods is developed based on data availability, data nature, expected estimation, and features of the method. Finally, a list of software packages for spatial interpolation is provided. Display Omitted Comparison of commonly used spatial interpolation methods in environmental science.Analysis of factors affecting the performance of spatial interpolation methods.Classification of 25 methods to illustrate their relationship.Guidelines for selecting an appropriate method for a given dataset.A list of software packages for commonly used spatial interpolation methods.

...read moreread less

466 citations

Journal Article•DOI•

SNPhylo: a pipeline to construct a phylogenetic tree from huge SNP data

[...]

Tae-Ho Lee¹, Hui Guo¹, Xiyin Wang¹, Changsoo Kim¹, Andrew H. Paterson - Show less +1 more•Institutions (1)

Plant Genome Mapping Laboratory¹

26 Feb 2014-BMC Genomics

TL;DR: A new pipeline, SNPhylo, to construct phylogenetic trees based on large SNP datasets, which can help a researcher focus more on interpretation of the results of analysis of voluminous data sets, rather than manipulations necessary to accomplish the analysis.

...read moreread less

Abstract: Phylogenetic trees are widely used for genetic and evolutionary studies in various organisms. Advanced sequencing technology has dramatically enriched data available for constructing phylogenetic trees based on single nucleotide polymorphisms (SNPs). However, massive SNP data makes it difficult to perform reliable analysis, and there has been no ready-to-use pipeline to generate phylogenetic trees from these data. We developed a new pipeline, SNPhylo, to construct phylogenetic trees based on large SNP datasets. The pipeline may enable users to construct a phylogenetic tree from three representative SNP data file formats. In addition, in order to increase reliability of a tree, the pipeline has steps such as removing low quality data and considering linkage disequilibrium. A maximum likelihood method for the inference of phylogeny is also adopted in generation of a tree in our pipeline. Using SNPhylo, users can easily produce a reliable phylogenetic tree from a large SNP data file. Thus, this pipeline can help a researcher focus more on interpretation of the results of analysis of voluminous data sets, rather than manipulations necessary to accomplish the analysis.

...read moreread less

393 citations

Journal Article•DOI•

A Comprehensive Framework for Intrinsic OpenStreetMap Quality Analysis

[...]

Christopher Barron¹, Pascal Neis¹, Alexander Zipf¹•Institutions (1)

Heidelberg University¹

01 Dec 2014-Transactions in Gis

TL;DR: A framework containing more than 25 methods and indicators is presented, allowing arbitrarily repeatable intrinsic OSM quality analyses for any part of the world, based solely on the data's history.

...read moreread less

Abstract: OpenStreetMap (OSM) is one of the most popular examples of a Volunteered Geographic Information (VGI) project. In the past years it has become a serious alternative source for geodata. Since the quality of OSM data can vary strongly, different aspects have been investigated in several scientific studies. In most cases the data is compared with commercial or administrative datasets which, however, are not always accessible due to the lack of availability, contradictory licensing restrictions or high procurement costs. In this investigation a framework containing more than 25 methods and indicators is presented, allowing OSM quality assessments based solely on the data's history. Without the usage of a reference data set, approximate statements on OSM data quality are possible. For this purpose existing methods are taken up, developed further, and integrated into an extensible open source framework. This enables arbitrarily repeatable intrinsic OSM quality analyses for any part of the world.

...read moreread less

327 citations

Journal Article•DOI•

When the entire population is the sample: strengths and limitations in register-based epidemiology.

[...]

Lau Caspar Thygesen¹, Annette Kjær Ersbøll¹•Institutions (1)

University of Southern Denmark¹

10 Jan 2014-European Journal of Epidemiology

TL;DR: It is concluded that epidemiological studies with inclusion of all persons in a population followed for decades available relatively fast are important data sources for modern epidemiology, but it is important to acknowledge the data limitations.

...read moreread less

Abstract: Studies based on databases, medical records and registers are used extensively today in epidemiological research. Despite the increasing use, no developed methodological literature on use and evaluation of population-based registers is available, even though data collection in register-based studies differs from researcher-collected data, all persons in a population are available and traditional statistical analyses focusing on sampling error as the main source of uncertainty may not be relevant. We present the main strengths and limitations of register-based studies, biases especially important in register-based studies and methods for evaluating completeness and validity of registers. The main strengths are that data already exist and valuable time has passed, complete study populations minimizing selection bias and independently collected data. Main limitations are that necessary information may be unavailable, data collection is not done by the researcher, confounder information is lacking, missing information on data quality, truncation at start of follow-up making it difficult to differentiate between prevalent and incident cases and the risk of data dredging. We conclude that epidemiological studies with inclusion of all persons in a population followed for decades available relatively fast are important data sources for modern epidemiology, but it is important to acknowledge the data limitations.

...read moreread less

326 citations

Proceedings Article•DOI•

Test-driven evaluation of linked data quality

[...]

Dimitris Kontokostas¹, Patrick Westphal¹, Sören Auer², Sebastian Hellmann¹, Jens Lehmann¹, Roland Cornelissen, Amrapali Zaveri¹ - Show less +3 more•Institutions (2)

Leipzig University¹, University of Bonn²

07 Apr 2014

TL;DR: This work presents a methodology for test-driven quality assessment of Linked Data, which is inspired by test- driven software development, and argues that vocabularies, ontologies and knowledge bases should be accompanied by a number of test cases, which help to ensure a basic level of quality.

...read moreread less

Abstract: Linked Open Data (LOD) comprises an unprecedented volume of structured data on the Web. However, these datasets are of varying quality ranging from extensively curated datasets to crowdsourced or extracted data of often relatively low quality. We present a methodology for test-driven quality assessment of Linked Data, which is inspired by test-driven software development. We argue that vocabularies, ontologies and knowledge bases should be accompanied by a number of test cases, which help to ensure a basic level of quality. We present a methodology for assessing the quality of linked data resources, based on a formalization of bad smells and data quality problems. Our formalization employs SPARQL query templates, which are instantiated into concrete quality test case queries. Based on an extensive survey, we compile a comprehensive library of data quality test case patterns. We perform automatic test case instantiation based on schema constraints or semi-automatically enriched schemata and allow the user to generate specific test case instantiations that are applicable to a schema or dataset. We provide an extensive evaluation of five LOD datasets, manual test case instantiation for five schemas and automatic test case instantiations for all available schemata registered with Linked Open Vocabularies (LOV). One of the main advantages of our approach is that domain specific semantics can be encoded in the data quality test cases, thus being able to discover data quality problems beyond conventional quality heuristics.

...read moreread less

Journal Article•DOI•

A BIM-based construction quality management model and its applications

[...]

LiJuan Chen¹, Hanbin Luo¹•Institutions (1)

Huazhong University of Science and Technology¹

01 Oct 2014-Automation in Construction

TL;DR: This paper explores and discusses the advantages of 4D BIM for a quality application based on construction codes, by constructing the model in a product, organization and process (POP) data definition structure.

...read moreread less

Journal Article•DOI•

Guidelines for reporting zircon Hf isotopic data by LA-MC-ICPMS and potential pitfalls in the interpretation of these data

[...]

Christopher M. Fisher¹, Jeffery D. Vervoort¹, John M. Hanchar²•Institutions (2)

Washington State University¹, Memorial University of Newfoundland²

10 Jan 2014-Chemical Geology

TL;DR: In this article, the authors present a set of guidelines that can be adopted to ensure that reliable Hf isotopic data are obtained by this technique and discuss a number of potential pitfalls vis-a-vis the assignment of the incorrect age to the measured isotope composition.

...read moreread less

Book•DOI•

Geographical Information Systems and Computer Cartography

[...]

Christopher B. Jones

01 May 2014

TL;DR: This chapter discusses GIS Functionality: An Overview, which focuses on the acquisition of Geo-referenced Data, and Graphics, Images and Visualisation, as well as computer Graphics Technology for Display and Interaction.

...read moreread less

Abstract: Part 1: Introduction. 1. Origins and Applications. 2. Geographical Information Concepts and Spatial Models. 3. GIS Functionality: An Overview. Part 2: Acquisition of Geo-referenced Data. 4. Coordinate Systems, Transformations and Map Projections. 5. Digitising, Editing and Structuring. 6. Primary Data Acquisition from Ground and Remote Surveys. 7. Data Quality and Data Standards. Part 3: Data Storage and Retrieval. 8. Computer Data Storage. 9. Database Management Systems. 10. Spatial Data Access Methods for Points, Lines and Polygons. Part 4: Spatial Data Modelling and Analysis. 11. Surface Modelling and Spatial Interpolation. 12. Optimal Solutions and Spatial Search. 13. Knowledge-Based Systems and Automated Reasoning. Part 5: Graphics, Images and Visualisation. 14. Computer Graphics Technology for Display and Interaction. 15. Three Dimensional Visualisation. 16. Raster and Vector Interconversions. 17. Map Generalisation. 18. Automated Design of Annotated Maps.

...read moreread less

Journal Article•DOI•

Experience with using the Parallel Workloads Archive

[...]

Dror G. Feitelson¹, Dan Tsafrir², David Krakov¹•Institutions (2)

Hebrew University of Jerusalem¹, Technion – Israel Institute of Technology²

01 Oct 2014-Journal of Parallel and Distributed Computing

TL;DR: This work considers issues like missing data, inconsistent data, erroneous data, system configuration changes during the logging period, and unrepresentative user behavior in the Parallel Workloads Archive, a repository of job-level usage data from large-scale parallel supercomputing systems.

...read moreread less

Proceedings Article•DOI•

Data quality: The other face of Big Data

[...]

Barna Saha¹, Divesh Srivastava¹•Institutions (1)

AT&T Labs¹

19 May 2014

TL;DR: This tutorial presents recent results that are relevant to big data quality management, focusing on the two major dimensions of discovering quality issues from the data itself, and (ii) trading-off accuracy vs efficiency, and identifies a range of open problems for the community.

...read moreread less

Abstract: In our Big Data era, data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society. Recent studies have shown that poor quality data is prevalent in large databases and on the Web. Since poor quality data can have serious consequences on the results of data analyses, the importance of veracity, the fourth ‘V’ of big data is increasingly being recognized. In this tutorial, we highlight the substantial challenges that the first three ‘V’s, volume, velocity and variety, bring to dealing with veracity in big data. Due to the sheer volume and velocity of data, one needs to understand and (possibly) repair erroneous data in a scalable and timely manner. With the variety of data, often from a diversity of sources, data quality rules cannot be specified a priori; one needs to let the “data to speak for itself” in order to discover the semantics of the data. This tutorial presents recent results that are relevant to big data quality management, focusing on the two major dimensions of (i) discovering quality issues from the data itself, and (ii) trading-off accuracy vs efficiency, and identifies a range of open problems for the community.

...read moreread less

Journal Article•DOI•

Patterns and causes of uncertainty in the American Community Survey.

[...]

Seth E. Spielman¹, David C. Folch¹, Nicholas N. Nagle²•Institutions (2)

University of Colorado Boulder¹, University of Tennessee²

01 Jan 2014-Applied Geography

TL;DR: Why the ACS tract and block group estimates have large margins of error is explained and a number of geographic strategies for improving the usability and quality ACS are suggested.

...read moreread less

Journal Article•DOI•

A Review of Data Quality Assessment Methods for Public Health Information Systems

[...]

Hong Chen¹, David Hailey¹, Ning Wang², Ping Yu¹•Institutions (2)

University UCINF¹, Chinese Center for Disease Control and Prevention²

14 May 2014-International Journal of Environmental Research and Public Health

TL;DR: It was found the dimension of data was most frequently assessed andpleteness, accuracy, and timeliness were the three most-used attributes among a total of 49 attributes of data quality.

...read moreread less

Abstract: High quality data and effective data quality assessment are required for accurately evaluating the impact of public health interventions and measuring public health outcomes. Data, data use, and data collection process, as the three dimensions of data quality, all need to be assessed for overall data quality assessment. We reviewed current data quality assessment methods. The relevant study was identified in major databases and well-known institutional websites. We found the dimension of data was most frequently assessed. Completeness, accuracy, and timeliness were the three most-used attributes among a total of 49 attributes of data quality. The major quantitative assessment methods were descriptive surveys and data audits, whereas the common qualitative assessment methods were interview and documentation review. The limitations of the reviewed studies included inattentiveness to data use and data collection process, inconsistency in the definition of attributes of data quality, failure to address data users’ concerns and a lack of systematic procedures in data quality assessment. This review study is limited by the coverage of the databases and the breadth of public health information systems. Further research could develop consistent data quality definitions and attributes. More research efforts should be given to assess the quality of data use and the quality of data collection process.

...read moreread less

Journal Article•DOI•

Reliability Meets Big Data: Opportunities and Challenges

[...]

William Q. Meeker¹, Yili Hong²•Institutions (2)

Iowa State University¹, Virginia Tech²

01 Jan 2014-Quality Engineering

TL;DR: In this article, the authors review some applications where field reliability data are used and explore some of the opportunities to use modern reliability data to provide stronger statistical methods to operate and predict the performance of systems in the field.

...read moreread less

Abstract: This article reviews some applications where field reliability data are used and explores some of the opportunities to use modern reliability data to provide stronger statistical methods to operate and predict the performance of systems in the field.

...read moreread less

Journal Article•DOI•

Rapid evaluation and quality control of next generation sequencing data with FaQCs

[...]

Chien Chi Lo¹, Patrick S. G. Chain¹•Institutions (1)

Los Alamos National Laboratory¹

19 Nov 2014-BMC Bioinformatics

TL;DR: This work presents a novel FastQ Quality Control Software (FaQCs) that can rapidly process large volumes of data, and which improves upon previous solutions to monitor the quality and remove poor quality data from sequencing runs.

...read moreread less

Abstract: Next generation sequencing (NGS) technologies that parallelize the sequencing process and produce thousands to millions, or even hundreds of millions of sequences in a single sequencing run, have revolutionized genomic and genetic research. Because of the vagaries of any platform’s sequencing chemistry, the experimental processing, machine failure, and so on, the quality of sequencing reads is never perfect, and often declines as the read is extended. These errors invariably affect downstream analysis/application and should therefore be identified early on to mitigate any unforeseen effects. Here we present a novel FastQ Quality Control Software (FaQCs) that can rapidly process large volumes of data, and which improves upon previous solutions to monitor the quality and remove poor quality data from sequencing runs. Both the speed of processing and the memory footprint of storing all required information have been optimized via algorithmic and parallel processing solutions. The trimmed output compared side-by-side with the original data is part of the automated PDF output. We show how this tool can help data analysis by providing a few examples, including an increased percentage of reads recruited to references, improved single nucleotide polymorphism identification as well as de novo sequence assembly metrics. FaQCs combines several features of currently available applications into a single, user-friendly process, and includes additional unique capabilities such as filtering the PhiX control sequences, conversion of FASTQ formats, and multi-threading. The original data and trimmed summaries are reported within a variety of graphics and reports, providing a simple way to do data quality control and assurance.

...read moreread less

Journal Article•DOI•

Improving the Quality of Linked Data Using Statistical Distributions

[...]

Heiko Paulheim¹, Christian Bizer¹•Institutions (1)

University of Mannheim¹

01 Apr 2014-International Journal on Semantic Web and Information Systems

TL;DR: Two algorithms that exploit statistical distributions of properties and types for enhancing the quality of incomplete and noisy Linked Data sets are presented: SDType adds missing type statements, and SDValidate identifies faulty statements.

...read moreread less

Abstract: Linked Data on the Web is either created from structured data sources (such as relational databases), from semi-structured sources (such as Wikipedia), or from unstructured sources (such as text). In the latter two cases, the generated Linked Data will likely be noisy and incomplete. In this paper, we present two algorithms that exploit statistical distributions of properties and types for enhancing the quality of incomplete and noisy Linked Data sets: SDType adds missing type statements, and SDValidate identifies faulty statements. Neither of the algorithms uses external knowledge, i.e., they operate only on the data itself. We evaluate the algorithms on the DBpedia and NELL knowledge bases, showing that they are both accurate as well as scalable. Both algorithms have been used for building the DBpedia 3.9 release: With SDType, 3.4 million missing type statements have been added, while using SDValidate, 13,000 erroneous RDF statements have been removed from the knowledge base.

...read moreread less

Journal Article•DOI•

Land cover classification using Landsat 8 Operational Land Imager data in Beijing, China

[...]

Kun Jia¹, Xiangqin Wei², Xingfa Gu², Yunjun Yao¹, Xianhong Xie¹, Bin Li² - Show less +2 more•Institutions (2)

Beijing Normal University¹, Chinese Academy of Sciences²

20 Feb 2014-Geocarto International

TL;DR: The successful launch of Landsat 8 provides a new data source for monitoring land cover, which has the potential to significantly improve the characterization of the earth's surface as discussed by the authors, and the results indicated that the OLI data quality was slightly better than the ETM + data quality in the visible bands, especially the near-infrared band of OLI the data, which had a clear improvement; clear improvement was not founded in the short-wave-inrared bands.

...read moreread less

Abstract: The successful launch of Landsat 8 provides a new data source for monitoring land cover, which has the potential to significantly improve the characterization of the earth’s surface To assess data performance, Landsat 8 Operational Land Imager (OLI) data were first compared with Landsat 7 ETM + data using texture features as the indicators Furthermore, the OLI data were investigated for land cover classification using the maximum likelihood and support vector machine classifiers in Beijing The results indicated that (1) the OLI data quality was slightly better than the ETM + data quality in the visible bands, especially the near-infrared band of OLI the data, which had a clear improvement; clear improvement was not founded in the shortwave-infrared bands Moreover, (2) OLI data had a satisfactory performance in terms of land cover classification In summary, OLI data were a reliable data source for monitoring land cover and provided the continuity in the Landsat earth observation

...read moreread less

Proceedings Article•DOI•

Aggregation and degradation in JetStream: streaming analytics in the wide area

[...]

Ariel Rabkin¹, Matvey Arye¹, Sercan Sen¹, Vivek S. Pai¹, Michael J. Freedman¹ - Show less +1 more•Institutions (1)

Princeton University¹

02 Apr 2014

TL;DR: JetStream is presented, a system that allows real-time analysis of large, widely-distributed changing data sets, and its adaptive control mechanisms are responsive enough to keep end-to-end latency within a few seconds, even when available bandwidth drops by a factor of two.

...read moreread less

Abstract: We present JetStream, a system that allows real-time analysis of large, widely-distributed changing data sets Traditional approaches to distributed analytics require users to specify in advance which data is to be backhauled to a central location for analysis This is a poor match for domains where available bandwidth is scarce and it is infeasible to collect all potentially useful dataJetStream addresses bandwidth limits in two ways, both of which are explicit in the programming model The system incorporates structured storage in the form of OLAP data cubes, so data can be stored for analysis near where it is generated Using cubes, queries can aggregate data in ways and locations of their choosing The system also includes adaptive filtering and other transformations that adjusts data quality to match available bandwidth Many bandwidth-saving transformations are possible; we discuss which are appropriate for which data and how they can best be combinedWe implemented a range of analytic queries on web request logs and image data Queries could be expressed in a few lines of code Using structured storage on source nodes conserved network bandwidth by allowing data to be collected only when needed to fulfill queries Our adaptive control mechanisms are responsive enough to keep end-to-end latency within a few seconds, even when available bandwidth drops by a factor of two, and are flexible enough to express practical policies

...read moreread less

Journal Article•DOI•

The IAB Establishment Panel—methodological essentials and data quality

[...]

Peter Ellguth¹, Susanne Kohaut¹, Iris Möller¹•Institutions (1)

Institut für Arbeitsmarkt- und Berufsforschung¹

01 Mar 2014-Journal for Labour Market Research

TL;DR: The IAB Establishment Panel was launched in 1993 to obtain information on the demand side of the labor market as mentioned in this paper, which meets two requirements: providing high quality data for the scientific aims and having an information system for policy makers and practitioners.

...read moreread less

Abstract: The IAB Establishment Panel was launched to obtain information on the demand side of the labor market. This data meets two requirements: providing high quality data for the scientific aims and having an information system for policy makers and practitioners. As it started in 1993 a rich data set of 20 years establishment survey is available now. This article provides information about methodological issues of sample design and data sampling and changes that have taken place in recent years. We focus on quality issues, efforts to improve the survey and on some ongoing discussions about methodological adjustments of the survey mode.

...read moreread less

Journal Article•DOI•

The impact of advanced analytics and data accuracy on operational performance: A contingent resource based theory (RBT) perspective

[...]

Bongsug Chae¹, Chen-Lung Yang², David L. Olson³, Chwen Sheu¹•Institutions (3)

Kansas State University¹, Chung Hua University², University of Nebraska–Lincoln³

01 Mar 2014

TL;DR: In order for manufacturers to take advantage of the use of data and analytics for better operational performance, complementary resources such as fact-based SCM initiatives must be combined with BA initiatives focusing on data quality and advanced analytics.

...read moreread less

Abstract: This study is interested in the impact of two specific business analytic (BA) resources-accurate manufacturing data and advanced analytics-on a firms' operational performance. The use of advanced analytics, such as mathematical optimization techniques, and the importance of manufacturing data accuracy have long been recognized as potential organizational resources or assets for improving the quality of manufacturing planning and control and of a firms' overall operational performance. This research adopted a contingent resource based theory (RBT), suggesting the moderating and mediating role of fact-based SCM initiatives as complementary resources. This research proposition was tested using Global Manufacturing Research Group (GMRG) survey data and was analyzed using partial least squares/structured equation modeling. The research findings shed light on the critical role of fact-based SCM initiatives as complementary resources, which moderate the impact of data accuracy on manufacturing planning quality and mediate the impact of advanced analytics on operational performance. The implication is that the impact of business analytics for manufacturing is contingent on contexts, specifically, the use of fact-based SCM initiatives such as TQM, JIT, and statistical process control. Moreover, in order for manufacturers to take advantage of the use of data and analytics for better operational performance, complementary resources such as fact-based SCM initiatives must be combined with BA initiatives focusing on data quality and advanced analytics.

...read moreread less

Journal Article•DOI•

Three-stage quality control strategies for DNA re-sequencing data

[...]

Yan Guo, Fei Ye, Quanghu Sheng, Travis Clark, David C. Samuels - Show less +1 more

01 Nov 2014-Briefings in Bioinformatics

TL;DR: This review discusses the proper quality control procedures and parameters for Illumina technology-based human DNA re-sequencing at three different stages of sequencing: raw data, alignment and variant calling.

...read moreread less

Abstract: Advances in next-generation sequencing (NGS) technologies have greatly improved our ability to detect genomic variants for biomedical research. In particular, NGS technologies have been recently applied with great success to the discovery of mutations associated with the growth of various tumours and in rare Mendelian diseases. The advance in NGS technologies has also created significant challenges in bioinformatics. One of the major challenges is quality control of the sequencing data. In this review, we discuss the proper quality control procedures and parameters for Illumina technology–based human DNA re-sequencing at three different stages of sequencing: raw data, alignment and variant calling. Monitoring quality control metrics at each of the three stages of NGS data provides unique and independent evaluations of data quality from differing perspectives. Properly conducting quality control protocols at all three stages and correctly interpreting the quality control results are crucial to ensure a successful and meaningful study.

...read moreread less

Journal Article•DOI•

The Quality of China's GDP Statistics

[...]

Carsten A. Holz¹•Institutions (1)

Stanford University¹

01 Sep 2014-China Economic Review

TL;DR: Wang et al. as mentioned in this paper evaluated the quality of China's gross domestic product (GDP) statistics and concluded that the supposed evidence for GDP data falsification is not compelling, that the National Bureau of Statistics has much institutional scope for falsifying GDP data, and that certain manipulations of nominal and real data would be virtually undetectable.

...read moreread less

Journal Article•DOI•

Insurance Telematics: Opportunities and Challenges with the Smartphone Solution

[...]

Peter Händel¹, Isaac Skog¹, Johan Wahlstrom¹, Farid Bonawiede¹, Richard Welch, Jens Ohlsson², Martin Ohlsson¹ - Show less +3 more•Institutions (2)

Royal Institute of Technology¹, Stockholm University²

24 Oct 2014-IEEE Intelligent Transportation Systems Magazine

TL;DR: A survey of smartphone-based insurance telematics is presented, including definitions; Figure-of-Merits (FoMs), describing the behavior of the driver and the characteristics of the trip; and risk profiling of theDriver based on different sets of FoMs, characterized in terms of Accuracy, Integrity, Availability, and Continuity of Service.

...read moreread less

Abstract: Smartphone-based insurance telematics or usage based insurance is a disruptive technology which relies on insurance premiums that reflect the risk profile of the driver; measured via smartphones with appropriate installed software. A survey of smartphone-based insurance telematics is presented, including definitions; Figure-of-Merits (FoMs), describing the behavior of the driver and the characteristics of the trip; and risk profiling of the driver based on different sets of FoMs. The data quality provided by the smartphone is characterized in terms of Accuracy, Integrity, Availability, and Continuity of Service. The quality of the smartphone data is further compared with the quality of data from traditional in-car mounted devices for insurance telematics, revealing the obstacles that have to be combated for a successful smartphone-based installation, which are the poor integrity and low availability. Simply speaking, the reliability is lacking considering the smartphone measurements. Integrity enhancement of smartphone data is illustrated by both second-by-second lowlevel signal processing to combat outliers and perform integrity monitoring, and by trip-based map-matching for robustification of the recorded trip data. A plurality of FoMs are described, analyzed and categorized, including events and properties like harsh braking, speeding, and location. The categorization of the FoMs in terms of Observability, Stationarity, Driver influence, and Actuarial relevance are tools for robust risk profiling of the driver and the trip. Proper driver feedback is briefly discussed, and rule-of-thumbs for feedback design are included. The work is supported by experimental validation, statistical analysis, and experiences from a recent insurance telematics pilot run in Sweden.

...read moreread less

Journal Article•DOI•

Missing traffic data: comparison of imputation methods

[...]

Yuebiao Li¹, Zhiheng Li¹, Li Li¹•Institutions (1)

Tsinghua University¹

01 Feb 2014-Iet Intelligent Transport Systems

TL;DR: Among various methods, the probabilistic principal component analysis (PPCA) yields best performance in all aspects and is used to impute data online before making further analysis and is robust to weather changes.

...read moreread less

Abstract: Many traffic management and control applications require highly complete and accurate data of traffic flow. However, because of various reasons such as sensor failure or transmission error, it is common that some traffic flow data are lost. As a result, various methods were proposed by using a wide spectrum of techniques to estimate missing traffic data in the last two decades. Generally, these missing data imputation methods can be categorised into three kinds: prediction methods, interpolation methods and statistical learning methods. To assess their performance, these methods are compared from different aspects in this paper, including reconstruction errors, statistical behaviours and running speeds. Results show that statistical learning methods are more effective than the other two kinds of imputation methods when data of a single detector is utilised. Among various methods, the probabilistic principal component analysis (PPCA) yields best performance in all aspects. Numerical tests demonstrate that PPCA can be used to impute data online before making further analysis (e.g. make traffic prediction) and is robust to weather changes.

...read moreread less

Journal Article•DOI•

Data Mining for Education Decision Support: A Review

[...]

Suhirman Suhirman¹, Tutut Herawan², Haruna Chiroma², Jasni Mohamad Zain¹•Institutions (2)

Universiti Malaysia Pahang¹, University of Malaya²

08 Dec 2014-International Journal of Emerging Technologies in Learning

TL;DR: The details of this paper will review on recent data mining in educational field and outlines future researches in educational data mining.

...read moreread less

Abstract: Management of higher education must continue to evaluate on an ongoing basis in order to improve the quality of institutions. This will be able to do the necessary evaluation of various data, information, and knowledge of both internal and external institutions. They plan to use more efficiently the collected data, develop tools so that to collect and direct management information, in order to support managerial decision making. The collected data could be utilized to evaluate quality, perform analyses and diagnoses, evaluate dependability to the standards and practices of curricula and syllabi, and suggest alternatives in decision processes. Data minings to support decision making are well suited methods to provide decision support in the education environments, by generating and presenting relevant information and knowledge towards quality improvement of education processes. In educational domain, this information is very useful since it can be used as a base for investigating and enhancing the current educational standards and managements. In this paper, a review on data mining for academic decision support in education field is presented. The details of this paper will review on recent data mining in educational field and outlines future researches in educational data mining.

...read moreread less

Collapse