scispace - formally typeset
Search or ask a question

Showing papers on "Data quality published in 2014"


MonographDOI
01 Jan 2014
TL;DR: The case for open data and the economics of open data are discussed in this paper, with a focus on the benefits of data integration and the challenges of building data infrastructures.
Abstract: Chapter 1: Conceptualising Data What are data? Kinds of data Data, information, knowledge, wisdom Framing data Thinking critically about databases and data infrastructures Data assemblages and the data revolution Chapter 2: Small Data, Data Infrastructures and Data Brokers Data holdings, data archives and data infrastructures Rationale for research data infrastructures The challenges of building data infrastructures The challenges of building data infrastructuresData brokers and markets Chapter 3: Open and Linked Data Open data Linked data The case for open data The economics of open data Concerns with respect to opening data Chapter 4: Big Data Volume Exhaustive Resolution and indexicality Relationality Velocity Variety Flexibility Chapter 5: Enablers and Sources of Big Data The enablers of big data Sources of big data Directed Data Automated data Volunteered data Chapter 6: Data Analytics Pre-analytics Machine learning Data mining and pattern recognition Data visualisation and visual analytics Statistical analysis Prediction, simulation and optimization Chapter 7: The Governmental and Business Rationale for Big Data Governing people Managing organisations Leveraging value and producing capital Creating better places Chapter 8: The Reframing of Science, Social Science and Humanities Research The fourth paradigm in science? The re-emergence of empiricism The fallacies of empiricism Data-driven science Computational social sciences and digital humanities Chapter 9: Technical and Organisational Issues Deserts and deluges Access Data quality, veracity and lineage Data integration and interoperability Poor analysis and ecological fallacies Skills and human resourcing Chapter 10: Ethical, Political, Social and Legal Concerns Data shadows and dataveillance Privacy Data security Profiling, social sorting and redlining Secondary uses, control creep and anticipatory governance Modes of governance and technological lock-ins Chapter 11: Making Sense of the Data Revolution Understanding data and the data revolution Researching data assemblages Final thoughts

751 citations



Book
30 Aug 2014
TL;DR: This book is intended to review the tasks that fill the gap between the data acquisition from the source and the data mining process, and contains a comprehensive look from a practical point of view, including basic concepts and surveying the techniques proposed in the specialized literature.
Abstract: Data Preprocessing for Data Mining addresses one of the most important issues within the well-known Knowledge Discovery from Data process. Data directly taken from the source will likely have inconsistencies, errors or most importantly, it is not ready to be considered for a data mining process. Furthermore, the increasing amount of data in recent science, industry and business applications, calls to the requirement of more complex tools to analyze it. Thanks to data preprocessing, it is possible to convert the impossible into possible, adapting the data to fulfill the input demands of each data mining algorithm. Data preprocessing includes the data reduction techniques, which aim at reducing the complexity of the data, detecting or removing irrelevant and noisy elements from the data. This book is intended to review the tasks that fill the gap between the data acquisition from the source and the data mining process. A comprehensive look from a practical point of view, including basic concepts and surveying the techniques proposed in the specialized literature, is given. Each chapter is a stand-alone guide to a particular data preprocessing topic, from basic concepts and detailed descriptions of classical algorithms, to an incursion of an exhaustive catalog of recent developments. The in-depth technical descriptions make this book suitable for technical professionals, researchers, senior undergraduate and graduate students in data science, computer science and engineering.

678 citations


Journal ArticleDOI
TL;DR: The data quality problem in the context of supply chain management (SCM) is introduced and methods for monitoring and controlling data quality are proposed and highlighted.

652 citations


Journal ArticleDOI
TL;DR: A research model is proposed to explain the acquisition intention of big data analytics mainly from the theoretical perspectives of data quality management and data usage experience and empirical investigation reveals that a firm's intention for big data Analytics can be positively affected by its competence in maintaining the quality of corporate data.

550 citations


Journal ArticleDOI
Lei Xu1, Chunxiao Jiang1, Jian Wang1, Jian Yuan1, Yong Ren1 
TL;DR: This paper identifies four different types of users involved in data mining applications, namely, data provider, data collector, data miner, and decision maker, and examines various approaches that can help to protect sensitive information.
Abstract: The growing popularity and development of data mining technologies bring serious threat to the security of individual,'s sensitive information. An emerging research topic in data mining, known as privacy-preserving data mining (PPDM), has been extensively studied in recent years. The basic idea of PPDM is to modify the data in such a way so as to perform data mining algorithms effectively without compromising the security of sensitive information contained in the data. Current studies of PPDM mainly focus on how to reduce the privacy risk brought by data mining operations, while in fact, unwanted disclosure of sensitive information may also happen in the process of data collecting, data publishing, and information (i.e., the data mining results) delivering. In this paper, we view the privacy issues related to data mining from a wider perspective and investigate various approaches that can help to protect sensitive information. In particular, we identify four different types of users involved in data mining applications, namely, data provider, data collector, data miner, and decision maker. For each type of user, we discuss his privacy concerns and the methods that can be adopted to protect sensitive information. We briefly introduce the basics of related research topics, review state-of-the-art approaches, and present some preliminary thoughts on future research directions. Besides exploring the privacy-preserving approaches for each type of user, we also review the game theoretical approaches, which are proposed for analyzing the interactions among different users in a data mining scenario, each of whom has his own valuation on the sensitive information. By differentiating the responsibilities of different users with respect to security of sensitive information, we would like to provide some useful insights into the study of PPDM.

528 citations


Journal ArticleDOI
TL;DR: In this paper, the authors provide guidelines and suggestions regarding application of spatial interpolation methods to environmental data by comparing the features of the commonly applied methods which fall into three categories, namely: non-geostatistical interpolation, geostatistic interpolation and combined methods.
Abstract: Spatially continuous data of environmental variables are often required for environmental sciences and management. However, information for environmental variables is usually collected by point sampling, particularly for the mountainous region and deep ocean area. Thus, methods generating such spatially continuous data by using point samples become essential tools. Spatial interpolation methods (SIMs) are, however, often data-specific or even variable-specific. Many factors affect the predictive performance of the methods and previous studies have shown that their effects are not consistent. Hence it is difficult to select an appropriate method for a given dataset. This review aims to provide guidelines and suggestions regarding application of SIMs to environmental data by comparing the features of the commonly applied methods which fall into three categories, namely: non-geostatistical interpolation methods, geostatistical interpolation methods and combined methods. Factors affecting the performance, including sampling design, sample spatial distribution, data quality, correlation between primary and secondary variables, and interaction among factors, are discussed. A total of 25 commonly applied methods are then classified based on their features to provide an overview of the relationships among them. These features are quantified and then clustered to show similarities among these 25 methods. An easy to use decision tree for selecting an appropriate method from these 25 methods is developed based on data availability, data nature, expected estimation, and features of the method. Finally, a list of software packages for spatial interpolation is provided. Display Omitted Comparison of commonly used spatial interpolation methods in environmental science.Analysis of factors affecting the performance of spatial interpolation methods.Classification of 25 methods to illustrate their relationship.Guidelines for selecting an appropriate method for a given dataset.A list of software packages for commonly used spatial interpolation methods.

466 citations


Journal ArticleDOI
TL;DR: A new pipeline, SNPhylo, to construct phylogenetic trees based on large SNP datasets, which can help a researcher focus more on interpretation of the results of analysis of voluminous data sets, rather than manipulations necessary to accomplish the analysis.
Abstract: Phylogenetic trees are widely used for genetic and evolutionary studies in various organisms. Advanced sequencing technology has dramatically enriched data available for constructing phylogenetic trees based on single nucleotide polymorphisms (SNPs). However, massive SNP data makes it difficult to perform reliable analysis, and there has been no ready-to-use pipeline to generate phylogenetic trees from these data. We developed a new pipeline, SNPhylo, to construct phylogenetic trees based on large SNP datasets. The pipeline may enable users to construct a phylogenetic tree from three representative SNP data file formats. In addition, in order to increase reliability of a tree, the pipeline has steps such as removing low quality data and considering linkage disequilibrium. A maximum likelihood method for the inference of phylogeny is also adopted in generation of a tree in our pipeline. Using SNPhylo, users can easily produce a reliable phylogenetic tree from a large SNP data file. Thus, this pipeline can help a researcher focus more on interpretation of the results of analysis of voluminous data sets, rather than manipulations necessary to accomplish the analysis.

393 citations


Journal ArticleDOI
TL;DR: A framework containing more than 25 methods and indicators is presented, allowing arbitrarily repeatable intrinsic OSM quality analyses for any part of the world, based solely on the data's history.
Abstract: OpenStreetMap (OSM) is one of the most popular examples of a Volunteered Geographic Information (VGI) project. In the past years it has become a serious alternative source for geodata. Since the quality of OSM data can vary strongly, different aspects have been investigated in several scientific studies. In most cases the data is compared with commercial or administrative datasets which, however, are not always accessible due to the lack of availability, contradictory licensing restrictions or high procurement costs. In this investigation a framework containing more than 25 methods and indicators is presented, allowing OSM quality assessments based solely on the data's history. Without the usage of a reference data set, approximate statements on OSM data quality are possible. For this purpose existing methods are taken up, developed further, and integrated into an extensible open source framework. This enables arbitrarily repeatable intrinsic OSM quality analyses for any part of the world.

327 citations


Journal ArticleDOI
TL;DR: It is concluded that epidemiological studies with inclusion of all persons in a population followed for decades available relatively fast are important data sources for modern epidemiology, but it is important to acknowledge the data limitations.
Abstract: Studies based on databases, medical records and registers are used extensively today in epidemiological research. Despite the increasing use, no developed methodological literature on use and evaluation of population-based registers is available, even though data collection in register-based studies differs from researcher-collected data, all persons in a population are available and traditional statistical analyses focusing on sampling error as the main source of uncertainty may not be relevant. We present the main strengths and limitations of register-based studies, biases especially important in register-based studies and methods for evaluating completeness and validity of registers. The main strengths are that data already exist and valuable time has passed, complete study populations minimizing selection bias and independently collected data. Main limitations are that necessary information may be unavailable, data collection is not done by the researcher, confounder information is lacking, missing information on data quality, truncation at start of follow-up making it difficult to differentiate between prevalent and incident cases and the risk of data dredging. We conclude that epidemiological studies with inclusion of all persons in a population followed for decades available relatively fast are important data sources for modern epidemiology, but it is important to acknowledge the data limitations.

326 citations


Proceedings ArticleDOI
07 Apr 2014
TL;DR: This work presents a methodology for test-driven quality assessment of Linked Data, which is inspired by test- driven software development, and argues that vocabularies, ontologies and knowledge bases should be accompanied by a number of test cases, which help to ensure a basic level of quality.
Abstract: Linked Open Data (LOD) comprises an unprecedented volume of structured data on the Web. However, these datasets are of varying quality ranging from extensively curated datasets to crowdsourced or extracted data of often relatively low quality. We present a methodology for test-driven quality assessment of Linked Data, which is inspired by test-driven software development. We argue that vocabularies, ontologies and knowledge bases should be accompanied by a number of test cases, which help to ensure a basic level of quality. We present a methodology for assessing the quality of linked data resources, based on a formalization of bad smells and data quality problems. Our formalization employs SPARQL query templates, which are instantiated into concrete quality test case queries. Based on an extensive survey, we compile a comprehensive library of data quality test case patterns. We perform automatic test case instantiation based on schema constraints or semi-automatically enriched schemata and allow the user to generate specific test case instantiations that are applicable to a schema or dataset. We provide an extensive evaluation of five LOD datasets, manual test case instantiation for five schemas and automatic test case instantiations for all available schemata registered with Linked Open Vocabularies (LOV). One of the main advantages of our approach is that domain specific semantics can be encoded in the data quality test cases, thus being able to discover data quality problems beyond conventional quality heuristics.

Journal ArticleDOI
TL;DR: This paper explores and discusses the advantages of 4D BIM for a quality application based on construction codes, by constructing the model in a product, organization and process (POP) data definition structure.

Journal ArticleDOI
TL;DR: In this article, the authors present a set of guidelines that can be adopted to ensure that reliable Hf isotopic data are obtained by this technique and discuss a number of potential pitfalls vis-a-vis the assignment of the incorrect age to the measured isotope composition.

BookDOI
01 May 2014
TL;DR: This chapter discusses GIS Functionality: An Overview, which focuses on the acquisition of Geo-referenced Data, and Graphics, Images and Visualisation, as well as computer Graphics Technology for Display and Interaction.
Abstract: Part 1: Introduction. 1. Origins and Applications. 2. Geographical Information Concepts and Spatial Models. 3. GIS Functionality: An Overview. Part 2: Acquisition of Geo-referenced Data. 4. Coordinate Systems, Transformations and Map Projections. 5. Digitising, Editing and Structuring. 6. Primary Data Acquisition from Ground and Remote Surveys. 7. Data Quality and Data Standards. Part 3: Data Storage and Retrieval. 8. Computer Data Storage. 9. Database Management Systems. 10. Spatial Data Access Methods for Points, Lines and Polygons. Part 4: Spatial Data Modelling and Analysis. 11. Surface Modelling and Spatial Interpolation. 12. Optimal Solutions and Spatial Search. 13. Knowledge-Based Systems and Automated Reasoning. Part 5: Graphics, Images and Visualisation. 14. Computer Graphics Technology for Display and Interaction. 15. Three Dimensional Visualisation. 16. Raster and Vector Interconversions. 17. Map Generalisation. 18. Automated Design of Annotated Maps.

Journal ArticleDOI
TL;DR: This work considers issues like missing data, inconsistent data, erroneous data, system configuration changes during the logging period, and unrepresentative user behavior in the Parallel Workloads Archive, a repository of job-level usage data from large-scale parallel supercomputing systems.

Proceedings ArticleDOI
Barna Saha1, Divesh Srivastava1
19 May 2014
TL;DR: This tutorial presents recent results that are relevant to big data quality management, focusing on the two major dimensions of discovering quality issues from the data itself, and (ii) trading-off accuracy vs efficiency, and identifies a range of open problems for the community.
Abstract: In our Big Data era, data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society. Recent studies have shown that poor quality data is prevalent in large databases and on the Web. Since poor quality data can have serious consequences on the results of data analyses, the importance of veracity, the fourth ‘V’ of big data is increasingly being recognized. In this tutorial, we highlight the substantial challenges that the first three ‘V’s, volume, velocity and variety, bring to dealing with veracity in big data. Due to the sheer volume and velocity of data, one needs to understand and (possibly) repair erroneous data in a scalable and timely manner. With the variety of data, often from a diversity of sources, data quality rules cannot be specified a priori; one needs to let the “data to speak for itself” in order to discover the semantics of the data. This tutorial presents recent results that are relevant to big data quality management, focusing on the two major dimensions of (i) discovering quality issues from the data itself, and (ii) trading-off accuracy vs efficiency, and identifies a range of open problems for the community.

Journal ArticleDOI
TL;DR: Why the ACS tract and block group estimates have large margins of error is explained and a number of geographic strategies for improving the usability and quality ACS are suggested.

Journal ArticleDOI
TL;DR: It was found the dimension of data was most frequently assessed andpleteness, accuracy, and timeliness were the three most-used attributes among a total of 49 attributes of data quality.
Abstract: High quality data and effective data quality assessment are required for accurately evaluating the impact of public health interventions and measuring public health outcomes. Data, data use, and data collection process, as the three dimensions of data quality, all need to be assessed for overall data quality assessment. We reviewed current data quality assessment methods. The relevant study was identified in major databases and well-known institutional websites. We found the dimension of data was most frequently assessed. Completeness, accuracy, and timeliness were the three most-used attributes among a total of 49 attributes of data quality. The major quantitative assessment methods were descriptive surveys and data audits, whereas the common qualitative assessment methods were interview and documentation review. The limitations of the reviewed studies included inattentiveness to data use and data collection process, inconsistency in the definition of attributes of data quality, failure to address data users’ concerns and a lack of systematic procedures in data quality assessment. This review study is limited by the coverage of the databases and the breadth of public health information systems. Further research could develop consistent data quality definitions and attributes. More research efforts should be given to assess the quality of data use and the quality of data collection process.

Journal ArticleDOI
TL;DR: In this article, the authors review some applications where field reliability data are used and explore some of the opportunities to use modern reliability data to provide stronger statistical methods to operate and predict the performance of systems in the field.
Abstract: This article reviews some applications where field reliability data are used and explores some of the opportunities to use modern reliability data to provide stronger statistical methods to operate and predict the performance of systems in the field.

Journal ArticleDOI
TL;DR: This work presents a novel FastQ Quality Control Software (FaQCs) that can rapidly process large volumes of data, and which improves upon previous solutions to monitor the quality and remove poor quality data from sequencing runs.
Abstract: Next generation sequencing (NGS) technologies that parallelize the sequencing process and produce thousands to millions, or even hundreds of millions of sequences in a single sequencing run, have revolutionized genomic and genetic research. Because of the vagaries of any platform’s sequencing chemistry, the experimental processing, machine failure, and so on, the quality of sequencing reads is never perfect, and often declines as the read is extended. These errors invariably affect downstream analysis/application and should therefore be identified early on to mitigate any unforeseen effects. Here we present a novel FastQ Quality Control Software (FaQCs) that can rapidly process large volumes of data, and which improves upon previous solutions to monitor the quality and remove poor quality data from sequencing runs. Both the speed of processing and the memory footprint of storing all required information have been optimized via algorithmic and parallel processing solutions. The trimmed output compared side-by-side with the original data is part of the automated PDF output. We show how this tool can help data analysis by providing a few examples, including an increased percentage of reads recruited to references, improved single nucleotide polymorphism identification as well as de novo sequence assembly metrics. FaQCs combines several features of currently available applications into a single, user-friendly process, and includes additional unique capabilities such as filtering the PhiX control sequences, conversion of FASTQ formats, and multi-threading. The original data and trimmed summaries are reported within a variety of graphics and reports, providing a simple way to do data quality control and assurance.

Journal ArticleDOI
TL;DR: Two algorithms that exploit statistical distributions of properties and types for enhancing the quality of incomplete and noisy Linked Data sets are presented: SDType adds missing type statements, and SDValidate identifies faulty statements.
Abstract: Linked Data on the Web is either created from structured data sources (such as relational databases), from semi-structured sources (such as Wikipedia), or from unstructured sources (such as text). In the latter two cases, the generated Linked Data will likely be noisy and incomplete. In this paper, we present two algorithms that exploit statistical distributions of properties and types for enhancing the quality of incomplete and noisy Linked Data sets: SDType adds missing type statements, and SDValidate identifies faulty statements. Neither of the algorithms uses external knowledge, i.e., they operate only on the data itself. We evaluate the algorithms on the DBpedia and NELL knowledge bases, showing that they are both accurate as well as scalable. Both algorithms have been used for building the DBpedia 3.9 release: With SDType, 3.4 million missing type statements have been added, while using SDValidate, 13,000 erroneous RDF statements have been removed from the knowledge base.

Journal ArticleDOI
TL;DR: The successful launch of Landsat 8 provides a new data source for monitoring land cover, which has the potential to significantly improve the characterization of the earth's surface as discussed by the authors, and the results indicated that the OLI data quality was slightly better than the ETM + data quality in the visible bands, especially the near-infrared band of OLI the data, which had a clear improvement; clear improvement was not founded in the short-wave-inrared bands.
Abstract: The successful launch of Landsat 8 provides a new data source for monitoring land cover, which has the potential to significantly improve the characterization of the earth’s surface To assess data performance, Landsat 8 Operational Land Imager (OLI) data were first compared with Landsat 7 ETM + data using texture features as the indicators Furthermore, the OLI data were investigated for land cover classification using the maximum likelihood and support vector machine classifiers in Beijing The results indicated that (1) the OLI data quality was slightly better than the ETM + data quality in the visible bands, especially the near-infrared band of OLI the data, which had a clear improvement; clear improvement was not founded in the shortwave-infrared bands Moreover, (2) OLI data had a satisfactory performance in terms of land cover classification In summary, OLI data were a reliable data source for monitoring land cover and provided the continuity in the Landsat earth observation

Proceedings ArticleDOI
02 Apr 2014
TL;DR: JetStream is presented, a system that allows real-time analysis of large, widely-distributed changing data sets, and its adaptive control mechanisms are responsive enough to keep end-to-end latency within a few seconds, even when available bandwidth drops by a factor of two.
Abstract: We present JetStream, a system that allows real-time analysis of large, widely-distributed changing data sets Traditional approaches to distributed analytics require users to specify in advance which data is to be backhauled to a central location for analysis This is a poor match for domains where available bandwidth is scarce and it is infeasible to collect all potentially useful dataJetStream addresses bandwidth limits in two ways, both of which are explicit in the programming model The system incorporates structured storage in the form of OLAP data cubes, so data can be stored for analysis near where it is generated Using cubes, queries can aggregate data in ways and locations of their choosing The system also includes adaptive filtering and other transformations that adjusts data quality to match available bandwidth Many bandwidth-saving transformations are possible; we discuss which are appropriate for which data and how they can best be combinedWe implemented a range of analytic queries on web request logs and image data Queries could be expressed in a few lines of code Using structured storage on source nodes conserved network bandwidth by allowing data to be collected only when needed to fulfill queries Our adaptive control mechanisms are responsive enough to keep end-to-end latency within a few seconds, even when available bandwidth drops by a factor of two, and are flexible enough to express practical policies

Journal ArticleDOI
TL;DR: The IAB Establishment Panel was launched in 1993 to obtain information on the demand side of the labor market as mentioned in this paper, which meets two requirements: providing high quality data for the scientific aims and having an information system for policy makers and practitioners.
Abstract: The IAB Establishment Panel was launched to obtain information on the demand side of the labor market. This data meets two requirements: providing high quality data for the scientific aims and having an information system for policy makers and practitioners. As it started in 1993 a rich data set of 20 years establishment survey is available now. This article provides information about methodological issues of sample design and data sampling and changes that have taken place in recent years. We focus on quality issues, efforts to improve the survey and on some ongoing discussions about methodological adjustments of the survey mode.

Journal ArticleDOI
01 Mar 2014
TL;DR: In order for manufacturers to take advantage of the use of data and analytics for better operational performance, complementary resources such as fact-based SCM initiatives must be combined with BA initiatives focusing on data quality and advanced analytics.
Abstract: This study is interested in the impact of two specific business analytic (BA) resources-accurate manufacturing data and advanced analytics-on a firms' operational performance. The use of advanced analytics, such as mathematical optimization techniques, and the importance of manufacturing data accuracy have long been recognized as potential organizational resources or assets for improving the quality of manufacturing planning and control and of a firms' overall operational performance. This research adopted a contingent resource based theory (RBT), suggesting the moderating and mediating role of fact-based SCM initiatives as complementary resources. This research proposition was tested using Global Manufacturing Research Group (GMRG) survey data and was analyzed using partial least squares/structured equation modeling. The research findings shed light on the critical role of fact-based SCM initiatives as complementary resources, which moderate the impact of data accuracy on manufacturing planning quality and mediate the impact of advanced analytics on operational performance. The implication is that the impact of business analytics for manufacturing is contingent on contexts, specifically, the use of fact-based SCM initiatives such as TQM, JIT, and statistical process control. Moreover, in order for manufacturers to take advantage of the use of data and analytics for better operational performance, complementary resources such as fact-based SCM initiatives must be combined with BA initiatives focusing on data quality and advanced analytics.

Journal ArticleDOI
TL;DR: This review discusses the proper quality control procedures and parameters for Illumina technology-based human DNA re-sequencing at three different stages of sequencing: raw data, alignment and variant calling.
Abstract: Advances in next-generation sequencing (NGS) technologies have greatly improved our ability to detect genomic variants for biomedical research. In particular, NGS technologies have been recently applied with great success to the discovery of mutations associated with the growth of various tumours and in rare Mendelian diseases. The advance in NGS technologies has also created significant challenges in bioinformatics. One of the major challenges is quality control of the sequencing data. In this review, we discuss the proper quality control procedures and parameters for Illumina technology–based human DNA re-sequencing at three different stages of sequencing: raw data, alignment and variant calling. Monitoring quality control metrics at each of the three stages of NGS data provides unique and independent evaluations of data quality from differing perspectives. Properly conducting quality control protocols at all three stages and correctly interpreting the quality control results are crucial to ensure a successful and meaningful study.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper evaluated the quality of China's gross domestic product (GDP) statistics and concluded that the supposed evidence for GDP data falsification is not compelling, that the National Bureau of Statistics has much institutional scope for falsifying GDP data, and that certain manipulations of nominal and real data would be virtually undetectable.

Journal ArticleDOI
TL;DR: A survey of smartphone-based insurance telematics is presented, including definitions; Figure-of-Merits (FoMs), describing the behavior of the driver and the characteristics of the trip; and risk profiling of theDriver based on different sets of FoMs, characterized in terms of Accuracy, Integrity, Availability, and Continuity of Service.
Abstract: Smartphone-based insurance telematics or usage based insurance is a disruptive technology which relies on insurance premiums that reflect the risk profile of the driver; measured via smartphones with appropriate installed software. A survey of smartphone-based insurance telematics is presented, including definitions; Figure-of-Merits (FoMs), describing the behavior of the driver and the characteristics of the trip; and risk profiling of the driver based on different sets of FoMs. The data quality provided by the smartphone is characterized in terms of Accuracy, Integrity, Availability, and Continuity of Service. The quality of the smartphone data is further compared with the quality of data from traditional in-car mounted devices for insurance telematics, revealing the obstacles that have to be combated for a successful smartphone-based installation, which are the poor integrity and low availability. Simply speaking, the reliability is lacking considering the smartphone measurements. Integrity enhancement of smartphone data is illustrated by both second-by-second lowlevel signal processing to combat outliers and perform integrity monitoring, and by trip-based map-matching for robustification of the recorded trip data. A plurality of FoMs are described, analyzed and categorized, including events and properties like harsh braking, speeding, and location. The categorization of the FoMs in terms of Observability, Stationarity, Driver influence, and Actuarial relevance are tools for robust risk profiling of the driver and the trip. Proper driver feedback is briefly discussed, and rule-of-thumbs for feedback design are included. The work is supported by experimental validation, statistical analysis, and experiences from a recent insurance telematics pilot run in Sweden.

Journal ArticleDOI
TL;DR: Among various methods, the probabilistic principal component analysis (PPCA) yields best performance in all aspects and is used to impute data online before making further analysis and is robust to weather changes.
Abstract: Many traffic management and control applications require highly complete and accurate data of traffic flow. However, because of various reasons such as sensor failure or transmission error, it is common that some traffic flow data are lost. As a result, various methods were proposed by using a wide spectrum of techniques to estimate missing traffic data in the last two decades. Generally, these missing data imputation methods can be categorised into three kinds: prediction methods, interpolation methods and statistical learning methods. To assess their performance, these methods are compared from different aspects in this paper, including reconstruction errors, statistical behaviours and running speeds. Results show that statistical learning methods are more effective than the other two kinds of imputation methods when data of a single detector is utilised. Among various methods, the probabilistic principal component analysis (PPCA) yields best performance in all aspects. Numerical tests demonstrate that PPCA can be used to impute data online before making further analysis (e.g. make traffic prediction) and is robust to weather changes.

Journal ArticleDOI
TL;DR: The details of this paper will review on recent data mining in educational field and outlines future researches in educational data mining.
Abstract: Management of higher education must continue to evaluate on an ongoing basis in order to improve the quality of institutions. This will be able to do the necessary evaluation of various data, information, and knowledge of both internal and external institutions. They plan to use more efficiently the collected data, develop tools so that to collect and direct management information, in order to support managerial decision making. The collected data could be utilized to evaluate quality, perform analyses and diagnoses, evaluate dependability to the standards and practices of curricula and syllabi, and suggest alternatives in decision processes. Data minings to support decision making are well suited methods to provide decision support in the education environments, by generating and presenting relevant information and knowledge towards quality improvement of education processes. In educational domain, this information is very useful since it can be used as a base for investigating and enhancing the current educational standards and managements. In this paper, a review on data mining for academic decision support in education field is presented. The details of this paper will review on recent data mining in educational field and outlines future researches in educational data mining.