scispace - formally typeset
Search or ask a question

Showing papers on "Data quality published in 2008"


Proceedings ArticleDOI
24 Aug 2008
TL;DR: The results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire; for certain label-quality/cost regimes, the benefit is substantial.
Abstract: This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction. With the outsourcing of small tasks becoming easier, for example via Rent-A-Coder or Amazon's Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. With low-cost labeling, preparing the unlabeled part of the data can become considerably more expensive than labeling. We present repeated-labeling strategies of increasing complexity, and show several main results. (i) Repeated-labeling can improve label quality and model quality, but not always. (ii) When labels are noisy, repeated labeling can be preferable to single labeling even in the traditional setting where labels are not particularly cheap. (iii) As soon as the cost of processing the unlabeled data is not free, even the simple strategy of labeling everything multiple times can give considerable advantage. (iv) Repeatedly labeling a carefully chosen set of points is generally preferable, and we present a robust technique that combines different notions of uncertainty to select data points for which quality should be improved. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire; for certain label-quality/cost regimes, the benefit is substantial.

1,199 citations


Journal ArticleDOI
TL;DR: Evaluated a recently developed hybrid inventory analysis method which aims to improve the limitations of previous methods, it was found that the truncation associated with process analysis can be up to 87%, reflecting the considerable shortcomings in the quantity of process data currently available.

310 citations


Journal ArticleDOI
TL;DR: This paper investigated the differences in data quality between a face-to-face and a web survey and found that web respondents were more likely to satisfice for a multitude of reasons, thereby pro-ducing data of lower quality.
Abstract: The current study experimentally investigates the differ- ences in data quality between a face-to-face and a web survey. Based on satisficing theory, it was hypothesized that web survey respondents would be more likely to satisfice for a multitude of reasons, thereby pro- ducing data of lower quality. The data show support for the hypothesis. Web survey respondents were shown to produce a higher "don't know" response rate, to differentiate less on rating scales, and to produce more item nonresponse than face-to-face survey respondents.

298 citations


Journal ArticleDOI
01 Aug 2008
TL;DR: This work proposes a new data-driven tool that can be used within an organization's data quality management process to suggest possible rules, and to identify conformant and non-conformant records.
Abstract: Dirty data is a serious problem for businesses leading to incorrect decision making, inefficient daily operations, and ultimately wasting both time and money. Dirty data often arises when domain constraints and business rules, meant to preserve data consistency and accuracy, are enforced incompletely or not at all in application code.In this work, we propose a new data-driven tool that can be used within an organization's data quality management process to suggest possible rules, and to identify conformant and non-conformant records. Data quality rules are known to be contextual, so we focus on the discovery of context-dependent rules. Specifically, we search for conditional functional dependencies (CFDs), that is, functional dependencies that hold only over a portion of the data. The output of our tool is a set of functional dependencies together with the context in which they hold (for example, a rule that states for CS graduate courses, the course number and term functionally determines the room and instructor). Since the input to our tool will likely be a dirty database, we also search for CFDs that almost hold. We return these rules together with the non-conformant records (as these are potentially dirty records).We present effective algorithms for discovering CFDs and dirty values in a data instance. Our discovery algorithm searches for minimal CFDs among the data values and prunes redundant candidates. No universal objective measures of data quality or data quality rules are known. Hence, to avoid returning an unnecessarily large number of CFDs and only those that are most interesting, we evaluate a set of interest metrics and present comparative results using real datasets. We also present an experimental study showing the scalability of our techniques.

257 citations


Proceedings ArticleDOI
09 Jun 2008
TL;DR: An overview of recent advances in revising classical dependencies for improving data quality is provided to provide an overview of the increasing demand for data quality technology.
Abstract: Dependency theory is almost as old as relational databases themselves, and has traditionally been used to improve the quality of schema, among other things. Recently there has been renewed interest in dependencies for improving the quality of data. The increasing demand for data quality technology has also motivated revisions of classical dependencies, to capture more inconsistencies in real-life data, and to match, repair and query the inconsistent data. This paper aims to provide an overview of recent advances in revising classical dependencies for improving data quality.

253 citations


01 Jan 2008
TL;DR: A statistical view of data quality is taken, with an emphasis on intuitive outlier detection and exploratory data analysis methods based in robust statistics, and algorithms and implementations that can be easily and efficiently implemented in very large databases, and which are easy to understand and visualize graphically are stressed.
Abstract: Data collection has become a ubiquitous function of large organizations – not only for record keeping, but to support a variety of data analysis tasks that are critical to the organizational mission. Data analysis typically drives decision-making processes and efficiency optimizations, and in an increasing number of settings is the raison d’etre of entire agencies or firms. Despite the importance of data collection and analysis, data quality remains a pervasive and thorny problem in almost every large organization. The presence of incorrect or inconsistent data can significantly distort the results of analyses, often negating the potential benefits of information-driven approaches. As a result, there has been a variety of research over the last decades on various aspects of data cleaning: computational procedures to automatically or semi-automatically identify – and, when possible, correct – errors in large data sets. In this report, we survey data cleaning methods that focus on errors in quantitative attributes of large databases, though we also provide references to data cleaning methods for other types of attributes. The discussion is targeted at computer practitioners who manage large databases of quantitative information, and designers developing data entry and auditing tools for end users. Because of our focus on quantitative data, we take a statistical view of data quality, with an emphasis on intuitive outlier detection and exploratory data analysis methods based in robust statistics [Rousseeuw and Leroy, 1987, Hampel et al., 1986, Huber, 1981]. In addition, we stress algorithms and implementations that can be easily and efficiently implemented in very large databases, and which are easy to understand and visualize graphically. The discussion mixes statistical intuitions and methods, algorithmic building blocks, efficient relational database implementation strategies, and user interface considerations. Throughout the discussion, references are provided for deeper reading on all of these issues.

230 citations


Journal ArticleDOI
TL;DR: In this paper, a site evaluation approach combining Lagrangian Stochastic footprint modeling with a quality assessment approach for eddy-covariance data was applied to 25 forested sites of the CarboEurope-IP network.
Abstract: We applied a site evaluation approach combining Lagrangian Stochastic footprint modeling with a quality assessment approach for eddy-covariance data to 25 forested sites of the CarboEurope-IP network. The analysis addresses the spatial representativeness of the flux measurements, instrumental effects on data quality, spatial patterns in the data quality, and the performance of the coordinate rotation method. Our findings demonstrate that application of a footprint filter could strengthen the CarboEurope-IP flux database, since only one third of the sites is situated in truly homogeneous terrain. Almost half of the sites experience a significant reduction in eddy-covariance data quality under certain conditions, though these effects are mostly constricted to a small portion of the dataset. Reductions in data quality of the sensible heat flux are mostly induced by characteristics of the surrounding terrain, while the latent heat flux is subject to instrumentation-related problems. The Planar-Fit coordinate rotation proved to be a reliable tool for the majority of the sites using only a single set of rotation angles. Overall, we found a high average data quality for the CarboEurope-IP network, with good representativeness of the measurement data for the specified target land cover types.

225 citations


Journal ArticleDOI
TL;DR: The culture of information use essential to an information system having an impact at the local level is weak in these clinics or at the sub-district level, and further training and support is required for the DHIS to function as intended.
Abstract: Background. Since reliable health information is essential for the planning and management of health services, we investigated the functioning of the District Health Information System (DHIS) in 10 rural clinics. Design and subjects. Semi-structured key informant interviews were conducted with clinic managers, supervisors and district information staff. Data collected over a 12-month period for each clinic were assessed for missing data, data out of minimum and maximum ranges, and validation rule violations. Setting. Our investigation was part of a larger study on improving information systems for primary care in rural KwaZulu-Natal. Outcomes. We assessed data quality, the utilisation for facility management, perceptions of work burden, and usefulness of the system to clinic staff. Results. A high perceived work burden associated with data collection and collation was found. Some data collation tools were not used as intended. There was good understanding of the data collection and collation process but little analysis, interpretation or utilisation of data. Feedback to clinics occurred rarely. In the 10 clinics, 2.5% of data values were missing, and 25% of data were outside expected ranges without an explanation provided. Conclusions. The culture of information use essential to an information system having an impact at the local level is weak in these clinics or at the sub-district level. Further training and support is required for the DHIS to function as intended.

224 citations


Journal ArticleDOI
TL;DR: In this article, the authors proposed a Radio Frequency Identification (RFID)-based quality management system, which functions as a platform for gathering, filtering, managing, monitoring and sharing quality data.

224 citations


Journal ArticleDOI
TL;DR: This article reviews some important limitations of population-based data sets and the methods used to analyze them and suggests that an increased awareness among surgical oncologists of these limitations will ensure that population- based studies in the surgical on cancer literature achieve high standards of methodological quality and clinical utility.
Abstract: Studies based on large population-based data sets, such as administrative claims data and tumor registry data, have become increasingly common in surgical oncology research. These data sets can be acquired relatively easily, and they offer larger sample sizes and improved generalizability compared with institutional data. There are, however, significant limitations that must be considered in the analysis and interpretation of such data. Invalid conclusions can result when insufficient attention is paid to issues such as data quality and depth, potential sources of bias, missing data, type I error, and the assessment of statistical significance. This article reviews some important limitations of population-based data sets and the methods used to analyze them. The candid reporting of these issues in the literature and an increased awareness among surgical oncologists of these limitations will ensure that population-based studies in the surgical oncology literature achieve high standards of methodological quality and clinical utility.

219 citations


Journal ArticleDOI
TL;DR: This letter presents the algorithm used within the MODIS production facility to produce temporally smoothed and spatially continuous biophysical data for such modeling applications and shows that the smoothed LAI agrees with high-quality MODIS LAI very well.
Abstract: Ecological and climate models require high-quality consistent biophysical parameters as inputs and validation sources. NASA's moderate resolution imaging spectroradiometer (MODIS) biophysical products provide such data and have been used to improve our understanding of climate and ecosystem changes. However, the MODIS time series contains occasional lower quality data, gaps from persistent clouds, cloud contamination, and other gaps. Many modeling efforts, such as those used in the North American Carbon Program, that use MODIS data as inputs require gap-free data. This letter presents the algorithm used within the MODIS production facility to produce temporally smoothed and spatially continuous biophysical data for such modeling applications. We demonstrate the algorithm with an example from the MODIS-leaf-area-index (LAI) product. Results show that the smoothed LAI agrees with high-quality MODIS LAI very well. Higher R-squares and better linear relationships have been observed when high-quality retrieval in each individual tile reaches 40% or more. These smoothed products show similar data quality to MODIS high-quality data and, therefore, can be substituted for low-quality retrievals or data gaps.

Journal ArticleDOI
01 Dec 2008-Ecology
TL;DR: An analysis quantifying the contribution of uncertainty in each step during the model-building sequence to variation in model validity and climate change projection uncertainty found that model type and data quality dominated this analysis.
Abstract: Sophisticated statistical analyses are common in ecological research, particularly in species distribution modeling. The effects of sometimes arbitrary decisions during the modeling procedure on the final outcome are difficult to assess, and to date are largely unexplored. We conducted an analysis quantifying the contribution of uncertainty in each step during the model-building sequence to variation in model validity and climate change projection uncertainty. Our study system was the distribution of the Great Grey Shrike in the German federal state of Saxony. For each of four steps (data quality, collinearity method, model type, and variable selection), we ran three different options in a factorial experiment, leading to 81 different model approaches. Each was subjected to a fivefold cross-validation, measuring area under curve (AUC) to assess model quality. Next, we used three climate change scenarios times three precipitation realizations to project future distributions from each model, yielding 729 projections. Again, we analyzed which step introduced most variability (the four model-building steps plus the two scenario steps) into predicted species prevalences by the year 2050. Predicted prevalences ranged from a factor of 0.2 to a factor of 10 of present prevalence, with the majority of predictions between 1.1 and 4.2 (inter-quartile range). We found that model type and data quality dominated this analysis. In particular, artificial neural networks yielded low cross-validation robustness and gave very conservative climate change predictions. Generalized linear and additive models were very similar in quality and predictions, and superior to neural networks. Variations in scenarios and realizations had very little effect, due to the small spatial extent of the study region and its relatively small range of climatic conditions. We conclude that, for climate projections, model type and data quality were the most influential factors. Since comparison of model types has received good coverage in the ecological literature, effects of data quality should now come under more scrutiny.

Book
28 Sep 2008
TL;DR: Master Data Management equips you with a deeply practical, business-focused way of thinking about MDMan understanding that will greatly enhance your ability to communicate with stakeholders and win their support.
Abstract: The key to a successful MDM initiative isnt technology or methods, its people: the stakeholders in the organization and their complex ownership of the data that the initiative will affect.Master Data Management equips you with a deeply practical, business-focused way of thinking about MDMan understanding that will greatly enhance your ability to communicate with stakeholders and win their support. Moreover, it will help you deserve their support: youll master all the details involved in planning and executing an MDM project that leads to measurable improvements in business productivity and effectiveness.* Presents a comprehensive roadmap that you can adapt to any MDM project.* Emphasizes the critical goal of maintaining and improving data quality.* Provides guidelines for determining which data to master.* Examines special issues relating to master data metadata.* Considers a range of MDM architectural styles.* Covers the synchronization of master data across the application infrastructure.

Book
09 Jul 2008
TL;DR: This book describes the future of data warehousing that is technologically possible now, at both an architectural level and technology level, and gives the experienced data warehouse professional everything and exactly what is needed in order to implement the new generation DW 2.0.
Abstract: Data Warehousing has been around for 20 years and has become part of the information technology infrastructure. Data warehousing originally grew in response to the corporate need for information--not data--and it supplies integrated, granular, and historical data to the corporation. There are many kinds of data warehouses, in large part due to evolution and different paths of software and hardware vendors. But DW 2.0, defined by this author in many talks, articles, and his b-eye-network newsletter that reaches 65,000 professionals monthly, is the well-identified and defined next generation data warehouse. The book carries that theme and describes the future of data warehousing that is technologically possible now, at both an architectural level and technology level. The perspective of the book is from the top down: looking at the overall architecture and then delving into the issues underlying the components. The benefit of this for people who are building or using a data warehouse can see what lies ahead, and can determine: what new technology to buy, how to plan extensions to the data warehouse, what can be salvaged from the current system, and how to justify the expense--at the most practical level. All of this gives the experienced data warehouse professional everything and exactly what is needed in order to implement the new generation DW 2.0.* First book on the new generation of data warehouse architecture, DW 2.0. * Written by the "father of the data warehouse", Bill Inmon, a columnist and newsletter editor of The Bill Inmon Channel on the Business Intelligence Network. * Long overdue comprehensive coverage of the implementation of technology and tools that enable the new generation of the DW: metadata, temporal data, ETL, unstructured data, and data quality control.

Journal ArticleDOI
TL;DR: A review of existing approaches to quantify fishing effort in small-scale, recreational, industrial, and illegal, unreported and unregulated (IUU) fisheries is presented in this article, where the strengths and limitations of existing methods, identifying the most robust methods and critical knowledge gaps that must be addressed to improve our ability to quantify and map fishing effort.
Abstract: The need to accurately quantify fishing effort has increased in recent years as fisheries have expanded around the world and many fish stocks and non-target species are threatened with collapse. Quantification methods vary greatly among fisheries, and to date there has not been a comprehensive review of these methods. Here we review existing approaches to quantify fishing effort in small-scale, recreational, industrial, and illegal, unreported and unregulated (IUU) fisheries. We present the strengths and limitations of existing methods, identifying the most robust methods and the critical knowledge gaps that must be addressed to improve our ability to quantify and map fishing effort. Although identifying the ‘best’ method ultimately depends on the intended application of the data, in general, quantification methods that are based on information on gear use and spatial distribution offer the best approaches to representing fishing effort on a broad scale. Integrating fisher’s knowledge and involving fishers in data collection and management decisions may be the most effective way to improve data quality and accessibility.

Book ChapterDOI
24 Aug 2008
TL;DR: A data provenance trust model is proposed which takes into account various factors that may affect the trustworthiness and, based on these factors, assigns trust scores to both data and data providers.
Abstract: Today, with the advances of information technology, individual people and organizations can obtain and process data from different sources. It is critical to ensure data integrity so that effective decisions can be made based on these data. An important component of any solution for assessing data integrity is represented by techniques and tools to evaluate the trustworthiness of data provenance. However, few efforts have been devoted to investigate approaches for assessing how trusted the data are, based in turn on an assessment of the data sources and intermediaries. To bridge this gap, we propose a data provenance trust model which takes into account various factors that may affect the trustworthiness and, based on these factors, assigns trust scores to both data and data providers. Such trust scores represent key information based on which data users may decide whether to use the data and for what purposes.

Journal ArticleDOI
TL;DR: The reliability and validity of the Minimum Data Set assessment, which is being used increasingly in Canadian nursing homes and continuing care facilities, is reviewed, including implications for health care managers in how to approach data quality concerns.

Patent
10 Nov 2008
TL;DR: In this paper, a system, device, and method of using a power line communication device that is communicatively connected to a low voltage power line to establish communications with one or more electronic utility meters is provided.
Abstract: A system, device, and method of using a power line communication device that is communicatively connected to a low voltage power line to establish communications with one or more electronic utility meters is provided. In one embodiment the method includes setting an encryption key parameter to a first encryption key used by the one or more electronic utility meters, establishing communications via one or more low voltage power lines with at least some of the utility meters using the first encryption key, assessing the quality of communications with at least some of the utility meters, transmitting communications quality data to a remote computer system; receiving information of one or more assigned meters from the remote computer system; and storing information of the assigned meters in memory.

Journal ArticleDOI
TL;DR: Trial management committees should consider central statistical monitoring a key aspect of such monitoring, and the systematic application of this approach would be likely to lead to tangible benefits, and resources that are currently wasted on inefficient on-site monitoring could be diverted to increasing trial sample sizes or conducting more trials.
Abstract: Errors in the design, the conduct, the data collection process, and the analysis of a randomized trial have the potential to affect not only the safety of the patients in the trial, but also, throu...

Book
01 Sep 2008
TL;DR: This book presents Danette McGilvray's ?
Abstract: Information is currency. Recent studies show that data quality problems are costing businesses billions of dollars each year, with poor data linked to waste and inefficiency, damaged credibility among customers and suppliers, and an organizational inability to make sound decisions. In this important and timely new book, Danette McGilvray presents her ?Ten Steps? approach to information quality, a proven method for both understanding and creating information quality in the enterprise. Her trademarked approach-in which she has trained Fortune 500 clients and hundreds of workshop attendees-applies to all types of data and to all types of organizations. * Includes numerous templates, detailed examples, and practical advice for executing every step of the ?Ten Steps? approach. * Allows for quick reference with an easy-to-use format highlighting key concepts and definitions, important checkpoints, communication activities, and best practices. * A companion Web site includes links to numerous data quality resources, including many of the planning and information-gathering templates featured in the text, quick summaries of key ideas from the Ten Step methodology, and other tools and information available online. Table of Contents Introduction The Reason for This Book Intended Audiences Structure of This Book How to Use This Book Acknowledgements Chapter 1 Overview Impact of Information and Data Quality About the Methodology Approaches to Data Quality in Projects Engaging Management Chapter 2 Key Concepts Introduction Framework for Information Quality (FIQ) Information Life Cycle Data Quality Dimensions Business Impact Techniques Data Categories Data Specifications Data Governance and Stewardship The Information and Data Quality Improvement Cycle The Ten Steps? Process Best Practices and Guidelines Chapter 3 The Ten Steps 1. Define Business Need and Approach 2. Analyze Information Environment 3. Assess Data Quality 4. Assess Business Impact 5. Identify Root Causes 6. Develop Improvement Plans 7. Prevent Future Data Errors 8. Correct Current Data Errors 9. Implement Controls 10. Communicate Actions and Results Chapter 4 Structuring Your Project Projects and The Ten Steps Data Quality Project Roles Project Timing Chapter 5 Other Techniques and Tools Introduction Information Life Cycle Approaches Capture Data Analyze and Document Results Metrics Data Quality Tools The Ten Steps and Six Sigma Chapter 6 A Few Final Words Appendix Quick References Framework for Information Quality POSMAD Interaction Matrix Detail POSMAD Phases and Activities Data Quality Dimensions Business Impact Techniques The Ten Steps? Overview Definitions of Data Categories

Journal ArticleDOI
TL;DR: In this paper, a conceptual framework of the effects of on-line questionnaire design on the quality of collected responses is proposed, and the results of an experiment where different protocols have been tested and compared in a randomised design using the basis of several quality indexes.
Abstract: he first objective of this article is to propose a conceptual framework of the effects of on-line questionnaire design on the quality of collected responses. Secondly, we present the results of an experiment where different protocols have been tested and compared in a randomised design using the basis of several quality indexes. Starting from some previous categorizations, and from the main factors identified in the literature, we first propose an initial global framework of the questionnaire and question characteristics in a web survey, divided into five groups of factors. Our framework was built to follow the response process successive stages of the contact between the respondent and the questionnaire itself. Then, because it has been studied in the survey methodology literature in a very restricted way, the concept of `response quality' is discussed and extended with some more `qualitative' criteria that could be helpful for researchers and practitioners, in order to obtain a deeper assessment of the survey output. As an experiment, on the basis of the factors chosen as major characteristics of the questionnaire design, eight versions of a questionnaire related to young people's consumption patterns were created. The links to these on-line questionnaires were sent in November 2005 to a target of 10,000 young people. The article finally presents the results of our study and discusses the conclusions. Very interesting results come to light; especially regarding the influence of length, interaction and question wording dimensions on response quality. We discuss the effects of Web-questionnaire design characteristics on the quality of data.

Book
22 Sep 2008
TL;DR: This chapter discusses the management system for data and information in organizations, and the costs of poor data quality and the importance of knowing how to assess and improve data quality.
Abstract: Part One Introduction Chapter 1: The Wondrous and Perilous Properties of Data and Information in Organizations Part Two Chapter 2: The (Often Hidden) Costs of Poor Data and Information Chapter 3: Assessing and Improving Data Quality Part Three Chapter 4: Making Better Decisions Chapter 5: Bringing Data and Information to the Marketplace: Content Providers Chapter 6: Bringing Data and Information to the Marketplace: Facilitators Part Four Chapter 7: Social Issues in the Management of Data and Information Chapter 8: Evolving the Management System for Data and Information Chapter 9: The Next One-Hundred Days

Journal ArticleDOI
TL;DR: The International Satellite Cloud Climatology Project (ISCCP) B1 data was recently rescued by NOAA's National Climatic Data Center (NCDC) as mentioned in this paper, where the data have been used for research in studying tropical cyclones and is available for other topics, such as rainfall and cloud cover.
Abstract: The International Satellite Cloud Climatology Project (ISCCP) B1 data was recently rescued by NOAA's National Climatic Data Center (NCDC). ISCCP B1 data are geostationary imagery from satellites worldwide which are subsampled to 10 km and 3 hourly resolution. These data were unusable given the disarray of format documentation and lack of software for reading the data files. After developing access software, assessing data quality, and removing infrared window calibration biases, the data have been used for research in studying tropical cyclones and is available for other topics, such as rainfall and cloud cover. This resulted not only in valuable scientific data for weather and climate research but also in important lessons learned for future archiving of scientific data records. The effort also exemplifies principles of scientific data stewardship.

Book
07 Jan 2008
TL;DR: This book describes how to build a data warehouse completely from scratch and shows practical examples on how to do it, as well as some practical issues he has experienced that developers are likely to encounter in their first data warehousing project, along with solutions and advice.
Abstract: Building a Data Warehouse: With Examples in SQL Server describes how to build a data warehouse completely from scratch and shows practical examples on how to do it. Author Vincent Rainardi also describes some practical issues he has experienced that developers are likely to encounter in their first data warehousing project, along with solutions and advice. The RDBMS used in the examples is SQL Server; the version will not be an issue as long as the user has SQL Server 2005 or later. The book is organized as follows. In the beginning of this book (Chapters 1 through 6), you learn how to build a data warehouse, for example, defining the architecture, understanding the methodology, gathering the requirements, designing the data models, and creating the databases. Then in Chapters 7 through 10, you learn how to populate the data warehouse, for example, extracting from source systems, loading the data stores, maintaining data quality, and utilizing the metadata. After you populate the data warehouse, in Chapters 11 through 15, you explore how to present data to users using reports and multidimensional databases and how to use the data in the data warehouse for business intelligence, customer relationship management, and other purposes. Chapters 16 and 17 wrap up the book: After you have built your data warehouse, before it can be released to production, you need to test it thoroughly. After your application is in production, you need to understand how to administer data warehouse operation. What youll learn A detailed understanding of what it takes to build a data warehouse The implementation code in SQL Server to build the data warehouse Dimensional modeling, data extraction methods, data warehouse loading, populating dimension and fact tables, data quality, data warehouse architecture, and database design Practical data warehousing applications such as business intelligence reports, analytics applications, and customer relationship management Who is this book for? There are three audiences for the book. The first are the people who implement the data warehouse. This could be considered a field guide for them. The second is database users/admins who want to get a good understanding of what it would take to build a data warehouse. Finally, the third audience is managers who must make decisions about aspects of the data warehousing task before them and use the book to learn about these issues. Related Titles Beginning Relational Data Modeling, Second Edition Data Mining and Statistical Analysis Using SQL

Journal ArticleDOI
TL;DR: There is considerable heterogeneity in the quality of cause-of-death statistics across Brazilian regions, especially for criteria such as completeness and ill-defined causes, which must be considered in the interpretation and use of data for secondary descriptive analyses.
Abstract: Background: Mortality statistics systems with reliable cause-of-death data constitute a major resource for effective health planning; however, many developing countries lack such information systems. Brazil has a long history of registering deaths, and a critical assessment of the quality of current cause-of-death statistics in its five different regions is crucial to identify strengths and weaknesses in the data, and present options for improvement. Methods: Quality of cause-of-death data from 2002 to 2004 was evaluated using an assessment framework based on four main attributes: generalizability, reliability, validity and policy relevance. A set of nine criteria: coverage, completeness, consistency of cause patterns with general mortality levels, consistency of cause specific mortality proportions over time, content validity, proportion of ill-defined causes and non-specific codes, incorrect or improbable age or sex patterns, timeliness, and geographical disaggregation were used to assess the four attributes of data quality. Results: Completeness of death registration varies from 72 to 80% in the northeast regions, compared with 85-90% in the Southeast and Centre-West regions, and 94-97% in the wealthier South region. The proportion of ill-defined deaths is an important problem in reported causes of death from almost all regions. Lack of adequate evidence limits the assessment of content validity of registered causes of death. Coverage, consistency of causes with general level of mortality, consistency over time, age and sex patterns, timeliness and usability of statistics for subnational purposes were judged to be reasonable and increase confidence in using the statistics. Conclusions: There is considerable heterogeneity in the quality of cause-of-death statistics across Brazilian regions, especially for criteria such as completeness and ill-defined causes. These factors can influence generalizability and validity of reported causes of death, and must be considered in the interpretation and use of data for secondary descriptive analyses such as burden of disease estimation at regional level, with suitable adjustments to account for bias. The differences identified in this study could be a useful guide for defining measures and investments needed to improve data quality in Brazil.

Journal ArticleDOI
TL;DR: In this paper, the authors discuss the ongoing development of quality management, and whether the concepts being discussed can be agreed, and what influence this might have on the quality movement and quality practice.
Abstract: Purpose – The purpose of this paper is to discuss the ongoing development of quality management, and whether the concepts being discussed can be agreed – and what influence this might have on the quality movement and quality practice.Design/methodology/approach – Literature review and meta‐analysis of current trends has been used to create a conceptual basis for current quality management questions.Findings – A large part of the development of the quality concept and quality management has taken place without much consideration of what quality management really is or should be. Over time their definitions have been widened to incorporate wellbeing of society, the environment and future generations. Whereas top managers need to address all parts of business, there is a need to separate quality issues from other issues. It is believed that there is a need for quality experts and a discipline of quality management. Quality excellence with a strong customer focus should be one prerequisite to attain true busi...

Book ChapterDOI
Danan Gu1
01 Jan 2008
TL;DR: This chapter provides a comprehensive review of data quality of the third wave of the Chinese Longitudinal Healthy Longevity Survey in 2002 in terms of proxy use, nonresponse rate, sample attrition, and reliability and validity of major health measures.
Abstract: This chapter provides a comprehensive review of data quality of the third wave of the Chinese Longitudinal Healthy Longevity Survey (CLHLS) in 2002 in terms of proxy use, nonresponse rate, sample attrition, and reliability and validity of major health measures. The results show that the data quality of the 2002 wave of the CLHLS is generally good. Some recommendations in use of the dataset are provided.

Journal ArticleDOI
TL;DR: A more detailed look at the current efforts to verify data in the congenital databases of The Society of Thoracic Surgeons, The European Association for Cardio-Thoracic Surgery, and The United Kingdom Central Cardiac Audit Database in North America is provided.
Abstract: Accurate, complete data is now the expectation of patients, families, payers, government, and even media. It has become an obligation of those practising congenital cardiac surgery. Appropriately, major professional organizations worldwide are assuming responsibility for the data quality in their respective registry databases. The purpose of this article is to review the current strategies used for verification of the data in the congenital databases of The Society of Thoracic Surgeons, The European Association for Cardio-Thoracic Surgery, and The United Kingdom Central Cardiac Audit Database. Because the results of the initial efforts to verify data in the congenital databases of the United Kingdom and Europe have been previously published, this article provides a more detailed look at the current efforts in North America, which prior to this article have not been published. The discussion and presentation of the strategy for the verification of data in the congenital heart surgery database of The Society of Thoracic Surgeons is then followed by a review of the strategies utilized in the United Kingdom and Europe. The ultimate goal of sharing the information in this article is to provide information to the participants in the databases that track the outcomes of patients with congenitally malformed hearts. This information should help to improve the quality of the data in all of our databases, and therefore increase the utility of these databases to function as a tool to optimise the management strategies provided to our patients. The need for accurate, complete and high quality Congenital Heart Surgery outcome data has never been more pressing. The public interest in medical outcomes is at an all time high and "pay for performance" is looming on the horizon. Information found in administrative databases is not risk or complexity adjusted, notoriously inaccurate, and far too imprecise to evaluate performance adequately in congenital cardiac surgery. The Society of Thoracic Surgeons and European Association for Cardio-Thoracic Surgery databases contain the elements needed for assessment of quality of care provided that a mechanism exists within these organizations to guarantee the completeness and accuracy of the data. The Central Cardiac Audit Database in the United Kingdom has an advantage in this endeavour with the ability to track and verify mortality independently, through their National Health Service. A combination of site visits with "Source Data Verification", in other words, verification of the data at the primary source of the data, and external verification of the data from independent databases or registries, such as governmental death registries, may ultimately be required to allow for optimal verification of data. Further research in the area of verification of data is also necessary. Data must be verified for both completeness and accuracy.

Journal ArticleDOI
TL;DR: A new funding area in the NSF/NIH Collaborative Research in Computational Neuroscience (CRCNS) program has been established to support data sharing, guided in part by a workshop held in 2007.
Abstract: Computational neuroscience is a subfield of neuroscience that develops models to integrate complex experimental data in order to understand brain function. To constrain and test computational models, researchers need access to a wide variety of experimental data. Much of those data are not readily accessible because neuroscientists fall into separate communities that study the brain at different levels and have not been motivated to provide data to researchers outside their community. To foster sharing of neuroscience data, a workshop was held in 2007, bringing together experimental and theoretical neuroscientists, computer scientists, legal experts and governmental observers. Computational neuroscience was recommended as an ideal field for focusing data sharing, and specific methods, strategies and policies were suggested for achieving it. A new funding area in the NSF/NIH Collaborative Research in Computational Neuroscience (CRCNS) program has been established to support data sharing, guided in part by the workshop recommendations. The new funding area is dedicated to the dissemination of high quality data sets with maximum scientific value for computational neuroscience. The first round of the CRCNS data sharing program supports the preparation of data sets which will be publicly available in 2008. These include electrophysiology and behavioral (eye movement) data described towards the end of this article.

Posted Content
TL;DR: In this article, the authors examined the effect of repeated labeling on data quality and showed that repeated labeling can improve label quality and model quality, but not always, when labels are noisy.
Abstract: This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction. With the outsourcing of small tasks becoming easier, for example via Rent-A-Coder or Amazon's Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. With low-cost labeling, preparing the unlabeled part of the data can become considerably more expensive than labeling. We present repeated-labeling strategies of increasing complexity, and show several main results. (i) Repeated-labeling can improve label quality and model quality, but not always. (ii) When labels are noisy, repeated labeling can be preferable to single labeling even in the traditional setting where labels are not particularly cheap. (iii) As soon as the cost of processing the unlabeled data is not free, even the simple strategy of labeling everything multiple times can give considerable advantage. (iv) Repeatedly labeling a carefully chosen set of points is generally preferable, and we present a robust technique that combines different notions of uncertainty to select data points for which quality should be improved. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire; for certain label-quality/cost regimes, the benefit is substantial.