scispace - formally typeset
Search or ask a question

Showing papers on "Data quality published in 2001"


Journal ArticleDOI
TL;DR: It was found that management support and resources help to address organizational issues that arise during warehouse implementations; resources, user participation, and highly-skilled project team members increase the likelihood that warehousing projects will finish on-time, on-budget, with the right functionality; and diverse, unstandardized source systems and poor development technology will increase the technical issues that project teams must overcome.
Abstract: The IT implementation literature suggests that various implementation factors play critical roles in the success of an information system; however, there is little empirical research about the implementation of data warehousing projects. Data warehousing has unique characteristics that may impact the importance of factors that apply to it. In this study, a cross-sectional survey investigated a model of data warehousing success. Data warehousing managers and data suppliers from 111 organizations completed paired mail questionnaires on implementation factors and the success of the warehouse. The results from a Partial Least Squares analysis of the data identified significant relationships between the system quality and data quality factors and perceived net benefits. It was found that management support and resources help to address organizational issues that arise during warehouse implementations; resources, user participation, and highly-skilled project team members increase the likelihood that warehousing projects will finish on-time, on-budget, with the right functionality; and diverse, unstandardized source systems and poor development technology will increase the technical issues that project teams must overcome. The implementation's success with organizational and project issues, in turn, influence the system quality of the data warehouse; however, data quality is best explained by factors not included in the research model.

1,579 citations


01 Jan 2001
TL;DR: The DIVA-GIS software allows analysis of genebank and herbarium databases to elucidate genetic, ecological and geographic patterns in the distribution of crops and wild species to improve data quality.
Abstract: Computer tools for spatial analysis of plant genetic resources data: 1. DIVA-GIS The DIVA-GIS software allows analysis of genebank and herbarium databases to elucidate genetic, ecological and geographic patterns in the distribution of crops and wild species. It useful for scientists who cannot afford generic commercial GIS software, or do not have the time to learn how to use it, and for others who require a GIS that is specifically designed for genetic resources work. Coordinate data are often absent from genebank databases, or if present are sometimes inaccurate. DIVA-GIS helps improve data quality by assigning coordinates, using a large digital gazetteer. DIVA-GIS can also be used to check existing coordinates using overlays of the collection-site and administrative boundary databases. Maps can then be made of the collection sites. Analytical functions implemented in DIVA include mapping of richness and diversity, distribution of useful traits and location of areas with complementary diversity. DIVA can also extract climate data for all terrestrial locations, which can be used to describe the environment of collection sites.

571 citations


Journal ArticleDOI
TL;DR: In this article, the authors used data from nine experiments carried out in three household surveys to investigate the effect of no-opinion options on attitude measures and found that the quality of attitude reports was not compromised by the omission of no opinion options.
Abstract: According to many seasoned survey researchers, offering a no-opinion option should reduce the pressure to give substantive re- sponses felt by respondents who have no true opinions. By contrast, the survey satisficing perspective suggests that no-opinion options may dis- courage some respondents from doing the cognitive work necessary to report the true opinions they do have. We address these arguments using data from nine experiments carried out in three household surveys. Attraction to no-opinion options was found to be greatest among re- spondents lowest in cognitive skills (as measured by educational at- tainment), among respondents answering secretly instead of orally, for questions asked later in a survey, and among respondents who devoted little effort to the reporting process. The quality of attitude reports ob- tained (as measured by over-time consistency and responsiveness to a question manipulation) was not compromised by the omission of no- opinion options. These results suggest that inclusion of no-opinion op- tions in attitude measures may not enhance data quality and instead may preclude measurement of some meaningful opinions.

484 citations


Proceedings Article
11 Sep 2001
TL;DR: This paper presents a language, an execution model and algorithms that enable users to express data cleaning specifications declaratively and perform the cleaning efficiently and experimental results report on the assessement of the proposed framework for data cleaning.
Abstract: The problem of data cleaning, which consists of emoving inconsistencies and errors from original data sets, is well known in the area of decision support systems and data warehouses. However, for non-conventional applications, such as the migration of largely unstructured data into structured one, or the integration of heterogeneous scientific data sets in inter-discipl- inary fields (e.g., in environmental science), existing ETL (Extraction Transformation Loading) and data cleaning tools for writing data cleaning programs are insufficient. The main challenge with them is the design of a data flow graph that effectively generates clean data, and can perform efficiently on large sets of input data. The difficulty with them comes from (i) a lack of clear separation between the logical specification of data transformations and their physical implementation and (ii) the lack of explanation of cleaning results and user interaction facilities to tune a data cleaning program. This paper addresses these two problems and presents a language, an execution model and algorithms that enable users to express data cleaning specifications declaratively and perform the cleaning efficiently. We use as an example a set of bibliographic references used to construct the Citeseer Web site. The underlying data integration problem is to derive structured and clean textual records so that meaningful queries can be performed. Experimental results report on the assessement of the proposed framework for data cleaning.

380 citations


BookDOI
01 Jan 2001
TL;DR: Executive Summary Introduction Origins of the National Health Care Quality Report The IOM Committee on the National Quality Report on HealthCare Delivery Defining Health care Quality Recent Initiatives on Health Care quality and Quality Reporting Other National and International InitiativesOn Health Carequality Measurement National Initiatives State Initiatives International Initiations Quality Measurement and Reporting in Other Sectors Objectives of theNational Health CareQuality Report Organization of the IOM Report
Abstract: Executive Summary Introduction Origins of the National Health Care Quality Report The IOM Committee on the National Quality Report on HealthCare Delivery Defining Health Care Quality Recent Initiatives on Health Care Quality and Quality Reporting Other National and International Initiatives on Health Care Quality Measurement National Initiatives State Initiatives International Initiatives Quality Measurement and Reporting in Other Sectors Objectives of the National Health Care Quality Report Organization of the IOM Report Defining the Contents of the Data Set: The National Health Care Quality Framework Recommendation Importance of the Framework National Health Care Quality Framework Overview Components of Health Care Quality Consumer Perspectives on Health Care Needs Consumer Perspectives on Health Care Needs as Reflected in Care for Specific Health Conditions Using a Matrix to Portray the Framework Equity in Quality of Care as a Cross-Cutting Issue in the Framework what About Efficiency? Summary Selecting Measures for the National Health Care Quality Data Set Recommendations Examining Potential Measure Selection Criteria Criteria for Selecting Individual Measures for the National Health Care Quality Data Set Major Aspects to Consider Specific Aspects to Consider When Selecting Measures Evaluating Individual Measures According to the Criteria Evaluation Criteria for the National Health Care Quality Measure Set Balance Comprehensiveness Robustness Measure Selection Process Steps in the Process of Measure Selectior Role of an Advisory Body Reviewing and Updating the Measure Set Measuring Health Care Quality Comprehensively Types of Measures Role of Summary Measures Measures of the Structure, Processes, and Outcomes of Health Care Summary Data Sources for the National Health Care Quality Report Recommendations Data Source Selection Criteria Credibility and Validity of the Data National Scope and Potential to Provide State-Level Detail Availability and Consistency of the Data Over Time and Across Sources Timeliness of the Data Ability to Support Subgroup- and Condition-Specific Analyses Public Accessibility of the Data Potential Data Sources Public Data Sources Private Data Sources Evaluating Data Sources for the National Health Care Quality Data Set in the Short Term Coverage of Health Care Quality Components Data Sources for the National Health Care Quality Report Data Sources in the Short Term Encouraging the Long-Term Development of Electronic Clinical Data Systems Increasing the Access to the National Health Care Quality Data Set Summary Designing the National Health Care Quality Report Recommendation Audiences for the National Health Care Quality Report Report Guidelines Defining the Content of the Quality Report Presenting Information in the Quality Report Audience Testing the National Health Care Quality Report Audience Testing Before Report Releases Evaluative Testing of Report Releases Promoting the Quality Report Communication Channels Partnerships Evaluating the Promotion Plan Summary Appendices A Workshop: Envisioning a National Quality Report on Health Care B Designing a Comprehensive National Report on Effectiveness of Care: Measurement, Data Collection, and Reporting Strategies C Submissions in Response to the Committee's Call for Measures from the Private Sector D Selected Approaches to Thinking About the National Health Care Quality Report E Quality Measure Selection Criteria Glossary Acronyms and Abbreviations Biographical Sketches of Committee Members

358 citations


Book ChapterDOI
TL;DR: How difficulties resulting from missing data, publication bias, data quality and data exclusion, non-independence among observations, and the combination of dissimilar data sets may affect the perceived utility of meta-analysis in these fields and the soundness of conclusions drawn from its application is examined.
Abstract: Meta-analysis is the statistical synthesis of the results of separate studies. It was adapted from other disciplines for use in ecology and evolutionary biology beginning in the early 1990s, and, at the turn of the century, has begun to have a substantial impact on the way data are summarized in these fields. We identify 119 studies concerned with meta-analysis in ecology and evolution, the earliest published in 1991 and the most recent in 2000. We introduce the statistical methods used in modern meta-analysis with references to the well-developed literature in the field. These formal, statistically defensible methods have been established to determine average treatment effects across studies when a common research question is being investigated, to establish confidence limits around the average effect size, and to test for consistency or lack of agreement in effect size as well as explanations for differences in the magnitude of the effect among studies. Problems with popular but statistically flawed methods for the quantitative summary of research results have been pointed out, and their use is diminishing. We discuss a number of challenges and threats to the validity of meta-analysis in ecology and evolution. In particular, we examine how difficulties resulting from missing data, publication bias, data quality and data exclusion, non-independence among observations, and the combination of dissimilar data sets may affect the perceived utility of meta-analysis in these fields and the soundness of conclusions drawn from its application. We highlight particular applications of meta-analysis in ecology and evolution, discuss several controversies surrounding individual meta-analyses, and outline some of the practical issues involved in carrying out a meta-analysis. Finally, we suggest changes that would improve the quality of data synthesis in ecology and evolutionary biology, and predict future directions for this emerging enterprise.

342 citations


Journal ArticleDOI
TL;DR: Smith and Todd as discussed by the authors used experimental data combined with nonexperimental data to evaluate the performance of alternative nonparametric estimators, including pairwise matching and caliper matching, and found that these simple matching estimators succeed in closely replicating the experimental NSW results, even though the comparison group data do not satisfy any of the criteria found to be important in Heckman et al. (1997, 1998a, b).
Abstract: There is a long-standing debate in the literature over whether social programs can be reliably evaluated without a randomized experiment. This paper summarizes results from a larger paper (Smith and Todd, 2001) that uses experimental data combined with nonexperimental data to evaluate the performance of alternative nonexperimental estimators. The impact estimates based on experimental data provide a benchmark against which to judge the performance of nonexperimental estimators. Our experimental data come from the National Supported Work (NSW) Demonstration and the nonexperimental data from the Current Population Survey (CPS) and the Panel Study of Income Dynamics (PSID). These same data were used in influential papers by Robert LaLonde (1986), James Heckman and Joseph Hotz (1989), and Rajeev Dehejia and Sadek Wahba (1998, 1999). We focus on a class of estimators called propensity-score matching estimators, which were introduced in the statistics literature by Paul Rosenbaum and Donald Rubin (1983). Traditional propensity-score matching methods pair each program participant with a single nonparticipant, where pairs are chosen based on the degree of similarity in the estimated probabilities of participating in the program (the propensity scores). More recently developed nonparametric matching estimators described in Heckman et al. (1997, 1998a, b) use weighted averages over multiple observations to construct matches. We apply both kinds of estimators in this paper. Heckman et al. (1997, 1998a, b) evaluate the performance of matching estimators using experimental data from the U.S. National Job Training Partnership Act (JTPA) Study combined with comparison group samples drawn from three sources. They show that data quality is a crucial ingredient to any reliable estimation strategy. Specifically, the estimators they examine are only found to perform well in replicating the results of the experiment when they are applied to comparison group data satisfying the following criteria: (i) the same data sources (i.e., the same surveys or the same type of administrative data or both) are used for participants and nonparticipants, so that earnings and other characteristics are measured in an analogous way, (ii) participants and nonparticipants reside in the same local labor markets, and (iii) the data contain a rich set of variables relevant to modeling the program-participation decision. If the comparison group data fails to satisfy these criteria, the performance of the estimators diminishes greatly. More recently, Dehejia and Wahba (1998, 1999) have used the NSW data (also used by LaLonde) to evaluate the performance of propensity-score matching methods, including pairwise matching and caliper matching. They find that these simple matching estimators succeed in closely replicating the experimental NSW results, even though the comparison group data do not satisfy any of the criteria found to be important in Heckman et al. (1997, 1998a). From this evidence, they conclude that matching approaches are generally more reliable than traditional econometric estimators. In this paper, we reanalyze the NSW data in an attempt to reconcile the conflicting findings * Smith: Department of Economics, University of Western Ontario, Social Science Centre, London, ON, N6A 5C2, Canada, and NBER; Todd: Department of Economics, University of Pennsylvania, 3718 Locust Walk, Philadelphia, PA 19104, and NBER. Versions of the paper that is the source for this paper have been presented at a 2000 meeting of the Institute for Research on Poverty in Madison, Wisconsin, at the Western Research Network on Employment and Training summer workshop (August 2000), at the Canadian International Labour Network meetings (September 2000), and at the University of North Carolina. We thank James Heckman for comments on this paper, and we thank Robert LaLonde for comments and for providing us with the data from his 1986 study. We thank Rajeev Dehejia for providing us with information helpful in reconstructing the samples used in the Dehejia and Wahba (1998, 1999) studies. Jingjing Hsee and Miana Plesca provided excellent research assistance.

330 citations


Journal ArticleDOI
TL;DR: In this article, the authors present a series of steps moving towards better assessment and validation of spatial information and ask the reader to evaluate where they are in this series and to move forward.
Abstract: Today, validation or accuracy assessment is an integral component of most mapping projects incorporating remotely sensed data. Other spatial information may not be so stringently evaluated, but at least requires meta-data that documents how the information was generated. This emphasis on data quality was not always the case. In the 1970s only a few brave scientists and researchers dared ask the question, 'How good is this map derived from Landsat MSS imagery?' In the 1980s, the use of the error matrix became a common tool for representing the accuracy of individual map categories. By the 1990s, most maps derived from remotely sensed imagery were required to meet some minimum accuracy standard. A similar progression can be outlined for other spatial information. However, this progression is about 5 years behind the validation of remotely sensed data. This paper presents a series of steps moving towards better assessment and validation of spatial information and asks the reader to evaluate where they are in this series and to move forward.

324 citations


Book
11 Jan 2001

292 citations


Journal ArticleDOI
TL;DR: In this article, the authors present a framework for data uncertainty assessment in life cycle inventories (LCI), where data uncertainty is divided into two categories: lack of data (data gaps) and data inaccuracy.
Abstract: Modelling data uncertainty is not common practice in life cycle inventories (LCI), although different techniques are available for estimating and expressing uncertainties, and for propagating the uncertainties to the final model results. To clarify and stimulate the use of data uncertainty assessments in common LCI practice, the SETAC working group ‘Data Availability and Quality’ presents a framework for data uncertainty assessment in LCI. Data uncertainty is divided in two categories: (1) lack of data, further specified as complete lack of data (data gaps) and a lack of representative data, and (2) data inaccuracy. Filling data gaps can be done by input-output modelling, using information for similar products or the main ingredients of a product, and applying the law of mass conservation. Lack of temporal, geographical and further technological correlation between the data used and needed may be accounted for by applying uncertainty factors to the non-representative data. Stochastic modelling, which can be performed by Monte Carlo simulation, is a promising technique to deal with data inaccuracy in LCIs.

271 citations


01 Jan 2001
TL;DR: DIVA-GIS as discussed by the authors is a GIS software for genebank and herbarium databases that supports the analysis of genebank data to elucidate genetic, ecological and geographic patterns in the distribution of crops and wild species.
Abstract: The DIVA-GIS version 1.4 software supports the analysis of genebank and herbarium databases to elucidate genetic, ecological and geographic patterns in the distribution of crops and wild species. It is aimed at scientists who cannot afford generic commercial geographic information system (GIS) software, or do not have the time to learn how to use these, and for anyone else who wants a GIS tailor-made for genetic resources. Coordinate data in genebank databases are often absent, and the records that do have coordinate data are sometimes inaccurate. This makes spatial analysis of genebank data more complicated. DIVA-GIS helps to improve data quality by the automated assigning of coordinates, using a large digital gazetteer. DIVA-GIS can also be used to check existing coordinates using overlays (simultaneous spatial queries) of the collection sites and administrative boundaries databases. Maps can then be made of the locations where accessions were collected. Analytical functions implemented in DIVA-GIS include mapping of richness and diversity; distribution of useful traits; and identification of areas with complementary diversity. DIVA-GIS can also extract climate data for all locations on land. These can be used for "retro-classification" of the environment of collection sites.

Journal ArticleDOI
TL;DR: An extensive simulation where different techniques for dealing with missing data in the context of software cost modeling are evaluated, suggesting that the simplest technique, listwise deletion, is a reasonable choice, however, this will not necessarily provide the best performance.
Abstract: The construction of software cost estimation models remains an active topic of research. The basic premise of cost modeling is that a historical database of software project cost data can be used to develop a quantitative model to predict the cost of future projects. One of the difficulties faced by workers in this area is that many of these historical databases contain substantial amounts of missing data. Thus far, the common practice has been to ignore observations with missing data. In principle, such a practice can lead to gross biases and may be detrimental to the accuracy of cost estimation models. We describe an extensive simulation where we evaluate different techniques for dealing with missing data in the context of software cost modeling. Three techniques are evaluated: listwise deletion, mean imputation, and eight different types of hot-deck imputation. Our results indicate that all the missing data techniques perform well with small biases and high precision. This suggests that the simplest technique, listwise deletion, is a reasonable choice. However, this will not necessarily provide the best performance. Consistent best performance (minimal bias and highest precision) can be obtained by using hot-deck imputation with Euclidean distance and a z-score standardization.

Journal ArticleDOI
TL;DR: It is concluded that many of the data quality problems found previously are present in the MFLS, fielded in Peninsular Malaysia in 1976 and 1988.
Abstract: The literature on reporting error provides insights into the quality of retrospective reports, particularly as it pertains to short-tern recall. Less is understood about the generalizability of these findings to longer-tern retrospective reports. We review studies analyzing the quality of retrospective reports in the Malaysian Family Life Surveys (MFLS), fielded in Peninsular Malaysia in 1976 and 1988, and conclude that many of the data quality problems found previously are present in the MFLS. We summarize this literature, place studies based on the MFLS within the context of the reporting error literature, and discuss implications for the design of future surveys.

Book
01 Jan 2001
TL;DR: This book discusses the design and management of data Warehousing, and the role of metadata in this process.
Abstract: Foreword. Preface. PART 1: OVERVIEW AND CONCEPTS. The Compelling Need for Data Warehousing. Data Warehouse: The Building Blocks. Trends in Data Warehousing. PART 2: PLANNING AND REQUIREMENTS. Planning and Project Management. Defining the Business Requirements. Requirements as the Driving Force for Data Warehousing. PART 3: ARCHITECTURE AND INFRASTRUCTURE. The Architectural Components. Infrastructure as the Foundation for Data Warehousing. The Significant Role of Metadata. PART 4: DATA DESIGN AND DATA PREPARATION. Principles of Dimensional Modeling. Dimensional Modeling: Advanced Topics. Data Extraction, Transformation, and Loading. Data Quality: A Key to Success. PART 5: INFORMATION ACCESS AND DELIVERY. Matching Information to the Classes of Users. OLAP in the Data Warehouse. Data Warehousing and the Web. Data Mining Basics. PART 6: IMPLEMENTATION AND MAINTENANCE. The Physical Design Process. Data Warehouse Deployment. Growth and Maintenance. Appendix A: Project Life Cycle Steps and Checklists. Appendix B: Critical Factors for Success. Appendix C: Guidelines for Evaluating Vendor Solutions. References. Glossary. Index.

Journal ArticleDOI
TL;DR: In this paper, the authors present a management tool called EcoSCAn (ecological supply chain analysis) that is used for analyzing, mapping, and managing environmental impacts along supply chains.
Abstract: This article reports on research toward a pragmatic and credible means for analyzing, mapping, and managing environmental impacts along supply chains. The results of this research include a management tool called "ecological supply chain analysis" (EcoSCAn) that is presented here for the first time. Its structure bears a passing resemblance to that used in some streamlined life-cycle assessments, but its operation and purpose are quite different. The EcoSCAn tool frames a comparative environmental analysis of products capable of performing broadly equivalent functions. The analysis occurs over complete extended supply chains and within defined supply chain stages at a product level and, to some extent, at a site level. The results are mapped with data confidence indicators. A range of tactical and, where data quality is sufficient, strategic supply chain actions are prompted. Actions to mitigate environmental stress are possible in the absence of good quality data across entire product life cycles, although the extent to which management actions are limited is made plain.

Journal ArticleDOI
TL;DR: More work must be done to develop domain-independent tools that solve the data cleaning problems associated with data warehouse development, and to achieve better synergy between database systems and data mining technology.
Abstract: Decision support systems form the core of business IT infrastructures because they let companies translate business information into tangible and lucrative results. Collecting, maintaining, and analyzing large amounts of data, however, involves expensive technical challenges that require organizational commitment. Many commercial tools are available for each of the three major data warehousing tasks: populating the data warehouse from independent operational databases, storing and managing the data, and analyzing the data to make intelligent business decisions. Data cleaning relates to heterogeneous data integration, a problem studied for many years. More work must be done to develop domain-independent tools that solve the data cleaning problems associated with data warehouse development. Most data mining research has focused on developing algorithms for building more accurate models or building models faster. However, data preparation and mining model deployment present several engaging problems that relate specifically to achieving better synergy between database systems and data mining technology.

Patent
David Perez Corral1
29 Aug 2001
TL;DR: In this paper, a quality management framework system and a method for operating a quality plan in a product development organization having quality objectives is presented, where data relative to the quality processes is collected and aggregated to generate quality reports.
Abstract: A quality management framework system and method for operating a quality plan in a product development organization having quality objectives. The system includes a plurality of computer implemented tools accessible by users for operating a plurality of quality processes. Data relative to the quality processes is collected and aggregated to generate quality reports. Reports are analyzed and problems are detected through a defect prevention process. Quality actions are initiated in a feedback quality management action tracking process.

Journal ArticleDOI
TL;DR: A technique for declaratively specifying suitable reconciliation correspondences to be used in order to solve conflicts among data in different sources and the main goal of the method is to support the design of mediators that materialize the data in the Data Warehouse relations.
Abstract: Information integration is one of the most important aspects of a Data Warehouse. When data passes from the sources of the application-oriented operational environment to the Data Warehouse, possible inconsistencies and redundancies should be resolved, so that the warehouse is able to provide an integrated and reconciled view of data of the organization. We describe a novel approach to data integration in Data Warehousing. Our approach is based on a conceptual representation of the Data Warehouse application domain, and follows the so-called local-as-view paradigm: both source and Data Warehouse relations are defined as views over the conceptual model. We propose a technique for declaratively specifying suitable reconciliation correspondences to be used in order to solve conflicts among data in different sources. The main goal of the method is to support the design of mediators that materialize the data in the Data Warehouse relations. Starting from the specification of one such relation as a query over the conceptual model, a rewriting algorithm reformulates the query in terms of both the source relations and the reconciliation correspondences, thus obtaining a correct specification of how to load the data in the materialized view.

Journal ArticleDOI
TL;DR: The essential elements of data acquisition, data processing and data analysis are reviewed, and issues related to the quality, validation and storage of data are discussed.

Patent
27 Sep 2001
TL;DR: In this paper, a data migration, data integration, data warehousing, and business intelligence system including a data storage model is provided that allows a business to effectively utilize its data to make business decisions.
Abstract: A data migration, data integration, data warehousing, and business intelligence system including a data storage model is provided that allows a business to effectively utilize its data to make business decisions. The system can be designed to include a number of data storage units including a data dock, a staging area, a data vault, a data mart, a data collection area, a metrics repository, and a metadata repository. Data received from a number of source systems moves through the data storage units and is processed along the way by a number of process areas including a profiling process area, a cleansing process area, a data loading process area, a business rules and integration process area, a propagation, aggregation and subject area breakout process area, and a business intelligence and decision support systems process area. Movement of the data is performed by metagates. The processed data is then received by corporate portals for use in making business decisions. The system may be implemented by an implementation team made up of members including a project manager, a business analyst, a system architect, a data modeler/data architect, a data migration expert, a DSS/OLAP expert, a data profiler/cleanser, and a trainer.

Journal ArticleDOI
TL;DR: The authors identify underlying causes for the 12 most problematic variables in three multiethnic surveys and describe them in terms of ethnic differences in reliability, validity, and cognitive processes, and differences with regard to cultural appropriateness and translation problems.
Abstract: Objective.There has been insufficient research on the influence of ethno-cultural and language differences in public health surveys. Using data from three independent studies, the authors examine methods to assess data quality and to identify causes of problematic survey questions.Methods.Qualitative and quantitative methods were used in this exploratory study, including secondary analyses of data from three baseline surveys (conducted in English, Spanish, Cantonese, Mandarin, and Vietnamese). Collection of additional data included interviews with investigators and interviewers; observations of item development; focus groups; think-aloud interviews; a test-retest assessment survey; and a pilot test of alternatively worded questions.Results.The authors identify underlying causes for the 12 most problematic variables in three multiethnic surveys and describe them in terms of ethnic differences in reliability, validity, and cognitive processes (interpretation, memory retrieval, judgment formation, and respon...

Journal ArticleDOI
TL;DR: With broader understanding of casemix adjustment and methods for analyzing small samples, quality data can be analysed and reported more accurately and precision may be improved using indirect estimation techniques that incorporate auxiliary information.
Abstract: PURPOSE To present two key statistical issues that arise in analysis and reporting of quality data. SUMMARY Casemix variation is relevant to quality reporting when the units being measured have differing distributions of patient characteristics that also affect the quality outcome. When this is the case, adjustment using stratification or regression may be appropriate. Such adjustments may be controversial when the patient characteristic does not have an obvious relationship to the outcome. Stratified reporting poses problems for sample size and reporting format, but may be useful when casemix effects vary across units. Although there are no absolute standards of reliability, high reliabilities (interunit F > or = 10 or reliability > or = 0.9) are desirable for distinguishing above- and below-average units. When small or unequal sample sizes complicate reporting, precision may be improved using indirect estimation techniques that incorporate auxiliary information, and 'shrinkage' estimation can help to summarize the strength of evidence about units with small samples. CONCLUSIONS With broader understanding of casemix adjustment and methods for analyzing small samples, quality data can be analysed and reported more accurately.

Journal ArticleDOI
TL;DR: The authors describe a fully functional healthcare data warehouse used to produce several reports for communities throughout Florida and are actively pursuing a research agenda to enhance technical data warehousing capabilities while investigating innovative community and clinical healthcare applications.
Abstract: Healthcare data warehousing presents unique challenges. The industry is rife with often incompatible medical standards and coding schemes that require careful translation. Healthcare data comes from many sources and is delivered in many forms, including published books, individual spreadsheets, and several tape or data formats. Results derived from a healthcare data warehouse must be delivered in accessible form to diverse stakeholders, including healthcare regulators, physicians, hospital administrators, consumers, community activists, and members of the popular press. The industry's widely decentralized and largely autonomous data collection efforts make data quality a significant challenge. Finally, the sensitivity of healthcare data makes privacy and security issues paramount. Healthcare data warehousing will make rigorous, quantitative information available to healthcare decision makers. The authors describe a fully functional healthcare data warehouse used to produce several reports for communities throughout Florida. Building on this work, they're actively pursuing a research agenda to enhance technical data warehousing capabilities while investigating innovative community and clinical healthcare applications.

Journal ArticleDOI
TL;DR: Issues related to the origin of dirty data, associated problems and costs of using dirty data in an organization, the process of dealing withdirty data in a migration to a new system: enterprise resource planning (ERP), and the benefits of an ERP in managing dirty data are discussed.
Abstract: The integrity of the data used to operate and make decisions about a business affects the relative efficiency of operations and quality of decisions made. Protecting that integrity can be difficult and becomes more difficult as the size and complexity of the business and its systems increase. Recovering data integrity may be impossible once it is compromised. Stewards of transactional and planning systems must therefore employ a combination of procedures including systematic safeguards and user‐training programs to counteract and prevent dirty data in those systems. Users of transactional and planning systems must understand the origins and effects of dirty data and the importance of and means of guarding against it. This requires a shared understanding within the context of the business of the meaning, uses, and value of data across functional entities. In this paper, we discuss issues related to the origin of dirty data, associated problems and costs of using dirty data in an organization, the process of dealing with dirty data in a migration to a new system: enterprise resource planning (ERP), and the benefits of an ERP in managing dirty data. These issues are explored in the paper using a case study.

Patent
07 Nov 2001
TL;DR: In this article, a music data distribution system for distributing music data to an external device connected to a network, consisting of a storage device that stores first music data, a receiver that receives a music distribution request from the external device and a transmitter that transmits the first or the second music data in accordance with contents of the data distribution request.
Abstract: A music data distribution system for distributing music data to an external device connected to a network, comprises: a storage device that stores first music data; a receiver that receives a music data distribution request from the external device connected to the network, the music data distribution request comprising at least music data identification information and music data quality information; a reading device that reads the first music data from said storage device in accordance with the music data identification information; a quality converter that converts the first music data into second music data having a quality different from the first music data in accordance with the music data quality information; and a transmitter that transmits the first or the second music data to the external device in accordance with contents of the music data distribution request.

Journal ArticleDOI
TL;DR: To facilitate data integration, data value Conversion rules are proposed to describe the quantitative relationships among data values involving context dependent conflicts and a general approach is proposed to discover data value conversion rules from the data.

Journal ArticleDOI
TL;DR: In this paper, the authors report the results of a qualitative study into the implementation of data-driven customer relationship management (CRM) strategies and find that clean customer data are essential to successful CRM performance and that technological support for data acquisition, analysis and deployment are not widespread.
Abstract: This paper reports the results of a qualitative study into the implementation of data-driven customer relationship management (CRM) strategies. Seventeen companies are investigated and three short case studies are presented. It is found that clean customer data are essential to successful CRM performance and that technological support for data acquisition, analysis and deployment are not widespread. Clean customer data enable CRM strategies to be both more effective and more efficient, yet not all companies are investing in improving data quality.

01 Jan 2001
TL;DR: This paper introduces data quality mining (DQM) as a new and promising data mining approach from the academic and the business point of view and describes how to employ association rules for the purpose of DQM.
Abstract: In this paper we introduce data quality mining (DQM) as a new and promising data mining approach from the academic and the business point of view. The goal of DQM is to employ data mining methods in order to detect, quantify, explain and correct data quality deficiencies in very large databases. Data quality is crucial for many applications of knowledge discovery in databases (KDD). So a typical application scenario for DQM is to support KDD projects, especially during the initial phases. Moreover, improving data quality is also a burning issue in many areas outside KDD. That is, DQM opens new and promising application fields for data mining methods outside the field of pure data analysis. To give a first impression of a concrete DQM approach, we describe how to employ association rules for the purpose of DQM.

Journal ArticleDOI
TL;DR: In this paper, the authors address the issues for multidisciplinary studies at a project level, from where data are aggregated to supply national databases, and provide guidelines for the handling of indicators of the different types.

01 Jan 2001
TL;DR: In this paper a first proposal of metrics for multidimensional model quality is shown together with their formal validation and more objective indicators are needed to help designers and managers to develop quality datawarehouses.
Abstract: Organizations are adopting datawarehouses to manage information efficiently as “the” main organizational asset. It is essential that we can assure the information quality of the data warehouse, as it became the main tool for strategic decisions. Information quality depends on presentation quality and the data warehouse quality. This last includes the multidimensional model quality. In the last years different authors have proposed some useful guidelines to design multidimensional models, however more objective indicators are needed to help designers and managers to develop quality datawarehouses. In this paper a first proposal of metrics for multidimensional model quality is shown together with their formal validation.