scispace - formally typeset

Data quality

About: Data quality is a(n) research topic. Over the lifetime, 17235 publication(s) have been published within this topic receiving 331716 citation(s). more


Open accessJournal Article
Abstract: Understand the need for analyses of large, complex, information-rich data sets. Identify the goals and primary tasks of the data-mining process. Describe the roots of data-mining technology. Recognize the iterative character of a data-mining process and specify its basic steps. Explain the influence of data quality on a data-mining process. Establish the relation between data warehousing and data mining. Data mining is an iterative process within which progress is defined by discovery, through either automatic or manual methods. Data mining is most useful in an exploratory analysis scenario in which there are no predetermined notions about what will constitute an "interesting" outcome. Data mining is the search for new, valuable, and nontrivial information in large volumes of data. It is a cooperative effort of humans and computers. Best results are achieved by balancing the knowledge of human experts in describing problems and goals with the search capabilities of computers. In practice, the two primary goals of data mining tend to be prediction and description. Prediction involves using some variables or fields in the data set to predict unknown or future values of other variables of interest. Description, on the other hand, focuses on finding patterns describing the data that can be interpreted by humans. Therefore, it is possible to put data-mining activities into one of two categories: Predictive data mining, which produces the model of the system described by the given data set, or Descriptive data mining, which produces new, nontrivial information based on the available data set. more

Topics: Data stream mining (68%), Concept mining (66%), Data pre-processing (65%) more

4,646 Citations

Journal ArticleDOI: 10.1080/07421222.1996.11518099
Abstract: Poor data quality (DQ) can have substantial social and economic impacts. Although firms are improving data quality with practical approaches and tools, their improvement efforts tend to focus narrowly on accuracy. We believe that data consumers have a much broader data quality conceptualization than IS professionals realize. The purpose of this paper is to develop a framework that captures the aspects of data quality that are important to data consumers.A two-stage survey and a two-phase sorting study were conducted to develop a hierarchical framework for organizing data quality dimensions. This framework captures dimensions of data quality that are important to data consumers. Intrinsic DQ denotes that data have quality in their own right. Contextual DQ highlights the requirement that data quality must be considered within the context of the task at hand. Representational DQ and accessibility DQ emphasize the importance of the role of systems. These findings are consistent with our understanding that high-quality data should be intrinsically good, contextually appropriate for the task, clearly represented, and accessible to the data consumer.Our framework has been used effectively in industry and government. Using this framework, IS managers were able to better understand and meet their data consumers' data quality needs. The salient feature of this research study is that quality attributes of data are collected from data consumers instead of being defined theoretically or based on researchers' experience. Although exploratory, this research provides a basis for future studies that measure data quality along the dimensions of this framework. more

Topics: Data quality (68%), Data governance (66%), Quality (business) (58%) more

3,716 Citations

Open accessJournal ArticleDOI: 10.1107/S0907444913000061
Philip R. Evans1, Garib N. Murshudov1Institutions (1)
Abstract: Following integration of the observed diffraction spots, the process of `data reduction' initially aims to determine the point-group symmetry of the data and the likely space group. This can be performed with the program POINTLESS. The scaling program then puts all the measurements on a common scale, averages measurements of symmetry-related reflections (using the symmetry determined previously) and produces many statistics that provide the first important measures of data quality. A new scaling program, AIMLESS, implements scaling models similar to those in SCALA but adds some additional analyses. From the analyses, a number of decisions can be made about the quality of the data and whether some measurements should be discarded. The effective `resolution' of a data set is a difficult and possibly contentious question (particularly with referees of papers) and this is discussed in the light of tests comparing the data-processing statistics with trials of refinement against observed and simulated data, and automated model-building and comparison of maps calculated with different resolution limits. These trials show that adding weak high-resolution data beyond the commonly used limits may make some improvement and does no harm. more

Topics: Data quality (54%), Data set (51%)

2,810 Citations

Journal ArticleDOI: 10.1145/248603.248616
Surajit Chaudhuri1, Umeshwar Dayal2Institutions (2)
01 Mar 1997-
Abstract: Data warehousing and on-line analytical processing (OLAP) are essential elements of decision support, which has increasingly become a focus of the database industry. Many commercial products and services are now available, and all of the principal database management system vendors now have offerings in these areas. Decision support places some rather different requirements on database technology compared to traditional on-line transaction processing applications. This paper provides an overview of data warehousing and OLAP technologies, with an emphasis on their new requirements. We describe back end tools for extracting, cleaning and loading data into a data warehouse; multidimensional data models typical of OLAP; front end client tools for querying and data analysis; server extensions for efficient query processing; and tools for metadata management and for managing the warehouse. In addition to surveying the state of the art, this paper also identifies some promising research issues, some of which are related to problems that the database research community has worked on for years, but others are only just beginning to be addressed. This overview is based on a tutorial that the authors presented at the VLDB Conference, 1996. more

Topics: Online analytical processing (66%), Database design (61%), Data warehouse (61%) more

2,770 Citations

Open accessBook
21 Aug 1986-
Abstract: Geographical information systems Data structures for thematic maps Digital elevation models Data input, verification, storage, and output Methods of data analysis and spatial modelling Data quality, errors, and natural variation: sources of error Errors arising through processing The nature of boundaries Classification methods Methods of spatial interpolation Choosing a geographical information system Appendices Index. more

Topics: Geographic information system (60%), Geospatial analysis (58%), Data quality (56%) more

2,505 Citations

No. of papers in the topic in previous years

Top Attributes

Show by:

Topic's top 5 most impactful authors

Mario Piattini

47 papers, 775 citations

Ismael Caballero

45 papers, 502 citations

Markus Helfert

33 papers, 343 citations

Monica Scannapieco

29 papers, 1.4K citations

Richard Y. Wang

24 papers, 8.3K citations

Network Information
Related Topics (5)
Data management

31.5K papers, 424.3K citations

90% related
Information system

107.5K papers, 1.8M citations

85% related
Decision support system

54.8K papers, 921.5K citations

85% related
Missing data

21.3K papers, 784.9K citations

85% related
Random forest

13.3K papers, 345.3K citations

85% related