scispace - formally typeset
Search or ask a question
Book ChapterDOI

Data Quality in Web Information Systems

01 Sep 2008-Vol. 5175, pp 1-1
TL;DR: This talk will focus on the current research activities and results on computational solutions form the database community in data profiling, record linking, conditional functional constraints, data provenance and data uncertainty, and a list of open research problems.
Abstract: The World Wide Web has brought a wave of revolutionary changes for people and organizations to generate, disseminate and use data. With unprecedented access to massive amount of data and powerful information gathering capabilities enabled by Web-based technologies, the traditional closed world assumption for database systems has been challenged. More and more data from the Web are used today as essential information sources, directly or indirectly, for all types of decision making purposes in not only just personal, but also many business and scientific applications. A user of such Web data, however, has to constantly rely on their own judgement on data quality, such as correctness, currency, consistency and completeness. This is an unreliable and often very difficult process, as the quality of this judgement itself often relies on the quality of other information obtained from the Web, and the relationship among the data used can be very complex and sometime hidden from the user. While the issue of data quality is as old as data itself, it is now exposed at a much higher, broader and more critical level due to the scale, diversity and ubiquitousness of Web Information Systems. The intrinsic mismatch between the intended use and actual use of the data on the Web is a fundamental cause of poor data quality for Web-based applications. In this talk, we will introduce the notion of data quality, from its root in management information systems research to new issues and challenges in the context of large-scale Web Information Systems. After a brief introduction to organizational and architectural solutions to the data quality problem, this talk will focus on the current research activities and results on computational solutions form the database community in data profiling, record linking, conditional functional constraints, data provenance and data uncertainty. These technical solutions will be examined for their promises and limitations to the problem of data quality in Web Information Systems. Finally, we will discuss a list of open research problems.
Citations
More filters
Journal ArticleDOI
TL;DR: Methodologies are compared along several dimensions, including the methodological phases and steps, the strategies and techniques, the data quality dimensions, the types of data, and, finally, thetypes of information systems addressed by each methodology.
Abstract: The literature provides a wide range of techniques to assess and improve the quality of data. Due to the diversity and complexity of these techniques, research has recently focused on defining methodologies that help the selection, customization, and application of data quality assessment and improvement techniques. The goal of this article is to provide a systematic and comparative description of such methodologies. Methodologies are compared along several dimensions, including the methodological phases and steps, the strategies and techniques, the data quality dimensions, the types of data, and, finally, the types of information systems addressed by each methodology. The article concludes with a summary description of each methodology.

1,048 citations

Journal ArticleDOI
TL;DR: The WIQA-Information Quality Assessment Framework enables information consumers to apply a wide range of policies to filter information, and generates explanations of why information satisfies a specific policy.

228 citations

BookDOI
01 Jan 2016
TL;DR: The first € price and the £ and $ price are net prices, subject to local VAT, and the €(D) includes 7% for Germany, the€(A) includes 10% for Austria.
Abstract: The first € price and the £ and $ price are net prices, subject to local VAT. Prices indicated with * include VAT for books; the €(D) includes 7% for Germany, the €(A) includes 10% for Austria. Prices indicated with ** include VAT for electronic products; 19% for Germany, 20% for Austria. All prices exclusive of carriage charges. Prices and other details are subject to change without notice. All errors and omissions excepted. C. Batini, M. Scannapieco Data and Information Quality

186 citations


Cites background from "Data Quality in Web Information Sys..."

  • ...The interested reader can find further details in [498]....

    [...]

Journal ArticleDOI
TL;DR: A set of 33 attributes which are relevant for portal data quality are proposed which have been obtained from a revision of literature and a validation process carried out by means of a survey and it is thought that it might be considered as a good starting point for constructing one.
Abstract: Data Quality is a critical issue in today's interconnected society. Advances in technology are making the use of the Internet an ever-growing phenomenon and we are witnessing the creation of a great variety of applications such as Web Portals. These applications are important data sources and/or means of accessing information which many people use to make decisions or to carry out tasks. Quality is a very important factor in any software product and also in data. As quality is a wide concept, quality models are usually used to assess the quality of a software product. From the software point of view there is a widely accepted standard proposed by ISO/IEC (the ISO/IEC 9126) which proposes a quality model for software products. However, until now a similar proposal for data quality has not existed. Although we have found some proposals of data quality models, some of them working as "de facto" standards, none of them focus specifically on web portal data quality and the user's perspective. In this paper, we propose a set of 33 attributes which are relevant for portal data quality. These have been obtained from a revision of literature and a validation process carried out by means of a survey. Although these attributes do not conform to a usable model, we think that it might be considered as a good starting point for constructing one.

82 citations

Journal ArticleDOI
TL;DR: The overall aim of the paper is to identify further research directions in the area of big data quality, by providing at the same time an up-to-date state of the art on data quality.
Abstract: In this paper, we discuss the application of concept of data quality to big data by highlighting how much complex is to define it in a general way. Already data quality is a multidimensional concept, difficult to characterize in precise definitions even in the case of well-structured data. Big data add two further dimensions of complexity: (i) being “very” source specific, and for this we adopt the interesting UNECE classification, and (ii) being highly unstructured and schema-less, often without golden standards to refer to or very difficult to access. After providing a tutorial on data quality in traditional contexts, we analyze big data by providing insights into the UNECE classification, and then, for each type of data source, we choose a specific instance of such a type (notably deep Web data, sensor-generated data, and Twitters/short texts) and discuss how quality dimensions can be defined in these cases. The overall aim of the paper is therefore to identify further research directions in the area of big data quality, by providing at the same time an up-to-date state of the art on data quality.

70 citations

References
More filters
Journal ArticleDOI
TL;DR: Methodologies are compared along several dimensions, including the methodological phases and steps, the strategies and techniques, the data quality dimensions, the types of data, and, finally, thetypes of information systems addressed by each methodology.
Abstract: The literature provides a wide range of techniques to assess and improve the quality of data. Due to the diversity and complexity of these techniques, research has recently focused on defining methodologies that help the selection, customization, and application of data quality assessment and improvement techniques. The goal of this article is to provide a systematic and comparative description of such methodologies. Methodologies are compared along several dimensions, including the methodological phases and steps, the strategies and techniques, the data quality dimensions, the types of data, and, finally, the types of information systems addressed by each methodology. The article concludes with a summary description of each methodology.

1,048 citations

Journal ArticleDOI
TL;DR: The WIQA-Information Quality Assessment Framework enables information consumers to apply a wide range of policies to filter information, and generates explanations of why information satisfies a specific policy.

228 citations

BookDOI
01 Jan 2016
TL;DR: The first € price and the £ and $ price are net prices, subject to local VAT, and the €(D) includes 7% for Germany, the€(A) includes 10% for Austria.
Abstract: The first € price and the £ and $ price are net prices, subject to local VAT. Prices indicated with * include VAT for books; the €(D) includes 7% for Germany, the €(A) includes 10% for Austria. Prices indicated with ** include VAT for electronic products; 19% for Germany, 20% for Austria. All prices exclusive of carriage charges. Prices and other details are subject to change without notice. All errors and omissions excepted. C. Batini, M. Scannapieco Data and Information Quality

186 citations

Journal ArticleDOI
TL;DR: A set of 33 attributes which are relevant for portal data quality are proposed which have been obtained from a revision of literature and a validation process carried out by means of a survey and it is thought that it might be considered as a good starting point for constructing one.
Abstract: Data Quality is a critical issue in today's interconnected society. Advances in technology are making the use of the Internet an ever-growing phenomenon and we are witnessing the creation of a great variety of applications such as Web Portals. These applications are important data sources and/or means of accessing information which many people use to make decisions or to carry out tasks. Quality is a very important factor in any software product and also in data. As quality is a wide concept, quality models are usually used to assess the quality of a software product. From the software point of view there is a widely accepted standard proposed by ISO/IEC (the ISO/IEC 9126) which proposes a quality model for software products. However, until now a similar proposal for data quality has not existed. Although we have found some proposals of data quality models, some of them working as "de facto" standards, none of them focus specifically on web portal data quality and the user's perspective. In this paper, we propose a set of 33 attributes which are relevant for portal data quality. These have been obtained from a revision of literature and a validation process carried out by means of a survey. Although these attributes do not conform to a usable model, we think that it might be considered as a good starting point for constructing one.

82 citations

Journal ArticleDOI
TL;DR: The overall aim of the paper is to identify further research directions in the area of big data quality, by providing at the same time an up-to-date state of the art on data quality.
Abstract: In this paper, we discuss the application of concept of data quality to big data by highlighting how much complex is to define it in a general way. Already data quality is a multidimensional concept, difficult to characterize in precise definitions even in the case of well-structured data. Big data add two further dimensions of complexity: (i) being “very” source specific, and for this we adopt the interesting UNECE classification, and (ii) being highly unstructured and schema-less, often without golden standards to refer to or very difficult to access. After providing a tutorial on data quality in traditional contexts, we analyze big data by providing insights into the UNECE classification, and then, for each type of data source, we choose a specific instance of such a type (notably deep Web data, sensor-generated data, and Twitters/short texts) and discuss how quality dimensions can be defined in these cases. The overall aim of the paper is therefore to identify further research directions in the area of big data quality, by providing at the same time an up-to-date state of the art on data quality.

70 citations