scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Big data and quality: A literature review

TL;DR: This paper throws light into the present state of quality issues related to Big Data and provides valuable insights that can be used to leverage Big Data science activities.
Abstract: Big Data refers to data volumes in the range of Exabyte (1018) and beyond. Such volumes exceed the capacity of current on-line storage and processing systems. With characteristics like volume, velocity and variety big data throws challenges to the traditional IT establishments. Computer assisted innovation, real time data analytics, customer-centric business intelligence, industry wide decision making and transparency are possible advantages, to mention few, of Big Data. There are many issues with Big Data that warrant quality assessment methods. The issues are pertaining to storage and transport, management, and processing. This paper throws light into the present state of quality issues related to Big Data. It provides valuable insights that can be used to leverage Big Data science activities.
Citations
More filters
Proceedings ArticleDOI
01 Oct 2017
TL;DR: In this paper, a public big network data was analyzed with a new unsupervised anomaly detection approach on Apache Spark cluster in Azure HDInsight, the results obtained from a case study were evaluated, %96 accuracy was achieved.
Abstract: Cyber-attacks wasorganized in a simple and random way in the past. However attacks are carried out systematically and long term nowadays. In addition, the high calculation volume and continuous changes in network data distribution have made it more difficult to analyze data and detect abnormal behaviors within. For this reason, big data solutions have become essential. In this paper, firstly network anomaly and attack detection studies on big data has been reviewed. Then, a public big network data was analyzed with a new unsupervised anomaly detection approach on Apache Spark cluster in Azure HDInsight. Finally, the results obtained from a case study were evaluated, %96 accuracy was achieved. The results were visualized after dimension reduction using Principal Component Analysis (PCA). The identified anomalies may provide usable outputs to understand the behavior of the network, distinguishing the attacks, providing better cyber security, and protecting critical infrastructures.

89 citations


Cites methods from "Big data and quality: A literature ..."

  • ...Big data characteristics defined by V’s, generally 6V’s [6, 7] as seen in Table I....

    [...]

Proceedings ArticleDOI
02 Jul 2018
TL;DR: An across-the-board quality management framework is proposed describing the key quality evaluation practices to be conducted through the different Big Data stages and can be used to leverage the quality management and to provide a roadmap for Data scientists to better understand quality practices and highlight the importance of managing the quality.
Abstract: With the advances in communication technologies and the high amount of data generated, collected, and stored, it becomes crucial to manage the quality of this data deluge in an efficient and cost-effective way The storage, processing, privacy and analytics are the main keys challenging aspects of Big Data that require quality evaluation and monitoring Quality has been recognized by the Big Data community as an essential facet of its maturity Yet, it is a crucial practice that should be implemented at the earlier stages of its lifecycle and progressively applied across the other key processes The earlier we incorporate quality the full benefit we can get from insights In this paper, we first identify the key challenges that necessitates quality evaluation We then survey, classify and discuss the most recent work on Big Data management Consequently, we propose an across-the-board quality management framework describing the key quality evaluation practices to be conducted through the different Big Data stages The framework can be used to leverage the quality management and to provide a roadmap for Data scientists to better understand quality practices and highlight the importance of managing the quality We finally, conclude the paper and point to some future research directions on quality of Big Data

60 citations


Cites background from "Big data and quality: A literature ..."

  • ...Others, proposed solutions to enhance quality of the data while applying cleansing tasks and activities that are parts of preprocessing (e.g. BigDansing, and Nadeef) [51]– [53], [56], [68]....

    [...]

  • ...BigDansing, and Nadeef) [51]– [53], [56], [68]....

    [...]

  • ...authors of selected literature [25], [31], [41], [43], [45], [46], [48], [49], [51]–[56], [58]–[60], have stressed that it is very important to discover quality issues and map them with Big data...

    [...]

Proceedings ArticleDOI
01 Nov 2017
TL;DR: Wang et al. as discussed by the authors proposed a big data framework for electric power data quality assessment, which can accumulate both the real-time data and the history data, and provide an integrated computation environment for the electric power big data assessment, and support the storage of different types of data.
Abstract: Since a low-quality data may influence the effectiveness and reliability of applications, data quality is required to be guaranteed. Data quality assessment is considered as the foundation of the promotion of data quality, so it is essential to access the data quality before any other data related activities. In the electric power industry, more and more electric power data is continuously accumulated, and many electric power applications have been developed based on these data. In China, the power grid has many special characteristic, traditional big data assessment frameworks cannot be directly applied. Therefore, a big data framework for electric power data quality assessment is proposed. Based on big data techniques, the framework can accumulate both the real-time data and the history data, provide an integrated computation environment for electric power big data assessment, and support the storage of different types of data.

21 citations

Proceedings Article
01 Jan 2017
TL;DR: A big data framework for electric power data quality assessment is proposed that can accumulate both the real-time data and the history data, provide an integrated computation environment forElectric power big data assessment, and support the storage of different types of data.
Abstract: Since a low-quality data may influence the effectiveness and reliability of applications, data quality is required to be guaranteed. Data quality assessment is considered as the foundation of the promotion of data quality, so it is essential to access the data quality before any other data related activities. In the electric power industry, more and more electric power data is continuously accumulated, and many electric power applications have been developed based on these data. In China, the power grid has many special characteristic, traditional big data assessment frameworks cannot be directly applied. Therefore, a big data framework for electric power data quality assessment is proposed. Based on big data techniques, the framework can accumulate both the real-time data and the history data, provide an integrated computation environment for electric power big data assessment, and support the storage of different types of data.

14 citations


Cites background from "Big data and quality: A literature ..."

  • ...Recently, although many distributed techniques are proposed for massive data collection and storage [11], they are not able be directly applied for electric power big data....

    [...]

Journal ArticleDOI
TL;DR: In this paper, the effect of big data traits and data quality dimensions on BDA application is explored, where the authors formulated 10 hypotheses that comprised of the relationships of big-data traits, accuracy, believability, completeness, timeliness, ease of operation, and BDA-application constructs.
Abstract: The popularity of big data analytics (BDA) has boosted the interest of organisations into exploiting their large scale data. This technology can become a strategic stimulation for organisations to achieve competitive advantage and sustainable growth. Previous BDA research, however, has focused more on introducing more traits, known as Vs for big data traits, while ignoring the quality of data when examining the application of BDA. Therefore, this study aims to explore the effect of big data traits and data quality dimensions on BDA application. This study has formulated 10 hypotheses that comprised of the relationships of big data traits, accuracy, believability, completeness, timeliness, ease of operation, and BDA application constructs. This study conducted a survey using a questionnaire as a data collection instrument. Then, the partial least squares structural equation modelling technique was used to analyse the hypothesised relationships between the constructs. The findings revealed that big data traits can significantly affect all constructs for data quality dimensions and that the ease of operation construct has a significant effect on BDA application. This study contributes to the literature by bringing new insights to the field of BDA and may serve as a guideline for future researchers and practitioners when studying BDA application.

10 citations

References
More filters
Journal ArticleDOI
TL;DR: Using this framework, IS managers were able to better understand and meet their data consumers' data quality needs and this research provides a basis for future studies that measure data quality along the dimensions of this framework.
Abstract: Poor data quality (DQ) can have substantial social and economic impacts. Although firms are improving data quality with practical approaches and tools, their improvement efforts tend to focus narrowly on accuracy. We believe that data consumers have a much broader data quality conceptualization than IS professionals realize. The purpose of this paper is to develop a framework that captures the aspects of data quality that are important to data consumers.A two-stage survey and a two-phase sorting study were conducted to develop a hierarchical framework for organizing data quality dimensions. This framework captures dimensions of data quality that are important to data consumers. Intrinsic DQ denotes that data have quality in their own right. Contextual DQ highlights the requirement that data quality must be considered within the context of the task at hand. Representational DQ and accessibility DQ emphasize the importance of the role of systems. These findings are consistent with our understanding that high-quality data should be intrinsically good, contextually appropriate for the task, clearly represented, and accessible to the data consumer.Our framework has been used effectively in industry and government. Using this framework, IS managers were able to better understand and meet their data consumers' data quality needs. The salient feature of this research study is that quality attributes of data are collected from data consumers instead of being defined theoretically or based on researchers' experience. Although exploratory, this research provides a basis for future studies that measure data quality along the dimensions of this framework.

4,069 citations


"Big data and quality: A literature ..." refers background in this paper

  • ...routines [17], data valid, but not correct [18], mismatched...

    [...]

Journal ArticleDOI
danah boyd1, Kate Crawford1
TL;DR: The era of Big Data has begun as discussed by the authors, where diverse groups argue about the potential benefits and costs of analyzing genetic sequences, social media interactions, health records, phone logs, government records, and other digital traces left by people.
Abstract: The era of Big Data has begun. Computer scientists, physicists, economists, mathematicians, political scientists, bio-informaticists, sociologists, and other scholars are clamoring for access to the massive quantities of information produced by and about people, things, and their interactions. Diverse groups argue about the potential benefits and costs of analyzing genetic sequences, social media interactions, health records, phone logs, government records, and other digital traces left by people. Significant questions emerge. Will large-scale search data help us create better tools, services, and public goods? Or will it usher in a new wave of privacy incursions and invasive marketing? Will data analytics help us understand online communities and political movements? Or will it be used to track protesters and suppress speech? Will it transform how we study human communication and culture, or narrow the palette of research options and alter what ‘research’ means? Given the rise of Big Data as a socio-tech...

3,955 citations

Journal ArticleDOI
TL;DR: The background and state-of-the-art of big data are reviewed, including enterprise management, Internet of Things, online social networks, medial applications, collective intelligence, and smart grid, as well as related technologies.
Abstract: In this paper, we review the background and state-of-the-art of big data. We first introduce the general background of big data and review related technologies, such as could computing, Internet of Things, data centers, and Hadoop. We then focus on the four phases of the value chain of big data, i.e., data generation, data acquisition, data storage, and data analysis. For each phase, we introduce the general background, discuss the technical challenges, and review the latest advances. We finally examine the several representative applications of big data, including enterprise management, Internet of Things, online social networks, medial applications, collective intelligence, and smart grid. These discussions aim to provide a comprehensive overview and big-picture to readers of this exciting area. This survey is concluded with a discussion of open problems and future directions.

2,303 citations


"Big data and quality: A literature ..." refers background in this paper

  • ...In literature, there is no unified definition of “Big Data”, it has been defined differently in technological, industrial, research or academic perspectives [5]....

    [...]

Proceedings ArticleDOI
31 Dec 2012
TL;DR: This analysis illustrates that the Big Data analytics is a fast-growing, influential practice and a key enabler for the social business and is critical for success in the age of social media.
Abstract: In this paper, we explain the concept, characteristics & need of Big Data & different offerings available in the market to explore unstructured large data. This paper covers Big Data adoption trends, entry & exit criteria for the vendor and product selection, best practices, customer success story, benefits of Big Data analytics, summary and conclusion. Our analysis illustrates that the Big Data analytics is a fast-growing, influential practice and a key enabler for the social business. The insights gained from the user generated online contents and collaboration with customers is critical for success in the age of social media.

811 citations

Journal ArticleDOI
TL;DR: The data characteristics of the big data environment are analyzed, quality challenges faced by big data are presented, and a hierarchical data quality framework is formulates from the perspective of data users.
Abstract: High-quality data are the precondition for analyzing and using big data and for guaranteeing the value of the data. Currently, comprehensive analysis and research of quality standards and quality assessment methods for big data are lacking. First, this paper summarizes reviews of data quality research. Second, this paper analyzes the data characteristics of the big data environment, presents quality challenges faced by big data, and formulates a hierarchical data quality framework from the perspective of data users. This framework consists of big data quality dimensions, quality characteristics, and quality indexes. Finally, on the basis of this framework, this paper constructs a dynamic assessment process for data quality. This process has good expansibility and adaptability and can meet the needs of big data quality assessment. The research results enrich the theoretical scope of big data and lay a solid foundation for the future by establishing an assessment model and studying evaluation algorithms.

631 citations