scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

An enhanced pre-processing model for big data processing: A quality framework

01 Feb 2017-pp 1-7
TL;DR: An effective pre-processing model is proposed in this paper for the processing of the big data using relief algorithm and fast mRMR together as a hybrid approach that can greatly enhance the quality of the data.
Abstract: With the ever growing trends and technologies a huge volume of data is being evolved each and every second big data has become a supreme approach in data inception, accession, processing and analyzing the heterogeneous, huge amount of data so as to derive useful insights out of it. With data and without quality there is no point in having the data. Thus, data with quality is required to use or leverage the data in a more appropriate manner. With the evolution of big data many technologies are being developed. The input to it must be processed in such a way that the quality data yields quality effective results. An effective pre-processing model is proposed in this paper for the processing of the big data. Using relief algorithm and fast mRMR together as a hybrid approach can be used for the pre-processing of the data. Analysis shows that this hybrid approach is more effective and can greatly enhance the quality of the data. This approach can yield better performance upon the big data platform using the Spark framework.
Citations
More filters
Proceedings ArticleDOI
01 Nov 2018
TL;DR: This paper identifies data quality challenges in the Internet of Things domain and proposes a model which ensure data quality standards provided by ISO 8000, and shows that when compared with the baseline model the proposed system improves accuracy by 38.88%.
Abstract: Internet of Things (IoTs) is one of the most promising fields in computer science. It consists of physical devices, automobiles, home appliances, embedded hardware, sensors and actuators which empowers these objects to interface and share information with other devices over the network. The data gathered from these devices is used to make intelligent decisions. If the data quality is poor, decisions are likely to be flawed. A little work has been carried out regarding data quality in the Internet of Things, but there is no scheme which is experimentally proved. In this paper we will identify data quality challenges in the Internet of Things domain and propose a model which ensure data quality standards provided by ISO 8000. We evaluated our model on the weather dataset and used the random forest prediction method to calculate the accuracy of our data. Results show that when compared with the baseline model the proposed system improves accuracy by 38.88%.

11 citations


Cites background from "An enhanced pre-processing model fo..."

  • ...The researchers in [20] give an enhanced pre-processing model for big data quality....

    [...]

Proceedings ArticleDOI
16 Apr 2020
TL;DR: This article analyzes the operation and transaction characteristics of distributed renewable energy plants, and data quality analysis framework for distributed renewableEnergy operations and transactions was built on the new energy cloud platform.
Abstract: Global climate crisis in 21st century pushed countries to move towards energy transformation in generation and consumption. To achieve green and low-carbon energy transformation goals, it is necessary that a large number of renewable energy resources such as wind and solar to be consumed. Renewable energy with intermittent fluctuations in time dimension and agglomerations in spatial dimension increases the complexity of green energy consumption friendly. Therefore, comprehensive data and advanced predictive analysis methods are required to guarantee safety of operation and transactions for renewable energy plants and stations. We can even say that quality of renewable energy data determines the accuracy of prediction and analysis. Firstly, this article analyzes the operation and transaction characteristics of distributed renewable energy plants, and data quality analysis framework for distributed renewable energy operations and transactions was built on the new energy cloud platform. Data information were classified into model parameter and status instance, which are related to dispatching and energy power transaction businesses such as equipment model management, operation monitoring and security analysis, measurement statistics etc. The importance between them is determined according to pairwise comparison. Finally, analytic hierarchy process (AHP) theory was applied to calculate weights for data integrity, accuracy, consistency and timeliness, data quality assessment process and calculation methods were designed, and load series data was used to verify its correctness.

4 citations


Cites methods from "An enhanced pre-processing model fo..."

  • ...Then an effective data pre-processing model is proposed for processing of the big data in [10], which using relief algorithm and fast mRMR together as a hybrid approach....

    [...]

Proceedings ArticleDOI
Li Yi, Wang Tongxun, Meng Tan, Yaqiong Li, Zhixian Pi 
01 Aug 2020
TL;DR: A data quality analysis, abnormal data detection and repair method for renewable energy operation data, and a predictive decision-making attributes forecasting tree is constructed to repair the abnormal data.
Abstract: Renewable energy sources is becoming the main form of energy supply side in the energy internet. To improve the absorption capacity and operation analysis level of large-scale distributed renewable energy, it is important to guarantee the accuracy of renewable energy operation data. Based on multi-scenario application analysis, this paper proposed a data quality analysis, abnormal data detection and repair method for renewable energy operation data. Firstly, the renewable energy data types are analyzed, the K-means clustering analysis method is used step by step to form data characteristic curve for data evaluation, and a diagnosis method for abnormal data. Then rough set theory is used to reduce the associated attributes of the operation data value, and establish the importance between data attribute types and data values. Finally, a predictive decision-making attributes forecasting tree is constructed to repair the abnormal data. A numerical load case verifies the effectiveness of the method.

1 citations


Cites methods from "An enhanced pre-processing model fo..."

  • ...Then an effective data preprocessing model is proposed for processing of the big data in [9], which using relief algorithm and fast mRMR together as a hybrid approach....

    [...]

Proceedings ArticleDOI
TL;DR: In this paper , the authors used machine learning techniques to identify the inconsistent data in a data set from a government entity of a program focused on needs in early childhood and showed that supervised learning has a better performance to identify inconsistent data than unsupervised learning and that is an efficient way to clean data.
Abstract: Nowadays, information is one of the main assets of companies and government entities; it facilitates decision-making and the determination of policies. However, their results are not always satisfactory due to the low-quality information. Data sets with duplicate, inconsistent, incomplete, outdated, and imprecise data are the most common causes that affect the quality of the information and therefore the results of its analysis. Cleaning data becomes fundamental for those reasons, it is a process that must be carried out before doing any analysis on a certain set of data, but at the same time is a cumbersome process. This article aims to give an overview of how machine-learning techniques may be used to simplify the task of cleaning data. The data cleaning is made by the use of machine learning to identify the inconsistent data. The study case is a data set from a government entity of a program focused on needs in early childhood. The experimental results show that supervised learning has a better performance to identify inconsistent data than unsupervised learning and that is an efficient way to clean data in this dimension.
References
More filters
Journal ArticleDOI
Jeffrey Dean1, Sanjay Ghemawat1
06 Dec 2004
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Abstract: MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day.

20,309 citations


Additional excerpts

  • ...IEEE Transactions on Reliability, 65(1), 38-53....

    [...]

  • ...Communications of the ACM, 51(1), 107-113....

    [...]

Journal ArticleDOI
Jeffrey Dean1, Sanjay Ghemawat1
TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Abstract: MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.

17,663 citations

Journal ArticleDOI
TL;DR: In this article, the maximal statistical dependency criterion based on mutual information (mRMR) was proposed to select good features according to the maximal dependency condition. But the problem of feature selection is not solved by directly implementing mRMR.
Abstract: Feature selection is an important problem for pattern classification systems. We study how to select good features according to the maximal statistical dependency criterion based on mutual information. Because of the difficulty in directly implementing the maximal dependency condition, we first derive an equivalent form, called minimal-redundancy-maximal-relevance criterion (mRMR), for first-order incremental feature selection. Then, we present a two-stage feature selection algorithm by combining mRMR and other more sophisticated feature selectors (e.g., wrappers). This allows us to select a compact set of superior features at very low cost. We perform extensive experimental comparison of our algorithm and other methods using three different classifiers (naive Bayes, support vector machine, and linear discriminate analysis) and four different data sets (handwritten digits, arrhythmia, NCI cancer cell lines, and lymphoma tissues). The results confirm that mRMR leads to promising improvement on feature selection and classification accuracy.

8,078 citations

Proceedings Article
01 Jan 2006
TL;DR: Bigtable as mentioned in this paper is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers, including web indexing, Google Earth and Google Finance.
Abstract: Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this article, we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.

4,843 citations

Journal ArticleDOI
TL;DR: The simple data model provided by Bigtable is described, which gives clients dynamic control over data layout and format, and the design and implementation of Bigtable are described.
Abstract: Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this article, we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.

3,259 citations


"An enhanced pre-processing model fo..." refers background in this paper

  • ...ACM Transactions on Computer Systems (TOCS), 26(2), 4....

    [...]

  • ...(2014) IEEE transactions on knowledge and data Engineering, 26(2), 309-321....

    [...]

  • ...IEEE transactions on knowledge and data Engineering, 26(2), 309-321....

    [...]

  • ...International Journal of Intelligent Systems, 32(2), 134152....

    [...]