scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Mining on Big Data Using Hadoop MapReduce Model

01 Nov 2017-Vol. 263, Iss: 4, pp 042007
TL;DR: Trial uncovers the fact that Hadoop contributes towards lessening system and processing masses by the uprightness of dispensing with excess exchanges on Hadoops hubs, and impressively outperforms and enhances the other models considerably.
Abstract: Customary parallel calculations for mining nonstop item create opportunity to adjust stack of similar data among hubs. The paper aims to review this process by analyzing the critical execution downside of the common parallel recurrent item-set mining calculations. Given a larger than average dataset, data apportioning strategies inside the current arrangements endure high correspondence and mining overhead evoked by repetitive exchanges transmitted among registering hubs. We tend to address this downside by building up a learning apportioning approach referred as Hadoop abuse using the map-reduce programming model. All objectives of Hadoop are to zest up the execution of parallel recurrent item-set mining on Hadoop bunches. Fusing the comparability metric and furthermore the locality-sensitive hashing procedure, Hadoop puts to a great degree comparative exchanges into an information segment to lift neighborhood while not making AN exorbitant assortment of excess exchanges. We tend to execute Hadoop on a 34-hub Hadoop bunch, driven by a decent change of datasets made by IBM quest market-basket manufactured data generator. Trial uncovers the fact that Hadoop contributes towards lessening system and processing masses by the uprightness of dispensing with excess exchanges on Hadoop hubs. Hadoop impressively outperforms and enhances the other models considerably.
Citations
More filters
Journal ArticleDOI
TL;DR: The present study uses principal component analysis based deep neural network model using Grey Wolf Optimization (GWO) algorithm to classify the extracted features of diabetic retinopathy dataset and shows that the proposed model offers better performance compared to the traditional machine learning algorithms.
Abstract: Diabetic retinopathy is a prominent cause of blindness among elderly people and has become a global medical problem over the last few decades. There are several scientific and medical approaches to screen and detect this disease, but most of the detection is done using retinal fungal imaging. The present study uses principal component analysis based deep neural network model using Grey Wolf Optimization (GWO) algorithm to classify the extracted features of diabetic retinopathy dataset. The use of GWO enables to choose optimal parameters for training the DNN model. The steps involved in this paper include standardization of the diabetic retinopathy dataset using a standardscaler normalization method, followed by dimensionality reduction using PCA, then choosing of optimal hyper parameters by GWO and finally training of the dataset using a DNN model. The proposed model is evaluated based on the performance measures namely accuracy, recall, sensitivity and specificity. The model is further compared with the traditional machine learning algorithms—support vector machine (SVM), Naive Bayes Classifier, Decision Tree and XGBoost. The results show that the proposed model offers better performance compared to the aforementioned algorithms.

151 citations

Journal ArticleDOI
TL;DR: This work establishes FPM using extend version of MapReduce framework in Hadoop environment and performs preprocessing to remove data redundancy, and proposes AP clustering which generates effective clusters from the given dataset.

18 citations

Proceedings ArticleDOI
01 Nov 2019
TL;DR: It is showed from results that Apache Pig has been considered more efficient and systematic in providing quick results in less time as compared to Apache Hive.
Abstract: Big Data has been observed as a revolution due to technological advancement since last few years. The process of examining the massive, gigantic, heterogeneous and multiplex datasets that are changing very often is called Big Data Analytics. Decision making by extracting information from complex and multi-structured data is not possible by using traditional means. Two keys elements of Hadoop, Apache Hive and Apache Pig are the most systematic and cost-effective for implementation of ECG Big Data. Apache Pig and Apache Hive are open source initiatives for examining huge sets of data in a high-level language. In this research, performance analysis of various datasets of ECG Big Data is performed by Apache Hive and Apache Pig. Different parameters of ECG Big Data have been observed and it is showed from results that Apache Pig has been considered more efficient and systematic in providing quick results in less time as compared to Apache Hive.

4 citations

01 Jan 2018
TL;DR: The major rise in data collection and storage has raised the necessity for much more powerful data analysis tools and there is a need for constantly updating the models to handle data velocity or new incoming data.
Abstract:  Data Discrimination − It refers to the mapping or classification of a class with some predefined group or class. Abstract: The major rise in data collection and storage has raised the necessity for much more powerful data analysis tools. The data collected in huge databases needs to be handled effectively and efficiently. The important and highly critical decisions are made not on the basis of information rich data stored in databases but instead on a decision maker’s instinct merely because of the absence of the tools capable of extracting the valuable knowledge from vast amount of the data. Currently expert systems depends on users to manually input knowledge into knowledge bases. This process is often time consuming, expensive, and bias. The problem with data mining algorithms are their non-capability of dealing with non-static, and unbalanced data. There is a need for constantly updating the models to handle data velocity or new incoming data.

1 citations


Cites background from "Mining on Big Data Using Hadoop Map..."

  • ...Pros and cons of FP-Growth The pros and cons related to FP-Growth algorithm are mentioned as under [10]....

    [...]

References
More filters
Journal ArticleDOI
Jeffrey Dean1, Sanjay Ghemawat1
06 Dec 2004
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Abstract: MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day.

20,309 citations

Journal ArticleDOI
Jeffrey Dean1, Sanjay Ghemawat1
TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Abstract: MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.

17,663 citations

Journal ArticleDOI
01 Sep 2010
TL;DR: Schism consistently outperforms simple partitioning schemes, and in some cases proves superior to the best known manual partitioning, reducing the cost of distributed transactions up to 30%.
Abstract: We present Schism, a novel workload-aware approach for database partitioning and replication designed to improve scalability of shared-nothing distributed databases. Because distributed transactions are expensive in OLTP settings (a fact we demonstrate through a series of experiments), our partitioner attempts to minimize the number of distributed transactions, while producing balanced partitions. Schism consists of two phases: i) a workload-driven, graph-based replication/partitioning phase and ii) an explanation and validation phase. The first phase creates a graph with a node per tuple (or group of tuples) and edges between nodes accessed by the same transaction, and then uses a graph partitioner to split the graph into k balanced partitions that minimize the number of cross-partition transactions. The second phase exploits machine learning techniques to find a predicate-based explanation of the partitioning strategy (i.e., a set of range predicates that represent the same replication/partitioning scheme produced by the partitioner).The strengths of Schism are: i) independence from the schema layout, ii) effectiveness on n-to-n relations, typical in social network databases, iii) a unified and fine-grained approach to replication and partitioning. We implemented and tested a prototype of Schism on a wide spectrum of test cases, ranging from classical OLTP workloads (e.g., TPC-C and TPC-E), to more complex scenarios derived from social network websites (e.g., Epinions.com), whose schema contains multiple n-to-n relationships, which are known to be hard to partition. Schism consistently outperforms simple partitioning schemes, and in some cases proves superior to the best known manual partitioning, reducing the cost of distributed transactions up to 30%.

602 citations

Journal ArticleDOI
TL;DR: The author surveys the state of the art in parallel and distributed association-rule-mining algorithms and uncovers the field's challenges and open research problems.
Abstract: The author surveys the state of the art in parallel and distributed association-rule-mining algorithms and uncovers the field's challenges and open research problems. This survey can serve as a reference for both researchers and practitioners.

510 citations

Proceedings ArticleDOI
Haoyuan Li1, Yi Wang1, Dong Zhang1, Ming Zhang2, Edward Y. Chang1 
23 Oct 2008
TL;DR: Through empirical study on a large dataset of 802,939 Web pages and 1,021,107 tags, it is demonstrated that PFP can achieve virtually linear speedup and to be promising for supporting query recommendation for search engines.
Abstract: Frequent itemset mining (FIM) is a useful tool for discovering frequently co-occurrent items. Since its inception, a number of significant FIM algorithms have been developed to speed up mining performance. Unfortunately, when the dataset size is huge, both the memory use and computational cost can still be prohibitively expensive. In this work, we propose to parallelize the FP-Growth algorithm (we call our parallel algorithm PFP) on distributed machines. PFP partitions computation in such a way that each machine executes an independent group of mining tasks. Such partitioning eliminates computational dependencies between machines, and thereby communication between them. Through empirical study on a large dataset of 802,939 Web pages and 1,021,107 tags, we demonstrate that PFP can achieve virtually linear speedup. Besides scalability, the empirical study demonstrates that PFP to be promising for supporting query recommendation for search engines.

472 citations