scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

High performance analytics of bigdata with dynamic and optimized hadoop cluster

TL;DR: This project intends to overcome all obstacles and built a user friendly SAAS platform which stores data in Amazon S3 and uses Amazon EMR and MapReduce paradigm using opensource R scripting language to perform analysis of big data analysis in desire time.
Abstract: With enterprises collecting feedback down to every possible detail, data repositories are being over flooded with information. In-order to access valuable information, these data should be processed using sophisticated statistical analysis. Traditional analytical tools, existing statistical software and data management systems find it challenging to perform deep analysis upon large data libraries. Users demand a service platform that can store and handle large quantities of data with some features such as easy accessibility, fast performance, durable and secure. These features can be availed without having to spend too much on hardware, upgrading, configuring etc to perform analysis of big data. This project intends to overcome all these obstacles and built a user friendly SAAS platform. It is cloud based web application which stores data in Amazon S3. As this system supports dynamic and optimized cluster nodes size as per the desired time, user doesn't need to calculate and estimate the number of nodes. The system uses Amazon EMR and MapReduce paradigm using opensource R scripting language to perform analysis of bigdata analysis in desire time.
Citations
More filters
Journal ArticleDOI
TL;DR: ‘ConTra’, an open source application which applies mutual exclusion rule in identifying contradictory data, existing in comma separated values (CSV) dataset is examined and the results show that ConTra can process large dataset when hosted in servers with fast processors.
Abstract: Big datasets are often stored in flat files and can contain contradictory data. Contradictory data undermines the soundness of the information from a noisy dataset. Traditional tools such as pie chart and bar chart are overwhelmed when used to visually identify contradictory data in multidimensional attribute-values of a big dataset. This work explains the importance of identifying contradictions in a noisy dataset. It also examines how contradictory data in a large and noisy dataset can be mined and visually analysed. The authors developed ‘ConTra’, an open source application which applies mutual exclusion rule in identifying contradictory data, existing in comma separated values (CSV) dataset. ConTra’s capability to enable the identification of contradictory data in different sizes of datasets is examined. The results show that ConTra can process large dataset when hosted in servers with fast processors. It is also shown in this work that ConTra is 100% accurate in identifying contradictory data of objects whose attribute values do not conform to the mutual exclusion rule of a dataset in CSV format. Different approaches through which ConTra can mine and identify contradictory data are also presented.

3 citations

Journal ArticleDOI
TL;DR: The goal is to make a comparative analysis of the main technological bricks that often form the backbone of any DFS system, including Google GFS, IBM GPFS, HDFS, Blobseer and AFS.
Abstract: A R T I C L E I N F O A B S T R A C T Article history: Received: 31 March, 2017 Accepted: 13 June, 2017 Online: 26 June, 2017 These last years, the amount of data generated by information systems has exploded. It is not only the quantities of information that are now estimated in Exabyte, but also the variety of these data which is more and more structurally heterogeneous and the velocity of generation of these data which can be compared in many cases to endless flows. Now days, Big Data science offers many opportunities to analyze and explore these quantities of data. Therefore, we can collect and parse data, make many distributed operations, aggregate results, make reports and synthesis. To allow all these operations, Big Data Science relies on the use of \"Distributed File Systems (DFS)\" technologies to store data more efficiently. Distributed File Systems were designed to address a set of technological challenges like consistency and availability of data, scalability of environments, competitive access to data or even more the cost of their maintenance and extension. In this paper, we attempt to highlight some of these systems. Some are proprietary such as Google GFS and IBM GPFS, and others are open source such as HDFS, Blobseer and AFS. Our goal is to make a comparative analysis of the main technological bricks that often form the backbone of any DFS system.

2 citations


Cites background from "High performance analytics of bigda..."

  • ...Traditional data processing technologies have rapidly reached their limits and are being replaced by new systems which allow big data storage and analysis, taking on consideration what is currently known as the four V: Volume (to handle the huge amount of generated data), Velocity (to store, analyze and retrieve huge dataset as quickly as possible), Variety (to process mostly unstructured data, from multiple sources), and Value (to ask the right questions to generate maximum value) [4]....

    [...]

Journal ArticleDOI
01 Jul 2019
TL;DR: The survey on the performance of the big data framework based on a cloud from various endeavors which assists ventures to pick a suitable framework for their work and get a desired outcome is about.
Abstract: Cloud computing is offering various IT services to many users in the work on the basis of pay-as-you-use model. As the data is increasing day by day, there is a huge requirement for cloud applications that manage such a huge amount of data. Basically, a best solution for analyzing such amounts of data and handles a large dataset. Various companies are providing such framesets for particular applications. A cloud framework is the accruement of different components which is similar to the development tools, various middleware for particular applications and various other database management services that are needed for cloud computing deployment, development and managing the various applications of the cloud. This results in an effective model for scaling such a huge amount of data in dynamically allocated recourses along with solving their complex problems. This article is about the survey on the performance of the big data framework based on a cloud from various endeavors which assists ventures to pick a suitable framework for their work and get a desired outcome.

1 citations

Journal ArticleDOI
TL;DR: Tiarrah Computing focuses on using the existing opensource technologies to overcome the challenges that evolve along with IoT to decouple application deployment and achieve High Performance, Flexible Application Development, High Availability, Ease of Development, E ease of Maintenances etc.
Abstract: The evolution of Internet of Things (IoT) brought about several challenges for the existing Hardware, Network and Application development. Some of these are handling real-time streaming and batch bigdata, real- time event handling, dynamic cluster resource allocation for computation, Wired and Wireless Network of Things etc. In order to combat these technicalities, many new technologies and strategies are being developed. Tiarrah Computing comes up with integration the concept of Cloud Computing, Fog Computing and Edge Computing. The main objectives of Tiarrah Computing are to decouple application deployment and achieve High Performance, Flexible Application Development, High Availability, Ease of Development, Ease of Maintenances etc. Tiarrah Computing focus on using the existing opensource technologies to overcome the challenges that evolve along with IoT. This paper gives you overview of the technologies and design your application as well as elaborate how to overcome most of existing challenge.

1 citations


Cites background from "High performance analytics of bigda..."

  • ...required high computation resource in cloud give flexibility to perform bigdata analytics in desire time by allocating resources dynamically [9]....

    [...]

10 Jan 2020
TL;DR: In this paper, a sistema computacional, sem custo financeiro for a instituicao e de facil compreensao for o usuario, utiliza os conhecimentos estatisticos for realizar a descricao, a apresentacao e analise dos dados coletados.
Abstract: Uma das atribuicoes do setor pedagogico no ensino medio e o acompanhamento do desempenho academico dos discentes a cada bimestre. Para tanto, se faz necessario agilidade e clareza na analise dos resultados emitidos pelo corpo docente em cada disciplina, para que seja possivel, quando necessario, intervir com instrumentos eficazes de recuperacao da aprendizagem dos alunos na disciplina que nao houve o exito desejado. Portanto, esse projeto tem como objetivo a construcao de um sistema computacional, sem custo financeiro para a instituicao e de facil compreensao para o usuario, que utiliza os conhecimentos estatisticos para realizar a descricao, a apresentacao e analise dos dados coletados.

1 citations

References
More filters
Book
01 Dec 1981
TL;DR: In this paper, the authors propose a simple linear regression model with variable selection and multicollinearity for robust regression, and validate the model using regression analysis and validation of regression models.
Abstract: Preface. Introduction. Simple Linear Regression. Multiple Linear Regression. Model Adequacy Checking. Transformations and Weighting to Correct Model Inadequacies. Diagnostics for Leverage and Influence. Polynomial Regression Models. Indicator Variables. Variable Selection and Model Building. Multicollinearity. Robust Regression. Introduction to Nonlinear Regression. Generalized Linear Models. Other Topics in the Use of Regression Analysis. Validation of Regression Models. Appendix A. Statistical Tables. Appendix B. Data Sets for Exercises. Appendix C. Supplemental Technical Material. References. Index.

5,664 citations

Book
29 May 2009
TL;DR: This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems: programmers will find details for analyzing large datasets, and administrators will learn how to set up and run Hadoops clusters.
Abstract: Hadoop: The Definitive Guide helps you harness the power of your data. Ideal for processing large datasets, the Apache Hadoop framework is an open source implementation of the MapReduce algorithm on which Google built its empire. This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems: programmers will find details for analyzing large datasets, and administrators will learn how to set up and run Hadoop clusters. Complete with case studies that illustrate how Hadoop solves specific problems, this book helps you: Use the Hadoop Distributed File System (HDFS) for storing large datasets, and run distributed computations over those datasets using MapReduce Become familiar with Hadoop's data and I/O building blocks for compression, data integrity, serialization, and persistence Discover common pitfalls and advanced features for writing real-world MapReduce programs Design, build, and administer a dedicated Hadoop cluster, or run Hadoop in the cloud Use Pig, a high-level query language for large-scale data processing Take advantage of HBase, Hadoop's database for structured and semi-structured data Learn ZooKeeper, a toolkit of coordination primitives for building distributed systems If you have lots of data -- whether it's gigabytes or petabytes -- Hadoop is the perfect solution. Hadoop: The Definitive Guide is the most thorough book available on the subject. "Now you have the opportunity to learn about Hadoop from a master-not only of the technology, but also of common sense and plain talk." -- Doug Cutting, Hadoop Founder, Yahoo!

3,797 citations


"High performance analytics of bigda..." refers background in this paper

  • ...The Hadoop Distributed File System is an extremely efficient file system that enables its users to use scalable and reliable data storage....

    [...]

Book
01 Jan 1980
TL;DR: In this paper, the authors present a method to estimate the least squares of a scatterplot matrix using a simple linear regression model, and compare it with the mean function of the scatterplot matrices.
Abstract: Preface.1 Scatterplots and Regression.1.1 Scatterplots.1.2 Mean Functions.1.3 Variance Functions.1.4 Summary Graph.1.5 Tools for Looking at Scatterplots.1.5.1 Size.1.5.2 Transformations.1.5.3 Smoothers for the Mean Function.1.6 Scatterplot Matrices.Problems.2 Simple Linear Regression.2.1 Ordinary Least Squares Estimation.2.2 Least Squares Criterion.2.3 Estimating sigma 2.2.4 Properties of Least Squares Estimates.2.5 Estimated Variances.2.6 Comparing Models: The Analysis of Variance.2.6.1 The F-Test for Regression.2.6.2 Interpreting p-values.2.6.3 Power of Tests.2.7 The Coefficient of Determination, R2.2.8 Confidence Intervals and Tests.2.8.1 The Intercept.2.8.2 Slope.2.8.3 Prediction.2.8.4 Fitted Values.2.9 The Residuals.Problems.3 Multiple Regression.3.1 Adding a Term to a Simple Linear Regression Model.3.1.1 Explaining Variability.3.1.2 Added-Variable Plots.3.2 The Multiple Linear Regression Model.3.3 Terms and Predictors.3.4 Ordinary Least Squares.3.4.1 Data and Matrix Notation.3.4.2 Variance-Covariance Matrix of e.3.4.3 Ordinary Least Squares Estimators.3.4.4 Properties of the Estimates.3.4.5 Simple Regression in Matrix Terms.3.5 The Analysis of Variance.3.5.1 The Coefficient of Determination.3.5.2 Hypotheses Concerning One of the Terms.3.5.3 Relationship to the t -Statistic.3.5.4 t-Tests and Added-Variable Plots.3.5.5 Other Tests of Hypotheses.3.5.6 Sequential Analysis of Variance Tables.3.6 Predictions and Fitted Values.Problems.4 Drawing Conclusions.4.1 Understanding Parameter Estimates.4.1.1 Rate of Change.4.1.2 Signs of Estimates.4.1.3 Interpretation Depends on Other Terms in the Mean Function.4.1.4 Rank Deficient and Over-Parameterized Mean Functions.4.1.5 Tests.4.1.6 Dropping Terms.4.1.7 Logarithms.4.2 Experimentation Versus Observation.4.3 Sampling from a Normal Population.4.4 More on R2.4.4.1 Simple Linear Regression and R2.4.4.2 Multiple Linear Regression.4.4.3 Regression through the Origin.4.5 Missing Data.4.5.1 Missing at Random.4.5.2 Alternatives.4.6 Computationally Intensive Methods.4.6.1 Regression Inference without Normality.4.6.2 Nonlinear Functions of Parameters.4.6.3 Predictors Measured with Error.Problems.5 Weights, Lack of Fit, and More.5.1 Weighted Least Squares.5.1.1 Applications of Weighted Least Squares.5.1.2 Additional Comments.5.2 Testing for Lack of Fit, Variance Known.5.3 Testing for Lack of Fit, Variance Unknown.5.4 General F Testing.5.4.1 Non-null Distributions.5.4.2 Additional Comments.5.5 Joint Confidence Regions.Problems.6 Polynomials and Factors.6.1 Polynomial Regression.6.1.1 Polynomials with Several Predictors.6.1.2 Using the Delta Method to Estimate a Minimum or a Maximum.6.1.3 Fractional Polynomials.6.2 Factors.6.2.1 No Other Predictors.6.2.2 Adding a Predictor: Comparing Regression Lines.6.2.3 Additional Comments.6.3 Many Factors.6.4 Partial One-Dimensional Mean Functions.6.5 Random Coefficient Models.Problems.7 Transformations.7.1 Transformations and Scatterplots.7.1.1 Power Transformations.7.1.2 Transforming Only the Predictor Variable.7.1.3 Transforming the Response Only.7.1.4 The Box and Cox Method.7.2 Transformations and Scatterplot Matrices.7.2.1 The 1D Estimation Result and Linearly Related Predictors.7.2.2 Automatic Choice of Transformation of Predictors.7.3 Transforming the Response.7.4 Transformations of Nonpositive Variables.Problems.8 Regression Diagnostics: Residuals.8.1 The Residuals.8.1.1 Difference Between e and e.8.1.2 The Hat Matrix.8.1.3 Residuals and the Hat Matrix with Weights.8.1.4 The Residuals When the Model Is Correct.8.1.5 The Residuals When the Model Is Not Correct.8.1.6 Fuel Consumption Data.8.2 Testing for Curvature.8.3 Nonconstant Variance.8.3.1 Variance Stabilizing Transformations.8.3.2 A Diagnostic for Nonconstant Variance.8.3.3 Additional Comments.8.4 Graphs for Model Assessment.8.4.1 Checking Mean Functions.8.4.2 Checking Variance Functions.Problems.9 Outliers and Influence.9.1 Outliers.9.1.1 An Outlier Test.9.1.2 Weighted Least Squares.9.1.3 Significance Levels for the Outlier Test.9.1.4 Additional Comments.9.2 Influence of Cases.9.2.1 Cook's Distance.9.2.2 Magnitude of Di .9.2.3 Computing Di .9.2.4 Other Measures of Influence.9.3 Normality Assumption.Problems.10 Variable Selection.10.1 The Active Terms.10.1.1 Collinearity.10.1.2 Collinearity and Variances.10.2 Variable Selection.10.2.1 Information Criteria.10.2.2 Computationally Intensive Criteria.10.2.3 Using Subject-Matter Knowledge.10.3 Computational Methods.10.3.1 Subset Selection Overstates Significance.10.4 Windmills.10.4.1 Six Mean Functions.10.4.2 A Computationally Intensive Approach.Problems.11 Nonlinear Regression.11.1 Estimation for Nonlinear Mean Functions.11.2 Inference Assuming Large Samples.11.3 Bootstrap Inference.11.4 References.Problems.12 Logistic Regression.12.1 Binomial Regression.12.1.1 Mean Functions for Binomial Regression.12.2 Fitting Logistic Regression.12.2.1 One-Predictor Example.12.2.2 Many Terms.12.2.3 Deviance.12.2.4 Goodness-of-Fit Tests.12.3 Binomial Random Variables.12.3.1 Maximum Likelihood Estimation.12.3.2 The Log-Likelihood for Logistic Regression.12.4 Generalized Linear Models.Problems.Appendix.A.1 Web Site.A.2 Means and Variances of Random Variables.A.2.1 E Notation.A.2.2 Var Notation.A.2.3 Cov Notation.A.2.4 Conditional Moments.A.3 Least Squares for Simple Regression.A.4 Means and Variances of Least Squares Estimates.A.5 Estimating E(Y |X) Using a Smoother.A.6 A Brief Introduction to Matrices and Vectors.A.6.1 Addition and Subtraction.A.6.2 Multiplication by a Scalar.A.6.3 Matrix Multiplication.A.6.4 Transpose of a Matrix.A.6.5 Inverse of a Matrix.A.6.6 Orthogonality.A.6.7 Linear Dependence and Rank of a Matrix.A.7 Random Vectors.A.8 Least Squares Using Matrices.A.8.1 Properties of Estimates.A.8.2 The Residual Sum of Squares.A.8.3 Estimate of Variance.A.9 The QR Factorization.A.10 Maximum Likelihood Estimates.A.11 The Box-Cox Method for Transformations.A.11.1 Univariate Case.A.11.2 Multivariate Case.A.12 Case Deletion in Linear Regression.References.Author Index.Subject Index.

3,215 citations

Journal ArticleDOI
TL;DR: Elements of Sampling Theory and Methods is unique in its presentation of materials, and the book’s price is reasonable in comparison to the other four books mentioned in this area.
Abstract: (2002). Introduction to Linear Regression Analysis. Technometrics: Vol. 44, No. 2, pp. 191-192.

2,818 citations

Book
19 Oct 2011
TL;DR: This book reveals how IBM is leveraging open source Big Data technology, infused with IBM technologies, to deliver a robust, secure, highly available, enterprise-class Big Data platform.
Abstract: Big Data represents a new era in data exploration and utilization, and IBM is uniquely positioned to help clients navigate this transformation. This book reveals how IBM is leveraging open source Big Data technology, infused with IBM technologies, to deliver a robust, secure, highly available, enterprise-class Big Data platform. The three defining characteristics of Big Data--volume, variety, and velocity--are discussed. You'll get a primer on Hadoop and how IBM is hardening it for the enterprise, and learn when to leverage IBM InfoSphere BigInsights (Big Data at rest) and IBM InfoSphere Streams (Big Data in motion) technologies. Industry use cases are also included in this practical guide. Learn how IBM hardens Hadoop for enterprise-class scalability and reliability Gain insight into IBM's unique in-motion and at-rest Big Data analytics platform Learn tips and tricks for Big Data use cases and solutions Get a quick Hadoop primer

1,290 citations


"High performance analytics of bigda..." refers background in this paper

  • ...The Hadoop Distributed File System is an extremely efficient file system that enables its users to use scalable and reliable data storage....

    [...]