scispace - formally typeset
Open AccessBook ChapterDOI

Mutation Operators for Large Scale Data Processing Programs in Spark

TLDR
This paper proposes a set of mutation operators designed for Spark programs characterized by a data flow and data processing operations and shows that mutation operators can contribute to the testing process, in the construction of reliable Spark programs.
Abstract
This paper proposes a mutation testing approach for big data processing programs that follow a data flow model, such as those implemented on top of Apache Spark. Mutation testing is a fault-based technique that relies on fault simulation by modifying programs, to create faulty versions called mutants. Mutant creation is carried on by operators able to simulate specific and well identified faults. A testing process must be able to signal faults within mutants and thereby avoid having ill behaviours within a program. We propose a set of mutation operators designed for Spark programs characterized by a data flow and data processing operations. These operators model changes in the data flow and operations, to simulate faults that take into account Spark program characteristics. We performed manual experiments to evaluate the proposed mutation operators in terms of cost and effectiveness. Thereby, we show that mutation operators can contribute to the testing process, in the construction of reliable Spark programs.

read more

Content maybe subject to copyright    Report

Citations
More filters
Book ChapterDOI

Modeling Big Data Processing Programs

TL;DR: This model generalizes the data flow programming style implemented by systems such as Apache Spark, DryadLINQ, Apache Beam and Apache Flink and uses Monoid Algebra to model operations over distributed, partitioned datasets and Petri Nets to represent the data/control flow.
Journal ArticleDOI

TRANSMUT‐Spark: Transformation mutation for Apache Spark

TL;DR: Transmut-Spark as mentioned in this paper is a tool for mutation testing of big data processing code within Spark programs, which relies on fault simulation to evaluate and design test sets, which is a fault-based testing technique for big data programs.
Posted Content

TRANSMUT-SPARK: Transformation Mutation for Apache Spark

TL;DR: TransSMUT-Spark as mentioned in this paper is a tool that automates the mutation testing process of Big Data processing code within Spark programs, which relies on fault simulation to evaluate and design test sets.
Posted Content

An Abstract View of Big Data Processing Programs.

TL;DR: In this paper, the authors propose a model for specifying data flow based parallel data processing programs agnostic of target Big Data processing frameworks, focusing on the formal abstract specification of non-iterative and iterative programs, generalizing the strategies adopted by data flow big data processing frameworks.

TRANSMUT-SPARK: Transformation Mutation for Apache Spark-Long Version

TL;DR: HAL as mentioned in this paper is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not, which may come from teaching and research institutions in France or abroad, or from public or private research centers.
References
More filters
Proceedings ArticleDOI

Item-based collaborative filtering recommendation algorithms

TL;DR: This paper analyzes item-based collaborative ltering techniques and suggests that item- based algorithms provide dramatically better performance than user-based algorithms, while at the same time providing better quality than the best available userbased algorithms.
Proceedings Article

Spark: cluster computing with working sets

TL;DR: Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.
Proceedings Article

Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

TL;DR: Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks.
Proceedings ArticleDOI

Pig latin: a not-so-foreign language for data processing

TL;DR: A new language called Pig Latin is described, designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce, which is an open-source, Apache-incubator project, and available for general use.
Journal ArticleDOI

An Analysis and Survey of the Development of Mutation Testing

TL;DR: These analyses provide evidence that Mutation Testing techniques and tools are reaching a state of maturity and applicability, while the topic of Mutation testing itself is the subject of increasing interest.
Related Papers (5)