Mutation Operators for Large Scale Data Processing Programs in Spark

doi:10.1007/978-3-030-49435-3_30

Open AccessBook ChapterDOI

Mutation Operators for Large Scale Data Processing Programs in Spark

- pp 482-497

TLDR

This paper proposes a set of mutation operators designed for Spark programs characterized by a data flow and data processing operations and shows that mutation operators can contribute to the testing process, in the construction of reliable Spark programs.

Abstract:

This paper proposes a mutation testing approach for big data processing programs that follow a data flow model, such as those implemented on top of Apache Spark. Mutation testing is a fault-based technique that relies on fault simulation by modifying programs, to create faulty versions called mutants. Mutant creation is carried on by operators able to simulate specific and well identified faults. A testing process must be able to signal faults within mutants and thereby avoid having ill behaviours within a program. We propose a set of mutation operators designed for Spark programs characterized by a data flow and data processing operations. These operators model changes in the data flow and operations, to simulate faults that take into account Spark program characteristics. We performed manual experiments to evaluate the proposed mutation operators in terms of cost and effectiveness. Thereby, we show that mutation operators can contribute to the testing process, in the construction of reliable Spark programs.

Citations

PDF

Open Access

More filters

Book ChapterDOI

Modeling Big Data Processing Programs

João Batista de Souza Neto, +3 more

TL;DR: This model generalizes the data flow programming style implemented by systems such as Apache Spark, DryadLINQ, Apache Beam and Apache Flink and uses Monoid Algebra to model operations over distributed, partitioned datasets and Petri Nets to represent the data/control flow.

...read moreread less

Journal ArticleDOI

TRANSMUT‐Spark: Transformation mutation for Apache Spark

Graham Curry

TL;DR: Transmut-Spark as mentioned in this paper is a tool for mutation testing of big data processing code within Spark programs, which relies on fault simulation to evaluate and design test sets, which is a fault-based testing technique for big data programs.

...read moreread less

Posted Content

TRANSMUT-SPARK: Transformation Mutation for Apache Spark

Graham Curry, +4 more

- 06 Aug 2021 -

arXiv: Software Engineering

TL;DR: TransSMUT-Spark as mentioned in this paper is a tool that automates the mutation testing process of Big Data processing code within Spark programs, which relies on fault simulation to evaluate and design test sets.

...read moreread less

Posted Content

An Abstract View of Big Data Processing Programs.

João Batista de Souza Neto, +3 more

- 06 Aug 2021 -

arXiv: Software Engineering

TL;DR: In this paper, the authors propose a model for specifying data flow based parallel data processing programs agnostic of target Big Data processing frameworks, focusing on the formal abstract specification of non-iterative and iterative programs, generalizing the strategies adopted by data flow big data processing frameworks.

...read moreread less

TRANSMUT-SPARK: Transformation Mutation for Apache Spark-Long Version

João Batista de Souza Neto, +4 more

TL;DR: HAL as mentioned in this paper is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not, which may come from teaching and research institutions in France or abroad, or from public or private research centers.

...read moreread less

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Item-based collaborative filtering recommendation algorithms

Badrul Sarwar, +3 more

TL;DR: This paper analyzes item-based collaborative ltering techniques and suggests that item- based algorithms provide dramatically better performance than user-based algorithms, while at the same time providing better quality than the best available userbased algorithms.

...read moreread less

Proceedings Article

Spark: cluster computing with working sets

Matei Zaharia, +4 more

TL;DR: Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.

...read moreread less

Proceedings Article

Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

Matei Zaharia, +8 more

TL;DR: Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks.

...read moreread less

Proceedings ArticleDOI

Pig latin: a not-so-foreign language for data processing

Christopher Olston, +4 more

TL;DR: A new language called Pig Latin is described, designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce, which is an open-source, Apache-incubator project, and available for general use.

...read moreread less

Journal ArticleDOI

An Analysis and Survey of the Development of Mutation Testing

Yue Jia, +1 more

- 01 Sep 2011 -

IEEE Transactions on Software Engineerin...

TL;DR: These analyses provide evidence that Mutation Testing techniques and tools are reaching a state of maturity and applicability, while the topic of Mutation testing itself is the subject of increasing interest.

...read moreread less

Mutation Operators for Large Scale Data Processing Programs in Spark

Citations

Modeling Big Data Processing Programs

TRANSMUT‐Spark: Transformation mutation for Apache Spark

TRANSMUT-SPARK: Transformation Mutation for Apache Spark

An Abstract View of Big Data Processing Programs.

TRANSMUT-SPARK: Transformation Mutation for Apache Spark-Long Version

References

Item-based collaborative filtering recommendation algorithms

Spark: cluster computing with working sets

Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

Pig latin: a not-so-foreign language for data processing

An Analysis and Survey of the Development of Mutation Testing

Related Papers (5)

C++11/14 Mutation Operators Based on Common Fault Patterns

Enabling mutation testing for Android apps

Finding Redundancy in Web Mutation Operators

Mutation testing of memory-related operators

Object oriented mutation testing: A survey