Mutation Operators for Large Scale Data Processing Programs in Spark
João Batista de Souza Neto,Anamaria Martins Moreira,Genoveva Vargas-Solar,Martin A. Musicante +3 more
- pp 482-497
TLDR
This paper proposes a set of mutation operators designed for Spark programs characterized by a data flow and data processing operations and shows that mutation operators can contribute to the testing process, in the construction of reliable Spark programs.Abstract:
This paper proposes a mutation testing approach for big data processing programs that follow a data flow model, such as those implemented on top of Apache Spark. Mutation testing is a fault-based technique that relies on fault simulation by modifying programs, to create faulty versions called mutants. Mutant creation is carried on by operators able to simulate specific and well identified faults. A testing process must be able to signal faults within mutants and thereby avoid having ill behaviours within a program. We propose a set of mutation operators designed for Spark programs characterized by a data flow and data processing operations. These operators model changes in the data flow and operations, to simulate faults that take into account Spark program characteristics. We performed manual experiments to evaluate the proposed mutation operators in terms of cost and effectiveness. Thereby, we show that mutation operators can contribute to the testing process, in the construction of reliable Spark programs.read more
Citations
More filters
Book ChapterDOI
Modeling Big Data Processing Programs
João Batista de Souza Neto,Anamaria Martins Moreira,Genoveva Vargas-Solar,Martin A. Musicante +3 more
TL;DR: This model generalizes the data flow programming style implemented by systems such as Apache Spark, DryadLINQ, Apache Beam and Apache Flink and uses Monoid Algebra to model operations over distributed, partitioned datasets and Petri Nets to represent the data/control flow.
Journal ArticleDOI
TRANSMUT‐Spark: Transformation mutation for Apache Spark
TL;DR: Transmut-Spark as mentioned in this paper is a tool for mutation testing of big data processing code within Spark programs, which relies on fault simulation to evaluate and design test sets, which is a fault-based testing technique for big data programs.
Posted Content
TRANSMUT-SPARK: Transformation Mutation for Apache Spark
Graham Curry,João Batista de Souza Neto,Anamaria Martins Moreira,Genoveva Vargas-Solar,Martin A. Musicante +4 more
TL;DR: TransSMUT-Spark as mentioned in this paper is a tool that automates the mutation testing process of Big Data processing code within Spark programs, which relies on fault simulation to evaluate and design test sets.
Posted Content
An Abstract View of Big Data Processing Programs.
João Batista de Souza Neto,Anamaria Martins Moreira,Genoveva Vargas-Solar,Martin A. Musicante +3 more
TL;DR: In this paper, the authors propose a model for specifying data flow based parallel data processing programs agnostic of target Big Data processing frameworks, focusing on the formal abstract specification of non-iterative and iterative programs, generalizing the strategies adopted by data flow big data processing frameworks.
TRANSMUT-SPARK: Transformation Mutation for Apache Spark-Long Version
João Batista de Souza Neto,Anamaria Martins Moreira,Genoveva,Vargas-Solar,Martin A. Musicante +4 more
TL;DR: HAL as mentioned in this paper is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not, which may come from teaching and research institutions in France or abroad, or from public or private research centers.
References
More filters
Proceedings ArticleDOI
Item-based collaborative filtering recommendation algorithms
TL;DR: This paper analyzes item-based collaborative ltering techniques and suggests that item- based algorithms provide dramatically better performance than user-based algorithms, while at the same time providing better quality than the best available userbased algorithms.
Proceedings Article
Spark: cluster computing with working sets
TL;DR: Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.
Proceedings Article
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing
Matei Zaharia,Mosharaf Chowdhury,Tathagata Das,Ankur Dave,Justin Ma,Murphy McCauley,Michael J. Franklin,Scott Shenker,Ion Stoica +8 more
TL;DR: Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks.
Proceedings ArticleDOI
Pig latin: a not-so-foreign language for data processing
TL;DR: A new language called Pig Latin is described, designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce, which is an open-source, Apache-incubator project, and available for general use.
Journal ArticleDOI
An Analysis and Survey of the Development of Mutation Testing
Yue Jia,Mark Harman +1 more
TL;DR: These analyses provide evidence that Mutation Testing techniques and tools are reaching a state of maturity and applicability, while the topic of Mutation testing itself is the subject of increasing interest.