Infrastructure-Aware Functional Testing of MapReduce Programs
Summary (3 min read)
INTRODUCTION
- The new trends in massive data processing have brought to light several technologies and processing models in the field called Big Data Engineering [1] .
- Then the framework that manages the infrastructure is also responsible to automatically deploy and run the program over several computers and lead the data processing between the input and output.
- These faults are often masked during the test execution because the tests usually run over an infrastructure configuration without considering the different situations that could occur in production, as for example different parallelism levels or the infrastructure failures [6] .
- The main contribution of this paper is a technique that can be used to generate automatically the different infrastructure configurations for a MapReduce application.
- Then each one of the configurations is executed in the test environment in order to detect functional faults of the program that may occur in production.
2. Automatic support by means of a test engine based on
- In Section II the principles of the MapReduce paradigm are introduced.
- The generation of the different configurations, the execution and the automatization of the tests are defined in Section III.
- In Section V the related work about software testing in MapReduce paradigm is presented.
II. MAPREDUCE PARADIGM
- The MapReduce program processes high quantities of data in a distributed infrastructure.
- The final output is obtained from the deployment and the execution over a distributed infrastructure of several instances of Mapper and Reducer, also called tasks.
- The Mapper task receives a subset of temperature data and emits <year, temperature of this year> pairs.
- In MapReduce there are also other implementations such as for example Partitioner that decides for each <key, value> pair which Reducer analyses it, Sort that sorts the <key, value> pairs, and Group that aggregates the values of each key before the Reducer.
- These faults are difficult to detect during testing because the test cases usually contain few input data.
A. Generation of the test scenarios
- To illustrate how the infrastructure configuration affects the program output, suppose that the example of Section II is extended with a Combiner in order to decrease the data and improve the performance.
- The Combiner receives several temperatures and then they are replaced by their average in the Combiner output.
- The program does not admit a Combiner because all the temperatures are needed to obtain the total average temperature.
- The error of adding the Combiner in order to optimize the program injects a functional fault in the program.
- Fig. 2 represents three possible executions of this program that could occur in production considering the different infrastructure configurations and the same input (year 1999 with temperatures 4º, 2º and 3º).
The first configuration consists of one Mapper, one
- Combiner and one Reducer that produces the expected output.
- The second configuration also generates the expected output executing one Mapper that processes the temperatures 4º and 2º, another Mapper for 3º, two Combiner, and finally one Reducer.
- In order to generate each one of the scenarios, a combinatorial technique [11] is proposed to combine the values of the different parameters that can modify the execution of the MapReduce program.
- The constraints considered in this paper are the following: 1. The values/combinations of the Mapper parameters depend on the input data because it is not possible more tasks than data.
- To illustrate how the parameters are combined and how the constraints are applied, suppose the program of Fig.
B. Execution of the test scenarios
- The previous section proposes a technique to generate scenarios that represent different infrastructure configurations according to the characteristics of the MapReduce processing.
- This is the scenario formed by one Mapper, one Combiner and one Reducer which is the usual configuration executed in testing.
- Finally, if the test case contains the expected output, the output of ideal scenario is also checked against the expected output ( 8), detecting a fault when both are not equivalent (9, 10) .
- Given a test case, the scenarios are generated according to the previous section, then they are iteratively executed and evaluated following the pseudocode of Fig. 3 .
- Finally, a third scenario is executed and produces 3.25º as output, this temperature is not equivalent to the 3º of the ideal scenario output.
IV. CASE STUDY
- In order to evaluate the proposed approach, the authors use as case study the MapReduce program described in I8K|DQ-BigData framework [13] .
- The output of the program is the data quality of each row, and the average of all rows.
- Over the previous program, a test case is obtained using a specific MapReduce testing technique based on data flow [5] .
- The second Mapper processes only row 2, but no other information about the mandatory columns or data quality threshold, so this Mapper cannot emit any output.
- These environments do not detect the fault because they only execute one scenario that masks the fault.
VI. CONCLUSIONS
- A testing technique for the MapReduce programs is introduced and automatized in this paper as a test engine that reproduces the different infrastructure configurations for a given test case.
- Automatically and without an expected output, the test engine can detect functional faults specific to the MapReduce paradigm that are in general difficult to detect in the test/production environments.
- This approach is applied in a real program using a test case with few data.
- The current approach is off-line because the tests are not carried out when the program is in production.
Did you find this useful? Give us your feedback
Citations
13 citations
Cites background from "Infrastructure-Aware Functional Tes..."
...ture configurations that could occur in production (all potential configurations) [25], [26]....
[...]
9 citations
Cites methods from "Infrastructure-Aware Functional Tes..."
...Finally, the testing is performed using a specific MapReduce testing technique [22], [34] that only needs the test input data and the program to detect functional faults (6)....
[...]
...[34] designed an automatic testing technique based on combinatorics and simulation....
[...]
Cites background or methods from "Infrastructure-Aware Functional Tes..."
...For example, data mining and machine learning methods are used to reduce the size of the original dataset or generate synthetic datasets [3, 4] for the testing purpose....
[...]
...Previous work reported in [1, 2, 3, 4, 5] focuses on generating tests that help to identify functional faults, i....
[...]
...also proposed a technique to generate different infrastructure configurations for a given MapReduce program that can be used to reveal functional faults [4]....
[...]
...Some approaches have been proposed to reduce the effort required for testing and debugging big data applications at the system level [1, 2, 3, 4, 5]....
[...]
References
20,309 citations
17,663 citations
834 citations
"Infrastructure-Aware Functional Tes..." refers methods in this paper
...Despite the testing challenges of the Big Data applications [15, 16] and the progresses in the testing techniques [17], little effort is focused on testing the MapReduce programs [18], one of the principal paradigms of Big Data [19]....
[...]
518 citations
"Infrastructure-Aware Functional Tes..." refers background in this paper
...These faults are often masked during the test execution because the tests usually run over an infrastructure configuration without considering the different situations that could occur in production, as for example different parallelism levels or the infrastructure failures [6]....
[...]
442 citations
Related Papers (5)
Frequently Asked Questions (12)
Q2. What is the way to test a mapreduce program?
In the unit testing, JUnit [32] could be used together with mock tools, or directly by MRUnit library [10] adapted to the MapReduce paradigm.
Q3. What is the purpose of this paper?
In this paper a test engine is implemented by an MRUnit extension that automatically generates and executes the different infrastructure configurations that could occur in production.
Q4. What other approaches are used to obtain the test input data of MapReduce programs?
there are other approaches oriented to obtain the test input data of MapReduce programs, such as [12] that employs data flow testing and [29] based on a bacteriological algorithm.
Q5. What is the goal of the example?
Given a test input data, the goal is to generate the different infrastructure configurations, also called in this context scenarios.
Q6. How can the test input data be obtained?
The test input data can be obtained with a generic testing technique or onespecifically designed for MapReduce, such as MRFlow [12].
Q7. What is the common type of fault produced when the data should reach the Reducer in a?
One common type of fault is produced when the data should reach the Reducer in a specific order, but the parallel execution causes these data to arrive disordered.
Q8. What are the parameters of the Mapper-Combiner-Reducer?
In this work the following parameters are considered based on previous work [5] that classifies different types of faults of the MapReduce applications: Mapper parameters: (1) Number of Mapper tasks, (2) Inputs processed per each Mapper, and (3) Data processing order of the inputs, that is, which data are processed before other data in the Mapper and which data are processed after.
Q9. What is the main idea of the paper?
In this paper, given a test input data, several configurations are generated and then executed in order to reveal functional faults.
Q10. How did Csallner and Chen analyze the fault?
This fault was analysed by Csallner et al. [24] and Chen et al. [25] using some testing techniques based on symbolic execution and model checking.
Q11. What is the definition of failure in the example of Section II?
The failure is produced whenever this infrastructure configuration is executed, regardless of the computer failures, slow net or others.
Q12. What is the problem with the second Mapper?
The second Mapper processes only row 2, but no other information about the mandatory columns or data quality threshold, so this Mapper cannot emit any output.