Proceedings Article•DOI•

Infrastructure-Aware Functional Testing of MapReduce Programs

Jesus Moran¹, Bibiano Rivas², Claudio de la Riva¹, Javier Tuya¹, Ismael Caballero², Manuel A. Serrano² - Show less +2 more•Institutions (2)

University of Oviedo¹, University of Castilla–La Mancha²

01 Aug 2016-pp 171-176

TL;DR: A testing technique is proposed to generate different infrastructure configurations for a given test input data, and then the program is executed in these configurations in order to reveal functional faults.

read less

Abstract: Programs that process a large volume of data generally run in a distributed and parallel architecture, such as the programs implemented in the processing model MapReduce. In these programs, developers can abstract the infrastructure where the program will run and focus on the functional issues. However, the infrastructure configuration and its state cause different parallel executions of the program and some could derive in functional faults which are hard to reveal. In general, the infrastructure that executes the program is not considered during the testing, because the tests usually contain few input data and then the parallelization is not necessary. In this paper a testing technique is proposed to generate different infrastructure configurations for a given test input data, and then the program is executed in these configurations in order to reveal functional faults. This testing technique is automatized by using a test engine and is applied to a case study. As a result, several infrastructure configurations are automatically generated and executed for a test case revealing a functional fault that is then fixed by the developer.

...read moreread less

Summary (3 min read)

Jump to: [INTRODUCTION] – [2. Automatic support by means of a test engine based on] – [II. MAPREDUCE PARADIGM] – [A. Generation of the test scenarios] – [The first configuration consists of one Mapper, one] – [B. Execution of the test scenarios] – [IV. CASE STUDY] – [V. RELATED WORK] and [VI. CONCLUSIONS]

INTRODUCTION

The new trends in massive data processing have brought to light several technologies and processing models in the field called Big Data Engineering [1] .
Then the framework that manages the infrastructure is also responsible to automatically deploy and run the program over several computers and lead the data processing between the input and output.
These faults are often masked during the test execution because the tests usually run over an infrastructure configuration without considering the different situations that could occur in production, as for example different parallelism levels or the infrastructure failures [6] .
The main contribution of this paper is a technique that can be used to generate automatically the different infrastructure configurations for a MapReduce application.
Then each one of the configurations is executed in the test environment in order to detect functional faults of the program that may occur in production.

2. Automatic support by means of a test engine based on

In Section II the principles of the MapReduce paradigm are introduced.
The generation of the different configurations, the execution and the automatization of the tests are defined in Section III.
In Section V the related work about software testing in MapReduce paradigm is presented.

II. MAPREDUCE PARADIGM

The MapReduce program processes high quantities of data in a distributed infrastructure.
The final output is obtained from the deployment and the execution over a distributed infrastructure of several instances of Mapper and Reducer, also called tasks.
The Mapper task receives a subset of temperature data and emits <year, temperature of this year> pairs.
In MapReduce there are also other implementations such as for example Partitioner that decides for each <key, value> pair which Reducer analyses it, Sort that sorts the <key, value> pairs, and Group that aggregates the values of each key before the Reducer.
These faults are difficult to detect during testing because the test cases usually contain few input data.

A. Generation of the test scenarios

To illustrate how the infrastructure configuration affects the program output, suppose that the example of Section II is extended with a Combiner in order to decrease the data and improve the performance.
The Combiner receives several temperatures and then they are replaced by their average in the Combiner output.
The program does not admit a Combiner because all the temperatures are needed to obtain the total average temperature.
The error of adding the Combiner in order to optimize the program injects a functional fault in the program.
Fig. 2 represents three possible executions of this program that could occur in production considering the different infrastructure configurations and the same input (year 1999 with temperatures 4º, 2º and 3º).

The first configuration consists of one Mapper, one

Combiner and one Reducer that produces the expected output.
The second configuration also generates the expected output executing one Mapper that processes the temperatures 4º and 2º, another Mapper for 3º, two Combiner, and finally one Reducer.
In order to generate each one of the scenarios, a combinatorial technique [11] is proposed to combine the values of the different parameters that can modify the execution of the MapReduce program.
The constraints considered in this paper are the following: 1. The values/combinations of the Mapper parameters depend on the input data because it is not possible more tasks than data.
To illustrate how the parameters are combined and how the constraints are applied, suppose the program of Fig.

B. Execution of the test scenarios

The previous section proposes a technique to generate scenarios that represent different infrastructure configurations according to the characteristics of the MapReduce processing.
This is the scenario formed by one Mapper, one Combiner and one Reducer which is the usual configuration executed in testing.
Finally, if the test case contains the expected output, the output of ideal scenario is also checked against the expected output ( 8), detecting a fault when both are not equivalent (9, 10) .
Given a test case, the scenarios are generated according to the previous section, then they are iteratively executed and evaluated following the pseudocode of Fig. 3 .
Finally, a third scenario is executed and produces 3.25º as output, this temperature is not equivalent to the 3º of the ideal scenario output.

IV. CASE STUDY

In order to evaluate the proposed approach, the authors use as case study the MapReduce program described in I8K|DQ-BigData framework [13] .
The output of the program is the data quality of each row, and the average of all rows.
Over the previous program, a test case is obtained using a specific MapReduce testing technique based on data flow [5] .
The second Mapper processes only row 2, but no other information about the mandatory columns or data quality threshold, so this Mapper cannot emit any output.
These environments do not detect the fault because they only execute one scenario that masks the fault.

VI. CONCLUSIONS

A testing technique for the MapReduce programs is introduced and automatized in this paper as a test engine that reproduces the different infrastructure configurations for a given test case.
Automatically and without an expected output, the test engine can detect functional faults specific to the MapReduce paradigm that are in general difficult to detect in the test/production environments.
This approach is applied in a real program using a test case with few data.
The current approach is off-line because the tests are not carried out when the program is in production.

Did you find this useful? Give us your feedback

Figures (5)

Fig. 1. Program that calculates the average temperature per year

Fig. 2. Different infrastructure configurations for a program that calculates the average temperature per year with Combiner task

TABLE I. TEST CASE OF THE I8K|DQ-BIGDATA PROGRAM

Fig. 4. Execution of the test case in different scenarios

Fig. 3. a) General famework of test execution b) Algorithm for test generation and execution of test scenarios

Content maybe subject to copyright Report

This paper is a post-print paper accepted in “International Conference on Future Internet of

Things and Cloud (FiCloud), 2016“

The final version of this paper is available through IEEE Xplore in the next link:

http://ieeexplore.ieee.org/document/7592719/

J. Morán, B. Rivas, C. De La Riva, J. Tuya, I. Caballero and M. Serrano, "Infrastructure-Aware

Functional Testing of MapReduce Programs," 2016 IEEE 4th International Conference on Future

Internet of Things and Cloud Workshops (FiCloudW), Vienna, 2016, pp. 171-176. doi:

10.1109/W-FiCloud.2016.45

IEEE must be obtained for all other uses, in any current or future media, including

reprinting/republishing this material for advertising or promotional purposes, creating new

collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted

component of this work in other works

Infrastructure-Aware Functional Testing of

MapReduce programs

Jesús Morán

Department of Computing

University of Oviedo

Gijón, Spain

moranjesus@lsi.uniovi.es

Bibiano Rivas

Institute of Technology

and Information Systems

University of Castilla-La

Mancha

Ciudad Real, Spain

Bibiano.Rivas@uclm.es

Claudio de la Riva, Javier Tuya

Department of Computing

University of Oviedo

Gijón, Spain

{claudio, tuya}@uniovi.es

Ismael Caballero, Manuel Serrano

Institute of Technology and

Information Systems

University of Castilla-La Mancha

Ciudad Real, Spain

{Ismael.Caballero,

Manuel.Serrano}@uclm.es

Abstract—Programs that process a large volume of data

generally run in a distributed and parallel architecture, such as

the programs implemented in the processing model MapReduce.

In these programs, developers can abstract the infrastructure

where the program will run and focus on the functional issues.

However, the infrastructure configuration and its state cause

different parallel executions of the program and some could

derive in functional faults which are hard to reveal. In general,

the infrastructure that executes the program is not considered

during the testing, because the tests usually contain few input

data and then the parallelization is not necessary. In this paper a

testing technique is proposed to generate different infrastructure

configurations for a given test input data, and then the program

is executed in these configurations in order to reveal functional

faults. This testing technique is automatized by using a test

engine and applied in a case study. As a result, several

infrastructure configurations are automatically generated and

executed for a test case revealing a functional fault that is then

fixed by the developer.

Keywords— Software testing, MapReduce programs, Big Data

Engineering, Hadoop

I. INTRODUCTION

The new trends in massive data processing have brought to

light several technologies and processing models in the field

called Big Data Engineering [1]. Among them, MapReduce [2]

can be highlighted as it permits the analysis of large data based

on the “divide and conquer” principle. These programs run two

phases in a distributed infrastructure: the Mapper phase divides

the problem into several subproblems, and then the Reducer

phase solves each subproblem. Usually, MapReduce programs

run on several computers with heterogeneous resources and

features. This complex infrastructure is managed by a

framework, such as Hadoop [3] which stands out due to its

wide use in the industry [4].

From the developer point of view, a MapReduce program

can be implemented only with Mapper and Reducer, without

any consideration about the infrastructure. Then the framework

that manages the infrastructure is also responsible to

automatically deploy and run the program over several

computers and lead the data processing between the input and

output. Among others, the framework divides the input into

several subsets of data, then processes each one in parallel and

re-runs some parts of the program if necessary.

Despite the fact the program can be implemented

abstracting the infrastructure, the developer needs to consider

how the infrastructure configuration could affect the program

functionality. A previous work [5] detects and classifies several

faults that depend on how the infrastructure configuration

affects the program execution and produces different output.

These faults are often masked during the test execution because

the tests usually run over an infrastructure configuration

without considering the different situations that could occur in

production, as for example different parallelism levels or the

infrastructure failures [6]. On the other hand, if the tests are

executed in an environment similar to the production, some

faults may not be detected because it is common that the test

inputs contain few data, which means that Hadoop does not

parallelize the program execution. There are some tools to

enable the simulation for some of these situations (for example

computer and net failures) [7, 8, 9], but it is difficult to design,

generate and execute the tests in a deterministic way because

there are a lot of elements that need fine grained simulation,

including the infrastructure and framework.

The main contribution of this paper is a technique that can

be used to generate automatically the different infrastructure

configurations for a MapReduce application. The goal is to

execute test cases with these configurations in order to reveal

functional faults. Given a test input data, the configurations are

obtained based on the different executions that can happen in

production. Then each one of the configurations is executed in

the test environment in order to detect functional faults of the

program that may occur in production. The contributions of

this work are:

1. A combinatorial technique to generate the different

infrastructure configurations, taking into account

characteristics related to the MapReduce processing and

the test input data.

2. Automatic support by means of a test engine based on

MRUnit [10] that allows the execution of the

infrastructure configurations, together with the

evaluation to detect failures.

The rest of the paper is organized as follows. In Section II the

principles of the MapReduce paradigm are introduced. The

generation of the different configurations, the execution and the

automatization of the tests are defined in Section III. In Section

IV it is applied to a case study. In Section V the related work

about software testing in MapReduce paradigm is presented.

The paper ends with conclusions and future work in Section

VI.

II. MAPREDUCE PARADIGM

The MapReduce program processes high quantities of data

in a distributed infrastructure. The developer implements two

functionalities: Mapper task that splits the problem into several

subproblems and Reducer task that solves these subproblems.

The final output is obtained from the deployment and the

execution over a distributed infrastructure of several instances

of Mapper and Reducer, also called tasks. The deployment and

execution are automatically carried out by Hadoop or another

framework. First, several Mapper tasks analyse in parallel a

subset of input data and determine which subproblems these

data need. When the execution of all Mappers are finished,

several Reducers are also executed in parallel in order to solve

the subproblems. Internally MapReduce handles <key, value>

pairs, where the key is the subproblem identifier and the value

contains the information to solve it.

To illustrate MapReduce let us suppose a program that

computes the average temperature per year from historical data

about temperatures. This program solves one subproblem for

each year, so the identifier or key is the year. The Mapper task

receives a subset of temperature data and emits <year,

temperature of this year> pairs. Then Hadoop aggregates all

values per key. Therefore, the Reducer tasks receive

subproblems like <year, [all temperatures of this year]>, that is

all temperatures grouped per year. Finally, the Reducer

calculates the average temperature. For example, in Fig. 1 an

execution of the program considering the input is detailed: year

2000 with 3º, 2002 with 4º, 2000 with 1º, and 2001 with 5º.

The first two inputs are analysed in one Mapper task and the

remainder in another task. Then the temperatures are grouped

per year and sent to the Reducer tasks. The first Reducer

receives all the temperatures for the years 2000 and 2002, and

the other task for the year 2001. Finally, each Reducer emits

the average temperature of the analysed subproblems: 2º in the

year 2000, 4º in 2002 and 5º in 2001. This program with the

same input could be executed in another way by the

framework, for example with three Mappers and three

Reducers. Regardless of how the framework runs the program,

it should generate the expected output.

Additionally, to optimize the program, a Combiner

functionality can be implemented. This task is run after the

Mapper and the goal is to remove the irrelevant <key, value>

pairs to solve the subproblem. In MapReduce there are also

other implementations such as for example Partitioner that

decides for each <key, value> pair which Reducer analyses it,

Sort that sorts the <key, value> pairs, and Group that

aggregates the values of each key before the Reducer.

The wrong implementation of these functionalities could

cause a failure in one of the different ways in which Hadoop

can run the program. These faults are difficult to detect during

testing because the test cases usually contain few input data. In

this way it is not necessary to split the inputs and therefore the

execution is over one Mapper, one Combiner and one Reducer

[2].

III. GENERATION AND EXECUTION OF TESTS

The generation of the infrastructure configurations for the

tests are defined in Section A, and a framework to execute the

tests in Section B.

A. Generation of the test scenarios

To illustrate how the infrastructure configuration affects the

program output, suppose that the example of Section II is

extended with a Combiner in order to decrease the data and

improve the performance. The Combiner receives several

temperatures and then they are replaced by their average in the

Combiner output. In this case, the program does not admit a

Combiner because all the temperatures are needed to obtain the

total average temperature. The error of adding the Combiner in

order to optimize the program injects a functional fault in the

program. Fig. 2 represents three possible executions of this

program that could occur in production considering the

different infrastructure configurations and the same input (year

1999 with temperatures 4º, 2º and 3º).

The first configuration consists of one Mapper, one

Combiner and one Reducer that produces the expected output.

The second configuration also generates the expected output

executing one Mapper that processes the temperatures 4º and

2º, another Mapper for 3º, two Combiner, and finally one

Reducer. The third configuration also executes two Mapper,

two Combiner and one Reducer, but produces an unexpected

output because the first Mapper processes 4º and the second

Mapper the temperatures 2º and 3º. Then one of the Combiner

tasks calculates the average of 4º, and the other Combiner of 2º

and 3º. The Reducer receives the previous averages (4º and

Fig. 1. Program that calculates the average temperature per year

Mapper Task

<2000, 3º>

<2002, 4º>

<2000, 1º>

<2001, 5º>

<2000, 3º>

<2002, 4º>

Mapper Task

Reducer Task

<2001, 5º>

<2000, [3º, 1º]>

<2002, [4º]>

<2001, [5º]>

<2000, 2º>

<2002, 4º>

<2001, 5º>

<2000, 1º>

Fig. 2. Different infrastructure configurations for a program that

calculates the average temperature per year with Combiner task

Mappper

<1999, 4º>

<1999, 2º>

<1999, 3º>

<1999, [4º, 2º, 3º]>

Combiner Reducer

<1999, [3º]>

<1999, 3º>

Mappper

<1999, 4º>

<1999, 2º>

<1999, 3º>

<1999, [4º, 2º]>

Combiner

Reducer

<1999, [3º, 3º]>

<1999, 3º>

Mappper Combiner

<1999, [3º]>

Mappper

<1999, 4º>

<1999, 2º>

<1999, 3º>

<1999, [4º]>

Combiner

Reducer

<1999, [4º, 2.5º]>

<1999, 3.25º>

Mappper Combiner

<1999, [2º,3º]>

Same input Different scenario Different output

2.5º), and calculates the total average in the year. This

configuration produces 3.25º as output instead of the 3º of the

expected output. The program has a functional fault only

detected in the third configuration. The failure is produced

whenever this infrastructure configuration is executed,

regardless of the computer failures, slow net or others. This

fault is difficult to reveal because the test case needs to be

executed in the infrastructure configuration that detect it, and in

a completely controlled way.

Given a test input data, the goal is to generate the different

infrastructure configurations, also called in this context

scenarios. For this purpose, the technique proposed considers

how the MapReduce program can execute these input data in

production. First, the program runs the Mappers, then over

their outputs the Combiners and finally the Reducers. The

execution can be carried out over a different number of

computers and therefore the Mapper-Combiner-Reducer can

analyse a different subset of data in each execution. In order to

generate each one of the scenarios, a combinatorial technique

[11] is proposed to combine the values of the different

parameters that can modify the execution of the MapReduce

program. In this work the following parameters are considered

based on previous work [5] that classifies different types of

faults of the MapReduce applications:

 Mapper parameters: (1) Number of Mapper tasks, (2)

Inputs processed per each Mapper, and (3) Data

processing order of the inputs, that is, which data are

processed before other data in the Mapper and which

data are processed after.

 Combiner parameters for each Mapper output: (1)

Number of Combiner tasks, and (2) Inputs processed

per each Combiner.

 Reducer parameters: (1) Number of Reducer tasks, and

(2) Inputs processed per each Reducer.

The different scenarios are obtained through the combination

of all values that can take the above parameters and applying

the constraints imposed by the sequential execution of

MapReduce. The constraints considered in this paper are the

following:

1. The values/combinations of the Mapper parameters

depend on the input data because it is not possible more

tasks than data. For example, if there are three data

items in the input, the maximum number of Mappers is

three.

2. The values/combinations of the Combiner parameters

depend on the output of the Mapper tasks.

3. The values/combinations of the Reducer parameters

depend on the output of the Mapper-Combiner tasks

and another functionality executed by Hadoop before

Reducer tasks. This other functionality is called Shuffle

and for each <key, value> pair determines the Reducer

task that requires these data, then sorts all the data and

aggregates by key.

To illustrate how the parameters are combined and how the

constraints are applied, suppose the program of Fig. 2. The

input of this program contains three data items, and these data

constrain the values that the Mapper parameters can take

because the maximum number of Mapper tasks is three (one

Mapper per each <key, value> pair). The first scenario is

generated with one Mapper, one Combiner and one Reducer.

For the second scenario the parameter “Number of Mapper

tasks” is modified to 2, where the first Mapper analyses two

<key, value> pairs, and the second processes one pair. The

third scenario maintains the parameter “Number of Mapper

tasks” at 2, but modifies the parameter “Inputs processed per

each Mapper”, so the first Mapper analyses one <key, value>

pair and the other Mapper processes two pairs. The scenarios

are generated by the modification of the values in the

parameters in this way and considering the constraints.

B. Execution of the test scenarios

The previous section proposes a technique to generate

scenarios that represent different infrastructure configurations

according to the characteristics of the MapReduce processing.

Fig. 3 describes a framework to execute systematically the tests

with the scenarios generated by the technique of the previous

section.

The framework takes as input a test case that contains the

input data and optionally the expected output. The test input

data can be obtained with a generic testing technique or one

Input: Test case with:

input data

expected output (optional)

Output: scenario that reveals a fault

(0) /* Generation of scenarios (section A)*/

(1) Scenarios ← Generate scenarios from input data

(2) /* Execution of scenarios */

(3) ideal scenario output ← Execution of ideal

scenario

(4) ∀ scenario ∈ Scenarios:

(5) scenario output ← Execution of scenario

(6) IF scenario output <> ideal scenario output:

(7) RETURN scenario with fault

(8) IF ideal scenario output <> expected output:

(9) RETURN ideal scenario

(10) ELSE:

(11) RETURN Zero faults detected

Fig. 3. a) General famework of test execution b) Algorithm for test generation and execution of test scenarios

Input data

Are all

scenarios

tested?

Ideal scenario

Run scenario

Ideal

output

Generation of

new scenario

Run scenario

Output

Are

equals?

Yes

Expected output

(optional)

Are equals?

Yes

Test

case

Automatic Test execution

(1)

(2)

(3)

(4)

(5)

(6)

(7)

(8)

(9)

(10)

specifically designed for MapReduce, such as MRFlow [12].

Then, the ideal scenario is generated (1) and executed (2, 3).

This is the scenario formed by one Mapper, one Combiner and

one Reducer which is the usual configuration executed in

testing. Next, new scenarios are iteratively generated (4) and

executed (5) through the technique of the previous section. The

output of each scenario is checked against the output of the

ideal scenario (6), revealing a fault if the outputs are not

equivalent (7). Finally, if the test case contains the expected

output, the output of ideal scenario is also checked against the

expected output (8), detecting a fault when both are not

equivalent (9, 10).

Given a test case, the scenarios are generated according to

the previous section, then they are iteratively executed and

evaluated following the pseudocode of Fig. 3. For example,

Fig. 2 contains the generation and execution of a program that

calculates the average temperature per year in three scenarios

considering the same test input: year 1999 with temperatures

4º, 2º and 3º. The first execution is the ideal scenario with one

Mapper, one Combiner and one Reducer, that produces 3º as

output. Then the second scenario is executed and also produces

3º. Finally, a third scenario is executed and produces 3.25º as

output, this temperature is not equivalent to the 3º of the ideal

scenario output. Consequently, a functional fault is revealed

without any knowledge of the expected output of the test case.

This approach is automatized by means of a test engine

based on MRUnit library [10]. This library is used to execute

each scenario. In MRUnit the test cases are executed in the

ideal scenario, but this library is extended to generate other

scenarios and enable parallelism supporting the execution of

several Mapper, Combiner and Reducer tasks.

IV. CASE STUDY

In order to evaluate the proposed approach, we use as case

study the MapReduce program described in I8K|DQ-BigData

framework [13]. This program measures the quality of the data

exchanged between organizations according to part 140 of the

ISO/TS 8000 [14]. The program receives (1) the data

exchanged in a row-column fashion, together with (2) a set of

mandatory columns that should contain data and (3) a

percentage threshold that divides the data quality of each row

in two parts: the first part is maximum if all mandatory

columns contain data and zero otherwise, and the second part

of the data quality is calculated as the percentage of the non-

mandatory columns that contain data. The output of the

program is the data quality of each row, and the average of all

rows.

Over the previous program, a test case is obtained using a

specific MapReduce testing technique based on data flow [5].

The test input data and the expected output of the test case

contain two rows represented in Table I. Row 1 contains two

columns (Name and City), and only one column has data, so

the data quality is 50%. Row 2 contains data in all columns, so

the data quality is 100%. The total quality is 75%, which is the

average of both rows.

The procedure described in Section III is applied on the

previous program using the previous test case as input. As a

result, a fault is detected and reported to the developer. This

failure occurs when the rows are processed in different

Mappers and only the first Mapper receives the information

related to the mandatory columns and the data quality

threshold, because Hadoop splits the input data into several

subsets. Without this information, the Mapper cannot calculate

the data quality and does not emit any output. The bottom of

Fig. 4 represents the scenario that produces the failure. There

are two Mappers that process different rows. The first Mapper

receives the data quality threshold (value of 50%), the

mandatory column (“Name”) and the two columns of row 1

with only data in one column, so the Mapper emits 50% as data

quality of row 1. The second Mapper processes only row 2, but

no other information about the mandatory columns or data

quality threshold, so this Mapper cannot emit any output. Then

the Reducer receives only the data quality of row 1 and emits

an incorrect output of the average data quality.

This fault is difficult to detect because it implies the parallel

and controlled execution of the program. Moreover, this fault is

not revealed by the execution of the test case in the following

environments: (a) Hadoop cluster in production with 4

computers, Hadoop in local mode (simple version of Hadoop

with one computer), and (c) MRUnit unit testing library. These

environments do not detect the fault because they only execute

one scenario that masks the fault. Normally these

environments run the program in the ideal scenario that is

formed by one Mapper, one Combiner and one Reducer, and

then the fault is masked due to a lack of parallelism.

The test engine proposed in this paper executes the test case

in the different scenarios that can occur in production with

large data and infrastructure failures. In contrast with the other

Fig. 4. Execution of the test case in different scenarios

Same input Different scenario Different ouput

Threshold: 50%

Mandatory: Name

Row 1

Row 2

Mapper Reducer

50%

100%

Row 1: 50%

Row 2: 100%

Avg: 75%

One scenario of

the test engine

Mapper Reducer

50%

Row 1: 50%

Avg: 50%

Mapper

No output due the lack of

threshold and mandatory columns

Threshold: 50%

Mandatory: Name

Row 1

Row 2

MRUnit or real

environment

TABLE I. TEST CASE OF THE I8K|DQ-BIGDATA PROGRAM

Input

Excepted output

Data quality threshold: 50%

Mandatory columns: “Name”

Row 1

Name: Alice

50%

75% (average)

City: (no data)

Row 2

Name: Bob

100%

City: Vienna

HTML Viewer

Frequently Asked Questions (12)

Q1. What contributions have the authors mentioned in the paper "Infrastructure-aware functional testing of mapreduce programs" ?

In this paper a testing technique is proposed to generate different infrastructure configurations for a given test input data, and then the program is executed in these configurations in order to reveal functional faults. This testing technique is automatized by using a test engine and applied in a case study.

Q2. What is the way to test a mapreduce program?

In the unit testing, JUnit [32] could be used together with mock tools, or directly by MRUnit library [10] adapted to the MapReduce paradigm.

Q3. What is the purpose of this paper?

In this paper a test engine is implemented by an MRUnit extension that automatically generates and executes the different infrastructure configurations that could occur in production.

Q4. What other approaches are used to obtain the test input data of MapReduce programs?

there are other approaches oriented to obtain the test input data of MapReduce programs, such as [12] that employs data flow testing and [29] based on a bacteriological algorithm.

Q5. What is the goal of the example?

Given a test input data, the goal is to generate the different infrastructure configurations, also called in this context scenarios.

Q6. How can the test input data be obtained?

The test input data can be obtained with a generic testing technique or onespecifically designed for MapReduce, such as MRFlow [12].

Q7. What is the common type of fault produced when the data should reach the Reducer in a?

One common type of fault is produced when the data should reach the Reducer in a specific order, but the parallel execution causes these data to arrive disordered.

Q8. What are the parameters of the Mapper-Combiner-Reducer?

In this work the following parameters are considered based on previous work [5] that classifies different types of faults of the MapReduce applications: Mapper parameters: (1) Number of Mapper tasks, (2) Inputs processed per each Mapper, and (3) Data processing order of the inputs, that is, which data are processed before other data in the Mapper and which data are processed after.

Q9. What is the main idea of the paper?

In this paper, given a test input data, several configurations are generated and then executed in order to reveal functional faults.

Q10. How did Csallner and Chen analyze the fault?

This fault was analysed by Csallner et al. [24] and Chen et al. [25] using some testing techniques based on symbolic execution and model checking.

Q11. What is the definition of failure in the example of Section II?

The failure is produced whenever this infrastructure configuration is executed, regardless of the computer failures, slow net or others.

Q12. What is the problem with the second Mapper?

The second Mapper processes only row 2, but no other information about the mandatory columns or data quality threshold, so this Mapper cannot emit any output.

Infrastructure-Aware Functional Testing of MapReduce Programs

Summary (3 min read)

INTRODUCTION

2. Automatic support by means of a test engine based on

II. MAPREDUCE PARADIGM

A. Generation of the test scenarios

The first configuration consists of one Mapper, one

B. Execution of the test scenarios

IV. CASE STUDY

VI. CONCLUSIONS

Figures (5)

Citations

Cites background from "Infrastructure-Aware Functional Tes..."

Cites methods from "Infrastructure-Aware Functional Tes..."

Cites background or methods from "Infrastructure-Aware Functional Tes..."

References

"Infrastructure-Aware Functional Tes..." refers methods in this paper

"Infrastructure-Aware Functional Tes..." refers background in this paper

Related Papers (5)

Frequently Asked Questions (12)

Q1. What contributions have the authors mentioned in the paper "Infrastructure-aware functional testing of mapreduce programs" ?

Q2. What is the way to test a mapreduce program?

Q3. What is the purpose of this paper?

Q4. What other approaches are used to obtain the test input data of MapReduce programs?

Q5. What is the goal of the example?

Q6. How can the test input data be obtained?

Q7. What is the common type of fault produced when the data should reach the Reducer in a?

Q8. What are the parameters of the Mapper-Combiner-Reducer?

Q9. What is the main idea of the paper?

Q10. How did Csallner and Chen analyze the fault?

Q11. What is the definition of failure in the example of Section II?

Q12. What is the problem with the second Mapper?