scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Infrastructure-Aware Functional Testing of MapReduce Programs

TL;DR: A testing technique is proposed to generate different infrastructure configurations for a given test input data, and then the program is executed in these configurations in order to reveal functional faults.
Abstract: Programs that process a large volume of data generally run in a distributed and parallel architecture, such as the programs implemented in the processing model MapReduce. In these programs, developers can abstract the infrastructure where the program will run and focus on the functional issues. However, the infrastructure configuration and its state cause different parallel executions of the program and some could derive in functional faults which are hard to reveal. In general, the infrastructure that executes the program is not considered during the testing, because the tests usually contain few input data and then the parallelization is not necessary. In this paper a testing technique is proposed to generate different infrastructure configurations for a given test input data, and then the program is executed in these configurations in order to reveal functional faults. This testing technique is automatized by using a test engine and is applied to a case study. As a result, several infrastructure configurations are automatically generated and executed for a test case revealing a functional fault that is then fixed by the developer.

Summary (3 min read)

INTRODUCTION

  • The new trends in massive data processing have brought to light several technologies and processing models in the field called Big Data Engineering [1] .
  • Then the framework that manages the infrastructure is also responsible to automatically deploy and run the program over several computers and lead the data processing between the input and output.
  • These faults are often masked during the test execution because the tests usually run over an infrastructure configuration without considering the different situations that could occur in production, as for example different parallelism levels or the infrastructure failures [6] .
  • The main contribution of this paper is a technique that can be used to generate automatically the different infrastructure configurations for a MapReduce application.
  • Then each one of the configurations is executed in the test environment in order to detect functional faults of the program that may occur in production.

2. Automatic support by means of a test engine based on

  • In Section II the principles of the MapReduce paradigm are introduced.
  • The generation of the different configurations, the execution and the automatization of the tests are defined in Section III.
  • In Section V the related work about software testing in MapReduce paradigm is presented.

II. MAPREDUCE PARADIGM

  • The MapReduce program processes high quantities of data in a distributed infrastructure.
  • The final output is obtained from the deployment and the execution over a distributed infrastructure of several instances of Mapper and Reducer, also called tasks.
  • The Mapper task receives a subset of temperature data and emits <year, temperature of this year> pairs.
  • In MapReduce there are also other implementations such as for example Partitioner that decides for each <key, value> pair which Reducer analyses it, Sort that sorts the <key, value> pairs, and Group that aggregates the values of each key before the Reducer.
  • These faults are difficult to detect during testing because the test cases usually contain few input data.

A. Generation of the test scenarios

  • To illustrate how the infrastructure configuration affects the program output, suppose that the example of Section II is extended with a Combiner in order to decrease the data and improve the performance.
  • The Combiner receives several temperatures and then they are replaced by their average in the Combiner output.
  • The program does not admit a Combiner because all the temperatures are needed to obtain the total average temperature.
  • The error of adding the Combiner in order to optimize the program injects a functional fault in the program.
  • Fig. 2 represents three possible executions of this program that could occur in production considering the different infrastructure configurations and the same input (year 1999 with temperatures 4º, 2º and 3º).

The first configuration consists of one Mapper, one

  • Combiner and one Reducer that produces the expected output.
  • The second configuration also generates the expected output executing one Mapper that processes the temperatures 4º and 2º, another Mapper for 3º, two Combiner, and finally one Reducer.
  • In order to generate each one of the scenarios, a combinatorial technique [11] is proposed to combine the values of the different parameters that can modify the execution of the MapReduce program.
  • The constraints considered in this paper are the following: 1. The values/combinations of the Mapper parameters depend on the input data because it is not possible more tasks than data.
  • To illustrate how the parameters are combined and how the constraints are applied, suppose the program of Fig.

B. Execution of the test scenarios

  • The previous section proposes a technique to generate scenarios that represent different infrastructure configurations according to the characteristics of the MapReduce processing.
  • This is the scenario formed by one Mapper, one Combiner and one Reducer which is the usual configuration executed in testing.
  • Finally, if the test case contains the expected output, the output of ideal scenario is also checked against the expected output ( 8), detecting a fault when both are not equivalent (9, 10) .
  • Given a test case, the scenarios are generated according to the previous section, then they are iteratively executed and evaluated following the pseudocode of Fig. 3 .
  • Finally, a third scenario is executed and produces 3.25º as output, this temperature is not equivalent to the 3º of the ideal scenario output.

IV. CASE STUDY

  • In order to evaluate the proposed approach, the authors use as case study the MapReduce program described in I8K|DQ-BigData framework [13] .
  • The output of the program is the data quality of each row, and the average of all rows.
  • Over the previous program, a test case is obtained using a specific MapReduce testing technique based on data flow [5] .
  • The second Mapper processes only row 2, but no other information about the mandatory columns or data quality threshold, so this Mapper cannot emit any output.
  • These environments do not detect the fault because they only execute one scenario that masks the fault.

VI. CONCLUSIONS

  • A testing technique for the MapReduce programs is introduced and automatized in this paper as a test engine that reproduces the different infrastructure configurations for a given test case.
  • Automatically and without an expected output, the test engine can detect functional faults specific to the MapReduce paradigm that are in general difficult to detect in the test/production environments.
  • This approach is applied in a real program using a test case with few data.
  • The current approach is off-line because the tests are not carried out when the program is in production.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

This paper is a post-print paper accepted in “International Conference on Future Internet of
Things and Cloud (FiCloud), 2016
The final version of this paper is available through IEEE Xplore in the next link:
http://ieeexplore.ieee.org/document/7592719/
J. Morán, B. Rivas, C. De La Riva, J. Tuya, I. Caballero and M. Serrano, "Infrastructure-Aware
Functional Testing of MapReduce Programs," 2016 IEEE 4th International Conference on Future
Internet of Things and Cloud Workshops (FiCloudW), Vienna, 2016, pp. 171-176. doi:
10.1109/W-FiCloud.2016.45
IEEE copyright notice. © 2016 IEEE. Personal use of this material is permitted. Permission from
IEEE must be obtained for all other uses, in any current or future media, including
reprinting/republishing this material for advertising or promotional purposes, creating new
collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted
component of this work in other works

Infrastructure-Aware Functional Testing of
MapReduce programs
Jesús Morán
Department of Computing
University of Oviedo
Gijón, Spain
moranjesus@lsi.uniovi.es
Bibiano Rivas
Institute of Technology
and Information Systems
University of Castilla-La
Mancha
Ciudad Real, Spain
Bibiano.Rivas@uclm.es
Claudio de la Riva, Javier Tuya
Department of Computing
University of Oviedo
Gijón, Spain
{claudio, tuya}@uniovi.es
Ismael Caballero, Manuel Serrano
Institute of Technology and
Information Systems
University of Castilla-La Mancha
Ciudad Real, Spain
{Ismael.Caballero,
Manuel.Serrano}@uclm.es
AbstractPrograms that process a large volume of data
generally run in a distributed and parallel architecture, such as
the programs implemented in the processing model MapReduce.
In these programs, developers can abstract the infrastructure
where the program will run and focus on the functional issues.
However, the infrastructure configuration and its state cause
different parallel executions of the program and some could
derive in functional faults which are hard to reveal. In general,
the infrastructure that executes the program is not considered
during the testing, because the tests usually contain few input
data and then the parallelization is not necessary. In this paper a
testing technique is proposed to generate different infrastructure
configurations for a given test input data, and then the program
is executed in these configurations in order to reveal functional
faults. This testing technique is automatized by using a test
engine and applied in a case study. As a result, several
infrastructure configurations are automatically generated and
executed for a test case revealing a functional fault that is then
fixed by the developer.
Keywords Software testing, MapReduce programs, Big Data
Engineering, Hadoop
I. INTRODUCTION
The new trends in massive data processing have brought to
light several technologies and processing models in the field
called Big Data Engineering [1]. Among them, MapReduce [2]
can be highlighted as it permits the analysis of large data based
on the “divide and conquer” principle. These programs run two
phases in a distributed infrastructure: the Mapper phase divides
the problem into several subproblems, and then the Reducer
phase solves each subproblem. Usually, MapReduce programs
run on several computers with heterogeneous resources and
features. This complex infrastructure is managed by a
framework, such as Hadoop [3] which stands out due to its
wide use in the industry [4].
From the developer point of view, a MapReduce program
can be implemented only with Mapper and Reducer, without
any consideration about the infrastructure. Then the framework
that manages the infrastructure is also responsible to
automatically deploy and run the program over several
computers and lead the data processing between the input and
output. Among others, the framework divides the input into
several subsets of data, then processes each one in parallel and
re-runs some parts of the program if necessary.
Despite the fact the program can be implemented
abstracting the infrastructure, the developer needs to consider
how the infrastructure configuration could affect the program
functionality. A previous work [5] detects and classifies several
faults that depend on how the infrastructure configuration
affects the program execution and produces different output.
These faults are often masked during the test execution because
the tests usually run over an infrastructure configuration
without considering the different situations that could occur in
production, as for example different parallelism levels or the
infrastructure failures [6]. On the other hand, if the tests are
executed in an environment similar to the production, some
faults may not be detected because it is common that the test
inputs contain few data, which means that Hadoop does not
parallelize the program execution. There are some tools to
enable the simulation for some of these situations (for example
computer and net failures) [7, 8, 9], but it is difficult to design,
generate and execute the tests in a deterministic way because
there are a lot of elements that need fine grained simulation,
including the infrastructure and framework.
The main contribution of this paper is a technique that can
be used to generate automatically the different infrastructure
configurations for a MapReduce application. The goal is to
execute test cases with these configurations in order to reveal
functional faults. Given a test input data, the configurations are
obtained based on the different executions that can happen in
production. Then each one of the configurations is executed in
the test environment in order to detect functional faults of the
program that may occur in production. The contributions of
this work are:
1. A combinatorial technique to generate the different
infrastructure configurations, taking into account
characteristics related to the MapReduce processing and
the test input data.
2. Automatic support by means of a test engine based on
MRUnit [10] that allows the execution of the
infrastructure configurations, together with the
evaluation to detect failures.

The rest of the paper is organized as follows. In Section II the
principles of the MapReduce paradigm are introduced. The
generation of the different configurations, the execution and the
automatization of the tests are defined in Section III. In Section
IV it is applied to a case study. In Section V the related work
about software testing in MapReduce paradigm is presented.
The paper ends with conclusions and future work in Section
VI.
II. MAPREDUCE PARADIGM
The MapReduce program processes high quantities of data
in a distributed infrastructure. The developer implements two
functionalities: Mapper task that splits the problem into several
subproblems and Reducer task that solves these subproblems.
The final output is obtained from the deployment and the
execution over a distributed infrastructure of several instances
of Mapper and Reducer, also called tasks. The deployment and
execution are automatically carried out by Hadoop or another
framework. First, several Mapper tasks analyse in parallel a
subset of input data and determine which subproblems these
data need. When the execution of all Mappers are finished,
several Reducers are also executed in parallel in order to solve
the subproblems. Internally MapReduce handles <key, value>
pairs, where the key is the subproblem identifier and the value
contains the information to solve it.
To illustrate MapReduce let us suppose a program that
computes the average temperature per year from historical data
about temperatures. This program solves one subproblem for
each year, so the identifier or key is the year. The Mapper task
receives a subset of temperature data and emits <year,
temperature of this year> pairs. Then Hadoop aggregates all
values per key. Therefore, the Reducer tasks receive
subproblems like <year, [all temperatures of this year]>, that is
all temperatures grouped per year. Finally, the Reducer
calculates the average temperature. For example, in Fig. 1 an
execution of the program considering the input is detailed: year
2000 with 3º, 2002 with 4º, 2000 with 1º, and 2001 with 5º.
The first two inputs are analysed in one Mapper task and the
remainder in another task. Then the temperatures are grouped
per year and sent to the Reducer tasks. The first Reducer
receives all the temperatures for the years 2000 and 2002, and
the other task for the year 2001. Finally, each Reducer emits
the average temperature of the analysed subproblems: in the
year 2000, in 2002 and in 2001. This program with the
same input could be executed in another way by the
framework, for example with three Mappers and three
Reducers. Regardless of how the framework runs the program,
it should generate the expected output.
Additionally, to optimize the program, a Combiner
functionality can be implemented. This task is run after the
Mapper and the goal is to remove the irrelevant <key, value>
pairs to solve the subproblem. In MapReduce there are also
other implementations such as for example Partitioner that
decides for each <key, value> pair which Reducer analyses it,
Sort that sorts the <key, value> pairs, and Group that
aggregates the values of each key before the Reducer.
The wrong implementation of these functionalities could
cause a failure in one of the different ways in which Hadoop
can run the program. These faults are difficult to detect during
testing because the test cases usually contain few input data. In
this way it is not necessary to split the inputs and therefore the
execution is over one Mapper, one Combiner and one Reducer
[2].
III. GENERATION AND EXECUTION OF TESTS
The generation of the infrastructure configurations for the
tests are defined in Section A, and a framework to execute the
tests in Section B.
A. Generation of the test scenarios
To illustrate how the infrastructure configuration affects the
program output, suppose that the example of Section II is
extended with a Combiner in order to decrease the data and
improve the performance. The Combiner receives several
temperatures and then they are replaced by their average in the
Combiner output. In this case, the program does not admit a
Combiner because all the temperatures are needed to obtain the
total average temperature. The error of adding the Combiner in
order to optimize the program injects a functional fault in the
program. Fig. 2 represents three possible executions of this
program that could occur in production considering the
different infrastructure configurations and the same input (year
1999 with temperatures 4º, 2º and 3º).
The first configuration consists of one Mapper, one
Combiner and one Reducer that produces the expected output.
The second configuration also generates the expected output
executing one Mapper that processes the temperatures and
2º, another Mapper for 3º, two Combiner, and finally one
Reducer. The third configuration also executes two Mapper,
two Combiner and one Reducer, but produces an unexpected
output because the first Mapper processes and the second
Mapper the temperatures and 3º. Then one of the Combiner
tasks calculates the average of 4º, and the other Combiner of 2º
and 3º. The Reducer receives the previous averages (4º and
Fig. 1. Program that calculates the average temperature per year
Mapper Task
<2000, 3º>
<2002, 4º>
<2000, 1º>
<2001, 5º>
<2000, 3º>
<2002, 4º>
Mapper Task
Reducer Task
Reducer Task
<2001, 5º>
<2000, [3º, 1º]>
<2002, [4º]>
<2001, [5º]>
<2000, 2º>
<2002, 4º>
<2001, 5º>
<2000, 1º>
Fig. 2. Different infrastructure configurations for a program that
calculates the average temperature per year with Combiner task
Mappper
<1999, 4º>
<1999, 2º>
<1999, 3º>
<1999, [4º, 2º, 3º]>
Combiner Reducer
<1999, [3º]>
<1999, 3º>
Mappper
<1999, 4º>
<1999, 2º>
<1999, 3º>
<1999, [4º, 2º]>
Combiner
Reducer
<1999, [3º, 3º]>
<1999, 3º>
Mappper Combiner
<1999, [3º]>
Mappper
<1999, 4º>
<1999, 2º>
<1999, 3º>
<1999, [4º]>
Combiner
Reducer
<1999, [4º, 2.5º]>
<1999, 3.25º>
Mappper Combiner
<1999, [2º,3º]>
Same input Different scenario Different output

2.5º), and calculates the total average in the year. This
configuration produces 3.25º as output instead of the of the
expected output. The program has a functional fault only
detected in the third configuration. The failure is produced
whenever this infrastructure configuration is executed,
regardless of the computer failures, slow net or others. This
fault is difficult to reveal because the test case needs to be
executed in the infrastructure configuration that detect it, and in
a completely controlled way.
Given a test input data, the goal is to generate the different
infrastructure configurations, also called in this context
scenarios. For this purpose, the technique proposed considers
how the MapReduce program can execute these input data in
production. First, the program runs the Mappers, then over
their outputs the Combiners and finally the Reducers. The
execution can be carried out over a different number of
computers and therefore the Mapper-Combiner-Reducer can
analyse a different subset of data in each execution. In order to
generate each one of the scenarios, a combinatorial technique
[11] is proposed to combine the values of the different
parameters that can modify the execution of the MapReduce
program. In this work the following parameters are considered
based on previous work [5] that classifies different types of
faults of the MapReduce applications:
Mapper parameters: (1) Number of Mapper tasks, (2)
Inputs processed per each Mapper, and (3) Data
processing order of the inputs, that is, which data are
processed before other data in the Mapper and which
data are processed after.
Combiner parameters for each Mapper output: (1)
Number of Combiner tasks, and (2) Inputs processed
per each Combiner.
Reducer parameters: (1) Number of Reducer tasks, and
(2) Inputs processed per each Reducer.
The different scenarios are obtained through the combination
of all values that can take the above parameters and applying
the constraints imposed by the sequential execution of
MapReduce. The constraints considered in this paper are the
following:
1. The values/combinations of the Mapper parameters
depend on the input data because it is not possible more
tasks than data. For example, if there are three data
items in the input, the maximum number of Mappers is
three.
2. The values/combinations of the Combiner parameters
depend on the output of the Mapper tasks.
3. The values/combinations of the Reducer parameters
depend on the output of the Mapper-Combiner tasks
and another functionality executed by Hadoop before
Reducer tasks. This other functionality is called Shuffle
and for each <key, value> pair determines the Reducer
task that requires these data, then sorts all the data and
aggregates by key.
To illustrate how the parameters are combined and how the
constraints are applied, suppose the program of Fig. 2. The
input of this program contains three data items, and these data
constrain the values that the Mapper parameters can take
because the maximum number of Mapper tasks is three (one
Mapper per each <key, value> pair). The first scenario is
generated with one Mapper, one Combiner and one Reducer.
For the second scenario the parameter “Number of Mapper
tasks” is modified to 2, where the first Mapper analyses two
<key, value> pairs, and the second processes one pair. The
third scenario maintains the parameter “Number of Mapper
tasks” at 2, but modifies the parameter “Inputs processed per
each Mapper, so the first Mapper analyses one <key, value>
pair and the other Mapper processes two pairs. The scenarios
are generated by the modification of the values in the
parameters in this way and considering the constraints.
B. Execution of the test scenarios
The previous section proposes a technique to generate
scenarios that represent different infrastructure configurations
according to the characteristics of the MapReduce processing.
Fig. 3 describes a framework to execute systematically the tests
with the scenarios generated by the technique of the previous
section.
The framework takes as input a test case that contains the
input data and optionally the expected output. The test input
data can be obtained with a generic testing technique or one
Input: Test case with:
input data
expected output (optional)
Output: scenario that reveals a fault
(0) /* Generation of scenarios (section A)*/
(1) Scenarios ← Generate scenarios from input data
(2) /* Execution of scenarios */
(3) ideal scenario output ← Execution of ideal
scenario
(4) scenario Scenarios:
(5) scenario output ← Execution of scenario
(6) IF scenario output <> ideal scenario output:
(7) RETURN scenario with fault
(8) IF ideal scenario output <> expected output:
(9) RETURN ideal scenario
(10) ELSE:
(11) RETURN Zero faults detected
Fig. 3. a) General famework of test execution b) Algorithm for test generation and execution of test scenarios

specifically designed for MapReduce, such as MRFlow [12].
Then, the ideal scenario is generated (1) and executed (2, 3).
This is the scenario formed by one Mapper, one Combiner and
one Reducer which is the usual configuration executed in
testing. Next, new scenarios are iteratively generated (4) and
executed (5) through the technique of the previous section. The
output of each scenario is checked against the output of the
ideal scenario (6), revealing a fault if the outputs are not
equivalent (7). Finally, if the test case contains the expected
output, the output of ideal scenario is also checked against the
expected output (8), detecting a fault when both are not
equivalent (9, 10).
Given a test case, the scenarios are generated according to
the previous section, then they are iteratively executed and
evaluated following the pseudocode of Fig. 3. For example,
Fig. 2 contains the generation and execution of a program that
calculates the average temperature per year in three scenarios
considering the same test input: year 1999 with temperatures
4º, and . The first execution is the ideal scenario with one
Mapper, one Combiner and one Reducer, that produces as
output. Then the second scenario is executed and also produces
. Finally, a third scenario is executed and produces 3.25º as
output, this temperature is not equivalent to the 3º of the ideal
scenario output. Consequently, a functional fault is revealed
without any knowledge of the expected output of the test case.
This approach is automatized by means of a test engine
based on MRUnit library [10]. This library is used to execute
each scenario. In MRUnit the test cases are executed in the
ideal scenario, but this library is extended to generate other
scenarios and enable parallelism supporting the execution of
several Mapper, Combiner and Reducer tasks.
IV. CASE STUDY
In order to evaluate the proposed approach, we use as case
study the MapReduce program described in I8K|DQ-BigData
framework [13]. This program measures the quality of the data
exchanged between organizations according to part 140 of the
ISO/TS 8000 [14]. The program receives (1) the data
exchanged in a row-column fashion, together with (2) a set of
mandatory columns that should contain data and (3) a
percentage threshold that divides the data quality of each row
in two parts: the first part is maximum if all mandatory
columns contain data and zero otherwise, and the second part
of the data quality is calculated as the percentage of the non-
mandatory columns that contain data. The output of the
program is the data quality of each row, and the average of all
rows.
Over the previous program, a test case is obtained using a
specific MapReduce testing technique based on data flow [5].
The test input data and the expected output of the test case
contain two rows represented in Table I. Row 1 contains two
columns (Name and City), and only one column has data, so
the data quality is 50%. Row 2 contains data in all columns, so
the data quality is 100%. The total quality is 75%, which is the
average of both rows.
The procedure described in Section III is applied on the
previous program using the previous test case as input. As a
result, a fault is detected and reported to the developer. This
failure occurs when the rows are processed in different
Mappers and only the first Mapper receives the information
related to the mandatory columns and the data quality
threshold, because Hadoop splits the input data into several
subsets. Without this information, the Mapper cannot calculate
the data quality and does not emit any output. The bottom of
Fig. 4 represents the scenario that produces the failure. There
are two Mappers that process different rows. The first Mapper
receives the data quality threshold (value of 50%), the
mandatory column (“Name”) and the two columns of row 1
with only data in one column, so the Mapper emits 50% as data
quality of row 1. The second Mapper processes only row 2, but
no other information about the mandatory columns or data
quality threshold, so this Mapper cannot emit any output. Then
the Reducer receives only the data quality of row 1 and emits
an incorrect output of the average data quality.
This fault is difficult to detect because it implies the parallel
and controlled execution of the program. Moreover, this fault is
not revealed by the execution of the test case in the following
environments: (a) Hadoop cluster in production with 4
computers, Hadoop in local mode (simple version of Hadoop
with one computer), and (c) MRUnit unit testing library. These
environments do not detect the fault because they only execute
one scenario that masks the fault. Normally these
environments run the program in the ideal scenario that is
formed by one Mapper, one Combiner and one Reducer, and
then the fault is masked due to a lack of parallelism.
The test engine proposed in this paper executes the test case
in the different scenarios that can occur in production with
large data and infrastructure failures. In contrast with the other
Fig. 4. Execution of the test case in different scenarios
Same input Different scenario Different ouput
Threshold: 50%
Mandatory: Name
Row 1
Row 2
Mapper Reducer
50%
100%
Row 1: 50%
Row 2: 100%
Avg: 75%
One scenario of
the test engine
Mapper Reducer
50%
Row 1: 50%
Avg: 50%
Mapper
No output due the lack of
threshold and mandatory columns
Threshold: 50%
Mandatory: Name
Row 1
Row 2
MRUnit or real
environment
TABLE I. TEST CASE OF THE I8K|DQ-BIGDATA PROGRAM
Input
Excepted output
Data quality threshold: 50%
Mandatory columns: “Name”
Row 1
Name: Alice
50%
75% (average)
City: (no data)
Row 2
Name: Bob
100%
City: Vienna

Citations
More filters
Journal ArticleDOI
TL;DR: New testing techniques that aimed to detect design faults by simulating different infrastructure configurations that as whole are more likely to reveal failures using random testing, and partition testing together with combinatorial testing are proposed.
Abstract: New processing models are being adopted in Big Data engineering to overcome the limitations of traditional technology. Among them, MapReduce stands out by allowing for the processing of large volumes of data over a distributed infrastructure that can change during runtime. The developer only designs the functionality of the program and its execution is managed by a distributed system. As a consequence, a program can behave differently at each execution because it is automatically adapted to the resources available at each moment. Therefore, when the program has a design fault, this could be revealed in some executions and masked in others. However, during testing, these faults are usually masked because the test infrastructure is stable, and they are only revealed in production because the environment is more aggressive with infrastructure failures, among other reasons. This paper proposes new testing techniques that aimed to detect these design faults by simulating different infrastructure configurations. The testing techniques generate a representative set of infrastructure configurations that as whole are more likely to reveal failures using random testing, and partition testing together with combinatorial testing. The techniques are automated by using a test execution engine called MRTest that is able to detect these faults using only the test input data, regardless of the expected output. Our empirical evaluation shows that MRTest can automatically detect these design faults within a reasonable time.

13 citations


Cites background from "Infrastructure-Aware Functional Tes..."

  • ...ture configurations that could occur in production (all potential configurations) [25], [26]....

    [...]

Proceedings ArticleDOI
25 Jul 2017
TL;DR: This work proposes an automatic test framework implementing a novel testing approach called Ex Vivo that can identify a fault in a few seconds, then the program can be stopped, not only avoiding an incorrect output, but also saving money, time and energy of production resources.
Abstract: Big Data programs are those that process large data exceeding the capabilities of traditional technologies. Among newly proposed processing models, MapReduce stands out as it allows the analysis of schema-less data in large distributed environments with frequent infrastructure failures. Functional faults in MapReduce are hard to detect in a testing/preproduction environment due to its distributed characteristics. We propose an automatic test framework implementing a novel testing approach called Ex Vivo. The framework employs data from production but executes the tests in a laboratory to avoid side-effects on the application. Faults are detected automatically without human intervention by checking if the same data would generate different outputs with different infrastructure configurations. The framework (MrExist) is validated with a real-world program. MrExist can identify a fault in a few seconds, then the program can be stopped, not only avoiding an incorrect output, but also saving money, time and energy of production resources.

9 citations


Cites methods from "Infrastructure-Aware Functional Tes..."

  • ...Finally, the testing is performed using a specific MapReduce testing technique [22], [34] that only needs the test input data and the program to detect functional faults (6)....

    [...]

  • ...[34] designed an automatic testing technique based on combinatorics and simulation....

    [...]

Posted Content
TL;DR: In this paper, the authors conducted a systematic review of the Big Data testing techniques period (2010 - 2021) and discussed the processing of testing data by highlighting the techniques used in every processing phase.
Abstract: Big Data is reforming many industrial domains by providing decision support through analyzing large volumes of data. Big Data testing aims to ensure that Big Data systems run smoothly and error-free while maintaining the performance and quality of data. However, because of the diversity and complexity of data, testing Big Data is challenging. Though numerous researches deal with Big Data testing, a comprehensive review to address testing techniques and challenges is not conflate yet. Therefore, we have conducted a systematic review of the Big Data testing techniques period (2010 - 2021). This paper discusses the processing of testing data by highlighting the techniques used in every processing phase. Furthermore, we discuss the challenges and future directions. Our finding shows that diverse functional, non-functional and combined (functional and non-functional) testing techniques have been used to solve specific problems related to Big Data. At the same time, most of the testing challenges have been faced during the MapReduce validation phase. In addition, the combinatorial testing technique is one of the most applied techniques in combination with other techniques (i.e., random testing, mutation testing, input space partitioning and equivalence testing) to solve various functional faults challenges faced during Big Data testing.
Proceedings ArticleDOI
10 Dec 2018
TL;DR: A framework for effectively generating method-level tests to facilitate debugging of big data applications by running a big data application with the original dataset and recording the inputs to a small number of method executions that preserve certain code coverage is introduced.
Abstract: When a failure occurs in a big data application, debugging with the original dataset can be difficult due to the large amount of data being processed. This paper introduces a framework for effectively generating method-level tests to facilitate debugging of big data applications. This is achieved by running a big data application with the original dataset and by recording the inputs to a small number of method executions, which we refer to as method-level tests, that preserve certain code coverage, e.g., edge coverage. The size of each method-level test is further reduced if needed, while maintaining code coverage. When debugging, a developer could inspect the execution of these method-level tests, instead of the entire program execution with the original dataset. We applied the framework to seven algorithms in the WEKA tool. The initial results show that in many cases a small number of method-level tests are sufficient to preserve code coverage. Furthermore, these tests could kill between 57.58% to 91.43% of the mutants generated using a mutation testing tool. This suggests that the framework could significantly reduce the efforts required for debugging big data applications.

Cites background or methods from "Infrastructure-Aware Functional Tes..."

  • ...For example, data mining and machine learning methods are used to reduce the size of the original dataset or generate synthetic datasets [3, 4] for the testing purpose....

    [...]

  • ...Previous work reported in [1, 2, 3, 4, 5] focuses on generating tests that help to identify functional faults, i....

    [...]

  • ...also proposed a technique to generate different infrastructure configurations for a given MapReduce program that can be used to reveal functional faults [4]....

    [...]

  • ...Some approaches have been proposed to reduce the effort required for testing and debugging big data applications at the system level [1, 2, 3, 4, 5]....

    [...]

References
More filters
Journal ArticleDOI
Jeffrey Dean1, Sanjay Ghemawat1
06 Dec 2004
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Abstract: MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day.

20,309 citations

Journal ArticleDOI
Jeffrey Dean1, Sanjay Ghemawat1
TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Abstract: MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.

17,663 citations

Proceedings ArticleDOI
23 May 2007
TL;DR: A consistent roadmap of the most relevant challenges to be addressed in software testing research is proposed, constituted by some important past achievements, while the destination consists of four identified goals to which research ultimately tends, but which remain as unreachable as dreams.
Abstract: Software engineering comprehends several disciplines devoted to prevent and remedy malfunctions and to warrant adequate behaviour. Testing, the subject of this paper, is a widespread validation approach in industry, but it is still largely ad hoc, expensive, and unpredictably effective. Indeed, software testing is a broad term encompassing a variety of activities along the development cycle and beyond, aimed at different goals. Hence, software testing research faces a collection of challenges. A consistent roadmap of the most relevant challenges to be addressed is here proposed. In it, the starting point is constituted by some important past achievements, while the destination consists of four identified goals to which research ultimately tends, but which remain as unreachable as dreams. The routes from the achievements to the dreams are paved by the outstanding research challenges, which are discussed in the paper along with interesting ongoing work.

834 citations


"Infrastructure-Aware Functional Tes..." refers methods in this paper

  • ...Despite the testing challenges of the Big Data applications [15, 16] and the progresses in the testing techniques [17], little effort is focused on testing the MapReduce programs [18], one of the principal paradigms of Big Data [19]....

    [...]

Proceedings ArticleDOI
10 Jun 2010
TL;DR: This paper is the first attempt to study server failures and hardware repairs for large datacenters and presents a detailed analysis of failure characteristics as well as a preliminary analysis on failure predictors.
Abstract: Modern day datacenters host hundreds of thousands of servers that coordinate tasks in order to deliver highly available cloud computing services. These servers consist of multiple hard disks, memory modules, network cards, processors etc., each of which while carefully engineered are capable of failing. While the probability of seeing any such failure in the lifetime (typically 3-5 years in industry) of a server can be somewhat small, these numbers get magnified across all devices hosted in a datacenter. At such a large scale, hardware component failure is the norm rather than an exception.Hardware failure can lead to a degradation in performance to end-users and can result in losses to the business. A sound understanding of the numbers as well as the causes behind these failures helps improve operational experience by not only allowing us to be better equipped to tolerate failures but also to bring down the hardware cost through engineering, directly leading to a saving for the company. To the best of our knowledge, this paper is the first attempt to study server failures and hardware repairs for large datacenters. We present a detailed analysis of failure characteristics as well as a preliminary analysis on failure predictors. We hope that the results presented in this paper will serve as motivation to foster further research in this area.

518 citations


"Infrastructure-Aware Functional Tes..." refers background in this paper

  • ...These faults are often masked during the test execution because the tests usually run over an infrastructure configuration without considering the different situations that could occur in production, as for example different parallelism levels or the infrastructure failures [6]....

    [...]

Journal ArticleDOI
TL;DR: This survey describes the basic algorithms used by the combination strategies and includes a subsumption hierarchy that attempts to relate the various coverage criteria associated with the identified combination strategies.
Abstract: Combination strategies are test-case selection methods where test cases are identifled by combining values of the difierent test object input parameters based on some combinatorial strategy. This survey presents 15 difierent combination strategies, and covers more than 30 papers that focus on one or several combination strategies. We believe this collection represents most of the existing work performed on combination strategies. This survey describes the basic algorithms used by the combination strategies. Some properties of combination strategies, including coverage criteria and theoretical bounds on the size of test suites, are also included in this description. This survey paper also includes a subsumption hierarchy that attempts to relate the various coverage criteria associated with the identifled combination strategies. Finally, this survey contains short summaries of all the papers that cover combination strategies.

442 citations

Frequently Asked Questions (12)
Q1. What contributions have the authors mentioned in the paper "Infrastructure-aware functional testing of mapreduce programs" ?

In this paper a testing technique is proposed to generate different infrastructure configurations for a given test input data, and then the program is executed in these configurations in order to reveal functional faults. This testing technique is automatized by using a test engine and applied in a case study. 

In the unit testing, JUnit [32] could be used together with mock tools, or directly by MRUnit library [10] adapted to the MapReduce paradigm. 

In this paper a test engine is implemented by an MRUnit extension that automatically generates and executes the different infrastructure configurations that could occur in production. 

there are other approaches oriented to obtain the test input data of MapReduce programs, such as [12] that employs data flow testing and [29] based on a bacteriological algorithm. 

Given a test input data, the goal is to generate the different infrastructure configurations, also called in this context scenarios. 

The test input data can be obtained with a generic testing technique or onespecifically designed for MapReduce, such as MRFlow [12]. 

One common type of fault is produced when the data should reach the Reducer in a specific order, but the parallel execution causes these data to arrive disordered. 

In this work the following parameters are considered based on previous work [5] that classifies different types of faults of the MapReduce applications: Mapper parameters: (1) Number of Mapper tasks, (2) Inputs processed per each Mapper, and (3) Data processing order of the inputs, that is, which data are processed before other data in the Mapper and which data are processed after. 

In this paper, given a test input data, several configurations are generated and then executed in order to reveal functional faults. 

This fault was analysed by Csallner et al. [24] and Chen et al. [25] using some testing techniques based on symbolic execution and model checking. 

The failure is produced whenever this infrastructure configuration is executed, regardless of the computer failures, slow net or others. 

The second Mapper processes only row 2, but no other information about the mandatory columns or data quality threshold, so this Mapper cannot emit any output.