What are the future works mentioned in the paper "Comparison and evaluation of clone detection tools" ?

To see how much the results depend upon Bellon, the authors plan to replicate the experiment with different independent judges. The whole benchmark suite with source code of the comparison framework, the data submitted by the participants, the reference set, and evaluation results are available online at [ 34 ] so that the experiment can be inspected in detail, replicated, and enhanced for new systems and clone detectors.

What were the changes in the try-catch block?

In the try-catch block (in total, 9 and 11 lines, respectively), a method call was replaced by a string literal, anassignment was added, a simple assignment was turned into a declaration with initialization, a throw statement was added, and a package qualifier was extended.

Why is the yield value in Fig. 8 a lower bound?

Because oracling two identical candidates is negligible given the high absolute number and the lowpercentage of candidates the authors actually looked at in their experiment, the yield value in Fig. 8 is still a meaningful lower bound for the overall acceptance rate.

(Open Access) Comparison and Evaluation of Clone Detection Tools (2007) | S. Bellon

Comparison and Evaluation

of Clone Detection Tools

Stefan Bellon, Rainer Koschk e, Member, IEEE Computer Society, Giuliano Antoniol, Member, IEEE,

Jens Krinke, Member, IEEE Computer Society, an d Ettore M erlo, Member, IEEE

Abstract—Many techniques for detecting duplicated source code (software clones) have been proposed in the past. However, it is not

yet clear how these techniques compare in terms of recall and precision as well as space and time requirements. This paper presents

an experiment that evaluates six clone detectors based on eight large C and Java programs (altogether almost 850 KLOC). Their clone

candidates were evaluated by one of the authors as an independent third party. The selected techniques cover the whole spectrum of

the state-of-the-art in clone detection. The techniques work on text, lexical and syntactic information, software metrics, and program

dependency graphs.

Index Terms—Redundant code, duplicated code, software clones.

1 INTRODUCTION

EUS E through copying and pasting source code is

common practice. So -called softwar e clones are the

results. Sometimes these clones are modified slightly to

adapt them to their new environment or purpose. Several

authors report 7 percent to 23 percent code duplication [1],

[2], [3]; in one extreme case, authors reported 59 percent [4].

The problem with code cloning is that errors in the

original must be fixed in every copy. Other kinds of

maintenance changes, for instance, extensions or adapta-

tions, must be applied multiple times, too. Yet, it is usually

not documented where code was copied. In such cases, one

needs to detect them. For large systems, detection is feasible

only by automatic techniques. Consequently, several tech-

niques have been proposed to detect clones automatically

[1], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12]. The abundance

of techniques calls for quantitative evaluations.

This paper presents an experiment conducted in 2002 that

evaluates six clone detectors based on eight large C and Java

programs (altogether almost 850 KLOC). The experiment

involved several researchers who applied their tools on

these systems. Their clone candidates were evaluated by one

of the authors, namely, Stefan Bellon, as an independent

third party. The selected techniques cover the whole

spectrum of the state of the art in clone detection. The

techniques work on text, lexical and syntactic information,

software metrics, and program dependency graphs. Fig. 1

lists the participants, their tools, and the type of information

they leverage.

The remainder of this paper is organized as follows: The

next section describes the techniques we evaluated and

related techniques for clone detection. Section 3 gives an

operational structural definition of clone types used in the

evaluation. The setup for the experiment is described in

Section 4 and its results are presented in Section 5. Section 6

describes related research in clone detection evaluation.

2 CLONE DETECTION

Software clone detection is an active field of research. This

section summarizes research in clone detection.

Textual comparison. The approach of Ducasse et al.

compares whole lines to each other textually [4]. To

increase performance, lines are partitioned using a hash

function for strings. Only lines in the same partition are

compared. The result is visualized as a dot plot, where

each dot indicates a pair of cloned lines. Clones may be

found as certain patterns in those dot plots visually.

Consecutive lines can be summarized to larger cloned

sequences automatically as uninterrupted diagonals or

displaced diagonals in the dot plot.

Johnson [13] uses the efficient string matching by Karp

and Rabin [14] based on fingerprints.

Token comparison. Baker’s technique is also a line-

based comparison. Instead of a string comparison, the token

sequences of lines are compared efficiently through a suffix

tree. First, each token sequence for a whole line is

summarized by a so-called functor that abstracts from

concrete values of identifiers and literals [1]. The functor

characterizes this t oke n sequenc e u niquely. Assigning

functors can be viewed as a perfect hash function. Concrete

values of identifiers and literals are captured as parameters

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 33, NO. 9, SEPTEMBER 2007 577

. S. Bellon is with Axivion GmbH, Nobelstr. 15, 70569 Stuttgart, Germany.

E-mail: bellon@axivion.com.

. R. Koschke is with the Universita

t Bremen, Fachbereich 03, Postfach 33 04

40, 28334 Bremen, Germany. E-mail: koschke@tzi.de.

. G. Antoniol is with the De

partement de Ge

nie Informatique,

Ecole

Polytechnique de Montre

al, Pavillons Lassonde, MacKay-Lassonde, 2500,

chemin de Polytechnique, Montre

al (Quebec), Canada, H3T 1J4.

E-mail: antoniol@ieee.org.

. J. Krinke is with Fern-Universita

t in Hagen, Universita

tsstr. 27, 58097

Hagen, Germany. E-mail: krinke@ieee.org.

. E. Merlo is with the Department of Computer Engineering, Ecole

Polytechnique of Montreal, PO Box 6079, Station Downtown, Montreal

(Quebec), Canada, H3C 3A7. E-mail: ettore.merlo@polymtl.ca.

Manuscript received 11 Apr. 2006; revised 21 Oct. 2006; accepted 14 May

2007; published online 10 July 2007.

Recommended for acceptance by M. Harman.

For information on obtaining reprints of this article, please send e-mail to:

tse@computer.org, and reference IEEECS Log Number TSE-0089-0406.

Digital Object Identifier no. 10.1109/TSE.2007.70725.

0098-5589/07/$25.00 ! 2007 IEEE Published by the IEEE Computer Society

to this functor. An encoding of these parameters abstracts

from their concrete values but not from their order so that

code fragments may be detected that di ffer only in

systematic renaming of parameters. Two lines are clones

if they match in their functors and parameter encoding.

The functors and their parameters are summarized in a

suffix tree, a trie that represents all suffixes of the program

in a compact fashion. A suffix tree can be built in time and

space linear to the input length [7], [15]. Every branch in the

suffix tree represents program suffixes with common

beginnings, hence, cloned sequences.

Kamiya et al. increase recall for superficially different yet

equivalent sequences by normalizing the token sequences [9].

Because syntax is not taken into account, the found

clones may overlap different syntactic units, which cannot

be replaced through functional abstraction. In either a

preprocessing [16], [17] or a postprocessing [18] step, clones

that completely fall in syntactic blocks can be found if block

delimiters are known.

Metric comparison. Merlo et al. gather different metrics

for code fragments and compare these metric vectors

instead of comparing code directly [2], [3], [12], [19]. An

allowable distance (for instance, euclidean distance) for

these metric vectors can be used as a hint for similar code.

Specific metric-based techniques were also proposed for

clones in Web sites [20], [21].

Comparison of abstract syntax trees (AST). Baxter et al.

partition subtrees of the abstract syntax tree of a program

based on a hash function and then compare subtrees in the

same partition through tree matching (allowing for some

divergences) [8]. A similar approach was proposed earlier

by Yang [22] using dynamic programming to find differ-

ences between two versions of the same file.

Comparison of program dependency graphs (PDG).

Control and data flow dependencies of a function may be

represented by a program dependency graph; clones may

be identified as isomorphic subgraphs [10], [11]; because

this problem is NP-hard , Krinke uses approxim ative

solutions.

Other techniques. Marcus and Maletic use latent semantic

indexing (an information retrieval technique) to identify

fragments in which similar names occur [23]. Leitao [24]

combines syntactic and semantic techniq ues through a

combination of specialized comparison functions that com-

pare various aspects (similar call subgraphs, commutative

operators, user-defined equivalences, and transformations

into canonical syntactic forms). Each comparison function

yields an evidence that is summarized in an evidence-factor

model yielding a clone likelihood. Wahler et al. [25] and

Li et al. [26] cast the search for similar fragments as a data

mining problem. Statement sequences are summarized to

item sets. An adapted data mining algorithm searches for

frequent item sets.

3 BASIC DEFINITIONS

This section presents definitions that form the foundation

for the evaluation. These definitions represent the con-

sensus among all participants of the experiment accounting

for the different backgrounds of the participants.

The foremost question to answer is, “What is a clone?”

Roughly speaking, two code fragments form a clone pair if

they are similar enough according to a given definition of

similarity. Different definitions of similarity and associated

levels of tolerance allow for different kinds and degrees of

clones.

A piece of code, A, is similar to another piece of code, B,

if B subsumes the functionality of A; in other words, they

have “similar” preconditions and postconditions. We call

such a pair ðA; BÞ a semantic clone. Unfortunately, detecting

semantic clones is undecidable in general.

Another definition of similarity considers the program

text: Two code fragments form a clone pair if their program

text is similar. The two code fragments may or may not be

equivalent semantically. These kinds of clones are often the

result of copy&paste; that is, the programmer selects a code

fragment and copies it to another location.

Copy&paste is a frequent programming practice and an

example of ad hoc reuse. The automatic clone detectors

evaluated in this experiment find clones that are similar in

program text and, hence, the latter definition of a clone pair

is adopted in this paper.

Clones of this nature may be compared on the basis of

the program text that was copied. We can distinguish the

following types of clones:

. Type 1 is an exact copy without modifications

(except for white space and comments).

. Type 2 is a syntactic ally iden tical copy; only

variable, type, or function identifiers were changed.

. Type 3 is a copy with further modifications; state-

ments were changed, added, or removed.

Some of the tools report so-called parameterized clones

[6], which are a subset of type-2 clones. Two code fragments

A and B are a parameterized clone pair if there is a bijective

mapping from A’s identifiers onto B’s identifiers that

allows an identifier substitution in A resulting in A

and

is a type-1 clone to B (and vice versa).

Differentiating parameterized clones would have re-

quired us to check for consistent renaming when we

evaluated the clone pairs proposed by the tools. Because

the validation was done completely manually and because

not all tools make this distinction, we did not distinguish

parameterized clones from other type-2 clones.

While type-1 and type-2 clones are precisely defined and

form an equivalence relation, the definition of type-3 clones

is vague. Some tools consider two consecutive type-1 or

type-2 clones together forming a type-3 clone if the gap in

between is below a certain threshold of lines. Another

precise definition could be based on a threshold for the

578 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 33, NO. 9, SEPTEMBER 2007

Fig. 1. Participating scientists. CloneDR is a trademark of Semantic

Designs Inc.

Levenshtein Distance, that is, the number of deletions,

insertions, or substitutions required to transform one string

into another.

Because there is no consensus on a suitable similarity

measure for type-3 clones, all clones reported by the

evaluated tools that are not type-1 or type-2 clones fall into

the category type-3 in our study. It is then the decision of

the human analyst whether type-3 clone candidates are real

clones.

We are now in a position to define clone pairs more

precisely:

Definition 1. A clone (pair) is a triple ðf

; f

; tÞ where f

and

are two similar code fragments and t is the associated type of

similarity (type 1, 2, or 3).

As a matter of fact, in the evaluation, we further

constrained the above definition by the additional require-

ment that clones may be replaced through function calls,

that is, that they are syntactically complete. Some of the

tools report code fragments that are at different syntactic

nesting levels (e.g., a fragment consisting of parts of two

different consecutive function bodies), which could indeed

be replaced through macros; but a maintenance program-

mer would never want to replace them becau se th e

replacemen t would make it hard to unders tand th e

program.

So, the next question is, “What is a code fragment,

exactly?” We could treat a sequence of tokens as a code

fragment. Yet, the notion of a token differs from tool to tool

(e.g., are preprocessor tokens considered?) and not all tools

report token sequences. Rather than tokens, our definition of

code fragments is based on text. Tokens may be mapped onto

text and the source text is a less debatable point of reference

(it is only less debatable rather than not at all debatable

because of macros and preprocessor directives in whose

presence one could use the preprocessed or original text).

Program text may be referenced by filename and row

and column information. Unfortunately, not all tools report

column information. Thus, the least common denominator

for the definition of a code fragment for our evaluation is

filename and row information.

Definition 2. A code fragment is a tuple ðf; s; eÞ which

consists of the name of the source file f, the start line s, and the

end line e of the fragment. Both line numbers are inclusive.

4 EXPERIMENTAL SETUP

This section explains how the experiment was set up.

Explanations of our general idea as well as in-depth

descriptions of the metrics used for the comparison will

be given.

4.1 Preparations

We analyzed C and Java systems. Using two different

languages and systems of different sizes decreases the

degree of bias.

We conducted the experiment in two phases: a test run

and the main experiment.

4.1.1 Test Run

The goal of the test run was to identify potential problems

for the main run. The test phase analyzed two small

C programs (bison and wget) and two small Java programs

(EIRC and spule).

In the test run, we noticed that some tools report the start

and end lines of the code fragments a line earlier or later if

the lines consist of only a brace. In practice, this difference is

irrelevant, but it complicates the comparison of clones from

different tools.

For this reason, the source code for the main run was

“normalized.” Empty lines were removed. Lines containing

only opening or closing braces were removed and the

braces were added to the line above, paying attention to

single-line comments, etc. (see Fig. 2).

Tools using layout information [12] in order to detect

clones may be affected by this normalization, but to make

the comparison easier, all participants agreed to the

normalization.

4.1.2 Main Run

The main run consisted of the analysis of four programs

written in C and four Java programs. The size of the source

code of the programs varied from 11K SLOC to 235K SLOC.

Fig. 3 gives an overview of the programs used in the

experiment.

As some tools can be configured, we split the main run

into a mandatory and a voluntary part. The mandatory part

has to be done with the “default” settings of the particular

tool, whereas in the voluntary run, each scientist could tune

the settings of his or her tool based on her or his own

experimentation with the subject system in order to gain the

best results.

BELLON ET AL.: COMPARISON AND EVALUATION OF CLONE DETECTION TOOLS 579

Fig. 2. Original code and the same code normalized.

Fig. 3. Overview of the programs used in the main run.

The tools were operated by the participants in a fixed

period of time (five weeks) and the results were collected

and evaluated by Stefan Bellon.

By consensus among all participants, only clones that are

at least six lines long were reported. Smaller clones tend to

be more spurious. Some of the tools applied a preprocessor

before they did the analysis; others worked directly on the

original program text.

4.2 Benchmark

We compared the individual results from the participants

against a reference corpus of “real clones” similarly to the

evaluation scheme in information retrieval. Each clone pair

suggested by a tool will be called candidate and each clone

pair of the reference corpus will be called reference in the

following.

The obvious, naive ways to create such a reference

corpus are:

1. union of candidates reported by different tools,

2. intersection of candidates reported by different tools,

and

3. candidates that were found jointly by N tools.

All three ways have deficiencies. The first alternative will

result in a precision of 1 for each tool as all the candidates a

tool reports are present in the reference corpus. Addition-

ally, we get many spurious false positives among the

references. The second alternative has the reverse effect: The

recall for all tools is 1 and we obtain many spurious true

negatives (it suffices that a single tool cannot detect a certain

clone). The third alternative is a compromise between the

first two and does not really help either. Apart from the fact

that we have to justify the chosen value of N, there can

always be N tools that report the same false positive, or only

N # 1 tools find a true positive.

Instead, we built the reference corpus manually. Stefan

Bellon—as an independent party (referred to as oracle in the

following—looked at 2 percent of all 325,935 submitted

cand idates and built a reference corpus by inserting

proposed candidates (sometimes after having modified

them slightly). In the following, we will use the term oracled

for all candidate s viewed by Stefan Bellon to decide

whether or not to accept it as a clone. Please note that

oracled includes rejected and accepted as is or in varied form.

An automatic selection process made sure that he did not

know which tools proposed the candidate and that the

2 percent was distributed equally, so that no tool is

preferred or discriminated against. As much as we wished

to classify more than just 2 percent of the candidates, it was

impossible considering our time constraints: It took 44 hours

to classify the first 1 percent and another 33 hours for the

second 1 percent.

We anticipate d this problem in the design of the

experimen t and took two countermeas ures. First, one

evaluation was done after 1 percent of the candidates had

been oracled. Then, another 1 percent was oracled. The

interesting observation (as can be seen in Section 5.3) was

that the relative quantitative results are almost the same.

Second, we injected clones that we did not disclose to the

participants in the given programs. The injected clones

helped us to get a better idea of the potential recall. Fig. 4

shows how many clone pairs of which clone type were

injected into the programs and how many were found by

the union of the tools.

The distribution of the i njecte d clones among the

programs is not even as Stefan Bellon started introducing

many clones in two programs and then noticed that he

would exceed his time constraints. After injecting the clone

pairs into the programs, they were added to the reference

corpus as well.

4.3 Methods of Evaluation—Metrics

This section defines the measurements taken to compare the

automatic clone detection tools.

The evaluation is based on clone pairs rather than

equivalence classes of clones because, only for type-1 and

type-2 clones, the underlying similarity function is reflexive,

symmetric, and transitive. The similarity of type-3 clones is

not transitive: If A is a type-3 clone of B and B one of C, the

similarity between A and C might be too low to qualify it as

type-3 clone. Moreover, some tools report their clones not as

classes but as clone pairs.

In order to determine whether a candidate matches a

reference, we need a precise measurement. Pragmatically,

we did not insist on completely overlapping code fragments

but allowed a “sufficiently large” overlap between candi-

dates and reference clone pairs.

Definition 3. Overlap is the ratio of code common to two code

fragments, CF

and CF

, i.e., their intersection correlated to

their union. Let linesðCFÞ denote the set of lines of a code

fragment CF ; then, overlapðCF

; CF

Þ is defined as:

overlapðCF

; CF

Þ ¼

jlinesðCF

Þ \ linesðCF

Þj

jlinesðCF

Þ [ linesðCF

Þj

Definition 4. Contained is the ratio of code of one code fragment

contained in another one. Let linesðCF

Þ denote the set of lines

of the first code fragment and linesðCF

Þ the set of lines of the

second code fragment; then, containedðCF

; CF

Þ is defined as:

containedðCF

; CF

Þ ¼

jlinesðCF

Þ \ linesðCF

Þj

jlinesðCF

Þj

Now, we use the above two definitions to create two

metrics that tell us how well a candidate hits a reference.

580 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 33, NO. 9, SEPTEMBER 2007

Fig. 4. Injected secret clones.

For the following two definitions to work, we have to make

sure that the two code fragments CF

and CF

that make up

a clone pair are ordered as follows:

< CF

, ðCF

:Filename < CF

:FilenameÞ _

ðCF

:Filename ¼ CF

:Filename ^

:StartLine < CF

:StartLineÞ _

ðCF

:Filename ¼ CF

:Filename ^

:StartLine ¼ CF

:StartLine ^

:EndLine < CF

:EndLineÞ:

Thus, for a valid clone pair CP ¼ ðCF

; CF

; tÞ, CF

must always hold (code fragments of candidates with

the wrong order are simply swapped in order to meet this

criterion).

Definition 5. The good-value between two clone pairs CP

and

is defined as follows:

goodðCP

; CP

Þ ¼ minðoverlapðCP

:CF

; CP

:CF

Þ;

overlapðCP

:CF

; CP

:CF

ÞÞ:

Two clone pairs CP

and CP

are thus called a good-matchðpÞ

iff, for p 2 ½0; 1&, holds

goodðCP

; CP

Þ ' p:

We are using the minimum degree of overlap because it

is stricter than the maximum or average.

Definition 6. The ok-value between two clone pairs CP

and

is defined as follows:

okðCP

; CP

Þ ¼ minðmaxðcontainedðCP

:CF

; CP

:CF

Þ;

containedðCP

:CF

; CP

:CF

ÞÞ;

maxðcontainedðCP

:CF

; CP

:CF

Þ;

containedðCP

:CF

; CP

:CF

ÞÞÞ:

Two clone pairs CP

and CP

are thus called an ok-matchðpÞ

iff, for p 2 ½0; 1&, holds:

okðCP

; CP

Þ ' p:

The meanings of the good-value and ok-value can be

seen easily by way of an example. An ok-match(p) applies

if, in at least one direction, a clone pair is contained in

another one for a portion of more than (or equal to)

p ( 100 %; that is, one fragment subsumes another one

sufficiently. However, this leads to the anomaly that one

clone pair can be a lot larger than the other one. With the

good-match(p) criterion, this cannot happen as the inter-

section of both clone pairs is used. The example of Fig. 5

illustrates this.

The vertical line in the middle symbolizes the linear

source code. The first source line is at the top; the last one is

at the bottom. The code fragments of the participating clone

pairs are represented by the filled rectangles. The left side

stands for the first clone pair; the right side stands for the

sec ond. The dotted arrows symbolize how the code

fragments were copied. Let us assume that the left side is

the clone candidate and the right side is a clone pair from

the reference corpus. The first cod e fragment of t he

candidate is one line shorter and starts and ends earlier

than the corresponding code fragment of the reference. The

sec ond code fragment of the candidate, however , is

completely contained within the corresponding code frag-

ment of the reference but two lines shorter.

This yields a good-value as follows:

goodðCP

; CP

Þ ¼ min

;

! "

< 0:7 ¼ p:

Thus, the example does not satisfy the criterion for a

good-match(0.7).

The ok-value is calculated as:

okðCP

; CP

Þ ¼ min

max

;

! "

;

max

;

! ""

> 0:7 ¼ p:

Thus, the example is an ok-match(0.7).

The following inequality always holds:

okðCP

; CP

Þ ' goodðCP

; CP

Þ:

The inequality means that a good-match(p) is a stronger

criterion than an ok-match(p) for the same value of p. In our

experiment, we decided to use a value of p ¼ 0:7. Because

the threshold for the acceptable length of a clone was 6 in

the experiment, the choice of p ¼ 0:7 allows two six-line

code fragments to be shifted by one line. For instance, if one

clone pair’s fragment starts at line 1 and ends at 6, and the

other’s fragment starts at line 2 and ends at 7, the degree of

overlap is 5=7 > 0:7 ¼ p. This choice accommodates the off-

by-one disagreement in the line reporting of the evaluated

tools. Because both measures are essentially measures of

overlaph—-good from the perspective of both fragments and

ok from the perspective of the smaller fragment—we chose

to use the same threshold for both measures for reasons of

uniformity.

Finally, a mapping from candidates to references has to

be established. Each candidate is mapped to the reference

that it best matches. The idea of the algorithm for establish-

ing this mapping is shown in Fig. 6 (in reality, a more

efficient implementation is used).

There are two dimensions to optimize for the mapping

from candidates onto references: the good and ok values.

BELLON ET AL.: COMPARISON AND EVALUATION OF CLONE DETECTION TOOLS 581

Fig. 5. Example of overlapping of two clone pairs.

Comparison and Evaluation of Clone Detection Tools

Figures

Citations

Comparison and evaluation of code clone detection techniques and tools: A qualitative approach

A Survey on Software Clone Detection Research

The state of the art in end-user software engineering

NICAD: Accurate Detection of Near-Miss Intentional Clones Using Flexible Pretty-Printing and Code Normalization

SourcererCC: scaling code clone detection to big-code

References

CCFinder: a multilinguistic token-based code clone detection system for large scale source code

A Space-Economical Suffix Tree Construction Algorithm

Efficient randomized pattern-matching algorithms

Winnowing: local algorithms for document fingerprinting

Clone detection using abstract syntax trees

Related Papers (5)

CCFinder: a multilinguistic token-based code clone detection system for large scale source code

Clone detection using abstract syntax trees

Comparison and evaluation of code clone detection techniques and tools: A qualitative approach

DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones

A Survey on Software Clone Detection Research

Frequently Asked Questions (9)

Q1. What are the contributions in "Comparison and evaluation of clone detection tools" ?

Q2. What are the future works mentioned in the paper "Comparison and evaluation of clone detection tools" ?

Q3. What is the underlying similarity function of clones?

Q4. How many lines can be shifted by the p 14 0:7?

Q5. Why is syntax not taken into account?

Q6. What were the changes in the try-catch block?

Q7. Why is the yield value in Fig. 8 a lower bound?

Q8. What is the common denominator for a code fragment?

Q9. What is the encoding of the functors?