scispace - formally typeset
Open AccessJournal ArticleDOI

Comparison and Evaluation of Clone Detection Tools

Reads0
Chats0
TLDR
An experiment is presented that evaluates six clone detectors based on eight large C and Java programs (altogether almost 850 KLOC) and selects techniques that cover the whole spectrum of the state-of-the-art in clone detection.
Abstract
Many techniques for detecting duplicated source code (software clones) have been proposed in the past. However, it is not yet clear how these techniques compare in terms of recall and precision as well as space and time requirements. This paper presents an experiment that evaluates six clone detectors based on eight large C and Java programs (altogether almost 850 KLOC). Their clone candidates were evaluated by one of the authors as an independent third party. The selected techniques cover the whole spectrum of the state-of-the-art in clone detection. The techniques work on text, lexical and syntactic information, software metrics, and program dependency graphs.

read more

Content maybe subject to copyright    Report

Comparison and Evaluation
of Clone Detection Tools
Stefan Bellon, Rainer Koschk e, Member, IEEE Computer Society, Giuliano Antoniol, Member, IEEE,
Jens Krinke, Member, IEEE Computer Society, an d Ettore M erlo, Member, IEEE
Abstract—Many techniques for detecting duplicated source code (software clones) have been proposed in the past. However, it is not
yet clear how these techniques compare in terms of recall and precision as well as space and time requirements. This paper presents
an experiment that evaluates six clone detectors based on eight large C and Java programs (altogether almost 850 KLOC). Their clone
candidates were evaluated by one of the authors as an independent third party. The selected techniques cover the whole spectrum of
the state-of-the-art in clone detection. The techniques work on text, lexical and syntactic information, software metrics, and program
dependency graphs.
Index Terms—Redundant code, duplicated code, software clones.
Ç
1 INTRODUCTION
R
EUS E through copying and pasting source code is
common practice. So -called softwar e clones are the
results. Sometimes these clones are modified slightly to
adapt them to their new environment or purpose. Several
authors report 7 percent to 23 percent code duplication [1],
[2], [3]; in one extreme case, authors reported 59 percent [4].
The problem with code cloning is that errors in the
original must be fixed in every copy. Other kinds of
maintenance changes, for instance, extensions or adapta-
tions, must be applied multiple times, too. Yet, it is usually
not documented where code was copied. In such cases, one
needs to detect them. For large systems, detection is feasible
only by automatic techniques. Consequently, several tech-
niques have been proposed to detect clones automatically
[1], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12]. The abundance
of techniques calls for quantitative evaluations.
This paper presents an experiment conducted in 2002 that
evaluates six clone detectors based on eight large C and Java
programs (altogether almost 850 KLOC). The experiment
involved several researchers who applied their tools on
these systems. Their clone candidates were evaluated by one
of the authors, namely, Stefan Bellon, as an independent
third party. The selected techniques cover the whole
spectrum of the state of the art in clone detection. The
techniques work on text, lexical and syntactic information,
software metrics, and program dependency graphs. Fig. 1
lists the participants, their tools, and the type of information
they leverage.
The remainder of this paper is organized as follows: The
next section describes the techniques we evaluated and
related techniques for clone detection. Section 3 gives an
operational structural definition of clone types used in the
evaluation. The setup for the experiment is described in
Section 4 and its results are presented in Section 5. Section 6
describes related research in clone detection evaluation.
2 CLONE DETECTION
Software clone detection is an active field of research. This
section summarizes research in clone detection.
Textual comparison. The approach of Ducasse et al.
compares whole lines to each other textually [4]. To
increase performance, lines are partitioned using a hash
function for strings. Only lines in the same partition are
compared. The result is visualized as a dot plot, where
each dot indicates a pair of cloned lines. Clones may be
found as certain patterns in those dot plots visually.
Consecutive lines can be summarized to larger cloned
sequences automatically as uninterrupted diagonals or
displaced diagonals in the dot plot.
Johnson [13] uses the efficient string matching by Karp
and Rabin [14] based on fingerprints.
Token comparison. Baker’s technique is also a line-
based comparison. Instead of a string comparison, the token
sequences of lines are compared efficiently through a suffix
tree. First, each token sequence for a whole line is
summarized by a so-called functor that abstracts from
concrete values of identifiers and literals [1]. The functor
characterizes this t oke n sequenc e u niquely. Assigning
functors can be viewed as a perfect hash function. Concrete
values of identifiers and literals are captured as parameters
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 33, NO. 9, SEPTEMBER 2007 577
. S. Bellon is with Axivion GmbH, Nobelstr. 15, 70569 Stuttgart, Germany.
E-mail: bellon@axivion.com.
. R. Koschke is with the Universita
¨
t Bremen, Fachbereich 03, Postfach 33 04
40, 28334 Bremen, Germany. E-mail: koschke@tzi.de.
. G. Antoniol is with the De
´
partement de Ge
´
nie Informatique,
!
Ecole
Polytechnique de Montre
´
al, Pavillons Lassonde, MacKay-Lassonde, 2500,
chemin de Polytechnique, Montre
´
al (Quebec), Canada, H3T 1J4.
E-mail: antoniol@ieee.org.
. J. Krinke is with Fern-Universita
¨
t in Hagen, Universita
¨
tsstr. 27, 58097
Hagen, Germany. E-mail: krinke@ieee.org.
. E. Merlo is with the Department of Computer Engineering, Ecole
Polytechnique of Montreal, PO Box 6079, Station Downtown, Montreal
(Quebec), Canada, H3C 3A7. E-mail: ettore.merlo@polymtl.ca.
Manuscript received 11 Apr. 2006; revised 21 Oct. 2006; accepted 14 May
2007; published online 10 July 2007.
Recommended for acceptance by M. Harman.
For information on obtaining reprints of this article, please send e-mail to:
tse@computer.org, and reference IEEECS Log Number TSE-0089-0406.
Digital Object Identifier no. 10.1109/TSE.2007.70725.
0098-5589/07/$25.00 ! 2007 IEEE Published by the IEEE Computer Society

to this functor. An encoding of these parameters abstracts
from their concrete values but not from their order so that
code fragments may be detected that di ffer only in
systematic renaming of parameters. Two lines are clones
if they match in their functors and parameter encoding.
The functors and their parameters are summarized in a
suffix tree, a trie that represents all suffixes of the program
in a compact fashion. A suffix tree can be built in time and
space linear to the input length [7], [15]. Every branch in the
suffix tree represents program suffixes with common
beginnings, hence, cloned sequences.
Kamiya et al. increase recall for superficially different yet
equivalent sequences by normalizing the token sequences [9].
Because syntax is not taken into account, the found
clones may overlap different syntactic units, which cannot
be replaced through functional abstraction. In either a
preprocessing [16], [17] or a postprocessing [18] step, clones
that completely fall in syntactic blocks can be found if block
delimiters are known.
Metric comparison. Merlo et al. gather different metrics
for code fragments and compare these metric vectors
instead of comparing code directly [2], [3], [12], [19]. An
allowable distance (for instance, euclidean distance) for
these metric vectors can be used as a hint for similar code.
Specific metric-based techniques were also proposed for
clones in Web sites [20], [21].
Comparison of abstract syntax trees (AST). Baxter et al.
partition subtrees of the abstract syntax tree of a program
based on a hash function and then compare subtrees in the
same partition through tree matching (allowing for some
divergences) [8]. A similar approach was proposed earlier
by Yang [22] using dynamic programming to find differ-
ences between two versions of the same file.
Comparison of program dependency graphs (PDG).
Control and data flow dependencies of a function may be
represented by a program dependency graph; clones may
be identified as isomorphic subgraphs [10], [11]; because
this problem is NP-hard , Krinke uses approxim ative
solutions.
Other techniques. Marcus and Maletic use latent semantic
indexing (an information retrieval technique) to identify
fragments in which similar names occur [23]. Leitao [24]
combines syntactic and semantic techniq ues through a
combination of specialized comparison functions that com-
pare various aspects (similar call subgraphs, commutative
operators, user-defined equivalences, and transformations
into canonical syntactic forms). Each comparison function
yields an evidence that is summarized in an evidence-factor
model yielding a clone likelihood. Wahler et al. [25] and
Li et al. [26] cast the search for similar fragments as a data
mining problem. Statement sequences are summarized to
item sets. An adapted data mining algorithm searches for
frequent item sets.
3 BASIC DEFINITIONS
This section presents definitions that form the foundation
for the evaluation. These definitions represent the con-
sensus among all participants of the experiment accounting
for the different backgrounds of the participants.
The foremost question to answer is, “What is a clone?”
Roughly speaking, two code fragments form a clone pair if
they are similar enough according to a given definition of
similarity. Different definitions of similarity and associated
levels of tolerance allow for different kinds and degrees of
clones.
A piece of code, A, is similar to another piece of code, B,
if B subsumes the functionality of A; in other words, they
have “similar” preconditions and postconditions. We call
such a pair ðA; BÞ a semantic clone. Unfortunately, detecting
semantic clones is undecidable in general.
Another definition of similarity considers the program
text: Two code fragments form a clone pair if their program
text is similar. The two code fragments may or may not be
equivalent semantically. These kinds of clones are often the
result of copy&paste; that is, the programmer selects a code
fragment and copies it to another location.
Copy&paste is a frequent programming practice and an
example of ad hoc reuse. The automatic clone detectors
evaluated in this experiment find clones that are similar in
program text and, hence, the latter definition of a clone pair
is adopted in this paper.
Clones of this nature may be compared on the basis of
the program text that was copied. We can distinguish the
following types of clones:
. Type 1 is an exact copy without modifications
(except for white space and comments).
. Type 2 is a syntactic ally iden tical copy; only
variable, type, or function identifiers were changed.
. Type 3 is a copy with further modifications; state-
ments were changed, added, or removed.
Some of the tools report so-called parameterized clones
[6], which are a subset of type-2 clones. Two code fragments
A and B are a parameterized clone pair if there is a bijective
mapping from A’s identifiers onto B’s identifiers that
allows an identifier substitution in A resulting in A
0
and
A
0
is a type-1 clone to B (and vice versa).
Differentiating parameterized clones would have re-
quired us to check for consistent renaming when we
evaluated the clone pairs proposed by the tools. Because
the validation was done completely manually and because
not all tools make this distinction, we did not distinguish
parameterized clones from other type-2 clones.
While type-1 and type-2 clones are precisely defined and
form an equivalence relation, the definition of type-3 clones
is vague. Some tools consider two consecutive type-1 or
type-2 clones together forming a type-3 clone if the gap in
between is below a certain threshold of lines. Another
precise definition could be based on a threshold for the
578 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 33, NO. 9, SEPTEMBER 2007
Fig. 1. Participating scientists. CloneDR is a trademark of Semantic
Designs Inc.

Levenshtein Distance, that is, the number of deletions,
insertions, or substitutions required to transform one string
into another.
Because there is no consensus on a suitable similarity
measure for type-3 clones, all clones reported by the
evaluated tools that are not type-1 or type-2 clones fall into
the category type-3 in our study. It is then the decision of
the human analyst whether type-3 clone candidates are real
clones.
We are now in a position to define clone pairs more
precisely:
Definition 1. A clone (pair) is a triple ðf
1
; f
2
; tÞ where f
1
and
f
2
are two similar code fragments and t is the associated type of
similarity (type 1, 2, or 3).
As a matter of fact, in the evaluation, we further
constrained the above definition by the additional require-
ment that clones may be replaced through function calls,
that is, that they are syntactically complete. Some of the
tools report code fragments that are at different syntactic
nesting levels (e.g., a fragment consisting of parts of two
different consecutive function bodies), which could indeed
be replaced through macros; but a maintenance program-
mer would never want to replace them becau se th e
replacemen t would make it hard to unders tand th e
program.
So, the next question is, “What is a code fragment,
exactly?” We could treat a sequence of tokens as a code
fragment. Yet, the notion of a token differs from tool to tool
(e.g., are preprocessor tokens considered?) and not all tools
report token sequences. Rather than tokens, our definition of
code fragments is based on text. Tokens may be mapped onto
text and the source text is a less debatable point of reference
(it is only less debatable rather than not at all debatable
because of macros and preprocessor directives in whose
presence one could use the preprocessed or original text).
Program text may be referenced by filename and row
and column information. Unfortunately, not all tools report
column information. Thus, the least common denominator
for the definition of a code fragment for our evaluation is
filename and row information.
Definition 2. A code fragment is a tuple ðf; s; eÞ which
consists of the name of the source file f, the start line s, and the
end line e of the fragment. Both line numbers are inclusive.
4 EXPERIMENTAL SETUP
This section explains how the experiment was set up.
Explanations of our general idea as well as in-depth
descriptions of the metrics used for the comparison will
be given.
4.1 Preparations
We analyzed C and Java systems. Using two different
languages and systems of different sizes decreases the
degree of bias.
We conducted the experiment in two phases: a test run
and the main experiment.
4.1.1 Test Run
The goal of the test run was to identify potential problems
for the main run. The test phase analyzed two small
C programs (bison and wget) and two small Java programs
(EIRC and spule).
In the test run, we noticed that some tools report the start
and end lines of the code fragments a line earlier or later if
the lines consist of only a brace. In practice, this difference is
irrelevant, but it complicates the comparison of clones from
different tools.
For this reason, the source code for the main run was
“normalized.” Empty lines were removed. Lines containing
only opening or closing braces were removed and the
braces were added to the line above, paying attention to
single-line comments, etc. (see Fig. 2).
Tools using layout information [12] in order to detect
clones may be affected by this normalization, but to make
the comparison easier, all participants agreed to the
normalization.
4.1.2 Main Run
The main run consisted of the analysis of four programs
written in C and four Java programs. The size of the source
code of the programs varied from 11K SLOC to 235K SLOC.
Fig. 3 gives an overview of the programs used in the
experiment.
As some tools can be configured, we split the main run
into a mandatory and a voluntary part. The mandatory part
has to be done with the “default” settings of the particular
tool, whereas in the voluntary run, each scientist could tune
the settings of his or her tool based on her or his own
experimentation with the subject system in order to gain the
best results.
BELLON ET AL.: COMPARISON AND EVALUATION OF CLONE DETECTION TOOLS 579
Fig. 2. Original code and the same code normalized.
Fig. 3. Overview of the programs used in the main run.

The tools were operated by the participants in a fixed
period of time (five weeks) and the results were collected
and evaluated by Stefan Bellon.
By consensus among all participants, only clones that are
at least six lines long were reported. Smaller clones tend to
be more spurious. Some of the tools applied a preprocessor
before they did the analysis; others worked directly on the
original program text.
4.2 Benchmark
We compared the individual results from the participants
against a reference corpus of “real clones” similarly to the
evaluation scheme in information retrieval. Each clone pair
suggested by a tool will be called candidate and each clone
pair of the reference corpus will be called reference in the
following.
The obvious, naive ways to create such a reference
corpus are:
1. union of candidates reported by different tools,
2. intersection of candidates reported by different tools,
and
3. candidates that were found jointly by N tools.
All three ways have deficiencies. The first alternative will
result in a precision of 1 for each tool as all the candidates a
tool reports are present in the reference corpus. Addition-
ally, we get many spurious false positives among the
references. The second alternative has the reverse effect: The
recall for all tools is 1 and we obtain many spurious true
negatives (it suffices that a single tool cannot detect a certain
clone). The third alternative is a compromise between the
first two and does not really help either. Apart from the fact
that we have to justify the chosen value of N, there can
always be N tools that report the same false positive, or only
N # 1 tools find a true positive.
Instead, we built the reference corpus manually. Stefan
Bellon—as an independent party (referred to as oracle in the
following—looked at 2 percent of all 325,935 submitted
cand idates and built a reference corpus by inserting
proposed candidates (sometimes after having modified
them slightly). In the following, we will use the term oracled
for all candidate s viewed by Stefan Bellon to decide
whether or not to accept it as a clone. Please note that
oracled includes rejected and accepted as is or in varied form.
An automatic selection process made sure that he did not
know which tools proposed the candidate and that the
2 percent was distributed equally, so that no tool is
preferred or discriminated against. As much as we wished
to classify more than just 2 percent of the candidates, it was
impossible considering our time constraints: It took 44 hours
to classify the first 1 percent and another 33 hours for the
second 1 percent.
We anticipate d this problem in the design of the
experimen t and took two countermeas ures. First, one
evaluation was done after 1 percent of the candidates had
been oracled. Then, another 1 percent was oracled. The
interesting observation (as can be seen in Section 5.3) was
that the relative quantitative results are almost the same.
Second, we injected clones that we did not disclose to the
participants in the given programs. The injected clones
helped us to get a better idea of the potential recall. Fig. 4
shows how many clone pairs of which clone type were
injected into the programs and how many were found by
the union of the tools.
The distribution of the i njecte d clones among the
programs is not even as Stefan Bellon started introducing
many clones in two programs and then noticed that he
would exceed his time constraints. After injecting the clone
pairs into the programs, they were added to the reference
corpus as well.
4.3 Methods of Evaluation—Metrics
This section defines the measurements taken to compare the
automatic clone detection tools.
The evaluation is based on clone pairs rather than
equivalence classes of clones because, only for type-1 and
type-2 clones, the underlying similarity function is reflexive,
symmetric, and transitive. The similarity of type-3 clones is
not transitive: If A is a type-3 clone of B and B one of C, the
similarity between A and C might be too low to qualify it as
type-3 clone. Moreover, some tools report their clones not as
classes but as clone pairs.
In order to determine whether a candidate matches a
reference, we need a precise measurement. Pragmatically,
we did not insist on completely overlapping code fragments
but allowed a “sufficiently large” overlap between candi-
dates and reference clone pairs.
Definition 3. Overlap is the ratio of code common to two code
fragments, CF
1
and CF
2
, i.e., their intersection correlated to
their union. Let linesðCFÞ denote the set of lines of a code
fragment CF ; then, overlapðCF
1
; CF
2
Þ is defined as:
overlapðCF
1
; CF
2
Þ ¼
jlinesðCF
1
Þ \ linesðCF
2
Þj
jlinesðCF
1
Þ [ linesðCF
2
Þj
:
Definition 4. Contained is the ratio of code of one code fragment
contained in another one. Let linesðCF
1
Þ denote the set of lines
of the first code fragment and linesðCF
2
Þ the set of lines of the
second code fragment; then, containedðCF
1
; CF
2
Þ is defined as:
containedðCF
1
; CF
2
Þ ¼
jlinesðCF
1
Þ \ linesðCF
2
Þj
jlinesðCF
1
Þj
:
Now, we use the above two definitions to create two
metrics that tell us how well a candidate hits a reference.
580 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 33, NO. 9, SEPTEMBER 2007
Fig. 4. Injected secret clones.

For the following two definitions to work, we have to make
sure that the two code fragments CF
1
and CF
2
that make up
a clone pair are ordered as follows:
CF
1
< CF
2
, ðCF
1
:Filename < CF
2
:FilenameÞ _
ðCF
1
:Filename ¼ CF
2
:Filename ^
CF
1
:StartLine < CF
2
:StartLineÞ _
ðCF
1
:Filename ¼ CF
2
:Filename ^
CF
1
:StartLine ¼ CF
2
:StartLine ^
CF
1
:EndLine < CF
2
:EndLineÞ:
Thus, for a valid clone pair CP ¼ ðCF
1
; CF
2
; tÞ, CF
1
<
CF
2
must always hold (code fragments of candidates with
the wrong order are simply swapped in order to meet this
criterion).
Definition 5. The good-value between two clone pairs CP
1
and
CP
2
is defined as follows:
goodðCP
1
; CP
2
Þ ¼ minðoverlapðCP
1
:CF
1
; CP
2
:CF
1
Þ;
overlapðCP
1
:CF
2
; CP
2
:CF
2
ÞÞ:
Two clone pairs CP
1
and CP
2
are thus called a good-matchðpÞ
iff, for p 2 ½0; 1&, holds
goodðCP
1
; CP
2
Þ ' p:
We are using the minimum degree of overlap because it
is stricter than the maximum or average.
Definition 6. The ok-value between two clone pairs CP
1
and
CP
2
is defined as follows:
okðCP
1
; CP
2
Þ ¼ minðmaxðcontainedðCP
1
:CF
1
; CP
2
:CF
1
Þ;
containedðCP
2
:CF
1
; CP
1
:CF
1
ÞÞ;
maxðcontainedðCP
1
:CF
2
; CP
2
:CF
2
Þ;
containedðCP
2
:CF
2
; CP
1
:CF
2
ÞÞÞ:
Two clone pairs CP
1
and CP
2
are thus called an ok-matchðpÞ
iff, for p 2 ½0; 1&, holds:
okðCP
1
; CP
2
Þ ' p:
The meanings of the good-value and ok-value can be
seen easily by way of an example. An ok-match(p) applies
if, in at least one direction, a clone pair is contained in
another one for a portion of more than (or equal to)
p ( 100 %; that is, one fragment subsumes another one
sufficiently. However, this leads to the anomaly that one
clone pair can be a lot larger than the other one. With the
good-match(p) criterion, this cannot happen as the inter-
section of both clone pairs is used. The example of Fig. 5
illustrates this.
The vertical line in the middle symbolizes the linear
source code. The first source line is at the top; the last one is
at the bottom. The code fragments of the participating clone
pairs are represented by the filled rectangles. The left side
stands for the first clone pair; the right side stands for the
sec ond. The dotted arrows symbolize how the code
fragments were copied. Let us assume that the left side is
the clone candidate and the right side is a clone pair from
the reference corpus. The first cod e fragment of t he
candidate is one line shorter and starts and ends earlier
than the corresponding code fragment of the reference. The
sec ond code fragment of the candidate, however , is
completely contained within the corresponding code frag-
ment of the reference but two lines shorter.
This yields a good-value as follows:
goodðCP
1
; CP
2
Þ ¼ min
5
8
;
6
8
! "
¼
5
8
< 0:7 ¼ p:
Thus, the example does not satisfy the criterion for a
good-match(0.7).
The ok-value is calculated as:
okðCP
1
; CP
2
Þ ¼ min
!
max
5
6
;
5
7
! "
;
max
6
6
;
6
8
! ""
¼
5
6
> 0:7 ¼ p:
Thus, the example is an ok-match(0.7).
The following inequality always holds:
okðCP
1
; CP
2
Þ ' goodðCP
1
; CP
2
Þ:
The inequality means that a good-match(p) is a stronger
criterion than an ok-match(p) for the same value of p. In our
experiment, we decided to use a value of p ¼ 0:7. Because
the threshold for the acceptable length of a clone was 6 in
the experiment, the choice of p ¼ 0:7 allows two six-line
code fragments to be shifted by one line. For instance, if one
clone pair’s fragment starts at line 1 and ends at 6, and the
other’s fragment starts at line 2 and ends at 7, the degree of
overlap is 5=7 > 0:7 ¼ p. This choice accommodates the off-
by-one disagreement in the line reporting of the evaluated
tools. Because both measures are essentially measures of
overlaph—-good from the perspective of both fragments and
ok from the perspective of the smaller fragment—we chose
to use the same threshold for both measures for reasons of
uniformity.
Finally, a mapping from candidates to references has to
be established. Each candidate is mapped to the reference
that it best matches. The idea of the algorithm for establish-
ing this mapping is shown in Fig. 6 (in reality, a more
efficient implementation is used).
There are two dimensions to optimize for the mapping
from candidates onto references: the good and ok values.
BELLON ET AL.: COMPARISON AND EVALUATION OF CLONE DETECTION TOOLS 581
Fig. 5. Example of overlapping of two clone pairs.

Citations
More filters
Journal ArticleDOI

Comparison and evaluation of code clone detection techniques and tools: A qualitative approach

TL;DR: A qualitative comparison and evaluation of the current state-of-the-art in clone detection techniques and tools is provided, and a taxonomy of editing scenarios that produce different clone types and a qualitative evaluation of current clone detectors are evaluated.

A Survey on Software Clone Detection Research

TL;DR: The state of the art in clone detection research is surveyed, the clone terms commonly used in the literature are described along with their corresponding mappings to the commonly used clone types and several open problems related to clone detectionResearch are pointed out.
Proceedings ArticleDOI

NICAD: Accurate Detection of Near-Miss Intentional Clones Using Flexible Pretty-Printing and Code Normalization

TL;DR: A new language- specific parser-based but lightweight clone detection approach exploiting a novel application of a source transformation system that is capable of finding near-miss clones with high precision and recall, and with reasonable performance.
Proceedings ArticleDOI

SourcererCC: scaling code clone detection to big-code

TL;DR: In this article, a token-based clone detector, SourcererCC, is proposed to detect both exact and near-miss clones from large inter-project repositories using a standard workstation.
References
More filters
Journal ArticleDOI

CCFinder: a multilinguistic token-based code clone detection system for large scale source code

TL;DR: A new clone detection technique, which consists of the transformation of input source text and a token-by-token comparison, is proposed, which has effectively found clones and the metrics have been able to effectively identify the characteristics of the systems.
Journal ArticleDOI

A Space-Economical Suffix Tree Construction Algorithm

TL;DR: A new algorithm is presented for constructing auxiliary digital search trees to aid in exact-match substring searching that has the same asymptotic running time bound as previously published algorithms, but is more economical in space.
Journal ArticleDOI

Efficient randomized pattern-matching algorithms

TL;DR: In this article, the first occurrence of a string X as a consecutive block within a text Y is found by using a randomized algorithm. But the algorithm requires a constant number of storage locations, and essentially runs in real time.
Proceedings ArticleDOI

Winnowing: local algorithms for document fingerprinting

TL;DR: The class of local document fingerprinting algorithms is introduced, which seems to capture an essential property of any finger-printing technique guaranteed to detect copies, and a novel lower bound on the performance of any local algorithm is proved.
Proceedings ArticleDOI

Clone detection using abstract syntax trees

TL;DR: The paper presents simple and practical methods for detecting exact and near miss clones over arbitrary program fragments in program source code by using abstract syntax trees and suggests that clone detection could be useful in producing more structured code, and in reverse engineering to discover domain concepts and their implementations.
Related Papers (5)
Frequently Asked Questions (9)
Q1. What are the contributions in "Comparison and evaluation of clone detection tools" ?

This paper presents an experiment that evaluates six clone detectors based on eight large C and Java programs ( altogether almost 850 KLOC ). Their clone candidates were evaluated by one of the authors as an independent third party. 

To see how much the results depend upon Bellon, the authors plan to replicate the experiment with different independent judges. The whole benchmark suite with source code of the comparison framework, the data submitted by the participants, the reference set, and evaluation results are available online at [ 34 ] so that the experiment can be inspected in detail, replicated, and enhanced for new systems and clone detectors. 

The evaluation is based on clone pairs rather than equivalence classes of clones because, only for type-1 and type-2 clones, the underlying similarity function is reflexive, symmetric, and transitive. 

Because the threshold for the acceptable length of a clone was 6 in the experiment, the choice of p ¼ 0:7 allows two six-line code fragments to be shifted by one line. 

Because syntax is not taken into account, the found clones may overlap different syntactic units, which cannot be replaced through functional abstraction. 

In the try-catch block (in total, 9 and 11 lines, respectively), a method call was replaced by a string literal, anassignment was added, a simple assignment was turned into a declaration with initialization, a throw statement was added, and a package qualifier was extended. 

Because oracling two identical candidates is negligible given the high absolute number and the lowpercentage of candidates the authors actually looked at in their experiment, the yield value in Fig. 8 is still a meaningful lower bound for the overall acceptance rate. 

Definition 2. A code fragment is a tuple ðf; s; eÞ which consists of the name of the source file f , the start line s, and the end line e of the fragment. 

The functors and their parameters are summarized in a suffix tree, a trie that represents all suffixes of the program in a compact fashion.