scispace - formally typeset
Open AccessProceedings ArticleDOI

On finding duplication and near-duplication in large software systems

Brenda S. Baker
- pp 86-95
TLDR
A program called dup can be used to locate instances of duplication or near-duplication in a software system and is shown to be both effective at locating duplication and fast.
Abstract
This paper describes how a program called dup can be used to locate instances of duplication or near-duplication in a software system. Dup reports both textually identical sections of code and sections that are the same textually except for systematic substitution of one set of variable names and constants for another. Further processing locates longer sections of code that are the same except for other small modifications. Experimental results from running dup on millions of lines from two large software systems show dup to be both effective at locating duplication and fast. Applications could include identifying sections of code that should be replaced by procedures, elimination of duplication during reengineering of the system, redocumentation to include references to copies, and debugging.

read more

Content maybe subject to copyright    Report

On
Finding Duplication and Near-Duplicat ion
in
Large Software Systems
Brenda
S.
Baker
AT&T
Bell Laboratories
600
Mountain
Ave.
Murray Hill, NJ
07974
bsbQresearch. att
.
com
Abstract
This paper describes how a program called
dup
can be used to locate instances of duplication
or
near-
duplication
in
a software system.
D u p
reports both
textually identical sections of code and sections that
are the same textually except for systematic substitu-
tion of one set of variable names
and
constants for
another. Further processing locates longer sections of
code that are the same except for other small modi-
fications. Experimental results from running
dup
on
millions of lines from
two
large software systems
show
dup
to be both effective at locating duplication and
fast. Applications could include identifying sections
of code that should be replaced
by
procedures,
elimina-
tion of duplication during reengineering of the system,
redocumentation
to
include references
to
copies, and
de bugging.
1
Introduction.
This paper focuses on locating duplication or near-
duplication in a large software system
as
an aid in
maintenance and reengineering. Duplication can be-
come
a
problem within large software systems if pro-
grammers make modifications by copying and modi-
fying sections of code.
It
has long been known that
copying can make the code larger, more complex, and
more difficult to maintain. In particular, when
a
bug
has been found in one copy,
a
bug fix may be made in
the copy where the bug was found, but not in the other
copies. Nevertheless, copying and modifying code may
occur for several reasons. First, making
a
copy and
modifying it may be simpler than more major revi-
sions and therefore less likely to introduce new bugs
immediately, especially when the programmer mak-
ing the bug fixes is not the one who wrote the orig-
inal code. Second, if multiple versions are created,
the interactions between the versions may become in-
tractable as the versions grow apart over time, and
eventually it may seem simpler to maintain some of
the code separately. Third, process management may
encourage duplication,
e.g.
if evaluation
of
program-
mers’ performance is based in part on how much new
code they write,
so
that programmers have little in-
centive to rewrite old code. Fourth, copies may be
required because of the need to avoid the overhead of
a
procedure call for efficiency considerations.
This paper addresses the problem of locating ex-
act or near-duplication of code that was created by
copying and modifying code with an editor. When
code is copied and modified via
an
editor, the types
of
changes made may include insertions and deletions of
lines, modifications within lines, and global substitu-
tions. The goal is to find copies that are substantially
the same line by line except for global substitutions,
so
that one copy is
a
variant of the other, rather than
sections of code that have evolved to be mostly differ-
ent. In software reuse terminology, the problem is to
locate instances of ad-hoc black-box or white-box
soft-
ware reuse
[16]
within
a
software system. Thus, this is
a
problem in reverse engineering. Moreover, the sys-
tems to be examined may be legacy systems running
to millions of lines
of
code.
The approach of this paper is to find maximal sec-
tions of code over
a
threshold length that are either
exactly the same, or the same except for
a
global sub-
stitution of names of parameters such as variables and
constants,
e.g.
all occurrences
of
x changed to y and
all occurrences of pchar changed to pc. In the for-
mer case, we call the two sections of code an
exact
match,
and in the latter case, a
parameterized match
(p-match).
Thus, the approach is text-based and line-
based. Comments and white space are ignored. The
tool to find maximal exact or parameterized matches
is
a
program called
dup.
To
find longer sections
of
code
that were copied and then changed locally in the mid-
dle, the exact or parameterized matches can be further
analyzed to locate pairs or sequences of matches that
0-8186-7111-4/95 $4.00
0
1995
IEEE
86

match sections of code separated by small gaps; al-
ternatively, such regions can be found by examining
scatter plots.
An example of a p-match is given in Figure
1,
which
contains two code fragments taken from the
X:
Window
System
[18]
source code. The fragments are identical
except for the differing indentation (which is ignored
by
dup)
and the correspondence between the vari-
able names pfi/pfh and the pairs of structure member
names lbearing/left and rbearing/right
.
These frag-
ments are excerpted from two 34-line sections of code
that are
a
p-match with these parameter correspon-
dences.
Fragment
1:
copy-number (&pmin, &pmax
,
pfi->min-bounds.lbearing,
pfi->max-bounds.lbeaing);
*pmin++
=
*pmax++
=
J , J ;
copy-number(&pmin, kpmax,
pfi->min-bounds.rbearing,
pf i->max-bounds .rbearing)
;
*pmin++
=
*pmax++
=
J , J ;
Fragment
2:
copy-number(&pmin, &pmax,
pfh->min-bounds.left,
pfh->max-bounds.left);
*pmin++
=
*pmax++
=
J , J ;
copy-number(&pmin, &pmax,
pfh->min-bounds.right,
pfh->max-bounds.right);
*pmin++
=
*pmax++
=
J , J ;
Figure
1:
Two
fragments of code from source for the
X
Window System.
In addition to finding possibly distant sections of
code that match,
dup
finds locally repetitive sections
of
code where the same short section is repeated im-
mediately with different parameters, typically with
names ending in a number; if an array were used in-
stead of the numbered parameters, the repetitive code
could be replaced by a loop. Such sections coluld have
been generated automatically by a program1 genera-
tor, but instances have been found that were created
by hand from a specification for which the specifica-
tion language lacked arrays within structures,.
For programmers,
dup
describes the matching sec-
tions of code and the correspondence between the pa-
rameter names in the two sections.
If
the pro), Trammer
wants to turn the multiple copies of the code into calls
to a new procedure, the correspondences between the
parameter names in the two sections suggest what the
formal parameters should be for the procedure. On
the other hand, if it seems better to leave the duplica-
tion
(e.g.
to avoid the overhead of a procedure call or
the time for rewriting), a profile can be generated that
shows for each line of code where other copies occur in
the system, based on the maximal exact or parameter-
ized matches,
so
that when a bug occurs in one copy
of some code, the programmer can fix
it
in the other
copies
as
well. Comments about the location of other
copies of code could
also
be added to redocument the
code.
For managers, the postprocessor computes how
much duplication is present in the system, estimates
how much code could be saved if the duplication were
eliminated, and computes which files or pairs of files
contain the most duplication. This information pro-
vides a new measure of software quality and if the
system is reengineered, the information could guide in
eliminating the duplication. In the case of repetitive
code, the information from
dup
identifies code that
could be rewritten using arrays and loops. For visu-
alization, a scatter plot of the output makes apparent
which sections
of
code contain large amounts of du-
plication, which sections of code are similar except for
small gaps, and whether duplication is local or distant.
Dup
and the postprocessor have been applied to
millions of lines of code from two large software sys-
tems. In the complete source of the
X
Window System
(minus some tables), including
714479
lines of code,
dup
located 2487 matches of
at
least 30 lines and these
matches involved 19% of the code;
dup
estimated that
12% of the input was duplication that could be elimi-
nated by rewriting. These matches can be divided into
976 groups, each of which apparently represents an in-
stance of copying and editing of code.
Dup
has also
been run on subsystems of a 10-million line production
system.
For
a production subsystem with 1.1M lines,
the
5550
parameterized matches of length at least 30
lines included 20% of the code;
dup
estimated that
13% of the subsystem was duplication that could be
eliminated by rewriting. These matches can be di-
vided into 2180 groups, each apparently representing
an instance of copying and editing of code. Some in-
teresting anomalies have been found in this production
system via
dup.
These have included unusually com-
plex files, an obsolete file, and a place where
a
bug fix
was
apparently applied
to
one
copy
of
some
code
but
not to another other copy. Two whole directories of
800
lines were found to be the same except for a sys-
87

tematic change of parameter names and
a
line break.
One subsystem contained two 40-line procedures for
date calculations that were identical except that one
used shorter identifiers than the other did.
In dealing with large systems of millions of lines
of code, it is essential for acto01 to use efficient tech-
niques to attain
a
reasonable processing speed.
Dup
runs very fast; using one
R3000
40MHZ
processor, it
can process
a
million lines of code in seven minutes.
The speed comes partly from the choice to make it
a
text-based, line-based tool and partly from efficient
algorithms based on
a
new data structure, called a pa-
rameterized suffix tree
[2,3].
Dup
and the postproces-
sor are implemented in about
2300
non-commentary
lines of
C
and Lex
[ll]
and run under UNIX.
Experiments on several million lines of production
code suggest that in practice, for thresholds of more
than about fifteen lines, the running time of
dup
on
C
code (excluding tables) is linear in input size, although
it could be quadratic in the worst case. (On tables,
depending on the values of the data, the number
of
matches t o be reported might be quadratic in table
size. Locally repetitive code can also lead locally to
a
quadratic amount of output but this has not been
found to be
a
dominant effect over
a
whole system.)
Overall, the data show that production systems can
contain a large amount of duplication that was appar-
ently created by copying and editing code. The con-
cept of maximal p-matches appears to be more useful
than just exact matches in locating such duplication.
Dup
runs fast enough to be useful for systems with
millions of lines of code. Finally, it appears that the
duplication information should be useful in practice
for finding previously unknown features of the code
and for maintenance and reengineering of large sys-
tems.
Other researchers have taken different approaches
to finding commonality in code. These approaches
have included finding common style or complexity
measures
[5,
8,
14,
121,
common parse trees
[lo],
com-
mon data flow
[l,
71,
fingerprints for files [9,
131,
the
UNIX
diff
command
[ll],
data compression
[17,
193,
and graphical user interfaces (GUIs)
[6].
These meth-
ods have been deficient for various reasons.
Ap-
proaches based on common style or complexity char-
acteristics have no guarantees about exactly how the
code
is
related. The parse tree method used exhaus-
tive search and was slow
[lo].
The data flow methods
have only been applied t o toy programming languages.
The fingerprint approaches were aimed at finding sim-
ilar files rather than copies
of
parts
of
the files.
Diff
and other approaches based on edit distance can take
quadratic time, are only designed for comparing whole
files, and are too slow for millions of lines
of
code.
Data compression methods find some cases of exact
duplication but not all maximal matches and certainly
not parameterized matches or local editing changes.
Church and Helfman’s
GUI,
Dotplot,
requires that
the user pick out patterns of similarity by eye, and
the patterns are often dominated by repetitive code
structure.
Section
2
describes how the definition of maximal
parameterized matches in code leads t o the design of
a
useful tool for finding duplication. Section
3
describes
the data structure used in
dup.
Section 4 discusses the
results of applying
dup
to two software systems. The
last section contains further discussion and directions
for further work.
2
Exact and parameterized matches
The basic tool in identifying duplication in software
is the program
dup
for finding maximal exact or pa-
rameterized matches over
a
threshold length specified
by the user.
A
postprocessor analyzes the matches
further. Currently,
dup
processes code written in
C ,
but front ends could be easily written for other input
languages. This section defines maximal exact and
parameterized matches and how these definitions are
adapted in
dup
to the task of finding interesting du-
plication or near-duplication in code.
Two sections of code are said to be
a
maximal
ex-
act
match
if their lines match exactly character by
character but the preceding lines do not match and
the following lines do not match. (White space and
comments are ignored.)
A
scatter plot helps to visualize maximal exact
matches. Figure 2 shows
a
scatter plot of exact
matches in a production system file of
2846
lines, or
1761
lines after pruning white space
and
comments,
with a minimum match length of
15
lines. Each (ap-
proximately) diagonal line from
(n l,
n2)
to
(n3,
n4)
represents a match between lines
n1
to
n3
and
n2
to
n4;
the lines are not strictly diagonal because the
white space and comments have been ignored, while
the line numbers are the original line numbers in the
file.
Only
the part
of
the plot
below
the main diagonal
is
shown,
so
that each match corresponds
to
exactly
one line segment. The full plot would be symmetric
around the main diagonal and contain two line seg-
ments for each match. In this case, there are
18
exact
matches involving 419 lines, or 24% of the file.
Two sections
of
code are
a
parameterized match
( p -
match) if there is
a
one-to-one function that maps the
set of parameters in one section onto the set of param-
and the design
of
dup.
88

0
500
1000 1500 2000 2500
Figure 2: Exact matches for a
C
file.
eters in the second section, such that the text of the
first section is transformed into the text of the second
by textually substituting
f(p)
for
p
everywhere that
p
occurs in the first section. (Comments and white
space are ignored.) For example, in the codle of Fig-
ure
1,
the one-to-one function maps lbearing into left,
rbearing into right, and pfi into pfh, but is the identity
on other parameter candidates such as copynumber
and pmin. Parameters in
dup
are currently defined
to include identifiers, constants, field names of struc-
tures, and macro names. Keywords such as “while” or
“if” are not candidates for parameters.
Two sections of code are a
maxcimal
p-match
if they
are a p-match and the p-match cannot be extended to
the preceding lines or the following lines.
Figure
3
shows a scatter plot of the maximal p-
matches for the same file whose exact matches are
plotted in Figure
2.
With a threshold of
15
lines, there
are
87
maximal parameterized matches involving
85%
of the file, compared to 18 exact matches involving
24%
of the file. The maximal parameterized match
found is 182 lines, compared to
37
lines for the exact
matches.
Sections of code that are a p-match generally look
related. In certain circumstances, such as sequences
of lines consisting of C “case variable:” statements,
matches are found between sections of code that don’t
appear to be related in that arbitrary variable names
are paired line after line. Experiments have shown
that
an
effective way of avoiding such output is to
report only p-matches where the number
of
non-
identical parameter pairs is
at
most half the number
of non-commentary lines in the match; more generally,
this could be turned into a percentage to be set by the
-I
1500
-1
500
A
/
oi------+
0
500 1000
1500
2000 2500
Figure
3:
P-matches for the same file
as
Figure 2.
user.
The quality of the output is also improved by prun-
ing off closing braces at the start of a match. Because
of the definition of maximality and the frequency of
lines containing just a closing brace, maximal matches
often begin with one or more closing braces, but the
closing braces usually belong to code preceding the
interesting part of the match.
Input code can be provided to
dup
either via the
standard input or via a list of file names. In the latter
case,
dup
does not allow matches to cross file bound-
aries. It does, however, allow matches to cross proce-
dure boundaries,
so
that whole files can be found to
match. An option to restrict matches from crossing
procedure boundaries may be added in the future.
A
postprocessor analyzes the p-matches and gener-
ates statistics and plots. A number of kinds of output
are available from the postprocessor.
For each p-match, the program outputs the num-
ber of matching non-commentary lines, the pairs of
matching intervals, and a list of the nonidentical pa-
rameter correspondences for each p-match. Figure
4
gives an example from the
X
Window System
[18];
the match is the one from which the fragments of Fig-
ure
1
were extracted. The intervals are described as
a file number, path name, and range of line numbers.
(The file number is useful visually when path names
are long and differ by only a character or two.) The
match length is specified by
34
ncsl”
,
which means
“34
non-commentary source lines”,
i.e.
the number
of lines in the match excluding comments and blank
lines.
The postprocessor calculates summary information
including the number of matches, number of non-
89

34
ncsl
1552,mit/clients/xlsfonts/xlsfonts.c:274,3O9
327,mit/fonts/clients/fslsfonts/fslsfonts.c:384,419
3
parameters
1:
pfi, pfh
2:
lbearing, left
3:
rbearing, right
Figure
4:
Output for the parameterized match for which Figure
1
is an excerpt.
commentary lines in the whole system involved in
the matches, percentage of non-commentary lines in
the system involved in the matches, and distribution
of match lengths. These calculations are straightfor-
ward.
The postprocessor computes an estimate
of
the per-
celltage of lines that could be eliminated if the code
were rewritten using alternative methods such as pro-
cedures instead of copying. The estimate is derived us-
ing the simple assumption that if the same line appears
in
k
sufficiently long matching sections of code, then
k
--
1
of these occurrences could have been avoided.
For example, for the file whose p-matches are plotted
in Figure
3,
the postprocessor estimates a potential
shrinkage of
61%
if the code were rewritten t o avoid
parameterized duplication.
The computation of the
estimate is complicated by matches that pair up the
same lines of code because they overlap in both inter-
vals. For example, it would be possible for lines
30-60
and
130-160
to be
a
maximal p-match and for lines
40-
70
and
140-170
to be another maximal p-match, where
a
longer p-match is not possible because a correspon-
dence of
x
and
y
in lines
39
and
139
conflicts with
a
correspondence of
x
and
z
in lines
61
and
161.
In
this example, both p-matches match lines
40-60
with
lines
140-160.
The calculations of redundancy han-
dle this situation by counting this as one extra copy
of each of the lines in these ranges, rather than two.
Such situations are caused by conflicting pairings of
values, often pairings of small integer constants (espe-
cially zero) that may be used as values for more than
one variable in one section of code but not the other.
As
an option, the postprocessor prints out a pro-
file of the code showing how much duplication occurs
where. In particular, it identifies intervals (sequences
of lines) in the input that are involved in exactly the
same set of matches. For each such sequence of lines,
it prints out the range of line numbers, the number
of
distinct matches, and a list of the match numbers.
In our above example, lines
30-60
and
130-160
were
a
p-match and lines
40-70
and
140-170
were
a
p-match,
and both p-matches match lines
40-60
with lines
140-
160.
In this situation, the postprocessor will identify
intervals
30-39, 40-60, 61-70, 130-139, 140-160,
and
161-170
as
sequences of line numbers within which the
lines are involved in the same matches.
However, it
does not count the two p-matches as distinct matches
for the intervals
40-60
and
140-160
in which they over-
lap, since they pair up the same lines.
Since a system can contain thousands of files, and
the duplication may be unevenly distributed among
them, another postprocessor option is to calculate the
percentage redundancy and number of redundant lines
within each file and between each pair of files in the
input. For efficiency, these calculations are done by
intervals participating in the same matches, as defined
in the preceding paragraph, rather than by individual
lines. Sorting can be used to identify the files or file
pairs with the most duplication.
Further processing of matches can be done to group
matches that appear to be related, in the sense that to-
gether they represent
a
region of code that was copied
and then edited. Two classes
of
these matches arise
as
follows.
First, there is the case described above of two
matches that would be one match if not for
a
param-
eter conflict in the middle of the code.
This is de-
tected by overlaps in both intervals and identical dis-
tances between the first and second intervals in the two
matches. Pairs or sequences of successive p-matches
with this relationship can be detected and labeled as
part
of
a
longer match with a conflict
in
parameters.
Second, if some code was copied and then modified
in the middle, what would be detected by
dup
would
be
a
pair of matches pairing up sections of code that
are close together but not overlapping,
e.g.
one match
pairing up lines
30-50
and
500-520,
and another match
pairing up lines
55-75
and lines
530-550.
Such pairs
(or more) of matches can be identified by sorting the
matches by endpoint and looking for pairs of matches
90

Citations
More filters
Journal ArticleDOI

CCFinder: a multilinguistic token-based code clone detection system for large scale source code

TL;DR: A new clone detection technique, which consists of the transformation of input source text and a token-by-token comparison, is proposed, which has effectively found clones and the metrics have been able to effectively identify the characteristics of the systems.
Proceedings ArticleDOI

Winnowing: local algorithms for document fingerprinting

TL;DR: The class of local document fingerprinting algorithms is introduced, which seems to capture an essential property of any finger-printing technique guaranteed to detect copies, and a novel lower bound on the performance of any local algorithm is proved.
Proceedings ArticleDOI

Clone detection using abstract syntax trees

TL;DR: The paper presents simple and practical methods for detecting exact and near miss clones over arbitrary program fragments in program source code by using abstract syntax trees and suggests that clone detection could be useful in producing more structured code, and in reverse engineering to discover domain concepts and their implementations.
Proceedings ArticleDOI

DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones

TL;DR: This paper presents an efficient algorithm for identifying similar subtrees and apply it to tree representations of source code and implemented this algorithm as a clone detection tool called DECKARD and evaluated it on large code bases written in C and Java including the Linux kernel and JDK.
Journal ArticleDOI

Comparison and evaluation of code clone detection techniques and tools: A qualitative approach

TL;DR: A qualitative comparison and evaluation of the current state-of-the-art in clone detection techniques and tools is provided, and a taxonomy of editing scenarios that produce different clone types and a qualitative evaluation of current clone detectors are evaluated.
References
More filters
Journal ArticleDOI

A universal algorithm for sequential data compression

TL;DR: The compression ratio achieved by the proposed universal code uniformly approaches the lower bounds on the compression ratios attainable by block-to-variable codes and variable- to-block codes designed to match a completely specified source.
Journal ArticleDOI

A Space-Economical Suffix Tree Construction Algorithm

TL;DR: A new algorithm is presented for constructing auxiliary digital search trees to aid in exact-match substring searching that has the same asymptotic running time bound as previously published algorithms, but is more economical in space.
Journal ArticleDOI

The X window system

TL;DR: An overview of the X Window System is presented, focusing on the system substrate and the low-level facilities provided to build applications and to manage the desktop.
Proceedings Article

Finding similar files in a large file system

TL;DR: Application of sif can be found in file management, information collecting, program reuse, file synchronization, data compression, and maybe even plagiarism detection.

The UNIX programming environment

TL;DR: In this article, the authors describe the UNIX programming environment and philosophy in detail, including how to use the system, its components, and the programs, but also how these fit into the total environment.