On finding duplication and near-duplication in large software systems

doi:10.1109/WCRE.1995.514697

On

Finding Duplication and Near-Duplicat ion

in

Large Software Systems

Brenda

S.

Baker

AT&T

Bell Laboratories

600

Mountain

Ave.

Murray Hill, NJ

07974

bsbQresearch. att

.

com

Abstract

This paper describes how a program called

dup

can be used to locate instances of duplication

or

near-

duplication

in

a software system.

D u p

reports both

textually identical sections of code and sections that

are the same textually except for systematic substitu-

tion of one set of variable names

and

constants for

another. Further processing locates longer sections of

code that are the same except for other small modi-

fications. Experimental results from running

dup

on

millions of lines from

two

large software systems

show

dup

to be both effective at locating duplication and

fast. Applications could include identifying sections

of code that should be replaced

by

procedures,

elimina-

tion of duplication during reengineering of the system,

redocumentation

to

include references

to

copies, and

de bugging.

1

Introduction.

This paper focuses on locating duplication or near-

duplication in a large software system

as

an aid in

maintenance and reengineering. Duplication can be-

come

a

problem within large software systems if pro-

grammers make modifications by copying and modi-

fying sections of code.

It

has long been known that

copying can make the code larger, more complex, and

more difficult to maintain. In particular, when

a

bug

has been found in one copy,

a

bug fix may be made in

the copy where the bug was found, but not in the other

copies. Nevertheless, copying and modifying code may

occur for several reasons. First, making

a

copy and

modifying it may be simpler than more major revi-

sions and therefore less likely to introduce new bugs

immediately, especially when the programmer mak-

ing the bug fixes is not the one who wrote the orig-

inal code. Second, if multiple versions are created,

the interactions between the versions may become in-

tractable as the versions grow apart over time, and

eventually it may seem simpler to maintain some of

the code separately. Third, process management may

encourage duplication,

e.g.

if evaluation

of

program-

mers’ performance is based in part on how much new

code they write,

so

that programmers have little in-

centive to rewrite old code. Fourth, copies may be

required because of the need to avoid the overhead of

a

procedure call for efficiency considerations.

This paper addresses the problem of locating ex-

act or near-duplication of code that was created by

copying and modifying code with an editor. When

code is copied and modified via

an

editor, the types

of

changes made may include insertions and deletions of

lines, modifications within lines, and global substitu-

tions. The goal is to find copies that are substantially

the same line by line except for global substitutions,

so

that one copy is

a

variant of the other, rather than

sections of code that have evolved to be mostly differ-

ent. In software reuse terminology, the problem is to

locate instances of ad-hoc black-box or white-box

soft-

ware reuse

[16]

within

a

software system. Thus, this is

a

problem in reverse engineering. Moreover, the sys-

tems to be examined may be legacy systems running

to millions of lines

of

code.

The approach of this paper is to find maximal sec-

tions of code over

a

threshold length that are either

exactly the same, or the same except for

a

global sub-

stitution of names of parameters such as variables and

constants,

e.g.

all occurrences

of

x changed to y and

all occurrences of pchar changed to pc. In the for-

mer case, we call the two sections of code an

exact

match,

and in the latter case, a

parameterized match

(p-match).

Thus, the approach is text-based and line-

based. Comments and white space are ignored. The

tool to find maximal exact or parameterized matches

is

a

program called

dup.

To

find longer sections

of

code

that were copied and then changed locally in the mid-

dle, the exact or parameterized matches can be further

analyzed to locate pairs or sequences of matches that

0-8186-7111-4/95 $4.00

0

1995

IEEE

86

match sections of code separated by small gaps; al-

ternatively, such regions can be found by examining

scatter plots.

An example of a p-match is given in Figure

1,

which

contains two code fragments taken from the

X:

Window

System

[18]

source code. The fragments are identical

except for the differing indentation (which is ignored

by

dup)

and the correspondence between the vari-

able names pfi/pfh and the pairs of structure member

names lbearing/left and rbearing/right

.

These frag-

ments are excerpted from two 34-line sections of code

that are

a

p-match with these parameter correspon-

dences.

Fragment

1:

copy-number (&pmin, &pmax

,

pfi->min-bounds.lbearing,

pfi->max-bounds.lbeaing);

*pmin++

=

*pmax++

=

J , J ;

copy-number(&pmin, kpmax,

pfi->min-bounds.rbearing,

pf i->max-bounds .rbearing)

;

*pmin++

=

*pmax++

=

J , J ;

Fragment

2:

copy-number(&pmin, &pmax,

pfh->min-bounds.left,

pfh->max-bounds.left);

*pmin++

=

*pmax++

=

J , J ;

copy-number(&pmin, &pmax,

pfh->min-bounds.right,

pfh->max-bounds.right);

*pmin++

=

*pmax++

=

J , J ;

Figure

1:

Two

fragments of code from source for the

X

Window System.

In addition to finding possibly distant sections of

code that match,

dup

finds locally repetitive sections

of

code where the same short section is repeated im-

mediately with different parameters, typically with

names ending in a number; if an array were used in-

stead of the numbered parameters, the repetitive code

could be replaced by a loop. Such sections coluld have

been generated automatically by a program1 genera-

tor, but instances have been found that were created

by hand from a specification for which the specifica-

tion language lacked arrays within structures,.

For programmers,

dup

describes the matching sec-

tions of code and the correspondence between the pa-

rameter names in the two sections.

If

the pro), Trammer

wants to turn the multiple copies of the code into calls

to a new procedure, the correspondences between the

parameter names in the two sections suggest what the

formal parameters should be for the procedure. On

the other hand, if it seems better to leave the duplica-

tion

(e.g.

to avoid the overhead of a procedure call or

the time for rewriting), a profile can be generated that

shows for each line of code where other copies occur in

the system, based on the maximal exact or parameter-

ized matches,

so

that when a bug occurs in one copy

of some code, the programmer can fix

it

in the other

copies

as

well. Comments about the location of other

copies of code could

also

be added to redocument the

code.

For managers, the postprocessor computes how

much duplication is present in the system, estimates

how much code could be saved if the duplication were

eliminated, and computes which files or pairs of files

contain the most duplication. This information pro-

vides a new measure of software quality and if the

system is reengineered, the information could guide in

eliminating the duplication. In the case of repetitive

code, the information from

dup

identifies code that

could be rewritten using arrays and loops. For visu-

alization, a scatter plot of the output makes apparent

which sections

of

code contain large amounts of du-

plication, which sections of code are similar except for

small gaps, and whether duplication is local or distant.

Dup

and the postprocessor have been applied to

millions of lines of code from two large software sys-

tems. In the complete source of the

X

Window System

(minus some tables), including

714479

lines of code,

dup

located 2487 matches of

at

least 30 lines and these

matches involved 19% of the code;

dup

estimated that

12% of the input was duplication that could be elimi-

nated by rewriting. These matches can be divided into

976 groups, each of which apparently represents an in-

stance of copying and editing of code.

Dup

has also

been run on subsystems of a 10-million line production

system.

For

a production subsystem with 1.1M lines,

the

5550

parameterized matches of length at least 30

lines included 20% of the code;

dup

estimated that

13% of the subsystem was duplication that could be

eliminated by rewriting. These matches can be di-

vided into 2180 groups, each apparently representing

an instance of copying and editing of code. Some in-

teresting anomalies have been found in this production

system via

dup.

These have included unusually com-

plex files, an obsolete file, and a place where

a

bug fix

was

apparently applied

to

one

copy

of

some

code

but

not to another other copy. Two whole directories of

800

lines were found to be the same except for a sys-

87

tematic change of parameter names and

a

line break.

One subsystem contained two 40-line procedures for

date calculations that were identical except that one

used shorter identifiers than the other did.

In dealing with large systems of millions of lines

of code, it is essential for acto01 to use efficient tech-

niques to attain

a

reasonable processing speed.

Dup

runs very fast; using one

R3000

40MHZ

processor, it

can process

a

million lines of code in seven minutes.

The speed comes partly from the choice to make it

a

text-based, line-based tool and partly from efficient

algorithms based on

a

new data structure, called a pa-

rameterized suffix tree

[2,3].

Dup

and the postproces-

sor are implemented in about

2300

non-commentary

lines of

C

and Lex

[ll]

and run under UNIX.

Experiments on several million lines of production

code suggest that in practice, for thresholds of more

than about fifteen lines, the running time of

dup

on

C

code (excluding tables) is linear in input size, although

it could be quadratic in the worst case. (On tables,

depending on the values of the data, the number

of

matches t o be reported might be quadratic in table

size. Locally repetitive code can also lead locally to

a

quadratic amount of output but this has not been

found to be

a

dominant effect over

a

whole system.)

Overall, the data show that production systems can

contain a large amount of duplication that was appar-

ently created by copying and editing code. The con-

cept of maximal p-matches appears to be more useful

than just exact matches in locating such duplication.

Dup

runs fast enough to be useful for systems with

millions of lines of code. Finally, it appears that the

duplication information should be useful in practice

for finding previously unknown features of the code

and for maintenance and reengineering of large sys-

tems.

Other researchers have taken different approaches

to finding commonality in code. These approaches

have included finding common style or complexity

measures

[5,

8,

14,

121,

common parse trees

[lo],

com-

mon data flow

[l,

71,

fingerprints for files [9,

131,

the

UNIX

diff

command

[ll],

data compression

[17,

193,

and graphical user interfaces (GUIs)

[6].

These meth-

ods have been deficient for various reasons.

Ap-

proaches based on common style or complexity char-

acteristics have no guarantees about exactly how the

code

is

related. The parse tree method used exhaus-

tive search and was slow

[lo].

The data flow methods

have only been applied t o toy programming languages.

The fingerprint approaches were aimed at finding sim-

ilar files rather than copies

of

parts

of

the files.

Diff

and other approaches based on edit distance can take

quadratic time, are only designed for comparing whole

files, and are too slow for millions of lines

of

code.

Data compression methods find some cases of exact

duplication but not all maximal matches and certainly

not parameterized matches or local editing changes.

Church and Helfman’s

GUI,

Dotplot,

requires that

the user pick out patterns of similarity by eye, and

the patterns are often dominated by repetitive code

structure.

Section

2

describes how the definition of maximal

parameterized matches in code leads t o the design of

a

useful tool for finding duplication. Section

3

describes

the data structure used in

dup.

Section 4 discusses the

results of applying

dup

to two software systems. The

last section contains further discussion and directions

for further work.

2

Exact and parameterized matches

The basic tool in identifying duplication in software

is the program

dup

for finding maximal exact or pa-

rameterized matches over

a

threshold length specified

by the user.

A

postprocessor analyzes the matches

further. Currently,

dup

processes code written in

C ,

but front ends could be easily written for other input

languages. This section defines maximal exact and

parameterized matches and how these definitions are

adapted in

dup

to the task of finding interesting du-

plication or near-duplication in code.

Two sections of code are said to be

a

maximal

ex-

act

match

if their lines match exactly character by

character but the preceding lines do not match and

the following lines do not match. (White space and

comments are ignored.)

A

scatter plot helps to visualize maximal exact

matches. Figure 2 shows

a

scatter plot of exact

matches in a production system file of

2846

lines, or

1761

lines after pruning white space

and

comments,

with a minimum match length of

15

lines. Each (ap-

proximately) diagonal line from

(n l,

n2)

to

(n3,

n4)

represents a match between lines

n1

to

n3

and

n2

to

n4;

the lines are not strictly diagonal because the

white space and comments have been ignored, while

the line numbers are the original line numbers in the

file.

Only

the part

of

the plot

below

the main diagonal

is

shown,

so

that each match corresponds

to

exactly

one line segment. The full plot would be symmetric

around the main diagonal and contain two line seg-

ments for each match. In this case, there are

18

exact

matches involving 419 lines, or 24% of the file.

Two sections

of

code are

a

parameterized match

( p -

match) if there is

a

one-to-one function that maps the

set of parameters in one section onto the set of param-

and the design

of

dup.

88

0

500

1000 1500 2000 2500

Figure 2: Exact matches for a

C

file.

eters in the second section, such that the text of the

first section is transformed into the text of the second

by textually substituting

f(p)

for

p

everywhere that

p

occurs in the first section. (Comments and white

space are ignored.) For example, in the codle of Fig-

ure

1,

the one-to-one function maps lbearing into left,

rbearing into right, and pfi into pfh, but is the identity

on other parameter candidates such as copynumber

and pmin. Parameters in

dup

are currently defined

to include identifiers, constants, field names of struc-

tures, and macro names. Keywords such as “while” or

“if” are not candidates for parameters.

Two sections of code are a

maxcimal

p-match

if they

are a p-match and the p-match cannot be extended to

the preceding lines or the following lines.

Figure

3

shows a scatter plot of the maximal p-

matches for the same file whose exact matches are

plotted in Figure

2.

With a threshold of

15

lines, there

are

87

maximal parameterized matches involving

85%

of the file, compared to 18 exact matches involving

24%

of the file. The maximal parameterized match

found is 182 lines, compared to

37

lines for the exact

matches.

Sections of code that are a p-match generally look

related. In certain circumstances, such as sequences

of lines consisting of C “case variable:” statements,

matches are found between sections of code that don’t

appear to be related in that arbitrary variable names

are paired line after line. Experiments have shown

that

an

effective way of avoiding such output is to

report only p-matches where the number

of

non-

identical parameter pairs is

at

most half the number

of non-commentary lines in the match; more generally,

this could be turned into a percentage to be set by the

-”““I

1500

‘-1

500

A

/

oi------+

0

500 1000

1500

2000 2500

Figure

3:

P-matches for the same file

as

Figure 2.

user.

The quality of the output is also improved by prun-

ing off closing braces at the start of a match. Because

of the definition of maximality and the frequency of

lines containing just a closing brace, maximal matches

often begin with one or more closing braces, but the

closing braces usually belong to code preceding the

interesting part of the match.

Input code can be provided to

dup

either via the

standard input or via a list of file names. In the latter

case,

dup

does not allow matches to cross file bound-

aries. It does, however, allow matches to cross proce-

dure boundaries,

so

that whole files can be found to

match. An option to restrict matches from crossing

procedure boundaries may be added in the future.

A

postprocessor analyzes the p-matches and gener-

ates statistics and plots. A number of kinds of output

are available from the postprocessor.

For each p-match, the program outputs the num-

ber of matching non-commentary lines, the pairs of

matching intervals, and a list of the nonidentical pa-

rameter correspondences for each p-match. Figure

4

gives an example from the

X

Window System

[18];

the match is the one from which the fragments of Fig-

ure

1

were extracted. The intervals are described as

a file number, path name, and range of line numbers.

(The file number is useful visually when path names

are long and differ by only a character or two.) The

match length is specified by

“34

ncsl”

,

which means

“34

non-commentary source lines”,

i.e.

the number

of lines in the match excluding comments and blank

lines.

The postprocessor calculates summary information

including the number of matches, number of non-

89

34

ncsl

1552,mit/clients/xlsfonts/xlsfonts.c:274,3O9

327,mit/fonts/clients/fslsfonts/fslsfonts.c:384,419

3

parameters

1:

pfi, pfh

2:

lbearing, left

3:

rbearing, right

Figure

4:

Output for the parameterized match for which Figure

1

is an excerpt.

commentary lines in the whole system involved in

the matches, percentage of non-commentary lines in

the system involved in the matches, and distribution

of match lengths. These calculations are straightfor-

ward.

The postprocessor computes an estimate

of

the per-

celltage of lines that could be eliminated if the code

were rewritten using alternative methods such as pro-

cedures instead of copying. The estimate is derived us-

ing the simple assumption that if the same line appears

in

k

sufficiently long matching sections of code, then

k

--

1

of these occurrences could have been avoided.

For example, for the file whose p-matches are plotted

in Figure

3,

the postprocessor estimates a potential

shrinkage of

61%

if the code were rewritten t o avoid

parameterized duplication.

The computation of the

estimate is complicated by matches that pair up the

same lines of code because they overlap in both inter-

vals. For example, it would be possible for lines

30-60

and

130-160

to be

a

maximal p-match and for lines

40-

70

and

140-170

to be another maximal p-match, where

a

longer p-match is not possible because a correspon-

dence of

x

and

y

in lines

39

and

139

conflicts with

a

correspondence of

x

and

z

in lines

61

and

161.

In

this example, both p-matches match lines

40-60

with

lines

140-160.

The calculations of redundancy han-

dle this situation by counting this as one extra copy

of each of the lines in these ranges, rather than two.

Such situations are caused by conflicting pairings of

values, often pairings of small integer constants (espe-

cially zero) that may be used as values for more than

one variable in one section of code but not the other.

As

an option, the postprocessor prints out a pro-

file of the code showing how much duplication occurs

where. In particular, it identifies intervals (sequences

of lines) in the input that are involved in exactly the

same set of matches. For each such sequence of lines,

it prints out the range of line numbers, the number

of

distinct matches, and a list of the match numbers.

In our above example, lines

30-60

and

130-160

were

a

p-match and lines

40-70

and

140-170

were

a

p-match,

and both p-matches match lines

40-60

with lines

140-

160.

In this situation, the postprocessor will identify

intervals

30-39, 40-60, 61-70, 130-139, 140-160,

and

161-170

as

sequences of line numbers within which the

lines are involved in the same matches.

However, it

does not count the two p-matches as distinct matches

for the intervals

40-60

and

140-160

in which they over-

lap, since they pair up the same lines.

Since a system can contain thousands of files, and

the duplication may be unevenly distributed among

them, another postprocessor option is to calculate the

percentage redundancy and number of redundant lines

within each file and between each pair of files in the

input. For efficiency, these calculations are done by

intervals participating in the same matches, as defined

in the preceding paragraph, rather than by individual

lines. Sorting can be used to identify the files or file

pairs with the most duplication.

Further processing of matches can be done to group

matches that appear to be related, in the sense that to-

gether they represent

a

region of code that was copied

and then edited. Two classes

of

these matches arise

as

follows.

First, there is the case described above of two

matches that would be one match if not for

a

param-

eter conflict in the middle of the code.

This is de-

tected by overlaps in both intervals and identical dis-

tances between the first and second intervals in the two

matches. Pairs or sequences of successive p-matches

with this relationship can be detected and labeled as

part

of

a

longer match with a conflict

in

parameters.

Second, if some code was copied and then modified

in the middle, what would be detected by

dup

would

be

a

pair of matches pairing up sections of code that

are close together but not overlapping,

e.g.

one match

pairing up lines

30-50

and

500-520,

and another match

pairing up lines

55-75

and lines

530-550.

Such pairs

(or more) of matches can be identified by sorting the

matches by endpoint and looking for pairs of matches

90

On finding duplication and near-duplication in large software systems

Citations

CCFinder: a multilinguistic token-based code clone detection system for large scale source code

Winnowing: local algorithms for document fingerprinting

Clone detection using abstract syntax trees

DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones

Comparison and evaluation of code clone detection techniques and tools: A qualitative approach

References

A universal algorithm for sequential data compression

A Space-Economical Suffix Tree Construction Algorithm

The X window system

Finding similar files in a large file system

The UNIX programming environment

Related Papers (5)

Clone detection using abstract syntax trees

CCFinder: a multilinguistic token-based code clone detection system for large scale source code

A language independent approach for detecting duplicated code

Identifying similar code with program dependence graphs

DECKARD: Scalable and Accurate Tree-Based Detection of Code Clones