What are the contributions mentioned in the paper "Value-based program characterization and its application to software plagiarism detection" ?

Based on an observation that some critical runtime values are hard to be replaced or eliminated by semanticspreserving transformation techniques, the authors introduce a novel approach to dynamic characterization of executable programs. The authors show how the values can be extracted and refined to expose the critical values and how they can apply this runtime property to help solve problems in software plagiarism detection. The authors have implemented a prototype with a dynamic taint analyzer atop a generic processor emulator.

What have the authors stated for future works in "Value-based program characterization and its application to software plagiarism detection" ?

As their future work, the authors will examine the relationship between values. A better understanding of the logical connection among the values will enable us to further remove system noise or less significant values. In addition, the authors will study the impact of emulation-based obfuscators such as Themida and Code Virtualizer [ 27 ] on VaPD ’ s performance. The authors believe their detection method can handle such obfuscators.

What is the effect of noise injection on the similarity score?

if injected successfully, noise could dramatically increase the size of an extracted value sequence, thus slowing down the similarity score computation, consuming more memory space.

Why does VaPD analyze x86 machine code?

Because VaPD analyzes x86 machine code, the authors convert Java byte code (used in SandMark and KlassMaster experiments) to x86 executable using GCJ 4.1.2, the GNU ahead-of-time Compiler for Java.

How many optimization flags can be used to extract the value sequences?

With GCC and its five selected optimization flags (-O0, -O1, -O2, -O3, and -Os), the authors can extract five optimized value sequences from the plaintiff program.

Why do the authors propose a technique that directly examines executable files?

Motivated by an observation that some outcome values computed by machine instructions survive various semantics-preserving code transformations, the authors have proposed a technique that directly examines executable files and does not need to access the source code of suspicious programs.

How many wav files are used as input to the programs?

For the dataset to be used as the input to the programs, the authors generate ten wav audio files (seven 16KB files, two 24KB files, and one 8KB file), cropped from a 43.5MB wav file containing an 8’37”-long speech.

How many plagiarisms were successfully discriminated by the VaPD?

Their experimental results indicate that the VaPD successfully discriminated 34 plagiarisms obfuscated by SandMark [7] (totally 39 obfuscators, but 5 of them failed to obfuscate their test programs); plagiarisms heavily obfuscated by KlassMaster,2 programs obfuscated by the Thicket C obfuscator, and executables obfuscated by Control Flow Flattening implemented in the Loco/Diablo link-time optimizer [21].

Why does VSE not extract values from dynamic linked libraries?

if necessary, the authors can enable VSE to include specific shared libraries in the value sequence extraction because the virtual machine knows which libraries are loaded and where they are.

How many obfuscators can be used to transform a program?

Although it is theoretically possible for a series of multiple obfuscators to transform a program, applying many obfuscators to a single program could raise practical issues of correctness of the target program and efficiency.

What are the requirements for a value to be added into a value sequence?

Since not all values associated with the execution of a program are core-values, the authors establish the following requirements for a value to be added into a value sequence:

What is the effect of the logical connection between the values?

A better understanding of the logical connection among the values will enable us to further remove system noise or less significant values.

(Open Access) Value-based program characterization and its application to software plagiarism detection (2011) | Yoon-Chan Jhi

Q: What are the main problems of WPP birthmarks?

WPP birthmarks are robust to some control flow obfuscation such as opaque predicates insertion, but are still vulnerable to many semantics-preserving transformations such as flattening and loop unwinding.

Q: How many comparison cases of software plagiarism are there?

In 30 comparison cases (three test programs, each of which has two irrelevant peers, five optimization switches), the value sequences of each program contain only 0% to 11% of the refined value sequences of different programs.

Value-Based Program Characterization and

Its Application to Software Plagiarism Detection

Yoon-Chan Jhi

Xinran Wang

Xiaoqi Jia

Sencun Zhu

Peng Liu

Dinghao Wu

Penn State University, University Park, PA 16802

{jhi, xinrwang, szhu}@cse.psu.edu, {pliu, dwu}@ist.psu.edu

State Key Laboratory of Information Security, Institute of Software, Chinese Academy of Sciences

xjia@is.iscas.ac.cn

ABSTRACT

Identifying similar or identical code fragments becomes much

more challenging in code theft cases where plagiarizers can

use various automated code transformation techniques to

hide stolen code from b eing detected. Previous works in this

ﬁeld are largely limited in that (1) most of them cannot han-

dle advanced obfuscation techniques; (2) the methods based

on source code analysis are less practical since the source

code of suspicious programs is typically not available until

strong evidences are collected; and (3) those depending on

the features of speciﬁc operating systems or programming

languages have limited applicability.

Based on an observation that some critical runtime val-

ues are hard to be replaced or eliminated by semantics-

preserving transformation techniques, we introduce a novel

approach to dynamic characterization of executable programs.

Leveraging such invariant values, our technique is resilient to

various control and data obfuscation techniques. We show

how the values can be extracted and reﬁned to expose the

critical values and how we can apply this runtime property

to help solve problems in software plagiarism detection. We

have implemented a prototype with a dynamic taint analyzer

atop a generic processor emulator. Our experimental re-

sults show that the value-based method successfully discrim-

inates 34 plagiarisms obfuscated by SandMark, plagiarisms

heavily obfuscated by KlassMaster, programs obfuscated by

Thicket, and executables obfuscated by Loco/Diablo.

Categories and Subject Descriptors

D.m [Software]: Miscellaneous

General Terms

Security

Keywords

Dynamic code identiﬁcation, software plagiarism detection

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

ICSE ’11, May 21–28, 2011, Waikiki, Honolulu, HI, USA

1. INTRODUCTION

Identifying same or similar code fragments among diﬀer-

ent programs or in the same program is very important in

some applications. For example, duplicated codes found in

the same program may degrade eﬃciency in both develop-

ment phase (e.g., they can confuse programmers and lead to

potential errors) and execution phase (e.g., duplicated code

can degrade cache performance). In this case, code identi-

ﬁcation techniques such as clone detection [1, 3, 18, 19, 16,

12, 15, 14] can be used to discover and refactor the identical

code fragments to improve the program. For another exam-

ple, same or similar co de found in diﬀerent programs may

lead us to even more serious issues. If those programs have

been individually developed by diﬀerent programmers, and

if they do not embed any public domain code in common,

duplicated code can be an indication of software plagiarism

or code theft. In code theft cases, determining the sameness

of two code fragments becomes much more diﬃcult since pla-

giarizers can use various code transformation techniques in-

cluding code obfuscation techniques [8, 9, 37] to hide stolen

code from detection. In order to handle such cases, code

characterization and identiﬁcation techniques must be able

to detect the identical code (i.e., two code fragments belong-

ing to the same lineage) without being easily circumvented

by code transformation techniques.

Previous works are largely insuﬃcient in meeting all of the

following three highly desired requirements: (R1) Resiliency

to automated semantics-preserving obfuscation tools [7, 21,

32, 40]; (R2) Ability to directly work on binary executables

of suspected programs since, in some applications such as

code theft cases, the source code of suspect software prod-

ucts often cannot be obtained until some strong evidences

are collected;(R3) Platform independence, e.g., independent

from operating systems and programming languages. As we

can see in the related work section, the existing schemes

can be broken down into four classes to see their limitations

with respect to the aforementioned three requirements: (C1)

static source code comparison methods [20, 33, 39, 17, 36,

28, 29, 13]; (C2) static executable code comparison methods

[23]; (C3) dynamic control ﬂow based methods [24]; (C4) dy-

namic API based methods [30, 34, 35]. First, Class C1, C2

and C3 do not satisfy requirement R1 because they are vul-

nerable to semantics-preserving obfuscation techniques such

as outlining and ordering transformation. Second, C1 does

not meet R2 because it has to access source code. Third,

the existing C3 and C4 schemes do not satisfy R3 because

they rely on features of Windows or Java.

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

ICSE’11, May 21–28, 2011, Waikiki, Honolulu, HI, USA

756

To address the above issues, we introduce a novel ap-

proach to dynamic characterization of executable programs.

After we examined various runtime properties of executable

programs, we found an interesting observation that some

runtime values of a program are hard to be replaced or elim-

inated by semantics-preserving transformation techniques

such as optimization techniques, obfuscation techniques, dif-

ferent compilers, etc. We call such values core-values.

To investigate the resilience of core values (to semantics-

preserving code transformation), we generated e

1..5

, ﬁve dif-

ferent versions of executable ﬁles of test program p written in

C, by compiling p with each of the ﬁve optimization switches

of GCC (-O0, -O1, -O2, -O3, and -Os). From each of e

1..5

given the same test input, we extracted a value sequence, a

sequence of values (4-bit, 8-bit, 16-bit, or 32-bit) written as

computation results of arithmetic instructions and bit-wise

instructions in the execution path. As a way of retaining (in

the value sequence) only the values derived from input, we

implemented a dynamic taint analyzer.

When we analyzed

the value sequences of e

1..5

, we found that some values sur-

vived all of the ﬁve optimization switches. Moreover, the

sequence of the values surviving all of the ﬁve optimiza-

tion switches was enclosed almost perfectly by the value se-

quences of executables generated by compiling p with dif-

ferent compilers (we tested Tiny C Compiler [4] and Open

Watcom C Compiler [26]). This indicates that core-values

do exist and we can use them to check whether two code

fragments belong to the same lineage.

In this paper, we show (1) how we extract the values

revealing core-values; and (2) how we can apply this run-

time property to solve problems in software plagiarism de-

tection. We implemented a value extractor with a speciﬁc

dynamic taint analyzer and value reﬁnement techniques atop

a generic processor emulator, as part of our value-based pro-

gram characterization method. As a machine code analyzer

which directly works on binary executables, our technique

satisﬁes R2. Because our technique analyzes generic charac-

teristics of machine instructions, it satisﬁes R3. Regarding

R1, we implemented a value-based software plagiarism de-

tection method (VaPD) that uses similarity measuring algo-

rithms based on sequences constructed from the extracted

values. We evaluated it through a set of real world obfus-

cators including two commercial products, Zelix Pty Ltd.’s

KlassMaster [40] and Semantic Designs Inc.’s Thicket [32].

Our experimental results indicate that the VaPD success-

fully discriminated 34 plagiarisms obfuscated by SandMark

[7] (totally 39 obfuscators, but 5 of them failed to obfuscate

our test programs); plagiarisms heavily obfuscated by Klass-

Master,

programs obfuscated by the Thicket C obfuscator,

and executables obfuscated by Control Flow Flattening im-

plemented in the Loco/Diablo link-time optimizer [21].

Contributions: (1) We present a novel co de characteriza-

tion method based on runtime values. To our best knowl-

edge, our work is the ﬁrst one exploring the existence of

the core-values. (2) By exploiting runtime values that can

hardly be changed or replaced, our code characterization

technique is resilient to various control and data obfuscation

We also have noticed that there are studies on identifying

and overcoming limitations of dynamic taint analysis. Please

note that dealing with those limitations is out of our scope.

Since SandMark and KlassMaster work on Java bytecode,

we use GCJ, GNU ahead-of-time compiler for Java, to con-

vert obfuscated programs to x86 native executables.

techniques. (3) Our plagiarism detection method (VaPD)

does not require access to source code of suspicious pro-

grams, thus it could greatly reduce plaintiﬀ’s risks through

providing strong evidences before ﬁling a lawsuit related to

intellectual property.

2. STATE OF THE ART

We roughly group the literature into the following three

categories.

Code Obfuscation Techniques: Code obfuscation is a

semantics-preserving transformation to hinder ﬁguring out

the original form of the resulting co de. A generic code ob-

fuscation technique is not as simple as adding x before com-

putation and subtracting x after the computation. Coll-

berg et al. [8] provided an extensive discussion on automated

code obfuscation techniques. They classify code obfuscation

techniques in the following categories depending on the fea-

ture that each technique targets: data obfuscation, control

obfuscation, layout obfuscation, and preventive transforma-

tions. Collberg et al. also introduced Opaque Predicates [9]

to thwart static disassembly. Other techniques such as indi-

rect branches, control-ﬂow ﬂattening, and function-pointer

aliasing were introduced by Wang [37].

Several code obfuscation tools are available. SandMark

[7] is one of such tools implementing 39 obfuscators applica-

ble to Java bytecode. Array representation and orientation,

functions, in-memory representation of variables, order of in-

structions, and control and data dependence are just a small

set of the features that SandMark can alter. Another Java

obfuscator is Zelix KlassMaster [40]. It implements compre-

hensive ﬂow obfuscation techniques, making it a heavy duty

obfuscator. Semantics is the only characteristic guaranteed

to be preserved across the obfuscation.

Static Analysis Based Plagiarism Detection: The ex-

isting static analysis techniques except for the birthmark-

based techniques are closely related to the clone detection

[1, 3, 18, 19, 16, 12, 15, 14, 31]. While possessing common

interests with the clone detection, the plagiarism detection

is diﬀerent in that (1) we must deal with code obfuscation

techniques which are often employed with a malicious in-

tention; (2) source code analysis of the suspicious program

is not possible in most cases. Static analysis techniques for

software plagiarism detection can be classiﬁed into ﬁve cate-

gories: string-based [1], AST-based [39, 17, 36], token-based

[28, 29, 13], PDG-based [20], and birthmark-based [23, 33].

String-based: Each line of source code is considered as a

string. A code fragment is labeled as plagiarism if the corre-

sponding sequence of strings matches certain code fragment

from original program. AST-based: The abstract syntax

trees (AST) are constructed from two programs. If the two

ASTs have common subtrees, plagiarism may exist. Token-

based: A program is ﬁrst parsed to a sequence of tokens.

The sequences of tokens are then compared to ﬁnd plagia-

rism. PDG-based: A program dependency graph (PDG)

represents the control ﬂow and data ﬂow relations between

the statements in a program procedure. To ﬁnd plagiarism,

two PDGs are constructed and compared to ﬁnd a relaxed

subgraph isomorphism. Birthmark-based: A software birth-

mark is a unique characteristic of a program that can be

used to determine the program’s identity. Two birthmarks

are extracted from two programs and compared.

None of the above techniques is resilient to code obfus-

cation. String-based schemes are vulnerable even to sim-

757

Table 1: Proportion of reﬁned value sequences of

GCC compiled executables that overlap value se-

quences of TCC and WCC compiled executables.

Compiler

Optimization

bzip2 gzip oggenc

switches tested

TCC NA 100% 100% 92%

WCC 20 switches 100% 100% > 91%

(avg. 95%)

ple identiﬁer renaming. AST-based schemes are resilient to

identiﬁer renaming, but weak against statement reordering

and control replacement. Token-based schemes are weak

against junk code insertion and statement reordering. Be-

cause PDGs contain semantic information of programs, PDG-

based schemes are more robust than the other three types

of the existing schemes. However, the PDG-based meth-

ods are still vulnerable to many semantics-preserving trans-

formations such as inlining/outlining functions and opaque

predicates. The existing birthmark-based schemes are vul-

nerable to either obfuscation techniques mentioned in [23]

or some well-known obfuscation such as statement reorder-

ing and junk instruction insertion. Moreover, all existing

techniques except for [23, 31] need to access source code.

Dynamic Analysis Based Plagiarism Detection: Myles

and Collberg [24] proposed a whole program path (WPP)

based dynamic birthmark. WPP was originally used to

represent the dynamic control ﬂow of a program. WPP

birthmarks are robust to some control ﬂow obfuscation such

as opaque predicates insertion, but are still vulnerable to

many semantics-preserving transformations such as ﬂatten-

ing and loop unwinding. Tamada et al. [34, 35] also in-

troduced two types of dynamic birthmarks for Windows

applications: Sequence of API Function Calls Birthmark

(EXESEQ) and Frequency of API Function Calls Birth-

mark (EXEFREQ). In EXESEQ, the sequence of Windows

API calls are recorded during the execution of a program.

These sequences are directly compared to ﬁnd the similarity.

In EXEFREQ, the frequency of each Windows API call is

recorded during the execution of a program. The frequency

distribution is used as a birthmark. Schuler et al. [30] pro-

posed a dynamic birthmark for Java. The call sequences

to Java standard API are recorded and the short sequences

at object level are used as a birthmark. Their experiments

showed that their API birthmarks are more robust to obfus-

cation than WPP birthmarks. These birthmarks, however,

can only identify the same source code compiled by diﬀer-

ent compilers with diﬀerent options, and the performance

against real obfuscation techniques is questionable. For ex-

ample, attackers may simply embed some of API implemen-

tations into their program so that fewer API calls will be ob-

served. Wang et al. [38] proposed a system call based birth-

mark, addressing the problems with API based techniques.

However, the proposed technique cannot be applied to com-

putation oriented softwares containing few system calls, and

is sill vulnerable to injecting transparent system calls in the

middle of an edge on the system call dependence graph.

3. CORE VALUES

The runtime values of a program are deﬁned as values

from the output operands of the machine instructions ex-

ecuted. While examining the runtime values of executable

programs, we observed that some runtime values of a pro-

gram could not be changed through automated semantics-

preserving transformation techniques such as optimization,

obfuscation, diﬀerent compilers, etc. We call such invariant

values core-values.

Core-values of a program are constructed from runtime

values that are pivotal for the program to transform its in-

put to desired output. We can practically eliminate non-

core values from the runtime values to retain core-values.

To identify non-core values, we leverage taint analysis and

easily accessible semantics-preserving transformation tech-

niques such as optimization techniques implemented in com-

pilers. Let v

be a runtime value of program P taking I

as input, and f be a semantics-preserving transformation.

Then, the non-core values have the following properties: (1)

If v

is not derived from I, v

is not a core-value of P ; (2)

If v

is not in the set of runtime values of f (P ), v

is not a

core-value of P .

To examine the existence of core-values, we perform a

dynamic analysis on three test programs gzip, bzip2, and

oggenc: Gzip and bzip2 are well-known compression utili-

ties, and oggenc is a OggVorbis audio format encoder. For

the dataset to be used as the input to the programs, we gen-

erate ten wav audio ﬁles (seven 16KB ﬁles, two 24KB ﬁles,

and one 8KB ﬁle), cropped from a 43.5MB wav ﬁle contain-

ing an 8’37”-long speech. In each set of experiments, we use

these ten inputs, and take the average outcome as the ﬁnal

result. With each of the three programs, we generate ﬁve dif-

ferent versions of executable ﬁles by compiling it with each

of the following optimization switches of GCC: -O0, -O1, -

O2, -O3, and -Os. From each of the executables given the

same input, we extract a value sequence, a sequence of values

(4-bit, 8-bit, 16-bit, or 32-bit) that are the computation re-

sults of arithmetic and bit-wise instructions in the execution

path. We also implement reﬁnement techniques (Section 4.1

and 4.2) including a dynamic taint analyzer to retain only

the values derived from input in the sequence. Then, we re-

ﬁne the value sequences by computing their longest common

subsequence, which contains the runtime values that survive

all of the ﬁve optimization switches.

To verify that the reﬁned value sequences are not from

compiler-speciﬁc common routines, we compare the reﬁned

value sequences against the value sequences extracted from

the same programs compiled by diﬀerent compilers, Tiny

C Compiler (TCC) and Open Watcom C Compiler (WCC).

Compared to GCC, TCC uses diﬀerent compiler components

such as parser and optimizer, and support library (libtcc.a),

however the code it produces borrows GCC’s runtime li-

braries (libc.so). WCC is a self-contained development suite

implementing its own C libraries. Therefore, the code it pro-

duces does not need to use GCC’s runtime libraries. Also,

WCC provides plenty of optimization options, and we test

all the 20 optimization switches to examine the reﬁned value

sequences. As shown in Table 1, the longest common subse-

quence of the ﬁve sequences are enclosed almost completely

by the value sequences of executables generated by compil-

ing the same test program with TCC and WCC. Although

92% and 95% matches shown in the cases of oggenc indicate

that the reﬁned value sequences still contain some non-core

values, these are much higher scores than those between ir-

relevant programs: as we will show shortly, the scores be-

tween irrelevant programs range from 0% to 11% in our ex-

periments.

758

Table 2: Proportion of reﬁned value sequences that

overlap value sequences of executables obfuscated by

Thicket and control ﬂow ﬂattening

Obfuscator bzip2 gzip oggenc

Thicket C Obfuscator 100% 100% 95%

Control Flow Flattening 100% 100% 100%

We further investigate the core-values through real obfus-

cation tools. For a source code obfuscation tool, we use Se-

mantic Designs, Inc.’s Thicket C obfuscator that implements

abstract syntax tree (AST) based code transformation. Its

features include, but not limited to, identiﬁer scrambling,

format scrambling, loop rewriting, and if-then-else rewrit-

ing. As a more advanced obfuscation technique, we use con-

trol ﬂow ﬂattening [37] implemented in Loco based on Diablo

link-time optimizer [21]. Control ﬂow ﬂattening can trans-

form statements ‘s1; s2;’ into ‘i=1; while(i) {switch(i)

{case 1: s1; i++; break; case 2: s2; i=0; break;}}’ of

which the control ﬂow graph is hugely diﬀerent from the orig-

inal. As shown in Table 2, again our reﬁned value sequences

are almost completely enclosed by the value sequences of

obfuscated executables.

To see overlapping portion of value sequences of diﬀerent

programs, we compare the reﬁned value sequences of bzip2,

gzip, and oggenc against irrelevant pairs (i.e., the reﬁned

value sequence of bzip2 to value sequence of oggenc opti-

mized with -O1). In 30 comparison cases (three test pro-

grams, each of which has two irrelevant peers, ﬁve optimiza-

tion switches), the value sequences of each program contain

only 0% to 11% of the reﬁned value sequences of diﬀerent

programs. This indicates that the core-values do exist and

we can use them to identify the sameness of codes.

4. DESIGN

Software theft has become a very serious concern to soft-

ware companies and open source communities. In the pres-

ence of automated semantics-preserving code transformation

tools [40, 21, 7, 32], the existing code characterization tech-

niques may face an impediment to ﬁnding sameness of pla-

giarized code and the original. In this section, we discuss

how we apply our technique to software plagiarism detection.

Later, we evaluate our method against such code obfuscation

tools in the context of software plagiarism detection.

Scope of Our Work: We consider the following types of

software plagiarisms in the presence of automated obfusca-

tors: whole-program plagiarism, where the plagiarizer copies

the whole or majority of the plaintiﬀ program and wraps it

in a modiﬁed interface, and core-part plagiarism, where the

plagiarizer copies only a part such as a module or an engine

of the plaintiﬀ program. Our main purpose of VaPD is to

develop a practical solution to real-world problems of the

whole-program software plagiarism detection, in which no

source code of the suspect program is available. VaPD can

also be a useful tool to solve many partial plagiarism cases

where the plaintiﬀ can provide the information about which

part of his program is likely to be plagiarized. We present

applicability of our technique to core-part plagiarism detec-

tion in the discussion section. We note that if the plagiarized

code is very small or functionally trivial, VaPD would not

be an appropriate tool.

4.1 Value Sequence Extraction

Since not all values associated with the execution of a

program are core-values, we establish the following require-

ments for a value to be added into a value sequence: The

value should be output of a value-updating instruction and

be closely related to the program’s semantics.

Informally, a computer is a state machine that makes state

transition based on input and a sequence of machine instruc-

tions. After every single execution of a machine instruc-

tion, the state is updated with the outcome of the instruc-

tion. Because the sequence of state updates reﬂects how the

program computes, the sequence of state-updating values is

closely related to the program’s semantics. As such, in value-

based characterization, we are interested only in the state

transitions made by value-updating instructions. More for-

mally, we can conceptualize the state-update as the change

of data stored in devices such as RAM and registers after

each instruction is performed, and we call the changed data

a state-updating value. We further deﬁne a value-updating

instruction as a machine instruction that does not always

preserve input in its output. Being an output of a value-

updating instruction is a suﬃcient condition to be a state-

updating value. Therefore, we exclude output values of non-

value-updating instructions from a value sequence. In our

x86 implementation, the value-updating instructions are the

standard mathematical operations (add, sub, etc.), the logi-

cal operators (and, or, etc.), bitshift arithmetic and logical

(shl, shr, etc.), and rotate operations (ror, rcl, etc.).

The above technique helps dramatically reduce the size

of a value sequence; however, in practice it is still challeng-

ing to analyze all values produced by all the value-updating

instructions. Therefore, we must apply further restrictions

to reﬁne value sequences. There are two classes of values

computed by value-updating instructions: Class-1 includes

those derived from input of the program, and Class-2 con-

sists of those that are not. For example, when program P

is processing input I in environment E, some instructions

take values derived from input I as their input, but some

others take input from environment E such as program load

location, stack pointer, size of stack frame, etc. Since the se-

mantics is a formal representation of the way that a program

processes the input, it is obvious that the values in Class-1

are more closely related with the semantics of a program. So,

we include only the values of Class-1 in a value sequence. To

identify the values included in Class-1, we run a program in a

virtual machine environment and perform a dynamic taint

analysis [25]. We start with tainting the input, and then

our analyzer in the virtual machine propagates the taint

to every byte in registers, memory cells, and ﬁles derived

from the input. Registers and memory cells appearing in

destination operands of all the instructions that take input

from tainted registers or tainted memory locations are also

tainted, and the output values of value-updating instruc-

tions are appended into the value sequence. In the example

of JLex used as a case study in this paper, the value se-

quences contain less than 7,000 values after applying taint

analysis, which is signiﬁcantly shorter, approximately

250

the original sequences.

4.2 Value Sequence Reﬁnement

In this section, we discuss heuristics to reﬁne value se-

quences. An initial value sequence constructed through the

dynamic taint analysis may still contain a number of non-

759

Table 3: Applicability of value sequence reﬁnement

techniques.

Reﬁnement technique

Plaintiﬀ Suspect

program program

Sequential reﬁnement

√

Optimization-based reﬁnement

√

Address removal

√ √

shl $0x2,%eax

add %edx,%eax

add %eax,%eax

add %edx,%eax

add $0xb,%eax

...

eax

edx,eax

eax

edx,eax

eax

Assembly code IN OUT Output value

001:

002:

003:

004:

005:

006:

Invisible

at line 6

Figure 1: Sequential reﬁnement example (EAX is

initially tainted)

core values produced by intermediate or insubstantial com-

putational steps. We need to eliminate those values to make

the value sequence (1) as close to core-values as possible;

and (2) capable of characterizing larger programs. We b e-

lieve a number of heuristics such as control/data ﬂow de-

pendence analysis and abnormal code pattern detection can

be adopted to achieve these goals, and below we introduce

some of them. One principle that we consider here is that

we have to be conservative in processing value sequences

of suspect programs. Since some heuristics may be abused

by sophisticated plagiarizers, we summarize applicability of

each heuristic that we will introduce in Table 3.

4.2.1 Sequential Reﬁnement

Inside the value sequence extractor, we implement a re-

ﬁnement technique named sequential reﬁnement. Figure 1

shows how GCC compiles “a=1; a=(a+1)*11;.” When vari-

able a is initially tainted, our taint analysis extracts value

sequence s = {4, 5, 10, 11, 22}. Note that sequence s

1:4

{4, 5, 10, 11}, a subsequence of s is generated by intermedi-

ate steps computing ‘(a + 1) ×11’. All the values in s

1:4

are

overwritten in register eax without aﬀecting any other mem-

ory locations until line 005. Since instructions after line 005

would never read (or be aﬀected by) the values in s

1:4

, we

can remove s

1:4

from s and retain only {22}. We formalize

this heuristic in the following rule:

Sequential Reduction Rule: Let i

m,n

denote m-th instruc-

tion updating variable (register or memory) n. Then, we

can skip logging output of i

m,n

if n is never read within

range (i

m,n

, i

m+1,n

). Repeat the same process until the ﬁrst

instruction that reads n and updates a variable (6= n) is

executed.

Through out our experiments presented in this paper, av-

erage reduction rate by the sequential reﬁnement is 16%,

and the maximum is 34%. Note that the sequential reﬁne-

ment only applies to plaintiﬀ programs because, in obfus-

cated programs, original values could appear as the results

of the intermediate computational steps.

4.2.2 Optimization-Based Reﬁnement

Only for plaintiﬀ programs, we perform optimization-based

reﬁnement as shown in Figure 2. One of the easiest way to

obtain diﬀerent executable ﬁles that are semantically iden-

tical is to compile the same source code with the same com-

piler with diﬀerent optimization switches enabled. Moti-

vated by this idea, we use several optimized executables of

the same program to sift non-core values out. With GCC

and its ﬁve selected optimization ﬂags (-O0, -O1, -O2, -O3,

and -Os), we can extract ﬁve optimized value sequences from

the plaintiﬀ program. Each optimized value sequence has

been processed with the sequential reﬁnement while it is ex-

tracted. Then, we compute a longest common subsequence

of all the optimized value sequences to retain only the com-

mon values in the resulting value sequence. As we do not

assume we have access to the source code of suspect pro-

grams, this reﬁnement heuristic is only applicable to plaintiﬀ

programs.

4.2.3 Address Removal

Memory addresses or pointer values stored in registers or

memory locations are transient. For example, some binary

transformation techniques such as word alignment and local

variable reordering can change pointers to local variables

or oﬀsets in stack; and heap pointers may not be the same

next time the program is executed even with the same input.

Therefore, we do not include pointer values in a reﬁned value

sequence.

In our VaPD prototype, we implement a range checking

based heuristic to detect addresses. Our testbed dynami-

cally monitors the changes of memory pages allocated to the

program being analyzed, and it maintains a list of ranges of

all the allocated pages with write permission enabled. If a

runtime value is found to be within the ranges in the list,

VaPD discards the value, regarding the value as an address.

Although this heuristic may also delete some non-pointer

values, it can remove pointers to stack and to heap with no

exception. Address removal heuristic is applicable to both

plaintiﬀ and suspect programs.

4.3 Similarity Metric

In the literature, there are many metrics for measuring the

degree of similarity of two sequences. In our prototype, we

deﬁne it based on the longest common subsequence (LCS).

It should be noted that the deﬁnition of the LCS does not

require every subsequence to be a continuous segment of the

mother sequence. For example, both {1, 6, 120} and {2,

24} are valid subsequences of value sequence {1, 2, 6, 24,

120}. Let |LCS (s

, s

)| denote the length of the LCS of

sequence s

and s

. Given v

, a fully reﬁned value sequence

of a plaintiﬀ program and v

, a value sequence of a susp ect

program, similarity score of the suspect program over the

plaintiﬀ program is intuitively deﬁned as:

Sim (v

, v

) =

|LCS (v

, v

4.4 Design Overview

Figure 3 shows overall design of VaPD. Here, provided

with executable ﬁles of plaintiﬀ program P and suspect pro-

gram S, and common test input I, Value Sequence Extrac-

tor(VSE) extracts v

and v

, the value sequences of P and

S. After reﬁning v

and v

, Similarity Detector computes

Sim (v

, v

), the similarity score of v

and v

. VaPD re-

peats this process with diﬀerent inputs (say, 10 or 20 inputs),

and claims plagiarism if the average of the scores shows a

signiﬁcant similarity.

760

Value-based program characterization and its application to software plagiarism detection

Figures

Citations

Attack of the Clones: Detecting Cloned Applications on Android Markets

Achieving accuracy and scalability simultaneously in detecting application clones on Android markets

Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization

Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection

ViewDroid: towards obfuscation-resilient mobile application repackaging detection

References

CCFinder: a multilinguistic token-based code clone detection system for large scale source code

Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software

Winnowing: local algorithms for document fingerprinting

Clone detection using abstract syntax trees

A Taxonomy of Obfuscating Transformations

Related Papers (5)

K-gram based software birthmarks

GPLAG: detection of software plagiarism by program dependence graph analysis

Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection

CCFinder: a multilinguistic token-based code clone detection system for large scale source code

Winnowing: local algorithms for document fingerprinting

Frequently Asked Questions (19)

Q1. What are the contributions mentioned in the paper "Value-based program characterization and its application to software plagiarism detection" ?

Q2. What have the authors stated for future works in "Value-based program characterization and its application to software plagiarism detection" ?

Q3. What are the main problems of WPP birthmarks?

Q4. What are some of the features that SandMark can alter?

Q5. What is the effect of noise injection on the similarity score?

Q6. Why does VaPD analyze x86 machine code?

Q7. How many optimization flags can be used to extract the value sequences?

Q8. Why do the authors propose a technique that directly examines executable files?

Q9. How many comparison cases of software plagiarism are there?

Q10. What are the three categories of code obfuscation techniques?

Q11. How many wav files are used as input to the programs?

Q12. How many plagiarisms were successfully discriminated by the VaPD?

Q13. Why does VSE not extract values from dynamic linked libraries?

Q14. What are the two classes of values computed by value-updating instructions?

Q15. How many obfuscators can be used to transform a program?

Q16. What are the requirements for a value to be added into a value sequence?

Q17. What is the problem with the proposed technique?

Q18. What is the effect of the logical connection between the values?

Q19. What are some of the heuristics that can be used to refine value sequences?