scispace - formally typeset
Open AccessProceedings ArticleDOI

Value-based program characterization and its application to software plagiarism detection

TLDR
This work introduces a novel approach to dynamic characterization of executable programs based on an observation that some critical runtime values are hard to be replaced or eliminated by semantics-preserving transformation techniques and how to apply this runtime property to help solve problems in software plagiarism detection.
Abstract
Identifying similar or identical code fragments becomes much more challenging in code theft cases where plagiarizers can use various automated code transformation techniques to hide stolen code from being detected. Previous works in this field are largely limited in that (1) most of them cannot handle advanced obfuscation techniques; (2) the methods based on source code analysis are less practical since the source code of suspicious programs is typically not available until strong evidences are collected; and (3) those depending on the features of specific operating systems or programming languages have limited applicability. Based on an observation that some critical runtime values are hard to be replaced or eliminated by semantics-preserving transformation techniques, we introduce a novel approach to dynamic characterization of executable programs. Leveraging such invariant values, our technique is resilient to various control and data obfuscation techniques. We show how the values can be extracted and refined to expose the critical values and how we can apply this runtime property to help solve problems in software plagiarism detection. We have implemented a prototype with a dynamic taint analyzer atop a generic processor emulator. Our experimental results show that the value-based method successfully discriminates 34 plagiarisms obfuscated by SandMark, plagiarisms heavily obfuscated by KlassMaster, programs obfuscated by Thicket, and executables obfuscated by Loco/Diablo.

read more

Content maybe subject to copyright    Report

Value-Based Program Characterization and
Its Application to Software Plagiarism Detection
Yoon-Chan Jhi
1
Xinran Wang
1
Xiaoqi Jia
2
Sencun Zhu
1
Peng Liu
1
Dinghao Wu
1
1
Penn State University, University Park, PA 16802
{jhi, xinrwang, szhu}@cse.psu.edu, {pliu, dwu}@ist.psu.edu
2
State Key Laboratory of Information Security, Institute of Software, Chinese Academy of Sciences
xjia@is.iscas.ac.cn
ABSTRACT
Identifying similar or identical code fragments becomes much
more challenging in code theft cases where plagiarizers can
use various automated code transformation techniques to
hide stolen code from b eing detected. Previous works in this
field are largely limited in that (1) most of them cannot han-
dle advanced obfuscation techniques; (2) the methods based
on source code analysis are less practical since the source
code of suspicious programs is typically not available until
strong evidences are collected; and (3) those depending on
the features of specific operating systems or programming
languages have limited applicability.
Based on an observation that some critical runtime val-
ues are hard to be replaced or eliminated by semantics-
preserving transformation techniques, we introduce a novel
approach to dynamic characterization of executable programs.
Leveraging such invariant values, our technique is resilient to
various control and data obfuscation techniques. We show
how the values can be extracted and refined to expose the
critical values and how we can apply this runtime property
to help solve problems in software plagiarism detection. We
have implemented a prototype with a dynamic taint analyzer
atop a generic processor emulator. Our experimental re-
sults show that the value-based method successfully discrim-
inates 34 plagiarisms obfuscated by SandMark, plagiarisms
heavily obfuscated by KlassMaster, programs obfuscated by
Thicket, and executables obfuscated by Loco/Diablo.
Categories and Subject Descriptors
D.m [Software]: Miscellaneous
General Terms
Security
Keywords
Dynamic code identification, software plagiarism detection
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
ICSE ’11, May 21–28, 2011, Waikiki, Honolulu, HI, USA
Copyright 2011 ACM 978-1-4503-0445-0/11/05 ...$10.00.
1. INTRODUCTION
Identifying same or similar code fragments among differ-
ent programs or in the same program is very important in
some applications. For example, duplicated codes found in
the same program may degrade efficiency in both develop-
ment phase (e.g., they can confuse programmers and lead to
potential errors) and execution phase (e.g., duplicated code
can degrade cache performance). In this case, code identi-
fication techniques such as clone detection [1, 3, 18, 19, 16,
12, 15, 14] can be used to discover and refactor the identical
code fragments to improve the program. For another exam-
ple, same or similar co de found in different programs may
lead us to even more serious issues. If those programs have
been individually developed by different programmers, and
if they do not embed any public domain code in common,
duplicated code can be an indication of software plagiarism
or code theft. In code theft cases, determining the sameness
of two code fragments becomes much more difficult since pla-
giarizers can use various code transformation techniques in-
cluding code obfuscation techniques [8, 9, 37] to hide stolen
code from detection. In order to handle such cases, code
characterization and identification techniques must be able
to detect the identical code (i.e., two code fragments belong-
ing to the same lineage) without being easily circumvented
by code transformation techniques.
Previous works are largely insufficient in meeting all of the
following three highly desired requirements: (R1) Resiliency
to automated semantics-preserving obfuscation tools [7, 21,
32, 40]; (R2) Ability to directly work on binary executables
of suspected programs since, in some applications such as
code theft cases, the source code of suspect software prod-
ucts often cannot be obtained until some strong evidences
are collected;(R3) Platform independence, e.g., independent
from operating systems and programming languages. As we
can see in the related work section, the existing schemes
can be broken down into four classes to see their limitations
with respect to the aforementioned three requirements: (C1)
static source code comparison methods [20, 33, 39, 17, 36,
28, 29, 13]; (C2) static executable code comparison methods
[23]; (C3) dynamic control flow based methods [24]; (C4) dy-
namic API based methods [30, 34, 35]. First, Class C1, C2
and C3 do not satisfy requirement R1 because they are vul-
nerable to semantics-preserving obfuscation techniques such
as outlining and ordering transformation. Second, C1 does
not meet R2 because it has to access source code. Third,
the existing C3 and C4 schemes do not satisfy R3 because
they rely on features of Windows or Java.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
ICSE’11, May 21–28, 2011, Waikiki, Honolulu, HI, USA
Copyright 2011 ACM 978-1-4503-0445-0/11/05 ...$10.00
756

To address the above issues, we introduce a novel ap-
proach to dynamic characterization of executable programs.
After we examined various runtime properties of executable
programs, we found an interesting observation that some
runtime values of a program are hard to be replaced or elim-
inated by semantics-preserving transformation techniques
such as optimization techniques, obfuscation techniques, dif-
ferent compilers, etc. We call such values core-values.
To investigate the resilience of core values (to semantics-
preserving code transformation), we generated e
1..5
, five dif-
ferent versions of executable files of test program p written in
C, by compiling p with each of the five optimization switches
of GCC (-O0, -O1, -O2, -O3, and -Os). From each of e
1..5
given the same test input, we extracted a value sequence, a
sequence of values (4-bit, 8-bit, 16-bit, or 32-bit) written as
computation results of arithmetic instructions and bit-wise
instructions in the execution path. As a way of retaining (in
the value sequence) only the values derived from input, we
implemented a dynamic taint analyzer.
1
When we analyzed
the value sequences of e
1..5
, we found that some values sur-
vived all of the five optimization switches. Moreover, the
sequence of the values surviving all of the five optimiza-
tion switches was enclosed almost perfectly by the value se-
quences of executables generated by compiling p with dif-
ferent compilers (we tested Tiny C Compiler [4] and Open
Watcom C Compiler [26]). This indicates that core-values
do exist and we can use them to check whether two code
fragments belong to the same lineage.
In this paper, we show (1) how we extract the values
revealing core-values; and (2) how we can apply this run-
time property to solve problems in software plagiarism de-
tection. We implemented a value extractor with a specific
dynamic taint analyzer and value refinement techniques atop
a generic processor emulator, as part of our value-based pro-
gram characterization method. As a machine code analyzer
which directly works on binary executables, our technique
satisfies R2. Because our technique analyzes generic charac-
teristics of machine instructions, it satisfies R3. Regarding
R1, we implemented a value-based software plagiarism de-
tection method (VaPD) that uses similarity measuring algo-
rithms based on sequences constructed from the extracted
values. We evaluated it through a set of real world obfus-
cators including two commercial products, Zelix Pty Ltd.’s
KlassMaster [40] and Semantic Designs Inc.’s Thicket [32].
Our experimental results indicate that the VaPD success-
fully discriminated 34 plagiarisms obfuscated by SandMark
[7] (totally 39 obfuscators, but 5 of them failed to obfuscate
our test programs); plagiarisms heavily obfuscated by Klass-
Master,
2
programs obfuscated by the Thicket C obfuscator,
and executables obfuscated by Control Flow Flattening im-
plemented in the Loco/Diablo link-time optimizer [21].
Contributions: (1) We present a novel co de characteriza-
tion method based on runtime values. To our best knowl-
edge, our work is the first one exploring the existence of
the core-values. (2) By exploiting runtime values that can
hardly be changed or replaced, our code characterization
technique is resilient to various control and data obfuscation
1
We also have noticed that there are studies on identifying
and overcoming limitations of dynamic taint analysis. Please
note that dealing with those limitations is out of our scope.
2
Since SandMark and KlassMaster work on Java bytecode,
we use GCJ, GNU ahead-of-time compiler for Java, to con-
vert obfuscated programs to x86 native executables.
techniques. (3) Our plagiarism detection method (VaPD)
does not require access to source code of suspicious pro-
grams, thus it could greatly reduce plaintiff’s risks through
providing strong evidences before filing a lawsuit related to
intellectual property.
2. STATE OF THE ART
We roughly group the literature into the following three
categories.
Code Obfuscation Techniques: Code obfuscation is a
semantics-preserving transformation to hinder figuring out
the original form of the resulting co de. A generic code ob-
fuscation technique is not as simple as adding x before com-
putation and subtracting x after the computation. Coll-
berg et al. [8] provided an extensive discussion on automated
code obfuscation techniques. They classify code obfuscation
techniques in the following categories depending on the fea-
ture that each technique targets: data obfuscation, control
obfuscation, layout obfuscation, and preventive transforma-
tions. Collberg et al. also introduced Opaque Predicates [9]
to thwart static disassembly. Other techniques such as indi-
rect branches, control-flow flattening, and function-pointer
aliasing were introduced by Wang [37].
Several code obfuscation tools are available. SandMark
[7] is one of such tools implementing 39 obfuscators applica-
ble to Java bytecode. Array representation and orientation,
functions, in-memory representation of variables, order of in-
structions, and control and data dependence are just a small
set of the features that SandMark can alter. Another Java
obfuscator is Zelix KlassMaster [40]. It implements compre-
hensive flow obfuscation techniques, making it a heavy duty
obfuscator. Semantics is the only characteristic guaranteed
to be preserved across the obfuscation.
Static Analysis Based Plagiarism Detection: The ex-
isting static analysis techniques except for the birthmark-
based techniques are closely related to the clone detection
[1, 3, 18, 19, 16, 12, 15, 14, 31]. While possessing common
interests with the clone detection, the plagiarism detection
is different in that (1) we must deal with code obfuscation
techniques which are often employed with a malicious in-
tention; (2) source code analysis of the suspicious program
is not possible in most cases. Static analysis techniques for
software plagiarism detection can be classified into five cate-
gories: string-based [1], AST-based [39, 17, 36], token-based
[28, 29, 13], PDG-based [20], and birthmark-based [23, 33].
String-based: Each line of source code is considered as a
string. A code fragment is labeled as plagiarism if the corre-
sponding sequence of strings matches certain code fragment
from original program. AST-based: The abstract syntax
trees (AST) are constructed from two programs. If the two
ASTs have common subtrees, plagiarism may exist. Token-
based: A program is first parsed to a sequence of tokens.
The sequences of tokens are then compared to find plagia-
rism. PDG-based: A program dependency graph (PDG)
represents the control flow and data flow relations between
the statements in a program procedure. To find plagiarism,
two PDGs are constructed and compared to find a relaxed
subgraph isomorphism. Birthmark-based: A software birth-
mark is a unique characteristic of a program that can be
used to determine the program’s identity. Two birthmarks
are extracted from two programs and compared.
None of the above techniques is resilient to code obfus-
cation. String-based schemes are vulnerable even to sim-
757

Table 1: Proportion of refined value sequences of
GCC compiled executables that overlap value se-
quences of TCC and WCC compiled executables.
Compiler
Optimization
bzip2 gzip oggenc
switches tested
TCC NA 100% 100% 92%
WCC 20 switches 100% 100% > 91%
(avg. 95%)
ple identifier renaming. AST-based schemes are resilient to
identifier renaming, but weak against statement reordering
and control replacement. Token-based schemes are weak
against junk code insertion and statement reordering. Be-
cause PDGs contain semantic information of programs, PDG-
based schemes are more robust than the other three types
of the existing schemes. However, the PDG-based meth-
ods are still vulnerable to many semantics-preserving trans-
formations such as inlining/outlining functions and opaque
predicates. The existing birthmark-based schemes are vul-
nerable to either obfuscation techniques mentioned in [23]
or some well-known obfuscation such as statement reorder-
ing and junk instruction insertion. Moreover, all existing
techniques except for [23, 31] need to access source code.
Dynamic Analysis Based Plagiarism Detection: Myles
and Collberg [24] proposed a whole program path (WPP)
based dynamic birthmark. WPP was originally used to
represent the dynamic control flow of a program. WPP
birthmarks are robust to some control flow obfuscation such
as opaque predicates insertion, but are still vulnerable to
many semantics-preserving transformations such as flatten-
ing and loop unwinding. Tamada et al. [34, 35] also in-
troduced two types of dynamic birthmarks for Windows
applications: Sequence of API Function Calls Birthmark
(EXESEQ) and Frequency of API Function Calls Birth-
mark (EXEFREQ). In EXESEQ, the sequence of Windows
API calls are recorded during the execution of a program.
These sequences are directly compared to find the similarity.
In EXEFREQ, the frequency of each Windows API call is
recorded during the execution of a program. The frequency
distribution is used as a birthmark. Schuler et al. [30] pro-
posed a dynamic birthmark for Java. The call sequences
to Java standard API are recorded and the short sequences
at object level are used as a birthmark. Their experiments
showed that their API birthmarks are more robust to obfus-
cation than WPP birthmarks. These birthmarks, however,
can only identify the same source code compiled by differ-
ent compilers with different options, and the performance
against real obfuscation techniques is questionable. For ex-
ample, attackers may simply embed some of API implemen-
tations into their program so that fewer API calls will be ob-
served. Wang et al. [38] proposed a system call based birth-
mark, addressing the problems with API based techniques.
However, the proposed technique cannot be applied to com-
putation oriented softwares containing few system calls, and
is sill vulnerable to injecting transparent system calls in the
middle of an edge on the system call dependence graph.
3. CORE VALUES
The runtime values of a program are defined as values
from the output operands of the machine instructions ex-
ecuted. While examining the runtime values of executable
programs, we observed that some runtime values of a pro-
gram could not be changed through automated semantics-
preserving transformation techniques such as optimization,
obfuscation, different compilers, etc. We call such invariant
values core-values.
Core-values of a program are constructed from runtime
values that are pivotal for the program to transform its in-
put to desired output. We can practically eliminate non-
core values from the runtime values to retain core-values.
To identify non-core values, we leverage taint analysis and
easily accessible semantics-preserving transformation tech-
niques such as optimization techniques implemented in com-
pilers. Let v
P
be a runtime value of program P taking I
as input, and f be a semantics-preserving transformation.
Then, the non-core values have the following properties: (1)
If v
P
is not derived from I, v
P
is not a core-value of P ; (2)
If v
P
is not in the set of runtime values of f (P ), v
P
is not a
core-value of P .
To examine the existence of core-values, we perform a
dynamic analysis on three test programs gzip, bzip2, and
oggenc: Gzip and bzip2 are well-known compression utili-
ties, and oggenc is a OggVorbis audio format encoder. For
the dataset to be used as the input to the programs, we gen-
erate ten wav audio files (seven 16KB files, two 24KB files,
and one 8KB file), cropped from a 43.5MB wav file contain-
ing an 8’37”-long speech. In each set of experiments, we use
these ten inputs, and take the average outcome as the final
result. With each of the three programs, we generate five dif-
ferent versions of executable files by compiling it with each
of the following optimization switches of GCC: -O0, -O1, -
O2, -O3, and -Os. From each of the executables given the
same input, we extract a value sequence, a sequence of values
(4-bit, 8-bit, 16-bit, or 32-bit) that are the computation re-
sults of arithmetic and bit-wise instructions in the execution
path. We also implement refinement techniques (Section 4.1
and 4.2) including a dynamic taint analyzer to retain only
the values derived from input in the sequence. Then, we re-
fine the value sequences by computing their longest common
subsequence, which contains the runtime values that survive
all of the five optimization switches.
To verify that the refined value sequences are not from
compiler-specific common routines, we compare the refined
value sequences against the value sequences extracted from
the same programs compiled by different compilers, Tiny
C Compiler (TCC) and Open Watcom C Compiler (WCC).
Compared to GCC, TCC uses different compiler components
such as parser and optimizer, and support library (libtcc.a),
however the code it produces borrows GCC’s runtime li-
braries (libc.so). WCC is a self-contained development suite
implementing its own C libraries. Therefore, the code it pro-
duces does not need to use GCC’s runtime libraries. Also,
WCC provides plenty of optimization options, and we test
all the 20 optimization switches to examine the refined value
sequences. As shown in Table 1, the longest common subse-
quence of the five sequences are enclosed almost completely
by the value sequences of executables generated by compil-
ing the same test program with TCC and WCC. Although
92% and 95% matches shown in the cases of oggenc indicate
that the refined value sequences still contain some non-core
values, these are much higher scores than those between ir-
relevant programs: as we will show shortly, the scores be-
tween irrelevant programs range from 0% to 11% in our ex-
periments.
758

Table 2: Proportion of refined value sequences that
overlap value sequences of executables obfuscated by
Thicket and control flow flattening
Obfuscator bzip2 gzip oggenc
Thicket C Obfuscator 100% 100% 95%
Control Flow Flattening 100% 100% 100%
We further investigate the core-values through real obfus-
cation tools. For a source code obfuscation tool, we use Se-
mantic Designs, Inc.’s Thicket C obfuscator that implements
abstract syntax tree (AST) based code transformation. Its
features include, but not limited to, identifier scrambling,
format scrambling, loop rewriting, and if-then-else rewrit-
ing. As a more advanced obfuscation technique, we use con-
trol flow flattening [37] implemented in Loco based on Diablo
link-time optimizer [21]. Control flow flattening can trans-
form statements s1; s2; into i=1; while(i) {switch(i)
{case 1: s1; i++; break; case 2: s2; i=0; break;}} of
which the control flow graph is hugely different from the orig-
inal. As shown in Table 2, again our refined value sequences
are almost completely enclosed by the value sequences of
obfuscated executables.
To see overlapping portion of value sequences of different
programs, we compare the refined value sequences of bzip2,
gzip, and oggenc against irrelevant pairs (i.e., the refined
value sequence of bzip2 to value sequence of oggenc opti-
mized with -O1). In 30 comparison cases (three test pro-
grams, each of which has two irrelevant peers, five optimiza-
tion switches), the value sequences of each program contain
only 0% to 11% of the refined value sequences of different
programs. This indicates that the core-values do exist and
we can use them to identify the sameness of codes.
4. DESIGN
Software theft has become a very serious concern to soft-
ware companies and open source communities. In the pres-
ence of automated semantics-preserving code transformation
tools [40, 21, 7, 32], the existing code characterization tech-
niques may face an impediment to finding sameness of pla-
giarized code and the original. In this section, we discuss
how we apply our technique to software plagiarism detection.
Later, we evaluate our method against such code obfuscation
tools in the context of software plagiarism detection.
Scope of Our Work: We consider the following types of
software plagiarisms in the presence of automated obfusca-
tors: whole-program plagiarism, where the plagiarizer copies
the whole or majority of the plaintiff program and wraps it
in a modified interface, and core-part plagiarism, where the
plagiarizer copies only a part such as a module or an engine
of the plaintiff program. Our main purpose of VaPD is to
develop a practical solution to real-world problems of the
whole-program software plagiarism detection, in which no
source code of the suspect program is available. VaPD can
also be a useful tool to solve many partial plagiarism cases
where the plaintiff can provide the information about which
part of his program is likely to be plagiarized. We present
applicability of our technique to core-part plagiarism detec-
tion in the discussion section. We note that if the plagiarized
code is very small or functionally trivial, VaPD would not
be an appropriate tool.
4.1 Value Sequence Extraction
Since not all values associated with the execution of a
program are core-values, we establish the following require-
ments for a value to be added into a value sequence: The
value should be output of a value-updating instruction and
be closely related to the program’s semantics.
Informally, a computer is a state machine that makes state
transition based on input and a sequence of machine instruc-
tions. After every single execution of a machine instruc-
tion, the state is updated with the outcome of the instruc-
tion. Because the sequence of state updates reflects how the
program computes, the sequence of state-updating values is
closely related to the program’s semantics. As such, in value-
based characterization, we are interested only in the state
transitions made by value-updating instructions. More for-
mally, we can conceptualize the state-update as the change
of data stored in devices such as RAM and registers after
each instruction is performed, and we call the changed data
a state-updating value. We further define a value-updating
instruction as a machine instruction that does not always
preserve input in its output. Being an output of a value-
updating instruction is a sufficient condition to be a state-
updating value. Therefore, we exclude output values of non-
value-updating instructions from a value sequence. In our
x86 implementation, the value-updating instructions are the
standard mathematical operations (add, sub, etc.), the logi-
cal operators (and, or, etc.), bitshift arithmetic and logical
(shl, shr, etc.), and rotate operations (ror, rcl, etc.).
The above technique helps dramatically reduce the size
of a value sequence; however, in practice it is still challeng-
ing to analyze all values produced by all the value-updating
instructions. Therefore, we must apply further restrictions
to refine value sequences. There are two classes of values
computed by value-updating instructions: Class-1 includes
those derived from input of the program, and Class-2 con-
sists of those that are not. For example, when program P
is processing input I in environment E, some instructions
take values derived from input I as their input, but some
others take input from environment E such as program load
location, stack pointer, size of stack frame, etc. Since the se-
mantics is a formal representation of the way that a program
processes the input, it is obvious that the values in Class-1
are more closely related with the semantics of a program. So,
we include only the values of Class-1 in a value sequence. To
identify the values included in Class-1, we run a program in a
virtual machine environment and perform a dynamic taint
analysis [25]. We start with tainting the input, and then
our analyzer in the virtual machine propagates the taint
to every byte in registers, memory cells, and files derived
from the input. Registers and memory cells appearing in
destination operands of all the instructions that take input
from tainted registers or tainted memory locations are also
tainted, and the output values of value-updating instruc-
tions are appended into the value sequence. In the example
of JLex used as a case study in this paper, the value se-
quences contain less than 7,000 values after applying taint
analysis, which is significantly shorter, approximately
1
250
of
the original sequences.
4.2 Value Sequence Refinement
In this section, we discuss heuristics to refine value se-
quences. An initial value sequence constructed through the
dynamic taint analysis may still contain a number of non-
759

Table 3: Applicability of value sequence refinement
techniques.
Refinement technique
Plaintiff Suspect
program program
Sequential refinement
Optimization-based refinement
Address removal
shl $0x2,%eax
add %edx,%eax
add %eax,%eax
add %edx,%eax
add $0xb,%eax
...
eax
eax
eax
eax
eax
4
5
10
11
22
eax
edx,eax
eax
edx,eax
eax
Assembly code IN OUT Output value
001:
002:
003:
004:
005:
006:
Invisible
at line 6
Figure 1: Sequential refinement example (EAX is
initially tainted)
core values produced by intermediate or insubstantial com-
putational steps. We need to eliminate those values to make
the value sequence (1) as close to core-values as possible;
and (2) capable of characterizing larger programs. We b e-
lieve a number of heuristics such as control/data flow de-
pendence analysis and abnormal code pattern detection can
be adopted to achieve these goals, and below we introduce
some of them. One principle that we consider here is that
we have to be conservative in processing value sequences
of suspect programs. Since some heuristics may be abused
by sophisticated plagiarizers, we summarize applicability of
each heuristic that we will introduce in Table 3.
4.2.1 Sequential Refinement
Inside the value sequence extractor, we implement a re-
finement technique named sequential refinement. Figure 1
shows how GCC compiles a=1; a=(a+1)*11;.” When vari-
able a is initially tainted, our taint analysis extracts value
sequence s = {4, 5, 10, 11, 22}. Note that sequence s
1:4
=
{4, 5, 10, 11}, a subsequence of s is generated by intermedi-
ate steps computing ‘(a + 1) ×11’. All the values in s
1:4
are
overwritten in register eax without affecting any other mem-
ory locations until line 005. Since instructions after line 005
would never read (or be affected by) the values in s
1:4
, we
can remove s
1:4
from s and retain only {22}. We formalize
this heuristic in the following rule:
Sequential Reduction Rule: Let i
m,n
denote m-th instruc-
tion updating variable (register or memory) n. Then, we
can skip logging output of i
m,n
if n is never read within
range (i
m,n
, i
m+1,n
). Repeat the same process until the first
instruction that reads n and updates a variable (6= n) is
executed.
Through out our experiments presented in this paper, av-
erage reduction rate by the sequential refinement is 16%,
and the maximum is 34%. Note that the sequential refine-
ment only applies to plaintiff programs because, in obfus-
cated programs, original values could appear as the results
of the intermediate computational steps.
4.2.2 Optimization-Based Refinement
Only for plaintiff programs, we perform optimization-based
refinement as shown in Figure 2. One of the easiest way to
obtain different executable files that are semantically iden-
tical is to compile the same source code with the same com-
piler with different optimization switches enabled. Moti-
vated by this idea, we use several optimized executables of
the same program to sift non-core values out. With GCC
and its five selected optimization flags (-O0, -O1, -O2, -O3,
and -Os), we can extract five optimized value sequences from
the plaintiff program. Each optimized value sequence has
been processed with the sequential refinement while it is ex-
tracted. Then, we compute a longest common subsequence
of all the optimized value sequences to retain only the com-
mon values in the resulting value sequence. As we do not
assume we have access to the source code of suspect pro-
grams, this refinement heuristic is only applicable to plaintiff
programs.
4.2.3 Address Removal
Memory addresses or pointer values stored in registers or
memory locations are transient. For example, some binary
transformation techniques such as word alignment and local
variable reordering can change pointers to local variables
or offsets in stack; and heap pointers may not be the same
next time the program is executed even with the same input.
Therefore, we do not include pointer values in a refined value
sequence.
In our VaPD prototype, we implement a range checking
based heuristic to detect addresses. Our testbed dynami-
cally monitors the changes of memory pages allocated to the
program being analyzed, and it maintains a list of ranges of
all the allocated pages with write permission enabled. If a
runtime value is found to be within the ranges in the list,
VaPD discards the value, regarding the value as an address.
Although this heuristic may also delete some non-pointer
values, it can remove pointers to stack and to heap with no
exception. Address removal heuristic is applicable to both
plaintiff and suspect programs.
4.3 Similarity Metric
In the literature, there are many metrics for measuring the
degree of similarity of two sequences. In our prototype, we
define it based on the longest common subsequence (LCS).
It should be noted that the definition of the LCS does not
require every subsequence to be a continuous segment of the
mother sequence. For example, both {1, 6, 120} and {2,
24} are valid subsequences of value sequence {1, 2, 6, 24,
120}. Let |LCS (s
1
, s
2
)| denote the length of the LCS of
sequence s
1
and s
2
. Given v
P
, a fully refined value sequence
of a plaintiff program and v
S
, a value sequence of a susp ect
program, similarity score of the suspect program over the
plaintiff program is intuitively defined as:
Sim (v
P
, v
S
) =
|LCS (v
P
, v
S
)|
|v
P
|
4.4 Design Overview
Figure 3 shows overall design of VaPD. Here, provided
with executable files of plaintiff program P and suspect pro-
gram S, and common test input I, Value Sequence Extrac-
tor(VSE) extracts v
P
and v
S
, the value sequences of P and
S. After refining v
P
and v
S
, Similarity Detector computes
Sim (v
P
, v
S
), the similarity score of v
P
and v
S
. VaPD re-
peats this process with different inputs (say, 10 or 20 inputs),
and claims plagiarism if the average of the scores shows a
significant similarity.
760

Citations
More filters
Book ChapterDOI

Attack of the Clones: Detecting Cloned Applications on Android Markets

TL;DR: DNADroid, a tool that detects Android application copying, or “cloning”, by robustly computing the similarity between two applications is presented, which achieves this by comparing program dependency graphs between methods in candidate applications.
Proceedings ArticleDOI

Achieving accuracy and scalability simultaneously in detecting application clones on Android markets

TL;DR: The implemented app clone detection system uses a geometry characteristic of dependency graphs to measure the similarity between methods in two apps, and synthesizes the method-level similarities and draws a Y/N conclusion on app (core functionality) cloning.
Proceedings ArticleDOI

Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization

TL;DR: An assembly code representation learning model that can find and incorporate rich semantic relationships among tokens appearing in assembly code and significantly outperforms existing methods against changes introduced by obfuscation and optimizations is developed.
Proceedings ArticleDOI

Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection

TL;DR: This paper proposes a binary-oriented, obfuscation-resilient method based on a new concept, longest common subsequence of semantically equivalent basic blocks, which combines rigorous program semantics with longestcommon subsequence based fuzzy matching.
Proceedings ArticleDOI

ViewDroid: towards obfuscation-resilient mobile application repackaging detection

TL;DR: This paper proposes ViewDroid, a user interface based approach to mobile app repackaging detection, which can detect repackaged apps at a large scale, both effectively and efficiently.
References
More filters
Journal ArticleDOI

CCFinder: a multilinguistic token-based code clone detection system for large scale source code

TL;DR: A new clone detection technique, which consists of the transformation of input source text and a token-by-token comparison, is proposed, which has effectively found clones and the metrics have been able to effectively identify the characteristics of the systems.
Proceedings Article

Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software

TL;DR: TaintCheck as mentioned in this paper performs dynamic taint analysis by performing binary rewriting at run time, which can reliably detect most types of exploits and produces no false positives for any of the many different programs that were tested.
Proceedings ArticleDOI

Winnowing: local algorithms for document fingerprinting

TL;DR: The class of local document fingerprinting algorithms is introduced, which seems to capture an essential property of any finger-printing technique guaranteed to detect copies, and a novel lower bound on the performance of any local algorithm is proved.
Proceedings ArticleDOI

Clone detection using abstract syntax trees

TL;DR: The paper presents simple and practical methods for detecting exact and near miss clones over arbitrary program fragments in program source code by using abstract syntax trees and suggests that clone detection could be useful in producing more structured code, and in reverse engineering to discover domain concepts and their implementations.

A Taxonomy of Obfuscating Transformations

TL;DR: It is argued that automatic code obfuscation is currently the most viable method for preventing reverse engineering and the design of a code obfuscator is described, a tool which converts a program into an equivalent one that is more diicult to understand and reverse engineer.
Related Papers (5)
Frequently Asked Questions (19)
Q1. What are the contributions mentioned in the paper "Value-based program characterization and its application to software plagiarism detection" ?

Based on an observation that some critical runtime values are hard to be replaced or eliminated by semanticspreserving transformation techniques, the authors introduce a novel approach to dynamic characterization of executable programs. The authors show how the values can be extracted and refined to expose the critical values and how they can apply this runtime property to help solve problems in software plagiarism detection. The authors have implemented a prototype with a dynamic taint analyzer atop a generic processor emulator. 

As their future work, the authors will examine the relationship between values. A better understanding of the logical connection among the values will enable us to further remove system noise or less significant values. In addition, the authors will study the impact of emulation-based obfuscators such as Themida and Code Virtualizer [ 27 ] on VaPD ’ s performance. The authors believe their detection method can handle such obfuscators. 

WPP birthmarks are robust to some control flow obfuscation such as opaque predicates insertion, but are still vulnerable to many semantics-preserving transformations such as flattening and loop unwinding. 

Array representation and orientation, functions, in-memory representation of variables, order of instructions, and control and data dependence are just a small set of the features that SandMark can alter. 

if injected successfully, noise could dramatically increase the size of an extracted value sequence, thus slowing down the similarity score computation, consuming more memory space. 

Because VaPD analyzes x86 machine code, the authors convert Java byte code (used in SandMark and KlassMaster experiments) to x86 executable using GCJ 4.1.2, the GNU ahead-of-time Compiler for Java. 

With GCC and its five selected optimization flags (-O0, -O1, -O2, -O3, and -Os), the authors can extract five optimized value sequences from the plaintiff program. 

Motivated by an observation that some outcome values computed by machine instructions survive various semantics-preserving code transformations, the authors have proposed a technique that directly examines executable files and does not need to access the source code of suspicious programs. 

In 30 comparison cases (three test programs, each of which has two irrelevant peers, five optimization switches), the value sequences of each program contain only 0% to 11% of the refined value sequences of different programs. 

They classify code obfuscation techniques in the following categories depending on the feature that each technique targets: data obfuscation, control obfuscation, layout obfuscation, and preventive transformations. 

For the dataset to be used as the input to the programs, the authors generate ten wav audio files (seven 16KB files, two 24KB files, and one 8KB file), cropped from a 43.5MB wav file containing an 8’37”-long speech. 

Their experimental results indicate that the VaPD successfully discriminated 34 plagiarisms obfuscated by SandMark [7] (totally 39 obfuscators, but 5 of them failed to obfuscate their test programs); plagiarisms heavily obfuscated by KlassMaster,2 programs obfuscated by the Thicket C obfuscator, and executables obfuscated by Control Flow Flattening implemented in the Loco/Diablo link-time optimizer [21]. 

if necessary, the authors can enable VSE to include specific shared libraries in the value sequence extraction because the virtual machine knows which libraries are loaded and where they are. 

There are two classes of values computed by value-updating instructions: Class-1 includes those derived from input of the program, and Class-2 consists of those that are not. 

Although it is theoretically possible for a series of multiple obfuscators to transform a program, applying many obfuscators to a single program could raise practical issues of correctness of the target program and efficiency. 

Since not all values associated with the execution of a program are core-values, the authors establish the following requirements for a value to be added into a value sequence: 

the proposed technique cannot be applied to computation oriented softwares containing few system calls, and is sill vulnerable to injecting transparent system calls in the middle of an edge on the system call dependence graph. 

A better understanding of the logical connection among the values will enable us to further remove system noise or less significant values. 

The authors believe a number of heuristics such as control/data flow dependence analysis and abnormal code pattern detection can be adopted to achieve these goals, and below the authors introduce some of them.