Value-based program characterization and its application to software plagiarism detection
read more
Citations
Attack of the Clones: Detecting Cloned Applications on Android Markets
Achieving accuracy and scalability simultaneously in detecting application clones on Android markets
Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization
Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection
ViewDroid: towards obfuscation-resilient mobile application repackaging detection
References
CCFinder: a multilinguistic token-based code clone detection system for large scale source code
Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software
Winnowing: local algorithms for document fingerprinting
Clone detection using abstract syntax trees
A Taxonomy of Obfuscating Transformations
Related Papers (5)
Frequently Asked Questions (19)
Q2. What have the authors stated for future works in "Value-based program characterization and its application to software plagiarism detection" ?
As their future work, the authors will examine the relationship between values. A better understanding of the logical connection among the values will enable us to further remove system noise or less significant values. In addition, the authors will study the impact of emulation-based obfuscators such as Themida and Code Virtualizer [ 27 ] on VaPD ’ s performance. The authors believe their detection method can handle such obfuscators.
Q3. What are the main problems of WPP birthmarks?
WPP birthmarks are robust to some control flow obfuscation such as opaque predicates insertion, but are still vulnerable to many semantics-preserving transformations such as flattening and loop unwinding.
Q4. What are some of the features that SandMark can alter?
Array representation and orientation, functions, in-memory representation of variables, order of instructions, and control and data dependence are just a small set of the features that SandMark can alter.
Q5. What is the effect of noise injection on the similarity score?
if injected successfully, noise could dramatically increase the size of an extracted value sequence, thus slowing down the similarity score computation, consuming more memory space.
Q6. Why does VaPD analyze x86 machine code?
Because VaPD analyzes x86 machine code, the authors convert Java byte code (used in SandMark and KlassMaster experiments) to x86 executable using GCJ 4.1.2, the GNU ahead-of-time Compiler for Java.
Q7. How many optimization flags can be used to extract the value sequences?
With GCC and its five selected optimization flags (-O0, -O1, -O2, -O3, and -Os), the authors can extract five optimized value sequences from the plaintiff program.
Q8. Why do the authors propose a technique that directly examines executable files?
Motivated by an observation that some outcome values computed by machine instructions survive various semantics-preserving code transformations, the authors have proposed a technique that directly examines executable files and does not need to access the source code of suspicious programs.
Q9. How many comparison cases of software plagiarism are there?
In 30 comparison cases (three test programs, each of which has two irrelevant peers, five optimization switches), the value sequences of each program contain only 0% to 11% of the refined value sequences of different programs.
Q10. What are the three categories of code obfuscation techniques?
They classify code obfuscation techniques in the following categories depending on the feature that each technique targets: data obfuscation, control obfuscation, layout obfuscation, and preventive transformations.
Q11. How many wav files are used as input to the programs?
For the dataset to be used as the input to the programs, the authors generate ten wav audio files (seven 16KB files, two 24KB files, and one 8KB file), cropped from a 43.5MB wav file containing an 8’37”-long speech.
Q12. How many plagiarisms were successfully discriminated by the VaPD?
Their experimental results indicate that the VaPD successfully discriminated 34 plagiarisms obfuscated by SandMark [7] (totally 39 obfuscators, but 5 of them failed to obfuscate their test programs); plagiarisms heavily obfuscated by KlassMaster,2 programs obfuscated by the Thicket C obfuscator, and executables obfuscated by Control Flow Flattening implemented in the Loco/Diablo link-time optimizer [21].
Q13. Why does VSE not extract values from dynamic linked libraries?
if necessary, the authors can enable VSE to include specific shared libraries in the value sequence extraction because the virtual machine knows which libraries are loaded and where they are.
Q14. What are the two classes of values computed by value-updating instructions?
There are two classes of values computed by value-updating instructions: Class-1 includes those derived from input of the program, and Class-2 consists of those that are not.
Q15. How many obfuscators can be used to transform a program?
Although it is theoretically possible for a series of multiple obfuscators to transform a program, applying many obfuscators to a single program could raise practical issues of correctness of the target program and efficiency.
Q16. What are the requirements for a value to be added into a value sequence?
Since not all values associated with the execution of a program are core-values, the authors establish the following requirements for a value to be added into a value sequence:
Q17. What is the problem with the proposed technique?
the proposed technique cannot be applied to computation oriented softwares containing few system calls, and is sill vulnerable to injecting transparent system calls in the middle of an edge on the system call dependence graph.
Q18. What is the effect of the logical connection between the values?
A better understanding of the logical connection among the values will enable us to further remove system noise or less significant values.
Q19. What are some of the heuristics that can be used to refine value sequences?
The authors believe a number of heuristics such as control/data flow dependence analysis and abnormal code pattern detection can be adopted to achieve these goals, and below the authors introduce some of them.