scispace - formally typeset
Open AccessProceedings ArticleDOI

Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs

TLDR
This research showcases how to apply ideas and techniques from NLP to large-scale binary code analysis, which has many applications, such as cross-architecture vulnerability discovery and code plagiarism detection.
Abstract
Binary code analysis allows analyzing binary code without having access to the corresponding source code. A binary, after disassembly, is expressed in an assembly language. This inspires us to approach binary analysis by leveraging ideas and techniques from Natural Language Processing (NLP), a rich area focused on processing text of various natural languages. We notice that binary code analysis and NLP share a lot of analogical topics, such as semantics extraction, summarization, and classification. This work utilizes these ideas to address two important code similarity comparison problems. (I) Given a pair of basic blocks for different instruction set architectures (ISAs), determining whether their semantics is similar or not; and (II) given a piece of code of interest, determining if it is contained in another piece of assembly code for a different ISA. The solutions to these two problems have many applications, such as cross-architecture vulnerability discovery and code plagiarism detection. We implement a prototype system INNEREYE and perform a comprehensive evaluation. A comparison between our approach and existing approaches to Problem I shows that our system outperforms them in terms of accuracy, efficiency and scalability. And the case studies utilizing the system demonstrate that our solution to Problem II is effective. Moreover, this research showcases how to apply ideas and techniques from NLP to large-scale binary code analysis.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Edge Computing Security: State of the Art and Challenges

TL;DR: This paper provides a comprehensive survey on the most influential and basic attacks as well as the corresponding defense mechanisms that have edge computing specific characteristics and can be practically applied to real-world edge computing systems.
Journal ArticleDOI

A Survey of Android Malware Detection with Deep Neural Models

TL;DR: This survey aims to address the challenges in DL-based Android malware detection and classification by systematically reviewing the latest progress, including FCN, CNN, RNN, DBN, AE, and hybrid models, and organize the literature according to the DL architecture.
Journal ArticleDOI

Order Matters: Semantic-Aware Neural Networks for Binary Code Similarity Detection

TL;DR: This paper proposes semantic-aware neural networks to extract the semantic information of the binary code, and finds that the order of the CFG's nodes is important for graph similarity detection, so it is adopted convolutional neural network (CNN) on adjacency matrices to Extract the order information.
Posted Content

A Literature Study of Embeddings on Source Code.

Zimin Chen, +1 more
- 05 Apr 2019 - 
TL;DR: In summary, word embedding has been successfully applied on different granularities of source code and with access to countless open-source repositories, the potential of applying other data-driven natural language processing techniques on source code in the future is seen.
Posted Content

SAFE: Self-Attentive Function Embeddings for Binary Similarity

TL;DR: SAFE as discussed by the authors is a self-attentive neural network (SAFE) based approach for binary similarity problem, which works directly on disassembled binary functions, does not require manual feature extraction, is computationally more efficient than existing solutions and is more general as it works on stripped binaries and on multiple architectures.
References
More filters
Journal ArticleDOI

Long short-term memory

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Journal Article

Visualizing Data using t-SNE

TL;DR: A new technique called t-SNE that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map, a variation of Stochastic Neighbor Embedding that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map.
Proceedings Article

Distributed Representations of Words and Phrases and their Compositionality

TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.
Posted Content

Efficient Estimation of Word Representations in Vector Space

TL;DR: This paper proposed two novel model architectures for computing continuous vector representations of words from very large data sets, and the quality of these representations is measured in a word similarity task and the results are compared to the previously best performing techniques based on different types of neural networks.
Journal ArticleDOI

Indexing by Latent Semantic Analysis

TL;DR: A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.
Related Papers (5)