Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs

doi:10.14722/NDSS.2019.23492

Open AccessProceedings ArticleDOI

Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs

Fei Zuo, +5 more

- 08 Aug 2018 -

arXiv: Software Engineering

TLDR

This research showcases how to apply ideas and techniques from NLP to large-scale binary code analysis, which has many applications, such as cross-architecture vulnerability discovery and code plagiarism detection.

Abstract:

Binary code analysis allows analyzing binary code without having access to the corresponding source code. A binary, after disassembly, is expressed in an assembly language. This inspires us to approach binary analysis by leveraging ideas and techniques from Natural Language Processing (NLP), a rich area focused on processing text of various natural languages. We notice that binary code analysis and NLP share a lot of analogical topics, such as semantics extraction, summarization, and classification. This work utilizes these ideas to address two important code similarity comparison problems. (I) Given a pair of basic blocks for different instruction set architectures (ISAs), determining whether their semantics is similar or not; and (II) given a piece of code of interest, determining if it is contained in another piece of assembly code for a different ISA. The solutions to these two problems have many applications, such as cross-architecture vulnerability discovery and code plagiarism detection. We implement a prototype system INNEREYE and perform a comprehensive evaluation. A comparison between our approach and existing approaches to Problem I shows that our system outperforms them in terms of accuracy, efficiency and scalability. And the case studies utilizing the system demonstrate that our solution to Problem II is effective. Moreover, this research showcases how to apply ideas and techniques from NLP to large-scale binary code analysis.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Edge Computing Security: State of the Art and Challenges

Yinhao Xiao, +5 more

TL;DR: This paper provides a comprehensive survey on the most influential and basic attacks as well as the corresponding defense mechanisms that have edge computing specific characteristics and can be practically applied to real-world edge computing systems.

...read moreread less

Journal ArticleDOI

A Survey of Android Malware Detection with Deep Neural Models

Junyang Qiu, +5 more

- 06 Dec 2020 -

ACM Computing Surveys

TL;DR: This survey aims to address the challenges in DL-based Android malware detection and classification by systematically reviewing the latest progress, including FCN, CNN, RNN, DBN, AE, and hybrid models, and organize the literature according to the DL architecture.

...read moreread less

Journal ArticleDOI

Order Matters: Semantic-Aware Neural Networks for Binary Code Similarity Detection

Zeping Yu, +5 more

TL;DR: This paper proposes semantic-aware neural networks to extract the semantic information of the binary code, and finds that the order of the CFG's nodes is important for graph similarity detection, so it is adopted convolutional neural network (CNN) on adjacency matrices to Extract the order information.

...read moreread less

Posted Content

A Literature Study of Embeddings on Source Code.

Zimin Chen, +1 more

- 05 Apr 2019 -

arXiv: Learning

TL;DR: In summary, word embedding has been successfully applied on different granularities of source code and with access to countless open-source repositories, the potential of applying other data-driven natural language processing techniques on source code in the future is seen.

...read moreread less

Posted Content

SAFE: Self-Attentive Function Embeddings for Binary Similarity

Luca Massarelli, +4 more

- 13 Nov 2018 -

arXiv: Cryptography and Security

TL;DR: SAFE as discussed by the authors is a self-attentive neural network (SAFE) based approach for binary similarity problem, which works directly on disassembled binary functions, does not require manual feature extraction, is computationally more efficient than existing solutions and is more general as it works on stripped binaries and on multiple architectures.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Long short-term memory

Sepp Hochreiter, +1 more

- 01 Nov 1997 -

Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

Journal Article

Visualizing Data using t-SNE

Laurens van der Maaten, +1 more

- 01 Jan 2008 -

Journal of Machine Learning Research

TL;DR: A new technique called t-SNE that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map, a variation of Stochastic Neighbor Embedding that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map.

...read moreread less

Proceedings Article

Distributed Representations of Words and Phrases and their Compositionality

Tomas Mikolov, +4 more

TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.

...read moreread less

Posted Content

Efficient Estimation of Word Representations in Vector Space

Tomas Mikolov, +3 more

- 16 Jan 2013 -

arXiv: Computation and Language

TL;DR: This paper proposed two novel model architectures for computing continuous vector representations of words from very large data sets, and the quality of these representations is measured in a word similarity task and the results are compared to the previously best performing techniques based on different types of neural networks.

...read moreread less

Journal ArticleDOI

Indexing by Latent Semantic Analysis

Scott Deerwester, +4 more

- 01 Sep 1990 -

Journal of the Association for Informati...

TL;DR: A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.

...read moreread less

Collapse

arXiv: Software Engineering

Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs

Citations

Edge Computing Security: State of the Art and Challenges

A Survey of Android Malware Detection with Deep Neural Models

Order Matters: Semantic-Aware Neural Networks for Binary Code Similarity Detection

A Literature Study of Embeddings on Source Code.

SAFE: Self-Attentive Function Embeddings for Binary Similarity

References

Long short-term memory

Visualizing Data using t-SNE

Distributed Representations of Words and Phrases and their Compositionality

Efficient Estimation of Word Representations in Vector Space

Indexing by Latent Semantic Analysis

Related Papers (5)

Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs.

A deep learning approach to program similarity

DeepSim: deep learning code functional similarity

Deep learning similarities from different representations of source code

A Transformer-based Approach for Source Code Summarization