scispace - formally typeset
Search or ask a question
Author

Xiaozhu Meng

Bio: Xiaozhu Meng is an academic researcher from Rice University. The author has contributed to research in topics: Instrumentation (computer programming) & Speedup. The author has an hindex of 7, co-authored 26 publications receiving 215 citations. Previous affiliations of Xiaozhu Meng include University of Wisconsin-Madison & Huazhong University of Science and Technology.

Papers
More filters
Proceedings ArticleDOI
18 Jul 2016
TL;DR: New code parsing algorithms in the open source Dyninst tool kit are presented, including a new model for describing jump tables that improves the ability to precisely determine the control flow targets, a new interprocedural analysis to determine when a function is non-returning, and techniques for handling tail calls.
Abstract: Binary code analysis is an enabling technique for many applications. Modern compilers and run-time libraries have introduced significant complexities to binary code, which negatively affect the capabilities of binary analysis tool kits to analyze binary code, and may cause tools to report inaccurate information about binary code. Analysts may hence be confused and applications based on these tool kits may have degrading quality. We examine the problem of constructing control flow graphs from binary code and labeling the graphs with accurate function boundary annotations. We identified several challenging code constructs that represent hard-to-analyze aspects of binary code, and show code examples for each code construct. As part of this discussion, we present new code parsing algorithms in our open source Dyninst tool kit that support these constructs, including a new model for describing jump tables that improves our ability to precisely determine the control flow targets, a new interprocedural analysis to determine when a function is non-returning, and techniques for handling tail calls. We evaluated how various tool kits fare when handling these code constructs with real software as well as test binaries patterned after each challenging code construct we found in real software.

112 citations

Book ChapterDOI
11 Sep 2017
TL;DR: New code features that capture programming style at the basic block level, an approach for identifying external template library code, and a new approach to capture correlations between the authors of basic blocks in a binary are presented.
Abstract: Knowing the authors of a binary program has significant application to forensics of malicious software (malware), software supply chain risk management, and software plagiarism detection. Existing techniques assume that a binary is written by a single author, which does not hold true in real world because most modern software, including malware, often contains code from multiple authors. In this paper, we make the first step toward identifying multiple authors in a binary. We present new fine-grained techniques to address the tougher problem of determining the author of each basic block. The decision of attributing authors at the basic block level is based on an empirical study of three large open source software, in which we find that a large fraction of basic blocks can be well attributed to a single author. We present new code features that capture programming style at the basic block level, our approach for identifying external template library code, and a new approach to capture correlations between the authors of basic blocks in a binary. Our experiments show strong evidence that programming styles can be recovered at the basic block level and it is practical to identify multiple authors in a binary.

45 citations

Journal ArticleDOI
TL;DR: A novel Task Duplication based Clustering Algorithm (TDCA) is proposed to improve the schedule performance by utilizing duplication task more thoroughly and improves parameter calculation, task duplication, and task merging.
Abstract: As a crucial task in heterogeneous distributed systems, DAG-scheduling models a scheduling application with a set of distributed tasks by a Direct Acyclic Graph (DAG). The goal is to assign tasks to different processors so that the whole application can finish as soon as possible. Task Duplication-Based (TDB) scheme is an important technique addressing this problem. The main idea is to duplicate tasks on multiple machines so that the results of the duplicated tasks are available on multiple machines to trade computation time for communication time. Existing TDB algorithms enumerate and test all possible duplication candidates, and only keep the candidates that can improve the overall scheduling. We observe that while a duplication candidate is ineffective at the moment, after other duplications have been applied, this ineffective duplication candidate can become effective, which in turn can cause other ineffective duplications to become effective. We call this phenomenon the chain reaction of task duplication. We propose a novel Task Duplication based Clustering Algorithm (TDCA) to improve the schedule performance by utilizing duplication task more thoroughly. TDCA improves parameter calculation, task duplication, and task merging. The analysis and experiments are based on randomly generated graphs with various characteristics, including DAG depth and width, communication-computing cost ration, and variant computation power of processors. Our results demonstrate that the TDCA algorithm is very competitive. It improves the schedule makespan of task duplication-based algorithms for heterogeneous systems for various communication-computing cost ratios.

36 citations

Proceedings ArticleDOI
22 Sep 2013
TL;DR: Two new line-level authorship models are presented to overcome the limitation of current tools that assume that the last developer to change a line of code is its author regardless of all earlier changes.
Abstract: Code authorship information is important for analyzing software quality, performing software forensics, and improving software maintenance However, current tools assume that the last developer to change a line of code is its author regardless of all earlier changes This approximation loses important information We present two new line-level authorship models to overcome this limitation We first define the repository graph as a graph abstraction for a code repository, in which nodes are the commits and edges represent the development dependencies Then for each line of code, structural authorship is defined as a sub graph of the repository graph recording all commits that changed the line and the development dependencies between the commits, weighted authorship is defined as a vector of author contribution weights derived from the structural authorship of the line and based on a code change measure between commits, for example, best edit distance We have implemented our two authorship models as a new git built-in tool git-author We evaluated git-author in an empirical study and a comparison study In the empirical study, we ran git-author on five open source projects and found that git-author can recover more information than a current tool (git-blame) for about 10% of lines In the comparison study, we used git-author to build a line-level model for bug prediction We compared our line-level model with an existing file-level model The results show that our line-level model performs consistently better than the file-level model when evaluated on our data sets produced from the Apache HTTP server project

29 citations

Proceedings ArticleDOI
01 Nov 2016
TL;DR: A new finer-grained technique is presented that can discriminate the author of a basic block with 52% accuracy among 282 authors, as opposed to 0.4% accuracy by random guess, and provides a practical solution for identifying multiple authors in software.
Abstract: Binary code authorship identification is the task of determining the authors of a piece of binary code from a set of known authors. Modern software often contains code from multiple authors. However, existing techniques assume that each program binary is written by a single author. We present a new finer-grained technique to the tougher problem of determining the author of each basic block. Our evaluation shows that our new technique can discriminate the author of a basic block with 52% accuracy among 282 authors, as opposed to 0.4% accuracy by random guess, and it provides a practical solution for identifying multiple authors in software.

26 citations


Cited by
More filters
Proceedings ArticleDOI
06 Nov 2019
TL;DR: This work proposes MemGuard, the first defense with formal utility-loss guarantees against black-box membership inference attacks and is the first one to show that adversarial examples can be used as defensive mechanisms to defend against membership inference attack.
Abstract: In a membership inference attack, an attacker aims to infer whether a data sample is in a target classifier's training dataset or not. Specifically, given a black-box access to the target classifier, the attacker trains a binary classifier, which takes a data sample's confidence score vector predicted by the target classifier as an input and predicts the data sample to be a member or non-member of the target classifier's training dataset. Membership inference attacks pose severe privacy and security threats to the training dataset. Most existing defenses leverage differential privacy when training the target classifier or regularize the training process of the target classifier. These defenses suffer from two key limitations: 1) they do not have formal utility-loss guarantees of the confidence score vectors, and 2) they achieve suboptimal privacy-utility tradeoffs. In this work, we propose MemGuard,the first defense with formal utility-loss guarantees against black-box membership inference attacks. Instead of tampering the training process of the target classifier, MemGuard adds noise to each confidence score vector predicted by the target classifier. Our key observation is that attacker uses a classifier to predict member or non-member and classifier is vulnerable to adversarial examples.Based on the observation, we propose to add a carefully crafted noise vector to a confidence score vector to turn it into an adversarial example that misleads the attacker's classifier. Specifically, MemGuard works in two phases. In Phase I, MemGuard finds a carefully crafted noise vector that can turn a confidence score vector into an adversarial example, which is likely to mislead the attacker's classifier to make a random guessing at member or non-member. We find such carefully crafted noise vector via a new method that we design to incorporate the unique utility-loss constraints on the noise vector. In Phase II, MemGuard adds the noise vector to the confidence score vector with a certain probability, which is selected to satisfy a given utility-loss budget on the confidence score vector. Our experimental results on three datasets show that MemGuard can effectively defend against membership inference attacks and achieve better privacy-utility tradeoffs than existing defenses. Our work is the first one to show that adversarial examples can be used as defensive mechanisms to defend against membership inference attacks.

245 citations

Proceedings Article
01 Aug 2016
TL;DR: This work studies the accuracy of nine state-of-the-art disassemblers on 981 real-world compiler-generated binaries with a wide variety of properties, and reveals a mismatch between expectations in the literature, and the actual capabilities of modern disassembler.
Abstract: It is well-known that static disassembly is an unsolved problem, but how much of a problem is it in real software—for instance, for binary protection schemes? This work studies the accuracy of nine state-of-the-art disassemblers on 981 real-world compiler-generated binaries with a wide variety of properties. In contrast, prior work focuses on isolated corner cases; we show that this has led to a widespread and overly pessimistic view on the prevalence of complex constructs like inline data and overlapping code, leading reviewers and researchers to underestimate the potential of binary-based research. On the other hand, some constructs, such as function boundaries, are much harder to recover accurately than is reflected in the literature, which rarely discusses much needed error handling for these primitives. We study 30 papers recently published in six major security venues, and reveal a mismatch between expectations in the literature, and the actual capabilities of modern disassemblers. Our findings help improve future research by eliminating this mismatch.

118 citations

Proceedings ArticleDOI
26 Apr 2017
TL;DR: Nucleus is a novel function detection algorithm for binaries that is compiler-agnostic, and does not require any learning phase or signature information, making it inherently suitable for difficult cases such as non-contiguous or multi-entry functions.
Abstract: We propose Nucleus, a novel function detection algorithm for binaries. In contrast to prior work, Nucleus is compiler-agnostic, and does not require any learning phase or signature information. Instead of scanning for signatures, Nucleus detects functions at the Control Flow Graph-level, making it inherently suitable for difficult cases such as non-contiguous or multi-entry functions. We evaluate Nucleus on a diverse set of 476 C and C ++ binaries, compiled with gcc, clang and Visual Studio for x86 and x64, at optimization levels O0–O3. We achieve consistently good performance, with a mean F-score of 0.95.

88 citations

Proceedings ArticleDOI
15 Oct 2018
TL;DR: A Deep Learning-based Code Authorship Identification System (DL-CAIS) for code authorship attribution that facilitates large-scale, language-oblivious, and obfuscation-resilient code authorships identification.
Abstract: Efficient extraction of code authorship attributes is key for successful identification. However, the extraction of such attributes is very challenging, due to various programming language specifics, the limited number of available code samples per author, and the average code lines per file, among others. To this end, this work proposes a Deep Learning-based Code Authorship Identification System (DL-CAIS) for code authorship attribution that facilitates large-scale, language-oblivious, and obfuscation-resilient code authorship identification. The deep learning architecture adopted in this work includes TF-IDF-based deep representation using multiple Recurrent Neural Network (RNN) layers and fully-connected layers dedicated to authorship attribution learning. The deep representation then feeds into a random forest classifier for scalability to de-anonymize the author. Comprehensive experiments are conducted to evaluate DL-CAIS over the entire Google Code Jam (GCJ) dataset across all years (from 2008 to 2016) and over real-world code samples from 1987 public repositories on GitHub. The results of our work show the high accuracy despite requiring a smaller number of files per author. Namely, we achieve an accuracy of 96% when experimenting with 1,600 authors for GCJ, and 94.38% for the real-world dataset for 745 C programmers. Our system also allows us to identify 8,903 authors, the largest-scale dataset used by far, with an accuracy of 92.3%. Moreover, our technique is resilient to language-specifics, and thus it can identify authors of four programming languages (e.g. C, C++, Java, and Python), and authors writing in mixed languages (e.g. Java/C++, Python/C++). Finally, our system is resistant to sophisticated obfuscation (e.g. using C Tigress) with an accuracy of 93.42% for a set of 120 authors.

80 citations

Journal ArticleDOI
TL;DR: An analysis of 11 aggregation schemes using data collected from 255 open source projects finds that aggregation schemes can significantly alter correlations among metrics, as well as the correlations between metrics and the defect count.
Abstract: Defect prediction models help software organizations to anticipate where defects will appear in the future. When training a defect prediction model, historical defect data is often mined from a Version Control System (VCS, e.g., Subversion), which records software changes at the file-level. Software metrics, on the other hand, are often calculated at the class- or method-level (e.g., McCabe's Cyclomatic Complexity). To address the disagreement in granularity, the class- and method-level software metrics are aggregated to file-level, often using summation (i.e., McCabe of a file is the sum of the McCabe of all methods within the file). A recent study shows that summation significantly inflates the correlation between lines of code ( Sloc ) and cyclomatic complexity ( Cc ) in Java projects. While there are many other aggregation schemes (e.g., central tendency, dispersion), they have remained unexplored in the scope of defect prediction. In this study, we set out to investigate how different aggregation schemes impact defect prediction models. Through an analysis of 11 aggregation schemes using data collected from 255 open source projects, we find that: (1) aggregation schemes can significantly alter correlations among metrics, as well as the correlations between metrics and the defect count; (2) when constructing models to predict defect proneness, applying only the summation scheme (i.e., the most commonly used aggregation scheme in the literature) only achieves the best performance (the best among the 12 studied configurations) in 11 percent of the studied projects, while applying all of the studied aggregation schemes achieves the best performance in 40 percent of the studied projects; (3) when constructing models to predict defect rank or count, either applying only the summation or applying all of the studied aggregation schemes achieves similar performance, with both achieving the closest to the best performance more often than the other studied aggregation schemes; and (4) when constructing models for effort-aware defect prediction, the mean or median aggregation schemes yield performance values that are significantly closer to the best performance than any of the other studied aggregation schemes. Broadly speaking, the performance of defect prediction models are often underestimated due to our community's tendency to only use the summation aggregation scheme. Given the potential benefit of applying additional aggregation schemes, we advise that future defect prediction models should explore a variety of aggregation schemes.

79 citations