A Survey of Machine Learning for Big Code and Naturalness

doi:10.1145/3212695

Open AccessJournal ArticleDOI

A Survey of Machine Learning for Big Code and Naturalness

Miltiadis Allamanis, +3 more

- 31 Jul 2018 -

ACM Computing Surveys

- Vol. 51, Iss: 4, pp 81

Chats0

TLDR

A survey of machine learning, programming languages, and software engineering has recently taken important steps in proposing learnable probabilistic models of source code that exploit the abundance of patterns of code.

Abstract:

Research at the intersection of machine learning, programming languages, and software engineering has recently taken important steps in proposing learnable probabilistic models of source code that exploit the abundance of patterns of code. In this article, we survey this work. We contrast programming languages against natural languages and discuss how these similarities and differences drive the design of probabilistic models. We present a taxonomy based on the underlying design principles of each model and use it to navigate the literature. Then, we review how researchers have adapted these models to application areas and discuss cross-cutting and application-specific challenges and opportunities.

Citations

PDF

Open Access

More filters

Posted Content

Open Graph Benchmark: Datasets for Machine Learning on Graphs

Weihua Hu, +7 more

- 02 May 2020 -

arXiv: Learning

TL;DR: The OGB datasets are large-scale, encompass multiple important graph ML tasks, and cover a diverse range of domains, ranging from social and information networks to biological networks, molecular graphs, source code ASTs, and knowledge graphs, indicating fruitful opportunities for future research.

...read moreread less

Posted Content

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search.

Hamel Husain, +4 more

- 20 Sep 2019 -

arXiv: Learning

TL;DR: The methodology used to obtain the corpus and expert labels, as well as a number of simple baseline solutions for the task are described.

...read moreread less

Proceedings ArticleDOI

A novel neural source code representation based on abstract syntax tree

Jian Zhang, +5 more

TL;DR: This paper proposes a novel AST-based Neural Network (ASTNN) for source code representation that splits each large AST into a sequence of small statement trees, and encodes the statement trees to vectors by capturing the lexical and syntactical knowledge of statements.

...read moreread less

Journal ArticleDOI

SequenceR : Sequence-to-Sequence Learning for End-to-End Program Repair

Zimin Chen, +5 more

- 01 Sep 2021 -

IEEE Transactions on Software Engineerin...

TL;DR: This paper devise, implement, and evaluate a technique, called SEQUENCER, for fixing bugs based on sequence-to-sequence learning on source code, which captures a wide range of repair operators without any domain-specific top-down design.

...read moreread less

Proceedings ArticleDOI

A neural model for generating natural language summaries of program subroutines

Alexander LeClair, +2 more

TL;DR: In this article, a neural model that combines words from code with code structure from an AST is presented, which allows the model to learn code structure independent of the text in code.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Long short-term memory

Sepp Hochreiter, +1 more

- 01 Nov 1997 -

Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

Proceedings ArticleDOI

Bleu: a Method for Automatic Evaluation of Machine Translation

Kishore Papineni, +3 more

TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.

...read moreread less

Proceedings Article

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, +2 more

TL;DR: It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

...read moreread less

Posted Content

Sequence to Sequence Learning with Neural Networks

Ilya Sutskever, +2 more

- 10 Sep 2014 -

arXiv: Computation and Language

TL;DR: This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.

...read moreread less

Journal ArticleDOI

Anomaly detection: A survey

Varun Chandola, +2 more

- 30 Jul 2009 -

ACM Computing Surveys

TL;DR: This survey tries to provide a structured and comprehensive overview of the research on anomaly detection by grouping existing techniques into different categories based on the underlying approach adopted by each technique.

...read moreread less

Collapse

Neural Computation

Attention is All you Need

Ashish Vaswani, +7 more

A Survey of Machine Learning for Big Code and Naturalness

Citations

Open Graph Benchmark: Datasets for Machine Learning on Graphs

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search.

A novel neural source code representation based on abstract syntax tree

SequenceR : Sequence-to-Sequence Learning for End-to-End Program Repair

A neural model for generating natural language summaries of program subroutines

References

Long short-term memory

Bleu: a Method for Automatic Evaluation of Machine Translation

Neural Machine Translation by Jointly Learning to Align and Translate

Sequence to Sequence Learning with Neural Networks

Anomaly detection: A survey

Related Papers (5)

code2vec: learning distributed representations of code

On the naturalness of software

Summarizing Source Code using a Neural Attention Model

Long short-term memory

Attention is All you Need