A Syntactic Neural Model for General-Purpose Code Generation

doi:10.18653/V1/P17-1041

Home
/
Papers
/
A Syntactic Neural Model for General-Purpose Code Generation

Proceedings Article•DOI•

A Syntactic Neural Model for General-Purpose Code Generation

Pengcheng Yin¹, Graham Neubig¹•Institutions (1)

Carnegie Mellon University¹

01 Jul 2017-pp 440-450

TL;DR: This paper propose a neural architecture powered by a grammar model to explicitly capture the target syntax as prior knowledge, which achieves state-of-the-art results in code generation and semantic parsing.

read less

Abstract: We consider the problem of parsing natural language descriptions into source code written in a general-purpose programming language like Python. Existing data-driven methods treat this problem as a language generation task without considering the underlying syntax of the target programming language. Informed by previous work in semantic parsing, in this paper we propose a novel neural architecture powered by a grammar model to explicitly capture the target syntax as prior knowledge. Experiments find this an effective way to scale up to generation of complex programs from natural language descriptions, achieving state-of-the-art results that well outperform previous code generation and semantic parsing approaches.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Deep code comment generation

[...]

Xing Hu¹, Ge Li¹, Xin Xia², David Lo³, Zhi Jin¹ - Show less +1 more•Institutions (3)

Peking University¹, Monash University², Singapore Management University³

28 May 2018

TL;DR: DeepCom applies Natural Language Processing (NLP) techniques to learn from a large code corpus and generates comments from learned features for better comments generation of Java methods.

...read moreread less

Abstract: During software maintenance, code comments help developers comprehend programs and reduce additional time spent on reading and navigating source code. Unfortunately, these comments are often mismatched, missing or outdated in the software projects. Developers have to infer the functionality from the source code. This paper proposes a new approach named DeepCom to automatically generate code comments for Java methods. The generated comments aim to help developers understand the functionality of Java methods. DeepCom applies Natural Language Processing (NLP) techniques to learn from a large code corpus and generates comments from learned features. We use a deep neural network that analyzes structural information of Java methods for better comments generation. We conduct experiments on a large-scale Java corpus built from 9,714 open source projects from GitHub. We evaluate the experimental results on a machine translation metric. Experimental results demonstrate that our method DeepCom outperforms the state-of-the-art by a substantial margin.

...read moreread less

541 citations

Journal Article•DOI•

A Survey of Machine Learning for Big Code and Naturalness

[...]

Miltiadis Allamanis¹, Earl T. Barr², Premkumar Devanbu³, Charles Sutton⁴•Institutions (4)

Microsoft¹, University College London², University of California, Davis³, University of Edinburgh⁴

31 Jul 2018-ACM Computing Surveys

TL;DR: A survey of machine learning, programming languages, and software engineering has recently taken important steps in proposing learnable probabilistic models of source code that exploit the abundance of patterns of code.

...read moreread less

Abstract: Research at the intersection of machine learning, programming languages, and software engineering has recently taken important steps in proposing learnable probabilistic models of source code that exploit the abundance of patterns of code. In this article, we survey this work. We contrast programming languages against natural languages and discuss how these similarities and differences drive the design of probabilistic models. We present a taxonomy based on the underlying design principles of each model and use it to navigate the literature. Then, we review how researchers have adapted these models to application areas and discuss cross-cutting and application-specific challenges and opportunities.

...read moreread less

502 citations

Posted Content•

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search.

[...]

Hamel Husain, Ho-Hsiang Wu, Tiferet Ahavah Gazit, Miltiadis Allamanis¹, Marc Brockschmidt¹ - Show less +1 more•Institutions (1)

Microsoft¹

20 Sep 2019-arXiv: Learning

TL;DR: The methodology used to obtain the corpus and expert labels, as well as a number of simple baseline solutions for the task are described.

...read moreread less

Abstract: Semantic code search is the task of retrieving relevant code given a natural language query. While related to other information retrieval tasks, it requires bridging the gap between the language used in code (often abbreviated and highly technical) and natural language more suitable to describe vague concepts and ideas. To enable evaluation of progress on code search, we are releasing the CodeSearchNet Corpus and are presenting the CodeSearchNet Challenge, which consists of 99 natural language queries with about 4k expert relevance annotations of likely results from CodeSearchNet Corpus. The corpus contains about 6 million functions from open-source code spanning six programming languages (Go, Java, JavaScript, PHP, Python, and Ruby). The CodeSearchNet Corpus also contains automatically generated query-like natural language for 2 million functions, obtained from mechanically scraping and preprocessing associated function documentation. In this article, we describe the methodology used to obtain the corpus and expert labels, as well as a number of simple baseline solutions for the task. We hope that CodeSearchNet Challenge encourages researchers and practitioners to study this interesting task further and will host a competition and leaderboard to track the progress on the challenge. We are also keen on extending CodeSearchNet Challenge to more queries and programming languages in the future.

...read moreread less

492 citations

Posted Content•

Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task

[...]

Tao Yu¹, Rui Zhang, Kai Yang¹, Michihiro Yasunaga¹, Dongxu Wang¹, Zifan Li¹, James Ma¹, Irene Li¹, Qingning Yao¹, Shanelle Roman¹, Zilin Zhang¹, Dragomir R. Radev¹ - Show less +8 more•Institutions (1)

Yale University¹

24 Sep 2018-arXiv: Computation and Language

TL;DR: This work defines a new complex and cross-domain semantic parsing and text-to-SQL task so that different complicated SQL queries and databases appear in train and test sets and experiments with various state-of-the-art models show that Spider presents a strong challenge for future research.

...read moreread less

Abstract: We present Spider, a large-scale, complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 college students. It consists of 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables, covering 138 different domains. We define a new complex and cross-domain semantic parsing and text-to-SQL task where different complex SQL queries and databases appear in train and test sets. In this way, the task requires the model to generalize well to both new SQL queries and new database schemas. Spider is distinct from most of the previous semantic parsing tasks because they all use a single database and the exact same programs in the train set and the test set. We experiment with various state-of-the-art models and the best model achieves only 12.4% exact matching accuracy on a database split setting. This shows that Spider presents a strong challenge for future research. Our dataset and task are publicly available at this https URL

...read moreread less

406 citations

Posted Content•

GraphCodeBERT: Pre-training Code Representations with Data Flow

[...]

Daya Guo¹, Shuo Ren², Shuai Lu³, Zhangyin Feng⁴, Duyu Tang⁴, Shujie Liu⁵, Long Zhou⁵, Nan Duan⁵, Alexey Svyatkovskiy⁵, Fu Shengyu⁵, Michele Tufano⁵, Shao Kun Deng⁵, Colin B. Clement⁵, Dawn Drain⁵, Neel Sundaresan⁵, Jian Yin¹, Daxin Jiang⁵, Ming Zhou⁶ - Show less +14 more•Institutions (6)

Sun Yat-sen University¹, Beihang University², Peking University³, Harbin Institute of Technology⁴, Microsoft⁵, Shanghai Jiao Tong University⁶

17 Sep 2020-arXiv: Software Engineering

TL;DR: Results show that code structure and newly introduced pre-training tasks can improve GraphCodeBERT and achieves state-of-the-art performance on the four downstream tasks and it is shown that the model prefers structure-level attentions over token- level attentions in the task of code search.

...read moreread less

Abstract: Pre-trained models for programming language have achieved dramatic empirical improvements on a variety of code-related tasks such as code search, code completion, code summarization, etc. However, existing pre-trained models regard a code snippet as a sequence of tokens, while ignoring the inherent structure of code, which provides crucial code semantics and would enhance the code understanding process. We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code. Instead of taking syntactic-level structure of code like abstract syntax tree (AST), we use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables. Such a semantic-level structure is neat and does not bring an unnecessarily deep hierarchy of AST, the property of which makes the model more efficient. We develop GraphCodeBERT based on Transformer. In addition to using the task of masked language modeling, we introduce two structure-aware pre-training tasks. One is to predict code structure edges, and the other is to align representations between source code and code structure. We implement the model in an efficient way with a graph-guided masked attention function to incorporate the code structure. We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement. Results show that code structure and newly introduced pre-training tasks can improve GraphCodeBERT and achieves state-of-the-art performance on the four downstream tasks. We further show that the model prefers structure-level attentions over token-level attentions in the task of code search.

...read moreread less

377 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Long short-term memory

[...]

Sepp Hochreiter¹, Jürgen Schmidhuber²•Institutions (2)

Technische Universität München¹, Dalle Molle Institute for Artificial Intelligence Research²

01 Nov 1997-Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

Abstract: Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

...read moreread less

72,897 citations

Proceedings Article•

Neural Machine Translation by Jointly Learning to Align and Translate

[...]

Dzmitry Bahdanau¹, Kyunghyun Cho², Yoshua Bengio²•Institutions (2)

Jacobs University Bremen¹, Université de Montréal²

01 Jan 2015

TL;DR: It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

...read moreread less

Abstract: Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder-decoders and consists of an encoder that encodes a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.

...read moreread less

20,027 citations

Proceedings Article•

Pointer networks

[...]

Oriol Vinyals¹, Meire Fortunato², Navdeep Jaitly¹•Institutions (2)

Google¹, University of California, Berkeley²

07 Dec 2015

TL;DR: A new neural architecture to learn the conditional probability of an output sequence with elements that are discrete tokens corresponding to positions in an input sequence using a recently proposed mechanism of neural attention, called Ptr-Nets, which improves over sequence-to-sequence with input attention, but also allows it to generalize to variable size output dictionaries.

...read moreread less

Abstract: We introduce a new neural architecture to learn the conditional probability of an output sequence with elements that are discrete tokens corresponding to positions in an input sequence. Such problems cannot be trivially addressed by existent approaches such as sequence-to-sequence [1] and Neural Turing Machines [2], because the number of target classes in each step of the output depends on the length of the input, which is variable. Problems such as sorting variable sized sequences, and various combinatorial optimization problems belong to this class. Our model solves the problem of variable size output dictionaries using a recently proposed mechanism of neural attention. It differs from the previous attention attempts in that, instead of using attention to blend hidden units of an encoder to a context vector at each decoder step, it uses attention as a pointer to select a member of the input sequence as the output. We call this architecture a Pointer Net (Ptr-Net). We show Ptr-Nets can be used to learn approximate solutions to three challenging geometric problems - finding planar convex hulls, computing Delaunay triangulations, and the planar Travelling Salesman Problem - using training examples alone. Ptr-Nets not only improve over sequence-to-sequence with input attention, but also allow us to generalize to variable size output dictionaries. We show that the learnt models generalize beyond the maximum lengths they were trained on. We hope our results on these tasks will encourage a broader exploration of neural learning for discrete problems.

...read moreread less

1,949 citations

Proceedings Article•

Semantic Parsing on Freebase from Question-Answer Pairs

[...]

Jonathan Berant¹, Andrew Chia Chen Chou¹, Roy Frostig¹, Percy Liang¹•Institutions (1)

Stanford University¹

01 Oct 2013

TL;DR: This paper trains a semantic parser that scales up to Freebase and outperforms their state-of-the-art parser on the dataset of Cai and Yates (2013), despite not having annotated logical forms.

...read moreread less

Abstract: In this paper, we train a semantic parser that scales up to Freebase. Instead of relying on annotated logical forms, which is especially expensive to obtain at large scale, we learn from question-answer pairs. The main challenge in this setting is narrowing down the huge number of possible logical predicates for a given question. We tackle this problem in two ways: First, we build a coarse mapping from phrases to predicates using a knowledge base and a large text corpus. Second, we use a bridging operation to generate additional predicates based on neighboring predicates. On the dataset of Cai and Yates (2013), despite not having annotated logical forms, our system outperforms their state-of-the-art parser. Additionally, we collected a more realistic and challenging dataset of question-answer pairs and improves over a natural baseline.

...read moreread less

1,738 citations

Proceedings Article•

A theoretically grounded application of dropout in recurrent neural networks

[...]

Yarin Gal¹, Zoubin Ghahramani¹•Institutions (1)

University of Cambridge¹

05 Dec 2016

TL;DR: The authors apply this variational inference based dropout technique in LSTM and GRU models, assessing it on language modelling and sentiment analysis tasks, and to the best of their knowledge improve on the single model state-of-the-art in language modelling with the Penn Treebank (73.4 test perplexity).

...read moreread less

Abstract: Recurrent neural networks (RNNs) stand at the forefront of many recent developments in deep learning. Yet a major difficulty with these models is their tendency to overfit, with dropout shown to fail when applied to recurrent layers. Recent results at the intersection of Bayesian modelling and deep learning offer a Bayesian interpretation of common deep learning techniques such as dropout. This grounding of dropout in approximate Bayesian inference suggests an extension of the theoretical results, offering insights into the use of dropout with RNN models. We apply this new variational inference based dropout technique in LSTM and GRU models, assessing it on language modelling and sentiment analysis tasks. The new approach outperforms existing techniques, and to the best of our knowledge improves on the single model state-of-the-art in language modelling with the Penn Treebank (73.4 test perplexity). This extends our arsenal of variational tools in deep learning.

...read moreread less

1,557 citations