scispace - formally typeset
M

Mitchell Stern

Researcher at University of California, Berkeley

Publications -  36
Citations -  2936

Mitchell Stern is an academic researcher from University of California, Berkeley. The author has contributed to research in topics: Parsing & Machine translation. The author has an hindex of 18, co-authored 36 publications receiving 2113 citations. Previous affiliations of Mitchell Stern include Google.

Papers
More filters
Proceedings Article

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

TL;DR: In this paper, the authors propose to estimate the per-parameter second moments based on the inverse square roots of exponential moving averages of squared past gradients of neural network weight matrices and show that adaptive methods can produce larger-than-desired updates when the decay rate of the second moment accumulator is too slow.
Proceedings Article

The Marginal Value of Adaptive Gradient Methods in Machine Learning

TL;DR: It is observed that the solutions found by adaptive methods generalize worse (often significantly worse) than SGD, even when these solutions have better training performance, suggesting that practitioners should reconsider the use of adaptive methods to train neural networks.
Proceedings ArticleDOI

Abstract Syntax Networks for Code Generation and Semantic Parsing

TL;DR: In this paper, abstract syntax trees (ASTs) are constructed by a decoder with a dynamically-determined modular structure paralleling the structure of the output tree, which achieves state-of-the-art results on the Atis, Jobs, and Geo semantic parsing datasets.
Posted Content

The Marginal Value of Adaptive Gradient Methods in Machine Learning

TL;DR: This article showed that adaptive methods often find drastically different solutions than gradient descent or stochastic gradient descent (SGD) for simple overparameterized problems, and that the solutions found by adaptive methods generalize worse (often significantly worse) than SGD, even when these solutions have better training performance.
Proceedings Article

Insertion Transformer: Flexible Sequence Generation via Insertion Operations

TL;DR: The Insertion Transformer outperforms many prior non-autoregressive approaches to translation at comparable or better levels of parallelism, and successfully recovers the performance of the original Transformer while requiring only logarithmically many iterations during decoding.