Home
/
Authors
/
Chi Jin

Author

Chi Jin

Other affiliations: University of California, Berkeley, Peking University

Bio: Chi Jin is an academic researcher from Princeton University. The author has contributed to research in topics: Reinforcement learning & Computer science. The author has an hindex of 35, co-authored 90 publications receiving 4746 citations. Previous affiliations of Chi Jin include University of California, Berkeley & Peking University.

Papers published on a yearly basis

2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2012

Papers

PDF

Open Access

More filters

Proceedings Article•

Escaping From Saddle Points --- Online Stochastic Gradient for Tensor Decomposition

[...]

Rong Ge¹, Furong Huang², Chi Jin³, Yang Yuan⁴•Institutions (4)

Microsoft¹, University of California, Irvine², University of California, Berkeley³, Cornell University⁴

26 Jun 2015

TL;DR: In this article, the authors show that stochastic gradient descent converges to a local minimum in a polynomial number of iterations for orthogonal tensor decomposition.

...read moreread less

Abstract: We analyze stochastic gradient descent for optimizing non-convex functions. In many cases for non-convex functions the goal is to find a reasonable local minimum, and the main concern is that gradient updates are trapped in saddle points. In this paper we identify strict saddle property for non-convex problem that allows for efficient optimization. Using this property we show that from an arbitrary starting point, stochastic gradient descent converges to a local minimum in a polynomial number of iterations. To the best of our knowledge this is the first work that gives global convergence guarantees for stochastic gradient descent on non-convex functions with exponentially many local minima and saddle points. Our analysis can be applied to orthogonal tensor decomposition, which is widely used in learning a rich class of latent variable models. We propose a new optimization formulation for the tensor decomposition problem that has strict saddle property. As a result we get the first online algorithm for orthogonal tensor decomposition with global convergence guarantee.

...read moreread less

1,016 citations

Posted Content•

Provably Efficient Reinforcement Learning with Linear Function Approximation

[...]

Chi Jin¹, Zhuoran Yang², Zhaoran Wang, Michael I. Jordan¹•Institutions (2)

University of California, Berkeley¹, Princeton University²

11 Jul 2019-arXiv: Learning

TL;DR: This paper proves that an optimistic modification of Least-Squares Value Iteration (LSVI) achieves regret, where d is the ambient dimension of feature space, H is the length of each episode, and T is the total number of steps, and is independent of the number of states and actions.

...read moreread less

Abstract: Modern Reinforcement Learning (RL) is commonly applied to practical problems with an enormous number of states, where function approximation must be deployed to approximate either the value function or the policy. The introduction of function approximation raises a fundamental set of challenges involving computational and statistical efficiency, especially given the need to manage the exploration/exploitation tradeoff. As a result, a core RL question remains open: how can we design provably efficient RL algorithms that incorporate function approximation? This question persists even in a basic setting with linear dynamics and linear rewards, for which only linear function approximation is needed. This paper presents the first provable RL algorithm with both polynomial runtime and polynomial sample complexity in this linear setting, without requiring a "simulator" or additional assumptions. Concretely, we prove that an optimistic modification of Least-Squares Value Iteration (LSVI)---a classical algorithm frequently studied in the linear setting---achieves $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret, where $d$ is the ambient dimension of feature space, $H$ is the length of each episode, and $T$ is the total number of steps. Importantly, such regret is independent of the number of states and actions.

...read moreread less

337 citations

Posted Content•

No Spurious Local Minima in Nonconvex Low Rank Problems: A Unified Geometric Analysis

[...]

Rong Ge¹, Chi Jin², Yi Zheng¹•Institutions (2)

Duke University¹, University of California, Berkeley²

03 Apr 2017-arXiv: Learning

TL;DR: In this paper, the authors developed a new framework that captures the common landscape underlying the common non-convex low-rank matrix problems including matrix sensing, matrix completion and robust PCA.

...read moreread less

Abstract: In this paper we develop a new framework that captures the common landscape underlying the common non-convex low-rank matrix problems including matrix sensing, matrix completion and robust PCA. In particular, we show for all above problems (including asymmetric cases): 1) all local minima are also globally optimal; 2) no high-order saddle points exists. These results explain why simple algorithms such as stochastic gradient descent have global converge, and efficiently optimize these non-convex objective functions in practice. Our framework connects and simplifies the existing analyses on optimization landscapes for matrix sensing and symmetric matrix completion. The framework naturally leads to new results for asymmetric matrix completion and robust PCA.

...read moreread less

295 citations

Proceedings Article•

How to escape saddle points efficiently

[...]

Chi Jin¹, Rong Ge², Praneeth Netrapalli³, Sham M. Kakade⁴, Michael I. Jordan¹ - Show less +1 more•Institutions (4)

University of California, Berkeley¹, Duke University², Microsoft³, University of Washington⁴

06 Aug 2017

TL;DR: In this article, the authors show that perturbed gradient descent can escape saddle points almost for free, in a number of iterations which depends only poly-logarithmically on dimension.

...read moreread less

Abstract: This paper shows that a perturbed form of gradient descent converges to a second-order stationary point in a number iterations which depends only poly-logarithmically on dimension (i.e., it is almost "dimension-free"). The convergence rate of this procedure matches the well-known convergence rate of gradient descent to first-order stationary points, up to log factors. When all saddle points are non-degenerate, all second-order stationary points are local minima, and our result thus shows that perturbed gradient descent can escape saddle points almost for free. Our results can be directly applied to many machine learning applications, including deep learning. As a particular concrete example of such an application, we show that our results can be used directly to establish sharp global convergence rates for matrix factorization. Our results rely on a novel characterization of the geometry around saddle points, which may be of independent interest to the non-convex optimization community.

...read moreread less

280 citations

Posted Content•

On Gradient Descent Ascent for Nonconvex-Concave Minimax Problems

[...]

Tianyi Lin¹, Chi Jin², Michael I. Jordan¹•Institutions (2)

University of California, Berkeley¹, Princeton University²

02 Jun 2019-arXiv: Learning

TL;DR: This is the first nonasymptotic analysis for two-time-scale GDA in this setting, shedding light on its superior practical performance in training generative adversarial networks (GANs) and other real applications.

...read moreread less

Abstract: We consider nonconvex-concave minimax problems, $\min_{\mathbf{x}} \max_{\mathbf{y} \in \mathcal{Y}} f(\mathbf{x}, \mathbf{y})$, where $f$ is nonconvex in $\mathbf{x}$ but concave in $\mathbf{y}$ and $\mathcal{Y}$ is a convex and bounded set. One of the most popular algorithms for solving this problem is the celebrated gradient descent ascent (GDA) algorithm, which has been widely used in machine learning, control theory and economics. Despite the extensive convergence results for the convex-concave setting, GDA with equal stepsize can converge to limit cycles or even diverge in a general setting. In this paper, we present the complexity results on two-time-scale GDA for solving nonconvex-concave minimax problems, showing that the algorithm can find a stationary point of the function $\Phi(\cdot) := \max_{\mathbf{y} \in \mathcal{Y}} f(\cdot, \mathbf{y})$ efficiently. To the best our knowledge, this is the first nonasymptotic analysis for two-time-scale GDA in this setting, shedding light on its superior practical performance in training generative adversarial networks (GANs) and other real applications.

...read moreread less

271 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

Collapse

Cited by

PDF

Open Access

More filters

Convex Analysisの二,三の進展について

[...]

徹丸山

01 Feb 1977

5,933 citations

Posted Content•

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

[...]

Nitish Shirish Keskar¹, Dheevatsa Mudigere², Jorge Nocedal, Mikhail Smelyanskiy², Ping Tak Peter Tang² - Show less +1 more•Institutions (2)

Northwestern University¹, Intel²

15 Sep 2016-arXiv: Learning

TL;DR: In this paper, the authors investigate the cause of the generalization drop in the large batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minima of the training and testing functions.

...read moreread less

Abstract: The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say $32$-$512$ data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize. We investigate the cause for this generalization drop in the large-batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation. We discuss several strategies to attempt to help large-batch methods eliminate this generalization gap.

...read moreread less

925 citations

Book Chapter•DOI•

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

[...]

Kaiqing Zhang¹, Zhuoran Yang², Tamer Basar¹•Institutions (2)

University of Illinois at Urbana–Champaign¹, Princeton University²

29 Apr 2021-arXiv: Learning

TL;DR: This chapter reviews the theoretical results of MARL algorithms mainly within two representative frameworks, Markov/stochastic games and extensive-form games, in accordance with the types of tasks they address, i.e., fully cooperative, fully competitive, and a mix of the two.

...read moreread less

Abstract: Recent years have witnessed significant advances in reinforcement learning (RL), which has registered tremendous success in solving various sequential decision-making problems in machine learning. Most of the successful RL applications, e.g., the games of Go and Poker, robotics, and autonomous driving, involve the participation of more than one single agent, which naturally fall into the realm of multi-agent RL (MARL), a domain with a relatively long history, and has recently re-emerged due to advances in single-agent RL techniques. Though empirically successful, theoretical foundations for MARL are relatively lacking in the literature. In this chapter, we provide a selective overview of MARL, with focus on algorithms backed by theoretical analysis. More specifically, we review the theoretical results of MARL algorithms mainly within two representative frameworks, Markov/stochastic games and extensive-form games, in accordance with the types of tasks they address, i.e., fully cooperative, fully competitive, and a mix of the two. We also introduce several significant but challenging applications of these algorithms. Orthogonal to the existing reviews on MARL, we highlight several new angles and taxonomies of MARL theory, including learning in extensive-form games, decentralized MARL with networked agents, MARL in the mean-field regime, (non-)convergence of policy-based methods for learning in games, etc. Some of the new angles extrapolate from our own research endeavors and interests. Our overall goal with this chapter is, beyond providing an assessment of the current state of the field on the mark, to identify fruitful future research directions on theoretical studies of MARL. We expect this chapter to serve as continuing stimulus for researchers interested in working on this exciting while challenging topic.

...read moreread less

692 citations

Journal Article•DOI•

A high-bias, low-variance introduction to Machine Learning for physicists

[...]

Pankaj Mehta¹, Marin Bukov², Ching-Hao Wang¹, Alexandre G. R. Day¹, Charles C. Richardson¹, Charles K. Fisher, David J. Schwab³ - Show less +3 more•Institutions (3)

Boston University¹, University of California, Berkeley², City University of New York³

30 May 2019-Physics Reports

TL;DR: The review begins by covering fundamental concepts in ML and modern statistics such as the bias-variance tradeoff, overfitting, regularization, generalization, and gradient descent before moving on to more advanced topics in both supervised and unsupervised learning.

...read moreread less

664 citations

Posted Content•

Gradient Descent Provably Optimizes Over-parameterized Neural Networks

[...]

Simon S. Du¹, Xiyu Zhai², Barnabás Póczos¹, Aarti Singh¹•Institutions (2)

Carnegie Mellon University¹, Massachusetts Institute of Technology²

04 Oct 2018-arXiv: Learning

TL;DR: This article showed that gradient descent converges at a global linear rate to the global optimum for two-layer fully connected ReLU activated neural networks, where over-parameterization and random initialization jointly restrict weight vector to be close to its initialization for all iterations.

...read moreread less

Abstract: One of the mysteries in the success of neural networks is randomly initialized first order methods like gradient descent can achieve zero training loss even though the objective function is non-convex and non-smooth. This paper demystifies this surprising phenomenon for two-layer fully connected ReLU activated neural networks. For an $m$ hidden node shallow neural network with ReLU activation and $n$ training data, we show as long as $m$ is large enough and no two inputs are parallel, randomly initialized gradient descent converges to a globally optimal solution at a linear convergence rate for the quadratic loss function. Our analysis relies on the following observation: over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows us to exploit a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum. We believe these insights are also useful in analyzing deep models and other first order methods.

...read moreread less

662 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse