Home
/
Authors
/
Tiancheng Yu

Author

Tiancheng Yu

Bio: Tiancheng Yu is an academic researcher from Massachusetts Institute of Technology. The author has contributed to research in topics: Reinforcement learning & Markov decision process. The author has an hindex of 9, co-authored 20 publications receiving 238 citations.

Papers

PDF

Open Access

More filters

Posted Content•

Reward-Free Exploration for Reinforcement Learning

[...]

Chi Jin¹, Akshay Krishnamurthy², Max Simchowitz³, Tiancheng Yu⁴•Institutions (4)

Princeton University¹, University of Massachusetts Amherst², University of California, Berkeley³, Massachusetts Institute of Technology⁴

07 Feb 2020-arXiv: Learning

TL;DR: Recently, the authors proposed a reward-free RL framework, where the agent first collects trajectories from an MDP without a pre-specified reward function, and after exploration, it is tasked with computing near-optimal policies under a collection of given reward functions.

...read moreread less

Abstract: Exploration is widely regarded as one of the most challenging aspects of reinforcement learning (RL), with many naive approaches succumbing to exponential sample complexity. To isolate the challenges of exploration, we propose a new "reward-free RL" framework. In the exploration phase, the agent first collects trajectories from an MDP $\mathcal{M}$ without a pre-specified reward function. After exploration, it is tasked with computing near-optimal policies under for $\mathcal{M}$ for a collection of given reward functions. This framework is particularly suitable when there are many reward functions of interest, or when the reward function is shaped by an external agent to elicit desired behavior. We give an efficient algorithm that conducts $\tilde{\mathcal{O}}(S^2A\mathrm{poly}(H)/\epsilon^2)$ episodes of exploration and returns $\epsilon$-suboptimal policies for an arbitrary number of reward functions. We achieve this by finding exploratory policies that visit each "significant" state with probability proportional to its maximum visitation probability under any possible policy. Moreover, our planning procedure can be instantiated by any black-box approximate planner, such as value iteration or natural policy gradient. We also give a nearly-matching $\Omega(S^2AH^2/\epsilon^2)$ lower bound, demonstrating the near-optimality of our algorithm in this setting.

...read moreread less

112 citations

Proceedings Article•

Near-Optimal Reinforcement Learning with Self-Play.

[...]

Yu Bai¹, Chi Jin², Tiancheng Yu³•Institutions (3)

Salesforce.com¹, Princeton University², Massachusetts Institute of Technology³

01 Jan 2020

TL;DR: An optimistic variant of the Nash Q-learning algorithm with sample complexity $\tilde{\mathcal{O}}(SAB)$, and a new \emph{Nash V-learning}, which matches the information-theoretic lower bound in all problem-dependent parameters except for a polynomial factor of the length of each episode.

...read moreread less

Abstract: This paper considers the problem of designing optimal algorithms for reinforcement learning in two-player zero-sum games. We focus on self-play algorithms which learn the optimal policy by playing against itself without any direct supervision. In a tabular episodic Markov game with $S$ states, $A$ max-player actions and $B$ min-player actions, the best existing algorithm for finding an approximate Nash equilibrium requires $\tilde{\mathcal{O}}(S^2AB)$ steps of game playing, when only highlighting the dependency on $(S,A,B)$. In contrast, the best existing lower bound scales as $\Omega(S(A+B))$ and has a significant gap from the upper bound. This paper closes this gap for the first time: we propose an optimistic variant of the \emph{Nash Q-learning} algorithm with sample complexity $\tilde{\mathcal{O}}(SAB)$, and a new \emph{Nash V-learning} algorithm with sample complexity $\tilde{\mathcal{O}}(S(A+B))$. The latter result matches the information-theoretic lower bound in all problem-dependent parameters except for a polynomial factor of the length of each episode. In addition, we present a computational hardness result for learning the best responses against a fixed opponent in Markov games---a learning objective different from finding the Nash equilibrium.

...read moreread less

70 citations

Proceedings Article•

Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition

[...]

Chi Jin¹, Tiancheng Jin², Haipeng Luo², Suvrit Sra³, Tiancheng Yu³ - Show less +1 more•Institutions (3)

Princeton University¹, University of Southern California², Massachusetts Institute of Technology³

12 Jul 2020

TL;DR: This work considers the problem of learning in episodic finite-horizon Markov decision processes with unknown transition function, bandit feedback, and adversarial losses and proposes an efficient algorithm that achieves Õ(L|X |2 √ |A|T ) regret with high probability.

...read moreread less

Abstract: We consider the problem of learning in episodic finite-horizon Markov decision processes with unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves Õ(L|X |2 √ |A|T ) regret with high probability, where L is the horizon, |X | is the number of states, |A| is the number of actions, and T is the number of episodes. To the best of our knowledge, our algorithm is the first one to ensure Õ( √ T ) regret in this challenging setting. Our key technical contribution is to introduce an optimistic loss estimator that is inversely weighted by an upper occupancy bound.

...read moreread less

65 citations

Posted Content•

A Sharp Analysis of Model-based Reinforcement Learning with Self-Play

[...]

Qinghua Liu¹, Tiancheng Yu², Yu Bai³, Chi Jin¹•Institutions (3)

Princeton University¹, Massachusetts Institute of Technology², Salesforce.com³

04 Oct 2020-arXiv: Learning

TL;DR: This paper designs an algorithm for two-player zero-sum Markov games that outputs a single Markov policy with optimality guarantee, while existing sample-efficient model-free algorithms output a nested mixture of Markov policies that is in general non-Markov and rather inconvenient to store and execute.

...read moreread less

Abstract: Model-based algorithms -- algorithms that explore the environment through building and utilizing an estimated model -- are widely used in reinforcement learning practice and theoretically shown to achieve optimal sample efficiency for single-agent reinforcement learning in Markov Decision Processes (MDPs). However, for multi-agent reinforcement learning in Markov games, the current best known sample complexity for model-based algorithms is rather suboptimal and compares unfavorably against recent model-free approaches. In this paper, we present a sharp analysis of model-based self-play algorithms for multi-agent Markov games. We design an algorithm -- Optimistic Nash Value Iteration (Nash-VI) for two-player zero-sum Markov games that is able to output an $\epsilon$-approximate Nash policy in $\tilde{\mathcal{O}}(H^3SAB/\epsilon^2)$ episodes of game playing, where $S$ is the number of states, $A,B$ are the number of actions for the two players respectively, and $H$ is the horizon length. This significantly improves over the best known model-based guarantee of $\tilde{\mathcal{O}}(H^4S^2AB/\epsilon^2)$, and is the first that matches the information-theoretic lower bound $\Omega(H^3S(A+B)/\epsilon^2)$ except for a $\min\{A,B\}$ factor. In addition, our guarantee compares favorably against the best known model-free algorithm if $\min \{A,B\}=o(H^3)$, and outputs a single Markov policy while existing sample-efficient model-free algorithms output a nested mixture of Markov policies that is in general non-Markov and rather inconvenient to store and execute. We further adapt our analysis to designing a provably efficient task-agnostic algorithm for zero-sum Markov games, and designing the first line of provably sample-efficient algorithms for multi-player general-sum Markov games.

...read moreread less

56 citations

Posted Content•

Learning Adversarial MDPs with Bandit Feedback and Unknown Transition

[...]

Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, Tiancheng Yu - Show less +1 more

03 Dec 2019-arXiv: Learning

TL;DR: The algorithm is the first to ensure $\mathcal{\tilde{O}}(\sqrt{T})$ regret in this challenging setting and achieves the same regret bound as (Rosenberg & Mansour, 2019a) that considers an easier setting with full-information feedback.

...read moreread less

Abstract: We consider the problem of learning in episodic finite-horizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves $\mathcal{\tilde{O}}(L|X|\sqrt{|A|T})$ regret with high probability, where $L$ is the horizon, $|X|$ is the number of states, $|A|$ is the number of actions, and $T$ is the number of episodes. To the best of our knowledge, our algorithm is the first to ensure $\mathcal{\tilde{O}}(\sqrt{T})$ regret in this challenging setting; in fact it achieves the same regret bound as (Rosenberg & Mansour, 2019a) that considers an easier setting with full-information feedback. Our key technical contributions are two-fold: a tighter confidence set for the transition function, and an optimistic loss estimator that is inversely weighted by an $\textit{upper occupancy bound}$.

...read moreread less

48 citations

1
2
3
4
…
5
6

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•

Limits of predictability in human mobility

[...]

Chaoming Song¹, Zehui Qu¹, Zehui Qu², Nicholas Blumm², Nicholas Blumm¹, Albert-László Barabási¹, Albert-László Barabási² - Show less +3 more•Institutions (2)

Northeastern University¹, Harvard University²

15 Mar 2010-Bulletin of the American Physical Society

TL;DR: In this paper, the authors explore the limits of predictability in human dynamics by studying the mobility patterns of anonymized mobile phone users and find that 93% potential predictability for user mobility across the whole user base.

...read moreread less

Abstract: A range of applications, from predicting the spread of human and electronic viruses to city planning and resource management in mobile communications, depend on our ability to foresee the whereabouts and mobility of individuals, raising a fundamental question: To what degree is human behavior predictable? Here we explore the limits of predictability in human dynamics by studying the mobility patterns of anonymized mobile phone users. By measuring the entropy of each individual's trajectory, we find a 93% potential predictability in user mobility across the whole user base. Despite the significant differences in the travel patterns, we find a remarkable lack of variability in predictability, which is largely independent of the distance users cover on a regular basis.

...read moreread less

118 citations

Posted Content•

Concentration inequalities for Markov chains by Marton couplings and spectral methods

[...]

Daniel Paulin¹•Institutions (1)

National University of Singapore¹

10 Dec 2012-arXiv: Probability

TL;DR: In this paper, a pseudo spectral gap is introduced for non-reversible Markov chains, which plays a similar role for nonreversible chains as the spectral gap plays for reversible chains.

...read moreread less

Abstract: We prove a version of McDiarmid's bounded differences inequality for Markov chains, with constants proportional to the mixing time of the chain. We also show variance bounds and Bernstein-type inequalities for empirical averages of Markov chains. In the case of non-reversible chains, we introduce a new quantity called the "pseudo spectral gap", and show that it plays a similar role for non-reversible chains as the spectral gap plays for reversible chains. Our techniques for proving these results are based on a coupling construction of Katalin Marton, and on spectral techniques due to Pascal Lezaud. The pseudo spectral gap generalises the multiplicative reversiblication approach of Jim Fill.

...read moreread less

105 citations

Posted Content•

Exploration-Exploitation in Constrained MDPs.

[...]

Yonathan Efroni, Shie Mannor, Matteo Pirotta

04 Mar 2020-arXiv: Learning

TL;DR: This work analyzes two approaches for learning in Constrained Markov Decision Processes and highlights a crucial difference between the two approaches; the linear programming approach results in stronger guarantees than in the dual formulation based approach.

...read moreread less

Abstract: In many sequential decision-making problems, the goal is to optimize a utility function while satisfying a set of constraints on different utilities. This learning problem is formalized through Constrained Markov Decision Processes (CMDPs). In this paper, we investigate the exploration-exploitation dilemma in CMDPs. While learning in an unknown CMDP, an agent should trade-off exploration to discover new information about the MDP, and exploitation of the current knowledge to maximize the reward while satisfying the constraints. While the agent will eventually learn a good or optimal policy, we do not want the agent to violate the constraints too often during the learning process. In this work, we analyze two approaches for learning in CMDPs. The first approach leverages the linear formulation of CMDP to perform optimistic planning at each episode. The second approach leverages the dual formulation (or saddle-point formulation) of CMDP to perform incremental, optimistic updates of the primal and dual variables. We show that both achieves sublinear regret w.r.t.\ the main utility while having a sublinear regret on the constraint violations. That being said, we highlight a crucial difference between the two approaches; the linear programming approach results in stronger guarantees than in the dual formulation based approach.

...read moreread less

94 citations

Posted Content•

Provable Self-Play Algorithms for Competitive Reinforcement Learning

[...]

Yu Bai¹, Chi Jin²•Institutions (2)

Salesforce.com¹, Princeton University²

10 Feb 2020-arXiv: Learning

TL;DR: This work introduces a self-play algorithm---Value Iteration with Upper/Lower Confidence Bound (VI-ULCB)---and shows that it achieves regret $\tilde{\mathcal{O}}(\sqrt{T})$ after playing $T$ steps of the game, and introduces an explore-then-exploit style algorithm, which achieves a slightly worse regret, but is guaranteed to run in polynomial time even in the worst case.

...read moreread less

Abstract: Self-play, where the algorithm learns by playing against itself without requiring any direct supervision, has become the new weapon in modern Reinforcement Learning (RL) for achieving superhuman performance in practice. However, the majority of exisiting theory in reinforcement learning only applies to the setting where the agent plays against a fixed environment; it remains largely open whether self-play algorithms can be provably effective, especially when it is necessary to manage the exploration/exploitation tradeoff. We study self-play in competitive reinforcement learning under the setting of Markov games, a generalization of Markov decision processes to the two-player case. We introduce a self-play algorithm---Value Iteration with Upper/Lower Confidence Bound (VI-ULCB)---and show that it achieves regret $\tilde{\mathcal{O}}(\sqrt{T})$ after playing $T$ steps of the game, where the regret is measured by the agent's performance against a \emph{fully adversarial} opponent who can exploit the agent's strategy at \emph{any} step. We also introduce an explore-then-exploit style algorithm, which achieves a slightly worse regret of $\tilde{\mathcal{O}}(T^{2/3})$, but is guaranteed to run in polynomial time even in the worst case. To the best of our knowledge, our work presents the first line of provably sample-efficient self-play algorithms for competitive reinforcement learning.

...read moreread less

87 citations

Proceedings Article•

Independent Policy Gradient Methods for Competitive Reinforcement Learning

[...]

Constantinos Daskalakis¹, Dylan J. Foster¹, Noah Golowich¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Jan 2020

TL;DR: It is shown that if both players run policy gradient methods in tandem, their policies will converge to a min-max equilibrium of the game, as long as their learning rates follow a two-timescale rule.

...read moreread less

Abstract: We obtain global, non-asymptotic convergence guarantees for independent learning algorithms in competitive reinforcement learning settings with two agents (i.e., zero-sum stochastic games). We consider an episodic setting where in each episode, each player independently selects a policy and observes only their own actions and rewards, along with the state. We show that if both players run policy gradient methods in tandem, their policies will converge to a min-max equilibrium of the game, as long as their learning rates follow a two-timescale rule (which is necessary). To the best of our knowledge, this constitutes the first finite-sample convergence result for independent policy gradient methods in competitive RL; prior work has largely focused on centralized, coordinated procedures for equilibrium computation.

...read moreread less

85 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69

Collapse