scispace - formally typeset
Search or ask a question
Author

Xuechen Li

Other affiliations: University of Toronto
Bio: Xuechen Li is an academic researcher from Stanford University. The author has contributed to research in topics: Computer science & Stochastic differential equation. The author has an hindex of 11, co-authored 25 publications receiving 1479 citations. Previous affiliations of Xuechen Li include University of Toronto.

Papers
More filters
Proceedings Article
14 Feb 2018
TL;DR: In this paper, the authors decompose the evidence lower bound to show the existence of a term measuring the total correlation between latent variables, and use this to motivate the beta-TCVAE (Total Correlation Variational Autoencoder) algorithm.
Abstract: We decompose the evidence lower bound to show the existence of a term measuring the total correlation between latent variables. We use this to motivate the beta-TCVAE (Total Correlation Variational Autoencoder) algorithm, a refinement and plug-in replacement of the beta-VAE for learning disentangled representations, requiring no additional hyperparameters during training. We further propose a principled classifier-free measure of disentanglement called the mutual information gap (MIG). We perform extensive quantitative and qualitative experiments, in both restricted and non-restricted settings, and show a strong relation between total correlation and disentanglement, when the model is trained using our framework.

541 citations

Posted Content
TL;DR: In this article, the authors decompose the evidence lower bound to show the existence of a term measuring the total correlation between latent variables and use this to motivate the Total Correlation Variational Autoencoder (TCVAE), a refinement of the state-of-the-art VAE objective for learning disentangled representations.
Abstract: We decompose the evidence lower bound to show the existence of a term measuring the total correlation between latent variables We use this to motivate our $\beta$-TCVAE (Total Correlation Variational Autoencoder), a refinement of the state-of-the-art $\beta$-VAE objective for learning disentangled representations, requiring no additional hyperparameters during training We further propose a principled classifier-free measure of disentanglement called the mutual information gap (MIG) We perform extensive quantitative and qualitative experiments, in both restricted and non-restricted settings, and show a strong relation between total correlation and disentanglement, when the latent variables model is trained using our framework

409 citations

Journal ArticleDOI
TL;DR: The Holistic Evaluation of Language Models (HELM) as mentioned in this paper ) is a popular benchmark for language models, with 30 models evaluated on 16 core scenarios and 7 metrics, exposing important trade-offs.
Abstract: Language models (LMs) like GPT-3, PaLM, and ChatGPT are the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of LMs. LMs can serve many purposes and their behavior should satisfy many desiderata. To navigate the vast space of potential scenarios and metrics, we taxonomize the space and select representative subsets. We evaluate models on 16 core scenarios and 7 metrics, exposing important trade-offs. We supplement our core evaluation with seven targeted evaluations to deeply analyze specific aspects (including world knowledge, reasoning, regurgitation of copyrighted content, and generation of disinformation). We benchmark 30 LMs, from OpenAI, Microsoft, Google, Meta, Cohere, AI21 Labs, and others. Prior to HELM, models were evaluated on just 17.9% of the core HELM scenarios, with some prominent models not sharing a single scenario in common. We improve this to 96.0%: all 30 models are now benchmarked under the same standardized conditions. Our evaluation surfaces 25 top-level findings. For full transparency, we release all raw model prompts and completions publicly. HELM is a living benchmark for the community, continuously updated with new scenarios, metrics, and models https://crfm.stanford.edu/helm/latest/.

168 citations

Posted Content
TL;DR: In this article, the authors examine approximate inference in variational autoencoders and find that divergence from the true posterior is often due to imperfect recognition networks, rather than the limited complexity of the approximating distribution.
Abstract: Amortized inference allows latent-variable models trained via variational learning to scale to large datasets. The quality of approximate inference is determined by two factors: a) the capacity of the variational distribution to match the true posterior and b) the ability of the recognition network to produce good variational parameters for each datapoint. We examine approximate inference in variational autoencoders in terms of these factors. We find that divergence from the true posterior is often due to imperfect recognition networks, rather than the limited complexity of the approximating distribution. We show that this is due partly to the generator learning to accommodate the choice of approximation. Furthermore, we show that the parameters used to increase the expressiveness of the approximation play a role in generalizing inference rather than simply improving the complexity of the approximation.

149 citations

Posted Content
TL;DR: The adjoint sensitivity method scalably computes gradients of solutions to ordinary differential equations and is generalized to stochastic differential equations, allowing time-efficient and constant-memory computation of gradients with high-order adaptive solvers.
Abstract: The adjoint sensitivity method scalably computes gradients of solutions to ordinary differential equations. We generalize this method to stochastic differential equations, allowing time-efficient and constant-memory computation of gradients with high-order adaptive solvers. Specifically, we derive a stochastic differential equation whose solution is the gradient, a memory-efficient algorithm for caching noise, and conditions under which numerical solutions converge. In addition, we combine our method with gradient-based stochastic variational inference for latent stochastic differential equations. We use our method to fit stochastic dynamics defined by neural networks, achieving competitive performance on a 50-dimensional motion capture dataset.

142 citations


Cited by
More filters
Proceedings ArticleDOI
22 Jan 2006
TL;DR: Some of the major results in random graphs and some of the more challenging open problems are reviewed, including those related to the WWW.
Abstract: We will review some of the major results in random graphs and some of the more challenging open problems. We will cover algorithmic and structural questions. We will touch on newer models, including those related to the WWW.

7,116 citations

Book ChapterDOI
01 Jan 1998
TL;DR: In this paper, the authors explore questions of existence and uniqueness for solutions to stochastic differential equations and offer a study of their properties, using diffusion processes as a model of a Markov process with continuous sample paths.
Abstract: We explore in this chapter questions of existence and uniqueness for solutions to stochastic differential equations and offer a study of their properties. This endeavor is really a study of diffusion processes. Loosely speaking, the term diffusion is attributed to a Markov process which has continuous sample paths and can be characterized in terms of its infinitesimal generator.

2,446 citations

Posted Content
Tero Karras1, Samuli Laine1, Timo Aila1
TL;DR: This article proposed an alternative generator architecture for GANs, borrowing from style transfer literature, which leads to an automatically learned, unsupervised separation of high-level attributes (e.g., pose and identity when trained on human faces) and stochastic variation in the generated images.
Abstract: We propose an alternative generator architecture for generative adversarial networks, borrowing from style transfer literature. The new architecture leads to an automatically learned, unsupervised separation of high-level attributes (e.g., pose and identity when trained on human faces) and stochastic variation in the generated images (e.g., freckles, hair), and it enables intuitive, scale-specific control of the synthesis. The new generator improves the state-of-the-art in terms of traditional distribution quality metrics, leads to demonstrably better interpolation properties, and also better disentangles the latent factors of variation. To quantify interpolation quality and disentanglement, we propose two new, automated methods that are applicable to any generator architecture. Finally, we introduce a new, highly varied and high-quality dataset of human faces.

1,612 citations

Journal Article
TL;DR: An independence criterion based on the eigen-spectrum of covariance operators in reproducing kernel Hilbert spaces (RKHSs), consisting of an empirical estimate of the Hilbert-Schmidt norm of the cross-covariance operator, or HSIC, is proposed.
Abstract: We propose an independence criterion based on the eigen-spectrum of covariance operators in reproducing kernel Hilbert spaces (RKHSs), consisting of an empirical estimate of the Hilbert-Schmidt norm of the cross-covariance operator (we term this a Hilbert-Schmidt Independence Criterion, or HSIC). This approach has several advantages, compared with previous kernel-based independence criteria. First, the empirical estimate is simpler than any other kernel dependence test, and requires no user-defined regularisation. Second, there is a clearly defined population quantity which the empirical estimate approaches in the large sample limit, with exponential convergence guaranteed between the two: this ensures that independence tests based on HSIC do not suffer from slow learning rates. Finally, we show in the context of independent component analysis (ICA) that the performance of HSIC is competitive with that of previously published kernel-based criteria, and of other recently published ICA methods.

1,134 citations

01 Jan 2015
TL;DR: This compact, informal introduction for graduate students and advanced undergraduates presents the current state-of-the-art filtering and smoothing methods in a unified Bayesian framework and learns what non-linear Kalman filters and particle filters are, how they are related, and their relative advantages and disadvantages.
Abstract: Filtering and smoothing methods are used to produce an accurate estimate of the state of a time-varying system based on multiple observational inputs (data). Interest in these methods has exploded in recent years, with numerous applications emerging in fields such as navigation, aerospace engineering, telecommunications, and medicine. This compact, informal introduction for graduate students and advanced undergraduates presents the current state-of-the-art filtering and smoothing methods in a unified Bayesian framework. Readers learn what non-linear Kalman filters and particle filters are, how they are related, and their relative advantages and disadvantages. They also discover how state-of-the-art Bayesian parameter estimation methods can be combined with state-of-the-art filtering and smoothing algorithms. The book’s practical and algorithmic approach assumes only modest mathematical prerequisites. Examples include MATLAB computations, and the numerous end-of-chapter exercises include computational assignments. MATLAB/GNU Octave source code is available for download at www.cambridge.org/sarkka, promoting hands-on work with the methods.

1,102 citations