scispace - formally typeset
Search or ask a question

Showing papers by "Richard Zhang published in 2021"


Proceedings ArticleDOI
13 Apr 2021
TL;DR: In this paper, a cross-domain distance consistency loss is proposed to preserve the relative similarities and differences between instances in the source via a novel cross-source distance consistency layer and an anchor-based strategy to encourage different levels of realism over different regions in the latent space.
Abstract: Training generative models, such as GANs, on a target domain containing limited examples (e.g., 10) can easily result in overfitting. In this work, we seek to utilize a large source domain for pretraining and transfer the diversity information from source to target. We propose to preserve the relative similarities and differences between instances in the source via a novel cross-domain distance consistency loss. To further reduce overfitting, we present an anchor-based strategy to encourage different levels of realism over different regions in the latent space. With extensive results in both photorealistic and non-photorealistic domains, we demonstrate qualitatively and quantitatively that our few-shot model automatically discovers correspondences between source and target domains and generates more diverse and realistic images than previous methods.

136 citations


Proceedings ArticleDOI
04 Mar 2021
TL;DR: In this article, the Anycost GAN is proposed for interactive natural image editing, which uses sampling-based multi-resolution training, adaptive-channel training, and a generator-conditioned discriminator.
Abstract: Generative adversarial networks (GANs) have enabled photorealistic image synthesis and editing. However, due to the high computational cost of large-scale generators (e.g., StyleGAN2), it usually takes seconds to see the results of a single edit on edge devices, prohibiting interactive user experience. In this paper, inspired by quick preview features in modern rendering software, we propose Anycost GAN for interactive natural image editing. We train the Anycost GAN to support elastic resolutions and channels for faster image generation at versatile speeds. Running subsets of the full generator produce outputs that are perceptually similar to the full generator, making them a good proxy for quick preview. By using sampling-based multi-resolution training, adaptive-channel training, and a generator-conditioned discriminator, the anycost generator can be evaluated at various configurations while achieving better image quality compared to separately trained models. Furthermore, we develop new encoder training and latent code optimization techniques to encourage consistency between the different sub-generators during image projection. Anycost GAN can be executed at various cost budgets (up to 10× computation reduction) and adapt to a wide range of hardware and la tency requirements. When deployed on desktop CPUs and edge devices, our model can provide perceptually similar previews at 6-12× speedup, enabling interactive image editing. The ${\color{RubineRed}{code}}$ and ${\color{RubineRed}{demo}}$ are publicly available.

56 citations


Proceedings ArticleDOI
29 Apr 2021
TL;DR: In this paper, the latent code corresponding to a given real input image is generated using a pre-trained generator, and the generated code can then be used to generate natural variations of the image.
Abstract: Recent generative models can synthesize "views" of artificial images that mimic real-world variations, such as changes in color or pose, simply by learning from unlabeled image collections. Here, we investigate whether such views can be applied to real images to benefit downstream analysis tasks such as image classification. Using a pre-trained generator, we first find the latent code corresponding to a given real input image. Applying perturbations to the code creates natural variations of the image, which can then be ensembled together at test-time. We use StyleGAN2 as the source of generative augmentations and investigate this setup on classification tasks involving facial attributes, cat faces, and cars. Critically, we find that several design decisions are required towards making this process work; the perturbation procedure, weighting between the augmentations and original image, and training the classifier on synthesized images can all impact the result. Currently, we find that while test-time ensembling with GAN-based augmentations can offer some small improvements, the remaining bottlenecks are the efficiency and accuracy of the GAN reconstructions, coupled with classifier sensitivities to artifacts in GAN-generated images.

48 citations


Posted Content
TL;DR: CDPAM is introduced –a metric that builds on and advances DPAM, and it is shown that adding this metric to existing speech synthesis and enhancement methods yields significant improvement, as measured by objective and subjective tests.
Abstract: Many speech processing methods based on deep learning require an automatic and differentiable audio metric for the loss function. The DPAM approach of Manocha et al. learns a full-reference metric trained directly on human judgments, and thus correlates well with human perception. However, it requires a large number of human annotations and does not generalize well outside the range of perturbations on which it was trained. This paper introduces CDPAM, a metric that builds on and advances DPAM. The primary improvement is to combine contrastive learning and multi-dimensional representations to build robust models from limited data. In addition, we collect human judgments on triplet comparisons to improve generalization to a broader range of audio perturbations. CDPAM correlates well with human responses across nine varied datasets. We also show that adding this metric to existing speech synthesis and enhancement methods yields significant improvement, as measured by objective and subjective tests.

33 citations


Proceedings ArticleDOI
01 Jun 2021
TL;DR: In this paper, a pixel-wise network is proposed for image-to-image translation, where each pixel is processed independently of others, through a composition of simple affine transformations and nonlinearities.
Abstract: We introduce a new generator architecture, aimed at fast and efficient high-resolution image-to-image translation. We design the generator to be an extremely lightweight function of the full-resolution image. In fact, we use pixel-wise networks; that is, each pixel is processed independently of others, through a composition of simple affine transformations and nonlinearities. We take three important steps to equip such a seemingly simple function with adequate expressivity. First, the parameters of the pixel-wise networks are spatially varying, so they can represent a broader function class than simple 1 × 1 convolutions. Second, these parameters are predicted by a fast convolutional network that processes an aggressively low-resolution representation of the input. Third, we augment the input image by concatenating a sinusoidal encoding of spatial coordinates, which provides an effective inductive bias for generating realistic novel high-frequency image content. As a result, our model is up to 18× faster than state-of-the-art baselines. We achieve this speedup while generating comparable visual quality across different image resolutions and translation domains.

27 citations


Journal ArticleDOI
TL;DR: The DREAM challenge as mentioned in this paper used in-vitro experimental intMEMOIR recordings and in-silico data for a C.elegans lineage tree and a Mus musculus tree of 10,000 cells.
Abstract: Summary The recent advent of CRISPR and other molecular tools enabled the reconstruction of cell lineages based on induced DNA mutations and promises to solve the ones of more complex organisms. To date, no lineage reconstruction algorithms have been rigorously examined for their performance and robustness across dataset types and number of cells. To benchmark such methods, we decided to organize a DREAM challenge using in vitro experimental intMEMOIR recordings and in silico data for a C. elegans lineage tree of about 1,000 cells and a Mus musculus tree of 10,000 cells. Some of the 22 approaches submitted had excellent performance, but structural features of the trees prevented optimal reconstructions. Using smaller sub-trees as training sets proved to be a good approach for tuning algorithms to reconstruct larger trees. The simulation and reconstruction methods here generated delineate a potential way forward for solving larger cell lineage trees such as in mouse.

26 citations


Journal ArticleDOI
TL;DR: In this article, the authors show that the coupling due to overlap constraints is guaranteed to be sparse over dense blocks, with a block sparsity pattern that coincides with the adjacency matrix of a tree.
Abstract: Clique tree conversion solves large-scale semidefinite programs by splitting an $$n\times n$$ matrix variable into up to n smaller matrix variables, each representing a principal submatrix of up to $$\omega \times \omega $$ . Its fundamental weakness is the need to introduce overlap constraints that enforce agreement between different matrix variables, because these can result in dense coupling. In this paper, we show that by dualizing the clique tree conversion, the coupling due to the overlap constraints is guaranteed to be sparse over dense blocks, with a block sparsity pattern that coincides with the adjacency matrix of a tree. We consider two classes of semidefinite programs with favorable sparsity patterns that encompass the MAXCUT and MAX k-CUT relaxations, the Lovasz Theta problem, and the AC optimal power flow relaxation. Assuming that $$\omega \ll n$$ , we prove that the per-iteration cost of an interior-point method is linear O(n) time and memory, so an $$\epsilon $$ -accurate and $$\epsilon $$ -feasible iterate is obtained after $$O(\sqrt{n}\log (1/\epsilon ))$$ iterations in near-linear $$O(n^{1.5}\log (1/\epsilon ))$$ time. We confirm our theoretical insights with numerical results on semidefinite programs as large as $$n=13{,}659$$ .

21 citations


Proceedings ArticleDOI
06 Jun 2021
TL;DR: In this paper, a metric called CDPAM is proposed to combine contrastive learning and multi-dimensional representations to improve generalization to a broader range of audio perturbations.
Abstract: Many speech processing methods based on deep learning require an automatic and differentiable audio metric for the loss function. The DPAM approach of Manocha et al. [1] learns a full-reference metric trained directly on human judgments, and thus correlates well with human perception. However, it requires a large number of human annotations and does not generalize well outside the range of perturbations on which it was trained. This paper introduces CDPAM –a metric that builds on and advances DPAM. The primary improvement is to combine contrastive learning and multi-dimensional representations to build robust models from limited data. In addition, we collect human judgments on triplet comparisons to improve generalization to a broader range of audio perturbations. CDPAM correlates well with human responses across nine varied datasets. We also show that adding this metric to existing speech synthesis and enhancement methods yields significant improvement, as measured by objective and subjective tests.

14 citations


Posted Content
TL;DR: In this paper, the authors investigate the sensitivity of the Fr\'echet Inception Distance (FID) score to inconsistent and often incorrect implementations across different image processing libraries and provide recommendations for computing the FID score accurately.
Abstract: We investigate the sensitivity of the Fr\'echet Inception Distance (FID) score to inconsistent and often incorrect implementations across different image processing libraries. FID score is widely used to evaluate generative models, but each FID implementation uses a different low-level image processing process. Image resizing functions in commonly-used deep learning libraries often introduce aliasing artifacts. We observe that numerous subtle choices need to be made for FID calculation and a lack of consistencies in these choices can lead to vastly different FID scores. In particular, we show that the following choices are significant: (1) selecting what image resizing library to use, (2) choosing what interpolation kernel to use, (3) what encoding to use when representing images. We additionally outline numerous common pitfalls that should be avoided and provide recommendations for computing the FID score accurately. We provide an easy-to-use optimized implementation of our proposed recommendations in the accompanying code.

13 citations


Posted Content
TL;DR: In this article, the authors make a series of empirical observations that investigate the hypothesis that deeper networks are inductively biased to find solutions with lower rank embeddings and conjecture that this bias exists because the volume of functions that maps to low-rank embedding increases with depth.
Abstract: Modern deep neural networks are highly over-parameterized compared to the data on which they are trained, yet they often generalize remarkably well. A flurry of recent work has asked: why do deep networks not overfit to their training data? In this work, we make a series of empirical observations that investigate the hypothesis that deeper networks are inductively biased to find solutions with lower rank embeddings. We conjecture that this bias exists because the volume of functions that maps to low-rank embedding increases with depth. We show empirically that our claim holds true on finite width linear and non-linear models and show that these are the solutions that generalize well. We then show that the low-rank simplicity bias exists even after training, using a wide variety of commonly used optimizers. We found this phenomenon to be resilient to initialization, hyper-parameters, and learning methods. We further demonstrate how linear over-parameterization of deep non-linear models can be used to induce low-rank bias, improving generalization performance without changing the effective model capacity. Practically, we demonstrate that simply linearly over-parameterizing standard models at training time can improve performance on image classification tasks, including ImageNet.

10 citations


Journal ArticleDOI
TL;DR: It is proved that the P-\Theta power flow problem has at most one solution for any acyclic or GSP graph, and it is shown that multiple distinct solutions cannot exist under the assumption that angle differences across the lines are bounded by some limit related to the maximal girth of the network.
Abstract: This article establishes sufficient conditions for the uniqueness of AC power flow solutions via the monotonic relationship between real power flow and the phase angle difference. More specifically, we prove that the $P-\Theta$ power flow problem has at most one solution for any acyclic or GSP graph. In addition, for arbitrary cyclic power networks, we show that multiple distinct solutions cannot exist under the assumption that angle differences across the lines are bounded by some limit related to the maximal girth of the network. In these cases, a vector of voltage phase angles can be uniquely determined (up to an absolute phase shift) given a vector of real power injections within the realizable range. The implication of this result for the classical power flow analysis is that under the conditions specified above, the problem has a unique physically realizable solution if the phasor voltage magnitudes are fixed. We also introduce a series–parallel operator and show that this operator obtains a reduced and easier-to-analyze model for the power system without changing the uniqueness of power flow solutions.

Posted Content
TL;DR: In this article, it was shown that for nonconvex low-rank matrix recovery, the restricted isometry property (RIP) is sufficient and sufficient for the existence of no spurious local minima.
Abstract: We prove that it is possible for nonconvex low-rank matrix recovery to contain no spurious local minima when the rank of the unknown ground truth $r^{\star}

Posted ContentDOI
30 May 2021-bioRxiv
TL;DR: TreeVAE as discussed by the authors uses a variational autoencoder (VAE) to model the observed transcriptomic data while accounting for the phylogenetic relationships between cells, and it outperforms benchmarks in reconstructing ancestral states on several metrics.
Abstract: AO_SCPLOWBSTRACTC_SCPLOWNovel experimental assays now simultaneously measure lineage relationships and transcriptomic states from single cells, thanks to CRISPR/Cas9-based genome engineering. These multimodal measurements allow researchers not only to build comprehensive phylogenetic models relating all cells but also infer transcriptomic determinants of consequential subclonal behavior. The gene expression data, however, is limited to cells that are currently present ("leaves" of the phylogeny). As a consequence, researchers cannot form hypotheses about unobserved, or "ancestral", states that gave rise to the observed population. To address this, we introduce TreeVAE: a probabilistic framework for estimating ancestral transcriptional states. TreeVAE uses a variational autoencoder (VAE) to model the observed transcriptomic data while accounting for the phylogenetic relationships between cells. Using simulations, we demonstrate that TreeVAE outperforms benchmarks in reconstructing ancestral states on several metrics. TreeVAE also provides a measure of uncertainty, which we demonstrate to correlate well with its prediction accuracy. This estimate therefore potentially provides a data-driven way to estimate how far back in the ancestor chain predictions could be made. Finally, using real data from lung cancer metastasis, we show that accounting for phylogenetic relationship between cells improves goodness of fit. Together, TreeVAE provides a principled framework for reconstructing unobserved cellular states from single cell lineage tracing data.

Posted Content
TL;DR: In this paper, the authors propose a method for propagating coarse 2D user scribbles to the 3D space to modify the color or shape of a local region, which can be used for editing the appearance and shape of real photographs.
Abstract: A neural radiance field (NeRF) is a scene model supporting high-quality view synthesis, optimized per scene. In this paper, we explore enabling user editing of a category-level NeRF - also known as a conditional radiance field - trained on a shape category. Specifically, we introduce a method for propagating coarse 2D user scribbles to the 3D space, to modify the color or shape of a local region. First, we propose a conditional radiance field that incorporates new modular network components, including a shape branch that is shared across object instances. Observing multiple instances of the same category, our model learns underlying part semantics without any supervision, thereby allowing the propagation of coarse 2D user scribbles to the entire 3D region (e.g., chair seat). Next, we propose a hybrid network update strategy that targets specific network components, which balances efficiency and accuracy. During user interaction, we formulate an optimization problem that both satisfies the user's constraints and preserves the original object structure. We demonstrate our approach on various editing tasks over three shape datasets and show that it outperforms prior neural editing approaches. Finally, we edit the appearance and shape of a real photograph and show that the edit propagates to extrapolated novel views.

Posted Content
TL;DR: In this article, a generative adversarial network (GAN) was proposed for interactive natural image editing, which can provide perceptually similar previews at 6-12x speedup, enabling interactive image editing.
Abstract: Generative adversarial networks (GANs) have enabled photorealistic image synthesis and editing. However, due to the high computational cost of large-scale generators (e.g., StyleGAN2), it usually takes seconds to see the results of a single edit on edge devices, prohibiting interactive user experience. In this paper, we take inspirations from modern rendering software and propose Anycost GAN for interactive natural image editing. We train the Anycost GAN to support elastic resolutions and channels for faster image generation at versatile speeds. Running subsets of the full generator produce outputs that are perceptually similar to the full generator, making them a good proxy for preview. By using sampling-based multi-resolution training, adaptive-channel training, and a generator-conditioned discriminator, the anycost generator can be evaluated at various configurations while achieving better image quality compared to separately trained models. Furthermore, we develop new encoder training and latent code optimization techniques to encourage consistency between the different sub-generators during image projection. Anycost GAN can be executed at various cost budgets (up to 10x computation reduction) and adapt to a wide range of hardware and latency requirements. When deployed on desktop CPUs and edge devices, our model can provide perceptually similar previews at 6-12x speedup, enabling interactive image editing. The code and demo are publicly available: this https URL.

Posted ContentDOI
22 Nov 2021-bioRxiv
TL;DR: In this paper, two theoretically grounded algorithms for reconstruction of the underlying phylogenetic tree, as well as asymptotic bounds for the number of recording sites necessary for exact recapitulation of the ground truth phylogeny at high probability, are presented.
Abstract: CRISPR-Cas9 lineage tracing technologies have emerged as a powerful tool for investigating develop-ment in single-cell contexts, but exact reconstruction of the underlying clonal relationships in experiment is plagued by data-related complications. These complications are functions of the experimental parameters in these systems, such as the Cas9 cutting rate, the diversity of indel outcomes, and the rate of missing data. In this paper, we develop two theoretically grounded algorithms for reconstruction of the underlying phylogenetic tree, as well as asymptotic bounds for the number of recording sites necessary for exact recapitulation of the ground truth phylogeny at high probability. In doing so, we explore the relationship between the problem difficulty and the experimental parameters, with implications for experimental design. Lastly, we provide simulations validating these bounds and showing the empirical performance of these algorithms. Overall, this work provides a first theoretical analysis of phylogenetic reconstruction in the CRISPR-Cas9 lineage tracing technology.

Posted Content
TL;DR: In this article, the authors use StyleGAN2 as the source of GAN-based augmentations and investigate this setup on classification tasks involving facial attributes, cat faces, and cars, and find that while test-time ensembling with GANbased augmented images can offer some small improvements, the remaining bottlenecks are the efficiency and accuracy of the GAN reconstructions, coupled with classifier sensitivities to artifacts in GANgenerated images.
Abstract: Recent generative models can synthesize "views" of artificial images that mimic real-world variations, such as changes in color or pose, simply by learning from unlabeled image collections. Here, we investigate whether such views can be applied to real images to benefit downstream analysis tasks such as image classification. Using a pretrained generator, we first find the latent code corresponding to a given real input image. Applying perturbations to the code creates natural variations of the image, which can then be ensembled together at test-time. We use StyleGAN2 as the source of generative augmentations and investigate this setup on classification tasks involving facial attributes, cat faces, and cars. Critically, we find that several design decisions are required towards making this process work; the perturbation procedure, weighting between the augmentations and original image, and training the classifier on synthesized images can all impact the result. Currently, we find that while test-time ensembling with GAN-based augmentations can offer some small improvements, the remaining bottlenecks are the efficiency and accuracy of the GAN reconstructions, coupled with classifier sensitivities to artifacts in GAN-generated images.

Proceedings Article
13 May 2021
TL;DR: In this article, the authors propose a method for propagating coarse 2D user scribbles to the 3D space to modify the color or shape of a local region, which can be used for editing the appearance and shape of real photographs.
Abstract: A neural radiance field (NeRF) is a scene model supporting high-quality view synthesis, optimized per scene. In this paper, we explore enabling user editing of a category-level NeRF - also known as a conditional radiance field - trained on a shape category. Specifically, we introduce a method for propagating coarse 2D user scribbles to the 3D space, to modify the color or shape of a local region. First, we propose a conditional radiance field that incorporates new modular network components, including a shape branch that is shared across object instances. Observing multiple instances of the same category, our model learns underlying part semantics without any supervision, thereby allowing the propagation of coarse 2D user scribbles to the entire 3D region (e.g., chair seat). Next, we propose a hybrid network update strategy that targets specific network components, which balances efficiency and accuracy. During user interaction, we formulate an optimization problem that both satisfies the user's constraints and preserves the original object structure. We demonstrate our approach on various editing tasks over three shape datasets and show that it outperforms prior neural editing approaches. Finally, we edit the appearance and shape of a real photograph and show that the edit propagates to extrapolated novel views.

Posted Content
TL;DR: In this article, a cross-domain distance consistency loss is proposed to preserve the relative similarities and differences between instances in the source via a novel cross-source distance consistency layer and an anchor-based strategy to encourage different levels of realism over different regions in the latent space.
Abstract: Training generative models, such as GANs, on a target domain containing limited examples (e.g., 10) can easily result in overfitting. In this work, we seek to utilize a large source domain for pretraining and transfer the diversity information from source to target. We propose to preserve the relative similarities and differences between instances in the source via a novel cross-domain distance consistency loss. To further reduce overfitting, we present an anchor-based strategy to encourage different levels of realism over different regions in the latent space. With extensive results in both photorealistic and non-photorealistic domains, we demonstrate qualitatively and quantitatively that our few-shot model automatically discovers correspondences between source and target domains and generates more diverse and realistic images than previous methods.

Patent
04 May 2021
TL;DR: In this paper, the edge prediction neural network and edge-guided colorization neural network are used to transform grayscale digital images into colorized digital images, which can be presented to a user via a colorization graphical user interface and receive color points and color edge modifications.
Abstract: Methods, systems, and non-transitory computer readable storage media are disclosed for utilizing an edge prediction neural network and edge-guided colorization neural network to transform grayscale digital images into colorized digital images. In one or more embodiments, the disclosed systems apply a color edge prediction neural network to a grayscale image to generate a color edge map indicating predicted chrominance edges. The disclosed systems can present the color edge map to a user via a colorization graphical user interface and receive user color points and color edge modifications. The disclosed systems can apply a second neural network, an edge-guided colorization neural network, to the color edge map or a modified edge map, user color points, and the grayscale image to generate an edge-constrained colorized digital image.

Posted Content
TL;DR: In this article, the authors introduce an information theory based approach to measure similarity between two images, which enables learning a lightweight critic to calibrate a feature space in a contrastive manner, such that reconstructions of corresponding spatial patches are brought together, while other patches are repulsed.
Abstract: Training supervised image synthesis models requires a critic to compare two images: the ground truth to the result. Yet, this basic functionality remains an open problem. A popular line of approaches uses the L1 (mean absolute error) loss, either in the pixel or the feature space of pretrained deep networks. However, we observe that these losses tend to produce overly blurry and grey images, and other techniques such as GANs need to be employed to fight these artifacts. In this work, we introduce an information theory based approach to measuring similarity between two images. We argue that a good reconstruction should have high mutual information with the ground truth. This view enables learning a lightweight critic to "calibrate" a feature space in a contrastive manner, such that reconstructions of corresponding spatial patches are brought together, while other patches are repulsed. We show that our formulation immediately boosts the perceptual realism of output images when used as a drop-in replacement for the L1 loss, with or without an additional GAN loss.