Home
/
Authors
/
Ron Mokady

Author

Ron Mokady

Bio: Ron Mokady is an academic researcher from Tel Aviv University. The author has contributed to research in topics: Computer science & Artificial intelligence. The author has an hindex of 3, co-authored 8 publications receiving 35 citations.

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Prompt-to-Prompt Image Editing with Cross Attention Control

[...]

Amir Hertz, Ron Mokady, J. M. Tenenbaum, Kfir Aberman, Yael Pritch, Daniel Cohen-Or - Show less +2 more

02 Aug 2022

TL;DR: This paper examines a text-conditioned model in depth and observes that the cross-attention layers are the key to controlling the relation between the spatial layout of the image to each word in the prompt, and presents several applications which monitor the image synthesis by editing the textual prompt only.

...read moreread less

Abstract: Recent large-scale text-driven synthesis models have attracted much attention thanks to their remarkable capabilities of generating highly diverse images that follow given text prompts. Such text-based synthesis methods are particularly appealing to humans who are used to verbally describe their intent. Therefore, it is only natural to extend the text-driven image synthesis to text-driven image editing. Editing is challenging for these generative models, since an innate property of an editing technique is to preserve most of the original image, while in the text-based models, even a small modification of the text prompt often leads to a completely different outcome. State-of-the-art methods mitigate this by requiring the users to provide a spatial mask to localize the edit, hence, ignoring the original structure and content within the masked region. In this paper, we pursue an intuitive prompt-to-prompt editing framework, where the edits are controlled by text only. To this end, we analyze a text-conditioned model in depth and observe that the cross-attention layers are the key to controlling the relation between the spatial layout of the image to each word in the prompt. With this observation, we present several applications which monitor the image synthesis by editing the textual prompt only. This includes localized editing by replacing a word, global editing by adding a specification, and even delicately controlling the extent to which a word is reflected in the image. We present our results over diverse images and prompts, demonstrating high-quality synthesis and fidelity to the edited prompts.

...read moreread less

262 citations

Journal Article•DOI•

Null-text Inversion for Editing Real Images using Guided Diffusion Models

[...]

Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, Daniel Cohen-Or - Show less +1 more

17 Nov 2022-arXiv.org

TL;DR: This paper proposed a NULL-text inversion based on the Stable Diffusion model for text-guided image generation, which allows for applying prompt-based editing while avoiding the cumbersome tuning of the model's weights.

...read moreread less

Abstract: Recent text-guided diffusion models provide powerful image generation capabilities. Currently, a massive effort is given to enable the modification of these images using text only as means to offer intuitive and versatile editing. To edit a real image using these state-of-the-art tools, one must first invert the image with a meaningful text prompt into the pretrained model's domain. In this paper, we introduce an accurate inversion technique and thus facilitate an intuitive text-based modification of the image. Our proposed inversion consists of two novel key components: (i) Pivotal inversion for diffusion models. While current methods aim at mapping random noise samples to a single input image, we use a single pivotal noise vector for each timestamp and optimize around it. We demonstrate that a direct inversion is inadequate on its own, but does provide a good anchor for our optimization. (ii) NULL-text optimization, where we only modify the unconditional textual embedding that is used for classifier-free guidance, rather than the input text embedding. This allows for keeping both the model weights and the conditional embedding intact and hence enables applying prompt-based editing while avoiding the cumbersome tuning of the model's weights. Our Null-text inversion, based on the publicly available Stable Diffusion model, is extensively evaluated on a variety of images and prompt editing, showing high-fidelity editing of real images.

...read moreread less

104 citations

Proceedings Article•DOI•

Stitch it in Time: GAN-Based Facial Editing of Real Videos

[...]

Rotem Tzaban, Ron Mokady, Rinon Gal, Amit Bermano, Daniel Cohen-Or - Show less +1 more

20 Jan 2022

TL;DR: This work uses the natural alignment of StyleGAN and the tendency of neural networks to learn low frequency functions, and demonstrates that they provide a strongly consistent prior for semantic editing of faces in videos, demonstrating significant improvements over the current state-of-the-art.

...read moreread less

Abstract: The ability of Generative Adversarial Networks to encode rich semantics within their latent space has been widely adopted for facial image editing. However, replicating their success with videos has proven challenging. Applying StyleGAN editing to real videos introduces two main challenges: (i) StyleGAN operates over aligned crops. When editing videos, these crops need to be pasted back into the frame, resulting in a spatial inconsistency. (ii) Videos introduce a fundamental barrier to overcome — temporal coherency. To address the first challenge, we propose a novel stitching-tuning procedure. The generator is carefully tuned to overcome the spatial artifacts at crop borders, resulting in smooth transitions even when difficult backgrounds are involved. Turning to temporal coherence, we propose that this challenge is largely artificial. The source video is already temporally coherent, and deviations arise in part due to careless treatment of individual components in the editing pipeline. We leverage the natural alignment of StyleGAN and the tendency of neural networks to learn low-frequency functions, and demonstrate that they provide a strongly consistent prior. These components are combined in an end-to-end framework for semantic editing of facial videos. We compare our pipeline to the current state-of-the-art and demonstrate significant improvements. Our method produces meaningful manipulations and maintains greater spatial and temporal consistency, even on challenging talking head videos which current methods struggle with. Our code and videos are available at https://stitch-time.github.io/.

...read moreread less

47 citations

Journal Article•DOI•

State‐of‐the‐Art in the Architecture, Methods and Applications of StyleGAN

[...]

Amit Bermano, Rinon Gal, Yuval Alaluf, Ron Mokady, Yotam Nitzan, Omer Tov, Or Patashnik, Daniel Cohen-Or - Show less +4 more

28 Feb 2022-Computer Graphics Forum

TL;DR: This state‐of‐the‐art report covers the StyleGAN architecture, and the ways it has been employed since its conception, while also analyzing its severe limitations.

...read moreread less

Abstract: Generative Adversarial Networks (GANs) have established themselves as a prevalent approach to image synthesis. Of these, StyleGAN offers a fascinating case study, owing to its remarkable visual quality and an ability to support a large array of downstream tasks. This state‐of‐the‐art report covers the StyleGAN architecture, and the ways it has been employed since its conception, while also analyzing its severe limitations. It aims to be of use for both newcomers, who wish to get a grasp of the field, and for more experienced readers that might benefit from seeing current research trends and existing tools laid out. Among StyleGAN's most interesting aspects is its learned latent space. Despite being learned with no supervision, it is surprisingly well‐behaved and remarkably disentangled. Combined with StyleGAN's visual quality, these properties gave rise to unparalleled editing capabilities. However, the control offered by StyleGAN is inherently limited to the generator's learned distribution, and can only be applied to images generated by StyleGAN itself. Seeking to bring StyleGAN's latent control to real‐world scenarios, the study of GAN inversion and latent space embedding has quickly gained in popularity. Meanwhile, this same study has helped shed light on the inner workings and limitations of StyleGAN. We map out StyleGAN's impressive story through these investigations, and discuss the details that have made StyleGAN the go‐to generator. We further elaborate on the visual priors StyleGAN constructs, and discuss their use in downstream discriminative tasks. Looking forward, we point out StyleGAN's limitations and speculate on current trends and promising directions for future research, such as task and target specific fine‐tuning.

...read moreread less

43 citations

Posted Content•

Pivotal Tuning for Latent-based Editing of Real Images

[...]

Daniel Roich, Ron Mokady, Amit Bermano, Daniel Cohen-Or

10 Jun 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Pivotal Tuning Inversion (PTI) as discussed by the authors is a recent approach to bridge the gap between distortion and editability of StyleGAN's latent space, which preserves the editing quality of an in-domain latent region while changing its portrayed identity and appearance.

...read moreread less

Abstract: Recently, a surge of advanced facial editing techniques have been proposed that leverage the generative power of a pre-trained StyleGAN. To successfully edit an image this way, one must first project (or invert) the image into the pre-trained generator's domain. As it turns out, however, StyleGAN's latent space induces an inherent tradeoff between distortion and editability, i.e. between maintaining the original appearance and convincingly altering some of its attributes. Practically, this means it is still challenging to apply ID-preserving facial latent-space editing to faces which are out of the generator's domain. In this paper, we present an approach to bridge this gap. Our technique slightly alters the generator, so that an out-of-domain image is faithfully mapped into an in-domain latent code. The key idea is pivotal tuning - a brief training process that preserves the editing quality of an in-domain latent region, while changing its portrayed identity and appearance. In Pivotal Tuning Inversion (PTI), an initial inverted latent code serves as a pivot, around which the generator is fined-tuned. At the same time, a regularization term keeps nearby identities intact, to locally contain the effect. This surgical training process ends up altering appearance features that represent mostly identity, without affecting editing capabilities. We validate our technique through inversion and editing metrics, and show preferable scores to state-of-the-art methods. We further qualitatively demonstrate our technique by applying advanced edits (such as pose, age, or expression) to numerous images of well-known and recognizable identities. Finally, we demonstrate resilience to harder cases, including heavy make-up, elaborate hairstyles and/or headwear, which otherwise could not have been successfully inverted and edited by state-of-the-art methods.

...read moreread less

33 citations

1
2
3
4
…

Cited by

PDF

Open Access

More filters

Journal Article•

“Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告

[...]

杉山拓海

12 Sep 2017-Computers & Graphics

3,940 citations

Journal Article•DOI•

DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation

[...]

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, Kfir Aberman - Show less +2 more

25 Aug 2022-arXiv.org

TL;DR: A new approach for “personalization” of text-to-image diffusion models (specializing them to users’ needs) and leveraging the semantic prior embedded in the model with a new autogenous class-speciﬁc prior preservation loss enables synthesizing the subject in diverse scenes, poses, views, and lighting conditions that do not appear in the reference images.

...read moreread less

Abstract: Large text-to-image models achieved a remarkable leap in the evolution of AI, enabling high-quality and diverse synthesis of images from a given text prompt. However, these models lack the ability to mimic the appearance of subjects in a given reference set and synthesize novel renditions of them in different contexts. In this work, we present a new approach for"personalization"of text-to-image diffusion models. Given as input just a few images of a subject, we fine-tune a pretrained text-to-image model such that it learns to bind a unique identifier with that specific subject. Once the subject is embedded in the output domain of the model, the unique identifier can be used to synthesize novel photorealistic images of the subject contextualized in different scenes. By leveraging the semantic prior embedded in the model with a new autogenous class-specific prior preservation loss, our technique enables synthesizing the subject in diverse scenes, poses, views and lighting conditions that do not appear in the reference images. We apply our technique to several previously-unassailable tasks, including subject recontextualization, text-guided view synthesis, and artistic rendering, all while preserving the subject's key features. We also provide a new dataset and evaluation protocol for this new task of subject-driven generation. Project page: https://dreambooth.github.io/

...read moreread less

381 citations

Proceedings Article•DOI•

Prompt-to-Prompt Image Editing with Cross Attention Control

[...]

Amir Hertz, Ron Mokady, J. M. Tenenbaum, Kfir Aberman, Yael Pritch, Daniel Cohen-Or - Show less +2 more

02 Aug 2022

...read moreread less

262 citations

Journal Article•DOI•

Adding Conditional Control to Text-to-Image Diffusion Models

[...]

Lvmin Zhang, Maneesh Agrawala

10 Feb 2023-arXiv.org

TL;DR: ControlNet as discussed by the authors learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small (<50k), and the model can be trained on a personal devices.

...read moreread less

Abstract: We present a neural network structure, ControlNet, to control pretrained large diffusion models to support additional input conditions. The ControlNet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small (<50k). Moreover, training a ControlNet is as fast as fine-tuning a diffusion model, and the model can be trained on a personal devices. Alternatively, if powerful computation clusters are available, the model can scale to large amounts (millions to billions) of data. We report that large diffusion models like Stable Diffusion can be augmented with ControlNets to enable conditional inputs like edge maps, segmentation maps, keypoints, etc. This may enrich the methods to control large diffusion models and further facilitate related applications.

...read moreread less

256 citations

Journal Article•DOI•

Imagic: Text-Based Real Image Editing with Diffusion Models

[...]

Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Hui-Tang Chang, Tali Dekel, Inbar Mosseri, Michal Irani - Show less +4 more

17 Oct 2022-arXiv.org

TL;DR: This paper demonstrates, for the very first time, the ability to apply complex (e.g., non-rigid) text-guided semantic edits to a single real image using a pretrained text-to-image diffusion model.

...read moreread less

Abstract: Text-conditioned image editing has recently attracted considerable interest. However, most methods are currently either limited to specific editing types (e.g., object overlay, style transfer), or apply to synthetically generated images, or require multiple input images of a common object. In this paper we demonstrate, for the very first time, the ability to apply complex (e.g., non-rigid) text-guided semantic edits to a single real image. For example, we can change the posture and composition of one or multiple objects inside an image, while preserving its original characteristics. Our method can make a standing dog sit down or jump, cause a bird to spread its wings, etc. -- each within its single high-resolution natural image provided by the user. Contrary to previous work, our proposed method requires only a single input image and a target text (the desired edit). It operates on real images, and does not require any additional inputs (such as image masks or additional views of the object). Our method, which we call"Imagic", leverages a pre-trained text-to-image diffusion model for this task. It produces a text embedding that aligns with both the input image and the target text, while fine-tuning the diffusion model to capture the image-specific appearance. We demonstrate the quality and versatility of our method on numerous inputs from various domains, showcasing a plethora of high quality complex semantic image edits, all within a single unified framework.

...read moreread less

201 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96

Collapse