Home
/
Authors
/
Shunsuke Saito

Author

Shunsuke Saito

Other affiliations: Facebook, Max Planck Society, Institute for Creative Technologies ...read more

Bio: Shunsuke Saito is an academic researcher from University of Southern California. The author has contributed to research in topics: Rendering (computer graphics) & Computer science. The author has an hindex of 23, co-authored 52 publications receiving 2422 citations. Previous affiliations of Shunsuke Saito include Facebook & Max Planck Society.

Papers published on a yearly basis

2023
2022
2021
2020
2019
2018
2017
2016
2015
2014

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization

[...]

Shunsuke Saito¹, Zeng Huang¹, Ryota Natsume², Shigeo Morishima², Hao Li¹, Angjoo Kanazawa³ - Show less +2 more•Institutions (3)

University of Southern California¹, Waseda University², University of California, Berkeley³

13 May 2019

TL;DR: Pixel-aligned Implicit Function (PIFu) as mentioned in this paper aligns pixels of 2D images with the global context of their corresponding 3D object to produce highresolution surfaces including largely unseen regions such as the back of a person.

...read moreread less

Abstract: We introduce Pixel-aligned Implicit Function (PIFu), an implicit representation that locally aligns pixels of 2D images with the global context of their corresponding 3D object. Using PIFu, we propose an end-to-end deep learning method for digitizing highly detailed clothed humans that can infer both 3D surface and texture from a single image, and optionally, multiple input images. Highly intricate shapes, such as hairstyles, clothing, as well as their variations and deformations can be digitized in a unified way. Compared to existing representations used for 3D deep learning, PIFu produces high-resolution surfaces including largely unseen regions such as the back of a person. In particular, it is memory efficient unlike the voxel representation, can handle arbitrary topology, and the resulting surface is spatially aligned with the input image. Furthermore, while previous techniques are designed to process either a single image or multiple views, PIFu extends naturally to arbitrary number of views. We demonstrate high-resolution and robust reconstructions on real world images from the DeepFashion dataset, which contains a variety of challenging clothing types. Our method achieves state-of-the-art performance on a public benchmark and outperforms the prior work for clothed human digitization from a single image.

...read moreread less

907 citations

Proceedings Article•DOI•

PIFuHD: Multi-Level Pixel-Aligned Implicit Function for High-Resolution 3D Human Digitization

[...]

Shunsuke Saito¹, Tomas Simon², Jason Saragih², Hanbyul Joo²•Institutions (2)

University of Southern California¹, Facebook²

14 Jun 2020

TL;DR: In this paper, a multi-level architecture is proposed to estimate high-resolution human shape from low-resolution images, where a coarse level observes the whole image at lower resolution and focuses on holistic reasoning, and a fine level estimates highly detailed geometry by observing higher resolution images.

...read moreread less

Abstract: Recent advances in image-based 3D human shape estimation have been driven by the significant improvement in representation power afforded by deep neural networks. Although current approaches have demonstrated the potential in real world settings, they still fail to produce reconstructions with the level of detail often present in the input images. We argue that this limitation stems primarily form two conflicting requirements; accurate predictions require large context, but precise predictions require high resolution. Due to memory limitations in current hardware, previous approaches tend to take low resolution images as input to cover large spatial context, and produce less precise (or low resolution) 3D estimates as a result. We address this limitation by formulating a multi-level architecture that is end-to-end trainable. A coarse level observes the whole image at lower resolution and focuses on holistic reasoning. This provides context to an fine level which estimates highly detailed geometry by observing higher-resolution images. We demonstrate that our approach significantly outperforms existing state-of-the-art techniques on single image human shape reconstruction by fully leveraging 1k-resolution input images.

...read moreread less

483 citations

Posted Content•

PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization

[...]

Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Morishima, Angjoo Kanazawa, Hao Li - Show less +2 more

13 May 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: The proposed Pixel-aligned Implicit Function (PIFu), an implicit representation that locally aligns pixels of 2D images with the global context of their corresponding 3D object, achieves state-of-the-art performance on a public benchmark and outperforms the prior work for clothed human digitization from a single image.

...read moreread less

Abstract: We introduce Pixel-aligned Implicit Function (PIFu), a highly effective implicit representation that locally aligns pixels of 2D images with the global context of their corresponding 3D object. Using PIFu, we propose an end-to-end deep learning method for digitizing highly detailed clothed humans that can infer both 3D surface and texture from a single image, and optionally, multiple input images. Highly intricate shapes, such as hairstyles, clothing, as well as their variations and deformations can be digitized in a unified way. Compared to existing representations used for 3D deep learning, PIFu can produce high-resolution surfaces including largely unseen regions such as the back of a person. In particular, it is memory efficient unlike the voxel representation, can handle arbitrary topology, and the resulting surface is spatially aligned with the input image. Furthermore, while previous techniques are designed to process either a single image or multiple views, PIFu extends naturally to arbitrary number of views. We demonstrate high-resolution and robust reconstructions on real world images from the DeepFashion dataset, which contains a variety of challenging clothing types. Our method achieves state-of-the-art performance on a public benchmark and outperforms the prior work for clothed human digitization from a single image.

...read moreread less

227 citations

Proceedings Article•DOI•

SiCloPe: Silhouette-Based Clothed People

[...]

Ryota Natsume¹, Shunsuke Saito², Zeng Huang², Weikai Chen³, Chongyang Ma, Hao Li¹, Shigeo Morishima² - Show less +3 more•Institutions (3)

Waseda University¹, University of Southern California², Institute for Creative Technologies³

15 Jun 2019

TL;DR: This work introduces a new silhouette-based representation for modeling clothed human bodies using deep generative models that can reconstruct a complete and textured 3D model of a person wearing clothes from a single input picture.

...read moreread less

Abstract: We introduce a new silhouette-based representation for modeling clothed human bodies using deep generative models. Our method can reconstruct a complete and textured 3D model of a person wearing clothes from a single input picture. Inspired by the visual hull algorithm, our implicit representation uses 2D silhouettes and 3D joints of a body pose to describe the immense shape complexity and variations of clothed people. Given a segmented 2D silhouette of a person and its inferred 3D joints from the input picture, we first synthesize consistent silhouettes from novel view points around the subject. The synthesized silhouettes which are the most consistent with the input segmentation are fed into a deep visual hull algorithm for robust 3D shape prediction. We then infer the texture of the subject's back view using the frontal image and segmentation mask as input to a conditional generative adversarial network. Our experiments demonstrate that our silhouette-based model is an effective representation and the appearance of the back view can be predicted reliably using an image-to-image translation network. While classic methods based on parametric models often fail for single-view images of subjects with challenging clothing, our approach can still produce successful results, which are comparable to those obtained from multi-view input.

...read moreread less

190 citations

Journal Article•DOI•

paGAN: real-time avatars using dynamic textures

[...]

Koki Nagano¹, Jaewoo Seo, Jun Xing¹, Lingyu Wei, Zimo Li², Shunsuke Saito², Aviral Agarwal, Jens Fursund, Hao Li¹ - Show less +5 more•Institutions (2)

Institute for Creative Technologies¹, University of Southern California²

04 Dec 2018-ACM Transactions on Graphics

TL;DR: This work produces state-of-the-art quality image and video synthesis, and is the first to the knowledge that is able to generate a dynamically textured avatar with a mouth interior, all from a single image.

...read moreread less

Abstract: With the rising interest in personalized VR and gaming experiences comes the need to create high quality 3D avatars that are both low-cost and variegated. Due to this, building dynamic avatars from a single unconstrained input image is becoming a popular application. While previous techniques that attempt this require multiple input images or rely on transferring dynamic facial appearance from a source actor, we are able to do so using only one 2D input image without any form of transfer from a source image. We achieve this using a new conditional Generative Adversarial Network design that allows fine-scale manipulation of any facial input image into a new expression while preserving its identity. Our photoreal avatar GAN (paGAN) can also synthesize the unseen mouth interior and control the eye-gaze direction of the output, as well as produce the final image from a novel viewpoint. The method is even capable of generating fully-controllable temporally stable video sequences, despite not using temporal information during training. After training, we can use our network to produce dynamic image-based avatars that are controllable on mobile devices in real time. To do this, we compute a fixed set of output images that correspond to key blendshapes, from which we extract textures in UV space. Using a subject's expression blendshapes at run-time, we can linearly blend these key textures together to achieve the desired appearance. Furthermore, we can use the mouth interior and eye textures produced by our network to synthesize on-the-fly avatar animations for those regions. Our work produces state-of-the-art quality image and video synthesis, and is the first to our knowledge that is able to generate a dynamically textured avatar with a mouth interior, all from a single image.

...read moreread less

184 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•

“Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告

[...]

杉山拓海

12 Sep 2017-Computers & Graphics

3,940 citations

Proceedings Article•

A morphable model for the synthesis of 3D faces

[...]

Matthew Turk

01 Jan 1999

2,010 citations

Proceedings Article•

Implicit Neural Representations with Periodic Activation Functions

[...]

Vincent Sitzmann¹, Julien N. P. Martel¹, Alexander W. Bergman¹, David B. Lindell¹, Gordon Wetzstein¹ - Show less +1 more•Institutions (1)

Stanford University¹

17 Jun 2020

TL;DR: In this paper, the authors propose to leverage periodic activation functions for implicit neural representations and demonstrate that these networks, dubbed sinusoidal representation networks or Sirens, are ideally suited for representing complex natural signals and their derivatives.

...read moreread less

Abstract: Implicitly defined, continuous, differentiable signal representations parameterized by neural networks have emerged as a powerful paradigm, offering many possible benefits over conventional representations. However, current network architectures for such implicit neural representations are incapable of modeling signals with fine detail, and fail to represent a signal's spatial and temporal derivatives, despite the fact that these are essential to many physical signals defined implicitly as the solution to partial differential equations. We propose to leverage periodic activation functions for implicit neural representations and demonstrate that these networks, dubbed sinusoidal representation networks or Sirens, are ideally suited for representing complex natural signals and their derivatives. We analyze Siren activation statistics to propose a principled initialization scheme and demonstrate the representation of images, wavefields, video, sound, and their derivatives. Further, we show how Sirens can be leveraged to solve challenging boundary value problems, such as particular Eikonal equations (yielding signed distance functions), the Poisson equation, and the Helmholtz and wave equations. Lastly, we combine Sirens with hypernetworks to learn priors over the space of Siren functions.

...read moreread less

1,058 citations

Proceedings Article•DOI•

Face2Face: Real-Time Face Capture and Reenactment of RGB Videos

[...]

Justus Thies¹, Michael Zollhöfer², Marc Stamminger¹, Christian Theobalt², Matthias NieBner³ - Show less +1 more•Institutions (3)

University of Erlangen-Nuremberg¹, Max Planck Society², Stanford University³

27 Jun 2016

TL;DR: A novel approach for real-time facial reenactment of a monocular target video sequence (e.g., Youtube video) that addresses the under-constrained problem of facial identity recovery from monocular video by non-rigid model-based bundling and re-render the manipulated output video in a photo-realistic fashion.

...read moreread less

Abstract: We present a novel approach for real-time facial reenactment of a monocular target video sequence (e.g., Youtube video). The source sequence is also a monocular video stream, captured live with a commodity webcam. Our goal is to animate the facial expressions of the target video by a source actor and re-render the manipulated output video in a photo-realistic fashion. To this end, we first address the under-constrained problem of facial identity recovery from monocular video by non-rigid model-based bundling. At run time, we track facial expressions of both source and target video using a dense photometric consistency measure. Reenactment is then achieved by fast and efficient deformation transfer between source and target. The mouth interior that best matches the re-targeted expression is retrieved from the target sequence and warped to produce an accurate fit. Finally, we convincingly re-render the synthesized target face on top of the corresponding video stream such that it seamlessly blends with the real-world illumination. We demonstrate our method in a live setup, where Youtube videos are reenacted in real time.

...read moreread less

1,011 citations

Proceedings Article•DOI•

PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization

[...]

Shunsuke Saito¹, Zeng Huang¹, Ryota Natsume², Shigeo Morishima², Hao Li¹, Angjoo Kanazawa³ - Show less +2 more•Institutions (3)

University of Southern California¹, Waseda University², University of California, Berkeley³

13 May 2019

...read moreread less

907 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse