Home
/
Authors
/
Alexandre Sablayrolles

Author

Alexandre Sablayrolles

Bio: Alexandre Sablayrolles is an academic researcher from Facebook. The author has contributed to research in topics: Nearest neighbor search & Computer science. The author has an hindex of 12, co-authored 25 publications receiving 1014 citations.

Papers

PDF

Open Access

More filters

Posted Content•

Training data-efficient image transformers & distillation through attention

[...]

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou - Show less +2 more

23 Dec 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a teacher-student strategy was proposed to train a convolution-free transformer on Imagenet only, achieving state-of-the-art performance on ImageNet.

...read moreread less

Abstract: Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. However, these visual transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption. In this work, we produce a competitive convolution-free transformer by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data. More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. We show the interest of this token-based distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and models.

...read moreread less

349 citations

Proceedings Article•

Training data-efficient image transformers & distillation through attention

[...]

Hugo Touvron¹, Matthieu Cord², Douze Matthijs¹, Francisco Massa¹, Alexandre Sablayrolles¹, Hervé Jégou¹ - Show less +2 more•Institutions (2)

Facebook¹, University of Paris²

18 Jul 2021

143 citations

Posted Content•

White-box vs Black-box: Bayes Optimal Strategies for Membership Inference.

[...]

Alexandre Sablayrolles¹, Matthijs Douze¹, Yann Ollivier¹, Cordelia Schmid, Hervé Jégou¹ - Show less +1 more•Institutions (1)

Facebook¹

29 Aug 2019-arXiv: Machine Learning

TL;DR: This paper derives the optimal strategy for membership inference with a few assumptions on the distribution of the parameters, and shows that optimal attacks only depend on the loss function, and thus black-box attacks are as good as white- box attacks.

...read moreread less

Abstract: Membership inference determines, given a sample and trained parameters of a machine learning model, whether the sample was part of the training set. In this paper, we derive the optimal strategy for membership inference with a few assumptions on the distribution of the parameters. We show that optimal attacks only depend on the loss function, and thus black-box attacks are as good as white-box attacks. As the optimal strategy is not tractable, we provide approximations of it leading to several inference methods, and show that existing membership inference methods are coarser approximations of this optimal strategy. Our membership attacks outperform the state of the art in various settings, ranging from a simple logistic regression to more complex architectures and datasets, such as ResNet-101 and Imagenet.

...read moreread less

141 citations

Proceedings Article•

Large Memory Layers with Product Keys

[...]

Guillaume Lample¹, Alexandre Sablayrolles¹, Marc'Aurelio Ranzato¹, Ludovic Denoyer¹, Hervé Jégou² - Show less +1 more•Institutions (2)

Facebook¹, French Institute for Research in Computer Science and Automation²

10 Jul 2019

TL;DR: A structured memory which can be easily integrated into a neural network and significantly increases the capacity of the architecture, by up to a billion parameters with a negligible computational overhead is introduced.

...read moreread less

Abstract: This paper introduces a structured memory which can be easily integrated into a neural network. The memory is very large by design and significantly increases the capacity of the architecture, by up to a billion parameters with a negligible computational overhead. Its design and access pattern is based on product keys, which enable fast and exact nearest neighbor search. The ability to increase the number of parameters while keeping the same computational budget lets the overall system strike a better trade-off between prediction accuracy and computation efficiency both at training and test time. This memory layer allows us to tackle very large scale language modeling tasks. In our experiments we consider a dataset with up to 30 billion words, and we plug our memory layer in a state-of-the-art transformer-based architecture. In particular, we found that a memory augmented model with only 12 layers outperforms a baseline transformer model with 24 layers, while being twice faster at inference time. We release our code for reproducibility purposes.

...read moreread less

95 citations

Posted Content•

Going deeper with Image Transformers

[...]

Hugo Touvron¹, Matthieu Cord¹, Alexandre Sablayrolles², Gabriel Synnaeve², Hervé Jégou² - Show less +1 more•Institutions (2)

University of Paris¹, Facebook²

31 Mar 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, the authors investigated the interplay of architecture and optimization of deep transformers for image classification and achieved state-of-the-art performance on Imagenet with Reassessed labels and match frequency.

...read moreread less

Abstract: Transformers have been recently adapted for large scale image classification, achieving high scores shaking up the long supremacy of convolutional neural networks. However the optimization of image transformers has been little studied so far. In this work, we build and optimize deeper transformer networks for image classification. In particular, we investigate the interplay of architecture and optimization of such dedicated transformers. We make two transformers architecture changes that significantly improve the accuracy of deep transformers. This leads us to produce models whose performance does not saturate early with more depth, for instance we obtain 86.5% top-1 accuracy on Imagenet when training with no external data, we thus attain the current SOTA with less FLOPs and parameters. Moreover, our best model establishes the new state of the art on Imagenet with Reassessed labels and Imagenet-V2 / match frequency, in the setting with no additional training data. We share our code and models.

...read moreread less

80 citations

1
2
3
4
…
5
6
7

Collapse

Cited by

PDF

Open Access

More filters

Posted Content•

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.

[...]

Ze Liu¹, Yutong Lin¹, Yue Cao¹, Han Hu¹, Yixuan Wei¹, Zheng Zhang¹, Stephen Lin¹, Baining Guo¹ - Show less +4 more•Institutions (1)

Microsoft¹

25 Mar 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Wang et al. as mentioned in this paper proposed a new vision Transformer called Swin Transformer, which is computed with shifted windows to address the differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text.

...read moreread less

Abstract: This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (86.4 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The code and models will be made publicly available at~\url{this https URL}.

...read moreread less

3,518 citations

Proceedings Article•DOI•

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

[...]

Sixiao Zheng¹, Jiachen Lu¹, Hengshuang Zhao², Xiatian Zhu³, Zekun Luo⁴, Yabiao Wang⁴, Yanwei Fu¹, Jianfeng Feng¹, Tao Xiang³, Philip H. S. Torr², Li Zhang¹ - Show less +7 more•Institutions (4)

Fudan University¹, University of Oxford², University of Surrey³, Tencent⁴

20 Jun 2021

TL;DR: Zhang et al. as discussed by the authors proposed a pure transformer to encode an image as a sequence of patches, which can be combined with a simple decoder to provide a powerful segmentation model.

...read moreread less

Abstract: Most recent semantic segmentation methods adopt a fully-convolutional network (FCN) with an encoder-decoder architecture. The encoder progressively reduces the spatial resolution and learns more abstract/semantic visual concepts with larger receptive fields. Since context modeling is critical for segmentation, the latest efforts have been focused on increasing the receptive field, through either dilated/atrous convolutions or inserting attention modules. However, the encoder-decoder based FCN architecture remains unchanged. In this paper, we aim to provide an alternative perspective by treating semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer (i.e., without convolution and resolution reduction) to encode an image as a sequence of patches. With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR). Extensive experiments show that SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes. Particularly, we achieve the first position in the highly competitive ADE20K test server leaderboard on the day of submission.

...read moreread less

1,761 citations

5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

[...]

柴田知秀

15 Feb 2020

1,595 citations

Posted Content•

An Empirical Study of Training Self-Supervised Vision Transformers

[...]

Xinlei Chen¹, Saining Xie², Kaiming He²•Institutions (2)

Carnegie Mellon University¹, Facebook²

05 Apr 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work investigates the effects of several fundamental components for training self-supervised ViT, and reveals that these results are indeed partial failure, and they can be improved when training is made more stable.

...read moreread less

Abstract: This paper does not describe a novel method Instead, it studies a straightforward, incremental, yet must-know baseline given the recent progress in computer vision: self-supervised learning for Vision Transformers (ViT) While the training recipes for standard convolutional networks have been highly mature and robust, the recipes for ViT are yet to be built, especially in the self-supervised scenarios where training becomes more challenging In this work, we go back to basics and investigate the effects of several fundamental components for training self-supervised ViT We observe that instability is a major issue that degrades accuracy, and it can be hidden by apparently good results We reveal that these results are indeed partial failure, and they can be improved when training is made more stable We benchmark ViT results in MoCo v3 and several other self-supervised frameworks, with ablations in various aspects We discuss the currently positive evidence as well as challenges and open questions We hope that this work will provide useful data points and experience for future research

...read moreread less

949 citations

Posted Content•

Reformer: The Efficient Transformer

[...]

Nikita Kitaev¹, Łukasz Kaiser², Anselm Levskaya²•Institutions (2)

University of California, Berkeley¹, Google²

13 Jan 2020-arXiv: Learning

TL;DR: The Reformer as discussed by the authors uses locality-sensitive hashing to improve the efficiency of Transformers and achieves state-of-the-art results on a number of tasks, but training these models can be prohibitively costly.

...read moreread less

Abstract: Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O($L^2$) to O($L\log L$), where $L$ is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of $N$ times, where $N$ is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.

...read moreread less

866 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse