Home
/
Authors
/
Elad Hoffer

Author

Elad Hoffer

Technion – Israel Institute of Technology

Other affiliations: Intel

Bio: Elad Hoffer is an academic researcher from Technion – Israel Institute of Technology. The author has contributed to research in topics: Artificial neural network & Deep learning. The author has an hindex of 20, co-authored 37 publications receiving 3956 citations. Previous affiliations of Elad Hoffer include Intel.

Papers

PDF

Open Access

More filters

Book Chapter•DOI•

Deep metric learning using Triplet network

[...]

Elad Hoffer¹, Nir Ailon¹•Institutions (1)

Technion – Israel Institute of Technology¹

12 Oct 2015

TL;DR: This paper proposes the triplet network model, which aims to learn useful representations by distance comparisons, and demonstrates using various datasets that this model learns a better representation than that of its immediate competitor, the Siamese network.

...read moreread less

Abstract: Deep learning has proven itself as a successful set of models for learning useful semantic representations of data. These, however, are mostly implicitly learned as part of a classification task. In this paper we propose the triplet network model, which aims to learn useful representations by distance comparisons. A similar model was defined by Wang et al. (2014), tailor made for learning a ranking for image information retrieval. Here we demonstrate using various datasets that our model learns a better representation than that of its immediate competitor, the Siamese network. We also discuss future possible usage as a framework for unsupervised learning.

...read moreread less

1,635 citations

Posted Content•

Deep metric learning using Triplet network

[...]

Elad Hoffer¹, Nir Ailon¹•Institutions (1)

Technion – Israel Institute of Technology¹

20 Dec 2014-arXiv: Learning

TL;DR: In this paper, Wang et al. proposed the triplet network model, which aims to learn useful representations by distance comparisons, and demonstrate using various datasets that their model learns a better representation than that of its immediate competitor, the Siamese network.

...read moreread less

824 citations

Journal Article•DOI•

The implicit bias of gradient descent on separable data

[...]

Daniel Soudry¹, Elad Hoffer¹, Mor Shpigel Nacson¹, Suriya Gunasekar², Nathan Srebro² - Show less +1 more•Institutions (2)

Technion – Israel Institute of Technology¹, Toyota Technological Institute at Chicago²

01 Jan 2018-Journal of Machine Learning Research

TL;DR: In this paper, the authors examine gradient descent on unregularized logistic regression problems, with homogeneous linear predictors on linearly separable datasets, and show that the predictor converges to the direction of the max-margin (hard margin SVM) solution.

...read moreread less

Abstract: We examine gradient descent on unregularized logistic regression problems, with homogeneous linear predictors on linearly separable datasets. We show the predictor converges to the direction of the max-margin (hard margin SVM) solution. The result also generalizes to other monotone decreasing loss functions with an infimum at infinity, to multi-class problems, and to training a weight layer in a deep network in a certain restricted setting. Furthermore, we show this convergence is very slow, and only logarithmic in the convergence of the loss itself. This can help explain the benefit of continuing to optimize the logistic or cross-entropy loss even after the training error is zero and the training loss is extremely small, and, as we show, even if the validation loss increases. Our methodology can also aid in understanding implicit regularization in more complex models and with other optimization methods.

...read moreread less

488 citations

Posted Content•

Train longer, generalize better: closing the generalization gap in large batch training of neural networks

[...]

Elad Hoffer¹, Itay Hubara¹, Daniel Soudry¹•Institutions (1)

Technion – Israel Institute of Technology¹

24 May 2017-arXiv: Machine Learning

TL;DR: This work proposes a "random walk on random landscape" statistical model which is known to exhibit similar "ultra-slow" diffusion behavior and presents a novel algorithm named "Ghost Batch Normalization" which enables significant decrease in the generalization gap without increasing the number of updates.

...read moreread less

Abstract: Background: Deep learning models are typically trained using stochastic gradient descent or one of its variants These methods update the weights using their gradient, estimated from a small fraction of the training data It has been observed that when using large batch sizes there is a persistent degradation in generalization performance - known as the "generalization gap" phenomena Identifying the origin of this gap and closing it had remained an open problem Contributions: We examine the initial high learning rate training phase We find that the weight distance from its initialization grows logarithmically with the number of weight updates We therefore propose a "random walk on random landscape" statistical model which is known to exhibit similar "ultra-slow" diffusion behavior Following this hypothesis we conducted experiments to show empirically that the "generalization gap" stems from the relatively small number of updates rather than the batch size, and can be completely eliminated by adapting the training regime used We further investigate different techniques to train models in the large-batch regime and present a novel algorithm named "Ghost Batch Normalization" which enables significant decrease in the generalization gap without increasing the number of updates To validate our findings we conduct several additional experiments on MNIST, CIFAR-10, CIFAR-100 and ImageNet Finally, we reassess common practices and beliefs concerning training of deep models and suggest they may not be optimal to achieve good generalization

...read moreread less

411 citations

Proceedings Article•

Train longer, generalize better: closing the generalization gap in large batch training of neural networks

[...]

Elad Hoffer¹, Itay Hubara¹, Daniel Soudry¹•Institutions (1)

Technion – Israel Institute of Technology¹

01 May 2017

TL;DR: In this paper, a random walk on a random landscape (RWN) model is proposed to solve the generalization gap problem in the training of deep learning models, which is known to exhibit similar "ultra-slow" diffusion behavior.

...read moreread less

Abstract: Background: Deep learning models are typically trained using stochastic gradient descent or one of its variants. These methods update the weights using their gradient, estimated from a small fraction of the training data. It has been observed that when using large batch sizes there is a persistent degradation in generalization performance - known as the "generalization gap" phenomenon. Identifying the origin of this gap and closing it had remained an open problem. Contributions: We examine the initial high learning rate training phase. We find that the weight distance from its initialization grows logarithmically with the number of weight updates. We therefore propose a "random walk on a random landscape" statistical model which is known to exhibit similar "ultra-slow" diffusion behavior. Following this hypothesis we conducted experiments to show empirically that the "generalization gap" stems from the relatively small number of updates rather than the batch size, and can be completely eliminated by adapting the training regime used. We further investigate different techniques to train models in the large-batch regime and present a novel algorithm named "Ghost Batch Normalization" which enables significant decrease in the generalization gap without increasing the number of updates. To validate our findings we conduct several additional experiments on MNIST, CIFAR-10, CIFAR-100 and ImageNet. Finally, we reassess common practices and beliefs concerning training of deep models and suggest they may not be optimal to achieve good generalization.

...read moreread less

373 citations

1
2
3
4
…
5
6
7
8

Collapse

Cited by

PDF

Open Access

More filters

Proceedings Article•

Matching networks for one shot learning

[...]

Oriol Vinyals¹, Charles Blundell¹, Timothy P. Lillicrap¹, Koray Kavukcuoglu¹, Daan Wierstra¹ - Show less +1 more•Institutions (1)

Google¹

05 Dec 2016

TL;DR: In this paper, a network that maps a small labeled support set and an unlabeled example to its label obviates the need for fine-tuning to adapt to new class types.

...read moreread less

Abstract: Learning from a few examples remains a key challenge in machine learning. Despite recent advances in important domains such as vision and language, the standard supervised deep learning paradigm does not offer a satisfactory solution for learning new concepts rapidly from little data. In this work, we employ ideas from metric learning based on deep neural features and from recent advances that augment neural networks with external memories. Our framework learns a network that maps a small labelled support set and an unlabelled example to its label, obviating the need for fine-tuning to adapt to new class types. We then define one-shot learning problems on vision (using Omniglot, ImageNet) and language tasks. Our algorithm improves one-shot accuracy on ImageNet from 87.6% to 93.2% and from 88.0% to 93.8% on Omniglot compared to competing approaches. We also demonstrate the usefulness of the same model on language modeling by introducing a one-shot task on the Penn Treebank.

...read moreread less

4,075 citations

Posted Content•

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.

[...]

Ze Liu¹, Yutong Lin¹, Yue Cao¹, Han Hu¹, Yixuan Wei¹, Zheng Zhang¹, Stephen Lin¹, Baining Guo¹ - Show less +4 more•Institutions (1)

Microsoft¹

25 Mar 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Wang et al. as mentioned in this paper proposed a new vision Transformer called Swin Transformer, which is computed with shifted windows to address the differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text.

...read moreread less

Abstract: This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (86.4 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The code and models will be made publicly available at~\url{this https URL}.

...read moreread less

3,518 citations

Proceedings Article•DOI•

SphereFace: Deep Hypersphere Embedding for Face Recognition

[...]

Weiyang Liu¹, Yandong Wen², Zhiding Yu², Ming Li³, Bhiksha Raj², Le Song¹ - Show less +2 more•Institutions (3)

Georgia Institute of Technology¹, Carnegie Mellon University², Sun Yat-sen University³

26 Apr 2017

TL;DR: In this paper, the angular softmax (A-softmax) loss was proposed to learn angularly discriminative features for deep face recognition under open-set protocol, where ideal face features are expected to have smaller maximal intra-class distance than minimal interclass distance under a suitably chosen metric space.

...read moreread less

Abstract: This paper addresses deep face recognition (FR) problem under open-set protocol, where ideal face features are expected to have smaller maximal intra-class distance than minimal inter-class distance under a suitably chosen metric space. However, few existing algorithms can effectively achieve this criterion. To this end, we propose the angular softmax (A-Softmax) loss that enables convolutional neural networks (CNNs) to learn angularly discriminative features. Geometrically, A-Softmax loss can be viewed as imposing discriminative constraints on a hypersphere manifold, which intrinsically matches the prior that faces also lie on a manifold. Moreover, the size of angular margin can be quantitatively adjusted by a parameter m. We further derive specific m to approximate the ideal feature criterion. Extensive analysis and experiments on Labeled Face in the Wild (LFW), Youtube Faces (YTF) and MegaFace Challenge 1 show the superiority of A-Softmax loss in FR tasks.

...read moreread less

2,272 citations

Proceedings Article•DOI•

CosFace: Large Margin Cosine Loss for Deep Face Recognition

[...]

Hao Wang¹, Yitong Wang¹, Zhou Zheng¹, Ji Xing¹, Dihong Gong¹, Jingchao Zhou¹, Zhifeng Li¹, Wei Liu¹ - Show less +4 more•Institutions (1)

Tencent¹

18 Jun 2018

TL;DR: In this article, the authors proposed a large margin cosine loss (LMCL), which normalizes both features and weight vectors to remove radial variations, based on which a cosine margin term is introduced to further maximize the decision margin in the angular space.

...read moreread less

Abstract: Face recognition has made extraordinary progress owing to the advancement of deep convolutional neural networks (CNNs). The central task of face recognition, including face verification and identification, involves face feature discrimination. However, the traditional softmax loss of deep CNNs usually lacks the power of discrimination. To address this problem, recently several loss functions such as center loss, large margin softmax loss, and angular softmax loss have been proposed. All these improved losses share the same idea: maximizing inter-class variance and minimizing intra-class variance. In this paper, we propose a novel loss function, namely large margin cosine loss (LMCL), to realize this idea from a different perspective. More specifically, we reformulate the softmax loss as a cosine loss by L2 normalizing both features and weight vectors to remove radial variations, based on which a cosine margin term is introduced to further maximize the decision margin in the angular space. As a result, minimum intra-class variance and maximum inter-class variance are achieved by virtue of normalization and cosine decision margin maximization. We refer to our model trained with LMCL as CosFace. Extensive experimental evaluations are conducted on the most popular public-domain face recognition datasets such as MegaFace Challenge, Youtube Faces (YTF) and Labeled Face in the Wild (LFW). We achieve the state-of-the-art performance on these benchmarks, which confirms the effectiveness of our proposed approach.

...read moreread less

1,879 citations

Journal Article•DOI•

Understanding deep learning (still) requires rethinking generalization

[...]

Chiyuan Zhang¹, Samy Bengio¹, Moritz Hardt², Benjamin Recht², Oriol Vinyals - Show less +1 more•Institutions (2)

Google¹, University of California, Berkeley²

22 Feb 2021-Communications of The ACM

TL;DR: These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity.

...read moreread less

Abstract: Despite their massive size, successful deep artificial neural networks can exhibit a remarkably small gap between training and test performance. Conventional wisdom attributes small generalization error either to properties of the model family or to the regularization techniques used during training. Through extensive systematic experiments, we show how these traditional approaches fail to explain why large neural networks generalize well in practice. Specifically, our experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization and occurs even if we replace the true images by completely unstructured random noise. We corroborate these experimental findings with a theoretical construction showing that simple depth two neural networks already have perfect finite sample expressivity as soon as the number of parameters exceeds the number of data points as it usually does in practice. We interpret our experimental findings by comparison with traditional models. We supplement this republication with a new section at the end summarizing recent progresses in the field since the original version of this paper.

...read moreread less

1,664 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse