Home
/
Authors
/
Yin Cui

Author

Yin Cui

Other affiliations: Columbia University, Cornell University

Bio: Yin Cui is an academic researcher from Google. The author has contributed to research in topics: Object detection & Contextual image classification. The author has an hindex of 12, co-authored 24 publications receiving 564 citations. Previous affiliations of Yin Cui include Columbia University & Cornell University.

Papers

PDF

Open Access

More filters

Posted Content•

Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation

[...]

Golnaz Ghiasi¹, Yin Cui¹, Aravind Srinivas¹, Rui Qian¹, Tsung-Yi Lin¹, Ekin D. Cubuk¹, Quoc V. Le¹, Barret Zoph¹ - Show less +4 more•Institutions (1)

Google¹

13 Dec 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: A systematic study of the Copy-Paste augmentation for instance segmentation where the authors randomly paste objects onto an image finds that the simple mechanism of pasting objects randomly is good enough and can provide solid gains on top of strong baselines.

...read moreread less

Abstract: Building instance segmentation models that are data-efficient and can handle rare object categories is an important challenge in computer vision. Leveraging data augmentations is a promising direction towards addressing this challenge. Here, we perform a systematic study of the Copy-Paste augmentation ([13, 12]) for instance segmentation where we randomly paste objects onto an image. Prior studies on Copy-Paste relied on modeling the surrounding visual context for pasting the objects. However, we find that the simple mechanism of pasting objects randomly is good enough and can provide solid gains on top of strong baselines. Furthermore, we show Copy-Paste is additive with semi-supervised methods that leverage extra data through pseudo labeling (e.g. self-training). On COCO instance segmentation, we achieve 49.1 mask AP and 57.3 box AP, an improvement of +0.6 mask AP and +1.5 box AP over the previous state-of-the-art. We further demonstrate that Copy-Paste can lead to significant improvements on the LVIS benchmark. Our baseline model outperforms the LVIS 2020 Challenge winning entry by +3.6 mask AP on rare categories.

...read moreread less

400 citations

Proceedings Article•

Rethinking Pre-training and Self-training

[...]

Barret Zoph¹, Golnaz Ghiasi¹, Tsung-Yi Lin¹, Yin Cui¹, Hanxiao Liu¹, Ekin D. Cubuk¹, Quoc V. Le¹ - Show less +3 more•Institutions (1)

Google¹

11 Jun 2020

TL;DR: Self-training works well exactly on the same setup that pre-training does not work (using ImageNet to help COCO), and on the PASCAL segmentation dataset, though pre- training does help significantly, self-training improves upon the pre-trained model.

...read moreread less

Abstract: Pre-training is a dominant paradigm in computer vision. For example, supervised ImageNet pre-training is commonly used to initialize the backbones of object detection and segmentation models. He et al., however, show a surprising result that ImageNet pre-training has limited impact on COCO object detection. Here we investigate self-training as another method to utilize additional data on the same setup and contrast it against ImageNet pre-training. Our study reveals the generality and flexibility of self-training with three additional insights: 1) stronger data augmentation and more labeled data further diminish the value of pre-training, 2) unlike pre-training, self-training is always helpful when using stronger data augmentation, in both low-data and high-data regimes, and 3) in the case that pre-training is helpful, self-training improves upon pre-training. For example, on the COCO object detection dataset, pre-training benefits when we use one fifth of the labeled data, and hurts accuracy when we use all labeled data. Self-training, on the other hand, shows positive improvements from +1.3 to +3.4AP across all dataset sizes. In other words, self-training works well exactly on the same setup that pre-training does not work (using ImageNet to help COCO). On the PASCAL segmentation dataset, which is a much smaller dataset than COCO, though pre-training does help significantly, self-training improves upon the pre-trained model. On COCO object detection, we achieve 54.3AP, an improvement of +1.5AP over the strongest SpineNet model. On PASCAL segmentation, we achieve 90.5 mIOU, an improvement of +1.5% mIOU over the previous state-of-the-art result by DeepLabv3+.

...read moreread less

287 citations

Posted Content•

Spatiotemporal Contrastive Video Representation Learning

[...]

Rui Qian¹, Tianjian Meng¹, Boqing Gong¹, Ming-Hsuan Yang¹, Huisheng Wang¹, Serge Belongie¹, Yin Cui¹ - Show less +3 more•Institutions (1)

Google¹

09 Aug 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work proposes a temporally consistent spatial augmentation method to impose strong spatial augmentations on each frame of the video while maintaining the temporal consistency across frames, and proposes a sampling-based temporal augmentation methods to avoid overly enforcing invariance on clips that are distant in time.

...read moreread less

Abstract: We present a self-supervised Contrastive Video Representation Learning (CVRL) method to learn spatiotemporal visual representations from unlabeled videos. Our representations are learned using a contrastive loss, where two augmented clips from the same short video are pulled together in the embedding space, while clips from different videos are pushed away. We study what makes for good data augmentations for video self-supervised learning and find that both spatial and temporal information are crucial. We carefully design data augmentations involving spatial and temporal cues. Concretely, we propose a temporally consistent spatial augmentation method to impose strong spatial augmentations on each frame of the video while maintaining the temporal consistency across frames. We also propose a sampling-based temporal augmentation method to avoid overly enforcing invariance on clips that are distant in time. On Kinetics-600, a linear classifier trained on the representations learned by CVRL achieves 70.4% top-1 accuracy with a 3D-ResNet-50 (R3D-50) backbone, outperforming ImageNet supervised pre-training by 15.7% and SimCLR unsupervised pre-training by 18.8% using the same inflated R3D-50. The performance of CVRL can be further improved to 72.9% with a larger R3D-152 (2x filters) backbone, significantly closing the gap between unsupervised and supervised video representation learning. Our code and models will be available at this https URL.

...read moreread less

252 citations

Proceedings Article•DOI•

Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation

[...]

Golnaz Ghiasi¹, Yin Cui¹, Aravind Srinivas¹, Rui Qian¹, Tsung-Yi Lin¹, Ekin D. Cubuk¹, Quoc V. Le¹, Barret Zoph¹ - Show less +4 more•Institutions (1)

Google¹

01 Jun 2021

TL;DR: In this paper, the Copy-Paste method is used for instance segmentation where objects are pasted randomly onto an image. And the authors show that the simple mechanism of pasting objects randomly is good enough and can provide solid gains on top of strong baselines.

...read moreread less

Abstract: Building instance segmentation models that are data-efficient and can handle rare object categories is an important challenge in computer vision. Leveraging data augmentations is a promising direction towards addressing this challenge. Here, we perform a systematic study of the Copy-Paste augmentation (e.g., [13], [12]) for instance segmentation where we randomly paste objects onto an image. Prior studies on Copy-Paste relied on modeling the surrounding visual context for pasting the objects. However, we find that the simple mechanism of pasting objects randomly is good enough and can provide solid gains on top of strong baselines. Furthermore, we show Copy-Paste is additive with semi-supervised methods that leverage extra data through pseudo labeling (e.g. self-training). On COCO instance segmentation, we achieve 49.1 mask AP and 57.3 box AP, an improvement of +0.6 mask AP and +1.5 box AP over the previous state-of-the-art. We further demonstrate that Copy-Paste can lead to significant improvements on the LVIS benchmark. Our baseline model outperforms the LVIS 2020 Challenge winning entry by +3.6 mask AP on rare categories.1

...read moreread less

201 citations

Proceedings Article•DOI•

SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization

[...]

Xianzhi Du¹, Tsung-Yi Lin¹, Pengchong Jin¹, Golnaz Ghiasi¹, Mingxing Tan¹, Yin Cui¹, Quoc V. Le¹, Xiaodan Song¹ - Show less +4 more•Institutions (1)

Google¹

14 Jun 2020

TL;DR: SpineNet is proposed, a backbone with scale-permuted intermediate features and cross-scale connections that is learned on an object detection task by Neural Architecture Search, and can transfer to classification tasks, achieving 5% top-1 accuracy improvement on a challenging iNaturalist fine-grained dataset.

...read moreread less

Abstract: Convolutional neural networks typically encode an input image into a series of intermediate features with decreasing resolutions. While this structure is suited to classification tasks, it does not perform well for tasks requiring simultaneous recognition and localization (e.g., object detection). The encoder-decoder architectures are proposed to resolve this by applying a decoder network onto a backbone model designed for classification tasks. In this paper, we argue encoder-decoder architecture is ineffective in generating strong multi-scale features because of the scale-decreased backbone. We propose SpineNet, a backbone with scale-permuted intermediate features and cross-scale connections that is learned on an object detection task by Neural Architecture Search. Using similar building blocks, SpineNet models outperform ResNet-FPN models by 3%+ AP at various scales while using 10-20% fewer FLOPs. In particular, SpineNet-190 achieves 52.1% AP on COCO, attaining the new state-of-the-art performance for single model object detection without test-time augmentation. SpineNet can transfer to classification tasks, achieving 5% top-1 accuracy improvement on a challenging iNaturalist fine-grained dataset. Code is at: https://github.com/tensorflow/tpu/tree/master/models/official/detection.

...read moreread less

149 citations

1
2
3
4
…
5

Cited by

PDF

Open Access

More filters

Posted Content•

YOLOv4: Optimal Speed and Accuracy of Object Detection

[...]

Alexey Bochkovskiy, Chien-Yao Wang¹, Hong-Yuan Mark Liao¹•Institutions (1)

Academia Sinica¹

23 Apr 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work uses new features: WRC, CSP, CmBN, SAT, Mish activation, Mosaic data augmentation, C mBN, DropBlock regularization, and CIoU loss, and combine some of them to achieve state-of-the-art results: 43.5% AP for the MS COCO dataset at a realtime speed of ~65 FPS on Tesla V100.

...read moreread less

Abstract: There are a huge number of features which are said to improve Convolutional Neural Network (CNN) accuracy. Practical testing of combinations of such features on large datasets, and theoretical justification of the result, is required. Some features operate on certain models exclusively and for certain problems exclusively, or only for small-scale datasets; while some features, such as batch-normalization and residual-connections, are applicable to the majority of models, tasks, and datasets. We assume that such universal features include Weighted-Residual-Connections (WRC), Cross-Stage-Partial-connections (CSP), Cross mini-Batch Normalization (CmBN), Self-adversarial-training (SAT) and Mish-activation. We use new features: WRC, CSP, CmBN, SAT, Mish activation, Mosaic data augmentation, CmBN, DropBlock regularization, and CIoU loss, and combine some of them to achieve state-of-the-art results: 43.5% AP (65.7% AP50) for the MS COCO dataset at a realtime speed of ~65 FPS on Tesla V100. Source code is at this https URL

...read moreread less

5,709 citations

Posted Content•

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.

[...]

Ze Liu¹, Yutong Lin¹, Yue Cao¹, Han Hu¹, Yixuan Wei¹, Zheng Zhang¹, Stephen Lin¹, Baining Guo¹ - Show less +4 more•Institutions (1)

Microsoft¹

25 Mar 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Wang et al. as mentioned in this paper proposed a new vision Transformer called Swin Transformer, which is computed with shifted windows to address the differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text.

...read moreread less

Abstract: This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (86.4 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The code and models will be made publicly available at~\url{this https URL}.

...read moreread less

3,518 citations

5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

[...]

柴田知秀

15 Feb 2020

1,595 citations

Posted Content•

Learning without Forgetting

[...]

Zhizhong Li¹, Derek Hoiem¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

29 Jun 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work proposes the Learning without Forgetting method, which uses only new task data to train the network while preserving the original capabilities, and performs favorably compared to commonly used feature extraction and fine-tuning adaption techniques.

...read moreread less

Abstract: When building a unified vision system or gradually adding new capabilities to a system, the usual assumption is that training data for all tasks is always available. However, as the number of tasks grows, storing and retraining on such data becomes infeasible. A new problem arises where we add new capabilities to a Convolutional Neural Network (CNN), but the training data for its existing capabilities are unavailable. We propose our Learning without Forgetting method, which uses only new task data to train the network while preserving the original capabilities. Our method performs favorably compared to commonly used feature extraction and fine-tuning adaption techniques and performs similarly to multitask learning that uses original task data we assume unavailable. A more surprising observation is that Learning without Forgetting may be able to replace fine-tuning with similar old and new task datasets for improved new task performance.

...read moreread less

1,037 citations

Posted Content•

Image Segmentation Using Deep Learning: A Survey

[...]

Shervin Minaee, Yuri Boykov¹, Fatih Porikli², Antonio Plaza³, Nasser Kehtarnavaz⁴, Demetri Terzopoulos⁵ - Show less +2 more•Institutions (5)

University of Waterloo¹, Australian National University², University of Extremadura³, University of Texas at Dallas⁴, University of California, Los Angeles⁵

15 Jan 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: A comprehensive review of recent pioneering efforts in semantic and instance segmentation, including convolutional pixel-labeling networks, encoder-decoder architectures, multiscale and pyramid-based approaches, recurrent networks, visual attention models, and generative models in adversarial settings are provided.

...read moreread less

Abstract: Image segmentation is a key topic in image processing and computer vision with applications such as scene understanding, medical image analysis, robotic perception, video surveillance, augmented reality, and image compression, among many others. Various algorithms for image segmentation have been developed in the literature. Recently, due to the success of deep learning models in a wide range of vision applications, there has been a substantial amount of works aimed at developing image segmentation approaches using deep learning models. In this survey, we provide a comprehensive review of the literature at the time of this writing, covering a broad spectrum of pioneering works for semantic and instance-level segmentation, including fully convolutional pixel-labeling networks, encoder-decoder architectures, multi-scale and pyramid based approaches, recurrent networks, visual attention models, and generative models in adversarial settings. We investigate the similarity, strengths and challenges of these deep learning models, examine the most widely used datasets, report performances, and discuss promising future research directions in this area.

...read moreread less

950 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse