scispace - formally typeset
Author

Andrew Zisserman

Other affiliations: University of Edinburgh, Microsoft, University of Leeds  ...read more
Bio: Andrew Zisserman is a academic researcher at University of Oxford who has co-authored 808 publication(s) receiving 261717 citation(s). The author has an hindex of 167. Previous affiliations of Andrew Zisserman include University of Edinburgh & Microsoft. The author has done significant research in the topic(s): Real image & Convolutional neural network.

...read more

Papers
  More

Open accessProceedings Article
Karen Simonyan1, Andrew Zisserman1Institutions (1)
01 Jan 2015-
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

...read more

49,857 Citations


Open accessProceedings Article
Karen Simonyan1, Andrew Zisserman1Institutions (1)
04 Sep 2014-
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

...read more

38,283 Citations


Open accessBook
Richard Hartley1, Andrew Zisserman2Institutions (2)
01 Jan 2000-
Abstract: From the Publisher: A basic problem in computer vision is to understand the structure of a real world scene given several images of it. Recent major developments in the theory and practice of scene reconstruction are described in detail in a unified framework. The book covers the geometric principles and how to represent objects algebraically so they can be computed and applied. The authors provide comprehensive background material and explain how to apply the methods and implement the algorithms directly.

...read more

Topics: Structure from motion (55%), Epipolar geometry (53%), Computer graphics (51%) ...read more

15,158 Citations


Open accessJournal ArticleDOI: 10.1007/S11263-009-0275-4
Mark Everingham1, Luc Van Gool2, Christopher Williams3, John Winn4  +1 moreInstitutions (5)
Abstract: The Pascal Visual Object Classes (VOC) challenge is a benchmark in visual object category recognition and detection, providing the vision and machine learning communities with a standard dataset of images and annotation, and standard evaluation procedures. Organised annually from 2005 to present, the challenge and its associated dataset has become accepted as the benchmark for object detection. This paper describes the dataset and evaluation procedure. We review the state-of-the-art in evaluated methods for both classification and detection, analyse whether the methods are statistically different, what they are learning from the images (e.g. the object or its context), and what the methods find easy or confuse. The paper concludes with lessons learnt in the three year history of the challenge, and proposes directions for future improvement and extension.

...read more

  • Fig. 1 Example images from the VOC2007 dataset. For each of the 20 classesnnotated, two examples are shown. Bounding boxes indicate all instances of the corresponding class in the image which are markeds “non-difficult” (see Sect. 3.3) – bounding boxes for the other classes are available in the annotation but not shown. Note the wide rangeof pose, scale, clutter, occlusion and imaging conditions.
    Fig. 1 Example images from the VOC2007 dataset. For each of the 20 classesnnotated, two examples are shown. Bounding boxes indicate all instances of the corresponding class in the image which are markeds “non-difficult” (see Sect. 3.3) – bounding boxes for the other classes are available in the annotation but not shown. Note the wide rangeof pose, scale, clutter, occlusion and imaging conditions.
  • Fig. 3 Example of the “difficult” annotation. Objects shown in red have been marked difficult, and are excluded from the evaluation. Note that the judgement of difficulty is not solely by object size – thedistant car on the right of the image is included in the evaluation.
    Fig. 3 Example of the “difficult” annotation. Objects shown in red have been marked difficult, and are excluded from the evaluation. Note that the judgement of difficulty is not solely by object size – thedistant car on the right of the image is included in the evaluation.
  • Table 2 Statistics of the VOC2007 dataset. The data is divided into twomain subsets: training/validation data (rainval), and test data (test), with thetrainval data further divided into suggested training (train) and validation (val) sets. For each subset and class, the number of images (containing at least one object of the corresponding class) and number of object instances are shown. Note that because images may cont in objects of several classes, the totals shown in the image columns are not simplythe sum of the corresponding column.
    Table 2 Statistics of the VOC2007 dataset. The data is divided into twomain subsets: training/validation data (rainval), and test data (test), with thetrainval data further divided into suggested training (train) and validation (val) sets. For each subset and class, the number of images (containing at least one object of the corresponding class) and number of object instances are shown. Note that because images may cont in objects of several classes, the totals shown in the image columns are not simplythe sum of the corresponding column.
  • Table 3 Statistics of the VOC2007 segmentation dataset. The data is divide into two main subsets: training/validation data (trainval), and test data (test), with thetrainval data further divided into suggested training (train) and validation (val) sets. For each subset and class, the number of images (containing at least one object of the corresponding class) and number of object instances are shown. Note that because images may contain objects of several classes, the totals shown in the imagecolumns are not simply the sum of the corresponding column. All objects in each image are segmented, with every pixel of the image being labelled as one of the object classes, “background” (not one of theannotated classes) or “void” (uncertain i.e. near object boundary).
    Table 3 Statistics of the VOC2007 segmentation dataset. The data is divide into two main subsets: training/validation data (trainval), and test data (test), with thetrainval data further divided into suggested training (train) and validation (val) sets. For each subset and class, the number of images (containing at least one object of the corresponding class) and number of object instances are shown. Note that because images may contain objects of several classes, the totals shown in the imagecolumns are not simply the sum of the corresponding column. All objects in each image are segmented, with every pixel of the image being labelled as one of the object classes, “background” (not one of theannotated classes) or “void” (uncertain i.e. near object boundary).
  • Fig. 5 Example images and annotation for the taster competitions. (a) Segmentation taster annotation showing object and class segmentation. Border regions are marked with the “void” label indicating that they may be object or background. Difficult objects are excluded by masking with the ‘void’ label. (b) Person Layout taster annotation showing bounding boxes for head, hands and feet.
    Fig. 5 Example images and annotation for the taster competitions. (a) Segmentation taster annotation showing object and class segmentation. Border regions are marked with the “void” label indicating that they may be object or background. Difficult objects are excluded by masking with the ‘void’ label. (b) Person Layout taster annotation showing bounding boxes for head, hands and feet.
  • + 20

11,545 Citations


Open accessProceedings Article
Karen Simonyan1, Andrew Zisserman1Institutions (1)
08 Dec 2014-
Abstract: We investigate architectures of discriminatively trained deep Convolutional Networks (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to generalise the best performing hand-crafted features within a data-driven learning framework. Our contribution is three-fold. First, we propose a two-stream ConvNet architecture which incorporates spatial and temporal networks. Second, we demonstrate that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Finally, we show that multitask learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both. Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it is competitive with the state of the art. It also exceeds by a large margin previous attempts to use deep nets for video classification.

...read more

5,381 Citations


Cited by
  More

Open accessProceedings ArticleDOI: 10.1109/CVPR.2016.90
Kaiming He1, Xiangyu Zhang1, Shaoqing Ren1, Jian Sun1Institutions (1)
27 Jun 2016-
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

...read more

Topics: Deep learning (53%), Residual (53%), Convolutional neural network (53%) ...read more

93,356 Citations


Open accessProceedings Article
Karen Simonyan1, Andrew Zisserman1Institutions (1)
01 Jan 2015-
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

...read more

49,857 Citations


Journal ArticleDOI: 10.1023/B:VISI.0000029664.99615.94
David G. Lowe1Institutions (1)
Abstract: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.

...read more

Topics: 3D single-object recognition (64%), Haar-like features (63%), Feature (computer vision) (58%) ...read more

42,225 Citations


Open accessProceedings Article
Karen Simonyan1, Andrew Zisserman1Institutions (1)
04 Sep 2014-
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

...read more

38,283 Citations


Proceedings ArticleDOI: 10.1109/CVPR.2009.5206848
Jia Deng1, Wei Dong1, Richard Socher1, Li-Jia Li1  +2 moreInstitutions (1)
20 Jun 2009-
Abstract: The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called “ImageNet”, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.

...read more

Topics: WordNet (57%), Image retrieval (54%)

31,274 Citations


Performance
Metrics

Author's H-index: 167

No. of papers from the Author in previous years
YearPapers
202154
202069
201958
201864
201737
201624

Top Attributes

Show by:

Author's top 5 most impactful journals

International Journal of Computer Vision

28 papers, 25.2K citations

Image and Vision Computing

17 papers, 1.4K citations

Lecture Notes in Computer Science

11 papers, 1.2K citations

Network Information
Related Authors (5)
Joon Son Chung

73 papers, 4.9K citations

88% related
David Forsyth

334 papers, 33K citations

88% related
Victor Lempitsky

173 papers, 30.8K citations

87% related
Arsha Nagrani

54 papers, 3.1K citations

86% related
James Charles

26 papers, 1.2K citations

86% related