Showing papers on "Feature (computer vision) published in 2017"

PDF

Open Access

Proceedings Article•DOI•

Feature Pyramid Networks for Object Detection

[...]

Tsung-Yi Lin¹, Piotr Dollár², Ross Girshick², Kaiming He², Bharath Hariharan², Serge Belongie¹ - Show less +2 more•Institutions (2)

Cornell University¹, Facebook²

21 Jul 2017

TL;DR: This paper exploits the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost and achieves state-of-the-art single-model results on the COCO detection benchmark without bells and whistles.

...read moreread less

Abstract: Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But pyramid representations have been avoided in recent object detectors that are based on deep convolutional networks, partially because they are slow to compute and memory intensive. In this paper, we exploit the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost. A top-down architecture with lateral connections is developed for building high-level semantic feature maps at all scales. This architecture, called a Feature Pyramid Network (FPN), shows significant improvement as a generic feature extractor in several applications. Using a basic Faster R-CNN system, our method achieves state-of-the-art single-model results on the COCO detection benchmark without bells and whistles, surpassing all existing single-model entries including those from the COCO 2016 challenge winners. In addition, our method can run at 5 FPS on a GPU and thus is a practical and accurate solution to multi-scale object detection. Code will be made publicly available.

...read moreread less

16,727 citations

Journal Article•DOI•

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

[...]

Vijay Badrinarayanan¹, Alex Kendall¹, Roberto Cipolla¹•Institutions (1)

University of Cambridge¹

01 Dec 2017-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: Quantitative assessments show that SegNet provides good performance with competitive inference time and most efficient inference memory-wise as compared to other architectures, including FCN and DeconvNet.

...read moreread less

Abstract: We present a novel and practical deep fully convolutional neural network architecture for semantic pixel-wise segmentation termed SegNet. This core trainable segmentation engine consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the VGG16 network [1] . The role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature map(s). Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. This eliminates the need for learning to upsample. The upsampled maps are sparse and are then convolved with trainable filters to produce dense feature maps. We compare our proposed architecture with the widely adopted FCN [2] and also with the well known DeepLab-LargeFOV [3] , DeconvNet [4] architectures. This comparison reveals the memory versus accuracy trade-off involved in achieving good segmentation performance. SegNet was primarily motivated by scene understanding applications. Hence, it is designed to be efficient both in terms of memory and computational time during inference. It is also significantly smaller in the number of trainable parameters than other competing architectures and can be trained end-to-end using stochastic gradient descent. We also performed a controlled benchmark of SegNet and other architectures on both road scenes and SUN RGB-D indoor scene segmentation tasks. These quantitative assessments show that SegNet provides good performance with competitive inference time and most efficient inference memory-wise as compared to other architectures. We also provide a Caffe implementation of SegNet and a web demo at http://mi.eng.cam.ac.uk/projects/segnet/ .

...read moreread less

13,468 citations

Proceedings Article•

LightGBM: a highly efficient gradient boosting decision tree

[...]

Guolin Ke¹, Qi Meng², Thomas Finley¹, Taifeng Wang¹, Wei Chen¹, Weidong Ma¹, Qiwei Ye¹, Tie-Yan Liu¹ - Show less +4 more•Institutions (2)

Microsoft¹, Peking University²

04 Dec 2017

TL;DR: It is proved that, since the data instances with larger gradients play a more important role in the computation of information gain, GOSS can obtain quite accurate estimation of the information gain with a much smaller data size, and is called LightGBM.

...read moreread less

Abstract: Gradient Boosting Decision Tree (GBDT) is a popular machine learning algorithm, and has quite a few effective implementations such as XGBoost and pGBRT. Although many engineering optimizations have been adopted in these implementations, the efficiency and scalability are still unsatisfactory when the feature dimension is high and data size is large. A major reason is that for each feature, they need to scan all the data instances to estimate the information gain of all possible split points, which is very time consuming. To tackle this problem, we propose two novel techniques: Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB). With GOSS, we exclude a significant proportion of data instances with small gradients, and only use the rest to estimate the information gain. We prove that, since the data instances with larger gradients play a more important role in the computation of information gain, GOSS can obtain quite accurate estimation of the information gain with a much smaller data size. With EFB, we bundle mutually exclusive features (i.e., they rarely take nonzero values simultaneously), to reduce the number of features. We prove that finding the optimal bundling of exclusive features is NP-hard, but a greedy algorithm can achieve quite good approximation ratio (and thus can effectively reduce the number of features without hurting the accuracy of split point determination by much). We call our new GBDT implementation with GOSS and EFB LightGBM. Our experiments on multiple public datasets show that, LightGBM speeds up the training process of conventional GBDT by up to over 20 times while achieving almost the same accuracy.

...read moreread less

4,977 citations

Journal Article•DOI•

Deep Learning in Medical Image Analysis

[...]

Dinggang Shen¹, Guorong Wu¹, Heung-Il Suk²•Institutions (2)

University of North Carolina at Chapel Hill¹, Korea University²

20 Jun 2017-Annual Review of Biomedical Engineering

TL;DR: This review covers computer-assisted analysis of images in the field of medical imaging and introduces the fundamentals of deep learning methods and their successes in image registration, detection of anatomical and cellular structures, tissue segmentation, computer-aided disease diagnosis and prognosis, and so on.

...read moreread less

Abstract: This review covers computer-assisted analysis of images in the field of medical imaging. Recent advances in machine learning, especially with regard to deep learning, are helping to identify, classify, and quantify patterns in medical images. At the core of these advances is the ability to exploit hierarchical feature representations learned solely from data, instead of features designed by hand according to domain-specific knowledge. Deep learning is rapidly becoming the state of the art, leading to enhanced performance in various medical applications. We introduce the fundamentals of deep learning methods and review their successes in image registration, detection of anatomical and cellular structures, tissue segmentation, computer-aided disease diagnosis and prognosis, and so on. We conclude by discussing research issues and suggesting future directions for further improvement.

...read moreread less

2,653 citations

Proceedings Article•DOI•

Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors

[...]

Jonathan Huang¹, Vivek Rathod¹, Chen Sun², Menglong Zhu³, Anoop Korattikara⁴, Alireza Fathi², Ian Fischer², Zbigniew Wojna⁵, Yang Song⁶, Sergio Guadarrama⁷, Kevin Murphy⁸ - Show less +7 more•Institutions (8)

Russian Academy of Sciences¹, Google², University of Pennsylvania³, University of California, Irvine⁴, University College London⁵, Chinese Center for Disease Control and Prevention⁶, University of California, Berkeley⁷, Cardiff University⁸

21 Jul 2017

TL;DR: A unified implementation of the Faster R-CNN, R-FCN and SSD systems is presented and the speed/accuracy trade-off curve created by using alternative feature extractors and varying other critical parameters such as image size within each of these meta-architectures is traced out.

...read moreread less

Abstract: The goal of this paper is to serve as a guide for selecting a detection architecture that achieves the right speed/memory/accuracy balance for a given application and platform. To this end, we investigate various ways to trade accuracy for speed and memory usage in modern convolutional object detection systems. A number of successful systems have been proposed in recent years, but apples-toapples comparisons are difficult due to different base feature extractors (e.g., VGG, Residual Networks), different default image resolutions, as well as different hardware and software platforms. We present a unified implementation of the Faster R-CNN [30], R-FCN [6] and SSD [25] systems, which we view as meta-architectures and trace out the speed/accuracy trade-off curve created by using alternative feature extractors and varying other critical parameters such as image size within each of these meta-architectures. On one extreme end of this spectrum where speed and memory are critical, we present a detector that achieves real time speeds and can be deployed on a mobile device. On the opposite end in which accuracy is critical, we present a detector that achieves state-of-the-art performance measured on the COCO detection task.

...read moreread less

2,484 citations

Posted Content•

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering.

[...]

Peter Anderson¹, Xiaodong He, Chris Buehler², Damien Teney³, Mark Johnson, Stephen Gould¹, Lei Zhang² - Show less +3 more•Institutions (3)

Australian National University¹, Microsoft², University of Adelaide³

25 Jul 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: A combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions is proposed, demonstrating the broad applicability of this approach to VQA.

...read moreread less

Abstract: Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.

...read moreread less

2,248 citations

Proceedings Article•DOI•

DeepFM: a factorization-machine based neural network for CTR prediction

[...]

Huifeng Guo¹, Ruiming Tang², Yunming Ye¹, Zhenguo Li², Xiuqiang He² - Show less +1 more•Institutions (2)

Harbin Institute of Technology¹, Huawei²

19 Aug 2017

TL;DR: This paper shows that it is possible to derive an end-to-end learning model that emphasizes both low- and high-order feature interactions, and combines the power of factorization machines for recommendation and deep learning for feature learning in a new neural network architecture.

...read moreread less

Abstract: Learning sophisticated feature interactions behind user behaviors is critical in maximizing CTR for recommender systems. Despite great progress, existing methods seem to have a strong bias towards low- or high-order interactions, or require expertise feature engineering. In this paper, we show that it is possible to derive an end-to-end learning model that emphasizes both low- and high-order feature interactions. The proposed model, DeepFM, combines the power of factorization machines for recommendation and deep learning for feature learning in a new neural network architecture. Compared to the latest Wide & Deep model from Google, DeepFM has a shared input to its "wide" and "deep" parts, with no need of feature engineering besides raw features. Comprehensive experiments are conducted to demonstrate the effectiveness and efficiency of DeepFM over the existing models for CTR prediction, on both benchmark data and commercial data.

...read moreread less

1,695 citations

Proceedings Article•DOI•

SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning

[...]

Long Chen¹, Hanwang Zhang², Jun Xiao¹, Liqiang Nie³, Jian Shao¹, Wei Liu⁴, Tat-Seng Chua⁵ - Show less +3 more•Institutions (5)

Zhejiang University¹, Columbia University², Shandong University³, Tencent⁴, National University of Singapore⁵

21 Jul 2017

TL;DR: This paper introduces a novel convolutional neural network dubbed SCA-CNN that incorporates Spatial and Channel-wise Attentions in a CNN that significantly outperforms state-of-the-art visual attention-based image captioning methods.

...read moreread less

Abstract: Visual attention has been successfully applied in structural prediction tasks such as visual captioning and question answering. Existing visual attention models are generally spatial, i.e., the attention is modeled as spatial probabilities that re-weight the last conv-layer feature map of a CNN encoding an input image. However, we argue that such spatial attention does not necessarily conform to the attention mechanism — a dynamic feature extractor that combines contextual fixations over time, as CNN features are naturally spatial, channel-wise and multi-layer. In this paper, we introduce a novel convolutional neural network dubbed SCA-CNN that incorporates Spatial and Channel-wise Attentions in a CNN. In the task of image captioning, SCA-CNN dynamically modulates the sentence generation context in multi-layer feature maps, encoding where (i.e., attentive spatial locations at multiple layers) and what (i.e., attentive channels) the visual attention is. We evaluate the proposed SCA-CNN architecture on three benchmark image captioning datasets: Flickr8K, Flickr30K, and MSCOCO. It is consistently observed that SCA-CNN significantly outperforms state-of-the-art visual attention-based image captioning methods.

...read moreread less

1,527 citations

Journal Article•DOI•

Trainable Weka Segmentation: a machine learning tool for microscopy pixel classification.

[...]

Ignacio Arganda-Carreras¹, Ignacio Arganda-Carreras², Verena Kaynig³, Curtis Rueden⁴, Kevin W. Eliceiri⁴, Johannes Schindelin⁴, Albert Cardona⁵, H. Sebastian Seung⁶ - Show less +4 more•Institutions (6)

Ikerbasque¹, Donostia International Physics Center², Harvard University³, University of Wisconsin-Madison⁴, Howard Hughes Medical Institute⁵, Princeton University⁶

01 Aug 2017-Bioinformatics

TL;DR: The Trainable Weka Segmentation (TWS), a machine learning tool that leverages a limited number of manual annotations in order to train a classifier and segment the remaining data automatically, is introduced.

...read moreread less

Abstract: Summary State-of-the-art light and electron microscopes are capable of acquiring large image datasets, but quantitatively evaluating the data often involves manually annotating structures of interest. This process is time-consuming and often a major bottleneck in the evaluation pipeline. To overcome this problem, we have introduced the Trainable Weka Segmentation (TWS), a machine learning tool that leverages a limited number of manual annotations in order to train a classifier and segment the remaining data automatically. In addition, TWS can provide unsupervised segmentation learning schemes (clustering) and can be customized to employ user-designed image features or classifiers. Availability and implementation TWS is distributed as open-source software as part of the Fiji image processing distribution of ImageJ at http://imagej.net/Trainable_Weka_Segmentation . Contact ignacio.arganda@ehu.eus. Supplementary information Supplementary data are available at Bioinformatics online.

...read moreread less

1,416 citations

Posted Content•

VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection

[...]

Yin Zhou¹, Oncel Tuzel¹•Institutions (1)

Apple Inc.¹

17 Nov 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: VoxelNet is proposed, a generic 3D detection network that unifies feature extraction and bounding box prediction into a single stage, end-to-end trainable deep network and learns an effective discriminative representation of objects with various geometries, leading to encouraging results in3D detection of pedestrians and cyclists.

...read moreread less

Abstract: Accurate detection of objects in 3D point clouds is a central problem in many applications, such as autonomous navigation, housekeeping robots, and augmented/virtual reality. To interface a highly sparse LiDAR point cloud with a region proposal network (RPN), most existing efforts have focused on hand-crafted feature representations, for example, a bird's eye view projection. In this work, we remove the need of manual feature engineering for 3D point clouds and propose VoxelNet, a generic 3D detection network that unifies feature extraction and bounding box prediction into a single stage, end-to-end trainable deep network. Specifically, VoxelNet divides a point cloud into equally spaced 3D voxels and transforms a group of points within each voxel into a unified feature representation through the newly introduced voxel feature encoding (VFE) layer. In this way, the point cloud is encoded as a descriptive volumetric representation, which is then connected to a RPN to generate detections. Experiments on the KITTI car detection benchmark show that VoxelNet outperforms the state-of-the-art LiDAR based 3D detection methods by a large margin. Furthermore, our network learns an effective discriminative representation of objects with various geometries, leading to encouraging results in 3D detection of pedestrians and cyclists, based on only LiDAR.

...read moreread less

991 citations

Journal Article•DOI•

Machine Learning for Medical Imaging.

[...]

Bradley J. Erickson¹, Panagiotis Korfiatis¹, Zeynettin Akkus¹, Timothy L. Kline¹•Institutions (1)

Mayo Clinic¹

17 Feb 2017-Radiographics

TL;DR: Deep learning has started to be used; this method has the benefit that it does not require image feature identification and calculation as a first step; rather, features are identified as part of the learning process.

...read moreread less

Abstract: Machine learning is a technique for recognizing patterns that can be applied to medical images. Although it is a powerful tool that can help in rendering medical diagnoses, it can be misapplied. Machine learning typically begins with the machine learning algorithm system computing the image features that are believed to be of importance in making the prediction or diagnosis of interest. The machine learning algorithm system then identifies the best combination of these image features for classifying the image or computing some metric for the given image region. There are several methods that can be used, each with different strengths and weaknesses. There are open-source versions of most of these machine learning methods that make them easy to try and apply to images. Several metrics for measuring the performance of an algorithm exist; however, one must be aware of the possible associated pitfalls that can result in misleading metrics. More recently, deep learning has started to be used; this method has the benefit that it does not require image feature identification and calculation as a first step; rather, features are identified as part of the learning process. Machine learning has been used in medical imaging and will have a greater influence in the future. Those working in medical imaging must be aware of how machine learning works. ©RSNA, 2017.

...read moreread less

Proceedings Article•DOI•

Deeply Supervised Salient Object Detection with Short Connections

[...]

Qibin Hou¹, Ming-Ming Cheng¹, Xiaowei Hu¹, Ali Borji², Zhuowen Tu³, Philip H. S. Torr⁴ - Show less +2 more•Institutions (4)

Nankai University¹, University of Central Florida², University of California, San Diego³, University of Oxford⁴

01 Jul 2017

TL;DR: This paper proposes a new salient object detection method by introducing short connections to the skip-layer structures within the HED architecture, which takes full advantage of multi-level and multi-scale features extracted from FCNs, providing more advanced representations at each layer, a property that is critically needed to perform segment detection.

...read moreread less

Abstract: Recent progress on saliency detection is substantial, benefiting mostly from the explosive development of Convolutional Neural Networks (CNNs). Semantic segmentation and saliency detection algorithms developed lately have been mostly based on Fully Convolutional Neural Networks (FCNs). There is still a large room for improvement over the generic FCN models that do not explicitly deal with the scale-space problem. Holisitcally-Nested Edge Detector (HED) provides a skip-layer structure with deep supervision for edge and boundary detection, but the performance gain of HED on saliency detection is not obvious. In this paper, we propose a new saliency method by introducing short connections to the skip-layer structures within the HED architecture. Our framework provides rich multi-scale feature maps at each layer, a property that is critically needed to perform segment detection. Our method produces state-of-the-art results on 5 widely tested salient object detection benchmarks, with advantages in terms of efficiency (0.08 seconds per image), effectiveness, and simplicity over the existing algorithms.

...read moreread less

Proceedings Article•DOI•

High-Resolution Image Inpainting Using Multi-scale Neural Patch Synthesis

[...]

Chao Yang¹, Xin Lu², Zhe Lin², Eli Shechtman², Oliver Wang², Hao Li³ - Show less +2 more•Institutions (3)

University of Southern California¹, Adobe Systems², Institute for Creative Technologies³

01 Jul 2017

TL;DR: This work proposes a multi-scale neural patch synthesis approach based on joint optimization of image content and texture constraints, which not only preserves contextual structures but also produces high-frequency details by matching and adapting patches with the most similar mid-layer feature correlations of a deep classification network.

...read moreread less

Abstract: Recent advances in deep learning have shown exciting promise in filling large holes in natural images with semantically plausible and context aware details, impacting fundamental image manipulation tasks such as object removal. While these learning-based methods are significantly more effective in capturing high-level features than prior techniques, they can only handle very low-resolution inputs due to memory limitations and difficulty in training. Even for slightly larger images, the inpainted regions would appear blurry and unpleasant boundaries become visible. We propose a multi-scale neural patch synthesis approach based on joint optimization of image content and texture constraints, which not only preserves contextual structures but also produces high-frequency details by matching and adapting patches with the most similar mid-layer feature correlations of a deep classification network. We evaluate our method on the ImageNet and Paris Streetview datasets and achieved state-of-the-art inpainting accuracy. We show our approach produces sharper and more coherent results than prior methods, especially for high-resolution images.

...read moreread less

Proceedings Article•DOI•

Amulet: Aggregating Multi-level Convolutional Features for Salient Object Detection

[...]

Pingping Zhang¹, Dong Wang², Huchuan Lu³, Hongyu Wang³, Xiang Ruan⁴ - Show less +1 more•Institutions (4)

Jinan Military Region¹, Dartmouth College², Dalian University of Technology³, Omron⁴

01 Oct 2017

TL;DR: Amulet is presented, a generic aggregating multi-level convolutional feature framework for salient object detection that provides accurate salient object labeling and performs favorably against state-of-the-art approaches in terms of near all compared evaluation metrics.

...read moreread less

Abstract: Fully convolutional neural networks (FCNs) have shown outstanding performance in many dense labeling problems. One key pillar of these successes is mining relevant information from features in convolutional layers. However, how to better aggregate multi-level convolutional feature maps for salient object detection is underexplored. In this work, we present Amulet, a generic aggregating multi-level convolutional feature framework for salient object detection. Our framework first integrates multi-level feature maps into multiple resolutions, which simultaneously incorporate coarse semantics and fine details. Then it adaptively learns to combine these feature maps at each resolution and predict saliency maps with the combined features. Finally, the predicted results are efficiently fused to generate the final saliency map. In addition, to achieve accurate boundary inference and semantic enhancement, edge-aware feature maps in low-level layers and the predicted results of low resolution features are recursively embedded into the learning framework. By aggregating multi-level convolutional features in this efficient and flexible manner, the proposed saliency model provides accurate salient object labeling. Comprehensive experiments demonstrate that our method performs favorably against state-of-the-art approaches in terms of near all compared evaluation metrics.

...read moreread less

Journal Article•DOI•

Richer Convolutional Features for Edge Detection

[...]

Yun Liu¹, Ming-Ming Cheng¹, Xiaowei Hu¹, Jia-Wang Bian¹, Le Zhang², Xiang Bai³, Jinhui Tang⁴ - Show less +3 more•Institutions (4)

Nankai University¹, Agency for Science, Technology and Research², Huazhong University of Science and Technology³, Nanjing University of Science and Technology⁴

21 Jul 2017

TL;DR: RCF as mentioned in this paper encapsulates all convolutional features into more discriminative representation, which makes good usage of rich feature hierarchies, and is amenable to training via backpropagation.

...read moreread less

Abstract: Edge detection is a fundamental problem in computer vision. Recently, convolutional neural networks (CNNs) have pushed forward this field significantly. Existing methods which adopt specific layers of deep CNNs may fail to capture complex data structures caused by variations of scales and aspect ratios. In this paper, we propose an accurate edge detector using richer convolutional features (RCF). RCF encapsulates all convolutional features into more discriminative representation, which makes good usage of rich feature hierarchies, and is amenable to training via backpropagation. RCF fully exploits multiscale and multilevel information of objects to perform the image-to-image prediction holistically. Using VGG16 network, we achieve state-of-the-art performance on several available datasets. When evaluating on the well-known BSDS500 benchmark, we achieve ODS F-measure of 0.811 while retaining a fast speed (8 FPS). Besides, our fast version of RCF achieves ODS F-measure of 0.806 with 30 FPS. We also demonstrate the versatility of the proposed method by applying RCF edges for classical image segmentation.

...read moreread less

Journal Article•DOI•

Classification of breast cancer histology images using Convolutional Neural Networks

[...]

Teresa Araújo¹, Guilherme Aresta¹, Eduardo Castro¹, José Rouco, Paulo Aguiar², Catarina Eloy, António Polónia², Aurélio Campilho¹ - Show less +4 more•Institutions (2)

Faculdade de Engenharia da Universidade do Porto¹, University of Porto²

01 Jun 2017-PLOS ONE

TL;DR: A method for the classification of hematoxylin and eosin stained breast biopsy images using Convolutional Neural Networks (CNNs) is proposed and the sensitivity of the method for cancer cases is 95.6%.

...read moreread less

Abstract: Breast cancer is one of the main causes of cancer death worldwide The diagnosis of biopsy tissue with hematoxylin and eosin stained images is non-trivial and specialists often disagree on the final diagnosis Computer-aided Diagnosis systems contribute to reduce the cost and increase the efficiency of this process Conventional classification approaches rely on feature extraction methods designed for a specific problem based on field-knowledge To overcome the many difficulties of the feature-based approaches, deep learning methods are becoming important alternatives A method for the classification of hematoxylin and eosin stained breast biopsy images using Convolutional Neural Networks (CNNs) is proposed Images are classified in four classes, normal tissue, benign lesion, in situ carcinoma and invasive carcinoma, and in two classes, carcinoma and non-carcinoma The architecture of the network is designed to retrieve information at different scales, including both nuclei and overall tissue organization This design allows the extension of the proposed system to whole-slide histology images The features extracted by the CNN are also used for training a Support Vector Machine classifier Accuracies of 778% for four class and 833% for carcinoma/non-carcinoma are achieved The sensitivity of our method for cancer cases is 956%

...read moreread less

Proceedings Article•DOI•

Feature Squeezing: Detecting Adversarial Examples in Deep Neural Networks

[...]

Weilin Xu¹, David Evans¹, Yanjun Qi¹•Institutions (1)

University of Virginia¹

04 Apr 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: Two feature squeezing methods are explored: reducing the color bit depth of each pixel and spatial smoothing, which are inexpensive and complementary to other defenses, and can be combined in a joint detection framework to achieve high detection rates against state-of-the-art attacks.

...read moreread less

Abstract: Although deep neural networks (DNNs) have achieved great success in many tasks, they can often be fooled by \emph{adversarial examples} that are generated by adding small but purposeful distortions to natural examples. Previous studies to defend against adversarial examples mostly focused on refining the DNN models, but have either shown limited success or required expensive computation. We propose a new strategy, \emph{feature squeezing}, that can be used to harden DNN models by detecting adversarial examples. Feature squeezing reduces the search space available to an adversary by coalescing samples that correspond to many different feature vectors in the original space into a single sample. By comparing a DNN model's prediction on the original input with that on squeezed inputs, feature squeezing detects adversarial examples with high accuracy and few false positives. This paper explores two feature squeezing methods: reducing the color bit depth of each pixel and spatial smoothing. These simple strategies are inexpensive and complementary to other defenses, and can be combined in a joint detection framework to achieve high detection rates against state-of-the-art attacks.

...read moreread less

Proceedings Article•DOI•

Large-Scale Image Retrieval with Attentive Deep Local Features

[...]

Hyeonwoo Noh¹, Andre Araujo², Jack Sim³, Tobias Weyand³, Bohyung Han¹ - Show less +1 more•Institutions (3)

Pohang University of Science and Technology¹, Stanford University², Google³

01 Oct 2017

TL;DR: An attentive local feature descriptor suitable for large-scale image retrieval, referred to as DELE (DEep Local Feature), based on convolutional neural networks, which are trained only with image-level annotations on a landmark image dataset.

...read moreread less

Abstract: We propose an attentive local feature descriptor suitable for large-scale image retrieval, referred to as DELE (DEep Local Feature). The new feature is based on convolutional neural networks, which are trained only with image-level annotations on a landmark image dataset. To identify semantically useful local features for image retrieval, we also propose an attention mechanism for key point selection, which shares most network layers with the descriptor. This frame-work can be used for image retrieval as a drop-in replacement for other keypoint detectors and descriptors, enabling more accurate feature matching and geometric verification. Our system produces reliable confidence scores to reject false positives–in particular, it is robust against queries that have no correct match in the database. To evaluate the proposed descriptor, we introduce a new large-scale dataset, referred to as Google-Landmarks dataset, which involves challenges in both database and query such as background clutter, partial occlusion, multiple landmarks, objects in variable scales, etc. We show that DELE outperforms the state-of-the-art global and local descriptors in the large-scale setting by significant margins.

...read moreread less

Posted Content•

Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

[...]

Saining Xie¹, Chen Sun¹, Jonathan Huang¹, Zhuowen Tu¹, Kevin Murphy¹ - Show less +1 more•Institutions (1)

Google¹

13 Dec 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: It is shown that it is possible to replace many of the 3D convolutions by low-cost 2D convolution, suggesting that temporal representation learning on high-level “semantic” features is more useful.

...read moreread less

Abstract: Despite the steady progress in video analysis led by the adoption of convolutional neural networks (CNNs), the relative improvement has been less drastic as that in 2D static image classification. Three main challenges exist including spatial (image) feature representation, temporal information representation, and model/computation complexity. It was recently shown by Carreira and Zisserman that 3D CNNs, inflated from 2D networks and pretrained on ImageNet, could be a promising way for spatial and temporal representation learning. However, as for model/computation complexity, 3D CNNs are much more expensive than 2D CNNs and prone to overfit. We seek a balance between speed and accuracy by building an effective and efficient video classification system through systematic exploration of critical network design choices. In particular, we show that it is possible to replace many of the 3D convolutions by low-cost 2D convolutions. Rather surprisingly, best result (in both speed and accuracy) is achieved when replacing the 3D convolutions at the bottom of the network, suggesting that temporal representation learning on high-level semantic features is more useful. Our conclusion generalizes to datasets with very different properties. When combined with several other cost-effective designs including separable spatial/temporal convolution and feature gating, our system results in an effective video classification system that that produces very competitive results on several action classification benchmarks (Kinetics, Something-something, UCF101 and HMDB), as well as two action detection (localization) benchmarks (JHMDB and UCF101-24).

...read moreread less

Proceedings Article•DOI•

Deep Pyramidal Residual Networks

[...]

Dongyoon Han¹, Jiwhan Kim¹, Junmo Kim¹•Institutions (1)

KAIST¹

21 Jul 2017

TL;DR: This research gradually increases the feature map dimension at all units to involve as many locations as possible in the network architecture and proposes a novel residual unit capable of further improving the classification accuracy with the new network architecture.

...read moreread less

Abstract: Deep convolutional neural networks (DCNNs) have shown remarkable performance in image classification tasks in recent years. Generally, deep neural network architectures are stacks consisting of a large number of convolutional layers, and they perform downsampling along the spatial dimension via pooling to reduce memory usage. Concurrently, the feature map dimension (i.e., the number of channels) is sharply increased at downsampling locations, which is essential to ensure effective performance because it increases the diversity of high-level attributes. This also applies to residual networks and is very closely related to their performance. In this research, instead of sharply increasing the feature map dimension at units that perform downsampling, we gradually increase the feature map dimension at all units to involve as many locations as possible. This design, which is discussed in depth together with our new insights, has proven to be an effective means of improving generalization ability. Furthermore, we propose a novel residual unit capable of further improving the classification accuracy with our new network architecture. Experiments on benchmark CIFAR-10, CIFAR-100, and ImageNet datasets have shown that our network architecture has superior generalization ability compared to the original residual networks. Code is available at https://github.com/jhkim89/PyramidNet.

...read moreread less

Proceedings Article•DOI•

Deep & Cross Network for Ad Click Predictions

[...]

Ruoxi Wang¹, Bin Fu², Gang Fu², Mingliang Wang²•Institutions (2)

Stanford University¹, Google²

14 Aug 2017

TL;DR: This paper proposes the Deep & Cross Network (DCN), which keeps the benefits of a DNN model, and beyond that, it introduces a novel cross network that is more efficient in learning certain bounded-degree feature interactions.

...read moreread less

Abstract: Feature engineering has been the key to the success of many prediction models. However, the process is nontrivial and often requires manual feature engineering or exhaustive searching. DNNs are able to automatically learn feature interactions; however, they generate all the interactions implicitly, and are not necessarily efficient in learning all types of cross features. In this paper, we propose the Deep & Cross Network (DCN) which keeps the benefits of a DNN model, and beyond that, it introduces a novel cross network that is more efficient in learning certain bounded-degree feature interactions. In particular, DCN explicitly applies feature crossing at each layer, requires no manual feature engineering, and adds negligible extra complexity to the DNN model. Our experimental results have demonstrated its superiority over the state-of-art algorithms on the CTR prediction dataset and dense classification dataset, in terms of both model accuracy and memory usage.

...read moreread less

Journal Article•DOI•

Overview of deep learning in medical imaging

[...]

Kenji Suzuki¹, Kenji Suzuki²•Institutions (2)

Illinois Institute of Technology¹, Tokyo Institute of Technology²

08 Jul 2017-Radiological Physics and Technology

TL;DR: It is shown that ML with feature input (or feature-based ML) was dominant before the introduction of deep learning, and that the major and essential difference between ML before and after deep learning is the learning of image data directly without object segmentation or feature extraction; thus, it is the source of the power of deepLearning.

...read moreread less

Abstract: The use of machine learning (ML) has been increasing rapidly in the medical imaging field, including computer-aided diagnosis (CAD), radiomics, and medical image analysis. Recently, an ML area called deep learning emerged in the computer vision field and became very popular in many fields. It started from an event in late 2012, when a deep-learning approach based on a convolutional neural network (CNN) won an overwhelming victory in the best-known worldwide computer vision competition, ImageNet Classification. Since then, researchers in virtually all fields, including medical imaging, have started actively participating in the explosively growing field of deep learning. In this paper, the area of deep learning in medical imaging is overviewed, including (1) what was changed in machine learning before and after the introduction of deep learning, (2) what is the source of the power of deep learning, (3) two major deep-learning models: a massive-training artificial neural network (MTANN) and a convolutional neural network (CNN), (4) similarities and differences between the two models, and (5) their applications to medical imaging. This review shows that ML with feature input (or feature-based ML) was dominant before the introduction of deep learning, and that the major and essential difference between ML before and after deep learning is the learning of image data directly without object segmentation or feature extraction; thus, it is the source of the power of deep learning, although the depth of the model is an important attribute. The class of ML with image input (or image-based ML) including deep learning has a long history, but recently gained popularity due to the use of the new terminology, deep learning. There are two major models in this class of ML in medical imaging, MTANN and CNN, which have similarities as well as several differences. In our experience, MTANNs were substantially more efficient in their development, had a higher performance, and required a lesser number of training cases than did CNNs. “Deep learning”, or ML with image input, in medical imaging is an explosively growing, promising field. It is expected that ML with image input will be the mainstream area in the field of medical imaging in the next few decades.

...read moreread less

Proceedings Article•DOI•

Harmonic Networks: Deep Translation and Rotation Equivariance

[...]

Daniel E. Worrall¹, Stephan J. Garbin¹, Daniyar Turmukhambetov¹, Gabriel J. Brostow¹•Institutions (1)

University College London¹

01 Jul 2017

TL;DR: H-Nets are presented, a CNN exhibiting equivariance to patch-wise translation and 360-rotation, and it is demonstrated that their layers are general enough to be used in conjunction with the latest architectures and techniques, such as deep supervision and batch normalization.

...read moreread less

Abstract: Translating or rotating an input image should not affect the results of many computer vision tasks. Convolutional neural networks (CNNs) are already translation equivariant: input image translations produce proportionate feature map translations. This is not the case for rotations. Global rotation equivariance is typically sought through data augmentation, but patch-wise equivariance is more difficult. We present Harmonic Networks or H-Nets, a CNN exhibiting equivariance to patch-wise translation and 360-rotation. We achieve this by replacing regular CNN filters with circular harmonics, returning a maximal response and orientation for every receptive field patch. H-Nets use a rich, parameter-efficient and fixed computational complexity representation, and we show that deep feature maps within the network encode complicated rotational invariants. We demonstrate that our layers are general enough to be used in conjunction with the latest architectures and techniques, such as deep supervision and batch normalization. We also achieve state-of-the-art classification on rotated-MNIST, and competitive results on other benchmark challenges.

...read moreread less

Proceedings Article•DOI•

Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose

[...]

Georgios Pavlakos¹, Xiaowei Zhou¹, Konstantinos G. Derpanis², Kostas Daniilidis¹•Institutions (2)

University of Pennsylvania¹, Ryerson University²

01 Jul 2017

TL;DR: In this paper, a fine discretization of the 3D space around the subject and train a ConvNet to predict per voxel likelihoods for each joint is proposed.

...read moreread less

Abstract: This paper addresses the challenge of 3D human pose estimation from a single color image. Despite the general success of the end-to-end learning paradigm, top performing approaches employ a two-step solution consisting of a Convolutional Network (ConvNet) for 2D joint localization and a subsequent optimization step to recover 3D pose. In this paper, we identify the representation of 3D pose as a critical issue with current ConvNet approaches and make two important contributions towards validating the value of end-to-end learning for this task. First, we propose a fine discretization of the 3D space around the subject and train a ConvNet to predict per voxel likelihoods for each joint. This creates a natural representation for 3D pose and greatly improves performance over the direct regression of joint coordinates. Second, to further improve upon initial estimates, we employ a coarse-to-fine prediction scheme. This step addresses the large dimensionality increase and enables iterative refinement and repeated processing of the image features. The proposed approach outperforms all state-of-the-art methods on standard benchmarks achieving a relative error reduction greater than 30% on average. Additionally, we investigate using our volumetric representation in a related architecture which is suboptimal compared to our end-to-end approach, but is of practical interest, since it enables training when no image with corresponding 3D groundtruth is available, and allows us to present compelling results for in-the-wild images.

...read moreread less

Journal Article•DOI•

Going Deeper With Contextual CNN for Hyperspectral Image Classification

[...]

Hyungtae Lee¹, Heesung Kwon²•Institutions (2)

Booz Allen Hamilton¹, United States Army Research Laboratory²

11 Jul 2017-IEEE Transactions on Image Processing

TL;DR: A novel deep convolutional neural network that is deeper and wider than other existing deep networks for hyperspectral image classification, called contextual deep CNN, can optimally explore local contextual interactions by jointly exploiting local spatio-spectral relationships of neighboring individual pixel vectors.

...read moreread less

Abstract: In this paper, we describe a novel deep convolutional neural network (CNN) that is deeper and wider than other existing deep networks for hyperspectral image classification. Unlike current state-of-the-art approaches in CNN-based hyperspectral image classification, the proposed network, called contextual deep CNN, can optimally explore local contextual interactions by jointly exploiting local spatio-spectral relationships of neighboring individual pixel vectors. The joint exploitation of the spatio-spectral information is achieved by a multi-scale convolutional filter bank used as an initial component of the proposed CNN pipeline. The initial spatial and spectral feature maps obtained from the multi-scale filter bank are then combined together to form a joint spatio-spectral feature map. The joint feature map representing rich spectral and spatial properties of the hyperspectral image is then fed through a fully convolutional network that eventually predicts the corresponding label of each pixel vector. The proposed approach is tested on three benchmark data sets: the Indian Pines data set, the Salinas data set, and the University of Pavia data set. Performance comparison shows enhanced classification performance of the proposed approach over the current state-of-the-art on the three data sets.

...read moreread less

Proceedings Article•DOI•

Flow-Guided Feature Aggregation for Video Object Detection

[...]

Xizhou Zhu¹, Yujie Wang, Jifeng Dai², Lu Yuan², Yichen Wei² - Show less +1 more•Institutions (2)

University of Science and Technology of China¹, Microsoft²

01 Oct 2017

TL;DR: This work presents flow-guided feature aggregation, an accurate and end-to-end learning framework for video object detection that improves the per-frame features by aggregation of nearby features along the motion paths, and thus improves the video recognition accuracy.

...read moreread less

Abstract: Extending state-of-the-art object detectors from image to video is challenging. The accuracy of detection suffers from degenerated object appearances in videos, e.g., motion blur, video defocus, rare poses, etc. Existing work attempts to exploit temporal information on box level, but such methods are not trained end-to-end. We present flow-guided feature aggregation, an accurate and end-to-end learning framework for video object detection. It leverages temporal coherence on feature level instead. It improves the per-frame features by aggregation of nearby features along the motion paths, and thus improves the video recognition accuracy. Our method significantly improves upon strong singleframe baselines in ImageNet VID [33], especially for more challenging fast moving objects. Our framework is principled, and on par with the best engineered systems winning the ImageNet VID challenges 2016, without additional bells-and-whistles. The code would be released.

...read moreread less

Journal Article•DOI•

Accurate Object Localization in Remote Sensing Images Based on Convolutional Neural Networks

[...]

Yang Long¹, Yiping Gong¹, Zhifeng Xiao¹, Qing Liu•Institutions (1)

Wuhan University¹

19 Jan 2017-IEEE Transactions on Geoscience and Remote Sensing

TL;DR: This paper proposes a new object localization framework, which can be divided into three processes: region proposal, classification, and accurate object localization process, and a dimension-reduction model performs better than the retrained and fine-tuned models and the detection precision of the combined CNN model is much higher than that of any single model.

...read moreread less

Abstract: In this paper, we focus on tackling the problem of automatic accurate localization of detected objects in high-resolution remote sensing images. The two major problems for object localization in remote sensing images caused by the complex context information such images contain are achieving generalizability of the features used to describe objects and achieving accurate object locations. To address these challenges, we propose a new object localization framework, which can be divided into three processes: region proposal, classification, and accurate object localization process. First, a region proposal method is used to generate candidate regions with the aim of detecting all objects of interest within these images. Then, generic image features from a local image corresponding to each region proposal are extracted by a combination model of 2-D reduction convolutional neural networks (CNNs). Finally, to improve the location accuracy, we propose an unsupervised score-based bounding box regression (USB-BBR) algorithm, combined with a nonmaximum suppression algorithm to optimize the bounding boxes of regions that detected as objects. Experiments show that the dimension-reduction model performs better than the retrained and fine-tuned models and the detection precision of the combined CNN model is much higher than that of any single model. Also our proposed USB-BBR algorithm can more accurately locate objects within an image. Compared with traditional features extraction methods, such as elliptic Fourier transform-based histogram of oriented gradients and local binary pattern histogram Fourier, our proposed localization framework shows robustness when dealing with different complex backgrounds.

...read moreread less

Posted Content•

Pose-driven Deep Convolutional Model for Person Re-identification

[...]

Chi Su¹, Jianing Li¹, Shiliang Zhang¹, Junliang Xing², Wen Gao¹, Qi Tian³ - Show less +2 more•Institutions (3)

Peking University¹, Chinese Academy of Sciences², University of Texas at San Antonio³

25 Sep 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: A Pose-driven Deep Convolutional (PDC) model is proposed to learn improved feature extraction and matching models from end to end and explicitly leverages the human part cues to alleviate the pose variations and learn robust feature representations from both the global image and different local parts.

...read moreread less

Abstract: Feature extraction and matching are two crucial components in person Re-Identification (ReID). The large pose deformations and the complex view variations exhibited by the captured person images significantly increase the difficulty of learning and matching of the features from person images. To overcome these difficulties, in this work we propose a Pose-driven Deep Convolutional (PDC) model to learn improved feature extraction and matching models from end to end. Our deep architecture explicitly leverages the human part cues to alleviate the pose variations and learn robust feature representations from both the global image and different local parts. To match the features from global human body and local body parts, a pose driven feature weighting sub-network is further designed to learn adaptive feature fusions. Extensive experimental analyses and results on three popular datasets demonstrate significant performance improvements of our model over all published state-of-the-art methods.

...read moreread less

Journal Article•DOI•

A convolutional neural network based feature learning and fault diagnosis method for the condition monitoring of gearbox

[...]

Luyang Jing¹, Ming Zhao², Pin Li³, Xiaoqiang Xu²•Institutions (3)

Tianjin University¹, Xi'an Jiaotong University², University of Cincinnati³

01 Dec 2017-Measurement

TL;DR: Developing a convolutional neural network to learn features directly from frequency data of vibration signals and testing the different performance of feature learning from raw data, frequency spectrum and combined time-frequency data demonstrate that the proposed method is able to learning features adaptively from frequencyData and achieve higher diagnosis accuracy than other comparative methods.

...read moreread less

Proceedings Article•

Universal Style Transfer via Feature Transforms

[...]

Yijun Li¹, Chen Fang², Jimei Yang², Zhaowen Wang², Xin Lu¹, Ming-Hsuan Yang¹ - Show less +2 more•Institutions (2)

University of California, Merced¹, Adobe Systems²

23 May 2017

TL;DR: In this paper, a pair of feature transforms, whitening and coloring, are embedded to an image reconstruction network to reflect direct matching of feature covariance of the content image to a given style image, which shares similar spirit with the optimization of Gram matrix based cost in neural style transfer.

...read moreread less

Abstract: Universal style transfer aims to transfer arbitrary visual styles to content images. Existing feed-forward based methods, while enjoying the inference efficiency, are mainly limited by inability of generalizing to unseen styles or compromised visual quality. In this paper, we present a simple yet effective method that tackles these limitations without training on any pre-defined styles. The key ingredient of our method is a pair of feature transforms, whitening and coloring, that are embedded to an image reconstruction network. The whitening and coloring transforms reflect direct matching of feature covariance of the content image to a given style image, which shares similar spirits with the optimization of Gram matrix based cost in neural style transfer. We demonstrate the effectiveness of our algorithm by generating high-quality stylized images with comparisons to a number of recent methods. We also analyze our method by visualizing the whitened features and synthesizing textures by simple feature coloring.

...read moreread less

Collapse