In the real world, a realistic setting for computer vision or multimedia recognition problems is that we have some classes containing lots of training data and many classes contain a small amount of training data. Therefore, how to use frequent classes to help learning rare classes for which it is harder to collect the training data is an open question. Learning with Shared Information is an emerging topic in machine learning, computer vision and multimedia analysis. There are different level of components that can be shared during concept modeling and machine learning stages, such as sharing generic object parts, sharing attributes, sharing transformations, sharing regularization parameters and sharing training examples, etc. Regarding the specific methods, multi-task learning, transfer learning and deep learning can be seen as using different strategies to share information. These learning with shared information methods are very effective in solving real-world large-scale problems. This special issue aims at gathering the recent advances in learning with shared information methods and their applications in computer vision and multimedia analysis. Both state-of-the-art works, as well as literature reviews, are welcome for submission. Papers addressing interesting real-world computer vision and multimedia applications are especially encouraged. Topics of interest include, but are not limited to:  • Multi-task learning or transfer learning for large-scale computer vision and multimedia analysis • Deep learning for large-scale computer vision and multimedia analysis • Multi-modal approach for large-scale computer vision and multimedia analysis • Different sharing strategies, e.g., sharing generic object parts, sharing attributes, sharing transformations, sharing regularization parameters and sharing training examples, • Real-world computer vision and multimedia applications based on learning with shared information, e.g., event detection, object recognition, object detection, action recognition, human head pose estimation, object tracking, location-based services, semantic indexing. • New datasets and metrics to evaluate the benefit of the proposed sharing ability for the specific computer vision or multimedia problem. • Survey papers regarding the topic of learning with shared information.  Authors who are unsure whether their planned submission is in scope may contact the guest editors prior to the submission deadline with an abstract, in order to receive feedback.

IEEE transactions on pattern analysis and machine intelligence

Image segmentation is a key topic in image processing and computer vision with applications such as scene understanding, medical image analysis, robotic perception, video surveillance, augmented reality, and image compression, among many others. Various algorithms for image segmentation have been developed in the literature. Recently, due to the success of deep learning models in a wide range of vision applications, there has been a substantial amount of works aimed at developing image segmentation approaches using deep learning models. In this survey, we provide a comprehensive review of the literature at the time of this writing, covering a broad spectrum of pioneering works for semantic and instance-level segmentation, including fully convolutional pixel-labeling networks, encoder-decoder architectures, multi-scale and pyramid based approaches, recurrent networks, visual attention models, and generative models in adversarial settings. We investigate the similarity, strengths and challenges of these deep learning models, examine the most widely used datasets, report performances, and discuss promising future research directions in this area.

Image Segmentation Using Deep Learning: A Survey

Domain-specific hardware is becoming a promising topic in the backdrop of improvement slow down for general-purpose processors due to the foreseeable end of Moore’s Law. Machine learning, especially deep neural networks (DNNs), has become the most dazzling domain witnessing successful applications in a wide spectrum of artificial intelligence (AI) tasks. The incomparable accuracy of DNNs is achieved by paying the cost of hungry memory consumption and high computational complexity, which greatly impedes their deployment in embedded systems. Therefore, the DNN compression concept was naturally proposed and widely used for memory saving and compute acceleration. In the past few years, a tremendous number of compression techniques have sprung up to pursue a satisfactory tradeoff between processing efficiency and application accuracy. Recently, this wave has spread to the design of neural network accelerators for gaining extremely high performance. However, the amount of related works is incredibly huge and the reported approaches are quite divergent. This research chaos motivates us to provide a comprehensive survey on the recent advances toward the goal of efficient compression and execution of DNNs without significantly compromising accuracy, involving both the high-level algorithms and their applications in hardware design. In this article, we review the mainstream compression approaches such as compact model, tensor decomposition, data quantization, and network sparsification. We explain their compression principles, evaluation metrics, sensitivity analysis, and joint-way use. Then, we answer the question of how to leverage these methods in the design of neural network accelerators and present the state-of-the-art hardware architectures. In the end, we discuss several existing issues such as fair comparison, testing workloads, automatic compression, influence on security, and framework/hardware-level support, and give promising topics in this field and the possible challenges as well. This article attempts to enable readers to quickly build up a big picture of neural network compression and acceleration, clearly evaluate various methods, and confidently get started in the right way.

Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey

In this paper, we propose a unified panoptic segmentation network (UPSNet) for tackling the newly proposed panoptic segmentation task. On top of a single backbone residual network, we first design a deformable convolution based semantic segmentation head and a Mask R-CNN style instance segmentation head which solve these two subtasks simultaneously. More importantly, we introduce a parameter-free panoptic head which solves the panoptic segmentation via pixel-wise classification. It first leverages the logits from the previous two heads and then innovatively expands the representation for enabling prediction of an extra unknown class which helps better resolving the conflicts between semantic and instance segmentation. Besides, it handles the challenge caused by the varying number of instances and permits back propagation to the bottom modules in an end-to-end manner. Extensive experimental results on Cityscapes, COCO and our internal dataset demonstrate that our UPSNet achieves state-of-the-art performance with much faster inference. Code has been made available at: https://github.com/uber-research/UPSNet

/pdf/upsnet-a-unified-panoptic-segmentation-network-14lr6mjvi9.pdf

UPSNet: A Unified Panoptic Segmentation Network

Many modern object detectors demonstrate outstanding performances by using the mechanism of looking and thinking twice. In this paper, we explore this mechanism in the backbone design for object detection. At the macro level, we propose Recursive Feature Pyramid, which incorporates extra feedback connections from Feature Pyramid Networks into the bottom-up backbone layers. At the micro level, we propose Switchable Atrous Convolution, which convolves the features with different atrous rates and gathers the results using switch functions. Combining them results in DetectoRS, which significantly improves the performances of object detection. On COCO test-dev, DetectoRS achieves state-of-the-art 55.7% box AP for object detection, 48.5% mask AP for instance segmentation, and 50.0% PQ for panoptic segmentation. The code is made publicly available.

DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution.

This paper studies panoptic segmentation, a recently proposed task which segments foreground (FG) objects at the instance level as well as background (BG) contents at the semantic level. Existing methods mostly dealt with these two problems separately, but in this paper, we reveal the underlying relationship between them, in particular, FG objects provide complementary cues to assist BG understanding. Our approach, named the Attention-guided Unified Network (AUNet), is a unified framework with two branches for FG and BG segmentation simultaneously. Two sources of attentions are added to the BG branch, namely, RPN and FG segmentation mask to provide object-level and pixel-level attentions, respectively. Our approach is generalized to different backbones with consistent accuracy gain in both FG and BG segmentation, and also sets new state-of-the-arts both in the MS-COCO (46.5% PQ) and Cityscapes (59.0% PQ) benchmarks.

/pdf/attention-guided-unified-network-for-panoptic-segmentation-ksj4gefuyq.pdf

Attention-Guided Unified Network for Panoptic Segmentation

Recently, numerous handcrafted and searched networks have been applied for semantic segmentation. However, previous works intend to handle inputs with various scales in pre-defined static architectures, such as FCN, U-Net, and DeepLab series. This paper studies a conceptually new method to alleviate the scale variance in semantic representation, named dynamic routing. The proposed framework generates data-dependent routes, adapting to the scale distribution of each image. To this end, a differentiable gating function, called soft conditional gate, is proposed to select scale transform paths on the fly. In addition, the computational cost can be further reduced in an end-to-end manner by giving budget constraints to the gating function. We further relax the network level routing space to support multi-path propagations and skip-connections in each forward, bringing substantial network capacity. To demonstrate the superiority of the dynamic property, we compare with several static architectures, which can be modeled as special cases in the routing space. Extensive experiments are conducted on Cityscapes and PASCAL VOC 2012 to illustrate the effectiveness of the dynamic framework. Code is available at https://github.com/yanwei-li/DynamicRouting.

/pdf/learning-dynamic-routing-for-semantic-segmentation-28fw0y8tyz.pdf

Learning Dynamic Routing for Semantic Segmentation

Attention-guided Unified Network for Panoptic Segmentation

In recent years, the application of deep learning based on deep convolutional neural networks has gained great success in face detection. However, one of the remaining open challenges is the detection of small-scaled faces. The depth of the convolutional network can cause the projected feature map for small faces to be quickly shrunk, and most detection approaches with scale invariant can hardly handle less than $15\times 15$ pixel faces. To solve this problem, we propose a different scales face detector (DSFD) based on Faster R-CNN. The new network can improve the precision of face detection while performing as real-time a Faster R-CNN. First, an efficient multitask region proposal network (RPN), combined with boosting face detection, is developed to obtain the human face ROI. Setting the ROI as a constraint, an anchor is inhomogeneously produced on the top feature map by the multitask RPN. A human face proposal is extracted through the anchor combined with facial landmarks. Then, a parallel-type Fast R-CNN network is proposed based on the proposal scale. According to the different percentages they cover on the images, the proposals are assigned to three corresponding Fast R-CNN networks. The three networks are separated through the proposal scales and differ from each other in the weight of feature map concatenation. A variety of strategies is introduced in our face detection network, including multitask learning, feature pyramid, and feature concatenation. Compared to state-of-the-art face detection methods such as UnitBox, HyperFace, FastCNN, the proposed DSFD method achieves promising performance on popular benchmarks including FDDB, AFW, PASCAL faces, and WIDER FACE.

Face Detection With Different Scales Based on Faster R-CNN

In this paper, we propose a robust visual detection–learning–tracking framework for autonomous aerial refueling of unmanned aerial vehicles. Two classifiers (D-classifier and T-classifier) are defined in the proposed framework. The D-classifier is a robust linear support vector machine (SVM) classifier trained offline for detecting the drogue object of aerial refueling and a low-dimensional normalized robust local binary pattern feature is proposed to describe the drogue object in the D-classifier. The T-classifier is a state-based structured SVM classifier trained online for tracking the drogue object. A combination strategy between the D-classifier and the T-classifier is proposed in the framework. The D-classifier is used to assess if some positive support vectors in the T-classifier are required to be replaced by positive examples with density peaks. The experimental results on several challenging video sequences validate the effectiveness and robustness of our proposed framework.

Xingang Wang

Papers

Attention-Guided Unified Network for Panoptic Segmentation

Learning Dynamic Routing for Semantic Segmentation

Attention-guided Unified Network for Panoptic Segmentation

Face Detection With Different Scales Based on Faster R-CNN

Robust Visual Detection–Learning–Tracking Framework for Autonomous Aerial Refueling of UAVs