scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Salient Object Detection by Contextual Refinement

14 Jun 2020-pp 1464-1472
TL;DR: A novel saliency detection framework with a Contextual Refinement Module (CRM) which consists of two sub-networks, Object Relation Unit (ORU) and Scene Context Unit (SCU) which captures complementary contextual information to give a holistic estimation of salient regions.
Abstract: Context plays an important role in the saliency prediction task. In this work, we propose a saliency detection framework that not only extracts visual features but also models two kinds of context including object-object relationships within a single image and scene contextual information. Specifically, we develop a novel saliency detection framework with a Contextual Refinement Module (CRM) which consists of two sub-networks, Object Relation Unit (ORU) and Scene Context Unit (SCU). ORU encodes the object-object relationship based on object relative position and object co-occurrence pattern in an image, by graphical approach, while SCU incorporates the scene contextual information of an image. Object Relation Unit (ORU) and Scene Context Unit (SCU) captures complementary contextual information to give a holistic estimation of salient regions. Extensive experiments show the effectiveness of modelling object relations and scene context in boosting the performance of saliency prediction. In particular, our frame-work outperforms the state-of-the-art models on challenging benchmark datasets.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
01 Jun 2022
TL;DR: This work designs a new discriminative mask which makes the model attend on the fixation and edge regions, and proposes an iterative refinement framework, coined SegMaR, which integrates Segment, Magnify and Reiterate in a multi-stage detection fashion.
Abstract: It is challenging to accurately detect camouflaged objects from their highly similar surroundings. Existing methods mainly leverage a single-stage detection fashion, while neglecting small objects with low-resolution fine edges requires more operations than the larger ones. To tackle camouflaged object detection (COD), we are inspired by humans attention coupled with the coarse-to-fine detection strategy, and thereby propose an iterative refinement framework, coined SegMaR, which integrates Segment, Magnify and Reiterate in a multi-stage detection fashion. Specifically, we design a new discriminative mask which makes the model attend on the fixation and edge regions. In addition, we leverage an attention-based sampler to magnify the object region progressively with no need of enlarging the image size. Extensive experiments show our SegMaR achieves remarkable and consistent improvements over other state-of-the-art methods. Especially, we surpass two competitive methods 7.4% and 20.0% respectively in average over standard evaluation metrics on small camouflaged objects. Additional studies provide more promising insights into Seg-MaR, including its effectiveness on the discriminative mask and its generalization to other network architectures. Code is available at https://github.com/dlut-dimt/SegMaR.

21 citations

Journal ArticleDOI
Zhengyi Liu, Yuan Wang, Yacheng Tan, Wei Li, Yun Xiao 
TL;DR: Zhang et al. as discussed by the authors proposed an Attention Gated Recurrent Unit (AGRU) for RGB-D saliency detection, which can reduce the influence of low-quality depth image, and retain more semantic features in the progressive fusion process.
Abstract: RGB-D saliency detection aims to identify the most attractive objects in a pair of color and depth images. However, most existing models adopt classic U-Net framework which progressively decodes two-stream features. In this paper, we decode the cross-modal and multi-level features in a unified unit, named Attention Gated Recurrent Unit (AGRU). It can reduce the influence of low-quality depth image, and retain more semantic features in the progressive fusion process. Specifically, the features of different modalities and different levels are organized as the sequential input, recurrently fed into AGRU which consists of reset gate, update gate and memory unit to be selectively fused and adaptively memorized based on attention mechanism. Further, two-stage AGRU serves as the decoder of RGB-D salient object detection network, named AGRFNet. Due to the recurrent nature, it achieves the best performance with the little parameters. In order to further improve the performance, three auxiliary modules are designed to better fuse semantic information, refine the features of the shallow layer and enhance the local detail. Extensive experiments on seven widely used benchmark datasets demonstrate that AGRFNet performs favorably against 18 state-of-the-art RGB-D SOD approaches.
Journal ArticleDOI
TL;DR: The DualRefine model as mentioned in this paper uses a deep equilibrium model framework to iteratively refine depth estimates and a hidden state of feature maps by computing local matching costs based on epipolar geometry.
Abstract: Self-supervised multi-frame depth estimation achieves high accuracy by computing matching costs of pixel correspondences between adjacent frames, injecting geometric information into the network. These pixel-correspondence candidates are computed based on the relative pose estimates between the frames. Accurate pose predictions are essential for precise matching cost computation as they influence the epipolar geometry. Furthermore, improved depth estimates can, in turn, be used to align pose estimates. Inspired by traditional structure-from-motion (SfM) principles, we propose the DualRefine model, which tightly couples depth and pose estimation through a feedback loop. Our novel update pipeline uses a deep equilibrium model framework to iteratively refine depth estimates and a hidden state of feature maps by computing local matching costs based on epipolar geometry. Importantly, we used the refined depth estimates and feature maps to compute pose updates at each step. This update in the pose estimates slowly alters the epipolar geometry during the refinement process. Experimental results on the KITTI dataset demonstrate competitive depth prediction and odometry prediction performance surpassing published self-supervised baselines.
Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed a context feature extraction module to refine the rough feature map in the intermediate stage to reduce the misclassification of the target object, which shows better results when compared with most of the state-of-the-art methods.
Abstract: Scene segmentation is a very challenging task where convolutional neural networks are used in this field and have achieved very good results. Current scene segmentation methods often ignore the internal consistency of the target object, and lack to make full use of global and local context information which leads to the situation of object misclassification. In addition, most of the previous work focused on the segmentation of the main part of the object, however, there are few researches on the quality of the object edge segmentation. In this article, based on the use of flow information to maintain body consistency, the context feature extraction module is designed to fully consider the global and local body context information of the target object, refining the rough feature map in the intermediate stage. So, the misclassification of the target object is reduced. Besides, in the proposed edge attention module, the low-level feature map guided by the global feature and the edge feature map with semantic information obtained by intermediate process are connected to obtain more accurate edge detail information. Finally, the segmentation quality that contains the body part of the noise and the edge details can be improved. This paper not only conducts experiments on the classic FCN, PSPNet, and DeepLabv3+ several mainstream network architectures, but also on the real-time SFNet network structure proposed last year, and the value of mIoU in object and boundary is improved to verify the effectiveness of the method proposed in this paper. Moreover, in order to prove the robustness of the experiment, we conduct experiments on three complex scene segmentation data sets of Cityscapes, CamVid, and KiTTi, and obtained mIoU values of 80.52% on the Cityscapes validation data set, and 71.4%, 56.53% on the Camvid and KITTI test data set, which shows better results when compared with most of the state-of-the-art methods.
References
More filters
Posted Content
TL;DR: Faster R-CNN as discussed by the authors proposes a Region Proposal Network (RPN) to generate high-quality region proposals, which are used by Fast R-NN for detection.
Abstract: State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features---using the recently popular terminology of neural networks with 'attention' mechanisms, the RPN component tells the unified network where to look. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been made publicly available.

23,183 citations

Journal ArticleDOI
TL;DR: An object detection system based on mixtures of multiscale deformable part models that is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges is described.
Abstract: We describe an object detection system based on mixtures of multiscale deformable part models. Our system is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges. While deformable part models have become quite popular, their value had not been demonstrated on difficult benchmarks such as the PASCAL data sets. Our system relies on new methods for discriminative training with partially labeled data. We combine a margin-sensitive approach for data-mining hard negative examples with a formalism we call latent SVM. A latent SVM is a reformulation of MI--SVM in terms of latent variables. A latent SVM is semiconvex, and the training problem becomes convex once latent information is specified for the positive examples. This leads to an iterative training algorithm that alternates between fixing latent values for positive examples and optimizing the latent SVM objective function.

10,501 citations


"Salient Object Detection by Context..." refers methods in this paper

  • ...Further, we use NonMaximum Suppression [6] to choose a fixed number of Re- gion of Interests (ROIs)....

    [...]

  • ...Further, we use NonMaximum Suppression [6] to choose a fixed number of Re-...

    [...]

Proceedings Article
12 Dec 2011
TL;DR: This paper considers fully connected CRF models defined on the complete set of pixels in an image and proposes a highly efficient approximate inference algorithm in which the pairwise edge potentials are defined by a linear combination of Gaussian kernels.
Abstract: Most state-of-the-art techniques for multi-class image segmentation and labeling use conditional random fields defined over pixels or image regions. While region-level models often feature dense pairwise connectivity, pixel-level models are considerably larger and have only permitted sparse graph structures. In this paper, we consider fully connected CRF models defined on the complete set of pixels in an image. The resulting graphs have billions of edges, making traditional inference algorithms impractical. Our main contribution is a highly efficient approximate inference algorithm for fully connected CRF models in which the pairwise edge potentials are defined by a linear combination of Gaussian kernels. Our experiments demonstrate that dense connectivity at the pixel level substantially improves segmentation and labeling accuracy.

3,233 citations


"Salient Object Detection by Context..." refers methods in this paper

  • ...To further preserve boundary information and improve spatial coherence, we utilize fully connected Conditional Random Field (CRF) [11] to obtain the final saliency map output....

    [...]

Proceedings ArticleDOI
23 Jun 2013
TL;DR: This work considers both foreground and background cues in a different way and ranks the similarity of the image elements with foreground cues or background cues via graph-based manifold ranking, defined based on their relevances to the given seeds or queries.
Abstract: Most existing bottom-up methods measure the foreground saliency of a pixel or region based on its contrast within a local context or the entire image, whereas a few methods focus on segmenting out background regions and thereby salient objects Instead of considering the contrast between the salient objects and their surrounding regions, we consider both foreground and background cues in a different way We rank the similarity of the image elements (pixels or regions) with foreground cues or background cues via graph-based manifold ranking The saliency of the image elements is defined based on their relevances to the given seeds or queries We represent the image as a close-loop graph with super pixels as nodes These nodes are ranked based on the similarity to background and foreground queries, based on affinity matrices Saliency detection is carried out in a two-stage scheme to extract background regions and foreground salient objects efficiently Experimental results on two large benchmark databases demonstrate the proposed method performs well when against the state-of-the-art methods in terms of accuracy and speed We also create a more difficult benchmark database containing 5,172 images to test the proposed saliency model and make this database publicly available with this paper for further studies in the saliency field

2,278 citations


"Salient Object Detection by Context..." refers methods in this paper

  • ...We evaluate our framework on PASCAL-S [14], ECSSD [24], HKU-IS [12], DUTS-TE (DUTS test set) [26] and DUT-OMORON (partitioned for testing) [35] saliency datasets....

    [...]

  • ...Our proposed architecture is trained on DUT-OMRON dataset [35]....

    [...]

Journal ArticleDOI
TL;DR: An original approach of attentional guidance by global scene context is presented that combines bottom-up saliency, scene context, and top-down mechanisms at an early stage of visual processing and predicts the image regions likely to be fixated by human observers performing natural search tasks in real-world scenes.
Abstract: Many experiments have shown that the human visual system makes extensive use of contextual information for facilitating object search in natural scenes. However, the question of how to formally model contextual influences is still open. On the basis of a Bayesian framework, the authors present an original approach of attentional guidance by global scene context. The model comprises 2 parallel pathways; one pathway computes local features (saliency) and the other computes global (scene-centered) features. The contextual guidance model of attention combines bottom-up saliency, scene context, and top-down mechanisms at an early stage of visual processing and predicts the image regions likely to be fixated by human observers performing natural search tasks in real-world scenes.

1,613 citations


"Salient Object Detection by Context..." refers background in this paper

  • ...In an image, contextual information determines the relative importance of objects in the image, which in turn determines the saliency of an object [25]....

    [...]