scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Improving Video Saliency Detection via Localized Estimation and Spatiotemporal Refinement

TL;DR: The experimental results demonstrate that the proposed framework is able to consistently and significantly improve the saliency detection performance of various video saliency models, thereby achieving the state-of-the-art performance.
Abstract: Video saliency detection aims to pop out the most salient regions in every frame of a video. Up to now, many efforts have been made from various aspects for video saliency detection. Unfortunately, the existing video saliency models are very likely to fail in challenging videos with complicated motions and complex scenes. Therefore, in this paper, we propose a novel framework to improve the saliency detection results generated by existing video saliency models. The proposed framework consists of three key steps including localized estimation, spatiotemporal refinement, and saliency update. Specifically, the initial saliency map of each frame in a video is first generated by using an existing saliency model. Then, by considering the temporal consistency and strong correlation among adjacent frames, the localized estimation models, which are generated by training the random forest regressor within a local temporal window, are employed to generate the temporary saliency map. Finally, by taking the appearance and motion information of salient objects into consideration, the spatiotemporal refinement step is deployed to further improve the temporary saliency map and generate the final saliency map. Furthermore, such an improved saliency map is then utilized to update the initial saliency map and provide reliable cues for saliency detection in the next frame. The experimental results on four challenging datasets demonstrate that the proposed framework is able to consistently and significantly improve the saliency detection performance of various video saliency models, thereby achieving the state-of-the-art performance.
Citations
More filters
Proceedings ArticleDOI
01 Jun 2019
TL;DR: A visual-attention-consistent Densely Annotated VSOD (DAVSOD) dataset, which contains 226 videos with 23,938 frames that cover diverse realistic-scenes, objects, instances and motions, and a baseline model equipped with a saliency shift- aware convLSTM, which can efficiently capture video saliency dynamics through learning human attention-shift behavior is proposed.
Abstract: The last decade has witnessed a growing interest in video salient object detection (VSOD). However, the research community long-term lacked a well-established VSOD dataset representative of real dynamic scenes with high-quality annotations. To address this issue, we elaborately collected a visual-attention-consistent Densely Annotated VSOD (DAVSOD) dataset, which contains 226 videos with 23,938 frames that cover diverse realistic-scenes, objects, instances and motions. With corresponding real human eye-fixation data, we obtain precise ground-truths. This is the first work that explicitly emphasizes the challenge of saliency shift, i.e., the video salient object(s) may dynamically change. To further contribute the community a complete benchmark, we systematically assess 17 representative VSOD algorithms over seven existing VSOD datasets and our DAVSOD with totally ~84K frames (largest-scale). Utilizing three famous metrics, we then present a comprehensive and insightful performance analysis. Furthermore, we propose a baseline model. It is equipped with a saliency shift- aware convLSTM, which can efficiently capture video saliency dynamics through learning human attention-shift behavior. Extensive experiments open up promising future directions for model development and comparison.

431 citations


Cites background from "Improving Video Saliency Detection ..."

  • ...Since [3, 7, 11, 33, 44, 47, 68, 93] did not release implementations, corresponding PCTs are borrowed from their papers or provided by authors....

    [...]

  • ...63* N/A 25 LESR [93] 2018 TMM localized estimation, spatiotemporal T 5....

    [...]

Journal ArticleDOI
TL;DR: A novel Information Conversion Network (ICNet) is proposed for RGB-D based SOD by employing the siamese structure with encoder-decoder architecture, which contains concatenation operations and correlation layers, and a Cross-modal Depth-weighted Combination block to discriminate the cross- modal features from different sources and to enhance RGB features with depth features at each level.
Abstract: RGB-D based salient object detection (SOD) methods leverage the depth map as a valuable complementary information for better SOD performance. Previous methods mainly resort to exploit the correlation between RGB image and depth map in three fusion domains: input images, extracted features, and output results. However, these fusion strategies cannot fully capture the complex correlation between the RGB image and depth map. Besides, these methods do not fully explore the cross-modal complementarity and the cross-level continuity of information, and treat information from different sources without discrimination. In this paper, to address these problems, we propose a novel Information Conversion Network (ICNet) for RGB-D based SOD by employing the siamese structure with encoder-decoder architecture. To fuse high-level RGB and depth features in an interactive and adaptive way, we propose a novel Information Conversion Module (ICM), which contains concatenation operations and correlation layers. Furthermore, we design a Cross-modal Depth-weighted Combination (CDC) block to discriminate the cross-modal features from different sources and to enhance RGB features with depth features at each level. Extensive experiments on five commonly tested datasets demonstrate the superiority of our ICNet over 15 state-of-the-art RGB-D based SOD methods, and validate the effectiveness of the proposed ICM and CDC block.

180 citations

Journal ArticleDOI
TL;DR: A composite attention mechanism that learns multi-scale local attentions and global attention priors end-to-end is used for enhancing the fused spatiotemporal features via emphasizing important features in multi-scales.
Abstract: This paper proposes a novel residual attentive learning network architecture for predicting dynamic eye-fixation maps. The proposed model emphasizes two essential issues, i.e ., effective spatiotemporal feature integration and multi-scale saliency learning. For the first problem, appearance and motion streams are tightly coupled via dense residual cross connections, which integrate appearance information with multi-layer, comprehensive motion features in a residual and dense way. Beyond traditional two-stream models learning appearance and motion features separately, such design allows early, multi-path information exchange between different domains, leading to a unified and powerful spatiotemporal learning architecture. For the second one, we propose a composite attention mechanism that learns multi-scale local attentions and global attention priors end-to-end. It is used for enhancing the fused spatiotemporal features via emphasizing important features in multi-scales. A lightweight convolutional Gated Recurrent Unit (convGRU), which is flexible for small training data situation, is used for long-term temporal characteristics modeling. Extensive experiments over four benchmark datasets clearly demonstrate the advantage of the proposed video saliency model over other competitors and the effectiveness of each component of our network. Our code and all the results will be available at https://github.com/ashleylqx/STRA-Net .

113 citations


Cites background from "Improving Video Saliency Detection ..."

  • ...In striking contrast to the significant advance of static visual attention prediction in recent years (back to [5]), there are only very few deep learning based visual attention models [17]–[21] that are specially designed for modeling eye fixation during dynamic free-viewing....

    [...]

Journal ArticleDOI
TL;DR: This paper proposes to utilize supervised deep convolutional neural networks to take full advantage of the long-term spatial-temporal information in order to improve the video saliency detection performance.
Abstract: This paper proposes to utilize supervised deep convolutional neural networks to take full advantage of the long-term spatial-temporal information in order to improve the video saliency detection performance. The conventional methods, which use the temporally neighbored frames solely, could easily encounter transient failure cases when the spatial-temporal saliency clues are less-trustworthy for a long period. To tackle the aforementioned limitation, we plan to identify those beyond-scope frames with trustworthy long-term saliency clues first and then align it with the current problem domain for an improved video saliency detection.

80 citations


Cites background from "Improving Video Saliency Detection ..."

  • ...wise alignments [15], the MRF guide metric learning [33], and the non-local random forest regressor [34] were pro-...

    [...]

Journal ArticleDOI
TL;DR: This paper attempts to integrate a novel depth-quality-aware subnet into the classic bistream structure in order to assess the depth quality prior to conducting the selective RGB-D fusion, achieving a much improved complementary status between RGB and D.
Abstract: The existing fusion based RGB-D salient object detection methods usually adopt the bi-stream structure to strike the fusion trade-off between RGB and depth (D). The D quality usually varies from scene to scene, while the SOTA bi-stream approaches are depth quality unaware, which easily result in substantial difficulties in achieving complementary fusion status between RGB and D, leading to poor fusion results in facing of low-quality D. Thus, this paper attempts to integrate a novel depth quality aware subnet into the classic bi-stream structure, aiming to assess the depth quality before conducting the selective RGB-D fusion. Compared with the SOTA bi-stream methods, the major highlight of our method is its ability to lessen the importance of those low-quality, no-contribution, or even negative-contribution D regions during the RGB-D fusion, achieving a much improved complementary status between RGB and D.

56 citations


Cites methods from "Improving Video Saliency Detection ..."

  • ...As a lightweight preprocessing tool, the downstream applications of image salient object detection usually include video saliency [6], [7], [8], [9], [10], [11], [12], quality assessment [13], [14], [15], video tracking [16], [17], and video background extraction [18], [19]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: In this article, a visual attention system inspired by the behavior and the neuronal architecture of the early primate visual system is presented, where multiscale image features are combined into a single topographical saliency map.
Abstract: A visual attention system, inspired by the behavior and the neuronal architecture of the early primate visual system, is presented. Multiscale image features are combined into a single topographical saliency map. A dynamical neural network then selects attended locations in order of decreasing saliency. The system breaks down the complex problem of scene understanding by rapidly selecting, in a computationally efficient manner, conspicuous locations to be analyzed in detail.

10,525 citations

01 Jan 1998
TL;DR: A visual attention system, inspired by the behavior and the neuronal architecture of the early primate visual system, is presented, which breaks down the complex problem of scene understanding by rapidly selecting conspicuous locations to be analyzed in detail.

8,566 citations


"Improving Video Saliency Detection ..." refers background or methods or result in this paper

  • ..., who proposed the well-known center-surround saliency model [51]....

    [...]

  • ...Akin to [51], in [21], the Kullback-Leibler divergence on dynamic texture feature is used to compute the video saliency based on the discriminant center-surround hypothesis [20]....

    [...]

  • ...Similar to [51], a global contrast saliency model is...

    [...]

  • ...The well-known center-surround scheme in [51] has been exploited by numerous video saliency models and interpreted as the feature difference by defining various mathematical principles....

    [...]

Journal ArticleDOI
TL;DR: A new superpixel algorithm is introduced, simple linear iterative clustering (SLIC), which adapts a k-means clustering approach to efficiently generate superpixels and is faster and more memory efficient, improves segmentation performance, and is straightforward to extend to supervoxel generation.
Abstract: Computer vision applications have come to rely increasingly on superpixels in recent years, but it is not always clear what constitutes a good superpixel algorithm. In an effort to understand the benefits and drawbacks of existing methods, we empirically compare five state-of-the-art superpixel algorithms for their ability to adhere to image boundaries, speed, memory efficiency, and their impact on segmentation performance. We then introduce a new superpixel algorithm, simple linear iterative clustering (SLIC), which adapts a k-means clustering approach to efficiently generate superpixels. Despite its simplicity, SLIC adheres to boundaries as well as or better than previous methods. At the same time, it is faster and more memory efficient, improves segmentation performance, and is straightforward to extend to supervoxel generation.

7,849 citations


"Improving Video Saliency Detection ..." refers methods in this paper

  • ...In our method, we follow the recent works [44]–[47] and segment every frame Ft (t = 1, 2, · · · ) into some perceptually homogenous superpixels {spt} i=1 (nt is the number of generated superpixels) via the simple linear iterative clustering (SLIC) algorithm [62]....

    [...]

Journal ArticleDOI
01 Aug 2004
TL;DR: A more powerful, iterative version of the optimisation of the graph-cut approach is developed and the power of the iterative algorithm is used to simplify substantially the user interaction needed for a given quality of result.
Abstract: The problem of efficient, interactive foreground/background segmentation in still images is of great practical importance in image editing. Classical image segmentation tools use either texture (colour) information, e.g. Magic Wand, or edge (contrast) information, e.g. Intelligent Scissors. Recently, an approach based on optimization by graph-cut has been developed which successfully combines both types of information. In this paper we extend the graph-cut approach in three respects. First, we have developed a more powerful, iterative version of the optimisation. Secondly, the power of the iterative algorithm is used to simplify substantially the user interaction needed for a given quality of result. Thirdly, a robust algorithm for "border matting" has been developed to estimate simultaneously the alpha-matte around an object boundary and the colours of foreground pixels. We show that for moderately difficult examples the proposed method outperforms competitive tools.

5,670 citations