scispace - formally typeset
Search or ask a question

Showing papers on "Human visual system model published in 2017"


Proceedings ArticleDOI
01 Oct 2017
TL;DR: In this paper, the structural similarity measure (Structure-measure) is proposed to evaluate non-binary foreground maps, which simultaneously evaluates region-aware and object-aware structural similarity between a saliency map and a ground-truth map.
Abstract: Foreground map evaluation is crucial for gauging the progress of object segmentation algorithms, in particular in the field of salient object detection where the purpose is to accurately detect and segment the most salient object in a scene. Several widely-used measures such as Area Under the Curve (AUC), Average Precision (AP) and the recently proposed F W/B (Fbw) have been used to evaluate the similarity between a non-binary saliency map (SM) and a ground-truth (GT) map. These measures are based on pixel-wise errors and often ignore the structural similarities. Behavioral vision studies, however, have shown that the human visual system is highly sensitive to structures in scenes. Here, we propose a novel, efficient, and easy to calculate measure known as structural similarity measure (Structure-measure) to evaluate non-binary foreground maps. Our new measure simultaneously evaluates region-aware and object-aware structural similarity between a SM and a GT map. We demonstrate superiority of our measure over existing ones using 5 meta-measures on 5 benchmark datasets.

693 citations


Proceedings ArticleDOI
21 Jul 2017
TL;DR: In this article, the authors use unsupervised motion-based segmentation on videos to obtain segments, which they use as pseudo ground truth to train a convolutional network to segment objects from a single frame.
Abstract: This paper presents a novel yet intuitive approach to unsupervised feature learning. Inspired by the human visual system, we explore whether low-level motion-based grouping cues can be used to learn an effective visual representation. Specifically, we use unsupervised motion-based segmentation on videos to obtain segments, which we use as pseudo ground truth to train a convolutional network to segment objects from a single frame. Given the extensive evidence that motion plays a key role in the development of the human visual system, we hope that this straightforward approach to unsupervised learning will be more effective than cleverly designed pretext tasks studied in the literature. Indeed, our extensive experiments show that this is the case. When used for transfer learning on object detection, our representation significantly outperforms previous unsupervised approaches across multiple settings, especially when training data for the target task is scarce.

499 citations


Proceedings ArticleDOI
21 Jul 2017
TL;DR: Zhang et al. as discussed by the authors proposed a Visual Translation Embedding Network (VTransE) for visual relation detection, which places objects in a low-dimensional relation space where a relation can be modeled as a simple vector translation, i.e., subject + predicate.
Abstract: Visual relations, such as person ride bike and bike next to car, offer a comprehensive scene understanding of an image, and have already shown their great utility in connecting computer vision and natural language. However, due to the challenging combinatorial complexity of modeling subject-predicate-object relation triplets, very little work has been done to localize and predict visual relations. Inspired by the recent advances in relational representation learning of knowledge bases and convolutional object detection networks, we propose a Visual Translation Embedding network (VTransE) for visual relation detection. VTransE places objects in a low-dimensional relation space where a relation can be modeled as a simple vector translation, i.e., subject + predicate ≈ object. We propose a novel feature extraction layer that enables object-relation knowledge transfer in a fully-convolutional fashion that supports training and inference in a single forward/backward pass. To the best of our knowledge, VTransE is the first end-toend relation detection network. We demonstrate the effectiveness of VTransE over other state-of-the-art methods on two large-scale datasets: Visual Relationship and Visual Genome. Note that even though VTransE is a purely visual model, it is still competitive to the Lu’s multi-modal model with language priors [27].

484 citations


Posted Content
TL;DR: A novel, efficient, and easy to calculate measure known as structural similarity measure (Structure-measure) to evaluate non-binary foreground maps that simultaneously evaluates region-aware and object-aware structural similarity between a SM and a GT map.
Abstract: Foreground map evaluation is crucial for gauging the progress of object segmentation algorithms, in particular in the filed of salient object detection where the purpose is to accurately detect and segment the most salient object in a scene. Several widely-used measures such as Area Under the Curve (AUC), Average Precision (AP) and the recently proposed Fbw have been utilized to evaluate the similarity between a non-binary saliency map (SM) and a ground-truth (GT) map. These measures are based on pixel-wise errors and often ignore the structural similarities. Behavioral vision studies, however, have shown that the human visual system is highly sensitive to structures in scenes. Here, we propose a novel, efficient, and easy to calculate measure known an structural similarity measure (Structure-measure) to evaluate non-binary foreground maps. Our new measure simultaneously evaluates region-aware and object-aware structural similarity between a SM and a GT map. We demonstrate superiority of our measure over existing ones using 5 meta-measures on 5 benchmark datasets.

409 citations


Proceedings ArticleDOI
06 May 2017
TL;DR: Although DNNs perform better than or on par with humans on good quality images, DNN performance is still much lower than human performance on distorted images, and there is little correlation in errors between DNN's and human subjects.
Abstract: Deep neural networks (DNNs) achieve excellent performance on standard classification tasks. However, under image quality distortions such as blur and noise, classification accuracy becomes poor. In this work, we compare the performance of DNNs with human subjects on distorted images. We show that, although DNNs perform better than or on par with humans on good quality images, DNN performance is still much lower than human performance on distorted images. We additionally find that there is little correlation in errors between DNNs and human subjects. This could be an indication that the internal representation of images are different between DNNs and the human visual system. These comparisons with human performance could be used to guide future development of more robust DNNs.

350 citations


Journal ArticleDOI
TL;DR: A full brain predictive model synthesizes brain maps for other visual experiments and recovers the activations observed in the corresponding fMRI studies, showing that this deep encoding model captures representations of brain function that are universal across experimental paradigms.

303 citations


Journal ArticleDOI
TL;DR: This paper presents a novel method for underwater image enhancement inspired by the Retinex framework, which simulates the human visual system and utilizes the combination of the bilateral filter and trilateral filter on the three channels of the image in CIELAB color space according to the characteristics of each channel.

244 citations


Journal ArticleDOI
TL;DR: A new perceptual image quality assessment (IQA) metric based on the human visual system (HVS) is proposed that performs efficiently with convolution operations at multiscales, gradient magnitude, and color information similarity, and a perceptual-based pooling.
Abstract: A fast reliable computational quality predictor is eagerly desired in practical image/video applications, such as serving for the quality monitoring of real-time coding and transcoding. In this paper, we propose a new perceptual image quality assessment (IQA) metric based on the human visual system (HVS). The proposed IQA model performs efficiently with convolution operations at multiscales, gradient magnitude, and color information similarity, and a perceptual-based pooling. Extensive experiments are conducted using four popular large-size image databases and two multiply distorted image databases, and results validate the superiority of our approach over modern IQA measures in efficiency and efficacy. Our metric is built on the theoretical support of the HVS with lately designed IQA methods as special cases.

218 citations


Proceedings ArticleDOI
21 Jul 2017
TL;DR: A novel convolutional neural networks (CNN) based FR-IQA model, named Deep Image Quality Assessment (DeepQA), where the behavior of the HVS is learned from the underlying data distribution of IQA databases, which achieves the state-of-the-art prediction accuracy among FR- IQA models.
Abstract: Since human observers are the ultimate receivers of digital images, image quality metrics should be designed from a human-oriented perspective. Conventionally, a number of full-reference image quality assessment (FR-IQA) methods adopted various computational models of the human visual system (HVS) from psychological vision science research. In this paper, we propose a novel convolutional neural networks (CNN) based FR-IQA model, named Deep Image Quality Assessment (DeepQA), where the behavior of the HVS is learned from the underlying data distribution of IQA databases. Different from previous studies, our model seeks the optimal visual weight based on understanding of database information itself without any prior knowledge of the HVS. Through the experiments, we show that the predicted visual sensitivity maps agree with the human subjective opinions. In addition, DeepQA achieves the state-of-the-art prediction accuracy among FR-IQA models.

216 citations


Posted Content
TL;DR: The human visual system is found to be more robust to image manipulations like contrast reduction, additive noise or novel eidolon-distortions than deep neural networks, indicating that there may still be marked differences in the way humans and current DNNs perform visual object recognition.
Abstract: Human visual object recognition is typically rapid and seemingly effortless, as well as largely independent of viewpoint and object orientation. Until very recently, animate visual systems were the only ones capable of this remarkable computational feat. This has changed with the rise of a class of computer vision algorithms called deep neural networks (DNNs) that achieve human-level classification performance on object recognition tasks. Furthermore, a growing number of studies report similarities in the way DNNs and the human visual system process objects, suggesting that current DNNs may be good models of human visual object recognition. Yet there clearly exist important architectural and processing differences between state-of-the-art DNNs and the primate visual system. The potential behavioural consequences of these differences are not well understood. We aim to address this issue by comparing human and DNN generalisation abilities towards image degradations. We find the human visual system to be more robust to image manipulations like contrast reduction, additive noise or novel eidolon-distortions. In addition, we find progressively diverging classification error-patterns between humans and DNNs when the signal gets weaker, indicating that there may still be marked differences in the way humans and current DNNs perform visual object recognition. We envision that our findings as well as our carefully measured and freely available behavioural datasets provide a new useful benchmark for the computer vision community to improve the robustness of DNNs and a motivation for neuroscientists to search for mechanisms in the brain that could facilitate this robustness.

206 citations


Posted Content
TL;DR: A dual-exposure fusion algorithm is proposed to provide an accurate contrast and lightness enhancement to solve the problem of low-light image enhancement through image fusion and weight matrix fusion.
Abstract: Low-light images are not conducive to human observation and computer vision algorithms due to their low visibility. Although many image enhancement techniques have been proposed to solve this problem, existing methods inevitably introduce contrast under- and over-enhancement. Inspired by human visual system, we design a multi-exposure fusion framework for low-light image enhancement. Based on the framework, we propose a dual-exposure fusion algorithm to provide an accurate contrast and lightness enhancement. Specifically, we first design the weight matrix for image fusion using illumination estimation techniques. Then we introduce our camera response model to synthesize multi-exposure images. Next, we find the best exposure ratio so that the synthetic image is well-exposed in the regions where the original image is under-exposed. Finally, the enhanced result is obtained by fusing the input image and the synthetic image according to the weight matrix. Experiments show that our method can obtain results with less contrast and lightness distortion compared to that of several state-of-the-art methods.

Posted Content
TL;DR: With the advent of large labelled datasets and high-capacity models, the performance of machine vision systems has been improving rapidly, but the technology has still major limitations, starting from the fact that different vision problems are still solved by different models, trained from scratch or fine-tuned on the target data.
Abstract: With the advent of large labelled datasets and high-capacity models, the performance of machine vision systems has been improving rapidly. However, the technology has still major limitations, starting from the fact that different vision problems are still solved by different models, trained from scratch or fine-tuned on the target data. The human visual system, in stark contrast, learns a universal representation for vision in the early life of an individual. This representation works well for an enormous variety of vision problems, with little or no change, with the major advantage of requiring little training data to solve any of them.

Journal ArticleDOI
TL;DR: It is shown that a specific region in the human visual system, known as the occipital place area, automatically encodes the structure of navigable space in visual scenes, thus providing evidence for a bottom-up visual mechanism for perceiving potential paths for movement in one’s immediate surroundings.
Abstract: A central component of spatial navigation is determining where one can and cannot go in the immediate environment. We used fMRI to test the hypothesis that the human visual system solves this problem by automatically identifying the navigational affordances of the local scene. Multivoxel pattern analyses showed that a scene-selective region of dorsal occipitoparietal cortex, known as the occipital place area, represents pathways for movement in scenes in a manner that is tolerant to variability in other visual features. These effects were found in two experiments: One using tightly controlled artificial environments as stimuli, the other using a diverse set of complex, natural scenes. A reconstruction analysis demonstrated that the population codes of the occipital place area could be used to predict the affordances of novel scenes. Taken together, these results reveal a previously unknown mechanism for perceiving the affordance structure of navigable space.

Journal ArticleDOI
TL;DR: Considering both pattern complexity and luminance contrast, a novel spatial masking estimation function is deduced, and an improved JND estimation model is built, which performs highly consistent with the human perception.
Abstract: The just noticeable difference (JND) in an image, which reveals the visibility limitation of the human visual system (HVS), is widely used for visual redundancy estimation in signal processing. To determine the JND threshold with the current schemes, the spatial masking effect is estimated as the contrast masking, and this cannot accurately account for the complicated interaction among visual contents. Research on cognitive science indicates that the HVS is highly adapted to extract the repeated patterns for visual content representation. Inspired by this, we formulate the pattern complexity as another factor to determine the total masking effect: the interaction is relatively straightforward with a limited masking effect in a regular pattern, and is complicated with a strong masking effect in an irregular pattern. From the orientation selectivity mechanism in the primary visual cortex, the response of each local receptive field can be considered as a pattern; therefore, in this paper, the orientation that each pixel presents is regarded as the fundamental element of a pattern, and the pattern complexity is calculated as the diversity of the orientation in a local region. Finally, considering both pattern complexity and luminance contrast, a novel spatial masking estimation function is deduced, and an improved JND estimation model is built. Experimental results on comparing with the latest JND models demonstrate the effectiveness of the proposed model, which performs highly consistent with the human perception. The source code of the proposed model is publicly available at http://web.xidian.edu.cn/wjj/en/index.html.

Journal ArticleDOI
TL;DR: The experimental results demonstrate that the proposed RISE metric is superior to the relevant state-of-the-art methods for evaluating both synthetic and real blurring and the proposed metric is robust, which means that it has very good generalization ability.
Abstract: The human visual system exhibits multiscale characteristic when perceiving visual scenes. The hierarchical structures of an image are contained in its scale space representation, in which the image can be portrayed by a series of increasingly smoothed images. Inspired by this, this paper presents a no-reference and robust image sharpness evaluation (RISE) method by learning multiscale features extracted in both the spatial and spectral domains. For an image, the scale space is first built. Then sharpness-aware features are extracted in gradient domain and singular value decomposition domain, respectively. In order to take into account the impact of viewing distance on image quality, the input image is also down-sampled by several times, and the DCT-domain entropies are calculated as quality features. Finally, all features are utilized to learn a support vector regression model for sharpness prediction. Extensive experiments are conducted on four synthetically and two real blurred image databases. The experimental results demonstrate that the proposed RISE metric is superior to the relevant state-of-the-art methods for evaluating both synthetic and real blurring. Furthermore, the proposed metric is robust, which means that it has very good generalization ability.

Proceedings Article
10 Feb 2017
TL;DR: The proposed model utilizes the recent studied attention mechanism to jointly discover the relevant local regions and build a sentiment classifier on top of these local regions, capable of automatically discovering sentimental local regions of given images and outperforms existing state-of-the-art algorithms to visual sentiment analysis.
Abstract: Visual sentiment analysis, which studies the emotional response of humans on visual stimuli such as images and videos, has been an interesting and challenging problem. It tries to understand the high-level content of visual data. The success of current models can be attributed to the development of robust algorithms from computer vision. Most of the existing models try to solve the problem by proposing either robust features or more complex models. In particular, visual features from the whole image or video are the main proposed inputs. Little attention has been paid to local areas, which we believe is pretty relevant to human's emotional response to the whole image. In this work, we study the impact of local image regions on visual sentiment analysis. Our proposed model utilizes the recent studied attention mechanism to jointly discover the relevant local regions and build a sentiment classifier on top of these local regions. The experimental results suggest that 1) our model is capable of automatically discovering sentimental local regions of given images and 2) it outperforms existing state-of-the-art algorithms to visual sentiment analysis.

Journal ArticleDOI
TL;DR: This work combines CNN‐based encoding models with magnetoencephalography to validate the accuracy of the encoding model by decoding stimulus identity in a left‐out validation set of viewed objects, achieving state‐of‐the‐art decoding accuracy.

Journal ArticleDOI
TL;DR: The derivation of a model for predicting saccade landing positions is described and it is demonstrated how it can be used in the context of gaze-contingent rendering to reduce the influence of system latency on the perceived quality.
Abstract: Gaze-contingent rendering shows promise in improving perceived quality by providing a better match between image quality and the human visual system requirements. For example, information about fixation allows rendering quality to be reduced in peripheral vision, and the additional resources can be used to improve the quality in the foveal region. Gaze-contingent rendering can also be used to compensate for certain limitations of display devices, such as reduced dynamic range or lack of accommodation cues. Despite this potential and the recent drop in the prices of eye trackers, the adoption of such solutions is hampered by system latency which leads to a mismatch between image quality and the actual gaze location. This is especially apparent during fast saccadic movements when the information about gaze location is significantly delayed, and the quality mismatch can be noticed. To address this problem, we suggest a new way of updating images in gaze-contingent rendering during saccades. Instead of rendering according to the current gaze position, our technique predicts where the saccade is likely to end and provides an image for the new fixation location as soon as the prediction is available. While the quality mismatch during the saccade remains unnoticed due to saccadic suppression, a correct image for the new fixation is provided before the fixation is established. This paper describes the derivation of a model for predicting saccade landing positions and demonstrates how it can be used in the context of gaze-contingent rendering to reduce the influence of system latency on the perceived quality. The technique is validated in a series of experiments for various combinations of display frame rate and eye-tracker sampling rate.

Journal ArticleDOI
TL;DR: This work introduces a new display technology, dubbed accommodation-invariant (AI) near-eye displays, to improve the consistency of depth cues in near- eye displays and validate the principle of operation of AI displays using a prototype display that allows for the accommodation state of users to be measured while they view visual stimuli using multiple different display modes.
Abstract: Although emerging virtual and augmented reality (VR/AR) systems can produce highly immersive experiences, they can also cause visual discomfort, eyestrain, and nausea. One of the sources of these symptoms is a mismatch between vergence and focus cues. In current VR/AR near-eye displays, a stereoscopic image pair drives the vergence state of the human visual system to arbitrary distances, but the accommodation, or focus, state of the eyes is optically driven towards a fixed distance. In this work, we introduce a new display technology, dubbed accommodation-invariant (AI) near-eye displays, to improve the consistency of depth cues in near-eye displays. Rather than producing correct focus cues, AI displays are optically engineered to produce visual stimuli that are invariant to the accommodation state of the eye. The accommodation system can then be driven by stereoscopic cues, and the mismatch between vergence and accommodation state of the eyes is significantly reduced. We validate the principle of operation of AI displays using a prototype display that allows for the accommodation state of users to be measured while they view visual stimuli using multiple different display modes.

Journal ArticleDOI
01 Jan 2017-Displays
TL;DR: A novel steganography approach based on the combination of LSB substitution mechanism and edge detection is proposed that achieves a much higher payload and better visual quality than those of state-of-the-art schemes.

Journal ArticleDOI
01 Jun 2017
TL;DR: It is shown how the cost function can be used in an optimizer to search for the optimal visual design for a user’s dataset and task objectives, and case studies demonstrate that the approach can adapt a design to the data, to reveal patterns without user intervention.
Abstract: Designing a good scatterplot can be difficult for non-experts in visualization, because they need to decide on many parameters, such as marker size and opacity, aspect ratio, color, and rendering order. This paper contributes to research exploring the use of perceptual models and quality metrics to set such parameters automatically for enhanced visual quality of a scatterplot. A key consideration in this paper is the construction of a cost function to capture several relevant aspects of the human visual system, examining a scatterplot design for some data analysis task. We show how the cost function can be used in an optimizer to search for the optimal visual design for a user’s dataset and task objectives (e.g., “reliable linear correlation estimation is more important than class separation”). The approach is extensible to different analysis tasks. To test its performance in a realistic setting, we pre-calibrated it for correlation estimation, class separation, and outlier detection. The optimizer was able to produce designs that achieved a level of speed and success comparable to that of those using human-designed presets (e.g., in R or MATLAB). Case studies demonstrate that the approach can adapt a design to the data, to reveal patterns without user intervention.

Journal ArticleDOI
TL;DR: A generalized framework to model the image formation process of the existing light-field display methods is described and a systematic method to simulate and characterize the retinal image and the accommodation response rendered by a light field display is presented.
Abstract: One of the key issues in conventional stereoscopic displays is the well-known vergence-accommodation conflict problem due to the lack of the ability to render correct focus cues for 3D scenes. Recently several light field display methods have been explored to reconstruct a true 3D scene by sampling either the projections of the 3D scene at different depths or the directions of the light rays apparently emitted by the 3D scene and viewed from different eye positions. These methods are potentially capable of rendering correct or nearly correct focus cues and addressing the vergence-accommodation conflict problem. In this paper, we describe a generalized framework to model the image formation process of the existing light-field display methods and present a systematic method to simulate and characterize the retinal image and the accommodation response rendered by a light field display. We further employ this framework to investigate the trade-offs and guidelines for an optimal 3D light field display design. Our method is based on quantitatively evaluating the modulation transfer functions of the perceived retinal image of a light field display by accounting for the ocular factors of the human visual system.

Book ChapterDOI
19 Dec 2017
TL;DR: The principle hypothesis of structural similarity based image quality assessment is that the HVS is highly adapted to extract structural information from the visual field, and therefore a measurement ofStructural similarity (or distortion) should provide a good approximation to perceived image quality.
Abstract: This chapter presents structural similarity as an alternative design philosophy for objective image quality assessment methods. It discusses the motivation, the general idea, and a specific structural similarity (SSIM) index algorithm of the structural similarity-based image quality assessment method. Many image quality assessment algorithms have been shown to behave consistently when applied to distorted images created from the same original image, using the same type of distortions. The SSIM indexing algorithm is quite encouraging not only because it achieves good quality prediction accuracy in the current tests, but also because of its simple formulation and low complexity implementation. The principal hypothesis of structural similarity based image quality assessment is that the human visual system is highly adapted to extract structural information from the visual field, and therefore a measurement of structural similarity should provide a good approximation to perceived image quality.

Journal ArticleDOI
TL;DR: This work finds that all visual areas exhibit subadditive summation, whereby responses to longer stimuli are less than the linear prediction from briefer stimuli, and builds predictive models that operate on arbitrary temporal patterns of stimulation using two simple computations: temporal summation followed by a compressive nonlinearity.
Abstract: Combining sensory inputs over space and time is fundamental to vision. Population receptive field models have been successful in characterizing spatial encoding throughout the human visual pathways. A parallel question, how visual areas in the human brain process information distributed over time, has received less attention. One challenge is that the most widely used neuroimaging method, fMRI, has coarse temporal resolution compared with the time-scale of neural dynamics. Here, via carefully controlled temporally modulated stimuli, we show that information about temporal processing can be readily derived from fMRI signal amplitudes in male and female subjects. We find that all visual areas exhibit subadditive summation, whereby responses to longer stimuli are less than the linear prediction from briefer stimuli. We also find fMRI evidence that the neural response to two stimuli is reduced for brief interstimulus intervals (indicating adaptation). These effects are more pronounced in visual areas anterior to V1-V3. Finally, we develop a general model that shows how these effects can be captured with two simple operations: temporal summation followed by a compressive nonlinearity. This model operates for arbitrary temporal stimulation patterns and provides a simple and interpretable set of computations that can be used to characterize neural response properties across the visual hierarchy. Importantly, compressive temporal summation directly parallels earlier findings of compressive spatial summation in visual cortex describing responses to stimuli distributed across space. This indicates that, for space and time, cortex uses a similar processing strategy to achieve higher-level and increasingly invariant representations of the visual world.SIGNIFICANCE STATEMENT Combining sensory inputs over time is fundamental to seeing. Two important temporal phenomena are summation, the accumulation of sensory inputs over time, and adaptation, a response reduction for repeated or sustained stimuli. We investigated these phenomena in the human visual system using fMRI. We built predictive models that operate on arbitrary temporal patterns of stimulation using two simple computations: temporal summation followed by a compressive nonlinearity. Our new temporal compressive summation model captures (1) subadditive temporal summation, and (2) adaptation. We show that the model accounts for systematic differences in these phenomena across visual areas. Finally, we show that for space and time, the visual system uses a similar strategy to achieve increasingly invariant representations of the visual world.

Journal ArticleDOI
TL;DR: The foveated object detector can approximate the performance of the object detector with homogeneous high spatial resolution processing while bringing significant computational cost savings and the impact of foveation on the computation of bottom-up saliency is assessed.
Abstract: Humans and many other species sense visual information with varying spatial resolution across the visual field (foveated vision) and deploy eye movements to actively sample regions of interests in scenes. The advantage of such varying resolution architecture is a reduced computational, hence metabolic cost. But what are the performance costs of such processing strategy relative to a scheme that processes the visual field at high spatial resolution? Here we first focus on visual search and combine object detectors from computer vision with a recent model of peripheral pooling regions found at the V1 layer of the human visual system. We develop a foveated object detector that processes the entire scene with varying resolution, uses retino-specific object detection classifiers to guide eye movements, aligns its fovea with regions of interest in the input image and integrates observations across multiple fixations. We compared the foveated object detector against a non-foveated version of the same object detector which processes the entire image at homogeneous high spatial resolution. We evaluated the accuracy of the foveated and non-foveated object detectors identifying 20 different objects classes in scenes from a standard computer vision data set (the PASCAL VOC 2007 dataset). We show that the foveated object detector can approximate the performance of the object detector with homogeneous high spatial resolution processing while bringing significant computational cost savings. Additionally, we assessed the impact of foveation on the computation of bottom-up saliency. An implementation of a simple foveated bottom-up saliency model with eye movements showed agreement in the selection of top salient regions of scenes with those selected by a non-foveated high resolution saliency model. Together, our results might help explain the evolution of foveated visual systems with eye movements as a solution that preserves perceptual performance in visual search while resulting in computational and metabolic savings to the brain.

Journal ArticleDOI
TL;DR: This work introduces a novel deep learning scheme for NR S3D IQA in terms of local to global feature aggregation, which is competitive with full-reference S3d IQA metrics and does not estimate the depth from a pair of S2D images.
Abstract: Previously, no-reference (NR) stereoscopic 3D (S3D) image quality assessment (IQA) algorithms have been limited to the extraction of reliable hand-crafted features based on an understanding of the insufficiently revealed human visual system or natural scene statistics. Furthermore, compared with full-reference (FR) S3D IQA metrics, it is difficult to achieve competitive quality score predictions using the extracted features, which are not optimized with respect to human opinion. To cope with this limitation of the conventional approach, we introduce a novel deep learning scheme for NR S3D IQA in terms of local to global feature aggregation. A deep convolutional neural network (CNN) model is trained in a supervised manner through two-step regression. First, to overcome the lack of training data, local patch-based CNNs are modeled, and the FR S3D IQA metric is used to approximate a reference ground-truth for training the CNNs. The automatically extracted local abstractions are aggregated into global features by inserting an aggregation layer in the deep structure. The locally trained model parameters are then updated iteratively using supervised global labeling, i.e., subjective mean opinion score (MOS). In particular, the proposed deep NR S3D image quality evaluator does not estimate the depth from a pair of S3D images. The S3D image quality scores predicted by the proposed method represent a significant improvement over those of previous NR S3D IQA algorithms. Indeed, the accuracy of the proposed method is competitive with FR S3D IQA metrics, having ~ 91% correlation in terms of MOS.

Journal ArticleDOI
TL;DR: A framework for rendering photographic images by directly optimizing their perceptual similarity to the original visual scene is developed, yielding results of comparable visual quality to current state-of-the-art methods, but without manual intervention or parameter adjustment.
Abstract: We develop a framework for rendering photographic images by directly optimizing their perceptual similarity to the original visual scene. Specifically, over the set of all images that can be rendered on a given display, we minimize the normalized Laplacian pyramid distance (NLPD), a measure of perceptual dissimilarity that is derived from a simple model of the early stages of the human visual system. When rendering images acquired with a higher dynamic range than that of the display, we find that the optimization boosts the contrast of low-contrast features without introducing significant artifacts, yielding results of comparable visual quality to current state-of-the-art methods, but without manual intervention or parameter adjustment. We also demonstrate the effectiveness of the framework for a variety of other display constraints, including limitations on minimum luminance (black point), mean luminance (as a proxy for energy consumption), and quantized luminance levels (halftoning). We show that the method may generally be used to enhance details and contrast, and, in particular, can be used on images degraded by optical scattering (e.g., fog). Finally, we demonstrate the necessity of each of the NLPD components-an initial power function, a multiscale transform, and local contrast gain control-in achieving these results and we show that NLPD is competitive with the current state-of-the-art image quality metrics.

Proceedings ArticleDOI
11 Jul 2017
TL;DR: A novel method for retinal image quality classification (IQC) that performs computational algorithms imitating the working of the human visual system is proposed that could achieve higher accuracy than other methods.
Abstract: The quality of input images significantly affects the outcome of automated diabetic retinopathy (DR) screening systems. Unlike the previous methods that only consider simple low-level features such as hand-crafted geometric and structural features, in this paper we propose a novel method for retinal image quality classification (IQC) that performs computational algorithms imitating the working of the human visual system. The proposed algorithm combines unsupervised features from saliency map and supervised features coming from convolutional neural networks (CNN), which are fed to an SVM to automatically detect high quality vs poor quality retinal fundus images. We demonstrate the superior performance of our proposed algorithm on a large retinal fundus image dataset and the method could achieve higher accuracy than other methods. Although retinal images are used in this study, the methodology is applicable to the image quality assessment and enhancement of other types of medical images.

Posted Content
TL;DR: Li et al. as mentioned in this paper make one of the earliest efforts to bridge saliency detection to WOD via the self-paced curriculum learning, which can guide the learning procedure to gradually achieve faithful knowledge of multi-class objects from easy to hard.
Abstract: Weakly-supervised object detection (WOD) is a challenging problems in computer vision. The key problem is to simultaneously infer the exact object locations in the training images and train the object detectors, given only the training images with weak image-level labels. Intuitively, by simulating the selective attention mechanism of human visual system, saliency detection technique can select attractive objects in scenes and thus is a potential way to provide useful priors for WOD. However, the way to adopt saliency detection in WOD is not trivial since the detected saliency region might be possibly highly ambiguous in complex cases. To this end, this paper first comprehensively analyzes the challenges in applying saliency detection to WOD. Then, we make one of the earliest efforts to bridge saliency detection to WOD via the self-paced curriculum learning, which can guide the learning procedure to gradually achieve faithful knowledge of multi-class objects from easy to hard. The experimental results demonstrate that the proposed approach can successfully bridge saliency detection and WOD tasks and achieve the state-of-the-art object detection results under the weak supervision.

Journal ArticleDOI
TL;DR: Comparison with state-of-the-art methods shows that the proposed method produces the highest average entropy, measure of enhancement (EME), and EME by entropy with the values of 7.618, 28.193, and 6.829, respectively.