scispace - formally typeset
Search or ask a question

Showing papers on "Human visual system model published in 2006"


Journal ArticleDOI
TL;DR: An image information measure is proposed that quantifies the information that is present in the reference image and how much of this reference information can be extracted from the distorted image and combined these two quantities form a visual information fidelity measure for image QA.
Abstract: Measurement of visual quality is of fundamental importance to numerous image and video processing applications. The goal of quality assessment (QA) research is to design algorithms that can automatically assess the quality of images or videos in a perceptually consistent manner. Image QA algorithms generally interpret image quality as fidelity or similarity with a "reference" or "perfect" image in some perceptual space. Such "full-reference" QA methods attempt to achieve consistency in quality prediction by modeling salient physiological and psychovisual features of the human visual system (HVS), or by signal fidelity measures. In this paper, we approach the image QA problem as an information fidelity problem. Specifically, we propose to quantify the loss of image information to the distortion process and explore the relationship between image information and visual quality. QA systems are invariably involved with judging the visual quality of "natural" images and videos that are meant for "human consumption." Researchers have developed sophisticated models to capture the statistics of such natural signals. Using these models, we previously presented an information fidelity criterion for image QA that related image quality with the amount of information shared between a reference and a distorted image. In this paper, we propose an image information measure that quantifies the information that is present in the reference image and how much of this reference information can be extracted from the distorted image. Combining these two quantities, we propose a visual information fidelity measure for image QA. We validate the performance of our algorithm with an extensive subjective study involving 779 images and show that our method outperforms recent state-of-the-art image QA algorithms by a sizeable margin in our simulations. The code and the data from the subjective study are available at the LIVE website.

3,146 citations


Journal ArticleDOI
TL;DR: An original approach of attentional guidance by global scene context is presented that combines bottom-up saliency, scene context, and top-down mechanisms at an early stage of visual processing and predicts the image regions likely to be fixated by human observers performing natural search tasks in real-world scenes.
Abstract: Many experiments have shown that the human visual system makes extensive use of contextual information for facilitating object search in natural scenes. However, the question of how to formally model contextual influences is still open. On the basis of a Bayesian framework, the authors present an original approach of attentional guidance by global scene context. The model comprises 2 parallel pathways; one pathway computes local features (saliency) and the other computes global (scene-centered) features. The contextual guidance model of attention combines bottom-up saliency, scene context, and top-down mechanisms at an early stage of visual processing and predicts the image regions likely to be fixated by human observers performing natural search tasks in real-world scenes.

1,613 citations


Journal ArticleDOI
TL;DR: This paper presents a coherent computational approach to the modeling of the bottom-up visual attention, mainly based on the current understanding of the HVS behavior, which includes Contrast sensitivity functions, perceptual decomposition, visual masking, and center-surround interactions.
Abstract: Visual attention is a mechanism which filters out redundant visual information and detects the most relevant parts of our visual field. Automatic determination of the most visually relevant areas would be useful in many applications such as image and video coding, watermarking, video browsing, and quality assessment. Many research groups are currently investigating computational modeling of the visual attention system. The first published computational models have been based on some basic and well-understood human visual system (HVS) properties. These models feature a single perceptual layer that simulates only one aspect of the visual system. More recent models integrate complex features of the HVS and simulate hierarchical perceptual representation of the visual input. The bottom-up mechanism is the most occurring feature found in modern models. This mechanism refers to involuntary attention (i.e., salient spatial visual features that effortlessly or involuntary attract our attention). This paper presents a coherent computational approach to the modeling of the bottom-up visual attention. This model is mainly based on the current understanding of the HVS behavior. Contrast sensitivity functions, perceptual decomposition, visual masking, and center-surround interactions are some of the features implemented in this model. The performances of this algorithm are assessed by using natural images and experimental measurements from an eye-tracking system. Two adequate well-known metrics (correlation coefficient and Kullbacl-Leibler divergence) are used to validate this model. A further metric is also defined. The results from this model are finally compared to those from a reference bottom-up model.

675 citations


Journal ArticleDOI
TL;DR: The novelties of the method is first to use an adaptive filter, whose shape follows the image high-contrast edges, thus reducing halo artifacts common to other methods, and only the luminance channel is processed.
Abstract: We propose a new method to render high dynamic range images that models global and local adaptation of the human visual system. Our method is based on the center-surround Retinex model. The novelties of our method is first to use an adaptive filter, whose shape follows the image high-contrast edges, thus reducing halo artifacts common to other methods. Second, only the luminance channel is processed, which is defined by the first component of a principal component analysis. Principal component analysis provides orthogonality between channels and thus reduces the chromatic changes caused by the modification of luminance. We show that our method efficiently renders high dynamic range images and we compare our results with the current state of the art

414 citations


Journal ArticleDOI
TL;DR: FMRI shows that the retinal size of an object and the depth information in a scene are combined early in the human visual system, and a distant object that appears to occupy a larger portion of the visual field activates a larger area in V1.
Abstract: Two objects that project the same visual angle on the retina can appear to occupy very different proportions of the visual field if they are perceived to be at different distances. What happens to the retinotopic map in primary visual cortex (V1) during the perception of these size illusions? Here we show, using functional magnetic resonance imaging (fMRI), that the retinotopic representation of an object changes in accordance with its perceived angular size. A distant object that appears to occupy a larger portion of the visual field activates a larger area in V1 than an object of equal angular size that is perceived to be closer and smaller. These results demonstrate that the retinal size of an object and the depth information in a scene are combined early in the human visual system.

387 citations


Proceedings ArticleDOI
01 Oct 2006
TL;DR: An improved method which is called gradient-based structural similarity (GSSIM) is developed, which is more consistent with HVS than SSIM and PSNR especially for blurred images.
Abstract: Objective quality assessment has been widely used in image processing for decades and many researchers have been studying the objective quality assessment method based on human visual system (HVS). Recently the structural similarity (SSIM) is proposed, under the assumption that the HVS is highly adapted for extracting structural information from a scene, and simulation results have proved that it is better than PSNR (or MSE), By deeply studying the SSIM, we find it fails in measuring the badly blurred images. Based on this, we develop an improved method which is called gradient-based structural similarity (GSSIM). Experiment results show that GSSIM is more consistent with HVS than SSIM and PSNR especially for blurred images.

278 citations


Journal ArticleDOI
19 Jun 2006
TL;DR: The approach leverages the highly robust and invariant object recognition capabilities of the human visual system, using single-trial EEG analysis to efficiently detect neural signatures correlated with the recognition event.
Abstract: We describe a real-time electroencephalography (EEG)-based brain-computer interface system for triaging imagery presented using rapid serial visual presentation. A target image in a sequence of nontarget distractor images elicits in the EEG a stereotypical spatiotemporal response, which can be detected. A pattern classifier uses this response to reprioritize the image sequence, placing detected targets in the front of an image stack. We use single-trial analysis based on linear discrimination to recover spatial components that reflect differences in EEG activity evoked by target versus nontarget images. We find an optimal set of spatial weights for 59 EEG sensors within a sliding 50-ms time window. Using this simple classifier allows us to process EEG in real time. The detection accuracy across five subjects is on average 92%, i.e., in a sequence of 2500 images, resorting images based on detector output results in 92% of target images being moved from a random position in the sequence to one of the first 250 images (first 10% of the sequence). The approach leverages the highly robust and invariant object recognition capabilities of the human visual system, using single-trial EEG analysis to efficiently detect neural signatures correlated with the recognition event.

272 citations


Proceedings Article
04 Dec 2006
TL;DR: The model is rather simplistic and essentially parameter-free, and contrasts recent developments in the field that usually aim at higher prediction rates at the cost of additional parameters and increasing model complexity, and in fact learns image features that resemble findings from several previous studies.
Abstract: This paper addresses the bottom-up influence of local image information on human eye movements. Most existing computational models use a set of biologically plausible linear filters, e.g., Gabor or Difference-of-Gaussians filters as a front-end, the outputs of which are nonlinearly combined into a real number that indicates visual saliency. Unfortunately, this requires many design parameters such as the number, type, and size of the front-end filters, as well as the choice of nonlinearities, weighting and normalization schemes etc., for which biological plausibility cannot always be justified. As a result, these parameters have to be chosen in a more or less ad hoc way. Here, we propose to learn a visual saliency model directly from human eye movement data. The model is rather simplistic and essentially parameter-free, and therefore contrasts recent developments in the field that usually aim at higher prediction rates at the cost of additional parameters and increasing model complexity. Experimental results show that—despite the lack of any biological prior knowledge—our model performs comparably to existing approaches, and in fact learns image features that resemble findings from several previous studies. In particular, its maximally excitatory stimuli have center-surround structure, similar to receptive fields in the early human visual system.

209 citations


Journal ArticleDOI
TL;DR: A novel color image enhancement method, which is named HVS Controlled Color Image Enhancement and Evaluation algorithm (HCCIEE algorithm), which is base on multiscale representation of pattern, luminance, and color processing in the HVS is proposed.

164 citations


Journal ArticleDOI
TL;DR: The proposed model incorporates the spatio-temporal contrast sensitivity function, the influence of eye movements, luminance adaptation, and contrast masking to be more consistent with human perception and is capable of yielding JNDs for both still images and video with significant motion.
Abstract: Just-noticeable distortion (JND), which refers to the maximum distortion that the human visual system (HVS) cannot perceive, plays an important role in perceptual image and video processing. In comparison with JND estimation for images, estimation of the JND profile for video needs to take into account the temporal HVS properties in addition to the spatial properties. In this paper, we develop a spatio-temporal model estimating JND in the discrete cosine transform domain. The proposed model incorporates the spatio-temporal contrast sensitivity function, the influence of eye movements, luminance adaptation, and contrast masking to be more consistent with human perception. It is capable of yielding JNDs for both still images and video with significant motion. The experiments conducted in this study have demonstrated that the JND values estimated for video sequences with moving objects by the model are in line with the HVS perception. The accurate JND estimation of the video towards the actual visibility bounds can be translated into resource savings (e.g., for bandwidth/storage or computation) and performance improvement in video coding and other visual processing tasks (such as perceptual quality evaluation, visual signal restoration/enhancement, watermarking, authentication, and error protection)

161 citations


BookDOI
01 Oct 2006
TL;DR: This chapter discusses adaptation in the Visual System to Color, Spatial, and Temporal Contrast, and the role of light distribution in this transformation.
Abstract: 1 Processing of Information in the Human Visual System (Prof. Dr. F. Schaeffel, University of Tubingen). 1.1 Preface. 1.2 Design and Structure of the Eye. 1.3 Optical Aberrations and Consequences for Visual Performance. 1.4 Chromatic Aberration. 1.5 Neural Adaptation to Monochromatic Aberrations. 1.6 Optimizing Retinal Processing with Limited Cell Numbers, Space and Energy. 1.7 Adaptation to Different Light Levels. 1.8 Rod and Cone Responses. 1.9 Spiking and Coding. 1.10 Temporal and Spatial Performance. 1.11 ON/OFF Structure, Division of theWhole Illuminance Amplitude in Two Segments. 1.12 Consequences of the Rod and Cone Diversity on Retinal Wiring. 1.13 Motion Sensitivity in the Retina. 1.14 Visual Information Processing in Higher Centers. 1.15 Effects of Attention. 1.16 Color Vision, Color Constancy, and Color Contrast. 1.17 Depth Perception. 1.18 Adaptation in the Visual System to Color, Spatial, and Temporal Contrast. 1.19 Conclusions. References. 2 Introduction to Building a Machine Vision Inspection (Axel Telljohann, Consulting Team Machine Vision (CTMV)). 2.1 Preface. 2.2 Specifying a Machine Vision System. 2.3 Designing a Machine Vision System. 2.4 Costs. 2.5 Words on Project Realization. 2.6 Examples. 3 Lighting in Machine Vision (I. Jahr, Vision & Control GmbH). 3.1 Introduction. 3.2 Demands on Machine Vision lighting. 3.3 Light used in Machine Vision. 3.4 Interaction of Test Object and Light. 3.5 Basic Rules and Laws of Light Distribution. 3.6 Light Filters. 3.7 Lighting Techniques and Their Use. 3.8 Lighting Control. 3.9 Lighting Perspectives for the Future. References. 4 Optical Systems in Machine Vision (Dr. Karl Lenhardt, Jos. Schneider OptischeWerke GmbH). 4.1 A Look on the Foundations of Geometrical Optics. 4.2 Gaussian Optics. 4.3 The Wave Nature of Light. 4.4 Information Theoretical Treatment of Image Transfer and Storage. 4.5 Criteria for Image Quality. 4.6 Practical Aspects. References. 5 Camera Calibration (R. Godding, AICON 3D Systems GmbH). 5.1 Introduction. 5.2 Terminology. 5.3 Physical Effects. 5.4 Mathematical Calibration Model. 5.5 Calibration and Orientation Techniques. 5.6 Verification of Calibration Results. 5.7 Applications. References. 6 Camera Systems in Machine Vision (Horst Mattfeldt, Allied Vision Technologies GmbH). 6.1 Camera Technology. 6.2 Sensor Technologies. 6.3 CCD Image Artifacts. 6.4 CMOS Image Sensor. 6.5 Block Diagrams and their Description. 6.6 Digital Cameras. 6.7 Controlling Image Capture. 6.8 Configuration of the Camera. 6.9 Camera Noise1. 6.10 Digital Interfaces. References. 7 Camera Computer Interfaces (Tony Iglesias, Anita Salmon, Johann Scholtz, Robert Hedegore, Julianna Borgendale, Brent Runnels, Nathan McKimpson, National Instruments). 7.1 Overview. 7.2 Analog Camera Buses. 7.3 Parallel Digital Camera Buses. 7.4 Standard PC Buses. 7.5 Choosing a Camera Bus. 7.6 Computer Buses. 7.7 Choosing a Computer Bus. 7.8 Driver Software. 7.9 Features of a Machine Vision System. 8 Machine Vision Algorithms (Dr. Carsten Steger, MVTec Software GmbH). 8.1 Fundamental Data Structures. 8.2 Image Enhancement. 8.3 Geometric Transformations. 8.4 Image Segmentation. 8.5 Feature Extraction. 8.6 Morphology. 8.7 Edge Extraction. 8.8 Segmentation and Fitting of Geometric Primitives. 8.9 Template Matching. 8.10 Stereo Reconstruction. 8.11 Optical Character Recognition. References. 9 Machine Vision in Manufacturing (Dr.-Ing. Peter Waszkewitz, Robert Bosch GmbH). 9.1 Introduction. 9.2 Application Categories. 9.3 System Categories. 9.4 Integration and Interfaces. 9.5 Mechanical Interfaces. 9.6 Electrical Interfaces. 9.7 Information Interfaces. 9.8 Temporal Interfaces. 9.9 Human-Machine Interfaces. 9.10 Industrial Case Studies. 9.11 Constraints and Conditions. References. Index.

Proceedings ArticleDOI
14 May 2006
TL;DR: An improved method which is called edge-based structural similarity (ESSIM) is developed and experiment results show that ESSIM is more consistent with HVS than SSIM and PSNR especially for the blurred images.
Abstract: Objective quality assessment has been widely used in image processing for decades and many researchers have been studying the objective quality assessment method based on Human Visual System (HVS). Recently the Structural Similarity (SSIM) is proposed, under the assumption that the HVS is highly adapted for extracting structural information from a scene, and simulation results have proved that it is better than PSNR (or MSE). By deeply studying the SSIM, we find it fails in measuring the badly blurred images. Based on this, we develop an improved method which is called Edge-based Structural Similarity (ESSIM). Experiment results show that ESSIM is more consistent with HVS than SSIM and PSNR especially for the blurred images.

Journal ArticleDOI
TL;DR: Two lines of theoretical work which understand processes in retina and primary visual cortex in this framework are reviewed, with the hypothesis that neural activities in V1 represent the bottom up saliencies of visual inputs, such that information can be selected for, or discarded from, detailed or attentive processing.
Abstract: Early vision is best understood in terms of two key information bottlenecks along the visual pathway — the optic nerve and, more severely, attention. Two effective strategies for sampling and representing visual inputs in the light of the bottlenecks are () data compression with minimum information loss and () data deletion. This paper reviews two lines of theoretical work which understand processes in retina and primary visual cortex (V1) in this framework. The first is an efficient coding principle which argues that early visual processes compress input into a more efficient form to transmit as much information as possible through channels of limited capacity. It can explain the properties of visual sampling and the nature of the receptive fields of retina and V1. It has also been argued to reveal the independent causes of the inputs. The second theoretical tack is the hypothesis that neural activities in V1 represent the bottom up saliencies of visual inputs, such that information can be selected for, ...

Journal ArticleDOI
01 Jul 2006
TL;DR: It is shown that by taking into account perceptual grouping mechanisms it is possible to build compelling hybrid images with stable percepts at each distance to create compelling displays in which the image appears to change as the viewing distance changes.
Abstract: We present hybrid images, a technique that produces static images with two interpretations, which change as a function of viewing distance. Hybrid images are based on the multiscale processing of images by the human visual system and are motivated by masking studies in visual perception. These images can be used to create compelling displays in which the image appears to change as the viewing distance changes. We show that by taking into account perceptual grouping mechanisms it is possible to build compelling hybrid images with stable percepts at each distance. We show examples in which hybrid images are used to create textures that become visible only when seen up-close, to generate facial expressions whose interpretation changes with viewing distance, and to visualize changes over time within a single picture.

Journal ArticleDOI
TL;DR: A parametric study of visual search for sine-wave targets added to spatial noise backgrounds that have spectral characteristics similar to natural images finds that many aspects of search performance and eye movement pattern are similar to those of an ideal searcher that has the same falloff in resolution with retinal eccentricity as the human visual system.
Abstract: Two of the factors limiting progress in understanding the mechanisms of visual search are the difficulty of controlling and manipulating the retinal stimulus when the eyes are free to move and the lack of an ideal observer theory for fixation selection during search. Recently, we developed a method to precisely control retinal stimulation with gaze-contingent displays (J. S. Perry & W. S. Geisler, 2002), and we derived a theory of optimal eye movements in visual search (J. Najemnik & W. S. Geisler, 2005). Here, we report a parametric study of visual search for sine-wave targets added to spatial noise backgrounds that have spectral characteristics similar to natural images (the amplitude spectrum of the noise falls inversely with spatial frequency). Search time, search accuracy, and eye fixations were measured as a function of target spatial frequency, 1/f noise contrast, and the resolution falloff of the display from the point of fixation. The results are systematic and similar for the two observers. We find that many aspects of search performance and eye movement pattern are similar to those of an ideal searcher that has the same falloff in resolution with retinal eccentricity as the human visual system.

Proceedings ArticleDOI
25 Jan 2006
TL;DR: A novel saliency map is presented which exploits the computational performance of modern GPUs and is thus possible to calculate this map in milliseconds, allowing it to be part of a real time rendering system.
Abstract: The computation of high-fidelity images in real-time remains one of the key challenges for computer graphics Recent work has shown that by understanding the human visual system, selective rendering may be used to render only those parts to which the human viewer is attending at high quality and the rest of the scene at a much lower quality This can result in a significant reduction in computational time, without the viewer being aware of the quality difference Selective rendering is guided by models of the human visual system, typically in the form of a 2D saliency map, which predict where the user will be looking in any scene Computation of these maps themselves often take many seconds, thus precluding such an approach in any interactive system, where many frames need to be rendered per second In this paper we present a novel saliency map which exploits the computational performance of modern GPUs With our approach it is thus possible to calculate this map in milliseconds, allowing it to be part of a real time rendering system In addition, we also show how depth, habituation and motion can be added to the saliency map to further guide the selective rendering This ensures that only the most perceptually important parts of any animated sequence need be rendered in high quality The rest of the animation can be rendered at a significantly lower quality, and thus much lower computational cost, without the user being aware of this difference

Proceedings ArticleDOI
09 Jul 2006
TL;DR: This paper provides an overview of the fundamental principle underlying 2D-to-3D conversion techniques, a cursory look at a number of approaches for depth extraction using a single image, and a highlight of the potential use of surrogate depth maps in depth image based rendering for 2D to 3D conversion.
Abstract: The next major advancement in television is expected to be stereoscopic three-dimensional television (3D-TV). A successful roll-out of 3D-TV will require a backward-compatible transmission and distribution system, inexpensive 3D displays that are equal or superior to high-definition television (HDTV), and an adequate supply of high-quality 3D program material. With respect to the last factor, we reckon that the conversion of 2D material to stereoscopic 3D could play an important role. In this paper we provide (a) an overview of the fundamental principle underlying 2D-to-3D conversion techniques, (b) a cursory look at a number of approaches for depth extraction using a single image, and (c) a highlight of the potential use of surrogate depth maps in depth image based rendering for 2D-to-3D conversion. This latter approach exploits the ability of the human visual system to combine reduced disparity information that are located mainly at edges and object boundaries with pictorial depth cues to produce an enhanced sensation of depth over 2D images.

Dissertation
01 Jan 2006
TL;DR: This may be the first time that a neurobiological model, faithful to the physiology and the anatomy of visual cortex, not only competes with some of the best computer vision systems thus providing a realistic alternative to engineered artificial vision systems, but also achieves performance close to that of humans in a categorization task involving complex natural images.
Abstract: In this thesis, I describe a quantitative model that accounts for the circuits and computations of the feedforward path of the ventral stream of visual cortex. This model is consistent with a general theory of visual processing that extends the hierarchical model of [Hubel and Wiesel, 1959] from primary to extrastriate visual areas. It attempts to explain the first few hundred milliseconds of visual processing and "immediate recognition". One of the key elements in the approach is the learning of a generic dictionary of shape-components from V2 to IT, which provides an invariant representation to task-specific categorization circuits in higher brain areas. This vocabulary of shape-tuned units is learned in an unsupervised manner from natural images, and constitutes a large and redundant set of image features with different complexities and invariances. This theory significantly extends an earlier approach by [Riesenhuber and Poggio, 1999a] and builds upon several existing neurobiological models and conceptual proposals. First, I present evidence to show that the model can duplicate the tuning properties of neurons in various brain areas (e.g., V1, V4 and IT). In particular, the model agrees with data from V4 about the response of neurons to combinations of simple two-bar stimuli [Reynolds et al., 1999] (within the receptive field of the S2 units) and some of the C2 units in the model show a tuning for boundary conformations which is consistent with recordings from V4 [Pasupathy and Connor, 2001]. Second, I show that not only can the model duplicate the tuning properties of neurons in various brain areas when probed with artificial stimuli, but it can also handle the recognition of objects in the real-world, to the extent of competing with the best computer vision systems. Third, I describe a comparison between the performance of the model and the performance of human observers in a rapid animal vs. non-animal recognition task for which recognition is fast and cortical back-projections are likely to be inactive. Results indicate that the model predicts human performance extremely well when the delay between the stimulus and the mask is about 50 ms. This suggests that cortical back-projections may not play a significant role when the time interval is in this range, and the model may therefore provide a satisfactory description of the feedforward path. Taken together, the evidences suggest that we may have the skeleton of a successful theory of visual cortex. In addition, this may be the first time that a neurobiological model, faithful to the physiology and the anatomy of visual cortex, not only competes with some of the best computer vision systems thus providing a realistic alternative to engineered artificial vision systems, but also achieves performance close to that of humans in a categorization task involving complex natural images. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

Journal ArticleDOI
TL;DR: A computationally efficient video distortion metric that can operate in full- or reduced-reference mode as required, based on a model of the human visual system implemented using the wavelet transform and separable filters is presented.
Abstract: Video distortion metrics based on models of the human visual system have traditionally used comparisons between the distorted signal and a reference signal to calculate distortions objectively. In video coding applications, this is not prohibitive. In quality monitoring applications, however, access to the reference signal is often limited. This paper presents a computationally efficient video distortion metric that can operate in full- or reduced-reference mode as required. The metric is based on a model of the human visual system implemented using the wavelet transform and separable filters. The visual model is parameterized using a set of video frames and the associated quality scores. The visual model's hierarchical structure, as well as the limited impact of fine scale distortions on quality judgments of severely impaired video, are exploited to build a framework for scaling the bitrate required to represent the reference signal. Two applications of the metric are also presented. In the first, the metric is used as the distortion measure in a rate-distortion optimized rate control algorithm for MPEG-2 video compression. The resulting compressed video sequences demonstrate significant improvements in visual quality over compressed sequences with allocations determined by the TM5 rate control algorithm operating with MPEG-2 at the same rate. In the second, the metric is used to estimate time series of objective quality scores for distorted video sequences using reference bitrates as low as 10 kb/s. The resulting quality scores more accurately model subjective quality recordings than do those estimated using the mean squared error as a distortion metric, while requiring a fraction of the bitrate used to represent the reference signal. The reduced-reference metric's performance is comparable to that of the full-reference metrics tested in the first Video Quality Experts Group evaluation.

Journal ArticleDOI
TL;DR: It is argued that attentional selection of pertinent information is heavily influenced by the stimuli most recently viewed that were important for behaviour, in particular the stimuli that have been important in the immediate past.
Abstract: Many lines of evidence show that the human visual system does not simply passively register whatever appears in the visual field. The visual system seems to preferentially “choose” stimuli according to what is most relevant for the task at hand, a process called attentional selection. Given the large amount of information in any given visual scene, and well-documented capacity limitations for the representation of visual stimuli, such a strategy seems only reasonable. Consistent with this, human observers are surprisingly insensitive to large changes in their visual environment when they attend to something else in the visual scene. Here I argue that attentional selection of pertinent information is heavily influenced by the stimuli most recently viewed that were important for behaviour. I will describe recent evidence for the existence of a powerful memory system, not under any form of voluntary control, which aids observers in orienting quickly and effectively to behaviourally relevant stimuli in the vi...

Journal ArticleDOI
TL;DR: An intriguing interaction between texture type (periodic, structured, or 3-D textures) and image statistics (autocorrelation function and filter magnitude correlations) is found, suggesting different representations may be employed for these texture families under pre-attentive viewing.

Dissertation
01 Jan 2006
TL;DR: This thesis describes an effort to construct a scene understanding system that is able to analyze the content of real images using a model of the human visual system constructed in the lab, and demonstrates that this biologically motivated image representation constitutes an effective representation for object detection, facilitating unprecedented levels of detection accuracy.
Abstract: This thesis describes an effort to construct a scene understanding system that is able to analyze the content of real images. While constructing the system we had to provide solutions to many of the fundamental questions that every student of object recognition deals with daily. These include the choice of data set, the choice of success measurement, the representation of the image content, the selection of inference engine, and the representation of the relations between objects. The main test-bed for our system is the CBCL StreetScenes data base. It is a carefully labeled set of images, much larger than any similar data set available at the time it was collected. Each image in this data set was labeled for 9 common classes such as cars, pedestrians, roads and trees. Our system represents each image using a set of features that are based on a model of the human visual system constructed in our lab. We demonstrate that this biologically motivated image representation, along with its extensions, constitutes an effective representation for object detection, facilitating unprecedented levels of detection accuracy. Similarly to biological vision systems, our system uses hierarchical representations. We therefore explore the possible ways of combining information across the hierarchy into the final perception. Our system is trained using standard machine learning machinery, which was first applied to computer vision in earlier work of Prof. Poggio and others. We demonstrate how the same standard methods can be used to model relations between objects in images as well, capturing context information. The resulting system detects and localizes, using a unified set of tools and image representations, compact objects such as cars, amorphous objects such as trees and roads, and the relations between objects within the scene. The same representation also excels in identifying objects in clutter without scanning the image. Much of the work presented in the thesis was devoted to a rigorous comparison of our system to alternative object recognition systems. The results of these experiments support the effectiveness of simple feed-forward systems for the basic tasks involved in scene understanding. We make our results fully available to the public by publishing our code and data sets in hope that others may improve and extend our results. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

Journal ArticleDOI
TL;DR: The design and evaluation of a speaker-independent audio-visual speech recognition (AVSR) system that utilizes a segment-based modeling strategy that allows the visual classifier to process visual frames with a constrained amount of asynchrony relative to proposed acoustic segments is presented.
Abstract: This paper presents the design and evaluation of a speaker-independent audio-visual speech recognition (AVSR) system that utilizes a segment-based modeling strategy. The audio and visual feature streams are integrated using a segment-constrained hidden Markov model, which allows the visual classifier to process visual frames with a constrained amount of asynchrony relative to proposed acoustic segments. The core experiments in this paper investigate several different visual model structures, each of which provides a different means for defining the units of the visual classifier and the synchrony constraints between the audio and visual streams. Word recognition experiments are conducted on the AV-TIMIT corpus under variable additive noise conditions. Over varying acoustic signal-to-noise ratios, word error rate reductions between 14% and 60% are observed when integrating the visual information into the automatic speech recognition process.

Journal ArticleDOI
TL;DR: It is shown though that in most cases the detector performs better if the proposed mask is employed, and the improved performance of the proposed detection scheme has been justified theoretically for the case of linear filtering plus noise attack and through extensive simulations.
Abstract: The aim of this paper is to improve the performance of spatial domain watermarking. To this end, a new perceptual mask and a new detection scheme are proposed. The proposed spatial perceptual mask is based on the cover image prediction error sequence and matches very well with the properties of the human visual system. It exhibits superior performance compared to existing spatial masking schemes. Moreover, it allows for a significantly increased strength of the watermark while, at the same time, the watermark visibility is decreased. The new blind detection scheme comprises an efficient prewhitening process and a correlation-based detector. The prewhitening process is based on the least-squares prediction error filter and substantially improves the detector's performance. The correlation-based detector that was selected is shown to be the most suitable for the problem at hand. The improved performance of the proposed detection scheme has been justified theoretically for the case of linear filtering plus noise attack and through extensive simulations. The theoretical analysis is independent of the proposed mask and the derived expressions can be used for any watermarking technique based on spatial masking. It is shown though that in most cases the detector performs better if the proposed mask is employed.

Proceedings ArticleDOI
TL;DR: This work proposes models for typical distortions encountered in video compression/transmission applications, and derives a multi-scale weighted variant of the complex wavelet SSIM (WCWSSIM), with weights based on the human contrast sensitivity function to handle local mean shift distortions.
Abstract: Perceptual image quality metrics have explicitly accounted for human visual system (HVS) sensitivity to subband noise by estimating thresholds above which distortion is just-noticeable. A recently proposed class of quality metrics, known as structural similarity (SSIM), models perception implicitly by taking into account the fact that the HVS is adapted for extracting structural information (relative spatial covariance) from images. We compare specific SSIM implementations both in the image space and the wavelet domain. We also evaluate the effectiveness of the complex wavelet SSIM (CWSSIM), a translation-insensitive SSIM implementation, in the context of realistic distortions that arise from compression and error concealment in video transmission applications. In order to better explore the space of distortions, we propose models for typical distortions encountered in video compression/transmission applications. We also derive a multi-scale weighted variant of the complex wavelet SSIM (WCWSSIM), with weights based on the human contrast sensitivity function to handle local mean shift distortions.

Journal ArticleDOI
TL;DR: Using a Ternus-Pikler display, it is shown that human observers can perceive features of moving objects at locations these features are not present and that these non-retinotopic feature attributions are not errors caused by the limitations of the perceptual system but follow rules of perceptual grouping.

Journal ArticleDOI
TL;DR: The proposed backlight scaling technique is capable of efficiently computing the flickering effect online and subsequently using a measure of the temporal distortion to appropriately adjust the slack on the intra-frame spatial distortion, thereby, achieving a good balance between the two sources of distortion while maximizing the backlight dimming-driven energy saving in the display system and meeting an overall video quality figure of merit.
Abstract: Liquid crystal displays (LCDs) have appeared in applications ranging from medical equipment to automobiles, gas pumps, laptops, and handheld portable computers. These display components present a cascaded energy attenuator to the battery of the handheld device which is responsible for about half of the energy drain at maximum display intensity. As such, the display components become the main focus of every effort for maximization of embedded system's battery lifetime. This paper proposes an approach for pixel transformation of the displayed image to increase the potential energy saving of the backlight scaling method. The proposed approach takes advantage of human visual system (HVS) characteristics and tries to minimize distortion between the perceived brightness values of the individual pixels in the original image and those of the backlight-scaled image. This is in contrast to previous backlight scaling approaches which simply match the luminance values of the individual pixels in the original and backlight-scaled images. Furthermore, this paper proposes a temporally-aware backlight scaling technique for video streams. The goal is to maximize energy saving in the display system by means of dynamic backlight dimming subject to a video distortion tolerance. The video distortion comprises of: 1) an intra-frame (spatial) distortion component due to frame-sensitive backlight scaling and transmittance function tuning and 2) an inter-frame (temporal) distortion component due to large-step backlight dimming across frames modulated by the psychophysical characteristics of the human visual system. The proposed backlight scaling technique is capable of efficiently computing the flickering effect online and subsequently using a measure of the temporal distortion to appropriately adjust the slack on the intra-frame spatial distortion, thereby, achieving a good balance between the two sources of distortion while maximizing the backlight dimming-driven energy saving in the display system and meeting an overall video quality figure of merit. The proposed dynamic backlight scaling approach is amenable to highly efficient hardware realization and has been implemented on the Apollo Testbed II. Actual current measurements demonstrate the effectiveness of proposed technique compared to the previous backlight dimming techniques, which have ignored the temporal distortion effect

Journal ArticleDOI
TL;DR: A multiscale model to represent natural images based on the scale-space representation: a model that has an inspiration in the human visual system and fulfills a number of properties that allows estimating the local orientation for several image structures.
Abstract: The efficient representation of local differential structure at various resolutions has been a matter of great interest for adaptive image processing and computer vision tasks. In this paper, we derive a multiscale model to represent natural images based on the scale-space representation: a model that has an inspiration in the human visual system. We first derive the one-dimensional case and then extend the results to two and three dimensions. The operators obtained for analysis and synthesis stages are derivatives of the Gaussian smoothing kernel, so that, for the two-dimensional case, we can represent them either in a rotated coordinate system or in terms of directional derivatives. The method to perform the rotation is efficient because it is implemented by means of the application of the so-called generalized binomial filters. Such a family of discrete sequences fulfills a number of properties that allows estimating the local orientation for several image structures. We also define the discrete counterpart in which the coordinate normalization of the continuous case is approximated as a subsampling of the discrete domain.

01 Jan 2006
TL;DR: The applications of visual tracking in three areas including visual surveillance, image compression, and 3-D reconstruction are discussed and the state of the art about visual tracking is introduced, especially the main approaches ofVisual tracking are shown.
Abstract: This paper introduces in detail the research of visual tracking which is a hot spot currently in the domain of computer vision. Firstly, the applications of visual tracking in three areas including visual surveillance, image compression, and 3-D reconstruction are discussed. Secondly, the state of the art about visual tracking is introduced, especially the main approaches of visual tracking are shown. In order to explain these methods clearly, the problems of visual tracking are classified. Then two ways to research the visual tracking problem are presented. And the visual tracking algorithms are classified into four classes: the area-based methods, feature-based methods, deformable-template-based methods and model-based methods. Finally, from the point of view of control theory, the difficulties of visual tracking are discussed that the algorithms should have robustness accuracy and be fast. Meanwhile, some future directions of visual tracking are also addressed shortly.

Journal ArticleDOI
TL;DR: The proposed visual quality metric is based on an effective Human Visual System model and relies on the computation of three distortion factors: blockiness, edge errors and visual impairments, which take into account the typical artifacts introduced by several classes of coders.
Abstract: In this paper, a multi-factor full-reference image quality index is presented. The proposed visual quality metric is based on an effective Human Visual System model. Images are pre-processed in order to take into account luminance masking and contrast sensitivity effects. The proposed metric relies on the computation of three distortion factors: blockiness, edge errors and visual impairments, which take into account the typical artifacts introduced by several classes of coders. A pooling algorithm is used in order to obtain a single distortion index. Results show the effectiveness of the proposed approach and its consistency with subjective evaluations.