scispace - formally typeset
Search or ask a question

Showing papers by "Eero P. Simoncelli published in 2016"


Posted Content
TL;DR: In this article, a nonlinear analysis transformation, a uniform quantizer, and a non-linear synthesis transformation are used to optimize the entire model for rate-distortion performance over a database of training images.
Abstract: We describe an image compression method, consisting of a nonlinear analysis transformation, a uniform quantizer, and a nonlinear synthesis transformation. The transforms are constructed in three successive stages of convolutional linear filters and nonlinear activation functions. Unlike most convolutional neural networks, the joint nonlinearity is chosen to implement a form of local gain control, inspired by those used to model biological neurons. Using a variant of stochastic gradient descent, we jointly optimize the entire model for rate-distortion performance over a database of training images, introducing a continuous proxy for the discontinuous loss function arising from the quantizer. Under certain conditions, the relaxed loss function may be interpreted as the log likelihood of a generative model, as implemented by a variational autoencoder. Unlike these models, however, the compression model must operate at any given point along the rate-distortion curve, as specified by a trade-off parameter. Across an independent set of test images, we find that the optimized method generally exhibits better rate-distortion performance than the standard JPEG and JPEG 2000 compression methods. More importantly, we observe a dramatic improvement in visual quality for all images at all bit rates, which is supported by objective quality estimates using MS-SSIM.

497 citations


Proceedings Article
05 Nov 2016
TL;DR: In this paper, a nonlinear analysis transformation, a uniform quantizer, and a non-linear synthesis transformation are used to optimize the entire model for rate-distortion performance over a database of training images.
Abstract: We describe an image compression method, consisting of a nonlinear analysis transformation, a uniform quantizer, and a nonlinear synthesis transformation. The transforms are constructed in three successive stages of convolutional linear filters and nonlinear activation functions. Unlike most convolutional neural networks, the joint nonlinearity is chosen to implement a form of local gain control, inspired by those used to model biological neurons. Using a variant of stochastic gradient descent, we jointly optimize the entire model for rate-distortion performance over a database of training images, introducing a continuous proxy for the discontinuous loss function arising from the quantizer. Under certain conditions, the relaxed loss function may be interpreted as the log likelihood of a generative model, as implemented by a variational autoencoder. Unlike these models, however, the compression model must operate at any given point along the rate-distortion curve, as specified by a trade-off parameter. Across an independent set of test images, we find that the optimized method generally exhibits better rate-distortion performance than the standard JPEG and JPEG 2000 compression methods. More importantly, we observe a dramatic improvement in visual quality for all images at all bit rates, which is supported by objective quality estimates using MS-SSIM.

410 citations


Proceedings Article
01 Jan 2016
TL;DR: In this paper, a parametric nonlinear transformation is proposed for Gaussianizing data from natural images, where each component is normalized by a pooled activity measure, computed by exponentiating a weighted sum of rectified and exponentiated components and an additive constant.
Abstract: We introduce a parametric nonlinear transformation that is well-suited for Gaussianizing data from natural images. After a linear transformation of the data, each component is normalized by a pooled activity measure, computed by exponentiating a weighted sum of rectified and exponentiated components and an additive constant. We optimize the parameters of this transformation (linear transform, exponents, weights, constant) over a database of natural images, directly minimizing the negentropy of the responses. We find that the optimized transformation successfully Gaussianizes the data, achieving a significantly smaller mutual information between transformed components than previous methods including ICA and radial Gaussianization. The transformation is differentiable and can be efficiently inverted, and thus induces a density model on images. We show that samples of this model are visually similar to samples of natural image patches. We also demonstrate the use of the model as a prior density in removing additive noise. Finally, we show that the transformation can be cascaded, with each layer optimized (unsupervised) using the same Gaussianization objective, to capture additional probabilistic structure.

176 citations


Proceedings ArticleDOI
18 Jul 2016
TL;DR: In this article, the authors introduce a general framework for end-to-end optimization of the rate-distortion performance of nonlinear transform codes assuming scalar quantization, which can be used to optimize any differentiable pair of analysis and synthesis transforms in combination with any perceptual metric.
Abstract: We introduce a general framework for end-to-end optimization of the rate-distortion performance of nonlinear transform codes assuming scalar quantization. The framework can be used to optimize any differentiable pair of analysis and synthesis transforms in combination with any differentiable perceptual metric. As an example, we consider a code built from a linear transform followed by a form of multi-dimensional local gain control. Distortion is measured with a state-of-the-art perceptual metric. When optimized over a large database of images, this representation offers substantial improvements in bitrate and perceptual appearance over fixed (DCT) codes, and over linear transform codes optimized for mean squared error.

147 citations


Posted Content
TL;DR: This work introduces a general framework for end-to-end optimization of the rate-distortion performance of nonlinear transform codes assuming scalar quantization and considers a code built from a linear transform followed by a form of multi-dimensional local gain control.
Abstract: We introduce a general framework for end-to-end optimization of the rate--distortion performance of nonlinear transform codes assuming scalar quantization. The framework can be used to optimize any differentiable pair of analysis and synthesis transforms in combination with any differentiable perceptual metric. As an example, we consider a code built from a linear transform followed by a form of multi-dimensional local gain control. Distortion is measured with a state-of-the-art perceptual metric. When optimized over a large database of images, this representation offers substantial improvements in bitrate and perceptual appearance over fixed (DCT) codes, and over linear transform codes optimized for mean squared error.

134 citations


Journal ArticleDOI
TL;DR: An image quality metric based on the transformations associated with the early visual system: local luminance subtraction and local gain control and shown to lead to significant reductions in redundancy relative to the original image pixels.
Abstract: We present an image quality metric based on the transformations associated with the early visual system: local luminance subtraction and local gain control. Images are decomposed using a Laplacian pyramid, which subtracts a local estimate of the mean luminance at multiple scales. Each pyramid coefficient is then divided by a local estimate of amplitude (weighted sum of absolute values of neighbors), where the weights are optimized for prediction of amplitude using (undistorted) images from a separate database. We define the quality of a distorted image, relative to its undistorted original, as the root mean squared error in this “normalized Laplacian” domain. We show that both luminance subtraction and amplitude division stages lead to significant reductions in redundancy relative to the original image pixels. We also show that the resulting quality metric provides a better account of human perceptual judgements than either MS-SSIM or a recently-published gain-control metric based on oriented filters. Introduction Many problems in image processing rely, at least implicitly, on a measure of image quality. Although mean squared error (MSE) is the standard choice, it is well known that it is not very well matched to the distortion perceived by human observers [1, 2, 3]. Objective measures of perceptual image quality attempt to correct this by incorporating known characteristics of human perception (see reviews [4, 5]). These measures typically operate by transforming the reference and distorted images and quantifying the error within that “perceptual” space. For instance, the seminal models described in [6, 7, 8] are based on psychophysical measurements of the dependence of contrast sensitivity on spatial frequency and contextual masking. Other models are designed to mimic physiological responses of neurons in the primary visual cortex. They typically include multi-scale oriented filtering followed by local gain control to normalize response amplitudes (e.g. [2, 9, 10]). Although the perceptual and physiological rationale for these models are compelling, they have complex parameterizations and are difficult to fit to data. Some models have been shown to be well-matched to the statistical properties of natural images, consistent with theories of biological coding efficiency and redundancy reduction [11, 12]. In particular, application of Independent Component Analysis (ICA) [13] (which seeks a linear transformation that best eliminates statistical dependencies in responses), or sparse coding [14] (which seeks to encode images with a small subset of basis elements), yields oriented filters resembling V1 receptive fields. Local gain control, in a form known as “divisive normalization” that has been widely used to describe sensory neurons [15], has been shown to decrease the dependencies between filter responses at adjacent spatial positions, orientations, and scales [16, 17, 18, 19, 20]. A widely-used measure of perceptual distortion is the structural similarity metric (SSIM) [21], which is designed to be invariant to “nuisance” variables such as the local mean or local standard deviation, while retaining sensitivity to the remaining “structure” of the image. It is generally used within a multi-scale representation (MS-SSIM), allowing it to handle features of all sizes [22]. While SSIM is informed by the invariances of human perception, the form of its computation (a product of the correlations between mean-subtracted, variance-normalized, and structure terms) has no obvious mapping onto physiological or perceptual representation. Nevertheless, the computations that underlie the embedding of those invariances – subtraction of the local mean, and division by the local standard deviation – are reminiscent of the response properties of neurons in the retina and thalamus. In particular, responses of these cells are often modeled as bandpass filters (“center-surround”) whose responses are rectified and subject to gain control according to local luminance and contrast (e.g., [23]). Here, we define a new quality metric, computed as the root mean squared error of an early visual representation based on center-surround filtering followed by local gain control. The filtering is performed at multiple scales, using the Laplacian pyramid [24]. While the model architecture and choice of operations are motivated by the physiology of the early visual system, we use a statistical criterion to select the local gain control parameters. Specifically, the weights used in computing the gain signal are chosen so as to minimize the conditional dependency of neighboring transformed coefficients. Despite the simplicity of this representation, we find that it provides an excellent account of human perceptual quality judgments, outperforming MS-SSIM, as well as V1-inspired models, in predicting the human data in the TID 2008 database [25]. Normalized Laplacian pyramid model Our model is comprised of two stages (figure 1): first, the local mean is removed by subtracting a blurred version of the image, and then these values are normalized by an estimate of the local amplitude. The perceptual metric is defined as the root mean squared error of a distorted image compared to the original, measured in this transformed domain. We view the local luminance subtraction and contrast normalization as a means of reducing redundancy in natural images. ©2016 Society for Imaging Science and Technology DOI: 10.2352/ISSN.2470-1173.2016.16HVEI-103 IS&T International Symposium on Electronic Imaging 2016 Human Vision and Electronic Imaging 2016 HVEI-103.1 Figure 1. Normalized Laplacian pyramid model diagram, shown for a single scale (k). The input image at scale k, x(k) (k = 1 corresponds to the original image), is modified by subtracting the local mean (eq. 2). This is accomplished using the standard Laplacian pyramid construction: convolve with lowpass filter L(ω), downsample by a factor of two in each dimension, upsample, convolve again with L(ω), and subtract from the input image x(k). This intermediate image z(k) is then normalized by an estimate of local amplitude, obtained by computing the absolute value, convolving with scale-specific filter P(k)(ω), and adding the scale-specific constant σ (k) (eq. 3)). As in the standard Laplacian Pyramid, the blurred and downsampled image x(k+1) is the input image for scale (k+1). Most of the redundant information in natural images is local, and can be captured with a Markov model. That is, the distribution of an image pixel (xi) conditioned on all others is well approximated by the conditional

126 citations


Journal ArticleDOI
TL;DR: Evidence is presented that neurons in area V2 are selective for local statistics that occur in natural visual textures, and tolerant of manipulations that preserve these statistics.
Abstract: As information propagates along the ventral visual hierarchy, neuronal responses become both more specific for particular image features and more tolerant of image transformations that preserve those features. Here, we present evidence that neurons in area V2 are selective for local statistics that occur in natural visual textures, and tolerant of manipulations that preserve these statistics. Texture stimuli were generated by sampling from a statistical model, with parameters chosen to match the parameters of a set of visually distinct natural texture images. Stimuli generated with the same statistics are perceptually similar to each other despite differences, arising from the sampling process, in the precise spatial location of features. We assessed the accuracy with which these textures could be classified based on the responses of V1 and V2 neurons recorded individually in anesthetized macaque monkeys. We also assessed the accuracy with which particular samples could be identified, relative to other statistically matched samples. For populations of up to 100 cells, V1 neurons supported better performance in the sample identification task, whereas V2 neurons exhibited better performance in texture classification. Relative to V1, the responses of V2 show greater selectivity and tolerance for the representation of texture statistics.

99 citations


Journal ArticleDOI
TL;DR: It is shown that nQDA provides a better account than many comparable alternatives for the transformation between neural representations in two high-level brain areas recorded as monkeys performed a visual delayed-match-to-sample task.
Abstract: Linear-nonlinear LN models and their extensions have proven successful in describing transformations from stimuli to spiking responses of neurons in early stages of sensory hierarchies. Neural responses at later stages are highly nonlinear and have generally been better characterized in terms of their decoding performance on prespecified tasks. Here we develop a biologically plausible decoding model for classification tasks, that we refer to as neural quadratic discriminant analysis nQDA. Specifically, we reformulate an optimal quadratic classifier as an LN-LN computation, analogous to "subunit" encoding models that have been used to describe responses in retina and primary visual cortex. We propose a physiological mechanism by which the parameters of the nQDA classifier could be optimized, using a supervised variant of a Hebbian learning rule. As an example of its applicability, we show that nQDA provides a better account than many comparable alternatives for the transformation between neural representations in two high-level brain areas recorded as monkeys performed a visual delayed-match-to-sample task

28 citations


Posted Content
TL;DR: This work predicts an explicit relationship between the statistical properties of the environment, the allocation and selectivity of neurons within populations, and perceptual discriminability, and finds that it is remarkably consistent with existing data.
Abstract: The mammalian brain is a metabolically expensive device, and evolutionary pressures have presumably driven it to make productive use of its resources. For sensory areas, this concept has been expressed more formally as an optimality principle: the brain maximizes the information that is encoded about relevant sensory variables, given available resources. Here, we develop this efficiency principle for encoding a sensory variable with a heterogeneous population of noisy neurons, each responding to a particular range of values. The accuracy with which the population represents any particular value depends on the number of cells that respond to that value, their selectivity, and their response levels. We derive the optimal solution for these parameters in closed form, as a function of the probability of stimulus values encountered in the environment. This optimal neural population also imposes limitations on the ability of the organism to discriminate different values of the encoded variable. As a result, we predict an explicit relationship between the statistical properties of the environment, the allocation and selectivity of neurons within populations, and perceptual discriminability. We test this relationship for three visual and two auditory attributes, and find that it is remarkably consistent with existing data.

25 citations


Proceedings Article
01 Jan 2016
TL;DR: In this article, the authors propose to synthesize a sequence of images lying on a path between them that is of minimal length in the space of a representation (a "representational geodesic").
Abstract: We develop a new method for visualizing and refining the invariances of learned representations. Given two reference images (typically, differing by some transformation), we synthesize a sequence of images lying on a path between them that is of minimal length in the space of a representation (a "representational geodesic"). If the transformation relating the two reference images is an invariant of the representation, this sequence should follow the gradual evolution of this transformation. We use this method to assess the invariances of state-of-the-art image classification networks and find that surprisingly, they do not exhibit invariance to basic parametric transformations of translation, rotation, and dilation. Our method also suggests a remedy for these failures, and following this prescription, we show that the modified representation exhibits a high degree of invariance for a range of geometric image transformations.

15 citations


Posted Content
TL;DR: In this article, the authors proposed an alternative approach to estimate the underlying rate directly, by integrating over the unobserved spikes instead of committing to a single estimate of the spike train, which can be used to estimate average firing rates or tuning curves directly from the imaging data.
Abstract: Two-photon imaging of calcium indicators allows simultaneous recording of responses of hundreds of neurons over hours and even days, but provides a relatively indirect measure of their spiking activity. Existing deconvolution algorithms attempt to recover spikes from observed imaging data, which are then commonly subjected to the same analyses that are applied to electrophysiologically recorded spikes (e.g., estimation of average firing rates, or tuning curves). Here we show, however, that in the presence of noise this approach is often heavily biased. We propose an alternative analysis that aims to estimate the underlying rate directly, by integrating over the unobserved spikes instead of committing to a single estimate of the spike train. This approach can be used to estimate average firing rates or tuning curves directly from the imaging data, and is sufficiently flexible to incorporate prior knowledge about tuning structure. We show that directly estimated rates are more accurate than those obtained from averaging of spikes estimated through deconvolution, both on simulated data and on imaging data acquired in mouse visual cortex.

04 Jan 2016
TL;DR: This work proposes an alternative analysis that aims to estimate the underlying rate directly, by integrating over the unobserved spikes instead of committing to a single estimate of the spike train, and shows that directly estimated rates are more accurate than those obtained from averaging of spikes estimated through deconvolution.
Abstract: Two-photon imaging of calcium indicators allows simultaneous recording of responses of hundreds of neurons over hours and even days, but provides a relatively indirect measure of their spiking activity. Existing deconvolution algorithms attempt to recover spikes from observed imaging data, which are then commonly subjected to the same analyses that are applied to electrophysiologically recorded spikes (e.g., estimation of average firing rates, or tuning curves). Here we show, however, that in the presence of noise this approach is often heavily biased. We propose an alternative analysis that aims to estimate the underlying rate directly, by integrating over the unobserved spikes instead of committing to a single estimate of the spike train. This approach can be used to estimate average firing rates or tuning curves directly from the imaging data, and is sufficiently flexible to incorporate prior knowledge about tuning structure. We show that directly estimated rates are more accurate than those obtained from averaging of spikes estimated through deconvolution, both on simulated data and on imaging data acquired in mouse visual cortex.