scispace - formally typeset
Search or ask a question

Showing papers by "Stefano Soatto published in 2016"


Posted Content
TL;DR: This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural networks that is motivated by the local geometry of the energy landscape and compares favorably to state-of-the-art techniques in terms of generalization error and training time.
Abstract: This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural networks that is motivated by the local geometry of the energy landscape. Local extrema with low generalization error have a large proportion of almost-zero eigenvalues in the Hessian with very few positive or negative eigenvalues. We leverage upon this observation to construct a local-entropy-based objective function that favors well-generalizable solutions lying in large flat regions of the energy landscape, while avoiding poorly-generalizable solutions located in the sharp valleys. Conceptually, our algorithm resembles two nested loops of SGD where we use Langevin dynamics in the inner loop to compute the gradient of the local entropy before each update of the weights. We show that the new objective has a smoother energy landscape and show improved generalization over SGD using uniform stability, under certain assumptions. Our experiments on convolutional and recurrent networks demonstrate that Entropy-SGD compares favorably to state-of-the-art techniques in terms of generalization error and training time.

487 citations


Posted Content
TL;DR: It is proved that Information Dropout achieves a comparable or better generalization performance than binary dropout, especially on smaller models, since it can automatically adapt the noise to the structure of the network, as well as to the test sample.
Abstract: The cross-entropy loss commonly used in deep learning is closely related to the defining properties of optimal representations, but does not enforce some of the key properties. We show that this can be solved by adding a regularization term, which is in turn related to injecting multiplicative noise in the activations of a Deep Neural Network, a special case of which is the common practice of dropout. We show that our regularized loss function can be efficiently minimized using Information Dropout, a generalization of dropout rooted in information theoretic principles that automatically adapts to the data and can better exploit architectures of limited capacity. When the task is the reconstruction of the input, we show that our loss function yields a Variational Autoencoder as a special case, thus providing a link between representation learning, information theory and variational inference. Finally, we prove that we can promote the creation of disentangled representations simply by enforcing a factorized prior, a fact that has been observed empirically in recent work. Our experiments validate the theoretical intuitions behind our method, and we find that information dropout achieves a comparable or better generalization performance than binary dropout, especially on smaller models, since it can automatically adapt the noise to the structure of the network, as well as to the test sample.

251 citations


Proceedings ArticleDOI
16 May 2016
TL;DR: A method to predict long-term motion of pedestrians, modeling their behavior as jump-Markov processes with their goal a hidden variable and intent as a policy in a Markov decision process framework.
Abstract: We present a method to predict long-term motion of pedestrians, modeling their behavior as jump-Markov processes with their goal a hidden variable. Assuming approximately rational behavior, and incorporating environmental constraints and biases, including time-varying ones imposed by traffic lights, we model intent as a policy in a Markov decision process framework. We infer pedestrian state using a Rao-Blackwellized filter, and intent by planning according to a stochastic policy, reflecting individual preferences in aiming at the same goal.

155 citations


Book ChapterDOI
08 Oct 2016
TL;DR: In this paper, an efficient convex program is proposed to estimate scaled relative positions between pairs of views (estimated for instance with epipolar geometry) for location recovery, that is the determination of relative pose up to a single unknown scale.
Abstract: We introduce a new method for location recovery from pairwise directions that leverages an efficient convex program that comes with exact recovery guarantees, even in the presence of adversarial outliers. When pairwise directions represent scaled relative positions between pairs of views (estimated for instance with epipolar geometry) our method can be used for location recovery, that is the determination of relative pose up to a single unknown scale. For this task, our method yields performance comparable to the state-of-the-art with an order of magnitude speed-up. Our proposed numerical framework is flexible in that it accommodates other approaches to location recovery and can be used to speed up other methods. These properties are demonstrated by extensively testing against state-of-the-art methods for location recovery on 13 large, irregular collections of images of real scenes in addition to simulated data with ground truth.

50 citations


Proceedings Article
01 Jan 2016
TL;DR: Analytical expressions for minimal sufficient statistics of visual data are derived and it is shown they are related to feature descriptors commonly used in computer vision, as well as to convolutional neural networks.
Abstract: Visual representations are defined in terms of minimal sufficient statistics of visual data, for a class of tasks, that are also invariant to nuisance variability. Minimal sufficiency guarantees that we can store a representation in lieu of raw data with smallest complexity and no performance loss on the task at hand. Invariance guarantees that the statistic is constant with respect to uninformative transformations of the data. We derive analytical expressions for such representations and show they are related to feature descriptors commonly used in computer vision, as well as to convolutional neural networks. This link highlights the assumptions and approximations tacitly assumed by these methods and explains empirical practices such as clamping, pooling and joint normalization.

49 citations


Book ChapterDOI
08 Oct 2016
TL;DR: A data structure obtained by hierarchically pooling Bag-of-Words descriptors during a sequence of views that achieves average speedups in large-scale loop closure applications ranging from 2 to 20 times on benchmark datasets is proposed.
Abstract: We propose a data structure obtained by hierarchically pooling Bag-of-Words (BoW) descriptors during a sequence of views that achieves average speedups in large-scale loop closure applications ranging from 2 to 20 times on benchmark datasets. Although simple, the method works as well as sophisticated agglomerative schemes at a fraction of the cost with minimal loss of performance.

14 citations


Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this paper, the authors conduct an empirical study to test the ability of convolutional neural networks (CNNs) to reduce the effects of nuisance transformations of the input data, such as location, scale and aspect ratio.
Abstract: We conduct an empirical study to test the ability of convolutional neural networks (CNNs) to reduce the effects of nuisance transformations of the input data, such as location, scale and aspect ratio. We isolate factors by adopting a common convolutional architecture either deployed globally on the image to compute class posterior distributions, or restricted locally to compute class conditional distributions given location, scale and aspect ratios of bounding boxes determined by proposal heuristics. In theory, averaging the latter should yield inferior performance compared to proper marginalization. Yet empirical evidence suggests the converse, leading us to conclude that – at the current level of complexity of convolutional architectures and scale of the data sets used to train them – CNNs are not very effective at marginalizing nuisance variability. We also quantify the effects of context on the overall classification task and its impact on the performance of CNNs, and propose improved sampling techniques for heuristic proposal schemes that improve end-to-end performance to state-of-the-art levels. We test our hypothesis on a classification task using the ImageNet Challenge benchmark and on a wide-baseline matching task using the Oxford and Fischer's datasets.

11 citations


Journal ArticleDOI
TL;DR: A variational model for occlusion detection that is formulated as an inverse problem that adapts the brightness constraint of optical flow to emphasize occlusions by exploiting their temporal behavior, while spatio-temporal regularizers on the occluded set make the model robust to noise and modeling errors.
Abstract: Occlusions generally become apparent when integrated over time because violations of the brightness-constancy constraint of optical flow accumulate in occluded areas. Based on this observation, we propose a variational model for occlusion detection that is formulated as an inverse problem. Our forward model adapts the brightness constraint of optical flow to emphasize occlusions by exploiting their temporal behavior, while spatio-temporal regularizers on the occlusion set make our model robust to noise and modeling errors. In terms of minimization, we approximate the resulting variational problem by a sequence of convex optimizations and develop efficient algorithms to solve them. Our experiments show the benefits of the proposed formulation, both forward model and regularizers, in comparison to the state-of-the-art techniques that detect occlusion as the residual of optical-flow estimation.

11 citations


Posted Content
TL;DR: A new method for location recovery from pairwise directions that leverages an efficient convex program that comes with exact recovery guarantees, even in the presence of adversarial outliers is introduced.
Abstract: We introduce a new method for location recovery from pair-wise directions that leverages an efficient convex program that comes with exact recovery guarantees, even in the presence of adversarial outliers. When pairwise directions represent scaled relative positions between pairs of views (estimated for instance with epipolar geometry) our method can be used for location recovery, that is the determination of relative pose up to a single unknown scale. For this task, our method yields performance comparable to the state-of-the-art with an order of magnitude speed-up. Our proposed numerical framework is flexible in that it accommodates other approaches to location recovery and can be used to speed up other methods. These properties are demonstrated by extensively testing against state-of-the-art methods for location recovery on 13 large, irregular collections of images of real scenes in addition to simulated data with ground truth.

9 citations


Patent
07 Nov 2016
TL;DR: In this paper, a variation of scale-invariant feature transform (SIFT) based on pooling gradient orientations across different domain sizes, in addition to spatial locations, is proposed.
Abstract: A variation of scale-invariant feature transform (SIFT) based on pooling gradient orientations across different domain sizes, in addition to spatial locations. The resulting descriptor is called DSP-SIFT, and it outperforms other methods in wide-baseline matching benchmarks, including those based on convolutional neural networks, despite having the same dimension of SIFT and requiring no training. Problems of local representation of imaging data are also addressed as computation of minimal sufficient statistics that are invariant to nuisance variability induced by viewpoint and illumination. A sampling-based and a point-estimate based approximation of such representations are described.

8 citations


Posted Content
TL;DR: In this paper, a system to detect objects in 3D space using video and inertial sensors (accelerometer and gyrometer), ubiquitous in modern mobile platforms from phones to drones, is described.
Abstract: We describe a system to detect objects in three-dimensional space using video and inertial sensors (accelerometer and gyrometer), ubiquitous in modern mobile platforms from phones to drones. Inertials afford the ability to impose class-specific scale priors for objects, and provide a global orientation reference. A minimal sufficient representation, the posterior of semantic (identity) and syntactic (pose) attributes of objects in space, can be decomposed into a geometric term, which can be maintained by a localization-and-mapping filter, and a likelihood function, which can be approximated by a discriminatively-trained convolutional neural network. The resulting system can process the video stream causally in real time, and provides a representation of objects in the scene that is persistent: Confidence in the presence of objects grows with evidence, and objects previously seen are kept in memory even when temporarily occluded, with their return into view automatically predicted to prime re-detection.

Journal ArticleDOI
TL;DR: A method to reconstruct surfaces from oriented point clouds corrupted by errors arising from range imaging sensors by couple an implicit parametrization that reconstructs surfaces of unknown topology with adaptive discretizations that avoid the high memory and computational cost of volumetric representations.
Abstract: We propose a method to reconstruct surfaces from oriented point clouds corrupted by errors arising from range imaging sensors. The core of this technique is the formulation of the problem as a convex minimization that reconstructs the indicator function of the surface's interior and substitutes the usual least-squares fidelity terms by Huber penalties to be robust to outliers, recover sharp corners, and avoid the shrinking bias of least-squares models. To achieve both flexibility and accuracy, we couple an implicit parametrization that reconstructs surfaces of unknown topology with adaptive discretizations that avoid the high memory and computational cost of volumetric representations. The hierarchical structure of the discretizations speeds minimization through multiresolution, while the proposed splitting algorithm minimizes nondifferentiable functionals and is easy to parallelize. In experiments, our model improves reconstruction from synthetic and real data, while the choice of discretization affects ...

Proceedings Article
09 Jul 2016
TL;DR: Rather than attempting to prove that the set of indistinguishable trajectories is a singleton, bounds on its volume are derived as a function of characteristics of the sensor and other sufficient excitation conditions, which provides an explicit characterization of the indistinguishable set that can be used for analysis and validation purposes.
Abstract: We analyze the observability of 3-D position and orientation from the fusion of visual and inertial sensors. The model contains unknown parameters, such as sensor biases, and so the problem is usually cast as a mixed filtering/identification problem, with the resulting observability analysis providing necessary conditions for convergence to a unique point estimate. Most models treat sensor bias rates as "noise," independent of other states, including biases themselves, an assumption that is violated in practice. We show that, when this assumption is lifted, the resulting model is not observable, and therefore existing analyses cannot be used to conclude that the set of states that are indistinguishable from the measurements is a singleton. We recast the analysis as one of sensitivity: Rather than attempting to prove that the set of indistinguishable trajectories is a singleton, we derive bounds on its volume, as a function of characteristics of the sensor and other sufficient excitation conditions. This provides an explicit characterization of the indistinguishable set that can be used for analysis and validation purposes.

Posted Content
TL;DR: A representation of a scene that captures geometric and semantic attributes of objects within, along with their uncertainty, is described that yields a posterior estimate of geometry, semantics, and a point-estimate of topology for a variable number of Objects within the scene, implemented causally and in real-time on commodity hardware.
Abstract: We describe a representation of a scene that captures geometric and semantic attributes of objects within, along with their uncertainty. Objects are assumed persistent in the scene, and their likelihood computed from intermittent visual data using a convolutional architecture, integrated within a Bayesian filtering framework with inertials and a context model. Our method yields a posterior estimate of geometry (attributed point cloud and associated uncertainty), semantics (identities and co-occurrence), and a point-estimate of topology for a variable number of objects within the scene, implemented causally and in real-time on commodity hardware.

Proceedings ArticleDOI
07 Mar 2016
TL;DR: A video coding system is presented that partitions the scene into "visual structures" and a residual "background" layer and a low-level representation ("track-template") of visual structures is proposed that exploits their temporal redundancy.
Abstract: A video coding system is presented that partitions the scene into "visual structures" anda residual "background" layer. A low-level representation ("track-template") of visual structures is proposed that exploits their temporal redundancy. A dictionary of track-templates is constructed that is used to encode video frames. We make optimal use of the dictionary in terms of rate-distortion by choosing a subset of the dictionary's elements for encoding using a Markov Random Field (MRF) formulation that places the track-templates in "depth" layers. The selected "track-templates" form the mid-level representation of the "visual structure" regions of the video. Our video coding system offers improvements over H.265/H.264 and other methods in a rate-distortion comparison.

Posted Content
TL;DR: This work constructs a general theory of local descriptors for visual matching and shows that SIFT and DSP-SIFT approximate the solution the theory suggests, and derives new descriptors that have fewer parameters and are potentially better in handling affine deformations.
Abstract: Why has SIFT been so successful? Why its extension, DSP-SIFT, can further improve SIFT? Is there a theory that can explain both? How can such theory benefit real applications? Can it suggest new algorithms with reduced computational complexity or new descriptors with better accuracy for matching? We construct a general theory of local descriptors for visual matching. Our theory relies on concepts in energy minimization and heat diffusion. We show that SIFT and DSP-SIFT approximate the solution the theory suggests. In particular, DSP-SIFT gives a better approximation to the theoretical solution; justifying why DSP-SIFT outperforms SIFT. Using the developed theory, we derive new descriptors that have fewer parameters and are potentially better in handling affine deformations.