Showing papers by "Stefano Soatto published in 2016"

PDF

Open Access

Posted Content•

Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

[...]

Pratik Chaudhari¹, Pratik Chaudhari², Anna Choromanska³, Stefano Soatto², Yann LeCun⁴, Yann LeCun⁵, Carlo Baldassi⁶, Carlo Baldassi⁷, Christian Borgs⁸, Jennifer Chayes⁸, Levent Sagun⁵, Riccardo Zecchina⁶, Riccardo Zecchina⁷ - Show less +9 more•Institutions (8)

University of Pennsylvania¹, University of California, Los Angeles², New York University³, Facebook⁴, Courant Institute of Mathematical Sciences⁵, Polytechnic University of Turin⁶, Bocconi University⁷, Microsoft⁸

06 Nov 2016-arXiv: Learning

TL;DR: This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural networks that is motivated by the local geometry of the energy landscape and compares favorably to state-of-the-art techniques in terms of generalization error and training time.

...read moreread less

Abstract: This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural networks that is motivated by the local geometry of the energy landscape. Local extrema with low generalization error have a large proportion of almost-zero eigenvalues in the Hessian with very few positive or negative eigenvalues. We leverage upon this observation to construct a local-entropy-based objective function that favors well-generalizable solutions lying in large flat regions of the energy landscape, while avoiding poorly-generalizable solutions located in the sharp valleys. Conceptually, our algorithm resembles two nested loops of SGD where we use Langevin dynamics in the inner loop to compute the gradient of the local entropy before each update of the weights. We show that the new objective has a smoother energy landscape and show improved generalization over SGD using uniform stability, under certain assumptions. Our experiments on convolutional and recurrent networks demonstrate that Entropy-SGD compares favorably to state-of-the-art techniques in terms of generalization error and training time.

...read moreread less

487 citations

Posted Content•

Information Dropout: Learning Optimal Representations Through Noisy Computation

[...]

Alessandro Achille¹, Stefano Soatto¹•Institutions (1)

University of California, Los Angeles¹

04 Nov 2016-arXiv: Machine Learning

TL;DR: It is proved that Information Dropout achieves a comparable or better generalization performance than binary dropout, especially on smaller models, since it can automatically adapt the noise to the structure of the network, as well as to the test sample.

...read moreread less

Abstract: The cross-entropy loss commonly used in deep learning is closely related to the defining properties of optimal representations, but does not enforce some of the key properties. We show that this can be solved by adding a regularization term, which is in turn related to injecting multiplicative noise in the activations of a Deep Neural Network, a special case of which is the common practice of dropout. We show that our regularized loss function can be efficiently minimized using Information Dropout, a generalization of dropout rooted in information theoretic principles that automatically adapts to the data and can better exploit architectures of limited capacity. When the task is the reconstruction of the input, we show that our loss function yields a Variational Autoencoder as a special case, thus providing a link between representation learning, information theory and variational inference. Finally, we prove that we can promote the creation of disentangled representations simply by enforcing a factorized prior, a fact that has been observed empirically in recent work. Our experiments validate the theoretical intuitions behind our method, and we find that information dropout achieves a comparable or better generalization performance than binary dropout, especially on smaller models, since it can automatically adapt the noise to the structure of the network, as well as to the test sample.

...read moreread less

251 citations

Proceedings Article•DOI•

Intent-aware long-term prediction of pedestrian motion

[...]

Vasiliy Karasev¹, Alper Ayvaci², Bernd Heisele², Stefano Soatto¹•Institutions (2)

University of California, Los Angeles¹, Honda²

16 May 2016

TL;DR: A method to predict long-term motion of pedestrians, modeling their behavior as jump-Markov processes with their goal a hidden variable and intent as a policy in a Markov decision process framework.

...read moreread less

Abstract: We present a method to predict long-term motion of pedestrians, modeling their behavior as jump-Markov processes with their goal a hidden variable. Assuming approximately rational behavior, and incorporating environmental constraints and biases, including time-varying ones imposed by traffic lights, we model intent as a policy in a Markov decision process framework. We infer pedestrian state using a Rao-Blackwellized filter, and intent by planning according to a stochastic policy, reflecting individual preferences in aiming at the same goal.

...read moreread less

155 citations

Book Chapter•DOI•

ShapeFit and ShapeKick for Robust, Scalable Structure from Motion

[...]

Tom Goldstein¹, Paul Hand², Choongbum Lee³, Vladislav Voroninski³, Stefano Soatto⁴ - Show less +1 more•Institutions (4)

University of Maryland, College Park¹, Rice University², Massachusetts Institute of Technology³, University of California⁴

08 Oct 2016

TL;DR: In this paper, an efficient convex program is proposed to estimate scaled relative positions between pairs of views (estimated for instance with epipolar geometry) for location recovery, that is the determination of relative pose up to a single unknown scale.

...read moreread less

Abstract: We introduce a new method for location recovery from pairwise directions that leverages an efficient convex program that comes with exact recovery guarantees, even in the presence of adversarial outliers. When pairwise directions represent scaled relative positions between pairs of views (estimated for instance with epipolar geometry) our method can be used for location recovery, that is the determination of relative pose up to a single unknown scale. For this task, our method yields performance comparable to the state-of-the-art with an order of magnitude speed-up. Our proposed numerical framework is flexible in that it accommodates other approaches to location recovery and can be used to speed up other methods. These properties are demonstrated by extensively testing against state-of-the-art methods for location recovery on 13 large, irregular collections of images of real scenes in addition to simulated data with ground truth.

...read moreread less

50 citations

Proceedings Article•

Visual Representations: Defining Properties and Deep Approximations

[...]

Stefano Soatto¹, Alessandro Chiuso²•Institutions (2)

University of California, Los Angeles¹, University of Padua²

01 Jan 2016

TL;DR: Analytical expressions for minimal sufficient statistics of visual data are derived and it is shown they are related to feature descriptors commonly used in computer vision, as well as to convolutional neural networks.

...read moreread less

Abstract: Visual representations are defined in terms of minimal sufficient statistics of visual data, for a class of tasks, that are also invariant to nuisance variability. Minimal sufficiency guarantees that we can store a representation in lieu of raw data with smallest complexity and no performance loss on the task at hand. Invariance guarantees that the statistic is constant with respect to uninformative transformations of the data. We derive analytical expressions for such representations and show they are related to feature descriptors commonly used in computer vision, as well as to convolutional neural networks. This link highlights the assumptions and approximations tacitly assumed by these methods and explains empirical practices such as clamping, pooling and joint normalization.

...read moreread less

49 citations

Book Chapter•DOI•

A Simple Hierarchical Pooling Data Structure for Loop Closure

[...]

Xiaohan Fei¹, Konstantine Tsotsos¹, Stefano Soatto¹•Institutions (1)

University of California¹

08 Oct 2016

TL;DR: A data structure obtained by hierarchically pooling Bag-of-Words descriptors during a sequence of views that achieves average speedups in large-scale loop closure applications ranging from 2 to 20 times on benchmark datasets is proposed.

...read moreread less

Abstract: We propose a data structure obtained by hierarchically pooling Bag-of-Words (BoW) descriptors during a sequence of views that achieves average speedups in large-scale loop closure applications ranging from 2 to 20 times on benchmark datasets. Although simple, the method works as well as sophisticated agglomerative schemes at a fraction of the cost with minimal loss of performance.

...read moreread less

14 citations

Proceedings Article•DOI•

An Empirical Evaluation of Current Convolutional Architectures’ Ability to Manage Nuisance Location and Scale Variability

[...]

Nikolaos Karianakis¹, Jingming Dong¹, Stefano Soatto¹•Institutions (1)

University of California, Los Angeles¹

27 Jun 2016

TL;DR: In this paper, the authors conduct an empirical study to test the ability of convolutional neural networks (CNNs) to reduce the effects of nuisance transformations of the input data, such as location, scale and aspect ratio.

...read moreread less

Abstract: We conduct an empirical study to test the ability of convolutional neural networks (CNNs) to reduce the effects of nuisance transformations of the input data, such as location, scale and aspect ratio. We isolate factors by adopting a common convolutional architecture either deployed globally on the image to compute class posterior distributions, or restricted locally to compute class conditional distributions given location, scale and aspect ratios of bounding boxes determined by proposal heuristics. In theory, averaging the latter should yield inferior performance compared to proper marginalization. Yet empirical evidence suggests the converse, leading us to conclude that – at the current level of complexity of convolutional architectures and scale of the data sets used to train them – CNNs are not very effective at marginalizing nuisance variability. We also quantify the effects of context on the overall classification task and its impact on the performance of CNNs, and propose improved sampling techniques for heuristic proposal schemes that improve end-to-end performance to state-of-the-art levels. We test our hypothesis on a classification task using the ImageNet Challenge benchmark and on a wide-baseline matching task using the Oxford and Fischer's datasets.

...read moreread less

11 citations

Journal Article•DOI•

Detecting Occlusions as an Inverse Problem

[...]

Virginia Estellers¹, Stefano Soatto¹•Institutions (1)

University of California, Los Angeles¹

01 Feb 2016-Journal of Mathematical Imaging and Vision

TL;DR: A variational model for occlusion detection that is formulated as an inverse problem that adapts the brightness constraint of optical flow to emphasize occlusions by exploiting their temporal behavior, while spatio-temporal regularizers on the occluded set make the model robust to noise and modeling errors.

...read moreread less

Abstract: Occlusions generally become apparent when integrated over time because violations of the brightness-constancy constraint of optical flow accumulate in occluded areas. Based on this observation, we propose a variational model for occlusion detection that is formulated as an inverse problem. Our forward model adapts the brightness constraint of optical flow to emphasize occlusions by exploiting their temporal behavior, while spatio-temporal regularizers on the occlusion set make our model robust to noise and modeling errors. In terms of minimization, we approximate the resulting variational problem by a sequence of convex optimizations and develop efficient algorithms to solve them. Our experiments show the benefits of the proposed formulation, both forward model and regularizers, in comparison to the state-of-the-art techniques that detect occlusion as the residual of optical-flow estimation.

...read moreread less

11 citations

Posted Content•

ShapeFit and ShapeKick for Robust, Scalable Structure from Motion

[...]

Tom Goldstein¹, Paul Hand², Choongbum Lee³, Vladislav Voroninski³, Stefano Soatto⁴ - Show less +1 more•Institutions (4)

University of Maryland, College Park¹, Rice University², Massachusetts Institute of Technology³, University of California⁴

07 Aug 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: A new method for location recovery from pairwise directions that leverages an efficient convex program that comes with exact recovery guarantees, even in the presence of adversarial outliers is introduced.

...read moreread less

Abstract: We introduce a new method for location recovery from pair-wise directions that leverages an efficient convex program that comes with exact recovery guarantees, even in the presence of adversarial outliers. When pairwise directions represent scaled relative positions between pairs of views (estimated for instance with epipolar geometry) our method can be used for location recovery, that is the determination of relative pose up to a single unknown scale. For this task, our method yields performance comparable to the state-of-the-art with an order of magnitude speed-up. Our proposed numerical framework is flexible in that it accommodates other approaches to location recovery and can be used to speed up other methods. These properties are demonstrated by extensively testing against state-of-the-art methods for location recovery on 13 large, irregular collections of images of real scenes in addition to simulated data with ground truth.

...read moreread less

9 citations

Patent•

Dsp-sift: domain-size pooling for image descriptors for image matching and other applications

[...]

Stefano Soatto¹, Jingming Dong¹•Institutions (1)

University of California¹

07 Nov 2016

TL;DR: In this paper, a variation of scale-invariant feature transform (SIFT) based on pooling gradient orientations across different domain sizes, in addition to spatial locations, is proposed.

...read moreread less

Abstract: A variation of scale-invariant feature transform (SIFT) based on pooling gradient orientations across different domain sizes, in addition to spatial locations. The resulting descriptor is called DSP-SIFT, and it outperforms other methods in wide-baseline matching benchmarks, including those based on convolutional neural networks, despite having the same dimension of SIFT and requiring no training. Problems of local representation of imaging data are also addressed as computation of minimal sufficient statistics that are invariant to nuisance variability induced by viewpoint and illumination. A sampling-based and a point-estimate based approximation of such representations are described.

...read moreread less

8 citations

Posted Content•

Visual-Inertial-Semantic Scene Representation for 3-D Object Detection

[...]

Jingming Dong, Xiaohan Fei, Stefano Soatto

13 Jun 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this paper, a system to detect objects in 3D space using video and inertial sensors (accelerometer and gyrometer), ubiquitous in modern mobile platforms from phones to drones, is described.

...read moreread less

Abstract: We describe a system to detect objects in three-dimensional space using video and inertial sensors (accelerometer and gyrometer), ubiquitous in modern mobile platforms from phones to drones. Inertials afford the ability to impose class-specific scale priors for objects, and provide a global orientation reference. A minimal sufficient representation, the posterior of semantic (identity) and syntactic (pose) attributes of objects in space, can be decomposed into a geometric term, which can be maintained by a localization-and-mapping filter, and a likelihood function, which can be approximated by a discriminatively-trained convolutional neural network. The resulting system can process the video stream causally in real time, and provides a representation of objects in the scene that is persistent: Confidence in the presence of objects grows with evidence, and objects previously seen are kept in memory even when temporarily occluded, with their return into view automatically predicted to prime re-detection.

...read moreread less

Journal Article•DOI•

Robust Surface Reconstruction

[...]

Virginia Estellers, Michael A. Scott, Stefano Soatto

08 Dec 2016-Siam Journal on Imaging Sciences

TL;DR: A method to reconstruct surfaces from oriented point clouds corrupted by errors arising from range imaging sensors by couple an implicit parametrization that reconstructs surfaces of unknown topology with adaptive discretizations that avoid the high memory and computational cost of volumetric representations.

...read moreread less

Abstract: We propose a method to reconstruct surfaces from oriented point clouds corrupted by errors arising from range imaging sensors. The core of this technique is the formulation of the problem as a convex minimization that reconstructs the indicator function of the surface's interior and substitutes the usual least-squares fidelity terms by Huber penalties to be robust to outliers, recover sharp corners, and avoid the shrinking bias of least-squares models. To achieve both flexibility and accuracy, we couple an implicit parametrization that reconstructs surfaces of unknown topology with adaptive discretizations that avoid the high memory and computational cost of volumetric representations. The hierarchical structure of the discretizations speeds minimization through multiresolution, while the proposed splitting algorithm minimizes nondifferentiable functionals and is easy to parallelize. In experiments, our model improves reconstruction from synthetic and real data, while the choice of discretization affects ...

...read moreread less

Proceedings Article•

Observability, identifiability and sensitivity of vision-aided inertial navigation

[...]

Joshua Hernandez¹, Konstantine Tsotsos¹, Stefano Soatto¹•Institutions (1)

University of California, Los Angeles¹

09 Jul 2016

TL;DR: Rather than attempting to prove that the set of indistinguishable trajectories is a singleton, bounds on its volume are derived as a function of characteristics of the sensor and other sufficient excitation conditions, which provides an explicit characterization of the indistinguishable set that can be used for analysis and validation purposes.

...read moreread less

Abstract: We analyze the observability of 3-D position and orientation from the fusion of visual and inertial sensors. The model contains unknown parameters, such as sensor biases, and so the problem is usually cast as a mixed filtering/identification problem, with the resulting observability analysis providing necessary conditions for convergence to a unique point estimate. Most models treat sensor bias rates as "noise," independent of other states, including biases themselves, an assumption that is violated in practice. We show that, when this assumption is lifted, the resulting model is not observable, and therefore existing analyses cannot be used to conclude that the set of states that are indistinguishable from the measurements is a singleton. We recast the analysis as one of sensitivity: Rather than attempting to prove that the set of indistinguishable trajectories is a singleton, we derive bounds on its volume, as a function of characteristics of the sensor and other sufficient excitation conditions. This provides an explicit characterization of the indistinguishable set that can be used for analysis and validation purposes.

...read moreread less

Posted Content•

Visual-Inertial Scene Representations.

[...]

Stefano Soatto

13 Jun 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: A representation of a scene that captures geometric and semantic attributes of objects within, along with their uncertainty, is described that yields a posterior estimate of geometry, semantics, and a point-estimate of topology for a variable number of Objects within the scene, implemented causally and in real-time on commodity hardware.

...read moreread less

Abstract: We describe a representation of a scene that captures geometric and semantic attributes of objects within, along with their uncertainty. Objects are assumed persistent in the scene, and their likelihood computed from intermittent visual data using a convolutional architecture, integrated within a Bayesian filtering framework with inertials and a context model. Our method yields a posterior estimate of geometry (attributed point cloud and associated uncertainty), semantics (identities and co-occurrence), and a point-estimate of topology for a variable number of objects within the scene, implemented causally and in real-time on commodity hardware.

...read moreread less

Proceedings Article•DOI•

A mid-level representation of visual structures for video compression

[...]

Georgios Georgiadis¹, Stefano Soatto²•Institutions (2)

Dolby Laboratories¹, University of California, Los Angeles²

07 Mar 2016

TL;DR: A video coding system is presented that partitions the scene into "visual structures" and a residual "background" layer and a low-level representation ("track-template") of visual structures is proposed that exploits their temporal redundancy.

...read moreread less

Abstract: A video coding system is presented that partitions the scene into "visual structures" anda residual "background" layer. A low-level representation ("track-template") of visual structures is proposed that exploits their temporal redundancy. A dictionary of track-templates is constructed that is used to encode video frames. We make optimal use of the dictionary in terms of rate-distortion by choosing a subset of the dictionary's elements for encoding using a Markov Random Field (MRF) formulation that places the track-templates in "depth" layers. The selected "track-templates" form the mid-level representation of the "visual structure" regions of the video. Our video coding system offers improvements over H.265/H.264 and other methods in a rate-distortion comparison.

...read moreread less

Posted Content•

A Theory of Local Matching: SIFT and Beyond.

[...]

Hossein Mobahi¹, Stefano Soatto²•Institutions (2)

Massachusetts Institute of Technology¹, University of California, Los Angeles²

19 Jan 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work constructs a general theory of local descriptors for visual matching and shows that SIFT and DSP-SIFT approximate the solution the theory suggests, and derives new descriptors that have fewer parameters and are potentially better in handling affine deformations.

...read moreread less

Abstract: Why has SIFT been so successful? Why its extension, DSP-SIFT, can further improve SIFT? Is there a theory that can explain both? How can such theory benefit real applications? Can it suggest new algorithms with reduced computational complexity or new descriptors with better accuracy for matching? We construct a general theory of local descriptors for visual matching. Our theory relies on concepts in energy minimization and heat diffusion. We show that SIFT and DSP-SIFT approximate the solution the theory suggests. In particular, DSP-SIFT gives a better approximation to the theoretical solution; justifying why DSP-SIFT outperforms SIFT. Using the developed theory, we derive new descriptors that have fewer parameters and are potentially better in handling affine deformations.

...read moreread less