From scientific research to commercial applications, eye tracking is an important tool across many domains. Despite its range of applications, eye tracking has yet to become a pervasive technology. We believe that we can put the power of eye tracking in everyone's palm by building eye tracking software that works on commodity hardware such as mobile phones and tablets, without the need for additional sensors or devices. We tackle this problem by introducing GazeCapture, the first large-scale dataset for eye tracking, containing data from over 1450 people consisting of almost 2.5M frames. Using GazeCapture, we train iTracker, a convolutional neural network for eye tracking, which achieves a significant reduction in error over previous approaches while running in real time (10-15fps) on a modern mobile device. Our model achieves a prediction error of 1.71cm and 2.53cm without calibration on mobile phones and tablets respectively. With calibration, this is reduced to 1.34cm and 2.12cm. Further, we demonstrate that the features learned by iTracker generalize well to other datasets, achieving state-of-the-art results. The code, data, and models are available at this http URL.

Eye Tracking for Everyone

From scientific research to commercial applications, eye tracking is an important tool across many domains. Despite its range of applications, eye tracking has yet to become a pervasive technology. We believe that we can put the power of eye tracking in everyone's palm by building eye tracking software that works on commodity hardware such as mobile phones and tablets, without the need for additional sensors or devices. We tackle this problem by introducing GazeCapture, the first large-scale dataset for eye tracking, containing data from over 1450 people consisting of almost 2:5M frames. Using GazeCapture, we train iTracker, a convolutional neural network for eye tracking, which achieves a significant reduction in error over previous approaches while running in real time (10–15fps) on a modern mobile device. Our model achieves a prediction error of 1.71cm and 2.53cm without calibration on mobile phones and tablets respectively. With calibration, this is reduced to 1.34cm and 2.12cm. Further, we demonstrate that the features learned by iTracker generalize well to other datasets, achieving state-of-the-art results. The code, data, and models are available at http://gazecapture.csail.mit.edu.

/pdf/eye-tracking-for-everyone-3aeea60toh.pdf

The increasing availability of large digitized fine art collections opens new research perspectives in the intersection of artificial intelligence and art history. Motivated by the successful performance of Convolutional Neural Networks (CNN) for a wide variety of computer vision tasks, in this paper we explore their applicability for art-related image classification tasks. We perform extensive CNN fine-tuning experiments and consolidate in one place the results for five different art-related classification tasks on three large fine art datasets. Along with addressing the previously explored tasks of artist, genre, style and time period classification, we introduce a novel task of classifying artworks based on their association with a specific national artistic context. We present state-of-the-art classification results of the addressed tasks, signifying the impact of our method on computational analysis of art, as well as other image classification related research areas. Furthermore, in order to question transferability of deep representations across various source and target domains, we systematically compare the effects of domain-specific weight initialization by evaluating networks pre-trained for different tasks, varying from object and scene recognition to sentiment and memorability labelling. We show that fine-tuning networks pre-trained for scene recognition and sentiment prediction yields better results than fine-tuning networks pre-trained for object recognition. This novel outcome of our work suggests that the semantic correlation between different domains could be inherent in the CNN weights. Additionally, we address the practical applicability of our results by analysing different aspects of image similarity. We show that features derived from fine-tuned networks can be employed to retrieve images similar in either style or content, which can be used to enhance capabilities of search systems in different online art collections.

Fine-tuning Convolutional Neural Networks for fine art classification

We introduce a framework that uses Generative Adversarial Networks (GANs) to study cognitive properties like memorability. These attributes are of interest because we do not have a concrete visual definition of what they entail. What does it look like for a dog to be more memorable? GANs allow us to generate a manifold of natural-looking images with fine-grained differences in their visual attributes. By navigating this manifold in directions that increase memorability, we can visualize what it looks like for a particular generated image to become more memorable. The resulting ``visual definitions" surface image properties (like ``object size") that may underlie memorability. Through behavioral experiments, we verify that our method indeed discovers image manipulations that causally affect human memory performance. We further demonstrate that the same framework can be used to analyze image aesthetics and emotional valence. ganalyze.csail.mit.edu.

Lore Goetschalckx, Alex Andonian, Aude Oliva, Phillip Isola: GANalyze: Toward Visual Definitions of Cognitive Image Properties.

/pdf/ganalyze-toward-visual-definitions-of-cognitive-image-lrwcrjfpu3.pdf

GANalyze: Toward Visual Definitions of Cognitive Image Properties

Progress in estimating visual memorability has been limited by the small scale and lack of variety of benchmark data. Here, we introduce a novel experimental procedure to objectively measure human memory, allowing us to build LaMem, the largest annotated image memorability dataset to date (containing 60,000 images from diverse sources). Using Convolutional Neural Networks (CNNs), we show that fine-tuned deep features outperform all other features by a large margin, reaching a rank correlation of 0.64, near human consistency (0.68). Analysis of the responses of the high-level CNN layers shows which objects and regions are positively, and negatively, correlated with memorability, allowing us to create memorability maps for each image and provide a concrete method to perform image memorability manipulation. This work demonstrates that one can now robustly estimate the memorability of images from many different classes, positioning memorability and deep memorability features as prime candidates to estimate the utility of information for cognitive systems. Our model and data are available at: http://memorability.csail.mit.edu.

/pdf/understanding-and-predicting-image-memorability-at-a-large-vk0vymn84h.pdf

Understanding and Predicting Image Memorability at a Large Scale

Detecting successful behaviour is crucial for training intelligent agents. As such, generalisable reward models are a prerequisite for agents that can learn to generalise their behaviour. In this work we focus on developing robust success detectors that leverage large, pretrained vision-language models (Flamingo, Alayrac et al. (2022)) and human reward annotations. Concretely, we treat success detection as a visual question answering (VQA) problem, denoted SuccessVQA. We study success detection across three vastly different domains: (i) interactive language-conditioned agents in a simulated household, (ii) real world robotic manipulation, and (iii)"in-the-wild"human egocentric videos. We investigate the generalisation properties of a Flamingo-based success detection model across unseen language and visual changes in the first two domains, and find that the proposed method is able to outperform bespoke reward models in out-of-distribution test scenarios with either variation. In the last domain of"in-the-wild"human videos, we show that success detection on unseen real videos presents an even more challenging generalisation task warranting future work. We hope our initial results encourage further work in real world success detection and reward modelling.

Vision-Language Models as Success Detectors

The ability to leverage heterogeneous robotic experience from different robots and tasks to quickly master novel skills and embodiments has the potential to transform robot learning. Inspired by recent advances in foundation models for vision and language, we propose a foundation agent for robotic manipulation. This agent, named RoboCat, is a visual goal-conditioned decision transformer capable of consuming multi-embodiment action-labelled visual experience. This data spans a large repertoire of motor control skills from simulated and real robotic arms with varying sets of observations and actions. With RoboCat, we demonstrate the ability to generalise to new tasks and robots, both zero-shot as well as through adaptation using only 100--1000 examples for the target task. We also show how a trained model itself can be used to generate data for subsequent training iterations, thus providing a basic building block for an autonomous improvement loop. We investigate the agent's capabilities, with large-scale evaluations both in simulation and on three different real robot embodiments. We find that as we grow and diversify its training data, RoboCat not only shows signs of cross-task transfer, but also becomes more efficient at adapting to new tasks.

RoboCat: A Self-Improving Foundation Agent for Robotic Manipulation

Despite their great accuracy, neural networks are not very popular in fields like medical, finance, education, and others where predictive explainability are essential. The objective of this work is to create and train a model using PyTorch Pipeline that divides photos into “Good” and “Anomaly” classes and, if the image is categorized as an “Anomaly,” a bounding box is returned for the fault. While this work appears straightforward and similar to other item detection tasks, there is a problem that it lacks bounding box labels. Fortunately, this problem can be solved by the model in the inference mode, trained without labels for defective regions, and is able to forecast a bounding box for a defective region in the picture, by processing feature maps from the deep convolutional layers. This work discusses the strategy and talks about how to use it for the purpose of defect detection in the real world. A 400-image dataset that includes pictures of both perfect objects (classed as “good”) and imperfect objects (classed as “anomalies”) has been used. The dataset is unbalanced; there are more examples of good than bad photographs. Any form of object, such as a bottle, cable, pill, tile, piece of leather, a zipper, etc., may be seen in the images.

Akhil S. Raju

Papers

Understanding and Predicting Image Memorability at a Large Scale

Vision-Language Models as Success Detectors

RoboCat: A Self-Improving Foundation Agent for Robotic Manipulation

Automatic Visual Inspection - Defects Detection using CNN