scispace - formally typeset
Search or ask a question

Showing papers on "Eye tracking published in 2018"


Book ChapterDOI
Zheng Zhu1, Qiang Wang1, Bo Li2, Wei Wu2, Junjie Yan2, Weiming Hu1 
08 Sep 2018
TL;DR: Zhang et al. as discussed by the authors proposed a distractor-aware Siamese network for accurate and long-term tracking, which uses an effective sampling strategy to control the distribution of training data and make the model focus on the semantic distractors.
Abstract: Recently, Siamese networks have drawn great attention in visual tracking community because of their balanced accuracy and speed. However, features used in most Siamese tracking approaches can only discriminate foreground from the non-semantic backgrounds. The semantic backgrounds are always considered as distractors, which hinders the robustness of Siamese trackers. In this paper, we focus on learning distractor-aware Siamese networks for accurate and long-term tracking. To this end, features used in traditional Siamese trackers are analyzed at first. We observe that the imbalanced distribution of training data makes the learned features less discriminative. During the off-line training phase, an effective sampling strategy is introduced to control this distribution and make the model focus on the semantic distractors. During inference, a novel distractor-aware module is designed to perform incremental learning, which can effectively transfer the general embedding to the current video domain. In addition, we extend the proposed approach for long-term tracking by introducing a simple yet effective local-to-global search region strategy. Extensive experiments on benchmarks show that our approach significantly outperforms the state-of-the-arts, yielding 9.6% relative gain in VOT2016 dataset and 35.9% relative gain in UAV20L dataset. The proposed tracker can perform at 160 FPS on short-term benchmarks and 110 FPS on long-term benchmarks.

711 citations


Posted Content
Zheng Zhu1, Qiang Wang1, Bo Li2, Wei Wu2, Junjie Yan2, Weiming Hu1 
TL;DR: This paper focuses on learning distractor-aware Siamese networks for accurate and long-term tracking, and extends the proposed approach for long- term tracking by introducing a simple yet effective local-to-global search region strategy.
Abstract: Recently, Siamese networks have drawn great attention in visual tracking community because of their balanced accuracy and speed. However, features used in most Siamese tracking approaches can only discriminate foreground from the non-semantic backgrounds. The semantic backgrounds are always considered as distractors, which hinders the robustness of Siamese trackers. In this paper, we focus on learning distractor-aware Siamese networks for accurate and long-term tracking. To this end, features used in traditional Siamese trackers are analyzed at first. We observe that the imbalanced distribution of training data makes the learned features less discriminative. During the off-line training phase, an effective sampling strategy is introduced to control this distribution and make the model focus on the semantic distractors. During inference, a novel distractor-aware module is designed to perform incremental learning, which can effectively transfer the general embedding to the current video domain. In addition, we extend the proposed approach for long-term tracking by introducing a simple yet effective local-to-global search region strategy. Extensive experiments on benchmarks show that our approach significantly outperforms the state-of-the-arts, yielding 9.6% relative gain in VOT2016 dataset and 35.9% relative gain in UAV20L dataset. The proposed tracker can perform at 160 FPS on short-term benchmarks and 110 FPS on long-term benchmarks.

644 citations


Journal ArticleDOI
30 Jul 2018
TL;DR: In this paper, a generative neural network with a novel space-time architecture is proposed to transfer the full 3D head position, head rotation, face expression, eye gaze, and eye blinking from a source actor to a portrait video of a target actor.
Abstract: We present a novel approach that enables photo-realistic re-animation of portrait videos using only an input video. In contrast to existing approaches that are restricted to manipulations of facial expressions only, we are the first to transfer the full 3D head position, head rotation, face expression, eye gaze, and eye blinking from a source actor to a portrait video of a target actor. The core of our approach is a generative neural network with a novel space-time architecture. The network takes as input synthetic renderings of a parametric face model, based on which it predicts photo-realistic video frames for a given target actor. The realism in this rendering-to-video transfer is achieved by careful adversarial training, and as a result, we can create modified target videos that mimic the behavior of the synthetically-created input. In order to enable source-to-target video re-animation, we render a synthetic target video with the reconstructed head animation parameters from a source video, and feed it into the trained network - thus taking full control of the target. With the ability to freely recombine source and target parameters, we are able to demonstrate a large variety of video rewrite applications without explicitly modeling hair, body or background. For instance, we can reenact the full head using interactive user-controlled editing, and realize high-fidelity visual dubbing. To demonstrate the high quality of our output, we conduct an extensive series of experiments and evaluations, where for instance a user study shows that our video edits are hard to detect.

611 citations


Proceedings ArticleDOI
18 Jun 2018
TL;DR: A Residual Attentional Siamese Network (RASNet) for high performance object tracking that mitigates the over-fitting problem in deep network training, but also enhances its discriminative capacity and adaptability due to the separation of representation learning and discriminator learning.
Abstract: Offline training for object tracking has recently shown great potentials in balancing tracking accuracy and speed. However, it is still difficult to adapt an offline trained model to a target tracked online. This work presents a Residual Attentional Siamese Network (RASNet) for high performance object tracking. The RASNet model reformulates the correlation filter within a Siamese tracking framework, and introduces different kinds of the attention mechanisms to adapt the model without updating the model online. In particular, by exploiting the offline trained general attention, the target adapted residual attention, and the channel favored feature attention, the RASNet not only mitigates the over-fitting problem in deep network training, but also enhances its discriminative capacity and adaptability due to the separation of representation learning and discriminator learning. The proposed deep architecture is trained from end to end and takes full advantage of the rich spatial temporal information to achieve robust visual tracking. Experimental results on two latest benchmarks, OTB-2015 and VOT2017, show that the RASNet tracker has the state-of-the-art tracking accuracy while runs at more than 80 frames per second.

499 citations


Journal ArticleDOI
TL;DR: The background of deep visual tracking is introduced, including the fundamental concepts of visual tracking and related deep learning algorithms, and the existing deep-learning-based trackers are categorize into three classes according to network structure, network function and network training.

473 citations


Proceedings ArticleDOI
18 Jun 2018
TL;DR: This paper proposes an efficient multi-cue analysis framework for robust visual tracking by combining different types of features, and constructs multiple experts through Discriminative Correlation Filter and each of them tracks the target independently.
Abstract: In recent years, many tracking algorithms achieve impressive performance via fusing multiple types of features, however, most of them fail to fully explore the context among the adopted multiple features and the strength of them. In this paper, we propose an efficient multi-cue analysis framework for robust visual tracking. By combining different types of features, our approach constructs multiple experts through Discriminative Correlation Filter (DCF) and each of them tracks the target independently. With the proposed robustness evaluation strategy, the suitable expert is selected for tracking in each frame. Furthermore, the divergence of multiple experts reveals the reliability of the current tracking, which is quantified to update the experts adaptively to keep them from corruption. Through the proposed multi-cue analysis, our tracker with standard DCF and deep features achieves outstanding results on several challenging benchmarks: OTB-2013, OTB-2015, Temple-Color and VOT 2016. On the other hand, when evaluated with only simple hand-crafted features, our method demonstrates comparable performance amongst complex non-realtime trackers, but exhibits much better efficiency, with a speed of 45 FPS on a CPU.

347 citations


Book ChapterDOI
08 Sep 2018
TL;DR: In this paper, a dynamic memory network is proposed to adapt the template to the target's appearance variations during tracking, where an LSTM is used as a memory controller, where the input is the search feature map and the outputs are the control signals for the reading and writing process of the memory block.
Abstract: Template-matching methods for visual tracking have gained popularity recently due to their comparable performance and fast speed. However, they lack effective ways to adapt to changes in the target object’s appearance, making their tracking accuracy still far from state-of-the-art. In this paper, we propose a dynamic memory network to adapt the template to the target’s appearance variations during tracking. An LSTM is used as a memory controller, where the input is the search feature map and the outputs are the control signals for the reading and writing process of the memory block. As the location of the target is at first unknown in the search feature map, an attention mechanism is applied to concentrate the LSTM input on the potential target. To prevent aggressive model adaptivity, we apply gated residual template learning to control the amount of retrieved memory that is used to combine with the initial template. Unlike tracking-by-detection methods where the object’s information is maintained by the weight parameters of neural networks, which requires expensive online fine-tuning to be adaptable, our tracker runs completely feed-forward and adapts to the target’s appearance changes by updating the external memory. Moreover, unlike other tracking methods where the model capacity is fixed after offline training – the capacity of our tracker can be easily enlarged as the memory requirements of a task increase, which is favorable for memorizing long-term object information. Extensive experiments on OTB and VOT demonstrates that our tracker MemTrack performs favorably against state-of-the-art tracking methods while retaining real-time speed of 50 fps.

264 citations


Proceedings ArticleDOI
21 Apr 2018
TL;DR: Interfaces communicating vehicle awareness and intent are found that can help pedestrians attempting to cross; are not limited to the vehicle and can exist in the environment; and should use a combination of modalities such as visual, auditory, and physical.
Abstract: Drivers use nonverbal cues such as vehicle speed, eye gaze, and hand gestures to communicate awareness and intent to pedestrians. Conversely, in autonomous vehicles, drivers can be distracted or absent, leaving pedestrians to infer awareness and intent from the vehicle alone. In this paper, we investigate the usefulness of interfaces (beyond vehicle movement) that explicitly communicate awareness and intent of autonomous vehicles to pedestrians, focusing on crosswalk scenarios. We conducted a preliminary study to gain insight on designing interfaces that communicate autonomous vehicle awareness and intent to pedestrians. Based on study outcomes, we developed four prototype interfaces and deployed them in studies involving a Segway and a car. We found interfaces communicating vehicle awareness and intent: (1) can help pedestrians attempting to cross; (2) are not limited to the vehicle and can exist in the environment; and (3) should use a combination of modalities such as visual, auditory, and physical.

240 citations


Book ChapterDOI
08 Sep 2018
TL;DR: This work addresses the issue of ground truth annotation by measuring head pose using a motion capture system and eye gaze using mobile eyetracking glasses and applies semantic image inpainting to the area covered by the glasses to bridge the gap between training and testing images by removing the obtrusiveness of the glasses.
Abstract: In this work, we consider the problem of robust gaze estimation in natural environments. Large camera-to-subject distances and high variations in head pose and eye gaze angles are common in such environments. This leads to two main shortfalls in state-of-the-art methods for gaze estimation: hindered ground truth gaze annotation and diminished gaze estimation accuracy as image resolution decreases with distance. We first record a novel dataset of varied gaze and head pose images in a natural environment, addressing the issue of ground truth annotation by measuring head pose using a motion capture system and eye gaze using mobile eyetracking glasses. We apply semantic image inpainting to the area covered by the glasses to bridge the gap between training and testing images by removing the obtrusiveness of the glasses. We also present a new real-time algorithm involving appearance-based deep convolutional neural networks with increased capacity to cope with the diverse images in the new dataset. Experiments with this network architecture are conducted on a number of diverse eye-gaze datasets including our own, and in cross dataset evaluations. We demonstrate state-of-the-art performance in terms of estimation accuracy in all experiments, and the architecture performs well even on lower resolution images.

233 citations


Journal ArticleDOI
TL;DR: A systematic review of eye tracking research in the domain of multimedia learning explores how cognitive processes in multimedia learning are studied with relevant variables through eye tracking technology to offer suggestions for future research and practices.
Abstract: This study provides a current systematic review of eye tracking research in the domain of multimedia learning. The particular aim of the review is to explore how cognitive processes in multimedia learning are studied with relevant variables through eye tracking technology. To this end, 52 articles, including 58 studies, were analyzed. Remarkable results are that (1) there is a burgeoning interest in the use of eye tracking technology in multimedia learning research; (2) studies were mostly conducted with college students, science materials, and the temporal and count scales of eye tracking measurements; (3) eye movement measurements provided inferences about the cognitive processes of selecting, organizing, and integrating; (4) multimedia learning principles, multimedia content, individual differences, metacognition, and emotions were the potential factors that can affect eye movement measurements; and (5) findings were available for supporting the association between cognitive processes inferred by eye tracking measurements and learning performance. Specific gaps in the literature and implications of existing findings on multimedia learning design were also determined to offer suggestions for future research and practices.

201 citations


Proceedings ArticleDOI
18 Jun 2018
TL;DR: This paper presents the large-scale eye-tracking in dynamic VR scene dataset, and proposes to compute saliency maps at different spatial scales: the sub-image patch centered at current gaze point, theSub-image corresponding to the Field of View (FoV), and the panorama image.
Abstract: This paper explores gaze prediction in dynamic 360° immersive videos, i.e., based on the history scan path and VR contents, we predict where a viewer will look at an upcoming time. To tackle this problem, we first present the large-scale eye-tracking in dynamic VR scene dataset. Our dataset contains 208 360° videos captured in dynamic scenes, and each video is viewed by at least 31 subjects. Our analysis shows that gaze prediction depends on its history scan path and image contents. In terms of the image contents, those salient objects easily attract viewers' attention. On the one hand, the saliency is related to both appearance and motion of the objects. Considering that the saliency measured at different scales is different, we propose to compute saliency maps at different spatial scales: the sub-image patch centered at current gaze point, the sub-image corresponding to the Field of View (FoV), and the panorama image. Then we feed both the saliency maps and the corresponding images into a Convolutional Neural Network (CNN) for feature extraction. Meanwhile, we also use a Long-Short-Term-Memory (LSTM) to encode the history scan path. Then we combine the CNN features and LSTM features for gaze displacement prediction between gaze point at a current time and gaze point at an upcoming time. Extensive experiments validate the effectiveness of our method for gaze prediction in dynamic VR scenes.

Proceedings ArticleDOI
19 Apr 2018
TL;DR: This work investigates precise, multimodal selection techniques using head motion and eye gaze for augmented reality applications, including compact menus with deep structure, and a proof-of-concept method for on-line correction of calibration drift.
Abstract: Head and eye movement can be leveraged to improve the user's interaction repertoire for wearable displays. Head movements are deliberate and accurate, and provide the current state-of-the-art pointing technique. Eye gaze can potentially be faster and more ergonomic, but suffers from low accuracy due to calibration errors and drift of wearable eye-tracking sensors. This work investigates precise, multimodal selection techniques using head motion and eye gaze. A comparison of speed and pointing accuracy reveals the relative merits of each method, including the achievable target size for robust selection. We demonstrate and discuss example applications for augmented reality, including compact menus with deep structure, and a proof-of-concept method for on-line correction of calibration drift.

Journal ArticleDOI
14 Sep 2018-PLOS ONE
TL;DR: Inter-trial change in pupil diameter and microsaccade magnitude appear to adequately discriminate task difficulty, and hence cognitive load, if the implied causality can be assumed and the reliability and sensitivity of task-evoked pupillary andmicrosaccadic measures of cognitive load are compared.
Abstract: Pupil diameter and microsaccades are captured by an eye tracker and compared for their suitability as indicators of cognitive load (as beset by task difficulty). Specifically, two metrics are tested in response to task difficulty: (1) the change in pupil diameter with respect to inter- or intra-trial baseline, and (2) the rate and magnitude of microsaccades. Participants performed easy and difficult mental arithmetic tasks while fixating a central target. Inter-trial change in pupil diameter and microsaccade magnitude appear to adequately discriminate task difficulty, and hence cognitive load, if the implied causality can be assumed. This paper’s contribution corroborates previous work concerning microsaccade magnitude and extends this work by directly comparing microsaccade metrics to pupillometric measures. To our knowledge this is the first study to compare the reliability and sensitivity of task-evoked pupillary and microsaccadic measures of cognitive load.

Journal ArticleDOI
TL;DR: In this paper, the problem of learning deep fully convolutional features for the CFB visual tracking is formulated and a novel and efficient backpropagation algorithm is presented based on the loss function of the network.
Abstract: During the recent years, correlation filters have shown dominant and spectacular results for visual object tracking. The types of the features that are employed in this family of trackers significantly affect the performance of visual tracking. The ultimate goal is to utilize the robust features invariant to any kind of appearance change of the object, while predicting the object location as properly as in the case of no appearance change. As the deep learning based methods have emerged, the study of learning features for specific tasks has accelerated. For instance, discriminative visual tracking methods based on deep architectures have been studied with promising performance. Nevertheless, correlation filter based (CFB) trackers confine themselves to use the pre-trained networks, which are trained for object classification problem. To this end, in this manuscript the problem of learning deep fully convolutional features for the CFB visual tracking is formulated. In order to learn the proposed model, a novel and efficient backpropagation algorithm is presented based on the loss function of the network. The proposed learning framework enables the network model to be flexible for a custom design. Moreover, it alleviates the dependency on the network trained for classification. Extensive performance analysis shows the efficacy of the proposed custom design in the CFB tracking framework. By fine-tuning the convolutional parts of a state-of-the-art network and integrating this model to a CFB tracker, which is the top performing one of VOT2016, 18% increase is achieved in terms of expected average overlap, and tracking failures are decreased by 25%, while maintaining the superiority over the state-of-the-art methods in OTB-2013 and OTB-2015 tracking datasets.

Proceedings ArticleDOI
12 Jun 2018
TL;DR: This paper presents a novel dataset of 360° videos with associated eye and head movement data, which is a follow-up to the previous dataset for still images and its associated code is made publicly available to support research on visual attention for 360° content.
Abstract: Research on visual attention in 360° content is crucial to understand how people perceive and interact with this immersive type of content and to develop efficient techniques for processing, encoding, delivering and rendering. And also to offer a high quality of experience to end users. The availability of public datasets is essential to support and facilitate research activities of the community. Recently, some studies have been presented analyzing exploration behaviors of people watching 360° videos, and a few datasets have been published. However, the majority of these works only consider head movements as proxy for gaze data, despite the importance of eye movements in the exploration of omnidirectional content. Thus, this paper presents a novel dataset of 360° videos with associated eye and head movement data, which is a follow-up to our previous dataset for still images [14]. Head and eye tracking data was obtained from 57 participants during a free-viewing experiment with 19 videos. In addition, guidelines on how to obtain saliency maps and scanpaths from raw data are provided. Also, some statistics related to exploration behaviors are presented, such as the impact of the longitudinal starting position when watching omnidirectional videos was investigated in this test. This dataset and its associated code are made publicly available to support research on visual attention for 360° content.

Proceedings ArticleDOI
01 Jun 2018
TL;DR: The positive samples generation network (PSGN) is introduced to sampling massive diverse training data through traversing over the constructed target object manifold and generated diverse target object images can enrich the training dataset and enhance the robustness of visual trackers.
Abstract: Existing visual trackers are easily disturbed by occlusion, blur and large deformation. We think the performance of existing visual trackers may be limited due to the following issues: i) Adopting the dense sampling strategy to generate positive examples will make them less diverse; ii) The training data with different challenging factors are limited, even through collecting large training dataset. Collecting even larger training dataset is the most intuitive paradigm, but it may still can not cover all situations and the positive samples are still monotonous. In this paper, we propose to generate hard positive samples via adversarial learning for visual tracking. Specifically speaking, we assume the target objects all lie on a manifold, hence, we introduce the positive samples generation network (PSGN) to sampling massive diverse training data through traversing over the constructed target object manifold. The generated diverse target object images can enrich the training dataset and enhance the robustness of visual trackers. To make the tracker more robust to occlusion, we adopt the hard positive transformation network (HPTN) which can generate hard samples for tracking algorithm to recognize. We train this network with deep reinforcement learning to automatically occlude the target object with a negative patch. Based on the generated hard positive samples, we train a Siamese network for visual tracking and our experiments validate the effectiveness of the introduced algorithm. The project page of this paper can be found from the website1.

Journal ArticleDOI
TL;DR: It is shown that eye movements during an everyday task predict aspects of the authors' personality, and new relations between previously neglected eye movement characteristics and personality are revealed.
Abstract: Besides allowing us to perceive our surroundings, eye movements are also a window into our mind and a rich source of information on who we are, how we feel, and what we do. Here we show that eye movements during an everyday task predict aspects of our personality. We tracked eye movements of 42 participants while they ran an errand on a university campus and subsequently assessed their personality traits using well-established questionnaires. Using a state-of-the-art machine learning method and a rich set of features encoding different eye movement characteristics, we were able to reliably predict four of the Big Five personality traits (neuroticism, extraversion, agreeableness, conscientiousness) as well as perceptual curiosity only from eye movements. Further analysis revealed new relations between previously neglected eye movement characteristics and personality. Our findings demonstrate a considerable influence of personality on everyday eye movement control, thereby complementing earlier studies in laboratory settings. Improving automatic recognition and interpretation of human social signals is an important endeavor, enabling innovative design of human–computer systems capable of sensing spontaneous natural user behavior to facilitate efficient interaction and personalization.

Book ChapterDOI
08 Sep 2018
TL;DR: In this paper, instead of directly regressing two angles for the pitch and yaw of the eyeball, they regress to an intermediate pictorial representation which in turn simplifies the task of 3D gaze direction estimation.
Abstract: Estimating human gaze from natural eye images only is a challenging task. Gaze direction can be defined by the pupil- and the eyeball center where the latter is unobservable in 2D images. Hence, achieving highly accurate gaze estimates is an ill-posed problem. In this paper, we introduce a novel deep neural network architecture specifically designed for the task of gaze estimation from single eye input. Instead of directly regressing two angles for the pitch and yaw of the eyeball, we regress to an intermediate pictorial representation which in turn simplifies the task of 3D gaze direction estimation. Our quantitative and qualitative results show that our approach achieves higher accuracies than the state-of-the-art and is robust to variation in gaze, head pose and image quality.

Journal ArticleDOI
TL;DR: Using statistical features extracted from speaking behaviour, eye activity, and head pose, the behaviour associated with major depression is characterised and the performance of the classification of individual modalities and when fused is examined.
Abstract: An estimated 350 million people worldwide are affected by depression. Using affective sensing technology, our long-term goal is to develop an objective multimodal system that augments clinical opinion during the diagnosis and monitoring of clinical depression. This paper steps towards developing a classification system-oriented approach, where feature selection, classification and fusion-based experiments are conducted to infer which types of behaviour (verbal and nonverbal) and behaviour combinations can best discriminate between depression and non-depression. Using statistical features extracted from speaking behaviour, eye activity, and head pose, we characterise the behaviour associated with major depression and examine the performance of the classification of individual modalities and when fused. Using a real-world, clinically validated dataset of 30 severely depressed patients and 30 healthy control subjects, a Support Vector Machine is used for classification with several feature selection techniques. Given the statistical nature of the extracted features, feature selection based on T-tests performed better than other methods. Individual modality classification results were considerably higher than chance level (83 percent for speech, 73 percent for eye, and 63 percent for head). Fusing all modalities shows a remarkable improvement compared to unimodal systems, which demonstrates the complementary nature of the modalities. Among the different fusion approaches used here, feature fusion performed best with up to 88 percent average accuracy. We believe that is due to the compatible nature of the extracted statistical features.

Posted Content
TL;DR: In this article, a reciprocative learning algorithm was proposed to exploit visual attention for training deep classifiers, which consists of feed-forward and backward operations to generate attention maps, which serve as regularization terms coupled with the original classification loss function for training.
Abstract: Visual attention, derived from cognitive neuroscience, facilitates human perception on the most pertinent subset of the sensory data Recently, significant efforts have been made to exploit attention schemes to advance computer vision systems For visual tracking, it is often challenging to track target objects undergoing large appearance changes Attention maps facilitate visual tracking by selectively paying attention to temporal robust features Existing tracking-by-detection approaches mainly use additional attention modules to generate feature weights as the classifiers are not equipped with such mechanisms In this paper, we propose a reciprocative learning algorithm to exploit visual attention for training deep classifiers The proposed algorithm consists of feed-forward and backward operations to generate attention maps, which serve as regularization terms coupled with the original classification loss function for training The deep classifier learns to attend to the regions of target objects robust to appearance changes Extensive experiments on large-scale benchmark datasets show that the proposed attentive tracking method performs favorably against the state-of-the-art approaches

Journal ArticleDOI
TL;DR: It is shown that a fully automated classification of raw gaze samples as belonging to fixations, saccades, or other oculomotor events can be achieved using a machine-learning approach and lead to superior detection compared to current state-of-the-art event detection algorithms.
Abstract: Event detection is a challenging stage in eye movement data analysis. A major drawback of current event detection methods is that parameters have to be adjusted based on eye movement data quality. Here we show that a fully automated classification of raw gaze samples as belonging to fixations, saccades, or other oculomotor events can be achieved using a machine-learning approach. Any already manually or algorithmically detected events can be used to train a classifier to produce similar classification of other data without the need for a user to set parameters. In this study, we explore the application of random forest machine-learning technique for the detection of fixations, saccades, and post-saccadic oscillations (PSOs). In an effort to show practical utility of the proposed method to the applications that employ eye movement classification algorithms, we provide an example where the method is employed in an eye movement-driven biometric application. We conclude that machine-learning techniques lead to superior detection compared to current state-of-the-art event detection algorithms and can reach the performance of manual coding.

Journal ArticleDOI
11 Jan 2018
TL;DR: In this paper, a real-time deep object tracker capable of incorporating temporal information into its model is presented. But this tracker requires knowledge and understanding of the object being tracked: its appearance, its motion, and how it changes over time.
Abstract: Robust object tracking requires knowledge and understanding of the object being tracked: its appearance, its motion, and how it changes over time. A tracker must be able to modify its underlying model and adapt to new observations. We present Re 3 , a real-time deep object tracker capable of incorporating temporal information into its model. Rather than focusing on a limited set of objects or training a model at test-time to track a specific instance, we pretrain our generic tracker on a large variety of objects and efficiently update on the fly; Re 3 simultaneously tracks and updates the appearance model with a single forward pass. This lightweight model is capable of tracking objects at 150 FPS while attaining competitive results on challenging benchmarks. We also show that our method handles temporary occlusion better than other comparable trackers using experiments that directly measure performance on sequences with occlusion.

Journal ArticleDOI
TL;DR: The key component of FaceVR is a robust algorithm to perform real-time facial motion capture of an actor who is wearing a head-mounted display (HMD), as well as a new data-driven approach for eye tracking from monocular videos.
Abstract: We propose FaceVR, a novel image-based method that enables video teleconferencing in VR based on self-reenactment. State-of-the-art face tracking methods in the VR context are focused on the animation of rigged 3D avatars (Li et al. 2015; Olszewski et al. 2016). Although they achieve good tracking performance, the results look cartoonish and not real. In contrast to these model-based approaches, FaceVR enables VR teleconferencing using an image-based technique that results in nearly photo-realistic outputs. The key component of FaceVR is a robust algorithm to perform real-time facial motion capture of an actor who is wearing a head-mounted display (HMD), as well as a new data-driven approach for eye tracking from monocular videos. Based on reenactment of a prerecorded stereo video of the person without the HMD, FaceVR incorporates photo-realistic re-rendering in real time, thus allowing artificial modifications of face and eye appearances. For instance, we can alter facial expressions or change gaze directions in the prerecorded target video. In a live setup, we apply these newly introduced algorithmic components.

Journal ArticleDOI
TL;DR: This study provides practical insight into how popular remote eye-trackers perform when recording from unrestrained participants and provides a testing method for evaluating whether a tracker is suitable for studying a certain target population, and that manufacturers can use during the development of new eye- Trackers.
Abstract: The marketing materials of remote eye-trackers suggest that data quality is invariant to the position and orientation of the participant as long as the eyes of the participant are within the eye-tracker's headbox, the area where tracking is possible. As such, remote eye-trackers are marketed as allowing the reliable recording of gaze from participant groups that cannot be restrained, such as infants, schoolchildren and patients with muscular or brain disorders. Practical experience and previous research, however, tells us that eye-tracking data quality, e.g. the accuracy of the recorded gaze position and the amount of data loss, deteriorates (compared to well-trained participants in chinrests) when the participant is unrestrained and assumes a non-optimal pose in front of the eye-tracker. How then can researchers working with unrestrained participants choose an eye-tracker? Here we investigated the performance of five popular remote eye-trackers from EyeTribe, SMI, SR Research, and Tobii in a series of tasks where participants took on non-optimal poses. We report that the tested systems varied in the amount of data loss and systematic offsets observed during our tasks. The EyeLink and EyeTribe in particular had large problems. Furthermore, the Tobii eye-trackers reported data for two eyes when only one eye was visible to the eye-tracker. This study provides practical insight into how popular remote eye-trackers perform when recording from unrestrained participants. It furthermore provides a testing method for evaluating whether a tracker is suitable for studying a certain target population, and that manufacturers can use during the development of new eye-trackers.

Proceedings Article
09 Oct 2018
TL;DR: In this paper, a reciprocative learning algorithm was proposed to exploit visual attention for training deep classifiers, which consists of feed-forward and backward operations to generate attention maps, which serve as regularization terms coupled with the original classification loss function for training.
Abstract: Visual attention, derived from cognitive neuroscience, facilitates human perception on the most pertinent subset of the sensory data. Recently, significant efforts have been made to exploit attention schemes to advance computer vision systems. For visual tracking, it is often challenging to track target objects undergoing large appearance changes. Attention maps facilitate visual tracking by selectively paying attention to temporal robust features. Existing tracking-by-detection approaches mainly use additional attention modules to generate feature weights as the classifiers are not equipped with such mechanisms. In this paper, we propose a reciprocative learning algorithm to exploit visual attention for training deep classifiers. The proposed algorithm consists of feed-forward and backward operations to generate attention maps, which serve as regularization terms coupled with the original classification loss function for training. The deep classifier learns to attend to the regions of target objects robust to appearance changes. Extensive experiments on large-scale benchmark datasets show that the proposed attentive tracking method performs favorably against the state-of-the-art approaches.

Book ChapterDOI
08 Sep 2018
TL;DR: This paper proposes the Asymmetric Regression-Evaluation Network (ARE-Net), and tries to improve the gaze estimation performance to its full extent, with promising results and surpasses the state-of-the-art methods on multiple public datasets.
Abstract: Eye gaze estimation has been increasingly demanded by recent intelligent systems to accomplish a range of interaction-related tasks, by using simple eye images as input. However, learning the highly complex regression between eye images and gaze directions is nontrivial, and thus the problem is yet to be solved efficiently. In this paper, we propose the Asymmetric Regression-Evaluation Network (ARE-Net), and try to improve the gaze estimation performance to its full extent. At the core of our method is the notion of “two eye asymmetry” observed during gaze estimation for the left and right eyes. Inspired by this, we design the multi-stream ARE-Net; one asymmetric regression network (AR-Net) predicts 3D gaze directions for both eyes with a novel asymmetric strategy, and the evaluation network (E-Net) adaptively adjusts the strategy by evaluating the two eyes in terms of their performance during optimization. By training the whole network, our method achieves promising results and surpasses the state-of-the-art methods on multiple public datasets.

Journal ArticleDOI
TL;DR: Researchers are urged to make their definitions of fixations and saccades more explicit by specifying all the relevant components of the eye movement under investigation, to enable eye-movement researchers from different fields to have a discussion without misunderstandings.
Abstract: Eye movements have been extensively studied in a wide range of research fields While new methods such as mobile eye tracking and eye tracking in virtual/augmented realities are emerging quickly, the eye-movement terminology has scarcely been revised We assert that this may cause confusion about two of the main concepts: fixations and saccades In this study, we assessed the definitions of fixations and saccades held in the eye-movement field, by surveying 124 eye-movement researchers These eye-movement researchers held a variety of definitions of fixations and saccades, of which the breadth seems even wider than what is reported in the literature Moreover, these definitions did not seem to be related to researcher background or experience We urge researchers to make their definitions more explicit by specifying all the relevant components of the eye movement under investigation: (i) the oculomotor component: eg whether the eye moves slow or fast; (ii) the functional component: what purposes does the eye movement (or lack thereof) serve; (iii) the coordinate system used: relative to what does the eye move; (iv) the computational definition: how is the event represented in the eye-tracker signal This should enable eye-movement researchers from different fields to have a discussion without misunderstandings

Journal ArticleDOI
TL;DR: In this paper, a real-time source-to-target reenactment approach for complete human portrait videos that enables transfer of torso and head motion, face expression, and eye gaze is proposed.
Abstract: We propose HeadOn, the first real-time source-to-target reenactment approach for complete human portrait videos that enables transfer of torso and head motion, face expression, and eye gaze. Given a short RGB-D video of the target actor, we automatically construct a personalized geometry proxy that embeds a parametric head, eye, and kinematic torso model. A novel realtime reenactment algorithm employs this proxy to photo-realistically map the captured motion from the source actor to the target actor. On top of the coarse geometric proxy, we propose a video-based rendering technique that composites the modified target portrait video via view- and pose-dependent texturing, and creates photo-realistic imagery of the target actor under novel torso and head poses, facial expressions, and gaze directions. To this end, we propose a robust tracking of the face and torso of the source actor. We extensively evaluate our approach and show significant improvements in enabling much greater flexibility in creating realistic reenacted output videos.

Proceedings ArticleDOI
15 Jun 2018
TL;DR: It is shown that eye- gaze outperforms head-gaze in terms of speed, task load, required head movement and user preference, and that the advantages of eye-gazes further increase with larger FOV sizes.
Abstract: The current best practice for hands-free selection using Virtual and Augmented Reality (VR/AR) head-mounted displays is to use head-gaze for aiming and dwell-time or clicking for triggering the selection. There is an observable trend for new VR and AR devices to come with integrated eye-tracking units to improve rendering, to provide means for attention analysis or for social interactions. Eye-gaze has been successfully used for human-computer interaction in other domains, primarily on desktop computers. In VR/AR systems, aiming via eye-gaze could be significantly faster and less exhausting than via head-gaze.To evaluate benefits of eye-gaze-based interaction methods in VR and AR, we compared aiming via head-gaze and aiming via eye-gaze. We show that eye-gaze outperforms head-gaze in terms of speed, task load, required head movement and user preference. We furthermore show that the advantages of eye-gaze further increase with larger FOV sizes.

Journal ArticleDOI
TL;DR: It is argued that the subjective perception of eye contact is a product of mutual face gaze instead of actual mutual eye contact, and the existence of an eye-mouth gaze continuum is suggested.
Abstract: We report the personal eye gaze patterns of people engaged in face-to-face getting acquainted conversation. Considerable differences between individuals are underscored by a stability of eye gaze patterns within individuals. Results suggest the existence of an eye-mouth gaze continuum. This continuum includes some people showing a strong preference for eye gaze, some with a strong preference for mouth gaze, and others distributing their gaze between the eyes and mouth to varying extents. Additionally, we found evidence of within-participant consistency not just for location preference but also for the duration of fixations upon the eye and mouth regions. We also estimate that during a 4-minute getting acquainted conversation mutual face gaze constitutes about 60% of conversation that occurs via typically brief instances of 2.2 seconds. Mutual eye contact ranged from 0–45% of conversation, via very brief instances. This was despite participants subjectively perceiving eye contact occurring for about 70% of conversation. We argue that the subjective perception of eye contact is a product of mutual face gaze instead of actual mutual eye contact. We also outline the fast activity of gaze movements upon various locations both on and off face during a typical face-to-face conversation.