Are They Going to Cross? A Benchmark Dataset and Baseline for Pedestrian Crosswalk Behavior

doi:10.1109/ICCVW.2017.33

Home
/
Papers
/
Are They Going to Cross? A Benchmark Dataset and Baseline for Pedestrian Crosswalk Behavior

Proceedings Article•DOI•

Are They Going to Cross? A Benchmark Dataset and Baseline for Pedestrian Crosswalk Behavior

Amir Rasouli¹, Iuliia Kotseruba¹, John K. Tsotsos¹•Institutions (1)

York University¹

01 Oct 2017-pp 206-213

TL;DR: A novel dataset is introduced which in addition to providing the bounding box information for pedestrian detection, also includes the behavioral and contextual annotations for the scenes, which allows combining visual and semantic information for better understanding of pedestrians' intentions in various traffic scenarios.

read less

Abstract: Designing autonomous vehicles suitable for urban environments remains an unresolved problem. One of the major dilemmas faced by autonomous cars is how to understand the intention of other road users and communicate with them. The existing datasets do not provide the necessary means for such higher level analysis of traffic scenes. With this in mind, we introduce a novel dataset which in addition to providing the bounding box information for pedestrian detection, also includes the behavioral and contextual annotations for the scenes. This allows combining visual and semantic information for better understanding of pedestrians' intentions in various traffic scenarios. We establish baseline approaches for analyzing the data and show that combining visual and contextual information can improve prediction of pedestrian intention at the point of crossing by at least 20%.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Autonomous Vehicles That Interact With Pedestrians: A Survey of Theory and Practice

[...]

Amir Rasouli¹, John K. Tsotsos¹•Institutions (1)

York University¹

01 Mar 2020-IEEE Transactions on Intelligent Transportation Systems

TL;DR: In this paper, the authors identify the major challenges that autonomous cars are facing today is driving in urban environments and propose future research directions, including design approaches for autonomous vehicles that communicate with pedestrians and visual perception and reasoning algorithms tailored to understanding pedestrian intention.

...read moreread less

Abstract: One of the major challenges that autonomous cars are facing today is driving in urban environments. To make it a reality, autonomous vehicles require the ability to communicate with other road users and understand their intentions. Such interactions are essential between vehicles and pedestrians, the most vulnerable road users. Understanding pedestrian behavior, however, is not intuitive and depends on various factors, such as demographics of the pedestrians, traffic dynamics, environmental conditions, and so on. In this paper, we identify these factors by surveying pedestrian behavior studies, both the classical works on pedestrian–driver interaction and the modern ones that involve autonomous vehicles. To this end, we will discuss various methods of studying pedestrian behavior and analyze how the factors identified in the literature are interrelated. We will also review the practical applications aimed at solving the interaction problem, including design approaches for autonomous vehicles that communicate with pedestrians and visual perception and reasoning algorithms tailored to understanding pedestrian intention. Based on our findings, we will discuss the open problems and propose future research directions.

...read moreread less

391 citations

Posted Content•

Autonomous Vehicles that Interact with Pedestrians: A Survey of Theory and Practice

[...]

Amir Rasouli¹, John K. Tsotsos¹•Institutions (1)

York University¹

30 May 2018-arXiv: Robotics

TL;DR: This paper surveys pedestrian behavior studies, both the classical works on pedestrian–driver interaction and the modern ones that involve autonomous vehicles, to discuss various methods of studying pedestrian behavior and analyze how the factors identified in the literature are interrelated.

...read moreread less

Abstract: One of the major challenges that autonomous cars are facing today is driving in urban environments. To make it a reality, autonomous vehicles require the ability to communicate with other road users and understand their intentions. Such interactions are essential between the vehicles and pedestrians as the most vulnerable road users. Understanding pedestrian behavior, however, is not intuitive and depends on various factors such as demographics of the pedestrians, traffic dynamics, environmental conditions, etc. In this paper, we identify these factors by surveying pedestrian behavior studies, both the classical works on pedestrian-driver interaction and the modern ones that involve autonomous vehicles. To this end, we will discuss various methods of studying pedestrian behavior, and analyze how the factors identified in the literature are interrelated. We will also review the practical applications aimed at solving the interaction problem including design approaches for autonomous vehicles that communicate with pedestrians and visual perception and reasoning algorithms tailored to understanding pedestrian intention. Based on our findings, we will discuss the open problems and propose future research directions.

...read moreread less

295 citations

Cites background or methods from "Are They Going to Cross? A Benchmar..."

...of 80% [142], 62% [14] for the probability of crossing, and...
[...]
...[14] use various contextual information such as characteristics of the road, the presence of traffic sig-...
[...]

Proceedings Article•DOI•

Toward Driving Scene Understanding: A Dataset for Learning Driver Behavior and Causal Reasoning

[...]

Vasili Ramanishka¹, Yi-Ting Chen², Teruhisa Misu², Kate Saenko¹•Institutions (2)

Boston University¹, Honda²

14 Dec 2018

TL;DR: The Honda Research Institute Driving Dataset (HDD) as discussed by the authors is a dataset of 104 hours of real human driving in the San Francisco Bay Area collected using an instrumented vehicle equipped with different sensors.

...read moreread less

Abstract: Driving Scene understanding is a key ingredient for intelligent transportation systems. To achieve systems that can operate in a complex physical and social environment, they need to understand and learn how humans drive and interact with traffic scenes. We present the Honda Research Institute Driving Dataset (HDD), a challenging dataset to enable research on learning driver behavior in real-life environments. The dataset includes 104 hours of real human driving in the San Francisco Bay Area collected using an instrumented vehicle equipped with different sensors. We provide a detailed analysis of HDD with a comparison to other driving datasets. A novel annotation methodology is introduced to enable research on driver behavior understanding from untrimmed data sequences. As the first step, baseline algorithms for driver behavior detection are trained and tested to demonstrate the feasibility of the proposed task.

...read moreread less

236 citations

Proceedings Article•DOI•

PIE: A Large-Scale Dataset and Models for Pedestrian Intention Estimation and Trajectory Prediction

[...]

Amir Rasouli¹, Iuliia Kotseruba¹, Toni Kunic¹, John K. Tsotsos¹•Institutions (1)

York University¹

01 Oct 2019

TL;DR: This work proposes a novel large-scale dataset designed for pedestrian intention estimation and proposes models for estimating pedestrian crossing intention and predicting their future trajectory and shows that combining pedestrian intention with observed motion improves trajectory prediction.

...read moreread less

Abstract: Pedestrian behavior anticipation is a key challenge in the design of assistive and autonomous driving systems suitable for urban environments. An intelligent system should be able to understand the intentions or underlying motives of pedestrians and to predict their forthcoming actions. To date, only a few public datasets were proposed for the purpose of studying pedestrian behavior prediction in the context of intelligent driving. To this end, we propose a novel large-scale dataset designed for pedestrian intention estimation (PIE). We conducted a large-scale human experiment to establish human reference data for pedestrian intention in traffic scenes. We propose models for estimating pedestrian crossing intention and predicting their future trajectory. Our intention estimation model achieves 79% accuracy and our trajectory prediction algorithm outperforms state-of-the-art by 26% on the proposed dataset. We further show that combining pedestrian intention with observed motion improves trajectory prediction. The dataset and models are available at http://data.nvision2.eecs.yorku.ca/PIE_dataset/.

...read moreread less

185 citations

Cites background from "Are They Going to Cross? A Benchmar..."

...A recently proposed dataset, JAAD [27], contains a large number of pedestrian samples with temporal correspondence, a subset of which are annotated with behavior information....
[...]
...The performance of all models is generally poorer on the JAAD dataset which can be partially attributed to the smaller number of samples, scales and shorter tracks all of which reduce the diversity of the dataset....
[...]
...Action (or behavior) prediction algorithms may take different forms such as generating future frames [20, 19, 24, 6], predicting the type of action [15, 21, 7], measuring confidence in the occurrence of an event [27, 37, 10], and forecasting the motion of objects [25, 40, 43, 1, 17, 5, 8]....
[...]
...Table 1 summarizes the properties of PIE and JAAD datasets....
[...]
...JAAD has bounding box annotations for all pedestrians, which makes it suitable for detection and tracking applications....
[...]

Proceedings Article•DOI•

A Literature Review on the Prediction of Pedestrian Behavior in Urban Scenarios

[...]

Daniela A. Ridel, Eike Rehder¹, Martin Lauer², Christoph Stiller², Denis F. Wolf - Show less +1 more•Institutions (2)

Daimler AG¹, Karlsruhe Institute of Technology²

01 Nov 2018

TL;DR: This paper explores the ways pedestrians' intention estimation has been studied, evaluated, and evolved, and addresses available solutions, state-of-the-art developments, and hurdles to be overcome towards reaching a solution that is closer to the human ability to predict and interpret such scenarios.

...read moreread less

Abstract: The ability to anticipate pedestrian actions on streets is a safety issue for intelligent cars and has increasingly drawn the attention of the automotive industry. Estimating when pedestrians will cross streets has proved a challenging task, since they can move in many different directions, suddenly change motion, be occluded by a variety of obstacles and distracted while talking to other pedestrians or typing on a mobile phone. Moreover, their decisions can also be affected by several factors. This paper explores the ways pedestrians' intention estimation has been studied, evaluated, and evolved. It provides a literature review on pedestrian behavior prediction, addresses available solutions, state-of-the-art developments, and hurdles to be overcome towards reaching a solution that is closer to the human ability to predict and interpret such scenarios. Although many studies can precisely estimate pedestrians' positioning one second before they cross a street, most of them cannot precisely predict when they will stop at a curb.

...read moreread less

133 citations

Additional excerpts

...Another available dataset for crosswalk behavior classification [76] pro-...
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46

Collapse

References

PDF

Open Access

More filters

Proceedings Article•

ImageNet Classification with Deep Convolutional Neural Networks

[...]

Alex Krizhevsky¹, Ilya Sutskever¹, Geoffrey E. Hinton¹•Institutions (1)

University of Toronto¹

03 Dec 2012

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

...read moreread less

73,978 citations

Proceedings Article•DOI•

ImageNet: A large-scale hierarchical image database

[...]

Jia Deng¹, Wei Dong¹, Richard Socher¹, Li-Jia Li¹, Kai Li¹, Li Fei-Fei¹ - Show less +2 more•Institutions (1)

Princeton University¹

20 Jun 2009

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

...read moreread less

Abstract: The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called “ImageNet”, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.

...read moreread less

49,639 citations

"Are They Going to Cross? A Benchmar..." refers methods in this paper

...For this purpose we use pre-trained AlexNet on two large image datasets, ImageNet [5] and places, and both datasets combined [44]....
[...]
...In each case we train a randomly initalized AlexNet end-to-end on cropped images of pedestrians from our dataset (with minor occlusions up to 25% allowed) and then try transfer learning by fine-tuning an AlexNet pre-trained on ImageNet [27]....
[...]

Proceedings Article•DOI•

Histograms of oriented gradients for human detection

[...]

Navneet Dalal¹, Bill Triggs¹•Institutions (1)

French Institute for Research in Computer Science and Automation¹

20 Jun 2005

TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.

...read moreread less

Abstract: We study the question of feature sets for robust visual object recognition; adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.

...read moreread less

31,952 citations

Journal Article•DOI•

Vision meets robotics: The KITTI dataset

[...]

Andreas Geiger¹, Philip Lenz², Christoph Stiller², Raquel Urtasun³•Institutions (3)

Max Planck Society¹, Karlsruhe Institute of Technology², Toyota Technological Institute at Chicago³

01 Sep 2013-The International Journal of Robotics Research

TL;DR: A novel dataset captured from a VW station wagon for use in mobile robotics and autonomous driving research, using a variety of sensor modalities such as high-resolution color and grayscale stereo cameras and a high-precision GPS/IMU inertial navigation system.

...read moreread less

Abstract: We present a novel dataset captured from a VW station wagon for use in mobile robotics and autonomous driving research. In total, we recorded 6 hours of traffic scenarios at 10-100 Hz using a variety of sensor modalities such as high-resolution color and grayscale stereo cameras, a Velodyne 3D laser scanner and a high-precision GPS/IMU inertial navigation system. The scenarios are diverse, capturing real-world traffic situations, and range from freeways over rural areas to inner-city scenes with many static and dynamic objects. Our data is calibrated, synchronized and timestamped, and we provide the rectified and raw image sequences. Our dataset also contains object labels in the form of 3D tracklets, and we provide online benchmarks for stereo, optical flow, object detection and other tasks. This paper describes our recording platform, the data format and the utilities that we provide.

...read moreread less

7,153 citations

"Are They Going to Cross? A Benchmar..." refers background or methods in this paper

...5k x Caltech[8] 347k 250k x x x KITTI [13] 12k 80k x x MPD [16] 86....
[...]
...Compared to existing large-scale datasets such as KITTI [13] and Caltech pedestrian dataset [8], in addition to ground truth for all pedestrians in the scene and occlusion information, our dataset contains behavioral tags describing actions of pedestrians intending to cross....
[...]
...There are a number of large-scale datasets publicly available that can be potentially used for pedestrian behavior understanding [13, 8, 10]....
[...]
...Few exceptions, such as KITTI [13], also provide optical flow and stereo information for mapping and localization....
[...]

Book Chapter•DOI•

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

[...]

Kaiming He¹, Xiangyu Zhang², Shaoqing Ren³, Jian Sun¹•Institutions (3)

Microsoft¹, Xi'an Jiaotong University², University of Science and Technology³

06 Sep 2014

TL;DR: This work equips the networks with another pooling strategy, “spatial pyramid pooling”, to eliminate the above requirement, and develops a new network structure, called SPP-net, which can generate a fixed-length representation regardless of image size/scale.

...read moreread less

Abstract: Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g. 224×224) input image. This requirement is “artificial” and may hurt the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with a more principled pooling strategy, “spatial pyramid pooling”, to eliminate the above requirement. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale. By removing the fixed-size limitation, we can improve all CNN-based image classification methods in general. Our SPP-net achieves state-of-the-art accuracy on the datasets of ImageNet 2012, Pascal VOC 2007, and Caltech101.

...read moreread less

3,945 citations

"Are They Going to Cross? A Benchmar..." refers methods in this paper

...Fine-tuning the FCN and SPP models is similar....
[...]
...The first is the Spatial Pyramid Pooling (SPP) [15] technique which allows the maxpooling of the features from the last convolutional layer (conv5) at different scales....
[...]
...Overall, the performance of the SPP models was even inferior comparing to those of single scale models (with exception of stop sign detection)....
[...]
...Such a multi-scale detection performance, however, was not achieved using the SPP models....
[...]
...It should also be noted that in the SPP models the fc6 layers were learned from scratch due to the change in the dimensionality of their inputs....
[...]