scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Are They Going to Cross? A Benchmark Dataset and Baseline for Pedestrian Crosswalk Behavior

01 Oct 2017-pp 206-213
TL;DR: A novel dataset is introduced which in addition to providing the bounding box information for pedestrian detection, also includes the behavioral and contextual annotations for the scenes, which allows combining visual and semantic information for better understanding of pedestrians' intentions in various traffic scenarios.
Abstract: Designing autonomous vehicles suitable for urban environments remains an unresolved problem. One of the major dilemmas faced by autonomous cars is how to understand the intention of other road users and communicate with them. The existing datasets do not provide the necessary means for such higher level analysis of traffic scenes. With this in mind, we introduce a novel dataset which in addition to providing the bounding box information for pedestrian detection, also includes the behavioral and contextual annotations for the scenes. This allows combining visual and semantic information for better understanding of pedestrians' intentions in various traffic scenarios. We establish baseline approaches for analyzing the data and show that combining visual and contextual information can improve prediction of pedestrian intention at the point of crossing by at least 20%.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: In this paper, the authors identify the major challenges that autonomous cars are facing today is driving in urban environments and propose future research directions, including design approaches for autonomous vehicles that communicate with pedestrians and visual perception and reasoning algorithms tailored to understanding pedestrian intention.
Abstract: One of the major challenges that autonomous cars are facing today is driving in urban environments. To make it a reality, autonomous vehicles require the ability to communicate with other road users and understand their intentions. Such interactions are essential between vehicles and pedestrians, the most vulnerable road users. Understanding pedestrian behavior, however, is not intuitive and depends on various factors, such as demographics of the pedestrians, traffic dynamics, environmental conditions, and so on. In this paper, we identify these factors by surveying pedestrian behavior studies, both the classical works on pedestrian–driver interaction and the modern ones that involve autonomous vehicles. To this end, we will discuss various methods of studying pedestrian behavior and analyze how the factors identified in the literature are interrelated. We will also review the practical applications aimed at solving the interaction problem, including design approaches for autonomous vehicles that communicate with pedestrians and visual perception and reasoning algorithms tailored to understanding pedestrian intention. Based on our findings, we will discuss the open problems and propose future research directions.

391 citations

Posted Content
TL;DR: This paper surveys pedestrian behavior studies, both the classical works on pedestrian–driver interaction and the modern ones that involve autonomous vehicles, to discuss various methods of studying pedestrian behavior and analyze how the factors identified in the literature are interrelated.
Abstract: One of the major challenges that autonomous cars are facing today is driving in urban environments. To make it a reality, autonomous vehicles require the ability to communicate with other road users and understand their intentions. Such interactions are essential between the vehicles and pedestrians as the most vulnerable road users. Understanding pedestrian behavior, however, is not intuitive and depends on various factors such as demographics of the pedestrians, traffic dynamics, environmental conditions, etc. In this paper, we identify these factors by surveying pedestrian behavior studies, both the classical works on pedestrian-driver interaction and the modern ones that involve autonomous vehicles. To this end, we will discuss various methods of studying pedestrian behavior, and analyze how the factors identified in the literature are interrelated. We will also review the practical applications aimed at solving the interaction problem including design approaches for autonomous vehicles that communicate with pedestrians and visual perception and reasoning algorithms tailored to understanding pedestrian intention. Based on our findings, we will discuss the open problems and propose future research directions.

295 citations


Cites background or methods from "Are They Going to Cross? A Benchmar..."

  • ...of 80% [142], 62% [14] for the probability of crossing, and...

    [...]

  • ...[14] use various contextual information such as characteristics of the road, the presence of traffic sig-...

    [...]

Proceedings ArticleDOI
14 Dec 2018
TL;DR: The Honda Research Institute Driving Dataset (HDD) as discussed by the authors is a dataset of 104 hours of real human driving in the San Francisco Bay Area collected using an instrumented vehicle equipped with different sensors.
Abstract: Driving Scene understanding is a key ingredient for intelligent transportation systems. To achieve systems that can operate in a complex physical and social environment, they need to understand and learn how humans drive and interact with traffic scenes. We present the Honda Research Institute Driving Dataset (HDD), a challenging dataset to enable research on learning driver behavior in real-life environments. The dataset includes 104 hours of real human driving in the San Francisco Bay Area collected using an instrumented vehicle equipped with different sensors. We provide a detailed analysis of HDD with a comparison to other driving datasets. A novel annotation methodology is introduced to enable research on driver behavior understanding from untrimmed data sequences. As the first step, baseline algorithms for driver behavior detection are trained and tested to demonstrate the feasibility of the proposed task.

236 citations

Proceedings ArticleDOI
01 Oct 2019
TL;DR: This work proposes a novel large-scale dataset designed for pedestrian intention estimation and proposes models for estimating pedestrian crossing intention and predicting their future trajectory and shows that combining pedestrian intention with observed motion improves trajectory prediction.
Abstract: Pedestrian behavior anticipation is a key challenge in the design of assistive and autonomous driving systems suitable for urban environments. An intelligent system should be able to understand the intentions or underlying motives of pedestrians and to predict their forthcoming actions. To date, only a few public datasets were proposed for the purpose of studying pedestrian behavior prediction in the context of intelligent driving. To this end, we propose a novel large-scale dataset designed for pedestrian intention estimation (PIE). We conducted a large-scale human experiment to establish human reference data for pedestrian intention in traffic scenes. We propose models for estimating pedestrian crossing intention and predicting their future trajectory. Our intention estimation model achieves 79% accuracy and our trajectory prediction algorithm outperforms state-of-the-art by 26% on the proposed dataset. We further show that combining pedestrian intention with observed motion improves trajectory prediction. The dataset and models are available at http://data.nvision2.eecs.yorku.ca/PIE_dataset/.

185 citations


Cites background from "Are They Going to Cross? A Benchmar..."

  • ...A recently proposed dataset, JAAD [27], contains a large number of pedestrian samples with temporal correspondence, a subset of which are annotated with behavior information....

    [...]

  • ...The performance of all models is generally poorer on the JAAD dataset which can be partially attributed to the smaller number of samples, scales and shorter tracks all of which reduce the diversity of the dataset....

    [...]

  • ...Action (or behavior) prediction algorithms may take different forms such as generating future frames [20, 19, 24, 6], predicting the type of action [15, 21, 7], measuring confidence in the occurrence of an event [27, 37, 10], and forecasting the motion of objects [25, 40, 43, 1, 17, 5, 8]....

    [...]

  • ...Table 1 summarizes the properties of PIE and JAAD datasets....

    [...]

  • ...JAAD has bounding box annotations for all pedestrians, which makes it suitable for detection and tracking applications....

    [...]

Proceedings ArticleDOI
01 Nov 2018
TL;DR: This paper explores the ways pedestrians' intention estimation has been studied, evaluated, and evolved, and addresses available solutions, state-of-the-art developments, and hurdles to be overcome towards reaching a solution that is closer to the human ability to predict and interpret such scenarios.
Abstract: The ability to anticipate pedestrian actions on streets is a safety issue for intelligent cars and has increasingly drawn the attention of the automotive industry. Estimating when pedestrians will cross streets has proved a challenging task, since they can move in many different directions, suddenly change motion, be occluded by a variety of obstacles and distracted while talking to other pedestrians or typing on a mobile phone. Moreover, their decisions can also be affected by several factors. This paper explores the ways pedestrians' intention estimation has been studied, evaluated, and evolved. It provides a literature review on pedestrian behavior prediction, addresses available solutions, state-of-the-art developments, and hurdles to be overcome towards reaching a solution that is closer to the human ability to predict and interpret such scenarios. Although many studies can precisely estimate pedestrians' positioning one second before they cross a street, most of them cannot precisely predict when they will stop at a curb.

133 citations


Additional excerpts

  • ...Another available dataset for crosswalk behavior classification [76] pro-...

    [...]

References
More filters
Proceedings Article
03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

73,978 citations

Proceedings ArticleDOI
Jia Deng1, Wei Dong1, Richard Socher1, Li-Jia Li1, Kai Li1, Li Fei-Fei1 
20 Jun 2009
TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Abstract: The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called “ImageNet”, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.

49,639 citations


"Are They Going to Cross? A Benchmar..." refers methods in this paper

  • ...For this purpose we use pre-trained AlexNet on two large image datasets, ImageNet [5] and places, and both datasets combined [44]....

    [...]

  • ...In each case we train a randomly initalized AlexNet end-to-end on cropped images of pedestrians from our dataset (with minor occlusions up to 25% allowed) and then try transfer learning by fine-tuning an AlexNet pre-trained on ImageNet [27]....

    [...]

Proceedings ArticleDOI
20 Jun 2005
TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Abstract: We study the question of feature sets for robust visual object recognition; adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.

31,952 citations

Journal ArticleDOI
TL;DR: A novel dataset captured from a VW station wagon for use in mobile robotics and autonomous driving research, using a variety of sensor modalities such as high-resolution color and grayscale stereo cameras and a high-precision GPS/IMU inertial navigation system.
Abstract: We present a novel dataset captured from a VW station wagon for use in mobile robotics and autonomous driving research. In total, we recorded 6 hours of traffic scenarios at 10-100 Hz using a variety of sensor modalities such as high-resolution color and grayscale stereo cameras, a Velodyne 3D laser scanner and a high-precision GPS/IMU inertial navigation system. The scenarios are diverse, capturing real-world traffic situations, and range from freeways over rural areas to inner-city scenes with many static and dynamic objects. Our data is calibrated, synchronized and timestamped, and we provide the rectified and raw image sequences. Our dataset also contains object labels in the form of 3D tracklets, and we provide online benchmarks for stereo, optical flow, object detection and other tasks. This paper describes our recording platform, the data format and the utilities that we provide.

7,153 citations


"Are They Going to Cross? A Benchmar..." refers background or methods in this paper

  • ...5k x Caltech[8] 347k 250k x x x KITTI [13] 12k 80k x x MPD [16] 86....

    [...]

  • ...Compared to existing large-scale datasets such as KITTI [13] and Caltech pedestrian dataset [8], in addition to ground truth for all pedestrians in the scene and occlusion information, our dataset contains behavioral tags describing actions of pedestrians intending to cross....

    [...]

  • ...There are a number of large-scale datasets publicly available that can be potentially used for pedestrian behavior understanding [13, 8, 10]....

    [...]

  • ...Few exceptions, such as KITTI [13], also provide optical flow and stereo information for mapping and localization....

    [...]

Book ChapterDOI
06 Sep 2014
TL;DR: This work equips the networks with another pooling strategy, “spatial pyramid pooling”, to eliminate the above requirement, and develops a new network structure, called SPP-net, which can generate a fixed-length representation regardless of image size/scale.
Abstract: Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g. 224×224) input image. This requirement is “artificial” and may hurt the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with a more principled pooling strategy, “spatial pyramid pooling”, to eliminate the above requirement. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale. By removing the fixed-size limitation, we can improve all CNN-based image classification methods in general. Our SPP-net achieves state-of-the-art accuracy on the datasets of ImageNet 2012, Pascal VOC 2007, and Caltech101.

3,945 citations


"Are They Going to Cross? A Benchmar..." refers methods in this paper

  • ...Fine-tuning the FCN and SPP models is similar....

    [...]

  • ...The first is the Spatial Pyramid Pooling (SPP) [15] technique which allows the maxpooling of the features from the last convolutional layer (conv5) at different scales....

    [...]

  • ...Overall, the performance of the SPP models was even inferior comparing to those of single scale models (with exception of stop sign detection)....

    [...]

  • ...Such a multi-scale detection performance, however, was not achieved using the SPP models....

    [...]

  • ...It should also be noted that in the SPP models the fc6 layers were learned from scratch due to the change in the dimensionality of their inputs....

    [...]