scispace - formally typeset
Search or ask a question
Dissertation

Multiple Sensor Fusion for Detection, Classification and Tracking of Moving Objects in Driving Environments

TL;DR: This dissertation addresses the problems of sensor data association, and sensor fusion for object detection, classification, and tracking at different levels within the DATMO stage, and believes that a richer list of tracked objects can improve future stages of an ADAS and enhance its final results.
Abstract: Advanced driver assistance systems (ADAS) help drivers to perform complex driving tasks and to avoid or mitigate dangerous situations. The vehicle senses the external world using sensors and then builds and updates an internal model of the environment configuration. Vehicle perception consists of establishing the spatial and temporal relationships between the vehicle and the static and moving obstacles in the environment. Vehicle perception is composed of two main tasks: simultaneous localization and mapping (SLAM) deals with modelling static parts; and detection and tracking moving objects (DATMO) is responsible for modelling moving parts of the environment. The perception output is used to reason and decide which driving actions are the best for specific driving situations. In order to perform a good reasoning and control, the system has to correctly model the surrounding environment. The accurate detection and classification of moving objects is a critical aspect of a moving object tracking system. Therefore, many sensors are part of a common intelligent vehicle system. Multiple sensor fusion has been a topic of research since long; the reason is the need to combine information from different views of the environment to obtain a more accurate model. This is achieved by combining redundant and complementary measurements of the environment. Fusion can be performed at different levels inside the perception task. Classification of moving objects is needed to determine the possible behaviour of the objects surrounding the vehicle, and it is usually performed at tracking level. Knowledge about the class of moving objects at detection level can help to improve their tracking, reason about their behaviour, and decide what to do according to their nature. Most of the current perception solutions consider classification information only as aggregate information for the final perception output. Also, the management of incomplete information is an important issue in these perception systems. Incomplete information can be originated from sensor-related reasons, such as calibration issues and hardware malfunctions; or from scene perturbations, like occlusions, weather issues and object shifting. It is important to manage these situations by taking into account the degree of imprecision and uncertainty into the perception process. The main contributions in this dissertation focus on the DATMO stage of the perception problem. Precisely, we believe that including the object's class as a key element of the object's representation and managing the uncertainty from multiple sensors detections, we can improve the results of the perception task, i.e., a more reliable list of moving objects of interest represented by their dynamic state and appearance information. Therefore, we address the problems of sensor data association, and sensor fusion for object detection, classification, and tracking at different levels within the DATMO stage. We believe that a richer list of tracked objects can improve future stages of an ADAS and enhance its final results. Although we focus on a set of three main sensors: radar, lidar, and camera, we propose a modifiable architecture to include other type or number of sensors. First, we define a composite object representation to include class information as a part of the object state from early stages to the final output of the perception task. Second, we propose, implement, and compare two different perception architectures to solve the DATMO problem according to the level where object association, fusion, and classification information is included and performed. Our data fusion approaches are based on the evidential framework, which is used to manage and include the uncertainty from sensor detections and object classifications. Third, we propose an evidential data association approach to establish a relationship between two sources of evidence from object detections. We apply this approach at tracking level to fuse information from two track representations, and at detection level to find the relations between observations and to fuse their representations. We observe how the class information improves the final result of the DATMO component. Fourth, we integrate the proposed fusion approaches as a part of a real-time vehicle application. This integration has been performed in a real vehicle demonstrator from the interactIVe European project. Finally, we analysed and experimentally evaluated the performance of the proposed methods. We compared our evidential fusion approaches against each other and against a state-of-the-art method using real data from different driving scenarios and focusing on the detection, classification and tracking of different moving objects: pedestrian, bike, car and truck. We obtained promising results from our proposed approaches and empirically showed how our composite representation can improve the final result when included at different stages of the perception task.

Content maybe subject to copyright    Report

THÈSE
Pour obtenir le grade de
DOCTEUR DE L’UNIVERSITÉ DE GRENOBLE
Spécialité : Mathématique, Informatique (Robotique)
Arrêté ministérial :
Présentée par
Ricardo Omar CHAVEZ GARCIA
Thèse dirigée par Olivier AYCARD
préparée au sein Laboratoire d’Informatique de Grenoble
et de Mathématiques, Sciences et Technologies de l’Information, Infor-
matique
Multiple Sensor Fusion for Detec-
tion, Classification and Tracking of
Moving Objects in Driving Environ-
ments
Thèse soutenue publiquement le 25 septembre 2014,
devant le jury composé de :
M., Michel DEVY
LAAS-CNRS, Rapporteur
M., François CHARPILLET
INRIA Nancy, Rapporteur
Mme., Michelle ROMBAUT
Gipsa-Lab, Président, Examinatrice
M., Yassine RUICHEK
Université de Technologie de Belfort-Montbéliard, Examinateur
M., Olivier AYCARD
Université de Grenoble1, Directeur de thèse

Multiple Sensor Fusion for Detection,
Classification and Tracking of Moving
Objects in Driving Environments
ii

To:
my family
and my friends.
iii

Abstract
Advanced driver assistance systems (ADAS) help drivers to perform complex driving
tasks and to avoid or mitigate dangerous situations. The vehicle senses the external
world using sensors and then builds and updates an internal model of the environment
configuration. Vehicle perception consists of establishing the spatial and temporal rela-
tionships between the vehicle and the static and moving obstacles in the environment.
Vehicle perception is composed of two main tasks: simultaneous localization and map-
ping (SLAM) deals with modelling static parts; and detection and tracking moving
objects (DATMO) is responsible for modelling moving parts of the environment. The
perception output is used to reason and decide which driving actions are the best for
specific driving situations. In order to perform a good reasoning and control, the sys-
tem has to correctly model the surrounding environment. The accurate detection and
classification of moving objects is a critical aspect of a moving object tracking system.
Therefore, many sensors are part of a common intelligent vehicle system.
Multiple sensor fusion has been a topic of research since long; the reason is the
need to combine information from different views of the environment to obtain a more
accurate model. This is achieved by combining redundant and complementary mea-
surements of the environment. Fusion can be performed at different levels inside the
perception task.
Classification of moving objects is needed to determine the possible behaviour of
the objects surrounding the vehicle, and it is usually performed at tracking level. Knowl-
edge about the class of moving objects at detection level can help to improve their
tracking, reason about their behaviour, and decide what to do according to their na-
ture. Most of the current perception solutions consider classification information only
as aggregate information for the final perception output. Also, the management of in-
complete information is an important issue in these perception systems. Incomplete in-
formation can be originated from sensor-related reasons, such as calibration issues and
hardware malfunctions; or from scene perturbations, like occlusions, weather issues
and object shifting. It is important to manage these situations by taking into account
iv

the degree of imprecision and uncertainty into the perception process.
The main contributions in this dissertation focus on the DATMO stage of the per-
ception problem. Precisely, we believe that including the object’s class as a key element
of the object’s representation and managing the uncertainty from multiple sensors de-
tections, we can improve the results of the perception task, i.e., a more reliable list of
moving objects of interest represented by their dynamic state and appearance informa-
tion. Therefore, we address the problems of sensor data association, and sensor fusion
for object detection, classification, and tracking at different levels within the DATMO
stage. We believe that a richer list of tracked objects can improve future stages of an
ADAS and enhance its final results.
Although we focus on a set of three main sensors: radar, lidar, and camera, we pro-
pose a modifiable architecture to include other type or number of sensors. First, we
define a composite object representation to include class information as a part of the
object state from early stages to the final output of the perception task. Second, we
propose, implement, and compare two different perception architectures to solve the
DATMO problem according to the level where object association, fusion, and classifica-
tion information is included and performed. Our data fusion approaches are based on
the evidential framework, which is used to manage and include the uncertainty from
sensor detections and object classifications. Third, we propose an evidential data as-
sociation approach to establish a relationship between two sources of evidence from
object detections. We apply this approach at tracking level to fuse information from
two track representations, and at detection level to find the relations between observa-
tions and to fuse their representations. We observe how the class information improves
the final result of the DATMO component. Fourth, we integrate the proposed fusion
approaches as a part of a real-time vehicle application. This integration has been per-
formed in a real vehicle demonstrator from the interactIVe European project.
Finally, we analysed and experimentally evaluated the performance of the pro-
posed methods. We compared our evidential fusion approaches against each other
and against a state-of-the-art method using real data from different driving scenarios
and focusing on the detection, classification and tracking of different moving objects:
pedestrian, bike, car and truck. We obtained promising results from our proposed ap-
proaches and empirically showed how our composite representation can improve the
final result when included at different stages of the perception task.
Key Words: Multiple sensor fusion, intelligent vehicles, perception, DATMO, clas-
sification, multiple object detection & tracking, Dempster-Shafer theory
v

Citations
More filters
Journal ArticleDOI
TL;DR: The main considerations for the onboard multi-sensor configuration of intelligent ground vehicles in the off-road environments are summarized, providing users with a guideline for selecting sensors based on their performance requirements and application environments.
Abstract: With the development of sensor fusion technologies, there has been a lot of research on intelligent ground vehicles, where obstacle detection is one of the key aspects of vehicle driving. Obstacle detection is a complicated task, which involves the diversity of obstacles, sensor characteristics, and environmental conditions. While the on-road driver assistance system or autonomous driving system has been well researched, the methods developed for the structured road of city scenes may fail in an off-road environment because of its uncertainty and diversity. A single type of sensor finds it hard to satisfy the needs of obstacle detection because of the sensing limitations in range, signal features, and working conditions of detection, and this motivates researchers and engineers to develop multi–sensor fusion and system integration methodology. This survey aims at summarizing the main considerations for the onboard multi-sensor configuration of intelligent ground vehicles in the off-road environments and providing users with a guideline for selecting sensors based on their performance requirements and application environments. State-of-the-art multi-sensor fusion methods and system prototypes are reviewed and associated to the corresponding heterogeneous sensor configurations. Finally, emerging technologies and challenges are discussed for future study.

124 citations

Journal ArticleDOI
12 Apr 2020-Sensors
TL;DR: This survey will provide sensor information to researchers who intend to accomplish the task of motion control of a robot and detail the use of LiDAR and cameras to accomplish robot navigation.
Abstract: This paper focuses on data fusion, which is fundamental to one of the most important modules in any autonomous system: perception. Over the past decade, there has been a surge in the usage of smart/autonomous mobility systems. Such systems can be used in various areas of life like safe mobility for the disabled, senior citizens, and so on and are dependent on accurate sensor information in order to function optimally. This information may be from a single sensor or a suite of sensors with the same or different modalities. We review various types of sensors, their data, and the need for fusion of the data with each other to output the best data for the task at hand, which in this case is autonomous navigation. In order to obtain such accurate data, we need to have optimal technology to read the sensor data, process the data, eliminate or at least reduce the noise and then use the data for the required tasks. We present a survey of the current data processing techniques that implement data fusion using different sensors like LiDAR that use light scan technology, stereo/depth cameras, Red Green Blue monocular (RGB) and Time-of-flight (TOF) cameras that use optical technology and review the efficiency of using fused data from multiple sensors rather than a single sensor in autonomous navigation tasks like mapping, obstacle detection, and avoidance or localization. This survey will provide sensor information to researchers who intend to accomplish the task of motion control of a robot and detail the use of LiDAR and cameras to accomplish robot navigation.

38 citations


Cites background or methods from "Multiple Sensor Fusion for Detectio..."

  • ...The data is then filtered and an appropriate fusion technology implemented this is fed into localization and mapping techniques like SLAM; the same data can be used to identify static or moving objects in the environment and this data can be used to classify the objects, wherein classification information is used to finalize information in creating a model of the environment which in turn can be fed into the control algorithm [27]....

    [...]

  • ...Another reason could be the system failure risk due to the failure of that single sensor [21,27,40] and hence one should introduce a level of redundancy....

    [...]

01 Jan 2002
TL;DR: This paper presents a new method to integrate SLAM and DTMO to solve both problems simultaneously for both indoor and outdoor applications and confirms that they can be complementary to one another.
Abstract: Both simultaneous localization and mapping (SLAM) and detection and tracking of moving objects (DTMO) play key roles in robotics and automation. For certain constrained environments, SLAM and DTMO are becoming solved problems. But for robots working outdoors, and at high speeds, SLAM and DTMO are still incomplete. This paper presents a method to integrate SLAM and DTMO to solve both problems simultaneously for both indoor and outdoor applications. The results of experiments carried out with vehicles with the maximum speed of 45 mph in crowded urban and suburban areas verify the described work.

26 citations

Journal ArticleDOI
TL;DR: The best model that utilize the prior knowledge of robot motion is proposed, which can not only meet the high precision requirements, but also have robustness to the robot operating environment.

14 citations

Proceedings ArticleDOI
28 Mar 2017
TL;DR: Preliminary results using only camera images for detecting various objects using deep learning network, as a first step toward multi-sensor fusion algorithm development are presented.
Abstract: Accuracy in detecting a moving object is critical to autonomous driving or advanced driver assistance systems (ADAS). By including the object classification from multiple sensor detections, the model of the object or environment can be identified more accurately. The critical parameters involved in improving the accuracy are the size and the speed of the moving object. All sensor data are to be used in defining a composite object representation so that it could be used for the class information in the core object’s description. This composite data can then be used by a deep learning network for complete perception fusion in order to solve the detection and tracking of moving objects problem. Camera image data from subsequent frames along the time axis in conjunction with the speed and size of the object will further contribute in developing better recognition algorithms. In this paper, we present preliminary results using only camera images for detecting various objects using deep learning network, as a first step toward multi-sensor fusion algorithm development. The simulation experiments based on camera images show encouraging results where the proposed deep learning network based detection algorithm was able to detect various objects with certain degree of confidence. A laboratory experimental setup is being commissioned where three different types of sensors, a digital camera with 8 megapixel resolution, a LIDAR with 40m range, and ultrasonic distance transducer sensors will be used for multi-sensor fusion to identify the object in real-time.

11 citations


Cites background or methods from "Multiple Sensor Fusion for Detectio..."

  • ...Kalman Filter (KF): KF features make it suited to deal with multi-sensor estimation and data fusion problems [11]....

    [...]

  • ...Multi-sensor data fusion is the process of combining several observations from different sensor inputs to provide a more complete, robust and precise representation of the environment of interest [11]....

    [...]

  • ...Evidence Theory (ET): The advantage of ET is its ability to represent incomplete evidence, total ignorance and the lack of a need for a priori probabilities [11]....

    [...]

  • ...Monte Carlo (MC) Methods: MC methods are well suited for problems where state transition models and observation models are highly non-linear [11]....

    [...]

  • ...information fusion are [11] complexity (need large number of probabilities), inconsistency (difficult to specify consistent set of beliefs in terms of probability) and model precision (precise probabilities about almost unknown events)....

    [...]

References
More filters
Proceedings ArticleDOI
20 Jun 2005
TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
Abstract: We study the question of feature sets for robust visual object recognition; adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.

31,952 citations


"Multiple Sensor Fusion for Detectio..." refers background or methods in this paper

  • ...Dalal and Triggs (2005) present a human classification scheme that uses SIFT-inspired features, called histograms of oriented gradients (HOG), and a linear SVM as a learning method. An HOG feature divides the region into k orientation bins, also defines four different cells that divide the rectangular feature, and then a Gaussian mask is applied to the magnitude values in order to weight the center pixels, and the pixels are interpolated with respect to pixel location within a block. The resulting feature is a vector of dimension 36 containing the summed magnitude of each pixel cells, divided into 9 (kvalue) bins. These features have been extensively exploited in the literature (Dollár et al., 2012). Recently, Qiang et al. (2006) and Chavez-Garcia et al. (2013) use HOG as a weak rule for AdaBoost classifier, achieving the same detection performance, but with less computation time....

    [...]

  • ...Dalal and Triggs (2005) present a human classification scheme that uses SIFT-inspired features, called histograms of oriented gradients (HOG), and a linear SVM as a learning method....

    [...]

  • ...Each element of a vector is a histogram of gradient orientations (Dalal and Triggs, 2005)....

    [...]

  • ...Dalal and Triggs (2005) present a human classification scheme that uses SIFT-inspired features, called histograms of oriented gradients (HOG), and a linear SVM as a learning method. An HOG feature divides the region into k orientation bins, also defines four different cells that divide the rectangular feature, and then a Gaussian mask is applied to the magnitude values in order to weight the center pixels, and the pixels are interpolated with respect to pixel location within a block. The resulting feature is a vector of dimension 36 containing the summed magnitude of each pixel cells, divided into 9 (kvalue) bins. These features have been extensively exploited in the literature (Dollár et al., 2012). Recently, Qiang et al. (2006) and Chavez-Garcia et al. (2013) use HOG as a weak rule for AdaBoost classifier, achieving the same detection performance, but with less computation time. Maji et al. (2008) and Wu and Nevatia (2007) proposed a feature based on segments of lines or curves, and compared it with HOG using AdaBoost and SVM learning algorithms....

    [...]

  • ...We based our visual representation approach on the work of Dalal and Triggs (2005) on histograms of oriented gradients (HOG) which has recently become a stateof-the-art feature in the computer vision domain for object detection tasks....

    [...]

Proceedings ArticleDOI
01 Dec 2001
TL;DR: A machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection rates and the introduction of a new image representation called the "integral image" which allows the features used by the detector to be computed very quickly.
Abstract: This paper describes a machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection rates. This work is distinguished by three key contributions. The first is the introduction of a new image representation called the "integral image" which allows the features used by our detector to be computed very quickly. The second is a learning algorithm, based on AdaBoost, which selects a small number of critical visual features from a larger set and yields extremely efficient classifiers. The third contribution is a method for combining increasingly more complex classifiers in a "cascade" which allows background regions of the image to be quickly discarded while spending more computation on promising object-like regions. The cascade can be viewed as an object specific focus-of-attention mechanism which unlike previous approaches provides statistical guarantees that discarded regions are unlikely to contain the object of interest. In the domain of face detection the system yields detection rates comparable to the best previous systems. Used in real-time applications, the detector runs at 15 frames per second without resorting to image differencing or skin color detection.

18,620 citations

Book ChapterDOI
01 Jan 2001
TL;DR: In this paper, the clssical filleting and prediclion problem is re-examined using the Bode-Shannon representation of random processes and the?stat-tran-sition? method of analysis of dynamic systems.
Abstract: The clssical filleting and prediclion problem is re-examined using the Bode-Shannon representation of random processes and the ?stat-tran-sition? method of analysis of dynamic systems. New result are: (1) The formulation and Methods of solution of the problm apply, without modification to stationary and nonstationary stalistics end to growing-memory and infinile -memory filters. (2) A nonlinear difference (or differential) equalion is dericed for the covariance matrix of the optimal estimalion error. From the solution of this equation the coefficients of the difference, (or differential) equation of the optimal linear filter are obtained without further caleulations. (3) Tke fillering problem is shoum to be the dual of the nois-free regulator problem. The new method developed here, is applied to do well-known problems, confirming and extending, earlier results. The discussion is largely, self-contatained, and proceeds from first principles; basic concepts of the theory of random processes are reviewed in the Appendix.

15,391 citations


"Multiple Sensor Fusion for Detectio..." refers background or methods in this paper

  • ...Moreover, we do not need a data association process to relate the moving objects from lidar and camera because the camera uses ROI from lidar processing. Afterwards, the fused mass distribution is considered as the reference distribution and therefore combined with the radar mass assignment mr. The association between lidar and radar objects is done using a gating approach between tracks based on the covariance matrices of the tracks from both sensors, this approach is based on the association techniques proposed by Bar-Shalom and Tse (1975) and Baig (2012). Also we include the idea of associating tracks that have parallel trajectories to perform track confirmation....

    [...]

  • ...Usually, the fixed size of a pedestrian is modified during ROI generation using size factors. This idea has two main drawbacks: the number of candidates can be very large, which makes it difficult to fulfil real-time requirements; and many irrelevant regions are passed to the next module, which increases the potential number of false positives. As a result, other approaches are used to perform explicit segmentation based on camera image, road restriction, or complementary sensor measurements. The most robust techniques to generate ROIs using camera data are biologically inspired. Milch and Behrens (2001) select ROIs according to color, intensity and gradient orientation of pixels....

    [...]

  • ...Regarding stereo-based pose estimation, Labayrade et al. (2007) introduced v-disparity space, which consists of accumulating stereo disparity along the image y-axis in order to compute the slope of the road and to point out the existence of vertical objects when the accumulated disparity of an image row is very different from its neighbors....

    [...]

  • ...Usually, the fixed size of a pedestrian is modified during ROI generation using size factors. This idea has two main drawbacks: the number of candidates can be very large, which makes it difficult to fulfil real-time requirements; and many irrelevant regions are passed to the next module, which increases the potential number of false positives. As a result, other approaches are used to perform explicit segmentation based on camera image, road restriction, or complementary sensor measurements. The most robust techniques to generate ROIs using camera data are biologically inspired. Milch and Behrens (2001) select ROIs according to color, intensity and gradient orientation of pixels. Dollár et al. (2012) review in detail current state-of-the-art intensity-based hypothesis generation for pedestrian detection. Some of these methods involve the use of learning techniques to discover threshold values for intensity segmentation. Optical flow has been used for foreground segmentation, specially in the general context of moving obstacle detection. Franke, U. and Heinrich (2002) propose to merge stereo processing, which extracts depth information without time correlation, and motion analysis, which is able to detect small gray value changes in order to permit early detection of moving objects, e....

    [...]

  • ...Regarding stereo-based pose estimation, Labayrade et al. (2007) introduced v-disparity space, which consists of accumulating stereo disparity along the image y-axis in order to compute the slope of the road and to point out the existence of vertical objects when the accumulated disparity of an image row is very different from its neighbors. Andriluka et al. (2008) proposed fitting 3D road data points to a plane, whereas Singh et al....

    [...]

Book
01 Jan 1976
TL;DR: This book develops an alternative to the additive set functions and the rule of conditioning of the Bayesian theory: set functions that need only be what Choquet called "monotone of order of infinity." and Dempster's rule for combining such set functions.
Abstract: Both in science and in practical affairs we reason by combining facts only inconclusively supported by evidence. Building on an abstract understanding of this process of combination, this book constructs a new theory of epistemic probability. The theory draws on the work of A. P. Dempster but diverges from Depster's viewpoint by identifying his "lower probabilities" as epistemic probabilities and taking his rule for combining "upper and lower probabilities" as fundamental. The book opens with a critique of the well-known Bayesian theory of epistemic probability. It then proceeds to develop an alternative to the additive set functions and the rule of conditioning of the Bayesian theory: set functions that need only be what Choquet called "monotone of order of infinity." and Dempster's rule for combining such set functions. This rule, together with the idea of "weights of evidence," leads to both an extensive new theory and a better understanding of the Bayesian theory. The book concludes with a brief treatment of statistical inference and a discussion of the limitations of epistemic probability. Appendices contain mathematical proofs, which are relatively elementary and seldom depend on mathematics more advanced that the binomial theorem.

14,565 citations

Journal ArticleDOI
TL;DR: In this paper, a face detection framework that is capable of processing images extremely rapidly while achieving high detection rates is described. But the detection performance is limited to 15 frames per second.
Abstract: This paper describes a face detection framework that is capable of processing images extremely rapidly while achieving high detection rates. There are three key contributions. The first is the introduction of a new image representation called the “Integral Image” which allows the features used by our detector to be computed very quickly. The second is a simple and efficient classifier which is built using the AdaBoost learning algorithm (Freund and Schapire, 1995) to select a small number of critical visual features from a very large set of potential features. The third contribution is a method for combining classifiers in a “cascade” which allows background regions of the image to be quickly discarded while spending more computation on promising face-like regions. A set of experiments in the domain of face detection is presented. The system yields face detection performance comparable to the best previous systems (Sung and Poggio, 1998; Rowley et al., 1998; Schneiderman and Kanade, 2000; Roth et al., 2000). Implemented on a conventional desktop, face detection proceeds at 15 frames per second.

13,037 citations