What is the purpose of the TJP?

The development of the TJP is based on three simple use cases: i) vehicle positioning on the road, ii) vehicle control in traffic jam, and iii) emergency braking.

What happens when a Raspberry PI crashes?

When one Raspberry PI crashes, the watchdog triggers the switch to the backup that takes over the processing of sensor data and the computing of the commands.

What is the impact of a crash on the safety of the TJP?

The impact of such problems on the safety of the TJP is classified ASIL D or ASIL C according to RENAULT experts, combining Frequency and Gravity.

What is the ideal executive support for a ddas?

As shown in previous work [6], this ideal executive support should exhibit the following features at runtime: i) control over component’s life cycle (add, remove, start, stop), ii) control over interactions for creating or removing bindings.

What is the serious problem in the TJP?

The result can be summarized as follows:• the crash of a computer running the TJP (a Raspberry PI in their mockup) leads to a loss of the service; the solution was based on a PBR replication strategy;• erroneous data delivered by the virtual sensor IMU (Inertial Measurement Unit) used to measure the speed of the vehicle was solved using TR and by computing an average value on a sliding window of values;• erroneous information delivered by virtual laser sensors was solved by triplication and voting.

What is the main feature of ROS?

Although it is not a core feature of ROS at present, dynamic binding was possible but ROS does not provide a specific API to manage such connection between components.

(Open Access) Towards Adaptive Fault Tolerance on ROS for Advanced Driver Assistance Systems (2017) | M. Amy

Q: What contributions have the authors mentioned in the paper "Towards adaptive fault tolerance on ros for advanced driver assistance systems" ?

This is the question the authors tackle in their work with a major car manufacturer. This paper is a progress report. The authors summarize their approach involving AFT ( Adaptive Fault Tolerance ) implemented on ROS ( Robot Operating System ), describe the simulation platform they have developed to experiment and validate over-the-air updates of ADAS and AFT, and finally draw some lessons learnt and perspectives.

Q: What is the effect of the remapping?

Since Client and Server must be relaunched for the remapping to take effect, the insertion is done off-line, i.e. the binding between nodes is static.

HAL Id: hal-01707514

https://hal.archives-ouvertes.fr/hal-01707514

Submitted on 12 Feb 2018

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entic research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diusion de documents

scientiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Towards Adaptive Fault Tolerance on ROS for Advanced

Driver Assistance Systems

Matthieu Amy, Jean-Charles Fabre, Michaël Lauer

To cite this version:

Matthieu Amy, Jean-Charles Fabre, Michaël Lauer. Towards Adaptive Fault Tolerance on ROS

for Advanced Driver Assistance Systems. 2017 47th Annual IEEE/IFIP International Conference

on Dependable Systems and Networks Workshop (DSN-W), Jun 2017, Denver, United States. 7p.,

�10.1109/DSN-W.2017.42�. �hal-01707514�

Towards Adaptive Fault Tolerance on ROS for

Advanced Driver Assistance Systems

M. Amy

, J.-C.Fabre

, M. Lauer

CNRS-LAAS, Ave du Colonel Roche, F-31400 Toulouse, France

Univ de Toulouse,

INP,

UPS, LAAS, F-31400 Toulouse, France

Technocentre RENAULT, F-78280 Guyancourt, France

Abstract— The use of over-the-air updates has attracted very

much interest these last few years with the software-intensive

development of embedded systems in the car industry. The

development of autonomous driving and ADAS (Advanced Driver

Assistance Systems) renders over-the-air updates mandatory, for

both user satisfaction and economic reasons. How to make sure

that remote updates of critical ADAS do not have an impact on

safety? This is the question we tackle in our work with a major

car manufacturer. This paper is a progress report. We

summarize our approach involving AFT (Adaptive Fault

Tolerance) implemented on ROS (Robot Operating System),

describe the simulation platform we have developed to

experiment and validate over-the-air updates of ADAS and AFT,

and finally draw some lessons learnt and perspectives.

I. INTRODUCTION

Automotive embedded systems are expected to evolve

during their service life, in order to cope with changes of

various nature due to maintenance activities or additional

features requested by users. For many reasons including

economic reasons, over-the-air updates, e.g. additional features

installed remotely into cars, are of prime interest for the car

manufacturers. This capability has been demonstrated by TELSA

and is currently one of the motivations for Adaptive AUTOSAR.

To this aim, the first challenge is to have a runtime support

enabling dynamic software updates to be carried out. ROS is a

possible candidate. ROS is a middleware for implementing

distributed applications that is used in many applications, from

robots (e.g. Robonaut developed by NASA – National

Aeronautics and Space Administration) to autonomous

vehicles (e.g. the Crusher military off-road ground autonomous

vehicle developed by NREC – National Robotics Engineering

Center). A second challenge is related to the side effect of a

functional update on dependability. It is of course mandatory to

adjust the fault tolerance mechanisms of the updated

application to maintain its dependability properties. This

requires separation of concerns, isolation of both application

code and fault tolerance code into error confinement areas, and

dynamic binding facilities between runtime components. We

have proposed a framework and developed several

conventional fault tolerance mechanisms on ROS to analyze to

what extend they can be easily updated.

Our objective is to validate the approach with critical

Advanced Driver Assistance Systems (ADAS). We are

developing a simple Traffic Jam Pilot system (TJP) able to

drive a car autonomously in a traffic jam. Our experimental

platform is composed of a redundant hardware system running

the TJP control system on ROS and controlling a virtual car

using a simulator, the GAZEBO Sim. Based on an FMECA

(Failure Modes Effects Critical Analysis), the TJP was

equipped with several fault tolerance mechanisms.

Our on-going work consists in applying the approach of

Adaptive Fault-Tolerance (AFT) we have investigated on ROS

to the update of ADAS. We aim at analyzing the effect of a

fault in the control system on the behavior of the vehicle. We

plan to analyze the impact of functional updates of the

dependability of the system, and implement adaptive fault

tolerance to make the system resilient.

In this paper, we summarize recent results and draw some

perspectives of our on-going work. In section II we describe

our approach for implementing adaptive fault tolerance on

ROS. In Section III we describe the experimental simulation

platform to experiment AFT on Advanced Driver Assistance

Systems. In Section IV we draw the initial lessons learnt from

this on-going work, and mention our future plans.

II. ADAPTIVE FAULT TOLERANCE WITH ROS

A. Basic concepts of AFT

Adaptive fault tolerance means that fault tolerance

mechanisms attached to applications need to be updated when

conditions change during the service life in the system. The

conditions are related to application characteristics; fault

tolerance requirements consecutive to a risk analysis and

FMECA leading to determine the criticality level of the

application and the required fault tolerance mechanisms

(FTM); fault tolerance mechanisms assumptions related to the

application structure and behavior; and related fault models,

namely the type of faults it is able to tolerate.

In this paper, we do not analyze AFT in detail and we refer

the interested reader to several papers on the subjects [1,2,3,4].

The main interest of AFT is its ability to update FTMs to

maintain compliance with some dependability requirements

and assumptions. An FTM should remain consistent with the

safety analysis when a change occurs, in particular after an

over-the-air update of an embedded application. Such

flexibility is essential, we would say mandatory, to keep the

system resilient, i.e. dependable in the presence of changes [5].

Two basic concepts are essential to implement Adaptive

Fault Tolerant computing, as demonstrated in [6]:

- Separation of Concerns at runtime: this concept is now

well-known at design time, but it is also very important at

runtime; it implies a clear separation between the

application code and the fault tolerance mechanisms. The

connection between the application code and the FTM must

be clearly defined. The FTMs should be disconnected and

replaced by a new one through standardized connectors.

- Componentization and dynamic binding: the first idea is

that fault tolerance software are decomposed into smaller

components. Each component exhibits interfaces (services

provided) and receptacles (services required). This means

that any FTMs can be decomposed into smaller pieces, and

conversely that an FTM is the aggregation of smaller ones.

The ability to manipulate the binding between components

(off-line but also on-line) is of high interest for AFT.

The main benefits of component-based AFT with respect to

pre-programmed adaptation is clear: separation of concerns at

runtime, componentization and dynamic binding enable FTMs

to be more easily updated a posteriori during the system

lifetime. Pre-program adaptation implies that all possible

undesirable situations are known at design time, which is

difficult to anticipate regarding new threats (attacks), new

failure modes (obsolescence of components), or simply adverse

situations ignored or forgotten during the safety analysis.

In short, fine grain adaptation of FTMs improves

maintainability of the system from a non-functional viewpoint.

Over-the-air updates of ADAS may have an impact on fault

tolerance requirements, a strong argument in favor of AFT.

B. Component model and reconfiguration with ROS

The main goal of ROS is to allow the design of modular

applications: a ROS application is a collection of programs,

called nodes, interacting only through message passing.

Developing an application involves the assembly of nodes,

which is akin to component-based approaches. Such an

assembly is referred to as the computational graph of the

application.  Two communication models are available in

ROS: a publisher/subscriber model and a client/server one.

The publisher/subscriber model defines one-way, many-to-

many, asynchronous communications through the concept of

topic. The client/server model relies on bidirectional

synchronous communications through the concept of service.

These high-level communication models introduce modularity

and flexibility in software systems.

 To provide this level of abstraction, each ROS application

includes a special node called the ROS Master. It provides

registration and lookup services to the other nodes. All nodes

node that has a comprehensive view of the computational

graph. When a node issues a service call, it queries the master

for the address of the node providing the service and then it

sends its request to this address. 

In order to be able to add fault-tolerance mechanisms to an

existing ROS application in the most transparent manner, we

need to implement interceptors. An interceptor provides a

means to insert functionality, such as safety or monitoring

nodes, into the invocation path between two ROS nodes. To

this end, a relevant ROS feature is its remapping capability. At

launch time, it is possible to reconfigure the name of any

services or topics used by a node. Thus, requests and replies

between nodes can be rerouted to interceptor nodes.

ROS provides two computational models: client-server (by

mean of services) and publish-subscribe (by means of topics).

The proposed approach is illustrated with the client-server

model in the paper. The application of the proposed framework

to the publish-subscribe computational model is on-going

work. In short, it requires the capture of the termination of the

computation within a ROS node to synchronize replicas.

C. Implementing Componentized FTMs

In this section, we first present the generic computational

graph we use for implementing FTMs on ROS. An

implementation of a duplex FTM, a Primary Backup

Replication (PBR) combined with a Time-Redundancy (TR)

mechanism has been done to validate our proposal.

We assume that the reader is familiar with conventional

replication techniques for fault tolerance (see. [7] or [8] for

more details about well-known replication techniques). The

objective is not to present and compare such techniques. The

objective is to show the capabilities of our framework to

combine, compose, decompose, adjust FT mechanisms.

Depending on a large number of performance criteria (e.g.

coverage, timing, communication overhead, HW resources,

etc.), the system manager may prefer one FTM instead of

another. This analysis is out of the scope of this paper.

1) Generic Computational graph

We have identified a generic pattern for the computational

graph of a FTM. Fig. 1 shows its application in the context of

ROS. All components are ROS nodes. A node, the Client, uses

a service provided by a Server node. The FTM computational

graph is inserted between the two nodes thanks to the ROS

remapping feature. Since Client and Server must be re-

launched for the remapping to take effect, the insertion is done

off-line, i.e. the binding between nodes is static. The FTM

nodes, topics, and services are generic for every FTM.

Implementing an FTM consists in specializing the Before,

Proceed, and After nodes with the adequate behavior of the

required FTM.

Fig. 1. Generic computational framework for FTM

2) Application to Primary-Backup Replication

We briefly illustrate here the approach and the Before-

Proceed-After framework, through the use of a Primary-

Backup Replication (PBR) mechanism. Three computers are

needed: the CLIENT site hosting the Client node and the ROS

Master, the MASTER site hosting the primary replica, and the

SLAVE site hosting the backup replica.

• We present the behavior of each node, the topics and

services used through a request/reply interaction

between a node Client and node Server (cf. Fig. 2).

• Client sends a request to Proxy (service clt2pxy);

• Proxy adds an identifier to the request and transfers it

to Protocol (topics pxy2pro)

• Protocol checks whether it is a duplicate request: if so,

it sends directly the stored reply to Proxy (topics

pro2pxy). Otherwise, it sends the request to Before

(service pro2bfr);

• Before transfers the request for processing to Proceed

(topics bfr2prd); no other action for PBR.

• Proceed calls the actual service provided by Server

(service prd2srv) and forwards the result to After

(topics prd2aft);

• After gets the last result from Proceed, captures Server

state by calling the state management service provided

by the Server (service aft2srv), and builds a checkpoint

based on this information which it sends to node After

S of the SLAVE replica (topics aft2aft S);

• Protocol gets the result (topics aft2pro) and sends it to

Proxy (topics pro2pxy);

The Before-Proceed-After (BPA) framework synchronizes

replicas in normal operation, i.e. in the absence of faults. It also

runs the recovery procedure when the failure detector (an

external/independent node) signals the crash of a replica.

Fig. 2. Before-Proceed-After framework applied to PBR

The main advantage of this approach is that a slight change

in the protocol can be performed easily just by

replacing/updating one of the Before, Proceed, After nodes. A

second advantage of the approach is that the inter-replica

protocol is clearly independent of the application service. The

main drawback is that ROS does not provide command to

change bindings between nodes after their initialization.

3) Composition of several FT Mechanisms

The generic computational graph for FTM given in Fig. 1 is

designed for composability. The key feature is that a Protocol

node can substitute for a Proceed node.

Fig. 3. Principle of composition for FT mechanisms

With respect to request processing, a Protocol node and a

Proceed node exhibit the same interfaces: in short, a request as

input, a reply as output. Hence, the composition of several FT

mechanisms relies on replacing the Proceed node of a

mechanism by a Protocol and its associated Before-Proceed-

After nodes of a second mechanism, as shown in Fig. 3. Our

approach enables developing a new mechanism on the

foundation of several existing ones. This improves the

development time and the assurance in the overall system,

since all mechanisms have been validated off-line.

Two composition scenarios are shortly described below.

PBR+TR. PBR is of interest to tolerate crash faults whereas

TR tolerates transient value faults. TR tolerate transient fault

by repeating the computation and voting on the results. As a

second FTM (FTM2), the After node of TR is responsible for

triggering the repetition of the computation (involving Before

and Proceed) and the vote on the various results produced

before forwarding the reply to the After node of FTM1, which

implements PBR.

PBR+Assertion. Assertions are often derived from safety

analysis. For instance, "the electronic lock of the steering

column must not activate when the speed of the vehicle is over

10 km/h". This safety rule can easily be translated into a logical

expression, i.e. a Boolean function. The second FTM (FTM2)

is responsible for the verification of such assertion

implemented in its After node. When the assertion is false it

may raise an alarm and return an error signal to FTM1 that will

send it back to the Client for emergency action.

D. Lessons learnt

The main advantage of ROS is to provide concepts for

componentization and separation of concerns. This is important

for the design of adaptive fault tolerance mechanisms, but also

for their implementation. The proposed framework Before-

Proceed-After inspired from Aspect Oriented Programming [9]

enables various fault tolerance mechanisms first to be

decomposed into isolated components that can be customized

according to the needs, but also to facilitate the composition of

several mechanisms in a row.

Separation of concerns enables the FTM to be externalized

with respect to the functional code, namely the application

code. The generic FTM mechanisms we propose are

independent of the nature of the application. This independence

between FTMs and application simplifies their i)

externalization and ii) their composition. The benefits of

separation of concerns have been demonstrated in many ways

for non-functional properties (replication, security, tracing,

etc.) using Meta-Object Protocols [10] in the past as in [11] and

was the main motivation for Aspect Oriented Programming.

The main interest is to avoid gluing non-functional

mechanisms with application code, an approach making

maintenance and evolution very difficult to achieve. Separation

of concerns has a lot of merits at design, implementation and

validation time, but also at runtime since the application and

the attached FTM can be located into isolated components.

Isolation is a key feature for dependable computing.

From an implementation viewpoint, ROS nodes provide

isolation in a protected address space for error confinement.

The services and the mechanisms can be isolated from each

other, and thus an error within the application (e.g. memory

violation) does not impact the FT mechanism. Although we

assume that the implementation of any FTM is zero-default

(huge validation effort following ISO 26262), this isolation

property also applies to nodes implementing the FTMs.

The static binding between nodes is a drawback because it

can only be manipulated a priori and off-line. This is a

weakness of ROS regarding fine grain over-the-air updates of

componentized FTM: an update can only be finalized after

restarting the application.

It is worth noting however that the validation of a new

mechanism or even an updated version of it, must be carried

out off-line following an intensive validation process, in

particular fault injection as far as fault tolerance is concerned.

Ideally, dynamic binding would improve the efficiency of

over-the-air updates of ADAS for instance. As we have shown

previously, only few or even just one node belonging to our

Before-Proceed-After framework may need to be updated. So,

why restarting the whole application? Just uploading a new

node and binding it to its companion nodes would suffice. This

is not possible at present with ROS, version 1. There is no API

to manipulate nodes and bindings at runtime. However, these

APIs can be emulated with dedicated logic added to some

nodes, using underlying Unix features and commands.

Last but not least, the ROS master is a single point of

failure in the current version of ROS. This problem could be

tackled using DMTCP [12], a library for checkpointing Unix

multi-threaded processes as a whole. This might be of interest

in the short term since a POSIX compliant kernel is part of the

upcoming Adaptive Autosar platform whose aim is to facilitate

dynamic reconfiguration and updates of embedded software.

However, the next major revision of ROS (ROS2) is based

on a DDS (Data Distribution Service) communication system

that should help solving this problem by distributing the ROS

master functionalities among the nodes of the system. This

approach would however require reliable multicast protocols

properly implemented and validated.

III. EXPERIMENTAL PLATFORM FOR AFT & ADAS

The objective of the platform is to provide the support for

several activities: i) the simulation of critical advanced driver

assistance systems, ii) a target for implementing over-the-air

update of ADAS, ii) a set of use cases for safety analysis, iv)

the implementation of adaptive fault tolerance techniques and

v) their validation by fault injection.

The use of ROS for the implementation of any ADAS is

essential to validate our AFT approach and our Before-

Proceed-After framework.

Instead of performing functional updates and related FTM

adaptation on a real car, we have used a simulator to

implement the car behavior. The GAZEBO-Sim tool enables a

vehicle and its environment to be simulated with a quite

interesting level of detail. Sensors and actuators can be

developed and integrated into a model of vehicles on roads.

Towards Adaptive Fault Tolerance on ROS for Advanced Driver Assistance Systems

Figures

Citations

[서평]「Component Software」 - Beyond Object-Oriented Programming -

A Seamless Integration of Fault-Tolerant and Real-Time Capabilities for Robot Operating System (ROS)

Flexible Fault Tolerance for the Robot Operating System

Effective Crash Recovery of Robot Software Programs in ROS

References

Aspect-oriented programming

Component Software: Beyond Object-Oriented Programming

Agile software development: the business of innovation

[서평]「Component Software」 - Beyond Object-Oriented Programming -

Composing adaptive software

Related Papers (5)

Design of a data-driven communication framework as personalized support for users of ADAS.

ADAS for the Car of the Future

A rapid prototyping environment for cooperative Advanced Driver Assistance Systems

A Survey of Personalization for Advanced Driver Assistance Systems

Advanced drivers assistant systems in automation

Frequently Asked Questions (19)

Q1. What contributions have the authors mentioned in the paper "Towards adaptive fault tolerance on ros for advanced driver assistance systems" ?

Q2. What is the purpose of an interceptor?

Q3. What is the effect of the remapping?

Q4. What is the purpose of interceptors in ROS?

Q5. What is the purpose of the TJP?

Q6. What is the purpose of the second FTM?

Q7. What is the main interest of separation of concerns?

Q8. What happens when a Raspberry PI crashes?

Q9. What are the main benefits of component-based AFT?

Q10. What does Adaptive Fault Tolerant Computing mean?

Q11. What are the three computers needed for the FTM?

Q12. What is the next major revision of ROS?

Q13. What is the purpose of the Adaptive Autosar platform?

Q14. What is the impact of a crash on the safety of the TJP?

Q15. What is the main motivation for Aspect Oriented Programming?

Q16. What is the ideal executive support for a ddas?

Q17. What is the serious problem in the TJP?

Q18. What is the main purpose of separation of concerns?

Q19. What is the main feature of ROS?

Trending Questions (1)