^{1}

^{*}

This paper reviews the research work done on the Reliability Analysis of Controller Area Network (CAN) based systems. During the last couple of decades, real-time researchers have extended schedulability analysis to a mature technique which for nontrivial systems can be used to determine whether a set of tasks executing on a single CPU or in a distributed system will meet their deadlines or not [1-3]. The main focus of the real-time research community is on hard real-time systems, and the essence of analyzing such systems is to investigate if deadlines are met in a worst case scenario. Whether this worst case actually will occur during execution, or if it is likely to occur, is not normally considered. Reliability modeling, on the other hand, involves study of fault models, characterization of distribution functions of faults and development of methods and tools for composing these distributions and models in estimating an overall reliability figure for the system [4]. This paper presents the research work done on reliability analysis developed with a focus on Controller-Area-Network-based automotive systems.

Scheduling work in real-time systems is traditionally dominated by the notion of absolute guarantee. The load on a system is assumed to be bounded, and known, worstcase conditions are presumed to be encountered, and static analysis is used to determine that all timing constraints (deadlines) are met in all circumstances. This deterministic framework has been very successful in providing a solid engineering foundation for the development of real-time systems in a wide range of applications from avionics to consumer electronics. However, the limitations of this approach are now beginning to pose serious research challenges for those working in scheduling analysis. A move from a deterministic to a probabilistic framework is advocated in [

When performing schedulability analysis (or any other type of formal analysis) it is important to keep in mind that the analysis is only valid under some specific model assumptions, typically under some assumed “normal condition,” e.g., no hardware failures and a “friendly” environment. The “abnormal” situations are typically catered for in reliability analysis, where probabilities for failing hardware and environmental interference are combined into a system reliability measure. This separation of deterministic (0/1) schedulability analysis and stochastic reliability analysis is a natural simplification of the total analysis, which unfortunately is quite pessimistic, since it assumes that the “abnormal” is equivalent to failure; in particular, for transient errors/failures, this may not at all be the case. Thus, even in cases where the worst case analysis deems the system to be unschedulable, it may be proven to satisfy its timing requirements with a sufficiently high probability [

Systems with requirements of reliability are traditionally built with some level of replication or redundancy so as to maintain the properties of correctness and timeliness even in the case of faults. Faults can be tolerated by using hardware, software, information and time redundancy. In certain classes of applications, it may not be feasible to provide hardware redundancy due to space, weight and cost considerations. Such systems need to exploit software, information and time redundancy techniques and especially the combination of these techniques. However, the implementation of such systems increases the complexity of keeping the system computation in compliance with the requirements specification. Effective schedulability analysis, which takes the fault model, fault tolerance mechanisms and the scheduling algorithm into consideration, is needed to guarantee the timeliness property [

Formal analysis approaches of fault-tolerant real-time systems have been addressed by some previous research activities. In [

• Multiple sources of errors: Handling of each source separately is not sufficient; instead, they have to be composed into a worst case interference with respect to the latency on the bus.

• Signaling pattern of individual sources: Each source can typically be characterized by a pattern of shorter or longer bursts, during which the bus is unavailable, i.e., no signaling will be possible on the bus. This model is, just as Tindell and Burns’ model, deterministic in that it models specific fixed patterns of interference. Section 3 presents general reliability modeling for such distributed real-time systems, as presented in [

An alternative is to use a stochastic model with interference distributions. Such a model is proposed by Navet et al. [

A study done by Burns, Davis and Punnekkat [

In [

Reliability is defined as the probability that a system can perform its intended function, under given conditions, for a given time interval. In the context of an automobile, its intended function is to provide reliable and cost-effective transport of men and material. At a subsystem level, such as for an Antilock Braking System (ABS), this boils down to performing the tasks (mainly input_sensors, compute_control, and output_actuators, etc.,) as per the specifications. Being part of a real-time system, the specifications for ABS imply the necessity for the results to be both functionally correct and within timing specifications [

A major issue here is how to compose hardware reliability, software reliability, environment model, and timing correctness to arrive at reasonable estimates of overall system reliability. A central feature of these systems is that the behavior is periodic, with a hyperperiod that is equal to the LCM of the periods of the constituent tasks.

The following probabilities are defined

The reliability of the system is the probability that the system performs all its intended functions correctly for a period. This is given by the product of cumulative probabilities that there are no failures in hardware, software, and communication subsystem during the period. That is,

In [

A. Reliability Estimation By definition, reliability is specified over a mission time. Normally, a repetitive pattern of messages (over the least common multiple (LCM) of the individual message periods) is assumed. Each LCM (hyperperiod) is typically a very small fraction of the mission time.

Let t represents an arbitrary time point when the external interference hits the bus and causes an error. If one can assume zero error latency and instantaneous error detection, then t becomes the time point of detection of an error in the bus due to external interference.

The following probabilities are defined:

By relying on the extensive error detection and handling features available in the CAN, one can safely assume that an error in a message is either detected and corrected by retransmission, or will ultimately result in a timing error. This allows to ignore P_{C}(t), and define in this context the probability of communication failure due to an interference starting at t as

In the environment model in [_{1}, having a certain pattern, hitting the message transmission. Let be the probability of such an event occurring at time t. It is also assume that another interference I_{2}, having a different pattern, can hit the system at time t with a specified probability, for example,. In [

In [

B. Failure Semantics In the above presentation, it is assumed that a single deadline miss causes a failure. This may be true for many systems, but in general, the failure semantics do not have to be restricted to this simple case. For instance, most control engineers would require the system to be more robust, i.e., the system should not loose stability if single deadlines, or even multiple deadlines, are missed. Tolerable failure semantics for such a system [

In particular, for critical systems, the designer has an obligation to make the system more robust. For instance, for a simple system a more reasonable failure semantics would be: “a failure occurs if more than two out of ten deadlines are missed.”

It should be noted that changing the failure semantics may have implications for the design, since it is to be made sure that the system appropriately can handle the new situation, for example, in this case, that a few deadline misses can be tolerated.

C. Calculating Failure Probabilities To calculate the subsystem reliability, first one needs to calculate the failure probability (in this case of the communication subsystem subject to possible external interference), i.e., the probability of at least one failure (defined as a missed deadline) during the mission time. In doing this, it is assumed that the interference-freesystem is schedulable, i.e., that it meets all deadlines with probability 1. Due to the bit-wise behavior of the CAN bus, and with respect to the external interference, a discretization of the time scale is made, with the time unit corresponding to a single bit time τ_{bit} (1 s for a 1-Mb/s bus), i.e., no distinction is made between an interference hitting the entire bit or only a fraction of it. In either case, the corruption will both occur and be detected [

Two types of interference sources are considered in [

• Intermittent sources, which are bursty sources that interfere during the entire mission time T, and are for an interference source I characterized by a period T_{I} and a burst length l_{I} (where l_{I} < T_{I} )• Transient sources, which are bursty sources of limited duration. These occur at most once during a mission time T, and for an interference source J are characterized by a period T_{J},_{ }a burst length l_{J}, and a number of bursts n_{J} (where l_{J} < T_{J} and), For a single intermittent source I, the probability of a communication failure during the mission time is defined as follows:

where denotes the conditional probability of a communication failure, given that the system was hit by interference from source I at time t, denoted by. Since interference and bus communication are independent, it follows that._{}

is calculated by simulation, i.e., the bus traffic during a mission is simulated, including the effects of interference, to determine if any communication failure (deadline violation) occurs. This will for each result in either 0, if no deadline is missed, or 1, if a deadline miss is detected.

For a single transient source J, the probability of a communication failure during the mission time is defined as follows:

where P_{J}(t) denotes the probability that the first transient interference hits the system at time t, and ^{ }denotes the conditional probability of a communication failure during T due to interference from J, given that the interference J starts at time t. In the above summation, all possible full or partial interference from the transient source during mission time are considered. The interval _{ }specifically captures initial partial interference, starting before mission time, but ending during mission. Since interference and bus communication are independent, it follows that.

is calculated by simulation, just as above.

The number of scenarios to simulate here is potentially much larger than for an intermittent source, since typically. However, the number of simulations can be reduced, since the probability of failure is independent of the hyperperiod (LCM of individual message periods), in which the interference hits the system. It can be proved that

It should be noted that a slight pessimism is introduced (which is the reason for the ≤) since the probability for failure in T is lower toward the end, when the interference bursts are extending past the end of T. However, since it is assumed that, the introduced pessimism can be considered negligible.

Also, it is to be noted that in (5) the effects of the interference starting before the LCM is covered by the interference on the subsequent LCM, which is the reason for starting the summation with t = 0 [rather than as in (4)]. Finally, it is noted that by not counting the interference starting outside the very first LCM (at the beginning of the mission time), some optimism is introduced, but since the assumption of

this optimism is insignificant.

To be more precise, a fraction of (4),

could be added to (5) in order to cover for the pre-mission time-triggered transient faults.

The above basic analysis is now extended to the analysis of multiple sources of interference. First, the presence of two independent sources of interference is considered. There are totally three cases to consider, namely 1) Two intermittent sources I_{1} and I_{2}_{}

_{ }

2) Two transient sources J_{1}and J_{2}

3) One intermittent and one transient source

The number of scenarios to simulate for the above three cases are in the order of_{, } T^{2}, and, respectively. This may be rather large, especially for the last two cases. By observing symmetries in the formulas, however, the number of scenarios can be reduced. For case 3), consider the following two mutually exclusive situations: 1) LCM ≥ T_{I}_{1}, which leads to the following reduced formula:

and 2), which leads to the following reduced formula:

The above two equations can be combined into

and, thus, the number of scenarios is reduced from the order of T_{I}_{1} × T to the order of T_{I}_{1 }× max (LCM, T_{I}_{1}).

Finally, the general formula for an arbitrary number of interference sources of either type (n intermittent and m transient sources of interference) is presented as [

D. Approach to Analysis The above equations define the probability of communication failure given a set of sources interfering with the communication according to specified patterns [

• One has a set K of external sources of interference, where, I is a set of intermittent sources of interference and J is a set of transient sources, as defined above.

• Each interference source is either active or inactive during a mission. The probability for each source to be active is, and the probability for inactivity is consequently.

For a scenario with a single intermittent interference source I_{1} and a single transient interference source J_{2}, we can now (with a slight generalization of the notation) define the probability of a communication failure in a mission as follows:

which for an arbitrary set I of intermittent sources and an arbitrary set J of transient interference sources can be generalized to

where. Intuitively, (14) defines the failure probability for a system that is potentially subjected to interference from a set of interference sources as the sum of the weighted probabilities that a failure occurs in each of the possible combination of sources.

This section presents an introduction to the general real-time system model, the fault-tolerant communication model and the applied reliability metrics, as presented in [

A. General System Model The calculation of reliability is illustrated based on a system bus that connects the processing nodes in an embedded system. On each processing node a certain number of processes is executed. Whenever two processes on different nodes need to communicate, a message is sent over the bus. For that purpose, a logical communication channel will be assigned to each pair of communicating processes. An example system with three logical communication channels τ_{l}, τ_{2} and τ_{3}. (as shown in

Tasks are coupled by event streams to indicate dependencies between them. An event stream is defined as a numerical representation of the timing of event occurrences [

several characteristic functions. The upper-bound arrival function describes the maximum number of events at τ_{i}’s input during an interval of length t. In [_{l}, τ_{2} and τ_{3}, with periods T_{1}, T_{2} and T_{3} respectively and without any release jitter. This is a reasonable assumption for a wide variety of applications, e.g. signal processing or control loops. Further on perfectly synchronized clocks on each node are assumed.

Each message has a deadline that is equal to the period of the corresponding channel. Due to the fact that multiple channels may concurrently request access to the bus, a scheduler is adopted to decide which channel bus will be granted access in case of conflicts.

The scheduling policy is static-priority non-preemptive, i.e. only the channel with the highest priority may transmit data. Preemptions of running transmissions are not permitted. This communication model corresponds to the CAN bus.

B. Fault-Tolerant Communication Model In this work, a single bus is assumed as communication medium. That is a very frequently used medium, both on chip and in distributed applications, such as in automotive electronics. In contrast to processes running on a CPU, the workload of a communication channel is often measured by the amount of data X to be transmitted within one job. Given the transmission speed V_{com}, it is possible to compute the effective time t_{com} (X) a message of size X (in bits) utilizes the bus (without any fault tolerance measures):

For the sake of clarity, all the messages of a communication channel τ_{i} are assumed to have the same length X_{i}. However the proposed method is also well-suited for channels with messages of different lengths. In this case it would be necessary to know the sequence of transmitted data (and thus the sequence of message lengths) a priori (e.g. MPEG). For scheduling analysis it is essential to know t_{com }instead of the amount of data. However the probabilistic considerations are based on the measure of bit error rate (BER), i.e. the relative amount of erroneous bits during transmission. As a consequence, the value of X is relevant to determine the probability that a message suffers from a certain number of bit errors. It should be noted that only independent errors are considered. Burst errors which might corrupt consecutive bits beyond message boundaries are excluded by this restriction. This is a reasonable assumption for on-chip buses [

There exists a wide variety of fault tolerance techniques to protect data transmission, commonly classified into error detecting codes (EDC) and error correcting codes (ECC) [

where t_{FTS} is the computation time for redundancy bits t_{com}_{,FT} is the time of a single transmission trial using a fault tolerant bus t_{RR} is the transmission time for the retransmission request n_{e} is the number of erroneous transmission trials t_{com}_{,FT} is composed of the transmission times t_{M} for the message payload and t_{code} for the redundant bits and the time t_{FTR }wasted for evaluation of the redundant bits:

Corresponding to the discussion in [

Normal Priority Retransmission: Retransmissions have the same priority as the original transmission.

Highest Priority Retransmission: Retransmissions

always have the highest priority.

C. Reliability Model The lifetime of a system is modelled with a random variable L that represents the time the system is running without failure. F_{L} depicts the distribution function of L. In [_{L}:

Practically R(t) denotes the probability that within an interval [0, t] no failure occurs [

This analytical method is aimed to overcome the limitations discussed in Section 2. For this purpose, a job-wise analysis is introduced, i.e. the number of tolerable errors is considered w.r.t. the job’s actual interference situation. Furthermore the probability that this number is not exceeded will be calculated for each transmission job separately. This procedure has to be performed for all jobs within the hyperperiod of the bus. After one hyperperiod has been completed, channel activations and load situation are repeated in exactly the same way as before. Consequently all probabilistic information which are necessary to derive a reliability function are already included into one hyperperiod [

The fact that the job τ_{i}_{,j} meets its deadline will be referred to with S_{i}_{,j} (success of τ_{i}_{,j}).

The probability that the job τ_{i}_{,j} meets its deadline is denoted with P[S_{i}_{,j}].

The probability of S_{i}_{,j} provided that some jobs have already met their deadlines is denoted as conditional probability with

The set of all jobs τ_{i}_{,j} that have been activated in the interval [0, t] is denoted with J(t),and the cardinal number| J(t)| with M:

Consider a time interval [0, t] and a list J(t) as defined by Equation (19). Furthermore J(t) is sorted by the monotonically increasing activation times of the jobs, with ties broken such that the jobs of higher priority tasks are considered before those of lower priority tasks. Based on this notation, reliability R(t) can be defined as the probability that all jobs in J(t) meet their deadlines, i.e.:

Consider that single success probabilities represent the probabilistic complements of the individual probabilities of failure on demand for each job, i.e.

Equation (19) can be converted using the laws of probabilities:

Hence computation of these conditional probabilities remains as the central challenge. This problem can be divided into determining the maximum number of tolerable errors per job and computing the probability that this number will not be exceeded. Before going into details, some reflections concerning the analysis interval are presented.

A. Analysis Interval Equation (20) represents the basic analysis methodology of jobwise success probability calculation. As explained at the begining of this section the job-wise analysis has to be performed for exactly one hyperperiod, which is equal to. If the reliability for one hyperperiod in known, reliability for A hyperperiods can be extrapolated by raising to the power of A:

Hence even if R(t) for large values of t is considered the analysis complexity is the same.

B. Number of Tolerable Errors An algorithm to compute the maximum number of tolerable errors per job is now presented. For the sake of clarity, the overhead in expressions t_{RR} and t_{FT}s is ignored. Thus a message that has to be retransmitted n_{e} times has an overall transmission time of. Further on the more compact notation j_{x} is introduced to refer to a certain job ∊ J(t).

_{c‒}_{1} and j_{c}, with j_{c‒}_{1} having higher priority than j_{c}. Job j_{c} is the considered job that should be evaluated for the maximum number of tolerable errors.

In the example, there exist several combinations of retransmissions of j_{c‒}_{1} and j_{c} without j_{c} missing its deadline, as shown in

approach of exclusively considering the worst-case situation, i.e. that scenario when j_{c‒}_{1} and j_{c} are activated simultaneously, would indicate no tolerable error at all.

_{c} algorithmically. It again refers to the example from _{c} is able to tolerate, including transmission errors in jobs other than j_{c} which cause a certain delay to j_{c} due to scheduling effects [

In ^{c}^{,k}. 1t is labeled with the corresponding error vector e^{-c}^{,k}. An edge represents an error event to denote that the last transmission trial of job j_{c-x} is disturbed by an error.

The fact that an arbitrary error scenario S^{i}^{,k} of a job j_{i} actually occurs is denoted with ω^{i}^{,k}.

A response time analysis is performed to determine whether the considered job j_{c} will meet its deadline under the actual interference situation and the error scenario represented by the node. If j_{c} meets its deadline, all successors of the node that represents the currently considered error scenario within the tree will be explored. Otherwise the node is tagged as not working, what implies that all its successors are not working too. From the graph-theoretic point of view a non-cyclic directed graph (i.e. a tree) is traversed using a depth first search, where nodes tagged as not working are used as an abort criterion. The search terminates when there are no remaining working nodes with unexplored successors, i.e. when all leaves are tagged as not working. At this point, the working set of the job j_{c }is found. It is separated from the non-working set by the working front

as depicted in

Theoretically, j_{c}’s completion may be delayed by all other jobs that have been activated beforehand. This would lead to a growing dimension of the error vector from job to job and thus to a depth explosion of the search tree. To avoid this problem, only the last D jobs which have been activated before j_{c} are included into the tree-based analysis. The parameter D is called the search depth. A bounded search depth assures a constant dimension of the error vector. This strategy may cause a disregard of certain error scenarios which generate a failure. Hence an overestimation of reliability may occur, leading to a non-strictly conservative analysis approach. However the experiments in [

C. Success Probability Computation Reconsidering the two steps mentioned above, the remaining problem is the calculation of the conditional success probabilities. That is considering a job j_{c}, expression (23) has to be evaluated.

Because the occurrences of all scenarios in w_{c} are mutually exclusive, the overall probability that exactly one of these scenarios actually appears can be computed by summing up the individual probabilities:

It is shown in [_{c}.

In this context the following abbreviations are introduced to proceed with a more compact notation:

Consequently, Equation (24) is simplified

Equation (26) is rearranged corresponding to the law of total probability:

Thus the following case differentiation is used:

1)

2)

Then the probability can be computed according to Equation (29).

where

is P[Transmission erroneous]

is P[Transmission correct]

is P[erroneous transmissions]

Thus it is shown in that Equation (24) can be evaluated for a given bit error rate BER and known message lengths X_{c}.

Research on schedulability analysis for fixed priority scheduling can be split to two classes. One class of techniques is based on utilization bound U, which allows a simple test of the form ΣUi ≤ U to determine a task set’s schedulability [^{1/n} − 1), where n is the total number of tasks. These methods have low overheads and adapt to online schedulability test. Their shortcoming is their pessimism, since some schedulable task sets may have higher utilizations than the bound U. The other class of techniques is based on the response time analysis techniques, which can provide exact schedulability analysis [1,16,17]. Such techniques have been increasingly developed for less restrictive computational models, thus allowing tasks to interact via shared resources, to have deadlines less than their periods [

The developed notion of a probabilistic assessment of schedulability has been supported [

A method that allows controlled relaxation of the timing requirements of safety-critical hard real-time systems is presented [

Using traditional schedulability analysis techniques, the designer will in many cases have no other choice than to redesign the system (in hardware, software or both). However, by resorting to the approach in [

It is well known [

An algorithm to determine the maximum number of fault corrections and the corresponding probability of a missed deadline in case of periodic, priority driven communication systems during a period of time is presented [

An approach for the efficient fault-tolerant scheduling of messages of mixed criticality in CAN which provides guarantees for variable levels of redundancy for critical messages is presented [

A comprehensive method to diagnose CAN nodes and communication links faults is presented [

The method in [