The complexity of automotive integrated circuits (ICs) has grown exponentially with the introduction of advanced driver-assistance systems and autonomous drive technologies. Directly correlated to this hike in complexity is the increased burden of ensuring an IC is protected from random hardware faults—functional failures that occur unpredictably. Random-fault mitigation continues to be one of the primary challenges and pain points across the industry.
To ensure random faults don’t affect silicon functionality and place humans at risk of injury, designs must be enhanced by implementing safety mechanisms to identify and control these faults. ISO 26262 requires that development teams instrument and prove the effectiveness of each safety mechanism.
Historically, the automotive industry has addressed random-fault analysis using a combination of tools and expert judgement. With this approach, the flow for initial safety analysis has been commonly performed top-down using Failure Modes Effects Diagnostic Analysis (FMEDA). While top-down analysis is still a necessity, the growth in semiconductor complexity makes this form of expert judgement unmanageable and error-prone.
Instead, an automated workflow must be deployed to assist experts in addressing random faults. Automating the random-fault workflow reduces human error and time to signoff, diminishing the potential for human error and eliminating time wasted iterating through the workflow.
A functional-safety workflow must not be viewed as a series of point solutions. Project managers who understand that functional safety requires a chain of tools working seamlessly have demonstrated success.
At Mentor, a Siemens Business, we have developed and field-proven a three-step workflow to address the random fault aspect of ISO 26262 by automating the safety-analysis, safety-insertion, and safety-verification tasks (Fig. 1). Because this automated flow eliminates the typical iterations of the legacy safety-analysis workflow, we call this new approach the “first time success workflow,” which is described in this article.
1. Comparing two functional-safety random-fault workflows: a traditional workflow versus a streamlined, three-step process.
The goal of safety analysis is to fully understand a design’s susceptibility to random hardware failures as well as the steps that must be taken to achieve the desired safety metrics, defined by the higher-level Automotive Safety Integrity Level (ASIL) target.
Several analytic techniques are deployed to determine design safety relative to the target safety metrics. Structural examination is a proven method for calculating and validating the failures in time (FIT) estimation performed during the creation of an FMEDA. Cone of influence analysis combined with structural analysis provides visualization into the design structures that are already protected from existing safety mechanisms.
Through structural and cone of influence analyses, the effectiveness of safety mechanisms in catching random hardware faults is quantified and the estimated diagnostic coverage (DC) is realized. The FIT and DC estimation represent both FMEDA gap analysis and validation of initial expert-driven analysis.
For example, a design that contains module duplication will be structurally analyzed. Then the estimated diagnostic coverage will be calculated based on the design structures covered by the duplicated module and associated checker. The calculated diagnostic coverage validates the top-down FMEDA diagnostic coverage estimation (Fig. 2).
2. Structural analysis can identify coverage on safety-critical design structures.
FMEDA gap analysis is an important checkpoint as it provides feedback early in the design cycle, avoiding the costly discovery after completing safety verification that there’s insufficient fault mitigation. In addition to gap analysis, structural analysis details the FIT contribution of each elementary design structure.
In the event that structural analysis demonstrates safety holes, elementary FIT data highlights and prioritizes the design structures that require additional safety enhancement. Using this information, safety architects are empowered to explore enhancement options to achieve the desired safety target while taking into account power and area requirements.
For example, a safety architect can estimate the impact of adding error correcting code (ECC) to a memory, review the overall improvement in diagnostic coverage, and determine whether the proposed set of safety mechanisms meets the safety target.
This exploration of options ensures that the proposed safety mechanisms will achieve the ASIL target once the random-fault workflow has completed, eliminating iterations through the remaining two phases (safety insertion and safety verification). The outcome of safety exploration is a clear understanding of the design enhancements required to meet the safety goals. The proposed safety mechanisms required to meet the safety target are fed directly into the safety-insertion phase.
Safety mechanisms come in a variety of flavors, each with its own level of effectiveness in detecting random hardware faults. Typically, safety mechanisms are bucketed as either fail-safe or fail-operational. Fail-safe mechanisms are capable of random-fault detection. Fail-operational safety mechanisms are capable of correcting random hardware faults; they typically incur a higher resource utilization (power, performance, area) and are required to attain the most stringent safety targets (designated as ASIL D).
Traditionally, the manual design enhancement process is disjointed and inconsistent across teams and projects. By automating the insertion of safety mechanisms, engineers achieve a handful of benefits, including:
- Instills consistent safety-mechanism implementation, eliminating human error.
- Frees up design engineer resources, allowing them to focus on differentiating features.
- Allows safety enhancement of third-party IP in which design architectures are unknown.
- Enables safety enhancement of machine-generated code, such as high-level synthesis designs.
With guidance from safety analysis, users insert the safety mechanisms that meet power, performance, area, and safety targets (Fig. 3). Engineers have a suite of hardware safety-mechanism choices, such as:
- Flip-flop parity, duplication, and triplication
- Finite state machine protection
- ECC and Triple modular redundancy
- Module-level lockstep and triplication
- End-to-end parity and cyclical redundancy check
3. Engineers have a suite of hardware safety-mechanism choices.
After safety-mechanism insertion, logical equivalency between the original and enhanced design must be performed to ensure that no functional deviation has been introduced.
Once the design is enhanced with fault-mitigation logic, it must be proven that it’s safe from hardware faults through a fault-injection campaign. The objective of a fault-injection campaign is to fully classify each fault within the fault list and validate the diagnostic coverage metric estimated during safety analysis.
Fault classification is performed by injecting faults into the design and verifying that functional deviances are caught by the automatically inserted safety mechanisms. The size of the fault list has a direct impact on the length of the fault campaign, so every means must be taken to reduce it to the minimal set. As a result, the fault campaign is subdivided into two sub-parts.
In the first part, the fault list is automatically generated using the same structural-analysis techniques used in safety analysis. Once generated, a series of fault-optimization tasks reduce the fault list to a minimal problem set. The first optimization identifies the logic contained within the safety-critical cone of influence, eliminating out-logic that can’t affect the safety goals (Fig. 4).
4. The fault-list optimization flow consists of these three steps.
Using the same structural-analysis algorithms deployed during safety analysis, the fault list is further optimized using safety-mechanism-aware analysis, trimming the list to contain only faults that contribute directly to diagnostic coverage. Lastly, fault collapsing is performed to remove any logically equivalent faults. For example, a stuck-at-0 fault on the output of an AND gate is equivalent to any of its inputs being stuck-at-0, resulting in a reduction in the number of fault nodes for the gate. Fault optimization is a critical first step in reducing the scope of the fault-injection campaign.
Once optimized, the fault list is used to inject random hardware faults into the design. This second part of a fault campaign is often the most time-consuming phase, as today’s design complexity can result in hundreds of thousands of design nodes in the optimized fault list.
The primary challenge of fault injection lies in establishing an approach capable of closing on a fault campaign within an acceptable timeframe. The goal is to inject faults, simulate the effect on design behavior, and ultimately validate the estimated diagnostic coverage calculated during safety analysis.
Fault simulation leverages functional stimulus to perform fault injection. The simulator identifies fault-injection points and injects faults when functional deviances are probable. Once injected, the fault is propagated until the fault is classified as safe/detected, single point, residual, multi-point latent, or multi-point detected/perceived (Fig. 5).
5. These cones of influence are used to classify systems for single- and dual-point faults.
To achieve maximum performance, three levels of concurrency are deployed. First, faults are injected using a concurrent fault-injection algorithm providing parallelism across a single thread. In concert with single-thread concurrency, faults are injected across a CPU core cluster and then further distributed across the larger machine grid. Fault management oversees the job distribution and the coalescing of the resulting data. With today’s automotive semiconductor designs resulting in hundreds of thousands of fault nodes, parallelizing the fault-injection campaign is essential to meet project schedules.
Bottoms-up safety analysis is important in reducing the number of iterations throughout the workflow. In addition to validating expert-driven judgment, it provides critical guidance during design enhancement and fault verification. Automating the three pillars of the random-fault workflow (safety analysis, safety insertion, and safety verification) delivers a seamless and efficient approach to random-fault mitigation and verification.
Jacob Wiltgen is the Functional Safety Solutions Manager for Mentor, a Siemens Business.