Soft Error Hits The Ground

Users of ICs at high altitude are aware of the potential for errors caused by the presence of high-energy neutrons. Now there is concern such phenomena could compromise the reliability of systems at ground level.

Research, focusing on the phenomenon as it affects the SRAM cells used to program some FPGAs, suggests there may be a problem at ground level. Historically, interest in neutron-induced errors has focused on data corruption in memory devices. Such corruption occurs when an energetic neutron from the atmosphere interacts with the crystal lattice of an IC in a so-called single-event upset (SEU). This may change the state of a flip-flop or memory element and is often referred to as a 'soft error', because there is no lasting damage to the device itself.

The problem for the designer is more serious when neutrons interact with the memory elements used to configure SRAM-based FPGAs. A change in configuration memory can mean a change in functionality (otherwise known as a single event functional interrupt, or SEFI). Because the mitigation techniques used to correct configuration upsets are not necessarily instantaneous, several thousands or millions of clock cycles may pass before the problem can be detected, let alone corrected. During this time the FPGA may behave in an unpredictable and uncontrollable manner. For this reason, neutron errors that affect FPGA configuration memory are referred to as 'firm errors'. The good news is that redundancy in FPGA architectures can stop many SEUs from causing a SEFI. The bad news is that the configuration change may cause the device to latch-up. These scenarios are well known, and have necessitated the deployment of a variety of error checking and system-level redundancy schemes to prevent malfunction and damage. Designers of high-reliability systems (and those for use at altitude, where neutron flux is higher) have gone further, and eliminated the use of SRAM-based FPGAs in such applications. Instead, they use flash- or antifuse-based devices which are not prone to neutron damage.

Concerns have been voiced that FPGAs may be susceptible to neutron collisions, even at ground level. iRoC Technologies recently conducted a series of tests to determine the failure rates of five different FPGA architectures from three vendors, using three different programming techniques. As a part of their research, they also reviewed how often an SEU might be expected to cause a SEFI for each architecture.

Testing in the natural environment on the Earth's surface is a slow process, due to the relatively low neutron flux. Large numbers of parts would need to be operated for long periods of time in order to accumulate statistically significant numbers of configuration upsets. Testing of multiple architectures and multiple programming technologies becomes impractical. The iRoC team therefore developed an accelerated test strategy which used facilities at the Los Alamos Neutron Sciences Center (LANSCE) at Los Alamos National Laboratory in New Mexico.

The tests followed the JESD-89 methodology, and looked at 1-million-gate Axcelerator AX1000 (150nm antifuse) and ProASIC PLUS APA1000 (220nm flash) devices from Actel; 3-million-gate (150nm SRAM) and 1-million-gate (90nm SRAM) devices from Xilinx; and a 1-million-gate (130nm SRAM) device from Altera.

Test boards at LANSCE were loaded with FPGAs and exposed to a neutron beam which closely matches the energy spectrum of naturally-occurring background neutron radiation. The quantity of neutrons each device was exposed to exceeds the quantity to be expected after 7,600 years of exposure to background radiation at sea level. For those FPGAs which did not exhibit functional failures, the tests were conducted for approximately ten times the duration of tests on parts exhibiting functional failures. In addition, configuration files were periodically read-back from two of the SRAM FPGA types, to detect configuration upsets.

The testing demonstrated that antifuse- and Flash-based FPGAs are not subject to functional failure due to neutron effects. All three SRAM-based FPGA architectures proved vulnerable to configuration loss due to neutrons, with an evident increasing vulnerability with decreasing process geometry. Where measurement was possible, the tests suggested that between one in five and one in ten configuration SEUs actually produce SEFIs. The full results are documented in Table 1 (visit http://www.actel.com/products/rescenter/ser/index.html for a copy of the full report).

At this point it is worth noting the wide variation with altitude of the figures given for 'equivalent atmospheric exposure'. This has very practical implications if our aim is to understand whether neutron upsets are relevant in real-world, ground-based applications. Natural neutron flux increases with altitude and with latitude. High altitude population centres require high reliability electronics for networking, telecoms, medical, industrial control.

Table 1 measures failure rates in FITs (Failures In Time), rather than as mean times between failure (MTBF). A failure rate of 1 FIT is defined as one failure in 109 hours. Typically, integrated circuits have FIT rates lower than 100. In high-reliability applications, component engineers will look for FIT rates in the region of 10 to 20. The reason for using FIT rates is twofold: first, today's target reliability rates are more manageable when expressed in FIT; secondly, FIT rates are more easily combined to ascertain system FIT or MTBF figures.

To understand the implications of these FIT rates, it is informative to consider a system-level scenario, such as a typical Sonet/SDH network. This might consist of 128 optical cross-connect chassis. Each chassis might have 32 line cards, and each line card would typically incorporate four 1M-gate SRAM FPGAs. So, the network includes a total of 16,384 FPGAs.

Some nodes will be located in high altitude locations, such as Denver. Referring to the neutron test results in Table 1, we see that the FIT rate for a single 1M-gate FPGA at 5,000 feet is 1,100. To obtain the FIT rate for the entire population, we multiply the individual FIT rate by the number of devices in the population. This works out to be 18,022,400. To obtain the mean time before failure (MTBF) in hours, we must divide 109 hours by the population FIT rate. This works out to be 55.5 hours. Generally carrier-class telecommunication infrastructure is expected to meet or exceed 'five nines' reliability – meaning that the system is functional 99.999% of the time. The MTBF associated with 99.999% reliability is 100,000 hours, assuming a downtime of one hour for each failure. The iRoC research reveals that programmable devices are susceptible to a significant level of neutron-induced failures at ground level. Just as designers have eliminated the use of SRAM-based devices in high-altitude and space applications, so they now need to be aware, particularly in multiple-chip applications, high-reliability implementations and network environments of the impact of neutron effects on system reliability.