ITC keynoter comments on 75-GFLOPS/W compute target

Performance and reliability are increasingly interdependent, according to Pradip Bose, manager, Department of Power- and Reliability-Aware Microarchitectures, IBM Thomas J. Watson Research Center. In a Thursday morning International Test Conference keynote address titled “Efficient Resilience in Future Systems: Design and Modeling Challenges,” he described an IBM-led DARPA PERFECT (Power Efficiency Revolution for Embedded Computing Technologies) initiative, which aims to achieve reliable compute power of 75 GFLOPS/W. IBM is working with Stanford, Harvard, and UVA on the project.

Bose defined fault tolerance as the ability to provide service despite hard or soft faults generated inadvertently or maliciously. Classical fault tolerance, he said, refers to tolerance to faults that conform to particular fault models, but there is a need to move beyond this definition to contend with faults that weren't anticipated in the development of initial specifications. He described energy-secure systems that are resilient to corner-case scenarios or attacks.

He noted that in traditional designs, power can increase without a corresponding increase in instructions per cycle (IPC), leading to the power/performance wall. He did add that special-purpose workload-optimized, throughput-oriented high-performance-computing (HPC) chips quadruple in performance every two years, while general-purpose processors only double in performance over the same period. He cited a target of 50 GFLOPS/W for exascale systems by 2020—not that far from the target of the PERFECT program.

He then described a reliability wall. Resilient systems, he said, on encountering an error can back up to a golden state. Unfortunately, as the number of processors increases, the number of rollbacks can also increase, resulting in an application being unable to run to completion—MTBF can shrink to minutes.

Approaches to solving such problems, ha said, include implementing cores augmented with parity checks, but a research challenge lies in determining where to insert error-checking functions—and how many.

He next commented on the power wall. Engineers are pursuing extreme measures to reduce power consumption, he said, but such measures (lower voltages, for example) can increase soft error rates (but on the plus side may help prevent hard errors).

The PERFECT program's goal of 75 GFLOPS/W will probably be achievable at the 7-nm process node, he said. To overcome the challenges of power and reliability, he concluded, cross-layer modeling and optimization is the key.

A guiding principle in IBM-led effort, he said, is informed by a quote from Albert Einstein: “Everything should be made as simple as possible, but not simpler.”

See these related ITC articles:

Sponsored Recommendations

Comments

To join the conversation, and become an exclusive member of Electronic Design, create an account today!