Power-Efficient Processor Leverages Novel Dataflow Architecture
My master’s degree work many decades ago was on dataflow architectures. At the time, though, the implementation using conventional programming languages was beyond the hardware capabilities. How times have changed — and not solely for artificial intelligence (AI).
In the video (above), I talk with Efficient Computer’s CEO Brandon Lucia about the company’s Electron E1 (Fig. 1). It uses a custom fabric architecture to implement a programmable dataflow architecture designed to run applications written in conventional programming languages like C, with the effcc Compiler hiding the underlying architecture.
The approach significantly reduces the power requirements by two orders of magnitude even when compared with ultra-low-power microcontrollers like the Texas Instruments MSP430 and Ambiq Micro’s Apollo, which use a conventional, register-based Von Neumann approach. Such power efficiency is important when discussing embedded compute/Internet of Things (IoT) applications, where reducing the amount of power can increase the longevity of a system or allow for more processing power to address new application features.
What is a Dataflow Architecture?
Dataflow architectures aren’t new, but they’re typically implemented in a static fashion to compute a value based on inputs (Fig. 2). Data “flows” through computational units.
The code for this example is:
A dataflow system is usually synchronous or asynchronous. In the latter case, computations occur as data is supplied, generating results once all of the necessary inputs are valid. Synchronous operation latches the data as it flows through computational units. Things get a bit more complex as control is added to the mix. Any dataflow application that’s more than just numeric manipulation needs control flow.
Dataflow implementations are often found in ASICs and FPGA systems in a more static form, where an algorithm is fixed and applied to a data stream or a fixed set of inputs. This is generally much more efficient in terms of power and performance because operations can be done in parallel and data movement is point-to-point.
A conventional CPU, on the other hand, must move data to and from a register file. On top of that, multiple instructions are needed to specify movement and calculations in a sequential fashion. ASICs and FPGAs can often run with a much lower clock speed while delivering higher throughput than a processor running a sequential program.
Why Efficient Computer’s Electron E1 Processor Design is So Radical
What makes Efficient Computer’s Electron E1 stand out is the programmable nature of the system’s dataflow. The chip has a RISC-V processor (RV32iac+zmmul), but it’s there to manage the computational fabric that makes up the bulk of the system. The RISC-V process powers down when it’s not needed.
The radical aspect comes into play with Efficient Computer’s compiler. It takes conventional code like a C program and allows it to be mapped onto the computational fabric that contains processing elements (PE) and a network-on-chip (NoC). The NoC feeds the PE input and routes the PE outputs as needed. Essentially, the PE outputs are connected to the inputs of the next PE in a calculation.
One might say that sounds like an FPGA. The major differences include the types of functions provided by the underlying fabric. An FPGA works at a logic gate level while the Electron E1 is at a program data unit that includes things like integers and floating-point numbers.
An FPGA can block things out and some include higher-level blocks that provide DSP-level functionality, but these tend to be the exception with the bulk of an FPGA comprising very basic lookup tables (LUTs). The FPGAs don’t use a NoC, although some advanced FPGAs have a NoC to connect its fabric with other components.
So, now we have an application mapped to a PE/NoC fabric. To make things more effective, there are different types of PEs, rather than having complex PEs that could handle different functions and data types as well as load/store (LS) with memory and control flow. In fact, specialization is also part of the NoC, which consists of three different NoCs for configuration, control, and data.
It would be great if an entire application would fit into the fabric, but two issues crop up. One is that an application rarely runs all aspects of the code, even in an application with easily exposed parallelism. The other is fitting large applications onto the fabric.
The answer is to apply an idea from GPUs and implement computation kernels or nuggets of code that will run for a period of time and then be replaced by other kernels. This is especially effective with parallel-processing applications like graphics or AI, but it can work well in other situations like handling interrupts. It all depends on the overhead needed to set up the system and how long it operates.
Efficient Computer’s compiler essentially handles the details of converting the source code into something that implements the application using the dataflow and control flow in the fabric. It also manages how pieces are loaded into the fabric. This is actually a lot harder to do and some interesting issues emerge when it comes to debugging, but we’ll have to leave that discussion for another time.
Listed below are some technical papers that provide more details about the approach taken by Efficient Computer with its Electron E1. The architecture and chips in the papers are precursors to the platform.
MANIC: An Energy-Efficient Architecture for Ultra-Low-Power Embedded Systems
RipTide: A programmable, energy-minimal dataflow compiler and architecture
Monza: An Energy-Minimal, General-Purpose Dataflow System-on-Chip for the Internet of Things
How the Electron E1 is Able to be So Power-Efficient
Part of the trick when designing the dataflow system was to determine where conventional processors spend a lot of time and energy. This turns out to be moving data to and from the register file and handling the instruction flow because the entire register file must be active to access an entry. This is a similar issue with instruction pipelines. A spatial dataflow processor is designed so that these operations aren’t required.
As might be expected, the chip has low-power and deep-sleep modes. It also can run the fabric at 40 or 200 MHz providing 5.4 and 21.6 GOPS, respectively. The Electron E1 has 4 MB of MRAM with DMA support, 3 MB of ultra-low-power SRAM, and a 128-kB ultra-low-power cache.
The effcc Compiler handles C, but other languages like C++, Python, and Rust are on the to do list, as is support for AI in the form of LiteRT and ONNX support.
The peripheral complement, which is similar to regular microcontrollers, includes six QSPI, six UART, and six I2C ports plus 72 GPIO pins. A real-time clock (RTC) is on-board as well, along with clock-generation and power-conversion support via an integrated LDO and buck converter. The power supply is 1.8 V.
Looking under the hood, the Electron E1 is radically different than any other processor on the market. Like a good compiler, effcc hides the underlying architecture from programmers, allowing everyone to benefit from the hardware’s power efficiency.
It will be interesting to see how well the chip will be adopted and its potential growth in the future. You can see how the chip operates using the simulator at the Electron E1 Playground website. This includes presentation of the Fabric as well as a debugger that also shows connections (Fig. 3).