Low-Power Play: GAP8 Weds Multicore RISC-V with Machine Learning

Low-Power Play: GAP8 Weds Multicore RISC-V with Machine Learning

GreenWaves’ GAP8 brings low-power machine learning to embedded systems using an eight-core array of RISC-V processors.

Machine learning (ML) on the edge often involves convolutional neural networks (CNNs). This can be done using standard processors, but there’s a cost due to performance and matching power requirements. Though specialized ML hardware can significantly reduce the amount of power, a programmable solution would provide a more flexible alternative.

GreenWaves Technologies brings a RISC-V-based solution to the table, building on the Parallel Ultra Low Power Platform (PULP). PULP is designed to support four different 32-bit, RISC-V cores, including RISCY, Zero-riscy, Micro-riscy, and Ariane. RISCY is an RV32-IMC core with a four-stage pipeline DSP, SIMD, hardware loop, bit manipulation, and post-increment extensions. Zero-riscy uses a two-stage pipeline with a RV32-IMC base, while Micro-riscy handles RV32-EC with only 16 registers and a two-stage pipeline. Ariane supports a 64-bit architecture with memory management, allowing it to run operating systems like Linux. It targets high-end applications.

The GAP8 (Fig. 1) features nine, identical RISC-V PULP RISCY cores, with memory-protection units (MPUs) organized as an array of eight cores and the ninth fabric controller to manage the system. The latter is in its own clock and voltage domain, allowing it to shut down the array to conserver power. In addition, most of the peripherals have micro DMA support. Off-chip memory can be accessed via HyperBus, a high-speed, low-pin-count memory system. The internal clock runs up to 250 MHz.

1. The GAP8 includes an array of eight RISC-V cores that have access to a CNN accelerator along with another RISC-V core to manage the system.

The fabric controller incorporates a 16-kB data cache and a 4-kB instruction cache. The array of cores has its own shared L1 cache with a logarithmic interconnect. This includes a 64-kB data cache and 16-kB instruction cache.

Furthermore, the array integrates a hardware convolution engine (HWCE) that can generate a 5-by-5 convolution or a pair of 3-by-3 convolutions with 16-bit operands in a single cycle. The HWCE is connected to the L1 cache using multiple load/store units. It operates in parallel with the RISC-V cores and provides a significant performance boost to CNN ML applications while reducing power requirements. The system enables always-on use of ML even in low-power, battery-operated applications.

Developers can trade off performance versus power. For example, one application was run at 15.4 MHz taking 99.1 ms to complete and using only 3.7 mW of power. The same application running at 175 MHz completed in 8.7 ms using 70 mW of power. This was without using the HWCE, which would have further reduced power requirements. The same application running on a Cortex-M7 at 216 MHz took 99.1 ms and used 60 mW of power.

The GAP8 is designed for very low power operation. It has its own dc-dc converter and can switch to a low-dropout (LDO) regulator when running in very low power modes. Deep-sleep mode with a real-time clock (RTC) uses a mere 70 nA, while a data-acquisition mode consumes only 40 µA plus any peripherals involved. The system can perform many pre-analysis chores using only 1 mW and full inference needs just 10 mW.

The system has its own boot ROM and there are e-fuses to implement additional security protocols. The chip comes in an 84-pin aQFN package.


The GAPUINO (Fig. 2) is one way to get started with GAP8. It’s supported by open-source development tools and a SDK that handles the RISC-V extensions. The board is also compatible with ARM’s Mbed OS. As a result, a GAP8 solution could be incorporated into an Mbed cloud-based solution.

2. The GAPUINO features a GAP8 in an Arduino form factor.

GAPUINO can operate in master or slave mode. It’s a standalone host in master mode. In slave mode, the GAP8 can be a peripheral processor to an Arduino-compatible host. The board has 256 Mb of SPI flash, an I2C EEPROM, and a HyperBus-based flash/DRAM chip with 512 Mb of flash and 64 Mb of DRAM. It can be battery powered or run off USB or external power sources. GreenWaves also has an Arduino-compatible board with four MP34DT01 microphones, a VL53 time-of-flight sensor, IR sensor, pressure sensor, light sensor, temperature and humidity sensor, and 6-axis accelerometer/gyroscope.

GreenWaves also provides CNN graph translators as well as code generators for common algorithms like FFTs, FIR, and CNN layers. The other tool is the GAP8 AutoTiler, which allows developers to create kernels that can be combined using parallelization, vectorization, and data-flow support. It supports automatic code generation for data flow using frameworks like OpenMP.

GAP8 targets a wide range of ML and non-ML embedded applications from smart cameras to wearable consumer applications. It’s able to handle chores like keyword recognition in audio streams to object recognition in video streams. The ability to scale the amount of performance in use allows for more power-efficient operation, which will be required for battery-based solutions.


Hide comments


  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.