There are a host of hardware accelerators for various machine-learning ML models. To wit, Renesas has come up with a ternary SRAM-based system to accelerate convolutional-neural-network (CNN) computations. The CNN is a machine-learning class of deep-neural-network (DNN) models.
One of the challenges with ML is moving around input and output data as well as the weights involved in the calculations. Various approaches have been used to optimize data movement. For instance, Flex Logix’s NMAX keeps weights in local memory.
The ternary approach uses two single-bit memory cells to encode 1.5 bits of information as a -1, 0, or 1 (Fig. 1). The Processing-in-Memory (PIM) method takes advantage of the ternary values.
1. Renesas’ hardware can take advantage of a ternary memory cell that stores a value of -1, 0, or 1.
The basic ternary storage can be combined into multibit solutions. Blocks can be combined for different accuracies, allowing users to optimize the balance between accuracy and power consumption (Fig. 2).
2. The hardware can combine ternary calculations into multibit operations.
Conventional memories read the contents using analog-to-digital converters (ADCs). This is a robust approach, but it requires space for the ADC and power. Renesas combined a 1-bit sense amplifier comparator with replica cells in which the current can be controlled flexibly to develop a high-precision memory data-readout circuit (Fig. 3). A “zero-detector” was developed to stop operation of the comparators when detecting the state that MAC result is equal to zero.
3. A “zero-detector” was developed to stop operation of the comparators when detecting the state that the MAC result is equal to zero.
This strategy takes advantage of the fact that the number of nodes (neurons) activated by neural-network operation is very small, about 1%, and it achieves even lower power operation by stopping operation of the readout circuits for nodes (neurons) that aren’t activated. As a result, power is significantly reduced while maintaining accuracy.
One downside of not using ADCs is that the storage isn’t as robust. Part of the issue stems from process variations during chip manufacturing. Renesas implemented multiple SRAM calculation blocks that have minimal manufacturing variations to address calculation errors due to manufacturing variations (Fig. 4). Normally, only a small number of all nodes will be activated. Nodes are allocated selectively to SRAM calculation circuit blocks that have minimal manufacturing process variations to perform the calculations. This allows calculation errors to be reduced to a level where they can be essentially ignored.
4. Multiple SRAM calculation blocks with minimal manufacturing variations are implemented to address calculation errors due to such variations.
Renesas engineers created a chip to demonstrate the ternary PIM approach (Fig. 5). The 12-nm technology chip contains four clusters, each containing the PIM and logic along with conventional SRAM storage. Each cluster can operate independently; thus, the system is able to manage up to four CNN models at one time. The chip can handle up to 128 CNN layers. PIM storage is 4.74 Mb and the SRAM stores 12.58 Mb. The 1-W chip can deliver 8.8 TOPS.
5. Renesas engineers created a chip to demonstrate the ternary PIM approach with four clusters. Each cluster can operate on a different ML model.
The chip has been used to execute a number of models, including one that recognizes handwritten characters. It maintained a recognition accuracy of over 99%. The chip is only a prototype, but it highlights how different approaches to ML acceleration can deliver higher performance while lowering power requirements.