NovuMind is another startup with its eyes on inference at the edge, leveraging dedicated technology designed to deliver high inference throughput with minimal power. It’s not alone in this space and many GPU and FPGA machine-learning (ML) platforms are available, but fewer dedicated platforms are shipping at this point. Many of these target specific applications, such as Intel’s Movius series. The advantage of the newer chips is that their power requirements are a few watts instead of the hundreds needed for high-end GPU boards.
Dr. Ren Wu, founder and CEO of NovuMind, says, “Until now, GPUs have powered the advances in AI, particularly around the training of deep-neural-network models from large sets of data. Once models are trained, however, the challenge is to deploy them at scale. GPUs and other processors are expensive and consume large amounts of power. Their architectures are optimized for two-dimensional matrix computation. While they perform well when processing large batches of data, these chips are not suited for real-time applications that require low latency. They also lack power efficiency and they tend to be very expensive. With the arrival of our NovuTensor chip, we are breaking these barriers and ushering in a new era where AI can be deployed at scale.”
1. The 400-MHz NovuTensor delivers 15 TOPS while the quad-chip PCI Express card pushes 60 TOPS.
NovuMind’s 400-MHz NovuTensor is designed to deliver 15 TOPS while using under 5 W for the neural engine. The chip actually needs 15 W. It’s available as a chip or on a short PCI Express card (Fig. 1). The card includes four chips delivering a combined 60 TOPS.
Details about the chip are a bit sparse at this point. In general, one of its advantages is to 3D tensor calculations without unfolding the data into 2D matrices. According to its patent, “The contraction engine calculates the tensor contraction by executing calculations from equivalent matrix multiplications, as the tensors were unfolded into matrices, but avoiding the overhead of expressly unfolding the tensors. The contraction engine includes a plurality of outer product units that calculate matrix multiplications by a sum of outer products. By using outer products, the equivalent matrix multiplications can be partitioned into smaller matrix multiplications, each of which is localized with respect to which tensor elements are required.”
Part of this approach is to minimize the amount of data movement. This is also something that Flex Logix does with its NMAX approach to neural-net processing. Moving data around takes time and power, but it’s necessary to keep the matrix multipliers flowing. Most systems can’t keep these calculators running all of the time and are often waiting for data to arrive.
2. A NovuTensor chip takes on tasks such as scaling streaming media to 4K video or 8K video using four chips.
NovuTensor can be used for most inferencing chores. It’s able to handle challenging applications like scaling streaming media to 4K video using a single chip or 8K video using four chips (Fig. 2).
The challenge for developers will be benchmarking this and other chips with their applications and neural-network models. Most benchmarks these days don’t address real-world applications all that well. Likewise, the scale and implementation of a model can have a significant impact on how it’s partitioned and implemented on a particular system. This will be especially critical for embedded systems where using the smallest, lowest-power chip can make the difference between a good, economical product and an expensive one that doesn’t work.