Skip navigation
The Ever-Improving Inference at the Edge

The Ever-Improving Inference at the Edge

Using machine-learning inference on the edge has never been easier with platforms like NVIDIA’s Jetson Nano.

Not all applications can utilize machine-learning (ML) inference, but it’s possible with most. Doing it at the information source instead of in the cloud is becoming easier thanks to improved artificial-intelligence (AI) software support plus hardware acceleration that targets deep neural networks (DNNs). Platforms like Renesas’ e-AI, STMicroelectronics’ STM32CubeMx.AI, and NXP’s eIQ all support ML and target hardware from conventional microcontrollers to systems with hardware acceleration.

ML hardware acceleration can significantly improve the performance of inference applications on the edge, opening up new application opportunities that would not be possible on stock hardware. GPGPUs and multicore CPUs led the charge, but ML-specific hardware has the edge. Even the latest version of these platforms have been enhanced to address the inference chores. For example, Intel’s latest Xeons include instructions targeting ML and its Movidius video processing unit (VPU) zeros in on specific ML application spaces.

NVIDIA’s Jetson Nano (see figure) brings a full SoC to the ML table. The 128-CUDA-core Maxwell GPGPU handles processing of most of the DNN models assisted by the 64-bit, quad-core Cortex-A57 CPU cluster. The compact DIMM module also includes 4 GB of DRAM and runs Linux. Its hardware encode and decode support can process a 4K or eight 1080p video streams while running ML models on each stream. Convection cooling easily handles the 5 to 10 W of power, allowing the Jetson Nano to work in compact, low-power AI applications on the edge. The Jetson Nano provides the same functionality as its older and more powerful siblings including the ability to support major platforms like TensorFlow, PyTorch, Caffe/Caffe2, MXNetx, and Keras.

Jetson Nano

The Jetson Nano from NVIDIA is an SoC that supports machine-learning inference chores in embedded systems.

Coprocessors are also answering the call for more efficient inference and identification chores in embedded systems, where a batch size of one is important. Servers typically handle large batch sizes more efficiently, but they’re also working with larger datasets versus embedded systems that might have a single camera delivering data for analysis.

Chips like Flex Logix’s InferX X1 target this space. The chip incorporates multiple nnMAX processing tiles specifically designed to handle each layer within a DNN model that’s been trained on a server using boards like NVIDIA’s latest Tesla T4 or FPGA boards such as Xilinx’s Alveo or Intel’s Programmable Acceleration Cards (PACs). The InferX X1 is optimized to implement Winograd acceleration, which can improve accuracy and performance of INT8 layers by 2.25. The system transforms the 3-by-3 convolution to a 4-by-4 with dynamic translation of weights to 12 bits. The support also handles input and output translation on-the-fly, minimizing the loading of weights within the system.

Figuring out whether AI will benefit an application is a chore in and of itself. However, once that determination is made, lots of options are available to implement these systems. Of course, one may have to apply AI just to wade through the options.

SourceESB banner with caps

Hide comments


  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.