Maximize Flexibility for AI at the Edge
What you'll learn:
- Understanding how embedded AI is the central core of application functionality for the future.
- The areas in which AI most readily integrate and coexist.
- The role that model CNN architectures such as ResNet and Mobilenet play in high image-recognition accuracy.
There will be few areas left untouched by artificial intelligence (AI). Alongside the many enterprise-level use cases, numerous applications for machine learning and AI are springing up for edge computers and devices for the Internet of Things (IoT), often in combination with signal and image processing.
Security and safety drive many of the applications. The technology provides the means to perform intruder detection and, on a larger scale, find anomalies in crowd movements that can provide alerts of situations that need human intervention.
For example, AI can help improve quality control in production and in utilities such as water supplies. A model trained on expected flows and anomalies can show when production is moving out of tolerance well before the quality degrades to the point where parts and subsystems need to be rejected and reworked.
One Size Does Not Fit All AI
The potential applications are vast, but no one-size-fits-all AI solution exists for all of them. Each use case will need a model that has the best set of capabilities to support it. Speech and gesture control will benefit from the same language model technology that now underpins generative AI.
Sensor-oriented applications will more commonly rely on convolutional neural network (CNN) architectures. However, some may benefit from the additional functionality offered by implementations based on vision transformers, albeit at the cost of needing higher performance.
AI has demonstrated a thirst for increased performance. Before the arrival of generative AI, annual model capacity growth was rising by a factor of three. Transformer-based models pushed that growth to more than 10X each year.
Server-based AI provides access to the highest-performing models. But in many embedded and industrial applications, access to these systems isn’t ideal. Operators and users want security for their data, and in many cases, network connections in the field aren’t reliable enough to support cloud-based AI.
Users need the ability to run AI models on-device. That capability comes partly from the use of higher-performance embedded processors optimized for the target environment, which provide increased data privacy and lower latency. In addition, as edge AI doesn’t need a reliable connection to the internet, it excels in such environments. Another contributor to on-device AI comes from how experts in the field have adapted server-based models to run more efficiently in embedded systems.
>>Check out this TechXchange for similarly themed articles and videos
Researchers developed CNN architectures, such as ResNet and Mobilenet, to offer high image-recognition accuracy using fewer matrix multiplications than earlier models developed for server implementation. These model architectures split large, computationally expensive filters into smaller two-dimensional convolutions.
They also took advantage of techniques such as layer fusing, in which successive operations funnel data through the weight calculations and activation operations of more than one layer. Such techniques take advantage of data locality to avoid external memory accesses that are costly in terms of energy and latency.
Designers coupled these and other edge-optimized model architectures with techniques such as network pruning and quantization. Pruning reduces the overall number of operations needed to process each layer. But it’s often a poor fit for the highly optimized matrix-multiply engines developed for neural processing. In practice, the use of quantization delivers better results with lower overheads, taking advantage of single-instruction multiple-data (SIMD) arithmetic engines designed for matrix and vector operations.
By using 8-bit integer arithmetic, and possibly even smaller word widths, instead of the much wider floating-point formats used during model training, massively reduces the demand for compute and energy. Since it’s possible to use many 8-bit arithmetic engines in parallel in place of a single high-precision floating-point unit, an embedded processor can deliver major improvements in throughput for the same energy and die cost.
Exploring Qualcomm NPU Architecture
Qualcomm’s Dragonwing Hexagon neural processor unit (NPU) leveraged these techniques for its Snapdragon series of system-on-chips (SoCs) for mobile phones. This lets the SoCs support functions such as face and speech recognition. The same processor is now available to industrial users through Tria Technologies’ Dragonwing series of SMARC modules (Fig. 1), sitting alongside Arm-based Cortex-A series applications cores and Adreno graphics processing units (GPUs).
Developers can incorporate SMARC modules into their own motherboard to take advantage of off-the-shelf carrier boards (Fig. 2). These provide access to a wide range of processor options.
Current generations of Hexagon reflect a long-term commitment to signal-processing, machine-learning and AI workloads. The first iteration of the Hexagon appeared in 2007, initially supporting digital-signal-processing (DSP) workloads with a scalar engine based on a very long instruction word (VLIW) architecture to deliver high data throughput.
A key innovation that dates back to this implementation is the use of symmetric multithreading (SMT). By leveraging thread-level parallelism, the architecture hides many of the issues caused by external memory latency. This design philosophy has carried through successive generations of Hexagon, along with a focus on creating a unified architecture that lets developers take full advantage of the Hexagon’s hardware resources.
Later generations of the Hexagon NPU added support for parallel vector arithmetic and then multidimensional tensors. This was coupled with a full scalar processor that can run Linux with no need to fall back on support from the Arm CPUs in the SoC if the application needs it. Fusing the scalar, vector, and tensor engines, all sharing access to a central memory, allows for high flexibility.
The NPU is also equipped with micro-tile inferencing, a technique that makes it possible to support smaller AI models efficiently where the use-case calls for ultra-low power consumption. This can let a simple model run for long periods in a low-energy state, to detect certain types of sound, such as a human voice. Multiple micro tiles can run simultaneously so that this model can continue running while other models take on the job of speech recognition.
The common-memory architecture lets developers take full advantage of techniques like layer fusing. This technique can combine 10 or more layers to eliminate the need to write intermediate results to external memory.
Recognizing the need for access to a variety of models, Qualcomm’s AI Hub provides access to hundreds of different model implementations, each optimized for the Snapdragon and Dragonwing platforms. Users simply need to select and download models to get up and running with AI, letting them try out different approaches to see which best fits the target application.
Processors currently available in that format include the QCS5490 and QCS6490, alongside the larger Vision AI-KIT. Optimized for high-performance video processing, the IQ9075 processor in the Vision AI-KIT can deliver 100 TOPS (teraoperations per second) of performance.
As AI continues to proliferate across edge and embedded applications, developers need easy access to the widest variety of models and performance points to match cost and service expectations.
>>Check out this TechXchange for similarly themed articles and videos
About the Author

Christian Bauer
Product Marketing Manager, Tria Technologies
Christian Bauer is Product Marketing Manager at Tria Technologies.
Voice Your Opinion!
To join the conversation, and become an exclusive member of Electronic Design, create an account today!

Leaders relevant to this article:

