Tuning Embedded Machine Learning to Real-World Apps

What you'll learn:

Choosing flexible hardware for DNN architectures for apps like ATM camera systems.
How "embeddings" can be effective in the image-recognition reidentification process.
How Arcturus Networks developed a customizable vision pipeline using NXP's i.MX 8M architecture.

Machine learning based on deep-neural-network (DNN) architectures have demonstrated impressive results in numerous experiments, particularly in tasks such as the recognition of objects and people in images. Many of these experiments, as well as the first real-world deployments, were performed on high-performance cloud-server hardware that can deliver the required calculation throughput.

There are numerous applications where processing on cloud-server hardware is impractical. Issues such as communications latency and available bandwidth demand the use of local intelligence for processing.

Take the example of a camera system that’s used to monitor activity at an automated teller machine (ATM) in situations such as social distancing during the recent COVID-19 pandemic. Banks have found it’s important for them to introduce automated monitoring to check conditions by each ATM to make sure people in the queue or around the area don’t stand too close together. In addition, they may want to ensure only patrons wearing masks are given access to a lobby-based ATM or the machines themselves.

Sending live video from security cameras to the cloud is one possibility for processing the data. But this is potentially very costly and difficult to implement in all but dense urban centers. Furthermore, the communications latency makes it harder for the system to react to arrivals and crowd movements in a timely way. Processing the video data locally potentially provides much better responsiveness if hardware can satisfy the computational demands of the appropriate image-recognition pipeline.

Model Choices

Many choices are available for integrators looking for edge-computing processors designed to handle the performance challenges associated with DNNs. Some employ modified graphics processing unit (GPU) architectures for throughput. However, for DNN processing, dedicated neural processing unit (NPU) designs can offer superior performance-energy ratios.

The most important consideration is to choose hardware that offers high flexibility and not just good results for off-the-shelf benchmark DNNs such as ImageNet or MobileNet. Often preprocessing steps will need to manipulate the image data into a form that suits the application and the DNNs involved must be fine-tuned to handle the specific requirements of the application.

The ATM-monitoring example has several elements that demonstrate this need for fine-tuning and preprocessing. In its work on this application, machine-learning specialist Arcturus Networks analyzed sample image data from a banking client. The data revealed that acute camera angles common in small, enclosed ATM spaces lead to a loss of detection confidence as people move underneath the camera (Fig. 1). Confidence in the results could change from more than 98% for an image where the next customer’s face is clear to less than 40% if the camera sees the top of their head with little of the face.

1. Acute- or high-angle perspectives can cause a large loss in detection confidence.

The need to handle masks adds complexity. This isn’t as trivial as simply training the network to recognize that there’s a class of people wearing masks. The appearance of helmets or other face coverings also can be considered personal protective equipment (PPE). The network requires the ability to consider these other types of face or head coverings, with their own detection classes.

Other requirements may be placed on the system, such as being able to detect suspicious behavior (e.g., loitering) and in cases where the subject isn’t always in view. They may wander in and out of the camera’s field of view at different points in time. This calls for the ability to track subjects over time, rather than simply detecting mask wearers, and that the people in view are spaced adequately.

Each of these requirements call for adjustments to the model’s operation as well as preprocessing steps. With live video, the assessment of whether a subject is wearing PPE is more difficult than in controlled experiments because partial occlusions and body pose will cause variability in detection results. To improve accuracy, the determination must be made using the results from multiple frames (Fig. 2). That, in turn, requires motion tracking to be performed on each unique person within the field of view.

2. Analysis of different types of face covering is necessary.

In principle, motion tracking alone is an effective choice as it’s relatively lightweight in terms of processing power. However, it relies on continuous detections. In a straightforward image-recognition system, occlusions, obstructions, or a person leaving and re-entering the field of view would result in the person being treated as new subjects rather than reidentifying them.

An effective approach for handling reidentification, for example, is to make use of embeddings, which are representations of the objects in the field of view that have other data encoded into them. Embeddings are commonly used in language processing. For instance, they represent words and phrases in the form of vectors, making it possible to cluster those with similar meanings in vector space.

In the case of the ATM-monitoring application, an embedding is used that not only represents visual appearance, but also information on where the object was last seen in a frame as well as what class that object was assigned. Within the bounding box used for localization, the visual appearance of the pixels inside is sampled to generate a feature vector that can be used for later comparisons.

A key advantage of embeddings is that they can be shared across multiple-camera systems, which may be employed to increase accuracy and scalability to larger spaces. The embeddings also can be used for archival searches, possibly to create active watch lists for offline analysis.

The processing overhead of tracking and reidentification does impact the processing throughput required (Fig. 3). In some situations, the number of people who need to be handled within a given time frame is inherently bound by the physical space available. However, with a larger field of view, computational demand may exceed a single SoC.

3. A comparison of the motion and visual-appearance tracking workflows.

Flexible Architecture

What’s required is a flexible architecture that can handle the different elements of a real-world machine-learning architecture. In its work, Arcturus has taken advantage of the flexible combination of processing elements in the NXP i.MX 8M Plus architecture to create an easily customizable vision pipeline. In the Arcturus approach, different stages of processing are represented by nodes. A node could be an inference model, a preprocessing or post-processing algorithm, data retrieval, or access to an external or remote service. The model is similar to the containerized approach utilized in cloud computing but adapted to the resource constraints of edge computing.

Each node is implemented as a microservice and interconnected through tightly synchronized, serialized data streams. Together, these nodes create a complete vision pipeline from image acquisition through to local actions. For basic applications, pipeline nodes can run on the same physical resource. More complex pipelines can have nodes distributed across hardware, such as the CPUs, GPUs, and NPUs in one or more i.MX 8M Plus processors, or even the cloud.

In this architecture, pipelines are orchestrated at runtime so that they can be reorganized easily as application needs change, helping to futureproof edge investment. As each node is containerized, it’s simple to replace one part of the system. For example, an inference model can be updated without disrupting the rest of the system, even if model attributes change.

This pipeline architecture in the Arcturus Brinq Edge Creator SDK makes it possible to scale AI performance beyond one physical processor. An i.MX 8M Plus can generate embeddings for DNNs on one or more i.MX8M devices that may perform detection across one or more cameras. These devices can be interconnected easily using a network fabric across one of the two dedicated Ethernet MACs on each of the processors.

As is common in machine learning, development, training, and fine-tuning can take place on a workstation, perhaps leveraging cloud-based acceleration. Once the model is trained, fine-tuned, and validated, the models will be converted for more efficient processing on NPU hardware, often by reducing the 32-bit floating-point operations needed for training to 8-bit integer calculations. Efficiency benefits further from the use of prebuilt layers and models that are optimized for the edge environment.

Arcturus provides a catalogue containing prebuilt models using different precisions. These models are pre-validated to support all major edge runtimes—Arm NN, TensorFlow Lite, and TensorRT with CPU, GPU, and NPU support. Tooling is available to train or fine-tune models along with dataset curation, image scraping, and augmentation.

The results of that combination of optimized runtime, quantized model, and NPU hardware can offer a 40X performance improvement when compared with other publicly available systems running the same model. Comprehensiveness in a library like this is vital. Often, edge runtime versions don’t support all layers required by all types of networks. Newer models that may demonstrate better performance tend to be less broadly supported than the older types used in frequently cited benchmarks.

The final component is a runtime inference engine that can load the DNN model into the i.MX 8M Plus. NXP’s eIQ machine-learning software development environment provides ported and validated versions of Arm NN and TensorFlow Lite inference engines.

Summary

Flexibility and performance scalability are two vital elements in real-world applications for machine learning. Each application is different and will influence not just the choice of DNN, but also the processing around it. A framework that supports this need for flexibility is vital. And it’s a key reason why the combination of a microservices environment such as that developed by Arcturus and processing hardware like the NXP i.MX 8M Plus can be a powerful tool in the migration of machine learning to the edge.