Nvidia Tensor Rt 8 Promo Web

Revving Up Machine-Learning Inference

July 28, 2021
Leveraging its latest GPGPU hardware, NVIDIA’s TensorRT 8 delivers significant performance enhancements.

What you’ll learn

  • What’s the difference between TensorFlow and TensorRT?
  • What are the performance improvements in TensorRT 8?

NVIDIA has just released TensorRT 8, which supports the open-source TensorFlow platform originally developed at Google (see figure). TensorRT 8 compiles and optimizes TensorFlow models for NVIDIA hardware, taking advantage of features found on those platforms. For example, the company’s Ampere GPU supports a feature called fine-grain sparsity.

Sparsity is a technique that helps reduce the size of an encoded value. In particular, the weights used in machine-learning models can benefit from reducing the size of the values. They require less space and computations can be more efficient. The trick is to encode the values for a large majority that are contained in the translated version. Likewise, the math used on the values needs to take the encoding into account.

TensorRT 8 provides substantial performance gains, including improved accuracy versus other techniques. For example, the quantization aware training (QAT) support can double the accuracy. This and other transformation optimizations allow TensorRT 8 to double the performance of many models compared to results provided by its older sibling, TensorRT 7.

TensorRT 8 only offers fine-grain sparsity support for newer hardware like the Ampere GPU. Nonetheless, the system still enhances performance for other NVIDIA hardware that doesn’t have hardware sparsity support. The improvements just aren’t as dramatic.

On the other hand, certain models can gain even better performance using TensorRT8. This includes Bidirectional Encoder Representations from Transformers (BERT). BERT is a transformer-based machine-learning technique that’s used for natural-language processing pre-training. Some systems see a performance increase of two orders of magnitude. Thus, analysis using a BERT-Large model takes only 1.2 ms, allowing for real-time response to natural-language queries.

“AI models are growing exponentially more complex, and worldwide demand is surging for real-time applications that use AI. That makes it imperative for enterprises to deploy state-of-the-art inferencing solutions,” said Greg Estes, vice president of developer programs at NVIDIA. “The latest version of TensorRT introduces new capabilities that enable companies to deliver conversational AI applications to their customers with a level of quality and responsiveness that was never before possible.”

Comments

To join the conversation, and become an exclusive member of Electronic Design, create an account today!