The H100 Transformer Engine supercharges AI training, delivering up to 6x more performance without losing accuracy


The largest AI models can take months to train on today’s computing platforms. It’s too slow for business.

Artificial intelligence, high performance computing and data analysis are becoming increasingly complex, with some models, such as large languages, reaching trillions of parameters.

the NVIDIA Hopper Architecture is built from the ground up to accelerate these next-generation AI workloads with massive computing power and fast memory to handle growing networks and datasets.

Transformer Engine, part of the new Hopper architecture, will dramatically accelerate performance and AI capabilities, and help train large models in days or hours.

Training AI models with Transformer Engine

Transformer models are the backbone of widely used language models today, such as BERT and GPT-3. Originally developed for natural language processing use cases, their versatility is increasingly being applied to computer vision, drug discovery, and more.

However, the size of the model continues to increase exponentially, now reaching trillions of parameters. This leads to training times stretching over several months due to huge amounts of calculations, which is not practical for business needs.

Transformer Engine uses 16-bit floating point precision and a new 8-bit floating point data format combined with advanced software algorithms that will further accelerate performance and AI capabilities.

AI training relies on floating-point numbers, which have fractional components, like 3.14. Introduced with the NVIDIA Ampere architecture, the TensorFloat32 (TF32) floating-point format is now the default 32-bit format in the TensorFlow and PyTorch frameworks.

Most AI floating-point calculations are performed using 16-bit “half-precision” (FP16), 32-bit “single” precision (FP32), and, for specialized operations, “double” 64-bit precision (FP64). By reducing computations to just eight bits, Transformer Engine makes it possible to train larger networks faster.

Combined with other new features of the Hopper architecture – such as the NVLink Switch system, which provides high-speed direct interconnection between nodes – H100-accelerated server clusters will be able to form huge networks that was nearly impossible to train at the speed needed for business.

Dive deeper into the transformer motor

Transformer Engine uses software and custom NVIDIA Hopper Tensor Core technology designed to accelerate the training of models built from the popular AI model building block, the Transformer. These Tensor Cores can apply mixed FP8 and FP16 formats to dramatically speed up AI calculations for processors. Tensor Core operations in FP8 have twice the throughput of 16-bit operations.

The challenge for models is to intelligently manage precision to maintain accuracy while achieving the performance of smaller, faster digital formats. Transformer Engine enables this with custom heuristics powered by NVIDIA that dynamically choose between FP8 and FP16 calculations and automatically handle reshaping and scaling between those precisions in each layer.

Transformer Engine uses per-layer statistical analysis to determine the optimal accuracy (FP16 or FP8) for each layer in a model, achieving the best performance while maintaining model accuracy.

The NVIDIA Hopper architecture also advances fourth-generation Tensor cores by tripling floating-point operations per second over the TF32, FP64, FP16, and INT8 precisions of previous generations. Combined with fourth-generation Transformer Engine and NVLink, Hopper Tensor Cores enable order-of-magnitude speedup for HPC and AI workloads.

Rotate the transformer motor

Much of the cutting edge work in AI revolves around large language models like Megatron 530B. The chart below shows the growth in model sizes over the past few years, a trend that is expected to continue. Many researchers are already working on trillion-plus parameter models for natural language understanding and other applications, showing a relentless appetite for the computational power of AI.

The growth of natural language understanding models continues apace. Source: Microsoft.

Meeting the demands of these growing models requires a combination of computing power and a ton of high-speed memory. The NVIDIA H100 Tensor Core GPU delivers on both fronts, with the accelerations made possible by Transformer Engine to take AI training to the next level.

When combined, these innovations deliver higher throughput and a 9x reduction in training time, from seven days to just 20 hours:

The NVIDIA H100 Tensor Core GPU delivers up to 9x more training throughput compared to the previous generation, allowing large models to be trained in a reasonable amount of time.

Transformer Engine can also be used for inference without any data format conversion. Previously, INT8 was the benchmark precision for optimal inference performance. However, this requires the trained networks to be converted to INT8 as part of the optimization process, which the NVIDIA TensorRT inference optimizer facilitates.

Using models trained with FP8 will allow developers to skip this conversion step altogether and perform inference operations using the same precision. And like INT8-formatted networks, deployments using Transformer Engine can run in a much smaller memory footprint.

On Megatron 530B, NVIDIA H100 GPU inference throughput is up to 30x that of NVIDIA A100, with one-second response latency, making it the optimal platform for enterprise deployments. AI:

Transformer Engine will also increase inference throughput up to 30x for low latency applications.

To learn more about NVIDIA H100 GPUs and the Hopper architectureread it NVIDIA Tech Blog Postas good as Hopper Architecture White Paper.

Source link


Comments are closed.