Inside NVIDIA Blackwell Ultra: The Chip Powering the AI Factory Era

As the latest member of the NVIDIA Blackwell architecture family, the NVIDIA Blackwell Ultra GPU builds on core innovations to accelerate training and AI reasoning. It fuses silicon innovations with new levels of system-level integration, delivering next-level performance, scalability, and efficiency for AI factories and the large-scale, real-time AI services they power.

With its energy-efficient dual-reticle design, high bandwidth and large-capacity HBM3E memory subsystem, fifth-generation Tensor Cores, and breakthrough NVFP4 precision format, Blackwell Ultra is raising the bar for accelerated computing. This in-depth look explains the architectural advances, why they matter, and how they translate into measurable gains for AI workloads.

Dual-reticle design: one GPU

Blackwell Ultra is composed of two reticle-sized dies connected using NVIDIA High-Bandwidth Interface (NV-HBI), a custom, power-efficient die-to-die interconnect technology that provides 10 TB/s of bandwidth. Blackwell Ultra is manufactured using TSMC 4NP and features 208B transistors–2.6x more than the NVIDIA Hopper GPU—all while functioning as a single, NVIDIA CUDA-programmed accelerator. This enables a large increase in performance while also maintaining the familiar CUDA programming model that developers have enjoyed for nearly two decades.

Benefits

  • Unified compute domain: 160 Streaming Multiprocessors (SMs) across two dies, providing 640 fifth-generation Tensor Cores with 15 PetaFLOPS dense NVFP4 compute.
  • Full coherence: Shared L2 cache with fully coherent memory accesses.
  • Maximum silicon utilization: Peak performance per square millimeter.

Streaming multiprocessors: compute engines for the AI Factory

As shown in Figure 1, the heart of Blackwell Ultra is its 160 Streaming Multiprocessors (SMs) organized into eight Graphics Processing Clusters (GPCs) in the full GPU implementation. Every SM, shown in Figure 2, is a self-contained compute engine housing:

  • 128 CUDA Cores for FP32 and INT32 operations, also FP16/BF16 and other precisions.
  • 4 fifth-generation Tensor Cores with NVIDIA second-generation Transformer Engine, optimized for FP8, FP6, and NVFP4.
  • 256 KB of Tensor Memory (TMEM) for warp-synchronous storage of intermediate results, enabling higher reuse and reduced off-chip memory traffic.
  • Special Function Units (SFUs) for transcendental math and special operations used in AI kernels.

NVIDIA Tensor Cores, AI compute powerhouses

When NVIDIA first introduced Tensor Cores in the Volta architecture, they fundamentally changed what GPUs could do for deep learning. Instead of executing scalar or vector operations one element at a time, Tensor Cores operate directly on small matrices—performing matrix multiply-accumulate (MMA) in a single instruction. This was a perfect match for neural networks, where the vast majority of computation comes down to multiplying and summing large grids of numbers.

Over successive generations, Tensor Cores have expanded in capability, precision formats, and parallelism:

  • NVIDIA Volta: 8-thread MMA units, FP16 with FP32 accumulation for training.
  • NVIDIA Ampere: Full warp-wide MMA, BF16, and TensorFloat-32 formats.
  • NVIDIA Hopper: Warp-group MMA across 128 threads, Transformer Engine with FP8 support.

Blackwell and Blackwell Ultra take this to the next level with their fifth-generation Tensor Cores and second-generation Transformer Engine, delivering higher throughput and lower latency for both dense and sparse AI workloads. Each Streaming Multiprocessor (SM) contains four Tensor Cores across the 160 SMs in Blackwell Ultra, adding up to 640 Tensor Cores upgraded to handle the newest precision format, NVFP4.

These enhancements aren’t just about raw FLOPS. The new Tensor Cores are tightly integrated with 256 KB of Tensor Memory (TMEM) per SM, optimized to keep data close to the compute units. They also support dual-thread-block MMA, where paired SMs cooperate on a single MMA operation, sharing operands and reducing redundant memory traffic.

The result is higher sustained throughput, better memory efficiency, and faster large-batch pre-training, reinforcement learning for post-training, and low-batch, high-interactivity inference.

Ultra-charged NVFP4 performance

The introduction of NVIDIA NVFP4, the new 4‑bit floating‑point format in the Blackwell GPU architecture, combines two-level scaling—an FP8 (E4M3) micro-block scale applied to 16‑value blocks plus a tensor-level FP32 scale—enabling hardware‑accelerated quantization with markedly lower error rates than standard FP4. This Tensor Core capability delivers nearly FP8‑equivalent accuracy (with often less than ~1% difference), while reducing memory footprint by ~1.8x compared to FP8 and up to ~3.5x vs. FP16. NVFP4 strikes an optimal balance of accuracy, efficiency, and performance for low‑precision AI inference.

The Blackwell Ultra dense NVFP4 compute capability provides a substantial performance uplift over the original Blackwell GPU. While the base architecture delivers 10 petaFLOPS of NVFP4 performance, Ultra pushes that to 15 petaFLOPS—a 1.5x increase compared to Blackwell GPU and 7.5x increase from NVIDIA Hopper H100 and H200 GPUs, as shown in Figure 3. This boost directly benefits large-scale inference, enabling more concurrent model instances, faster response times, and lower costs per token generated.

Accelerated softmax in the attention layer

Modern AI workloads rely heavily on attention processing with long input contexts and long output sequences for “thinking”. Transformer attention layers, in turn, stress exponentials, divisions, and other transcendental operations executed by the SM’s SFUs.

In Blackwell Ultra, SFU throughput has been doubled for key instructions used in attention, delivering up to 2x faster attention-layer compute compared to Blackwell GPUs. This improvement accelerates both short and long-sequence attention, but is especially impactful for reasoning models with large context windows—where the softmax stage can become a latency bottleneck.

By accelerating the attention mechanism within transformer models, Blackwell Ultra enables:

  • Faster AI reasoning with lower time-to-first-token in interactive applications.
  • Lower compute costs by reducing total processing cycles per query.
  • Higher system efficiency—more attention sequences processed per watt.

As depicted in Figure 4, the performance gains from the accelerated attention-layer instructions in Blackwell Ultra compound with NVFP4 precision, resulting in a step-function improvement for LLM and multimodal inference.

Memory: high capacity and bandwidth for multi-trillion-parameter models

Blackwell Ultra doesn’t just scale compute—it scales memory capacity to meet the demands of the largest AI models. With 288 GB of HBM3e per GPU, it offers 3.6x more on-package memory than H100 and 50% more than Blackwell, as shown in Figure 5. This capacity is critical for hosting trillion-parameter models, extending context length without KV-cache offloading, and enabling high-concurrency inference in AI factories.

High bandwidth memory features

  • Max capacity: 288 GB, 3.6x increase over H100
  • HBM configuration: 8 stacks, 16 × 512-bit controllers (8,192-bit total width)
  • Bandwidth: 8 TB/s per GPU, 2.4x improvement over H100 (3.35 TB/s)

This massive memory footprint enables:

  • Complete model residence: 300B+ parameter models without memory offloading.
  • Extended context lengths: Larger KV cache capacity for transformer models.
  • Improved compute efficiency: Higher compute-to-memory ratios for diverse workloads.

Interconnect: built for scale

Blackwell and Blackwell Ultra support fifth-generation NVIDIA NVLink for GPU-to-GPU communication over NVLink Switch, NVLink-C2C for coherent interconnect to an NVIDIA Grace CPU, and x16 PCI-Express Gen 6 interface for connection to host CPUs.

  • Per-GPU Bandwidth: 1.8 TB/s bidirectional (18 links x 100 GB/s)
  • Performance Scaling: 2x improvement over NVLink 4 (Hopper GPU)
  • Maximum Topology: 576 GPUs in non-blocking compute fabric
  • Rack-Scale Integration: 72-GPU NVL72 configurations with 130 TB/s aggregate bandwidth

Host connectivity:

  • PCIe Interface: Gen6 × 16 lanes (256 GB/s bidirectional)
  • NVLink-C2C: Grace CPU-GPU communication with memory coherency (900 GB/s)

Table 1 provides a comparison of the interconnects across generations.

Interconnect Hopper GPU Blackwell GPU Blackwell Ultra GPU
NVLink (GPU-GPU) 900 1,800 1,800
NVLink-C2C (CPU-GPU) 900 900 900
PCIe Interface 128 (Gen 5) 256 (Gen 6) 256 (Gen 6)
Table 1. Interconnect comparison of Hopper compared to Blackwell and Blackwell Ultra (in BiDir GB/s)

Advancing performance-efficiency

Blackwell Ultra delivers a decisive leap over Blackwell by adding 50% more NVFP4 compute and 50% more HBM capacity per chip, enabling larger models and faster throughput without compromising efficiency. Accelerated softmax execution further boosts real-world inference speeds, driving up tokens per second per user (TPS/user) while improving data center tokens per second per megawatt (TPS/MW). Every architectural enhancement was purpose-built to push both user experience and operational efficiency to the next level.

As shown in Figure 6, plotting these two metrics for the NVIDIA Hopper HGX H100 NVL8 system, NVIDIA Blackwell HGX B200 NVL8 system, NVIDIA Blackwell GB200 NVL72 system, and NVIDIA Blackwell Ultra GB300 NVL72 system reveals a generational leap. The curve starts with Hopper NVL8 at FP8 precision and ends with Blackwell Ultra NVL72 at NVFP4 precision—showing how each architectural advance pushes the Pareto frontier up and to the right.

These architectural innovations improve the economics of AI inference and redefine what’s possible in AI factory design—delivering more model instances, faster responses, and higher output per megawatt than any previous NVIDIA platform.

To see firsthand how innovations in hardware and deployment configurations impact data center efficiency and user experience, check out our interactive Pareto Frontier explainer.

Enterprise-grade features

Blackwell Ultra isn’t just about raw performance—it’s designed with enterprise-grade features that simplify operations, strengthen security, and deliver reliable performance at scale.

Advanced scheduling and management

Security and reliability

  • Confidential computing and secure AI: Secure and performant protection for sensitive AI models and data, extending hardware-based Trusted Execution Environment (TEE) to GPUs with industry-first TEE-I/O capabilities in the Blackwell architecture and inline NVLink protection for near-identical throughput when compared to unencrypted modes.
  • Advanced NVIDIA Remote Attestation Service (RAS) engine: AI-powered reliability system monitoring thousands of parameters to predict failures, optimize maintenance schedules, and maximize system uptime in large-scale deployments.

AI video and data processing enhancements

Blackwell Ultra also integrates specialized engines for modern AI workloads requiring multimodal data processing:

NVIDIA GPU chip summary comparison

To put Blackwell Ultra’s advances in perspective, Table 2 compares key chip specifications across Hopper, Blackwell, and Blackwell Ultra. It highlights the generational leap in transistor count, memory capacity, interconnect bandwidth, and precision compute throughput—as well as the architectural enhancements like attention acceleration and NVFP4. This side-by-side view shows how Blackwell Ultra scales up performance and extends capabilities critical for AI factory deployments at both node and rack scale.

Feature Hopper Blackwell Blackwell Ultra
Manufacturing process TSMC 4N TSMC 4NP TSMC 4NP
Transistors 80B 208B 208B
Dies per GPU 1 2 2
NVFP4 dense | sparse performance 10 | 20 PetaFLOPS 15  | 20 PetaFLOPS
FP8 dense | sparse performance 2 | 4 PetaFLOPS 5 | 10 PetaFLOPS 5 | 10 PetaFLOPS
Attention acceleration
(SFU EX2)
4.5 TeraExponentials/s 5 TeraExponentials/s 10.7 TeraExponentials/s
Max HBM capacity 80 GB HBM (H100) 
141 GB HBM3E (H200)
192 GB HBM3E 288 GB HBM3E
Max HBM bandwidth 3.35 TB/s (H100)
4.8 TB/s (H200)
8 TB/s 8 TB/s
NVLink bandwidth 900 GB/s 1,800 GB/s 1,800 GB/s
Max power (TGP) Up to 700W Up to 1,200W Up to 1,400W
Table 2. NVIDIA GPU chip comparison

From chip to AI factory

Blackwell Ultra GPUs form the backbone of NVIDIA’s next-generation AI infrastructure—delivering transformative performance from desktop superchips to full AI factory racks.  

NVIDIA Grace Blackwell Ultra Superchip

This superchip couples one Grace CPU with two Blackwell Ultra GPUs through NVLink‑C2C, offering up to 30 PFLOPS dense, and 40 PFLOPS sparse, NVFP4 AI compute, and boasts 1 TB of unified memory combining HBM3E and LPDDR5X for unprecedented on-node capacity. ConnectX-8 SuperNICs provide 800 GB/s high-speed network connectivity (See Figure 7). The NVIDIA Grace Blackwell Ultra Superchip is the foundational computing component of the GB300 NVL 72 rack-scale system.

Complete CUDA compatibility

Blackwell Ultra maintains full backward compatibility with the entire CUDA ecosystem while introducing optimizations for next-generation AI frameworks:

The bottom line

NVIDIA Blackwell Ultra establishes the foundation for AI factories to train and deploy intelligence at unprecedented scale and efficiency. With breakthrough innovations in dual-die integration, NVFP4 acceleration, massive memory capacity, and advanced interconnect technology, Blackwell Ultra enables AI applications that were previously computationally impossible.

As the industry transitions from proof-of-concept AI to production AI factories, Blackwell Ultra provides the computational foundation to turn AI ambitions into reality with unmatched performance, efficiency, and scale.

Learn more

Dive deeper into the innovations powering the trillion-token era. Download the Blackwell Architecture Technical Brief to explore the full silicon-to-system story.

Acknowledgments

We’d like to thank Manas Mandal, Ronny Krashinsky, Vishal Mehta, Greg Palmer, Michael Andersch, Eduardo Alvarez, Ashraf Eassa, Joe DeLaere, and many other NVIDIA GPU architects, engineers, and product leaders who contributed to this post.

Go to Source