NVIDIA Reference Architecture for Modern AI | Large Language Models and Alignment

12.04

Instructor: Yixin Zhu

Topics Covered

Introduction to NVIDIA GPUs
- Overview of NVIDIA’s GPU architecture and its evolution
- Key features that make NVIDIA GPUs suitable for AI and machine learning tasks
NVIDIA Accelerate Artificial intelengence
- Accelerate library
  - Rapips
  - CV-CUDA
  - NCCL
  - TensorRT
  - cuDNN
  - Triton
How to scale GPU computing power
- Inter-node connction: PCIe, NVLink
- Multi-node connection: Infiniband, ROCE
- Scaling in AI: Strong Scaling, Weak Scaling

GPU Cluster Component
- Whole picture of NVIDIA SuperPOD
- SuperPOD system design
DGX/HGX
Thousands of GPU Network design
- Compute Fabric
- Storage Fabric
- In-Band Management Network
- Out-Of-Band (OOB) Management Network
Storage & management
- Storage Performance requirements
- Storage performance guidelines
- Management: NVIDIA Base Command Manager Essentials