作业四:NVIDIA DPU 编程简介——解锁 AI 网络的力量
本综合探讨 NVIDIA DPU 编程,这是一项处于 AI 和网络交汇点的尖端技术。同学们将:
- 学习 NVIDIA 加速计算和 AI 网络
- 使用 NVIDIA DOCA 软件框架在 BlueField 网络平台上开发和部署数据中心基础设施应用
- 探索如何利用 NVIDIA AI 网络技术加速 AI 工作负载
从而:
- 理解 NVIDIA 在 AI 领域通过其端到端加速计算和 AI 网络技术的领导地位
- 熟练掌握为 NVIDIA BlueField-3 网络平台和 NVIDIA DOCA 软件框架开发的技能
- 构建用于真实场景的基础设施应用和服务
目标
- 理解NVIDIA加速计算和网络技术的重要性
- 掌握NVIDIA BlueField-3网络平台和NVIDIA DOCA软件框架的基础知识
- 开发应用程序和服务,以创建用于各种工作负载的安全和加速基础设施
- 使用NVIDIA DOCA SDK和API在NVIDIA BlueField-3上构建应用程序或服务
基础知识
- 具备网络和OSI模型的基本知识
- 熟悉Linux编程和命令行界面
- 熟悉C编程语言
评分及要求
- DOCA开发环境可用时间:12.25 00:00 - 1.7 23:59
- 项目:从下方项目列表中选择一个
- 评估:用Word或LaTeX撰写的2页短文,可用中文或英文撰写
- DDL: 1.14 23:59
作业提交
需要提交的材料包括PDF发送到指定邮箱,具体来说包括:
- 以PDF为格式的实验报告(中英文形式都可,编写方式自由,可以根据个人偏好使用word或latex),PDF文件需要发送至[email protected]。
- 作业的分析、case等结果(不超过10MB),放于results文件夹中。
- 注意最后你需要提交的内容为:实验报告(pdf格式),支撑材料(文件夹),打包为压缩文件后发送至邮箱中即可。
为避免发生类似于将“Robustness”翻译为“鲁棒性”、“Socket”翻译为“套接字”的翻译误导,本词作业指南为英文形式。请同学们自行阅读并翻译,助教友情推荐使用Chrome浏览器中的"沉浸式翻译"插件浏览网页,或自行导出HTML文件后使用ChatGPT翻译。
Project List
1. NVIDIA DOCA Secure Channel
Difficulty ★★★★☆
Objectives
- Replicate the functionality of the NVIDIA DOCA Secure Channel Application
- Understand how to use DOCA Comm Channel APIs for:
- Creating a secure communication channel
- Exchanging messages between Host and BlueField-3 DPU
- Extend the Secure Channel functions to provide control services on BlueField-3 DPU
Introduction
The DOCA Secure Channel reference application leverages the DOCA Comm Channel API to create a secure, network-independent communication channel between the host and the NVIDIA BlueField DPU. Key features include:
- Enabling host control of DPU services and offloads
- Facilitating message exchange using a client-server framework
- Supporting one-to-many communication (server to multiple clients)
- Allowing communication between any PF/VF/SF on the host and the DPU server
- Configurable message size and quantity for simulating heavy load
Note: DOCA SDK 2.5.0 introduced a new API for DOCA Comm Channel, offering high-performance data path and compatibility with DOCA progress engine. The old API will be deprecated in future releases.
References
- Application source:
/opt/mellanox/doca/applications/secure_channel/
- Configuration file:
/opt/mellanox/doca/applications/secure_channel/sc_params.json
System Design
The secure channel application operates in client mode (host) and server mode (DPU), allowing bidirectional message flow once a channel is established.
Application Architecture
The application is built on the DOCA Comm Channel API. The connection flow between client and server is as follows:
- Both sides initiate
create()
- Server listens for new connections
- Server calls
recvfrom()
to prepare for message exchange - Client executes
connect()
to initiate connection - Client sends the first message
- Server responds
This architecture enables secure, efficient communication between the host and DPU, facilitating advanced network operations and offloads.
Compilation
To build the secure channel application:
Direct build method:
cd /opt/mellanox/doca/applications/ meson /tmp/build -Denable_all_applications=false -Denable_secure_channel=true ninja -C /tmp/build
Using meson_options.txt: a. Edit
/opt/mellanox/doca/applications/meson_options.txt
:- Set
enable_all_applications
tofalse
- Set
enable_secure_channel
totrue
b. Run compilation commands:
cd /opt/mellanox/doca/applications/ meson /tmp/build ninja -C /tmp/build
- Set
The compiled doca_secure_channel
will be created in /tmp/build/secure_channel/
.
Running Application
The secure channel application requires compilation before execution. Use the following command to view usage instructions:
./doca_secure_channel -h
or
./doca_secure_channel --help
Application usage:
Usage: doca_secure_channel [DOCA Flags] [Program Flags]
DOCA Flags:
-h, --help Print a help synopsis
-v, --version Print program version information
-l, --log-level Set the (numeric) log level for the program
<10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING,
50=INFO, 60=DEBUG, 70=TRACE>
--sdk-log-level Set the SDK (numeric) log level for the program
<10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING,
50=INFO, 60=DEBUG, 70=TRACE>
-j, --json <path> Parse all command flags from an input json file
Program Flags:
-s, --msg-size Message size to be sent
-n, --num-msgs Number of messages to be sent
-p, --pci-addr DOCA Comm Channel device PCI address
-r, --rep-pci DOCA Comm Channel device representor PCI address
(needed only on DPU)
These flags allow you to configure the application’s behavior, including log levels, message size, number of messages, and PCI addresses for communication.
Running on BlueField
Login to BlueField
Enter the code folder
dpu# cd /opt/mellanox/doca/applications dpu/opt/mellanox/doca/applications#
Build DOCA Secure Channel Application on BlueField
dpu/opt/mellanox/doca/applications# meson /tmp/build --Denable_all_applications=false --Denable_secure_channel=true dpu/opt/mellanox/doca/applications# ninja -C /tmp/build
Check device PCIe address
dpu# mst start dpu# mst status -v …… PCI devices: ------------ DEVICE_TYPE MST PCI RDMA NET NUMA BlueField3(rev:1) /dev/mst/mt41692_pciconf0.1 03:00.1 mlx5_1 net-en3f1pf1sf0,net-pf1hpf,net-p1 -1 BlueField3(rev:1) /dev/mst/mt41692_pciconf0 03:00.0 mlx5_0 net-en3f0pf0sf0,net-p0,net-pf0hpf -1
CLI example for running the application on BlueField:
dpu# ./doca_secure_channel -s 256 -n 10 -p 03:00.0 -r 0b:00.0
Note: Both DOCA Secure Channel device PCIe address (03:00.0) and DOCA Comm Channel device representor PCIe address (0b:00.0) should match the addresses of the desired PCIe devices.
Running on Host
Login to Host
Enter the code folder
host# cd /opt/mellanox/doca/applications host/opt/mellanox/doca/applications#
Build DOCA Secure Channel Application on Host
host/opt/mellanox/doca/applications# meson /tmp/build --Denable_all_applications=false --Denable_secure_channel=true host/opt/mellanox/doca/applications# ninja -C /tmp/build
Check device representor PCIe address
host# mst start host# mst status -v …… PCI devices: ------------ DEVICE_TYPE MST PCI RDMA NET NUMA BlueField3(rev:1) /dev/mst/mt41692_pciconf0 0b:00.0 mlx5_0 net-ens192f0np0 -1 BlueField3(rev:1) /dev/mst/mt41692_pciconf0.1 0b:00.1 mlx5_1 net-ens192f1np1 -1
CLI example for running the application on Host:
host# ./doca_secure_channel -s 256 -n 10 -p 0b:00.0
Note: DOCA Comm Channel device representor PCIe address (0b:00.0) should match the address of the desired PCIe device.
Code Description
BlueField Side
Set Secure channel configuration operation mode to Run endpoint in DPU:
app_cfg.mode = SC_MODE_DPU;
Parse cmdline/json arguments:
register_secure_channel_params()
Initialize Communication Channel context: init_cc()
Create Comm Channel endpoint:
doca_comm_channel_ep_create()
Open Comm Channel DOCA device based on PCI address: open_doca_device_with_pci()
Open Comm Channel DOCA device representor based on PCI address: open_doca_device_rep_with_pci()
Set Comm Channel context properties, including DOCA device, max_msg_size, snd_queue_size, rcv_queue_size, set DOCA device representor: set_cc_properties()
Secure Channel secure_channel_server start listening: doca_comm_channel_ep_listen()
Initiate all relevant signal and epoll file descriptors: init_signaling_polling()
Create Comm Channel send/receive epoll instance: fd = epoll_create1(0)
Create send/receive termination file descriptor, and add termination file descriptor to epoll instance:
fd = signalfd(-1, &signal_mask, 0);
epoll_ctl(*cc_send_epoll_fd, EPOLL_CTL_ADD, *send_interrupt_fd, &intr_fd)
Extract the event_channel handles for user’s use. When the user send/receive packets with non-blocking mode, this handle can use epoll() to get interrupt when a new event happened:
doca_comm_channel_ep_get_event_channel(ctx->ep, &ctx->cc_send_fd, &ctx->cc_recv_fd)
Start threads and wait for them to finish: start_threads()
start sendto thread
pthread_create(ctx->sendto_t, NULL, sendto_channel, (void *)ctx)
start recvfrom thread
pthread_create(ctx->recvfrom_t, NULL, recvfrom_channel, (void *)ctx)
Add Comm Channel receive file descriptor to receive epoll instance:
epoll_ctl(ctx->cc_recv_epoll_fd, EPOLL_CTL_ADD, ctx->cc_recv_fd, &recv_event)
while (1) {
doca_comm_channel_ep_recvfrom(ctx->ep, recv_buffer, &msg_len, DOCA_CC_MSG_FLAG_NONE, &curr_peer);
Check if interrupt was received (events[ev_idx].data.fd == ctx->recv_intr_fd), if yes, Receive thread exiting, total amount of messages received successfully.
Signal send thread to start sending messages
}
Host Side
Parse cmdline/json arguments: register_secure_channel_params()
Initialize Communication Channel context: init_cc()
Create Comm Channel endpoint:
doca_comm_channel_ep_create()
Open Comm Channel DOCA device based on PCI address:
open_doca_device_with_pci()
Set Comm Channel context properties, including DOCA device, max_msg_size, snd_queue_size, rcv_queue_size, set DOCA device representor:
set_cc_properties()
Establish a connection with DPU node:
doca_comm_channel_ep_connect()
Initiate all relevant signal and epoll file descriptors: init_signaling_polling()
Create Comm Channel send/receive epoll instance:
fd = epoll_create1(0)
Create send/receive termination file descriptor, and add termination file descriptor to epoll instance:
fd = signalfd(-1, &signal_mask, 0);
epoll_ctl(*cc_send_epoll_fd, EPOLL_CTL_ADD, *send_interrupt_fd, &intr_fd)
Extract the event_channel handles for user’s use. When the user send/receive packets with non-blocking mode, this handle can use epoll() to get interrupt when a new event happened:
doca_comm_channel_ep_get_event_channel(ctx->ep, &ctx->cc_send_fd, &ctx->cc_recv_fd)
Start threads and wait for them to finish: start_threads()
start recvfrom thread:
pthread_create(ctx->recvfrom_t, NULL, recvfrom_channel, (void *)ctx)
start sendto thread:
pthread_create(ctx->sendto_t, NULL, sendto_channel, (void *)ctx)
Add Comm Channel send file descriptor to send epoll instance
epoll_ctl(ctx->cc_send_epoll_fd, EPOLL_CTL_ADD, ctx->cc_send_fd, &send_event)
while (msg_nb) {
result=doca_comm_channel_ep_sendto(ctx->ep, send_buffer, ctx->cfg->send_msg_size, DOCA_CC_MSG_FLAG_NONE, ctx->peer);
//Check if interrupt was received: events[ev_idx].data.fd == ctx->send_intr_fd
If yes, send thread exiting, total amount of messages sent successfully
}
Project Direction
Modify message parameters:
- Experiment with different message sizes using the
-s
or--msg-size
flag. - Vary the number of messages sent using the
-n
or--num-msgs
flag. - Example:
./doca_secure_channel -s 512 -n 100 -p <PCI_ADDRESS> [-r <REP_PCI_ADDRESS>]
- Experiment with different message sizes using the
Enhance logging and debugging:
- Increase the log level using the
-l
or--log-level
flag for more detailed output. - Add print statements in the source code to show detailed information about:
- Channel connection establishment
- Message transmission progress
- Timing information for performance analysis
- Increase the log level using the
Implement JSON-based configuration:
- Create a JSON file with various configurations (e.g.,
sc_params.json
) - Run the application using the JSON file:
./doca_secure_channel --json ./sc_params.json
- Create a JSON file with various configurations (e.g.,
Explore different deployment scenarios:
- Test communication between different PF/VF/SF combinations
- Verify behavior with multiple clients connecting to the server (DPU) side
Error handling and resilience:
- Implement more robust error checking and handling in the application code
- Test application behavior under various error conditions (e.g., connection loss, invalid parameters)
Performance optimization:
- Profile the application to identify potential bottlenecks
- Experiment with different buffer sizes and threading models for improved performance
Extended functionality:
- Implement a simple control protocol over the secure channel
- Add support for bidirectional simultaneous communication
Integration with other DOCA applications:
- Explore how the Secure Channel can be used in conjunction with other DOCA applications or services
Documentation
For detailed information about the NVIDIA DOCA Secure Channel Application, refer to the official guide: NVIDIA DOCA Secure Channel Application Guide
Key sections to review in the documentation:
- System Design and Application Architecture
- DOCA Libraries used (DOCA Comch)
- Compilation instructions
- Running the Application (including command-line flags and JSON-based deployment)
- Application Code Flow
2. NVIDIA DOCA DPA All-to-All
Difficulty: ★ ★ ★ ★ ☆
Objective
- Replicate the functionality of NVIDIA DOCA DPA All-to-all Application
- Understand how to use DOCA DPA APIs for accelerating MPI all-to-all collective on BlueField-3 DPU
- Extend the DPA All-to-all functions to improve Collective Operation performance on BlueField-3 DPU
Introduction
The NVIDIA DPA All-to-All application demonstrates how the Message Passing Interface (MPI) all-to-all collective can be accelerated using the Data Path Accelerator (DPA). In an MPI collective, all processes within the same job call the collective routine.
Given a communicator of n ranks, the application performs a collective operation where all processes send and receive the same amount of data from all other processes (hence “all-to-all”).
System Design
All-to-all is an MPI method. MPI is a standardized and portable message passing standard designed for parallel computing architectures. An MPI program consists of several processes running in parallel.
Each process in the diagram divides its local sendbuf into n blocks (4 in this example), each containing sendcount elements (4 in this example). Process i sends the k-th block of its local sendbuf to process k, which places the data in the i-th block of its local recvbuf.
Implementing the all-to-all method using DOCA DPA offloads the copying of elements from the srcbuf to the recvbufs to the DPA, freeing the CPU to perform other computations.
Application Architecture
The following diagram illustrates the differences between host-based all-to-all and DPA all-to-all operations:
- In DPA all-to-all, DPA threads perform the all-to-all operation, freeing the CPU for other computations.
- In host-based all-to-all, the CPU must still perform the all-to-all operation at some point and is not completely available for other computations.
Compilation
To build only the DPA all-to-all application:
cd /opt/mellanox/doca/applications/
meson /tmp/build -Denable_all_applications=false -Denable_dpa_all_to_all=true
ninja -C /tmp/build
Alternatively, users can set the desired flags in the meson_options.txt
file:
Edit the following flags in
/opt/mellanox/doca/applications/meson_options.txt
:- Set
enable_all_applications
tofalse
- Set
enable_dpa_all_to_all
totrue
- Set
Run the following compilation commands:
cd /opt/mellanox/doca/applications/
meson /tmp/build
ninja -C /tmp/build
The doca_dpa_all_to_all
executable is created under /tmp/build/dpa_all_to_all/
.
Running Application
The DPA all-to-all application is provided in source form. Therefore, compilation is required before execution.
Application usage instructions (run ./doca_dpa_all_to_all -h
or ./doca_dpa_all_to_all --help
):
Usage: doca_dpa_all_to_all [DOCA Flags] [Program Flags]
DOCA Flags:
-h, --help Print a help synopsis
-v, --version Print program version information
-l, --log-level Set the (numeric) log level for the program
<10=DISABLE, 20=CRITICAL, 30=ERROR,
40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE>
--sdk-log-level Set the SDK (numeric) log level for the program
<10=DISABLE, 20=CRITICAL, 30=ERROR,
40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE>
-j, --json <path> Parse all command flags from an input json file
Program Flags:
-m, --msgsize <Message size> The message size - the size of the
sendbuf and recvbuf (in bytes).
Must be in multiples of integer size.
Default is size of one integer times
the number of processes.
-d, --devices <IB device names> IB devices names that support DPA, separated
by comma without spaces (max of two
devices). If not provided, a random
IB device will be chosen.
Running on BlueField
Login to BlueField
Enter the code folder:
dpu# cd /opt/mellanox/doca/applications dpu/opt/mellanox/doca/applications#
MPI is used for compilation and running of this application. Ensure that MPI is installed on your setup. By default, DOCA All will provide openmpi but not mpicc. Run the following commands:
Check if mpicc is installed:
dpu# dpkg -l | grep mpich
If not installed, install mpicc:
dpu# apt-get install mpich
Build DOCA DPA All-to-all Application on BlueField:
# meson /tmp/build --Denable_all_applications=false --Denable_dpa_all_to_all=true # ninja -C /tmp/build
Check the mlx device name on BlueField:
# mst status -v …… PCI devices: ------------ DEVICE_TYPE MST PCI RDMA NET NUMA BlueField3(rev:1) /dev/mst/mt41692_pciconf0.1 03:00.1 mlx5_1 net-en3f1pf1sf0,net-pf1hpf,net-p1 -1 BlueField3(rev:1) /dev/mst/mt41692_pciconf0 03:00.0 mlx5_0 net-en3f0pf0sf0,net-p0,net-pf0hpf -1
Run DPU All-to-all application with 4 processes, 32 bytes as message size, and mlx5_0 as the InfiniBand device:
# mpirun -np 4 /tmp/build/doca_dpa_all_to_all -m 32 -d "mlx5_0"
Notes:
-d
specifies the RDMA device shown in the previous step-m
is the message size, representing the size of the sendbuf & recvbuf. It’s divided into nProcs * Buffer size, and BufSize is further divided into message count * nProcs
Code Description
Initialize MPI:
MPI_Init(&argc, &argv);
Parse application arguments:
- Initialize arg parser resources and register DOCA general parameters:
doca_argp_init();
- Register the application’s parameters:
register_all_to_all_params();
- Parse the arguments:
doca_argp_start();
- Only let the first process (of rank 0) parse the parameters to then broadcast them to the rest of the processes.
- Initialize arg parser resources and register DOCA general parameters:
Check and prepare the needed resources for the all_to_all call:
- Check the number of processes (maximum is 16).
- Check the msgsize. It must be in multiples of integer size and at least the number of processes times integer size.
- Allocate the sendbuf and recvbuf according to msgsize.
Prepare the resources required to perform the all-to-all method using DOCA DPA:
- Initialize DOCA DPA context:
- Open DOCA DPA device (DOCA device that supports DPA):
open_dpa_device();
- Create DOCA DPA context using the opened device:
doca_dpa_create();
- Open DOCA DPA device (DOCA device that supports DPA):
- Create the required events for the all-to-all:
create_dpa_a2a_events() { doca_dpa_event_create(doca_dpa, DOCA_DPA_EVENT_ACCESS_DPA, DOCA_DPA_EVENT_ACCESS_CPU, DOCA_DPA_EVENT_WAIT_DEFAULT, &comp_event, 0); for (i = 0; i < resources->num_ranks; i++) doca_dpa_event_create(doca_dpa, DOCA_DPA_EVENT_ACCESS_REMOTE, DOCA_DPA_EVENT_ACCESS_DPA, DOCA_DPA_EVENT_WAIT_DEFAULT, &(kernel_events[i]), 0); }
- Create DOCA DPA worker (for the endpoints):
doca_dpa_worker_create();
- Prepare DOCA DPA endpoints:
- Create DOCA DPA endpoints as the number of processes/ranks:
for (i = 0; i < resources->num_ranks; i++) doca_dpa_ep_create();
- Connect the local process’ endpoints to the other processes’ endpoints:
connect_dpa_a2a_endpoints();
- Export the endpoints to DOCA DPA device endpoints and copy them to DPA heap memory:
for (int i = 0; i < resources->num_ranks; i++) { result = doca_dpa_ep_dev_export(); doca_dpa_mem_alloc(); doca_dpa_h2d_memcpy(); }
- Create DOCA DPA endpoints as the number of processes/ranks:
- Prepare the memory required for the all-to-all method:
prepare_dpa_a2a_memory();
- Initialize DOCA DPA context:
Launch the alltoall_kernel using DOCA DPA kernel launch:
- Every MPI rank launches a kernel of up to MAX_NUM_THREADS (16 in this example).
- Launch alltoall_kernel using kernel_launch:
doca_dpa_kernel_launch();
- Copy the relevant sendbuf to the correct recvbuf for every process:
for (i = thread_rank; i < num_ranks; i += num_threads) doca_dpa_dev_put_signal_nb();
- Wait until the alltoall_kernel has finished:
doca_dpa_event_wait_until();
Destroy the a2a_resources:
- Free all the DOCA DPA memories:
doca_dpa_mem_free();
- Unregister all the DOCA DPA host memories:
doca_dpa_mem_unregister();
- Destroy all the DOCA DPA endpoints:
doca_dpa_ep_destroy();
- Destroy the DOCA DPA worker:
doca_dpa_worker_destroy();
- Destroy all the DOCA DPA events:
doca_dpa_event_destroy();
- Destroy the DOCA DPA context:
doca_dpa_destroy();
- Close the DOCA device:
doca_dev_close();
- Free all the DOCA DPA memories:
Project Direction
Enhance the code with additional parameters:
- Add input for running multiple iterations
- Calculate and report execution time
Increase the number of DPA Execution Units (EUs) to test alltoall performance
Implement additional customizations and extensions:
- Add multi-server support
- Integrate secure_channel logic
- Explore other MPI collective operations that could benefit from DPA acceleration
Documents
OpenMPI
- Communications between multiple processes
- Transfer BufferAddr/Event/Handlers
- OpenMPI v4.1 Documentation
DOCA DPA
- DPA create/start/execution
- DOCA DPA Documentation
DOCA MMAP
- MMAP and buffer handling on DPA
- DOCA Core Memory Subsystem
DOCA RDMA
- Communication between Ranks and DPA
- DOCA RDMA Documentation
DOCA AlltoAll
Resources
Book
- Data Processing Unit – Introduction to DPU Programming (ISBN 978-7-111-73115-3)
Homepages & Documents
- NVIDIA BlueField Networking Platform Homepage (Chinese)
- NVIDIA DOCA Software Framework Homepage (Chinese)
- NVIDIA DOCA Developer Forum Homepage (Chinese)
- Get Started With NVIDIA DOCA (Chinese)
- DOCA Developer Quick-Start Guide (English)
- BlueField Admin Quick-Start Guide (English)
- DPU Hardware Installation Instructions (English)
- NVIDIA BlueField HW Manuals (English)
- NVIDIA BlueField Platform SW Manuals (English)
- NVIDIA DOCA Documentation (English)
- NVIDIA DOCA Documentation v2.7.0 (English)
- NVIDIA DOCA Installation Guide for Linux (English)
- NVIDIA Firmware Tools (MFT) Documents v4.29.0 (English)
- NVIDIA DPU & DOCA Developer Blog (Chinese, Search DPU or DOCA)
- Introduction to NVIDIA DPU Programming ebook (Chinese)
Self-paced Free Online Courses
- Introduction to DOCA for DPUs (Chinese Captions)
- Getting Started with DOCA Flow (Chinese Captions)
- Introduction to Congest Control (Chinese)
- Accelerating AI Workload by building Congestion Control Algorithm with DOCA (Chinese)
Free DOCA Development Environment
Scan QR-Code for applying free DOCA Development Environment by NVIDIA Authorized Partner DPU & DOCA Excellence Center. An-Link is a new DPU & DOCA Excellence Center and will provide access to Free DOCA Development Environment later.