作业四：NVIDIA DPU 编程简介——解锁 AI 网络的力量

本综合探讨 NVIDIA DPU 编程，这是一项处于 AI 和网络交汇点的尖端技术。同学们将：

学习 NVIDIA 加速计算和 AI 网络
使用 NVIDIA DOCA 软件框架在 BlueField 网络平台上开发和部署数据中心基础设施应用
探索如何利用 NVIDIA AI 网络技术加速 AI 工作负载

从而：

理解 NVIDIA 在 AI 领域通过其端到端加速计算和 AI 网络技术的领导地位
熟练掌握为 NVIDIA BlueField-3 网络平台和 NVIDIA DOCA 软件框架开发的技能
构建用于真实场景的基础设施应用和服务

目标

理解NVIDIA加速计算和网络技术的重要性
掌握NVIDIA BlueField-3网络平台和NVIDIA DOCA软件框架的基础知识
开发应用程序和服务，以创建用于各种工作负载的安全和加速基础设施
使用NVIDIA DOCA SDK和API在NVIDIA BlueField-3上构建应用程序或服务

基础知识

具备网络和OSI模型的基本知识
熟悉Linux编程和命令行界面
熟悉C编程语言

评分及要求

DOCA开发环境可用时间：12.25 00:00 - 1.7 23:59
项目：从下方项目列表中选择一个
评估：用Word或LaTeX撰写的2页短文，可用中文或英文撰写
DDL: 1.14 23:59

作业提交

需要提交的材料包括PDF发送到指定邮箱，具体来说包括：

以PDF为格式的实验报告（中英文形式都可，编写方式自由，可以根据个人偏好使用word或latex），PDF文件需要发送至[email protected]。
作业的分析、case等结果（不超过10MB），放于results文件夹中。
注意最后你需要提交的内容为：实验报告（pdf格式），支撑材料（文件夹），打包为压缩文件后发送至邮箱中即可。

为避免发生类似于将“Robustness”翻译为“鲁棒性”、“Socket”翻译为“套接字”的翻译误导，本词作业指南为英文形式。请同学们自行阅读并翻译，助教友情推荐使用Chrome浏览器中的"沉浸式翻译"插件浏览网页，或自行导出HTML文件后使用ChatGPT翻译。

Project List

1. NVIDIA DOCA Secure Channel

Difficulty ★★★★☆

Objectives

Replicate the functionality of the NVIDIA DOCA Secure Channel Application
Understand how to use DOCA Comm Channel APIs for:
- Creating a secure communication channel
- Exchanging messages between Host and BlueField-3 DPU
Extend the Secure Channel functions to provide control services on BlueField-3 DPU

Introduction

The DOCA Secure Channel reference application leverages the DOCA Comm Channel API to create a secure, network-independent communication channel between the host and the NVIDIA BlueField DPU. Key features include:

Enabling host control of DPU services and offloads
Facilitating message exchange using a client-server framework
Supporting one-to-many communication (server to multiple clients)
Allowing communication between any PF/VF/SF on the host and the DPU server
Configurable message size and quantity for simulating heavy load

Note: DOCA SDK 2.5.0 introduced a new API for DOCA Comm Channel, offering high-performance data path and compatibility with DOCA progress engine. The old API will be deprecated in future releases.

References

Application source: /opt/mellanox/doca/applications/secure_channel/
Configuration file: /opt/mellanox/doca/applications/secure_channel/sc_params.json

System Design

The secure channel application operates in client mode (host) and server mode (DPU), allowing bidirectional message flow once a channel is established.

Application Architecture

The application is built on the DOCA Comm Channel API. The connection flow between client and server is as follows:

Both sides initiate create()
Server listens for new connections
Server calls recvfrom() to prepare for message exchange
Client executes connect() to initiate connection
Client sends the first message
Server responds

This architecture enables secure, efficient communication between the host and DPU, facilitating advanced network operations and offloads.

Compilation

To build the secure channel application:

Direct build method:

cd /opt/mellanox/doca/applications/
meson /tmp/build -Denable_all_applications=false -Denable_secure_channel=true
ninja -C /tmp/build

Using meson_options.txt: a. Edit /opt/mellanox/doca/applications/meson_options.txt:
- Set enable_all_applications to false
- Set enable_secure_channel to true
b. Run compilation commands:
```
cd /opt/mellanox/doca/applications/
meson /tmp/build
ninja -C /tmp/build
```

The compiled doca_secure_channel will be created in /tmp/build/secure_channel/.

Running Application

The secure channel application requires compilation before execution. Use the following command to view usage instructions:

./doca_secure_channel -h

./doca_secure_channel --help

Application usage:

Usage: doca_secure_channel [DOCA Flags] [Program Flags]

DOCA Flags:
 -h, --help            Print a help synopsis
 -v, --version         Print program version information
 -l, --log-level       Set the (numeric) log level for the program
                       <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING,
                       50=INFO, 60=DEBUG, 70=TRACE>
 --sdk-log-level       Set the SDK (numeric) log level for the program
                       <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING,
                       50=INFO, 60=DEBUG, 70=TRACE>
 -j, --json <path>     Parse all command flags from an input json file

Program Flags:
 -s, --msg-size        Message size to be sent
 -n, --num-msgs        Number of messages to be sent
 -p, --pci-addr        DOCA Comm Channel device PCI address
 -r, --rep-pci         DOCA Comm Channel device representor PCI address
                       (needed only on DPU)

These flags allow you to configure the application’s behavior, including log levels, message size, number of messages, and PCI addresses for communication.

Running on BlueField

Enter the code folder

dpu# cd /opt/mellanox/doca/applications
dpu/opt/mellanox/doca/applications#

Build DOCA Secure Channel Application on BlueField

dpu/opt/mellanox/doca/applications# meson /tmp/build --Denable_all_applications=false --Denable_secure_channel=true
dpu/opt/mellanox/doca/applications# ninja -C /tmp/build

Check device PCIe address

dpu# mst start
dpu# mst status -v
……
PCI devices:
------------
DEVICE_TYPE             MST                           PCI       RDMA         NET                                     NUMA
BlueField3(rev:1)       /dev/mst/mt41692_pciconf0.1   03:00.1   mlx5_1       net-en3f1pf1sf0,net-pf1hpf,net-p1       -1
BlueField3(rev:1)       /dev/mst/mt41692_pciconf0     03:00.0   mlx5_0       net-en3f0pf0sf0,net-p0,net-pf0hpf       -1

CLI example for running the application on BlueField:
```
dpu# ./doca_secure_channel -s 256 -n 10 -p 03:00.0 -r 0b:00.0 
```
Note: Both DOCA Secure Channel device PCIe address (03:00.0) and DOCA Comm Channel device representor PCIe address (0b:00.0) should match the addresses of the desired PCIe devices.

Running on Host

Enter the code folder

host# cd /opt/mellanox/doca/applications
host/opt/mellanox/doca/applications#

Build DOCA Secure Channel Application on Host

host/opt/mellanox/doca/applications# meson /tmp/build --Denable_all_applications=false --Denable_secure_channel=true
host/opt/mellanox/doca/applications# ninja -C /tmp/build

Check device representor PCIe address

host# mst start
host# mst status -v
……
PCI devices:
------------
DEVICE_TYPE             MST                                PCI       RDMA            NET                                     NUMA
BlueField3(rev:1)       /dev/mst/mt41692_pciconf0         0b:00.0   mlx5_0          net-ens192f0np0                         -1
BlueField3(rev:1)       /dev/mst/mt41692_pciconf0.1       0b:00.1   mlx5_1          net-ens192f1np1                         -1

CLI example for running the application on Host:
```
host# ./doca_secure_channel -s 256 -n 10 -p 0b:00.0
```
Note: DOCA Comm Channel device representor PCIe address (0b:00.0) should match the address of the desired PCIe device.

Code Description

BlueField Side

Set Secure channel configuration operation mode to Run endpoint in DPU:
app_cfg.mode = SC_MODE_DPU;
Parse cmdline/json arguments:
register_secure_channel_params()
Initialize Communication Channel context: init_cc()
- Create Comm Channel endpoint:
  doca_comm_channel_ep_create()
- Open Comm Channel DOCA device based on PCI address: open_doca_device_with_pci()
- Open Comm Channel DOCA device representor based on PCI address: open_doca_device_rep_with_pci()
- Set Comm Channel context properties, including DOCA device, max_msg_size, snd_queue_size, rcv_queue_size, set DOCA device representor: set_cc_properties()
- Secure Channel secure_channel_server start listening: doca_comm_channel_ep_listen()
Initiate all relevant signal and epoll file descriptors: init_signaling_polling()
- Create Comm Channel send/receive epoll instance: fd = epoll_create1(0)
- Create send/receive termination file descriptor, and add termination file descriptor to epoll instance:
  fd = signalfd(-1, &signal_mask, 0);
  epoll_ctl(*cc_send_epoll_fd, EPOLL_CTL_ADD, *send_interrupt_fd, &intr_fd)
Extract the event_channel handles for user’s use. When the user send/receive packets with non-blocking mode, this handle can use epoll() to get interrupt when a new event happened:
doca_comm_channel_ep_get_event_channel(ctx->ep, &ctx->cc_send_fd, &ctx->cc_recv_fd)
Start threads and wait for them to finish: start_threads()
- start sendto thread
  pthread_create(ctx->sendto_t, NULL, sendto_channel, (void *)ctx)
- start recvfrom thread
  pthread_create(ctx->recvfrom_t, NULL, recvfrom_channel, (void *)ctx)
  - Add Comm Channel receive file descriptor to receive epoll instance:
    epoll_ctl(ctx->cc_recv_epoll_fd, EPOLL_CTL_ADD, ctx->cc_recv_fd, &recv_event)
  - while (1) {
    doca_comm_channel_ep_recvfrom(ctx->ep, recv_buffer, &msg_len, DOCA_CC_MSG_FLAG_NONE, &curr_peer);
    Check if interrupt was received (events[ev_idx].data.fd == ctx->recv_intr_fd), if yes, Receive thread exiting, total amount of messages received successfully.
    Signal send thread to start sending messages
    }

Host Side

Parse cmdline/json arguments: register_secure_channel_params()
Initialize Communication Channel context: init_cc()
- Create Comm Channel endpoint:
  doca_comm_channel_ep_create()
- Open Comm Channel DOCA device based on PCI address:
  open_doca_device_with_pci()
- Set Comm Channel context properties, including DOCA device, max_msg_size, snd_queue_size, rcv_queue_size, set DOCA device representor:
  set_cc_properties()
- Establish a connection with DPU node:
  doca_comm_channel_ep_connect()
Initiate all relevant signal and epoll file descriptors: init_signaling_polling()
- Create Comm Channel send/receive epoll instance:
  fd = epoll_create1(0)
- Create send/receive termination file descriptor, and add termination file descriptor to epoll instance:
  fd = signalfd(-1, &signal_mask, 0);
  epoll_ctl(*cc_send_epoll_fd, EPOLL_CTL_ADD, *send_interrupt_fd, &intr_fd)
Extract the event_channel handles for user’s use. When the user send/receive packets with non-blocking mode, this handle can use epoll() to get interrupt when a new event happened:
doca_comm_channel_ep_get_event_channel(ctx->ep, &ctx->cc_send_fd, &ctx->cc_recv_fd)
Start threads and wait for them to finish: start_threads()
- start recvfrom thread:
  pthread_create(ctx->recvfrom_t, NULL, recvfrom_channel, (void *)ctx)
- start sendto thread:
  pthread_create(ctx->sendto_t, NULL, sendto_channel, (void *)ctx)
  - Add Comm Channel send file descriptor to send epoll instance
    epoll_ctl(ctx->cc_send_epoll_fd, EPOLL_CTL_ADD, ctx->cc_send_fd, &send_event)
  - while (msg_nb) {
    result=doca_comm_channel_ep_sendto(ctx->ep, send_buffer, ctx->cfg->send_msg_size, DOCA_CC_MSG_FLAG_NONE, ctx->peer);
    //Check if interrupt was received: events[ev_idx].data.fd == ctx->send_intr_fd
    If yes, send thread exiting, total amount of messages sent successfully
    }

Project Direction

Modify message parameters:
- Experiment with different message sizes using the -s or --msg-size flag.
- Vary the number of messages sent using the -n or --num-msgs flag.
- Example: ./doca_secure_channel -s 512 -n 100 -p <PCI_ADDRESS> [-r <REP_PCI_ADDRESS>]
Enhance logging and debugging:
- Increase the log level using the -l or --log-level flag for more detailed output.
- Add print statements in the source code to show detailed information about:
  - Channel connection establishment
  - Message transmission progress
  - Timing information for performance analysis
Implement JSON-based configuration:
- Create a JSON file with various configurations (e.g., sc_params.json)
- Run the application using the JSON file: ./doca_secure_channel --json ./sc_params.json
Explore different deployment scenarios:
- Test communication between different PF/VF/SF combinations
- Verify behavior with multiple clients connecting to the server (DPU) side
Error handling and resilience:
- Implement more robust error checking and handling in the application code
- Test application behavior under various error conditions (e.g., connection loss, invalid parameters)
Performance optimization:
- Profile the application to identify potential bottlenecks
- Experiment with different buffer sizes and threading models for improved performance
Extended functionality:
- Implement a simple control protocol over the secure channel
- Add support for bidirectional simultaneous communication
Integration with other DOCA applications:
- Explore how the Secure Channel can be used in conjunction with other DOCA applications or services

Documentation

For detailed information about the NVIDIA DOCA Secure Channel Application, refer to the official guide: NVIDIA DOCA Secure Channel Application Guide

Key sections to review in the documentation:

System Design and Application Architecture
DOCA Libraries used (DOCA Comch)
Compilation instructions
Running the Application (including command-line flags and JSON-based deployment)
Application Code Flow

2. NVIDIA DOCA DPA All-to-All

Difficulty: ★ ★ ★ ★ ☆

Objective

Replicate the functionality of NVIDIA DOCA DPA All-to-all Application
Understand how to use DOCA DPA APIs for accelerating MPI all-to-all collective on BlueField-3 DPU
Extend the DPA All-to-all functions to improve Collective Operation performance on BlueField-3 DPU

Introduction

The NVIDIA DPA All-to-All application demonstrates how the Message Passing Interface (MPI) all-to-all collective can be accelerated using the Data Path Accelerator (DPA). In an MPI collective, all processes within the same job call the collective routine.

Given a communicator of n ranks, the application performs a collective operation where all processes send and receive the same amount of data from all other processes (hence “all-to-all”).

System Design

All-to-all is an MPI method. MPI is a standardized and portable message passing standard designed for parallel computing architectures. An MPI program consists of several processes running in parallel.

Each process in the diagram divides its local sendbuf into n blocks (4 in this example), each containing sendcount elements (4 in this example). Process i sends the k-th block of its local sendbuf to process k, which places the data in the i-th block of its local recvbuf.
Implementing the all-to-all method using DOCA DPA offloads the copying of elements from the srcbuf to the recvbufs to the DPA, freeing the CPU to perform other computations.

Application Architecture

The following diagram illustrates the differences between host-based all-to-all and DPA all-to-all operations:

In DPA all-to-all, DPA threads perform the all-to-all operation, freeing the CPU for other computations.
In host-based all-to-all, the CPU must still perform the all-to-all operation at some point and is not completely available for other computations.

Compilation

To build only the DPA all-to-all application:

cd /opt/mellanox/doca/applications/
meson /tmp/build -Denable_all_applications=false -Denable_dpa_all_to_all=true
ninja -C /tmp/build

Alternatively, users can set the desired flags in the meson_options.txt file:

Edit the following flags in /opt/mellanox/doca/applications/meson_options.txt:
- Set enable_all_applications to false
- Set enable_dpa_all_to_all to true
Run the following compilation commands:

cd /opt/mellanox/doca/applications/
meson /tmp/build
ninja -C /tmp/build

The doca_dpa_all_to_all executable is created under /tmp/build/dpa_all_to_all/.

Running Application

The DPA all-to-all application is provided in source form. Therefore, compilation is required before execution.

Application usage instructions (run ./doca_dpa_all_to_all -h or ./doca_dpa_all_to_all --help):

Usage: doca_dpa_all_to_all [DOCA Flags] [Program Flags]

DOCA Flags:
-h, --help                Print a help synopsis
-v, --version             Print program version information
-l, --log-level           Set the (numeric) log level for the program 
                          <10=DISABLE, 20=CRITICAL, 30=ERROR, 
                          40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE>
--sdk-log-level           Set the SDK (numeric) log level for the program 
                          <10=DISABLE, 20=CRITICAL, 30=ERROR, 
                          40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE>
-j, --json <path>         Parse all command flags from an input json file

Program Flags:
-m, --msgsize <Message size>   The message size - the size of the 
                               sendbuf and recvbuf (in bytes). 
                               Must be in multiples of integer size.
                               Default is size of one integer times
                               the number of processes.
-d, --devices <IB device names> IB devices names that support DPA, separated
                                by comma without spaces (max of two 
                                devices). If not provided, a random
                                IB device will be chosen.

Running on BlueField

Enter the code folder:

dpu# cd /opt/mellanox/doca/applications
dpu/opt/mellanox/doca/applications#

MPI is used for compilation and running of this application. Ensure that MPI is installed on your setup. By default, DOCA All will provide openmpi but not mpicc. Run the following commands:
- Check if mpicc is installed:
```
dpu# dpkg -l | grep mpich
```
- If not installed, install mpicc:
```
dpu# apt-get install mpich
```

Build DOCA DPA All-to-all Application on BlueField:

# meson /tmp/build --Denable_all_applications=false --Denable_dpa_all_to_all=true
# ninja -C /tmp/build

Check the mlx device name on BlueField:

# mst status -v
……
PCI devices:
------------
DEVICE_TYPE             MST                           PCI       RDMA         NET                                     NUMA
BlueField3(rev:1)       /dev/mst/mt41692_pciconf0.1   03:00.1   mlx5_1       net-en3f1pf1sf0,net-pf1hpf,net-p1       -1
BlueField3(rev:1)       /dev/mst/mt41692_pciconf0     03:00.0   mlx5_0       net-en3f0pf0sf0,net-p0,net-pf0hpf       -1

Run DPU All-to-all application with 4 processes, 32 bytes as message size, and mlx5_0 as the InfiniBand device:
```
# mpirun -np 4 /tmp/build/doca_dpa_all_to_all -m 32 -d "mlx5_0"
```

Notes:

-d specifies the RDMA device shown in the previous step
-m is the message size, representing the size of the sendbuf & recvbuf. It’s divided into nProcs * Buffer size, and BufSize is further divided into message count * nProcs

Code Description

Initialize MPI:
```
MPI_Init(&argc, &argv);
```
Parse application arguments:
- Initialize arg parser resources and register DOCA general parameters:
```
doca_argp_init();
```
- Register the application’s parameters:
```
register_all_to_all_params();
```
- Parse the arguments:
```
doca_argp_start();
```
- Only let the first process (of rank 0) parse the parameters to then broadcast them to the rest of the processes.
Check and prepare the needed resources for the all_to_all call:
- Check the number of processes (maximum is 16).
- Check the msgsize. It must be in multiples of integer size and at least the number of processes times integer size.
- Allocate the sendbuf and recvbuf according to msgsize.

Prepare the resources required to perform the all-to-all method using DOCA DPA:

Initialize DOCA DPA context:
- Open DOCA DPA device (DOCA device that supports DPA):
```
open_dpa_device();
```
- Create DOCA DPA context using the opened device:
```
doca_dpa_create();
```

Create the required events for the all-to-all:

create_dpa_a2a_events() {
    doca_dpa_event_create(doca_dpa, DOCA_DPA_EVENT_ACCESS_DPA, DOCA_DPA_EVENT_ACCESS_CPU, DOCA_DPA_EVENT_WAIT_DEFAULT, &comp_event, 0); 
    for (i = 0; i < resources->num_ranks; i++)
        doca_dpa_event_create(doca_dpa, DOCA_DPA_EVENT_ACCESS_REMOTE, DOCA_DPA_EVENT_ACCESS_DPA, DOCA_DPA_EVENT_WAIT_DEFAULT, &(kernel_events[i]), 0);
}

Create DOCA DPA worker (for the endpoints):
```
doca_dpa_worker_create();
```

Prepare DOCA DPA endpoints:

Create DOCA DPA endpoints as the number of processes/ranks:

for (i = 0; i < resources->num_ranks; i++)
    doca_dpa_ep_create();

Connect the local process’ endpoints to the other processes’ endpoints:
```
connect_dpa_a2a_endpoints();
```

Export the endpoints to DOCA DPA device endpoints and copy them to DPA heap memory:

for (int i = 0; i < resources->num_ranks; i++) {
    result = doca_dpa_ep_dev_export();
    doca_dpa_mem_alloc();
    doca_dpa_h2d_memcpy();
}

Prepare the memory required for the all-to-all method:
```
prepare_dpa_a2a_memory();
```

Launch the alltoall_kernel using DOCA DPA kernel launch:
- Every MPI rank launches a kernel of up to MAX_NUM_THREADS (16 in this example).
- Launch alltoall_kernel using kernel_launch:
```
doca_dpa_kernel_launch();
```
- Copy the relevant sendbuf to the correct recvbuf for every process:
```
for (i = thread_rank; i < num_ranks; i += num_threads)
    doca_dpa_dev_put_signal_nb();
```
- Wait until the alltoall_kernel has finished:
```
doca_dpa_event_wait_until();
```
Destroy the a2a_resources:
- Free all the DOCA DPA memories:
```
doca_dpa_mem_free();
```
- Unregister all the DOCA DPA host memories:
```
doca_dpa_mem_unregister();
```
- Destroy all the DOCA DPA endpoints:
```
doca_dpa_ep_destroy();
```
- Destroy the DOCA DPA worker:
```
doca_dpa_worker_destroy();
```
- Destroy all the DOCA DPA events:
```
doca_dpa_event_destroy();
```
- Destroy the DOCA DPA context:
```
doca_dpa_destroy();
```
- Close the DOCA device:
```
doca_dev_close();
```

Project Direction

Enhance the code with additional parameters:
- Add input for running multiple iterations
- Calculate and report execution time
Increase the number of DPA Execution Units (EUs) to test alltoall performance
Implement additional customizations and extensions:
- Add multi-server support
- Integrate secure_channel logic
- Explore other MPI collective operations that could benefit from DPA acceleration

Documents

OpenMPI

Communications between multiple processes
Transfer BufferAddr/Event/Handlers
OpenMPI v4.1 Documentation

DOCA DPA

DPA create/start/execution
DOCA DPA Documentation

DOCA MMAP

MMAP and buffer handling on DPA
DOCA Core Memory Subsystem

DOCA RDMA

Communication between Ranks and DPA
DOCA RDMA Documentation

DOCA AlltoAll

DOCA DPA All-to-All Application Guide

Resources

Free DOCA Development Environment

Scan QR-Code for applying free DOCA Development Environment by NVIDIA Authorized Partner DPU & DOCA Excellence Center. An-Link is a new DPU & DOCA Excellence Center and will provide access to Free DOCA Development Environment later.

Homework 4

作业四：NVIDIA DPU 编程简介——解锁 AI 网络的力量

目标

基础知识

评分及要求

作业提交

Project List

1. NVIDIA DOCA Secure Channel

Difficulty ★★★★☆

Objectives

Introduction

References

System Design

Application Architecture

Compilation

Running Application

Running on BlueField

Running on Host

Code Description

Project Direction

Documentation

2. NVIDIA DOCA DPA All-to-All

Difficulty: ★ ★ ★ ★ ☆

Objective

Introduction

System Design

Application Architecture

Compilation

Running Application

Running on BlueField

Code Description

Project Direction

Documents

OpenMPI

DOCA DPA

DOCA MMAP

DOCA RDMA

DOCA AlltoAll

Resources

Book

Homepages & Documents

Self-paced Free Online Courses

Free DOCA Development Environment