Homework 4

作业四:NVIDIA DPU 编程简介——解锁 AI 网络的力量

本综合探讨 NVIDIA DPU 编程,这是一项处于 AI 和网络交汇点的尖端技术。同学们将:

  1. 学习 NVIDIA 加速计算和 AI 网络
  2. 使用 NVIDIA DOCA 软件框架在 BlueField 网络平台上开发和部署数据中心基础设施应用
  3. 探索如何利用 NVIDIA AI 网络技术加速 AI 工作负载

从而:

  • 理解 NVIDIA 在 AI 领域通过其端到端加速计算和 AI 网络技术的领导地位
  • 熟练掌握为 NVIDIA BlueField-3 网络平台和 NVIDIA DOCA 软件框架开发的技能
  • 构建用于真实场景的基础设施应用和服务

目标

  • 理解NVIDIA加速计算和网络技术的重要性
  • 掌握NVIDIA BlueField-3网络平台和NVIDIA DOCA软件框架的基础知识
  • 开发应用程序和服务,以创建用于各种工作负载的安全和加速基础设施
  • 使用NVIDIA DOCA SDK和API在NVIDIA BlueField-3上构建应用程序或服务

基础知识

  • 具备网络和OSI模型的基本知识
  • 熟悉Linux编程和命令行界面
  • 熟悉C编程语言

评分及要求

  • DOCA开发环境可用时间:12.25 00:00 - 1.7 23:59
  • 项目:从下方项目列表中选择一个
  • 评估:用Word或LaTeX撰写的2页短文,可用中文或英文撰写
  • DDL: 1.14 23:59

作业提交

需要提交的材料包括PDF发送到指定邮箱,具体来说包括:

  1. 以PDF为格式的实验报告(中英文形式都可,编写方式自由,可以根据个人偏好使用word或latex),PDF文件需要发送至[email protected]
  2. 作业的分析、case等结果(不超过10MB),放于results文件夹中。
  3. 注意最后你需要提交的内容为:实验报告(pdf格式),支撑材料(文件夹),打包为压缩文件后发送至邮箱中即可。

为避免发生类似于将“Robustness”翻译为“鲁棒性”、“Socket”翻译为“套接字”的翻译误导,本词作业指南为英文形式。请同学们自行阅读并翻译,助教友情推荐使用Chrome浏览器中的"沉浸式翻译"插件浏览网页,或自行导出HTML文件后使用ChatGPT翻译。

Project List

1. NVIDIA DOCA Secure Channel

Difficulty ★★★★☆

Objectives

  1. Replicate the functionality of the NVIDIA DOCA Secure Channel Application
  2. Understand how to use DOCA Comm Channel APIs for:
    • Creating a secure communication channel
    • Exchanging messages between Host and BlueField-3 DPU
  3. Extend the Secure Channel functions to provide control services on BlueField-3 DPU

Introduction

The DOCA Secure Channel reference application leverages the DOCA Comm Channel API to create a secure, network-independent communication channel between the host and the NVIDIA BlueField DPU. Key features include:

  • Enabling host control of DPU services and offloads
  • Facilitating message exchange using a client-server framework
  • Supporting one-to-many communication (server to multiple clients)
  • Allowing communication between any PF/VF/SF on the host and the DPU server
  • Configurable message size and quantity for simulating heavy load

Note: DOCA SDK 2.5.0 introduced a new API for DOCA Comm Channel, offering high-performance data path and compatibility with DOCA progress engine. The old API will be deprecated in future releases.

References

  • Application source: /opt/mellanox/doca/applications/secure_channel/
  • Configuration file: /opt/mellanox/doca/applications/secure_channel/sc_params.json

System Design

The secure channel application operates in client mode (host) and server mode (DPU), allowing bidirectional message flow once a channel is established.

Application Architecture

The application is built on the DOCA Comm Channel API. The connection flow between client and server is as follows:

  1. Both sides initiate create()
  2. Server listens for new connections
  3. Server calls recvfrom() to prepare for message exchange
  4. Client executes connect() to initiate connection
  5. Client sends the first message
  6. Server responds

This architecture enables secure, efficient communication between the host and DPU, facilitating advanced network operations and offloads.

Compilation

To build the secure channel application:

  1. Direct build method:

    cd /opt/mellanox/doca/applications/
    meson /tmp/build -Denable_all_applications=false -Denable_secure_channel=true
    ninja -C /tmp/build

  2. Using meson_options.txt: a. Edit /opt/mellanox/doca/applications/meson_options.txt:

    • Set enable_all_applications to false
    • Set enable_secure_channel to true

    b. Run compilation commands:

    cd /opt/mellanox/doca/applications/
    meson /tmp/build
    ninja -C /tmp/build

The compiled doca_secure_channel will be created in /tmp/build/secure_channel/.

Running Application

The secure channel application requires compilation before execution. Use the following command to view usage instructions:

./doca_secure_channel -h

or

./doca_secure_channel --help

Application usage:

Usage: doca_secure_channel [DOCA Flags] [Program Flags]

DOCA Flags:
 -h, --help            Print a help synopsis
 -v, --version         Print program version information
 -l, --log-level       Set the (numeric) log level for the program
                       <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING,
                       50=INFO, 60=DEBUG, 70=TRACE>
 --sdk-log-level       Set the SDK (numeric) log level for the program
                       <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING,
                       50=INFO, 60=DEBUG, 70=TRACE>
 -j, --json <path>     Parse all command flags from an input json file

Program Flags:
 -s, --msg-size        Message size to be sent
 -n, --num-msgs        Number of messages to be sent
 -p, --pci-addr        DOCA Comm Channel device PCI address
 -r, --rep-pci         DOCA Comm Channel device representor PCI address
                       (needed only on DPU)

These flags allow you to configure the application’s behavior, including log levels, message size, number of messages, and PCI addresses for communication.

Running on BlueField
  1. Login to BlueField

  2. Enter the code folder

    dpu# cd /opt/mellanox/doca/applications
    dpu/opt/mellanox/doca/applications#
  3. Build DOCA Secure Channel Application on BlueField

    dpu/opt/mellanox/doca/applications# meson /tmp/build --Denable_all_applications=false --Denable_secure_channel=true
    dpu/opt/mellanox/doca/applications# ninja -C /tmp/build
  4. Check device PCIe address

    dpu# mst start
    dpu# mst status -v
    ……
    PCI devices:
    ------------
    DEVICE_TYPE             MST                           PCI       RDMA         NET                                     NUMA
    BlueField3(rev:1)       /dev/mst/mt41692_pciconf0.1   03:00.1   mlx5_1       net-en3f1pf1sf0,net-pf1hpf,net-p1       -1
    BlueField3(rev:1)       /dev/mst/mt41692_pciconf0     03:00.0   mlx5_0       net-en3f0pf0sf0,net-p0,net-pf0hpf       -1
  5. CLI example for running the application on BlueField:

    dpu# ./doca_secure_channel -s 256 -n 10 -p 03:00.0 -r 0b:00.0 

    Note: Both DOCA Secure Channel device PCIe address (03:00.0) and DOCA Comm Channel device representor PCIe address (0b:00.0) should match the addresses of the desired PCIe devices.

Running on Host
  1. Login to Host

  2. Enter the code folder

    host# cd /opt/mellanox/doca/applications
    host/opt/mellanox/doca/applications#
  3. Build DOCA Secure Channel Application on Host

    host/opt/mellanox/doca/applications# meson /tmp/build --Denable_all_applications=false --Denable_secure_channel=true
    host/opt/mellanox/doca/applications# ninja -C /tmp/build
  4. Check device representor PCIe address

    host# mst start
    host# mst status -v
    ……
    PCI devices:
    ------------
    DEVICE_TYPE             MST                                PCI       RDMA            NET                                     NUMA
    BlueField3(rev:1)       /dev/mst/mt41692_pciconf0         0b:00.0   mlx5_0          net-ens192f0np0                         -1
    BlueField3(rev:1)       /dev/mst/mt41692_pciconf0.1       0b:00.1   mlx5_1          net-ens192f1np1                         -1
  5. CLI example for running the application on Host:

    host# ./doca_secure_channel -s 256 -n 10 -p 0b:00.0

    Note: DOCA Comm Channel device representor PCIe address (0b:00.0) should match the address of the desired PCIe device.

Code Description

BlueField Side

  1. Set Secure channel configuration operation mode to Run endpoint in DPU:

    app_cfg.mode = SC_MODE_DPU;

  2. Parse cmdline/json arguments:

    register_secure_channel_params()

  3. Initialize Communication Channel context: init_cc()

    • Create Comm Channel endpoint:

      doca_comm_channel_ep_create()

    • Open Comm Channel DOCA device based on PCI address: open_doca_device_with_pci()

    • Open Comm Channel DOCA device representor based on PCI address: open_doca_device_rep_with_pci()

    • Set Comm Channel context properties, including DOCA device, max_msg_size, snd_queue_size, rcv_queue_size, set DOCA device representor: set_cc_properties()

    • Secure Channel secure_channel_server start listening: doca_comm_channel_ep_listen()

  4. Initiate all relevant signal and epoll file descriptors: init_signaling_polling()

    • Create Comm Channel send/receive epoll instance: fd = epoll_create1(0)

    • Create send/receive termination file descriptor, and add termination file descriptor to epoll instance:

      fd = signalfd(-1, &signal_mask, 0);

      epoll_ctl(*cc_send_epoll_fd, EPOLL_CTL_ADD, *send_interrupt_fd, &intr_fd)

  5. Extract the event_channel handles for user’s use. When the user send/receive packets with non-blocking mode, this handle can use epoll() to get interrupt when a new event happened:

    doca_comm_channel_ep_get_event_channel(ctx->ep, &ctx->cc_send_fd, &ctx->cc_recv_fd)

  6. Start threads and wait for them to finish: start_threads()

    • start sendto thread

      pthread_create(ctx->sendto_t, NULL, sendto_channel, (void *)ctx)

    • start recvfrom thread

      pthread_create(ctx->recvfrom_t, NULL, recvfrom_channel, (void *)ctx)

      • Add Comm Channel receive file descriptor to receive epoll instance:

        epoll_ctl(ctx->cc_recv_epoll_fd, EPOLL_CTL_ADD, ctx->cc_recv_fd, &recv_event)

      • while (1) {

        doca_comm_channel_ep_recvfrom(ctx->ep, recv_buffer, &msg_len, DOCA_CC_MSG_FLAG_NONE, &curr_peer);

        Check if interrupt was received (events[ev_idx].data.fd == ctx->recv_intr_fd), if yes, Receive thread exiting, total amount of messages received successfully.

        Signal send thread to start sending messages

        }

Host Side

  1. Parse cmdline/json arguments: register_secure_channel_params()

  2. Initialize Communication Channel context: init_cc()

    • Create Comm Channel endpoint:

      doca_comm_channel_ep_create()

    • Open Comm Channel DOCA device based on PCI address:

      open_doca_device_with_pci()

    • Set Comm Channel context properties, including DOCA device, max_msg_size, snd_queue_size, rcv_queue_size, set DOCA device representor:

      set_cc_properties()

    • Establish a connection with DPU node:

      doca_comm_channel_ep_connect()

  3. Initiate all relevant signal and epoll file descriptors: init_signaling_polling()

    • Create Comm Channel send/receive epoll instance:

      fd = epoll_create1(0)

    • Create send/receive termination file descriptor, and add termination file descriptor to epoll instance:

      fd = signalfd(-1, &signal_mask, 0);

      epoll_ctl(*cc_send_epoll_fd, EPOLL_CTL_ADD, *send_interrupt_fd, &intr_fd)

  4. Extract the event_channel handles for user’s use. When the user send/receive packets with non-blocking mode, this handle can use epoll() to get interrupt when a new event happened:

    doca_comm_channel_ep_get_event_channel(ctx->ep, &ctx->cc_send_fd, &ctx->cc_recv_fd)

  5. Start threads and wait for them to finish: start_threads()

    • start recvfrom thread:

      pthread_create(ctx->recvfrom_t, NULL, recvfrom_channel, (void *)ctx)

    • start sendto thread:

      pthread_create(ctx->sendto_t, NULL, sendto_channel, (void *)ctx)

      • Add Comm Channel send file descriptor to send epoll instance

        epoll_ctl(ctx->cc_send_epoll_fd, EPOLL_CTL_ADD, ctx->cc_send_fd, &send_event)

      • while (msg_nb) {

        result=doca_comm_channel_ep_sendto(ctx->ep, send_buffer, ctx->cfg->send_msg_size, DOCA_CC_MSG_FLAG_NONE, ctx->peer);

        //Check if interrupt was received: events[ev_idx].data.fd == ctx->send_intr_fd

        If yes, send thread exiting, total amount of messages sent successfully

        }

Project Direction

  1. Modify message parameters:

    • Experiment with different message sizes using the -s or --msg-size flag.
    • Vary the number of messages sent using the -n or --num-msgs flag.
    • Example: ./doca_secure_channel -s 512 -n 100 -p <PCI_ADDRESS> [-r <REP_PCI_ADDRESS>]
  2. Enhance logging and debugging:

    • Increase the log level using the -l or --log-level flag for more detailed output.
    • Add print statements in the source code to show detailed information about:
      • Channel connection establishment
      • Message transmission progress
      • Timing information for performance analysis
  3. Implement JSON-based configuration:

    • Create a JSON file with various configurations (e.g., sc_params.json)
    • Run the application using the JSON file: ./doca_secure_channel --json ./sc_params.json
  4. Explore different deployment scenarios:

    • Test communication between different PF/VF/SF combinations
    • Verify behavior with multiple clients connecting to the server (DPU) side
  5. Error handling and resilience:

    • Implement more robust error checking and handling in the application code
    • Test application behavior under various error conditions (e.g., connection loss, invalid parameters)
  6. Performance optimization:

    • Profile the application to identify potential bottlenecks
    • Experiment with different buffer sizes and threading models for improved performance
  7. Extended functionality:

    • Implement a simple control protocol over the secure channel
    • Add support for bidirectional simultaneous communication
  8. Integration with other DOCA applications:

    • Explore how the Secure Channel can be used in conjunction with other DOCA applications or services

Documentation

For detailed information about the NVIDIA DOCA Secure Channel Application, refer to the official guide: NVIDIA DOCA Secure Channel Application Guide

Key sections to review in the documentation:

  • System Design and Application Architecture
  • DOCA Libraries used (DOCA Comch)
  • Compilation instructions
  • Running the Application (including command-line flags and JSON-based deployment)
  • Application Code Flow

2. NVIDIA DOCA DPA All-to-All

Difficulty: ★ ★ ★ ★ ☆

Objective

  • Replicate the functionality of NVIDIA DOCA DPA All-to-all Application
  • Understand how to use DOCA DPA APIs for accelerating MPI all-to-all collective on BlueField-3 DPU
  • Extend the DPA All-to-all functions to improve Collective Operation performance on BlueField-3 DPU

Introduction

The NVIDIA DPA All-to-All application demonstrates how the Message Passing Interface (MPI) all-to-all collective can be accelerated using the Data Path Accelerator (DPA). In an MPI collective, all processes within the same job call the collective routine.

Given a communicator of n ranks, the application performs a collective operation where all processes send and receive the same amount of data from all other processes (hence “all-to-all”).

System Design

All-to-all is an MPI method. MPI is a standardized and portable message passing standard designed for parallel computing architectures. An MPI program consists of several processes running in parallel.

All-to-All Operation
All-to-All Operation
  • Each process in the diagram divides its local sendbuf into n blocks (4 in this example), each containing sendcount elements (4 in this example). Process i sends the k-th block of its local sendbuf to process k, which places the data in the i-th block of its local recvbuf.

  • Implementing the all-to-all method using DOCA DPA offloads the copying of elements from the srcbuf to the recvbufs to the DPA, freeing the CPU to perform other computations.

Application Architecture

The following diagram illustrates the differences between host-based all-to-all and DPA all-to-all operations:

Host-based vs DPA All-to-All
Host-based vs DPA All-to-All
  • In DPA all-to-all, DPA threads perform the all-to-all operation, freeing the CPU for other computations.
  • In host-based all-to-all, the CPU must still perform the all-to-all operation at some point and is not completely available for other computations.

Compilation

To build only the DPA all-to-all application:

cd /opt/mellanox/doca/applications/
meson /tmp/build -Denable_all_applications=false -Denable_dpa_all_to_all=true
ninja -C /tmp/build

Alternatively, users can set the desired flags in the meson_options.txt file:

  1. Edit the following flags in /opt/mellanox/doca/applications/meson_options.txt:

    • Set enable_all_applications to false
    • Set enable_dpa_all_to_all to true
  2. Run the following compilation commands:

cd /opt/mellanox/doca/applications/
meson /tmp/build
ninja -C /tmp/build

The doca_dpa_all_to_all executable is created under /tmp/build/dpa_all_to_all/.

Running Application

The DPA all-to-all application is provided in source form. Therefore, compilation is required before execution.

Application usage instructions (run ./doca_dpa_all_to_all -h or ./doca_dpa_all_to_all --help):

Usage: doca_dpa_all_to_all [DOCA Flags] [Program Flags]

DOCA Flags:
-h, --help                Print a help synopsis
-v, --version             Print program version information
-l, --log-level           Set the (numeric) log level for the program 
                          <10=DISABLE, 20=CRITICAL, 30=ERROR, 
                          40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE>
--sdk-log-level           Set the SDK (numeric) log level for the program 
                          <10=DISABLE, 20=CRITICAL, 30=ERROR, 
                          40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE>
-j, --json <path>         Parse all command flags from an input json file

Program Flags:
-m, --msgsize <Message size>   The message size - the size of the 
                               sendbuf and recvbuf (in bytes). 
                               Must be in multiples of integer size.
                               Default is size of one integer times
                               the number of processes.
-d, --devices <IB device names> IB devices names that support DPA, separated
                                by comma without spaces (max of two 
                                devices). If not provided, a random
                                IB device will be chosen.
Running on BlueField
  1. Login to BlueField

  2. Enter the code folder:

    dpu# cd /opt/mellanox/doca/applications
    dpu/opt/mellanox/doca/applications#
  3. MPI is used for compilation and running of this application. Ensure that MPI is installed on your setup. By default, DOCA All will provide openmpi but not mpicc. Run the following commands:

    • Check if mpicc is installed:

      dpu# dpkg -l | grep mpich

    • If not installed, install mpicc:

      dpu# apt-get install mpich

  4. Build DOCA DPA All-to-all Application on BlueField:

    # meson /tmp/build --Denable_all_applications=false --Denable_dpa_all_to_all=true
    # ninja -C /tmp/build
  5. Check the mlx device name on BlueField:

    # mst status -v
    ……
    PCI devices:
    ------------
    DEVICE_TYPE             MST                           PCI       RDMA         NET                                     NUMA
    BlueField3(rev:1)       /dev/mst/mt41692_pciconf0.1   03:00.1   mlx5_1       net-en3f1pf1sf0,net-pf1hpf,net-p1       -1
    BlueField3(rev:1)       /dev/mst/mt41692_pciconf0     03:00.0   mlx5_0       net-en3f0pf0sf0,net-p0,net-pf0hpf       -1
  6. Run DPU All-to-all application with 4 processes, 32 bytes as message size, and mlx5_0 as the InfiniBand device:

    # mpirun -np 4 /tmp/build/doca_dpa_all_to_all -m 32 -d "mlx5_0"
DPA All-to-All Execution Output 1
DPA All-to-All Execution Output 1
DPA All-to-All Execution Output 2
DPA All-to-All Execution Output 2

Notes:

  • -d specifies the RDMA device shown in the previous step
  • -m is the message size, representing the size of the sendbuf & recvbuf. It’s divided into nProcs * Buffer size, and BufSize is further divided into message count * nProcs

Code Description

  1. Initialize MPI:

    MPI_Init(&argc, &argv);

  2. Parse application arguments:

    • Initialize arg parser resources and register DOCA general parameters:
      doca_argp_init();
    • Register the application’s parameters:
      register_all_to_all_params();
    • Parse the arguments:
      doca_argp_start();
    • Only let the first process (of rank 0) parse the parameters to then broadcast them to the rest of the processes.
  3. Check and prepare the needed resources for the all_to_all call:

    • Check the number of processes (maximum is 16).
    • Check the msgsize. It must be in multiples of integer size and at least the number of processes times integer size.
    • Allocate the sendbuf and recvbuf according to msgsize.
  4. Prepare the resources required to perform the all-to-all method using DOCA DPA:

    • Initialize DOCA DPA context:
      • Open DOCA DPA device (DOCA device that supports DPA):
        open_dpa_device();
      • Create DOCA DPA context using the opened device:
        doca_dpa_create();
    • Create the required events for the all-to-all:
      create_dpa_a2a_events() {
          doca_dpa_event_create(doca_dpa, DOCA_DPA_EVENT_ACCESS_DPA, DOCA_DPA_EVENT_ACCESS_CPU, DOCA_DPA_EVENT_WAIT_DEFAULT, &comp_event, 0); 
          for (i = 0; i < resources->num_ranks; i++)
              doca_dpa_event_create(doca_dpa, DOCA_DPA_EVENT_ACCESS_REMOTE, DOCA_DPA_EVENT_ACCESS_DPA, DOCA_DPA_EVENT_WAIT_DEFAULT, &(kernel_events[i]), 0);
      }
    • Create DOCA DPA worker (for the endpoints):
      doca_dpa_worker_create();
    • Prepare DOCA DPA endpoints:
      • Create DOCA DPA endpoints as the number of processes/ranks:
        for (i = 0; i < resources->num_ranks; i++)
            doca_dpa_ep_create();
      • Connect the local process’ endpoints to the other processes’ endpoints:
        connect_dpa_a2a_endpoints();
      • Export the endpoints to DOCA DPA device endpoints and copy them to DPA heap memory:
        for (int i = 0; i < resources->num_ranks; i++) {
            result = doca_dpa_ep_dev_export();
            doca_dpa_mem_alloc();
            doca_dpa_h2d_memcpy();
        }
    • Prepare the memory required for the all-to-all method:
      prepare_dpa_a2a_memory();
  5. Launch the alltoall_kernel using DOCA DPA kernel launch:

    • Every MPI rank launches a kernel of up to MAX_NUM_THREADS (16 in this example).
    • Launch alltoall_kernel using kernel_launch:
      doca_dpa_kernel_launch();
    • Copy the relevant sendbuf to the correct recvbuf for every process:
      for (i = thread_rank; i < num_ranks; i += num_threads)
          doca_dpa_dev_put_signal_nb();
    • Wait until the alltoall_kernel has finished:
      doca_dpa_event_wait_until();
  6. Destroy the a2a_resources:

    • Free all the DOCA DPA memories:
      doca_dpa_mem_free();
    • Unregister all the DOCA DPA host memories:
      doca_dpa_mem_unregister();
    • Destroy all the DOCA DPA endpoints:
      doca_dpa_ep_destroy();
    • Destroy the DOCA DPA worker:
      doca_dpa_worker_destroy();
    • Destroy all the DOCA DPA events:
      doca_dpa_event_destroy();
    • Destroy the DOCA DPA context:
      doca_dpa_destroy();
    • Close the DOCA device:
      doca_dev_close();

Project Direction

  1. Enhance the code with additional parameters:

    • Add input for running multiple iterations
    • Calculate and report execution time
  2. Increase the number of DPA Execution Units (EUs) to test alltoall performance

  3. Implement additional customizations and extensions:

    • Add multi-server support
    • Integrate secure_channel logic
    • Explore other MPI collective operations that could benefit from DPA acceleration

Documents

OpenMPI
DOCA DPA
DOCA MMAP
DOCA RDMA
DOCA AlltoAll

Resources

Book

Homepages & Documents

Self-paced Free Online Courses

Free DOCA Development Environment

QR Code for Free DOCA Development Environment
QR Code for Free DOCA Development Environment

Scan QR-Code for applying free DOCA Development Environment by NVIDIA Authorized Partner DPU & DOCA Excellence Center. An-Link is a new DPU & DOCA Excellence Center and will provide access to Free DOCA Development Environment later.

Previous