Deep-learning Processor Unit

Goal

In this tutorial you will
  • Build bitstream with Deep-learning Processor Unit

  • Include Vitis AI libraries in Yocto project

A bit of background

Deep-learning Processor Unit is an IP Core provided by AMD that accelerates deep-learning inference on Xilinx FPGAs. It’s part of Vitis AI library and facilities running models created with TensorFlow or PyTorch on FPGA. Integration of Deep-learning Processor Unit into Linux distribution follows similar steps as integration of other IP blocks (like double UART from Enable programmable logic support).

Prerequisites

Provided outputs

Following files (Tutorial files) are associated with this tutorial:

  • Leopard/Zero-to-hero/04 Deep learning Processor Unit/arch.json - DPU fingerprint

  • Leopard/Zero-to-hero/04 Deep learning Processor Unit/leopard-dpu-bd.xsa - DPU IP bitstream

  • Leopard/Zero-to-hero/04 Deep learning Processor Unit/boot-common.bin - Boot firmware for Leopard

  • Leopard/Zero-to-hero/04 Deep learning Processor Unit/dpu-leopard-leopard-dpu.rootfs.cpio.gz.u-boot - Root filesystem for Leopard

  • Leopard/Zero-to-hero/04 Deep learning Processor Unit/Image - Linux kernel

  • Leopard/Zero-to-hero/04 Deep learning Processor Unit/system.dtb - Device tree

Use these files if you want to skip building bitstream or Yocto distribution by yourself.

Download Deep-learning Processor Unit repository Vivado

  1. On machine with Vivado create dpu-ip-repo directory.

  2. Download DPU IP block from https://xilinx.github.io/Vitis-AI/3.5/html/docs/workflow-system-integration.html#ip-and-reference-designs.

    • Use ‘IP-only download’ link for ‘MPSoC & Kria K26’ platform.

    • Note that DPU IP for Zynq UltraScale+ has version 3.0. That version works fine with Vitis AI 3.5 used in this tutorial.

  3. Unpack downloaded archive to dpu-ip-repo directory.

    • Make sure that after extracting, directory DPUCZDX8G_v4_1_0 is directly in dpu-ip-repo.

Create bitstream with Deep-learning Processor Unit Vivado

  1. Start Vivado and create new project. In new project wizard select following options:

    • Project type: RTL Project

      • Select Don’t specify sources at this time

      • Don’t select Project is an extensible Vitis platform

    • Part: xczu9eg-ffvc900-1L-i

  2. Add DPU IP repository to project

    1. Open settings by clicking on Settings in Flow Navigator.

    2. Go to Project Settings ‣ IP ‣ Repository.

    3. Add dpu-ip-repo directory to list of repositories.

      Vivado will show confirmation message and list Deep-learning Process Unit as newly added IP.

  3. Create top-level block design by using Create Block Design in Flow Navigator. Use dpu_bd as name.

  4. In block design diagram editor add Zynq UltraScale+ MPSoC IP block.

  5. Start customization of Zynq UltraScale+ MPSoC IP block by double-clicking on it.

    1. Apply previously exported preset by selecting Presets ‣ Apply configuration and select leopard-minimalistic-with-pl.tcl file.

    2. PS-PL Configuration ‣ PS-PL Interfaces ‣ Master Interface ‣ AXI HPM0 FPD: Set Data Width to 32.

    3. PS-PL Configuration ‣ PS-PL Interfaces ‣ Slave Interface ‣ AXI LPD: Set Data Width to 32.

  6. Add “Processor System Reset” IP block to block design. In Block properties name it rst_gen_pl_clk0.

  7. Connect rst_gen_pl_clk0 IP block inputs:

    1. Connect slowest_sync_clk to pl_clk0 output port of Zynq UltraScale+ MPSoC IP block.

    2. Connect ext_reset_in to pl_resetn0 output port of Zynq UltraScale+ MPSoC IP block.

  8. Add “Clocking Wizard” IP block to block design.

  9. Customize Clocking Wizard block by double-clicking on it.

    1. In Clocking Options, set Primitve to “Auto”

    2. On Output Clocks:

      • Set Port Name of ‘clk_out1’ to ‘clk_2x_dpu’

      • Set clk_out1 to ‘200.000 MHz’

      • Enable clk_out2

      • Set Port Name of ‘clk_out2’ to ‘clk_dpu’

      • Set clk_out2 to ‘100.000 MHz’

      • Enable Matched Routing for both clocks

      • Enable reset input

      • Select Reset Type to ‘Active Low’

  10. Connect Clocking Wizard IP block inputs:

    1. Connect clk_in1 to pl_clk0 output port of Zynq UltraScale+ MPSoC IP block.

    2. Connect resetn to peripheral_aresetn[0:0] output port of rst_gen_pl_clk0 IP block.

  11. Add another “Processor System Reset” IP block to block design. In Block properties name it rst_gen_2x_dpu_clk.

  12. Connect rst_gen_2x_dpu_clk IP block inputs:

    1. Connect slowest_sync_clk to clk_2x_dpu output port of Clocking Wizard IP block.

    2. Connect ext_reset_in to peripheral_aresetn[0:0] output port of rst_gen_pl_clk0 IP block.

  13. Add another “Processor System Reset” IP block to block design. In Block properties name it rst_gen_dpu_clk.

  14. Connect rst_gen_dpu_clk IP block inputs:

    1. Connect slowest_sync_clk to clk_dpu output port of Clocking Wizard IP block.

    2. Connect ext_reset_in to peripheral_aresetn[0:0] output port of rst_gen_pl_clk0 IP block.

  15. Add Deep learning Processing Unit IP block to block design.

  16. Customize Deep learning Process Unit block by double-clicking on it.

    1. On Arch tab set Arch of DPU to ‘B1024’

  17. Connect Deep learning Process Unit IP block inputs:

    1. Connect S_AXI to M_AXI_HPM0_FPD output port of Zynq UltraScale+ MPSoC IP block.

    2. Connect s_axi_aclk to pl_clk0 output port of Zynq UltraScale+ MPSoC IP block.

    3. Connect s_axi_aresetn to peripheral_aresetn[0:0] output port of rst_gen_pl_clk0 IP block.

    4. Connect dpu_2x_clk to clk_2x_dpu output port of Clocking Wizard IP block.

    5. Connect dpu_2x_resetn to peripheral_aresetn[0:0] output port of rst_gen_2x_dpu_clk IP block.

    6. connect m_axi_dpu_aclk to clk_dpu output port of Clocking Wizard IP block.

    7. Connect m_axi_dpu_aresetn to peripheral_aresetn[0:0] output port of rst_gen_dpu_clk IP block.

  18. Connect Zynq UltraScale+ MPSoC IP block inputs:

    1. Connect S_ACI_HPC0_FPD to DPU0_M_AXI_DATA0 output port of Deep-learning Process Unit IP block.

    2. Connect S_ACI_HPC1_FPD to DPU0_M_AXI_DATA1 output port of Deep-learning Process Unit IP block.

    3. Connect S_ACI_LPD to DPU0_M_AXI_INSTR output port of Deep-learning Process Unit IP block.

    4. Connect maxihpm0_fpd to pl_clk0 output port of Zynq UltraScale+ MPSoC IP block.

    5. Connect saxihpc0_fpd_aclk to clk_dpu output port of Clocking Wizard IP block.

    6. Connect saxihpc1_fpd_aclk to clk_dpu output port of Clocking Wizard IP block.

    7. Connect saxi_lpd_aclk to clk_dpu output port of Clocking Wizard IP block.

    8. Connect pl_ps_irq0 to dpu0_interrupt output port of Deep-learning Process Unit IP block.

  19. Run Tools ‣ Validate Design. When asked about auto assigning address segments, answer “Yes.”

  20. Final block design should look like this:

    ../../../_images/dpu_bd1.png

    Fig. 11 Block design with Deep-learning Processor Unit

  21. In Sources view select Design Sources ‣ dpu_bd and click Create HDL Wrapper in context menu. Use Let Vivado manage wrapper and auto-update option.

  22. Generate bitstream

    Warning

    Compared to previous tutorials, generating bitstream might take significantly longer time.

  23. Export hardware including bitstream to file leopard-dpu-bd.xsa

Add Vitis layers to Yocto Project Yocto

  1. Clone Xilinx meta-vitis layer:

    machine:~/leopard-linux-1/build$ git clone -b rel-v2024.1 https://github.com/Xilinx/meta-vitis.git ../sources/meta-vitis
    
  2. Clone KP labs meta-kp-vitis-ai layer:

    machine:~/leopard-linux-1/build$ git clone -b nanbield https://github.com/kplabs-pl/meta-kp-vitis-ai.git ../sources/meta-kp-vitis-ai
    
  3. Apply patches to meta-vitis that fix support for nanbield Yocto version

    machine:~/leopard-linux-1/sources/meta-vitis$ git am ../meta-kp-vitis-ai/patches/*.patch
    Applying: Switch to nanbield
    Applying: bbappend to any glog version
    
  4. Add layers to Yocto project:

    machine:~/leopard-linux-1/build$ bitbake-layers add-layer ../sources/meta-openembedded/meta-python
    machine:~/leopard-linux-1/build$ bitbake-layers add-layer ../sources/meta-vitis
    machine:~/leopard-linux-1/build$ bitbake-layers add-layer ../sources/meta-kp-vitis-ai
    
  5. Change recipe providing opencl-icd by adding configuarion option to ~/leopard-linux-1/build/conf/local.conf.

    PREFERRED_PROVIDER_virtual/opencl-icd = "ocl-icd"
    

    Note

    meta-vitis layer requires particular project configuration

Add Deep-learning Processor Unit bitstream to Linux image Yocto

  1. Create directory ~/leopard-linux-1/sources/meta-local/recipes-example/bitstreams/dpu/ and copy leopard-dpu-bd.xsa to it.

  2. Create new recipe ~/leopard-linux-1/sources/meta-local/recipes-example/bitstreams/dpu.bb that will install bitstream with DPU.

    LICENSE = "CLOSED"
    
    inherit bitstream
    
    SRC_URI += "file://leopard-dpu-bd.xsa"
    BITSTREAM_HDF_FILE = "${WORKDIR}/leopard-dpu-bd.xsa"
    
  3. Create recipe append for kernel

    machine:~/leopard-linux-1/build$ recipetool newappend --wildcard-version ../sources/meta-local/ linux-xlnx
    
  4. Create directory ~/leopard-linux-1/sources/meta-local/recipes-kernel/linux/linux-xlnx.

  5. Enable Xilinx DPU kernel driver module by creating file ~/leopard-linux-1/sources/meta-local/recipes-kernel/linux/linux-xlnx/xlnx-dpu.cfg with content

    CONFIG_XILINX_DPU=m
    
  6. Enable kernel configuration fragment by adding it to ~/leopard-linux-1/sources/meta-local/recipes-kernel/linux/linux-xlnx_%.bbappend

    FILESEXTRAPATHS:prepend := "${THISDIR}/${PN}:"
    
    SRC_URI += "file://xlnx-dpu.cfg"
    
  7. Add new packages into Linux image by editing ~/leopard-linux-1/sources/meta-local/recipes-leopard/images/dpu-leopard.bbappend

    IMAGE_INSTALL += "\
       fpga-manager-script \
       double-uart \
       dpu \
       vitis-ai-library \
       kernel-module-xlnx-dpu \
    "
    
  8. Build firmware and image

    machine:~/leopard-linux-1/build$ bitbake leopard-all
    
  9. Prepare build artifacts for transfer to EGSE Host

    machine:~/leopard-linux-1/build$ mkdir -p ../egse-host-transfer
    machine:~/leopard-linux-1/build$ cp tmp/deploy/images/leopard-dpu/bootbins/boot-common.bin ../egse-host-transfer
    machine:~/leopard-linux-1/build$ cp tmp/deploy/images/leopard-dpu/system.dtb ../egse-host-transfer
    machine:~/leopard-linux-1/build$ cp tmp/deploy/images/leopard-dpu/dpu-leopard-leopard-dpu.rootfs.cpio.gz.u-boot ../egse-host-transfer
    machine:~/leopard-linux-1/build$ cp tmp/deploy/images/leopard-dpu/Image ../egse-host-transfer
    
  10. Transfer content of egse-host-transfer directory to EGSE Host and place it in /var/tftp/tutorial directory

Run model on Deep-learning Processor Unit EGSE Host

  1. Verify that all necessary artifacts are present on EGSE Host:

    customer@egse-host:~$ ls -lh /var/tftp/tutorial
    total 106M
    -rw-rw-r-- 1 customer customer  21M Jan 23 09:37 Image
    -rw-rw-r-- 1 customer customer 1.6M Jan 23 09:37 boot-common.bin
    -rw-rw-r-- 1 customer customer  93M Jan 23 09:37 dpu-leopard-leopard-dpu.rootfs.cpio.gz.u-boot
    -rw-rw-r-- 1 customer customer  39K Jan 23 09:37 system.dtb
    

    Note

    Exact file size might differ a bit but they should be in the same range (for example dpu-leopard-leopard-dpu.rootfs.cpio.gz.u-boot shall be about ~100MB)

  2. Open second SSH connection to EGSE Host and start minicom to observe boot process

    customer@egse-host:~$ minicom -D /dev/sml/leopard-pn1-uart
    

    Leave this terminal open and get back to SSH connection used in previous steps.

  3. Power on Leopard

    customer@egse-host:~$ sml power on
    Powering on...Success
    
  4. Power on DPU Processing Node 1

    customer@egse-host:~$ sml pn1 power on --nor-memory nor1
    Powering on processing node Node1...Success
    

    Note

    Boot firmware is the same as in Enable programmable logic support.

  5. DPU boot process should be visible in minicom terminal

  6. Log in to DPU using root user

    leopard login: root
    root@leopard:~#
    
  7. Load DPU bitstream

    root@leopard:~# fpgautil -o /lib/firmware/dpu/overlay.dtbo
    
  8. Verify that DPU instance is visible in system

    root@leopard:~# xdputil query
    {
       "DPU IP Spec":{
          "DPU Core Count":1,
          "IP version":"v4.1.0",
          "enable softmax":"False"
       },
       "VAI Version":{
          "libvart-runner.so":"Xilinx vart-runner Version: 3.5.0-b7953a2a9f60e23efdfced5c186328dd144966,
          "libvitis_ai_library-dpu_task.so":"Advanced Micro Devices vitis_ai_library dpu_task Version: ,
          "libxir.so":"Xilinx xir Version: xir-b7953a2a9f60e23efdfced5c186328dd1449665c 2024-07-15-16:5,
          "target_factory":"target-factory.3.5.0 b7953a2a9f60e23efdfced5c186328dd1449665c"
       },
       "kernels":[
          {
                "DPU Arch":"DPUCZDX8G_ISA1_B1024",
                "DPU Frequency (MHz)":100,
                "XRT Frequency (MHz)":100,
                "cu_idx":0,
                "fingerprint":"0x101000056010402",
                "is_vivado_flow":true,
                "name":"DPU Core 0"
          }
       ]
    }
    
  9. Follow Machine learning model deployment tutorials to train and compile for Deep-learning Processor Unit. Go to Onboard inference to see how to run inference on DPU.

Summary

In this tutorial you walked through steps required to include Deep-learning Processor Unit in FPGA design and integrate it with Yocto project.