Deep-learning Processor Unit¶

Goal¶

In this tutorial you will

Build bitstream with Deep-learning Processor Unit
Include Vitis AI libraries in Yocto project

A bit of background¶

Deep-learning Processor Unit is an IP Core provided by AMD that accelerates deep-learning inference on Xilinx FPGA devices. It’s part of Vitis AI library and facilities running models created with TensorFlow or PyTorch on FPGA. Integration of Deep-learning Processor Unit into Linux distribution follows similar steps as integration of other IP blocks (like double UART from Enable programmable logic support).

Prerequisites¶

Preset with Processing System configuration from Enable programmable logic support
Yocto project with Programmable Logic support from Enable programmable logic support

Provided outputs¶

Following files (Tutorial files) are associated with this tutorial:

Leopard/Zero-to-hero/04 Deep learning Processor Unit/arch.json - DPU fingerprint
Leopard/Zero-to-hero/04 Deep learning Processor Unit/leopard-dpu-bd.xsa - DPU IP bitstream
Leopard/Zero-to-hero/04 Deep learning Processor Unit/boot-common.bin - Boot firmware for Leopard
Leopard/Zero-to-hero/04 Deep learning Processor Unit/dpu-leopard-leopard-dpu.rootfs.cpio.gz.u-boot - Root filesystem for Leopard
Leopard/Zero-to-hero/04 Deep learning Processor Unit/Image - Linux kernel
Leopard/Zero-to-hero/04 Deep learning Processor Unit/system.dtb - Device tree

Use these files if you want to skip building bitstream or Yocto distribution by yourself.

Download Deep-learning Processor Unit repository Vivado¶

On machine with Vivado create dpu-ip-repo directory.
Download DPU IP block from https://xilinx.github.io/Vitis-AI/3.5/html/docs/workflow-system-integration.html#ip-and-reference-designs.
- Use ‘IP-only download’ link for ‘MPSoC & Kria K26’ platform.
- Note that DPU IP for Zynq UltraScale+ has version 3.0. That version works fine with Vitis AI 3.5 used in this tutorial.
Unpack downloaded archive to dpu-ip-repo directory.
- Make sure that after extracting, directory DPUCZDX8G_v4_1_0 is directly in dpu-ip-repo.

Create bitstream with Deep-learning Processor Unit Vivado¶

Start Vivado and create new project. In new project wizard select following options:
- Project type: RTL Project
  - Select Don’t specify sources at this time
  - Don’t select Project is an extensible Vitis platform
- Part: xczu9eg-ffvc900-1L-i
Add DPU IP repository to project
1. Open settings by clicking on Settings in Flow Navigator.
2. Go to Project Settings ‣ IP ‣ Repository.
3. Add dpu-ip-repo directory to list of repositories.
  
  Vivado will show confirmation message and list Deep-learning Process Unit as newly added IP.
Create top-level block design by using Create Block Design in Flow Navigator. Use dpu_bd as name.
In block design diagram editor add Zynq UltraScale+ MPSoC IP block.
Start customization of Zynq UltraScale+ MPSoC IP block by double-clicking on it.
1. Apply previously exported preset by selecting Presets ‣ Apply configuration and select leopard-minimalistic-with-pl.tcl file.
2. PS-PL Configuration ‣ PS-PL Interfaces ‣ Master Interface ‣ AXI HPM0 FPD: Set Data Width to 32.
3. PS-PL Configuration ‣ PS-PL Interfaces ‣ Slave Interface ‣ AXI LPD: Set Data Width to 32.
Add “Processor System Reset” IP block to block design. In Block properties name it rst_gen_pl_clk0.
Connect rst_gen_pl_clk0 IP block inputs:
1. Connect slowest_sync_clk to pl_clk0 output port of Zynq UltraScale+ MPSoC IP block.
2. Connect ext_reset_in to pl_resetn0 output port of Zynq UltraScale+ MPSoC IP block.
Add “Clocking Wizard” IP block to block design.
Customize Clocking Wizard block by double-clicking on it.
1. In Clocking Options, set Primitve to “Auto”
2. On Output Clocks:
  - Set Port Name of clk_out1 to clk_2x_dpu
  - Set clk_out1 to ‘200.000 MHz’
  - Enable clk_out2
  - Set Port Name of clk_out2 to clk_dpu
  - Set clk_out2 to ‘100.000 MHz’
  - Enable Matched Routing for both clocks
  - Enable reset input
  - Select Reset Type to ‘Active Low’
Connect Clocking Wizard IP block inputs:
1. Connect clk_in1 to pl_clk0 output port of Zynq UltraScale+ MPSoC IP block.
2. Connect resetn to peripheral_aresetn[0:0] output port of rst_gen_pl_clk0 IP block.
Add another “Processor System Reset” IP block to block design. In Block properties name it rst_gen_2x_dpu_clk.
Connect rst_gen_2x_dpu_clk IP block inputs:
1. Connect slowest_sync_clk to clk_2x_dpu output port of Clocking Wizard IP block.
2. Connect ext_reset_in to peripheral_aresetn[0:0] output port of rst_gen_pl_clk0 IP block.
Add another “Processor System Reset” IP block to block design. In Block properties name it rst_gen_dpu_clk.
Connect rst_gen_dpu_clk IP block inputs:
1. Connect slowest_sync_clk to clk_dpu output port of Clocking Wizard IP block.
2. Connect ext_reset_in to peripheral_aresetn[0:0] output port of rst_gen_pl_clk0 IP block.
Add Deep learning Processing Unit IP block to block design.
Customize Deep learning Process Unit block by double-clicking on it.
1. On Arch tab set Arch of DPU to ‘B1024’
Connect Deep learning Process Unit IP block inputs:
1. Connect S_AXI to M_AXI_HPM0_FPD output port of Zynq UltraScale+ MPSoC IP block.
2. Connect s_axi_aclk to pl_clk0 output port of Zynq UltraScale+ MPSoC IP block.
3. Connect s_axi_aresetn to peripheral_aresetn[0:0] output port of rst_gen_pl_clk0 IP block.
4. Connect dpu_2x_clk to clk_2x_dpu output port of Clocking Wizard IP block.
5. Connect dpu_2x_resetn to peripheral_aresetn[0:0] output port of rst_gen_2x_dpu_clk IP block.
6. connect m_axi_dpu_aclk to clk_dpu output port of Clocking Wizard IP block.
7. Connect m_axi_dpu_aresetn to peripheral_aresetn[0:0] output port of rst_gen_dpu_clk IP block.
Connect Zynq UltraScale+ MPSoC IP block inputs:
1. Connect S_ACI_HPC0_FPD to DPU0_M_AXI_DATA0 output port of Deep-learning Process Unit IP block.
2. Connect S_ACI_HPC1_FPD to DPU0_M_AXI_DATA1 output port of Deep-learning Process Unit IP block.
3. Connect S_ACI_LPD to DPU0_M_AXI_INSTR output port of Deep-learning Process Unit IP block.
4. Connect maxihpm0_fpd to pl_clk0 output port of Zynq UltraScale+ MPSoC IP block.
5. Connect saxihpc0_fpd_aclk to clk_dpu output port of Clocking Wizard IP block.
6. Connect saxihpc1_fpd_aclk to clk_dpu output port of Clocking Wizard IP block.
7. Connect saxi_lpd_aclk to clk_dpu output port of Clocking Wizard IP block.
8. Connect pl_ps_irq0 to dpu0_interrupt output port of Deep-learning Process Unit IP block.
Run Tools ‣ Validate Design. When asked about auto assigning address segments, answer “Yes.”
Final block design should look like this:

Fig. 11 Block design with Deep-learning Processor Unit¶
In Sources view select Design Sources ‣ dpu_bd and click Create HDL Wrapper in context menu. Use Let Vivado manage wrapper and auto-update option.
Generate bitstream

Warning

Compared to previous tutorials, generating bitstream might take significantly longer time.
Export hardware including bitstream to file leopard-dpu-bd.xsa

Add Vitis layers to Yocto Project Yocto¶

Note

If necessary, re-enable Yocto environment using

machine:~/leopard-linux-1$ source sources/poky/oe-init-build-env ./build

Clone Xilinx meta-vitis layer:

machine:~/leopard-linux-1/build$ git clone -b rel-v2024.1 https://github.com/Xilinx/meta-vitis.git ../sources/meta-vitis

Clone KP labs meta-kp-vitis-ai layer:

machine:~/leopard-linux-1/build$ git clone -b nanbield https://github.com/kplabs-pl/meta-kp-vitis-ai.git ../sources/meta-kp-vitis-ai

Apply patches to meta-vitis that fix support for nanbield Yocto version

machine:~/leopard-linux-1/build$ cd ../sources/meta-vitis
machine:~/leopard-linux-1/sources/meta-vitis$ git am ../meta-kp-vitis-ai/patches/*.patch
Applying: Switch to nanbield
Applying: bbappend to any glog version

Add layers to Yocto project:

machine:~/leopard-linux-1/build$ bitbake-layers add-layer ../sources/meta-openembedded/meta-python
machine:~/leopard-linux-1/build$ bitbake-layers add-layer ../sources/meta-vitis
machine:~/leopard-linux-1/build$ bitbake-layers add-layer ../sources/meta-kp-vitis-ai

Change recipe providing opencl-icd by adding configuarion option to ~/leopard-linux-1/build/conf/local.conf.
```
PREFERRED_PROVIDER_virtual/opencl-icd = "ocl-icd"
```
Note

meta-vitis layer requires particular project configuration

Add Deep-learning Processor Unit bitstream to Linux image Yocto¶

Create directory ~/leopard-linux-1/sources/meta-local/recipes-example/bitstreams/dpu/ and copy leopard-dpu-bd.xsa to it.

Create new recipe ~/leopard-linux-1/sources/meta-local/recipes-example/bitstreams/dpu.bb that will install bitstream with DPU.

LICENSE = "CLOSED"

inherit bitstream

SRC_URI += "file://leopard-dpu-bd.xsa"
BITSTREAM_HDF_FILE = "${WORKDIR}/leopard-dpu-bd.xsa"

Create recipe append for kernel

machine:~/leopard-linux-1/build$ recipetool newappend --wildcard-version ../sources/meta-local/ linux-xlnx

Create directory ~/leopard-linux-1/sources/meta-local/recipes-kernel/linux/linux-xlnx.
Enable Xilinx DPU kernel driver module by creating file ~/leopard-linux-1/sources/meta-local/recipes-kernel/linux/linux-xlnx/xlnx-dpu.cfg with content
```
CONFIG_XILINX_DPU=m
```
Enable kernel configuration fragment by adding it to ~/leopard-linux-1/sources/meta-local/recipes-kernel/linux/linux-xlnx_%.bbappend
```
FILESEXTRAPATHS:prepend := "${THISDIR}/${PN}:"

SRC_URI += "file://xlnx-dpu.cfg"
```

Add new packages into Linux image by editing ~/leopard-linux-1/sources/meta-local/recipes-leopard/images/dpu-leopard.bbappend

IMAGE_INSTALL += "\
   fpga-manager-script \
   double-uart \
   dpu \
   vitis-ai-library \
   kernel-module-xlnx-dpu \
"

Build firmware and image

machine:~/leopard-linux-1/build$ bitbake leopard-all

Prepare build artifacts for transfer to EGSE Host

machine:~/leopard-linux-1/build$ mkdir -p ../egse-host-transfer
machine:~/leopard-linux-1/build$ cp tmp/deploy/images/leopard-dpu/bootbins/boot-common.bin ../egse-host-transfer
machine:~/leopard-linux-1/build$ cp tmp/deploy/images/leopard-dpu/system.dtb ../egse-host-transfer
machine:~/leopard-linux-1/build$ cp tmp/deploy/images/leopard-dpu/dpu-leopard-leopard-dpu.rootfs.cpio.gz.u-boot ../egse-host-transfer
machine:~/leopard-linux-1/build$ cp tmp/deploy/images/leopard-dpu/Image ../egse-host-transfer

Transfer content of egse-host-transfer directory to EGSE Host and place it in /var/tftp/tutorial directory

Run model on Deep-learning Processor Unit EGSE Host¶

Verify that all necessary artifacts are present on EGSE Host:

customer@egse-host:~$ ls -lh /var/tftp/tutorial
total 106M
-rw-rw-r-- 1 customer customer  21M Jan 23 09:37 Image
-rw-rw-r-- 1 customer customer 1.6M Jan 23 09:37 boot-common.bin
-rw-rw-r-- 1 customer customer  93M Jan 23 09:37 dpu-leopard-leopard-dpu.rootfs.cpio.gz.u-boot
-rw-rw-r-- 1 customer customer  39K Jan 23 09:37 system.dtb

Note

Exact file size might differ a bit but they should be in the same range (for example dpu-leopard-leopard-dpu.rootfs.cpio.gz.u-boot shall be about ~100MB)

Ensure that Leopard is powered off

customer@egse-host:~$ sml power off
Powering off...Success

Open second SSH connection to EGSE Host and start minicom to observe boot process
```
customer@egse-host:~$ minicom -D /dev/sml/leopard-pn1-uart
```
Leave this terminal open and get back to SSH connection used in previous steps.

Power on Leopard

customer@egse-host:~$ sml power on
Powering on...Success

Power on DPU Processing Node 1

customer@egse-host:~$ sml pn1 power on --nor-memory nor1
Powering on processing node Node1...Success

Note

Boot firmware is the same as in Enable programmable logic support.

DPU boot process should be visible in minicom terminal
Log in to DPU using root user
```
leopard login: root
root@leopard:~#
```

Load DPU bitstream

root@leopard:~# fpgautil -o /lib/firmware/dpu/overlay.dtbo

Verify that DPU instance is visible in system

root@leopard:~# xdputil query
{
   "DPU IP Spec":{
      "DPU Core Count":1,
      "IP version":"v4.1.0",
      "enable softmax":"False"
   },
   "VAI Version":{
      "libvart-runner.so":"Xilinx vart-runner Version: 3.5.0-b7953a2a9f60e23efdfced5c186328dd144966,
      "libvitis_ai_library-dpu_task.so":"Advanced Micro Devices vitis_ai_library dpu_task Version: ,
      "libxir.so":"Xilinx xir Version: xir-b7953a2a9f60e23efdfced5c186328dd1449665c 2024-07-15-16:5,
      "target_factory":"target-factory.3.5.0 b7953a2a9f60e23efdfced5c186328dd1449665c"
   },
   "kernels":[
      {
            "DPU Arch":"DPUCZDX8G_ISA1_B1024",
            "DPU Frequency (MHz)":100,
            "XRT Frequency (MHz)":100,
            "cu_idx":0,
            "fingerprint":"0x101000056010402",
            "is_vivado_flow":true,
            "name":"DPU Core 0"
      }
   ]
}

Follow Machine learning model deployment tutorials to train and compile for Deep-learning Processor Unit. Go to Onboard inference to see how to run inference on DPU.

Summary¶

In this tutorial you walked through steps required to include Deep-learning Processor Unit in FPGA design and integrate it with Yocto project.