Building an In-House Dev Environment on Kubernetes Part 3: Kubernetes Device Plugin for LPU

Hello! I’m Younghoon Jun, a DevOps Engineer on the ML team at HyperAccel.

This post is the third installment of the Building an In-House Dev Environment on Kubernetes series!

In Part 1, we covered the background, overall design, and direction of building a Kubernetes-based development environment. Part 2 introduced the strategy and process for building an ARC-based CI/CD infrastructure to overcome the structural limitations of self-hosted runners. In this third article, we will discuss the Device Plugin required for utilizing custom resources on Kubernetes.

HyperAccel is a company that builds LPU(LLM Processing Unit) devices specialized for LLM inference. To make Kubernetes recognize the LPU as a custom resource and enable smooth scheduling, a Device Plugin developed specifically for the LPU is required.

In this article, we will explore the following topics under the theme of LPU Device Plugin:

First, we explain the operating principles and detailed mechanics of the Kubernetes Device Plugin. Then, we introduce the development process of the device plugin for HyperAccel’s first-generation FPGA-based LPU and how it is provided. Next, we briefly introduce the development process of the device plugin for Bertha , our upcoming ASIC-based LPU. Finally, we introduce DRA (Dynamic Resource Allocation), a technology gaining significant attention, and explain how DRA relates to the scheduler and Device Plugin in Kubernetes.

Kubernetes Device Plugin

The Kubernetes Device Plugin framework is a standard interface that allows custom devices such as GPUs, FPGAs, and NICs to be exposed and managed within a Kubernetes cluster.

By default, Kubernetes can only recognize CPU and memory resources. Without a Device Plugin, Kubernetes has no information about custom devices, making it difficult to distinguish between hardware, check device status, or use them as criteria for Pod scheduling. Device Plugins enable vendors to support their hardware without directly modifying Kubernetes source code.

Why Do We Need a Device Plugin?

This approach was difficult to maintain because it required understanding the vast Kubernetes codebase. It also lacked scalability, since the code had to be modified every time new hardware was introduced.

In summary:

Improved maintainability and scalability: Kubernetes focuses solely on container orchestration, while hardware control logic is handled by Device Plugins built by manufacturers (NVIDIA, AMD, HyperAccel, etc.).
Support for diverse hardware: Not just GPUs, but also high-performance networks (SR-IOV), cryptographic accelerators (QAT), and other resources can all be managed in a unified manner.

How Device Plugins Work & Workflow

Device Plugins typically run as DaemonSets and communicate with Kubelet through gRPC services on each node. The process of making devices recognizable through communication between Kubelet and the Device Plugin is as follows:

Device Registration

Registration
- When the Plugin starts, it registers itself with Kubelet through a specific path on the node (/var/lib/kubelet/device-plugins/kubelet.sock).
- This is the process of telling Kubelet: “I am a plugin that manages a resource called hyperaccel.ai/lpu!”
ListAndWatch
- Kubelet periodically requests the list of available devices from the plugin.
- The Plugin monitors and reports the status of devices (Healthy/Unhealthy) in real-time to Kubelet through the ListAndWatch method.
Allocate
- At the node level, when a user specifies resources.limits.hyperaccel.ai/lpu: 1 in the Pod spec, the scheduler places the Pod on a node that has available resources. The scheduler is responsible for selecting the node where the Pod will be placed.
- Once the Pod is placed on a node through the scheduler, Kubelet calls the plugin’s Allocate method. The Plugin responds with the necessary configurations (environment variables, device node paths, volume mounts, etc.) so that the container can access the device.
- The Device Plugin is responsible for selecting which device to allocate within the node.

So far, we have examined the process by which the Device Plugin communicates with Kubelet to make Kubernetes recognize devices. So what interface is needed when these two components communicate? Let’s look at the gRPC interface required for the Device Plugin to communicate with Kubelet.

Key gRPC Interfaces

Device Plugin API Overview

Method	Role
GetDevicePluginOptions	Checks plugin options (periodic checkpoints, etc.)
ListAndWatch	Returns the device list and streams status changes
GetPreferredAllocation	Provides functionality to guide specific hardware selection through preference conditions
Allocate	Provides configuration for using specific hardware during container creation
PreStartContainer	Initializes the device before the container starts (optional)

Next, we will explain how Kubernetes internal components, including the Device Plugin, allocate devices to pods during the actual device request process.

How Devices Are Allocated to Pods

We will use HyperAccel’s LPU as an example.

Detecting Available LPUs on a Node

First, let’s explain how the API Server and Kubelet detect and monitor new devices by interacting with the Device Plugin at the Kubernetes level.

Device Detection

RegisterRequest (Device Registration Request)
- When the Device Plugin starts, it registers the plugin with Kubelet’s Device Plugin Manager through a Unix Domain Socket.
- This is the process of saying: “This node has a resource called hyperaccel.ai/lpu, and I will manage it.”
Start gRPC Server
- The Device Plugin starts a gRPC server to communicate with Kubelet. Through this, Kubelet can query device status or send allocation (Allocate) requests.
ListAndWatch
- Kubelet calls the Device Plugin’s ListAndWatch() function.
- During this process, the Device Plugin detects the LPUs connected below and reports their status (Healthy/Unhealthy) to Kubelet. This connection is maintained continuously, so it immediately notifies if any device issues occur.
Update Node Status
- Kubelet sends the confirmed LPU resource information (e.g., hyperaccel.ai/lpu: 4) to the API Server.
- The API Server records this information in etcd.
- Now the entire cluster knows that “4 LPUs are available on that node.”

Now, when a user creates a pod and requests the following in the YAML file, the Kubernetes Scheduler can place the pod on the appropriate node:

# User Pod YAML file

resources:
  limits:
    hyperaccel.ai/lpu: 1  # Request 1 LPU

Selecting and Allocating the Appropriate LPU for a Pod’s Request

Once the Kubernetes level is ready to use the new device called LPU through the above process, we need to look at the actual allocation process. Let’s explain step by step what happens internally in Kubernetes when a user declares “I need 1 LPU” in the YAML file.

Device Allocation

Bind(pod, node) - Pod Scheduling
- The user requests hyperaccel.ai/lpu: 1 in the PodSpec.
- The Kubernetes Scheduler searches for nodes with IDLE LPUs across all nodes.
- It selects a worker node that meets the conditions and commands (Bind): “Place this pod on the designated node!”
AllocateRequest
- Kubelet confirms that the pod assigned to it requires an LPU to run.
- The Device Plugin Manager inside Kubelet requests the Device Plugin: “I need 1 LPU, please prepare it for use.” It also passes the specific LPU ID (e.g., LPU 0) to be used.
AllocateResponse
- The Device Plugin responds to Kubelet with the necessary configuration information to make the LPU available in the container:
  - Envs: Environment variables to reference inside the container
  - Devices: Device paths the container needs to access (e.g., /dev/lpu0)
  - Mount: Mount paths for required libraries or driver files
CreateContainer
- Kubelet requests container creation from the container runtime (containerd) based on the received information.
- At this point, containerd connects the host’s /dev/lpu0 device to the inside of the container as configured, ultimately allowing the running pod to directly access the LPU hardware.

As a result, users can use the new LPU device in a safe and isolated environment simply by adding a single line — limits: hyperaccel.ai/lpu: 1 — to the YAML file, without needing to know the complex hardware setup process!

However, a question may arise from step 4 above. How does the container runtime (containerd) interpret and inject the device paths, environment variables, and mount information returned by the Device Plugin in the AllocateResponse? The standard that addresses this is CDI (Container Device Interface).

CDI (Container Device Interface): Standardizing Device Injection

The Problem Before CDI

When the Device Plugin sends the necessary configuration information to Kubelet via AllocateResponse, the container runtime ultimately interprets this to create the container. However, in this process, each vendor had its own way of injecting devices into containers.

For example, using an NVIDIA GPU in a container requires more than simply mounting the /dev/nvidia0 device file. GPU driver libraries (libcuda.so, libnvidia-ml.so, etc.), related utility binaries, and the correct paths matching their exact versions all need to be properly injected into the container. To achieve this, NVIDIA developed its own container runtime hook called nvidia-container-runtime-hook.

Other vendors’ devices — FPGAs, network accelerators (SR-IOV), cryptographic accelerators (QAT), and more — each required their own unique methods to inject device nodes, driver libraries, and configuration files into containers. This led to the following problems:

Runtime dependencies: Each vendor had to develop its own container runtime or runtime hook. NVIDIA’s nvidia-container-runtime, AMD’s xilinx-container-runtime, and others were all needed, and they were incompatible with each other.
Complex configuration management: To use both a GPU and an FPGA on a single node, multiple runtime hooks had to be combined, increasing operational complexity.
Lack of portability: Device configurations that worked with Docker sometimes didn’t work with containerd or CRI-O, because each container runtime had different device injection mechanisms.

What Is CDI?

CDI (Container Device Interface) is a container runtime-level standard specification that emerged to solve these problems. CDI declaratively defines how devices are injected into containers through a single JSON spec file. Container runtimes (containerd, CRI-O, Podman, etc.) read this spec and inject the devices into the container.

Simply put, before CDI, each vendor would say “To put my device in a container, you need to install this special runtime.” After CDI, it became “With just this JSON file, any runtime can inject my device.”

Structure of a CDI Spec File

CDI spec files are located at /etc/cdi/ or /var/run/cdi/ and have the following structure:

{
  "cdiVersion": "0.6.0",
  "kind": "hyperaccel.ai/lpu",
  "devices": [
    {
      "name": "lpu0",
      "containerEdits": {
        "deviceNodes": [
          {
            "path": "/dev/lpu0",
            "hostPath": "/dev/lpu0",
            "permissions": "rw"
          }
        ],
        "mounts": [
          {
            "hostPath": "/opt/hyperaccel/lib",
            "containerPath": "/usr/lib/hyperaccel",
            "options": ["ro", "nosuid", "nodev", "bind"]
          }
        ],
        "env": [
          "LPU_DEVICE_INDEX=0",
          "LPU_VISIBLE_DEVICES=0"
        ]
      }
    }
  ],
  "containerEdits": {
    "env": [
      "LPU_DRIVER_VERSION=1.0.0"
    ]
  }
}

The key components of the spec file are as follows:

Field	Description
kind	A vendor domain-based name that identifies the type of device (e.g., `hyperaccel.ai/lpu`)
devices	A list of individual device definitions. Each device has a unique `name`
containerEdits	Declaratively defines changes to apply to the container
deviceNodes	Device file paths and permissions to expose inside the container
mounts	Mount configurations for driver libraries, firmware, and other required files
env	Environment variables to inject inside the container

Device-level containerEdits are applied only when that specific device is requested, while top-level containerEdits are applied as common settings whenever any device of that kind is requested.

How CDI Works

The device allocation process with CDI applied is as follows:

CDI Spec Generation: The Device Plugin (or DRA Driver) detects devices on the node and generates CDI spec files for each device, storing them in the /etc/cdi/ path.
CDI Device Name Passing: When allocating a device to a Pod, the Device Plugin includes the CDI device name (e.g., hyperaccel.ai/lpu=lpu0) in the AllocateResponse and passes it to Kubelet.
Container Runtime CDI Resolution: When Kubelet requests container creation from the container runtime (containerd), it passes the CDI device name along with the request. The container runtime finds and reads the corresponding spec file from the /etc/cdi/ path, and automatically injects the device nodes, mounts, and environment variables defined in containerEdits into the container.
Container Execution: The container runs with all device configurations applied, and the Pod can directly access the device from within.

Compared to the AllocateResponse-based flow we examined earlier, the key difference is that the complex logic for injecting devices into containers has been separated from the Device Plugin into CDI spec files. The Device Plugin no longer needs to directly include device paths, mounts, and environment variables in the AllocateResponse — it only needs to pass the CDI device name. The rest is handled by the container runtime through the CDI spec.

CDI Configuration by containerd Version

To use CDI, the container runtime must have CDI support enabled. For containerd, the level of CDI support varies by version.

containerd Version	CDI Support Status	Configuration Required
Before 1.7	CDI not supported	Not available
1.7 ~ 1.x	CDI supported (default: disabled)	Manual activation required
2.0 and above	CDI supported (default: enabled)	No additional configuration needed

For containerd versions 1.7 and above but below 2.0, CDI must be explicitly enabled in /etc/containerd/config.toml.

[plugins."io.containerd.grpc.v1.cri"]
  enable_cdi = true
  cdi_spec_dirs = ["/etc/cdi", "/var/run/cdi"]

After changing the configuration, restart containerd to activate CDI.

sudo systemctl restart containerd

Starting from containerd 2.0, enable_cdi defaults to true, so CDI can be used immediately without any additional configuration.

The Relationship Between CDI and DRA

CDI is also closely related to DRA (Dynamic Resource Allocation), which we will introduce later. DRA Drivers use CDI as the standard interface for injecting devices into containers. In other words, CDI started by standardizing device injection methods in the Device Plugin era and continues to play a pivotal role as a foundational technology in the DRA era as well. DRA will be covered in detail in a later section.

Device Plugin Examples

So far, we have examined what the Kubernetes Device Plugin is, why it’s needed, and how it detects and allocates devices. Before diving into the Device Plugin for LPU, let’s look at examples provided by leading vendors NVIDIA and AMD.

NVIDIA GPU Device Plugin

NVIDIA Device Plugin Overview

As shown in the NVIDIA Cloud-Native Toolkit Stack Layer, NVIDIA presents a layered architecture for GPU utilization:

Linux Distribution: Host OS level
Container Engine (Docker/containerd): Container runtime environment
Kubernetes: Pod scheduling and management
GPU Operator (Top Layer): An operational tool that automates all these processes, with one of its core components being the NVIDIA Device Plugin.

The NVIDIA GPU Operator supports all basic Device Plugin functions — Expose Resources, Health Check, Device Allocation & Isolation — while also providing the following extensibility:

Feature	Description
MIG (Multi-Instance GPU) Support	Enables splitting a single physical GPU into multiple virtual GPUs via MIG
Ecosystem Integration	Works closely with `dcgm-exporter`, integrating with Prometheus (monitoring) and Grafana (visualization)
Operational Automation	Delivered as a modern pattern that manages everything from driver installation to plugin deployment through the GPU Operator

AMD Xilinx Device Plugin

Xilinx combines Node Feature Discovery (NFD) with the FPGA Operator to support seamless use of their FPGA devices in Kubernetes environments.

Xilinx FPGA Operator Overview

NFD scans each node to determine “What Xilinx card is installed on this node?” and creates labels for the node. Since not only the model name but also the installed Shell version, PCIe information, and others are crucial for FPGAs, it provides foundational data so the scheduler can place pods based on FPGA models or specific feature support.

Once NFD creates labels on the node, the FPGA Operator uses them to bring the node to a READY state.

Component	Description
Host Setup	Automatically installs XRT (Xilinx Runtime) drivers and firmware on each node
Container Runtime	Configures a dedicated runtime engine to enable tools like `xbutil` and FPGA resource access within containers
Device Plugin	Ultimately reports virtual resources such as `xilinx.com/fpga` to the `API Server`

HyperAccel’s first-generation LPU is based on Xilinx FPGAs. In other words, the aforementioned Xilinx Device Plugin must be applied to utilize LPUs in a Kubernetes environment.

However, directly applying the Xilinx Device Plugin as-is has functional limitations that make it difficult to use in a real server environment. What limitations exist, and how did we overcome them? Let’s now explain how we implemented the Device Plugin for LPU.

Kubernetes Device Plugin for HyperAccel LPU

HyperAccel’s first-generation LPU was designed on an FPGA (Field-Programmable Gate Array), based on the Xilinx Alveo U55C device.

Xilinx provides the Xilinx FPGA Device Plugin as open source. By deploying this plugin as a DaemonSet in the cluster, Kubernetes can recognize FPGAs as resource hardware and select them as scheduling targets. FPGAs are recognized with names like amd.com/xilinx_u55c_gen3x16_xdma_base_3-0.

Challenges with the Existing Xilinx Device Plugin

Currently, HyperAccel provides Orion servers equipped with FPGA-based LPUs, with 8 LPU devices installed in each Orion server.

Orion Server

The 8 LPUs are connected in a Ring-Topology configuration.

Ring Topology

The existing Xilinx FPGA Device Plugin can only convey information like “Is the number of IDLE devices on the current node greater than the number of devices requested by the current pod?” — so the scheduler selects the node for pod placement based solely on this. If allocation is done without considering device topology, connectivity issues can arise.

Let’s look at an example. Assume a pod requests 4 LPUs within an Orion server. When a device allocation request comes in, the Device Plugin sends the list of IDLE devices to Kubelet, and Kubelet then randomly selects devices based on this list.

LPU Allocation Anti-Pattern

In such a situation, if devices are selected without considering connectivity:

Since the 4 LPUs are not connected to each other, operations like distributed inference become difficult. (Parallel communication and LPU-to-LPU activation synchronization are impossible)
If a future pod requesting 4 LPUs is scheduled, the same problem occurs. (The IDLE device count is satisfied for scheduling, but there is no connectivity between devices)

This not only makes distributed workloads using multiple LPUs difficult, but also causes resource fragmentation within the Orion server, reducing server-level utilization.

Due to these limitations, in internal development and testing environments with Orion servers and external PoC environments, only the approach of allocating either 1 or all LPUs could be applied. With this 1 or ALL allocation method, when multi-device development or testing is needed on Kubernetes, it becomes difficult for multiple developers to work simultaneously on a single Orion server. To overcome this problem, we developed the Orion Device Plugin in-house for more flexible device utilization in a Kubernetes-based development environment.

Orion Device Plugin

The Orion Device Plugin is a component that adds Topology-Aware allocation functionality on top of the existing Xilinx Device Plugin.

Using the in-house developed HyperDex-Toolchain, a network table showing the connectivity of LPUs within the Orion server can be extracted. Based on the number of LPUs requested by the Pod and the connectivity status of IDLE LPUs extracted from the network table, the optimal device allocation combination is determined. The Device Plugin conveys this list to Kubelet through the GetPreferredAllocation function described earlier, and Kubelet allocates LPUs to the Pod accordingly.

Using the developed component, the Orion server is currently being used in various ways:

In the Kubernetes-based in-house development environment, fully isolated LPUs within the Orion server are provided to FPGA developers. In other words, multiple developers can work simultaneously on a single Orion server without device interference.
During HyperAccel Chat Demos running on the Orion server, multiple models can be served simultaneously on the same Orion server.
When conducting customer PoCs, if there is a need to test on a Kubernetes environment, the Device Plugin is provided alongside.

Bertha Device Plugin

HyperAccel’s next-generation ASIC chip, Bertha, is about to be released!

Since Bertha is an entirely new proprietary ASIC chip from HyperAccel, utilizing it in a Kubernetes environment requires developing a Bertha Operator that includes components such as the Device Plugin, NFD, and Metric Exporter. We are currently developing the Cloud-Native Software Stack for Bertha together with Namyoon Kim (Author, LinkedIn) from the ML team.

DRA in Kubernetes

So far, we have introduced the Device Plugin — an essential component for utilizing new devices in Kubernetes — and explored how HyperAccel leverages it in our development environment. However, Device Plugins also have some limitations.

Limitations of the Kubernetes Device Plugin

Hardware management through Device Plugins has the following limitations:

Static allocation: Resource allocation is based simply on the number of devices requested.
Lack of complex configuration: It is difficult to dynamically reflect detailed hardware settings (e.g., GPU partitioning, memory bandwidth configuration) at Pod request time.
Difficulty with resource sharing: Implementing scenarios where multiple pods flexibly share a single resource is complex.

DRA (Dynamic Resource Allocation)

Kubernetes DRA is a newly introduced framework actively being adopted to overcome the limitations of existing Device Plugins and manage custom hardware resources like GPUs and FPGAs with much greater flexibility and granularity.

DRA is designed to request and allocate hardware resources in a manner similar to using storage (PVC/PV). Simply put, whereas before you could only say “Give me 1 LPU,” with DRA you can make much more complex requests like “Please dynamically allocate an LPU with settings optimized for a specific model’s inference.”

Core Components of DRA

DRA clearly separates how resources are defined and requested into Allocation and Usage.

Component	Description
ResourceClaim	An object where users define and request needed resources (e.g., “1 GPU with 16GB or more”) — similar to PVC
ResourceClass	Defines the type of resource and the driver to handle it — similar to StorageClass
ResourceSlice	A data unit containing detailed information (model, capacity, status, etc.) about devices actually present on each node
DRA Driver	A plugin provided by specific hardware vendors, responsible for actual hardware preparation and allocation

Device Plugin vs DRA

Feature	Device Plugin	DRA
Resource Model	Integer-based (e.g., `hyperaccel.ai/lpu: 1`)	Object-based (ResourceClaim)
Parameter Passing	Very limited (using Annotations, etc.)	Highly flexible (Custom parameter support)
Scheduling	Simple count check	Can reflect complex constraints
Lifecycle	Fixed at Pod execution time	Can be allocated/released independently of the Pod

With the release of Kubernetes v1.34 in the second half of 2025, major DRA features have transitioned to General Availability (GA). As a result, DRA is being standardized for managing accelerators like NVIDIA GPUs in the latest cloud environments (GKE, EKS, etc.) and on-premises AI clusters.

HyperAccel is also keeping pace and preparing to support the use of Bertha with DRA in Kubernetes environments!

In Summary…

In this article — the third installment of the Building an In-House Dev Environment on Kubernetes series — we introduced what the Kubernetes Device Plugin is, the development process of the Device Plugin for utilizing HyperAccel’s LPU in Kubernetes, and DRA, a technology for more flexible resource allocation.

For a market-competitive LPU, it is important to design well at the HW level, but an optimized SW stack must also accompany it. Supporting the smooth use of LPU on top of the most widely used Kubernetes environment is a critical challenge for chip success!

Our ML team’s DevOps division is developing Cloud-Native Toolkit software for HyperAccel LPU’s potential customers. Components such as Device Plugins and DRAs, which form the foundation of this software, enable seamless use of the LPU within computing nodes.

Thank you for reading to the end!

P.S.: HyperAccel is Hiring!

HyperAccel has all team members working hard toward the release of an LLM acceleration ASIC chip! As a company that covers not only HW and SW but also end-to-end inference AI technology, outstanding talents across all areas are working together. We are growing rapidly together, learning not just in a single field but gaining broad and deep knowledge while working with amazing colleagues!

Our ML team’s DevOps division provides and manages development environments to boost in-house developer productivity, and develops Cloud-Native Toolkit to effectively support cloud-level utilization of LPU chips.

If you’re interested in the technologies HyperAccel works with, please apply through HyperAccel Career!

Building an In-House Dev Environment on Kubernetes Part 3: Kubernetes Device Plugin for LPU#

Kubernetes Device Plugin#

Why Do We Need a Device Plugin?#

How Device Plugins Work & Workflow#

Key gRPC Interfaces#

How Devices Are Allocated to Pods#

Detecting Available LPUs on a Node#

Selecting and Allocating the Appropriate LPU for a Pod’s Request#

CDI (Container Device Interface): Standardizing Device Injection#

The Problem Before CDI#

What Is CDI?#

Structure of a CDI Spec File#

How CDI Works#

CDI Configuration by containerd Version#

The Relationship Between CDI and DRA#

Device Plugin Examples#

NVIDIA GPU Device Plugin#

AMD Xilinx Device Plugin#

Kubernetes Device Plugin for HyperAccel LPU#

Challenges with the Existing Xilinx Device Plugin#

Orion Device Plugin#

Bertha Device Plugin#

DRA in Kubernetes#

Limitations of the Kubernetes Device Plugin#

DRA (Dynamic Resource Allocation)#

Core Components of DRA#

Device Plugin vs DRA#

In Summary…#

P.S.: HyperAccel is Hiring!#

Reference#