Add GPU flag support for Docker provider #1

jacobtomlinson · 2022-01-25T15:34:09Z

Adds a gpus node config option which in turn passes the --gpus=all flag to Docker. This approach was already rejected in kubernetes-sigs#1886 but it looks like that PR got force pushed with a more generic implementation that was also rejected. I wanted to have a go at implementing this feature quickly myself for my own use.

I don't intend on trying to get this merged upstream right now as I don't have time to go off on this tangent, but wanted to put this here in case it inspires someone else to do so.

Demo

Enable GPUs in cluster spec.

# kind-gpu.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
name: gpu-test
nodes:
  - role: control-plane
    gpus: True

Create cluster.

$ ./bin/kind create cluster --config kind-gpu.yaml

Install NVIDIA operator (but skip driver install as it should already be installed).

$ helm repo add nvidia https://nvidia.github.io/gpu-operator \
   && helm repo update
$ helm install --wait --generate-name \                
     -n gpu-operator --create-namespace \
     nvidia/gpu-operator \
     --set driver.enabled=false

Test scheduling a pod.

# gpu-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: vectoradd
spec:
  restartPolicy: OnFailure
  containers:
  - name: vectoradd
    image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
    resources:
      limits:
        nvidia.com/gpu: 1

$ kubectl apply --context kind-gpu-test -f gpu-pod.yaml

Check the pod logs.

$ kubectl logs --context kind-gpu-test vectoradd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

jacobtomlinson · 2022-01-25T15:53:10Z

Here's a linux amd64 binary to save building one.

kind-0.11.1-with-gpu-patch.zip
kind-0.17.0-with-gpu-patch.zip

moracabanas · 2022-02-16T19:23:09Z

Thanks for pushing this.
I am looking for testing Kubeflow in a future workstation as a machine learning experimenting platform and workflow.
Probably overkill because of trying to get something designed to run on kubernetes cluster on a barebones workstation with single GPU.
I can't find easy solutions to test kubernetes cluster on my laptop setup.
At the moment I have a laptop with Windows 11 WSL2 and working CUDA containers.
All I have to do is just run this set of flags to get GPU passthrough:

i.e.:
docker run -it --rm --gpus all nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

But trying to adopt Kind as a testing kubernetes solution is not working for me. I can't get GPU passthrough following your steps:

kubectl describe pod vectoradd

Name:         vectoradd
Namespace:    default
Priority:     0
Node:         <none>
Labels:       <none>
Annotations:  <none>
Status:       Pending
IP:           
IPs:          <none>
Containers:
  vectoradd:
    Image:      nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
    Port:       <none>
    Host Port:  <none>
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5b9pw (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  kube-api-access-5b9pw:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  39s (x3 over 2m47s)  default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu.

Do you have any idea to progress on this setup?

Thanks!

moracabanas · 2022-02-16T20:01:12Z

UPDATE:

I found out WSL2 setup is not supported for this case because the NVIDIA driver plugin and container runtime rely on the NVIDIA Management Library (NVML) which is not yet supported, as seen on this threads:

NVIDIA/k8s-device-plugin#207 (comment)

https://k3d.io/v4.4.8/usage/guides/cuda/#known-issues

jacobtomlinson · 2022-02-17T09:25:54Z

Yeah I wrote this blog post on Running Kubeflow in Kind with GPUs, mainly to record the steps I took to get things running on my linux workstation. But yeah I can imagine Windows being more challenging.

raphaelauv · 2022-11-07T10:05:57Z

thanks @jacobtomlinson 👍 👍

do you plan to open a PR direclty on kind repo ?

jacobtomlinson · 2022-11-07T10:32:51Z

@raphaelauv It's on my wishlist but I'm not sure I'll ever get to it. It seems folks have previously made PRs and tried to get the kind devs to support it and have failed to convince them. I'm not sure I have the time to try and navigate this.

raphaelauv · 2022-11-07T11:12:50Z

Okay I see , thanks for your time 👍

I will make a try of your solution soon

reference: - jacobtomlinson/kind#1 - https://jacobtomlinson.dev/posts/2022/quick-hack-adding-gpu-support-to-kind/ Creating log: $ ../kind/bin/kind create cluster --config kind-gpu.yml Creating cluster "gpu-test" ... ✓ Ensuring node image (kindest/node:v1.23.1) 🖼 ✓ Preparing nodes 📦 ✓ Writing configuration 📜 ✓ Starting control-plane 🕹️ ✓ Installing CNI 🔌 ✓ Installing StorageClass 💾 Set kubectl context to "kind-gpu-test" You can now use your cluster with: kubectl cluster-info --context kind-gpu-test Have a nice day! 👋

jacobtomlinson · 2023-02-08T09:56:12Z

Just updated this PR with the latest code from upstream. Still, no intention to make this a real PR, but at least it is up to date for now.

psigen · 2023-06-25T12:24:31Z

FYI: the discussion for adding GPU support has been re-opened here and seems to be progressing, your PR was referenced in the conversation:
kubernetes-sigs#3164

Here is a WIP PR coming from that discussion for anyone interested:
kubernetes-sigs#3257

klueska · 2023-06-26T11:48:09Z

Please see kubernetes-sigs#3257 (comment) for a discussion on how to enable GPU support today without any patched needed to kind.

jiangxiaobin96 · 2023-09-01T03:47:00Z

I follow the blog and meet some problems.

NAMESPACE            NAME                                                                  READY   STATUS     RESTARTS      AGE
gpu-operator         pod/gpu-feature-discovery-qcvjj                                       0/1     Init:0/1   0             18m
gpu-operator         pod/gpu-operator-1693379171-node-feature-discovery-master-7b4fhdvf2   1/1     Running    4 (74m ago)   44h
gpu-operator         pod/gpu-operator-1693379171-node-feature-discovery-worker-vrww2       1/1     Running    5 (74m ago)   44h
gpu-operator         pod/gpu-operator-5ffbcb489c-wvrlf                                     1/1     Running    3 (74m ago)   44h
gpu-operator         pod/nvidia-dcgm-exporter-zfl49                                        0/1     Init:0/1   0             44h
gpu-operator         pod/nvidia-device-plugin-daemonset-wtn4p                              0/1     Init:0/1   0             44h
gpu-operator         pod/nvidia-operator-validator-2pgx2                                   0/1     Init:0/4   0             44h

run kubectl describe pod/gpu-feature-discovery-qcvjj -n gpu-operator

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

It seens kind node loss nvidia runtime. After entering the kind node image and found that it use containerd instead of docker and loss nvidia-container-runtime.
Is there any steps i run wrong or i need to install nvidia-container-runtime or docker(nvidia-docker2) manually?

jiangxiaobin96 · 2023-09-01T10:04:32Z

BTW, it's also necessary to add NVIDIA_DRIVER_CAPABILITIES=all

klueska · 2023-09-01T10:36:34Z

Did you try following the instructions linked in the comment just above yours? They supercede the need for this patch:
#1 (comment)

jiangxiaobin96 · 2023-09-05T03:27:06Z

kind version

# kind version
kind (@jacobtomlinson's patched GPU edition) v0.18.0-alpha.702+ec8f4c936a5171 go1.19.3 linux/amd64

helm repo list

# helm repo list
NAME  	URL                                  
nvidia	https://nvidia.github.io/gpu-operator

get pod

# kubectl get pods -n gpu-operator
NAME                                                              READY   STATUS                  RESTARTS         AGE
gpu-feature-discovery-9xdtd                                       0/1     CrashLoopBackOff        5 (113s ago)     4m52s
gpu-operator-1693882100-node-feature-discovery-master-68d8ngrhg   1/1     Running                 0                34m
gpu-operator-1693882100-node-feature-discovery-worker-mhkjz       1/1     Running                 0                34m
gpu-operator-7cd5b68d57-9khd2                                     1/1     Running                 0                34m
nvidia-container-toolkit-daemonset-nlfvt                          1/1     Running                 0                34m
nvidia-cuda-validator-zzxpb                                       0/1     Completed               0                34m
nvidia-dcgm-exporter-8c76s                                        1/1     Running                 0                34m
nvidia-device-plugin-daemonset-f679x                              0/1     CrashLoopBackOff        11 (2m44s ago)   34m
nvidia-operator-validator-hpnx9                                   0/1     Init:CrashLoopBackOff   7 (3m12s ago)    34m

pod error of gpu-feature-discovery-9xdtd

Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: mount error: unable to get cap device attributes: /proc/driver/nvidia/capabilities/mig/monitor: no such file or directory: unknown

mig daemonset

gpu-operator   daemonset.apps/nvidia-mig-manager                                      0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true             36m

mig cap

Tue Sep  5 03:25:45 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:00:05.0 Off |                    0 |
| N/A   30C    P0    52W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Enabled* |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Dentrax · 2023-09-13T19:15:17Z

Awesome job! Looking forward to this feature.

Add GPU flag support for Docker provider

e447ff3

jacobtomlinson added 2 commits January 25, 2022 15:55

Make sinatllable from PR

8d6242a

Revert

b074fdf

rrzatkie mentioned this pull request May 17, 2022

Failed to initialize NVML - driver:460.32.03 cuda:11.2 k8s:1.23.5 NVIDIA/k8s-device-plugin#303

Open

8 tasks

Merge branch 'main' of https://github.com/kubernetes-sigs/kind into gpu

a32cb05

Add version name modifier

ec8f4c9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GPU flag support for Docker provider #1

Add GPU flag support for Docker provider #1

jacobtomlinson commented Jan 25, 2022

jacobtomlinson commented Jan 25, 2022 •

edited

Loading

moracabanas commented Feb 16, 2022 •

edited

Loading

moracabanas commented Feb 16, 2022

jacobtomlinson commented Feb 17, 2022

raphaelauv commented Nov 7, 2022

jacobtomlinson commented Nov 7, 2022

raphaelauv commented Nov 7, 2022 •

edited

Loading

jacobtomlinson commented Feb 8, 2023

psigen commented Jun 25, 2023 •

edited

Loading

klueska commented Jun 26, 2023

jiangxiaobin96 commented Sep 1, 2023

jiangxiaobin96 commented Sep 1, 2023

klueska commented Sep 1, 2023

jiangxiaobin96 commented Sep 5, 2023

Dentrax commented Sep 13, 2023

Add GPU flag support for Docker provider #1

Are you sure you want to change the base?

Add GPU flag support for Docker provider #1

Conversation

jacobtomlinson commented Jan 25, 2022

Demo

jacobtomlinson commented Jan 25, 2022 • edited Loading

moracabanas commented Feb 16, 2022 • edited Loading

moracabanas commented Feb 16, 2022

jacobtomlinson commented Feb 17, 2022

raphaelauv commented Nov 7, 2022

jacobtomlinson commented Nov 7, 2022

raphaelauv commented Nov 7, 2022 • edited Loading

jacobtomlinson commented Feb 8, 2023

psigen commented Jun 25, 2023 • edited Loading

klueska commented Jun 26, 2023

jiangxiaobin96 commented Sep 1, 2023

jiangxiaobin96 commented Sep 1, 2023

klueska commented Sep 1, 2023

jiangxiaobin96 commented Sep 5, 2023

Dentrax commented Sep 13, 2023

jacobtomlinson commented Jan 25, 2022 •

edited

Loading

moracabanas commented Feb 16, 2022 •

edited

Loading

raphaelauv commented Nov 7, 2022 •

edited

Loading

psigen commented Jun 25, 2023 •

edited

Loading