Kubernetes has robust support for provisioning GPU resources (currently limited to NVIDIA GPUs), which is an invaluable asset for compute-intensive tasks such as deep learning.
How to Access GPU Resources
For Kubernetes v1.8 and Above
Starting from Kubernetes v1.8, GPU support is facilitated through the DevicePlugin feature. Prior configuration includes:
Enabling the flag on kubelet/kube-apiserver/kube-controller-manager: --feature-gates="DevicePlugins=true"
Installing Nvidia drivers on all Nodes, including NVIDIA Cuda Toolkit and cuDNN
Configuring Kubelet to utilize the docker container engine (which is the default setting); other engines are not yet compatible with this feature.
# Install docker-cecurlhttps://get.docker.com|sh \&&sudosystemctl--nowenabledocker# Add the package repositoriesdistribution=$(./etc/os-release;echo $ID$VERSION_ID) \&&curl-fsSLhttps://nvidia.github.io/libnvidia-container/gpgkey|sudogpg--dearmor-o/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \&&curl-s-Lhttps://nvidia.github.io/libnvidia-container/experimental/$distribution/libnvidia-container.list| \sed's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g'| \sudotee/etc/apt/sources.list.d/nvidia-container-toolkit.list# Install nvidia-docker2 and reload the Docker daemon configurationsudoapt-getinstall-ynvidia-docker2sudosystemctlrestartdocker# Test nvidia-smi with the latest official CUDA imagesudodockerrun--rm--gpusallnvidia/cuda:11.6.2-base-ubuntu20.04nvidia-smi
The alpha.kubernetes.io/nvidia-gpu has been removed in v1.10; use nvidia.com/gpu in newer versions.
For Kubernetes v1.6 and v1.7, it is necessary to install Nvidia drivers on all Nodes, enable --feature-gates="Accelerators=true" on apiserver and kubelet, and ensure kubelet is configured to use docker as the container engine.
The following is how you would specify the number of GPUs using the resource name alpha.kubernetes.io/nvidia-gpu:
GPU resources must be requested within resources.limits, and resources.requests are ineffective.
Containers can request either one or multiple GPUs, but not a fraction.
GPUs cannot be shared between containers.
The assumption is that all Nodes have the same model of GPUs installed.
Handling Multiple GPU Models
If the cluster has Nodes with different GPU models, Node Affinity can be used to schedule Pods to Nodes with specific GPU models:
First, at cluster setup, label the Nodes with the appropriate GPU model:
# Label your nodes with the accelerator type they have.kubectllabelnodes<node-with-k80>accelerator=nvidia-tesla-k80kubectllabelnodes<node-with-p100>accelerator=nvidia-tesla-p100
Then, set Node Affinity when creating the Pod:
apiVersion:v1kind:Podmetadata:name:cuda-vector-addspec:restartPolicy:OnFailurecontainers: - name:cuda-vector-addimage:"k8s.gcr.io/cuda-vector-add:v0.1"resources:limits:nvidia.com/gpu:1nodeSelector:accelerator:nvidia-tesla-p100# or nvidia-tesla-k80 etc.
Utilizing CUDA Libraries
The NVIDIA Cuda Toolkit and cuDNN must be pre-installed on all Nodes. To access the /usr/lib/nvidia-375, pass the CUDA libraries to the container as hostPath volumes:
# Check for CUDA and try to install.if!dpkg-query-Wcuda; then# The 16.04 installer works with 16.10.curl-Ohttp://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_8.0.61-1_amd64.debdpkg-i./cuda-repo-ubuntu1604_8.0.61-1_amd64.debapt-getupdateapt-getinstallcuda-yfi