GPU
Kubernetes now supports the allocation of GPU resources for containers (currently only NVIDIA GPUs), which is widely used in scenarios like deep learning.
How to Use
Kubernetes v1.8 and Later
Starting with Kubernetes v1.8, GPUs are supported through the DevicePlugin feature. Prior to use, several configurations are needed:
Enable the following feature gates on kubelet/kube-apiserver/kube-controller-manager:
--feature-gates="DevicePlugins=true"
Install Nvidia drivers on all Nodes, including NVIDIA Cuda Toolkit and cuDNN
Configure Kubelet to use Docker as the container engine (which is the default setting), as other container engines do not yet support this feature
NVIDIA Plugin
NVIDIA requires nvidia-docker.
To install nvidia-docker:
# Install docker-ce
sudo apt-get install \
apt-transport-https \
ca-certificates \
curl \
software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository \
"deb [arch=amd64] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) \
stable"
sudo apt-get update
sudo apt-get install docker-ce
# Add the package repositories
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/ubuntu16.04/amd64/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
# Install nvidia-docker2 and reload the Docker daemon configuration
sudo apt-get install -y nvidia-docker2
sudo pkill -SIGHUP dockerd
# Test nvidia-smi with the latest official CUDA image
docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
Set Docker default runtime to nvidia:
# cat /etc/docker/daemon.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
Deploy the NVIDIA device plugin:
# For Kubernetes v1.8
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.8/nvidia-device-plugin.yml
# For Kubernetes v1.9
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.9/nvidia-device-plugin.yml
GCE/GKE GPU Plugin
This plugin does not require nvidia-docker and also supports CRI container runtimes.
# Install NVIDIA drivers on Container-Optimized OS:
kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/k8s-1.9/daemonset.yaml
# Install NVIDIA drivers on Ubuntu (experimental):
kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/k8s-1.9/nvidia-driver-installer/ubuntu/daemonset.yaml
# Install the device plugin:
kubectl create -f https://raw.githubusercontent.com/kubernetes/kubernetes/release-1.9/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml
Example of Requesting nvidia.com/gpu
Resources
nvidia.com/gpu
ResourcesapiVersion: v1
kind: Pod
metadata:
name: cuda-vector-add
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
# https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
image: "k8s.gcr.io/cuda-vector-add:v0.1"
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
Kubernetes v1.6 and v1.7
alpha.kubernetes.io/nvidia-gpu
has been deprecated in v1.10, please usenvidia.com/gpu
for newer versions.
To use GPUs in Kubernetes v1.6 and v1.7, prerequisite configurations are required:
Install Nvidia drivers on all Nodes, including NVIDIA Cuda Toolkit and cuDNN
Enable the feature gates
--feature-gates="Accelerators=true"
on apiserver and kubeletConfigure Kubelet to use Docker as the container engine (the default setting), as other container engines are not yet supported
Use the resource name alpha.kubernetes.io/nvidia-gpu
to specify the number of GPUs required, for example:
apiVersion: v1
kind: Pod
metadata:
name: tensorflow
spec:
restartPolicy: Never
containers:
- image: gcr.io/tensorflow/tensorflow:latest-gpu
name: gpu-container-1
command: ["python"]
env:
- name: LD_LIBRARY_PATH
value: /usr/lib/nvidia
args:
- -u
- -c
- from tensorflow.python.client import device_lib; print(device_lib.list_local_devices())
resources:
limits:
alpha.kubernetes.io/nvidia-gpu: 1 # requests one GPU
volumeMounts:
- mountPath: /usr/local/nvidia/bin
name: bin
- mountPath: /usr/lib/nvidia
name: lib
- mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so
name: libcuda-so
- mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so.1
name: libcuda-so-1
- mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so.375.66
name: libcuda-so-375-66
volumes:
- name: bin
hostPath:
path: /usr/lib/nvidia-375/bin
- name: lib
hostPath:
path: /usr/lib/nvidia-375
- name: libcuda-so
hostPath:
path: /usr/lib/x86_64-linux-gnu/libcuda.so
- name: libcuda-so-1
hostPath:
path: /usr/lib/x86_64-linux-gnu/libcuda.so.1
- name: libcuda-so-375-66
hostPath:
path: /usr/lib/x86_64-linux-gnu/libcuda.so.375.66
$ kubectl create -f pod.yaml
pod "tensorflow" created
$ kubectl logs tensorflow
...
[name: "/cpu:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 9675741273569321173
, name: "/gpu:0"
device_type: "GPU"
memory_limit: 11332668621
locality {
bus_id: 1
}
incarnation: 7807115828340118187
physical_device_desc: "device: 0, name: Tesla K80, pci bus id: 0000:00:04.0"
]
Note:
GPU resources must be requested in
resources.limits
,resources.requests
are not validContainers may request either 1 GPU or multiple GPUs, but not fractional parts of a GPU
GPUs cannot be shared among multiple containers
It is assumed by default that all Nodes are equipped with GPUs of the same model
Multiple GPU Models
If the Nodes in your cluster are installed with GPUs of different models, you can use Node Affinity to schedule Pods to Nodes with a specific GPU model.
First, label your Nodes with the GPU model during cluster initialization:
# Label your nodes with the accelerator type they have.
kubectl label nodes <node-with-k80> accelerator=nvidia-tesla-k80
kubectl label nodes <node-with-p100> accelerator=nvidia-tesla-p100
Then, set Node Affinity when creating a Pod:
apiVersion: v1
kind: Pod
metadata:
name: cuda-vector-add
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
# https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
image: "k8s.gcr.io/cuda-vector-add:v0.1"
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
accelerator: nvidia-tesla-p100 # or nvidia-tesla-k80 etc.
Using CUDA Libraries
NVIDIA Cuda Toolkit and cuDNN must be pre-installed on all Nodes. To access /usr/lib/nvidia-375
, CUDA libraries should be passed to containers as hostPath volumes:
apiVersion: batch/v1
kind: Job
metadata:
name: nvidia-smi
labels:
name: nvidia-smi
spec:
template:
metadata:
labels:
name: nvidia-smi
spec:
containers:
- name: nvidia-smi
image: nvidia/cuda
command: ["nvidia-smi"]
imagePullPolicy: IfNotPresent
resources:
limits:
alpha.kubernetes.io/nvidia-gpu: 1
volumeMounts:
- mountPath: /usr/local/nvidia/bin
name: bin
- mountPath: /usr/lib/nvidia
name: lib
volumes:
- name: bin
hostPath:
path: /usr/lib/nvidia-375/bin
- name: lib
hostPath:
path: /usr/lib/nvidia-375
restartPolicy: Never
$ kubectl create -f job.yaml
job "nvidia-smi" created
$ kubectl get job
NAME DESIRED SUCCESSFUL AGE
nvidia-smi 1 1 14m
$ kubectl get pod -a
NAME READY STATUS RESTARTS AGE
nvidia-smi-kwd2m 0/1 Completed 0 14m
$ kubectl logs nvidia-smi-kwd2m
Fri Jun 16 19:49:53 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66 Driver Version: 375.66 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:00:04.0 Off | 0 |
| N/A 74C P0 80W / 149W | 0MiB / 11439MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Appendix: Installing CUDA
To install CUDA:
# Check for CUDA and try to install.
if ! dpkg-query -W cuda; then
# The 16.04 installer works with 16.10.
curl -O http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
dpkg -i ./cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
apt-get update
apt-get install cuda -y
fi
To install cuDNN:
First, visit the website https://developer.nvidia.com/cudnn, register and download cuDNN v5.1, then use the following commands to install it:
tar zxvf cudnn-8.0-linux-x64-v5.1.tgz
ln -s /usr/local/cuda-8.0 /usr/local/cuda
sudo cp -P cuda/include/cudnn.h /usr/local/cuda/include
sudo cp -P cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*
After installation, you can run nvidia-smi to check the status of the GPU devices:
$ nvidia-smi
Fri Jun 16 19:33:35 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66 Driver Version: 375.66 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:00:04.0 Off | 0 |
| N/A 74C P0 80W / 149W | 0MiB / 11439MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Reference Documents
最后更新于