GPU
Kubernetes 支持容器请求 GPU 资源(目前仅支持 NVIDIA GPU),在深度学习等场景中有大量应用。

使用方法

Kubernetes v1.8 及更新版本

从 Kubernetes v1.8 开始,GPU 开始以 DevicePlugin 的形式实现。在使用之前需要配置
  • kubelet/kube-apiserver/kube-controller-manager: --feature-gates="DevicePlugins=true"
  • 在所有的 Node 上安装 Nvidia 驱动,包括 NVIDIA Cuda Toolkit 和 cuDNN 等
  • Kubelet 配置使用 docker 容器引擎(默认就是 docker),其他容器引擎暂不支持该特性

NVIDIA 插件

NVIDIA 需要 nvidia-docker。
安装 nvidia-docker
1
# Install docker-ce
2
sudo apt-get install \
3
apt-transport-https \
4
ca-certificates \
5
curl \
6
software-properties-common
7
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
8
sudo add-apt-repository \
9
"deb [arch=amd64] https://download.docker.com/linux/ubuntu \
10
$(lsb_release -cs) \
11
stable"
12
sudo apt-get update
13
sudo apt-get install docker-ce
14
15
# Add the package repositories
16
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
17
sudo apt-key add -
18
curl -s -L https://nvidia.github.io/nvidia-docker/ubuntu16.04/amd64/nvidia-docker.list | \
19
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
20
sudo apt-get update
21
22
# Install nvidia-docker2 and reload the Docker daemon configuration
23
sudo apt-get install -y nvidia-docker2
24
sudo pkill -SIGHUP dockerd
25
26
# Test nvidia-smi with the latest official CUDA image
27
docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
Copied!
设置 Docker 默认运行时为 nvidia
1
# cat /etc/docker/daemon.json
2
{
3
"default-runtime": "nvidia",
4
"runtimes": {
5
"nvidia": {
6
"path": "/usr/bin/nvidia-container-runtime",
7
"runtimeArgs": []
8
}
9
}
10
}
Copied!
部署 NVDIA 设备插件
1
# For Kubernetes v1.8
2
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.8/nvidia-device-plugin.yml
3
4
# For Kubernetes v1.9
5
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.9/nvidia-device-plugin.yml
Copied!

GCE/GKE GPU 插件

该插件不需要 nvidia-docker,并且也支持 CRI 容器运行时。
1
# Install NVIDIA drivers on Container-Optimized OS:
2
kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/k8s-1.9/daemonset.yaml
3
4
# Install NVIDIA drivers on Ubuntu (experimental):
5
kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/k8s-1.9/nvidia-driver-installer/ubuntu/daemonset.yaml
6
7
# Install the device plugin:
8
kubectl create -f https://raw.githubusercontent.com/kubernetes/kubernetes/release-1.9/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml
Copied!

请求 nvidia.com/gpu 资源示例

1
apiVersion: v1
2
kind: Pod
3
metadata:
4
name: cuda-vector-add
5
spec:
6
restartPolicy: OnFailure
7
containers:
8
- name: cuda-vector-add
9
# https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
10
image: "k8s.gcr.io/cuda-vector-add:v0.1"
11
resources:
12
limits:
13
nvidia.com/gpu: 1 # requesting 1 GPU
Copied!

Kubernetes v1.6 和 v1.7

alpha.kubernetes.io/nvidia-gpu 已在 v1.10 中删除,新版本请使用 nvidia.com/gpu
在 Kubernetes v1.6 和 v1.7 中使用 GPU 需要预先配置
  • 在所有的 Node 上安装 Nvidia 驱动,包括 NVIDIA Cuda Toolkit 和 cuDNN 等
  • 在 apiserver 和 kubelet 上开启 --feature-gates="Accelerators=true"
  • Kubelet 配置使用 docker 容器引擎(默认就是 docker),其他容器引擎暂不支持该特性
使用资源名 alpha.kubernetes.io/nvidia-gpu 指定请求 GPU 的个数,如
1
apiVersion: v1
2
kind: Pod
3
metadata:
4
name: tensorflow
5
spec:
6
restartPolicy: Never
7
containers:
8
- image: gcr.io/tensorflow/tensorflow:latest-gpu
9
name: gpu-container-1
10
command: ["python"]
11
env:
12
- name: LD_LIBRARY_PATH
13
value: /usr/lib/nvidia
14
args:
15
- -u
16
- -c
17
- from tensorflow.python.client import device_lib; print device_lib.list_local_devices()
18
resources:
19
limits:
20
alpha.kubernetes.io/nvidia-gpu: 1 # requests one GPU
21
volumeMounts:
22
- mountPath: /usr/local/nvidia/bin
23
name: bin
24
- mountPath: /usr/lib/nvidia
25
name: lib
26
- mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so
27
name: libcuda-so
28
- mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so.1
29
name: libcuda-so-1
30
- mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so.375.66
31
name: libcuda-so-375-66
32
volumes:
33
- name: bin
34
hostPath:
35
path: /usr/lib/nvidia-375/bin
36
- name: lib
37
hostPath:
38
path: /usr/lib/nvidia-375
39
- name: libcuda-so
40
hostPath:
41
path: /usr/lib/x86_64-linux-gnu/libcuda.so
42
- name: libcuda-so-1
43
hostPath:
44
path: /usr/lib/x86_64-linux-gnu/libcuda.so.1
45
- name: libcuda-so-375-66
46
hostPath:
47
path: /usr/lib/x86_64-linux-gnu/libcuda.so.375.66
Copied!
1
$ kubectl create -f pod.yaml
2
pod "tensorflow" created
3
4
$ kubectl logs tensorflow
5
...
6
[name: "/cpu:0"
7
device_type: "CPU"
8
memory_limit: 268435456
9
locality {
10
}
11
incarnation: 9675741273569321173
12
, name: "/gpu:0"
13
device_type: "GPU"
14
memory_limit: 11332668621
15
locality {
16
bus_id: 1
17
}
18
incarnation: 7807115828340118187
19
physical_device_desc: "device: 0, name: Tesla K80, pci bus id: 0000:00:04.0"
20
]
Copied!
注意
  • GPU 资源必须在 resources.limits 中请求,resources.requests 中无效
  • 容器可以请求 1 个或多个 GPU,不能只请求一部分
  • 多个容器之间不能共享 GPU
  • 默认假设所有 Node 安装了相同型号的 GPU

多种型号的 GPU

如果集群 Node 中安装了多种型号的 GPU,则可以使用 Node Affinity 来调度 Pod 到指定 GPU 型号的 Node 上。
首先,在集群初始化时,需要给 Node 打上 GPU 型号的标签
1
# Label your nodes with the accelerator type they have.
2
kubectl label nodes <node-with-k80> accelerator=nvidia-tesla-k80
3
kubectl label nodes <node-with-p100> accelerator=nvidia-tesla-p100
Copied!
然后,在创建 Pod 时设置 Node Affinity:
1
apiVersion: v1
2
kind: Pod
3
metadata:
4
name: cuda-vector-add
5
spec:
6
restartPolicy: OnFailure
7
containers:
8
- name: cuda-vector-add
9
# https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
10
image: "k8s.gcr.io/cuda-vector-add:v0.1"
11
resources:
12
limits:
13
nvidia.com/gpu: 1
14
nodeSelector:
15
accelerator: nvidia-tesla-p100 # or nvidia-tesla-k80 etc.
Copied!

使用 CUDA 库

NVIDIA Cuda Toolkit 和 cuDNN 等需要预先安装在所有 Node 上。为了访问 /usr/lib/nvidia-375,需要将 CUDA 库以 hostPath volume 的形式传给容器:
1
apiVersion: batch/v1
2
kind: Job
3
metadata:
4
name: nvidia-smi
5
labels:
6
name: nvidia-smi
7
spec:
8
template:
9
metadata:
10
labels:
11
name: nvidia-smi
12
spec:
13
containers:
14
- name: nvidia-smi
15
image: nvidia/cuda
16
command: ["nvidia-smi"]
17
imagePullPolicy: IfNotPresent
18
resources:
19
limits:
20
alpha.kubernetes.io/nvidia-gpu: 1
21
volumeMounts:
22
- mountPath: /usr/local/nvidia/bin
23
name: bin
24
- mountPath: /usr/lib/nvidia
25
name: lib
26
volumes:
27
- name: bin
28
hostPath:
29
path: /usr/lib/nvidia-375/bin
30
- name: lib
31
hostPath:
32
path: /usr/lib/nvidia-375
33
restartPolicy: Never
Copied!
1
$ kubectl create -f job.yaml
2
job "nvidia-smi" created
3
4
$ kubectl get job
5
NAME DESIRED SUCCESSFUL AGE
6
nvidia-smi 1 1 14m
7
8
$ kubectl get pod -a
9
NAME READY STATUS RESTARTS AGE
10
nvidia-smi-kwd2m 0/1 Completed 0 14m
11
12
$ kubectl logs nvidia-smi-kwd2m
13
Fri Jun 16 19:49:53 2017
14
+-----------------------------------------------------------------------------+
15
| NVIDIA-SMI 375.66 Driver Version: 375.66 |
16
|-------------------------------+----------------------+----------------------+
17
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
18
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
19
|===============================+======================+======================|
20
| 0 Tesla K80 Off | 0000:00:04.0 Off | 0 |
21
| N/A 74C P0 80W / 149W | 0MiB / 11439MiB | 100% Default |
22
+-------------------------------+----------------------+----------------------+
23
24
+-----------------------------------------------------------------------------+
25
| Processes: GPU Memory |
26
| GPU PID Type Process name Usage |
27
|=============================================================================|
28
| No running processes found |
29
+-----------------------------------------------------------------------------+
Copied!

附录:CUDA 安装方法

安装 CUDA:
1
# Check for CUDA and try to install.
2
if ! dpkg-query -W cuda; then
3
# The 16.04 installer works with 16.10.
4
curl -O http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
5
dpkg -i ./cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
6
apt-get update
7
apt-get install cuda -y
8
fi
Copied!
安装 cuDNN:
首先到网站 https://developer.nvidia.com/cudnn 注册,并下载 cuDNN v5.1,然后运行命令安装
1
tar zxvf cudnn-8.0-linux-x64-v5.1.tgz
2
ln -s /usr/local/cuda-8.0 /usr/local/cuda
3
sudo cp -P cuda/include/cudnn.h /usr/local/cuda/include
4
sudo cp -P cuda/lib64/libcudnn* /usr/local/cuda/lib64
5
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*
Copied!
安装完成后,可以运行 nvidia-smi 查看 GPU 设备的状态
1
$ nvidia-smi
2
Fri Jun 16 19:33:35 2017
3
+-----------------------------------------------------------------------------+
4
| NVIDIA-SMI 375.66 Driver Version: 375.66 |
5
|-------------------------------+----------------------+----------------------+
6
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
7
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
8
|===============================+======================+======================|
9
| 0 Tesla K80 Off | 0000:00:04.0 Off | 0 |
10
| N/A 74C P0 80W / 149W | 0MiB / 11439MiB | 100% Default |
11
+-------------------------------+----------------------+----------------------+
12
13
+-----------------------------------------------------------------------------+
14
| Processes: GPU Memory |
15
| GPU PID Type Process name Usage |
16
|=============================================================================|
17
| No running processes found |
18
+-----------------------------------------------------------------------------+
Copied!

参考文档

最近更新 2mo ago