Spark
Kubernetes 从 v1.8 开始支持 原生的 Apache Spark 应用(需要 Spark 支持 Kubernetes,比如 v2.3),可以通过 spark-submit 命令直接提交 Kubernetes 任务。比如计算圆周率
1
bin/spark-submit \
2
--deploy-mode cluster \
3
--class org.apache.spark.examples.SparkPi \
4
--master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
5
--kubernetes-namespace default \
6
--conf spark.executor.instances=5 \
7
--conf spark.app.name=spark-pi \
8
--conf spark.kubernetes.driver.docker.image=kubespark/spark-driver:v2.2.0-kubernetes-0.4.0 \
9
--conf spark.kubernetes.executor.docker.image=kubespark/spark-executor:v2.2.0-kubernetes-0.4.0 \
10
local:///opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.4.0.jar
Copied!
或者使用 Python 版本
1
bin/spark-submit \
2
--deploy-mode cluster \
3
--master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
4
--kubernetes-namespace <k8s-namespace> \
5
--conf spark.executor.instances=5 \
6
--conf spark.app.name=spark-pi \
7
--conf spark.kubernetes.driver.docker.image=kubespark/spark-driver-py:v2.2.0-kubernetes-0.4.0 \
8
--conf spark.kubernetes.executor.docker.image=kubespark/spark-executor-py:v2.2.0-kubernetes-0.4.0 \
9
--jars local:///opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.4.0.jar \
10
--py-files local:///opt/spark/examples/src/main/python/sort.py \
11
local:///opt/spark/examples/src/main/python/pi.py 10
Copied!

Spark on Kubernetes 部署

Kubernetes 示例 github 上提供了一个详细的 spark 部署方法,由于步骤复杂,这里简化一些部分让大家安装的时候不用去多设定一些东西。

部署条件

  • 一个 kubernetes 群集, 可参考 集群部署
  • kube-dns 正常运作

创建一个命名空间

namespace-spark-cluster.yaml
1
apiVersion: v1
2
kind: Namespace
3
metadata:
4
name: "spark-cluster"
5
labels:
6
name: "spark-cluster"
Copied!
1
$ kubectl create -f examples/staging/spark/namespace-spark-cluster.yaml
Copied!
这边原文提到需要将 kubectl 的执行环境转到 spark-cluster, 这边为了方便我们不这样做, 而是将之后的佈署命名空间都加入 spark-cluster

部署 Master Service

建立一个 replication controller, 来运行 Spark Master 服务
1
kind: ReplicationController
2
apiVersion: v1
3
metadata:
4
name: spark-master-controller
5
namespace: spark-cluster
6
spec:
7
replicas: 1
8
selector:
9
component: spark-master
10
template:
11
metadata:
12
labels:
13
component: spark-master
14
spec:
15
containers:
16
- name: spark-master
17
image: gcr.io/google_containers/spark:1.5.2_v1
18
command: ["/start-master"]
19
ports:
20
- containerPort: 7077
21
- containerPort: 8080
22
resources:
23
requests:
24
cpu: 100m
Copied!
1
$ kubectl create -f spark-master-controller.yaml
Copied!
创建 master 服务
spark-master-service.yaml
1
kind: Service
2
apiVersion: v1
3
metadata:
4
name: spark-master
5
namespace: spark-cluster
6
spec:
7
ports:
8
- port: 7077
9
targetPort: 7077
10
name: spark
11
- port: 8080
12
targetPort: 8080
13
name: http
14
selector:
15
component: spark-master
Copied!
1
$ kubectl create -f spark-master-service.yaml
Copied!
检查 Master 是否正常运行
1
$ kubectl get pod -n spark-cluster
2
spark-master-controller-qtwm8 1/1 Running 0 6d
Copied!
1
$ kubectl logs spark-master-controller-qtwm8 -n spark-cluster
2
17/08/07 02:34:54 INFO Master: Registered signal handlers for [TERM, HUP, INT]
3
17/08/07 02:34:54 INFO SecurityManager: Changing view acls to: root
4
17/08/07 02:34:54 INFO SecurityManager: Changing modify acls to: root
5
17/08/07 02:34:54 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
6
17/08/07 02:34:55 INFO Slf4jLogger: Slf4jLogger started
7
17/08/07 02:34:55 INFO Remoting: Starting remoting
8
17/08/07 02:34:55 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:7077]
9
17/08/07 02:34:55 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
10
17/08/07 02:34:55 INFO Master: Starting Spark master at spark://spark-master:7077
11
17/08/07 02:34:55 INFO Master: Running Spark version 1.5.2
12
17/08/07 02:34:56 INFO Utils: Successfully started service 'MasterUI' on port 8080.
13
17/08/07 02:34:56 INFO MasterWebUI: Started MasterWebUI at http://10.2.6.12:8080
14
17/08/07 02:34:56 INFO Utils: Successfully started service on port 6066.
15
17/08/07 02:34:56 INFO StandaloneRestServer: Started REST server for submitting applications on port 6066
16
17/08/07 02:34:56 INFO Master: I have been elected leader! New state: ALIVE
Copied!
若 master 已经被建立与运行, 我们可以透过 Spark 开发的 webUI 来察看我们 spark 的群集状况, 我们将佈署 specialized proxy
spark-ui-proxy-controller.yaml
1
kind: ReplicationController
2
apiVersion: v1
3
metadata:
4
name: spark-ui-proxy-controller
5
namespace: spark-cluster
6
spec:
7
replicas: 1
8
selector:
9
component: spark-ui-proxy
10
template:
11
metadata:
12
labels:
13
component: spark-ui-proxy
14
spec:
15
containers:
16
- name: spark-ui-proxy
17
image: elsonrodriguez/spark-ui-proxy:1.0
18
ports:
19
- containerPort: 80
20
resources:
21
requests:
22
cpu: 100m
23
args:
24
- spark-master:8080
25
livenessProbe:
26
httpGet:
27
path: /
28
port: 80
29
initialDelaySeconds: 120
30
timeoutSeconds: 5
Copied!
1
$ kubectl create -f spark-ui-proxy-controller.yaml
Copied!
提供一个 service 做存取, 这边原文是使用 LoadBalancer type, 这边我们改成 NodePort, 如果你的 kubernetes 运行环境是在 cloud provider, 也可以参考原文作法
spark-ui-proxy-service.yaml
1
kind: Service
2
apiVersion: v1
3
metadata:
4
name: spark-ui-proxy
5
namespace: spark-cluster
6
spec:
7
ports:
8
- port: 80
9
targetPort: 80
10
nodePort: 30080
11
selector:
12
component: spark-ui-proxy
13
type: NodePort
Copied!
1
$ kubectl create -f spark-ui-proxy-service.yaml
Copied!
部署完后你可以利用 kubecrl proxy 来察看你的 Spark 群集状态
1
$ kubectl proxy --port=8001
Copied!
可以透过 http://localhost:8001/api/v1/proxy/namespaces/spark-cluster/services/spark-master:8080/ 察看, 若 kubectl 中断就无法这样观察了, 但我们再先前有设定 nodeport 所以也可以透过任意台 node 的端口 30080 去察看(例如 http://10.201.2.34:30080)。

部署 Spark workers

要先确定 Matser 是再运行的状态
spark-worker-controller.yaml
1
kind: ReplicationController
2
apiVersion: v1
3
metadata:
4
name: spark-worker-controller
5
namespace: spark-cluster
6
spec:
7
replicas: 2
8
selector:
9
component: spark-worker
10
template:
11
metadata:
12
labels:
13
component: spark-worker
14
spec:
15
containers:
16
- name: spark-worker
17
image: gcr.io/google_containers/spark:1.5.2_v1
18
command: ["/start-worker"]
19
ports:
20
- containerPort: 8081
21
resources:
22
requests:
23
cpu: 100m
Copied!
1
$ kubectl create -f spark-worker-controller.yaml
2
replicationcontroller "spark-worker-controller" created
Copied!
透过指令察看运行状况
1
$ kubectl get pod -n spark-cluster
2
spark-master-controller-qtwm8 1/1 Running 0 6d
3
spark-worker-controller-4rxrs 1/1 Running 0 6d
4
spark-worker-controller-z6f21 1/1 Running 0 6d
5
spark-ui-proxy-controller-d4br2 1/1 Running 4 6d
Copied!
也可以透过上面建立的 WebUI 服务去察看
基本上到这边 Spark 的群集已经建立完成了

创建 Zeppelin UI

我们可以利用 Zeppelin UI 经由 web notebook 直接去执行我们的任务, 详情可以看 Zeppelin UISpark architecture
zeppelin-controller.yaml
1
kind: ReplicationController
2
apiVersion: v1
3
metadata:
4
name: zeppelin-controller
5
namespace: spark-cluster
6
spec:
7
replicas: 1
8
selector:
9
component: zeppelin
10
template:
11
metadata:
12
labels:
13
component: zeppelin
14
spec:
15
containers:
16
- name: zeppelin
17
image: gcr.io/google_containers/zeppelin:v0.5.6_v1
18
ports:
19
- containerPort: 8080
20
resources:
21
requests:
22
cpu: 100m
Copied!
1
$ kubectl create -f zeppelin-controller.yaml
2
replicationcontroller "zeppelin-controller" created
Copied!
然后一样佈署 Service
zeppelin-service.yaml
1
kind: Service
2
apiVersion: v1
3
metadata:
4
name: zeppelin
5
namespace: spark-cluster
6
spec:
7
ports:
8
- port: 80
9
targetPort: 8080
10
nodePort: 30081
11
selector:
12
component: zeppelin
13
type: NodePort
Copied!
1
$ kubectl create -f zeppelin-service.yaml
Copied!
可以看到我们把 NodePort 设再 30081, 一样可以透过任意台 node 的 30081 port 访问 zeppelin UI。
通过命令行访问 pyspark(记得把 pod 名字换成你自己的):
1
$ kubectl exec -it zeppelin-controller-8f14f -n spark-cluster pyspark
2
Python 2.7.9 (default, Mar 1 2015, 12:57:24)
3
[GCC 4.9.2] on linux2
4
Type "help", "copyright", "credits" or "license" for more information.
5
17/08/14 01:59:22 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
6
Welcome to
7
____ __
8
/ __/__ ___ _____/ /__
9
_\ \/ _ \/ _ `/ __/ '_/
10
/__ / .__/\_,_/_/ /_/\_\ version 1.5.2
11
/_/
12
13
Using Python version 2.7.9 (default, Mar 1 2015 12:57:24)
14
SparkContext available as sc, HiveContext available as sqlContext.
15
>>>
Copied!
接着就能使用 Spark 的服务了, 如有错误欢迎更正。

zeppelin 常见问题

  • zeppelin 的镜像非常大, 所以再 pull 时会花上一些时间, 而 size 大小的问题现在也正在解决中, 详情可参考 issue #17231
  • 在 GKE 的平台上, kubectl post-forward 可能有些不稳定, 如果你看现 zeppelin 的状态为 Disconnected,port-forward 可能已经失败你需要去重新启动它, 详情可参考 #12179

参考文档

最近更新 2mo ago