Spark

Since Kubernetes v1.8, native support for Apache Spark applications has been available (requiring Spark to support Kubernetes, e.g., v2.3). You can submit Kubernetes tasks directly with the spark-submit command. Here's an example of computing Pi:

bin/spark-submit \
  --deploy-mode cluster \
  --class org.apache.spark.examples.SparkPi \
  --master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
  --kubernetes-namespace default \
  --conf spark.executor.instances=5 \
  --conf spark.app.name=spark-pi \
  --conf spark.kubernetes.driver.docker.image=kubespark/spark-driver:v2.2.0-kubernetes-0.4.0 \
  --conf spark.kubernetes.executor.docker.image=kubespark/spark-executor:v2.2.0-kubernetes-0.4.0 \
  local:///opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.4.0.jar

Or, the Python version:

bin/spark-submit \
  --deploy-mode cluster \
  --master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
  --kubernetes-namespace <k8s-namespace> \
  --conf spark.executor.instances=5 \
  --conf spark.app.name=spark-pi \
  --conf spark.kubernetes.driver.docker.image=kubespark/spark-driver-py:v2.2.0-kubernetes-0.4.0 \
  --conf spark.kubernetes.executor.docker.image=kubespark/spark-executor-py:v2.2.0-kubernetes-0.4.0 \
  --jars local:///opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.4.0.jar \
  --py-files local:///opt/spark/examples/src/main/python/sort.py \
  local:///opt/spark/examples/src/main/python/pi.py 10

Deploying Spark on Kubernetes

A detailed method for deploying Spark is provided on the github Kubernetes examples. To simplify some steps for an easier installation, follow the instructions below.

Deployment Prerequisites

Creating a Namespace

namespace-spark-cluster.yaml

apiVersion: v1
kind: Namespace
metadata:
  name: "spark-cluster"
  labels:
    name: "spark-cluster"
$ kubectl create -f examples/staging/spark/namespace-spark-cluster.yaml

For simplicity, we will not switch the kubectl context to spark-cluster. Instead, we will add the spark-cluster namespace to subsequent deployments.

Deploying the Master Service

Create a replication controller to run the Spark Master service.

kind: ReplicationController
apiVersion: v1
metadata:
  name: spark-master-controller
  namespace: spark-cluster
spec:
  replicas: 1
  selector:
    component: spark-master
  template:
    metadata:
      labels:
        component: spark-master
    spec:
      containers:
        - name: spark-master
          image: gcr.io/google_containers/spark:1.5.2_v1
          command: ["/start-master"]
          ports:
            - containerPort: 7077
            - containerPort: 8080
          resources:
            requests:
              cpu: 100m
$ kubectl create -f spark-master-controller.yaml

Create the master service.

spark-master-service.yaml

kind: Service
apiVersion: v1
metadata:
  name: spark-master
  namespace: spark-cluster
spec:
  ports:
    - port: 7077
      targetPort: 7077
      name: spark
    - port: 8080
      targetPort: 8080
      name: http
  selector:
    component: spark-master
$ kubectl create -f spark-master-service.yaml

Check if the Master is running properly.

$ kubectl get pod -n spark-cluster

For observing our Spark cluster via the Spark-developed web UI, deploy specialized proxy.

Deploy Spark workers to ensure the Master is up and running. Also create the Zeppelin UI, which allows direct task execution on our cluster through a web notebook, seen at Zeppelin UI and Spark architecture.

Once done, the Spark cluster is established.

Common Issues with Zeppelin

  • The Zeppelin image is quite large and takes some time to pull. Details at issue #17231.

  • kubectl port-forward may be unstable on the GKE platform; restart as needed. See issue #12179 for reference.

Reference Documents

最后更新于