Adding Spark Support

Integration with the Spark Operator has been deprecated since 2.2.0, and will be removed in version 3.x. Spark integration has moved to the Cloudflow-contrib project. Please see the Cloudflow-contrib getting started guide for instructions on how to use Spark Native Kubernetes integration. The documentation that follows describes the deprecated feature.
See Storage requirements (for use with Spark or Flink) to setup Storage Requirements for Spark applications.

If you plan to write applications that utilize Spark, you will need to install the Spark operator before deploying your Cloudflow application.

The only Cloudflow compatible Spark operator is a fork maintained by Lightbend. See Backported Fix for Spark 2.4.5 for more details.

Add the Spark Helm chart repository and update the local index.

helm repo add incubator https://charts.helm.sh/incubator
helm repo update

Before installing the Helm chart, we will need to prepare a value.yaml file, it overrides the default values in the Spark Helm chart.

cat > spark-values.yaml << EOF
rbac:
  create: true
serviceAccounts:
  sparkoperator:
    create: true
    name: cloudflow-spark-operator
  spark:
    create: true
    name: cloudflow-spark

enableWebhook: true
enableMetrics: true

controllerThreads: 10
installCrds: true
metricsPort: 10254
metricsEndpoint: "/metrics"
metricsPrefix: ""
resyncInterval: 30
webhookPort: 8080
sparkJobNamespace: ""
operatorVersion: latest
---
kind: Service
apiVersion: v1
metadata:
  name: cloudflow-webhook
  labels:
    app.kubernetes.io/name: sparkoperator
spec:
  selector:
    app.kubernetes.io/name: sparkoperator
EOF

Now we can install the Spark operator using Helm. Note that we use a specific version of the Spark operator Helm chart.

  helm upgrade -i spark-operator incubator/sparkoperator \
    --values="spark-values.yaml" \
    --namespace cloudflow
OpenShift requires a role-binding

If you are installing the Spark operator on OpenShift, you need to apply the following rolebinding to the cluster:

cat > spark-rolebinding.yaml << EOF
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: system:openshift:scc:anyuid
  namespace: cloudflow
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:openshift:scc:anyuid
subjects:
- kind: ServiceAccount
  name: cloudflow-spark
  namespace: cloudflow
EOF

To apply the spark-rolebinding.yaml file execute the following command. Note that this has only to be performed if you are installing Cloudflow on Openshift:

oc apply -f spark-rolebinding.yaml

After the Spark operator installation completes, we need to adjust its webhook configuration:

$ kubectl get pods -n cloudflow
NAME                                                READY   STATUS    RESTARTS   AGE
cloudflow-operator-6b7d7cbdfc-xb6w5                 1/1     Running   0          109s
spark-operator-sparkoperator-56d6ffc8cd-cln67       1/1     Running   0          109s

For this, we will create yet another YAML file that we will apply to the cluster with kubectl:

cat > spark-mutating-webhook.yaml << 'EOF2'
apiVersion: batch/v1
kind: Job
metadata:
  name: cloudflow-patch-spark-mutatingwebhookconfig
spec:
  template:
    spec:
      serviceAccountName: cloudflow-operator
      restartPolicy: OnFailure
      containers:
        - name: main
          image: alpine:3.12
          command:
            - /bin/ash
            - "-c"
            - |
              apk update && apk add wget
              wget -q -O /bin/kubectl https://storage.googleapis.com/kubernetes-release/release/v1.19.6/bin/linux/amd64/kubectl && chmod 755 /bin/kubectl
              NAME="spark-operator-sparkoperator"
              API_VERSION=$(kubectl get deployment -n cloudflow $NAME -o jsonpath='{.apiVersion}')
              UUID=$(kubectl get deployment -n cloudflow $NAME -o jsonpath='{.metadata.uid}')
              KIND=$(kubectl get deployment -n cloudflow $NAME -o jsonpath='{.kind}')
              HOOK_NAME="spark-operator-sparkoperator-webhook-config"
              JSON=$(cat <<EOF
              {
                "metadata": {
                  "ownerReferences": [
                    {
                      "apiVersion": "$API_VERSION",
                      "blockOwnerDeletion": true,
                      "controller": true,
                      "kind": "$KIND",
                      "name": "$NAME",
                      "uid": "$UUID"
                    }
                  ]
                }
              }
              EOF
              )
              echo $JSON
              kubectl patch MutatingWebhookConfiguration $HOOK_NAME -n cloudflow -p "$JSON"
EOF2

Now we can apply the changes to the webhook:

kubectl apply -f spark-mutating-webhook.yaml --namespace cloudflow

That completes the installation of the Spark operator. You can check the result with the following command:

$ kubectl get pods -n cloudflow
NAME                                                READY   STATUS      RESTARTS   AGE
cloudflow-operator-6b7d7cbdfc-xb6w5                 1/1     Running     0          3m29s
cloudflow-patch-spark-mutatingwebhookconfig-66xxb   0/1     Completed   0          21s
spark-operator-sparkoperator-56d6ffc8cd-cln67       1/1     Running     0          3m29s

If you plan to write applications that utilize Flink you will need to install the Flink operator before deploying your Cloudflow application.

Continue with Adding Flink Support.