What is this PR for?
The goal of this PR is to be able to execute Spark notebooks on Kubernetes in cluster mode, so that the Spark Driver runs inside Kubernetes cluster - based on https://github.com/apache-spark-on-k8s/spark. Zeppelin uses spark-submit
to start RemoteInterpreterServer which is able to execute notebooks on Spark. Kubernetes specific spark-submit
parameters like driver, executor, init container, shuffle images should be set in SPARK_SUBMIT_OPTIONS environment variable. In case the Spark interpreter is configured with a K8 Spark specific master url (k8s://https....) RemoteInterpreterServer is launched inside a Spark driver pod on Kubernetes, thus Zeppelin server it has to be able to connect to the remote server. In a Kubernetes cluster the best solution for this is creating a K8S service for RemoteInterpreterServer. This is the reason for having the SparkK8RemoteInterpreterManagerProcess - extending functionality of RemoteInterpreterManagerProcess - which creates the Kubernetes service, mapping the port of RemoteInterpreterServer in Driver pod and connects to this service once Spark Driver pod is in Running state.
Design considerations: As described in spark-interpreter-k8s.md
, the Zeppelin Server is running inside the Kubenetes cluster - thus we can choose where to run the Zeppelin server - the benefit of running the server inside K8S is that we don't have to deal with authentication. However is not enough to start only the Zeppelin Server inside the Kubernetes cluster as by default Zeppelin will start spark-submit
in the same pod and will run every Spark job locally. The scope of this PR is run to run spark-submit
(apache-spark-on-k8s version) properly configured with Docker images etc. so that the Spark driver will be started in a separate pod in the cluster, also staring separate pods for Spark executors thus we can benefit from dynamic scaling of executors inside the Kubernetes cluster (while all the scheduling, pod allocation, resource management is done by the Kubernetes scheduler).
Please see below how is this running/used:
The cluster:
The flow:
What type of PR is it?
Feature
What is the Jira issue?
- https://issues.apache.org/jira/browse/ZEPPELIN-3020
How should this be tested?
Unit and functional tests - running notebooks on Spark on K8S.
Questions:
- Does the licenses files need update?
- Is there breaking changes for older versions?
- Does this needs documentation?