Apache Spark is an open-source distributed computing framework. In a few lines of code (in Scala, Python, SQL, or R), data scientists or engineers define applications that can process large amounts of data, Spark taking care of parallelizing the work across a cluster of machines.
Spark itself doesn't manage these machines. It needs a cluster manager (also sometimes called scheduler). The main cluster-managers are:
- Standalone: Simple cluster-manager, limited in features, incorporated with Spark.
- Apache Mesos: An open source cluster-manager once popular for big data workloads (not just Spark) but in decline over the last few years.
- Hadoop YARN: The JVM-based cluster-manager of hadoop released in 2012 and most commonly used to date, both for on-premise (e.g. Cloudera, MapR) and cloud (e.g. EMR, Dataproc, HDInsight) deployments.
- Kubernetes: Spark runs natively on Kubernetes since version Spark 2.3 (2018). This deployment mode is gaining traction quickly as well as enterprise backing (Google, Palantir, Red Hat, Bloomberg, Lyft). As of June 2020 its support is still marked as experimental though.
As the new kid on the block, there's a lot of hype around Kubernetes. In this article, we'll explain the core concepts of Spark-on-k8s and evaluate the benefits and drawbacks of this new deployment mode.
You can submit Spark apps using spark-submit or the using the spark-operator — the latter is our preference, but we'll talk about it in a future tutorial post. This request contains your full application configuration including the code and dependencies to run (packaged as a docker image or specified via URIs), the infrastructure parameters, (e.g. the memory, CPU, and storage volume specs to allocate to each Spark executor), and the Spark configuration.
Kubernetes takes this request and starts the Spark driver in a Kubernetes pod (a k8s abstraction, just a docker container in this case). The Spark driver can then directly talk back to the Kubernetes master to request executor pods, scaling them up and down at runtime according to the load if dynamic allocation is enabled. Kubernetes takes care of the bin-packing of the pods onto Kubernetes nodes (the physical VMs), and will dynamically scale the various node pools to meet the requirements.
To go a little deeper, the Kubernetes support of Spark relies mostly on the KubernetesClusterSchedulerBackend which lives in the Spark driver.
This class keeps track of the current number of registered executors, and the desired total number of executors (from a fixed-size configuration or from dynamic allocation). At periodic intervals (configured by spark.kubernetes.allocation.batch.delay), it will request the creation or deletion of executor pods, and wait for that request to complete before making other requests. Hence this class implements the "desired state principle" which is dear to Kubernetes fans, favoring declarative over imperative statements.
The Pros - Benefits of Spark on Kubernetes
This is the main motivation for using Kubernetes itself. The benefits of containerization in traditional software engineering apply to big data and Spark too. Containers make your applications more portable, they simplify the packaging of dependencies, they enable repeatable and reliable build workflows. They reduce the overall devops load and allow you to iterate on your code faster.
The top 3 benefits of using Docker containers for Spark:
1) Build your dependencies once, run everywhere (locally or at scale)
2) Make Spark more reliable and cost-efficient.
3) Speed up your iteration cycle by 10X (at Data Mechanics, our users regularly report bringing down their Spark dev workflow from 5 minutes or more to less than 30 seconds)
Our favorite benefit is definitely dependency management, since it's notoriously painful with Spark. You can choose to build a new docker image for each app, or to use a smaller set of docker images that package most of your needed libraries, and dynamically add your application-specific code on top. Say goodbye to long and flaky init scripts compiling C-libraries on each application launch.
2. Integration in a rich ecosystem
Deploying Spark on Kubernetes gives you powerful features for free such as the use of namespaces and quotas for multitenancy control, and role-based access control (optionally integrated with your cloud provider IAM) for fine-grained security and data access.
If you have a need outside the k8s scope, the community is very active and it's likely you'll find a tool to answer this need. This point is particularly strong if you already use Kubernetes for the rest of your stack as you may re-use your existing tooling, such as the k8s dashboard for basic logging and administration, and prometheus + grafana for monitoring.
3. Efficient resource sharing
On other cluster-managers (YARN, Standalone, Mesos) if you want to reuse the same cluster for concurrent Spark apps (for cost reasons), you'll have to compromise on isolation:
- Dependency isolation. These apps must use the same global Spark and python version.
- Performance isolation. If someone else kicks off a big job, my job is likely to run slower.
On the other hand, with dynamic allocation and cluster autoscaling correctly configured, Kubernetes will give you the cost benefits of a shared infrastructure and the full isolation of disjoint container sets. It takes about 10s for Kubernetes to remove an idle Spark executor from one app and allocate this capacity to another app.
Say goodbye to the complex load balancing, queues, and multitenancy tradeoffs of YARN deployments !
The Neutral - No impact on performance
We ran benchmarks that prove that there is no performance difference between running Spark on Kubernetes and running Spark on YARN. So you should focus on other criteria to make your decision between the two! Read our blog post for more details: Apache Spark Performance Benchmarks show Kubernetes has caught up with YARN
The blog post explains the setup of the benchmark, the results, as well as critical tips to maximize shuffle performance when running Spark on Kubernetes.
The Cons - Drawbacks of Spark on Kubernetes
1. Making Spark-on-k8s reliable at scale requires build time and expertise
If you're new to Kubernetes, the new language, abstractions and tools it introduces can be frightening and take you away from your core mission.
And even if you already have expertise on Kubernetes, there's a lot to build:
- Create and configure the Kubernetes cluster and its node pools
- Setup the spark-operator and k8s autoscaler (optional, but recommended)
- Setup a docker registry and create a process to package your dependencies
- Setup a Spark History Server (to see the Spark UI after an app has completed)
- Setup your logging, monitoring, and security tools
- Optimize application configurations and I/O for Kubernetes
This is why we built Data Mechanics - to take care of all the setup and make Spark on Kubernetes easy-to-use and cost-effective. See How Data Mechanics Improves on Spark on Kubernetes open-source to learn more about what our platform offers on top of open-source.
2. Dynamic allocation limitation caused by the shuffle architecture
Shuffles are the expensive all-to-all communication steps that take place in Spark. Executors (on the map side) produce shuffle files on local disk that will later be fetched by other executors (on the reduce side). If a mapper executor is lost, the associated shuffle files are lost and the map task will be need to be recomputed.
Using YARN, shuffle files can be stored in an external shuffle service, such that when dynamic allocation is enabled, the mapper executor can be safely removed on a downscaling event without losing the precious shuffle files. On Kubernetes, an external shuffle service does not exist yet. As a result, dynamic allocation must operate with one additional constraint: executors holding active shuffle files are exempt from downscaling. This mechanism is called "Dynamic Allocation With Shuffle Tracking", and it has been working really well since Spark 3.0. Learn more in our guide Setting up, Managing & Monitoring Spark on Kubernetes.
Update (November 2020): With the upcoming version of Spark (Spark 3.1, to be released in December 2020), a new mechanism will let Spark move shuffle files when an executor is going away due to dynamic allocation. This means the constraint described here will be removed. In fact this feature is much more powerful, as it will Spark resilient to Spot kills as well (Spark will be able to use the 30 seconds termination notice sent by the cloud provider before a spot kill happens, to move the shuffle file around). So stay tuned!
Conclusion - Should You Get Started ?
Traditional software engineering has shifted towards cloud-native containerization over the past few years, and it's undeniable a similar shift is happening for big data workloads. In fact, with the upcoming version of Spark (3.1, to be released in December 2020), Spark on Kubernetes will be marked as Generally Available and production ready in the official docs.
Does it mean that data teams should become Kubernetes experts? Not at all. We've built Data Mechanics precisely for that reason. To make Spark on Kubernetes absolute setup- and maintenance-free. We've helped many customers run Spark on Kubernetes, both for new Spark projects or as part of a migration from a YARN-based infrastructure. Our platform improves on top of the open-source version by adding intuitive user interfaces, notebook and scheduler integrations, and dynamic optimizations to control your costs.
We'd love to help! Book a demo with us to get started.