The Pros and Cons of Running Apache Spark on Kubernetes

May 26, 2020

Apache Spark is an open-source distributed computing framework. In a few lines of code (in Scala, Python, SQL, or R), data scientists or engineers define applications that can process large amounts of data, Spark taking care of parallelizing the work across a cluster of machines.

Spark itself doesn't manage these machines. It needs a cluster manager (also sometimes called scheduler). The main cluster-managers are:

  • Standalone: Simple cluster-manager, limited in features, incorporated with Spark.
  • Apache Mesos: An open source cluster-manager once popular for big data workloads (not just Spark) but in decline over the last few years.
  • Hadoop YARN: The JVM-based cluster-manager of hadoop released in 2012 and most commonly used to date, both for on-premise (e.g. Cloudera, MapR) and cloud (e.g. EMR, Dataproc, HDInsight) deployments.
  • Kubernetes: Spark runs natively on Kubernetes since version Spark 2.3 (2018). This deployment mode is gaining traction quickly as well as enterprise backing (Google, Palantir, Red Hat, Bloomberg, Lyft). As of June 2020 its support is still marked as experimental though.

As the new kid on the block, there's a lot of hype around Kubernetes. In this article, we'll explain the core concepts of Spark-on-k8s and evaluate the benefits and drawbacks of this new deployment mode.

Core Concepts

Apache Spark on Kubernetes Architecture
Apache Spark on Kubernetes Architecture

You can submit Spark apps using spark-submit or the using the spark-operator — the latter is our preference, but we'll talk about it in a future tutorial post. This request contains your full application configuration including the code and dependencies to run (packaged as a docker image or specified via URIs), the infrastructure parameters, (e.g. the memory, CPU, and storage volume specs to allocate to each Spark executor), and the Spark configuration.

Kubernetes takes this request and starts the Spark driver in a Kubernetes pod (a k8s abstraction, just a docker container in this case). The Spark driver can then directly talk back to the Kubernetes master to request executor pods, scaling them up and down at runtime according to the load if dynamic allocation is enabled. Kubernetes takes care of the bin-packing of the pods onto Kubernetes nodes (the physical VMs), and will dynamically scale the various node pools to meet the requirements.

To go a little deeper, the Kubernetes support of Spark relies mostly on the KubernetesClusterSchedulerBackend which lives in the Spark driver.

This class keeps track of the current number of registered executors, and the desired total number of executors (from a fixed-size configuration or from dynamic allocation). At periodic intervals (configured by spark.kubernetes.allocation.batch.delay), it will request the creation or deletion of executor pods, and wait for that request to complete before making other requests. Hence this class implements the "desired state principle" which is dear to Kubernetes fans, favoring declarative over imperative statements.

The Pros - Benefits of Spark on Kubernetes

1. Containerization

This is the main motivation for using Kubernetes itself. The benefits of containerization in traditional software engineering apply to big data and Spark too. Containers make your applications more portable, they simplify the packaging of dependencies, they enable repeatable and reliable build workflows. They reduce the overall devops load and allow you to iterate on your code faster.

Our favorite benefit is definitely dependency management, since it's notoriously painful with Spark. You can choose to build a new docker image for each app, or to use a smaller set of  docker images that package most of your needed libraries, and dynamically add your application-specific code on top. Say goodbye to long and flaky init scripts compiling C-libraries on each application launch.

2. Integration in a rich ecosystem

Source: The State Of The Kubernetes Ecosystem by Amiram Shachar at Spot.io


Deploying Spark on Kubernetes gives you powerful features for free such as the use of  namespaces and quotas for multitenancy control, and role-based access control (optionally integrated with your cloud provider IAM) for fine-grained security and data access.

If you have a need outside the k8s scope, the community is very active and it's likely you'll find a tool to answer this need. This point is particularly strong if you already use Kubernetes for the rest of your stack as you may re-use your existing tooling, such as the k8s dashboard for basic logging and administration, and prometheus + grafana for monitoring.

3. Efficient resource sharing

On other cluster-managers (YARN, Standalone, Mesos) if you want to reuse the same cluster for concurrent Spark apps (for cost reasons), you'll have to compromise on isolation:

  • Dependency isolation. These apps must use the same global Spark and python version.
  • Performance isolation. If someone else kicks off a big job, my job is likely to run slower.

On the other hand, with dynamic allocation and cluster autoscaling correctly configured, Kubernetes will give you the cost benefits of a shared infrastructure and the full isolation of disjoint container sets. It takes about 10s for Kubernetes to remove an idle Spark executor from one app and allocate this capacity to another app.

Say goodbye to the complex load balancing, queues, and multitenancy tradeoffs of YARN deployments !

The Neutral - No impact on performance

We ran benchmarks that prove that there is no performance difference between running Spark on Kubernetes and running Spark on YARN. So you should focus on other criteria to make your decision between the two! Read our blog post for more details: Apache Spark Performance Benchmarks show Kubernetes has caught up with YARN

The blog post explains the setup of the benchmark, the results, as well as critical tips to maximize shuffle performance when running Spark on Kubernetes.

The Cons - Drawbacks of Spark on Kubernetes

1. Making Spark-on-k8s reliable at scale requires build time and expertise

If you're new to Kubernetes, the new language, abstractions and tools it introduces can be frightening and take you away from your core mission. And even if you already have expertise on Kubernetes, there's just a lot to build:

  • Create and configure the Kubernetes cluster and its node pools
  • Setup the spark-operator and k8s autoscaler (optional, but recommended)
  • Setup a docker registry and create a process to package your dependencies
  • Setup a Spark History Server (to see the Spark UI after an app has completed)
  • Setup your logging, monitoring, and security tools
  • Optimize application configurations and I/O for Kubernetes

These problems are mostly unaddressed by the managed Kubernetes offerings from the cloud providers.

2. Dynamic allocation limitation caused by the shuffle architecture

How Shuffles Work in Spark

Shuffles are the expensive all-to-all communication steps that take place in Spark. Executors (on the map side) produce shuffle files on local disk that will later be fetched by other executors (on the reduce side). If a mapper executor is lost, the associated shuffle files are lost and the map task will be re-scheduled on another executor, which hurts performance. Using YARN, shuffle files can be stored in an external shuffle service, such that when dynamic allocation is enabled, the mapper executor can be safely removed on a downscaling event without losing the precious shuffle files.

For complex reasons - that will be the subject of a future post - the same architecture is not possible on Kubernetes. As a result, dynamic allocation must operate with one additional constraint: executors holding active shuffle files (as tracked by the Spark driver) are exempt from downscaling. This "soft dynamic allocation" is available with Spark 3.0 (preview) version, and it has successfully mitigated this problem for our customers.

There is exciting ongoing work to go further and build a shuffle architecture where compute resources (k8s nodes and containers) are entirely separate from temporary shuffle data storage. Once completed, this work will go much further than the external shuffle service, as it will enable "hard dynamic allocation" as well as make Spark resilient to the sudden loss of executors (which is a frequent problem when using spot/pre-emptible virtual machines).

Conclusion - Should You Get Started ?

As we operate a Spark platform deployed on Kubernetes, we are biased in this answer. Traditional software engineering has radically shifted towards containerization in recent years which brought in a lot of good practices. We think a similar shift will happen for big data -  and hence we believe Spark-on-Kubernetes is the future of Spark.

Does it mean that data teams should become Kubernetes experts? Not at all. We've built Data Mechanics precisely for that reason. Our platform addresses the main drawback we've outlined earlier: we've done all the setup work, stitching together the best open source software to build the Spark platform you'd build for yourself, and adding powerful optimizations on top. So that data scientists and engineers can focus on their data while we handle the mechanics.

Whether you're thinking of kicking off a new Spark project, or of revisiting your existing Spark architecture to simplify your operations, we'd love to help! Use our "Book a Demo" link to schedule a call with our team, or write us at contact@datamechanics.co

Read more

Our Latest Blog Posts

Learn about company news, product updates, and technology best practices straight from the Data Mechanics engineering team.

Setting up, Managing & Monitoring Spark on Kubernetes

Earlier this year at Spark + AI Summit, we went over the best practices and pitfalls of running Apache Spark on Kubernetes. We’d like to expand on that and give you a comprehensive overview of how you can get started with Spark on k8s, optimize performance & costs, monitor your Spark applications, and the future of Spark on k8s!

Monday, September 21, 2020

How We Built A Serverless Spark Platform On Kubernetes - Video Tour Of Data Mechanics

In this video, we give you a product tour of our serverless Spark platform and its core features: connecting a Jupyter notebook, submitting apps programmatically, monitoring their logs and metrics, tracking their costs and performance over time.

Tuesday, September 8, 2020

Apache Spark Performance Benchmarks show Kubernetes has caught up with YARN

Apache Spark on Kubernetes is as performant as Spark on YARN, including during shuffle stages. This article presents the benchmark results and gives critical performance tips for Spark on Kubernetes.

Monday, July 6, 2020

🍪 We use cookies to optimize your user experience. By browsing our website, you agree to the use of cookies.
close
30