Apache Spark on Kubernetes
Cloud-Native Spark: The Easy Way

A Cloud-Native architecture

Apache Spark on Kubernetes

A Kubernetes cluster consists of a set of nodes on which you can run containerized Apache Spark applications (as well any other containerized workloads). Each Spark app is fully isolated from the others and packages its own version of Spark and dependencies within a Docker image.
‍
When you submit a Spark app, it starts a Spark driver pod (a Docker container, to put it simply) on the Kubernetes cluster. The driver pod and Kubernetes directly talk to each other to start Spark executor pods. The start and removal of executors is automated based on load when dynamic allocation is enabled.

Learn More

A thriving DEVELOPER community

Spark-on-Kubernetes now Generally Available

Since its initial release in Spark 2.3, the Spark-on-Kubernetes community has been thriving behind both starts and enterprises adoptions. The community led the development of key features such as volume mounts, dynamic allocation, and graceful handling of node shutdown.

As a result of these features, the Spark-on-Kubernetes project will officially be marked as Generally Available and production ready as of Spark 3.1.

Learn More

Timeline of improvements to Spark on Kubernetes

Why Choose Spark on Kubernetes?

The advantages of working with Apache Spark on Kubernetes

Spark on k8s offers many advantages for companies big and small who want to simplify and speed up their development workflows in addition to dramatically reducing their costs.

1. Containerized

Use Docker to package your dependencies.

Build it once, and run it everywhere: locally or in the cloud, in dev or in prod environments. Your application is portable and fully isolated from other workloads running at the same time.

2. Developer-friendly

Kubernetes offers a rich ecosystem of developer tools and solutions that simplify your data operations on a day-to-day basis.

You won't need to learn about the obscure intricacies of Hadoop YARN, or of an opaque proprietary vendor solutions.

3. Cost Effective

Keep your infrastructure costs down by running all your apps on a shared cluster.

The cluster can automatically scale up and down based on load, and use a mix of spot/on-demand nodes and heterogeneous instance types to reduce your cloud bill.

4. Fast

Speed up your iteration cycle by 10X.

Kubernetes can start and scale applications in a matter of seconds. You won't need to wait for slow virtual machine setup, for YARN overhead, or for slow bootstrap scripts to run.

Person sitting on the floor with a laptop on their knees.

5. Reliable & Secure

Bring in the DevOps best practices in your data teams.
‍
Kubernetes enable repeatable and reliable workflows, and enables gold-standard security best practices around networking, data access permissions, and user ACLs.

Person in front of world globe with multiple locations.

6. Cloud and Vendor Agnostic

Build on top of a standard open-source technology and avoid lock in.
‍
Kubernetes lets you easily switch between cloud and on-premise platforms, and avoid expensive alternatives by vendors who will lock you in with proprietary technologies.

Take A Deeper Dive

Apache Spark 3.2 Release: Main Features and What's New for Spark-on-Kubernetes

Monday, October 25, 2021

Apache Spark 3.2 is now released and available on our platform. Spark 3.2 bundles Hadoop 3.3.1, Koalas (for Pandas users) and RocksDB (for Streaming users). For Spark-on-Kubernetes users, Persistent Volume Claims (k8s volumes) can now "survive the death" of their Spark executor and be recovered by Spark, preventing the loss of precious shuffle files!

Tutorial: Run your R (SparklyR) workloads at scale with Spark-on-Kubernetes

Tuesday, October 19, 2021

A step-by-step tutorial to help you run R applications with Spark on a Kubernetes cluster using the SparklyR library. We'll go through building a compatible Docker image, building the code of the SparlyR application itself, and deploying it on Data Mechanics.

Apache Spark 3.1 Release: Spark on Kubernetes is now Generally Available

Monday, March 8, 2021

With the Apache Spark 3.1 release in March 2021, the Spark on Kubernetes project is now officially declared as production-ready and Generally Available. This is the achievement of 3 years of booming community contribution and adoption of the project - since initial support for Spark-on-Kubernetes was added in Spark 2.3 (February 2018). In this article, we will go over the main features of Spark 3.1, with a special focus on the improvements to Kubernetes.

A Cloud-Native architecture

Apache Spark on Kubernetes

A thriving DEVELOPER community

Spark-on-Kubernetes now Generally Available

Why Choose Spark on Kubernetes?

The advantages of working with Apache Spark on Kubernetes

1. Containerized

2. Developer-friendly

3. Cost Effective

4. Fast

5. Reliable & Secure

6. Cloud and Vendor Agnostic

Take A Deeper Dive

Apache Spark 3.2 Release: Main Features and What's New for Spark-on-Kubernetes

Tutorial: Run your R (SparklyR) workloads at scale with Spark-on-Kubernetes

Apache Spark 3.1 Release: Spark on Kubernetes is now Generally Available

About Us

NAvigation

Resources

A Cloud-Native architecture

Apache Spark on Kubernetes

A thriving DEVELOPER community

Spark-on-Kubernetes now Generally Available

Why Choose Spark on Kubernetes?

The advantages of working with Apache Spark on Kubernetes

1. Containerized

2. Developer-friendly

3. Cost Effective

4. Fast

5. Reliable & Secure

6. Cloud and Vendor Agnostic

Take A Deeper Dive

Apache Spark 3.2 Release: Main Features and What's New for Spark-on-Kubernetes

Tutorial: Run your R (SparklyR) workloads at scale with Spark-on-Kubernetes

Apache Spark 3.1 Release: Spark on Kubernetes is now Generally Available

About Us

NAvigation

Resources

Legal

Stay connected