How We Built A Serverless Spark Platform On Kubernetes - Video Tour Of Data Mechanics

September 8, 2020

Our mission at Data Mechanics is to let data engineers and data scientists build pipelines and models over large datasets with the simplicity of running a script on their laptop. Let them focus on their data, while we handle the mechanics of infrastructure management.

So we built a serverless Apache Spark platform, a more easy-to-use and more performant alternative to services like Amazon EMR, Google Dataproc, Azure HDInsight, Databricks, Qubole, Cloudera and Hortonworks.

In this video, we give you a product tour of our platform and some of its core features:

  1. How to connect a Jupyter notebook to the platform, play with Apache Spark interactively
  2. How to submit applications programmatically using our API or our Airflow integration
  3. How to monitor logs and metrics for your Apache Spark app from our dashboard
  4. How to track your costs, stability and performance over time of your jobs (recurring apps)

What makes Data Mechanics a Serverless Apache Spark platform?

Our autopilot features

Our platform dynamically and continuously optimize the infrastructure parameters and Spark configurations of each of your Spark applications to make them stable and performant. Here are some parameters we tune:

  • The container sizes (memory, CPU) - to keep your app stable (avoid OutOfMemory errors), to optimize the binpacking of containers on your nodes, and to boost the performance of your app by acting on its bottleneck (Memory-bound, Cpu-bound, I/O-bound)
  • The default number of partitions used by Apache Spark to increase its degree of parallelism.
  • The disk sizes, shuffle and I/O configurations to make sure data transfer phases run at their optimal speed.

Our automated tuning feature is trained on the past runs of a recurrent application. It will automatically react to changes to code or to input data, such that your apps stay stable and performant over time, without manual action from you.

Data Mechanics and Tradelab Costs Cut In Half
Source: Our 2019 Spark Summit Presentation on How to automate performance tuning for Apache Spark

In addition to autotuning, our second autopilot feature is autoscaling. We support two levels of autoscaling:

  • At the application level: each Spark app dynamically scales its number of executors based on load (dynamic allocation)
  • At the cluster level: the Kubernetes cluster automatically adds and removes nodes from the cloud provider 

This model lets each app work in complete isolation (with its own Spark version, dependencies, and ressources) while keeping your infrastructure cost-efficient at all time.

Our cloud-native containerization

Data Mechanics is deployed on a Kubernetes cluster in our customers cloud account (while most other platforms still run Spark on YARN, Hadoop's scheduler).

This deployment model has key benefits:

  • An airtight security model: our customers sensitive data stays in their cloud account and VPC.
  • Native Docker support: our customers can use one of our pre-built optimized docker images for Apache Spark (Update: as of April 2021, these images are publicly available on DockerHub!). Or they can build their own Docker images to package their dependency in a reliable way. Learn more about the benefits of the resulting Docker-based developer workflow.
  • Integration with the rich tools from the Kubernetes ecosystem.
  • Cloud agnosticity: Data Mechanics is available on AWS, GCP, and Azure.
Kubernetes Ecosystem by Spotinst
Source: The State Of The Kubernetes Ecosystem

Our serverless pricing model

Competing data platforms pricing model is based on server uptime. For each instance type, they'll charge you an hourly fee, whether this instance is actually used to run Spark apps or not. This puts the burden on Spark developers to efficiently manage their clusters and make sure they're not wasting ressources due to over-provisioning or parallelism issues.

Data Mechanics Pricing Model Serverless (Spark on Kubernetes)

Instead, the Data Mechanics fee is based on the sum of the duration of all the Spark tasks (the units of work distributed by Spark, reported with a millisecond accuracy). This means our platform only makes money when our users do real work. We don't make money:

  • When an application is completely idle (because you took a break from your notebook and forgot to scale down your cluster)
  • When most of your application ressources are waiting on a straggler task to finish
  • When you run a Spark driver-only operation (pure scala or python code)

As a result, Data Mechanics will aggressively scale down your apps when they're idle, so that we reduce your cloud costs (without impacting our revenue). In fact the savings we generate on your cloud costs will cover or typically exceed the fee we charge for our services.

Update (April 2021): To estimate what your Data Mechanics costs would be, install our open-source monitoring tool Delight (it works on top of any Spark platform: Databricks, EMR, Dataproc, CDH/HDP, an open-source setup). More details on our blog post or on our github page.

I'd like to try this, how do I get started?

Great! The first step is to book a demo with our team so we can learn more about your use case. After this initial chat, we'll invite you to a shared slack channel - we use Slack for our support and we're very responsive there. We'll send you instructions on how to give us permissions on the AWS, GCP, or Azure account of your choice, and once we have these permissions we'll deploy Data Mechanics and you'll be ready to get started using our docs.

There are other features which we didn't get to cover in this post -- like our support for spot/preemptible nodes, our support for private clusters (cut off from the internet), our Spark UI replacement project, our integration with tools for CI/CD tools and machine learning model tracking and serving. So stay tuned and reach out if you're curious to learn more.

Jean-Yves Stephan


Jean-Yves Stephan

Ready to get started?

Read more

Our Latest Blog Posts

Learn about company news, product updates, and technology best practices straight from the Data Mechanics engineering team.

Apache Spark 3.2 Release: Main Features and What's New for Spark-on-Kubernetes

Apache Spark 3.2 is now released and available on our platform. Spark 3.2 bundles Hadoop 3.3.1, Koalas (for Pandas users) and RocksDB (for Streaming users). For Spark-on-Kubernetes users, Persistent Volume Claims (k8s volumes) can now "survive the death" of their Spark executor and be recovered by Spark, preventing the loss of precious shuffle files!

Monday, October 25, 2021

Tutorial: Run your R (SparklyR) workloads at scale with Spark-on-Kubernetes

A step-by-step tutorial to help you run R applications with Spark on a Kubernetes cluster using the SparklyR library. We'll go through building a compatible Docker image, building the code of the SparlyR application itself, and deploying it on Data Mechanics.

Tuesday, October 19, 2021

Tutorial: Running PySpark inside Docker containers

In this tutorial, we'll show you how to build your first PySpark applications from scratch and run it inside a Docker container. We'll also show you how to install libraries (like koalas) and write to a data sink (postgres database).

Thursday, September 23, 2021

🍪 We use cookies to optimize your user experience. By browsing our website, you agree to the use of cookies.