How We Built A Serverless Spark Platform On Kubernetes - Video Tour Of Data Mechanics

September 8, 2020

Our mission at Data Mechanics is to let data engineers and data scientists build pipelines and models over large datasets with the simplicity of running a script on their laptop. Let them focus on their data, while we handle the mechanics of infrastructure management.

So we built a serverless Apache Spark platform, a more easy-to-use and more performant alternative to services like Amazon EMR, Google Dataproc, Azure HDInsight, Databricks, Qubole, Cloudera and Hortonworks.

In this video, we give you a product tour of our platform and some of its core features:

  1. How to connect a Jupyter notebook to the platform, play with Apache Spark interactively
  2. How to submit applications programmatically using our API or our Airflow integration
  3. How to monitor logs and metrics for your Apache Spark app from our dashboard
  4. How to track your costs, stability and performance over time of your jobs (recurring apps)

What makes Data Mechanics a Serverless Apache Spark platform?

Our autopilot features

Our platform dynamically and continuously optimize the infrastructure parameters and Spark configurations of each of your Spark applications to make them stable and performant. Here are some parameters we tune:

  • The container sizes (memory, CPU) - to keep your app stable (avoid OutOfMemory errors), to optimize the binpacking of containers on your nodes, and to boost the performance of your app by acting on its bottleneck (Memory-bound, Cpu-bound, I/O-bound)
  • The default number of partitions used by Apache Spark to increase its degree of parallelism.
  • The disk sizes, shuffle and I/O configurations to make sure data transfer phases run at their optimal speed.

Our automated tuning feature is trained on the past runs of a recurrent application. It will automatically react to changes to code or to input data, such that your apps stay stable and performant over time, without manual action from you.

Data Mechanics and Tradelab Costs Cut In Half
Source: Our 2019 Spark Summit Presentation on How to automate performance tuning for Apache Spark

In addition to autotuning, our second autopilot feature is autoscaling. We support two levels of autoscaling:

  • At the application level: each Spark app dynamically scales its number of executors based on load (dynamic allocation)
  • At the cluster level: the Kubernetes cluster automatically adds and removes nodes from the cloud provider 

This model lets each app work in complete isolation (with its own Spark version, dependencies, and ressources) while keeping your infrastructure cost-efficient at all time.

Our cloud-native containerization

Data Mechanics is deployed on a Kubernetes cluster in our customers cloud account (while most other platforms still run Spark on YARN, Hadoop's scheduler).

This deployment model has key benefits:

  • An airtight security model: our customers sensitive data stays in their cloud account and VPC.
  • Native Docker support: our customers can use our set of pre-built optimized Spark docker images or build their own Docker images to package their dependency in a reliable way. Learn more about using custom Docker images on Data Mechanics.
  • Integration with the rich tools from the Kubernetes ecosystem.
  • Cloud agnosticity: Data Mechanics is available on AWS, GCP, and Azure.
Kubernetes Ecosystem by Spotinst
Source: The State Of The Kubernetes Ecosystem

Our serverless pricing model

Competing data platforms pricing model is based on server uptime. For each instance type, they'll charge you an hourly fee, whether this instance is actually used to run Spark apps or not. This puts the burden on Spark developers to efficiently manage their clusters and make sure they're not wasting ressources due to over-provisioning or parallelism issues.

Data Mechanics Pricing Model Serverless (Spark on Kubernetes)

Instead, the Data Mechanics fee is based on the sum of the duration of all the Spark tasks (the units of work distributed by Spark, reported with a millisecond accuracy). This means our platform only makes money when our users do real work. We don't make money:

  • When an application is completely idle (because you took a break from your notebook and forgot to scale down your cluster)
  • When most of your application ressources are waiting on a straggler task to finish
  • When you run a Spark driver-only operation (pure scala or python code)

As a result, Data Mechanics will aggressively scale down your apps when they're idle, so that we reduce your cloud costs (without impacting our revenue). In fact the savings we generate on your cloud costs will cover or typically exceed the fee we charge for our services.

I'd like to try this, how do I get started?

Great! The first step is to book a demo with our team so we can learn more about your use case. After this initial chat, we'll invite you to a shared slack channel - we use Slack for our support and we're very responsive there. We'll send you instructions on how to give us permissions on the AWS, GCP, or Azure account of your choice, and once we have these permissions we'll deploy Data Mechanics and you'll be ready to get started using our docs.

There are other features which we didn't get to cover in this post -- like our support for spot/preemptible nodes, our support for private clusters (cut off from the internet), our Spark UI replacement project, our integration with tools for CI/CD tools and machine learning model tracking and serving. So stay tuned and reach out if you're curious to learn more.

Jean-Yves Stephan


Jean-Yves Stephan

Ready to get started?

Read more

Our Latest Blog Posts

Learn about company news, product updates, and technology best practices straight from the Data Mechanics engineering team.

Migrating from EMR to Spark on Kubernetes with Data Mechanics

Customer Story: Lingk is a data integration platform powered by Apache Spark. AWS EMR was getting hard to manage and expensive. By migrating to Spark on Kubernetes with Data Mechanics, Lingk now enjoys ~2x faster Spark applications, their AWS bill has decreased by 65%, and their developer can now "achieve the plans they dream about".

Tuesday, April 6, 2021

Apache Spark 3.1 Release: Spark on Kubernetes is now Generally Available

With the Apache Spark 3.1 release in March 2021, the Spark on Kubernetes project is now officially declared as production-ready and Generally Available. This is the achievement of 3 years of booming community contribution and adoption of the project - since initial support for Spark-on-Kubernetes was added in Spark 2.3 (February 2018). In this article, we will go over the main features of Spark 3.1, with a special focus on the improvements to Kubernetes.

Monday, March 8, 2021

Cost-Effective Weather Analytics At Scale with Cloud-Native Apache Spark

Customer Story: Weather2020 is a predictive weather analytics company. In 3 weeks, their data engineering team built Apache Spark pipelines ingesting terabytes of weather data to power their core product. Data Mechanics performance optimizations and pricing model lowered their costs by 60% compared to Databricks, the main alternative they considered.

Wednesday, January 13, 2021

🍪 We use cookies to optimize your user experience. By browsing our website, you agree to the use of cookies.