September 8, 2020

Why we built a serverless Spark Platform

Our mission at Data Mechanics is to let data engineers and data scientists build pipelines and models over large datasets with the simplicity of running a script on their laptop. Let them focus on their data, while we handle the mechanics of infrastructure management.

So we built a serverless Spark platform, a more easy-to-use and more performant alternative to services like Amazon EMR, Google Dataproc, Azure HDInsight, Databricks, Qubole, Cloudera and Hortonworks.

In this video, we give you a product tour of our platform and some of its core features:

  1. How to connect a Jupyter notebook to the platform, play with Spark interactively
  2. How to submit applications programmatically using our API or our Airflow integration
  3. How to monitor logs and metrics for your Spark app from our dashboard
  4. How to track your costs, stability and performance over time of your jobs (recurring apps)

What makes Data Mechanics a Serverless Spark platform?

Our autopilot features

Our platform dynamically and continuously optimize the infrastructure parameters and Spark configurations of each of your Spark applications to make them stable and performant. Here are some parameters we tune:

  • The container sizes (memory, CPU) - to keep your app stable (avoid OutOfMemory errors), to optimize the binpacking of containers on your nodes, and to boost the performance of your app by acting on its bottleneck (Memory-bound, Cpu-bound, I/O-bound)
  • The default number of partitions used by Spark to increase its degree of parallelism.
  • The disk sizes, shuffle and I/O configurations to make sure data transfer phases run at their optimal speed.

Our automated tuning feature is trained on the past runs of a recurrent application. It will automatically react to changes to code or to input data, such that your apps stay stable and performant over time, without manual action from you.

Data Mechanics and Tradelab Costs Cut In Half
Source: Our 2019 Spark Summit Presentation on How to automate performance tuning for Apache Spark

In addition to autotuning, our second autopilot feature is autoscaling. We support two levels of autoscaling:

  • At the application level: each Spark app dynamically scales its number of executors based on load (dynamic allocation)
  • At the cluster level: the Kubernetes cluster automatically adds and removes nodes from the cloud provider 

This model lets each app work in complete isolation (with its own Spark version, dependencies, and ressources) while keeping your infrastructure cost-efficient at all time.

Our cloud-native containerization

Data Mechanics is deployed on a Kubernetes cluster in our customers cloud account (while most other platforms still run Spark on YARN, Hadoop's scheduler).

This deployment model has key benefits:

  • An airtight security model: our customers sensitive data stays in their cloud account and VPC.
  • Native Docker support: our customers can use our set of pre-built optimized Spark docker images or build their own Docker images to package their dependency in a reliable way. Learn more about using custom Docker images on Data Mechanics.
  • Integration with the rich tools from the Kubernetes ecosystem.
  • Cloud agnosticity: Data Mechanics is available on AWS, GCP, and Azure.
Kubernetes Ecosystem by Spotinst
Source: The State Of The Kubernetes Ecosystem

Our serverless pricing model

Competing data platforms pricing model is based on server uptime. For each instance type, they'll charge you an hourly fee, whether this instance is actually used to run Spark apps or not. This puts the burden on Spark developers to efficiently manage their clusters and make sure they're not wasting ressources due to over-provisioning or parallelism issues.

Data Mechanics Pricing Model Serverless (Spark on Kubernetes)

Instead, the Data Mechanics fee is based on the sum of the duration of all the Spark tasks (the units of work distributed by Spark, reported with a millisecond accuracy). This means our platform only makes money when our users do real work. We don't make money:

  • When an application is completely idle (because you took a break from your notebook and forgot to scale down your cluster)
  • When most of your application ressources are waiting on a straggler task to finish
  • When you run a Spark driver-only operation (pure scala or python code)

As a result, Data Mechanics will aggressively scale down your apps when they're idle, so that we reduce your cloud costs (without impacting our revenue). In fact the savings we generate on your cloud costs will cover or typically exceed the fee we charge for our services.

I'd like to try this, how do I get started?

Great! The first step is to book a demo with our team so we can learn more about your use case. After this initial chat, we'll invite you to a shared slack channel - we use Slack for our support and we're very responsive there. We'll send you instructions on how to give us permissions on the AWS, GCP, or Azure account of your choice, and once we have these permissions we'll deploy Data Mechanics and you'll be ready to get started using our docs.

There are other features which we didn't get to cover in this post -- like our support for spot/preemptible nodes, our support for private clusters (cut off from the internet), our Spark UI replacement project, our integration with tools for CI/CD tools and machine learning model tracking and serving. So stay tuned and reach out if you're curious to learn more.

Read more

Our Latest Blog Posts

Learn about company news, product updates, and technology best practices straight from the Data Mechanics engineering team.

Spark and Docker: Your Spark development cycle just got 10x faster !

Native support for Docker is in fact one of the main reasons companies choose to deploy Spark on top of Kubernetes instead of YARN. In this article, we will illustrate the benefits of Docker for Apache Spark by going through the end-to-end development cycle used by many of our users at Data Mechanics.

Tuesday, October 13, 2020

Setting up, Managing & Monitoring Spark on Kubernetes

Earlier this year at Spark + AI Summit, we went over the best practices and pitfalls of running Apache Spark on Kubernetes. We’d like to expand on that and give you a comprehensive overview of how you can get started with Spark on k8s, optimize performance & costs, monitor your Spark applications, and the future of Spark on k8s!

Monday, September 21, 2020

How We Built A Serverless Spark Platform On Kubernetes - Video Tour Of Data Mechanics

In this video, we give you a product tour of our serverless Spark platform and its core features: connecting a Jupyter notebook, submitting apps programmatically, monitoring their logs and metrics, tracking their costs and performance over time.

Tuesday, September 8, 2020

🍪 We use cookies to optimize your user experience. By browsing our website, you agree to the use of cookies.