September 8, 2020

Jean-Yves Stephan


Jean-Yves Stephan

Our mission at Data Mechanics is to let data engineers and data scientists build pipelines and models over large datasets with the simplicity of running a script on their laptop. Let them focus on their data, while we handle the mechanics of infrastructure management.

So we built a serverless Apache Spark platform, a more easy-to-use and more performant alternative to services like Amazon EMR, Google Dataproc, Azure HDInsight, Databricks, Qubole, Cloudera and Hortonworks.

In this video, we give you a product tour of our platform and some of its core features:

  1. How to connect a Jupyter notebook to the platform, play with Apache Spark interactively
  2. How to submit applications programmatically using our API or our Airflow integration
  3. How to monitor logs and metrics for your Apache Spark app from our dashboard
  4. How to track your costs, stability and performance over time of your jobs (recurring apps)

What makes Data Mechanics a Serverless Apache Spark platform?

Our autopilot features

Our platform dynamically and continuously optimize the infrastructure parameters and Spark configurations of each of your Spark applications to make them stable and performant. Here are some parameters we tune:

  • The container sizes (memory, CPU) - to keep your app stable (avoid OutOfMemory errors), to optimize the binpacking of containers on your nodes, and to boost the performance of your app by acting on its bottleneck (Memory-bound, Cpu-bound, I/O-bound)
  • The default number of partitions used by Apache Spark to increase its degree of parallelism.
  • The disk sizes, shuffle and I/O configurations to make sure data transfer phases run at their optimal speed.

Our automated tuning feature is trained on the past runs of a recurrent application. It will automatically react to changes to code or to input data, such that your apps stay stable and performant over time, without manual action from you.

Data Mechanics and Tradelab Costs Cut In Half
Source: Our 2019 Spark Summit Presentation on How to automate performance tuning for Apache Spark

In addition to autotuning, our second autopilot feature is autoscaling. We support two levels of autoscaling:

  • At the application level: each Spark app dynamically scales its number of executors based on load (dynamic allocation)
  • At the cluster level: the Kubernetes cluster automatically adds and removes nodes from the cloud provider 

This model lets each app work in complete isolation (with its own Spark version, dependencies, and ressources) while keeping your infrastructure cost-efficient at all time.

Our cloud-native containerization

Data Mechanics is deployed on a Kubernetes cluster in our customers cloud account (while most other platforms still run Spark on YARN, Hadoop's scheduler).

This deployment model has key benefits:

  • An airtight security model: our customers sensitive data stays in their cloud account and VPC.
  • Native Docker support: our customers can use our set of pre-built optimized Spark docker images or build their own Docker images to package their dependency in a reliable way. Learn more about using custom Docker images on Data Mechanics.
  • Integration with the rich tools from the Kubernetes ecosystem.
  • Cloud agnosticity: Data Mechanics is available on AWS, GCP, and Azure.
Kubernetes Ecosystem by Spotinst
Source: The State Of The Kubernetes Ecosystem

Our serverless pricing model

Competing data platforms pricing model is based on server uptime. For each instance type, they'll charge you an hourly fee, whether this instance is actually used to run Spark apps or not. This puts the burden on Spark developers to efficiently manage their clusters and make sure they're not wasting ressources due to over-provisioning or parallelism issues.

Data Mechanics Pricing Model Serverless (Spark on Kubernetes)

Instead, the Data Mechanics fee is based on the sum of the duration of all the Spark tasks (the units of work distributed by Spark, reported with a millisecond accuracy). This means our platform only makes money when our users do real work. We don't make money:

  • When an application is completely idle (because you took a break from your notebook and forgot to scale down your cluster)
  • When most of your application ressources are waiting on a straggler task to finish
  • When you run a Spark driver-only operation (pure scala or python code)

As a result, Data Mechanics will aggressively scale down your apps when they're idle, so that we reduce your cloud costs (without impacting our revenue). In fact the savings we generate on your cloud costs will cover or typically exceed the fee we charge for our services.

I'd like to try this, how do I get started?

Great! The first step is to book a demo with our team so we can learn more about your use case. After this initial chat, we'll invite you to a shared slack channel - we use Slack for our support and we're very responsive there. We'll send you instructions on how to give us permissions on the AWS, GCP, or Azure account of your choice, and once we have these permissions we'll deploy Data Mechanics and you'll be ready to get started using our docs.

There are other features which we didn't get to cover in this post -- like our support for spot/preemptible nodes, our support for private clusters (cut off from the internet), our Spark UI replacement project, our integration with tools for CI/CD tools and machine learning model tracking and serving. So stay tuned and reach out if you're curious to learn more.

Ready to get started?

Read more

Our Latest Blog Posts

Learn about company news, product updates, and technology best practices straight from the Data Mechanics engineering team.

Data + AI Summit Europe 2020 Highlights

Data + AI Summit 2020 Highlights: What’s new for the Apache Spark community? In this article we’ll go over the highlights of the conference, focusing on the new developments which were recently added to Apache Spark or are coming up in the coming months: Spark on Kubernetes, Koalas, Project Zen.

Tuesday, November 24, 2020

We’re releasing a free, cross-platform Spark UI and Spark History Server

Today we’re releasing a web-based Spark UI and Spark History Server which work on top of any Spark platform, whether it’s on-premise or in the cloud, over Kubernetes or YARN, with a commercial service or using open-source Apache Spark. This is our first step towards building Data Mechanics Delight - the new and improved Spark UI.

Monday, November 16, 2020

Spark on Kubernetes Made Easy: How Data Mechanics Improves On The Open-Source Version

How Is Data Mechanics different than running Spark on Kubernetes open-source? In this article, we explain how our platform extends and improves on Spark on Kubernetes to make it easy-to-use, flexible, and cost-effective. We'll go over our intuitive user interfaces, dynamic optimizations, and custom integrations

Tuesday, November 10, 2020

🍪 We use cookies to optimize your user experience. By browsing our website, you agree to the use of cookies.