NetApp announces acquisition of Data Mechanics, delivering the best way to run Apache Spark in your cloud

They trust us

Various logos in a circle: Data Mechanics, Airflow, IDEs, Docker, Kubernetes


Bring Your Own Tools

Enjoy a faster and more reliable development workflow with Docker. Use our pre-built optimized images, or build your own custom images to package your dependencies. Your apps will start in seconds!

Use Spark interactively by connecting Jupyter notebooks, or submit applications programmatically through our REST API or our Airflow connector.

Transparent & Flexible

The Power of Kubernetes, Without the Complexity

We're deployed on a managed Kubernetes cluster in your cloud account and in your virtual private cloud. Your sensitive data does not leave this environment. You're in control.

 We handle the complexity of Kubernetes and provide you with an easy-to-use monitoring dashboard where you can track your application's logs, metrics, and costs over time.

Data Mechanics Architecture: Dockerized Spark applications running on a Kubernetes cluster.
An illustration with a laptop and a dashboard showing a Spark application metrics over time.


50-75% Cost Reductions
From Smart Automations

We dynamically scale your applications and Kubernetes nodes based on load. We automatically tune your configurations (type of instance and disks, container memory/cpu allocations, and Spark configurations) based on the historical runs of your Spark pipelines.

Our pricing gives us an incentive to make your data infrastructure as effective as possible to reduce your cloud provider's costs. We've achieved 50 to 75% cost reductions for customers migrating from competing platforms like Databricks or EMR.

Deployment process

How it works

Data Mechanics is deployed on a Kubernetes cluster in your cloud account that we create and manage for you. Your sensitive data never leaves your account.

Logos of 3 cloud providers: AWS, GCP, Azure.

1. Connect your cloud account

Give us scoped permissions on your AWS, GCP or Azure account. We will deploy the platform on a Kubernetes cluster that we create and manage for you.

Apache Spark logo in a cube.

2. Submit Spark applications

Attach a Jupyter notebook and start exploring interactively or submit jobs programmatically through our REST API or our Airflow operator.

A green check logo.

3. You’re all set

Sit back and relax. Monitor your application logs and metrics from our web user interface.

Need help ?

Frequently Asked Questions

If you have an infrequently asked question, use the chat in the bottom-right corner and our team will get back to you shortly.

How are you different from Spark-on-k8s open-source?

This question is so common we wrote a blog post to answer it.

One part of the answer is that Data Mechanics is a managed service. We take care of the setup and maintenance of the Kubernetes cluster, and provide intuitive APIs and web interfaces to hide away the complexity of Kubernetes. 

The other part of the answer is that we add specific features on top of Spark-on-Kubernetes open-source:
- Integrations with tools like Jupyter, Airflow, and more.
- Log & metrics collection and persistence to give you visibility.
- Node pool management (node pool definition, autoscaling, spot support)
- Performance Optimizations like our automated tuning of parameters.

All in all, we built Data Mechanics to make Spark-on-Kubernetes stable, easy-to-use, and cost-effective, the way you would built it yourself. By being deployed in your cloud account and in your VPC, you get the flexibility of a home-grown project. The optimizations we built will generate cloud costs savings that more than make up for our management fee.

Does Data Mechanics need access to my data?

No. The Data Mechanics platform is deployed in your cloud account and your sensitive data never leaves it. You can restrict incoming traffic and make the platform accessible only from your office IP and / or your VPN. You can control data access using security best practices like Role Based Access Control, cloud Identity and Access Management, and Kubernetes secrets.

Which cloud providers do you support?

Our platform is available on GCP (on GKE), AWS (on EKS), and Azure (on AKS). To use our platform, all we need is a certain set of permissions (an IAM role) on the cloud account of your choice. We will use these permissions to create and manage the Kubernetes cluster for you.

Are you available on-premise?

Not yet. We focus on cloud deployments as they let us deploy and push releases quickly and in a standardized way. Support for on-premise deployments is on our roadmap. Contact us if you're interested.

Can I still see and control the infrastructure?

Yes. We automate infrastructure management to make your life simpler, but we do this in a transparent way. You can view and control the Kubernetes cluster that we manage for you using the cloud provider console, its CLI and API. Similarly, you can view and control the infrastructure parameters and Spark configurations used by each application. Your preferences take precedence over our autoscaling and automated tuning features.

How are you different from Databricks?

Databricks is an end-to-end data science platform with proprietary  hosted notebooks, an ML flow integration, a simple job scheduler. If you need all these features, the Databricks platform is a good fit for you.

Data Mechanics focuses on making Spark more developer-friendly and cost-effective for data engineering workloads. We don't provide hosted notebooks, we let our users point Jupyter notebooks at the platform. We don't provide a hosted job scheduler, but we integrate with Airflow and other schedulers.

How are we better than Databricks?
* Native Docker support to simplify dependency management.
* Very fast startup time and autoscaling. The entire Kubernetes cluster is one large pool shared by all your applications.
* Automated tuning of instance type, disk type, Spark configurations.
* Pricing. Our management fee is much lower, and it's based on real Spark compute time (instead of wasted server uptime).
* In a transparent and flexible architecture, built on top of open-source technologies, and available on the three main cloud providers. 

Databricks customers who migrated to our platform have reduced their total costs of ownership by 50 to 80%.

Which languages do you support?

Your applications can be written in Python, Scala, Java, and SQL. Even though many of our features are primarily built for Spark, it's also possible to run pure Python, Scala, and Java applications on the platform with the same ease.

Which versions of Apache Spark do you support?

We support all versions of Apache Spark from Spark 2.4 and later. We're always up to date with the newest versions of Spark and update our platform to support the latest version within a few days of release. See our public fleet of optimized Docker images for Spark on DockerHub to learn more.

Do you support spot (preemptible) instances?

Yes. We have configuration templates ready to help guide you towards their adoption. It's typically recommended to only put Spark executors on spot nodes, and place the Spark driver on an on-demand node, to make your workloads resilient to spot kills.

How can I get started with a trial?

Get in touch with us by booking a demo so that we lean more about your use case and answer your questions about the platform. We'll then invite you to a shared Slack or Teams channel that we will use for most of our interactions and for live support. The first step to get started is to grant us scoped permissions on the AWS, GCP, or Azure account of your choice.

Ready to get started?

🍪 We use cookies to optimize your user experience. By browsing our website, you agree to the use of cookies.