Our mission at Data Mechanics is to let data engineers and data scientists build pipelines and models over large datasets with the simplicity of running a script on their laptop. Let them focus on their data, while we handle the mechanics of infrastructure management.
So we built a serverless Apache Spark platform, a more easy-to-use and more performant alternative to services like Amazon EMR, Google Dataproc, Azure HDInsight, Databricks, Qubole, Cloudera and Hortonworks.
In this video, we give you a product tour of our platform and some of its core features:
- How to connect a Jupyter notebook to the platform, play with Apache Spark interactively
- How to submit applications programmatically using our API or our Airflow integration
- How to monitor logs and metrics for your Apache Spark app from our dashboard
- How to track your costs, stability and performance over time of your jobs (recurring apps)
What makes Data Mechanics a Serverless Apache Spark platform?
Our autopilot features
Our platform dynamically and continuously optimize the infrastructure parameters and Spark configurations of each of your Spark applications to make them stable and performant. Here are some parameters we tune:
- The container sizes (memory, CPU) - to keep your app stable (avoid OutOfMemory errors), to optimize the binpacking of containers on your nodes, and to boost the performance of your app by acting on its bottleneck (Memory-bound, Cpu-bound, I/O-bound)
- The default number of partitions used by Apache Spark to increase its degree of parallelism.
- The disk sizes, shuffle and I/O configurations to make sure data transfer phases run at their optimal speed.
Our automated tuning feature is trained on the past runs of a recurrent application. It will automatically react to changes to code or to input data, such that your apps stay stable and performant over time, without manual action from you.
In addition to autotuning, our second autopilot feature is autoscaling. We support two levels of autoscaling:
- At the application level: each Spark app dynamically scales its number of executors based on load (dynamic allocation)
- At the cluster level: the Kubernetes cluster automatically adds and removes nodes from the cloud provider
This model lets each app work in complete isolation (with its own Spark version, dependencies, and ressources) while keeping your infrastructure cost-efficient at all time.
Our cloud-native containerization
Data Mechanics is deployed on a Kubernetes cluster in our customers cloud account (while most other platforms still run Spark on YARN, Hadoop's scheduler).
This deployment model has key benefits:
- An airtight security model: our customers sensitive data stays in their cloud account and VPC.
- Native Docker support: our customers can use one of our pre-built optimized docker images for Apache Spark (Update: as of April 2021, these images are publicly available on DockerHub!). Or they can build their own Docker images to package their dependency in a reliable way. Learn more about the benefits of the resulting Docker-based developer workflow.
- Integration with the rich tools from the Kubernetes ecosystem.
- Cloud agnosticity: Data Mechanics is available on AWS, GCP, and Azure.
Our serverless pricing model
Competing data platforms pricing model is based on server uptime. For each instance type, they'll charge you an hourly fee, whether this instance is actually used to run Spark apps or not. This puts the burden on Spark developers to efficiently manage their clusters and make sure they're not wasting ressources due to over-provisioning or parallelism issues.
Instead, the Data Mechanics fee is based on the sum of the duration of all the Spark tasks (the units of work distributed by Spark, reported with a millisecond accuracy). This means our platform only makes money when our users do real work. We don't make money:
- When an application is completely idle (because you took a break from your notebook and forgot to scale down your cluster)
- When most of your application ressources are waiting on a straggler task to finish
- When you run a Spark driver-only operation (pure scala or python code)
As a result, Data Mechanics will aggressively scale down your apps when they're idle, so that we reduce your cloud costs (without impacting our revenue). In fact the savings we generate on your cloud costs will cover or typically exceed the fee we charge for our services.
Update (April 2021): To estimate what your Data Mechanics costs would be, install our open-source monitoring tool Delight (it works on top of any Spark platform: Databricks, EMR, Dataproc, CDH/HDP, an open-source setup). More details on our blog post or on our github page.
I'd like to try this, how do I get started?
Great! The first step is to book a demo with our team so we can learn more about your use case. After this initial chat, we'll invite you to a shared slack channel - we use Slack for our support and we're very responsive there. We'll send you instructions on how to give us permissions on the AWS, GCP, or Azure account of your choice, and once we have these permissions we'll deploy Data Mechanics and you'll be ready to get started using our docs.
There are other features which we didn't get to cover in this post -- like our support for spot/preemptible nodes, our support for private clusters (cut off from the internet), our Spark UI replacement project, our integration with tools for CI/CD tools and machine learning model tracking and serving. So stay tuned and reach out if you're curious to learn more.