Use cases for distributed data processing (“big data”) are everywhere: from drug development and AI-assisted healthcare, to fraud detection and information security, to production lines and energy grid optimizations. Consumer companies must process their customers data to build better products, distribute them more effectively, and make data-driven decisions at a global scale.
Our mission at Data Mechanics is to give superpowers to the data scientists and data engineers of the world so they can make sense of their data and build applications at scale on top of it.
Our founding team is made of data infrastructure, data engineering and data science experts, who experienced firsthand the frustrations of spending too much time on manual, low-level tasks such as infrastructure management and application debugging, and too little time on their core expertise such as building better models and better pipelines. We’ll talk more about the company formation and our experience going through YCombinator in a future blog post. What we want to tell you now is the vision of the company and product we want to build.
The world of big data analytics has been lagging behind in terms of developer tools. The advent of containerization and cloud-native technologies has brought a new paradigm to traditional software engineering, enabling developers to build, deploy and maintain applications in a much more simple and efficient way.
So why is it that in 2020 the average developer who wants to use Apache Spark - the #1 distributed data processing engine - still needs to become a Hadoop or YARN expert, choose dozens of parameters through slow iterations, and spend painful hours debugging the issues that arise, simply to get their job done? The answer is: there is no reason. The paradigm shift is moving to the notoriously difficult world of data engineering, and we’re going to help with that.
We’re building the data platform of tomorrow, and we’re starting with Apache Spark. We want to let data engineers and scientists build, run and maintain Spark applications processing Gigabytes, Terabytes or Petabytes of data, as easily as if they were developing a python + scikit learn app on their laptop.
Now that we’ve told our ambition, how do we get there? Our platform follows 3 core tenets:
- It’s not just managed, it’s serverless
- It’s not end-to-end, it’s integrated in your workflow
- It’s not inherently disruptive, it’s built on top of the best software Open Source can offer
It’s not just managed, it’s serverless
A managed data platform will provision the machines for you when you submit an application. But it still requires users to specify the full cluster and Spark configurations: how many nodes, how many CPUs and how much memory for each node, how to distribute the data (shuffle and partitions configurations), how to parallelize the work (parallelism configuration), how to allocate memory between execution and storage, and which algorithm to use when writing results to the data sink. That’s a lot of work for a managed service, particularly as you run dozens, or hundreds, or thousands of applications.
Our serverless data platform automates the tuning of infrastructure parameters and Spark configurations dynamically and continuously for each workload to make them fast and stable without your intervention. It goes further than mere infrastructure management: our users shouldn’t have to think about the infrastructure. Of course, servers do still exist. In our case, the servers are deployed within our customers cloud account to guarantee strict security and data privacy requirements, so our users can see and even control the infrastructure if necessary.
It’s not end-to-end, it’s integrated in your workflow
Some data platforms claim to serve the entire needs of data scientists and engineers through a comprehensive list of proprietary features: their hosted notebooks, their data visualization, their scheduler, their version control and dependency management mechanisms, ...
We believe this approach suffers from 3 problems.
- Migration costs. Before a company starts a Spark project, they probably already have a lot of data science and engineering tools. End-to-end platforms require users to change their tools and transpose their entire workflow to fit the platform’s opinionated way of doing things.
- Lock-in. That’s the reverse problem: migrating away from proprietary tools is hard.
- Hole in the net. The space of data science and data engineering is so dynamic that platforms serving such a large scope of needs always fall behind in some areas.
That’s why we decided to focus on doing one thing really well: the serverless infrastructure that automates application deployment and maintenance. For all the other needs, we integrate with the tools that you already use and love today: your preferred notebook or IDE, your scheduler, your CI/CD tool.
It’s not inherently disruptive,
it’s built on the best software Open Source can offer
We encourage customers to outsource their data infrastructure management to us. We apply the same philosophy to ourselves: Data Mechanics is built on top of the best software open source can offer, including Apache Spark, Kubernetes, Docker, Jupyter, and more. We’re proud to contribute to them too.
Stitching together the right mix of open source technologies is a hard problem. The devil lies in the details: how to connect the technologies together, how to make them work across all clouds, how to make them scale, how to keep them up-to-date. It might not feel like disruptive innovation, but we optimize for impact, and that’s how we deliver the biggest value to our customers.
We’re proud to tell you publicly about what we’ve been building. A data platform that automates the deployment, maintenance and monitoring of Spark applications, so that data scientists and engineers can focus on their data while we handle the mechanics.
This is just the beginning and there’s a lot of work ahead of us. In future posts, we’ll write Spark tutorials, we’ll give product updates and details on how the platform works, and tell more company stories like our experience going through YCombinator. Follow us on Twitter or LinkedIn to stay tuned !