November 2, 2020

Jean-Yves Stephan

by

Jean-Yves Stephan

This article is jointly written by:

  • Data Mechanics, a YCombinator-backed startup founded by ex-Databricks engineers commercializing a next-generation Spark platform deployed on Kubernetes. Our mission is to make Spark more easy-to-use and cost-efficient, with a focus on data engineering workloads.
  • Quantmetry, an AI & Data Consulting firm expert in helping European enterprises adopt technologies like Apache Spark to solve mission critical business goals.

What makes Apache Spark popular?

In the data science and data engineering world, Apache Spark is the leading technology for working with large datasets. The Apache Spark developer community is thriving: most companies have already adopted or are in the process of adopting Apache Spark. Apache Spark’s popularity is due to 3 mains reasons:

  1. It’s fast. It can process large datasets (at the GB, TB or PB scale) thanks to its native parallelization.
  1. It has APIs in Python (PySpark), Scala/Java, SQL and R. These APIs enable a simple migration from “single-machine” (non-distributed) Python workloads to running at scale with Spark. For example, the Koalas library was recently released, which enables Python developers to easily turn their Pandas code into Spark. The fact that arbitrary Python/Scala code can be executed also gives Spark developers a lot more flexibility than SQL-only frameworks like Redshift and BigQuery.
  1. It’s very versatile. Apache Spark has connectors for virtually all data storages, and Spark clusters can be deployed in any cloud or on-premise platforms.
What is Apache Spark? Here's a high-level overview.
What is Apache Spark? Here's a high-level overview.

What are the top pain points with Apache Spark?

Debugging application code

The first challenge is the fact that it’s hard for Spark beginners to understand how their code is interpreted and distributed by Spark. You need to learn about the intricacies of Spark and get to a certain level of expertise to be able to debug your application when it does not perform as expected (for example, debug a memory error), and to understand and then optimize its speed.

Spark exposes many configurations that are complex for beginners to set. As a result, most Spark developers tend to stick to default settings, without realizing how this can hurt their application stability and performance.

Managing the underlying infrastructure

The second challenge is infrastructure management. What should be the size of our Spark cluster? Which type of virtual machines and disks should I choose? How should we collect and visualize logs and metrics?

These challenges come in two different flavors between on-premise deployments (with Hortonworks, Cloudera, MapR) and cloud deployments (AWS: EMR, GCP: Dataproc, Azure: HDInsight).

  • For on-prem deployments, the main challenge is the cost / speed tradeoff on the cluster size. If you oversize your cluster, your costs will snowball and you will suffer from low utilization and overprovisioning resources to your cluster most of the time. If you undersize your cluster, it will not be able to sustain peak workloads, and you will need to implement priority queues to make sure your mission critical workloads are not delayed.
  • For cloud deployments, the elasticity of the cloud provider solves this problem as resources can be added or removed on the fly. This also means that costs are unbounded, and it is up to each individual Spark user to appropriately size and configure their applications and make sure that it is stable and cost-efficient. Despite the services being called “managed”, the real burden of management and configuration is still on the data teams using the Spark cluster.

Let’s now go over Data Mechanics best practices and recommendations to overcome these challenges.

Simplifying Spark Infrastructure Management with a Serverless Approach

Data Mechanics is a managed Spark platform deployed on a Kubernetes cluster inside our customers’ cloud account. It is available on the 3 big cloud providers (AWS, GCP and Azure) and it is an alternative to platforms like Databricks, Amazon EMR, Google Dataproc, and Azure HDInsight. Jean-Yves - an ex-Databricks engineer, now co-founder of Data Mechanics, explains our 3 key features that implement a serverless approach to Apache Spark. 

It’s Dockerized.

Kubernetes has native support for Docker containers. These containers let you build your dependencies once (on your laptop) and then run your application everywhere in a consistent way: on your laptop, for development and testing; or in the cloud over production data.

Using Docker rather than slow init scripts and runtime dependency downloads will make your Spark applications more cost-efficient and stable. With the proper optimizations, you can speed up your Spark development cycle with Docker such that it takes less than 30 seconds from the time you make a change to your code to the time it is deployed on our platform.

Spark & Docker Development Iteration Cycle

Spark & Docker Development Iteration Cycle

It’s in autopilot mode. 

We believe a “managed” Spark platform should do more than simply start virtual machines when you request them. It should actually take the infrastructure management burden from its users. Hence our platform dynamically and automatically adjusts the most important infrastructure parameters and Spark configurations: cluster sizes, type of instance, disk types, level of parallelism, memory management, shuffle configurations, etc. This makes Spark 2x more stable and more cost-efficient, as we illustrated at the 2019 Spark Summit with a customer success story.

It’s priced on actual Spark compute time.

Competing Spark platforms charge a fee based on server uptime (“Using this type of instance for an hour will cost you $0.40”). This fee is due whether the machine is actually used by Spark, or whether it’s just up but not doing anything because you made a configuration mistake.

At Data mechanics, we only charge our customers when their machines are used to run Spark computations, not based on overall server uptime. This gives us an incentive to manage your infrastructure more efficiently and remove wasted compute resources. Book a meeting with us if you’d like us to assess how this could reduce your costs.

The third point is specific to our platform, but the first two recommendations -- dockerization and some autopilot features (autoscaling at least) may be available in your Spark platform. 

Make Spark more accessible with a new monitoring UI

The serverless approach can speed up your iteration cycle by 10x and divide your costs by 3x. But if your Spark code has a bug or if your data isn’t partitioned the right way, it’s still up to the developper to solve this problem. Today the only monitoring tool available to solve these challenges is called the Spark UI (screenshot below), but it’s cumbersome and not intuitive:

  • Too much information is displayed. It’s hard to spot where the application spends most of its time and what is the bottleneck.
  • It lacks critical metrics around memory usage, cpu usage, I/O.
  • The Spark History Server (necessary to access the Spark UI after an application is finished) is hard to set up. 
Native Spark UI
Native Spark UI

To solve this problem and make Spark more accessible to beginners and experts alike, the Data Mechanics team is developing a new monitoring tool to replace the Spark UI, the Data Mechanics Delight. Delight will be made available to the wider Spark community (not just Data Mechanics customers) for free. You will be able to install it on any Spark platform (in the cloud or on-premise) by downloading an open-source agent which will run inside the Spark driver and stream metrics to the Data Mechanics backend. 

Here’s a screenshot of what Delight looks like. It consists of an overview screen to help developers get a bird-eye view of their application, with key metrics and high-level recommendations. New metrics around CPU and Memory usage will help Spark developers understand their application resource profile and bottleneck.

Data Mechanics Delight UI
Data Mechanics Delight

Update (November 2020): The first milestone of Data Mechanics Delight has been released. It consists of a free, hosted and partially open-source Spark History Server that works on top of any Spark platform. For now it only serves the Spark UI, but from the next release (January 2021) we will start replacing the Spark UI by adding the new screens that we describe below.

Conclusion: The Future of Apache Spark

We hope this article has given you concrete recommendations to help you be more successful with Spark, or advice to help you get started! Quantmetry and Data Mechanics are available to help you on this journey.

To conclude let’s go over exciting work currently happening within Apache Spark.

  • Spark 3.0 (June 2020) has brought 2x performance gains on average through performance optimizations such as Adaptive Query Execution and Partition Pruning. Dynamic allocation (autoscaling) is now available for Spark on Kubernetes. New Pandas UDF and Python type hints also make PySpark development easier.
  • Spark 3.1 (December 2020) will declare Spark-on-Kubernetes as officially generally available and production-ready thanks to more stability and performance fixes, hence accelerating the transition from YARN-based platforms to Kubernetes-based platforms (like Data Mechanics). More performance optimizations are also coming in like filter pushdown for JSON and Avro file formats.


Ready to get started?

Read more

Our Latest Blog Posts

Learn about company news, product updates, and technology best practices straight from the Data Mechanics engineering team.

Data + AI Summit Europe 2020 Highlights

Data + AI Summit 2020 Highlights: What’s new for the Apache Spark community? In this article we’ll go over the highlights of the conference, focusing on the new developments which were recently added to Apache Spark or are coming up in the coming months: Spark on Kubernetes, Koalas, Project Zen.

Tuesday, November 24, 2020

We’re releasing a free, cross-platform Spark UI and Spark History Server

Today we’re releasing a web-based Spark UI and Spark History Server which work on top of any Spark platform, whether it’s on-premise or in the cloud, over Kubernetes or YARN, with a commercial service or using open-source Apache Spark. This is our first step towards building Data Mechanics Delight - the new and improved Spark UI.

Monday, November 16, 2020

Spark on Kubernetes Made Easy: How Data Mechanics Improves On The Open-Source Version

How Is Data Mechanics different than running Spark on Kubernetes open-source? In this article, we explain how our platform extends and improves on Spark on Kubernetes to make it easy-to-use, flexible, and cost-effective. We'll go over our intuitive user interfaces, dynamic optimizations, and custom integrations

Tuesday, November 10, 2020

🍪 We use cookies to optimize your user experience. By browsing our website, you agree to the use of cookies.
close
30