June 23, 2020

Here's why we build Data Mechanics Delight

"The Spark UI is my favorite monitoring tool" — said no one ever.

The Apache Spark UI generates a lot of frustrations. We keep hearing it over and over from Spark beginners and experts alike:

  • "It's hard to understand what's going on"
  • "Even if there's a critical information, it's buried behind a lot of noisy information that only experts know how to navigate"
  • "There's a lot of tribal knowledge involved"
  • "The Spark history server is a pain to setup"

At Data Mechanics we have the crazy ambition of replacing the Spark UI and Spark History Server with something delightful, we'll call it Data Mechanics Delight*.

*Note: The project was initially named "Spark Delight", but we changed the name to comply with Apache Software Fondation rules on the usage of the "Spark" word.

We plan to make it work on top of any Spark platform, entirely free of charge. We started prototyping it. Before we move to production, we'd like to get feedback and sense the interest from the community. If you find this project interesting, share this article and fill up the form at the bottom of this page - we'll email you when Data Mechanics Delight is out.

What's wrong with the current Apache Spark UI?

Spark UI Jobs Page
The familiar Spark UI (jobs page)
It's hard to get the bird's eye view of what is going on.
  • Which jobs/stages took most of the time? How do they match with my code?
  • Is there a stability or performance issue that matters?
  • What is the bottleneck of my app (I/O bound, CPU bound, Memory bound)?
The Spark UI lacks essential node metrics (CPU, Memory and I/O usage).
  • You can go without them, but you'll be walking in the dark. Changing an instance type will be a leap of faith.
  • Or you'll need to setup a separate metrics monitoring system: Ganglia, Prometheus + Grafana, StackDriver (GCP), or CloudWatch (AWS).
    You'll need to jump back and forth between this monitoring system and the Spark U trying to match the timestamps between the two (usually jumping between UTC and your local timezone, to increase the fun).
The Spark History Server (rendering the Spark UI after an application is finished) is hard to setup.
  • You need to persist spark event logs to long-term storage and often run it yourself, incurring costs and maintenance burden.
  • It takes a long time to load, and it sometimes crashes.

What does the Data Mechanics Delight UI look like?

A picture is worth a thousand word, so here's a GIF of our prototype:

Data Mechanics Delight - Better Spark UI
Our prototype for a Spark UI replacement. Let us know your feedback!

What is new about it?

The main screen (overview) has a lot of new information and visuals.

Summary statistics
Data Mechanics Delight - Summary Statistics - Better Spark UI

What was the duration of the app, the amount of resources (CPU uptime) that were used, the duration of all the Spark tasks (should be close to your CPU uptime unless you suffer from bad parallelism or long phases of driver-only work/idleness).

Data Mechanics Delight - Recommendations - Better Spark UI

This section pinpoints stability and performance issues at a high-level to help developers address them. Examples:

  • "The default number of tasks (200) is too small compared to the number of CPU cores (400) available. Increase spark.sql.shuffle.partitions to 1200."
  • "Job 4 suffers from an input data skew. Consider repartitioning your data or salting the partition key".
  • "An executor crashed due to an OutOfMemory error in stage 7 with 85% of the memory used by Python, consider increasing the memory available per executor or setting spark.executor.pyspark.memory".

This section builds upon the capability of the Data Mechanics platform to automatically tune infrastructure parameters and Spark configurations (e.g. instance type, memory/cpu allocations, configs for parallelism, shuffle, I/O) for each pipeline running on it based on its history. This high-level feedback will complement the serverless features of the platform by helping developers understand and optimize their application code.

Executors CPU Usage
Data Mechanics Delight - Executors CPU Storage - Better Spark UI

What were your executor CPU cores doing on average? If there is a lot of unused time, maybe your app is overprovisioned. If they spend a lot of time doing shuffles (all-to-all data communications), it's worth looking if some shuffles can be avoided, or at tuning the performance of the shuffle stages. This screen should let you see quickly if your app is I/O bound or CPU bound, and make smarter infrastructure changes accordingly (e.g. use an instance type with more CPUs or faster disks).

What's great with this screen is that you can visually align this information with the different Spark phases of your app. If most of your app is spent in single shuffle heavy stage, you can spot this in one second, and in a single click dive into the specific stage.

Executors Peak Memory Usage
Data Mechanics Delight - Executors Peak Memory Usage - Better Spark UI

This screen shows you the memory usage breakdown for each executor when the total memory consumption was at its peak. You'll be immediately able to see if you went close to the memory limit (narrowly avoiding an OutOfMemory error), or if you have plenty of leg room. Data Mechanics Delight gives you the split between the different types of memories used by the JVM and the python memory usage. This data is crucial but as far as we know Spark developers have no easy way to get this today.

Stage and Executor Pages
Data Mechanics Delight - Stage & Executor Pages - Better Spark UI

You can then dive in and find similar information at a finer granularity on a Spark stage page or an executor page. For example, on a specific stage page, you'll see graphs showing the distribution of a metric (e.g. duration or input size) over all the tasks in this stage, so you can immediately visually notice if you suffer from skew. Check the animated GIF for a full tour.

What is NOT new about it?

The Spark UI also has many visuals that work really well. Our goal is not to throw it all away. On the contrary we would like Data Mechanics Delight to contain as much information as the current Spark UI. We plan to reuse many elements, like the tables for the list of jobs, stages, and tasks, the Gantt chart illustrating tasks scheduled across executors within a stage, the DAG views, and more.

How does debugging with Data Mechanics Delight work in practice?

We'll use two concrete scenarios we recently encountered with some of our customers.

A parallelism issue

One of our customer was running an app with 10 executors with 8 CPU cores each, such that the app could run 80 Spark tasks in parallel. But a developer had set the `spark.sql.shuffle.partitions` configuration to 8, such that during shuffles only 8 tasks were generated, meaning 90% of the app resources were unused. This configuration was a mistake (probably set during local development), but the fact is that this critical issue is completely silent in the current Spark UI, unless you know really well where to look for it. In Data Mechanics Delight, the issue would be obvious:

Data Mechanics Delight - Executor CPU Usage and Parallelism Issues - Better Spark UI
This graph of average executors CPU usage over time shows that there is a lot of unused capacity due to bad parallelism, particularly during the last few jobs and stages of the app.

This parallelism example might seem planted, but note that it's more common than you think — it also happens when the default value (200) is too small compared to the total app capacity, or when the input data is incorrectly partitioned (which takes more than a configuration change to fix).

A memory issue

Memory errors are the most common source of crashes in Spark, but they come in different sorts. The JVM can get an OutOfMemory error (meaning the heap got to its maximum size, needed to allocate more space, but even after GC couldn't find any), this can happen for many reasons like an imbalanced shuffle, a high concurrency, or improper use of caching. Another common memory issue is when a Spark executor is killed (by Kubernetes or by YARN) because it exceeded its memory limit. This happens a lot when using PySpark, because the Spark executor will spawn one python process per running task.

Data Mechanics Delight - Out Of Memory (OOM) - Better Spark UI
This graph of executors memory breakdown at peak usage shows that Python processes were using most of the allocated container memory.

Few monitoring tools let you see the breakdown between memory usage by the JVM (heap and non-heap) and Python, and yet this information is crucial for stability. In this screenshot, we can see that the executor with the highest memory usage got very close to the limit.

PySpark users are often in the dark when it comes to monitoring memory usage, and we hope this new interface will be useful to them and avoid the dreaded OOM-kills.

How does Data Mechanics Delight work? How can I use it?

For technical reasons, we will not implement Data Mechanics Delight directly in Spark open source. But we do plan to make it work on top of any Spark platform, entirely free of charge. The first version (MVP) will only work for terminated apps a few minutes after they have run. So the MVP would be more of a Spark History Server replacement than a Spark UI replacement. We hope it will be useful to you nonetheless!

To use it, you'll need to install an agent (a single jar) to Spark -- we'll provide init scripts to do this automatically. The code for the agent will be open-sourced. The agent will send the Spark event logs to our backend. Once this is set up, no more action required from you. In the MVP the agent will probably just print a unique URL at the top of the driver log, and this URL will give you access to Data Mechanics Delight for your app.

Note: Spark event logs make up the source of truth for the Spark UI — this is the information the Spark History Server reads to render the Spark UI. They're metadata logs about the tasks run by Spark (task-id #123 was run on executor #3 from timestamp t1 until timestamp t2 etc) in a structured format. They do not contain any of the actual data processed by Spark (in particular they do not contain Personally Identifiable Information). These logs will be automatically deleted by Data Mechanics after a retention period.

Conclusion: we need YOU to make this happen

Data Mechanics is a serverless Spark platform which tunes the infrastructure parameters and Spark configurations automatically for each pipeline running on it, to optimize performance and stability.

Data Mechanics Delight as a project would be a great addition to our platform, as it will give Apache Spark developers the high-level feedback they need to develop, productionize and maintain stable and performant apps at the application code level — e.g. understand when to use caching, understand when to repartition the input data because you suffer from skew, etc.

We think this project will greatly simplify Spark monitoring not just for our customers but for the greater Apache Spark community. Please use the form below to let us know of your interest and give us feedback about this project. The more people sign up, the harder we'll work to release it, and you'll be the first to know when it happens. Thanks!

powered by Typeform

Read more

Our Latest Blog Posts

Learn about company news, product updates, and technology best practices straight from the Data Mechanics engineering team.

Spark and Docker: Your Spark development cycle just got 10x faster !

Native support for Docker is in fact one of the main reasons companies choose to deploy Spark on top of Kubernetes instead of YARN. In this article, we will illustrate the benefits of Docker for Apache Spark by going through the end-to-end development cycle used by many of our users at Data Mechanics.

Tuesday, October 13, 2020

Setting up, Managing & Monitoring Spark on Kubernetes

Earlier this year at Spark + AI Summit, we went over the best practices and pitfalls of running Apache Spark on Kubernetes. We’d like to expand on that and give you a comprehensive overview of how you can get started with Spark on k8s, optimize performance & costs, monitor your Spark applications, and the future of Spark on k8s!

Monday, September 21, 2020

How We Built A Serverless Spark Platform On Kubernetes - Video Tour Of Data Mechanics

In this video, we give you a product tour of our serverless Spark platform and its core features: connecting a Jupyter notebook, submitting apps programmatically, monitoring their logs and metrics, tracking their costs and performance over time.

Tuesday, September 8, 2020

🍪 We use cookies to optimize your user experience. By browsing our website, you agree to the use of cookies.