Migrating from EMR to Spark on Kubernetes with Data Mechanics

April 6, 2021

Industry: Data Integration Platform

Customer: Lingk.io is a data loading, data pipelines, and integration platform built on top of Apache Spark, serving commercial customers, with expertise in the education sector. Their visual interface makes it easy to load, deduplicate and enrich data from dozens of sources, and promote projects from development to production in a few clicks.


  • Reduce AWS costs
  • Improve Customer Experience
  • Streamline data team operational work.

Technical Use Case: Migration from EMR to Data Mechanics, Run Low-Latency Spark apps submitted through a REST API, Autoscaling & Performance optimizations

Summary: By migrating from EMR to Data Mechanics, Lingk’s customers now enjoy ~2x faster Spark applications, Lingk’s AWS bill has decreased by 65%, and Lingk’s team can spend less time managing their infrastructure to focus on expanding their Spark-based data integration platform!

The Challenge: EMR is hard to manage & expensive

As a data integration platform, Lingk makes it easy for its customers to run Spark jobs, whether it’s for ad-hoc projects or for automatically scheduled production data pipelines, making Apache Spark core to Lingk’s business.

The data engineering team at Lingk had several challenges working with EMR:

  • Spark apps would take 40 seconds to start on average, and cause a poor user experience when building data pipelines and integrations. 
  • EMR required too much infrastructure management for their devops team with limited Spark experience. 
  • The core Spark application was stuck at an earlier version because upgrading Spark to 3.0+ caused unexplained performance regressions.
  • High AWS costs for running customers’ pipelines required a new look at autoscaling.

The Solution: Data Mechanics’ serverless Spark platform

The Data Mechanics platform is deployed on a managed Kubernetes cluster (EKS) inside Lingk’s AWS account. Data Mechanics automatically scales the cluster up-and-down based on load and tunes the Spark configurations based on historical data. Instead of HDFS, S3 is used for intermediate storage, with fast access guaranteed using optimized S3 committers.

Lingk + Data Mechanics Integration Architecture

Lingk’s team does not have to manage clusters anymore, they just submit dockerized Spark apps through the Data Mechanics REST API and enjoy a serverless experience. The team has control over the docker images used by Spark, which brings 3 additional benefits:

  1. Applications start more quickly - as all dependencies are baked in the Docker image.
  2. The CI/CD flow is simpler - A Docker image is built automatically when a PR is merged.
  3. The Docker image includes the Spark distribution itself (there is no global Spark version), which means all applications can efficiently run on the same cluster, and it was easy to gradually upgrade to Spark 3.0.

The automated configuration tuning enabled several performance optimizations:

  1. Container sizes optimization to maximize bin-packing on (newer generation) instances
  2. Tuning the number of partitions for optimal parallelism (many jobs suffered from too many small files)
  3. Dynamic allocation to give additional Spark executors and speed up long-running pipelines significantly (5x speedup for their 99th-percentile longest apps!).

The Results: A big win for end-users, for the team, and for the wallet!

The migration from EMR to Data Mechanics was a big win:

  • In terms of end-user experience, the Spark application startup time was halved, and the average app duration decreased by 40%.
  • In terms of costs, the AWS costs were reduced by over 65%. The total cost of ownership for Lingk (including Data Mechanics management fee) was reduced by 33%.

Lingk was also able to gradually upgrade Spark to 3.0, which was made easy by the new Spark-on-Kubernetes architecture which enabled native dockerization. The team at Lingk can now confidently expand their data integration platform toward new ambitious use cases.

“Leveraging Data Mechanics Spark expertise and platform decreases cost while letting us sleep well at night and achieve the plans we dream about."
Dale McCrory, Co-Founder & Chief Product Officer at Lingk.
Thansk to the migration, Lingk's AWS costs decreased by 65%, the application startup time was halved, and the average app duration decreased by 40%.

Jean-Yves Stephan


Jean-Yves Stephan

Ready to get started?

Read more

Our Latest Blog Posts

Learn about company news, product updates, and technology best practices straight from the Data Mechanics engineering team.

Tutorial: Running PySpark inside Docker containers

In this tutorial, we'll show you how to build your first PySpark applications from scratch and run it inside a Docker container. We'll also show you how to install libraries (like koalas) and write to a data sink (postgres database).

Thursday, September 23, 2021

How the United Nations Modernized their Maritime Traffic Data Exploration while cutting costs by 70%

By migrating from HBase and EMR to the Data Mechanics platform, the united nations reduced their costs by 70% while improving their team productivity and development experience.

Friday, September 17, 2021

Delight: The New & Improved Spark UI & Spark History Server is now Generally Available

We're releasing Delight: our free, hosted & cross-platform monitoring dashboard for Apache Spark. It's a great complement to the Spark UI and Spark History Server to help you understand and improve the performance of your Spark applications. It's easy to install Delight on top of any Spark platform - including Databricks, EMR, Dataproc, and many others.

Tuesday, April 20, 2021

🍪 We use cookies to optimize your user experience. By browsing our website, you agree to the use of cookies.