Migrating from EMR to Spark on Kubernetes with Data Mechanics

April 6, 2021

Industry: Data Integration Platform

Customer: Lingk.io is a data loading, data pipelines, and integration platform built on top of Apache Spark, serving commercial customers, with expertise in the education sector. Their visual interface makes it easy to load, deduplicate and enrich data from dozens of sources, and promote projects from development to production in a few clicks.

Goals:

  • Reduce AWS costs
  • Improve Customer Experience
  • Streamline data team operational work.

Technical Use Case: Migration from EMR to Data Mechanics, Run Low-Latency Spark apps submitted through a REST API, Autoscaling & Performance optimizations

Summary: By migrating from EMR to Data Mechanics, Lingk’s customers now enjoy ~2x faster Spark applications, Lingk’s AWS bill has decreased by 65%, and Lingk’s team can spend less time managing their infrastructure to focus on expanding their Spark-based data integration platform!

The Challenge: EMR is hard to manage & expensive

As a data integration platform, Lingk makes it easy for its customers to run Spark jobs, whether it’s for ad-hoc projects or for automatically scheduled production data pipelines, making Apache Spark core to Lingk’s business.

The data engineering team at Lingk had several challenges working with EMR:

  • Spark apps would take 40 seconds to start on average, and cause a poor user experience when building data pipelines and integrations. 
  • EMR required too much infrastructure management for their devops team with limited Spark experience. 
  • The core Spark application was stuck at an earlier version because upgrading Spark to 3.0+ caused unexplained performance regressions.
  • High AWS costs for running customers’ pipelines required a new look at autoscaling.

The Solution: Data Mechanics’ serverless Spark platform

The Data Mechanics platform is deployed on a managed Kubernetes cluster (EKS) inside Lingk’s AWS account. Data Mechanics automatically scales the cluster up-and-down based on load and tunes the Spark configurations based on historical data. Instead of HDFS, S3 is used for intermediate storage, with fast access guaranteed using optimized S3 committers.

Lingk + Data Mechanics Integration Architecture


Lingk’s team does not have to manage clusters anymore, they just submit dockerized Spark apps through the Data Mechanics REST API and enjoy a serverless experience. The team has control over the docker images used by Spark, which brings 3 additional benefits:

  1. Applications start more quickly - as all dependencies are baked in the Docker image.
  2. The CI/CD flow is simpler - A Docker image is built automatically when a PR is merged.
  3. The Docker image includes the Spark distribution itself (there is no global Spark version), which means all applications can efficiently run on the same cluster, and it was easy to gradually upgrade to Spark 3.0.

The automated configuration tuning enabled several performance optimizations:

  1. Container sizes optimization to maximize bin-packing on (newer generation) instances
  2. Tuning the number of partitions for optimal parallelism (many jobs suffered from too many small files)
  3. Dynamic allocation to give additional Spark executors and speed up long-running pipelines significantly (5x speedup for their 99th-percentile longest apps!).

The Results: A big win for end-users, for the team, and for the wallet!

The migration from EMR to Data Mechanics was a big win:

  • In terms of end-user experience, the Spark application startup time was halved, and the average app duration decreased by 40%.
  • In terms of costs, the AWS costs were reduced by over 65%. The total cost of ownership for Lingk (including Data Mechanics management fee) was reduced by 33%.

Lingk was also able to gradually upgrade Spark to 3.0, which was made easy by the new Spark-on-Kubernetes architecture which enabled native dockerization. The team at Lingk can now confidently expand their data integration platform toward new ambitious use cases.

“Leveraging Data Mechanics Spark expertise and platform decreases cost while letting us sleep well at night and achieve the plans we dream about."
Dale McCrory, Co-Founder & Chief Product Officer at Lingk.
Thansk to the migration, Lingk's AWS costs decreased by 65%, the application startup time was halved, and the average app duration decreased by 40%.


Jean-Yves Stephan

by

Jean-Yves Stephan

Ready to get started?

Read more

Our Latest Blog Posts

Learn about company news, product updates, and technology best practices straight from the Data Mechanics engineering team.

Migrating from EMR to Spark on Kubernetes with Data Mechanics

Customer Story: Lingk is a data integration platform powered by Apache Spark. AWS EMR was getting hard to manage and expensive. By migrating to Spark on Kubernetes with Data Mechanics, Lingk now enjoys ~2x faster Spark applications, their AWS bill has decreased by 65%, and their developer can now "achieve the plans they dream about".

Tuesday, April 6, 2021

Apache Spark 3.1 Release: Spark on Kubernetes is now Generally Available

With the Apache Spark 3.1 release in March 2021, the Spark on Kubernetes project is now officially declared as production-ready and Generally Available. This is the achievement of 3 years of booming community contribution and adoption of the project - since initial support for Spark-on-Kubernetes was added in Spark 2.3 (February 2018). In this article, we will go over the main features of Spark 3.1, with a special focus on the improvements to Kubernetes.

Monday, March 8, 2021

Cost-Effective Weather Analytics At Scale with Cloud-Native Apache Spark

Customer Story: Weather2020 is a predictive weather analytics company. In 3 weeks, their data engineering team built Apache Spark pipelines ingesting terabytes of weather data to power their core product. Data Mechanics performance optimizations and pricing model lowered their costs by 60% compared to Databricks, the main alternative they considered.

Wednesday, January 13, 2021

🍪 We use cookies to optimize your user experience. By browsing our website, you agree to the use of cookies.
close
30