Industry: Data Integration Platform
Customer: Lingk.io is a data loading, data pipelines, and integration platform built on top of Apache Spark, serving commercial customers, with expertise in the education sector. Their visual interface makes it easy to load, deduplicate and enrich data from dozens of sources, and promote projects from development to production in a few clicks.
- Reduce AWS costs
- Improve Customer Experience
- Streamline data team operational work.
Technical Use Case: Migration from EMR to Data Mechanics, Run Low-Latency Spark apps submitted through a REST API, Autoscaling & Performance optimizations
Summary: By migrating from EMR to Data Mechanics, Lingk’s customers now enjoy ~2x faster Spark applications, Lingk’s AWS bill has decreased by 65%, and Lingk’s team can spend less time managing their infrastructure to focus on expanding their Spark-based data integration platform!
The Challenge: EMR is hard to manage & expensive
As a data integration platform, Lingk makes it easy for its customers to run Spark jobs, whether it’s for ad-hoc projects or for automatically scheduled production data pipelines, making Apache Spark core to Lingk’s business.
The data engineering team at Lingk had several challenges working with EMR:
- Spark apps would take 40 seconds to start on average, and cause a poor user experience when building data pipelines and integrations.
- EMR required too much infrastructure management for their devops team with limited Spark experience.
- The core Spark application was stuck at an earlier version because upgrading Spark to 3.0+ caused unexplained performance regressions.
- High AWS costs for running customers’ pipelines required a new look at autoscaling.
The Solution: Data Mechanics’ serverless Spark platform
The Data Mechanics platform is deployed on a managed Kubernetes cluster (EKS) inside Lingk’s AWS account. Data Mechanics automatically scales the cluster up-and-down based on load and tunes the Spark configurations based on historical data. Instead of HDFS, S3 is used for intermediate storage, with fast access guaranteed using optimized S3 committers.
Lingk’s team does not have to manage clusters anymore, they just submit dockerized Spark apps through the Data Mechanics REST API and enjoy a serverless experience. The team has control over the docker images used by Spark, which brings 3 additional benefits:
- Applications start more quickly - as all dependencies are baked in the Docker image.
- The CI/CD flow is simpler - A Docker image is built automatically when a PR is merged.
- The Docker image includes the Spark distribution itself (there is no global Spark version), which means all applications can efficiently run on the same cluster, and it was easy to gradually upgrade to Spark 3.0.
The automated configuration tuning enabled several performance optimizations:
- Container sizes optimization to maximize bin-packing on (newer generation) instances
- Tuning the number of partitions for optimal parallelism (many jobs suffered from too many small files)
- Dynamic allocation to give additional Spark executors and speed up long-running pipelines significantly (5x speedup for their 99th-percentile longest apps!).
The Results: A big win for end-users, for the team, and for the wallet!
The migration from EMR to Data Mechanics was a big win:
- In terms of end-user experience, the Spark application startup time was halved, and the average app duration decreased by 40%.
- In terms of costs, the AWS costs were reduced by over 65%. The total cost of ownership for Lingk (including Data Mechanics management fee) was reduced by 33%.
Lingk was also able to gradually upgrade Spark to 3.0, which was made easy by the new Spark-on-Kubernetes architecture which enabled native dockerization. The team at Lingk can now confidently expand their data integration platform toward new ambitious use cases.
“Leveraging Data Mechanics Spark expertise and platform decreases cost while letting us sleep well at night and achieve the plans we dream about."
Dale McCrory, Co-Founder & Chief Product Officer at Lingk.