Cost-Effective Weather Analytics At Scale with Cloud-Native Apache Spark

January 13, 2021

Industry: Weather Analytics

Customer: Weather2020 is a predictive weather analytics company providing long-range forecasts and weather analytics to decision makers across industries like agriculture, retail, energy, insurance, and pyrotechnics. 

Business Use Case: Building a new data product.

Technical Use Case: PySpark ETL, Modeling and Predictions on Time-Series and Geospatial data, Large backfills and automated data ingestion with Airflow

Summary: In 3 weeks, their data engineering team built Apache Spark pipelines ingesting terabytes of weather data to power their core product. Data Mechanics performance optimizations and pricing model lowered their costs by 60% compared to Databricks, the main alternative they considered.

The Challenge

Weather2020 extracts weather data from public agencies around the globe in a variety of formats - including industry-specific formats which are not suited for big data. They work with time series data spanning over 40 years of meteorological and geospatial data.

They needed pipelines to pull this data, clean it, enrich it, aggregate it, and store it in a cloud-based data lake in a cost-effective way. Their data is then ready to be consumed by multiple downstream data products: 

  1. Long-range weather forecasts based on Spark-based predictive analytics
  2. Real-time dashboard (powered by SparkSQL)
  3. Data delivery pipelines in a custom format required by their customers

Weather2020’s team had solid data engineering skills and custom knowledge around extracting and modeling weather data, but they did not have any prior experience with Apache Spark.

The Solution

EMR required too much setup and maintenance work. We didn’t want to spend our time writing bash scripts to manage and configure it. Databricks felt like a casino. It didn’t seem like the right product for our technical team, and their steep pricing ruled them out for us.
Max, Lead Data Engineer @ Weather2020.

Data Mechanics's Cloud-Native Spark Platform lowered the barrier to entry for Apache Spark by making it more developer friendly, while minimizing infrastructure costs thanks to the autopilot features.

  • Automated infrastructure management: The cluster dynamically scales based on load and adjusts its infrastructure parameters and Spark configs to optimize performance*.
  • Native Containerization: Weather2020 built their own Docker images to simplify the packaging of PySpark code and its complex libraries (with Cython and C dependencies).
  • Airflow Integration: Airflow is deployed on the same Kubernetes cluster (EKS) as Data Mechanics. We use it to schedule our daily pipelines

*Enabling dynamic allocation and using i3 instances with large SSDs brought the most significant performance improvements given the scale of the pipelines (shuffle-heavy jobs processing TBs of data). 

The Results

Deliver Projects Faster: It took only 3 weeks to build and productionize Terabytes-scale data ingestion pipelines, without prior experience with Apache Spark.

Keep Costs In Check: Performance optimizations incentivized by a fair pricing structure achieved a 60% reduction in total cost of ownership compared to Databricks.

A Flexible and Scalable Architecture: The Data Mechanics Spark Platform is deployed on a managed, autoscaled Kubernetes (EKS) cluster inside Weather2020's AWS account.

My expectations were high but Data Mechanics exceeded them. The developer experience when working with their platform has been really good. Their world class team and the level of support we get from them is impressive.
Max, Lead Data Engineer @ Weather2020.
Jean-Yves Stephan


Jean-Yves Stephan

Ready to get started?

Read more

Our Latest Blog Posts

Learn about company news, product updates, and technology best practices straight from the Data Mechanics engineering team.

Delight: The New & Improved Spark UI & Spark History Server is now Generally Available

We're releasing Delight: our free, hosted & cross-platform monitoring dashboard for Apache Spark. It's a great complement to the Spark UI and Spark History Server to help you understand and improve the performance of your Spark applications. It's easy to install Delight on top of any Spark platform - including Databricks, EMR, Dataproc, and many others.

Tuesday, April 20, 2021

Our Optimized Spark Docker Images Are Now Available

Today we’re excited to publicly release our optimized Docker images for Apache Spark. They can be freely downloaded from our DockerHub repository, whether you’re a Data Mechanics customer or not. These images come with many connectors to common data sources built-in: S3, GCS, Azure Data Lake, Snowflake, Delta Lake ; as well as support for Python, Scala, Java.. and Spark, of course!

Tuesday, April 13, 2021

Migrating from EMR to Spark on Kubernetes with Data Mechanics

Customer Story: Lingk is a data integration platform powered by Apache Spark. AWS EMR was getting hard to manage and expensive. By migrating to Spark on Kubernetes with Data Mechanics, Lingk now enjoys ~2x faster Spark applications, their AWS bill has decreased by 65%, and their developer can now "achieve the plans they dream about".

Tuesday, April 6, 2021

🍪 We use cookies to optimize your user experience. By browsing our website, you agree to the use of cookies.