How the United Nations Modernized their Maritime Traffic Data Exploration while cutting costs by 70%

September 21, 2021

Industry: Data Collaboration Platform.

Customer: The United Nations Global Platform

Technical Use Case: Migration from EMR, JupyterHub-integration, Migration from HBase to S3

Summary: By migrating from HBase and EMR to the Data Mechanics platform, the united nations reduced their costs by 70% while improving their team productivity and development experience.

The Background: The mission of the United Nations Global Platform

Under the governance of the UN Committee of Experts on Big Data and Data Science for Official Statistics (UN-CEBD), the Global Platform has built a cloud-service ecosystem to support international collaboration in the development of Official Statistics using new data sources, including big data, and innovative methods and to help countries measure the Sustainable Development Goals (SDGs) to deliver the 2030 Sustainable Development Agenda. 

The Task Team on AIS is a group of participating organisations across the globe made of dozens of statisticians interested in using AIS data (global time-series datasets about vessels’ position and speed) for official statistical and experimental indicators purposes. The Task Team uses the UN Global Platform to store, manage and analyse the AIS data, growing by 300 billion records per year. See an example of their work: Faster indicators of UK economic activity project.

AIS data includes the vessel identifier, type of vessel, location, speed, type of cargo, etc.


The Challenge: HBase + EMR was hard to manage & expensive

The platform used to rely on an HBase instance for hosting AIS data, and Apache Spark running on the AWS EMR platform for the analysis of this data. The data team at the UN Global Plaform had several challenges with this setup: 

  • The EMR cluster was oversized (except during peak loads) and its autoscaling capabilities were not satisfactory, leading to high costs. 
  • The cluster would get unstable when tens of users ran competing queries concurrently. 
  • The python libraries available to end users were limited as the process to install additional ones was complex. 
  • Costly to keep years of historical data in an HBase instance
  • The complexity of HBase management

The high cost and lack of flexibility of this system prompted the search for a better solution. After a successful POC, Data Mechanics was selected as a partner in this journey. 

The Solution: 70% lower costs and a better user experience with Data Mechanics

The new platform architecture: Apache Spark running on EKS, S3 as the data source, Jupyter notebooks hosted on JupyterHub as the main interface.

Apache Spark now runs on a Kubernetes (EKS) cluster managed by Data Mechanics. End-users submit programmatic jobs through the API for batch processing, and connect Jupyter notebooks (hosted on Jupyter Hub) for interactive data exploration.

  • Each user gets its own set of ressources (Spark driver, Spark executor) which are isolated from others, and automatically scale up and down based on the load.
  • Users can install new libraries in a self-service way and without impacting others by adding them to their docker images  The set of libraries that can be supported in the Docker images is virtually infinite, resulting in better analysis and reporting for all needs.
  • The Data Mechanics platform enabled significant cost reductions through fast cluster autoscaling capabilities and additional code performance optimizations. 
  • Using S3 as the main data source reduced the costs further and provided additional flexibility in storage management, while providing a rich historical depth.

Jean-Yves Stephan

by

Jean-Yves Stephan

Ready to get started?

Read more

Our Latest Blog Posts

Learn about company news, product updates, and technology best practices straight from the Data Mechanics engineering team.

Tutorial: Running PySpark inside Docker containers

In this tutorial, we'll show you how to build your first PySpark applications from scratch and run it inside a Docker container. We'll also show you how to install libraries (like koalas) and write to a data sink (postgres database).

Thursday, September 23, 2021

How the United Nations Modernized their Maritime Traffic Data Exploration while cutting costs by 70%

By migrating from HBase and EMR to the Data Mechanics platform, the united nations reduced their costs by 70% while improving their team productivity and development experience.

Friday, September 17, 2021

Delight: The New & Improved Spark UI & Spark History Server is now Generally Available

We're releasing Delight: our free, hosted & cross-platform monitoring dashboard for Apache Spark. It's a great complement to the Spark UI and Spark History Server to help you understand and improve the performance of your Spark applications. It's easy to install Delight on top of any Spark platform - including Databricks, EMR, Dataproc, and many others.

Tuesday, April 20, 2021

🍪 We use cookies to optimize your user experience. By browsing our website, you agree to the use of cookies.
close
30