Our Optimized Spark Docker Images Are Now Available

April 13, 2021


Today we’re excited to publicly release our optimized Docker images for Apache Spark. They can be freely downloaded from our DockerHub repository, whether you’re a Data Mechanics customer or not.

This is the result of a lot of work from the Data Mechanics team to ensure that we can: 

  • Build a combinations of Docker Images to serve our customers needs - with various versions of Spark, Python, Scala, Java, Hadoop, and all the popular data connectors
  • Automatically test them across various workloads, to ensure the included dependencies are working together (in other words, save you from “dependency hell”).

Our philosophy is to provide high quality Docker images that come “with batteries included”, meaning you will be able to get started and do your work with all the common data sources supported by Spark. We hope these images will just work for you, out of the box. 

We will maintain this fleet of images over time, up to date with latest versions and bug fixes of Spark and the various built-in dependencies.

Have you ever blocked all the containers in production due to dependency issues? We hope to save you from this.

What’s a Docker Image for Spark?

When you run Spark on Kubernetes, the Spark driver and executors are Docker containers. These containers use an image specifically built for Spark, which contains the Spark distribution itself (Spark 2.4, 3.0, 3.1). This means that the Spark version is not a global cluster property, as it is for YARN clusters.

You can also use Docker images to run Spark locally. For example you can run Spark in a driver-only mode (in a single container), or run Spark on Kubernetes on a local minikube cluster. Many of our users choose to do this during their development and their testing. 


Docker Development Workflow Diagram
Using Docker will speed up your development workflow and give you fast, reliable, and reproducible production deployments.

To learn more about the benefits of using Docker for Spark, and see the concrete steps to use Docker in your development workflow, check out our article: “Spark and Docker: Your development cycle jut got 10x faster!”.

What’s in these optimized Docker Images?

They contain the Spark distribution itself - from open-source code, without any proprietary modifications.

They come built-in with connectors to common data sources:

  • AWS S3 (s3a:// scheme)
  • Google Cloud Storage (gs:// scheme)
  • Azure Blob Storage (wasbs:// scheme)
  • Azure Datalake generation 1 (adls:// scheme)
  • Azure Datalake generation 2 (abfss:// scheme)
  • Snowflake
  • Delta Lake

They also come built-in with Python & PySpark support, as well as pip and conda so that it's easy to to install additional Python packages. (If you don't need PySpark, you can use the lighter image with the tag prefix ‘jvm-only’)

Finally, each image uses a combination of the versions from the following components:

  • Apache Spark: 2.4.5 to 3.1.1
  • Apache Hadoop: 3.1 or 3.2
  • Java: 8 or 11
  • Scala: 2.11 or 2.12
  • Python: 3.7 or 3.8

Note that not all the possible combinations exist, check out our DockerHub page to find them.

Optimized Docker Images for Spark
Our images includes connectors to GCS, S3, Azure Data Lake, Delta, and Snowflake, as well as support for Python, Java, Scala, Hadoop and Spark!

How To Use Our Spark Docker Images

You should use our Spark Docker images as a base, and then build your own images by adding your code dependencies on top. Here’s a Dockerfile example to help get you started:


Dockerfile to build a custom Spark image


Once you’ve built your Docker image, you can run it locally by running: docker run {{image_name}} driver local:///opt/application/main.py {args} 

Or you can push your newly built image to a Docker registry that you own, then use it on your production k8s cluster!

Do not directly pull our DockerHub images from your production cluster in an unauthenticated way, as you risk hitting rate limits. It's best to push your image to your own registry, or purchase a paid plan from Dockerhub. 

Data Mechanics users can directly use the images from our documentation. They have a higher availability and a few additional capabilities exlusive to Data Mechanics, like Jupyter support.

Conclusion - We hope these images will be useful to you

Are these images working well for you? Do you need new connectors or versions to be added? Let us know, we'd love your feedback. 

Are you interested in getting a trial of the Data Mechanics platform to test the benefits of a containerized Spark platform powered by Kubernetes, deployed in your cloud account? Schedule a demo with us and we'll show you how to get started.

Jean-Yves Stephan

by

Jean-Yves Stephan

Ready to get started?

Read more

Our Latest Blog Posts

Learn about company news, product updates, and technology best practices straight from the Data Mechanics engineering team.

Delight: The New & Improved Spark UI & Spark History Server is now Generally Available

We're releasing Delight: our free, hosted & cross-platform monitoring dashboard for Apache Spark. It's a great complement to the Spark UI and Spark History Server to help you understand and improve the performance of your Spark applications. It's easy to install Delight on top of any Spark platform - including Databricks, EMR, Dataproc, and many others.

Tuesday, April 20, 2021

Our Optimized Spark Docker Images Are Now Available

Today we’re excited to publicly release our optimized Docker images for Apache Spark. They can be freely downloaded from our DockerHub repository, whether you’re a Data Mechanics customer or not. These images come with many connectors to common data sources built-in: S3, GCS, Azure Data Lake, Snowflake, Delta Lake ; as well as support for Python, Scala, Java.. and Spark, of course!

Tuesday, April 13, 2021

Migrating from EMR to Spark on Kubernetes with Data Mechanics

Customer Story: Lingk is a data integration platform powered by Apache Spark. AWS EMR was getting hard to manage and expensive. By migrating to Spark on Kubernetes with Data Mechanics, Lingk now enjoys ~2x faster Spark applications, their AWS bill has decreased by 65%, and their developer can now "achieve the plans they dream about".

Tuesday, April 6, 2021

🍪 We use cookies to optimize your user experience. By browsing our website, you agree to the use of cookies.
close
30