Our Optimized Spark Docker Images Are Now Available

April 20, 2021


Today we’re excited to publicly release our optimized Docker images for Apache Spark. They can be freely downloaded from our DockerHub repository, whether you’re a Data Mechanics customer or not.

This is the result of a lot of work from the Data Mechanics team to ensure that we can: 

  • Build a combinations of Docker Images to serve our customers needs - with various versions of Spark, Python, Scala, Java, Hadoop, and all the popular data connectors
  • Automatically test them across various workloads, to ensure the included dependencies are working together (in other words, save you from “dependency hell”).

Our philosophy is to provide high quality Docker images that come “with batteries included”, meaning you will be able to get started and do your work with all the common data sources supported by Spark. We hope these images will just work for you, out of the box. 

We will maintain this fleet of images over time, up to date with latest versions and bug fixes of Spark and the various built-in dependencies.

Have you ever blocked all the containers in production due to dependency issues? We hope to save you from this.

What’s a Docker Image for Spark?

When you run Spark on Kubernetes, the Spark driver and executors are Docker containers. These containers use an image specifically built for Spark, which contains the Spark distribution itself (Spark 2.4, 3.0, 3.1). This means that the Spark version is not a global cluster property, as it is for YARN clusters.

You can also use Docker images to run Spark locally. For example you can run Spark in a driver-only mode (in a single container), or run Spark on Kubernetes on a local minikube cluster. Many of our users choose to do this during their development and their testing. 


Docker Development Workflow Diagram
Using Docker will speed up your development workflow and give you fast, reliable, and reproducible production deployments.

To learn more about the benefits of using Docker for Spark, and see the concrete steps to use Docker in your development workflow, check out our article: “Spark and Docker: Your development cycle jut got 10x faster!”.

What’s in these optimized Docker Images?

They contain the Spark distribution itself - from open-source code, without any proprietary modifications.

They come built-in with connectors to common data sources:

  • AWS S3 (s3a:// scheme)
  • Google Cloud Storage (gs:// scheme)
  • Azure Blob Storage (wasbs:// scheme)
  • Azure Datalake generation 1 (adls:// scheme)
  • Azure Datalake generation 2 (abfss:// scheme)
  • Snowflake
  • Delta Lake

They also come built-in with Python & PySpark support, as well as pip and conda so that it's easy to to install additional Python packages. (If you don't need PySpark, you can use the lighter image with the tag prefix ‘jvm-only’)

Finally, each image uses a combination of the versions from the following components:

  • Apache Spark: 2.4.5 to 3.1.1
  • Apache Hadoop: 3.1 or 3.2
  • Java: 8 or 11
  • Scala: 2.11 or 2.12
  • Python: 3.7 or 3.8

Note that not all the possible combinations exist, check out our DockerHub page to find them.

Optimized Docker Images for Spark
Our images includes connectors to GCS, S3, Azure Data Lake, Delta, and Snowflake, as well as support for Python, Java, Scala, Hadoop and Spark!

How To Use Our Spark Docker Images

Update (October 2021): See our step-by-step tutorial on how to build an image and get started with it with our boilercode template!

You should use our Spark Docker images as a base, and then build your own images by adding your code dependencies on top. Here’s a Dockerfile example to help get you started:


Dockerfile to build a custom Spark image


Once you’ve built your Docker image, you can run it locally by running: docker run {{image_name}} driver local:///opt/application/main.py {args} 

Or you can push your newly built image to a Docker registry that you own, then use it on your production k8s cluster!

Do not directly pull our DockerHub images from your production cluster in an unauthenticated way, as you risk hitting rate limits. It's best to push your image to your own registry, or purchase a paid plan from Dockerhub. 

Data Mechanics users can directly use the images from our documentation. They have a higher availability and a few additional capabilities exlusive to Data Mechanics, like Jupyter support.

Conclusion - We hope these images will be useful to you

Are these images working well for you? Do you need new connectors or versions to be added? Let us know, we'd love your feedback. 

Are you interested in getting a trial of the Data Mechanics platform to test the benefits of a containerized Spark platform powered by Kubernetes, deployed in your cloud account? Schedule a demo with us and we'll show you how to get started.

Jean-Yves Stephan

by

Jean-Yves Stephan

Ready to get started?

Read more

Our Latest Blog Posts

Learn about company news, product updates, and technology best practices straight from the Data Mechanics engineering team.

Tutorial: Running PySpark inside Docker containers

In this tutorial, we'll show you how to build your first PySpark applications from scratch and run it inside a Docker container. We'll also show you how to install libraries (like koalas) and write to a data sink (postgres database).

Thursday, September 23, 2021

How the United Nations Modernized their Maritime Traffic Data Exploration while cutting costs by 70%

By migrating from HBase and EMR to the Data Mechanics platform, the united nations reduced their costs by 70% while improving their team productivity and development experience.

Friday, September 17, 2021

Delight: The New & Improved Spark UI & Spark History Server is now Generally Available

We're releasing Delight: our free, hosted & cross-platform monitoring dashboard for Apache Spark. It's a great complement to the Spark UI and Spark History Server to help you understand and improve the performance of your Spark applications. It's easy to install Delight on top of any Spark platform - including Databricks, EMR, Dataproc, and many others.

Tuesday, April 20, 2021

🍪 We use cookies to optimize your user experience. By browsing our website, you agree to the use of cookies.
close
30