Tutorial: Running PySpark inside Docker containers

October 12, 2021

In this article, we’re going to show you how to start running PySpark applications inside of Docker containers, by going through a step-by-step tutorial with code examples (see github).

There are multiple motivations for running Spark application inside of Docker container, we covered them in our article “Spark & Docker - Your Dev Workflow Just Got 10x Faster”:

  • Docker containers simplify the packaging and management of dependencies like external java libraries (jars) or python libraries that can help with data processing or help connect to an external data storage. Adding or upgrading a library can break your pipeline (e.g. because of conflicts). Using Docker means that you can catch this failure locally at development time, fix it, and then publish your image with the confidence that the jars and the environment will be the same, wherever your code runs.
  • Docker containers are also a great way to develop and test Spark code locally, before running it at scale in production on your cluster (for example a Kubernetes cluster). 

At Data Mechanics we maintain a fleet of Docker images which come built-in with a series of useful libraries like the data connectors to data lakes, data warehouses, streaming data sources, and more. You can read more about these images here, and download them for free on Dockerhub.

In this tutorial, we’ll walk you through the process of building a new Docker image from one of our base images, adding new dependencies, and testing the functionality you have installed by using the Koalas library and writing some data to a postgres Database.


For the purpose of this demonstration, we will create a simple PySpark application that reads population density data per country  from the public dataset - https://registry.opendata.aws/dataforgood-fb-hrsl, applies transformations to find the median population, and writes the results to a postgres instance. To find the median, we will utilize koalas, a Spark implementation of the pandas API. 

From Spark 3.2+, the pandas library will be automatically bundled with open-source Spark. In this tutorial we use Spark 3.1, but in the future you won’t need to install Koalas, it will work out of the box. 

Finding the median values using Spark can be quite tedious, so we will utilize the koalas functionality for a concise solution (Note: To read from the public data source, you will need access to an AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. The bucket is public, but AWS requires user creation in order to read from public buckets).

If you do not have a postgres database handy, you can create one locally using the following steps:

docker pull postgres
docker run  -e POSTGRES_PASSWORD=postgres -e POSTGRES_USER=postgres -d -p 5432:5432 postgres

Building the Docker Image


To start, we will build our Docker image using the latest Data Mechanics base image - gcr.io/datamechanics/spark:platform-3.1-latest. There are a few methods to include external libraries in your Spark application. For language specific libraries, you can use a package manager like pip for python or sbt for scala to directly install the library. Alternatively, you can download the library jar into your Docker image and move the jar to /opt/spark/jars. Both methods achieve the same result, but certain libraries or packages may only be available to install with one method.

To install koalas, we will use pip to add the package directly to our python environment. The standard convention is to create a requirements file that lists all of your python dependencies and use pip to install each library into your environment. In your application repo, create a file called requirements.txt, and add the following line - koalas==1.8.1. Copy this file into your Docker image and add the following command RUN pip3 install -r requirements.txt. Now you should be able to import koalas directly into your python code.

Next, we will use the jar method to install the necessary sql driver so that Spark may write directly to postgres. First, let’s pull the postgres driver jar into your image. Add the following line to your Dockerfile - RUN wget https://jdbc.postgresql.org/download/postgresql-42.2.5.jar and move the returning jar into /opt/spark/jars. When you start your Spark application, the postgres driver will be in the classpath, and you will be able to successfully write to the database.

Developing Your Application Code

Example PySpark Workflow

Finally, we’ll add our application code. In the code snippet above, we have a simple Spark application that reads a DataFrame from the public bucket source. Next, we transform the Spark DataFrame by grouping the country column, casting the population column to a string, and aggregating. We then convert the Spark DataFrame to a Koalas DataFrame and apply the median function (an operation not available in Spark). Lastly, we transform the Koalas DataFrame back to a Spark DataFrame and add a date column so that we may utilize the write.jdbc function and output our data into a SQL table.

Monitoring and Debugging

Script to build and run the image locally (note: this is a justfile)

To help with running the application locally, we’ve included a justfile with a few helpful commands to get started. 

  • To build your Docker image locally, run just build
  • To run the PySpark application, run just run
  • To access a PySpark shell in the Docker image, run just shell

You can also execute into the Docker container directly by running docker run -it <image name> /bin/bash. This will create an interactive shell that can be used to explore the Docker/Spark environment, as well as monitor performance and resource utilization. 


We hope you found this tutorial useful! If you would like to dive into this code further, or explore other application demos, head to our Github examples repo.

Jean-Yves Stephan


Jean-Yves Stephan

Ready to get started?

Read more

Our Latest Blog Posts

Learn about company news, product updates, and technology best practices straight from the Data Mechanics engineering team.

Apache Spark 3.2 Release: Main Features and What's New for Spark-on-Kubernetes

Apache Spark 3.2 is now released and available on our platform. Spark 3.2 bundles Hadoop 3.3.1, Koalas (for Pandas users) and RocksDB (for Streaming users). For Spark-on-Kubernetes users, Persistent Volume Claims (k8s volumes) can now "survive the death" of their Spark executor and be recovered by Spark, preventing the loss of precious shuffle files!

Monday, October 25, 2021

Tutorial: Run your R (SparklyR) workloads at scale with Spark-on-Kubernetes

A step-by-step tutorial to help you run R applications with Spark on a Kubernetes cluster using the SparklyR library. We'll go through building a compatible Docker image, building the code of the SparlyR application itself, and deploying it on Data Mechanics.

Tuesday, October 19, 2021

Tutorial: Running PySpark inside Docker containers

In this tutorial, we'll show you how to build your first PySpark applications from scratch and run it inside a Docker container. We'll also show you how to install libraries (like koalas) and write to a data sink (postgres database).

Thursday, September 23, 2021

🍪 We use cookies to optimize your user experience. By browsing our website, you agree to the use of cookies.