Tutorial: Run your R (SparklyR) workloads at scale with Spark-on-Kubernetes

November 24, 2021

R is a programming language for statistical computing. It is widely used among statisticians and data scientists. Running applications on a single machine has been sufficient for a long time, but it has become a limiting factor when more data and advanced analysis is required.

That’s why the R community has developed sparklyr to scale data engineering, data science and machine learning using Apache Spark. It supports the Apache Spark use cases: Batch, Streaming, ML and Graph, SQL, in addition to the well-known R packages: dplyr, DBI, broom. More information can be found on sparklyr.ai.


The problem is, the integration between Sparklyr and Apache Spark is brittle, it’s hard to get the right mix of libraries and environment setup. One of our customers tried to get this to work on EMR and described it as “a nightmare”. On the contrary, by building their own Docker images and running them on our Spark-on-Kubernetes platform, he was able to make his SparklyR setup reliably work. 

So let’s see how to get your SparklyR applications running at scale using Spark-on-Kubernetes! All of the code for this tutorial is available on this Github repository.

Requirements

You must configure a Docker image. This is the most difficult part but we did it for you!

The following Dockerfile uses one our published image as a base - see this blog post and our dockerhub repository for more details on these images.

Dockerfile


You can tune your packages in the RUN install2.r section. Tidyverse contains many well known packages like dplyr, ggplot2.

Once your image is built and available in your registry it contains all your dependencies and takes a few seconds to load when you run your applications.

Develop your SparklyR application

We will show you a few code samples to start with. You can find more examples in the sparklyr github repo.

There are two critical topics:

  • Creating the Spark Session
  • Understanding when your R object is an interface to a Spark Dataframe or to a R dataset.

Experienced Sparklyr developers can look at the Spark session creation and then switch directly to Submit your Applications.

Create the Spark Session 

See source file


Create a Spark Dataframe

The sparklyr copy_to function returns a reference to the generated Spark DataFrame as a tbl_spark. The returned object will act as a dplyr-compatible interface to the underlying Spark table.

See source file.


List available Spark tables


Use dplyr (see documentation)

See source file.


Apply an R function to a spark Dataframe 

See source file.


Cache

The Spark dataframes can be explicitly cached and uncached: 


Query Spark tables with SQL


Writing and Reading Parquet

Creating Plots

See source file.


Then various R packages can be used to copy the file to a cloud storage: cloudyr, AzureStor.

Don't forget to end  the Spark Session

Run your Spark application at scale

You must first define a Data Mechanics configuration through a template or a configOverride.

Example Data Mechanics configuration (JSON)


The code to be executed is in the file RExamples.R which was copied in the Docker Image. Other ways to package your applications are documented here

The Data Mechanics platform allows you to monitor and optimize your Spark applications with Delight, our new and improved Spark UI and History server.

Screenshot from the Delight UI (this project works for free on top of any Spark platform).


Conclusion

Special thanks go to our customer running SparklyR workloads on the Data Mechanics platform for sharing the tricks and setup. We hope this tutorial will help you be successful with Spark and R!

Jean-Yves Stephan

by

Jean-Yves Stephan

Ready to get started?

Read more

Our Latest Blog Posts

Learn about company news, product updates, and technology best practices straight from the Data Mechanics engineering team.

Apache Spark 3.2 Release: Main Features and What's New for Spark-on-Kubernetes

Apache Spark 3.2 is now released and available on our platform. Spark 3.2 bundles Hadoop 3.3.1, Koalas (for Pandas users) and RocksDB (for Streaming users). For Spark-on-Kubernetes users, Persistent Volume Claims (k8s volumes) can now "survive the death" of their Spark executor and be recovered by Spark, preventing the loss of precious shuffle files!

Monday, October 25, 2021

Tutorial: Run your R (SparklyR) workloads at scale with Spark-on-Kubernetes

A step-by-step tutorial to help you run R applications with Spark on a Kubernetes cluster using the SparklyR library. We'll go through building a compatible Docker image, building the code of the SparlyR application itself, and deploying it on Data Mechanics.

Tuesday, October 19, 2021

Tutorial: Running PySpark inside Docker containers

In this tutorial, we'll show you how to build your first PySpark applications from scratch and run it inside a Docker container. We'll also show you how to install libraries (like koalas) and write to a data sink (postgres database).

Thursday, September 23, 2021

🍪 We use cookies to optimize your user experience. By browsing our website, you agree to the use of cookies.
close
30