Apache Spark - the leading analytics engine for big data processing

Over 1,000 contributors from 250 orgs

Spark is the most popular open-source distributed computing engine for big data analysis.

Used by data engineers and data scientists alike in thousands of organizations worldwide, Spark is the industry standard analytics engine for big data processing and machine learning. Spark enables you to process data at lightning speed for both batch and streaming workloads.

Spark can run on Kubernetes, YARN, or standalone - and it works with a wide range of data inputs and outputs.

Learn More

A vast range of support

The Spark Ecosystem

Spark makes it easy to start working with distributed computing systems.

Through its core API support for multiple languages, native libraries that enable easy streaming, machine learning, graph computing, and SQL - the Spark ecosystem offers some of the most extensive capabilities and features of any technology out there.

Other third-party contributions also make using Spark much easier and versatile.

Apache Spark Ecosystem: Streaming, MLlib, GraphX, SparkSQL & Dataframes, Spark Core API.

Why choose spark?

The advantages of working with Apache Spark

Whether you’re working on ETL & Data Engineering jobs, machine learning & AI applications, doing exploratory data analysis (EDA), or any combination of the three - Spark is right for you.

1. Lightning Fast

Spark uses in-memory processing and optimizes query executions for efficient parallelism, hence gaining an edge over other big data tools. Spark is up to 100x faster than Hadoop.
‍

2. Flexible

Spark developers enjoy the flexibility of a programming language (like Python or Scala), contrary to pure SQL frameworks. This lets them express their complex business logics and insert custom code, even as the code base grows.

3. Versatile

Spark is packaged with higher-level libraries which enables data engineering, machine learning, streaming, and graph processing use cases. Spark also comes with connectors to efficiently read from and write to most data storage systems.

4. Cost Effective

Spark is one of the most cost effective solution for big data processing. By separating the compute infrastructure from the cloud storage (according to the data lake architecture), Spark can scale its ressources automatically based on the load.

Person in front of world globe with multiple locations.

5. Multilingual

Spark has APIs in Python, Scala, R, SQL and Java. The open-source koalas library also makes it easy to convert pure python workloads (using the pandas library) into Spark. This makes it easy for developers from most backgrounds to easily adopt Apache Spark.

Person sitting on the floor with a laptop on their knees.

6. Easy to use

In a few lines of code, data scientists and engineers can build complex applications and let Spark handle their scale. Platforms like Data Mechanics automate the management and maintenance of the infrastructure so that developers can focus on their application code.

Take A Deeper Dive

Apache Spark 3.2 Release: Main Features and What's New for Spark-on-Kubernetes

Monday, October 25, 2021

Apache Spark 3.2 is now released and available on our platform. Spark 3.2 bundles Hadoop 3.3.1, Koalas (for Pandas users) and RocksDB (for Streaming users). For Spark-on-Kubernetes users, Persistent Volume Claims (k8s volumes) can now "survive the death" of their Spark executor and be recovered by Spark, preventing the loss of precious shuffle files!

Tutorial: Run your R (SparklyR) workloads at scale with Spark-on-Kubernetes

Tuesday, October 19, 2021

A step-by-step tutorial to help you run R applications with Spark on a Kubernetes cluster using the SparklyR library. We'll go through building a compatible Docker image, building the code of the SparlyR application itself, and deploying it on Data Mechanics.

Apache Spark 3.1 Release: Spark on Kubernetes is now Generally Available

Monday, March 8, 2021

With the Apache Spark 3.1 release in March 2021, the Spark on Kubernetes project is now officially declared as production-ready and Generally Available. This is the achievement of 3 years of booming community contribution and adoption of the project - since initial support for Spark-on-Kubernetes was added in Spark 2.3 (February 2018). In this article, we will go over the main features of Spark 3.1, with a special focus on the improvements to Kubernetes.

Over 1,000 contributors from 250 orgs

Spark is the most popular open-source distributed computing engine for big data analysis.

A vast range of support

The Spark Ecosystem

Why choose spark?

The advantages of working with Apache Spark

1. Lightning Fast

2. Flexible

3. Versatile

4. Cost Effective

5. Multilingual

6. Easy to use

Take A Deeper Dive

Apache Spark 3.2 Release: Main Features and What's New for Spark-on-Kubernetes

Tutorial: Run your R (SparklyR) workloads at scale with Spark-on-Kubernetes

Apache Spark 3.1 Release: Spark on Kubernetes is now Generally Available

About Us

NAvigation

Resources

Over 1,000 contributors from 250 orgs

Spark is the most popular open-source distributed computing engine for big data analysis.

A vast range of support

The Spark Ecosystem

Why choose spark?

The advantages of working with Apache Spark

1. Lightning Fast

2. Flexible

3. Versatile

4. Cost Effective

5. Multilingual

6. Easy to use

Take A Deeper Dive

Apache Spark 3.2 Release: Main Features and What's New for Spark-on-Kubernetes

Tutorial: Run your R (SparklyR) workloads at scale with Spark-on-Kubernetes

Apache Spark 3.1 Release: Spark on Kubernetes is now Generally Available

About Us

NAvigation

Resources

Legal

Stay connected