Third-Party Projects | Apache Spark

This page tracks external software projects that supplement Apache Spark and add to its ecosystem.

Popular libraries with PySpark integrations

great-expectations - Always know what to expect from your data
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
xgboost - Scalable, portable and distributed gradient boosting
shap - A game theoretic approach to explain the output of any machine learning model
python-deequ - Measures data quality in large datasets
datahub - Metadata platform for the modern data stack
dbt-spark - Enables dbt to work with Apache Spark
Hamilton - Enables one to declaratively describe PySpark transformations that helps keep code testable, modular, and logically visualizable.
ScaleDP - An Open-Source Library for Processing Documents using AI/ML in Apache Spark.

Connectors

spark-redshift - Performant Redshift data source for Apache Spark
spark-sql-connector - Apache Spark Connector for SQL Server and Azure SQL
azure-cosmos-spark - Apache Spark Connector for Azure Cosmos DB
azure-event-hubs-spark - Enables continuous data processing with Apache Spark and Azure Event Hubs
azure-kusto-spark - Apache Spark connector for Azure Kusto
mongo-spark - The MongoDB Spark connector
couchbase-spark-connector - The Official Couchbase Spark connector
spark-cassandra-connector - DataStax connector for Apache Spark to Apache Cassandra
elasticsearch-hadoop - Elasticsearch real-time search and analytics natively integrated with Spark
neo4j-spark-connector - Neo4j Connector for Apache Spark
starrocks-connector-for-apache-spark - StarRocks Apache Spark connector
tispark - TiSpark is built for running Apache Spark on top of TiDB/TiKV
spark-pdf - PDF Datasource for Apache Spark
spark-connector-oceanbase - Apache Spark Connectors for OceanBase
lance-spark - Apache Spark connector for Lance datasets
spark-clickhouse-connector - Apache Spark connector for ClickHouse

Open table formats

Delta Lake - Storage layer that provides ACID transactions and scalable metadata handling for Apache Spark workloads
Hudi: Upserts, Deletes And Incremental Processing on Big Data
Iceberg - Open table format for analytic datasets
Lance - Modern columnar data format for ML and LLMs

Infrastructure projects

Kyuubi - Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses
REST Job Server for Apache Spark - REST interface for managing and submitting Spark jobs on the same cluster.
Apache Mesos - Cluster management system that supports running Spark
Alluxio (née Tachyon) - Memory speed virtual distributed storage system that supports running Spark
FiloDB - a Spark integrated analytical/columnar database, with in-memory option capable of sub-second concurrent queries
Zeppelin - Multi-purpose notebook which supports 20+ language backends, including Apache Spark
Kubeflow Spark Operator - Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
IBM Spectrum Conductor - Cluster management software that integrates with Spark and modern computing frameworks.
MLflow - Open source platform to manage the machine learning lifecycle, including deploying models from diverse machine learning libraries on Apache Spark.
Apache DataFu - A collection of utils and user-defined-functions for working with large scale data in Apache Spark, as well as making Scala-Python interoperability easier.

Applications using Spark

Apache Mahout - Previously on Hadoop MapReduce, Mahout has switched to using Spark as the backend
ADAM - A framework and CLI for loading, transforming, and analyzing genomic data using Apache Spark
TransmogrifAI - AutoML library for building modular, reusable, strongly typed machine learning workflows on Spark with minimal hand tuning
Natural Language Processing for Apache Spark - A library to provide simple, performant, and accurate NLP annotations for machine learning pipelines
Rumble for Apache Spark - A JSONiq engine to query, with a functional language, large, nested, and heterogeneous JSON datasets that do not fit in dataframes.
Lightning Catalog - A data catalog for running ad-hoc queries, wrangling data by federating enterprise data assets, and building a unified semantic layer with data quality checks.

Performance, monitoring, and debugging tools for Spark

Data Mechanics Delight - Delight is a free, hosted, cross-platform Spark UI alternative backed by an open-source Spark agent. It features new metrics and visualizations to simplify Spark monitoring and performance tuning.
DataFlint - DataFlint is A Spark UI replacement installed via an open-source library, which updates in real-time and alerts on performance issues

Additional language bindings

C# / .NET

Mobius: C# and F# language binding and extensions to Apache Spark

Clojure

Geni - A Clojure dataframe library that runs on Apache Spark with a focus on optimizing the REPL experience.

Julia

Spark.jl

Kotlin

Kotlin for Apache Spark

Adding new projects

To add a project, open a pull request against the spark-website repository. Add an entry to this markdown file, then run jekyll build to generate the HTML too. Include both in your pull request. See the README in this repo for more information.

Note that all project and product names should follow trademark guidelines.

Latest News

Spark 3.5.6 released (May 29, 2025)
Spark 4.0.0 released (May 23, 2025)
Spark 3.5.5 released (Feb 27, 2025)
Spark 3.5.4 released (Dec 20, 2024)