SparkFE is the LLVM-based and high-performance Spark native execution engine which is designed for feature engineering.

Last update: Jun 10, 2021

Overview

Introduction

SparkFE is the LLVM-based and high-performance Spark native execution engine which is designed for feature engineering.

Spark has rapidly emerged as the de facto standard for big data processing. However, it is not designed for machine learning which has more and more limitation in AI scenarios. SparkFE rewrite the execution engine in C++ and achieve more than 6x performance improvement for feature extraction. It guarantees the online-offline consistency which makes AI landing much easier. For further details, please refer to SparkFE Documentation.

Features

High Performance

Based on LLVM optimization, we can get more than 6 times performance improvement in some AI scenarios. It reduces the computing time for the same applications and gets lower TCO.
No Migration Cost

Using SparkFE does not require modifying or re-compiling your SparkSQL applications. Just set the SPARK_HOME then you will reap the performance benefit of native execution engine.
Optimized For Machine Learning

SparkFE provided the customized join type and UDF/UDAF for machine learning scenarios which can meet the requirements for feature engineering in production environment.
Online-Offline consistency

Using FEDB and SparkFE, the machine learning applications with SQL for feature engineering can be deployed without any development. The native execution engine guarantees the online-offline consistency and greatly reduces the cost of AI landing.
Upstream First

SparkFE will be compatible with Spark 3.0 and the later versions. All the functions will be synchronized with upstream and it is able to fallback to vanilla Spark in some special scenarios.

Performance

SparkFE has significant performance improvement in most of the AI scenarios. Here are part of the benchmark results.

You can verify the results in your environment with the following steps.

docker run -it 4pdosc/sparkfe bash

git clone https://github.com/4paradigm/SparkFE.git 
cd ./SparkFE/benchmark/taxi_tour_multiple_window/

wget http://103.3.60.66:8001/sparkfe_resources/taxi_tour_parquet.tar.gz
tar xzvf ./taxi_tour_parquet.tar.gz

export SPARK_HOME=/spark-3.0.0-bin-hadoop2.7/
./submit_spark_job.sh

export SPARK_HOME=/spark-3.0.0-bin-sparkfe/
./submit_spark_job.sh

QuickStart

Use Docker Image

Run with the official SparkFE docker image.

docker run -it 4pdosc/sparkfe bash

Execute the standard Spark commands which will use SparkFE for acceleration by default.

$SPARK_HOME/bin/spark-submit \
  --master local \
  --class org.apache.spark.examples.sql.SparkSQLExample \
  $SPARK_HOME/examples/jars/spark-examples*.jar

Use SparkFE Distribution

Download the pre-built package in Releases Page then execute the Spark commands.

tar xzvf ./spark-3.0.0-bin-sparkfe.tgz

export SPARK_HOME=`pwd`/spark-3.0.0-bin-sparkfe/

$SPARK_HOME/bin/spark-submit \
  --master local \
  --class org.apache.spark.examples.sql.SparkSQLExample \
  $SPARK_HOME/examples/jars/spark-examples*.jar

Contribution

You can use the official docker image for development.

docker run -it 4pdosc/sparkfe bash

git clone --recurse-submodules [email protected]:4paradigm/SparkFE.git
cd ./SparkFE/sparkfe/

Build the sparkfe module from scratch.

Operating System	Compile Command	Notice
Linux	mvn clean package	Support CentOS 6, Ubuntu and other Linux distros
MacOS	mvn clean package -Pmacos	Support macOS Big Sur and later versions
All in one	mvn clean package -Pallinone	Support Linux and MacOS at the same time

Build the SparkFE distribution from scratch.

cd ../spark/

./dev/make-distribution.sh --name sparkfe --pip --tgz -Phadoop-2.7 -Pyarn

Roadmap

SQL Compatibility

SparkFE is compatible with most SparkSQL applications now. In the future, we may perfect the compatibility for ANSI SQL and lower the migration cost for developers.

[2021 H1&H2] Support more Window types and Where, GroupBy with complex expressions.
[2021 H1&H2] Support more SQL syntax and UDF/UDAF functions for AI scenarios.

Performance Improvement

SparkFE has significant performance improvement with C++ and LLVM. We will reduce the cost of cross-language invocation and support heterogeneous hardware in the future.

[2021 H1] Support multiple coded formats and be compatible with Spark UnsafeRow memory layout.
[2021 H1] Automatically optimize the window computing and table join with skew data.
[2021 H1] Integrate the optimization passes for Native LastJoin which is used in AI scenarios.
[2021 H2] Support column-based memory layout in the whole process which may reduce the overhead of reading or writing files and support CPU vectorization optimization.
[2021 H2] Support heterogeneous computing hardware.

Ecosystem Integration

SparkFE is compatible with Spark ecosystems currently. We may integrate with more open-source systems to meet the requirements in production environments.

[2021 H2] Integrate with multiple versions of Spark and provide pre-built packages.

License

Apache License 2.0

Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statistical applications.

Datumbox Machine Learning Framework The Datumbox Machine Learning Framework is an open-source framework written in Java which allows the rapid develop

Dec 9, 2022

A Slimefun4 addon that adds a new storage solution for mass and organised storage without harsh performance.

Networks is a Slimefun4 addon that brings a simple yet powerful item storage and movement network that works along side cargo. Network Grid / Crafting

Jan 7, 2023

Statistical Machine Intelligence & Learning Engine

Smile Smile (Statistical Machine Intelligence and Learning Engine) is a fast and comprehensive machine learning, NLP, linear algebra, graph, interpola

Jan 1, 2023

An Engine-Agnostic Deep Learning Framework in Java

Deep Java Library (DJL) Overview Deep Java Library (DJL) is an open-source, high-level, engine-agnostic Java framework for deep learning. DJL is desig

Jan 7, 2023

🔍 Open Source Enterprise Cognitive Search Engine

OpenK9 OpenK9 is a new Cognitive Search Engine that allows you to build next generation search experiences. It employs a scalable architecture and mac

Dec 10, 2022

An Engine-Agnostic Deep Learning Framework in Java

Deep Java Library (DJL) Overview Deep Java Library (DJL) is an open-source, high-level, engine-agnostic Java framework for deep learning. DJL is desig

Jan 7, 2023

Transform ML models into a native code (Java, C, Python, Go, JavaScript, Visual Basic, C#, R, PowerShell, PHP, Dart, Haskell, Ruby, F#, Rust) with zero dependencies

m2cgen m2cgen (Model 2 Code Generator) - is a lightweight library which provides an easy way to transpile trained statistical models into a native cod

Jan 4, 2023

A simple utility that allows you to query which items can be placed in a specific slot by holding down Left-Alt

What's That Slot This mod is a simple utility that allows you to query which items can be placed in a specific slot by holding down Left-Alt. You can

Dec 25, 2022

Problems of Data Structure from basics are covered here for interview preparation and logic building. Basic programming problems and so many interview based leetcode problems are present. Every program is written to solve problem in as optimized way as possible.

Data Structure in Java Problem Solving 👨‍💻 Problems of Data Structure from basics are covered here for interview preparation and logic building. Bas

May 23, 2022

SparkFE is the LLVM-based and high-performance Spark native execution engine which is designed for feature engineering.

Related tags

Overview

Introduction

Features

Performance

QuickStart

Use Docker Image

Use SparkFE Distribution

Contribution

Roadmap

SQL Compatibility

Performance Improvement

Ecosystem Integration

License

You might also like...

Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statistical applications.

A Slimefun4 addon that adds a new storage solution for mass and organised storage without harsh performance.

Statistical Machine Intelligence & Learning Engine

An Engine-Agnostic Deep Learning Framework in Java

🔍 Open Source Enterprise Cognitive Search Engine

An Engine-Agnostic Deep Learning Framework in Java

Transform ML models into a native code (Java, C, Python, Go, JavaScript, Visual Basic, C#, R, PowerShell, PHP, Dart, Haskell, Ruby, F#, Rust) with zero dependencies

A simple utility that allows you to query which items can be placed in a specific slot by holding down Left-Alt

Problems of Data Structure from basics are covered here for interview preparation and logic building. Basic programming problems and so many interview based leetcode problems are present. Every program is written to solve problem in as optimized way as possible.

Releases(v0.1.1)

v0.1.1(Apr 16, 2021)

v0.1.0-snapshot(Mar 26, 2021)

Owner

4Paradigm

Model import deployment framework for retraining models (pytorch, tensorflow,keras) deploying in JVM Micro service environments, mobile devices, iot, and Apache Spark

Firestorm is a Remote Shuffle Service, and provides the capability for Apache Spark applications to store shuffle data on remote servers

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Sparkling Water provides H2O functionality inside Spark cluster

Serverless proxy for Spark cluster

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Spark interface for Drsti

Flink/Spark Connectors for Apache Doris(Incubating)

Word Count in Apache Spark using Java

Stream Processing and Complex Event Processing Engine