SparkFE is the LLVM-based and high-performance Spark native execution engine which is designed for feature engineering.

Overview

Introduction

SparkFE is the LLVM-based and high-performance Spark native execution engine which is designed for feature engineering.

Spark has rapidly emerged as the de facto standard for big data processing. However, it is not designed for machine learning which has more and more limitation in AI scenarios. SparkFE rewrite the execution engine in C++ and achieve more than 6x performance improvement for feature extraction. It guarantees the online-offline consistency which makes AI landing much easier. For further details, please refer to SparkFE Documentation.

Architecture

Features

  • High Performance

    Based on LLVM optimization, we can get more than 6 times performance improvement in some AI scenarios. It reduces the computing time for the same applications and gets lower TCO.

  • No Migration Cost

    Using SparkFE does not require modifying or re-compiling your SparkSQL applications. Just set the SPARK_HOME then you will reap the performance benefit of native execution engine.

  • Optimized For Machine Learning

    SparkFE provided the customized join type and UDF/UDAF for machine learning scenarios which can meet the requirements for feature engineering in production environment.

  • Online-Offline consistency

    Using FEDB and SparkFE, the machine learning applications with SQL for feature engineering can be deployed without any development. The native execution engine guarantees the online-offline consistency and greatly reduces the cost of AI landing.

  • Upstream First

    SparkFE will be compatible with Spark 3.0 and the later versions. All the functions will be synchronized with upstream and it is able to fallback to vanilla Spark in some special scenarios.

Performance

SparkFE has significant performance improvement in most of the AI scenarios. Here are part of the benchmark results.

Benchmark

You can verify the results in your environment with the following steps.

docker run -it 4pdosc/sparkfe bash

git clone https://github.com/4paradigm/SparkFE.git 
cd ./SparkFE/benchmark/taxi_tour_multiple_window/

wget http://103.3.60.66:8001/sparkfe_resources/taxi_tour_parquet.tar.gz
tar xzvf ./taxi_tour_parquet.tar.gz

export SPARK_HOME=/spark-3.0.0-bin-hadoop2.7/
./submit_spark_job.sh

export SPARK_HOME=/spark-3.0.0-bin-sparkfe/
./submit_spark_job.sh

QuickStart

Use Docker Image

Run with the official SparkFE docker image.

docker run -it 4pdosc/sparkfe bash

Execute the standard Spark commands which will use SparkFE for acceleration by default.

$SPARK_HOME/bin/spark-submit \
  --master local \
  --class org.apache.spark.examples.sql.SparkSQLExample \
  $SPARK_HOME/examples/jars/spark-examples*.jar

Use SparkFE Distribution

Download the pre-built package in Releases Page then execute the Spark commands.

tar xzvf ./spark-3.0.0-bin-sparkfe.tgz

export SPARK_HOME=`pwd`/spark-3.0.0-bin-sparkfe/

$SPARK_HOME/bin/spark-submit \
  --master local \
  --class org.apache.spark.examples.sql.SparkSQLExample \
  $SPARK_HOME/examples/jars/spark-examples*.jar

Contribution

You can use the official docker image for development.

docker run -it 4pdosc/sparkfe bash

git clone --recurse-submodules [email protected]:4paradigm/SparkFE.git
cd ./SparkFE/sparkfe/

Build the sparkfe module from scratch.

Operating System Compile Command Notice
Linux mvn clean package Support CentOS 6, Ubuntu and other Linux distros
MacOS mvn clean package -Pmacos Support macOS Big Sur and later versions
All in one mvn clean package -Pallinone Support Linux and MacOS at the same time

Build the SparkFE distribution from scratch.

cd ../spark/

./dev/make-distribution.sh --name sparkfe --pip --tgz -Phadoop-2.7 -Pyarn

Roadmap

SQL Compatibility

SparkFE is compatible with most SparkSQL applications now. In the future, we may perfect the compatibility for ANSI SQL and lower the migration cost for developers.

  • [2021 H1&H2] Support more Window types and Where, GroupBy with complex expressions.
  • [2021 H1&H2] Support more SQL syntax and UDF/UDAF functions for AI scenarios.

Performance Improvement

SparkFE has significant performance improvement with C++ and LLVM. We will reduce the cost of cross-language invocation and support heterogeneous hardware in the future.

  • [2021 H1] Support multiple coded formats and be compatible with Spark UnsafeRow memory layout.
  • [2021 H1] Automatically optimize the window computing and table join with skew data.
  • [2021 H1] Integrate the optimization passes for Native LastJoin which is used in AI scenarios.
  • [2021 H2] Support column-based memory layout in the whole process which may reduce the overhead of reading or writing files and support CPU vectorization optimization.
  • [2021 H2] Support heterogeneous computing hardware.

Ecosystem Integration

SparkFE is compatible with Spark ecosystems currently. We may integrate with more open-source systems to meet the requirements in production environments.

  • [2021 H2] Integrate with multiple versions of Spark and provide pre-built packages.

License

Apache License 2.0

You might also like...

Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statistical applications.

Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statistical applications.

Datumbox Machine Learning Framework The Datumbox Machine Learning Framework is an open-source framework written in Java which allows the rapid develop

Dec 9, 2022

A Slimefun4 addon that adds a new storage solution for mass and organised storage without harsh performance.

A Slimefun4 addon that adds a new storage solution for mass and organised storage without harsh performance.

Networks is a Slimefun4 addon that brings a simple yet powerful item storage and movement network that works along side cargo. Network Grid / Crafting

Jan 7, 2023

Statistical Machine Intelligence & Learning Engine

Statistical Machine Intelligence & Learning Engine

Smile Smile (Statistical Machine Intelligence and Learning Engine) is a fast and comprehensive machine learning, NLP, linear algebra, graph, interpola

Jan 1, 2023

An Engine-Agnostic Deep Learning Framework in Java

An Engine-Agnostic Deep Learning Framework in Java

Deep Java Library (DJL) Overview Deep Java Library (DJL) is an open-source, high-level, engine-agnostic Java framework for deep learning. DJL is desig

Jan 7, 2023

🔍 Open Source Enterprise Cognitive Search Engine

OpenK9 OpenK9 is a new Cognitive Search Engine that allows you to build next generation search experiences. It employs a scalable architecture and mac

Dec 10, 2022

An Engine-Agnostic Deep Learning Framework in Java

An Engine-Agnostic Deep Learning Framework in Java

Deep Java Library (DJL) Overview Deep Java Library (DJL) is an open-source, high-level, engine-agnostic Java framework for deep learning. DJL is desig

Jan 7, 2023

Transform ML models into a native code (Java, C, Python, Go, JavaScript, Visual Basic, C#, R, PowerShell, PHP, Dart, Haskell, Ruby, F#, Rust) with zero dependencies

m2cgen m2cgen (Model 2 Code Generator) - is a lightweight library which provides an easy way to transpile trained statistical models into a native cod

Jan 4, 2023

A simple utility that allows you to query which items can be placed in a specific slot by holding down Left-Alt

A simple utility that allows you to query which items can be placed in a specific slot by holding down Left-Alt

What's That Slot This mod is a simple utility that allows you to query which items can be placed in a specific slot by holding down Left-Alt. You can

Dec 25, 2022
Releases(v0.1.1)
Owner
4Paradigm
4Paradigm Open Source Community
4Paradigm
Model import deployment framework for retraining models (pytorch, tensorflow,keras) deploying in JVM Micro service environments, mobile devices, iot, and Apache Spark

The Eclipse Deeplearning4J (DL4J) ecosystem is a set of projects intended to support all the needs of a JVM based deep learning application. This mean

Eclipse Foundation 12.7k Dec 30, 2022
Firestorm is a Remote Shuffle Service, and provides the capability for Apache Spark applications to store shuffle data on remote servers

What is Firestorm Firestorm is a Remote Shuffle Service, and provides the capability for Apache Spark applications to store shuffle data on remote ser

Tencent 246 Nov 29, 2022
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.8k Dec 28, 2022
Sparkling Water provides H2O functionality inside Spark cluster

Sparkling Water Sparkling Water integrates H2O's fast scalable machine learning engine with Spark. It provides: Utilities to publish Spark data struct

H2O.ai 939 Jan 2, 2023
Serverless proxy for Spark cluster

Hydrosphere Mist Hydrosphere Mist is a serverless proxy for Spark cluster. Mist provides a new functional programming framework and deployment model f

hydrosphere.io 317 Dec 1, 2022
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.7k Mar 12, 2021
Spark interface for Drsti

Drsti for Spark (ai.jgp.drsti-spark) Spark interface for Drsti Resources Bringing vision to Apache Spark (2021-09-21) introduces Drsti and explains ho

Jean-Georges 3 Sep 22, 2021
Flink/Spark Connectors for Apache Doris(Incubating)

Apache Doris (incubating) Connectors The repository contains connectors for Apache Doris (incubating) Flink Doris Connector More information about com

The Apache Software Foundation 30 Dec 7, 2022
Word Count in Apache Spark using Java

Word Count in Apache Spark using Java

Arjun Gautam 2 Feb 24, 2022
Stream Processing and Complex Event Processing Engine

Siddhi Core Libraries Siddhi is a cloud native Streaming and Complex Event Processing engine that understands Streaming SQL queries in order to captur

Siddhi - Cloud Native Stream Processor 1.4k Jan 6, 2023