Firestorm is a Remote Shuffle Service, and provides the capability for Apache Spark applications to store shuffle data on remote servers

Last update: Nov 29, 2022

Related tags

Overview

What is Firestorm

Firestorm is a Remote Shuffle Service, and provides the capability for Apache Spark applications to store shuffle data on remote servers.

Architecture

Firestorm contains coordinator cluster, shuffle server cluster and remote storage(eg, HDFS) if necessary.

Coordinator will collect status of shuffle server and do the assignment for the job.

Shuffle server will receive the shuffle data, merge them and write to storage.

Depend on different situation, Firestorm supports Memory & Local, Memory & Remote Storage(eg, HDFS), Local only, Remote Storage only.

Shuffle Process with Firestorm

Spark driver ask coordinator to get shuffle server for shuffle process
Spark task write shuffle data to shuffle server with following step:
1. Send KV data to buffer
2. Flush buffer to queue when buffer is full or buffer manager is full
3. Thread pool get data from queue
4. Request memory from shuffle server first and send the shuffle data
5. Shuffle server cache data in memory first and flush to queue when buffer manager is full
6. Thread pool get data from queue
7. Write data to storage with index file and data file
8. After write data, task report all blockId to shuffle server, this step is used for data validation later
9. Store taskAttemptId in MapStatus to support Spark speculation
Depend on different storage type, spark task read shuffle data from shuffle server or remote storage or both of them.

Shuffle file format

The shuffle data is stored with index file and data file. Data file has all blocks for specific partition and index file has metadata for every block.

Supported Spark Version

Current support Spark 2.3.x, Spark 2.4.x, Spark3.0.x, Spark 3.1.x

Note: To support dynamic allocation, the patch(which is included in client-spark/patch folder) should be applied to Spark

Building Firestorm

Firestorm is built using Apache Maven. To build it, run:

mvn -DskipTests clean package

To package the Firestorm, run:

./build_distribution.sh

rss-xxx.tgz will be generated for deployment

Deploy

Deploy Coordinator

unzip package to RSS_HOME

update RSS_HOME/bin/rss-env.sh, eg,

  JAVA_HOME=<java_home>
  HADOOP_HOME=<hadoop home>
  XMX_SIZE="16g"

update RSS_HOME/conf/coordinator.conf, eg,

  rss.rpc.server.port 19999
  rss.jetty.http.port 19998
  rss.coordinator.server.heartbeat.timeout 30000
  rss.coordinator.app.expired 60000
  rss.coordinator.shuffle.nodes.max 5
  rss.coordinator.exclude.nodes.file.path RSS_HOME/conf/exclude_nodes

start Coordinator
```
 bash RSS_HOME/bin/start-coordnator.sh
```

Deploy Shuffle Server

unzip package to RSS_HOME

update RSS_HOME/bin/rss-env.sh, eg,

  JAVA_HOME=<java_home>
  HADOOP_HOME=<hadoop home>
  XMX_SIZE="80g"

update RSS_HOME/conf/server.conf, the following demo is for local storage only, eg,

  rss.rpc.server.port 19999
  rss.jetty.http.port 19998
  rss.rpc.executor.size 2000
  rss.storage.type LOCALFILE
  rss.coordinator.quorum <coordinatorIp1>:19999,<coordinatorIp2>:19999
  rss.storage.basePath /data1/rssdata,/data2/rssdata....
  rss.server.flush.thread.alive 5
  rss.server.flush.threadPool.size 10
  rss.server.buffer.capacity 40g
  rss.server.read.buffer.capacity 20g
  rss.server.heartbeat.timeout 60000
  rss.server.heartbeat.interval 10000
  rss.rpc.message.max.size 1073741824
  rss.server.preAllocation.expired 120000
  rss.server.commit.timeout 600000
  rss.server.app.expired.withoutHeartbeat 120000

start Shuffle Server

 bash RSS_HOME/bin/start-shuffle-server.sh

Deploy Spark Client

Add client jar to Spark classpath, eg, SPARK_HOME/jars/

The jar for Spark2 is located in <RSS_HOME>/jars/client/spark2/rss-client-XXXXX-shaded.jar

The jar for Spark3 is located in <RSS_HOME>/jars/client/spark3/rss-client-XXXXX-shaded.jar

Update Spark conf to enable Firestorm, the following demo is for local storage only, eg,

spark.shuffle.manager org.apache.spark.shuffle.RssShuffleManager
spark.rss.coordinator.quorum <coordinatorIp1>:19999,<coordinatorIp2>:19999
spark.rss.storage.type MEMORY_LOCALFILE

Support Spark dynamic allocation

To support spark dynamic allocation with Firestorm, spark code should be updated. There are 2 patches for spark-2.4.6 and spark-3.1.2 in spark-patches folder for reference.

After apply the patch and rebuild spark, add following configuration in spark conf to enable dynamic allocation:

spark.shuffle.service.enabled false
spark.dynamicAllocation.enabled true

Configuration

The important configuration is listed as following.

Coordinator

Property Name	Default	Meaning
rss.coordinator.server.heartbeat.timeout	30000	Timeout if can't get heartbeat from shuffle server
rss.coordinator.assignment.strategy	BASIC	Strategy for assigning shuffle server, only BASIC support
rss.coordinator.app.expired	60000	Application expired time (ms), the heartbeat interval should be less than it
rss.coordinator.shuffle.nodes.max	9	The max number of shuffle server when do the assignment
rss.coordinator.exclude.nodes.file.path	-	The path of configuration file which have exclude nodes
rss.coordinator.exclude.nodes.check.interval.ms	60000	Update interval (ms) for exclude nodes
rss.rpc.server.port	-	RPC port for coordinator
rss.jetty.http.port	-	Http port for coordinator

Shuffle Server

Property Name	Default	Meaning
rss.coordinator.quorum	-	Coordinator quorum
rss.rpc.server.port	-	RPC port for Shuffle server
rss.jetty.http.port	-	Http port for Shuffle server
rss.server.buffer.capacity	-	Max memory of buffer manager for shuffle server
rss.server.memory.shuffle.highWaterMark.percentage	75.0	Threshold of spill data to storage, percentage of rss.server.buffer.capacity
rss.server.memory.shuffle.lowWaterMark.percentage	25.0	Threshold of keep data in memory, percentage of rss.server.buffer.capacity
rss.server.read.buffer.capacity	-	Max size of buffer for reading data
rss.server.heartbeat.interval	10000	Heartbeat interval to Coordinator (ms)
rss.server.flush.threadPool.size	10	Thread pool for flush data to file
rss.server.commit.timeout	600000	Timeout when commit shuffle data (ms)
rss.storage.type	-	Supports LOCALFILE, HDFS, LOCALFILE_AND_HDFS

Spark Client

Property Name	Default	Meaning
spark.rss.writer.buffer.size	3m	Buffer size for single partition data
spark.rss.writer.buffer.spill.size	128m	Buffer size for total partition data
spark.rss.coordinator.quorum	-	Coordinator quorum
spark.rss.storage.type	-	Supports MEMORY_LOCAL, MEMORY_HDFS, LOCALFILE, HDFS, LOCALFILE_AND_HDFS
spark.rss.client.send.size.limit	16m	The max data size sent to shuffle server
spark.rss.client.read.buffer.size	32m	The max data size read from storage
spark.rss.client.send.threadPool.size	10	The thread size for send shuffle data to shuffle server

LICENSE

Firestorm is under the Apache License Version 2.0. See the LICENSE file for details.

Contributing

For more information about contributing issues or pull requests, see Firestorm Contributing Guide.

Comments

Merge values to reduce the shuffle data when operator can be combined

What changes were proposed in this pull request?

In spark shuffle service, it can merge values to reduce the data size when operator can be combined.

Why are the changes needed?

To reduce the shuffle data size and then to speed up.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

In draft and waiting to be tested.

opened by zuston 17
[Improvement] Split basic checks in CI pipeline
What changes were proposed in this pull request?

Enable CI on non-master branches.

Split checkstyle and rat checks from build matrix.

Add spotbugs, its result is ignored since there are many violations right now.

Add Summary of failures step in each job.

Why are the changes needed?

Contributors can ensure CI has passed on forked branches before submitting PR.

It's easier to see why CI has failed. And avoids repeating checkstyle and rat.

Spotbugs serves as suggestions for now.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Example of rat failure: https://github.com/kaijchen/Firestorm/actions/runs/2383526195
opened by kaijchen 13
Support using remote fs path to specify the excludeNodesFilePath

What changes were proposed in this pull request?

Support using remote fs path to specify the excludeNodesFilePath

Why are the changes needed?

When existing two coordinators serving for online, we hope they can read the consistent exclude nodes file insteading of using the localfile syncing manually.

Does this PR introduce any user-facing change?

Yes. It's an incompatible change.

When the default fs is HDFS in the core-site.xml, and the excludeFilePath is specified to "/user/xxxxx" in coordinator server. After applied this patch, filesystem will be initialized to remote HDFS due to lacking scheme.

How was this patch tested?

Unit tests.

opened by zuston 10
使用firestorm-0.4.0 运行spark3.1.1官方的JavaWordCount报如下错误，并且在yarn-client模式下driver端进程一直不退出

java.io.StreamCorruptedException: invalid stream header: 74001673 at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:806) at java.io.ObjectInputStream.<init>(ObjectInputStream.java:299) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.<init>(JavaSerializer.scala:64) at org.apache.spark.serializer.JavaDeserializationStream.<init>(JavaSerializer.scala:64) at org.apache.spark.serializer.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:123) at org.apache.spark.shuffle.reader.RssShuffleDataIterator.createKVIterator(RssShuffleDataIterator.java:71) at org.apache.spark.shuffle.reader.RssShuffleDataIterator.hasNext(RssShuffleDataIterator.java:118) at org.apache.spark.shuffle.reader.RssShuffleReader$MultiPartitionIterator.hasNext(RssShuffleReader.java:213) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:155) at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:50) at org.apache.spark.shuffle.reader.RssShuffleReader.read(RssShuffleReader.java:125) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) at org.apache.spark.scheduler.Task.run(Task.scala:134) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:535) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:545) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

opened by sfwang218 10
Avoid using the default forkjoin pool by parallelStream directly

What changes were proposed in this pull request?

As we know that parallelStream will use the default forkjoin pool in entire jvm. To avoid it, use the custom pool and allow to specify the pool size.

Why are the changes needed?

use separate forkjoin pool to send shuffle data

Does this PR introduce any user-facing change?

Yes, introduce the configuration to control the size of forkjoinpool. mapreduce.rss.client.data.transfer.pool.size for MapReduce spark.rss.client.data.transfer.pool.size for Spark

How was this patch tested?

GA passed.

opened by zuston 9
yarn-client模式下driver端进程一直不退出

如上图任务结束后 driver端一直在上报心跳
我测试了下把 scheduledExecutorService 改成守护线程可以解决该问题 scheduledExecutorService = Executors.newSingleThreadScheduledExecutor( new ThreadFactoryBuilder().setDaemon(true).setNameFormat("rss-heartbeat-%d").build());

opened by sfwang218 9
[Bugfix] Add timestamp to avoid the conflicts of gc.log after server …

What changes were proposed in this pull request?

Add a timestamp as the suffix of gc log.

Why are the changes needed?

JVM will rewrite gc.log after restart.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

By hand.

opened by frankliee 9
[Follow Up] Support to skip unexpected and processed segments in HDFS

What changes were proposed in this pull request?

Refactor HdfsClientReadHandler/UploadedHdfsClientReadHandler based on DataSkippableReadHandler

Why are the changes needed?

This is "[Feature] Support to skip unexpected and processed segments in LocalFileReadClient (#55)" 's follow-up pull request, we implement the segment filter based on DataSkippableReadHandler. In this PR, we extends to HdfsClientReadHandler/UploadedHdfsClientReadHandler.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Run all existing UT & IT.

opened by frankliee 9
[Improvement] Optimize quorum write by skipping unnecessary blocks

What changes were proposed in this pull request?

Write client send blocks in two rounds. In the primary round, we only send [0, replicaWrite) data. In the secondary round, we send [replicaWrite, replica) data. When the primary round is totally successful, the secondary round can be skipped.

Why are the changes needed?

Without this patch, the quorum writer will send whole replica times of blocks even all RPCs are successful.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

All previous UTs and new UTs.

opened by frankliee 8
[Feature] Support to skip unexpected and processed segments in LocalFileReadClient
What changes were proposed in this pull request?

Add DataSkippableReadHandler to enable segment filter.

Refactor LocalFileQuorumClientReadHandler
a. Extract the processing of each replica in LocalFileReadHandler b. Implement LocalFileReplicaReadHandler based on DataSkippableReadHandler

Refactor MemoryQuorumClientReadHandler a. Extract the processing of each replica in LocalFileReadHandler

Why are the changes needed?

Some segments are read by different handlers.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Run all UT & IT.
opened by frankliee 7
[Improvement] Replace runtime exception with rss related exception in…

What changes were proposed in this pull request? This PR add RSS related exception which can be identify there has some problem in shuffle server

Why are the changes needed? Improve the exception

Does this PR introduce any user-facing change? No

How was this patch tested? base on current UT
enhancement

opened by colinmjj 7
To support more tasks with Firestorm
The current blockId is designed as following:

// BlockId is long and composed by partitionId, executorId and AtomicInteger // AtomicInteger is first 19 bit, max value is 2^19 - 1 // partitionId is next 24 bit, max value is 2^24 - 1 // taskAttemptId is rest of 20 bit, max value is 2^20 - 1

Why we need blockId? It's designed for data check, filter, memory data read, etc.

Why blockId is designed as above? BlockId will be stored in Shuffle server, to reduce memory cost. Roaringbitmap is used to cache it. According to implementation of Roaringbitmap, the design of BlockId is target to use BitmapContainer instead of ArrayContainer for memory saving.

What's the problem of blockId? It can't support taskId which is greater than 2^20 - 1

Proposal I think the first 19 bit is too much for atomic int, and we can leverage some of them for taskId.
opened by colinmjj 0
[Feature Request]Add a web UI in Coordinated Server to show the detailed server/job/metrics information

It would be good to have a web UI in coordinated server to show the detailed information about shuffle server, job shuffle metrics and others. Simply like Spark web UI which has the ability to show the detailed information of Spark jobs.

opened by jerryshao 1

Releases(release-0.4.1)

release-0.4.1(Apr 20, 2022)

[Bugfix] Fix quorum bug in blockIds report (https://github.com/Tencent/Firestorm/pull/107)
Source code(tar.gz)
Source code(zip)
release-0.4.0(Apr 7, 2022)
What's Changed

[Bugfix] Fix uncorrect index file by @jerqi in https://github.com/Tencent/Firestorm/pull/92

[Doc] Add the support information by @jerqi in https://github.com/Tencent/Firestorm/pull/95

[Bugfix] Exist memory leak when we use aqe by @jerqi in https://github.com/Tencent/Firestorm/pull/98

[Bugfix] Add timestamp to avoid the conflicts of gc.log after server … by @frankliee in https://github.com/Tencent/Firestorm/pull/99

[Improvement] Remove some integration tests under StorageType.HDFS mode to … by @frankliee in https://github.com/Tencent/Firestorm/pull/97

[Feature] Firestorm supports quorum write/read by @frankliee in https://github.com/Tencent/Firestorm/pull/96

upgrade to 0.4.0 by @jerqi in https://github.com/Tencent/Firestorm/pull/102

Full Changelog: https://github.com/Tencent/Firestorm/compare/release-0.3.0...release-0.4.0
Source code(tar.gz)
Source code(zip)
release-0.3.0(Mar 4, 2022)
What's Changed

[Feature] Add AccessManager by @duanmeng in https://github.com/Tencent/Firestorm/pull/61

[Bug] Exchange the position between startHeartbeat and registerShuffleServers by @jerqi in https://github.com/Tencent/Firestorm/pull/75

[Improvement] Add detailed logging for read client by @frankliee in https://github.com/Tencent/Firestorm/pull/72

[Feature]Access check in client by @duanmeng in https://github.com/Tencent/Firestorm/pull/66

[Feature] Add the metrics of requireBuffer failures by @frankliee in https://github.com/Tencent/Firestorm/pull/78

[Feature] Enhance the DelegationRssShuffleManager to support hdfs conf file and reset the shuffle manager name by @duanmeng in https://github.com/Tencent/Firestorm/pull/79

[Improvement] Update spark.rss.client.read.buffer.size to avoid JVM humongous alloc… by @frankliee in https://github.com/Tencent/Firestorm/pull/80

[Improvement] Update some default values of parameters by @frankliee in https://github.com/Tencent/Firestorm/pull/82

[Feature] add the dynamic client conf fetching and updating by @duanmeng in https://github.com/Tencent/Firestorm/pull/81

[Feature] add access related metrics by @duanmeng in https://github.com/Tencent/Firestorm/pull/84

[Feature] Support Spark 3.2 by @jerqi in https://github.com/Tencent/Firestorm/pull/86

[Feature] corrupted LocalStorage can be excluded by @jerqi in https://github.com/Tencent/Firestorm/pull/85

upgrade to 0.3.0 by @duanmeng in https://github.com/Tencent/Firestorm/pull/89

Full Changelog: https://github.com/Tencent/Firestorm/compare/release-0.2.0...release-0.3.0
Source code(tar.gz)
Source code(zip)
release-0.2.0(Jan 17, 2022)
What's Changed

Add license header to maven ci yaml file #2 by @duanmeng in https://github.com/Tencent/Firestorm/pull/2

Add pull request template by @duanmeng in https://github.com/Tencent/Firestorm/pull/10

Upgrade version to 0.2.0-SNAPSHOT by @duanmeng in https://github.com/Tencent/Firestorm/pull/11

[Improvement] Replace runtime exception with rss related exception in… by @colinmjj in https://github.com/Tencent/Firestorm/pull/8

Fix Dependabot alerts by @duanmeng in https://github.com/Tencent/Firestorm/pull/12

[Feature] Read index file first in local mode by @duanmeng in https://github.com/Tencent/Firestorm/pull/6

[Feature] Add modifier, comment and import checking into checkstyle by @duanmeng in https://github.com/Tencent/Firestorm/pull/16

[Feature] Enable NewLineAtEofChecker in checkstyle by @duanmeng in https://github.com/Tencent/Firestorm/pull/18

[Feature] Firestorm supports node's health check by @jerqi in https://github.com/Tencent/Firestorm/pull/14

[Feature] Add the metrics for the number of total partitions by @jerqi in https://github.com/Tencent/Firestorm/pull/24

[Feature] Limit upload segment num and upload time in force mode by @duanmeng in https://github.com/Tencent/Firestorm/pull/23

Add metrics to collect connection info of grpc by @colinmjj in https://github.com/Tencent/Firestorm/pull/25

[Improvement] Fix the flaky test by @jerqi in https://github.com/Tencent/Firestorm/pull/29

[Improvement] Support to config byte sizes by readable string (e.g., kb/mb/gb) by @frankliee in https://github.com/Tencent/Firestorm/pull/26

[Feature] Add doc system based on Jekyll by @frankliee in https://github.com/Tencent/Firestorm/pull/22

[Feature] Read index file first in hdfs mode by @duanmeng in https://github.com/Tencent/Firestorm/pull/21

[Improvement] Modify some configuration's names and default values by @jerqi in https://github.com/Tencent/Firestorm/pull/27

Fix log4j-core dependabot by @duanmeng in https://github.com/Tencent/Firestorm/pull/32

[Improvement]Remove useless code in test base by @duanmeng in https://github.com/Tencent/Firestorm/pull/33

[Improvement] Avoid speculative tasks causing writing errors after shuffle is uploaded and deleted by @jerqi in https://github.com/Tencent/Firestorm/pull/31

[Improvement] fix log4j Dependabot alerts by @duanmeng in https://github.com/Tencent/Firestorm/pull/37

[Feature]Read handler refactor by @duanmeng in https://github.com/Tencent/Firestorm/pull/34

[MINOR] Fix flaky testes in HealthCheckCoordinatorGrpcTest by @frankliee in https://github.com/Tencent/Firestorm/pull/38

Check blockIds in read client to allow early return by @frankliee in https://github.com/Tencent/Firestorm/pull/39

Bump log4j-core from 2.16.0 to 2.17.0 by @dependabot in https://github.com/Tencent/Firestorm/pull/44

[Improvement] Support Firestorm with Spark 2.3 & 3.0 by @frankliee in https://github.com/Tencent/Firestorm/pull/41

Support store shuffle data in memory by @colinmjj in https://github.com/Tencent/Firestorm/pull/36

Update Readme with memory shuffle support by @colinmjj in https://github.com/Tencent/Firestorm/pull/46

[Improvement] Update import style in the Spark-like way by @frankliee in https://github.com/Tencent/Firestorm/pull/45

Shutdown non-daemon executor service to avoid hang when jvm destroy by @toujours33 in https://github.com/Tencent/Firestorm/pull/47

Fix time unit from micro to mill by @colinmjj in https://github.com/Tencent/Firestorm/pull/52

upload unit test log in github CI by @frankliee in https://github.com/Tencent/Firestorm/pull/54

[Improvement] Remove unnecssary logic to calculate bitmap split number by @colinmjj in https://github.com/Tencent/Firestorm/pull/53

[Feature] MultiStorageManager can write both localfile and hdfs directly by @jerqi in https://github.com/Tencent/Firestorm/pull/43

[Build] Only upload logs when test cases are failed by @frankliee in https://github.com/Tencent/Firestorm/pull/57

[Bug]Fix memory leak in spark executor by @colinmjj in https://github.com/Tencent/Firestorm/pull/58

[Feature] Support to skip unexpected and processed segments in LocalFileReadClient by @frankliee in https://github.com/Tencent/Firestorm/pull/55

[Feature] Support read from localfile, hdfs and uploaded hdfs by @jerqi in https://github.com/Tencent/Firestorm/pull/56

Fix log4j-core dependabot, upgrade to 2.17.1 by @frankliee in https://github.com/Tencent/Firestorm/pull/62

[Improvement] Reduce memory cost in spark executor by @colinmjj in https://github.com/Tencent/Firestorm/pull/59

[Feature] Hdfs write fail, the data will write to localfile by @jerqi in https://github.com/Tencent/Firestorm/pull/60

[Feature] Support StorageType MEMORY_LOCALFILE_HDFS by @jerqi in https://github.com/Tencent/Firestorm/pull/63

[Bug] Support to read incomplete index data by @jerqi in https://github.com/Tencent/Firestorm/pull/67

[Bug] LocalStorageChecker should support MEMORY_LOCALFILE and MEMORY_LOCALFILE_HDFS by @jerqi in https://github.com/Tencent/Firestorm/pull/65

[Feature] Add the data size of metrics HDFS and LOCALFILE by @jerqi in https://github.com/Tencent/Firestorm/pull/68

[Follow Up] Support to skip unexpected and processed segments in HDFS by @frankliee in https://github.com/Tencent/Firestorm/pull/64

[Bug] Replace parallelStream with stream except for the method sendDataAsync by @jerqi in https://github.com/Tencent/Firestorm/pull/69

Update version to 0.2.0 by @colinmjj in https://github.com/Tencent/Firestorm/pull/74

[Doc] Update readme for new configuration by @colinmjj in https://github.com/Tencent/Firestorm/pull/73

New Contributors

@duanmeng made their first contribution in https://github.com/Tencent/Firestorm/pull/2

@colinmjj made their first contribution in https://github.com/Tencent/Firestorm/pull/8

@frankliee made their first contribution in https://github.com/Tencent/Firestorm/pull/26

@dependabot made their first contribution in https://github.com/Tencent/Firestorm/pull/44

@toujours33 made their first contribution in https://github.com/Tencent/Firestorm/pull/47

Full Changelog: https://github.com/Tencent/Firestorm/commits/release-0.2.0
Source code(tar.gz)
Source code(zip)
release-0.1.0(Nov 9, 2021)

Source code(tar.gz)
Source code(zip)

Owner

Tencent

GitHub

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

1.8k Dec 28, 2022

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

1.7k Mar 12, 2021

Apache Spark - A unified analytics engine for large-scale data processing

Apache Spark Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an op

34.7k Jan 2, 2023

Sparkling Water provides H2O functionality inside Spark cluster

Sparkling Water Sparkling Water integrates H2O's fast scalable machine learning engine with Spark. It provides: Utilities to publish Spark data struct

939 Jan 2, 2023

A Simple movies app using JAVA,MVVM and with a offline caching capability

IMDB-CLONE A simple imdb clone using JAVA,MVVM with searching and bookmarking ability with offline caching ability screenshots Home Screen 1 Home Scre

13 Aug 16, 2022

Flink/Spark Connectors for Apache Doris(Incubating)

Apache Doris (incubating) Connectors The repository contains connectors for Apache Doris (incubating) Flink Doris Connector More information about com

30 Dec 7, 2022

Word Count in Apache Spark using Java

2 Feb 24, 2022

Encog java core Apache 2 Encog java core Encog is an advanced machine learning framework that supports a variety of advanced algorithms, as well as support classes to normalize and process data. Machine learning algorithms such as Support Vector Machines, Artificial Neural Networks, Genetic Programming, Bayesian Networks, Hidden Markov Models, Genetic Programming and Genetic Algorithms are supported. License: Apache 2 , .

Encog Machine Learning Framework Encog is a pure-Java/C# machine learning framework that I created back in 2008 to support genetic programming, NEAT/H

739 Dec 17, 2022

SparkFE is the LLVM-based and high-performance Spark native execution engine which is designed for feature engineering.

Spark has rapidly emerged as the de facto standard for big data processing. However, it is not designed for machine learning which has more and more limitation in AI scenarios. SparkFE rewrite the execution engine in C++ and achieve more than 6x performance improvement for feature extraction. It guarantees the online-offline consistency which makes AI landing much easier. For further details, please refer to SparkFE Documentation.

67 Jun 10, 2021

Serverless proxy for Spark cluster

Hydrosphere Mist Hydrosphere Mist is a serverless proxy for Spark cluster. Mist provides a new functional programming framework and deployment model f

317 Dec 1, 2022

Spark interface for Drsti

Drsti for Spark (ai.jgp.drsti-spark) Spark interface for Drsti Resources Bringing vision to Apache Spark (2021-09-21) introduces Drsti and explains ho

3 Sep 22, 2021

Reference implementation for MINAS (MultI-class learNing Algorithm for data Streams), an algorithm to address novelty detection in data streams multi-class problems.

4 Sep 7, 2022

Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statistical applications.

Datumbox Machine Learning Framework The Datumbox Machine Learning Framework is an open-source framework written in Java which allows the rapid develop

1.1k Dec 9, 2022

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

900 Jan 2, 2023

This repository holds the famous Data Structures (mostly abstract ones) and Algorithms for sorting, traversing, and modifying them.

Data-Structures-and-Algorithms About Repo The repo contains the algorithms for manipulating the abstract data structures like Linked List, Stacks, Que

14 Dec 26, 2021

Problems of Data Structure from basics are covered here for interview preparation and logic building. Basic programming problems and so many interview based leetcode problems are present. Every program is written to solve problem in as optimized way as possible.

Data Structure in Java Problem Solving ??‍?? Problems of Data Structure from basics are covered here for interview preparation and logic building. Bas

7 May 23, 2022

Firestorm is a Remote Shuffle Service, and provides the capability for Apache Spark applications to store shuffle data on remote servers

Related tags

Overview

What is Firestorm

Architecture

Shuffle Process with Firestorm

Shuffle file format

Supported Spark Version

Building Firestorm

Deploy

Deploy Coordinator

Deploy Shuffle Server

Deploy Spark Client

Support Spark dynamic allocation

Configuration

Coordinator

Shuffle Server

Spark Client

LICENSE

Contributing

Comments

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Releases(release-0.4.1)

release-0.4.1(Apr 20, 2022)

release-0.4.0(Apr 7, 2022)

What's Changed

release-0.3.0(Mar 4, 2022)

What's Changed

release-0.2.0(Jan 17, 2022)

What's Changed

New Contributors

release-0.1.0(Nov 9, 2021)

Owner

Tencent

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Apache Spark - A unified analytics engine for large-scale data processing

Sparkling Water provides H2O functionality inside Spark cluster

A Simple movies app using JAVA,MVVM and with a offline caching capability

Flink/Spark Connectors for Apache Doris(Incubating)

Word Count in Apache Spark using Java

SparkFE is the LLVM-based and high-performance Spark native execution engine which is designed for feature engineering.

Serverless proxy for Spark cluster

Spark interface for Drsti

Reference implementation for MINAS (MultI-class learNing Algorithm for data Streams), an algorithm to address novelty detection in data streams multi-class problems.

Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statistical applications.

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

This repository holds the famous Data Structures (mostly abstract ones) and Algorithms for sorting, traversing, and modifying them.

Problems of Data Structure from basics are covered here for interview preparation and logic building. Basic programming problems and so many interview based leetcode problems are present. Every program is written to solve problem in as optimized way as possible.