Firestorm is a Remote Shuffle Service, and provides the capability for Apache Spark applications to store shuffle data on remote servers

Overview

What is Firestorm

Firestorm is a Remote Shuffle Service, and provides the capability for Apache Spark applications to store shuffle data on remote servers.

Architecture

Rss Architecture Firestorm contains coordinator cluster, shuffle server cluster and remote storage(eg, HDFS) if necessary.

Coordinator will collect status of shuffle server and do the assignment for the job.

Shuffle server will receive the shuffle data, merge them and write to storage.

Depend on different situation, Firestorm supports Memory & Local, Memory & Remote Storage(eg, HDFS), Local only, Remote Storage only.

Shuffle Process with Firestorm

  • Spark driver ask coordinator to get shuffle server for shuffle process

  • Spark task write shuffle data to shuffle server with following step: Rss Shuffle_Write

    1. Send KV data to buffer
    2. Flush buffer to queue when buffer is full or buffer manager is full
    3. Thread pool get data from queue
    4. Request memory from shuffle server first and send the shuffle data
    5. Shuffle server cache data in memory first and flush to queue when buffer manager is full
    6. Thread pool get data from queue
    7. Write data to storage with index file and data file
    8. After write data, task report all blockId to shuffle server, this step is used for data validation later
    9. Store taskAttemptId in MapStatus to support Spark speculation
  • Depend on different storage type, spark task read shuffle data from shuffle server or remote storage or both of them.

Shuffle file format

The shuffle data is stored with index file and data file. Data file has all blocks for specific partition and index file has metadata for every block.

Rss Shuffle_Write

Supported Spark Version

Current support Spark 2.3.x, Spark 2.4.x, Spark3.0.x, Spark 3.1.x

Note: To support dynamic allocation, the patch(which is included in client-spark/patch folder) should be applied to Spark

Building Firestorm

Firestorm is built using Apache Maven. To build it, run:

mvn -DskipTests clean package

To package the Firestorm, run:

./build_distribution.sh

rss-xxx.tgz will be generated for deployment

Deploy

Deploy Coordinator

  1. unzip package to RSS_HOME
  2. update RSS_HOME/bin/rss-env.sh, eg,
      JAVA_HOME=<java_home>
      HADOOP_HOME=<hadoop home>
      XMX_SIZE="16g"
    
  3. update RSS_HOME/conf/coordinator.conf, eg,
      rss.rpc.server.port 19999
      rss.jetty.http.port 19998
      rss.coordinator.server.heartbeat.timeout 30000
      rss.coordinator.app.expired 60000
      rss.coordinator.shuffle.nodes.max 5
      rss.coordinator.exclude.nodes.file.path RSS_HOME/conf/exclude_nodes
    
  4. start Coordinator
     bash RSS_HOME/bin/start-coordnator.sh
    

Deploy Shuffle Server

  1. unzip package to RSS_HOME
  2. update RSS_HOME/bin/rss-env.sh, eg,
      JAVA_HOME=<java_home>
      HADOOP_HOME=<hadoop home>
      XMX_SIZE="80g"
    
  3. update RSS_HOME/conf/server.conf, the following demo is for local storage only, eg,
      rss.rpc.server.port 19999
      rss.jetty.http.port 19998
      rss.rpc.executor.size 2000
      rss.storage.type LOCALFILE
      rss.coordinator.quorum <coordinatorIp1>:19999,<coordinatorIp2>:19999
      rss.storage.basePath /data1/rssdata,/data2/rssdata....
      rss.server.flush.thread.alive 5
      rss.server.flush.threadPool.size 10
      rss.server.buffer.capacity 40g
      rss.server.read.buffer.capacity 20g
      rss.server.heartbeat.timeout 60000
      rss.server.heartbeat.interval 10000
      rss.rpc.message.max.size 1073741824
      rss.server.preAllocation.expired 120000
      rss.server.commit.timeout 600000
      rss.server.app.expired.withoutHeartbeat 120000
    
  4. start Shuffle Server
     bash RSS_HOME/bin/start-shuffle-server.sh
    

Deploy Spark Client

  1. Add client jar to Spark classpath, eg, SPARK_HOME/jars/

    The jar for Spark2 is located in <RSS_HOME>/jars/client/spark2/rss-client-XXXXX-shaded.jar

    The jar for Spark3 is located in <RSS_HOME>/jars/client/spark3/rss-client-XXXXX-shaded.jar

  2. Update Spark conf to enable Firestorm, the following demo is for local storage only, eg,

    spark.shuffle.manager org.apache.spark.shuffle.RssShuffleManager
    spark.rss.coordinator.quorum <coordinatorIp1>:19999,<coordinatorIp2>:19999
    spark.rss.storage.type MEMORY_LOCALFILE
    

Support Spark dynamic allocation

To support spark dynamic allocation with Firestorm, spark code should be updated. There are 2 patches for spark-2.4.6 and spark-3.1.2 in spark-patches folder for reference.

After apply the patch and rebuild spark, add following configuration in spark conf to enable dynamic allocation:

spark.shuffle.service.enabled false
spark.dynamicAllocation.enabled true

Configuration

The important configuration is listed as following.

Coordinator

Property Name Default Meaning
rss.coordinator.server.heartbeat.timeout 30000 Timeout if can't get heartbeat from shuffle server
rss.coordinator.assignment.strategy BASIC Strategy for assigning shuffle server, only BASIC support
rss.coordinator.app.expired 60000 Application expired time (ms), the heartbeat interval should be less than it
rss.coordinator.shuffle.nodes.max 9 The max number of shuffle server when do the assignment
rss.coordinator.exclude.nodes.file.path - The path of configuration file which have exclude nodes
rss.coordinator.exclude.nodes.check.interval.ms 60000 Update interval (ms) for exclude nodes
rss.rpc.server.port - RPC port for coordinator
rss.jetty.http.port - Http port for coordinator

Shuffle Server

Property Name Default Meaning
rss.coordinator.quorum - Coordinator quorum
rss.rpc.server.port - RPC port for Shuffle server
rss.jetty.http.port - Http port for Shuffle server
rss.server.buffer.capacity - Max memory of buffer manager for shuffle server
rss.server.memory.shuffle.highWaterMark.percentage 75.0 Threshold of spill data to storage, percentage of rss.server.buffer.capacity
rss.server.memory.shuffle.lowWaterMark.percentage 25.0 Threshold of keep data in memory, percentage of rss.server.buffer.capacity
rss.server.read.buffer.capacity - Max size of buffer for reading data
rss.server.heartbeat.interval 10000 Heartbeat interval to Coordinator (ms)
rss.server.flush.threadPool.size 10 Thread pool for flush data to file
rss.server.commit.timeout 600000 Timeout when commit shuffle data (ms)
rss.storage.type - Supports LOCALFILE, HDFS, LOCALFILE_AND_HDFS

Spark Client

Property Name Default Meaning
spark.rss.writer.buffer.size 3m Buffer size for single partition data
spark.rss.writer.buffer.spill.size 128m Buffer size for total partition data
spark.rss.coordinator.quorum - Coordinator quorum
spark.rss.storage.type - Supports MEMORY_LOCAL, MEMORY_HDFS, LOCALFILE, HDFS, LOCALFILE_AND_HDFS
spark.rss.client.send.size.limit 16m The max data size sent to shuffle server
spark.rss.client.read.buffer.size 32m The max data size read from storage
spark.rss.client.send.threadPool.size 10 The thread size for send shuffle data to shuffle server

LICENSE

Firestorm is under the Apache License Version 2.0. See the LICENSE file for details.

Contributing

For more information about contributing issues or pull requests, see Firestorm Contributing Guide.

Comments
  • Merge values to reduce the shuffle data when operator can be combined

    Merge values to reduce the shuffle data when operator can be combined

    What changes were proposed in this pull request?

    In spark shuffle service, it can merge values to reduce the data size when operator can be combined.

    Why are the changes needed?

    To reduce the shuffle data size and then to speed up.

    Does this PR introduce any user-facing change?

    No.

    How was this patch tested?

    In draft and waiting to be tested.

    opened by zuston 17
  • [Improvement] Split basic checks in CI pipeline

    [Improvement] Split basic checks in CI pipeline

    What changes were proposed in this pull request?

    1. Enable CI on non-master branches.
    2. Split checkstyle and rat checks from build matrix.
    3. Add spotbugs, its result is ignored since there are many violations right now.
    4. Add Summary of failures step in each job.

    Why are the changes needed?

    1. Contributors can ensure CI has passed on forked branches before submitting PR.
    2. It's easier to see why CI has failed. And avoids repeating checkstyle and rat.
    3. Spotbugs serves as suggestions for now.

    Does this PR introduce any user-facing change?

    No.

    How was this patch tested?

    Example of rat failure: https://github.com/kaijchen/Firestorm/actions/runs/2383526195

    opened by kaijchen 13
  • Support using remote fs path to specify the excludeNodesFilePath

    Support using remote fs path to specify the excludeNodesFilePath

    What changes were proposed in this pull request?

    Support using remote fs path to specify the excludeNodesFilePath

    Why are the changes needed?

    When existing two coordinators serving for online, we hope they can read the consistent exclude nodes file insteading of using the localfile syncing manually.

    Does this PR introduce any user-facing change?

    Yes. It's an incompatible change.

    When the default fs is HDFS in the core-site.xml, and the excludeFilePath is specified to "/user/xxxxx" in coordinator server. After applied this patch, filesystem will be initialized to remote HDFS due to lacking scheme.

    How was this patch tested?

    Unit tests.

    opened by zuston 10
  • 使用firestorm-0.4.0 运行spark3.1.1官方的JavaWordCount报如下错误,并且在yarn-client模式下driver端进程一直不退出

    使用firestorm-0.4.0 运行spark3.1.1官方的JavaWordCount报如下错误,并且在yarn-client模式下driver端进程一直不退出

    java.io.StreamCorruptedException: invalid stream header: 74001673 at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:806) at java.io.ObjectInputStream.<init>(ObjectInputStream.java:299) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.<init>(JavaSerializer.scala:64) at org.apache.spark.serializer.JavaDeserializationStream.<init>(JavaSerializer.scala:64) at org.apache.spark.serializer.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:123) at org.apache.spark.shuffle.reader.RssShuffleDataIterator.createKVIterator(RssShuffleDataIterator.java:71) at org.apache.spark.shuffle.reader.RssShuffleDataIterator.hasNext(RssShuffleDataIterator.java:118) at org.apache.spark.shuffle.reader.RssShuffleReader$MultiPartitionIterator.hasNext(RssShuffleReader.java:213) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:155) at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:50) at org.apache.spark.shuffle.reader.RssShuffleReader.read(RssShuffleReader.java:125) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) at org.apache.spark.scheduler.Task.run(Task.scala:134) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:535) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:545) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

    opened by sfwang218 10
  • Avoid using the default forkjoin pool by parallelStream directly

    Avoid using the default forkjoin pool by parallelStream directly

    What changes were proposed in this pull request?

    As we know that parallelStream will use the default forkjoin pool in entire jvm. To avoid it, use the custom pool and allow to specify the pool size.

    Why are the changes needed?

    use separate forkjoin pool to send shuffle data

    Does this PR introduce any user-facing change?

    Yes, introduce the configuration to control the size of forkjoinpool. mapreduce.rss.client.data.transfer.pool.size for MapReduce spark.rss.client.data.transfer.pool.size for Spark

    How was this patch tested?

    GA passed.

    opened by zuston 9
  • yarn-client模式下driver端进程一直不退出

    yarn-client模式下driver端进程一直不退出

    图片 如上图 任务结束后 driver端一直在上报心跳

    我测试了下 把 scheduledExecutorService 改成守护线程可以解决该问题 scheduledExecutorService = Executors.newSingleThreadScheduledExecutor( new ThreadFactoryBuilder().setDaemon(true).setNameFormat("rss-heartbeat-%d").build());

    opened by sfwang218 9
  • [Bugfix] Add timestamp to avoid the conflicts of gc.log after server …

    [Bugfix] Add timestamp to avoid the conflicts of gc.log after server …

    What changes were proposed in this pull request?

    Add a timestamp as the suffix of gc log.

    Why are the changes needed?

    JVM will rewrite gc.log after restart.

    Does this PR introduce any user-facing change?

    No.

    How was this patch tested?

    By hand.

    opened by frankliee 9
  • [Follow Up] Support to skip unexpected and processed segments in HDFS

    [Follow Up] Support to skip unexpected and processed segments in HDFS

    What changes were proposed in this pull request?

    Refactor HdfsClientReadHandler/UploadedHdfsClientReadHandler based on DataSkippableReadHandler

    Why are the changes needed?

    This is "[Feature] Support to skip unexpected and processed segments in LocalFileReadClient (#55)" 's follow-up pull request, we implement the segment filter based on DataSkippableReadHandler. In this PR, we extends to HdfsClientReadHandler/UploadedHdfsClientReadHandler.

    Does this PR introduce any user-facing change?

    No.

    How was this patch tested?

    Run all existing UT & IT.

    opened by frankliee 9
  • [Improvement] Optimize quorum write by skipping unnecessary blocks

    [Improvement] Optimize quorum write by skipping unnecessary blocks

    What changes were proposed in this pull request?

    Write client send blocks in two rounds. In the primary round, we only send [0, replicaWrite) data. In the secondary round, we send [replicaWrite, replica) data. When the primary round is totally successful, the secondary round can be skipped.

    Why are the changes needed?

    Without this patch, the quorum writer will send whole replica times of blocks even all RPCs are successful.

    Does this PR introduce any user-facing change?

    No.

    How was this patch tested?

    All previous UTs and new UTs.

    opened by frankliee 8
  • [Feature] Support to skip unexpected and processed segments in LocalFileReadClient

    [Feature] Support to skip unexpected and processed segments in LocalFileReadClient

    What changes were proposed in this pull request?

    1. Add DataSkippableReadHandler to enable segment filter.
    2. Refactor LocalFileQuorumClientReadHandler
      a. Extract the processing of each replica in LocalFileReadHandler b. Implement LocalFileReplicaReadHandler based on DataSkippableReadHandler
    3. Refactor MemoryQuorumClientReadHandler a. Extract the processing of each replica in LocalFileReadHandler

    Why are the changes needed?

    Some segments are read by different handlers.

    Does this PR introduce any user-facing change?

    No.

    How was this patch tested?

    Run all UT & IT.

    opened by frankliee 7
  • [Improvement] Replace runtime exception with rss related exception in…

    [Improvement] Replace runtime exception with rss related exception in…

    What changes were proposed in this pull request? This PR add RSS related exception which can be identify there has some problem in shuffle server

    Why are the changes needed? Improve the exception

    Does this PR introduce any user-facing change? No

    How was this patch tested? base on current UT

    enhancement 
    opened by colinmjj 7
  • To support more tasks with Firestorm

    To support more tasks with Firestorm

    The current blockId is designed as following:

     // BlockId is long and composed by partitionId, executorId and AtomicInteger
     // AtomicInteger is first 19 bit, max value is 2^19 - 1
     // partitionId is next 24 bit, max value is 2^24 - 1
     // taskAttemptId is rest of 20 bit, max value is 2^20 - 1
    

    Why we need blockId? It's designed for data check, filter, memory data read, etc.

    Why blockId is designed as above? BlockId will be stored in Shuffle server, to reduce memory cost. Roaringbitmap is used to cache it. According to implementation of Roaringbitmap, the design of BlockId is target to use BitmapContainer instead of ArrayContainer for memory saving.

    What's the problem of blockId? It can't support taskId which is greater than 2^20 - 1

    Proposal I think the first 19 bit is too much for atomic int, and we can leverage some of them for taskId.

    opened by colinmjj 0
  • [Feature Request]Add a web UI in Coordinated Server to show the detailed server/job/metrics information

    [Feature Request]Add a web UI in Coordinated Server to show the detailed server/job/metrics information

    It would be good to have a web UI in coordinated server to show the detailed information about shuffle server, job shuffle metrics and others. Simply like Spark web UI which has the ability to show the detailed information of Spark jobs.

    opened by jerryshao 1
Releases(release-0.4.1)
  • release-0.4.1(Apr 20, 2022)

  • release-0.4.0(Apr 7, 2022)

    What's Changed

    • [Bugfix] Fix uncorrect index file by @jerqi in https://github.com/Tencent/Firestorm/pull/92
    • [Doc] Add the support information by @jerqi in https://github.com/Tencent/Firestorm/pull/95
    • [Bugfix] Exist memory leak when we use aqe by @jerqi in https://github.com/Tencent/Firestorm/pull/98
    • [Bugfix] Add timestamp to avoid the conflicts of gc.log after server … by @frankliee in https://github.com/Tencent/Firestorm/pull/99
    • [Improvement] Remove some integration tests under StorageType.HDFS mode to … by @frankliee in https://github.com/Tencent/Firestorm/pull/97
    • [Feature] Firestorm supports quorum write/read by @frankliee in https://github.com/Tencent/Firestorm/pull/96
    • upgrade to 0.4.0 by @jerqi in https://github.com/Tencent/Firestorm/pull/102

    Full Changelog: https://github.com/Tencent/Firestorm/compare/release-0.3.0...release-0.4.0

    Source code(tar.gz)
    Source code(zip)
  • release-0.3.0(Mar 4, 2022)

    What's Changed

    • [Feature] Add AccessManager by @duanmeng in https://github.com/Tencent/Firestorm/pull/61
    • [Bug] Exchange the position between startHeartbeat and registerShuffleServers by @jerqi in https://github.com/Tencent/Firestorm/pull/75
    • [Improvement] Add detailed logging for read client by @frankliee in https://github.com/Tencent/Firestorm/pull/72
    • [Feature]Access check in client by @duanmeng in https://github.com/Tencent/Firestorm/pull/66
    • [Feature] Add the metrics of requireBuffer failures by @frankliee in https://github.com/Tencent/Firestorm/pull/78
    • [Feature] Enhance the DelegationRssShuffleManager to support hdfs conf file and reset the shuffle manager name by @duanmeng in https://github.com/Tencent/Firestorm/pull/79
    • [Improvement] Update spark.rss.client.read.buffer.size to avoid JVM humongous alloc… by @frankliee in https://github.com/Tencent/Firestorm/pull/80
    • [Improvement] Update some default values of parameters by @frankliee in https://github.com/Tencent/Firestorm/pull/82
    • [Feature] add the dynamic client conf fetching and updating by @duanmeng in https://github.com/Tencent/Firestorm/pull/81
    • [Feature] add access related metrics by @duanmeng in https://github.com/Tencent/Firestorm/pull/84
    • [Feature] Support Spark 3.2 by @jerqi in https://github.com/Tencent/Firestorm/pull/86
    • [Feature] corrupted LocalStorage can be excluded by @jerqi in https://github.com/Tencent/Firestorm/pull/85
    • upgrade to 0.3.0 by @duanmeng in https://github.com/Tencent/Firestorm/pull/89

    Full Changelog: https://github.com/Tencent/Firestorm/compare/release-0.2.0...release-0.3.0

    Source code(tar.gz)
    Source code(zip)
  • release-0.2.0(Jan 17, 2022)

    What's Changed

    • Add license header to maven ci yaml file #2 by @duanmeng in https://github.com/Tencent/Firestorm/pull/2
    • Add pull request template by @duanmeng in https://github.com/Tencent/Firestorm/pull/10
    • Upgrade version to 0.2.0-SNAPSHOT by @duanmeng in https://github.com/Tencent/Firestorm/pull/11
    • [Improvement] Replace runtime exception with rss related exception in… by @colinmjj in https://github.com/Tencent/Firestorm/pull/8
    • Fix Dependabot alerts by @duanmeng in https://github.com/Tencent/Firestorm/pull/12
    • [Feature] Read index file first in local mode by @duanmeng in https://github.com/Tencent/Firestorm/pull/6
    • [Feature] Add modifier, comment and import checking into checkstyle by @duanmeng in https://github.com/Tencent/Firestorm/pull/16
    • [Feature] Enable NewLineAtEofChecker in checkstyle by @duanmeng in https://github.com/Tencent/Firestorm/pull/18
    • [Feature] Firestorm supports node's health check by @jerqi in https://github.com/Tencent/Firestorm/pull/14
    • [Feature] Add the metrics for the number of total partitions by @jerqi in https://github.com/Tencent/Firestorm/pull/24
    • [Feature] Limit upload segment num and upload time in force mode by @duanmeng in https://github.com/Tencent/Firestorm/pull/23
    • Add metrics to collect connection info of grpc by @colinmjj in https://github.com/Tencent/Firestorm/pull/25
    • [Improvement] Fix the flaky test by @jerqi in https://github.com/Tencent/Firestorm/pull/29
    • [Improvement] Support to config byte sizes by readable string (e.g., kb/mb/gb) by @frankliee in https://github.com/Tencent/Firestorm/pull/26
    • [Feature] Add doc system based on Jekyll by @frankliee in https://github.com/Tencent/Firestorm/pull/22
    • [Feature] Read index file first in hdfs mode by @duanmeng in https://github.com/Tencent/Firestorm/pull/21
    • [Improvement] Modify some configuration's names and default values by @jerqi in https://github.com/Tencent/Firestorm/pull/27
    • Fix log4j-core dependabot by @duanmeng in https://github.com/Tencent/Firestorm/pull/32
    • [Improvement]Remove useless code in test base by @duanmeng in https://github.com/Tencent/Firestorm/pull/33
    • [Improvement] Avoid speculative tasks causing writing errors after shuffle is uploaded and deleted by @jerqi in https://github.com/Tencent/Firestorm/pull/31
    • [Improvement] fix log4j Dependabot alerts by @duanmeng in https://github.com/Tencent/Firestorm/pull/37
    • [Feature]Read handler refactor by @duanmeng in https://github.com/Tencent/Firestorm/pull/34
    • [MINOR] Fix flaky testes in HealthCheckCoordinatorGrpcTest by @frankliee in https://github.com/Tencent/Firestorm/pull/38
    • Check blockIds in read client to allow early return by @frankliee in https://github.com/Tencent/Firestorm/pull/39
    • Bump log4j-core from 2.16.0 to 2.17.0 by @dependabot in https://github.com/Tencent/Firestorm/pull/44
    • [Improvement] Support Firestorm with Spark 2.3 & 3.0 by @frankliee in https://github.com/Tencent/Firestorm/pull/41
    • Support store shuffle data in memory by @colinmjj in https://github.com/Tencent/Firestorm/pull/36
    • Update Readme with memory shuffle support by @colinmjj in https://github.com/Tencent/Firestorm/pull/46
    • [Improvement] Update import style in the Spark-like way by @frankliee in https://github.com/Tencent/Firestorm/pull/45
    • Shutdown non-daemon executor service to avoid hang when jvm destroy by @toujours33 in https://github.com/Tencent/Firestorm/pull/47
    • Fix time unit from micro to mill by @colinmjj in https://github.com/Tencent/Firestorm/pull/52
    • upload unit test log in github CI by @frankliee in https://github.com/Tencent/Firestorm/pull/54
    • [Improvement] Remove unnecssary logic to calculate bitmap split number by @colinmjj in https://github.com/Tencent/Firestorm/pull/53
    • [Feature] MultiStorageManager can write both localfile and hdfs directly by @jerqi in https://github.com/Tencent/Firestorm/pull/43
    • [Build] Only upload logs when test cases are failed by @frankliee in https://github.com/Tencent/Firestorm/pull/57
    • [Bug]Fix memory leak in spark executor by @colinmjj in https://github.com/Tencent/Firestorm/pull/58
    • [Feature] Support to skip unexpected and processed segments in LocalFileReadClient by @frankliee in https://github.com/Tencent/Firestorm/pull/55
    • [Feature] Support read from localfile, hdfs and uploaded hdfs by @jerqi in https://github.com/Tencent/Firestorm/pull/56
    • Fix log4j-core dependabot, upgrade to 2.17.1 by @frankliee in https://github.com/Tencent/Firestorm/pull/62
    • [Improvement] Reduce memory cost in spark executor by @colinmjj in https://github.com/Tencent/Firestorm/pull/59
    • [Feature] Hdfs write fail, the data will write to localfile by @jerqi in https://github.com/Tencent/Firestorm/pull/60
    • [Feature] Support StorageType MEMORY_LOCALFILE_HDFS by @jerqi in https://github.com/Tencent/Firestorm/pull/63
    • [Bug] Support to read incomplete index data by @jerqi in https://github.com/Tencent/Firestorm/pull/67
    • [Bug] LocalStorageChecker should support MEMORY_LOCALFILE and MEMORY_LOCALFILE_HDFS by @jerqi in https://github.com/Tencent/Firestorm/pull/65
    • [Feature] Add the data size of metrics HDFS and LOCALFILE by @jerqi in https://github.com/Tencent/Firestorm/pull/68
    • [Follow Up] Support to skip unexpected and processed segments in HDFS by @frankliee in https://github.com/Tencent/Firestorm/pull/64
    • [Bug] Replace parallelStream with stream except for the method sendDataAsync by @jerqi in https://github.com/Tencent/Firestorm/pull/69
    • Update version to 0.2.0 by @colinmjj in https://github.com/Tencent/Firestorm/pull/74
    • [Doc] Update readme for new configuration by @colinmjj in https://github.com/Tencent/Firestorm/pull/73

    New Contributors

    • @duanmeng made their first contribution in https://github.com/Tencent/Firestorm/pull/2
    • @colinmjj made their first contribution in https://github.com/Tencent/Firestorm/pull/8
    • @frankliee made their first contribution in https://github.com/Tencent/Firestorm/pull/26
    • @dependabot made their first contribution in https://github.com/Tencent/Firestorm/pull/44
    • @toujours33 made their first contribution in https://github.com/Tencent/Firestorm/pull/47

    Full Changelog: https://github.com/Tencent/Firestorm/commits/release-0.2.0

    Source code(tar.gz)
    Source code(zip)
  • release-0.1.0(Nov 9, 2021)

Owner
Tencent
Tencent
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.8k Dec 28, 2022
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.7k Mar 12, 2021
Apache Spark - A unified analytics engine for large-scale data processing

Apache Spark Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an op

The Apache Software Foundation 34.7k Jan 2, 2023
Sparkling Water provides H2O functionality inside Spark cluster

Sparkling Water Sparkling Water integrates H2O's fast scalable machine learning engine with Spark. It provides: Utilities to publish Spark data struct

H2O.ai 939 Jan 2, 2023
A Simple movies app using JAVA,MVVM and with a offline caching capability

IMDB-CLONE A simple imdb clone using JAVA,MVVM with searching and bookmarking ability with offline caching ability screenshots Home Screen 1 Home Scre

saiteja janjirala 13 Aug 16, 2022
Flink/Spark Connectors for Apache Doris(Incubating)

Apache Doris (incubating) Connectors The repository contains connectors for Apache Doris (incubating) Flink Doris Connector More information about com

The Apache Software Foundation 30 Dec 7, 2022
Word Count in Apache Spark using Java

Word Count in Apache Spark using Java

Arjun Gautam 2 Feb 24, 2022
SparkFE is the LLVM-based and high-performance Spark native execution engine which is designed for feature engineering.

Spark has rapidly emerged as the de facto standard for big data processing. However, it is not designed for machine learning which has more and more limitation in AI scenarios. SparkFE rewrite the execution engine in C++ and achieve more than 6x performance improvement for feature extraction. It guarantees the online-offline consistency which makes AI landing much easier. For further details, please refer to SparkFE Documentation.

4Paradigm 67 Jun 10, 2021
Serverless proxy for Spark cluster

Hydrosphere Mist Hydrosphere Mist is a serverless proxy for Spark cluster. Mist provides a new functional programming framework and deployment model f

hydrosphere.io 317 Dec 1, 2022
Spark interface for Drsti

Drsti for Spark (ai.jgp.drsti-spark) Spark interface for Drsti Resources Bringing vision to Apache Spark (2021-09-21) introduces Drsti and explains ho

Jean-Georges 3 Sep 22, 2021
Reference implementation for MINAS (MultI-class learNing Algorithm for data Streams), an algorithm to address novelty detection in data streams multi-class problems.

Reference implementation for MINAS (MultI-class learNing Algorithm for data Streams), an algorithm to address novelty detection in data streams multi-class problems.

Douglas M. Cavalcanti 4 Sep 7, 2022
Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statistical applications.

Datumbox Machine Learning Framework The Datumbox Machine Learning Framework is an open-source framework written in Java which allows the rapid develop

Vasilis Vryniotis 1.1k Dec 9, 2022
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

null 900 Jan 2, 2023
This repository holds the famous Data Structures (mostly abstract ones) and Algorithms for sorting, traversing, and modifying them.

Data-Structures-and-Algorithms About Repo The repo contains the algorithms for manipulating the abstract data structures like Linked List, Stacks, Que

Zaid Ahmed 14 Dec 26, 2021
Apache Flink

Apache Flink Apache Flink is an open source stream processing framework with powerful stream- and batch-processing capabilities. Learn more about Flin

The Apache Software Foundation 20.4k Jan 5, 2023
Mirror of Apache Mahout

Welcome to Apache Mahout! The goal of the Apache Mahout™ project is to build an environment for quickly creating scalable, performant machine learning

The Apache Software Foundation 2k Jan 4, 2023
Mirror of Apache SystemML

Apache SystemDS Overview: SystemDS is a versatile system for the end-to-end data science lifecycle from data integration, cleaning, and feature engine

The Apache Software Foundation 940 Dec 25, 2022