Hudi manages the storage of large analytical datasets on DFS

Overview

Apache Hudi

Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and Incrementals. Hudi manages the storage of large analytical datasets on DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage).

https://hudi.apache.org/

Build Test License Maven Central Join on Slack

Features

  • Upsert support with fast, pluggable indexing
  • Atomically publish data with rollback support
  • Snapshot isolation between writer & queries
  • Savepoints for data recovery
  • Manages file sizes, layout using statistics
  • Async compaction of row & columnar data
  • Timeline metadata to track lineage
  • Optimize data lake layout with clustering

Hudi supports three types of queries:

  • Snapshot Query - Provides snapshot queries on real-time data, using a combination of columnar & row-based storage (e.g Parquet + Avro).
  • Incremental Query - Provides a change stream with records inserted or updated after a point in time.
  • Read Optimized Query - Provides excellent snapshot query performance via purely columnar storage (e.g. Parquet).

Learn more about Hudi at https://hudi.apache.org

Building Apache Hudi from source

Prerequisites for building Apache Hudi:

  • Unix-like system (like Linux, Mac OS X)
  • Java 8 (Java 9 or 10 may work)
  • Git
  • Maven (>=3.3.1)
# Checkout code and build
git clone https://github.com/apache/hudi.git && cd hudi
mvn clean package -DskipTests

# Start command
spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
  --jars `ls packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-*.*.*-SNAPSHOT.jar` \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'

To build the Javadoc for all Java and Scala classes:

# Javadoc generated under target/site/apidocs
mvn clean javadoc:aggregate -Pjavadocs

Build with Scala 2.12

The default Scala version supported is 2.11. To build for Scala 2.12 version, build using scala-2.12 profile

mvn clean package -DskipTests -Dscala-2.12

Build with Spark 3

The default Spark version supported is 2.4.4. To build for different Spark 3 versions, use the corresponding profile

# Build against Spark 3.2.1 (the default build shipped with the public Spark 3 bundle)
mvn clean package -DskipTests -Dspark3

# Build against Spark 3.1.2
mvn clean package -DskipTests -Dspark3.1.x

Build without spark-avro module

The default hudi-jar bundles spark-avro module. To build without spark-avro module, build using spark-shade-unbundle-avro profile

# Checkout code and build
git clone https://github.com/apache/hudi.git && cd hudi
mvn clean package -DskipTests -Pspark-shade-unbundle-avro

# Start command
spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
  --packages org.apache.spark:spark-avro_2.11:2.4.4 \
  --jars `ls packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-*.*.*-SNAPSHOT.jar` \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'

Running Tests

Unit tests can be run with maven profile unit-tests.

mvn -Punit-tests test

Functional tests, which are tagged with @Tag("functional"), can be run with maven profile functional-tests.

mvn -Pfunctional-tests test

To run tests with spark event logging enabled, define the Spark event log directory. This allows visualizing test DAG and stages using Spark History Server UI.

mvn -Punit-tests test -DSPARK_EVLOG_DIR=/path/for/spark/event/log

Quickstart

Please visit https://hudi.apache.org/docs/quick-start-guide.html to quickly explore Hudi's capabilities using spark-shell.

Comments
  • [HUDI-1138] Add timeline-server-based marker file strategy for improving marker-related latency

    [HUDI-1138] Add timeline-server-based marker file strategy for improving marker-related latency

    What is the purpose of the pull request

    This PR adds a new marker file strategy that optimizes the marker-related latency for file systems with non-trivial file I/O latency, such as Amazon S3. The existing marker file mechanism creates one marker file per data file written. When listing and deleting the marker files, S3 throttles the I/O operations if the number of marker files is huge and it can take 10 minutes for listing alone. When write concurrency is high, the marker creation can also be throttled by S3. Such behavior degrades the performance of the write. The new timeline-server-based marker file strategy delegates the marker creation and other marker-related operations from individual executors to the timeline server. The timeline server maintains the markers and consistency by periodically flushing the in-memory markers to a limited number of underlying files in the file system. In such a way, the number of actual file operations and latency related to markers can be reduced, thus improving the performance of the writes.

    To improve the efficiency of processing marker creation requests, we design the batch processing in the handler of marker requests at the timeline server. Each marker creation request is handled asynchronously in the Javalin timeline server and queued before processing. For every batch interval, e.g., 20ms, a dispatching thread pulls the pending requests from the queue and sends them to the worker thread for processing. Each worker thread processes the marker creation requests, sets the responses, and flushes the new markers by overwriting the underlying file storing the markers in the file system. There can be multiple worker threads running concurrently, given that the file overwriting takes longer than the batch interval, and each worker thread writes to an exclusive file not touched by other threads, to guarantee consistency and correctness. Both the batch interval and the number of worker threads can be configured through the write options.

    Note that the worker thread always checks whether the marker has already been created by comparing the marker name from the request with the memory copy of all markers maintained in the marker directory state class. The underlying files storing the markers are only read upon the first marker request (lazy loading). The responses of requests are only sent back once the new markers are flushed to the files, so that in the case of timeline server failure, the timeline server can recover the already created markers. These ensure consistency between the file system and the in-memory copy, and improve the performance of processing marker requests.

    Since the new timeline-server-based marker mechanism is intended for cloud storage, it is not supported for HDFS at this point (an exception is thrown if timeline-server-based marker mechanism is configured for writes on HDFS).

    Brief change log

    • New marker-related write options in HoodieWriteConfig
      • hoodie.write.markers.type: marker type to use for the writes. Two modes are supported: direct and timeline_server_based.
      • hoodie.markers.timeline_server_based.batch.num_threads: number of threads to use for batch processing marker creation requests at the timeline server for timeline-server-based marker strategy
      • hoodie.markers.timeline_server_based.batch.interval_ms: the batch interval in milliseconds for marker creation batch processing for timeline-server-based marker strategy
    • Abstraction of marker mechanism (WriteMarkers class)
      • Provides an interface for marker file strategy, containing different operations (create marker, delete marker directory, get all markers, etc.)
      • Creates an enum, MarkerType, to indicate the marker type used for the current write
      • Creates a new class, DirectWriteMarkers, for the existing mechanism which directly creates a marker file per data file in the file system from each executor
      • Creates a factory class, WriteMarkersFactory, to generate WriteMarkers instance based on the marker type. Throws HoodieException if timeline_server_based is used for HDFS since the support for HDFS is not ready.
      • Uses WriteMarkersFactory at all places where markers are needed (write handles, Java, Spark, Flink engines)
    • New timeline-server-based marker file implementation
      • Creates a new class, TimelineServerBasedWriteMarkers, for the timeline-server-based marker file strategy, which sends requests to the timeline server for all marker-related operations
    • Handling marker requests at the timeline server
      • Adds a class to store request URLs and parameters in MarkerOperation
      • Creates MarkerDirState class to store the state of a marker directory and operate on the markers inside the directory
      • Adds a future class, MarkerCreationFuture, to be able to process marker creation requests asynchronously
      • Creates a request handler, MarkerHandler, for marker-related requests, which schedules the dispatching thread at a fixed rate based on the batching interval upon the first marker creation request
      • Creates a MarkerCreationDispatchingRunnable for scheduling periodic, batched processing of marker creation requests
      • Creates a BatchedMarkerCreationRunnable for batch processing marker creation requests
      • Adds a context class, BatchedMarkerCreationContext, for passing the pending requests from the dispatching thread to the worker threads
    • New units around timeline-server-based marker file strategy
    • Code cleanup
      • Refactoring the configuration of EmbeddedTimelineService, TimelineService and RequestHandler

    Follow-up TODOs are summarized in this ticket: https://issues.apache.org/jira/projects/HUDI/issues/HUDI-2271

    Verify this pull request

    This direct marker file strategy is covered by existing tests in TestDirectWriteMarkers. This change adds unit tests in TestTimelineServerBasedWriteMarkers to verify the new timeline-server-based marker file strategy.

    We have also manually verified the change by running multiple Spark jobs on large datasets in the cluster with different file system settings:

    • Hadoop/Yarn cluster with HDFS testing succeeds with both marker file strategies for 30GB input data (1000 parquet files, 400M records) producing 43k data files (Note that this is done by explicitly commenting out the exception throwing in the WriteMarkersFactory so the new timeline-server-based markers can be used on HDFS. By default, the job should throw an exception in this setup, which is also tested below. We plan to enhance the support of timeline-server-based markers for HDFS in follow-ups.)
      • direct: 19.6 mins (1175943 ms)
      • timeline-server-based: 19.7 mins (1179346 ms)
    • Amazon EMR with S3 testing succeeds with both marker file strategies for 100GB input data producing 165k data files, thanks to @nsivabalan 's help
      • direct: 55 mins
      • timeline-server-based: 38 mins
    • Testing of retried stages in Spark succeeded with expected cleanup of invalid data files and correct number of records in the table
      • Enables spark speculative execution in the spark shell with the following config so that the Spark automatically kills the tasks beyond a certain task duration and retries with new attempts, emulating the failed stages intentionally
        • spark.speculation=true
        • spark.speculation.multiplier=1.0
        • spark.speculation.quantile=0.5
      • Runs the same job with timeline-server-based marker file strategy
      • Driver logs show that there are quite some tasks killed and retried: WARN scheduler.TaskSetManager: Lost task 1537.1 in stage 22.0 (TID 17375, executor 290): TaskKilled (another attempt succeeded)
      • Executor logs show that invalid data files are indeed deleted at the end
        • Stage: Delete invalid files generated during the write operation, collect at HoodieSparkEngineContext.java:73
        • Log: 21/08/06 19:39:01 INFO table.HoodieTable: Deleting invalid data files=[(/tmp/temp_marker_test_table/asia/india/chennai,/tmp/temp_marker_test_table/asia/india/chennai/5dcae584-1563-4f13-9ce4-955864a2b616-38_162-12-2694_20210806192244.parquet),...
      • Verifies that the number of records written to the table is the same as the input
        • val df2 = spark.read.parquet("/tmp/temp_marker_test_table/*/*/*/*.parquet")
        • df2.select("_hoodie_record_key").distinct.count outputs the same count (400M) as the input df
    • Testing of configuring timeline-server-based markers on HDFS and it throws HoodieException, which is expected:
      • Caused by: org.apache.hudi.exception.HoodieException: Timeline-server-based markers are not supported for HDFS: base path /tmp/temp_marker_test_table

    More details regarding the Hadoop/Yarn cluster with HDFS testing:

    • Preparing data
    val df = spark.read.parquet("/tmp/testdata/*/*.parquet")
    
    • Spark shell command to run direct marker file strategy
    spark.time(df.write.format("hudi").
    |   option("hoodie.bulkinsert.shuffle.parallelism", "200").
    |   option("hoodie.datasource.write.recordkey.field", "uuid").
    |   option("hoodie.datasource.write.partitionpath.field", "partitionpath").
    |   option("hoodie.datasource.write.operation", "bulk_insert").
    |   option("hoodie.write.markers.type", "direct").
    |   option("hoodie.parquet.max.file.size", "1048576").
    |   option("hoodie.parquet.block.size", "1048576").
    |   option("hoodie.table.name", "temp_marker_test_table").
    |   mode(Overwrite).
    |   save("/tmp/temp_marker_test_table/")
    )
    
    • Spark shell command to run timeline-server-based marker file strategy
    spark.time(df.write.format("hudi").
    |   option("hoodie.bulkinsert.shuffle.parallelism", "200").
    |   option("hoodie.datasource.write.recordkey.field", "uuid").
    |   option("hoodie.datasource.write.partitionpath.field", "partitionpath").
    |   option("hoodie.datasource.write.operation", "bulk_insert").
    |   option("hoodie.write.markers.type", "timeline_server_based").
    |   option("hoodie.markers.timeline_server_based.batch.num_threads", "20").
    |   option("hoodie.markers.timeline_server_based.batch.interval_ms", "20").
    |   option("hoodie.parquet.max.file.size", "1048576").
    |   option("hoodie.parquet.block.size", "1048576").
    |   option("hoodie.table.name", "temp_marker_test_table").
    |   mode(Overwrite).
    |   save("/tmp/temp_marker_test_table/")
    )
    
    • Snapshot of files storing markers till the end of the job in timeline-server-based marker file strategy
    hadoop fs -ls /tmp/temp_marker_test_table/.hoodie/.temp/20210806161940
    Found 20 items
    251846 2021-08-06 16:39 /tmp/temp_marker_test_table/.hoodie/.temp/20210806161940/MARKERS0
    253029 2021-08-06 16:38 /tmp/temp_marker_test_table/.hoodie/.temp/20210806161940/MARKERS1
    259197 2021-08-06 16:39 /tmp/temp_marker_test_table/.hoodie/.temp/20210806161940/MARKERS10
    252838 2021-08-06 16:39 /tmp/temp_marker_test_table/.hoodie/.temp/20210806161940/MARKERS11
    254775 2021-08-06 16:38 /tmp/temp_marker_test_table/.hoodie/.temp/20210806161940/MARKERS12
    249139 2021-08-06 16:39 /tmp/temp_marker_test_table/.hoodie/.temp/20210806161940/MARKERS13
    278320 2021-08-06 16:38 /tmp/temp_marker_test_table/.hoodie/.temp/20210806161940/MARKERS14
    266073 2021-08-06 16:38 /tmp/temp_marker_test_table/.hoodie/.temp/20210806161940/MARKERS15
    252428 2021-08-06 16:39 /tmp/temp_marker_test_table/.hoodie/.temp/20210806161940/MARKERS16
    243916 2021-08-06 16:39 /tmp/temp_marker_test_table/.hoodie/.temp/20210806161940/MARKERS17
    260791 2021-08-06 16:39 /tmp/temp_marker_test_table/.hoodie/.temp/20210806161940/MARKERS18
    249626 2021-08-06 16:39 /tmp/temp_marker_test_table/.hoodie/.temp/20210806161940/MARKERS19
    252976 2021-08-06 16:39 /tmp/temp_marker_test_table/.hoodie/.temp/20210806161940/MARKERS2
    247380 2021-08-06 16:39 /tmp/temp_marker_test_table/.hoodie/.temp/20210806161940/MARKERS3
    253442 2021-08-06 16:38 /tmp/temp_marker_test_table/.hoodie/.temp/20210806161940/MARKERS4
    248841 2021-08-06 16:39 /tmp/temp_marker_test_table/.hoodie/.temp/20210806161940/MARKERS5
    249906 2021-08-06 16:39 /tmp/temp_marker_test_table/.hoodie/.temp/20210806161940/MARKERS6
    246458 2021-08-06 16:39 /tmp/temp_marker_test_table/.hoodie/.temp/20210806161940/MARKERS7
    241654 2021-08-06 16:39 /tmp/temp_marker_test_table/.hoodie/.temp/20210806161940/MARKERS8
    243208 2021-08-06 16:39 /tmp/temp_marker_test_table/.hoodie/.temp/20210806161940/MARKERS9
    

    Committer checklist

    • [ ] Has a corresponding JIRA in PR title & commit

    • [ ] Commit message is descriptive of the change

    • [ ] CI is green

    • [ ] Necessary doc changes done or have another open PR

    • [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

    priority:blocker 
    opened by yihua 78
  • [HUDI-1089] Refactor hudi-client to support multi-engine

    [HUDI-1089] Refactor hudi-client to support multi-engine

    Tips

    • Thank you very much for contributing to Apache Hudi.
    • Please review https://hudi.apache.org/contributing.html before opening a pull request.

    What is the purpose of the pull request

    Refactor hudi-client to support multi-engine

    Brief change log

    • Refactor hudi-client to support multi-engine

    Verify this pull request

    This pull request is already covered by existing tests.

    Committer checklist

    • [ ] Has a corresponding JIRA in PR title & commit

    • [ ] Commit message is descriptive of the change

    • [ ] CI is green

    • [ ] Necessary doc changes done or have another open PR

    • [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

    opened by wangxianghu 65
  • [HUDI-3981] Flink engine support for comprehensive schema evolution

    [HUDI-3981] Flink engine support for comprehensive schema evolution

    Change Logs

    This PR adds support of reading by flink when comprehensive schema evolution(RFC-33) enabled and there are operations add column, rename column, change type of column, drop column.

    Impact

    user-facing feature change: comprehensive schema evolution in flink

    Risk level medium

    This change added tests and can be verified as follows:

    • Added unit test TestCastMap to verify that type conversion is correct
    • Added integration test ITTestSchemaEvolution to verify that table with added, renamed, casted, dropped columns is read as expected.

    Documentation Update

    There is schema evolution doc https://hudi.apache.org/docs/schema_evolution

    Contributor's checklist

    • [ ] Read through contributor's guide
    • [ ] Change Logs and Impact were stated clearly
    • [ ] Adequate tests were added if applicable
    • [ ] CI passed
    schema-and-data-types priority:major flink big-needle-movers 
    opened by trushev 60
  • [HUDI-2560][RFC-33] Support full Schema evolution for Spark

    [HUDI-2560][RFC-33] Support full Schema evolution for Spark

    Tips

    • Thank you very much for contributing to Apache Hudi.
    • Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

    What is the purpose of the pull request

    Support full schema evolution for hoodie:

    1. support spark3 DDL. include: alter statement: ALTER TABLE table1 ALTER COLUMN a.b.c TYPE bigint support follow types

    int => long/float/double/string long => float/double/string float => double/String double => String/Decimal Decimal => Decimal/String String => date/decimal date => String

    ALTER TABLE table1 ALTER COLUMN a.b.c SET NOT NULL ALTER TABLE table1 ALTER COLUMN a.b.c DROP NOT NULL ALTER TABLE table1 ALTER COLUMN a.b.c COMMENT 'new comment' ALTER TABLE table1 ALTER COLUMN a.b.c FIRST ALTER TABLE table1 ALTER COLUMN a.b.c AFTER x add statement: ALTER TABLE table1 ADD COLUMNS (col_name data_type [COMMENT col_comment], ...); rename: ALTER TABLE table1 RENAME COLUMN a.b.c TO x drop: ALTER TABLE table1 DROP COLUMN a.b.c ALTER TABLE table1 DROP COLUMNS a.b.c, x, y set/unset Properties: ALTER TABLE table SET TBLPROPERTIES ('table_property' = 'property_value'); ALTER TABLE table UNSET TBLPROPERTIES [IF EXISTS] ('comment', 'key');

    1. support mor(incremental/realtime/optimize) read/write
    2. support cow (incremental/realtime) read/write
    3. support mor compaction

    Brief change log

    (for example:)

    • Modify AnnotationLocation checkstyle rule in checkstyle.xml

    Verify this pull request

    (Please pick either of the following options)

    This pull request is a trivial rework / code cleanup without any test coverage.

    (or)

    This pull request is already covered by existing tests, such as (please describe tests).

    (or)

    This change added tests and can be verified as follows:

    (example:)

    • Added integration tests for end-to-end.
    • Added HoodieClientWriteTest to verify the change.
    • Manually verified the change by running a job locally.

    Committer checklist

    • [ ] Has a corresponding JIRA in PR title & commit

    • [ ] Commit message is descriptive of the change

    • [ ] CI is green

    • [ ] Necessary doc changes done or have another open PR

    • [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

    priority:blocker big-needle-movers 
    opened by xiarixiaoyao 56
  • [HUDI-1615] [SUPPORT] ERROR HoodieTimelineArchiveLog: Failed to archive commits

    [HUDI-1615] [SUPPORT] ERROR HoodieTimelineArchiveLog: Failed to archive commits

    Hello,

    Hudi version: 0.7 Emr version: 6.2 Spark version: 3.0.1

    Hudi Options:

    Map(hoodie.datasource.hive_sync.database -> raw_courier_api_hudi, 
    hoodie.parquet.small.file.limit -> 67108864, 
    hoodie.copyonwrite.record.size.estimate -> 1024, 
    hoodie.datasource.write.precombine.field -> LineCreatedTimestamp, 
    hoodie.datasource.hive_sync.partition_extractor_class -> org.apache.hudi.hive.NonPartitionedExtractor, hoodie.parquet.max.file.size -> 134217728, 
    hoodie.parquet.block.size -> 67108864, 
    hoodie.datasource.hive_sync.table -> customer_address, 
    hoodie.datasource.write.operation -> upsert, 
    hoodie.datasource.hive_sync.enable -> true, 
    hoodie.datasource.write.recordkey.field -> id, 
    hoodie.table.name -> customer_address, 
    hoodie.datasource.hive_sync.jdbcurl -> jdbc:hive2://emr:10000, 
    hoodie.datasource.write.hive_style_partitioning -> false, 
    hoodie.datasource.write.table.name -> customer_address, 
    hoodie.datasource.write.keygenerator.class -> org.apache.hudi.keygen.NonpartitionedKeyGenerator, hoodie.upsert.shuffle.parallelism -> 50)
    
    21/02/01 08:12:22 ERROR HoodieTimelineArchiveLog: Failed to archive commits, .commit file: 20210201021259.commit.requested
    java.lang.NullPointerException: null of string of map of union in field extraMetadata of org.apache.hudi.avro.model.HoodieCommitMetadata of union in field hoodieCommitMetadata of org.apache.hudi.avro.model.HoodieArchivedMetaEntry
    	at org.apache.avro.generic.GenericDatumWriter.npe(GenericDatumWriter.java:145)
    	at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:139)
    	at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
    	at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
    	at org.apache.hudi.common.table.log.block.HoodieAvroDataBlock.serializeRecords(HoodieAvroDataBlock.java:106)
    	at org.apache.hudi.common.table.log.block.HoodieDataBlock.getContentBytes(HoodieDataBlock.java:97)
    	at org.apache.hudi.common.table.log.HoodieLogFormatWriter.appendBlocks(HoodieLogFormatWriter.java:164)
    	at org.apache.hudi.common.table.log.HoodieLogFormatWriter.appendBlock(HoodieLogFormatWriter.java:142)
    	at org.apache.hudi.table.HoodieTimelineArchiveLog.writeToFile(HoodieTimelineArchiveLog.java:361)
    	at org.apache.hudi.table.HoodieTimelineArchiveLog.archive(HoodieTimelineArchiveLog.java:311)
    	at org.apache.hudi.table.HoodieTimelineArchiveLog.archiveIfRequired(HoodieTimelineArchiveLog.java:138)
    	at org.apache.hudi.client.AbstractHoodieWriteClient.postCommit(AbstractHoodieWriteClient.java:426)
    	at org.apache.hudi.client.AbstractHoodieWriteClient.commitStats(AbstractHoodieWriteClient.java:188)
    	at org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:110)
    	at org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:442)
    	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:218)
    	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:134)
    	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
    	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
    	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
    	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90)
    	at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
    	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
    	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
    	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
    	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:124)
    	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:123)
    	at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:963)
    	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:104)
    	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:227)
    	at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:107)
    	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:132)
    	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:104)
    	at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:227)
    	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:132)
    	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:248)
    	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:131)
    	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
    	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
    	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:963)
    	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:415)
    	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:399)
    	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:288)
    	at hudiwriter.HudiWriter.merge(HudiWriter.scala:72)
    	at hudiwriter.HudiContext.writeToHudi(HudiContext.scala:35)
    	at jobs.TableProcessor.start(TableProcessor.scala:84)
    	at TableProcessorWrapper$.$anonfun$main$2(TableProcessorWrapper.scala:23)
    	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    	at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
    	at scala.util.Success.$anonfun$map$1(Try.scala:255)
    	at scala.util.Success.map(Try.scala:213)
    	at scala.concurrent.Future.$anonfun$map$1(Future.scala:292)
    	at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
    	at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
    	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
    	at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402)
    	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
    	at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
    	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
    	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
    Caused by: java.lang.NullPointerException
    	at org.apache.avro.io.Encoder.writeString(Encoder.java:121)
    	at org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:267)
    	at org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:262)
    	at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:128)
    	at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
    	at org.apache.avro.generic.GenericDatumWriter.writeMap(GenericDatumWriter.java:234)
    	at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:121)
    	at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
    	at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
    	at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
    	at org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
    	at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
    	at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
    	at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
    	at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
    	at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
    	at org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
    	at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
    	at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
    	... 59 more
    
    priority:critical writer-core 
    opened by rubenssoto 54
  • Help with Reading Kafka topic written using Debezium Connector - Deltastreamer

    Help with Reading Kafka topic written using Debezium Connector - Deltastreamer

    Hi Team,

    Im facing this use case where I need to ingest data from kafka topic usinf Deltastreamer which is loaded using Debezium connector. So the topic contains schema which contains fields like before, after, ts_ms, op, source etc. Im providing record key as after.id and precombine key with after.timestamp but still the entire debezium output is being ingested.

    Please find my properties

    hoodie.upsert.shuffle.parallelism=2
     hoodie.insert.shuffle.parallelism=2
     hoodie.delete.shuffle.parallelism=2
     hoodie.bulkinsert.shuffle.parallelism=2
     hoodie.embed.timeline.server=true
     hoodie.filesystem.view.type=EMBEDDED_KV_STORE
     hoodie.compact.inline=false
    # Key fields, for kafka example
    hoodie.datasource.write.recordkey.field=after.inc_id
    hoodie.datasource.write.partitionpath.field=date
    hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator
    # Schema provider props (change to absolute path based on your installation)
    #hoodie.deltastreamer.schemaprovider.source.schema.file=/var/demo/config/schema.avsc
    #hoodie.deltastreamer.schemaprovider.target.schema.file=/var/demo/config/schema.avsc
    # Kafka Source
    hoodie.deltastreamer.source.kafka.topic=airflow.public.motor_crash_violation_incidents
    #Kafka props
    bootstrap.servers=http://xxxxx:29092
    auto.offset.reset=earliest
    hoodie.deltastreamer.schemaprovider.registry.url=http://xxxxx:8081/subjects/airflow.public.motor_crash_violation_incidents-value/versions/latest
    #hoodie.deltastreamer.schemaprovider.registry.targetUrl=http://xxxxx:8081/subjects/airflow.public.motor_crash_violation_incidents-value/versions/latest
    schema.registry.url=http://xxxxx:8081
    validate.non.null = false
    
    opened by ashishmgofficial 50
  • [HUDI-2086] Refactor hive mor_incremental_view

    [HUDI-2086] Refactor hive mor_incremental_view

    Tips

    • Thank you very much for contributing to Apache Hudi.
    • Please review https://hudi.apache.org/contributing.html before opening a pull request.

    What is the purpose of the pull request

    redo the logical of mor_incremental_view for hive to fix some bugs for mor_incremental_view for hive/sparksql

    purpose of the pull request:

    1. support read the lastest incremental datas which are stored by logs
    2. support read incremental datas which before replacecommit
    3. support read file groups which has only logs
    4. keep the logical of mor_incremental_view as the same logicl as spark dataSource

    Brief change log

    (for example:)

    • Modify AnnotationLocation checkstyle rule in checkstyle.xml

    Verify this pull request

    new UT added

    Committer checklist

    • [ ] Has a corresponding JIRA in PR title & commit

    • [ ] Commit message is descriptive of the change

    • [ ] CI is green

    • [ ] Necessary doc changes done or have another open PR

    • [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

    priority:blocker priority:critical hive 
    opened by xiarixiaoyao 45
  • [HUDI-1129] Deltastreamer Add support for schema evolution

    [HUDI-1129] Deltastreamer Add support for schema evolution

    What is the purpose of the pull request

    When schema is evolved but producer is still producing events using older version of schema, Hudi delta streamer is failing. This fix is to make sure delta streamer works fine with schema evoluation.

    Related issues #1845 #1971 #1972

    Brief change log

    • Update avro to spark conversion method AvroConversionHelper.createConverterToRow to handle scenario when provided schema has more fields than data (scenario where producer is still sending events with old schema)
    • Introduce new payload class called BaseAvroPayloadWithSchema. This is used to store the writer schema part of payload. Currently, HoodieAvroUtils.avroToBytes uses the schema of the data to convert to bytes, but HoodieAvroUtils.bytesToAvro uses provided schema. Since both may not match always, it results in error. By having data's schema as part of payload, we can ensure, same schema is used for converting avro to bytes and bytes back to avro.

    Verify this pull request

    This change added tests and can be verified as follows:

    • Added unit test to verify schema evoluation Thanks @sbernauer for unit test

    Committer checklist

    • [x] Has a corresponding JIRA in PR title & commit

    • [x] Commit message is descriptive of the change

    • [x] CI is green

    • [ ] Necessary doc changes done or have another open PR

    • [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

    schema-and-data-types priority:critical 
    opened by sathyaprakashg 45
  • [HUDI-289] Implement a test suite to support long running test for Hudi writing and querying end-end

    [HUDI-289] Implement a test suite to support long running test for Hudi writing and querying end-end

    • Flexible schema payload generation
    • Different types of workload generation such as inserts, upserts etc
    • Post process actions to perform validations
    • Interoperability of test suite to use HoodieWriteClient and HoodieDeltaStreamer so both code paths can be tested
    • Custom workload sequence generator
    • Ability to perform parallel operations, such as upsert and compaction
    opened by yanghua 45
  • [HUDI-1951] Add bucket hash index, compatible with the hive bucket

    [HUDI-1951] Add bucket hash index, compatible with the hive bucket

    What is the purpose of the pull request

    Index pattern 1 in RFC-29: Hash Index https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index

    Brief change log

    • add a new index method
    • fix the compatibility issue when adding member variables to HoodieKey

    Verify this pull request

    This change added tests and can be verified as follows:

    • Added unit tests for bucket index and serialization
    • Modify existing integration tests to verify the new index method

    Committer checklist

    • [ ] Has a corresponding JIRA in PR title & commit

    • [ ] Commit message is descriptive of the change

    • [ ] CI is green

    • [ ] Necessary doc changes done or have another open PR

    • [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

    index big-needle-movers 
    opened by minihippo 44
  • [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

    [SUPPORT] Hudi MERGE_ON_READ load to dataframe fails for the versions [0.6.0],[0.7.0] and runs for [0.5.3]

    • Hudi is not able to read MERGE_ON_READ table when using the versions [0.6.0] and [0.7.0] When I run the same code with the version [0.5.3] I am able to read the table generated by the option of merge on read.

    Steps to reproduce the behavior:

    **1.**Start a pyspark shell 2.pyspark --packages org.apache.hudi:hudi-spark-bundle_2.12:0.7.0,org.apache.spark:spark-avro_2.12:3.0.0 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' Or pyspark --packages org.apache.hudi:hudi-spark-bundle_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.4 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' 3. ``` >>>S3_SNAPSHOT = >>>S3_MERGE_ON_READ = >>> from pyspark.sql.functions import * >>> df = spark.read.parquet(S3_SNAPSHOT) >>>df.count()
    21/01/27 14:49:13 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'. 950897550
    >>> hudi_options_insert = { ... "hoodie.table.name": "sample_schema.table_name", ... "hoodie.datasource.write.storage.type": "MERGE_ON_READ", ... "hoodie.datasource.write.recordkey.field": "id", ... "hoodie.datasource.write.operation": "bulk_insert", ... "hoodie.datasource.write.partitionpath.field": "ds", ... "hoodie.datasource.write.precombine.field": "id", ... "hoodie.insert.shuffle.parallelism": 135 ... } >>>df.write.format("hudi").options(**hudi_options_insert).mode("overwrite").save(S3_MERGE_ON_READ)

    
    
    4.
    
    **Expected behavior**
    
    Data is loaded to dataframe perfectly when spark shell is created with the parameters:
    `pyspark --packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.3,org.apache.spark:spark-avro_2.11:2.4.4 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
    `
    
    **Environment Description**
    EMR 
    * Hudi version :[0.7.0], [0.6.0] is giving error.  [0.5.3] is running fluently
    
    * Spark version : [2.4.4], [3.0.1]
    
    * Hive version : 
    
    * Hadoop version :
    
    * Storage (HDFS/S3/GCS..) : S3
    
    * Running on Docker? (yes/no) : no 
    
    
    
    
    **Stacktrace**
    
    
        >>> df_mor = spark.read.format("hudi").load(S3_MERGE_ON_READ + "/*")
    

    Traceback (most recent call last): File "", line 1, in File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 178, in load return self._df(self._jreader.load(path)) File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in call File "/usr/lib/spark/python/pyspark/sql/utils.py", line 128, in deco return f(*a, **kw) File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o86.load. : java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.InMemoryFileIndex.(Lorg/apache/spark/sql/SparkSession;Lscala/collection/Seq;Lscala/collection/immutable/Map;Lscala/Option;Lorg/apache/spark/sql/execution/datasources/FileStatusCache;)V at org.apache.hudi.HoodieSparkUtils$.createInMemoryFileIndex(HoodieSparkUtils.scala:89) at org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:127) at org.apache.hudi.MergeOnReadSnapshotRelation.(MergeOnReadSnapshotRelation.scala:72) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:89) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:53) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297) at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748)

    
    
    aws-support spark query-engine 
    opened by zafer-sahin 43
  • [MINOR] Extending support to other special characters

    [MINOR] Extending support to other special characters

    Change Logs

    Describe context and summary for this change. Highlight if any code was copied. This fix is to cover issue discussed in here

    Impact

    Describe any public API or user-facing feature change or any performance impact.

    Risk level (write none, low medium or high below)

    If medium or high, explain what verification was done to mitigate the risks. none

    Documentation Update

    Describe any necessary documentation update if there is any new feature, config, or user-facing change

    • The config description must be updated if new configs are added or the default value of the configs are changed
    • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the instruction to make changes to the website.

    Contributor's checklist

    • [x] Read through contributor's guide
    • [ ] Change Logs and Impact were stated clearly
    • [x] Adequate tests were added if applicable
    • [ ] CI passed
    opened by srikanthjaggari 1
  • [HUDI-5205] support flink 1.16.0

    [HUDI-5205] support flink 1.16.0

    Change Logs

    • support flink 1.16.0
    • Based on PR
      • copy the existing adapters from hudi-flink1.15.x to hudi-flink1.16.x
      • Add new adapters StreamWriteOperatorCoordinatorAdapter & SortOperatorGenAdapter in each flink module
      • add flink1.16 profile in ci matrix

    Impact

    Low

    Risk level (write none, low medium or high below)

    Low

    Documentation Update

    the official documents need to be updated

    Contributor's checklist

    • [ ] Read through contributor's guide
    • [ ] Change Logs and Impact were stated clearly
    • [ ] Adequate tests were added if applicable
    • [ ] CI passed
    opened by stayrascal 1
  • [SUPPORT] Unable to query Partitioned COW Hudi tables with metadata enabled using Trino-Hudi Connector

    [SUPPORT] Unable to query Partitioned COW Hudi tables with metadata enabled using Trino-Hudi Connector

    Describe the problem you faced Original issue: https://github.com/trinodb/trino/issues/15368

    Our team is testing the same on COPY ON WRITE HUDI (0.10.1) tables with metadata enabled at version using Trino 400. And we are facing the error while reading from partitioned tables. Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.

    The issue was resolved by placing some dependencies in the classpath. Interestingly, those dependencies are already included in the trino-hudi-bundle. This particular issues tracks any gap in packaging.

    To Reproduce

    Steps to reproduce the behavior:

    1. Write a Hudi COW table with the below properties and metadata enabled.
    2. Query the same table using the trino-hudi connector (properties mentioned below) with hudi.metadata-enabled=true.

    Trino Hudi Connector Properties:

    connector.name=hudi
    hive.metastore.uri={METASTORE_URI}
    hive.s3.iam-role={S3_IAM_ROLE}
    hive.metastore-refresh-interval=2m
    hive.metastore-timeout=3m
    hudi.max-outstanding-splits=1800
    hive.s3.max-error-retries=50
    hive.s3.connect-timeout=1m
    hive.s3.socket-timeout=2m
    hudi.parquet.use-column-names=true
    hudi.metadata-enabled=true
    

    Hudi Properties set while writing:

    hoodie.datasource.write.partitionpath.field = "insert_ds_ist",
    hoodie.datasource.write.recordkey.field = "id",
    hoodie.datasource.write.precombine.field = "_hoodie_incremental_key", (self generated column),
    hoodie.datasource.write.hive_style_partitioning = "true",
    hoodie.datasource.hive_sync.auto_create_database = "true",
    hoodie.parquet.compression.codec = "gzip",
    hoodie.table.name = "<table_name>",
    hoodie.datasource.write.keygenerator.class = "org.apache.hudi.keygen.SimpleKeyGenerator",
    hoodie.datasource.write.table.type = "COPY_ON_WRITE",
    hoodie.metadata.enable = "true",
    hoodie.datasource.hive_sync.enable = "true",
    hoodie.datasource.hive_sync.partition_fields = "insert_ds_ist",
    hoodie.datasource.hive_sync.partition_extractor_class = "org.apache.hudi.hive.MultiPartKeysValueExtractor"
    

    General information of table: Total rows = 1,213,959,199 Total Partitions = 2400+ Total file objects = 120,000 Total Size on S3 = 12~13 GB The table was upgraded from 0.9.0 to 0.10.1

    Coordinator Relevant Logs:

    Expected behavior

    They query should work out-of-the-box without having to place jars in classpath.

    Environment Description

    • Hudi version : 0.10.1

    • Spark version : 2.4

    • Trino version : 400

    • Hadoop version :

    • Storage (HDFS/S3/GCS..) :

    • Running on Docker? (yes/no) : no

    Additional context

    Add any other context about the problem here.

    Stacktrace

    Full stacktrace in Partitioned_COW_Hudi_Coordinator_logs.log

    priority:major query-engine 
    opened by codope 2
  • [HUDI-5488]Make sure Discrupt queue start first, then insert records

    [HUDI-5488]Make sure Discrupt queue start first, then insert records

    Change Logs

    Describe context and summary for this change. Highlight if any code was copied. We must to make sure to set up Disruptor's queue first, then producer can insert records to the queue. But currently we have no idea which thread start first, so this pr tries to fix it.

    CompletableFuture<Void> consuming = startConsumingAsync();
    CompletableFuture<Void> producing = startProducingAsync();
    

    Also, I think the test TestDisruptorExecutionInSpark#testExecutor and TestDisruptorMessageQueue#testRecordReading failures relate to this bug.

    Screenshot 2022-12-29 at 10 07 21

    Impact

    Describe any public API or user-facing feature change or any performance impact. none

    Risk level (write none, low medium or high below)

    If medium or high, explain what verification was done to mitigate the risks.

    none

    Documentation Update

    Describe any necessary documentation update if there is any new feature, config, or user-facing change

    • The config description must be updated if new configs are added or the default value of the configs are changed
    • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the ticket number here and follow the instruction to make changes to the website.

    Contributor's checklist

    • [ ] Read through contributor's guide
    • [ ] Change Logs and Impact were stated clearly
    • [ ] Adequate tests were added if applicable
    • [ ] CI passed
    opened by boneanxs 5
  • [HUDI-5434] Fix archival in metadata table to not rely on completed rollback or clean in data table

    [HUDI-5434] Fix archival in metadata table to not rely on completed rollback or clean in data table

    Change Logs

    Before this PR, the archival for the metadata table uses the earliest instant of all actions from the active timeline of the data table. In the archival process, CLEAN and ROLLBACK instants are archived separately apart from commits (check HoodieTimelineArchiver#getCleanInstantsToArchive). Because of this, a very old completed CLEAN or ROLLBACK instant in the data table can block the archive of the metadata table timeline and causes the active timeline of the metadata table to be extremely long, leading to performance issues for loading the timeline.

    This PR changes the archival in metadata table to not rely on completed rollback or clean in data table, by archiving the metadata table's instants after the earliest commit (COMMIT, DELTA_COMMIT, and REPLACE_COMMIT only) and the earliest inflight instant (all actions) in the data table's active timeline.

    The savepoints are seamlessly handled here, i.e., the completed savepoints do not affect the archive process in the metadata table.

    New tests are added and an existing test around the archival in the metadata table is adjusted to verify that the archival in the metadata table does not depend on the completed rollback in the data table.

    This PR is tested locally to make sure the archival in the metadata table works as expected.

    Impact

    Makes the active timeline of the metadata table shorter and improves the performance of loading the active timeline of the metadata table.

    Risk level

    low

    Documentation Update

    N/A

    Contributor's checklist

    • [ ] Read through contributor's guide
    • [ ] Change Logs and Impact were stated clearly
    • [ ] Adequate tests were added if applicable
    • [ ] CI passed
    priority:critical metadata 
    opened by yihua 1
  • [HUDI-5487] Reduce duplicate Logs in ExternalSpillableMap

    [HUDI-5487] Reduce duplicate Logs in ExternalSpillableMap

    Change Logs

    Reduce duplicate Logs in ExternalSpillableMap

    We see hundreds of thousands of duplicate logs in the executor log.

    22/12/26 21:13:40,864 [Executor task launch worker for task 0.0 in stage 480.0 (TID 211376)] INFO ExternalSpillableMap: Update Estimated Payload size to => 4567
    22/12/26 21:13:40,864 [Executor task launch worker for task 0.0 in stage 480.0 (TID 211376)] INFO ExternalSpillableMap: Update Estimated Payload size to => 4567
    22/12/26 21:13:40,864 [Executor task launch worker for task 0.0 in stage 480.0 (TID 211376)] INFO ExternalSpillableMap: Update Estimated Payload size to => 4567
    22/12/26 21:13:40,864 [Executor task launch worker for task 0.0 in stage 480.0 (TID 211376)] INFO ExternalSpillableMap: Update Estimated Payload size to => 4567
    22/12/26 21:13:40,864 [Executor task launch worker for task 0.0 in stage 480.0 (TID 211376)] INFO ExternalSpillableMap: Update Estimated Payload size to => 4567
    22/12/26 21:13:40,864 [Executor task launch worker for task 0.0 in stage 480.0 (TID 211376)] INFO ExternalSpillableMap: Update Estimated Payload size to => 4567
    22/12/26 21:13:40,864 [Executor task launch worker for task 0.0 in stage 480.0 (TID 211376)] INFO ExternalSpillableMap: Update Estimated Payload size to => 4567
    

    Impact

    Reduce noisy logs and reduce log size

    Risk level (write none, low medium or high below)

    none

    Documentation Update

    Contributor's checklist

    • [ ] Read through contributor's guide
    • [ ] Change Logs and Impact were stated clearly
    • [ ] Adequate tests were added if applicable
    • [ ] CI passed
    opened by cxzl25 2
Releases(release-0.12.2)
  • release-0.12.2(Dec 28, 2022)

  • release-0.12.1(Oct 19, 2022)

  • release-0.12.0(Aug 17, 2022)

  • release-0.11.1(Jun 18, 2022)

  • release-0.11.0(May 2, 2022)

  • release-0.10.1(Jan 26, 2022)

  • release-0.10.0(Dec 12, 2021)

  • release-0.9.0(Aug 31, 2021)

  • release-0.8.0(Apr 6, 2021)

  • release-0.7.0(Jan 26, 2021)

  • release-0.6.0(Aug 25, 2020)

  • 0.5.3(Jul 21, 2020)

  • hoodie-0.4.7(May 29, 2019)

    Highlights

    • Major releases with fundamental changes to filesystem listing & write failure handling
    • Introduced the first version of HoodieTimelineServer that runs embedded on the driver
    • With all executors fetching filesystem listing via RPC to timeline server, drastically reduced filesystem listing!
    • Failing concurrent write tasks are now handled differently to be robust against spark stage retries
    • Bug fixes/clean up around indexing, compaction

    Full PR List

    • @bvaradar - HUDI-135 - Skip Meta folder when looking for partitions #698
    • @bvaradar - HUDI-136 - Only inflight commit timeline (.commit/.deltacommit) must be used when checking for sanity during compaction scheduling #699
    • @bvaradar - HUDI-134 - Disable inline compaction for Hoodie Demo #696
    • @v3nkatesh - default implementation for HBase index qps allocator #685
    • @bvaradar - SparkUtil#initLauncher shoudn't raise when spark-defaults.conf doesn't exist #670HUDI-131 Zero File Listing in Compactor run #693
    • @vinothchandar - Fixed HUDI-116 : Handle duplicate record keys across partitions #687
    • @leilinen - HUDI-105 : Fix up offsets not available on leader exception #650
    • @bvaradar - Allow users to set hoodie configs figs for Compactor, Cleaner and HDFSParquetImporter utility scripts #691
    • @bvaradar - Spark Stage retry handling #651
    • @pseudomoto - HUDI-113: Use Pair over # delimited string #672
    • @bvaradar - Support nested types for recordKey, partitionPath and combineKey #684
    • @vinothchandar - Downgrading fasterxml jackson to 2.6.7 to be spark compatible #686
    • @bvaradar - Timeline Service with Incremental View Syncing support #600
    Source code(tar.gz)
    Source code(zip)
  • hoodie-0.4.6(May 29, 2019)

    Highlights

    • Index performance! Interval trees + bucketized checking speed up index lookup upto 10x!
    • Faster writing due to cached avro encoder/decoders, lighter memory usage, lesser data shuffled.
    • Support for spark jobs using > 1 cores per executor
    • DeltaStreamer bug fixes (inline compaction, hive sync, error record handling)
    • Empty Record payload to support deletes out-of-box easily
    • Fixes to hive/spark bundles around dependencies, versioning, shading

    Full PR List

    • @bvaradar - Minor CLI documentation change in delta-streamer #679
    • @n3nash - converting map task memory from mb to bytes #678
    • @bvaradar - Fix various errors found by long running delta-streamer tests #675
    • @vinothchandar - Bucketized Bloom Filter checking #671
    • @pseudomuto - SparkUtil#initLauncher shoudn't raise when spark-defaults.conf doesn't exist #670
    • @abhioncbr - HUDI-101: added exclusion filters for signature files. #669
    • @ovj - migrating kryo's dependency from twitter chill to plain kryo library #649
    • @bvaradar - Revert "HUDI-101: added mevn-shade plugin with filters." #665
    • @abhioncbr - HUDI-101: added mevn-shade plugin with filters. #659
    • @bvaradar - Rollback inflights when using Spark [Streaming] write #660
    • @vinothchandar - Making DataSource/DeltaStreamer use defaults for combining #634
    • @vinothchandar - Fixes HUDI-85 : Interval tree based pruning for Bloom Index #653
    • @takezoe - Fix to enable hoodie.datasource.read.incr.filters #655
    • @n3nash - Removing OLD MAGIC header #648
    • @bvaradar - Revert "Read and apply schema for each log block from the metadata header instead of the latest schema" #647
    • @lyogev - Add empty payload class to support deletes via apache spark #635
    • @bvaradar - Move to apachehudi dockerhub repository & use openjdk docker containers #644
    • @bvaradar - Fix Hive RT query failure in hoodie demo #645
    • @ovj - Revert - Replacing Apache commons-lang3 object serializer with Kryo #642
    • @n3nash - Read and apply schema for each log block from the metadata header instead of the latest schema #640
    • @bhasudha - FIXES HUDI-98: Fix multiple issues when using build_local_docker_images for demo setup #636
    • @n3nash - Performing commit archiving in batches to avoid keeping a huge chunk in memory #631
    • @bvaradar - Essential Hive packages missing in hoodie spark bundle #633
    • @n3nash - 1. Minor changes to fix compaction 2. Adding 2 compaction policies 3. Adding a Hbase index property #629
    • @milantracy - [HUDI-66] FSUtils.getRelativePartitionPath does not handle repeated f… #627
    • @vinothchandar - Fixing small file handling, inline compaction defaults #599
    • @vinothchandar - Follow up HUDI-27 : Call super.close() in HoodieWraperFileSystem::close() #621
    • @vinothchandar - Fix HUDI-27 : Support num_cores > 1 for writing through spark #620
    • @vinothchandar - Fixes HUDI-38: Reduce memory overhead of WriteStatus #616
    • @vinothchandar - Fixed HUDI-87 : Remove schemastr from BaseAvroPayload #619
    • @vinothchandar - Fixes HUDI-9 : Check precondition minInstantsToKeep > cleanerCommitsR… #617
    • @n3nash - Fixing source schema and writer schema distinction in payloads #612
    • @ambition119 - [HUDI-63] Removed unused BucketedIndex code #608
    • @bvaradar - run_hive_sync tool must be able to handle case where there are multiple standalone jdbc jars in hive installation dir #609
    • @milantracy - add a script that shuts down demo cluster gracefully #606
    • @n3nash - Enable multi rollbacks for MOR table type #546
    • @ovj - Replacing Apache commons-lang3 object serializer with Kryo serializer #583
    • @kaka11chen - Add compression codec configurations for HoodieParquetWriter. #604
    • @smarthi - HUDI-75: Add KEYS #601
    • @vinothchandar - Removing docs folder from master branch #602
    • @bvaradar - Fix hive sync and deltastreamer issue in demo #593
    • @bhasudha - Fix quickstart documentation for querying via Presto #598
    • @ovj - Handling duplicate record update for single partition (duplicates in single or different parquet files) #584
    • @kaka11chen - Fix avro doesn't have short and byte type. #595
    • @bvaradar - FIleSystem View to handle same fileIds across partitions correctly #572
    • @vinothchandar - Upgrade various jar, gem versions for maintenance #575
    Source code(tar.gz)
    Source code(zip)
  • hoodie-0.4.5(May 29, 2019)

    Highlights

    • Dockerized demo with support for different Hive versions
    • Smoother handling of append log on cloud stores
    • Introducing a global bloom index, that enforces unique constraint across partitions
    • CLI commands to analyze workloads, manage compactions
    • Migration guide for folks wanting to move datasets to Hudi
    • Added Spark Structured Streaming support, with a Hudi sink
    • In-built support for filtering duplicates in DeltaStreamer
    • Support for plugging in custom transformation in DeltaStreamer
    • Better support for non-partitioned Hive tables
    • Support hard deletes for Merge on Read storage
    • New slack url & site urls
    • Added presto bundle for easier integration
    • Tons of bug fixes, reliability improvements

    Full PR List

    • @bhasudha - Create hoodie-presto bundle jar. fixes #567 #571
    • @bhasudha - Close FSDataInputStream for meta file open in HoodiePartitionMetadata . Fixes issue #573 #574
    • @yaoqinn - handle no such element exception in HoodieSparkSqlWriter #576
    • @vinothchandar - Update site url in README
    • @yaooqinn - typo: bundle jar with unrecognized variables #570
    • @bvaradar - Table rollback for inflight compactions MUST not delete instant files at any time to avoid race conditions #565
    • @bvaradar - Fix Hoodie Record Reader to work with non-partitioned dataset ( ISSUE-561) #569
    • @bvaradar - Hoodie Delta Streamer Features : Transformation and Hoodie Incremental Source with Hive integration #485
    • @vinothchandar - Updating new slack signup link #566
    • @yaooqinn - Using immutable map instead of mutables to generate parameters #559
    • @n3nash - Fixing behavior of buffering in Create/Merge handles for invalid/wrong schema records #558
    • @n3nash - cleaner should now use commit timeline and not include deltacommits #539
    • @n3nash - Adding compaction to HoodieClient example #551
    • @n3nash - Filtering partition paths before performing a list status on all partitions #541
    • @n3nash - Passing a path filter to avoid including folders under .hoodie directory as partition paths #548
    • @n3nash - Enabling hard deletes for MergeOnRead table type #538
    • @msridhar - Add .m2 directory to Travis cache #534
    • @artem0 - General enhancements #520
    • @bvaradar - Ensure Hoodie works for non-partitioned Hive table #515
    • @xubo245 - fix some spell errorin Hudi #530
    • @leletan - feat(SparkDataSource): add structured streaming sink #486
    • @n3nash - Serializing the complete payload object instead of serializing just the GenericRecord in HoodieRecordConverter #495
    • @n3nash - Returning empty Statues for an empty spark partition caused due to incorrect bin packing #510
    • @bvaradar - Avoid WriteStatus collect() call when committing batch to prevent Driver side OOM errors #512
    • @vinothchandar - Explicitly handle lack of append() support during LogWriting #511
    • @n3nash - Fixing number of insert buckets to be generated by rounding off to the closest greater integer #500
    • @vinothchandar - Enabling auto tuning of insert splits by default #496
    • @bvaradar - Useful Hudi CLI commands to debug/analyze production workloads #477
    • @bvaradar - Compaction validate, unschedule and repair #481
    • @shangxinli - Fix addMetadataFields() to carry over 'props' #484
    • @n3nash - Adding documentation for migration guide and COW vs MOR tradeoffs #470
    • @leletan - Add additional feature to drop later arriving dups #468
    • @bvaradar - Fix regression bug which broke HoodieInputFormat handling of non-hoodie datasets #482
    • @vinothchandar - Add --filter-dupes to DeltaStreamer #478
    • @bvaradar - A quickstart demo to showcase Hudi functionalities using docker along with support for integration-tests #455
    • @bvaradar - Ensure Hoodie metadata folder and files are filtered out when constructing Parquet Data Source #473
    • @leletan - Adds HoodieGlobalBloomIndex #438
    Source code(tar.gz)
    Source code(zip)
  • hoodie-0.4.4(Sep 28, 2018)

    Release 0.4.4

    Highlights

    • Dependencies are now decoupled from CDH and based on apache versions!
    • Support for Hive 2 is here!! Use -Dhive11 to build for older hive versions
    • Deltastreamer tool reworked to make configs simpler, hardended tests, added Confluent Kafka support
    • Provide strong consistency for S3 datasets
    • Removed dependency on commons lang3, to ease use with different hadoop/spark versions
    • Better CLI support and docs for managing async compactions
    • New CLI commands to manage datasets

    Full PR List

    • @vinothchandar - Perform consistency checks during write finalize #464
    • @bvaradar - Travis CI tests needs to be run in quieter mode (WARN log level) to avoid max log-size errors #465
    • @lys0716 - Fix the name of avro schema file in Test #467
    • @bvaradar - Hive Sync handling must work for datasets with multi-partition keys #460
    • @bvaradar - Explicitly release resources in LogFileReader and TestHoodieClientBase. Fixes Memory allocation errors #463
    • @bvaradar - [Release Blocking] Ensure packaging modules create sources/javadoc jars #461
    • @vinothchandar - Fix bug with incrementally pulling older data #458
    • @saravsars - Updated jcommander version to fix NPE in HoodieDeltaStreamer tool #443
    • @n3nash - Removing dependency on apache-commons lang 3, adding necessary classes as needed #444
    • @n3nash - Small file size handling for inserts into log files. #413
    • @vinothchandar - Update Gemfile.lock with higher ffi version
    • @bvaradar - Simplify and fix CLI to schedule and run compactions #447
    • @n3nash - Fix a failing test case intermittenly in TestMergeOnRead due to incorrect prev commit time #448
    • @bvaradar- CLI to create and desc hoodie table #446
    • @vinothchandar- Reworking the deltastreamer tool #449
    • @bvaradar- Docs for describing async compaction and how to operate it #445
    • @n3nash- Adding check for rolling stats not present in existing timeline to handle backwards compatibility #451
    • @bvaradar @vinothchandar - Moving all dependencies off cdh and to apache #420
    • @bvaradar- Reduce minimum delta-commits required for compaction #452
    • @bvaradar- Use spark Master from environment if set #454
    Source code(tar.gz)
    Source code(zip)
  • hoodie-0.4.3(Aug 23, 2018)

    Release 0.4.3

    Highlights

    • Ability to run compactions asynchrously & in-parallel to ingestion/write added!!!
    • Day based compaction does not respect IO budgets i.e agnostic of them
    • Adds ability to throttle writes to HBase via the HBaseIndex
    • (Merge on read) Inserts are sent to log files, if they are indexable.

    Full PR List

    • @n3nash - Adding ability for inserts to be written to log files #400
    • @n3nash - Fixing bug introducted in rollback for MOR table type with inserts into log files #417
    • @n3nash - Changing Day based compaction strategy to be IO agnostic #398
    • @ovj - Changing access level to protected so that subclasses can access it #421
    • @n3nash - Fixing missing hoodie record location in HoodieRecord when record is read from disk after being spilled #419
    • @bvaradar - Async compaction - Single Consolidated PR #404
    • @bvaradar - BUGFIX - Use Guava Optional (which is Serializable) in CompactionOperation to avoid NoSerializableException #435
    • @n3nash - Adding another metric to HoodieWriteStat #434
    • @n3nash - Fixing Null pointer exception in finally block #440
    • @kaushikd49 - Throttling to limit QPS from HbaseIndex #427
    Source code(tar.gz)
    Source code(zip)
  • hoodie-0.4.2(Jun 11, 2018)

    Release 0.4.2

    Highlights

    • Parallelize Parquet writing & input record read resulting in upto 2x performance improvement
    • Better out-of-box configs to support upto 500GB upserts, improved ROPathFilter performance
    • Added a union mode for RT View, that supports near-real time event ingestion without update semantics
    • Added a tuning guide with suggestions for oft-encountered problems
    • New configs for configs for compression ratio, index storage levels

    Full PR List

    • @jianxu - Use hadoopConf in HoodieTableMetaClient and related tests #343
    • @jianxu - Add more options in HoodieWriteConfig #341
    • @n3nash - Adding a tool to read/inspect a HoodieLogFile #328
    • @ovj - Parallelizing parquet write and spark's external read operation. #294
    • @n3nash - Fixing memory leak due to HoodieLogFileReader holding on to a logblock #346
    • @kaushikd49 - DeduplicateRecords based on recordKey if global index is used #345
    • @jianxu - Checking storage level before persisting preppedRecords #358
    • @n3nash - Adding config for parquet compression ratio #366
    • @xjodoin - Replace deprecated jackson version #367
    • @n3nash - Making ExternalSpillableMap generic for any datatype #350
    • @bvaradar - CodeStyle formatting to conform to basic Checkstyle rules. #360
    • @vinothchandar - Update release notes for 0.4.1 (post) #371
    • @bvaradar - Issue-329 : Refactoring TestHoodieClientOnCopyOnWriteStorage and adding test-cases #372
    • @n3nash - Parallelized read-write operations in Hoodie Merge phase #370
    • @n3nash - Using BufferedFsInputStream to wrap FSInputStream for FSDataInputStream #373
    • @suniluber - Fix for updating duplicate records in same/different files in same pa… #380
    • @bvaradar - Fixit : Add Support for ordering and limiting results in CLI show commands #383
    • @n3nash - Adding metrics for MOR and COW #365
    • @n3nash - Adding a fix/workaround when fs.append() unable to return a valid outputstream #388
    • @n3nash - Minor fixes for MergeOnRead MVP release readiness #387
    • @bvaradar - Issue-257: Support union mode in HoodieRealtimeRecordReader for pure insert workloads #379
    • @n3nash - Enabling global index for MOR #389
    • @suniluber - Added a new filter function to filter by record keys when reading parquet file #395
    • @vinothchandar - Improving out of box experience for data source #295
    • @xjodoin - Fix wrong use of TemporaryFolder junit rule #411
    Source code(tar.gz)
    Source code(zip)
  • hoodie-0.4.0(Oct 3, 2017)

    Release 0.4.0

    Highlights

    • Spark datasource API now supported for Copy-On-Write datasets, across all views
    • BloomIndex can now prune based on key ranges & cut down index tagging time dramatically, for time-prefixed/ordered record keys
    • Hive sync tool registers RO and RT tables now.
    • Client application can now specify the partitioner to be used by bulkInsert(), useful for low-level control over initial record placement
    • Framework for metadata tracking inside IO handles, to implement Spark accumulator-style counters, that are consistent with the timeline
    • Bug fixes around cleaning, savepoints & upsert's partitioner.

    Full PR List

    • @gekath - Writes relative paths to .commit files #184
    • @kaushikd49 - Correct clean bug that causes exception when partitionPaths are empty #202
    • @vinothchandar - Refactor HoodieTableFileSystemView using FileGroups & FileSlices #201
    • @prazanna - Savepoint should not create a hole in the commit timeline #207
    • @jianxu - Fix TimestampBasedKeyGenerator in HoodieDeltaStreamer when DATE_STRING is used #211
    • @n3nash - Sync Tool registers 2 tables, RO and RT Tables #210
    • @n3nash - Using FsUtils instead of Files API to extract file extension #213
    • @vinothchandar - Edits to documentation #219
    • @n3nash - Enabled deletes in merge_on_read #218
    • @n3nash - Use HoodieLogFormat for the commit archived log #205
    • @n3nash - fix for cleaning log files in master branch (mor) #228
    • @vinothchandar - Adding range based pruning to bloom index #232
    • @n3nash - Use CompletedFileSystemView instead of CompactedView considering deltacommits too #229
    • @n3nash - suppressing logs (under 4MB) for jenkins #240
    • @jianxu - Add nested fields support for MOR tables #234
    • @n3nash - adding new config to separate shuffle and write parallelism #230
    • @n3nash - adding ability to read archived files written in log format #252
    • @ovj - Removing randomization from UpsertPartitioner #253
    • @ovj - Replacing SortBy with custom partitioner #245
    • @esmioley - Update deprecated hash function #259
    • @vinothchandar - Adding canIndexLogFiles(), isImplicitWithStorage(), isGlobal() to HoodieIndex #268
    • @kaushikd49 - Hoodie Event callbacks #251
    • @vinothchandar - Spark Data Source (finally) #266
    Source code(tar.gz)
    Source code(zip)
  • hoodie-0.3.8(Jun 16, 2017)

    Highlights

    • Merge on Read tested end to end. Ingestion - Hive Registration - Querying non-nested fields
    • Contributions from @kaushikd49 @n3nash @dannyhchen @zqureshi @vinothchandar and @prazanna

    New Features

    • #149 Introduce custom log format (HoodieLogFormat) for the log files
    • #141 Introduce Compaction Strategies for Merge on Read table and implement UnboundedCompactionStrategy and IOBoundedCompactionStrategy
    • #42 Implement HoodieRealtimeInputFormat and HoodieRealtimeRecordReader
    • #150 Rewrite hoodie-hive to incrementally sync partitions based on the last commit that was sucessfully synced

    Changes

    Commits: 21e334...4b26be

    Source code(tar.gz)
    Source code(zip)
Owner
The Apache Software Foundation
The Apache Software Foundation
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.8k Dec 28, 2022
Program finds average number of words in each comment given a large data set by use of hadoop's map reduce to work in parallel efficiently.

Finding average number of words in all the comments in a data set ?? Mapper Function In the mapper function we first tokenize entire data and then fin

Aleezeh Usman 3 Aug 23, 2021
Firehose is an extensible, no-code, and cloud-native service to load real-time streaming data from Kafka to data stores, data lakes, and analytical storage systems.

Firehose - Firehose is an extensible, no-code, and cloud-native service to load real-time streaming data from Kafka to data stores, data lakes, and analytical storage systems.

Open DataOps Foundation 279 Dec 22, 2022
Hollow is a java library and toolset for disseminating in-memory datasets from a single producer to many consumers for high performance read-only access.

Hollow Hollow is a java library and toolset for disseminating in-memory datasets from a single producer to many consumers for high performance read-on

Netflix, Inc. 1.1k Dec 25, 2022
A Slimefun4 addon that adds a new storage solution for mass and organised storage without harsh performance.

Networks is a Slimefun4 addon that brings a simple yet powerful item storage and movement network that works along side cargo. Network Grid / Crafting

null 17 Jan 7, 2023
Manages server status and gives log of status information. front end - angular js, backend- sbring boot, DB-MySQL

ServerManagerApplication / | | / | | ( ___ _ __ __ __ ___ _ __ | \ / | __ _ _ __ __ _ __ _ ___ _ __ __ \ / _ \ | '| \ \ / / / _ \ | '| | |/| | / | | '

null 1 Jan 6, 2022
A library that automatically generates and manages configuration files based on classes.

sc-cfg SC-CFG is a simple, yet powerful library that automatically generate configuration files based on your classes. Compatible with Java 8+ and Kot

null 10 Nov 28, 2022
A Hybrid Serving & Analytical Processing Database.

DingoDB DingoDB is a real-time Hybrid Serving & Analytical Processing (HSAP) Database. It can execute high-frequency queries and upsert, interactive a

DingoDB 120 Dec 14, 2022
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.8k Dec 28, 2022
Apache Spark - A unified analytics engine for large-scale data processing

Apache Spark Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an op

The Apache Software Foundation 34.7k Jan 2, 2023
APM, (Application Performance Management) tool for large-scale distributed systems.

Visit our official web site for more information and Latest updates on Pinpoint. Latest Release (2020/01/21) We're happy to announce the release of Pi

null 12.6k Jan 4, 2023
APM, (Application Performance Management) tool for large-scale distributed systems.

Visit our official web site for more information and Latest updates on Pinpoint. Latest Release (2020/01/21) We're happy to announce the release of Pi

null 12.5k Dec 29, 2022
Java large off heap cache

OHC - An off-heap-cache Features asynchronous cache loader support optional per entry or default TTL/expireAt entry eviction and expiration without a

Robert Stupp 801 Dec 31, 2022
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.7k Mar 12, 2021
Hadoop library for large-scale data processing, now an Apache Incubator project

Apache DataFu Follow @apachedatafu Apache DataFu is a collection of libraries for working with large-scale data in Hadoop. The project was inspired by

LinkedIn's Attic 589 Apr 1, 2022
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.8k Dec 28, 2022
APM, (Application Performance Management) tool for large-scale distributed systems.

Visit our official web site for more information and Latest updates on Pinpoint. Latest Release (2020/01/21) We're happy to announce the release of Pi

null 12.6k Jan 6, 2023
Program finds average number of words in each comment given a large data set by use of hadoop's map reduce to work in parallel efficiently.

Finding average number of words in all the comments in a data set ?? Mapper Function In the mapper function we first tokenize entire data and then fin

Aleezeh Usman 3 Aug 23, 2021
Code4Me provides automatic intelligent code completion based on large pre-trained language models

Code4Me Code4Me provides automatic intelligent code completion based on large pre-trained language models. Code4Me predicts statement (line) completio

Code4Me 38 Dec 5, 2022
Apache ORC - the smallest, fastest columnar storage for Hadoop workloads

Apache ORC ORC is a self-describing type-aware columnar file format designed for Hadoop workloads. It is optimized for large streaming reads, but with

The Apache Software Foundation 576 Jan 2, 2023