Apache Druid: a high performance real-time analytics database.

Related tags

Big data druid
Overview

Slack Build Status Language grade: Java Coverage Status Docker Helm


Website | Documentation | Developer Mailing List | User Mailing List | Slack | Twitter | Download


Apache Druid

Druid is a high performance real-time analytics database. Druid's main value add is to reduce time to insight and action.

Druid is designed for workflows where fast queries and ingest really matter. Druid excels at powering UIs, running operational (ad-hoc) queries, or handling high concurrency. Consider Druid as an open source alternative to data warehouses for a variety of use cases.

Getting started

You can get started with Druid with our local or Docker quickstart.

Druid provides a rich set of APIs (via HTTP and JDBC) for loading, managing, and querying your data. You can also interact with Druid via the built-in console (shown below).

Load data

data loader Kafka

Load streaming and batch data using a point-and-click wizard to guide you through ingestion setup. Monitor one off tasks and ingestion supervisors.

Manage the cluster

management

Manage your cluster with ease. Get a view of your datasources, segments, ingestion tasks, and services from one convenient location. All powered by SQL systems tables, allowing you to see the underlying query for each view.

Issue queries

query view combo

Use the built-in query workbench to prototype DruidSQL and native queries or connect one of the many tools that help you make the most out of Druid.

Documentation

You can find the documentation for the latest Druid release on the project website.

If you would like to contribute documentation, please do so under /docs in this repository and submit a pull request.

Community

Community support is available on the druid-user mailing list, which is hosted at Google Groups.

Development discussions occur on [email protected], which you can subscribe to by emailing [email protected].

Chat with Druid committers and users in real-time on the #druid channel in the Apache Slack team. Please use this invitation link to join the ASF Slack, and once joined, go into the #druid channel.

Building from source

Please note that JDK 8 is required to build Druid.

For instructions on building Druid from source, see docs/development/build.md

Contributing

Please follow the community guidelines for contributing.

For instructions on setting up IntelliJ dev/intellij-setup.md

License

Apache License, Version 2.0

Comments
  • Add Rackspace Cloud Files Deep Storage Extension

    Add Rackspace Cloud Files Deep Storage Extension

    This is the first version to support Rackspace's Cloud Files as druid deep storage. It still lacks of tests. As already discussed here https://groups.google.com/forum/#!topic/druid-development/9O5NMo2-4WU I would also like some suggestions about the pom.xml as I'm having some difficulties in order to build druid along with the cloud-files extension. Here's my current working solution (from the root of the project):

    mvn install -DskipTests=true -Dmaven.javadoc.skip=true
    cp -f services/target/druid-services-0.8.2-SNAPSHOT-selfcontained.jar /usr/local/druid/lib
    java "-Ddruid.extensions.coordinates=[\"io.druid.extensions:druid-kafka-eight:0.8.2-SNAPSHOT\",\"io.druid.extensions:mysql-metadata-storage:0.8.2-SNAPSHOT\",\"io.druid.extensions:druid-cloudfiles-extensions:0.8.2-SNAPSHOT\"]" -Ddruid.extensions.localRepository=/usr/local/druid/repository "-Ddruid.extensions.remoteRepositories=[\"file:///root/.m2/repository/\",\"http://repo1.maven.org/maven2/\",\"https://metamx.artifactoryonline.com/metamx/pub-libs-releases-local\"]" -cp /usr/local/druid/lib/* io.druid.cli.Main tools pull-deps
    cd extensions/cloudfiles-extensions/
    mvn dependency:copy-dependencies "-DoutputDirectory=/usr/local/druid/jclouds"
    cd ../..
    

    and then I run the various nodes:

    java -Xmx512m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -cp config/_common:config/coordinator:/usr/local/druid/lib/*:/usr/local/druid/jclouds/* io.druid.cli.Main server coordinator
    java -Xmx512m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -cp config/_common:config/historical:/usr/local/druid/lib/*:/usr/local/druid/jclouds/* io.druid.cli.Main server historical
    java -Xmx512m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -cp config/_common:config/broker:/usr/local/druid/lib/*:/usr/local/druid/jclouds/* io.druid.cli.Main server broker
    java -Xmx512m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Ddruid.realtime.specFile=examples/kafka/kafka_realtime.spec -cp config/_common:config/realtime:/usr/local/druid/lib/*:/usr/local/druid/jclouds/* io.druid.cli.Main server realtime
    
    Feature 
    opened by se7entyse7en 59
  • Delagate creation of segmentPath/LoadSpec to DataSegmentPushers and add S3a support

    Delagate creation of segmentPath/LoadSpec to DataSegmentPushers and add S3a support

    • Delegation of creation/naming of segments to DataSegmentPusher
    • Add Support of S3a schema
    • Add dependency jars to Hdfs Module
    • Update the hadoop compile version.
    • Propagates some default run time properties to the hadoop indexer jvm. Currently the defaults are druid.storage.* and "druid.javascript.*
    Feature Release Notes Improvement 
    opened by b-slim 53
  • Priority based task locking

    Priority based task locking

    This PR corresponds to the issue https://github.com/druid-io/druid/issues/1513

    Design details -

    • The task priority is used for acquiring a lock on an interval for a datasource. Tasks with higher priority can preempt lower-priority tasks for the same datasource and interval if ran concurrently. The flow for acquiring and upgrading Locks by a Task during TaskLifeCycle would be like -
      • Check for TaskLocks on same datasource and overlapping interval
        • If the one or more TaskLocks have same or higher priority then stop and retry after some time
        • If no TaskLock is present or all the TaskLocks are of lower priority then check if any lock is an exclusive lock
          • If no then revoke all the lower priority TaskLocks and create a new TaskLock
          • If yes then stop and retry after some time
      • Before publishing the segments, upgrade the TaskLock to exclusiveLock
        • If upgrade successful publish the segment and return success status for the Task
        • If no then return failure status for the Task

    Tasks with no priority specified will have the respective default priorities as per the task type

    • Default priorities for task
      • Realtime Index Task - 75
      • Hadoop/Index Task - 50
      • Merge/Append Task - 25
      • Other Tasks - 0
    • Higher the number, higher the priority.

    For example, if a Hadoop Index task is running and a Realtime Index task starts that wants to publish a segment for the same (or overlapping) interval for the same datasource, then it will override the task locks of the Hadoop Index task. Consequently, the Hadoop Index task will fail before publishing the segment.

    Note - There is no need to set this property, task automatically gets a default priority as per its type. However, if one wants to override the default priority it can be done by setting lockPriority inside context property like this -

      "context" {
        "lockPriority" : "80"
      }
    

    Major Implementation details -

    • Two new fields have been added to TaskLock namely priority and preemptive.
    • Modified and refactored tryLock method in TaskLockbox
    • Added upgradeLock method in TaskLockbox, MetaStorageActionHandler, SQLMetaStorageActionHandler, TaskStorage, MetadataTaskStorage and HeapMemoryTaskStorage
    • New Task Action named LockUpgradeAction has been added
    • Check for preemption of locks has been added in TaskActionToolbox which is performed before publishing the segments in SegmentInsertAction
    • Changed Task implementations to handle priority, preemption and check for lock validity
    • Modified ThreadPoolTaskRunner a bit to be helpful in testing lock override
    • Modified TaskSerdeTest to check for priority
    • Modified Task related docs
    • Added unit tests for TaskLock overriding in TaskLifeCylceTest
    • Added a task name TestIndexTask useful for unit testing

    Possible Future Enhancements - Proactively shutdown the tasks instead of waiting for them to fail eventually since their TaskLock has been revoked.

    Edit 08/05/2016 -

    Priority locking feature is configurable now, by default it is off and can be enabled by setting the runtime property druid.indexer.taskLockboxVersion to v2.

    • TaskLockbox in an interface and there are two implementations - TaskLockboxV1 which is same as the previous TaskLockbox andTaskLockboxV2 does priority based locking. One of them is injected at runtime in CliOverlord and CliPeon depending on druid.indexer.taskLockboxVersion
    • TaskLockbox has new method boolean setTaskLockCriticalState(Task task, Interval interval, TaskLockCriticalState taskLockCriticalState) meant for upgrading locks in case priority based locking is used. TaskLockboxV1 always returns true for this method.
    • TaskLock has two new fields priority and upgraded. In case of TaskLockboxV1 the corresponding values are always 0 and true. For TaskLockboxV2 it is depended on the task.
    • Each task calls setTaskLockCriticalState before publishing segments. However, in case of TaskLockboxV1 the method always returns true and extra overhead is just an HTTP call to overlord. In case of TaskLockboxV2 it does the actual work of setting TaskLock state.
    • Task interface has int getLockPriority() method which I guess is OK
    Feature Discuss 
    opened by pjain1 52
  • Dynamic auto scale Kafka-Stream ingest tasks

    Dynamic auto scale Kafka-Stream ingest tasks

    Description

    In druid, users need to set 'taskCount' when submit Kafka ingestion supervisor. It has a few limitations :

    1. When supervisor is running, we can't modify the task count number. We may meet data lag during sudden peak traffic period. Users have to re-submit the supervisor with a larger task number, aiming to catch up Kafka delay. But if there are too many supervisors, this re-submit operation is very complicated. In addition do scale in action manually after sudden traffic peak.
    2. In order to avoid Kafka lag during regular traffic peak, users have to set a large task count in supervisors. So that it will cause the waste of resource during regular traffic off-peak. For example,
    traffic-pattern Here is our traffic pattern. I have to set taskCount to 8, avoiding Kafka lag during traffic peak. At other times, 4 tasks are enough. This PR provides the ability of auto scaling the number of Kafka ingest tasks based on Lag metrics when supervisors are running. Enable this feature and ingest tasks will auto scale out during traffic peak and scale in during traffic off-peak.

    Design

    Here are the designs of this PR: The work flow of supervisor controller based on druid source code 屏幕快照 2020-10-21 下午1 44 54 As the picture shows, SupervisorManger controls all the supervisors in OverLord Service. Each Kafka Supervisor serially consume notices in LinkedBlockingQueue. Notice is an interface. RunNotice, ShutdownNotice and RestNotice are implementations of this interface. I design a new implementation named DynamicAllocationTasksNotice. I create a new Timer(lagComputationExec) to collect Kafka lags at fix rate and create a new Timer(allocationExec) to check and do scale action at fix rate, as shown below 屏幕快照 2020-10-21 下午1 45 11 For allocationExec details , 屏幕快照 2020-10-21 下午1 45 36 Furthermore, I expand the ioConfig spec and add new parameters to control the scale behave, for example

    "ioConfig": {
          "topic": "dummy_topic",
          "inputFormat": null,
          "replicas": 1,
          "taskCount": 1,
          "taskDuration": "PT3600S",
          "consumerProperties": {
            "bootstrap.servers": "xxx,xxx,xxx"
          },
          "autoScalerConfig": {
            "enableTaskAutoScaler": true,
            "lagCollectionIntervalMillis": 30000,
            "lagCollectionRangeMillis": 600000,
            "scaleOutThreshold": 6000000,
            "triggerScaleOutThresholdFrequency": 0.3,
            "scaleInThreshold": 1000000,
            "triggerScaleInThresholdFrequency": 0.9,
            "scaleActionStartDelayMillis": 300000,
            "scaleActionPeriodMillis": 60000,
            "taskCountMax": 6,
            "taskCountMin": 2,
            "scaleInStep": 1,
            "scaleOutStep": 2,
            "minTriggerScaleActionFrequencyMillis": 600000
          },
          "pollTimeout": 100,
          "startDelay": "PT5S",
          "period": "PT30S",
          "useEarliestOffset": false,
          "completionTimeout": "PT1800S",
          "lateMessageRejectionPeriod": null,
          "earlyMessageRejectionPeriod": null,
          "lateMessageRejectionStartDateTime": null,
          "stream": "dummy_topic",
          "useEarliestSequenceNumber": false
        }
    

    | Property | Description | Required | | ------------- | ------------- | ------------- | | enableTaskAutoScaler | Whether enable this feature or not. Set false or ignored here will disable autoScaler even though autoScalerConfig is not null| no (default == false) | | lagCollectionIntervalMillis | Define the frequency of lag points collection. | no (default == 30000) | | lagCollectionRangeMillis | The total time window of lag collection, Use with lagCollectionIntervalMillis,it means that in the recent lagCollectionRangeMillis, collect lag metric points every lagCollectionIntervalMillis. | no (default == 600000) | | scaleOutThreshold | The Threshold of scale out action | no (default == 6000000) | | triggerScaleOutThresholdFrequency | If triggerScaleOutThresholdFrequency percent of lag points are higher than scaleOutThreshold, then do scale out action. | no (default == 0.3) | | scaleInThreshold | The Threshold of scale in action | no (default == 1000000) | | triggerScaleInThresholdFrequency | If triggerScaleInThresholdFrequency percent of lag points are lower than scaleOutThreshold, then do scale in action. | no (default == 0.9) | | scaleActionStartDelayMillis | Number of milliseconds after supervisor starts when first check scale logic. | no (default == 300000) | | scaleActionPeriodMillis | The frequency of checking whether to do scale action in millis | no (default == 60000) | | taskCountMax | Maximum value of task count. Make Sure taskCountMax >= taskCountMin | yes | | taskCountMin | Minimum value of task count. When enable autoscaler, the value of taskCount in IOConfig will be ignored, and taskCountMin will be the number of tasks that ingestion starts going up to taskCountMax| yes | | scaleInStep | How many tasks to reduce at a time | no (default == 1) | | scaleOutStep | How many tasks to add at a time | no (default == 2) | | minTriggerScaleActionFrequencyMillis | Minimum time interval between two scale actions | no (default == 600000) | | autoScalerStrategy | The algorithm of autoScaler. ONLY lagBased is supported for now. | no (default == lagBased) |

    Effect evaluation :

    I have deployed this feature in our Production Environment Figure 1 : Kafka ingestion lag Ingest kafka lag

    Figure 2 : Task count Running task count per datasource

    Figure 3 : Ingest speed total Ingest speed per datasource

    Druid ingestion task is divided into two states: reading and writing, When druid scales out at 10:38, druid will launch 3 new tasks in reading state, and change the old one's state from reading to writing which will finish writing in few minutes. This is why the figure2 shows a peak of 4 from 10:38 to 10:42(3 new reading tasks and one writing task) and a peak of 5 from 11:06 to 11:08. In fact, what we really care about is the tasks in reading state. In other words, the real peak of task number is 3 all the time, which scale out at 10:39 due to Kafka lag and there is no gap between the traffic peak and task peak.

    Conclusion Here are the benefits of Druid Auto Scale :

    1. Help improve data SLA: whenever there is heavy traffic, Druid can scale out automatically and sensitively  to provide stronger consuming power so that there is no delay for downstream.
    2. Resource Saving :
      • Cost Saving, because the task number of each datasource is reduced, so that there is no need for as many task solts as before.
      • Support Resource Saving, the entire process from scale out action to scale in action does not require human intervention.

    This PR has:

    • [x] been self-reviewed.
      • [x] using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
    • [x] added documentation for new or modified features or behaviors.
    • [x] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
    • [ ] added or updated version, license, or notice information in licenses.yaml
    • [x] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
    • [x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
    • [x] added integration tests.
    • [x] been tested in a test Druid cluster.

    Key changed/added classes in this PR
    • SeekableStreamSupervisor.java
    • KafkaSupervisor.java
    • KafkaSupervisorIOConfig.java
    Performance Release Notes Design Review Area - Streaming Ingestion 
    opened by zhangyue19921010 48
  • Adding Kafka-emitter

    Adding Kafka-emitter

    This PR is about adding new extensions-contrib.

    Currently, Druid has various emitter module and also has HttpPostEmitter for general purpose. But, Many user currently using Apache Kafka platform and I assume that If druid has another extension emitting their metrics to kafka directly, many user feel comfort and convenience using various monitoring dashboard UI or another eco.

    Please feel free for commenting this PR.

    Area - Documentation Release Notes 
    opened by dkhwangbo 47
  • build v9 directly

    build v9 directly

    This PR tracks the feature of building v9 directly which had been discussed in https://groups.google.com/forum/#!topic/druid-development/0CxhljSGeeo

    We can divide this PR into 3 main parts:

    1. Changing of ColumnPartSerde's interface. I had change ColumnPartSerde's interface to separate the Serializer and Deserializer. So we can have multiple serializers, one of it has the same behavior with old ones. And another one is the new streaming version. The Deserializer just do the same thing like old ones.
    2. Added IndexedIntsWriter interface and some *Writers which are responsible for writing dimension ids. You can see the hierarchy that i divided the writers to 2 main sub abstract classes: SingleValueIndexedIntsWriter and MultiValueIndexedIntsWriter. Here are the classed which are doing the real things: VSizeIndexedIntsWriter (single value, vsize encoded, not compressed) CompressedIntsIndexedWriter (single value, not vsize encoded, compressed) CompressedVSizeIntsIndexedWriter (single value, vsize encoded, compressed) VSizeIndexedWriter (multi value, both offset and values are vsized, not compressed) CompressedVSizeIndexedV3Writer (multi value, only values are vsized, compressed) More details can be found here: https://groups.google.com/forum/#!topic/druid-development/0CxhljSGeeo
    3. GenericColumnSerializer and its sub-classes are for writing metrics. They are almost identical to MetricColumnSerializer you familiar with, except the GenericColumnSerializer can write to a Channel(which is allocated from smoosher when building). The sub-classes are: LongColumnSerializer (write long metrics) FloatColumnSerializer (write float metrics) ComplexColumnSerializer (writer complex metrics)
    opened by KurtYoung 45
  • Kafka Ingestion Peon Tasks Success But Overlord Shows Failure

    Kafka Ingestion Peon Tasks Success But Overlord Shows Failure

    Apologies if this breaks any rules, but I tried on the druid forums without much success so trying here to see if I can reach a different audience. Relevant information below and more details in the druid forum post.

    • Druid Version: 0.22.1, 0.23.0
    • Kafka Ingestion (idempotent producer) - HOURLY
    • Overlord type: remote

    https://www.druidforum.org/t/kafka-ingestion-peon-tasks-success-but-overlord-shows-failure/7374

    In general when we run all our tasks, we start seeing issues between Overlord and MM/Peons. Often times, the Peon will show that the task was successful but the overlord believes it failed and tries to shut it down. And things start to get sluggish with the Overlord and it starts taking a while to recognize completed tasks and tasks that are trying to start which seems to be pointing at a communication/coordination failure between Overlord and MM/Peons. We even see TaskAssignment between Overlord and MM timeouts (PT10M - default is PT5M) occur.

    The only thing that seems to be able to help is reducing the number of tasks we have running concurrently by suspending certain supervisors. Which also indicates an issue with the 3 Druid services handling the load of our current ingestion. But according to system metrics, resource usage is not hitting any limits and it still has more compute it can use. It's odd since we know there are probably a lot of users ingesting more data per hour than us and we don't see this type of issue in their discussions/white papers.

    Any help will definitely be appreciated.

    Uncategorized problem report Area - Ingestion 
    opened by pchang388 44
  • Implement force push down for nested group by query

    Implement force push down for nested group by query

    @gianm @jihoonson - creating a PR as we discussed over the email. The pull request isn't complete by any means. I just wanted to get your feedback on the approach I have taken. For now, I have implemented the force push down feature. Once we have finalized on the over all approach, I will provide a patch for automatic push down in a separate patch. Thanks!

    Performance Area - Querying 
    opened by samarthjain 44
  • Implement Redis cache extension

    Implement Redis cache extension

    This is a first working draft of extension. It provides basic caching functionality and cache statistics, while doMonitor() still does nothing(). Data is compressed using LZ4.

    This is to resolve #1927.

    Feature 
    opened by estliberitas 44
  • Kafka Index Task that supports Incremental handoffs

    Kafka Index Task that supports Incremental handoffs

    Hopefully fixes #4016, #4177, #4812, #4781, #4743 and #5046 Partially fixes #4498 and #4693 Alternate to https://github.com/druid-io/druid/pull/4178 and https://github.com/druid-io/druid/pull/4175

    • Incrementally handoff segments when they hit maxRowsPerSegment limit
    • Decouple segment partitioning from Kafka partitioning, all records from consumed partitions go to a single druid segment
    • Support for restoring task on middle manager restarts by check pointing end offsets for segments
    • Incremental handoff is NOT supported with pauseAfterRead feature

    Design -

    • Currently a KafkaIndexTask only creates and publishes segments corresponding to a single sequenceName.
    • With this PR, Introduced ability to incrementally hand-off segments by by starting a new sequence when maxRowsInSegment limit is reached and publishing all the segments for the previous sequence.
    • New Sequences are created by concatenating base sequence name with monotonically increasing id called sequenceId. It starts with 0.
    • A sequence corresponds to a range of Kafka offsets. SequenceMetadata class is added to KafkaIndexTask class to maintain metadata about a sequence. Metadata for all the sequences in the task is persisted to disk whenever a new sequence is created or deleted or end offsets for a sequence are set. This also enables restore on restart.
    • End offsets for the latest sequence is set whenever AppenderatorDriverAddResult.getNumRowsInSegment() returned by add call of AppenderatorDriver is greater than maxRowsInMemory. -At this point the task will pause and send CheckPointDataSourceMetadataAction to the Supervisor, which will use the same logic as is used for finishing all the tasks in the TaskGroup and call setEndOffsets on the replica tasks with finish flag set to false to indicate that current sequence should be finished and published and new sequence should be created for messages having offsets greater than the set end offsets. The new checkpoint information will be stored in the TaskGroup.
    • At Supervisor list of SequenceMetadata is not maintained for tasks in TaskGroup rather a sorted map of <SequenceId, Checkpoint> is maintained. Checkpoint always corresponds to the start offsets of a Sequence.
    • Whenever a new task is started by the Supervisor it always sends the map of checkpoints for the TaskGroup in the context field. Using this checkpoints information, the task creates list of SequenceMetadata and sets it start offsets to the start offset of the first sequence.
    • Since task always starts consuming from the start offset of first sequence, it is necessary for the Supervisor to remove checkpoints from the TaskGroup for which segments have been published so that new tasks are not sent stale checkpoints thus preventing duplicate work.
    • Supervisor does this clean up lazily only when it starts new tasks in replacement of a failed task or creates a new TaskGroup or discover new tasks for a TaskGroup (includes the case when it restarts). This is done by getting checkpoints information from the running tasks and verifying them with the offsets information in the metadata store. It also kills inconsistent task, refer to verifyAndMergeCheckpoints method.

    Some more information about specific issues -

    • Task status is set to FINISHING immediately after the task is asked to finish by call to setEndOffsets with finish set to true to fix https://github.com/druid-io/druid/issues/4812
    • To persist commit metadata only for the subset of segments which are being published, option 1 is used as described here - https://github.com/druid-io/druid/issues/4781
    • Un-deprecated handoffConditionTimeout property as for fixing https://github.com/druid-io/druid/pull/4175 the tasks are now killed (not stopped) after pending Completion Timeout elapses.
    Release Notes Area - Streaming Ingestion 
    opened by pjain1 41
  • groupBy v2 failing intermittently with complex columns

    groupBy v2 failing intermittently with complex columns

    We are on druid 0.10 and seeing groupBy v2 failing intermittently with complex columns, we don't see those failures with groupByStrategy=v1.
    Stacktrace : com.yahoo.sketches.SketchesArgumentException: Possible Corruption: Illegal Family ID: 0 at com.yahoo.sketches.Family.idToFamily(Family.java:184) ~[sketches-core-0.8.4.jar:?] at com.yahoo.sketches.theta.SetOperation.wrap(SetOperation.java:105) ~[sketches-core-0.8.4.jar:?] at com.yahoo.sketches.theta.SetOperation.wrap(SetOperation.java:91) ~[sketches-core-0.8.4.jar:?] at io.druid.query.aggregation.datasketches.theta.SketchBufferAggregator.getUnion(SketchBufferAggregator.java:92) ~[druid-datasketches-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT] at io.druid.query.aggregation.datasketches.theta.SketchBufferAggregator.aggregate(SketchBufferAggregator.java:71) ~[druid-datasketches-0.10.0-SNAPSHOT.jar:0.10.0-SNAPSHOT] at io.druid.query.groupby.epinephelinae.BufferGrouper.aggregate(BufferGrouper.java:203) ~[druid-processing-0.10.0-0a471fb.jar:0.10.0-SNAPSHOT] at io.druid.query.groupby.epinephelinae.SpillingGrouper.aggregate(SpillingGrouper.java:111) ~[druid-processing-0.10.0-0a471fb.jar:0.10.0-SNAPSHOT] at io.druid.query.groupby.epinephelinae.ConcurrentGrouper.aggregate(ConcurrentGrouper.java:163) ~[druid-processing-0.10.0-0a471fb.jar:0.10.0-SNAPSHOT] at io.druid.query.groupby.epinephelinae.ConcurrentGrouper.aggregate(ConcurrentGrouper.java:184) ~[druid-processing-0.10.0-0a471fb.jar:0.10.0-SNAPSHOT] at io.druid.query.groupby.epinephelinae.RowBasedGrouperHelper$1.accumulate(RowBasedGrouperHelper.java:173) ~[druid-processing-0.10.0-0a471fb.jar:0.10.0-SNAPSHOT] at io.druid.query.groupby.epinephelinae.RowBasedGrouperHelper$1.accumulate(RowBasedGrouperHelper.java:148) ~[druid-processing-0.10.0-0a471fb.jar:0.10.0-SNAPSHOT] at io.druid.java.util.common.guava.BaseSequence.accumulate(BaseSequence.java:46) ~[java-util-0.10.0-0a471fb.jar:0.10.0-SNAPSHOT] at io.druid.java.util.common.guava.LazySequence.accumulate(LazySequence.java:40) ~[java-util-0.10.0-0a471fb.jar:0.10.0-SNAPSHOT] at io.druid.java.util.common.guava.WrappingSequence$1.get(WrappingSequence.java:50) ~[java-util-0.10.0-0a471fb.jar:0.10.0-SNAPSHOT] at io.druid.java.util.common.guava.SequenceWrapper.wrap(SequenceWrapper.java:55) ~[java-util-0.10.0-0a471fb.jar:0.10.0-SNAPSHOT] at io.druid.java.util.common.guava.WrappingSequence.accumulate(WrappingSequence.java:45) ~[java-util-0.10.0-0a471fb.jar:0.10.0-SNAPSHOT] at io.druid.java.util.common.guava.LazySequence.accumulate(LazySequence.java:40) ~[java-util-0.10.0-0a471fb.jar:0.10.0-SNAPSHOT] at io.druid.java.util.common.guava.WrappingSequence$1.get(WrappingSequence.java:50) ~[java-util-0.10.0-0a471fb.jar:0.10.0-SNAPSHOT] at io.druid.java.util.common.guava.SequenceWrapper.wrap(SequenceWrapper.java:55) ~[java-util-0.10.0-0a471fb.jar:0.10.0-SNAPSHOT] at io.druid.java.util.common.guava.WrappingSequence.accumulate(WrappingSequence.java:45) ~[java-util-0.10.0-0a471fb.jar:0.10.0-SNAPSHOT] at io.druid.java.util.common.guava.WrappingSequence$1.get(WrappingSequence.java:50) ~[java-util-0.10.0-0a471fb.jar:0.10.0-SNAPSHOT] at io.druid.query.CPUTimeMetricQueryRunner$1.wrap(CPUTimeMetricQueryRunner.java:78) ~[druid-processing-0.10.0-0a471fb.jar:0.10.0-SNAPSHOT] at io.druid.java.util.common.guava.WrappingSequence.accumulate(WrappingSequence.java:45) ~[java-util-0.10.0-0a471fb.jar:0.10.0-SNAPSHOT] at io.druid.query.spec.SpecificSegmentQueryRunner$2.accumulate(SpecificSegmentQueryRunner.java:83) ~[druid-processing-0.10.0-0a471fb.jar:0.10.0-SNAPSHOT] at io.druid.java.util.common.guava.WrappingSequence$1.get(WrappingSequence.java:50) ~[java-util-0.10.0-0a471fb.jar:0.10.0-SNAPSHOT] at io.druid.query.spec.SpecificSegmentQueryRunner.doNamed(SpecificSegmentQueryRunner.java:169) ~[druid-processing-0.10.0-0a471fb.jar:0.10.0-SNAPSHOT] at io.druid.query.spec.SpecificSegmentQueryRunner.access$200(SpecificSegmentQueryRunner.java:43) ~[druid-processing-0.10.0-0a471fb.jar:0.10.0-SNAPSHOT] at io.druid.query.spec.SpecificSegmentQueryRunner$3.wrap(SpecificSegmentQueryRunner.java:149) ~[druid-processing-0.10.0-0a471fb.jar:0.10.0-SNAPSHOT] at io.druid.java.util.common.guava.WrappingSequence.accumulate(WrappingSequence.java:45) ~[java-util-0.10.0-0a471fb.jar:0.10.0-SNAPSHOT] at io.druid.query.groupby.epinephelinae.GroupByMergingQueryRunnerV2$1$1$1.call(GroupByMergingQueryRunnerV2.java:228) ~[druid-processing-0.10.0-0a471fb.jar:0.10.0-SNAPSHOT] at io.druid.query.groupby.epinephelinae.GroupByMergingQueryRunnerV2$1$1$1.call(GroupByMergingQueryRunnerV2.java:219) ~[druid-processing-0.10.0-0a471fb.jar:0.10.0-SNAPSHOT] at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_112] at io.druid.query.PrioritizedListenableFutureTask.run(PrioritizedExecutorService.java:271) ~[druid-processing-0.10.0-0a471fb.jar:0.10.0-SNAPSHOT] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_112] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_112] at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112]

    @himanshug @gianm

    opened by akashdw 41
  • Update forbidden apis with fixed executor

    Update forbidden apis with fixed executor

    This adds java.util.concurrent.Executors#newFixedThreadPool(int) to the forbidden apis list. The threadpools generated by this would be non-daemon and prevent the JVM from shutting down. Execs#multiThreaded() marks all threads as non daemon to prevent this.

    opened by adarshsanjeev 0
  • Quote and escape literals in JDBC lookup to allow reserved identifiers.

    Quote and escape literals in JDBC lookup to allow reserved identifiers.

    Before this change, JDBC lookups with reserved keywords present in table, key, or column names will not parse correctly. Table, key, and column names need to be double-quoted and escaped to allow any reserved words.

    For example,

    SELECT select, value FROM intervals
    

    will not parse since select is specified as a column name. Whereas the following quoted query will parse correctly:

    SELECT “select”, "value" FROM “intervals”
    
    • Add a new parameter with reserved identifiers for table, key and column names.
    • Update the tests that use the Derby connector, which now uses quoted and escaped identifiers consistently for creation, insertion, and other operations.

    This PR has:

    • [x] been self-reviewed.
      • [ ] using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
    • [ ] added documentation for new or modified features or behaviors.
    • [ ] a release note entry in the PR description.
    • [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
    • [ ] added or updated version, license, or notice information in licenses.yaml
    • [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
    • [x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
    • [ ] added integration tests.
    • [x] been tested in a test Druid cluster.
    opened by abhishekrb19 1
  • Web console: better show totals when grouping

    Web console: better show totals when grouping

    In the Service view show totals when grouping by Tier also.

    image

    (the screenshot is not very impressive with only 1 historical but with more it looks better.

    Bug Area - Web Console 
    opened by vogievetsky 0
  • data ingestion from logstash

    data ingestion from logstash

    Hi, I,m new to druid db i want to send data from logstash to druid db. can anyone please provide pipeline.conf of or output part for logstash pipeline.

    opened by rajnishnpst 0
  • Bump json5 from 1.0.1 to 1.0.2 in /web-console

    Bump json5 from 1.0.1 to 1.0.2 in /web-console

    Bumps json5 from 1.0.1 to 1.0.2.

    Release notes

    Sourced from json5's releases.

    v1.0.2

    • Fix: Properties with the name __proto__ are added to objects and arrays. (#199) This also fixes a prototype pollution vulnerability reported by Jonathan Gregson! (#295). This has been backported to v1. (#298)
    Changelog

    Sourced from json5's changelog.

    Unreleased [code, diff]

    v2.2.3 [code, diff]

    v2.2.2 [code, diff]

    • Fix: Properties with the name __proto__ are added to objects and arrays. (#199) This also fixes a prototype pollution vulnerability reported by Jonathan Gregson! (#295).

    v2.2.1 [code, diff]

    • Fix: Removed dependence on minimist to patch CVE-2021-44906. (#266)

    v2.2.0 [code, diff]

    • New: Accurate and documented TypeScript declarations are now included. There is no need to install @types/json5. (#236, #244)

    v2.1.3 [code, diff]

    • Fix: An out of memory bug when parsing numbers has been fixed. (#228, #229)

    v2.1.2 [code, diff]

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the Security Alerts page.
    Area - Dependencies javascript 
    opened by dependabot[bot] 0
  • Bump json5 from 2.2.1 to 2.2.3 in /website

    Bump json5 from 2.2.1 to 2.2.3 in /website

    Bumps json5 from 2.2.1 to 2.2.3.

    Release notes

    Sourced from json5's releases.

    v2.2.3

    v2.2.2

    • Fix: Properties with the name __proto__ are added to objects and arrays. (#199) This also fixes a prototype pollution vulnerability reported by Jonathan Gregson! (#295).
    Changelog

    Sourced from json5's changelog.

    v2.2.3 [code, diff]

    v2.2.2 [code, diff]

    • Fix: Properties with the name __proto__ are added to objects and arrays. (#199) This also fixes a prototype pollution vulnerability reported by Jonathan Gregson! (#295).
    Commits
    • c3a7524 2.2.3
    • 94fd06d docs: update CHANGELOG for v2.2.3
    • 3b8cebf docs(security): use GitHub security advisories
    • f0fd9e1 docs: publish a security policy
    • 6a91a05 docs(template): bug -> bug report
    • 14f8cb1 2.2.2
    • 10cc7ca docs: update CHANGELOG for v2.2.2
    • 7774c10 fix: add proto to objects and arrays
    • edde30a Readme: slight tweak to intro
    • 97286f8 Improve example in readme
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the Security Alerts page.
    Area - Dependencies javascript 
    opened by dependabot[bot] 0
Releases(druid-25.0.0)
  • druid-25.0.0(Jan 4, 2023)

    Apache Druid 25.0.0 contains over 300 new features, bug fixes, performance enhancements, documentation improvements, and additional test coverage from 51 contributors.

    See the complete set of changes for additional details.

    # Highlights

    # MSQ task engine now production ready

    The multi-stage query (MSQ) task engine used for SQL-based ingestion is now production ready. Use it for any supported workloads. For more information, see the following pages:

    # Simplified Druid deployments

    The new start-druid script greatly simplifies deploying any combination of Druid services on a single-server. It comes pre-packaged with the required configs and can be used to launch a fully functional Druid cluster simply by invoking ./start-druid. For experienced Druids, it also gives complete control over the runtime properties and JVM arguments to have a cluster that exactly fits your needs.

    The start-druid script deprecates the existing profiles such as start-micro-quickstart and start-nano-quickstart. These profiles may be removed in future releases. For more information, see Single server deployment.

    # String dictionary compression (experimental)

    Added support for front coded string dictionaries for smaller string columns, leading to reduced segment sizes with only minor performance penalties for most Druid queries.

    This can be enabled by setting IndexSpec.stringDictionaryEncoding to {"type":"frontCoded", "bucketSize": 4} , where bucketSize is any power of 2 less than or equal to 128. Setting this property instructs indexing tasks to write segments using compressed dictionaries of the specified bucket size.

    Any segment written using string dictionary compression is not readable by older versions of Druid.

    For more information, see Front coding.

    https://github.com/apache/druid/pull/12277

    # Kubernetes-native tasks

    Druid can now use Kubernetes to launch and manage tasks, eliminating the need for middle managers.

    To use this feature, enable the druid-kubernetes-overlord-extensions in the extensions load list for your Overlord process.

    https://github.com/apache/druid/pull/13156

    # Hadoop-3 compatible binary

    Druid now comes packaged as a dedicated binary for Hadoop-3 users, which contains Hadoop-3 compatible jars. If you do not use Hadoop-3 with your Druid cluster, you may continue using the classic binary.

    # Multi-stage query (MSQ) task engine

    # MSQ enabled for Docker

    MSQ task query engine is now enabled for Docker by default.

    https://github.com/apache/druid/pull/13069

    # Query history

    Multi-stage queries no longer show up in the Query history dialog. They are still available in the Recent query tasks panel.

    # Limit on CLUSTERED BY columns

    When using the MSQ task engine to ingest data, the number of columns that can be passed in the CLUSTERED BY clause is now limited to 1500.

    https://github.com/apache/druid/pull/13352

    # Support for string dictionary compression

    The MSQ task engine supports the front-coding of String dictionaries for better compression. This can be enabled for INSERT or REPLACE statements by setting indexSpec to a valid json string in the query context.

    https://github.com/apache/druid/pull/13275

    # Sketch merging mode

    Workers can now gather key statistics, used to generate partition boundaries, either sequentially or in parallel. Set clusterStatisticsMergeMode to PARALLEL, SEQUENTIAL or AUTO in the query context to use the corresponding sketch merging mode. For more information, see Sketch merging mode.

    https://github.com/apache/druid/pull/13205

    # Performance and operational improvements

    • Error messages: For disallowed MSQ warnings of certain types, the warning is now surfaced as the error. https://github.com/apache/druid/pull/13198
    • Secrets: For tasks containing SQL with sensitive keys, Druid now masks the keys while logging with the help regular expressions. https://github.com/apache/druid/pull/13231
    • Downsampling accuracy: MSQ task engine now uses the number of bytes instead of number of keys when downsampling data. https://github.com/apache/druid/pull/12998
    • Memory usage: When determining partition boundaries, the heap footprint of internal sketches used by MSQ is now capped at 10% of available memory or 300 MB, whichever is lower. Previously, the cap was strictly 300 MB. https://github.com/apache/druid/pull/13274
    • Task reports: Added fields pendingTasks and runningTasks to the worker report. See Query task status information for related web console changes. https://github.com/apache/druid/pull/13263

    # Querying

    # Async reads for JDBC

    Prevented JDBC timeouts on long queries by returning empty batches when a batch fetch takes too long. Uses an async model to run the result fetch concurrently with JDBC requests.

    https://github.com/apache/druid/pull/13196

    # Improved algorithm to check values of an IN filter

    To accommodate large value sets arising from large IN filters or from joins pushed down as IN filters, Druid now uses a sorted merge algorithm for merging the set and dictionary for larger values.

    https://github.com/apache/druid/pull/13133

    # Enhanced query context security

    Added the following configuration properties that refine the query context security model controlled by druid.auth.authorizeQueryContextParams:

    • druid.auth.unsecuredContextKeys: A JSON list of query context keys that do not require a security check.
    • druid.auth.securedContextKeys: A JSON list of query context keys that do require a security check.

    If both are set, unsecuredContextKeys acts as exceptions to securedContextKeys.

    https://github.com/apache/druid/pull/13071

    # HTTP response headers

    The HTTP response for a SQL query now correctly sets response headers, same as a native query.

    https://github.com/apache/druid/pull/13052

    # Metrics

    # New metrics

    The following metrics have been newly added. For more details, see the complete list of Druid metrics.

    # Batched segment allocation

    These metrics pertain to batched segment allocation.

    | Metric | Description | Dimensions | |-|-|-| |task/action/batch/runTime | Milliseconds taken to execute a batch of task actions. Currently only being emitted for batched segmentAllocate actions | dataSource, taskActionType=segmentAllocate | |task/action/batch/queueTime | Milliseconds spent by a batch of task actions in queue. Currently only being emitted for batched segmentAllocate actions | dataSource, taskActionType=segmentAllocate | |task/action/batch/size | Number of task actions in a batch that was executed during the emission period. Currently only being emitted for batched segmentAllocate actions | dataSource, taskActionType=segmentAllocate | |task/action/batch/attempts | Number of execution attempts for a single batch of task actions. Currently only being emitted for batched segmentAllocate actions | dataSource, taskActionType=segmentAllocate | |task/action/success/count | Number of task actions that were executed successfully during the emission period. Currently only being emitted for batched segmentAllocate actions | dataSource, taskId, taskType, taskActionType=segmentAllocate | |task/action/failed/count | Number of task actions that failed during the emission period. Currently only being emitted for batched segmentAllocate actions | dataSource, taskId, taskType, taskActionType=segmentAllocate |

    # Streaming ingestion

    | Metric | Description | Dimensions | |-|-|-| |ingest/kafka/partitionLag | Partition-wise lag between the offsets consumed by the Kafka indexing tasks and latest offsets in Kafka brokers. Minimum emission period for this metric is a minute. | dataSource, stream, partition | |ingest/kinesis/partitionLag/time | Partition-wise lag time in milliseconds between the current message sequence number consumed by the Kinesis indexing tasks and latest sequence number in Kinesis. Minimum emission period for this metric is a minute. | dataSource, stream, partition | |ingest/pause/time | Milliseconds spent by a task in a paused state without ingesting. | dataSource, taskId, taskType | |ingest/handoff/time | Total time taken in milliseconds for handing off a given set of published segments. | dataSource, taskId, taskType |

    https://github.com/apache/druid/pull/13238 https://github.com/apache/druid/pull/13331 https://github.com/apache/druid/pull/13313

    # Other improvements

    • New dimension taskActionType which may take values such as segmentAllocate, segmentTransactionalInsert, etc. This dimension is reported for task/action/run/time and the new batched segment allocation metrics. https://github.com/apache/druid/pull/13333
    • Metric namespace/cache/heapSizeInBytes for global cached lookups now accounts for the String object overhead of 40 bytes. https://github.com/apache/druid/pull/13219
    • jvm/gc/cpu has been fixed to report nanoseconds instead of milliseconds. https://github.com/apache/druid/pull/13383

    # Nested columns

    # Nested columns performance improvement

    Improved NestedDataColumnSerializer to no longer explicitly write null values to the field writers for the missing values of every row. Instead, passing the row counter is moved to the field writers so that they can backfill null values in bulk.

    https://github.com/apache/druid/pull/13101

    # Support for more formats

    Druid nested columns and the associated JSON transform functions now support Avro, ORC, and Parquet.

    https://github.com/apache/druid/pull/13325 https://github.com/apache/druid/pull/13375

    # Refactored a datasource before unnest

    When data requires "flattening" during processing, the operator now takes in an array and then flattens the array into N (N=number of elements in the array) rows where each row has one of the values from the array.

    https://github.com/apache/druid/pull/13085

    # Ingestion

    # Improved filtering for cloud objects

    You can now stop at arbitrary subfolders using glob syntax in the ioConfig.inputSource.filter field for native batch ingestion from cloud storage, such as S3.

    https://github.com/apache/druid/pull/13027

    # Async task client for streaming ingestion

    You can now enable asynchronous communication between the stream supervisor and indexing tasks by setting chatAsync to true in the tuningConfig. The async task client uses its own internal thread pool and thus ignrores the chatThreads property.

    https://github.com/apache/druid/pull/13354

    # Improved handling of JSON data with streaming ingestion

    You can now better control how Druid reads JSON data for streaming ingestion by setting the following fields in the input format specification:

    • assumedNewlineDelimited to parse lines of JSON independently.
    • useJsonNodeReader to retain valid JSON events when parsing multi-line JSON events when a parsing exception occurs.

    The web console has been updated to include these options.

    https://github.com/apache/druid/pull/13089

    # Ingesting from an idle Kafka stream

    When a Kafka stream becomes inactive, the supervisor ingesting from it can be configured to stop creating new indexing tasks. The supervisor automatically resumes creation of new indexing tasks once the stream becomes active again. Set the property dataSchema.ioConfig.idleConfig.enabled to true in the respective supervisor spec or set druid.supervisor.idleConfig.enabled on the overlord to enable this behaviour. Please see the following for details:

    https://github.com/apache/druid/pull/13144

    # Kafka Consumer improvement

    You can now configure the Kafka Consumer's custom deserializer after its instantiation.

    https://github.com/apache/druid/pull/13097

    # Kafka supervisor logging

    Kafka supervisor logs are now less noisy. The supervisors now log events at the DEBUG level instead of INFO.

    https://github.com/apache/druid/pull/13392

    # Fixed Overlord leader election

    Fixed a problem where Overlord leader election failed due to lock reacquisition issues. Druid now fails these tasks and clears all locks so that the Overlord leader election isn't blocked.

    https://github.com/apache/druid/pull/13172

    # Support for inline protobuf descriptor

    Added a new inline type protoBytesDecoder that allows a user to pass inline the contents of a Protobuf descriptor file, encoded as a Base64 string.

    https://github.com/apache/druid/pull/13192

    # Duplicate notices

    For streaming ingestion, notices that are the same as one already in queue won't be enqueued. This will help reduce notice queue size.

    https://github.com/apache/druid/pull/13334

    # Sampling from stream input now respects the configured timeout

    Fixed a problem where sampling from a stream input, such as Kafka or Kinesis, failed to respect the configured timeout when the stream had no records available. You can now set the maximum amount of time in which the entry iterator will return results.

    https://github.com/apache/druid/pull/13296

    # Streaming tasks resume on Overlord switch

    Fixed a problem where streaming ingestion tasks continued to run until their duration elapsed after the Overlord leader had issued a pause to the tasks. Now, when the Overlord switch occurs right after it has issued a pause to the task, the task remains in a paused state even after the Overlord re-election.

    https://github.com/apache/druid/pull/13223

    # Fixed Parquet list conversion

    Fixed an issue with Parquet list conversion, where lists of complex objects could unexpectedly be wrapped in an extra object, appearing as [{"element":<actual_list_element>},{"element":<another_one>}...] instead of the direct list. This changes the behavior of the parquet reader for lists of structured objects to be consistent with other parquet logical list conversions. The data is now fetched directly, more closely matching its expected structure.

    https://github.com/apache/druid/pull/13294

    # Introduced a tree type to flattenSpec

    Introduced a tree type to flattenSpec. In the event that a simple hierarchical lookup is required, the tree type allows for faster JSON parsing than jq and path parsing types.

    https://github.com/apache/druid/pull/12177

    # Operations

    # Compaction

    Compaction behavior has changed to improve the amount of time it takes and disk space it takes:

    • When segments need to be fetched, download them one at a time and delete them when Druid is done with them. This still takes time but minimizes the required disk space.
    • Don't fetch segments on the main compact task when they aren't needed. If the user provides a full granularitySpec, dimensionsSpec, and metricsSpec, Druid skips fetching segments.

    For more information, see the documentation on Compaction and Automatic compaction.

    https://github.com/apache/druid/pull/13280

    # Idle configs for the Supervisor

    You can now set the Supervisor to idle, which is useful in cases where freeing up slots so that autoscaling can be more effective.

    To configure the idle behavior, use the following properties:

    | Property | Description | Default | | - | - | -| |druid.supervisor.idleConfig.enabled| (Cluster wide) If true, supervisor can become idle if there is no data on input stream/topic for some time.|false| |druid.supervisor.idleConfig.inactiveAfterMillis| (Cluster wide) Supervisor is marked as idle if all existing data has been read from input topic and no new data has been published for inactiveAfterMillis milliseconds.|600_000| | inactiveAfterMillis | (Individual Supervisor) Supervisor is marked as idle if all existing data has been read from input topic and no new data has been published for inactiveAfterMillis milliseconds. | no (default == 600_000) |

    https://github.com/apache/druid/pull/13311

    # Improved supervisor termination

    Fixed issues with delayed supervisor termination during certain transient states.

    https://github.com/apache/druid/pull/13072

    # Backoff for HttpPostEmitter

    The HttpPostEmitter option now has a backoff. This means that there should be less noise in the logs and lower CPU usage if you use this option for logging.

    https://github.com/apache/druid/pull/12102

    # DumpSegment tool for nested columns

    The DumpSegment tool can now be used on nested columns with the --dump nested option.

    For more information, see dump-segment tool.

    https://github.com/apache/druid/pull/13356

    # Segment loading and balancing

    # Batched segment allocation

    Segment allocation on the Overlord can take some time to finish, which can cause ingestion lag while a task waits for segments to be allocated. Performing segment allocation in batches can help improve performance.

    There are two new properties that affect how Druid performs segment allocation:

    | Property | Description | Default | | - | - | - | |druid.indexer.tasklock.batchSegmentAllocation| If set to true, Druid performs segment allocate actions in batches to improve throughput and reduce the average task/action/run/time. See batching segmentAllocate actions for details.|false| |druid.indexer.tasklock.batchAllocationWaitTime|Number of milliseconds after Druid adds the first segment allocate action to a batch, until it executes the batch. Allows the batch to add more requests and improve the average segment allocation run time. This configuration takes effect only if batchSegmentAllocation is enabled.|500|

    In addition to these properties, there are new metrics to track batch segment allocation. For more information, see New metrics for segment allocation.

    For more information, see the following:

    https://github.com/apache/druid/pull/13369 https://github.com/apache/druid/pull/13503

    # Improved cachingCost balancer strategy

    The cachingCost balancer strategy now behaves more similarly to cost strategy. When computing the cost of moving a segment to a server, the following calculations are performed:

    • Subtract the self cost of a segment if it is being served by the target server
    • Subtract the cost of segments that are marked to be dropped

    https://github.com/apache/druid/pull/13321

    # Faster segment assignment

    You can now use a round-robin segment strategy to speed up initial segment assignments. Set useRoundRobinSegmentAssigment to true in the Coordinator dynamic config to enable this feature.

    https://github.com/apache/druid/pull/13367

    # Default to batch sampling for balancing segments

    Batch sampling is now the default method for sampling segments during balancing as it performs significantly better than the alternative when there is a large number of used segments in the cluster.

    As part of this change, the following have been deprecated and will be removed in future releases:

    • coordinator dynamic config useBatchedSegmentSampler
    • coordinator dynamic config percentOfSegmentsToConsiderPerMove
    • old non-batch method of sampling segments

    # Remove unused property

    The unused coordinator property druid.coordinator.loadqueuepeon.repeatDelay has been removed. Use only druid.coordinator.loadqueuepeon.http.repeatDelay to configure repeat delay for the HTTP-based segment loading queue.

    https://github.com/apache/druid/pull/13391

    # Avoid segment over-replication

    Improved the process of checking server inventory to prevent over-replication of segments during segment balancing.

    https://github.com/apache/druid/pull/13114

    # Provided service specific log4j overrides in containerized deployments

    Provided an option to override log4j configs setup at the service level directories so that it works with Druid-operator based deployments.

    https://github.com/apache/druid/pull/13020

    # Various Docker improvements

    • Updated Docker to run with JRE 11 by default.
    • Updated Docker to use gcr.io/distroless/java11-debian11 image as base by default.
    • Enabled Docker buildkit cache to speed up building.
    • Downloaded bash-static to the Docker image so that scripts that require bash can be executed.
    • Bumped builder image from 3.8.4-jdk-11-slim to 3.8.6-jdk-11-slim.
    • Switched busybox from amd64/busybox:1.30.0-glibc to busybox:1.35.0-glibc.
    • Added support to build arm64-based image.

    https://github.com/apache/druid/pull/13059

    # Enabled cleaner JSON for various input sources and formats

    Added JsonInclude to various properties, to avoid population of default values in serialized JSON.

    https://github.com/apache/druid/pull/13064

    # Improved direct memory check on startup

    Improved direct memory check on startup by providing better support for Java 9+ in RuntimeInfo, and clearer log messages where validation fails.

    https://github.com/apache/druid/pull/13207

    # Improved the run time of the MarkAsUnusedOvershadowedSegments duty

    Improved the run time of the MarkAsUnusedOvershadowedSegments duty by iterating over all overshadowed segments and marking segments as unused in batches.

    https://github.com/apache/druid/pull/13287

    # Web console

    # Delete an interval

    You can now pick an interval to delete from a dropdown in the kill task dialog.

    https://github.com/apache/druid/pull/13431

    # Removed the old query view

    The old query view is removed. Use the new query view with tabs. For more information, see Web console.

    https://github.com/apache/druid/pull/13169

    # Filter column values in query results

    The web console now allows you to add to existing filters for a selected column.

    https://github.com/apache/druid/pull/13169

    # Support for Kafka lookups in the web-console

    Added support for Kafka-based lookups rendering and input in the web console.

    https://github.com/apache/druid/pull/13098

    # Query task status information

    The web console now exposes a textual indication about running and pending tasks when a query is stuck due to lack of task slots.

    https://github.com/apache/druid/pull/13291

    # Extensions

    # Extension optimization

    Optimized the compareTo function in CompressedBigDecimal.

    https://github.com/apache/druid/pull/13086

    # CompressedBigDecimal cleanup and extension

    Removed unnecessary generic type from CompressedBigDecimal, added support for number input types, added support for reading aggregator input types directly (uningested data), and fixed scaling bug in buffer aggregator.

    https://github.com/apache/druid/pull/13048

    # Support for Kubernetes discovery

    Added POD_NAME and POD_NAMESPACE env variables to all Kubernetes Deployments and StatefulSets. Helm deployment is now compatible with druid-kubernetes-extension.

    https://github.com/apache/druid/pull/13262

    # Docs

    # Jupyter Notebook tutorials

    We released our first Jupyter Notebook-based tutorial to learn the basics of the Druid API. Download the notebook and follow along with the tutorial to learn how to get basic cluster information, ingest data, and query data. For more information, see Jupyter Notebook tutorials.

    https://github.com/apache/druid/pull/13342
    https://github.com/apache/druid/pull/13345

    # Dependency updates

    # Updated Kafka version

    Updated the Apache Kafka core dependency to version 3.3.1.

    https://github.com/apache/druid/pull/13176

    # Docker improvements

    Updated dependencies for the Druid image for Docker, including JRE 11. Docker BuildKit cache is enabled to speed up building.

    https://github.com/apache/druid/pull/13059

    # Upgrading to 25.0.0

    Consider the following changes and updates when upgrading from Druid 24.0.x to 25.0.0. If you're updating from an earlier version, see the release notes of the relevant intermediate versions.

    # Default HTTP-based segment discovery and task management

    The default segment discovery method now uses HTTP instead of ZooKeeper.

    This update changes the defaults for the following properties:

    | Property | New default | Previous default | | - | - | - | | druid.serverview.type for segment management | http | batch | | druid.coordinator.loadqueuepeon.type for segment management | http | curator | | druid.indexer.runner.type for the Overlord | httpRemote | local |

    To use ZooKeeper instead of HTTP, change the values for the properties back to the previous defaults. ZooKeeper-based implementations for these properties are deprecated and will be removed in a subsequent release.

    https://github.com/apache/druid/pull/13092

    # Finalizing HLL and quantiles sketch aggregates

    The aggregation functions for HLL and quantiles sketches returned sketches or numbers when they are finalized depending on where they were in the native query plan.

    Druid no longer finalizes aggregators in the following two cases:

    • aggregators appear in the outer level of a query
    • aggregators are used as input to an expression or finalizing-field-access post-aggregator

    This change aligns the behavior of HLL and quantiles sketches with theta sketches.

    To restore old behaviour, you can set sqlFinalizeOuterSketches=true in the query context.

    https://github.com/apache/druid/pull/13247

    # Kill tasks mark segments as unused only if specified

    When you issue a kill task, Druid marks the underlying segments as unused only if explicitly specified. For more information, see the API reference

    https://github.com/apache/druid/pull/13104

    # Incompatible changes

    # Upgrade curator to 5.3.0

    Apache Curator upgraded to the latest version, 5.3.0. This version drops support for ZooKeeper 3.4 but Druid has already officially dropped support in 0.22. In 5.3.0, Curator has removed support for Exhibitor so all related configurations and tests have been removed.

    https://github.com/apache/druid/pull/12939

    # Fixed Parquet list conversion

    The behavior of the parquet reader for lists of structured objects has been changed to be consistent with other parquet logical list conversions. The data is now fetched directly, more closely matching its expected structure. See parquet list conversion for more details.

    https://github.com/apache/druid/pull/13294

    # Credits

    Thanks to everyone who contributed to this release!

    @317brian @599166320 @a2l007 @abhagraw @abhishekagarwal87 @adarshsanjeev @adelcast @AlexanderSaydakov @amaechler @AmatyaAvadhanula @ApoorvGuptaAi @arvindanugula @asdf2014 @churromorales @clintropolis @cloventt @cristian-popa @cryptoe @dampcake @dependabot[bot] @didip @ektravel @eshengit @findingrish @FrankChen021 @gianm @hnakamor @hosswald @imply-cheddar @jasonk000 @jon-wei @Junge-401 @kfaraz @LakshSingla @mcbrewster @paul-rogers @petermarshallio @rash67 @rohangarg @sachidananda007 @santosh-d3vpl3x @senthilkv @somu-imply @techdocsmith @tejaswini-imply @vogievetsky @vtlim @wcc526 @writer-jill @xvrl @zachjsh

    Source code(tar.gz)
    Source code(zip)
  • druid-24.0.2(Dec 22, 2022)

  • druid-24.0.1(Nov 22, 2022)

    Apache Druid 24.0.1 is a bug fix release that fixes some issues in the 24.0 release. See the complete set of changes for additional details.

    # Notable Bug fixes

    https://github.com/apache/druid/pull/13214 to fix SQL planning when using the JSON_VALUE function. https://github.com/apache/druid/pull/13297 to fix values that match a range filter on nested columns. https://github.com/apache/druid/pull/13077 to fix detection of nested objects while generating an MSQ SQL in the web-console. https://github.com/apache/druid/pull/13172 to correctly handle overlord leader election even when tasks cannot be reacquired. https://github.com/apache/druid/pull/13259 to fix memory leaks from SQL statement objects. https://github.com/apache/druid/pull/13273 to fix overlord API failures by de-duplicating task entries in memory. https://github.com/apache/druid/pull/13049 to fix a race condition while processing query context. https://github.com/apache/druid/pull/13151 to fix assertion error in SQL planning.

    # Credits

    Thanks to everyone who contributed to this release!

    @abhishekagarwal87 @AmatyaAvadhanula @clintropolis @gianm @kfaraz @LakshSingla @paul-rogers @vogievetsky

    # Known issues

    • Hadoop ingestion does not work with custom extension config due to injection errors (fixed in https://github.com/apache/druid/pull/13138)
    Source code(tar.gz)
    Source code(zip)
  • druid-24.0.0(Sep 16, 2022)

    Apache Druid 24.0.0 contains over 300 new features, bug fixes, performance enhancements, documentation improvements, and additional test coverage from 67 contributors. See the complete set of changes for additional details.

    # Major version bump

    Starting with this release, we have dropped the leading 0 from the release version and promoted all other digits one place to the left. Druid is now at major version 24, a jump up from the prior 0.23.0 release. In terms of backward-compatibility or breaking changes, this release is not significantly different than other previous major releases such as 0.23.0 or 0.22.0. We are continuing with the same policy as we have used in prior releases: minimizing the number of changes that require special attention when upgrading, and calling out any that do exist in the release notes. For this release, please refer to the Upgrading to 24.0.0 section for a list of backward-incompatible changes in this release.

    # New Features

    # Multi-stage query task engine

    SQL-based ingestion for Apache Druid uses a distributed multi-stage query architecture, which includes a query engine called the multi-stage query task engine (MSQ task engine). The MSQ task engine extends Druid's query capabilities, so you can write queries that reference external data as well as perform ingestion with SQL INSERT and REPLACE. Essentially, you can perform SQL-based ingestion instead of using JSON ingestion specs that Druid's native ingestion uses. In addition to the easy-to-use syntax, the SQL interface lets you perform transformations that involve multiple shuffles of data.

    SQL-based ingestion using the multi-stage query task engine is recommended for batch ingestion starting in Druid 24.0.0. Native batch and Hadoop-based ingestion continue to be supported as well. We recommend you review the known issues and test the feature in a staging environment before rolling out in production. Using the multi-stage query task engine with plain SELECT statements (not INSERT ... SELECT or REPLACE ... SELECT) is experimental.

    If you're upgrading from an earlier version of Druid or you're using Docker, you'll need to add the druid-multi-stage-query extension to druid.extensions.loadlist in your common.runtime.properties file.

    For more information, refer to the Overview documentation for SQL-based ingestion.

    #12524 #12386 #12523 #12589

    # Nested columns

    Druid now supports directly storing nested data structures in a newly added COMPLEX<json> column type. COMPLEX<json> columns store a copy of the structured data in JSON format as well as specialized internal columns and indexes for nested literal values—STRING, LONG, and DOUBLE types. An optimized virtual column allows Druid to read and filter these values at speeds consistent with standard Druid LONG, DOUBLE, and STRING columns.

    Newly added Druid SQL, native JSON functions, and virtual column allow you to extract, transform, and create COMPLEX<json> values in at query time. You can also use the JSON functions in INSERT and REPLACE statements in SQL-based ingestion, or in a transformSpec in native ingestion as an alternative to using a flattenSpec object to "flatten" nested data for ingestion.

    See SQL JSON functions, native JSON functions, Nested columns, virtual columns, and the feature summary for more detail.

    #12753 #12714 #12753 #12920

    # Updated Java support

    Java 11 is fully supported is no longer experimental. Java 17 support is improved.

    #12839

    # Query engine updates

    # Updated column indexes and query processing of filters

    Reworked column indexes to be extraordinarily flexible, which will eventually allow us to model a wide range of index types. Added machinery to build the filters that use the updated indexes, while also allowing for other column implementations to implement the built-in index types to provide adapters to make use indexing in the current set filters that Druid provides.

    #12388

    # Time filter operator

    You can now use the Druid SQL operator TIME_IN_INTERVAL to filter query results based on time. Prefer TIME_IN_INTERVAL over the SQL BETWEEN operator to filter on time. For more information, see Date and time functions.

    #12662

    # Null values and the "in" filter

    If a values array contains null, the "in" filter matches null values. This differs from the SQL IN filter, which does not match null values.

    For more information, see Query filters and SQL data types. #12863

    # Virtual columns in search queries

    Previously, a search query could only search on dimensions that existed in the data source. Search queries now support virtual columns as a parameter in the query.

    #12720

    # Optimize simple MIN / MAX SQL queries on __time

    Simple queries like select max(__time) from ds now run as a timeBoundary queries to take advantage of the time dimension sorting in a segment. You can set a feature flag to enable this feature.

    #12472 #12491

    # String aggregation results

    The first/last string aggregator now only compares based on values. Previously, the first/last string aggregator’s values were compared based on the _time column first and then on values.

    If you have existing queries and want to continue using both the _time column and values, update your queries to use ORDER BY MAX(timeCol).

    #12773

    # Reduced allocations due to Jackson serialization

    Introduced and implemented new helper functions in JacksonUtils to enable reuse of SerializerProvider objects.

    Additionally, disabled backwards compatibility for map-based rows in the GroupByQueryToolChest by default, which eliminates the need to copy the heavyweight ObjectMapper. Introduced a configuration option to allow administrators to explicitly enable backwards compatibility.

    #12468

    # Updated IPAddress Java library

    Added a new IPAddress Java library dependency to handle IP addresses. The library includes IPv6 support. Additionally, migrated IPv4 functions to use the new library.

    #11634

    # Query performance improvements

    Optimized SQL operations and functions as follows:

    • Vectorized numeric latest aggregators (#12439)
    • Optimized isEmpty() and equals() on RangeSets (#12477)
    • Optimized reuse of Yielder objects (#12475)
    • Operations on numeric columns with indexes are now faster (#12830)
    • Optimized GroupBy by reducing allocations. Reduced allocations by reusing entry and key holders (#12474)
    • Added a vectorized version of string last aggregator (#12493)
    • Added Direct UTF-8 access for IN filters (#12517)
    • Enabled virtual columns to cache their outputs in case Druid calls them multiple times on the same underlying row (#12577)
    • Druid now rewrites a join as a filter when possible in IN joins (#12225)
    • Added automatic sizing for GroupBy dictionaries (#12763)
    • Druid now distributes JDBC connections more evenly amongst brokers (#12817)

    # Streaming ingestion

    # Kafka consumers

    Previously, consumers that were registered and used for ingestion persisted until Kafka deleted them. They were only used to make sure that an entire topic was consumed. There are no longer consumer groups that linger.

    #12842

    # Kinesis ingestion

    You can now perform Kinesis ingestion even if there are empty shards. Previously, all shards had to have at least one record.

    #12792

    # Batch ingestion

    # Batch ingestion from S3

    You can now ingest data from endpoints that are different from your default S3 endpoint and signing region. For more information, see S3 config. #11798

    # Improvements to ingestion in general

    This release includes the following improvements for ingestion in general.

    # Increased robustness for task management

    Added setNumProcessorsPerTask to prevent various automatically-sized thread pools from becoming unreasonably large. It isn't ideal for each task to size its pools as if it is the only process on the entire machine. On large machines, this solves a common cause of OutOfMemoryError due to "unable to create native thread".

    #12592

    # Avatica JDBC driver

    The JDBC driver now follows the JDBC standard and uses two kinds of statements, Statement and PreparedStatement.

    #12709

    # Eight hour granularity

    Druid now accepts the EIGHT_HOUR granularity. You can segment incoming data to EIGHT_HOUR buckets as well as group query results by eight hour granularity. #12717

    # Ingestion general

    # Updated Avro extension

    The previous Avro extension leaked objects from the parser. If these objects leaked into your ingestion, you had objects being stored as a string column with the value as the .toString(). This string column will remain after you upgrade but will return Map.toString() instead of GenericRecord.toString. If you relied on the previous behavior, you can use the Avro extension from an earlier release.

    #12828

    # Sampler API

    The sampler API has additional limits: maxBytesInMemory and maxClientResponseBytes. These options augment the existing options numRows and timeoutMs. maxBytesInMemory can be used to control the memory usage on the Overlord while sampling. maxClientResponseBytes can be used by clients to specify the maximum size of response they would prefer to handle.

    #12947

    # SQL

    # Column order

    The DruidSchema and SegmentMetadataQuery properties now preserve column order instead of ordering columns alphabetically. This means that query order better matches ingestion order.

    #12754

    # Converting JOINs to filter

    You can improve performance by pushing JOINs partially or fully to the base table as a filter at runtime by setting the enableRewriteJoinToFilter context parameter to true for a query.

    Druid now pushes down join filters in case the query computing join references any columns from the right side.

    #12749 #12868

    # Add is_active to sys.segments

    Added is_active as shorthand for (is_published = 1 AND is_overshadowed = 0) OR is_realtime = 1). This represents "all the segments that should be queryable, whether or not they actually are right now".

    #11550

    # useNativeQueryExplain now defaults to true

    The useNativeQueryExplain property now defaults to true. This means that EXPLAIN PLAN FOR returns the explain plan as a JSON representation of equivalent native query(s) by default. For more information, see Broker Generated Query Configuration Supplementation.

    #12936

    # Running queries with inline data using druid query engine

    Some queries that do not refer to any table, such as select 1, are now always translated to a native Druid query with InlineDataSource before execution. If translation is not possible, for queries such as SELECT (1, 2), then an error occurs. In earlier versions, this query would still run.

    #12897

    # Coordinator/Overlord

    # You can configure the Coordinator to kill segments in the future

    You can now set druid.coordinator.kill.durationToRetain to a negative period to configure the Druid cluster to kill segments whose interval_end is a date in the future. For example, PT-24H would allow segments to be killed if their interval_end date was 24 hours or less into the future at the time that the kill task is generated by the system. A cluster operator can also disregard the druid.coordinator.kill.durationToRetain entirely by setting a new configuration, druid.coordinator.kill.ignoreDurationToRetain=true. This ignores interval_end date when looking for segments to kill, and can instead kill any segment marked unused. This new configuration is turned off by default, and a cluster operator should fully understand and accept the risks before enabling it.

    # Improved Overlord stability

    Reduced contention between the management thread and the reception of status updates from the cluster. This improves the stability of Overlord and all tasks in a cluster when there are large (1000+) task counts.

    #12099

    # Improved Coordinator segment logging

    Updated Coordinator load rule logging to include current replication levels. Added missing segment ID and tier information from some of the log messages.

    #12511

    # Optimized overlord GET tasks memory usage

    Addressed the significant memory overhead caused by the web-console indirectly calling the Overlord’s GET tasks API. This could cause unresponsiveness or Overlord failure when the ingestion tab was opened multiple times.

    #12404

    # Reduced time to create intervals

    In order to optimize segment cost computation time by reducing time taken for interval creation, store segment interval instead of creating it each time from primitives and reduce memory overhead of storing intervals by interning them. The set of intervals for segments is low in cardinality.

    #12670

    # Brokers/Overlord

    Brokers now have a default of 25MB maximum queued per query. Previously, there was no default limit. Depending on your use case, you may need to increase the value, especially if you have large result sets or large amounts of intermediate data. To adjust the maximum memory available, use the druid.broker.http.maxQueuedBytes property. For more information, see Configuration reference.

    # Web console

    Prepare to have your Web Console experience elevated! - @vogievetsky

    # New query view (WorkbenchView) with tabs and long running query support

    You can use the new query view to execute multi-stage, task based, queries with the /druid/v2/sql/task and /druid/indexer/v1/task/* APIs as well as native and sql-native queries just like the old Query view. A key point of the sql-msq-task based queries is that they may run for a long time. This inspired / necessitated many UX changes including, but not limited to the following:

    # Tabs

    You can now have many queries stored and running at the same time, significantly improving the query view UX.

    You can open several tabs, duplicate them, and copy them as text to paste into any console and reopen there.

    # Progress reports (counter reports)

    Queries run with the multi-stage query task engine have detailed progress reports shown in the summary progress bar and the in detail execution table that provides summaries of the counters for every step.

    # Error and warning reports

    Queries run with the multi-stage query task engine present user friendly warnings and errors should anything go wrong. The new query view has components to visualize these with their full detail including a stack-trace.

    # Recent query tasks panel

    Queries run with the multi-stage query task engine are tasks. This makes it possible to show queries that are executing currently and that have executed in the recent past.

    For any query in the Recent query tasks panel you can view the execution details for it and you can also attach it as a new tab and continue iterating on the query. It is also possible to download the "query detail archive", a JSON file containing all the important details for a given query to use for troubleshooting.

    # Connect external data flow

    Connect external data flow lets you use the sampler to sample your source data to, determine its schema and generate a fully formed SQL query that you can edit to fit your use case before you launch your ingestion job. This point-and-click flow will save you much typing.

    # Preview button

    The Preview button appears when you type in an INSERT or REPLACE SQL query. Click the button to remove the INSERT or REPLACE clause and execute your query as an "inline" query with a limi). This gives you a sense of the shape of your data after Druid applies all your transformations from your SQL query.

    # Results table

    The query results table has been improved in style and function. It now shows you type icons for the column types and supports the ability to manipulate nested columns with ease.

    # Helper queries

    The Web Console now has some UI affordances for notebook and CTE users. You can reference helper queries, collapsable elements that hold a query, from the main query just like they were defined with a WITH statement. When you are composing a complicated query, it is helpful to break it down into multiple queries to preview the parts individually.

    # Additional Web Console tools

    More tools are available from the ... menu:

    • Explain query - show the query plan for sql-native and multi-stage query task engine queries.
    • Convert ingestion spec to SQL - Helps you migrate your native batch and Hadoop based specs to the SQL-based format.
    • Open query detail archive - lets you open a query detail archive downloaded earlier.
    • Load demo queries - lets you load a set of pre-made queries to play around with multi-stage query task engine functionality.

    # New SQL-based data loader

    The data loader exists as a GUI wizard to help users craft a JSON ingestion spec using point and click and quick previews. The SQL data loader is the SQL-based ingestion analog of that.

    Like the native based data loader, the SQL-based data loader stores all the state in the SQL query itself. You can opt to manipulate the query directly at any stage. See (#12919) for more information about how the data loader differs from the Connect external data workflow.

    # Other changes and improvements

    • The query view has so much new functionality that it has moved to the far left as the first view available in the header.
    • You can now click on a datasource or segment to see a preview of the data within.
    • The task table now explicitly shows if a task has been canceled in a different color than a failed task.
    • The user experience when you view a JSON payload in the Druid console has been improved. There’s now syntax highlighting and a search.
    • The Druid console can now use the column order returned by a scan query to determine the column order for reindexing data.
    • The way errors are displayed in the Druid console has been improved. Errors no longer appear as a single long line.

    See (#12919) for more details and other improvements

    # Metrics

    # Sysmonitor stats for Peons

    Sysmonitor stats, like memory or swap, are no longer reported since Peons always run on the same host as MiddleManagerse. This means that duplicate stats will no longer be reported.

    #12802

    # Prometheus

    You can now include the host and service as labels for Prometheus by setting the following properties to true:

    • druid.emitter.prometheus.addHostAsLabel
    • druid.emitter.prometheus.addServiceAsLabel

    #12769

    # Rows per segment

    (Experimental) You can now see the average number of rows in a segment and the distribution of segments in predefined buckets with the following metrics: segment/rowCount/avg and segment/rowCount/range/count. Enable the metrics with the following property: org.apache.druid.server.metrics.SegmentStatsMonitor #12730

    # New sqlQuery/planningTimeMs metric

    There’s a new sqlQuery/planningTimeMs metric for SQL queries that computes the time it takes to build a native query from a SQL query.

    #12923

    # StatsD metrics reporter

    The StatsD metrics reporter extension now includes the following metrics:

    • coordinator/time
    • coordinator/global/time
    • tier/required/capacity
    • tier/total/capacity
    • tier/replication/factor
    • tier/historical/count
    • compact/task/count
    • compactTask/maxSlot/count
    • compactTask/availableSlot/count
    • segment/waitCompact/bytes
    • segment/waitCompact/count
    • interval/waitCompact/count
    • segment/skipCompact/bytes
    • segment/skipCompact/count
    • interval/skipCompact/count
    • segment/compacted/bytes
    • segment/compacted/count
    • interval/compacted/count #12762

    # New worker level task metrics

    Added a new monitor, WorkerTaskCountStatsMonitor, that allows each middle manage worker to report metrics for successful / failed tasks, and task slot usage.

    #12446

    # Improvements to the JvmMonitor

    The JvmMonitor can now handle more generation and collector scenarios. The monitor is more robust and works properly for ZGC on both Java 11 and 15.

    #12469

    # Garbage collection

    Garbage collection metrics now use MXBeans.

    #12481

    # Metric for task duration in the pending queue

    Introduced the metric task/pending/time to measure how long a task stays in the pending queue.

    #12492

    # Emit metrics object for Scan, Timeseries, and GroupBy queries during cursor creation

    Adds vectorized metric for scan, timeseries and groupby queries.

    #12484

    # Emit state of replace and append for native batch tasks

    Druid now emits metrics so you can monitor and assess the use of different types of batch ingestion, in particular replace and tombstone creation.

    #12488 #12840

    # KafkaEmitter emits queryType

    The KafkaEmitter now properly emits the queryType property for native queries.

    #12915

    # Security

    You can now hide properties that are sensitive in the API response from /status/properties, such as S3 access keys. Use the druid.server.hiddenProperties property in common.runtime.properties to specify the properties (case insensitive) you want to hide.

    #12950

    # Other changes

    • You can now configure the retention period for request logs stored on disk with the druid.request.logging.durationToRetain property. Set the retention period to be longer than P1D (#12559)
    • You can now specify liveness and readiness probe delays for the historical StatefulSet in your values.yaml file. The default is 60 seconds (#12805)
    • Improved exception message for native binary operators (#12335)
    • ​​Improved error messages when URI points to a file that doesn't exist (#12490)
    • ​​Improved build performance of modules (#12486)
    • Improved lookups made using the druid-kafka-extraction-namespace extension to handle records that have been deleted from a kafka topic (#12819)
    • Updated core Apache Kafka dependencies to 3.2.0 (#12538)
    • Updated ORC to 1.7.5 (#12667)
    • Updated Jetty to 9.4.41.v20210516 (#12629)
    • Added Zstandard compression library to CompressionStrategy (#12408)
    • Updated the default gzip buffer size to 8 KB to for improved performance (#12579)
    • Updated the default inputSegmentSizeBytes in Compaction configuration to 100,000,000,000,000 (~100TB)

    # Bug fixes

    Druid 24.0 contains over 68 bug fixes. You can find the complete list here

    # Upgrading to 24.0

    # Permissions for multi-stage query engine

    To read external data using the multi-stage query task engine, you must have READ permissions for the EXTERNAL resource type. Users without the correct permission encounter a 403 error when trying to run SQL queries that include EXTERN.

    The way you assign the permission depends on your authorizer. For example, with [basic security]((/docs/development/extensions-core/druid-basic-security.md) in Druid, add the EXTERNAL READ permission by sending a POST request to the roles API.

    The example adds permissions for users with the admin role using a basic authorizer named MyBasicMetadataAuthorizer. The following permissions are granted:

    • DATASOURCE READ
    • DATASOURCE WRITE
    • CONFIG READ
    • CONFIG WRITE
    • STATE READ
    • STATE WRITE
    • EXTERNAL READ
    curl --location --request POST 'http://localhost:8081/druid-ext/basic-security/authorization/db/MyBasicMetadataAuthorizer/roles/admin/permissions' \
    --header 'Content-Type: application/json' \
    --data-raw '[
    {
      "resource": {
        "name": ".*",
        "type": "DATASOURCE"
      },
      "action": "READ"
    },
    {
      "resource": {
        "name": ".*",
        "type": "DATASOURCE"
      },
      "action": "WRITE"
    },
    {
      "resource": {
        "name": ".*",
        "type": "CONFIG"
      },
      "action": "READ"
    },
    {
      "resource": {
        "name": ".*",
        "type": "CONFIG"
      },
      "action": "WRITE"
    },
    {
      "resource": {
        "name": ".*",
        "type": "STATE"
      },
      "action": "READ"
    },
    {
      "resource": {
        "name": ".*",
        "type": "STATE"
      },
      "action": "WRITE"
    },
    {
      "resource": {
        "name": "EXTERNAL",
        "type": "EXTERNAL"
      },
      "action": "READ"
    }
    ]'
    

    # Behavior for unused segments

    Druid automatically retains any segments marked as unused. Previously, Druid permanently deleted unused segments from metadata store and deep storage after their duration to retain passed. This behavior was reverted from 0.23.0. #12693

    # Default for druid.processing.fifo

    The default for druid.processing.fifo is now true. This means that tasks of equal priority are treated in a FIFO manner. For most use cases, this change can improve performance on heavily loaded clusters.

    #12571

    # Update to JDBC statement closure

    In previous releases, Druid automatically closed the JDBC Statement when the ResultSet was closed. Druid closed the ResultSet on EOF. Druid closed the statement on any exception. This behavior is, however, non-standard. In this release, Druid's JDBC driver follows the JDBC standards more closely: The ResultSet closes automatically on EOF, but does not close the Statement or PreparedStatement. Your code must close these statements, perhaps by using a try-with-resources block. The PreparedStatement can now be used multiple times with different parameters. (Previously this was not true since closing the ResultSet closed the PreparedStatement.) If any call to a Statement or PreparedStatement raises an error, the client code must still explicitly close the statement. According to the JDBC standards, statements are not closed automatically on errors. This allows you to obtain information about a failed statement before closing it. If you have code that depended on the old behavior, you may have to change your code to add the required close statement.

    #12709

    # Known issues

    # Credits

    @2bethere @317brian @a2l007 @abhagraw @abhishekagarwal87 @abhishekrb19 @adarshsanjeev @aggarwalakshay @AmatyaAvadhanula @BartMiki @capistrant @chenrui333 @churromorales @clintropolis @cloventt @CodingParsley @cryptoe @dampcake @dependabot[bot] @dherg @didip @dongjoon-hyun @ektravel @EsoragotoSpirit @exherb @FrankChen021 @gianm @hellmarbecker @hwball @iandr413 @imply-cheddar @jarnoux @jasonk000 @jihoonson @jon-wei @kfaraz @LakshSingla @liujianhuanzz @liuxiaohui1221 @lmsurpre @loquisgon @machine424 @maytasm @MC-JY @Mihaylov93 @nishantmonu51 @paul-rogers @petermarshallio @pjfanning @rockc2020 @rohangarg @somu-imply @suneet-s @superivaj @techdocsmith @tejaswini-imply @TSFenwick @vimil-saju @vogievetsky @vtlim @williamhyun @wiquan @writer-jill @xvrl @yuanlihan @zachjsh @zemin-piao

    Source code(tar.gz)
    Source code(zip)
  • druid-0.23.0(Jun 23, 2022)

    Apache Druid 0.23.0 contains over 450 new features, bug fixes, performance enhancements, documentation improvements, and additional test coverage from 81 contributors. See the complete set of changes for additional details.

    # New Features

    # Query engine

    # Grouping on arrays without exploding the arrays

    You can now group on a multi-value dimension as an array. For a datasource named "test":

    {"timestamp": "2011-01-12T00:00:00.000Z", "tags": ["t1","t2","t3"]}  #row1
    {"timestamp": "2011-01-13T00:00:00.000Z", "tags": ["t3","t4","t5"]}  #row2
    {"timestamp": "2011-01-14T00:00:00.000Z", "tags": ["t5","t6","t7"]}  #row3
    {"timestamp": "2011-01-14T00:00:00.000Z", "tags": []}                #row4
    

    The following query:

    {
      "queryType": "groupBy",
      "dataSource": "test",
      "intervals": [
        "1970-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z"
      ],
      "granularity": {
        "type": "all"
      },
      "virtualColumns" : [ {
        "type" : "expression",
        "name" : "v0",
        "expression" : "mv_to_array(\"tags\")",
        "outputType" : "ARRAY<STRING>"
      } ],
      "dimensions": [
        {
          "type": "default",
          "dimension": "v0",
          "outputName": "tags"
          "outputType":"ARRAY<STRING>"
        }
      ],
      "aggregations": [
        {
          "type": "count",
          "name": "count"
        }
      ]
    }
    

    Returns the following:

    [
     {
        "timestamp": "1970-01-01T00:00:00.000Z",
        "event": {
          "count": 1,
          "tags": "[]"
        }
      },
      {
        "timestamp": "1970-01-01T00:00:00.000Z",
        "event": {
          "count": 1,
          "tags": "["t1","t2","t3"]"
        }
      },
      {
        "timestamp": "1970-01-01T00:00:00.000Z",
        "event": {
          "count": 1,
          "tags": "[t3","t4","t5"]"
        }
      },
      {
        "timestamp": "1970-01-01T00:00:00.000Z",
        "event": {
          "count": 2,
          "tags": "["t5","t6","t7"]"
        }
      }
    ]
    

    (#12078) (#12253)

    # Specify a column other than __time column for row comparison in first/last aggregators

    You can pass time column in *first/*last aggregators by using LATEST_BY / EARLIEST_BY SQL functions. This provides support for cases where the time is stored as a part of a column different than "__time". You can also specify another logical time column. (#11949) (#12145)

    # Improvements to querying user experience

    This release includes several improvements for querying:

    • Added the SQL query ID to response header for failed SQL query to aid in locating the error messages (#11756)
    • Added input type validation for DataSketches HLL (#12131)
    • Improved JDBC logging (#11676)
    • Added SQL functions MV_FILTER_ONLY and MV_FILTER_NONE to filter rows of multi-value string dimensions to include only the supplied list of values or none of them respectively (#11650)
    • Added ARRAY_CONCAT_AGG to aggregate array inputs together into a single array (#12226)
    • Added the ability to authorize the usage of query context parameters (#12396)
    • Improved query IDs to make it easier to link queries and sub-queries for end-to-end query visibility (#11809)
    • Added a safe divide function to protect against division by 0 (#11904)
    • You can now add a query context to internally generated SegmentMetadata query (#11429)
    • Added support for Druid complex types to the native expression processing system to make all Druid data usable within expressions (#11853, #12016)
    • You can control the size of the on-heap segment-level dictionary via druid.query.groupBy.maxSelectorDictionarySize when grouping on string or array-valued expressions that do not have pre-existing dictionaries.
    • You have better protection against filter explosion during CNF conversion (#12314) (#12324)
    • You can get the complete native query on explaining the SQL query by setting useNativeQueryExplain to true in query context (#11908)
    • You can have broker ignore real time nodes or specific historical tiers. (#11766) (#11732)

    # Streaming Ingestion

    # Kafka input format for parsing headers and key

    We've introduced a Kafka input format so you can ingest header data in addition to the message contents. For example:

    • the event key field
    • event headers
    • the Kafka event timestamp
    • the Kafka event value that stores the payload.

    (#11630)

    # Kinesis ingestion - Improvements

    We have made following improvements in kinesis ingestion

    • Re-sharding can affect and slow down ingestion as many intermediate empty shards are created. These shards get assigned to tasks causing imbalance in load assignment. You can set skipIgnorableShards to true in kinesis ingestion tuning config to ignore such shards. (#12235)
    • Currently, kinesis ingestion uses DescribeStream to fetch the list of shards. This call is deprecated and slower. In this release, you can switch to a newer API listShards by setting useListShards to true in kinesis ingestion tuning config. (#12161)

    # Native Batch Ingestion

    # Multi-dimension range partitioning

    Multi-dimension range partitioning allows users to partition their data on the ranges of any number of dimensions. It develops further on the concepts behind "single-dim" partitioning and is now arguably the most preferable secondary partitioning, both for query performance and storage efficiency. (#11848) (#11973)

    # Improved replace data behavior

    In previous versions of Druid, if ingested data with dropExisting flag to replace data, Druid would retain the existing data for a time chunk if there was no new data to replace it. Now, if you set dropExisting to true in your ioSpec and ingest data for a time range that includes a time chunk with no data, Druid uses a tombstone to overshadow the existing data in the empty time chunk. (#12137)

    This release includes several improvements for native batch ingestion:

    • Druid now emits a new metric when a batch task finishes waiting for segment availability. (#11090)
    • Added segmentAvailabilityWaitTimeMs, the duration in milliseconds that a task waited for its segments to be handed off to Historical nodes, to IngestionStatsAndErrorsTaskReportData (#11090)
    • Added functionality to preserve existing metrics during ingestion (#12185)
    • Parallel native batch task can now provide task reports for the sequential and single phase mode (e.g., used with dynamic partitioning) as well as single phase mode subtasks (#11688)
    • Added support for RowStats in druid/indexer/v1/task/{task_id}/reports API for multi-phase parallel indexing task (#12280)
    • Fixed the OOM failures in the dimension distribution phase of parallel indexing (#12331)
    • Added support to handle null dimension values while creating partition boundaries (#11973)

    # Improvements to ingestion in general

    This release includes several improvements for ingestion in general:

    • Removed the template modifier from IncrementalIndex<AggregatorType> because it is no longer required
    • You can now use JsonPath functions in JsonPath expressions during ingestion (#11722)
    • Druid no longer creates a materialized list of segment files and elimited looping over the files to reduce OOM issues (#11903)
    • Added an intermediate-persist IndexSpec to the main "merge" method in IndexMerger (#11940)
    • Granularity.granularitiesFinerThan now returns ALL if you pass in ALL (#12003)
    • Added a configuation parameter for appending tasks to allow them to use a SHARED lock (#12041)
    • SchemaRegistryBasedAvroBytesDecoder now throws a ParseException instead of RE when it fails to retrieve a schema (#12080)
    • Added includeAllDimensions to dimensionsSpec to put all explicit dimensions first in InputRow and subsequently any other dimensions found in input data (#12276)
    • Added the ability to store null columns in segments (#12279)

    # Compaction

    This release includes several improvements for compaction:

    • Automatic compaction now supports complex dimensions (#11924)
    • Automatic compaction now supports overlapping segment intervals (#12062)
    • You can now configure automatic compaction to calculate the ratio of slots available for compaction tasks from maximum slots, including autoscaler maximum worker nodes (#12263)
    • You can now configure the Coordinator auto compaction duty period separately from other indexing duties (#12263)
    • Default inputSegmentSizeBytes is now changed to ~ 100 TB (#12534)
    • You can change query granularity, change dimension schema, filter data, add metrics through auto-compaction (#11856) (#11874) (#11922) (#12125)
    • You can control roll-up as well for auto and manual compaction (#11850)

    # SQL

    # Human-readable and actionable SQL error messages

    Until version 0.22.1, if you issued an unsupported SQL query, Druid would throw very cryptic and unhelpful error messages. With this change, error messages include exactly the part of the SQL query that is not supported in Druid. For example, if you run a scan query that is ordered on a dimension other than the time column.

    (#11911)

    # Cancel API for SQL queries

    We've added a new API to cancel SQL queries, so you can now cancel SQL queries just like you can cancel native queries. You can use the API from the web console. In previous versions, cancellation from the console only closed the client connection while the SQL query kept running on Druid.

    (#11643) (#11738) (#11710)

    # Improved SQL compatibility

    We have made changes to expressions that make expression evaluation more SQL compliant. This new behaviour is disabled by default. It can be enabled by setting druid.expressions.useStrictBooleans to true. We recommend enabling this behaviour since it is also more performant in some cases.

    (#11184)

    # Improvements to SQL user experience

    This release includes several additional improvements for SQL:

    • You no longer need to include a trailing slash / for JDBC connections to Druid (#11737)
    • You can now use scans as outer queries (#11831)
    • Added a class to sanitize JDBC exceptions and to log them (#11843)
    • Added type headers to response format to make it easier for clients to interpret the results of SQL queries (#11914)
    • Improved the way the DruidRexExecutor handles numeric arrays (#11968)
    • Druid now returns an empty result after optimizing a GROUP BY query to a time series query (#12065)
    • As an administrator, you can now configure the implementation for APPROX_COUNT_DISTINCT and COUNT(DISTINCT expr) in approximate mode (#11181)

    # Coordinator/Overlord

    • Coordinator can be overwhelmed by the connections from other druid services, especially when TLS is enabled. You can mitigate this by setting druid.global.http.eagerInitialization to false in common runtime properties.

    # Web console

    • Query view can now cancel all queries issued from it (#11738)
    • The auto refresh functions will now run in foreground only (#11750) this prevents forgotten background console tabs from putting any load on the cluster.
    • Add a Segment size (in bytes) column to the Datasources view (#11797)
    image
    • Format numbers with commas in the query view (#12031)
    • Add a JSON Diff view for supervisor specs (#12085)

    image

    • Improve the formatting and info contents of code auto suggestion docs (#12085)

    image

    • Add shard detail column to segments view (#12212)

    image

    • Avoid refreshing tables if a menu is open (#12435)
    • Misc other bug fixes and usability improvements

    # Metrics

    # Query metrics now also set the vectorized dimension by default. This can be helpful in understanding performance profile of queries.

    12464

    # Auto-compaction duty also report duty metrics now. A dimension to indicate the duty group has also been added.

    12352

    This release includes several additional improvements for metrics:

    • Druid includes the Prometheus emitter by defult (#11812)
    • Fixed the missing conversionFactor in Prometheus emitter (12338)
    • Fixed an issue with the ingest/events/messageGap metric (#12337)
    • Added metrics for Shenandoah GC (#12369)
    • Added metrics as follows: Cpu and CpuSet to java.util.metrics.cgroups, ProcFsUtil for procfs info, and CgroupCpuMonitor and CgroupCpuSetMonitor (#11763)
    • Added support to route data through an HTTP proxy (#11891)
    • Added more metrics for Jetty server thread pool usage (#11113)
    • Added worker category as a dimension TaskSlot metric of the indexing service (#11554)
    • Added partitioningType dimension to segment/added/bytes metric to track usage of different partitioning schemes (#11902)
    • Added query laning metrics to visualize lane assignment (#12111)

    # Cloud integrations

    # Allow authenticating via Shared access resource for azure storage

    12266

    # Other changes

    • Druid now processes lookup load failures more quickly (#12397)
    • BalanceSegments#balanceServers now exits early when there is no balancing work to do (#11768)
    • DimensionHandler now allows you to define a DimensionSpec appropriate for the type of dimension to handle (#11873)
    • Added an interface for external schema providers to Druid SQL (#12043)

    # Security fixes

    # Support for access control on setting query contexts

    Today, any context params are allowed to users. This can cause 1) a bad UX if the context param is not matured yet or 2) even query failure or system fault in the worst case if a sensitive param is abused, ex) maxSubqueryRows. Druid now has an ability to limit context params per user role. That means, a query will fail if you have a context param set in the query that is not allowed to you.

    The context parameter authorization can be enabled using Druid.auth.authorizeQueryContextParams. This is disabled by default to enable a smoother upgrade experience.

    (#12396)

    # Other security improvements

    This release includes several additional improvements for security:

    • You can now optionally enable auhorization on Druid system tables (#11720)
    • Log4j2 has been upgraded to 2.17.1 (#12106)

    # Performance improvements

    # Ingestion

    • More accurate memory estimations while building an on-heap incremental index. Rather than using the maximum possible aggregated row size, Druid can now use (based on a task context flag) a closer estimate of the actual heap footprint of an aggregated row. This enables the indexer to fit more rows in memory before performing an intermediate persist. (#12073)

    # SQL

    • Vectorized virtual column processing is enabled by default. It will improve performance for majority of the queries. (#12520)
    • Improved performance for SQL queries with large IN filters. You can achieve better performance by reducing inSubQueryThreshold in SQL query context. (#12357)
    • time_shift is now vectorized (#12254)

    # Bug fixes

    Druid 0.23.0 contains over 68 bug fixes. You can find the complete list here

    # Upgrading to 0.23.0

    Consider the following changes and updates when upgrading from Druid 0.22.x to 0.23.0. If you're updating from an earlier version than 0.22.1, see the release notes of the relevant intermediate versions.

    # Auto-killing of segments

    In 0.23.0, Auto killing of segments is now enabled by default (#12187). The new defaults should kill all unused segments older than 90 days. If users do not want this behavior on an upgrade, they should explicitly disable the behavior. This is a risky change since depending on the interval, segments will be killed immediately after being marked unused. this behavior will be reverted or changed in the next druid release. Please see (#12693) for more details.

    # Other changes

    • Kinesis ingestion requires listShards API access on the stream.
    • Kafka clients libraries have been upgraded to 3.0.0 (#11735)
    • The dynamic coordinator config, percentOfSegmentsToConsiderPerMove has been deprecated and will be removed in a future release of Druid. It is being replaced by a new segment picking strategy introduced in (#11257). This new strategy is currently toggled off by default, but can be toggled on if you set the dynamic coordinator config useBatchedSegmentSampler to true. Setting this as such, will disable the use of the deprecated percentOfSegmentsToConsiderPerMove. In a future release, useBatchedSegmentSampler will become permanently true. (#11960)

    # Developer notices

    # updated airline dependency to 2.x

    https://github.com/airlift/airline is no longer maintained and so druid has upgraded to https://github.com/rvesse/airline (Airline 2) to use an actively maintained version, while minimizing breaking changes.

    This is a backwards incompatible change, and custom extensions relying on the CliCommandCreator extension point will also need to be updated.

    12270

    # Return 404 instead of 400 for unknown supervisors or tasks

    Earlier supervisor/task endpoint return 400 when a supervisor or a task is not found. This status code is not friendly and confusing for the 3rd system. And according to the definition of HTTP status code, 404 is right code for such case. So we have changed the status code from 400 to 404 to eliminate the ambigiuty. Any clients of these endpoints should change the response code handling accordingly.

    11724

    # Return 400 instead of 500 when SQL query cannot be planned

    Any SQL query that cannot be planned by Druid is not considered a bad request. For such queries, we now return 400. Developers using SQL API should change the response code handling if needed.

    12033

    # ResponseContext refactoring

    0.23.0 changes the the ResponseContext and it's keys in a breaking way. The prior version of the response context suggested that keys be defined in an enum, then registered. This version suggests that keys be defined as objects, then registered. See the ResponseContext class itself for the details.

    (#11828)

    # Other changes

    • SingleServerInventoryView has been removed. (#11770)
    • LocalInputSource does not allow ingesting same file multiple times. (#11965)
    • getType() in PostAggregator is deprecated in favour of getType(ColumnInspector) (#11818)

    # Known issues

    For a full list of open issues, please see Bug .

    # Credits

    Thanks to everyone who contributed to this release!

    @2bethere @317brian @a2l007 @abhishekagarwal87 @adarshsanjeev @aggarwalakshay @AlexanderSaydakov @AmatyaAvadhanula @andreacyc @ApoorvGuptaAi @arunramani @asdf2014 @AshishKapoor @benkrug @capistrant @Caroline1000 @cheddar @chenhuiyeh @churromorales @clintropolis @cryptoe @davidferlay @dbardbar @dependabot[bot] @didip @dkoepke @dungdm93 @ektravel @emirot @FrankChen021 @gianm @hqx871 @iMichka @imply-cheddar @isandeep41 @IvanVan @jacobtolar @jasonk000 @jgoz @jihoonson @jon-wei @josephglanville @joyking7 @kfaraz @klarose @LakshSingla @liran-funaro @lokesh-lingarajan @loquisgon @mark-imply @maytasm @mchades @nikhil-ddu @paul-rogers @petermarshallio @pjain1 @pjfanning @rohangarg @samarthjain @sergioferragut @shallada @somu-imply @sthetland @suneet-s @syacobovitz @Tassatux @techdocsmith @tejaswini-imply @themarcelor @TSFenwick @uschindler @v-vishwa @Vespira @vogievetsky @vtlim @wangxiaobaidu11 @williamhyun @wjhypo @xvrl @yuanlihan @zachjsh

    Source code(tar.gz)
    Source code(zip)
  • druid-0.22.1(Dec 11, 2021)

    Apache Druid 0.22.1 is a bug fix release that fixes some security issues. See the complete set of changes for additional details.

    # Bug fixes

    https://github.com/apache/druid/pull/12051 Update log4j to 2.15.0 to address CVE-2021-44228 https://github.com/apache/druid/pull/11787 JsonConfigurator no longer logs sensitive properties https://github.com/apache/druid/pull/11786 Update axios to 0.21.4 to address CVE-2021-3749 https://github.com/apache/druid/pull/11844 Update netty4 to 4.1.68 to address CVE-2021-37136 and CVE-2021-37137

    # Credits

    Thanks to everyone who contributed to this release!

    @abhishekagarwal87 @andreacyc @clintropolis @gianm @jihoonson @kfaraz @xvrl

    Source code(tar.gz)
    Source code(zip)
  • druid-0.22.0(Sep 22, 2021)

    Apache Druid 0.22.0 contains over 400 new features, bug fixes, performance enhancements, documentation improvements, and additional test coverage from 73 contributors. See the complete set of changes for additional details.

    # New features

    # Query engine

    # Support for multiple distinct aggregators in same query

    Druid now can support multiple DISTINCT 'exact' counts using the grouping aggregator typically used with grouping sets. Note that this only applies to exact counts - when druid.sql.planner.useApproximateCountDistinct is false, and can be enabled by setting druid.sql.planner.useGroupingSetForExactDistinct to true.

    https://github.com/apache/druid/pull/11014

    # SQL ARRAY_AGG and STRING_AGG aggregator functions

    The ARRAY_AGG aggregation function has been added, to allow accumulating values or distinct values of a column into a single array result. This release also adds STRING_AGG, which is similar to ARRAY_AGG, except it joins the array values into a single string with a supplied 'delimiter' and it ignores null values. Both of these functions accept a maximum size parameter to control maximum result size, and will fail if this value is exceeded. See SQL documentation for additional details.

    https://github.com/apache/druid/pull/11157 https://github.com/apache/druid/pull/11241

    # Bitwise math function expressions and aggregators

    Several new SQL functions functions for performing 'bitwise' math (along with corresponding native expressions), including BITWISE_AND, BITWISE_OR, BITWISE_XOR and so on. Additionally, aggregation functions BIT_AND, BIT_OR, and BIT_XOR have been added to accumulate values in a column with the corresponding bitwise function. For complete details see SQL documentation.

    https://github.com/apache/druid/pull/10605 https://github.com/apache/druid/pull/10823 https://github.com/apache/druid/pull/11280

    # Human readable number format functions

    Three new SQL and native expression number format functions have been added in Druid 0.22.0, HUMAN_READABLE_BINARY_BYTE_FORMAT, HUMAN_READABLE_DECIMAL_BYTE_FORMAT, and HUMAN_READABLE_DECIMAL_FORMAT, which allow transforming results into a more friendly consumption format for query results. For more information see SQL documentation.

    https://github.com/apache/druid/issues/10584 https://github.com/apache/druid/pull/10635

    # Expression aggregator

    Druid 0.22.0 adds a new 'native' JSON query expression aggregator function, that lets you use Druid native expressions to perform "fold" (alternatively known as "reduce") operations to accumulate some value on any number of input columns. This adds significant flexibility to what can be done in a Druid aggregator, similar in a lot of ways to what was possible with the Javascript aggregator, but in a much safer, sandboxed manner.

    Expressions now being able to perform a "fold" on input columns also really rounds out the abilities of native expressions in addition to the previously possible "map" (expression virtual columns), "filter" (expression filters) and post-transform (expression post-aggregators) functions.

    Since this uses expressions, performance is not yet optimal, and it is not directly documented yet, but it is the underlying technology behind the SQL ARRAY_AGG, STRING_AGG, and bitwise aggregator functions also added in this release.

    https://github.com/apache/druid/pull/11104

    # SQL query routing improvements

    Druid 0.22 adds some new facilities to provide extension writers with enhanced control over how queries are routed between Druid routers and brokers. The first adds a new manual broker selection strategy to the Druid router, which allows a query to manually specify which Druid brokers a query should be sent to based on a query context parameter brokerService to any broker pool defined in druid.router.tierToBrokerMap (this corresponds to the 'service name' of the broker set, druid.service).

    The second new feature allows the Druid router to parse and examine SQL queries so that broker selection strategies can also function for SQL queries. This can be enabled by setting druid.router.sql.enable to true. This does not affect JDBC queries, which use a different mechanism to facilitate "sticky" connections to a single broker.

    https://github.com/apache/druid/pull/11566 https://github.com/apache/druid/pull/11495

    # Avatica protobuf JDBC Support

    Druid now supports using Avatica Protobuf JDBC connections, such as for use with the Avatica Golang Driver, and has a separate endpoint from the JSON JDBC uri.

    String url = "jdbc:avatica:remote:url=http://localhost:8082/druid/v2/sql/avatica-protobuf/;serialization=protobuf";
    

    https://github.com/apache/druid/pull/10543

    # Improved query error logging

    Query exceptions have been changed from WARN level to ERROR level to include additional information in the logs to help troubleshoot query failures. Additionally, a new query context flag, enableQueryDebugging has been added that will include stack traces in these query error logs, to provide even more information without the need to enable logs at the DEBUG level.

    https://github.com/apache/druid/pull/11519

    # Streaming Ingestion

    # Task autoscaling for Kafka and Kinesis streaming ingestion

    Druid 0.22.0 now offers experimental support for dynamic Kafka and Kinesis task scaling. The included strategies are driven by periodic measurement of stream lag (which is based on message count for Kafka, and difference of age between the message iterator and the oldest message for Kinesis), and will adjust the number of tasks based on the amount of 'lag' and several configuration parameters. See Kafka and Kinesis documentation for complete information.

    https://github.com/apache/druid/pull/10524 https://github.com/apache/druid/pull/10985

    # Avro and Protobuf streaming InputFormat and Confluent Schema Registry Support

    Druid streaming ingestion now has support for Avro and Protobuf in the updated InputFormat specification format, which replaces the deprecated firehose/parser specification used by legacy Druid streaming formats. Alongside this, comes support for obtaining schemas for these formats from Confluent Schema Registry. See data formats documentation for further information.

    https://github.com/apache/druid/pull/11040 https://github.com/apache/druid/pull/11018 https://github.com/apache/druid/pull/10314 https://github.com/apache/druid/pull/10839

    # Kafka ingestion support for specifying group.id

    Druid Kafka streaming ingestion now optionally supports specifying group.id on the connections Druid tasks make to the Kafka brokers. This is useful for accessing clusters which require this be set as part of authorization, and can be specified in the consumerProperties section of the Kafka supervisor spec. See Kafka ingestion documentation for more details.

    https://github.com/apache/druid/pull/11147

    # Native Batch Ingestion

    # Support for using deep storage for intermediary shuffle data

    Druid native 'perfect rollup' 2-phase ingestion tasks now support using deep storage as a shuffle location, as an alternative to local disks on middle-managers or indexers. To use this feature, set druid.processing.intermediaryData.storage.type to deepstore, which uses the configured deep storage type.

    Note - With "deepstore" type, data is stored in shuffle-data directory under the configured deep storage path, auto clean up for this directory is not supported yet. One can setup cloud storage lifecycle rules for auto clean up of data at shuffle-data prefix location.

    https://github.com/apache/druid/pull/11507

    # Improved native batch ingestion task memory usage

    Druid native batch ingestion has received a new configuration option, druid.indexer.task.batchProcessingMode which introduces two new operating modes that should allow batch ingestion to operate with a smaller and more predictable heap memory usage footprint. The CLOSED_SEGMENTS_SINKS mode is the most aggressive, and should have the smallest memory footprint, and works by eliminating in memory tracking and mmap of intermediary segments produced during segment creation, but isn't super well tested at this point so considered experimental. CLOSED_SEGMENTS, which is the new default option, eliminates mmap of intermediary segments, but still tracks the entire set of segments in heap, though it is relatively well tested at this point and considered stable. OPEN_SEGMENTS will use the previous ingestion path, which is shared with streaming ingestion and performs a mmap on intermediary segments and builds a timeline so that these segments can be queryable by realtime queries. This is not needed at all for batch, but OPEN_SEGMENTS mode can be selected if any problems occur with the 2 newer modes.

    https://github.com/apache/druid/pull/11123 https://github.com/apache/druid/pull/11294 https://github.com/apache/druid/pull/11536

    # Allow batch tasks to wait until segment handoff before completion

    Druid native batch ingestion tasks can now be optionally configured to not terminate until after the ingested segments are completely loaded by Historical servers. This can be useful for scenarios when the trade-off of keeping an extra task slot occupied is worth using the task state as a measure of if ingestion is complete and segments are available to query.

    This can be enabled by adding awaitSegmentAvailabilityTimeoutMillis to the tuningConfig in the ingestion spec, which specifies the maximum amount of time that a task will wait for segments to be loaded before terminating. If not all segments become available by the time this timeout expires, the job will still succeed. However, in the ingestion report, segmentAvailabilityConfirmed will be false. This indicates that handoff was not successful and these newly indexed segments may not all be available for query. On the other hand, if all segments become available for query on the Historical services before the timeout expires, the value for that key in the report will be true.

    This tuningConfig value is not supported for compaction tasks at this time. If a user tries to specify a value for awaitSegmentAvailabilityTimeoutMillis for Compaction, the task will fail telling the user it is not supported.

    https://github.com/apache/druid/pull/10676

    # Data lifecycle management

    # Support managing segment and query granularity for auto-compaction

    Druid manual and automatic compaction can now be configured to change segment granularity, and manual compaction can also change query granularity. Additionally, compaction will preserve segment granularity by default. This allows operators to more easily perform options like changing older data to larger segment and query granularities in exchange for decreased data size. See compaction docs for details.

    https://github.com/apache/druid/pull/10843 https://github.com/apache/druid/pull/10856 https://github.com/apache/druid/pull/10900 https://github.com/apache/druid/pull/10912 https://github.com/apache/druid/pull/11009

    # Allow compaction to temporarily skip locked intervals

    Druid auto-compaction will now by default temporarily skip locked intervals instead of waiting for the lock to become free, which should improve the rate at which datasources can be compacted. This is controlled by druid.coordinator.compaction.skipLockedIntervals, and can be set to false if this behavior is not desired for some reason.

    https://github.com/apache/druid/pull/11190

    # Support for additional automatic metadata cleanup

    You can configure automated cleanup to remove records from the metadata store after you delete delete some entities from Druid:

    • segments records
    • audit records
    • supervisor records
    • rule records
    • compaction configuration records
    • datasource records created by supervisors

    This feature helps maintain performance when you have a high datasource churn rate, meaning you frequently create and delete many short-lived datasources or other related entities. You can limit the length of time to retain unused metadata records to prevent your metadata store from filling up. See automatic cleanup documentation for more information.

    https://github.com/apache/druid/pull/11078 https://github.com/apache/druid/pull/11084 https://github.com/apache/druid/pull/11164 https://github.com/apache/druid/pull/11200 https://github.com/apache/druid/pull/11227 https://github.com/apache/druid/pull/11232 https://github.com/apache/druid/pull/11245

    # Dropping data

    A new setting, dropExisting has been added to the ioConfig of Druid native batch ingestion tasks and compaction, which if set to true (and appendToExist is false), then the ingestion task will transactionally mark all existing segments in the interval as unused, replacing them with the new set of segments. This can be useful in compaction use cases where normal overshadowing does not completely replace a set of segments in an interval, such as when changing segment granularity to a smaller size and some of the smaller granularity buckets would have no data, leaving the original segments only partially overshadowed.

    Note that this functionality is still experimental, and can result in temporary data unavailability for data within the compacted interval. Changing this config does not cause intervals to be compacted again.

    Similarly, markAsUnused has been added as an option to the Druid kill task, which will mark any segments in the supplied interval as 'unused' prior to deleting all of the unused segments. This is useful for allowing the mark unused -> delete sequence to happen with a single API call for the caller, as well as allowing the unmark action to occur under a task interval lock.

    https://github.com/apache/druid/pull/11070 https://github.com/apache/druid/pull/11025 https://github.com/apache/druid/pull/11501

    # Coordinator

    # Control over coordinator segment load timeout timeout behavior with Apache Zookeeper based segment management

    A new Druid coordinator dynamic configuration option allows controlling the behavior whenever a segment load action times out when using Zookeeper based segment management. replicateAfterLoadTimeout when set to true, the coordinator will attempt to replicate the segment that failed to load to a different historical server. This helps improve the segment availability if there are a few slow historical servers in the cluster. However, the slow historical may still load the segment later and the coordinator may need to issue drop requests if the segment is over-replicated.

    https://github.com/apache/druid/pull/10213

    # Faster coordinator segment balancing

    Another new coordinator dynamic configuration option, useBatchedSegmentSampler, when set to true can potentially provide a large performance increase in the speed which the coordinator can process the segment balancing phase. This should be particularly notable at very large cluster sizes with many segments, but is disabled by default to err on the side of caution.

    https://github.com/apache/druid/pull/11257

    # Improved loadstatus API to optionally compute under-replication based on cluster size

    The Druid coordinator load status API now supports a new optional URL query parameter, computeUsingClusterView, which when specified will cause the coordinator compute under-replication for segments based on the number of servers available within cluster that the segment can be replicated to, instead of the configured replication count configured in load rule. For example, if the load rules specify 2 replicas, but there is only 1 server which can hold segments, this API would not report as under-replicated because the segments are as replicated as is possible for the given cluster size.

    https://github.com/apache/druid/pull/11056

    # Optional limits on the number of non-primary replicants loaded per coordination cycle

    A new coordinator dynamic configuration, maxNonPrimaryReplicantsToLoad, with default value of Integer.MAX_VALUE, lets operators to define a hard upper limit on the number of non-primary replicants that will be loaded in a single coordinator execution cycle. The default value will mimic the behavior that exists today.

    Example usage: If you set this configuration to 1000, the coordinator will load a maximum of 1000 non-primary replicants in each run cycle execution. Meaning if you ingested 2000 segments with a replication factor of 2, the coordinator would load 2000 primary replicants and 1000 non-primary replicants on the first execution. Then the next execution, the last 1000 non-primary replicants will be loaded.

    https://github.com/apache/druid/pull/11135

    # Web Console

    # General improvements

    The Druid web-console 'services' tab will now display which coordinator and overlord servers are serving as the leader, displayed in the 'Detail' column of the table. This should help operators be able to more quickly determine which node is the leader and thus which likely has the interesting logs to examine.

    The web-console now also supports using ASCII control characters, by entering them in the form of \uNNNN where NNNN is the unicode code point for the character.

    https://github.com/apache/druid/pull/10951 https://github.com/apache/druid/pull/10795

    # Query view

    The query view of the web-console has received a number of 'quality of life' improvements in Druid 0.22.0. First, the query view now provides an indicator of how long a query took to execute: image

    Also, queries will no longer auto-run when opening a fresh page, to prevent stale queries from being executed when opening a browser, the page will be reset to 0 if the query result changes and the query limit will automatically increase when the last page is loaded re-running the query.

    Inline documentation now also should include Druid type information: image and should provide better suggestions whenever a query error occurs: image

    Finally, the web console query view now supports the hot-key combination command + enter (on mac) and ctrl + enter on Windows and Linux.

    https://github.com/apache/druid/pull/11158 https://github.com/apache/druid/pull/11128 https://github.com/apache/druid/pull/11203 https://github.com/apache/druid/pull/11365

    # Data management

    The web-console segments view timeline now has the ability to pick any time interval, instead of just the previous year! image

    The web-console segments view has also been improved to hopefully be more performant when interacting with the sys.segments table, including providing the ability to 'force' the web-console to only use the native JSON API methods to display segment information: image

    The lookup view has also been improved, so that now 'poll period' and 'summary' are available as columns in the list view: image

    We have also added validation for poll period to prevent user error, and improved error reporting: image

    https://github.com/apache/druid/pull/11359 https://github.com/apache/druid/pull/10909 https://github.com/apache/druid/pull/11620

    # Metrics

    # Prometheus metric emitter

    A new "contrib" extension has been added, prometheus-emitter, which allows Druid metrics to be sent directly to a Prometheus server. See the extension documentation page for complete details: https://druid.apache.org/docs/0.22.0/development/extensions-contrib/prometheus.html

    https://github.com/apache/druid/pull/10412 https://github.com/apache/druid/pull/11618

    # ingest/notices/queueSize

    ingest/notices/queueSize is a new metric added to provide monitoring for supervisor ingestion task control message processing queue sizes, to help in determining if a supervisor might be overloaded by a large volume of these notices. This metric is emitted by default for every running supervisor.

    https://github.com/apache/druid/pull/11417

    # query/segments/count

    query/segments/count is a new metric which has been added to track the number of segments which participate in a query. This metric is not enabled by default, so must be enabled via a custom extension to override which QueryMetrics are emitted similar to other query metrics that are not emitted by default. (We know this is definitely not friendly, and hope someday in the future to make this easier, sorry).

    https://github.com/apache/druid/pull/11394

    # Cloud integrations

    # AWS Web Identity / IRSA Support

    Druid 0.22.0 adds AWS Web Identity Token Support, which allows for the use of IAM roles for service accounts on Kubernetes, if configured as the AWS credentials provider.

    https://github.com/apache/druid/pull/10541

    # S3 ingestion support for assuming a role

    Druid native batch ingestion from S3 input sources can now use the AssumeRole capability in AWS for cross-account file access. This can be utilized by setting assumeRoleArn and assumeRoleExternalId on the S3 input source specification in a batch ingestion task. See AWS documentation and native batch documentation for more details.

    https://github.com/apache/druid/pull/10995

    # Google Cloud Storage support for URI lookups

    Druid lookups now support loading via Google Cloud Storage, similar to existing functionality available with S3. This requires the druid-google-extensions must be loaded in addition to the lookup extensions, but beyond that it is as simple as using a Google Cloud Storage URI.

    https://github.com/apache/druid/pull/11026

    # Other changes

    # Extracting Avro union fields by type

    Avro ingestion using Druid batch or streaming ingestion now supports an alternative mechanism of extracting data for Avro Union types. This new option, extractUnionsByType only works when utilizing a flattenSpec to extract nested data from union types, and will cause the extracted data to be available with the type as part of the flatten path. For example, given a multi-typed union column someMultiMemberUnion, with this option enabled a long value would be extracted by $.someMultiMemberUnion.long instead of $.someMultiMemberUnion, and would only extract long values from the union. See Avro documentation for complete information.

    https://github.com/apache/druid/pull/10505

    # Support using MariaDb connector with MySQL extensions

    Druid MySQL extensions now supports using the MariaDB connector library as an alternative to the MySQL connector. This can be done by setting druid.metadata.mysql.driver.driverClassName to org.mariadb.jdbc.Driver and includes full support for JDBC URI parameter whitelists used by JDBC lookups and SQL based ingestion.

    https://github.com/apache/druid/pull/11402

    # Add Environment Variable DynamicConfigProvider

    Druid now provides a DynamicConfigProvider implementation that is backed by environment variables. For example:

    druid.some.config.dynamicConfigProvider={"type": "environment","variables":{"secret1": "SECRET1_VAR","secret2": "SECRET2_VAR"}}
    

    See dynamic config provider documentation for further information.

    https://github.com/apache/druid/pull/11377

    # Add DynamicConfigProvider for Schema Registry

    Ingestion formats which support Confluent Schema Registry now support supplying these parameters via a DynamicConfigProvider which is the newer alternative to PasswordProvider. This will allow ingestion tasks to use the config provider to supply this information instead of directly in the JSON specifications, allowing the potential for more secure manners of supplying credentials and other sensitive configuration information. See data format and dynamic config provider documentation for more details.

    https://github.com/apache/druid/pull/11362

    # Security fixes

    # Control of allowed protocols for HTTP and HDFS input sources

    Druid 0.22.0 adds new facilities to control the set of allowed protocols used by HTTP and HDFS input sources in batch ingestion. druid.ingestion.hdfs.allowedProtocols is configured by default to accept hdfs as the protocol, and druid.ingestion.http.allowedProtocols by default will allow http and https. This might cause issue with existing deployments since it is more restrictive than the current default behavior in older versions of Druid, but overall allows operators more flexibility in securing these input sources.

    https://github.com/apache/druid/pull/10830

    # Fix expiration logic for LDAP internal credential cache

    This version of Druid also fixes a flaw in druid-basic-security extension when using LDAP, where the credentials cache would not correctly expire, potentially holding expired credential information after it should have expired, until another trigger was hit or the service was restarted. Druid clusters using LDAP for authorization should update to 0.22.0 whenever possible to fix this issue.

    https://github.com/apache/druid/pull/11395

    # Performance improvements

    # General performance

    • improved granularity processing speed: https://github.com/apache/druid/pull/10904
    • improved string comparison performance: https://github.com/apache/druid/pull/11171

    # JOIN query enhacements

    • improved performance for certain JOIN queries by allowing some INNER JOIN queries to be translated into native Druid filters: https://github.com/apache/druid/pull/11068
    • support filter pushdown into the left base table for certain JOIN queries, controlled by new query context parameter enableJoinLeftTableScanDirect (default to false): https://github.com/apache/druid/pull/10697

    # SQL

    • improved SQL group by query performance by using native query granularity when possible: https://github.com/apache/druid/pull/11379
    • added druid.sql.avatica.minRowsPerFrame broker configuration which can be used to significantly improve JDBC performance by increasing the result batch size: https://github.com/apache/druid/pull/10880
    • faster SQL parsing through improved expression parsing and exception handling: https://github.com/apache/druid/pull/11041
    • improved query performance for sys.segments: https://github.com/apache/druid/pull/11008
    • reduced SQL schema lock contention on brokers: https://github.com/apache/druid/pull/11457
    • improved performance of segmentMetadata queries which are used to build SQL schema https://github.com/apache/druid/pull/10892

    # Vectorized query engine

    • vectorized query engine support for DataSketches quantiles aggregator: https://github.com/apache/druid/pull/11183
    • vectorized query engine support for DataSketches theta sketch aggregator: https://github.com/apache/druid/pull/10767
    • vectorized query engine support for Druid cardinality aggregator: https://github.com/apache/druid/pull/11182
    • vectorization support has been added for expression filter: https://github.com/apache/druid/pull/10613
    • vectorized group by support for string expressions: https://github.com/apache/druid/pull/11010
    • deferred string expression evaluation support for vectorized group by engine: https://github.com/apache/druid/pull/11213
    • improved column scan speed for LONG columns with 'auto' encoding (not the default): https://github.com/apache/druid/pull/11004
    • improved column scan/filtering speeds when contiguous blocks of values are read: https://github.com/apache/druid/pull/11039

    # Bug fixes

    Druid 0.22.0 contains over 80 bug fixes, you can see the complete list here.

    # Upgrading to 0.22.0

    Consider the following changes and updates when upgrading from Druid 0.21.x to 0.22.0. If you're updating from an earlier version than 0.21.0, see the release notes of the relevant intermediate versions.

    # Dropped support for Apache ZooKeeper 3.4

    Following up to 0.21, which officially deprecated support for Zookeeper 3.4, which has been end-of-life for a while, support for ZooKeeper 3.4 is now removed in 0.22.0. Be sure to upgrade your Zookeeper cluster prior to upgrading your Druid cluster to 0.22.0.

    https://github.com/apache/druid/issues/10780 https://github.com/apache/druid/pull/11073

    # Native batch ingestion segment allocation fix

    Druid 0.22.0 includes an important bug-fix in native batch indexing where transient failures of indexing sub-tasks can result in non-contiguous partitions in the result segments, which will never become queryable due to logic which checks for the 'complete' set. This issue has been resolved in the latest version of Druid, but required a change in the protocol which batch tasks use to allocate segments, and this change can cause issues during rolling downgrades if you decide to roll back from Druid 0.22.0 to an earlier version.

    To avoid task failure during a rolling-downgrade, set

    druid.indexer.task.default.context={ "useLineageBasedSegmentAllocation" : false }
    

    in the overlord runtime properties, and wait for all tasks which have useLineageBasedSegmentAllocation set to true to complete before initiating the downgrade. After these tasks have all completed the downgrade shouldn't have any further issue and the setting can be removed from the overlord configuration (recommended, as you will want this setting enabled if you are running Druid 0.22.0 or newer).

    https://github.com/apache/druid/pull/11189

    # SQL timeseries no longer skip empty buckets with all granularity

    Prior to Druid 0.22, an SQL group by query which is using a single universal grouping key (e.g. only aggregators) such as SELECT COUNT(*), SUM(x) FROM y WHERE z = 'someval' would produce an empty result set instead of [0, null] that might be expected from this query matching no results. This was because underneath this would plan into a timeseries query with 'ALL' granularity, and skipEmptyBuckets set to true in the query context. This latter option caused the results of such a query to return no results, as there are no buckets with values to aggregate and so they are skipped, making an empty result set instead of a 'nil' result set. This behavior has been changed to behave in line with other SQL implementations, but the previous behavior can be obtained by explicitly setting skipEmptyBuckets on the query context.

    https://github.com/apache/druid/pull/11188

    # Druid reingestion incompatible changes

    Batch tasks using a 'Druid' input source to reingest segment data will no longer accept the 'dimensions' and 'metrics' sections of their task spec, and now will internally use a new columns filter to specify which columns from the original segment should be retained. Additionally, timestampSpec is no longer ignored, allowing the __time column to be modified or replaced with a different column. These changes additionally fix a bug where transformed columns would be ignored and unavailable on the new segments.

    https://github.com/apache/druid/pull/10267

    # Druid web-console no longer supports IE11 and other older browsers

    Some things might still work, but it is no longer officially supported so that newer Javascript features can be used to develop the web-console. https://github.com/apache/druid/pull/11357

    # Changed default maximum segment loading queue size

    Druid coordinator maxSegmentsInNodeLoadingQueue dynamic configuration has been changed from unlimited (0) to instead to 100. This should make the coordinator behave in a much more relaxed manner during periods of cluster volatility, such as a rolling upgrade, but caps the total number of segments that will be loaded in any given coordinator cycle to 100 per server, which can slow down the speed at which a completely stopped cluster is started and loaded from deep storage.

    https://github.com/apache/druid/pull/11540

    # Developer notices

    # CacheKeyBuilder moved from druid-processing to druid-core

    The CacheKeyBuilder class, which is annotated with @PublicAPI has been moved from druid-processing to druid-core so that expressions can extend the Cacheable interface to allow expressions to generate cache keys which depend on some external state, such as lookup version.

    https://github.com/apache/druid/pull/11358

    # Query engine now uses new QueryProcessingPool instead of ExecutorService directly

    This impacts a handful of method signatures in the query processing engine, such as QueryRunnerFactory and QuerySegmentWalker to allow extensions to hook into various parts of the query processing pool and alternative processing pool scheduling strategies in the future.

    https://github.com/apache/druid/pull/11382

    # SegmentLoader is now extensible and customizable

    This allows extensions to provide alternative segment loading implementations to customize how Druid segments are loaded from deep storage and made available to the query engine. This should be considered an unstable api, and is annotated as such in the code.

    https://github.com/apache/druid/pull/11398

    # Known issues

    For a full list of open issues, please see https://github.com/apache/druid/labels/Bug.

    # Credits

    Thanks to everyone who contributed to this release!

    @2bethere @a2l007 @abhishekagarwal87 @AKarbas @AlexanderSaydakov @ArvinZheng @asdf2014 @astrohsy @bananaaggle @benkrug @bergmt2000 @camteasdale143 @capistrant @Caroline1000 @chenyuzhi459 @clintropolis @cryptoe @DaegiKim @dependabot[bot] @dkoepke @dongjoon-hyun @egor-ryashin @fhan688 @FrankChen021 @gianm @harinirajendran @himadrisingh @himanshug @hqx871 @imply-jbalik @imply-jhan @isandeep41 @jasonk000 @jbampton @jerryleooo @jgoz @jihoonson @jon-wei @josephglanville @jp707049 @junegunn @kaijianding @kazuhirokomoda @kfaraz @lkm @loquisgon @MakDon @maytasm @misqos @mprashanthsagar @mSitkovets @paul-rogers @petermarshallio @pjain1 @rohangarg @samarthjain @shankeerthan-kasilingam @spinatelli @sthetland @suneet-s @techdocsmith @Tiaaa @tushar-1728 @viatcheslavmogilevsky @viongpanzi @vogievetsky @vtlim @wjhypo @wx930910 @xvrl @yuanlihan @zachjsh @zhangyue19921010

    Source code(tar.gz)
    Source code(zip)
  • druid-0.21.1(Jun 10, 2021)

    Apache Druid 0.21.1 is a bug fix release that fixes a few regressions with the 0.21 release. The first is an issue with the published Docker image, which causes containers to fail to start due to volume permission issues, described in #11166 as fixed in #11167. This release also fixes an issue caused by a bug in the upgraded Jetty version which was released in 0.21, described in #11206 and fixed in #11207. Finally, a web console regression related to field validation has been added in #11228.

    # Bug fixes

    https://github.com/apache/druid/pull/11167 fix docker volume permissions https://github.com/apache/druid/pull/11207 Upgrade jetty version https://github.com/apache/druid/pull/11228 Web console: Fix required field treatment https://github.com/apache/druid/pull/11299 Fix permission problems in docker

    # Credits

    Thanks to everyone who contributed to this release!

    @a2l007 @clintropolis @FrankChen021 @maytasm @vogievetsky

    Source code(tar.gz)
    Source code(zip)
  • druid-0.21.0(Apr 28, 2021)

    Apache Druid 0.21.0 contains around 120 new features, bug fixes, performance enhancements, documentation improvements, and additional test coverage from 36 contributors. Refer to the complete list of changes and everything tagged to the milestone for further details.

    # New features

    # Operation

    # Service discovery and leader election based on Kubernetes

    The new Kubernetes extension supports service discovery and leader election based on Kubernetes. This extension works in conjunction with the HTTP-based server view (druid.serverview.type=http) and task management (druid.indexer.runner.type=httpRemote) to allow you to run a Druid cluster with zero ZooKeeper dependencies. This extension is still experimental. See Kubernetes extension for more details.

    https://github.com/apache/druid/pull/10544 https://github.com/apache/druid/pull/9507 https://github.com/apache/druid/pull/10537

    # New dynamic coordinator configuration to limit the number of segments when finding a candidate segment for segment balancing

    You can set the percentOfSegmentsToConsiderPerMove to limit the number of segments considered when picking a candidate segment to move. The candidates are searched up to maxSegmentsToMove * 2 times. This new configuration prevents Druid from iterating through all available segments to speed up the segment balancing process, especially if you have lots of available segments in your cluster. See Coordinator dynamic configuration for more details.

    https://github.com/apache/druid/pull/10284

    # status and selfDiscovered endpoints for Indexers

    The Indexer now supports status and selfDiscovered endpoints. See Processor information APIs for details.

    https://github.com/apache/druid/pull/10679

    # Querying

    # New grouping aggregator function

    You can use the new grouping aggregator SQL function with GROUPING SETS or CUBE to indicate which grouping dimensions are included in the current grouping set. See Aggregation functions for more details.

    https://github.com/apache/druid/pull/10518

    # Improved missing argument handling in expressions and functions

    Expression processing now can be vectorized when inputs are missing. For example a non-existent column. When an argument is missing in an expression, Druid can now infer the proper type of result based on non-null arguments. For instance, for longColumn + nonExistentColumn, nonExistentColumn is treated as (long) 0 instead of (double) 0.0. Finally, in default null handling mode, math functions can produce output properly by treating missing arguments as zeros.

    https://github.com/apache/druid/pull/10499

    # Allow zero period for TIMESTAMPADD

    TIMESTAMPADD function now allows zero period. This functionality is required for some BI tools such as Tableau.

    https://github.com/apache/druid/pull/10550

    # Ingestion

    # Native parallel ingestion no longer requires explicit intervals

    Parallel task no longer requires you to set explicit intervals in granularitySpec. If intervals are missing, the parallel task executes an extra step for input sampling which collects the intervals to index.

    https://github.com/apache/druid/pull/10592 https://github.com/apache/druid/pull/10647

    # Old Kafka version support

    Druid now supports Apache Kafka older than 0.11. To read from an old version of Kafka, set the isolation.level to read_uncommitted in consumerProperties. Only 0.10.2.1 have been tested up until this release. See Kafka supervisor configurations for details.

    https://github.com/apache/druid/pull/10551

    Multi-phase segment merge for native batch ingestion

    A new tuningConfig, maxColumnsToMerge, controls how many segments can be merged at the same time in the task. This configuration can be useful to avoid high memory pressure during the merge. See tuningConfig for native batch ingestion for more details.

    https://github.com/apache/druid/pull/10689

    # Native re-ingestion is less memory intensive

    Parallel tasks now sort segments by ID before assigning them to subtasks. This sorting minimizes the number of time chunks for each subtask to handle. As a result, each subtask is expected to use less memory, especially when a single Parallel task is issued to re-ingest segments covering a long time period.

    https://github.com/apache/druid/pull/10646

    # Web console

    # Updated and improved web console styles

    The new web console styles make better use of the Druid brand colors and standardize paddings and margins throughout. The icon and background colors are now derived from the Druid logo.

    image

    https://github.com/apache/druid/pull/10515

    # Partitioning information is available in the web console

    The web console now shows datasource partitioning information on the new Segment granularity and Partitioning columns.

    Segment granularity column in the Datasources tab

    97240667-1b9cb280-17ac-11eb-9c55-e312c24cd8fc

    Partitioning column in the Segments tab

    97240597-ebedaa80-17ab-11eb-976f-a0d49d6d1a40

    https://github.com/apache/druid/pull/10533

    # The column order in the Schema table matches the dimensionsSpec

    The Schema table now reflects the dimension ordering in the dimensionsSpec.

    image

    https://github.com/apache/druid/pull/10588

    # Metrics

    # Coordinator duty runtime metrics

    The coordinator performs several 'duty' tasks. For example segment balancing, loading new segments, etc. Now there are two new metrics to help you analyze how fast the Coordinator is executing these duties.

    • coordinator/time: the time for an individual duty to execute
    • coordinator/global/time: the time for the whole duties runnable to execute

    https://github.com/apache/druid/pull/10603

    # Query timeout metric

    A new metric provides the number of timed out queries. Previously timed out queries were treated as interrupted and included in the query/interrupted/count (see Changed HTTP status codes for query errors for more details).

    query/timeout/count: the number of timed out queries during the emission period

    https://github.com/apache/druid/pull/10567

    # Shuffle metrics for batch ingestion

    Two new metrics provide shuffle statistics for MiddleManagers and Indexers. These metrics have the supervisorTaskId as their dimension.

    • ingest/shuffle/bytes: number of bytes shuffled per emission period
    • ingest/shuffle/requests: number of shuffle requests per emission period

    To enable the shuffle metrics, add org.apache.druid.indexing.worker.shuffle.ShuffleMonitor in druid.monitoring.monitors. See Shuffle metrics for more details.

    https://github.com/apache/druid/pull/10359

    # New clock-drift safe metrics monitor scheduler

    The default metrics monitor scheduler is implemented based on ScheduledThreadPoolExecutor which is prone to unbounded clock drift. A new monitor scheduler, ClockDriftSafeMonitorScheduler, overcomes this limitation. To use the new scheduler, set druid.monitoring.schedulerClassName to org.apache.druid.java.util.metrics.ClockDriftSafeMonitorScheduler in the runtime.properties file.

    https://github.com/apache/druid/pull/10448 https://github.com/apache/druid/pull/10732

    # Others

    # New extension for a password provider based on AWS RDS token

    A new PasswordProvider type allows access to AWS RDS DB instances using temporary AWS tokens. This extension can be useful when an RDS is used as Druid's metadata store. See AWS RDS extension for more details.

    https://github.com/apache/druid/pull/9518

    # The sys.servers table shows leaders

    A new long-typed column is_leader in the sys.servers table indicates whether or not the server is the leader.

    https://github.com/apache/druid/pull/10680

    # druid-influxdb-emitter extension supports the HTTPS protocol

    See Influxdb emitter extension for new configurations.

    https://github.com/apache/druid/pull/9938

    # Docker

    # Small docker image

    The docker image size is reduced by half by eliminating unnecessary duplication.

    https://github.com/apache/druid/pull/10506

    # Development

    # Extensible Kafka consumer properties via a new DynamicConfigProvider

    A new class DynamicConfigProvider enables fetching consumer properties at runtime. For instance, you can use DynamicConfigProvider fetch bootstrap.servers from location such as a local environment variable if it is not static. Currently, only a map-based config provider is supported by default. See DynamicConfigProvider for how to implement a custom config provider.

    https://github.com/apache/druid/pull/10309

    # Bug fixes

    Druid 0.21.0 contains 30 bug fixes, you can see the complete list here.

    # Post-aggregator computation with subtotals

    Before 0.21.0, the query fails with an error when you use post aggregators with sub-totals. Now this bug is fixed and you can use post aggregators with subtotals.

    https://github.com/apache/druid/pull/10653

    # Indexers announce themselves as segment servers

    In 0.19.0 and 0.20.0, Indexers could not process queries against streaming data as they did not announce themselves as segment servers. They are fixed to announce themselves properly in 0.21.0.

    https://github.com/apache/druid/pull/10631

    # Validity check for segment files in historicals

    Historicals now perform validity check after they download segment files and re-download automatically if those files are crashed.

    https://github.com/apache/druid/pull/10650

    # StorageLocationSelectorStrategy injection failure is fixed

    The injection failure while reading the configurations of StorageLocationSelectorStrategy is fixed.

    https://github.com/apache/druid/pull/10363

    # Upgrading to 0.21.0

    Consider the following changes and updates when upgrading from Druid 0.20.0 to 0.21.0. If you're updating from an earlier version than 0.20.0, see the release notes of the relevant intermediate versions.

    # Improved HTTP status codes for query errors

    Before this release, Druid returned the "internal error (500)" for most of the query errors. Now Druid returns different error codes based on their cause. The following table lists the errors and their corresponding codes that has changed:

    Exception | Description| Old code | New code ------------ | ------------- | ------------- | ------------- SqlParseException and ValidationException from Calcite | Query planning failed | 500 | 400 QueryTimeoutException | Query execution didn't finish in timeout | 500 | 504 ResourceLimitExceededException | Query asked more resources than configured threshold | 500 | 400 InsufficientResourceException | Query failed to schedule because of lack of merge buffers available at the time when it was submitted | 500 | 429, merged to QueryCapacityExceededException QueryUnsupportedException | Unsupported functionality | 400 | 501

    There is also a new query metric for query timeout errors. See New query timeout metric for more details.

    https://github.com/apache/druid/pull/10464 https://github.com/apache/druid/pull/10746

    # Query interrupted metric

    query/interrupted/count no longer counts the queries that timed out. These queries are counted by query/timeout/count.

    # context dimension in query metrics

    context is now a default dimension emitted for all query metrics. context is a JSON-formatted string containing the query context for the query that the emitted metric refers to. The addition of a dimension that was not previously alters some metrics emitted by Druid. You should plan to handle this new context dimension in your metrics pipeline. Since the dimension is a JSON-formatted string, a common solution is to parse the dimension and either flatten it or extract the bits you want and discard the full JSON-formatted string blob.

    https://github.com/apache/druid/pull/10578

    # Deprecated support for Apache ZooKeeper 3.4

    As ZooKeeper 3.4 has been end-of-life for a while, support for ZooKeeper 3.4 is deprecated in 0.21.0 and will be removed in the near future.

    https://github.com/apache/druid/issues/10780

    # Consistent serialization format and column naming convention for the sys.segments table

    All columns in the sys.segments table are now serialized in the JSON format to make them consistent with other system tables. Column names now use the same "snake case" convention.

    https://github.com/apache/druid/pull/10481

    # Known issues

    # Known security vulnerability in the Thrift library

    The Thrift extension can be useful for ingesting files of the Thrift format into Druid. However, there is a known security vulnerability in the version of the Thrift library that Druid uses. The vulerability can be exploitable by ingesting maliciously crafted Thrift files when you use Indexers. We recommend granting the DATASOURCE WRITE permission to only trusted users.

    # Permission issues in running the docker-based Druid cluster

    If you run the Druid docker cluster for the first time in your machine, using the 0.21.0 image can create internal directories with the root account. As a result, Druid services can fail due lack of permissions. This issue is filed in https://github.com/apache/druid/issues/11166.

    If you are using docker compose, you can use the below commands to work around this issue. These commands will create internal directories first using an old image and then start services using the 0.21.0 image.

    $ cd ${PREV_SRC_DIR}
    $ docker-compose -f distribution/docker/docker-compose.yml create
    $ cd ${0.21.0_SRC_DIR}
    $ docker-compose -f distribution/docker/docker-compose.yml up
    

    If you are not using docker compose, you can directly pass the volume parameter for /opt/druid/var when you start services using the 0.21.0 image. For example, you can run the command below to start the coordinator service.

    $ docker run -v /path/to/host/dir:/opt/druid/var apache/druid:0.21.0 coordinator
    

    For a full list of open issues, please see https://github.com/apache/druid/labels/Bug.

    # Credits

    Thanks to everyone who contributed to this release!

    @a2l007 @abhishekagarwal87 @asdf2014 @AshishKapoor @awelsh93 @ayushkul2910 @bananaaggle @capistrant @ccaominh @clintropolis @cloventt @FrankChen021 @gianm @harinirajendran @himanshug @jihoonson @jon-wei @kroeders @liran-funaro @martin-g @maytasm @mghosh4 @michaelschiff @nishantmonu51 @pcarrier @QingdongZeng3 @sthetland @suneet-s @tdt17 @techdocsmith @valdemar-giosg @viatcheslavmogilevsky @viongpanzi @vogievetsky @xvrl @zhangyue19921010

    Source code(tar.gz)
    Source code(zip)
  • druid-0.20.2(Mar 29, 2021)

    Apache Druid 0.20.2 introduces new configurations to address CVE-2021-26919: Authenticated users can execute arbitrary code from malicious MySQL database systems. Users are recommended to enable new configurations in the below to mitigate vulnerable JDBC connection properties. These configurations will be applied to all JDBC connections for ingestion and lookups, but not for metadata store. See security configurations for more details.

    • druid.access.jdbc.enforceAllowedProperties: When true, Druid applies druid.access.jdbc.allowedProperties to JDBC connections starting with jdbc:postgresql: or jdbc:mysql:. When false, Druid allows any kind of JDBC connections without JDBC property validation. This config is set to false by default to not break rolling upgrade. This config is deprecated now and can be removed in a future release. The allow list will be always enforced in that case.
    • druid.access.jdbc.allowedProperties: Defines a list of allowed JDBC properties. Druid always enforces the list for all JDBC connections starting with jdbc:postgresql: or jdbc:mysql: if druid.access.jdbc.enforceAllowedProperties is set to true. This option is tested against MySQL connector 5.1.48 and PostgreSQL connector 42.2.14. Other connector versions might not work.
    • druid.access.jdbc.allowUnknownJdbcUrlFormat: When false, Druid only accepts JDBC connections starting with jdbc:postgresql: or jdbc:mysql:. When true, Druid allows JDBC connections to any kind of database, but only enforces druid.access.jdbc.allowedProperties for PostgreSQL and MySQL.
    Source code(tar.gz)
    Source code(zip)
  • druid-0.20.1(Jan 29, 2021)

  • druid-0.20.0(Oct 17, 2020)

    Apache Druid 0.20.0 contains around 160 new features, bug fixes, performance enhancements, documentation improvements, and additional test coverage from 36 contributors. Refer to the complete list of changes and everything tagged to the milestone for further details.

    # New Features

    # Ingestion

    # Combining InputSource

    A new combining InputSource has been added, allowing the user to combine multiple input sources during ingestion. Please see https://druid.apache.org/docs/0.20.0/ingestion/native-batch.html#combining-input-source for more details.

    https://github.com/apache/druid/pull/10387

    # Automatically determine numShards for parallel ingestion hash partitioning

    When hash partitioning is used in parallel batch ingestion, it is no longer necessary to specify numShards in the partition spec. Druid can now automatically determine a number of shards by scanning the data in a new ingestion phase that determines the cardinalities of the partitioning key.

    https://github.com/apache/druid/pull/10419

    # Subtask file count limits for parallel batch ingestion

    The size-based splitHintSpec now supports a new maxNumFiles parameter, which limits how many files can be assigned to individual subtasks in parallel batch ingestion.

    The segment-based splitHintSpec used for reingesting data from existing Druid segments also has a new maxNumSegments parameter which functions similarly.

    Please see https://druid.apache.org/docs/0.20.0/ingestion/native-batch.html#split-hint-spec for more details.

    https://github.com/apache/druid/pull/10243

    # Task slot usage metrics

    New task slot usage metrics have been added. Please see the entries for the taskSlot metrics at https://druid.apache.org/docs/0.20.0/operations/metrics.html#indexing-service for more details.

    https://github.com/apache/druid/pull/10379

    # Compaction

    # Support for all partitioning schemes for auto-compaction

    A partitioning spec can now be defined for auto-compaction, allowing users to repartition their data at compaction time. Please see the documentation for the new partitionsSpec property in the compaction tuningConfig for more details:

    https://druid.apache.org/docs/0.20.0/configuration/index.html#compaction-tuningconfig

    https://github.com/apache/druid/pull/10307

    # Auto-compaction status API

    A new coordinator API which shows the status of auto-compaction for a datasource has been added. The new API shows whether auto-compaction is enabled for a datasource, and a summary of how far compaction has progressed.

    The web console has also been updated to show this information:

    https://user-images.githubusercontent.com/177816/94326243-9d07e780-ff57-11ea-9f80-256fa08580f0.png

    Please see https://druid.apache.org/docs/latest/operations/api-reference.html#compaction-status for details on the new API, and https://druid.apache.org/docs/latest/operations/metrics.html#coordination for information on new related compaction metrics.

    https://github.com/apache/druid/pull/10371 https://github.com/apache/druid/pull/10438

    # Querying

    # Query segment pruning with hash partitioning

    Druid now supports query-time segment pruning (excluding certain segments as read candidates for a query) for hash partitioned segments. This optimization applies when all of the partitionDimensions specified in the hash partition spec during ingestion time are present in the filter set of a query, and the filters in the query filter on discrete values of the partitionDimensions (e.g., selector filters). Segment pruning with hash partitioning is not supported with non-discrete filters such as bound filters.

    For existing users with existing segments, you will need to reingest those segments to take advantage of this new feature, as the segment pruning requires a partitionFunction to be stored together with the segments, which does not exist in segments created by older versions of Druid. It is not necessary to specify the partitionFunction explicitly, as the default is the same partition function that was used in prior versions of Druid.

    Note that segments created with a default partitionDimensions value (partition by all dimensions + the time column) cannot be pruned in this manner, the segments need to be created with an explicit partitionDimensions.

    https://github.com/apache/druid/pull/9810 https://github.com/apache/druid/pull/10288

    # Vectorization

    To enable vectorization features, please set the druid.query.default.context.vectorizeVirtualColumns property to true or set the vectorize property in the query context. Please see https://druid.apache.org/docs/0.20.0/querying/query-context.html#vectorization-parameters for more information.

    # Vectorization support for expression virtual columns

    Expression virtual columns now have vectorization support (depending on the expressions being used), which an results in a 3-5x performance improvement in some cases.

    Please see https://druid.apache.org/docs/0.20.0/misc/math-expr.html#vectorization-support for details on the specific expressions that support vectorization.

    https://github.com/apache/druid/pull/10388 https://github.com/apache/druid/pull/10401 https://github.com/apache/druid/pull/10432

    # More vectorization support for aggregators

    Vectorization support has been added for several aggregation types: numeric min/max aggregators, variance aggregators, ANY aggregators, and aggregators from the druid-histogram extension.

    https://github.com/apache/druid/pull/10260 - numeric min/max https://github.com/apache/druid/pull/10304 - histogram https://github.com/apache/druid/pull/10338 - ANY https://github.com/apache/druid/pull/10390 - variance

    We've observed about a 1.3x to 1.8x performance improvement in some cases with vectorization enabled for the min, max, and ANY aggregator, and about 1.04x to 1.07x wuth the histogram aggregator.

    # offset parameter for GroupBy and Scan queries

    It is now possible set an offset parameter for GroupBy and Scan queries, which tells Druid to skip a number of rows when returning results. Please see https://druid.apache.org/docs/0.20.0/querying/limitspec.html and https://druid.apache.org/docs/0.20.0/querying/scan-query.html for details.

    https://github.com/apache/druid/pull/10235 https://github.com/apache/druid/pull/10233

    # OFFSET clause for SQL queries

    Druid SQL queries now support an OFFSET clause. Please see https://druid.apache.org/docs/0.20.0/querying/sql.html#offset for details.

    https://github.com/apache/druid/pull/10279

    # Substring search operators

    Druid has added new substring search operators in its expression language and for SQL queries.

    Please see documentation for CONTAINS_STRING and ICONTAINS_STRING string functions for Druid SQL (https://druid.apache.org/docs/0.20.0/querying/sql.html#string-functions) and documentation for contains_string and icontains_string for the Druid expression language (https://druid.apache.org/docs/0.20.0/misc/math-expr.html#string-functions).

    We've observed about a 2.5x performance improvement in some cases by using these functions instead of STRPOS.

    https://github.com/apache/druid/pull/10350

    # UNION ALL operator for SQL queries

    Druid SQL queries now support the UNION ALL operator, which fuses the results of multiple queries together. Please see https://druid.apache.org/docs/0.20.0/querying/sql.html#union-all for details on what query shapes are supported by this operator.

    https://github.com/apache/druid/pull/10324

    # Cluster-wide default query context settings

    It is now possible to set cluster-wide default query context properties by adding a configuration of the form druid.query.override.default.context.*, with * replaced by the property name.

    https://github.com/apache/druid/pull/10208

    # Other features

    # Improved retention rules UI

    The retention rules UI in the web console has been improved. It now provides suggestions and basic validation in the period dropdown, shows the cluster default rules, and makes editing the default rules more accessible.

    https://github.com/apache/druid/pull/10226

    # Redis cache extension enhancements

    The Redis cache extension now supports Redis Cluster, selecting which database is used, connecting to password-protected servers, and period-style configurations for the expiration and timeout properties.

    https://github.com/apache/druid/pull/10240

    # Disable sending server version in response headers

    It is now possible to disable sending of server version information in Druid's response headers.

    This is controlled by a new property druid.server.http.sendServerVersion, which defaults to true.

    https://github.com/apache/druid/pull/9832

    # Specify byte-based configuration properties with units

    Druid now supports units for specifying byte-based configuration properties, e.g.:

    druid.server.maxSize=300g
    

    equivalent to

    druid.server.maxSize=300000000000
    
    

    Please see https://druid.apache.org/docs/0.20.0/configuration/human-readable-byte.html for more details.

    https://github.com/apache/druid/pull/10203

    # Bug fixes

    # Fix query correctness issue when historical has no segment timeline

    Druid 0.20.0 fixes a query correctness issue when a broker issues a query expecting a historical to have certain segments for a datasource, but the historical when queried does not actually have any segments for that datasource (e.g., they were all unloaded before the historical processed the query). Prior to 0.20.0, the query would return successfully but without the results from the segments that were missing in the manner described previously. In 0.20.0, queries will now fail in such situations.

    https://github.com/apache/druid/pull/10199

    # Fix issue preventing result-level cache from being populated

    Druid 0.20.0 fixes an issue introduced in 0.19.0 (https://github.com/apache/druid/issues/10337) which can prevent query caches from being populated when result-level caching is enabled.

    https://github.com/apache/druid/pull/10341

    # Fix for variance aggregator ordering

    The variance aggregator previously used an incorrect comparator that compared using an aggregator's internal count variable instead of the variance.

    https://github.com/apache/druid/pull/10340

    # Fix incorrect caching for groupBy queries with limit specs

    Druid 0.20.0 fixes an issues with groupBy queries and caching, where the limitSpec of the query was not considered in the cache key, leading to potentially incorrect results if queries that are identical except for the limitSpec are issued.

    https://github.com/apache/druid/pull/10093

    # Fix for stringFirst and stringLast with rollup enabled

    https://github.com/apache/druid/issues/7243 has been resolved, the stringFirst and stringLast aggregators no longer cause an exception when used during ingestion with rollup enabled.

    https://github.com/apache/druid/pull/10332

    # Upgrading to Druid 0.20.0

    Please be aware of the following considerations when upgrading from 0.19.0 to 0.20.0. If you're updating from an earlier version than 0.19.0, please see the release notes of the relevant intermediate versions.

    # Default maxSize

    druid.server.maxSize will now default to the sum of maxSize values defined within the druid.segmentCache.locations. The user can still provide a custom value for druid.server.maxSize which will take precedence over the default value.

    https://github.com/apache/druid/pull/10255

    # Compaction and kill task ID changes

    Compaction and kill tasks issued by the coordinator will now have their task IDs prefixed by coordinator-issued, while user-issued kill tasks will be prefixed by api-issued.

    https://github.com/apache/druid/pull/10278

    # New size limits for parallel ingestion split hint specs

    The size-based and segment-based splitHintSpec for parallel batch ingestion now apply a default file/segment limit of 1000 per subtask, controlled by the maxNumFiles and maxNumSegments respectively.

    https://github.com/apache/druid/pull/10243

    # New PostAggregator and AggregatorFactory methods

    Users who have developed an extension with custom PostAggregator or AggregatorFactory implementions will need to update their extensions, as these two interfaces have new methods defined in 0.20.0.

    PostAggregator now has a new method:

      ValueType getType();
    

    To support type information on PostAggregator, AggregatorFactory also has 2 new methods:

      public abstract ValueType getType();
    
      public abstract ValueType getFinalizedType();
    

    Please see https://github.com/apache/druid/pull/9638 for more details on the interface changes.

    # New Expr-related methods

    Users who have developed an extension with custom Expr implementions will need to update their extensions, as Expr and related interfaces hae changed in 0.20.0. Please see the PR below for details:

    https://github.com/apache/druid/pull/10401

    # More accurate query/cpu/time metric

    In 0.20.0, the accuracy of the query/cpu/time metric has been improved. Previously, it did not account for certain portions of work during query processing, described in more detail in the following PR:

    https://github.com/apache/druid/pull/10377

    # New audit log service metric columns

    If you are using audit logging, please be aware that new columns have been added to the audit log service metric (comment, remote_address, and created_date). An optional payload column has also been added, which can be enabled by setting druid.audit.manager.includePayloadAsDimensionInMetric to true.

    https://github.com/apache/druid/pull/10373

    # sqlQueryContext in request logs

    If you are using query request logging, the request log events will now include the sqlQueryContext for SQL queries.

    https://github.com/apache/druid/pull/10368

    # Additional per-segment state in metadata store

    Hash-partitioned segments created by Druid 0.20.0 will now have additional partitionFunction data in the metadata store.

    Additionally, compaction tasks will now store additional per-segment information in the metadata store, used for tracking compaction history.

    https://github.com/apache/druid/pull/10288 https://github.com/apache/druid/pull/10413

    # Known issues

    # druid.segmentCache.locationSelectorStrategy injection failure

    Specifying a value for druid.segmentCache.locationSelectorStrategy prevents services from starting due to an injection error. Please see https://github.com/apache/druid/issues/10348 for more details.

    # Resource leak in web console data sampler

    When a timeout occurs while sampling data in the web console, internal resources created to read from the input source are not properly closed. Please see https://github.com/apache/druid/pull/10467 for more information.

    # Credits

    Thanks to everyone who contributed to this release!

    @a2l007 @abhishekagarwal87 @abhishekrb19 @ArvinZheng @belugabehr @capistrant @ccaominh @clintropolis @code-crusher @dylwylie @fermelone @FrankChen021 @gianm @himanshug @jihoonson @jon-wei @josephglanville @joykent99 @kroeders @lightghli @lkm @mans2singh @maytasm @medb @mghosh4 @nishantmonu51 @pan3793 @richardstartin @sthetland @suneet-s @tarunparackal @tdt17 @tourvi @vogievetsky @wjhypo @xiangqiao123 @xvrl

    Source code(tar.gz)
    Source code(zip)
  • druid-0.19.0(Jul 21, 2020)

    Apache Druid 0.19.0 contains around 200 new features, bug fixes, performance enhancements, documentation improvements, and additional test coverage from 51 contributors. Refer to the complete list of changes and everything tagged to the milestone for further details.

    # New Features

    # GroupBy and Timeseries vectorized query engines enabled by default

    Vectorized query engines for GroupBy and Timeseries queries were introduced in Druid 0.16, as an opt in feature. Since then we have extensively tested these engines and feel that the time has come for these improvements to find a wider audience. Note that not all of the query engine is vectorized at this time, but this change makes it so that any query which is eligible to be vectorized will do so. This feature may still be disabled if you encounter any problems by setting druid.query.vectorize to false.

    https://github.com/apache/druid/pull/10065

    # Druid native batch support for Apache Avro Object Container Files

    New in Druid 0.19.0, native batch indexing now supports Apache Avro Object Container Format encoded files, allowing batch ingestion of Avro data without needing an external Hadoop cluster. Check out the docs for more details

    https://github.com/apache/druid/pull/9671

    # Updated Druid native batch support for SQL databases

    An 'SqlInputSource' has been added in Druid 0.19.0 to work with the new native batch ingestion specifications first introduced in Druid 0.17, deprecating the SqlFirehose. Like the 'SqlFirehose' it currently supports MySQL and PostgreSQL, using the driver from those extensions. This is a relatively low level ingestion task, and the operator must take care to manually ensure that the correct data is ingested, either by specially crafting queries to ensure no duplicate data is ingested for appends, or ensuring that the entire set of data is queried to be replaced when overwriting. See the docs for more operational details.

    https://github.com/apache/druid/pull/9449

    # Apache Ranger based authorization

    A new extension in Druid 0.19.0 adds an Authorizer which implements access control for Druid, backed by Apache Ranger. Please see [the extension documentation]((https://druid.apache.org/docs/0.19.0/development/extensions-core/druid-ranger-security.html) and Authentication and Authorization for more information on the basic facilities this extension provides.

    https://github.com/apache/druid/pull/9579

    # Alibaba Object Storage Service support

    A new 'contrib' extension has been added for Alibaba Cloud Object Storage Service (OSS) to provide both deep storage and usage as a batch ingestion input source. Since this is a 'contrib' extension, it will not be packaged by default in the binary distribution, please see community extensions for more details on how to use in your cluster.

    https://github.com/apache/druid/pull/9898

    # Ingestion worker autoscaling for Google Compute Engine

    Another 'contrib' extension new in 0.19.0 has been added to support ingestion worker autoscaling, which allows a Druid Overlord to provision or terminate worker instances (MiddleManagers or Indexers) whenever there are pending tasks or idle workers, for Google Compute Engine. Unlike the Amazon Web Services ingestion autoscaling extension, which provisions and terminates instances directly without using an Auto Scaling Group, the GCE autoscaler uses Managed Instance Groups to more closely align with how operators are likely to provision their clusters in GCE. Like other 'contrib' extensions, it will not be packaged by default in the binary distribution, please see community extensions for more details on how to use in your cluster.

    https://github.com/apache/druid/pull/8987

    # REGEXP_LIKE

    A new REGEXP_LIKE function has been added to Druid SQL and native expressions, which behaves similar to LIKE, except using regular expressions for the pattern.

    https://github.com/apache/druid/pull/9893

    # Web console lookup management improvements

    Druid 0.19 also web console also includes some useful improvements to the lookup table management interface. Creating and editing lookups is now done with a form to accept user input, rather than a raw text editor to enter the JSON spec.

    Screen Shot 2020-04-02 at 1 14 38 AM

    Additionally, clicking the magnifying glass icon next to a lookup will now allow displaying the first 5000 values of that lookup.

    Screen Shot 2020-03-20 at 3 09 24 PM

    https://github.com/apache/druid/pull/9549 https://github.com/apache/druid/pull/9587

    # New Coordinator per datasource 'loadstatus' API

    A coordinator API can make it easier to determine if the latest published segments are available for querying. This is similar to the existing coordinator 'loadstatus' API, but is datasource specific, may specify an interval, and can optionally live refresh the metadata store snapshot to get the latest up to date information. Note that operators should still exercise caution when using this API to query large numbers of segments, especially if forcing a metadata refresh, as it can potentially be a 'heavy' call on large clusters.

    https://github.com/apache/druid/pull/9965

    # Native batch append support for range and hash partitioning

    Part bug fix, part new feature, Druid native batch (once again) supports appending new data to existing time chunks when those time chunks were partitioned with 'hash' or 'range' partitioning algorithms. Note that currently the appended segments only support 'dynamic' partitioning, and when rolling back to older versions that these appended segments will not be recognized by Druid after the downgrade. In order to roll back to a previous version, these appended segments should be compacted with the rest of the time chunk in order to have a homogenous partitioning scheme.

    https://github.com/apache/druid/pull/10033

    # Bug fixes

    Druid 0.19.0 contains 65 bug fixes, you can see the complete list here.

    # Fix for batch ingested 'dynamic' partitioned segments not becoming queryable atomically

    Druid 0.19.0 fixes an important query correctness issue, where 'dynamic' partitioned segments produced by a batch ingestion task were not tracking the overall number of partitions. This had the implication that when these segments came online, they did not do so as a complete set, but rather as individual segments, meaning that there would be periods of swapping where results could be queried from an incomplete partition set within a time chunk.

    https://github.com/apache/druid/pull/10025

    # Fix to allow 'hash' and 'range' partitioned segments with empty buckets to now be queryable

    Prior to 0.19.0, Druid had a bug when using hash or ranged partitioning where if data skew was such that any of the buckets were 'empty' after ingesting, the partitions would never be recognized as 'complete' and so never become queryable. Druid 0.19.0 fixes this issue by adjusting the schema of the partitioning spec. These changes to the json format should be backwards compatible, however rolling back to a previous version will again make these segments no longer queryable.

    https://github.com/apache/druid/pull/10012

    # Incorrect balancer behavior

    A bug in Druid versions prior to 0.19.0 allowed for (incorrect) coordinator operation in the event druid.server.maxSize was not set. This bug would allow segments to load, and effectively randomly balance them in the cluster (regardless of what balancer strategy was actually configured) if all historicals did not have this value set. This bug has been fixed, but as a result druid.server.maxSize must be set to the sum of the segment cache location sizes for historicals, or else they will not load segments.

    https://github.com/apache/druid/pull/10070

    # Upgrading to Druid 0.19.0

    Please be aware of the following issues when upgrading from 0.18.1 to 0.19.0. If you're updating from an earlier version than 0.18.1, please see the release notes of the relevant intermediate versions.

    # 'druid.server.maxSize' must now be set for Historical servers

    A Coordinator bug fix as a side-effect now requires druid.server.maxSize to be set for segments to be loaded. While this value should have been set correctly for previous versions, please be sure this value is configured correctly before upgrading your clusters or else segments will not be loaded.

    https://github.com/apache/druid/pull/10070

    # System tables 'sys.segments' column 'payload' has been removed and replaced with 'dimensions', 'metrics', and 'shardSpec'

    The removal of the 'payload' column from the sys.segments table should make queries on this table much more efficient, and the most useful fields from this, the list of 'dimensions', 'metrics', and the 'shardSpec', have been split out, and so are still available to devote to processing queries.

    https://github.com/apache/druid/pull/9883

    # Changed default number of segment loading threads

    The druid.segmentCache.numLoadingThreads configuration has had the default value changed from 'number of cores' to 'number of cores' divided by 6. This should make historicals a bit more well behaved out of the box when loading a large number of segments, limiting the impact on query performance.

    https://github.com/apache/druid/pull/9856

    # Broadcast load rules no longer have 'colocated datasources'

    A number of incomplete changes to facilitate more efficient join queries, based on the idea of utilizing broadcast load rules to propagate smaller datasources among the cluster so that join operations can be pushed down to individual segment processing, have been added to 0.19.0. While not a finished feature yet, as part of the changes to make this happen, 'broadcast' load rules no longer have the concept of 'colocated datasources', which would attempt to only broadcast segments to servers that had segments of the configured datasource. This didn't work so well in practice, as it was non-atomic, meaning that the broadcast segments would lag behind loads and drops of the colocated datasource, so we decided to remove it.

    https://github.com/apache/druid/pull/9971

    # Brokers and realtime tasks may now be configured to load segments from 'broadcast' datasources

    Another effect of the afforementioned preliminary work to introduce efficient 'broadcast joins', Brokers and realtime indexing tasks will now load segments loaded by 'broadcast' rules, if a segment cache is configured. Since the feature is not complete there is little reason to do this in 0.19.0, and it will not happen unless explicitly configured.

    https://github.com/apache/druid/pull/9971

    # lpad and rpad function behavior change

    The lpad and rpad functions have gone through a slight behavior change in Druids default non-SQL compatible mode, in order to make them behave consistently with PostgreSQL. In the new behavior, if the pad expression is an empty string, then the result will be the (possibly trimmed) original characters, rather than the empty string being treated as a null and coercing the results to null.

    https://github.com/apache/druid/pull/10006

    # Extensions providing custom Druid expressions are now expected to implement equals and hashCode methods

    A change to the Expr interface in Druid 0.19.0 requires that any extension which provides custom expressions via ExprMacroTable must also implement equals and hashCode methods to function correctly, especially with JOIN queries, which rely on filter and expression analysis for determining how to optimally process a query.

    https://github.com/apache/druid/pull/9830

    # Known Issues

    For a full list of open issues, please see https://github.com/apache/druid/labels/Bug.

    # Credits

    Thanks to everyone who contributed to this release!

    @2bethere @a-chumagin @a2l007 @abhishekrb19 @agricenko @ahuret @alex-plekhanov @AlexanderSaydakov @awelsh93 @bolkedebruin @calvinhkf @capistrant @ccaominh @chenyuzhi459 @clintropolis @damnMeddlingKid @danc @dylwylie @egor-ryashin @FrankChen021 @frnidito @Fullstop000 @gianm @harshpreet93 @jihoonson @jon-wei @josephglanville @kamaci @kanibs @leerho @liujianhuanzz @maytasm @mcbrewster @mghosh4 @morrifeldman @pjain1 @samarthjain @stefanbirkner @sthetland @suneet-s @surekhasaharan @tarpdalton @viongpanzi @vogievetsky @willsalz @wjhypo @xhl0726 @xiangqiao123 @xvrl @yuanlihan @zachjsh

    Source code(tar.gz)
    Source code(zip)
  • druid-0.18.1(May 14, 2020)

    Apache Druid 0.18.1 is a bug fix release that fixes Streaming ingestion failure with Avro, ingestion performance issue, upgrade issue with HLLSketch, and so on. The complete list of bug fixes can be found at https://github.com/apache/druid/pulls?q=is%3Apr+milestone%3A0.18.1+label%3ABug+is%3Aclosed.

    # Bug fixes

    • https://github.com/apache/druid/pull/9823 rollbacks the new Kinesis lag metrics as it can stall the Kinesis supervisor indefinitely with a large number of shards.
    • https://github.com/apache/druid/pull/9734 fixes the Streaming ingestion failure issue when you use a data format other than CSV or JSON.
    • https://github.com/apache/druid/pull/9812 fixes filtering on boolean values during transformation.
    • https://github.com/apache/druid/pull/9723 fixes slow ingestion performance due to frequent flushes on local file system.
    • https://github.com/apache/druid/pull/9751 reverts the version of datasketches-java from 1.2.0 to 1.1.0 to workaround upgrade failure with HLLSketch.
    • https://github.com/apache/druid/pull/9698 fixes a bug in inline subquery with multi-valued dimension.
    • https://github.com/apache/druid/pull/9761 fixes a bug in CloseableIterator which potentially leads to resource leaks in Data loader.

    # Known issues

    Incorrect result of nested groupBy query on Join of subqueries

    A nested groupBy query can result in an incorrect result when it is on top of a Join of subqueries and the inner and the outer groupBys have different filters. See https://github.com/apache/druid/issues/9866 for more details.

    # Credits

    Thanks to everyone who contributed to this release!

    @clintropolis @gianm @jihoonson @maytasm @suneet-s @viongpanzi @whutjs

    Source code(tar.gz)
    Source code(zip)
  • druid-0.18.0(Apr 20, 2020)

    Apache Druid 0.18.0 contains over 200 new features, performance enhancements, bug fixes, and major documentation improvements from 42 contributors. Check out the complete list of changes and everything tagged to the milestone.

    # New Features

    # Join support

    Join is a key operation in data analytics. Prior to 0.18.0, Druid supported some join-related features, such as Lookups or semi-joins in SQL. However, the use cases for those features were pretty limited and, for other join use cases, users had to denormalize their datasources when they ingest data instead of joining them at query time, which could result in exploding data volume and long ingestion time.

    Druid 0.18.0 supports real joins for the first time ever in its history. Druid supports INNER, LEFT, and CROSS joins for now. For native queries, the join datasource has been newly introduced to represent a join of two datasources. Currently, only the left-deep join is allowed. That means, only a table or another join datasource is allowed for the left datasource. For the right datasource, lookup, inline, or query datasources are allowed. Note that join of Druid datasources is not supported yet. There should be only one table datasource in the same join query.

    Druid SQL also supports joins. Under the covers, SQL join queries are translated into one or several native queries that include join datasources. See Query translation for more details of SQL translation and best practices to write efficient queries.

    When a join query is issued, the Broker first evaluates all datasources except for the base datasource which is the only table datasource in the query. The evaluation can include executing subqueries for query datasources. Once the Broker evaluates all non-base datasources, it replaces them with inline datasources and sends the rewritten query to data nodes (see the below "Query inlining in Brokers" section for more details). Data nodes use the hash join to process join queries. They build a hash table for each non-primary leaf datasource unless it already exists. Note that only lookup datasource currently has a pre-built hash table. See Query execution for more details about join query execution.

    Joins can affect performance of your queries. In general, any queries including joins can be slower than equivalent queries against a denormalized datasource. The LOOKUP function could perform better than joins with lookup datasources. See Join performance for more details about join query performance and future plans for performance improvement.

    https://github.com/apache/druid/pull/8728 https://github.com/apache/druid/pull/9545 https://github.com/apache/druid/pull/9111

    # Query inlining in Brokers

    Druid is now able to execute a nested query by inlining subqueries. Any type of subquery can be on top of any type of another, such as in the following example:

                 topN
                   |
           (join datasource)
             /          \
    (table datasource)  groupBy
    

    To execute this query, the Broker first evaluates the leaf groupBy subquery; it sends the subquery to data nodes and collects the result. The collected result is materialized in the Broker memory. Once the Broker collects all results for the groupBy query, it rewrites the topN query by replacing the leaf groupBy with an inline datasource which has the result of the groupBy query. Finally, the rewritten query is sent to data nodes to execute the topN query.

    # Query laning and prioritization

    When you run multiple queries of heterogenous workloads at a time, you may sometimes want to control the resource commitment for a query based on its priority. For example, you would want to limit the resources assigned to less important queries, so that important queries can be executed in time without being disrupted by less important ones.

    Query laning allows you to control capacity utilization for heterogeneous query workloads. With laning, the broker examines and classifies a query for the purpose of assigning it to a 'lane'. Lanes have capacity limits, enforced by the Broker, that can be used to ensure sufficient resources are available for other lanes or for interactive queries (with no lane), or to limit overall throughput for queries within the lane.

    Automatic query prioritization determines the query priority based on the configured strategy. The threshold-based prioritization strategy has been added; it automatically lowers the priority of queries that cross any of a configurable set of thresholds, such as how far in the past the data is, how large of an interval a query covers, or the number of segments taking part in a query.

    See Query prioritization and laning for more details.

    https://github.com/apache/druid/issues/6993 https://github.com/apache/druid/pull/9407 https://github.com/apache/druid/pull/9493

    New dimension in query metrics

    Since a native query containing subqueries can be executed part-by-part, a new subQueryId has been introduced. Each subquery has different subQueryIds but same queryId. The subQueryId is available as a new dimension in query metrics.

    New configuration

    A new druid.server.http.maxSubqueryRows configuration controls the maximum number of rows materialized in the Broker memory.

    Please see Query execution for more details.

    https://github.com/apache/druid/pull/9533

    # SQL grouping sets

    GROUPING SETS is now supported, allowing you to combine multiple GROUP BY clauses into one GROUP BY clause. This GROUPING SETS clause is internally translated into the groupBy query with subtotalsSpec. The LIMIT clause is now applied after subtotalsSpec, rather than applied to each grouping set.

    https://github.com/apache/druid/pull/9122

    # SQL Dynamic parameters

    Druid now supports dynamic parameters for SQL. To use dynamic parameters, replace any literal in the query with a question mark (?) character. These question marks represent the places where the parameters will be bound at execution time. See SQL dynamic parameters for more details.

    https://github.com/apache/druid/pull/6974

    # Important Changes

    # applyLimitPushDownToSegments is disabled by default

    applyLimitPushDownToSegments was added in 0.17.0 to push down limit evaluation to queryable nodes, limiting results during segment scan for groupBy v2. This can lead to performance degradation, as reported in https://github.com/apache/druid/issues/9689, if many segments are involved in query processing. This is because “limit push down to segment scan” initializes an aggregation buffer per segment, the overhead for which is not negligible. Enable this configuration only if your query involves a relatively small number of segments per historical or realtime task.

    https://github.com/apache/druid/pull/9711

    # Roaring bitmaps as default

    Druid supports two bitmap types, i.e., Roaring and CONCISE. Since Roaring bitmaps provide a better out-of-box experience (faster query speed in general), the default bitmap type is now switched to Roaring bitmaps. See Segment compression for more details about bitmaps.

    https://github.com/apache/druid/pull/9548

    # Complex metrics behavior change at ingestion time when SQL-compatible null handling is disabled (default mode)

    When SQL-compatible null handling is disabled, the behavior of complex metric aggregation at ingestion time has now changed to be consistent with that at query time. The complex metrics are aggregated to the default 0 values for nulls instead of skipping them during ingestion.

    https://github.com/apache/druid/pull/9484

    # Array expression syntax change

    Druid expression now supports typed constructors for creating arrays. Arrays can be defined with an explicit type. For example, <LONG>[1, 2, null] creates an array of LONG type containing 1, 2, and null. Note that you can still create an array without an explicit type. For example, [1, 2, null] is still a valid syntax to create an equivalent array. In this case, Druid will infer the type of array from its elements. This new syntax applies to empty arrays as well. <STRING>[], <DOUBLE>[], and <LONG>[] will create an empty array of STRING, DOUBLE, and LONG type, respectively.

    https://github.com/apache/druid/pull/9367

    # Enabling pending segments cleanup by default

    The pendingSegments table in the metadata store is used to create unique new segment IDs for appending tasks such as Kafka/Kinesis indexing tasks or batch tasks of appending mode. Automatic pending segments cleanup was introduced in 0.12.0, but has been disabled by default prior to 0.18.0. This configuration is now enabled by default.

    https://github.com/apache/druid/pull/9385

    # Creating better input splits for native parallel indexing

    The Parallel task now can create better splits. Each split can contain multiple input files based on their size. Empty files will be ignored. The split size is controllable with the new split hint spec. See Split hint spec for more details.

    https://github.com/apache/druid/pull/9360 https://github.com/apache/druid/pull/9450

    # Transform is now an extension point

    Transform is an Interface that represents a transformation to be applied to each row at ingestion time. This interface is now an Extension point. Please see Writing your own extensions for how to add your custom Transform.

    https://github.com/apache/druid/pull/9319

    # chunkPeriod query context is removed

    chunkPeriod has been deprecated since 0.14.0 because of its limited usage (it was sometimes useful for only groupBy v1). This query context is now removed in 0.18.0.

    https://github.com/apache/druid/pull/9216

    # Experimental support for Java 11

    Druid now experimentally supports Java 11. You can run the same Druid binary distribution with Java 11 which is compiled with Java 8. Our tests on Travis include:

    • Compiling and running unit tests with Java 11
    • Compiling with Java 8 and running integration tests with Java 11

    Performance testing results are not available yet.

    Warnings for illegal reflective accesses when running Druid with Java 11

    Since Java 9, it issues a warning when it is found that some libraries use reflection to illegally access internal APIs of the JDK. These warnings will be fixed by modifying Druid codes or upgrading library versions in future releases. For now, these warnings can be suppressed by adding JVM options such as --add-opens or --add-exports. See JDK 11 Migration Guide for more details.

    Some of the warnings are:

    2020-01-22T21:30:08,893 WARN [main] org.apache.druid.java.util.metrics.AllocationMetricCollectors - Cannot initialize org.apache.druid.java.util.metrics.AllocationMetricCollector
    java.lang.reflect.InaccessibleObjectException: Unable to make public long[] com.sun.management.internal.HotSpotThreadImpl.getThreadAllocatedBytes(long[]) accessible: module jdk.management does not "exports com.sun.management.internal" to unnamed module @6955cb39
    

    This warning can be suppressed by adding --add-exports jdk.management/com.sun.management.internal=ALL-UNNAMED.

    2020-01-22T21:30:08,902 WARN [main] org.apache.druid.java.util.metrics.JvmMonitor - Cannot initialize GC counters. If running JDK11 and above, add --add-exports java.base/jdk.internal.perf=ALL-UNNAMED to the JVM arguments to enable GC counters.
    

    This warning can be suppressed by adding --add-exports java.base/jdk.internal.perf=ALL-UNNAMED.

    WARNING: An illegal reflective access operation has occurred
    WARNING: Illegal reflective access by com.google.inject.internal.cglib.core.$ReflectUtils$1 to method java.lang.ClassLoader.defineClass(java.lang.String,byte[],int,int,java.security.ProtectionDomain)
    WARNING: Please consider reporting this to the maintainers of com.google.inject.internal.cglib.core.$ReflectUtils$1
    WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
    WARNING: All illegal access operations will be denied in a future release
    

    This warning can be suppressed by adding --add-opens java.base/java.lang=ALL-UNNAMED.

    https://github.com/apache/druid/pull/7306 https://github.com/apache/druid/pull/9491

    # New Extension

    # New Pac4j extension

    A new extension is added in 0.18.0 to enable OpenID Connect based Authentication for Druid Processes. This can be used with any authentication server that supports same e.g. Okta. This extension should only be used at the router node to enable a group of users in existing authentication server to interact with Druid cluster, using the Web Console.

    https://github.com/apache/druid/pull/8992

    # Security Issues

    # [CVE-2020-1958] Apache Druid LDAP injection vulnerability

    CVE-2020-1958 has been reported recently and fixed in 0.18.0 and 0.17.1. When LDAP authentication is enabled, callers of Druid APIs can bypass the credentialsValidator.userSearch filter barrier or retrieve any LDAP attribute values of users that exist on the LDAP server, so long as that information is visible to the Druid server. Please see the description in the link for more details. It is strongly recommended to upgrade to 0.18.0 or 0.17.1 if you are using LDAP authentication with Druid.

    https://github.com/apache/druid/pull/9600

    # Updating Kafka client to 2.2.2

    Kafka client library has been updated to 2.2.2, in which CVE-2019-12399 is fixed.

    https://github.com/apache/druid/pull/9259

    # Bug fixes

    Druid 0.18.0 includes 40 bug fixes. Please see https://github.com/apache/druid/pulls?page=1&q=is%3Apr+milestone%3A0.18.0+is%3Aclosed+label%3ABug for the full list of bug fixes.

    • Fix superbatch merge last partition boundaries (https://github.com/apache/druid/pull/9448)
    • Reuse transformer in stream indexing (https://github.com/apache/druid/pull/9625)
    • Preserve the null values for numeric type dimensions post-compaction (https://github.com/apache/druid/pull/9622)
    • DruidInputSource can add new dimensions during re-ingestion (https://github.com/apache/druid/pull/9590)
    • Error on value counter overflow instead of writing bad segments (https://github.com/apache/druid/pull/9559)
    • Fix some issues with filters on numeric columns with nulls (https://github.com/apache/druid/pull/9251)
    • Fix timestamp_format expr outside UTC time zone (https://github.com/apache/druid/pull/9282)
    • KIS task fail when setting segmentGranularity with time zone (https://github.com/apache/druid/issues/8690)
    • Fix issue with group by limit pushdown for extractionFn, expressions, joins, etc (https://github.com/apache/druid/pull/9662)

    # Upgrading to Druid 0.18.0

    Be aware of the following changes between 0.17.1 and 0.18.0 that you should be aware of before upgrading. If you're updating from an earlier version than 0.17.1, please see the release notes of the relevant intermediate versions.

    # S3 extension

    The S3 storage extension now supports cleanup of stale task logs and segments. When deploying 0.18.0, please ensure that your extensions directory does not have any older versions of druid-s3-extensions extension.

    https://github.com/apache/druid/pull/9459

    # Core extension for Azure

    The Azure storage extension has been promoted to a core extension. It also supports cleanup of stale task logs and segments now. When deploying 0.18.0, please ensure that your extensions-contrib directory does not have any older versions of druid-azure-extensions extension.

    https://github.com/apache/druid/pull/9394 https://github.com/apache/druid/pull/9523

    # Google Storage extension

    The Google storage extension now supports cleanup of stale task logs and segments. When deploying 0.18.0, please ensure that your extensions directory does not have any older versions of druid-google-extensions extension.

    https://github.com/apache/druid/pull/9519

    # Hadoop AWS library included in binary distribution

    Hadoop AWS library is now included in the binary distribution for better out-of-box experience. When deploying 0.18.0, please ensure that your hadoop-dependencies directory or any other directories in the classpath does not have duplicate libraries.

    # PostgreSQL JDBC driver for Lookups included in binary distribution

    PostgreSQL JDBC driver for Lookups is now included in the binary distribution for better out-of-box experience. When deploying 0.18.0, please ensure that your extensions/druid-lookups-cached-single directory or any other directories in the classpath does not have duplicate JDBC drivers.

    https://github.com/apache/druid/pull/9399

    # Known Issues

    # Query failure with topN or groupBy on scan with multi-valued columns

    Query inlining in Brokers is newly introduced in 0.18.0 but has a bug that queries with topN or groupBy on top of scan fail if the scan query selects multi-valued dimensions. See https://github.com/apache/druid/issues/9697 for more details.

    # NullPointerException when using Avro parser with Kafka indexing service.

    Avro parser doesn't work with Kafka indexing service because of a wrong null check. See https://github.com/apache/druid/issues/9728 for more details.

    # Misleading segment/unavailable/count metric during handoff

    This metric is supposed to take the number of segments served by realtime tasks into consideration as well, but it isn't now. As a result, it appears that unavailability spikes up before the new segments are loaded by historicals, even if all segments actually are continuously available on some combination of realtime tasks and historicals.

    https://github.com/apache/druid/issues/9677

    # Slight difference between the result of explain plan for query and the actual execution plan

    The result of explain plan for can be slightly different from what Druid actually executes when the query includes joins or subqueries. The difference can be found in that each part of the query plan would be represented as if it was its own native query in the result of explain plan for. For example, for a join of a datasource d1 and a groupBy subquery on datasource d2, the explain plan for could return a plan like below

         join
        /    \
    scan    groupBy
     |        |
    d1       d2
    

    whereas the actual query plan Druid would execute is

         join
        /    \
      d1    groupBy
              |
             d2
    

    # Other known issues

    For a full list of open issues, please see https://github.com/apache/druid/labels/Bug.

    # Credits

    Thanks to everyone who contributed to this release!

    @a2l007 @abhishekrb19 @aditya-r-m @AlexanderSaydakov @als-sdin @aP0StAl @asdf2014 @benhopp @bjozet @capistrant @Caroline1000 @ccaominh @clintropolis @dampcake @fjy @Fokko @frnidito @gianm @himanshug @JaeGeunBang @jihoonson @jon-wei @JulianJaffePinterest @kou64yama @lamber-ken @leventov @liutang123 @maytasm @mcbrewster @mgill25 @mitchlloyd @mrsrinivas @nvolungis @prabcs @samarthjain @sthetland @suneet-s @themaric @vogievetsky @xvrl @zachjsh @zhenxiao

    Source code(tar.gz)
    Source code(zip)
  • druid-0.17.1(Apr 1, 2020)

    Apache Druid 0.17.1 is a security bug fix release that addresses the following CVE for LDAP authentication:

    • [CVE-2020-1958]: Apache Druid LDAP injection vulnerability (https://lists.apache.org/thread.html/r9d437371793b410f8a8e18f556d52d4bb68e18c537962f6a97f4945e%40%3Cdev.druid.apache.org%3E)
    Source code(tar.gz)
    Source code(zip)
  • druid-0.17.0(Jan 27, 2020)

    Apache Druid 0.17.0 contains over 250 new features, performance enhancements, bug fixes, and major documentation improvements from 52 contributors. Check out the complete list of changes and everything tagged to the milestone.

    Highlights

    Batch ingestion improvements

    Druid 0.17.0 includes a significant update to the native batch ingestion system. This update adds the internal framework to support non-text binary formats, with initial support for ORC and Parquet. Additionally, native batch tasks can now read data from HDFS.

    This rework changes how the ingestion source and data format are specified in the ingestion task. To use the new features, please refer to the documentation on InputSources and InputFormats.

    Please see the following documentation for details: https://druid.apache.org/docs/0.17.0/ingestion/data-formats.html#input-format https://druid.apache.org/docs/0.17.0/ingestion/native-batch.html#input-sources https://druid.apache.org/docs/0.17.0/ingestion/native-batch.html#partitionsspec

    https://github.com/apache/druid/issues/8812

    Single dimension range partitioning for parallel native batch ingestion

    The parallel index task now supports the single_dim type partitions spec, which allows for range-based partitioning on a single dimension.

    Please see https://druid.apache.org/docs/0.17.0/ingestion/native-batch.html for details.

    Compaction changes

    Parallel index task split hints

    The parallel indexing task now has a new configuration, splitHintSpec, in the tuningConfig to allow for operators to provide hints to control the amount of data that each first phase subtask reads. There is currently one split hint spec type, SegmentsSplitHintSpec, used for re-ingesting Druid segments.

    Parallel auto-compaction

    Auto-compaction can now use the parallel indexing task, allowing for greater compaction throughput.

    To control the level of parallelism, the auto-compactiontuningConfig has new parameters, maxNumConcurrentSubTasks and splitHintSpec.

    Please see https://druid.apache.org/docs/0.17.0/configuration/index.html#compaction-dynamic-configuration for details.

    https://github.com/apache/incubator-druid/pull/8570

    Stateful auto-compaction

    Auto-compaction now uses the partitionSpec to track changes made by previous compaction tasks, allowing the coordinator to reduce redundant compaction operations.

    Please see https://github.com/apache/druid/issues/8489 for details.

    If you have auto-compaction enabled, please see the information under "Stateful auto-compaction changes" in the "Upgrading to Druid 0.17.0" section before upgrading.

    Parallel query merging on brokers

    The Druid broker can now opportunistically merge query results in parallel using multiple threads.

    Please see druid.processing.merge.useParallelMergePool in the Broker section of the configuration reference for details on how to configure this new feature.

    Parallel merging is enabled by default (controlled by the druid.processing.merge.useParallelMergePool property), and most users should not have to change any of the advanced configuration properties described in the configuration reference.

    Additionally, merge parallelism can be controlled on a per-query basis using the query context. Information about the new query context parameters can be found at https://druid.apache.org/docs/0.17.0/querying/query-context.html.

    https://github.com/apache/incubator-druid/pull/8578

    SQL-compatible null handling

    In 0.17.0, we have added official documentation for Druid's SQL-compatible null handling mode.

    Please see https://druid.apache.org/docs/0.17.0/configuration/index.html#sql-compatible-null-handling and https://druid.apache.org/docs/0.17.0/design/segments.html#sql-compatible-null-handling for details.

    Several bugs that existed in this previously undocumented mode have been fixed, particularly around null handling in numeric columns. We recommend that users begin to consider transitioning their clusters to this new mode after upgrading to 0.17.0.

    The full list of null handling bugs fixed in 0.17.0 can be found at https://github.com/apache/druid/issues?utf8=%E2%9C%93&q=label%3A%22Area+-+Null+Handling%22+milestone%3A0.17.0+

    LDAP extension

    Druid now supports LDAP authentication. Authorization using LDAP groups is also supported by mapping LDAP groups to Druid roles.

    • LDAP authentication is handled by specifying an LDAP-type credentials validator.
    • Authorization using LDAP is handled by specifying an LDAP-type role provider, and defining LDAP group->Druid role mappings within Druid.

    LDAP integration requires the druid-basic-security core extension. Please see https://druid.apache.org/docs/0.17.0/development/extensions-core/druid-basic-security.html for details.

    As this is the first release with LDAP support, and there are a large variety of LDAP ecosystems, some LDAP use cases and features may not be supported yet. Please file an issue if you need enhancements to this new functionality.

    https://github.com/apache/incubator-druid/pull/6972

    Dropwizard emitter

    A new Dropwizard metrics emitter has been added as a contrib extension.

    The currently supported Dropwizard metrics types are counter, gauge, meter, timer and histogram. These metrics can be emitted using either a Console or JMX reporter.

    Please see https://druid.apache.org/docs/0.17.0/design/extensions-contrib/dropwizard.html for details.

    https://github.com/apache/incubator-druid/pull/7363

    Self-discovery resource

    A new pair of endpoints have been added to all Druid services that return information about whether the Druid service has received a confirmation that the service has been added to the cluster, from the central service discovery mechanism (currently ZooKeeper). These endpoints can be useful as health/ready checks.

    The new endpoints are:

    • /status/selfDiscovered/status
    • /status/selfDiscovered

    Please see the Druid API reference for details.

    https://github.com/apache/incubator-druid/pull/6702 https://github.com/apache/incubator-druid/pull/9005

    Supervisors system table

    Task supervisors (e.g. Kafka or Kinesis supervisors) are now recorded in the system tables in a new sys.supervisors table.

    Please see https://druid.apache.org/docs/0.17.0/querying/sql.html#supervisors-table for details.

    https://github.com/apache/incubator-druid/pull/8547

    Fast historical start with lazy loading

    A new boolean configuration property for historicals, druid.segmentCache.lazyLoadOnStart, has been added.

    This new property allows historicals to defer loading of a segment until the first time that segment is queried, which can significantly decrease historical startup times for clusters with a large number of segments.

    Please see the configuration reference for details.

    https://github.com/apache/incubator-druid/pull/6988

    Historical segment cache distribution change

    A new historical property, druid.segmentCache.locationSelectorStrategy, has been added.

    If there are multiple segment storage locations specified in druid.segmentCache.locations, the new locationSelectorStrategy property allows the user to specify what strategy is used to fill the locations. Currently supported options are roundRobin and leastBytesUsed.

    Please see the configuration reference for details.

    https://github.com/apache/incubator-druid/pull/8038

    New readiness endpoints

    A new Broker endpoint has been added: /druid/broker/v1/readiness.

    A new Historical endpoint has been added: /druid/historical/v1/readiness.

    These endpoints are similar to the existing /druid/broker/v1/loadstatus and /druid/historical/v1/loadstatus endpoints.

    They differ in that they do not require authentication/authorization checks, and instead of a JSON body they only return a 200 success or 503 HTTP response code.

    https://github.com/apache/incubator-druid/pull/8841

    Support task assignment based on MiddleManager categories

    It is now possible to define a "category" name property for each MiddleManager. New worker select strategies that are category-aware have been added, allowing the user to control how tasks are assigned to MiddleManagers based on the configured categories.

    Please see the documentation for druid.worker.category in the configuration reference, and the following links, for more details: https://druid.apache.org/docs/0.17.0/configuration/index.htmlEqual-Distribution-With-Category-Spec https://druid.apache.org/docs/0.17.0/configuration/index.html#Fill-Capacity-With-Category-Spec https://druid.apache.org/docs/0.17.0/configuration/index.html#WorkerCategorySpec

    https://github.com/apache/druid/pull/7066

    Security vulnerability updates

    A large number of dependencies have been updated to newer versions to address security vulnerabilities.

    Please see the PRs below for details:

    • https://github.com/apache/incubator-druid/pull/8878
    • https://github.com/apache/incubator-druid/pull/8980

    Upgrading to Druid 0.17.0

    Select native query has been replaced

    The deprecated Select native query type has been removed in 0.17.0.

    If you have native queries that use Select, you need to modify them to use Scan instead. See the Scan query documentation (https://druid.apache.org/docs/0.17.0/querying/scan-query.html) for syntax and output format details.

    For Druid SQL queries that use Select, no changes are needed; the SQL planner already uses the Scan query type under the covers for Select queries.

    https://github.com/apache/incubator-druid/pull/8739

    Old consoles have been removed

    The legacy coordinator and overlord consoles have been removed, and they have been replaced with the new web console on the coordinator and overlord.

    https://github.com/apache/incubator-druid/pull/8838

    Calcite 1.21 upgrade, Druid SQL null handling

    Druid 0.17.0 updates Calcite to version 1.21. This newer version of Calcite can make additional optimizations that assume SQL-compliant null handling behavior when planning queries.

    If you use Druid SQL and rely on null handling behavior, please read the information at https://druid.apache.org/docs/0.17.0/configuration/index.html#sql-compatible-null-handling and ensure that your Druid cluster is running in the SQL-compliant null handling mode before upgrading.

    https://github.com/apache/incubator-druid/pull/8566

    Logging adjustments

    Druid 0.17.0 has tidied up its lifecycle, querying, and ingestion logging.

    Please see https://github.com/apache/incubator-druid/pull/8889 for a detailed list of changes. If you relied on specific log messages for external integrations, please review the new logging changes before upgrading.

    The full set of log messages can still be seen when logging is set to DEBUG level. Template log4j2 configuration files that show how to enable per-package DEBUG logging are provided in the _common configuration folder in the example clusters under conf/druid.

    Stateful auto-compaction changes

    The auto-compaction scheduling logic in 0.17.0 tracks additional segment partitioning information in Druid's metadata store that is not present in older versions. This information is used to determine whether a set of segments has already been compacted under the cluster's current auto-compaction configurations.

    When this new metadata is not present, a set of segments will always be scheduled for an initial compaction and this new metadata will be created after they are compacted, allowing the scheduler to skip them later if auto-compaction config is unchanged.

    Since this additional segment partitioning metadata is not present before 0.17.0, the auto-compaction scheduling logic will re-compact all segments within a datasource once after the upgrade to 0.17.0.

    This re-compaction on the entire set of segments for each datasource that has auto-compaction enabled means that:

    • There will be a transition period after the upgrade where more total compaction tasks will be queued than under normal conditions
    • The deep storage usage will increase as the entire segment set is re-compacted (the old set of segments is still kept in deep storage unless explicitly removed).

    Users are advised to be aware of the temporary increase in scheduled compaction tasks and the impact on deep storage usage. Documentation on removing old segments is located at https://druid.apache.org/docs/0.17.0/ingestion/data-management.html#deleting-data

    targetCompactionSizeBytes property removed

    The targetCompactionSizeBytes property has been removed from the compaction task and auto-compaction configuration. For auto-compaction, maxRowsPerSegment is now a mandatory configuration. For non-auto compaction tasks, any partitionsSpec can be used.

    https://github.com/apache/incubator-druid/pull/8573

    Compaction task tuningConfig

    Due to the parallel auto-compaction changes introduced by #8570, any manually submitted compaction task specs need to be updated to use an index_parallel type for the tuningConfig section instead of index. These spec changes should be applied after the cluster is upgraded to 0.17.0.

    Existing auto-compaction configs can remain unchanged after the update; the auto-compaction will create non-parallel compaction tasks until the auto-compaction configs are updated to use parallelism post-upgrade.

    To control the level of parallelism, the auto-compactiontuningConfig has new parameters, maxNumConcurrentSubTasks and splitHintSpec.

    Please see https://druid.apache.org/docs/0.17.0/configuration/index.html#compaction-dynamic-configuration for details.

    Compaction task ioConfig

    The compaction task now requires an ioConfig in the task spec.

    Please see https://druid.apache.org/docs/0.17.0/ingestion/data-management.html#compaction-ioconfig for details.

    ioConfig does not have to be added to existing auto-compaction configurations, the coordinator after the upgrade will automatically create task specs with ioConfig sections.

    https://github.com/apache/incubator-druid/pull/8571

    Renamed partition spec fields

    The targetPartitionSize and maxSegmentSize fields in the partition specs have been deprecated. They have been renamed to targetNumRowsPerSegment and maxRowsPerSegment respectively.

    https://github.com/apache/incubator-druid/pull/8507

    Cache metrics are off by default

    Cache metrics are now disabled by default. To enable cache metrics, add "org.apache.druid.client.cache.CacheMonitor" to the druid.monitoring.monitors property.

    https://github.com/apache/incubator-druid/pull/8561

    Supervisor API has changed to be consistent with task API

    Supervisor task specs should now put the dataSchema, tuningConfig, and ioConfig sections as subfields of a spec field. Please see https://github.com/apache/incubator-druid/pull/8810 for examples.

    The old format is still accepted in 0.17.0.

    Segments API semantics change

    The /datasources/{dataSourceName}/segments endpoint on the Coordinator now returns all used segments (including overshadowed) on the specified intervals, rather than only visible ones.

    https://github.com/apache/incubator-druid/pull/8564

    Password provider for basic authentication of HttpEmitterConfig

    The druid.emitter.http.basicAuthentication property now accepts a password provider. We recommend updating your configurations to use a password provider if using the HTTP emitter.

    https://github.com/apache/incubator-druid/pull/8618

    Multivalue expression transformation change

    Reusing multi-valued columns in expressions will no longer result in unnecessary cartesian explosions. Please see the following links for details.

    https://github.com/apache/druid/issues/8947 https://github.com/apache/incubator-druid/pull/8957

    Kafka/Kinesis ingestion during rolling upgrades

    During a rolling upgrade, if there are tasks running 0.17.0 and overlords running older versions, and a task made progress reading data from its stream but rejected all the records it saw (e.g., all were unparseable), you will see NullPointerExceptions on overlords running older versions when the task updates the overlord with its current stream offsets.

    Previously, there was a bug in this area (https://github.com/apache/incubator-druid/issues/8765) where such tasks would fail to communicate their current offsets to the overlord. The task/overlord publishing protocol has been updated to fix this, but older overlords do not recognize this protocol change.

    This condition should be fairly rare.

    Default Hadoop version change

    Druid 0.17.0 now bundles Hadoop 2.8.5 libraries instead of 2.8.3.

    If you were referencing the bundled 2.8.3 libraries in your configuration via the druid.indexer.task.defaultHadoopCoordinates property, or the hadoopDependencyCoordinates property in your Hadoop ingestion specs, you will need to update these references to point to the bundled 2.8.5 libraries instead.

    ParseSpec.verify method removed

    If you were maintaining a custom extension that provides an implementation for the ParseSpec interface, the verify method has been removed, and the @Override annotation on the method will need to be removed in any custom implementations.

    https://github.com/apache/druid/pull/8744

    Known issues

    Filtering on long columns in SQL-compatible null handling mode

    We are currently aware of a bug with applying certain filter types on null values from long columns when SQL-compatible null handling is enabled (https://github.com/apache/druid/issues/9255).

    Please file an issue if you encounter any other null handling problems.

    Ingestion spec preview in web console

    The preview specs shown for native batch ingestion tasks created in the Data Loader of the web console are not correctly formatted and will fail if you copy them and submit them manually. Submitting these specs through the Data Loader submits a correctly formatted spec, however.

    https://github.com/apache/druid/issues/9144

    Other known issues

    For a full list of open issues, please see https://github.com/apache/druid/labels/Bug

    Credits

    Thanks to everyone who contributed to this release!

    @a2l007 @abhishekrb19 @aditya-r-m @AlexanderSaydakov @asdf2014 @capistrant @ccaominh @clintropolis @denever @elloooooo @FaxianZhao @fjy @Fokko @fstolba @gianm @gkc2104 @glasser @GraceKoo @himanshug @jihoonson @jnaous @jon-wei @karthikbhat13 @legendtkl @leventov @maytasm3 @mcbrewster @mitchlloyd @mohammadjkhan @nishantmonu51 @pdeva @pjain1 @pzhdfy @QiuMM @qutang1 @renevan10 @richardstartin @samarthjain @SandishKumarHN @sashidhar @sekingme @SEKIRO-J @suneet-s @surekhasaharan @tijoparacka @vogievetsky @xingbowu @xvrl @yuanlihan @yunwan @yurmix @zhenxiao

    Source code(tar.gz)
    Source code(zip)
  • druid-0.16.1-incubating(Dec 11, 2019)

    Apache Druid 0.16.1-incubating is a bug fix and user experience improvement release that fixes a rolling upgrade issue, improves the startup scripts, and updates licensing information.

    Bug Fixes #8682 implement FiniteFirehoseFactory in InlineFirehose #8905 Retrying with a backward compatible task type on unknown task type error in parallel indexing

    User Experience Improvements #8792 Use bundled ZooKeeper in tutorials. #8794 Startup scripts: verify Java 8 (exactly), improve port/java verification messages. #8942 Improve verify-default-ports to check both INADDR_ANY and 127.0.0.1. #8798 Fix verify script.

    Licensing Update #8944 Add license for tutorial wiki data #8968 Add licenses.yaml entry for Wikipedia sample data

    Other #8419 Bump Apache Thrift to 0.10.0

    Updating from 0.16.0-incubating and earlier

    PR #8905 fixes an issue with rolling upgrades when updating from earlier versions. Credits Thanks to everyone who contributed to this release!

    @aditya-r-m @clintropolis @Fokko @gianm @jihoonson @jon-wei

    Apache Druid (incubating) is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.

    Source code(tar.gz)
    Source code(zip)
  • druid-0.16.0-incubating(Sep 25, 2019)

    Apache Druid 0.16.0-incubating contains over 350 new features, performance enhancements, bug fixes, and major documentation improvements from 50 contributors. Check out the complete list of changes and everything tagged to the milestone.

    Highlights

    # Performance

    # 'Vectorized' query processing

    An experimental 'vectorized' query execution engine is new in 0.16.0, which can provide a speed increase in the range of 1.3-3x for timeseries and group by v2 queries. It operates on the principle of batching operations on rows instead of processing a single row at a time, e.g. iterating bitmaps in batches instead of per row, reading column values in batches, filtering in batches, aggregating values in batches, and so on. This results in significantly fewer method calls, better memory locality, and increased cache efficiency.

    This is an experimental feature, but we view it as the path forward for Druid query processing and are excited for feedback as we continue to improve and fill out missing features in upcoming releases.

    • Only timeseries and groupBy have vectorized engines.
    • GroupBy doesn't handle multi-value dimensions or granularity other than "all" yet.
    • Vector cursors cannot handle virtual columns or descending order.
    • Expressions are not supported anywhere: not as inputs to aggregators, in virtual functions, or in filters.
    • Only some aggregators have vectorized implementations: "count", "doubleSum", "floatSum", "longSum", "hyperUnique", and "filtered".
    • Only some filters have vectorized matchers: "selector", "bound", "in", "like", "regex", "search", "and", "or", and "not".
    • Dimension specs other than "default" don't work yet (no extraction functions or filtered dimension specs).

    The feature can be enabled by setting "vectorize": true your query context (the default is false). This works both for Druid SQL and for native queries. When set to true, vectorization will be used if possible; otherwise, Druid will fall back to its non-vectorized query engine. You can also set it to "force", which will return an error if the query cannot be fully vectorized. This is helpful for confirming that vectorization is indeed being used.

    You can control the block size during execution by setting the vectorSize query context parameter (default is 1000).

    https://github.com/apache/incubator-druid/issues/7093 https://github.com/apache/incubator-druid/pull/6794

    # GroupBy array-based result rows

    groupBy v2 queries now use an array-based representation of result rows, rather than the map-based representation used by prior versions of Druid. This provides faster generation and processing of result sets. Out of the box this change is invisible and backwards-compatible; you will not have to change any configuration to reap the benefits of this more efficient format, and it will have no impact on cached results. Internally this format will always be utilized automatically by the broker in the queries that it issues to historicals. By default the results will be translated back to the existing 'map' based format at the broker before sending them back to the client.

    However, if you would like to avoid the overhead of this translation, and get even faster results,resultAsArray may be set on the query context to directly pass through the new array based result row format. The schema is as follows, in order:

    • Timestamp (optional; only if granularity != ALL)
    • Dimensions (in order)
    • Aggregators (in order)
    • Post-aggregators (optional; in order, if present)

    https://github.com/apache/incubator-druid/issues/8118 https://github.com/apache/incubator-druid/pull/8196

    # Additional performance enhancements

    The complete set of pull requests tagged as performance enhancements for 0.16 can be found here.

    # "Minor" compaction

    Users of the Kafka indexing service and compaction and who get a trickle of late data, can find a huge improvement in the form of a new concept called 'minor' compaction. Enabled by internal changes to how data segments are versioned, minor compaction is based on the idea of 'segment' based locking at indexing time instead of the current Druid locking behavior (which is now referred to as 'time chunk' locking). Segment locking as you might expect allows only the segments which are being compacted to be locked, while still allowing new 'appending' indexing tasks (like Kafka indexing tasks) to continue to run and create new segments, simulataneously. This is a big deal if you get a lot of late data, because the current behavior results in compaction tasks starving as higher priority realtime tasks hog the locks. This prevention of compaction tasks from optimizing the datasources segment sizes results in reduced overall performance.

    To enable segment locking, you will need to set forceTimeChunkLock to false in the task context, or set druid.indexer.tasklock.forceTimeChunkLock=false in the Overlord configuration. However, beware, after enabling this feature, due to the changes in segment versioning, there is no rollback path built in, so once you upgrade to 0.16, you cannot downgrade to an older version of Druid. Because of this, we highly recommend confirming that Druid 0.16 is stable in your cluster before enabling this feature.

    It has a humble name, but the changes of minor compaction run deep, and it is not possible to adequately describe the mechanisms that drive this in these release notes, so check out the proposal and PR for more details.

    https://github.com/apache/incubator-druid/issues/7491 https://github.com/apache/incubator-druid/pull/7547

    # Druid "indexer" process

    The new Indexer process is an alternative to the MiddleManager + Peon task execution system. Instead of forking a separate JVM process per-task, the Indexer runs tasks as separate threads within a single JVM process. The Indexer is designed to be easier to configure and deploy compared to the MiddleManager + Peon system and to better enable resource sharing across tasks.

    The advantage of the Indexer is that it allows query processing resources, lookups, cached authentication/authorization information, and much more to be shared between all running indexing task threads, giving each individual task access to a larger pool of resources and far fewer redundant actions done than is possible with the Peon model of execution where each task is isolated in its own process.

    Using Indexer does come with one downside: the loss of process isolation provided by Peon processes means that a single task can potentially affect all running indexing tasks on that Indexer. The druid.worker.globalIngestionHeapLimitBytes and druid.worker.numConcurrentMerges configurations are meant to help minimize this. Additionally, task logs for indexer processes will be inline with the Indexer process log, and not persisted to deep storage.

    You can start using indexing by supplying server indexer as the command-line argument to org.apache.druid.cli.Main when starting the service. To use Indexer in place of a MiddleManager and Peon, you should be able to adapt values from the configuration into the Indexer configuration, lifting druid.indexer.fork.property. configurations directly to the Indexer, and sizing heap and direct memory based on the Peon sizes multiplied by the number of task slots (unlike a MiddleManager, it does not accept the configurations druid.indexer.runner.javaOpts or druid.indexer.runner.javaOptsArray). See the indexer documentation for details.

    https://github.com/apache/incubator-druid/pull/8107

    # Native parallel batch indexing with shuffle

    In 0.16.0, Druid's index_parallel native parallel batch indexing task now supports 'perfect' rollup with the implementation of a 2 stage shuffle process.

    Tasks in stage 1 perform a secondary partitioning of rows on top of the standard time based partitioning of segment granularity, creating an intermediary data segment for each partition. Stage 2 tasks are each assigned a set of the partitionings created during stage 1, and will collect and combine the set of intermediary data segments which belong to that partitioning, allowing it to achieve complete rollup when building the final segments. At this time, only hash-based partitioning is supported.

    This can be enabled by setting forceGuaranteedRollup to true in thetuningConfig; numShards in partitionsSpec and intervals in granularitySpec must also be set.

    The Druid MiddleManager (or the new Indexer) processes have a new responsibility for these indexing tasks, serving the intermediary partition segments output of stage 1 into the stage 2 tasks, so depending on configuration and cluster size, the MiddleManager jvm configuration might need to be adjusted to increase heap allocation and http threads. These numbers are expected to scale with cluster size, as all MiddleManager or Indexer processes involved in a shuffle will need the ability to communicate with each other, but we do not expect the footprint to be significantly larger than it is currently. Optimistically we suggest trying with your existing configurations, and bumping up heap and http thread count only if issues are encountered.

    https://github.com/apache/incubator-druid/issues/8061

    # Web console

    # Data loader

    The console data loader, introduced in 0.15, has been expanded in 0.16 to support Kafka, Kinesis, segment reindexing, and even "inline" data which can be pasted in directly.

    DataLoaderKafka

    https://github.com/apache/incubator-druid/pull/7643 https://github.com/apache/incubator-druid/pull/8181 https://github.com/apache/incubator-druid/pull/8056 https://github.com/apache/incubator-druid/pull/7947

    # SQL query view

    The query view on the web console has received a major upgrade for 0.16, transitioning into an interactive point-and-click SQL editor. There is a new column picker sidebar and the result output table let you directly manipulate the SQL query without needing to actually type SQL.

    rn_sql_query_view

    There is also a query history view and a the ability to fully edit the query context.

    https://github.com/apache/incubator-druid/pull/7905 https://github.com/apache/incubator-druid/pull/7934 https://github.com/apache/incubator-druid/pull/8251 https://github.com/apache/incubator-druid/pull/7816

    # Servers view

    The "Data servers" view has been renamed to "Servers", and now displays a list of all Druid processes which are discovered as part of the cluster, using the sys.servers system table.

    rn_servers_view

    This should make it much more convenient to at a glance ensure that all your Druid servers are up and reporting to the rest of the cluster.

    https://github.com/apache/incubator-druid/pull/7770 https://github.com/apache/incubator-druid/pull/7654

    # Datasources view

    The datasource view adds a new segment timeline visualization, allowing segment size and distribution to be gauged visually.

    rn_datasources_view

    If you previously have used what is now the legacy coordinator console, do not be alarmed, the timeline is reversed and the newest segments are now on the right side of the visualization!

    https://github.com/apache/incubator-druid/pull/8202

    # Tasks view

    The tasks view has been significantly improved to assist a cluster operators ability to manage Druid supervisors and indexing tasks. The supervisors display now includes status information to allow determining their overall status and displaying error messaging if a supervisor is in error.

    https://github.com/apache/incubator-druid/pull/7428 https://github.com/apache/incubator-druid/pull/7799

    # SQL and native query enhancements

    # SQL and native expression support for multi-value string columns

    Druid SQL and native query expression 'virtual columns' can now correctly utilize multi-value string columns, either operating on them as individual VARCHAR values for parity with native Druid query behavior, or as an array like type, with a new set of multi-value string functions. Note that so far this is only supported in query time expressions; use of multi-value expressions at ingestion time via a transformSpec will be available in a future release. The complete list of added functions and behavior changes is far too large to list here, so check out the SQL and expression documentation for more details.

    https://github.com/apache/incubator-druid/issues/7525 https://github.com/apache/incubator-druid/pull/7588 https://github.com/apache/incubator-druid/pull/7950 https://github.com/apache/incubator-druid/pull/7973 https://github.com/apache/incubator-druid/pull/7974 https://github.com/apache/incubator-druid/pull/8011

    # TIMESTAMPDIFF

    To complement TIMESTAMPADD which can modify timestamps by time units, a TIMESTAMPDIFF which can compute the signed number of time units between two timestamps. For syntax information, check out the SQL documentation

    https://github.com/apache/incubator-druid/pull/7695

    # TIME_CEIL

    TIME_CEIL, a time specific, more flexible version of the CEIL function, has also been added to Druid 0.16.0. This function can round a timestamp up by an ISO8601 period, like P3M (quarters) or PT12H (half-days), optionally for a specific timezone. See SQL documentation for additional information.

    https://github.com/apache/incubator-druid/pull/8027

    # IPv4

    Druid 0.16.0 also adds specialized SQL operators and native expressions for dealing with IPv4 internet addresses in dotted-decimal string or integer format. The new operators are IPV4_MATCH(address, subnet), IPV4_PARSE(address), and IPV4_STRINGIFY(address), which can match IP addresses to subnets in CIDR notation, translate dotted-decimal string format to integer format, and translate integer format into dotted-decimal string format, respectively. See SQL documentation for details.

    https://github.com/apache/incubator-druid/pull/8223

    # NVL

    To increase permissiveness and SQL compatibility with users coming to Druid with experience in other databases, an alias for the COALESCE function has been added in the form of NVL, which some SQL dialects use instead. See SQL documentation for details

    https://github.com/apache/incubator-druid/pull/7965

    # long/double/float sum/min/max aggregator support for string columns

    Long, double, and float sum, min, and max aggregators now will permissively work when used with string columns, performing a best-effort parsing to try and translate them to the correct numerical type.

    https://github.com/apache/incubator-druid/pull/8243 https://github.com/apache/incubator-druid/pull/8319

    # Official Docker image

    An official "convenience binary" Docker image will now be offered for every release starting with 0.16, available at https://hub.docker.com/r/apache/incubator-druid.

    # Refreshed website documentation

    The documentation section of the Druid website has received a major upgrade, transitioning to using Docusaurus, which creates much more beautiful and functional pages than we have currently. Each documentation page now has a left-hand collapsible table of contents showing the outline of the overall docs, and a right-hand table of contents showing the outline of that particular doc page, vastly improving navigability.

    Alongside this, the ingestion documentation has been totally refreshed, beginning with a new ingestion/index.md doc that introduces all the key ingestion spec concepts, and describes the most popular ingestion methods. This was a much needed rework of many existing documentation pages, which had grown organically over time and have become difficult to follow, into a simpler set of fewer, larger, more cross-referenced pages. They are also a bit more 'opinionated', pushing new people towards Kafka, Kinesis, Hadoop, and native batch ingestion. They discuss Tranquility but don't present it as something highly recommended since it is effectively in minimal maintenance mode.

    Check out these improvements (and many more) here.

    https://github.com/apache/incubator-druid/pull/8311

    Extensions

    # druid-datasketches

    The druid-datasketches extension, built on top of Apache Datasketches (incubating), has been expanded with 3 new post aggregators, quantilesDoublesSketchToRank which computes an approximation to the rank of a given value that is the fraction of the distribution less than that value, and quantilesDoublesSketchToCDF which computes an approximation to the Cumulative Distribution Function given an array of split points that define the edges of the bins.

    Another post aggregation, thetaSketchToString which will print a summary of sketch has been added to assist in debugging. See Datasketches extension documentation to learn more.

    https://github.com/apache/incubator-druid/pull/7550 https://github.com/apache/incubator-druid/pull/7937

    The HLLSketch aggregator has been improved with a query-time only round option to support rounding values into whole numbers, to give it feature parity with the built-in cardinality and hyperUnique aggregators.

    https://github.com/apache/incubator-druid/pull/8023

    Finally, users of HllSketch should also see a performance improvement due to some changes made which allow Druid to precompute an empty sketch and copy that into the aggregation buffers, greatly decreasing time to initialize the aggregator during query processing.

    https://github.com/apache/incubator-druid/pull/8194

    # druid-stats

    The druid-stats core extension has been enhanced with SQL support, exposing VAR_POP and VAR_SAMP to compute variance population and sample with the variance aggregator, as well as STDDEV_POP and STDDEV_SAMPto compute standard deviation population and sample using the standard deviation post aggregator. Additionally, VARIANCE and STDDEV functions are added as aliases for VAR_SAMP and STDDEV_SAMP respectively. See SQL documentation and stats extension documentation for more details.

    https://github.com/apache/incubator-druid/pull/7801

    # druid-histogram

    The 'fixed bucket' histogram aggregator of the druid-histogram extension has a new added property finalizeAsBase64Binary that enables serializing the resulting histogram as a base64 string instead of a human readable JSON summary, making it consistent with the approximate histogram aggregator of the same extension. See the documentation for further information.

    https://github.com/apache/incubator-druid/pull/7784

    # statsd-emitter

    The statsd-emitter extension has a new option, druid.emitter.statsd.dogstatsdServiceAsTag which enables emitting the Druid service name as a 'tag', such as druid_service:druid/broker instead of part of the metric name as in druid.broker.query.time. druid is used as the prefix. so the example of druid.broker.query.time would instead be druid.query.time, allowing consolidation metrics across Druid service types and discriminating by tag instead. druid.emitter.statsd.dogstatsd must be set to true for this setting to take effect. See the documentation for more details.

    https://github.com/apache/incubator-druid/pull/8238 https://github.com/apache/incubator-druid/pull/8472

    # New extension: druid-tdigestsketch

    A new set of approximate sketch aggregators for computing quantiles and the like and based on t-digest has been added in Druid 0.16. T-digest was designed for parallel programming use cases like distributed aggregations or map reduce jobs by making combining two intermediate t-digests easy and efficient. It serves to complement existing algorithms provided by the Apache Datasketches extension and moments sketch extension. See the extension documentation for more details.

    https://github.com/apache/incubator-druid/issues/7303 https://github.com/apache/incubator-druid/pull/7331

    # New extension: druid-influxdb-emitter

    A new Druid emitter extension to allow sending Druid metrics to influxdb over HTTP has also been added in 0.16. Currently this emitter only emits service metric events to InfluxDB (See Druid metrics for a list of metrics). When a metric event is fired it is added to a queue of events. After a configurable amount of time, the events on the queue are transformed to InfluxDB's line protocol and POSTed to the InfluxDB HTTP API. The entire queue is flushed at this point. The queue is also flushed as the emitter is shutdown. See the extension docs for details.

    https://github.com/apache/incubator-druid/pull/7717

    # Fine tuning your workloads

    In addition to the experimental vectorized query engine and new indexer process type, 0.16 also has some additional features available to allow potentially fine tuning indexing and query performance via experimentation.

    # Control of indexing intermediary segment compression

    First up, the ability to independently control what compression is used (or disable it) when persisting intermediary segments during indexing. This configuration available to the indexSpec property, and can be added to tuningConfig as:

    "indexSpecForIntermediatePersists": {
      "dimensionCompression": "uncompressed",
      "metricCompression": "none"
    }
    

    for example to disable compression entirely for intermediary segments. One potential reason to consider 'uncompressed' intermediary segments is to ease up on the amount of Java 'direct' memory required to perform the final merge of intermediary segments before they are published and pushed to deep storage, as reading data from uncompressed columns does not require the 64kb direct buffers which are used to decode lz4 and other encoded columns. Of course this is a trade-off of storage space and page cache footprint, so we recommend experimenting with this before settling on a configuration to use for your production workloads.

    https://github.com/apache/incubator-druid/pull/7919

    # Control Filter Bitmap Index Utilization

    Bitmap indexes are usually a huge performance boost for Druid, but under some scenarios can result in slower query speeds, particularly in cases of computationally expensive filters on very high cardinality dimensions. In Druid 0.16, a new mechanism to provide some manual control over when bitmap indexes are utilized, and when a filter will be done as a row scan are now in place, and available on a per filter, per query basis. Most filters will accept a new property, filterTuning, which might look something like this:

    "filterTuning": {
      "useBitmapIndex": true
    }
    

    useBitmapIndex if set to false will disallow a filter to utilize bitmap indexes. This property is optional, and default behavior if filterTuning is not supplied remains unchanged. Note that this feature is not documented in user facing documentation, considered experimental, and subject to change in any future release.

    https://github.com/apache/incubator-druid/pull/8209

    # Request logging

    If you would have liked to enable Druid request logging, but use Druid SQL and find them a bit too chatty due to all the metadata queries, you are luck with 0.16 due to a new configuration option that allows selectively muting specific types of queries from request logging. The option, druid.request.logging.mutedQueryTypes, accepts a list of "queryType" strings as defined by Druid's native JSON query API, and defaults to an empty list (so nothing is ignored). For example,

    druid.request.logging.mutedQueryTypes=["segmentMetadata", "timeBoundary"]
    

    would mute all request logs for segmentMetadata and timeBoundary queries.

    https://github.com/apache/incubator-druid/pull/7562

    # Upgrading to Druid 0.16.0

    # S3 task log storage

    After upgrading to 0.16.0, the task logs will require the same S3 permissions as pushing segments to deep storage, which has some additional logic to possibly get the bucket owner (by fetching the bucket ACL) and set the ACL of the segment object to give the bucket owner full access. If you wish to avoid providing these additional permissions, the existing behavior can be retained by setting druid.indexer.logs.disableAcl=true.

    https://github.com/apache/incubator-druid/pull/7907

    # Druid SQL

    Druid SQL is now enabled on the router by default. Note that this may put some additional pressure on broker processes, due to additional metadata queries required to maintain datasource schemas for query planning, and can be disabled to retain the previous behavior by setting druid.sql.enable=false.

    https://github.com/apache/incubator-druid/pull/7808

    # Druid SQL lookup function

    Druid SQL queries that use the LOOKUP function will now take advantage of injective lookups that are a one-to-one transformation, which allows an optimization to be performed that avoids pushing down lookup evaluation to historical processes, instead allowing the transformation to be done later. The drawback of this change is that lookups which are incorrectly defined as being one-to-one, but are not in fact a one-to-one transformation at evaluation time will produce incorrect query results.

    https://github.com/apache/incubator-druid/pull/7655

    # Coordinator metadata API changes

    The metadata API provided by the coordinator has been modified to provide more consistent HTTP responses. The changes are minor, but backwards incompatible if you depend on a specific API.

    • POST /druid/coordinator/v1/datasources/{dataSourceName}
    • DELETE /druid/coordinator/v1/datasources/{dataSourceName}
    • POST /druid/coordinator/v1/datasources/{dataSourceName}/markUnused
    • POST /druid/coordinator/v1/datasources/{dataSourceName}/markUsed

    now return a JSON object of the form {"numChangedSegments": N} instead of 204 (empty response) when no segments were changed. On the other hand, 500 (server error) is not returned instead of 204 (empty response).

    • POST /druid/coordinator/v1/datasources/{dataSourceName}/segments/{segmentId}
    • DELETE /druid/coordinator/v1/datasources/{dataSourceName}/segments/{segmentId}

    now return a JSON object of the form {"segmentStateChanged": true/false}.

    /druid/coordinator/v1/metadata/datasources?includeDisabled is now /druid/coordinator/v1/metadata/datasources?includeUnused. includeDisabled is still accepted, but a warning is emitted in a log.

    https://github.com/apache/incubator-druid/pull/7653

    # Compaction tasks

    The keepSegmentGranularity option would create compaction tasks that ignored interval boundary of segments before compacting them, but was deprecated because it was not very useful in most cases. If you did find this behavior useful, it can still be achieved by setting segmentGranularity to ALL.

    https://github.com/apache/incubator-druid/pull/7747

    There is a known issue with all Druid versions that support 'auto' compaction via the Coordinator, where auto compaction can get stuck repeatedly trying the same interval. This issue can now be mitigated in 0.16 by setting targetCompactionSizeBytes, but is still present when using maxRowsPerSegment or maxTotalRows. A future version of Druid will fully resolve this issue.

    https://github.com/apache/incubator-druid/issues/8481 https://github.com/apache/incubator-druid/pull/8495

    # Indexing spec

    All native indexing task types now use PartitonsSpec to define secondary partitioning to be consistent with other indexing task types. The changes are not backwards incompatible, but users should consider migrating maxRowsPerSegment, maxTotalRows, numShards, and partitionDimensions to the partitions spec of native, kafka, and kinesis indexing tasks.

    https://github.com/apache/incubator-druid/pull/8141

    # Kafka indexing

    Incremental publishing of segments for the Apache Kafka indexing service was introduced in Druid 0.12, at which time the existing Kafka indexing service was retained as a 'legacy' format to allow rolling update from older versions of Druid. This 'legacy' codebase has now been removed in Druid 0.16.0, which means that a rolling-update from a Druid version older than 0.12.0 is not supported.

    https://github.com/apache/incubator-druid/pull/7735

    # Avro extension

    The fromPigAvroStorage option has been removed from the Apache Avro extension in Druid 0.16, in order to clean up dependencies. This option provided a special transformation of GenericData.Array, but this column type will now be handled by the default List implementation.

    https://github.com/apache/incubator-druid/pull/7810

    # Kerberos extension

    druid.auth.authenticator.kerberos.excludedPaths superseded by druid.auth.unsecuredPaths which applies across all authenticators/authorizers, so use this configuration property instead.

    https://github.com/apache/incubator-druid/pull/7745

    # Zookeeper

    The setting druid.zk.service.terminateDruidProcessOnConnectFail has been removed and this is now the default behavior - Druid processes will exit when an unhandled Curator client exception occurs instead of continuing to run in a 'zombie' state.

    Additionally, a new optional setting druid.zk.service.connectionTimeoutMs, which wires into the Curator client connectionTimeoutMs, setting is now available to customize. By default it will continue to use the Curator default of 15000 milliseconds.

    https://github.com/apache/incubator-druid/pull/8458

    # Realtime process

    The deprecated standalone 'realtime' process has been removed from Druid. Use of this has been discouraged for some time, but now middleManager and the new index processes should process all of your realtime indexing tasks.

    https://github.com/apache/incubator-druid/pull/7915 https://github.com/apache/incubator-druid/pull/8020

    Credits

    Thanks to everyone who contributed to this release!

    @a2l007 @abossert @AlexanderSaydakov @AlexandreYang @andrewluotechnologies @asdf2014 @awelsh93 @blugowski @capistrant @ccaominh @clintropolis @dclim @dene14 @dinsaw @Dylan1312 @esevastyanov @fjy @Fokko @gianm @gocho1 @himanshug @ilhanadiyaman @jennyzzz @jihoonson @Jinstarry @jon-wei @justinborromeo @kamaci @khwj @legoscia @leventov @litao91 @lml2468 @mcbrewster @nieroda @nishantmonu51 @pjain1 @pphust @samarthjain @SandishKumarHN @sashidhar @satybald @sekingme @shuqi7 @surekhasaharan @viongpanzi @vogievetsky @xueyumusic @xvrl @yurmix

    Source code(tar.gz)
    Source code(zip)
  • druid-0.15.1-incubating(Aug 16, 2019)

    Apache Druid 0.15.1-incubating is a bug fix release that includes important fixes for Apache Zookeeper based segment loading, the 'druid-datasketches' extension, and much more.

    Bug Fixes

    Coordinator

    #8137 coordinator throwing exception trying to load segments (fixed by #8140)

    Middlemanager

    #7886 Middlemanager fails startup due to corrupt task files (fixed by #7917) #8085 fix forking task runner task shutdown to be more graceful

    Queries

    #7777 timestamp_ceil function is either wrong or misleading (fixed by #7823) #7820 subtotalsSpec and filtering returns no results (fixed by #7827) #8013 Fix ExpressionVirtualColumn capabilities; fix groupBy's improper uses of StorageAdapter#getColumnCapabilities.

    API

    #6786 apache-druid-0.13.0-incubating router /druid/router/v1/brokers (fixed by #8026) #8044 SupervisorManager: Add authorization checks to bulk endpoints.

    Metrics Emitters

    #8204 HttpPostEmitter throw Class cast exception when using emitAndReturnBatch (fixed by #8205)

    Extensions
    Datasketches

    #7666 sketches-core-0.13.4 #8055 force native order when wrapping ByteBuffer

    Kinesis Indexing Service

    #7830 Kinesis: Fix getPartitionIds, should be checking isHasMoreShards.

    Moving Average Query

    #7999 Druid moving average query results in circular reference error (fixed by #8192)

    Documentation Fixes

    #8002 Improve pull-deps reference in extensions page. #8003 Add missing reference to Materialized-View extension. #8079 Fix documentation formatting #8087 fix references to bin/supervise in tutorial docs

    Updating from 0.15.0-incubating and earlier

    Due to issue #8137, when updating from specifically 0.15.0-incubating to 0.15.1-incubating, it is recommended to update the Coordinator before the Historical servers to prevent segment unavailability during an upgrade (this is typically reversed). Upgrading from any version older than 0.15.0-incubating does not have these concerns and can be done normally.

    Known Issues

    Building Docker images is currently broken and will be fixed in the next release, see #8054 which is fixed by #8237 for more details.

    Credits

    Thanks to everyone who contributed to this release!

    @AlexanderSaydakov @ArtyomyuS @ccl0326 @clintropolis @gianm @himanshug @jihoonson @legoscia @leventov @pjain1 @yurmix @xueyumusic

    Apache Druid (incubating) is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.

    Source code(tar.gz)
    Source code(zip)
  • druid-0.15.0-incubating(Jun 27, 2019)

    Apache Druid 0.15.0-incubating contains over 250 new features, performance/stability/documentation improvements, and bug fixes from 39 contributors. Major new features and improvements include:

    • New Data Loader UI
    • Support transactional Kafka topic
    • New Moving Average query
    • Time ordering for Scan query
    • New Moments Sketch aggregator
    • SQL enhancements
    • Light lookup module for routers
    • Core ORC extension
    • Core GCP extension
    • Document improvements

    The full list of changes is here: https://github.com/apache/incubator-druid/pulls?q=is%3Apr+is%3Aclosed+milestone%3A0.15.0

    Documentation for this release is at: http://druid.apache.org/docs/0.15.0-incubating/

    Highlights

    New Data Loader UI (Batch indexing part)

    0 15 0-data-loader

    Druid has a new Data Loader UI which is integrated with the Druid Console. The new Data Loader UI shows some sampled data to easily verify the ingestion spec and generates the final ingestion spec automatically. The users are expected to easily issue batch index tasks instead of writing a JSON spec by themselves.

    Added by @vogievetsky and @dclim in https://github.com/apache/incubator-druid/pull/7572 and https://github.com/apache/incubator-druid/pull/7531, respectively.

    Support Kafka Transactional Topics

    The Kafka indexing service now supports Kafka Transactional Topics.

    Please note that only Kafka 0.11.0 or later versions are supported after this change.

    Added by @surekhasaharan in https://github.com/apache/incubator-druid/pull/6496.

    New Moving Average Query

    A new query type was introduced to compute moving average.

    Please see http://druid.apache.org/docs/0.15.0-incubating/development/extensions-contrib/moving-average-query.html for more details.

    Added by @yurmix in https://github.com/apache/incubator-druid/pull/6430.

    Time Ordering for Scan Query

    The Scan query type now supports time ordering. Please see http://druid.apache.org/docs/0.15.0-incubating/querying/scan-query.html#time-ordering for more details.

    Added by @justinborromeo in https://github.com/apache/incubator-druid/pull/7133.

    New Moments Sketch Aggregator

    The Moments Sketch is a new sketch type for approximate quantile computation. Please see http://druid.apache.org/docs/0.15.0-incubating/development/extensions-contrib/momentsketch-quantiles.html for more details.

    Added by @edgan8 in https://github.com/apache/incubator-druid/pull/6581.

    SQL enhancements

    Druid community has been striving to enhance SQL support and now it's no longer experimental.

    New SQL functions

    • LPAD and RPAD functions were added by @xueyumusic in https://github.com/apache/incubator-druid/pull/7388.
    • DEGREES and RADIANS functions were added by @xueyumusic in https://github.com/apache/incubator-druid/pull/7336.
    • STRING_FORMAT function was added by @gianm in https://github.com/apache/incubator-druid/pull/7327.
    • PARSE_LONG function was added by @gianm in https://github.com/apache/incubator-druid/pull/7326.
    • ROUND function was added by @gianm in https://github.com/apache/incubator-druid/pull/7224.
    • Trigonometric functions were added by @xueyumusic in https://github.com/apache/incubator-druid/pull/7182.

    Autocomplete in Druid Console

    0 15 0-autocomplete

    Druid Console now supports autocomplete for SQL.

    Added by @shuqi7 in https://github.com/apache/incubator-druid/pull/7244.

    Time-ordered scan support for SQL

    Druid SQL supports time-ordered scan query.

    Added by @justinborromeo in https://github.com/apache/incubator-druid/pull/7373.

    Lookups view added to the web console

    0 15 0-lookup-view

    You can now configure your lookups from the web console directly.

    Added by @shuqi7 in https://github.com/apache/incubator-druid/pull/7259.

    Misc web console improvements

    "NoSQL" mode : https://github.com/apache/incubator-druid/pull/7493 [@shuqi7]

    The web console now has a backup mode that allows it to function as best as it can if DruidSQL is disabled or unavailable.

    Added compaction configuration dialog : https://github.com/apache/incubator-druid/pull/7242 [@shuqi7]

    You can now configure the auto compaction settings for a data source from the Datasource view.

    Auto wrap query with limit : https://github.com/apache/incubator-druid/pull/7449 [@vogievetsky]

    0 15 0-misc

    The console query view will now (by default) wrap DruidSQL queries with a SELECT * FROM (...) LIMIT 1000 allowing you to enter queries like SELECT * FROM your_table without worrying about the impact to the cluster. You can still send 'raw' queries by selecting the option from the ... menu.

    SQL explain query : https://github.com/apache/incubator-druid/pull/7402 [@shuqi7]

    You can now click on the ... menu in the query view to get an explanation of the DruidSQL query.

    Surface is_overshadowed as a column in the segments table https://github.com/apache/incubator-druid/pull/7555 , https://github.com/apache/incubator-druid/pull/7425 [@shuqi7][@surekhasaharan]

    is_overshadowed column represents that this segment is overshadowed by any published segments. It can be useful to see what segments should be loaded by historicals. Please see http://druid.apache.org/docs/0.15.0-incubating/querying/sql.html for more details.

    Improved status UI for actions on tasks, supervisors, and datasources : https://github.com/apache/incubator-druid/pull/7528 [shuqi7]

    This PR condenses the actions list into a tidy menu and lets you see the detailed status for supervisors and tasks. New actions for datasources around loading and dropping data by interval has also been added.

    Light Lookup Module for Routers

    Light lookup module was introduced for Routers and they now need only minimum amount of memory. Please see http://druid.apache.org/docs/0.15.0-incubating/operations/basic-cluster-tuning.html#router for basic memory tuning.

    Added by @clintropolis in https://github.com/apache/incubator-druid/pull/7222.

    Core ORC extension

    ORC extension is now promoted to a core extension. Please read the below 'Updating from 0.14.0-incubating and earlier' section if you are using the ORC extension in an earlier version of Druid.

    Added by @clintropolis in https://github.com/apache/incubator-druid/pull/7138.

    Core GCP extension

    GCP extension is now promoted to a core extension. Please read the below 'Updating from 0.14.0-incubating and earlier' section if you are using the GCP extension in an earlier version of Druid.

    Added by @drcrallen in https://github.com/apache/incubator-druid/pull/6953.

    Document Improvements

    Single-machine deployment example configurations and scripts

    Several configurations and scripts were added for easy single machine setup. Please see http://druid.apache.org/docs/0.15.0-incubating/operations/single-server.html for details.

    Added by @jon-wei in https://github.com/apache/incubator-druid/pull/7590.

    Tool for migrating from local deep storage/Derby metadata

    A new tool was added for easy migration from single machine to a cluster environment. Please see http://druid.apache.org/docs/0.15.0-incubating/operations/deep-storage-migration.html for details.

    Added by @jon-wei in https://github.com/apache/incubator-druid/pull/7598.

    Document for basic tuning guide

    Documents for basic tuning guide was added. Please see http://druid.apache.org/docs/0.15.0-incubating/operations/basic-cluster-tuning.html for details.

    Added by @jon-wei in https://github.com/apache/incubator-druid/pull/7629.

    Security Improvement

    The Druid system table now requires only mandatory permissions instead of the read permission for the whole sys database. Please see http://druid.apache.org/docs/0.15.0-incubating/development/extensions-core/druid-basic-security.html for details.

    Added by @jon-wei in https://github.com/apache/incubator-druid/pull/7579.

    Deprecated/removed

    Drop support for automatic segment merge

    The automatic segment merge by the coordinator is not supported anymore. Please use auto compaction instead.

    Added by @jihoonson in #6883.

    Drop support for insert-segment-to-db tool

    In Druid 0.14.x or earlier, Druid stores segment metadata (descriptor.json file) in deep storage in addition to metadata store. This behavior has changed in 0.15.0 and it doesn't store segment metadata file in deep storage anymore. As a result, insert-segment-to-db tool is no longer supported as well since it works based on descriptor.json files in deep storage. Please see http://druid.apache.org/docs/0.15.0-incubating/operations/insert-segment-db.html for details.

    Please note that kill task will fail if you're using HDFS as deep storage and descriptor.json file is missing in 0.14.x or earlier versions.

    Added by @jihoonson in https://github.com/apache/incubator-druid/pull/6911.

    Removed "useFallback" configuration for SQL

    This option was removed since it generates unscalable query plans and doesn't work with some SQL functions.

    Added by @gianm in https://github.com/apache/incubator-druid/pull/7567.

    Removed a public API in CompressionUtils for extension developers

    public static void gunzip(File pulledFile, File outDir) was removed in https://github.com/apache/incubator-druid/pull/6908 by @clintropolis.

    Other behavior changes

    Coordinator await initialization before finishing startup

    A new configuration (druid.coordinator.segment.awaitInitializationOnStart) was added to make Coordinator wait for segment view initialization. This option is enabled by default.

    Added by @QiuMM in https://github.com/apache/incubator-druid/pull/6847.

    Coordinator API behavior change

    The coordinator periodically polls segment metadata information from metadata store and caches them in memory. In Druid 0.14.x or earlier, removing segments via coordinator APIs (/druid/coordinator/v1/datasources/{dataSourceName} and /druid/coordinator/v1/datasources/{dataSourceName}/segments/{segmentId}) immediately updates the segment cache in memory as well as metadata store. But this behavior has changed in 0.15.0 and the cache is updated per poll rather than being updated immediately on removal. The below APIs can return removed segments via the above API calls until the cache is updated in the next poll.

    • /druid/coordinator/v1/metadata/datasources/{dataSourceName}
    • /druid/coordinator/v1/metadata/datasources/{dataSourceName}/segments
    • /druid/coordinator/v1/metadata/datasources/{dataSourceName}/segments/{segmentId}
    • /druid/coordinator/v1/metadata/datasources
    • /druid/coordinator/v1/loadstatus

    The below metrics can also contain removed segments via the above API calls until the cache is updated in the next poll.

    • segment/unavailable/count
    • segment/underReplicated/count

    This behavior was changed in https://github.com/apache/incubator-druid/pull/7595 by @surekhasaharan.

    Listing Lookup API change

    The /druid/coordinator/v1/lookups/config API now returns a list of tiers currently active in the cluster in addition to ones known in the dynamic configuration.

    Added by @clintropolis in https://github.com/apache/incubator-druid/pull/7647.

    Zookeeper loss

    With a new configuration (druid.zk.service.terminateDruidProcessOnConnectFail), Druid processes can terminate itself on disconnection to ZooKeeper.

    Added by @michael-trelinski in https://github.com/apache/incubator-druid/pull/6740.

    Updating from 0.14.0-incubating and earlier

    Minimum compatible Kafka version change for Kafka Indexing Service

    Kafka 0.11.x or later versions are only supported after https://github.com/apache/incubator-druid/pull/6496. Please consider updating Kafka version if you're using an older one.

    ORC extension changes

    The ORC extension has been promoted to a core extension. When deploying 0.15.0-incubating, please ensure that your extensions directory does not have any older versions of druid-orc-extensions extension.

    Additionally, even though the new core extension can index any data the old contrib extension could, the JSON spec for the ingestion task is incompatible, and will need modified to work with the newer core extension.

    To migrate to 0.15.0-incubating:

    • In inputSpec of ioConfig, inputFormat must be changed from "org.apache.hadoop.hive.ql.io.orc.OrcNewInputFormat" to "org.apache.orc.mapreduce.OrcInputFormat"
    • The contrib extension supported a typeString property, which provided the schema of the ORC file, of which was essentially required to have the types correct, but notably not the column names, which facilitated column renaming. In the core extension, column renaming can be achieved with flattenSpec expressions.
    • The contrib extension supported a mapFieldNameFormat property, which provided a way to specify a dimension to flatten OrcMap columns with primitive types. This functionality has also been replaced with flattenSpec expressions.

    For more details and examples, please see http://druid.apache.org/docs/0.15.0-incubating/development/extensions-core/orc.html.

    GCP extension changes

    The GCP extension has been promoted to a core extension. When deploying 0.15.0-incubating, please ensure that your extensions directory does not have any older versions of the druid-google-extensions extension.

    Dropped auto segment merge

    The coordinator configuration for auto segment merge (druid.coordinator.merge.on) is not supported anymore. Please use auto compaction instead.

    Removed descriptor.json metadata file in deep storage

    The segment metadata file (descriptor.json) is not stored in deep storage any more. If you are using HDFS as your deep storage and need to roll back to 0.14.x or earlier, then please consider that the kill task could fail because of the missing descriptor.json files.

    Credits

    Thanks to everyone who contributed to this release!

    @a2l007 @asdf2014 @capistrant @clintropolis @dampcake @dclim @donbowman @drcrallen @Dylan1312 @edgan8 @es1220 @esevastyanov @FaxianZhao @fjy @gianm @glasser @hpandeycodeit @jihoonson @jon-wei @jorbay-au @justinborromeo @kamaci @KazuhitoT @leventov @lxqfy @michael-trelinski @peferron @puneetjaiswal @QiuMM @richardstartin @samarthjain @scrawfor @shuqi7 @surekhasaharan @venkatramanp @vogievetsky @xueyumusic @xvrl @yurmix

    Source code(tar.gz)
    Source code(zip)
  • druid-0.14.2-incubating(May 27, 2019)

    Apache Druid 0.14.2-incubating is a bug fix release that includes important fixes for the 'druid-datasketches' extension and the broker 'result' level caching.

    Bug Fixes

    • #7607 thetaSketch(with sketches-core-0.13.1) in groupBy always return value no more than 16384
    • #6483 Exception during sketch aggregations while using Result level cache
    • #7621 NPE when both populateResultLevelCache and grandTotal are set

    Credits

    Thanks to everyone who contributed to this release!

    @AlexanderSaydakov @clintropolis @jihoonson @jon-wei

    Apache Druid (incubating) is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.

    Source code(tar.gz)
    Source code(zip)
  • druid-0.14.1-incubating(May 9, 2019)

    Apache Druid 0.14.1-incubating is a small patch release that includes a handful of bug and documentation fixes from 16 contributors.

    Important Notice

    This release fixes an issue with druid-datasketches extension with quantile sketches, but introduces another one with theta sketches that was confirmed after the release was finalized, caused by #7320 and described in #7607. If you utilize theta sketches, we recommend not upgrading to this release. This will be fixed in the next release of Druid by #7619.

    Bug Fixes

    • use latest sketches-core-0.13.1 #7320
    • Adjust BufferAggregator.get() impls to return copies #7464
    • DoublesSketchComplexMetricSerde: Handle empty strings. #7429
    • handle empty sketches #7526
    • Adds backwards-compatible serde for SeekableStreamStartSequenceNumbers. #7512
    • Support Kafka supervisor adopting running tasks between versions #7212
    • Fix time-extraction topN with non-STRING outputType. #7257
    • Fix two issues with Coordinator -> Overlord communication. #7412
    • refactor druid-bloom-filter aggregators #7496
    • Fix encoded taskId check in chatHandlerResource #7520
    • Fix too many dentry cache slab objs#7508. #7509
    • Fix result-level cache for queries #7325
    • Fix flattening Avro Maps with Utf8 keys #7258
    • Write null byte when indexing numeric dimensions with Hadoop #7020
    • Batch hadoop ingestion job doesn't work correctly with custom segments table #7492
    • Fix aggregatorFactory meta merge exception #7504

    Documentation Changes

    • Fix broken link due to Typo. #7513
    • Some docs optimization #6890
    • Updated Javascript Affinity config docs #7441
    • fix expressions docs operator table #7420
    • Fix conflicting information in configuration doc #7299
    • Add missing doc link for operations/http-compression.html #7110

    Updating from 0.14.0-incubating and earlier

    Kafka Ingestion

    Updating from version 0.13.0-incubating or earlier directly to 0.14.1-incubating will not require downtime like the migration path to 0.14.0-incubating due to the issue described in #6958, which has been fixed for this release in #7212. Likewise, rolling updates from version 0.13.0-incubating and earlier should also work properly due to #7512.

    Native Parallel Ingestion

    Updating from 0.13.0-incubating directly to 0.14.1-incubating will not encounter any issues during a rolling update with mixed versions of middle managers due to the fixes in #7520, as could be experienced when updating to 0.14.0-incubating.

    Credits

    Thanks to everyone who contributed to this release!

    @AlexanderSaydakov @b-slim @benhopp @chrishardis @clintropolis @ferristseng @es1220 @gianm @jihoonson @jon-wei @justinborromeo @kaka11chen @samarthjain @surekhasaharan @zhaojiandong @zhztheplayer

    Apache Druid (incubating) is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.

    Source code(tar.gz)
    Source code(zip)
  • druid-0.14.0-incubating-rc3(Apr 9, 2019)

  • druid-0.14.0-incubating(Apr 9, 2019)

    Apache Druid (incubating) 0.14.0-incubating contains over 200 new features, performance/stability/documentation improvements, and bug fixes from 54 contributors. Major new features and improvements include:

    • New web console
    • Amazon Kinesis indexing service
    • Decommissioning mode for Historicals
    • Published segment cache in Broker
    • Bloom filter aggregator and expression
    • Updated Apache Parquet extension
    • Force push down option for nested GroupBy queries
    • Better segment handoff and drop rule handling
    • Automatically kill MapReduce jobs when Apache Hadoop ingestion tasks are killed
    • DogStatsD tag support for statsd emitter
    • New API for retrieving all lookup specs
    • New compaction options
    • More efficient cachingCost segment balancing strategy

    The full list of changes is here: https://github.com/apache/incubator-druid/pulls?q=is%3Apr+is%3Amerged+milestone%3A0.14.0

    Documentation for this release is at: http://druid.io/docs/0.14.0-incubating/

    Highlights

    New web console

    new-druid-console

    Druid has a new web console that provides functionality that was previously split between the coordinator and overlord consoles.

    The new console allows the user to manage datasources, segments, tasks, data processes (Historicals and MiddleManagers), and coordinator dynamic configuration. The user can also run SQL and native Druid queries within the console.

    For more details, please see http://druid.io/docs/0.14.0-incubating/operations/management-uis.html

    Added by @vogievetsky in #6923.

    Kinesis indexing service

    Druid now supports ingestion from Kinesis streams, provided by the new druid-kinesis-indexing-service core extension.

    Please see http://druid.io/docs/0.14.0-incubating/development/extensions-core/kinesis-ingestion.html for details.

    Added by @jsun98 in #6431.

    Decommissioning mode for Historicals

    Historical processes can now be put into a "decommissioning" mode, where the coordinator will no longer consider the Historical process as a target for segment replication. The coordinator will also move segments off the decommissioning Historical.

    This is controlled via Coordinator dynamic configuration. For more details, please see http://druid.io/docs/0.14.0-incubating/configuration/index.html#dynamic-configuration.

    Added by @egor-ryashin in #6349.

    Published segment cache on Broker

    The Druid Broker now has the ability to maintain a cache of published segments via polling the Coordinator, which can significantly improve response time for metadata queries on the sys.segments system table.

    Please see http://druid.io/docs/0.14.0-incubating/querying/sql.html#retrieving-metadata for details.

    Added by @surekhasaharan in #6901

    Bloom filter aggregator and expression

    A new aggregator for constructing Bloom filters at query time and support for performing Bloom filter checks within Druid expressions have been added to the druid-bloom-filter extension.

    Please see http://druid.io/docs/0.14.0-incubating/development/extensions-core/bloom-filter.html

    Added by @clintropolis in #6904 and #6397

    Updated Parquet extension

    druid-extensions-parquet has been moved into the core extension set from the contrib extensions and now supports flattening and int96 values.

    Please see http://druid.io/docs/0.14.0-incubating/development/extensions-core/parquet.html for details.

    Added by @clintropolis in #6360

    Force push down option for nested GroupBy queries

    Outer query execution for nested GroupBy queries can now be pushed down to Historical processes; previously, the outer queries would always be executed on the Broker.

    Please see https://github.com/apache/incubator-druid/pull/5471 for details.

    Added by @samarthjain in #5471.

    Better segment handoff and retention rule handling

    Segment handoff will now ignore segments that would be dropped by a datasource's retention rules, avoiding ingestion failures caused by issue #5868.

    Period load rules will now include the future by default.

    A new "Period Drop Before" rule has been added. Please see http://druid.io/docs/0.14.0-incubating/operations/rule-configuration.html#period-drop-before-rule for details.

    Added by @QiuMM in #6676, #6414, and #6415.

    Automatically kill MapReduce jobs when Hadoop ingestion tasks are killed

    Druid will now automatically terminate MapReduce jobs created by Hadoop batch ingestion tasks when the ingestion task is killed.

    Added by @ankit0811 in #6828.

    DogStatsD tag support for statsd-emitter

    The statsd-emitter extension now supports DogStatsD-style tags. Please see http://druid.io/docs/0.14.0-incubating/development/extensions-contrib/statsd.html

    Added by @deiwin in #6605, with support for constant tags added by @glasser in #6791.

    New API for retrieving all lookup specs

    A new API for retrieving all lookup specs for all tiers has been added. Please see http://druid.io/docs/0.14.0-incubating/querying/lookups.html#get-all-lookups for details.

    Added by @jihoonson in #7025.

    New compaction options

    Auto-compaction now supports the maxRowsPerSegment option. Please see http://druid.io/docs/0.14.0-incubating/design/coordinator.html#compacting-segments for details.

    The compaction task now supports a new segmentGranularity option, deprecating the older keepSegmentGranularity option for controlling the segment granularity of compacted segments. Please see the segmentGranularity table in http://druid.io/docs/0.14.0-incubating/ingestion/compaction.html for more information on these properties.

    Added by @jihoonson in #6758 and #6780.

    More efficient cachingCost segment balancing strategy

    The cachingCost Coordinator segment balancing strategy will now only consider Historical processes for balancing decisions. Previously the strategy would unnecessarily consider active worker tasks as well, which are not targets for segment replication.

    Added by @QiuMM in #6879.

    New metrics:

    • New allocation rate metric jvm/heapAlloc/bytes, added by @egor-ryashin in #6710.
    • New query count metric query/count, added by @QiuMM in #6473.
    • SQL query metrics sqlQuery/bytes and sqlQuery/time, added by @gaodayue in #6302.
    • Apache Kafka ingestion lag metrics ingest/kafka/maxLag and ingest/kafka/avgLag, added by @QiuMM in #6587
    • Task count metrics task/success/count, task/failed/count, task/running/count, task/pending/count, task/waiting/count, added by @QiuMM in #6657

    New interfaces for extension developers

    RequestLogEvent

    It is now possible to control the fields in RequestLogEvent, emitted by EmittingRequestLogger. Please see #6477 for details. Added by @leventov.

    Custom TLS certificate checks

    An extension point for custom TLS certificate checks has been added. Please see http://druid.io/docs/0.14.0-incubating/operations/tls-support.html#custom-tls-certificate-checks for details. Added by @jon-wei in #6432.

    Kafka Indexing Service no longer experimental

    The Kafka Indexing Service extension has been moved out of experimental status.

    SQL Enhancements

    Enhancements to dsql

    The dsql command line client now supports CLI history, basic autocomplete, and specifying query timeouts in the query context.

    Added in #6929 by @gianm.

    Add SQL id, request logs, and metrics

    SQL queries now have an ID, and native queries executed as part of a SQL query will have the associated SQL query ID in the native query's request logs. SQL queries will now be logged in the request logs.

    Two new metrics, sqlQuery/time and sqlQuery/bytes, are now emitted for SQL queries.

    Please see http://druid.io/docs/0.14.0-incubating/configuration/index.html#request-logging and http://druid.io/docs/0.14.0-incubating/querying/sql.html#sql-metrics for details.

    Added by @gaodayue in #6302

    More SQL aggregator support

    The follow aggregators are now supported in SQL:

    • DataSketches HLL sketch
    • DataSketches Theta sketch
    • DataSketches quantiles sketch
    • Fixed bins histogram
    • Bloom filter aggregator

    Added by @jon-wei in #6951 and @clintropolis in #6502

    Other SQL enhancements

    • SQL: Add support for queries with project-after-semijoin. #6756
    • SQL: Support for selecting multi-value dimensions. #6462
    • SQL: Support AVG on system tables. #601
    • SQL: Add "POSITION" function. #6596
    • SQL: Set INFORMATION_SCHEMA catalog name to "druid". #6595
    • SQL: Fix ordering of sort, sortProject in DruidSemiJoin. #6769

    Added by @gianm.

    Updating from 0.13.0-incubating and earlier

    Kafka ingestion downtime when upgrading

    Due to the issue described in #6958, existing Kafka indexing tasks can be terminated unnecessarily during a rolling upgrade of the Overlord. The terminated tasks will be restarted by the Overlord and will function correctly after the initial restart.

    Parquet extension changes

    The druid-parquet-extensions extension has been moved from contrib to core. When deploying 0.14.0-incubating, please ensure that your extensions-contrib directory does not have any older versions of the Parquet extension.

    Additionally, there are now two styles of Parquet parsers in the extension:

    • parquet-avro: Converts Parquet to Avro, and then parses the Avro representation. This was the existing parser prior to 0.14.0-incubating.
    • parquet: A new parser that parses the Parquet format directly. Only this new parser supports int96 values.

    Prior to 0.14.0-incubating, a specifying a parquet type parser would have a task use the Avro-converting parser. In 0.14.0-incubating, to continue using the Avro-converting parser, you will need to update your ingestion specs to use parquet-avro instead.

    The inputFormat field in the inputSpec for tasks using Parquet input must also match the choice of parser:

    • parquet: org.apache.druid.data.input.parquet.DruidParquetInputFormat
    • parquet-avro: org.apache.druid.data.input.parquet.DruidParquetInputFormat

    Please see http://druid.io/docs/0.14.0-incubating/development/extensions-core/parquet.html for details.

    Running Druid with non-2.8.3 Hadoop

    If you plan to use Druid 0.14.0-incubating with Hadoop versions other than 2.8.3, you may need to do the following:

    • Set the Hadoop dependency coordinates to your target version as described in http://druid.io/docs/0.14.0-incubating/operations/other-hadoop.html under Tip #3: Use specific versions of Hadoop libraries.
    • Rebuild Druid with your target version of Hadoop by changing hadoop.compile.version in the main Druid pom.xml and then following the standard build instructions.

    Other Behavior changes

    Old task cleanup

    Old task entries in the metadata storage will now be cleaned up automatically together with their task logs. Please see http:/druid.io/docs/0.14.0-incubating/development/extensions-core/configuration/index.html#task-logging and #6592 for details.

    Automatic processing buffer sizing

    The druid.processing.buffer.sizeBytes property has new default behavior if it is not set. Druid will now automatically choose a value for the processing buffer size using the following formula:

    processingBufferSize = totalDirectMemory / (numMergeBuffers + numProcessingThreads + 1)
    processingBufferSize = min(processingBufferSize, 1GB)
    

    Where:

    • totalDirectMemory: The direct memory limit for the JVM specified by -XX:MaxDirectMemorySize
    • numMergeBuffers: The value of druid.processing.numMergeBuffers.
    • numProcessingThreads: The value of druid.processing.numThreads.

    At most, Druid will use 1GB for the automatically chosen processing buffer size. The processing buffer size can still be specified manually.

    Please see #6588 for details.

    Retention rules now include the future by default

    Please be aware that new retention rules will now include the future by default. Please see #6414 for details.

    Property changes

    Segment announcing

    The druid.announcer.type property used for choosing between Zookeeper or HTTP-based segment management/discovery has been moved to druid.serverview.type. If you were using http prior to 0.14.0-incubating, you will need to update your configs to use the new druid.serverview.type.

    Please see the following for details:

    • http://druid.io/docs/0.14.0-incubating/configuration/index.html#segment-management
    • http://druid.io/docs/0.14.0-incubating/configuration/index.html#segment-discovery

    fix missing property in JsonTypeInfo of SegmentWriteOutMediumFactory

    The druid.peon.defaultSegmentWriteOutMediumFactory.@type property has been fixed. The property is now druid.peon.defaultSegmentWriteOutMediumFactory.type without the "@".

    Please see #6656 for details.

    Deprecations

    Approximate Histogram aggregator

    The ApproximateHistogram aggregator has been deprecated; it is a distribution-dependent algorithm without formal error bounds and has significant accuracy issues.

    The DataSketches quantiles aggregator should be used instead for quantile and histogram use cases.

    Please see Histogram and Quantiles Aggregators

    Cardinality/HyperUnique aggregator

    The Cardinality and HyperUnique aggregators have been deprecated in favor of the DataSketches HLL aggregator and Theta Sketch aggregator. These aggregators have better accuracy and performance characteristics.

    Please see Count Distinct Aggregators for details.

    Query Chunk Period

    The chunkPeriod query context configuration is now deprecated, along with the associated query/intervalChunk/time metric. Please see #6591 for details.

    keepSegmentGranularity for Compaction

    The keepSegmentGranularity option for compaction tasks has been deprecated. Please see #6758 and the segmentGranularity table in http://druid.io/docs/0.14.0-incubating/ingestion/compaction.html for more information on these properties.

    Interface changes for extension developers

    SegmentId class

    Druid now uses a SegmentId class instead of plain Strings to represent segment IDs. Please see #6370 for details.

    Added by @leventov.

    druid-api, druid-common, java-util moved to druid-core

    The druid-api, druid-common, java-util modules have been moved into druid-core. Please update your dependencies accordingly if your project depended on these libraries.

    Please see #6443 for details.

    Credits

    Thanks to everyone who contributed to this release!

    @a2l007 @AlexanderSaydakov @anantmf @ankit0811 @asdf2014 @awelsh93 @benhopp @Caroline1000 @clintropolis @dclim @deiwin @DiegoEliasCosta @drcrallen @dyf6372 @Dylan1312 @egor-ryashin @elloooooo @evans @FaxianZhao @gaodayue @gianm @glasser @Guadrado @hate13 @hoesler @hpandeycodeit @janeklb @jihoonson @jon-wei @jorbay-au @jsun98 @justinborromeo @kamaci @leventov @lxqfy @mirkojotic @navkumar @niketh @patelh @pzhdfy @QiuMM @rcgarcia74 @richardstartin @robertervin @samarthjain @seoeun25 @Shimi @surekhasaharan @taiii @thomask @VincentNewkirk @vogievetsky @yunwan @zhaojiandong

    Source code(tar.gz)
    Source code(zip)
  • druid-0.13.0-incubating(Dec 12, 2018)

    Druid 0.13.0-incubating contains over 400 new features, performance/stability/documentation improvements, and bug fixes from 81 contributors. It is the first release of Druid in the Apache Incubator program. Major new features and improvements include:

    • native parallel batch indexing
    • automatic segment compaction
    • system schema tables
    • improved indexing task status, statistics, and error reporting
    • SQL-compatible null handling
    • result-level broker caching
    • ingestion from RDBMS
    • Bloom filter support
    • additional SQL result formats
    • additional aggregators (stringFirst/stringLast, ArrayOfDoublesSketch, HllSketch)
    • support for multiple grouping specs in groupBy query
    • mutual TLS support
    • HTTP-based worker management
    • broker backpressure
    • maxBytesInMemory ingestion tuning configuration
    • materialized views (community extension)
    • parser for Influx Line Protocol (community extension)
    • OpenTSDB emitter (community extension)

    The full list of changes is here: https://github.com/apache/incubator-druid/pulls?q=is%3Apr+is%3Aclosed+milestone%3A0.13.0

    Documentation for this release is at: http://druid.io/docs/0.13.0-incubating/

    Highlights

    Native parallel batch indexing

    Introduces the index_parallel supervisor which manages the parallel batch ingestion of splittable sources without requiring a dependency on Hadoop. See http://druid.io/docs/latest/ingestion/native_tasks.html for more information.

    Note: This is the initial single-phase implementation and has limitations on how it expects the input data to be partitioned. Notably, it does not have a shuffle implementation which will be added in the next iteration of this feature. For more details, see the proposal at #5543.

    Added by @jihoonson in #5492.

    Automatic segment compaction

    Previously, compacting small segments into optimally-sized ones to improve query performance required submitting and running compaction or re-indexing tasks. This was often a manual process or required an external scheduler to handle the periodic submission of tasks. This patch implements automatic segment compaction managed by the coordinator service.

    Note: This is the initial implementation and has limitations on interoperability with realtime ingestion tasks. Indexing tasks currently require acquisition of a lock on the portion of the timeline they will be modifying to prevent inconsistencies from concurrent operations. This implementation uses low-priority locks to ensure that it never interrupts realtime ingestion, but this also means that compaction may fail to make any progress if the realtime tasks are continually acquiring locks on the time interval being compacted. This will be improved in the next iteration of this feature with finer-grained locking. For more details, see the proposal at #4479.

    Documentation for this feature: http://druid.io/docs/0.13.0-incubating/design/coordinator.html#compacting-segments

    Added by @jihoonson in #5102.

    System schema tables

    Adds a system schema to the SQL interface which contains tables exposing information on served and published segments, nodes of the cluster, and information on running and completed indexing tasks.

    Note: This implementation contains some known overhead inefficiencies that will be addressed in a future patch.

    Documentation for this feature: http://druid.io/docs/0.13.0-incubating/querying/sql.html#system-schema

    Added by @surekhasaharan in #6094.

    Improved indexing task status, statistics, and error reporting

    Improves the performance and detail of the ingestion-related APIs which were previously quite opaque making it difficult to determine the cause of parse exceptions, task failures, and the actual output from a completed task. Also adds improved ingestion metric reporting including moving average throughput statistics.

    Added by @surekhasaharan and @jon-wei in #5801, #5418, and #5748.

    SQL-compatible null handling

    Improves Druid's handling of null values by treating them as missing values instead of being equivalent to empty strings or a zero-value. This makes Druid more SQL compatible and improves integration with external BI tools supporting ODBC/JDBC. See #4349 for proposal.

    To enable this feature, you will need to set the system-wide property druid.generic.useDefaultValueForNull=false.

    Added by @nishantmonu51 in #5278 and #5958.

    Results-level broker caching

    Implements result-level caching on brokers which can operate concurrently with the traditional segment-level cache. See #4843 for proposal.

    Documentation for this feature: http://druid.io/docs/0.13.0-incubating/configuration/index.html#broker-caching

    Added by @a2l007 in #5028.

    Ingestion from RDBMS

    Introduces a sql firehose which supports data ingestion directly from an RDBMS.

    Added by @a2l007 in #5441.

    Bloom filter support

    Adds support for optimizing Druid queries by applying a Bloom filter generated by an external system such as Apache Hive. In the future, #6397 will support generation of Bloom filters as the result of Druid queries which can then be used to optimize future queries.

    Added by @nishantmonu51 in #6222.

    Additional SQL result formats

    Adds result formats for line-based JSON and CSV and additionally X-Druid-Column-Names and X-Druid-Column-Types response headers containing a list of columns contained in the result.

    Added by @gianm in #6191.

    'stringLast' and 'stringFirst' aggregators

    Introduces two complementary aggregators, stringLast and stringFirst which operate on string columns and return the value with the maximum and minimum timestamp respectively.

    Added by @andresgomezfrr in #5789.

    ArrayOfDoublesSketch

    Adds support for numeric Tuple sketches, which extend the functionality of the count distinct Theta sketches by adding arrays of double values associated with unique keys.

    Added by @AlexanderSaydakov in #5148.

    HllSketch

    Adds a configurable implementation of a count distinct aggregator based on HllSketch from https://github.com/DataSketches. Comparison to Druid's native HyperLogLogCollector shows improved accuracy, efficiency, and speed: https://datasketches.github.io/docs/HLL/HllSketchVsDruidHyperLogLogCollector.html

    Added by @AlexanderSaydakov in #5712.

    Support for multiple grouping specs in groupBy query

    Adds support for the subtotalsSpec groupBy parameter which allows Druid to be efficient by reusing intermediate results at the broker level when running multiple queries that group by subsets of the same set of columns. See proposal in #5179 for more information.

    Added by @himanshug in #5280.

    Mutual TLS support

    Adds support for mutual TLS (server certificate validation + client certificate validation). See: https://en.wikipedia.org/wiki/Mutual_authentication

    Added by @jon-wei in #6076.

    HTTP based worker management

    Adds an HTTP-based indexing task management implementation to replace the previous one based on ZooKeeper. Part of a set of improvements to reduce and eventually eliminate Druid's dependency on ZooKeeper. See #4996 for proposal.

    Added by @himanshug in #5104.

    Broker backpressure

    Allows the broker to exert backpressure on data-serving nodes to prevent the broker from crashing under memory pressure when results are coming in faster than they are being read by clients.

    Added by @gianm in #6313.

    'maxBytesInMemory' ingestion tuning configuration

    Previously, a major tuning parameter for indexing task memory management was the maxRowsInMemory configuration, which determined the threshold for spilling the contents of memory to disk. This was difficult to properly configure since the 'size' of a row varied based on multiple factors. maxBytesInMemory makes this configuration byte-based instead of row-based.

    Added by @surekhasaharan in #5583.

    Materialized views

    Supports the creation of materialized views which can improve query performance in certain situations at the cost of additional storage. See http://druid.io/docs/latest/development/extensions-contrib/materialized-view.html for more information.

    Note: This is a community-contributed extension and is not automatically included in the Druid distribution. We welcome feedback for deciding when to promote this to a core extension. For more information, see Community Extensions.

    Added by @zhangxinyu1 in #5556.

    Parser for Influx Line Protocol

    Adds support for ingesting the Influx Line Protocol data format. For more information, see: https://docs.influxdata.com/influxdb/v1.6/write_protocols/line_protocol_tutorial/

    Note: This is a community-contributed extension and is not automatically included in the Druid distribution. We welcome feedback for deciding when to promote this to a core extension. For more information, see Community Extensions.

    Added by @njhartwell in #5440.

    OpenTSDB emitter

    Adds support for emitting Druid metrics to OpenTSDB.

    Note: This is a community-contributed extension and is not automatically included in the Druid distribution. We welcome feedback for deciding when to promote this to a core extension. For more information, see Community Extensions.

    Added by @QiuMM in #5380.

    Updating from 0.12.3 and earlier

    Please see below for changes between 0.12.3 and 0.13.0 that you should be aware of before upgrading. If you're updating from an earlier version than 0.12.3, please see release notes of the relevant intermediate versions for additional notes.

    MySQL metadata storage extension no longer includes JDBC driver

    The MySQL metadata storage extension is now packaged together with the Druid distribution but without the required MySQL JDBC driver (due to licensing restrictions). To use this extension, the driver will need to be downloaded separately and added to the extension directory.

    See http://druid.io/docs/latest/development/extensions-core/mysql.html for more details.

    AWS region configuration required for S3 extension

    As a result of switching from jets3t to the AWS SDK (#5382), users of the S3 extension are now required to explicitly set the target region. This can be done by setting the JVM system property aws.region or the environment variable AWS_REGION.

    As an example, to set the region to 'us-east-1' through system properties:

    • add -Daws.region=us-east-1 to the jvm.config file for all Druid services
    • add -Daws.region=us-east-1 to druid.indexer.runner.javaOpts in middleManager/runtime.properties so that the property will be passed to peon (worker) processes

    Ingestion spec changes

    As a result of renaming packaging from io.druid to org.apache.druid, ingestion specs that reference classes by their fully-qualified class name will need to be modified accordingly.

    As an example, if you are using the Parquet extension with Hadoop indexing, the inputFormat field of the inputSpec will need to change from io.druid.data.input.parquet.DruidParquetInputFormat to org.apache.druid.data.input.parquet.DruidParquetInputFormat.

    Metrics changes

    New metrics

    • task/action/log/time - Milliseconds taken to log a task action to the audit log (#5714)
    • task/action/run/time - Milliseconds taken to execute a task action (#5714)
    • query/node/backpressure - Nanoseconds the channel is unreadable due to backpressure being applied (#6335) (Note that this is not enabled by default and requires a custom implementation of QueryMetrics to emit)

    New dimensions

    • taskId and taskType added to task-related metrics (#5664)

    Other

    • HttpPostEmitterMonitor no longer emits maxTime and minTime if no times were recorded (#6418)

    Rollback restrictions

    64-bit doubles aggregators

    64-bit doubles aggregators are now used by default (see #5478). Support for 64-bit floating point columns was release in Druid 0.11.0, so if this is enabled, versions older than 0.11.0 will not be able to read the data segments.

    To disable and keep the old format, you will need to set the system-wide property druid.indexing.doubleStorage=float.

    Disabling bitmap indexes

    0.13.0 adds support for disabling bitmap indexes on a per-column basis, which can save space in cases where bitmap indexes add no value. This is done by setting the 'createBitmapIndex' field in the dimension schema. Segments written with this option will not be backwards compatible with older versions of Druid (#5402).

    Behavior changes

    Java package name changes

    Druid's package names have all moved from io.druid to org.apache.druid. This affects the name of the Java main class that you should run when starting up services, which is now org.apache.druid.cli.Main. It may also affect installation and configuration of extensions and monitors.

    ParseSpec is now a required field in ingestion specs

    There is no longer a default ParseSpec (previously the DelimitedParseSpec). Ingestion specs now require parseSpec to be specified. If you previously did not provide a parseSpec, you should use one with "format": "tsv" to maintain the existing behavior (#6310).

    Change to default maximum rows to return in one JDBC frame

    The default value for druid.sql.avatica.maxRowsPerFrame was reduced from 100k to 5k to minimize out of memory errors (#5409).

    Router behavior change when routing to brokers dedicated to different time ranges

    As a result of #5595, routers may now select an undesired broker in configurations where there are different tiers of brokers that are intended to be dedicated to queries on different time ranges. See #1362 for discussion.

    Ruby TimestampSpec no longer ignores milliseconds

    Timestamps parsed using a TimestampSpec with format 'ruby' no longer truncates the millisecond component. If you were using this parser and wanted a query granularity of SECOND, ensure that it is configured appropriately in your indexing specs (#6217).

    Small increase in size of ZooKeeper task announcements

    The datasource name was added to TaskAnnouncement which will result in a small per task increase in the amount of data stored in ZooKeeper (#5511).

    Addition of 'name' field to filtered aggregators

    Aggregators of type 'filtered' now support a 'name' field. Previously, the filtered aggregator inherited the name of the aggregator it wrapped. If you have provided the 'name' field for both the filtered aggregator and the wrapped aggregator, it will prefer the name of the filtered aggregator. It will use the name of the wrapped aggregator if the name of the filtered aggregator is missing or empty (#6219).

    utf8mb4 is now the recommended metadata storage charset

    For upgrade instructions, use the ALTER DATABASE and ALTER TABLE instructions as described here: https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-conversion.html.

    For motivation and reference, see #5377 and #5411.

    Removed configuration properties

    • druid.indexer.runner.tlsStartPort has been removed (#6194).
    • druid.indexer.runner.separateIngestionEndpoint has been removed (#6263).

    Interface changes for extension developers

    • Packaging has been renamed from io.druid to org.apache.druid. All third-party extensions will need to rename their META-INF/io.druid.initialization.DruidModule to org.apache.druid.initialization.DruidModule and update their extension's packaging appropriately (#6266).

    • The DataSegmentPuller interface has been removed (#5461).

    • A number of functions under java-util have been removed (#5461).

    • The constructor of the Metadata class has changed (#5613).

    • The 'spark2' Maven profile has been removed (#5581).

    API deprecations

    Overlord

    • The /druid/indexer/v1/supervisor/{id}/shutdown endpoint has been deprecated in favor of /druid/indexer/v1/supervisor/{id}/terminate (#6272 and #6234).
    • The /druid/indexer/v1/task/{taskId}/segments endpoint has been deprecated (#6368).
    • The status field returned by /druid/indexer/v1/task/ID/status has been deprecated in favor of statusCode (#6334).
    • The reportParseExceptions and ignoreInvalidRows parameters for ingestion tuning configurations have been deprecated in favor of logParseExceptions and maxParseExceptions (#5418).

    Broker

    • The /druid/v2/datasources/{dataSourceName}/dimensions endpoint has been deprecated. A segment metadata query or the INFORMATION_SCHEMA SQL table should be used instead (#6361).
    • The /druid/v2/datasources/{dataSourceName}/metrics endpoint has been deprecated. A segment metadata query or the INFORMATION_SCHEMA SQL table should be used instead (#6361).

    Credits

    Thanks to everyone who contributed to this release!

    @a2l007 @adursun @ak08 @akashdw @aleksi75 @AlexanderSaydakov @alperkokmen @amalakar @andresgomezfrr @apollotonkosmo @asdf2014 @awelsh93 @b-slim @bolkedebruin @Caroline1000 @chengchengpei @clintropolis @dansuzuki @dclim @DiegoEliasCosta @dragonls @drcrallen @dyanarose @dyf6372 @Dylan1312 @erikdubbelboer @es1220 @evasomething @fjy @Fokko @gaodayue @gianm @hate13 @himanshug @hoesler @jaihind213 @jcollado @jihoonson @jim-slattery-rs @jkukul @jon-wei @josephglanville @jsun98 @kaijianding @KenjiTakahashi @kevinconaway @korvit0 @leventov @lssenthilkumar @mhshimul @niketh @NirajaM @nishantmonu51 @njhartwell @palanieppan-m @pdeva @pjain1 @QiuMM @redlion99 @rpless @samarthjain @Scorpil @scrawfor @shiroari @shivtools @siddharths @SpyderRivera @spyk @stuartmclean @surekhasaharan @susielu @varaga @vineshcpaul @vvc11 @wysstartgo @xvrl @yunwan @yuppie-flu @yurmix @zhangxinyu1 @zhztheplayer

    Source code(tar.gz)
    Source code(zip)
  • druid-0.12.3(Sep 18, 2018)

    Druid 0.12.3 contains stability improvements and bug fixes from 6 contributors. Major improvements include:

    • More stable Kafka indexing service
    • Several query bug fixes

    The full list of changes is here: https://github.com/apache/incubator-druid/pulls?q=is%3Apr+milestone%3A0.12.3+is%3Aclosed

    Documentation for this release is at: http://druid.io/docs/0.12.3

    Highlights

    More stable Kafka indexing service

    0.12.3 fixes a serious issue where the Kafka indexing service would incorrectly delete published segments in certain situations. Please see #6155, contributed by @gianm, for details.

    Other stability improvements include:

    • Fix NPE in KafkaSupervisor.checkpointTaskGroup: #6206, added by @jihoonson
    • Fix NPE for taskGroupId when rolling update: #6168, added by @jihoonson

    Query bug fixes

    0.12.3 includes a memory allocation adjustment for the GroupBy query that should reduce heap usage. Added by @gaodayue in #6256.

    This release also contains several fixes for the TopN and GroupBy queries when numeric dimension output types are used. Added by @gianm in #6220.

    SQL fixes

    Additionally, 0.12.3 includes the following Druid SQL bug fixes:

    • Fix post-aggregator naming logic for sort-project: https://github.com/apache/incubator-druid/pull/6250
    • Fix precision of TIMESTAMP types: https://github.com/apache/incubator-druid/pull/5464
    • Fix assumption that AND, OR have two arguments: https://github.com/apache/incubator-druid/pull/5470
    • Remove useless boolean CASTs in filters: https://github.com/apache/incubator-druid/pull/5619
    • Fix selecting BOOLEAN type in JDBC: https://github.com/apache/incubator-druid/pull/5401
    • Support projection after sorting in SQL: #5788
    • Fix missing postAggregations for Timeseries and TopN: #5912
    • Finalize aggregations for inner queries when necessary #6221

    Other

    0.12.3 fixes an issue with the druid-basic-security extension where non-coordinator nodes would sometimes fail when updating their cached views of the authentication and/or authorization tables from the coordinator. Fixed by @gaodayue in #6270.

    Full change list

    The full change list can be found here: https://github.com/apache/incubator-druid/pulls?q=is%3Apr+milestone%3A0.12.3+is%3Aclosed

    Updating from 0.12.2 and earlier

    0.12.3 is a minor release and compatible with 0.12.2. If you're updating from an earlier version than 0.12.2, please see release notes of the relevant intermediate versions for additional notes.

    Credits

    Thanks to everyone who contributed to this release!

    @gaodayue @gianm @jihoonson @jon-wei @QiuMM @vogievetsky @wpcnjupt

    Source code(tar.gz)
    Source code(zip)
  • druid-0.12.2(Aug 10, 2018)

    Druid 0.12.2 contains stability improvements and bug fixes from 13 contributors. Major improvements include:

    • More stable Kafka indexing service
    • More stable data ingestion
    • More stable segment balancing
    • Bug fixes in querying and result caching

    The full list of changes is here: https://github.com/apache/incubator-druid/pulls?q=is%3Apr+milestone%3A0.12.2+is%3Aclosed

    Documentation for this release is at: http://druid.io/docs/0.12.2-rc1

    Highlights

    More stable Kafka indexing service

    We have fixed a bunch of bugs in Kafka indexing service, which are mostly race conditions when incrementally publishing segments.

    Added by @jihoonson in #5805. Added by @surekhasaharan in #5899. Added by @surekhasaharan in #5900. Added by @jihoonson in #5905. Added by @jihoonson in #5907. Added by @jihoonson in #5996.

    More stable data ingestion

    We also have fixed some bugs in general data ingestion logics. Especially we have fixed a bug of wrong segment data when you use auto encoded long columns with any compression.

    Added by @jihoonson in #5932. Added by @clintropolis in #6045.

    More stable segment balancing

    Coordinator is now capable of more stable segment management especially for segment balancing. We have fixed an unexpected segment imbalancing caused by the conflicted decisions of Coordinator rule runner and balancer.

    Added by @clintropolis in #5528. Added by @clintropolis in #5529. Added by @clintropolis in #5532. Added by @clintropolis in #5555. Added by @clintropolis in #5591. Added by @clintropolis in #5888. Added by @clintropolis in #5928.

    Bug fixes in querying and result caching

    We've fixed the wrong lexicographic sort of topN queries and the wrong filter application for the nested queries. The bug of ClassCastException when caching topN queries with Float dimensions has also fixed.

    Added by @drcrallen in #5650. Added by @gianm in #5653.

    And much more!

    The full list of changes is here: https://github.com/apache/incubator-druid/pulls?q=is%3Apr+milestone%3A0.12.2+is%3Aclosed

    Updating from 0.12.1 and earlier

    0.12.2 is a minor release and compatible with 0.12.1. If you're updating from an earlier version than 0.12.1, please see release notes of the relevant intermediate versions for additional notes.

    Credits

    Thanks to everyone who contributed to this release!

    @acdn-ekeddy @awelsh93 @clintropolis @drcrallen @gianm @jihoonson @jon-wei @kaijianding @leventov @michas2 @Oooocean @samarthjain @surekhasaharan

    Source code(tar.gz)
    Source code(zip)
  • druid-0.12.1(Jun 8, 2018)

    Druid 0.12.1 contains stability improvements and bug fixes from 10 contributors. Major improvements include:

    • Large performance improvements for coordinator's loadstatus API
    • More memory limiting for HttpPostEmitter
    • Fix several issues of Kerberos Authentication
    • Fix SQLMetadataSegmentManager to allow successive start and stop
    • Fix default interval handling in SegmentMetadataQuery
    • Support HTTP OPTIONS request
    • Fix a bug of different segments of the same segment id in Kafka indexing

    The full list of changes is here: https://github.com/druid-io/druid/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Aclosed+milestone%3A0.12.1

    Documentation for this release is at: http://druid.io/docs/0.12.1

    Highlights

    Large performance improvements for coordinator's loadstatus API

    The loadstatus API of Coordinators returns the percentage of segments actually loaded in the cluster versus segments that should be loaded in the cluster. The performance of this API has greatly been improved.

    Added by @jon-wei in #5632.

    More memory limiting for HttpPostEmitter

    Druid now can limit the amount of memory used by HttpPostEmitter to 10% of the available JVM heap, thereby avoiding OutOfMemory errors from buffered events.

    Added by @jon-wei in #5300.

    Fix several issues of Kerberos Authentication

    There were some bugs in Kerberos authentication like authentication failure without cookies or broken authentication when router is used. See https://github.com/druid-io/druid/pull/5596, https://github.com/druid-io/druid/pull/5706, and https://github.com/druid-io/druid/pull/5766 for more details.

    Added by @nishantmonu51 in #5596. Added by @b-slim in #5706. Added by @jon-wei in #5766.

    Fix SQLMetadataSegmentManager to allow successive start and stop

    Coordinators could be stuck if it loses leadership while starting. This bug has been fixed now.

    Added by @jihoonson in #5554.

    Fix default interval handling in SegmentMetadataQuery

    SegmentMetadataQuery is supposed to use the interval of druid.query.segmentMetadata.defaultHistory if the interval is not specified, but it queried all segments instead which incurs an unexpected performance hit. SegmentMetadataQuery now respects the defaultHistory option again.

    Added by @gianm in #5489.

    Support HTTP OPTIONS request

    Druid now supports the HTTP OPTIONS request by fixing its auth handling.

    Added by @jon-wei in #5615.

    Fix a bug of different segments of the same segment id in Kafka indexing

    Kafka indexing service allowed retrying tasks to overwrite the segments in deep storage written by the previous failed tasks. However, this caused another bug that the same segment ID could have different data on historicals and in deep storage. This bug has been fixed now by using unique segment paths for each Kafka index tasks.

    Added by @dclim in #5692.

    And much more!

    The full list of changes is here: https://github.com/druid-io/druid/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Aclosed+milestone%3A0.12.1

    Updating from 0.12.0 and earlier

    0.12.1 is a minor release and compatible with 0.12.0. If you're updating from an earlier version than 0.12.0, please see release notes of the relevant intermediate versions for additional notes.

    Credits

    Thanks to everyone who contributed to this release!

    @dclim @gianm @JeKuOrdina @jihoonson @jon-wei @leventov @niketh @nishantmonu51 @pdeva

    Source code(tar.gz)
    Source code(zip)
  • druid-0.12.0(Mar 9, 2018)

    Druid 0.12.0 contains over a hundred performance improvements, stability improvements, and bug fixes from almost 40 contributors. This release adds major improvements to the Kafka indexing service.

    Other major new features include:

    • Prioritized task locking
    • Improved automatic segment management
    • Test stats post-aggregators
    • Numeric quantiles sketch aggregator
    • Basic auth extension
    • Query request queuing improvements
    • Parse batch support
    • Various performance improvements
    • Various improvements to Druid SQL

    The full list of changes is here: https://github.com/druid-io/druid/pulls?utf8=%E2%9C%93&q=is%3Apr%20is%3Aclosed%20milestone%3A0.12.0

    Documentation for this release is at: http://druid.io/docs/0.12.0/

    Highlights

    Kafka indexing incremental handoffs and decoupled partitioning

    The Kafka indexing service now supports incremental handoffs, as well as decoupling the number of segments created by a Kafka indexing task from the number of Kafka partitions. Please see https://github.com/druid-io/druid/pull/4815#issuecomment-346155552 for more information.

    Added by @pjain1 in #4815.

    Prioritized task locking

    Druid now supports priorities for indexing task locks. When an indexing task needs to acquire a lock on a datasource + interval, higher-priority tasks can now preempt lower-priority tasks. Please see http://druid.io/docs/0.12.0-rc1/ingestion/tasks.html#task-priority for more information.

    Added by @jihoonson in https://github.com/druid-io/druid/pull/4550.

    Improved automatic segment management

    Automatic pending segments cleanup

    Indexing tasks create entries in the "pendingSegments" table in the metadata store; prior to 0.12.0, these temporary entries were not automatically cleaned up, leading to possible cluster performance degradation over time. Druid 0.12.0 allows the coordinator to automatically clean up unused entries in the pending segments table. This feature is enabled by setting druid.coordinator.kill.pendingSegments.on=true in coordinator properties.

    Added by @jihoonson in #5149.

    Compaction task

    Compacting segments (merging a set of segments within a given interval to create a set with larger but fewer segments) is a common Druid batch ingestion use case. Druid 0.12.0 now supports a Compaction Task that merges all segments within a given interval into a single segment. Please see http://druid.io/docs/0.12.0-rc1/ingestion/tasks.html#compaction-task for more details.

    Added by @jihoonson in #4985.

    Test stats post-aggregators

    New z-score and p-value test statistics post-aggregators have been added to the druid-stats extension. Please see http://druid.io/docs/0.12.0-rc1/development/extensions-core/test-stats.html for more details.

    Added by @chunghochen in #4532.

    Numeric quantiles sketch aggregator

    A numeric quantiles sketch aggregator has been added to the druid-datasketches extension.

    Added by @AlexanderSaydakov in #5002.

    Basic auth extension

    Druid 0.12.0 includes a new authentication/authorization extension that provides Basic HTTP authentication and simple role-based access control. Please see http://druid.io/docs/0.12.0-rc1/development/extensions-core/druid-basic-security.html for more information.

    Added by @jon-wei in #5099.

    Query request queuing improvements

    Currently clients can overwhelm a broker inadvertently by sending too many requests which get queued in an unbounded Jetty worker pool queue. Clients typically close the connection after a certain client-side timeout but the broker will continue to process these requests, giving the appearance of being unresponsive. Meanwhile, clients would continue to retry, continuing to add requests to an already overloaded broker..

    The newly introduced properties druid.server.http.queueSize and druid.server.http.enableRequestLimit in the broker configuration and historical configuration allow users to configure request rejection to prevent clients from overwhelming brokers and historicals with queries.

    Added by @himanshug in #4540.

    Parse batch support

    For developers of custom ingestion parsing extensions, it is now possible for InputRowParsers to return multiple InputRows from a single input row. This can simplify ingestion pipelines by reducing the need for input transformations outside of Druid. Added by @pjain1 in #5081.

    Performance improvements

    SegmentWriteOutMedium

    When creating new segments, Druid stores some pre-processed data in temporary buffers. Prior to 0.12.0, these buffers were always kept in temporary files on disk. In 0.12.0, PR #4762 by @leventov allows these temporary buffers to be stored in off-heap memory, thus reducing the number of disk I/O operations during ingestion. To enable using off-heap memory for these buffers, the druid.peon.defaultSegmentWriteOutMediumFactory property needs to be configured accordingly. If using off-heap memory for the temporary buffers, please ensure that -XX:MaxDirectMemorySize is increased to accommodate the higher direct memory usage.

    Please see http://druid.io/docs/0.12.0-rc1/configuration/indexing-service.html#SegmentWriteOutMediumFactory for configuration details.

    Parallel merging of intermediate GroupBy results

    PR #4704 by @jihoonson allows the user to configure a number of processing threads to be used for parallel merging of intermediate GroupBy results that have been spilled to disk. Prior to 0.12.0, this merging step would always take place within a single thread.

    Please see http://druid.io/docs/0.12.0-rc1/configuration/querying/groupbyquery.html#parallel-combine for configuration details.

    Other performance improvements

    • DataSegment memory optimizations: #5094 by @leventov
    • Remove IndexedInts.iterator(): #4811 by @leventov
    • ExpressionSelectors: Add optimized selectors: #5048 by @gianm

    SQL improvements

    Various improvements and features have been added to Druid SQL, by @gianm in the following PRs:

    • Improve translation of time floor expressions: #5107
    • Add TIMESTAMPADD: #5079
    • Add rule to prune unused aggregations: #5049
    • Support CASE-style filtered count distinct: #5047
    • Add "array" result format, and document result formats: #5032
    • Fix havingSpec on complex aggregators: #5024
    • Improved behavior when implicitly casting strings to date/time literals: #5023
    • Add Router connection balancers for Avatica queries: #4983
    • Fix incorrect filter simplification: #4945
    • Fix Router handling of SQL queries: #4851

    And much more!

    The full list of changes is here: https://github.com/druid-io/druid/pulls?utf8=%E2%9C%93&q=is%3Apr%20is%3Aclosed%20milestone%3A0.12.0

    Updating from 0.11.0 and earlier

    Please see below for changes between 0.11.0 and 0.12.0 that you should be aware of before upgrading. If you're updating from an earlier version than 0.11.0, please see release notes of the relevant intermediate versions for additional notes.

    Rollback restrictions

    Please note that after upgrading to 0.12.0, it is no longer possible to downgrade to a version older than 0.11.0, due to changes made in https://github.com/druid-io/druid/pull/4762. It is still possible to roll back to version 0.11.0.

    com.metamx.java-util library migration

    The Metamarkets java-util library has been brought into Druid. As a result, the following package references have changed:

    com.metamx.common -> io.druid.java.util.common com.metamx.emitter -> io.druid.java.util.emitter com.metamx.http -> io.druid.java.util.http com.metamx.metrics -> io.druid.java.util.metrics

    This will affect the druid.monitoring.monitors configuration. References to monitor classes under the old com.metamx.metrics.* package will need to be updated to reference io.druid.java.metrics.* instead, e.g. io.druid.java.util.metrics.JvmMonitor.

    If classes under the the com.metamx packages shown above are referenced in other configurations such as log4j2.xml, those references will need to be updated as well.

    Extension developers will need to update their code to use the new Druid packages as well.

    Caffeine cache extension

    The Caffeine cache extension has been moved out of an extension, into core Druid. In addition, the Caffeine cache is now the default cache implementation. Please remove druid-caffeine-cache if present from the extension list when upgrading to 0.12.0. More information can be found at https://github.com/druid-io/druid/pull/4810.

    Kafka indexing service changes

    earlyMessageRejectPeriod

    The semantics of the earlyMessageRejectPeriod configuration have changed. The earlyMessageRejectPeriod will now be added to (task start time + task duration) instead of just (task start time) when determining the bounds of the message window. Please see https://github.com/druid-io/druid/pull/4990 for more information.

    Rolling upgrade

    In 0.12.0, there are protocol changes between the Kafka supervisor and Kafka Indexing task and also some changes to the metadata formats persisted on disk. Therefore, to support rolling upgrade, all the Middle Managers will need to be upgraded first before the Overlord. Note that this ordering is different from the standard order of upgrade, also note that this ordering is only necessary when using the Kafka Indexing Service. If one is not using Kafka Indexing Service or can handle down time for Kafka Supervisor then one can upgrade in any order.

    Until the point in time Overlord is upgraded, all the Kafka Indexing Task will behave in same manner (even if they are upgraded) as earlier which means no decoupling and incremental hand-offs. Once, Overlord is upgraded, the new tasks started by the upgraded Overlord will support the new features.

    Please see https://github.com/druid-io/druid/pull/4815 for more info.

    Roll back

    Once both the overlord and middle managers are rolled back, a new set of tasks should be started, which will work properly. However, the current set of tasks may fail during a roll back. Please see #4815 for more info.

    Interface Changes for Extension Developers

    The ColumnSelectorFactory API has changed. Aggregator extension authors and any others who use ColumnSelectorFactory will need to update their code accordingly. Please see https://github.com/druid-io/druid/pull/4886 for more details.

    The Aggregator.reset() method has been removed because it was deprecated and unused. Please see https://github.com/druid-io/druid/pull/5177 for more info.

    The DataSegmentPusher interface has changed, and the push() method now has an additional replaceExisting parameter. Please see https://github.com/druid-io/druid/pull/5187 for details.

    The Escalator interface has changed: the createEscalatedJettyClient method has been removed. Please see https://github.com/druid-io/druid/pull/5322 for more details.

    Credits

    Thanks to everyone who contributed to this release!

    @a2l007 @akashdw @AlexanderSaydakov @b-slim @ben-manes @benvogan @chuanlei @chunghochen @clintropolis @daniel-tcell @dclim @dpenas @drcrallen @egor-ryashin @elloooooo @Fokko @fuji-151a @gianm @gvsmirnov @himanshug @hzy001 @Igosuki @jihoonson @jon-wei @KenjiTakahashi @kevinconaway @leventov @mh2753 @niketh @nishantmonu51 @pjain1 @QiuMM @QubitPi @quenlang @Shimi @skyler-tao @xanec @yuppie-flu @zhangxinyu1

    Source code(tar.gz)
    Source code(zip)
Owner
The Apache Software Foundation
The Apache Software Foundation
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.8k Dec 28, 2022
Real-time Query for Hadoop; mirror of Apache Impala

Welcome to Impala Lightning-fast, distributed SQL queries for petabytes of data stored in Apache Hadoop clusters. Impala is a modern, massively-distri

Cloudera 27 Dec 28, 2022
A platform for visualization and real-time monitoring of data workflows

Status This project is no longer maintained. Ambrose Twitter Ambrose is a platform for visualization and real-time monitoring of MapReduce data workfl

Twitter 1.2k Dec 31, 2022
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

Apache Zeppelin Documentation: User Guide Mailing Lists: User and Dev mailing list Continuous Integration: Contributing: Contribution Guide Issue Trac

The Apache Software Foundation 5.9k Jan 8, 2023
A High Dynamic Range (HDR) Histogram

HdrHistogram HdrHistogram: A High Dynamic Range (HDR) Histogram This repository currently includes a Java implementation of HdrHistogram. C, C#/.NET,

null 2k Dec 29, 2022
Mirror of Apache Storm

Master Branch: Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processi

The Apache Software Foundation 6.4k Jan 3, 2023
Apache Heron (Incubating) is a realtime, distributed, fault-tolerant stream processing engine from Twitter

Heron is a realtime analytics platform developed by Twitter. It has a wide array of architectural improvements over it's predecessor. Heron in Apache

The Apache Software Foundation 3.6k Dec 28, 2022
Hadoop library for large-scale data processing, now an Apache Incubator project

Apache DataFu Follow @apachedatafu Apache DataFu is a collection of libraries for working with large-scale data in Hadoop. The project was inspired by

LinkedIn's Attic 589 Apr 1, 2022
Apache Flink

Apache Flink Apache Flink is an open source stream processing framework with powerful stream- and batch-processing capabilities. Learn more about Flin

The Apache Software Foundation 20.4k Jan 5, 2023
Apache Hive

Apache Hive (TM) The Apache Hive (TM) data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storag

The Apache Software Foundation 4.6k Dec 28, 2022
This code base is retained for historical interest only, please visit Apache Incubator Repo for latest one

Apache Kylin Apache Kylin is an open source Distributed Analytics Engine to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop supp

Kylin OLAP Engine 561 Dec 4, 2022
Apache Dubbo漏洞测试Demo及其POC

DubboPOC Apache Dubbo 漏洞POC 持续更新中 CVE-2019-17564 CVE-2020-1948 CVE-2020-1948绕过 CVE-2021-25641 CVE-2021-30179 others 免责声明 项目仅供学习使用,任何未授权检测造成的直接或者间接的后果及

lz2y 19 Dec 12, 2022
A scalable, mature and versatile web crawler based on Apache Storm

StormCrawler is an open source collection of resources for building low-latency, scalable web crawlers on Apache Storm. It is provided under Apache Li

DigitalPebble Ltd 776 Jan 2, 2023
Flink CDC Connectors is a set of source connectors for Apache Flink

Flink CDC Connectors is a set of source connectors for Apache Flink, ingesting changes from different databases using change data capture (CDC). The Flink CDC Connectors integrates Debezium as the engine to capture data changes.

null 6 Mar 23, 2022
Apache Druid: a high performance real-time analytics database.

Website | Documentation | Developer Mailing List | User Mailing List | Slack | Twitter | Download Apache Druid Druid is a high performance real-time a

The Apache Software Foundation 12.3k Jan 2, 2023
Apache Druid: a high performance real-time analytics database.

Website | Documentation | Developer Mailing List | User Mailing List | Slack | Twitter | Download Apache Druid Druid is a high performance real-time a

The Apache Software Foundation 12.3k Jan 9, 2023
:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop

Elasticsearch Hadoop Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Apache Hive, Apache Pig, Apach

elastic 1.9k Dec 22, 2022
SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).

SeaTunnel SeaTunnel was formerly named Waterdrop , and renamed SeaTunnel since October 12, 2021. SeaTunnel is a very easy-to-use ultra-high-performanc

The Apache Software Foundation 4.4k Jan 2, 2023
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine l

Oryx Project 1.8k Dec 28, 2022