Apache Pinot - A realtime distributed OLAP datastore

Related tags

Database java
Overview
Apache Pinot

Build Status Release codecov.io Join the chat at https://communityinviter.com/apps/apache-pinot/apache-pinot Twitter Follow License

What is Apache Pinot?

Apache Pinot is a real-time distributed OLAP datastore, built to deliver scalable real-time analytics with low latency. It can ingest from batch data sources (such as Hadoop HDFS, Amazon S3, Azure ADLS, Google Cloud Storage) as well as stream data sources (such as Apache Kafka).

Pinot was built by engineers at LinkedIn and Uber and is designed to scale up and out with no upper bound. Performance always remains constant based on the size of your cluster and an expected query per second (QPS) threshold.

For getting started guides, deployment recipes, tutorials, and more, please visit our project documentation at https://docs.pinot.apache.org.

Apache Pinot

Features

Pinot was originally built at LinkedIn to power rich interactive real-time analytic applications such as Who Viewed Profile, Company Analytics, Talent Insights, and many more. UberEats Restaurant Manager is another example of a customer facing Analytics App. At LinkedIn, Pinot powers 50+ user-facing products, ingesting millions of events per second and serving 100k+ queries per second at millisecond latency.

  • Column-oriented: a column-oriented database with various compression schemes such as Run Length, Fixed Bit Length.

  • Pluggable indexing: pluggable indexing technologies Sorted Index, Bitmap Index, Inverted Index.

  • Query optimization: ability to optimize query/execution plan based on query and segment metadata.

  • Stream and batch ingest: near real time ingestion from streams and batch ingestion from Hadoop.

  • Query with SQL: SQL-like language that supports selection, aggregation, filtering, group by, order by, distinct queries on data.

  • Upsert during real-time ingestion: update the data at-scale with consistency

  • Multi-valued fields: support for multi-valued fields, allowing you to query fields as comma separated values.

  • Cloud-native on Kubernetes: Helm chart provides a horizontally scalable and fault-tolerant clustered deployment that is easy to manage using Kubernetes.

Apache Pinot query console

When should I use Pinot?

Pinot is designed to execute real-time OLAP queries with low latency on massive amounts of data and events. In addition to real-time stream ingestion, Pinot also supports batch use cases with the same low latency guarantees. It is suited in contexts where fast analytics, such as aggregations, are needed on immutable data, possibly, with real-time data ingestion. Pinot works very well for querying time series data with lots of dimensions and metrics.

Example query:

SELECT sum(clicks), sum(impressions) FROM AdAnalyticsTable
  WHERE
       ((daysSinceEpoch >= 17849 AND daysSinceEpoch <= 17856)) AND
       accountId IN (123456789)
  GROUP BY
       daysSinceEpoch TOP 100

Pinot is not a replacement for database i.e it cannot be used as source of truth store, cannot mutate data. While Pinot supports text search, it's not a replacement for a search engine. Also, Pinot queries cannot span across multiple tables by default. You can use the Trino-Pinot Connector or Presto-Pinot Connector to achieve table joins and other features.

Building Pinot

More detailed instructions can be found at Quick Demo section in the documentation.

# Clone a repo
$ git clone https://github.com/apache/pinot.git
$ cd pinot

# Build Pinot
$ mvn clean install -DskipTests -Pbin-dist

# Run the Quick Demo
$ cd pinot-distribution/target/apache-pinot-<version>-SNAPSHOT-bin/apache-pinot-<version>-SNAPSHOT-bin
$ bin/quick-start-batch.sh

Deploying Pinot to Kubernetes

Please refer to Running Pinot on Kubernetes in our project documentation. Pinot also provides Kubernetes integrations with the interactive query engine, Trino Presto, and the data visualization tool, Apache Superset.

Join the Community

Documentation

Check out Pinot documentation for a complete description of Pinot's features.

License

Apache Pinot is under Apache License, Version 2.0

Comments
  • DataTable V3 implementation and measure data table serialization cost on server

    DataTable V3 implementation and measure data table serialization cost on server

    Description

    This PR:

    • Add a positional data section to the tail of data table, bump up data table version to V3
    • Data in the positional data section is supposed to be key/value pairs, and data are supposed to be positional(value of a given key is locatable even after serialization), so use String[] to store keys and use enum to store keys.
    • Currently we only have one KV pair (response_serialization_cost) in positional data section. But if we add more KV pairs, we can add some utility function such as getOffsetForValueOfGivenKey() to locate the value of given key.
    • measure data table serialization cost on server and put the cost in the positional data section.

    Upgrade Notes

    Does this PR prevent a zero down-time upgrade? (Assume upgrade order: Controller, Broker, Server, Minion)

    • [x] Yes (Please label as backward-incompat, and complete the section below on Release Notes)

    Does this PR fix a zero-downtime upgrade introduced earlier?

    • [ ] Yes (Please label this as backward-incompat, and complete the section below on Release Notes)

    Does this PR otherwise need attention when creating release notes? Things to consider:

    • New configuration options
    • Deprecation of configurations
    • Signature changes to public methods/interfaces
    • New plugins added or old plugins removed
    • [x] Yes (Please label this PR as release-notes and complete the section on Release Notes)

    Release Notes

    If you have tagged this as either backward-incompat or release-notes, you MUST add text here that you would like to see appear in release notes of the next release.

    If you have a series of commits adding or enabling a feature, then add this section only in final commit that marks the feature completed. Refer to earlier release notes to see examples of text

    Documentation

    If you have introduced a new feature or configuration, please add it to the documentation as well. See https://docs.pinot.apache.org/developers/developers-and-contributors/update-document

    release-notes backward-incompat 
    opened by mqliang 38
  • [pinot]Support for pausing the realtime consumption without disabling the table.

    [pinot]Support for pausing the realtime consumption without disabling the table.

    As of now, to pause the realtime consumption from kafka we have to disable the table. This also leads to the table not being available for querying.

    It would be helpful if there is support for only stopping the realtime consumption while having the table available for querying.

    opened by Aka-shi 38
  • NULL value support for all data types

    NULL value support for all data types

    Currently in Pinot we don't have real NULL value support, but use some special default values for NULL. For dimensions, the default value is the minimum value for numeric types, "null" for STRING, empty byte array for BYTES; for metrics, the default value is 0 for numeric types, empty byte array for BYTES. #4214 adds support to treat empty byte array as NULL for BYTES, but it's not the general way to support NULL for all data types. We should add another type of index to mark if the value is null, and apply that while filtering or fetching the values to skip all NULL values.

    feature 
    opened by Jackie-Jiang 34
  • regexp_like fusing

    regexp_like fusing

    When executing queries like:

    select col1, col2 from Table where regexp_like(col1, '/r1/') or regexp_like(col1, '/r2/')
    

    Pinot has to scan the referred column twice. This PR creates an optimization that tries to fuse boolean algebra and regexp_like predicates. Specifically, as indicated in the Javadoc:

    • Queries where regexp_like(col1, 'r1') and regexp_like(col1, '/r2/') will be translated to where regexp_like(col1, '(?=r1)(?=r2)')
    • Queries where regexp_like(col1, 'r1') or regexp_like(col1, '/r2/') will be translated to where regexp_like(col1, '(?:r1)|(?:r2)')
    • Queries where not regexp_like(col1, 'r1') will be translated to where regexp_like(col1, '(?!r1)')

    There are some tests that apply the optimization to more advanced cases.

    Regex can be quite complex. By doing some analysis I'm sure that this optimization will break some regex. For example, predicates that use backrefences may change the semantics when this optimization is applied.

    To know if the optimization can be applied or not, it would be necessary to analyze the regex, which would require to parse the regex into a AST. As far as I know Pinot doesn't have that, so this optimization is disabled by default and can be enabled by activating a new query option. There is an heuristic that applies a basic analysis to try to find incompatible expressions like backreferences. Even when the query option is enabled, the regexp that are detected as not optimizable by the heuristic will not be optimized.

    Future improvements may include to translate some text_match, where this optimization may be useful.

    PD: Originally this PR was focused on optimizing text_match instead of regexp_like. Some comments may still be referred to the original version

    opened by gortiz 29
  • Fix the issue with

    Fix the issue with "pinot-pulsar" module (potentially library conflicts)

    Apache Pulsar connector has been added from https://github.com/apache/pinot/pull/7026

    However, it currently is facing some issues on runtime (potentially dependency conflicts). We need to fix the conflicts to make the connector work correctly.

    2021/08/06 10:53:35.451 ERROR [SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel] [HelixTaskExecutor-message_handle_thread] Caught exception in state transition from OFFLINE -> ONLINE for resource: myTable_REALTIME, partition: myTable__0__0__20210806T1753Z
    java.lang.RuntimeException: org.apache.pulsar.shaded.com.google.protobuf.v241.InvalidProtocolBufferException: Protocol message tag had invalid wire type.
            at org.apache.pulsar.client.internal.ReflectionUtils.catchExceptions(ReflectionUtils.java:43) ~[pinot-pulsar-0.8.0-shaded.jar:0.8.0-a206db39710e2495f1d72adb90387617b03566d4]
            at org.apache.pulsar.client.internal.DefaultImplementation.newMessageIdFromByteArray(DefaultImplementation.java:103) ~[pinot-pulsar-0.8.0-shaded.jar:0.8.0-a206db39710e2495f1d72adb90387617b03566d4]
            at org.apache.pulsar.client.api.MessageId.fromByteArray(MessageId.java:58) ~[pinot-pulsar-0.8.0-shaded.jar:0.8.0-a206db39710e2495f1d72adb90387617b03566d4]
            at org.apache.pinot.plugin.stream.pulsar.MessageIdStreamOffset.<init>(MessageIdStreamOffset.java:47) ~[pinot-pulsar-0.8.0-shaded.jar:0.8.0-a206db39710e2495f1d72adb90387617b03566d4]
            at org.apache.pinot.plugin.stream.pulsar.MessageIdStreamOffsetFactory.create(MessageIdStreamOffsetFactory.java:39) ~[pinot-pulsar-0.8.0-shaded.jar:0.8.0-a206db39710e2495f1d72adb90387617b03566d4]
            at org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager.<init>(LLRealtimeSegmentDataManager.java:1183) ~[pinot-all-0.8.0-jar-with-dependencies.jar:0.8.0-a206db39710e2495f1d72adb90387617b03566d4]
            at org.apache.pinot.core.data.manager.realtime.RealtimeTableDataManager.addSegment(RealtimeTableDataManager.java:349) ~[pinot-all-0.8.0-jar-with-dependencies.jar:0.8.0-a206db39710e2495f1d72adb90387617b03566d4]
            at org.apache.pinot.server.starter.helix.HelixInstanceDataManager.addRealtimeSegment(HelixInstanceDataManager.java:162) ~[pinot-all-0.8.0-jar-with-dependencies.jar:0.8.0-a206db39710e2495f1d72adb90387617b03566d4]
            at org.apache.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeOnlineFromOffline(SegmentOnlineOfflineStateModelFactory.java:168) [pinot-all-0.8.0-jar-with-dependencies.jar:0.8.0-a206db39710e2495f1d72adb90387617b03566d4]
            at org.apache.pinot.server.starter.helix.SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel.onBecomeConsumingFromOffline(SegmentOnlineOfflineStateModelFactory.java:89) [pinot-all-0.8.0-jar-with-dependencies.jar:0.8.0-a206db39710e2495f1d72adb90387617b03566d4]
            at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
            at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]
            at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
            at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
            at org.apache.helix.messaging.handling.HelixStateTransitionHandler.invoke(HelixStateTransitionHandler.java:404) [pinot-all-0.8.0-jar-with-dependencies.jar:0.8.0-a206db39710e2495f1d72adb90387617b03566d4]
            at org.apache.helix.messaging.handling.HelixStateTransitionHandler.handleMessage(HelixStateTransitionHandler.java:331) [pinot-all-0.8.0-jar-with-dependencies.jar:0.8.0-a206db39710e2495f1d72adb90387617b03566d4]
            at org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:97) [pinot-all-0.8.0-jar-with-dependencies.jar:0.8.0-a206db39710e2495f1d72adb90387617b03566d4]
            at org.apache.helix.messaging.handling.HelixTask.call(HelixTask.java:49) [pinot-all-0.8.0-jar-with-dependencies.jar:0.8.0-a206db39710e2495f1d72adb90387617b03566d4]
            at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
            at java.lang.Thread.run(Thread.java:834) [?:?]
    Caused by: org.apache.pulsar.shaded.com.google.protobuf.v241.InvalidProtocolBufferException: Protocol message tag had invalid wire type.
            at org.apache.pulsar.common.util.protobuf.ByteBufCodedInputStream.skipField(ByteBufCodedInputStream.java:192) ~[pinot-pulsar-0.8.0-shaded.jar:0.8.0-a206db39710e2495f1d72adb90387617b03566d4]
            at org.apache.pulsar.common.api.proto.PulsarApi$MessageIdData$Builder.mergeFrom(PulsarApi.java:1602) ~[pinot-pulsar-0.8.0-shaded.jar:0.8.0-a206db39710e2495f1d72adb90387617b03566d4]
            at org.apache.pulsar.client.impl.MessageIdImpl.fromByteArray(MessageIdImpl.java:106) ~[pinot-pulsar-0.8.0-shaded.jar:0.8.0-a206db39710e2495f1d72adb90387617b03566d4]
            at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
            at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]
            at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
            at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
            at org.apache.pulsar.client.internal.DefaultImplementation.lambda$newMessageIdFromByteArray$3(DefaultImplementation.java:103) ~[pinot-pulsar-0.8.0-shaded.jar:0.8.0-a206db39710e2495f1d72adb90387617b03566d4]
            at org.apache.pulsar.client.internal.ReflectionUtils.catchExceptions(ReflectionUtils.java:35) ~[pinot-pulsar-0.8.0-shaded.jar:0.8.0-a206db39710e2495f1d72adb90387617b03566d4]
            ... 21 more
    
    pinot-server_1      | 2021/08/05 14:25:59.009 ERROR [SegmentOnlineOfflineStateModelFactory$SegmentOnlineOfflineStateModel] [HelixTaskExecutor-message_handle_thread] Caught exception in state transition from OFFLINE -> ONLINE for resource: datasource_610bf4bf19200003007e55b3_REALTIME, partition: datasource_610bf4bf19200003007e55b3__0__0__20210805T1425Z
    pinot-server_1      | java.lang.IndexOutOfBoundsException: readerIndex(1) + length(8) exceeds writerIndex(6): UnpooledHeapByteBuf(ridx: 1, widx: 6, cap: 6/6)
    pinot-server_1      | 	at shaded.io.netty.buffer.AbstractByteBuf.checkReadableBytes0(AbstractByteBuf.java:1478) ~[pinot-azure-0.8.0-SNAPSHOT-shaded.jar:0.8.0-SNAPSHOT-4da1dae06aef50f0a7c96b5a22019e541310fdd9]
    pinot-server_1      | 	at shaded.io.netty.buffer.AbstractByteBuf.readLongLE(AbstractByteBuf.java:845) ~[pinot-azure-0.8.0-SNAPSHOT-shaded.jar:0.8.0-SNAPSHOT-4da1dae06aef50f0a7c96b5a22019e541310fdd9]
    pinot-server_1      | 	at org.apache.pulsar.common.util.protobuf.ByteBufCodedInputStream.readRawLittleEndian64(ByteBufCodedInputStream.java:309) ~[pinot-pulsar-0.8.0-SNAPSHOT-shaded.jar:0.8.0-SNAPSHOT-4da1dae06aef50f0a7c96b5a22019e541310fdd9]
    pinot-server_1      | 	at org.apache.pulsar.common.util.protobuf.ByteBufCodedInputStream.skipField(ByteBufCodedInputStream.java:177) ~[pinot-pulsar-0.8.0-SNAPSHOT-shaded.jar:0.8.0-SNAPSHOT-4da1dae06aef50f0a7c96b5a22019e541310fdd9]
    pinot-server_1      | 	at org.apache.pulsar.common.api.proto.PulsarApi$MessageIdData$Builder.mergeFrom(PulsarApi.java:1602) ~[pinot-pulsar-0.8.0-SNAPSHOT-shaded.jar:0.8.0-SNAPSHOT-4da1dae06aef50f0a7c96b5a22019e541310fdd9]
    pinot-server_1      | 	at org.apache.pulsar.client.impl.MessageIdImpl.fromByteArray(MessageIdImpl.java:106) ~[pinot-pulsar-0.8.0-SNAPSHOT-shaded.jar:0.8.0-SNAPSHOT-4da1dae06aef50f0a7c96b5a22019e541310fdd9]
    pinot-server_1      | 	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
    pinot-server_1      | 	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]
    pinot-server_1      | 	at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
    pinot-server_1      | 	at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
    pinot-server_1      | 	at org.apache.pulsar.client.internal.DefaultImplementation.lambda$newMessageIdFromByteArray$3(DefaultImplementation.java:103) ~[pinot-pulsar-0.8.0-SNAPSHOT-shaded.jar:0.8.0-SNAPSHOT-4da1dae06aef50f0a7c96b5a22019e541310fdd9]
    pinot-server_1      | 	at org.apache.pulsar.client.internal.ReflectionUtils.catchExceptions(ReflectionUtils.java:35) ~[pinot-pulsar-0.8.0-SNAPSHOT-shaded.jar:0.8.0-SNAPSHOT-4da1dae06aef50f0a7c96b5a22019e541310fdd9]
    pinot-server_1      | 	at org.apache.pulsar.client.internal.DefaultImplementation.newMessageIdFromByteArray(DefaultImplementation.java:103) ~[pinot-pulsar-0.8.0-SNAPSHOT-shaded.jar:0.8.0-SNAPSHOT-4da1dae06aef50f0a7c96b5a22019e541310fdd9]
    
    opened by snleee 25
  • Admin command to recover from deleted CONSUMING segment case

    Admin command to recover from deleted CONSUMING segment case

    If a user deletes the CONSUMING segment (by mistake or intentionally coz they don't know the impact), the consumption stops. Pinot is unable to recover sucha a table automatically. The only way out is to manually add the next sequenceID's segment zk metadata (and optionally add entries in ideal state for CONSUMING to kick off consumption immediately).

    We can add admin commands to do these ops.

    opened by npawar 23
  • Add exception to broker response when not all segments are available (partial response)

    Add exception to broker response when not all segments are available (partial response)

    There are 3 scenarios when the broker response might be partial:

    1. There are some segments without online server in external view - numUnavailableSegments in BaseBrokerRequestHandler line 718
    2. Some segments are not available on server - numSegmentsQueried > numSegmentsAcquired in ServerQueryExecutorV1Impl line 170
    3. Some servers not responded - numServersQueried > numServersResponded in SingleConnectionBrokerRequestHandler line 128

    Currently these partial responses are only tracked by metrics/query stats, but not modeled as an exception. We should add an exception to the broker response to inform the users that the response might be partial

    beginner-task 
    opened by Jackie-Jiang 22
  • Make Pinot JDK 11 Compilable

    Make Pinot JDK 11 Compilable

    Description

    As part of effort to upgrade Pinot to use JAVA 11(#6689), this PR will:

    • Upgrade pinot to be compatible with jdk11+. This involves updating dependencies, tests and minor modifications to some classes: equality methods for Schema and ColumnMetadata.
    • JDK 8 build is still available with the option: -Djdk.version=8
    mvn clean install -DskipTests -Pbin-dist -T 4  -Djdk.version=8
    

    Upgrade Notes

    Does this PR prevent a zero down-time upgrade? (Assume upgrade order: Controller, Broker, Server, Minion)

    • [ X] Yes (Please label as backward-incompat, and complete the section below on Release Notes)

    Does this PR fix a zero-downtime upgrade introduced earlier?

    • [ ] Yes (Please label this as backward-incompat, and complete the section below on Release Notes)

    Does this PR otherwise need attention when creating release notes? Things to consider:

    • New configuration options
    • Deprecation of configurations
    • Signature changes to public methods/interfaces
    • New plugins added or old plugins removed
    • [ X] Yes (Please label this PR as release-notes and complete the section on Release Notes)

    Release Notes

    Upgrade to Java 11 and drop support for Java 8

    If you have tagged this as either backward-incompat or release-notes, you MUST add text here that you would like to see appear in release notes of the next release.

    If you have a series of commits adding or enabling a feature, then add this section only in final commit that marks the feature completed. Refer to earlier release notes to see examples of text

    Documentation

    If you have introduced a new feature or configuration, please add it to the documentation as well. See https://docs.pinot.apache.org/developers/developers-and-contributors/update-document

    release-notes backward-incompat 
    opened by elonazoulay 21
  • pinot_kafka_0.9.0.1 module

    pinot_kafka_0.9.0.1 module

    Step 3 of https://github.com/apache/incubator-pinot/issues/3998 As discussed in the issue, we are planning to move out kafka stream implementation related classes into its own module pinot-kafka-0.9.0.1. This should enable folks to plug in implementations for other kafka versions. TODO: the pinot-integration-tests, pinot-tools and pinot-perf still depend on pinot-kafka-0.9.0.1 for the tests and dummy setup. We need to figure out a way by which we do not need to specify the dependency, but be able to inject the classes depending on the version we wish to use.

    opened by npawar 20
  • Schema evolution: Default values for newly added columns

    Schema evolution: Default values for newly added columns

    One of the most common request from users "when we add new columns, can the old segments have default values for these columns". Currently Pinot forces users to re-bootstrap old data when new columns are added. Most users are ok with using default values in old segments for newly added columns.

    For example, lets say user added a new column c_new to table that currently has d1, m1 columns. Lets say we have two segments. S1_OLD (contains d1,m1) and S2_NEW (contains d1,m1, c_new)

    Lets go over the queries and its current behavior in Pinot If the query does not mention new column, results will be correct and as expected.

    If the query specifies the new column, then we will not consider old data(w/o new column) while processing this query. In this case it depends on other filters in the query if the result is right or wrong.

    For example, lets look at the current behavior in Pinot when a new column c_new is added at time T1

    select sum(m1) from Table //this will work in all cases
    select sum(m1) from Table where c_new=x //this will be incorrect as pinot considers only new data
    select sum(m1) from Table where c_new=x and time > T1 //this will be correct since all data after T1 will have c_new
    select sum(c_new) from table //this works and returns correct result
    select sum(c1) from table //this also works as expected
    but
    select sum(c_new), sum(m1)  from table //THIS WONT WORK AS EXPECTED
    
    opened by kishoreg 20
  • Allow disabling dict generation for High cardinality columns

    Allow disabling dict generation for High cardinality columns

    For high cardinality metric columns (especialy doubles/floats), dictionary may be an overhead not only for storage space but also it adds one additional hop for read. During segment generation phase, we can check if a column would require more storage for dictionary vs raw values and can choose to not create dictionary in that phase.

    The feature is off by default and can be enabled by setting optimizeDictionaryEnabled to true.

    Release Notes

    • Add a new configuration to disable dictionaries for single-valued metric columns.

    Documentation

    https://docs.pinot.apache.org/configuration-reference/table#table-index-config

    release-notes Configuration performance 
    opened by KKcorps 19
  • Deleting a table does not reset metric gauges

    Deleting a table does not reset metric gauges

    When a table with error segments is deleted, the "segments_in_error_state" gauge remains at the same value. This should either go to 0 or stop reporting all together.

    opened by jadami10 0
  • RT segment size does NOT follow

    RT segment size does NOT follow "realtime.segment.flush.desired.size" parameter

    Segments of realtime tables which have "flush.desired.size" configured do not honor the specified value (reported here). Here are actual sizes for some segments for which the desired size is 500M. image

    bug 
    opened by sajjad-moradi 0
  • Fix issue with S3 FS not working with segments with spaces in the name

    Fix issue with S3 FS not working with segments with spaces in the name

    S3PinotFS fails to process segments with spaces in their names. This is due to URI.getPath() returning decoded path while the .prefix function expects unescaped path.

    bugfix 
    opened by saurabhd336 2
  • Add DistinctSum aggregation function

    Add DistinctSum aggregation function

    Label = feature Issue = #9574

    This PR adds support for DISTINCTSUM aggregation function. Similar support is provided by MySQL - https://dev.mysql.com/doc/refman/8.0/en/aggregate-functions.html#function_sum

    https://www.db-fiddle.com/f/xyAgqWtR324YeXjdK6Rup8/0 explains a sample usecase where this functionality will help.

    Added tests.

    feature release-notes 
    opened by vvivekiyer 3
Releases(release-0.11.0)
  • release-0.11.0(Sep 1, 2022)

    Summary

    Apache Pinot 0.11.0 has introduced many new features to extend the query abilities, e.g. the Multi-Stage query engine enables Pinot to do distributed joins, more sql syntax(DML support), query functions and indexes(Text index, Timestamp index) supported for new use cases. And as always, more integrations with other systems(E.g. Spark3, Flink).

    Note: there is a major upgrade for Apache Helix to 1.0.4, so please make sure you upgrade the system in the order of: Helix Controller -> Pinot Controller -> Pinot Broker -> Pinot server

    Multi-Stage Query Engine

    The new multi-stage query engine (a.k.a V2 query engine) is designed to support more complex SQL semantics such as JOIN, OVER window, MATCH_RECOGNIZE and eventually, make Pinot support closer to full ANSI SQL semantics. image More to read: https://docs.pinot.apache.org/developers/advanced/v2-multi-stage-query-engine

    Pause Stream Consumption on Apache Pinot

    Pinot operators can pause realtime consumption of events while queries are being executed, and then resume consumption when ready to do so again. image

    More to read: https://medium.com/apache-pinot-developer-blog/pause-stream-consumption-on-apache-pinot-772a971ef403

    Gap-filling function

    The gapfilling functions allow users to interpolate data and perform powerful aggregations and data processing over time series data. More to read: https://www.startree.ai/blog/gapfill-function-for-time-series-datasets-in-pinot

    Add support for Spark 3.x (#8560)

    Long waiting feature for segment generation on Spark 3.x.

    Add Flink Pinot connector (#8233)

    Similar to the Spark Pinot connector, this allows Flink users to dump data from the Flink application to Pinot.

    Show running queries and cancel query by id (#9175)

    This feature allows better fine-grained control on pinot queries.

    Timestamp Index (#8343)

    This allows users to have better query performance on the timestamp column for lower granularity. See: https://docs.pinot.apache.org/basics/indexing/timestamp-index

    Native Text Indices (#8384)

    Wanna search text in realtime? The new text indexing engine in Pinot supports the following capabilities:

    1. New operator: LIKE
    select * FROM foo where text_col LIKE 'a%'
    
    1. New operator: CONTAINS
    select * from foo where text_col CONTAINS 'bar'
    
    1. Native text index, built from the ground up, focusing on Pinot’s time series use cases and utilizing existing Pinot indices and structures(inverted index, bitmap storage).
    2. Real Time Text Index

    Read more: https://medium.com/@atri.jiit/text-search-time-series-style-681af37ba42e

    Adding DML definition and parse SQL InsertFile (#8557)

    Now you can use INSERT INTO [database.]table FROM FILE dataDirURI OPTION ( k=v ) [, OPTION (k=v)]* to load data into Pinot from a file using Minion. See: https://docs.pinot.apache.org/basics/data-import/from-query-console

    Deduplication (#8708)

    This feature supports enabling deduplication for realtime tables, via a top-level table config. At a high level, primaryKey (as defined in the table schema) hashes are stored into in-memory data structures, and each incoming row is validated against it. Duplicate rows are dropped.

    The expectation while using this feature is for the stream to be partitioned by the primary key, strictReplicaGroup routing to be enabled, and the configured stream consumer type to be low level. These requirements are therefore mandated via table config API's input validations.

    Functions support and changes:

    • Add support for functions arrayConcatLong, arrayConcatFloat, arrayConcatDouble (#9131)
    • Add support for regexpReplace scalar function (#9123)
    • Add support for Base64 Encode/Decode Scalar Functions (#9114)
    • Optimize like to regexp conversion to do not include unnecessary ^._ and ._$ (#8893)
    • Support DISTINCT on multiple MV columns (#8873)
    • Support DISTINCT on single MV column (#8857)
    • Add histogram aggregation function (#8724)
    • Optimize dateTimeConvert scalar function to only parse the format once (#8939)
    • Support conjugates for scalar functions, add more scalar functions (#8582)
    • add FIRSTWITHTIME aggregate function support #7647 (#8181)
    • Add PercentileSmartTDigestAggregationFunction (#8565)
    • Simplify the parameters for DistinctCountSmartHLLAggregationFunction (#8566)
    • add scalar function for cast so it can be calculated at compile time (#8535)
    • Scalable Gapfill Implementation for Avg/Count/Sum (#8647)
    • Add commonly used math, string and date scalar functions in Pinot (#8304)
    • Datetime transform functions (#8397)
    • Scalar function for url encoding and decoding (#8378)
    • Add support for IS NULL and NOT IS NULL in transform functions (#8264)
    • Support st_contains using H3 index (#8498)

    The full list of features introduced in this release

    • add query cancel APIs on controller backed by those on brokers (#9276)
    • Add an option to search input files recursively in ingestion job. The default is set to true to be backward compatible. (#9265)
    • Adding endpoint to download local log files for each component (#9259)
    • Add metrics to track controller segment download and upload requests in progress (#9258)
    • add a freshness based consumption status checker (#9244)
    • Force commit consuming segments (#9197)
    • Adding kafka offset support for period and timestamp (#9193)
    • Make upsert metadata manager pluggable (#9186)
    • Adding logger utils and allow change logger level at runtime (#9180)
    • Proper null handling in equality, inequality and membership operators for all SV column data types (#9173)
    • support to show running queries and cancel query by id (#9171)
    • Enhance upsert metadata handling (#9095)
    • Proper null handling in Aggregation functions for SV data types (#9086)
    • Add support for IAM role based credentials in Kinesis Plugin (#9071)
    • Task genrator debug api (#9058)
    • Add Segment Lineage List API #9005 (#9006)
    • [colocated-join] Adds Support for instancePartitionsMap in Table Config (#8989)
    • Support pause/resume consumption of realtime tables (#8986)
    • #8970 Minion tab in Pinot UI (#8978)
    • Add Protocol Buffer Stream Decoder (#8972)
    • Update minion task metadata ZNode path (#8959)
    • add /tasks/{taskType}/{tableNameWithType}/debug API (#8949)
    • Defined a new broker metric for total query processing time (#8941)
    • Proper null handling in SELECT, ORDER BY, DISTINCT, and GROUP BY (#8927)
    • fixing REGEX OPTION parser (#8905)
    • Enable key value byte stitching in PulsarMessageBatch (#8897)
    • Add property to skip adding hadoop jars to package (#8888)
    • Support DISTINCT on multiple MV columns (#8873)
    • Implement Mutable FST Index (#8861)
    • Support DISTINCT on single MV column (#8857)
    • Add controller API for reload segment task status (#8828)
    • Spark Connector, support for TIMESTAMP and BOOLEAN fields (#8825)
    • Allow moveToFinalLocation in METADATA push based on config (#8823) (#8815)
    • allow up to 4GB per bitmap index (#8796)
    • Deprecate debug options and always use query options (#8768)
    • Streamed segment download & untar with rate limiter to control disk usage (#8753)
    • Improve the Explain Plan accuracy (#8738)
    • allow to set https as the default scheme (#8729)
    • Add histogram aggregation function (#8724)
    • Allow table name with dots by a PinotConfiguration switch (#8713)
    • Disable Groovy function by default (#8711)
    • Deduplication (#8708)
    • Add pluggable client auth provider (#8670)
    • Adding pinot file system command (#8659)
    • Allow broker to automatically rewrite expensive function to its approximate counterpart (#8655)
    • allow to take data outside the time window by negating the window filter (#8640)
    • Support BigDecimal raw value forward index; Support BigDecimal in many transforms and operators (#8622)
    • Ingestion Aggregation Feature (#8611)
    • Enable uploading segments to realtime tables (#8584)
    • Package kafka 0.9 shaded jar to pinot-distribution (#8569)
    • Simplify the parameters for DistinctCountSmartHLLAggregationFunction (#8566)
    • Add PercentileSmartTDigestAggregationFunction (#8565)
    • Add support for Spark 3.x (#8560)
    • Adding DML definition and parse SQL InsertFile (#8557)
    • endpoints to get and delete minion task metadata (#8551)
    • Add query option to use more replica groups (#8550)
    • Only discover public methods annotated with @ScalarFunction (#8544)
    • Support single-valued BigDecimal in schema, type conversion, SQL statements and minimum set of transforms. (#8503)
    • Add connection based FailureDetector (#8491)
    • Add endpoints for some finer control on minion tasks (#8486)
    • Add adhoc minion task creation endpoint (#8465)
    • Rewrite PinotQuery based on expression hints at instance/segment level (#8451)
    • Allow disabling dict generation for High cardinality columns (#8398)
    • add segment size metric on segment push (#8387)
    • Implement Native Text Operator (#8384)
    • Change default memory allocation for consuming segments from on-heap to off-heap (#8380)
    • New Pinot storage metrics for compressed tar.gz and table size w/o replicas (#8358)
    • add a experiment API for upsert heap memory estimation (#8355)
    • Timestamp type index (#8343)
    • Upgrade Helix to 1.0.4 in Pinot (#8325)
    • Allow overriding expression in query through query config (#8319)
    • Always handle null time values (#8310)
    • Add prefixesToRename config for renaming fields upon ingestion (#8273)
    • Added multi column partitioning for offline table (#8255)
    • Automatically update broker resource on broker changes (#8249)

    Vulnerability fixs

    Pinot has resolved all the high-level vulnerabilities issues:

    • Add a new workflow to check vulnerabilities using trivy (#9044)
    • Disable Groovy function by default (#8711)
    • Upgrade netty due to security vulnerability (#8328)
    • Upgrade protobuf as the current version has security vulnerability (#8287)
    • Upgrade to hadoop 2.10.1 due to cves (#8478)
    • Upgrade Helix to 1.0.4 (#8325)
    • Upgrade thrift to 0.15.0 (#8427)
    • Upgrade jetty due to security issue (#8348)
    • Upgrade netty (#8346)
    • Upgrade snappy version (#8494)

    Bug fixs

    • Nested arrays and map not handled correctly for complex types (#9235)
    • Fix empty data block not returning schema (#9222)
    • Allow mvn build with development webpack; fix instances default value (#9179)
    • Fix the race condition of reflection scanning classes (#9167)
    • Fix ingress manifest for controller and broker (#9135)
    • Fix jvm processors count (#9138)
    • Fix grpc query server not setting max inbound msg size (#9126)
    • Fix upsert replace (#9132)
    • Fix the race condition for partial upsert record read (#9130)
    • Fix log msg, as it missed one param value (#9124)
    • Fix authentication issue when auth annotation is not required (#9110)
    • Fix segment pruning that can break server subquery (#9090)
    • Fix the NPE for ADLSGen2PinotFS (#9088)
    • Fix cross merge (#9087)
    • Fix LaunchDataIngestionJobCommand auth header (#9070)
    • Fix catalog skipping (#9069)
    • Fix adding util for getting URL from InstanceConfig (#8856)
    • Fix string length in MutableColumnStatistics (#9059)
    • Fix instance details page loading table for tenant (#9035)
    • Fix thread safety issue with java client (#8971)
    • Fix allSegmentLoaded check (#9010)
    • Fix bug in segmentDetails table name parsing; style the new indexes table (#8958)
    • Fix pulsar close bug (#8913)
    • Fix REGEX OPTION parser (#8905)
    • Avoid reporting negative values for server latency. (#8892)
    • Fix getConfigOverrides in MinionQuickstart (#8858)
    • Fix segment generation error handling (#8812)
    • Fix multi stage engine serde (#8689)
    • Fix server discovery (#8664)
    • Fix Upsert config validation to check for metrics aggregation (#8781)
    • Fix multi value column index creation (#8848)
    • Fix grpc port assignment in multiple server quickstart (#8834)
    • Spark Connector GRPC reader fix for reading realtime tables (#8824)
    • Fix auth provider for minion (#8831)
    • Fix metadata push mode in IngestionUtils (#8802)
    • Misc fixes on segment validation for uploaded real-time segments (#8786)
    • Fix a typo in ServerInstance.startQueryServer() (#8794)
    • Fix the issue of server opening up query server prematurely (#8785)
    • Fix regression where case order was reversed, add regression test (#8748)
    • Fix dimension table load when server restart or reload table (#8721)
    • Fix when there're two index filter operator h3 inclusion index throw exception (#8707)
    • Fix the race condition of reading time boundary info (#8685)
    • Fix pruning in expressions by max/min/bloom (#8672)
    • Fix GcsPinotFs listFiles by using bucket directly (#8656)
    • Fix column data type store for data table (#8648)
    • Fix the potential NPE for timestamp index rewrite (#8633)
    • Fix on timeout string format in KinesisDataProducer (#8631)
    • Fix bug in segment rebalance with replica group segment assignment (#8598)
    • Fix the upsert metadata bug when adding segment with same comparison value (#8590)
    • Fix the deadlock in ClusterChangeMediator (#8572)
    • Fix BigDecimal ser/de on negative scale (#8553)
    • Fix table creation bug for invalid realtime consumer props (#8509)
    • Fix the bug of missing dot to extract sub props from ingestion job filesytem spec and minion segmentNameGeneratorSpec (#8511)
    • Fix to query inconsistencies under heavy upsert load (resolves #7958) (#7971)
    • Fix ChildTraceId when using multiple child threads, make them unique (#8443)
    • Fix the group-by reduce handling when query times out (#8450)
    • Fix a typo in BaseBrokerRequestHandler (#8448)
    • Fix TIMESTAMP data type usage during segment creation (#8407)
    • Fix async-profiler install (#8404)
    • Fix ingestion transform config bugs. (#8394)
    • Fix upsert inconsistency by snapshotting the validDocIds before reading the numDocs (#8392)
    • Fix bug when importing files with the same name in different directories (#8337)
    • Fix the missing NOT handling (#8366)
    • Fix setting of metrics compression type in RealtimeSegmentConverter (#8350)
    • Fix segment status checker to skip push in-progress segments (#8323)
    • Fix datetime truncate for multi-day (#8327)
    • Fix redirections for routes with access-token (#8285)
    • Fix CSV files surrounding space issue (#9028)
    • Fix suppressed exceptions in GrpcBrokerRequestHandler(#8272)
    Source code(tar.gz)
    Source code(zip)
  • release-0.10.0(Mar 18, 2022)

    Summary

    This release introduces some new great features, performance enhancements, UI improvments, and bug fixes which are described in details in the following sections. The release was cut from this commit fd9c58a.

    Dependency Graph

    The dependency graph for plug-and-play architecture that was introduced in release 0.3.0 has been extended and now it contains new nodes for Pinot Segment SPI. oss-dep-graph

    SQL Improvements

    • Implement NOT Operator (#8148)
    • Add DistinctCountSmartHLLAggregationFunction which automatically store distinct values in Set or HyperLogLog based on cardinality (#8189)
    • Add LEAST and GREATEST functions (#8100)
    • Handle SELECT * with extra columns (#7959)
    • Add FILTER clauses for aggregates (#7916)
    • Add ST_Within function (#7990)
    • Handle semicolon in query (#7861)
    • Add EXPLAIN PLAN (#7568)

    UI Enhancements

    • Show Reported Size and Estimated Size in human readable format in UI (#8199)
    • Make query console state URL based (#8194)
    • Improve query console to not show query result when multiple columns have the same name (#8131)
    • Improve Pinot dashboard tenant view to show correct amount of servers and brokers (#8115)
    • Fix issue with opening new tabs from Pinot Dashboard (#8021)
    • Fix issue with Query console going blank on syntax error (#8006)
    • Make query stats always show even there's error (#7981)
    • Implement OIDC auth workflow in UI (#7121)
    • Add tooltip and modal for table status (#7899)
    • Add option to wrap lines in custom code mirror (#7857)
    • Add ability to comment out queries with cmd + / (#7841)
    • Return exception when unavailable segments on empty broker response (#7823)
    • Properly handle the case where segments are missing in externalview (#7803)
    • Add TIMESTAMP to datetime column Type (#7746)

    Performance Improvements

    • Reuse regex matcher in dictionary based LIKE queries (#8261)
    • Early terminate orderby when columns already sorted (#8228)
    • Do not do another pass of Query Automaton Minimization (#8237)
    • Improve RangeBitmap by upgrading RoaringBitmap (#8206)
    • Optimize geometry serializer usage when literal is available (#8167)
    • Improve performance of no-dictionary group by (#8195)
    • Allocation free DataBlockCache lookups (#8140)
    • Prune unselected THEN statements in CaseTransformFunction (#8138)
    • Aggregation delay conversion to double (#8139)
    • Reduce object allocation rate in ExpressionContext or FunctionContext (#8124)
    • Lock free DimensionDataTableManager (#8102)
    • Improve json path performance during ingestion by upgrading JsonPath (#7819)
    • Reduce allocations and speed up StringUtil.sanitizeString (#8013)
    • Faster metric scans - ForwardIndexReader (#7920)
    • Unpeel group by 3 ways to enable vectorization (#7949)
    • Power of 2 fixed size chunks (#7934)
    • Don't use mmap for compression except for huge chunks (#7931)
    • Exit group-by marking loop early (#7935)
    • Improve performance of base chunk forward index write (#7930)
    • Cache JsonPaths to prevent compilation per segment (#7826)
    • Use LZ4 as default compression mode (#7797)
    • Peel off special case for 1 dimensional groupby (#7777)
    • Bump roaringbitmap version to improve range queries performance (#7734)

    Other Notable Features

    • Adding NoopPinotMetricFactory and corresponding changes (#8270)
    • Allow to specify fixed segment name for SegmentProcessorFramework (#8269)
    • Move all prestodb dependencies into a separated module (#8266)
    • Include docIds in Projection and Transform block (#8262)
    • Automatically update broker resource on broker changes (#8249)
    • Update ScalarFunction annotation from name to names to support function alias. (#8252)
    • Implemented BoundedColumnValue partition function (#8224)
    • Add copy recursive API to pinotFS (#8200)
    • Add Support for Getting Live Brokers for a Table (without type suffix) (#8188)
    • Pinot docker image - cache prometheus rules (#8241)
    • In BrokerRequestToQueryContextConverter, remove unused filterExpressionContext (#8238)
    • Adding retention period to segment delete REST API (#8122)
    • Pinot docker image - upgrade prometheus and scope rulesets to components (#8227)
    • Allow segment name postfix for SegmentProcessorFramework (#8230)
    • Superset docker image - update pinotdb version in superset image (#8231)
    • Add retention period to deleted segment files and allow table level overrides (#8176)
    • Remove incubator from pinot and superset (#8223)
    • Adding table config overrides for disabling groovy (#8196)
    • Optimise sorted docId iteration order in mutable segments (#8213)
    • Adding secure grpc query server support (#8207)
    • Move Tls configs and utils from pinot-core to pinot-common (#8210)
    • Reduce allocation rate in LookupTransformFunction (#8204)
    • Allow subclass to customize what happens pre/post segment uploading (#8203)
    • Enable controller service auto-discovery in Jersey framework (#8193)
    • Add support for pushFileNamePattern in pushJobSpec (#8191)
    • Add additionalMatchLabels to helm chart (#7177)
    • Simulate rsvps after meetup.com retired the feed (#8180)
    • Adding more checkstyle rules (#8197)
    • Add persistence.extraVolumeMounts and persistence.extraVolumes to Kubernetes statefulsets (#7486)
    • Adding scala profile for kafka 2.x build and remove root pom scala dependencies (#8174)
    • Allow realtime data providers to accept non-kafka producers (#8190)
    • Enhance revertReplaceSegments api (#8166)
    • Adding broker level config for disabling Pinot queries with Groovy (#8159)
    • Make presto driver query pinot server with SQL (#8186)
    • Adding controller config for disabling Groovy in ingestionConfig (#8169)
    • Adding main method for LaunchDataIngestionJobCommand for spark-submit command (#8168)
    • Add auth token for segment replace rest APIs (#8146)
    • Add allowRefresh option to UploadSegment (#8125)
    • Add Ingress to Broker and Controller helm charts (#7997)
    • Improve progress reporter in SegmentCreationMapper (#8129)
    • St_* function error messages + support literal transform functions (#8001)
    • Add schema and segment crc to SegmentDirectoryContext (#8127)
    • Extend enableParallePushProtection support in UploadSegment API (#8110)
    • Support BOOLEAN type in Config Recommendation Engine (#8055)
    • Add a broker metric to distinguish exception happens when acquire channel lock or when send request to server (#8105)
    • Add pinot.minion prefix on minion configs for consistency (#8109)
    • Enable broker service auto-discovery in Jersey framework (#8107)
    • Timeout if waiting server channel lock takes a long time (#8083)
    • Wire EmptySegmentPruner to routing config (#8067)
    • Support for TIMESTAMP data type in Config Recommendation Engine (#8087)
    • Listener TLS customization (#8082)
    • Add consumption rate limiter for LLConsumer (#6291)
    • Implement Real Time Mutable FST (#8016)
    • Allow quickstart to get table files from filesystem (#8093)
    • Add support for instant segment deletion (#8077)
    • Add a config file to override quickstart configs (#8059)
    • Add pinot server grpc metadata acl (#8030)
    • Move compatibility verifier to a separate module (#8049)
    • Move hadoop and spark ingestion libs from plugins directory to external-plugins (#8048)
    • Add global strategy for partial upsert (#7906)
    • Upgrade kafka to 2.8.1 (#7883)
    • Created EmptyQuickstart command (#8024)
    • Allow SegmentPushUtil to push realtime segment (#8032)
    • Add ignoreMerger for partial upsert (#7907)
    • Make task timeout and concurrency configurable (#8028)
    • Return 503 response from health check on shut down (#7892)
    • Pinot-druid-benchmark: set the multiValueDelimiterEnabled to false when importing TPC-H data (#8012)
    • Cleanup: Remove remaining occurrences of incubator. (#8023)
    • Refactor segment loading logic in BaseTableDataManager to decouple it with local segment directory (#7969)
    • Improving segment replacement/revert protocol (#7995)
    • PinotConfigProvider interface (#7984)
    • Enhance listSegments API to exclude the provided segments from the output (#7878)
    • Remove outdated broker metric definitions (#7962)
    • Add skip key for realtimeToOffline job validation (#7921)
    • Upgrade async-http-client (#7968)
    • Allow Reloading Segments with Multiple Threads (#7893)
    • Ignore query options in commented out queries (#7894)
    • Remove TableConfigCache which does not listen on ZK changes (#7943)
    • Switch to zookeeper of helm 3.0x (#7955)
    • Use a single react hook for table status modal (#7952)
    • Add debug logging for realtime ingestion (#7946)
    • Separate the exception for transform and indexing for consuming records (#7926)
    • Disable JsonStatementOptimizer (#7919)
    • Make index readers/loaders pluggable (#7897)
    • Make index creator provision pluggable (#7885)
    • Support loading plugins from multiple directories (#7871)
    • Update helm charts to honour readinessEnabled probes flags on the Controller, Broker, Server and Minion StatefulSets (#7891)
    • Support non-selection-only GRPC server request handler (#7839)
    • GRPC broker request handler (#7838)
    • Add validator for SDF (#7804)
    • Support large payload in zk put API (#7364)
    • Push JSON Path evaluation down to storage layer (#7820)
    • When upserting new record, index the record before updating the upsert metadata (#7860)
    • Add Post-Aggregation Gapfilling functionality. (#7781)
    • Clean up deprecated fields from segment metadata (#7853)
    • Remove deprecated method from StreamMetadataProvider (#7852)
    • Obtain replication factor from tenant configuration in case of dimension table (#7848)
    • Use valid bucket end time instead of segment end time for merge/rollup delay metrics (#7827)
    • Make pinot start components command extensible (#7847)
    • Make upsert inner segment update atomic (#7844)
    • Clean up deprecated ZK metadata keys and methods (#7846)
    • Add extraEnv, envFrom to statefulset help template (#7833)
    • Make openjdk image name configurable (#7832)
    • Add getPredicate() to PredicateEvaluator interface (#7840)
    • Make split commit the default commit protocol (#7780)
    • Pass Pinot connection properties from JDBC driver (#7822)
    • Add Pinot client connection config to allow skip fail on broker response exception (#7816)
    • Change default range index version to v2 (#7815)
    • Put thread timer measuring inside of wall clock timer measuring (#7809)
    • Add getRevertReplaceSegmentRequest method in FileUploadDownloadClient (#7796)
    • Add JAVA_OPTS env var in docker image (#7799)
    • Split thread cpu time into three metrics (#7724)
    • Add config for enabling realtime offset based consumption status checker (#7753)
    • Add timeColumn, timeUnit and totalDocs to the json segment metadata (#7765)
    • Set default Dockerfile CMD to -help (#7767)
    • Add getName() to PartitionFunction interface (#7760)
    • Support Native FST As An Index Subtype for FST Indices (#7729)
    • Add forceCleanup option for 'startReplaceSegments' API (#7744)
    • Add config for keystore types, switch tls to native implementation, and add authorization for server-broker tls channel (#7653)
    • Extend FileUploadDownloadClient to send post request with json body (#7751)

    Major Bug Fixes

    • Fix string comparisons (#8253)
    • Bugfix for order-by all sorted optimization (#8263)
    • Fix dockerfile (#8239)
    • Ensure partition function never return negative partition (#8221)
    • Handle indexing failures without corrupting inverted indexes (#8211)
    • Fixed broken HashCode partitioning (#8216)
    • Fix segment replace test (#8209)
    • Fix filtered aggregation when it is mixed with regular aggregation (#8172)
    • Fix FST Like query benchmark to remove SQL parsing from the measurement (#8097)
    • Do not identify function types by throwing exceptions (#8137)
    • Fix regression bug caused by sharing TSerializer across multiple threads (#8160)
    • Fix validation before creating a table (#8103)
    • Check cron schedules from table configs after subscribing child changes (#8113)
    • Disallow duplicate segment name in tar file (#8119)
    • Fix storage quota checker NPE for Dimension Tables (#8132)
    • Fix TraceContext NPE issue (#8126)
    • Update gcloud libraries to fix underlying issue with api's with CMEK (#8121)
    • Fix error handling in jsonPathArray (#8120)
    • Fix error handling in json functions with default values (#8111)
    • Fix controller config validation failure for customized TLS listeners (#8106)
    • Validate the numbers of input and output files in HadoopSegmentCreationJob (#8098)
    • Broker Side validation for the query with aggregation and col but without group by (#7972)
    • Improve the proactive segment clean-up for REVERTED (#8071)
    • Allow JSON forward indexes (#8073)
    • Fix the PinotLLCRealtimeSegmentManager on segment name check (#8058)
    • Always use smallest offset for new partitionGroups (#8053)
    • Fix RealtimeToOfflineSegmentsTaskExecutor to handle time gap (#8054)
    • Refine segment consistency checks during segment load (#8035)
    • Fixes for various JDBC issues (#7784)
    • Delete tmp- segment directories on server startup (#7961)
    • Fix ByteArray datatype column metadata getMaxValue NPE bug and expose maxNumMultiValues (#7918)
    • Fix the issues that Pinot upsert table's uploaded segments get deleted when a server restarts. (#7979)
    • Fixed segment upload error return (#7957)
    • Fix QuerySchedulerFactory to plug in custom scheduler (#7945)
    • Fix the issue with grpc broker request handler not started correctly (#7950)
    • Fix realtime ingestion when an entire batch of messages is filtered out (#7927)
    • Move decode method before calling acquireSegment to avoid reference count leak (#7938)
    • Fix semaphore issue in consuming segments (#7886)
    • Add bootstrap mode for PinotServiceManager to avoid glitch for health check (#7880)
    • Fix the broker routing when segment is deleted (#7817)
    • Fix obfuscator not capturing secretkey and keytab (#7794)
    • Fix segment merge delay metric when there is empty bucket (#7761)
    • Fix QuickStart by adding types for invalid/missing type (#7768)
    • Use oldest offset on newly detected partitions (#7756)
    • Fix javadoc to compatible with jdk8 source (#7754)
    • Handle null segment lineage ZNRecord for getSelectedSegments API (#7752)
    • Handle fields missing in the source in ParquetNativeRecordReader (#7742)

    Backward Incompatible Changes

    • Fix the issue with HashCode partitioning function (#8216)
    • Fix the issue with validation on table creation (#8103)
    • Change PinotFS API's (#8603)
    Source code(tar.gz)
    Source code(zip)
  • release-0.9.3(Dec 24, 2021)

    This is a bug fixing release contains:

    Update Log4j to 2.17.0 to address CVE-2021-45105 (#7933)

    The release is based on the release 0.9.2 with the following cherry-picks:

    93c0404da6bcbf9bf4e165f1e2cbba069abcc872

    Source code(tar.gz)
    Source code(zip)
  • release-0.9.2(Dec 15, 2021)

    Summary

    This is a bug fixing release contains:

    • Upgrade log4j to 2.16.0 to fix CVE-2021-45046 (#7903)
    • Upgrade swagger-ui to 3.23.11 to fix CVE-2019-17495 (#7902)
    • Fix the bug that RealtimeToOfflineTask failed to progress with large time bucket gaps (#7814).

    The release is based on the release 0.9.1 with the following cherry-picks:

    9ed6498cdf9d32a65ebcbcce9158acab64a8c0d7 50e1613503cd74b26cf78873efcbdd6e8516bd8f 767aa8abfb5bf085ba0a7ae5ff4024679f27816e

    Source code(tar.gz)
    Source code(zip)
  • release-0.9.1(Dec 12, 2021)

    Summary

    This release fixes the major issue of CVE-2021-44228 and a major bug fixing of pinot admin exit code issue(https://github.com/apache/pinot/pull/7798).

    The release is based on the release 0.9.0 with the following cherry-picks:

    e44d2e46f2eaba5f75d789d92ce767fbee96feba af2858aff26e169f348605e61d0c5e21ddd73dd9

    Source code(tar.gz)
    Source code(zip)
  • release-0.9.0(Nov 12, 2021)

    Summary

    This release introduces a new features: Segment Merge and Rollup to simplify users day to day operational work. A new metrics plugin is added to support dropwizard. As usual, new functionalities and many UI/ Performance improvements.

    The release was cut from the following commit: 13c9ee9 and the following cherry-picks: 668b5e0, ee887b9

    Support Segment Merge and Roll-up

    LinkedIn operates a large multi-tenant cluster that serves a business metrics dashboard, and noticed that their tables consisted of millions of small segments. This was leading to slow operations in Helix/Zookeeper, long running queries due to having too many tasks to process, as well as using more space because of a lack of compression.

    To solve this problem they added the Segment Merge task, which compresses segments based on timestamps and rolls up/aggregates older data. The task can be run on a schedule or triggered manually via the Pinot REST API.

    At the moment this feature is only available for offline tables, but will be added for real-time tables in a future release.

    Major Changes:

    • Integrate enhanced SegmentProcessorFramework into MergeRollupTaskExecutor (#7180)
    • Merge/Rollup task scheduler for offline tables. (#7178)
    • Fix MergeRollupTask uploading segments not updating their metadata (#7289)
    • MergeRollupTask integration tests (#7283)
    • Add mergeRollupTask delay metrics (#7368)
    • MergeRollupTaskGenerator enhancement: enable parallel buckets scheduling (#7481)
    • Use maxEndTimeMs for merge/roll-up delay metrics. (#7617)

    UI Improvement

    This release also sees improvements to Pinot’s query console UI.

    • Cmd+Enter shortcut to run query in query console (#7359)
    • Showing tooltip in SQL Editor (#7387)
    • Make the SQL Editor box expandable (#7381)
    • Fix tables ordering by number of segments (#7564)

    SQL Improvements

    There have also been improvements and additions to Pinot’s SQL implementation.

    New functions:

    • IN (#7542)
    • LASTWITHTIME (#7584)
    • ID_SET on MV columns (#7355)
    • Raw results for Percentile TDigest and Est (#7226),
    • Add timezone as argument in function toDateTime (#7552)

    New predicates are supported:

    • LIKE(#7214)
    • REGEXP_EXTRACT(#7114)
    • FILTER(#7566)

    Query compatibility improvements:

    • Infer data type for Literal (#7332)
    • Support logical identifier in predicate (#7347)
    • Support JSON queries with top-level array path expression. (#7511)
    • Support configurable group by trim size to improve results accuracy (#7241)

    Performance Improvements

    This release contains many performance improvement, you may sense it for you day to day queries. Thanks to all the great contributions listed below:

    • Reduce the disk usage for segment conversion task (#7193)
    • Simplify association between Java Class and PinotDataType for faster mapping (#7402)
    • Avoid creating stateless ParseContextImpl once per jsonpath evaluation, avoid varargs allocation (#7412)
    • Replace MINUS with STRCMP (#7394)
    • Bit-sliced range index for int, long, float, double, dictionarized SV columns (#7454)
    • Use MethodHandle to access vectorized unsigned comparison on JDK9+ (#7487)
    • Add option to limit thread usage per query (#7492)
    • Improved range queries (#7513)
    • Faster bitmap scans (#7530)
    • Optimize EmptySegmentPruner to skip pruning when there is no empty segments (#7531)
    • Map bitmaps through a bounded window to avoid excessive disk pressure (#7535)
    • Allow RLE compression of bitmaps for smaller file sizes (#7582)
    • Support raw index properties for columns with JSON and RANGE indexes (#7615)
    • Enhance BloomFilter rule to include IN predicate(#7444) (#7624)
    • Introduce LZ4_WITH_LENGTH chunk compression type (#7655)
    • Enhance ColumnValueSegmentPruner and support bloom filter prefetch (#7654)
    • Apply the optimization on dictIds within the segment to DistinctCountHLL aggregation func (#7630)
    • During segment pruning, release the bloom filter after each segment is processed (#7668)
    • Fix JSONPath cache inefficient issue (#7409)
    • Optimize getUnpaddedString with SWAR padding search (#7708)
    • Lighter weight LiteralTransformFunction, avoid excessive array fills (#7707)
    • Inline binary comparison ops to prevent function call overhead (#7709)
    • Memoize literals in query context in order to deduplicate them (#7720)

    Other Notable New Features and Changes

    • Human Readable Controller Configs (#7173)
    • Add the support of geoToH3 function (#7182)
    • Add Apache Pulsar as Pinot Plugin (#7223) (#7247)
    • Add dropwizard metrics plugin (#7263)
    • Introduce OR Predicate Execution On Star Tree Index (#7184)
    • Allow to extract values from array of objects with jsonPathArray (#7208)
    • Add Realtime table metadata and indexes API. (#7169)
    • Support array with mixing data types (#7234)
    • Support force download segment in reload API (#7249)
    • Show uncompressed znRecord from zk api (#7304)
    • Add debug endpoint to get minion task status. (#7300)
    • Validate CSV Header For Configured Delimiter (#7237)
    • Add auth tokens and user/password support to ingestion job command (#7233)
    • Add option to store the hash of the upsert primary key (#7246)
    • Add null support for time column (#7269)
    • Add mode aggregation function (#7318)
    • Support disable swagger in Pinot servers (#7341)
    • Delete metadata properly on table deletion (#7329)
    • Add basic Obfuscator Support (#7407)
    • Add AWS sts dependency to enable auth using web identity token. (#7017)(#7445)
    • Mask credentials in debug endpoint /appconfigs (#7452)
    • Fix /sql query endpoint now compatible with auth (#7230)
    • Fix case sensitive issue in BasicAuthPrincipal permission check (#7354)
    • Fix auth token injection in SegmentGenerationAndPushTaskExecutor (#7464)
    • Add segmentNameGeneratorType config to IndexingConfig (#7346)
    • Support trigger PeriodicTask manually (#7174)
    • Add endpoint to check minion task status for a single task. (#7353)
    • Showing partial status of segment and counting CONSUMING state as good segment status (#7327)
    • Add "num rows in segments" and "num segments queried per host" to the output of Realtime Provisioning Rule (#7282)
    • Check schema backward-compatibility when updating schema through addSchema with override (#7374)
    • Optimize IndexedTable (#7373)
    • Support indices remove in V3 segment format (#7301)
    • Optimize TableResizer (#7392)
    • Introduce resultSize in IndexedTable (#7420)
    • Offset based realtime consumption status checker (#7267)
    • Add causes to stack trace return (#7460)
    • Create controller resource packages config key (#7488)
    • Enhance TableCache to support schema name different from table name (#7525)
    • Add validation for realtimeToOffline task (#7523)
    • Unify CombineOperator multi-threading logic (#7450)
    • Support no downtime rebalance for table with 1 replica in TableRebalancer (#7532)
    • Introduce MinionConf, move END_REPLACE_SEGMENTS_TIMEOUT_MS to minion config instead of task config. (#7516)
    • Adjust tuner api (#7553)
    • Adding config for metrics library (#7551)
    • Add geo type conversion scalar functions (#7573)
    • Add BOOLEAN_ARRAY and TIMESTAMP_ARRAY types (#7581)
    • Add MV raw forward index and MV BYTES data type (#7595)
    • Enhance TableRebalancer to offload the segments from most loaded instances first (#7574)
    • Improve get tenant API to differentiate offline and realtime tenants (#7548)
    • Refactor query rewriter to interfaces and implementations to allow customization (#7576)
    • In ServiceStartable, apply global cluster config in ZK to instance config (#7593)
    • Make dimension tables creation bypass tenant validation (#7559)
    • Allow Metadata and Dictionary Based Plans for No Op Filters (#7563)
    • Reject query with identifiers not in schema (#7590)
    • Round Robin IP addresses when retry uploading/downloading segments (#7585)
    • Support multi-value derived column in offline table reload (#7632)
    • Support segmentNamePostfix in segment name (#7646)
    • Add select segments API (#7651)
    • Controller getTableInstance() call now returns the list of live brokers of a table. (#7556)
    • Allow MV Field Support For Raw Columns in Text Indices (#7638)
    • Allow override distinctCount to segmentPartitionedDistinctCount (#7664)
    • Add a quick start with both UPSERT and JSON index (#7669)
    • Add revertSegmentReplacement API (#7662)
    • Smooth segment reloading with non blocking semantic (#7675)
    • Clear the reused record in PartitionUpsertMetadataManager (#7676)
    • Replace args4j with picocli (#7665)
    • Handle datetime column consistently (#7645)(#7705)
    • Allow to carry headers with query requests (#7696) (#7712)
    • Allow adding JSON data type for dimension column types (#7718)
    • Separate SegmentDirectoryLoader and tierBackend concepts (#7737)
    • Implement size balanced V4 raw chunk format (#7661)
    • Add presto-pinot-driver lib (#7384)

    Major Bug fixes

    • Fix null pointer exception for non-existed metric columns in schema for JDBC driver (#7175)
    • Fix the config key for TASK_MANAGER_FREQUENCY_PERIOD (#7198)
    • Fixed pinot java client to add zkClient close (#7196)
    • Ignore query json parse errors (#7165)
    • Fix shutdown hook for PinotServiceManager (#7251) (#7253)
    • Make STRING to BOOLEAN data type change as backward compatible schema change (#7259)
    • Replace gcp hardcoded values with generic annotations (#6985)
    • Fix segment conversion executor for in-place conversion (#7265)
    • Fix reporting consuming rate when the Kafka partition level consumer isn't stopped (#7322)
    • Fix the issue with concurrent modification for segment lineage (#7343)
    • Fix TableNotFound error message in PinotHelixResourceManager (#7340)
    • Fix upload LLC segment endpoint truncated download URL (#7361)
    • Fix task scheduling on table update (#7362)
    • Fix metric method for ONLINE_MINION_INSTANCES metric (#7363)
    • Fix JsonToPinotSchema behavior to be consistent with AvroSchemaToPinotSchema (#7366)
    • Fix currentOffset volatility in consuming segment(#7365)
    • Fix misleading error msg for missing URI (#7367)
    • Fix the correctness of getColumnIndices method (#7370)
    • Fix SegmentZKMetadta time handling (#7375)
    • Fix retention for cleaning up segment lineage (#7424)
    • Fix segment generator to not return illegal filenames (#7085)
    • Fix missing LLC segments in segment store by adding controller periodic task to upload them (#6778)
    • Fix parsing error messages returned to FileUploadDownloadClient (#7428)
    • Fix manifest scan which drives /version endpoint (#7456)
    • Fix missing rate limiter if brokerResourceEV becomes null due to ZK connection (#7470)
    • Fix race conditions between segment merge/roll-up and purge (or convertToRawIndex) tasks: (#7427)
    • Fix pql double quote checker exception (#7485)
    • Fix minion metrics exporter config (#7496)
    • Fix segment unable to retry issue by catching timeout exception during segment replace (#7509)
    • Add Exception to Broker Response When Not All Segments Are Available (Partial Response) (#7397)
    • Fix segment generation commands (#7527)
    • Return non zero from main with exception (#7482)
    • Fix parquet plugin shading error (#7570)
    • Fix the lowest partition id is not 0 for LLC (#7066)
    • Fix star-tree index map when column name contains '.' (#7623)
    • Fix cluster manager URLs encoding issue(#7639)
    • Fix fieldConfig nullable validation (#7648)
    • Fix verifyHostname issue in FileUploadDownloadClient (#7703)
    • Fix TableCache schema to include the built-in virtual columns (#7706)
    • Fix DISTINCT with AS function (#7678)
    • Fix SDF pattern in DataPreprocessingHelper (#7721)
    • Fix fields missing issue in the source in ParquetNativeRecordReader (#7742)
    Source code(tar.gz)
    Source code(zip)
  • release-0.8.0(Aug 11, 2021)

    Summary

    This release introduced several awesome new features, including compatibility tests, enhanced complex type and Json support, partial upsert support, and new stream ingestion plugins (AWS Kinesis, Apache Pulsar). It contains a lot of query enhancements such as new timestamp and boolean type support and flexible numerical column comparison. It also includes many key bug fixes. See details below.

    The release was cut from the following commit: fe83e95aa9124ee59787c580846793ff7456eaa5

    and the following cherry-picks:

    • 668b5e01d7c263e3cdb3081e0a947a43e7d7f782
    • ee887b97e77ef7a132d3d6d60f83a800e52d4555
    • c2f7fccefcbd9930c995808cf9947c61b4223786
    • c1ac8a18b65841fc666722496cc5f4f9347b3dd4
    • 4da1dae06aef50f0a7c96b5a22019e541310fdd9
    • 573651b28a6f89bd4895c992a5e8fa8e23df4615
    • c6c407d24c3e02e6b3f628e691118f379575d6da
    • 0d96c7f5f58191a823956cc1b1a8c93914fd73b3
    • c2637d139dd5ba79682b4244cde316dacb0852ee

    Notable New Features

    • Extract time handling for SegmentProcessorFramework (#7158)
    • Add Apache Pulsar low level and high level connector (#7026)
    • Enable parallel builds for compat checker (#7149)
    • Add controller/server API to fetch aggregated segment metadata (#7102)
    • Support Dictionary Based Plan For DISTINCT (#7141)
    • Provide HTTP client to kinesis builder (#7148)
    • Add datetime function with 2 arguments (#7116)
    • Adding ability to check ingestion status for Offline Pinot table (#7070)
    • Add timestamp datatype support in JDBC (#7117)
    • Allow updating controller and broker helix hostname (#7064)
    • Cancel running Kinesis consumer tasks when timeout occurs (#7109)
    • Implement Append merger for partial upsert (#7087)
    • SegmentProcessorFramework Enhancement (#7092)
    • Added TaskMetricsEmitted periodic controler job (#7091)
    • Support json path expressions in query. (#6998)
    • Support data preprocessing for AVRO and ORC formats (#7062)
    • Add partial upsert config and mergers (#6899)
    • Add support for range index rule recommendation(#7034) (#7063)
    • Allow reloading consuming segment by default (#7078)
    • Add LZ4 Compression Codec (#6804) (#7035)
    • Make Pinot JDK 11 Compilable (#6424)
    • Introduce in-Segment Trim for GroupBy OrderBy Query (#6991)
    • Produce GenericRow file in segment processing mapper (#7013)
    • Add ago() scalar transform function (#6820)
    • Add Bloom Filter support for IN predicate(#7005) (#7007)
    • Add genericRow file reader and writer (#6997)
    • Normalize LHS and RHS numerical types for >, >=, <, and <= operators. (#6927)
    • Add Kinesis Stream Ingestion Plugin (#6661)
    • feature/#6766 JSON and Startree index information in API (#6873)
    • Support null value fields in generic row ser/de (#6968)
    • Implement PassThroughTransformOperator to optimize select queries(#6972) (#6973)
    • Optimize TIME_CONVERT/DATE_TIME_CONVERT predicates (#6957)
    • Prefetch call to fetch buffers of columns seen in the query (#6967)
    • Enabling compatibility tests in the script (#6959)
    • Add collectionToJsonMode to schema inference (#6946)
    • Add the complex-type support to decoder/reader (#6945)
    • Adding a new Controller API to retrieve ingestion status for realtime… (#6890)
    • Add support for Long in Modulo partition function. (#6929)
    • Enhance PinotSegmentRecordReader to preserve null values (#6922)
    • add complex-type support to avro-to-pinot schema inference (#6928)
    • Add correct yaml files for real time data(#6787) (#6916)
    • Add complex-type transformation to offline segment creation (#6914)
    • Add config File support(#6787) (#6901)
    • Enhance JSON index to support nested array (#6877)
    • Add debug endpoint for tables. (#6897)
    • JSON column datatype support. (#6878)
    • Allow empty string in MV column (#6879)
    • Add Zstandard compression support with JMH benchmarking(#6804) (#6876)
    • Normalize LHS and RHS numerical types for = and != operator. (#6811)
    • Change ConcatCollector implementation to use off-heap (#6847)
    • [PQL Deprecation] Clean up the old BrokerRequestOptimizer (#6859)
    • [PQL Deprecation] Do not compile PQL broker request for SQL query (#6855)
    • Add TIMESTAMP and BOOLEAN data type support (#6719)
    • Add admin endpoint for Pinot Minon. (#6822)
    • Remove the usage of PQL compiler (#6808
    • Add endpoints in Pinot Controller, Broker and Server to get system and application configs. (#6817)
    • Support IN predicate in ColumnValue SegmentPruner(#6756) (#6776)
    • Enable adding new segments to a upsert-enabled realtime table (#6567)
    • Interface changes for Kinesis connector (#6667)
    • Pinot Minion SegmentGenerationAndPush task: PinotFS configs inside taskSpec is always temporary and has higher priority than default PinotFS created by the minion server configs (#6744)
    • DataTable V3 implementation and measure data table serialization cost on server (#6710)
    • add uploadLLCSegment endpoint in TableResource (#6653)
    • File-based SegmentWriter implementation (#6718)
    • Basic Auth for pinot-controller (#6613)
    • UI integration with Authentication API and added login page (#6686)
    • Support data ingestion for offline segment in one pass (#6479)
    • SumPrecision: support all data types and star-tree (#6668)
    • complete compatibility regression testing (#6650)
    • Kinesis implementation Part 1: Rename partitionId to partitionGroupId (#6655)
    • Make Pinot metrics pluggable (#6640)
    • Recover the segment from controller when LLC table cannot load it (#6647)
    • Adding a new API for validating specified TableConfig and Schema (#6620)
    • Introduce a metric for query/response size on broker. (#6590)
    • Adding a controller periodic task to clean up dead minion instances (#6543)
    • Adding new validation for Json, TEXT indexing (#6541)
    • Always return a response from query execution. (#6596)

    Special notes

    • After the 0.8.0 release, we will officially support jdk 11, and can now safely start to use jdk 11 features. Code is still compilable with jdk 8 (#6424)
    • RealtimeToOfflineSegmentsTask config has some backward incompatible changes (#7158) — timeColumnTransformFunction is removed (backward-incompatible, but rollup is not supported anyway) — Deprecate collectorType and replace it with mergeType — Add roundBucketTimePeriod and partitionBucketTimePeriod to config the time bucket for round and partition
    • Regex path for pluggable MinionEventObserverFactory is changed from org.apache.pinot.*.event.* to org.apache.pinot.*.plugin.minion.tasks.* (#6980)
    • Moved all pinot built-in minion tasks to the pinot-minion-builtin-tasks module and package them into a shaded jar (#6618)
    • Reloading consuming segment flag pinot.server.instance.reload.consumingSegment will be true by default (#7078)
    • Move JSON decoder from pinot-kafka to pinot-json package. (#7021)
    • Backward incompatible schema change through controller rest API PUT /schemas/{schemaName} will be blocked. (#6737)
    • Deprecated /tables/validateTableAndSchema in favor of the new configs/validate API and introduced new APIs for /tableConfigs to operate on the realtime table config, offline table config and schema in one shot. (#6840)

    Major Bug fixes

    • Fix race condition in MinionInstancesCleanupTask (#7122)
    • Fix custom instance id for controller/broker/minion (#7127)
    • Fix UpsertConfig JSON deserialization. (#7125)
    • Fix the memory issue for selection query with large limit (#7112)
    • Fix the deleted segments directory not exist warning (#7097)
    • Fixing docker build scripts by providing JDK_VERSION as parameter (#7095)
    • Misc fixes for json data type (#7057)
    • Fix handling of date time columns in query recommender(#7018) (#7031)
    • fixing pinot-hadoop and pinot-spark test (#7030)
    • Fixing HadoopPinotFS listFiles method to always contain scheme (#7027)
    • fixed GenericRow compare for different _fieldToValueMap size (#6964)
    • Fix NPE in NumericalFilterOptimizer due to IS NULL and IS NOT NULL operator. (#7001)
    • Fix the race condition in realtime text index refresh thread (#6858) (#6990)
    • Fix deep store directory structure (#6976)
    • Fix NPE issue when consumed kafka message is null or the record value is null. (#6950)
    • Mitigate calcite NPE bug. (#6908)
    • Fix the exception thrown in the case that a specified table name does not exist (#6328) (#6765)
    • Fix CAST transform function for chained transforms (#6941)
    • Fixed failing pinot-controller npm build (#6795)
    Source code(tar.gz)
    Source code(zip)
  • release-0.7.1(Apr 7, 2021)

    Summary

    This release introduced several awesome new features, including JSON index, lookup-based join support, geospatial support, TLS support for pinot connections, and various performance optimizations and improvements. It also adds several new APIs to better manage the segments and upload data to offline table. It also contains many key bug fixes. See details below.

    The release was cut from the following commit: 78152cdb2892cf8c2df5b8a4d04e2aa897333487

    and the following cherry-picks:

    • b527af353e78f26d0c4388cab89e4fe18d5290f6
    • 84d59e3ba27a3cdf0eaecbfe0eeec9b47060a2e3
    • a18dc60dca09bd2a1d33a8bc6b787d7ceb8e1749
    • 4ec38f79315d4017e5e2ac45e8989fa7fc4584fa
    • b48dac07dfce0769ad4acf1a643f3e9aba53e18b
    • 5d2bc0c6d83825e152d30fb4774464fabf3b3e8b
    • 913492e443d71d758f0b88b17cb144b5a5a5fb57
    • 50a4531b33475327bc9fe3c0199e7003f0a4c882
    • 1f21403a29a40c751fceb2211437a1e27c58b5e1
    • 8dbb70ba08daf90f5e9067fcec545203ffefe215

    Notable New Features

    • Add a server metric: queriesDisabled to check if queries disabled or not. (#6586)
    • Optimization on GroupKey to save the overhead of ser/de the group keys (#6593) (#6559)
    • Support validation for jsonExtractKey and jsonExtractScalar functions (#6246) (#6594)
    • Real-Time Provisioning Helper tool improvement to take data characteristics as input instead of an actual segment (#6546)
    • Add the isolation level config isolation.level to Kafka consumer (2.0) to ingest transactionally committed messages only (#6580)
    • Enhance StarTreeIndexViewer to support multiple trees (#6569)
    • Improves ADLSGen2PinotFS with service principal-based auth, auto-create container on the initial run. It's backward compatible with key-based auth. (#6531)
    • Add metrics for minion tasks status (#6549)
    • Use minion data directory as tmp directory for SegmentGenerationAndPushTask to ensure directory is always cleaned up (#6560)
    • Add optional HTTP basic auth to pinot broker, which enables user- and table-level authentication of incoming queries. (#6552)
    • Add Access Control for REST endpoints of Controller (#6507)
    • Add date_trunc to scalar functions to support date_trunc during ingestion (#6538)
    • Allow tar gz with > 8gb size (#6533)
    • Add Lookup UDF Join support (#6530), (#6465), (#6383) (#6286)
    • Add cron scheduler metrics reporting (#6502)
    • Support generating derived column during segment load, so that derived columns can be added on-the-fly (#6494)
    • Support chained transform functions (#6495)
    • Add scalar function JsonPathArray to extract arrays from json (#6490)
    • Add a guard against multiple consuming segments for same partition (#6483)
    • Remove the usage of deprecated range delimiter (#6475)
    • Handle scheduler calls with a proper response when it's disabled. (#6474)
    • Simplify SegmentGenerationAndPushTask handling getting schema and table config (#6469)
    • Add a cluster config to config number of concurrent tasks per instance for minion task: SegmentGenerationAndPushTaskGenerator (#6468)
    • Replace BrokerRequestOptimizer with QueryOptimizer to also optimize the PinotQuery (#6423)
    • Add additional string scalar functions (#6458)
    • Add additional scalar functions for array type (#6446)
    • Add CRON scheduler for Pinot tasks (#6451)
    • Set default Data Type while setting type in Add Schema UI dialog (#6452)
    • Add ImportData sub command in pinot admin (#6396)
    • H3-based geospatial index (#6409) (#6306)
    • Add JSON index support (#6408) (#6216) (#6346)
    • Make minion tasks pluggable via reflection (#6395)
    • Add compatibility test for segment operations upload and delete (#6382)
    • Add segment reset API that disables and then enables the segment (#6336)
    • Add Pinot minion segment generation and push task. (#6340)
    • Add a version option to pinot admin to show all the component versions (#6380)
    • Add FST index using Lucene lib to speedup REGEXP_LIKE operator on text (#6120)
    • Add APIs for uploading data to an offline table. (#6354)
    • Allow the use of environment variables in stream configs (#6373)
    • Enhance task schedule api for single type/table support (#6352)
    • Add broker time range based pruner for routing. Query operators supported: RANGE, =, <, <=, >, >=, AND, OR(#6259)
    • Add json path functions to extract values from json object (#6347)
    • Create a pluggable interface for Table config tuner (#6255)
    • Add a Controller endpoint to return table creation time (#6331)
    • Add tooltips, ability to enable-disable table state to the UI (#6327)
    • Add Pinot Minion client (#6339)
    • Add more efficient use of RoaringBitmap in OnHeapBitmapInvertedIndexCreator and OffHeapBitmapInvertedIndexCreator (#6320)
    • Add decimal percentile support. (#6323)
    • Add API to get status of consumption of a table (#6322)
    • Add support to add offline and realtime tables, individually able to add schema and schema listing in UI (#6296)
    • Improve performance for distinct queries (#6285)
    • Allow adding custom configs during the segment creation phase (#6299)
    • Use sorted index based filtering only for dictionary encoded column (#6288)
    • Enhance forward index reader for better performance (#6262)
    • Support for text index without raw (#6284)
    • Add api for cluster manager to get table state (#6211)
    • Perf optimization for SQL GROUP BY ORDER BY (#6225)
    • Add support using environment variables in the format of ${VAR_NAME:DEFAULT_VALUE} in Pinot table configs. (#6271)

    Special notes

    • Pinot controller metrics prefix is fixed to add a missing dot (#6499). This is a backward-incompatible change that JMX query on controller metrics must be updated

    • Legacy group key delimiter (\t) was removed to be backward-compatible with release 0.5.0 (#6589)

    • Upgrade zookeeper version to 3.5.8 to fix ZOOKEEPER-2184: Zookeeper Client should re-resolve hosts when connection attempts fail. (#6558)

    • Add TLS-support for client-pinot and pinot-internode connections (#6418) Upgrades to a TLS-enabled cluster can be performed safely and without downtime. To achieve a live-upgrade, go through the following steps:

      • First, configure alternate ingress ports for https/netty-tls on brokers, controllers, and servers. Restart the components with a rolling strategy to avoid cluster downtime.
      • Second, verify manually that https access to controllers and brokers is live. Then, configure all components to prefer TLS-enabled connections (while still allowing unsecured access). Restart the individual components.
      • Third, disable insecure connections via configuration. You may also have to set controller.vip.protocol and controller.vip.port and update the configuration files of any ingestion jobs. Restart components a final time and verify that insecure ingress via http is not available anymore.
    • PQL endpoint on Broker is deprecated (#6607)

      • Apache Pinot has adopted SQL syntax and semantics. Legacy PQL (Pinot Query Language) is deprecated and no longer supported. Please use SQL syntax to query Pinot on broker endpoint /query/sql and controller endpoint /sql

    Major Bug fixes

    • Fix the SIGSEGV for large index (#6577)
    • Handle creation of segments with 0 rows so segment creation does not fail if the data source has 0 rows. (#6466)
    • Fix QueryRunner tool for multiple runs (#6582)
    • Use URL encoding for the generated segment tar name to handle characters that cannot be parsed to URI. (#6571)
    • Fix a bug of miscounting the top nodes in StarTreeIndexViewer (#6569)
    • Fix the raw bytes column in real-time segment (#6574)
    • Fixes a bug to allow using JSON_MATCH predicate in SQL queries (#6535)
    • Fix the overflow issue when loading the large dictionary into the buffer (#6476)
    • Fix empty data table for distinct query (#6363)
    • Fix the default map return value in DictionaryBasedGroupKeyGenerator (#6712)
    • Fix log message in ControllerPeriodicTask (#6709)
    • Fix bug #6671: RealtimeTableDataManager shuts down SegmentBuildTimeLeaseExtender for all tables in the host (#6682)
    • Fix license headers and plugin checks
    Source code(tar.gz)
    Source code(zip)
  • release-0.6.0(Nov 5, 2020)

    Summary

    This release introduced some excellent new features, including upsert, tiered storage, pinot-spark-connector, support of having clause, more validations on table config and schema, support of ordinals in GROUP BY and ORDER BY clause, array transform functions, adding push job type of segment metadata only mode, and some new APIs like updating instance tags, new health check endpoint. It also contains many key bug fixes. See details below.

    The release was cut from the following commit: e5c9bec and the following cherry-picks:

    Notable New Features

    • Tiered storage (#5793)
    • Upsert feature (#6096, #6113, #6141, #6149, #6167)
    • Pre-generate aggregation functions in QueryContext (#5805)
    • Adding controller healthcheck endpoint: /health (#5846)
    • Add pinot-spark-connector (#5787)
    • Support multi-value non-dictionary group by (#5851)
    • Support type conversion for all scalar functions (#5849)
    • Add additional datetime functionality (#5438)
    • Support post-aggregation in ORDER-BY (#5856)
    • Support post-aggregation in SELECT (#5867)
    • Add RANGE FilterKind to support merging ranges for SQL (#5898)
    • Add HAVING support (#5889)
    • Support for exact distinct count for non int data types (#5872)
    • Add max qps bucket count (#5922)
    • Add Range Indexing support for raw values (#5853)
    • Add IdSet and IdSetAggregationFunction (#5926)
    • [Deepstore by-pass]Add a Deepstore bypass integration test with minor bug fixes. (#5857)
    • Add Hadoop counters for detecting schema mismatch (#5873)
    • Add RawThetaSketchAggregationFunction (#5970)
    • Instance API to directly updateTags (#5902)
    • Add streaming query handler (#5717)
    • Add InIdSetTransformFunction (#5973)
    • Add ingestion descriptor in the header (#5995)
    • Zookeeper put api (#5949)
    • Feature/#5390 segment indexing reload status api (#5718)
    • Segment processing framework (#5934)
    • Support streaming query in QueryExecutor (#6027)
    • Add list of allowed tables for emitting table level metrics (#6037)
    • Add FilterOptimizer which supports optimizing both PQL and SQL query filter (#6056)
    • Adding push job type of segment metadata only mode (#5967)
    • Minion taskExecutor for RealtimeToOfflineSegments task (#6050, #6124)
    • Adding array transform functions: array_average, array_max, array_min, array_sum (#6084)
    • Allow modifying/removing existing star-trees during segment reload (#6100)
    • Implement off-heap bloom filter reader (#6118)
    • Support for multi-threaded Group By reducer for SQL. (#6044)
    • Add OnHeapGuavaBloomFilterReader (#6147)
    • Support using ordinals in GROUP BY and ORDER BY clause (#6152)
    • Merge common APIs for Dictionary (#6176)
    • Add table level lock for segment upload (#6165)
    • Added recursive functions validation check for group by (#6186)
    • Add StrictReplicaGroupInstanceSelector (#6208)
    • Add IN_SUBQUERY support (#6022)
    • Add IN_PARTITIONED_SUBQUERY support (#6043)
    • Some UI features (#5810, #5981, #6117, #6215)

    Special notes

    • Brokers should be upgraded before servers in order to keep backward-compatible: — Change group key delimiter from '\t' to '\0' (#5858) — Support for exact distinct count for non int data types (#5872)
    • Pinot Components have to be deployed in the following order: (PinotServiceManager -> Bootstrap services in role ServiceRole.CONTROLLER -> All remaining bootstrap services in parallel) — Starts Broker and Server in parallel when using ServiceManager (#5917)
    • New settings introduced and old ones deprecated: — Make realtime threshold property names less ambiguous (#5953) — Change Signature of Broker API in Controller (#6119)
    • This aggregation function is still in beta version. This PR involves change on the format of data sent from server to broker, so it works only when both broker and server are upgraded to the new version: — Enhance DistinctCountThetaSketchAggregationFunction (#6004)

    Major Bug fixes

    • Improve performance of DistinctCountThetaSketch by eliminating empty sketches and unions. (#5798)
    • Enhance VarByteChunkSVForwardIndexReader to directly read from data buffer for uncompressed data (#5816)
    • Fixing backward-compatible issue of schema fetch call (#5885)
    • Fix race condition in MetricsHelper (#5887)
    • Fixing the race condition that segment finished before ControllerLeaderLocator created. (#5864)
    • Fix CSV and JSON converter on BYTES column (#5931)
    • Fixing the issue that transform UDFs are parsed as function name 'OTHER', not the real function names (#5940)
    • Incorporating embedded exception while trying to fetch stream offset (#5956)
    • Use query timeout for planning phase (#5990)
    • Add null check while fetching the schema (#5994)
    • Validate timeColumnName when adding/updating schema/tableConfig (#5966)
    • Handle the partitioning mismatch between table config and stream (#6031)
    • Fix built-in virtual columns for immutable segment (#6042)
    • Refresh the routing when realtime segment is committed (#6078)
    • Add support for Decimal with Precision Sum aggregation (#6053)
    • Fixing the calls to Helix to throw exception if zk connection is broken (#6069)
    • Allow modifying/removing existing star-trees during segment reload (#6100)
    • Add max length support in schema builder (#6112)
    • Enhance star-tree to skip matching-all predicate on non-star-tree dimension (#6109)

    Backward Incompatible Changes

    • Make realtime threshold property names less ambiguous (#5953)
    • Enhance DistinctCountThetaSketchAggregationFunction (#6004)
    • Deep Extraction Support for ORC, Thrift, and ProtoBuf Records (#6046)
    Source code(tar.gz)
    Source code(zip)
  • release-0.5.0(Sep 3, 2020)

    Summary

    This release includes many new features on Pinot ingestion and connectors (e.g., support for filtering during ingestion which is configurable in table config; support for json during ingestion; proto buf input format support and a new Pinot JDBC client), query capability (e.g., a new GROOVY transform function UDF) and admin functions (a revamped Cluster Manager UI & Query Console UI). It also contains many key bug fixes. See details below.

    The release was cut from the following commit: d1b4586 and the following cherry-picks:

    Notable New Features

    • Allowing update on an existing instance config: PUT /instances/{instanceName} with Instance object as the pay-load (#PR4952)
    • Add PinotServiceManager to start Pinot components (#PR5266)
    • Support for protocol buffers input format. (#PR5293)
    • Add GenericTransformFunction wrapper for simple ScalarFunctions (PR#5440) — Adding support to invoke any scalar function via GenericTransformFunction
    • Add Support for SQL CASE Statement (PR#5461)
    • Support distinctCountRawThetaSketch aggregation that returns serialized sketch. (PR#5465)
    • Add multi-value support to SegmentDumpTool (PR#5487) — add segment dump tool as part of the pinot-tool.sh script
    • Add json_format function to convert json object to string during ingestion. (PR#5492) — Can be used to store complex objects as a json string (which can later be queries using jsonExtractScalar)
    • Support escaping single quote for SQL literal (PR#5501) — This is especially useful for DistinctCountThetaSketch because it stores expression as literal E.g. DistinctCountThetaSketch(..., 'foo=''bar''', ...)
    • Support expression as the left-hand side for BETWEEN and IN clause (PR#5502)
    • Add a new field IngestionConfig in TableConfig — FilterConfig: ingestion level filtering of records, based on filter function. (PR#5597) — TransformConfig: ingestion level column transformations. This was previously introduced in Schema (FieldSpec#transformFunction), and has now been moved to TableConfig. It continues to remain under schema, but we recommend users to set it in the TableConfig starting this release (PR#5681).
    • Allow star-tree creation during segment load (#PR5641) — Introduced a new boolean config enableDynamicStarTreeCreation in IndexingConfig to enable/disable star-tree creation during segment load.
    • Support for Pinot clients using JDBC connection (#PR5602)
    • Support customized accuracy for distinctCountHLL, distinctCountHLLMV functions by adding log2m value as the second parameter in the function. (#PR5564) —Adding cluster config: default.hyperloglog.log2m to allow user set default log2m value.
    • Add segment encryption on Controller based on table config (PR#5617)
    • Add a constraint to the message queue for all instances in Helix, with a large default value of 100000. (PR#5631)
    • Support order-by aggregations not present in SELECT (PR#5637) — Example: "select subject from transcript group by subject order by count() desc" This is equivalent to the following query but the return response should not contain count(). "select subject, count() from transcript group by subject order by count() desc"
    • Add geo support for Pinot queries (PR#5654) — Added geo-spatial data model and geospatial functions
    • Cluster Manager UI & Query Console UI revamp (PR#5684 and PR#5732) — updated cluster manage UI and added table details page and segment details page
    • Add Controller API to explore Zookeeper (PR#5687)
    • Support BYTES type for dictinctCount and group-by (PR#5701 and PR#5708) —Add BYTES type support to DistinctCountAggregationFunction —Correctly handle BYTES type in DictionaryBasedAggregationOperator for DistinctCount
    • Support for ingestion job spec in JSON format (#PR5729)
    • Improvements to RealtimeProvisioningHelper command (#PR5737) — Improved docs related to ingestion and plugins
    • Added GROOVY transform function UDF (#PR5748) — Ability to run a groovy script in the query as a UDF. e.g. string concatenation: SELECT GROOVY('{"returnType": "INT", "isSingleValue": true}', 'arg0 + " " + arg1', columnA, columnB) FROM myTable

    Special notes

    • Changed the stream and metadata interface (PR#5542)
      — This PR concludes the work for the issue #5359 to extend offset support for other streams
    • TransformConfig: ingestion level column transformations. This was previously introduced in Schema (FieldSpec#transformFunction), and has now been moved to TableConfig. It continues to remain under schema, but we recommend users to set it in the TableConfig starting this release (PR#5681).
    • Config key enable.case.insensitive.pql in Helix cluster config is deprecated, and replaced with enable.case.insensitive. (#PR5546)
    • Change default segment load mode to MMAP. (PR#5539) —The load mode for segments currently defaults to heap.

    Major Bug fixes

    • Fix bug in distinctCountRawHLL on SQL path (#5494)
    • Fix backward incompatibility for existing stream implementations (#5549)
    • Fix backward incompatibility in StreamFactoryConsumerProvider (#5557)
    • Fix logic in isLiteralOnlyExpression. (#5611)
    • Fix double memory allocation during operator setup (#5619)
    • Allow segment download url in Zookeeper to be deep store uri instead of hardcoded controller uri (#5639)
    • Fix a backward compatible issue of converting BrokerRequest to QueryContext when querying from Presto segment splits (#5676)
    • Fix the issue that PinotSegmentToAvroConverter does not handle BYTES data type. (#5789)

    Backward Incompatible Changes

    • PQL queries with HAVING clause will no longer be accepted for the following reasons: (#PR5570) — HAVING clause does not apply to PQL GROUP-BY semantic where each aggregation column is ordered individually — The current behavior can produce inaccurate results without any notice — HAVING support will be added for SQL queries in the next release
    • Because of the standardization of the DistinctCountThetaSketch predicate strings, please upgrade Broker before Server. The new Broker can handle both standard and non-standard predicate strings for backward-compatibility. (#PR5613)
    Source code(tar.gz)
    Source code(zip)
  • release-0.4.0(Jun 2, 2020)

    Summary

    This release introduced various new features, including the theta-sketch based distinct count aggregation function, an S3 filesystem plugin, a unified star-tree index implementation, deprecation of TimeFieldSpec in favor of DateTimeFieldSpec, etc. Miscellaneous refactoring, performance improvement and bug fixes were also included in this release. See details below.

    The release was cut from this commit: https://github.com/apache/incubator-pinot/commit/008be2db874dd1c0d7877ce712842abd818d89d1 with cherry-picking the following patches:

    • https://github.com/apache/incubator-pinot/commit/7f10c5c9a571ba1ca8d8ebee5d2b74fc3b05cd7b
    • https://github.com/apache/incubator-pinot/commit/ee21e793c0365f538bcb3b801f47122c59fc0e04
    • https://github.com/apache/incubator-pinot/commit/a314d42e8744549e5c182383445f11c60ac4ae4a

    Notable New Features

    • Made DateTimeFieldSpecs mainstream and deprecated TimeFieldSpec (#2756)
      • Used time column from table config instead of schema (#5320)
      • Included dateTimeFieldSpec in schema columns of Pinot Query Console #5392
      • Used DATE_TIME as the primary time column for Pinot tables (#5399)
    • Supported range queries using indexes (#5240)
    • Supported complex aggregation functions
      • Supported Aggregation functions with multiple arguments (#5261)
      • Added api in AggregationFunction to get compiled input expressions (#5339)
    • Added a simple PinotFS benchmark driver (#5160)
    • Supported default star-tree (#5147)
    • Added an initial implementation for theta-sketch based distinct count aggregation function (#5316)
      • One minor side effect: DataSchemaPruner won't work for DistinctCountThetaSketchAggregatinoFunction (#5382)
    • Added access control for Pinot server segment download api (#5260)
    • Added Pinot S3 Filesystem Plugin (#5249)
    • Text search improvement
      • Pruned stop words for text index (#5297)
      • Used 8byte offsets in chunk based raw index creator (#5285)
      • Derived num docs per chunk from max column value length for varbyte raw index creator (#5256)
      • Added inter segment tests for text search and fixed a bug for Lucene query parser creation (#5226)
      • Made text index query cache a configurable option (#5176)
      • Added Lucene DocId to PinotDocId cache to improve performance (#5177)
      • Removed the construction of second bitmap in text index reader to improve performance (#5199)
    • Tooling/usability improvement
      • Added template support for Pinot Ingestion Job Spec (#5341)
      • Allowed user to specify zk data dir and don't do clean up during zk shutdown (#5295)
      • Allowed configuring minion task timeout in the PinotTaskGenerator (#5317)
      • Update JVM settings for scripts (#5127)
      • Added Stream github events demo (#5189)
      • Moved docs link from gitbook to docs.pinot.apache.org (#5193)
    • Re-implemented ORCRecordReader (#5267)
    • Evaluated schema transform expressions during ingestion (#5238)
    • Handled count distinct query in selection list (#5223)
    • Enabled async processing in pinot broker query api (#5229)
    • Supported bootstrap mode for table rebalance (#5224)
    • Supported order-by on BYTES column (#5213)
    • Added Nightly publish to binary (#5190)
    • Shuffled the segments when rebalancing the table to avoid creating hotspot servers (#5197)
    • Supported inbuilt transform functions (#5312)
      • Added date time transform functions (#5326)
    • Deepstore by-pass in LLC: introduced segment uploader (#5277, #5314)
    • APIs Additions/Changes
      • Added a new server api for download of segments
        • /GET /segments/{tableNameWithType}/{segmentName}
    • Upgraded helix to 0.9.7 (#5411)
    • Added support to execute functions during query compilation (#5406)
    • Other notable refactoring
      • Moved table config into pinot-spi (#5194)
      • Cleaned up integration tests. Standardized the creation of schema, table config and segments (#5385)
      • Added jsonExtractScalar function to extract field from json object (#4597)
      • Added template support for Pinot Ingestion Job Spec #5372
      • Cleaned up AggregationFunctionContext (#5364)
      • Optimized real-time range predicate when cardinality is high (#5331)
      • Made PinotOutputFormat use table config and schema to create segments (#5350)
      • Tracked unavailable segments in InstanceSelector (#5337)
      • Added a new best effort segment uploader with bounded upload time (#5314)
      • In SegmentPurger, used table config to generate the segment (#5325)
      • Decoupled schema from RecordReader and StreamMessageDecoder (#5309)
      • Implemented ARRAYLENGTH UDF for multi-valued columns (#5301)
      • Improved GroupBy query performance (#5291)
      • Optimized ExpressionFilterOperator (#5132)

    Major Bug Fixes

    • Do not release the PinotDataBuffer when closing the index (#5400)
    • Handled a no-arg function in query parsing and expression tree (#5375)
    • Fixed compatibility issues during rolling upgrade due to unknown json fields (#5376)
    • Fixed missing error message from pinot-admin command (#5305)
    • Fixed HDFS copy logic (#5218)
    • Fixed spark ingestion issue (#5216)
    • Fixed the capacity of the DistinctTable (#5204)
    • Fixed various links in the Pinot website

    Work in Progress

    • Upsert: support overriding data in the real-time table (#4261).
      • Add pinot upsert features to pinot common (#5175)
    • Enhancements for theta-sketch, e.g. multiValue aggregation support, complex predicates, performance tuning, etc

    Backward Incompatible Changes

    • TableConfig no longer support de-serialization from json string of nested json string (i.e. no \" inside the json) (#5194)
    • The following APIs are changed in AggregationFunction (use TransformExpressionTree instead of String as the key of blockValSetMap) (#5371):
      void aggregate(int length, AggregationResultHolder aggregationResultHolder, Map<TransformExpressionTree, BlockValSet> blockValSetMap);
      void aggregateGroupBySV(int length, int[] groupKeyArray, GroupByResultHolder groupByResultHolder, Map<TransformExpressionTree, BlockValSet> blockValSetMap);
      void aggregateGroupByMV(int length, int[][] groupKeysArray, GroupByResultHolder groupByResultHolder, Map<TransformExpressionTree, BlockValSet> blockValSetMap);
    
    • A different segment writing logic was introduced in #5256. Although this is backward compatible in a sense that the old segments can be read by the new code, rollback would be tricky since new segments after the upgrade would have been written in the new format, and the old code cannot read those new segments.
    Source code(tar.gz)
    Source code(zip)
  • release-0.3.0(Mar 25, 2020)

    What's the big change?

    The reason behind the architectural change from the previous release (0.2.0) and this release (0.3.0), is the possibility of extending Apache Pinot. The 0.2.0 release was not flexible enough to support new storage types nor new stream types. Basically, inserting a new functionality required to change too much code. Thus, the Pinot team went through an extensive refactoring and improvement of the source code.

    For instance, the picture below shows the module dependencies of the 0.2.X or previous releases. If we wanted to support a new storage type, we would have had to change several modules. Pretty bad, huh?

    image

    In order to conquer this challenge, below major changes are made:

    • Refactored common interfaces to pinot-spi module
    • Concluded four types of modules:
      • Pinot input format: How to read records from various data/file formats: e.g. Avro/CSV/JSON/ORC/Parquet/Thrift
      • Pinot filesystem: How to operate files on various filesystems: e.g. Azure Data Lake/Google Cloud Storage/S3/HDFS
      • Pinot stream ingestion: How to ingest data stream from various upstream systems, e.g. Kafka/Kinesis/Eventhub
      • Pinot batch ingestion: How to run Pinot batch ingestion jobs in various frameworks, like Standalone, Hadoop, Spark.
    • Built shaded jars for each individual plugin
    • Added support to dynamically load pinot plugins at server startup time

    Now the architecture supports a plug-and-play fashion, where new tools can be supported with little and simple extensions, without affecting big chunks of code. Integrations with new streaming services and data formats can be developed in a much more simple and convenient way.

    Below is current supported Pinot Plugins module structure:

    • pinot-input-format
      • pinot-avro
      • pinot-csv
      • pinot-json
      • pinot-orc
      • pinot-parquet
      • pinot-thrift
    • pinot-file-system
      • pinot-adls
      • pinot-gcs
      • pinot-hdfs
    • pinot-stream-ingestion
      • pinot-kafka-0.9
      • pinot-kafka-2.0
    • pinot-batch-ingestion
      • pinot-batch-ingestion-hadoop
      • pinot-batch-ingestion-spark
      • pinot-batch-ingestion-standalone

    Notable New Features

    • Added support for DISTINCT (#4535)
    • Added support default value for BYTES column (#4583)
    • JDK 11 Support
    • Added support to tune size vs accuracy for approximation aggregation functions: DistinctCountHLL, PercentileEst, PercentileTDigest (#4666)
    • Added Data Anonymizer Tool (#4747)
    • Deprecated pinot-hadoop and pinot-spark modules, replace with pinot-batch-ingestion-hadoop and pinot-batch-ingestion-spark
    • Support STRING and BYTES for no dictionary columns in realtime consuming segments (#4791)
    • Make pinot-distribution to build a pinot-all jar and assemble it (#4977)
    • Added support for PQL case insensitive (#4983)
    • Enhanced TableRebalancer logics
      • Moved to new rebalance strategy (#4695)
      • Supported rebalancing tables under any condition(#4990)
      • Supported reassigning completed segments along with Consuming segments for LLC realtime table (#5015)
    • Added experimental support for Text Search‌ (#4993)
    • Upgraded Helix to version 0.9.4, task management now works as expected (#5020)
    • Added date_trunc transformation function. (#4740)
    • Support schema evolution for consuming segment. (#4954)
    • SQL Support
      • Added Calcite SQL compiler
      • Added SQL response format (#4694, #4877)
      • Added support for GROUP BY with ORDER BY (#4602)
      • Query console defaults to use SQL syntax (#4994)
      • Support column alias (#5016, #5033)
      • Added SQL query endpoint: /query/sql (#4964)
      • Support arithmetic operators (#5018)
      • Support non-literal expressions for right-side operand in predicate comparison(#5070)
    • APIs Additions/Changes
      • Pinot Admin Command
        • Added -queryType option in PinotAdmin PostQuery subcommand (#4726)
        • Added -schemaFile as option in AddTable command (#4959)
        • Added OperateClusterConfig sub command in PinotAdmin (#5073)
      • Pinot Controller Rest APIs
        • Get Table leader controller resource (#4545)
        • Support HTTP POST/PUT to upload JSON encoded schema (#4639)
        • Table rebalance API now requires both table name and type as parameters. (#4824)
        • Refactored Segments APIs (#4806)
        • Added segment batch deletion REST API (#4828)
        • Update schema API to reload table on schema change when applicable (#4838)
        • Enhance the task related REST APIs (#5054)
        • Added PinotClusterConfig REST APIs (#5073)
          • GET /cluster/configs
          • POST /cluster/configs
          • DELETE /cluster/configs/{configName}
    • Configurations Additions/Changes
      • Config: controller.host is now optional in Pinot Controller
      • Added instance config: queriesDisabled to disable query sending to a running server (#4767)
      • Added broker config: pinot.broker.enable.query.limit.override configurable max query response size (#5040)
      • Removed deprecated server configs (#4903)
        • pinot.server.starter.enableSegmentsLoadingCheck
        • pinot.server.starter.timeoutInSeconds
        • pinot.server.instance.enable.shutdown.delay
        • pinot.server.instance.starter.maxShutdownWaitTime
        • pinot.server.instance.starter.checkIntervalTime
      • Decouple server instance id with hostname/port config. (#4995)
      • Add FieldConfig to encapsulate encoding, indexing info for a field.(#5006)

    Major Bug Fixes

    • Fixed the bug of releasing the segment when there are still threads working on it. (#4764)
    • Fixed the bug of uneven task distribution for threads (#4793)
    • Fixed encryption for .tar.gz segment file upload (#4855)
    • Fixed controller rest API to download segment from non local FS. (#4808)
    • Fixed the bug of not releasing segment lock if segment recovery throws exception (#4882)
    • Fixed the issue of server not registering state model factory before connecting the Helix manager (#4929)
    • Fixed the exception in server instance when Helix starts a new ZK session (#4976)
    • Fixed ThreadLocal DocIdSet issue in ExpressionFilterOperator (#5114)
    • Fixed the bug in default value provider classes (#5137)
    • Fixed the bug when no segment exists in RealtimeSegmentSelector (#5138)

    Work in Progress

    • We are in the process of supporting text search query functionalities.
    • We are in the process of supporting null value (#4230), currently limited query feature is supported
      • Added Presence Vector to represent null value (#4585)
      • Added null predicate support for leaf predicates (#4943)

    Backward Incompatible Changes

    • It’s a disruptive upgrade from version 0.1.0 to this because of the protocol changes between Pinot Broker and Pinot Server. Please ensure that you upgrade to release 0.2.0 first, then upgrade to this version.
    • If you build your own startable or war without using scripts generated in Pinot-distribution module. For Java 8, an environment variable “plugins.dir” is required for Pinot to find out where to load all the Pinot plugin jars. For Java 11, plugins directory is required to be explicitly set into classpath. Please see pinot-admin.sh as an example.
    • As always, we recommend that you upgrade controllers first, and then brokers and lastly the servers in order to have zero downtime in production clusters.
    • Kafka 0.9 is no longer included in the release distribution.
    • Pull request #4806 introduces a backward incompatible API change for segments management.
      • Removed segment toggle APIs
      • Removed list all segments in cluster APIs
      • Deprecated below APIs:
        • GET /tables/{tableName}/segments
        • GET /tables/{tableName}/segments/metadata
        • GET /tables/{tableName}/segments/crc
        • GET /tables/{tableName}/segments/{segmentName}
        • GET /tables/{tableName}/segments/{segmentName}/metadata
        • GET /tables/{tableName}/segments/{segmentName}/reload
        • POST /tables/{tableName}/segments/{segmentName}/reload
        • GET /tables/{tableName}/segments/reload
        • POST /tables/{tableName}/segments/reload
    • Pull request #5054 deprecated below task related APIs:
      • GET:
        • /tasks/taskqueues: List all task queues
        • /tasks/taskqueuestate/{taskType} -> /tasks/{taskType}/state
        • /tasks/tasks/{taskType} -> /tasks/{taskType}/tasks
        • /tasks/taskstates/{taskType} -> /tasks/{taskType}/taskstates
        • /tasks/taskstate/{taskName} -> /tasks/task/{taskName}/taskstate
        • /tasks/taskconfig/{taskName} -> /tasks/task/{taskName}/taskconfig
      • PUT:
        • /tasks/scheduletasks -> POST /tasks/schedule
        • /tasks/cleanuptasks/{taskType} -> /tasks/{taskType}/cleanup
        • /tasks/taskqueue/{taskType}: Toggle a task queue
      • DELETE:
        • /tasks/taskqueue/{taskType} -> /tasks/{taskType}
    • Deprecated modules pinot-hadoop and pinot-spark and replaced with pinot-batch-ingestion-hadoop and pinot-batch-ingestion-spark.
    • Introduced new Pinot batch ingestion jobs and yaml based job specs to define segment generation jobs and segment push jobs.
    • You may see exceptions like below in pinot-brokers during cluster upgrade, but it's safe to ignore them.
    2020/03/09 23:37:19.879 ERROR [HelixTaskExecutor] [CallbackProcessor@b808af5-pinot] [pinot-broker] [] Message cannot be processed: 78816abe-5288-4f08-88c0-f8aa596114fe, {CREATE_TIMESTAMP=1583797034542, MSG_ID=78816abe-5288-4f08-88c0-f8aa596114fe, MSG_STATE=unprocessable, MSG_SUBTYPE=REFRESH_SEGMENT, MSG_TYPE=USER_DEFINE_MSG, PARTITION_NAME=fooBar_OFFLINE, RESOURCE_NAME=brokerResource, RETRY_COUNT=0, SRC_CLUSTER=pinot, SRC_INSTANCE_TYPE=PARTICIPANT, SRC_NAME=Controller_hostname.domain,com_9000, TGT_NAME=Broker_hostname,domain.com_6998, TGT_SESSION_ID=f6e19a457b80db5, TIMEOUT=-1, segmentName=fooBar_559, tableName=fooBar_OFFLINE}{}{}
    java.lang.UnsupportedOperationException: Unsupported user defined message sub type: REFRESH_SEGMENT
            at org.apache.pinot.broker.broker.helix.TimeboundaryRefreshMessageHandlerFactory.createHandler(TimeboundaryRefreshMessageHandlerFactory.java:68) ~[pinot-broker-0.2.1172.jar:0.3.0-SNAPSHOT-c9d88e47e02d799dc334d7dd1446a38d9ce161a3]
            at org.apache.helix.messaging.handling.HelixTaskExecutor.createMessageHandler(HelixTaskExecutor.java:1096) ~[helix-core-0.9.1.509.jar:0.9.1.509]
            at org.apache.helix.messaging.handling.HelixTaskExecutor.onMessage(HelixTaskExecutor.java:866) [helix-core-0.9.1.509.jar:0.9.1.509]
    
    Source code(tar.gz)
    Source code(zip)
  • release-0.2.0(Nov 11, 2019)

    Major changes/feature additions since 0.1.0:

    • Added support for Kafka 2.0.

    • Table rebalancer now supports a minimum number of serving replicas during rebalance.

    • Added support for UDF in filter predicates and selection.

    • Added support to use hex string as the representation of byte array for queries (#4041)

    • Added support for parquet reader (#3852)

    • Introduced interface stability and audience annotations (#4063)

    • Refactor HelixBrokerStarter to separate constructor and start() (#4100) - backwards incompatible

    • Admin tool for listing segments with invalid intervals for offline tables

    • Migrated to log4j2 (#4139)

    • Added simple avro msg decoder

    • Added support for passing headers in pinot client

    • Table rebalancer now supports a minimum number of serving replicas during rebalance.

    • Support transform functions with AVG aggregation function (#4557)

    • Configurations additions/changes

      • Allow customized metrics prefix (#4392)
      • Controller.enable.batch.message.mode to false by default (#3928)
      • RetentionManager and OfflineSegmentIntervalChecker initial delays configurable (#3946)
      • Config to control kafka fetcher size and increase default (#3869)
      • Added a percent threshold to consider startup of services (#4011)
      • Make SingleConnectionBrokerRequestHandler as default (#4048)
      • Always enable default column feature, remove the configuration (#4074)
      • Remove redundant default broker configurations (#4106)
      • Removed some config keys in server (#4222)
      • Add config to disable HLC realtime segment (#4235)
      • Make RetentionManager and OfflineSegmentIntervalChecker initial delays configurable (#3946)
      • The following config variables are deprecated and will be removed in the next release:
        • pinot.broker.requestHandlerType will be removed, in favor of using the "singleConnection" broker request handler. If you have set this configuration, please remove it and use the default type ("singleConnection") for broker request handler.

    Work in progress

    • We are in the process of separating Helix and Pinot controllers, so that admninistrators can have the option of running independent Helix controllers and Pinot controllers.

    • We are in the process of moving towards supporting SQL query format and results.

    • We are in the process of seperating instance and segment assignment using instance pools to optimize the number of Helix state transitions in Pinot clusters with thousands of tables.

    Other important notes

    • Task management does not work correctly in this relese, due to bugs in Helix. We will upgrade to Helix 0.9.2 (or later) version to get this fixed.

    • You must upgrade to this release before moving onto newer versions of Pinot release. The protocol between pinot-broker and pinot-server has been changed and this release has the code to retain compatibility moving forward. Skipping this release may (depending on your environment) cause query errors if brokers are upgraded and servers are in the process of being upgraded.

    • As always, we recommend that you upgrade controllers first, and then brokers and lastly the servers in order to have zero downtime in production clusters.

    • PR #4100 introduces a backwards incompatible change to pinot broker. If you use the Java constructor on HelixBrokerStarter class, then you will face a compilation error with this version. You will need to construct the object and call start() method in order to start the broker.

    • PR #4139 introduces a backwards incompatible change for log4j configuration. If you used a custom log4j configuration (log4j.xml), you need to write a new log4j2 configuration (log4j2.xml). In addition, you may need to change the arguments on the command line to start pinot components.

      If you used pinot-admin command to start pinot components, you don't need any change. If you used your own commands to start pinot components, you will need to pass the new log4j2 config as a jvm parameter (i.e. substitute -Dlog4j.configuration or -Dlog4j.configurationFile argument with -Dlog4j2.configurationFile=log4j2.xml).

    Source code(tar.gz)
    Source code(zip)
  • release-0.1.0(Mar 7, 2019)

    Pinot 0.1.0

    This is the first official release of Apache Pinot.

    Requirements

    • Java 8 (It has been reported that Pinot cannot be compiled with Java 11 due to a missing package for sun.misc , #3625)

    Highlights

    Pluggable Storage

    We have added a Pinot filesystem abstraction that provides users the option to plug-in their own storage backend. Currently, we support HDFS, NFS, and Azure Data Lake.

    • Documentation (https://pinot.readthedocs.io/en/latest/pluggable_storage.html)

    Pluggable Realtime Streams

    We have decoupled Pinot from Kafka for realtime ingestion and added the abstraction for Streams. Users can now add their own plugins to read from any pub-sub systems.

    • Documentation (https://pinot.readthedocs.io/en/latest/pluggable_streams.html)

    Native Byte Array and TDigest Support

    Pinot now can support byte[] type natively. Also, Pinot can accept byte serialized TDigest object (com.tdunning.math.stats.TDigest) and this can be queried to compute approximate percentiles as follows.

    select percentileTDigest95(tDigestColumn) from myTable where ... group by... top N
    

    Compatibility

    Since this is the first official Apache release, these notes applies to people who have used Pinot in production by building from the source code.

    Backward Incompatible Changes

    • All the packages have been renamed from com.linkedin to org.apache.
    • Method signatures has been changed in interface ControllerRestApi and SegmentNameGenerator.

    Rollback Restrictions

    • If you have used PartitionAwareRouting feature, we have changed the format of partition values from the segment metadata and zk segment metadata. (e.g. partitionRanges: [0 0], [1 1] -> partitions: 0, 1) The new code is backward compatible with the old format; however, old code will not be able to understand the new format in case of rollback.

    Deprecation

    • N/A

    Credits

    Thanks for everyone who have contributed to Pinot.

    We would also like to express our gratitude to our Apache mentors @kishoreg, @felixcheung, @jimjag, @olamy and @rvs.

    Source code(tar.gz)
    Source code(zip)
Owner
The Apache Software Foundation
The Apache Software Foundation
Realtime SOS Android Application. Location (GPS + Cellular Network) tracing application by alerting guardians of the User.

WomenSaftey Women Safety Android Application: Realtime SOS Android Application. Designed a Location (GPS + Cellular Network) tracing application by al

jatin kasera 6 Nov 19, 2022
Apache Drill is a distributed MPP query layer for self describing data

Apache Drill Apache Drill is a distributed MPP query layer that supports SQL and alternative query languages against NoSQL and Hadoop data storage sys

The Apache Software Foundation 1.8k Jan 7, 2023
Apache Cayenne is an open source persistence framework licensed under the Apache License

Apache Cayenne is an open source persistence framework licensed under the Apache License, providing object-relational mapping (ORM) and remoting services.

The Apache Software Foundation 284 Dec 31, 2022
Distributed ID Generate Service

Leaf There are no two identical leaves in the world. — Leibnitz 中文文档 | English Document Introduction Leaf refers to some common ID generation schemes

美团 5.7k Dec 29, 2022
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

Trino is a fast distributed SQL query engine for big data analytics. See the User Manual for deployment instructions and end user documentation. Devel

Trino 6.9k Dec 31, 2022
The official home of the Presto distributed SQL query engine for big data

Presto Presto is a distributed SQL query engine for big data. See the User Manual for deployment instructions and end user documentation. Requirements

Presto 14.3k Dec 30, 2022
CrateDB is a distributed SQL database that makes it simple to store and analyze massive amounts of machine data in real-time.

About CrateDB is a distributed SQL database that makes it simple to store and analyze massive amounts of machine data in real-time. CrateDB offers the

Crate.io 3.6k Jan 2, 2023
A distributed in-memory data store for the cloud

EVCache EVCache is a memcached & spymemcached based caching solution that is mainly used for AWS EC2 infrastructure for caching frequently used data.

Netflix, Inc. 1.9k Jan 2, 2023
Java implementation of Condensation - a zero-trust distributed database that ensures data ownership and data security

Java implementation of Condensation About Condensation enables to build modern applications while ensuring data ownership and security. It's a one sto

CondensationDB 43 Oct 19, 2022
A scalable, distributed Time Series Database.

___ _____ ____ ____ ____ / _ \ _ __ ___ _ _|_ _/ ___|| _ \| __ ) | | | | '_ \ / _ \ '_ \| | \___ \| | | | _ \

OpenTSDB 4.8k Dec 26, 2022
Apache Calcite

Apache Calcite Apache Calcite is a dynamic data management framework. It contains many of the pieces that comprise a typical database management syste

The Apache Software Foundation 3.6k Dec 31, 2022
Apache Druid: a high performance real-time analytics database.

Website | Documentation | Developer Mailing List | User Mailing List | Slack | Twitter | Download Apache Druid Druid is a high performance real-time a

The Apache Software Foundation 12.3k Jan 1, 2023
Apache Hive

Apache Hive (TM) The Apache Hive (TM) data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storag

The Apache Software Foundation 4.6k Dec 28, 2022
The Chronix Server implementation that is based on Apache Solr.

Chronix Server The Chronix Server is an implementation of the Chronix API that stores time series in Apache Solr. Chronix uses several techniques to o

Chronix 262 Jul 3, 2022
Apache Ant is a Java-based build tool.

Apache Ant What is it? ----------- Ant is a Java based build tool. In theory it is kind of like "make" without makes wrinkles and with

The Apache Software Foundation 355 Dec 22, 2022
Apache Aurora - A Mesos framework for long-running services, cron jobs, and ad-hoc jobs

NOTE: The Apache Aurora project has been moved into the Apache Attic. A fork led by members of the former Project Management Committee (PMC) can be fo

The Apache Software Foundation 627 Nov 28, 2022
Flink Connector for Apache Doris(incubating)

Flink Connector for Apache Doris (incubating) Flink Doris Connector More information about compilation and usage, please visit Flink Doris Connector L

The Apache Software Foundation 115 Dec 20, 2022
HurricaneDB a real-time distributed OLAP engine, powered by Apache Pinot

HurricaneDB is a real-time distributed OLAP datastore, built to deliver scalable real-time analytics with low latency. It can ingest from batch data sources (such as Hadoop HDFS, Amazon S3, Azure ADLS, Google Cloud Storage) as well as stream data sources (such as Apache Kafka).

GuinsooLab 4 Dec 28, 2022
Distributed and fault-tolerant realtime computation: stream processing, continuous computation, distributed RPC, and more

IMPORTANT NOTE!!! Storm has Moved to Apache. The official Storm git repository is now hosted by Apache, and is mirrored on github here: https://github

Nathan Marz 8.9k Dec 26, 2022