The Heroic Time Series Database

Overview

DEPRECATION NOTICE

This repo is no longer actively maintained. While it should continue to work and there are no major known bugs, we will not be improving Heroic or releasing new versions.

Heroic Heroic

Build Status Codecov License

A scalable time series database based on Bigtable, Cassandra, and Elasticsearch. Go to https://spotify.github.io/heroic/ for documentation.

This project adheres to the Open Code of Conduct. By participating, you are expected to honor this code.

Install

Docker

Docker images are available on Docker Hub.

$ docker run -p 8080:8080 -p 9091:9091 spotify/heroic

Heroic will now be reachable at http://localhost:8080/status.

In production it's advised to use a tagged version.

Configuration

For help on how to write a configuration file, see the Configuration Section of the official documentation.

Heroic has been tested with the following services:

Developing

Building from source

In order to compile Heroic, you'll need:

  • A Java 11 JDK
  • Maven 3
  • Gradle

The project is built using Gradle:

# full build, runs all tests and builds the shaded jar
./gradlew build

# only compile
./gradlew assemble

# build a single module
./gradlew heroic-metric-bigtable:build

The heroic-dist module can be used to produce a shaded jar that contains all required dependencies:

./gradlew heroic-dist:shadowJar

After building, the entry point of the service is com.spotify.heroic.HeroicService. The following is an example of how this can be run:

./gradlew heroic-dist:runShadow <config>

which is the equivalent of doing:

java -jar $PWD/heroic-dist/build/libs/heroic-dist-0.0.1-SNAPSHOT-shaded.jar <config>

Building with Docker

$ docker build -t heroic:latest .

This is a multi-stage build and will first build Heroic via a ./gradlew clean build and then copy the resulting shaded jar into the runtime container.

Running heroic via docker can be done:

$ docker run -d -p 8080:8080 -p 9091:9091 -v /path/to/config.yml:/heroic.yml spotify/heroic:latest

Logging

Logging is captured using SLF4J, and forwarded to Log4j.

To configure logging, define the -Dlog4j.configurationFile=<path> parameter. You can use docs/log4j2-file.xml as a base.

Testing

We run tests with Gradle:

# run unit tests
./gradlew test

# run integration tests
./gradlew integrationTest

or to run a more comprehensive set of checks:

./gradlew check

This will run:

It is strongly recommended that you run the full test suite before setting up a pull request, otherwise it will be rejected by Travis.

Full Cluster Tests

Full cluster tests are defined in heroic-dist/src/test/java.

This way, they have access to all the modules and parts of Heroic.

The JVM RPC module is specifically designed to allow for rapid execution of integration tests. It allows multiple cores to be defined and communicate with each other in the same JVM instance.

Code Coverage

Coverage

There's an ongoing project to improve test coverage. Clicking the above graph will bring you to codecov.io, where you can find areas to focus on.

Bypassing Validation

To bypass automatic formatting and checkstyle validation you can use the following stanza:

// @formatter:off
final List<String> list = ImmutableList.of(
   "Welcome to...",
   "... The Wild West"
);
// @formatter:on

To bypass a FindBugs error, you should use the @SupressFBWarnings annotation.

@SupressFBWarnings(value="FINDBUGS_ERROR_CODE", justification="I Know Better Than FindBugs")
public class IKnowBetterThanFindbugs() {
    // ...
}

Module Orientation

The Heroic project is split into a couple of modules.

The most critical one is heroic-component. It contains interfaces, value objects, and the basic set of dependencies necessary to glue different components together.

Submodules include metric, suggest, metadata, and aggregation. The first three contain various implementations of the given backend type, while the latter provides aggregation methods.

heroic-core contains the com.spotify.heroic.HeroicCore class which is the central building block for setting up a Heroic instance.

heroic-elasticsearch-utils is a collection of utilities for interacting with Elasticsearch. This is separate since we have more than one backend that needs to talk with elasticsearch.

Finally there is heroic-dist, a small project that depends on all module. Here is where everything is bound together into a distribution — a shaded jar. It also provides the entry-point for services, namely com.spotify.heroic.HeroicService or through an interactive shell com.spotify.heroic.HeroicShell. The shell can either be run standalone or connected to an existing Heroic instance for administration.

Contributing

Guidelines for contributing can be found here.

Comments
  • Make suggest result size configurable

    Make suggest result size configurable

    Right now, ES suggest sets its query size to 0. This should make Elasticsearch return the maximum number of results (10_000). That could be incomplete, but its also not very usable.

    Having this configurable and set to something fairly small in combination with typeahead searching from the grafana datasource would likely give both faster results and a better user experience.

    type:enhancement component:elasticsearch component:suggest 
    opened by hexedpackets 27
  • just compiled in master none query metrics works

    just compiled in master none query metrics works

    Hello,

    I just wanted to update the source code from master everithing in the compilation goes right no errors i can write data elasticsearch is ok i can count the number of series

    BUT i still get this error on all my different queries

    Some fetches failed (628) or were cancelled (0), caused by Some fetches failed (628) or were cancelled (0)
    

    wich source code do i compile to get things work ?

    type:bug 
    opened by lucilecoutouly 15
  • No more data returned by heroic

    No more data returned by heroic

    Hi guys,

    I've done a git pull and package build and it seems that no more data is returned by heroic...

    This is my git reflog

    root@heroic:~/heroic# git reflog --date=iso
    a0551f6 HEAD@{2017-02-06 09:45:53 +0100}: pull: Fast-forward
    be6c06f HEAD@{2016-12-22 12:46:06 +0100}: pull: Fast-forward
    b329697 HEAD@{2016-12-16 12:02:38 +0100}: pull: Fast-forward
    114d815 HEAD@{2016-12-15 13:01:43 +0100}: pull: Fast-forward
    12b92c5 HEAD@{2016-12-05 10:00:04 +0100}: pull: Fast-forward
    c6a43b0 HEAD@{2016-12-01 15:32:27 +0100}: pull: Fast-forward
    51494af HEAD@{2016-11-28 16:27:42 +0100}: pull: Fast-forward
    660e16e HEAD@{2016-11-14 14:21:20 +0100}: pull: Fast-forward
    cf81df2 HEAD@{2016-11-03 13:53:35 +0100}: pull: Fast-forward
    c05e7dc HEAD@{2016-10-14 10:12:40 +0200}: pull: Fast-forward
    2a16276 HEAD@{2016-10-13 11:37:22 +0200}: pull: Fast-forward
    240076d HEAD@{2016-10-11 09:20:06 +0200}: pull: Fast-forward
    6986d65 HEAD@{2016-10-08 21:37:19 +0200}: pull: Fast-forward
    f576768 HEAD@{2016-09-27 21:21:29 +0200}: pull: Fast-forward
    1a3578c HEAD@{2016-09-01 21:29:47 +0200}: clone: from https://github.com/spotify/heroic.git
    

    The errors seem to be coming from elasticsearch ...

    [2017-02-06 14:05:05,602][DEBUG][action.search.type       ] [Achilles] [59813] Failed to execute query phase
    org.elasticsearch.transport.RemoteTransportException: [Gauntlet][inet[/1.1.1.2:9300]][indices:data/read/search[phase/scan/scroll]]
    Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [59813]
    	at org.elasticsearch.search.SearchService.findContext(SearchService.java:537)
    	at org.elasticsearch.search.SearchService.executeScan(SearchService.java:265)
    	at org.elasticsearch.search.action.SearchServiceTransportAction$SearchScanScrollTransportHandler.messageReceived(SearchServiceTransportAction.java:939)
    	at org.elasticsearch.search.action.SearchServiceTransportAction$SearchScanScrollTransportHandler.messageReceived(SearchServiceTransportAction.java:930)
    	at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:279)
    	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    	at java.lang.Thread.run(Thread.java:745)
    

    I see the index for this respective day present in ES and the heroic api nodes seem the be able to consume and insert data in ES and cassandra normally.

    The retrievel of data returns the following message...

    {"queryId":"edd112b3-8a66-4f62-8d14-dcb814e8ef87","range":{"start":1486386897275,"end":1486386902275},"trace":{"what":{"name":"com.spotify.heroic.CoreQueryManager#query"},"elapsed":738436,"children":[{"what":{"name":"com.spotify.heroic.CoreQueryManager#query_shard[{site=HG}]"},"elapsed":738552,"children":[]}]},"limits":[],"commonTags":{},"result":[],"errors":[{"type":"shard","nodes":["grpc://heroic1.cc:1394[local]#query","grpc://heroic.cc:1394#query","grpc://heroic3.cc:1394#query","grpc://heroic2.cc:1394#query"],"shard":{"site":"HG"},"error":"Request finished with status code (Status{code=UNKNOWN, description=null, cause=null}), caused by Request finished with status code (Status{code=UNKNOWN, description=null, cause=null})"}]}
    
    type:question 
    opened by servergeeks 15
  •  Error querying cassandra:9042 : com.datastax.driver.core.exceptions.BusyPoolException: [cassandra] Pool is busy (no available connection and the queue has reached its max size 256)

    Error querying cassandra:9042 : com.datastax.driver.core.exceptions.BusyPoolException: [cassandra] Pool is busy (no available connection and the queue has reached its max size 256)

    hello ,

    it's related to #243 when i do a big query:

     Error querying cassandraseed/10.42.28.96:9042 : com.datastax.driver.core.exceptions.BusyPoolException: [cassandraseed/10.42.28.96] Pool is busy (no available connection and the queue has reached its max size 256)
    

    did you change the version of cassandra driver ? how to prevent this error ?

    some working in the hawkular time serie says: BusyPoolException under heavy load - no available connection and the queue has reached its max size 256 https://issues.jboss.org/browse/HWKMETRICS-542 https://issues.jboss.org/browse/HWKMETRICS-597

    is it related ??

    is this a resolution : https://datastax-oss.atlassian.net/browse/JAVA-893

    thanks

    type:bug type:question 
    opened by lucilecoutouly 14
  • Delta aggregation

    Delta aggregation

    Adds delta aggregation (i.e. "diff" in KairosDB or "rate" in OpenTSDB; open to alternative names). There are no arguments, leaving it up to the user to decide whether to chain it as an input to a SamplingAggregation, or to diff the output for any other Aggregation. Example:

    {
      "range": {"type": "relative", "unit": "HOURS", "value": 2},
      "filter": ["and", ["key", "foo"], ["=", "foo", "bar"], ["+", "role"]],
      "aggregation": {"type": "delta"},
      "groupBy": ["site"]
    }
    

    The above would return the difference of each point from the last within the sample. ~~The first point in a difference is always 0.~~ (EDIT: The First point sampled is truncated and the first returned point is the difference from the last point).

    This is my first time writing Java, so please go easy! I understand it might not be the most idiomatic code, but I did my best to use patterns found in the other aggregators (stream, map, etc.). Thanks for looking.

    opened by mykolasmith 13
  • [WIP] Add distribution support to spotify100 and the ability to fetch distribution from bigTable.

    [WIP] Add distribution support to spotify100 and the ability to fetch distribution from bigTable.

    This PR add distribution support to spotify100 (jsonMetric) and the ability to fetch distribution metric from bigTable. Spotify100 consumers and ingestion modules should now be able to handle new and old JsonMetric. For more information please refer to this issue

    opened by ao2017 12
  • Heroic Bigtable Consumer does not handle failures as expected

    Heroic Bigtable Consumer does not handle failures as expected

    DoD

    Heroic should properly address exception handling.

    • Heroic consumer should not ack messages that it failed to process.
    • Any write or processing failure should be logged.

    Background

    Heroic bigtable consumer failed to write to bigtable when the new column family was not created yet. There was no exception logged and the consumer ack-ed the message as if the write was successful. I got the exception below by hacky debugger evaluations.

    io.grpc.StatusRuntimeException: NOT_FOUND: Error while mutating the row '\023\023distribution-test-1\004\003\003env\007\007staging\004\004host\017\017samanthanoellef\013\013metric_type\014\014distribution\004\004what\005\005stuff\000\000\001u\000\000\000\000' (projects/xpn-heroic-1/instances/metrics-staging-guc/tables/metrics) : Requested column family not found.
    	at io.grpc.Status.asRuntimeException(Status.java:517)
    	at com.google.cloud.bigtable.grpc.async.BulkMutation.toException(BulkMutation.java:77)
    	at com.google.cloud.bigtable.grpc.async.BulkMutation.access$400(BulkMutation.java:59)
    	at com.google.cloud.bigtable.grpc.async.BulkMutation$Batch.handleEntries(BulkMutation.java:227)
    	at com.google.cloud.bigtable.grpc.async.BulkMutation$Batch.handleResult(BulkMutation.java:200)
    	at com.google.cloud.bigtable.grpc.async.BulkMutation$Batch$1.onSuccess(BulkMutation.java:170)
    	at com.google.cloud.bigtable.grpc.async.BulkMutation$Batch$1.onSuccess(BulkMutation.java:167)
    	at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1021)
    	at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
    	at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1137)
    	at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:957)
    	at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:726)
    	at com.google.cloud.bigtable.grpc.async.AbstractRetryingOperation$GrpcFuture.set(AbstractRetryingOperation.java:90)
    	at com.google.cloud.bigtable.grpc.async.RetryingMutateRowsOperation.onOK(RetryingMutateRowsOperation.java:91)
    	at com.google.cloud.bigtable.grpc.async.AbstractRetryingOperation.onClose(AbstractRetryingOperation.java:167)
    	at com.google.cloud.bigtable.grpc.async.ThrottlingClientInterceptor$1$1.onClose(ThrottlingClientInterceptor.java:125)
    	at com.google.cloud.bigtable.grpc.io.ChannelPool$InstrumentedChannel$2.onClose(ChannelPool.java:209)
    	at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
    	at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
    	at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
    	at com.google.cloud.bigtable.grpc.io.RefreshingOAuth2CredentialsInterceptor$UnAuthResponseListener.onClose(RefreshingOAuth2CredentialsInterceptor.java:81)
    	at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
    	at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
    	at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
    	at io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:678)
    	at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
    	at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
    	at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
    	at io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:397)
    	at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:459)
    	at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:63)
    	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:546)
    	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$600(ClientCallImpl.java:467)
    	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:584)
    	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
    	at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
    	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    	at java.base/java.lang.Thread.run(Thread.java:834)
    
    opened by samfadrigalan 10
  • Implement a configurable ES result size

    Implement a configurable ES result size

    Implement a configurable ES result size

    Please see Make suggest result size configurable #646 for a detailed discussion about the implementation. Especially the conversation about ES probably not being hit for 10K suggestions the vast majority of the time.

    Use Case Resolved

    • I am a Grafana user
    • who wants the UI to be more responsive
    • so that I can work more efficiently and not get irritated

    Use Case 2 Resolved

    • I am a Heroic developer
    • who wants to tighten-up/secure/protect Heroic against unreasonably large request suggestion limits, irrespective of their origin
    • so that I can rest easy at night

    Design & Implementation Notes

    • a trivial new class NumSuggestionsLimit, objects of which are used by SuggestBackend implementations to determine the maximum number (i.e. limit) of suggestion entities to return.

    • said class will not allow a limit of more than 500. It will default to a limit of 50 if no limit is supplied by heroic.yaml for the respective Backend implementation.

    • when, for example, a call to tagSuggest() is made, if the request contains a limit, that limit is respected. If it does not, the Backend's limit is used.

    • Note that I also refactored a bunch of related code to help me understand it.

    opened by sming 10
  • Enable Cassandra Driver Pooling option configuration

    Enable Cassandra Driver Pooling option configuration

    This change set enables the configuration of the java driver pooling options during startup.

    A couple of notes:

    • In our case we are very sensitive to data loss, so we wanted to set up a separate logging file that we can parse for alerting/resubmission, hence the additional logging on DatastaxBackend.java. We are totally open (and appreciate recommendations) to different approaches for this.

    • I will also submit a separate PR to the site branch with documentation updates.

    opened by jcabmora 10
  • client=org.elasticsearch.client.node.NodeClient@429967c4) leaked @ unknown

    client=org.elasticsearch.client.node.NodeClient@429967c4) leaked @ unknown

    Hi guys,

    I've made an heroic cluster (4 nodes) that is using 5 cassandra nodes and 3 elasticsearch nodes. For some reason after a day of ingesting data (around 2GB/index in ES), retrieving data becomes really slow and sometimes impossible...

    If I try to do an HQL query in grafana I usually get to crash an heroic API node and everything stops working.

    Any tips/ideas to improve performance and keep my cluster running/usable ?

    An example of an crash/error would be...

    14:03:45.049 [nioEventLoopGroup-2-16] INFO  com.spotify.heroic.cluster.CoreClusterManager - 00000007 [new] grpc://heroic3.cc:1394
    14:03:45.049 [nioEventLoopGroup-2-16] INFO  com.spotify.heroic.cluster.CoreClusterManager - 00000007 [new] grpc://heroic1.cc:1394
    14:03:45.049 [nioEventLoopGroup-2-16] INFO  com.spotify.heroic.cluster.CoreClusterManager - 00000007 [new] grpc://heroic.cc:1394
    14:03:45.049 [nioEventLoopGroup-2-16] ERROR com.spotify.heroic.cluster.CoreClusterManager - 00000007 [failed] grpc://heroic2.cc:1394
    java.lang.RuntimeException: Request finished with status code (Status{code=DEADLINE_EXCEEDED, description=null, cause=null})
    	at com.spotify.heroic.rpc.grpc.GrpcRpcClient$1.onClose(GrpcRpcClient.java:103)
    	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$3.runInContext(ClientCallImpl.java:462)
    	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:54)
    	at io.grpc.internal.SerializingExecutor$TaskRunner.run(SerializingExecutor.java:154)
    	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:339)
    	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:356)
    	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:742)
    	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
    	at java.lang.Thread.run(Thread.java:745)
    14:03:45.050 [nioEventLoopGroup-2-16] INFO  com.spotify.heroic.cluster.CoreClusterManager - 00000007 [update] [{site=HG}] 3 result(s)
    14:04:17.843 [elasticsearch[heroic3][clusterService#updateTask][T#1]] INFO  org.elasticsearch.cluster.service - [heroic3] removed {[heroic2][_QO3QHYpQs-2X4MXFbSX2Q][heroic2][inet[/131.1.1.7:9300]]{data=false, client=true},}, reason: zen-disco-receive(from master [[Occulus][H59ohQteSpGOQk1O69WCGA][heroic-e][inet[/131.1.1.2:9300]]])
    14:04:17.845 [elasticsearch[heroic3][clusterService#updateTask][T#1]] INFO  org.elasticsearch.cluster.service - [heroic3] removed {[heroic2][_QO3QHYpQs-2X4MXFbSX2Q][heroic2][inet[/131.1.1.7:9300]]{data=false, client=true},}, reason: zen-disco-receive(from master [[Occulus][H59ohQteSpGOQk1O69WCGA][heroic-e][inet[/131.1.1.2:9300]]])
    14:04:45.050 [heroic-scheduler#0] INFO  com.spotify.heroic.cluster.CoreClusterManager - new refresh with id (00000008)
    14:04:47.853 [elasticsearch[heroic3][clusterService#updateTask][T#1]] INFO  org.elasticsearch.cluster.service - [heroic3] removed {[heroic2][kzCTkZdBRlqrGcBFKNahLQ][heroic2][inet[/131.1.1.7:9301]]{data=false, client=true},}, reason: zen-disco-receive(from master [[Occulus][H59ohQteSpGOQk1O69WCGA][heroic-e][inet[/131.1.1.2:9300]]])
    14:04:47.855 [elasticsearch[heroic3][clusterService#updateTask][T#1]] INFO  org.elasticsearch.cluster.service - [heroic3] removed {[heroic2][kzCTkZdBRlqrGcBFKNahLQ][heroic2][inet[/131.1.1.7:9301]]{data=false, client=true},}, reason: zen-disco-receive(from master [[Occulus][H59ohQteSpGOQk1O69WCGA][heroic-e][inet[/131.1.1.2:9300]]])
    14:04:50.053 [nioEventLoopGroup-2-6] INFO  com.spotify.heroic.cluster.CoreClusterManager - 00000008 [new] grpc://heroic3.cc:1394
    14:04:50.053 [nioEventLoopGroup-2-6] INFO  com.spotify.heroic.cluster.CoreClusterManager - 00000008 [new] grpc://heroic.cc:1394
    14:04:50.054 [nioEventLoopGroup-2-6] INFO  com.spotify.heroic.cluster.CoreClusterManager - 00000008 [new] grpc://heroic1.cc:1394
    14:04:50.054 [nioEventLoopGroup-2-6] ERROR com.spotify.heroic.cluster.CoreClusterManager - 00000008 [failed] grpc://heroic2.cc:1394
    java.lang.RuntimeException: Request finished with status code (Status{code=DEADLINE_EXCEEDED, description=null, cause=null})
    	at com.spotify.heroic.rpc.grpc.GrpcRpcClient$1.onClose(GrpcRpcClient.java:103)
    	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$3.runInContext(ClientCallImpl.java:462)
    	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:54)
    	at io.grpc.internal.SerializingExecutor$TaskRunner.run(SerializingExecutor.java:154)
    	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:339)
    	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:356)
    	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:742)
    	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
    	at java.lang.Thread.run(Thread.java:745)
    14:04:50.054 [nioEventLoopGroup-2-6] INFO  com.spotify.heroic.cluster.CoreClusterManager - 00000008 [update] [{site=HG}] 3 result(s)
    14:05:50.055 [heroic-scheduler#4] INFO  com.spotify.heroic.cluster.CoreClusterManager - new refresh with id (00000009)
    
    14:05:55.059 [nioEventLoopGroup-2-10] INFO  com.spotify.heroic.cluster.CoreClusterManager - 00000009 [new] grpc://heroic3.cc:1394
    14:05:55.059 [nioEventLoopGroup-2-10] INFO  com.spotify.heroic.cluster.CoreClusterManager - 00000009 [new] grpc://heroic.cc:1394
    14:05:55.059 [nioEventLoopGroup-2-10] INFO  com.spotify.heroic.cluster.CoreClusterManager - 00000009 [new] grpc://heroic1.cc:1394
    14:05:55.059 [nioEventLoopGroup-2-10] ERROR com.spotify.heroic.cluster.CoreClusterManager - 00000009 [failed] grpc://heroic2.cc:1394
    java.lang.RuntimeException: Request finished with status code (Status{code=DEADLINE_EXCEEDED, description=null, cause=null})
    	at com.spotify.heroic.rpc.grpc.GrpcRpcClient$1.onClose(GrpcRpcClient.java:103)
    	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$3.runInContext(ClientCallImpl.java:462)
    	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:54)
    	at io.grpc.internal.SerializingExecutor$TaskRunner.run(SerializingExecutor.java:154)
    	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:339)
    	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:356)
    	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:742)
    	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
    	at java.lang.Thread.run(Thread.java:745)
    14:05:55.060 [nioEventLoopGroup-2-10] INFO  com.spotify.heroic.cluster.CoreClusterManager - 00000009 [update] [{site=HG}] 3 result(s)
    14:06:55.060 [heroic-scheduler#2] INFO  com.spotify.heroic.cluster.CoreClusterManager - new refresh with id (0000000a)
    14:07:00.063 [nioEventLoopGroup-2-15] INFO  com.spotify.heroic.cluster.CoreClusterManager - 0000000a [new] grpc://heroic3.cc:1394
    14:07:00.063 [nioEventLoopGroup-2-15] INFO  com.spotify.heroic.cluster.CoreClusterManager - 0000000a [new] grpc://heroic.cc:1394
    14:07:00.064 [nioEventLoopGroup-2-15] INFO  com.spotify.heroic.cluster.CoreClusterManager - 0000000a [new] grpc://heroic1.cc:1394
    14:07:00.064 [nioEventLoopGroup-2-15] ERROR com.spotify.heroic.cluster.CoreClusterManager - 0000000a [failed] grpc://heroic2.cc:1394
    java.lang.RuntimeException: Request finished with status code (Status{code=DEADLINE_EXCEEDED, description=null, cause=null})
    	at com.spotify.heroic.rpc.grpc.GrpcRpcClient$1.onClose(GrpcRpcClient.java:103)
    	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$3.runInContext(ClientCallImpl.java:462)
    	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:54)
    	at io.grpc.internal.SerializingExecutor$TaskRunner.run(SerializingExecutor.java:154)
    	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:339)
    	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:356)
    	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:742)
    	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
    	at java.lang.Thread.run(Thread.java:745)
    14:07:00.064 [nioEventLoopGroup-2-15] INFO  com.spotify.heroic.cluster.CoreClusterManager - 0000000a [update] [{site=HG}] 3 result(s)
    14:07:30.085 [jersey-background-task-scheduler-0] WARN  com.spotify.heroic.common.CoreJavaxRestFramework - Client timed out
    14:07:30.085 [jersey-background-task-scheduler-0] ERROR com.spotify.heroic.common.CoreJavaxRestFramework - Request cancelled
    14:07:30.089 [jersey-background-task-scheduler-0] INFO  org.eclipse.jetty.server.RequestLog - 131.130.249.44 - - [16/Dec/2016:13:02:30 +0000] "POST //heroic3.cc:8080/query/batch HTTP/1.1" 500 101 
    14:07:30.090 [jersey-background-task-scheduler-0] WARN  com.spotify.heroic.common.CoreJavaxRestFramework - Client completed
    14:07:43.525 [jersey-background-task-scheduler-0] WARN  com.spotify.heroic.common.CoreJavaxRestFramework - Client timed out
    14:07:43.525 [jersey-background-task-scheduler-0] ERROR com.spotify.heroic.common.CoreJavaxRestFramework - Request cancelled
    14:07:43.528 [jersey-background-task-scheduler-0] INFO  org.eclipse.jetty.server.RequestLog - 131.130.249.44 - - [16/Dec/2016:13:02:43 +0000] "POST //heroic3.cc:8080/query/batch HTTP/1.1" 500 101 
    14:07:43.528 [jersey-background-task-scheduler-0] WARN  com.spotify.heroic.common.CoreJavaxRestFramework - Client completed
    14:08:00.065 [heroic-scheduler#2] INFO  com.spotify.heroic.cluster.CoreClusterManager - new refresh with id (0000000b)
    14:08:01.637 [nioEventLoopGroup-2-5] INFO  com.spotify.heroic.cluster.CoreClusterManager - 0000000b [new] grpc://heroic3.cc:1394
    14:08:01.637 [nioEventLoopGroup-2-5] INFO  com.spotify.heroic.cluster.CoreClusterManager - 0000000b [new] grpc://heroic.cc:1394
    14:08:01.637 [nioEventLoopGroup-2-5] INFO  com.spotify.heroic.cluster.CoreClusterManager - 0000000b [new] grpc://heroic1.cc:1394
    14:08:01.637 [nioEventLoopGroup-2-5] INFO  com.spotify.heroic.cluster.CoreClusterManager - 0000000b [update] [{site=HG}] 4 result(s)
    14:08:01.876 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ZookeeperConsumerConnector - [heroic_heroic3-1481892993532-faf9240b], begin rebalancing consumer heroic_heroic3-1481892993532-faf9240b try #0
    14:08:01.933 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ConsumerFetcherManager - [ConsumerFetcherManager-1481892993640] Stopping leader finder thread
    14:08:01.933 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ConsumerFetcherManager$LeaderFinderThread - [heroic_heroic3-1481892993532-faf9240b-leader-finder-thread], Shutting down
    14:08:01.933 [heroic_heroic3-1481892993532-faf9240b-leader-finder-thread] INFO  kafka.consumer.ConsumerFetcherManager$LeaderFinderThread - [heroic_heroic3-1481892993532-faf9240b-leader-finder-thread], Stopped 
    14:08:01.933 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ConsumerFetcherManager$LeaderFinderThread - [heroic_heroic3-1481892993532-faf9240b-leader-finder-thread], Shutdown completed
    14:08:01.934 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ConsumerFetcherManager - [ConsumerFetcherManager-1481892993640] Stopping all fetchers
    14:08:01.934 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ConsumerFetcherThread - [ConsumerFetcherThread-heroic_heroic3-1481892993532-faf9240b-0-1], Shutting down
    14:08:01.934 [ConsumerFetcherThread-heroic_heroic3-1481892993532-faf9240b-0-1] WARN  kafka.consumer.SimpleConsumer - Reconnect due to socket error: null
    14:08:01.935 [ConsumerFetcherThread-heroic_heroic3-1481892993532-faf9240b-0-1] INFO  kafka.consumer.ConsumerFetcherThread - [ConsumerFetcherThread-heroic_heroic3-1481892993532-faf9240b-0-1], Stopped 
    14:08:01.935 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ConsumerFetcherThread - [ConsumerFetcherThread-heroic_heroic3-1481892993532-faf9240b-0-1], Shutdown completed
    14:08:01.935 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ConsumerFetcherThread - [ConsumerFetcherThread-heroic_heroic3-1481892993532-faf9240b-0-3], Shutting down
    14:08:01.936 [ConsumerFetcherThread-heroic_heroic3-1481892993532-faf9240b-0-3] WARN  kafka.consumer.SimpleConsumer - Reconnect due to socket error: null
    14:08:01.936 [ConsumerFetcherThread-heroic_heroic3-1481892993532-faf9240b-0-3] INFO  kafka.consumer.ConsumerFetcherThread - [ConsumerFetcherThread-heroic_heroic3-1481892993532-faf9240b-0-3], Stopped 
    14:08:01.937 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ConsumerFetcherThread - [ConsumerFetcherThread-heroic_heroic3-1481892993532-faf9240b-0-3], Shutdown completed
    14:08:01.937 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ConsumerFetcherManager - [ConsumerFetcherManager-1481892993640] All connections stopped
    14:08:01.937 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ZookeeperConsumerConnector - [heroic_heroic3-1481892993532-faf9240b], Cleared all relevant queues for this fetcher
    14:08:01.937 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ZookeeperConsumerConnector - [heroic_heroic3-1481892993532-faf9240b], Cleared the data chunks in all the consumer message iterators
    14:08:01.937 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ZookeeperConsumerConnector - [heroic_heroic3-1481892993532-faf9240b], Committing all offsets after clearing the fetcher queues
    14:08:01.940 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ZookeeperConsumerConnector - [heroic_heroic3-1481892993532-faf9240b], Releasing partition ownership
    14:08:01.944 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ZookeeperConsumerConnector - [heroic_heroic3-1481892993532-faf9240b], Consumer heroic_heroic3-1481892993532-faf9240b rebalancing the following partitions: ArrayBuffer(0, 1, 2, 3, 4, 5, 6, 7, 8, 9) for topic metrics with consumers: List(heroic_heroic-1481893181613-f2c9267b-0, heroic_heroic-1481893181613-f2c9267b-1, heroic_heroic1-1481893140289-6988cffe-0, heroic_heroic1-1481893140289-6988cffe-1, heroic_heroic2-1481893050704-558be4ae-0, heroic_heroic2-1481893050704-558be4ae-1, heroic_heroic3-1481892993532-faf9240b-0, heroic_heroic3-1481892993532-faf9240b-1)
    14:08:01.944 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ZookeeperConsumerConnector - [heroic_heroic3-1481892993532-faf9240b], heroic_heroic3-1481892993532-faf9240b-0 attempting to claim partition 8
    14:08:01.945 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ZookeeperConsumerConnector - [heroic_heroic3-1481892993532-faf9240b], heroic_heroic3-1481892993532-faf9240b-1 attempting to claim partition 9
    14:08:01.947 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ZookeeperConsumerConnector - [heroic_heroic3-1481892993532-faf9240b], heroic_heroic3-1481892993532-faf9240b-0 successfully owned partition 8 for topic metrics
    14:08:01.948 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ZookeeperConsumerConnector - [heroic_heroic3-1481892993532-faf9240b], heroic_heroic3-1481892993532-faf9240b-1 successfully owned partition 9 for topic metrics
    14:08:01.948 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ZookeeperConsumerConnector - [heroic_heroic3-1481892993532-faf9240b], Updating the cache
    14:08:01.948 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ZookeeperConsumerConnector - [heroic_heroic3-1481892993532-faf9240b], Consumer heroic_heroic3-1481892993532-faf9240b selected partitions : metrics:8: fetched offset = 788121: consumed offset = 788121,metrics:9: fetched offset = 788111: consumed offset = 788111
    14:08:01.949 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ZookeeperConsumerConnector - [heroic_heroic3-1481892993532-faf9240b], end rebalancing consumer heroic_heroic3-1481892993532-faf9240b try #0
    14:08:01.951 [heroic_heroic3-1481892993532-faf9240b-leader-finder-thread] INFO  kafka.consumer.ConsumerFetcherManager$LeaderFinderThread - [heroic_heroic3-1481892993532-faf9240b-leader-finder-thread], Starting 
    14:08:01.958 [heroic_heroic3-1481892993532-faf9240b-leader-finder-thread] INFO  kafka.utils.VerifiableProperties - Verifying properties
    14:08:01.958 [heroic_heroic3-1481892993532-faf9240b-leader-finder-thread] INFO  kafka.utils.VerifiableProperties - Property client.id is overridden to heroic
    14:08:01.959 [heroic_heroic3-1481892993532-faf9240b-leader-finder-thread] INFO  kafka.utils.VerifiableProperties - Property metadata.broker.list is overridden to kafkabaer1.cc:9092,kafkabaer2.cc:9092,kafkabaer3.cc:9092
    14:08:01.959 [heroic_heroic3-1481892993532-faf9240b-leader-finder-thread] INFO  kafka.utils.VerifiableProperties - Property request.timeout.ms is overridden to 30000
    14:08:01.959 [heroic_heroic3-1481892993532-faf9240b-leader-finder-thread] INFO  kafka.client.ClientUtils$ - Fetching metadata from broker id:1,host:kafkabaer1.cc,port:9092 with correlation id 9 for 1 topic(s) Set(metrics)
    14:08:01.959 [heroic_heroic3-1481892993532-faf9240b-leader-finder-thread] INFO  kafka.producer.SyncProducer - Connected to kafkabaer1.cc:9092 for producing
    14:08:01.960 [heroic_heroic3-1481892993532-faf9240b-leader-finder-thread] INFO  kafka.producer.SyncProducer - Disconnecting from kafkabaer1.cc:9092
    14:08:01.961 [ConsumerFetcherThread-heroic_heroic3-1481892993532-faf9240b-0-1] INFO  kafka.consumer.ConsumerFetcherThread - [ConsumerFetcherThread-heroic_heroic3-1481892993532-faf9240b-0-1], Starting 
    14:08:01.961 [heroic_heroic3-1481892993532-faf9240b-leader-finder-thread] INFO  kafka.consumer.ConsumerFetcherManager - [ConsumerFetcherManager-1481892993640] Added fetcher for partitions ArrayBuffer([[metrics,8], initOffset 788121 to broker id:3,host:kafkabaer3.cc,port:9092] , [[metrics,9], initOffset 788111 to broker id:1,host:kafkabaer1.cc,port:9092] )
    14:08:01.962 [ConsumerFetcherThread-heroic_heroic3-1481892993532-faf9240b-0-3] INFO  kafka.consumer.ConsumerFetcherThread - [ConsumerFetcherThread-heroic_heroic3-1481892993532-faf9240b-0-3], Starting 
    14:08:04.700 [elasticsearch[heroic3][clusterService#updateTask][T#1]] INFO  org.elasticsearch.cluster.service - [heroic3] added {[heroic2][_QO3QHYpQs-2X4MXFbSX2Q][heroic2][inet[/131.1.1.7:9300]]{data=false, client=true},}, reason: zen-disco-receive(from master [[Occulus][H59ohQteSpGOQk1O69WCGA][heroic-e][inet[/131.1.1.2:9300]]])
    14:08:04.701 [elasticsearch[heroic3][clusterService#updateTask][T#1]] INFO  org.elasticsearch.cluster.service - [heroic3] added {[heroic2][_QO3QHYpQs-2X4MXFbSX2Q][heroic2][inet[/131.1.1.7:9300]]{data=false, client=true},}, reason: zen-disco-receive(from master [[Occulus][H59ohQteSpGOQk1O69WCGA][heroic-e][inet[/131.1.1.2:9300]]])
    14:08:04.778 [elasticsearch[heroic3][clusterService#updateTask][T#1]] INFO  org.elasticsearch.cluster.service - [heroic3] added {[heroic2][kzCTkZdBRlqrGcBFKNahLQ][heroic2][inet[/131.1.1.7:9301]]{data=false, client=true},}, reason: zen-disco-receive(from master [[Occulus][H59ohQteSpGOQk1O69WCGA][heroic-e][inet[/131.1.1.2:9300]]])
    14:08:04.780 [elasticsearch[heroic3][clusterService#updateTask][T#1]] INFO  org.elasticsearch.cluster.service - [heroic3] added {[heroic2][kzCTkZdBRlqrGcBFKNahLQ][heroic2][inet[/131.1.1.7:9301]]{data=false, client=true},}, reason: zen-disco-receive(from master [[Occulus][H59ohQteSpGOQk1O69WCGA][heroic-e][inet[/131.1.1.2:9300]]])
    
    type:bug 
    opened by servergeeks 10
  • Support detailed query logging

    Support detailed query logging

    This is another troubleshooting story.

    We want to provide a configurable feature that gives a detailed log of how the queries are treated by heroic.

    Suggestion

    The user can provide the system with additional clientContext which will be included in each log step. The system can either be provided, or will generate an id when the query has been received that can be used to group or correlate individual queries.

    Each log line has the following structure:

    {
      "component": "com.spotify.heroic.query.CoreQueryManager",
      "queryId": "ed6fe51c-afba-4320-a859-a88795c15175",
      "clientContext": {
        "dashboardId": "my-system-metrics",
        "user": "udoprog"
      },
      "type": "Received",
      "data": {}
    }
    

    data would be specific to the type being logged.

    In all relevant stages of query processing, a call to the query logging framework is done, which results in logging of json describing the exact structure of the query in that stage.

    Suggested relevant stages and data to log:

    • api Query as received from the user, also includes relevant HTTP-based information.
    • api Query as received by ClusterManager.
    • api FullQuery.Request which fans out to all data nodes.
    • data FullQuery.Request when received by data node.
    • data FullQuery as it goes back to the API node (excluding data).
    • api FullQuery as received from data node (excluding data).
    • api QueryTrace with timing information for the whole query
    • api QueryMetricsResponse being sent to the user (excluding data).

    Note: All response paths excludes samples (time series data) to reduce the size of the log. This would be represented as the size of the data being returned instead.

    Requirements

    • Query logging framework is isolated from critical path of query processing in a way so that failure in the query logger doesn't affect query processing.

    Extension to Query API

    The following field would be added to the query API.

    {
      "features": ["com.spotify.heroic.query_logging"],
      "queryLogging": {
        "queryId": "my-query-id",
        "clientContext": {
          "dashboardId": "my-system-metrics",
          "user": "udoprog"
        }
      }
    }
    

    This is completely optional, and query logging can be enabled globally by setting the feature flag com.spotify.heroic.query_logging.

    Configuration

    Query logging is configured in the configuration file under the queryLogging section, like the following:

    queryLogging:
      type: file
      path: /path/to/query.log
      rotationPeriod: 1d
    

    Or with a logger facility:

    queryLogging:
      type: logger
      name: com.spotify.heroic.query_logging
    
    note:rfc 
    opened by udoprog 10
  • integrationTests requires

    integrationTests requires "hidden" quay.io/testcontainers/ryuk docker image

    Hi,

    testing to build heroic for the first time, integrationTests failed. Due to a missing docker image.

    com.spotify.heroic.PubSubConsumerIT > consumeOneMessage FAILED
        org.testcontainers.containers.ContainerLaunchException: Container startup failed
            Caused by:
            org.testcontainers.containers.ContainerFetchException: Can't get Docker image: RemoteDockerImage(imageNameFuture=java.util.concurrent.CompletableFuture@5f9adb66[Completed normally], imagePullPolicy=DefaultPullPolicy(), dockerClient=LazyDockerClient.INSTANCE)
    
                Caused by:
                com.github.dockerjava.api.exception.NotFoundException: {"message":"No such image: quay.io/testcontainers/ryuk:0.2.3"}
    

    This image is not specified in the code so I guess this is a requirement of a dependency. Other docker images that are explicitly mentioned in the code as bigtruedata/gcloud-pubsub-emulator were automatically pulled as seen in the test just before the failing one:

    com.spotify.heroic.PubSubConsumerIT > consumeOneMessage STANDARD_OUT
        14:29:02.761 [Test worker] INFO  org.testcontainers.DockerClientFactory - Docker host IP address is localhost
        14:29:02.777 [Test worker] INFO  org.testcontainers.DockerClientFactory - Connected to docker:
          Server Version: 20.10.5
          API Version: 1.41
          Operating System: Arch Linux
          Total Memory: 15641 MB
        14:29:02.783 [Test worker] INFO  🐳 [bigtruedata/gcloud-pubsub-emulator:latest] - Pulling docker image: bigtruedata/gcloud-pubsub-emulator:latest. Please be patient; this may take some time but only needs to be done once.
        14:29:02.783 [Test worker] INFO  org.testcontainers.DockerClientFactory - Docker host IP address is localhost
        14:29:02.799 [Test worker] INFO  org.testcontainers.DockerClientFactory - Connected to docker:
          Server Version: 20.10.5
          API Version: 1.41
          Operating System: Arch Linux
          Total Memory: 15641 MB
    

    Build works fine just after manual pulling of image quay.io/testcontainers/ryuk:0.2.3

    opened by Pierrotws 1
  • Investigate potentially serious performance implications of seemingly unnecessary thread-per-log message logging implementation

    Investigate potentially serious performance implications of seemingly unnecessary thread-per-log message logging implementation

    So the serializeAndLog method is called by all "log this sentence to disk" methods in the Slf4jQueryLogger class.

    This is incredibly wasteful as for every single line of the log file, a Thread is created and destroyed, which includes :

    • a Thread object being allocated and GC'd
    • all the trappings of a Thread object e.g. file handles, OS resources
    • the actual OS thread that the Thread represents
    • other stuff I've forgotten

    We should - at the least - run some tests to determine how slow/wasteful this is, assuming realistic logging rates.

    My money is that it's really bad and could have a significant effect on performance, assuming that these logging methods are called frequently enough.

    opened by sming 0
  • Fix is GC'ed without being ended." issue (caused by a BT timeout)">

    Fix "...Span is GC'ed without being ended." issue (caused by a BT timeout)

    100's of Tracing Spans are left un-ended from every query timeout

    • I am a prism goalie
    • Who wants to have a stable heroic
    • So that I can focus on features and not get woken up at night and have angry users

    These un-ended spans represent a real runtime risk to heroic. If ~700-1000 of these are left hanging around after each timeout-d query, it's conceivable that the JVM will :

    • potentially run out of memory altogether
    • experience much longer GC pauses / sweep times (cos of all the hanging spans needing reaping)
    • hugely inflate the size of heroic's logs, costing us $$$ and obscuring "genuine" problems

    Proposed Solution

    • find the correct location to catch the BT timeout exception (not trivial)
    • catch it, end the span and throw it out again

    Repro Steps

    • run heroic locally with GUC config and on branch feature/add-bigtable-timeout-settings-refactored
    • capture a lengthy query from grafana using the chrome dev tools network tab
    • alter the query to hit localhost and watch the logs, you'll see this message

    List of methods concerned from logs

    1. ERROR io.opencensus.trace.Tracer - Span localMetricsManager.fetchSeries is GC'ed without being ended.
    2. ERROR io.opencensus.trace.Tracer - Span bigtable.fetchBatch is GC'ed without being ended.
    opened by sming 1
  • add x-client-id to markdown documentation examples for /query/[metrics|batch]

    add x-client-id to markdown documentation examples for /query/[metrics|batch]

    basically need to find a neat way of adding "x-client-id: my-app" to the example cURLs given in https://spotify.github.io/heroic/docs/api/post-query-metrics and https://spotify.github.io/heroic/docs/api/post-query-batch .

    it's not that easy since we want this only to be added to these two endpoints. Probably need to create a "sub class" of whatever spits out the curl -XPOST ... text.

    type:documentation 
    opened by sming 0
  • Investigate & resolve nondeterministic build errors

    Investigate & resolve nondeterministic build errors

    • I am a heroic developer
    • Who wants to have deterministic builds
    • So that I can retain my sanity and be productive instead of chasing random build errors

    Design & Implementation Notes

    feature/add-bigtable-timeout-settings-refactored com.spotify.heroic.GrpcClusterQueryIT > distributedFilterQueryTest FAILED
        java.lang.IllegalStateException: failed to create a child event loop
            Caused by:
            io.netty.channel.ChannelException: failed to open a new selector
                Caused by:
                java.io.IOException: Too many open files
        java.lang.NullPointerException
    
    codebase quality build issues 
    opened by sming 1
Releases(2.3.18)
  • 2.3.18(Mar 4, 2021)

    PR's included in this release

    • Implement Bigtable timeouts&retry settings #733
    • per x-client-id timeout logging & metrics #763

    NOTE that the Bigtable config is unchanged - there should be absolutely no observable behavioural changes w.r.t. Bigtable timeouts and retries.

    Source code(tar.gz)
    Source code(zip)
  • 2.3.17(Feb 10, 2021)

  • 2.3.16(Feb 1, 2021)

    • Remove legacy source input, move distribution type out of request input (#749)
    • Remove static import statement that keep failing sporadically (#747)
    Source code(tar.gz)
    Source code(zip)
  • 2.3.15(Jan 26, 2021)

  • 2.3.14(Jan 25, 2021)

    • Introduce support for RotatingIndexMapping to limit the number of elasticsearch metadata indices read based upon the range of a heroic query. If the newly introduced config flag is set to true, the number of read indices is dynamically determined based upon the rotation interval and the range of the query. (#746)

    Closes #745.

    Source code(tar.gz)
    Source code(zip)
  • 2.3.13(Jan 20, 2021)

  • 2.3.12(Jan 14, 2021)

    • The interval variable in RotatingIndexMapping is now respected if present in configuration. (PR #743 & Closes Isssue #742)
    • This adds error flags and exception messages to the spans for metadata, suggest, and bigtable writes when the chain of futures fails. The same exceptions are also logged. Previously some exceptions, such as grpc errors, could be masked by sl4j settings. The trace for a metric write was cleaned up to remove several intermediary spans. These spans did not have useful information and had no branching paths, so they were just clutter in the overall trace. (PR #740 & Closes Issue #724)
    Source code(tar.gz)
    Source code(zip)
  • 2.3.11(Jan 11, 2021)

    • Moves Tdigest stat computation to SharedGroups combine phase. (Part 1 of 2) (#734)
    • Add distribution query integration test. (Part 2 of 2) (#737)
    Source code(tar.gz)
    Source code(zip)
  • 2.3.10(Dec 18, 2020)

    • Add distribution data points aggregation core components. The implementation follows the same paradigm as existing aggregation instance such as sum and average. (#728)
    • Update Checkstyle version from 6.x to 8.x. (#726)
    • Correct metadata Series getter defect for scenarios where indexResourceIdentifiers is true. (#731)
    Source code(tar.gz)
    Source code(zip)
  • 2.3.9(Nov 20, 2020)

    • Introduce a config variable, indexResourceIdentifiers to index resource identifiers in elasticsearch metadata. If configured as true, Resource Identifiers will be written to and indexed in elasticsearch metadata. indexResourceIdentifiers defaults to false if not specified in heroic config. (#722)
    Source code(tar.gz)
    Source code(zip)
  • 2.3.8(Nov 16, 2020)

    • Fix logging (#717)
    • GitHub action workflow to build PRs in Docker images. (#718)
    • Ensure metrics that are explicitly set to zero are not dropped (#721)
    • Fixes a typo in distribution column family name (#723)
    Source code(tar.gz)
    Source code(zip)
  • 2.3.7(Oct 28, 2020)

  • 2.3.6(Oct 23, 2020)

    • Add distribution support to spotify100_proto and BigtableBackend writer (#699)
    • Add distribution support to spotify100 and the ability to fetch distribution from bigTable. (#700)
    • Fix QuotaWatchers creation and cleaning (#702)
    Source code(tar.gz)
    Source code(zip)
  • 2.3.5(Sep 24, 2020)

  • 2.3.4(Sep 22, 2020)

    • Add global data points stats (#696)
    • Implement a configurable ES result size (#685)
    • Update protobuf version and force folsom to use latest version of spotify:dns (#697)
    Source code(tar.gz)
    Source code(zip)
  • 2.3.3(Sep 9, 2020)

  • 2.3.2(Sep 4, 2020)

  • 2.3.1(Aug 27, 2020)

  • 2.3.0(Aug 25, 2020)

  • 2.2.0(Jul 21, 2020)

    • Run system tests in a docker full environment.
    • Add sentry as an optional error aggregator.
    • [elasticsearch-metadata] Add hash to use for sorting.
    • [elasticsearch consumers] Fix reporting of dropped by duplicate metrics.
    Source code(tar.gz)
    Source code(zip)
  • 2.1.0(Apr 28, 2020)

    • [statistics] use SlidingTimeWindowArrayReservoir for histograms
    • [metadata] do not allow partial results from elasticsearch
    • [metadata] add metric for failed shards
    • [elasticsearch-metadata] Add option for findSeries to use pagination
    • [heroic-component] remove warnings around deprecated methods
    • [core] remove duplicate query trace tags
    • [elasticsearch] Timeout fix for multi-page scrolling requests
    Source code(tar.gz)
    Source code(zip)
  • 2.0.0(Apr 9, 2020)

  • 1.2.0(Mar 19, 2020)

    • Scope request spans so that they link to coreQueryManager. (#621)
    • [elasticsearch] Support any ES index settings. (#623)
    • [elasticsearch] Set order for templates to 100. (#626)
    • [suggest-elasticsearch] write series and tags in bulk (#624)
    Source code(tar.gz)
    Source code(zip)
  • 1.1.0(Mar 19, 2020)

    • [metadata-elasticsearch] clear search scroll when complete
    • [cassandra] Disable datastax JMX reporting.
    • Log when some deprecated functions are called
    • [tracing] Improve tags and refactor query spans
    • [tracing] Add squashing trace exporter.
    Source code(tar.gz)
    Source code(zip)
  • 1.0.3(Feb 12, 2020)

  • 1.0.2(Feb 11, 2020)

  • 1.0.1(Jan 24, 2020)

  • 1.0.0(Jan 22, 2020)

    This is the first major release of Heroic! \o/

    Changes going forward will continue to follow semantic versioning.

    The upgrade path for previous users of Heroic is to deploy new ElasticSearch 7.5 clusters. 7.5 has an EOL date of 2021-06-02 which at that time going to 8.x will require Heroic to migrate away from the transport client to the high level rest one.

    The major change from 5.x -> 7.x is indexes no longer support multiple types.

    • suggest has two types, series and tags. This means there will be 2x the number of indexes.
    • metadata has only one type, metadata. There will be the same number of indexes.

    Remove deprecated configuration options:

    • Completely remove the StandaloneClient as it was removed in 6.x+. TestContainers are now used for the integration tests.
    • Completely remove the NodeClient since it's not recommended to use anyway.

    Example configuration for a metadata/suggest backend. Note that seeds and clusterName are now nested under client and not connection.

    metadata:
      backends:
      - type: elasticsearch
        backendType: kv
        connection:
          index:
            type: rotating
            pattern: docker-cluster-%s
          client:
            type: transport
            clusterName: docker-cluster
            seeds:
              - localhost
    
    
    Source code(tar.gz)
    Source code(zip)
  • 0.10.5(Jan 6, 2020)

  • 0.10.4(Nov 27, 2019)

    • [usagetracking] Create a module for tracking basic events.
    • [usagetracking] Send an uptime event every 24 hours.
    • Fallback to DOCKER_TAG when setting version.
    • [aggregations] Chain aggregation fix missing type id.
    • [tracing] Fix NPE.
    • [usage-tracking] Set timeout and close body.
    Source code(tar.gz)
    Source code(zip)
Fast scalable time series database

KairosDB is a fast distributed scalable time series database written on top of Cassandra. Documentation Documentation is found here. Frequently Asked

null 1.7k Dec 17, 2022
A scalable, distributed Time Series Database.

___ _____ ____ ____ ____ / _ \ _ __ ___ _ _|_ _/ ___|| _ \| __ ) | | | | '_ \ / _ \ '_ \| | \___ \| | | | _ \

OpenTSDB 4.8k Dec 26, 2022
An open source SQL database designed to process time series data, faster

English | 简体中文 | العربية QuestDB QuestDB is a high-performance, open-source SQL database for applications in financial services, IoT, machine learning

QuestDB 9.9k Jan 1, 2023
Accumulo backed time series database

Timely is a time series database application that provides secure access to time series data. Timely is written in Java and designed to work with Apac

National Security Agency 367 Oct 11, 2022
The Prometheus monitoring system and time series database.

Prometheus Visit prometheus.io for the full documentation, examples and guides. Prometheus, a Cloud Native Computing Foundation project, is a systems

Prometheus 46.3k Jan 10, 2023
Time series monitoring and alerting platform.

Argus Argus is a time-series monitoring and alerting platform. It consists of discrete services to configure alerts, ingest and transform metrics & ev

Salesforce 495 Dec 1, 2022
Time Series Metrics Engine based on Cassandra

Hawkular Metrics, a storage engine for metric data About Hawkular Metrics is the metric data storage engine part of Hawkular community. It relies on A

Hawkular 230 Dec 9, 2022
The Most Advanced Time Series Platform

Warp 10 Platform Introduction Warp 10 is an Open Source Geo Time Series Platform designed to handle data coming from sensors, monitoring systems and t

SenX 322 Dec 29, 2022
Scalable Time Series Data Analytics

Time Series Data Analytics Working with time series is difficult due to the high dimensionality of the data, erroneous or extraneous data, and large d

Patrick Schäfer 286 Dec 7, 2022
Apache Druid: a high performance real-time analytics database.

Website | Documentation | Developer Mailing List | User Mailing List | Slack | Twitter | Download Apache Druid Druid is a high performance real-time a

The Apache Software Foundation 12.3k Jan 1, 2023
CrateDB is a distributed SQL database that makes it simple to store and analyze massive amounts of machine data in real-time.

About CrateDB is a distributed SQL database that makes it simple to store and analyze massive amounts of machine data in real-time. CrateDB offers the

Crate.io 3.6k Jan 2, 2023
HurricaneDB a real-time distributed OLAP engine, powered by Apache Pinot

HurricaneDB is a real-time distributed OLAP datastore, built to deliver scalable real-time analytics with low latency. It can ingest from batch data sources (such as Hadoop HDFS, Amazon S3, Azure ADLS, Google Cloud Storage) as well as stream data sources (such as Apache Kafka).

GuinsooLab 4 Dec 28, 2022
eXist Native XML Database and Application Platform

eXist-db Native XML Database eXist-db is a high-performance open source native XML database—a NoSQL document database and application platform built e

eXist-db.org 363 Dec 30, 2022
Flyway by Redgate • Database Migrations Made Easy.

Flyway by Redgate Database Migrations Made Easy. Evolve your database schema easily and reliably across all your instances. Simple, focused and powerf

Flyway by Boxfuse 6.9k Jan 9, 2023
MapDB provides concurrent Maps, Sets and Queues backed by disk storage or off-heap-memory. It is a fast and easy to use embedded Java database engine.

MapDB: database engine MapDB combines embedded database engine and Java collections. It is free under Apache 2 license. MapDB is flexible and can be u

Jan Kotek 4.6k Dec 30, 2022
Realm is a mobile database: a replacement for SQLite & ORMs

Realm is a mobile database that runs directly inside phones, tablets or wearables. This repository holds the source code for the Java version of Realm

Realm 11.4k Jan 5, 2023
Transactional schema-less embedded database used by JetBrains YouTrack and JetBrains Hub.

JetBrains Xodus is a transactional schema-less embedded database that is written in Java and Kotlin. It was initially developed for JetBrains YouTrack

JetBrains 1k Dec 14, 2022
Flyway by Redgate • Database Migrations Made Easy.

Flyway by Redgate Database Migrations Made Easy. Evolve your database schema easily and reliably across all your instances. Simple, focused and powerf

Flyway by Boxfuse 6.9k Jan 5, 2023
MapDB provides concurrent Maps, Sets and Queues backed by disk storage or off-heap-memory. It is a fast and easy to use embedded Java database engine.

MapDB: database engine MapDB combines embedded database engine and Java collections. It is free under Apache 2 license. MapDB is flexible and can be u

Jan Kotek 4.6k Jan 1, 2023