The Heroic Time Series Database

Last update: Dec 20, 2022

Overview

DEPRECATION NOTICE

This repo is no longer actively maintained. While it should continue to work and there are no major known bugs, we will not be improving Heroic or releasing new versions.

Heroic

A scalable time series database based on Bigtable, Cassandra, and Elasticsearch. Go to https://spotify.github.io/heroic/ for documentation.

This project adheres to the Open Code of Conduct. By participating, you are expected to honor this code.

Install

Docker

Docker images are available on Docker Hub.

$ docker run -p 8080:8080 -p 9091:9091 spotify/heroic

Heroic will now be reachable at http://localhost:8080/status.

In production it's advised to use a tagged version.

Configuration

For help on how to write a configuration file, see the Configuration Section of the official documentation.

Heroic has been tested with the following services:

Cassandra (2.1.x, 3.5) when using metric/datastax.
Cloud Bigtable when using metric/bigtable.
Elasticsearch (7.x) when using metadata/elasticsearch or suggest/elasticsearch.
Kafka (0.8.x) when using consumer/kafka.

Developing

Building from source

In order to compile Heroic, you'll need:

A Java 11 JDK
Maven 3
Gradle

The project is built using Gradle:

# full build, runs all tests and builds the shaded jar
./gradlew build

# only compile
./gradlew assemble

# build a single module
./gradlew heroic-metric-bigtable:build

The heroic-dist module can be used to produce a shaded jar that contains all required dependencies:

./gradlew heroic-dist:shadowJar

After building, the entry point of the service is com.spotify.heroic.HeroicService. The following is an example of how this can be run:

./gradlew heroic-dist:runShadow <config>

which is the equivalent of doing:

java -jar $PWD/heroic-dist/build/libs/heroic-dist-0.0.1-SNAPSHOT-shaded.jar <config>

Building with Docker

$ docker build -t heroic:latest .

This is a multi-stage build and will first build Heroic via a ./gradlew clean build and then copy the resulting shaded jar into the runtime container.

Running heroic via docker can be done:

$ docker run -d -p 8080:8080 -p 9091:9091 -v /path/to/config.yml:/heroic.yml spotify/heroic:latest

Logging

Logging is captured using SLF4J, and forwarded to Log4j.

To configure logging, define the -Dlog4j.configurationFile=<path> parameter. You can use docs/log4j2-file.xml as a base.

Testing

We run tests with Gradle:

# run unit tests
./gradlew test

# run integration tests
./gradlew integrationTest

or to run a more comprehensive set of checks:

./gradlew check

This will run:

unit tests
integration tests
Checkstyle
Coverage Reporting with Jacoco

It is strongly recommended that you run the full test suite before setting up a pull request, otherwise it will be rejected by Travis.

Full Cluster Tests

Full cluster tests are defined in heroic-dist/src/test/java.

This way, they have access to all the modules and parts of Heroic.

The JVM RPC module is specifically designed to allow for rapid execution of integration tests. It allows multiple cores to be defined and communicate with each other in the same JVM instance.

Code Coverage

There's an ongoing project to improve test coverage. Clicking the above graph will bring you to codecov.io, where you can find areas to focus on.

Bypassing Validation

To bypass automatic formatting and checkstyle validation you can use the following stanza:

// @formatter:off
final List<String> list = ImmutableList.of(
   "Welcome to...",
   "... The Wild West"
);
// @formatter:on

To bypass a FindBugs error, you should use the @SupressFBWarnings annotation.

@SupressFBWarnings(value="FINDBUGS_ERROR_CODE", justification="I Know Better Than FindBugs")
public class IKnowBetterThanFindbugs() {
    // ...
}

Module Orientation

The Heroic project is split into a couple of modules.

The most critical one is heroic-component. It contains interfaces, value objects, and the basic set of dependencies necessary to glue different components together.

Submodules include metric, suggest, metadata, and aggregation. The first three contain various implementations of the given backend type, while the latter provides aggregation methods.

heroic-core contains the com.spotify.heroic.HeroicCore class which is the central building block for setting up a Heroic instance.

heroic-elasticsearch-utils is a collection of utilities for interacting with Elasticsearch. This is separate since we have more than one backend that needs to talk with elasticsearch.

Finally there is heroic-dist, a small project that depends on all module. Here is where everything is bound together into a distribution — a shaded jar. It also provides the entry-point for services, namely com.spotify.heroic.HeroicService or through an interactive shell com.spotify.heroic.HeroicShell. The shell can either be run standalone or connected to an existing Heroic instance for administration.

Contributing

Guidelines for contributing can be found here.

Comments

Make suggest result size configurable

Right now, ES suggest sets its query size to 0. This should make Elasticsearch return the maximum number of results (10_000). That could be incomplete, but its also not very usable.

Having this configurable and set to something fairly small in combination with typeahead searching from the grafana datasource would likely give both faster results and a better user experience.
type:enhancement component:elasticsearch component:suggest

opened by hexedpackets 27
just compiled in master none query metrics works
Hello,

I just wanted to update the source code from master everithing in the compilation goes right no errors i can write data elasticsearch is ok i can count the number of series

BUT i still get this error on all my different queries

Some fetches failed (628) or were cancelled (0), caused by Some fetches failed (628) or were cancelled (0)

wich source code do i compile to get things work ?
type:bug
opened by lucilecoutouly 15

No more data returned by heroic

Hi guys,

I've done a git pull and package build and it seems that no more data is returned by heroic...

This is my git reflog

root@heroic:~/heroic# git reflog --date=iso
a0551f6 HEAD@{2017-02-06 09:45:53 +0100}: pull: Fast-forward
be6c06f HEAD@{2016-12-22 12:46:06 +0100}: pull: Fast-forward
b329697 HEAD@{2016-12-16 12:02:38 +0100}: pull: Fast-forward
114d815 HEAD@{2016-12-15 13:01:43 +0100}: pull: Fast-forward
12b92c5 HEAD@{2016-12-05 10:00:04 +0100}: pull: Fast-forward
c6a43b0 HEAD@{2016-12-01 15:32:27 +0100}: pull: Fast-forward
51494af HEAD@{2016-11-28 16:27:42 +0100}: pull: Fast-forward
660e16e HEAD@{2016-11-14 14:21:20 +0100}: pull: Fast-forward
cf81df2 HEAD@{2016-11-03 13:53:35 +0100}: pull: Fast-forward
c05e7dc HEAD@{2016-10-14 10:12:40 +0200}: pull: Fast-forward
2a16276 HEAD@{2016-10-13 11:37:22 +0200}: pull: Fast-forward
240076d HEAD@{2016-10-11 09:20:06 +0200}: pull: Fast-forward
6986d65 HEAD@{2016-10-08 21:37:19 +0200}: pull: Fast-forward
f576768 HEAD@{2016-09-27 21:21:29 +0200}: pull: Fast-forward
1a3578c HEAD@{2016-09-01 21:29:47 +0200}: clone: from https://github.com/spotify/heroic.git

The errors seem to be coming from elasticsearch ...

[2017-02-06 14:05:05,602][DEBUG][action.search.type       ] [Achilles] [59813] Failed to execute query phase
org.elasticsearch.transport.RemoteTransportException: [Gauntlet][inet[/1.1.1.2:9300]][indices:data/read/search[phase/scan/scroll]]
Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [59813]
	at org.elasticsearch.search.SearchService.findContext(SearchService.java:537)
	at org.elasticsearch.search.SearchService.executeScan(SearchService.java:265)
	at org.elasticsearch.search.action.SearchServiceTransportAction$SearchScanScrollTransportHandler.messageReceived(SearchServiceTransportAction.java:939)
	at org.elasticsearch.search.action.SearchServiceTransportAction$SearchScanScrollTransportHandler.messageReceived(SearchServiceTransportAction.java:930)
	at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:279)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

I see the index for this respective day present in ES and the heroic api nodes seem the be able to consume and insert data in ES and cassandra normally.

The retrievel of data returns the following message...

{"queryId":"edd112b3-8a66-4f62-8d14-dcb814e8ef87","range":{"start":1486386897275,"end":1486386902275},"trace":{"what":{"name":"com.spotify.heroic.CoreQueryManager#query"},"elapsed":738436,"children":[{"what":{"name":"com.spotify.heroic.CoreQueryManager#query_shard[{site=HG}]"},"elapsed":738552,"children":[]}]},"limits":[],"commonTags":{},"result":[],"errors":[{"type":"shard","nodes":["grpc://heroic1.cc:1394[local]#query","grpc://heroic.cc:1394#query","grpc://heroic3.cc:1394#query","grpc://heroic2.cc:1394#query"],"shard":{"site":"HG"},"error":"Request finished with status code (Status{code=UNKNOWN, description=null, cause=null}), caused by Request finished with status code (Status{code=UNKNOWN, description=null, cause=null})"}]}

type:question

opened by servergeeks 15

Error querying cassandra:9042 : com.datastax.driver.core.exceptions.BusyPoolException: [cassandra] Pool is busy (no available connection and the queue has reached its max size 256)
hello ,

it's related to #243 when i do a big query:

Error querying cassandraseed/10.42.28.96:9042 : com.datastax.driver.core.exceptions.BusyPoolException: [cassandraseed/10.42.28.96] Pool is busy (no available connection and the queue has reached its max size 256)

did you change the version of cassandra driver ? how to prevent this error ?

some working in the hawkular time serie says: BusyPoolException under heavy load - no available connection and the queue has reached its max size 256 https://issues.jboss.org/browse/HWKMETRICS-542 https://issues.jboss.org/browse/HWKMETRICS-597

is it related ??

is this a resolution : https://datastax-oss.atlassian.net/browse/JAVA-893

thanks
type:bug type:question
opened by lucilecoutouly 14
Delta aggregation
Adds delta aggregation (i.e. "diff" in KairosDB or "rate" in OpenTSDB; open to alternative names). There are no arguments, leaving it up to the user to decide whether to chain it as an input to a SamplingAggregation, or to diff the output for any other Aggregation. Example:

{ "range": {"type": "relative", "unit": "HOURS", "value": 2}, "filter": ["and", ["key", "foo"], ["=", "foo", "bar"], ["+", "role"]], "aggregation": {"type": "delta"}, "groupBy": ["site"] }

The above would return the difference of each point from the last within the sample. ~~The first point in a difference is always 0.~~ (EDIT: The First point sampled is truncated and the first returned point is the difference from the last point).

This is my first time writing Java, so please go easy! I understand it might not be the most idiomatic code, but I did my best to use patterns found in the other aggregators (stream, map, etc.). Thanks for looking.
opened by mykolasmith 13
[WIP] Add distribution support to spotify100 and the ability to fetch distribution from bigTable.

This PR add distribution support to spotify100 (jsonMetric) and the ability to fetch distribution metric from bigTable. Spotify100 consumers and ingestion modules should now be able to handle new and old JsonMetric. For more information please refer to this issue

opened by ao2017 12

Heroic Bigtable Consumer does not handle failures as expected

DoD

Heroic should properly address exception handling.

Heroic consumer should not ack messages that it failed to process.
Any write or processing failure should be logged.

Background

Heroic bigtable consumer failed to write to bigtable when the new column family was not created yet. There was no exception logged and the consumer ack-ed the message as if the write was successful. I got the exception below by hacky debugger evaluations.

io.grpc.StatusRuntimeException: NOT_FOUND: Error while mutating the row '\023\023distribution-test-1\004\003\003env\007\007staging\004\004host\017\017samanthanoellef\013\013metric_type\014\014distribution\004\004what\005\005stuff\000\000\001u\000\000\000\000' (projects/xpn-heroic-1/instances/metrics-staging-guc/tables/metrics) : Requested column family not found.
	at io.grpc.Status.asRuntimeException(Status.java:517)
	at com.google.cloud.bigtable.grpc.async.BulkMutation.toException(BulkMutation.java:77)
	at com.google.cloud.bigtable.grpc.async.BulkMutation.access$400(BulkMutation.java:59)
	at com.google.cloud.bigtable.grpc.async.BulkMutation$Batch.handleEntries(BulkMutation.java:227)
	at com.google.cloud.bigtable.grpc.async.BulkMutation$Batch.handleResult(BulkMutation.java:200)
	at com.google.cloud.bigtable.grpc.async.BulkMutation$Batch$1.onSuccess(BulkMutation.java:170)
	at com.google.cloud.bigtable.grpc.async.BulkMutation$Batch$1.onSuccess(BulkMutation.java:167)
	at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1021)
	at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
	at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1137)
	at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:957)
	at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:726)
	at com.google.cloud.bigtable.grpc.async.AbstractRetryingOperation$GrpcFuture.set(AbstractRetryingOperation.java:90)
	at com.google.cloud.bigtable.grpc.async.RetryingMutateRowsOperation.onOK(RetryingMutateRowsOperation.java:91)
	at com.google.cloud.bigtable.grpc.async.AbstractRetryingOperation.onClose(AbstractRetryingOperation.java:167)
	at com.google.cloud.bigtable.grpc.async.ThrottlingClientInterceptor$1$1.onClose(ThrottlingClientInterceptor.java:125)
	at com.google.cloud.bigtable.grpc.io.ChannelPool$InstrumentedChannel$2.onClose(ChannelPool.java:209)
	at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
	at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
	at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
	at com.google.cloud.bigtable.grpc.io.RefreshingOAuth2CredentialsInterceptor$UnAuthResponseListener.onClose(RefreshingOAuth2CredentialsInterceptor.java:81)
	at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
	at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
	at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
	at io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:678)
	at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
	at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
	at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
	at io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:397)
	at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:459)
	at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:63)
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:546)
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$600(ClientCallImpl.java:467)
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:584)
	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
	at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)

opened by samfadrigalan 10

Implement a configurable ES result size
Implement a configurable ES result size

Please see Make suggest result size configurable #646 for a detailed discussion about the implementation. Especially the conversation about ES probably not being hit for 10K suggestions the vast majority of the time.

Use Case Resolved

I am a Grafana user

who wants the UI to be more responsive

so that I can work more efficiently and not get irritated

Use Case 2 Resolved

I am a Heroic developer

who wants to tighten-up/secure/protect Heroic against unreasonably large request suggestion limits, irrespective of their origin

so that I can rest easy at night

Design & Implementation Notes

a trivial new class NumSuggestionsLimit, objects of which are used by SuggestBackend implementations to determine the maximum number (i.e. limit) of suggestion entities to return.

said class will not allow a limit of more than 500. It will default to a limit of 50 if no limit is supplied by heroic.yaml for the respective Backend implementation.

when, for example, a call to tagSuggest() is made, if the request contains a limit, that limit is respected. If it does not, the Backend's limit is used.

Note that I also refactored a bunch of related code to help me understand it.
opened by sming 10
Enable Cassandra Driver Pooling option configuration
This change set enables the configuration of the java driver pooling options during startup.

A couple of notes:

In our case we are very sensitive to data loss, so we wanted to set up a separate logging file that we can parse for alerting/resubmission, hence the additional logging on DatastaxBackend.java. We are totally open (and appreciate recommendations) to different approaches for this.

I will also submit a separate PR to the site branch with documentation updates.
opened by jcabmora 10

client=org.elasticsearch.client.node.NodeClient@429967c4) leaked @ unknown

Hi guys,

I've made an heroic cluster (4 nodes) that is using 5 cassandra nodes and 3 elasticsearch nodes. For some reason after a day of ingesting data (around 2GB/index in ES), retrieving data becomes really slow and sometimes impossible...

If I try to do an HQL query in grafana I usually get to crash an heroic API node and everything stops working.

Any tips/ideas to improve performance and keep my cluster running/usable ?

An example of an crash/error would be...

14:03:45.049 [nioEventLoopGroup-2-16] INFO  com.spotify.heroic.cluster.CoreClusterManager - 00000007 [new] grpc://heroic3.cc:1394
14:03:45.049 [nioEventLoopGroup-2-16] INFO  com.spotify.heroic.cluster.CoreClusterManager - 00000007 [new] grpc://heroic1.cc:1394
14:03:45.049 [nioEventLoopGroup-2-16] INFO  com.spotify.heroic.cluster.CoreClusterManager - 00000007 [new] grpc://heroic.cc:1394
14:03:45.049 [nioEventLoopGroup-2-16] ERROR com.spotify.heroic.cluster.CoreClusterManager - 00000007 [failed] grpc://heroic2.cc:1394
java.lang.RuntimeException: Request finished with status code (Status{code=DEADLINE_EXCEEDED, description=null, cause=null})
	at com.spotify.heroic.rpc.grpc.GrpcRpcClient$1.onClose(GrpcRpcClient.java:103)
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$3.runInContext(ClientCallImpl.java:462)
	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:54)
	at io.grpc.internal.SerializingExecutor$TaskRunner.run(SerializingExecutor.java:154)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:339)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:356)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:742)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
	at java.lang.Thread.run(Thread.java:745)
14:03:45.050 [nioEventLoopGroup-2-16] INFO  com.spotify.heroic.cluster.CoreClusterManager - 00000007 [update] [{site=HG}] 3 result(s)
14:04:17.843 [elasticsearch[heroic3][clusterService#updateTask][T#1]] INFO  org.elasticsearch.cluster.service - [heroic3] removed {[heroic2][_QO3QHYpQs-2X4MXFbSX2Q][heroic2][inet[/131.1.1.7:9300]]{data=false, client=true},}, reason: zen-disco-receive(from master [[Occulus][H59ohQteSpGOQk1O69WCGA][heroic-e][inet[/131.1.1.2:9300]]])
14:04:17.845 [elasticsearch[heroic3][clusterService#updateTask][T#1]] INFO  org.elasticsearch.cluster.service - [heroic3] removed {[heroic2][_QO3QHYpQs-2X4MXFbSX2Q][heroic2][inet[/131.1.1.7:9300]]{data=false, client=true},}, reason: zen-disco-receive(from master [[Occulus][H59ohQteSpGOQk1O69WCGA][heroic-e][inet[/131.1.1.2:9300]]])
14:04:45.050 [heroic-scheduler#0] INFO  com.spotify.heroic.cluster.CoreClusterManager - new refresh with id (00000008)
14:04:47.853 [elasticsearch[heroic3][clusterService#updateTask][T#1]] INFO  org.elasticsearch.cluster.service - [heroic3] removed {[heroic2][kzCTkZdBRlqrGcBFKNahLQ][heroic2][inet[/131.1.1.7:9301]]{data=false, client=true},}, reason: zen-disco-receive(from master [[Occulus][H59ohQteSpGOQk1O69WCGA][heroic-e][inet[/131.1.1.2:9300]]])
14:04:47.855 [elasticsearch[heroic3][clusterService#updateTask][T#1]] INFO  org.elasticsearch.cluster.service - [heroic3] removed {[heroic2][kzCTkZdBRlqrGcBFKNahLQ][heroic2][inet[/131.1.1.7:9301]]{data=false, client=true},}, reason: zen-disco-receive(from master [[Occulus][H59ohQteSpGOQk1O69WCGA][heroic-e][inet[/131.1.1.2:9300]]])
14:04:50.053 [nioEventLoopGroup-2-6] INFO  com.spotify.heroic.cluster.CoreClusterManager - 00000008 [new] grpc://heroic3.cc:1394
14:04:50.053 [nioEventLoopGroup-2-6] INFO  com.spotify.heroic.cluster.CoreClusterManager - 00000008 [new] grpc://heroic.cc:1394
14:04:50.054 [nioEventLoopGroup-2-6] INFO  com.spotify.heroic.cluster.CoreClusterManager - 00000008 [new] grpc://heroic1.cc:1394
14:04:50.054 [nioEventLoopGroup-2-6] ERROR com.spotify.heroic.cluster.CoreClusterManager - 00000008 [failed] grpc://heroic2.cc:1394
java.lang.RuntimeException: Request finished with status code (Status{code=DEADLINE_EXCEEDED, description=null, cause=null})
	at com.spotify.heroic.rpc.grpc.GrpcRpcClient$1.onClose(GrpcRpcClient.java:103)
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$3.runInContext(ClientCallImpl.java:462)
	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:54)
	at io.grpc.internal.SerializingExecutor$TaskRunner.run(SerializingExecutor.java:154)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:339)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:356)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:742)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
	at java.lang.Thread.run(Thread.java:745)
14:04:50.054 [nioEventLoopGroup-2-6] INFO  com.spotify.heroic.cluster.CoreClusterManager - 00000008 [update] [{site=HG}] 3 result(s)
14:05:50.055 [heroic-scheduler#4] INFO  com.spotify.heroic.cluster.CoreClusterManager - new refresh with id (00000009)

14:05:55.059 [nioEventLoopGroup-2-10] INFO  com.spotify.heroic.cluster.CoreClusterManager - 00000009 [new] grpc://heroic3.cc:1394
14:05:55.059 [nioEventLoopGroup-2-10] INFO  com.spotify.heroic.cluster.CoreClusterManager - 00000009 [new] grpc://heroic.cc:1394
14:05:55.059 [nioEventLoopGroup-2-10] INFO  com.spotify.heroic.cluster.CoreClusterManager - 00000009 [new] grpc://heroic1.cc:1394
14:05:55.059 [nioEventLoopGroup-2-10] ERROR com.spotify.heroic.cluster.CoreClusterManager - 00000009 [failed] grpc://heroic2.cc:1394
java.lang.RuntimeException: Request finished with status code (Status{code=DEADLINE_EXCEEDED, description=null, cause=null})
	at com.spotify.heroic.rpc.grpc.GrpcRpcClient$1.onClose(GrpcRpcClient.java:103)
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$3.runInContext(ClientCallImpl.java:462)
	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:54)
	at io.grpc.internal.SerializingExecutor$TaskRunner.run(SerializingExecutor.java:154)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:339)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:356)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:742)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
	at java.lang.Thread.run(Thread.java:745)
14:05:55.060 [nioEventLoopGroup-2-10] INFO  com.spotify.heroic.cluster.CoreClusterManager - 00000009 [update] [{site=HG}] 3 result(s)
14:06:55.060 [heroic-scheduler#2] INFO  com.spotify.heroic.cluster.CoreClusterManager - new refresh with id (0000000a)
14:07:00.063 [nioEventLoopGroup-2-15] INFO  com.spotify.heroic.cluster.CoreClusterManager - 0000000a [new] grpc://heroic3.cc:1394
14:07:00.063 [nioEventLoopGroup-2-15] INFO  com.spotify.heroic.cluster.CoreClusterManager - 0000000a [new] grpc://heroic.cc:1394
14:07:00.064 [nioEventLoopGroup-2-15] INFO  com.spotify.heroic.cluster.CoreClusterManager - 0000000a [new] grpc://heroic1.cc:1394
14:07:00.064 [nioEventLoopGroup-2-15] ERROR com.spotify.heroic.cluster.CoreClusterManager - 0000000a [failed] grpc://heroic2.cc:1394
java.lang.RuntimeException: Request finished with status code (Status{code=DEADLINE_EXCEEDED, description=null, cause=null})
	at com.spotify.heroic.rpc.grpc.GrpcRpcClient$1.onClose(GrpcRpcClient.java:103)
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$3.runInContext(ClientCallImpl.java:462)
	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:54)
	at io.grpc.internal.SerializingExecutor$TaskRunner.run(SerializingExecutor.java:154)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:339)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:356)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:742)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
	at java.lang.Thread.run(Thread.java:745)
14:07:00.064 [nioEventLoopGroup-2-15] INFO  com.spotify.heroic.cluster.CoreClusterManager - 0000000a [update] [{site=HG}] 3 result(s)
14:07:30.085 [jersey-background-task-scheduler-0] WARN  com.spotify.heroic.common.CoreJavaxRestFramework - Client timed out
14:07:30.085 [jersey-background-task-scheduler-0] ERROR com.spotify.heroic.common.CoreJavaxRestFramework - Request cancelled
14:07:30.089 [jersey-background-task-scheduler-0] INFO  org.eclipse.jetty.server.RequestLog - 131.130.249.44 - - [16/Dec/2016:13:02:30 +0000] "POST //heroic3.cc:8080/query/batch HTTP/1.1" 500 101 
14:07:30.090 [jersey-background-task-scheduler-0] WARN  com.spotify.heroic.common.CoreJavaxRestFramework - Client completed
14:07:43.525 [jersey-background-task-scheduler-0] WARN  com.spotify.heroic.common.CoreJavaxRestFramework - Client timed out
14:07:43.525 [jersey-background-task-scheduler-0] ERROR com.spotify.heroic.common.CoreJavaxRestFramework - Request cancelled
14:07:43.528 [jersey-background-task-scheduler-0] INFO  org.eclipse.jetty.server.RequestLog - 131.130.249.44 - - [16/Dec/2016:13:02:43 +0000] "POST //heroic3.cc:8080/query/batch HTTP/1.1" 500 101 
14:07:43.528 [jersey-background-task-scheduler-0] WARN  com.spotify.heroic.common.CoreJavaxRestFramework - Client completed
14:08:00.065 [heroic-scheduler#2] INFO  com.spotify.heroic.cluster.CoreClusterManager - new refresh with id (0000000b)
14:08:01.637 [nioEventLoopGroup-2-5] INFO  com.spotify.heroic.cluster.CoreClusterManager - 0000000b [new] grpc://heroic3.cc:1394
14:08:01.637 [nioEventLoopGroup-2-5] INFO  com.spotify.heroic.cluster.CoreClusterManager - 0000000b [new] grpc://heroic.cc:1394
14:08:01.637 [nioEventLoopGroup-2-5] INFO  com.spotify.heroic.cluster.CoreClusterManager - 0000000b [new] grpc://heroic1.cc:1394
14:08:01.637 [nioEventLoopGroup-2-5] INFO  com.spotify.heroic.cluster.CoreClusterManager - 0000000b [update] [{site=HG}] 4 result(s)
14:08:01.876 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ZookeeperConsumerConnector - [heroic_heroic3-1481892993532-faf9240b], begin rebalancing consumer heroic_heroic3-1481892993532-faf9240b try #0
14:08:01.933 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ConsumerFetcherManager - [ConsumerFetcherManager-1481892993640] Stopping leader finder thread
14:08:01.933 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ConsumerFetcherManager$LeaderFinderThread - [heroic_heroic3-1481892993532-faf9240b-leader-finder-thread], Shutting down
14:08:01.933 [heroic_heroic3-1481892993532-faf9240b-leader-finder-thread] INFO  kafka.consumer.ConsumerFetcherManager$LeaderFinderThread - [heroic_heroic3-1481892993532-faf9240b-leader-finder-thread], Stopped 
14:08:01.933 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ConsumerFetcherManager$LeaderFinderThread - [heroic_heroic3-1481892993532-faf9240b-leader-finder-thread], Shutdown completed
14:08:01.934 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ConsumerFetcherManager - [ConsumerFetcherManager-1481892993640] Stopping all fetchers
14:08:01.934 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ConsumerFetcherThread - [ConsumerFetcherThread-heroic_heroic3-1481892993532-faf9240b-0-1], Shutting down
14:08:01.934 [ConsumerFetcherThread-heroic_heroic3-1481892993532-faf9240b-0-1] WARN  kafka.consumer.SimpleConsumer - Reconnect due to socket error: null
14:08:01.935 [ConsumerFetcherThread-heroic_heroic3-1481892993532-faf9240b-0-1] INFO  kafka.consumer.ConsumerFetcherThread - [ConsumerFetcherThread-heroic_heroic3-1481892993532-faf9240b-0-1], Stopped 
14:08:01.935 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ConsumerFetcherThread - [ConsumerFetcherThread-heroic_heroic3-1481892993532-faf9240b-0-1], Shutdown completed
14:08:01.935 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ConsumerFetcherThread - [ConsumerFetcherThread-heroic_heroic3-1481892993532-faf9240b-0-3], Shutting down
14:08:01.936 [ConsumerFetcherThread-heroic_heroic3-1481892993532-faf9240b-0-3] WARN  kafka.consumer.SimpleConsumer - Reconnect due to socket error: null
14:08:01.936 [ConsumerFetcherThread-heroic_heroic3-1481892993532-faf9240b-0-3] INFO  kafka.consumer.ConsumerFetcherThread - [ConsumerFetcherThread-heroic_heroic3-1481892993532-faf9240b-0-3], Stopped 
14:08:01.937 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ConsumerFetcherThread - [ConsumerFetcherThread-heroic_heroic3-1481892993532-faf9240b-0-3], Shutdown completed
14:08:01.937 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ConsumerFetcherManager - [ConsumerFetcherManager-1481892993640] All connections stopped
14:08:01.937 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ZookeeperConsumerConnector - [heroic_heroic3-1481892993532-faf9240b], Cleared all relevant queues for this fetcher
14:08:01.937 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ZookeeperConsumerConnector - [heroic_heroic3-1481892993532-faf9240b], Cleared the data chunks in all the consumer message iterators
14:08:01.937 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ZookeeperConsumerConnector - [heroic_heroic3-1481892993532-faf9240b], Committing all offsets after clearing the fetcher queues
14:08:01.940 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ZookeeperConsumerConnector - [heroic_heroic3-1481892993532-faf9240b], Releasing partition ownership
14:08:01.944 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ZookeeperConsumerConnector - [heroic_heroic3-1481892993532-faf9240b], Consumer heroic_heroic3-1481892993532-faf9240b rebalancing the following partitions: ArrayBuffer(0, 1, 2, 3, 4, 5, 6, 7, 8, 9) for topic metrics with consumers: List(heroic_heroic-1481893181613-f2c9267b-0, heroic_heroic-1481893181613-f2c9267b-1, heroic_heroic1-1481893140289-6988cffe-0, heroic_heroic1-1481893140289-6988cffe-1, heroic_heroic2-1481893050704-558be4ae-0, heroic_heroic2-1481893050704-558be4ae-1, heroic_heroic3-1481892993532-faf9240b-0, heroic_heroic3-1481892993532-faf9240b-1)
14:08:01.944 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ZookeeperConsumerConnector - [heroic_heroic3-1481892993532-faf9240b], heroic_heroic3-1481892993532-faf9240b-0 attempting to claim partition 8
14:08:01.945 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ZookeeperConsumerConnector - [heroic_heroic3-1481892993532-faf9240b], heroic_heroic3-1481892993532-faf9240b-1 attempting to claim partition 9
14:08:01.947 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ZookeeperConsumerConnector - [heroic_heroic3-1481892993532-faf9240b], heroic_heroic3-1481892993532-faf9240b-0 successfully owned partition 8 for topic metrics
14:08:01.948 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ZookeeperConsumerConnector - [heroic_heroic3-1481892993532-faf9240b], heroic_heroic3-1481892993532-faf9240b-1 successfully owned partition 9 for topic metrics
14:08:01.948 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ZookeeperConsumerConnector - [heroic_heroic3-1481892993532-faf9240b], Updating the cache
14:08:01.948 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ZookeeperConsumerConnector - [heroic_heroic3-1481892993532-faf9240b], Consumer heroic_heroic3-1481892993532-faf9240b selected partitions : metrics:8: fetched offset = 788121: consumed offset = 788121,metrics:9: fetched offset = 788111: consumed offset = 788111
14:08:01.949 [heroic_heroic3-1481892993532-faf9240b_watcher_executor] INFO  kafka.consumer.ZookeeperConsumerConnector - [heroic_heroic3-1481892993532-faf9240b], end rebalancing consumer heroic_heroic3-1481892993532-faf9240b try #0
14:08:01.951 [heroic_heroic3-1481892993532-faf9240b-leader-finder-thread] INFO  kafka.consumer.ConsumerFetcherManager$LeaderFinderThread - [heroic_heroic3-1481892993532-faf9240b-leader-finder-thread], Starting 
14:08:01.958 [heroic_heroic3-1481892993532-faf9240b-leader-finder-thread] INFO  kafka.utils.VerifiableProperties - Verifying properties
14:08:01.958 [heroic_heroic3-1481892993532-faf9240b-leader-finder-thread] INFO  kafka.utils.VerifiableProperties - Property client.id is overridden to heroic
14:08:01.959 [heroic_heroic3-1481892993532-faf9240b-leader-finder-thread] INFO  kafka.utils.VerifiableProperties - Property metadata.broker.list is overridden to kafkabaer1.cc:9092,kafkabaer2.cc:9092,kafkabaer3.cc:9092
14:08:01.959 [heroic_heroic3-1481892993532-faf9240b-leader-finder-thread] INFO  kafka.utils.VerifiableProperties - Property request.timeout.ms is overridden to 30000
14:08:01.959 [heroic_heroic3-1481892993532-faf9240b-leader-finder-thread] INFO  kafka.client.ClientUtils$ - Fetching metadata from broker id:1,host:kafkabaer1.cc,port:9092 with correlation id 9 for 1 topic(s) Set(metrics)
14:08:01.959 [heroic_heroic3-1481892993532-faf9240b-leader-finder-thread] INFO  kafka.producer.SyncProducer - Connected to kafkabaer1.cc:9092 for producing
14:08:01.960 [heroic_heroic3-1481892993532-faf9240b-leader-finder-thread] INFO  kafka.producer.SyncProducer - Disconnecting from kafkabaer1.cc:9092
14:08:01.961 [ConsumerFetcherThread-heroic_heroic3-1481892993532-faf9240b-0-1] INFO  kafka.consumer.ConsumerFetcherThread - [ConsumerFetcherThread-heroic_heroic3-1481892993532-faf9240b-0-1], Starting 
14:08:01.961 [heroic_heroic3-1481892993532-faf9240b-leader-finder-thread] INFO  kafka.consumer.ConsumerFetcherManager - [ConsumerFetcherManager-1481892993640] Added fetcher for partitions ArrayBuffer([[metrics,8], initOffset 788121 to broker id:3,host:kafkabaer3.cc,port:9092] , [[metrics,9], initOffset 788111 to broker id:1,host:kafkabaer1.cc,port:9092] )
14:08:01.962 [ConsumerFetcherThread-heroic_heroic3-1481892993532-faf9240b-0-3] INFO  kafka.consumer.ConsumerFetcherThread - [ConsumerFetcherThread-heroic_heroic3-1481892993532-faf9240b-0-3], Starting 
14:08:04.700 [elasticsearch[heroic3][clusterService#updateTask][T#1]] INFO  org.elasticsearch.cluster.service - [heroic3] added {[heroic2][_QO3QHYpQs-2X4MXFbSX2Q][heroic2][inet[/131.1.1.7:9300]]{data=false, client=true},}, reason: zen-disco-receive(from master [[Occulus][H59ohQteSpGOQk1O69WCGA][heroic-e][inet[/131.1.1.2:9300]]])
14:08:04.701 [elasticsearch[heroic3][clusterService#updateTask][T#1]] INFO  org.elasticsearch.cluster.service - [heroic3] added {[heroic2][_QO3QHYpQs-2X4MXFbSX2Q][heroic2][inet[/131.1.1.7:9300]]{data=false, client=true},}, reason: zen-disco-receive(from master [[Occulus][H59ohQteSpGOQk1O69WCGA][heroic-e][inet[/131.1.1.2:9300]]])
14:08:04.778 [elasticsearch[heroic3][clusterService#updateTask][T#1]] INFO  org.elasticsearch.cluster.service - [heroic3] added {[heroic2][kzCTkZdBRlqrGcBFKNahLQ][heroic2][inet[/131.1.1.7:9301]]{data=false, client=true},}, reason: zen-disco-receive(from master [[Occulus][H59ohQteSpGOQk1O69WCGA][heroic-e][inet[/131.1.1.2:9300]]])
14:08:04.780 [elasticsearch[heroic3][clusterService#updateTask][T#1]] INFO  org.elasticsearch.cluster.service - [heroic3] added {[heroic2][kzCTkZdBRlqrGcBFKNahLQ][heroic2][inet[/131.1.1.7:9301]]{data=false, client=true},}, reason: zen-disco-receive(from master [[Occulus][H59ohQteSpGOQk1O69WCGA][heroic-e][inet[/131.1.1.2:9300]]])

type:bug

opened by servergeeks 10

Support detailed query logging
This is another troubleshooting story.

We want to provide a configurable feature that gives a detailed log of how the queries are treated by heroic.

Suggestion

The user can provide the system with additional clientContext which will be included in each log step. The system can either be provided, or will generate an id when the query has been received that can be used to group or correlate individual queries.

Each log line has the following structure:

{ "component": "com.spotify.heroic.query.CoreQueryManager", "queryId": "ed6fe51c-afba-4320-a859-a88795c15175", "clientContext": { "dashboardId": "my-system-metrics", "user": "udoprog" }, "type": "Received", "data": {} }

data would be specific to the type being logged.

In all relevant stages of query processing, a call to the query logging framework is done, which results in logging of json describing the exact structure of the query in that stage.

Suggested relevant stages and data to log:

api Query as received from the user, also includes relevant HTTP-based information.

api Query as received by ClusterManager.

api FullQuery.Request which fans out to all data nodes.

data FullQuery.Request when received by data node.

data FullQuery as it goes back to the API node (excluding data).

api FullQuery as received from data node (excluding data).

api QueryTrace with timing information for the whole query

api QueryMetricsResponse being sent to the user (excluding data).

Note: All response paths excludes samples (time series data) to reduce the size of the log. This would be represented as the size of the data being returned instead.

Requirements

Query logging framework is isolated from critical path of query processing in a way so that failure in the query logger doesn't affect query processing.

Extension to Query API

The following field would be added to the query API.

{ "features": ["com.spotify.heroic.query_logging"], "queryLogging": { "queryId": "my-query-id", "clientContext": { "dashboardId": "my-system-metrics", "user": "udoprog" } } }

This is completely optional, and query logging can be enabled globally by setting the feature flag com.spotify.heroic.query_logging.

Configuration

Query logging is configured in the configuration file under the queryLogging section, like the following:

queryLogging: type: file path: /path/to/query.log rotationPeriod: 1d

Or with a logger facility:

queryLogging: type: logger name: com.spotify.heroic.query_logging
note:rfc
opened by udoprog 10

integrationTests requires "hidden" quay.io/testcontainers/ryuk docker image

Hi,

testing to build heroic for the first time, integrationTests failed. Due to a missing docker image.

com.spotify.heroic.PubSubConsumerIT > consumeOneMessage FAILED
    org.testcontainers.containers.ContainerLaunchException: Container startup failed
        Caused by:
        org.testcontainers.containers.ContainerFetchException: Can't get Docker image: RemoteDockerImage(imageNameFuture=java.util.concurrent.CompletableFuture@5f9adb66[Completed normally], imagePullPolicy=DefaultPullPolicy(), dockerClient=LazyDockerClient.INSTANCE)

            Caused by:
            com.github.dockerjava.api.exception.NotFoundException: {"message":"No such image: quay.io/testcontainers/ryuk:0.2.3"}

This image is not specified in the code so I guess this is a requirement of a dependency. Other docker images that are explicitly mentioned in the code as bigtruedata/gcloud-pubsub-emulator were automatically pulled as seen in the test just before the failing one:

com.spotify.heroic.PubSubConsumerIT > consumeOneMessage STANDARD_OUT
    14:29:02.761 [Test worker] INFO  org.testcontainers.DockerClientFactory - Docker host IP address is localhost
    14:29:02.777 [Test worker] INFO  org.testcontainers.DockerClientFactory - Connected to docker:
      Server Version: 20.10.5
      API Version: 1.41
      Operating System: Arch Linux
      Total Memory: 15641 MB
    14:29:02.783 [Test worker] INFO  🐳 [bigtruedata/gcloud-pubsub-emulator:latest] - Pulling docker image: bigtruedata/gcloud-pubsub-emulator:latest. Please be patient; this may take some time but only needs to be done once.
    14:29:02.783 [Test worker] INFO  org.testcontainers.DockerClientFactory - Docker host IP address is localhost
    14:29:02.799 [Test worker] INFO  org.testcontainers.DockerClientFactory - Connected to docker:
      Server Version: 20.10.5
      API Version: 1.41
      Operating System: Arch Linux
      Total Memory: 15641 MB

Build works fine just after manual pulling of image quay.io/testcontainers/ryuk:0.2.3

opened by Pierrotws 1

Investigate potentially serious performance implications of seemingly unnecessary thread-per-log message logging implementation
So the serializeAndLog method is called by all "log this sentence to disk" methods in the Slf4jQueryLogger class.

This is incredibly wasteful as for every single line of the log file, a Thread is created and destroyed, which includes :

a Thread object being allocated and GC'd

all the trappings of a Thread object e.g. file handles, OS resources

the actual OS thread that the Thread represents

other stuff I've forgotten

We should - at the least - run some tests to determine how slow/wasteful this is, assuming realistic logging rates.

My money is that it's really bad and could have a significant effect on performance, assuming that these logging methods are called frequently enough.
opened by sming 0
is GC'ed without being ended." issue (caused by a BT timeout)">
Fix "...Span is GC'ed without being ended." issue (caused by a BT timeout)
100's of Tracing Spans are left un-ended from every query timeout

I am a prism goalie

Who wants to have a stable heroic

So that I can focus on features and not get woken up at night and have angry users

These un-ended spans represent a real runtime risk to heroic. If ~700-1000 of these are left hanging around after each timeout-d query, it's conceivable that the JVM will :

potentially run out of memory altogether

experience much longer GC pauses / sweep times (cos of all the hanging spans needing reaping)

hugely inflate the size of heroic's logs, costing us $$$ and obscuring "genuine" problems

Proposed Solution

find the correct location to catch the BT timeout exception (not trivial)

catch it, end the span and throw it out again

Repro Steps

run heroic locally with GUC config and on branch feature/add-bigtable-timeout-settings-refactored

capture a lengthy query from grafana using the chrome dev tools network tab

alter the query to hit localhost and watch the logs, you'll see this message

List of methods concerned from logs

ERROR io.opencensus.trace.Tracer - Span localMetricsManager.fetchSeries is GC'ed without being ended.

ERROR io.opencensus.trace.Tracer - Span bigtable.fetchBatch is GC'ed without being ended.
opened by sming 1
add x-client-id to markdown documentation examples for /query/[metrics|batch]

basically need to find a neat way of adding "x-client-id: my-app" to the example cURLs given in https://spotify.github.io/heroic/docs/api/post-query-metrics and https://spotify.github.io/heroic/docs/api/post-query-batch .

it's not that easy since we want this only to be added to these two endpoints. Probably need to create a "sub class" of whatever spits out the curl -XPOST ... text.
type:documentation

opened by sming 0
Investigate & resolve nondeterministic build errors
I am a heroic developer

Who wants to have deterministic builds

So that I can retain my sanity and be productive instead of chasing random build errors

Design & Implementation Notes

See Heroic Build Issues Analysis gdoc

See this Slack thread for details

See this build for a full technical example

be aware of network errors that CircleCI seem to have regularly (but not always)

and other weird, random issues that CircleCI seems to throw up

I think this is occurring irrespective of the branch too (but might only pertain to feature/add-bigtable-timeout-settings-refactored) :

feature/add-bigtable-timeout-settings-refactored com.spotify.heroic.GrpcClusterQueryIT > distributedFilterQueryTest FAILED java.lang.IllegalStateException: failed to create a child event loop Caused by: io.netty.channel.ChannelException: failed to open a new selector Caused by: java.io.IOException: Too many open files java.lang.NullPointerException
codebase quality build issues
opened by sming 1

Releases(2.3.18)

2.3.18(Mar 4, 2021)
PR's included in this release

Implement Bigtable timeouts&retry settings #733

per x-client-id timeout logging & metrics #763

NOTE that the Bigtable config is unchanged - there should be absolutely no observable behavioural changes w.r.t. Bigtable timeouts and retries.
Source code(tar.gz)
Source code(zip)
2.3.17(Feb 10, 2021)
Revert input variable to 'source' (#758)

improved Mandatory ID Filter to return info upon rejection (#751)

Source code(tar.gz)
Source code(zip)
2.3.16(Feb 1, 2021)
Remove legacy source input, move distribution type out of request input (#749)

Remove static import statement that keep failing sporadically (#747)

Source code(tar.gz)
Source code(zip)
2.3.15(Jan 26, 2021)

this is the implementation of x-client-id header enforcement in Heroic (as opposed to the just-reverted Envoy implementation)
Source code(tar.gz)
Source code(zip)
2.3.14(Jan 25, 2021)
Introduce support for RotatingIndexMapping to limit the number of elasticsearch metadata indices read based upon the range of a heroic query. If the newly introduced config flag is set to true, the number of read indices is dynamically determined based upon the rotation interval and the range of the query. (#746)

Closes #745.
Source code(tar.gz)
Source code(zip)
2.3.13(Jan 20, 2021)
Add a simple group combiners and additional IT tests. (#744)

Source code(tar.gz)
Source code(zip)
2.3.12(Jan 14, 2021)
The interval variable in RotatingIndexMapping is now respected if present in configuration. (PR #743 & Closes Isssue #742)

This adds error flags and exception messages to the spans for metadata, suggest, and bigtable writes when the chain of futures fails. The same exceptions are also logged. Previously some exceptions, such as grpc errors, could be masked by sl4j settings. The trace for a metric write was cleaned up to remove several intermediary spans. These spans did not have useful information and had no branching paths, so they were just clutter in the overall trace. (PR #740 & Closes Issue #724)

Source code(tar.gz)
Source code(zip)
2.3.11(Jan 11, 2021)
Moves Tdigest stat computation to SharedGroups combine phase. (Part 1 of 2) (#734)

Add distribution query integration test. (Part 2 of 2) (#737)

Source code(tar.gz)
Source code(zip)
2.3.10(Dec 18, 2020)
Add distribution data points aggregation core components. The implementation follows the same paradigm as existing aggregation instance such as sum and average. (#728)

Update Checkstyle version from 6.x to 8.x. (#726)

Correct metadata Series getter defect for scenarios where indexResourceIdentifiers is true. (#731)

Source code(tar.gz)
Source code(zip)
2.3.9(Nov 20, 2020)
Introduce a config variable, indexResourceIdentifiers to index resource identifiers in elasticsearch metadata. If configured as true, Resource Identifiers will be written to and indexed in elasticsearch metadata. indexResourceIdentifiers defaults to false if not specified in heroic config. (#722)

Source code(tar.gz)
Source code(zip)
2.3.8(Nov 16, 2020)
Fix logging (#717)

GitHub action workflow to build PRs in Docker images. (#718)

Ensure metrics that are explicitly set to zero are not dropped (#721)

Fixes a typo in distribution column family name (#723)

Source code(tar.gz)
Source code(zip)
2.3.7(Oct 28, 2020)
Implemented configurable Bigtable write batch size. (#706)

Source code(tar.gz)
Source code(zip)
2.3.6(Oct 23, 2020)
Add distribution support to spotify100_proto and BigtableBackend writer (#699)

Add distribution support to spotify100 and the ability to fetch distribution from bigTable. (#700)

Fix QuotaWatchers creation and cleaning (#702)

Source code(tar.gz)
Source code(zip)
2.3.5(Sep 24, 2020)
Add Bigtable app profile ID to metrics and analytics modules (#698)

Source code(tar.gz)
Source code(zip)
2.3.4(Sep 22, 2020)
Add global data points stats (#696)

Implement a configurable ES result size (#685)

Update protobuf version and force folsom to use latest version of spotify:dns (#697)

Source code(tar.gz)
Source code(zip)
2.3.3(Sep 9, 2020)
Resolve Heroic compilation warnings (#692)

Moving quota watcher clean up to cover all queries (#693)

Source code(tar.gz)
Source code(zip)
2.3.2(Sep 4, 2020)
Add total stats across all queries in local manager (#690)

Source code(tar.gz)
Source code(zip)
2.3.1(Aug 27, 2020)
Add log of current number of data points for debugging OOMing (#688)

Source code(tar.gz)
Source code(zip)
2.3.0(Aug 25, 2020)
Handle hash illegal argument exception (#684)

Drop metric with row key length bigger than BigTable limit (#686)

Source code(tar.gz)
Source code(zip)
2.2.0(Jul 21, 2020)
Run system tests in a docker full environment.

Add sentry as an optional error aggregator.

[elasticsearch-metadata] Add hash to use for sorting.

[elasticsearch consumers] Fix reporting of dropped by duplicate metrics.

Source code(tar.gz)
Source code(zip)
2.1.0(Apr 28, 2020)
[statistics] use SlidingTimeWindowArrayReservoir for histograms

[metadata] do not allow partial results from elasticsearch

[metadata] add metric for failed shards

[elasticsearch-metadata] Add option for findSeries to use pagination

[heroic-component] remove warnings around deprecated methods

[core] remove duplicate query trace tags

[elasticsearch] Timeout fix for multi-page scrolling requests

Source code(tar.gz)
Source code(zip)
2.0.0(Apr 9, 2020)
[elasticsearch] Connect with REST client

[http] BREAKING Remove groupBy field from metrics query endpoint

Source code(tar.gz)
Source code(zip)
1.2.0(Mar 19, 2020)
Scope request spans so that they link to coreQueryManager. (#621)

[elasticsearch] Support any ES index settings. (#623)

[elasticsearch] Set order for templates to 100. (#626)

[suggest-elasticsearch] write series and tags in bulk (#624)

Source code(tar.gz)
Source code(zip)
1.1.0(Mar 19, 2020)
[metadata-elasticsearch] clear search scroll when complete

[cassandra] Disable datastax JMX reporting.

Log when some deprecated functions are called

[tracing] Improve tags and refactor query spans

[tracing] Add squashing trace exporter.

Source code(tar.gz)
Source code(zip)
1.0.3(Feb 12, 2020)
[metadata] clear search scroll when complete

Source code(tar.gz)
Source code(zip)
1.0.2(Feb 11, 2020)
[cassandra] Update datastax library (#605)

Source code(tar.gz)
Source code(zip)
1.0.1(Jan 24, 2020)
[pubsub] use latest version of netty-tcnative-boringssl (#599)

Source code(tar.gz)
Source code(zip)
1.0.0(Jan 22, 2020)
This is the first major release of Heroic! \o/

Changes going forward will continue to follow semantic versioning.

The upgrade path for previous users of Heroic is to deploy new ElasticSearch 7.5 clusters. 7.5 has an EOL date of 2021-06-02 which at that time going to 8.x will require Heroic to migrate away from the transport client to the high level rest one.

The major change from 5.x -> 7.x is indexes no longer support multiple types.

suggest has two types, series and tags. This means there will be 2x the number of indexes.

metadata has only one type, metadata. There will be the same number of indexes.

Remove deprecated configuration options:

Completely remove the StandaloneClient as it was removed in 6.x+. TestContainers are now used for the integration tests.

Completely remove the NodeClient since it's not recommended to use anyway.

Example configuration for a metadata/suggest backend. Note that seeds and clusterName are now nested under client and not connection.

metadata: backends: - type: elasticsearch backendType: kv connection: index: type: rotating pattern: docker-cluster-%s client: type: transport clusterName: docker-cluster seeds: - localhost
Source code(tar.gz)
Source code(zip)
0.10.5(Jan 6, 2020)
[elasticsearch] remove old v1 mapping

[shell] fix kotlin data class

Source code(tar.gz)
Source code(zip)
0.10.4(Nov 27, 2019)
[usagetracking] Create a module for tracking basic events.

[usagetracking] Send an uptime event every 24 hours.

Fallback to DOCKER_TAG when setting version.

[aggregations] Chain aggregation fix missing type id.

[tracing] Fix NPE.

[usage-tracking] Set timeout and close body.

Source code(tar.gz)
Source code(zip)