I think in any situation the user must be willing to do the science and confirm their ideas about performance, that is write little scripts to test that. Someone has already asked for something like that or if anyone else has done it.
While the user has some responsibility, the tool must also have some performance intentions otherwise it would be a bit pointless. I'd imagine there must be some tests somewhere.
That aside the first performance document that comes up for a search isn't helpful even when it comes to covering basic usage. It's quite cryptic and potentially counter productive. It doesn't really explain things well and in some areas it really makes no sense at all,
There are cases where it seems to jump into low level aspect where as it might be better to just say what kind of schema it establishes in for example cassandra allowing users to look at cassandra specs for specifics.
It helps to narrow down some things:
Data:
- number of unique metric names
- number of unique tag names
- number of unique tag values
- number of unique times
- overlap and other complexities
Access patterns:
Are tags:
I assume or but the documentation is a little tight lipped about it:
It is possible to filter the data returned by specifying a tag. The data returned will only contain data points associated with the specified tag. Filtering is done using the “tags” property.
It talks a little too much in the singular about a plural, however we find out they're ors here:
Tags narrow down the search. Only metrics that include the tag and matches one of the values are returned. Tags is optional.
This should probably be only the metrics that match as least one of the tag key value combinations will be returned. This I'm still left unsure. An array of values obviously would be an or but what about two key names? Though it would be a bit dysfunctional for those to be and. Most people would want (a = 1 AND b = 2) OR (a = 2 AND b = 1)
not (a = 1 OR a = 2) AND (b = 2 OR b = 1)
? Just a IN (1, 2) OR b IN(1, 2)
would make more sense. Though the example with customer as well as hosts would imply they're AND between multiple key names. Will people flip key and value to get AND to work? Are people not asking the questions I am getting incorrect metrics without knowing (often it's worse as excess tends to appear more valid than deficit or vice versa in most cases)?
For when people want and, the only solution to that is to DIY, that is:
tags = {hotel: 123, room: 321, person: 666};
keys = Object.keys(tags).sort();
values = keys.map(key => tags[key]);
tags[JSON.stringify(keys)] = JSON.stringify(values);
query = JSON.stringify({metrics: [{tags}]});
This would allow to search for "for how long during this period did person ? stay in room ? of the hotel ?".
Concerns like this depend on actual usage and access. In this case it's quite common to want to want:
- How many stays (a day) were there for the hotel ? during the period ?.
- During the period ?, how many stays were there in room ? of hotel ?.
Many people might do something simpler than the above and just have room: [hotel, room].join(delim)
. It's quite common to have a usage pattern where your lookup tends to consist of a list of possible ands (like a and b and c) but may only want out of those either a, a and b or a and b or c but not just b or a and c.
Metric versus tags: Fight
Starting out you must define both a metric and a tag for a datapoint. A problem here is that all basic use cases involving and/or can be both managed with metrics and tags. Whenever starting out with a single first use case it's very much neither here not there which works best to use more of. It's not until you start piling on the use cases that things start to become apparent.
You might say well surely it's obvious when your queries are ten times bigger with using metrics (unless it turns out it secretly takes an array for multiple items like tag values do) or based on it spamming rows but it's not immediately obvious if that will be the case and the line between when to use metrics or tags is blurred especially where performance is concerned.
Yes, I have seen people using metrics in place of tags. It happens, probably quite often. Then the moment you need to add another access pattern it quickly becomes insufficient. For tags you can easily add them or stop populating them with little impact but metrics has a lot of impact. This should be considered in any design, what will happen with tags versus metrics when different access patterns are needed.
My view on the matter is keep metrics quite shallow and use tags by default. By shallow it's usually as in whichever first set of ands you'll always want. As in (application = hotels AND type = stays) AND (a = 1 OR a = 2 AND b = 1)
then for that initial static part in brackets that might make a good metric but for the part that's very dynamic based on use case then that should be tags.
Another rule of thumb (obvious one) should be to use metrics for any data that's always isolated, as in never included together with a query, should also use a separate metric. If it's separate data, use a separate metric.
As I see it by default more metrics is bad but tags can build up until that's bad as well and there's then probably a kind of to and fro between splitting things up a bit with metrics than tags but generally the preference should be on tags, not metrics. I think you'l always have cases where either it turns out a portion of a metric should have been a tag or a tag should have been a metric.
It's technically possible to make an abstraction that can switch between the two approaches for performance testing. It's also technically possible to make a profiling mode where given the appropriate usage patterns indicates if it appears a tag should be a metric or vise versa (usually by identifying that when reduced to metrics there's no overlap or that two metrics needed to be used for otherwise the same query).
I never saw a database system where you can insert your use cases like rather than just insert a = 1, b = 2, insert a = 1 AND b = 2;a = 1;a IN(1)
though that's probably out of the scope of this.
The battle becomes even more epic when you pit group by against the other alternatives.
Schema
What is a row key? What do the indexes actually look like? Are there composite indexes? Are they ordered? Are they hashes? Is there a plan to expose left to right indexes?
Rather than trying to explain it, it might be easier to just give the definitions used on the most concise level, IE in cassies syntax for example.
Buckets you say?
This immediately stands out as it gives the impression that's all their is. As in you always end up fetching at least three weeks of data for a given metric and time range. I assume that's not actually the case but if I were looking at this database at a glance and saw that I'd quickly walk if for example my use case consisted of a lot of small range lookups across a busy (populated with a lot of data) metric within a retention period of a fortnight.
I would assume in reality cassandra provides a sparse array implementation and allows you to say you want from this column to that column? That raises the question because sparse columns are just an abstraction. Usually they're backed by either a hash map or a tree structure (though in some cases simplre or more complex solutions). If it's a map then that tends to preclude the possibility of a range lookup.
WHAT?
Similar to a query but only returns the tags (no data points returned). This can potentially return more tags than a query because it is optimized for speed and does not query all rows to narrow down the time range. This queries only the Row Key Index and thus the time range is the starting time range. Since the Cassandra row is set to 3 weeks, this can return tags for up to a 3 week period. See Cassandra Schema.
{"start_absolute": 1357023600000, "end_relative": {"value": "5", "unit": "days"},
Kick the bucket, bad bucket.