Java library for the HyperLogLog algorithm

Last update: Dec 30, 2022

Related tags

Overview

java-hll

A Java implementation of HyperLogLog whose goal is to be storage-compatible with other similar offerings from Aggregate Knowledge.

NOTE: This implementation fully implements reading and writing all formats in the v1.0.0 storage specification, but internal memory representation (and hence space-tradeoffs) may cause automatic "promotion" between representations to occur at different implementation-dependent points. To ensure interoperability between, for example, the PostgreSQL implementation and this library, all promotion cutoffs should be explicitly defined.

Similarly, certain parameters have different bounds in order to deal with VM limitations like maximum array length. Specifically, log2m has a maximum value of 30 in this implementation whereas the storage specification states a maximum value of 31 (which can be realized in the PostgreSQL implementation).

Overview

HyperLogLog (HLL) is a fixed-size, set-like structure used for distinct value counting with tunable precision. For example, in 1280 bytes HLL can estimate the count of tens of billions of distinct values with only a few percent error.

In addition to the algorithm proposed in the original paper, this implementation is augmented to improve its accuracy and memory use without sacrificing much speed. See below for more details.

Algorithms

A hll is a combination of different set/distinct-value-counting algorithms that can be thought of as a hierarchy, along with rules for moving up that hierarchy. In order to distinguish between said algorithms, we have given them names:

`EMPTY`

A constant value that denotes the empty set.

`EXPLICIT`

An explicit, unique, sorted list of integers in the set, which is maintained up to a fixed cardinality.

`SPARSE`

A 'lazy', map-based implementation of HyperLogLog, a probabilistic set data structure. Only stores the indices and values of non-zero registers in a map, until the number of non-zero registers exceeds a fixed cardinality.

`FULL`

A fully-materialized, list-based implementation of HyperLogLog. Explicitly stores the value of every register in a list ordered by register index.

Motivation

Our motivation for augmenting the original HLL algorithm went something like this:

Naively, a HLL takes regwidth * 2^log2m bits to store.
In typical usage, log2m = 11 and regwidth = 5, it requires 10,240 bits or 1,280 bytes.
That's a lot of bytes!

The first addition to the original HLL algorithm came from realizing that 1,280 bytes is the size of 160 64-bit integers. So, if we wanted more accuracy at low cardinalities, we could just keep an explicit set of the inputs as a sorted list of 64-bit integers until we hit the 161st distinct value. This would give us the true representation of the distinct values in the stream while requiring the same amount of memory. (This is the EXPLICIT algorithm.)

The second came from the realization that we didn't need to store registers whose value was zero. We could simply represent the set of registers that had non-zero values as a map from index to values. This is map is stored as a list of index-value pairs that are bit-packed "short words" of length log2m + regwidth. (This is the SPARSE algorithm.)

Combining these two augmentations, we get a "promotion hierarchy" that allows the algorithm to be tuned for better accuracy, memory, or performance.

Initializing and storing a new hll object will simply allocate a small sentinel value symbolizing the empty set (EMPTY). When you add the first few values, a sorted list of unique integers is stored in an EXPLICIT set. When you wish to cease trading off accuracy for memory, the values in the sorted list are "promoted" to a SPARSE map-based HyperLogLog structure. Finally, when there are enough registers, the map-based HLL will be converted to a bit-packed FULL HLL structure.

Empirically, the insertion rate of EMPTY, EXPLICIT, and SPARSE representations is measured in 200k/s - 300k/s range, while the throughput of the FULL representation is in the millions of inserts per second on relatively new hardware ('10 Xeon).

Naturally, the cardinality estimates of the EMPTY and EXPLICIT representations is exact, while the SPARSE and FULL representations' accuracies are governed by the guarantees provided by the original HLL algorithm.

The Importance of Hashing

In brief, it is absolutely crucial to hash inputs to an HLL. A close approximation of uniform randomness in the inputs ensures that the error guarantees laid out in the original paper hold. We've empirically determined that MurmurHash 3, from Google's Guava, is an excellent and fast hash function to use in conjunction with java-hll module.

The seed to the hash call must remain constant for all inputs to a given HLL. Similarly, if one plans to compute the union of two HLLs, the input values must have been hashed using the same seed.

For a good overview of the importance of hashing and hash functions when using probabilistic algorithms as well as an analysis of MurmurHash 3, refer to these blog posts:

On Unions and Intersections

HLLs have the useful property that the union of any number of HLLs is equal to the HLL that would have been populated by playing back all inputs to those 'n' HLLs into a single HLL. Colloquially, one can say that HLLs have "lossless" unions because the same cardinality error guarantees that apply to a single HLL apply to a union of HLLs. See the union() function.

Using the inclusion-exclusion principle and the union() function, one can also estimate the intersection of sets represented by HLLs. Note, however, that error is proportional to the union of the two HLLs, while the result can be significantly smaller than the union, leading to disproportionately large error relative to the actual intersection cardinality. For instance, if one HLL has a cardinality of 1 billion, while the other has a cardinality of 10 million, with an overlap of 5 million, the intersection cardinality can easily be dwarfed by even a 1% error estimate in the larger HLLs cardinality.

For more information on HLL intersections, see this blog post.

Usage

HLL is available in Maven Central. Include it in your project with:

<dependency>
    <groupId>net.agkn</groupId>
    <artifactId>hll</artifactId>
    <version>1.6.0</version>
</dependency>

Hashing and adding a value to a new HLL:

final int seed = 123456;
final Murmur3_128HashFunction hash = new Murmur3_128HashFunction(seed);
final Hasher hasher = hash.newHasher();
hasher.putLong(1L/*value to hash*/);

final long hashedValue = hasher.hash().asLong();

final HLL hll = new HLL(13/*log2m*/, 5/*registerWidth*/);
hll.addRaw(hashedValue);

Retrieving the cardinality of an HLL:

final long cardinality = hll.cardinality();

Unioning two HLLs together (and retrieving the resulting cardinality):

final HLL hll1 = new HLL(13/*log2m*/, 5/*registerWidth*/);
final HLL hll2 = new HLL(13/*log2m*/, 5/*registerWidth*/);

// ... (add values to both sets) ...

hll1.union(hll2)/*modifies hll1 to contain the union*/;
final long cardinalityUnion = hll1.cardinality();

Reading an HLL from a hex representation of storage specification, v1.0.0 (for example, retrieved from a PostgreSQL database):

final HLL hll = HLL.fromBytes(NumberUtil.fromHex(hexString));

Writing an HLL to its hex representation of storage specification, v1.0.0 (for example, to be inserted into a PostgreSQL database):

final byte[] bytes = hll.toBytes();
final String output = "\\x" + NumberUtil.toHex(bytes, 0, bytes.length)

Building

Requires Maven 2.0
mvn clean package in the base directory

A target directory will be created and a jar containing the library will be created therein.

Testing

mvn test in the base directory.

Comments

Problem with toBytes and fromBytes (error is 76%, too big)

test1 (Scala code) gives error of less than 2%:

import java.nio.charset.Charset
import com.google.common.hash.Hashing
import net.agkn.hll.HLL

def test1() {
  val charset = Charset.forName("UTF8")
  val m3 = Hashing.murmur3_128
  val hll = new HLL(13, 5)

  for (i <- 1 to 74570) {
    val dau1 = hll.cardinality.toInt

    val hashCode = m3.hashString("" + i, charset)
    hll.addRaw(hashCode.asLong)

    val dau2 = hll.cardinality.toInt
    if (dau1 == dau2) println(s"dau1 = dau2 = $dau1, i = $i")
  }
  println(hll.cardinality.toInt)
}

test2 gives error of 76%!!!

def test2() {
  val charset = Charset.forName("UTF8")
  val m3 = Hashing.murmur3_128  

  var b: Array[Byte] = null
  for (i <- 1 to 74570) {
    val hll = if (b == null) new HLL(13, 5) else HLL.fromBytes(b)
    val dau1 = hll.cardinality.toInt

    val hashCode = m3.hashString("" + i, charset)
    hll.addRaw(hashCode.asLong)

    val dau2 = hll.cardinality.toInt
    if (dau1 == dau2) println(s"dau1 = dau2 = $dau1, i = $i")

    if (dau1 != dau2) b = hll.toBytes
  }

  val hll = if (b == null) new HLL(13, 5) else HLL.fromBytes(b)
  println(hll.cardinality.toInt)
}

The output looks like this (note that the value drops dramatically from 73818 to 17757):

...
dau1 = dau2 = 73818, i = 74560
dau1 = dau2 = 73818, i = 74561
dau1 = dau2 = 73818, i = 74562
dau1 = dau2 = 17757, i = 74565
dau1 = dau2 = 17760, i = 74567
...

opened by ngocdaothanh 7

Unusual behavior for HLL with register width 6

There seems to be some kind of a bug with HLL execution; when one instantiates an HLL as having a regwidth of 6, the cardinality returned is consistently 0 at large sample sizes. The following test was constructed and run in FullHLLTest. Could this be something to do with cutoff measurements? Thanks!

/**
 * Test case for 6-bit cutoff problems
 */
@Test
public void sixBitSixteenKRegisterCutoffBugTest() {
    Long temp = System.currentTimeMillis();

    // this does NOT work
    final HLL brokenHll4 = new HLL(13, 6);
    {
        Random rand = new Random(temp);
        for(int i=0; i<16384*16384; i++) {
            long random = rand.nextLong();
            brokenHll4.addRaw(random);
        }

        final long cardinality = brokenHll4.cardinality();

        assertTrue(cardinality > 0); // this does NOT work
    }

    // this DOES work
    final HLL brokenHll5 = new HLL(13, 5);
    {
        Random rand = new Random(temp);
        for(int i=0; i<16384*16384; i++) {
            long random = rand.nextLong();
            brokenHll5.addRaw(random);
        }

        final long cardinality = brokenHll5.cardinality();

        assertTrue(cardinality > 0); // this DOES work
    }
}

opened by Akninirith 5

Make some calculation use arrays instead of everytime calculation

This pull request includes previous one (about caching largeEstimatorCutoff), so feel free accept anything from either one, or reimplement my proposed changes :)

opened by yerenkow 4
pom cleanup: gpg and javadoc

The pom had two different versions of the maven-gpg plugin specified, so I fixed that.

Javadoc also wasn't building in JDK8 because apparently self-closing HTML tags aren't allowed in Javadoc as of JDK8. Rather than changing all the javadocs, let's pass -Xdoclint:none to the javadoc plugin and disable that check for now.

opened by blinsay 1
Creating test for some bug in serialization/unserialization

Currently, it fails with such messages:

java.lang.IllegalArgumentException: Word length must be larger than 4 and less than or equal to 64. (was: 1) at net.agkn.hll.serialization.BigEndianAscendingWordSerializer.(BigEndianAscendingWordSerializer.java:77) at net.agkn.hll.serialization.SchemaVersionOne.getSerializer(SchemaVersionOne.java:111) at net.agkn.hll.HLL.toBytes(HLL.java:894) at net.agkn.hll.HLL.toBytes(HLL.java:847) at net.agkn.hll.serialization.HllSerializationTest.runHllSerializeTest(HllSerializationTest.java:31)

opened by yerenkow 1

Wrong max value for log2m?

The docs say that the log2m parameter must "be at least 4 and at most 31", but it seems that 30 is the highest possible value.

The following line raises a NegativeArraySizeException:

new HLL(31, 8, -1, true, HLLType.FULL);

Here's the full stack trace:

Exception in thread "main" java.lang.NegativeArraySizeException
    at net.agkn.hll.util.BitVector.<init>(BitVector.java:68)
    at net.agkn.hll.HLL.initializeStorage(HLL.java:434)
    at net.agkn.hll.HLL.<init>(HLL.java:202)
    at Spike.main(Spike.java:7)

If I change 31 to 30 it works fine.

opened by iconara 1

Error rate is more than theoretical error rate
The research paper claims the error rate to be 1.04/sqrt(m)

For log2m = 11 and w = 5 the theoretical error rate should be ~2.3% but in our tests we have seen a variance of upto 3.5% for inserting 100K unique alphanumeric UUIDs. Below is our pseudo code :

HLL hll = new HLL( 11, 5 ); HashFunction hashFunction = Hashing.murmur3_128(); for (String str : UUIDs) { Hasher hasher = hashFunction.newHasher(); hasher.putString(str, StandardCharsets.UTF_8); hll.addRaw(hasher.hash().asLong()); }

How can we reduce this additional 1.2% variance and fall in the theoretically claimed error rate ?
opened by shreshthagarg21 0
Large range HLL resulting in 0 cardinality
There are two unit tests SparseHLLTest#largeRangeSmokeTest() and FullHLLTest#largeRangeSmokeTest() that are supposed to test cardinalities of very large HLLs. However, if you insert

System.out.println(cardinality + ", " + expected);

at the bottom of them, you see that both the expected and actual cardinalities are 0. They should instead be a very large number.
opened by JonathanAquino 0

IntegrationTestGenerator: StackOverflowError when running it

When I run the IntegrationTestGenerator class, I get a StackOverflowError:

Exception in thread "main" java.lang.StackOverflowError
    at net.agkn.hll.IntegrationTestGenerator.newHLL(IntegrationTestGenerator.java:516)
    at net.agkn.hll.IntegrationTestGenerator.newHLL(IntegrationTestGenerator.java:516)
    at net.agkn.hll.IntegrationTestGenerator.newHLL(IntegrationTestGenerator.java:516)
    at net.agkn.hll.IntegrationTestGenerator.newHLL(IntegrationTestGenerator.java:516)
    at net.agkn.hll.IntegrationTestGenerator.newHLL(IntegrationTestGenerator.java:516)

opened by JonathanAquino 0

Maximum or Average Error?

In the docs, the relative error is given by the expression ±1.04/√(2log2m).

Is this an average error or a maximum error? If average, is it even possible to determine a maximum error?

opened by dgeswein-dlx 0
Unable to Serialize HLL ?

I am playing around with your cardinality libraries which i'd like to wrap in a map/reduce paradigm and plug into various compute platforms out there....hence chose Adaptive HLL which has a merge method on it....

I'd also like to save the object with its state on disk so that the compute/query platforms can interpret these computed objects and answer questions (instead of flowing raw event data again)...But i am unable to serialize the Adaptive Hyperloglog object. Was this intentionally left out?

There is not much documentation - but the library seems quite clean and seems like a potential fit for what i have in mind...can you please help?

Thanks, Rohan

opened by rohancs 3

Owner

Aggregate Knowledge (a Neustar service)

GitHub http://research.neustar.biz/2013/12/24/open-source-release-java-hll/

A Java library for quickly and efficiently parsing and writing UUIDs

fast-uuid fast-uuid is a Java library for quickly and efficiently parsing and writing UUIDs. It yields the most dramatic performance gains when compar

142 Jan 1, 2023

Hollow is a java library and toolset for disseminating in-memory datasets from a single producer to many consumers for high performance read-only access.

Hollow Hollow is a java library and toolset for disseminating in-memory datasets from a single producer to many consumers for high performance read-on

1.1k Dec 25, 2022

Java port of a concurrent trie hash map implementation from the Scala collections library

About This is a Java port of a concurrent trie hash map implementation from the Scala collections library. It is almost a line-by-line conversion from

147 Oct 31, 2022

A simple integer compression library in Java

JavaFastPFOR: A simple integer compression library in Java License This code is released under the Apache License Version 2.0 http://www.apache.org/li

487 Dec 30, 2022

LWJGL is a Java library that enables cross-platform access to popular native APIs useful in the development of graphics (OpenGL, Vulkan), audio (OpenAL), parallel computing (OpenCL, CUDA) and XR (OpenVR, LibOVR) applications.

LWJGL - Lightweight Java Game Library 3 LWJGL (https://www.lwjgl.org) is a Java library that enables cross-platform access to popular native APIs usef

4k Dec 29, 2022

A modern I/O library for Android, Kotlin, and Java.

Okio See the project website for documentation and APIs. Okio is a library that complements java.io and java.nio to make it much easier to access, sto

8.2k Dec 31, 2022

A Persistent Java Collections Library

PCollections A Persistent Java Collections Library Overview PCollections serves as a persistent and immutable analogue of the Java Collections Framewo

708 Dec 28, 2022

RxJava – Reactive Extensions for the JVM – a library for composing asynchronous and event-based programs using observable sequences for the Java VM.

RxJava: Reactive Extensions for the JVM RxJava is a Java VM implementation of Reactive Extensions: a library for composing asynchronous and event-base

46.7k Dec 30, 2022

Java library for the HyperLogLog algorithm

Related tags

Overview

java-hll

Overview

Algorithms

EMPTY

EXPLICIT

SPARSE

FULL

Motivation

The Importance of Hashing

On Unions and Intersections

Usage

Building

Testing

Comments

Owner

Aggregate Knowledge (a Neustar service)

A Java library for quickly and efficiently parsing and writing UUIDs

Hollow is a java library and toolset for disseminating in-memory datasets from a single producer to many consumers for high performance read-only access.

Java port of a concurrent trie hash map implementation from the Scala collections library

A simple integer compression library in Java

LWJGL is a Java library that enables cross-platform access to popular native APIs useful in the development of graphics (OpenGL, Vulkan), audio (OpenAL), parallel computing (OpenCL, CUDA) and XR (OpenVR, LibOVR) applications.

A modern I/O library for Android, Kotlin, and Java.

A Persistent Java Collections Library

RxJava – Reactive Extensions for the JVM – a library for composing asynchronous and event-based programs using observable sequences for the Java VM.

Jalgorithm is an open-source Java library which has implemented various algorithms and data structure

A fork of Cliff Click's High Scale Library. Improved with bug fixes and a real build system.

Library for creating In-memory circular buffers that use direct ByteBuffers to minimize GC overhead

Reactive Streams Utilities - Future standard utilities library for Reactive Streams.

Zero-dependency Reactive Streams publishers library

A Primitive Collection library that reduces memory usage and improves performance

High Performance data structures and utility methods for Java

Bloofi: A java implementation of multidimensional Bloom filters

Chronicle Bytes has a similar purpose to Java NIO's ByteBuffer with many extensions

High performance Java implementation of a Cuckoo filter - Apache Licensed

An advanced, but easy to use, platform for writing functional applications in Java 8.

`EMPTY`

`EXPLICIT`

`SPARSE`

`FULL`