Java library for the HyperLogLog algorithm

Overview

java-hll

A Java implementation of HyperLogLog whose goal is to be storage-compatible with other similar offerings from Aggregate Knowledge.

NOTE: This implementation fully implements reading and writing all formats in the v1.0.0 storage specification, but internal memory representation (and hence space-tradeoffs) may cause automatic "promotion" between representations to occur at different implementation-dependent points. To ensure interoperability between, for example, the PostgreSQL implementation and this library, all promotion cutoffs should be explicitly defined.

Similarly, certain parameters have different bounds in order to deal with VM limitations like maximum array length. Specifically, log2m has a maximum value of 30 in this implementation whereas the storage specification states a maximum value of 31 (which can be realized in the PostgreSQL implementation).

Overview

HyperLogLog (HLL) is a fixed-size, set-like structure used for distinct value counting with tunable precision. For example, in 1280 bytes HLL can estimate the count of tens of billions of distinct values with only a few percent error.

In addition to the algorithm proposed in the original paper, this implementation is augmented to improve its accuracy and memory use without sacrificing much speed. See below for more details.

Algorithms

A hll is a combination of different set/distinct-value-counting algorithms that can be thought of as a hierarchy, along with rules for moving up that hierarchy. In order to distinguish between said algorithms, we have given them names:

EMPTY

A constant value that denotes the empty set.

EXPLICIT

An explicit, unique, sorted list of integers in the set, which is maintained up to a fixed cardinality.

SPARSE

A 'lazy', map-based implementation of HyperLogLog, a probabilistic set data structure. Only stores the indices and values of non-zero registers in a map, until the number of non-zero registers exceeds a fixed cardinality.

FULL

A fully-materialized, list-based implementation of HyperLogLog. Explicitly stores the value of every register in a list ordered by register index.

Motivation

Our motivation for augmenting the original HLL algorithm went something like this:

  • Naively, a HLL takes regwidth * 2^log2m bits to store.
  • In typical usage, log2m = 11 and regwidth = 5, it requires 10,240 bits or 1,280 bytes.
  • That's a lot of bytes!

The first addition to the original HLL algorithm came from realizing that 1,280 bytes is the size of 160 64-bit integers. So, if we wanted more accuracy at low cardinalities, we could just keep an explicit set of the inputs as a sorted list of 64-bit integers until we hit the 161st distinct value. This would give us the true representation of the distinct values in the stream while requiring the same amount of memory. (This is the EXPLICIT algorithm.)

The second came from the realization that we didn't need to store registers whose value was zero. We could simply represent the set of registers that had non-zero values as a map from index to values. This is map is stored as a list of index-value pairs that are bit-packed "short words" of length log2m + regwidth. (This is the SPARSE algorithm.)

Combining these two augmentations, we get a "promotion hierarchy" that allows the algorithm to be tuned for better accuracy, memory, or performance.

Initializing and storing a new hll object will simply allocate a small sentinel value symbolizing the empty set (EMPTY). When you add the first few values, a sorted list of unique integers is stored in an EXPLICIT set. When you wish to cease trading off accuracy for memory, the values in the sorted list are "promoted" to a SPARSE map-based HyperLogLog structure. Finally, when there are enough registers, the map-based HLL will be converted to a bit-packed FULL HLL structure.

Empirically, the insertion rate of EMPTY, EXPLICIT, and SPARSE representations is measured in 200k/s - 300k/s range, while the throughput of the FULL representation is in the millions of inserts per second on relatively new hardware ('10 Xeon).

Naturally, the cardinality estimates of the EMPTY and EXPLICIT representations is exact, while the SPARSE and FULL representations' accuracies are governed by the guarantees provided by the original HLL algorithm.


The Importance of Hashing

In brief, it is absolutely crucial to hash inputs to an HLL. A close approximation of uniform randomness in the inputs ensures that the error guarantees laid out in the original paper hold. We've empirically determined that MurmurHash 3, from Google's Guava, is an excellent and fast hash function to use in conjunction with java-hll module.

The seed to the hash call must remain constant for all inputs to a given HLL. Similarly, if one plans to compute the union of two HLLs, the input values must have been hashed using the same seed.

For a good overview of the importance of hashing and hash functions when using probabilistic algorithms as well as an analysis of MurmurHash 3, refer to these blog posts:

On Unions and Intersections

HLLs have the useful property that the union of any number of HLLs is equal to the HLL that would have been populated by playing back all inputs to those 'n' HLLs into a single HLL. Colloquially, one can say that HLLs have "lossless" unions because the same cardinality error guarantees that apply to a single HLL apply to a union of HLLs. See the union() function.

Using the inclusion-exclusion principle and the union() function, one can also estimate the intersection of sets represented by HLLs. Note, however, that error is proportional to the union of the two HLLs, while the result can be significantly smaller than the union, leading to disproportionately large error relative to the actual intersection cardinality. For instance, if one HLL has a cardinality of 1 billion, while the other has a cardinality of 10 million, with an overlap of 5 million, the intersection cardinality can easily be dwarfed by even a 1% error estimate in the larger HLLs cardinality.

For more information on HLL intersections, see this blog post.

Usage

HLL is available in Maven Central. Include it in your project with:

<dependency>
    <groupId>net.agkn</groupId>
    <artifactId>hll</artifactId>
    <version>1.6.0</version>
</dependency>

Hashing and adding a value to a new HLL:

final int seed = 123456;
final Murmur3_128HashFunction hash = new Murmur3_128HashFunction(seed);
final Hasher hasher = hash.newHasher();
hasher.putLong(1L/*value to hash*/);

final long hashedValue = hasher.hash().asLong();

final HLL hll = new HLL(13/*log2m*/, 5/*registerWidth*/);
hll.addRaw(hashedValue);

Retrieving the cardinality of an HLL:

final long cardinality = hll.cardinality();

Unioning two HLLs together (and retrieving the resulting cardinality):

final HLL hll1 = new HLL(13/*log2m*/, 5/*registerWidth*/);
final HLL hll2 = new HLL(13/*log2m*/, 5/*registerWidth*/);

// ... (add values to both sets) ...

hll1.union(hll2)/*modifies hll1 to contain the union*/;
final long cardinalityUnion = hll1.cardinality();

Reading an HLL from a hex representation of storage specification, v1.0.0 (for example, retrieved from a PostgreSQL database):

final HLL hll = HLL.fromBytes(NumberUtil.fromHex(hexString));

Writing an HLL to its hex representation of storage specification, v1.0.0 (for example, to be inserted into a PostgreSQL database):

final byte[] bytes = hll.toBytes();
final String output = "\\x" + NumberUtil.toHex(bytes, 0, bytes.length)

Building

  • Requires Maven 2.0

  • mvn clean package in the base directory

    A target directory will be created and a jar containing the library will be created therein.

Testing

  • mvn test in the base directory.
Comments
  • Problem with toBytes and fromBytes (error is 76%, too big)

    Problem with toBytes and fromBytes (error is 76%, too big)

    test1 (Scala code) gives error of less than 2%:

    import java.nio.charset.Charset
    import com.google.common.hash.Hashing
    import net.agkn.hll.HLL
    
    def test1() {
      val charset = Charset.forName("UTF8")
      val m3 = Hashing.murmur3_128
      val hll = new HLL(13, 5)
    
      for (i <- 1 to 74570) {
        val dau1 = hll.cardinality.toInt
    
        val hashCode = m3.hashString("" + i, charset)
        hll.addRaw(hashCode.asLong)
    
        val dau2 = hll.cardinality.toInt
        if (dau1 == dau2) println(s"dau1 = dau2 = $dau1, i = $i")
      }
      println(hll.cardinality.toInt)
    }
    

    test2 gives error of 76%!!!

    def test2() {
      val charset = Charset.forName("UTF8")
      val m3 = Hashing.murmur3_128  
    
      var b: Array[Byte] = null
      for (i <- 1 to 74570) {
        val hll = if (b == null) new HLL(13, 5) else HLL.fromBytes(b)
        val dau1 = hll.cardinality.toInt
    
        val hashCode = m3.hashString("" + i, charset)
        hll.addRaw(hashCode.asLong)
    
        val dau2 = hll.cardinality.toInt
        if (dau1 == dau2) println(s"dau1 = dau2 = $dau1, i = $i")
    
        if (dau1 != dau2) b = hll.toBytes
      }
    
      val hll = if (b == null) new HLL(13, 5) else HLL.fromBytes(b)
      println(hll.cardinality.toInt)
    }
    

    The output looks like this (note that the value drops dramatically from 73818 to 17757):

    ...
    dau1 = dau2 = 73818, i = 74560
    dau1 = dau2 = 73818, i = 74561
    dau1 = dau2 = 73818, i = 74562
    dau1 = dau2 = 17757, i = 74565
    dau1 = dau2 = 17760, i = 74567
    ...
    
    opened by ngocdaothanh 7
  • Unusual behavior for HLL with register width 6

    Unusual behavior for HLL with register width 6

    There seems to be some kind of a bug with HLL execution; when one instantiates an HLL as having a regwidth of 6, the cardinality returned is consistently 0 at large sample sizes. The following test was constructed and run in FullHLLTest. Could this be something to do with cutoff measurements? Thanks!

    /**
     * Test case for 6-bit cutoff problems
     */
    @Test
    public void sixBitSixteenKRegisterCutoffBugTest() {
        Long temp = System.currentTimeMillis();
    
        // this does NOT work
        final HLL brokenHll4 = new HLL(13, 6);
        {
            Random rand = new Random(temp);
            for(int i=0; i<16384*16384; i++) {
                long random = rand.nextLong();
                brokenHll4.addRaw(random);
            }
    
            final long cardinality = brokenHll4.cardinality();
    
            assertTrue(cardinality > 0); // this does NOT work
        }
    
        // this DOES work
        final HLL brokenHll5 = new HLL(13, 5);
        {
            Random rand = new Random(temp);
            for(int i=0; i<16384*16384; i++) {
                long random = rand.nextLong();
                brokenHll5.addRaw(random);
            }
    
            final long cardinality = brokenHll5.cardinality();
    
            assertTrue(cardinality > 0); // this DOES work
        }
    }
    
    opened by Akninirith 5
  • Make some calculation use arrays instead of everytime calculation

    Make some calculation use arrays instead of everytime calculation

    This pull request includes previous one (about caching largeEstimatorCutoff), so feel free accept anything from either one, or reimplement my proposed changes :)

    opened by yerenkow 4
  • pom cleanup: gpg and javadoc

    pom cleanup: gpg and javadoc

    The pom had two different versions of the maven-gpg plugin specified, so I fixed that.

    Javadoc also wasn't building in JDK8 because apparently self-closing HTML tags aren't allowed in Javadoc as of JDK8. Rather than changing all the javadocs, let's pass -Xdoclint:none to the javadoc plugin and disable that check for now.

    opened by blinsay 1
  • Creating test for some bug in serialization/unserialization

    Creating test for some bug in serialization/unserialization

    Currently, it fails with such messages:

    java.lang.IllegalArgumentException: Word length must be larger than 4 and less than or equal to 64. (was: 1) at net.agkn.hll.serialization.BigEndianAscendingWordSerializer.(BigEndianAscendingWordSerializer.java:77) at net.agkn.hll.serialization.SchemaVersionOne.getSerializer(SchemaVersionOne.java:111) at net.agkn.hll.HLL.toBytes(HLL.java:894) at net.agkn.hll.HLL.toBytes(HLL.java:847) at net.agkn.hll.serialization.HllSerializationTest.runHllSerializeTest(HllSerializationTest.java:31)

    opened by yerenkow 1
  • Wrong max value for log2m?

    Wrong max value for log2m?

    The docs say that the log2m parameter must "be at least 4 and at most 31", but it seems that 30 is the highest possible value.

    The following line raises a NegativeArraySizeException:

    new HLL(31, 8, -1, true, HLLType.FULL);
    

    Here's the full stack trace:

    Exception in thread "main" java.lang.NegativeArraySizeException
        at net.agkn.hll.util.BitVector.<init>(BitVector.java:68)
        at net.agkn.hll.HLL.initializeStorage(HLL.java:434)
        at net.agkn.hll.HLL.<init>(HLL.java:202)
        at Spike.main(Spike.java:7)
    

    If I change 31 to 30 it works fine.

    opened by iconara 1
  • Error rate is more than theoretical error rate

    Error rate is more than theoretical error rate

    The research paper claims the error rate to be 1.04/sqrt(m)

    For log2m = 11 and w = 5 the theoretical error rate should be ~2.3% but in our tests we have seen a variance of upto 3.5% for inserting 100K unique alphanumeric UUIDs. Below is our pseudo code :

    HLL hll = new HLL( 11, 5 );
    HashFunction hashFunction = Hashing.murmur3_128();
    
    for (String str : UUIDs) {
      Hasher hasher = hashFunction.newHasher();
      hasher.putString(str, StandardCharsets.UTF_8);
      hll.addRaw(hasher.hash().asLong());
    }
    

    How can we reduce this additional 1.2% variance and fall in the theoretically claimed error rate ?

    opened by shreshthagarg21 0
  • Large range HLL resulting in 0 cardinality

    Large range HLL resulting in 0 cardinality

    There are two unit tests SparseHLLTest#largeRangeSmokeTest() and FullHLLTest#largeRangeSmokeTest() that are supposed to test cardinalities of very large HLLs. However, if you insert

    System.out.println(cardinality + ", " + expected);
    

    at the bottom of them, you see that both the expected and actual cardinalities are 0. They should instead be a very large number.

    opened by JonathanAquino 0
  • IntegrationTestGenerator: StackOverflowError when running it

    IntegrationTestGenerator: StackOverflowError when running it

    When I run the IntegrationTestGenerator class, I get a StackOverflowError:

    Exception in thread "main" java.lang.StackOverflowError
        at net.agkn.hll.IntegrationTestGenerator.newHLL(IntegrationTestGenerator.java:516)
        at net.agkn.hll.IntegrationTestGenerator.newHLL(IntegrationTestGenerator.java:516)
        at net.agkn.hll.IntegrationTestGenerator.newHLL(IntegrationTestGenerator.java:516)
        at net.agkn.hll.IntegrationTestGenerator.newHLL(IntegrationTestGenerator.java:516)
        at net.agkn.hll.IntegrationTestGenerator.newHLL(IntegrationTestGenerator.java:516)
    
    opened by JonathanAquino 0
  • Maximum or Average Error?

    Maximum or Average Error?

    In the docs, the relative error is given by the expression ±1.04/√(2log2m).

    Is this an average error or a maximum error? If average, is it even possible to determine a maximum error?

    opened by dgeswein-dlx 0
  • Unable to Serialize HLL ?

    Unable to Serialize HLL ?

    I am playing around with your cardinality libraries which i'd like to wrap in a map/reduce paradigm and plug into various compute platforms out there....hence chose Adaptive HLL which has a merge method on it....

    I'd also like to save the object with its state on disk so that the compute/query platforms can interpret these computed objects and answer questions (instead of flowing raw event data again)...But i am unable to serialize the Adaptive Hyperloglog object. Was this intentionally left out?

    There is not much documentation - but the library seems quite clean and seems like a potential fit for what i have in mind...can you please help?

    Thanks, Rohan

    opened by rohancs 3
A Java library for quickly and efficiently parsing and writing UUIDs

fast-uuid fast-uuid is a Java library for quickly and efficiently parsing and writing UUIDs. It yields the most dramatic performance gains when compar

Jon Chambers 142 Jan 1, 2023
Hollow is a java library and toolset for disseminating in-memory datasets from a single producer to many consumers for high performance read-only access.

Hollow Hollow is a java library and toolset for disseminating in-memory datasets from a single producer to many consumers for high performance read-on

Netflix, Inc. 1.1k Dec 25, 2022
Java port of a concurrent trie hash map implementation from the Scala collections library

About This is a Java port of a concurrent trie hash map implementation from the Scala collections library. It is almost a line-by-line conversion from

null 147 Oct 31, 2022
A simple integer compression library in Java

JavaFastPFOR: A simple integer compression library in Java License This code is released under the Apache License Version 2.0 http://www.apache.org/li

Daniel Lemire 487 Dec 30, 2022
LWJGL is a Java library that enables cross-platform access to popular native APIs useful in the development of graphics (OpenGL, Vulkan), audio (OpenAL), parallel computing (OpenCL, CUDA) and XR (OpenVR, LibOVR) applications.

LWJGL - Lightweight Java Game Library 3 LWJGL (https://www.lwjgl.org) is a Java library that enables cross-platform access to popular native APIs usef

Lightweight Java Game Library 4k Dec 29, 2022
A modern I/O library for Android, Kotlin, and Java.

Okio See the project website for documentation and APIs. Okio is a library that complements java.io and java.nio to make it much easier to access, sto

Square 8.2k Dec 31, 2022
A Persistent Java Collections Library

PCollections A Persistent Java Collections Library Overview PCollections serves as a persistent and immutable analogue of the Java Collections Framewo

harold cooper 708 Dec 28, 2022
RxJava – Reactive Extensions for the JVM – a library for composing asynchronous and event-based programs using observable sequences for the Java VM.

RxJava: Reactive Extensions for the JVM RxJava is a Java VM implementation of Reactive Extensions: a library for composing asynchronous and event-base

ReactiveX 46.7k Dec 30, 2022
Jalgorithm is an open-source Java library which has implemented various algorithms and data structure

We loved Java and algorithms, so We made Jalgorithm ❤ Jalgorithm is an open-source Java library which has implemented various algorithms and data stru

Muhammad Karbalaee 35 Dec 15, 2022
A fork of Cliff Click's High Scale Library. Improved with bug fixes and a real build system.

High Scale Lib This is Boundary's fork of Cliff Click's high scale lib. We will be maintaining this fork with bug fixes, improvements and versioned bu

BMC TrueSight Pulse (formerly Boundary) 402 Jan 2, 2023
Library for creating In-memory circular buffers that use direct ByteBuffers to minimize GC overhead

Overview This project aims at creating a simple efficient building block for "Big Data" libraries, applications and frameworks; thing that can be used

Tatu Saloranta 132 Jul 28, 2022
Reactive Streams Utilities - Future standard utilities library for Reactive Streams.

Reactive Streams Utilities This is an exploration of what a utilities library for Reactive Streams in the JDK might look like. Glossary: A short gloss

Lightbend 61 May 27, 2021
Zero-dependency Reactive Streams publishers library

⚡️ Mutiny Zero: a zero-dependency Reactive Streams publishers library for Java Mutiny Zero is a minimal API for creating reactive-streams compliant pu

SmallRye 14 Dec 14, 2022
A Primitive Collection library that reduces memory usage and improves performance

Primitive-Collections This is a Simple Primitive Collections Library i started as a hobby Project. It is based on Java's Collection Library and FastUt

Speiger 26 Dec 25, 2022
High Performance data structures and utility methods for Java

Agrona Agrona provides a library of data structures and utility methods that are a common need when building high-performance applications in Java. Ma

Real Logic 2.5k Jan 5, 2023
Bloofi: A java implementation of multidimensional Bloom filters

Bloofi: A java implementation of multidimensional Bloom filters Bloom filters are probabilistic data structures commonly used for approximate membersh

Daniel Lemire 71 Nov 2, 2022
Chronicle Bytes has a similar purpose to Java NIO's ByteBuffer with many extensions

Chronicle-Bytes Chronicle-Bytes Chronicle Bytes contains all the low level memory access wrappers. It is built on Chronicle Core’s direct memory and O

Chronicle Software : Open Source 334 Jan 1, 2023
High performance Java implementation of a Cuckoo filter - Apache Licensed

Cuckoo Filter For Java This library offers a similar interface to Guava's Bloom filters. In most cases it can be used interchangeably and has addition

Mark Gunlogson 161 Dec 30, 2022
An advanced, but easy to use, platform for writing functional applications in Java 8.

Getting Cyclops X (10) The latest version is cyclops:10.4.0 Stackoverflow tag cyclops-react Documentation (work in progress for Cyclops X) Integration

AOL 1.3k Dec 29, 2022