Stream summarizer and cardinality estimator.

Overview

Build Status

Description

A Java library for summarizing data in streams for which it is infeasible to store all events. More specifically, there are classes for estimating: cardinality (i.e. counting things); set membership; top-k elements and frequency. One particularly useful feature is that cardinality estimators with compatible configurations may be safely merged.

These classes may be used directly in a JVM project or with the provided shell scripts and good old Unix IO redirection.

The ideas here are not original to us. We have endeavored to create useful implementations from iterating over the existing academic literature. As such this library relies heavily on the work of others. Please read the Sources and Reference sections.

Examples

$ echo -e "foo\nfoo\nbar" | ./bin/topk
item count error
---- ----- -----
 foo     2     0
 bar     1     0

Item count: 3


$ echo -e "foo\nfoo\nbar" | ./bin/cardinality
Item Count Cardinality Estimate
---------- --------------------
         3                    2

Maven Artifact Maven Central

<dependency>
  <groupId>com.clearspring.analytics</groupId>
  <artifactId>stream</artifactId>
  <version>2.9.5</version>
</dependency>

Building

Assuming you have Apache Maven installed and configured:

mvn package

And you should be all set.

Where People Hang Out

Mailing list: http://groups.google.com/group/stream-lib-user

Sources

The set membership code is the Bloom Filter implementation from Apache Cassandra circa December 2009. The changes here are minimal and were for the purpose of testing and independent use. Apache Software Foundation headers have been retained on these files. By extension we also include murmurhash.

We were inspired to use this code by Jonathan Ellis' post All you ever wanted to know about writing bloom filters.

References

There are javadoc references to specific papers. These were the ones we found most relevant during out research.

Cardinality

  • Min Cai, Jianping Pan, Yu K. Kwok, and Kai Hwang. Fast and accurate traffic matrix measurement using adaptive cardinality counting. In MineNet ’05: Proceedings of the 2005 ACM SIGCOMM workshop on Mining network data, pages 205–206, New York, NY, USA, 2005. ACM.

  • Ahmed Metwally, Divyakant Agrawal, and Amr E. Abbadi. Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic. In EDBT ’08: Proceedings of the 11th international conference on Extending database technology, pages 618–629, New York, NY, USA, 2008. ACM.

  • Nikos Ntarmos, Peter Triantafillou, and Gerhard Weikum. Counting at large: Efficient cardinality estimation in Internet-Scale data networks. In ICDE ’06: Proceedings of the 22nd International Conference on Data Engineering, pages 40+, Washington, DC, USA, 2006. IEEE Computer Society.

  • Marianne Durand and Philippe Flajolet. LogLog counting of large cardinalities. In ESA03, volume 2832 of LNCS, pages 605–617, 2003.

  • Kyu Y. Whang, Brad T. Vander Zanden, and Howard M. Taylor. A linear-time probabilistic counting algorithm for database applications. ACM Trans. Database Syst., 15(2):208–229, 1990.

  • Moses Charikar, Kevin Chen, and Martin F. Colton. Finding frequent items in data streams. In ICALP ’02: Proceedings of the 29th International Colloquium on Automata, Languages and Programming, pages 693–703, London, UK, 2002. Springer-Verlag.

  • Stefan Heule, Marc Nunkesser, Alex Hall. HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm. Proceedings of the EDBT 2013 Conference, ACM, Genoa, Italy

Top-K

  • Graham Cormode and S. Muthukrishnan. An improved data stream summary: The Count-Min sketch and its applications. pages 29–38.
  1. 10.1016/j.jalgor.2003.12.001 http://dl.acm.org/citation.cfm?id=1073718
  • Cheqing Jin, Weining Qian, Chaofeng Sha, Jeffrey X. Yu, and Aoying Zhou. Dynamically maintaining frequent items over a data stream. In CIKM ’03: Proceedings of the twelfth international conference on Information and knowledge management, pages 287–294, New York, NY, USA, 2003. ACM. 10.1145/956863.956918 http://dl.acm.org/citation.cfm?id=956918

  • Ahmed Metwally, Divyakant Agrawal, and Amr Abbadi. Efficient computation of frequent and top-k elements in data streams. pages 398–412. 2005. 10.1007/978-3-540-30570-5_27 http://link.springer.com/chapter/10.1007/978-3-540-30570-5_27

Frequency

Comments
  • HLL++ in sparse mode can be large than in normal mode

    HLL++ in sparse mode can be large than in normal mode

    Today I played a little bit with HLL++ in sparse mode. I have tens of millions HLL estimators and most of them have a low cardinality. Using HLL or HLL++ in normal mode is not memory efficient for this use case. To be picky I don't care that much about memory consumption, I am trying to minimize the serialized size of the estimators .

    This whole idea behind the sparse mode is to not waste memory with the normal representation when we can do better for small cardinality. It sounds reasonable to switch back to the normal representation as soon as the sparse mode consume more memory.

    However I don't observe a such behavior with HyperLogLogPlus:

    HyperLogLog:

            HyperLogLog hll = new HyperLogLog(14);
            System.out.println(hll.getBytes().length);
    
    => 10932
    

    HyperLogLogPlus in normal mode

            HyperLogLogPlus hllp = new HyperLogLogPlus(14);
            System.out.println(hllp.getBytes().length);
    
    => 10940 
    

    Empty HyperLogLogPlus in sparse mode

            HyperLogLogPlus hllp = new HyperLogLogPlus(14, 14);
            System.out.println(hllp.getBytes().length);
    
    => 16
    

    5K elements with HyperLogLogPlus in sparse mode

            Random r = new Random();
    
            HyperLogLogPlus hllp = new HyperLogLogPlus(14, 14);
    
            for (int i = 0; i < 5000; i++) {
                hllp.offer(Integer.toString(r.nextInt()));
            }
    
            System.out.println(hllp.getBytes().length);
    
    => 25495
    

    According to the source code the sparseSetThreshold only depends of p and is set to 12288 for p = 14. It means that if the set contains 12000 elements, I'm wasting almost 40KBytes compared to the normal representation.

    Am I wrong ? Is this behavior expected ?

    My second question would be: Do we really want to create the RegisterSet even when we are in Sparse mode ? It ruins the goal to be memory efficient. It currently does not matters for me since my bottleneck is the serialization size but I find this quite surprising.

    opened by cykl 25
  • OffHeap RegisterSet support

    OffHeap RegisterSet support

    This makes it possible to keep the register set off heap - we need it in Cassandra for long lived HLLP instances (we use it for overlap estimation between sstables) https://issues.apache.org/jira/browse/CASSANDRA-11035

    opened by krummas 14
  • remove fastutil, add guava, add test-slf4j-simple

    remove fastutil, add guava, add test-slf4j-simple

    the fastutil addition should never have been merged without addressing the enormous and almost entirely untouched mountain of code that it pulls in. there are several possible solutions, but I assert that the burden is on someone who cares more about QDigest to implement one.

    conversely, guava is tiny by comparison, has a variety of useful classes, and is widely enough used that most downstream users will see no increase in dependency count anyway.

    opened by tea-dragon 13
  • Attempt to decrease qdigest memory footprint

    Attempt to decrease qdigest memory footprint

    Hi, Full discussion here https://groups.google.com/forum/#!topic/stream-lib-user/DuB-n4rU15Y.

    I created benchmark for inserts: https://github.com/whiter4bbit/qdigest-bench. This benchmark does sequential inserts of 50000000 elements (random.nextInt(120000)).

    whiter4bbit@katrin stream-lib-fork % java -version
    java version "1.7.0_21"
    Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
    Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)
    

    Java options: -server -Xmx1280m -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:qdigest.log

    I measured young collections count and young/old gen heap size at the end of each pass. Implementation, which uses Long2LongOpenHashMap don't have young gen collections at all (as well as old gen).

    Results:

    2.5.0-SNAPSHOT:

    whiter4bbit@katrin qdigest-benchmark % bin/bench.sh 10 | ruby bin/summarize.rb
    
    Insert
      Mean: 45238
      Stddev: 7632.287599402947
      Median: 42502.0
    
    Young gen GC count
      Mean: 0
      Stddev: 0.0
      Median: 0.0
    
    Young gen heap used (at the end)
      Mean: 1885
      Stddev: 0.0
      Median: 1885.0
    
    Old gen heap used (at the end)
      Mean: 0
      Stddev: 0.0
      Median: 0.0
    

    2.4.0:

    whiter4bbit@katrin qdigest-benchmark % bin/bench.sh 10 | ruby bin/summarize.rb
    Insert
      Mean: 49892
      Stddev: 2827.6702424434147
      Median: 50452.0
    
    Young gen GC count
      Mean: 999
      Stddev: 61.66846844214635
      Median: 1021.0
    
    Young gen heap used (at the end)
      Mean: 39305
      Stddev: 24565.86045714662
      Median: 34994.5
    
    Old gen heap used (at the end)
      Mean: 486
      Stddev: 20.784609690826528
      Median: 488.0
    
    opened by whiter4bbit 11
  • Hyperbitbit

    Hyperbitbit

    This is the first implementation of HyperBitBit. It is totally functional and has the algorithm implemented following the Robert Sedgewick presentation at AofA'16.

    There are some tests for .cardinality, and the algorithm is documented in its own file.

    opened by positiveblue 9
  • CountMinSketch is not serializable

    CountMinSketch is not serializable

    Could we make com.clearspring.analytics.stream.frequency.CountMinSketch implementing java.io.Serializable please?

    I am getting this error: org.apache.commons.lang3.SerializationException: java.io.NotSerializableException: com.clearspring.analytics.stream.frequency.CountMinSketch at org.apache.commons.lang3.SerializationUtils.serialize(SerializationUtils.java:156)

    opened by gokyo 9
  • add merge functionality to BloomFilter

    add merge functionality to BloomFilter

    Essentially uses BitSet's or method. I used the HyperLogLog's merge mechanism as a model. Includes a new MembershipMergeException class and a couple simple tests.

    opened by strongh 8
  • HyperLogLogPlus merge introduces error when counters overlap

    HyperLogLogPlus merge introduces error when counters overlap

    When merging two HyperLogLogPlus counters that have a large intersection of their underlying sets it is possible to introduce a large amount of error in cardinality estimation. This is caused by checking if the size of the two sparse lists sums to a number greater than the sparseThreshold. If the two lists share many elements, then their actual merge should be a list much smaller than the sparse threshold - this can cause a sparse counter to be promoted to a normal counter long before it should. If this happens it both wastes space and, if it happens too early can cause large errors in cardinality estimation ( I have seen up to 30% in tests ) due to the bias estimation curves not having samples for cardinalities that low.

    Forcing merges to always merge sparse lists before checking size helps, but does not completely eliminate the problem. I believe this is due to problems in the merge implementation that I will open another issue for.

    The easiest way to reproduce this error is to produce two counters with identical elements at just over 1/2 the sparseThreshold, then merge them.

    opened by deGravity 8
  • how to serialize HLL using kryo

    how to serialize HLL using kryo

    Hi, we are using HLL in Storm project. What we are expecting is to serialize HLL object using Kryo and sent it to the other Bolt component for computing. But it was failed when trying to serialize HLL object by following code for test:

            Kryo kryo = new Kryo();
            kryo.setInstantiatorStrategy(new SerializingInstantiatorStrategy());
            kryo.register(HyperLogLogPlus.class);
            try {
                Output output = new Output(new FileOutputStream("/Users/Felix/Desktop/file.bin"));
                kryo.writeObject(output, card);
                output.close();
    
                Input input = new Input(new FileInputStream("/Users/Felix/Desktop/file.bin"));
                HyperLogLogPlus someObject = kryo.readObject(input, HyperLogLogPlus.class);
                System.out.println(someObject.cardinality());
                input.close();
            } catch (FileNotFoundException e) {
                e.printStackTrace();
            }
    

    We got following errors:

    Exception in thread "main" com.esotericsoftware.kryo.KryoException: Class cannot be created (missing no-arg constructor): com.clearspring.analytics.stream.cardinality.HyperLogLog
    	at com.esotericsoftware.kryo.Kryo.newInstantiator(Kryo.java:1050)
    	at com.esotericsoftware.kryo.Kryo.newInstance(Kryo.java:1062)
    	at com.esotericsoftware.kryo.serializers.FieldSerializer.create(FieldSerializer.java:228)
    	at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:217)
    	at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:626)
    	at hyperloglog.TT.main(TT.java:45)
    	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    	at java.lang.reflect.Method.invoke(Method.java:606)
    	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
    

    Does it mean that HLL can not be serialized by Kryo? Or, has any way to do that correctly?

    Thank you very much.

    opened by mycFelix 7
  • CountMinSketch#size should be long everywhere in the code

    CountMinSketch#size should be long everywhere in the code

    This code shows that the CountMinSketch size() method does not work as expected.

            double confidence = 0.999;
            double epsilon = 0.0001;
            int seed = 1;
    
            CountMinSketch sketch = new CountMinSketch(epsilon, confidence, seed);
            sketch.add(0, Integer.MAX_VALUE);
            long expectedSize = Integer.MAX_VALUE;
            Assert.assertEquals(expectedSize, sketch.size());
    
            CountMinSketch newSketch = CountMinSketch.merge(sketch, sketch);
            // bug: the next line is failing
            Assert.assertEquals(2 * expectedSize, newSketch.size());
    

    There are two parts in the code where size is used wrongly as int: https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/frequency/CountMinSketch.java#L68 https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/frequency/CountMinSketch.java#L186

    It should be long everywhere.

    Happy to fix it and add some unit tests creating a pull request. Please let me know

    opened by gokyo 7
  • RegisterSet incorrectly allocates the space needed to track counts

    RegisterSet incorrectly allocates the space needed to track counts

    The int array that's allocated in RegisterSet does not contain enough entries to accomodate correct counts for a given value of p.

    Each count takes 6 bits, meaning that to store 2^p counts you need 6*(2^p)/8 bytes, not 2^p/6.

    The counts are therefore incorrect, which can be detected by feeding for example hashes from 0 to (2^p-1) << (64 -p) in steps of 1 << (64 - p), the result should be 2**64-1 but it is 6194288074

    opened by hbs 7
  • BloomFilter doesn't work when amount of bits > 2^32

    BloomFilter doesn't work when amount of bits > 2^32

    In my live case, we need BloomFilter for a bigger amount (about 4-5Gb ram, >32B bits)

    The code is dependent on java.util.BitSet with .ctor

    public BitSet(int nbits) with a lot of getter/setter by int index.

    opened by ajtkulov 4
  • Change type of TDigest.Group.id from int to long

    Change type of TDigest.Group.id from int to long

    The assumption is that the id is unique. If the id value overflows, it can lead to NullPointerExceptions. Changing the type to long doesn't make overflows impossible, but highly unlikely.

    This fixes issues like #100

    opened by sgrj 1
  • CountMinSketch with Int type in the internal table

    CountMinSketch with Int type in the internal table

    The current implementation has Long type for the internal table. For a lot of real use cases, we need only Int (in general it would be great to have a generic solution, like long/int/short(?)).

    For instance, I have 50M objects. It's impossible to have a value > 50M in the internal table. So, Int value is enough, in this case, Long type is a waste (in 2 times more memory consumption in comparison with Int type).

    opened by ajtkulov 0
  • HyperLogLogPlusPlus sparse precision 32 accuracy problem

    HyperLogLogPlusPlus sparse precision 32 accuracy problem

    I have had very strange results (very high inaccuracy) for low-cardinality HLL++ when using usual values of p (p = 11, 12 ,13, 14) and sp = 32. I suspect (but I am not certain) that treating sp = 31 and sp = 32 exactly the same at the following line causes the problem:

    sm = sp > 30 ? Integer.MAX_VALUE : 1 << sp;

    since for low cardinalities, the cardinality is computed this way: return Math.round(HyperLogLog.linearCounting(sm, sm - sparseSet.length));

    Using sp=31 works as expected, sp=32 does not.

    opened by Enzo90910 0
  • make OSGI compatible

    make OSGI compatible

    We add maven-bundle-plugin to package it as a bundle, as effect to add in MANIFEST.MF osgi metadata. It enables it to be used inside OSGI based applications.

    opened by simondaudin 0
Owner
AddThis
AddThis
Table-Computing (Simplified as TC) is a distributed light weighted, high performance and low latency stream processing and data analysis framework. Milliseconds latency and 10+ times faster than Flink for complicated use cases.

Table-Computing Welcome to the Table-Computing GitHub. Table-Computing (Simplified as TC) is a distributed light weighted, high performance and low la

Alibaba 34 Oct 14, 2022
Eclipse Collections is a collections framework for Java with optimized data structures and a rich, functional and fluent API.

English | 中文 | Deutsch | Español | Ελληνικά | Français | 日本語 | Norsk (bokmål) | Português-Brasil | Русский | हिंदी Eclipse Collections is a comprehens

Eclipse Foundation 2.1k Dec 29, 2022
A Java library for quickly and efficiently parsing and writing UUIDs

fast-uuid fast-uuid is a Java library for quickly and efficiently parsing and writing UUIDs. It yields the most dramatic performance gains when compar

Jon Chambers 142 Jan 1, 2023
Immutable key/value store with efficient space utilization and fast reads. They are ideal for the use-case of tables built by batch processes and shipped to multiple servers.

Minimal Perfect Hash Tables About Minimal Perfect Hash Tables are an immutable key/value store with efficient space utilization and fast reads. They a

Indeed Engineering 92 Nov 22, 2022
gRPC and protocol buffers for Android, Kotlin, and Java.

Wire “A man got to have a code!” - Omar Little See the project website for documentation and APIs. As our teams and programs grow, the variety and vol

Square 3.9k Jan 5, 2023
The Java collections framework provides a set of interfaces and classes to implement various data structures and algorithms.

Homework #14 Table of Contents General Info Technologies Used Project Status Contact General Information Homework contains topics: Sorting an ArrayLis

Mykhailo 1 Feb 12, 2022
High Performance data structures and utility methods for Java

Agrona Agrona provides a library of data structures and utility methods that are a common need when building high-performance applications in Java. Ma

Real Logic 2.5k Jan 5, 2023
Replicate your Key Value Store across your network, with consistency, persistance and performance.

Chronicle Map Version Overview Chronicle Map is a super-fast, in-memory, non-blocking, key-value store, designed for low-latency, and/or multi-process

Chronicle Software : Open Source 2.5k Dec 29, 2022
Fault tolerance and resilience patterns for the JVM

Failsafe Failsafe is a lightweight, zero-dependency library for handling failures in Java 8+, with a concise API for handling everyday use cases and t

Jonathan Halterman 3.9k Jan 2, 2023
fasttuple - Collections that are laid out adjacently in both on- and off-heap memory.

FastTuple Introduction There are lots of good things about working on the JVM, like a world class JIT, operating system threads, and a world class gar

BMC TrueSight Pulse (formerly Boundary) 137 Sep 30, 2022
Hollow is a java library and toolset for disseminating in-memory datasets from a single producer to many consumers for high performance read-only access.

Hollow Hollow is a java library and toolset for disseminating in-memory datasets from a single producer to many consumers for high performance read-on

Netflix, Inc. 1.1k Dec 25, 2022
A fork of Cliff Click's High Scale Library. Improved with bug fixes and a real build system.

High Scale Lib This is Boundary's fork of Cliff Click's high scale lib. We will be maintaining this fork with bug fixes, improvements and versioned bu

BMC TrueSight Pulse (formerly Boundary) 402 Jan 2, 2023
Implementation of various string similarity and distance algorithms: Levenshtein, Jaro-winkler, n-Gram, Q-Gram, Jaccard index, Longest Common Subsequence edit distance, cosine similarity ...

java-string-similarity A library implementing different string similarity and distance measures. A dozen of algorithms (including Levenshtein edit dis

Thibault Debatty 2.5k Dec 29, 2022
Java Collections till the last breadcrumb of memory and performance

Koloboke A family of projects around collections in Java (so far). The Koloboke Collections API A carefully designed extension of the Java Collections

Roman Leventov 967 Nov 14, 2022
LWJGL is a Java library that enables cross-platform access to popular native APIs useful in the development of graphics (OpenGL, Vulkan), audio (OpenAL), parallel computing (OpenCL, CUDA) and XR (OpenVR, LibOVR) applications.

LWJGL - Lightweight Java Game Library 3 LWJGL (https://www.lwjgl.org) is a Java library that enables cross-platform access to popular native APIs usef

Lightweight Java Game Library 4k Dec 29, 2022
A modern I/O library for Android, Kotlin, and Java.

Okio See the project website for documentation and APIs. Okio is a library that complements java.io and java.nio to make it much easier to access, sto

Square 8.2k Dec 31, 2022
Immutable in-memory R-tree and R*-tree implementations in Java with reactive api

rtree In-memory immutable 2D R-tree implementation in java using RxJava Observables for reactive processing of search results. Status: released to Mav

Dave Moten 999 Dec 20, 2022
RTree2D is a 2D immutable R-tree with STR (Sort-Tile-Recursive) packing for ultra-fast nearest and intersection queries

RTree2D RTree2D is a 2D immutable R-tree with STR (Sort-Tile-Recursive) packing for ultra-fast nearest and intersection queries. Goals Main our requir

Andriy Plokhotnyuk 121 Dec 14, 2022