Stream summarizer and cardinality estimator.

Last update: Dec 30, 2022

Related tags

Overview

Description

A Java library for summarizing data in streams for which it is infeasible to store all events. More specifically, there are classes for estimating: cardinality (i.e. counting things); set membership; top-k elements and frequency. One particularly useful feature is that cardinality estimators with compatible configurations may be safely merged.

These classes may be used directly in a JVM project or with the provided shell scripts and good old Unix IO redirection.

The ideas here are not original to us. We have endeavored to create useful implementations from iterating over the existing academic literature. As such this library relies heavily on the work of others. Please read the Sources and Reference sections.

Examples

$ echo -e "foo\nfoo\nbar" | ./bin/topk
item count error
---- ----- -----
 foo     2     0
 bar     1     0

Item count: 3


$ echo -e "foo\nfoo\nbar" | ./bin/cardinality
Item Count Cardinality Estimate
---------- --------------------
         3                    2

Maven Artifact

<dependency>
  <groupId>com.clearspring.analytics</groupId>
  <artifactId>stream</artifactId>
  <version>2.9.5</version>
</dependency>

Building

Assuming you have Apache Maven installed and configured:

mvn package

And you should be all set.

Where People Hang Out

Mailing list: http://groups.google.com/group/stream-lib-user

Sources

The set membership code is the Bloom Filter implementation from Apache Cassandra circa December 2009. The changes here are minimal and were for the purpose of testing and independent use. Apache Software Foundation headers have been retained on these files. By extension we also include murmurhash.

We were inspired to use this code by Jonathan Ellis' post All you ever wanted to know about writing bloom filters.

References

There are javadoc references to specific papers. These were the ones we found most relevant during out research.

Cardinality

Min Cai, Jianping Pan, Yu K. Kwok, and Kai Hwang. Fast and accurate traffic matrix measurement using adaptive cardinality counting. In MineNet ’05: Proceedings of the 2005 ACM SIGCOMM workshop on Mining network data, pages 205–206, New York, NY, USA, 2005. ACM.
Ahmed Metwally, Divyakant Agrawal, and Amr E. Abbadi. Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic. In EDBT ’08: Proceedings of the 11th international conference on Extending database technology, pages 618–629, New York, NY, USA, 2008. ACM.
Nikos Ntarmos, Peter Triantafillou, and Gerhard Weikum. Counting at large: Efficient cardinality estimation in Internet-Scale data networks. In ICDE ’06: Proceedings of the 22nd International Conference on Data Engineering, pages 40+, Washington, DC, USA, 2006. IEEE Computer Society.
Marianne Durand and Philippe Flajolet. LogLog counting of large cardinalities. In ESA03, volume 2832 of LNCS, pages 605–617, 2003.
Kyu Y. Whang, Brad T. Vander Zanden, and Howard M. Taylor. A linear-time probabilistic counting algorithm for database applications. ACM Trans. Database Syst., 15(2):208–229, 1990.
Moses Charikar, Kevin Chen, and Martin F. Colton. Finding frequent items in data streams. In ICALP ’02: Proceedings of the 29th International Colloquium on Automata, Languages and Programming, pages 693–703, London, UK, 2002. Springer-Verlag.
Stefan Heule, Marc Nunkesser, Alex Hall. HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm. Proceedings of the EDBT 2013 Conference, ACM, Genoa, Italy

Top-K

Graham Cormode and S. Muthukrishnan. An improved data stream summary: The Count-Min sketch and its applications. pages 29–38.

10.1016/j.jalgor.2003.12.001 http://dl.acm.org/citation.cfm?id=1073718

Cheqing Jin, Weining Qian, Chaofeng Sha, Jeffrey X. Yu, and Aoying Zhou. Dynamically maintaining frequent items over a data stream. In CIKM ’03: Proceedings of the twelfth international conference on Information and knowledge management, pages 287–294, New York, NY, USA, 2003. ACM. 10.1145/956863.956918 http://dl.acm.org/citation.cfm?id=956918
Ahmed Metwally, Divyakant Agrawal, and Amr Abbadi. Efficient computation of frequent and top-k elements in data streams. pages 398–412. 2005. 10.1007/978-3-540-30570-5_27 http://link.springer.com/chapter/10.1007/978-3-540-30570-5_27

Frequency

Graham Cormode and S. Muthukrishnan. An improved data stream summary: The Count-Min sketch and its applications. 2004. 10.1016/j.jalgor.2003.12.001 http://dl.acm.org/citation.cfm?id=1073718

Comments

HLL++ in sparse mode can be large than in normal mode
Today I played a little bit with HLL++ in sparse mode. I have tens of millions HLL estimators and most of them have a low cardinality. Using HLL or HLL++ in normal mode is not memory efficient for this use case. To be picky I don't care that much about memory consumption, I am trying to minimize the serialized size of the estimators .

This whole idea behind the sparse mode is to not waste memory with the normal representation when we can do better for small cardinality. It sounds reasonable to switch back to the normal representation as soon as the sparse mode consume more memory.

However I don't observe a such behavior with HyperLogLogPlus:

HyperLogLog:

HyperLogLog hll = new HyperLogLog(14); System.out.println(hll.getBytes().length); => 10932

HyperLogLogPlus in normal mode

HyperLogLogPlus hllp = new HyperLogLogPlus(14); System.out.println(hllp.getBytes().length); => 10940

Empty HyperLogLogPlus in sparse mode

HyperLogLogPlus hllp = new HyperLogLogPlus(14, 14); System.out.println(hllp.getBytes().length); => 16

5K elements with HyperLogLogPlus in sparse mode

Random r = new Random(); HyperLogLogPlus hllp = new HyperLogLogPlus(14, 14); for (int i = 0; i < 5000; i++) { hllp.offer(Integer.toString(r.nextInt())); } System.out.println(hllp.getBytes().length); => 25495

According to the source code the sparseSetThreshold only depends of p and is set to 12288 for p = 14. It means that if the set contains 12000 elements, I'm wasting almost 40KBytes compared to the normal representation.

Am I wrong ? Is this behavior expected ?

My second question would be: Do we really want to create the RegisterSet even when we are in Sparse mode ? It ruins the goal to be memory efficient. It currently does not matters for me since my bottleneck is the serialization size but I find this quite surprising.
opened by cykl 25
OffHeap RegisterSet support

This makes it possible to keep the register set off heap - we need it in Cassandra for long lived HLLP instances (we use it for overlap estimation between sstables) https://issues.apache.org/jira/browse/CASSANDRA-11035

opened by krummas 14
remove fastutil, add guava, add test-slf4j-simple

the fastutil addition should never have been merged without addressing the enormous and almost entirely untouched mountain of code that it pulls in. there are several possible solutions, but I assert that the burden is on someone who cares more about QDigest to implement one.

conversely, guava is tiny by comparison, has a variety of useful classes, and is widely enough used that most downstream users will see no increase in dependency count anyway.

opened by tea-dragon 13

Attempt to decrease qdigest memory footprint

Hi, Full discussion here https://groups.google.com/forum/#!topic/stream-lib-user/DuB-n4rU15Y.

I created benchmark for inserts: https://github.com/whiter4bbit/qdigest-bench. This benchmark does sequential inserts of 50000000 elements (random.nextInt(120000)).

whiter4bbit@katrin stream-lib-fork % java -version
java version "1.7.0_21"
Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)

Java options: -server -Xmx1280m -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:qdigest.log

I measured young collections count and young/old gen heap size at the end of each pass. Implementation, which uses Long2LongOpenHashMap don't have young gen collections at all (as well as old gen).

Results:

2.5.0-SNAPSHOT:

whiter4bbit@katrin qdigest-benchmark % bin/bench.sh 10 | ruby bin/summarize.rb

Insert
  Mean: 45238
  Stddev: 7632.287599402947
  Median: 42502.0

Young gen GC count
  Mean: 0
  Stddev: 0.0
  Median: 0.0

Young gen heap used (at the end)
  Mean: 1885
  Stddev: 0.0
  Median: 1885.0

Old gen heap used (at the end)
  Mean: 0
  Stddev: 0.0
  Median: 0.0

2.4.0:

whiter4bbit@katrin qdigest-benchmark % bin/bench.sh 10 | ruby bin/summarize.rb
Insert
  Mean: 49892
  Stddev: 2827.6702424434147
  Median: 50452.0

Young gen GC count
  Mean: 999
  Stddev: 61.66846844214635
  Median: 1021.0

Young gen heap used (at the end)
  Mean: 39305
  Stddev: 24565.86045714662
  Median: 34994.5

Old gen heap used (at the end)
  Mean: 486
  Stddev: 20.784609690826528
  Median: 488.0

opened by whiter4bbit 11

Hyperbitbit

This is the first implementation of HyperBitBit. It is totally functional and has the algorithm implemented following the Robert Sedgewick presentation at AofA'16.

There are some tests for .cardinality, and the algorithm is documented in its own file.

opened by positiveblue 9
CountMinSketch is not serializable

Could we make com.clearspring.analytics.stream.frequency.CountMinSketch implementing java.io.Serializable please?

I am getting this error: org.apache.commons.lang3.SerializationException: java.io.NotSerializableException: com.clearspring.analytics.stream.frequency.CountMinSketch at org.apache.commons.lang3.SerializationUtils.serialize(SerializationUtils.java:156)

opened by gokyo 9
add merge functionality to BloomFilter

Essentially uses BitSet's or method. I used the HyperLogLog's merge mechanism as a model. Includes a new MembershipMergeException class and a couple simple tests.

opened by strongh 8
HyperLogLogPlus merge introduces error when counters overlap

When merging two HyperLogLogPlus counters that have a large intersection of their underlying sets it is possible to introduce a large amount of error in cardinality estimation. This is caused by checking if the size of the two sparse lists sums to a number greater than the sparseThreshold. If the two lists share many elements, then their actual merge should be a list much smaller than the sparse threshold - this can cause a sparse counter to be promoted to a normal counter long before it should. If this happens it both wastes space and, if it happens too early can cause large errors in cardinality estimation ( I have seen up to 30% in tests ) due to the bias estimation curves not having samples for cardinalities that low.

Forcing merges to always merge sparse lists before checking size helps, but does not completely eliminate the problem. I believe this is due to problems in the merge implementation that I will open another issue for.

The easiest way to reproduce this error is to produce two counters with identical elements at just over 1/2 the sparseThreshold, then merge them.

opened by deGravity 8

how to serialize HLL using kryo

Hi, we are using HLL in Storm project. What we are expecting is to serialize HLL object using Kryo and sent it to the other Bolt component for computing. But it was failed when trying to serialize HLL object by following code for test:

        Kryo kryo = new Kryo();
        kryo.setInstantiatorStrategy(new SerializingInstantiatorStrategy());
        kryo.register(HyperLogLogPlus.class);
        try {
            Output output = new Output(new FileOutputStream("/Users/Felix/Desktop/file.bin"));
            kryo.writeObject(output, card);
            output.close();

            Input input = new Input(new FileInputStream("/Users/Felix/Desktop/file.bin"));
            HyperLogLogPlus someObject = kryo.readObject(input, HyperLogLogPlus.class);
            System.out.println(someObject.cardinality());
            input.close();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }

We got following errors:

Exception in thread "main" com.esotericsoftware.kryo.KryoException: Class cannot be created (missing no-arg constructor): com.clearspring.analytics.stream.cardinality.HyperLogLog
	at com.esotericsoftware.kryo.Kryo.newInstantiator(Kryo.java:1050)
	at com.esotericsoftware.kryo.Kryo.newInstance(Kryo.java:1062)
	at com.esotericsoftware.kryo.serializers.FieldSerializer.create(FieldSerializer.java:228)
	at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:217)
	at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:626)
	at hyperloglog.TT.main(TT.java:45)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)

Does it mean that HLL can not be serialized by Kryo? Or, has any way to do that correctly?

Thank you very much.

opened by mycFelix 7

CountMinSketch#size should be long everywhere in the code
This code shows that the CountMinSketch size() method does not work as expected.

double confidence = 0.999; double epsilon = 0.0001; int seed = 1; CountMinSketch sketch = new CountMinSketch(epsilon, confidence, seed); sketch.add(0, Integer.MAX_VALUE); long expectedSize = Integer.MAX_VALUE; Assert.assertEquals(expectedSize, sketch.size()); CountMinSketch newSketch = CountMinSketch.merge(sketch, sketch); // bug: the next line is failing Assert.assertEquals(2 * expectedSize, newSketch.size());

There are two parts in the code where size is used wrongly as int: https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/frequency/CountMinSketch.java#L68 https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/frequency/CountMinSketch.java#L186

It should be long everywhere.

Happy to fix it and add some unit tests creating a pull request. Please let me know
opened by gokyo 7
RegisterSet incorrectly allocates the space needed to track counts

The int array that's allocated in RegisterSet does not contain enough entries to accomodate correct counts for a given value of p.

Each count takes 6 bits, meaning that to store 2^p counts you need 6*(2^p)/8 bytes, not 2^p/6.

The counts are therefore incorrect, which can be detected by feeding for example hashes from 0 to (2^p-1) << (64 -p) in steps of 1 << (64 - p), the result should be 2**64-1 but it is 6194288074

opened by hbs 7
BloomFilter doesn't work when amount of bits > 2^32

In my live case, we need BloomFilter for a bigger amount (about 4-5Gb ram, >32B bits)

The code is dependent on java.util.BitSet with .ctor

public BitSet(int nbits) with a lot of getter/setter by int index.

opened by ajtkulov 4
Change type of TDigest.Group.id from int to long

The assumption is that the id is unique. If the id value overflows, it can lead to NullPointerExceptions. Changing the type to long doesn't make overflows impossible, but highly unlikely.

This fixes issues like #100

opened by sgrj 1
CountMinSketch with Int type in the internal table

The current implementation has Long type for the internal table. For a lot of real use cases, we need only Int (in general it would be great to have a generic solution, like long/int/short(?)).

For instance, I have 50M objects. It's impossible to have a value > 50M in the internal table. So, Int value is enough, in this case, Long type is a waste (in 2 times more memory consumption in comparison with Int type).

opened by ajtkulov 0
HyperLogLogPlusPlus sparse precision 32 accuracy problem

I have had very strange results (very high inaccuracy) for low-cardinality HLL++ when using usual values of p (p = 11, 12 ,13, 14) and sp = 32. I suspect (but I am not certain) that treating sp = 31 and sp = 32 exactly the same at the following line causes the problem:

sm = sp > 30 ? Integer.MAX_VALUE : 1 << sp;

since for low cardinalities, the cardinality is computed this way: return Math.round(HyperLogLog.linearCounting(sm, sm - sparseSet.length));

Using sp=31 works as expected, sp=32 does not.

opened by Enzo90910 0
make OSGI compatible

We add maven-bundle-plugin to package it as a bundle, as effect to add in MANIFEST.MF osgi metadata. It enables it to be used inside OSGI based applications.

opened by simondaudin 0

Stream summarizer and cardinality estimator.

Related tags

Overview

Description

Examples

Maven Artifact

Building

Where People Hang Out

Sources

References

Cardinality

Top-K

Frequency

Comments

Owner

AddThis

Apache Heron (Incubating) is a realtime, distributed, fault-tolerant stream processing engine from Twitter

Access paged data as a "stream" with async loading while maintaining order

Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.

A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.

OpenRefine is a free, open source power tool for working with messy data and improving it

Machine Learning Platform and Recommendation Engine built on Kubernetes

:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop

A platform for visualization and real-time monitoring of data workflows

Desktop app to browse and administer your MongoDB cluster

A scalable, mature and versatile web crawler based on Apache Storm

Stream summarizer and cardinality estimator.

Password strength estimator

Union, intersection, and set cardinality in loglog space

Java implementation of our paper: Efficient Private Set Intersection Cardinality inthe Reverse Unbalanced Setting Utilizing Hash-Prefix Filter

Table-Computing (Simplified as TC) is a distributed light weighted, high performance and low latency stream processing and data analysis framework. Milliseconds latency and 10+ times faster than Flink for complicated use cases.

Distributed and fault-tolerant realtime computation: stream processing, continuous computation, distributed RPC, and more