Stream summarizer and cardinality estimator.

Related tags

Big data java
Overview

Build Status

Description

A Java library for summarizing data in streams for which it is infeasible to store all events. More specifically, there are classes for estimating: cardinality (i.e. counting things); set membership; top-k elements and frequency. One particularly useful feature is that cardinality estimators with compatible configurations may be safely merged.

These classes may be used directly in a JVM project or with the provided shell scripts and good old Unix IO redirection.

The ideas here are not original to us. We have endeavored to create useful implementations from iterating over the existing academic literature. As such this library relies heavily on the work of others. Please read the Sources and Reference sections.

Examples

$ echo -e "foo\nfoo\nbar" | ./bin/topk
item count error
---- ----- -----
 foo     2     0
 bar     1     0

Item count: 3


$ echo -e "foo\nfoo\nbar" | ./bin/cardinality
Item Count Cardinality Estimate
---------- --------------------
         3                    2

Maven Artifact Maven Central

<dependency>
  <groupId>com.clearspring.analytics</groupId>
  <artifactId>stream</artifactId>
  <version>2.9.5</version>
</dependency>

Building

Assuming you have Apache Maven installed and configured:

mvn package

And you should be all set.

Where People Hang Out

Mailing list: http://groups.google.com/group/stream-lib-user

Sources

The set membership code is the Bloom Filter implementation from Apache Cassandra circa December 2009. The changes here are minimal and were for the purpose of testing and independent use. Apache Software Foundation headers have been retained on these files. By extension we also include murmurhash.

We were inspired to use this code by Jonathan Ellis' post All you ever wanted to know about writing bloom filters.

References

There are javadoc references to specific papers. These were the ones we found most relevant during out research.

Cardinality

  • Min Cai, Jianping Pan, Yu K. Kwok, and Kai Hwang. Fast and accurate traffic matrix measurement using adaptive cardinality counting. In MineNet ’05: Proceedings of the 2005 ACM SIGCOMM workshop on Mining network data, pages 205–206, New York, NY, USA, 2005. ACM.

  • Ahmed Metwally, Divyakant Agrawal, and Amr E. Abbadi. Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic. In EDBT ’08: Proceedings of the 11th international conference on Extending database technology, pages 618–629, New York, NY, USA, 2008. ACM.

  • Nikos Ntarmos, Peter Triantafillou, and Gerhard Weikum. Counting at large: Efficient cardinality estimation in Internet-Scale data networks. In ICDE ’06: Proceedings of the 22nd International Conference on Data Engineering, pages 40+, Washington, DC, USA, 2006. IEEE Computer Society.

  • Marianne Durand and Philippe Flajolet. LogLog counting of large cardinalities. In ESA03, volume 2832 of LNCS, pages 605–617, 2003.

  • Kyu Y. Whang, Brad T. Vander Zanden, and Howard M. Taylor. A linear-time probabilistic counting algorithm for database applications. ACM Trans. Database Syst., 15(2):208–229, 1990.

  • Moses Charikar, Kevin Chen, and Martin F. Colton. Finding frequent items in data streams. In ICALP ’02: Proceedings of the 29th International Colloquium on Automata, Languages and Programming, pages 693–703, London, UK, 2002. Springer-Verlag.

  • Stefan Heule, Marc Nunkesser, Alex Hall. HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm. Proceedings of the EDBT 2013 Conference, ACM, Genoa, Italy

Top-K

  • Graham Cormode and S. Muthukrishnan. An improved data stream summary: The Count-Min sketch and its applications. pages 29–38.
  1. 10.1016/j.jalgor.2003.12.001 http://dl.acm.org/citation.cfm?id=1073718
  • Cheqing Jin, Weining Qian, Chaofeng Sha, Jeffrey X. Yu, and Aoying Zhou. Dynamically maintaining frequent items over a data stream. In CIKM ’03: Proceedings of the twelfth international conference on Information and knowledge management, pages 287–294, New York, NY, USA, 2003. ACM. 10.1145/956863.956918 http://dl.acm.org/citation.cfm?id=956918

  • Ahmed Metwally, Divyakant Agrawal, and Amr Abbadi. Efficient computation of frequent and top-k elements in data streams. pages 398–412. 2005. 10.1007/978-3-540-30570-5_27 http://link.springer.com/chapter/10.1007/978-3-540-30570-5_27

Frequency

Comments
  • HLL++ in sparse mode can be large than in normal mode

    HLL++ in sparse mode can be large than in normal mode

    Today I played a little bit with HLL++ in sparse mode. I have tens of millions HLL estimators and most of them have a low cardinality. Using HLL or HLL++ in normal mode is not memory efficient for this use case. To be picky I don't care that much about memory consumption, I am trying to minimize the serialized size of the estimators .

    This whole idea behind the sparse mode is to not waste memory with the normal representation when we can do better for small cardinality. It sounds reasonable to switch back to the normal representation as soon as the sparse mode consume more memory.

    However I don't observe a such behavior with HyperLogLogPlus:

    HyperLogLog:

            HyperLogLog hll = new HyperLogLog(14);
            System.out.println(hll.getBytes().length);
    
    => 10932
    

    HyperLogLogPlus in normal mode

            HyperLogLogPlus hllp = new HyperLogLogPlus(14);
            System.out.println(hllp.getBytes().length);
    
    => 10940 
    

    Empty HyperLogLogPlus in sparse mode

            HyperLogLogPlus hllp = new HyperLogLogPlus(14, 14);
            System.out.println(hllp.getBytes().length);
    
    => 16
    

    5K elements with HyperLogLogPlus in sparse mode

            Random r = new Random();
    
            HyperLogLogPlus hllp = new HyperLogLogPlus(14, 14);
    
            for (int i = 0; i < 5000; i++) {
                hllp.offer(Integer.toString(r.nextInt()));
            }
    
            System.out.println(hllp.getBytes().length);
    
    => 25495
    

    According to the source code the sparseSetThreshold only depends of p and is set to 12288 for p = 14. It means that if the set contains 12000 elements, I'm wasting almost 40KBytes compared to the normal representation.

    Am I wrong ? Is this behavior expected ?

    My second question would be: Do we really want to create the RegisterSet even when we are in Sparse mode ? It ruins the goal to be memory efficient. It currently does not matters for me since my bottleneck is the serialization size but I find this quite surprising.

    opened by cykl 25
  • OffHeap RegisterSet support

    OffHeap RegisterSet support

    This makes it possible to keep the register set off heap - we need it in Cassandra for long lived HLLP instances (we use it for overlap estimation between sstables) https://issues.apache.org/jira/browse/CASSANDRA-11035

    opened by krummas 14
  • remove fastutil, add guava, add test-slf4j-simple

    remove fastutil, add guava, add test-slf4j-simple

    the fastutil addition should never have been merged without addressing the enormous and almost entirely untouched mountain of code that it pulls in. there are several possible solutions, but I assert that the burden is on someone who cares more about QDigest to implement one.

    conversely, guava is tiny by comparison, has a variety of useful classes, and is widely enough used that most downstream users will see no increase in dependency count anyway.

    opened by tea-dragon 13
  • Attempt to decrease qdigest memory footprint

    Attempt to decrease qdigest memory footprint

    Hi, Full discussion here https://groups.google.com/forum/#!topic/stream-lib-user/DuB-n4rU15Y.

    I created benchmark for inserts: https://github.com/whiter4bbit/qdigest-bench. This benchmark does sequential inserts of 50000000 elements (random.nextInt(120000)).

    whiter4bbit@katrin stream-lib-fork % java -version
    java version "1.7.0_21"
    Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
    Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)
    

    Java options: -server -Xmx1280m -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:qdigest.log

    I measured young collections count and young/old gen heap size at the end of each pass. Implementation, which uses Long2LongOpenHashMap don't have young gen collections at all (as well as old gen).

    Results:

    2.5.0-SNAPSHOT:

    whiter4bbit@katrin qdigest-benchmark % bin/bench.sh 10 | ruby bin/summarize.rb
    
    Insert
      Mean: 45238
      Stddev: 7632.287599402947
      Median: 42502.0
    
    Young gen GC count
      Mean: 0
      Stddev: 0.0
      Median: 0.0
    
    Young gen heap used (at the end)
      Mean: 1885
      Stddev: 0.0
      Median: 1885.0
    
    Old gen heap used (at the end)
      Mean: 0
      Stddev: 0.0
      Median: 0.0
    

    2.4.0:

    whiter4bbit@katrin qdigest-benchmark % bin/bench.sh 10 | ruby bin/summarize.rb
    Insert
      Mean: 49892
      Stddev: 2827.6702424434147
      Median: 50452.0
    
    Young gen GC count
      Mean: 999
      Stddev: 61.66846844214635
      Median: 1021.0
    
    Young gen heap used (at the end)
      Mean: 39305
      Stddev: 24565.86045714662
      Median: 34994.5
    
    Old gen heap used (at the end)
      Mean: 486
      Stddev: 20.784609690826528
      Median: 488.0
    
    opened by whiter4bbit 11
  • Hyperbitbit

    Hyperbitbit

    This is the first implementation of HyperBitBit. It is totally functional and has the algorithm implemented following the Robert Sedgewick presentation at AofA'16.

    There are some tests for .cardinality, and the algorithm is documented in its own file.

    opened by positiveblue 9
  • CountMinSketch is not serializable

    CountMinSketch is not serializable

    Could we make com.clearspring.analytics.stream.frequency.CountMinSketch implementing java.io.Serializable please?

    I am getting this error: org.apache.commons.lang3.SerializationException: java.io.NotSerializableException: com.clearspring.analytics.stream.frequency.CountMinSketch at org.apache.commons.lang3.SerializationUtils.serialize(SerializationUtils.java:156)

    opened by gokyo 9
  • add merge functionality to BloomFilter

    add merge functionality to BloomFilter

    Essentially uses BitSet's or method. I used the HyperLogLog's merge mechanism as a model. Includes a new MembershipMergeException class and a couple simple tests.

    opened by strongh 8
  • HyperLogLogPlus merge introduces error when counters overlap

    HyperLogLogPlus merge introduces error when counters overlap

    When merging two HyperLogLogPlus counters that have a large intersection of their underlying sets it is possible to introduce a large amount of error in cardinality estimation. This is caused by checking if the size of the two sparse lists sums to a number greater than the sparseThreshold. If the two lists share many elements, then their actual merge should be a list much smaller than the sparse threshold - this can cause a sparse counter to be promoted to a normal counter long before it should. If this happens it both wastes space and, if it happens too early can cause large errors in cardinality estimation ( I have seen up to 30% in tests ) due to the bias estimation curves not having samples for cardinalities that low.

    Forcing merges to always merge sparse lists before checking size helps, but does not completely eliminate the problem. I believe this is due to problems in the merge implementation that I will open another issue for.

    The easiest way to reproduce this error is to produce two counters with identical elements at just over 1/2 the sparseThreshold, then merge them.

    opened by deGravity 8
  • how to serialize HLL using kryo

    how to serialize HLL using kryo

    Hi, we are using HLL in Storm project. What we are expecting is to serialize HLL object using Kryo and sent it to the other Bolt component for computing. But it was failed when trying to serialize HLL object by following code for test:

            Kryo kryo = new Kryo();
            kryo.setInstantiatorStrategy(new SerializingInstantiatorStrategy());
            kryo.register(HyperLogLogPlus.class);
            try {
                Output output = new Output(new FileOutputStream("/Users/Felix/Desktop/file.bin"));
                kryo.writeObject(output, card);
                output.close();
    
                Input input = new Input(new FileInputStream("/Users/Felix/Desktop/file.bin"));
                HyperLogLogPlus someObject = kryo.readObject(input, HyperLogLogPlus.class);
                System.out.println(someObject.cardinality());
                input.close();
            } catch (FileNotFoundException e) {
                e.printStackTrace();
            }
    

    We got following errors:

    Exception in thread "main" com.esotericsoftware.kryo.KryoException: Class cannot be created (missing no-arg constructor): com.clearspring.analytics.stream.cardinality.HyperLogLog
    	at com.esotericsoftware.kryo.Kryo.newInstantiator(Kryo.java:1050)
    	at com.esotericsoftware.kryo.Kryo.newInstance(Kryo.java:1062)
    	at com.esotericsoftware.kryo.serializers.FieldSerializer.create(FieldSerializer.java:228)
    	at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:217)
    	at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:626)
    	at hyperloglog.TT.main(TT.java:45)
    	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    	at java.lang.reflect.Method.invoke(Method.java:606)
    	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
    

    Does it mean that HLL can not be serialized by Kryo? Or, has any way to do that correctly?

    Thank you very much.

    opened by mycFelix 7
  • CountMinSketch#size should be long everywhere in the code

    CountMinSketch#size should be long everywhere in the code

    This code shows that the CountMinSketch size() method does not work as expected.

            double confidence = 0.999;
            double epsilon = 0.0001;
            int seed = 1;
    
            CountMinSketch sketch = new CountMinSketch(epsilon, confidence, seed);
            sketch.add(0, Integer.MAX_VALUE);
            long expectedSize = Integer.MAX_VALUE;
            Assert.assertEquals(expectedSize, sketch.size());
    
            CountMinSketch newSketch = CountMinSketch.merge(sketch, sketch);
            // bug: the next line is failing
            Assert.assertEquals(2 * expectedSize, newSketch.size());
    

    There are two parts in the code where size is used wrongly as int: https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/frequency/CountMinSketch.java#L68 https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/frequency/CountMinSketch.java#L186

    It should be long everywhere.

    Happy to fix it and add some unit tests creating a pull request. Please let me know

    opened by gokyo 7
  • RegisterSet incorrectly allocates the space needed to track counts

    RegisterSet incorrectly allocates the space needed to track counts

    The int array that's allocated in RegisterSet does not contain enough entries to accomodate correct counts for a given value of p.

    Each count takes 6 bits, meaning that to store 2^p counts you need 6*(2^p)/8 bytes, not 2^p/6.

    The counts are therefore incorrect, which can be detected by feeding for example hashes from 0 to (2^p-1) << (64 -p) in steps of 1 << (64 - p), the result should be 2**64-1 but it is 6194288074

    opened by hbs 7
  • BloomFilter doesn't work when amount of bits > 2^32

    BloomFilter doesn't work when amount of bits > 2^32

    In my live case, we need BloomFilter for a bigger amount (about 4-5Gb ram, >32B bits)

    The code is dependent on java.util.BitSet with .ctor

    public BitSet(int nbits) with a lot of getter/setter by int index.

    opened by ajtkulov 4
  • Change type of TDigest.Group.id from int to long

    Change type of TDigest.Group.id from int to long

    The assumption is that the id is unique. If the id value overflows, it can lead to NullPointerExceptions. Changing the type to long doesn't make overflows impossible, but highly unlikely.

    This fixes issues like #100

    opened by sgrj 1
  • CountMinSketch with Int type in the internal table

    CountMinSketch with Int type in the internal table

    The current implementation has Long type for the internal table. For a lot of real use cases, we need only Int (in general it would be great to have a generic solution, like long/int/short(?)).

    For instance, I have 50M objects. It's impossible to have a value > 50M in the internal table. So, Int value is enough, in this case, Long type is a waste (in 2 times more memory consumption in comparison with Int type).

    opened by ajtkulov 0
  • HyperLogLogPlusPlus sparse precision 32 accuracy problem

    HyperLogLogPlusPlus sparse precision 32 accuracy problem

    I have had very strange results (very high inaccuracy) for low-cardinality HLL++ when using usual values of p (p = 11, 12 ,13, 14) and sp = 32. I suspect (but I am not certain) that treating sp = 31 and sp = 32 exactly the same at the following line causes the problem:

    sm = sp > 30 ? Integer.MAX_VALUE : 1 << sp;

    since for low cardinalities, the cardinality is computed this way: return Math.round(HyperLogLog.linearCounting(sm, sm - sparseSet.length));

    Using sp=31 works as expected, sp=32 does not.

    opened by Enzo90910 0
  • make OSGI compatible

    make OSGI compatible

    We add maven-bundle-plugin to package it as a bundle, as effect to add in MANIFEST.MF osgi metadata. It enables it to be used inside OSGI based applications.

    opened by simondaudin 0
Owner
AddThis
AddThis
Apache Heron (Incubating) is a realtime, distributed, fault-tolerant stream processing engine from Twitter

Heron is a realtime analytics platform developed by Twitter. It has a wide array of architectural improvements over it's predecessor. Heron in Apache

The Apache Software Foundation 3.6k Dec 28, 2022
Access paged data as a "stream" with async loading while maintaining order

DataStream What? DataStream is a simple piece of code to access paged data and interface it as if it's a single "list". It only keeps track of queued

Thomas 1 Jan 19, 2022
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

Apache Zeppelin Documentation: User Guide Mailing Lists: User and Dev mailing list Continuous Integration: Contributing: Contribution Guide Issue Trac

The Apache Software Foundation 5.9k Jan 8, 2023
Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.

Elephant Bird About Elephant Bird is Twitter's open source library of LZO, Thrift, and/or Protocol Buffer-related Hadoop InputFormats, OutputFormats,

Twitter 1.1k Jan 5, 2023
A distributed data integration framework that simplifies common aspects of big data integration such as data ingestion, replication, organization and lifecycle management for both streaming and batch data ecosystems.

Apache Gobblin Apache Gobblin is a highly scalable data management solution for structured and byte-oriented data in heterogeneous data ecosystems. Ca

The Apache Software Foundation 2.1k Jan 4, 2023
OpenRefine is a free, open source power tool for working with messy data and improving it

OpenRefine OpenRefine is a Java-based power tool that allows you to load data, understand it, clean it up, reconcile it, and augment it with data comi

OpenRefine 9.2k Jan 1, 2023
Machine Learning Platform and Recommendation Engine built on Kubernetes

Update January 2018 Seldon Core open sourced. Seldon Core focuses purely on deploying a wide range of ML models on Kubernetes, allowing complex runtim

Seldon 1.5k Dec 15, 2022
:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop

Elasticsearch Hadoop Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Apache Hive, Apache Pig, Apach

elastic 1.9k Dec 22, 2022
A platform for visualization and real-time monitoring of data workflows

Status This project is no longer maintained. Ambrose Twitter Ambrose is a platform for visualization and real-time monitoring of MapReduce data workfl

Twitter 1.2k Dec 31, 2022
Desktop app to browse and administer your MongoDB cluster

UMONGO, the MongoDB GUI UMONGO, the MongoDB GUI About This version of UMongo is provided as free software by EdgyTech LLC. UMongo is open source, and

Antoine Girbal 583 Nov 11, 2022
A scalable, mature and versatile web crawler based on Apache Storm

StormCrawler is an open source collection of resources for building low-latency, scalable web crawlers on Apache Storm. It is provided under Apache Li

DigitalPebble Ltd 776 Jan 2, 2023
Stream summarizer and cardinality estimator.

Description A Java library for summarizing data in streams for which it is infeasible to store all events. More specifically, there are classes for es

AddThis 2.2k Dec 30, 2022
Password strength estimator

Nbvcxz - Password strength estimator - [] nbvcxz is java library (and standalone console program) which is heavily inspired by the work in zxcvbn. Pas

GoSimple 237 Dec 29, 2022
Union, intersection, and set cardinality in loglog space

HyperMinHash-java A Java implementation of the HyperMinHash algorithm, presented by Yu and Weber. HyperMinHash allows approximating set unions, inters

LiveRamp 48 Sep 22, 2022
Java implementation of our paper: Efficient Private Set Intersection Cardinality inthe Reverse Unbalanced Setting Utilizing Hash-Prefix Filter

PSI-CA-Framework This is the Java implementation of our paper: Efficient Private Set Intersection Cardinality inthe Reverse Unbalanced Setting Utilizi

IamGroot 4 Dec 30, 2022
Table-Computing (Simplified as TC) is a distributed light weighted, high performance and low latency stream processing and data analysis framework. Milliseconds latency and 10+ times faster than Flink for complicated use cases.

Table-Computing Welcome to the Table-Computing GitHub. Table-Computing (Simplified as TC) is a distributed light weighted, high performance and low la

Alibaba 34 Oct 14, 2022
Distributed and fault-tolerant realtime computation: stream processing, continuous computation, distributed RPC, and more

IMPORTANT NOTE!!! Storm has Moved to Apache. The official Storm git repository is now hosted by Apache, and is mirrored on github here: https://github

Nathan Marz 8.9k Dec 26, 2022