High performance Java implementation of a Cuckoo filter - Apache Licensed

Overview

Maven Central Javadocs Build Status Coverage Status License

Cuckoo Filter For Java

This library offers a similar interface to Guava's Bloom filters. In most cases it can be used interchangeably and has additional advantages including thread-safety, concurrent operations, deletions/counting and a configurable hashing algorithm.

This Forked version from MGunlogson/CuckooFilter4J has replaced the MurMur hash functions with xxHash which is much quicker. If you are only using the filter with primitive numbers (long,int,short,byte) you should use the filter from the primitive branch (https://github.com/MPdaedalus/CuckooFilter4J/tree/Primitive-Filter) as it is much quicker due to mightContain() calls creating no Garbage compared to this branch that supports Objects.

About Cuckoo Filters

Cuckoo filter is a Bloom filter replacement for approximated set-membership queries. While Bloom filters are well-known space-efficient data structures to serve queries like "if item x is in a set?", they do not support deletion. Their variances to enable deletion (like counting Bloom filters) usually require much more space.

Cuckoo filters provide the flexibility to add and remove items dynamically. A cuckoo filter is based on cuckoo hashing (and therefore named as cuckoo filter). It is essentially a cuckoo hash table storing each key's fingerprint. Cuckoo hash tables can be highly compact, thus a cuckoo filter could use less space than conventional Bloom filters, for applications that require low false positive rates (< 3%).

For details about the algorithm and citations please use this article for now

"Cuckoo Filter: Better Than Bloom" by Bin Fan, Dave Andersen and Michael Kaminsky

If you're looking for an implementation in a different language check out these GitHub links.

My personal thanks goes out to the authors of those libaries, their code helped immensely while writing my own implementation.

Installation

Maven artifact:

<dependency>
    <groupId>com.github.mgunlogson</groupId>
    <artifactId>cuckoofilter4j</artifactId>
    <version>1.0.1</version>
</dependency> 

Download At Maven Central

Is performance important to your application?

The regular filter is fairly quick, but MPdaedalus has made significant performance improvement by adding a brand new hash algorithm and reducing GC pressure. These changes speed up most operations by ~2x! His changes can be found here. These improvements will be merged to version 2.0 as soon as the public API is finalized.

Usage

Below is a full example of creating and using the filter. Many more examples can be found in the test folders within the project.

import com.github.mgunlogson.cuckoofilter4j;
import com.google.common.hash.Funnels;

public class Example {

	public static void main(String[] args) {
		// create
		CuckooFilter<Integer> filter = new CuckooFilter.Builder<>(Funnels.integerFunnel(), 2000000).build();
		// insert
		if (filter.put(42)) {
			System.out.println("Insert Success!");
		}
		// contains
		if (filter.mightContain(42)) {
			System.out.println("Found 42!");
		}
		// count
		System.out.println("Filter has " + filter.getCount() + " items");
		
				// count
		System.out.println("42 has been inserted approximately " + filter.approximateCount(42) + " times");

		// % loaded
		System.out.println("Filter is " + String.format("%.0f%%", filter.getLoadFactor() * 100) + " loaded");

		// delete
		if (filter.delete(42)) {
			System.out.println("Delete Success!");
		}
	}

}

Documentation

Visit The JavaDoc Here

###False Positives The false positive rate of the filter is the probability that mightContain() will erroneously return true for an object that was not added to the filter. Unlike Bloom filters, a Cuckoo filter will fail to insert when it reaches capacity. If an insert fails put() will return false

Duplicates

Cuckoo filters allow deletion like counting Bloom filters. While counting Bloom filters invariably use more space to allow deletions, Cuckoo filters achieve this with no space or time cost. Like counting variations of Bloom filters, Cuckoo filters have a limit to the number of times you can insert duplicate items. This limit is 8-9 in the current design, depending on internal state. Reaching this limit can cause further inserts to fail and degrades the performance of the filter. Occasional duplicates will not degrade the performance of the filter but will slightly reduce capacity. Existing items can be deleted without affecting the false positive rate or causing false negatives. However, deleting items that were not previously added to the filter can cause false negatives.

Counting

Cuckoo filters support counting items, like counting Bloom filters. The maximum count is still limited by max-duplicates to 7 so this should only be used to count small numbers. The measured count may be higher than actual count due to false positives, but will never be lower since Cuckoo filters have no false negatives.

Capacity

Once the filter reaches capacity (put() returns false). It's best to either rebuild the existing filter or create a larger one. Deleting items in the current filter is also an option, but you should delete at least ~2% of the items in the filter before inserting again.

Speed

CuckooFilter4J is roughly the same speed as Guava's Bloom filters when running single-threaded. Guava's Bloom is usually faster with small tables, but the trend is reversed with tables too large to fit in the CPU cache. Overall the single-threaded speed of the two libraries is comparable. This library supports concurrent access through multithreading (Guava's Bloom does not). In my tests this scales fairly well, making CuckooFilter4J faster than Bloom filters for multi-threaded applications. On my 4 core machine, running inserts on all cores is roughly 3x faster than single-threaded operation. Cpu architecture will affect this, so your mileage may vary. See the benchmark folder for some tests to run on your own system.

Hashing Algorithms

Hash collision attacks are theoretically possible against Cuckoo filters (as with any hash table based structure). If this is an issue for your application, use one of the cryptographically secure (but slower) hash functions. The default hash function, Murmer3 is not secure. Secure functions include SHA and SipHash. All hashes,including non-secure, are internally seeded and salted. Practical attacks against any of them are unlikely. Also note that the maximum supported size of the filter depends on the hash funciton. Especially in the case of 32 bit Murmur3, the hash will limit table size. Even with a 32 bit hash, the maximum table size is around 270 megabytes. With 64 bit hashes the maximum table size is extremely large, and practically unlimited using 128+bit hash functions. In any case, the library will refuse to create the table using an invalid configuration.

Multi-Threading

All operations are thread-safe. Most also run concurrently for increased performance. Notable exceptions include copy, serialization, and hashcode which nessecarily lock the entire table - running on a single thread until complete. Thread safety should be considered BETA at the moment. Multithreading is notoriously hard to test, and despite my best effort to avoid bugs and deadlocks it is likely that some remain. If you are using multithreading in production I will do my best to provide prompt support and give you my thanks :).

Serializing

Cuckoo filters are serializable.

Comments
  • add xxHash and optimise FilterTable.checkTag

    add xxHash and optimise FilterTable.checkTag

    The xxHash implementation can only go in the google hash folder as it needs to implement AbstractStreamingHashFunction which is package private.

    With FilterTable.checkTag I removed the boolean variable and created a local reference for "bitsPerTag" but did not change loop from increment to deincrement as I don't know what effect it will have on memBlock.get().

    In my rather low tech visualVM benchmark the 1/3rd of mightContain() time up used by the murmur hash has completely disappeared, the hash methods don't even show up as measurable time, maybe you could confirm this with your more advanced JMH setup.

    The garbage created by mightContain is even worse than i feared as not only is there a overhead from Long objects but also every call creates a new hasher and hashcode object as well ,as well as any arrays and primitives required to do the hashing. I recon each call creates between 50 and 100 bytes of garbage which is a lot for a method that is called millions of times per second.

    I will create another branch and strip out the generics and rework the hashing, this will have the added benefit that the xxHash implementation can go in your utils class and the hashing itself will be much simpler as it will only have a single "long" to hash rather than a byte array, it will also create no garbage what so ever and be much faster as well. I will also add methods for smaller primitives which will end up being converted to longs so you only need one class. Should be done by sunday.

    Let me know if you have any issues with the pull or your tests reveal errors.

    opened by MPdaedalus 3
  • curIndex and altIndex might be equal, which results in an error for the countTag

    curIndex and altIndex might be equal, which results in an error for the countTag

    Hey, Thanks for authors' work ! I find curIndex and altIndex might be equal in put method (cuckoofilter.java), which results in an error for the countTag method (FilterTable.java), An item will be counted twice, I don't know what other effects this error has.

    Fix countTag bug: In the second Bucket, you can determine if i1(curIndex ) does not equal i2(altIndex ) if (i1!=i2 && checkTag(i2, posInBucket, tag))

    https://github.com/MGunlogson/CuckooFilter4J/blob/e8472aa150b201f05046d1c81cac5a5ca4348db8/src/main/java/com/github/mgunlogson/cuckoofilter4j/FilterTable.java#L226

    https://github.com/MGunlogson/CuckooFilter4J/blob/e8472aa150b201f05046d1c81cac5a5ca4348db8/src/main/java/com/github/mgunlogson/cuckoofilter4j/CuckooFilter.java#L413

    https://github.com/MGunlogson/CuckooFilter4J/blob/e8472aa150b201f05046d1c81cac5a5ca4348db8/src/main/java/com/github/mgunlogson/cuckoofilter4j/FilterTable.java#L221

    opened by d0scoo1 0
  • Issue in code

    Issue in code

    https://github.com/MGunlogson/CuckooFilter4J/blob/e8472aa150b201f05046d1c81cac5a5ca4348db8/src/main/java/com/github/mgunlogson/cuckoofilter4j/Utils.java#L153

    From the original paper, It should be like this return DoubleMath.roundToInt((DoubleMath.log2(1 / fpProb) + 3) / loadFactor, RoundingMode.UP);

    opened by dushyantk1509 0
  • Remove duplicate rows from file

    Remove duplicate rows from file

    @MGunlogson this library remove duplicate line from a file. I'm currently work with next-generation sequencing (ngs) and need to remove duplicate readings from large volumes of data (big data)

    opened by sergiogaia 0
  • putIfAbsent support

    putIfAbsent support

    the sequence of mightContain + put is racy.

    would it make sense to create a putIfAbsent variant that is atomic and race-free from the callers point of view?

    or are there other options available to achieve something similar without synchronizing on the CuckooFilter instance?

    opened by naude-r 0
  • Misleading point about guava bloomfilter

    Misleading point about guava bloomfilter

    https://github.com/google/guava/blob/6f22af40e1526b8c194e9e36d457bcd37680c6a3/android/guava/src/com/google/common/hash/BloomFilter.java

    As of version 23, guava bloomfilter is thread safe.

    opened by lijinglue 0
Releases(1.0.1)
Owner
Mark Gunlogson
Software engineer. Grew up in the midwest, living the life in Texas.
Mark Gunlogson
A high performance caching library for Java

Caffeine is a high performance, near optimal caching library. For more details, see our user's guide and browse the API docs for the latest release. C

Ben Manes 13k Jan 5, 2023
Hollow is a java library and toolset for disseminating in-memory datasets from a single producer to many consumers for high performance read-only access.

Hollow Hollow is a java library and toolset for disseminating in-memory datasets from a single producer to many consumers for high performance read-on

Netflix, Inc. 1.1k Dec 25, 2022
High Performance Primitive Collections for Java

HPPC: High Performance Primitive Collections Collections of primitive types (maps, sets, stacks, lists) with open internals and an API twist (no java.

Carrot Search 890 Dec 28, 2022
Simple Binary Encoding (SBE) - High Performance Message Codec

Simple Binary Encoding (SBE) SBE is an OSI layer 6 presentation for encoding and decoding binary application messages for low-latency financial applic

Real Logic 2.8k Dec 28, 2022
Table-Computing (Simplified as TC) is a distributed light weighted, high performance and low latency stream processing and data analysis framework. Milliseconds latency and 10+ times faster than Flink for complicated use cases.

Table-Computing Welcome to the Table-Computing GitHub. Table-Computing (Simplified as TC) is a distributed light weighted, high performance and low la

Alibaba 34 Oct 14, 2022
Java Collections till the last breadcrumb of memory and performance

Koloboke A family of projects around collections in Java (so far). The Koloboke Collections API A carefully designed extension of the Java Collections

Roman Leventov 967 Nov 14, 2022
Bloofi: A java implementation of multidimensional Bloom filters

Bloofi: A java implementation of multidimensional Bloom filters Bloom filters are probabilistic data structures commonly used for approximate membersh

Daniel Lemire 71 Nov 2, 2022
Java port of a concurrent trie hash map implementation from the Scala collections library

About This is a Java port of a concurrent trie hash map implementation from the Scala collections library. It is almost a line-by-line conversion from

null 147 Oct 31, 2022
A Java implementation of Transducers

transducers-java Transducers are composable algorithmic transformations. They are independent from the context of their input and output sources and s

null 117 Sep 29, 2022
Parquet-MR contains the java implementation of the Parquet format

Parquet MR Parquet-MR contains the java implementation of the Parquet format. Parquet is a columnar storage format for Hadoop; it provides efficient s

The Apache Software Foundation 1.8k Jan 5, 2023
A fork of Cliff Click's High Scale Library. Improved with bug fixes and a real build system.

High Scale Lib This is Boundary's fork of Cliff Click's high scale lib. We will be maintaining this fork with bug fixes, improvements and versioned bu

BMC TrueSight Pulse (formerly Boundary) 402 Jan 2, 2023
Replicate your Key Value Store across your network, with consistency, persistance and performance.

Chronicle Map Version Overview Chronicle Map is a super-fast, in-memory, non-blocking, key-value store, designed for low-latency, and/or multi-process

Chronicle Software : Open Source 2.5k Dec 29, 2022
A Primitive Collection library that reduces memory usage and improves performance

Primitive-Collections This is a Simple Primitive Collections Library i started as a hobby Project. It is based on Java's Collection Library and FastUt

Speiger 26 Dec 25, 2022
Implementation of various string similarity and distance algorithms: Levenshtein, Jaro-winkler, n-Gram, Q-Gram, Jaccard index, Longest Common Subsequence edit distance, cosine similarity ...

java-string-similarity A library implementing different string similarity and distance measures. A dozen of algorithms (including Levenshtein edit dis

Thibault Debatty 2.5k Dec 29, 2022
Golang implementation of the Alaya protocol

Go PlatON Welcome to the PlatON-Go source code repository! This is an Ethereum-based、high-performance and high-security implementation of the PlatON p

null 23 Oct 14, 2022
Kameleon - project scaffolding for Apache Camel

Kameleon - project scaffolding for Apache Camel This is a project generator for Apache Camel. It generates maven-based Java project with preconfigured

The Apache Software Foundation 31 Dec 14, 2022
Chronicle Bytes has a similar purpose to Java NIO's ByteBuffer with many extensions

Chronicle-Bytes Chronicle-Bytes Chronicle Bytes contains all the low level memory access wrappers. It is built on Chronicle Core’s direct memory and O

Chronicle Software : Open Source 334 Jan 1, 2023
An advanced, but easy to use, platform for writing functional applications in Java 8.

Getting Cyclops X (10) The latest version is cyclops:10.4.0 Stackoverflow tag cyclops-react Documentation (work in progress for Cyclops X) Integration

AOL 1.3k Dec 29, 2022
Eclipse Collections is a collections framework for Java with optimized data structures and a rich, functional and fluent API.

English | 中文 | Deutsch | Español | Ελληνικά | Français | 日本語 | Norsk (bokmål) | Português-Brasil | Русский | हिंदी Eclipse Collections is a comprehens

Eclipse Foundation 2.1k Dec 29, 2022