Bloofi: A java implementation of multidimensional Bloom filters

Overview

Bloofi: A java implementation of multidimensional Bloom filters

Bloom filters are probabilistic data structures commonly used for approximate membership problems in many areas of Computer Science (networking, distributed systems, databases, etc.). With the increase in data size and distribution of data, problems arise where a large number of Bloom filters are available, and all them need to be searched for potential matches. As an example, in a federated cloud environment, each cloud provider could encode the information using Bloom filters and share the Bloom filters with a central coordinator. The problem of interest is not only whether a given element is in any of the sets represented by the Bloom filters, but which of the existing sets contain the given element. This problem cannot be solved by just constructing a Bloom filter on the union of all the sets. Instead, we effectively have a multidimensional Bloom filter problem: given an element, we wish to receive a list of candidate sets where the element might be. To solve this problem, we consider 3 alternatives. Firstly, we can naively check many Bloom filters. Secondly, we propose to organize the Bloom filters in a hierarchical index structure akin to a B+ tree, that we call Bloofi. Finally, we propose another data structure that packs the Bloom filters in such a way as to exploit bit-level parallelism, which we call Flat-Bloofi. Our theoretical and experimental results show that Bloofi and Flat-Bloofi provide scalable and efficient solutions alternatives to search through a large number of Bloom filters.

Prerequisites

We build on an existing Bloom filter library (https://github.com/magnuss/java-bloomfilter) by Magnus Skjegstad which we embedded and modified. We also use junit and Hamcrest which we include as jar files for your convenience.

Usage

We provide the necessary software to reproduce our experiments. The software includes unit testing. The documentation is minimal and the software is not meant for production use. It is provided mostly for research purposes and as a way to promote the ideas.

Building and running unit tests :

ant

Main class to run the experiments: mvm.provenance.TestAC.

Sample run:

java -Xms7168m -Xmx7168m mvm.provenance.TestAC -bloofi -falsePositiveProb 0.01 -expectedNbElemInBloomFilter 10000 -initialNbElemInBloomFilter 100 -nbBloomFilters 1000 -bloofiOrder 2 -constructionMethod i -nbYesSearches 50000 -nbNoSearches 50000 -splitAllOneNodesIfOverflow false -metric Hamming -nbBFInsertsDeletes 0 -nbUpdates 0 -nonOverlappingRanges true -nbRuns 10

Input parameters, with default values:

-bloofi | -bloofi2 | -naive #this param is required and specifies which type of index tructure to construct - original bloofi (bloofi), flat bloofi (bloofi 2), or just store all the Bloom filters without indexing, in a map (naive)
-falsePositiveProb falsePosProb #Default: 0.01
-expectedNbElemInBloomFilter expectedNbElemInFilter #Default 10000
-initialNbElemInBloomFilter initialNbElemInFilter #Default 100
-nbBloomFilters  nbBFs #Default 1000
-bloofiOrder order #Default 2
-constructionMethod b | i (bulk or incremental) #Default i
-nbYesSearches nbyesSearches #searches for elements known to be in the Bloom filters. Default 1000
-nbNoSearches nbNoSearches #searches for elements not in the Bloom filters. Default 1000
-splitAllOneNodesIfOverflow true | false #if there is an overflow in the Bloofi index, and the value of the node is already all bits to one, should that node still split, or not? Default false
-metric Hamming | Jaccard | Cosine #metric used to compare similarity between two Bloom filters. Default Hamming
-nbBFInsertsDeletes nbBloomFiltersInsertsOrDeletes #Default 0
-nbUpdates nbOfElementsToBeInsertedDuringUpdateInEachFilter #Default 0
-nonOverlappingRanges true | false #if true, each Bloom filter i gets the integers in [(i-1)* initialNbElemInFilter,i*actualNbElemInFilter); if false, each bloom filter gets initialNbElemInFilter random integers from a random rangeDefault true
-nbRuns numberOfRunsForExperiments #Default 10

References

Adina Crainiceanu and Daniel Lemire. Bloofi: Multidimensional Bloom Filters. Information Systems,Volume 54, December 2015, pp.311-324 http://arxiv.org/abs/1501.01941 http://www.sciencedirect.com/science/article/pii/S0306437915000125

Adina Crainiceanu. Bloofi: A Hierarchical Bloom Filter Index with Applications to Distributed Data Provenance. 2nd International Workshop on Cloud Intelligence (Cloud-I 2013) collocated with the 39th International Conference in Very Large Data Bases VLDB. Riva del Garda, Italy, 2013

License

Because we built on Magnus Skjegstad's Bloom filter library, we use the lesser GPL software license.

You might also like...

External-Memory Sorting in Java

Externalsortinginjava External-Memory Sorting in Java: useful to sort very large files using multiple cores and an external-memory algorithm. The vers

Dec 29, 2022

A Java library for quickly and efficiently parsing and writing UUIDs

fast-uuid fast-uuid is a Java library for quickly and efficiently parsing and writing UUIDs. It yields the most dramatic performance gains when compar

Jan 1, 2023

Geohash utitlies in java

geo Java utility methods for geohashing. Status: production, available on Maven Central Maven site reports are here including javadoc. Add this to you

Jan 1, 2023

Hollow is a java library and toolset for disseminating in-memory datasets from a single producer to many consumers for high performance read-only access.

Hollow is a java library and toolset for disseminating in-memory datasets from a single producer to many consumers for high performance read-only access.

Hollow Hollow is a java library and toolset for disseminating in-memory datasets from a single producer to many consumers for high performance read-on

Dec 25, 2022

High Performance Primitive Collections for Java

HPPC: High Performance Primitive Collections Collections of primitive types (maps, sets, stacks, lists) with open internals and an API twist (no java.

Dec 28, 2022

Java library for the HyperLogLog algorithm

java-hll A Java implementation of HyperLogLog whose goal is to be storage-compatible with other similar offerings from Aggregate Knowledge. NOTE: This

Dec 30, 2022

A simple integer compression library in Java

JavaFastPFOR: A simple integer compression library in Java License This code is released under the Apache License Version 2.0 http://www.apache.org/li

Dec 30, 2022

Java Collections till the last breadcrumb of memory and performance

Koloboke A family of projects around collections in Java (so far). The Koloboke Collections API A carefully designed extension of the Java Collections

Nov 14, 2022
Owner
Daniel Lemire
Daniel Lemire is a computer science professor. His research is focused on software performance and indexing.
Daniel Lemire
Java port of a concurrent trie hash map implementation from the Scala collections library

About This is a Java port of a concurrent trie hash map implementation from the Scala collections library. It is almost a line-by-line conversion from

null 147 Oct 31, 2022
A Java implementation of Transducers

transducers-java Transducers are composable algorithmic transformations. They are independent from the context of their input and output sources and s

null 117 Sep 29, 2022
Parquet-MR contains the java implementation of the Parquet format

Parquet MR Parquet-MR contains the java implementation of the Parquet format. Parquet is a columnar storage format for Hadoop; it provides efficient s

The Apache Software Foundation 1.8k Jan 5, 2023
Implementation of various string similarity and distance algorithms: Levenshtein, Jaro-winkler, n-Gram, Q-Gram, Jaccard index, Longest Common Subsequence edit distance, cosine similarity ...

java-string-similarity A library implementing different string similarity and distance measures. A dozen of algorithms (including Levenshtein edit dis

Thibault Debatty 2.5k Dec 29, 2022
Golang implementation of the Alaya protocol

Go PlatON Welcome to the PlatON-Go source code repository! This is an Ethereum-based、high-performance and high-security implementation of the PlatON p

null 23 Oct 14, 2022
High Performance data structures and utility methods for Java

Agrona Agrona provides a library of data structures and utility methods that are a common need when building high-performance applications in Java. Ma

Real Logic 2.5k Jan 5, 2023
A high performance caching library for Java

Caffeine is a high performance, near optimal caching library. For more details, see our user's guide and browse the API docs for the latest release. C

Ben Manes 13k Jan 5, 2023
Chronicle Bytes has a similar purpose to Java NIO's ByteBuffer with many extensions

Chronicle-Bytes Chronicle-Bytes Chronicle Bytes contains all the low level memory access wrappers. It is built on Chronicle Core’s direct memory and O

Chronicle Software : Open Source 334 Jan 1, 2023
An advanced, but easy to use, platform for writing functional applications in Java 8.

Getting Cyclops X (10) The latest version is cyclops:10.4.0 Stackoverflow tag cyclops-react Documentation (work in progress for Cyclops X) Integration

AOL 1.3k Dec 29, 2022
Eclipse Collections is a collections framework for Java with optimized data structures and a rich, functional and fluent API.

English | 中文 | Deutsch | Español | Ελληνικά | Français | 日本語 | Norsk (bokmål) | Português-Brasil | Русский | हिंदी Eclipse Collections is a comprehens

Eclipse Foundation 2.1k Dec 29, 2022