Union, intersection, and set cardinality in loglog space

Overview

Build Status

HyperMinHash-java

A Java implementation of the HyperMinHash algorithm, presented by Yu and Weber. HyperMinHash allows approximating set unions, intersections, Jaccard Indices, and cardinalities of very large sets with high accuracy using only loglog space. It also supports streaming updates and merging sketches, just the same as HyperLogLog.

This repo implements two flavors of HyperMinHash:

  1. HyperMinHash: An implementation based on HyperLogLog with the addition of the bias correction seen in HyperLogLog++.
  2. BetaMinHash: An implementation which uses LogLog-Beta for the underlying LogLog implementation. Loglog-beta is almost identical in accuracy to HyperLogLog++, except it performs better on cardinality estimations for small datasets (n <= 80k), holding memory fixed. Since we use Loglog-Beta, we refer to our implementation as BetaMinHash. However, our implementation currently only supports a fixed precision p=14.

If you expect to be dealing with low cardinality datasets (<= 80,000 unique elements), we recommend using BetaMinHash as it has a smaller memory footprint and is more accurate than HyperLogLog in the range from 20,000-80,000, holding memory fixed. However, note that different sketch types are not interchangeable i.e: obtaining the intersection of an HMH and a BMH is not currently supported.

Both implementations are equipped with serialization/deserialization capabilities out of the box for sending sketches over the wire or persisting them to disk.

Usage

Importing via Maven

<dependency>
  <groupId>com.liveramp</groupId>
  <artifactId>hyperminhash</artifactId>
  <version>0.2</version>
</dependency>

Cardinality estimation

Set<byte[]> mySet = getMySet();
BetaMinHash sketch = new BetaMinHash();
for (byte[] element : mySet){
    sketch.add(element);
}

long estimatedCardinality = sketch.cardinality();

Merging (unioning) sketches

Collection<BetaMinHash> sketches = getSketches();
SketchCombiner<BetaMinHash> combiner = BetaMinHashCombiner.getInstance();
BetaMinHash combined = combiner.union(sketches);

// to get cardinality of the union
long unionCardinality = combined.cardinality();

// using HyperMinHash instead of BetaMinHash
Collection<HyperMinHash> sketches = getSketches();
SketchCombiner<HyperMinHash> combiner = HyperMinHashCombinre.getInstance();
HyperMinHash combined = combiner.union(sketches);

Cardinality of unions

BetaMinHash combined = combiner.union(sketches);
long estimatedCardinality = combined.cardinality();

Cardinality of intersection

Collection<BetaMinHash> sketches = getSketches();
SketchCombiner<BetaMinHash> combiner = BetaMinHashComber.getInstance();
long intersectionCardinality = combiner.intersectionCardinality(sketches);

Serializing a sketch

To get a byte[] representation of a sketch, use the IntersectionSketch.SerDe interface:

HyperMinHash sketch = getSketch();
HyperMinHashSerde serde = new HyperMinHashSerde();

byte[] serialized = serde.toBytes(sketch);
HyperMinHash deserialized = serde.fromBytes(serialized);

int sizeInBytes = serde.sizeInBytes(sketch);

Maintainers

Commit authorship was lost when merging code. The maintainers of the library, in alphabetical order, are:

  1. Christian Hansen (github.com/ChristianHansen)
  2. Harry Rackmil (github.com/harryrackmil)
  3. Shrif Nada (github.com/sherifnada)

Acknowledgements

Thanks to Seif Lotfy for implementing a Golang version of HyperMinHash. We use some of his tests in our library, and our BetaMinHash implementation references his implementation.

Comments
  • intersectionCardinality giving the wrong result

    intersectionCardinality giving the wrong result

    I'm trying to experiment with the code in scala and wrote the following pieces as a test case

    import com.liveramp.hyperminhash._
    import java.util.ArrayList
    val sketch1 = new BetaMinHash()
    val sketch2 = new BetaMinHash()
    val set1 = (1 to 10000).map(_.toByte).toArray
    val set2 = (5000 to 15000).map(_.toByte).toArray
    sketch1.offer(set1)
    sketch2.offer(set2)
    val sketches = new ArrayList[BetaMinHash]()
    sketches.add(sketch1)
    sketches.add(sketch2)
    val combiner = BetaMinHashCombiner.getInstance();
    val combined = combiner.intersectionCardinality(sketches);
    

    combined gives me 0 while the actual result should be something around 5000. Am I doing something wrong here?

    opened by rasoolianbehnam 1
  • IMP optimize serde and union operations

    IMP optimize serde and union operations

    Serialization of sparse objects creates a lot of zeros, which becomes costly in some applications when serialization is used as an intermediate step while building up unions. The proposed scheme is backwards-compatible with the current implementation, but also makes use of negative values (which shouldn't be in the registers) to encode run-lengths of zero values.

    Also introduced here is an optimization for union operations, which previously applied logic to the first sketch twice. To that end, a fix to deepCopy() was also needed for BetaMinHash, whose deepCopy never actually copied values.

    Motivation:

    We want to use this library in both spark and elasticsearch (as a custom plugin); spark to create binary serialized forms and elasticsearch to search/aggregate over them in realtime. Our Spark UDAF necessarily has to create many sparse objects before merging them, and this is made much more efficient in terms of runtime by finding a better encoding scheme for such sparse objects. There is no penalty when objects are "dense" (in terms of serialized size). Impact on deserialization time is minimal; each value is checked for negativity. Serialization now requires two passes: one to determine output size and one to do the serialization.

    opened by srstrickland 1
  • Bump junit from 4.12 to 4.13.1

    Bump junit from 4.12 to 4.13.1

    Bumps junit from 4.12 to 4.13.1.

    Release notes

    Sourced from junit's releases.

    JUnit 4.13.1

    Please refer to the release notes for details.

    JUnit 4.13

    Please refer to the release notes for details.

    JUnit 4.13 RC 2

    Please refer to the release notes for details.

    JUnit 4.13 RC 1

    Please refer to the release notes for details.

    JUnit 4.13 Beta 3

    Please refer to the release notes for details.

    JUnit 4.13 Beta 2

    Please refer to the release notes for details.

    JUnit 4.13 Beta 1

    Please refer to the release notes for details.

    Commits
    • 1b683f4 [maven-release-plugin] prepare release r4.13.1
    • ce6ce3a Draft 4.13.1 release notes
    • c29dd82 Change version to 4.13.1-SNAPSHOT
    • 1d17486 Add a link to assertThrows in exception testing
    • 543905d Use separate line for annotation in Javadoc
    • 510e906 Add sub headlines to class Javadoc
    • 610155b Merge pull request from GHSA-269g-pwp5-87pp
    • b6cfd1e Explicitly wrap float parameter for consistency (#1671)
    • a5d205c Fix GitHub link in FAQ (#1672)
    • 3a5c6b4 Deprecated since jdk9 replacing constructor instance of Double and Float (#1660)
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    change/standard dependencies 
    opened by dependabot[bot] 0
  • Whitespace cleanup.

    Whitespace cleanup.

    Removed double empty lines and add missing newline at end of of file. Not strictly necessary to stay in line with the Google style guide (if we are trying to stay in line with that), but I think the consistency in whitespace usage is good.

    change/standard 
    opened by ChristianHansen 0
  • Update Integer.SIZE to Long.SIZE in BitHelper

    Update Integer.SIZE to Long.SIZE in BitHelper

    I believe the getLeftmostBits method of the BitHelper class should throw IllegalArgumentException(String.format("numBits must be < %d", Long.SIZE)) instead of IllegalArgumentException(String.format("numBits must be < %d", Integer.SIZE)) because the conditional checks numBits against Long.SIZE and not Integer.SIZE.

    opened by brucesw 0
  • Dynamic registers

    Dynamic registers

    This PR adds an optimization to use dynamic register sizes. For example, if R=17, then we can use an int to represent a register instead of a long (since we'd need 6 bits for HLL and 17 for Minhash). This can have a meaningful impact when maintaining many sketches with a large number of registers in total. Since we are pre-1.0, this PR breaks the current serialization format.

    opened by sherifnada 0
  • Add tunable HyperMinHash impl

    Add tunable HyperMinHash impl

    This PR cleans up a lot of the repo and adds HyperMinHash, an implementation that is based on HyperLogLog with the addition of bias correction from HyperLogLog++.

    opened by sherifnada 0
  • Add CD build pipeline

    Add CD build pipeline

    Context

    CD pipeline should be ready to go now. To release a new version of this lib, we'll probably want to use the maven release plugin which moves the current version off of -SNAPSHOT. It'll temporarily change e.g 1.0-SNAPSHOT to 1.0, perform a deploy, then bump to 1.1-SNAPSHOT to resume development. It also has a bunch of other handy features that make life easier when deploying.

    Structure of changes

    .travis.yml: in order for the build to succeed, two things need to happen successfully:

    1. build and test the module
    2. sign artifacts and deploy them to Sonatype (which will mirror to Maven Central)

    pom.xml: A few small details that the build needs. Perhaps most interesting is the sign profile.

    cd/mvnsettings.xml: Specifies the environment variables that will be injected at build time for use in signing and publishing artifacts.

    *.java: just a bunch of changes I had to make to get the build working. Ignore the java files since they'll be revamped in future PRs this week.

    Misc

    I'd love any comments on readability/accessibility of the non java code.

    Potential Improvements

    • I think it would also be good to add some notes in the README or elsewhere specifying the dev workflow when editing this repo (i.e: how do you make sure everything short of deploying to maven central works? How do you release etc..).
    opened by sherifnada 0
  • [Snyk] Upgrade org.apache.commons:commons-math3 from 3.4.1 to 3.6.1

    [Snyk] Upgrade org.apache.commons:commons-math3 from 3.4.1 to 3.6.1

    Snyk has created this PR to upgrade org.apache.commons:commons-math3 from 3.4.1 to 3.6.1.

    :information_source: Keep your dependencies up-to-date. This makes it easier to fix existing vulnerabilities and to more quickly identify and fix newly disclosed vulnerabilities when they affect your project.


    • The recommended version is 3 versions ahead of your current version.
    • The recommended version was released 6 years ago, on 2016-03-17.

    Note: You are seeing this because you or someone else with access to this repository has authorized Snyk to open upgrade PRs.

    For more information:

    🧐 View latest project report

    🛠 Adjust upgrade PR settings

    🔕 Ignore this dependency or unsubscribe from future upgrade PRs

    opened by snyk-bot 0
  • HyperMinHash Jaccard similarity calculation

    HyperMinHash Jaccard similarity calculation

    The HyperMinHash Jaccard similarity calculation appears to only compare the "mantissa" portion of the register to increase c, rather than the whole register. The algorithm 2.1.4 in the paper seems to compare the full tuple to increase c. Is there a reason why only the mantissa is compared in your implementation?

    opened by asteele-quantifind 3
  • Question about similarity of multiple sketches

    Question about similarity of multiple sketches

    First of all, thanks a lot for your useful, nice and interesting Java implementation of HyperMinHash.

    My question is about this: https://github.com/LiveRamp/HyperMinHash-java/blob/master/src/main/java/com/liveramp/hyperminhash/BetaMinHashCombiner.java#L54

    As I read arXiv:1710.08436, while mergeability is trivial, I think it's not trivial that Jaccard Index estimation for multiple (> 2) sets works properly.

    1. Does the estimation still have same accuracy as of 2-set Jaccard Index ?
    2. If so, is there any proof ?

    Sorry for obscure question. Thanks again.

    opened by ocadaruma 5
Owner
LiveRamp
LiveRamp
This repository contains codes for various data structures and algorithms in C, C++, Java, Python, C#, Go, JavaScript and Kotlin.

Overview The goal of this project is to have codes for various data structures and algorithms - in C, C++, Java, Python, C#, Go, JavaScript and Kotlin

Manan 25 Mar 2, 2022
This repository contains all the Data Structures and Algorithms concepts and their implementation in several ways

An Open-Source repository that contains all the Data Structures and Algorithms concepts and their implementation in several ways, programming questions and Interview questions. The main aim of this repository is to help students who are learning Data Structures and Algorithms or preparing for an interview.

Pranay Gupta 691 Dec 31, 2022
A big, fast and persistent queue based on memory mapped file.

Big Queue A big, fast and persistent queue based on memory mapped file. Notice, bigqueue is just a standalone library, for a high-throughput, persiste

bulldog 520 Dec 30, 2022
A lightning fast, transactional, file-based FIFO for Android and Java.

Tape by Square, Inc. Tape is a collection of queue-related classes for Android and Java. QueueFile is a lightning-fast, transactional, file-based FIFO

Square 2.4k Dec 30, 2022
Popular Algorithms and Data Structures implemented in popular languages

Algos Community (college) maintained list of Algorithms and Data Structures implementations. Implemented Algorithms Algorithm C CPP Java Python Golang

IIIT Vadodara Open Source 1k Dec 28, 2022
A repository that contains Data Structure and Algorithms coded on Java

A repository that contains Data Structure and Algorithms coded on Java . It will also contain solutions of questions from Leetcode.

Akshat Gupta 6 Oct 15, 2022
Flink Table Store is a unified streaming and batch store for building dynamic tables on Apache Flink

Flink Table Store is a unified streaming and batch store for building dynamic tables on Apache Flink

The Apache Software Foundation 366 Jan 1, 2023
🎓☕ Repository of lessons and exercises from loiane.training's course on data structure with Java

☕ Curso estrutura de dados com Java by @loiane.training Repositório com as aulas e exercícios do curso de estrutura de dados com Java da loiane.traini

Leticia Campos 2 Feb 1, 2022
Algorithm and Data Structrue

SWE241P Algorithm and Data Structure Ex1 TreeSet with Red-Black Tree HashSet LinkedList Set Ex 2 Selection Sort Insertion Sort Heap Sort Merge Sort Qu

Tiger Liu 4 Apr 13, 2022
A FlinkSQL studio and real-time computing platform based on Apache Flink

Dinky 简介 实时即未来,Dinky 为 Apache Flink 而生,让 Flink SQL 纵享丝滑,并致力于实时计算平台建设。 Dinky 架构于 Apache Flink,增强 Flink 的应用与体验,探索流式数仓。即站在巨人肩膀上创新与实践,Dinky 在未来批流一体的发展趋势下潜

null 1.5k Dec 30, 2022
Worker-queue implementation on top of Java and database

Database Queue Library provides worker-queue implementation on top of Java and database. Fintech company YooMoney uses db-queue in cases where reliabi

null 17 Dec 12, 2022
Data structures and algorithms exercises in java

Data structure and algorithms in Java About The Project [] In this repository you can find examples of data structure exercises solved in java and som

Luis Perez Contreras 1 Nov 25, 2021
Data structures & algorithms implemented in Java and solutions to leetcode problems.

Hello, World! ?? Hey everyone, I'm Sharad ☃ , and I'm a Software Engineer ?? at eGain! This repository ?? is all about data structures & algorithms an

Sharad Dutta 16 Dec 16, 2022
Java implementation of our paper: Efficient Private Set Intersection Cardinality inthe Reverse Unbalanced Setting Utilizing Hash-Prefix Filter

PSI-CA-Framework This is the Java implementation of our paper: Efficient Private Set Intersection Cardinality inthe Reverse Unbalanced Setting Utilizi

IamGroot 4 Dec 30, 2022
RTree2D is a 2D immutable R-tree with STR (Sort-Tile-Recursive) packing for ultra-fast nearest and intersection queries

RTree2D RTree2D is a 2D immutable R-tree with STR (Sort-Tile-Recursive) packing for ultra-fast nearest and intersection queries. Goals Main our requir

Andriy Plokhotnyuk 121 Dec 14, 2022
Plugin for fixing crash caused by dispensing shulker box by Union of Black Bean.

UBBDispenserShulkerBoxCrashFixer 繁體中文(香港) Purpose In servers lower than 1.13, when a dispenser is placed at y=0 and facing downwards / y=(max build he

Union of Black Bean 44 Dec 24, 2022
A visualizer for the percolation problem solved with Weighted Union Find

A visualizer for the percolation problem. Uses Weighted Union Find with Path Compression for solving the problem and StdDraw from alg4.jar to draw. Includes real-time flow which updates filled sites after every new site is opened.

Ajinkya Talekar 2 Jan 19, 2022
Stream summarizer and cardinality estimator.

Description A Java library for summarizing data in streams for which it is infeasible to store all events. More specifically, there are classes for es

AddThis 2.2k Dec 30, 2022
Stream summarizer and cardinality estimator.

Description A Java library for summarizing data in streams for which it is infeasible to store all events. More specifically, there are classes for es

AddThis 2.2k Dec 30, 2022
Immutable key/value store with efficient space utilization and fast reads. They are ideal for the use-case of tables built by batch processes and shipped to multiple servers.

Minimal Perfect Hash Tables About Minimal Perfect Hash Tables are an immutable key/value store with efficient space utilization and fast reads. They a

Indeed Engineering 92 Nov 22, 2022