Union, intersection, and set cardinality in loglog space

Last update: Sep 22, 2022

Overview

HyperMinHash-java

A Java implementation of the HyperMinHash algorithm, presented by Yu and Weber. HyperMinHash allows approximating set unions, intersections, Jaccard Indices, and cardinalities of very large sets with high accuracy using only loglog space. It also supports streaming updates and merging sketches, just the same as HyperLogLog.

This repo implements two flavors of HyperMinHash:

HyperMinHash: An implementation based on HyperLogLog with the addition of the bias correction seen in HyperLogLog++.
BetaMinHash: An implementation which uses LogLog-Beta for the underlying LogLog implementation. Loglog-beta is almost identical in accuracy to HyperLogLog++, except it performs better on cardinality estimations for small datasets (n <= 80k), holding memory fixed. Since we use Loglog-Beta, we refer to our implementation as BetaMinHash. However, our implementation currently only supports a fixed precision p=14.

If you expect to be dealing with low cardinality datasets (<= 80,000 unique elements), we recommend using BetaMinHash as it has a smaller memory footprint and is more accurate than HyperLogLog in the range from 20,000-80,000, holding memory fixed. However, note that different sketch types are not interchangeable i.e: obtaining the intersection of an HMH and a BMH is not currently supported.

Both implementations are equipped with serialization/deserialization capabilities out of the box for sending sketches over the wire or persisting them to disk.

Usage

Importing via Maven

<dependency>
  <groupId>com.liveramp</groupId>
  <artifactId>hyperminhash</artifactId>
  <version>0.2</version>
</dependency>

Cardinality estimation

Set<byte[]> mySet = getMySet();
BetaMinHash sketch = new BetaMinHash();
for (byte[] element : mySet){
    sketch.add(element);
}

long estimatedCardinality = sketch.cardinality();

Merging (unioning) sketches

Collection<BetaMinHash> sketches = getSketches();
SketchCombiner<BetaMinHash> combiner = BetaMinHashCombiner.getInstance();
BetaMinHash combined = combiner.union(sketches);

// to get cardinality of the union
long unionCardinality = combined.cardinality();

// using HyperMinHash instead of BetaMinHash
Collection<HyperMinHash> sketches = getSketches();
SketchCombiner<HyperMinHash> combiner = HyperMinHashCombinre.getInstance();
HyperMinHash combined = combiner.union(sketches);

Cardinality of unions

BetaMinHash combined = combiner.union(sketches);
long estimatedCardinality = combined.cardinality();

Cardinality of intersection

Collection<BetaMinHash> sketches = getSketches();
SketchCombiner<BetaMinHash> combiner = BetaMinHashComber.getInstance();
long intersectionCardinality = combiner.intersectionCardinality(sketches);

Serializing a sketch

To get a byte[] representation of a sketch, use the IntersectionSketch.SerDe interface:

HyperMinHash sketch = getSketch();
HyperMinHashSerde serde = new HyperMinHashSerde();

byte[] serialized = serde.toBytes(sketch);
HyperMinHash deserialized = serde.fromBytes(serialized);

int sizeInBytes = serde.sizeInBytes(sketch);

Maintainers

Commit authorship was lost when merging code. The maintainers of the library, in alphabetical order, are:

Christian Hansen (github.com/ChristianHansen)
Harry Rackmil (github.com/harryrackmil)
Shrif Nada (github.com/sherifnada)

Acknowledgements

Thanks to Seif Lotfy for implementing a Golang version of HyperMinHash. We use some of his tests in our library, and our BetaMinHash implementation references his implementation.

Comments

intersectionCardinality giving the wrong result

I'm trying to experiment with the code in scala and wrote the following pieces as a test case

import com.liveramp.hyperminhash._
import java.util.ArrayList
val sketch1 = new BetaMinHash()
val sketch2 = new BetaMinHash()
val set1 = (1 to 10000).map(_.toByte).toArray
val set2 = (5000 to 15000).map(_.toByte).toArray
sketch1.offer(set1)
sketch2.offer(set2)
val sketches = new ArrayList[BetaMinHash]()
sketches.add(sketch1)
sketches.add(sketch2)
val combiner = BetaMinHashCombiner.getInstance();
val combined = combiner.intersectionCardinality(sketches);

combined gives me 0 while the actual result should be something around 5000. Am I doing something wrong here?

opened by rasoolianbehnam 1

IMP optimize serde and union operations

Serialization of sparse objects creates a lot of zeros, which becomes costly in some applications when serialization is used as an intermediate step while building up unions. The proposed scheme is backwards-compatible with the current implementation, but also makes use of negative values (which shouldn't be in the registers) to encode run-lengths of zero values.

Also introduced here is an optimization for union operations, which previously applied logic to the first sketch twice. To that end, a fix to deepCopy() was also needed for BetaMinHash, whose deepCopy never actually copied values.

Motivation:

We want to use this library in both spark and elasticsearch (as a custom plugin); spark to create binary serialized forms and elasticsearch to search/aggregate over them in realtime. Our Spark UDAF necessarily has to create many sparse objects before merging them, and this is made much more efficient in terms of runtime by finding a better encoding scheme for such sparse objects. There is no penalty when objects are "dense" (in terms of serialized size). Impact on deserialization time is minimal; each value is checked for negativity. Serialization now requires two passes: one to determine output size and one to do the serialization.

opened by srstrickland 1
Bump junit from 4.12 to 4.13.1
Bumps junit from 4.12 to 4.13.1.

Release notes

Sourced from junit's releases.

JUnit 4.13.1

Please refer to the release notes for details.

JUnit 4.13

Please refer to the release notes for details.

JUnit 4.13 RC 2

Please refer to the release notes for details.

JUnit 4.13 RC 1

Please refer to the release notes for details.

JUnit 4.13 Beta 3

Please refer to the release notes for details.

JUnit 4.13 Beta 2

Please refer to the release notes for details.

JUnit 4.13 Beta 1

Please refer to the release notes for details.

Commits

1b683f4 [maven-release-plugin] prepare release r4.13.1

ce6ce3a Draft 4.13.1 release notes

c29dd82 Change version to 4.13.1-SNAPSHOT

1d17486 Add a link to assertThrows in exception testing

543905d Use separate line for annotation in Javadoc

510e906 Add sub headlines to class Javadoc

610155b Merge pull request from GHSA-269g-pwp5-87pp

b6cfd1e Explicitly wrap float parameter for consistency (#1671)

a5d205c Fix GitHub link in FAQ (#1672)

3a5c6b4 Deprecated since jdk9 replacing constructor instance of Double and Float (#1660)

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

change/standard dependencies
opened by dependabot[bot] 0
Whitespace cleanup.

Removed double empty lines and add missing newline at end of of file. Not strictly necessary to stay in line with the Google style guide (if we are trying to stay in line with that), but I think the consistency in whitespace usage is good.
change/standard

opened by ChristianHansen 0
Update Integer.SIZE to Long.SIZE in BitHelper

I believe the getLeftmostBits method of the BitHelper class should throw IllegalArgumentException(String.format("numBits must be < %d", Long.SIZE)) instead of IllegalArgumentException(String.format("numBits must be < %d", Integer.SIZE)) because the conditional checks numBits against Long.SIZE and not Integer.SIZE.

opened by brucesw 0
Dynamic registers

This PR adds an optimization to use dynamic register sizes. For example, if R=17, then we can use an int to represent a register instead of a long (since we'd need 6 bits for HLL and 17 for Minhash). This can have a meaningful impact when maintaining many sketches with a large number of registers in total. Since we are pre-1.0, this PR breaks the current serialization format.

opened by sherifnada 0
Add tunable HyperMinHash impl

This PR cleans up a lot of the repo and adds HyperMinHash, an implementation that is based on HyperLogLog with the addition of bias correction from HyperLogLog++.

opened by sherifnada 0
Add CD build pipeline
Context

CD pipeline should be ready to go now. To release a new version of this lib, we'll probably want to use the maven release plugin which moves the current version off of -SNAPSHOT. It'll temporarily change e.g 1.0-SNAPSHOT to 1.0, perform a deploy, then bump to 1.1-SNAPSHOT to resume development. It also has a bunch of other handy features that make life easier when deploying.

Structure of changes

.travis.yml: in order for the build to succeed, two things need to happen successfully:

build and test the module

sign artifacts and deploy them to Sonatype (which will mirror to Maven Central)

pom.xml: A few small details that the build needs. Perhaps most interesting is the sign profile.

cd/mvnsettings.xml: Specifies the environment variables that will be injected at build time for use in signing and publishing artifacts.

*.java: just a bunch of changes I had to make to get the build working. Ignore the java files since they'll be revamped in future PRs this week.

Misc

I'd love any comments on readability/accessibility of the non java code.

Potential Improvements

I think it would also be good to add some notes in the README or elsewhere specifying the dev workflow when editing this repo (i.e: how do you make sure everything short of deploying to maven central works? How do you release etc..).
opened by sherifnada 0
[Snyk] Upgrade org.apache.commons:commons-math3 from 3.4.1 to 3.6.1
Snyk has created this PR to upgrade org.apache.commons:commons-math3 from 3.4.1 to 3.6.1.

:information_source: Keep your dependencies up-to-date. This makes it easier to fix existing vulnerabilities and to more quickly identify and fix newly disclosed vulnerabilities when they affect your project.

The recommended version is 3 versions ahead of your current version.

The recommended version was released 6 years ago, on 2016-03-17.

Note: You are seeing this because you or someone else with access to this repository has authorized Snyk to open upgrade PRs.

For more information:

🧐 View latest project report

🛠 Adjust upgrade PR settings

🔕 Ignore this dependency or unsubscribe from future upgrade PRs
opened by snyk-bot 0
HyperMinHash Jaccard similarity calculation

The HyperMinHash Jaccard similarity calculation appears to only compare the "mantissa" portion of the register to increase c, rather than the whole register. The algorithm 2.1.4 in the paper seems to compare the full tuple to increase c. Is there a reason why only the mantissa is compared in your implementation?

opened by asteele-quantifind 3
Question about similarity of multiple sketches
First of all, thanks a lot for your useful, nice and interesting Java implementation of HyperMinHash.

My question is about this: https://github.com/LiveRamp/HyperMinHash-java/blob/master/src/main/java/com/liveramp/hyperminhash/BetaMinHashCombiner.java#L54

As I read arXiv:1710.08436, while mergeability is trivial, I think it's not trivial that Jaccard Index estimation for multiple (> 2) sets works properly.

Does the estimation still have same accuracy as of 2-set Jaccard Index ?

If so, is there any proof ?

Sorry for obscure question. Thanks again.
opened by ocadaruma 5

Owner

LiveRamp

GitHub

This repository contains codes for various data structures and algorithms in C, C++, Java, Python, C#, Go, JavaScript and Kotlin.

Overview The goal of this project is to have codes for various data structures and algorithms - in C, C++, Java, Python, C#, Go, JavaScript and Kotlin

25 Mar 2, 2022

This repository contains all the Data Structures and Algorithms concepts and their implementation in several ways

An Open-Source repository that contains all the Data Structures and Algorithms concepts and their implementation in several ways, programming questions and Interview questions. The main aim of this repository is to help students who are learning Data Structures and Algorithms or preparing for an interview.

691 Dec 31, 2022

A big, fast and persistent queue based on memory mapped file.

Big Queue A big, fast and persistent queue based on memory mapped file. Notice, bigqueue is just a standalone library, for a high-throughput, persiste

520 Dec 30, 2022

A lightning fast, transactional, file-based FIFO for Android and Java.

Tape by Square, Inc. Tape is a collection of queue-related classes for Android and Java. QueueFile is a lightning-fast, transactional, file-based FIFO

2.4k Dec 30, 2022

Popular Algorithms and Data Structures implemented in popular languages

Algos Community (college) maintained list of Algorithms and Data Structures implementations. Implemented Algorithms Algorithm C CPP Java Python Golang

1k Dec 28, 2022

A repository that contains Data Structure and Algorithms coded on Java

A repository that contains Data Structure and Algorithms coded on Java . It will also contain solutions of questions from Leetcode.

6 Oct 15, 2022

Flink Table Store is a unified streaming and batch store for building dynamic tables on Apache Flink

366 Jan 1, 2023

🎓☕ Repository of lessons and exercises from loiane.training's course on data structure with Java

☕ Curso estrutura de dados com Java by @loiane.training Repositório com as aulas e exercícios do curso de estrutura de dados com Java da loiane.traini

2 Feb 1, 2022

Algorithm and Data Structrue

SWE241P Algorithm and Data Structure Ex1 TreeSet with Red-Black Tree HashSet LinkedList Set Ex 2 Selection Sort Insertion Sort Heap Sort Merge Sort Qu

4 Apr 13, 2022

A FlinkSQL studio and real-time computing platform based on Apache Flink

Dinky 简介实时即未来，Dinky 为 Apache Flink 而生，让 Flink SQL 纵享丝滑，并致力于实时计算平台建设。 Dinky 架构于 Apache Flink，增强 Flink 的应用与体验，探索流式数仓。即站在巨人肩膀上创新与实践，Dinky 在未来批流一体的发展趋势下潜

1.5k Dec 30, 2022

Worker-queue implementation on top of Java and database

Database Queue Library provides worker-queue implementation on top of Java and database. Fintech company YooMoney uses db-queue in cases where reliabi

17 Dec 12, 2022

Data structures and algorithms exercises in java

Data structure and algorithms in Java About The Project [] In this repository you can find examples of data structure exercises solved in java and som

1 Nov 25, 2021

Data structures & algorithms implemented in Java and solutions to leetcode problems.

Hello, World! ?? Hey everyone, I'm Sharad ☃ , and I'm a Software Engineer ?? at eGain! This repository ?? is all about data structures & algorithms an

16 Dec 16, 2022

Java implementation of our paper: Efficient Private Set Intersection Cardinality inthe Reverse Unbalanced Setting Utilizing Hash-Prefix Filter

PSI-CA-Framework This is the Java implementation of our paper: Efficient Private Set Intersection Cardinality inthe Reverse Unbalanced Setting Utilizi

4 Dec 30, 2022

Immutable key/value store with efficient space utilization and fast reads. They are ideal for the use-case of tables built by batch processes and shipped to multiple servers.

Minimal Perfect Hash Tables About Minimal Perfect Hash Tables are an immutable key/value store with efficient space utilization and fast reads. They a

92 Nov 22, 2022

Union, intersection, and set cardinality in loglog space

Related tags

Overview

HyperMinHash-java

Usage

Importing via Maven

Cardinality estimation

Merging (unioning) sketches

Cardinality of unions

Cardinality of intersection

Serializing a sketch

Maintainers

Acknowledgements

Comments

JUnit 4.13.1

JUnit 4.13

JUnit 4.13 RC 2

JUnit 4.13 RC 1

JUnit 4.13 Beta 3

JUnit 4.13 Beta 2

JUnit 4.13 Beta 1

Context

Structure of changes

Misc

Potential Improvements

Snyk has created this PR to upgrade org.apache.commons:commons-math3 from 3.4.1 to 3.6.1.