Immutable key/value store with efficient space utilization and fast reads. They are ideal for the use-case of tables built by batch processes and shipped to multiple servers.

Overview

Minimal Perfect Hash Tables

OSS Lifecycle

About

Minimal Perfect Hash Tables are an immutable key/value store with efficient space utilization and fast reads. They are ideal for the use-case of tables built by batch processes and shipped to multiple servers.

Usage

Indeed MPH is available on Maven Central, just add the following dependency:

<dependency>
    <groupId>com.indeed</groupId>
    <artifactId>mph-table</artifactId>
    <version>1.0.4</version>
</dependency>

The primary interfaces are TableReader, to construct a reader to an existing table, TableWriter, to build a table, and TableConfig, to specify the configuration for the writer.

How to write a table:

final TableConfig<Long, Long> config = new TableConfig()
    .withKeySerializer(new SmartLongSerializer())
    .withValueSerializer(new SmartVLongSerializer());
final Set<Pair<Long, Long>> entries = new HashSet<>();
for (long i = 0; i < 20; ++i) {
    entries.add(new Pair(i, i * i));
}
TableWriter.write(new File("squares"), config, entries);

How to read a table:

try (final TableReader<Long, Long> reader = TableReader.open("squares")) {
  final Long value = reader.get(3L);          // get one
  for (final Pair<Long, Long> p : reader) {   // iterate over all
     ...
  }
}

Command Line

In addition to the Java API, TableReader and TableWriter provide convenience command-line interfaces to read and write tables, allowing you to quickly get started without writing any code:

# print all key-values in a table as TSV
$ java com.indeed.mph.TableReader --dump <table>

# print the value for a single key
$ java com.indeed.mph.TableReader --get <key> <table>

# create a table from a TSV file of words with counts
$ java com.indeed.mph.TableWriter --valueSerializer .SmartVLongSerializer <table to create> <counts.tsv>

# create a table from a TSV file mapping movie ids to lists of actor names (compressed by reference)
$ java com.indeed.mph.TableWriter --keySerializer .SmartVLongSerializer --valueSerializer '.SmartListSerializer(.SmartDictionarySerializer)' <table to create> <movies.tsv>

# same as above, not actually storing the movie ids but still allowing retrieval by them
$ java com.indeed.mph.TableWriter --keyStorage IMPLICIT --keySerializer .SmartVLongSerializer --valueSerializer '.SmartListSerializer(.SmartDictionarySerializer)' <table to create> <movies.tsv>

Code of Conduct

This project is governed by the Contributor Covenant v 1.4.1

License

This project is licensed under the Apache-2.0 License - see the LICENSE file for details.

Comments
  • Try other minimal perfect hash functions?

    Try other minimal perfect hash functions?

    Hi there, I found your blog post https://engineering.indeedblog.com/blog/2018/02/indeed-mph/, where you describe a key-value store based on MPHF. And thanks for open-source it!

    I wonder whether, apart from GOV, you are interested in other MPHF as to improve the efficiency of your store. For example, you may want to consider this new method: PTHash, https://github.com/jermp/pthash (although it is written in C++, not Java). The method is fully described in the following two papers:

    • https://arxiv.org/abs/2104.10402
    • https://arxiv.org/abs/2106.02350

    Cheers, -Giulio

    opened by jermp 3
  • Improve coverage by varying the list size.

    Improve coverage by varying the list size.

    This request includes a random selection of list size that favors smaller list to optimize test run time, but also includes a 1 in 5 change of picking a larger list size. The list size effects coverage of the underlying code.

    I am a part of a research group at the University of Illinois Chicago that works towards novel methods to detect bugs in software projects. This project is identified as being a good candidate for application of our research method to improve coverage using property tests.

    opened by jcoultas 2
  • Add section explaining how to add a dependency using popular dependency managers

    Add section explaining how to add a dependency using popular dependency managers

    While the command line usage is great, a more common use for this will likely be to call from a Java process. Adding quick one line snippets for installing using gradle or maven (even links to the maven repository).

    https://mvnrepository.com/artifact/com.indeed/mph-table

    opened by mjpitz 1
  • Updates the repo to the current Indeed repo standards.

    Updates the repo to the current Indeed repo standards.

    This pull request updates the mph-table repo to the current set repo standards for Indeed.

    • Adds a Code of Conduct
    • Adds the OSSMETADATA File
    • Updates the README to add the OSS Lifecycle shield and the Code of Conduct

    Please confirm the OSSMETADATa file accurately reflects the project's current status (Active).

    opened by DuaneOBrien 0
  • Add documentation on how start using the Java API directly

    Add documentation on how start using the Java API directly

    Let's make it easy for someone writing Java code to start using this API directly. While someone can dig through and read the main function of the TableWriter class, having a snippet in the README.md will help adoption / use.

    opened by mjpitz 0
  • Bump junit from 4.12 to 4.13.1

    Bump junit from 4.12 to 4.13.1

    Bumps junit from 4.12 to 4.13.1.

    Release notes

    Sourced from junit's releases.

    JUnit 4.13.1

    Please refer to the release notes for details.

    JUnit 4.13

    Please refer to the release notes for details.

    JUnit 4.13 RC 2

    Please refer to the release notes for details.

    JUnit 4.13 RC 1

    Please refer to the release notes for details.

    JUnit 4.13 Beta 3

    Please refer to the release notes for details.

    JUnit 4.13 Beta 2

    Please refer to the release notes for details.

    JUnit 4.13 Beta 1

    Please refer to the release notes for details.

    Commits
    • 1b683f4 [maven-release-plugin] prepare release r4.13.1
    • ce6ce3a Draft 4.13.1 release notes
    • c29dd82 Change version to 4.13.1-SNAPSHOT
    • 1d17486 Add a link to assertThrows in exception testing
    • 543905d Use separate line for annotation in Javadoc
    • 510e906 Add sub headlines to class Javadoc
    • 610155b Merge pull request from GHSA-269g-pwp5-87pp
    • b6cfd1e Explicitly wrap float parameter for consistency (#1671)
    • a5d205c Fix GitHub link in FAQ (#1672)
    • 3a5c6b4 Deprecated since jdk9 replacing constructor instance of Double and Float (#1660)
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
Lightning Memory Database (LMDB) for Java: a low latency, transactional, sorted, embedded, key-value store

LMDB for Java LMDB offers: Transactions (full ACID semantics) Ordered keys (enabling very fast cursor-based iteration) Memory-mapped files (enabling o

null 680 Dec 23, 2022
Simple, fast Key-Value storage. Inspired by HaloDB

Phantom Introduction Phantom is an embedded key-value store, provides extreme high write throughput while maintains low latency data access. Phantom w

null 11 Apr 14, 2022
RTree2D is a 2D immutable R-tree with STR (Sort-Tile-Recursive) packing for ultra-fast nearest and intersection queries

RTree2D RTree2D is a 2D immutable R-tree with STR (Sort-Tile-Recursive) packing for ultra-fast nearest and intersection queries. Goals Main our requir

Andriy Plokhotnyuk 121 Dec 14, 2022
Immutable in-memory R-tree and R*-tree implementations in Java with reactive api

rtree In-memory immutable 2D R-tree implementation in java using RxJava Observables for reactive processing of search results. Status: released to Mav

Dave Moten 999 Dec 20, 2022
Persistent (immutable) collections for Java and Kotlin

What are Dexx Collections? Dexx Collections are a port of Scala's immutable, persistent collection classes to pure Java. Persistent in the context of

Andrew O'Malley 208 Sep 30, 2022
Carbyne Stack tuple store for secure multiparty computation

Carbyne Stack Castor Tuple Store Castor is an open source storage service for cryptographic material used in Secure Multiparty Computation, so called

Carbyne Stack 5 Oct 15, 2022
A 1.16.5 utility mod oriented towards anarchy servers

FrostBurn A 1.16.5 utility mod oriented towards anarchy servers Progress Command System Module System Friends Settings ClickGUI Render Modules Chams F

Evan 43 Dec 24, 2022
A fast object pool for the JVM

Stormpot Stormpot is an object pooling library for Java. Use it to recycle objects that are expensive to create. The library will take care of creatin

Chris Vest 302 Nov 14, 2022
Fast integer compression in C using the StreamVByte codec

streamvbyte StreamVByte is a new integer compression technique that applies SIMD instructions (vectorization) to Google's Group Varint approach. The n

Daniel Lemire 281 Dec 27, 2022
Fast campus 강의 '현실 세상의 TDD' 실습에 사용된 예제 코드를 제공합니다.

현실 세상의 TDD 실습 코드 Fast campus 강의 '현실 세상의 TDD' 실습에 사용된 예제 코드를 제공합니다. 예제 코드는 강의 촬영 전에 미리 준비되었고 강의 촬영 시 라이브 코딩이 진행되었기 때문에 세부 코드는 강의 영상에서 보는 것과 다를 수 있습니다.

Gyuwon Yi 170 Jan 2, 2023
Table-Computing (Simplified as TC) is a distributed light weighted, high performance and low latency stream processing and data analysis framework. Milliseconds latency and 10+ times faster than Flink for complicated use cases.

Table-Computing Welcome to the Table-Computing GitHub. Table-Computing (Simplified as TC) is a distributed light weighted, high performance and low la

Alibaba 34 Oct 14, 2022
Clojure's data structures modified for use outside of Clojure

This library has been extracted from the master branch of Clojure (http://clojure.org) version 1.5.1 (as of October 2013) http://github.com/richhick

Karl Krukow 221 Oct 6, 2022
An advanced, but easy to use, platform for writing functional applications in Java 8.

Getting Cyclops X (10) The latest version is cyclops:10.4.0 Stackoverflow tag cyclops-react Documentation (work in progress for Cyclops X) Integration

AOL 1.3k Dec 29, 2022
Library for creating In-memory circular buffers that use direct ByteBuffers to minimize GC overhead

Overview This project aims at creating a simple efficient building block for "Big Data" libraries, applications and frameworks; thing that can be used

Tatu Saloranta 132 Jul 28, 2022
Eclipse Collections is a collections framework for Java with optimized data structures and a rich, functional and fluent API.

English | 中文 | Deutsch | Español | Ελληνικά | Français | 日本語 | Norsk (bokmål) | Português-Brasil | Русский | हिंदी Eclipse Collections is a comprehens

Eclipse Foundation 2.1k Dec 29, 2022
A Java library for quickly and efficiently parsing and writing UUIDs

fast-uuid fast-uuid is a Java library for quickly and efficiently parsing and writing UUIDs. It yields the most dramatic performance gains when compar

Jon Chambers 142 Jan 1, 2023
gRPC and protocol buffers for Android, Kotlin, and Java.

Wire “A man got to have a code!” - Omar Little See the project website for documentation and APIs. As our teams and programs grow, the variety and vol

Square 3.9k Jan 5, 2023