External-Memory Sorting in Java

Overview

Externalsortinginjava

Build Status docs-badge Java CI

External-Memory Sorting in Java: useful to sort very large files using multiple cores and an external-memory algorithm.

The versions 0.1 of the library are compatible with Java 6 and above. Versions 0.2 and above require at least Java 8.

This code is used in Apache Jackrabbit Oak as well as in Apache Beam and in Spotify scio.

Code sample

import com.google.code.externalsorting.ExternalSort;

//... inputfile: input file name
//... outputfile: output file name
// next command sorts the lines from inputfile to outputfile
ExternalSort.mergeSortedFiles(ExternalSort.sortInBatch(new File(inputfile)), new File(outputfile));
// you can also provide a custom string comparator, see API

Code sample (CSV)

For sorting CSV files, it might be more convenient to use CsvExternalSort.

import com.google.code.externalsorting.CsvExternalSort;
import com.google.code.externalsorting.CsvSortOptions;

// provide a comparator
Comparator<CSVRecord> comparator = (op1, op2) -> op1.get(0).compareTo(op2.get(0));
//... inputfile: input file name
//... outputfile: output file name
//...provide sort options
CsvSortOptions sortOptions = new CsvSortOptions
				.Builder(comparator, CsvExternalSort.DEFAULTMAXTEMPFILES, CsvExternalSort.estimateAvailableMemory())
				.charset(Charset.defaultCharset())
				.distinct(false)
				.numHeader(1)
				.skipHeader(false)
				.format(CSVFormat.DEFAULT)
				.build();
// container to store the header lines
ArrayList<CSVRecord> header = new ArrayList<CSVRecord>();

// next two lines sort the lines from inputfile to outputfile
List<File> sortInBatch = CsvExternalSort.sortInBatch(file, null, sortOptions, header);
// at this point you can access header if you'd like.
CsvExternalSort.mergeSortedFiles(sortInBatch, outputfile, sortOptions, true, header);

The numHeader parameter is the number of lines of headers in the CSV files (typically 1 or 0) and the skipHeader parameter indicates whether you would like to exclude these lines from the parsing.

API Documentation

http://www.javadoc.io/doc/com.google.code.externalsortinginjava/externalsortinginjava/

Maven dependency

You can download the jar files from the Maven central repository: https://repo1.maven.org/maven2/com/google/code/externalsortinginjava/externalsortinginjava/

You can also specify the dependency in the Maven "pom.xml" file:

    <dependencies>
         <dependency>
	     <groupId>com.google.code.externalsortinginjava</groupId>
	     <artifactId>externalsortinginjava</artifactId>
	     <version>[0.6.0,)</version>
         </dependency>
     </dependencies>

How to build

  • get the java jdk
  • Install Maven 2
  • mvn install - builds jar (requires signing)
  • or mvn package - builds jar (does not require signing)
  • mvn test - runs tests
Comments
  • Fixed test assert 'expected' order

    Fixed test assert 'expected' order

    Minor fix of 'expected' test assert order

    Further, it seems that the CsvSortOptions and up-to-date CsvExternalSort as described in the readme are not in the latest 0.4.0 release version.

    opened by dritter-hd 13
  • mergeSortedFiles with BufferedWriter cannot be used

    mergeSortedFiles with BufferedWriter cannot be used

    I'm trying to use the merge method with a GZIPOutput Buffered writer. The method mergeSortedFiles(BufferedWriter fbw, final Comparator cmp, boolean distinct, List buffers) looks perfect, but the dependency to BinaryFileBuffer makes it impossible to use. As BinaryFileBuffer is a non-public child class, you cannot use it externally and you cannot extend it either. Could you just make the BinaryFileBuffer a standard public class in its own file?

    Thanks

    opened by wavyx 7
  • Including csv External sorting using apache commons

    Including csv External sorting using apache commons

    Including new class to deal with CsvExternal sorting using apache commons csv

    I made small changes into your logic to be able to work with csv files better.

    opened by guilhermefaleixo 6
  • Update commons-csv to 1.9.0

    Update commons-csv to 1.9.0

    I'm using org.apache.commons:commons-csv:1.9.0 Could we update it in externalsortinginjava ? So when I have commons-csv:1.9.0 running in runtime, I know that externalsortinginjava was compiled against 1.9 as well.

    opened by evgenydatanerds 5
  • Changed estimation of available free memory.

    Changed estimation of available free memory.

    Hey, I found that the estimation of available memory uses the current heap space, instead of the maximum heapspace (ie. I got lots of little batch files even though I had gigabytes of heap space). This worked for me.

    See http://stackoverflow.com/questions/12807797/java-get-available-memory

    opened by pbloem 5
  • Cast strings to integers so as to sort as expected

    Cast strings to integers so as to sort as expected

    Hi,

    Thanks for this code. I could sort billions of lines. However, I identified a tiny issue while sorting integers of variable length like below..

    1000000000 is smaller than 100000001 100000010 is smaller than 10000001 100000020 is smaller than 10000002 and so on.,

    This is because of r1.compareTo(r2) method call in defaultcomparator. When they are compared as strings, 1000000000 is indeed smaller than 100000001. But it's incorrect when it is an integer.

    This PR checks whether input contain all numerical digits & sort them as integers.

    Looking forward to hear from you.

    opened by harshavmb 4
  • What is numHeader and skipHeader?

    What is numHeader and skipHeader?

    I was a bit confused with the skipheader and numHeader in the the CSVSortOptions. Couldn't find the description in your API documentation. Can you please elaborate?

    opened by abhinandanmadaan 4
  • Sort CSV based on multiple columns

    Sort CSV based on multiple columns

    1. Is it possible to sort a CSV file based on multiple columns, say firstName, lastName, and dob? For example in case multiple records have the same firstName, those records will be sorted based on the lastName value. If lastName is also same, they will be sorted based on the dob value, And so on.

    2. Can this method also be applied to sort fixed length files based on multiple columns?

    If yes, can you point me to where can I find some samples of those?

    opened by abhinandanmadaan 4
  • CsvExternalSort.mergeSortedFiles ignores CsvSortOptions.numHeader

    CsvExternalSort.mergeSortedFiles ignores CsvSortOptions.numHeader

    CsvExternalSort.mergeSortedFiles ignores CsvSortOptions.numHeader which could trigger issues in the provided Comparator as it will try to compare header line with other lines. The issue only occurs if the batch has size > 1.

    README example:

    import com.google.code.externalsorting.CsvExternalSort;
    import com.google.code.externalsorting.CsvSortOptions;
    
    // provide a comparator
    Comparator<CSVRecord> comparator = (op1, op2) -> op1.get(0).compareTo(op2.get(0));
    //... inputfile: input file name
    //... outputfile: output file name
    //...provide sort options
    CsvSortOptions sortOptions = new CsvSortOptions
    				.Builder(CsvExternalSort.DEFAULTMAXTEMPFILES, comparator, 1, CsvExternalSort.estimateAvailableMemory())
    				.charset(Charset.defaultCharset())
    				.distinct(false)
    				.numHeader(1)
    				.skipHeader(false)
    				.format(CSVFormat.DEFAULT)
    				.build();
    
    // next two lines sort the lines from inputfile to outputfile
    List<File> sortInBatch = CsvExternalSort.sortInBatch(file, null, sortOptions);;
    CsvExternalSort.mergeSortedFiles(sortInBatch, outputfile, sortOptions, true);
    
    Lack test case/undocumented 
    opened by dennishendriksen 4
  • CsvExternalSort.sortInBatch drops a row for each batch

    CsvExternalSort.sortInBatch drops a row for each batch

    In the case where if (currentBlock.get() < blocksize) is false, the CSVRecord e is lost. You'll notice this as a missing row for each block.

    My solution was to remove the conditional logic for processing e, and add the conditional logic only in the part where the block is processed.

    Here is my updated code: ` try (CSVParser parser = new CSVParser(fbr, sortOptions.getFormat())) { parser.spliterator().forEachRemaining(e -> { if (e.getRecordNumber() <= sortOptions.getNumHeader()) { header[0] = e; } else { tmplist.add(e); currentBlock.addAndGet(SizeEstimator.estimatedSizeOf(e)); } if (currentBlock.get() >= blocksize) { try { files.add(sortAndSave(tmplist, tmpdirectory, sortOptions, header[0])); } catch (IOException e1) { logger.warn(String.format("Error during the sort in batch"),e1); //$NON-NLS-1$ } tmplist.clear(); currentBlock.getAndSet(0); }

    		});
    	}
    

    `

    opened by z9security 4
  • ExternalSort.sortInBatch returns empty file, ExternalSort.mergeSortedFiles doesn't delete it

    ExternalSort.sortInBatch returns empty file, ExternalSort.mergeSortedFiles doesn't delete it

    I have a test that call ExternalSort.sortInBatch with a very small maxMemory - 100 bytes. When I call that method on a small file, ExternalSort.sortInBatch creates 5 temporary files, one of which is empty. I then call ExternalSort.mergeSortedFiles on the result of ExternalSort.sortInBatch. The resulting merged file is correct, however ExternalSort.mergeSortedFiles doesn't delete the empty file.

    That seems to be caused by the empty file still being in use (I'm running on Windows). The code only adds non-empty files to the priority queue:

    `

           public static long mergeSortedFiles(BufferedWriter fbw,
                final Comparator<String> cmp, boolean distinct,
                List<IOStringStack> buffers)  ...
                for (IOStringStack bfb : buffers) {
                        if (!bfb.empty()) {
                                pq.add(bfb);
                        }
                }
    

    ` Since the empty file is not on the queue, it's not being closed so it cannot be deleted.

    There is probably more than one way to fix this issue - e.g., if a file is empty, don't create an IOStringStack for it. I'm not sure what's the best way.

    opened by agherardi 4
  • Ability to set the line separator for the final file

    Ability to set the line separator for the final file

    Currently fbw.newLine(); is used exclusively, but I require the ability to use single line feed on windows so the files can interchanged with other systems, without changing the default system line separator.

    opened by johnnuy 1
Owner
Daniel Lemire
Daniel Lemire is a computer science professor. His research is focused on software performance and indexing.
Daniel Lemire
Java Collections till the last breadcrumb of memory and performance

Koloboke A family of projects around collections in Java (so far). The Koloboke Collections API A carefully designed extension of the Java Collections

Roman Leventov 967 Nov 14, 2022
Lightning Memory Database (LMDB) for Java: a low latency, transactional, sorted, embedded, key-value store

LMDB for Java LMDB offers: Transactions (full ACID semantics) Ordered keys (enabling very fast cursor-based iteration) Memory-mapped files (enabling o

null 680 Dec 23, 2022
Immutable in-memory R-tree and R*-tree implementations in Java with reactive api

rtree In-memory immutable 2D R-tree implementation in java using RxJava Observables for reactive processing of search results. Status: released to Mav

Dave Moten 999 Dec 20, 2022
jproblemgenerator creates scenarios in which Java programs leak memory or crash the JVM

jproblemgenerator creates scenarios in which Java programs leak memory or crash the JVM. It is intended to train the use of debugging tools

null 1 Jan 6, 2022
fasttuple - Collections that are laid out adjacently in both on- and off-heap memory.

FastTuple Introduction There are lots of good things about working on the JVM, like a world class JIT, operating system threads, and a world class gar

BMC TrueSight Pulse (formerly Boundary) 137 Sep 30, 2022
Library for creating In-memory circular buffers that use direct ByteBuffers to minimize GC overhead

Overview This project aims at creating a simple efficient building block for "Big Data" libraries, applications and frameworks; thing that can be used

Tatu Saloranta 132 Jul 28, 2022
A Primitive Collection library that reduces memory usage and improves performance

Primitive-Collections This is a Simple Primitive Collections Library i started as a hobby Project. It is based on Java's Collection Library and FastUt

Speiger 26 Dec 25, 2022
High Performance data structures and utility methods for Java

Agrona Agrona provides a library of data structures and utility methods that are a common need when building high-performance applications in Java. Ma

Real Logic 2.5k Jan 5, 2023
Bloofi: A java implementation of multidimensional Bloom filters

Bloofi: A java implementation of multidimensional Bloom filters Bloom filters are probabilistic data structures commonly used for approximate membersh

Daniel Lemire 71 Nov 2, 2022
A high performance caching library for Java

Caffeine is a high performance, near optimal caching library. For more details, see our user's guide and browse the API docs for the latest release. C

Ben Manes 13k Jan 5, 2023
Chronicle Bytes has a similar purpose to Java NIO's ByteBuffer with many extensions

Chronicle-Bytes Chronicle-Bytes Chronicle Bytes contains all the low level memory access wrappers. It is built on Chronicle Core’s direct memory and O

Chronicle Software : Open Source 334 Jan 1, 2023
High performance Java implementation of a Cuckoo filter - Apache Licensed

Cuckoo Filter For Java This library offers a similar interface to Guava's Bloom filters. In most cases it can be used interchangeably and has addition

Mark Gunlogson 161 Dec 30, 2022
An advanced, but easy to use, platform for writing functional applications in Java 8.

Getting Cyclops X (10) The latest version is cyclops:10.4.0 Stackoverflow tag cyclops-react Documentation (work in progress for Cyclops X) Integration

AOL 1.3k Dec 29, 2022
Eclipse Collections is a collections framework for Java with optimized data structures and a rich, functional and fluent API.

English | 中文 | Deutsch | Español | Ελληνικά | Français | 日本語 | Norsk (bokmål) | Português-Brasil | Русский | हिंदी Eclipse Collections is a comprehens

Eclipse Foundation 2.1k Dec 29, 2022
A Java library for quickly and efficiently parsing and writing UUIDs

fast-uuid fast-uuid is a Java library for quickly and efficiently parsing and writing UUIDs. It yields the most dramatic performance gains when compar

Jon Chambers 142 Jan 1, 2023
Geohash utitlies in java

geo Java utility methods for geohashing. Status: production, available on Maven Central Maven site reports are here including javadoc. Add this to you

Dave Moten 386 Jan 1, 2023
High Performance Primitive Collections for Java

HPPC: High Performance Primitive Collections Collections of primitive types (maps, sets, stacks, lists) with open internals and an API twist (no java.

Carrot Search 890 Dec 28, 2022
Java port of a concurrent trie hash map implementation from the Scala collections library

About This is a Java port of a concurrent trie hash map implementation from the Scala collections library. It is almost a line-by-line conversion from

null 147 Oct 31, 2022
Java library for the HyperLogLog algorithm

java-hll A Java implementation of HyperLogLog whose goal is to be storage-compatible with other similar offerings from Aggregate Knowledge. NOTE: This

Aggregate Knowledge (a Neustar service) 296 Dec 30, 2022