Apache ORC - the smallest, fastest columnar storage for Hadoop workloads

Overview

Apache ORC

ORC is a self-describing type-aware columnar file format designed for Hadoop workloads. It is optimized for large streaming reads, but with integrated support for finding required rows quickly. Storing data in a columnar format lets the reader read, decompress, and process only the values that are required for the current query. Because ORC files are type-aware, the writer chooses the most appropriate encoding for the type and builds an internal index as the file is written. Predicate pushdown uses those indexes to determine which stripes in a file need to be read for a particular query and the row indexes can narrow the search to a particular set of 10,000 rows. ORC supports the complete set of types in Hive, including the complex types: structs, lists, maps, and unions.

ORC File Library

This project includes both a Java library and a C++ library for reading and writing the Optimized Row Columnar (ORC) file format. The C++ and Java libraries are completely independent of each other and will each read all versions of ORC files. But the C++ library only writes the original (Hive 0.11) version of ORC files, and will be extended in the future.

Releases:

The current build status:

Bug tracking: Apache Jira

The subdirectories are:

  • c++ - the c++ reader and writer
  • cmake_modules - the cmake modules
  • docker - docker scripts to build and test on various linuxes
  • examples - various ORC example files that are used to test compatibility
  • java - the java reader and writer
  • proto - the protocol buffer definition for the ORC metadata
  • site - the website and documentation
  • snap - the script to build snaps of the ORC tools
  • tools - the c++ tools for reading and inspecting ORC files

Building

  • Install java 1.8 or higher
  • Install maven 3 or higher
  • Install cmake

To build a release version with debug information:

% mkdir build
% cd build
% cmake ..
% make package
% make test-out

To build a debug version:

% mkdir build
% cd build
% cmake .. -DCMAKE_BUILD_TYPE=DEBUG
% make package
% make test-out

To build a release version without debug information:

% mkdir build
% cd build
% cmake .. -DCMAKE_BUILD_TYPE=RELEASE
% make package
% make test-out

To build only the Java library:

% cd java
% mvn package

To build only the C++ library:

% mkdir build
% cd build
% cmake .. -DBUILD_JAVA=OFF
% make package
% make test-out
Comments
  • ORC-322: [C++] Fix writing & reading timestamp

    ORC-322: [C++] Fix writing & reading timestamp

    Currently C++ and Java version have different behaviors in reading and writing values of timestamp type. This patch ensures C++ reader/printer will behave the same as their parities on the java side and also ensures C++ reader and writer obtain same timestamp values in the TimestampVectorBatch.

    opened by wgtmac 22
  • Orc 17

    Orc 17

    In this pull request I added LIBHDFS++ library for reading files from HDFS to ORC project.

    Libhdfs++ is located in orc/c++/lib/libhdfspp and by default builds as a light-weight library without examples, tests, and tools (and by this avoids dependencies on JDK, valgrind and gmock). However, if the flag -DHDFSPP_LIBRARY_ONLY=FALSE is passed to cmake, then it will build the examples, tests, and tools as well.

    Libhdfs++ depends on protobuf libraries in orc/c++/libs/protobuf-2.6.0 and is searching the system for packages Doxygen, OpenSSL, CyrusSASL, GSasl, and Threads dynamically (however only OpenSSL and Threads are required).

    The folder libhdfspp also includes a script pull_hdfs.sh which pulls the latest changes from Libhdfs++ Hadoop branch to ORC, and generates file 'imported_timestamp' with the timestamp and the information about the latest commit.

    I also updated all the ORC tools to automatically use Libhdfs++ to read ORC files on HDFS if their path begins with 'hdfs://'.

    Please review.

    opened by AnatoliShein 21
  • [JAVA] mvn package fails if test compiling was skipped

    [JAVA] mvn package fails if test compiling was skipped

    I would like to run mvn -Dmaven.test.skip=true clean package, but maven-dependency-plugin complains Unused declared dependencies for some libraries used by the test code, which breaks the compilation. Please check the attached logs.

    I'm not a java expert, but I'm guessing the cause of the problem is the misuse of analyze-only in the package phase. According to the documentation, the analyze-only goal is meant to be used during the test-compile phase. In our case, the test class was not compiled, so the dependency analyzer treated some libraries as unused. I don't know the right way to fix it though.

    The setting of maven-dependency-plugin: https://github.com/apache/orc/blob/8cf1047f9ace3799df12f24d2a5096b17a9a6ed0/java/pom.xml#L373-L388

    Logs:

    $ cd orc/java
    
    $ mvn -Dmaven.test.skip=true clean package
    [INFO] Scanning for projects...
    [INFO] ------------------------------------------------------------------------
    [INFO] Reactor Build Order:
    [INFO]
    [INFO] Apache ORC                                                         [pom]
    [INFO] ORC Shims                                                          [jar]
    [INFO] ORC Core                                                           [jar]
    [INFO] ORC MapReduce                                                      [jar]
    [INFO] ORC Tools                                                          [jar]
    [INFO] ORC Examples                                                       [jar]
    [INFO]
    [INFO] -------------------------< org.apache.orc:orc >-------------------------
    [INFO] Building Apache ORC 1.9.0-SNAPSHOT                                 [1/6]
    [INFO] --------------------------------[ pom ]---------------------------------
    [INFO]
    [INFO] --- maven-clean-plugin:3.2.0:clean (default-clean) @ orc ---
    [INFO] Deleting /Users/x/Documents/playground/orc/java/target
    [INFO]
    [INFO] --- maven-enforcer-plugin:3.1.0:enforce (enforce-maven-version) @ orc ---
    [INFO]
    [INFO] --- maven-enforcer-plugin:3.1.0:enforce (enforce-java-version) @ orc ---
    [INFO]
    [INFO] --- maven-enforcer-plugin:3.1.0:enforce (enforce-maven) @ orc ---
    [INFO]
    [INFO] --- maven-remote-resources-plugin:1.7.0:process (process-resource-bundles) @ orc ---
    [INFO] Preparing remote bundle org.apache:apache-jar-resource-bundle:1.4
    [INFO] Copying 3 resources from 1 bundle.
    [INFO]
    [INFO] --- maven-antrun-plugin:3.1.0:run (setup-test-dirs) @ orc ---
    [INFO] Executing tasks
    [INFO]     [mkdir] Created dir: /Users/x/Documents/playground/orc/java/target/testing-tmp
    [INFO] Executed tasks
    [INFO]
    [INFO] --- maven-site-plugin:3.12.0:attach-descriptor (attach-descriptor) @ orc ---
    [INFO] No site descriptor found: nothing to attach.
    [INFO]
    [INFO] --- maven-source-plugin:3.2.1:jar-no-fork (create-source-jar) @ orc ---
    [INFO]
    [INFO] --- maven-source-plugin:3.2.1:test-jar-no-fork (create-source-jar) @ orc ---
    [INFO]
    [INFO] --- reproducible-build-maven-plugin:0.15:strip-jar (default) @ orc ---
    [INFO]
    [INFO] ----------------------< org.apache.orc:orc-shims >----------------------
    [INFO] Building ORC Shims 1.9.0-SNAPSHOT                                  [2/6]
    [INFO] --------------------------------[ jar ]---------------------------------
    [INFO]
    [INFO] --- maven-clean-plugin:3.2.0:clean (default-clean) @ orc-shims ---
    [INFO] Deleting /Users/x/Documents/playground/orc/java/shims/target
    [INFO]
    [INFO] --- maven-enforcer-plugin:3.1.0:enforce (enforce-maven-version) @ orc-shims ---
    [INFO]
    [INFO] --- maven-enforcer-plugin:3.1.0:enforce (enforce-java-version) @ orc-shims ---
    [INFO]
    [INFO] --- maven-enforcer-plugin:3.1.0:enforce (enforce-maven) @ orc-shims ---
    [INFO]
    [INFO] --- build-helper-maven-plugin:3.3.0:add-source (add-source) @ orc-shims ---
    [INFO] Source directory: /Users/x/Documents/playground/orc/java/shims/target/generated-sources added.
    [INFO]
    [INFO] --- maven-remote-resources-plugin:1.7.0:process (process-resource-bundles) @ orc-shims ---
    [INFO] Preparing remote bundle org.apache:apache-jar-resource-bundle:1.4
    [INFO] Copying 3 resources from 1 bundle.
    [INFO]
    [INFO] --- maven-resources-plugin:3.2.0:resources (default-resources) @ orc-shims ---
    [INFO] Using 'UTF-8' encoding to copy filtered resources.
    [INFO] Using 'UTF-8' encoding to copy filtered properties files.
    [INFO] skip non existing resourceDirectory /Users/x/Documents/playground/orc/java/shims/src/main/resources
    [INFO] Copying 3 resources
    [INFO]
    [INFO] --- maven-compiler-plugin:3.10.1:compile (default-compile) @ orc-shims ---
    [INFO] Compiling 13 source files to /Users/x/Documents/playground/orc/java/shims/target/classes
    [INFO]
    [INFO] --- maven-resources-plugin:3.2.0:testResources (default-testResources) @ orc-shims ---
    [INFO] Not copying test resources
    [INFO]
    [INFO] --- maven-antrun-plugin:3.1.0:run (setup-test-dirs) @ orc-shims ---
    [INFO] Executing tasks
    [INFO]     [mkdir] Created dir: /Users/x/Documents/playground/orc/java/shims/target/testing-tmp
    [INFO] Executed tasks
    [INFO]
    [INFO] --- maven-compiler-plugin:3.10.1:testCompile (default-testCompile) @ orc-shims ---
    [INFO] Not compiling test sources
    [INFO]
    [INFO] --- maven-surefire-plugin:3.0.0-M5:test (default-test) @ orc-shims ---
    [INFO] Tests are skipped.
    [INFO]
    [INFO] --- maven-jar-plugin:3.3.0:jar (default-jar) @ orc-shims ---
    [INFO] Building jar: /Users/x/Documents/playground/orc/java/shims/target/orc-shims-1.9.0-SNAPSHOT.jar
    [INFO]
    [INFO] --- maven-site-plugin:3.12.0:attach-descriptor (attach-descriptor) @ orc-shims ---
    [INFO] Skipping because packaging 'jar' is not pom.
    [INFO]
    [INFO] --- maven-source-plugin:3.2.1:jar-no-fork (create-source-jar) @ orc-shims ---
    [INFO] Building jar: /Users/x/Documents/playground/orc/java/shims/target/orc-shims-1.9.0-SNAPSHOT-sources.jar
    [INFO]
    [INFO] --- maven-source-plugin:3.2.1:test-jar-no-fork (create-source-jar) @ orc-shims ---
    [INFO] Building jar: /Users/x/Documents/playground/orc/java/shims/target/orc-shims-1.9.0-SNAPSHOT-test-sources.jar
    [INFO]
    [INFO] --- reproducible-build-maven-plugin:0.15:strip-jar (default) @ orc-shims ---
    [INFO] Stripping /Users/x/Documents/playground/orc/java/shims/target/orc-shims-1.9.0-SNAPSHOT-test-sources.jar
    [INFO] Stripping /Users/x/Documents/playground/orc/java/shims/target/orc-shims-1.9.0-SNAPSHOT-sources.jar
    [INFO] Stripping /Users/x/Documents/playground/orc/java/shims/target/orc-shims-1.9.0-SNAPSHOT.jar
    [INFO]
    [INFO] --- maven-dependency-plugin:3.1.2:analyze-only (default) @ orc-shims ---
    [WARNING] Unused declared dependencies found:
    [WARNING]    org.junit.jupiter:junit-jupiter-api:jar:5.9.0:test
    [INFO] ------------------------------------------------------------------------
    [INFO] Reactor Summary for Apache ORC 1.9.0-SNAPSHOT:
    [INFO]
    [INFO] Apache ORC ......................................... SUCCESS [  4.108 s]
    [INFO] ORC Shims .......................................... FAILURE [  4.817 s]
    [INFO] ORC Core ........................................... SKIPPED
    [INFO] ORC MapReduce ...................................... SKIPPED
    [INFO] ORC Tools .......................................... SKIPPED
    [INFO] ORC Examples ....................................... SKIPPED
    [INFO] ------------------------------------------------------------------------
    [INFO] BUILD FAILURE
    [INFO] ------------------------------------------------------------------------
    [INFO] Total time:  9.189 s
    [INFO] Finished at: 2022-10-13T15:07:06+08:00
    [INFO] ------------------------------------------------------------------------
    [ERROR] Failed to execute goal org.apache.maven.plugins:maven-dependency-plugin:3.1.2:analyze-only (default) on project orc-shims: Dependency problems found -> [Help 1]
    [ERROR]
    [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
    [ERROR] Re-run Maven using the -X switch to enable full debug logging.
    [ERROR]
    [ERROR] For more information about the errors and possible solutions, please read the following articles:
    [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
    [ERROR]
    [ERROR] After correcting the problems, you can resume the build with the command
    [ERROR]   mvn <args> -rf :orc-shims
    
    opened by zjx20 20
  • ORC-703 : Fix RLE encoding bug on large negative integer.

    ORC-703 : Fix RLE encoding bug on large negative integer.

    What changes were proposed in this pull request?

    ORC has use RLE to encoding/decoding integer. Four types are comprised of the RLE encoding/decoding algorithm. Short Repeat : used for short repeating integer sequences. Direct : used for integer sequences whose values have a relatively constant bit width. Patched Base : used for integer sequences whose bit widths varies a lot. Delta : used for monotonically increasing or decreasing sequences.

    Why are the changes needed?

    This bug occurs in Patched Base Type for large negative number. In patched base, we use 3 bits to store base valuewidth that is encoded using 1 to 8 bytes. If the base value is actually 8 bytes in length, the value for base width should be 7. Currently, this value can go up to 8 what can result in inconsistent data as part of the encoding procedure. In extreme cases, the encoding/decoding process can even be cored dump referring to an illegal address.

    How was this patch tested?

    Pass the newly added UT.

    Stale 
    opened by chaoyli 19
  • ORC-1075: Support reading ORC files with no column statistics

    ORC-1075: Support reading ORC files with no column statistics

    What changes were proposed in this pull request?

    This PR aims to fix evaluation predicate proto. When the row index statistics only implements ColumnStatistics and does not provide information other than min or max, the predicate cannot be evaluated correctly

    Why are the changes needed?

    To make all files compatible with the orc specification readable by the official library

    How was this patch tested?

    Added unit test. testWithoutStatistics for tests that do not provide statistics testMissMinOrMaxInStatistics is used to test the case where min max statistics are not provided

    JAVA 
    opened by guiyanakuang 18
  • ORC-961: [C++] expose related metrics of the reader

    ORC-961: [C++] expose related metrics of the reader

    What changes were proposed in this pull request?

    This patch keeps track of the time spent and the number of calls to each module. The current metrics mainly include the decompression time and number of calls, decoding time and number of calls and total elapsed time.

    Why are the changes needed?

    It exposes the relevant metrics of the reader, so that the user can visually see the metrics of each module which includes decompression, decoding.

    How was this patch tested?

    The orc-scan tool with the -m parameter can output relevant metrics.

    CPP 
    opened by coderex2522 17
  • ORC-1172: Add row count limit config in one stripe

    ORC-1172: Add row count limit config in one stripe

    What changes were proposed in this pull request?

    add row count limit config "orc.stripe.row.count" to limit row count in one stripe.

    Why are the changes needed?

    for query engine like presto,stripe is the base unit for query concurrency, one stripe can only be processed by one split. In current implement of orc writer, the only config which can control row count in stripe is the "orc.stripe.size". But for different kind of table, the row count is difficult to use.

    for table with much columns( eg. 100 columns), 64MB may contain 5000 rows. for table with less columns(eg. 5 columns), 64MB may contain 100000 rows. for presto, normal olap query only read a subset of table columns, the row count is the key factor of query performance. If one stripe contain much rows, the query performance may become too low.

    So, besides the config "orc.stripe.size", we need another config like "orc.stripe.row.count" to control the row count of one stripe. The similar config has been introduced to cudf ( a GPU DataFrame library base on apache arrow): rapidsai/cudf#9261

    How was this patch tested?

    testStripeRowCountLimit added. can be test by command below:

    cd java
    ./mvnw -Dtest=TestWriterImpl test
    

    closed #1117

    JAVA 
    opened by dengweisysu 17
  • ORC-713: Add Java 15 to GitHub action

    ORC-713: Add Java 15 to GitHub action

    What changes were proposed in this pull request?

    This PR aims to add Java 15 to GitHub action as a preparation for Java 17.

    Why are the changes needed?

    Java 17 is the next LTS.

    How was this patch tested?

    Pass the GitHub action.

    opened by williamhyun 17
  • ORC-1310: Allowlist Support for plugin filter

    ORC-1310: Allowlist Support for plugin filter

    What changes were proposed in this pull request?

    This PR is aimed to add allowlist support for plugin filter.

    Why are the changes needed?

    ServiceLoader will load all the interfaces in current classpath. When we have two implementations of PluginFilterService, Both of them will be loaded as plugin filters. This PR will provide a configuration "orc.filter.plugin.allowlist". ORC reader will only load the class which is in the allowlist when the configuration is not null.

    How was this patch tested?

    UT

    JAVA 
    opened by deshanxiao 16
  • ORC-757: HashTable dictionary

    ORC-757: HashTable dictionary

    What changes were proposed in this pull request?

    • Add a straightforward implementation for Dictionary using hash table.
    • Refactored RB-Tree to make code reusable, like VisitorContextImpl. Moving it into DictionaryUtils.java to make is sharable between different implementation of Dictionary interface.
    • Enabling hash-based dictionary for existing tests that enables dictionary-encoding.
    • Added ORCWriterBenchmark for benchmarking writer performance with different options.

    Why are the changes needed?

    We find RB-Tree based dictionary implementation being slow in our production workload. The performance comparison for the new hash-table based implementation will be done as part of ORC-50.

    How was this patch tested?

    Mostly tested with added unit tests for hash-table and enabled hash-table based dictionary in some of the existing tests.

    Also added the benchmark result comparing to RB-Tree one:

    ORCWriterBenchMark.dictBench                     RBTREE  avgt    5   28216.937 ±  368.582  us/op
    ORCWriterBenchMark.dictBench:bytesPerRecord      RBTREE  avgt    5      49.832                 #
    ORCWriterBenchMark.dictBench:iOs                 RBTREE  avgt    5         ≈ 0                 #
    ORCWriterBenchMark.dictBench:perRecord           RBTREE  avgt    5       0.861 ±    0.011  us/op
    ORCWriterBenchMark.dictBench:records             RBTREE  avgt    5  163840.000                 #
    ORCWriterBenchMark.dictBench                       HASH  avgt    5    5751.196 ± 1049.305  us/op
    ORCWriterBenchMark.dictBench:bytesPerRecord        HASH  avgt    5      50.146                 #
    ORCWriterBenchMark.dictBench:iOs                   HASH  avgt    5         ≈ 0                 #
    ORCWriterBenchMark.dictBench:perRecord             HASH  avgt    5       0.176 ±    0.032  us/op
    ORCWriterBenchMark.dictBench:records               HASH  avgt    5  163840.000                 #
    ORCWriterBenchMark.dictBench                       NONE  avgt    5    3988.156 ±  621.354  us/op
    ORCWriterBenchMark.dictBench:bytesPerRecord        NONE  avgt    5      50.146                 #
    ORCWriterBenchMark.dictBench:iOs                   NONE  avgt    5         ≈ 0                 #
    ORCWriterBenchMark.dictBench:perRecord             NONE  avgt    5       0.122 ±    0.019  us/op
    ORCWriterBenchMark.dictBench:records               NONE  avgt    5  163840.000                 #```
    opened by autumnust 16
  • ORC-1008: Fix overflow detection code for C++ int64_t / java long

    ORC-1008: Fix overflow detection code for C++ int64_t / java long

    What changes were proposed in this pull request?

    https://github.com/apache/orc/blob/6da96bb8ceb64528d082974efed411c4c29f3408/c%2B%2B/src/Statistics.cc#L180-L190 A counter-example can easily be given Assume sum=1, update(std::numeric_limits<int64_t>::max(), 3); value * repetitions + _stats.getSum() is overflowed, but is still a positive number : 9223372036854775806

    This pr aims to fix overflow detection code for C++ int64_t / java long.

    ORC-338 Workaround C++ compiler bug in xcode 9.3 by removing an inline function.

    As I fixed the implementation. The current update function can be inline.

    Why are the changes needed?

    Fix bug.

    How was this patch tested?

    Pass the CIs.

    JAVA CPP 
    opened by guiyanakuang 15
  • Bump checkstyle from 10.5.0 to 10.6.0 in /java

    Bump checkstyle from 10.5.0 to 10.6.0 in /java

    Bumps checkstyle from 10.5.0 to 10.6.0.

    Release notes

    Sourced from checkstyle's releases.

    checkstyle-10.6.0

    Checkstyle 10.6.0 - https://checkstyle.org/releasenotes.html#Release_10.6.0

    Breaking backward compatibility:

    #12520 - Simplify JavadocStyleCheck: remove functionality for missing package-info Javadoc

    Bug fixes:

    #12409 - Inconsistent allowedAbbreviations when a method contains an underscore #12486 - NoWhitespaceAfter false positive on synchronized method #11807 - Null pointer exception with records in RequireThisCheck

    Commits
    • 233c91b [maven-release-plugin] prepare release checkstyle-10.6.0
    • c982461 config: maven has problems to push, moving push to action level
    • 2826b1b config: git push commands need write permission in actions
    • 311a1b7 config: skip pgp sign plugin during release:prepare as we do not sign commits
    • 04347b1 doc: release notes for 10.6.0
    • d12ffc7 Issue #12409: Inconsistentency In Allowed Abbreviations
    • a5be3cf minor: Bump version to 10.6.0-SNAPSHOT
    • ebb46cb Issue #12520: removes missing package-info Javadoc check in JavadocStyle
    • 475063f supplemental: Forbid usage of @​BeforeAll in tests
    • 069905a config: upgrade sevntu to 1.44.1
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies BUILD JAVA 
    opened by dependabot[bot] 0
  • Bump mockito.version from 4.10.0 to 4.11.0 in /java

    Bump mockito.version from 4.10.0 to 4.11.0 in /java

    Bumps mockito.version from 4.10.0 to 4.11.0. Updates mockito-core from 4.10.0 to 4.11.0

    Release notes

    Sourced from mockito-core's releases.

    v4.11.0

    Changelog generated by Shipkit Changelog Gradle Plugin

    4.11.0

    Commits

    Updates mockito-junit-jupiter from 4.10.0 to 4.11.0

    Release notes

    Sourced from mockito-junit-jupiter's releases.

    v4.11.0

    Changelog generated by Shipkit Changelog Gradle Plugin

    4.11.0

    Commits

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies BUILD JAVA 
    opened by dependabot[bot] 0
  • Huge memory taken for each field when exporting

    Huge memory taken for each field when exporting

    Hello, Using arrow adapter, I became aware that the memory (RAM) footprint of the export (exporting an orc file) was very huge for each field. For instance, exporting a table with 10000 fields can take up to 30Go, even if there is only 10 records. Even for 100 fields, that could take 100Mo+. The "issue" seems to be coming from here : https://github.com/apache/orc/blob/432a7aade9ea8d3cd705d315da21c2c859bce9ef/c%2B%2B/src/ColumnWriter.cc#L59

    When we create a writer with the "createWriter" (https://github.com/apache/orc/blob/432a7aade9ea8d3cd705d315da21c2c859bce9ef/c%2B%2B/src/Writer.cc#L681-L684 ), a stream (compressor) is created for each field. As we allocate a Buffer of 1 * 1024 *1024 we get as a minimum 1Mo additionnal size taken in memory for each field.

    Is there a reason the BufferedOutputStream initial capacity is that high ? I circumvented my problem by lowering it to 1Ko (it didn't change much the performance according to my testing, but it may depend on usecases). Could it be envisaged to put a global variable (or static one) to parametrize this to allow changing this hard coded parameter ? Thanks

    opened by LouisClt 7
  • ORC-1200: Extracting encryption setup logic from `WriterImpl`

    ORC-1200: Extracting encryption setup logic from `WriterImpl`

    What changes were proposed in this pull request?

    Extracting the encryption setup logic as a tool class.

    Why are the changes needed?

    Because of flink's ORC writer is based of stream, we must create a stream based PhysicalFsWriter before WriterImpl initial. WriterImpl Then, You can look into flink orc writer implementation from OrcBulkWriterFactory. Given the above, now I want to pass encryption settings to PhysicalFsWriter before WriterImpl initail, but the encryption setup at WriterImpl is so tightly coupled that the encryption variant cannot be obtained externally.

    How was this patch tested?

    It doesn't introduce new features and passed all test cases.

    JAVA 
    opened by liujiawinds 7
  • RecordReaderImpl.getValueRange() may cause incorrect results

    RecordReaderImpl.getValueRange() may cause incorrect results

    orc version: 1.6.11, sql: select xxx from xxx where str is not null

    Recently i found some orc files wrote by trino didn't have complete statistics in files meta(maybe a presto bug), this causes OrcProto.ColumnStatistics can't be deserialized to any specific ColumnStatisticsImpl such as StringStatisticsImpl, then RecordReaderImpl.getValueRange() returns ValueRange with null lower and RecordReaderImpl.pickRowGroups() skips this row group, which should not be skipped. In normal conditions except above, everything is ok. And i found orc-1.5.x can handle above case according to RecordReaderImpl.UNKNOWN_VALUE, which has removed in 1.6.x. Maybe we could add it back for better compatibility. @dongjoon-hyun @omalley

    enhancement 
    opened by PengleiShi 7
Releases(v1.8.1)
  • v1.8.1(Dec 2, 2022)

    Milestone

    • https://github.com/apache/orc/milestone/13

    Changelog

    • https://github.com/apache/orc/commits/v1.8.1

    Bug

    • ORC-1283 ENABLE_INDEXES does not take effect
    • ORC-1288 Invalid memory freeing with ZLIB compression
    • ORC-1291 NullPointerException in TypeDescription

    Improvement

    • ORC-1268 Set CMP0135 policy for CMake 3.24+
    • ORC-1282 Add slf4j impl to avoid warning message
    • ORC-1294 Build error when skip tests build
    • ORC-1295 Improve ORC Spec example (Decoding RLE v2 direct)
    • ORC-1299 benchmark can't work for data resource 403
    • ORC-1305 Add more orc java examples
    • ORC-1308 Avoid star import

    Test

    • ORC-1290 Bump spotbugs to 4.7.3
    • ORC-1300 Update Spark to 3.3.1 and its dependencies

    Tasks

    • ORC-1269 Remove FindBugs
    • ORC-1270 Move opencsv dependency to the tools module.
    • ORC-1292 Add paragraph in java documentation
    Source code(tar.gz)
    Source code(zip)
  • v1.7.7(Nov 19, 2022)

    Milestone

    • https://github.com/apache/orc/milestone/12

    Changelog

    • https://github.com/apache/orc/commits/v1.7.7

    Bug

    • ORC-1283 ENABLE_INDEXES does not take effect

    Test

    • ORC-1254 Add spotbugs check
    • ORC-1299 Fix fetch data error in bench module

    Task

    • ORC-1256 Publish tests jar to maven central
    • ORC-1268 Set CMP0135 policy for CMake 3.24+
    Source code(tar.gz)
    Source code(zip)
  • v1.8.0(Sep 3, 2022)

    Milestone

    • https://github.com/apache/orc/milestone/2

    Changelog

    • https://github.com/apache/orc/commits/v1.8.0

    New Feature and Notable Changes

    • ORC-450 Support selecting list indices without materializing list items
    • ORC-824 Add column statistics for List and Map
    • ORC-1004 Java ORC writer supports the selection vector
    • ORC-1075 Support reading ORC files with no column statistics
    • ORC-1125 Support decoding decimals in RLE
    • ORC-1136 Optimize reads by combining multiple reads without significant separation into a single read
    • ORC-1138 Seek vs Read Optimization
    • ORC-1172 Add row count limit config for one stripe
    • ORC-1212 Upgrade protobuf-java to 3.17.3
    • ORC-1220 Set min.hadoop.version to 2.7.3
    • ORC-1248 Redefine Hadoop dependency for Apache ORC 1.8.0
    • ORC-1256 Publish test-jar to maven central
    • ORC-1260 Publish shaded-protobuf classifier artifacts

    Improvement

    • ORC-825 Use Empty Array For Collections toArray
    • ORC-826 Do Not Use Collection Contains/Get
    • ORC-828 Improve Fetch Data Set Process
    • ORC-829 Optimize Serialization percentileBits
    • ORC-831 Do Not Copy String When Flushing Dictionary
    • ORC-833 RunLengthIntegerReaderV2 Calculate Batch Size Once
    • ORC-834 Do Not Convert to String in DecimalFromTimestampTreeReader
    • ORC-835 Cache TRUE/FALSE Bytes in StringGroupFromBooleanTreeReader
    • ORC-836 StringGroupFromDoubleTreeReader Use Double toString
    • ORC-837 Reuse HiveDecimalWritable in ConvertTreeReaderFactory
    • ORC-838 Simplify compareTo/equals/putBuffer of ByteBufferAllocatorPool
    • ORC-840 Remove Superfluous Array Fill in RecordReaderImpl
    • ORC-841 Remove Superfluous Array Fill in StringHashTableDictionary
    • ORC-842 Remove newKey from StringHashTableDictionary
    • ORC-844 Improve hashCode Methods
    • ORC-847 Do Not Create Empty Array in StringGroupFromBinaryTreeReader
    • ORC-852 Allow DynamicByteArray to Return a ByteBuffer
    • ORC-853 Optimize writeDouble Implementation
    • ORC-855 Remove Unused isRepeating from RunLengthIntegerReaderV2
    • ORC-865 Bump opencsv from 3.9 to 5.5.1
    • ORC-883 Dependency Audit and QA
    • ORC-897 optimization loop termination condition in readerIsCompatible method
    • ORC-935 Bump commons-csv from 1.8 to 1.9.0
    • ORC-937 Replace deprecated method
    • ORC-958 Convert command support overwrite option
    • ORC-969 Evaluate SearchArguments using file and stripe level stats
    • ORC-975 Avoid double counting closestFixedBits in percentileBits method
    • ORC-982 Extract checkstyle to a single file, help newcomers check code style
    • ORC-988 Bump opencsv from 5.5.1 to 5.5.2
    • ORC-992 Reached max repeat length, we can directly decide to use DELTA encoding
    • ORC-1005 Make that the java and C++ implementations of determineEncoding in RunLengthIntegerWriterV2 are consistent.
    • ORC-1007 Fix a warning from the shade plugin
    • ORC-1013 Renaming a parameter in constructors of TreeWriter's derived classes
    • ORC-1014 Add details when we get IOExceptions from file system
    • ORC-1020 Improve orc::RleDecoderV2::nextDirect
    • ORC-1027 Filter processing to allow filter injections that cannot be represented via SArgs
    • ORC-1047 Handle quoted field names during string schema parsing
    • ORC-1077 Remove commons-codec dependency and use java.util.Base64
    • ORC-1099 Extend ReadIntent to support MAP and UNION type
    • ORC-1101 Improve malformed STRUCT handling
    • ORC-1122 Add buffer to decode the whole run in RleDecoderV2
    • ORC-1137 Improve float/double conversion in DoubleColumnReader::next()
    • ORC-1149 Bump slf4j.version to 1.7.36
    • ORC-1150 Improve RowReaderImpl::computeBatchSize()
    • ORC-1152 Support encoding short decimals in RLEv2
    • ORC-1156 Update opencsv to 5.6
    • ORC-1163 Bump zookeeper from 3.7.0 to 3.8.0
    • ORC-1169 Use Hadoop 3.3.2 on Java 17+
    • ORC-1178 Use hadoop 3.3.3 on Java 17+

    Bug

    • ORC-845 Fix NPE in DynamicIntArray toString
    • ORC-929 Fix NaN at orc-tools 'meta' command
    • ORC-1129 The build of tool-test should depends on cpp tools
    • ORC-1159 Crash when the last stripe is skipped
    • ORC-1242 Bump threeten-extra to 1.7.1

    Test

    • ORC-860 Add dependabot
    • ORC-864 Bump jackson.version from 2.12.2 to 2.12.4
    • ORC-877 Bump junit-vintage-engine from 5.7.0 to 5.7.2
    • ORC-888 Bump objenesis from 3.1 to 3.2
    • ORC-905 Add an integration test for example
    • ORC-917 Bump mockito-core from 3.7.0 to 3.11.2
    • ORC-919 Spark bench objenesis should be the same as Spark.
    • ORC-920 Use junit.version and mockito.version property and bump junit to 5.7.2
    • ORC-925 Simplify assertions
    • ORC-928 Bump checkstyle from 8.44 to 8.45.1
    • ORC-932 Bump byte-buddy from 1.10.19 to 1.11.12
    • ORC-934 Add integration tests for Java bench
    • ORC-940 Use Hadoop 3.3.1 in bench module
    • ORC-955 Add Javadoc generation GitHub Action job
    • ORC-963 Build benchmark module always for integration testing
    • ORC-966 Bump byte-buddy from 1.11.12 to 1.11.13
    • ORC-967 Bump mockito.version from 3.11.2 to 3.12.1
    • ORC-986 Bump mockito.version from 3.12.1 to 3.12.4
    • ORC-987 Bump jackson.version from 2.12.4 to 2.12.5
    • ORC-1001 Bump maven-enforcer-plugin to 3.0.0
    • ORC-1019 Remove redundant jackson dependencies
    • ORC-1022 Bump byte-buddy from 1.11.13 to 1.11.19
    • ORC-1038 Bump mockito.version from 3.12.4 to 4.0.0
    • ORC-1074 Bump byte-buddy from 1.11.19 to 1.12.6
    • ORC-1079 Add Linux clang GitHub Action job
    • ORC-1085 Bump auto-service from 1.0 to 1.0.1
    • ORC-1089 Add test cases verifying writers with selected vector
    • ORC-1104 Use Spark 3.2.1 in benchmark
    • ORC-1107 Fix NPE at benchmark data schema loading
    • ORC-1110 Bump mockito.version from 4.0.0 to 4.3.1
    • ORC-1126 Bump byte-buddy from 1.12.6 to 1.12.8
    • ORC-1139 Benchmark for Seek vs Read
    • ORC-1141 Bump mockito.version from 4.3.1 to 4.4.0
    • ORC-1145 Add Java 18 to GitHub Action CI.
    • ORC-1153 Bump byte-buddy from 1.12.8 to 1.12.9
    • ORC-1157 Update guava to 31.1-jre
    • ORC-1168 Update byte-buddy to 1.12.10
    • ORC-1177 Upgrade mockito.version to 4.5.1
    • ORC-1179 Upgrade checkstyle to 10.2 on Java 11+
    • ORC-1187 Use main instead of master in merge_orc_pr.py
    • ORC-1194 Bump mockito.version to 4.6.0
    • ORC-1195 Bump checkstyle to 10.3
    • ORC-1196 Add spark benchmark integration tests to GHA
    • ORC-1197 Bump mockito.version from 4.6.0 to 4.6.1
    • ORC-1201 Remove Debian 9 from Docker Tests
    • ORC-1203 Bump maven-enforcer-plugin to 3.1.0
    • ORC-1206 Bump netty-all to 4.1.78.Final
    • ORC-1207 Upgrade Spark to 3.3.0
    • ORC-1208 Bump byte-buddy to 1.12.12
    • ORC-1209 Bump checkstyle to 10.3.1
    • ORC-1234 Upgrade objenesis to 3.2 in Spark benchmark
    • ORC-1236 Bump checkstyle to 10.3.2
    • ORC-1243 Bump byte-buddy to 1.12.13
    • ORC-1253 Add Fedora 37 docker test
    • ORC-1254 Add spotbugs check

    Task

    • ORC-868 Pin gson to 2.2.4
    • ORC-869 Pin jmh 1.20
    • ORC-872 Bump kryo-shaded from 3.0.3 to 4.0.2
    • ORC-874 Bump zookeeper from 3.6.2 to 3.7.0
    • ORC-884 Bump jettison from 1.1 to 1.4.1
    • ORC-887 Remove ORC Twitter link from news page
    • ORC-890 Pin minimum support Hadoop version to 2.2.0
    • ORC-892 Pin scala-library to 2.12.10
    • ORC-898 Bump threeten-extra from 1.5.0 to 1.7.0
    • ORC-899 Archive Apache ORC 1.4.x in releases page
    • ORC-900 Update doap_orc.rdf for Apache Projects page
    • ORC-908 Use https instead of http for website links in pom.xml
    • ORC-914 Pin maven-dependency-plugin to 3.1.2
    • ORC-916 Bump annotations from 17.0.0 to 21.0.1
    • ORC-918 Pin protobuf-java to 2.5.0
    • ORC-923 Bump apache from 23 to 24
    • ORC-946 Unified json library
    • ORC-949 Add CustomImportOrder rule
    • ORC-956 Bump annotations from 21.0.1 to 22.0.0
    • ORC-977 Update webpages and TestVectorOrcFile.java to be more neutral
    • ORC-1045 Bump commons-cli to 1.5
    • ORC-1056 Bump annotations from 22.0.0 to 23.0.0
    • ORC-1103 Use Maven 3.8.4
    • ORC-1140 Documentation for Seek vs Read
    • ORC-1158 Add notification settings to .asf.yam
    • ORC-1162 Fix Apache Project Website Checks Warningl
    • ORC-1165 Enable GitHub Action in branch-1.8
    • ORC-1166 Enable snapshot publishing in branch-1.8
    • ORC-1171 Skip build and test on docker and site updates
    • ORC-1173 Pin jodd-core to 3.5.2
    • ORC-1176 Upgrade maven-jar-plugin to 3.2.2
    • ORC-1185 Add merge_orc_pr.py
    • ORC-1210 Upgrade maven to 3.8.6
    • ORC-1216 Pin org.jetbrains.annotations dependency to 17.0.0
    • ORC-1211 Upgrade maven-assembly-plugin to 3.4.0
    • ORC-1214 Bump maven-assembly-plugin to 3.4.1
    • ORC-1217 Downgrade org.jetbrains.annotations to 17.0.0
    • ORC-1223 Move DirectDecompressWrapper to org.apache.orc.impl
    • ORC-1224 Move getDecompressor to HadoopShimsCurrent
    • ORC-1226 Add a deprecation warning for Hadoop 2.7.2 and below
    • ORC-1229 Move KeyProviderImpl to org.apache.orc.impl
    • ORC-1230 Move encryption utility functions to HadoopShimsCurrent
    • ORC-1246 Revamp ORC Website
    • ORC-1247 Improve Apache ORC website and docs
    • ORC-1249 Move site/_docs/releases.md to site/releases/index.md
    • ORC-1255 Fix ORC website navbar highlight
    • ORC-1257 Publish multi-architecture ORC-dev docker images
    • ORC-1261 Rename shaded pattern com.google.protobuf25 to org.apache.orc.protobuf
    • ORC-1263 Add decimal type to ORC Website
    • ORC-1221 Move NullKeyProvider to org.apache.orc.impl
    Source code(tar.gz)
    Source code(zip)
  • v1.7.6(Aug 18, 2022)

    Milestone

    • https://github.com/apache/orc/milestone/11

    Changelog

    • https://github.com/apache/orc/compare/v1.7.5...v1.7.6

    Bug Fixes

    • ORC-1204: ORC MapReduce writer to flush when long arrays
    • ORC-1205: nextVector should invoke ensureSize when reusing vectors
    • ORC-1215: Remove a wrong NotNull annotation on value of setAttribute
    • ORC-1222: Upgrade tools.hadoop.version to 2.10.2
    • ORC-1227: Use Constructor.newInstance instead of Class.newInstance
    • ORC-1228: Fix setAttribute to handle null value

    Tests

    • ORC-932: Bump byte-buddy from 1.10.19 to 1.11.12 (#842)
    • ORC-1169: Use Hadoop to 3.3.2 on Java 17+ (#1113)
    • ORC-1178: Use Hadoop 3.3.3 on Java 17+ (#1129)
    • ORC-1193: Bump parquet.version to 1.12.3
    • ORC-1207: Upgrade Spark to 3.3.0
    • ORC-1210: Upgrade maven to 3.8.6
    • ORC-1234: Upgrade objenesis to 3.2 in Spark benchmark
    • ORC-1235: Bump avro.version to 1.11.1
    • ORC-1240: Update site README to use apache/orc-dev DockerHub image
    • ORC-1241: Use apache/orc-dev DockerHub repository in Docker tests
    • ORC-1244: Upgrade byte-buddy to 1.12.13 in branch-1.7
    • ORC-1245: Use Hadoop 3.3.4 on Java 17+ and benchmark

    Documentation

    • MINOR: Update DOAP with new releases (#1127)
    • ORC-900: Update doap_orc.rdf for Apache Projects page (#806)
    • ORC-1231: Update supported OS list in building.md
    • ORC-1237: Remove a wrong image link to article-footer.png
    • ORC-1238: Update DOAP with 1.7.5

    Task

    • ORC-1185: Add merge_orc_pr.py
    • ORC-1187: Use main instead of master in merge_orc_pr.py
    • ORC-1213: Use https in ThirdpartyToolchain.cmake
    • ORC-1226: Add a deprecation warning for Hadoop 2.7.2 and below
    Source code(tar.gz)
    Source code(zip)
  • v1.7.5(Jun 16, 2022)

    Milestone

    • https://github.com/apache/orc/milestone/9

    Changelog

    • https://github.com/apache/orc/compare/v1.7.4...v1.7.5

    Bug Fixes

    • ORC-1151: [C++] Fix ColumnWriter for non-UTC Timestamp columns (#1088)
    • ORC-1160: [C++] Fix seekToRow can't seek within selected row group (#1102)
    • ORC-1133: [C++] Fix csv-import tool options
    • ORC-1183: Upgrade gson to 2.9.0
    • ORC-1186: Limit family in aarch64 profile
    • ORC-1188: Fix ORC_PREFER_STATIC_ZLIB

    Improvements

    • ORC-1198: Add a new PhysicalFsWriter constructor with FSDataOutputStream parameter
    • ORC-1199: Use Google mirror of Maven Central as the primary

    Tests

    • ORC-1155: Add Ubuntu 22.04 to docker tests (#1093)
    • ORC-1154: Bump hive.version from 3.1.2 to 3.1.3 (#1090)
    • ORC-1161: Add MacOS 12 and remove MacOS 10
    • ORC-1174: Add Ubuntu 22.04 to GitHub Action (#1128)
    • ORC-1182: Use slf4j-simple instead of deprecated slf4j-log4j12
    • ORC-1184: Use Hadoop 3.3.3 in benchmark module
    • ORC-1189: Update README.md and help command message in benchmark module and .gitignore
    • ORC-1190: Fix ORCWriterBenchMark dumpDir initialization
    • ORC-1191: Updated TLC Taxi Benchmark Dataset
    • ORC-1192: Use orc.zstd instead of orc.none (#1144)
    • ORC-1196: Add Spark benchmark integration tests to GHA
    • ORC-1201: Remove Debian 9 from Docker Tests

    Documentation

    • MINOR: Add ASF verification instruction link (#1134)
    Source code(tar.gz)
    Source code(zip)
  • v1.6.14(Apr 14, 2022)

    Milestone

    • https://github.com/apache/orc/milestone/6?closed=1

    Changelog

    • https://github.com/apache/orc/compare/v1.6.13...v1.6.14

    Bug Fixes

    • ORC-1121: Fix column coversion check bug which causes column filters don't work (#1055)
    • ORC-1146: Float category missing check if the statistic sum is a finite value (#1078)
    • ORC-1147: Use isNaN instead of isFinite to determine the contain NaN values (#1082)

    Tests

    • ORC-1016: Use [email protected] in GitHub Action MacOS CIs
    • ORC-1113: Remove CentOS 8 from docker-based tests (#1040)
    Source code(tar.gz)
    Source code(zip)
  • v1.7.4(Apr 16, 2022)

    Milestone

    • https://github.com/apache/orc/milestone/7

    Changelog

    • https://github.com/apache/orc/compare/v1.7.3...v1.7.4

    Bug Fixes

    • ORC-1120: Remove C++ library limitation about write version (#1054)
    • ORC-1121: Fix column conversion check bug which causes column filters don't work (#1055)
    • ORC-1127: [C++] add missing version of UNSTABLE-PRE-2.0 (#1064)
    • ORC-1146: Float category missing check if the statistic sum is a finite value (#1078)
    • ORC-1147: Use isNaN instead of isFinite to determine the contain NaN values (#1082)

    Improvements

    • ORC-236: Support UNION type in Java Convert tool (#1025)
    • ORC-1116: [C++] Fix csv-import tool when exporting long bytes (#1044)
    • ORC-1123: Add estimationMemory method for writer (#1057)

    Tests

    • ORC-1145: Add Java 18 to GitHub Action CI (#1074)
    • ORC-1118: Support Java 17 and ARM64 docker tests (#1047)

    Documentation

    • ORC-1117: Add Dask page at Using in Python section (#1045)
    • ORC-1119: Remove timestamp from ORC API docs (#1049)
    Source code(tar.gz)
    Source code(zip)
  • v1.7.3(Feb 10, 2022)

    Milestone

    • https://github.com/apache/orc/milestone/4?closed=1

    Changelog

    • https://github.com/apache/orc/compare/v1.7.2...v1.7.3

    Bug Fixes

    • ORC-1060: Reduce memory usage when vectorized reading dictionary string encoding columns (#971)
    • ORC-1065: Fix IndexOutOfBoundsException in ReaderImpl.extractFileTail #979
    • ORC-1067: [C++] Upgrade ZSTD to 1.5.1 (#981)
    • ORC-1078: Row group end offset doesn't accommodate all the blocks (#996)
    • ORC-1081: Fix heap-use-after-free in SearchArgumentBuilderImpl::end() (#998)
    • ORC-1087: [C++] Handle unloaded seek positions when seeking in an uncompressed chunk (#1008)
    • ORC-1092: [C++] Upgrade LZ4 to version 1.9.3 (#1012)
    • ORC-1102: [C++] Upgrade ZSTD to 1.5.2 (#1026)

    Improvements (orc-tools)

    • ORC-1055: [C++] Add the timezone option for the csv-import tool (#975)
    • ORC-1082: Improve FileDump and JsonFileDump to be robust on missing column statistics (#1003)
    • ORC-1098: [C++] Support specifying type ids or column names in cpp tools (#1020)

    Documentation

    • ORC-1050: Update ORC site README.md and release process page (#963)
    • ORC-1069: Update building.md (#982)
    • ORC-1071: Update adopters page (#985)
    • ORC-1091: Add Tests section at ORC develop page (#1011)
    • ORC-1112: Add Using with Python web page (#1039)
    • ORC-1114: Update Using with Python page with PyArrow 7.0.0 (#1042)

    Task

    • ORC-1070: Upgrade site docker image to use Ubuntu 20.04 (#983)
    • ORC-1072: Add 'Stale' GitHub Action job (#986)
    • ORC-1094: Enable GitHub issues tab (#1015)
    • ORC-1095: Deprecate UnknownFormatException (#1016)

    Tests

    • ORC-875: Add GitHub Action job for Windows Server 2019 (#872)
    • ORC-878: Bump auto-service from 1.0-rc7 to 1.0
    • ORC-881: Bump slf4j.version from 1.7.30 to 1.7.32 (#786)
    • ORC-989: Bump checkstyle from 8.45.1 to 9.0 (#899)
    • ORC-993: Bump junit.version from 5.7.2 to 5.8.0 (#906)
    • ORC-1018: Bump checkstyle from 9.0 to 9.0.1 (#927)
    • ORC-1033: Bump junit.version from 5.8.0 to 5.8.1 (#938)
    • ORC-1044: Bump reproducible-build-maven-plugin to 0.14 (#955)
    • ORC-1048: Bump checkstyle from 9.0.1 to 9.1 (#960)
    • ORC-1052: Bump avro.version from 1.10.2 to 1.11.0 (#965)
    • ORC-1057: Bump junit.version from 5.8.1 to 5.8.2 (#969)
    • ORC-1061: Bump checkstyle from 9.1 to 9.2 (#970)
    • ORC-1066: Bump guava from 30.1.1-jre to 31.0.1-jre #978
    • ORC-1068: [C++] Stabilize HAS_POST_2038 test (#980)
    • ORC-1073: Remove appveyor.yml (#987)
    • ORC-1076: Remove Travis PR Builder Link from README.md (#991)
    • ORC-1079: Add Linux Clang 11 GitHub Action test coverage (#995)
    • ORC-1080: Remove .travis.yml (#997)
    • ORC-1084: Bump checkstyle from 9.2 to 9.2.1 (#1007)
    • ORC-1086: Bump reproducible-build-maven-plugin from 0.14 to 0.15 (#1005)
    • ORC-1090: Disable Clang 13.0-specific compilation warnings (#1017)
    • ORC-1093: Remove debian8 specific code in run-one.sh (#1013)
    • ORC-1096: Bump slf4j.version to 1.7.33 (#1019)
    • ORC-1103: Use Maven 3.8.4 (#1029)
    • ORC-1104: Use Spark 3.2.1 in benchmark (#1030)
    • ORC-1105: fetch-data.sh should use zsh instead of bash (#1031)
    • ORC-1106: Use transitive commons-lang3 dependency in bench module (#1032)
    • ORC-1107: Fix NPE at benchmark data schema loading (#1033)
    • ORC-1108: Use RawLocalFileSystem to skip checksum files during benchmark data generation (#1034)
    • ORC-1109: Use zstd instead of none in the default compress option (#1035)
    • ORC-1111: Bump build-helper-maven-plugin from 3.2.0 to 3.3.0 (#1038)
    • ORC-1113: Remove CentOS 8 from docker-based tests (#1040)
    • ORC-1115: Suppress Illegal reflective access warnings on Java9+ Tests (#1043)
    Source code(tar.gz)
    Source code(zip)
  • v1.6.13(Feb 10, 2022)

    Milestone

    • https://github.com/apache/orc/milestone/5?closed=1

    Changelog

    • https://github.com/apache/orc/compare/rel/release-1.6.12...v1.6.13

    Bug Fixes

    • ORC-1065: Fix IndexOutOfBoundsException in ReaderImpl.extractFileTail (#979)
    • ORC-1078: Row group end offset doesn’t accommodate all the blocks (#996)

    Tests

    • ORC-875: Add GitHub Action job for Windows Server 2019 (#999)
    • ORC-941: Move MacOS 10.15/11.5 test from Travis to GitHub Action (#1001)
    • ORC-1079: Add Linux Clang 11 GitHub Action test coverage (#1002)
    • ORC-1080: Remove .travis.yml (#1000)
    Source code(tar.gz)
    Source code(zip)
  • v1.7.2(Feb 10, 2022)

    Milestone

    • https://github.com/apache/orc/milestone/3?closed=1

    Changelog

    • https://github.com/apache/orc/compare/rel/release-1.7.1...v1.7.2

    Bug Fixes

    • ORC-492: Avoid potential ArrayIndexOutOfBoundsException when getting WriterVersionn (#961)
    • ORC-1041: Use memcpy during LZO decompression (#958)
    • ORC-1053: Fix time zone offset precision when convert tool converts LocalDateTime to Timestamp is not consistent with the internal default precision of ORC (#967)
    • ORC-1059: Align findColumns behaviour between 1.6 and 1.7 release (#972)

    Improvements (orc-tools)

    • ORC-1012: Support specifying columns in orc-scan (#921)
    • ORC-1017: Add sizes tool to determine and display the sizes of each column in a set of files. (#925)
    • ORC-1023: Support writing bloom filters in ConvertTool (#933)

    Tests

    • ORC-915: Remove io.netty.netty from Spark benchmark (#822)
    • ORC-938: Bump netty-all from 4.1.42.Final to 4.1.66.Final (#819)
    • ORC-948: Add hive benchmark integration tests (#860)
    • ORC-957: Bump netty-all from 4.1.66.Final to 4.1.67.Final (#870)
    • ORC-1021: Add -fno-omit-frame-pointer in DEBUG and RELWITHDEBINFO builds (#932)
    • ORC-1051: Update benchmark dependencies (#964)
    Source code(tar.gz)
    Source code(zip)
  • v1.7.1(Feb 10, 2022)

    Milestone

    • https://github.com/apache/orc/milestone/1?closed=1

    Changelog

    • https://github.com/apache/orc/compare/rel/release-1.7.0...rel/release-1.7.1

    Bug Fixes

    • ORC-879 - Flaky Test for TestJsonReader
    • ORC-1008 - Overflow detection code is incorrect in IntegerColumnStatisticsImpl
    • ORC-1009 - [C++] Missing string include causes build failure with MSVC++
    • ORC-1015 - Update OrcFile.WriterOptions::memory javadoc
    • ORC-1016 - Use [email protected] in GitHub Action MacOS CIs
    • ORC-1024 - BloomFilter hash computation is inconsistent between Java and C++ clients
    • ORC-1029 - Could not load 'org.apache.orc.DataMask.Provider' when using orc encryption and spark executor with multi cores!
    • ORC-1030 - Java Tools Recover File command does not accurately find OrcFile.MAGIC
    • ORC-1034 - The search byte array algorithm is incorrectly implemented in FileDump.java
    • ORC-1035 - backupDataPath may be incorrect in recoverFile
    • ORC-1039 - Make FileDump.recoverFile handle side files only if they exist

    Test

    • ORC-1000 - Use Java 17 in GitHub Action
    • ORC-1002 - Add java17 profile for Java17 unit testing
    • ORC-1010 - Bump tzdata from tzdata-2020e-1.tar.xz to tzdata-2021b-1.tar.xz
    • ORC-1011 - Activate java17 profile automatically
    • ORC-1032 - Bump parquet.version from 1.12.0 to 1.12.2
    • ORC-1036 - Due to tzdata upgrade, the fixed download links in CI are often not working
    • ORC-1037 - Bump spark.version from 3.1.2 to 3.2.0
    • ORC-1040 - Add Debian 11 docker test
    • ORC-1042 - Ignore unused-function C++ compile warning on CentOS 7
    • ORC-1043 - Fix C++ conversion compilation error in CentOS 7
    Source code(tar.gz)
    Source code(zip)
  • v1.7.0(Feb 10, 2022)

    New Feature

    • [ORC-40] - [C++] Support building SearchArgument
    • [ORC-577] - Allow row-level filtering
    • [ORC-602] - Create adaptor for using FSDataInputStream for Java ORC reader
    • [ORC-716] - Build and test on Java 17-EA
    • [ORC-731] - Improve Java Tools
    • [ORC-747] - Abstract Dictionary interface and refactoring
    • [ORC-751] - [C++] Implement Predicate Pushdown for C++ Reader
    • [ORC-765] - Added build option to compile libraries with position independent code
    • [ORC-819] - Add GitHub labeler

    Improvement

    • [ORC-377] - [C++] Adding writing with snappy compression to orc c++ writing lib
    • [ORC-480] - [C++] Deactivate WARN_FLAGS in release build
    • [ORC-566] - Add docker file for building site
    • [ORC-568] - Make the convert tool sort the old _col column names by number
    • [ORC-574] - Performance: Use const references for string statistics min and max to avoid copy construction
    • [ORC-588] - Static field or method should be directly referred by its class
    • [ORC-595] - Optimize Decimal64 scale calculation
    • [ORC-597] - Row-level filtering bench
    • [ORC-606] - Optimize Timestamp parseNanos calculation
    • [ORC-607] - Sync orc-benchmarks module to the others
    • [ORC-608] - Fix DecimalBench reader options
    • [ORC-609] - Upgrade aircompressor to 0.16
    • [ORC-614] - Implement efficient seek() in decompression streams
    • [ORC-615] - Refactor decompression streams into common base class
    • [ORC-622] - Refactoring of TreeReader into TypeReader and BatchReader
    • [ORC-638] - ORCMapredRecordWriter enlarge columnVector with factors when child array size is not large enough
    • [ORC-639] - Improve zstd compression performance
    • [ORC-646] - Add Ubuntu 20.04 docker file
    • [ORC-651] - Use GitHub Pull Request Template
    • [ORC-652] - Upgrade ZSTD to 1.4.5
    • [ORC-655] - Update bench to use Spark 2.4.6
    • [ORC-656] - Use gharchive.org instead of githubarchive.org
    • [ORC-657] - Remove com.netflix.iceberg dependency in java/bench module
    • [ORC-683] - PPD: Make Floating point NaN check more strict
    • [ORC-684] - [C++] Make Floating point NaN check more strict
    • [ORC-687] - Upgrade to JUnit5
    • [ORC-688] - Allow CHAR, VARCHAR to be promoted to STRING
    • [ORC-689] - Add GitHubAction job to publish snapshot
    • [ORC-693] - Update credential according to INFRA setup
    • [ORC-694] - Update docker files adding Java11 support
    • [ORC-696] - Consistent TypeDescription handling for quoted field names
    • [ORC-697] - Improve Scan tool to report where files are corrupted.
    • [ORC-699] - Minor improvements to the scan tool
    • [ORC-704] - Publish snapshots at only apache repo
    • [ORC-710] - Update maven plugins
    • [ORC-712] - Add USING IN SPARK to website
    • [ORC-722] - Improve code quality using static analysis.
    • [ORC-734] - Use org.apache.commons.lang3
    • [ORC-736] - Upgrade Hive to 3.1.2
    • [ORC-737] - Upgrade Spark to 3.1.0
    • [ORC-744] - LazyIO of non-filter columns
    • [ORC-745] - Migrate to travis-ci.com
    • [ORC-748] - Add separate writer implementation for Trino
    • [ORC-749] - Add checkstyle to -Panalzye
    • [ORC-750] - Fix benchmark to pass checkstyle:check
    • [ORC-757] - Add Hashtable implementation for dictionary
    • [ORC-760] - Update spark to 3.1.1
    • [ORC-761] - Replace MAINTAINER command with LABEL command in Dockerfile
    • [ORC-766] - Generalize the docker scripts to handle build-args
    • [ORC-767] - Add docker support for jdk 8 in debian 10
    • [ORC-768] - Update commons-csv to 1.8
    • [ORC-769] - Support ZSTD in ORC data benchmark
    • [ORC-770] - Support ZSTD in Avro data benchmark
    • [ORC-776] - Include source jars during publishing snapshot
    • [ORC-777] - Make the vectorized row batch size configurable in MR record readers and writers
    • [ORC-779] - Upgrade commons-cli to 1.4
    • [ORC-780] - Add LZ4 Compression to the C++ Writer
    • [ORC-791] - Upgrade guava test dependency to 30.1.1-jre
    • [ORC-792] - Upgrade commons-lang to 3.12.0
    • [ORC-796] - Upgrade apache parent pom version to the latest, 23
    • [ORC-797] - Allow writers to get the stripe information
    • [ORC-799] - Remove Ubuntu 16 docker test
    • [ORC-800] - [ORC]if map.value is selected, map.key should be selected automatically to prevent segment fault.
    • [ORC-801] - Clean up Logging
    • [ORC-802] - Document Maven Version and mvnw
    • [ORC-803] - MemoryManagerImpl Simplify removeWriter
    • [ORC-806] - Upgrade to Apache POM 23
    • [ORC-807] - Separate Jackson Versions in POM
    • [ORC-808] - Update Spark to 3.1.2
    • [ORC-812] - Simplify getClosestBufferSize in Writer
    • [ORC-813] - Upgrade ZSTD to 1.5.0
    • [ORC-818] - Build and test in Apple Silicon
    • [ORC-821] - Use mvnw instead of mvn
    • [ORC-823] - Upgrade maven-assembly-plugin to 3.3.0
    • [ORC-848] - Recycle Internal Buffer in StringHashTableDictionary
    • [ORC-849] - Core Benchmark Cleanup
    • [ORC-893] - Remove junit-vintage-engine from shims module.
    • [ORC-913] - Support data/format/compress options in Spark benchmark
    • [ORC-921] - Add an encrypted example file
    • [ORC-922] - Remove redundant conditional statements
    • [ORC-927] - Extracting duplicate codes for RowFilterBenchmark
    • [ORC-930] - Ignore unsupported JSON x ZSTD combination in bench
    • [ORC-931] - Optimize RunLengthIntegerWriterV2 code for better readability
    • [ORC-933] - extend the example with advanced reader
    • [ORC-941] - Move MacOS 10.15 and 11.5 test from Travis to GitHub Action
    • [ORC-943] - Add Intellij conf to support JIRA/PR autolinks
    • [ORC-945] - Add OUTPUT_QUIET, ERROR_QUIET to suppress Java8 addopen error messages
    • [ORC-970] - Reordering statements, improve readability in WriterImpl
    • [ORC-976] - Optimize compute zigZagLiterals
    • [ORC-984] - Save the software version that wrote each ORC file

    Sub-task

    • [ORC-599] - Bump guava version to 28.1-jre
    • [ORC-663] - [C++] Support nanosecond in timestamp column statistics
    • [ORC-713] - Add Java 15 test to github action
    • [ORC-714] - Remove MRUnit dependency and its usage
    • [ORC-715] - Add MapReduce test cases
    • [ORC-718] - Enable Checkstyle plugin and FileTabCharacter rule.
    • [ORC-719] - Enable UnusedImports.
    • [ORC-720] - Run mvn checkstyle:check in GitHub action.
    • [ORC-721] - Use org.junit.Assert instead of deprecated junit.framework.Assert.
    • [ORC-723] - Upgrade Mockito to 3.7.0.
    • [ORC-726] - Support Map type in orc-tools convert
    • [ORC-727] - Update Java Tools documentation
    • [ORC-728] - Support head command in Java Tools
    • [ORC-733] - Upgrade Zookeeper from 3.4.x to 3.6.2
    • [ORC-735] - ConvertTool should not fail at a single ORC file
    • [ORC-738] - Add date type conversion support in Java Tools
    • [ORC-741] - Schema Evolution missing column is not handled in the presence of filters
    • [ORC-742] - LazyIO of non-filter columns in the presence of filters
    • [ORC-743] - Conversion of SArg into Filters, to take advantage of LazyIO
    • [ORC-754] - Code cleanup
    • [ORC-755] - Introduce OrcFilterContext
    • [ORC-758] - Avoid decompressing compressed streams if already decompressed
    • [ORC-759] - StructBatchReader should always skip processing on the rootReader
    • [ORC-778] - Add "NewlineAtEndOfFile" checkstyle rule
    • [ORC-783] - Add a checkstyle rule to prevent trailing white spaces.
    • [ORC-795] - Add "LineLength" rule to checkstyle
    • [ORC-811] - Benchmarks for Filters
    • [ORC-814] - Build and test Java module on Apple Silicon
    • [ORC-815] - Build and test C++ module on CLang12
    • [ORC-816] - Rename and enable aarch64 profile automatically
    • [ORC-820] - Add Java 16 to GitHub Action
    • [ORC-822] - Add Java 17-ea to GitHub Action
    • [ORC-839] - Fix head command for batch reader
    • [ORC-851] - Fix CNFE in ORC tools uber jar to include required classes.
    • [ORC-857] - Add OuterTypeFilename/UpperEll/ArrayTypeStyle checkstyle rules.
    • [ORC-858] - Add NoLineWrap/OneStatementPerLine/NeedBraces checkstyle rules
    • [ORC-859] - Update maven-checkstyle-plugin to 3.1.2.
    • [ORC-866] - Reduce LineLength from 125 to 120
    • [ORC-867] - Upgrade hive-storage-api to 2.8.1
    • [ORC-871] - orc-tools json-schema fails at empty json file with EOFException
    • [ORC-882] - Remove hamcrest-core test dependency
    • [ORC-886] - Add an integration test for ORC Java tools
    • [ORC-889] - Remove orc-mapreduce build warnings due to overlapping resources
    • [ORC-895] - Use snappy-java 1.1.8.4 in bench/core to support Apple Silicon
    • [ORC-901] - Remove junit-vintage-engine from mapreduce/tools module
    • [ORC-905] - Add an integration test for example
    • [ORC-907] - Remove junit-vintage-engine from core module
    • [ORC-909] - Remove commons-io 2.1 dependency
    • [ORC-910] - Enforce maven-dependency-plugin
    • [ORC-911] - Remove janino dependency in favor of Spark's transitive dependency
    • [ORC-912] - Exclude Spark transitive avro/parquet dependency from Spark benchmark
    • [ORC-917] - Bump mockito-core from 3.7.0 to 3.11.2
    • [ORC-919] - Spark bench objenesis should be the same as Spark.
    • [ORC-920] - Use junit.version and mockito.version property and bump junit to 5.7.2
    • [ORC-924] - Add redundant modifier/modifier order checkstyle rules.
    • [ORC-926] - Consolidate license header style in Java files.
    • [ORC-928] - Bump checkstyle from 8.44 to 8.45.1
    • [ORC-929] - Fix NaN at orc-tools 'meta' command
    • [ORC-934] - Add integration tests for Java bench
    • [ORC-939] - Remove threetenbp dependency
    • [ORC-942] - Remove javax.xml.bind:jaxb-api dependency
    • [ORC-944] - Add "RedundantImport" checkstyle rule
    • [ORC-947] - Update coding guide to max line length 100 and enforce it.
    • [ORC-950] - Bump aircompressor to 0.20
    • [ORC-951] - Add since tag to org.apache.orc.Reader interface
    • [ORC-952] - Add since tag to org.apache.orc.RecordReader interface
    • [ORC-953] - Add since tag to org.apache.orc.Writer interface
    • [ORC-959] - C++ reader crash in resolving nested List columns for SearchArgument
    • [ORC-960] - Create SearchArgument using column ids
    • [ORC-971] - LESS_THAN_EQUALS doesn't handle the case when min=max
    • [ORC-973] - [C++] Provide more interfaces for creating IN predicate
    • [ORC-980] - Filter processing ignores the schema case-sensitivity flag
    • [ORC-981] - Add "CommentsIndentation" checkstyle rule
    • [ORC-983] - Revise filter processing log level/location/message

    Bug

    • [ORC-27] - C++ reader does not read dates correctly prior to 1583
    • [ORC-361] - org.apache.orc.impl.MemoryManagerImpl: Owner thread expected Thread[main,5,main], got Thread[pool-15-thread-1,5,main]
    • [ORC-414] - [C++] ORC files with malformed protobuf objects can crash a release build
    • [ORC-424] - Add findbugs checks for test classes
    • [ORC-503] - Add Maven Wrapper
    • [ORC-504] - Create a reproducible java build
    • [ORC-519] - Incorrect ORC v1 specification of Decimal encoding
    • [ORC-526] - orc-tools convert does not respect second fractions
    • [ORC-528] - orc-tools timestamps off by one?
    • [ORC-552] - Fix compilation of C++ Reader.cc on centos 6.
    • [ORC-554] - Float to timestamp schema evolution handles time/nanoseconds incorrectly
    • [ORC-555] - IllegalArgumentException when reading files with compressed footers bigger than 16k
    • [ORC-556] - ConvertTreeReader can incorrectly be applied on columns of the same primitive type
    • [ORC-557] - Large ORC file parsing failed
    • [ORC-562] - Don't wrap readerSchema in acidSchema, if readerSchema is already acid
    • [ORC-563] - ORC-540 could break schema evolution on PPD codepaths
    • [ORC-564] - fix Docker scripts so that their command is bash
    • [ORC-565] - Fix a couple c++ tests when building out of tree
    • [ORC-567] - [C++] Fix integer overflow in RowReader::seekToRow
    • [ORC-569] - Empty positions list in first row index entry
    • [ORC-570] - FS: ReaderOptions.filesystem should also accept a lazy Supplier
    • [ORC-571] - ArrayIndexOutOfBoundsException in StripePlanner.readRowIndex
    • [ORC-578] - IllegalArgumentException: Can't use LongColumnVector to read proleptic Gregorian dates.
    • [ORC-580] - Crash in StripeStreamsImpl::getEncoding
    • [ORC-581] - C++ library could crash in orc::TypeImpl::addStructField
    • [ORC-584] - TypeDescription toJson() returns invalid json
    • [ORC-586] - [C++] Memory leak in StructColumnReader
    • [ORC-589] - [C++] ORC doesn't check for negative dictionary entry lengths anymore
    • [ORC-590] - Crash in orc::RleDecoderV2::readByte
    • [ORC-591] - orc::readFully crash due to null pointer variable
    • [ORC-598] - Unable to read ORC file with struct and array.length > 1024
    • [ORC-600] - StringDictionaryColumnReader does not update index buffer correctly
    • [ORC-603] - Update current Hadoop version to 2.10.1
    • [ORC-604] - Check in StringDictionary.getValueByIndex is too permissive
    • [ORC-610] - Updated Copyright year in the NOTICE file
    • [ORC-611] - Incorrect min-max stats for sub-millisecond timestamps
    • [ORC-621] - Need reader fix for ORC-569
    • [ORC-623] - Potentially incorrect Sarg evaluation for not(in) and not(isNull)
    • [ORC-626] - Reading Struct Column Having Multiple Fields With Same Name Causes java.io.EOFException
    • [ORC-628] - Add a new java tool to count rows from ORC files under a directory
    • [ORC-629] - PPD: Floating point NaN is not transitive across comparisons
    • [ORC-630] - Fix orc-tools uber jar by adding guava dependency back
    • [ORC-631] - Add guava dependency to tools jar
    • [ORC-636] - [C++] PPD Floating point stats with NaN should be ignored
    • [ORC-641] - orc-core includes packages from io.airlift.slice
    • [ORC-643] - Change logging of codec creation to debug
    • [ORC-644] - nested struct evolution does not respect to orc.force.positional.evolution
    • [ORC-649] - duplicate method invocation (buildConversion) in SchemaEvolution
    • [ORC-650] - Fix argument to find_package() for ZSTD
    • [ORC-654] - Fix build with clang-10 and ubuntu-20
    • [ORC-658] - Fix NoClassDefFoundError during benchmark data generation
    • [ORC-659] - Initialize "next_in" before calling DeflateInit2
    • [ORC-661] - DateColumnStatistics uses Date, which is not timezone agnostic.
    • [ORC-667] - Positional mapping for nested struct types should not applied by default
    • [ORC-668] - Use TestSchemaEvolution as a test file prefix to prevent test failure
    • [ORC-669] - Reduce breaking changes in ReaderImpl.java
    • [ORC-670] - RecordReaderImpl.findColumns should respect orc.schema.evolution.case.sensitive
    • [ORC-671] - Add OrcTail.getStripeStatistics back for backward compatiblility
    • [ORC-672] - Fix made in ORC-598 need to be extended to other readers like DecimalFromFloatTreeReader
    • [ORC-673] - PPD: LTE Point equality comparison is wrong when RG MIN==MAX
    • [ORC-676] - Add getRawDataSizeFromColIndices back
    • [ORC-677] - Fix SargApplier compilation error
    • [ORC-680] - Upgrade Travis CI linux os from Ubuntu 14.04 to Ubuntu 16.04
    • [ORC-681] - Upgrade commons-codec to 1.15
    • [ORC-682] - Upgrade to commons-lang3
    • [ORC-685] - Add ReaderImpl.extractFileTail back
    • [ORC-686] - Upgrade aircompressor to 0.19 and fix UT
    • [ORC-702] - [C++] Big files can't be opened in Windows
    • [ORC-705] - Predicate evaluation should take into account writer calendar
    • [ORC-706] - Put back DataReaderProperties default maxDiskRangeChunkLimit
    • [ORC-708] - FIX Snapshot publishing
    • [ORC-709] - FIX Boolean to StringGroup schema evolution
    • [ORC-711] - Support CryptoExtension in create/decryptLocalKey
    • [ORC-724] - PPD: Date IN single value comparison throws ClassCastException
    • [ORC-739] - Use Maven Wrapper in java/CMakeLists.txt
    • [ORC-740] - Add curl in debian and ubuntu Docker files
    • [ORC-756] - Include Vector Length Error with null readerSchema (for ACID formats)
    • [ORC-763] - ORC timestamp inconsistencies before UNIX epoch
    • [ORC-771] - ORC timestamp consistency Test for sql.Timestamps close to epoch
    • [ORC-772] - Fix Spark benchmark jar creation
    • [ORC-773] - BinaryColumnWriter does not update Bloom filter
    • [ORC-774] - Support ZSTD in Parquet data benchmark
    • [ORC-775] - Fix a regression on column names with dot
    • [ORC-781] - [C++] Make type annotations available from C++
    • [ORC-786] - Update Dockerfiles to use main branch
    • [ORC-787] - Update site to use main
    • [ORC-788] - Update travis build status to main
    • [ORC-790] - TIMESTAMP_INSTANT should be primitive
    • [ORC-793] - Deprecate the incorrect getInt method and add a new setInt within OrcConf
    • [ORC-804] - MaskDescriptionImpl should use List instead of Set
    • [ORC-810] - FIX expected Output Stats when using Hash Dictionary
    • [ORC-856] - Fix exception description in findSubtype
    • [ORC-861] - Bump CMake minimum requirement to 2.8.12
    • [ORC-862] - Add libarchive to centos8 docker image
    • [ORC-863] - Upgrade TravisCI from xenial to focal.
    • [ORC-873] - Fix FindCyrusSASL to use the package name CyrusSASL instead of CYRUS_SASL
    • [ORC-885] - Update bench README.md and allow user env shell
    • [ORC-902] - The example of orc-example cannot be run
    • [ORC-954] - Fix Javadoc generation failure
    • [ORC-965] - Got "Overflow detected" at spark-orc with zstd
    • [ORC-978] - Fix NPE in TestFlinkOrcReaderWriter
    • [ORC-985] - ORC branch 1.7 is producing larger files from java writer
    • [ORC-990] - [C++] fix RowReaderImpl::seekToRowGroup
    • [ORC-991] - enctypt data throw exception with a sql filter push down

    Test

    • [ORC-647] - Add macOS 10.15 test to Travis CI
    • [ORC-648] - Add GitHub Action for Java8/Java11 test coverage
    • [ORC-678] - Upgrade JUnit to 4.13.1
    • [ORC-679] - Update tzdata to recover Win32 build in AppVeyor
    • [ORC-891] - Use assert methods without package name Assert
    • [ORC-903] - Migrate TestVectorOrcFile to JUnit5
    • [ORC-925] - Simplify assertions
    • [ORC-955] - Add Javadoc generation GitHub Action job
    • [ORC-963] - Build benchmark module always for integration testing

    Task

    • [ORC-553] - Add test case to check that SchemaEvolution checkAcidSchema works well
    • [ORC-560] - Improve docker tests and include centos 8 and debian 10
    • [ORC-576] - Improve LICENSE file
    • [ORC-664] - docker image for centos7 fails to build zstd
    • [ORC-674] - Update docker files adding Ubuntu 20 and removing Debian 8 and Ubuntu 14
    • [ORC-675] - Remove debian8/ubuntu14 docker directories
    • [ORC-691] - Remove unused snapcraft-related code
    • [ORC-725] - Disable merge commits from Github Merge Button
    • [ORC-730] - Add website link and description to GitHub page
    • [ORC-785] - Update GitHub Action branch name to main
    • [ORC-805] - Add .asf.yaml for Apache ORC website publishing
    • [ORC-908] - Use https instead of http for website links in pom.xml
    Source code(tar.gz)
    Source code(zip)
Owner
The Apache Software Foundation
The Apache Software Foundation
A FlinkSQL studio and real-time computing platform based on Apache Flink

Dinky 简介 实时即未来,Dinky 为 Apache Flink 而生,让 Flink SQL 纵享丝滑,并致力于实时计算平台建设。 Dinky 架构于 Apache Flink,增强 Flink 的应用与体验,探索流式数仓。即站在巨人肩膀上创新与实践,Dinky 在未来批流一体的发展趋势下潜

null 1.5k Dec 30, 2022
A Slimefun4 addon that adds a new storage solution for mass and organised storage without harsh performance.

Networks is a Slimefun4 addon that brings a simple yet powerful item storage and movement network that works along side cargo. Network Grid / Crafting

null 17 Jan 7, 2023
Hadoop library for large-scale data processing, now an Apache Incubator project

Apache DataFu Follow @apachedatafu Apache DataFu is a collection of libraries for working with large-scale data in Hadoop. The project was inspired by

LinkedIn's Attic 589 Apr 1, 2022
Real-time Query for Hadoop; mirror of Apache Impala

Welcome to Impala Lightning-fast, distributed SQL queries for petabytes of data stored in Apache Hadoop clusters. Impala is a modern, massively-distri

Cloudera 27 Dec 28, 2022
Te4j (Template Engine For Java) - Fastest and easy template engine

Te4j About the project Te4j (Template Engine For Java) - Fastest and easy template engine Pros Extremely fast (127k renders per second on 4790K) Easy

Lero4ka16 19 Nov 11, 2022
BenDB - An fastest, qualified & easy to use multi database library

BenDB - An fastest, qualified & easy to use multi database library

Fitchle 2 May 3, 2022
A MOTD plugin for Velocity that caches network packets. This helps it be the fastest one of the MOTD plugins.

FastMOTD A MOTD plugin for Velocity that catches network packets. This helps it be the fastest one of the MOTD plugins. Test server: ely.su Features F

Elytrium 19 Dec 24, 2022
Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.

Elephant Bird About Elephant Bird is Twitter's open source library of LZO, Thrift, and/or Protocol Buffer-related Hadoop InputFormats, OutputFormats,

Twitter 1.1k Jan 5, 2023
Google Mr4c GNU Lesser 3 Google Mr4c MR4C is an implementation framework that allows you to run native code within the Hadoop execution framework. License: GNU Lesser 3, .

Introduction to the MR4C repo About MR4C MR4C is an implementation framework that allows you to run native code within the Hadoop execution framework.

Google 911 Dec 9, 2022
:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop

Elasticsearch Hadoop Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Apache Hive, Apache Pig, Apach

elastic 1.9k Dec 22, 2022
In this task, we had to write a MapReduce program to analyze the sentiment of a keyword from a list of comments. This was done using Hadoop HDFS.

All the files have been commented for your ease. Furthermore you may also add further comments if you may. For further queries contact me at : chhxnsh

Hassan Shahzad 5 Aug 14, 2021
Program finds average number of words in each comment given a large data set by use of hadoop's map reduce to work in parallel efficiently.

Finding average number of words in all the comments in a data set ?? Mapper Function In the mapper function we first tokenize entire data and then fin

Aleezeh Usman 3 Aug 23, 2021
Program that uses Hadoop Map-Reduce to identify the anagrams of the words of a file

Hadoop-MapReduce-Anagram-Solver The implementation consists of a program that utilizes the Hadoop Map-Reduce framework to identify the anagrams of the

Nikolas Petrou 2 Dec 4, 2022
PageRank implementation in hadoop

PageRank implementation in hadoop Use kiwenalu/hadoop-cluster-docker (set cluster size for 5) for running JAR. Load dataset to memory using script

Maksym Zub 1 Jan 24, 2022
MapDB provides concurrent Maps, Sets and Queues backed by disk storage or off-heap-memory. It is a fast and easy to use embedded Java database engine.

MapDB: database engine MapDB combines embedded database engine and Java collections. It is free under Apache 2 license. MapDB is flexible and can be u

Jan Kotek 4.6k Dec 30, 2022
An extremely easy way to perform background processing in Java. Backed by persistent storage. Open and free for commercial use.

The ultimate library to perform background processing on the JVM. Dead simple API. Extensible. Reliable. Distributed and backed by persistent storage.

JobRunr 1.3k Jan 6, 2023
Easy to use cryptographic framework for data protection: secure messaging with forward secrecy and secure data storage. Has unified APIs across 14 platforms.

Themis provides strong, usable cryptography for busy people General purpose cryptographic library for storage and messaging for iOS (Swift, Obj-C), An

Cossack Labs 1.6k Dec 29, 2022
MapDB provides concurrent Maps, Sets and Queues backed by disk storage or off-heap-memory. It is a fast and easy to use embedded Java database engine.

MapDB: database engine MapDB combines embedded database engine and Java collections. It is free under Apache 2 license. MapDB is flexible and can be u

Jan Kotek 4.6k Jan 1, 2023
DankTech is an attempt to recreate the Dank Storage mod (and \dank\null\ etc.) from a plugin.

DankTech is an attempt to recreate the Dank Storage mod, /dank/null and OpenBlocks /dev/null in the form of a plugin. Features The plugin currently fe

null 9 Feb 6, 2022
An easy-to-use wrapper for many storage systems.

Data Store An easy-to-use wrapper for redis' cached storage system. (support for more data types coming soon) Note: This project is unfinished, and th

Subham 4 Jul 17, 2022