Apache ORC - the smallest, fastest columnar storage for Hadoop workloads

Last update: Jan 2, 2023

Overview

Apache ORC

ORC is a self-describing type-aware columnar file format designed for Hadoop workloads. It is optimized for large streaming reads, but with integrated support for finding required rows quickly. Storing data in a columnar format lets the reader read, decompress, and process only the values that are required for the current query. Because ORC files are type-aware, the writer chooses the most appropriate encoding for the type and builds an internal index as the file is written. Predicate pushdown uses those indexes to determine which stripes in a file need to be read for a particular query and the row indexes can narrow the search to a particular set of 10,000 rows. ORC supports the complete set of types in Hive, including the complex types: structs, lists, maps, and unions.

ORC File Library

This project includes both a Java library and a C++ library for reading and writing the Optimized Row Columnar (ORC) file format. The C++ and Java libraries are completely independent of each other and will each read all versions of ORC files. But the C++ library only writes the original (Hive 0.11) version of ORC files, and will be extended in the future.

Releases:

Latest: Apache ORC releases
Maven Central:
Downloads: Apache ORC downloads

The current build status:

Master branch
Pull Requests

Bug tracking: Apache Jira

The subdirectories are:

c++ - the c++ reader and writer
cmake_modules - the cmake modules
docker - docker scripts to build and test on various linuxes
examples - various ORC example files that are used to test compatibility
java - the java reader and writer
proto - the protocol buffer definition for the ORC metadata
site - the website and documentation
snap - the script to build snaps of the ORC tools
tools - the c++ tools for reading and inspecting ORC files

Building

Install java 1.8 or higher
Install maven 3 or higher
Install cmake

To build a release version with debug information:

% mkdir build
% cd build
% cmake ..
% make package
% make test-out

To build a debug version:

% mkdir build
% cd build
% cmake .. -DCMAKE_BUILD_TYPE=DEBUG
% make package
% make test-out

To build a release version without debug information:

% mkdir build
% cd build
% cmake .. -DCMAKE_BUILD_TYPE=RELEASE
% make package
% make test-out

To build only the Java library:

% cd java
% mvn package

To build only the C++ library:

% mkdir build
% cd build
% cmake .. -DBUILD_JAVA=OFF
% make package
% make test-out

Comments

ORC-322: [C++] Fix writing & reading timestamp

Currently C++ and Java version have different behaviors in reading and writing values of timestamp type. This patch ensures C++ reader/printer will behave the same as their parities on the java side and also ensures C++ reader and writer obtain same timestamp values in the TimestampVectorBatch.

opened by wgtmac 22
Orc 17

In this pull request I added LIBHDFS++ library for reading files from HDFS to ORC project.

Libhdfs++ is located in orc/c++/lib/libhdfspp and by default builds as a light-weight library without examples, tests, and tools (and by this avoids dependencies on JDK, valgrind and gmock). However, if the flag -DHDFSPP_LIBRARY_ONLY=FALSE is passed to cmake, then it will build the examples, tests, and tools as well.

Libhdfs++ depends on protobuf libraries in orc/c++/libs/protobuf-2.6.0 and is searching the system for packages Doxygen, OpenSSL, CyrusSASL, GSasl, and Threads dynamically (however only OpenSSL and Threads are required).

The folder libhdfspp also includes a script pull_hdfs.sh which pulls the latest changes from Libhdfs++ Hadoop branch to ORC, and generates file 'imported_timestamp' with the timestamp and the information about the latest commit.

I also updated all the ORC tools to automatically use Libhdfs++ to read ORC files on HDFS if their path begins with 'hdfs://'.

Please review.

opened by AnatoliShein 21

[JAVA] mvn package fails if test compiling was skipped

I would like to run mvn -Dmaven.test.skip=true clean package, but maven-dependency-plugin complains Unused declared dependencies for some libraries used by the test code, which breaks the compilation. Please check the attached logs.

I'm not a java expert, but I'm guessing the cause of the problem is the misuse of analyze-only in the package phase. According to the documentation, the analyze-only goal is meant to be used during the test-compile phase. In our case, the test class was not compiled, so the dependency analyzer treated some libraries as unused. I don't know the right way to fix it though.

The setting of maven-dependency-plugin: https://github.com/apache/orc/blob/8cf1047f9ace3799df12f24d2a5096b17a9a6ed0/java/pom.xml#L373-L388

Logs:

$ cd orc/java

$ mvn -Dmaven.test.skip=true clean package
[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Build Order:
[INFO]
[INFO] Apache ORC                                                         [pom]
[INFO] ORC Shims                                                          [jar]
[INFO] ORC Core                                                           [jar]
[INFO] ORC MapReduce                                                      [jar]
[INFO] ORC Tools                                                          [jar]
[INFO] ORC Examples                                                       [jar]
[INFO]
[INFO] -------------------------< org.apache.orc:orc >-------------------------
[INFO] Building Apache ORC 1.9.0-SNAPSHOT                                 [1/6]
[INFO] --------------------------------[ pom ]---------------------------------
[INFO]
[INFO] --- maven-clean-plugin:3.2.0:clean (default-clean) @ orc ---
[INFO] Deleting /Users/x/Documents/playground/orc/java/target
[INFO]
[INFO] --- maven-enforcer-plugin:3.1.0:enforce (enforce-maven-version) @ orc ---
[INFO]
[INFO] --- maven-enforcer-plugin:3.1.0:enforce (enforce-java-version) @ orc ---
[INFO]
[INFO] --- maven-enforcer-plugin:3.1.0:enforce (enforce-maven) @ orc ---
[INFO]
[INFO] --- maven-remote-resources-plugin:1.7.0:process (process-resource-bundles) @ orc ---
[INFO] Preparing remote bundle org.apache:apache-jar-resource-bundle:1.4
[INFO] Copying 3 resources from 1 bundle.
[INFO]
[INFO] --- maven-antrun-plugin:3.1.0:run (setup-test-dirs) @ orc ---
[INFO] Executing tasks
[INFO]     [mkdir] Created dir: /Users/x/Documents/playground/orc/java/target/testing-tmp
[INFO] Executed tasks
[INFO]
[INFO] --- maven-site-plugin:3.12.0:attach-descriptor (attach-descriptor) @ orc ---
[INFO] No site descriptor found: nothing to attach.
[INFO]
[INFO] --- maven-source-plugin:3.2.1:jar-no-fork (create-source-jar) @ orc ---
[INFO]
[INFO] --- maven-source-plugin:3.2.1:test-jar-no-fork (create-source-jar) @ orc ---
[INFO]
[INFO] --- reproducible-build-maven-plugin:0.15:strip-jar (default) @ orc ---
[INFO]
[INFO] ----------------------< org.apache.orc:orc-shims >----------------------
[INFO] Building ORC Shims 1.9.0-SNAPSHOT                                  [2/6]
[INFO] --------------------------------[ jar ]---------------------------------
[INFO]
[INFO] --- maven-clean-plugin:3.2.0:clean (default-clean) @ orc-shims ---
[INFO] Deleting /Users/x/Documents/playground/orc/java/shims/target
[INFO]
[INFO] --- maven-enforcer-plugin:3.1.0:enforce (enforce-maven-version) @ orc-shims ---
[INFO]
[INFO] --- maven-enforcer-plugin:3.1.0:enforce (enforce-java-version) @ orc-shims ---
[INFO]
[INFO] --- maven-enforcer-plugin:3.1.0:enforce (enforce-maven) @ orc-shims ---
[INFO]
[INFO] --- build-helper-maven-plugin:3.3.0:add-source (add-source) @ orc-shims ---
[INFO] Source directory: /Users/x/Documents/playground/orc/java/shims/target/generated-sources added.
[INFO]
[INFO] --- maven-remote-resources-plugin:1.7.0:process (process-resource-bundles) @ orc-shims ---
[INFO] Preparing remote bundle org.apache:apache-jar-resource-bundle:1.4
[INFO] Copying 3 resources from 1 bundle.
[INFO]
[INFO] --- maven-resources-plugin:3.2.0:resources (default-resources) @ orc-shims ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Using 'UTF-8' encoding to copy filtered properties files.
[INFO] skip non existing resourceDirectory /Users/x/Documents/playground/orc/java/shims/src/main/resources
[INFO] Copying 3 resources
[INFO]
[INFO] --- maven-compiler-plugin:3.10.1:compile (default-compile) @ orc-shims ---
[INFO] Compiling 13 source files to /Users/x/Documents/playground/orc/java/shims/target/classes
[INFO]
[INFO] --- maven-resources-plugin:3.2.0:testResources (default-testResources) @ orc-shims ---
[INFO] Not copying test resources
[INFO]
[INFO] --- maven-antrun-plugin:3.1.0:run (setup-test-dirs) @ orc-shims ---
[INFO] Executing tasks
[INFO]     [mkdir] Created dir: /Users/x/Documents/playground/orc/java/shims/target/testing-tmp
[INFO] Executed tasks
[INFO]
[INFO] --- maven-compiler-plugin:3.10.1:testCompile (default-testCompile) @ orc-shims ---
[INFO] Not compiling test sources
[INFO]
[INFO] --- maven-surefire-plugin:3.0.0-M5:test (default-test) @ orc-shims ---
[INFO] Tests are skipped.
[INFO]
[INFO] --- maven-jar-plugin:3.3.0:jar (default-jar) @ orc-shims ---
[INFO] Building jar: /Users/x/Documents/playground/orc/java/shims/target/orc-shims-1.9.0-SNAPSHOT.jar
[INFO]
[INFO] --- maven-site-plugin:3.12.0:attach-descriptor (attach-descriptor) @ orc-shims ---
[INFO] Skipping because packaging 'jar' is not pom.
[INFO]
[INFO] --- maven-source-plugin:3.2.1:jar-no-fork (create-source-jar) @ orc-shims ---
[INFO] Building jar: /Users/x/Documents/playground/orc/java/shims/target/orc-shims-1.9.0-SNAPSHOT-sources.jar
[INFO]
[INFO] --- maven-source-plugin:3.2.1:test-jar-no-fork (create-source-jar) @ orc-shims ---
[INFO] Building jar: /Users/x/Documents/playground/orc/java/shims/target/orc-shims-1.9.0-SNAPSHOT-test-sources.jar
[INFO]
[INFO] --- reproducible-build-maven-plugin:0.15:strip-jar (default) @ orc-shims ---
[INFO] Stripping /Users/x/Documents/playground/orc/java/shims/target/orc-shims-1.9.0-SNAPSHOT-test-sources.jar
[INFO] Stripping /Users/x/Documents/playground/orc/java/shims/target/orc-shims-1.9.0-SNAPSHOT-sources.jar
[INFO] Stripping /Users/x/Documents/playground/orc/java/shims/target/orc-shims-1.9.0-SNAPSHOT.jar
[INFO]
[INFO] --- maven-dependency-plugin:3.1.2:analyze-only (default) @ orc-shims ---
[WARNING] Unused declared dependencies found:
[WARNING]    org.junit.jupiter:junit-jupiter-api:jar:5.9.0:test
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Apache ORC 1.9.0-SNAPSHOT:
[INFO]
[INFO] Apache ORC ......................................... SUCCESS [  4.108 s]
[INFO] ORC Shims .......................................... FAILURE [  4.817 s]
[INFO] ORC Core ........................................... SKIPPED
[INFO] ORC MapReduce ...................................... SKIPPED
[INFO] ORC Tools .......................................... SKIPPED
[INFO] ORC Examples ....................................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  9.189 s
[INFO] Finished at: 2022-10-13T15:07:06+08:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-dependency-plugin:3.1.2:analyze-only (default) on project orc-shims: Dependency problems found -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <args> -rf :orc-shims

opened by zjx20 20

ORC-703 : Fix RLE encoding bug on large negative integer.

What changes were proposed in this pull request?

ORC has use RLE to encoding/decoding integer. Four types are comprised of the RLE encoding/decoding algorithm. Short Repeat : used for short repeating integer sequences. Direct : used for integer sequences whose values have a relatively constant bit width. Patched Base : used for integer sequences whose bit widths varies a lot. Delta : used for monotonically increasing or decreasing sequences.

Why are the changes needed?

This bug occurs in Patched Base Type for large negative number. In patched base, we use 3 bits to store base valuewidth that is encoded using 1 to 8 bytes. If the base value is actually 8 bytes in length, the value for base width should be 7. Currently, this value can go up to 8 what can result in inconsistent data as part of the encoding procedure. In extreme cases, the encoding/decoding process can even be cored dump referring to an illegal address.

How was this patch tested?

Pass the newly added UT.
Stale

opened by chaoyli 19
ORC-1075: Support reading ORC files with no column statistics

What changes were proposed in this pull request?

This PR aims to fix evaluation predicate proto. When the row index statistics only implements ColumnStatistics and does not provide information other than min or max, the predicate cannot be evaluated correctly

Why are the changes needed?

To make all files compatible with the orc specification readable by the official library

How was this patch tested?

Added unit test. testWithoutStatistics for tests that do not provide statistics testMissMinOrMaxInStatistics is used to test the case where min max statistics are not provided
JAVA

opened by guiyanakuang 18
ORC-961: [C++] expose related metrics of the reader

What changes were proposed in this pull request?

This patch keeps track of the time spent and the number of calls to each module. The current metrics mainly include the decompression time and number of calls, decoding time and number of calls and total elapsed time.

Why are the changes needed?

It exposes the relevant metrics of the reader, so that the user can visually see the metrics of each module which includes decompression, decoding.

How was this patch tested?

The orc-scan tool with the -m parameter can output relevant metrics.
CPP

opened by coderex2522 17
ORC-1172: Add row count limit config in one stripe
What changes were proposed in this pull request?

add row count limit config "orc.stripe.row.count" to limit row count in one stripe.

Why are the changes needed?

for query engine like presto，stripe is the base unit for query concurrency, one stripe can only be processed by one split. In current implement of orc writer, the only config which can control row count in stripe is the "orc.stripe.size". But for different kind of table, the row count is difficult to use.

for table with much columns( eg. 100 columns), 64MB may contain 5000 rows. for table with less columns(eg. 5 columns), 64MB may contain 100000 rows. for presto, normal olap query only read a subset of table columns, the row count is the key factor of query performance. If one stripe contain much rows, the query performance may become too low.

So, besides the config "orc.stripe.size", we need another config like "orc.stripe.row.count" to control the row count of one stripe. The similar config has been introduced to cudf ( a GPU DataFrame library base on apache arrow): rapidsai/cudf#9261

How was this patch tested?

testStripeRowCountLimit added. can be test by command below:

cd java ./mvnw -Dtest=TestWriterImpl test

closed #1117
JAVA
opened by dengweisysu 17
ORC-713: Add Java 15 to GitHub action

What changes were proposed in this pull request?

This PR aims to add Java 15 to GitHub action as a preparation for Java 17.

Why are the changes needed?

Java 17 is the next LTS.

How was this patch tested?

Pass the GitHub action.

opened by williamhyun 17
ORC-1310: Allowlist Support for plugin filter

What changes were proposed in this pull request?

This PR is aimed to add allowlist support for plugin filter.

Why are the changes needed?

ServiceLoader will load all the interfaces in current classpath. When we have two implementations of PluginFilterService, Both of them will be loaded as plugin filters. This PR will provide a configuration "orc.filter.plugin.allowlist". ORC reader will only load the class which is in the allowlist when the configuration is not null.

How was this patch tested?

UT
JAVA

opened by deshanxiao 16

ORC-757: HashTable dictionary

What changes were proposed in this pull request?

Add a straightforward implementation for Dictionary using hash table.
Refactored RB-Tree to make code reusable, like VisitorContextImpl. Moving it into DictionaryUtils.java to make is sharable between different implementation of Dictionary interface.
Enabling hash-based dictionary for existing tests that enables dictionary-encoding.
Added ORCWriterBenchmark for benchmarking writer performance with different options.

Why are the changes needed?

We find RB-Tree based dictionary implementation being slow in our production workload. The performance comparison for the new hash-table based implementation will be done as part of ORC-50.

How was this patch tested?

Mostly tested with added unit tests for hash-table and enabled hash-table based dictionary in some of the existing tests.

Also added the benchmark result comparing to RB-Tree one:

ORCWriterBenchMark.dictBench                     RBTREE  avgt    5   28216.937 ±  368.582  us/op
ORCWriterBenchMark.dictBench:bytesPerRecord      RBTREE  avgt    5      49.832                 #
ORCWriterBenchMark.dictBench:iOs                 RBTREE  avgt    5         ≈ 0                 #
ORCWriterBenchMark.dictBench:perRecord           RBTREE  avgt    5       0.861 ±    0.011  us/op
ORCWriterBenchMark.dictBench:records             RBTREE  avgt    5  163840.000                 #
ORCWriterBenchMark.dictBench                       HASH  avgt    5    5751.196 ± 1049.305  us/op
ORCWriterBenchMark.dictBench:bytesPerRecord        HASH  avgt    5      50.146                 #
ORCWriterBenchMark.dictBench:iOs                   HASH  avgt    5         ≈ 0                 #
ORCWriterBenchMark.dictBench:perRecord             HASH  avgt    5       0.176 ±    0.032  us/op
ORCWriterBenchMark.dictBench:records               HASH  avgt    5  163840.000                 #
ORCWriterBenchMark.dictBench                       NONE  avgt    5    3988.156 ±  621.354  us/op
ORCWriterBenchMark.dictBench:bytesPerRecord        NONE  avgt    5      50.146                 #
ORCWriterBenchMark.dictBench:iOs                   NONE  avgt    5         ≈ 0                 #
ORCWriterBenchMark.dictBench:perRecord             NONE  avgt    5       0.122 ±    0.019  us/op
ORCWriterBenchMark.dictBench:records               NONE  avgt    5  163840.000                 #```

opened by autumnust 16

ORC-1008: Fix overflow detection code for C++ int64_t / java long

What changes were proposed in this pull request?

https://github.com/apache/orc/blob/6da96bb8ceb64528d082974efed411c4c29f3408/c%2B%2B/src/Statistics.cc#L180-L190 A counter-example can easily be given Assume sum=1, update(std::numeric_limits<int64_t>::max(), 3); value * repetitions + _stats.getSum() is overflowed, but is still a positive number : 9223372036854775806

This pr aims to fix overflow detection code for C++ int64_t / java long.

ORC-338 Workaround C++ compiler bug in xcode 9.3 by removing an inline function.

As I fixed the implementation. The current update function can be inline.

Why are the changes needed?

Fix bug.

How was this patch tested?

Pass the CIs.
JAVA CPP

opened by guiyanakuang 15
Bump checkstyle from 10.5.0 to 10.6.0 in /java
Bumps checkstyle from 10.5.0 to 10.6.0.

Release notes

Sourced from checkstyle's releases.

checkstyle-10.6.0

Checkstyle 10.6.0 - https://checkstyle.org/releasenotes.html#Release_10.6.0

Breaking backward compatibility:

#12520 - Simplify JavadocStyleCheck: remove functionality for missing package-info Javadoc

Bug fixes:

#12409 - Inconsistent allowedAbbreviations when a method contains an underscore #12486 - NoWhitespaceAfter false positive on synchronized method #11807 - Null pointer exception with records in RequireThisCheck

Commits

233c91b [maven-release-plugin] prepare release checkstyle-10.6.0

c982461 config: maven has problems to push, moving push to action level

2826b1b config: git push commands need write permission in actions

311a1b7 config: skip pgp sign plugin during release:prepare as we do not sign commits

04347b1 doc: release notes for 10.6.0

d12ffc7 Issue #12409: Inconsistentency In Allowed Abbreviations

a5be3cf minor: Bump version to 10.6.0-SNAPSHOT

ebb46cb Issue #12520: removes missing package-info Javadoc check in JavadocStyle

475063f supplemental: Forbid usage of @BeforeAll in tests

069905a config: upgrade sevntu to 1.44.1

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

dependencies BUILD JAVA
opened by dependabot[bot] 0
Bump mockito.version from 4.10.0 to 4.11.0 in /java
Bumps mockito.version from 4.10.0 to 4.11.0. Updates mockito-core from 4.10.0 to 4.11.0

Release notes

Sourced from mockito-core's releases.

v4.11.0

Changelog generated by Shipkit Changelog Gradle Plugin

4.11.0

2022-12-28 - 1 commit(s) by Andy Coates

Improve vararg handling: approach 2 [(#2807)](mockito/mockito#2807)

Mocking varargs method with any(String[].class) doesn't work as expected [(#2796)](mockito/mockito#2796)

(Argument)Matchers regression from 1.10.19 to 2.18.3 for varargs [(#1498)](mockito/mockito#1498)

Cannot verify varargs parameter as an array [(#1222)](mockito/mockito#1222)

ArgumentCaptor can't capture varargs-arrays [(#584)](mockito/mockito#584)

Verification of an empty varargs call fails when isNotNull() is used [(#567)](mockito/mockito#567)

Commits

483e15f Add type() method to ArgumentMatcher (#2807)

See full diff in compare view

Updates mockito-junit-jupiter from 4.10.0 to 4.11.0

Release notes

Sourced from mockito-junit-jupiter's releases.

v4.11.0

Changelog generated by Shipkit Changelog Gradle Plugin

4.11.0

2022-12-28 - 1 commit(s) by Andy Coates

Improve vararg handling: approach 2 [(#2807)](mockito/mockito#2807)

Mocking varargs method with any(String[].class) doesn't work as expected [(#2796)](mockito/mockito#2796)

(Argument)Matchers regression from 1.10.19 to 2.18.3 for varargs [(#1498)](mockito/mockito#1498)

Cannot verify varargs parameter as an array [(#1222)](mockito/mockito#1222)

ArgumentCaptor can't capture varargs-arrays [(#584)](mockito/mockito#584)

Verification of an empty varargs call fails when isNotNull() is used [(#567)](mockito/mockito#567)

Commits

483e15f Add type() method to ArgumentMatcher (#2807)

See full diff in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

dependencies BUILD JAVA
opened by dependabot[bot] 0
Huge memory taken for each field when exporting

Hello, Using arrow adapter, I became aware that the memory (RAM) footprint of the export (exporting an orc file) was very huge for each field. For instance, exporting a table with 10000 fields can take up to 30Go, even if there is only 10 records. Even for 100 fields, that could take 100Mo+. The "issue" seems to be coming from here : https://github.com/apache/orc/blob/432a7aade9ea8d3cd705d315da21c2c859bce9ef/c%2B%2B/src/ColumnWriter.cc#L59

When we create a writer with the "createWriter" (https://github.com/apache/orc/blob/432a7aade9ea8d3cd705d315da21c2c859bce9ef/c%2B%2B/src/Writer.cc#L681-L684 ), a stream (compressor) is created for each field. As we allocate a Buffer of 1 * 1024 *1024 we get as a minimum 1Mo additionnal size taken in memory for each field.

Is there a reason the BufferedOutputStream initial capacity is that high ? I circumvented my problem by lowering it to 1Ko (it didn't change much the performance according to my testing, but it may depend on usecases). Could it be envisaged to put a global variable (or static one) to parametrize this to allow changing this hard coded parameter ? Thanks

opened by LouisClt 7
ORC-1200: Extracting encryption setup logic from `WriterImpl`

What changes were proposed in this pull request?

Extracting the encryption setup logic as a tool class.

Why are the changes needed?

Because of flink's ORC writer is based of stream, we must create a stream based PhysicalFsWriter before WriterImpl initial. WriterImpl Then, You can look into flink orc writer implementation from OrcBulkWriterFactory. Given the above, now I want to pass encryption settings to PhysicalFsWriter before WriterImpl initail, but the encryption setup at WriterImpl is so tightly coupled that the encryption variant cannot be obtained externally.

How was this patch tested?

It doesn't introduce new features and passed all test cases.
JAVA

opened by liujiawinds 7
RecordReaderImpl.getValueRange() may cause incorrect results

orc version: 1.6.11, sql: select xxx from xxx where str is not null

Recently i found some orc files wrote by trino didn't have complete statistics in files meta(maybe a presto bug), this causes OrcProto.ColumnStatistics can't be deserialized to any specific ColumnStatisticsImpl such as StringStatisticsImpl, then RecordReaderImpl.getValueRange() returns ValueRange with null lower and RecordReaderImpl.pickRowGroups() skips this row group, which should not be skipped. In normal conditions except above, everything is ok. And i found orc-1.5.x can handle above case according to RecordReaderImpl.UNKNOWN_VALUE, which has removed in 1.6.x. Maybe we could add it back for better compatibility. @dongjoon-hyun @omalley
enhancement

opened by PengleiShi 7

Releases(v1.8.1)

v1.8.1(Dec 2, 2022)
Milestone

https://github.com/apache/orc/milestone/13

Changelog

https://github.com/apache/orc/commits/v1.8.1

Bug

ORC-1283 ENABLE_INDEXES does not take effect

ORC-1288 Invalid memory freeing with ZLIB compression

ORC-1291 NullPointerException in TypeDescription

Improvement

ORC-1268 Set CMP0135 policy for CMake 3.24+

ORC-1282 Add slf4j impl to avoid warning message

ORC-1294 Build error when skip tests build

ORC-1295 Improve ORC Spec example (Decoding RLE v2 direct)

ORC-1299 benchmark can't work for data resource 403

ORC-1305 Add more orc java examples

ORC-1308 Avoid star import

Test

ORC-1290 Bump spotbugs to 4.7.3

ORC-1300 Update Spark to 3.3.1 and its dependencies

Tasks

ORC-1269 Remove FindBugs

ORC-1270 Move opencsv dependency to the tools module.

ORC-1292 Add paragraph in java documentation

Source code(tar.gz)
Source code(zip)
v1.7.7(Nov 19, 2022)
Milestone

https://github.com/apache/orc/milestone/12

Changelog

https://github.com/apache/orc/commits/v1.7.7

Bug

ORC-1283 ENABLE_INDEXES does not take effect

Test

ORC-1254 Add spotbugs check

ORC-1299 Fix fetch data error in bench module

Task

ORC-1256 Publish tests jar to maven central

ORC-1268 Set CMP0135 policy for CMake 3.24+

Source code(tar.gz)
Source code(zip)
v1.8.0(Sep 3, 2022)
Milestone

https://github.com/apache/orc/milestone/2

Changelog

https://github.com/apache/orc/commits/v1.8.0

New Feature and Notable Changes

ORC-450 Support selecting list indices without materializing list items

ORC-824 Add column statistics for List and Map

ORC-1004 Java ORC writer supports the selection vector

ORC-1075 Support reading ORC files with no column statistics

ORC-1125 Support decoding decimals in RLE

ORC-1136 Optimize reads by combining multiple reads without significant separation into a single read

ORC-1138 Seek vs Read Optimization

ORC-1172 Add row count limit config for one stripe

ORC-1212 Upgrade protobuf-java to 3.17.3

ORC-1220 Set min.hadoop.version to 2.7.3

ORC-1248 Redefine Hadoop dependency for Apache ORC 1.8.0

ORC-1256 Publish test-jar to maven central

ORC-1260 Publish shaded-protobuf classifier artifacts

Improvement

ORC-825 Use Empty Array For Collections toArray

ORC-826 Do Not Use Collection Contains/Get

ORC-828 Improve Fetch Data Set Process

ORC-829 Optimize Serialization percentileBits

ORC-831 Do Not Copy String When Flushing Dictionary

ORC-833 RunLengthIntegerReaderV2 Calculate Batch Size Once

ORC-834 Do Not Convert to String in DecimalFromTimestampTreeReader

ORC-835 Cache TRUE/FALSE Bytes in StringGroupFromBooleanTreeReader

ORC-836 StringGroupFromDoubleTreeReader Use Double toString

ORC-837 Reuse HiveDecimalWritable in ConvertTreeReaderFactory

ORC-838 Simplify compareTo/equals/putBuffer of ByteBufferAllocatorPool

ORC-840 Remove Superfluous Array Fill in RecordReaderImpl

ORC-841 Remove Superfluous Array Fill in StringHashTableDictionary

ORC-842 Remove newKey from StringHashTableDictionary

ORC-844 Improve hashCode Methods

ORC-847 Do Not Create Empty Array in StringGroupFromBinaryTreeReader

ORC-852 Allow DynamicByteArray to Return a ByteBuffer

ORC-853 Optimize writeDouble Implementation

ORC-855 Remove Unused isRepeating from RunLengthIntegerReaderV2

ORC-865 Bump opencsv from 3.9 to 5.5.1

ORC-883 Dependency Audit and QA

ORC-897 optimization loop termination condition in readerIsCompatible method

ORC-935 Bump commons-csv from 1.8 to 1.9.0

ORC-937 Replace deprecated method

ORC-958 Convert command support overwrite option

ORC-969 Evaluate SearchArguments using file and stripe level stats

ORC-975 Avoid double counting closestFixedBits in percentileBits method

ORC-982 Extract checkstyle to a single file, help newcomers check code style

ORC-988 Bump opencsv from 5.5.1 to 5.5.2

ORC-992 Reached max repeat length, we can directly decide to use DELTA encoding

ORC-1005 Make that the java and C++ implementations of determineEncoding in RunLengthIntegerWriterV2 are consistent.

ORC-1007 Fix a warning from the shade plugin

ORC-1013 Renaming a parameter in constructors of TreeWriter's derived classes

ORC-1014 Add details when we get IOExceptions from file system

ORC-1020 Improve orc::RleDecoderV2::nextDirect

ORC-1027 Filter processing to allow filter injections that cannot be represented via SArgs

ORC-1047 Handle quoted field names during string schema parsing

ORC-1077 Remove commons-codec dependency and use java.util.Base64

ORC-1099 Extend ReadIntent to support MAP and UNION type

ORC-1101 Improve malformed STRUCT handling

ORC-1122 Add buffer to decode the whole run in RleDecoderV2

ORC-1137 Improve float/double conversion in DoubleColumnReader::next()

ORC-1149 Bump slf4j.version to 1.7.36

ORC-1150 Improve RowReaderImpl::computeBatchSize()

ORC-1152 Support encoding short decimals in RLEv2

ORC-1156 Update opencsv to 5.6

ORC-1163 Bump zookeeper from 3.7.0 to 3.8.0

ORC-1169 Use Hadoop 3.3.2 on Java 17+

ORC-1178 Use hadoop 3.3.3 on Java 17+

Bug

ORC-845 Fix NPE in DynamicIntArray toString

ORC-929 Fix NaN at orc-tools 'meta' command

ORC-1129 The build of tool-test should depends on cpp tools

ORC-1159 Crash when the last stripe is skipped

ORC-1242 Bump threeten-extra to 1.7.1

Test

ORC-860 Add dependabot

ORC-864 Bump jackson.version from 2.12.2 to 2.12.4

ORC-877 Bump junit-vintage-engine from 5.7.0 to 5.7.2

ORC-888 Bump objenesis from 3.1 to 3.2

ORC-905 Add an integration test for example

ORC-917 Bump mockito-core from 3.7.0 to 3.11.2

ORC-919 Spark bench objenesis should be the same as Spark.

ORC-920 Use junit.version and mockito.version property and bump junit to 5.7.2

ORC-925 Simplify assertions

ORC-928 Bump checkstyle from 8.44 to 8.45.1

ORC-932 Bump byte-buddy from 1.10.19 to 1.11.12

ORC-934 Add integration tests for Java bench

ORC-940 Use Hadoop 3.3.1 in bench module

ORC-955 Add Javadoc generation GitHub Action job

ORC-963 Build benchmark module always for integration testing

ORC-966 Bump byte-buddy from 1.11.12 to 1.11.13

ORC-967 Bump mockito.version from 3.11.2 to 3.12.1

ORC-986 Bump mockito.version from 3.12.1 to 3.12.4

ORC-987 Bump jackson.version from 2.12.4 to 2.12.5

ORC-1001 Bump maven-enforcer-plugin to 3.0.0

ORC-1019 Remove redundant jackson dependencies

ORC-1022 Bump byte-buddy from 1.11.13 to 1.11.19

ORC-1038 Bump mockito.version from 3.12.4 to 4.0.0

ORC-1074 Bump byte-buddy from 1.11.19 to 1.12.6

ORC-1079 Add Linux clang GitHub Action job

ORC-1085 Bump auto-service from 1.0 to 1.0.1

ORC-1089 Add test cases verifying writers with selected vector

ORC-1104 Use Spark 3.2.1 in benchmark

ORC-1107 Fix NPE at benchmark data schema loading

ORC-1110 Bump mockito.version from 4.0.0 to 4.3.1

ORC-1126 Bump byte-buddy from 1.12.6 to 1.12.8

ORC-1139 Benchmark for Seek vs Read

ORC-1141 Bump mockito.version from 4.3.1 to 4.4.0

ORC-1145 Add Java 18 to GitHub Action CI.

ORC-1153 Bump byte-buddy from 1.12.8 to 1.12.9

ORC-1157 Update guava to 31.1-jre

ORC-1168 Update byte-buddy to 1.12.10

ORC-1177 Upgrade mockito.version to 4.5.1

ORC-1179 Upgrade checkstyle to 10.2 on Java 11+

ORC-1187 Use main instead of master in merge_orc_pr.py

ORC-1194 Bump mockito.version to 4.6.0

ORC-1195 Bump checkstyle to 10.3

ORC-1196 Add spark benchmark integration tests to GHA

ORC-1197 Bump mockito.version from 4.6.0 to 4.6.1

ORC-1201 Remove Debian 9 from Docker Tests

ORC-1203 Bump maven-enforcer-plugin to 3.1.0

ORC-1206 Bump netty-all to 4.1.78.Final

ORC-1207 Upgrade Spark to 3.3.0

ORC-1208 Bump byte-buddy to 1.12.12

ORC-1209 Bump checkstyle to 10.3.1

ORC-1234 Upgrade objenesis to 3.2 in Spark benchmark

ORC-1236 Bump checkstyle to 10.3.2

ORC-1243 Bump byte-buddy to 1.12.13

ORC-1253 Add Fedora 37 docker test

ORC-1254 Add spotbugs check

Task

ORC-868 Pin gson to 2.2.4

ORC-869 Pin jmh 1.20

ORC-872 Bump kryo-shaded from 3.0.3 to 4.0.2

ORC-874 Bump zookeeper from 3.6.2 to 3.7.0

ORC-884 Bump jettison from 1.1 to 1.4.1

ORC-887 Remove ORC Twitter link from news page

ORC-890 Pin minimum support Hadoop version to 2.2.0

ORC-892 Pin scala-library to 2.12.10

ORC-898 Bump threeten-extra from 1.5.0 to 1.7.0

ORC-899 Archive Apache ORC 1.4.x in releases page

ORC-900 Update doap_orc.rdf for Apache Projects page

ORC-908 Use https instead of http for website links in pom.xml

ORC-914 Pin maven-dependency-plugin to 3.1.2

ORC-916 Bump annotations from 17.0.0 to 21.0.1

ORC-918 Pin protobuf-java to 2.5.0

ORC-923 Bump apache from 23 to 24

ORC-946 Unified json library

ORC-949 Add CustomImportOrder rule

ORC-956 Bump annotations from 21.0.1 to 22.0.0

ORC-977 Update webpages and TestVectorOrcFile.java to be more neutral

ORC-1045 Bump commons-cli to 1.5

ORC-1056 Bump annotations from 22.0.0 to 23.0.0

ORC-1103 Use Maven 3.8.4

ORC-1140 Documentation for Seek vs Read

ORC-1158 Add notification settings to .asf.yam

ORC-1162 Fix Apache Project Website Checks Warningl

ORC-1165 Enable GitHub Action in branch-1.8

ORC-1166 Enable snapshot publishing in branch-1.8

ORC-1171 Skip build and test on docker and site updates

ORC-1173 Pin jodd-core to 3.5.2

ORC-1176 Upgrade maven-jar-plugin to 3.2.2

ORC-1185 Add merge_orc_pr.py

ORC-1210 Upgrade maven to 3.8.6

ORC-1216 Pin org.jetbrains.annotations dependency to 17.0.0

ORC-1211 Upgrade maven-assembly-plugin to 3.4.0

ORC-1214 Bump maven-assembly-plugin to 3.4.1

ORC-1217 Downgrade org.jetbrains.annotations to 17.0.0

ORC-1223 Move DirectDecompressWrapper to org.apache.orc.impl

ORC-1224 Move getDecompressor to HadoopShimsCurrent

ORC-1226 Add a deprecation warning for Hadoop 2.7.2 and below

ORC-1229 Move KeyProviderImpl to org.apache.orc.impl

ORC-1230 Move encryption utility functions to HadoopShimsCurrent

ORC-1246 Revamp ORC Website

ORC-1247 Improve Apache ORC website and docs

ORC-1249 Move site/_docs/releases.md to site/releases/index.md

ORC-1255 Fix ORC website navbar highlight

ORC-1257 Publish multi-architecture ORC-dev docker images

ORC-1261 Rename shaded pattern com.google.protobuf25 to org.apache.orc.protobuf

ORC-1263 Add decimal type to ORC Website

ORC-1221 Move NullKeyProvider to org.apache.orc.impl

Source code(tar.gz)
Source code(zip)
v1.7.6(Aug 18, 2022)
Milestone

https://github.com/apache/orc/milestone/11

Changelog

https://github.com/apache/orc/compare/v1.7.5...v1.7.6

Bug Fixes

ORC-1204: ORC MapReduce writer to flush when long arrays

ORC-1205: nextVector should invoke ensureSize when reusing vectors

ORC-1215: Remove a wrong NotNull annotation on value of setAttribute

ORC-1222: Upgrade tools.hadoop.version to 2.10.2

ORC-1227: Use Constructor.newInstance instead of Class.newInstance

ORC-1228: Fix setAttribute to handle null value

Tests

ORC-932: Bump byte-buddy from 1.10.19 to 1.11.12 (#842)

ORC-1169: Use Hadoop to 3.3.2 on Java 17+ (#1113)

ORC-1178: Use Hadoop 3.3.3 on Java 17+ (#1129)

ORC-1193: Bump parquet.version to 1.12.3

ORC-1207: Upgrade Spark to 3.3.0

ORC-1210: Upgrade maven to 3.8.6

ORC-1234: Upgrade objenesis to 3.2 in Spark benchmark

ORC-1235: Bump avro.version to 1.11.1

ORC-1240: Update site README to use apache/orc-dev DockerHub image

ORC-1241: Use apache/orc-dev DockerHub repository in Docker tests

ORC-1244: Upgrade byte-buddy to 1.12.13 in branch-1.7

ORC-1245: Use Hadoop 3.3.4 on Java 17+ and benchmark

Documentation

MINOR: Update DOAP with new releases (#1127)

ORC-900: Update doap_orc.rdf for Apache Projects page (#806)

ORC-1231: Update supported OS list in building.md

ORC-1237: Remove a wrong image link to article-footer.png

ORC-1238: Update DOAP with 1.7.5

Task

ORC-1185: Add merge_orc_pr.py

ORC-1187: Use main instead of master in merge_orc_pr.py

ORC-1213: Use https in ThirdpartyToolchain.cmake

ORC-1226: Add a deprecation warning for Hadoop 2.7.2 and below

Source code(tar.gz)
Source code(zip)
v1.7.5(Jun 16, 2022)
Milestone

https://github.com/apache/orc/milestone/9

Changelog

https://github.com/apache/orc/compare/v1.7.4...v1.7.5

Bug Fixes

ORC-1151: [C++] Fix ColumnWriter for non-UTC Timestamp columns (#1088)

ORC-1160: [C++] Fix seekToRow can't seek within selected row group (#1102)

ORC-1133: [C++] Fix csv-import tool options

ORC-1183: Upgrade gson to 2.9.0

ORC-1186: Limit family in aarch64 profile

ORC-1188: Fix ORC_PREFER_STATIC_ZLIB

Improvements

ORC-1198: Add a new PhysicalFsWriter constructor with FSDataOutputStream parameter

ORC-1199: Use Google mirror of Maven Central as the primary

Tests

ORC-1155: Add Ubuntu 22.04 to docker tests (#1093)

ORC-1154: Bump hive.version from 3.1.2 to 3.1.3 (#1090)

ORC-1161: Add MacOS 12 and remove MacOS 10

ORC-1174: Add Ubuntu 22.04 to GitHub Action (#1128)

ORC-1182: Use slf4j-simple instead of deprecated slf4j-log4j12

ORC-1184: Use Hadoop 3.3.3 in benchmark module

ORC-1189: Update README.md and help command message in benchmark module and .gitignore

ORC-1190: Fix ORCWriterBenchMark dumpDir initialization

ORC-1191: Updated TLC Taxi Benchmark Dataset

ORC-1192: Use orc.zstd instead of orc.none (#1144)

ORC-1196: Add Spark benchmark integration tests to GHA

ORC-1201: Remove Debian 9 from Docker Tests

Documentation

MINOR: Add ASF verification instruction link (#1134)

Source code(tar.gz)
Source code(zip)
v1.6.14(Apr 14, 2022)
Milestone

https://github.com/apache/orc/milestone/6?closed=1

Changelog

https://github.com/apache/orc/compare/v1.6.13...v1.6.14

Bug Fixes

ORC-1121: Fix column coversion check bug which causes column filters don't work (#1055)

ORC-1146: Float category missing check if the statistic sum is a finite value (#1078)

ORC-1147: Use isNaN instead of isFinite to determine the contain NaN values (#1082)

Tests

ORC-1016: Use [email protected] in GitHub Action MacOS CIs

ORC-1113: Remove CentOS 8 from docker-based tests (#1040)

Source code(tar.gz)
Source code(zip)
v1.7.4(Apr 16, 2022)
Milestone

https://github.com/apache/orc/milestone/7

Changelog

https://github.com/apache/orc/compare/v1.7.3...v1.7.4

Bug Fixes

ORC-1120: Remove C++ library limitation about write version (#1054)

ORC-1121: Fix column conversion check bug which causes column filters don't work (#1055)

ORC-1127: [C++] add missing version of UNSTABLE-PRE-2.0 (#1064)

ORC-1146: Float category missing check if the statistic sum is a finite value (#1078)

ORC-1147: Use isNaN instead of isFinite to determine the contain NaN values (#1082)

Improvements

ORC-236: Support UNION type in Java Convert tool (#1025)

ORC-1116: [C++] Fix csv-import tool when exporting long bytes (#1044)

ORC-1123: Add estimationMemory method for writer (#1057)

Tests

ORC-1145: Add Java 18 to GitHub Action CI (#1074)

ORC-1118: Support Java 17 and ARM64 docker tests (#1047)

Documentation

ORC-1117: Add Dask page at Using in Python section (#1045)

ORC-1119: Remove timestamp from ORC API docs (#1049)

Source code(tar.gz)
Source code(zip)
v1.7.3(Feb 10, 2022)
Milestone

https://github.com/apache/orc/milestone/4?closed=1

Changelog

https://github.com/apache/orc/compare/v1.7.2...v1.7.3

Bug Fixes

ORC-1060: Reduce memory usage when vectorized reading dictionary string encoding columns (#971)

ORC-1065: Fix IndexOutOfBoundsException in ReaderImpl.extractFileTail #979

ORC-1067: [C++] Upgrade ZSTD to 1.5.1 (#981)

ORC-1078: Row group end offset doesn't accommodate all the blocks (#996)

ORC-1081: Fix heap-use-after-free in SearchArgumentBuilderImpl::end() (#998)

ORC-1087: [C++] Handle unloaded seek positions when seeking in an uncompressed chunk (#1008)

ORC-1092: [C++] Upgrade LZ4 to version 1.9.3 (#1012)

ORC-1102: [C++] Upgrade ZSTD to 1.5.2 (#1026)

Improvements (orc-tools)

ORC-1055: [C++] Add the timezone option for the csv-import tool (#975)

ORC-1082: Improve FileDump and JsonFileDump to be robust on missing column statistics (#1003)

ORC-1098: [C++] Support specifying type ids or column names in cpp tools (#1020)

Documentation

ORC-1050: Update ORC site README.md and release process page (#963)

ORC-1069: Update building.md (#982)

ORC-1071: Update adopters page (#985)

ORC-1091: Add Tests section at ORC develop page (#1011)

ORC-1112: Add Using with Python web page (#1039)

ORC-1114: Update Using with Python page with PyArrow 7.0.0 (#1042)

Task

ORC-1070: Upgrade site docker image to use Ubuntu 20.04 (#983)

ORC-1072: Add 'Stale' GitHub Action job (#986)

ORC-1094: Enable GitHub issues tab (#1015)

ORC-1095: Deprecate UnknownFormatException (#1016)

Tests

ORC-875: Add GitHub Action job for Windows Server 2019 (#872)

ORC-878: Bump auto-service from 1.0-rc7 to 1.0

ORC-881: Bump slf4j.version from 1.7.30 to 1.7.32 (#786)

ORC-989: Bump checkstyle from 8.45.1 to 9.0 (#899)

ORC-993: Bump junit.version from 5.7.2 to 5.8.0 (#906)

ORC-1018: Bump checkstyle from 9.0 to 9.0.1 (#927)

ORC-1033: Bump junit.version from 5.8.0 to 5.8.1 (#938)

ORC-1044: Bump reproducible-build-maven-plugin to 0.14 (#955)

ORC-1048: Bump checkstyle from 9.0.1 to 9.1 (#960)

ORC-1052: Bump avro.version from 1.10.2 to 1.11.0 (#965)

ORC-1057: Bump junit.version from 5.8.1 to 5.8.2 (#969)

ORC-1061: Bump checkstyle from 9.1 to 9.2 (#970)

ORC-1066: Bump guava from 30.1.1-jre to 31.0.1-jre #978

ORC-1068: [C++] Stabilize HAS_POST_2038 test (#980)

ORC-1073: Remove appveyor.yml (#987)

ORC-1076: Remove Travis PR Builder Link from README.md (#991)

ORC-1079: Add Linux Clang 11 GitHub Action test coverage (#995)

ORC-1080: Remove .travis.yml (#997)

ORC-1084: Bump checkstyle from 9.2 to 9.2.1 (#1007)

ORC-1086: Bump reproducible-build-maven-plugin from 0.14 to 0.15 (#1005)

ORC-1090: Disable Clang 13.0-specific compilation warnings (#1017)

ORC-1093: Remove debian8 specific code in run-one.sh (#1013)

ORC-1096: Bump slf4j.version to 1.7.33 (#1019)

ORC-1103: Use Maven 3.8.4 (#1029)

ORC-1104: Use Spark 3.2.1 in benchmark (#1030)

ORC-1105: fetch-data.sh should use zsh instead of bash (#1031)

ORC-1106: Use transitive commons-lang3 dependency in bench module (#1032)

ORC-1107: Fix NPE at benchmark data schema loading (#1033)

ORC-1108: Use RawLocalFileSystem to skip checksum files during benchmark data generation (#1034)

ORC-1109: Use zstd instead of none in the default compress option (#1035)

ORC-1111: Bump build-helper-maven-plugin from 3.2.0 to 3.3.0 (#1038)

ORC-1113: Remove CentOS 8 from docker-based tests (#1040)

ORC-1115: Suppress Illegal reflective access warnings on Java9+ Tests (#1043)

Source code(tar.gz)
Source code(zip)
v1.6.13(Feb 10, 2022)
Milestone

https://github.com/apache/orc/milestone/5?closed=1

Changelog

https://github.com/apache/orc/compare/rel/release-1.6.12...v1.6.13

Bug Fixes

ORC-1065: Fix IndexOutOfBoundsException in ReaderImpl.extractFileTail (#979)

ORC-1078: Row group end offset doesn’t accommodate all the blocks (#996)

Tests

ORC-875: Add GitHub Action job for Windows Server 2019 (#999)

ORC-941: Move MacOS 10.15/11.5 test from Travis to GitHub Action (#1001)

ORC-1079: Add Linux Clang 11 GitHub Action test coverage (#1002)

ORC-1080: Remove .travis.yml (#1000)

Source code(tar.gz)
Source code(zip)
v1.7.2(Feb 10, 2022)
Milestone

https://github.com/apache/orc/milestone/3?closed=1

Changelog

https://github.com/apache/orc/compare/rel/release-1.7.1...v1.7.2

Bug Fixes

ORC-492: Avoid potential ArrayIndexOutOfBoundsException when getting WriterVersionn (#961)

ORC-1041: Use memcpy during LZO decompression (#958)

ORC-1053: Fix time zone offset precision when convert tool converts LocalDateTime to Timestamp is not consistent with the internal default precision of ORC (#967)

ORC-1059: Align findColumns behaviour between 1.6 and 1.7 release (#972)

Improvements (orc-tools)

ORC-1012: Support specifying columns in orc-scan (#921)

ORC-1017: Add sizes tool to determine and display the sizes of each column in a set of files. (#925)

ORC-1023: Support writing bloom filters in ConvertTool (#933)

Tests

ORC-915: Remove io.netty.netty from Spark benchmark (#822)

ORC-938: Bump netty-all from 4.1.42.Final to 4.1.66.Final (#819)

ORC-948: Add hive benchmark integration tests (#860)

ORC-957: Bump netty-all from 4.1.66.Final to 4.1.67.Final (#870)

ORC-1021: Add -fno-omit-frame-pointer in DEBUG and RELWITHDEBINFO builds (#932)

ORC-1051: Update benchmark dependencies (#964)

Source code(tar.gz)
Source code(zip)
v1.7.1(Feb 10, 2022)
Milestone

https://github.com/apache/orc/milestone/1?closed=1

Changelog

https://github.com/apache/orc/compare/rel/release-1.7.0...rel/release-1.7.1

Bug Fixes

ORC-879 - Flaky Test for TestJsonReader

ORC-1008 - Overflow detection code is incorrect in IntegerColumnStatisticsImpl

ORC-1009 - [C++] Missing string include causes build failure with MSVC++

ORC-1015 - Update OrcFile.WriterOptions::memory javadoc

ORC-1016 - Use [email protected] in GitHub Action MacOS CIs

ORC-1024 - BloomFilter hash computation is inconsistent between Java and C++ clients

ORC-1029 - Could not load 'org.apache.orc.DataMask.Provider' when using orc encryption and spark executor with multi cores!

ORC-1030 - Java Tools Recover File command does not accurately find OrcFile.MAGIC

ORC-1034 - The search byte array algorithm is incorrectly implemented in FileDump.java

ORC-1035 - backupDataPath may be incorrect in recoverFile

ORC-1039 - Make FileDump.recoverFile handle side files only if they exist

Test

ORC-1000 - Use Java 17 in GitHub Action

ORC-1002 - Add java17 profile for Java17 unit testing

ORC-1010 - Bump tzdata from tzdata-2020e-1.tar.xz to tzdata-2021b-1.tar.xz

ORC-1011 - Activate java17 profile automatically

ORC-1032 - Bump parquet.version from 1.12.0 to 1.12.2

ORC-1036 - Due to tzdata upgrade, the fixed download links in CI are often not working

ORC-1037 - Bump spark.version from 3.1.2 to 3.2.0

ORC-1040 - Add Debian 11 docker test

ORC-1042 - Ignore unused-function C++ compile warning on CentOS 7

ORC-1043 - Fix C++ conversion compilation error in CentOS 7

Source code(tar.gz)
Source code(zip)
v1.7.0(Feb 10, 2022)
New Feature

[ORC-40] - [C++] Support building SearchArgument

[ORC-577] - Allow row-level filtering

[ORC-602] - Create adaptor for using FSDataInputStream for Java ORC reader

[ORC-716] - Build and test on Java 17-EA

[ORC-731] - Improve Java Tools

[ORC-747] - Abstract Dictionary interface and refactoring

[ORC-751] - [C++] Implement Predicate Pushdown for C++ Reader

[ORC-765] - Added build option to compile libraries with position independent code

[ORC-819] - Add GitHub labeler

Improvement

[ORC-377] - [C++] Adding writing with snappy compression to orc c++ writing lib

[ORC-480] - [C++] Deactivate WARN_FLAGS in release build

[ORC-566] - Add docker file for building site

[ORC-568] - Make the convert tool sort the old _col column names by number

[ORC-574] - Performance: Use const references for string statistics min and max to avoid copy construction

[ORC-588] - Static field or method should be directly referred by its class

[ORC-595] - Optimize Decimal64 scale calculation

[ORC-597] - Row-level filtering bench

[ORC-606] - Optimize Timestamp parseNanos calculation

[ORC-607] - Sync orc-benchmarks module to the others

[ORC-608] - Fix DecimalBench reader options

[ORC-609] - Upgrade aircompressor to 0.16

[ORC-614] - Implement efficient seek() in decompression streams

[ORC-615] - Refactor decompression streams into common base class

[ORC-622] - Refactoring of TreeReader into TypeReader and BatchReader

[ORC-638] - ORCMapredRecordWriter enlarge columnVector with factors when child array size is not large enough

[ORC-639] - Improve zstd compression performance

[ORC-646] - Add Ubuntu 20.04 docker file

[ORC-651] - Use GitHub Pull Request Template

[ORC-652] - Upgrade ZSTD to 1.4.5

[ORC-655] - Update bench to use Spark 2.4.6

[ORC-656] - Use gharchive.org instead of githubarchive.org

[ORC-657] - Remove com.netflix.iceberg dependency in java/bench module

[ORC-683] - PPD: Make Floating point NaN check more strict

[ORC-684] - [C++] Make Floating point NaN check more strict

[ORC-687] - Upgrade to JUnit5

[ORC-688] - Allow CHAR, VARCHAR to be promoted to STRING

[ORC-689] - Add GitHubAction job to publish snapshot

[ORC-693] - Update credential according to INFRA setup

[ORC-694] - Update docker files adding Java11 support

[ORC-696] - Consistent TypeDescription handling for quoted field names

[ORC-697] - Improve Scan tool to report where files are corrupted.

[ORC-699] - Minor improvements to the scan tool

[ORC-704] - Publish snapshots at only apache repo

[ORC-710] - Update maven plugins

[ORC-712] - Add USING IN SPARK to website

[ORC-722] - Improve code quality using static analysis.

[ORC-734] - Use org.apache.commons.lang3

[ORC-736] - Upgrade Hive to 3.1.2

[ORC-737] - Upgrade Spark to 3.1.0

[ORC-744] - LazyIO of non-filter columns

[ORC-745] - Migrate to travis-ci.com

[ORC-748] - Add separate writer implementation for Trino

[ORC-749] - Add checkstyle to -Panalzye

[ORC-750] - Fix benchmark to pass checkstyle:check

[ORC-757] - Add Hashtable implementation for dictionary

[ORC-760] - Update spark to 3.1.1

[ORC-761] - Replace MAINTAINER command with LABEL command in Dockerfile

[ORC-766] - Generalize the docker scripts to handle build-args

[ORC-767] - Add docker support for jdk 8 in debian 10

[ORC-768] - Update commons-csv to 1.8

[ORC-769] - Support ZSTD in ORC data benchmark

[ORC-770] - Support ZSTD in Avro data benchmark

[ORC-776] - Include source jars during publishing snapshot

[ORC-777] - Make the vectorized row batch size configurable in MR record readers and writers

[ORC-779] - Upgrade commons-cli to 1.4

[ORC-780] - Add LZ4 Compression to the C++ Writer

[ORC-791] - Upgrade guava test dependency to 30.1.1-jre

[ORC-792] - Upgrade commons-lang to 3.12.0

[ORC-796] - Upgrade apache parent pom version to the latest, 23

[ORC-797] - Allow writers to get the stripe information

[ORC-799] - Remove Ubuntu 16 docker test

[ORC-800] - [ORC]if map.value is selected, map.key should be selected automatically to prevent segment fault.

[ORC-801] - Clean up Logging

[ORC-802] - Document Maven Version and mvnw

[ORC-803] - MemoryManagerImpl Simplify removeWriter

[ORC-806] - Upgrade to Apache POM 23

[ORC-807] - Separate Jackson Versions in POM

[ORC-808] - Update Spark to 3.1.2

[ORC-812] - Simplify getClosestBufferSize in Writer

[ORC-813] - Upgrade ZSTD to 1.5.0

[ORC-818] - Build and test in Apple Silicon

[ORC-821] - Use mvnw instead of mvn

[ORC-823] - Upgrade maven-assembly-plugin to 3.3.0

[ORC-848] - Recycle Internal Buffer in StringHashTableDictionary

[ORC-849] - Core Benchmark Cleanup

[ORC-893] - Remove junit-vintage-engine from shims module.

[ORC-913] - Support data/format/compress options in Spark benchmark

[ORC-921] - Add an encrypted example file

[ORC-922] - Remove redundant conditional statements

[ORC-927] - Extracting duplicate codes for RowFilterBenchmark

[ORC-930] - Ignore unsupported JSON x ZSTD combination in bench

[ORC-931] - Optimize RunLengthIntegerWriterV2 code for better readability

[ORC-933] - extend the example with advanced reader

[ORC-941] - Move MacOS 10.15 and 11.5 test from Travis to GitHub Action

[ORC-943] - Add Intellij conf to support JIRA/PR autolinks

[ORC-945] - Add OUTPUT_QUIET, ERROR_QUIET to suppress Java8 addopen error messages

[ORC-970] - Reordering statements, improve readability in WriterImpl

[ORC-976] - Optimize compute zigZagLiterals

[ORC-984] - Save the software version that wrote each ORC file

Sub-task

[ORC-599] - Bump guava version to 28.1-jre

[ORC-663] - [C++] Support nanosecond in timestamp column statistics

[ORC-713] - Add Java 15 test to github action

[ORC-714] - Remove MRUnit dependency and its usage

[ORC-715] - Add MapReduce test cases

[ORC-718] - Enable Checkstyle plugin and FileTabCharacter rule.

[ORC-719] - Enable UnusedImports.

[ORC-720] - Run mvn checkstyle:check in GitHub action.

[ORC-721] - Use org.junit.Assert instead of deprecated junit.framework.Assert.

[ORC-723] - Upgrade Mockito to 3.7.0.

[ORC-726] - Support Map type in orc-tools convert

[ORC-727] - Update Java Tools documentation

[ORC-728] - Support head command in Java Tools

[ORC-733] - Upgrade Zookeeper from 3.4.x to 3.6.2

[ORC-735] - ConvertTool should not fail at a single ORC file

[ORC-738] - Add date type conversion support in Java Tools

[ORC-741] - Schema Evolution missing column is not handled in the presence of filters

[ORC-742] - LazyIO of non-filter columns in the presence of filters

[ORC-743] - Conversion of SArg into Filters, to take advantage of LazyIO

[ORC-754] - Code cleanup

[ORC-755] - Introduce OrcFilterContext

[ORC-758] - Avoid decompressing compressed streams if already decompressed

[ORC-759] - StructBatchReader should always skip processing on the rootReader

[ORC-778] - Add "NewlineAtEndOfFile" checkstyle rule

[ORC-783] - Add a checkstyle rule to prevent trailing white spaces.

[ORC-795] - Add "LineLength" rule to checkstyle

[ORC-811] - Benchmarks for Filters

[ORC-814] - Build and test Java module on Apple Silicon

[ORC-815] - Build and test C++ module on CLang12

[ORC-816] - Rename and enable aarch64 profile automatically

[ORC-820] - Add Java 16 to GitHub Action

[ORC-822] - Add Java 17-ea to GitHub Action

[ORC-839] - Fix head command for batch reader

[ORC-851] - Fix CNFE in ORC tools uber jar to include required classes.

[ORC-857] - Add OuterTypeFilename/UpperEll/ArrayTypeStyle checkstyle rules.

[ORC-858] - Add NoLineWrap/OneStatementPerLine/NeedBraces checkstyle rules

[ORC-859] - Update maven-checkstyle-plugin to 3.1.2.

[ORC-866] - Reduce LineLength from 125 to 120

[ORC-867] - Upgrade hive-storage-api to 2.8.1

[ORC-871] - orc-tools json-schema fails at empty json file with EOFException

[ORC-882] - Remove hamcrest-core test dependency

[ORC-886] - Add an integration test for ORC Java tools

[ORC-889] - Remove orc-mapreduce build warnings due to overlapping resources

[ORC-895] - Use snappy-java 1.1.8.4 in bench/core to support Apple Silicon

[ORC-901] - Remove junit-vintage-engine from mapreduce/tools module

[ORC-905] - Add an integration test for example

[ORC-907] - Remove junit-vintage-engine from core module

[ORC-909] - Remove commons-io 2.1 dependency

[ORC-910] - Enforce maven-dependency-plugin

[ORC-911] - Remove janino dependency in favor of Spark's transitive dependency

[ORC-912] - Exclude Spark transitive avro/parquet dependency from Spark benchmark

[ORC-917] - Bump mockito-core from 3.7.0 to 3.11.2

[ORC-919] - Spark bench objenesis should be the same as Spark.

[ORC-920] - Use junit.version and mockito.version property and bump junit to 5.7.2

[ORC-924] - Add redundant modifier/modifier order checkstyle rules.

[ORC-926] - Consolidate license header style in Java files.

[ORC-928] - Bump checkstyle from 8.44 to 8.45.1

[ORC-929] - Fix NaN at orc-tools 'meta' command

[ORC-934] - Add integration tests for Java bench

[ORC-939] - Remove threetenbp dependency

[ORC-942] - Remove javax.xml.bind:jaxb-api dependency

[ORC-944] - Add "RedundantImport" checkstyle rule

[ORC-947] - Update coding guide to max line length 100 and enforce it.

[ORC-950] - Bump aircompressor to 0.20

[ORC-951] - Add since tag to org.apache.orc.Reader interface

[ORC-952] - Add since tag to org.apache.orc.RecordReader interface

[ORC-953] - Add since tag to org.apache.orc.Writer interface

[ORC-959] - C++ reader crash in resolving nested List columns for SearchArgument

[ORC-960] - Create SearchArgument using column ids

[ORC-971] - LESS_THAN_EQUALS doesn't handle the case when min=max

[ORC-973] - [C++] Provide more interfaces for creating IN predicate

[ORC-980] - Filter processing ignores the schema case-sensitivity flag

[ORC-981] - Add "CommentsIndentation" checkstyle rule

[ORC-983] - Revise filter processing log level/location/message

Bug

[ORC-27] - C++ reader does not read dates correctly prior to 1583

[ORC-361] - org.apache.orc.impl.MemoryManagerImpl: Owner thread expected Thread[main,5,main], got Thread[pool-15-thread-1,5,main]

[ORC-414] - [C++] ORC files with malformed protobuf objects can crash a release build

[ORC-424] - Add findbugs checks for test classes

[ORC-503] - Add Maven Wrapper

[ORC-504] - Create a reproducible java build

[ORC-519] - Incorrect ORC v1 specification of Decimal encoding

[ORC-526] - orc-tools convert does not respect second fractions

[ORC-528] - orc-tools timestamps off by one?

[ORC-552] - Fix compilation of C++ Reader.cc on centos 6.

[ORC-554] - Float to timestamp schema evolution handles time/nanoseconds incorrectly

[ORC-555] - IllegalArgumentException when reading files with compressed footers bigger than 16k

[ORC-556] - ConvertTreeReader can incorrectly be applied on columns of the same primitive type

[ORC-557] - Large ORC file parsing failed

[ORC-562] - Don't wrap readerSchema in acidSchema, if readerSchema is already acid

[ORC-563] - ORC-540 could break schema evolution on PPD codepaths

[ORC-564] - fix Docker scripts so that their command is bash

[ORC-565] - Fix a couple c++ tests when building out of tree

[ORC-567] - [C++] Fix integer overflow in RowReader::seekToRow

[ORC-569] - Empty positions list in first row index entry

[ORC-570] - FS: ReaderOptions.filesystem should also accept a lazy Supplier

[ORC-571] - ArrayIndexOutOfBoundsException in StripePlanner.readRowIndex

[ORC-578] - IllegalArgumentException: Can't use LongColumnVector to read proleptic Gregorian dates.

[ORC-580] - Crash in StripeStreamsImpl::getEncoding

[ORC-581] - C++ library could crash in orc::TypeImpl::addStructField

[ORC-584] - TypeDescription toJson() returns invalid json

[ORC-586] - [C++] Memory leak in StructColumnReader

[ORC-589] - [C++] ORC doesn't check for negative dictionary entry lengths anymore

[ORC-590] - Crash in orc::RleDecoderV2::readByte

[ORC-591] - orc::readFully crash due to null pointer variable

[ORC-598] - Unable to read ORC file with struct and array.length > 1024

[ORC-600] - StringDictionaryColumnReader does not update index buffer correctly

[ORC-603] - Update current Hadoop version to 2.10.1

[ORC-604] - Check in StringDictionary.getValueByIndex is too permissive

[ORC-610] - Updated Copyright year in the NOTICE file

[ORC-611] - Incorrect min-max stats for sub-millisecond timestamps

[ORC-621] - Need reader fix for ORC-569

[ORC-623] - Potentially incorrect Sarg evaluation for not(in) and not(isNull)

[ORC-626] - Reading Struct Column Having Multiple Fields With Same Name Causes java.io.EOFException

[ORC-628] - Add a new java tool to count rows from ORC files under a directory

[ORC-629] - PPD: Floating point NaN is not transitive across comparisons

[ORC-630] - Fix orc-tools uber jar by adding guava dependency back

[ORC-631] - Add guava dependency to tools jar

[ORC-636] - [C++] PPD Floating point stats with NaN should be ignored

[ORC-641] - orc-core includes packages from io.airlift.slice

[ORC-643] - Change logging of codec creation to debug

[ORC-644] - nested struct evolution does not respect to orc.force.positional.evolution

[ORC-649] - duplicate method invocation (buildConversion) in SchemaEvolution

[ORC-650] - Fix argument to find_package() for ZSTD

[ORC-654] - Fix build with clang-10 and ubuntu-20

[ORC-658] - Fix NoClassDefFoundError during benchmark data generation

[ORC-659] - Initialize "next_in" before calling DeflateInit2

[ORC-661] - DateColumnStatistics uses Date, which is not timezone agnostic.

[ORC-667] - Positional mapping for nested struct types should not applied by default

[ORC-668] - Use TestSchemaEvolution as a test file prefix to prevent test failure

[ORC-669] - Reduce breaking changes in ReaderImpl.java

[ORC-670] - RecordReaderImpl.findColumns should respect orc.schema.evolution.case.sensitive

[ORC-671] - Add OrcTail.getStripeStatistics back for backward compatiblility

[ORC-672] - Fix made in ORC-598 need to be extended to other readers like DecimalFromFloatTreeReader

[ORC-673] - PPD: LTE Point equality comparison is wrong when RG MIN==MAX

[ORC-676] - Add getRawDataSizeFromColIndices back

[ORC-677] - Fix SargApplier compilation error

[ORC-680] - Upgrade Travis CI linux os from Ubuntu 14.04 to Ubuntu 16.04

[ORC-681] - Upgrade commons-codec to 1.15

[ORC-682] - Upgrade to commons-lang3

[ORC-685] - Add ReaderImpl.extractFileTail back

[ORC-686] - Upgrade aircompressor to 0.19 and fix UT

[ORC-702] - [C++] Big files can't be opened in Windows

[ORC-705] - Predicate evaluation should take into account writer calendar

[ORC-706] - Put back DataReaderProperties default maxDiskRangeChunkLimit

[ORC-708] - FIX Snapshot publishing

[ORC-709] - FIX Boolean to StringGroup schema evolution

[ORC-711] - Support CryptoExtension in create/decryptLocalKey

[ORC-724] - PPD: Date IN single value comparison throws ClassCastException

[ORC-739] - Use Maven Wrapper in java/CMakeLists.txt

[ORC-740] - Add curl in debian and ubuntu Docker files

[ORC-756] - Include Vector Length Error with null readerSchema (for ACID formats)

[ORC-763] - ORC timestamp inconsistencies before UNIX epoch

[ORC-771] - ORC timestamp consistency Test for sql.Timestamps close to epoch

[ORC-772] - Fix Spark benchmark jar creation

[ORC-773] - BinaryColumnWriter does not update Bloom filter

[ORC-774] - Support ZSTD in Parquet data benchmark

[ORC-775] - Fix a regression on column names with dot

[ORC-781] - [C++] Make type annotations available from C++

[ORC-786] - Update Dockerfiles to use main branch

[ORC-787] - Update site to use main

[ORC-788] - Update travis build status to main

[ORC-790] - TIMESTAMP_INSTANT should be primitive

[ORC-793] - Deprecate the incorrect getInt method and add a new setInt within OrcConf

[ORC-804] - MaskDescriptionImpl should use List instead of Set

[ORC-810] - FIX expected Output Stats when using Hash Dictionary

[ORC-856] - Fix exception description in findSubtype

[ORC-861] - Bump CMake minimum requirement to 2.8.12

[ORC-862] - Add libarchive to centos8 docker image

[ORC-863] - Upgrade TravisCI from xenial to focal.

[ORC-873] - Fix FindCyrusSASL to use the package name CyrusSASL instead of CYRUS_SASL

[ORC-885] - Update bench README.md and allow user env shell

[ORC-902] - The example of orc-example cannot be run

[ORC-954] - Fix Javadoc generation failure

[ORC-965] - Got "Overflow detected" at spark-orc with zstd

[ORC-978] - Fix NPE in TestFlinkOrcReaderWriter

[ORC-985] - ORC branch 1.7 is producing larger files from java writer

[ORC-990] - [C++] fix RowReaderImpl::seekToRowGroup

[ORC-991] - enctypt data throw exception with a sql filter push down

Test

[ORC-647] - Add macOS 10.15 test to Travis CI

[ORC-648] - Add GitHub Action for Java8/Java11 test coverage

[ORC-678] - Upgrade JUnit to 4.13.1

[ORC-679] - Update tzdata to recover Win32 build in AppVeyor

[ORC-891] - Use assert methods without package name Assert

[ORC-903] - Migrate TestVectorOrcFile to JUnit5

[ORC-925] - Simplify assertions

[ORC-955] - Add Javadoc generation GitHub Action job

[ORC-963] - Build benchmark module always for integration testing

Task

[ORC-553] - Add test case to check that SchemaEvolution checkAcidSchema works well

[ORC-560] - Improve docker tests and include centos 8 and debian 10

[ORC-576] - Improve LICENSE file

[ORC-664] - docker image for centos7 fails to build zstd

[ORC-674] - Update docker files adding Ubuntu 20 and removing Debian 8 and Ubuntu 14

[ORC-675] - Remove debian8/ubuntu14 docker directories

[ORC-691] - Remove unused snapcraft-related code

[ORC-725] - Disable merge commits from Github Merge Button

[ORC-730] - Add website link and description to GitHub page

[ORC-785] - Update GitHub Action branch name to main

[ORC-805] - Add .asf.yaml for Apache ORC website publishing

[ORC-908] - Use https instead of http for website links in pom.xml

Source code(tar.gz)
Source code(zip)